



### An Overview of High Performance Computing and Challenges for the Future

### Jack Dongarra

University of Tennessee Oak Ridge National Laboratory University of Manchester



- Overview of High Performance Computing
- Look at some implementations of linear algebra algorithms on today's High Performance Computers
  - As an examples of the kind of thing needed.

### Ç

# State of Supercomputing Today

- Pflops (> 10<sup>15</sup> Flop/s) computing fully established with 81 systems.
- Three technology architecture possibilities or "swim lanes" are thriving.
  - Commodity (e.g. Intel)
  - Commodity + accelerator (e.g. GPUs) (104 systems)
  - Special purpose lightweight cores (e.g. IBM BG, ARM, Intel's Knights Landing)
- Interest in supercomputing is now worldwide, and growing in many new markets (around 50% of Top500 computers are used in industry).
- Exascale (10<sup>18</sup> Flop/s) projects exist in many countries and regions.
- Intel processors have largest share, 89% followed by AMD, 4%.





H. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

- Updated twice a year Size SC'xy in the States in November Meeting in Germany in June
- All data available from www.top500.org 4



### Performance Development of HPC over the Last 24 Years from the Top500



# November 2015: The TOP 10 Systems

| Rank  | Site                                              | Computer                                                                   | Country            | Cores     | Rmax<br>[Pflops] | % of<br>Peak | Power<br>[MW] | MFlops<br>/Watt |
|-------|---------------------------------------------------|----------------------------------------------------------------------------|--------------------|-----------|------------------|--------------|---------------|-----------------|
| 1     | National Super<br>Computer Center in<br>Guangzhou | Tianhe-2 NUDT,<br>Xeon 12C + <mark>Intel Xeon Phi (57c)</mark><br>+ Custom | China              | 3,120,000 | <i>33.9</i>      | 62           | 17.8          | 1905            |
| 2     | DOE / OS<br>Oak Ridge Nat Lab                     | Titan, Cray XK7, AMD (16C) +<br>Nvidia Kepler GPU (14c) +<br>Custom        | USA                | 560,640   | 17.6             | 65           | 8.3           | 2120            |
| 3     | DOE / NNSA<br>L Livermore Nat Lab                 | Sequoia, BlueGene/Q (16c)<br>+ custom                                      | USA                | 1,572,864 | 17.2             | 85           | 7.9           | 2063            |
| 4     | RIKEN Advanced Inst<br>for Comp Sci               | K computer Fujitsu SPARC64<br>VIIIfx (8c) + Custom                         | Japan              | 705,024   | 10.5             | 93           | 12.7          | 827             |
| 5     | DOE / OS<br>Argonne Nat Lab                       | Mira, BlueGene/Q (16c)<br>+ Custom                                         | USA<br>B TOTAL     | 786,432   | 8.16             | 85           | 3.95          | 2066            |
| 6     | DOE / NNSA /<br>Los Alamos & Sandia               | Trinity, Cray XC40,Xeon 16C +<br>Custom                                    | USA<br>B TITAL     | 301,056   | 8.10             | 80           |               |                 |
| 7     | Swiss CSCS                                        | Piz Daint, Cray XC30, Xeon 8C +<br>Nvidia Kepler (14c) + Custom            | <mark>Swiss</mark> | 115,984   | 6.27             | 81           | 2.3           | 2726            |
| 8     | HLRS Stuttgart                                    | Hazel Hen, Cray XC40, Xeon 12C<br>+ Custom                                 | Germany            | 185,088   | 5.64             | 76           |               |                 |
| 9     | KAUST                                             | Shaheen II, Cray XC40, Xeon<br>16C + Custom                                | Saudi<br>Arabia    | 196,608   | 5.54             | 77           | 2.8           | 1 <i>9</i> 54   |
| 10    | Texas Advanced<br>Computing Center                | Stampede, Dell Intel (8c) + Intel<br>Xeon Phi (61c) + IB                   | USA                | 204,900   | 5.17             | 61           | 4.5           | 1489            |
| 500 ( | 368) Karlsruher                                   | MEGAWARE Intel                                                             | Germany            | 10,800    | .206             | 95           |               |                 |

# Recent Developments

- US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 \$525M hardware
- " Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems
- Argonne Lab to receive Intel based system
  - > After this the Exaflop
- " US Dept of Commerce is groups from receiving In<sup>.</sup>







### Yutong Lu from NUDT at the International Supercomputer Conference in Germany in July 2015

| Status              |                            |                                    | Tionhe-2A                         |
|---------------------|----------------------------|------------------------------------|-----------------------------------|
|                     | Tianhe-1A                  | Tianhe-2                           | flaine -                          |
| System              |                            | 54.9                               | ~100                              |
| System Peak(PF)     | 4.7                        | 17.8                               | ~18                               |
| Peak Power(MW)      | 4.04                       | 1 4 PB                             | ~3PB                              |
| Total System Memory | 262 TB                     | 1.4 PD                             |                                   |
| Node<br>Node        | 0.655                      | 3.431                              | ~6                                |
| Node processors     | Xeon X5670<br>Nvidia M2050 | Xeon E5 2692<br>Xeon Phi           | Xeon E5 2692<br>China Accelerator |
| System size(nodes)  | 7,168 nodes                | 16,000 nodes                       | ~18,000                           |
| system Interconnect | TH Express-1               | TH Express-2                       | Express-2+                        |
| File System         | 2 PB                       | 12.4PB<br>H <sup>2</sup> FS+Lustre | ~30PB<br>S+TDM                    |

# **China Accelerator**

### Matrix2000 GPDSP

### High Performance

- 64bit Supported
- $> \sim 2.4/4.8$ TFlops(DP/SP)
- ➤ 1GHz, ~200W

### High Throughput

- > High-bandwidth Memory
- > 32~64GB
- ➢ PCIE 3.0, 16x



K HPCL

National University of Defense Technology





| Absolute C | counts |
|------------|--------|
| US:        | 201    |
| China:     | 109    |
| Japan:     | 38     |
| UK:        | 18     |
| France:    | 18     |
| Germany:   | 32     |

China nearly tripled the number of systems on the latest list, while the number of systems in the US has fallen to the lowest point since the TOP500 list was created.

# France's 18 systems on Top500

| Rank | Name             | Computer                                                            | Site                      | Manufacturer     | Total<br>Cores | Rmax    | % Peak         |
|------|------------------|---------------------------------------------------------------------|---------------------------|------------------|----------------|---------|----------------|
| 33   | Pangea           | SGI ICE X, Xeon 8C 2.600GHz, Inf FDR                                | Total                     | SGI              | 110400         | 2098090 | 91%            |
| 44   | Occigen          | bullx DLC, Xeon 12C 2.6GHz, Inf FDR                                 | GENCI-CINES               | Bull, Atos Group | 50544          | 1628770 | 77%            |
| 53   | Curie thin nodes | Bullx B510, Xeon 8C 2.700GHz, Inf QDR                               | CEA/TGCC-<br>GENCI        | Bull, Atos Group | 77184          | 1359000 | 82%            |
| 70   | Turing           | BlueGene/Q, Power BQC 16C 1.60GHz,<br>Custom                        | CNRS/IDRIS-<br>GENCI      | IBM              | 98304          | 1073327 | 85%            |
| 74   | Tera-100         | Bull bullx super-node S6010/S6030                                   | CEA                       | Bull, Atos Group | 138368         | 1050000 | 84%            |
| 121  | Zumbrota         | BlueGene/Q, Power BQC 16C 1.60GHz,<br>Custom                        | EDF R&D                   | IBM              | 65536          | 715551  | 85%            |
| 167  | HPC4             | HP POD - Cluster Platform, Intel Xeon 12C 2.7GHz, Inf FDR           | Airbus                    | Hewlett-Packard  | 34560          | 516897  | 69%            |
| 171  | PORTHOS          | IBM NeXtScale nx360M5, Xeon 14C<br>2.6GHz, Inf FDR                  | EDF R&D                   | IBM              | 16100          | 506357  | 76%            |
| 190  | Beaufix          | Bullx DLC B710 Blades, Intel Xeon 12C 2.7GHz, Inf FDR               | Meteo France              | Bull, Atos Group | 24192          | 469097  | 90%            |
| 194  | Prolix           | Bullx DLC B710 Blades, Intel Xeon 12C 2.7GHz, Inf FDR               | Meteo France              | Bull, Atos Group | 23760          | 464865  | 91%            |
| 215  | Manny            | bullx DLC 720, Xeon 12C 2.6GHz, Inf FDR                             | Bull                      | Bull, Atos Group | 12960          | 430459  | 80%            |
| 278  | Athos            | iDataPlex DX360M4, Intel Xeon 12C<br>2.7GHz, Inf FDR14              | EDF R&D                   | IBM              | 18144          | 352671  | 90%            |
| 283  | airain           | Bullx B510, Xeon 8C 2.7GHz, Ind QDR                                 | CEA/CCRT                  | Bull, Atos Group | 18144          | 346070  | 88%            |
| 399  | EOS              | Bullx DLC B710 Blades, Intel Xeon 10C 2.8GHz, Inf FDR               | CALMIP / U of<br>Toulouse | Bull, Atos Group | 12240          | 255078  | 93%            |
| 400  | romeo            | Bull R421-E3 Cluster, Intel Xeon 8C<br>2.6GHz, Inf FDR, NVIDIA K20x | Champagne-<br>Ardenne     | Bull, Atos Group | 5720           | 254900  | 66%            |
| 412  |                  | Cluster Platform 3000 BL460c, Intel Xeon 12C 2.7GHz, Inf FDR        | Manufacturing<br>Company  | Hewlett-Packard  | 13152          | 249348  | 88%            |
| 421  |                  | HP POD - Cluster Platform 3000 BL260c<br>G6, 3.06 GHz, Inf          | Airbus                    | Hewlett-Packard  | 24192          | 243900  | 82%            |
| 433  | Jade             | SGI ICE 8200EX, Xeon 4C 3.000GHz, Inf                               | GENCI-CINES               | SGI              | 23040          | 237800  | <sup>89%</sup> |

### Commodity plus Accelerator Today 104 of the Top500 Systems

ICL UT



## Technology Trends: Microprocessor C:



Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip doubles every 18 months

2X transistors/Chip Every 1.5 years Called "<u>Moore's Law</u>"



The future of integrated electronics is the future of electronics itself. The advantages of integration will bring about a proliferation of electronics, pushing this science into many new areas.

Integrated circuits will lead to such wonders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be feasible today.

But the biggest potential lies in the production of large systems. In telephone communications, integrated circuits in digital filters will separate channels on multiplex equipment. Integrated circuits will also switch telephone circuits and perform data processing.

Computers will be more powerful, and will be organized in completely different ways. For example, memories built of integrated electronics may be distributed throughout the

The author



Dr. Gordon E. Moore is one of the new breed of electronic engineers, schooled in the physical sciences rather than in electronics. He earned a B.S. degree in chemistry from the hyperbid Collignment of Collignment machine instead of being concentrated in a central unit. In addition, the improved reliability made possible by integrated circuits will allow the construction of larger processing units. Machines similar to those in existence today will be built at lower costs and with faster turn-around.

#### Present and future

By integrated electronics, I mean all the various technologies which are referred to as microelectronics today as well as any additional ones that result in electronics functions supplied to the user as irreducible units. These technologies were first investigated in the late 1950's. The object was to miniaturize electronics equipment to include increasingly complex electronic functions in limited space with minimum weight. Several approaches evolved, including microassembly techniques for individual components, thinfilm structures and semiconductor integrated circuits.

Each approach evolved rapidly and converged so that each borrowed techniques from another. Many researchers believe the way of the future to be a combination of the various approaches.

The advocates of semiconductor integrated circuitry are already using the improved characteristics of thin-film resistors by applying such films directly to an active semiconductor substrate. Those advocating a technology based upon

# Moore's Secret Sauce: Dennard Scaling

Moore's Law put lots more transistors on a chip...but it's Dennard's Law that made them useful

**Dennard Scaling :** 

- Decrease feature size by a factor of  $\lambda$  and decrease voltage by a factor of  $\lambda$ ; then
- # transistors increase by  $\lambda^2$
- Clock speed increases by  $\boldsymbol{\lambda}$
- Energy consumption does not change

2x transistor count 40% faster 50% more efficient

#### Design of Ion-Implanted MOSFET's with Very Small Physical Dimensions

ROBERT H. DENNARD, MEMBER, IEEE, FRITZ H. GAENSSLEN, HWA-NIEN YU, MEMBER, IEEE, V. LEO RIDEOUT, MEMBER, IEEE, ERNEST BASSOUS, AND ANDRE R. LEBLANC, MEMBER, IEEE

D

 $\Delta W_{\ell}$ 

€ 31, €01

U.s.

Ν.

Abstract—This paper considers the design, fabrication, and characterization of very small MOSEFT switching devices witable for digital integrated circuits using dimensions of the order of 1  $\mu$ . Scaling relationships are presented which above how a conventional MOSEFT can be reduced in size. An imporved small device structure is presented that uses ion implantation to provide shallow source and drain regions and a nonuniform substrate doping profile doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFET's with channel length as short as 0.5  $\mu$  were fabricated, and the device characteristics measured and compared with predicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected.

Manuscript received May 20, 1974; revised July 3, 1974. The authors are with the IBM T. J. Watson Research Center, Yorktown Heights, N.Y. 10598.

LIST OF SYMBOLS Inverse semilogarithmic slope of subthreshold characteristic. Width of idealized step function profile for channel implant. Work function difference between gate and substrate. Dielectric constants for silicon and silicon dioxide. Drain current. Boltzmann's constant. Unitless scaling constant. MOSFET channel length. Effective surface mobility. Intrinsic carrier concentration. Substrate acceptor concentration Band bending in silicon at the onset of strong inversion for zero substrate voltage.

[Dennard, Gaensslen, Yu, Rideout, Bassous, Leblanc, **IEEE JSSC**, 1974] 14

# Dennard Scaling Over

### Evolution of processors

The primary reason cited for the breakdown is that at small sizes, current leakage poses greater challenges, and also causes the chip to heat up, which creates a threat of thermal runaway and therefore further increases energy costs.



### Unfortunately Dennard Scaling is Over: What is the Catch?

Powering the transistors without melting the chip



CPU DB: recording microprocessor history, CACM, V 55 N 4, 2012, http://dl.acm.org/citation.cfm?id=2133822

# Power Cost of Frequency

- Power  $\propto$  Voltage<sup>2</sup> x Frequency (V<sup>2</sup>F)
- Frequency ~ Voltage

| Power «Frequency <sup>3</sup> |       |      |      |      |       |            |      |  |
|-------------------------------|-------|------|------|------|-------|------------|------|--|
|                               | Cores | V    | Freq | Perf | Power | PE (Bops/w | att) |  |
| Superscalar                   | 1     | 1    | 1    | 1    | 1     | 1          |      |  |
| "New" Superscalar             | 1X    | 1.5X | 1.5X | 1.5X | 3.3X  | 0.45X      |      |  |
|                               | -     |      |      |      |       |            | /    |  |

# Power Cost of Frequency

- Power  $\propto$  Voltage<sup>2</sup> x Frequency (V<sup>2</sup>F)
- Frequency ~ Voltage



50% more performance with 20% less power

Preferable to use multiple slower devices, than one superfast device

# Construction For Structure Construction Cons



Intel Haswell (18 cores)



AMD Interlagos (16 cores)



Fujitsu Venus (16 cores)



Intel Xeon Phi (60 cores)





IBM Power 8 (12 cores)



Nvidia Kepler (2688 Cuda cores) IBM BG/Q (18 cores)



ShenWei (16 core)



#### Peak Performance - Per Core FLOPs $FLOPS = cores \times clock >$ cvcle Floating point operations per cycle per core + Most of the recent computers have FMA (Fused multiple add): (i.e. $x \leftarrow x + y^*z$ in one cycle) + Intel Xeon earlier models and AMD Opteron have SSE2 + 2 flops/cycle DP & 4 flops/cycle SP + Intel Xeon Nehalem ('09) & Westmere ('10) have SSE4 + 4 flops/cycle DP & 8 flops/cycle SP + Intel Xeon Sandy Bridge('11) & Ivy Bridge ('12) have AVX + 8 flops/cycle DP & 16 flops/cycle SP intel inter + Intel Xeon Haswell ('13) & (Broadwell ('14)) AVX2 + 16 flops/cycle DP & 32 flops/cycle SP + Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP We Intel Xeon Skylake ('15) are here + 32 flops/cycle DL & 64 flops/cycle SP Sandy Bridge Haswell Skylake

pe Nam

yotta zetta

exaF petal

teraF gigal

mega kiloF

### **CPU Access Latencies in Clock Cycles**



## Memory transfer (Its All About Data Movement) Example on my laptop: One level of memory



(Omitting latency here.)

The model IS simplified (see next slide) but it provides an upper bound on performance as well. I.e., we will never go faster than what the model predicts. (And, of course, we can go slower ... )

# FMA: fused multiply-add



Note: It is reasonable to expect the one loop codes shown here to perform as well as their Level 1 BLAS counterpart (on multicore with an OpenMP pragma for example).

The true gain these days with using the BLAS is (1) Level 3 BLAS, and (2) portability.

- Take two double precision vectors x and y of size
   n=375,000.
- Data size:



- (375,000 double) \* (8 Bytes / double) = 3 MBytes per vector

(Two vectors fit in cache (6 MBytes). OK.)

- Time to move the vectors from memory to cache:
   (6 MBytes) / (25.6 GBytes/sec) = 0.23 ms
- Time to perform computation of DOT:
  - (2n flop) / (56 Gflop/sec) = 0.01 ms

## **Vector Operations**

# total\_time $\geq$ max ( time\_comm , time\_comp ) = max ( 0.23ms , 0.01ms ) = 0.23ms

Performance = (2 x 375,000 flops)/.23ms = 3.2 Gflop/s

## Performance for DOT ≤ 3.2 Gflop/s Peak is 56 Gflop/s

We say that the operation is communication bounded. No reuse of data.

# Level 1, 2 and 3 BLAS



- Double precision matrix A and vectors x and y of size n=860.
   GEMV: α α x + β
- Data size:



- (860<sup>2</sup> + 2\*860 double) \* (8 Bytes / double) ~ 6 MBytes

Matrix and two vectors fit in cache (6 MBytes).

- Time to move the data from memory to cache:
   (6 MBytes) / (25.6 GBytes/sec) = 0.23 ms
- Time to perform computation of DOT:
   (2n<sup>2</sup> flop) / (56 Gflop/sec) = 0.26 ms

## Matrix - Vector Operations

We say that the operation is communication bounded. Very little reuse of data.

• Take two double precision vectors x and y of size n=500.



- Data size:
  - ( 500<sup>2</sup> double ) \* ( 8 Bytes / double ) = 2 MBytes per matrix

(Three matrices fit in cache (6 MBytes). OK.)

- Time to move the matrices in cache:
   (6 MBytes) / (25.6 GBytes/sec) = 0.23 ms
- Time to perform computation in GEMM:
   (2n<sup>3</sup> flop) / (56 Gflop/sec) = 4.46 ms

# Matrix Matrix Operations

total\_time ≥ max ( time\_comm , time\_comp )

= max( 0.23ms , 4.46ms ) = 4.46ms

For this example, communication time is less than 6% of the computation time.

Performance =  $(2 \times 500^{3} \text{ flops})/4.69 \text{ms} = 53.3 \text{ Gflop/s}$ 

There is a lots of data reuse in a GEMM; 2/3n per data element. Has good temporal locality.

If we assume total\_time ≈ time\_comm +time\_comp, we get **Performance for GEMM ≈ 53.3 Gflop/sec** 

Performance for DOT ≤ 3.2 Gflop/s Performance for GEMV ≤ 5.7 Gflop/s

(Out of 56 Gflop/sec possible, so that would be 95% peak performance efficiency.)

### Level 1, 2 and 3 BLAS

1 core Intel Haswell i7-4850HQ, 2.3 GHz (Turbo Boost at 3.5 GHz);

Peak = 56 Gflop/s



1 core Intel Haswell i7-4850HQ, 2.3 GHz, Memory: DDR3L-1600MHz 6 MB shared L3 cache, and each core has a private 256 KB L2 and 64 KB L1. The theoretical peak per core double precision is 56 Gflop/s per core. Compiled with gcc and using Veclib

### The Standard LU Factorization LINPACK 1970's HPC of the Day: Vector Architecture





#### Main points

- Factorization column (zero) mostly sequential due to memory bottleneck
- Level 1 BLAS
- Divide pivot row has little parallelism
- Rank -1 Schur complement update is the only easy parallelize task
- Partial pivoting complicates things even further
- Bulk synchronous parallelism (fork-join)
  - Load imbalance
  - Non-trivial Amdahl fraction in the panel
  - Potential workaround (look-ahead) has complicated implementation

# The Standard LU Factorization LAPACK 1980's HPC of the Day: Cache Based SMP



Factor panel with Level 1,2 BLAS Triangular update Schur complement update Next Step

#### Main points

- Panel factorization mostly sequential due to memory bottleneck
- Triangular solve has little parallelism
- Schur complement update is the only easy parallelize task
- Partial pivoting complicates things even further
- Bulk synchronous parallelism (fork-join)
  - Load imbalance
  - Non-trivial Amdahl fraction in the panel
  - Potential workaround (look-ahead) has complicated implementation

# Last Generations of DLA Software



#### **2D Block Cyclic Layout**

|   |   | Ma | ıtrix p | point | of vi | ew |   |   | ] [ |   |   |   | Proc | e | SSO | r poi | int of | F١ | view |   |   |  |
|---|---|----|---------|-------|-------|----|---|---|-----|---|---|---|------|---|-----|-------|--------|----|------|---|---|--|
| 0 | 2 | 4  | 0       | 2     | 4     | 0  | 2 | 4 |     |   | 0 | 0 | 0    |   | 2   | 2     | 2      |    | 4    | 4 | 4 |  |
| 1 | 3 | 5  | 1       | 3     | 5     | 1  | 3 | 5 |     |   | 0 | 0 | 0    |   | 2   | 2     | 2      |    | 4    | 4 | 4 |  |
| 0 | 2 | 4  | 0       | 2     | 4     | 0  | 2 | 4 |     |   | 0 | 0 | 0    |   | 2   | 2     | 2      |    | 4    | 4 | 4 |  |
| 1 | 3 | 5  | 1       | 3     | 5     | 1  | 3 | 5 |     |   | 0 | 0 | 0    |   | 2   | 2     | 2      |    | 4    | 4 | 4 |  |
| 0 | 2 | 4  | 0       | 2     | 4     | 0  | 2 | 4 |     | L | 0 | 0 | 0    |   | 2   | 2     | 2      |    | 4    | 4 | 4 |  |
| 1 | 3 | 5  | 1       | 3     | 5     | 1  | 3 | 5 |     | 1 | 1 | 1 | 1    |   | 3   | 3     | 3      |    | 5    | 5 | 5 |  |
| 0 | 2 | 4  | 0       | 2     | 4     | 0  | 2 | 4 |     |   | 1 | 1 | 1    |   | 3   | 3     | 3      |    | 5    | 5 | 5 |  |
| 1 | 3 | 5  | 1       | 3     | 5     | 1  | 3 | 5 |     |   | 1 | 1 | 1    |   | 3   | 3     | 3      |    | 5    | 5 | 5 |  |
| 0 | 2 | 4  | 0       | 2     | 4     | 0  | 2 | 4 |     | l | 1 | 1 | 1    |   | 3   | 3     | 3      |    | 5    | 5 | 5 |  |



# Classical Analysis of Algorithms May Not be Valid

- "Processors over provisioned for floating point arithmetic
- " Data movement extremely expensive
- " Operation count is not a good indicator of the time to solve a problem.
- "Algorithms that do more ops may actually take less time.







### 3 Generations of software compared



Bottleneck in the Bidiagonalization The Standard Bidiagonal Reduction: xGEBRD Two Steps: Factor Panel & Update Tailing Matrix



### **\***Characteristics

- Total cost  $8n^3/3$ , (reduction to bi-diag
- Too many Level 2 BLAS operations
- 4/3 n<sup>3</sup> from GEMV and 4/3 n<sup>3</sup> from G
- Performance limited to 2\* performan
- →Memory bound algorithm.



### **Recent Work on 2-Stage Algorithm**



### **\***Characteristics

- Stage 1:
  - Fully Level 3 BLAS
  - Dataflow Asynchronous execution
- Stage 2:
  - Level "BLAS-1.5"
  - Asynchronous execution
  - Cache friendly kernel (reduced communication)

### **Recent work on developing new 2-stage algorithm**



$$\begin{split} \text{flops} &\approx \sum_{\substack{s=1\\s=1}}^{\frac{n-n_b}{n_b}} 2n_b^3 + (nt-s)3n_b^3 + (nt-s)\frac{10}{3}n_b^3 + (nt-s) \times (nt-s)5n_b^3 \\ &+ \sum_{s=1}^{\frac{n-n_b}{n_b}} 2n_b^3 + (nt-s-1)3n_b^3 + (nt-s-1)\frac{10}{3}n_b^3 + (nt-s) \times (nt-s-1)5n_b^3 \\ &\approx \frac{10}{3}n^3 + \frac{10n_b}{3}n^2 + \frac{2n_b}{3}n^3 \\ \text{flops} &\approx \frac{10}{3}n^3(\text{gemm})_{\text{first stage}} \end{split}$$

More Flops, original did 8/3 n<sup>3</sup> 25% More flops

### Recent work on developing new 2-stage algorithm



25% More flops and 1.8 – 7 times faster



#### Parallelization of LU and QR.

#### <sup>IC</sup> Parallelize the update:

- Easy and done in any reasonable software.
- This is the  $2/3n^3$  term in the FLOPs count.
- Can be done efficiently with LAPACK+multithreaded BLAS





dgemm

←

# Synchronization (in LAPACK LU)





# PLASMA LU Factorization

×GEMM

xGFMN





- Added with OpenMP 3.0 (2009)
- Allows parallelization of irregular problems
- OpenMP 4.0 (2013) Tasks can have
   dependencies
  - ►DAGs



# Tiled Cholesky Decomposition

}



```
#pragma omp parallel
#pragma omp master
  CHOLESKY(A);
                  - }
CHOLESKY(A) {
    for (k = 0; k < M; k++) {
        #pragma omp task depend(inout:A(k,k)[0:tilesize]
           POTRF(A(k,k)); \}
        for (m = k+1; m < M; m++) {
            #pragma omp task \
                depend (in: A(k, k) [0: tilesize]) \
                depend(inout:A(m,k)[0:tilesize])
            { TRSM(A(k,k), A(m,k)); }
        for (m = k+1; m < M; m++) {
            #pragma omp task \
                depend (in:A(m,k)[0:tilesize])
                depend(inout:A(m,m)[0:tilesize])
            \{ SYRK(A(m,k), A(m,m)); \}
            for (n = k+1; n < m; n++) {
                #pragma omp task \
                    depend(in:A(m, k)[0:tilesize], \
                              A(n,k)[0:tilesize]) \setminus
                    depend(inout:A(m,n)[0:tilesize])
                \{ GEMM(A(m,k), A(n,k), A(m,n)); \}
            }
      }
   }
```



# Dataflow Based Design

"Objectives

- > High utilization of each core
- Scaling to large number of cores
- Synchronization reducing algorithms

### "Methodology

- Dynamic DAG scheduling
- Explicit parallelism
- Implicit communication
- Fine granularity / block data layout

"Arbitrary DAG with dynamic scheduling





# Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L's



Avoiding Synchronization

"Responsibly Reckless" Algorithms
Try fast algorithm (unstable algorithm) that might fail (but rarely)
Check for instability
If needed, recompute with stable algorithm

LU decomposition (Gaussian Elimination) for the solution of Ax = b





- Stability issue: a<sub>kk</sub> may be small or zero ⇒ large element growth ⇒ elements of normal size lost in summation.
- Partial pivoting (GEPP): swap rows so that each  $a_{kk}$  is large. row k is exchanged with row p such that  $|a_{pk}| = \max_{\substack{j \ge k}} |a_{jk}|$ Eventually, PA = LU (P permutation matrix).

### Pivoting is expensive

- Complete pivoting, partial pivoting, tournament pivoting, etc.
- GEPP implemented in most numerical libraries (LAPACK...)
- No floating point operation in pivoting but it involves irregular movements of data
- Communication overhead due to pivoting:  $O(n^2)$  comparisons



Cost of partial pivoting in LU factorization (MAGMA), Nvidia Kepler K20

### Random matrices are nice (for pivoting)

#### (see [ Trefethen and Schreiber, SIMAX 90 ], [ Yeung and Chan, SIMAX 97 ])



### No pivoting by randomizing instead:

- For general systems (LU factorization): Initially proposed by [Parker, 1995] Revisited in [MB, Dongarra, Herrmann Tomov, TOMS 2013]
- Idea: the original matrix is transformed into a matrix that would be sufficiently "random" so that, with a probability close to 1, pivoting is not needed.

Random Butterfly Transformation (RBT)

$$Ax = b \equiv \underbrace{U^T AV}_{A_r} \underbrace{V^{-1} x}_{y} = \underbrace{U^T b}_{c}$$

- Compute  $A_r = U^T A V$  with U, V random (recursive butterflies)
- Factorize A<sub>r</sub> without pivoting (GENP)
- Solve  $A_r y = U^T b$  then x = V y

Requirements :

- Randomization must be cheap
- Fast GENP ("Cholesky" speed)
- Accuracy close to that of GEPP (possibly IR)

### **Butterfly Matrix**

A butterfly matrix is defined as any *n*-by-*n* matrix of the form:

$$B = \frac{1}{\sqrt{2}} \left( \begin{array}{cc} R & S \\ R & -S \end{array} \right)$$

where R and S are random diagonal matrices.

$$B = \left( \begin{array}{c} \\ \end{array} \right)$$

Remark:

$$B = \frac{1}{\sqrt{2}} \begin{pmatrix} I_{n/2} & I_{n/2} \\ I_{n/2} & -I_{n/2} \end{pmatrix} \begin{pmatrix} R & 0 \\ 0 & S \end{pmatrix}$$

# HPL - Bad Things

- LINPACK Benchmark is 37 years old
  - TOP500 (HPL) is 22 years old
- Floating point-intensive performs O(n<sup>3</sup>) floating point operations and moves O(n<sup>2</sup>) data.
- No longer so strongly correlated to real apps.
- Reports Peak Flops (although hybrid systems see only 1/2 to 2/3 of Peak)
- Encourages poor choices in architectural features
- Overall usability of a system is not measured
- Used as a marketing tool
- Decisions on acquisition made on one number
- Benchmarking for days wastes a valuable resource

## **Proposal: HPCG**

- High Performance Conjugate Gradient (HPCG).
- Solves Ax=b, A large, sparse, b known, x computed.
- An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs
- Patterns:
  - Dense and sparse computations.
  - Dense and sparse collective.
  - Multi-scale execution of kernels via MG (truncated) V cycle.
  - Data-driven parallelism (unstructured sparse triangular solves).
- Strong verification and validation properties (via spectral properties of PCG).

### **Goals for New Benchmark**

 Augment the TOP500 listing with a benchmark that correlates with important scientific and technical apps not well represented by HPL









- Encourage vendors to focus on architecture features needed for high performance on those important scientific and technical apps.
  - Stress a balance of floating point and communication bandwidth and latency
  - Reward investment in high performance collective ops
  - Reward investment in high performance point-to-point messages of various sizes
  - Reward investment in local memory system performance
  - Reward investment in parallel runtimes that facilitate intra-node parallelism
- Provide an outreach/communication tool
  - Easy to understand
  - Easy to optimize
  - Easy to implement, run, and check results
- Provide a historical database of performance information
  - The new benchmark should have longevity

## **Model Problem Description**

- Synthetic discretized 3D PDE (FEM, FVM, FDM).
- Single heat diffusion model.
- Zero Dirichlet BCs, Synthetic RHS s.t. solution = 1.
- Local domain:  $(n_x \times n_y \times n_z)$
- Process layout:  $(np_x \times np_y \times np_z)$
- Global domain:  $(n_x * np_x) \times (n_y * np_y) \times (n_z * np_z)$
- Sparse matrix:
  - 27 nonzeros/row interior.
  - 7 18 on boundary.
  - Symmetric positive definite.



27-point stencil operator

## HPL vs. HPCG: Bookends

- Some see HPL and HPCG as "bookends" of a spectrum.
  - Applications teams know where their codes lie on the spectrum.
  - Can gauge performance on a system using both HPL and HPCG numbers.
- Problem of HPL execution time still an issue:
  - Need a lower cost option. End-to-end HPL runs are too expensive.
  - Work in progress.
  - Began last year with about 20 results, today have 41 systems.
    - Not interested in collecting 500 systems

### **Comparison Peak, HPL**



## Comparison Peak, HPL, & HPCG



### HPCG Results, Nov 2015, 1-10

| Rank | Site                                                  | Computer                                                                                  | Cores     | Rmax<br>Pflops | HPCG<br>Pflops | HPCG<br>/HPL | % of<br>Peak |
|------|-------------------------------------------------------|-------------------------------------------------------------------------------------------|-----------|----------------|----------------|--------------|--------------|
| 1    | NSCC / Guangzhou                                      | Tianhe-2 NUDT, Xeon 12C 2.2GHz + Intel<br>Xeon Phi 57C + Custom                           | 3,120,000 | 33.86          | 0.580          | 1.7%         | 1.1%         |
| 2    | RIKEN Advanced Institute for<br>Computational Science | K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect                                      | 705,024   | 10.51          | 0.460          | 4.4%         | 4.1%         |
| 3    | DOE/SC/Oak Ridge Nat Lab                              | Titan - Cray XK7 , Opteron 6274 16C<br>2.200GHz, Cray Gemini interconnect,<br>NVIDIA K20x | 560,640   | 17.59          | 0.322          | 1.8%         | 1.2%         |
| 4    | DOE/NNSA/LANL/SNL                                     | Trinity - Cray XC40, Intel E5-2698v3,<br>Aries custom                                     | 301,056   | 8.10           | 0.182          | 2.3%         | 1.6%         |
| 5    | DOE/SC/Argonne National<br>Laboratory                 | Mira - BlueGene/Q, Power BQC 16C<br>1.60GHz, Custom                                       | 786,432   | 8.58           | 0.167          | 1.9%         | 1.7%         |
| 6    | HLRS/University of Stuttgart                          | Hazel Hen - Cray XC40, Intel E5-2680v3,<br>Infiniband FDR                                 | 185,088   | 5.64           | 0.138          | 2.4%         | 1.9%         |
| 7    | NASA / Mountain View                                  | Pleiades - SGI ICE X, Intel E5-2680,<br>E5-2680V2, E5-2680V3, Infiniband FDR              | 186,288   | 4.08           | 0.131          | 3.2%         | 2.7%         |
| 8    | Swiss National<br>Supercomputing Centre (CSCS)        | Piz Daint - Cray XC30, Xeon E5-2670 8C<br>2.600GHz, Aries interconnect , NVIDIA<br>K20x   | 115,984   | 6.27           | 0.124          | 2.0%         | 1.6%         |
| 9    | KAUST / Jeda                                          | Shaheen II - Cray XC40, Intel Haswell 2.3<br>GHz 16C, Cray Aries                          | 196,608   | 5.53           | 0.113          | 2.1%         | 1.6%         |
| 10   | Texas Advanced Computing<br>Center/Univ. of Texas     | Stampede - PowerEdge C8220, Xeon<br>E5-2680 8C 2.7GHz, Infiniband, Phi<br>SE10P           | 522,080   | 5.16           | 0.096          | 1.9%         | 1.0%         |

### HPCG Results, Nov 2015, 11-20

| Rank | Site                                                | Computer                                                                                           | Cores   | Rmax<br>Pflops | HPCG<br>Pflops | HPCG/<br>HPL | % of<br>Peak |
|------|-----------------------------------------------------|----------------------------------------------------------------------------------------------------|---------|----------------|----------------|--------------|--------------|
| 11   | Forschungszentrum Jülich                            | JUQUEEN - BlueGene/Q                                                                               | 458,752 | 5.0089         | 0.095          | 1.9%         | 1.6%         |
| 12   | Information Technology Center,<br>Nagoya University | ITC, Nagoya - Fujitsu PRIMEHPC<br>FX100                                                            | 92,160  | 2.91           | 0.086          | 3.0%         | 2.7%         |
| 13   | Leibniz Rechenzentrum                               | SuperMUC - iDataPlex DX360M4,<br>Xeon E5-2680 8C 2.70GHz,<br>Infiniband FDR                        | 147,456 | 2.897          | 0.083          | 2.9%         | 2.6%         |
| 14   | EPSRC/University of Edinburgh                       | ARCHER - Cray XC30, Intel Xeon E5<br>v2 12C 2.700GHz, Aries interconnect                           | 118,080 | 1.643          | 0.081          | 4.9%         | 3.2%         |
| 15   | DOE/SC/LBNL/NERSC                                   | Edison - Cray XC30, Intel Xeon<br>E5-2695v2 12C 2.4GHz, Aries<br>interconnect                      | 133,824 | 1.655          | 0.079          | 4.8%         | 3.1%         |
| 16   | National Institute for Fusion<br>Science            | Plasma Simulator - Fujitsu<br>PRIMEHPC FX100, SPARC64 Xifx,<br>Custom                              | 82,944  | 2.376          | 0.073          | 3.1%         | 2.8%         |
| 17   | GSIC Center, Tokyo Institute of<br>Technology       | TSUBAME 2.5 - Cluster Platform<br>SL390s G7, Xeon X5670 6C 2.93GHz,<br>Infiniband QDR, NVIDIA K20x | 76,032  | 2.785          | 0.073          | 2.6%         | 1.3%         |
| 18   | HLRS/Universitaet Stuttgart                         | Hornet - Cray XC40, Xeon E5-2680<br>v3 2.5 GHz, Cray Aries                                         | 94,656  | 2.763          | 0.066          | 2.4%         | 1.7%         |
| 19   | Max-Planck-Gesellschaft MPI/IPP                     | iDataPlex DX360M4, Intel Xeon<br>E5-2680v2 10C 2.800GHz, Infiniband                                | 65,320  | 1.283          | 0.061          | 4.8%         | 4.2%         |
| 20   | CEIST / JAMSTEC                                     | Earth Simulator - NEC SX-ACE                                                                       | 8,192   | 0.487          | 0.058          | 11.9%        | 11.0%        |







| Systems              | <b>2015</b><br>Tianhe-2                       |
|----------------------|-----------------------------------------------|
| System peak          | 55 Pflop/s                                    |
| Power                | 18 MW<br>(3 Gflops/W)                         |
| System memory        | <b>1.4 PB</b><br>(1.024 PB CPU + .384 PB CoP) |
| Node performance     | <b>3.43 TF/s</b><br>(.4 CPU +3 CoP)           |
| Node concurrency     | 24 cores CPU +<br>171 cores CoP               |
| Node Interconnect BW | 6.36 GB/s                                     |
| System size (nodes)  | 16,000                                        |
| Total concurrency    | <b>3.12 M</b><br>12.48M threads (4/core)      |
| MTTF                 | Few / day                                     |

## Exascale System Architecture with a cap of \$200M and 20MW

| Systems              | <b>2015</b><br>Tianhe-2                       |
|----------------------|-----------------------------------------------|
| System peak          | 55 Pflop/s                                    |
| Power                | 18 MW<br>(3 Gflops/W)                         |
| System memory        | <b>1.4 PB</b><br>(1.024 PB CPU + .384 PB CoP) |
| Node performance     | <b>3.43 TF/s</b><br>(.4 CPU +3 CoP)           |
| Node concurrency     | 24 cores CPU +<br>171 cores CoProc            |
| Node Interconnect BW | 6.36 GB/s                                     |
| System size (nodes)  | 16,000                                        |
| Total concurrency    | <b>3.12 M</b><br>12.48M threads (4/core)      |
| MTTF                 | Few / day                                     |

## Exascale System Architecture with a cap of \$200M and 20MW

| Systems              | <b>2015</b><br>Tianhe-2                       | 2020-2023               | Difference<br>Today & Exa |
|----------------------|-----------------------------------------------|-------------------------|---------------------------|
| System peak          | 55 Pflop/s                                    | 1 Eflop/s               | ~20×                      |
| Power                | 18 MW<br>(3 Gflops/W)                         | ~20 MW<br>(50 Gflops/W) | O(1)<br>~15x              |
| System memory        | <b>1.4 PB</b><br>(1.024 PB CPU + .384 PB CoP) | 256 PB                  | ~100x                     |
| Node performance     | <b>3.43 TF/s</b><br>(.4 CPU +3 CoP)           | 1.2 or 15TF/s           | O(1)                      |
| Node concurrency     | 24 cores CPU +<br>171 cores CoProc            | 0(1k) or 10k            | ~5x - ~50x                |
| Node Interconnect BW | 6.36 GB/s                                     | 200-400 GB/s            | ~40×                      |
| System size (nodes)  | 16,000                                        | O(100,000) or O(1M)     | ~6x - ~60x                |
| Total concurrency    | <b>3.12 M</b><br>12.48M threads (4/core)      | O(billion)              | ~100×                     |
| MTTF                 | Few / day                                     | Many / day              | O(?)                      |

# Critical Issues at Peta & Exascale for Algorithm and Software Design

- Synchronization-reducing algorithms
  - Break Fork-Join model
- Communication-reducing algorithms
  - Use methods which have lower bound on communication
- Mixed precision methods
  - 2x speed of ops and 2x speed for data movement
- Autotuning
  - Today's machines are too complicated, build "smarts" into software to adapt to the hardware
- Fault resilient algorithms
  - Implement algorithms that can recover from failures/bit flips
- Reproducibility of results
  - Today we can't guarantee this. We understand the issues, but some of our "colleagues" have a hard time with this.



- Major Challenges are ahead for extreme computing
  - Parallelism O(10<sup>9</sup>)
    - Programming issues
  - Hybrid
    - Peak and HPL may be very misleading
    - No where near close to peak for most apps
  - Fault Tolerance
    - Today Sequoia BG/Q node failure rate is 1.25 failures/day
  - Power
    - 50 Gflops/w (today at 2 Gflops/w)
- We will need completely new approaches and technologies to reach the Exascale level

# Collaborators / Software / Support

- PLASMA <u>http://icl.cs.utk.edu/plasma/</u>
- MAGMA <u>http://icl.cs.utk.edu/magma/</u>
- PaRSEC(Parallel Runtime Scheduling and Execution Control)
- http://icl.cs.utk.edu/parsec/



Collaborating partners
 University of Tennessee, Knoxville
 University of California, Berkeley
 University of Colorado, Denver

