# Technologies and application performance

Marc Mendez-Bermond

HPC Solutions Expert - Dell Technologies

September 2017

informer a constitution and the first state of the state

and and a second second

And a second sec

mail of a second second



a abran di sedite -

#### The landscape is changing

"We are no longer in the general purpose era... the argument of tuning software for hardware is moot. Now, to get the best bang for the buck, you have to tune both."



- Kushagra Vaid, general manager of server engineering, Microsoft Cloud Solutions

https://www.nextplatform.com/2017/03/08/arm-amd-x86-server-chips-get-mainstream-lift-microsoft/amp/

### Moore's Law (Technology)

- The clock speed plateau
- The power ceiling
- IPC limit



Chuck Moore, "DATA PROCESSING IN EXASCALE-CLASS COMPUTER SYSTEMS", The Salishan Conference on High Speed Computing, 2011

# Amdahl's Law (Application)

- Amdahl's law predicts performance from your app parallelization
- 50% : x2 max
- 99% : x100 max
- 99.9% : x1000 max

- But you should also check the efficiency here :
  - 99.9% parallel, at 1024 processors, x509 and efficiency at 49% ...



# WARNING : Legacy Slide from back in 2014 (ENS/PSMN cluster inauguration)

#### Still valid conclusions !!!

#### Intel Xeon Phi : a few considerations

- x86\_64 programming models
- Cache coherency
  - Dual-ring interconnect
  - 8 (soon 16) GB RAM
- Right to the point cores
  - No « out of order» execution
  - No branch prediction
  - 4 Hyper-threads per core
  - Wide vectors (16 op/c/core)
- PCIe connectivity to host

App should fit in onboard memory, Parallelism > 99.9%, Vectorization > 95%



# Moore's Law vs Amdahl's Law - "too Many Cooks in the Kitchen"



Industry is applying Moore's Law by adding more cores

Meanwhile Amdahl's Law says that you cannot use them all efficiently

#### System trend over the years (1)



#### System trend over the years (2)



Future



# Improving performance - what levels do we have?

- Challenge: Sustain performance trajectory without massive increases in cost, power, real estate, and unreliability
- Solutions: <u>No single answer</u>, must **intelligently turn** "Architectural Knobs"



#### Turning the knobs 1 - 4



Frequency is unlikely to change much - Thermal/Power/Leakage challenges

Moore's Law still holds: 130 -> 14 nm - LOTS of transistors





Number of sockets per system is the easiest knob. Challenging for power/density/cooling/networking

IPC still grows



FMA3/4, AVX, FPGA implementations for algorithms

Challenging for the user/developer

### New capabilities according to Intel

| Thurley Platform                                         |                           | Romley Platform                                               |                           | Grantley Platform                                        |                           | Purley Platform                                          |                   |
|----------------------------------------------------------|---------------------------|---------------------------------------------------------------|---------------------------|----------------------------------------------------------|---------------------------|----------------------------------------------------------|-------------------|
| Intel <sup>®</sup> Microarchitecture<br>Codename Nehalem |                           | Intel <sup>®</sup> Microarchitecture<br>Codename Sandy Bridge |                           | Intel <sup>®</sup> Microarchitecture<br>Codename Haswell |                           | Intel <sup>®</sup> Microarchitecture<br>Codename Skylake |                   |
| Nehalem                                                  | Westmere                  | Sandy<br>Bridge                                               | Ivy Bridge                | Haswell                                                  | Broadwell                 | Skylake                                                  | Future<br>Product |
| 45nm                                                     | 32nm                      | 32nm                                                          | 22nm                      | 22nm                                                     | 14nm                      | 14nm                                                     |                   |
| New Micro-<br>architecture                               | New Process<br>Technology | New Micro-<br>architecture                                    | New Process<br>Technology | New Micro-<br>architecture                               | New Process<br>Technology | New Micro-<br>architecture                               |                   |
| SSSE3                                                    | SSE4                      | AVX                                                           | AVX                       | AVX2                                                     | AVX2                      | AVX-512                                                  |                   |
| 2007                                                     | 2009                      | 2012                                                          | 2013                      | 2014                                                     | 2015                      | 2017                                                     |                   |

#### The state of ISV software

| Segment             | Applications                  | Vectorization support    |
|---------------------|-------------------------------|--------------------------|
| CFD                 | Fluent, LS-DYNA, STAR<br>CCM+ | Limited SSE2 support     |
| CSM                 | CFX, RADIOSS, Abaqus          | Limited SSE2 support     |
| Weather             | WRF, UM, NEMO, CAM            | Yes                      |
| Oil and Gas         | Seismic processing            | Not applicable           |
|                     | Reservoir Simulation          | Yes                      |
| Chemistry           | Gaussian, GAMESS, Molpro      | Not applicable           |
| Molecular dynamics  | NAMD, GROMACS,<br>Amber,      | PME kernels support SSE2 |
| Biology             | BLAST, Smith-Waterman         | Not applicable           |
| Molecular mechanics | CPMD, VASP, CP2k,<br>CASTEP   | Yes                      |

Bottom line: ISV support for new instructions is poor. Less of an issue for in-house developed codes, but programming is hard

#### Meanwhile the bandwidth is suffering



# Add to this the Memory Bandwidth and System



Obtained from: http://sc16.supercomputing.org/2016/10/07/sc16-invited-talk-spotlight-dr-john-d-mccalpin-presents-memory-bandwidth-system-balance-hpc-systems/

### And data is becoming sparser (think "Big Data")



- This has very low arithmetic density and hence memory bound
- Common in CFD, but also in genetic evaluation of species

#### Xeon roofline model (v4)



#### What does Intel do about these trends?

| Problem             | Westmere   | Sandy Bridge                                                      | Ivy Bridge                                                           | Haswell                                                                            | Broadwell                                                                                                                                  | Skylake                                                                                                                              |
|---------------------|------------|-------------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| QPI<br>bandwidth    | No problem | Even better                                                       | Two snoop<br>modes                                                   | Three snoop<br>modes                                                               | Four (!) snoop<br>modes                                                                                                                    | <ul> <li>UPI</li> <li>COD snoop<br/>modes</li> </ul>                                                                                 |
| Memory<br>bandwidth | No problem | Extra memory<br>channel                                           | Larger cache                                                         | Extra load/store<br>units                                                          | Larger cache                                                                                                                               | <ul> <li>Extra<br/>load/store<br/>units</li> <li>+50%<br/>memory<br/>channels</li> </ul>                                             |
| Core<br>frequency   | No problem | <ul> <li>More cores</li> <li>AVX</li> <li>Better Turbo</li> </ul> | <ul> <li>Even more<br/>cores</li> <li>Above TDP<br/>Turbo</li> </ul> | <ul> <li>Still more<br/>cores</li> <li>AVX2</li> <li>Per-core<br/>Turbo</li> </ul> | <ul> <li>Again even<br/>more cores</li> <li>optimized<br/>FMA</li> <li>Per-core<br/>Turbo<br/>based on<br/>instruction<br/>type</li> </ul> | <ul> <li>More cores</li> <li>Larger OOO<br/>engine</li> <li>AVX-512</li> <li>3 different<br/>core<br/>frequency<br/>modes</li> </ul> |

## C4130 – Ten supported variations





Optional

PEU

PCIe 1 x8

### Pragmatic computing

| Parallelize                     | Vectorize                                  | Optimize                                                                   |
|---------------------------------|--------------------------------------------|----------------------------------------------------------------------------|
| Take advantage of multicore     | Take advantage<br>of large-vector<br>units | <ul> <li>Intrinsic optimization</li> <li>Execution optimization</li> </ul> |
| Amdahl's law<br>Moore's law : b | Efficiency of<br>implementation            |                                                                            |

#### Public benchmark data

() en.community.dell.com/techcenter/high-performance-computing/b/general\_hpc/archive/2017/08/04/lammps-four-node-comparative-performance-analysis-on-skylake-processors



# Portfolio: Ready Solutions for HPC

Maximum flexibility Validated for use case Heterogeneity with lower risk Component lifecycle automation and control

#### Consumption models

Fastest time to value Optimized and tuned for use case Greatest risk reduction Solution lifecycle automation

#### STORAGE READY BUNDLES

Dell EMC Ready Bundle for HPC NFS Storage

Benefits

Solutio

Scale

Scales from a minimum of 48TB to 480TB of raw capacity in a single name space Dell EMC Ready Bundle for HPC Lustre Storage

Lustre parallel file storage system scales from 120TB to petabytes of data

#### SYSTEMS FOR A RANGE OF USE CASES

Dell EMC HPC System for Life Sciences

Fully integrated for pharma/biotech applications

#### Dell EMC HPC System for Manufacturing

Fully integrated for compute-aided engineering (CAE) workloads Dell EMC HPC System for Research

General purpose compute cluster for multiple research workloads



DELLEMC

### HPC Innovation Lab World-Class Infrastructure

# Dedication to Research and Development:

- 13K sq. ft (1200m<sup>2</sup>) with 1300+ Servers and ~10PB
- Leverage Expertise in HPC
- Test New Technologies
- Tune your applications for performance and efficiency



#### Zenith

- Top500 class system based on Intel Scalable Systems Framework (OPA, KNL, Xeon, OpenHPC)
- 256-nodes with dual 2697v4 processors, non-blocking OPA fabric and 270TFlops sustained performance

#### Rattler

- Research/development system in collaboration with Mellanox and NVIDIA
- 80 nodes configured with Infiniband EDR and 2660v3 processors

# Merci !

#### marc\_mendez\_bermond@dell.com

**DKILLEMC** 

DELEMC