YET ANOTHER STENCIL KERNEL (YASK):
A FRAMEWORK FOR FINITE-DIFFERENCE OPTIMIZATION

Chuck Yount, Principal Engineer, Intel Corporation
chuck.yount@intel.com

with contributed material from
• Alex Duran, Intel Corporation Iberia, Spain
• Alex Breuer & Josh Tobin, Univ. of CA, San Diego
• Alex Heinecke, Intel Labs

Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
October 4, 2017
Legal Disclaimers

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Statements in this document that refer to Intel's plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel's results and plans is included in Intel's SEC filings, including the annual report on Form 10-K.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, Xeon, Xeon Phi, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804
Agenda

Background
- Introduction to HPC Stencil Algorithms via earthquake simulation
- Overview of the Intel® Xeon Phi™ and Intel® Xeon® Scalable Processors

YASK Framework for creating and tuning Stencil Code
- Vector-folding feature
- Automatic-tuning feature
- Current and future work

Summary
Application Domain: HPC Stencil Computation

- Iterative kernels that update elements in one or more N-dimensional grids using a fixed pattern of computation on neighboring elements.
- Fundamental algorithm in many scientific simulations, commonly used for solving differential equations using finite-difference methods (FDM).

Images from https://commons.wikimedia.org

- Image Processing
- Seismic Modeling
- Weather Simulation

Images from https://commons.wikimedia.org
Example: AWP-ODC-OS

AWP-ODC: Anelastic Wave Propagation-Olsen, Day, Cui

- Software that simulates seismic wave propagation after a fault rupture
- Widely used by the Southern California Earthquake Center (SCEC) community
- In recent years has primarily run on GPU accelerated supercomputers

AWP-ODC-OS

- First ever open source release in 2016 (BSD-2 license), including port to Intel Xeon Phi processor, under development by San Diego Supercomputer Center (SDSC) at Univ. of CA, San Diego (UCSD)

- CyberShake Study 15.4 hazard map for 336 sites around Southern California
- Warm colors represent areas of high hazard
AWP-ODC-OS Development Process

1. Geophysicists use differential equations to represent velocity and stress of rock and soil during an earthquake

2. Derivatives are approximated in both time and space (only x dimension shown)

3. Equations are expanded to finite-difference stencils (this is one of 15 stencils for AWP-ODC staggered-grid formulation)

4. Stencils are coded and tuned for HPC clusters (our focus)
LATEST INTEL® PROCESSORS FOR HPC

Intel® Xeon Phi™ Processors and Intel® Xeon® Scalable Processors
Intel® Xeon Phi™ Product Family x200
(previously code-named Knights Landing, “KNL”)

Host Processor in Groveport Platform
*Self-boot Intel® Xeon Phi™ processor*
KNL Architecture Overview

**ISA**
Intel® Xeon® Processor Binary-Compatible (w/Broadwell)

**On-package memory**
16GiB MCDRAM, ~490 GB/s Stream Triad

**Platform Memory**
Up to 384GiB DDR4-2400, ~90 GB/s Stream Triad

- 2D Mesh Architecture
- Out-of-Order Cores
- 3X single-thread vs. KNC

**TILE:**
(Up to 36)

<table>
<thead>
<tr>
<th>2VPU</th>
<th>HUB</th>
<th>2VPU</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core</td>
<td>1MB</td>
<td>L2</td>
</tr>
</tbody>
</table>

*Enhanced Intel® Atom™ cores based on Silvermont Microarchitecture*

Up to 72 cores with 4 hyper-threads each

**Bi-directional tile connections**
### Integrated On-Package Memory Usage Models

Model configurable at boot time and software exposed through NUMA

<table>
<thead>
<tr>
<th>Cache Model</th>
<th>Flat Model</th>
<th>Hybrid Model</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ideal for large data size (&gt;16GB) cache blocking apps</td>
<td>Maximum bandwidth for data reuse aware apps</td>
<td>Maximum flexibility for varied workloads</td>
</tr>
</tbody>
</table>

**Description**
- **Cache Model**: Hardware automatically manages the MCDRAM as a "L3 cache" between CPU and DDR memory.
- **Flat Model**: Manually manage how the app uses the integrated on-package memory and DDR for peak perf.
- **Hybrid Model**: Harness the benefits of both Cache and Flat models by segmenting the integrated on-package memory.

**Usage Model**
- **Cache Model**: App and/or data set is very large and will not fit into MCDRAM; Unknown or unstructured memory access behavior.
- **Flat Model**: App or portion of an app or data set that can be “locked” into MCDRAM so it doesn’t get flushed out.
- **Hybrid Model**: Need to "lock" in a relatively small portion of an app or data set via the Flat model; Remaining MCDRAM is configured as Cache.

**Model Options**
- **Cache Model**: 64B cache lines direct-mapped
- **Flat Model**: 8GB/16GB MCDRAM
- **Hybrid Model**: Split Options: 25/75% or 50/50%
- **Physical Address**: Up to 384 GB DRAM
- **DRAM Options**: 8GB or 16GB
- **MCDRAM Options**: 8GB or 12GB

**Other Details**
- **Physical Address**: 16GB
- **Cache Lines**: 64B
- **Address Space**: Up to 384 GB

---

1. NUMA = non-uniform memory access
Intel® Xeon® Scalable Processors
(previously code-named Skylake Xeon, “SKX”)

• Above graphic shows maximum sockets. Two-socket platforms are common in HPC installations.
• Also available: Gold (5000 Series), Silver (4000 Series), and Bronze (3000 Series)
SIMD Instruction Sets

- E5-2600 (SNB\(^1\))
  - x87/MMX
  - SSE
  - AVX
  - AVX2

- E5-2600v3 (HSW\(^1\))
  - x87/MMX
  - SSE
  - AVX
  - AVX2

- KNL (Xeon Phi)
  - x87/MMX
  - SSE
  - AVX
  - AVX2
  - AVX-512F
  - AVX-512CD
  - AVX-512PF
  - AVX-512ER

- SKX (Xeon)
  - x87/MMX
  - SSE
  - AVX
  - AVX2
  - AVX-512F
  - AVX-512CD
  - AVX-512BW
  - AVX-512DQ
  - AVX-512VL

**AVX-512:**
- Foundation
  - 512-bit FP/Integer Vectors
  - 32 SIMD registers
  - 8 mask registers
  - Vector gather/scatter
- **Conflict Detection** for vectorizing histogram-type algorithms
- **Prefetch** gather/scatter
- **Exponential and Reciprocal** instructions
- **Byte and Word** integer SIMD elements
- **Double- and Quad-word** int SIMD
- **Vector-Length** orthogonality (128 and 256-bit operations)

---

1. Previous code-names of Intel® Xeon® processors

---

Sept., 2017

Intel Corporation

13
### Intel® Xeon Phi™ & Xeon® Scalable on Top500

(Those in top 20 from June 2017 list, [https://www.top500.org](https://www.top500.org))

<table>
<thead>
<tr>
<th>Rank</th>
<th>System</th>
<th>Cores</th>
<th>Rmax (TFlop/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>6</td>
<td>Cori, Intel Xeon Phi 7250 68C 1.4GHz, DOE/SC/LBNL/NERSC, United States</td>
<td>622,336</td>
<td>14,014.7</td>
</tr>
<tr>
<td>7</td>
<td>Oakforest-PACS, Xeon Phi 7250, Joint Center for Advanced High Performance Computing, Japan</td>
<td>556,104</td>
<td>13,554.6</td>
</tr>
<tr>
<td>12</td>
<td>Stampede2, Xeon Phi 7250, Texas Advanced Computing Center/Univ. of Texas, United States</td>
<td>285,600</td>
<td>6,807.1</td>
</tr>
<tr>
<td>13</td>
<td>MareNostrum, Xeon Platinum 8160, Barcelona Supercomputing Center, Spain</td>
<td>148,176</td>
<td>6,227.2</td>
</tr>
<tr>
<td>14</td>
<td>Marconi - CINECA Cluster, Xeon Phi 7250, CINECA, Italy</td>
<td>241,808</td>
<td>6,223.0</td>
</tr>
<tr>
<td>16</td>
<td>Theta, Xeon Phi 7230, DOE/SC/Argonne National Laboratory, United States</td>
<td>231,424</td>
<td>5,884.6</td>
</tr>
</tbody>
</table>
STENCIL-CODE MODERNIZATION FOR THE INTEL® XEON PHI™ PROCESSOR
What is “Modernized” Code? (generic HPC advice)

<table>
<thead>
<tr>
<th>What</th>
<th>Defined</th>
<th>Tools of the trade</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 Thread Scaling</td>
<td>Increase concurrent thread use across coherent shared memory</td>
<td>OpenMP, TBB, Cilk Plus</td>
</tr>
<tr>
<td>2 Vector Scaling</td>
<td>Use wide-vector (512-bit) instructions</td>
<td>Vector loops, vector functions, array notations</td>
</tr>
<tr>
<td>3 Cache Blocking</td>
<td>Use algorithms to reduce memory bandwidth pressure and improve cache hit rate</td>
<td>Blocking algorithms</td>
</tr>
<tr>
<td>4 Fabric Scaling</td>
<td>Tune workload to increased node count</td>
<td>MPI</td>
</tr>
<tr>
<td>5 Data Layout</td>
<td>Optimize data layout for unconstrained performance</td>
<td>AoS→SoA, directives for alignment</td>
</tr>
</tbody>
</table>
Modernizing Stencil Code

<table>
<thead>
<tr>
<th>Category</th>
<th>Example techniques for stencil code</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 Thread Scaling</td>
<td>• Typical: Evaluate multiple blocks in parallel using hyper-threading and multi-core</td>
</tr>
<tr>
<td></td>
<td>• Advanced: Use nested parallelism to increase cooperation between hyper-threads and/or KNL cores sharing a tile</td>
</tr>
<tr>
<td>2 Vector Scaling</td>
<td>• Typical: Use wide-vector (512-bit) instructions</td>
</tr>
<tr>
<td></td>
<td>• Advanced: Use KNL reciprocal instructions to improve division performance when allowable</td>
</tr>
<tr>
<td>3 Cache Blocking</td>
<td>• Typical: Use one level of blocking within a time-step to increase L2 reuse</td>
</tr>
<tr>
<td></td>
<td>• Advanced: Use additional level of blocking with temporal wave-fronts to utilize KNL's MCDRAM cache</td>
</tr>
<tr>
<td>4 Fabric Scaling</td>
<td>• Typical: Use MPI to exchange halos between time-steps</td>
</tr>
<tr>
<td></td>
<td>• Advanced: Schedule MPI communication to occur simultaneously with calculations of internal points</td>
</tr>
<tr>
<td>5 Data Layout</td>
<td>• Typical: Align accesses on cache-line boundaries and use KNL's MCDRAM when possible</td>
</tr>
<tr>
<td></td>
<td>• Advanced: Use custom layout to enable vector-folding, which reduces memory bandwidth demand (details following)</td>
</tr>
</tbody>
</table>

Challenges

- Implementing the optimizations can be complex and error-prone
- Optimal tuning requires trading off multiple (sometimes conflicting) optimizations, each with multiple parameters
- Domain experts may reject code that obfuscates the underlying math!
Y.A.S.K. – Yet Another Stencil Kernel

What it is [and isn't]

- A *software framework* to implement and tune stencil code for Intel® Xeon® processors and Intel® Xeon Phi™ processors and coprocessors
- Not [just] a library because stencil formulation isn't known *a priori* for all problems

Goals

- Create high-performing kernel code from a straightforward specification of stencil equations in a domain-specific language (DSL)
- Provide a simple kernel-driver to test and tune stencil performance
  - Expose optimization trade-off choices without requiring code changes
  - Automate searching through the optimization design space
- Provide ability to integrate generated code into larger applications (work in progress)
YASK High-Level Flow

- Stencil specification code
- Stencil compiler
- Optimized stencil calculation and prefetch code
- Loop compiler
- Nested loops with OpenMP, prefetch code, etc.
- Other C++ code
- Executable stencil kernel binary
- Intel C++ compiler
- Performance results

Sept., 2017
Intel Corporation
Example YASK Feature: Vector-Folding

Example 3D stencil:

25 points from 3D grid $u(t)$

...as specified by the RHS of this finite-difference equation

$$u(t+1,i,j,k) = c_0 u(t,i,j,k)$$
$$+ \sum_{r=1}^{4} c_r [u(t,i-r,j,k) + u(t,i+r,j,k) + u(t,i,j-r,k)]$$

...are used to compute 1 point in $u(t+1)$

$u(t)$ $\rightarrow$ $u(t+1)$
Example 3D stencil:

\[ u(t) \rightarrow u(t + 1) \]
Example 3D stencil:

\[ u(t) \rightarrow u(t + 1) \]
Example 3D stencil:

\[ u(t) \rightarrow u(t + 1) \]

Entire problem domain—typically billions of points

“Halo” data regions

Repeat for \( u(t+2) \)...
Traditional 1D Vectorization

25 8-element input vectors (200 points) from $u(t)$

Notice overlap of vectors along x axis

$u(t)$  $\rightarrow$  $u(t + 1)$

...to compute 8 points in $u(t+1)$
Traditional 1D Vectorization

Inner 3D loop iterates in x direction, i.e., same dimension as vectorization

8 new vectors must be read for k±r points (4 for k+r and 4 for k-r for r=1..4)

Only 1 new vector must be read for i±r points due to overlap along x axis

8 new vectors must be read for j±r points

Total memory BW cost for traditional “in-line” vectors = 17 new vector inputs for each vector of output
Traditional Memory Layout and Code Gen

1D vectorization

Logical indices in 2D with 8-element SIMD in x-dimension

<p>| | | | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>5.1</td>
<td>5.2</td>
<td>5.3</td>
<td>5.4</td>
<td>5.5</td>
<td>...</td>
<td>5.8</td>
<td>5.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>4.1</td>
<td>4.2</td>
<td>4.3</td>
<td>4.4</td>
<td>4.5</td>
<td>...</td>
<td>4.8</td>
<td>4.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>3.1</td>
<td>3.2</td>
<td>3.3</td>
<td>3.4</td>
<td>3.5</td>
<td>...</td>
<td>3.8</td>
<td>3.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>2.1</td>
<td>2.2</td>
<td>2.3</td>
<td>2.4</td>
<td>2.5</td>
<td>...</td>
<td>2.8</td>
<td>2.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>1.1</td>
<td>1.2</td>
<td>1.3</td>
<td>1.4</td>
<td>1.5</td>
<td>...</td>
<td>1.8</td>
<td>1.9</td>
<td>...</td>
<td></td>
</tr>
</tbody>
</table>

Layout in memory (1D)

- Traditional 1D vectorization layout (8×1)
- Two aligned vectors are colored
- **Aligned** reads shown with bold borders done with simple and efficient aligned vector loads
Traditional Memory Layout and Code Gen

1D vectorization

Logical indices in 2D with 8-element SIMD in x-dimension

<table>
<thead>
<tr>
<th>x</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<td>5,1</td>
<td>5,2</td>
</tr>
<tr>
<td>4,1</td>
<td>4,2</td>
</tr>
<tr>
<td>3,1</td>
<td>3,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
</tbody>
</table>

Layout in memory (1D)

- Traditional 1D vectorization layout (8×1)
- Two aligned vectors are colored
- **Unaligned** reads shown with bold borders done with unaligned load or two aligned loads plus a simple **shift** instruction
2D Vector-Folding

Twenty-five 4x2x1 input vectors

Notice overlap of vectors along x and y axes

\[ u(t) \quad \rightarrow \quad u(t + 1) \]

4x2x1 micro-block in x and y dimensions
2D Vector-Folding

- Inner 3D loop iterates in z direction, i.e., perpendicular to 2D vector.

- Total memory BW cost for 4x2x1 vector with z-axis loop = 7 new vector inputs for each vector of output (2.4x lower than in-line).

- 4 new 4x2x1 vectors must be read for j±r points.

- Only 1 new vector must be read for k±r points.

- 2 new vectors must be read for i±r points.
Vector-Folding Memory Layout and Code Gen

2D “4x2” vector folding

Logical indices in 2D with 8-element SIMD in x and y dimensions

<table>
<thead>
<tr>
<th></th>
<th>5.1</th>
<th>5.2</th>
<th>5.3</th>
<th>5.4</th>
<th>5.5</th>
<th>5.6</th>
<th>5.7</th>
<th>5.8</th>
<th>5.9</th>
<th>...</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.1</td>
<td>4.2</td>
<td>4.3</td>
<td>4.4</td>
<td>4.5</td>
<td>4.6</td>
<td>4.7</td>
<td>4.8</td>
<td>4.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>3.1</td>
<td>3.2</td>
<td>3.3</td>
<td>3.4</td>
<td>3.5</td>
<td>3.6</td>
<td>3.7</td>
<td>3.8</td>
<td>3.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>2.1</td>
<td>2.2</td>
<td>2.3</td>
<td>2.4</td>
<td>2.5</td>
<td>2.6</td>
<td>2.7</td>
<td>2.8</td>
<td>2.9</td>
<td>...</td>
<td></td>
</tr>
<tr>
<td>1.1</td>
<td>1.2</td>
<td>1.3</td>
<td>1.4</td>
<td>1.5</td>
<td>1.6</td>
<td>1.7</td>
<td>1.8</td>
<td>1.9</td>
<td>...</td>
<td></td>
</tr>
</tbody>
</table>

Layout in memory (1D)

| 1.1 | 1.2 | 1.3 | 1.4 | 2.1 | 2.2 | 2.3 | 2.4 | 1.5 | 1.6 | 1.7 | 1.8 | 2.5 | 2.6 | 2.7 | 2.8 | 1.9 | ... |

- 2D vector folding layout (8×1)
- Two aligned vectors are colored
- Aligned reads shown with bold borders done with simple and efficient aligned vector loads
Vector-Folding Memory Layout and Code Gen

2D “4x2” vector folding

Logical indices in 2D with 8-element SIMD in x and y dimensions

| 1,1 | 1,2 | 1,3 | 1,4 | 2,1 | 2,2 | 2,3 | 2,4 | 1,5 | 1,6 | 1,7 | 1,8 | 2,5 | 2,6 | 2,7 | 2,8 | 1,9 | ...
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|

Layout in memory (1D)

- 2D vector folding layout (8×1)
- Two aligned vectors are colored
- Unaligned read shown with bold borders done by loading aligned vectors and then shuffling the requisite elements via an AVX-512 *permute* instruction

Intel Corporation

Sept., 2017
Results of Vector-Folding on AWP-ODC-OS

- Performance measured in number of Lattice Update Points per Second, updating 15 grids
- Kernel performance only (does not include surface boundary updates)
- See ISC’17 paper for more details

![Graph showing performance comparison between 1D and 2D layouts, with values in MLUPS (Millions of Lattice Update Points per Second).]

- Traditional 1D layout
- 2D layout selected

Content on this slide courtesy of UCSD
Example YASK Feature: Automatic Tuner

Challenge

- Dozens of possible optimization strategies
- Some of these can take hundreds of values (e.g., cache-block dimensions)
- Leads to combinatorial explosion in size of possible design space

Solution

- Use genetic algorithm to select optimizations and tune parameters
- Tuner repeats the following sequence until convergence
  - Chooses optimization strategies and parameters based on random values (first generation) or mutation and crossover (subsequent generations)
  - Runs stencil compiler, loop compiler, C++ compiler, and kernel itself
  - Inputs resulting performance as fitness
YASK High-Level Flow with Tuner

Automated Tuner

Stencil-specification code

Stencil compiler

Optimized stencil calculation and prefetch code

Loop compiler

Nested loops with OpenMP, prefetch code, etc.

Other C++ code

Executable stencil kernel binary

Intel C++ compiler

Performance results
Results of Auto-Tuning on AWP-ODC-OS

- 13 generations of 200 individuals each
- Best individual (highlighted) found in 8\textsuperscript{th} generation
- Performance measured in number of Lattice Update Points per Second, updating 15 grids
- Kernel performance only (does not include surface boundary updates)
Additional YASK Features

Cache-blocking with nested OpenMP
- Two levels of OpenMP parallelism applied to “block” and “sub-block” sub-domains
- Allows efficient use of level 1 and 2 local caches shared between threads and/or cores

Temporal wave-front blocking
- Allows efficient use of large shared cache (e.g., Xeon Phi MCDRAM in cache mode)

MPI halo exchange
- Allows domain decomposition across compute nodes in a cluster
## Experimental Platforms

<table>
<thead>
<tr>
<th>Products</th>
<th>KNL (Intel® Xeon Phi™ Processor 7250)</th>
<th>SkyLake/Wolf Pass (Intel® Xeon® Gold 6148/Platinum 8160/Platinum 8180)</th>
<th>Broadwell (Intel® Xeon® E5-2697 v4)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sockets - TDP</td>
<td>1S - 215W</td>
<td>2-150W/ 2-150W/ 2-205W</td>
<td>2-145W</td>
</tr>
<tr>
<td>DDR4</td>
<td>96 GiB 2400 MHz</td>
<td>192 GiB 2666 MHz</td>
<td>128 GiB 2400 MHz</td>
</tr>
<tr>
<td>MCDRAM</td>
<td>16 GiB (Quadrant Flat)</td>
<td>N/A</td>
<td>N/A</td>
</tr>
<tr>
<td>Turbo</td>
<td>On</td>
<td>On</td>
<td>On</td>
</tr>
<tr>
<td>Compiler version</td>
<td>Intel C++ compiler 17.0.4</td>
<td>Intel C++ compiler 17.0.4</td>
<td>Intel C++ compiler 17.0.4</td>
</tr>
<tr>
<td>OS</td>
<td>RHEL7.2</td>
<td>RHEL 7.3</td>
<td>RHEL7.2</td>
</tr>
</tbody>
</table>
YASK Performance

AWP:
- $1024^2 \times 128$ problem size
- N.B.: Results are in grid updates per sec, 15x the lattice updates in previous slide

Iso3DFD:
- $1024^3$ problem size
- 16\textsuperscript{th} order-accurate in space, 2\textsuperscript{nd} order-accurate in time acoustic wave

Xeon Phi:
- Both problems fit within 16 GiB MCDRAM in flat mode

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Measurements created by UC San Diego Supercomputer Center (SDSC) as of Oct., 2016. See complete configuration details on "Configuration" slide. *Other names and brands may be claimed as the property of others
Current Work Items

Project engagements

- Imperial College London (ICL): Integrating YASK as a backend to Devito, a symbolic finite-difference solution software
- Barcelona Supercomputing Center (BSC): Applying YASK to full staggered-grid formulation for energy exploration
- Univ. of CA San Diego (UCSD): Continued support of AWP

Software development

- Creating a documented API (application-programmer interface) for both the stencil compiler and kernel [almost done]
- Generalizing the 3D grids to nD arrays to allow 1D and 2D stencils as well as mixed array sizes in stencils (e.g., for absorbing boundary-condition arrays) [almost done]
- Allowing domain subsets via conditional equations [in progress]
Development Process with Devito

1. Geophysicists use differential equations to represent velocity and stress of rock and soil during an earthquake.

\[ v = \begin{pmatrix} v_x \\ v_y \\ v_z \end{pmatrix}, \quad \sigma = \begin{pmatrix} \sigma_{xx} & \sigma_{xy} & \sigma_{xz} \\ \sigma_{xy} & \sigma_{yy} & \sigma_{yz} \\ \sigma_{xz} & \sigma_{yz} & \sigma_{zz} \end{pmatrix} \]

\[ \frac{\partial}{\partial t} v = \frac{1}{\rho} \nabla \cdot \sigma \]

\[ \frac{\partial}{\partial t} \sigma = \lambda (\nabla \cdot v) I + \mu (\nabla v + \nabla v^T) \]

2. Derivatives are approximated in both time and space (only x dimension shown).

3. Equations are expanded to finite-difference stencils (this is one of 15 stencils for AWP-ODC staggered-grid formulation).

4. Stencils are coded and tuned for HPC clusters (our focus).

\[ \sigma_{xx}[i, j, k] = \sigma_{xx}[i, j, k] + \frac{\Delta t}{h} [2\mu + \lambda] v_x[i + 1, j, k] + \lambda v_x[i, j, k + 1] + \lambda v_x[i, j, k - 1] + \lambda D_x[i, j, k] \]

\[ \lambda = 8(\lambda[i, j, k] + \lambda[i - 1, j, k] + \lambda[i + 1, j, k] + \lambda[i, j - 1, k] + \lambda[i, j + 1, k] + \lambda[i, j, k + 1] + \lambda[i, j, k - 1] + \lambda[D_x[i, j, k]] \]

\[ \mu = 8(\mu[i, j, k] + \mu[i - 1, j, k] + \mu[i + 1, j, k] + \mu[i, j - 1, k] + \mu[i, j + 1, k] + \mu[i, j, k + 1] + \mu[i, j, k - 1] + \mu[D_x[i, j, k]] \]

\[ D_x = c_1(v_x[i + 1, j, k] - v_x[i, j, k]) + c_2(v_x[i + 2, j, k] - v_x[i, j, k]) + \ldots \]

\[ D_y = c_1(v_y[i, j + 1, k] - v_y[i, j, k]) + c_2(v_y[i, j + 2, k] - v_y[i, j, k]) + \ldots \]

\[ D_z = c_1(v_z[i, j, k + 1] - v_z[i, j, k]) + c_2(v_z[i, j, k + 2] - v_z[i, j, k]) \]
YASK High-Level Flow with Devito

PDE(s) and data

Devito (Python)

PDE(s) translated to stencil AST

Stencil-compiler Python module

Optimized stencil calculation and prefetch code

Loop compiler

Nested loops with OpenMP, prefetch code, etc.

Other C++ code

Stencil-kernel Python module

Intel C++ compiler

Data to be stored in vector-folding format

Results

Sept., 2017

Intel Corporation
Future Work and Opportunities

Additional workloads (skills: HPC workload familiarity)
- Provide more real-world example stencils in YASK distribution
- Adapt more applications to use YASK

Add useful interfaces (skills: C++)
- Add file and/or network I/O to read/write data
- Integrate with data-visualization package

Improve cluster performance (skills: OpenMP, MPI, C++)
- Reorganize work to allow MPI communication to occur in parallel with computation
- Enable temporal wave-front computation on multiple MPI ranks

Improve open-source package (skills: SW eng.)
- Design and document C++ and Python APIs
- Write better documentation
- Add automated testing and performance-evaluation processes

Improve stencil robustness (skills: compiler internals, polyhedral model)
- Enable more complex stencil expressions to be efficiently compiled
- Evaluate expression dependencies correctly
- Stretch: replace compilation with JIT binary or LLVM IL code-generator
Resources

Software available

- YASK: https://github.com/01org/yask (MIT OS license) with several example stencils
- ICL's Devito: https://github.com/opesci/devito (MIT OS license) with tutorials and examples
- UCSD's AWP-ODC-OS: https://github.com/HPGeoC/awp-odc-os (BSD OS license)

Related collateral

- Devito project: http://www.opesci.org/devito-public
- High Performance GeoComputing Lab (AWP): http://hpgeoc.sdsc.edu
- Email chuck.yount@intel.com for copies of other conference papers and presentations
Summary

HPC Stencil Code
- Use numerical methods to approximate solutions to differential equations

Intel® Xeon Phi™ Processors (Knights Landing)
- New AVX-512 instruction set architecture
- Up to 72 cores of 4 hyper-threads each
- High-bandwidth on-package MCDRAM

Intel® Xeon® Scalable Processors (Skylake)
- Brings AVX-512 to Xeon product line
- Up to 28 cores per socket; 2 and 4-socket platforms available

Tuning Stencil Code for the Xeon Phi with YASK
- Framework for rapid prototyping and tuning
- Turnkey solution for application of multiple complex performance features
Y*A*S*K
Yet Another Stencil Kernel
Iso3DFD Stencil Example

\[ p(t + 1, i, j, k) \leftarrow 2p(t, i, j, k) - p(t - 1, i, j, k) + \]
\[ v(i, j, k) \left( c_0 p(t, i, j, k) + \]
\[ \sum_{r=1}^{8} c_r \left[ p(t, i - r, j, k) + p(t, i + r, j, k) + \]
\[ p(t, i, j - r, k) + p(t, i, j + r, k) + \]
\[ p(t, i, j, k - r) + p(t, i, j, k + r) \right] \]

- 51-point stencil
- 16\textsuperscript{th} order accurate in space, 2\textsuperscript{nd} order accurate in time
- 61 FP ops
#include "StencilBase.hpp"
class Iso3dfdStencil : public StencilRadiusBase {
protected:
    Grid pressure; // time-varying 3D pressure grid.
    Grid vel;     // constant 3D vel grid.
    Param coeff; // stencil coefficients.
public:
    Iso3dfdStencil(StencilList& stencils, int radius=8) :
        StencilRadiusBase("iso3dfd", stencils, radius) {
            INIT_GRID_4D(pressure, t, x, y, z);
            INIT_GRID_3D(vel, x, y, z);
            INIT_PARAM_1D(coeff, r, radius + 1);
        }

    virtual void define(const IntTuple& offsets) {
        GET_OFFSET(t); GET_OFFSET(x); GET_OFFSET(y); GET_OFFSET(z);
        GridValue np = pressure(t, x, y, z) * coeff(0);
        for (int r = 1; r <= _radius; r++) {
            np += coeff(r) *
                (pressure(t, x-r, y, z) + pressure(t, x+r, y, z) +
                 pressure(t, x, y-r, z) + pressure(t, x, y+r, z) +
                 pressure(t, x, y, z-r) + pressure(t, x, y, z+r));
        }
        np = (2.0 * pressure(t, x, y, z))
            - pressure(t-1, x, y, z) // subtract pressure at t-1.
            + (np * vel(x, y, z)); // add velocity term.
        pressure(t+1, x, y, z) IS_EQUIV_TO v;
    }
};
YASK Auto-tuner Applied to Iso3dfd

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Intel measurements as of Oct., 2016 on Intel® Xeon Phi™ processor 7250 with 16 GiB MCDRAM, 96 GiB DDR4. See complete configuration details on “Configuration” slide.
YASK Optimizations Applied to Iso3DFD

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Intel measurements as of Oct., 2016 on Intel® Xeon Phi™ processor 7250 with 16 GiB MCDRAM, 96 GiB DDR4. See complete configuration details on "Configuration" slide.
AWP-ODC Numerics

**Finite Difference code**
- Staggered-grid scheme
- Fourth-order accurate in space and second-order accurate in time

**Fifteen grids updated in every time-step**
- 3 velocity grids
- 6 stress grids
- 6 grids for auxiliary memory-variables required for accurate high-frequency simulation

**Fifteen stencils**
- Nine 13-point stencils
- Six 9-point stencils

**Free-surface boundary computation every time-step**
Today’s Code Investment Carries Forward

To carry forward most key code modernizations

**RECOMPILE**
Parallelization, threading, vectorization, cache-blocking, MPI+OpenMP hybridization...

For additional gains

**TUNE**
Exploit NEW features...

---

Knights Landing Enabled
Performance Libraries & Runtimes

MKL  MPI
OpenMP  TBB

Knights Landing Enabled Compilers

Intel® AVX-512

Cache Mode For
High Bandwidth Memory

KNL Enhancements
(memory, architecture, bandwidth, etc.)

Sept., 2017
Thread Scaling for Stencils

Algorithm characteristics

- Threading stencils is often straight-forward within a single grid and time-step
  - Many stencils update elements in one grid at a given time-step based only on elements in other grids or the same grid at previous time-steps
  - Some updates across multiple grids may be independent within a time-step
  - In these cases, all elements can be updated in parallel trivially
- Threading across multiple dependent grids and time-steps is more challenging
  - Requires more complex synchronization to observe dependencies

Techniques

- Example techniques to implement dependent threading include temporal wave-fronts and diamond tiling
- Threading software
  - Older code tends to use multiple single-threaded MPI tasks even on a single node
  - Often does not scale well to many threads available on KNL socket (up to 288)
  - More modern code uses OpenMP or MPI+OpenMP or MPI w/shared memory on a node
  - More advanced threading may include task scheduling to avoid global synchronization
Vector Scaling for Stencils

Algorithm characteristics

- The nature of “stencils” is application of a fixed pattern to multiple elements
- Elements within a grid are usually independent as discussed earlier
- Thus, SIMD vectorization can be applied in a straight-forward manner

Techniques

- Straight-forward vectorization along one dimension often results in many cache-line reads, many unused elements, and low reuse between vectors
- “Vector-folding” is a technique to vectorize in two or more dimensions, increasing reuse and thus decreasing memory-bandwidth requirements
  - Speed-ups of >1.5x have been observed in several real-world stencils
  - See HPCC’15 paper “Vector Folding: improving stencil performance via multi-dimensional SIMD-vector representation”
Cache Blocking for Stencils

Algorithm characteristics
- Stencil codes are very often memory-bound
- Stencil equations typically input multiple neighboring values (increasing with accuracy)
- These factors make cache-blocking critical for high performance

Techniques
- Most cache-blocking is implemented via simple loop-tiling with each OpenMP thread working on separate tiles
- Advanced techniques leverage sharing of KNL caches between threads
  - Each L1 data cache is shared by 4 hyper-threads in a core
  - Each L2 cache is shared by 2 cores in a tile
  - Tiles can be sub-divided into slabs or similar partitions, and threads that share these caches can cooperate within them, increasing reuse and decreasing evictions
- To leverage the MCDRAM cache shared by all cores, an addition level of tiling may be used
  - See PMBS’16 paper “Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling”
- In addition, prefetching data into L1 and/or L2 cache may improve performance
Fabric Scaling for Stencils

Algorithm characteristics

- As with threading and SIMD, independence of solutions within a time-step facilitate partitioning across nodes
- Access to multiple neighboring values that are common across partitions requires synchronizing data

Techniques

- Use of “halo” or “ghost” regions is the most common solution to reduce communications within a time-step as shared data is accessed
  - Halos must be exchanged between nodes to keep data consistent
  - Application of tiling across time-steps requires more sophisticated exchanges, usually consisting of exchanging more data but less often
- MPI is the most common software used, but other alternatives are in the works
- Global synchronization can cause under-utilization of nodes on large clusters and/or on clusters with nodes of heterogeneous performance
Data Layout for Stencils

Algorithm characteristics
- Many problems consist of multi-dimensional domains across multiple grids, which could be implemented naïvely with a multi-dimensional array-of-structures (AoS)
- Access to many neighboring elements and/or grids may cause translation look-aside buffer (TLB) pressure when multiple pages are accesses

Techniques
- Structure-of-arrays layout (SoA) is typical for multi-grid problems to enable vectorization
- Options to reduce TLB pressure
  - Using huge pages, e.g., via transparent huge pages (THP) in Linux
  - Reordering index nesting in grids, e.g., TXYZ → XYTZ
  - Tiling the layout itself
- To benefit from vector-folding discussed earlier, a data-layout transformation to a grid of folded vectors is essential
Experimental Configurations

Configuration details: YASK HPC Stencils, iso3DFD Kernel

Intel® Xeon Phi™ processor 7250: Intel® Xeon Phi™ processor 7250, 68 cores, 272 threads, 1400 MHz core freq. (Turbo On), MCDRAM 16 GiB, DDR4 96GiB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat* Enterprise Linux Server release 6.7

Configuration details: YASK HPC Stencils, AWP-ODC Kernel

Intel® Xeon® processor E5-2680 v3: Single Socket Intel® Xeon® processor E5-2680 v3, 2.5 GHz (Turbo Off), 12 Cores, 12 Threads (HT off), DDR4 128GiB, CentOS* 6.7

Intel® Xeon Phi™ processor 7210: Intel® Xeon Phi™ processor 7210, 64 cores, 256 threads, 1300 MHz core freq. (Turbo On), MCDRAM 16 GiB, DDR4 96GiB 2133 MHz, quad cluster mode, MCDRAM flat memory mode, CentOS* 7.2

Intel® Xeon Phi™ processor 7250: Intel® Xeon Phi™ processor 7250, 68 cores, 272 threads, 1400 MHz core freq. (Turbo On), MCDRAM 16 GiB, DDR4 96GiB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, CentOS* 7.2

NVIDIA Tesla* K20X (Kepler): Part number 900-22081-0030-000, 1x GK110 CPU, 2688 cores, 732 MHz core freq, 6GiB 2.6GHz GDDR5

NVIDIA M40 (Maxwell): Part number TCSM40M-PB, 3072 cores, 948 MHz base freq, 12 GiB GDDR5

NVIDIA Titan X (Pascal): 3072 cores, 1000 MHz base freq, 12 GiB GDDR5

*Other names and brands may be claimed as the property of others
Example Stencil-Compiler Feature: Automatic Prefetch Generation

This example stencil reads from 7 cache lines (after vectorization):

The stencil compiler generates the following prefetch functions:

- Full prefetch function loads all 7 cache lines
- X-direction prefetch function loads only these 3 leading cache lines
- Y-direction prefetch function loads only these 5 leading cache lines
Worldwide Training

Intel® Modern Code

FREE...Worldwide Training and Teaching Resources

Intel® Parallel Computing Centers

Colfax training with access to a 36-node cluster

Parallel Programming Reference Books

Commercial ISVs Embracing Intel® Xeon Phi™ processor Family
IXPUG Community Forum

The Intel® Xeon Phi™ User's Group (IXPUG) is an independent global users group whose mission is to provide a forum for the free exchange of information that enhances the usability and efficiency of scientific and technical applications running on large High Performance Computing (HPC) systems using the Intel® Xeon Phi™ processor. IXPUG is administered by representatives of member sites that operate large Phi-based HPC systems.

- **IXPUG Monthly Tuning Meetings**: conference calls to inform the Intel® Xeon Phi™ processor community of relevant updates and share techniques, results, and methodologies.
Bio

Chuck received his PhD degree in ECE from Carnegie Mellon University in Pittsburgh, Pennsylvania, USA. He is currently a Principal Engineer in the Software and Services Group at Intel Corporation. His work includes developing analysis and optimization techniques for HPC applications on many-core products including the YASK open-source software framework for stencil-code optimization.
Abstract

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications, especially those arising from finite-difference solutions of differential equations representing the behavior of physical phenomenon such as heat dispersion or seismic activity. This talk provides a brief review of stencil computation and Intel® Xeon® and Xeon Phi™ processors, and it describes the YASK (Yet Another Stencil Kernel) framework that simplifies the tasks of defining stencil functions, generating high-performance code targeted for various Intel platforms, and running tuning experiments. A couple of example YASK features are explained, performance results are given, and future work is described.