# P&S Heterogeneous Systems

Parallel Patterns: Convolution

Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Fall 2021

25 November 2021

### Parallel Patterns

### Reduction Operation

- A reduction operation reduces a set of values to a single value
  - Sum, Product, Minimum, Maximum are examples
- Properties of reduction
  - Associativity
  - Commutativity
  - Identity value
- Reduction is a key primitive for parallel computing
  - E.g., MapReduce programming model

### Divergence-Free Mapping (I)

All active threads belong to the same warp



### Divergence-Free Mapping (II)

#### Program with high SIMD utilization

```
__shared__ float partialSum[]
unsigned int t = threadIdx.x;

for(int stride = blockDim.x; stride > 0; stride >> 1){
    __syncthreads();
    if (t < stride)
        partialSum[t] += partialSum[t + stride];
}</pre>
```

# Warp utilization is maximized



### Histogram Computation

- Histogram is a frequently used computation for reducing the dimensionality and extracting notable features and patterns from large data sets
  - Feature extraction for object recognition in images
  - Fraud detection in credit card transactions
  - Correlating heavenly object movements in astrophysics
  - **...**
- Basic histograms for each element in the data set, use the value to identify a "bin" to increment
  - Divide possible input value range into "bins"
  - Associate a counter to each bin
  - For each input element, examine its value and determine the bin it falls into and increment the counter for that bin

### Parallel Histogram Computation: Iteration 2

- All threads move to the next section of the input
  - Each thread moves to element threadID + #threads



### Histogram Privatization

- Privatization: Per-block sub-histograms in shared memory
  - Threads use atomic operations in shared memory





## Convolution

### Convolution Applications

- Convolution is a widely-used operation in signal processing, image processing, video processing, and computer vision
- Convolution applies a filter or mask or kernel\* on each element of the input (e.g., a signal, an image, a frame) to obtain a new value, which is a weighted sum of a set of neighboring input elements
  - Smoothing, sharpening, or blurring an image
  - Finding edges in an image
  - Removing noise, etc.
- Applications in machine learning and artificial intelligence
  - Convolutional Neural Networks (CNN or ConvNets)

### 1D Convolution Example

- Commonly used for audio processing
- Mask size is usually an odd number of elements for symmetry (5 in this example)
- Calculation of P[2]:



### 1D Convolution Example: Next Element

Calculation of P[3]



### 1D Convolution Boundary Condition

- Calculation of output elements near the boundaries (beginning and end) of the input array need to deal with "ghost" elements
  - Different policies (0, replicates of boundary values, etc.)



Slide credit: Hwu & Kirk

#### 1D Convolution Kernel

#### with Boundary Condition Handling\*

```
global void convolution 1D basic kernel(float *N, float *M, float *P,
                                                 int Mask Width, int Width) {
int i = blockIdx.x * blockDim.x + threadIdx.x; // Index of output element
float Pvalue = 0;
int N start point = i - (Mask Width/2); // Index of first neighbor
for(int j = 0; j < Mask Width; <math>j++) {
  // Check the boundaries
  if(N start point + j >= 0 && N start point + j < Width) {</pre>
    Pvalue += N[N start point + j] * M[j]; // Multiply and accumulate
                                                N N[0] N[1] N[2] N[3] N[4] N[5] N[6]
                                                                         P P[0] P[1] P[2] P[3] P[4] P[5] P[6]
                                                            5 6
P[i] = Pvalue; // Store output element
                                                                               Sum
                                             M M[0] M[1] M[2] M[3] M[4]
                                                                  Partial products
```

### Recall: Memory Hierarchy in CUDA Programs



### Memory in the GPU Architecture



Slide credit: Izzat El Hajj

### Storing the Mask in Constant Memory

- We can store the mask in constant memory
  - The mask is small
  - It is constant
  - It is accessed by all threads
- Constant memory is cached inside each GPU core and it is particularly fast when all threads of a warp access the same value
- 1. Declare the mask as a global variable

```
#define MASK_WIDTH 5
__constant__ float M[MASK_WIDTH];
```

2. Initialize the mask from the host

#### 1D Convolution Kernel

#### with Boundary Condition Handling and Constant Cache for M

```
global void convolution 1D basic kernel(float *N, float *P,
                                                int Mask Width, int Width) {
 int i = blockIdx.x * blockDim.x + threadIdx.x; // Index of output element
 float Pvalue = 0;
 int N start point = i - (Mask Width/2); // Index of first neighbor
 for (int j = 0; j < Mask Width; <math>j++) {
   // Check the boundaries
    if (N start point + j >= 0 && N start point + j < Width) {
      Pvalue += N[N start point + j] * M[j]; // Multiply and accumulate
                                                N N[0] N[1] N[2] N[3] N[4] N[5] N[6]
                                                                        P P[0] P[1] P[2] P[3] P[4] P[5] P[6]
                                                            5
 P[i] = Pvalue; // Store output element
Use it with caution! Constant
                                                                             Sum
                                             M M[0] M[1] M[2] M[3] M[4]
cache is small (e.g., 64 KB)
                                                                Partial products
```

### Recall: Data Reuse: Tiling

 To take advantage of data reuse, we divide the input into tiles that can be loaded into shared memory



```
__shared__ int l_data[(L_SIZE+2)*(L_SIZE+2)];
...
Load tile into shared memory
__syncthreads();
for (int i = 0; i < 3; i++){
   for (int j = 0; j < 3; j++){
      sum += gauss[i][j] * l_data[(i+l_row-1)*(L_SIZE+2)+j+l_col-1];
   }
}</pre>
```

### Tiled 1D Convolution Basic Idea



Slide credit: Hwu & Kirk

### Loading the Left Halo



Slide credit: Hwu & Kirk

### Loading the Internal Elements



```
N_ds[n + threadIdx.x] = N[blockIdx.x*blockDim.x + threadIdx.x];
```

Slide credit: Hwu & Kirk

### Loading the Right Halo



Slide credit: Hwu & Kirk

### Tiled 1D Convolution Kernel

```
global void convolution 1D basic kernel(float *N, float *P, int Mask Width, int Width) {
 int i = blockIdx.x*blockDim.x + threadIdx.x; // Index of output element
 shared float N ds[TILE SIZE + MAX MASK WIDTH - 1]; // Shared memory tile
 int n = Mask Width/2; // Halo width
 // Load left halo
 int halo index left = (blockIdx.x - 1)*blockDim.x + threadIdx.x;
 if(threadIdx.x >= blockDim.x - n) {
   N ds[threadIdx.x - (blockDim.x - n)] = (halo index left < 0) ? 0 : N[halo index left];</pre>
  }
 // Load internal elements
 N ds[n + threadIdx.x] = N[blockIdx.x*blockDim.x + threadIdx.x];
 // Load right halo
 int halo index right = (blockIdx.x + 1)*blockDim.x + threadIdx.x;
 if(threadIdx.x < n) {</pre>
   N ds[n + blockDim.x + threadIdx.x] = (halo index right >= Width) ? 0 : N[halo index right];
  }
 syncthreads(); // Intra-block synchronization
 float Pvalue = 0;
 for(int j = 0; j < Mask Width; j++) {</pre>
   Pvalue += N ds[threadIdx.x + j] * M[j]; // Convolution result
 P[i] = Pvalue;
```

Slide credit: Hwu & Kirk

### 2D Convolution (I)

- The mask is 2D
  - For example, a Gaussian filter



Gaussian mask



### 2D Convolution (II)

Same memory locations accessed by neighboring threads



```
for (int i = 0; i < 3; i++){
    for (int j = 0; j < 3; j++){
        sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)];
    }
}</pre>
```

### Tiling in 2D Convolution

 To take advantage of data reuse, we divide the input into tiles that can be loaded into shared memory



```
__shared__ int l_data[(L_SIZE+2)*(L_SIZE+2)];
...
Load tile into shared memory
__syncthreads();
for (int i = 0; i < 3; i++){
   for (int j = 0; j < 3; j++){
      sum += gauss[i][j] * l_data[(i+l_row-1)*(L_SIZE+2)+j+l_col-1];
   }
}</pre>
```

### Loading Tiles into Shared Memory



**Challenge:** Input and output tiles have different dimensions (input tile dimension = output tile dimension +  $2 \times \text{mask radius}$ )

**Solution:** Launch enough threads per block to load the input tile to shared memory, then use a subset of them to compute and store the output tile

## Difference in Tile Sizes (I)



### Difference in Tile Sizes (II)



### Difference in Tile Sizes (III)



### 2D Convolution Examples

Original

Blur

Motion blur

Sharpen



### Canny Edge Detection

- Gaussian filtering
  - Smooth the image to remove noise
- Sobel filtering
  - Find the intensity gradients of the image
- Non-maximum suppression
  - Suppress spurious response to edge detection
- Hysteresis threshold
  - Strong edges + weak edges connected to a strong edge



### Convolutions are Stencils

### Stencil Computation

- Stencils are a class of algorithms where each output element is calculated from a set of neighboring input elements in a structured grid
  - Widely-used in high performance computing to solve partial differential equations (PDEs)



### Stencil Computation and Applications

- Stencil computations update values in a grid using a fixed pattern of grid points
- Stencils are used in ~30% of high-performance computing applications



7-point Jacobi in 3D plane







#### Heterogeneous System: CPU+FPGA



We evaluate two POWER9+FPGA systems:

**1. HBM-based board AD9H7**Xilinx Virtex Ultrascale+™ XCVU37P-2

2. DDR4-based board AD9V3
Xilinx Virtex Ultrascale+™ XCVU3P-2

#### **NERO** Application Framework

- NERO communicates to Host over CAPI2 (Coherent Accelerator Processor Interface)
- COSMO API handles offloading jobs to NERO
- SNAP (Storage, Network, and Analytics Programming) allows for seamless integration of the COSMO API



https://github.com/open-power/snap

### Accelerating Climate Modeling

 Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gómez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal, "NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling"

Proceedings of the <u>30th International Conference on Field-Programmable Logic</u> <u>and Applications</u> (**FPL**), Gothenburg, Sweden, September 2020.

[Slides (pptx) (pdf)]

[<u>Lightning Talk Slides (pptx) (pdf)</u>]

[Talk Video (23 minutes)]

Nominated for the Stamatis Vassiliadis Memorial Award.

## NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling

Gagandeep Singh $^{a,b,c}$  Dionysios Diamantopoulos $^c$  Christoph Hagleitner $^c$  Juan Gómez-Luna $^b$  Sander Stuijk $^a$  Onur Mutlu $^b$  Henk Corporaal $^a$  Eindhoven University of Technology  $^b$ ETH Zürich  $^c$ IBM Research Europe, Zurich

## Convolutions in Machine Learning

#### Convolutions in Machine Learning

- Convolutions are traditionally used for feature detection in image processing
  - They can be used as neural network layers
- Convolutions have an advantage over fully connected layers (e.g., in multilayer perceptron, MLP)
  - Local weights: They compute on only a window around the element of interest
  - Data sharing via on-chip memories is feasible



#### Convolutional Neural Networks: Demo

#### Back to Yann's Home Publications

#### LeNet-5 Demos

#### Unusual Patterns

unusual styles weirdos

#### Invariance

translation (anim)
scale (anim)
rotation (anim)
squeezing (anim)
stroke width
(anim)

#### Noise Resistance

noisy 3 and 6 noisy 2 (anim) noisy 4 (anim)

#### Multiple Character

various stills
dancing 00 (anim)
dancing 384
(anim)

#### Complex cases (anim)

(anim) 35 -> 53 12 -> 4-> 21 23 -> 32 30 + noise 31-51-57-61

#### LeNet-5, convolutional neural networks

Convolutional Neural Networks are are a special kind of multi-layer neural networks. Like almost every other neural networks they are trained with a version of the back-propagation algorithm. Where they differ is in the

Convolutional Neural Networks are designed to recognize visual patterns directly from pixel images with minimal preprocessing.

They can recognize patterns with extreme variability (such as handwritten characters), and with robustness to distortions and simple geometric transformations.

LeNet-5 is our latest convolutional network designed for handwritten and machine-printed character recognition. Here is an example of LeNet-5 in action.



Many more examples are available in the column on the left:

Several papers on LeNet and convolutional networks are available on my publication page:

[LeCun et al., 1998]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, november 1998. PS.QZ

[Bottou et al., 1997]

L. Bottou, Y. LeCun, and Y. Bengio. Global training of



http://yann.lecun.com/exdb/lenet/index.html

# LeNet-5, a Convolutional Neural Network for Hand-Written Digit Recognition

This is a 1024\*8 bit input, which will have a truth table of 28196 entries



43

#### An Example of 2D Convolution

#### Output feature map



Input feature map

#### Structure information

Input: 5\*5 (blue)

Kernel (filter): 3\*3 (grey)

Output: 5\*5 (green)

#### **Computation information**

Stride: 1

Padding: 1 (white)

Output Dim = (Input + 2\*Padding - Kernel) / Stride + 1

## An Example of 2D Convolution



#### Another Example of 2D Convolution



46

#### A Basic Conv. Layer Forward Kernel

(Code is Incomplete!)

```
global void ConvLayerForward Basic Kernel(int C, int W grid,
                            int K, float* X, float* W, float* Y){
  int m = blockIdx.x; // Index of the output feature map
  int h = blockIdx.y / W_grid + threadIdx.y;
  int w = blockIdx.y % W grid + threadIdx.x;
  float acc = 0.;
  for(int c = 0; c < C; c++) { // Sum over all input channels
     for(int p = 0; p < K; p++) // Loop over KxK filter</pre>
        for(int q = 0; q < K; q++)
           acc += X[c, h + p, w + q] * W[m, c, p, q];
  Y[m, h, w] = acc;
```

47

#### Power of Convolutions and Applied Courses

- In 2010, Prof. Andreas Moshovos adopted Professor Hwu's ECE498AL Programming Massively Parallel Processors Class
- Several of Prof. Geoffrey Hinton's graduate students took the course
- These students developed the GPU implementation of the Deep CNN that was trained with 1.2M images to win the ImageNet competition

Slide credit: Hwu & Kirk

### Example: AlexNet (2012)

- AlexNet wins the ImageNet classification competition with ~10% points higher accuracy than state-of-the-art
  - Krizhevsky et al., "ImageNet Classification with Deep Convolutional Neural Networks", NIPS 2012.

#### ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca

Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca

Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca

#### Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

### Neural Network Layer Examples

| LeNet                                         | AlexNet                                                  |
|-----------------------------------------------|----------------------------------------------------------|
| Image: 28 (height) × 28 (width) × 1 (channel) | Image: 224 (height) × 224 (width) × 3 (channels)         |
| <u> </u>                                      | <u> </u>                                                 |
| Convolution with 5×5 kernel+2padding:28×28×6  | Convolution with 11 × 11 kernel + 4 stride: 54 × 54 × 96 |
| √sigmoid                                      | √ ReLu                                                   |
| Pool with 2×2 average kernel+2 stride:14×14×6 | Pool with 3×3 max. kernel+2 stride: 26×26×96             |
|                                               | V                                                        |
| Convolution with 5×5 kernel (no pad):10×10×16 | Convolution with 5×5 kernel+2 pad:26×26×256              |
| √sigmoid                                      | √ ReLu                                                   |
| Pool with 2×2 average kernel+2 stride: 5×5×16 | Pool with 3×3 max.kernel+2stride:12×12×256               |
| flatten                                       | $\bigvee$                                                |
| Dense: 120 fully connected neurons            | Convolution with 3×3 kernel+1 pad:12×12×384              |
| √ sigmoid                                     | √ ReLu                                                   |
| Dense: 84 fully connected neurons             | Convolution with 3×3 kernel+1 pad:12×12×384              |
| √ sigmoid                                     | √ ReLu                                                   |
| Dense: 10 fully connected neurons             | Convolution with 3×3 kernel+1 pad:12×12×256              |
| $\bigvee$                                     | √ ReLu                                                   |
| Output: 1 of 10 classes                       | Pool with 3×3 max.kernel+2stride:5×5×256                 |
|                                               | flatten                                                  |
|                                               | Dense: 4096 fully connected neurons                      |
|                                               | √ ReLu, dropout p=0.5                                    |
|                                               | Dense: 4096 fully connected neurons                      |
|                                               | √ ReLu, dropout p=0.5                                    |
|                                               | Dense: 1000 fully connected neurons                      |
|                                               | $\bigvee$                                                |
|                                               | Output: 1 of 1000 classes                                |

### Example: GoogLeNet (2014)

- Google improves accuracy by adding more network layers
  - From 8 in AlexNet to 22 in GoogLeNet
  - Szegedy et al., "Going Deeper with Convolutions", CVPR 2015.

#### **Going Deeper with Convolutions**

Christian Szegedy<sup>1</sup>, Wei Liu<sup>2</sup>, Yangqing Jia<sup>1</sup>, Pierre Sermanet<sup>1</sup>, Scott Reed<sup>3</sup>,
Dragomir Anguelov<sup>1</sup>, Dumitru Erhan<sup>1</sup>, Vincent Vanhoucke<sup>1</sup>, Andrew Rabinovich<sup>4</sup>

<sup>1</sup>Google Inc. <sup>2</sup>University of North Carolina, Chapel Hill

<sup>3</sup>University of Michigan, Ann Arbor <sup>4</sup>Magic Leap Inc.

1{szegedy, jiayq, sermanet, dragomir, dumitru, vanhoucke}@google.com

<sup>2</sup>wliu@cs.unc.edu, <sup>3</sup>reedscott@umich.edu, <sup>4</sup>arabinovich@magicleap.com

### Example: ResNet (2015)

He et al., "Deep Residual Learning for Image Recognition", CVPR 2016.

#### **Deep Residual Learning for Image Recognition**

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com





#### Reducing Convolution Layers to Matrix Multiplications

- Convolution layers are the compute intensive parts of a CNN
- GPUs have extremely high-performance implementations of matrix multiplications
- Tiling techniques for matrix multiplication naturally reuse input features across output feature maps
- Converting convolutions in a convolution layer to a matrix multiplication helps to keep the level of parallelism stable across CNN layers

Slide credit: Hwu & Kirk

# Implementing a Convolutional Layer with Matrix Multiplication





Input Features X (unrolled)

### Tiled Matrix-Matrix Multiplication (I)



### Tiled Matrix-Matrix Multiplication (II)



$$C_{\text{tile}} = A_{\text{tile}} \times B_{\text{tile}}$$





Step 2: Each thread computes its partial sum from the tiles in shared memory (threads wait for each other to finish)

#### Deep Learning Matrix Multiplication Hierarchical Decomposition



## Joint Register and Shared Memory Tiling

- Store input M tile and output P tile elements in registers
- Store input N tile elements in shared memory
- Decouple of M and N input tile widths
  - TILE\_WIDTH\_M , TILE\_WIDTH\_N
- Key quantities
  - Number of threads = TILE\_WIDTH\_M
  - Output tile size = TILE\_WIDTH\_M \* TILE\_WIDTH\_N
  - Reuses for each N element = TILE\_WIDTH\_M
  - Reuses for each M element = TILE\_WIDTH\_N
  - Each thread calculatesTILE WIDTH N P elements

#### Example:

$$TILE_WIDTH_M = 4$$
  
 $TILE_WIDTH_N = 2$ 







58

#### NVIDIA A100 Core



- 19.5 TFLOPS Single Precision
- 9.7 TFLOPS Double Precision
- 312 TFLOPS for Deep Learning (Tensor cores)



#### Tensor Core Microarchitecture (Volta)

- Each warp utilizes two tensor cores
- Each tensor core contains two "octets"
  - 16 SIMD units per tensor core (8 per octet)
  - 4x4 matrix-multiply and accumulate each cycle per tensor core



<sup>\*</sup> M. A. Raihan, N. Goli and T. M. Aamodt, "Modeling Deep Learning Accelerator Enabled GPUs," ISPASS 2019.

#### Tensor Core Operations

Load/Store from TensorCore

Fixed size matrix multiplication: usually 4x4, 16x16, or 64x64



16x32

 NVIDIA supports nonsquare matrix dimensions

# TensorCore WMMA (WARP Matrix Multiply Accumulate) API

```
wmma::fragment< Use, M, N, K, Type, Layout>;
wmma::fill fragment(fragment<...> &, const Type &);
wmma::load matrix sync(fragment<...> &, const Type *, unsigned,
Layout);
wmma::store matrix sync( Type *, const fragment<...> &, unsigned,
Layout);
wmma::mma sync(fragment<...> &, const fragment<...> &, const
fragment<...> &,
               const fragment<...> &, bool);
```

Fairly simple low level API that operates on memory fragments (internally registers)

#### Matrix Fragments

A memory fragment (internally a set of registers)

# Matrix Fragment Initialization with Constant Values

Fills the fragment with some constant value C

## Loading Matrix Fragment from Memory

Loads data from global or shared memory with specified stride into the fragment.

If the fragment is **matrix\_a** or **matrix\_b**, the layout is inferred from the fragment type.

In the example, we load a tile of a matrix (at &a[aRow+aCol\*lda]) into a\_frag with lda as the stride.

## Storing Matrix Fragment to Memory

Stores the fragment into global or shared memory with specified stride and layout

### GEMM of Matrix Fragments

Stores in **out** the result of computing **a x b** + **c**If **satf** is true, the following saturation happens:

- Infinite saturates to MAX\_NORM
- Minus infinite saturates to –MAX\_NORM
- NaN is transformed to 0

# GEMM Kernel Using MMA in Half Precision (1 of 3)

```
#include <mma.h>
using namespace nvcuda::wmma;
 _global__ void dot_wmma(half * a, half * b, half * c,
                         int M, int N, int K, float alpha, float beta) ) {
   int lda = M; int ldb = K; int ldc = M;
   int warp m = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warp n = (blockIdx.y * blockDim.y + threadIdx.y);
   fragment<matrix_a, 16, 16, 16, half, col_major> a_frag;
   fragment<matrix b, 16, 16, 16, half, row major> b frag;
   fragment<accumulator, 16, 16, 16, half> acc frag;
   fragment<accumulator, 16, 16, 16, half> c_frag;
   fill fragment(acc frag, 0.0f);
```

# GEMM Kernel Using MMA in Half Precision (2 of 3)

```
for (int i = 0; i < K; i += 16) {
   int a row = warp m * 16; int a col = i;
   int b row = i; int b col = warp n * 16;
   if (a_row < M && a_col < K && b_row < K && b_col < N) {
      /* Load fragments in the tile */
      load_matrix_sync(a_frag, a + a_row + a col * lda, lda);
      load matrix sync(b frag, b + b row + b col * ldb, ldb);
      /* Compute partial result and accumulate */
      mma_sync(acc_frag, a_frag, b_frag, acc_frag);
```

# GEMM Kernel Using MMA in Half Precision (3 of 3)

```
int c row = warp m * 16; int c col = warp n * 16;
if (c row < M && c col < N) {
  /* Load output fragment from memory */
   load_matrix_sync(c_frag, c + c_row + c_ol * ldc, ldc, mem_col_major);
   /* Update fragment with computed result */
  for(int i = 0; i < c frag.num elements; i++) {</pre>
      c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
   /* Store output fragment */
   store_matrix_sync(c + c_row + c_col * ldc, c_frag, ldc, mem_col_major);
```

## Systolic Arrays

#### Google TPU Generation I (~2016)



**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.



**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.

### Google TPU Generation II (2017)



https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-second-generation-tpu/

4 TPU chips vs 1 chip in TPU1

High Bandwidth Memory vs DDR3

Floating point operations vs FP16

45 TFLOPS per chip vs 23 TOPS

Designed for training and inference vs only inference

#### Google TPU Generation III (2019)



TPU v2 - 4 chips, 2 cores per chip

TPU v3 - 4 chips, 2 cores per chip

32GB HBM per chip vs 16GB HBM in TPU2 4 Matrix Units per chip 90 TFLOPS per chip vs 2 Matrix Units in TPU2

vs 45 TFLOPS in TPU2

#### Google TPU Generation IV (2019)



#### New ML applications (vs. TPU3):

- Computer vision
- Natural Language Processing (NLP)
- Recommender system
- Reinforcement learning that plays Go

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3



1 ExaFLOPS per board

#### An Example Modern Systolic Array: TPU (II)

As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer [Kun80][Ram91][Ovt15b]. Figure 4 shows that data flows in from the left, and the weights are loaded from the top. A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators. From a correctness perspective, software is unaware of the systolic nature of the matrix unit, but for performance, it does worry about the latency of the unit.



Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.

### Example 2D Systolic Array Computation

- Multiply two 3x3 matrices (inputs)
  - Keep the final result in PE accumulators

$$\begin{bmatrix} c_{00} & c_{01} & c_{02} \\ c_{10} & c_{11} & c_{12} \\ c_{20} & c_{21} & c_{22} \end{bmatrix} = \begin{bmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \\ a_{20} & a_{21} & a_{22} \end{bmatrix} \times \begin{bmatrix} b_{00} & b_{01} & b_{02} \\ b_{10} & b_{11} & b_{12} \\ b_{20} & b_{21} & b_{22} \end{bmatrix}$$



Figure 1: A systolic array processing element

$$P = M$$
 $Q = N$ 
 $R = R + M*N$ 



#### An Example Modern Systolic Array: TPU (III)



**Figure 1.** TPU Block Diagram. The main computation part is the yellow Matrix Multiply unit in the upper right hand corner. Its inputs are the blue Weight FIFO and the blue Unified Buffer (UB) and its output is the blue Accumulators (Acc). The yellow Activation Unit performs the nonlinear functions on the Acc, which go to the UB.

#### Lecture on Systolic Arrays



#### Recommended Readings

- Hwu and Kirk, "Programming Massively Parallel Processors,"
   Third Edition, 2017
  - Chapter 7 Parallel patterns —
     convolution: An introduction to
     stencil computation
  - Chapter 16 Application case study machine learning



## P&S Heterogeneous Systems

Parallel Patterns: Convolution

Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Fall 2021

25 November 2021