**P&S Heterogeneous Systems** GPU Memory Hierarchy

> Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Fall 2022 24 October 2022

# GPU Programming

#### Recommended Readings (I)

Hwu and Kirk, "Programming Massively Parallel Processors," Third Edition, 2017



### Recommended Readings (II)

Hwu, Kirk, El Hajj, "Programming Massively Parallel Processors," Fourth Edition, 2022



#### Recommended Readings (III)

#### CUDA Programming Guide

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

|                                       | CUDA TOOLKIT DOCUMENTATION                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |
|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| CUDA Toolkit v11.6.2                  | Programming Guide (PDF) - v11.6.2 (older) - Last updated March 24, 2022 - Send Feedbac                                                                                                                                                                                                                                                                                       |  |  |  |  |
| Programming Guide                     |                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| ▷ 1. Introduction                     | CUDA C++ Programming Guide                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |
| 2. Programming Model                  | The programming guide to the CUDA model and interface.                                                                                                                                                                                                                                                                                                                       |  |  |  |  |
| 3. Programming Interface              |                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| ▷ 4. Hardware Implementation          | Changes from Version 11.3                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |
| 5. Performance Guidelines             |                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| A. CUDA-Enabled GPUs                  | Added <u>Graph Memory Nodes</u> .                                                                                                                                                                                                                                                                                                                                            |  |  |  |  |
| ▷ B. C++ Language Extensions          | Formalized <u>Asynchronous SIMT Programming Model</u> .                                                                                                                                                                                                                                                                                                                      |  |  |  |  |
| C. Cooperative Groups                 | 1. Introduction                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| D. CUDA Dynamic Parallelism           |                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| E. Virtual Memory Management          | 1.1. The Benefits of Using GPUs                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
| F. Stream Ordered Memory<br>Allocator | The Graphics Processing Unit (GPU) <sup>1</sup> provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applic leverage these higher capabilities to run faster on the GPU than on the CPU (see <u>GPU Applications</u> ). Other computing devices, like FPGAs, are also very energy efficient, but of |  |  |  |  |
| C. Graph Memory Nodes                 | much less programming flexibility than GPUs.                                                                                                                                                                                                                                                                                                                                 |  |  |  |  |
| ▷ H. Mathematical Functions           | This difference in capabilities between the GPU and the CPU exists because they are designed with different goals in mind. While the CPU is designed to excel at executing a                                                                                                                                                                                                 |  |  |  |  |
| ▷I. C++ Language Support              | sequence of operations, called a <i>thread</i> , as fast as possible and can execute a few tens of these threads in parallel, the GPU is designed to excel at executing thousands                                                                                                                                                                                            |  |  |  |  |
| ▷ J. Texture Fetching                 | parallel (amortizing the slower single-thread performance to achieve greater throughput).                                                                                                                                                                                                                                                                                    |  |  |  |  |
| ▷ K. Compute Capabilities             | The GPU is specialized for highly parallel computations and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control.                                                                                                                                                                                          |  |  |  |  |
| ▷ L. Driver API                       | The schematic Figure 1 shows an example distribution of chip resources for a CPU versus a GPU.                                                                                                                                                                                                                                                                               |  |  |  |  |
|                                       |                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |

# GPU Computing

- Computation is offloaded to the GPU
- Three steps
  - CPU-GPU data transfer (1)
  - □ GPU kernel execution (2)
  - GPU-CPU data transfer (3)



# Traditional Program Structure

- CPU threads and GPU kernels
  - Sequential or modestly parallel sections on CPU
  - Massively parallel sections on GPU



# Traditional Program Structure in CUDA

#### Function prototypes

float serialFunction(...);
 global void kernel(...);

- main()
  - 1) Allocate memory space on the device cudaMalloc(&d\_in, bytes);
  - 2) Transfer data from host to device cudaMemCpy(d\_in, h\_in, ...);
  - a 3) Execution configuration setup: #blocks and #threads
  - 4) Kernel call kernel << execution configuration >>> (args...);
  - 5) Transfer results from device to host cudaMemCpy(h\_out, d\_out, ...);
- Kernel \_\_global\_\_\_ void kernel(type args,...)
  - Automatic variables transparently assigned to registers
  - Shared memory: \_\_\_\_\_shared\_\_\_\_
  - Intra-block synchronization: \_\_\_\_\_syncthreads();

s need

-epeat

### CUDA Programming Language

#### Memory allocation

cudaMalloc((void\*\*)&d\_in, #bytes);

#### Memory copy

cudaMemcpy(d\_in, h\_in, #bytes, cudaMemcpyHostToDevice);

#### Kernel launch

kernel<<< #blocks, #threads >>>(args);

#### Memory deallocation cudaFree(d\_in);

• Explicit synchronization

cudaDeviceSynchronize();

### Vector Addition (I)

- Our first GPU programming example
- We assign one GPU thread to each element-wise addition



#### Vector Addition (II)

- The whole set of threads is called a grid
- We need a way to assign threads to GPU cores



#### Vector Addition (III)

We group threads into blocks



#### Host Code Example: Vector Addition

void vecadd(float\* A, float\* B, float\* C, int N) {

// Allocate GPU memory

float \*A\_d, \*B\_d, \*C\_d; cudaMalloc((void\*\*) &A\_d, N\*sizeof(float)); cudaMalloc((void\*\*) &B\_d, N\*sizeof(float)); cudaMalloc((void\*\*) &C\_d, N\*sizeof(float));

```
// Copy data to GPU memory
cudaMemcpy(A_d, A, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, N*sizeof(float), cudaMemcpyHostToDevice);
```

```
// Perform computation on GPU
const unsigned int numThreadsPerBlock = 512;
const unsigned int numBlocks = N/numThreadsPerBlock;
```

```
vecadd_kernel<<<numBlocks, numThreadsPerBlock>>>(A_d, B_d, C_d, N);
// Copy data from GPU memory
cudaMemcpy(C, C_d, N*sizeof(float), cudaMemcpyDeviceToHost);
```

```
// Deallocate GPU memory
cudaFree(A_d);
cudaFree(B_d);
cudaFree(C_d);
```

}

#### Boundary Conditions

- What if the size of the input is not a multiple of the number of threads per block?
  - Solution: use the ceiling to launch extra threads then omit the threads after the boundary

const unsigned int numBlocks = (N +numThreadsPerBlock - 1)/numThreadsPerBlock;

```
Mernel code
__global__ void vecadd_kernel(float* A, float* B, float* C, int N) {
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if(i < N) {
        C[i] = A[i] + B[i];
    }
}</pre>
```

#### Indexing and Memory Access

#### Images are 2D data structures

- height x width
- □ Image[j][i], where  $0 \le j < \text{height}$ , and  $0 \le i < \text{width}$



#### Image Layout in Memory

- Row-major layout
- Image[j][i] = Image[j x width + i]



#### Indexing and Memory Access: 2D Grid

2D blocks

□ gridDim.x, gridDim.y



Image[3][1] = Image[3 \* 8 + 1]

#### NVIDIA H100: Thread Block Clusters

- GPUs grow beyond 100 GPU cores (SMs): a new level in the software hierarchy can improve execution efficiency
  - Programmatic control of locality at a granularity larger than a single thread block on a single SM
- Thread blocks in the same cluster can synchronize and exchange data
- Thread blocks in the same cluster are guaranteed to be concurrently scheduled
  - Thread blocks in the same cluster run on the SMs within a GPU Processing Cluster (GPC)



#### NVIDIA H100: Thread Block Clusters

- GPUs grow beyond 100 GPU cores (SMs): a new level in the software hierarchy can improve execution efficiency
  - Programmatic control of locality at a granularity larger than a single thread block on a single SM



### NVIDIA H100: Thread Block Clusters

- GPUs grow beyond 100 GPU cores (SMs): a new level in the software hierarchy can improve execution efficiency
  - Programmatic control of locality at a granularity larger than a single thread block on a single SM
- Thread blocks in the same cluster can synchronize and exchange data
- Thread blocks in the same cluster are guaranteed to be concurrently scheduled
  - Thread blocks in the same cluster run on the SMs within a GPU Processing Cluster (GPC)
  - Data sharing via SM-to-SM network in a GPC





Thread - Thread block - Thread block cluster - Grid

# GPU Memories

# NVIDIA H100 Block Diagram



https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

144 cores on the full GH100 60MB L2 cache

#### NVIDIA H100 Core



https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

\* Preliminary performance estimates

#### Memory in the GPU Architecture



# NVIDIA V100 & A100 Memory Hierarchy

 Example of data movement between GPU global memory (DRAM) and GPU cores.



A100 improves SM bandwidth efficiency with a new load-global-store-shared asynchronous copy instruction that bypasses L1 cache and register file (RF). Additionally, A100's more efficient Tensor Cores reduce shared memory (SMEM) loads.

A100 feature: Direct copy from L2 to scratchpad, bypassing L1 and register file.

#### NVIDIA H100 Tensor Memory Accelerator

Asynchronous memory copy with LDGSTS instruction vs. TMA



TMA unit reduces addressing overhead

A single thread per warp issues the TMA operation

Support for different tensor layouts (1D-5D)



#### NVIDIA H100 Distributed Shared Memory

- Shared memory virtual address space distributed across the blocks of a cluster
- Load, store, and atomic operations to other SM's shared memory



Thread block clusters and distributed shared memory (DSMEM) are leveraged via cooperative\_groups API

TMA unit supports copies across thread blocks in a cluster

Asynchronous transaction barriers

#### Memory in the H100 GPU Architecture



### CUDA Variable Type Qualifiers

| Variable dec   | Memory                      | Scope    | Lifetime |             |
|----------------|-----------------------------|----------|----------|-------------|
|                | <pre>int LocalVar;</pre>    | register | thread   | thread      |
|                | <pre>int localArr[N];</pre> | global   | thread   | thread      |
|                | <pre>int SharedVar;</pre>   | shared   | block    | block       |
| device         | <pre>int GlobalVar;</pre>   | global   | grid     | application |
| deviceconstant | int ConstantVar;            | constant | grid     | application |

device is optional when used with \_\_shared \_, Or \_\_constant \_\_

- Recall cudaMalloc(...) allocates memory from the host
  - Constant memory can also be allocated and initialized from the host
- Automatic variables without any qualifier reside in a register
  - Except arrays that reside in global memory

#### Memory Hierarchy in CUDA Programs



#### Data Reuse

Same memory locations accessed by neighboring threads



```
for (int i = 0; i < 3; i++){
    for (int j = 0; j < 3; j++){
        sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)];
    }
}</pre>
```

# Data Reuse: Tiling

 To take advantage of data reuse, we divide the input into tiles that can be loaded into shared memory



```
__shared__ int l_data[(L_SIZE+2)*(L_SIZE+2)];
```

#### Synchronization Function

- void \_\_syncthreads();
- Synchronizes all threads in a block
- Once all threads in a block have reached this point, execution resumes normally
- Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory

# Tiling/Blocking in On-chip Memories

#### Tiling or Blocking

- Divide loops operating on arrays into computation chunks so that each chunk can hold its data in the cache (or other onchip memory, e.g., scratchpad)
- Avoids cache conflicts between different chunks of computation
- Essentially: Divide the working set so that each piece fits in the cache
- Let's first see an example for CPUs

#### Naïve Matrix Multiplication (I)

- Matrix multiplication: C = A x B
- Consider two input matrices A and B in row-major layout
  - A size is M x P
  - B size is P x N
  - C size is M x N



#### Naïve Matrix Multiplication (II)

#### Naïve implementation of matrix multiplication has poor cache locality



# Tiled Matrix Multiplication (I)

- We can achieve better cache locality by computing on smaller tiles or blocks that fit in the cache
  - Or in the scratchpad memory and register file if we compute on a GPU



Ν

Lam+, "The cache performance and optimizations of blocked algorithms," ASPLOS 1991. <a href="https://doi.org/10.1145/106972.106981">https://doi.org/10.1016/B978-0-12-803819-2.00011-2</a> Bansal+, "Chapter 15 - Fast Matrix Computations on Heterogeneous Streams," in "High Performance Parallelism Pearls", 2015. <a href="https://doi.org/10.1016/B978-0-12-803819-2.00011-2">https://doi.org/10.1016/B978-0-12-803819-2.00011-2</a> Kirk & Hwu, "Chapter 5 - Performance considerations," in "Programming Massively Parallel Processors (Third Edition)", 2017. <a href="https://doi.org/10.1016/B978-0-12-811986-0.00005-4">https://doi.org/10.1016/B978-0-12-803819-2.00011-2</a>

tile dim

k

Ρ

Α

dim

tile

# Tiled Matrix Multiplication (II)

 Tiled implementation operates on submatrices (tiles or blocks) that fit fast memories (cache, scratchpad, RF)



#### Lecture on Advanced Caches



#### DDCA - Lecture 24: Advanced Caches (Spring 2021) https://youtu.be/89Q7OdhmQ9o

#### Example: Matrix-Matrix Multiplication (I)



#### Example: Matrix-Matrix Multiplication (II)

Parallelization approach: assign one thread to each element in the output matrix (C)



#### Example: Matrix-Matrix Multiplication (III)

```
___global___ void mm_kernel(float* A, float* B, float* C, unsigned int N) {
```

```
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
```

```
float sum = 0.0f;
for(unsigned int i = 0; i < N; ++i) {
    sum += A[row*N + i]*B[i*N + col];
}
C[row*N + col] = sum;
```



}

#### Reuse in Matrix-Matrix Multiplication (I)



Some of the threads in the same thread block use the same input data

#### Reuse in Matrix-Matrix Multiplication (II)



Some of the threads in the same thread block use the same input data

#### Reuse in Matrix-Matrix Multiplication (III)

- Sometimes, we are lucky:
  - The thread finds the data in the L1 cache because it was recently loaded by another thread
- Sometimes, we are not lucky:
  - The data gets evicted from the L1 cache before another thread tries to load it
- Solution:
  - Let the threads work together to load part of the data and ensure that all threads that need it use it before loading more data
  - Use shared memory to ensure data stays close
  - Optimizing called tiling because divides input to tiles

# Tiled Matrix-Matrix Multiplication (I)





# $C_{tile} = A_{tile} \times B_{tile}$





#### **Step 2:** Each thread computes its partial sum from the tiles in shared memory (threads wait for each other to finish)

# Tiled Matrix-Matrix Multiplication (III)



# Tiled Matrix-Matrix Multiplication (IV)



# Tiled Matrix-Matrix Multiplication (V)

```
____shared____float A_s[TILE_DIM][TILE_DIM];
                                                     Declare arrays in shared memory
____shared____float B_s[TILE_DIM][TILE_DIM];
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.0f;
for(unsigned int tile = 0; tile < N/TILE_DIM; ++tile) {</pre>
    // Load tile to shared memory
    A_s[threadIdx.y][threadIdx.x] = A[row*N + tile*TILE_DIM + threadIdx.x];
    B_s[threadIdx.y][threadIdx.x] = B[(tile*TILE_DIM + threadIdx.y)*N + col];
    ____syncthreads();
                           Threads wait for each other to finish loading before computing
    // Compute with tile
    for(unsigned int i = 0; i < TILE_DIM; ++i) {</pre>
        sum += A_s[threadIdx.y][i]*B_s[i][threadIdx.x];
    }
    ____syncthreads(); ~
                           Threads wait for each other to finish computing before loading
}
```

C[row\*N + col] = sum;

# Tiled Matrix Multiplication on GPU



Computer Architecture - Lecture 9: GPUs and GPGPU Programming (Fall 2017) https://youtu.be/mgtlbEqn2dA?t=8157

# Recommended Readings (I)

- Hwu and Kirk, "Programming Massively Parallel Processors," Third Edition, 2017
  - Chapter 4: Memory and data locality



# Recommended Readings (II)

 Hwu and Kirk and El Hajj, "Programming Massively Parallel Processors," Fourth Edition, 2022
 Chapter 5 - Memory architecture and data locality



**P&S Heterogeneous Systems** GPU Memory Hierarchy

> Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Fall 2022 24 October 2022