# P&S Processing-in-Memory Real-World Processing-in-Memory Architectures: Microbenchmarking of UPMEM PIM Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Spring 2022 24 March 2022 # UPMEM Processing-in-DRAM Engine (2019) - Processing in DRAM Engine - Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips. - Replaces standard DIMMs # **System Organization** - A UPMEM DIMM contains 8 or 16 chips - Thus, 1 or 2 ranks of 8 chips each - Inside each PIM chip there are: - 8 64MB banks per chip: Main RAM (MRAM) banks - 8 DRAM Processing Units (DPUs) in each chip, 64 DPUs per rank # **DRAM Processing Unit** PIM Chip ## **DPU Pipeline** - In-order pipeline - Up to 425 MHz \* - Fine-grain multithreaded - 24 hardware threads - 14 pipeline stages - DISPATCH: Thread selection - FETCH: Instruction fetch - READOP: Register file - FORMAT: Operand formatting - ALU: Operation and WRAM - MERGE: Result formatting <sup>\* 350</sup> MHz in the UPMEM-based PIM system used for the experimental results shown in this lecture ### **DPU Instruction Set Architecture** - Specific 32-bit ISA - Aiming at scalar, inorder, and multithreaded implementation - Allowing compilation of 64-bit C code - LLVM/Clang compiler https://sdk.upmem.com/2021.2.0/201\_IS.html# # Microbenchmarking the UPMEM PIM Architecture # **DPU: Arithmetic Throughput** PIM Chip ### **Arithmetic Throughput: Microbenchmark** #### Goal Measure the maximum arithmetic throughput for different datatypes and operations #### Microbenchmark - We stream over an array in WRAM and perform read-modify-write operations - Experiments on one DPU - We vary the number of tasklets from 1 to 24 - Arithmetic operations: add, subtract, multiply, divide - Datatypes: int32, int64, float, double - We measure cycles with an accurate cycle counter that the SDK provides - We include WRAM accesses (including address calculation) and arithmetic operation ### Microbenchmark for INT32 ADD Throughput ``` #define SIZE 256 int* bufferA = mem alloc(SIZE * sizeof(int)); C-based code for(int i = 0; i < SIZE; i++){</pre> int temp = bufferA[i]; 5 temp += scalar; bufferA[i] = temp; } move r2, 0 Poop of the second seco // Loop header lsl add r3, r0, r2, 2 // Address calculation // Load from WRAM // Add Store to WRAM // Index update jneq r2, 256, .LBB0 1 // Conditional jump ``` # **Arithmetic Throughput: 11 Tasklets** #### **KEY OBSERVATION 1** The arithmetic throughput of a DRAM Processing Unit saturates at 11 or more tasklets. This observation is consistent for different datatypes (INT32, INT64, UINT32, UINT64, FLOAT, DOUBLE) and operations (ADD, SUB, MUL, DIV). # **Arithmetic Throughput: ADD/SUB** INT32 ADD/SUB are 17% faster than INT64 ADD/SUB #### Can we explain the peak throughput? Peak throughput at 11 tasklets. One instruction retires every cycle when the pipeline is full Arithmetic Throughput (in OPS) = $\frac{frequenc\overline{y}_{DPU}}{\#instructions}$ # **Arithmetic Throughput: #Instructions** Compiler explorer: <a href="https://dpu.dev">https://dpu.dev</a> ``` #define BLOCK SIZE 1024 ☐ 11010 ☐ ./a.out ☑ .LX0: ☑ .text ☑ // 1 Benchmark 32bits: typedef int T; move r2, 0 void Benchmark 32bits(T *cache A, T scalar) { 3 .LBB0 1: for (int i = 0; i < BLOCK SIZE / sizeof(T); i++){</pre> lsl add r3, r0, r2, 2 ///// WRAM READ ///// lw r4, r3, 0 T temp = cache_A[i]; add r4, r4, r1 sw r3, 0, r4 temp += scalar; // ADD add r2, r2, 1 10 jneq r2, 256, .LBB0 1 ///// WRAM WRITE ///// 11 10 jump r23 12 cache A[i] = temp; 11 Benchmark 64bits: 13 12 move r1, 0 14 13 .LBB1 1: 15 lsl add r4, r0, r1, 3 14 16 typedef long T long; ld d6, r4, 0 15 void Benchmark 64bits(T long *cache A, T long scalar) { 17 add r7, r7, r3 16 for (int i = 0; i < BLOCK SIZE / sizeof(T long); i++){</pre> 18 addc r6, r6, r2 17 ///// WRAM READ ///// 19 sd r4, 0, d6 18 20 T long temp = cache A[i]; add r1, r1, 1 19 21 jneq r1, 128, .LBB1_1 20 22 temp += scalar; // ADD 21 jump r23 23 ``` - 6 instructions in the 32-bit ADD/SUB microbenchmark - 7 instructions in the 64-bit ADD/SUB microbenchmark 24 2526 27 # **Arithmetic Throughput: ADD/SUB** INT32 ADD/SUB are 17% faster than INT64 ADD/SUB #### Can we explain the peak throughput? Peak throughput at 11 tasklets. One instruction retires every cycle when the pipeline is full Arithmetic Throughput (in OPS) = $\frac{frequency_{DPU}}{\#instructions}$ 64-bit ADD/SUB: 7 instructions $\rightarrow$ 50.00 MOPS at $frequency_{DPU}$ = 350 MHz # **Arithmetic Throughput: MUL/DIV** # **Arithmetic Throughput: Native Support** #### **KEY OBSERVATION 2** - DPUs provide native hardware support for 32-and 64-bit integer addition and subtraction, leading to high throughput for these operations. - DPUs do not natively support 32- and 64-bit multiplication and division, and floating point operations. These operations are emulated by the UPMEM runtime library, leading to much lower throughput. ### Microbenchmark: Arithmetic Throughput Arithmetic throughput for different operations and datatypes ### **DPU: WRAM Bandwidth** PIM Chip ### **WRAM Bandwidth: Microbenchmark** - Goal - Measure the WRAM bandwidth for the STREAM benchmark - Microbenchmark - We implement the four versions of STREAM: COPY, ADD, SCALE, and TRIAD - The operations performed in ADD, SCALE, and TRIAD are addition, multiplication, and addition+multiplication, respectively - We vary the number of tasklets from 1 to 16 - We show results for 1 DPU - We do not include accesses to MRAM ### STREAM Benchmark in WRAM ``` // COPY 8 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> no arithmetic operations bufferB[i] = bufferA[i]; // ADD 16 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> ADD bufferC[i] = bufferA[i] + bufferB[i]; // SCALE 8 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> MUL bufferB[i] = scalar * bufferA[i]; // TRIAD 16 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> MUL, ADD bufferC[i] = bufferA[i] + scalar * bufferB[i]; ``` ### **WRAM Bandwidth: STREAM** #### How can we estimate the bandwidth? Assuming that the pipeline is full, and *Bytes* is the number of bytes read and written: $$WRAM\ Bandwidth\ \left(in\frac{B}{S}\right) = \frac{Bytes \times frequency_{DPU}}{\#instructions}$$ ### **WRAM Bandwidth: COPY** COPY executes 2 instructions (WRAM load and store). With 11 tasklets, 11 × 16 bytes in 22 cycles: WRAM Bandwidth $$\left(in\frac{B}{S}\right) = 2,800\frac{MB}{S}$$ at 350 MHz ### **WRAM Bandwidth: ADD** $$WRAM\ Bandwidth\ \left(in\frac{B}{S}\right) = \frac{Bytes \times frequency_{DPU}}{\#instructions}$$ ADD executes 5 instructions (2 1d, add, addc, sd). With 11 tasklets, 11 × 24 bytes in 55 cycles: WRAM Bandwidth $$\left(in\frac{B}{S}\right) = 1,680\frac{MB}{S}$$ at 350 MHz ### WRAM Bandwidth: Access Patterns All 8-byte WRAM loads and stores take one cycle when the DPU pipeline is full #### **KEY OBSERVATION 3** The sustained bandwidth provided by the DPU's internal Working memory (WRAM) is **independent of the memory access pattern** (either streaming, strided, or random access pattern). **All 8-byte WRAM loads and stores take one cycle**, when the DPU's pipeline is full (i.e., with 11 or more tasklets). ``` Microbenchmark: c[a[i]]=b[a[i]]; Unit-stride: a[i]=a[i-1]+1; Strided: a[i]=a[i-1]+stride; Random: a[i]=rand(); ``` ### Microbenchmark: STREAM and WRAM STREAM benchmark and WRAM access patterns ### **DPU: MRAM Latency and Bandwidth** PIM Chip ### **MRAM Bandwidth** - Goal - Measure MRAM bandwidth for different access patterns - Microbenchmarks - Latency of a single DMA transfer for different transfer sizes ``` mram read(); // MRAM-WRAM DMA transfer ``` - mram write(); // WRAM-MRAM DMA transfer - STREAM benchmark - COPY, COPY-DMA - ADD, SCALE, TRIAD - Strided access pattern - Coarse-grain strided access - Fine-grain strided access - Random access pattern (GUPS) - We do include accesses to MRAM # MRAM Read and Write Latency (I) $$MRAM\ Bandwidth\ \left(in\frac{B}{S}\right) = \frac{size \times frequency_{DPU}}{MRAM\ Latency}$$ We can model the MRAM latency with a linear expression $MRAM\ Latency\ (in\ cycles) = \alpha + \beta \times size$ In our measurements, $\beta$ equals 0.5 cycles/byte. Theoretical maximum MRAM bandwidth = 700 MB/s at 350 MHz # MRAM Read and Write Latency (II) #### **KEY OBSERVATION 4** - The DPU's Main memory (MRAM) bank access latency increases linearly with the transfer size. - The maximum theoretical MRAM bandwidth is 2 bytes per cycle. # MRAM Read and Write Latency (III) Read and write accesses to MRAM are symmetric The sustained MRAM bandwidth increases with data transfer size #### **PROGRAMMING RECOMMENDATION 1** For data movement between the DPU's MRAM bank and the WRAM, use large DMA transfer sizes when all the accessed data is going to be used. # MRAM Read and Write Latency (IV) #### MRAM latency changes slowly between 8 and 128 bytes For small transfers, the fixed cost $(\alpha)$ dominates the variable cost $(\beta \times size)$ #### PROGRAMMING RECOMMENDATION 2 For small transfers between the MRAM bank and the WRAM, **fetch more bytes than necessary within a 128-byte limit**. Doing so increases the likelihood of finding data in WRAM for later accesses (i.e., the program can check whether the desired data is in WRAM before issuing a new MRAM access). # MRAM Read and Write Latency (V) 2,048-byte transfers are only 4% faster than 1,024-byte transfers Larger transfers require more WRAM, which may limit the number of tasklets #### PROGRAMMING RECOMMENDATION 3 Choose the data transfer size between the MRAM bank and the WRAM based on the program's WRAM usage, as it imposes a tradeoff between the sustained MRAM bandwidth and the number of tasklets that can run in the DPU (which is dictated by the limited WRAM capacity). ### **MRAM Bandwidth** - Goal - Measure MRAM bandwidth for different access patterns - Microbenchmarks - Latency of a single DMA transfer for different transfer sizes ``` mram read(); // MRAM-WRAM DMA transfer ``` - mram write(); // WRAM-MRAM DMA transfer - STREAM benchmark - COPY, COPY-DMA - ADD, SCALE, TRIAD - Strided access pattern - Coarse-grain strided access - Fine-grain strided access - Random access pattern (GUPS) - We do include accesses to MRAM ### STREAM Benchmark in MRAM ``` // COPY // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(uint64 t)); for(int i = 0; i < SIZE; i++){ bufferB[i] = bufferA[i]; // Write WRAM block to MRAM mram write(bufferB, ( mram ptr void*)mram address B, SIZE * sizeof(uint64 t)); // COPY-DMA // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(uint64 t)); // Write WRAM block to MRAM mram write(bufferA, ( mram ptr void*)mram address B, SIZE * sizeof(uint64 t)); ``` ### **STREAM Benchmark: COPY-DMA** The sustained bandwidth of **COPY-DMA** is close to the theoretical maximum (700 MB/s): ~1.6 TB/s for 2,556 DPUs **COPY-DMA** saturates with two tasklets, even though the DMA engine can perform only one transfer at a time Using two or more tasklets guarantees that there is always a DMA request enqueued to keep the DMA engine busy ### STREAM Benchmark: Bandwidth Saturation (I) COPY and ADD saturate at 4 and 6 tasklets, respectively #### **SCALE** and **TRIAD** saturate at 11 tasklets The latency of MRAM accesses becomes longer than the pipeline latency after 4 and 6 tasklets for COPY and ADD, respectively The pipeline latency of **SCALE** and **TRIAD** is longer than the MRAM latency for any number of tasklets (both use costly MUL) ### STREAM Benchmark: Bandwidth Saturation (II) #### **KEY OBSERVATION 5** - When the access latency to an MRAM bank for a streaming benchmark (COPY-DMA, COPY, ADD) is larger than the pipeline latency (i.e., execution latency of arithmetic operations and WRAM accesses), the performance of the DPU saturates at a number of tasklets smaller than 11. This is a memory-bound workload. - When the pipeline latency for a streaming benchmark (SCALE, TRIAD) is larger than the MRAM access latency, the performance of a DPU saturates at 11 tasklets. This is a compute-bound workload. ### **MRAM Bandwidth** - Goal - Measure MRAM bandwidth for different access patterns - Microbenchmarks - Latency of a single DMA transfer for different transfer sizes ``` mram_read(); // MRAM-WRAM DMA transfer ``` - mram write(); // WRAM-MRAM DMA transfer - STREAM benchmark - COPY, COPY-DMA - ADD, SCALE, TRIAD - Strided access pattern - Coarse-grain strided access - Fine-grain strided access - Random access pattern (GUPS) - We do include accesses to MRAM #### Strided and Random Access to MRAM ``` // COARSE-GRAINED STRIDED ACCESS // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(uint64 t)); mram read(( mram ptr void const*)mram address B, bufferB, SIZE * sizeof(uint64 t)); for(int i = 0; i < SIZE; i += stride){</pre> bufferB[i] = bufferA[i]; // Write WRAM block to MRAM mram write(bufferB, ( mram ptr void*)mram address B, SIZE * sizeof(uint64 t)); // FINE-GRAINED STRIDED & RANDOM ACCESS for(int i = 0; i < SIZE; i += stride){</pre> int index = i * sizeof(uint64 t); // Load current MRAM element to WRAM mram read(( mram ptr void const*)(mram address A + index), bufferA, sizeof(uint64 t)); // Write WRAM element to MRAM mram write(bufferA, ( mram ptr void*)(mram address B + index), sizeof(uint64 t)); ``` ## Strided and Random Accesses (I) Large difference in maximum sustained bandwidth between coarse-grained and fine-grained DMA Coarse-grained DMA uses 1,024-byte transfers, while fine-grained DMA uses 8-byte transfers Random access achieves very similar maximum sustained bandwidth to fine-grained strided approach ## Strided and Random Accesses (II) The sustained MRAM bandwidth of coarse-grained DMA decreases as the stride increases The effective utilization of the transferred data decreases as the stride becomes larger (e.g., a stride 4 means that only one fourth of the transferred data is used) ## Strided and Random Accesses (III) For a stride of 16 or larger, the fine-grained DMA approach achieves higher bandwidth With stride 16, only one sixteenth of the maximum sustained bandwidth (622.36 MB/s) of coarse-grained DMA is effectively used, which is lower than the bandwidth of fine-grained DMA (72.58 MB/s) ## Strided and Random Accesses (IV) #### PROGRAMMING RECOMMENDATION 4 - For strided access patterns with a **stride smaller than 16 8-byte elements, fetch a large contiguous chunk** (e.g., 1,024 bytes) from a DPU's MRAM bank. - For strided access patterns with larger strides and random access patterns, fetch only the data elements that are needed from an MRAM bank. ### Microbenchmark: Strided and Random Strided and random accesses to MRAM ### DPU: Arithmetic Throughput vs. Operational Intensity ### Arithmetic Throughput vs. Operational Intensity (I) - Goal - Characterize memory-bound regions and compute-bound regions for different datatypes and operations - Microbenchmark - We load one chunk of an MRAM array into WRAM - Perform a variable number of operations on the data - Write back to MRAM - The experiment is inspired by the Roofline model\* - We define operational intensity (OI) as the number of arithmetic operations performed per byte accessed from MRAM (OP/B) - The pipeline latency changes with the operational intensity, but the MRAM access latency is fixed ### Arithmetic Throughput vs. Operational Intensity (II) ``` int repetitions = input repeat >= 1.0 ? (int)input repeat : 1; int stride = input repeat \geq 1.0 ? 1 : (int)(1 / input repeat); // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(T)); // Update input repeat greater or equal for(int r = 0; r < repetitions; r++){</pre> to 1 indicates the (integer) for(int i = 0; i < SIZE; i+=stride){</pre> number of repetitions per input #ifdef ADD element bufferA[i] += scalar; // ADD #elif SUB input repeat smaller than 1 bufferA[i] -= scalar; // SUB indicates the fraction of elements #elif MUIL that are updated bufferA[i] *= scalar; // MUL #elif DIV bufferA[i] /= scalar; // DIV #endif // Write WRAM block to MRAM mram write(bufferA, ( mram ptr void*)mram address B, SIZE * sizeof(T)); ``` ### Arithmetic Throughput vs. Operational Intensity (III) We show results of arithmetic throughput vs. operational intensity for (a) 32-bit integer ADD, (b) 32-bit integer MUL, (c) 32-bit floating-point ADD, and (d) 32-bit floating-point MUL (results for other datatypes and operations show similar trends) ### Arithmetic Throughput vs. Operational Intensity (IV) In the memory-bound region, the arithmetic throughput increases with the operational intensity In the compute-bound region, the arithmetic throughput is flat at its maximum The throughput saturation point is the operational intensity where the transition between the memory-bound region and the compute-bound region happens The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched ### Arithmetic Throughput vs. Operational Intensity (V) #### **KEY OBSERVATION 6** The arithmetic throughput of a DRAM Processing Unit (DPU) saturates at low or very low operational intensity (e.g., 1 integer addition per 32-bit element). Thus, the DPU is fundamentally a compute-bound processor. We expect most real-world workloads be compute-bound in the UPMEM PIM architecture. #### Microbenchmark: Arithmetic Throughput vs. Operational Intensity Arithmetic Throughput versus Operational Intensity ### **Key Takeaway 1** The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched Operational Intensity (OP/B) #### KEY TAKEAWAY 1 The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable workloads are memory-bound. ### Experimental Analysis of the UPMEM PIM Engine ### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM)*. Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. ### Understanding a Modern PIM Architecture ## Upcoming Lectures Introduction to Samsung's and SK Hynix's PIM devices Programming an UPMEM-based PIM system Workload characterization for PIM suitability # P&S Processing-in-Memory Real-World Processing-in-Memory Architectures: Microbenchmarking of UPMEM PIM Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Spring 2022 24 March 2022