# **P&S Processing-in-Memory**

Real-World Processing-in-Memory Architectures: UPMEM PIM Architecture

> Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Spring 2022 17 March 2022

### PIM Becomes Real

- UPMEM, founded in January 2015, announces the first real-world PIM architecture in 2016
- UPMEM's PIM-enabled DIMMs start getting commercialized in 2019
- In early 2021, Samsung announces FIMDRAM at ISSCC conference
- Samsung's LP-DDR5 and DIMMbased PIM announced a few months later
- In early 2022, SK Hynix announces AiM at ISSCC conference





Fabless chip company Upmem SAS (Grenoble, France), founded in January 2015, is developing a microprocessor for use in data-intensive applications in the datacenter that will sit embedded in DRAM to be close to the data.

Placing hundreds or thousands of processing elements in DRAM able to perform work for a controlling server

### Samsung Function-in-Memory DRAM (2021)

Samsung Newsroom

CORPORATE | PRODUCTS | PRESS RESOURCES | VIEWS | ABOUT US

Audio

Share ( 🔊

### Samsung Develops Industry's First High Bandwidth Memory with AI Processing Power

Korea on February 17, 2021

#### The new architecture will deliver over twice the system performance and reduce energy consumption by more than 70%

Samsung Electronics, the world leader in advanced memory technology, today announced that it has developed the industry's first High Bandwidth Memory (HBM) integrated with artificial intelligence (AI) processing power – the HBM-PIM The new processing-in-memory (PIM) architecture brings powerful AI computing capabilities inside high-performance memory, to accelerate large-scale processing in data centers, high performance computing (HPC) systems and AI-enabled mobile applications.

Kwangil Park, senior vice president of Memory Product Planning at Samsung Electronics stated, "Our groundbreaking HBM-PIM is the industry's first programmable PIM solution tailored for diverse AI-driven workloads such as HPC, training and inference. We plan to build upon this breakthrough by further collaborating with AI solution providers for even more advanced PIM-powered applications."

### SK Hynix Accelerator-in-Memory (2022)

#### **SK**hynix NEWSROOM

**SK hvnix STORY** 

INSIGHT

MULTIMEDIA PRESS CENTER

Search

🌐 ENG 🗸

Q

#### SK hynix Develops PIM, Next-Generation AI Accelerator

February 16, 2022

#### Seoul, February 16, 2022

SK hynix (or "the Company", www.skhynix.com) announced on February 16 that it has developed PIM\*, a nextgeneration memory chip with computing capabilities.

\*PIM(Processing In Memory): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory

It has been generally accepted that memory chips store data and CPU or GPU, like human brain, process data. SK hynix, following its challenge to such notion and efforts to pursue innovation in the next-generation smart memory, has found a breakthrough solution with the development of the latest technology.

SK hynix plans to showcase its PIM development at the world's most prestigious semiconductor conference, 2022 ISSCC\*, in San Francisco at the end of this month. The company expects continued efforts for innovation of this technology to bring the memory-centric computing, in which semiconductor memory plays a central role, a step closer In Paper 11.1, SK Hynix describes an 1ynm, GDDR6-based accelerator-in-memory with a command set for deep-learning operation. The to the reality in devices such as smartphones.

\*ISSCC: The International Solid-State Circuits Conference will be held virtually from Feb. 20 to Feb. 24 this year with a theme of "Intelligent Silicon for a Sustainable World"

For the first product that adopts the PIM technology, SK hynix has developed a sample of GDDR6-AiM (Accelerator\* in memory). The GDDR6-AiM adds computational functions to GDDR6\* memory chips, which process data at 16Gbps. A combination of GDDR6-AiM with CPU or GPU instead of a typical DRAM makes certain computation speed 16 times faster. GDDR6-AiM is widely expected to be adopted for machine learning, high-performance computing, and big data computation and storage.



#### 11.1 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications

Seongiu Lee, SK hynix, Icheon, Korea

8Gb design achieves a peak throughput of 1TFLOPS with 1GHz MAC operations and supports major activation functions to improve accuracy.

https://news.skhynix.com/sk-hynix-develops-pim-next-generation-ai-accelerator/

## UPMEM Processing-in-DRAM Engine (2019)

### Processing in DRAM Engine

 Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.

### Replaces standard DIMMs

- DDR4 R-DIMM modules
  - 8GB+128 DPUs (16 PIM chips)
  - Standard 2x-nm DRAM process



Large amounts of compute & memory bandwidth



https://www.anandtech.com/show/14750/hot-chips-31-analysis-inmemory-processing-by-upmem

https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/

### Samsung Function-in-Memory DRAM (2021)

### FIMDRAM based on HBM2



#### [3D Chip Structure of HBM with FIMDRAM]

ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil D', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Song', Ahn Choi', Daeho Kim', SooYoung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro<sup>3</sup>, Seungwoo Seo<sup>3</sup>, JoonHo Song<sup>3</sup>, Jaeyoun Youn', Kyomin Sohn', Nam Sung Kim'

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea

## Samsung Function-in-Memory DRAM (2021)

## **Chip Implementation**

- Mixed design methodology to implement FIMDRAM
  - Full-custom + Digital RTL



[Digital RTL design for PCU block]

#### ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Oheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Hasuk Lee', Soo Young Kim', Youngmin Cho, Jin Guk Kim', Jongyoon Cho', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Soong', Ann Cho'l, Daeho Kim', Sooryoung Kim', Eur-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Koyonis Sohn', Nam Sung Kim'

| Cell array<br>for bank0                             | Cell array<br>for bank4                              | Cell array<br>for bank0                             | Cell array<br>for bank4                              | Pseudo       | Pseudo    |
|-----------------------------------------------------|------------------------------------------------------|-----------------------------------------------------|------------------------------------------------------|--------------|-----------|
| PCU block<br>for bank0 & 1                          | PCU block<br>for bank4 & 5                           | PCU block<br>for bank0 & 1                          | PCU block<br>for bank4 & 5                           | channel-0    | channel-1 |
| Cell array<br>for bank1<br>Cell array<br>for bank2  | Cell array<br>for bank5<br>Cell array<br>for bank6   | Cell array<br>for bank1<br>Cell array<br>for bank2  | Cell array<br>for bank5<br>Cell array<br>for bank6   |              |           |
| PCU block<br>for bank2 & 3                          | PCU block<br>for bank6 & 7                           | PCU block<br>for bank2 & 3                          | PCU block<br>for bank6 & 7                           |              |           |
| Cell array<br>for bank3                             | Cell array<br>for bank7                              | Cell array<br>for bank3                             | Cell array<br>for bank7                              |              |           |
|                                                     |                                                      | TSV &                                               | Peri C                                               | ontrol Block |           |
| Cell array<br>for bank11                            | Cell array<br>for bank15                             | Cell array<br>for bank11                            | Cell array<br>for bank15                             |              |           |
| PCU block<br>for bank10 & 11                        | PCU block<br>for bank14 & 15                         | PCU block<br>for bank10 & 11                        | PCU block<br>for bank14 & 15                         |              |           |
| Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 | Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 |              |           |
| PCU block<br>for bank8 & 9                          | PCU block<br>for bank12 & 13                         | PCU block<br>for bank8 & 9                          | PCU block<br>for bank12 & 13                         | Pseudo       | Pseudo    |
| Cell array<br>for bank8                             | Cell array<br>for bank12                             | Cell array<br>for bank8                             | Cell array<br>for bank12                             | channel-0    | channel-1 |

## Samsung AxDIMM (2021)

- DIMM-based PIM
  - DLRM recommendation system





AxDIMM System





# UPMEM PIM Microarchitecture and ISA

## **UPMEM DIMMs**

- E19: 8 chips/DIMM (1 rank). DPUs @ 267 MHz
- P21: 16 chips/DIMM (2 ranks). DPUs @ 350-425 MHz



### SAFARI

www.upmem.com

## **PIM's Promises**

### **UPMEM PIM massive benefits**

- Massive speed-up
  - Massive additional compute & bandwidth
- Massive energy gains
  - Most data movement on chip
- Low cost
  - ~300\$ of additional DRAM silicon
  - Affordable programming
- Massive ROI / TCO gains

| Energy efficiency when<br>computing on or off<br>memory chip |    | Server +<br>PIM<br>DRAM | Server +<br>normal<br>DRAM |
|--------------------------------------------------------------|----|-------------------------|----------------------------|
| DRAM to processor<br>64-bit operand                          | рJ | ~150                    | ~3000*                     |
| Operation                                                    | рJ | ~20                     | ~10*                       |
| Server consumption                                           | W  | ~700W                   | ~300W                      |
| speed-up                                                     |    | ~ x20                   | x1                         |
| energy gain                                                  |    | ~ x10                   | x1                         |
| TCO gain                                                     |    | ~ x10                   | x1                         |

\*Exascale Computing Trends: Adjusting to the "New Normal" for Computer Architecture; John Shalf, Computing in Science & engineering, 2013

HOT CHIPS 31



Copyright UPMEM® 2019

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on September 04,2020 at 13:55:41 UTC from IEEE Xplore. Restrictions apply.



# **Technology Challenges**

### The Hurdles on the road to the Graal

- DRAM process highly constrained
  - 3x slower transistors than same node digital process
  - Logic 10 times less dense vs. ASIC process
  - Routing density dramatically lower
    - 3 metals only for routing (vs. 10+), pitch x4 larger
- Strong design choices mandatory

### But the PIM Graal is worth it !

Copyright UPMEM<sup>®</sup> 2019

SAFARI

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on September 04,2020 at 13:55:41 UTC from IEEE Xplore. Restrictions appl

#### <u>Take away</u>

#### DRAM vs. ASIC

- Far less performing
- Wafers 2x cheaper vs. ASIC

#### Leapfrogging Moore's law

**HOT CHIPS 31** 

- Total Energy efficiency x10
- Massive, scalable parallelism
- Very low cost



12

### **UPMEM** Patent

| (12) United States Patent<br>Devaux et al. |                  |                                                                                                              | (10) Patent No.:US10,324,870B2(45) Date of Patent:Jun. 18, 2019 |                                                                                                                                           |  |
|--------------------------------------------|------------------|--------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|--|
| (54)                                       | MEMOR<br>PROCESS | Y CIRCUIT WITH INTEGRATED<br>SOR                                                                             | (56)                                                            | <b>References Cited</b><br>U.S. PATENT DOCUMENTS                                                                                          |  |
| (71)                                       | Applicant:       | UPMEM, Grenoble (FR)                                                                                         |                                                                 | 5,666,485 A * 9/1997 Suresh                                                                                                               |  |
| (72)                                       | Inventors:       | Fabrice Devaux, La Conversion (CH);<br>Jean-François Roy, Grenoble (FR)                                      |                                                                 | 6,463,001         B1         10/2002         Williams           7,349,277         B2 *         3/2008         Kinsley         G11C 11/406 |  |
| (73)                                       | Assignee:        | UPMEM, Grenoble (FR)                                                                                         |                                                                 | 8,438,358 B1 * 5/2013 Kraipak G11C 7/04<br>711/167                                                                                        |  |
| (*)                                        | Notice:          | Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days. |                                                                 | (Continued)<br>FOREIGN PATENT DOCUMENTS                                                                                                   |  |
| (21)                                       | Appl. No.:       | 15/551,418                                                                                                   | EP<br>JP                                                        | 0780768 A1 6/1997<br>H03109661 A 5/1991                                                                                                   |  |
| (22)                                       | PCT Filed        | Feb. 12, 2016                                                                                                | WO                                                              | 2010/141221 A1 12/2010                                                                                                                    |  |

#### ABSTRACT

A memory circuit having: a memory array including one or more memory banks; a first processor; and a processor control interface for receiving data processing commands directed to the first processor from a central processor, the processor control interface being adapted to indicate to the central processor when the first processor has finished accessing one or more of the memory banks of the memory array, these memory banks becoming accessible to the central processor.

#### SAFARI

(57)

## Accelerator Model (I)

- UPMEM DIMMs coexist with conventional DIMMs
- Integration of UPMEM DIMMs in a system follows an accelerator model
- UPMEM DIMMs can be seen as a loosely coupled accelerator
  - Explicit data movement between the main processor (host CPU) and the accelerator (UPMEM)
  - Explicit kernel launch onto the UPMEM processors
- This resembles GPU computing

## GPU Computing

- Computation is offloaded to the GPU
- Three steps
  - CPU-GPU data transfer (1)
  - GPU kernel execution (2)
  - GPU-CPU data transfer (3)



https://www.youtube.com/watch?v=y40-tY5WJ8A

https://safari.ethz.ch/digitaltechnik/spring2018/lib/exe/fetch.php?media=digitaldesign-2018-lecture22-gpuprogramming-afterlecture.pdf

### Lecture on GPU Programming



## Heterogeneous Systems Course (Fall 2021)

Short weekly lectures Hands-on projects

Increased period



(Fall 2021)

2021)

SAFARI Project & Seminars Courses (Fall

Trace: • start • processing\_in\_memory • heterogeneous\_systems



https://youtube.com/playlist?list=PL5Q2soXY2Zi OwkTqEyA6tk3UsoPBH737

Recent Changes Media Manager Sitemap

# Accelerator Model (II)

• FIG. 6 is a flow diagram representing operations in a method of delegating a processing task to a DRAM processor according to an example embodiment



# System Organization (I)

• FIG. 1 schematically illustrates a computing system comprising DRAM circuits having integrated processors according to an example embodiment



# System Organization (II)

 In a UPMEM-based PIM system UPMEM DIMMs coexist with regular DDR4 DIMMs



## System Organization (III)

- A UPMEM DIMM contains 8 or 16 chips
  - Thus, 1 or 2 ranks of 8 chips each
- Inside each PIM chip there are:
  - 8 64MB banks per chip: Main RAM (MRAM) banks
  - 8 DRAM Processing Units (DPUs) in each chip, 64 DPUs per rank



# 2,560-DPU System (I)

- UPMEM-based PIM system with 20 UPMEM DIMMs of 16 chips each (40 ranks)
  - P21 DIMMs

- Dual x86 socket
  - UPMEM DIMMs coexist with regular DDR4 DIMMs
  - 2 memory controllers/socket (3 channels each)
  - 2 conventional DDR4 DIMMs on one channel of one controller



## 2,560-DPU System (II)



## 640-DPU System

- UPMEM-based PIM system with 10 UPMEM DIMMs of 8 chips each (10 ranks)
  - E19 DIMMs
  - x86 socket
    - 2 memory controllers (3 channels each)
    - 2 conventional DDR4 DIMMs on one channel of one controller



# **DPU Sharing? Security Implications?**

- DPUs cannot be shared across multiple CPU processes
  - There are so many DPUs in the system that there is no need for sharing
- According to UPMEM, this assumption makes things simpler
  - No need for OS
  - Simplified security implications: No side channels

# Vector Addition (VA)

- Our first programming example
- We partition the input arrays across:
  - DPUs
  - Tasklets, i.e., software threads running on a DPU



## **CPU-DPU/DPU-CPU** Data Transfers

- CPU-DPU and DPU-CPU transfers
  - Between host CPU's main memory and DPUs' MRAM banks



- Serial CPU-DPU/DPU-CPU transfers:
  - A single DPU (i.e., 1 MRAM bank)
- Parallel CPU-DPU/DPU-CPU transfers:
  - Multiple DPUs (i.e., many MRAM banks)
- Broadcast CPU-DPU transfers:
  - Multiple DPUs with a single buffer

## **Inter-DPU Communication**

• There is no direct communication channel between DPUs



- Inter-DPU communication takes places via the host CPU using CPU-DPU and DPU-CPU transfers
- Example communication patterns:
  - Merging of partial results to obtain the final result
    - Only DPU-CPU transfers
  - Redistribution of intermediate results for further computation
    - DPU-CPU transfers and CPU-DPU transfers

# DRAM Processing Unit (I)

• FIG. 4 schematically illustrates part of the computing system of FIG. 1 in more detail according to an example embodiment



Fig 4

## **DRAM Processing Unit (II)**

**PIM Chip** 



# **DPU Pipeline**

- In-order pipeline
  - Up to 425 MHz
- Fine-grain multithreaded
  - 24 hardware threads
- 14 pipeline stages
  - DISPATCH: Thread selection
  - FETCH: Instruction fetch
  - **READOP:** Register file
  - FORMAT: Operand formatting
  - ALU: Operation and WRAM
  - MERGE: Result formatting



# Fine-Grained Multithreading

## Fine-Grained Multithreading

- Idea: Hardware has multiple thread contexts (PC+registers).
   Each cycle, fetch engine fetches from a different thread.
  - By the time the fetched branch/instruction resolves, no instruction is fetched from the same thread
  - Branch/instruction resolution latency overlapped with execution of other threads' instructions
- + No logic needed for handling control and data dependences within a thread
- -- Single thread performance suffers
- -- Extra logic for keeping thread contexts
- -- Does not overlap latency if not enough threads to cover the whole pipeline



## Fine-Grained Multithreading (II)

- Idea: Switch to another thread every cycle such that no two instructions from a thread are in the pipeline concurrently
- Tolerates the control and data dependence latencies by overlapping the latency with useful work from other threads
- Improves pipeline utilization by taking advantage of multiple threads
- Thornton, "Parallel Operation in the Control Data 6600," AFIPS 1964.
- Smith, "A pipelined, shared resource MIMD computer," ICPP 1978.

## Lecture on Fine-Grained Multithreading



## Lectures on Fine-Grained Multithreading

### Digital Design & Computer Architecture, Spring 2021, Lecture 14

- Pipelined Processor Design (ETH, Spring 2021)
- https://www.youtube.com/watch?v=6e5KZcCGBYw&list=PL5Q2soXY2Zi\_uej3aY39Y B5pfW4SJ7LIN&index=16
- Digital Design & Computer Architecture, Spring 2020, Lecture 18c
  - Fine-Grained Multithreading (ETH, Spring 2020)
  - https://www.youtube.com/watch?v=bu5dxKTvQVs&list=PL5Q2soXY2Zi\_FRrloMa2fU YWPGiZUBQo2&index=26

### https://www.youtube.com/onurmutlulectures

# **DPU Pipeline**

- In-order pipeline
  - Up to 350 MHz
- Fine-grain multithreaded
  - 24 hardware threads
- 14 pipeline stages
  - DISPATCH: Thread selection
  - FETCH: Instruction fetch
  - **READOP:** Register file
  - FORMAT: Operand formatting
  - ALU: Operation and WRAM
  - MERGE: Result formatting



## **DPU Instruction Set Architecture**

### • Specific 32-bit ISA

- Aiming at scalar, inorder, and multithreaded implementation
- Allowing compilation of 64-bit C code
- LLVM/Clang compiler



#### **Instruction Set Architecture**

This section covers the architecture concepts required to understand and use UPMEM DPU processor as a software developer. It is also providing an exhaustive list of the available processor instructions.

Software developers should use this section as a reference manual to develop or debug assembly code.

#### **Resources overview**

#### **Thread registers**

The system is composed of 24 hardware threads. Each of them owns a set of private resources:

- 24 general purpose 32-bits registers named r0 through r23
- A 16-bits wide program counter, named PC. Notice that the PC value does not address an instruction in memory, but the index of such an instruction directly. For example, a PC equal to 1 represents the second instruction in the DPU's program memory.
- Two persistent flags, keeping information about the previous result of an arithmetic or logical instruction:

• ZF: last result is equal to zero

https://sdk.upmem.com/2021.2.0/201\_IS.html#

### Microbenchmark for INT32 ADD Throughput

| 1 | <pre>#define SIZE 256</pre>                              |
|---|----------------------------------------------------------|
| 2 | <pre>int* bufferA = mem_alloc(SIZE * sizeof(int));</pre> |
| 3 | <pre>for(int i = 0; i &lt; SIZE; i++){</pre>             |
| 4 | <pre>int temp = bufferA[i];</pre>                        |
| 5 | temp += scalar;                                          |
| 6 | <pre>bufferA[i] = temp;</pre>                            |
| 7 | }                                                        |

| $\sim$    | 1 | move r2, 0                                                                                                                                                |                                |
|-----------|---|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|
| SA        | 2 | .LBB0_1:                                                                                                                                                  | // Loop header                 |
|           | 3 | <pre>move r2, 0 .LBB0_1:     lsl_add r3, r0, r2, 2     lw r4, r3, 0     add r4, r4, r1     sw r3, 0, r4     add r2, r2, 1     jneq r2, 256, .LBB0_1</pre> | // Address calculation         |
| pP<br>DP  | 4 | lw r4, r3, 0                                                                                                                                              | // Load from WRAM              |
| pile<br>M | 5 | add r4, r4, r1                                                                                                                                            | // Add                         |
| ME        | 6 | sw r3, 0, r4                                                                                                                                              | // Store to WRAM               |
| С<br>Ч    | 7 | add r2, r2, 1                                                                                                                                             | // Index update                |
| $\smile$  | 8 | <pre>jneq r2, 256, .LBB0_1</pre>                                                                                                                          | <pre>// Conditional jump</pre> |

#### SAFARI

C-based code

## **Arithmetic Throughput: #Instructions**

Compiler explorer: <u>https://dpu.dev</u>



6 instructions in the 32-bit ADD/SUB microbenchmark 7 instructions in the 64-bit ADD/SUB microbenchmark

### SAFARI

25 26

27

## **DPU: WRAM Bandwidth**



## **DPU: MRAM Latency and Bandwidth**

**PIM Chip** 



### **DPU: Arithmetic Throughput vs. Operational Intensity**



## Upcoming Lectures

- Microbenchmarking of the UPMEM DPU
  - Compute throughput
  - MRAM and WRAM bandwidth
  - Arithmetic intensity versus compute throughput
- Programming an UPMEM-based PIM system
- Introduction to Samsung's and SK Hynix's PIM devices

### Experimental Analysis of the UPMEM PIM Engine

### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM*).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

#### https://arxiv.org/pdf/2105.03814.pdf 45

### Understanding a Modern PIM Architecture



https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9

# **P&S Processing-in-Memory**

Real-World Processing-in-Memory Architectures: UPMEM PIM Architecture

> Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Spring 2022 17 March 2022