# P&S Processing-in-Memory

Real-World Processing-in-Memory Architectures:
Samsung AxDIMM

Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Fall 2022

22 November 2022

## UPMEM Processing-in-DRAM Engine (2019)

- Processing in DRAM Engine
- Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.
- Replaces standard DIMMs





#### **UPMEM PIM System Organization**

- A UPMEM DIMM contains 8 or 16 chips
  - Thus, 1 or 2 ranks of 8 chips each
- Inside each PIM chip there are:
  - 8 64MB banks per chip: Main RAM (MRAM) banks
  - 8 DRAM Processing Units (DPUs) in each chip, 64 DPUs per rank



# FIMDRAM: Chip Structure

#### FIMDRAM based on HBM2



[3D Chip Structure of HBM with FIMDRAM]

# FIMDRAM: System Organization (III)

- PIM units respond to standard DRAM column commands (RD or WR)
  - Compliant with unmodified JEDEC controllers
- They execute one wide-SIMD operation commanded by a PIM instruction with deterministic latency in a lock-step manner
- A PIM unit can get 16 16-bit operands from IOSAs, a register, and/or the result bus



#### AiM: Chip Implementation

4 Gb AiM die with 16 processing units (PUs)

#### **AiM Die Photograph**



#### 1 Process Unit (PU) Area

| Total                    | 0.19mm <sup>2</sup> |
|--------------------------|---------------------|
| MAC                      | 0.11mm <sup>2</sup> |
| Activation Function (AF) | 0.02mm <sup>2</sup> |
| Reservoir Cap.           | 0.05mm <sup>2</sup> |
| Etc.                     | 0.01mm <sup>2</sup> |



## AiM: System Organization

#### GDDR6-based AiM architecture



# Samsung AxDIMM

# Samsung AxDIMM (2021)



#### **DRAM Modules Powered by PIM**



The Acceleration DIMM (AXDIMM) brings processing to the DRAM module itself, minimizing large data movement between the CPU and DRAM to boost the energy efficiency of AI accelerator systems. With an AI engine built inside the buffer chip, the AXDIMM can perform parallel processing of multiple memory ranks (sets of DRAM chips) instead of accessing just one rank at a time, greatly enhancing system performance and efficiency. Since the module can retain its traditional DIMM form factor, the AXDIMM facilitates drop-in replacement without requiring system modifications. Currently being tested on customer servers, the AXDIMM can offer approximately twice the performance in AI-based recommendation applications and a 40% decrease in system-wide energy usage.

# Samsung AxDIMM (2021)

- DIMM-based PIM
  - DLRM recommendation system





#### **AxDIMM System**





#### General-Purpose Near-Rank Approach

Memory Channel Network (MCN) DIMMs



#### PnM with AxDIMM (IEEE Micro 2021)

#### Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM

Liu Ke\*<sup>†</sup>, Xuan Zhang<sup>†</sup>, Jinin So<sup>‡</sup>, Jong-Geon Lee<sup>‡</sup>, Shin-Haeng Kang<sup>‡</sup>, Sukhan Lee<sup>‡</sup>, Songyi Han<sup>‡</sup>, YeonGon Cho<sup>‡</sup>, JIN Hyun Kim<sup>‡</sup>, Yongsuk Kwon<sup>‡</sup>, KyungSoo Kim<sup>‡</sup>, Jin Jung<sup>‡</sup>, Ilkwon Yun<sup>‡</sup>, Sung Joo Park<sup>‡</sup>, Hyunsun Park<sup>‡</sup>, Joonho Song<sup>‡</sup>, Jeonghyeon Cho<sup>‡</sup>, Kyomin Sohn<sup>‡</sup>, Nam Sung Kim<sup>‡</sup>, Hsien-Hsin S. Lee\*

\*Facebook, †Washington University in St. Louis, ‡Samsung

# Recommendation Systems

# Recommendation Systems

#### Candidate recommendations are retrieved and then ranked



Covington et al., Deep Neural Networks for YouTube Recommendations, RecSys 2016



Naumov et al., Deep Learning Recommendation Model for Personalization and Recommendation Systems, arXiv:1906.00091, 2019



Li et al., iMARS: An In-Memory-Computing Architecture for Recommendation Systems, arXiv:2202.09433, 2022

#### Overview of Recommendation Models

 Personalized recommendation: recommend content to users, e.g., Facebook's DLRM recommendation system



Dense features: continuous inputs in vectors and matrices are processed by typical DNN layers (e.g., fully connected layers)

#### Overview of Recommendation Models

 Personalized recommendation: recommend content to users, e.g., Facebook's DLRM recommendation system



#### Overview of Recommendation Models

 Personalized recommendation: recommend content to users, e.g., Facebook's DLRM recommendation system



Embedding tables are organized as a set of potentially millions of vectors: lookup and pooling operations represent sparse features learned during training and generally exhibit Gather-Reduce pattern, via Caffe2's SparseLengths (SLS) operators

#### DLRM Performance Characterization

Identifying key performance bottlenecks for the DLRM system







#### SparseLengths (SLS) operators:

- Low FP intensity
- Larger batch size:
  - Higher memory footprint
  - Higher memory intensity

The memory bandwidth can easily be saturated by embedding operations especially as both the batch size and the number of threads increase

#### RecNMP Architecture

- DIMM-based NMP architecture for recommendation systems
  - Multiply the bandwidth by exploiting rank-level parallelism



Embedding entries are fetched from the concurrently activated ranks

#### RecNMP Architecture

- DIMM-based NMP architecture for recommendation systems
  - Multiply the bandwidth by exploiting rank-level parallelism



The NMP PU performs the local embedding lookup and pooling functions at the memory side, producing the general Gather-Reduce execution pattern

#### RecNMP Architecture

- DIMM-based NMP architecture for recommendation systems
  - Multiply the bandwidth by exploiting rank-level parallelism



Element-wise summation of the embedding entries is performed inside the NMP PU, and the final pooling result is transferred back to host

# AxDIMM Design: Overview

- Accelerator DIMM (AxDIMM)
  - DDR4-compatible FPGA-based platform with standard memory interfaces
- AxDIMM can potentially
  - support both in-order general-purpose processor and specialized accelerator modules
  - be an interesting prototyping platform for near-memory processing
- Personalized recommendation case study, including:
  - hardware implementation
  - software-stack support

## AxDIMM System







System was slowed down (1/3 of normal DDR4 memory channel speedup; CPU went from 3.2 GHz to 1.2 GHz) to keep up with the FPGA IO speed

# AxDIMM Hardware & Architecture



Standard DIMM Interface

#### FPGA board with standard DIMM interface:

It serves as a real-system near-memory processing implementation



Standard DIMM Interface

#### Rank-level parallelism:

Two DRAM ranks are activated in parallel to load embedding entries from memory

Element-wise summation is performed inside the FPGA module



DDR4 slave PHY receives DRAM commands and NMP instructions (via DQ pins) from the host side



The memory interface generator (MIG) supports the internal rank accesses between Rank-NMP and the DRAM device















# AxDIMM Design: Address Map

Memory map of AxDIMM





# AxDIMM Execution Flow









#### PnM with AxDIMM (IEEE Micro 2021)

#### Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM

Liu Ke\*<sup>†</sup>, Xuan Zhang<sup>†</sup>, Jinin So<sup>‡</sup>, Jong-Geon Lee<sup>‡</sup>, Shin-Haeng Kang<sup>‡</sup>, Sukhan Lee<sup>‡</sup>, Songyi Han<sup>‡</sup>, YeonGon Cho<sup>‡</sup>, JIN Hyun Kim<sup>‡</sup>, Yongsuk Kwon<sup>‡</sup>, KyungSoo Kim<sup>‡</sup>, Jin Jung<sup>‡</sup>, Ilkwon Yun<sup>‡</sup>, Sung Joo Park<sup>‡</sup>, Hyunsun Park<sup>‡</sup>, Joonho Song<sup>‡</sup>, Jeonghyeon Cho<sup>‡</sup>, Kyomin Sohn<sup>‡</sup>, Nam Sung Kim<sup>‡</sup>, Hsien-Hsin S. Lee\*

\*Facebook, †Washington University in St. Louis, ‡Samsung

#### More Real-World PIM to Come



HOME BLOCK FILE OBJECT DISK TAPE FLASH NVME SC

Home > AI/ML > NeuroBladers build a processing-in-memory analytics chip and server



#### NeuroBladers build a processing-inmemory analytics chip and server

By Chris Mellor - October 6, 2021









An Israeli startup called NeuroBlade has exited stealth mode, built a processing-in-memory (PIM) analytics chip combining DRAM and thousands of cores, put four of them in an analytics accelerating server appliance box, and taken in \$83 million in B-round funding.

The idea is to take a GPU approach to big data-style analytics and AI software by employing a massively parallel core design, but take it further by layering the cores on DRAM with a wide I/O bus architecture design linking the cores and memory to speed processing even more. This design vastly reduces data movement between storage and memory and also accelerates data transfer between memory and processing cores.

#### NeuroBlade Patent (I)

| , ,  | Unite<br>Sity et a | d States Patent                                                                                              | (10) Patent No.: US 10,762,034 B2<br>(45) Date of Patent: Sep. 1, 2020 |
|------|--------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
| (54) |                    | Y-BASED DISTRIBUTED<br>SOR ARCHITECTURE                                                                      | (56) References Cited  U.S. PATENT DOCUMENTS                           |
| (71) | Applicant:         | $\textbf{NeuroBlade, Ltd.,} \ \text{Hod-Hashron (IL)}$                                                       | 4,837,747 A * 6/1989 Dosaka G11C 8/12                                  |
| (72) | Inventors:         | <b>Elad Sity</b> , Kfar Saba (IL); <b>Eliad Hillel</b> , Kfar Saba (IL)                                      | 5,155,729 A 10/1992 Rysko et al. (Continued)                           |
| (73) | Assignee:          | $\textbf{NeuroBlade, Ltd.,} \ \textbf{Hod-Hashron} \ (\textbf{IL})$                                          | FOREIGN PATENT DOCUMENTS                                               |
| (*)  | Notice:            | Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days. | CA 2 149 479 C 5/2001<br>OTHER PUBLICATIONS                            |
| (21) | Appl. No.          | : 16/512,590                                                                                                 | Ahn et al., "A Scalable Processing-in-Memory Accelerator for           |
| (22) | Filed:             | Jul. 16, 2019                                                                                                | Parallel Graph Processing," ISCA '15 (Jun. 13-17, 2015), pp. 105-117.  |

#### (57) ABSTRACT

Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits.

#### NeuroBlade Patent (II)





#### NeuroBlade: Xiphos

- PIM XRAM chip
  - IMPU (Intensive Memory Processing Unit)
- x86 CPU, 32 NVMe SSDs
- PCIe fabric: "Everything is connected on top of PCIe fabric."
- Wide I/O bus: multiple x16 PCIe buses



Xiphos appliance.

#### NeuroBlade: Sofware Suite

- Xiphos SW suite: Insights API
  - APIs for 3<sup>rd</sup> party applications and web client
- Data I/O
  - ETL process populates and updates local storage
- Query Compiler
  - Generates query execution plans
- Tools
  - E.g., visual profiler
- TPC benchmarks and queries



#### Hybrid Bonding with PnM Engine (ISSCC 2022)

#### ISSCC 2022 / SESSION 29 / ML CHIPS FOR EMERGING A

# 29.1 184QPS/W 64Mb/mm<sup>2</sup> 3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Dimin Niu<sup>1</sup>, Shuangchen Li<sup>1</sup>, Yuhao Wang<sup>1</sup>, Wei Han<sup>1</sup>, Zhe Zhang<sup>2</sup>, Yijin Guan<sup>2</sup>, Tianchan Guan<sup>3</sup>, Fei Sun<sup>1</sup>, Fei Xue<sup>1</sup>, Lide Duan<sup>1</sup>, Yuanwei Fang<sup>1</sup>, Hongzhong Zheng<sup>1</sup>, Xiping Jiang<sup>4</sup>, Song Wang<sup>4</sup>, Fengguo Zuo<sup>4</sup>, Yubing Wang<sup>4</sup>, Bing Yu<sup>4</sup>, Qiwei Ren<sup>4</sup>, Yuan Xie<sup>1</sup>

<sup>1</sup>Alibaba DAMO Academy, Sunnyvale, CA; <sup>2</sup>Alibaba DAMO Academy, Beijing, China <sup>3</sup>Alibaba DAMO Academy, Shanghai, China; <sup>4</sup>UnilC, Xian, China

#### HB-PNM: Overall Architecture (I)

 3D-stacked logic die and DRAM die vertically bonded by hybrid bonding (HB)



#### HB-PNM: Overall Architecture (II)

 Match engine and neural engine for matching and ranking in a recommendation system



# Upcoming Lectures

More real-world PIM architectures

PUM architectures and prototypes

Enabling the adoption of PIM

# P&S Processing-in-Memory

Real-World Processing-in-Memory Architectures:
Samsung AxDIMM

Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zürich Fall 2022

22 November 2022

#### Another Lecture on AxDIMM

