#### Computer Architecture #### Lecture 17a: Emerging Memory Technologies II Prof. Onur Mutlu ETH Zürich Fall 2021 25 November 2021 #### Solution 2: Emerging Memory Technologies - Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile) - Example: Phase Change Memory - Data stored by changing phase of material - Data read by detecting material's resistance - Expected to scale to 9nm (2022 [ITRS 2009]) - Prototyped at 20nm (Raoux+, IBM JRD 2008) Can they be enabled to replace/augment/surpass DRAM? #### Solution 2: Emerging Memory Technologies - Lee+, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA'09, CACM'10, IEEE Micro'10. - Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters 2012. - Yoon, Meza+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012. - Kultursay+, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative," ISPASS 2013. - Meza+, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory," WEED 2013. - Lu+, "Loose Ordering Consistency for Persistent Memory," ICCD 2014. - Zhao+, "FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems," MICRO 2014. - Yoon, Meza+, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories," TACO 2014. - Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems," MICRO 2015. - Chauhan+, "NVMove: Helping Programmers Move to Byte-Based Persistence," INFLOW 2016. - Li+, "Utility-Based Hybrid Memory Management," CLUSTER 2017. - Yu+, "Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation," MICRO 2017. - Tavakkol+, "MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices," FAST 2018. - Tavakkol+, "FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives," ISCA 2018. - Sadrosadati+. "LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching," ASPLOS 2018. - Salkhordeh+, "An Analytical Model for Performance and Lifetime Estimation of Hybrid DRAM-NVM Main Memories," TC 2019. - Wang+, "Panthera: Holistic Memory Management for Big Data Processing over Hybrid Memories," PLDI 2019. - Song+, "Enabling and Exploiting Partition-Level Parallelism (PALP) in Phase Change Memories," CASES 2019. - Liu+, "Binary Star: Coordinated Reliability in Heterogeneous Memory Systems for High Performance and Scalability," MICRO'19. - Song+, "Improving Phase Change Memory Performance with Data Content Aware Access," ISMM 2020. - Yavits+, "WoLFRaM: Enhancing Wear-Leveling and Fault Tolerance in Resistive Memories using Programmable Address Decoders," ICCD 2020. - Song+, "Aging-Aware Request Scheduling for Non-Volatile Main Memory," ASP-DAC 2021. #### Intel Optane Persistent Memory (2019) - Non-volatile main memory - Based on 3D-XPoint Technology #### PCM as Main Memory: Idea in 2009 Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf) One of the 13 computer architecture papers of 2009 selected as Top Picks by IEEE Micro. Selected as a CACM Research Highlight. #### Architecting Phase Change Memory as a Scalable DRAM Alternative Benjamin C. Lee† Engin Ipek† Onur Mutlu‡ Doug Burger† †Computer Architecture Group Microsoft Research Redmond, WA {blee, ipek, dburger}@microsoft.com ‡Computer Architecture Laboratory Carnegie Mellon University Pittsburgh, PA onur@cmu.edu #### PCM as Main Memory: Idea in 2009 Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger, "Phase Change Technology and the Future of Main Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, January/February 2010. ## PHASE-CHANGE TECHNOLOGY AND THE FUTURE OF MAIN MEMORY #### More on PCM Based Main Memory HanBin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories" ACM Transactions on Architecture and Code Optimization (TACO), Vol. 11, No. 4, December 2014. [Slides (ppt) (pdf)] Presented at the 10th HiPEAC Conference, Amsterdam, Netherlands, January 2015. [Slides (ppt) (pdf)] Best (student) presentation award. #### Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories HANBIN YOON\* and JUSTIN MEZA, Carnegie Mellon University NAVEEN MURALIMANOHAR, Hewlett-Packard Labs NORMAN P. JOUPPI\*\*, Google Inc. ONUR MUTLU, Carnegie Mellon University #### More on STT-MRAM as Main Memory Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative" Proceedings of the <u>2013 IEEE International Symposium on</u> <u>Performance Analysis of Systems and Software</u> (**ISPASS**), Austin, TX, April 2013. <u>Slides (pptx) (pdf)</u> ## Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay\*, Mahmut Kandemir\*, Anand Sivasubramaniam\*, and Onur Mutlu<sup>†</sup> \*The Pennsylvania State University and <sup>†</sup>Carnegie Mellon University #### Hybrid Main Memory #### A More Viable Approach: Hybrid Memory Systems Hardware/software manage data allocation and movement to achieve the best of multiple technologies Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. Yoon+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012 Best Paper Award. #### Challenge and Opportunity # Providing the Best of Multiple Metrics with Multiple Memory Technologies #### Challenge and Opportunity Heterogeneous, Configurable, Programmable Memory Systems #### Hybrid Memory Systems: Issues - Cache vs. Main Memory - Granularity of Data Move/Manage-ment: Fine or Coarse - Hardware vs. Software vs. HW/SW Cooperative - When to migrate data? - How to design a scalable and efficient large cache? - **...** #### Data Placement in Hybrid Memory ### Which memory do we place each page in, to maximize system performance? - Memory A is fast, but small - Load should be balanced on both channels? - Page migrations have performance and energy overhead #### Data Placement Between DRAM and PCM - Idea: Characterize data access patterns and guide data placement in hybrid memory - Streaming accesses: As fast in PCM as in DRAM - Random accesses: Much faster in DRAM - Idea: Place random access data with some reuse in DRAM; streaming data in PCM - Yoon+, "Row Buffer Locality-Aware Data Placement in Hybrid Memories," ICCD 2012 Best Paper Award. #### Key Observation & Idea - Row buffers exist in both DRAM and PCM - Row hit latency similar in DRAM & PCM [Lee+ ISCA'09] - Row miss latency small in DRAM, large in PCM - Place data in DRAM which - is likely to miss in the row buffer (low row buffer locality) → miss penalty is smaller in DRAM AND - is reused many times → cache only the data worth the movement cost and DRAM space #### Hybrid vs. All-PCM/DRAM [ICCD'12] #### More on Hybrid Memory Data Placement HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, and Onur Mutlu, "Row Buffer Locality Aware Caching Policies for Hybrid Memories" Proceedings of the <u>30th IEEE International Conference on Computer</u> <u>Design</u> (**ICCD**), Montreal, Quebec, Canada, September 2012. <u>Slides</u> (pptx) (pdf) Best paper award (in Computer Systems and Applications track). ### Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A. Harding and Onur Mutlu Carnegie Mellon University {hanbinyoon,meza,rachata,onur}@cmu.edu, rhardin@mit.edu #### Weaknesses of Existing Solutions - They are all heuristics that consider only a *limited part* of memory access behavior - Do not directly capture the overall system performance impact of data placement decisions - Example: None capture memory-level parallelism (MLP) - Number of concurrent memory requests from the same application when a page is accessed - Affects how much page migration helps performance #### Importance of Memory-Level Parallelism Page migration decisions need to consider MLP #### Our Goal [CLUSTER 2017] A generalized mechanism that - 1. Directly estimates the performance benefit of migrating a page between any two types of memory - 2. Places **only** the **performance-critical data** in the fast memory #### Utility-Based Hybrid Memory Management - A memory manager that works for any hybrid memory - e.g., DRAM-NVM, DRAM-RLDRAM #### Key Idea - For each page, use comprehensive characteristics to calculate estimated *utility* (i.e., performance impact) of migrating page from one memory to the other in the system - Migrate only pages with the highest utility (i.e., pages that improve system performance the most when migrated) - Li+, "Utility-Based Hybrid Memory Management", CLUSTER 2017. #### Key Mechanisms of UH-MEM - For each page, estimate utility using a performance model - Application stall time reduction How much would migrating a page benefit the performance of the application that the page belongs to? - Application performance sensitivity How much does the improvement of a single application's performance increase the *overall* system performance? $Utility = \Delta StallTime_i \times Sensitivity_i$ - Migrate only pages whose utility exceed the migration threshold from slow memory to fast memory - Periodically adjust migration threshold #### Results: System Performance **UH-MEM improves system performance** over the best state-of-the-art hybrid memory manager #### Results: Sensitivity to Slow Memory Latency • We vary $t_{RCD}$ and $t_{WR}$ of the slow memory **Slow Memory Latency Multiplier** **UH-MEM improves system performance** for a wide variety of hybrid memory systems SAFAKI #### More on UH-MEM Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, and Onur Mutlu, "Utility-Based Hybrid Memory Management" Proceedings of the <u>19th IEEE Cluster Conference</u> (**CLUSTER**), Honolulu, Hawaii, USA, September 2017. [Slides (pptx) (pdf)] #### **Utility-Based Hybrid Memory Management** Yang Li $^{\dagger}$ Saugata Ghose $^{\dagger}$ Jongmoo Choi $^{\ddagger}$ Jin Sun $^{\dagger}$ Hui Wang $^{\star}$ Onur Mutlu $^{\dagger\dagger}$ $^{\dagger}$ Carnegie Mellon University $^{\ddagger}$ Dankook University $^{\star}$ Beihang University $^{\dagger}$ ETH Zürich #### Challenge and Opportunity # Enabling an Emerging Technology to Augment DRAM Managing Hybrid Memories #### Another Challenge ## Designing Effective Large (DRAM) Caches #### One Problem with Large DRAM Caches - A large DRAM cache requires a large metadata (tag + block-based information) store - How do we design an efficient DRAM cache? #### Idea 1: Tags in Memory - Store tags in the same row as data in DRAM - Store metadata in same row as their data - Data and metadata can be accessed together - Benefit: No on-chip tag storage overhead - Downsides: - Cache hit determined only after a DRAM access - Cache hit requires two DRAM accesses #### Idea 2: Cache Tags in SRAM - Recall Idea 1: Store all metadata in DRAM - To reduce metadata storage overhead - Idea 2: Cache in on-chip SRAM frequently-accessed metadata - Cache only a small amount to keep SRAM size small #### Idea 3: Dynamic Data Transfer Granularity - Some applications benefit from caching more data - They have good spatial locality - Others do not - Large granularity wastes bandwidth and reduces cache utilization - Idea 3: Simple dynamic caching granularity policy - Cost-benefit analysis to determine best DRAM cache block size - Group main memory into sets of rows - Different sampled row sets follow different fixed caching granularities - The rest of main memory follows the best granularity - Cost—benefit analysis: access latency versus number of cachings - Performed every quantum #### **TIMBER Performance** Meza, Chang, Yoon, Mutlu, Ranganathan, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. #### TIMBER Energy Efficiency Meza, Chang, Yoon, Mutlu, Ranganathan, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. #### On Large DRAM Cache Design Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management" IEEE Computer Architecture Letters (CAL), February 2012. #### Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management Justin Meza\* Jichuan Chang† HanBin Yoon\* Onur Mutlu\* Parthasarathy Ranganathan† \*Carnegie Mellon University †Hewlett-Packard Labs {meza,hanbinyoon,onur}@cmu.edu {jichuan.chang,partha.ranganathan}@hp.com #### DRAM Caches: Many Recent Options **Table 1: Summary of Operational Characteristics of Different State-of-the-Art DRAM Cache Designs** – We assume perfect way prediction for Unison Cache. Latency is relative to the access time of the off-package DRAM (see Section 6 for baseline latencies). We use different colors to indicate the high (dark red), medium (white), and low (light green) overhead of a characteristic. | Scheme | DRAM Cache Hit | DRAM Cache Miss | Replacement Traffic | Replacement Decision | Large Page Caching | |-----------------------------------------|---------------------------|--------------------------|-----------------------------------------|----------------------|--------------------| | Unison [32] | In-package traffic: 128 B | In-package traffic: 96 B | On every miss | Hardware managed, | Yes | | | (data + tag read and up- | (spec. data + tag read) | Footprint size [31] | set-associative, | | | | date) | Latency: ∼2x | | LRU | | | | Latency: ∼1x | | | | | | Alloy [50] | In-package traffic: 96 B | In-package traffic: 96 B | On some misses | Hardware managed, | Yes | | | (data + tag read) | (spec. data + tag read) | Cacheline size (64 B) | direct-mapped, | | | | Latency: ~1x | Latency: ∼2x | | stochastic [20] | | | TDC [38] | In-package traffic: 64 B | In-package traffic: 0 B | On every miss | Hardware managed, | No | | | Latency: ∼1x | Latency: ∼1x | Footprint size [28] | fully-associative, | | | | TLB coherence | TLB coherence | | FIFO | | | HMA [44] | In-package traffic: 64 B | In-package traffic: 0 B | Software managed, high replacement cost | | Yes | | 2000 CA U S CA PASS | Latency: ∼1x | Latency: ∼1x | | | | | Banshee | In-package traffic: 64 B | In-package traffic: 0 B | Only for hot pages | Hardware managed, | Yes | | (This work) | Latency: ∼1x | Latency: ∼1x | Page size (4 KB) | set-associative, | | | 101 101 101 101 101 101 101 101 101 101 | | | | frequency based | | Yu+, "Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation," MICRO 2017. # Banshee [MICRO 2017] - Tracks presence in cache using TLB and Page Table - No tag store needed for DRAM cache - Enabled by a new lightweight lazy TLB coherence protocol - New bandwidth-aware frequency-based replacement policy #### More on Banshee Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, Onur Mutlu, and Srinivas Devadas, "Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation" Proceedings of the 50th International Symposium on Microarchitecture (MICRO), Boston, MA, USA, October 2017. # Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation Xiangyao Yu<sup>1</sup> Christopher J. Hughes<sup>2</sup> Nadathur Satish<sup>2</sup> Onur Mutlu<sup>3</sup> Srinivas Devadas<sup>1</sup> <sup>1</sup>MIT <sup>2</sup>Intel Labs <sup>3</sup>ETH Zürich # Other Opportunities with Emerging Technologies - Merging of memory and storage - e.g., a single interface to manage all data - New applications - e.g., ultra-fast checkpoint and restore - More robust system design - e.g., reducing data loss - Processing tightly-coupled with memory - e.g., enabling efficient search and filtering # Recall: Processing Using Memory # In-Memory Bulk Bitwise Operations - We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ - At low cost - Using analog computation capability of DRAM - Idea: activating multiple rows performs computation - 30-60X performance and energy improvement - Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," MICRO 2017. - New memory technologies enable even more opportunities - Memristors, resistive RAM, phase change mem, STT-MRAM, ... - Can operate on data with minimal movement # In-DRAM Bulk Bitwise AND/OR Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, "Fast Bulk Bitwise AND and OR in DRAM" IEEE Computer Architecture Letters (CAL), April 2015. # Fast Bulk Bitwise AND and OR in DRAM Vivek Seshadri\*, Kevin Hsieh\*, Amirali Boroumand\*, Donghyuk Lee\*, Michael A. Kozuch<sup>†</sup>, Onur Mutlu\*, Phillip B. Gibbons<sup>†</sup>, Todd C. Mowry\* \*Carnegie Mellon University †Intel Pittsburgh # Ambit: Bulk-Bitwise in-DRAM Computation Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology" Proceedings of the <u>50th International Symposium on</u> Microarchitecture (MICRO), Boston, MA, USA, October 2017. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri<sup>1,5</sup> Donghyuk Lee<sup>2,5</sup> Thomas Mullins<sup>3,5</sup> Hasan Hassan<sup>4</sup> Amirali Boroumand<sup>5</sup> Jeremie Kim<sup>4,5</sup> Michael A. Kozuch<sup>3</sup> Onur Mutlu<sup>4,5</sup> Phillip B. Gibbons<sup>5</sup> Todd C. Mowry<sup>5</sup> $^1$ Microsoft Research India $^2$ NVIDIA Research $^3$ Intel $^4$ ETH Zürich $^5$ Carnegie Mellon University # In-DRAM Bulk Bitwise Execution Paradigm Vivek Seshadri and Onur Mutlu, "In-DRAM Bulk Bitwise Execution Engine" Invited Book Chapter in Advances in Computers, to appear in 2020. [Preliminary arXiv version] # In-DRAM Bulk Bitwise Execution Engine Vivek Seshadri Microsoft Research India visesha@microsoft.com Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch # SIMDRAM Framework for in-DRAM Computing Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu, "SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM" Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021. [2-page Extended Abstract] [Short Talk Slides (pptx) (pdf)] [Talk Slides (pptx) (pdf)] [Short Talk Video (5 mins)] [Full Talk Video (27 mins)] # SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM \*Nastaran Hajinazar<sup>1,2</sup> Nika Mansouri Ghiasi<sup>1</sup> \*Geraldo F. Oliveira<sup>1</sup> Minesh Patel<sup>1</sup> Juan Gómez-Luna<sup>1</sup> Sven Gregorio<sup>1</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup> João Dinis Ferreira<sup>1</sup> Saugata Ghose<sup>3</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Simon Fraser University <sup>3</sup>University of Illinois at Urbana–Champaign # Lecture on RowClone & Processing using DRAM # Lecture on Processing using Memory (I) # Lecture on Processing using Memory (II) # Pinatubo: RowClone and Bitwise Ops in PCM # Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Shuangchen Li<sup>1</sup>\*, Cong Xu<sup>2</sup>, Qiaosha Zou<sup>1,5</sup>, Jishen Zhao<sup>3</sup>, Yu Lu<sup>4</sup>, and Yuan Xie<sup>1</sup> University of California, Santa Barbara<sup>1</sup>, Hewlett Packard Labs<sup>2</sup> University of California, Santa Cruz<sup>3</sup>, Qualcomm Inc.<sup>4</sup>, Huawei Technologies Inc.<sup>5</sup> {shuangchenli, yuanxie}ece.ucsb.edu<sup>1</sup> # Pinatubo: RowClone and Bitwise Ops in PCM Figure 2: Overview: (a) Computing-centric approach, moving tons of data to CPU and write back. (b) The proposed Pinatubo architecture, performs *n*-row bitwise operations inside NVM in one step. # New: In-Memory Crossbar Array Operations # In-Memory Crossbar Array Operations - Some emerging NVM technologies have crossbar array structure - Memristors, resistive RAM, phase change mem, STT-MRAM, ... - Crossbar arrays can be used to perform dot product operations using "analog computation capability" - Can operate on multiple pieces of data using Kirchoff's laws - Bitline current is a sum of products of wordline V x (1 / cell R) - Computation is in analog domain inside the crossbar array - Need peripheral circuitry for D->A and A->D conversion of inputs and outputs # In-Memory Crossbar Computation Fig. 1. (a) Using a bitline to perform an analog sum of products operation. (b) A memristor crossbar used as a vector-matrix multiplier. # In-Memory Crossbar Computation # Required Peripheral Circuitry Shift and add: used to summarize the final output # An Example of 2D Convolution #### Output feature map Input feature map #### Structure information Input: 5\*5 (blue) Kernel (filter): 3\*3 (grey) Output: 5\*5 (green) #### **Computation information** Stride: 1 Padding: 1 (white) Output Dim = (Input + 2\*Padding - Kernel) / Stride + 1 # Mapping Computation onto the Crossbar A convolution operation in neural network application An NVM-based PIM array # An Overview of NVM-Based PIM System NVM-based PIM array: core processing unit for vector-matrix multiplication Non-linear function array: processing unit for non-linear functions (e.g., ReLU operations in neural networks) Multiplier array: handles element-wise operations # Example Readings on NVM-Based PIM - Shafiee+, "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars", ISCA 2016. - Chi+, "PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory", ISCA 2016. - Prezioso+, "Training and Operation of an Integrated Neuromorphic Network based on Metal-Oxide Memristors", Nature 2015 - Ambrogio+, "Equivalent-accuracy accelerated neural-network training using analogue memory", Nature 2018. # Other Opportunities with Emerging Technologies - Merging of memory and storage - e.g., a single interface to manage all data - New applications - e.g., ultra-fast checkpoint and restore - More robust system design - e.g., reducing data loss - Processing tightly-coupled with memory - e.g., enabling efficient search and filtering ## TWO-LEVEL STORAGE MODEL ### TWO-LEVEL STORAGE MODEL Non-volatile memories combine characteristics of memory and storage # Two-Level Memory/Storage Model - The traditional two-level storage model is a bottleneck with NVM - □ Volatile data in memory → a load/store interface - □ Persistent data in storage → a file system interface - Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores # Unified Memory and Storage with NVM - Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data - Improves both energy and performance - Simplifies programming model as well # **PERSISTENT MEMORY** Provides an opportunity to manipulate persistent data directly # The Persistent Memory Manager (PMM) ``` int main(void) // data in file.dat is persistent FILE myData = "file.dat"; Persistent objects myData = new int[64]; void updateValue(int n, int value) { FILE myData = "file.dat"; myData[n] = value; // value is persistent Store | Hints from SW/OS/runtime Software Persistent Memory Manager Hardware Data Layout, Persistence, Metadata, Security, ... DRAM Flash NVM HDD ``` PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices # The Persistent Memory Manager (PMM) - Exposes a load/store interface to access persistent data - □ Applications can directly access persistent memory → no conversion, translation, location overhead for persistent data - Manages data placement, location, persistence, security - To get the best of multiple forms of storage - Manages metadata storage and retrieval - This can lead to overheads that need to be managed - Exposes hooks and interfaces for system software - To enable better data placement and management decisions - Meza+, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory," WEED 2013. - A persistent memory exposes a large, persistent address space - But it may use many different devices to satisfy this goal - From fast, low-capacity volatile DRAM to slow, high-capacity nonvolatile HDD or Flash - And other NVM devices in between - Performance and energy can benefit from good placement of data among these devices - Utilizing the strengths of each device and avoiding their weaknesses, if possible - For example, consider two important application characteristics: locality and persistence Applications or system software can provide hints for data placement # Evaluated Systems #### HDD Baseline - Traditional system with volatile DRAM memory and persistent HDD storage - Overheads of operating system and file system code and buffering #### NVM Baseline (NB) - Same as HDD Baseline, but HDD is replaced with NVM - Still has OS/FS overheads of the two-level storage model #### Persistent Memory (PM) - Uses only NVM (no DRAM) to ensure full-system persistence - All data accessed using loads and stores - Does not waste time on system calls - Data is manipulated directly on the NVM device ## Performance Benefits of a Single-Level Store # Energy Benefits of a Single-Level Store ## On Persistent Memory Benefits & Challenges Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf) #### A Case for Efficient Hardware/Software Cooperative Management of Storage and Memory Justin Meza\* Yixin Luo\* Samira Khan\*<sup>‡</sup> Jishen Zhao<sup>†</sup> Yuan Xie<sup>†§</sup> Onur Mutlu\* \*Carnegie Mellon University <sup>†</sup>Pennsylvania State University <sup>‡</sup>Intel Labs <sup>§</sup>AMD Research # Challenge and Opportunity # Combined Memory & Storage # Challenge and Opportunity # A Unified Interface to All Data # Intel Optane Persistent Memory (2019) - Non-volatile main memory - Based on 3D-XPoint Technology ## UPMEM Processing-in-DRAM Engine (2019) - Processing in DRAM Engine - Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips. - Replaces standard DIMMs - DDR4 R-DIMM modules - 8GB+128 DPUs (16 PIM chips) - Standard 2x-nm DRAM process - Large amounts of compute & memory bandwidth # One Key Challenge in Persistent Memory How to ensure consistency of system/data if all memory is persistent? - Two extremes - Programmer transparent: Let the system handle it - Programmer only: Let the programmer handle it - Many alternatives in-between... #### CRASH CONSISTENCY PROBLEM Add a node to a linked list System crash can result in inconsistent memory state #### **Explicit interfaces to manage consistency** - NV-Heaps [ASPLOS'11], BPFS [SOSP'09], Mnemosyne [ASPLOS'11] ``` AtomicBegin { Insert a new node; } AtomicEnd; ``` #### **Limits adoption of NVM** Have to rewrite code with clear partition between volatile and non-volatile data # **Burden on the programmers** #### **Explicit interfaces to manage consistency** - NV-Heaps [ASPLOS'11], BPFS [SOSP'09], Mnemosyne [ASPLOS'11] # **Example Code**update a node in a persistent hash table ``` void hashtable update (hashtable t* ht, void *key, void *data) list t* chain = get chain(ht, key); pair t* pair; pair t updatePair; updatePair.first = key; pair = (pair t*) list find(chain, pair->second = data; ``` ``` void TMhashtable update (TMARCGDECL hashtable t* ht, void *key, void*data) { list t* chain = get chain(ht, key); pair t* pair; pair t updatePair; updatePair.first = key; pair = (pair t*) TMLIST FIND (chain, &updatePair); pair->second = data; ``` #### Manual declaration of persistent components ### void *TMhashtable\_update*(TMARCGDECL ``` void*data) { list t* chain = get chain(ht, key); pair t* pair; pair t updatePair; updatePair.first = key; pair = (pair t*) TMLIST FIND (chain, &updatePair); pair->second = data; ``` #### Manual declaration of persistent components ``` void TMhashtable update(TMARCGDECL void*data) { list_t* chain = get_chain(ht, key); pair_t* pair; Need a new implementation pair_t updatePair; updatePair.first = key; pair = (pair t*) TMLIST FIND (chain, &updatePair); pair->second = data; ``` #### Manual declaration of persistent components ``` void TMhashtable update(TMARCGDECL void*data) { list_t* chain = get_chain(ht, key); pair_t* pair; Need a new implementation pair_t updatePair; updatePair.first pair = (pair t*) pair->second = Third party code air); be inconsistent ``` Manual declaration of persistent components ``` void TMhashtable update(TMARCGDECL void*data) { list_t* chain = get_chain(ht, key) pair t* pair; Need a new implementation updatePair.first pair = (pair t*) TMLIST FIND Third party code Prohibited can be inconsistent ``` Burden on the programmers # **OUR APPROACH: ThyNVM** # Goal: Software transparent consistency in persistent memory systems Key Idea: Periodically checkpoint state; recover to previous checkpt on crash # **ThyNVM: Summary** # A new hardware-based checkpointing mechanism - Checkpoints at multiple granularities to reduce both checkpointing latency and metadata overhead - Overlaps checkpointing and execution to reduce checkpointing latency - Adapts to DRAM and NVM characteristics Performs within 4.9% of an *idealized DRAM* with zero cost consistency # 2. OVERLAPPING CHECKPOINTING AND EXECUTION time ## More About ThyNVM Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu, <u>"ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems"</u> Proceedings of the <u>48th International Symposium on</u> <u>Microarchitecture</u> (**MICRO**), Waikiki, Hawaii, USA, December 2015. [<u>Slides (pptx) (pdf)</u>] [<u>Lightning Session Slides (pptx) (pdf)</u>] [<u>Poster</u> (pptx) (pdf)] Source Code # ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems Jinglei Ren\*† Jishen Zhao<sup>‡</sup> Samira Khan<sup>†</sup>′ Jongmoo Choi<sup>+</sup>† Yongwei Wu\* Onur Mutlu<sup>†</sup> †Carnegie Mellon University \*Tsinghua University \*University of California, Santa Cruz 'University of Virginia +Dankook University Another Key Challenge in Persistent Memory # Programming Ease to Exploit Persistence # Tools/Libraries to Help Programmers Himanshu Chauhan, Irina Calciu, Vijay Chidambaram, Eric Schkufza, Onur Mutlu, and Pratap Subrahmanyam, "NVMove: Helping Programmers Move to Byte-Based Persistence" Proceedings of the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW), Savannah, GA, USA, November 2016. [Slides (pptx) (pdf)] #### **NVMOVE: Helping Programmers Move to Byte-Based Persistence** | Himanshu Chauhan * | Irina Calciu | Vijay Chidambaram | |---------------------------------------|-----------------------------|-------------------------------| | UT Austin | VMware Research Group | UT Austin | | Eric Schkufza<br>VMware Research Grou | Onur Mutlu<br>ip ETH Zürich | Pratap Subrahmanyam<br>VMware | # Consistency Support for Persistent Memory Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu, "Loose-Ordering Consistency for Persistent Memory" Proceedings of the <u>32nd IEEE International Conference on Computer</u> <u>Design</u> (ICCD), Seoul, South Korea, October 2014. [Slides (pptx) (pdf)] [Erratum] ### Loose-Ordering Consistency for Persistent Memory Youyou Lu <sup>†</sup>, Jiwu Shu <sup>† §</sup>, Long Sun <sup>†</sup> and Onur Mutlu <sup>‡</sup> Department of Computer Science and Technology, Tsinghua University, Beijing, China <sup>§</sup>State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China <sup>‡</sup>Computer Architecture Laboratory, Carnegie Mellon University, Pittsburgh, PA, USA luyy09@mails.tsinghua.edu.cn, shujw@tsinghua.edu.cn, sun-l12@mails.tsinghua.edu.cn, onur@cmu.edu ### Another Key Challenge in Persistent Memory # Security and Data Privacy Issues # Security and Privacy Issues of NVM ■ Endurance problems → Wearout attacks ■ Hybrid memories → Performance attacks ■ Data not erased after power-off → Privacy breaches #### Conclusion ### The Future of Emerging Technologies is Bright - Regardless of challenges - in underlying technology and overlying problems/requirements #### Can enable: - Orders of magnitude improvements - New applications and computing systems Yet, we have to - Think across the stack - Design enabling systems ## If In Doubt, Refer to Flash Memory - A very "doubtful" emerging technology - for at least two decades Proceedings of the IEEE, Sept. 2017 # Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu ABSTRACT | NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and KEYWORDS | Data storage systems; error recovery; fault tolerance; flash memory; reliability; solid-state drives # Many Research & Design Opportunities - Enabling completely persistent memory - Computation in/using NVM based memories - Hybrid memory systems - Security and privacy issues in persistent memory - Reliability and endurance related problems - Virtual memory systems for NVM → virtual block interface # Computer Architecture Lecture 17a: Emerging Memory Technologies II Prof. Onur Mutlu ETH Zürich Fall 2021 25 November 2021