## P&S Ramulator # Designing and Evaluating Memory Systems and Modern Software Workloads with Ramulator Hasan Hassan Prof. Onur Mutlu ETH Zürich Fall 2021 14 October 2021 ### P&S Ramulator: Content You will learn in detail how modern memory systems operate You will design new DRAM and memory controller mechanisms for improving overall system performance, energy consumption, and reliability You will simulate and understand the memory system behavior of modern workloads such as machine learning, graph analytics, genome analysis ## P&S Ramulator: Key Takeaways - This P&S is aimed at improving your - Knowledge in Computer Architecture and Memory Systems - Technical skills in simulating memory systems - Critical thinking and analysis - Interaction with a nice group of researchers - Familiarity with key research directions - Technical presentation of your project Learn how state-of-the-art memory controllers operate, design new DRAM and memory controller mechanisms, and evaluate your mechanisms using simulation ## Prerequisites of the Course Digital Design and Computer Architecture (or equivalent course) A good knowledge in C/C++ programming language Interest in making things efficient and solving problems Interest in understanding software development and hardware design, and their interaction ## Course Info: Who Are We? (I) #### Onur Mutlu - Full Professor @ ETH Zurich ITET (INFK), since September 2015 - Strecker Professor @ Carnegie Mellon University ECE/CS, 2009-2016, 2016-... - PhD from UT-Austin, worked at Google, VMware, Microsoft Research, Intel, AMD - https://people.inf.ethz.ch/omutlu/ - omutlu@gmail.com (Best way to reach me) - https://people.inf.ethz.ch/omutlu/projects.htm ### Research and Teaching in: - Computer architecture, computer systems, hardware security, bioinformatics - Memory and storage systems - Hardware security, safety, predictability - Fault tolerance - Hardware/software cooperation - Architectures for bioinformatics, health, medicine - **..** ## Course Info: Who Are We? (II) - Lead Supervisor: - Hasan Hassan - Supervisors: - Geraldo de Oliveira - Lois Orosa - Giray Yaglikci - Haocong Luo - Get to know us and our research - https://safari.ethz.ch/safari-group/ ## Onur Mutlu's SAFARI Research Group ### Computer architecture, HW/SW, systems, bioinformatics, security, memory https://safari.ethz.ch/safari-newsletter-april-2020/ Think BIG, Aim HIGH! ### Current Research Focus Areas ### Research Focus: Computer architecture, HW/SW, bioinformatics - Memory and storage (DRAM, flash, emerging), interconnects - Heterogeneous & parallel systems, GPUs, systems for data analytics - System/architecture interaction, new execution models, new interfaces - Energy efficiency, fault tolerance, hardware security, performance - Genome sequence analysis & assembly algorithms and architectures - Biologically inspired systems & system design for bio/medicine ### Course Info: How About You? - Let us know your background, interests - Why did you join this P&S? - Please submit HW0 ## Course Requirements and Expectations - Attendance required for all meetings - Study the learning materials - Each student will carry out a hands-on project - Build, implement, code, and design with close engagement from the supervisors - Participation - Ask questions, contribute thoughts/ideas - Read relevant papers We will help in all projects! If your work is really good, you may get it published! ### Course Website https://safari.ethz.ch/projects\_and\_seminars/fall2021/doku.php?id=ramulator Useful information about the course Check your email frequently for announcements ## Meeting 1 ### **Learning materials:** - An old version of Ramulator: <a href="https://github.com/CMU-SAFARI/ramulator">https://github.com/CMU-SAFARI/ramulator</a> - Original Ramulator paper: <a href="https://people.inf.ethz.ch/omutlu/pub/ramulator\_dram\_simulator-ieee-cal15.pdf">https://people.inf.ethz.ch/omutlu/pub/ramulator\_dram\_simulator-ieee-cal15.pdf</a> - An example study of modern workloads and DRAM architectures using Ramulator: <a href="https://people.inf.ethz.ch/omutlu/pub/Workload-DRAM-Interaction-Analysis sigmetrics19">https://people.inf.ethz.ch/omutlu/pub/Workload-DRAM-Interaction-Analysis sigmetrics19</a> pomacs19.pdf - An example recent study of a new DRAM architecture using Ramulator: <a href="https://people.inf.ethz.ch/omutlu/pub/CLR-DRAM\_capacity-latency-reconfigurable-DRAM\_isca20.pdf">https://people.inf.ethz.ch/omutlu/pub/CLR-DRAM\_capacity-latency-reconfigurable-DRAM\_isca20.pdf</a> - An example recent study of a new virtual memory system architecture using Ramulator: <a href="https://people.inf.ethz.ch/omutlu/pub/VBI-virtual-block-interface\_isca20.pdf">https://people.inf.ethz.ch/omutlu/pub/VBI-virtual-block-interface\_isca20.pdf</a> - Three examples of new ideas enabled by Ramulator based evaluation: - https://people.inf.ethz.ch/omutlu/pub/rowclone\_micro13.pdf - https://people.inf.ethz.ch/omutlu/pub/salp-dram\_isca12.pdf - https://people.inf.ethz.ch/omutlu/pub/raidr-dram-refresh\_isca12.pdf ## Meeting 2 (TBD) - We will announce the projects and will give you some description about them - We will give you a chance to select a project - Then, we will have 1-1 meetings to match your interests, skills, and background with a suitable project - It is important that you study the learning materials before our next meeting! ## Next Meetings - Individual meetings with your mentor/s - Tutorials and short talks - Simulating memory systems with Ramulator - Recent research works - Presentation of your work # An Introduction to Simulating Memory Systems with Ramulator ### Motivation - DRAM and Memory Controller landscape is changing - Many new and upcoming standards - Many new controller designs - A fast and easy-to-extend simulator is very much needed | Segment | DRAM Standards & Architectures | |-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Commodity | DDR3 (2007) [14]; DDR4 (2012) [18] | | Low-Power | LPDDR3 (2012) [17]; LPDDR4 (2014) [20] | | Graphics | GDDR5 (2009) [15] | | Performance | eDRAM [28], [32]; RLDRAM3 (2011) [29] | | 3D-Stacked | WIO (2011) [16]; WIO2 (2014) [21]; MCDRAM (2015) [13];<br>HBM (2013) [19]; HMC1.0 (2013) [10]; HMC1.1 (2014) [11] | | Academic | SBA/SSA (2010) [38]; Staged Reads (2012) [8]; RAIDR (2012) [27]; SALP (2012) [24]; TL-DRAM (2013) [26]; RowClone (2013) [37]; Half-DRAM (2014) [39]; Row-Buffer Decoupling (2014) [33]; SARP (2014) [6]; AL-DRAM (2015) [25] | Table 1. Landscape of DRAM-based memory ### Ramulator: A Fast and Extensible DRAM Simulator - Provides out-of-the box support for many DRAM standards: - DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, HMC, and academic proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP) - Models timing of non-volatile memories (PCM, STT-MRAM) - Supports multiple scheduling and row buffer management policies - Modular and extensible to different standards - Can be paired with other simulators, e.g., gem5 and DRAMPower - Written in C++11 - ~2.5X faster than fastest open-source simulator | Simulator<br>(clang -O3) | Cycles (10 <sup>6</sup> ) | | Runtime (sec.) | | Reg/sec (10 <sup>3</sup> ) | | Memory | |--------------------------|---------------------------|--------|----------------|--------|----------------------------|--------|---------| | | Random | Stream | Random | Stream | Random | Stream | (MB) | | Ramulator | 652 | 411 | 752 | 249 | 133 | 402 | 2.1 | | DRAMSim2 | 645 | 413 | 2,030 | 876 | 49 | 114 | 1.2 | | USIMM | 661 | 409 | 1,880 | 750 | 53 | 133 | 4.5 | | DrSim | 647 | 406 | 18,109 | 12,984 | 6 | 8 | 1.6 | | NVMain | 666 | 413 | 6,881 | 5,023 | 15 | 20 | 4,230.0 | ## Case Study: Comparison of DRAM Standards | Standard | Rate<br>(MT/s) | Timing<br>(CL-RCD-RP) | Data-Bus<br>(Width×Chan.) | Rank-per-Chan | BW<br>(GB/s) | |-------------------|----------------|-----------------------|---------------------------|---------------|--------------| | DDR3 | 1,600 | 11-11-11 | 64-bit × 1 | 1 | 11.9 | | DDR4 | 2,400 | 16-16-16 | $64$ -bit $\times 1$ | 1 | 17.9 | | SALP <sup>†</sup> | 1,600 | 11-11-11 | $64$ -bit $\times 1$ | 1 | 11.9 | | LPDDR3 | 1,600 | 12-15-15 | $64$ -bit $\times 1$ | 1 | 11.9 | | LPDDR4 | 2,400 | 22-22-22 | $32$ -bit $\times 2^*$ | 1 | 17.9 | | GDDR5 [12] | 6,000 | 18-18-18 | $64$ -bit $\times 1$ | 1 | 44.7 | | HBM | 1,000 | 7-7-7 | 128-bit × 8* | 1 | 119.2 | | WIO | 266 | 7-7-7 | $128$ -bit $\times 4^*$ | 1 | 15.9 | | WIO2 | 1,066 | 9-10-10 | 128-bit × 8* | 1 | 127.2 | Across 22 workloads, simple CPU model Figure 2. Performance comparison of DRAM standards ## Another Example Study with Ramulator Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu, "Demystifying Workload—DRAM Interactions: An Experimental Study" Proceedings of the <u>ACM International Conference on Measurement and Modeling</u> of <u>Computer Systems</u> (**SIGMETRICS**), Phoenix, AZ, USA, June 2019. [Preliminary arXiv Version] [Abstract] [Slides (pptx) (pdf)] ## Demystifying Complex Workload-DRAM Interactions: An Experimental Study SAUGATA GHOSE, Carnegie Mellon University, USA TIANSHI LI, Carnegie Mellon University, USA NASTARAN HAJINAZAR, Simon Fraser University, Canada & ETH Zürich, Switzerland DAMLA SENOL CALI, Carnegie Mellon University, USA ONUR MUTLU, ETH Zürich, Switzerland & Carnegie Mellon University, USA ## Simulator Architecture ## Design Objective: Extensibility Treats extensibility as a first-class citizen Observation: DRAM can be abstracted as a hierarchy of state machines - Provides a standard-agnostic state machine - Paired with any standard at compile time ## High-Level Design: Hierarchy of State Machines The 'DRAM' class: a template for building a hierarchy of state machines (i.e., nodes) ``` DRAM<DDR3> Instance -level = DDR3::Level::Channel // DRAM.h // DDR3.h/cpp -index = 0 template <typename T> class DDR3 { enum class Level { class DRAM { DRAM<T>* parent: Channel, Rank, Bank, DRAM<DDR3> DRAM<DDR3> DRAM<DDR3> - Rank - Rank - Rank vector<DRAM<T>*> Row, Column, MAX children; }; T::Level level: int index; DRAM<DDR3> DRAM<DDR3> more code... more code... - Bank - Bank ``` ### DRAM Node States - status: may change when the node receives one of the DDR3 commands - next: a lookup table specifying the earliest time the node can receive each command (for honoring DDR3 timing parameters). ``` // DRAM.h template <typename T> ✓ Currently named 'next' class DRAM { // states (queried/updated by functions below) T::Status status; long horizon[T::Command::MAX]; map<int, T::Status> leaf_status; // for bank only // functions (recursively traverses down tree) T::Command decode(T::Command cmd, int addr[]); bool check(T::Command cmd, int addr[], long now); void update(T::Command cmd, int addr[], long now); // DDR3.h/cpp class DDR3 { enum class Status {Open, Closed, ..., MAX}; enum class Command {ACT, PRE, RD, WR, ..., MAX}; }; ``` ### **DRAM** Functions The memory controller relies on three recursive functions to serve a memory request: - decode(): returns the prerequisite command (e.g., ACT for a closed bank) - check(): returns whether or not the DRAM is ready to accept a given command (i.e., timing violation check) - update(): updates the node state based on the issued command ``` // DRAM.h template <typename T> class DRAM { // states (queried/updated by functions below) T::Status status; long horizon[T::Command::MAX]; map<int, T::Status> leaf_status; // for bank only // functions (recursively traverses down tree) T::Command decode(T::Command cmd, int addr[]); bool check(T::Command cmd, int addr[], long now); void update(T::Command cmd, int addr[], long now); 12 }; // DDR3.h/cpp class DDR3 { enum class Status {Open, Closed, ..., MAX}; enum class Command {ACT, PRE, RD, WR, ..., MAX}; ``` **}**; Currently named 'next' ## decode(): Determining the *Prerequisite* 6 11 14 ``` // DRAM.h // DDR3.h/cpp template <typename T> class DDR3 { class DRAM { // declare 2D lookup-table of lambdas T::Command decode(T::Command cmd, int addr[]) { function < Command(DRAM < DDR3 > *) > if (prereg[level][cmd]) { prereq[Level::MAX][Command::MAX]; // consult lookup-table to decode command // populate an entry in the table T::Command p = prereq[level][cmd](this); prereq[Level::Rank][Command::REF] = if (p != T::Command::MAX) [] (DRAM<DDR3>* node) -> Command { return p; // decoded successfully for (auto bank : node->children) if (bank->status == Status::Open) 10 if (children.size() == 0) // lowest-level return Command::PREA: return cmd; // decoded successfully 11 // use addr[] to identify target child... return Command::REF; // invoke decode() at the target child... 13 }; 14 populate other entries... 15 }; 16 ``` ## check(): Satisfying the DRAM Timing ``` // Check 265 template <typename T> 266 bool DRAM<T>::check(typename T::Command cmd, const int* addr, long clk) 267 268 if (next[int(cmd)] != -1 && clk < next[int(cmd)])</pre> 269 return false; // stop recursion: the check failed at this level 270 271 272 int child id = addr[int(level)+1]; if (child id < 0 || level == spec->scope[int(cmd)] || !children.size()) 273 return true; // stop recursion: the check passed at all levels 274 275 276 // recursively check my child return children[child id]->check(cmd, addr, clk); 277 278 ``` Verifies whether next[cmd] <= now for every node affected by cmd ## update(): Transitioning ``` // Update 324 template <typename T> void DRAM<T>::update(typename T::Command cmd, const int* addr, long clk) 327 Defined in DDR3.cpp cur clk = clk; update state(cmd, addr); update timing(cmd, addr, clk); 331 // Update (Timing) // Update (State) template <typename T> template <typename T> void DRAM<T>::update timing(typename T::Command cmd, const int* addr, long clk) 352 void DRAM<T>::update state(typename T::Command cmd, const int* addr) // I am not a target node: I am merely one of its siblings 354 if (id != addr[int(level)]) { int child\id = addr[int(level)+1]; for (auto& t : timing[int(cmd)]) if (lambda[int(cmd)]) Defined in DDR3.cpp if (!t.sibling) lambda[int(cmd)](this, child id); // update this level continue; // not an applicable timing parameter 358 341 assert (t.dist == 1); if (level == spec->scope[int(cmd)] || !children.size()) 342 return; // stop recursion: updated all levels long future = clk + t.val; 344 next[int(t.cmd)] = max(next[int(t.cmd)], future); // update future // recursively update my child 345 children[child id]->update state(cmd, addr); return; // stop recursion: only target nodes should be recursed 347 ``` ## Ramulator Paper and Source Code - Yoongu Kim, Weikun Yang, and Onur Mutlu, "Ramulator: A Fast and Extensible DRAM Simulator" IEEE Computer Architecture Letters (CAL), March 2015. [Source Code] - Source code is released under the liberal MIT License - https://github.com/CMU-SAFARI/ramulator - https://github.com/CMU-SAFARI/ramulator-pim - ZSim+Ramulator: a framework for design space exploration of general-purpose Processing-in-Memory (PIM) architectures ### Conclusion - Provides out-of-the box support for many DRAM standards: - DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, HMC, and academic proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP) - Models timing of non-volatile memories (PCM, STT-MRAM) - Supports multiple scheduling and row buffer management policies - Modular and extensible to different standards - Can be paired with other simulators, e.g., gem5 and DRAMPower - Written in C++11 - ~2.5X faster than fastest open-source simulator | Simulator<br>(clang -O3) | Cycles (10 <sup>6</sup> ) | | Runtime (sec.) | | Reg/sec (10 <sup>3</sup> ) | | Memory | |--------------------------|---------------------------|--------|----------------|--------|----------------------------|--------|---------| | | Random | Stream | Random | Stream | Random | Stream | (MB) | | Ramulator | 652 | 411 | 752 | 249 | 133 | 402 | 2.1 | | DRAMSim2 | 645 | 413 | 2,030 | 876 | 49 | 114 | 1.2 | | USIMM | 661 | 409 | 1,880 | 750 | 53 | 133 | 4.5 | | DrSim | 647 | 406 | 18,109 | 12,984 | 6 | 8 | 1.6 | | NVMain | 666 | 413 | 6,881 | 5,023 | 15 | 20 | 4,230.0 | ## P&S Ramulator # Designing and Evaluating Memory Systems and Modern Software Workloads with Ramulator Hasan Hassan Prof. Onur Mutlu ETH Zürich Fall 2021 14 October 2021