# Digital Design & Computer Arch. Lecture 1: Introduction and Basics Prof. Onur Mutlu ETH Zürich Spring 2022 24 February 2022 ### Brief Self Introduction #### Onur Mutlu - Full Professor @ ETH Zurich ITET (INFK), since Sept 2015 - □ Strecker Professor @ Carnegie Mellon University ECE (CS), 2009-2016, 2016-... - Started the Comp Arch Research Group @ Microsoft Research, 2006-2009 - Worked @ Google, VMware, Microsoft Research, Intel, AMD - PhD in Computer Engineering from University of Texas at Austin in 2006 - BS in Computer Engineering & Psychology from University of Michigan in 2000 - https://people.inf.ethz.ch/omutlu/ omutlu@gmail.com #### Research and Teaching in: - Computer architecture, systems, hardware security, bioinformatics - Memory and storage systems - Robust & dependable hardware systems: security, safety, predictability, reliability - Hardware/software cooperation - New computing paradigms; architectures with emerging technologies/devices - Architectures for bioinformatics, genomics, health, medicine, AI/ML **...** ### Current Research Mission #### Computer architecture, HW/SW, systems, bioinformatics, security # **Build fundamentally better architectures** # Four Key Current Directions Fundamentally Secure/Reliable/Safe Architectures - Fundamentally Energy-Efficient Architectures - Memory-centric (Data-centric) Architectures Fundamentally Low-Latency and Predictable Architectures Architectures for AI/ML, Genomics, Medicine, Health, ... # Fundamentally Better Architectures # **Data-centric** **Data-driven** **Data-aware** # Onur Mutlu's SAFARI Research Group Computer architecture, HW/SW, systems, bioinformatics, security, memory https://safari.ethz.ch/safari-newsletter-april-2020/ Think BIG, Aim HIGH! SAFARI https://safari.ethz.ch # SAFARI Newsletter January 2021 Edition https://safari.ethz.ch/safari-newsletter-january-2021/ January 2021 Newsletter Think Big, Aim High, and Have a Wonderful 2021! Dear SAFARI friends, ### SAFARI Newsletter December 2021 Edition https://safari.ethz.ch/safari-newsletter-december-2021/ Think Big, Aim High View in your browser December 2021 ## SAFARI PhD and Post-Doc Alumni #### https://safari.ethz.ch/safari-alumni/ - Minesh Patel (ETH Zurich), MICRO 2020 and DSN 2020 Best Paper Awards; ISCA Hall of Fame 2021 - Damla Senol Cali (Bionano Genomics), SRC TECHCON 2019 Best Student Presentation Award - Nastaran Hajinazar (ETH Zurich) - Gagandeep Singh (ETH Zurich), FPL 2020 Best Paper Award Finalist - Amirali Boroumand (Stanford Univ → Google), SRC TECHCON 2018 Best Student Presentation Award - Jeremie Kim (ETH Zurich), EDAA Outstanding Dissertation Award 2020; IEEE Micro Top Picks 2019; ISCA/MICRO HoF 2021 - Nandita Vijaykumar (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021 - Kevin Hsieh (Microsoft Research, Senior Researcher) - Justin Meza (Facebook), HiPEAC 2015 Best Student Presentation Award; ICCD 2012 Best Paper Award - Mohammed Alser (ETH Zurich), IEEE Turkey Best PhD Thesis Award 2018 - Yixin Luo (Google), HPCA 2015 Best Paper Session - Kevin Chang (Facebook), SRC TECHCON 2016 Best Student Presentation Award - Rachata Ausavarungnirun (KMUNTB, Assistant Professor), NOCS 2015 and NOCS 2012 Best Paper Award Finalist - Gennady Pekhimenko (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021; ASPLOS 2015 SRC Winner - Vivek Seshadri (Microsoft Research) - Donghyuk Lee (NVIDIA Research, Senior Researcher), HPCA Hall of Fame 2018 - Yoongu Kim (Software Robotics → Google), TCAD'19 Top Pick Award; IEEE Micro Top Picks'10; HPCA'10 Best Paper Session - Lavanya Subramanian (Intel Labs → Facebook) - Samira Khan (Univ. of Virginia, Assistant Professor), HPCA 2014 Best Paper Session - Saugata Ghose (Univ. of Illinois, Assistant Professor), DFRWS-EU 2017 Best Paper Award - Jawad Haj-Yahya (Huawei Research Zurich, Principal Researcher) # Our Major Courses & Lectures ### First Computer Architecture & Digital Design Course - Digital Design and Computer Architecture - Spring 2021 Livestream Edition: <a href="https://www.youtube.com/watch?v=LbC0EZY8yw4&list=PL5Q2soXY2Zi\_uej3aY39YB5pfW4SJ7LIN">https://www.youtube.com/watch?v=LbC0EZY8yw4&list=PL5Q2soXY2Zi\_uej3aY39YB5pfW4SJ7LIN</a> ### Advanced Computer Architecture Course - Computer Architecture - Fall 2021 Livestream Edition: <a href="https://www.youtube.com/watch?v=4yfkM\_5EFgo&list=PL5Q2">https://www.youtube.com/watch?v=4yfkM\_5EFgo&list=PL5Q2</a> <a href="mailto:soXY2Zi-Mnk1PxjEIG32HAGILkTOF">soXY2Zi-Mnk1PxjEIG32HAGILkTOF</a> #### Seminar in Computer Architecture https://www.youtube.com/watch?v=4TcP297mdsI&list=PL5Q 2soXY2Zi\_7UBNmC9B8Yr5JSwTG9yH4 # DDCA (Spring 2021) - https://safari.ethz.ch/digitaltechnik/ spring2021/doku.php?id=schedule - https://www.youtube.com/watch?v =LbC0EZY8yw4&list=PL5Q2soXY2Zi uej3aY39YB5pfW4SJ7LIN - Bachelor's course - 2<sup>nd</sup> semester at ETH Zurich - Rigorous introduction into "How Computers Work" - Digital Design/Logic - Computer Architecture - 10 FPGA Lab Assignments Search Q Recent Changes Media Manager Sitemap schedule Trace: • schedule ome Announcements #### laterials - Lectures/Schedule Lecture Buzzwords - Readings - Optional HWs - Labs - Extra Assignments - ExamsTechnical Docs #### Resources - Computer Architecture (CMU) - SS15: Lecture Videos Computer Architecture (CMU) - SS15: Course Website Digitaltechnik SS18: Lecture - Videos Digitaltechnik SS18: Course - Website Digitaltechnik SS19: Lecture - Videos Digitaltechnik SS19: Course - Website Digitaltechnik SS20: Lecture - Videos Digitaltechnik SS20: Course - Website - Moodle Recorded Lecture Playlist #### **Spring 2021 Lectures/Schedule** | Week | Date | Livestream | Lecture | Readings | Lab | HW | |------|---------------|---------------|-----------------------------------------------------|------------------------------------|-----|----| | W1 | 25.02<br>Thu. | You Tube Live | L1: Introduction and Basics | Required<br>Suggested<br>Mentioned | | | | | 26.02<br>Fri. | You Tube Live | L2a: Tradeoffs, Metrics, Mindset | Required | | | | | | | L2b: Mysteries in Computer Architecture (PDF) (PPT) | Required<br>Mentioned | | | | W2 | 04.03<br>Thu. | You Tube Live | L3a: Mysteries in Computer Architecture II | Required<br>Suggested | | | # Comp Arch (Fall 2021) https://safari.ethz.ch/architecture/fall20 21/doku.php?id=schedule #### Youtube Livestream: https://www.youtube.com/watch?v=4yfk M 5EFgo&list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF #### Master's level course - Taken by Bachelor's/Masters/PhD students - Cutting-edge research topics + fundamentals in Computer Architecture - 5 Simulator-based Lab Assignments - Potential research exploration - Many research readings ent Changes Media Manager Sitema schedule Trace: · readings · start · schedule #### Home Announcements #### Materials - Lectures/Schedule - Lecture Buzzwords - Readings - HWsLabs - Exams - Related Courses Tutorials #### Pecaurees - Computer Architecture FS20: - Computer Architecture FS20: Lecture Videos - Digitaltechnik SS21: Course - Digitaltechnik SS21: Lecture Videos - Moodle - WHOTCRP - Verilog Practice Website (HDLBits) #### Lecture Video Playlist on YouTube Recorded Lecture Playlist #### Fall 2021 Lectures & Schedule | Week | Date | Livestream | Lecture | Readings | Lab | HW | |------|---------------|---------------|---------------------------------------------------------------|------------------------|--------------|-------------| | W1 | 30.09<br>Thu. | You Live | L1: Introduction and Basics | Required<br>Mentioned | Lab 1<br>Out | HW 0<br>Out | | | 01.10<br>Fri. | You Tube Live | L2: Trends, Tradeoffs and Design Fundamentals ((PDF) ((PPT)) | Required<br>Mentioned | | | | W2 | 07.10<br>Thu. | You Tube Live | L3a: Memory Systems: Challenges and Opportunities | Described<br>Suggested | | HW 1<br>Out | | | | | L3b: Course Info & Logistics (PDF) (PPT) | | | | | | | | L3c: Memory Performance Attacks | Described<br>Suggested | | | | | 08.10<br>Fri. | You Tube Live | L4a: Memory Performance Attacks | Described<br>Suggested | Lab 2<br>Out | | | | | | L4b: Data Retention and Memory Refresh (PDF) (PPT) | Described<br>Suggested | | | | | | | L4c: RowHammer | Described<br>Suggested | | | # Seminar in Comp Arch (Fall 2021) https://safari.ethz.ch/architecture\_semin ar/fall2021/doku.php?id=schedule #### Youtube Livestream: - https://www.youtube.com/watch?v=4TcP 297mdsI&list=PL5Q2soXY2Zi 7UBNmC9B 8Yr5JSwTG9yH4 - Critical analysis course - Taken by Bachelor's/Masters/PhD students - Cutting-edge research topics + fundamentals in Computer Architecture - 20+ research papers, presentations, analyses # Hands-On Project Courses https://safari.ethz.ch/projects\_and\_seminars/doku.php Search start Recent Changes Media Manager Sitemap Trace: · start #### Home #### **Projects** - SoftMC - Ramulator - Accelerating Genomics - Mobile Genomics - Processing-in-Memory - Heterogeneous Systems - SSD Simulator #### **SAFARI Projects & Seminars Courses (Spring 2021)** Welcome to the wiki for Project and Seminar courses SAFARI offers. #### Courses we offer: - Understanding and Improving Modern DRAM Performance, Reliability, and Security with Hands-On **Experiments** - Designing and Evaluating Memory Systems and Modern Software Workloads with Ramulator - Accelerating Genome Analysis with FPGAs, GPUs, and New Execution Paradigms - Genome Sequencing on Mobile Devices - Exploring the Processing-in-Memory Paradigm for Future Computing Systems - Hands-on Acceleration on Heterogeneous Computing Systems - Understanding and Designing Modern NAND Flash-Based Solid-State Drives (SSDs) by Building a **Practical SSD Simulator** # PIM Course (Fall 2021) #### Fall 2021 Edition: https://safari.ethz.ch/projects and semi nars/fall2021/doku.php?id=processing in memory #### Youtube Livestream: https://www.youtube.com/watch?v=9e4 Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX #### Project course - Taken by Bachelor's/Master's students - Processing-in-Memory lectures - Hands-on research exploration - Many research readings Lecture Video Playlist on YouTube Lecture Playlist #### Fall 2021 Meetings/Schedule | Week | Date | Livestream | Meeting | Learning Materials | Assignments | |------|---------------|---------------|------------------------------------------------------------------|------------------------------------------------|-------------| | W1 | 05.10<br>Tue. | You Tube Live | M1: P&S PIM Course Presentation (PDF) (PPT) | Required Materials<br>Recommended<br>Materials | HW 0 Out | | W2 | 12.10<br>Tue. | You Tube Live | M2: Real-World PIM Architectures (PDF) (PDF) | | | | W3 | 19.10<br>Tue. | You Tube Live | M3: Real-World PIM Architectures II (PDF) (PDF) | | | | W4 | 26.10<br>Tue. | You Live | M4: Real-World PIM Architectures III (PDF) (PDF) | | | | W5 | 02.11<br>Tue. | YouTube Live | M5: Real-World PIM Architectures IV (PDF) (PDF) | | | | W6 | 09.11<br>Tue. | You Tube Live | M6: End-to-End Framework for Processing-using-Memory (PDF) (PPT) | | | | W7 | 16.11<br>Tue. | You Tube Live | M7: How to Evaluate Data Movement Bottlenecks (PDF) (PPT) | | | | W8 | 23.11<br>Tue. | You Tube Live | M8: Programming PIM Architectures (PDF) (PDF) | | | | W9 | 30.11<br>Tue. | You Tube Live | M9: Benchmarking and Workload Suitability on PIM (PDF) (PPT) | | | | W10 | 07.12<br>Tue. | You Tube Live | M10: Bit-Serial SIMD Processing using DRAM | | | (PDF) (PPT) # Genomics (Fall 2021) #### Fall 2021 Edition: https://safari.ethz.ch/projects and semi nars/fall2021/doku.php?id=bioinformatic s #### Youtube Livestream: https://www.youtube.com/watch?v=Mno gTeMjY8k&list=PL5Q2soXY2Zi8sngH-TrNZnDhDkPq55J9J #### Project course - Taken by Bachelor's/Master's students - Genomics lectures - Hands-on research exploration - Many research readings #### Fall 2021 Meetings/Schedule | Week | Date | Livestream | Meeting | Learning<br>Materials | Assignments | |------|---------------|---------------|---------------------------------------------------------------------------------------------------|------------------------------------------------|-------------| | W1 | 5.10<br>Tue. | You Tube Live | M1: P&S Accelerating Genomics Course Introduction & Project Proposals (PDF) Im(PPT) You The Video | Required Materials<br>Recommended<br>Materials | | | W2 | 20.10<br>Wed. | You Tube Live | M2: Introduction to Sequencing (PDF) (PDF) | | | | W3 | 27.10<br>Wed. | You Tube Live | M3: Read Mapping (PDF) (PDF) | | | | W4 | 3.11<br>Wed. | YouTube Live | M4: GateKeeper (PDF) IM (PPT) | | | | W5 | 10.11<br>Wed. | You Tube Live | M5: MAGNET & Shouji (PDF) (PDF) | | | | W6 | 17.11<br>Wed. | | M6.1: SneakySnake (PDF) (PPT) Video | | | | | | | M6.2: GRIM-Filter (PDF) (PDF) (PPT) You ( Video | | | | W7 | 24.11<br>Wed. | | M7: GenASM (PDF) (PPT) You (PDF) Video | | | | W8 | 01.12<br>Wed. | You Tube Live | M8: Genome Assembly | | | | W9 | 13.12<br>Mon. | You Tube Live | M9: GRIM-Filter (PDF) (PPT) | | | | W10 | 15.12<br>Wed. | You Tube Live | M10: Genomic Data Sharing Under Differential Privacy (PDF) (PDF) | | | # Hetero. Systems (Fall'21) #### Fall 2021 Edition: https://safari.ethz.ch/projects and semi nars/fall2021/doku.php?id=heterogeneou s systems #### Youtube Livestream: https://www.youtube.com/watch?v=QY bjwzsfMM&list=PL5Q2soXY2Zi OwkTgEy A6tk3UsoPBH737 #### Project course - Taken by Bachelor's/Master's students - GPU and Parallelism lectures - Hands-on research exploration - Many research readings #### Fall 2021 Meetings/Schedule | Week | Date | Livestream | Meeting | Learning Materials | Assignments | |------|---------------|---------------|--------------------------------------------------------|------------------------------------------------|-------------| | W1 | 07.10<br>Thu. | You Tube Live | M1: P&S Course Presentation (PDF) (PPT) | Required Materials<br>Recommended<br>Materials | HW 0 Out | | W2 | 14.10<br>Thu. | You Tube Live | M2: SIMD Processing and GPUs (PDF) (PPT) | | | | W3 | 21.10<br>Thu. | You Tube Live | M3: GPU Software Hierarchy (PDF) (PPT) | | | | W4 | 28.10<br>Thu. | You Tube Live | M4: GPU Memory Hierarchy (PDF) (PPT) | | | | W5 | 04.11<br>Thu. | You Tube Live | M5: GPU Performance Considerations (PDF) (PPT) | | | | W6 | 11.11<br>Thu. | You Tube Live | M6: Parallel Patterns: Reduction (PDF) (PPT) | | | | W7 | 18.11<br>Thu. | You Tube Live | M7: Parallel Patterns: Histogram (PDF) (PPT) | | | | W8 | 25.11<br>Thu. | You Tube Live | M8: Parallel Patterns: Convolution (PDF) (PPT) | | | | W9 | 02.12<br>Thu. | You Tube Live | M9: Parallel Patterns: Prefix Sum (Scan) (PDF) (PPT) | | | | W10 | 09.12<br>Thu. | You Tube Live | M10: Parallel Patterns: Sparse Matrices (PPT) | | | | W11 | 16.12<br>Thu. | You Tube Live | M11: Parallel Patterns: Graph<br>Search<br>(PDF) (PPT) | | | | W12 | 22.12<br>Thu. | You Tube Live | M12: Dynamic Parallelism (PDF) (PPT) | | | | W13 | 06.01<br>Thu. | You Tube Live | M13: Collaborative Computing (PDF) (PPT) | | | # SAFARI Live Seminars (I) SAFARI Live Seminars in Computer Architecture **Experimental Methodology** · We experimentally study three modern Intel processors We measure voltage and current using a Data Acquisition card (NI-DAQ) CPU Cores # SAFARI Live Seminars (II) SAFARI Live Seminar: Nastaran Hajinazar 27 Oct 2021 Posted on October 1, 2021 by ewent SAFARI Live Seminar: Damla Senol Cali 07 Nov 2021 Posted on October 18, 2021 by ewent SAFARI Live Seminar: Gennady Pekhimenko 08 Nov 2021 SAFARI Live Seminar: Fabrice Devaux, 2 Feb 2022 ted on January 15, 2022 by ewent SAFARI Live Seminar: Serghei Mangul 11 Nov 2021 SAFARI Live Seminar: Lois Orosa, 10 Feb 2022 Join us for our next SAFARI Live Seminar with Lois Orosa. Posted on January 16, 2022 by ewent SAFARI Live Seminar: Rahul Bera 20 Dec 2021 Posted on January 19, 2022 by ewent Join us for our SAFARI Live Seminar with Sean Lie, Cerebras Systems Monday, February 28 2022 at 6:00 pm Zurich time (CET) Thursday, February 10 at 5:00 pm Zurich time (CET) https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9&index=1 # Upcoming SAFARI Live Seminar (Feb 28) https://www.youtube.com/watch?v=x2-qB0J7KHw # Some Basic Principles We Follow # Principle: Teaching and Research Teaching drives Research Research drives Teaching 23 # Focus on Insight Encourage New Ideas Principle: Learning and Scholarship # Focus on learning and scholarship # Create an environment that values free & critical exploration, openness, collaboration, hard work, creativity Principle: Learning and Scholarship # The quality of your work defines your impact Principle: Good Mindset, Goals & Focus # You can make a good impact on the world Suggestion: Principle: Passion # Follow Your Passion (Do not get derailed by naysayers) Principle: Build Infrastructure # Build Infrastructure to Enable Your Passion Principle: Work Hard # Work Hard to Enable Your Passion Suggestion: Principle: Resilience & Focus # Be Resilient & Focused Make It Happen Principle: Good Mindset, Goals & Focus # You can make a good impact on the world # Research & Teaching: Some Overview Talks #### https://www.youtube.com/onurmutlulectures - Future Computing Architectures - https://www.youtube.com/watch?v=kqiZISOcGFM&list=PL5Q2soXY2Zi8D 5MGV6EnXEJHnV2YFBJI&index=1 - Enabling In-Memory Computation - https://www.youtube.com/watch?v=njX 14584Jw&list=PL5Q2soXY2Zi8D 5MGV6EnXEJHnV2YFBJl&index=16 - Accelerating Genome Analysis - https://www.youtube.com/watch?v=r7sn41lH-4A&list=PL5Q2soXY2Zi8D\_5MGV6EnXEJHnV2YFBJl&index=41 - Rethinking Memory System Design - https://www.youtube.com/watch?v=F7xZLNMIY1E&list=PL5Q2soXY2Zi8D\_5MGV6EnXEJHnV2YFBJl&index=3 - Intelligent Architectures for Intelligent Machines - https://www.youtube.com/watch?v=c6\_LgzuNdkw&list=PL5Q2soXY2Zi8D\_5MGV6EnXEJHnV2YFBJl&index=25 - The Story of RowHammer - https://www.youtube.com/watch?v=sqd7PHQQ1AI&list=PL5Q2soXY2Zi8D 5MGV6EnXEJHnV2YFBJl&index=39 ## An Interview on Research and Education - Computing Research and Education (@ ISCA 2019) - https://www.youtube.com/watch?v=8ffSEKZhmvo&list=PL5Q2 soXY2Zi\_4oP9LdL3cc8G6NIjD2Ydz - Maurice Wilkes Award Speech (10 minutes) - https://www.youtube.com/watch?v=tcQ3zZ3JpuA&list=PL5Q2 soXY2Zi8D\_5MGV6EnXEJHnV2YFBJl&index=15 # More Thoughts and Suggestions Onur Mutlu, #### "Some Reflections (on DRAM)" Award Speech for <u>ACM SIGARCH Maurice Wilkes Award</u>, at the **ISCA** Awards Ceremony, Phoenix, AZ, USA, 25 June 2019. [Slides (pptx) (pdf)] [Video of Award Acceptance Speech (Youtube; 10 minutes) (Youku; 13 minutes)] [Video of Interview after Award Acceptance (Youtube; 1 hour 6 minutes)] (Youku; 1 hour 6 minutes) [News Article on "ACM SIGARCH Maurice Wilkes Award goes to Prof. Onur Mutlu"] Onur Mutlu, #### "How to Build an Impactful Research Group" 57th Design Automation Conference Early Career Workshop (DAC), Virtual, 19 July 2020. [Slides (pptx) (pdf)] # More Thoughts and Suggestions (II) Onur Mutlu, "Computer Architecture: Why Is It So Important and Exciting Today?" Invited Lecture at *Izmir Institute of Technology (IYTE)*, Virtual, 16 October 2020. [Slides (pptx) (pdf)] [Talk Video (2 hours 12 minutes)] Onur Mutlu, "Applying to Graduate School & Doing Impactful Research" Invited Panel Talk at <u>the 3rd Undergraduate Mentoring Workshop</u>, held with <u>the 48th International Symposium on Computer Architecture</u> (**ISCA**), Virtual, 18 June 2021. [Slides (pptx) (pdf)] [Talk Video (50 minutes)] # A Talk on Impactful Growth ## An Interview on Computing Futures # "Formative Experience" # "High investment, high return" # "Recorded lectures allowed me to go over the lectures when necessary" # "YouTube allows me to watch the lectures on my TV" # "The lecturer is very responsive to questions and remarks from students" "Perhaps even better than in-person classes as questions can be asked asynchronously" # "the course was fantastic and I would do it again at any time" # Learning experience Long-term tradeoff analysis Critical thinking & decision making Concepts & Ideas Fundamentals & Cutting-edge Hands-on learning Your mindset will determine what you get out of the course # Find and choose the learning style that works best for you #### Course Components - Lectures - Readings - Labs - Homeworks - Exam - Extra Credit Assignments - In all, you have freedom to adapt to your learning style - We will talk about these more later https://safari.ethz.ch/digitaltechnik/spring2022/ # What Will We Learn in This Course? # How Computers Work (from the ground up) #### Answer Continued # And Why We Care # Why Do We Have Computers? # Why Do We Do Computing? # To Solve Problems # To Gain Insight # To Enable a Better Life & Future # How Does a Computer Solve Problems? # Orchestrating Electrons In today's dominant technologies # How Do Problems Get Solved by Electrons? ## The Transformation Hierarchy Computer Architecture (expanded view) Computer Architecture (narrow view) #### Levels of Transformation "The purpose of computing is [to gain] insight" (*Richard Hamming*) We gain and generate insight by solving problems How do we ensure problems are solved by electrons? #### Algorithm Step-by-step procedure that is guaranteed to terminate where each step is precisely stated and can be carried out by a computer - Finiteness - Definiteness - Effective computability Many algorithms for the same problem Microarchitecture An implementation of the ISA Problem Algorithm Program/Language System Software (VM, OS, MM) ISA (Architecture) Microarchitecture Logic Devices Electrons ISA (Instruction Set Architecture) Interface/contract between SW and HW. What the programmer assumes hardware will satisfy. Digital logic circuits Building blocks of micro-arch (e.g., gates) ## Computer Architecture - is the science and art of designing computing platforms (hardware, interface, system SW, and programming model) - to achieve a set of design goals - E.g., highest performance on earth on workloads X, Y, Z - E.g., longest battery life at a form factor that fits in your pocket with cost < \$\$\$ CHF</li> - E.g., best average performance across all known workloads at the best performance/cost ratio - **-** ... - □ Designing a supercomputer is different from designing a smartphone → But, many fundamental principles are similar **Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16. **Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017. ### New ML applications (vs. TPU3): - Computer vision - Natural Language Processing (NLP) - Recommender system - Reinforcement learning that plays Go 250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3 1 ExaFLOPS per board https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests - ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors, 600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs. - Two redundant chips for better safety. TESLA Tesla Dojo Chip & System ## D1 Chip 362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32 10TBps/dir. on-Chip Bandwidth 4TBps/edge. off-Chip Bandwidth **400W TDP** 645mm<sup>2</sup> 7nm Technology **50 Billion** Transistors 11+ Miles Of Wires TESLA Tesla Dojo Chip & System TESLA Tesla Dojo Chip & System The largest ML accelerator chip (2021) 850,000 cores ### **Cerebras WSE-2** 2.6 Trillion transistors 46,225 mm<sup>2</sup> ### **Largest GPU** 54.2 Billion transistors 826 mm<sup>2</sup> **NVIDIA** Ampere GA100 https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro, August 2020. #### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PIM (Processing,-bendemy) benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, which we identify as memory-bound. We evaluate the performance and scaling characteristics of PIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 460 and 25.50 DPUs provides new insights about suitability of different workloads to the PIM systems you commendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. ## Axiom To achieve the highest energy efficiency and performance: ## we must take the expanded view of computer architecture Co-design across the hierarchy: Algorithms to devices Specialize as much as possible within the design goals # What is Computer Architecture? The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals. # Why Study Computer Architecture? - Enable better systems: make computers faster, cheaper, smaller, more reliable, ... - By exploiting advances and changes in underlying technology/circuits - Enable new applications - Life-like 3D visualization 20 years ago? Virtual reality? - Self-driving cars? - Personalized genomics? Personalized medicine? - Enable better solutions to problems - Software innovation is built on trends and changes in computer architecture - > 50% performance improvement per year has enabled this innovation - Understand why computers work the way they do # Computer Architecture Today (I) - Today is a very exciting time to study computer architecture - Industry is in a large paradigm shift (to novel architectures) - many different potential system designs possible - Many difficult problems motivating and caused by the shift - Huge hunger for data and new data-intensive applications - Power/energy/thermal constraints - Complexity of design - Difficulties in technology scaling - Memory bottleneck - Reliability problems - Programmability problems - Security and privacy issues - No clear, definitive answers to these problems # Computer Architecture Today (II) - Computing landscape is very different from 10-20 years ago - Applications and technology both demand novel architectures To achieve the highest energy efficiency and performance: ## we must take the expanded view of computer architecture Co-design across the hierarchy: Algorithms to devices Specialize as much as possible within the design goals # Historical: Opportunities at the Bottom # There's Plenty of Room at the Bottom From Wikipedia, the free encyclopedia "There's Plenty of Room at the Bottom: An Invitation to Enter a New Field of Physics" was a lecture given by physicist Richard Feynman at the annual American Physical Society meeting at Caltech on December 29, 1959. [1] Feynman considered the possibility of direct manipulation of individual atoms as a more powerful form of synthetic chemistry than those used at the time. Although versions of the talk were reprinted in a few popular magazines, it went largely unnoticed and did not inspire the conceptual beginnings of the field. Beginning in the 1980s, nanotechnology advocates cited it to establish the scientific credibility of their work. # Historical: Opportunities at the Bottom (II) # There's Plenty of Room at the Bottom From Wikipedia, the free encyclopedia Feynman considered some ramifications of a general ability to manipulate matter on an atomic scale. He was particularly interested in the possibilities of denser computer circuitry, and microscopes that could see things much smaller than is possible with scanning electron microscopes. These ideas were later realized by the use of the scanning tunneling microscope, the atomic force microscope and other examples of scanning probe microscopy and storage systems such as Millipede, created by researchers at IBM. Feynman also suggested that it should be possible, in principle, to make nanoscale machines that "arrange the atoms the way we want", and do chemical synthesis by mechanical manipulation. He also presented the possibility of "swallowing the doctor", an idea that he credited in the essay to his friend and graduate student Albert Hibbs. This concept involved building a tiny, swallowable surgical robot. # Historical: Opportunities at the Top #### **REVIEW** # There's plenty of room at the Top: What will drive computer performance after Moore's law? - D Charles E. Leiserson<sup>1</sup>, D Neil C. Thompson<sup>1,2,\*</sup>, D Joel S. Emer<sup>1,3</sup>, D Bradley C. Kuszmaul<sup>1,†</sup>, Butler W. Lampson<sup>1,4</sup>, D... - + See all authors and affiliations Science 05 Jun 2020: Vol. 368, Issue 6495, eaam9744 DOI: 10.1126/science.aam9744 Much of the improvement in computer performance comes from decades of miniaturization of computer components, a trend that was foreseen by the Nobel Prize—winning physicist Richard Feynman in his 1959 address, "There's Plenty of Room at the Bottom," to the American Physical Society. In 1975, Intel founder Gordon Moore predicted the regularity of this miniaturization trend, now called Moore's law, which, until recently, doubled the number of transistors on computer chips every 2 years. Unfortunately, semiconductor miniaturization is running out of steam as a viable way to grow computer performance—there isn't much more room at the "Bottom." If growth in computing power stalls, practically all industries will face challenges to their productivity. Nevertheless, opportunities for growth in computing performance will still be available, especially at the "Top" of the computing-technology stack: software, algorithms, and hardware architecture. # Axiom, Revisited There is plenty of room both at the top and at the bottom but much more so when you communicate well between and optimize across the top and the bottom # Hence the Expanded View Computer Architecture (expanded view) # Computer Architecture # Why Is It So Exciting Today? # Many Interesting Things Are Happening Today in Computer Architecture # Many Interesting Things Are Happening Today in Computer Architecture # Performance and Energy Efficiency # Do We Want This? 97 # Or This? **SAFARI** 98 # Challenge and Opportunity for Future # High Performance, Energy Efficient, Sustainable # Many Difficult Problems: Climate # Many Difficult Problems: Intelligence # Many Difficult Problems: Intelligence ## **Forbes** Jun 17, 2020, 11:54am EDT | 20,934 views ## Deep Learning's Carbon Emissions Problem Rob Toews Contributor ① Α *I write about the big picture of artificial intelligence.* 102 # Many Difficult Problems: Congestion # Many Difficult Problems: Public Health # Many Difficult Problems: Genome Analysis # Data Movement vs. Computation Energy A memory access consumes ~100-1000X the energy of a complex addition # Intel Optane Persistent Memory (2019) - Non-volatile main memory - Based on 3D-XPoint Technology # PCM as Main Memory: Idea in 2009 Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the <u>36th International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), pages 2-13, Austin, TX, June 2009. <u>Slides</u> (pdf) ## Architecting Phase Change Memory as a Scalable DRAM Alternative Benjamin C. Lee† Engin Ipek† Onur Mutlu‡ Doug Burger† †Computer Architecture Group Microsoft Research Redmond, WA {blee, ipek, dburger}@microsoft.com ‡Computer Architecture Laboratory Carnegie Mellon University Pittsburgh, PA onur@cmu.edu ## PCM as Main Memory: Idea in 2009 Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger, "Phase Change Technology and the Future of Main Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, January/February 2010. ## PHASE-CHANGE TECHNOLOGY AND THE FUTURE OF MAIN MEMORY ## Cerebras's Wafer Scale ML Engine (2019) The largest ML accelerator chip 400,000 cores #### **Cerebras WSE** 1.2 Trillion transistors 46,225 mm<sup>2</sup> #### **Largest GPU** 21.1 Billion transistors 815 mm<sup>2</sup> **NVIDIA** TITAN V https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning ## Cerebras's Wafer Scale ML Engine-2 (2021) The largest ML accelerator chip (2021) 850,000 cores #### **Cerebras WSE-2** 2.6 Trillion transistors 46,225 mm<sup>2</sup> #### **Largest GPU** 54.2 Billion transistors 826 mm<sup>2</sup> **NVIDIA** Ampere GA100 https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning ## Upcoming SAFARI Live Seminar (Feb 28) #### https://www.youtube.com/watch?v=x2-qB0J7KHw 22.6K subscribers ## UPMEM Processing-in-DRAM Engine (2019) - Processing in DRAM Engine - Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips. - Replaces standard DIMMs - DDR4 R-DIMM modules - 8GB+128 DPUs (16 PIM chips) - Standard 2x-nm DRAM process - Large amounts of compute & memory bandwidth UPMEM ## **UPMEM Memory Modules** - E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz - P21: 16 chips DIMM (2 ranks). DPUs @ 350 MHz ## 2,560-DPU Processing-in-Memory System #### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PIM (Processing,-bendemy) benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, which we identify as memory-bound. We evaluate the performance and scaling characteristics of PIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 460 and 25.50 DPUs provides new insights about suitability of different workloads to the PIM systems you commendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems. ## FPGA-based Processing Near Memory Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications" IFFE Micro (IEEE MICRO), to appear, 2021. ## FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications Gagandeep Singh<sup>⋄</sup> Mohammed Alser<sup>⋄</sup> Damla Senol Cali<sup>⋈</sup> Dionysios Diamantopoulos<sup>▽</sup> Juan Gómez-Luna<sup>⋄</sup> Henk Corporaal<sup>⋆</sup> Onur Mutlu<sup>⋄⋈</sup> <sup>⋄</sup>ETH Zürich <sup>⋈</sup> Carnegie Mellon University \*Eindhoven University of Technology <sup>▽</sup>IBM Research Europe Samsung Newsroom CORPORATE **PRODUCTS** PRESS RESOURCES VIEWS **ABOUT US** Q ## Samsung Develops Industry's First High Bandwidth Memory with AI Processing Power Korea on February 17, 2021 Audio Share (5 The new architecture will deliver over twice the system performance and reduce energy consumption by more than 70% Samsung Electronics, the world leader in advanced memory technology, today announced that it has developed the industry's first High Bandwidth Memory (HBM) integrated with artificial intelligence (AI) processing power — the HBM-PIM The new processing-in-memory (PIM) architecture brings powerful AI computing capabilities inside high-performance memory, to accelerate large-scale processing in data centers, high performance computing (HPC) systems and AI-enabled mobile applications. Kwangil Park, senior vice president of Memory Product Planning at Samsung Electronics stated, "Our groundbreaking HBM-PIM is the industry's first programmable PIM solution tailored for diverse Al-driven workloads such as HPC, training and inference. We plan to build upon this breakthrough by further collaborating with Al solution providers for even more advanced PIM-powered applications." #### FIMDRAM based on HBM2 [3D Chip Structure of HBM with FIMDRAM] #### **Chip Specification** 128DQ / 8CH / 16 banks / BL4 32 PCU blocks (1 FIM block/2 banks) 1.2 TFLOPS (4H) FP16 ADD / Multiply (MUL) / Multiply-Accumulate (MAC) / Multiply-and- Add (MAD) #### ISSCC 2021 / SESSION 25 / DRAM / 25.4 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pii Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Song', Ahn Choi', Daeho Kim', SooYoung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro³, Seungwoo Seo³, JoonHo Song³, Jaeyoun Youn', Kyomin Sohn', Nam Sung Kim' <sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea ### **Programmable Computing Unit** - Configuration of PCU block - Interface unit to control data flow - Execution unit to perform operations - Register group - 32 entries of CRF for instruction memory - 16 GRF for weight and accumulation - 16 SRF to store constants for MAC operations #### [Block diagram of PCU in FIMDRAM] #### ISSCC 2021 / SESSION 25 / DRAM / 25.4 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications Young-Cheon Kwon', Suk Han Let', Jaehoon Let', Sang-Hyuk Kwon', Ja Min Ryu', Jong-Pi Son', Seongli O', Hak Soo Yu', Hesauk Let', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Hyeong Jun Song', Alm Choi', Daeho Kim', Soo Young Kim', Eun-Bong Kim', David Wang', Shinhaend Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Man Sung Kim' #### [Available instruction list for FIM operation] | Туре | CMD | Description | | |-------------------|------|-----------------------------|--| | Floating<br>Point | ADD | FP16 addition | | | | MUL | FP16 multiplication | | | | MAC | FP16 multiply-accumulate | | | | MAD | FP16 multiply and add | | | Data Path | MOVE | Load or store data | | | Data Fatti | FILL | Copy data from bank to GRFs | | | | NOP | Do nothing | | | Control Path | JUMP | Jump instruction | | | | EXIT | Exit instruction | | #### ISSCC 2021 / SESSION 25 / DRAM / 25.4 25.4 A 20nm 6GB Function-in-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications Young-Cheon Kwon', Suk Han Let', Jaehoon Let', Sang-Hyuk Kwon', Jah Min Ryu', Johng-Pi Son', Seongli O', Hak Soo Yu', Hesay k Let', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Hyeong Jun Song', Alm Choi', Daeho Kim', Soo Young Kim', Eun-Bong Kim', David Wang', Shinhaend Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Nam Sung Kim' ## **Chip Implementation** - Mixed design methodology to implement FIMDRAM - Full-custom + Digital RTL [Digital RTL design for PCU block] #### ISSCC 2021 / SESSION 25 / DRAM / 25.4 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications Young-Cheon Kwon', Suk Han Ler', Jaehoon Ler', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Hyeong Jun Song', Am Choi', Deach Kim', Soo'Qung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Man Sung Kim' | Cell array<br>for bank0 | Cell array<br>for bank4 | Cell array<br>for bank0 | Cell array<br>for bank4 | Pseudo | Pseudo | |-----------------------------------------------------|------------------------------------------------------|-----------------------------------------------------|------------------------------------------------------|--------------|-----------| | PCU block<br>for bank0 & 1 | PCU block<br>for bank4 & 5 | PCU block<br>for bank0 & 1 | PCU block<br>for bank4 & 5 | channel-0 | channel-1 | | Cell array<br>for bank1<br>Cell array<br>for bank2 | Cell array<br>for bank5<br>Cell array<br>for bank6 | Cell array<br>for bank1<br>Cell array<br>for bank2 | Cell array<br>for bank5<br>Cell array<br>for bank6 | | | | PCU block<br>for bank2 & 3 | PCU block<br>for bank6 & 7 | PCU block<br>for bank2 & 3 | PCU block<br>for bank6 & 7 | | | | Cell array<br>for bank3 | Cell array<br>for bank7 | Cell array<br>for bank3 | Cell array<br>for bank7 | | | | | | | Peri C | ontrol Block | | | Cell array<br>for bank11 | Cell array<br>for bank15 | Cell array<br>for bank11 | Cell array<br>for bank15 | | | | PCU block<br>for bank10 & 11 | PCU block<br>for bank14 & 15 | PCU block<br>for bank10 & 11 | PCU block<br>for bank14 & 15 | | | | Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 | Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 | | | | PCU block<br>for bank8 & 9 | PCU block<br>for bank12 & 13 | PCU block<br>for bank8 & 9 | PCU block<br>for bank12 & 13 | Pseudo | Pseudo | | Cell array<br>for bank8 | Cell array<br>for bank12 | Cell array<br>for bank8 | Cell array<br>for bank12 | channel-0 | channel-1 | ## Samsung AxDIMM (2021) - DDRx-PIM - Deep learning recommendation system #### **AxDIMM System** ## SK Hynix Accelerator-in-Memory (2022) **SK**hynix NEWSROOM ⊕ ENG ∨ INSIGHT **SK hvnix STORY** PRESS CENTER MULTIMEDIA Search Q #### SK hynix Develops PIM, Next-Generation AI Accelerator February 16, 2022 #### Seoul, February 16, 2022 SK hynix (or "the Company", www.skhynix.com) announced on February 16 that it has developed PIM\*, a nextgeneration memory chip with computing capabilities. \*PIM(Processing In Memory): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory It has been generally accepted that memory chips store data and CPU or GPU, like human brain, process data. SK hynix, following its challenge to such notion and efforts to pursue innovation in the next-generation smart memory, has found a breakthrough solution with the development of the latest technology. SK hynix plans to showcase its PIM development at the world's most prestigious semiconductor conference, 2022 ISSCC\*, in San Francisco at the end of this month. The company expects continued efforts for innovation of this technology to bring the memory-centric computing, in which semiconductor memory plays a central role, a step closer in Paper 11.1. SK Hynix describes an Tynm, GDDR6-based accelerator-in-memory with a command set for deep-learning operation. The to the reality in devices such as smartphones. \*ISSCC: The International Solid-State Circuits Conference will be held virtually from Feb. 20 to Feb. 24 this year with a theme of "Intelligent Silicon for a Sustainable World' For the first product that adopts the PIM technology, SK hynix has developed a sample of GDDR6-AiM (Accelerator\* in memory). The GDDR6-AiM adds computational functions to GDDR6\* memory chips, which process data at 16Gbps. A combination of GDDR6-AiM with CPU or GPU instead of a typical DRAM makes certain computation speed 16 times faster. GDDR6-AiM is widely expected to be adopted for machine learning, high-performance computing, and big data computation and storage. 11.1 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications Seongiu Lee, SK hynix, Icheon, Korea 8Gb design achieves a peak throughput of 1TFLOPS with 1GHz MAC operations and supports major activation functions to improve ## Specialized Processing in Memory (2015) Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" Proceedings of the <u>42nd International Symposium on</u> <u>Computer Architecture</u> (**ISCA**), Portland, OR, June 2015. [Slides (pdf)] [Lightning Session Slides (pdf)] #### A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn Sungpack Hong<sup>§</sup> Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungpack.hong@oracle.com, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>§</sup>Oracle Labs <sup>†</sup>Carnegie Mellon University ## Simple Processing in Memory (2015) Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture" Proceedings of the <u>42nd International Symposium on</u> Computer Architecture (ISCA), Portland, OR, June 2015. [Slides (pdf)] [Lightning Session Slides (pdf)] ### PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture Junwhan Ahn Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>†</sup>Carnegie Mellon University ## Processing in Memory on Mobile Devices Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural</u> <u>Support for Programming Languages and Operating</u> <u>Systems</u> (**ASPLOS**), Williamsburg, VA, USA, March 2018. ## Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand<sup>1</sup> Saugata Ghose<sup>1</sup> Youngsok Kim<sup>2</sup> Rachata Ausavarungnirun<sup>1</sup> Eric Shiu<sup>3</sup> Rahul Thakur<sup>3</sup> Daehyun Kim<sup>4,3</sup> Aki Kuusela<sup>3</sup> Allan Knies<sup>3</sup> Parthasarathy Ranganathan<sup>3</sup> Onur Mutlu<sup>5,1</sup> ## In-DRAM Processing (2013) Vivek Seshadri et al., "<u>Ambit: In-Memory Accelerator</u> for Bulk Bitwise Operations Using Commodity DRAM <u>Technology</u>," MICRO 2017. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology ``` Vivek Seshadri^{1,5} Donghyuk Lee^{2,5} Thomas Mullins^{3,5} Hasan Hassan^4 Amirali Boroumand^5 Jeremie Kim^{4,5} Michael A. Kozuch^3 Onur Mutlu^{4,5} Phillip B. Gibbons^5 Todd C. Mowry^5 ``` $^1$ Microsoft Research India $^2$ NVIDIA Research $^3$ Intel $^4$ ETH Zürich $^5$ Carnegie Mellon University ## In-DRAM Bulk Bitwise Execution (2017) Vivek Seshadri and Onur Mutlu, "In-DRAM Bulk Bitwise Execution Engine" Invited Book Chapter in Advances in Computers, to appear in 2020. [Preliminary arXiv version] ### In-DRAM Bulk Bitwise Execution Engine Vivek Seshadri Microsoft Research India visesha@microsoft.com Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch ## SIMDRAM Framework (2021) Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu, "SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM" Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021. [2-page Extended Abstract] [Short Talk Slides (pptx) (pdf)] [Talk Slides (pptx) (pdf)] [Short Talk Video (5 mins)] [Full Talk Video (27 mins)] ## SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM \*Nastaran Hajinazar<sup>1,2</sup> Nika Mansouri Ghiasi<sup>1</sup> \*Geraldo F. Oliveira<sup>1</sup> Minesh Patel<sup>1</sup> Juan Gómez-Luna<sup>1</sup> Sven Gregorio<sup>1</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup> João Dinis Ferreira<sup>1</sup> Saugata Ghose<sup>3</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Simon Fraser University <sup>3</sup>University of Illinois at Urbana-Champaign ## PIM Review and Open Problems ## A Modern Primer on Processing in Memory Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup> SAFARI Research Group <sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" Invited Book Chapter in <u>Emerging Computing: From Devices to Systems -</u> Looking Beyond Moore and Von Neumann, Springer, to be published in 2021. #### A Modern Primer on Processing in Memory Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup> SAFARI Research Group <sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok #### Abstract Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) processing using memory by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) processing near memory by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM. Keywords: memory systems, data movement, main memory, processing-in-memory, near-data processing, computation-in-memory, processing using memory, processing near memory, 3D-stacked memory, non-volatile memory, energy efficiency, high-performance computing, computer architecture, computing paradigm, emerging technologies, memory scaling, technology scaling, dependable systems, robust systems, hardware security, system security, latency, low-latency computing #### Contents | 1 | Introduction | 2 | |---|-----------------------------------------------------------------------|----| | 2 | Major Trends Affecting Main Memory | 4 | | 3 | The Need for Intelligent Memory Controllers | | | | to Enhance Memory Scaling | 6 | | 4 | Perils of Processor-Centric Design | 9 | | 5 | Processing-in-Memory (PIM): Technology | | | Г | Enablers and Two Approaches | 12 | | | 5.1 New Technology Enablers: 3D-Stacked | | | | Memory and Non-Volatile Memory | 12 | | | 5.2 Two Approaches: Processing Using | | | | Memory (PUM) vs. Processing Near | | | | Memory (PNM) | 13 | | 6 | Processing Using Memory (PUM) | 14 | | U | 6.1 RowClone | 14 | | | 6.2 Ambit | 15 | | | 6.3 Gather-Scatter DRAM | 17 | | | 6.4 In-DRAM Security Primitives | 17 | | | 0.4 III-DRAW Security Fillilitives | 17 | | 7 | Processing Near Memory (PNM) | 18 | | | 7.1 Tesseract: Coarse-Grained Application- | | | | Level PNM Acceleration of Graph Pro- | | | Г | cessing | 19 | | | 7.2 Function-Level PNM Acceleration of | | | | Mobile Consumer Workloads | 20 | | | 7.3 Programmer-Transparent Function-<br>Level PNM Acceleration of GPU | | | | Level PNM Acceleration of GPU | | | | Applications | 21 | | | 7.4 Instruction-Level PNM Acceleration | | | L | with PIM-Enabled Instructions (PEI) | 21 | | | 7.5 Function-Level PNM Acceleration of | | | | Genome Analysis Workloads | 22 | | | 7.6 Application-Level PNM Acceleration of | | | L | Time Series Analysis | 23 | | 8 | Enabling the Adoption of PIM | 24 | | | 8.1 Programming Models and Code Genera- | | | | tion for PIM | 24 | | | 8.2 PIM Runtime: Scheduling and Data | | | | Mapping | 25 | | | 8.3 Memory Coherence | 27 | | | 8.4 Virtual Memory Support | 27 | | | 8.5 Data Structures for PIM | 28 | | | 8.6 Benchmarks and Simulation Infrastruc- | | | | tures | 29 | | | 8.7 Real PIM Hardware Systems and Proto- | | | | types | 30 | | | 8.8 Security Considerations | 30 | | | | | | 9 | Conclusion and Future Outlook | 31 | #### 1. Introduction Main memory, built using the Dynamic Random Access Memory (DRAM) technology, is a major component in nearly all computing systems, including servers, cloud platforms, mobile/embedded devices, and sensor systems. Across all of these systems, the data working set sizes of modern applications are rapidly growing, while the need for fast analysis of such data is increasing. Thus, main memory is becoming an increasingly significant bottleneck across a wide variety of computing systems and applications [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Alleviating the main memory bottleneck requires the memory capacity, energy, cost, and performance to all scale in an efficient manner across technology generations. Unfortunately, it has become increasingly difficult in recent years, especially the past decade, to scale all of these dimensions [1, 2, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], and thus the main memory bottleneck has been worsening. A major reason for the main memory bottleneck is the high energy and latency cost associated with data movement. In modern computers, to perform any operation on data that resides in main memory, the processor must retrieve the data from main memory. This requires the memory controller to issue commands to a DRAM module across a relatively slow and power-hungry off-chip bus (known as the memory channel). The DRAM module sends the requested data across the memory channel, after which the data is placed in the caches and registers. The CPU can perform computation on the data once the data is in its registers. Data movement from the DRAM to the CPU incurs long latency and consumes a significant amount of energy [7, 50, 51, 52, 53, 54]. These costs are often exacerbated by the fact that much of the data brought into the caches is not reused by the CPU [52, 53, 55, 56], providing little benefit in return for the high latency and energy cost. The cost of data movement is a fundamental issue with the processor-centric nature of contemporary computer systems. The CPU is considered to be the master in the system, and computation is performed only in the processor (and accelerators). In contrast, data storage and communication units, including the main memory, are treated as unintelligent workers that are incapable of computation. As a result of this processor-centric design paradigm, data moves a lot in the system between the computation units and communication/ storage units so that computation can be done on it. With the increasingly data-centric nature of contemporary and emerging appli- ## A Tutorial on Processing in Memory Onur Mutlu, "Memory-Centric Computing" Education Class at <u>Embedded Systems Week (**ESWEEK**)</u>, Virtual, 9 October 2021. [Slides (pptx) (pdf)] [Abstract (pdf)] [Talk Video (2 hours, including Q&A)] [Invited Paper at DATE 2021] ["A Modern Primer on Processing in Memory" paper] https://www.youtube.com/watch?v=N1Ac1ov1JOM Embedded Systems Week (ESWEEK) 2021 Lecture - Memory-Centric Computing - Onur Mutlu - 9 October 2021 509 views • Premiered Dec 6, 2021 □ DISLIKE SHARE SAVE https://www.youtube.com/watch?v=N1Ac1ov1JOM **ANALYTICS** **EDIT VIDEO** ## Detailed Lectures on PIM (I) - Computer Architecture, Fall 2020, Lecture 6 - Computation in Memory (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=oGcZAGwfEUE&list=PL5Q2soXY2Zi9xidyIgBxUz 7xRPS-wisBN&index=12 - Computer Architecture, Fall 2020, Lecture 7 - Near-Data Processing (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=j2GIigqn1Qw&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=13 - Computer Architecture, Fall 2020, Lecture 11a - Memory Controllers (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=TeG773OgiMQ&list=PL5Q2soXY2Zi9xidyIgBxUz 7xRPS-wisBN&index=20 - Computer Architecture, Fall 2020, Lecture 12d - Real Processing-in-DRAM with UPMEM (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=Sscy1Wrr22A&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=25 ## Detailed Lectures on PIM (II) - Computer Architecture, Fall 2020, Lecture 15 - Emerging Memory Technologies (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=AlE1rD9G\_YU&list=PL5Q2soXY2Zi9xidyIgBxUz 7xRPS-wisBN&index=28 - Computer Architecture, Fall 2020, Lecture 16a - Opportunities & Challenges of Emerging Memory Technologies (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=pmLszWGmMGQ&list=PL5Q2soXY2Zi9xidyIgBx Uz7xRPS-wisBN&index=29 - Computer Architecture, Fall 2020, Guest Lecture - In-Memory Computing: Memory Devices & Applications (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=wNmqQHiEZNk&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=41 ## Many Interesting Things Are Happening Today in Computer Architecture # Performance and Energy Efficiency ## Apple M1 System on Chip (2021) ## Apple M1 Max System on Chip (2021) ## Bigger and More Powerful Systems (2021) **€**M1 Max ## TESLA Full Self-Driving Computer (2019) - ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors, 600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs. - Two redundant chips for better safety. ## Tesla Dojo ML Training Chip (2021) TESLA Tesla Dojo Chip #### D1 Chip 362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32 10TBps/dir. on-Chip Bandwidth 4TBps/edge. off-Chip Bandwidth **400W TDP** 645mm<sup>2</sup> 7nm Technology **50 Billion** Transistors 11+ Miles Of Wires ## Tesla Dojo ML Training System (2021) TESLA Tesla Dojo System ## Tesla Dojo ML Training System (2021) TESLA Tesla Dojo Chip & System # Cerebras's Wafer Scale ML Engine (2019) The largest ML accelerator chip 400,000 cores #### **Cerebras WSE** 1.2 Trillion transistors 46,225 mm<sup>2</sup> #### **Largest GPU** 21.1 Billion transistors 815 mm<sup>2</sup> **NVIDIA** TITAN V https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/ # Cerebras's Wafer Scale ML Engine-2 (2021) The largest ML accelerator chip (2021) 850,000 cores #### **Cerebras WSE-2** 2.6 Trillion transistors 46,225 mm<sup>2</sup> ### **Largest GPU** 54.2 Billion transistors 826 mm<sup>2</sup> **NVIDIA** Ampere GA100 https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning ## Huge Demand for Performance & Efficiency ## **Exponential Growth of Neural Networks** 1800x more compute In just 2 years Tomorrow, multi-trillion parameter models # Google Tensor Processing Unit (~2016) **Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16. **Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017. # Google TPU Generation II (2017) https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-second-generation-tpu/ 4 TPU chips vs 1 chip in TPU1 High Bandwidth Memory vs DDR3 Floating point operations vs FP16 45 TFLOPS per chip vs 23 TOPS Designed for training and inference vs only inference # Google TPU Generation III TPU v2 - 4 chips, 2 cores per chip TPU v3 - 4 chips, 2 cores per chip More High Bandwidth Memory More Systolic Arrays ## Google TPU Generation IV (2021) ### New ML applications (vs. TPU3): - Computer vision - Natural Language Processing (NLP) - Recommender system - Reinforcement learning that plays Go 250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3 1 ExaFLOPS per board https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests # An Example Modern Systolic Array: TPU (II) As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer [Kun80][Ram91][Ovt15b]. Figure 4 shows that data flows in from the left, and the weights are loaded from the top. A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators. From a correctness perspective, software is unaware of the systolic nature of the matrix unit, but for performance, it does worry about the latency of the unit. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017. # An Example Modern Systolic Array: TPU (III) **Figure 1.** TPU Block Diagram. The main computation part is the yellow Matrix Multiply unit in the upper right hand corner. Its inputs are the blue Weight FIFO and the blue Unified Buffer (UB) and its output is the blue Accumulators (Acc). The yellow Activation Unit performs the nonlinear functions on the Acc, which go to the UB. # Many (Other) AI/ML Chips - Alibaba - Amazon - Facebook - Google - Huawei - Intel - Microsoft - NVIDIA - Tesla - Many Others and Many Startups... - Many More to Come... # Many (Other) AI/ML Chips (2019) # Many (Other) AI/ML Chips (2021) All information contained within this infographic is gathered from the internet and periodically updated, no guarantee is given that the information provided is correct, complete, and up-to-date. # Many Interesting Things Are Happening Today in Computer Architecture Reliability Safety Security Privacy # How Reliable/Secure/Safe is This Bridge? # Collapse of the "Galloping Gertie" # Another View # How Secure Are These People? Security is about preventing unforeseen consequences ## How Safe & Secure Is **This** Platform? # Security: RowHammer (2014) # The Story of RowHammer - One can predictably induce bit flips in commodity DRAM chips - □ >80% of the tested DRAM chips are vulnerable - First example of how a simple hardware failure mechanism can create a widespread system security vulnerability Forget Software—Now Hackers Are Exploiting Physics BUSINESS CULTURE DESIGN GEAR SCIENCE NDY GREENBERG SECURITY 08.31.16 7:00 AM # FORGET SOFTWARE—NOW HACKERS ARE EXPLOITING PHYSICS ## Modern DRAM is Prone to Disturbance Errors Repeatedly reading a row enough times (before memory gets refreshed) induces disturbance errors in adjacent rows in most real DRAM chips you can buy today ## Most DRAM Modules Are Vulnerable A company **B** company **C** company Up to **1.0×10**<sup>7</sup> errors Up to **2.7×10**<sup>6</sup> errors Up to $3.3 \times 10^5$ errors ## One Can Take Over an Otherwise-Secure System ## Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology # Project Zero Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014) News and updates from the Project Zero team at Google Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn+, 2015) Monday, March 9, 2015 Exploiting the DRAM rowhammer bug to gain kernel privileges # Security: RowHammer (2014) It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after # More Security Implications (I) "We can gain unrestricted access to systems of website visitors." www.iaik.tugraz.at Not there yet, but ... ROOT privileges for web apps! Daniel Gruss (@lavados), Clémentine Maurice (@BloodyTangerine), December 28, 2015 — 32c3, Hamburg, Germany Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript (DIMVA'16) Source: https://lab.dsst.io/32c3-slides/7197.html # More Security Implications (II) "Can gain control of a smart phone deterministically" Hammer And Root Millions of Androids Drammer: Deterministic Rowhammer Attacks on Mobile Platforms, CCS'16<sup>170</sup> # More Security Implications (III) Using an integrated GPU in a mobile system to remotely escalate privilege via the WebGL interface "GRAND PWNING UNIT" — # Drive-by Rowhammer attack uses GPU to compromise an Android phone JavaScript based GLitch pwns browsers by flipping bits inside memory chips. **DAN GOODIN - 5/3/2018, 12:00 PM** # Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU Pietro Frigo Vrije Universiteit Amsterdam p.frigo@vu.nl Cristiano Giuffrida Vrije Universiteit Amsterdam giuffrida@cs.vu.nl Herbert Bos Vrije Universiteit Amsterdam herbertb@cs.vu.nl Kaveh Razavi Vrije Universiteit Amsterdam kaveh@cs.vu.nl # More Security Implications (IV) Rowhammer over RDMA (I) Z&IT TECH SCIENCE POLICY CARS GAMING&CU THROWHAMMER - # Packets over a LAN are all it takes to trigger serious Rowhammer bit flips The bar for exploiting potentially serious DDR weakness keeps getting lower. **DAN GOODIN - 5/10/2018, 5:26 PM** ## Throwhammer: Rowhammer Attacks over the Network and Defenses Andrei Tatar VU Amsterdam Radhesh Krishnan VU Amsterdam Herbert Bos VII Amsterdam Elias Athanasopoulos University of Cyprus Kaveh Razavi VU Amsterdam Cristiano Giuffrida VU Amsterdam # More Security Implications (V) Rowhammer over RDMA (II) Nethammer—Exploiting DRAM Rowhammer Bug Through Network Requests # Nethammer: Inducing Rowhammer Faults through Network Requests Moritz Lipp Graz University of Technology Daniel Gruss Graz University of Technology Misiker Tadesse Aga University of Michigan Clémentine Maurice Univ Rennes, CNRS, IRISA Lukas Lamster Graz University of Technology Michael Schwarz Graz University of Technology Lukas Raab Graz University of Technology # More Security Implications (VI) **IEEE S&P 2020** RAMBleed # RAMBleed: Reading Bits in Memory Without Accessing Them Andrew Kwong University of Michigan ankwong@umich.edu Daniel Genkin University of Michigan genkin@umich.edu Daniel Gruss Graz University of Technology daniel.gruss@iaik.tugraz.at Yuval Yarom University of Adelaide and Data61 yval@cs.adelaide.edu.au # More Security Implications (VII) USENIX Security 2019 # Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks Sanghyun Hong, Pietro Frigo<sup>†</sup>, Yiğitcan Kaya, Cristiano Giuffrida<sup>†</sup>, Tudor Dumitraș University of Maryland, College Park †Vrije Universiteit Amsterdam #### A Single Bit-flip Can Cause Terminal Brain Damage to DNNs One specific bit-flip in a DNN's representation leads to accuracy drop over 90% Our research found that a specific bit-flip in a DNN's bitwise representation can cause the accuracy loss up to 90%, and the DNN has 40-50% parameters, on average, that can lead to the accuracy drop over 10% when individually subjected to such single bitwise corruptions... **Read More** # More Security Implications (VIII) ## USENIX Security 2020 ## DeepHammer: Depleting the Intelligence of Deep Neural Networks through Targeted Chain of Bit Flips Fan Yao University of Central Florida fan.yao@ucf.edu Adnan Siraj Rakin Deliang Fan Arizona State University asrakin@asu.edu dfan@asu.edu ## Degrade the inference accuracy to the level of Random Guess Example: ResNet-20 for CIFAR-10, 10 output classes Before attack, Accuracy: 90.2% After attack, Accuracy: ~10% (1/10) # Can We Depend on Computers? # RowHammer: Eight Years Ago... Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors" Proceedings of the 41st International Symposium on Computer Architecture (ISCA), Minneapolis, MN, June 2014. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Source Code and Data] # Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors Yoongu Kim<sup>1</sup> Ross Daly\* Jeremie Kim<sup>1</sup> Chris Fallin\* Ji Hye Lee<sup>1</sup> Donghyuk Lee<sup>1</sup> Chris Wilkerson<sup>2</sup> Konrad Lai Onur Mutlu<sup>1</sup> Carnegie Mellon University <sup>2</sup>Intel Labs SAFARI 178 # RowHammer: 2019 and Beyond... Onur Mutlu and Jeremie Kim, "RowHammer: A Retrospective" <u>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems</u> (**TCAD**) Special Issue on Top Picks in Hardware and Embedded Security, 2019. [Preliminary arXiv version] [Slides from COSADE 2019 (pptx)] [Slides from VLSI-SOC 2020 (pptx) (pdf)] [Talk Video (1 hr 15 minutes, with Q&A)] # RowHammer: A Retrospective Onur Mutlu<sup>§‡</sup> Jeremie S. Kim<sup>‡§</sup> §ETH Zürich <sup>‡</sup>Carnegie Mellon University 179 # RowHammer in 2020 & 2021 #### RowHammer is Getting Much Worse Jeremie S. Kim, Minesh Patel, A. Giray Yaglikci, Hasan Hassan, Roknoddin Azizi, Lois Orosa, and Onur Mutlu, "Revisiting RowHammer: An Experimental Analysis of Modern Devices and Mitigation Techniques" Proceedings of the <u>47th International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), Valencia, Spain, June 2020. [Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Talk Video (20 minutes)] [Lightning Talk Video (3 minutes)] # Revisiting RowHammer: An Experimental Analysis of Modern DRAM Devices and Mitigation Techniques Jeremie S. Kim<sup>§†</sup> Minesh Patel<sup>§</sup> A. Giray Yağlıkçı<sup>§</sup> Hasan Hassan<sup>§</sup> Roknoddin Azizi<sup>§</sup> Lois Orosa<sup>§</sup> Onur Mutlu<sup>§†</sup> §ETH Zürich †Carnegie Mellon University #### Existing Solutions Do Not Work Pietro Frigo, Emanuele Vannacci, Hasan Hassan, Victor van der Veen, Onur Mutlu, Cristiano Giuffrida, Herbert Bos, and Kaveh Razavi, "TRRespass: Exploiting the Many Sides of Target Row Refresh" Proceedings of the <u>41st IEEE Symposium on Security and Privacy</u> (**S&P**), San Francisco, CA, USA, May 2020. [Slides (pptx) (pdf)] [Lecture Slides (pptx) (pdf)] [Talk Video (17 minutes)] [Lecture Video (59 minutes)] [Source Code] [Web Article] Best paper award. Pwnie Award 2020 for Most Innovative Research. Pwnie Awards 2020 # TRRespass: Exploiting the Many Sides of Target Row Refresh Pietro Frigo\*† Emanuele Vannacci\*† Hasan Hassan§ Victor van der Veen¶ Onur Mutlu§ Cristiano Giuffrida\* Herbert Bos\* Kaveh Razavi\* \*Vrije Universiteit Amsterdam §ETH Zürich ¶Oualcomm Technologies Inc. #### Hard to Guarantee RowHammer-Free Chips Lucian Cojocar, Jeremie Kim, Minesh Patel, Lillian Tsai, Stefan Saroiu, Alec Wolman, and Onur Mutlu, "Are We Susceptible to Rowhammer? An End-to-End Methodology for Cloud Providers" Proceedings of the <u>41st IEEE Symposium on Security and</u> Privacy (**S&P**), San Francisco, CA, USA, May 2020. [Slides (pptx) (pdf)] [Talk Video (17 minutes)] # Are We Susceptible to Rowhammer? An End-to-End Methodology for Cloud Providers Lucian Cojocar, Jeremie Kim<sup>§†</sup>, Minesh Patel<sup>§</sup>, Lillian Tsai<sup>‡</sup>, Stefan Saroiu, Alec Wolman, and Onur Mutlu<sup>§†</sup> Microsoft Research, <sup>§</sup>ETH Zürich, <sup>†</sup>CMU, <sup>‡</sup>MIT SAFARI 183 #### RowHammer Has Many Dimensions Lois Orosa, Abdullah Giray Yaglikci, Haocong Luo, Ataberk Olgun, Jisung Park, Hasan Hassan, Minesh Patel, Jeremie S. Kim, and Onur Mutlu, "A Deeper Look into RowHammer's Sensitivities: Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and Defenses" Proceedings of the <u>54th International Symposium on Microarchitecture</u> (**MICRO**), Virtual, October 2021. [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Talk Video (21 minutes)] [Lightning Talk Video (1.5 minutes)] [arXiv version] #### A Deeper Look into RowHammer's Sensitivities: Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and Defenses Lois Orosa\* ETH Zürich A. Giray Yağlıkçı\* ETH Zürich Haocong Luo ETH Zürich Ataberk Olgun ETH Zürich, TOBB ETÜ Jisung Park ETH Zürich Hasan Hassan ETH Zürich Minesh Patel ETH Zürich Jeremie S. Kim ETH Zürich Onur Mutlu ETH Zürich #### Industry-Adopted Solutions Do Not Work Hasan Hassan, Yahya Can Tugrul, Jeremie S. Kim, Victor van der Veen, Kaveh Razavi, and Onur Mutlu, "Uncovering In-DRAM RowHammer Protection Mechanisms: A New Methodology, Custom RowHammer Patterns, and Implications" Proceedings of the <u>54th International Symposium on Microarchitecture</u> (**MICRO**), Virtual, October 2021. [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Talk Video (25 minutes)] [Lightning Talk Video (100 seconds)] arXiv version #### **Uncovering In-DRAM RowHammer Protection Mechanisms:** A New Methodology, Custom RowHammer Patterns, and Implications Yahya Can Tuğrul<sup>†‡</sup> Jeremie S. Kim<sup>†</sup> Hasan Hassan<sup>†</sup> Victor van der Veen $^{\sigma}$ Onur Mutlu<sup>†</sup> Kaveh Razavi<sup>†</sup> $^\ddagger TOBB\ University\ of\ Economics\ \&\ Technology$ $^\sigma Qualcomm\ Technologies\ Inc.$ †ETH Zürich #### BlockHammer Solution in 2021 A. Giray Yaglikci, Minesh Patel, Jeremie S. Kim, Roknoddin Azizi, Ataberk Olgun, Lois Orosa, Hasan Hassan, Jisung Park, Konstantinos Kanellopoulos, Taha Shahroodi, Saugata Ghose, and Onur Mutlu, "BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows" Proceedings of the <u>27th International Symposium on High-Performance</u> <u>Computer Architecture</u> (**HPCA**), Virtual, February-March 2021. [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [Talk Video (22 minutes)] [Short Talk Video (7 minutes)] #### **BlockHammer: Preventing RowHammer at Low Cost** by Blacklisting Rapidly-Accessed DRAM Rows A. Giray Yağlıkçı<sup>1</sup> Minesh Patel<sup>1</sup> Jeremie S. Kim<sup>1</sup> Roknoddin Azizi<sup>1</sup> Ataberk Olgun<sup>1</sup> Lois Orosa<sup>1</sup> Hasan Hassan<sup>1</sup> Jisung Park<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Taha Shahroodi<sup>1</sup> Saugata Ghose<sup>2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>University of Illinois at Urbana–Champaign SAFARI 186 #### Detailed Lectures on RowHammer - Computer Architecture, Fall 2020, Lecture 4b - RowHammer (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=KDy632z23UE&list=PL5Q2soXY2Zi9xidyIgBxUz 7xRPS-wisBN&index=8 - Computer Architecture, Fall 2020, Lecture 5a - RowHammer in 2020: TRRespass (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=pwRw7QqK\_qA&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=9 - Computer Architecture, Fall 2020, Lecture 5b - RowHammer in 2020: Revisiting RowHammer (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=gR7XR-Eepcg&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=10 - Computer Architecture, Fall 2020, Lecture 5c - Secure and Reliable Memory (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=HvswnsfG3oQ&list=PL5Q2soXY2Zi9xidyIgBxUz 7xRPS-wisBN&index=11 #### The Story of RowHammer Lecture ... Onur Mutlu, #### "The Story of RowHammer" Keynote Talk at <u>Secure Hardware, Architectures, and Operating Systems</u> <u>Workshop</u> (**SeHAS**), held with <u>HiPEAC 2021 Conference</u>, Virtual, 19 January 2021. [Slides (pptx) (pdf)] [Talk Video (1 hr 15 minutes, with Q&A)] 188 ### Security: Meltdown and Spectre (2018) #### Meltdown and Spectre - Someone can steal secret data from the system even though - your program and data are perfectly correct and - your hardware behaves according to the specification and - there are no software vulnerabilities/bugs #### Why? - Speculative execution leaves traces of secret data in the processor's cache (internal storage) - It brings data that is not supposed to be brought/accessed if there was no speculative execution - A malicious program can inspect the contents of the cache to "infer" secret data that it is not supposed to access - A malicious program can actually force another program to speculatively execute code that leaves traces of secret data #### More on Meltdown/Spectre Vulnerabilities ### Project Zero News and updates from the Project Zero team at Google Wednesday, January 3, 2018 Reading privileged memory with a side-channel Posted by Jann Horn, Project Zero We have discovered that CPU data cache timing can be abused to efficiently leak information out of misspeculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts. # Many Interesting Things Are Happening Today in Computer Architecture # Many Interesting Things Are Happening Today in Computer Architecture ### More Demanding Workloads #### Huge Demand for Performance & Efficiency ### SeanLie #### **Exponential Growth of Neural Networks** 1800x more compute In just 2 years Tomorrow, multi-trillion parameter models ### Increasingly Demanding Applications ## Dream # and, they will come As applications push boundaries, computing platforms will become increasingly strained. ### New Genome Sequencing Technologies # Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions Damla Senol Cali ™, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017 Published: 02 April 2018 Article history ▼ Oxford Nanopore MinION #### Data → performance & energy bottleneck #### Why Do We Care? An Example 200 Oxford Nanopore sequencers have left UK for China, to support rapid, near-sample coronavirus sequencing for outbreak surveillance Fri 31st January 2020 Following extensive support of, and collaboration with, public health professionals in China, Oxford Nanopore has shipped an additional 200 MinION sequencers and related consumables to China. These will be used to support the ongoing surveillance of the current coronavirus outbreak, adding to a large number of the devices already installed in the country. Each MinION sequencer is approximately the size of a stapler, and can provide rapid sequence information about the coronavirus. 700Kg of Oxford Nanopore sequencers and consumables are on their way for use by Chinese scientists in understanding the current coronavirus outbreak. ### Population-Scale Microbiome Profiling #### City-Scale Microbiome Profiling (A) The five boroughs of NYC include (1) Manhattan (green) (B) The collection from the 466 subway stations of NYC across the 24 subway lines involved three main steps: (1) collection with Copan Elution swabs, (2) data entry into the database, and (3) uploading of the data. An image is shown of the current collection database, taken from http://pathomap.giscloud.com. (C) Workflow for sample DNA extraction, library preparation, sequencing, quality trimming of the FASTQ files, and alignment with MegaBLAST and MetaPhlAn to discern taxa present #### Example: Rapid Surveillance of Ebola Outbreak Figure 1: Deployment of the portable genome surveillance system in Guinea. Quick+, "Real-time, portable genome sequencing for Ebola surveillance", Nature, 2016 ### High-Throughput Genome Sequencers Illumina MiSeq Illumina NovaSeq 6000 Pacific Biosciences Sequel II Pacific Biosciences RS II Oxford Nanopore MinION ... and more! All produce data with different properties. ### High-Throughput Genome Sequencers Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro, August 2020. #### The Genomic Era Sequencing **Genome Analysis** ### Data → performance & energy bottleneck reau4: CGCTTCCAT read5: CCATGACGC read6: TTCCATGAC **Scientific Discovery** 3 Variant Calling 4 #### Software Acceleration: Eliminate Useless Work - Download the source code and try for yourself - Download link to FastHASH Xin et al. BMC Genomics 2013, **14**(Suppl 1):S13 http://www.biomedcentral.com/1471-2164/14/S1/S13 #### **PROCEEDINGS** **Open Access** #### Accelerating read mapping with FastHASH Hongyi Xin<sup>1</sup>, Donghyuk Lee<sup>1</sup>, Farhad Hormozdiari<sup>2</sup>, Samihan Yedkar<sup>1</sup>, Onur Mutlu<sup>1\*</sup>, Can Alkan<sup>3\*</sup> From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013) Vancouver, Canada. 21-24 January 2013 #### Shifted Hamming Distance: SIMD Acceleration https://github.com/CMU-SAFARI/Shifted-Hamming-Distance Bioinformatics, 31(10), 2015, 1553–1560 doi: 10.1093/bioinformatics/btu856 Advance Access Publication Date: 10 January 2015 Original Paper Sequence analysis # Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping Hongyi Xin<sup>1,\*</sup>, John Greth<sup>2</sup>, John Emmons<sup>2</sup>, Gennady Pekhimenko<sup>1</sup>, Carl Kingsford<sup>3</sup>, Can Alkan<sup>4,\*</sup> and Onur Mutlu<sup>2,\*</sup> Xin+, "Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter to Accelerate Alignment Verification in Read Mapping", Bioinformatics 2015. #### GateKeeper: FPGA-Based Alignment Filtering #### GateKeeper: FPGA-Based Alignment Filtering Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan "GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping" Bioinformatics, [published online, May 31], 2017. Source Code Online link at Bioinformatics Journal # GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping Mohammed Alser ™, Hasan Hassan, Hongyi Xin, Oğuz Ergin, Onur Mutlu ™, Can Alkan ™ Bioinformatics, Volume 33, Issue 21, 1 November 2017, Pages 3355–3363, https://doi.org/10.1093/bioinformatics/btx342 Published: 31 May 2017 Article history ▼ **SAFARI** ### In-Memory DNA Sequence Analysis Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu, "GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies" **BMC Genomics**, 2018. Proceedings of the <u>16th Asia Pacific Bioinformatics Conference</u> (**APBC**), Yokohama, Japan, January 2018. [Slides (pptx) (pdf)] Source Code [arxiv.org Version (pdf)] Talk Video at AACBB 2019 # GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies Jeremie S. Kim<sup>1,6\*</sup>, Damla Senol Cali<sup>1</sup>, Hongyi Xin<sup>2</sup>, Donghyuk Lee<sup>3</sup>, Saugata Ghose<sup>1</sup>, Mohammed Alser<sup>4</sup>, Hasan Hassan<sup>6</sup>, Oguz Ergin<sup>5</sup>, Can Alkan<sup>4\*</sup> and Onur Mutlu<sup>6,1\*</sup> From The Sixteenth Asia Pacific Bioinformatics Conference 2018 Yokohama, Japan. 15-17 January 2018 ### Shouji (障子) [Alser+, Bioinformatics 2019] Mohammed Alser, Hasan Hassan, Akash Kumar, Onur Mutlu, and Can Alkan, "Shouji: A Fast and Efficient Pre-Alignment Filter for Sequence Alignment" Bioinformatics, [published online, March 28], 2019. [Source Code] [Online link at Bioinformatics Journal] Bioinformatics, 2019, 1–9 doi: 10.1093/bioinformatics/btz234 Advance Access Publication Date: 28 March 2019 Original Paper Sequence alignment ## Shouji: a fast and efficient pre-alignment filter for sequence alignment Mohammed Alser<sup>1,2,3,\*</sup>, Hasan Hassan<sup>1</sup>, Akash Kumar<sup>2</sup>, Onur Mutlu<sup>1,3,\*</sup> and Can Alkan<sup>3,\*</sup> <sup>1</sup>Computer Science Department, ETH Zürich, Zürich 8092, Switzerland, <sup>2</sup>Chair for Processor Design, Center For Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universität Dresden, 01062 Dresden, Germany and <sup>3</sup>Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey \*To whom correspondence should be addressed. Associate Editor: Inanc Birol SAFARI #### SneakySnake [Alser+, Bioinformatics 2020] Mohammed Alser, Taha Shahroodi, Juan-Gomez Luna, Can Alkan, and Onur Mutlu, "SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs" **Bioinformatics**, to appear in 2020. Source Code [Online link at Bioinformatics Journal] Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category **Subject Section** # SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs Mohammed Alser<sup>1,2,\*</sup>, Taha Shahroodi<sup>1</sup>, Juan Gómez-Luna<sup>1,2</sup>, Can Alkan<sup>4,\*</sup>, and Onur Mutlu<sup>1,2,3,4,\*</sup> <sup>&</sup>lt;sup>1</sup>Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland <sup>&</sup>lt;sup>2</sup>Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland <sup>&</sup>lt;sup>3</sup>Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh 15213, PA, USA <sup>&</sup>lt;sup>4</sup>Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey #### GenASM Framework [MICRO 2020] Damla Senol Cali, Gurpreet S. Kalsi, Zulal Bingol, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu, "GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis" Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), Virtual, October 2020. [<u>Lighting Talk Video</u> (1.5 minutes)] [<u>Lightning Talk Slides (pptx) (pdf)</u>] [<u>Talk Video</u> (18 minutes)] [<u>Slides (pptx) (pdf)</u>] #### GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis Damla Senol Cali<sup>†™</sup> Gurpreet S. Kalsi<sup>™</sup> Zülal Bingöl<sup>▽</sup> Can Firtina<sup>⋄</sup> Lavanya Subramanian<sup>‡</sup> Jeremie S. Kim<sup>⋄†</sup> Rachata Ausavarungnirun<sup>⊙</sup> Mohammed Alser<sup>⋄</sup> Juan Gomez-Luna<sup>⋄</sup> Amirali Boroumand<sup>†</sup> Anant Nori<sup>™</sup> Allison Scibisz<sup>†</sup> Sreenivas Subramoney<sup>™</sup> Can Alkan<sup>▽</sup> Saugata Ghose<sup>\*†</sup> Onur Mutlu<sup>⋄†▽</sup> † Carnegie Mellon University <sup>™</sup> Processor Architecture Research Lab, Intel Labs <sup>▽</sup> Bilkent University <sup>⋄</sup> ETH Zürich ‡ Facebook <sup>⊙</sup> King Mongkut's University of Technology North Bangkok <sup>\*</sup> University of Illinois at Urbana–Champaign 213 #### FPGA-based Near-Memory Analytics Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications" IEEE Micro (IEEE MICRO), 2021. # FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications Gagandeep Singh<sup>⋄</sup> Mohammed Alser<sup>⋄</sup> Damla Senol Cali<sup>⋈</sup> Dionysios Diamantopoulos<sup>▽</sup> Juan Gómez-Luna<sup>⋄</sup> Henk Corporaal<sup>⋆</sup> Onur Mutlu<sup>⋄⋈</sup> <sup>⋄</sup>ETH Zürich <sup>⋈</sup> Carnegie Mellon University \*Eindhoven University of Technology <sup>▽</sup>IBM Research Europe #### In-Storage Genome Filtering [ASPLOS 2022] Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu, "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis" Proceedings of the <u>27th International Conference on Architectural Support for</u> <u>Programming Languages and Operating Systems</u> (**ASPLOS**), Virtual, February-March 2022. [<u>Lightning Talk Slides (pptx) (pdf)</u>] ## GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis Nika Mansouri Ghiasi¹ Jisung Park¹ Harun Mustafa¹ Jeremie Kim¹ Ataberk Olgun¹ Arvid Gollwitzer¹ Damla Senol Cali² Can Firtina¹ Haiyu Mao¹ Nour Almadhoun Alserr¹ Rachata Ausavarungnirun³ Nandita Vijaykumar⁴ Mohammed Alser¹ Onur Mutlu¹ <sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics <sup>3</sup>KMUTNB <sup>4</sup>University of Toronto ### Future of Genome Sequencing & Analysis Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro, August 2020. # COVID-19 Nanopore Sequencing (I) From ONT (<a href="https://nanoporetech.com/covid-19/overview">https://nanoporetech.com/covid-19/overview</a>) # COVID-19 Nanopore Sequencing (II) From ONT (<a href="https://nanoporetech.com/covid-19/overview">https://nanoporetech.com/covid-19/overview</a>) # Accelerating Genome Analysis: Overview Mohammed Alser, Zulal Bingol, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu, "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro (IEEE MICRO), Vol. 40, No. 5, pages 65-75, September/October 2020. [Slides (pptx)(pdf)] [Talk Video (1 hour 2 minutes)] # Accelerating Genome Analysis: A Primer on an Ongoing Journey #### **Mohammed Alser** ETH Zürich #### Zülal Bingöl Bilkent University #### Damla Senol Cali Carnegie Mellon University #### Jeremie Kim ETH Zurich and Carnegie Mellon University #### Saugata Ghose University of Illinois at Urbana–Champaign and Carnegie Mellon University #### Can Alkan Bilkent University #### **Onur Mutlu** ETH Zurich, Carnegie Mellon University, and Bilkent University # More on Fast Genome Analysis ... Onur Mutlu, "Accelerating Genome Analysis: A Primer on an Ongoing Journey" Invited Lecture at <u>Technion</u>, Virtual, 26 January 2021. [Slides (pptx) (pdf)] [Talk Video (1 hour 37 minutes, including Q&A)] [Related Invited Paper (at IEEE Micro, 2020)] # Detailed Lectures on Genome Analysis - Computer Architecture, Fall 2020, Lecture 3a - Introduction to Genome Sequence Analysis (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=CrRb32v7SJc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=5 - Computer Architecture, Fall 2020, Lecture 8 - □ **Intelligent Genome Analysis** (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=ygmQpdDTL7o&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=14 - Computer Architecture, Fall 2020, Lecture 9a - □ **GenASM: Approx. String Matching Accelerator** (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=XoLpzmN Pas&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=15 - Accelerating Genomics Project Course, Fall 2020, Lecture 1 - Accelerating Genomics (ETH Zürich, Fall 2020) - https://www.youtube.com/watch?v=rgjl8ZyLsAg&list=PL5Q2soXY2Zi9E2bBVAgCqL gwiDRQDTyId # Many Interesting Things Are Happening Today in Computer Architecture # More Demanding Workloads # We Covered Until This Point in the Lecture # Computing is Bottlenecked by Data # Data is Key for AI, ML, Genomics, ... Important workloads are all data intensive They require rapid and efficient processing of large amounts of data - Data is increasing - We can generate more than we can process ### Data is Key for Future Workloads #### **In-memory Databases** [Mao+, EuroSys'12; Clapp+ (Intel), IISWC'15] #### **In-Memory Data Analytics** [Clapp+ (Intel), IISWC'15; Awan+, BDCloud'15] #### **Graph/Tree Processing** [Xu+, IISWC'12; Umuroglu+, FPL'15] #### **Datacenter Workloads** [Kanev+ (Google), ISCA'15] #### Data Overwhelms Modern Machines **In-memory Databases** **Graph/Tree Processing** # Data → performance & energy bottleneck #### **In-Memory Data Analytics** [Clapp+ (Intel), IISWC'15; Awan+, BDCloud'15] #### **Datacenter Workloads** [Kanev+ (Google), ISCA' 15] ### Data is Key for Future Workloads Chrome Google's web browser #### **TensorFlow Mobile** Google's machine learning framework Google's video codec Google's video codec #### Data Overwhelms Modern Machines **TensorFlow Mobile** Data → performance & energy bottleneck VP9 VouTube Video Playback Google's video codec VP9 VouTube Video Capture Google's video codec #### Data Movement Overwhelms Modern Machines Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Williamsburg, VA, USA, March 2018. ### 62.7% of the total system energy is spent on data movement ### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand<sup>1</sup> Rachata Ausavarungnirun<sup>1</sup> Aki Kuusela<sup>3</sup> Allan Knies<sup>3</sup> Saugata Ghose<sup>1</sup> Youngsok Kim<sup>2</sup> Eric Shiu<sup>3</sup> Rahul Thakur<sup>3</sup> Daehyun Kim<sup>4,3</sup> Parthasarathy Ranganathan<sup>3</sup> Onur Mutlu<sup>5,1</sup> # Data Movement vs. Computation Energy A memory access consumes ~100-1000X the energy of a complex addition # Many Interesting Things Are Happening Today in Computer Architecture # Many Novel Concepts Investigated Today - New Computing Paradigms (Rethinking the Full Stack) - Processing in Memory, Processing Near Data - Neuromorphic Computing - Fundamentally Secure and Dependable Computers - New Accelerators & Systems (Algorithm-Hardware Co-Designs) - Artificial Intelligence & Machine Learning - Graph Analytics - Genome Analysis - New Memories and Storage Systems - Non-Volatile Main Memory - Intelligent Memory # Increasingly Demanding Applications # Dream # and, they will come As applications push boundaries, computing platforms will become increasingly strained. # Increasingly Diverging/Complex Tradeoffs # Data Movement vs. Computation Energy A memory access consumes ~100-1000X the energy of a complex addition # Increasingly Complex Systems #### Past systems # Increasingly Complex Systems # Computer Architecture Today - Computing landscape is very different from 10-20 years ago - Applications and technology both demand novel architectures # Computer Architecture Today (II) - You can revolutionize the way computers are built, if you understand both the hardware and the software (and change each accordingly) - You can invent new paradigms for computation, communication, and storage - Recommended book: Thomas Kuhn, "The Structure of Scientific Revolutions" (1962) - Pre-paradigm science: no clear consensus in the field - Normal science: dominant theory used to explain/improve things (business as usual); exceptions considered anomalies - Revolutionary science: underlying assumptions re-examined # Computer Architecture Today (II) You can revolutionize the way computers are built, if you understand both the hardware and the software (and change each accordingly) You can ir communic Recomme Scientific I Pre-para Normal : things (t Revolution ure of eld improve anomalies examined # Takeaways - It is an exciting time to be understanding and designing computing architectures - Many challenging and exciting problems in platform design - That no one has tackled (or thought about) before - That can have huge impact on the world's future - Driven by huge hunger for data (Big Data), new applications (ML/AI, graph analytics, genomics), ever-greater realism, ... - We can easily collect more data than we can analyze/understand - Driven by significant difficulties in keeping up with that hunger at the technology layer - Five walls: Energy, reliability, complexity, security, scalability # Digital Design & Computer Arch. Lecture 1: Introduction and Basics Prof. Onur Mutlu ETH Zürich Spring 2022 24 February 2022