SAFARI Live Seminar: Gennady Pekhimenko 5 August 2021

Join us for our next SAFARI Live Seminar with Gennady Pekhimenko.

Thursday, August 5 at 5:00 pm Zurich time (CEST)

Efficient DNN Training at Scale: from Algorithms to Hardware
Gennady Pekhimenko, University of Toronto

Livestream at 5:00 pm Zurich time (CEST) on YouTube:

The recent popularity of deep neural networks (DNNs) has generated a lot of research interest in performing DNN-related computation efficiently. However, the primary focus of systems research is usually quite narrow and limited to (i) inference — i.e. how to efficiently execute already trained models and (ii) image classification networks as the primary benchmark for evaluation. In this talk, we will demonstrate a holistic approach to DNN training acceleration and scalability starting from the algorithm, to software and hardware optimizations, to special development and optimization tools.

In the first part of the talk, I will show our radically new approach on how to efficiently scale backpropagation algorithms used in DNN training (BPPSA, MLSys’20). Then I will demonstrate a new approach on how to train multiple DNN models jointly on the same hardware (HFTA, MLSys’21). I will then demonstrate several approaches to deal with one of the major limiting factors in DNN training: limited GPU/accelerator memory capacity (Echo, ISCA’20 and Gist, ISCA’18). At the end, I will show the performance and visualization tools we built in my group to understand, visualize, and optimize DNN models, and even predict their performance on different hardware.

Gennady Pekhimenko is an Assistant Professor at the University of Toronto, CS department and (by courtesy) ECE department, where he is leading the EcoSystem (Efficient Computing Systems) group. Gennady is also a Faculty Member at Vector Institute and a CIFAR AI chair. Before joining Univ. of Toronto, he spent a year in 2017 at Microsoft Research in Redmond in the Systems Research group. He got his PhD from the Computer Science Department at Carnegie Mellon University in 2016. Gennady is a recipient of Amazon Machine Learning Research Award, Facebook Faculty Research Award, Connaught New Researcher Award, NVIDIA Graduate, Microsoft Research, Qualcomm Innovation, and NSERC CGS-D Fellowships. His research interests are in the areas of systems, computer architecture, compilers, and applied machine learning.


SAFARI Live Seminar: Geraldo F. Oliveira 22 July 2021

We are pleased to have Geraldo F. Oliveira give a 3rd talk in our SAFARI Live Seminars!

Thursday, July 22 at 5:00 pm Zurich time (CEST)

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Geraldo F. Oliveira, SAFARI Research Group, D-ITET, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube:


Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ different techniques to reduce overheads caused by data movement, from traditional processor-centric mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging paradigms, such as near-data processing (NDP), where computation is moved closer to or inside memory. However, there is a lack of understanding about (1) the key metrics that identify different sources of data movement bottlenecks and (2) how different data movement bottlenecks can be alleviated by traditional and emerging data movement mitigation mechanisms.

In this work, we make two key contributions. First, we propose the first methodology to characterize data-intensive workloads based on the source of their data movement bottlenecks. This methodology is driven by insights obtained from a large-scale experimental characterization of 345 applications from 37 different benchmark suites and an evaluation of the performance of memory-bound functions from these applications with three data-movement mitigation mechanisms. Second, we release DAMOV, the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. We show how DAMOV can aid the study of open research problems for NDP architectures via four case studies.

Our work provides new insights about the suitability of different classes of data movement bottlenecks to the different data movement mitigation mechanisms, including analyses on how the different data movement mitigation mechanisms impact performance and energy for memory bottlenecked applications. All our bottleneck analysis toolchains and DAMOV benchmarks are publicly and freely available ( We believe and hope that our work can enable further studies and research on hardware and software solutions for data movement bottlenecks, including near-data processing.

Speaker Bio:
Geraldo F. Oliveira is a Ph.D. student in the SAFARI Research Group @ETH Zurich. He received a B.S. degree in computer science from the Federal University of Viçosa, Viçosa, Brazil, in 2015, and an M.S. degree in computer science from the Federal University of Rio Grande do Sul, Porto Alegre, Brazil, in 2017. Since 2018, he has been working toward a Ph.D. degree with Onur Mutlu at ETH Zürich, Zürich, Switzerland. His current research interests include system support for processing-in-memory and processing-using-memory architectures, data-centric accelerators for emerging applications, approximate computing, and emerging memory systems for consumer devices. He has several publications on these topics.

SAFARI Live Seminar: Andrew Walker 19 July 2021

We are excited to have Andrew Walker as a speaker in July for our SAFARI Live Seminars!

Monday, July 19 at 6:00 pm Zurich time (CEST)

Andrew Walker, Schiltron Corporation & Nexgen Power Systems
An Addiction to Low Cost Per Memory Bit – How to Recognize it and What to Do About it

Livestream at 6:00 pm Zurich time (CEST) on YouTube:

Talk slides (pdf) (pptx)

The phenomenal rise in the amounts of data has put great pressure on the semiconductor industry to provide low cost memory solutions. The result is a constant drive to lower the cost per bit of DRAM, SRAM and NAND Flash. In addition, AI requires intense store and recall between processor and memory. In the rush to provide low cost solutions, other attributes have been treated as expendable as an acceptable cost of doing business. Several examples come to mind: short product lifetimes because of limited NAND Flash endurance; data insecurity because of DRAM Rowhammer; poor energy efficiency because of the need to bring growing amounts of data from DRAM into the processor chip due to SRAM area inefficiencies. All such “negative externalities” have a cost that is not included in the product cost but affects society in terms of wasted energy and resources. This talk looks into their origins and consequences and is a call to action for a more comprehensive understanding of what cost per bit really means.

Speaker Bio:
Andy Walker has been working in silicon technology since 1985. After a BSc in physics from Dundee University in Scotland he joined Philips Research Laboratory in Eindhoven, The Netherlands. His PhD from the Technical University of Eindhoven arose from his research work at Philips. In 1994 he came to Silicon Valley and worked at various companies including Cypress, Matrix and Spin Memory. He also founded Schiltron Corporation to develop new forms of monolithic memories. He has been fortunate in being able to work in many interesting areas of silicon devices and process technology including MOS device physics, nonvolatile memories, ESD and Latch-up, TFTs and MRAM. He is now active in the area of GaN high voltage devices.


SAFARI Live Seminar: Juan Gomez-Luna 12 July 2021

We are excited to kick off our summer SAFARI Live Seminars with our first talk next week!

Monday, July 12 at 5:00 pm Zurich time (CEST)

Understanding a Modern Processing-in-Memory Architecture: Benchmarking and Experimental Characterization
Dr. Juan Gomez-Luna, SAFARI Research Group, D-ITET, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube:

Talk slides (pptx) (pdf)

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with generalpurpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

Speaker Bio:
Juan Gomez-Luna is a senior researcher and lecturer in the SAFARI Research Group @ETH Zurich. He received the BS and MS degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the PhD degree in Computer Science from the University of Cordoba, Spain, in 2012. Between 2005 and 2017, he was a faculty member of the University of Cordoba. His research interests focus on processing-in-memory, memory systems, heterogeneous computing, and hardware and software acceleration of medical imaging and bioinformatics. He is the lead author of PrIM (, the first publicly-available benchmark suite for a real-world processing-in-memory architecture, and Chai (, a benchmark suite for heterogeneous systems with CPU/GPU/FPGA.

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu,
“Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture”
Preprint in arXiv, 9 May 2021.
[arXiv preprint]
[PrIM Benchmarks Source Code]
[Slides (pptx) (pdf)]
[Long Talk Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[SAFARI Live Seminar Slides (pptx) (pdf)]
[SAFARI Live Seminar Video (2 hrs 57 mins)]
[Lightning Talk Video (3 minutes)]


Join us as ISCA 2021 for our talks

ISCA 2021 Program:

Tuesday, June 15 Session 6B: Memory II 12 pm EDT:

Lois Orosa, Yaohua Wang, Mohammad Sadrosadati, Jeremie S. Kim, Minesh Patel, Ivan Puddu, Haocong Luo, Kaveh Razavi, Juan Gomez-Luna, Hasan Hassan, Nika Mansouri-Ghiasi, Saugata Ghose, and Onur Mutlu, “CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations”Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (22 minutes)]

Wednesday, June 16 Session 11B 1:15 pm EDT:

Jawad Haj-Yahya, Jeremie S. Kim, A. Giray Yaglikci, Ivan Puddu, Lois Orosa, Juan Gomez Luna, Mohammed Alser, and Onur Mutlu, “IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern Processors”, Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (21 minutes)]

Wednesday, June 16 Session 11A 1:15 pm EDT:

Ataberk Olgun, Minesh Patel, A. Giray Yaglikci, Haocong Luo, Jeremie S. Kim, F. Nisa Bostanci, Nandita Vijaykumar, Oguz Ergin, and Onur Mutlu, “QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips”, Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (25 minutes)]


SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM

Watch our recent talks at ASPLOS 2021!

Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu,
SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Short Talk Slides (pptx) (pdf)]
[Talk Slides (pptx) (pdf)]
[Short Talk Video (5 mins)]
[Full Talk Video (27 mins)]


Onur Mutlu and Co-authors Receive the 2021 HPCA Test of Time Award

Congratulations to Onur Mutlu and co-authors on receiving the HPCA Test of Time Award for their 2003 HPCA paper:

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
Onur Mutlu, Jared Stark, Chris Wilkerson, Yale N. Patt

The IEEE International Symposium on High-Performance Computer Architecture (HPCA) Test of Time Award recognizes the most influential papers published in prior sessions of HPCA (held 18-22 years ago), and that have had a significant impact in the field.

The paper was Professor Onur Mutlu’s first publication during his PhD at the University of Texas with his PhD advisor Professor Yale Patt and colleagues from Intel, and Dr. Jared Stark and Chris Wilkerson.  The significance of the paper was described by the award committee as: “Runahead Execution is a pioneering paper that opened up new avenues in dynamic prefetching. The basic idea of run ahead execution effectively increases the instruction window very significantly, without having to increase physical resource size (e.g. the issue queue). This seminal paper spawned off a new area of ILP-enhancing microarchitecture research. This work has had strong industry impact as evidenced by IBM’s POWER6 – Load Lookahead, NVIDIA Denver, and Sun ROCK’s hardware scouting.” The award was presented last week at HPCA 2021 on March 2, 2021.

Watch Onur’s Retrospective HPCA Test of Time Award Talk Video (14 minutes)

Onur Mutlu
, Jared Stark, Chris Wilkerson, and Yale N. Patt,
“Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors”
Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA), pages 129-140, Anaheim, CA, February 2003.
[Talk Slides (pdf)]
[Lecture Slides (pptx) (pdf)]
[Lecture Video (1 hr 54 mins)]
[Retrospective HPCA Test of Time Award Talk Slides (pptx) (pdf)]
[Retrospective HPCA Test of Time Award Talk Video (14 minutes)]
One of the 15 computer architecture papers of 2003 selected as Top Picks by IEEE Micro.
HPCA Test of Time Award (awarded in 2021).

Interview with Damla Senol Cali: about her work, experience in SAFARI, and her future directions

Damla Senol Cali is a PhD student with SAFARI at CMU co-advised by Onur Mutlu and Saugata Ghose.
We recently interviewed Damla about her work, her experience as a PhD student in SAFARI, and her future directions.  Damla’s interview appears as the first video contribution to the SAFARI Meet our Members section in our January 2021 newsletter.

Watch Damla’s video interview here    |    Read the transcript here 

Damla Senol Cali, Gurpreet S. Kalsi, Zulal Bingol, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu, GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence AnalysisProceedings of the 53rd International Symposium on Microarchitecture (MICRO), Virtual, October 2020.

Full paper link | Talk Video (18 mins) | Talk Slides (pptx) (pdf) |
Lecture Video (37 mins) | Lecture Slides (pptx) (pdf) | GenASM Source Code |
More information here

Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu, Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future DirectionsBriefings in Bioinformatics (BIB), 2018.

Paper link | Paper PDF | AACBB’19 Talk Video | Slides (ppt) (pdf) |
More information here

Read the latest edition of our SAFARI Newsletter

Dear SAFARI friends,

Happy New Year!  We are excited to share our group highlights with you in this second edition of the SAFARI newsletter:

In this second edition of the SAFARI newsletter, we share our research, teaching and outreach highlights from 2020, and look ahead to a new and inspiring future in 2021.

We wish you a wonderful 2021, in all aspects of your lives!

Onur Mutlu