SAFARI Live Seminar: Jawad Haj-Yahya 16 August 2021

Join us for our next SAFARI Live Seminar with Jawad Haj-Yahya.

Monday, August 16 at 5:30 pm Zurich time (CEST)

Power Management Mechanisms in Modern Microprocessors and Their Security Implications
Jawad Haj-Yahya, Principal researcher at Huawei Research Center in Zurich

Livestream at 5:30 pm Zurich time (CEST) on YouTube:

Billions of new devices (e.g., sensors, wearables, smartphones, tablets, laptops, servers) are being deployed each year with new services and features that are driving a higher demand for high performance microprocessors, which often have high power consumption. Despite the failure of Dennard scaling, the slow-down in Moore’s Law, and the high power-density of modern processors, power management mechanisms have enabled significant advances in modern microprocessor performance and energy efficiency. Yet, current power management architectures also pose serious security implications. This is mainly because functionality rather than security has been the main consideration in the design of power management mechanisms in commodity microprocessors.

In this seminar, we provide a detailed overview of the state-of-the-art in power management mechanisms, power delivery networks (PDNs), and security vulnerabilities of current management mechanisms in modern microprocessors. We first present, analyze and enhance the advanced power management mechanisms of modern microprocessors to improve energy and performance in active and idle power states. Second, we present the design and tradeoffs of modern power delivery networks, evaluate their implications on performance and energy-efficiency, and describe new techniques to mitigate PDN inefficiencies. We will especially introduce the idea and benefits of hybrid power delivery networks. Third, we present some of the security vulnerabilities that exist in current management mechanisms of modern processors and propose mitigation techniques. We conclude that power management, power delivery and resulting security implications are critical and exciting areas to research to make modern systems both more energy-efficient and higher performance.

Jawad Haj-Yahya received his Ph.D. degree in Computer Science from Haifa University, Israel. Jawad was a processor architect for many years at Intel. His awards and honors include the Intel Achievement Award (the highest award at Intel), for his significant contribution to Intel processors. Jawad worked at Nanyang Technological University (NTU), Singapore as a cybersecurity Research Scientist where he led the architecture and design of a secure-processor project based on RISC-V architecture. He then moved to the Institute of Microelectronics (IME) at A*STAR Singapore where he was a Scientist III and worked on hardware security and an AI accelerator. Jawad next worked as a Senior Researcher in the SAFARI Research Group at ETH Zurich, where he led multiple projects on Energy-Efficient Computing and Hardware Security, before moving to his current position as principal researcher at Huawei Research Center in Zurich.

This talk is based on four papers we published respectively at HPCA 2020, ISCA 2020, MICRO 2020 and ISCA 2021. The links to individual papers and slides are below.

SAFARI Live Seminar: Gennady Pekhimenko 5 August 2021

Join us for our next SAFARI Live Seminar with Gennady Pekhimenko.

Thursday, August 5 at 5:00 pm Zurich time (CEST)

Efficient DNN Training at Scale: from Algorithms to Hardware
Gennady Pekhimenko, University of Toronto

Livestream at 5:00 pm Zurich time (CEST) on YouTube:

The recent popularity of deep neural networks (DNNs) has generated a lot of research interest in performing DNN-related computation efficiently. However, the primary focus of systems research is usually quite narrow and limited to (i) inference — i.e. how to efficiently execute already trained models and (ii) image classification networks as the primary benchmark for evaluation. In this talk, we will demonstrate a holistic approach to DNN training acceleration and scalability starting from the algorithm, to software and hardware optimizations, to special development and optimization tools.

In the first part of the talk, I will show our radically new approach on how to efficiently scale backpropagation algorithms used in DNN training (BPPSA, MLSys’20). Then I will demonstrate a new approach on how to train multiple DNN models jointly on the same hardware (HFTA, MLSys’21). I will then demonstrate several approaches to deal with one of the major limiting factors in DNN training: limited GPU/accelerator memory capacity (Echo, ISCA’20 and Gist, ISCA’18). At the end, I will show the performance and visualization tools we built in my group to understand, visualize, and optimize DNN models, and even predict their performance on different hardware.

Gennady Pekhimenko is an Assistant Professor at the University of Toronto, CS department and (by courtesy) ECE department, where he is leading the EcoSystem (Efficient Computing Systems) group. Gennady is also a Faculty Member at Vector Institute and a CIFAR AI chair. Before joining Univ. of Toronto, he spent a year in 2017 at Microsoft Research in Redmond in the Systems Research group. He got his PhD from the Computer Science Department at Carnegie Mellon University in 2016. Gennady is a recipient of Amazon Machine Learning Research Award, Facebook Faculty Research Award, Connaught New Researcher Award, NVIDIA Graduate, Microsoft Research, Qualcomm Innovation, and NSERC CGS-D Fellowships. His research interests are in the areas of systems, computer architecture, compilers, and applied machine learning.


Congratulations to Damla Senol Cali on successfully defending her PhD!

Thesis:  Accelerating Genome Sequence Analysis via Efficient Hardware/Algorithm Co-Design


Genome sequence analysis plays a pivotal role in enabling many medical and scientific advancements in personalized medicine, outbreak tracing, the understanding of evolution, and forensics. Modern genome sequencing machines can rapidly generate massive amounts of genomics data at low cost. However, the analysis of genome sequencing data is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. Our goals in this dissertation are to (1) characterize the real-system behavior of the genome sequence analysis pipeline and its associated tools, (2) expose the bottlenecks and tradeoffs of the pipeline and tools, and (3) co-design fast and efficient algorithms along with scalable and energy-efficient customized hardware accelerators for the key pipeline bottlenecks to enable faster genome sequence analysis.

First, we comprehensively analyze the tools in the genome assembly pipeline for long reads in multiple dimensions (i.e., accuracy, performance, memory usage, and scalability), uncovering bottlenecks and tradeoffs that different combinations of tools and different underlying systems lead to. We show that we need high-performance, memory-efficient, low-power, and scalable designs for genome sequence analysis in order to exploit the advantages that genome sequencing provides. Second, we propose GenASM, an acceleration framework that builds upon bitvector-based approximate string matching (ASM) to accelerate multiple steps of the genome sequence analysis pipeline. We co-design our highly-parallel, scalable and memory-efficient algorithms with low-power and area-efficient hardware accelerators. We evaluate GenASM for three different use cases of ASM in genome sequence analysis and show that GenASM is significantly faster and more power- and area-efficient than state-of-the-art software and hardware tools for each of these use cases. Third, we implement an FPGA-based prototype for GenASM, where state-of-the-art 3D-stacked memory (HBM2) offers high memory bandwidth and FPGA resources offer high parallelism by instantiating multiple copies of the GenASM accelerators. Fourth, we propose GenGraph, the first hardware acceleration framework for sequence-to-graph mapping. Instead of representing the reference genome as a single linear DNA sequence, genome graphs provide a better representation of the diversity among populations by encoding variations across individuals in a graph data structure, avoiding a bias towards any one reference. GenGraph enables the efficient mapping of a sequenced genome to a graph-based reference, providing more comprehensive and accurate genome sequence analysis.

Overall, we demonstrate that genome sequence analysis can be accelerated by co- designing scalable and energy-efficient customized accelerators along with efficient algorithms for the key steps of genome sequence analysis.

Examining Committee

Onur Mutlu, Co-advisor, CMU-ECE, ETH Zurich
Saugata Ghose, Co-advisor, CMU-ECE, University of Illinois Urbana-Champaign
James C. Hoe, CMU-ECE
Can Alkan, Bilkent University








SAFARI Live Seminar: Geraldo F. Oliveira 22 July 2021

We are pleased to have Geraldo F. Oliveira give a 3rd talk in our SAFARI Live Seminars!

Thursday, July 22 at 5:00 pm Zurich time (CEST)

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Geraldo F. Oliveira, SAFARI Research Group, D-ITET, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube:


Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ different techniques to reduce overheads caused by data movement, from traditional processor-centric mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging paradigms, such as near-data processing (NDP), where computation is moved closer to or inside memory. However, there is a lack of understanding about (1) the key metrics that identify different sources of data movement bottlenecks and (2) how different data movement bottlenecks can be alleviated by traditional and emerging data movement mitigation mechanisms.

In this work, we make two key contributions. First, we propose the first methodology to characterize data-intensive workloads based on the source of their data movement bottlenecks. This methodology is driven by insights obtained from a large-scale experimental characterization of 345 applications from 37 different benchmark suites and an evaluation of the performance of memory-bound functions from these applications with three data-movement mitigation mechanisms. Second, we release DAMOV, the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. We show how DAMOV can aid the study of open research problems for NDP architectures via four case studies.

Our work provides new insights about the suitability of different classes of data movement bottlenecks to the different data movement mitigation mechanisms, including analyses on how the different data movement mitigation mechanisms impact performance and energy for memory bottlenecked applications. All our bottleneck analysis toolchains and DAMOV benchmarks are publicly and freely available ( We believe and hope that our work can enable further studies and research on hardware and software solutions for data movement bottlenecks, including near-data processing.

Speaker Bio:
Geraldo F. Oliveira is a Ph.D. student in the SAFARI Research Group @ETH Zurich. He received a B.S. degree in computer science from the Federal University of Viçosa, Viçosa, Brazil, in 2015, and an M.S. degree in computer science from the Federal University of Rio Grande do Sul, Porto Alegre, Brazil, in 2017. Since 2018, he has been working toward a Ph.D. degree with Onur Mutlu at ETH Zürich, Zürich, Switzerland. His current research interests include system support for processing-in-memory and processing-using-memory architectures, data-centric accelerators for emerging applications, approximate computing, and emerging memory systems for consumer devices. He has several publications on these topics.

SAFARI Live Seminar: Andrew Walker 19 July 2021

We are excited to have Andrew Walker as a speaker in July for our SAFARI Live Seminars!

Monday, July 19 at 6:00 pm Zurich time (CEST)

Andrew Walker, Schiltron Corporation & Nexgen Power Systems
An Addiction to Low Cost Per Memory Bit – How to Recognize it and What to Do About it

Livestream at 6:00 pm Zurich time (CEST) on YouTube:

Talk slides (pdf) (pptx)

The phenomenal rise in the amounts of data has put great pressure on the semiconductor industry to provide low cost memory solutions. The result is a constant drive to lower the cost per bit of DRAM, SRAM and NAND Flash. In addition, AI requires intense store and recall between processor and memory. In the rush to provide low cost solutions, other attributes have been treated as expendable as an acceptable cost of doing business. Several examples come to mind: short product lifetimes because of limited NAND Flash endurance; data insecurity because of DRAM Rowhammer; poor energy efficiency because of the need to bring growing amounts of data from DRAM into the processor chip due to SRAM area inefficiencies. All such “negative externalities” have a cost that is not included in the product cost but affects society in terms of wasted energy and resources. This talk looks into their origins and consequences and is a call to action for a more comprehensive understanding of what cost per bit really means.

Speaker Bio:
Andy Walker has been working in silicon technology since 1985. After a BSc in physics from Dundee University in Scotland he joined Philips Research Laboratory in Eindhoven, The Netherlands. His PhD from the Technical University of Eindhoven arose from his research work at Philips. In 1994 he came to Silicon Valley and worked at various companies including Cypress, Matrix and Spin Memory. He also founded Schiltron Corporation to develop new forms of monolithic memories. He has been fortunate in being able to work in many interesting areas of silicon devices and process technology including MOS device physics, nonvolatile memories, ESD and Latch-up, TFTs and MRAM. He is now active in the area of GaN high voltage devices.


SAFARI Live Seminar: Juan Gomez-Luna 12 July 2021

We are excited to kick off our summer SAFARI Live Seminars with our first talk next week!

Monday, July 12 at 5:00 pm Zurich time (CEST)

Understanding a Modern Processing-in-Memory Architecture: Benchmarking and Experimental Characterization
Dr. Juan Gomez-Luna, SAFARI Research Group, D-ITET, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube:

Talk slides (pptx) (pdf)

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with generalpurpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

Speaker Bio:
Juan Gomez-Luna is a senior researcher and lecturer in the SAFARI Research Group @ETH Zurich. He received the BS and MS degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the PhD degree in Computer Science from the University of Cordoba, Spain, in 2012. Between 2005 and 2017, he was a faculty member of the University of Cordoba. His research interests focus on processing-in-memory, memory systems, heterogeneous computing, and hardware and software acceleration of medical imaging and bioinformatics. He is the lead author of PrIM (, the first publicly-available benchmark suite for a real-world processing-in-memory architecture, and Chai (, a benchmark suite for heterogeneous systems with CPU/GPU/FPGA.

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu,
“Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture”
Preprint in arXiv, 9 May 2021.
[arXiv preprint]
[PrIM Benchmarks Source Code]
[Slides (pptx) (pdf)]
[Long Talk Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[SAFARI Live Seminar Slides (pptx) (pdf)]
[SAFARI Live Seminar Video (2 hrs 57 mins)]
[Lightning Talk Video (3 minutes)]


Join us as ISCA 2021 for our talks

ISCA 2021 Program:

Tuesday, June 15 Session 6B: Memory II 12 pm EDT:

Lois Orosa, Yaohua Wang, Mohammad Sadrosadati, Jeremie S. Kim, Minesh Patel, Ivan Puddu, Haocong Luo, Kaveh Razavi, Juan Gomez-Luna, Hasan Hassan, Nika Mansouri-Ghiasi, Saugata Ghose, and Onur Mutlu, “CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations”Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (22 minutes)]

Wednesday, June 16 Session 11B 1:15 pm EDT:

Jawad Haj-Yahya, Jeremie S. Kim, A. Giray Yaglikci, Ivan Puddu, Lois Orosa, Juan Gomez Luna, Mohammed Alser, and Onur Mutlu, “IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern Processors”, Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (21 minutes)]

Wednesday, June 16 Session 11A 1:15 pm EDT:

Ataberk Olgun, Minesh Patel, A. Giray Yaglikci, Haocong Luo, Jeremie S. Kim, F. Nisa Bostanci, Nandita Vijaykumar, Oguz Ergin, and Onur Mutlu, “QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips”, Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (25 minutes)]


Congratulations to Nastaran Hajinazar on her successful PhD Defence

Nastaran successfully defended her PhD thesis in June 2021.  Congratulations Nastaran!  We look forward to many more collaborations with you in the future.  

Thesis: Data-Centric and Data-Aware Frameworks for Fundamentally Efficient Data Handling in Modern Computing Systems


There is an explosive growth in the size of the input and/or intermediate data used and generated by modern and emerging applications. Unfortunately, modern computing systems are not capable of handling large amounts of data efficiently. Major concepts and components (e.g., the virtual memory system) and predominant execution models (e.g., the processor-centric execution model) used in almost all computing systems are designed without having modern applications’ overwhelming data demand in mind. As a result, accessing, moving, and processing large amounts of data faces important challenges in today’s systems, making data a first-class concern and a prime performance and energy bottleneck in such systems. This thesis studies the root cause of inefficiency in modern computing systems when handling modern applications’ data demand, and aims to fundamentally address such inefficiencies, with a focus on two directions.

First, we design a new framework that aids the widespread adoption of processing-using-DRAM, a data-centric computation paradigm that improves the overall performance and efficiency of the system when computing large amounts of data by minimizing the cost of data movement and enabling computation where the data resides. To this end, we introduce SIMDRAM, an end-to-end processing-using-DRAM framework that (1) efficiently computes complex operations required by modern data intensive applications, and (2) provides the ability to implement new arbitrary operations as required, all inan in-DRAM massively-parallel SIMD substrate that requires minimal changes to the DRAM architecture.

Second, we design a new, more scalable virtual memory framework that (1) eliminates the inefficiencies of the conventional virtual memory frameworks when handling the high memory demand in modern applications, and (2) is built from the ground up to understand, convey, and exploit data properties, to create opportunities for performance and efficiency improvements. To this end, we introduce the Virtual Block Interface (VBI), a novel virtual memory framework that (1) efficiently handles modern applications’ high data demand, (2) conveys properties of different pieces of program data (e.g., data structures) to the hardware and exploits this knowledge for performance and efficiency optimizations, (3) better extracts performance from the wide variety of new system configurations that are designed to process large amounts of data (e.g., hybrid memory systems), and (4) provides all the key features of the conventional virtual memory frameworks, at low overhead.

Keywords: Efficient Data Handling, Data-Centric Architectures, Data-Aware Architectures, Virtual Memory, Processing-in-Memory

Examining Committee

Onur Mutlu, Co-Senior Supervisor
Arrvindh Shriraman, Co-Senior Supervisor
Saugata Ghose, Supervisor
Vivek Seshadri, Supervisor
Alaa Alameldeen, Internal Examiner
Myoungsoo Jung, External Examiner
Zhenman fang, Chair 


Congratulations to Gagandeep Singh on his successful PhD Defence

Gagan successfully defended his PhD thesis in March 2021.  We are excited that Gagan will stay on with SAFARI as a postdoc and we look forward many successful collaborations with him.  Congratulations Gagan!

Gagandeep Singh, March 2021 (defended 29 March 2021)

Thesis title: “Designing, Modeling, and Optimizing Data-Intensive Computing Systems”
[Slides (pptx) (pdf)]


The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. Moreover, the end of Dennard scaling, the slowing of Moore’s law, and the emergence of dark silicon limit the attainable performance on current computing systems. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. This approach allows us to overcome our current systems’ performance and energy limitations by minimizing the data movement overhead by ensuring that data does not overwhelm system components. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. Our current systems are designed to follow rigid and simple policies that lack adaptability. Therefore, current system policies fail to provide robust improvement across varying workloads and system conditions.

As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and further proposes several data-driven mechanisms.

First, we design NERO, a data-centric accelerator for a real-world weather prediction application. NERO overcomes the memory bottleneck of weather prediction stencil kernels by exploiting near-memory computation capability on specialized field-programmable gate array (FPGA) accelerators with high-bandwidth memory (HBM) that are attached to the host CPU.

Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. We search for the appropriate bit-width that reduces the memory footprint and improves the performance and energy efficiency with minimal loss in the accuracy.

Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. NAPEL uses ensemble learning to build a model that, once trained for a fraction of programs, can predict the performance and energy consumption of different applications.

Fourth, we present the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. LEAPER provides the ability to reuse a prediction model built on an inexpensive low-end local system to a new, unknown, high-end FPGA-based system.

Fifth, we propose QRator, a reinforcement learning (RL)-based data-placement technique for hybrid storage systems. QRator is a data-driven technique, which uses RL to develop a data-placement policy agent. The data-placement agent decides which data should be stored in what storage device to achieve the best performance while minimizing the migration overhead taking into account the device and the workload characteristics. Our evaluation results show that QRator significantly improves a hybrid storage subsystem’s performance compared to state-of-the-art data placement techniques.

Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient. Thus, we conclude that the mechanisms proposed by this dissertation provide promising solutions to handle data well by following a data-centric approach and further demonstrates the importance of leveraging data to devise data-driven policies. We hope that the proposed architectural techniques and detailed experimental data presented in this dissertation will enable the development of energy-efficient data-intensive computing systems and drive the exploration of new mechanisms to improve the performance and energy efficiency of future computing systems.

Reducing Solid-State Drive Read Latency by Optimizing Read-Retry

Watch our recent talk at ASPLOS 2021:

Jisung Park, Myungsuk Kim, Myoungjun Chun, Lois Orosa, Jihong Kim, and Onur Mutlu,
“Reducing Solid-State Drive Read Latency by Optimizing Read-Retry”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Short Talk Slides (pptx) (pdf)]
[Full Talk Slides (pptx) (pdf)]
[Short Talk Video (5 mins)]
[Full Talk Video (19 mins)]