Congratulations to Damla Senol Cali on successfully defending her PhD!

Thesis:  Accelerating Genome Sequence Analysis via Efficient Hardware/Algorithm Co-Design

Abstract:

Genome sequence analysis plays a pivotal role in enabling many medical and scientific advancements in personalized medicine, outbreak tracing, the understanding of evolution, and forensics. Modern genome sequencing machines can rapidly generate massive amounts of genomics data at low cost. However, the analysis of genome sequencing data is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. Our goals in this dissertation are to (1) characterize the real-system behavior of the genome sequence analysis pipeline and its associated tools, (2) expose the bottlenecks and tradeoffs of the pipeline and tools, and (3) co-design fast and efficient algorithms along with scalable and energy-efficient customized hardware accelerators for the key pipeline bottlenecks to enable faster genome sequence analysis.

First, we comprehensively analyze the tools in the genome assembly pipeline for long reads in multiple dimensions (i.e., accuracy, performance, memory usage, and scalability), uncovering bottlenecks and tradeoffs that different combinations of tools and different underlying systems lead to. We show that we need high-performance, memory-efficient, low-power, and scalable designs for genome sequence analysis in order to exploit the advantages that genome sequencing provides. Second, we propose GenASM, an acceleration framework that builds upon bitvector-based approximate string matching (ASM) to accelerate multiple steps of the genome sequence analysis pipeline. We co-design our highly-parallel, scalable and memory-efficient algorithms with low-power and area-efficient hardware accelerators. We evaluate GenASM for three different use cases of ASM in genome sequence analysis and show that GenASM is significantly faster and more power- and area-efficient than state-of-the-art software and hardware tools for each of these use cases. Third, we implement an FPGA-based prototype for GenASM, where state-of-the-art 3D-stacked memory (HBM2) offers high memory bandwidth and FPGA resources offer high parallelism by instantiating multiple copies of the GenASM accelerators. Fourth, we propose GenGraph, the first hardware acceleration framework for sequence-to-graph mapping. Instead of representing the reference genome as a single linear DNA sequence, genome graphs provide a better representation of the diversity among populations by encoding variations across individuals in a graph data structure, avoiding a bias towards any one reference. GenGraph enables the efficient mapping of a sequenced genome to a graph-based reference, providing more comprehensive and accurate genome sequence analysis.

Overall, we demonstrate that genome sequence analysis can be accelerated by co- designing scalable and energy-efficient customized accelerators along with efficient algorithms for the key steps of genome sequence analysis.

Examining Committee

Onur Mutlu, Co-advisor, CMU-ECE, ETH Zurich
Saugata Ghose, Co-advisor, CMU-ECE, University of Illinois Urbana-Champaign
James C. Hoe, CMU-ECE
Can Alkan, Bilkent University

 

 

 

 

 

 

 

SAFARI Live Seminar: Geraldo F. Oliveira 22 July 2021

We are pleased to have Geraldo F. Oliveira give a 3rd talk in our SAFARI Live Seminars!

Thursday, July 22 at 5:00 pm Zurich time (CEST)

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Geraldo F. Oliveira, SAFARI Research Group, D-ITET, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube:
https://www.youtube.com/watch?v=GWideVyo0nM

Paper: https://arxiv.org/pdf/2105.03725.pdf
Repository: https://github.com/CMU-SAFARI/DAMOV

Abstract:
Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ different techniques to reduce overheads caused by data movement, from traditional processor-centric mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging paradigms, such as near-data processing (NDP), where computation is moved closer to or inside memory. However, there is a lack of understanding about (1) the key metrics that identify different sources of data movement bottlenecks and (2) how different data movement bottlenecks can be alleviated by traditional and emerging data movement mitigation mechanisms.

In this work, we make two key contributions. First, we propose the first methodology to characterize data-intensive workloads based on the source of their data movement bottlenecks. This methodology is driven by insights obtained from a large-scale experimental characterization of 345 applications from 37 different benchmark suites and an evaluation of the performance of memory-bound functions from these applications with three data-movement mitigation mechanisms. Second, we release DAMOV, the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. We show how DAMOV can aid the study of open research problems for NDP architectures via four case studies.

Our work provides new insights about the suitability of different classes of data movement bottlenecks to the different data movement mitigation mechanisms, including analyses on how the different data movement mitigation mechanisms impact performance and energy for memory bottlenecked applications. All our bottleneck analysis toolchains and DAMOV benchmarks are publicly and freely available (https://github.com/CMU-SAFARI/DAMOV). We believe and hope that our work can enable further studies and research on hardware and software solutions for data movement bottlenecks, including near-data processing.

Speaker Bio:
Geraldo F. Oliveira is a Ph.D. student in the SAFARI Research Group @ETH Zurich. He received a B.S. degree in computer science from the Federal University of Viçosa, Viçosa, Brazil, in 2015, and an M.S. degree in computer science from the Federal University of Rio Grande do Sul, Porto Alegre, Brazil, in 2017. Since 2018, he has been working toward a Ph.D. degree with Onur Mutlu at ETH Zürich, Zürich, Switzerland. His current research interests include system support for processing-in-memory and processing-using-memory architectures, data-centric accelerators for emerging applications, approximate computing, and emerging memory systems for consumer devices. He has several publications on these topics.

SAFARI Live Seminar: Andrew Walker 19 July 2021

We are excited to have Andrew Walker as a speaker in July for our SAFARI Live Seminars!

Monday, July 19 at 6:00 pm Zurich time (CEST)

Andrew Walker, Schiltron Corporation & Nexgen Power Systems
An Addiction to Low Cost Per Memory Bit – How to Recognize it and What to Do About it

Livestream at 6:00 pm Zurich time (CEST) on YouTube:
https://www.youtube.com/watch?v=76YCpdsa5FU

Talk slides (pdf) (pptx)

Abstract:
The phenomenal rise in the amounts of data has put great pressure on the semiconductor industry to provide low cost memory solutions. The result is a constant drive to lower the cost per bit of DRAM, SRAM and NAND Flash. In addition, AI requires intense store and recall between processor and memory. In the rush to provide low cost solutions, other attributes have been treated as expendable as an acceptable cost of doing business. Several examples come to mind: short product lifetimes because of limited NAND Flash endurance; data insecurity because of DRAM Rowhammer; poor energy efficiency because of the need to bring growing amounts of data from DRAM into the processor chip due to SRAM area inefficiencies. All such “negative externalities” have a cost that is not included in the product cost but affects society in terms of wasted energy and resources. This talk looks into their origins and consequences and is a call to action for a more comprehensive understanding of what cost per bit really means.

Speaker Bio:
Andy Walker has been working in silicon technology since 1985. After a BSc in physics from Dundee University in Scotland he joined Philips Research Laboratory in Eindhoven, The Netherlands. His PhD from the Technical University of Eindhoven arose from his research work at Philips. In 1994 he came to Silicon Valley and worked at various companies including Cypress, Matrix and Spin Memory. He also founded Schiltron Corporation to develop new forms of monolithic memories. He has been fortunate in being able to work in many interesting areas of silicon devices and process technology including MOS device physics, nonvolatile memories, ESD and Latch-up, TFTs and MRAM. He is now active in the area of GaN high voltage devices.

 

SAFARI Live Seminar: Juan Gomez-Luna 12 July 2021

We are excited to kick off our summer SAFARI Live Seminars with our first talk next week!

Monday, July 12 at 5:00 pm Zurich time (CEST)

Understanding a Modern Processing-in-Memory Architecture: Benchmarking and Experimental Characterization
Dr. Juan Gomez-Luna, SAFARI Research Group, D-ITET, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube:
https://www.youtube.com/watch?v=D8Hjy2iU9l4

Paper: https://arxiv.org/pdf/2105.03814.pdf
Repository: https://github.com/CMU-SAFARI/prim-benchmarks
Talk slides (pptx) (pdf)

Abstract:
Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with generalpurpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM (Processing-In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

Speaker Bio:
Juan Gomez-Luna is a senior researcher and lecturer in the SAFARI Research Group @ETH Zurich. He received the BS and MS degrees in Telecommunication Engineering from the University of Sevilla, Spain, in 2001, and the PhD degree in Computer Science from the University of Cordoba, Spain, in 2012. Between 2005 and 2017, he was a faculty member of the University of Cordoba. His research interests focus on processing-in-memory, memory systems, heterogeneous computing, and hardware and software acceleration of medical imaging and bioinformatics. He is the lead author of PrIM (https://github.com/CMU-SAFARI/prim-benchmarks), the first publicly-available benchmark suite for a real-world processing-in-memory architecture, and Chai (https://github.com/chai-benchmarks/chai), a benchmark suite for heterogeneous systems with CPU/GPU/FPGA.


Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu,
“Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture”
Preprint in arXiv, 9 May 2021.
[arXiv preprint]
[PrIM Benchmarks Source Code]
[Slides (pptx) (pdf)]
[Long Talk Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[SAFARI Live Seminar Slides (pptx) (pdf)]
[SAFARI Live Seminar Video (2 hrs 57 mins)]
[Lightning Talk Video (3 minutes)]

 

Join us as ISCA 2021 for our talks

ISCA 2021 Program:  https://www.iscaconf.org/isca2021/program/

Tuesday, June 15 Session 6B: Memory II 12 pm EDT:

Lois Orosa, Yaohua Wang, Mohammad Sadrosadati, Jeremie S. Kim, Minesh Patel, Ivan Puddu, Haocong Luo, Kaveh Razavi, Juan Gomez-Luna, Hasan Hassan, Nika Mansouri-Ghiasi, Saugata Ghose, and Onur Mutlu, “CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations”Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (22 minutes)]


Wednesday, June 16 Session 11B 1:15 pm EDT:

Jawad Haj-Yahya, Jeremie S. Kim, A. Giray Yaglikci, Ivan Puddu, Lois Orosa, Juan Gomez Luna, Mohammed Alser, and Onur Mutlu, “IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern Processors”, Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (21 minutes)]


Wednesday, June 16 Session 11A 1:15 pm EDT:

Ataberk Olgun, Minesh Patel, A. Giray Yaglikci, Haocong Luo, Jeremie S. Kim, F. Nisa Bostanci, Nandita Vijaykumar, Oguz Ergin, and Onur Mutlu, “QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips”, Proceedings of the 48th International Symposium on Computer Architecture (ISCA), Virtual, June 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Talk Video (25 minutes)]


 

Congratulations to Nastaran Hajinazar on her successful PhD Defence

Nastaran successfully defended her PhD thesis in June 2021.  Congratulations Nastaran!  We look forward to many more collaborations with you in the future.  

Thesis: Data-Centric and Data-Aware Frameworks for Fundamentally Efficient Data Handling in Modern Computing Systems

Abstract:

There is an explosive growth in the size of the input and/or intermediate data used and generated by modern and emerging applications. Unfortunately, modern computing systems are not capable of handling large amounts of data efficiently. Major concepts and components (e.g., the virtual memory system) and predominant execution models (e.g., the processor-centric execution model) used in almost all computing systems are designed without having modern applications’ overwhelming data demand in mind. As a result, accessing, moving, and processing large amounts of data faces important challenges in today’s systems, making data a first-class concern and a prime performance and energy bottleneck in such systems. This thesis studies the root cause of inefficiency in modern computing systems when handling modern applications’ data demand, and aims to fundamentally address such inefficiencies, with a focus on two directions.

First, we design a new framework that aids the widespread adoption of processing-using-DRAM, a data-centric computation paradigm that improves the overall performance and efficiency of the system when computing large amounts of data by minimizing the cost of data movement and enabling computation where the data resides. To this end, we introduce SIMDRAM, an end-to-end processing-using-DRAM framework that (1) efficiently computes complex operations required by modern data intensive applications, and (2) provides the ability to implement new arbitrary operations as required, all inan in-DRAM massively-parallel SIMD substrate that requires minimal changes to the DRAM architecture.

Second, we design a new, more scalable virtual memory framework that (1) eliminates the inefficiencies of the conventional virtual memory frameworks when handling the high memory demand in modern applications, and (2) is built from the ground up to understand, convey, and exploit data properties, to create opportunities for performance and efficiency improvements. To this end, we introduce the Virtual Block Interface (VBI), a novel virtual memory framework that (1) efficiently handles modern applications’ high data demand, (2) conveys properties of different pieces of program data (e.g., data structures) to the hardware and exploits this knowledge for performance and efficiency optimizations, (3) better extracts performance from the wide variety of new system configurations that are designed to process large amounts of data (e.g., hybrid memory systems), and (4) provides all the key features of the conventional virtual memory frameworks, at low overhead.

Keywords: Efficient Data Handling, Data-Centric Architectures, Data-Aware Architectures, Virtual Memory, Processing-in-Memory

Examining Committee

Onur Mutlu, Co-Senior Supervisor
Arrvindh Shriraman, Co-Senior Supervisor
Saugata Ghose, Supervisor
Vivek Seshadri, Supervisor
Alaa Alameldeen, Internal Examiner
Myoungsoo Jung, External Examiner
Zhenman fang, Chair 

 

Congratulations to Gagandeep Singh on his successful PhD Defence

Gagan successfully defended his PhD thesis in March 2021.  We are excited that Gagan will stay on with SAFARI as a postdoc and we look forward many successful collaborations with him.  Congratulations Gagan!

Gagandeep Singh, March 2021 (defended 29 March 2021)

Thesis title: “Designing, Modeling, and Optimizing Data-Intensive Computing Systems”
[Slides (pptx) (pdf)]

Abstract:  

The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. Moreover, the end of Dennard scaling, the slowing of Moore’s law, and the emergence of dark silicon limit the attainable performance on current computing systems. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. This approach allows us to overcome our current systems’ performance and energy limitations by minimizing the data movement overhead by ensuring that data does not overwhelm system components. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. Our current systems are designed to follow rigid and simple policies that lack adaptability. Therefore, current system policies fail to provide robust improvement across varying workloads and system conditions.

As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and further proposes several data-driven mechanisms.

First, we design NERO, a data-centric accelerator for a real-world weather prediction application. NERO overcomes the memory bottleneck of weather prediction stencil kernels by exploiting near-memory computation capability on specialized field-programmable gate array (FPGA) accelerators with high-bandwidth memory (HBM) that are attached to the host CPU.

Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. We search for the appropriate bit-width that reduces the memory footprint and improves the performance and energy efficiency with minimal loss in the accuracy.

Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. NAPEL uses ensemble learning to build a model that, once trained for a fraction of programs, can predict the performance and energy consumption of different applications.

Fourth, we present the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. LEAPER provides the ability to reuse a prediction model built on an inexpensive low-end local system to a new, unknown, high-end FPGA-based system.

Fifth, we propose QRator, a reinforcement learning (RL)-based data-placement technique for hybrid storage systems. QRator is a data-driven technique, which uses RL to develop a data-placement policy agent. The data-placement agent decides which data should be stored in what storage device to achieve the best performance while minimizing the migration overhead taking into account the device and the workload characteristics. Our evaluation results show that QRator significantly improves a hybrid storage subsystem’s performance compared to state-of-the-art data placement techniques.

Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient. Thus, we conclude that the mechanisms proposed by this dissertation provide promising solutions to handle data well by following a data-centric approach and further demonstrates the importance of leveraging data to devise data-driven policies. We hope that the proposed architectural techniques and detailed experimental data presented in this dissertation will enable the development of energy-efficient data-intensive computing systems and drive the exploration of new mechanisms to improve the performance and energy efficiency of future computing systems.


Reducing Solid-State Drive Read Latency by Optimizing Read-Retry

Watch our recent talk at ASPLOS 2021:

Jisung Park, Myungsuk Kim, Myoungjun Chun, Lois Orosa, Jihong Kim, and Onur Mutlu,
“Reducing Solid-State Drive Read Latency by Optimizing Read-Retry”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Short Talk Slides (pptx) (pdf)]
[Full Talk Slides (pptx) (pdf)]
[Short Talk Video (5 mins)]
[Full Talk Video (19 mins)]

SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM

Watch our recent talks at ASPLOS 2021!

Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu,
SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Short Talk Slides (pptx) (pdf)]
[Talk Slides (pptx) (pdf)]
[Short Talk Video (5 mins)]
[Full Talk Video (27 mins)]

 

Join us at ASPLOS 2021 online

We are at ASPLOS 2021 this week and next.  Join us for our talks and learn more about our recent works:

Session 2: Memory Systems, Monday, April 19 4:00 PM Pacific Tiime:

Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli,
“Rethinking Software Runtimes for Disaggregated Memory”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Source Code (Officially Artifact Evaluated)]


Session 8: Tools & Frameworks, Tuesday, April 20 4:00 PM Pacific Time: 

Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu,
“SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Short Talk Slides (pptx) (pdf)]
[Talk Slides (pptx) (pdf)]
[Short Talk Video (5 mins)]
[Full Talk Video (27 mins)]


Session 17: Solid State Drives, Thursday, April 22 7:00 AM Pacific Time:

Jisung Park, Myungsuk Kim, Myoungjun Chun, Lois Orosa, Jihong Kim, and Onur Mutlu,
“Reducing Solid-State Drive Read Latency by Optimizing Read-Retry”
Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
[2-page Extended Abstract]
[Short Talk Slides (pptx) (pdf)]
[Full Talk Slides (pptx) (pdf)]
[Short Talk Video (5 mins)]
[Full Talk Video (19 mins)]

 

ASPLOS Program:  https://asplos-conference.org/program/