HiPEAC Blog

Oct. 23, 2018

‘It’s the memory, stupid’: A conversation with Onur Mutlu

‘We’re beyond computation; we know how to do computation really well, we can optimize it, we can build all sorts of accelerators … but the memory – how to feed the data, how to get the data into the accelerators – is a huge problem.’

Onur Mutlu

This was how ETH Zürich and Carnegie Mellon Professor Onur Mutlu opened his course on memory systems and memory-centric computing systems at HiPEAC’s summer school, ACACES18. A prolific publisher – he recently bagged the top spot on the International Symposium on Computer Architecture (ISCA) hall of fame – Onur is passionate about computation and communication that are efficient and secure by design. In advance of our Computing Systems Week focusing on data centres, storage, and networking, which takes place next week in Heraklion, HiPEAC picked his brains on all things data-based.

How important is storing and moving data to the correct functioning of a computer system?
Critically important: all systems need to have reliable and efficient mechanisms to store and move data.

Fundamentally, a computing system consists of three components:

  • computation units that operate on data
  • data storage units – that is, memories – that keep the data intact
  • data communication units that enable the movement of data to where it is needed

Most of the real estate of the processors we build is dedicated to data storage and communication units. Computation units actually occupy a very small area, thanks to Moore’s Law and over 70 years dedicated to optimizing them. This arose from a design mindset in which all data has to be processed in the computation units.

This mindset has led to significant performance improvements in computation units. However, the same mindset came at the cost of data and movement units, which have essentially remained 'slaves' to the computation units. There are a few major downsides to the computation-unit-centric design approach.

First, in many cases, more than 80% of the chip area is dedicated to caches, memories, memory controllers, interconnects and so on, whose sole purpose is to buffer data or control the buffering of data. Add main memory and storage, and we soon see 95-99% of the entire real estate of a system being dedicated to units that simply store and move data, or which control the storage and movement of data. If we were to design systems where, say, storage units could do computations on the data they store, we would likely have a much more balanced system.

Second, to get data quickly from storage units to the processor, we’ve created workarounds which are making systems ever more complex. These include sophisticated out-of-order and speculative execution engines, many levels of cache hierarchy, complex prefetching mechanisms and large amounts of multithreading. As well as making design harder, these workarounds have an adverse impact on reliability and energy efficiency.

Third, since processors are the only place where data can be processed, data moves around the system a lot, even if all we want to do is copy data from one place to another inside memory. Our results on four major mobile system workloads developed by Google show that, in a mobile system, more than 62% of the entire measured system energy is spent on moving data between memory and the computation units. We’ve also shown that, by processing data close to memory, using either very simple cores or specialized accelerators in the logic layer of 3D-stacked memory, we can halve the energy used while at the same time doubling system performance on a mobile device.

What kind of applications could be enabled or improved thanks to enhanced data storage and movement?
The amount of data we store and manipulate is increasing at an unprecedented pace. Take bioinformatics, for example. Today we can sequence genomes much faster than our computational capacity allows us to analyse them. A major bottleneck is data movement between the memory and processor / field-programmable gate array (FPGA). If we could speed up genome analysis, we could accelerate scientific discovery, enable doctors to make personalized medical decisions in real time, increase security and enable many more things that might not be imaginable to us at the moment.

One of the goals of our bioinformatics research is to enable an embedded device that can perform comprehensive genome analysis in real time. To build such a device, you want to maximize energy efficiency, minimize latency and maximize security. All of these require minimizing data movement. As data movement is naturally energy hungry – a single memory access costs two to three orders of magnitude more energy than a complex arithmetic operation! – memory access also leads to performance loss, and increases security vulnerabilities by exposing the data to the outside world for longer.

Two directions we’re pursuing to minimize data movement are:
New algorithms for genome analysis: we’ve shown that by designing algorithms (e.g., FastHASH) that take advantage of the structure of the genome being analysed, one can reduce communication (and computation) by 10 to 100 times.
The processing near / inside memory paradigm: we have been examining massively parallel processing of the genome inside main memory using very simply bitwise operations and leveraging 3D-stacked memory + logic techniques. This can speed up genome analysis by almost four times, according to our initial work called GRIM-Filter.

Combining these two approaches – algorithmic and architecture design – can lead to an overall speedup of more than two orders of magnitude compared to state-of-the-art systems in genome read mapping.

Data movement greatly bottlenecks other key workloads we use daily, such as graph processing, which is foundational to numerous applications, including machine learning, social networks, information networks such as Wikipedia and bioinformatics. As these require frequent random memory accesses with little computation per memory access, the scalability of the workloads is limited to how fast we can move the data between the memory and the processor.

Here, again, processing with minimal data movement – that is, close to where the data resides using near-memory computation units – can greatly alleviate the data movement problem and accelerate graph processing. To this end, we’ve devised a graph-processing accelerator called Tesseract, where graph computation is performed in simple in-order cores at the logic layer of a 3D-stacked memory + logic chip. Tesseract delivers, on average, almost 14 times better performance and eight times lower energy consumption across five key graph processing workloads, including the PageRank system underlying search engines such as Google. By fine-tuning this system, I believe we could reach up to 50 times greater performance and 30 times less energy.

I could go on and on about the importance of reducing data movement, but I’ll stick to one more example: databases. Databases are everywhere in our lives, storing massive amounts of data and responding to many types of query. The latency and throughput of these queries, which directly affect user experience and the types of problem we can solve in a given time, are largely dictated by how fast we can move data from the memory to the processor and how fast we can perform operations in parallel. Hence, processing inside the memory or storage unit is extremely promising for database workloads.

We’ve come up with a system, Ambit, which minimally modifies memory (DRAM) chips such that they can perform bulk-bitwise operations in parallel. The key idea is to exploit the analogue computation (charge sharing) capability present in all DRAM chips. We’ve shown that in-memory processing using this system can deliver 12 times faster query execution. Since bulk bitwise operations are in workloads from genome analysis, to cryptography, to web searches, the impact could be significant.

What about storing data? Could this be improved?
Absolutely: another way to reduce energy wastage and performance loss is to make sure that the vast amounts of data we store are actually stored efficiently. There are two approaches.

The first is to design new technologies that are fundamentally more energy efficient, as they do not need to be refreshed as often as the memory technologies we have today (dynamic random access memory, or DRAM). Promising contenders include phase-change memory, spin torque transfer magnetic memory (STT-MRAM) and resistive random access memory (RRAM). We have shown that we can more than halve main memory energy consumption by carefully replacing DRAM with STT-MRAM (see ‘Further reading’, below).

The second involves changing the design mindset to maximize energy efficiency in current technologies like DRAM (used in main memory) and flash memory (used in solid state drives). Our solutions (called RAIDR and AVATAR) show that one can reduce them number of refreshes necessary to keep data intact in modern DRAM by four times through intelligent design of the memory controller. We can also reduce performance loss by two to four times with a combination of techniques that make the memory controller more intelligent in handling latency.

Can we just incrementally improve existing technologies to overcome data bottlenecks? Or do we need revolutionary change?
I think that both are needed. Improving existing technologies is easier in the short term, but their benefit is limited. Radically new approaches, like processing in memory, require a paradigm shift and a lot of cross-layer effort, but can provide huge benefits that may not be possible with incremental improvements.

That said, we cannot ignore the world if we want to change it; we should make it easy to gradually move from the existing paradigm to any radically new paradigm. This is why we’ve developed mechanisms which take advantage of processing in memory with minimal changes to the existing system, such as our work on “PIM-enabled instructions” (see ‘Further reading’, below). Even this very simple approach improves performance by 40-50% and reduces energy consumption by about 25% on average across key data-intensive workloads. Sure, it doesn’t come close to the massive improvements achieved by Tesseract, but it only requires minor modifications, as opposed to the complete system redesign required by Tesseract.

We’ve recently written a book chapter on this topic (see ‘Further reading’, below) and would encourage interested readers to get in touch with any feedback.

How else can we make computing systems more efficient?
As we face increasingly harder problems in technology scaling at the device and circuit levels, I believe we have an exciting opportunity to reinvent how we architect computing systems, with a focus on energy efficiency. We’ve talked about processing in memory or storage, and more efficient new technologies for memory and storage. There’s a third paradigm which is already taking place, but which needs to gather more speed: specialized architectures for key applications, such as genome sequence analysis. We need research across the hardware / software interface, intersecting all layers of the computing stack, to enable new paradigms that are fundamentally more efficient and sustainable.

Videos of Onur Mutlu’s ACACES18 course are available on the HiPEAC YouTube channel: bit.ly/ACACES_playlist

Further reading:

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, ‘Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks’
Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018.
Slides (pptx)(pdf)
Lightning Session Slides (pptx)(pdf)
Poster (pptx) (pdf)
Lightning Talk Video (2 minutes)
Full Talk Video (21 minutes)

Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu, ‘GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies’
BMC Genomics, 2018.
Proceedings of the 16th Asia Pacific Bioinformatics Conference (APBC), Yokohama, Japan, January 2018. Slides (pptx) (pdf)
Source Code
arxiv.org Version (pdf)

Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan ‘GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping’
Bioinformatics, published online, May 31, 2017.
Source Code
Online link at Bioinformatics Journal

Hongyi Xin, Donghyuk Lee, Farhad Hormozdiari, Samihan Yedkar, Onur Mutlu, and Can Alkan, ‘Accelerating Read Mapping with FastHASH’
BMC Genomics, 14(Suppl 1):S13, 21 January 2013. PDF article
Also appears in Proceedings of the 11th Asia Pacific Bioinformatics Conference (APBC), Vancouver, BC, Canada, January 2013
Slides (pptx)
Source Code

Hongyi Xin, John Greth, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu,‘Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter to Accelerate Alignment Verification in Read Mapping’
Bioinformatics, published online, January 10, 2015.
PDF article
Source Code

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, ‘PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture’
Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), Portland, OR, June 2015.
Slides (pdf)
Lightning Session Slides (pdf)

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, ‘A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing’
Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), Portland, OR, June 2015.
Slides (pdf)
Lightning Session Slides (pdf)

Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, ‘Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology’
Proceedings of the 50th International Symposium on Microarchitecture (MICRO), Boston, MA, USA, October 2017. Slides (pptx) (pdf)
Lightning Session Slides (pptx) (pdf)
Poster (pptx) (pdf)

Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu, ‘Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions’ Invited Book Chapter, to appear in 2018.
Preliminary arxiv.org version

Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, ‘Architecting Phase Change Memory as a Scalable DRAM Alternative’
Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009.
Slides (pdf)

Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu, ‘Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative’
Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April 2013.
Slides (pptx) (pdf)

Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu, ‘RAIDR: Retention-Aware Intelligent DRAM Refresh’
Proceedings of the 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012.
Slides (pdf)

Moinuddin Qureshi, Dae Hyun Kim, Samira Khan, Prashant Nair, and Onur Mutlu, ‘AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems
Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.
Slides (pptx) (pdf)

Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu, ‘Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case’
Proceedings of the 21st International Symposium on High-Performance Computer Architecture (HPCA), Bay Area, CA, February 2015.
Slides (pptx) (pdf)
Full data sets

Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu, ‘Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture’
Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013.
Slides (pptx)

Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and Onur Mutlu, ‘Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms’
Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Urbana-Champaign, IL, USA, June 2017.
Abstract
POMACS Journal Version (same content, different format)
Slides (pptx) (pdf)

Kevin Chang, Abhijith Kashyap, Hasan Hassan, Samira Khan, Kevin Hsieh, Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Tianshi Li, and Onur Mutlu, ‘Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization’
Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Antibes Juan-Les-Pins, France, June 2016.
Slides (pptx) (pdf)
Source Code

Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose, Nika Mansouri Ghiasi, Minesh Patel, Jeremie S. Kim, Hasan Hassan, Mohammad Sadrosadati, and Onur Mutlu, 'Reducing DRAM Latency via Charge-Level-Aware Look-Ahead Partial Restoration'
Proceedings of the 51st International Symposium on Microarchitecture (MICRO), Fukuoka, Japan, October 2018.

Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu, 'Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines'
Proceedings of the 36th IEEE International Conference on Computer Design (ICCD), Orlando, FL, USA, October 2018.


© 2004-2024 HiPEAC, High Performance, Edge And Cloud computing
The HiPEAC project has received funding from the European Union's Horizon Europe research and innovation funding programme under grant agreement number 101069836. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.