SAFARI Live Seminar: Minesh Patel 21 September 2021

Join us for our SAFARI Live Seminar with Minesh Patel.
Tuesday, September 21 at 5:00 pm Zurich time (CEST)

Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes
Minesh Patel, SAFARI Research Group, ETH Zurich

Livestream at 5:00 pm Zurich time (CEST) on YouTube: Link

Abstract: 

Improvements in main memory storage density are primarily driven by process technology shrinkage (i.e., technology scaling), which negatively impacts reliability by exacerbating various circuit-level error mechanisms. To offset growing error rates, both memory manufacturers and consumers develop and incorporate error-mitigation mechanisms that improve manufacturing yield and allow system designers to meet desired reliability targets. Developing effective error mitigation techniques requires understanding the errors’ characteristics (e.g., worst-case behavior, statistical properties). Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC) used in modern memory chips introduces new challenges to efficient error mitigation by obfuscating CPU-visible error characteristics in an unpredictable, ECC-dependent manner.

In this seminar, we experimentally study memory errors, examine how on-die ECC obfuscates their statistical characteristics, and develop new testing techniques to overcome the obfuscation through four key steps. First, we experimentally study DRAM data-retention error characteristics to understand the challenges inherent in understanding and mitigating technology-scaling-related errors. Second, we study how on-die ECC affects these characteristics to develop Error Inference (EIN), a statistical inference methodology for inferring details of the on-die ECC mechanism and the pre-correction errors. Third, we examine the on-die ECC mechanism in detail to understand exactly how on-die ECC obfuscates raw bit error patterns. Using this knowledge, we introduce Bit Exact ECC Recovery (BEER), a new testing methodology that exploits uncorrectable error patterns to (1) reverse-engineer the exact on-die ECC implementation used in a given chip and (2) identify the bit-exact locations of pre-correction errors that correspond to a given set of observed post-correction errors. Fourth, we study how on-die ECC impacts error profiling and show that on-die ECC introduces three key challenges that impact profiling practicality and effectiveness. To overcome these challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new profiling strategy that uses simple modifications to the on-die ECC mechanism to quickly and effectively identify bits at risk of error. Finally, we conclude by discussing the need for transparency in DRAM reliability characteristics in order to enable DRAM consumers to better understand and adapt commodity DRAM chips to their system-specific needs. In general, we hope and believe that these new testing techniques will enable scientists and engineers to make informed decisions towards building smarter systems.

Bio:

Minesh Patel is a Ph.D. candidate at ETH Zurich working with Prof. Onur Mutlu. He received B.S. degrees in ECE and Physics from the University of Texas, in 2015. Since then, he has been working toward his Ph.D. degree with a focus on memory systems reliability. His current research interests broadly span computer systems and architecture topics, including support for speculative and/or unreliable systems, performance modeling and analysis, and application characterization and optimization.


This talk is based on four papers we published respectively at ISCA 2017, DSN 2019, MICRO 2020 and MICRO 2021 (to appear). The links to available individual papers and slides are below.

Minesh Patel, Jeremie S. Kim, and Onur Mutlu, “The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions”, Proceedings of the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 2017.
[Slides (pptx) (pdf)]
[Lightning Session Slides (pptx) (pdf)]

Minesh Patel, Jeremie S. Kim, Hasan Hassan, and Onur Mutlu, “Understanding and Modeling On-Die Error Correction in Modern DRAM: An Experimental Study Using Real Devices”, Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Portland, OR, USA, June 2019.
[Slides (pptx) (pdf)]
[Talk Video (26 minutes)]
[Full Talk Lecture (29 minutes)]
[Source Code for EINSim, the Error Inference Simulator]
Best paper award.

Minesh Patel, Jeremie S. Kim, Taha Shahroodi, Hasan Hassan, and Onur Mutlu, “Bit-Exact ECC Recovery (BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics”, Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), Virtual, October 2020.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Lightning Talk Slides (pptx) (pdf)]
[Lecture Slides (pptx) (pdf)]
[Talk Video (15 minutes)]
[Short Talk Video (5.5 minutes)]
[Lightning Talk Video (1.5 minutes)]
[Lecture Video (52.5 minutes)]
[BEER Source Code]
Best paper award.

Minesh Patel, Geraldo Francisco de Oliveira Jr., Onur Mutlu, “HARP: Practically and Effectively Identifying Uncorrectable Errors in Main Memory Chips That Use On-Die ECC”, Proceedings of the 54rd International Symposium on Microarchitecture (MICRO), Virtual, October 2021.

Posted in Papers, Seminar, Talks, Video.