Congratulations to Minesh Patel on successfully defending his PhD

Minesh Patel, defended his PhD thesis on October 01 2021
Thesis title: “Enabling Effective Error Mitigation in Modern Memory Chips that Use On-Die Error-Correcting Codes”
[Slides (pptx) (pdf)]
[Thesis (pdf)]
[SAFARI Live Seminar Video]

>> Minesh is currently on the job market.  He is seeking a research position in an industry lab in the area of computer systems and architecture. His graduate work focuses primarily on improvements to memory systems. Going forward, Minesh is interested in broader topics pertaining to cutting-edge computing systems.

Improvements in main memory storage density are primarily driven by process technology shrinkage (i.e., technology scaling), which negatively impacts reliability by exacerbating various circuit-level error mechanisms. To compensate for growing error rates, both memory manufacturers and consumers develop and incorporate error-mitigation mechanisms that improve manufacturing yield and allow system designers to meet reliability targets. Developing effective error mitigation techniques requires understanding the errors’ characteristics (e.g., worst-case behavior, statistical properties). Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC) used in modern memory chips introduce new challenges to efficient error mitigation by obfuscating CPU-visible error characteristics in an unpredictable, ECC-dependent manner.

In this dissertation, we experimentally study memory errors, examine how on-die ECC obfuscates their statistical characteristics, and develop new testing techniques to overcome the obfuscation through four key steps. First, we experimentally study DRAM data-retention error characteristics to understand the challenges inherent in understanding and mitigating memory errors that are related to technology scaling. Second, we study how on-die ECC affects these characteristics to develop Error Inference (EIN), a new statistical inference methodology for inferring key details of the on-die ECC mechanism and the raw errors that it obfuscates. Third, we examine the on-die ECC mechanism in detail to understand exactly how on-die ECC obfuscates raw bit error patterns. Using this knowledge, we introduce Bit Exact ECC Recovery (BEER), a new testing methodology that exploits uncorrectable error patterns to (1) reverse-engineer the exact on-die ECC implementation used in a given memory chip and (2) identify the bit-exact locations of the raw bit errors responsible for a set of errors that are observed after on-die ECC correction. Fourth, we study how on-die ECC impacts error profiling and show that on-die ECC introduces three key challenges that negatively impact profiling practicality and effectiveness. To overcome these challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new error profiling strategy that uses simple modifications to the on-die ECC mechanism to quickly and effectively identify bits at risk of error. Finally, we conclude by discussing the critical need for transparency in DRAM reliability characteristics in order to enable DRAM consumers to better understand and adapt commodity DRAM chips to their system-specific needs.

This dissertation builds a detailed understanding of how on-die ECC obfuscates the statistical properties of main memory error mechanisms using a combination of real-chip experiments and statistical analyses. Our results show that the error characteristics that on-die ECC obfuscates can be recovered using new memory testing techniques that exploit the interaction between on-die ECC and the statistical characteristics of memory error mechanisms to expose physical cell behavior. We hope and believe that the analysis, techniques, and results we present in this dissertation will enable the community to better understand and tackle current and future reliability challenges as well as adapt commodity memory to new advantageous applications.

Minesh Patel, Geraldo F. de Oliveira Jr., and Onur Mutlu,
“HARP: Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes”
Proceedings of the 54th International Symposium on Microarchitecture (MICRO), Virtual, October 2021.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Lightning Talk Slides (pptx) (pdf)]
[Talk Video (20 minutes)]
[Lightning Talk Video (1.5 minutes)]
[HARP Source Code (Officially Artifact Evaluated with All Badges)]
[arXiv version]

Minesh Patel, Jeremie S. Kim, Taha Shahroodi, Hasan Hassan, and Onur Mutlu,
“Bit-Exact ECC Recovery (BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics”
Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), Virtual, October 2020.
[Slides (pptx) (pdf)]
[Short Talk Slides (pptx) (pdf)]
[Lightning Talk Slides (pptx) (pdf)]
[Lecture Slides (pptx) (pdf)]
[Talk Video (15 minutes)]
[Short Talk Video (5.5 minutes)]
[Lightning Talk Video (1.5 minutes)]
[Lecture Video (52.5 minutes)]
[BEER Source Code]
Best paper award

Minesh Patel, Jeremie S. Kim, Hasan Hassan, and Onur Mutlu,
“Understanding and Modeling On-Die Error Correction in Modern DRAM: An Experimental Study Using Real Devices”
Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Portland, OR, USA, June 2019.
[Slides (pptx) (pdf)]
[Talk Video (26 minutes)]
[Full Talk Lecture (29 minutes)]
Source Code for EINSim, the Error Inference Simulator]
Best paper award

Minesh Patel, Jeremie S. Kim, and Onur Mutlu,
“The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions”
Proceedings of the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 2017.
[Slides (pptx) (pdf)]
[Lightning Session Slides (pptx) (pdf)]

Posted in Awards, Code, Lectures, Papers, PhD Defense, Talks, Video.