Talk Slides [pdf]
Title: HBM3 RAS: The Journey to Enhancing Die-Stacked DRAM Resilience at Scale
HBM3 is the next-generation technology of the JEDEC High Bandwidth Memory™ DRAM standard. HBM3 is expected to be widely used in future SoCs to accelerate data center and automotive workloads. Reliability, Availability, and Serviceability (RAS) are key requirements in most of these computing domains and use cases, and essential for attaining sufficient resilience at scale. In the first part of the talk, we will review some key terminology and concepts, explain the set of RAS challenges that was facing HBM3, and certain key considerations for standardization. Data and analyses will be presented that justified the need for a new RAS architecture for HBM3. Next, we will present the overall solution space that was explored, the specific direction taken for HBM3, and explain why this path was chosen. Finally, the details of the HBM3 RAS architecture and an evaluation of its resilience at scale will be presented.
Sudhanva Gurumurthi is a Fellow at AMD, where he leads advanced development in RAS. Prior to joining industry, Sudhanva was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. He is a recipient of an NSF CAREER Award, a Google Focused Research Award, an IEEE Computer Society Distinguished Contributor recognition, and several other awards and recognitions. Sudhanva has served as an editor for the IEEE Micro Top Picks from Computer Architecture Conferences special issue, IEEE Transactions on Computers, and IEEE Computer Architecture Letters. He also serves on the Advisory Council of the College of Science and Engineering at Texas State University. Sudhanva received his PhD in Computer Science and Engineering from Penn State in 2005.