Join us for our upcoming SAFARI Live Seminar
Title: Understanding the Reliability and Power-Efficiency Trade-offs of Modern FPGAs through Undervolting
FPGAs are a common type of reconfigurable computing system. They offer the best of both worlds; programmability (close to general-purpose CPU/GPU processors) and efficiency (close to application-specific ASIC circuits). Yet, they have a significantly lower power-efficiency than equivalent specialized ASIC designs. Tackling this issue, we propose aggressive undervolting, i.e., scaling the supply voltage below the nominal and safe value set by the manufacturer.
To ensure the correct functionality of chips under the worst-case fabrication process and extremely-harsh environmental conditions (e.g., temperature, humidity, or radiation variation), manufacturers often add conservative voltage guardbands below the nominal voltage level. By experimenting on several FPGA architectures from AMD/Xilinx, a main FPGA vendor, we found that this voltage guardband is significant and conservative for real-world scenarios. Eliminating this large voltage margin results in significant power savings without compromising performance or reliability. Nevertheless, further undervolting below the voltage guardband may introduce reliability issues, since the delay in the circuit will increase, i.e., faults will appear in the underlying hardware components. Our study characterizes, in detail, the rates, locations, types, and sensitivity to temperature of these faults. In addition, to prevent the negative effects of these undervolting-related faults, we propose efficient fault mitigation techniques, e.g., application-specific intelligent data mapping, built-in Error Correction Code (ECC), and frequency underscaling. We validate our study for FPGA-based Convolutional Neural Networks (CNNs). As computationally intensive applications, CNNs can take full advantage of such an undervolting technique for saving significant power consumption. Further, CNNs are inherently robust against faults, which means they can save even more power consumption without sacrificing significant accuracy by further undervolting below the voltage guardband.
In this presentation, we will comprehensively cover our findings on the undervolting of multiple components of FPGAs, i.e., SRAM-based on-chip memories (BRAMs), Look-Up Tables (LUTs), and DRAM-based High-Bandwidth Memories (HBMs). The talk will conclude with a discussion of open problems and possible directions for future research.
Behzad Salami holds bachelor and master degrees in Computer Engineering from Iran University of Science and Technology (IUST) and Amirkabir University of Technology (AUT), respectively. He received his Ph.D. (Hons.) in Computer Architecture from Universitat Politècnica de Catalunya (UPC) in 2018, followed by a postdoc at Barcelona Supercomputing Center (BSC). He has had research visits at the University of Manchester (UK) and the Institute for Research in Fundamental Sciences (Iran), and is currently working with the SAFARI Research Group remotely. He has collaborated with academia and industry worldwide on joint research projects and also contributed to multiple EU-funded research projects as a researcher. He received several grants and awards for his research, e.g, HiPEAC Paper Award, HiPEAC Collaboration Grant, Tetramax Technology Transfer Grant (led the project as PI), MSCA Seal of Excellence, I4MS-SAE Certificate of Excellence, Severo Ochoa Grant, Heidelberg Laureate Forum (HLF) Participation Grant as a Young Researcher, OPRECOM Summer of Code Project Grant, among others. His research interests are Reconfigurable Computing, Low-Power and Fault-Resilient Hardware Accelerators, and Near-Data Processing Systems.
 Behzad Salami, Erhan Baturay Onural, İsmail Emir Yüksel, Fahrettin Koc, Oguz Ergin, Adrián Cristal Kestelman, Osman Unsal, Hamid Sarbazi-Azad, and Onur Mutlu. “An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration.” In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 138-149. IEEE, 2020.
 Seyed Saber Nabavi Larimi, Behzad Salami, Osman S. Unsal, Adrián Cristal Kestelman, Hamid Sarbazi-Azad, and Onur Mutlu. “Understanding Power Consumption and Reliability of High-Bandwidth Memory with Voltage Underscaling.” In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 517-522. IEEE, 2021.
 Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman. “On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation.” 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 2018.
 Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman. “Comprehensive Evaluation of Supply Voltage Underscaling in FPGA On-Chip Memories.” 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018.