## Computer Architecture Lecture 7: Computation in Memory II Prof. Onur Mutlu ETH Zürich Fall 2019 10 October 2019 ## Sub-Agenda: In-Memory Computation - Major Trends Affecting Main Memory - The Need for Intelligent Memory Controllers - Bottom Up: Push from Circuits and Devices - Top Down: Pull from Systems and Applications - Processing in Memory: Two Directions - Minimally Changing Memory Chips - Exploiting 3D-Stacked Memory - How to Enable Adoption of Processing in Memory - Conclusion ## Processing in Memory: Two Approaches - 1. Minimally changing memory chips - 2. Exploiting 3D-stacked memory ## Recall: RowClone Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization" Proceedings of the <u>46th International Symposium on Microarchitecture</u> (**MICRO**), Davis, CA, December 2013. [<u>Slides (pptx) (pdf)</u>] [<u>Lightning Session Slides (pptx) (pdf)</u>] [<u>Poster (pptx) (pdf)</u>] ## RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Yoongu Kim Chris Fallin\* Donghyuk Lee vseshadr@cs.cmu.edu yoongukim@cmu.edu cfallin@c1f.net donghyuk1@cmu.edu Rachata Ausavarungnirun Gennady Pekhimenko Yixin Luo rachata@cmu.edu gpekhime@cs.cmu.edu yixinluo@andrew.cmu.edu Onur Mutlu Phillip B. Gibbons† Michael A. Kozuch† Todd C. Mowry onur@cmu.edu phillip.b.gibbons@intel.com michael.a.kozuch@intel.com tcm@cs.cmu.edu Carnegie Mellon University †Intel Pittsburgh ## Recall: End-to-End System Design **Application** **Operating System** ISA Microarchitecture DRAM (RowClone) How to communicate occurrences of bulk copy/initialization across layers? How to ensure cache coherence? How to maximize latency and energy savings? How to handle data reuse? ## Memory as an Accelerator Memory similar to a "conventional" accelerator ## In-Memory Bulk Bitwise Operations - We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ - At low cost - Using analog computation capability of DRAM - Idea: activating multiple rows performs computation - 30-60X performance and energy improvement - Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," MICRO 2017. - New memory technologies enable even more opportunities - Memristors, resistive RAM, phase change mem, STT-MRAM, ... - Can operate on data with minimal movement ## In-DRAM AND/OR: Triple Row Activation ## In-DRAM Bulk Bitwise AND/OR Operation - BULKAND A, B $\rightarrow$ C - Semantics: Perform a bitwise AND of two rows A and B and store the result in row C - R0 reserved zero row, R1 reserved one row - D1, D2, D3 Designated rows for triple activation - 1. RowClone A into D1 - 2. RowClone B into D2 - 3. RowClone R0 into D3 - 4. ACTIVATE D1,D2,D3 - 5. RowClone Result into C ## More on In-DRAM Bulk AND/OR Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, "Fast Bulk Bitwise AND and OR in DRAM" IEEE Computer Architecture Letters (CAL), April 2015. ## Fast Bulk Bitwise AND and OR in DRAM Vivek Seshadri\*, Kevin Hsieh\*, Amirali Boroumand\*, Donghyuk Lee\*, Michael A. Kozuch<sup>†</sup>, Onur Mutlu\*, Phillip B. Gibbons<sup>†</sup>, Todd C. Mowry\* \*Carnegie Mellon University <sup>†</sup>Intel Pittsburgh ## In-DRAM NOT: Dual Contact Cell Figure 5: A dual-contact cell connected to both ends of a sense amplifier Idea: Feed the negated value in the sense amplifier into a special row Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017. ## In-DRAM NOT Operation Figure 5: Bitwise NOT using a dual contact capacitor Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017. ## Performance: In-DRAM Bitwise Operations Figure 9: Throughput of bitwise operations on various systems. ## Energy of In-DRAM Bitwise Operations | | Design | not | and/or | nand/nor | xor/xnor | |----------------|----------------|-------|--------|----------|----------| | DRAM & | DDR3 | 93.7 | 137.9 | 137.9 | 137.9 | | Channel Energy | Ambit | 1.6 | 3.2 | 4.0 | 5.5 | | (nJ/KB) | $(\downarrow)$ | 59.5X | 43.9X | 35.1X | 25.1X | Table 3: Energy of bitwise operations. $(\downarrow)$ indicates energy reduction of Ambit over the traditional DDR3-based design. Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017. ## Ambit vs. DDR3: Performance and **Energy** - Performance Improvement - Energy Reduction ## Bulk Bitwise Operations in Workloads ## Example Data Structure: Bitmap Index - Alternative to B-tree and its variants - Efficient for performing range queries and joins - Many bitwise operations to perform a query ## Performance: Bitmap Index on Ambit Figure 10: Bitmap index performance. The value above each bar indicates the reduction in execution time due to Ambit. >5.4-6.6X Performance Improvement Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017. ## Performance: BitWeaving on Ambit Figure 11: Speedup offered by Ambit over baseline CPU with SIMD for BitWeaving Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017. ## More on In-DRAM Bulk AND/OR Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry, "Fast Bulk Bitwise AND and OR in DRAM" IEEE Computer Architecture Letters (CAL), April 2015. ## Fast Bulk Bitwise AND and OR in DRAM Vivek Seshadri\*, Kevin Hsieh\*, Amirali Boroumand\*, Donghyuk Lee\*, Michael A. Kozuch<sup>†</sup>, Onur Mutlu\*, Phillip B. Gibbons<sup>†</sup>, Todd C. Mowry\* \*Carnegie Mellon University <sup>†</sup>Intel Pittsburgh ## More on In-DRAM Bitwise Operations Vivek Seshadri et al., "<u>Ambit: In-Memory Accelerator</u> for Bulk Bitwise Operations Using Commodity DRAM <u>Technology</u>," MICRO 2017. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology ``` Vivek Seshadri^{1,5} Donghyuk Lee^{2,5} Thomas Mullins^{3,5} Hasan Hassan^4 Amirali Boroumand^5 Jeremie Kim^{4,5} Michael A. Kozuch^3 Onur Mutlu^{4,5} Phillip B. Gibbons^5 Todd C. Mowry^5 ``` $^1$ Microsoft Research India $^2$ NVIDIA Research $^3$ Intel $^4$ ETH Zürich $^5$ Carnegie Mellon University ## More on In-DRAM Bulk Bitwise Execution Vivek Seshadri and Onur Mutlu, "In-DRAM Bulk Bitwise Execution Engine" Invited Book Chapter in Advances in Computers, to appear in 2020. [Preliminary arXiv version] ## In-DRAM Bulk Bitwise Execution Engine Vivek Seshadri Microsoft Research India visesha@microsoft.com Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch ## Challenge: Intelligent Memory Device # Does memory have to be dumb? ## Challenge and Opportunity for Future # Computing Architectures with Minimal Data Movement ## A Detour on the Review Process ## Ambit Sounds Good, No? ## **Paper summary** ## **Review from ISCA 2016** The paper proposes to extend DRAM to include bulk, bit-wise logical operations directly between rows within the DRAM. ## **Strengths** - Very clever/novel idea. - Great potential speedup and efficiency gains. ### Weaknesses - Probably won't ever be built. Not practical to assume DRAM manufacturers with change DRAM in this way. ## Another Review ## **Another Review from ISCA 2016** ## **Strengths** The proposed mechanisms effectively exploit the operation of the DRAM to perform efficient bitwise operations across entire rows of the DRAM. ## Weaknesses This requires a modification to the DRAM that will only help this type of bitwise operation. It seems unlikely that something like that will be adopted. ## Yet Another Review ## **Yet Another Review from ISCA 2016** ## Weaknesses The core novelty of Buddy RAM is almost all circuits-related (by exploiting sense amps). I do not find architectural innovation even though the circuits technique benefits architecturally by mitigating memory bandwidth and relieving cache resources within a subarray. The only related part is the new ISA support for bitwise operations at DRAM side and its induced issue on cache coherence. ## The Reviewer Accountability Problem ## **Acknowle** gments We thank the reviewers of ISCA 2016/2017, MICRO 2016/2017, and HPCA 2017 for their valuable comments. We ## We Have a Mindset Issue... - There are many other similar examples from reviews... - For many other papers... - And, we are not even talking about JEDEC yet... - How do we fix the mindset problem? - By doing more research, education, implementation in alternative processing paradigms ## We need to work on enabling the better future... ## Aside: A Recommended Book Raj Jain, "The Art of **Computer Systems** Performance Analysis," Wiley, 1991. WILEY ### DECISION MAKER'S GAMES Even if the performance analysis is correctly done and presented, it may not be enough to persuade your audience—the decision makers—to follow your recommendations. The list shown in Box 10.2 is a compilation of reasons for rejection heard at various performance analysis presentations. You can use the list by presenting it immediately and pointing out that the reason for rejection is not new and that the analysis deserves more consideration. Also, the list is helpful in getting the competing proposals rejected! There is no clear end of an analysis. Any analysis can be rejected simply on the grounds that the problem needs more analysis. This is the first reason listed in Box 10.2. The second most common reason for rejection of an analysis and for endless debate is the workload. Since workloads are always based on the past measurements, their applicability to the current or future environment can always be questioned. Actually workload is one of the four areas of discussion that lead a performance presentation into an endless debate. These "rat holes" and their relative sizes in terms of time consumed are shown in Figure 10.26. Presenting this cartoon at the beginning of a presentation helps to avoid these areas. Raj Jain, "The Art of Computer Systems Performance Analysis," Wiley, 1991. FIGURE 10.26 Four issues in performance presentations that commonly lead to endless discussion. ## Box 10.2 Reasons for Not Accepting the Results of an Analysis - 1. This needs more analysis. - 2. You need a better understanding of the workload. - 2. You need a better and 2. You need a better and 2. It improves performance only for long I/O's, packets, jobs, and files are short. and most of the I/O's, packets, jobs, and files are short. - and most of the description and most of the description and files, and files, and files, for the performance of short I/O's, packets in the performance of short I/O's, packets in the performance of short I/O's, packets in the performance of short I/O's, packets in the performance of short I/O's, packets in the performance of short I/O's, packets in the performance of short I/O's, packets, jobs, and files, the performance of short I/O's, packets, jobs, and files, the performance of short I/O's, packets, jobs, and files, the performance of short I/O's, packets, jobs, and files, the performance of short I/O's, packets, jobs, and files, the performance of short I/O's, packets, in and the performance of short I/O's and the performance of short I/O's and the performance of short I/O's and the performance of short I/O's a It improves performance of short I/O's, packets, jobs, and files, but who cares for the performance the system. files; its the long ones that impact the system. - 5. It needs too much memory/CPU/bandwidth and memory/CPU/band. width isn't free. - 6. It only saves us memory/CPU/bandwidth and memory/CPU/band. width is cheap. - 7. There is no point in making the networks (similarly, CPUs/disks/...) faster; our CPUs/disks (any component other than the one being die cussed) aren't fast enough to use them. - 8. It improves the performance by a factor of x, but it doesn't really matter at the user level because everything else is so slow. - 9. It is going to increase the complexity and cost. - 10. Let us keep it simple stupid (and your idea is not stupid). - 11. It is not simple. (Simplicity is in the eyes of the beholder.) - 12. It requires too much state. - 13. Nobody has ever done that before. (You have a new idea.) - 14. It is not going to raise the price of our stock by even an eighth. (Nothing ever does, except rumors.) - 15. This will violate the IEEE, ANSI, CCITT, or ISO standard. - 16. It may violate some future standard. - 17. The standard says nothing about this and so it must not be important. - 18. Our competitors don't do it. If it was a good idea, they would have done it. - 19. Our competition does it this way and you don't make money by copying others. - 20. It will introduce randomness into the system and make debugging difficult. - 21. It is too deterministic; it may lead the system into a cycle. - 22. It's not interoperable. - 23. This impacts hardware. - 24. That's beyond today's technology. - 26. Why change—it's working OK. Raj Jain, "The Art of Computer Systems Performance Analysis." Wiley, 1991. ## Suggestion to Community ## We Need to Fix the Reviewer Accountability Problem ## Main Memory Needs Intelligent Controllers ## Research Community Needs Accountable Reviewers #### Suggestions to Reviewers - Be fair; you do not know it all - Be open-minded; you do not know it all - Be accepting of diverse research methods: there is no single way of doing research - Be constructive, not destructive - Do not have double standards... #### Do not block or delay scientific progress for non-reasons #### RowClone & Bitwise Ops in Real DRAM Chips ## ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs Fei Gao feig@princeton.edu Department of Electrical Engineering Princeton University Georgios Tziantzioulis georgios.tziantzioulis@princeton.edu Department of Electrical Engineering Princeton University David Wentzlaff wentzlaf@princeton.edu Department of Electrical Engineering Princeton University #### Pinatubo: RowClone and Bitwise Ops in PCM # Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories Shuangchen Li<sup>1</sup>\*, Cong Xu<sup>2</sup>, Qiaosha Zou<sup>1,5</sup>, Jishen Zhao<sup>3</sup>, Yu Lu<sup>4</sup>, and Yuan Xie<sup>1</sup> University of California, Santa Barbara<sup>1</sup>, Hewlett Packard Labs<sup>2</sup> University of California, Santa Cruz<sup>3</sup>, Qualcomm Inc.<sup>4</sup>, Huawei Technologies Inc.<sup>5</sup> {shuangchenli, yuanxie}ece.ucsb.edu<sup>1</sup> # Other Examples of "Why Change? It's Working OK!" #### Mindset Issues Are Everywhere - "Why Change? It's Working OK!" mindset limits progress - There are many such examples in real life - Examples of Bandwidth Waste in Real Life - Examples of Latency and Queueing Delays in Real Life - Example of Where to Build a Bridge on the Road # Another Example #### Initial RowHammer Reviews # Disturbance Errors in DRAM: Demonstration, Characterization, and Prevention Rejected (R2) 863kB Friday 31 May 2013 2:00:53pm PDT b9bf06021da54cddf4cd0b3565558a181868b972 You are an author of this paper. + Abstract + Authors Review #66A Review #66B Review #66C Review #66D Review #66E Review #66F | OveMer | Nov | WriQua | RevExp | |--------|-----|--------|--------| | 1 | 4 | 4 | 4 | | 5 | 4 | 5 | 3 | | 2 | 3 | 5 | 4 | | 1 | 2 | 3 | 4 | | 4 | 4 | 4 | 3 | | 2 | 4 | 4 | 3 | # Missing the Point Reviews from Micro 2013 #### PAPER WEAKNESSES This is an excellent test methodology paper, but there is no micro-architectural or architectural content. #### PAPER WEAKNESSES - Whereas they show disturbance may happen in DRAM array, authors don't show it can be an issue in realistic DRAM usage scenario - Lacks architectural/microarchitectural impact on the DRAM disturbance analysis #### PAPER WEAKNESSES The mechanism investigated by the authors is one of many well known disturb mechanisms. The paper does not discuss the root causes to sufficient depth and the importance of this mechanism compared to others. Overall the length of the sections restating known information is much too long in relation to new work. ### Experimental DRAM Testing Infrastructure # Tested DRAM Modules (129 total) | M | Module | Date* | $Timing^{\dagger}$ | | Organization | | Chip | | | Victims-per-Module | | | RI <sub>th</sub> (ms) | |--------------|---------------------------------------|----------------|--------------------|----------------------|--------------|-------|------------------------|----------|-----------------------------|----------------------------------------|--------------------------------------------|--------------------------------------------|-----------------------| | Manufacturer | | (yy-ww) | Freq (MT/s) | t <sub>RC</sub> (ns) | Size (GB) | Chips | Size (Gb) <sup>‡</sup> | Pins | Die Version <sup>§</sup> | Average | Minimum | Maximum | Min | | | $A_1$ | 10-08 | 1066 | 50.625 | 0.5 | 4 | 1 | ×16 | В | 0 | 0 | 0 | - | | | $A_2$ | 10-20 | 1066 | 50.625 | 1 | 8 | 1 | ×8 | F | 0 | 0 | 0 | - | | | A <sub>3-5</sub> | 10-20 | 1066 | 50.625 | 0.5 | 4 | 1 | ×16 | В | 0 | 0 | 0 | - | | | A <sub>6-7</sub> | 11-24 | 1066 | 49.125 | 1 | 4 | 2 | ×16 | $\mathcal{D}$ | $7.8 \times 10^{1}$ | $5.2 \times 10^{1}$ | $1.0 \times 10^2$ | 21.3 | | | A <sub>8-12</sub> | 11-26 | 1066 | 49.125 | 1 | 4 | 2 | ×16 | $\mathcal{D}$ | $2.4 \times 10^{2}$ | $5.4 \times 10^{1}$ | $4.4 \times 10^{2}$ | 16.4 | | Α | A <sub>13-14</sub> | 11-50 | 1066 | 49.125 | 1 | 4 | 2 | ×16 | $\mathcal{D}$ | $8.8 \times 10^{1}$ | $1.7 \times 10^{1}$ | $1.6 \times 10^{2}$ | 26.2 | | ^ | A <sub>15-16</sub> | 12-22 | 1600 | 50.625 | 1 | 4 | 2 | ×16 | M | 9.5 | 9 | $1.0 \times 10^{1}$<br>$2.0 \times 10^{2}$ | 34.4 | | Total of | A <sub>17-18</sub> | 12-26 | 1600 | 49.125 | 2 2 | 8 | 2 2 | ×8 | K | $1.2 \times 10^2$<br>$8.6 \times 10^6$ | $3.7 \times 10^{1}$<br>$7.0 \times 10^{6}$ | $1.0 \times 10^{7}$ | 21.3<br>8.2 | | 43 Modules | A <sub>19-30</sub> | 12-40<br>13-02 | 1600<br>1600 | 48.125<br>48.125 | 2 | 8 | 2 | ×8<br>×8 | _ | $1.8 \times 10^{6}$ | $1.0 \times 10^6$ $1.0 \times 10^6$ | $3.5 \times 10^6$ | 11.5 | | | A <sub>31-34</sub> | 13-02 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | _ | $4.0 \times 10^{1}$ | $1.0 \times 10^{1}$ $1.9 \times 10^{1}$ | | 21.3 | | | A <sub>35-36</sub> | 13-14 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | ĸ | $1.7 \times 10^6$ | $1.4 \times 10^{6}$ | $2.0 \times 10^{6}$ | 9.8 | | | Α <sub>37-38</sub> | 13-28 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | K | 5.7 × 10 <sup>4</sup> | $5.4 \times 10^4$ | | 16.4 | | | A <sub>39-40</sub> | 14-04 | 1600 | 49.125 | 2 | 8 | 2 | ×8 | _ | $2.7 \times 10^{5}$ | $2.7 \times 10^5$ | | 18.0 | | | Α <sub>41</sub> | 14-04 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | K | 0.5 | 0 | 1 | 62.3 | | - | A <sub>42-43</sub> | | | | | | | | | | | | | | | B | 08-49 | 1066 | 50.625 | 1 | 8 | 1 | ×8 | $\mathcal{D}$ $\mathcal{E}$ | 0 | 0 | 0 | - | | | B <sub>2</sub> | 09-49 | 1066 | 50.625 | 1 | 8 | 1 | ×8 | E<br>F | 0 | 0 | 0 | _ | | | B <sub>3</sub><br>B <sub>4</sub> | 10-19 | 1066 | 50.625 | 1 2 | 8 | 1 2 | ×8 | C | 0 | 0 | 0 | - | | | B <sub>4</sub><br>B <sub>5</sub> | 10-31<br>11-13 | 1333<br>1333 | 49.125<br>49.125 | 2 | 8 | 2 | ×8<br>×8 | C | 0 | 0 | 0 | - | | | B <sub>6</sub> | 11-15 | 1066 | 50.625 | 1 | 8 | 1 | ×8 | F | 0 | 0 | 0 | _ | | | B <sub>7</sub> | 11-10 | 1066 | 50.625 | 1 | 8 | 1 | ×8 | F | 0 | 0 | 0 | _ | | | B <sub>8</sub> | 11-25 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | c | 0 | 0 | 0 | | | В | B <sub>9</sub> | 11-23 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | $\mathcal{D}$ | $1.9 \times 10^{6}$ | $1.9 \times 10^{6}$ | $1.9 \times 10^{6}$ | 11.5 | | | B<br>B | 11-46 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | $\mathcal{D}$ | $2.2 \times 10^{6}$ | $1.5 \times 10^{6}$ | | 11.5 | | Total of | B <sub>10-12</sub><br>B <sub>13</sub> | 11-49 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | c | 0 | 0 | 0 | - | | 54 Modules | B <sub>14</sub> | 12-01 | 1866 | 47.125 | 2 | 8 | 2 | ×8 | $\mathcal{D}$ | $9.1 \times 10^{5}$ | $9.1 \times 10^{5}$ | | 9.8 | | | B . | 12-10 | 1866 | 47.125 | 2 | 8 | 2 | ×8 | $\mathcal{D}$ | 0.8 × 10 <sup>5</sup> | $7.8 \times 10^{5}$ | $1.2 \times 10^{6}$ | 11.5 | | | B <sub>15-31</sub><br>B <sub>32</sub> | 12-25 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | ε | | $7.4 \times 10^{5}$ | | 11.5 | | | B <sub>33-42</sub> | 12-28 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | ε | | $1.9 \times 10^{5}$ | | 11.5 | | | D <sub>33-42</sub> | 12-26 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | ε | | $2.9 \times 10^{5}$ | | 13.1 | | | B <sub>43-47</sub> | 13-19 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | ε | | $7.4 \times 10^4$ | $1.4 \times 10^5$ | 14.7 | | | B <sub>48-51</sub> | 13-40 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | $\mathcal{D}$ | 2.6 × 10 <sup>4</sup> | $2.3 \times 10^4$ | | 21.3 | | | B <sub>52-53</sub><br>B <sub>54</sub> | 14-07 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | $\mathcal{D}$ | | $7.5 \times 10^3$ | | 26.2 | | | | | | | | | | | | | | | | | | Cı | 10-18 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | $\mathcal{A}$ | 0 | 0 | 0 | - | | | C <sub>2</sub> | 10-20 | 1066 | 50.625 | 2 | 8 | 2 | ×8 | $\mathcal{A}$ | 0 | 0 | 0 | - | | | $G_3$ | 10-22 | 1066 | 50.625 | 2 | 8 | 2 | ×8 | A | 0 | 0 | 0 | - | | | C <sub>4-5</sub> | 10-26 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | B | $8.9 \times 10^{2}$ | $6.0 \times 10^{2}$ | $1.2 \times 10^{3}$ | 29.5 | | | C <sub>6</sub> | 10-43 | 1333 | 49.125 | 1 | 8 | 1 | ×8 | $\mathcal{T}$ $\mathcal{B}$ | $0 \\ 4.0 \times 10^{2}$ | $0 \\ 4.0 \times 10^{2}$ | 0 | - 20.5 | | | C <sub>7</sub> | 10-51 | 1333 | 49.125 | 2 | 8 | 2 | ×8 | | | | | 29.5 | | | C <sub>8</sub> | 11-12 | 1333 | 46.25 | 2 | 8 | 2 | ×8 | В | $6.9 \times 10^2$ | $6.9 \times 10^2$ | | 21.3 | | | C <sub>9</sub> | 11-19 | 1333 | 46.25 | 2 | 8 | 2 2 | ×8 | B<br>B | 9.2 × 10 <sup>2</sup> | $9.2 \times 10^{2}$ | 9.2 × 10 <sup>2</sup> | 27.9 | | | C <sub>10</sub> | 11-31 | 1333 | 49.125 | 2 | 8 | | ×8 | В | $1.6 \times 10^{2}$ | $1.6 \times 10^2$ | $1.6 \times 10^{2}$ | 39.3 | | С | C11 | 11-42 | 1333 | 49.125 | 2 | 8 | 2 2 | ×8 | C | | $1.6 \times 10^{2}$<br>$7.1 \times 10^{4}$ | | 39.3 | | | C <sub>12</sub> | 11-48 | 1600 | 48.125 | 2 | | | ×8 | | | | | 19.7 | | Total of | C <sub>13</sub> | 12-08<br>12-12 | 1333<br>1333 | 49.125 | 2 2 | 8 | 2 2 | ×8<br>×8 | C<br>C | | $3.9 \times 10^4$<br>$2.1 \times 10^4$ | | 21.3 | | 32 Modules | C <sub>14-15</sub> | | | 49.125 | | | 2 | | c | | $1.2 \times 10^{3}$ | | | | | C <sub>16-18</sub> | 12-20 | 1600 | 48.125 | 2 2 | 8 | 2 | ×8 | $\mathcal{E}$ | | $1.2 \times 10^{5}$<br>$1.4 \times 10^{5}$ | | 27.9 | | | C <sub>19</sub> | 12-23 | 1600 | 48.125 | | | | ×8 | | | | | 18.0 | | | C <sub>20</sub> | 12-24 | 1600 | 48.125 | 2 2 | 8 | 2 2 | ×8<br>×8 | c<br>c | $6.5 \times 10^4$ | 0.5 × 10 <sup>4</sup> | | 21.3 | | | $O_{21}$ | 12-26 | 1600 | 48.125 | | | | | | | $2.3 \times 10^4$ | | 24.6 | | | C <sub>22</sub> | 12-32 | 1600 | 48.125 | 2 | 8 | 2 2 | ×8 | C<br>C | | $1.7 \times 10^4$<br>$1.1 \times 10^4$ | | 22.9 | | | C <sub>23-24</sub> | 12-37 | 1600 | 48.125 | 2 | 8 | | ×8 | | | | | 18.0 | | | C <sub>25-30</sub> | 12-41 | 1600 | 48.125 | 2 | 8 | 2 | ×8 | C | $2.0 \times 10^4$ | $1.1 \times 10^4$ | | 19.7 | | | C <sub>31</sub> | 13-11 | 1600 | 48.125 | 2 | 8 | 2 2 | ×8<br>×8 | c<br>c | | $3.3 \times 10^5$<br>$3.7 \times 10^4$ | | 14.7<br>21.3 | | | C <sub>32</sub> | 13-35 | 1600 | 48.125 | 2 | × | | | | | | | | <sup>\*</sup> We report the manufacture date marked on the chip packages, which is more accurate than other dates that can be gleaned from a module. † We report timing constraints stored in the module's on-board ROM [33], which is read by the system BIOS to calibrate the memory controller. ‡ The maximum DRAM chip size supported by our testing platform is 2Gb. <sup>§</sup> We report DRAM die versions marked on the chip packages, which typically progress in the following manner: $\mathcal{M} \to \mathcal{A} \to \mathcal{B} \to \mathcal{C} \to \cdots$ . Table 3. Sample population of 129 DDR3 DRAM modules, categorized by manufacturer and sorted by manufacture date #### Fast Forward 6 Months ### More Reviews... Reviews from ISCA 2014 #### PAPER WEAKNESSES - 1) The disturbance error (a.k.a coupling or cross-talk noise induced error) is a known problem to the DRAM circuit community. - 2) What you demonstrated in this paper is so called DRAM row hammering issue you can even find a Youtube video showing this! <a href="http://www.youtube.com/watch?v=i3-gQSnBcdo">http://www.youtube.com/watch?v=i3-gQSnBcdo</a> - Ine architectural contribution of this study is too insignificant. #### PAPER WEAKNESSES - Row Hammering appears to be well-known, and solutions have already been proposed by industry to address the issue. - The paper only provides a qualitative analysis of solutions to the problem. A more robust evaluation is really needed to know whether the proposed solution is necessary. #### Final RowHammer Reviews #### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM **Disturbance Errors** Accepted 639kB 21 Nov 2013 10:53:11pm CST | f039be2735313b39304ae1c6296523867a485610 You are an **author** of this paper. | | OveMer | Nov | WriQua | RevConAnd | |-------------|--------|-----|--------|-----------| | Review #41A | 8 | 4 | 5 | 3 | | Review #41B | 7 | 4 | 4 | 3 | | Review #41C | 6 | 4 | 4 | 3 | | Review #41D | 2 | 2 | 5 | 4 | | Review #41E | 3 | 2 | 3 | 3 | | Review #41F | 7 | 4 | 4 | 3 | ### RowHammer: Hindsight & Impact (I) #### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology # Project Zero Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014) News and updates from the Project Zero team at Google Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015) Monday, March 9, 2015 Exploiting the DRAM rowhammer bug to gain kernel privileges ### RowHammer: Hindsight & Impact (II) Onur Mutlu and Jeremie Kim,"RowHammer: A Retrospective" <u>IEEE Transactions on Computer-Aided Design of Integrated</u> <u>Circuits and Systems</u> (**TCAD**) Special Issue on Top Picks in Hardware and Embedded Security, 2019. [Preliminary arXiv version] ### RowHammer: A Retrospective Onur Mutlu<sup>§‡</sup> Jeremie S. Kim<sup>‡§</sup> §ETH Zürich <sup>‡</sup>Carnegie Mellon University 51 Suggestion to Researchers: Principle: Passion # Follow Your Passion (Do not get derailed by naysayers) Suggestion to Researchers: Principle: Resilience # Be Resilient Principle: Learning and Scholarship # Focus on learning and scholarship Principle: Learning and Scholarship # The quality of your work defines your impact #### Sub-Agenda: In-Memory Computation - Major Trends Affecting Main Memory - The Need for Intelligent Memory Controllers - Bottom Up: Push from Circuits and Devices - Top Down: Pull from Systems and Applications - Processing in Memory: Two Directions - Minimally Changing Memory Chips - Exploiting 3D-Stacked Memory - How to Enable Adoption of Processing in Memory - Conclusion # We Need to Think Differently from the Past Approaches ### Memory as an Accelerator Memory similar to a "conventional" accelerator # Processing in Memory: Two Approaches - 1. Minimally changing memory chips - 2. Exploiting 3D-stacked memory #### Opportunity: 3D-Stacked Logic+Memory #### DRAM Landscape (circa 2015) | Segment | DRAM Standards & Architectures | |-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Commodity | DDR3 (2007) [14]; DDR4 (2012) [18] | | Low-Power | LPDDR3 (2012) [17]; LPDDR4 (2014) [20] | | Graphics | GDDR5 (2009) [15] | | Performance | eDRAM [28], [32]; RLDRAM3 (2011) [29] | | 3D-Stacked | WIO (2011) [16]; WIO2 (2014) [21]; MCDRAM (2015) [13]; HBM (2013) [19]; HMC1.0 (2013) [10]; HMC1.1 (2014) [11] | | Academic | SBA/SSA (2010) [38]; Staged Reads (2012) [8]; RAIDR (2012) [27]; SALP (2012) [24]; TL-DRAM (2013) [26]; RowClone (2013) [37]; Half-DRAM (2014) [39]; Row-Buffer Decoupling (2014) [33]; SARP (2014) [6]; AL-DRAM (2015) [25] | Table 1. Landscape of DRAM-based memory Kim+, "Ramulator: A Flexible and Extensible DRAM Simulator", IEEE CAL 2015. #### Several Questions in 3D-Stacked PIM - What are the performance and energy benefits of using 3D-stacked memory as a coarse-grained accelerator? - By changing the entire system - By performing simple function offloading - What is the minimal processing-in-memory support we can provide? - With minimal changes to system and programming #### Another Example: In-Memory Graph Processing Large graphs are everywhere (circa 2015) 36 Million Wikipedia Pages 1.4 Billion Facebook Users 300 Million Twitter Users 30 Billion Instagram Photos Scalable large-scale graph processing is challenging #### Key Bottlenecks in Graph Processing ``` for (v: graph.vertices) { for (w: v.successors) { w.next rank += weight * v.rank; 1. Frequent random memory accesses &w V w.rank w.next rank weight * v.rank w.edges W 2. Little amount of computation ``` #### Tesseract System for Graph Processing Interconnected set of 3D-stacked memory+logic chips with simple cores #### Tesseract System for Graph Processing #### Communications In Tesseract (I) ``` for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } ``` #### Communications In Tesseract (II) ``` for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } ``` #### Communications In Tesseract (III) ``` for (v: graph.vertices) { Non-blocking Remote Function Call for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); Can be delayed until the nearest barrier barrier(); Vault #1 Vault #2 put &w V put put W put ``` #### Remote Function Call (Non-Blocking) - 1. Send function address & args to the remote core - 2. Store the incoming message to the message queue - Flush the message queue when it is full or a synchronization barrier is reached put(w.id, function() { w.next\_rank += value; }) ### Tesseract System for Graph Processing #### Evaluated Systems #### Tesseract Graph Processing Performance #### Tesseract Graph Processing Performance \_\_\_\_\_\_ #### Effect of Bandwidth & Programming Model #### Tesseract Graph Processing System Energy **SAFARI** Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" ISCA 2015. #### Tesseract: Advantages & Disadvantages #### Advantages - + Specialized graph processing accelerator using PIM - + Large system performance and energy benefits - + Takes advantage of 3D stacking for an important workload - + More general than just graph processing #### Disadvantages - Changes a lot in the system - New programming model - Specialized Tesseract cores for graph processing - Cost - Scalability limited by off-chip links or graph partitioning #### More on Tesseract Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" Proceedings of the <u>42nd International Symposium on</u> <u>Computer Architecture</u> (**ISCA**), Portland, OR, June 2015. [Slides (pdf)] [Lightning Session Slides (pdf)] #### A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn Sungpack Hong<sup>§</sup> Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungpack.hong@oracle.com, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>§</sup>Oracle Labs <sup>†</sup>Carnegie Mellon University ### Computer Architecture Lecture 7: Computation in Memory II Prof. Onur Mutlu ETH Zürich Fall 2019 10 October 2019