Bachelor’s Seminar in Computer Architecture

- Fall 2023 (offered every Fall and Spring Semester)
- 2 credit units

- Rigorous seminar on fundamental and cutting-edge topics in computer architecture

- Critical paper presentation, review, and discussion of seminal and cutting-edge works in computer architecture
  - We will cover many ideas & issues, analyze their tradeoffs, perform critical thinking and brainstorming

- Participation, presentation, synthesis report, lots of discussion

- You can register for the course online

  [https://safari.ethz.ch/architecture_seminar](https://safari.ethz.ch/architecture_seminar)
<table>
<thead>
<tr>
<th>Week</th>
<th>Date</th>
<th>Livestream</th>
<th>Lecture</th>
<th>Readings</th>
<th>Assignments</th>
</tr>
</thead>
<tbody>
<tr>
<td>W1</td>
<td>23.08 Thu.</td>
<td>Live</td>
<td>L1a: Course Logistics</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L1b: Introduction and Basics</td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L1c: Architectural Design Fundamentals</td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W2</td>
<td>30.09 Thu.</td>
<td>Live</td>
<td>L2: GateKeeper</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W3</td>
<td>07.10 Thu.</td>
<td>Live</td>
<td>L3: RowClone (Processing using DRAM)</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W4</td>
<td>14.10 Thu.</td>
<td>Live</td>
<td>L4: Memory Channel Partitioning</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W5</td>
<td>4.11 Thu.</td>
<td>Live</td>
<td>S1.1: Bottleneck Identification and Scheduling in Multithreaded Applications</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W6</td>
<td>11.11 Thu.</td>
<td>Live</td>
<td>S2.1: Profiling a Warehouse-Scale Computer</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>S2.2: Understanding Sources of Inefficiency in General-Purpose Chips</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (ODP) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W7</td>
<td>18.11 Thu.</td>
<td>Live</td>
<td>S3.1: Python: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning, MICRO2021</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>S3.2: Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches, MICRO 2021</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td>W8</td>
<td>25.11 Thu.</td>
<td>Live</td>
<td>S4.1: Very Long Instruction Word Architectures and the ELI-512, ISCA1983</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (KEY)</td>
<td></td>
</tr>
<tr>
<td>W9</td>
<td>02.12 Thu.</td>
<td>Live</td>
<td>S5.1: Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems, ISCA 2021</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>S5.2: SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM, ASPLOS 2021</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (KEY)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>S6.2: BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows, HPCA 2021</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF) (PPT)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PPT) (PDF)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PPT) (PDF)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PPT) (PDF)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>S8.2: SquiggleFilter: An Accelerator for Portable Virus Detection, MICRO 2021</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PDF)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>S8.3: Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks, ASPLOS 2018</td>
<td>Mentioned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(PPT) (PDF)</td>
<td></td>
</tr>
</tbody>
</table>
### Spring 2022 Lectures/Schedule

<table>
<thead>
<tr>
<th>Week</th>
<th>Date</th>
<th>Livestream</th>
<th>Lecture</th>
<th>Readings</th>
<th>Assignments</th>
</tr>
</thead>
<tbody>
<tr>
<td>W1</td>
<td>24.02 Thu.</td>
<td><img src="https://example.com" alt="Livestream" /></td>
<td>L1a: Course Logistics</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L1b: Introduction and Basics</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L1c: Architectural Design Fundamentals</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td>W2</td>
<td>03.03 Thu.</td>
<td><img src="https://example.com" alt="Livestream" /></td>
<td>L2: Memory-Centric Computing</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td>W3</td>
<td>10.03 Thu.</td>
<td><img src="https://example.com" alt="Livestream" /></td>
<td>L3: Memory-Centric Computing II</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td>W4</td>
<td>17.03 Thu.</td>
<td><img src="https://example.com" alt="Livestream" /></td>
<td>L4: Memory-Centric Computing III</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td>W5</td>
<td>24.03 Thu.</td>
<td><img src="https://example.com" alt="Livestream" /></td>
<td>L5: Accelerating Genome Analysis</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td>W6</td>
<td>31.03 Thu.</td>
<td><img src="https://example.com" alt="Livestream" /></td>
<td>L6a: Rethinking Virtual Memory I</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L6b: Rethinking Virtual Memory II</td>
<td>Suggested</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L7b: SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems, MICRO 2021</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Research Opportunities

If you are interested in learning more and doing research in Computer Architecture, three suggestions:

- Email me with your interest (CC: Mohammad, Juan)
- Take the seminar course and the “Computer Architecture” course
- Do readings and assignments on your own & talk with us

There are many exciting projects and research positions, e.g.:

- Novel memory/storage/computation/communication systems
- New execution paradigms (e.g., in-memory computing)
- Hardware security, safety, reliability, predictability
- GPUs, TPUs, FPGAs, PIM, heterogeneous systems, ...
- Security-architecture-reliability-energy-performance interactions
- Architectures for genomics/proteomics/medical/health/AI/ML
- A limited list is here: https://safari.ethz.ch/theses/

https://people.inf.ethz.ch/omutlu/projects.htm
Think BIG, Aim HIGH!

https://safari.ethz.ch
Onur Mutlu’s SAFARI Research Group

Computer architecture, HW/SW, systems, bioinformatics, security, memory

https://safari.ethz.ch/safari-newsletter-january-2021/

Think BIG, Aim HIGH!

https://safari.ethz.ch
SAFARI PhD and Post-Doc Alumni

https://safari.ethz.ch/safari-alumni/

- Hasan Hassan (Rivos), EDAA Outstanding Dissertation Award 2023; S&P 2020 Best Paper Award, 2020 Pwnie Award, IEEE Micro TP HM 2020
- Christina Giannoula (Univ. of Toronto)
- Minesh Patel (ETH Zurich), DSN Carter Award for Best Thesis 2022; ETH Medal 2023; MICRO’20 & DSN’20 Best Paper Awards; ISCA HoF 2021
- Damla Senol Cali (Bionano Genomics), SRC TECHCON 2019 Best Student Presentation Award; RECOMB-Seq 2018 Best Poster Award
- Nastaran Hajinazar (Intel)
- Gagandeep Singh (AMD/Xilinx), FPL 2020 Best Paper Award Finalist
- Amirali Boroumand (Stanford Univ \rightarrow Google), SRC TECHCON 2018 Best Presentation Award
- Jeremie Kim (Apple), EDAA Outstanding Dissertation Award 2020; IEEE Micro Top Picks 2019; ISCA/MICRO HoF 2021
- Nandita Vijaykumar (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021
- Kevin Hsieh (Microsoft Research, Senior Researcher)
- Justin Meza (Facebook), HiPEAC 2015 Best Student Presentation Award; ICCD 2012 Best Paper Award
- Mohammed Alser (ETH Zurich), IEEE Turkey Best PhD Thesis Award 2018
- Yixin Luo (Google), HPCA 2015 Best Paper Session
- Kevin Chang (Facebook), SRC TECHCON 2016 Best Student Presentation Award
- Rachata Ausavarungnirun (KMUNTB, Assistant Professor), NOCS 2015 and NOCS 2012 Best Paper Award Finalist
- Gennady Pekhimenko (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021; ASPLOS 2015 SRC Winner
- Vivek Seshadri (Microsoft Research)
- Donghyuk Lee (NVIDIA Research, Senior Researcher), HPCA Hall of Fame 2018
- Yoongu Kim (Software Robotics \rightarrow Google), TCAD’19 Top Pick Award; IEEE Micro Top Picks’10; HPCA’10 Best Paper Session
- Lavanya Subramanian (Intel Labs \rightarrow Facebook)
- Samira Khan (Univ. of Virginia, Assistant Professor), HPCA 2014 Best Paper Session
- Saugata Ghose (Univ. of Illinois, Assistant Professor), HPCA 2015 Best Paper Session, DFRWS-EU 2017 Best Paper Award
- Jawad Haj-Yahya (Huawei Research Zurich, Principal Researcher)
- Lois Orosa (Galicia Supercomputing Center, Director)
- Jisung Park (POSTECH, Assistant Professor)
- Gagandeep Singh (AMD/Xilinx, Researcher), FPL 2020 Best Paper Award Finalist
You Can Join Us!

- [https://safari.ethz.ch/apply/](https://safari.ethz.ch/apply/)

SAFARI Researcher Applications

Sign in

This is the application submission site to be considered for being a researcher in the [SAFARI Research Group](https://safari.ethz.ch/apply/), directed by [Professor Onur Mutlu](https://safari.ethz.ch/apply/) ([Publications and Teaching](https://safari.ethz.ch/apply/)).

If you are interested in doing research in the [SAFARI Research Group](https://safari.ethz.ch/apply/), please make sure you apply through this submissions site and supply as many of the requested documents and information as possible. Please read and follow the provided instructions and submit as complete an application as possible (given the position you are applying for).

We suggest studying the following materials before submission:
[SAFARI Publications and Courses](https://safari.ethz.ch/apply/)
[Onur Mutlu's Online Lectures and Course Materials](https://safari.ethz.ch/apply/)

We strongly recommend that you read and analyze critically as many recent papers from our group as possible. This is the best way to prepare for the application process. Our recommendation is that you use professor Mutlu's methodology for critically analyzing papers.

[Guide On Reviewing Papers](https://safari.ethz.ch/apply/)

Good luck!

Welcome to the SAFARI at ETH Zurich -- PhD, Postdoc, Internship, Visiting Researcher Applications ([SAFARI Researcher Applications](https://safari.ethz.ch/apply/)) submissions site.
Approaches to (Instruction-Level) Concurrency

- Pipelining
- Fine-Grained Multithreading
- Out-of-order Execution
- Dataflow (at the ISA level)
- Superscalar Execution
- VLIW
- Systolic Arrays
- Decoupled Access Execute
- SIMD Processing (Vector and array processors, GPUs)
Readings for Today

- **Required**

- **Recommended**
Readings for Next Week

- **Required**

- **Recommended**
Systolic Arrays
Systolic Arrays: Motivation

- Goal: design an accelerator that has
  - Simple, regular design (keep # unique parts small and regular)
  - High concurrency $\rightarrow$ high performance
  - Balanced computation and I/O (memory) bandwidth

- Idea: Replace a single processing element (PE) with a regular array of PEs and carefully orchestrate flow of data between the PEs
  - such that they collectively transform a piece of input data before outputting it to memory

- Benefit: Maximizes computation done on a single piece of data element brought from memory
Systolic Arrays


Figure 1. Basic principle of a systolic system.

Memory: heart
Data: blood
PEs: cells
Memory pulses data through PEs
Why Systolic Architectures?

- **Idea:** Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory

- Similar to **blood flow:** heart → many cells → heart
  - Different cells “process” the blood
  - Many veins operate simultaneously
  - Can be many-dimensional

- **Why?** Special purpose accelerators/architectures need
  - **Simple, regular design** (keep # unique parts small and regular)
  - **High concurrency** → high performance
  - **Balanced computation and I/O (memory) bandwidth**
Systolic Architectures

- Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs
  - Balance computation and memory bandwidth

- Differences from pipelining:
  - These are individual PEs
  - Array structure can be non-linear and multi-dimensional
  - PE connections can be multidirectional (and different speed)
  - PEs can have local memory and execute kernels (rather than a piece of the instruction)
Systolic Computation Example

- **Convolution**
  - Used in filtering, pattern matching, correlation, polynomial evaluation, etc ...
  - Many image processing tasks
  - **Machine learning**: up to hundreds of convolutional layers in Convolutional Neural Networks (CNN)

\[
\text{Given the sequence of weights } \{w_1, w_2, \ldots, w_k\}, \text{ and the input sequence } \{x_1, x_2, \ldots, x_n\}, \text{ compute the result sequence } \{y_1, y_2, \ldots, y_{n+1-k}\} \text{ defined by } y_i = w_1 x_i + w_2 x_{i+1} + \ldots + w_k x_{i+k-1} \]
LeNet-5, a Convolutional Neural Network for Hand-Written Digit Recognition

This is a 1024*8 bit input, which will have a truth table of $2^{8196}$ entries.
An Example of 2D Convolution

**Structure information**
- Input: 5*5 (blue)
- Kernel (filter): 3*3 (grey)
- Output: 5*5 (green)

**Computation information**
- Stride: 1
- Padding: 1 (white)

Output Dim = (Input + 2*Padding - Kernel) / Stride + 1
An Example of 2D Convolution

Input Layer

CNN kernel

Output Layer
Convolutional Neural Networks are designed to recognize visual patterns directly from pixel images with minimal preprocessing. They can recognize patterns with extreme variability (such as handwritten characters), and with robustness to distortions and simple geometric transformations.

LeNet-5 is our latest convolutional network designed for handwritten and machine-printed character recognition. Here is an example of LeNet-5 in action.

Many more examples are available in the column on the left:

Several papers on LeNet and convolutional networks are available on my publication page:

[LeCun et al., 1998]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.

[Bottou et al., 1997]
L. Bottou, Y. LeCun, and Y. Bengio. Global training of

http://yann.lecun.com/exdb/lenet/index.html
Implementing a Convolutional Layer with Matrix Multiplication

\[
\begin{bmatrix}
1 & 1 & 2 & 2 \\
1 & 1 & 2 & 0
\end{bmatrix}
\times
\begin{bmatrix}
1 & 0 & 0 & 1 & 2 & 1 & 2 \\
1 & 0 & 1 & 0
\end{bmatrix}
= 
\begin{bmatrix}
12 & 18 & 13 & 22 \\
10 & 20 & 15 & 22
\end{bmatrix}
\]

Convolution Filters \( W' \)

Output Features \( Y \)

Input Features \( X \) (unrolled)

Input Features \( X \) (unrolled)

Slide credit: Reproduced from Hwu & Kirk
In 2010, Prof. Andreas Moshovos adopted Professor Hwu’s ECE498AL Programming Massively Parallel Processors Class

Several of Prof. Geoffrey Hinton’s graduate students took the course

These students developed the GPU implementation of the Deep CNN that was trained with 1.2M images to win the ImageNet competition
Example: AlexNet (2012)

- AlexNet wins the **ImageNet classification competition** with ~10% points higher accuracy than state-of-the-art
  - Krizhevsky et al., “**ImageNet Classification with Deep Convolutional Neural Networks**”, NIPS 2012.
Example: GoogLeNet (2014)

- Google improves accuracy by adding more network layers
  - From 8 in AlexNet to 22 in GoogLeNet
  - Szegedy et al., “Going Deeper with Convolutions”, CVPR 2015.

**Going Deeper with Convolutions**

Christian Szegedy\(^1\), Wei Liu\(^2\), Yangqing Jia\(^1\), Pierre Sermanet\(^1\), Scott Reed\(^3\), Dragomir Anguelov\(^1\), Dumitru Erhan\(^1\), Vincent Vanhoucke\(^1\), Andrew Rabinovich\(^4\)

\(^1\)Google Inc. \(^2\)University of North Carolina, Chapel Hill
\(^3\)University of Michigan, Ann Arbor \(^4\)Magic Leap Inc.

\(^1\)\{szegedy, jiayq, sermanet, dragomir, dimitru, vanhoucke\}@google.com
\(^2\)wliu@cs.unc.edu, \(^3\)reedscott@umich.edu, \(^4\)arabinovich@magic leap.com

Neural Network Layer Examples

<table>
<thead>
<tr>
<th>LeNet</th>
<th>AlexNet</th>
</tr>
</thead>
<tbody>
<tr>
<td>Image: 28 (height) × 28 (width) × 1 (channel)</td>
<td>Image: 224 (height) × 224 (width) × 3 (channels)</td>
</tr>
<tr>
<td>Convolution with 5×5 kernel+2 padding:28×28×6 ↓ sigmoid</td>
<td>Convolution with 11×11 kernel+4 stride:54×54×96 ↓ ReLu</td>
</tr>
<tr>
<td>Pool with 2×2 average kernel+2 stride:14×14×6 ↓</td>
<td>Pool with 3×3 max. kernel+2 stride:26×26×96 ↓</td>
</tr>
<tr>
<td>Convolution with 5×5 kernel (no pad):10×10×16 ↓ sigmoid</td>
<td>Convolution with 5×5 kernel+2 pad:26×26×256 ↓ ReLu</td>
</tr>
<tr>
<td>Pool with 2×2 average kernel+2 stride:5×5×16 ↓ flatten</td>
<td>Pool with 3×3 max. kernel+2 stride:12×12×256 ↓</td>
</tr>
<tr>
<td>Dense: 120 fully connected neurons ↓ sigmoid</td>
<td>Convolution with 3×3 kernel+1 pad:12×12×384 ↓ ReLu</td>
</tr>
<tr>
<td>Dense: 84 fully connected neurons ↓ sigmoid</td>
<td>Convolution with 3×3 kernel+1 pad:12×12×384 ↓ ReLu</td>
</tr>
<tr>
<td>Dense: 10 fully connected neurons ↓</td>
<td>Convolution with 3×3 kernel+1 pad:12×12×256 ↓ ReLu</td>
</tr>
<tr>
<td>Output: 1 of 10 classes</td>
<td>Pool with 3×3 max. kernel+2 stride:5×5×256 ↓ flatten</td>
</tr>
<tr>
<td>Dense: 4096 fully connected neurons ↓ ReLu, dropout p=0.5</td>
<td>Dense: 4096 fully connected neurons ↓ ReLu, dropout p=0.5</td>
</tr>
<tr>
<td>Dense: 1000 fully connected neurons ↓</td>
<td>Output: 1 of 1000 classes</td>
</tr>
</tbody>
</table>

By Cmglee - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=104937230
Systolic Computation Example: Convolution (I)

- Convolution
  - Used in filtering, pattern matching, correlation, polynomial evaluation, etc ...
  - Many image processing tasks
  - Machine learning: up to hundreds of convolutional layers in Convolutional Neural Networks (CNN)

Given the sequence of weights \(\{w_1, w_2, \ldots, w_k\}\) and the input sequence \(\{x_1, x_2, \ldots, x_n\}\), compute the result sequence \(\{y_1, y_2, \ldots, y_{n+1-k}\}\) defined by

\[
y_i = w_1 x_i + w_2 x_{i+1} + \ldots + w_k x_{i+k-1}
\]
Systolic Computation Example: Convolution (II)

- \( y_1 = w_1x_1 + w_2x_2 + w_3x_3 \)
- \( y_2 = w_1x_2 + w_2x_3 + w_3x_4 \)
- \( y_3 = w_1x_3 + w_2x_4 + w_3x_5 \)

Figure 8. Design W1: systolic convolution array (a) and cell (b) where \( w_i \)'s stay and \( x_i \)'s and \( y_i \)'s move systolically in opposite directions.
Worthwhile to implement adder and multiplier separately to allow overlapping of add/mul executions
Systolic Computation Example: Convolution (IV)

- One needs to carefully orchestrate when data elements are input to the array
- And when output is buffered

- This gets more involved when
  - Array dimensionality increases
  - PEs are less predictable in terms of latency
Example 2D Systolic Array Computation

- Multiply two 3x3 matrices (inputs)
  - Keep the final result in PE accumulators

\[
\begin{bmatrix}
c_{00} & c_{01} & c_{02} \\
c_{10} & c_{11} & c_{12} \\
c_{20} & c_{21} & c_{22}
\end{bmatrix}
= \begin{bmatrix}
a_{00} & a_{01} & a_{02} \\
a_{10} & a_{11} & a_{12} \\
a_{20} & a_{21} & a_{22}
\end{bmatrix}
\times \begin{bmatrix}
b_{00} & b_{01} & b_{02} \\
b_{10} & b_{11} & b_{12} \\
b_{20} & b_{21} & b_{22}
\end{bmatrix}
\]

P = M
Q = N
R = R + M*N

Figure 1: A systolic array processing element
Two-Dimensional Systolic Arrays

To a given problem there could be both one- and two-dimensional systolic array solutions. For example, two-dimensional convolution can be performed by a onedimensional systolic array\textsuperscript{24,25} or a two-dimensional systolic array.\textsuperscript{6} When the memory speed is more than cell speed, two-dimensional systolic arrays such as those depicted in Figure 11 should be used. At each cell cycle, all the I/O ports on the array boundaries can input or output data items to or from the memory; as a result, the available memory bandwidth can be fully utilized. Thus, the choice of a one- or two-dimensional scheme is very dependent on how cells and memories will be implemented.
Combinations

- Systolic arrays can be chained together to form powerful systems.

- This systolic array is capable of producing on-the-fly least-squares fit to all the data that has arrived up to any given moment.

Figure 12. On-the-fly least-squares solutions using one- and two-dimensional systolic arrays, with $p = 4$. 
Systolic Arrays: Pros and Cons

Advantages:
- **Principled**: Efficiently makes use of limited memory bandwidth, balances computation to I/O bandwidth availability
- **Specialized** (computation should fit the PE organization/functions)
  - improved efficiency, simple design, high concurrency/performance
  - good to do more with less memory bandwidth requirement

Downside:
- **Specialized**
  - not generally applicable because computation needs to fit the PE functions/organization
More Programmability in Systolic Arrays

- Each PE in a systolic array
  - Can store multiple “weights”
  - Weights can be selected on the fly
  - Eases implementation of, e.g., adaptive filtering

- Taken further
  - Each PE can have its own data and instruction memory
  - Data memory → to store partial/temporary results, constants
  - Leads to stream processing, pipeline parallelism
    - More generally, staged execution
Figure 1. (a) The code of a loop, (b) Each iteration is split into 3 pipeline stages: A, B, and C. Iteration $i$ comprises $A_i$, $B_i$, $C_i$. (c) Sequential execution of 4 iterations. (d) Parallel execution of 6 iterations using pipeline parallelism on a three-core machine. Each stage executes on one core.

Stages of Pipelined Programs

- Loop iterations are divided into code segments called **stages**
- Threads execute stages on different cores

```plaintext
loop {
    Compute1
    Compute2
    Compute3
}
```
Figure 3. File compression algorithm executed using pipeline parallelism
Systolic Array: Advantages & Disadvantages

- Advantages
  - Special purpose → high efficiency
  - Makes multiple uses of each data item → reduced need for fetching/refetching → better use of memory bandwidth
  - High concurrency
  - Regular design (both data and control flow) → easier to implement

- Disadvantages
  - Not good at exploiting irregular parallelism
  - Special purpose → not generally applicable
    - Needs software, programmer support to become more general purpose
Example Systolic Array: The WARP Computer

- HT Kung, CMU, 1984-1988

- Linear array of 10 cells, each cell a 10 Mflop programmable processor
- Attached to a general purpose host machine
- HLL and optimizing compiler to program the systolic array
- Used extensively to accelerate vision and robotics tasks

The WARP Computer

Figure 1: Warp system overview
The WARP Cell

Figure 2: Warp cell data path
**An Example Modern Systolic Array: TPU (I)**

**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.

**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer [Kun80][Ram91][Ovt15b]. Figure 4 shows that data flows in from the left, and the weights are loaded from the top. A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators. From a correctness perspective, software is unaware of the systolic nature of the matrix unit, but for performance, it does worry about the latency of the unit.

Recall: Example 2D Systolic Array Computation

- Multiply two 3x3 matrices (inputs)
  - Keep the final result in PE accumulators

\[
\begin{bmatrix}
  c_{00} & c_{01} & c_{02} \\
  c_{10} & c_{11} & c_{12} \\
  c_{20} & c_{21} & c_{22}
\end{bmatrix}
= \begin{bmatrix}
  a_{00} & a_{01} & a_{02} \\
  a_{10} & a_{11} & a_{12} \\
  a_{20} & a_{21} & a_{22}
\end{bmatrix}
\times
\begin{bmatrix}
  b_{00} & b_{01} & b_{02} \\
  b_{10} & b_{11} & b_{12} \\
  b_{20} & b_{21} & b_{22}
\end{bmatrix}
\]

Figure 1: A systolic array processing element

\[P = M\]
\[Q = N\]
\[R = R + M \times N\]
**Figure 1.** TPU Block Diagram. The main computation part is the yellow Matrix Multiply unit in the upper right hand corner. Its inputs are the blue Weight FIFO and the blue Unified Buffer (UB) and its output is the blue Accumulators (Acc). The yellow Activation Unit performs the nonlinear functions on the Acc, which go to the UB.
An Example Modern Systolic Array: TPU2

4 TPU chips
vs 1 chip in TPU1

High Bandwidth Memory
vs DDR3

Floating point operations
vs FP16

45 TFLOPS per chip
vs 23 TOPS

Designed for training
and inference
vs only inference

An Example Modern Systolic Array: TPU3

32GB HBM per chip vs 16GB HBM in TPU2
4 Matrix Units per chip vs 2 Matrix Units in TPU2
90 TFLOPS per chip vs 45 TFLOPS in TPU2
An Example Modern Systolic Array: TPU4

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3
1 ExaFLOPS per board

New ML applications (vs. TPU3):
• Computer vision
• Natural Language Processing (NLP)
• Recommender system
• Reinforcement learning that plays Go

https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests
Cerebras’s Wafer Scale Engine (2019)

- The largest ML accelerator chip
- 400,000 cores

Cerebras WSE
- 1.2 Trillion transistors
- 46,225 mm²

Largest GPU
- 21.1 Billion transistors
- 815 mm²

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Cerebras’s Wafer Scale Engine-2 (2021)

- The largest ML accelerator chip
- 850,000 cores

Cerebras WSE-2
2.6 Trillion transistors
46,225 mm²

Largest GPU
54.2 Billion transistors
826 mm²

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Huge Demand for Performance & Efficiency

Exponential Growth of Neural Networks

Memory and compute requirements

Total training compute, PFLOP-days

Model memory requirement, GB

- 2018
- 2019
- 2020+
- MSFT-1T (1T)
- MT-NLG (530B)
- GPT-3 (175B)
- T5 (11B)
- T-NLG (17B)
- Megatron-LM (8B)
- GPT-2 (1.5B)
- BERT Large (340M)
- BERT Base (110M)

1800x more compute
In just 2 years

Tomorrow, multi-trillion parameter models

https://www.youtube.com/watch?v=x2-qB0J7KHW
More on the Cerebras WSE

https://www.youtube.com/watch?v=x2-qB0J7KHw