Digital Design & Computer Arch.
Lecture 14: Pipelined Processor Design

Prof. Onur Mutlu

ETH Zürich
Spring 2021
22 April 2021
Required Readings

Last week & This week

- Pipelining
  - H&H, Chapter 7.5
- Pipelining Issues
  - H&H, Chapter 7.8.1-7.8.3

This week & Next week

- Out-of-order execution
  - H&H, Chapter 7.8-7.9
  - More advanced pipelining
  - Interrupt and exception handling
  - Out-of-order and superscalar execution concepts
Agenda for Today & Next Few Lectures

■ Earlier
  - Single-cycle Microarchitectures
  - Multi-cycle Microarchitectures

■ Last week & today
  - Pipelining
  - Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, ...

■ Tomorrow & Next week
  - Out-of-Order Execution
  - Issues in OoO Execution: Load-Store Handling, ...
Review: Single-Cycle MIPS Processor
Review: Single-Cycle MIPS FSM

- Single-cycle machine

AS: Architectural State
Can We Do Better?
Review: Multi-Cycle MIPS Processor
Review: Multi-Cycle MIPS FSM

What is the shortcoming of this design?

What does this design assume about memory?
Can We Do Better?
Review: Pipelining Basic Idea

Of course, we need to be more careful than this!
Review: Pipelined Datapath & Control

- Same control unit as single-cycle processor
- Control delayed to proper pipeline stage
Review: Execution of Four Independent ADDs

- Multi-cycle: 4 cycles per instruction
  
  ![Multi-cycle Diagram]

- Pipelined: 4 cycles per 4 instructions (steady state)
  
  ![Pipelined Diagram]

Is life always this beautiful?
Review: Issues in Pipeline Design

- Balancing work in pipeline stages
  - How many stages and what is done in each stage

- Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow
  - Handling dependences
    - Data
    - Control
  - Handling resource contention
  - Handling long-latency (multi-cycle) operations

- Handling exceptions, interrupts

- Advanced: Improving pipeline throughput
  - Minimizing stalls
Data Dependence Handling: Concepts and Implementation
Review: Data Dependence Types

**Flow dependence**

\[ r_3 \leftarrow r_1 \text{ op } r_2 \quad \text{Read-after-Write} \]
\[ r_5 \leftarrow r_3 \text{ op } r_4 \quad \text{(RAW)} \]

**Anti dependence**

\[ r_3 \leftarrow r_1 \text{ op } r_2 \quad \text{Write-after-Read} \]
\[ r_1 \leftarrow r_4 \text{ op } r_5 \quad \text{(WAR)} \]

**Output-dependence**

\[ r_3 \leftarrow r_1 \text{ op } r_2 \quad \text{Write-after-Write} \]
\[ r_5 \leftarrow r_3 \text{ op } r_4 \quad \text{(WAW)} \]
\[ r_3 \leftarrow r_6 \text{ op } r_7 \]
Review: How to Handle Data Dependences

- Anti and output dependences are easier to handle
  - write to the destination only in last stage and in program order

- Flow dependences are more interesting

- Six fundamental ways of handling flow dependences
  - Detect and wait until value is available in register file
  - Detect and forward/bypass data to dependent instruction
  - Detect and eliminate the dependence at the software level
    - No need for the hardware to detect dependence
  - Detect and move it out of the way for independent instructions
  - Predict the needed value(s), execute “speculatively”, and verify
  - Do something else (fine-grained multithreading)
    - No need to detect
Review: Pipeline Stall: Resolving Data Dependence

Stall = make the dependent instruction wait until its source data value is available

1. stop all up-stream stages
2. drain all down-stream stages

Inst_h
Inst_i
Inst_j
Inst_k
Inst_l

i: \( r_x \leftarrow _\) bubble

j: _ \leftarrow r_x  \quad \text{dist}(i,j)=4
How to Implement Stalling

- **Stall**
  - disable **PC** and **IF/ID** latching; ensure stalled instruction stays in its stage
  - Insert "**invalid**" instructions/nops into the stage following the stalled one (called "**bubbles**")

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
One instruction writes a register ($s0) and next instructions read this register => read after write (RAW) dependence.

- `add` writes into $s0 in the first half of cycle 5
- `sub` reads $s0 in 2nd half of cycle 5, getting the correct value
- subsequent instructions read the correct value of $s0

Only if the pipeline handles data dependences incorrectly!

```
add $s0, $s2, $s3
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
```
Add $s0, $s2, $s3

No operation

No operation

And $t0, $s0, $s1

Or $t1, $s4, $s0

Sub $t2, $s0, $s5

- Insert enough NOPs for the required result to be ready
- Or (if you can) move independent useful instructions up
  - Reorder/reschedule instructions at the compiler level
Data Forwarding

- Also called Data Bypassing

- We have already seen the basic idea before
  - Forward the result value to the dependent instruction as soon as the value is available

- Remember dataflow?
  - Data value supplied to dependent instruction as soon as it is available
  - Instruction executes when all its operands are available

- Data forwarding brings a pipeline closer to data flow execution principles
Data Forwarding

add $s0, $s2, $s3

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5
Data Forwarding
Data Forwarding

- Forward to Execute stage from either:
  - Memory stage or
  - Writeback stage

- When should we forward from either Memory or Writeback stage?
  - If that stage will write to a destination register and the destination register matches the source register.
  - If both the Memory and Writeback stages contain matching destination registers, the Memory stage should have priority, because it contains the *more recently executed* instruction.
Data Forwarding (in Pseudocode)

- Forward to Execute stage from either:
  - Memory stage or
  - Writeback stage

- Forwarding logic for \textit{ForwardAE (pseudo code)}:
  \begin{verbatim}
  if ((rsE \neq 0) \text{ AND } (rsE == \text{WriteRegM}) \text{ AND RegWriteM}) \text{ then}
    \text{ForwardAE} = 10 \quad \# \text{ forward from Memory stage}
  \text{else if } ((rsE \neq 0) \text{ AND } (rsE == \text{WriteRegW}) \text{ AND RegWriteW}) \text{ then}
    \text{ForwardAE} = 01 \quad \# \text{ forward from Writeback stage}
  \text{else}
    \text{ForwardAE} = 00 \quad \# \text{ no forwarding}
  \end{verbatim}

- Forwarding logic for \textit{ForwardBE} same, but replace \textit{rsE} with \textit{rtE}
Forwarding is not always possible

Forwarding is sufficient to resolve RAW data dependences

Unfortunately, there are cases when forwarding is not possible

- due to pipeline design and instruction latencies
- The `lw` instruction does not finish reading data until the end of the Memory stage
  - its result cannot be forwarded to the Execute stage of the next instruction
Stalling

lw $s0, 40($0)
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
Hardware Needed for Stalling

- Stalls are supported by
  - adding enable inputs (EN) to the Fetch and Decode pipeline registers
  - and a synchronous reset/clear (CLR) input to the Execute pipeline register
    - or an INV bit associated with each pipeline register, indicating that contents are INValid

- When a lw stall occurs
  - StallD and StallF are asserted to force the Decode and Fetch stage pipeline registers to hold their old values.
  - FlushE is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble
Stalling and Dependence Detection Hardware
A Special Case of Data Dependence

- Control dependence
  - Data dependence on the Instruction Pointer / Program Counter
Control Dependence

Question: What should the fetch PC be in the next cycle?
Answer: The address of the next instruction
   - All instructions are control dependent on previous ones. Why?

If the fetched instruction is a non-control-flow instruction:
   - Next Fetch PC is the address of the next-sequential instruction
   - Easy to determine if we know the size of the fetched instruction

If the instruction that is fetched is a control-flow instruction:
   - How do we determine the next Fetch PC?

In fact, how do we know whether or not the fetched instruction is a control-flow instruction?
Control Dependences

- Special case of data dependence: dependence on PC

- **beq:**
  - branch is not resolved until the fourth stage of the pipeline
  - Instructions after the branch are fetched before branch is resolved
    - Always predict that the next sequential instruction is fetched
    - Called “Always not taken” prediction
  - These instructions must be flushed if the branch is taken

- **Branch misprediction penalty**
  - number of instructions flushed when branch is taken
  - May be reduced by resolving the branch earlier
Control Dependence: Original Pipeline

[Diagram of a pipelined processor with various control signals and data paths, including Op, Func, Instruction Memory, Register File, ALU, Data Memory, and Control Unit.]
Control Dependence

20  beq $t1, $t2, 40
24  and $t0, $s0, $s1
28  or $t1, $s4, $s0
2C  sub $t2, $s0, $s5
30  ...
...
64  slt $t3, $s2, $s3
Introduces another data dependence in Decode stage.
Early Branch Resolution

20  beq $t1, $t2, 40
24  and $t0, $s0, $s1
28  or $t1, $s4, $s0
2C  sub $t2, $s0, $s5
30  ...
...
64  slt $t3, $s2, $s3
Early Branch Resolution: Good Idea?

- **Advantages**
  - Reduced branch misprediction penalty
    - Reduced CPI (cycles per instruction)

- **Disadvantages**
  - Potential increase in clock cycle time?
    - Higher clock period and lower frequency?
  - Additional hardware cost
    - Specialized and likely not used by other instructions
Data Forwarding for Early Branch Resolution

Data forwarding for early branch resolution.
Forwarding and Stalling Hardware Control

// Forwarding logic:
assign ForwardAD = (rsD != 0) & (rsD == WriteRegM) & RegWriteM;
assign ForwardBD = (rtD != 0) & (rtD == WriteRegM) & RegWriteM;

// Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;
assign branchstall = (BranchD & RegWriteE &
                     (WriteRegE == rsD | WriteRegE == rtD))
                   |
                     (BranchD & MemtoRegM &
                      (WriteRegM == rsD | WriteRegM == rtD));

// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;
Final Pipelined MIPS Processor (H&H)

Figure 7.58 Pipelined processor with full hazard handling

Includes data dependence detection, early br resolution, forwarding, stall logic
Doing Better: Smarter Branch Prediction

- Guess whether branch will be taken
  - Backward branches are usually taken (loops)
  - Consider history of whether branch was previously taken to improve the guess

- Good prediction reduces the fraction of branches requiring a flush
More on Branch Prediction (I)

Importance of The Branch Problem

- Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
- Assume: 1 out of 5 instructions is a branch
- Assume: Each 5 instruction-block ends with a branch

How long does it take to fetch 500 instructions?

- 100% accuracy
  - 100 cycles (all instructions fetched on the correct path)
  - No wasted work; IPC = 500/100
- 99% accuracy
  - 98 cycles (correct path) + 2 cycles (wrong path) = 120 cycles
  - 20% extra instructions fetched; IPC = 500/120
- 90% accuracy
  - 98 cycles (correct path) + 2 cycles (wrong path) = 100 cycles
  - 200% extra instructions fetched; IPC = 500/300
- 80% accuracy
  - 98 cycles (correct path) + 2 cycles (wrong path) = 120 cycles
  - 50% extra instructions fetched; IPC = 500/600
- 60% accuracy
  - 98 cycles (correct path) + 2 cycles (wrong path) = 120 cycles
  - 50% extra instructions fetched; IPC = 500/900

---

https://www.youtube.com/onurmutlulectures
More on Branch Prediction (II)

https://www.youtube.com/onurmutlulectures
More on Branch Prediction (III)

Superblock Code Optimization Example

Original Code

Code After Superblock Formation

Code After Common Subexpression Elimination

18-740 Computer Architecture - Advanced Branch Prediction - Lecture 5

4,696 views • Sep 23, 2015

Carnegie Mellon Computer Architecture

Lecture 5: Advanced Branch Prediction
Lecturer: Prof. Onur Mutlu (http://users.ece.cmu.edu/~omutlu/)
Date: September 16, 2014.

Lecture 5 slides (ppt): http://www.ece.cmu.edu/~ece740/f15/slides/lecture5.ppt

https://www.youtube.com/onurmutlulectures
Lectures on Branch Prediction

- Digital Design & Computer Architecture, Spring 2020, Lecture 16b
  - Branch Prediction I (ETH Zurich, Spring 2020)
  - https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=22

- Digital Design & Computer Architecture, Spring 2020, Lecture 17
  - Branch Prediction II (ETH Zurich, Spring 2020)
  - https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=23

- Computer Architecture, Spring 2015, Lecture 5
  - Advanced Branch Prediction (CMU, Spring 2015)
  - https://www.youtube.com/watch?v=yDjsr-jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4

https://www.youtube.com/onurmutlulectures
Pipelined Performance Example
Pipelined Performance Example

- **SPECINT2017 benchmark:**
  - 25% loads
  - 10% stores
  - 11% branches
  - 2% jumps
  - 52% R-type

- **Suppose:**
  - 40% of loads used by next instruction
  - 25% of branches mispredicted

- **All jumps flush next instruction**

- **What is the average CPI?**
Pipelined Performance Example Solution

- Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush. Thus:
  - $CPI_{lw} = 1(0.6) + 2(0.4) = 1.4$  
    *Average CPI for load*
  - $CPI_{beq} = 1(0.75) + 2(0.25) = 1.25$  
    *Average CPI for branch*

- And
  - $Average\ CPI = \ $
Pipelined Performance Example Solution

- **Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.**

  Thus:
  - \( \text{CPI}_{\text{lw}} = 1(0.6) + 2(0.4) = 1.4 \)  
    \( \text{Average CPI for load} \)
  - \( \text{CPI}_{\text{beq}} = 1(0.75) + 2(0.25) = 1.25 \)  
    \( \text{Average CPI for branch} \)

- **And**

  - **Average CPI**  
    \[
    = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1)
    
    = 1.15
    \]  
    \( \text{load} \)
    \( \text{store} \)
    \( \text{beq} \)
    \( \text{jump} \)
    \( r\text{-type} \)
Pipelined Performance

- There are 5 stages, and 5 different timing paths:

\[ T_c = \max \{ \]

\[ t_{pcq} + t_{mem} + t_{setup} + 2(t_{RFread} + t_{mux} + t_{eq} + t_{AND} + t_{mux} + t_{setup}) \]

\[ t_{pcq} + t_{mux} + t_{mux} + t_{ALU} + t_{setup} \]

\[ t_{pcq} + t_{memwrite} + t_{setup} \]

\[ 2(t_{pcq} + t_{mux} + t_{RFwrite}) \]

\}
## Pipelined Performance Example

<table>
<thead>
<tr>
<th>Element</th>
<th>Parameter</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Register clock-to-Q</td>
<td>$t_{pcq_PC}$</td>
<td>30</td>
</tr>
<tr>
<td>Register setup</td>
<td>$t_{setup}$</td>
<td>20</td>
</tr>
<tr>
<td>Multiplexer</td>
<td>$t_{mux}$</td>
<td>25</td>
</tr>
<tr>
<td>ALU</td>
<td>$t_{ALU}$</td>
<td>200</td>
</tr>
<tr>
<td>Memory read</td>
<td>$t_{mem}$</td>
<td>250</td>
</tr>
<tr>
<td>Register file read</td>
<td>$t_{RFread}$</td>
<td>150</td>
</tr>
<tr>
<td>Register file setup</td>
<td>$t_{RFsetup}$</td>
<td>20</td>
</tr>
<tr>
<td>Equality comparator</td>
<td>$t_{eq}$</td>
<td>40</td>
</tr>
<tr>
<td>AND gate</td>
<td>$t_{AND}$</td>
<td>15</td>
</tr>
<tr>
<td>Memory write</td>
<td>$T_{memwrite}$</td>
<td>220</td>
</tr>
<tr>
<td>Register file write</td>
<td>$t_{RFwrite}$</td>
<td>100</td>
</tr>
</tbody>
</table>

\[
T_c = 2(t_{RFread} + t_{mux} + t_{eq} + t_{AND} + t_{mux} + t_{setup}) \\
= 2(150 + 25 + 40 + 15 + 25 + 20) \text{ ps} \\
= 550 \text{ ps}
\]
Pipelined Performance Example

- For a program with 100 billion instructions executing on a pipelined MIPS processor:
  - CPI = 1.15
  - $T_c = 550$ ps

- Execution Time
  \[
  \text{Execution Time} = (\# \text{ instructions}) \times \text{CPI} \times T_c \\
  = (100 \times 10^9)(1.15)(550 \times 10^{-12}) \\
  = 63 \text{ seconds}
  \]
## Performance Summary for MIPS arch.

<table>
<thead>
<tr>
<th>Processor</th>
<th>Execution Time (seconds)</th>
<th>Speedup (single-cycle is baseline)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-cycle</td>
<td>95</td>
<td>1</td>
</tr>
<tr>
<td>Multicycle</td>
<td>133</td>
<td>0.71</td>
</tr>
<tr>
<td>Pipelined</td>
<td>63</td>
<td>1.51</td>
</tr>
</tbody>
</table>

- Fastest of the three MIPS architectures is **Pipelined**.
- However, even though we have 5 fold pipelining, it is not 5 times faster than single cycle.
Recall: How to Handle Data Dependences

- Anti and output dependences are easier to handle
  - write to the destination only in last stage and in program order

- Flow dependences are more interesting

- Six fundamental ways of handling flow dependences
  - Detect and wait until value is available in register file
  - Detect and forward/bypass data to dependent instruction
  - Detect and eliminate the dependence at the software level
    - No need for the hardware to detect dependence
  - Detect and move it out of the way for independent instructions
  - Predict the needed value(s), execute “speculatively”, and verify
  - Do something else (fine-grained multithreading)
    - No need to detect
Recall: How to Handle Data Dependences

- Anti and output dependences are easier to handle
  - write to the destination only in last stage and in program order

- Flow dependences are more interesting

- Six fundamental ways of handling flow dependences
  - Detect and wait until value is available in register file
  - Detect and forward/bypass data to dependent instruction
  - Detect and eliminate the dependence at the software level
    - No need for the hardware to detect dependence
  - Detect and move it out of the way for independent instructions
  - Predict the needed value(s), execute “speculatively”, and verify
  - Do something else (fine-grained multithreading)
    - No need to detect
Questions to Ponder

- What is the role of the hardware vs. the software in data dependence handling?
  - Software based interlocking
  - Hardware based interlocking
  - Who inserts/manages the pipeline bubbles?
  - Who finds the independent instructions to fill “empty” pipeline slots?
  - What are the advantages/disadvantages of each?
    - Think of the performance equation as well
Questions to Ponder

- What is the role of the hardware vs. the software in the order in which instructions are executed in the pipeline?
  - Software based instruction scheduling → static scheduling
  - Hardware based instruction scheduling → dynamic scheduling

- How does each impact different metrics?
  - Performance (and parts of the performance equation)
  - Complexity
  - Power consumption
  - Reliability
  - ...
More on Software vs. Hardware

- **Software based scheduling of instructions → static scheduling**
  - Compiler orders the instructions, hardware executes them in that order
  - Contrast this with **dynamic scheduling** (in which hardware can execute instructions out of the compiler-specified order)
  - How does the compiler know the latency of each instruction?

- **What information does the compiler not know that makes static scheduling difficult?**
  - Answer: Anything that is determined at run time
    - Variable-length operation latency, memory addr, branch direction

- **How can the compiler alleviate this (i.e., estimate the unknown)?**
  - Answer: Profiling
More on Static Instruction Scheduling


7,136 views • Feb 26, 2015

Carnegie Mellon Computer Architecture
23K subscribers

Lecture 16: Static Instruction Scheduling
Lecturer: Prof. Onur Mutlu (http://users.ece.cmu.edu/~omutlu/)
Date: Feb 23rd, 2015

Lecture 16 slides (pdf): http://www.ece.cmu.edu/~ece447/s15/lib...
Lectures on Static Instruction Scheduling

- Computer Architecture, Spring 2015, Lecture 16
  - Static Instruction Scheduling (CMU, Spring 2015)
  - [https://www.youtube.com/watch?v=isBEVkJjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCLO1TWybTDtKq&index=18](https://www.youtube.com/watch?v=isBEVkJjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCLO1TWybTDtKq&index=18)

- Computer Architecture, Spring 2013, Lecture 21
  - Static Instruction Scheduling (CMU, Spring 2013)
  - [https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=21](https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=21)

[https://www.youtube.com/onurmutlulectures](https://www.youtube.com/onurmutlulectures)
Recall: How to Handle Data Dependences

- Anti and output dependences are easier to handle
  - write to the destination only in last stage and in program order

- Flow dependences are more interesting

- Six fundamental ways of handling flow dependences
  - Detect and wait until value is available in register file
  - Detect and forward/bypass data to dependent instruction
  - Detect and eliminate the dependence at the software level
    - No need for the hardware to detect dependence
  - Detect and move it out of the way for independent instructions
  - Predict the needed value(s), execute “speculatively”, and verify
  - Do something else (fine-grained multithreading)
    - No need to detect
Fine-Grained Multithreading
Fine-Grained Multithreading

- **Idea:** Hardware has multiple thread contexts (PC+registers). Each cycle, fetch engine fetches from a different thread.
  - By the time the fetched branch/instruction resolves, no instruction is fetched from the same thread.
  - Branch/instruction resolution latency overlapped with execution of other threads’ instructions.

+ No logic needed for handling control and data dependences within a thread
— Single thread performance suffers
— Extra logic for keeping thread contexts
— Does not overlap latency if not enough threads to cover the whole pipeline
Fine-Grained Multithreading (II)

- Idea: Switch to another thread every cycle such that no two instructions from a thread are in the pipeline concurrently.

- Tolerates the control and data dependence latencies by overlapping the latency with useful work from other threads.

- Improves pipeline utilization by taking advantage of multiple threads.


Fine-Grained Multithreading: History

- CDC 6600’s peripheral processing unit is fine-grained multithreaded
  - Processor executes a different I/O thread every cycle
  - An operation from the same thread is executed every 10 cycles

- Denelcor HEP (Heterogeneous Element Processor)
  - 120 threads/processor
  - available queue vs. unavailable (waiting) queue for threads
  - each thread can have only 1 instruction in the processor pipeline; each thread independent
  - to each thread, processor looks like a non-pipelined machine
  - system throughput vs. single thread performance tradeoff
Fine-Grained Multithreading in HEP

- Cycle time: 100ns
- 8 stages $\rightarrow$ 800 ns to complete an instruction
  - assuming no memory access
- No control and data dependence checking

Burton Smith
(1941-2018)
Multithreaded Pipeline Example
Sun Niagara Multithreaded Pipeline

Fine-Grained Multithreading

- **Advantages**
  + No need for dependence checking between instructions
    (only one instruction in pipeline from a single thread)
  + No need for branch prediction logic
  + Otherwise-bubble cycles used for executing useful instructions from different threads
  + Improved system throughput, latency tolerance, utilization

- **Disadvantages**
  - Extra hardware complexity: multiple hardware contexts (PCs, register files, ...), thread selection logic
  - Reduced single thread performance (one instruction fetched every N cycles from the same thread)
  - Resource contention between threads in caches and memory
  - Some dependence checking logic *between* threads remains (load/store)
Modern GPUs are FGMT Machines
NVIDIA GeForce GTX 285 “core”

- data-parallel (SIMD) func. unit, control shared across 8 units
  - multiply-add
  - multiply

- instruction stream decode
- execution context storage

64 KB of storage for thread contexts (registers)

Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285 “core”

- Groups of 32 **threads** share instruction stream (each group is a Warp): they execute the same instruction on different data
- **Up to 32 warps are interleaved in an FGMT manner**
- Up to 1024 thread contexts can be stored

Slide credit: Kayvon Fatahalian
30 cores on the GTX 285: 30,720 threads
A PIPELINED, SHARED RESOURCE MIMD COMPUTER

Burton J. Smith
Denelcor, Inc.
Denver, Colorado 80205

Further Reading for the Interested (I)

Burton Smith (1941-2018)

Architecture and applications of the HEP multiprocessor computer system

Burton J. Smith
Denelcor, Inc., 14221 E. 4th Avenue, Aurora, Colorado 80011
The Tera Computer System*

Robert Alverson  
David Callahan  
Allan Porterfield  
Daniel Cummings  
Burton Smith  
Brian Koblenz

Tera Computer Company  
Seattle, Washington USA

4 Processors

Each processor in a Tera computer can execute multiple instruction streams simultaneously. In the current implementation, as few as one or as many as 128 program counters may be active at once. On every tick of the clock, the processor logic selects a stream that is ready to execute and allows it to issue its next instruction. Since instruction interpretation is completely pipelined by the processor and by the network and memories as well, a new instruction from a different stream may be issued in each tick without interfering with its predecessors. When an instruction finishes, the stream to which it belongs thereby becomes ready to execute the next instruction. As long as there are enough instruction streams in the processor so that the average instruction latency is filled with instructions from other streams, the processor is being fully utilized. Thus, it is only necessary to have enough streams to hide the expected latency (perhaps 70 ticks on average); once latency is hidden the processor is running at peak performance and additional streams do not speed the result.
We did not cover the following slides. They are for your benefit. We will cover them in future lectures.
Pipelining and Precise Exceptions: Preserving Sequential Semantics
Multi-Cycle Execution

- Not all instructions take the same amount of time for “execution”

- Idea: Have multiple different functional units that take different number of cycles
  - Can be pipelined or not pipelined
  - Can let independent instructions start execution on a different functional unit before a previous long-latency instruction finishes execution
Issues in Pipelining: Multi-Cycle Execute

- Instructions can take different number of cycles in EXECUTE stage
  - Integer ADD versus FP MULtiply

```
FMUL R4 ← R1, R2
ADD  R3 ← R1, R2
FMUL R2 ← R5, R6
ADD  R7 ← R5, R6
```

- What is wrong with this picture in a Von Neumann architecture?
  - Sequential semantics of the ISA NOT preserved!
  - What if FMULL incurs an exception?
Exceptions vs. Interrupts

- **Cause**
  - Exceptions: internal to the running thread
  - Interrupts: external to the running thread

- **When to Handle**
  - Exceptions: when detected (and known to be non-speculative)
  - Interrupts: when convenient
    - Except for very high priority ones
      - Power failure
      - Machine check (error)

- **Priority**: process (exception), depends (interrupt)

- **Handling Context**: process (exception), system (interrupt)
Precise Exceptions/Interrupts

- The architectural state should be consistent (precise) when the exception/interrupt is ready to be handled

1. All previous instructions should be completely retired.

2. No later instruction should be retired.

Retire = commit = finish execution and update arch. state
Checking for and Handling Exceptions in Pipelining

- When the oldest instruction ready-to-be-retired is detected to have caused an exception, the control logic
  - Ensures architectural state is precise (register file, PC, memory)
  - Flushes all younger instructions in the pipeline
  - Saves PC and registers (as specified by the ISA)
  - Redirects the fetch engine to the appropriate exception handling routine
Why Do We Want Precise Exceptions?

- Semantics of the von Neumann model ISA specifies it
  - Remember von Neumann vs. Dataflow
- Aids software debugging
- Enables (easy) recovery from exceptions
- Enables (easily) restartable processes
- Enables traps into software (e.g., software implemented opcodes)
Ensuring Precise Exceptions in Pipelining

- **Idea:** Make each operation take the same amount of time

```
FMUL R3 ← R1, R2
ADD  R4 ← R1, R2
```

- **Downside**
  - Worst-case instruction latency determines all instructions’ latency
  - What about memory operations?
  - Each functional unit takes worst-case number of cycles?
Solutions

- Reorder buffer
- History buffer
- Future register file
- Checkpointing

We will not cover these

Suggested reading
Recall: Solution I: Reorder Buffer (ROB)

- **Idea:** Complete instructions out-of-order, but reorder them before making results visible to architectural state
- When instruction is decoded it reserves the next-sequential entry in the ROB
- When instruction completes, it writes result into ROB entry
- When instruction oldest in ROB and it has completed without exceptions, its result moved to reg. file or memory
Reorder Buffer

- Buffers information about all instructions that are decoded but not yet retired/committed
What’s in a ROB Entry?

- Everything required to:
  - correctly reorder instructions back into the program order
  - update the architectural state with the instruction’s result(s), if instruction can retire without any issues
  - handle an exception/interrupt precisely, if an exception/interrupt needs to be handled before retiring the instruction

- Need valid bits to keep track of readiness of the result(s) and find out if the instruction has completed execution

<table>
<thead>
<tr>
<th>V</th>
<th>DestRegID</th>
<th>DestRegVal</th>
<th>StoreAddr</th>
<th>StoreData</th>
<th>PC</th>
<th>Valid bits for reg/data + control bits</th>
<th>Exception?</th>
</tr>
</thead>
</table>

Reorder Buffer: Independent Operations

- Result first written to ROB on instruction completion
- Result written to register file at commit time

<table>
<thead>
<tr>
<th>F</th>
<th>D</th>
<th>E</th>
<th>E</th>
<th>E</th>
<th>E</th>
<th>E</th>
<th>E</th>
<th>E</th>
<th>R</th>
<th>W</th>
</tr>
</thead>
<tbody>
<tr>
<td>F</td>
<td>D</td>
<td>E</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>W</td>
<td></td>
</tr>
<tr>
<td>F</td>
<td>D</td>
<td>E</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>W</td>
<td></td>
</tr>
<tr>
<td>F</td>
<td>D</td>
<td>E</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>W</td>
<td></td>
</tr>
<tr>
<td>F</td>
<td>D</td>
<td>E</td>
<td>R</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>W</td>
<td></td>
</tr>
</tbody>
</table>

- What if a later instruction needs a value in the reorder buffer?
  - One option: stall the operation \(\rightarrow\) stall the pipeline
  - Better: Read the value from the reorder buffer. How?
Reorder Buffer: How to Access?

A register value can be in the register file, reorder buffer, (or bypass/forwarding paths)

Instruction Cache

Register File

Reorder Buffer

Random Access Memory (indexed with Register ID, which is the address of an entry)

Content Addressable Memory (searched with register ID, which is part of the content of an entry)

bypass paths

Func Unit

Func Unit

Func Unit
Simplifying Reorder Buffer Access

- Idea: **Use indirection**

- Access register file first (check if the register is valid)
  - If register not valid, register file stores the ID of the reorder buffer entry that contains (or will contain) the value of the register
  - **Mapping of the register to a ROB entry**: Register file maps the register to a reorder buffer entry if there is an in-flight instruction writing to the register

- Access reorder buffer next

- Now, reorder buffer does not need to be content addressable
Reorder Buffer in Intel Pentium III

Important: Register Renaming with a Reorder Buffer

- Output and anti dependencies are **not true dependencies**
  - WHY? The same register refers to values that have nothing to do with each other
  - They exist due to lack of register ID’s (i.e. names) in the ISA

- The register ID is **renamed** to the reorder buffer entry that will hold the register’s value
  - Register ID $\rightarrow$ ROB entry ID
  - Architectural register ID $\rightarrow$ Physical register ID
  - After renaming, ROB entry ID used to refer to the register

- This eliminates anti and output dependencies
  - Gives the illusion that there are a large number of registers
Recall: Data Dependence Types

True (flow) dependence
\[ r_3 \leftarrow r_1 \text{ op } r_2 \quad \text{Read-after-Write (RAW)} \quad -- \quad \text{True} \]
\[ r_5 \leftarrow r_3 \text{ op } r_4 \]

Anti dependence
\[ r_3 \leftarrow r_1 \text{ op } r_2 \quad \text{Write-after-Read (WAR)} \quad -- \quad \text{Anti} \]
\[ r_1 \leftarrow r_4 \text{ op } r_5 \]

Output-dependence
\[ r_3 \leftarrow r_1 \text{ op } r_2 \quad \text{Write-after-Write (WAW)} \quad -- \quad \text{Output} \]
\[ r_5 \leftarrow r_3 \text{ op } r_4 \]
\[ r_3 \leftarrow r_6 \text{ op } r_7 \]
In-Order Pipeline with Reorder Buffer

- **Decode (D):** Access regfile/ROB, allocate entry in ROB, check if instruction can execute, if so **dispatch** instruction
- **Execute (E):** Instructions can complete out-of-order
- **Completion (R):** Write result to reorder buffer
- **Retirement/Commit (W):** Check for exceptions; if none, write result to architectural register file or memory; else, flush pipeline and start from exception handler
- **In-order dispatch/execution, out-of-order completion, in-order retirement**
Reorder Buffer Tradeoffs

- **Advantages**
  - Conceptually simple for supporting precise exceptions
  - Can eliminate false dependences

- **Disadvantages**
  - Reorder buffer needs to be accessed to get the results that are yet to be written to the register file
    - CAM or indirection → increased latency and complexity

- **Other solutions aim to eliminate the disadvantages**
  - History buffer
  - Future file
  - Checkpointing

*We will not cover these*