Pipelined MIPS Microarchitecture

Digital Design and Computer Architecture
Mohammad Sadrosadati
Frank K. Gürkaynak

http://safari.ethz.ch/ddca
What Will We Learn

- How to do more per unit time
  - Parallelism
  - Pipelining

- Single Cycle vs Pipelined

- Pipelined MIPS architecture

- Hazards and how to solve them
  - Data hazards
  - Control hazards

- Performance of the Pipelined architecture
Parallelism

- Two types of parallelism:
  - Spatial parallelism
    - duplicate hardware performs multiple tasks at once
  - Temporal parallelism
    - task is broken into multiple stages
    - also called pipelining
    - for example, an assembly line
Parallelism Definitions

- Some definitions:
  - **Token**: A group of inputs processed to produce a group of outputs
  - **Latency**: Time for one token to pass from start to end
  - **Throughput**: The number of tokens that can be produced per unit time

- Parallelism increases throughput.
Parallelism Example

Example:
Ben Bitdiddle is baking cookies to celebrate the installation of his traffic light controller. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism?

Latency =  
Throughput =
Parallelism Example

Example:

Ben Bitdiddle is baking cookies to celebrate the installation of his traffic light controller. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism?

Latency = 5 + 15 = 20 minutes = 1/3 hour

Throughput = 1 tray/ 1/3 hour = 3 trays/hour
Parallelism Example

- What is the latency and throughput if Ben uses parallelism?
  - **Spatial parallelism**: Ben asks Allysa P. Hacker to help, using her own oven
  - **Temporal parallelism**: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.
Spatial Parallelism

Latency: time to first tray

Time

Tray 1
- Ben 1
- Ben 1

Tray 2
- Alyssa 1
- Alyssa 1

Tray 3
- Ben 2
- Ben 2

Tray 4
- Alyssa 2
- Alyssa 2

Legend:
- Roll
- Bake

Latency = 
Throughput =

Spatial Parallelism

Latency = 5 + 15 = 20 minutes = \( \frac{1}{3} \) hour

Throughput = 2 trays/ \( \frac{1}{3} \) hour = 6 trays/hour
Temporal Parallelism

Latency =

Throughput =
Temporal Parallelism

Latency: $\frac{1}{3}$ hour

Throughput: 4 trays/hour

Using both techniques, the throughput would be 8 trays/hour
Pipelined MIPS Processor

- Temporal parallelism

- Divide single-cycle processor into 5 stages:
  - Fetch
  - Decode
  - Execute
  - Memory
  - Writeback

- Add pipeline registers between stages
Single-Cycle vs. Pipelined Performance

**Single-Cycle**

- Fetch Instruction
- Decode Read Reg
- Execute ALU
- Memory Read / Write
- Write Reg

**Pipelined**

- Fetch Instruction
- Decode Read Reg
- Execute ALU
- Memory Read / Write
- Write Reg

- Fetch Instruction
- Decode Read Reg
- Execute ALU
- Memory Read / Write
- Write Reg

- Fetch Instruction
- Decode Read Reg
- Execute ALU
- Memory Read / Write
- Write Reg
Pipelining Abstraction

lw  $s2, 40($0)
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw  $s6, 20($s1)
or  $s7, $t3, $t4
Single-Cycle and Pipelined Datapath
Corrected Pipelined Datapath

- WriteReg must arrive at the same time as Result
Same control unit as single-cycle processor
Control delayed to proper pipeline stage
Pipeline Hazard

- Occurs when an instruction depends on results from previous instruction that hasn’t completed.

- Types of hazards:
  - *Data hazard*: register value not written back to register file yet
  - *Control hazard*: next instruction not decided yet (caused by branches)
Data Hazard

The register file can be read and written in the same cycle:

- write takes place during the 1st half of the cycle
- read takes place during the 2nd half of the cycle => no hazard !!!
- However operations that involve register file have only half a clock cycle to complete the operation!!
Data Hazard

- One instruction writes a register ($s0$) and next instructions read this register => read after write (RAW) hazard.
  - *add* writes into $s0$ in the first half of cycle 5
  - *and* reads $s0$ on cycle 3, obtaining the wrong value
  - *or* reads $s0$ on cycle 4, again obtaining the wrong value.
  - *sub* reads $s0$ in the second half of cycle 5, obtaining the correct value
  - subsequent instructions read the correct value of $s0$
How Can You Handle Data Hazards?

- Insert “NOP”s (No OPeration) in code at compile time
- Rearrange code at compile time
- Forward data at run time
- Stall the processor at run time
Compile-Time Hazard Elimination

- Insert enough NOPs for result to be ready
- Or (if you can) move independent useful instructions forward
Data Forwarding

\[
\begin{align*}
&\text{add } s0, s2, s3 \\
&\text{and } t0, s0, s1 \\
&\text{or } t1, s4, s0 \\
&\text{sub } t2, s0, s5
\end{align*}
\]
Data Forwarding
Data Forwarding

- Forward to Execute stage from either:
  - Memory stage or
  - Writeback stage

- When should we forward from one either Memory or Writeback stage?
  - If that stage will write a destination register and the destination register matches the source register.
  - If both the Memory and Writeback stages contain matching destination registers, the Memory stage should have priority, because it contains the more recently executed instruction.
Data Forwarding

- Forward to Execute stage from either:
  - Memory stage or
  - Writeback stage

- Forwarding logic for ForwardAE (pseudo code):

  ```
  if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
    ForwardAE = 10  # forward from Memory stage
  else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
    ForwardAE = 01  # forward from Writeback stage
  else
    ForwardAE = 00  # no forwarding
  ```

- Forwarding logic for ForwardBE same, but replace rsE with rtE
Stalling

lw $s0, 40($0)
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5

- Forwarding is sufficient to solve RAW data hazards
- *but ...*
Stalling

lw $s0, 40($0)
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5

The `lw` instruction does not finish reading data until the end of the Memory stage, so its result cannot be forwarded to the Execute stage of the next instruction.
Stalling

The **lw** instruction has a **two-cycle latency**, therefore a dependent instruction cannot use its result until two cycles later.

The **lw** instruction receives data from memory at the end of cycle 4. But the **and** instruction needs that data as a source operand at the beginning of cycle 4. **There is no way to solve this hazard with forwarding.**
Stalling

1w $s0, 40($0)

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5
Stalling Hardware

- **Stalls are supported by:**
  - adding enable inputs (EN) to the Fetch and Decode pipeline registers
  - and a synchronous reset/clear (CLR) input to the Execute pipeline register.

- **When a lw stall occurs**
  - StallD and StallF are asserted to force the Decode and Fetch stage pipeline registers to hold their old values.
  - FlushE is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble
Stalling Hardware
Control Hazards

- **beq:**
  - Branch is not determined until the fourth stage of the pipeline.
  - Instructions after the branch are fetched before branch occurs.
  - These instructions must be flushed if the branch happens.

- **Branch misprediction penalty**
  - Number of instructions flushed when branch is taken.
  - May be reduced by determining branch earlier.
Control Hazards: Original Pipeline
Control Hazards

20  beq $t1, $t2, 40
24  and $t0, $s0, $s1
28  or $t1, $s4, $s0
2C  sub $t2, $s0, $s5
30  ...
   ...
64  slt $t3, $s2, $s3
Control Hazards: Early Branch Resolution

Introduced another data hazard in Decode stage... this is another story
Early Branch Resolution

Carnegie Mellon

37

Time (cycles)

beq $t1, $t2, 40

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

... ...

slt $t3, $s2, $s3
Handling Data and Control Hazards

Possible solution to data hazard in Decode stage.
Control Forwarding and Stalling Hardware

// Forwarding logic:
assign ForwardAD = (rsD != 0) & (rsD == WriteRegM) & RegWriteM;
assign ForwardBD = (rtD != 0) & (rtD == WriteRegM) & RegWriteM;

// Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;

assign branchstall = (BranchD & RegWriteE &
                      (WriteRegE == rsD | WriteRegE == rtD)) |
                      (BranchD & MemtoRegM &
                      (WriteRegM == rsD | WriteRegM == rtD));

// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;
Branch Prediction

- Guess whether branch will be taken
  - Backward branches are usually taken (loops)
  - Perhaps consider history of whether branch was previously taken to improve the guess

- Good prediction reduces the fraction of branches requiring a flush
Pipelined Performance Example

- **SPECINT2000 benchmark:**
  - 25% loads
  - 10% stores
  - 11% branches
  - 2% jumps
  - 52% R-type

- **Suppose:**
  - 40% of loads used by next instruction
  - 25% of branches mispredicted

- All jumps flush next instruction

- What is the average CPI?
Pipelined Performance Example Solution

- Load/Branch CPI = 1 when no stalling, 2 when stalling.
  Thus:
  - \( \text{CPI}_{lw} = 1(0.6) + 2(0.4) = 1.4 \) 
  - \( \text{CPI}_{beq} = 1(0.75) + 2(0.25) = 1.25 \)

- And
  - \( \text{Average CPI} = \)
Pipelined Performance Example Solution

Load/Branch CPI = 1 when no stalling, 2 when stalling.
Thus:
- \( \text{CPI}_{lw} = 1(0.6) + 2(0.4) = 1.4 \)  
  Average CPI for load
- \( \text{CPI}_{beq} = 1(0.75) + 2(0.25) = 1.25 \)  
  Average CPI for branch

And
- **Average CPI**  
  \[ = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) \]
  load
  store
  beq
  jump
  r-type

\[ = 1.15 \]
Pipelined Performance

- There are 5 stages, and 5 different timing paths:

\[ T_c = \max \{ \]
\[ t_{pcq} + t_{mem} + t_{setup} \quad \text{fetch} \\
2(t_{RFread} + t_{mux} + t_{eq} + t_{AND} + t_{mux} + t_{setup}) \quad \text{decode} \\
2(t_{RFwrite}) \quad \text{writeback} \]
\[ t_{pcq} + t_{mux} + t_{mux} + t_{ALU} + t_{setup} \quad \text{execute} \\
2(t_{PC} + t_{memwrite} + t_{setup}) \quad \text{memory} \]
\[ t_{PC} + t_{mem} + t_{setup} \quad \text{memory} \]
\[ } \]

- The operation speed depends on the slowest operation

- Decode and Writeback use register file and have only half a clock cycle to complete, that is why there is a 2 in front of them
## Pipelined Performance Example

<table>
<thead>
<tr>
<th>Element</th>
<th>Parameter</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Register clock-to-Q</td>
<td>$t_{\text{pcq_PC}}$</td>
<td>30</td>
</tr>
<tr>
<td>Register setup</td>
<td>$t_{\text{setup}}$</td>
<td>20</td>
</tr>
<tr>
<td>Multiplexer</td>
<td>$t_{\text{mux}}$</td>
<td>25</td>
</tr>
<tr>
<td>ALU</td>
<td>$t_{\text{ALU}}$</td>
<td>200</td>
</tr>
<tr>
<td>Memory read</td>
<td>$t_{\text{mem}}$</td>
<td>250</td>
</tr>
<tr>
<td>Register file read</td>
<td>$t_{\text{RFread}}$</td>
<td>150</td>
</tr>
<tr>
<td>Register file setup</td>
<td>$t_{\text{RFsetup}}$</td>
<td>20</td>
</tr>
<tr>
<td>Equality comparator</td>
<td>$t_{\text{eq}}$</td>
<td>40</td>
</tr>
<tr>
<td>AND gate</td>
<td>$t_{\text{AND}}$</td>
<td>15</td>
</tr>
<tr>
<td>Memory write</td>
<td>$T_{\text{memwrite}}$</td>
<td>220</td>
</tr>
<tr>
<td>Register file write</td>
<td>$t_{\text{RFwrite}}$</td>
<td>100</td>
</tr>
</tbody>
</table>

\[
T_c = 2(t_{\text{RFread}} + t_{\text{mux}} + t_{\text{eq}} + t_{\text{AND}} + t_{\text{mux}} + t_{\text{setup}}) \\
= 2[150 + 25 + 40 + 15 + 25 + 20] \text{ ps} \\
= 550 \text{ ps}
\]
Pipelined Performance Example

- For a program with 100 billion instructions executing on a pipelined MIPS processor:
  - CPI = 1.15
  - $T_c = 550$ ps

- Execution Time
  \[
  \text{Execution Time} = (\# \text{ instructions}) \times \text{CPI} \times T_c \\
  = (100 \times 10^9)(1.15)(550 \times 10^{-12}) \\
  = 63 \text{ seconds}
  \]
Performance Summary for MIPS arch.

<table>
<thead>
<tr>
<th>Processor</th>
<th>Execution Time (seconds)</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single-cycle</td>
<td>95</td>
<td>1</td>
</tr>
<tr>
<td>Multicycle</td>
<td>133</td>
<td>0.71</td>
</tr>
<tr>
<td>Pipelined</td>
<td>63</td>
<td>1.51</td>
</tr>
</tbody>
</table>

- Fastest of the three MIPS architectures is **Pipelined**.

- However, even though we have 5 fold pipelining, it is not 5 times faster than single cycle.
What Did We Learn?

- **How to design a pipelined architecture**
  - Break down long combinational path by registers
  - Shortens the clock period
  - You can start processing next instruction once the first part is complete

- **Problems of the pipelined architecture**
  - Data you need for the next instruction may not be ready (*Data Hazard*)
  - You may not yet know the next instruction (*Control Hazard*)

- **Solutions to Hazards**

- **Performance of pipelined MIPS architecture**