## Digital Design & Computer Arch.

## Lecture 11: Microarchitecture Fundamentals II

Prof. Onur Mutlu

ETH Zürich
Spring 2021
15 April 2021

#### Readings

- Last time and today
  - Introduction to microarchitecture and single-cycle microarchitecture
    - H&H, Chapter 7.1-7.3
    - P&P, Appendices A and C
  - Multi-cycle microarchitecture
    - H&H, Chapter 7.4
    - P&P, Appendices A and C
- Tomorrow and next week
  - Pipelining
    - H&H, Chapter 7.5
  - Pipelining Issues
    - H&H, Chapter 7.8.1-7.8.3

#### Agenda for Today & Next Few Lectures

- Instruction Set Architectures (ISA): LC-3 and MIPS
- Assembly programming: LC-3 and MIPS
- Microarchitecture (principles & single-cycle uarch)
- Multi-cycle microarchitecture
- Pipelining
- Issues in Pipelining: Control & Data Dependence Handling,
   State Maintenance and Recovery, ...
- Out-of-Order Execution

#### Recall: A Very Basic Instruction Processing Engine

- Each instruction takes a single clock cycle to execute
- Only combinational logic is used to implement instruction execution
  - □ No intermediate, programmer-invisible state updates

AS = Architectural (programmer visible) state at the beginning of a clock cycle

Process instruction in one clock cycle

AS' = Architectural (programmer visible) state at the end of a clock cycle

#### Recall: The Instruction Processing "Cycle"



#### Instruction Processing "Cycle" vs. Machine Clock Cycle

#### Single-cycle machine:

 All six phases of the instruction processing cycle take a single machine clock cycle to complete

#### Multi-cycle machine:

- All six phases of the instruction processing cycle can take multiple machine clock cycles to complete
- In fact, each phase can take multiple clock cycles to complete

#### Recall: Single-Cycle Machine

Single-cycle machine



#### Recall: Datapath and Control Logic

- An instruction processing engine consists of two components
  - Datapath: Consists of hardware elements that deal with and transform data signals
    - functional units that operate on data
    - hardware structures (e.g., wires, muxes, decoders, tri-state bufs) that enable the flow of data into the functional units and registers
    - storage units that store data (e.g., registers)
  - Control logic: Consists of hardware elements that determine control signals, i.e., signals that specify what the datapath elements should do to the data

## Single-Cycle Datapath for Arithmetic and Logical Instructions

### Datapath for R- and I-Type ALU Insts.



if MEM[PC] == ADDI rt rs immediate
 GPR[rt] ← GPR[rs] + sign-extend (immediate)
 PC ← PC + 4

Combinational state update logic 10

## Single-Cycle Datapath for Data Movement Instructions

#### Datapath for Non-Control-Flow Insts.



## Single-Cycle Datapath for Control Flow Instructions

### Jump Instruction

Unconditional branch or jump

```
\begin{array}{|c|c|c|c|c|} \hline j & target \\ \hline \hline j & (2) & immediate \\ \hline 6 & bits & 26 & bits \\ \hline \end{array}
```

- $\square$  2 = opcode
- □ immediate (target) = target address

#### Semantics

```
if MEM[PC]== j immediate<sub>26</sub>
  target = { PC + [31:28], immediate<sub>26</sub>, 2' b00 }
  PC ← target
```

#### Unconditional Jump Datapath



#### Other Jumps in MIPS

- jal: jump and link (function calls)
  - Semantics

```
if MEM[PC]== jal immediate<sub>26</sub>

$ra \leftarrow PC + 4

target = { PC ^{+}[31:28], immediate<sub>26</sub>, 2' b00 }

PC \leftarrow target
```

- jr: jump register
  - Semantics

```
if MEM[PC] == jr rs

PC \leftarrow GPR(rs)
```

- jalr: jump and link register
  - Semantics

```
if MEM[PC]== jalr rs

ra \leftarrow PC + 4

PC \leftarrow GPR(rs)
```

#### Aside: MIPS Cheat Sheet

https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetc h.php?media=mips\_reference\_data.pdf

#### On the course website





#### Conditional Branch Instructions

beq (Branch if Equal)



Semantics (assuming no branch delay slot)

```
if MEM[PC] == beq rs rt immediate<sub>16</sub>
  target = PC<sup>+</sup> + sign-extend(immediate) x 4
  if GPR[rs]==GPR[rt] then PC ← target
  else PC ← PC + 4
```

Variations: beq, bne, blez, bgtz

#### Conditional Branch Datapath (for you to finish)



How to uphold the delayed branch semantid??

## Putting It All Together



## Single-Cycle Control Logic

#### Single-Cycle Hardwired Control

#### As combinational function of Inst=MEM[PC]



#### Consider

- All R-type and I-type ALU instructions
- Iw and sw
- beq, bne, blez, bgtz
- □ j, jr, jal, jalr

## Generate Control Signals (in Orange Color)



### Single-Bit Control Signals (I)

|          | When De-asserted                                                | When asserted                                                        | Equation                                                               |
|----------|-----------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------------------------------|
| RegDest  | GPR write select according to rt, i.e., inst[20:16]             | GPR write select according to rd, i.e., inst[15:11]                  | opcode==0                                                              |
| ALUSrc   | 2 <sup>nd</sup> ALU input from 2 <sup>nd</sup><br>GPR read port | 2 <sup>nd</sup> ALU input from sign-<br>extended 16-bit<br>immediate | (opcode!=0) &&<br>(opcode!=BEQ) &&<br>(opcode!=BNE)                    |
| MemtoReg | Steer ALU result to GPR write port                              | steer memory load to<br>GPR write port                               | opcode==LW                                                             |
| RegWrite | GPR write disabled                                              | GPR write enabled                                                    | (opcode!=SW) &&<br>(opcode!=Bxx) &&<br>(opcode!=J) &&<br>(opcode!=JR)) |

### Single-Bit Control Signals (II)

|                    | When De-asserted                | When asserted                                             | Equation                                 |
|--------------------|---------------------------------|-----------------------------------------------------------|------------------------------------------|
| MemRead            | Memory read disabled            | Memory read port return load value                        | opcode==LW                               |
| MemWrite           | Memory write disabled           | Memory write enabled                                      | opcode==SW                               |
| PCSrc <sub>1</sub> | According to PCSrc <sub>2</sub> | next PC is based on 26-<br>bit immediate jump<br>target   | (opcode==J)   <br>(opcode==JAL)          |
| PCSrc <sub>2</sub> | next PC = PC + 4                | next PC is based on 16-<br>bit immediate branch<br>target | (opcode==Bxx) &&<br>"bcond is satisfied" |

#### R-Type ALU



\*\*Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

### I-Type ALU









#### Branch (Not Taken)

## Some control signals are dependent on the processing of data



#### Branch (Taken)

## Some control signals are dependent on the processing of data





\*\*Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]

#### What is in That Control Box?

- Combinational Logic → Hardwired Control
  - Idea: Control signals generated combinationally based on bits in instruction encoding
- Sequential Logic → Sequential Control
  - Idea: A memory structure contains the control signals associated with an instruction
    - Called Control Store
- Both types of control structure can be used in single-cycle processors
  - Choice depends on latency of each structure + how much on the critical path control signal generation is, etc.

### Review: Complete Single-Cycle Processor



# Another Single-Cycle MIPS Processor (from H&H)

See backup slides to reinforce the concepts we have covered.

They are to complement your reading:

H&H, Chapter 7.1-7.3, 7.6

#### Another Complete Single-Cycle Processor



## Example: Single-Cycle Datapath: 1w fetch

#### STEP 1: Fetch instruction







## Single-Cycle Datapath: 1w register read

STEP 2: Read source operands from register file







## Single-Cycle Datapath: 1w immediate

STEP 3: Sign-extend the immediate





lw \$s3, 1(\$0) # read memory word 1 into \$s3

I-Type
op rs rt imm
6 bits 5 bits 5 bits 16 bits

## Single-Cycle Datapath: 1w address

STEP 4: Compute the memory address





### Single-Cycle Datapath: 1w memory read

STEP 5: Read from memory and write back to register file



lw \$s3, 1(\$0) # read memory word 1 into \$s3

#### **I-Type**

| op     | rs     | rt     | imm     |
|--------|--------|--------|---------|
| 6 bits | 5 bits | 5 bits | 16 bits |

## Single-Cycle Datapath: 1w PC increment

STEP 6: Determine address of next instruction



lw \$s3, 1(\$0) # read memory word 1 into \$s3

#### **I-Type**

| op     | rs     | rt     | imm     |
|--------|--------|--------|---------|
| 6 bits | 5 bits | 5 bits | 16 bits |

## Similarly, We Need to Design the Control Unit

Control signals are generated by the decoder in control unit

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | Branch | MemWrite | MemtoReg | ALUOp <sub>1:0</sub> | Jump |
|-------------|-------------------|----------|--------|--------|--------|----------|----------|----------------------|------|
| R-type      | 000000            | 1        | 1      | 0      | 0      | 0        | 0        | 10                   | 0    |
| lw          | 100011            | 1        | 0      | 1      | 0      | 0        | 1        | 00                   | 0    |
| SW          | 101011            | 0        | X      | 1      | 0      | 1        | X        | 00                   | 0    |
| beq         | 000100            | 0        | X      | 0      | 1      | 0        | X        | 01                   | 0    |
| addi        | 001000            | 1        | 0      | 1      | 0      | 0        | 0        | 00                   | 0    |
| j           | 000010            | 0        | X      | X      | X      | 0        | X        | XX                   | 1    |

## Another Complete Single-Cycle Processor (H&H)



## Your Reading Assignment

- Please read the Lecture Slides & the Backup Slides
- Please do your readings from the H&H Book
  - □ H&H, Chapter 7.1-7.3, 7.6

## Single-Cycle Uarch I (We Developed in Lectures)



## Single-Cycle Uarch II (In Your Readings)



## Evaluating the Single-Cycle Microarchitecture

## A Single-Cycle Microarchitecture

Is this a good idea/design?

When is this a good design?

When is this a bad design?

How can we design a better microarchitecture?

## Performance Analysis Basics

## Recall: Performance Analysis Basics

- Execution time of a single instruction
  - □ {CPI} x {clock cycle time}
    - CPI: Number of cycles it takes to execute an instruction
- Execution time of an entire program
  - Sum over all instructions [{CPI} x {clock cycle time}]

- How fast is my program?
  - Every program consists of a series of instructions
  - Each instruction needs to be executed

- How fast is my program?
  - Every program consists of a series of instructions
  - Each instruction needs to be executed
- How fast are my instructions?
  - Instructions are realized on the hardware
  - Each instruction can take one or more clock cycles to complete
  - Cycles per Instruction = CPI

#### How fast is my program?

- Every program consists of a series of instructions
- Each instruction needs to be executed.

#### How fast are my instructions?

- Instructions are realized on the hardware
- Each instruction can take one or more clock cycles to complete
- Cycles per Instruction = CPI

#### How long is one clock cycle?

- The critical path determines how much time one cycle requires = clock period.
- 1/clock period = clock frequency = how many cycles can be done each second.

#### As a general formula

- Our program consists of executing N instructions
- Our processor needs CPI cycles (on average) for each instruction
- The clock frequency of the processor is f
  - → the clock period is therefore T=1/f

- As a general formula
  - Our program consists of executing N instructions
  - Our processor needs CPI cycles (on average) for each instruction
  - The clock frequency of the processor is f
    - → the clock period is therefore T=1/f

Our program executes in

$$N \times CPI \times (1/f) =$$

N x CPI x T seconds

# Performance Analysis of Our Single-Cycle Design

## A Single-Cycle Microarchitecture: Analysis

- Every instruction takes 1 cycle to execute
  - CPI (Cycles per instruction) is strictly 1
- How long each instruction takes is determined by how long the slowest instruction takes to execute
  - Even though many instructions do not need that long to execute
- Clock cycle time of the microarchitecture is determined by how long it takes to complete the slowest instruction
  - Critical path of the design is determined by the processing time of the slowest instruction

#### What is the Slowest Instruction to Process?

- Let's go back to the basics
- All six phases of the instruction processing cycle take a single machine clock cycle to complete
- Fetch
- Decode
- Evaluate Address
- Fetch Operands
- Execute
- Store Result

- 1. Instruction fetch (IF)
- 2. Instruction decode and register operand fetch (ID/RF)
- 3. Execute/Evaluate memory address (EX/AG)
- 4. Memory operand fetch (MEM)
- 5. Store/writeback result (WB)

Do each of the above phases take the same time (latency) for all instructions?

#### Let's Find the Critical Path



[Based on original figure from P&HCO&D, COPYRIGHT 2004 Elsevier, ALL RIGHTS RESERVED.]

## Example Single-Cycle Datapath Analysis

- Assume (for the design in the previous slide)
  - memory units (read or write): 200 ps
  - ALU and adders: 100 ps
  - register file (read or write): 50 ps
  - other combinational logic: 0 ps

| steps     | IF  | ID | EX  | MEM | WB | Delay |  |
|-----------|-----|----|-----|-----|----|-------|--|
| resources | mem | RF | ALU | mem | RF |       |  |
| R-type    | 200 | 50 | 100 |     | 50 | 400   |  |
| I-type    | 200 | 50 | 100 |     | 50 | 400   |  |
| LW        | 200 | 50 | 100 | 200 | 50 | 600   |  |
| SW        | 200 | 50 | 100 | 200 |    | 550   |  |
| Branch    | 200 | 50 | 100 |     |    | 350   |  |
| Jump      | 200 |    |     |     |    | 200   |  |

#### Let's Find the Critical Path



[Based on original figure from P&HCO&D, COPYRIGHT 2004 Elsevier, ALL RIGHTS RESERVED.]

## R-Type and I-Type ALU









#### Branch Taken



## Jump



## What About Control Logic?

- How does that affect the critical path?
- Food for thought for you:
  - Can control logic be on the critical path?
  - Historical example:
    - CDC 5600: control store access too long...

#### What is the Slowest Instruction to Process?

- Real world: Memory is slow (not magic)
- What if memory sometimes takes 100ms to access?
- Does it make sense to have a simple register to register add or jump to take {100ms+all else to do a memory operation}?
- And, what if you need to access memory more than once to process an instruction?
  - Which instructions need this?
  - Do you provide multiple ports to memory?

## Single Cycle uArch: Complexity

#### Contrived

All instructions run as slow as the slowest instruction

#### Inefficient

- All instructions run as slow as the slowest instruction
- Must provide worst-case combinational resources in parallel as required by any instruction
- Need to replicate a resource if it is needed more than once by an instruction during different parts of the instruction processing cycle
- Not necessarily the simplest way to implement an ISA
  - Single-cycle implementation of REP MOVS (x86) or INDEX (VAX)?
- Not easy to optimize/improve performance
  - Optimizing the common case does not work (e.g. common instructions)
  - Need to optimize the worst case all the time

## (Micro)architecture Design Principles

#### Critical path design

- Find and decrease the maximum combinational logic delay
- Break a path into multiple cycles if it takes too long
- Bread and butter (common case) design
  - Spend time and resources on where it matters most
    - i.e., improve what the machine is really designed to do
  - Common case vs. uncommon case

#### Balanced design

- Balance instruction/data flow through hardware components
- Design to eliminate bottlenecks: balance the hardware for the work

## Single-Cycle Design vs. Design Principles

- Critical path design
- Bread and butter (common case) design
- Balanced design

How does a single-cycle microarchitecture fare with respect to these principles?

## Aside: System Design Principles

- When designing computer systems/architectures, it is important to follow good principles
  - Actually, this is true for \*any\* system design
    - Real architectures, buildings, bridges, ...
    - Good consumer products
    - **...**

- Remember: "principled design" from our second lecture
  - Frank Lloyd Wright: "architecture [...] based upon principle, and not upon precedent"

#### Aside: From Lecture 2

"architecture [...] based upon principle, and not upon precedent"



## This



## That



#### Recall: Takeaways

 It all starts from the basic building blocks and design principles

And, knowledge of how to use, apply, enhance them

- Underlying technology might change (e.g., steel vs. wood)
  - but methods of taking advantage of technology bear resemblance
  - methods used for design depend on the principles employed

## Aside: System Design Principles

- We will continue to cover key principles in this course
- Here are some references where you can learn more
- Yale Patt, "Requirements, Bottlenecks, and Good Fortune: Agents for Microprocessor Evolution," Proc. of IEEE, 2001. (Levels of transformation, design point, etc)
- Mike Flynn, "Very High-Speed Computing Systems," Proc. of IEEE, 1966. (Flynn's Bottleneck → Balanced design)
- Gene M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS Conference, April 1967. (Amdahl's Law → Common-case design)
- Butler W. Lampson, "Hints for Computer System Design," ACM Operating Systems Review, 1983.
  - http://research.microsoft.com/pubs/68221/acrobat.pdf

## A Key System Design Principle

- Keep it simple
- "Everything should be made as simple as possible, but no simpler."
  - Albert Einstein
- And, keep it low cost: "An engineer is a person who can do for a dime what any fool can do for a dollar."
- For more, see:
  - Butler W. Lampson, "Hints for Computer System Design," ACM Operating Systems Review, 1983.
  - http://research.microsoft.com/pubs/68221/acrobat.pdf

# Multi-Cycle Microarchitectures

#### Multi-Cycle Microarchitectures

 Goal: Let each instruction take (close to) only as much time it really needs

#### Idea

- Determine clock cycle time independently of instruction processing time
- Each instruction takes as many clock cycles as it needs to take
  - Multiple state transitions per instruction
  - The states followed by each instruction is different

## Recall: The "Process Instruction" Step

- ISA specifies abstractly what AS' should be, given an instruction and AS
  - It defines an abstract finite state machine where
    - State = programmer-visible state
    - Next-state logic = instruction execution specification
  - From ISA point of view, there are no "intermediate states" between AS and AS' during instruction execution
    - One state transition per instruction
- Microarchitecture implements how AS is transformed to AS'
  - There are many choices in implementation
  - We can have programmer-invisible state to optimize the speed of instruction execution: multiple state transitions per instruction
    - Choice 1:  $AS \rightarrow AS'$  (transform AS to AS' in a single clock cycle)
    - Choice 2: AS → AS+MS1 → AS+MS2 → AS+MS3 → AS' (take multiple clock cycles to transform AS to AS')

#### Multi-Cycle Microarchitecture

AS = Architectural (programmer visible) state at the beginning of an instruction



Step 1: Process part of instruction in one clock cycle



Step 2: Process part of instruction in the next clock cycle





AS' = Architectural (programmer visible) state at the end of a clock cycle

## Benefits of Multi-Cycle Design

#### Critical path design

 Can keep reducing the critical path independently of the worstcase processing time of any instruction

#### Bread and butter (common case) design

Can optimize the number of states it takes to execute "important" instructions that make up much of the execution time

#### Balanced design

- No need to provide more capability or resources than really needed
  - An instruction that needs resource X multiple times does not require multiple X's to be implemented
  - Leads to more efficient hardware: Can reuse hardware components needed multiple times for an instruction

## Downsides of Multi-Cycle Design

- Need to store the intermediate results at the end of each clock cycle
  - Hardware overhead for registers
  - Register setup/hold overhead paid multiple times for an instruction

### Remember: Performance Analysis

- Execution time of a single instruction
  - □ {CPI} x {clock cycle time} CPI: Cycles Per Instruction
- Execution time of an entire program
  - Sum over all instructions [{CPI} x {clock cycle time}]
- Single-cycle microarchitecture performance
  - □ CPI = 1
  - Clock cycle time = long
- Multi-cycle microarchitecture performance
  - CPI = different for each instruction
    - Average CPI → hopefully small
  - □ Clock cycle time = short

In multi-cycle, we have two degrees of freedom to optimize independently

# A Multi-Cycle Microarchitecture A Closer Look

#### How Do We Implement This?

 Maurice Wilkes, "The Best Way to Design an Automatic Calculating Machine," Manchester Univ. Computer Inaugural Conf., 1951.

## THE BEST WAY TO DESIGN AN AUTOMATIC CALCULATING MACHINE

By M. V. Wilkes, M.A., Ph.D., F.R.A.S.



- An elegant implementation:
  - The concept of microcoded/microprogrammed machines

#### Multi-Cycle Microarchitectures

- Key Idea for Realization
  - One can implement the "process instruction" step as a finite state machine that sequences between states and eventually returns back to the "fetch instruction" state
  - A state is defined by the control signals asserted in it
  - Control signals for the next state are determined in current state

## Recall: The Instruction Processing "Cycle"



#### A Basic Multi-Cycle Microarchitecture

- Instruction processing cycle divided into "states"
  - A stage in the instruction processing cycle can take multiple states
- A multi-cycle microarchitecture sequences from state to state to process an instruction
  - The behavior of the machine in a state is completely determined by control signals in that state
- The behavior of the entire processor is specified fully by a finite state machine
- In a state (clock cycle), control signals control two things:
  - How the datapath should process the data
  - How to generate the control signals for the (next) clock cycle

# One Example Multi-Cycle Microarchitecture

#### Remember: Single-Cycle MIPS Processor



#### **Multi-Cycle MIPS Processor**

#### Single-cycle microarchitecture:

- cycle time limited by longest instruction  $(1w) \rightarrow low clock frequency$
- three adders/ALUs and two memories → high hardware cost

#### Multi-cycle microarchitecture:

- + higher clock frequency
- + simpler instructions take few clock cycles
- reuse expensive hardware across multiple cycles
- sequencing overhead paid many times
- hardware overhead for storing intermediate results

#### Multi-cycle requires the same design steps as single cycle:

- datapath
- control logic

#### What Do We Want To Optimize?

- Single-cycle microarchitecture uses two memories
  - One memory stores instructions, the other data
  - We want to use a single memory (lower cost)

#### What Do We Want To Optimize?

- Single-cycle microarchitecture uses two memories
  - One memory stores instructions, the other data
  - We want to use a single memory (lower cost)
- Single-cycle microarchitecture needs three adders
  - ALU, PC, Branch address calculation
  - We want to use the ALU for all operations (lower cost)

#### What Do We Want To Optimize?

- Single-cycle microarchitecture uses two memories
  - One memory stores instructions, the other data
  - We want to use a single memory (lower cost)
- Single-cycle microarchitecture needs three adders
  - ALU, PC, Branch address calculation
  - We want to use the ALU for all operations (lower cost)
- Single-cycle microarchitecture: each instruction takes one cycle
  - The slowest instruction slows down every single instruction
  - We want to determine clock cycle time independently of instruction processing time
    - Divide each instruction into multiple clock cycles
    - Simpler instructions can be very fast (compared to the slowest)

## Let's Construct the Multi-Cycle Datapath

#### Consider the lw Instruction

■ For an instruction such as: lw \$t0, 0x20(\$t1)

#### We need to:

- Read the instruction from memory
- Then read \$t1 from register array
- Add the immediate value (0x20) to calculate the memory address
- Read the content of this address.
- Write to the register \$t0 this content

#### Multi-Cycle Datapath: Instruction Fetch

- We will consider lw, but fetch is the same for all instructions
  - STEP 1: Fetch instruction



read from the memory location [rs]+imm to location [rt]

| I- I ype |        |        |         |
|----------|--------|--------|---------|
| op       | rs     | rt     | imm     |
| 6 bits   | 5 bits | 5 bits | 16 bits |

#### Multi-Cycle Datapath: 1w register read





#### Multi-Cycle Datapath: 1w immediate



#### Multi-Cycle Datapath: 1w address



#### Multi-Cycle Datapath: 1w memory read





#### Multi-Cycle Datapath: 1w write register





#### Multi-Cycle Datapath: increment PC



#### Multi-Cycle Datapath: sw

Write data in rt to memory



#### Multi-Cycle Datapath: R-type Instructions

#### Read from rs and rt

- Write ALUResult to register file
- Write to rd (instead of rt)



# Multi-Cycle Datapath: beq

- Determine whether values in rs and rt are equal
  - Calculate branch target address:
     Target Address = (sign-extended immediate << 2) + (PC+4)</li>



# **Complete Multi-Cycle Processor**



# Let's Construct the Multi-Cycle Control Logic

#### **Control Unit**



#### Main Controller FSM: Fetch





#### Main Controller FSM: Fetch





#### Main Controller FSM: Decode





#### **Main Controller FSM: Address Calculation**



#### **Main Controller FSM: Address Calculation**



#### Main Controller FSM: 1w



#### Main Controller FSM: sw



# Main Controller FSM: R-Type



# Main Controller FSM: beq



# Complete Multi-Cycle Controller FSM



#### Main Controller FSM: addi



#### Main Controller FSM: addi



# **Extended Functionality: j**



# Control FSM: j



# Control FSM: j



# Review: Single-Cycle MIPS Processor



# Review: Single-Cycle MIPS FSM

Single-cycle machine



129

# Review: Multi-Cycle MIPS Processor



# Review: Multi-Cycle MIPS FSM



What is the shortcoming of this design?

What does this design assume about memory?

# What If Memory Takes > One Cycle?

- Stay in the same "memory access" state until memory returns the data
- "Memory Ready?" bit is an input to the control logic that determines the next state

# Another Example: Microprogrammed Multi-Cycle Microarchitecture

# Recall: How Do We Implement This?

 Maurice Wilkes, "The Best Way to Design an Automatic Calculating Machine," Manchester Univ. Computer Inaugural Conf., 1951.

# THE BEST WAY TO DESIGN AN AUTOMATIC CALCULATING MACHINE

By M. V. Wilkes, M.A., Ph.D., F.R.A.S.



- An elegant implementation:
  - The concept of microcoded/microprogrammed machines

# Example uProgrammed Control & Datapath



# For More on Microprogrammed Designs



# Detailed Lectures on Microprogramming

- Design of Digital Circuits, Spring 2018, Lecture 13
  - Microprogramming (ETH Zürich, Spring 2018)
  - https://www.youtube.com/watch?v=u4GhShuBP3Y&list=PL5Q2soXY2Zi\_QedyPWtR mFUJ2F8DdYP7l&index=13
- Computer Architecture, Spring 2013, Lecture 7
  - Microprogramming (CMU, Spring 2013)
  - https://www.youtube.com/watch?v=\_igvSl5h8cs&list=PL5PHm2jkkXmidJOd59REog
     9jDnPDTG6IJ&index=7

# Digital Design & Computer Arch.

# Lecture 11: Microarchitecture Fundamentals II

Prof. Onur Mutlu

ETH Zürich
Spring 2021
15 April 2021

# Backup Slides on Single-Cycle Uarch for Your Own Study

Please study these to reinforce the concepts we covered in lectures.

Please do the readings together with these slides: H&H, Chapter 7.1-7.3, 7.6

# Another Single-Cycle MIPS Processor (from H&H)

These are slides for your own study. They are to complement your reading H&H, Chapter 7.1-7.3, 7.6

# What to do with the Program Counter?

- The PC needs to be incremented by 4 during each cycle (for the time being).
- Initial PC value (after reset) is 0x00400000

### We Need a Register File

- Store 32 registers, each 32-bit
  - $2^5 == 32$ , we need 5 bits to address each
- Every R-type instruction uses 3 register
  - Two for reading (RS, RT)
  - One for writing (RD)
- We need a special memory with:
  - 2 read ports (address x2, data out x2)
  - 1 write port (address, data in)

### Register File

```
input [4:0] a_rs, a_rt, a_rd;
input [31:0] di_rd;
input we_rd;
output [31:0] do rs, do rt;
 reg [31:0] R_arr [31:0]; // Array that stores regs
 // Circuit description
 assign do_rs = R_arr[a_rs];  // Read RS
  assign do rt = R arr[a rt];
                           // Read RT
  always @ (posedge clk)
     if (we_rd) R_arr[a_rd] <= di_rd; // write RD</pre>
```

### Register File

```
input [4:0] a_rs, a_rt, a_rd;
input [31:0] di rd;
input we_rd;
output [31:0] do rs, do rt;
 reg [31:0] R arr [31:0]; // Array that stores regs
 // Circuit description; add the trick with $0
  assign do_rs = (a_rs != 5'b00000)? // is address 0?
                 R_arr[a_rs] : 0;  // Read RS or 0
 assign do_rt = (a_rt != 5'b00000)? // is address 0?
                 R_arr[a_rt] : 0;  // Read RT or 0
  always @ (posedge clk)
     if (we_rd) R_arr[a_rd] <= di_rd; // write RD</pre>
```

#### **Data Memory Example**

Will be used to store the bulk of data

```
input [15:0] addr; // Only 16 bits in this example
input [31:0] di;
input
     we;
output [31:0] do;
 reg [31:0] M_arr [0:65535]; // Array for Memory
 // Circuit description
 assign do = M arr[addr];
                                 // Read memory
 always @ (posedge clk)
     if (we) M_arr[addr] <= di;  // write memory</pre>
```

#### Single-Cycle Datapath: 1w fetch

#### STEP 1: Fetch instruction







#### Single-Cycle Datapath: 1w register read

■ *STEP 2:* Read source operands from register file





147

lw \$s3, 1(\$0) # read memory word 1 into \$s3

I-Type
op rs rt imm
6 bits 5 bits 5 bits 16 bits

#### Single-Cycle Datapath: 1w immediate

STEP 3: Sign-extend the immediate







#### Single-Cycle Datapath: 1w address

STEP 4: Compute the memory address





#### Single-Cycle Datapath: 1w memory read

■ STEP 5: Read from memory and write back to register file



| op     | rs     | rt     | imm     |
|--------|--------|--------|---------|
| 6 bits | 5 bits | 5 bits | 16 bits |

### Single-Cycle Datapath: 1w PC increment

STEP 6: Determine address of next instruction



lw \$s3, 1(\$0) # read memory word 1 into \$s3

#### **I-Type**

| ор     | rs     | rt     | imm     |
|--------|--------|--------|---------|
| 6 bits | 5 bits | 5 bits | 16 bits |

#### Single-Cycle Datapath: sw

Write data in rt to memory



sw \$t7, 44(\$0) # write t7 into memory address 44

#### **I-Type**

| op     | rs     | rt     | imm     |
|--------|--------|--------|---------|
| 6 bits | 5 bits | 5 bits | 16 bits |

#### Single-Cycle Datapath: R-type Instructions

Read from rs and rt, write ALUResult to register file





#### Single-Cycle Datapath: beq



```
beq $s0, $s1, target # branch is taken
```

Determine whether values in rs and rt are equal Calculate BTA = (sign-extended immediate << 2) + (PC+4)</p>

#### **Complete Single-Cycle Processor**



#### **Our MIPS Datapath has Several Options**

- ALU inputs
  - Either RT or Immediate (MUX)
- Write Address of Register File
  - Either RD or RT (MUX)
- Write Data In of Register File
  - Either ALU out or Data Memory Out (MUX)
- Write enable of Register File
  - Not always a register write (MUX)
- Write enable of Memory
  - Only when writing to memory (sw) (MUX)
    - All these options are our control signals

#### **Control Unit**



| ALUOp | Meaning             |  |  |  |  |
|-------|---------------------|--|--|--|--|
| 00    | add                 |  |  |  |  |
| 01    | subtract            |  |  |  |  |
| 10    | look at funct field |  |  |  |  |
| 11    | n/a                 |  |  |  |  |

#### **ALU Does the Real Work in a Processor**



| F <sub>2:0</sub> | Function |
|------------------|----------|
| 000              | A & B    |
| 001              | A   B    |
| 010              | A + B    |
| 011              | not used |
| 100              | A & ~B   |
| 101              | A   ~B   |
| 110              | A - B    |
| 111              | SLT      |

#### **ALU Internals**



| F <sub>2:0</sub> | Function |
|------------------|----------|
| 000              | A & B    |
| 001              | A   B    |
| 010              | A + B    |
| 011              | not used |
| 100              | A & ~B   |
| 101              | A   ~B   |
| 110              | A - B    |
| 111              | SLT      |

#### **Control Unit: ALU Decoder**



| ALUOp <sub>1:0</sub> | Meaning       |
|----------------------|---------------|
| 00                   | Add           |
| 01                   | Subtract      |
| 10                   | Look at Funct |
| 11                   | Not Used      |

| ALUOp <sub>1:0</sub> | Funct        | ALUControl <sub>2:0</sub> |
|----------------------|--------------|---------------------------|
| 00                   | X            | 010 (Add)                 |
| X1                   | X            | 110 (Subtract)            |
| 1X                   | 100000 (add) | 010 (Add)                 |
| 1X                   | 100010 (sub) | 110 (Subtract)            |
| 1X                   | 100100 (and) | 000 (And)                 |
| 1X                   | 100101 (or)  | 001 (Or)                  |
| 1X                   | 101010(slt)  | 111 (SLT)                 |

IbU

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | MemWrite | MemtoReg | ALUOp |
|-------------|-------------------|----------|--------|--------|----------|----------|-------|
|             |                   |          |        |        |          |          |       |
|             |                   |          |        |        |          |          |       |
|             |                   |          |        |        |          |          |       |
|             |                   |          |        |        |          |          |       |

• RegWrite: Write enable for the register file

• RegDst: Write to register RD or RT

AluSrc: ALU input RT or immediate

MemWrite: Write Enable

MemtoReg: Register data in from Memory or ALU

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | MemWrite | MemtoReg | ALUOp |
|-------------|-------------------|----------|--------|--------|----------|----------|-------|
| R-type      | 000000            | 1        | 1      | 0      | 0        | 0        | funct |
|             |                   |          |        |        |          |          |       |
|             |                   |          |        |        |          |          |       |

RegWrite: Write enable for the register file

• RegDst: Write to register RD or RT

AluSrc: ALU input RT or immediate

MemWrite: Write Enable

MemtoReg: Register data in from Memory or ALU

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | MemWrite | MemtoReg | ALUOp |
|-------------|-------------------|----------|--------|--------|----------|----------|-------|
| R-type      | 000000            | 1        | 1      | 0      | 0        | 0        | funct |
| lw          | 100011            | 1        | 0      | 1      | 0        | 1        | add   |
|             |                   |          |        |        |          |          |       |

RegWrite: Write enable for the register file

RegDst: Write to register RD or RT

AluSrc: ALU input RT or immediate

MemWrite: Write Enable

MemtoReg: Register data in from Memory or ALU

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | MemWrite | MemtoReg | ALUOp |
|-------------|-------------------|----------|--------|--------|----------|----------|-------|
| R-type      | 000000            | 1        | 1      | 0      | 0        | 0        | funct |
| lw          | 100011            | 1        | 0      | 1      | 0        | 1        | add   |
| SW          | 101011            | 0        | Χ      | 1      | 1        | X        | add   |

RegWrite: Write enable for the register file

RegDst: Write to register RD or RT

AluSrc: ALU input RT or immediate

MemWrite: Write Enable

MemtoReg: Register data in from Memory or ALU

#### **More Control Signals**

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | Branch | MemWrite | MemtoReg | ALUOp |
|-------------|-------------------|----------|--------|--------|--------|----------|----------|-------|
| R-type      | 000000            | 1        | 1      | 0      | 0      | 0        | 0        | funct |
| lw          | 100011            | 1        | 0      | 1      | 0      | 0        | 1        | add   |
| SW          | 101011            | 0        | Χ      | 1      | 0      | 1        | X        | add   |
| beq         | 000100            | 0        | X      | 0      | 1      | 0        | X        | sub   |

#### New Control Signal

**Branch:** Are we jumping or not?

#### **Control Unit: Main Decoder**

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | Branch | MemWrite | MemtoReg | ALUOp <sub>1:0</sub> |
|-------------|-------------------|----------|--------|--------|--------|----------|----------|----------------------|
| R-type      | 000000            | 1        | 1      | 0      | 0      | 0        | 0        | 10                   |
| lw          | 100011            | 1        | 0      | 1      | 0      | 0        | 1        | 00                   |
| SW          | 101011            | 0        | Χ      | 1      | 0      | 1        | X        | 00                   |
| beq         | 000100            | 0        | X      | 0      | 1      | 0        | X        | 01                   |



#### Single-Cycle Datapath Example: or



#### Extended Functionality: addi



No change to datapath

### Control Unit: addi

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | Branch | MemWrite | MemtoReg | ALUOp <sub>1:0</sub> |
|-------------|-------------------|----------|--------|--------|--------|----------|----------|----------------------|
| R-type      | 000000            | 1        | 1      | 0      | 0      | 0        | 0        | 10                   |
| lw          | 100011            | 1        | 0      | 1      | 0      | 0        | 1        | 00                   |
| SW          | 101011            | 0        | Χ      | 1      | 0      | 1        | X        | 00                   |
| beq         | 000100            | 0        | X      | 0      | 1      | 0        | X        | 01                   |
| addi        | 001000            | 1        | 0      | 1      | 0      | 0        | 0        | 00                   |

### **Extended Functionality: j**



#### **Control Unit: Main Decoder**

| Instruction | Op <sub>5:0</sub> | RegWrite | RegDst | AluSrc | Branch | MemWrite | MemtoReg | ALUOp <sub>1:0</sub> | Jump |
|-------------|-------------------|----------|--------|--------|--------|----------|----------|----------------------|------|
| R-type      | 000000            | 1        | 1      | 0      | 0      | 0        | 0        | 10                   | 0    |
| lw          | 100011            | 1        | 0      | 1      | 0      | 0        | 1        | 00                   | 0    |
| SW          | 101011            | 0        | X      | 1      | 0      | 1        | X        | 00                   | 0    |
| beq         | 000100            | 0        | X      | 0      | 1      | 0        | X        | 01                   | 0    |
| j           | 000100            | 0        | X      | X      | X      | 0        | X        | XX                   | 1    |

### Review: Complete Single-Cycle Processor (H&H)



# A Bit More on Performance Analysis

- Reduce the number of instructions
  - Make instructions that 'do' more (CISC)
  - Use better compilers

- Reduce the number of instructions
  - Make instructions that 'do' more (CISC)
  - Use better compilers
- Use less cycles to perform the instruction
  - Simpler instructions (RISC)
  - Use multiple units/ALUs/cores in parallel

#### Reduce the number of instructions

- Make instructions that 'do' more (CISC)
- Use better compilers

#### Use less cycles to perform the instruction

- Simpler instructions (RISC)
- Use multiple units/ALUs/cores in parallel

#### Increase the clock frequency

- Find a 'newer' technology to manufacture
- Redesign time critical components
- Adopt pipelining

## Performance Analysis of Single-Cycle vs. Multi-Cycle Designs

## Single-Cycle Performance

T<sub>C</sub> is limited by the critical path (1w)



## Single-Cycle Performance

- Single-cycle critical path:
  - $T_c = t_{pcq\_PC} + t_{mem} + max(t_{RFread}, t_{sext} + t_{mux}) + t_{ALU} + t_{mem} + t_{mux} + t_{RFsetup}$
- In most implementations, limiting paths are:
  - memory, ALU, register file.
  - $T_c = t_{pcq\_PC} + 2t_{mem} + t_{RFread} + t_{mux} + t_{ALU} + t_{RFsetup}$



180

| Element             | Parameter            | Delay (ps) |
|---------------------|----------------------|------------|
| Register clock-to-Q | t <sub>pcq_PC</sub>  | 30         |
| Register setup      | t <sub>setup</sub>   | 20         |
| Multiplexer         | t <sub>mux</sub>     | 25         |
| ALU                 | t <sub>ALU</sub>     | 200        |
| Memory read         | t <sub>mem</sub>     | 250        |
| Register file read  | t <sub>RFread</sub>  | 150        |
| Register file setup | t <sub>RFsetup</sub> | 20         |

$$T_c =$$

| Element             | Parameter            | Delay (ps) |
|---------------------|----------------------|------------|
| Register clock-to-Q | t <sub>pcq_PC</sub>  | 30         |
| Register setup      | t <sub>setup</sub>   | 20         |
| Multiplexer         | t <sub>mux</sub>     | 25         |
| ALU                 | t <sub>ALU</sub>     | 200        |
| Memory read         | t <sub>mem</sub>     | 250        |
| Register file read  | t <sub>RFread</sub>  | 150        |
| Register file setup | t <sub>RFsetup</sub> | 20         |

$$T_c = t_{pcq\_PC} + 2t_{mem} + t_{RFread} + t_{mux} + t_{ALU} + t_{RFsetup}$$
  
= [30 + 2(250) + 150 + 25 + 200 + 20] ps  
= 925 ps

#### Example:

For a program with 100 billion instructions executing on a single-cycle MIPS processor:

#### Example:

For a program with 100 billion instructions executing on a single-cycle MIPS processor:

```
Execution Time = # instructions x CPI x T_c
= (100 \times 10^9)(1)(925 \times 10^{-12} \text{ s})
= 92.5 seconds
```

### Multi-Cycle Performance: CPI

Instructions take different number of cycles:

```
□ 3 cycles: beq, j
```

CPI is weighted average, e.g. SPECINT2000 benchmark:

```
25% loads
```

• Average CPI = 
$$(0.11 + 0.02) 3 + (0.52 + 0.10) 4 + (0.25) 5$$
  
=  $4.12$ 

### Multi-Cycle Performance: Cycle Time

Multi-cycle critical path:

$$T_c =$$



### Multi-Cycle Performance: Cycle Time

Multi-cycle critical path:

$$T_c = t_{pcq} + t_{mux} + max(t_{ALU} + t_{mux}, t_{mem}) + t_{setup}$$



### Multi-Cycle Performance Example

| Element             | Parameter            | Delay (ps) |
|---------------------|----------------------|------------|
| Register clock-to-Q | t <sub>pcq_PC</sub>  | 30         |
| Register setup      | t <sub>setup</sub>   | 20         |
| Multiplexer         | t <sub>mux</sub>     | 25         |
| ALU                 | t <sub>ALU</sub>     | 200        |
| Memory read         | t <sub>mem</sub>     | 250        |
| Register file read  | t <sub>RFread</sub>  | 150        |
| Register file setup | t <sub>RFsetup</sub> | 20         |

 $T_c =$ 

### Multi-Cycle Performance Example

| Element             | Parameter            | Delay (ps) |
|---------------------|----------------------|------------|
| Register clock-to-Q | t <sub>pcq_PC</sub>  | 30         |
| Register setup      | t <sub>setup</sub>   | 20         |
| Multiplexer         | t <sub>mux</sub>     | 25         |
| ALU                 | t <sub>ALU</sub>     | 200        |
| Memory read         | t <sub>mem</sub>     | 250        |
| Register file read  | t <sub>RFread</sub>  | 150        |
| Register file setup | t <sub>RFsetup</sub> | 20         |

$$T_c$$
 =  $t_{pcq\_PC} + t_{mux} + max(t_{ALU} + t_{mux}, t_{mem}) + t_{setup}$   
=  $[30 + 25 + 250 + 20] ps$   
=  $325 ps$ 

### Multi-Cycle Performance Example

 For a program with 100 billion instructions executing on a multi-cycle MIPS processor

```
□ CPI = 4.12
□ T_c = 325 \text{ ps}
```

```
■ Execution Time = (# instructions) × CPI × T_c
= (100 \times 10^9)(4.12)(325 \times 10^{-12})
= 133.9 seconds
```

- This is slower than the single-cycle processor (92.5 seconds). Why?
- Did we break the stages in a balanced manner?
- Overhead of register setup/hold paid many times
- How would the results change with different assumptions on memory latency and instruction mix?

### Review: Single-Cycle MIPS Processor



## Review: Single-Cycle MIPS FSM

Single-cycle machine



### Review: Multi-Cycle MIPS Processor



### Review: Multi-Cycle MIPS FSM



What is the shortcoming of this design?

What does this design assume about memory?

### What If Memory Takes > One Cycle?

- Stay in the same "memory access" state until memory returns the data
- "Memory Ready?" bit is an input to the control logic that determines the next state

# Backup Slides on Microprogrammed Multi-Cycle Microarchitectures

### These Slides Are Covered in A Past Lecture



### Lectures on Microprogrammed Designs

- Design of Digital Circuits, Spring 2018, Lecture 13
  - Microprogramming (ETH Zürich, Spring 2018)
  - https://www.youtube.com/watch?v=u4GhShuBP3Y&list=PL5Q2soXY2Zi\_QedyPWtR mFUJ2F8DdYP7l&index=13
- Computer Architecture, Spring 2013, Lecture 7
  - Microprogramming (CMU, Spring 2013)
  - https://www.youtube.com/watch?v=\_igvSl5h8cs&list=PL5PHm2jkkXmidJOd59REog
     9jDnPDTG6IJ&index=7

## Another Example: Microprogrammed Multi-Cycle Microarchitecture

### How Do We Implement This?

 Maurice Wilkes, "The Best Way to Design an Automatic Calculating Machine," Manchester Univ. Computer Inaugural Conf., 1951.

## THE BEST WAY TO DESIGN AN AUTOMATIC CALCULATING MACHINE

By M. V. Wilkes, M.A., Ph.D., F.R.A.S.



- An elegant implementation:
  - The concept of microcoded/microprogrammed machines

### Recall: A Basic Multi-Cycle Microarchitecture

- Instruction processing cycle divided into "states"
  - A stage in the instruction processing cycle can take multiple states
- A multi-cycle microarchitecture sequences from state to state to process an instruction
  - The behavior of the machine in a state is completely determined by control signals in that state
- The behavior of the entire processor is specified fully by a finite state machine
- In a state (clock cycle), control signals control two things:
  - How the datapath should process the data
  - How to generate the control signals for the (next) clock cycle

## Microprogrammed Control Terminology

- Control signals associated with the current state
  - Microinstruction
- Act of transitioning from one state to another
  - Determining the next state and the microinstruction for the next state
  - Microsequencing
- Control store stores control signals for every possible state
  - Store for microinstructions for the entire FSM
- Microsequencer determines which set of control signals will be used in the next clock cycle (i.e., next state)



### What Happens In A Clock Cycle?

- The control signals (microinstruction) for the current state control two things:
  - Processing in the data path
  - Generation of control signals (microinstruction) for the next cycle
  - See Supplemental Figure 1 (next-next slide)
- Datapath and microsequencer operate concurrently
- Question: why not generate control signals for the current cycle in the current cycle?
  - This could lengthen the clock cycle
  - Why could it lengthen the clock cycle?
  - See Supplemental Figure 2

### Example uProgrammed Control & Datapath



### A Clock Cycle



### A Bad Clock Cycle!



### A Simple LC-3b Control and Datapath



### What Determines Next-State Control Signals?

- What is happening in the current clock cycle
  - See the 9 control signals coming from "Control" block
    - What are these for?
- The instruction that is being executed
  - □ IR[15:11] coming from the Data Path
- Whether the condition of a branch is met, if the instruction being processed is a branch
  - BEN bit coming from the datapath
- Whether the memory operation is completing in the current cycle, if one is in progress
  - R bit coming from memory

### A Simple LC-3b Control and Datapath



### The State Machine for Multi-Cycle Processing

- The behavior of the LC-3b uarch is completely determined by
  - the 35 control signals and
  - additional 7 bits that go into the control logic from the datapath
- 35 control signals completely describe the state of the control structure
- We can completely describe the behavior of the LC-3b as a state machine, i.e. a directed graph of
  - Nodes (one corresponding to each state)
  - Arcs (showing flow from each state to the next state(s))

### An LC-3b State Machine

- Patt and Patel, Revised Appendix C, Figure C.2
- Each state must be uniquely specified
  - Done by means of state variables
- 31 distinct states in this LC-3b state machine
  - Encoded with 6 state variables
- Examples
  - State 18,19 correspond to the beginning of the instruction processing cycle
  - $\square$  Fetch phase: state 18, 19  $\rightarrow$  state 33  $\rightarrow$  state 35
  - Decode phase: state 32



### The FSM Implements the LC-3b ISA



- P&P Appendix A (revised):
  - https://safari.ethz.ch/digi taltechnik/spring2018/lib/ exe/fetch.php?media=pp -appendixa.pdf

### LC-3b State Machine: Some Questions

- How many cycles does the fastest instruction take?
- How many cycles does the slowest instruction take?
- Why does the BR take as long as it takes in the FSM?
- What determines the clock cycle time?

### LC-3b Datapath

- Patt and Patel, Revised Appendix C, Figure C.3
- Single-bus datapath design
  - At any point only one value can be "gated" on the bus (i.e., can be driving the bus)
  - Advantage: Low hardware cost: one bus
  - Disadvantage: Reduced concurrency if instruction needs the bus twice for two different things, these need to happen in different states
- Control signals (26 of them) determine what happens in the datapath in one clock cycle
  - Patt and Patel, Revised Appendix C, Table C.1





#### Remember the MIPS datapath



| Signal Name             | Signal Values                              |                                                                                                    |
|-------------------------|--------------------------------------------|----------------------------------------------------------------------------------------------------|
| LD.MAR/1:<br>LD.MDR/1:  | NO, LOAD<br>NO, LOAD                       |                                                                                                    |
| LD.MDR/1:               | NO, LOAD                                   |                                                                                                    |
| LD.BEN/1:               | NO, LOAD                                   |                                                                                                    |
| LD.REG/1:               | NO, LOAD                                   |                                                                                                    |
| LD.CC/1:                | NO, LOAD                                   |                                                                                                    |
| LD.PC/1:                | NO, LOAD                                   |                                                                                                    |
| GatePC/1:<br>GateMDR/1: | NO, YES                                    |                                                                                                    |
| GateALU/1:              | NO, YES<br>NO, YES                         |                                                                                                    |
| GateMARMUX/1:           | NO, YES                                    |                                                                                                    |
| GateSHF/1:              | NO, YES                                    |                                                                                                    |
| PCMUX/2:                | PC+2<br>BUS<br>ADDER                       | ;select pc+2<br>;select value from bus<br>;select output of address adder                          |
| DRMUX/1:                | 11.9<br>R7                                 | ;destination IR[11:9]<br>;destination R7                                                           |
| SR1MUX/1:               | 11.9<br>8.6                                | ;source IR[11:9]<br>;source IR[8:6]                                                                |
| ADDR1MUX/1:             | PC, BaseR                                  |                                                                                                    |
| ADDR2MUX/2:             | ZERO<br>offset6<br>PCoffset9<br>PCoffset11 | ;select the value zero<br>;select SEXT[IR[5:0]]<br>;select SEXT[IR[8:0]]<br>;select SEXT[IR[10:0]] |
| MARMUX/1:               | 7.0<br>ADDER                               | ;select LSHF(ZEXT[IR[7:0]],1)<br>;select output of address adder                                   |
| ALUK/2:                 | ADD, AND, X                                | COR, PASSA                                                                                         |
| MIO.EN/1:               | NO, YES                                    |                                                                                                    |
| R.W/1:                  | RD, WR                                     |                                                                                                    |
| DATA.SIZE/1:            | BYTE, WORL                                 |                                                                                                    |
| LSHF1/1:                | NO, YES                                    |                                                                                                    |

Table C.1: Data path control signals

#### LC-3b Datapath: Some Questions

- How does instruction fetch happen in this datapath according to the state machine?
- What is the difference between gating and loading?
  - Gating: Enable/disable an input to be connected to the bus
    - Combinational: during a clock cycle
  - □ Loading: Enable/disable an input to be written to a register
    - Sequential: e.g., at a clock edge (assume at the end of cycle)
- Is this the smallest hardware you can design?

#### LC-3b Microprogrammed Control Structure

- Patt and Patel, Appendix C, Figure C.4
- Three components:
  - Microinstruction, control store, microsequencer
- Microinstruction: control signals that control the datapath (26 of them) and help determine the next state (9 of them)
- Each microinstruction is stored in a unique location in the control store (a special memory structure)
- Unique location: address of the state corresponding to the microinstruction
  - Remember each state corresponds to one microinstruction
- Microsequencer determines the address of the next microinstruction (i.e., next state)





|   |          |              |          |   |   |              |   |              |              |              |          |           |            |           |              | ۸- | 5 á          | No.       |              |                | 1      | S.                                               | Į.           | A                                                |      |         |          | Ŕ        | ži                                     |
|---|----------|--------------|----------|---|---|--------------|---|--------------|--------------|--------------|----------|-----------|------------|-----------|--------------|----|--------------|-----------|--------------|----------------|--------|--------------------------------------------------|--------------|--------------------------------------------------|------|---------|----------|----------|----------------------------------------|
|   | R)       | OBB          |          |   | 5 |              |   | 4            | 345          |              | e 6      |           | ي<br>بي دي | ر<br>چې د |              |    | San Constant | T ALL SOL | grist.       | الم<br>المحققة |        | and of                                           | 907.         | A CAREE                                          | ş* . | 100 A   | , of     |          |                                        |
| [ | Ì        | ÷            |          | _ | - | Т            | T | Ī            | Ţ            | Ţ            | Ĺ        | Ì         | Ì          |           | Ī            | Ĺ  | Ţ            |           | ,<br>Ц       | Ţ              | Ţ      | Ľ                                                | Ť            | Ľ                                                |      | Ţ       | Ĺ        | j        | 000000 (State 0)                       |
|   | +        | ÷            | H        | _ |   | <del>'</del> | ÷ | +            | +            | +            | $\vdash$ | Н         | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | H                                                | +            | H                                                | +    | +       |          |          | 000001 (State 1)<br>000010 (State 2)   |
|   |          | -            |          | Т | Т | Т            | _ | t            | t            | $^{\dagger}$ |          |           |            |           |              |    |              |           |              |                |        | <u> </u>                                         |              |                                                  | 1    |         |          |          | 000010 (State 2) 000011 (State 3)      |
| - | 4        | -            |          |   |   | <del>'</del> | + | +            | +            | $\perp$      |          | Н         | _          | _         | +            | Н  | 4            | -         | $\Box$       | _              | +      | H                                                | +            | Ľ                                                | 4    | _       |          |          | 000100 (State 4)                       |
| ł | 1        | -            | H        | _ | - | -            | - | +            | +            | +            |          | Н         |            |           |              | Н  | +            |           | Н            |                |        | Η-                                               |              | H                                                | +    | +       |          |          | 000101 (State 5)<br>000110 (State 6)   |
|   |          | -            |          | _ |   | _            | _ | 1            | Ţ            |              |          |           |            |           |              |    |              |           |              |                |        |                                                  |              |                                                  | 1    |         |          |          | 000111 (State 7)                       |
| ł | -        | <del>.</del> | H        |   |   | ÷            | ÷ | +            | +            | +            | H        |           | -          |           | +            | Н  | +            |           | $\dashv$     | +              | +      | H                                                | +            | H                                                | +    | +       |          |          | 001000 (State 8)<br>001001 (State 9)   |
|   |          | 1            |          |   | Т | _            | _ | t            | t            | $^{\dagger}$ |          |           |            |           | $^{\dagger}$ | П  |              |           |              |                |        |                                                  |              | $\vdash$                                         |      |         |          |          | 001010 (State 10)                      |
|   |          | -            |          |   |   | -            | - | 1            |              |              |          |           |            |           |              |    | 1            |           |              |                |        | H                                                |              | Ľ                                                |      | $\perp$ |          |          | 001011 (State 11)                      |
| - | -        | _            | H        | _ | - | -            | - | +            | +            | +            |          |           | -          |           | +            | Н  | +            |           | $\dashv$     |                | +      | ╁                                                | +            | ┢᠇                                               | +    | +       |          |          | 001100 (State 12)<br>001101 (State 13) |
|   |          |              |          |   |   | Т            |   | 1            | İ            |              |          |           |            |           |              |    |              |           |              |                | İ      | Ľ                                                | İ            |                                                  |      |         |          |          | 001110 (State 14)                      |
| - | $\dashv$ | 1            | H        |   | _ | 1            | 1 | +            | +            | +            |          | H         | -          | $\perp$   | +            | H  | +            | +         | -            | 4              | +      | <del>                                     </del> | +            | +                                                | +    | +       |          | $\dashv$ | 001111 (State 15)                      |
| ł | $\dashv$ | -            |          | - | - | -            | - | $\dagger$    | $^{\dagger}$ | +            | H        | H         | $\dashv$   | +         | $\dagger$    | H  | +            | +         | +            | +              | +      | $\vdash$                                         | +            | $\vdash$                                         | +    | +       | $\vdash$ |          | 010000 (State 16)<br>010001 (State 17) |
|   | $\Box$   | -            |          |   | - | _            | - | 1            |              |              |          |           |            | 1         |              | П  |              |           |              |                | 1      | H                                                | 1            |                                                  | 1    | 1       |          |          | 010010 (State 18)                      |
|   | $\dashv$ | 1            | H        | _ | - | ·            | · | +            | +            | +            | $\vdash$ | Н         | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | H                                                | +            | H                                                | +    | +       | $\vdash$ |          | 010011 (State 19)<br>010100 (State 20) |
| ŀ |          |              | L        | _ | _ | _            | _ | t            | 1            | $^{\dagger}$ | Ħ        | H         | _†         | _         | $^{\dagger}$ | H  | _            | 1         | ᅥ            |                | 1      | Ľ                                                | ✝            | Ħ                                                | _†   |         | L        |          | 010100 (State 20) 010101 (State 21)    |
|   | 7        | _            |          | _ | _ | T            | _ | Ţ            | T            | Γ            |          | П         | $\Box$     | 1         | T            | П  | T            | I         | П            | I              | I      |                                                  | Ţ            | Г                                                | 1    | T       |          |          | 010110 (State 22)                      |
|   | +        | <del>-</del> | H        | _ | _ | -            | 1 | +            | +            | +            | Н        | Н         | $\dashv$   | +         | +            | Н  | +            | +         | +            | +              | +      | H                                                | +            | <del>                                     </del> | +    | +       | $\vdash$ |          | 010111 (State 23)<br>011000 (State 24) |
|   |          | _            |          |   |   | _            | _ | t            | İ            |              |          |           |            |           |              |    |              |           |              |                |        |                                                  |              |                                                  |      |         |          |          | 011001 (State 25)                      |
|   | 4        | <del>'</del> |          |   |   | <del>'</del> | + | +            | +            | $\perp$      |          | Н         |            | -         | +            | Н  | +            | -         | <del>'</del> |                | +      | <del>  '</del>                                   | +            | <del>  '</del>                                   | 4    | +       |          |          | 011010 (State 26)                      |
| - |          | -            | H        | - | Т | -            | - | +            | +            | +            |          | Н         |            | +         | +            | Н  | +            |           | $\dashv$     |                | +      | +                                                | +            | ╁                                                | +    | +       |          |          | 011011 (State 27)<br>011100 (State 28) |
|   |          |              |          |   |   |              | _ | 1            |              |              |          |           |            |           |              |    |              |           |              |                |        |                                                  |              |                                                  |      |         |          |          | 011101 (State 29)                      |
|   | 4        | <del>'</del> | H        | _ |   | ÷            | ÷ | +            | +            | +            |          | Н         | _          | -         | +            | Н  | +            | -         | $\dashv$     |                | +      | H                                                | +            | H                                                | 4    | +       |          |          | 011110 (State 30)<br>011111 (State 31) |
| ł |          | -            |          | - | Т | -            | - | $^{\dagger}$ | $^{+}$       | +            |          |           |            |           | +            | Н  | $\pm$        |           |              |                | +      | <u> </u>                                         |              |                                                  | +    | +       |          |          | 100000 (State 32)                      |
|   | 4        | <u> </u>     |          |   |   | <del>'</del> | + | 1            | $\perp$      | $\perp$      |          | Ц         | _          | 4         | $\perp$      | Н  | _            | -         | -            | 4              | +      | H                                                | $\downarrow$ | H                                                | 4    | _       | -        |          | 100001 (State 33)                      |
| - |          | -            | H        | - | - | -            | - | +            | +            | +            |          | Н         |            | +         | +            | Н  | +            | +         | -            |                | +      | Η-                                               | +            | ┝                                                | +    | +       |          |          | 100010 (State 34)<br>100011 (State 35) |
|   |          |              |          | _ | Т | _            | _ | t            |              |              |          |           |            |           |              |    |              |           |              |                |        |                                                  |              |                                                  |      |         |          |          | 100100 (State 36)                      |
|   | _        | <u>'</u>     | L        |   |   | <del>'</del> | + | +            | +            | $\perp$      |          | Ш         |            | _         | -            | Н  | 4            | -         | ;            |                | +      | H                                                | 1            | <b>!</b>                                         | 4    | +       |          |          | 100101 (State 37)                      |
|   | +        | -            | H        |   | - | Т            | Т | +            | +            | +            | Н        | Н         | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | -                                                | +            | -                                                | +    | +       | $\vdash$ |          | 100110 (State 38)<br>100111 (State 39) |
|   |          | _            |          |   | T | Ţ            | Ţ | 1            |              | t            |          |           |            |           |              |    | #            |           |              |                |        | П                                                |              |                                                  |      |         |          |          | 101000 (State 40)                      |
|   | $\dashv$ | 1            | H        | _ |   | <del>'</del> | ÷ | +            | +            | +            | $\vdash$ | Н         | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | H                                                | +            | H                                                | +    | +       | -        |          | 101001 (State 41)<br>101010 (State 42) |
|   | $\dashv$ | -            | H        | - | - | 1            | - | +            | t            | +            | H        | H         | $\dashv$   | +         | +            | H  | +            | +         | +            | +              | $^{+}$ | +                                                | +            | $\vdash$                                         | +    | +       | t        |          | 101010 (State 42) 101011 (State 43)    |
|   | $\Box$   | 1            |          |   | _ |              | - | 1            | ļ            | L            |          |           |            | 1         | Ţ            | П  | 1            | L         |              |                | 1      | Ľ                                                | 1            | Ľ                                                | 1    | 1       |          |          | 101100 (State 44)                      |
|   | $\dashv$ | <del>'</del> | H        | _ | · | ÷            | ÷ | +            | +            | +            | $\vdash$ | Н         | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | H                                                | +            | ⊢                                                | +    | +       | -        |          | 101101 (State 45)<br>101110 (State 46) |
| ŀ | ╛        | _            |          | _ | _ | _            | _ | t            | İ            | t            | L        | Н         |            |           | İ            |    | 士            | İ         | ╛            | ╅              | t      |                                                  |              | Ľ                                                | ╛    | $\pm$   | L        |          | 101111 (State 47)                      |
| - | 7        | -            | F        | 7 | T |              | _ | Ŧ            | F            | F            | F        | П         | $\dashv$   | Ŧ         | F            | П  | 7            | F         | П            | $\mp$          | F      | H                                                | Ŧ            | F                                                | 7    | Ŧ       | F        | $\Box$   | 110000 (State 48)                      |
| ł | $\dashv$ | -            | $\vdash$ | - | - | -            | - | +            | +            | +            | $\vdash$ | $\forall$ | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | +                                                | +            | +                                                | +    | +       | $\vdash$ |          | 110001 (State 49)<br>110010 (State 50) |
|   |          | 1            |          |   | _ |              | _ | 1            | ļ            | İ            |          | П         |            | 1         | t            |    |              |           | I            |                | t      |                                                  | 1            |                                                  | 1    | 1       |          |          | 110011 (State 51)                      |
| } | $\dashv$ | -            | H        |   |   | +            | + | +            | +            | +            |          | $\vdash$  | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | ₩;                                               | +            | ₽,                                               | +    | +       | -        |          | 110100 (State 52)                      |
| } | $\dashv$ | -            | $\vdash$ | _ | - | -            | - | $\dagger$    | +            | +            | $\vdash$ | H         | $\dashv$   | +         | +            | Н  | +            | +         | +            | +              | +      | +                                                | +            | $\vdash$                                         | +    | +       | $\vdash$ |          | 110101 (State 53)<br>110110 (State 54) |
|   | 1        | -            |          |   |   | <u> </u>     | Ţ | 1            | 1            |              |          |           |            |           |              |    |              | L         |              |                |        |                                                  | 1            |                                                  | 1    | #       |          |          | 110111 (State 55)                      |
| ł | $\dashv$ | -            |          |   |   | +            |   | +            | +            | +            | $\vdash$ | Н         | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | H                                                | +            | H                                                |      | +       | $\vdash$ |          | 111000 (State 56)<br>111001 (State 57) |
|   | _†       | _            | L        | - |   |              | _ | t            | $\pm$        | 1            | Ħ        | H         |            |           | 1            | Ħ  | _            | 1         | _            | ╅              | 1      | Ľ                                                |              | Ħ                                                | _†   | _       | L        |          | 11100 (State 57) 111010 (State 58)     |
|   | 1        | _            |          | _ |   | _            | _ | Ţ            | I            |              |          | П         |            | 1         |              | П  | 7            |           | П            | 1              |        |                                                  |              |                                                  | 7    | Ţ       |          | $\Box$   | 111011 (State 59)                      |
| - | $\dashv$ | -            | H        | _ | _ | +            |   | +            | +            | +            | $\vdash$ | $\vdash$  | $\dashv$   | +         | +            | Н  | +            | +         | $\dashv$     | +              | +      | H                                                | +            | ╁                                                | +    | +       | $\vdash$ | $\dashv$ | 111100 (State 60)<br>111101 (State 61) |
| ŀ | ╛        | _            |          |   | _ | _            | _ | 1            | İ            | İ            | T        | ⇈         |            |           | İ            |    | 士            | İ         |              | 1              | İ      |                                                  | I            | Ľ                                                | ╛    | 士       | İ        |          | 111110 (State 62)                      |
| 1 | Т        |              |          |   | Т |              | _ | Ι            | Ι            |              |          |           |            |           | Γ            |    |              | Ι         | ' Τ          |                | Ι      |                                                  |              |                                                  |      |         |          |          | 111111 (State 63)                      |

#### LC-3b Microsequencer

- Patt and Patel, Appendix C, Figure C.5
- The purpose of the microsequencer is to determine the address of the next microinstruction (i.e., next state)
  - Next state could be conditional or unconditional

Next state address depends on 9 control signals (plus 7

data signals)

| Signal Name     | Signal Values                                                                    |                                                                |  |  |  |  |  |  |  |
|-----------------|----------------------------------------------------------------------------------|----------------------------------------------------------------|--|--|--|--|--|--|--|
| J/6:<br>COND/2: | COND <sub>0</sub><br>COND <sub>1</sub><br>COND <sub>2</sub><br>COND <sub>3</sub> | ;Unconditional<br>;Memory Ready<br>;Branch<br>;Addressing Mode |  |  |  |  |  |  |  |
| IRD/1:          | NO, YES                                                                          |                                                                |  |  |  |  |  |  |  |



#### The Microsequencer: Some Questions

- When is the IRD signal asserted?
- What happens if an illegal instruction is decoded?
- What are condition (COND) bits for?
- How is variable latency memory handled?
- How do you do the state encoding?
  - Minimize number of state variables (~ control store size)
  - Start with the 16-way branch
  - Then determine constraint tables and states dependent on COND

# An Exercise in Microprogramming

#### Handouts

- 7 pages of Microprogrammed LC-3b design
- https://safari.ethz.ch/digitaltechnik/spring2018/lib/exe/fetc h.php?media=lc3b-figures.pdf

## A Simple LC-3b Control and Datapath







A Simple Datapath Can Become Very Powerful







| Signal Name   | Signal Values      |                                 |
|---------------|--------------------|---------------------------------|
| LD.MAR/1:     | NO, LOAD           |                                 |
| LD.MDR/1:     | NO, LOAD           |                                 |
| LD.IR/1:      | NO, LOAD           |                                 |
| LD.BEN/1:     | NO, LOAD           |                                 |
| LD.REG/1:     | NO, LOAD           |                                 |
| LD.CC/1:      | NO, LOAD           |                                 |
| LD.PC/1:      | NO, LOAD           |                                 |
| GatePC/1:     | NO, YES<br>NO, YES |                                 |
| GateMDR/1:    |                    |                                 |
| GateALU/1:    | NO, YES            |                                 |
| GateMARMUX/1: | NO, YES            |                                 |
| GateSHF/1:    | NO, YES            |                                 |
| PCMUX/2:      | PC+2               | ;select pc+2                    |
|               | BUS                | ;select value from bus          |
|               | ADDER              | ;select output of address adder |
| DRMUX/1:      | 11.9               | ;destination IR[11:9]           |
|               | R7                 | ;destination R7                 |
| SR1MUX/1:     | 11.9               | ;source IR[11:9]                |
|               | 8.6                | ;source IR[8:6]                 |
| ADDR1MUX/1:   | PC, BaseR          |                                 |
| ADDR2MUX/2:   | ZERO               | ;select the value zero          |
|               | offset6            | select SEXT[IR[5:0]]            |
|               | PCoffset9          | ;select SEXT[IR[8:0]]           |
|               | PCoffset11         | ;select SEXT[IR[10:0]]          |
| MARMUX/1:     | 7.0                | ;select LSHF(ZEXT[IR[7:0]],1)   |
|               | ADDER              | ;select output of address adder |
| ALUK/2:       | ADD, AND, X        | OR, PASSA                       |
|               |                    | ,                               |
| MIO.EN/1:     | NO, YES            |                                 |
| R.W/1:        | RD, WR             | _                               |
| DATA.SIZE/1:  | BYTE, WORL         | )                               |
| LSHF1/1:      | NO, YES            |                                 |

Table C.1: Data path control signals







# End of the Exercise in Microprogramming

## Variable-Latency Memory

- The ready signal (R) enables memory read/write to execute correctly
  - Example: transition from state 33 to state 35 is controlled by the R bit asserted by memory when memory data is available
- Could we have done this in a single-cycle microarchitecture?
- What did we assume about memory and registers in a single-cycle microarchitecture?

## The Microsequencer: Advanced Questions

- What happens if the machine is interrupted?
- What if an instruction generates an exception?
- How can you implement a complex instruction using this control structure?
  - Think REP MOVS instruction in x86
    - string copy of N elements starting from address A to address B

#### The Power of Abstraction

- The concept of a control store of microinstructions enables the hardware designer with a new abstraction: microprogramming
- The designer can translate any desired operation to a sequence of microinstructions
- All the designer needs to provide is
  - The sequence of microinstructions needed to implement the desired operation
  - The ability for the control logic to correctly sequence through the microinstructions
  - Any additional datapath elements and control signals needed (no need if the operation can be "translated" into existing control signals)

# Let's Do Some More Microprogramming

- Implement REP MOVS in the LC-3b microarchitecture
- What changes, if any, do you make to the
  - state machine?
  - datapath?
  - control store?
  - microsequencer?
- Show all changes and microinstructions
- Optional HW Assignment

#### x86 REP MOVS (String Copy) Instruction

```
REP MOVS (DEST SRC)
                                                                                                              DEST \leftarrow SRC:
                                                                                                              IF (Byte move)
                                                                                                                 THEN IF DF = 0
                                                                                                                     THEN
                                                                                                                         (R|E)SI \leftarrow (R|E)SI + 1;
IF AddressSize = 16
                                                                                                                         (R|E)DI \leftarrow (R|E)DI + 1;
     THEN
                                                                                                                     ELSE
                                                                                                                          (R|E)SI \leftarrow (R|E)SI - 1;
            Use CX for CountReg;
                                                                                                                         (R|E)DI \leftarrow (R|E)DI - 1;
     ELSE IF AddressSize = 64 and REX.W used
                                                                                                                 ELSE IF (Word move)
            THEN Use RCX for CountReg; FI;
                                                                                                                     THEN IF DF = 0
                                                                                                                         (R|E)SI \leftarrow (R|E)SI + 2;
     ELSE
                                                                                                                         (R|E)DI \leftarrow (R|E)DI + 2;
            Use ECX for CountReg;
                                                                                                                     ELSE
FI;
                                                                                                                          (R|E)SI \leftarrow (R|E)SI - 2;
                                                                                                                          (R|E)DI \leftarrow (R|E)DI - 2;
WHILE CountReg \neq 0
                                                                                                                 ELSE IF (Doubleword move)
     D0
                                                                                                                     THEN IF DF = 0
            Service pending interrupts (if any);
                                                                                                                         (R|E)SI \leftarrow (R|E)SI + 4;
                                                                                                                         (R|E)DI \leftarrow (R|E)DI + 4;
            Execute associated string instruction;
            CountReg \leftarrow (CountReg - 1);
                                                                                                                     ELSE
                                                                                                                         (R|E)SI \leftarrow (R|E)SI - 4;
            IF CountReg = 0
                                                                                                                         (R|E)DI \leftarrow (R|E)DI - 4;
                   THEN exit WHILE loop; FI;
                                                                                                                 ELSE IF (Quadword move)
            IF (Repeat prefix is REPZ or REPE) and (ZF = 0)
                                                                                                                     THEN IF DF = 0
                                                                                                                         (R|E)SI \leftarrow (R|E)SI + 8;
            or (Repeat prefix is REPNZ or REPNE) and (ZF = 1)
                                                                                                                         (R|E)DI \leftarrow (R|E)DI + 8;
                   THEN exit WHILE loop; FI;
                                                                                                                     ELSE
     OD:
                                                                                                                         (R|E)SI \leftarrow (R|E)SI - 8;
                                                                                                                         (R|E)DI \leftarrow (R|E)DI - 8;
                                                                                                                     FI:
```

How many instructions does this take in MIPS ISA?

### Aside: Alignment Correction in Memory

- Unaligned accesses
- LC-3b has byte load and byte store instructions that move data not aligned at the word-address boundary
  - Convenience to the programmer/compiler
- How does the hardware ensure this works correctly?
  - Take a look at state 29 for LDB
  - States 24 and 17 for STB
  - Additional logic to handle unaligned accesses
- P&P, Revised Appendix C.5

# Aside: Memory Mapped I/O

- Address control logic determines whether the specified address of LDW and STW are to memory or I/O devices
- Correspondingly enables memory or I/O devices and sets up muxes
- An instance where the final control signals of some datapath elements (e.g., MEM.EN or INMUX/2) cannot be stored in the control store
  - These signals are dependent on memory address
- P&P, Revised Appendix C.6

# Advantages of Microprogrammed Control

- Allows a very simple design to do powerful computation by controlling the datapath (using a sequencer)
  - High-level ISA translated into microcode (sequence of u-instructions)
  - Microcode (u-code) enables a minimal datapath to emulate an ISA
  - Microinstructions can be thought of as a user-invisible ISA (u-ISA)
- Enables easy extensibility of the ISA
  - Can support a new instruction by changing the microcode
  - Can support complex instructions as a sequence of simple microinstructions (e.g., REP MOVS, INC [MEM])
- Enables update of machine behavior
  - A buggy implementation of an instruction can be fixed by changing the microcode in the field
    - Easier if datapath provides ability to do the same thing in different ways

#### Update of Machine Behavior

- The ability to update/patch microcode in the field (after a processor is shipped) enables
  - Ability to add new instructions without changing the processor!
  - Ability to "fix" buggy hardware implementations

#### Examples

- IBM 370 Model 145: microcode stored in main memory, can be updated after a reboot
- □ IBM System z: Similar to 370/145.
  - Heller and Farrell, "Millicode in an IBM zSeries processor," IBM JR&D, May/Jul 2004.
- B1700 microcode can be updated while the processor is running
  - User-microprogrammable machine!
  - Wilner, "Microprogramming environment on the Burroughs B1700", CompCon 1972.

# Multi-Cycle vs. Single-Cycle uArch

- Advantages
- Disadvantages
- For you to fill in