Name: ____________________________
First Name: _______________________
Student ID: ________________________

1st session examination
Design of Digital Circuits SS2013
(252-0014-00S)
Markus Püschel, Frank K. Gürkaynak

Examination Rules:

1. Written exam, 90 minutes total.
2. No books, no calculators, no computers or communication devices. Five pages of handwritten notes are allowed.
3. Write all your answers on this document, space is reserved for your answers after each question. Blank pages are available at the end of the exam.
4. Put your Student ID card visible on the desk during the exam.
5. If you feel disturbed, immediately call an assistant.
6. Answers will only be evaluated if they are readable
7. Write with a black or blue pen (no pencil, no green or red color).
8. Show all your work. For some questions, you may get partial credit even if the end result is wrong due to a calculation mistake.

<table>
<thead>
<tr>
<th>Question</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>Points:</td>
<td>5</td>
<td>10</td>
<td>15</td>
<td>10</td>
<td>12</td>
<td>5</td>
<td>12</td>
<td>6</td>
<td>75</td>
</tr>
<tr>
<td>Score:</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
1. (a) (3 points) You receive the following 12-bit binary sequence:

0000 0111 1100

Which decimal numbers are encoded in this sequence, if you were told that the sequence contained:

Two 6-bit numbers using two’s complement: 1, −4
A single 12-bit unsigned number: 124
Three 4-bit numbers using sign/magnitude: 0, 7, −4

(b) (2 points) In the lecture, it was explained that the two's complement was the better alternative to represent negative numbers. Name two main advantages of the two's complement representation over a sign/magnitude representation:

<table>
<thead>
<tr>
<th>Solution:</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. Zero is represented only once</td>
</tr>
<tr>
<td>2. Standard binary addition works with two's complement numbers without additional effort</td>
</tr>
<tr>
<td>3. Associativity law holds</td>
</tr>
</tbody>
</table>
2. The following Verilog code defines a combinational circuit. We are interested in finding out the timing properties of this circuit.

```verilog
module gandalf (input [3:0] a, input e, output z);

wire b, c, d;
reg f;


always @ (*)
  f <= a[3] & b;

assign z = (~e) ? d : f;
assign b = a[0] & a[1];
endmodule
```

The circuit is implemented using only the following basic logic building blocks: 2-input AND, 2-input OR, 2:1 Multiplexer, Inverter. The delay from any input to the output for each basic building block is given in the table below:

<table>
<thead>
<tr>
<th>Description</th>
<th>Delay [ps]</th>
</tr>
</thead>
<tbody>
<tr>
<td>2-input AND gate</td>
<td>100</td>
</tr>
<tr>
<td>2-input OR gate</td>
<td>120</td>
</tr>
<tr>
<td>Inverter</td>
<td>50</td>
</tr>
<tr>
<td>2:1 Multiplexer</td>
<td>180</td>
</tr>
</tbody>
</table>

Continue to the next page.
(a) (4 points) Draw a gate-level circuit diagram of the circuit using only the following basic logic gates: 2-input AND, 2-input OR, 2:1 Multiplexer, Inverter. Note: there is no need for optimizations.

![Circuit Diagram]

Solution:

(b) (3 points) Determine the propagation delay of the circuit. Draw it on your schematic, and calculate the propagation delay using the delay values from the table.

\[
\begin{align*}
t_{pd} &= t_{pd,AND} + t_{pd,OR} + t_{pd,AND} + t_{pd,INV} + t_{pd,MUX} \\
     &= 100 \text{ ps} + 120 \text{ ps} + 100 \text{ ps} + 50 \text{ ps} + 180 \text{ ps} \\
     &= 550 \text{ ps}
\end{align*}
\]

Solution:

(c) (3 points) Determine the contamination delay of the circuit. Draw it on your schematic, and calculate the contamination delay using the delay values from the table.

\[
\begin{align*}
t_{pd} &= t_{pd,INV} + t_{pd,MUX} \\
     &= 50 \text{ ps} + 180 \text{ ps} \\
     &= 230 \text{ ps}
\end{align*}
\]
3. The following Verilog code defines a Finite State Machine (FSM).

```verilog
module fsm (input a, input b, output [1:0] z,
            input clk, input reset);

reg [2:0] state, nextstate;

parameter INIT = 3’b000;
parameter DECODE = 3’b001;
parameter LOOP = 3’b100;
parameter JUMP = 3’b111;
parameter NEXT = 3’b010;
// DEF1 = 3’b011;
// DEF2 = 3’b101;
// DEF3 = 3’b110;

// next state calculation
always @( * )
  case (state)
  begin
    INIT: if ((a==1’b0) & (b==1’b0)) nextstate = DECODE;
          else nextstate = INIT;
    DECODE: if (a) nextstate = NEXT;
             else nextstate = LOOP;
    LOOP: nextstate = JUMP;
    JUMP: if (b) nextstate = INIT;
          else nextstate = DECODE;
    NEXT: nextstate = INIT;
    default: nextstate = INIT;
  endcase

// state register
always @(posedge clk, negedge reset)
  if (reset == 1’b0) state <= INIT;
  else state <= nextstate;

// output logic
assign z = state[1:0];
endmodule
```

(a) (1 point) Is this a Moore or a Mealy FSM? Briefly explain.

**Solution:** Moore, outputs depend only on the present state and nothing else.
(b) (4 points) Draw the State Transition Diagram corresponding to the Verilog code given above.

```
Solution:

reset

INIT
  z=00

DECODE
  z=01
    a=0\&b=0
    a=1|b=1

LOOP
  z=00
    a=0

NEXT
  z=10
    b=1

JUMP
  z=11
    b=0
```

*Note: The diagram shows the state transitions and the conditions for moving between states.*
(c) (5 points) Complete the following state transition table for the FSM described by the Verilog code. To make writing easier, denote the state bits by $S_2, S_1, S_0$ and the nextstate bits by $N_2, N_1, N_0$. Note that the default behavior for the nextstate is to move to the INIT state. Since only five states have been defined, there are three additional states which we named DEF1, DEF2, DEF3. As an example, entries for these three default states and the NEXT state have been entered.

<table>
<thead>
<tr>
<th>State</th>
<th>Inputs</th>
<th>Next State</th>
</tr>
</thead>
<tbody>
<tr>
<td>name</td>
<td>$S_2$</td>
<td>$S_1$</td>
</tr>
<tr>
<td>INIT</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>INIT</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>INIT</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>INIT</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>DECODE</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>DECODE</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>LOOP</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>JUMP</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>JUMP</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>NEXT</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>DEF1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>DEF2</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>DEF3</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Note that there are different ways of writing this table to represent the same result.
(d) (1 point) For describing the next state is it better to use Products-of-Sums (POS) or Sums-of-Products (SOP)? Briefly explain.

**Solution:** Sums-of-Products, there are fewer entries with a 1 in them, and SOP requires one entry for each.

(e) (3 points) Write down the Boolean Equations for the next state bits $N_2, N_1, N_0$, in either POS or SOP. You don’t need to minimize the equations.

**Solution:**

\[
N_0 = \overline{S}_2 \overline{S}_1 \overline{S}_0 A \overline{B} + S_2 \overline{S}_1 \overline{S}_0 + S_2 S_1 S_0 \overline{B}
\]

\[
N_1 = \overline{S}_2 S_1 S_0 A + S_2 \overline{S}_1 \overline{S}_0
\]

\[
N_2 = \overline{S}_2 S_1 S_0 A + S_2 \overline{S}_1 \overline{S}_0
\]

(f) (1 point) Briefly explain how the output $z$ could be obtained in an actual circuit implementation, what kind of logic circuit would be needed?

**Solution:** The output is directly obtained from the 2 least significant bits of the state, or $S_1 S_0$. 
4. (10 points) There are four Verilog code snippets in this section. Only one of these codes is syntactically correct. All others have a problem with the syntax. For each code, first state whether or not there is a mistake. If there is a mistake explain how to correct it. Note: Assume that the behavior as described, is correct.

(a)

```verilog
module one (input [1:0] sel, input [3:0] data, output z);

assign z = sel[1] ? (sel[0] ? data[0] : data[3])

endmodule
```

**Solution:** This code is correct. The distribution of the data bits may seem strange, but we are not checking for behavior.

(b)

```verilog
module mux2 (input [1:0] i, input sel, output z);

assign z = (sel) ? i[1]:i[0];

endmodule

module two (input [3:0] data, input sel1, input sel2, output z);

mux2 i0 (.i(data[1:0]), .sel(sel1), .z(m[0]) );
mux2 i1 (.i(data[3:2]), .sel(sel1), .z(m[1]) );
mux2 i2 (.i(m), .sel(sel2), .z(z) );

endmodule
```

**Solution:** This code has mistakes. In module two there is an additional signal m used. This has not been declared, it should be declared as wire [1:0] m;.
(c)

```vhdl
module three (input [1:0] sel, output reg [7:0] z);

always @ (sel)
    if (sel = 2'b01) z=8'b01010101;
    else if (sel = 2'b10) z=8'b10101010;
    else z=8'b00000000;

endmodule
```

**Solution:** This code has mistakes. The condition checking for `sel` has been written as `=` which is an assignment. It should be `==` in both instances.

(d)

```vhdl
module four (input [1:0] sel, input neg, output reg [3:0] z);

always @ (neg, sel)
    if (neg) z = 4'b1111;
    else if (sel[1]) z = 4'b0001;
    else if (sel[0]) z = 4'b0010;

endmodule
```

**Solution:** This code has mistakes. There are 3 separate `if` statements following `always`. These should be within a `begin ... end` block. Note that, it would not be correct to have three separate `always` statements as this would mean driving the signal `z` from three different processes.
5. In this section, you will compare three structures to add 32-bit binary numbers in terms of Latency, Throughput, Area and Maximum Operating Frequency. Assume the following performance numbers for the components in the question. Note that the registers are considered ideal for timing: no propagation delay and no setup delay.

<table>
<thead>
<tr>
<th>Description</th>
<th>Delay [ns]</th>
<th>Area [μm²]</th>
</tr>
</thead>
<tbody>
<tr>
<td>32-bit Ripple Carry Adder</td>
<td>4.0</td>
<td>4'000</td>
</tr>
<tr>
<td>16-bit Ripple Carry Adder</td>
<td>2.0</td>
<td>2'000</td>
</tr>
<tr>
<td>32-bit Carry Lookahead Adder</td>
<td>2.5</td>
<td>6'000</td>
</tr>
<tr>
<td>64-bit register</td>
<td>0.0</td>
<td>670</td>
</tr>
<tr>
<td>49-bit register</td>
<td>0.0</td>
<td>500</td>
</tr>
<tr>
<td>32-bit register</td>
<td>0.0</td>
<td>330</td>
</tr>
</tbody>
</table>

(a) (4 points) Consider the following 32-bit ripple carry adder pipeline stage and answer the following questions:

- What is the area occupied by the entire pipeline?
- How long does it take to compute one addition?
- What is the maximum operating frequency (in GHz) of this pipeline?
- How many additions can be completed in 1000 ns?

Solution:

\[
\text{Area} = A_{FF,64} + A_{RCA,32} + A_{FF,32}
\]
\[
= 670 + 4000 + 330
\]
\[
= 5000
\]

\[
\text{Latency} = 4 \text{ ns}
\]

\[
\text{MaxFrequency} = \frac{1}{4 \text{ ns}}
\]
\[
= 0.250 \text{ GHz}
\]

\[
\text{Throughput} = \frac{1000}{4 \text{ ns}}
\]
\[
= 250 \text{ additions per 1000 ns}
\]

Hint: \( \frac{1}{1 \text{ ns}} = 1 \text{ GHz} \), a clock with 1 GHz has a period of 1 ns.
(b) (2 points) Consider the following 32-bit carry lookahead pipeline stage and answer the following questions:

- What is the area occupied by the entire pipeline?
- How long does it take to compute one addition?
- What is the maximum operating frequency (in GHz) of this pipeline?
- How many additions can be completed in 1000 ns?

Solution:

\[
\text{Area} = A_{\text{FF,64}} + A_{\text{CLA,32}} + A_{\text{FF,32}} \\
= 670 + 6000 + 330 \\
= 7000 \\
\text{Latency} = 2.5 \text{ ns} \\
\text{MaxFrequency} = \frac{1}{2.5 \text{ ns}} \\
= 0.400 \text{ GHz} \\
\text{Throughput} = \frac{1000}{2.5 \text{ ns}} \\
= 400 \text{ additions per 1000 ns}
\]
(c) (2 points) Consider the following 32-bit adder with a 2 stage pipeline built out of two 16-bit ripple carry adders and answer the following questions:

- What is the area occupied by the entire pipeline?
- How long does it take to compute one addition?
- What is the maximum operating frequency (in GHz) of this pipeline?
- How many additions can be completed in 1000 ns?

Solution:

\[
\begin{align*}
Area &= A_{FF,64} + A_{RCA,16} + A_{FF,49} + A_{RCA,16} + A_{FF,32} \\
&= 670 + 2000 + 500 + 2000 + 330 \\
&= 5500 \\
Latency &= 4 \text{ ns} \\
MaxFrequency &= 1/2 \text{ ns} \\
&= 0.500 \text{ GHz} \\
Throughput &= 1000/2 \text{ ns} \\
&= 500 \text{ additions per 1000 ns}
\end{align*}
\]

(d) (4 points) The \textit{Latency} is the time it takes to calculate one addition, whereas the \textit{Throughput} is the number of additions that can be calculated per unit time. It is obvious that the throughput will increase if you can reduce the latency. Is it possible to increase the throughput, even if you cannot reduce the latency? Briefly explain.

\textbf{Solution:} Yes. One solution is to introduce pipelining it is possible to improve the throughput as seen in the section c) of this question. Pipelining does not reduce latency, the computation of one data item still takes the same amount of time, however, the operation is broken down into smaller pieces, and as soon as the first part is completed, a new data item can be accepted, this improves the throughput. Another solution is to increase parallelism by, for example, duplicating the hardware.
6. (5 points) As covered in class, the execution speed of a program on a processor can be given as:

\[ \text{Execution Time} = N \times CPI \times \frac{1}{f} \]

Where \( N \) is the number of instructions, \( CPI \) is clocks per instruction and \( f \) is the clock frequency. Execution Time will improve by either reducing \( N \) and \( CPI \), or increasing \( f \) (or a combination thereof). List at least five improvements that can be made in order to improve the Execution Time.

**Solution:** Any five of the following could be accepted:

- **Reduce number of instructions**
  - adopt CISC, that uses instructions that can do more
  - improve the compiler so that it produces more optimized code

- **Reduce clocks per instruction**
  - adopt RISC, simpler instructions can be executed faster
  - add parallel execution units, do more per clock cycle

- **Increase clock frequency**
  - migrate to a more modern manufacturing technology
  - adopt pipelining
  - redesign and improve timing critical components in the circuit (adders, alu etc)
  - *Could also be accepted:* overclock the system (use higher voltage, clock frequency)
7. (12 points) Consider the following MIPS program. For clarity the addresses have been written using only 4 hexadecimal digits. Leading hexadecimal digits are all zeroes (the real start address is 0x00003000).

<table>
<thead>
<tr>
<th>Address</th>
<th>Instruction</th>
<th>R1, R2, R3, Immediate</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x3000</td>
<td>start: addi $s0, $0, 4</td>
<td></td>
</tr>
<tr>
<td>0x3004</td>
<td>xor $s1, $s1, $s1</td>
<td></td>
</tr>
<tr>
<td>0x3008</td>
<td>addi $s2, $0, 10</td>
<td></td>
</tr>
<tr>
<td>0x300C</td>
<td>sw $s2, 0($s1)</td>
<td></td>
</tr>
<tr>
<td>0x3010</td>
<td>addi $s2, $s2, 6</td>
<td></td>
</tr>
<tr>
<td>0x3014</td>
<td>add $s1, $s1, $s0</td>
<td></td>
</tr>
<tr>
<td>0x3018</td>
<td>sw $s2, 0($s1)</td>
<td></td>
</tr>
<tr>
<td>0x301C</td>
<td>addi $a0, $0, 11</td>
<td></td>
</tr>
<tr>
<td>0x3020</td>
<td>sll $t1, $a0, 1</td>
<td></td>
</tr>
<tr>
<td>0x3024</td>
<td>and $a1, $a0, $t1</td>
<td></td>
</tr>
<tr>
<td>0x3028</td>
<td>jal absdiff</td>
<td></td>
</tr>
<tr>
<td>0x302C</td>
<td>sw $v0, 4($s1)</td>
<td></td>
</tr>
<tr>
<td>0x3030</td>
<td>lw $a0, 0($0)</td>
<td></td>
</tr>
<tr>
<td>0x3034</td>
<td>lw $a1, 0($s0)</td>
<td></td>
</tr>
<tr>
<td>0x3038</td>
<td>jal absdiff</td>
<td></td>
</tr>
<tr>
<td>0x303C</td>
<td>lw $t3, 8($0)</td>
<td></td>
</tr>
<tr>
<td>0x3040</td>
<td>sub $t2, $t3, $v0</td>
<td></td>
</tr>
<tr>
<td>0x3044</td>
<td>done: j done</td>
<td></td>
</tr>
<tr>
<td>0x3048</td>
<td>absdiff: sub $t1, $a0, $a1</td>
<td></td>
</tr>
<tr>
<td>0x304C</td>
<td>slt $t2, $t1, $0</td>
<td></td>
</tr>
<tr>
<td>0x3050</td>
<td>beq $t2, $0, pos</td>
<td></td>
</tr>
<tr>
<td>0x3054</td>
<td>sub $t1, $a1, $a0</td>
<td></td>
</tr>
<tr>
<td>0x3058</td>
<td>pos: add $v0, $0, $t1</td>
<td></td>
</tr>
<tr>
<td>0x305C</td>
<td>jr $ra</td>
<td></td>
</tr>
</tbody>
</table>
We are interested in determining the value of some registers at the end of the program execution when the program reaches line 0x3044. Fill in the following table, writing the value of the indicated registers at the end of the program, and at which line these values have been written into these registers.

As an example: at the end of execution the register $s0 will have the value 4. This value has been written into the register while executing line 0x3000.

<table>
<thead>
<tr>
<th>Register</th>
<th>Value</th>
<th>Assigned on line</th>
</tr>
</thead>
<tbody>
<tr>
<td>$s0</td>
<td>4</td>
<td>0x3000</td>
</tr>
<tr>
<td>$s2</td>
<td>16</td>
<td>0x3010</td>
</tr>
<tr>
<td>$t1</td>
<td>6</td>
<td>0x3054</td>
</tr>
<tr>
<td>$t2</td>
<td>3</td>
<td>0x3040</td>
</tr>
<tr>
<td>$t3</td>
<td>9</td>
<td>0x303c</td>
</tr>
<tr>
<td>$ra</td>
<td>0x303c</td>
<td>0x3038</td>
</tr>
</tbody>
</table>
8. You are involved in designing a computing system using a cache (256 kbyte, 4-way set associative cache using 1 kbyte blocks). Your first design has some cache performance problems. Your colleagues made the following suggestions. For each suggestion, first state whether or not the idea will work, and then briefly explain why. If the idea works explain under what conditions.

(a) (2 points) Alain: "We have too many cache misses due to conflicts. We need to reduce the degree of associativity, so that we reduce conflict misses in the cache":

**Solution:**
This idea will not work. Just the opposite: increasing set associativity gives data more possibilities to be stored in the cache without replacing other data. This reduces conflict misses.

(b) (2 points) Beatrice: "There are many compulsory cache misses. To combat this, we should increase our block size"

**Solution:** This idea could work. A larger block size will take advantage of spatial locality and assume that nearby data items will also be accessed by the program. If the program has such accesses, the first data access will result in a compulsory miss, but the subsequent accesses will find data in the cache.
(c) (2 points) Cathy: "Our cache has many capacity misses. Instead of using a set associative cache, we should convert it to a direct mapped cache of the same size. This will allow more sets to be stored in the cache, hence reducing capacity misses."

**Solution:** This idea will not work. The organization of the cache does not change its capacity. The capacity miss occurs because data that is needed cannot fit into the cache.