Center for Reliable and High-Performance Computing

IMPACT: AN ARCHITECTURAL FRAMEWORK FOR MULTIPLE-INSTRUCTION-ISSUE PROCESSORS

Pohua P. Chang
Scott A. Mahlke
William Y. Chen
Nancy J. Warter
Wen-mei W. Hwu

Coordinated Science Laboratory
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Approved for Public Release. Distribution Unlimited.
IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors

The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate efficient code for concurrent hardware. In the IMPACT project, we have developed IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler are summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors have achieved solid speedup over a high-performance single-instruction-issue processor.

To address architecture design issues, we ran experiments to characterize the engineering tradeoffs such as the code scheduling model, the instruction issue rate, the memory load latency, and the function unit resource limitations. Based on the experimental results, we propose the IMPACT Architectural Framework, a set of architectural features that best support the IMPACT-I C compiler to generate efficient code for multiple-instruction-issue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IMPACT-I C compiler.
IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors

Pohua P. Chang, Scott A. Mahlke, William Y. Chen, Nancy J. Warter and Wen-mei W. Hwu

Center for Reliable and High-Performance Computing
University of Illinois
hwu@crhc.uiuc.edu
January 31, 1991

Abstract

The performance of multiple-instruction-issue processors can be severely limited by the compiler's ability to generate efficient code for concurrent hardware. In the IMPACT project, we have developed IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler is summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The multiple-instruction-issue processors have achieved solid speedup over a high-performance single-instruction-issue processor.

To address architecture design issues, we ran experiments to charaterize the engineering tradeoffs such as the code scheduling model, the instruction issue rate, the memory load latency, and the function unit resource limitations. Based on the experimental results, we propose the IMPACT Architectural Framework, a set of architectural features that best support the IMPACT-I C compiler to generate efficient code for multiple-instruction-issue processors. By supporting these architectural features, multiple-instruction-issue implementations of existing and new architectures receive immediate compilation support from the IMPACT-I C compiler.
1 Introduction

Computer engineers have been striving to improve uniprocessor performance since the invention of the computer. This paper is concerned with exploiting instruction level concurrency to achieve high performance. The traditional approach to exploiting concurrency is to provide the necessary support for instruction pipelining and overlapping [Kogge 81]. By optimizing a simple instruction pipeline structure, current pipelined processors can execute nearly one operation per cycle [Hennessy 81]. A natural extension to instruction pipelining is to provide parallel datapaths in order to fetch, decode, and execute several operations per cycle. Such processors have been referred to as multiple-instruction-issue processors in recent literature. Many hardware and software techniques for improving the performance and cost-effectiveness of multiple-instruction-issue processors have been studied [Fisher 81] [Rau 81] [Fisher 83] [Nicolau 85] [Patt 85] [Ellis 86] [Hwu 86] [Colwell 87] [Howland 87] [Weiss 87] [Cohn 89] [Jouppi 89] [Rau 89] [Smith 89] [Sohi 89] [Golumbic 90] [Warren 90] [Smith 90].

1.1 Our Approach And Contribution

An important problem in the design of multiple-instruction-issue processors is to ensure that the compiler can generate efficient code for the hardware. To solve this problem, we have constructed the IMPACT-I C compiler, a retargetable compiler with code optimization components especially developed for multiple-instruction-issue processors. These code improving techniques include function inline expansion, instruction placement, loop unrolling, memory disambiguation, register renaming, branch prediction, critical path depth reduction, and an integrated register allocation and code scheduling algorithm.
Using the IMPACT-I C compiler, we conducted experiments to characterize the performance implications of engineering tradeoffs such as alternative code scheduling models, memory load latency, and function unit resource limitations. All experimental results are derived from important non-numerical programs with realistic input data. Based on the experimental results, we have identified a set of architectural features that best support the IMPACT-I C compiler to generate efficient code for multiple-instruction-issue processors. We call the collection of these architectural features the IMPACT Architectural Framework.

The IMPACT-I C compiler generates highly optimized code for processors designed within the IMPACT Architectural Framework. Experimental results show that multiple-instruction-issue processors in this framework have achieved solid speedup over a high-performance single-instruction-issue processor. Supported by the advanced code optimization capabilities of the IMPACT-I C compiler, these multiple-instruction-issue processors have achieved a very encouraging performance level for non-numerical C programs.

1.2 Related Work

Fisher demonstrated that trace scheduling can find sufficient instruction-level parallelism to exploit VLIW architectures [Fisher 81]. Code scheduling and resource allocation for VLIW machines are done at compile-time [Fisher 81] [Ellis 86] [Colwell 87]. Rau has designed the ESL Polycyclic processor [Rau 81] and the Cydra 5 supercomputer [Rau 89] that issue several operations in a single cycle. Cohn, Gross, Lam, and Tseng have studied the architecture and compiler tradeoffs in the design of iWarp which is capable of specifying up to nine operations in an instruction [Cohn 89]. Weiss and Smith have shown that loop unrolling and software pipelining are effective in increasing parallelism [Weiss 87]. Sohi has shown that restricted horizontal instruction formats are compa-
rable to a format that can issue every operation to every function unit [Sohi 89]. These studies have focused mainly on numerical kernels and applications. In this paper, we have dealt with non-numerical programs.

Patt and Hwu have described a single-chip microarchitecture which allows the compiler to schedule several operations into an instruction word, and also supports dynamic code scheduling and branch lookahead to further explore concurrences between instructions [Patt 85]. This paper focuses on compile-time optimizations for simple in-order execution architectures.

Jouppi and Wall have measured the instruction-level parallelism of some non-numerical Modula-2 and C programs using an optimizing compiler that performs local code scheduling. Assuming unit-time operation latency, they reported that there are between 1.6 and 2.1 concurrently executable operations per cycle [Jouppi 89]. In this paper, we have implemented more aggressive code scheduling techniques, and have considered non-unit-time operation latencies and machine constraints.

Smith, Johnson, and Horowitz have used trace-based simulations to determine that dynamic scheduling can achieve an execution rate of about two operations per cycle for non-numerical C programs [Smith 89]. This paper focuses on the static code scheduling approaches.

Smith, Lam, and Horowitz have proposed and studied a static scheduling architecture that allows operations to be moved across a preceding branch operation [Smith 90]. They have reported about 1.63 speedup against a scalar processor which executes approximately 0.9 operations per cycle, using a four-issue microarchitecture. In this paper, we have implemented more levels of static scheduling support. With aggressive code transformation optimizations, we have achieved a higher level of performance.
1.3 Organization Of This Paper

This paper is organized into five sections. Section 2 presents our compiler technology and static code scheduling techniques. Section 3 presents experimental results. Section 4 describes the IMPACT Architectural Framework. Section 5 provides concluding remarks.

2 The IMPACT-I C Compiler

The IMPACT-I C compiler serves two important purposes. First, it is intended to generate highly optimized code for existing commercial microprocessors. We have constructed code generators for the MIPS-R2000 [Kane 87] and the SPARC [Sparc 87] processors. We are constructing code generators for the AMD29K [Amd], the IBM RS/6000 [IBM 90], and the i860 [Intel 89] processors. Second, it provides a platform for studying new code optimization techniques for multiple-instruction-issue architectures. These new code optimization techniques, once validated, can be immediately applied to the multiple-instruction-issue implementations of existing and new commercial architectures.

2.1 Code Optimizations

Code improving techniques in the IMPACT-I C compiler can be categorized into two groups: machine-independent optimizations and machine-dependent optimizations. Machine-independent optimizations include classical local and global code optimizations [Aho 86], function inline expansion [Hwu 89.2], instruction placement optimization [Chang 88] [Hwu 89.1], loop unrolling, intelligent generation of switch statements [Chang 89.2], and jump optimization. Machine-dependent optimizations include profile-based branch prediction, constant preloading, graph-coloring-based register allocation [Chaitin 82] [Chow 84], and code scheduling. A profiler has been integrated into
<table>
<thead>
<tr>
<th>name</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>cccp</td>
<td>GNU C preprocessor</td>
</tr>
<tr>
<td>cmp</td>
<td>compare files</td>
</tr>
<tr>
<td>compress</td>
<td>compress files</td>
</tr>
<tr>
<td>eqn</td>
<td>typeset mathematical formulas for troff</td>
</tr>
<tr>
<td>eqntott</td>
<td>boolean minimization</td>
</tr>
<tr>
<td>espresso</td>
<td>boolean minimization</td>
</tr>
<tr>
<td>grep</td>
<td>string search</td>
</tr>
<tr>
<td>lex</td>
<td>lexical analysis program generator</td>
</tr>
<tr>
<td>qsort</td>
<td>quick sort</td>
</tr>
<tr>
<td>tbl</td>
<td>format tables for troff</td>
</tr>
<tr>
<td>wc</td>
<td>word count</td>
</tr>
<tr>
<td>yacc</td>
<td>parsing program generator</td>
</tr>
</tbody>
</table>

Table 1: Benchmarks.

the IMPACT-I C compiler. When hardware resources are scarce, the profile information helps to identify the most frequently executed program sections and variables.

2.2 Base Performance

It is important to measure the performance of multiple-instruction-issue architectures using highly optimized code, because a naive compiler may produce redundant operations that show deceptive parallelism. To calibrate the quality of the code generated by the IMPACT-I C compiler, we have compared the execution time of its output code with those of a leading commercial compiler (MIPS CC) and a leading public domain compiler (GNU CC) on the DEC 3100 workstation.

Table 1 shows the benchmark programs that are used in this paper. The name column shows the names of the benchmark programs. The description column briefly describes the nature of each benchmark program. Table 2 shows the speedup of the IMPACT-I C compiler and the GNU

1** GNU -O failed, reported speedup is that without optimization.
The quality of the code generated by the IMPACT-I C compiler is comparable to that of the MIPS C compiler which is known for its excellent code optimization capabilities. Therefore, the speedup numbers that we report for multiple-instruction-issue architectures in the Section 3 are based on very efficient sequential code.

### Code Generation For Multiple-Instruction-Issue Machine

#### Code Optimization Techniques

A code generator for a parameterized multiple-instruction-issue architecture has been implemented. The code generator performs profile-based branch prediction to support the squashing branch scheme [McFarling 86] [Chang 89.1]. The IMPACT-I C compiler performs several code transformations that enlarge the scope of static scheduling, including function inline expansion, instruction

**Table 2: Code Quality.**

<table>
<thead>
<tr>
<th>name</th>
<th>MIPS-03</th>
<th>GNU-O</th>
<th>IMPACT</th>
</tr>
</thead>
<tbody>
<tr>
<td>ccmp</td>
<td>1.0</td>
<td>0.99</td>
<td>1.07</td>
</tr>
<tr>
<td>cmp</td>
<td>1.0</td>
<td>0.99</td>
<td>1.01</td>
</tr>
<tr>
<td>compress</td>
<td>1.0</td>
<td>0.77 **</td>
<td>1.00</td>
</tr>
<tr>
<td>eqn</td>
<td>1.0</td>
<td>0.98</td>
<td>1.07</td>
</tr>
<tr>
<td>eqntott</td>
<td>1.0</td>
<td>0.85</td>
<td>0.98</td>
</tr>
<tr>
<td>espresso</td>
<td>1.0</td>
<td>0.88</td>
<td>1.00</td>
</tr>
<tr>
<td>grep</td>
<td>1.0</td>
<td>0.86</td>
<td>1.01</td>
</tr>
<tr>
<td>lex</td>
<td>1.0</td>
<td>0.97</td>
<td>1.01</td>
</tr>
<tr>
<td>qsort</td>
<td>1.0</td>
<td>0.94</td>
<td>1.00</td>
</tr>
<tr>
<td>tbl</td>
<td>1.0</td>
<td>0.93</td>
<td>1.00</td>
</tr>
<tr>
<td>wc</td>
<td>1.0</td>
<td>0.83</td>
<td>1.04</td>
</tr>
<tr>
<td>yacc</td>
<td>1.0</td>
<td>0.62 **</td>
<td>0.96</td>
</tr>
</tbody>
</table>
for (i=0; i<120; i++) {
    c[i] = max(a[i], b[i]);
    m = max(m, c[i]);
}

(1) mov i, 0;
LO: (2) load $P1, _a, i; /* a[i], $P1 is the 1st parameter register */
(3) load $P2, _b, i; /* b[i], $P2 is the 2nd parameter register */
(4) jsr _max; /* max(a[i], b[i]), $P0 is the return register */
(5) store _c, i, $P0; /* c[i] = max(a[i], b[i]) */
(6) load $P1, _m, 0; /* m */
(7) mov, $P2, $P0; /* c[i] */
(8) jsr _max; /* max(m, c[i]) */
(9) store _m, 0, $P0; /* m = max(m, c[i]) */
(10) add i, i, 4; /* i++ */
(11) blt i, 120, LO; /* i<120 */
L1:

Figure 1: A Simple Program

placement, super-block formation, loop unrolling, loop peeling, and branch expansion. The compiler also performs several code transformations that reduce the depth of critical paths, including induction variable expansion, register renaming, global variable register allocation, operation folding, and memory disambiguation.

Figure 1 illustrates a simple program that determines the maximum of array elements. This program is used in the following paragraphs to show the functionality of each code optimization.

Branch prediction: There are two unconditional jumps (subroutine calls) and one conditional branch in Figure 1. If the code segment is invoked exactly once, the profile information should indicate that the two subroutine calls (4, 8) have been executed 120 times, and the conditional branch (11) has been taken 119 times and not taken 1 time. For machines that support squashing branch, all three branches should be marked as likely branches and the branch slots should be filled
by instructions from the target paths.

**Function inline expansion:** To perform static code scheduling in the inter-function (inter-procedure) level in Figure 1, the function (procedure) calls must be inline expanded. Figure 2 illustrates the result of function inline expansion.

**Instruction placement:** Although we have eliminated the two function calls from the simple program in Figure 1, the expanded program has several additional branches (4a,4c,8a,8c). Instruction placement is an optimization that reorganizes code to increase sequential locality and to reduce the number of taken branches. The basic idea is to group basic blocks that tend to execute in a sequence together into a linear trace. We use the code segment in Figure 2 as an example. Assume that the elements in b[] are usually larger than the elements in a[]. The most
Figure 3: After instruction placement

likely instruction sequence is (1 → 2 → 3 → 4a → 4b → 4c → 5). Assume that the elements in the beginning of a[] and b[] tend to be bigger than the rest. Then, the most likely instruction sequence is (6 → 7 → 8a → 8b → 8c → 9). The profiler can capture the sequencing information that is needed to guide the instruction placement optimization. The result of instruction placement is shown in Figure 3. Instruction placement forms groups of basic blocks that can be compacted by trace scheduling [Fisher 81]. Instruction (2) to (11) in Figure 3 forms a single trace. After instruction placement, the loop body contains no likely branch, except the loop back-edge. Note that (4c) and (8c) have been eliminated and (4e) and (8e) have been added to the program. Because (4c) and (8c) appear on frequently executed paths and (4e) and (8e) appear on infrequently executed paths, we can expect some speedup for single-instruction-issue architectures as well as for multiple-instruction-issue architectures.
Super-block formation: The flow-dependence arc \((4b \rightarrow 5)\) can be eliminated if the result of \((4b)\) can be copy propagated to \((5)\). This is impossible because instruction \((4d)\) modifies \(SP0\). We propose an optimization called super-block formation which allows traces to be customized. A super-block is a sequence of instructions that can be executed only from the top instruction and may contain multiple branch instructions. If a trace is not already a super-block, the trace can be converted to a super-block by creating a copy of the trace and redirecting all control transfers to the middle of the trace to the duplicate copy. From Figure 3, super-block formation produces Figure 4.

In Figure 4, label \(L3\) and label \(L5\) can be eliminated because all control transfers have been redirected to label \(L6\) and label \(L7\). More classical code optimizations can be applied to the super-block. The result is shown in Figure 5.

Loop unrolling: To increase the number of instructions in the super-block in Figure 5, we can unroll the loop \(N\) times by duplicating the loop body \((N - 1)\) times. Figure 6 shows the new super-block after loop unrolling \((N = 2)\). For multiple-instruction-issue processors, the IMPACT-I C compiler typically unrolls small loops 8 or more times.

Loop peeling: Many loops iterate very few times \((< 10)\) in the benchmark programs that we have studied (see Table 1). For these loops, loop unrolling and software pipelining are less effective because the execution time that is spent in the parallel section (optimized loop body) is not substantially longer than in the sequential section (loop prologue and epilogue). An alternative approach is to peel off enough iterations, such that the loop typically executes as a piece of straight-line code. Figure 7 shows the result of applying loop peeling \((N = 1)\) on the code segment in Figure 5. Typically, the IMPACT-I C compiler peels off about 4 iterations. The peeled iterations
(1) mov i, 0;
L0: (2) load $P1, _a, i;
   (3) load $P2, _b, i;
   (4a) bge $P1, $P2, L2; /* unlikely */
   (4b) mov $P0, $P2;
L3: (5) store _c, i, $P0;
   (6) load $P1, _m, 0;
   (7) mov $P2, $P0;
   (8a) bge $P1, $P2, L4; /* unlikely */
   (8b) mov $P0, $P2;
L5: (9) store _m, 0, $P0;
   (10) add i, i, 4;
   (11) blt i, 120, L0; /* likely */
L1: 
   ..... 
L6: (5') store _c, i, $P0;
   (6') load $P1, _m, 0;
   (7') mov $P2, $P0;
   (8a') bge $P1, $P2, L4; /* unlikely */
   (8b') mov $P0, $P2;
L7: (9') store _m, 0, $P0;
   (10') add i, i, 4;
   (11') blt i, 120, L0; /* likely */
   (12) jump L1; /* likely */
   ..... 
L2: (4d) mov $P0, $P1;
   (4e) jump L6; /* likely */
L4: (8d) mov $P0, $P1;
   (8e) jump L7; /* likely */

Figure 4: After super-block formation
(1) mov i, 0;
L0: (2) load $P1, _a, i;
(3) load $P2, _b, i;
(4a) bge $P1, $P2, L2; /* unlikely */
/* delete (4b) */
(5) store _c, i, $P2;
(6) load $P1, _m, 0;
(8a) bge $P1, $P2, L4; /* unlikely */
/* delete (8b) */
(9) store _m, 0, $P2;
(10) add i, i, 4;
(11) blt i, 120, L0; /* likely */
L1: ...

Figure 5: After super-block formation and classical code optimization

(1) mov i, 0;
L0: (2) load $P1, _a, i;
(3) load $P2, _b, i;
(4a) bge $P1, $P2, L2; /* unlikely */
(5) store _c, i, $P2;
(6) load $P1, _m, 0;
(8a) bge $P1, $P2, L4; /* unlikely */
(9) store _m, 0, $P2;
(10) add i, i, 4;
(11) bge i, 120, L1; /* unlikely */
(2") load $P1, _a, i;
(3") load $P2, _b, i;
(4a") bge $P1, $P2, L2; /* unlikely */
(5") store _c, i, $P2;
(6") load $P1, _m, 0;
(8a") bge $P1, $P2, L4; /* unlikely */
(9") store _m, 0, $P2;
(10") add i, i, 4;
(11") blt i, 120, L0; /* likely */
L1: ...

Figure 6: After loop unrolling
(1) mov i, 0;
(2") load $P1, _a, 0;
(3") load $P2, _b, 0;
(4a") bge $P1, $P2, L2; /* unlikely */
(5") store _c, 0, $P2;
(6") load $P1, _m, 0;
(8a") bge $P1, $P2, L4; /* unlikely */
(9") store _m, 0, $P2;
(10") mov i, 4;

L0: (2) load $P1, _a, i;
(3) load $P2, _b, i;
(4a) bge $P1, $P2, L2; /* unlikely */
(5) store _c, i, $P2;
(6) load $P1, _m, 0;
(8a) bge $P1, $P2, L4; /* unlikely */
(9) store _m, 0, $P2;
(10) add i, i, 4;
(11) blt i, 120, L0; /* unlikely */

L1: 

Figure 7: After loop peeling
can usually be further optimized by classical code optimizations, e.g. copy propagation.

**Branch expansion:** Instruction placement and super-block formation introduce many spurious branch instructions (12), (4e), and (8e) in Figure 4. In realistic programs where there are many nested if — then — else statements, the situation is much worst than in Figure 4. Branch expansion helps to eliminate some branch instructions by replacing them by their target basic blocks. The number of static instructions increases due to this optimization. But the number of dynamic (executed) instructions decreases. Figure 8 shows the result of branch expansion. Note that two super-blocks (L2, L4) are developed by this optimization.

**Induction variable expansion:** After loop unrolling (see Figure 6), induction variables and variables that are used as accumulators are often the source of anti-dependencies and output-
mov i, 0;
L0: add i2, i, 4; /* rename i to i2 for the 2nd iteration */
    load $P1, _a, i;
    load $P2, _b, i;
    bge $P1, $P2, L2;
    store _c, i, $P2;
    load $P1, _m, 0;
    bge $P1, $P2, L4;
    store _m, 0, $P2;
    add i, i, 4; /* replace i by i2 after here */
    bge i2, 120, L1;
    load $P1, _a, i2;
    load $P2, _b, i2;
    bge $P1, $P2, L2;
    store _c, i2, $P2;
    load $P1, _m, 0;
    bge $P1, $P2, L4;
    store _m, 0, $P2;
    add i, i2, 4;
    bit i, 120, L0;
L1: ......

Figure 9: After induction variable expansion
dependencies across iterations. In order to overlap the execution of different iterations, it is necessary to rename these induction variables. Unfortunately, register renaming cannot be applied because an induction variable is alive in the loop. Induction variable expansion is designed specifically to rename induction variables. Figure 9 shows the result of induction variable expansion.

For a loop which is unrolled $N$ times, $(N - 1)$ temporary variables are introduced to represent the induction variable in iteration $2..N$. In Figure 9, instruction (3) to (9) use the original induction variable. Instruction (2) creates a new induction for the second loop iteration. Instruction (11) to (18) use the new induction variable. Note that instruction (10) and (19) update the original induction variable to enforce consistency.

**Register renaming:** Register renaming needs to be applied to the code segment in Figure 9 in order to overlap the execution of different loop iterations. The result is shown in Figure 10. $P1$ in instruction (3) is renamed to $temp1$ and $P2$ in instruction (4) is renamed to $temp2$. Subsequent uses of $P1$, prior to a new definition of $P1$, are also renamed to $temp1$. When $P1$ is redefined in instruction (12), it is renamed to a new temporary variable $temp5$. When a register is alive in the taken path of a branch, a bookkeep instruction needs to be inserted between the branch and the target instruction to undo the effect of register renaming. For example, $P1$ is used before defined in $L2$ super-block. Therefore, instruction (5) must first branch to $L2'$. In $L2'$, $P1$ is updated before jumping to $L2$.

Note that the bookkeep cost can be reduced with branch expansion and classical code optimization by forming $(L2' \rightarrow L2)$, $(L2'' \rightarrow L2)$, $(L4' \rightarrow L4)$, and $(L4'' \rightarrow L4)$ super-blocks. Repeated applications of branch expansion and classical code optimization increase the static program size and decrease the number of dynamic (executed) instructions. However, increasing program size
(1) mov i, 0;
L0: (2) add i2, i, 4;
(3) load temp1, _a, i;
(4) load temp2, _b, i;
(5) bge temp1, temp2, L2'; /* $P1 is live in L2 */
(6) store _c, i, temp2;
(7) load temp3, _m, 0;
(8) bge temp3, temp2, L4'; /* $P1 is live in L4 */
(9) store _m, 0, temp2;
(10) add i, i, 4;
(11) bge i2, 120, L1;
(12) load temp5, _a, i2;
(13) load temp6, _b, i2;
(14) bge temp5, temp6, L2"; /* $P1 is live in L2 */
(15) store _c, i2, temp6;
(16) load temp7, _m, 0;
(17) bge temp7, temp6, L4"; /* $P1 is live in L4 */
(18) store _m, 0, temp6;
(19) add i, i2, 4;
(20) blt i, 120, L0;

L1:

L2': (100) mov $P1, temp1;
(101) jump L2;
L2": (102) mov $P1, temp5;
(103) jump L2;
L4': (104) mov $P1, temp3;
(105) jump L4;
L4": (106) mov $P1, temp7;
(107) jump L4;

Figure 10: After register renaming
may degrade instruction cache performance. The tradeoff between optimizing for efficiency and optimizing for instruction cache performance is not clear.

**Operation folding:** Operation folding eliminates flow-dependencies between instructions by observing special instruction patterns of the form \( f(g(src_g), src_f) \) which can be transformed to \( (g(src_g), f'(src_g, src_f)) \). For example, the flow-dependence between instruction (19) and (20) in Figure 10 can be eliminated as shown in Figure 11.

**Global variable register allocation:** Global variable register allocation identifies all memory accesses to a specific memory location in a loop, replaces all memory accesses by register accesses, and adds appropriate code to load and restore the memory location in the loop prologue and epilogue sections. Some flow-dependencies that are due to memory load and store instructions can be eliminated by global variable register allocation. In Figure 10, all memory accesses to \(_m\) can be converted to register accesses. Figure 11 shows the result of global variable register allocation for the most important super-block of this example. Note that after global variable register allocation, there are more opportunities for register renaming.

**Memory disambiguation:** Memory disambiguation does not by itself perform code transformation. Its duty is to discover optimization opportunities for other code optimizations, such as global variable register allocation and code scheduling. For the C programming language, memory disambiguation is very difficult due to pointers.
/* tm = mem[_m] */

(1) mov i, 0;
(100) load tm, _m, 0; /* tm = mem[_m] */
L0: (2) add i2, i, 4;
(3) load temp1, _a, i;
(4) load temp2, _b, i;
(5) bge temp1, temp2, L2';
(6) store _c, i, temp2;
(7) mov temp3, tm;
(8) bge tm, temp2, L4';
(9) mov tm, temp2;
(10) add i, i, 4;
(11) bge i2, 120, L1;
(12) load temp5, _a, i2;
(13) load temp6, _b, i2;
(14) bge temp5, temp6, L2'';
(15) store _c, i2, temp6;
(16) mov temp7, temp2;
(17) bge temp2, temp6, L4'';
(18) mov tm, temp6;
(19) add i, i2, 4;
(20) bit i2, 116, L0;

L1: (101) store _m, 0, tm; /* mem[_m] = tm */

Figure 12: After global variable register allocation
2.3.2 Code Scheduling Algorithm

Prepass code scheduling is performed prior to register allocation to reduce the effect of artificial data dependencies that are introduced by register assignment [Hwu 88] [Goodman 88]. Postpass code scheduling is performed after register allocation.

Both prepass and postpass code scheduling algorithms consist of the following steps: 1) Form traces from basic blocks that are likely to be executed as a sequence. 2) Form a large superblock from each trace of basic blocks by code duplication. 3) Construct a dependence graph for each super-block. 4) Improve the dependence graph by removing dependence arcs that can be resolved at compile-time. 5) Compute live-variable information. For each branch path, live-variable information tells us what variables must not be destroyed when that branch path is taken. 6) Schedule the refined dependence graph according to machine constraints.

We have shown how super-blocks are formed in the previous subsection (Figure 1 - 12).

The construction of a dependence graph from a super-block is very straight forward by observing every pair of instructions in the super-block. There are three major categories of dependencies: data dependencies, control dependencies, and synchronization dependencies. Data dependencies can be further partitioned to six different types: register flow-dependence, register anti-dependence, register output-dependence, memory flow-dependence, memory anti-dependence, and memory output-dependence. For example, instruction (5) in Figure 12 is register flow-dependent on instruction (3) because instruction (5) uses the result of instruction (3). Instruction (9) is register anti-dependent on instruction (8) because instruction (9) modifies a source register of instruction (8). Instruction (19) is register output-dependent on instruction (10) because they modify the same register. Without memory disambiguation, we have to assume that all memory load and store instructions
may access the same location. For example, instruction (12) is memory flow-dependent on instruction (6). Instruction (6) is memory anti-dependent on instruction (3). Instruction (15) is memory output-dependent on instruction (6). Control dependencies exist between a branch instruction and any other instruction in the super-block. For example, instruction (5) is control-dependent on instruction (4). Instruction (6) is control-dependent on instruction (5). Dependencies due to synchronization instructions are beyond the scope of this paper and will not be discussed further.

After a dependence graph is constructed for a super-block, some memory dependencies can be eliminated by memory disambiguation.

Dataflow analysis [Aho 86] is applied to a super-block to determine for each branch what variable values must not be destroyed when that branch is taken. Let \( \text{live-out}(x) \) denote the set of variables which may be used before defined when a branch \( x \) is taken. The result of dataflow analysis on Figure 12 is shown in Figure 13.

### 2.3.3 Code Scheduling Models

Our code scheduler moves code both upward and downward across branch operations in a super-block. If not because of branch operations, the order of any two operations may be reversed if there is no data dependencies between the two operations. Let \( X \) and \( Y \) denote two operations in the same super-block and \( X \) precedes \( Y \) in the original code. If \( X \) and \( Y \) are both branch operations, their order may not be changed, except when they branch to the same location and are consecutive in the super-block. For simplicity, a static code scheduler typically does not schedule a branch before another preceding branch. If \( Y \) is a branch and \( X \) is not a branch, then the static code scheduler is allowed to schedule \( Y \) ahead of \( X \). If \( Y \) is to be scheduled ahead of \( X \) and the destination register of \( X \) is in \( \text{live-out}(Y) \), then a copy of \( X \) must be inserted between \( Y \) and its
(1) mov i, 0;
(100) load tm, _m, 0;
LO: (2) add i2, i, 4;
(3) load temp1, _a, i;
(4) load temp2, _b, i;
(5) bge temp1, temp2, L2'; /* live-out(5) = {temp1, tm, i} */
(6) store _c, i, temp2;
(7) mov temp3, tm;
(8) bge tm, temp2, L4'; /* live-out(8) = {temp3, tm, i} */
(9) mov tm, temp2;
(10) add i, i, 4;
(11) bge i2, 120, L1; /* live-out(11) = {tm, i} */
(12) load temp5, _a, i2;
(13) load temp6, _b, i2;
(14) bge temp5, temp6, L2'"; /* live-out(14) = {temp5, tm, i} */
(15) store _c, i2, temp6;
(16) mov temp7, temp2;
(17) bge temp2, temp6, L4"; /* live-out(17) = {temp7, tm, i} */
(18) mov tm, temp6;
(19) add i, 12, 4;
(20) bit i2, 116, L0; /* live-out(20) = {tm, i} */
L1: (101) store _m, 0, tm;
......

Figure 13: After dataflow analysis
target instruction. Moving operations from above a branch operation to below is always safe. On the other hand, moving operations from below a branch to above is not always safe. If $X$ is a branch and $Y$ is not a branch, then the static code scheduler is allowed to schedule $Y$ ahead of $X$ if the following two restrictions are not violated.

**Restriction 1:** $Y$ is not in $\text{live-out}(X)$.

**Restriction 2:** $Y$ must not cause an exception that may terminate the program execution.

For example, it is not safe to move a division operation above a branch because of the possibility of dividing by zero. As another example, it is not safe to move a memory load operation above a branch because of the possibility of memory access violation. We have implemented a code scheduling algorithm that enforces the above two restrictions. We refer to this algorithm as **restricted code percolation**.

It is possible to free the code scheduler from the second restriction if the division operation and the memory load operation do not cause exceptions. Instead of trapping on divide by zero or illegal memory access, a magic value is returned. Page faults can be handled in the usual manner. We refer to this code scheduling model as **general code percolation**.

Using aggressive hardware support, the first restriction can also be removed. Smith, Lam, and Horowitz have described such a scheme [Smith 90]. This scheme squashes percolated operations if the branch direction is incorrectly predicted. We have implemented a scheduling method where operations can be freely moved above $N$ branch operations in the same super-block, where $N$ is a design parameter. We refer to this scheduling model as **speculative execution**.

In Section 3, we show the relative performance of the three static code scheduling models. We set $N$ to 32 for the speculative execution model.
Assuming that all operation latencies are unit time and we have infinite computation resource, we apply static code scheduling to the code segment in Figure 13 using aforementioned code scheduling models. There is an implicit linear precedence order among operations that are issued concurrently. We assume that a taken branch can squash all lower precedence operations that are issued concurrently with the taken branch. A legal schedule using the restricted code percolation model is shown in Figure 14. Each row gives the operations that may be issued concurrently in a single cycle. A legal schedule using the general code percolation model is shown in Figure 15. A legal schedule using the speculative execution model is shown in Figure 16.

* = speculative execution

Assuming that all operation latencies are unit time and we have infinite computation resource, we apply static code scheduling to the code segment in Figure 13 using aforementioned code scheduling models. There is an implicit linear precedence order among operations that are issued concurrently. We assume that a taken branch can squash all lower precedence operations that are issued concurrently with the taken branch. A legal schedule using the restricted code percolation model is shown in Figure 14. Each row gives the operations that may be issued concurrently in a single cycle. A legal schedule using the general code percolation model is shown in Figure 15. A legal schedule using the speculative execution model is shown in Figure 16.

* = speculative execution
3 Experiments

3.1 Evaluation Methodology

A machine description file has been written to describe the instruction set, the microarchitecture, and the code scheduling model of each processor architecture under study. The machine description file is used to guide the IMPACT-I C compiler to optimize each benchmark program for each processor architecture. Using a profiler, we measure the execution count of every operation and collect branch statistics. From the profile information, we can derive the best and the worst case execution time of each super-block, assuming ideal cache. The worst case is due to long operation latencies that protrude from one super-block to another super-block. For the benchmark programs used in this paper, the difference between the best case and the worst case execution time is always negligible. In the following discussion, we use the worst case execution time measure.

The experiment is to study the speedup of multiple-instruction-issue processors versus a single-instruction-issue processor for various scheduling models, memory load latencies, and function unit resource limitations. The experiment produces a total of \((X \times Y)\) numbers, where \(X\) is the number of processor configurations under study, and \(Y\) is the number of benchmark programs. Let \(\text{cycle}(i,j)\) denote a function that returns the number of cycles to execute the benchmark program \(j\) on the machine \(i\). Let \(\text{cycle}(1,j)\) denote the number of cycles to execute the benchmark \(j\) on the base architecture. We define the \(\text{speedup}(i)\) function as the harmonic ² mean of \((\text{cycle}(1,.)/\text{cycle}(i,.))\) over all benchmarks.

²To report results conservatively, the harmonic mean is used instead of the arithmetic mean.
### Table 3: Operation latencies.

<table>
<thead>
<tr>
<th>fn</th>
<th>base</th>
<th>MIPS-R3000</th>
<th>SPARC</th>
<th>i860</th>
</tr>
</thead>
<tbody>
<tr>
<td>integer alu</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>barrel shifter</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>integer mul</td>
<td>3</td>
<td>12</td>
<td>47</td>
<td>11</td>
</tr>
<tr>
<td>integer div</td>
<td>25</td>
<td>35</td>
<td>?</td>
<td>59</td>
</tr>
<tr>
<td>load</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>store</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>FP alu</td>
<td>3</td>
<td>6</td>
<td>10</td>
<td>3</td>
</tr>
<tr>
<td>FP conv</td>
<td>3</td>
<td>4</td>
<td>10</td>
<td>4</td>
</tr>
<tr>
<td>FP mul</td>
<td>4</td>
<td>6</td>
<td>12</td>
<td>5</td>
</tr>
<tr>
<td>FP div</td>
<td>25</td>
<td>12</td>
<td>64</td>
<td>38</td>
</tr>
</tbody>
</table>

#### 3.2 Base Architecture

The base architecture is a single-instruction-issue processor that uses the general code percolation model. We have chosen an instruction set that is a superset of the MIPS instruction set to establish a strong single-instruction-issue base architecture. All function units are pipelined. The base column of Table 3 shows the operation latencies. We assume in-order execution and deterministic operation latencies. Each processor includes a 64-entry integer register bank and a 32-entry floating-point register bank. The architecture uses a squashing branch scheme and profile-based branch prediction. One branch slot (one instruction) is automatically allocated for each instruction that contains a predict-taken branch operation.

Considering one cycle branch latency, the base architecture has achieved an execution rate of better than 0.95 operations per cycle for the benchmark programs.

---

3 The integer multiplication and division latencies of the commercial processors are based on software implementation.
3.3 Comparison Of The Three Scheduling Models

Figures 17 through 19 show the speedup of all three code scheduling models over the base architecture for issue rates from one to eight. The graphs show the speedup when the memory load operation latency is one, two, and three cycles respectively. Except for the memory load latency, operation latencies are the same as that of the base architecture. No limitation has been placed on the function unit resources. Every operation code can be executed from every operation slot of an instruction.

The experimental results show that general code percolation and speculative execution substantially out-perform restricted code percolation. They also show that speculative execution consistently performs better than general code percolation, but the improvement is insignificant.

The experimental results indicate that increasing the memory load operation latency substantially degrades the performance of multiple-instruction-issue architectures. This degradation is
Figure 18: Comparison of Scheduling Models for Load Delay 2

Figure 19: Comparison of Scheduling Models for Load Delay 3
most pronounced for high issue rates.

3.4 Limited Resource

The cost to replicate all function units for each additional operation slot in the instruction format can be very high. Therefore, we have evaluated performance degradations due to limited function unit resources. The results are shown in Figures 20 and 21.

The experimental results indicate that the ability to issue multiple branch and memory load operations per cycle is important for high issue rate architectures.

4 The IMPACT Architectural Framework

Figure 22 shows a high-level block diagram of the IMPACT Architectural Framework. In the ideal case, instructions are fetched and decoded every cycle. The control unit issues instructions to the function units in the order they are fetched.
Figure 21: Effects of Limited Resources for Load Delay 2

Figure 22: Block diagram of the IMPACT Architectural Framework.
4.1 Instruction Issue Rate

The number of operations that can be packed into an instruction is an architectural parameter. We have developed a variant of squashing branch, called inline target insertion [Hwu 90] [Chang 89.1], that uses profile-based branch prediction. Inline target insertion allows multiple branch operations to be issued per cycle, and allows branch operations to be fetched from branch slots. Independent of the length of the control unit pipeline, only one program counter needs to be saved in order to return from interrupt.

The ability to execute multiple branch operations per cycle is important for high issue rate architectures. Without this ability, the execution rate of four-issue architectures will be limited to below two times speedup over the base architecture.

4.2 Limited Function Unit Resources

The decoded instructions are forwarded directly to several independent function units. Figure 22 shows four function units for an issue rate of four operations per cycle. Each function unit can be as simple as an integer ALU, or as complex as a composite of a cache interface, a floating-point ALU, an integer ALU, and branch logic.

The experimental results in Figures 20 and 21 indicate that the ability to execute multiple branch and memory load operations is important for high issue rate architectures.

4.3 Support For General Code Percolation

The experimental results show that general code percolation significantly out-performs restricted code percolation. They also show that speculative execution achieves very little speedup beyond general code percolation. Therefore, multiple-instruction-issue architectures should support the
general code percolation model. The hardware support includes disabling exceptions that are caused by divide by zero and by illegal memory accesses. Some recent processors already support a set of arithmetic operations that do not signal overflow exception [Kane 87] [Sparc 87] [Amd] [IBM 90] [Intel 89].

4.4 Memory Load Latency

The experimental results in Figure 23 clearly indicate that increasing the memory load latency substantially degrades the performance of high issue rate architectures. Memory load operations frequently appear on critical paths. As instruction issue bandwidth increases, critical paths become more visible. Therefore, we strongly recommend a short memory load operation latency for multiple-instruction-issue machines.
5 Conclusion

Our approach to the design of multiple-instruction-issue processors can be summarized into three steps. In step one, we constructed IMPACT-I, a highly optimizing C compiler for multiple-instruction-issue processors. In step two, we conducted experiments to derive the IMPACT Architectural Framework. Multiple-instruction-issue processors designed within this framework receive effective compilation support from the IMPACT-I C compiler. In step three, we ran experiments to verify that multiple-instruction-issue processors designed within the IMPACT Architectural Framework have achieved solid speedup over a high-performance single-instruction-issue processor.

The IMPACT-I C compiler has code generators for several major commercial architectures either available or under construction. Within the IMPACT Architectural Framework, multiple-instruction-issue implementations of these architectures receive immediate compilation support from the IMPACT-I C compiler. This would significantly reduce the uncertainties and misconceptions about software during hardware development. Furthermore, processors designed within the framework would have compiler support to deliver promised performance in their production use.

In terms of engineering tradeoffs, we recommend the following. First, general code percolation should be supported as the code scheduling model for multiple-instruction-issue processors. Second, high issue rate processors should support short memory load latency. Third, high issue rate processors should have the ability to execute multiple branch and load operations in each cycle. All these features can be incorporated into existing commercial architectures in an upward compatible manner.

Future directions of the IMPACT Architectural Framework project include supporting multiple-instruction-issue implementations of major commercial architectures, enhancing the code optimiza-
tion capabilities of the IMPACT-I C compiler, and extending the framework to support multipro-
cessor architectures.

Acknowledgements The authors would like to thank all members of the IMPACT research
group for their support, comments and suggestions. This research has been supported by the
National Science Foundation (NSF) under Grant MIP-8809478, a donation from NCR, the National
Aeronautics and Space Administration (NASA) under Contract NASA NAG 1-613 in cooperation
with the Illinois Computer Laboratory for Aerospace Systems and Software (ICLASS), and the
Office of Naval Research under Contract N00014-88-K-0656.
References


[Intel 89] Intel, "i860(TM) 64-Bit Microprocessor", Order Number 240296-002, Santa Clara, California, April, 1989.


