### Chapter 3

# Instruction Level Parallelism and Its Dynamic Exploitation

Slides: D. Patterson, W. Gross, V. Hayward, T. Arbel

### **Recall from Pipelining Review**

- Pipeline CPI = Ideal pipeline CPI + Structural
   Stalls + Data Hazard Stalls + Control Stalls
  - <u>Ideal pipeline CPI</u>: measure of the maximum performance attainable by the implementation
  - <u>Structural hazards</u>: HW cannot support this combination of instructions
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline
  - <u>Control hazards</u>: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

## **Ideas to Reduce Stalls**

|           | Technique                                  | Reduces                            |
|-----------|--------------------------------------------|------------------------------------|
| <b>▲</b>  | Dynamic scheduling                         | Data hazard stalls                 |
|           | Dynamic branch prediction                  | Control stalls                     |
| Chapter 3 | Issuing multiple<br>instructions per cycle | Ideal CPI                          |
| onuprer o | Speculation                                | Data and control stalls            |
|           | Dynamic memory                             | Data hazard stalls involving       |
| •         | disambiguation                             | memory                             |
| <b>↓</b>  | Loop unrolling                             | Control hazard stalls              |
|           | Basic compiler pipeline<br>scheduling      | Data hazard stalls                 |
| Chapter 4 | Compiler dependence<br>analysis            | Ideal CPI and data hazard stalls   |
|           | Software pipelining and trace scheduling   | Ideal CPI and data hazard stalls   |
| •         | Compiler speculation                       | Ideal CPI, data and control stalls |

# Instruction-Level Parallelism (ILP)

- Basic Block (BB) ILP is quite small
  - BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit
  - average dynamic branch frequency 15% to 25%
     => 4 to 7 instructions execute between a pair of branches
  - Plus instructions in BB likely to depend on each other
- To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks
- Simplest: <u>loop-level parallelism</u> to exploit parallelism among iterations of a loop
  - Vector is one way
  - If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler

### Data Dependence and Hazards

Instr<sub>J</sub> is data dependent on Instr<sub>I</sub>
 Instr<sub>J</sub> tries to read operand before Instr<sub>I</sub> writes it

I: add r1,r2,r3
J: sub r4,r1,r3

- $\cdot$  or  $Instr_J$  is data dependent on  $Instr_K$  which is dependent on  $Instr_I$
- Caused by a "True Dependence" (compiler term)
- If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard

# Data Dependence and Hazards

- Dependences are a property of programs
- Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline
- Importance of the data dependencies
- 1) indicates the possibility of a hazard
- 2) determines order in which results must be calculated
- 3) sets an upper bound on how much parallelism can possibly be exploited
- Today looking at HW schemes to avoid hazard

#### Name Dependence #1: Anti-dependence

- Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence
- Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1"

 If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard

#### Name Dependence #2: Output dependence

• Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.

I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

- Called an "output dependence" by compiler writers This also results from the reuse of name "r1"
- If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard

# **ILP and Data Hazards**

- HW/SW must preserve program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program
- HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program
- Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict
  - Register renaming resolves name dependence for regs
  - Either by compiler or by HW

# **Control Dependencies**

 Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order

- }
- S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.

### **Control Dependence Ignored**

- · Control dependence need not be preserved
  - willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program
- Instead, 2 properties critical to program correctness are exception behavior and data flow

### **Exception Behavior**

- Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions)
- Example:

| DADDU | R2,R3,R4 |
|-------|----------|
| BEQZ  | R2,L1    |
| LW    | R1,0(R2) |
| L1:   |          |

• Problem with moving LW before BEQZ?

# **Data Flow**

- Data flow: actual flow of data values among instructions that produce results and those that consume them
  - branches make flow dynamic, determine which instruction is supplier of data
- Example:

| DADDU | R1,R2,R3 |
|-------|----------|
| BEQZ  | R4,L     |
| DSUBU | R1,R5,R6 |
| L:    |          |
| OR    | R7,R1,R8 |

• OR depends on DADDU or DSUBU? Must preserve data flow on execution

# Advantages of Dynamic Scheduling

- Handles cases when dependences unknown at compile time
  - (e.g., because they may involve a memory reference)
- It simplifies the compiler
- Allows code that compiled for one pipeline to run efficiently on a different pipeline
- Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

The limiting factor for standard pipelines is *in-order execution*: a stalling instruction stalls all subsequent instructions, hence magnifying the problem.

Example:

DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14 // could execute // out of order



However *out-of-order execution* has new consequences:



© 2002, 2003 V. Hayward, T. Arbel, W. Gross, Department of Electrical and Computer Engineering, McGill University, Textbook Figures © Elsevier Science 1990, 1996, 2003

But using more registers can systematically eliminate name dependencies:



It is evident that this has to be managed systematically as the number of registers is limited. The solution is called *register renaming* which involves making a hardware copy of a register only when needed.

© 2002, 2003 V. Hayward, T. Arbel, W. Gross, Department of Electrical and Computer Engineering, McGill University, Textbook Figures © Elsevier Science 1990, 1996, 2003

#### **Register renaming**:

Eliminates WAR and WAW hazards by renaming destination registers.

```
DIV.D F0, F2, F4
ADD.D F6, F0, F8 // may finish later than MUL.D (WAW)
S.D F6, 0(R1)
SUB.D F8, F10, F14 // anti-dependence with ADD.D (WAR)
MUL.D F6, F10, F8 // output dependence with ADD.D
Register renaming:
Temp registers: S, T
DIV.D F0, F2, F4
ADD.D S, F0, F8 // S replaces F6
S.D S, 0(R1)
SUB.D T, F10, F14 // T replaces F8
MUL.D F6, F10, T
```

Also need to replace any subsequent uses of F8 with T. This can be tricky since there may be branches between uses of F8.

<sup>© 2002, 2003</sup> V. Hayward, T. Arbel, W. Gross, Department of Electrical and Computer Engineering, McGill University, Textbook Figures © Elsevier Science 1990, 1996, 2003

There is a dynamic scheduling scheme which can do register renaming across branches!

Tomasulo's Algorithm – hardware for out-of-order execution

Basic ideas:

- 1. Track dependencies: allow execution as soon as operands are available.
- 2. Register renaming to eliminate WAR and WAW hazards.

© 2002, 2003 V. Hayward, T. Arbel, W. Gross, Department of Electrical and Computer Engineering, McGill University, Textbook Figures © Elsevier Science 1990, 1996, 2003

### HW Schemes: Instruction Parallelism

- Key idea: Allow instructions behind stall to proceed
   DIVD F0,F2,F4
   ADDD F10,F0,F8
   SUBD F12,F8,F14
- Enables out-of-order execution and allows out-of-order completion
- Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution
- In a dynamically scheduled pipeline, all instructions pass through issue stage in order (in-order issue)

# Dynamic Scheduling Step 1

- Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue
- Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
- *Issue*—Decode instructions, check for structural hazards
- Read operands—Wait until no data hazards, then read operands

#### A Dynamic Algorithm: Tomasulo's Algorithm

- For IBM 360/91 (before caches!)
- Goal: High Performance without special compilers
- Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations
  - This led Tomasulo to try to figure out how to get more effective registers renaming in hardware!
- Why Study 1966 Computer?
- The descendants of this have flourished!
  - Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, ...

# **Tomasulo Algorithm**

- Control & buffers distributed with Function Units (FU)
  - FU buffers called "reservation stations"; have pending operands
- Registers in instructions replaced by values or pointers to reservation stations(RS); called <u>register</u> <u>renaming</u>;
  - avoids WAR, WAW hazards
  - More reservation stations than registers, so can do optimizations compilers can't
- Results to FU from RS, <u>not through registers</u>, over <u>Common Data Bus</u> that broadcasts results to all FUs
- $\cdot$  Load and Stores treated as FUs with RSs as well
- Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue

# **Tomasulo Organization**



Common Data Bus (CDB)

#### **Reservation Station Components**

Op: Operation to perform in the unit (e.g., + or -) Vj, Vk: Value of Source operands

- Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers (value to be written)

- Note: Qj,Qk=0 => ready
- Store buffers only have Qi for RS producing result

**Busy:** Indicates reservation station or FU is busy

**Register result status**—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

# Three Stages of Tomasulo Algorithm

#### 1. Issue-get instruction from FP Op Queue

If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

#### 2. Execute—operate on operands (EX)

When both operands ready then execute; if not ready, watch Common Data Bus for result

#### 3. Write result—finish execution (WB)

Write on Common Data Bus to all awaiting units; mark reservation station available

- Normal data bus: data + destination ("go to" bus)
- <u>Common data bus</u>: data + <u>source</u> ("<u>come from</u>" bus)
  - 64 bits of data + 4 bits of Functional Unit source address
  - Write if matches expected Functional Unit (produces result)
  - Does the broadcast
- Example speed:
   3 clocks for Fl .pt. +,-; 10 for \*; 40 clks for /

# **Tomasulo Example**



Instruction stream







#### Register result status:

| Clock |    | F0 | <b>F</b> 2 | F4 | F6    | F8 | <i>F10</i> | <i>F12</i> | ••• | <i>F30</i> |
|-------|----|----|------------|----|-------|----|------------|------------|-----|------------|
| 2     | FU |    | Load2      |    | Load1 |    |            |            |     |            |
|       |    |    |            |    |       |    |            |            |     |            |

#### Note: Can have multiple loads outstanding



- Note: registers names are removed ("renamed") in Reservation Stations; MULT issued
- Load1 completing; what is waiting for Load1?



Load2 completing; what is waiting for Load2?



Timer starts down for Add1, Mult1



• Issue ADDD here despite name dependency on F6?



Add1 (SUBD) completing; what is waiting for it?



| Clock |    | FO    | F2    | F4 | <i>F6</i> | <i>F8</i> | <i>F10</i> | <i>F12</i> | ••• | <i>F30</i> |
|-------|----|-------|-------|----|-----------|-----------|------------|------------|-----|------------|
| 8     | FU | Mult1 | M(A2) |    | Add2      | (M-M)     | Mult2      |            |     |            |

| Instructio              | on sta                | tus:      |            |       | Exec          | Write     |       |       |      |         |
|-------------------------|-----------------------|-----------|------------|-------|---------------|-----------|-------|-------|------|---------|
| Instructi               | on                    | j         | k          | Issue | Comp          | Result    |       |       | Busy | Address |
| LD                      | F6                    | 34+       | R2         | 1     | 3             | 4         |       | Load1 | No   |         |
| LD                      | F2                    | 45+       | <b>R</b> 3 | 2     | 4             | 5         |       | Load2 | No   |         |
| MULTD                   | <b>F0</b>             | F2        | F4         | 3     |               |           |       | Load3 | No   |         |
| SUBD                    | F8                    | <b>F6</b> | F2         | 4     | 7             | 8         |       |       |      |         |
| DIVD                    | F10                   | FO        | <b>F6</b>  | 5     |               |           |       |       |      |         |
| ADDD                    | F6                    | F8        | F2         | 6     |               |           |       |       |      |         |
| Reservati               | Reservation Stations: |           |            |       |               | <i>S2</i> | RS    | RS    |      |         |
|                         | Time                  | Name      | Busy       | Ор    | Vj            | Vk        | Qj    | Qk    | _    |         |
|                         |                       | Add1      | No         |       |               |           |       |       |      |         |
|                         | 1                     | Add2      | Yes        | ADDD  | (M-M)         | M(A2)     |       |       |      |         |
|                         |                       | Add3      | No         |       |               |           |       |       |      |         |
|                         | 6                     | Mult1     | Yes        | MULTE | <b>M</b> (A2) | R(F4)     |       |       |      |         |
|                         |                       | Mult2     | Yes        | DIVD  |               | M(A1)     | Mult1 |       |      |         |
| Register result status: |                       |           |            |       |               |           |       |       |      |         |

| Clock |    | F0    | <i>F2</i> | F4 | <i>F6</i> | F8    | <i>F10</i> | <i>F12</i> | ••• | F30 |
|-------|----|-------|-----------|----|-----------|-------|------------|------------|-----|-----|
| 9     | FU | Mult1 | M(A2)     |    | Add2      | (M-M) | Mult2      |            |     |     |

| Instructio              | n sta | tus:    |            |       | Exec          | Write         |       |       |            |            |     |     |
|-------------------------|-------|---------|------------|-------|---------------|---------------|-------|-------|------------|------------|-----|-----|
| Instructio              | on    | j       | k          | Issue | Comp          | Result        |       |       | Busy       | Address    |     |     |
| LD                      | F6    | 34+     | R2         | 1     | 3             | 4             |       | Load1 | No         |            |     |     |
| LD                      | F2    | 45+     | <b>R3</b>  | 2     | 4             | 5             |       | Load2 | No         |            |     |     |
| MULTD                   | FO    | F2      | <b>F</b> 4 | 3     |               |               |       | Load3 | No         |            |     |     |
| SUBD                    | F8    | F6      | F2         | 4     | 7             | 8             |       |       |            |            |     |     |
| DIVD                    | F10   | F0      | <b>F6</b>  | 5     |               |               |       |       |            |            |     |     |
| ADDD                    | F6    | F8      | F2         | 6     | 10            |               |       |       |            |            |     |     |
| Reservati               | on St | tations | •          |       | <i>S1</i>     | <i>S2</i>     | RS    | RS    |            |            |     |     |
|                         | Time  | Name    | Busy       | Ор    | Vj            | Vk            | Qj    | Qk    | _          |            |     |     |
|                         |       | Add1    | No         |       |               |               |       |       |            |            |     |     |
|                         | (     | ) Add2  | Yes        | ADDD  | (M-M)         | M(A2)         |       |       |            |            |     |     |
|                         |       | Add3    | No         |       |               |               |       |       |            |            |     |     |
|                         | 4     | 5 Mult1 | Yes        | MULTE | <b>M</b> (A2) | <b>R</b> (F4) |       |       |            |            |     |     |
|                         |       | Mult2   | Yes        | DIVD  |               | M(A1)         | Mult1 |       |            |            |     |     |
| Register result status: |       |         |            |       |               |               |       |       |            |            |     |     |
| Clock                   |       |         |            | FO    | F2            | F4            | F6    | F8    | <i>F10</i> | <i>F12</i> | ••• | F30 |
| 10                      |       |         | FU         | Mult1 | M(A2)         |               | Add2  | (M-M) | Mult2      |            |     |     |

Add2 (ADDD) completing; what is waiting for it?



| Clock |    | FO    | <i>F2</i> | F4 | 4 F6   | F8    | <i>F10</i> | <i>F12</i> | • • • | <i>F30</i> |
|-------|----|-------|-----------|----|--------|-------|------------|------------|-------|------------|
| 11    | FU | Mult1 | M(A2)     |    | (M-M+N | (M-M) | Mult2      |            |       |            |

- Write result of ADDD here?
- All quick instructions complete in this cycle!

| In | structio   | n sta     | tus:      |            |       | Exec          | Write            |       |            |               |           |             |
|----|------------|-----------|-----------|------------|-------|---------------|------------------|-------|------------|---------------|-----------|-------------|
|    | Instructio | n         | j         | k          | Issue | Comp          | Result           |       |            | Busy          | Address   |             |
|    | LD         | F6        | 34+       | R2         | 1     | 3             | 4                |       | Load1      | No            |           |             |
|    | LD         | F2        | 45+       | <b>R</b> 3 | 2     | 4             | 5                |       | Load2      | No            |           |             |
|    | MULTD      | <b>F0</b> | F2        | <b>F</b> 4 | 3     |               |                  |       | Load3      | No            |           |             |
|    | SUBD       | <b>F8</b> | <b>F6</b> | F2         | 4     | 7             | 8                |       |            |               |           |             |
|    | DIVD       | F10       | FO        | <b>F6</b>  | 5     |               |                  |       |            |               |           |             |
|    | ADDD       | F6        | F8        | F2         | 6     | 10            | 11               |       |            |               |           |             |
| Re | eservatio  | on St     | ations    | 5:         |       | <i>S1</i>     | <i>S2</i>        | RS    | RS         |               |           |             |
|    |            | Time      | Name      | Busy       | Ор    | Vj            | Vk               | Qj    | Qk         | _             |           |             |
|    |            |           | Add1      | No         |       |               |                  |       |            |               |           |             |
|    |            |           | Add2      | No         |       |               |                  |       |            |               |           |             |
|    |            |           | Add3      | No         |       |               |                  |       |            |               |           |             |
|    |            | 3         | Mult1     | Yes        | MULTE | <b>M</b> (A2) | <b>R</b> (F4)    |       |            |               |           |             |
|    |            |           | Mult2     | Yes        | DIVD  |               | M(A1)            | Mult1 |            |               |           |             |
| Re | egister r  | esult     | statu     | s:         |       |               |                  |       |            |               |           |             |
|    | $O_1$ 1    |           |           |            |       | $\Gamma$      | $\Gamma \Lambda$ |       | $\Gamma O$ | $\Gamma_{10}$ | <b>F1</b> | <b>E</b> 20 |

| Clock |    | FO    | F2    | F4 | F6      | F8    | F10   | <i>F12</i> | ••• | <i>F30</i> |
|-------|----|-------|-------|----|---------|-------|-------|------------|-----|------------|
| 12    | FU | Mult1 | M(A2) |    | (M-M+N) | (M-M) | Mult2 |            |     |            |



| Clock |    | FO    | F2    | F4 | F6     | F8    | F10   | <i>F12</i> | ••• | <i>F30</i> |
|-------|----|-------|-------|----|--------|-------|-------|------------|-----|------------|
| 13    | FU | Mult1 | M(A2) |    | (M-M+N | (M-M) | Mult2 |            |     |            |

| Instructio | on sta     | tus:    |           |       | Exec          | Write         |       |       |      |         |   |
|------------|------------|---------|-----------|-------|---------------|---------------|-------|-------|------|---------|---|
| Instructi  | on         | j       | k         | Issue | Comp          | Result        |       |       | Busy | Address |   |
| LD         | F6         | 34+     | R2        | 1     | 3             | 4             |       | Load1 | No   |         |   |
| LD         | F2         | 45+     | <b>R3</b> | 2     | 4             | 5             |       | Load2 | No   |         |   |
| MULTD      | <b>F</b> 0 | F2      | <b>F4</b> | 3     |               |               |       | Load3 | No   |         | l |
| SUBD       | <b>F8</b>  | F6      | F2        | 4     | 7             | 8             |       |       |      |         |   |
| DIVD       | F10        | FO      | <b>F6</b> | 5     |               |               |       |       |      |         |   |
| ADDD       | F6         | F8      | F2        | 6     | 10            | 11            |       |       |      |         |   |
| Reservati  | on St      | ations  | 5:        |       | <i>S1</i>     | <i>S2</i>     | RS    | RS    |      |         |   |
|            | Time       | Name    | Busy      | Op    | Vj            | Vk            | Qj    | Qk    |      |         |   |
|            |            | Add1    | No        |       |               |               |       |       |      |         |   |
|            |            | Add2    | No        |       |               |               |       |       |      |         |   |
|            |            | Add3    | No        |       |               |               |       |       |      |         |   |
|            | 1          | Mult1   | Yes       | MULTE | <b>M</b> (A2) | <b>R</b> (F4) |       |       |      |         |   |
|            |            | Mult2   | Yes       | DIVD  |               | M(A1)         | Mult1 |       |      |         |   |
| Register   | result     | t statu | s:        |       |               |               |       |       |      |         |   |

| Clock |    | FO    | F2    | F4 | F6     | F8    | <i>F10</i> | <i>F12</i> | ••• | <i>F30</i> |
|-------|----|-------|-------|----|--------|-------|------------|------------|-----|------------|
| 14    | FU | Mult1 | M(A2) |    | (M-M+N | (M-M) | Mult2      |            |     |            |



Mult1 (MULTD) completing; what is waiting for it?



Just waiting for Mult2 (DIVD) to complete

#### Faster than light computation (skip a couple of cycles)

| Instructio | n sta     | tus:      |            |       | Exec      | Write     |    |       |      |         |     |
|------------|-----------|-----------|------------|-------|-----------|-----------|----|-------|------|---------|-----|
| Instructio | on        | j         | k          | Issue | Comp      | Result    |    |       | Busy | Address | _   |
| LD         | F6        | 34+       | R2         | 1     | 3         | 4         |    | Load1 | No   |         |     |
| LD         | F2        | 45+       | <b>R</b> 3 | 2     | 4         | 5         |    | Load2 | No   |         |     |
| MULTD      | <b>F0</b> | <b>F2</b> | <b>F</b> 4 | 3     | 15        | 16        |    | Load3 | No   |         |     |
| SUBD       | F8        | F6        | F2         | 4     | 7         | 8         |    |       |      |         |     |
| DIVD       | F10       | FO        | F6         | 5     |           |           |    |       |      |         |     |
| ADDD       | F6        | F8        | F2         | 6     | 10        | 11        |    |       |      |         |     |
| Reservatio | on St     | ations    | 5.         |       | <i>S1</i> | <i>S2</i> | RS | RS    |      |         |     |
|            | Time      | Name      | Busy       | Ор    | Vj        | Vk        | Qj | Qk    | _    |         |     |
|            |           | Add1      | No         |       |           |           |    |       |      |         |     |
|            |           | Add2      | No         |       |           |           |    |       |      |         |     |
|            |           | Add3      | No         |       |           |           |    |       |      |         |     |
|            |           | Mult1     | No         |       |           |           |    |       |      |         |     |
|            | 1         | Mult2     | Yes        | DIVD  | M*F4      | M(A1)     |    |       |      |         |     |
| Register r | esult     | t statu   | <i>s:</i>  |       |           |           |    |       |      |         |     |
| Clock      |           |           |            | FO    | F2        | F4        | F6 | F8    | F10  | F12     | F30 |

| Clock |    | FO   | F2    | F4 | F6     | F8    | F10   | <i>F12</i> | ••• | <i>F30</i> |
|-------|----|------|-------|----|--------|-------|-------|------------|-----|------------|
| 55    | FU | M*F4 | M(A2) |    | (M-M+N | (M-M) | Mult2 |            |     |            |



Mult2 (DIVD) is completing; what is waiting for it?



 Once again: In-order issue, out-of-order execution and out-of-order completion.

#### **Tomasulo Drawbacks**

- · Complexity
  - delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon!
- Many associative stores (CDB) at high speed
- Performance limited by Common Data Bus
  - Each CDB must go to multiple functional units ⇒high capacitance, high wiring density
  - Number of functional units that can complete per cycle limited to one!
    - » Multiple CDBs  $\Rightarrow$  more FU logic for parallel assoc stores
- Non-precise interrupts!
  - We will address this later

#### Tomasulo Loop Example

| Loop:LD | FO | 0    | R1 |
|---------|----|------|----|
| MULTD   | F4 | FO   | F2 |
| SD      | F4 | 0    | R1 |
| SUBI    | R1 | R1   | #8 |
| BNEZ    | R1 | Loop |    |

- This time assume Multiply takes 4 clocks
- Assume 1st load takes 8 clocks
   (L1 cache miss), 2nd load takes 1 clock (hit)
- To be clear, will show clocks for SUBI, BNEZ - Reality: integer instructions ahead of Fl. Pt. Instructions
- Show 2 iterations

#### Loop Example



Value of Register used for address, iteration control







#### Implicit renaming sets up data flow graph



Dispatching SUBI Instruction (not in FP queue)



And, BNEZ instruction (not in FP queue)



Notice that FO never sees Load from location 80



- Register file completely detached from computation
- First and Second iteration completely overlapped





- Load1 completing: who is waiting?
- Note: Dispatching SUBI



- Load2 completing: who is waiting?
- Note: Dispatching BNEZ



Next load in sequence



• Why not issue third multiply?

| Instruction statu  | <i>s:</i>  |       | Write      |               |           |        |            |            |            |            |
|--------------------|------------|-------|------------|---------------|-----------|--------|------------|------------|------------|------------|
| ITER Instruct      | ion        | j     | k          | Issue         | Сотр      | Result |            | Busy       | Addr       | Fu         |
| 1 LD               | <b>F0</b>  | 0     | <b>R</b> 1 | 1             | 9         | 10     | Load1      | No         |            |            |
| 1 MULTD            | F4         | F0    | F2         | 2             |           |        | Load2      | No         |            |            |
| 1 SD               | F4         | 0     | <b>R</b> 1 | 3             |           |        | Load3      | Yes        | 64         |            |
| 2 LD               | <b>F0</b>  | 0     | <b>R</b> 1 | 6             | 10        | 11     | Store1     | Yes        | 80         | Mult1      |
| 2 MULTD            | <b>F</b> 4 | FO    | F2         | 7             |           |        | Store2     | Yes        | 72         | Mult2      |
| 2 SD               | F4         | 0     | <b>R</b> 1 | 8             |           |        | Store3     | No         |            |            |
| Reservation Stat   | ions:      |       |            | <i>S1</i>     | <i>S2</i> | RS     |            |            |            |            |
| Time Name          | Busy       | Ор    | Vj         | Vk            | Qj        | Qk     | Code:      |            |            |            |
| Add1               | No         |       |            |               |           |        | LD         | FO         | 0          | <b>R</b> 1 |
| Add2               | No         |       |            |               |           |        | MULTD      | F4         | F0         | F2 <       |
| Add3               | No         |       |            |               |           |        | SD         | F4         | 0          | <b>R</b> 1 |
| 1 Mult1            | Yes        | Multd | M[80]      | <b>R</b> (F2) |           |        | SUBI       | <b>R</b> 1 | <b>R</b> 1 | #8         |
| 2 Mult2            | Yes        | Multd | M[72]      | <b>R</b> (F2) |           |        | BNEZ       | <b>R</b> 1 | Loop       |            |
| Register result st | tatus      |       |            |               |           |        |            |            |            |            |
| Clock R1           |            | FO    | <i>F2</i>  | <i>F4</i>     | <i>F6</i> | F8     | <i>F10</i> | <i>F12</i> | •••        | F30        |
| 13 64              | Fu         | Load3 |            | Mult2         |           |        |            |            |            |            |

#### • Why not issue third store?



Mult1 completing. Who is waiting?



Mult2 completing. Who is waiting?



| Instructi | on statu  | <i>s:</i>  |       |            | Write     |           |        |            |            |            |            |
|-----------|-----------|------------|-------|------------|-----------|-----------|--------|------------|------------|------------|------------|
| ITER      | Instruct  | ion        | j     | k          | Issue     | Comp      | Result |            | Busy       | Addr       | Fu         |
| 1         | LD        | F0         | 0     | <b>R</b> 1 | 1         | 9         | 10     | Load1      | No         |            |            |
| 1         | MULTD     | F4         | F0    | F2         | 2         | 14        | 15     | Load2      | No         |            |            |
| 1         | SD        | <b>F</b> 4 | 0     | <b>R</b> 1 | 3         |           |        | Load3      | Yes        | 64         |            |
| 2         | LD        | <b>F0</b>  | 0     | <b>R</b> 1 | 6         | 10        | 11     | Store1     | Yes        | 80         | [80]*R2    |
| 2         | MULTD     | <b>F</b> 4 | F0    | F2         | 7         | 15        | 16     | Store2     | Yes        | 72         | [72]*R2    |
| 2         | SD        | <b>F4</b>  | 0     | <b>R</b> 1 | 8         |           |        | Store3     | Yes        | 64         | Mult1      |
| Reservat  | tion Stat | ions:      |       |            | <i>S1</i> | <i>S2</i> | RS     |            |            |            |            |
| Time      | Name      | Busy       | Ор    | Vj         | Vk        | Qj        | Qk     | Code:      |            |            |            |
|           | Add1      | No         |       |            |           |           |        | LD         | F0         | 0          | <b>R</b> 1 |
|           | Add2      | No         |       |            |           |           |        | MULTD      | F4         | F0         | F2         |
|           | Add3      | No         |       |            |           |           |        | SD         | F4         | 0          | R1 🔶       |
|           | Mult1     | Yes        | Multd |            | R(F2)     | Load3     |        | SUBI       | <b>R</b> 1 | <b>R</b> 1 | #8         |
|           | Mult2     | No         |       |            |           |           |        | BNEZ       | <b>R</b> 1 | Loop       |            |
| Register  | result si | tatus      |       |            |           |           |        |            |            |            |            |
| Clock     | <b>R1</b> |            | FO    | <i>F2</i>  | F4        | <i>F6</i> | F8     | <i>F10</i> | <i>F12</i> | •••        | <i>F30</i> |
| 17        | 64        | Fu         | Load3 |            | Mult1     |           |        |            |            |            |            |







and out-of-order completion.

# Why can Tomasulo overlap iterations of loops?

- Register renaming
  - Multiple iterations use different physical destinations for registers (dynamic loop unrolling).
- Reservation stations
  - Permit instruction issue to advance past integer control flow operations
  - Also buffer old values of registers totally avoiding the WAR stall that we saw in the scoreboard.
- Other perspective: Tomasulo building data flow dependency graph on the fly.

#### Tomasulo's scheme offers 2 major advantages

(1) the distribution of the hazard detection logic

- distributed reservation stations and the CDB
- If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
- If a centralized register file were used, the units would have to read their results from the registers when register buses are available.
- (2) the elimination of stalls for WAW and WAR hazards

#### What about Precise Interrupts?

• Tomasulo had:

In-order issue, out-of-order execution, and out-of-order completion

 Need to "fix" the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.

# Relationship between precise interrupts and specultation:

- Speculation is a form of guessing.
- Important for branch prediction:
  - Need to "take our best shot" at predicting branch direction.
- If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:
  - This is exactly same as precise exceptions!
- Technique for both precise interrupts/exceptions and speculation: *in-order completion or commit*

## HW support for precise interrupts

- Need HW buffer for results of uncommitted instructions: reorder buffer
  - 3 fields: instr, destination, value
  - Use reorder buffer number instead of reservation station when execution completes
  - Supplies operands between execution complete & commit
  - (Reorder buffer can be operand source => more registers like RS)
  - Instructions commit
  - Once instruction commits, result is put into register
  - As a result, easy to undo speculated instructions on mispredicted branches <u>or exceptions</u>



#### Four Steps of Speculative Tomasulo Algorithm

#### 1. Issue-get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called "dispatch")

#### 2. Execution—operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called "issue")

#### **3. Write result**—finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

#### 4. Commit—update register with reorder result

When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called "graduation")

## What are the hardware complexities with reorder buffer (ROB)?



- How do you find the latest version of a register?
  - (As specified by Smith paper) need associative comparison network
  - Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value
- Need as many ports on ROB as register file

#### **Dynamic Memory Disambiguation**

- WAR and WAW hazards are eliminated by Tomasulo's algorithm by register renaming
- Easy to do since the names are exposed
- What about if two instructions share the same memory address ?

• What if 40(R6) = 64(R3) ???

#### **Dynamic Memory Disambiguation**

L.D. F1, 40(R6) S.D F4, 64(R3)

 If the load and store are executed out-oforder... WAR hazard

> S.D F4, 64(R3) L.D. F1, 40(R6)

• RAW hazard

#### **Dynamic Memory Disambiguation**

- Loads/Stores have to wait until any uncompleted Stores/Loads sharing the same effective address that precede that instruction in program order complete
- To detect these hazards, we need to know the effective address of any earlier memory operation
- Solution: perform the EA calculations in program order

### Computing EAs in Program Order

- E.g. Consider a load
- When the load completes EA calculation, check the address fields of all the active store buffers
- If the load address matches any active store buffer entry, then do not send the load to the load buffer until the conflicting store completes
- Stores are similar except must check both load and store buffers

#### **Review Tomasulo**

- Reservations stations: *implicit register renaming* to larger set of registers + buffering source operands
  - Prevents registers as bottleneck
  - Avoids WAR, WAW hazards
  - Allows loop unrolling in HW
- Not limited to basic blocks (integer units gets ahead, beyond branches)
- Today, helps cache misses as well
  - Don't stall for L1 Data cache miss (insufficient ILP for L2 miss?)
- Lasting Contributions
  - Dynamic scheduling
  - Register renaming
  - Load/store disambiguation
- 360/91 descendants are Pentium III, 4; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

#### Chapter 3

## **Dynamic Branch Prediction**

#### **Branch Prediction**

- Now that we have considered how to reduce stalls due to data hazards, let's tackle control hazards.
- The fundamental problem:
  - There is a delay between the cycle which we find out if the instruction is a branch, what it's target is, whether it is taken or not,....and the cycle from which we need to fetch the next instruction.
- One way to get around this is to guess whether a branch is taken or not taken...if we are correct then there could potentially be no penalty.
- We suffer a penalty if the guess was wrong

#### Static vs. Dynamic Prediction

- The performance of branch prediction rests on how accurate our predictions are.
- We have seen a compiler scheme for filling the branch delay (static branch prediction).
- Analyze each branch and try to fill the delay slot with an instruction from the branch target or the fall through.
- The problem is... it is very hard to predict the direction of branches in the compiler...we really need to consider the dynamic branch behaviour.

#### Looking Ahead...

- To lower the IDEAL CPI, we will consider machines that can ISSUE more than one instruction in a clock cycle...
  - "multiple issue" (Superscalar and VLIW)

#### Case for Branch Prediction when Issue N instructions per clock cycle

- 1. Branches will arrive up to *n* times faster in an *n*-issue processor
- 2. Amdahl's Law => relative impact of the control stalls will be larger with the lower potential CPI in an *n*-issue processor

#### **Dynamic Branch Prediction**

• IDEA: predict the outcome of a branch based on its past behaviour



Small memory (branch prediction buffer)

#### **7 Branch Prediction Schemes**

- 1. 1-bit Branch-Prediction Buffer
- 2. 2-bit Branch-Prediction Buffer
- 3. Correlating Branch Prediction Buffer
- 4. Tournament Branch Predictor
- 5. Branch Target Buffer
- 6. Integrated Instruction Fetch Units
- 7. Return Address Predictors

#### **Dynamic Branch Prediction**



#### Performance = f(accuracy, cost of misprediction)

#### **1-bit Predictors**

- Branch History Table: Lower bits of PC address index table of 1-bit values
  - Says whether or not branch taken last time
  - No address check (saves HW, but may not be right branch)
  - Adequate performance for numerical code with many loops
- Problem: in a loop, 1-bit BHT will cause
  2 mispredictions
  - End of loop case, when it exits instead of looping as before
  - First time through loop on *next* time through code, when it predicts *exit* instead of looping
  - Only 80% accuracy even if loop 90% of the time

### Example

- Loop with 10 iterations. First 9 are taken and then the last is not.
- Mispredict 2 times for every 10 instructions
- 80% prediction accuracy
- (mispredict at twice the rate of branch not taken...should be able to at least match the taken branch frequency for highly regular loops)

#### 1-bit predictors

- Prediction is wrong whenever there is a transition in the branching pattern.
- Example
  - NTNTNT
  - 1-bit predictor is never correct ! (0%)
  - Tossing a coin (no prediction at all) gives 50%
- However, real code has bias
- A branch taken several times is likely to be taken again
- Solution: keep more "memory" than is possible by just one bit...try two bits

#### 2-bit predictors

- Count the number of 'taken' (not taken) outcomes
- Two taken (not taken) in a row → predict "taken" (not taken)
- A single not taken (taken) branch will not affect the prediction – there need to be two in a row to affect the prediction
- In general, with n prediction bits, it takes 2<sup>n-1</sup> mispredictions before the predictor changes its mind



- ...NNNNN TNTNTN TTTTT...
- 50 % prediction accuracy
- ... TTTN TTTTTTT N TTT...
- 90% prediction accuracy

#### **2-Bit Branch Prediction**



• A branch that strongly favours taken or not taken will be mispredicted less often than with a 1-bit © 2003 Elsevier Science (USA). All rights reserved. predictor

#### **Counter Implementation**



#### Accuracy of 2-bit predictors

- 99-100% on heavy matrix code
- 80 90% on integer code (e.g. gcc)
- Statistics show virtualy no gain in accuracy with more states and buffers of more than 1K entries.
- However, there are some cases where we can do better...

#### **Correlating Branch Predictors**

- Why is the performance of integer code so low?
- We assumed that different branches' behaviour was not correlated..
- But, they often are...

```
If (a == 2)
a =0
If( b == 2)
b= 0
If (a != b)
```

• A simple predictor that considers only one branch can't capture this behaviour

#### **Correlating Branch Predictors**

- Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)
- Simple predictor: keep a history of 1 branch and each predictor is 1-bit (1,1)

#### **Example without Correlation**

• E.g.

- B1: If (d == 0) d = 1
- B2: If (d == 1)

...

Try a 1-bit predictor

|     | Branch | Pred | outcome | update |
|-----|--------|------|---------|--------|
| d=2 | B1     | Ν    | N       | Ν      |
|     | B2     | Ν    | N       | Ν      |
| d=0 | B1     | Ν    | Т       | Т      |
|     | B2     | Ν    | Т       | Т      |
| d=2 | B1     | Т    | N       | Ν      |
|     | B2     | Т    | N       | Ν      |
| d=0 | B1     | Ν    | Т       | Т      |
|     | B2     | Ν    | Т       | Т      |

## Simple Correlating Predictor

|                                    | Prediction bits         |                   |  |
|------------------------------------|-------------------------|-------------------|--|
|                                    | Use this one if last    | Use this one if   |  |
| Each branch has 2.1 hit predictors | branch <b>not taken</b> | last branch taken |  |
| Each branch has 2 1-bit predictors | N                       | Ν                 |  |
| yielding four possibilities: (1,1) | N                       | Т                 |  |
| correlating branch predictor       | Т                       | N                 |  |
|                                    | Т                       | Т                 |  |

|     | Branch | Pred. Bits | Prediction | Outcome | Update |
|-----|--------|------------|------------|---------|--------|
|     |        |            |            |         |        |
| d=2 | B1     | NN         | N          | N       | NN     |
|     | B2     | NN         | N          | N       | NN     |
| d=0 | B1     | NN         | N          | т       | TN     |
|     | B2     | NN         | N          | т       | NT     |
| d=2 | B1     | TN         | N          | N       | TN     |
|     | B2     | NT         | N          | N       | NT     |
| d=0 | B1     | TN         | т          | т       | TN     |
|     | B2     | NT         | Т          | Т       | NT     |

## **Correlating Branches**

Behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

- (2,2) predictor: 2-bit global, 2-bit local
- General: (m,n) uses behaviour of the last m branches to choose from 2<sup>m</sup> predictors each of which is an n-bit predictor



#### Accuracy of Different Schemes



#### **BHT** Accuracy

- Mispredict because either:
  - Wrong guess for that branch
  - Got branch history of wrong branch when index the table
- 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
- For SPEC92,
  4096 about as good as infinite table

#### **Tournament Predictors**

- Motivation for correlating branch predictors is 2-bit predictor failed on important branches; by adding global information, performance improved
- Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector
- Hopes to select right predictor for right branch

#### **Tournament Predictor in Alpha 21264**

- 4K 2-bit counters to choose from among a global predictor and a local predictor
- Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor
  - 12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken;
- Local predictor consists of a 2-level predictor:
  - Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted.
  - Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction
- Total size: 4K\*2 + 4K\*2 + 1K\*10 + 1K\*3 = 29K bits!
   (~180,000 transistors)

#### % of predictions from local predictor in Tournament Prediction Scheme



#### **Accuracy of Branch Prediction**



#### Accuracy v. Size (SPEC89)



## Instruction Delivery

- Classic 5-stage pipeline: branch target address and branch direction (outcome) are known early (in ID)
  - 1 cycle branch delay
- Predictors don't give much benefit for this pipeline unless they can give the prediction in the IF stage
- Seems impossible: don't even know the instruction yet!

## Branch Target Buffer



Table look up can be done in hardware for small tables

Usually, only store predicted taken branches in BTB

## Branch Target Buffer



Example of cost in CC for all cases. This, plus statistics allows to predict speedup.

| Ĥit | Prediction | Outcome   | Penalty                                              |
|-----|------------|-----------|------------------------------------------------------|
| yes | taken      | taken     | 0: correct instruction fetched next clock cycle      |
| yes | taken      | not taken | 2: 1 CC to updtate buffer + 1 CC to restart fetching |
| no  | don't care | taken     | 2: 1 CC                                              |
| no  | don't care | not taken | 0: correct instruction getched next clock cycle      |

# **Branch Folding**

- Next step in this idea is to store the target instruction (instead of just it's address)
- Works perfectly for jumps (unconditional branches) – eliminates them completely (negative penalty!)

## **Integrated Instruction Fetch Units**

- Fetching instructions becomes the bottleneck in multiple-issue processors
- Integrated instruction fetch/prediction unit
- Instruction prefetch
  - Fetch ahead several instructions (Chapter 5)

#### **Return Address Predictors**

- For the procedure call instruction, the return PC is typically stored in a stack in memory
- Instead of loading the return address from memory, some processors provide for a small buffer of the 8-16 most recent return addresses
- Just the knowledge of the PC of a return instruction provides the return address directly without decoding.

# **Dynamic Branch Prediction Summary**

- Prediction becoming important part of scalar execution
- Branch History Table: 2 bits for loop accuracy
- Correlation: Recently executed branches correlated with next branch.
  - Either different branches
  - Or different executions of same branches
- Tournament Predictor: more resources to competitive solutions and pick between them
- Branch Target Buffer: include branch address & prediction
- Return address stack for prediction of return addresses

#### Chapter 3

# Multiple-Issue Processors

#### Getting CPI < 1: Issuing Multiple Instructions/Cycle

- Basic idea: parallel pipelines.
  - Allow the fetching, issuing, and completion of more than one instruction every clock cycle
- Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)
  - IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4
- Very Long Instruction Words (VLIW): fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates (TBD)
  - Intel Architecture-64 (IA-64) 64-bit address
    - » Renamed: "Explicitly Parallel Instruction Computer (EPIC)"
  - Will discuss in next chapter

# Multiple-Issue Processors

| Common<br>Name               | Issue<br>Structure | Hazard<br>detection | Scheduling                  | Distinguishing<br>Characteristic                 | Examples                                                                            |
|------------------------------|--------------------|---------------------|-----------------------------|--------------------------------------------------|-------------------------------------------------------------------------------------|
| Superscalar<br>(static)      | Dynamic            | Hardware            | Static                      | In-order<br>execution                            | Sun<br>UltraSPARC<br>II/III                                                         |
| Superscalar<br>(dynamic)     | Dynamic            | Hardware            | Dynamic                     | Some out-of-<br>order<br>execution               | IBM Power2                                                                          |
| Superscalar<br>(speculative) | Dynamic            | Hardware            | Dynamic with<br>speculation | Out-of-order<br>execution with<br>speculation    | Pentium<br>III/4,<br>MIPS R10K,<br>Alpha<br>21264, HP<br>PA 8500,<br>IBM<br>RS64III |
| VLIW                         | Static             | Software            | Static                      | No hazards<br>between issue<br>packets           | Trimedia,<br>i860,<br>Transmeta<br>Crusoe                                           |
| EPIC                         | Mostly static      | Mostly<br>software  | Mostly static               | Explicit<br>dependences<br>marked by<br>compiler | Itanium                                                                             |

## Getting CPI < 1: Issuing Multiple Instructions/Cycle

#### • Superscalar MIPS: 2 instructions, 1 FP & 1 integer

- Fetch 64-bits/clock cycle; Int on left, FP on right
- Can only issue 2nd instruction if 1st instruction issues
- More ports for FP registers to do FP load & FP op in a pair

| Туре                  | Pipes | Stage. | 5  |     |     |     |    |
|-----------------------|-------|--------|----|-----|-----|-----|----|
| Int. instruction      | IF    | ID     | EX | MEM | WB  |     |    |
| <b>FP</b> instruction | IF    | ID     | EX | MEM | WB  |     |    |
| Int. instruction      |       | IF     | ID | EX  | MEM | WB  |    |
| <b>FP</b> instruction |       | IF     | ID | EX  | MEM | WB  |    |
| Int. instruction      |       |        | IF | ID  | EX  | MEM | WB |
| <b>FP</b> instruction |       |        | IF | ID  | EX  | MEM | WB |

- 1 cycle load delay expands to 3 instructions in SS
  - instruction in right half can't use it, nor instructions in next slot

## Multiple Issue Issues

- issue packet: group of instructions from fetch unit that could potentially issue in 1 clock
  - If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue
  - O to N instruction issues per clock cycle, for N-issue
- Performing issue checks in 1 cycle could limit clock cycle time: O(n<sup>2</sup>-n) comparisons
  - => issue stage usually split and pipelined
  - 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued
  - => higher branch penalties => prediction accuracy important
- While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:
  - Exactly 50% FP operations AND No hazards

#### Dynamic scheduling approach:

Similarly the Tomasulo algorithm can be applied to the superscalar case. Instead of issuing one instruction per clock cycle, two or more can be issued. Here is an example for the scalar-add-to-vector loop for dual issue. The CC# when events occur are indicated.

| Iter. | Instruction | Issue    | Exec | Mem | Write CDB | Comment |                 |
|-------|-------------|----------|------|-----|-----------|---------|-----------------|
| 1     | L.D         | F0,0(R1) | 1    | 2   | 3         | 4       |                 |
| 1     | ADD.D       | F4,F0,F2 | 1    | 5   |           | 8       | Wait for L.D    |
| 1     | S.D         | F4,0(R1) | 2    | 3   | 9         |         | Wait for ADD.D  |
| 1     | DADDIU      | R1,R1,-8 | 2    | 4   |           | 5       | Wait for result |
| 1     | BNE         | R1,R2,L  | 3    | 6   |           |         | Wait for DADDIU |
| 2     | L.D         | F0,0(R1) | 4    | 7   | 8         | 9       | Wait for BNE    |
| 2     | ADD.D       | F4,F0,F2 | 4    | 10  |           | 13      | Etc.            |
| 2     | S.D         | F4,0(R1) | 5    | 8   | 14        |         |                 |
| 2     | DADDIU      | R1,R1,-8 | 5    | 9   |           | 10      |                 |
| 2     | BNE         | R1,R2,L  | 6    | 11  |           |         |                 |

Quite complex to analyze: many things happen simultaneously.

#### Chapter 3

# Speculation

## Dynamic Scheduling with Hardware Speculation

- What is speculation? (or speculative execution)
- Let's consider dynamic scheduling (Tomasulo) with hardware branch prediction
- Make a branch prediction and *execute the* program as if the guess was correct
  - The speculatively executed sequence of instructions probably includes other branches (which need to be predicted).
  - This is especially true in multiple-issue processors (possibly one branch per clock cycle)
- Need the ability to undo the effects of an incorrectly speculated sequence

#### Dynamic Scheduling with Hardware Speculation

- Dynamic scheduling without speculation only partially overlaps basic blocks
  - It requires that a branch be resolved before executing instructions in the successor basic block
- Speculation allows us to overcome control dependencies (data flow execution)
- To implement speculation we will modify Tomasulo's algorithm

## Hardware Speculation

- What is needed to speculatively execute a stream of instructions?
- We must avoid updating the state of the processor until we know for sure that an instruction should have been executed (we then say that it is no longer speculative)
- Registers must not be written until an instruction is no longer speculative
  - Rely on forwarding results among instructions
  - The values forwarded might not be correct.

#### **Instruction Commit**

- When we finally know that an instruction is no longer speculative then we allow it to write to the register file or memory
- This extra pipeline stage is called instruction commit
- Key idea: allow instructions to execute outof-order but to commit in-order
  - Need to prevent any irrecoverable action (state update, or exception) until the instruction commits
- Instructions may finish execution considerably before they are ready to commit

#### **Reorder Buffer**

- Need a reorder buffer to hold the results of instructions that have finished execution but have not yet commited
- The ROB also passes results between instructions
  - Register file is updated only when the instruction commits
  - Takes over the role of register renaming from the reservation stations (still need them as buffers between instruction issue and execution)
  - ROB performs the same functionality as the store buffers and it replaces them



#### Four Steps of Speculative Tomasulo Algorithm

#### 1. Issue-get instruction from Op Queue

- If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination

#### 2. Execution—operate on operands (EX)

 When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW

#### 3. Write result—finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

#### 4. Commit—update register with reorder result

When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer

## **Reorder Buffer**

#### ROB

| Entry | Busy | Instruction     | State        | Dest | Value            |
|-------|------|-----------------|--------------|------|------------------|
| 1     | no   | L.D. F6,34(R2)  | commit       | F6   | Mem[34+Regs[R2]] |
| 2     | yes  | MUL.D F0,F6,F4  | Write result | FO   | #1 × Regs[F4]    |
| 3     | yes  | DIV.D F10,F0,F6 | Execute      | F10  |                  |

#### **Reservation stations**

| Name  | Busy | Ор    | Vj               | Vk               | Qj | Qk | Dest | A |
|-------|------|-------|------------------|------------------|----|----|------|---|
| Mult1 | no   | MUL.D | Mem[34+Regs[R2]] | Regs[F4]         |    |    | #2   |   |
| Mult2 | yes  | DIV.D |                  | Mem[34+Regs[R2]] | #2 |    | #3   |   |

#### **FP Register Status**

| Field     | F0  | F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | F9 | F10 |
|-----------|-----|----|----|----|----|----|----|----|----|----|-----|
| Reorder # | 2   |    |    |    |    |    |    |    |    |    | 3   |
| Busy      | yes | no | yes |

#### What about Precise Interrupts?

• Tomasulo had:

In-order issue, out-of-order execution, and out-of-order completion

 Need to "fix" the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.

# Relationship between precise interrupts and specultation:

- Speculation is a form of guessing.
- Important for branch prediction:
  - Need to "take our best shot" at predicting branch direction.
- If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:
  - Need to "fix" the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.
- What about precise exceptions?
  - Need to "fix" the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.
- Technique for both precise interrupts/exceptions and speculation: *in-order completion or commit*

Again we take the example of the add-scalar-to-vector loop example. We assume that all integer operations are finished and focus on FP.

| Entry | Busy | Instruction |          | State Destination |           | Value          |  |
|-------|------|-------------|----------|-------------------|-----------|----------------|--|
| 1     | no   | L.D         | F0,0(R1) | Commit            | FO        | Mem[0+Reg[R1]] |  |
| 2     | no   | ADD.D       | F4,F0,F2 | Commit            | F4        | #1 * Reg[F2]   |  |
| 3     | yes  | S.D         | F4,0(R1) | Write result      | 0+Reg[R1] | #2             |  |
| 4     | yes  | DADDIU      | R1,R1,-8 | Write result      | R1        | Regs[R1] - 8   |  |
| 5     | yes  | BNE         | R1,R2,L  | Write result      |           |                |  |
| 6     | yes  | L.D         | F0,0(R1) | Write result      | FO        | Mem[#4]        |  |
| 7     | yes  | ADD.D       | F4,F0,F2 | Write result      | F4        | #6 * Reg[F2]   |  |
| 8     | yes  | S.D         | F4,0(R1) | Write result      | 0+#4      | #7             |  |
| 9     | yes  | DADDIU      | R1,R1,-8 | Write result      | R1        | #4 - 8         |  |
| 10    | yes  | BNE         | R1,R2,L  | Write result      |           |                |  |

At this particular time, two complete loops have issued and the first two entries have committed, freeing the ROB entries (fields are left filled for clarity but could be rewritten by other instructions). When the head (currently at entry 3) reaches the branch, if the BNE at entry 5 was mispredicted when issued, then instructions following it are flushed and will never commit.

In essence, the ROB executes **in-order** a simplified version of the original code: it performs just the writes according the actual outcomes of branches. The actual computations are done speculatively according to branch predictions.

#### Multiple Issue with Speculation No speculation:

| Iter.                      | Instruction                                                    | on                                                                           | Issue                      | Exec                                    | Mem            | Write CDB               | Comme                            | ent                                                                              |
|----------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------|----------------------------|-----------------------------------------|----------------|-------------------------|----------------------------------|----------------------------------------------------------------------------------|
| 1                          | LD                                                             | R2,0(R1)                                                                     | 1                          | 2                                       | 3              | 4                       |                                  |                                                                                  |
| 1                          | DADDIU                                                         | R2,R2,#1                                                                     | 1                          | 5                                       |                | 6                       | Wait f                           | or LD                                                                            |
| 1                          | SD                                                             | R2,0(R1)                                                                     | 2                          | 3                                       | 7              |                         | Wait f                           | or DADDIU                                                                        |
| 1                          | DADDIU                                                         | R1,R1,-8                                                                     | 2                          | 3                                       |                | 4                       | Execut                           | es directly                                                                      |
| 1                          | BNE                                                            | R1,R2,L                                                                      | 3                          | 7                                       |                |                         | Wait f                           | or DADDIU                                                                        |
| 2                          | LD                                                             | R2,0(R1)                                                                     | 4                          | 8                                       | <mark>9</mark> | <mark>10</mark>         | Wait f                           | or BNE                                                                           |
| 2                          | DADDIU                                                         | R2,R2,#1                                                                     | 4                          | 11                                      |                | 12                      | Etc.                             |                                                                                  |
| 2                          | SD                                                             | R2,0(R1)                                                                     | 5                          | 9                                       | 13             |                         |                                  |                                                                                  |
| 2                          | DADDIU                                                         | R1,R1,-8                                                                     | 5                          | 8                                       |                | 9                       |                                  |                                                                                  |
| 2                          | BNE                                                            | R1,R2,L                                                                      | 6                          | 13                                      |                |                         |                                  |                                                                                  |
| Speci                      | ulation:                                                       |                                                                              |                            |                                         |                |                         |                                  |                                                                                  |
| Iter                       | Instruction                                                    | - 13                                                                         | Lagua                      | Erree                                   | Deed           | Walte                   | C                                | <u></u>                                                                          |
| nu.                        | msuucuv                                                        | on                                                                           | Issue                      | Exec                                    | Read           | Write                   | Commits                          | Comment                                                                          |
| ner.                       | mstructio                                                      | on                                                                           | Issue                      | Exec                                    | Acces          |                         | Commits                          | Comment                                                                          |
| 1                          | LD R2,0                                                        | )(R1)                                                                        | 1 Issue                    | 2 Exec                                  |                |                         | 5                                | Comment                                                                          |
| 1                          | LD R2,0                                                        | )(R1)<br>R2,R2,#1                                                            | 1<br>1                     |                                         | Acces          | CDB                     |                                  | Wait for LD                                                                      |
| 1<br>1<br>1                | LD R2,0<br>DADDIU<br>SD                                        | )(R1)<br>R2,R2,#1<br>R2,0(R1)                                                | 1<br>1<br>2                | 2                                       | Acces          | CDB<br>4                |                                  |                                                                                  |
| 1<br>1<br>1<br>1           | LD R2,0<br>DADDIU<br>SD                                        | )(R1)<br>R2,R2,#1                                                            | 1<br>1<br>2<br>2           | 2<br>5                                  | Acces          | CDB<br>4                |                                  | Wait for LD                                                                      |
| 1<br>1<br>1<br>1<br>1<br>1 | LD R2,0<br>DADDIU<br>SD                                        | )(R1)<br>R2,R2,#1<br>R2,0(R1)                                                | 1<br>1<br>2                | 2<br>5<br>3<br>7                        | Acces<br>3     | CDB<br>4<br>6           | 5<br>7<br>7                      | Wait for LD<br>Wait for DADDIU                                                   |
| 1<br>1<br>1<br>1<br>1<br>2 | LD R2,0<br>DADDIU<br>SD<br>DADDIU                              | )(R1)<br>R2,R2,#1<br>R2,0(R1)<br>R1,R1,-8                                    | 1<br>1<br>2<br>2           | 2<br>5<br>3<br>7                        | Acces          | CDB<br>4<br>6           | 5<br>7<br>7<br>8                 | Wait for LD<br>Wait for DADDIU<br>Commit in order                                |
| 1<br>1<br>1<br>1<br>1      | LD R2,0<br>DADDIU<br>SD<br>DADDIU<br>BNE<br>LD                 | )(R1)<br>R2,R2,#1<br>R2,0(R1)<br>R1,R1,-8<br>R1,R2,L                         | 1<br>1<br>2<br>2<br>3      | 2<br>5<br>3                             | Acces<br>3     | CDB<br>4<br>6           | 5<br>7<br>7<br>8<br>8            | Wait for LD<br>Wait for DADDIU<br>Commit in order<br>Wait for DADDIU             |
| 1<br>1<br>1<br>1<br>1<br>2 | LD R2,0<br>DADDIU<br>SD<br>DADDIU<br>BNE<br>LD                 | )(R1)<br>R2,R2,#1<br>R2,0(R1)<br>R1,R1,-8<br>R1,R2,L<br>R2,0(R1)             | 1<br>1<br>2<br>2<br>3<br>4 | 2<br>5<br>3<br>7<br><mark>5</mark>      | Acces<br>3     | CDB<br>4<br>6<br>4<br>7 | 5<br>7<br>7<br>8<br>8<br>9       | Wait for LD<br>Wait for DADDIU<br>Commit in order<br>Wait for DADDIU<br>No delay |
| 1<br>1<br>1<br>1<br>2<br>2 | LD R2,0<br>DADDIU<br>SD<br>DADDIU<br>BNE<br>LD<br>DADDIU<br>SD | )(R1)<br>R2,R2,#1<br>R2,0(R1)<br>R1,R1,-8<br>R1,R2,L<br>R2,0(R1)<br>R2,R2,#1 | 1<br>2<br>2<br>3<br>4<br>4 | 2<br>5<br>3<br>7<br><mark>5</mark><br>8 | Acces<br>3     | CDB<br>4<br>6<br>4<br>7 | 5<br>7<br>7<br>8<br>8<br>9<br>10 | Wait for LD<br>Wait for DADDIU<br>Commit in order<br>Wait for DADDIU<br>No delay |

This concludes the bulk of this chapter. The text contains additional material (236 to 240) regarding the fine point of design tradeoffs with the techniques we have seen.

This is followed by a detailed quantitative study of speedup enhancements given by these techniques (240 to 259). One important result is in Fig. 3.45. It shows that even if a processor can be perfectly designed (?), could fetch infinite numbers of instructions at once, and execute them out of order with the data flow as the only limitation, even the most inherently parallel benchmarks saturate at around 60 instructions of ILP. This means that to get more parallelism out of a program, this program must be written specifically for a parallel processor using special parallel commands.

Pages 259 to 273 are case studies.

To push performance further, some current research is directed at predicting, not only branch outcomes, but the results themselves a.k.a. *value prediction* so that speculative computations can be done deeper. In the meantime, everything we have seen is this chapter is today's reality. See the table next page:

# Multiple-Issue Processors

| Common<br>Name               | Issue<br>Structure | Hazard<br>detection | Scheduling                  | Distinguishing<br>Characteristic                 | Examples                                                                            |
|------------------------------|--------------------|---------------------|-----------------------------|--------------------------------------------------|-------------------------------------------------------------------------------------|
| Superscalar<br>(static)      | Dynamic            | Hardware            | Static                      | In-order<br>execution                            | Sun<br>UltraSPARC<br>II/III                                                         |
| Superscalar<br>(dynamic)     | Dynamic            | Hardware            | Dynamic                     | Some out-of-<br>order<br>execution               | IBM Power2                                                                          |
| Superscalar<br>(speculative) | Dynamic            | Hardware            | Dynamic with<br>speculation | Out-of-order<br>execution with<br>speculation    | Pentium<br>III/4,<br>MIPS R10K,<br>Alpha<br>21264, HP<br>PA 8500,<br>IBM<br>RS64III |
| VLIW                         | Static             | Software            | Static                      | No hazards<br>between issue<br>packets           | Trimedia,<br>i860,<br>Transmeta<br>Crusoe                                           |
| EPIC                         | Mostly static      | Mostly<br>software  | Mostly static               | Explicit<br>dependences<br>marked by<br>compiler | Itanium                                                                             |

#### **Processors with ROB**

- PowerPC 603/604/G3/G4
- MIPS R10000/12000
- Pentium II/III/4
- Alpha 21264
- AMD K5/K6/Athlon

Recent high performance processors characteristics:

| Processor     | Year | Max<br>CR | Pow.<br>(W) | Trans.<br>(M) | Window<br>size | Rename<br>Registers | lssue rate<br>max/mem/<br>int/FP/branch | Branch<br>Predict<br>Buffer | Pipe<br>Stages |
|---------------|------|-----------|-------------|---------------|----------------|---------------------|-----------------------------------------|-----------------------------|----------------|
| MIPS R14000   | 00   | 400       | 25          | 7             | 48             | 32/32               | 4/1/2/2/1                               | 2K x 2                      | 6              |
| UltraSparc    | 01   | 900       | 65          | 29            | NA             | 0                   | 4/1/4/3/1                               | 16K x 2                     | 14/15          |
| Pentium III   | 00   | 1000      | 30          | 24            | 40             | 40                  | 3/2/2/1/1                               | 512                         | 12/14          |
| Pentium 4     | 01   | 1700      | 64          | 42            | 126            | 128                 | 3/2/3/2/1                               | 4K x 2                      | 22/24          |
| HP PA 8600    | 01   | 550       | 60          | 130           | 56             | 56                  | 4/2/2/2/1                               | 2K x 2                      | 7/9            |
| Alpha 21264B  | 01   | 833       | 75          | 15            | 80             | 41/41               | 4/2/4/2/1                               | tournan't                   | 7/9            |
| PowerPC G4    | 00   | 450       | 5           | 7             | 5              | 6/6                 | 3/1/2/1/1                               | 512 x 2                     | 4/5            |
| AMD Athlon    | 01   | 1330      | 76          | 37            | 72             | 36/36               | 3/2/3/3/1                               | 4K x 2                      | 9/11           |
| IBM Power3-II | 00   | 450       | 36          | 23            | 32             | 16/24               | 4/2/2/2/2                               | 2K x 2                      | 7/8            |

This table is quite interesting considering that all these processors have comparable performance: their characteristics are not (e.g. the thrifty G4, clock rates that vary by a factor three, and power by a factor 15)!

These processors still appear to the programmer as ordinary one instruction-at-a-time processors.

#### Pitfalls and Fallacies:

Fallacy: Processors with lower CPI will be faster.
Fallacy: Processors with faster clock rates will be faster.
See the performance equation of Chap 1

**Pitfall:** Improvement in CPI by a high issue rate while sacrificing clock rate can lead to lower improvement.

Depends on the task (e.g. heavy FP code). Simulating before implementing plus experience is the true test.

**Pitfall:** Improvement of one aspect of a superscalar processor and expecting overall performance enhancement.

See Amdahl's Law. Superscalars processors have so many small enhancements that each contribute a small part to the overall result.

Pitfall: Sometimes bigger and dumber is better.

Different applications produce different performance. Many small improvements make a processor more general purpose.