# Robust Systems From Today to the *N3XT 1,000X*

## Subhasish Mitra



Department of EE & Department of CS Stanford University

## **World Relies on Computing**



Inte/ hing





Computational demands exceed
Processing capability











## **World Relies on Computing**





- Ensure robust operation
- Meet computation demands
- New application horizon













## Research Topics

- Robust operation
  - Bugs, reliability, security

- Revolutionize nanosystems
  - 1,000X opportunity

- Program human brain
  - SNI Big Ideas in Neuroscience initiative

## **Outline**

Robust operation: silicon CMOS reliability

Beyond silicon

Conclusion

## Silicon CMOS Reliability Challenges

- Radiation-induced soft errors
  - Fatal flip-flop errors

- Early-life failures (ELF)
  - Burn-in: difficult, expensive

- Variations: V<sub>dd</sub>, thermal, circuit aging
  - Worst-case guardbands expensive

#### **Definitions**

- Malfunction (often referred to as failure)
  - Deviation from specified behavior
  - Underlying cause: failure
- Error: incorrect signal value
- Fault model
  - (Logic) representation of effect of failure

## System Output Response to Failure

- Error on output: non-critical apps. (e.g., games ?)
- Fault-secure: correct outputs or error indication
  - Retry adequate (e.g., banks)
- Fault masked: correct outputs
  - Fault in specified class (e.g., spacecraft)
- Fail safe: correct or "safe" outputs

#### **Definitions**

- Reliability: *R*(*t*)
  - Probability system works correctly up to time t
- Exponential model
  - $R(t) = e^{-\lambda t}$ ,  $\lambda = \text{failure rate}$
- Mean Time to Failure (MTTF)

$$MTTF = \int_{0}^{\infty} t \times R(t) dt = \frac{1}{\lambda}$$

#### **Definitions**

- Availability: A(t)
  - Probability system works correctly AT time t
- Assume: system repaired after failure
  - Mean Time to Repair (MTTR)
- Steady-state availability:

$$\frac{MTTF}{MTTF + MTTR}$$

How to improve availability ?

## **Error Effects: Vanished**



# **Error Effects: Output Mismatch**



No error indication

# **Silent Data Corruption (SDC)**



- Output file incorrect
- No error indication

## **Error Effects: Unexpected Termination**



e.g.,

- Divide-by-zero
- Memory access violation
- Application-detected errors

**Finish** 

- ...

# Hang



> 2 × error-free execution time

Does not finish / terminate

## **Soft Error Effects**



Errors (DUE)

[Cho DAC 13]

## Soft Error Effects: BZip2 on IBM Power6



## **Fault-Tolerance: Rich Literature**

## **Expensive**





## **How Low Cost?**

#### **Applications**

**Approach** 

No



**Low-cost** detection / correction

BISER, LEAP

Circuit failure prediction

Output errors

Yes



**Light-weight** correction

Error Resilient System Arch. (ERSA)
[Cho IEEE TCAD 12]

## **Low-Cost Techniques**



## **How Low Cost?**

#### Solution: cross-layer resilience?

"multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective error resilience"

[DARPA, PERFECT BAA 12]

[Borkar Intel, IEEE Micro 05] [Carter Intel, DATE 10]

[Pedram, NSF 12] [Chandra ARM, DAC 14]

[Gupta IBM, IRPS 14] [Henkel, DAC 14]

## **Existing Resilience Techniques**

Many point solutions Some cross-layer, some single-layer Missing

Assertions (School Death Cooper) End-to-end cross-layer resilience framework

#### CLEAR

## **Cross-Layer Exploration for Architecting Resilience**

Reliability Analysis / Execution Time Evaluation Resilience **Layout Evaluation** Library 28nm Synopsys BEE3 **Emulation Application** Design cluster Library Compiler cells Software **PrimeTime SRAM** Stampede Architecture compiler supercomputer Compiler Logic Circuit Reliability, Area, Power, RTL: ARM, IVM, Leon, Energy, Clock Frequency, OpenSPARC T2 SoC, **Application Runtime** uncore, accelerators [Cheng DAC 16] 23

## Today's Focus

- Radiation-induced soft errors in flip-flops
  - Single-Event Upsets (SEUs)
  - Single-Event Multiple Upsets (SEMUs)

Combinational logic soft errors not critical

## **CLEAR: Extensive Study**

- Designs: wide variety
  - ARM, LEON3, Alpha, OpenSPARC multi-core SoC, accelerators
- Thorough <u>flip-flop</u> error injections
  - FPGA clusters, Stampede supercomputer (522,080 cores)
  - Full workloads (SPEC, PARSEC, PERFECT, proprietary)

- Detailed physical design
  - Wire routing, process / voltage / temperature corners

## Many Cross-Layer Questions Answered

Cross-layer always best?

All cross-layer solutions equally good?

Application constraints (e.g., soft real-time)?

Benchmark dependence?

Definitive guidelines for new resilience techniques

## **Key Message**

From black art to science

Several layers, Numerous combinations

- 5-50x resilience, 0.2-6% energy cost
  - Circuit + logic + micro-arch. recovery

- Circuit alone (application-guided)
  - ~1% extra energy vs. best cross-layer



#### **CLEAR**

## Cross-Layer Exploration for Architecting Resilience

Reliability Analysis / Execution Time Evaluation **Layout Evaluation** Resilience Library 28nm Synopsys **Emulation Application** Design cluster Library Compiler cells Software **PrimeTime** Stampede **SRAM** Architecture compiler supercomputer Compiler Logic Circuit Reliability, Area, Power, RTL: ARM, IVM, Leon, Energy, Clock Frequency, OpenSPARC T2 SoC, **Application Runtime** uncore, accelerators

## Representative Resilience Techniques

#### Algorithm

Algorithm Based Fault Tolerance (ABFT) Detection orithmic Noise Tolerance (ANT)

Algorithm Based Fault Tolerance(ABFT) Correction

#### Software

Control Flow Checking by Software Signatures (CFCSS)

Error Detection by Duplicated Instructions (EDDI

10 error detection / correction techniques + 4 recovery techniques

Reorder Buffer (RoB) recovery

Extended Instruction Replay (EIR) recovery

#### \_ogic

#### 798 combinations

Residue Code Bose-Lin Code

#### Circuit

Razor Built-In Soft Error Resilience (BISER

Layout design through Error Aware transistor Positioning (LEAP)

Bistable Cross-coupled Dual Modular Redundancy (BCDMF

Error Detection Sequential (EDS)

Reinforced Charge Collection (RCC)

- 1. Algorithm Based Fault Tolerance (ABFT) Correction
- 2. ABFT Detection
- 3. Software Assertions
- 4. Control Flow Checking by Software Signatures (CFCSS)
- 5. Error Detection by Duplicated Instructions (EDDI)
- 6. Data Flow Checking (DFC)
- 7. Monitor Cores
- 8. Logic Parity

#### Micro-arch. Recovery

- 1. Flush
- 2. Reorder Buffer (RoB)
- 3. Instruction Replay (IR)
- 4. Extended IR (EIR)
- Layout design through Error-Aware transistor Positioning (LEAP)
- 10. Error Detection Sequential (EDS)

## **BISER: Built-In Soft Error Resilience**



45nm: up to 1,000X benefits

# Single Error Assumption Inadequate

Single-event multiple upsets (SEMUs)

**LEAP: Layout by Error Aware transistor Positioning** 



Errors Corrected: SEUs and SEMUs

## **Extensive LEAP Characterization**

- Radiation beam experiments
  - 40nm, 28nm, 20nm, 14nm
  - Bulk, SOI

VDD: nominal, near-threshold

| Flip-flop | Soft Error Rate (SER) | Area | Power | Delay | Energy |
|-----------|-----------------------|------|-------|-------|--------|
| Baseline  | 1                     | 1    | 1     | 1     | 1      |
| LEAP-DICE | 2×10 <sup>-4</sup>    | 2    | 1.8   | 1     | 1.8    |

# **Memory ECC and SEMUs**

Don't implement multiple error correction blindly

Multiple physically adjacent errors

## **Memory ECC and SEMUs**

#### **Option 1**

Memory interleaving

- Bit1/W1 Bit1/W2 Bit2/W1
- 2 physically adjacent errors
  - Single errors in 2 separate words
- © Cost, difficult for smaller geometries

#### **Option 2**

Adjacent bit error correction [Dutta ITC 07]

## **Memory ECC Challenges**

- Performance overhead
  - Pipelining
    - Additional latency, verification effort
  - Detection followed by correction
    - Variable latency, verification effort
- Small distributed memories

## **Error Masking**

- No error on outputs
  - Triple Modular Redundancy (TMR)



## TMR Reliability

$$R_{TMR} = R_{voter} \times [R_m^3 + R_m^2 \times (1 - R_m)]$$

- $R_m$ : individual module reliability
- Pessimistic: non-overlapping errors
- Optimistic: correlated / common-mode failures
- TMR MTTF < Simplex MTTF</li>

## TMR Reliability



- TMR reliability = simplex reliability
  - Time = log<sub>e</sub>2 × Simplex MTTF (perfect voter)

## TMR Reliability vs. Mission Time



- TMR effective for "short" mission times
- Other options: TMR-Simplex, TMR + Duplex-Repair

## **Concurrent Error Detection (CED)**

Normal system operation

- Preserve data integrity
  - Correct outputs or
  - Error indicated
    - Incorrect outputs
  - aka fault-secure



## **Output "Characteristics"**

- Output itself: duplication
  - Major challenges if not "fine-grained"
- Output parity
- Output residue
- 1s or 0s count in output word
- Many others (extensive literature)
- Self-checking checkers

## **Processor Duplication Challenges**

- Synchronization !!
- False DUEs when out of sync
  - e.g., error correction event in one processor
  - Mismatch when output pins compared

## Single-Bit Logic Parity Prediction

- $P = Z_1 \oplus Z_2 \oplus ... \oplus Z_m$
- Disjoint output logic (no logic sharing)
  - Only for combinational logic errors



## Multiple-Bit Logic Parity Prediction

- Main purpose: cost reduction (sharing, routing, logic)
  - Can be still expensive



# Logic Parity Checking: Flip-Flop Errors

Logic parity

- ☐ Original design
- ☐ Parity logic
- ☐ Pipeline flip-flops



| Logic parity: Naïve               | 200 MHz clock speed impact                  |  |
|-----------------------------------|---------------------------------------------|--|
| Logic parity: Incorrect heuristic | 80% additional energy impact                |  |
| Logic parity: CLEAR heuristic     | No clock speed impact Minimal energy impact |  |

Parameters: parity size, flip-flop vulnerability, floorplan location, timing path slack, etc.

## **Parity Prediction for Datapath Circuits**

- S = A + B (*n*-bit operation)
- Parity  $(S) = S_1 \oplus S_2 \oplus S_3 \oplus \dots S_n$ =  $(A_1 \oplus B_1 \oplus C_1) \oplus (A_2 \oplus B_2 \oplus C_2) \dots (A_n \oplus B_n \oplus C_n)$ 
  - $= (A_1 \oplus A_2 \dots A_n) \oplus (B_1 \oplus B_2 \dots B_n) \oplus (C_1 \oplus C_2 \dots C_n)$
  - = Parity (A) ⊕ Parity (B) ⊕ Parity (internal carries)
- Parity (internal carries) expensive
  - Several strategies for high-performance adders

## Residue Codes for Datapath

- $y = x \mod b$ : y is residue of x (modulo b)
- Residue (A + B) = Residue (A) + Residue (B)
- Residue  $(A \times B)$  = Residue  $(A) \times$  Residue (B)
- Choice of b: Mersenne prime (form 2<sup>m</sup>-1)
  - Coverage, checker complexity
- Issues: bit-wise logic, operand residue, checker cost
  - Often used for multipliers

## **Application-Specific CED**

- LZ compression: loss-less, invertible
  - Compression: complex
  - Decompression: simple



9% area overhead, 0.5% delay overhead [Huang 00]

# **Error Detection Sequentials (EDS)**



## **Errors in Processors**





Computational error Incorrect computation

Memory error
Incorrect value or address

# Program Representation: Control Flow Graph



#### SIHFT

- Software Implemented Hardware Fault Tolerance
  - Automated by compiler
  - EDDI [Oh, IEEE Trans. Reliability 02]
  - CFCSS [Oh IEEE Trans. Reliability 02b]
  - ED<sup>4</sup> [Oh IEEE Trans. Computers 02]
  - Lots of recent publications

#### **EDDI**

- Error Detection using Duplicated Instructions
- Duplicate instructions inside basic blocks
  - Different registers
- Duplicate data structures
- Comparison before memory stores
- Performance penalty 13% 111%
  - Reduced by Instruction Level Parallelism (ILP)

## **EDDI Example**

ADD R3, R1, R2

MUL R4, R3, R5

ST 0(SP), R4

; R3 ← R1 + R2

; R4 ← R3 \* R5

; store R4 in location pointed by SP



ADD R3, R1, R2

ADD R23, R21, R22

MUL R4, R3, R5

MUL R24, R23, R25

BNE R4, R24, Error Handler

ST 0(SP), R4

ST offset(SP), R24

; R3 ← R1 + R2 master

; R23 ← R21 + R2 shadow

; R4 ← R3 \* R5 master

; R24 ← R23 \* R25 shadow

; compare

; store master result

; store shadow result

## **EDDI Design Choices**

- Check after each instruction?
- Storeless basic blocks (SBB) ?
  - No branch or store except final instruction
- Why SBB ?
  - Correctness defined by program output
  - Erroneous branches: stores skipped?
    - Check at branches too

#### **CFCSS**

Control Flow Checking using Software Signatures

- Each node
  - Unique signature
- Each edge
  - Transition between 2 signatures
  - Difference function: XOR



## **CFCSS**

- Runtime signature G
- Basic block i to j
  - $G = s_i XOR d_{i,j}$
  - Check  $G = s_i$



 $G_n$  run-time signature at node  $V_n$   $s_n$  signature assigned to node  $V_n$  $d_n$  signature difference

## **CFCSS Implementation**

- Global variable G holds run-time signature
- Compute & check signature: start of each basic block



#### **CFCSS: Branch Fan-in**

- Basic block with multiple predecessors
- Run-time adjusting signature D differentiates fan-in



## ED<sup>4</sup>I

- Error Detection using Diverse Data & Duplicated
   Instructions
- Duplicated instructions, data diversity
  - Expressions in shadows multiply by k (-1, -2, ...)
- k = -2: good choice
- Transient errors & most permanent faults detected
- Issues: floating point, pointers

## SIHFT Results [Lovelette 02]

- COTS in space: no hardware redundancy
- ARGOS satellite experiment
  - Compare rad-hard processor vs. COTS
- Undetected errors in rad-hard processor
- COTS: 5.55 SEUs / Mbyte / day, 99.7% coverage
- 98.8% successful recovery: software ECC + restart
- COTS + SIHFT: faster than rad-hard

## **Multi-Threading for CED**



- Same application computed by two threads
  - Rotenberg 99, Saxena 00, Mukherjee 02]

## **What About Recovery?**

Instruction Replay (IR)



Cotal Company

"Instruction Replay"

Style 1:

- Cross-layer protected
- LEAP-DICE protected
- □ Recovery logic

|                                | Instruction Replay      | Flush recovery            |
|--------------------------------|-------------------------|---------------------------|
| Overhead for recovery hardware | 16% area,<br>21% energy | 0.6% area,<br>0.9% energy |
| Recovery latency               | 47 cycles               | 7 cycles                  |

#### **CLEAR**

#### **Cross-Layer Exploration for Architecting Resilience**

Reliability Analysis / Execution Time Evaluation **Layout Evaluation** Resilience Library 28nm Synopsys BEE3 **Emulation Application** Design cluster Library Compiler **FPGA** cells Software **PrimeTime SRAM** Stampede Architecture compiler supercomputer Compiler Logic Circuit Reliability, Area, Power, RTL: ARM, IVM, Leon, Energy, Clock Frequency, OpenSPARC T2 SoC, **Application Runtime** uncore, accelerators 64

### Radiation-Induced Soft Errors

Soft errors

Radiation beam testing

Flip-flop error injection









The Los Alamos
Neutron Science Center

Simulation / Emulation

## **Soft Error Injection**

Simulation speed



Flip-flop

10<sup>2</sup> cycles / sec





**High-level error injection** 



Architectural register

10<sup>7</sup> cycles / sec



Program variable

10<sup>9</sup> cycles / sec

## **Soft Error Injection**

Accuracy





#### **Ground truth**



Flip-flop

10<sup>2</sup> cycles / sec

**High-level error injection** 



Mid-Level Predict Execution

Inst. Cache

BR Integer Register File Pt RF

Integer Execution

Interface Control

Interface Control

Architecture register

10<sup>7</sup> cycles / sec

Program variable

10<sup>9</sup> cycles / sec

## Silent Data Corruption (SDC)



- Output file incorrect
- No error indication

Detected but Uncorrected Error (DUE) also considered

# Perils of Inaccurate Estimation



#### **Little Prior Work**

#### Error injection studies

```
[Sloan DSN 07]
                                  [Seward ITC 03]
                                                                    [Li HPCA 07]
 [Lanigan DSIChen PRDC 08]
                                          [Meixner Micro 07]
                                                               [Reddy DSN 07]
   [Wang Trans. Dependable and Secure Coramaca คริสารา DSN 08]
               [Hari MICRO 09]
                                                          [Dimitrov PACT 10]
                                           [Kalbarczyk Trans. Software Eng. 99]
  [Wang ISCA 07] [Sakata DSN 07] [Feng ASPLOS 10] [Choi Trans. Reliability 90]
                                                                   [Lee DAC 01]
                              [Ejlali DSN 03]
                                                [Arlat TCOMP 03]Ray MICRO 011
             [Hari ASPLOS 12]
                                                 [Christmansson ISSRE 98]
        [Racunas HPCA 07]Nakka DSN 071
                                                          [Goswami FTCS 93]
                              [Ferna TCAD 12]
                                                             [Sahoo DSN 08]
  [Kruiif ISCA 10]
    [Choi Trans. Reliability 90] [Ignat DATE 06]
                                                     [Miskov-Živanov TCAD 10]
                                             [Pellegrini DATE 12 herjee HPCA 05]
                      [Kanawati AIAA 93]
 [Gschwind ICCD 11]
      [Blome Workshop on Architectural Reliability 08] \ 07
                                                              [Rimen FTCS 94]
      [Li HPCA 09 Li ASPLOS 08]
                                   [Reis TACO 05]
[Reddy ICCD 06
                                                      [Rebaudengo IOLTW 02]
                      [Maniatakos TCOMP 11]
    [Zhang PACT 10]
                                                            [Chen ASPDAC 06]
                    [Pattabiraman Trans. Dependable and Secure Computing 11]
       [Li DSN 05]
                                                                 [Gracia DFT 01]
[Pandit DSN 09]
                                    [Constantinescu DSN 12]
             ÍSterpone DDECS 111
                                                             [Goswami FTCS 93]
  [Michalak Trans. Device and Materials Reliability 12]
                                                                [Baraza TVLSI 08]
                        [Narayanasam DATE 07]
 [Stott DSN 02] [Wang DSN 04]
                                                          [Cheng TCAD 99]
                                      [Andres TVLSI 08]
         [Alderighi Trans. Nucl. SROManescu PACT 08 Gu DSN hoppoto DAC 10]
                                                                  [Lima DAC 03]
               [Saggese DSN 05]
                                        [Pattabiraman DSN 08]
```

#### Quantified comparison

[Rimen FTCS 94] [Rebaudengo IOLTW 02]

## What We Found

Naïve high-level injections highly inaccurate

How inaccurate?

#### What We Found

Naïve high-level injections highly inaccurate

How inaccurate?

#### Designs:

LEON3 (in-order, single-issue), ALPHA (out-of-order, superscalar)

**Applications: SPEC 2000** 

Error injection samples: 6 million



#### What We Found

Naïve high-level injections highly inaccurate

- How inaccurate?
  - Up to 45X
  - Neither optimistic nor pessimistic

#### What We Found

Naïve high-level injections highly inaccurate

- How inaccurate?
  - Up to 45X
  - Neither optimistic nor pessimistic

- Why inaccurate?
  - Only 3% flip-flop error propagations modeled

## **Uncore Components**

#### OpenSPARC T2 SoC





#### Intel i7 quad-core SoC



#### **Power**

Processor cores 60.2%

Uncore 39.8%

[Gupta USENIX 12]

## **Existing Work**

#### Errors in processor cores

```
[Sloan DSN 07]
                                  [Seward ITC 03]
                                                                     [Li HPCA 07]
  [Lanigan DSINChen PRDC 08]
                                           [Meixner Micro 07]
                                                                [Reddy DSN 07]
    [Wang Trans. Dependable and Secure Corramachandran DSN 08]
                [Hari MICRO 09]
                                                           [Dimitrov PACT 10]
                                            [Kalbarczyk Trans. Software Eng. 99]
  [Wang ISCA 07] [Sakata DSN 07] [Feng ASPLOS 10] [Choi Trans. Reliability 90]
                                                                     [Lee DAC 01]
                              [Ejlali DSN 03]
                                                 [Arlat TCOMP 03]Ray MICRO 011
             [Hari ASPLOS 12]
                                                  [Christmansson ISSRE 98]
        [Racunas HPCA 07]Nakka DSN 071
                                                           [Goswami FTCS 93]
                               [Ferna TCAD 12]
  [Kruiif ISCA 10]
                                                              [Sahoo DSN 08]
    [Choi Trans. Reliability 90] [Ignat DATE 06]
                                                      [Miskov-Živanov TCAD 10]
                                              [Pellegrini DATE 12 herjee HPCA 05]
                      [Kanawati AIAA 93]
 [Gschwind ICCD 11]
       [Blome Workshop on Architectural Reliability 085N 07]
                                                               [Rimen FTCS 94]
      [Li HPCA 09 Li ASPLOS 08]
                                    [Reis TACO 05] [Rebaudengo IOLTW 02] [Reddy ICCD 06]
                       [Maniatakos TCOMP 11]
    [Zhang PACT 10]
                                                             [Chen ASPDAC 06]
                    [Pattabiraman Trans. Dependable and Secure Computing 11]
       [Li DSN 05]
[Pandit DSN 09] [Sterpone DDECS 11]
                                                                  [Gracia DFT 01]
                                     [Constantinescu DSN 12]
                                                              [Goswami FTCS 93]
  [Michalak Trans. Device and Materials Reliability 12]
                                                                 [Baraza TVLSI 08]
                        [Narayanasam DATE 07]
 [Stott DSN 02] [Wang DSN 04] [Vadlamani DATE 10]
                                                           [Cheng TCAD 99]
                                       [Andres TVLSI 08]
         [Alderighi Trans. Nucl. Schongenescu PACT 08 Gu DSN hoppoto DAC 10]
                                                                   [Lima DAC 03]
               [Saggese DSN 05]
                                         [Pattabiraman DSN 08]
```

#### Errors in uncore?

#### **HUNDREDs** of publications

# **Uncore Soft Errors: First Extensive Study**

- New error injection: fast <u>&</u> accurate
  - 20,000x speedup vs. RTL
- Reliability impact: uncore ≈ processor cores
  - BUT, long error propagation latency



[Cho DAC 15]

## **Lots of CLEAR Results**



# **High-Level Enough?**



Circuit

Silent Data Corruption (SDC) Rate























# Architecture & Software: **Too Expensive**





# Circuit-only (Application-guided): Architecture **Highly Effective**





# Circuit-only (Application-guided): Software Architecture **Highly Effective**





# Circuit-only (Application-guided): Software Architecture **Highly Effective**





# Circuit-only (Application-guided): Software Architecture **Highly Effective**





Circuit-only (Application-guided):
Highly Effective



Algorithm

#### What About ABFT?





SDC improvement

## **CLEAR Insights**

Hidden costs & inefficiencies

Implementation matters

Inaccurate analysis

## **Example: "Hidden" Costs**



| Flip-flop | Area | Energy |
|-----------|------|--------|
| Nominal   | 1    | 1      |
| LEAP-DICE | 2    | 1.8    |
| EDS       | 1.5  | 1.4    |

Not just flip-flop overhead Routing, recovery impact

## **Example: Inefficiencies**

#### Few Flip-flops Protected

Data Flow Checking (DFC)

57%

Low SDC Coverage per Flip-flop

Data Flow Checking (DFC)

30%



#### Result: Low SDC Improvement

Data Flow Checking (DFC)

1.2x

[Meixner, MICRO 07]

# **Example: Implementation Matters**

Logic parity

- ☐ Original design
- ☐ Parity logic
- ☐ Pipeline flip-flops



| Logic parity: Naïve               | 200 MHz clock speed impact                  |
|-----------------------------------|---------------------------------------------|
| Logic parity: Incorrect heuristic | 80% additional energy impact                |
| Logic parity: CLEAR heuristic     | No clock speed impact Minimal energy impact |

Parameters: parity size, flip-flop vulnerability, floorplan location, timing path slack, etc.

## **Example: Inaccurate Analysis**

- Software assertions: SDC improvement
  - Prior publications: 3.9x
    - Inaccurate error injection
  - Accurate analysis: 1.5x

|                 | Flip-Flop error injection | Register Uniform error injection |  |  |
|-----------------|---------------------------|----------------------------------|--|--|
| SDC improvement | 1.5x                      | 4.8x                             |  |  |

# **How About Benchmark Dependence?**

- 50 <training, evaluation> pairs
  - Training: 4 SPEC, Evaluation: 7 SPEC

| Trained SDC improvement   | 5x   | 50x | 500x |
|---------------------------|------|-----|------|
| Evaluated SDC improvement | 4.8x | 39x | 433x |



Add "lightweight" hardening (e.g., LHL)

| Extra energy cost (additive) | 2%  | 1%   | 0.8%   |
|------------------------------|-----|------|--------|
| Final SDC improvement        | 19x | 152x | 1,326x |

# **Light-Hardened LEAP (LHL)**

| Flip-Flop | Soft Error<br>Rate (SER) | Area | Power | Delay | Energy |
|-----------|--------------------------|------|-------|-------|--------|
| Baseline  | 1                        | 1    | 1     | 1     | 1      |
| LHL       | 2.5×10 <sup>-1</sup>     | 1.2  | 1.1   | 1.2   | 1.3    |
| LEAP-DICE | 2×10 <sup>-4</sup>       | 2    | 1.8   | 1     | 1.8    |

## **Target for Future Resilience Techniques**



### **Outline**

Robust operation: silicon CMOS reliability

Beyond silicon

Conclusion

# COMPUTING PERFORMANCE

# Game Over or Next Level?

Samuel H. Fuller and Lynette I. Millett, Editors

# **Improve Computing Performance**

System integration

Device performance

# **Option 1: Better Transistors**

System integration

- Few experimental demos
- Transistors ≠ system



# **Option 2: Design Tricks**

System integration



## **Improve Computing Performance**

System integration



## Solution: NanoSystems

Transform new nanotech into new systems enable new applications New devices New Architectures **New fabrication** New sensors √ab ric<mark>atio</mark>

107

## **Abundant-Data Explosion**

"Swimming in sensors, drowning in data"





- Mine, search, analyze: near real-time
  - Data centers, mobile phones, robots

# Today's System Bottlenecks

- Separate compute & memory chips
- Not enough on-chip memory
- Capacity & bandwidth critical



# Abundant-Data Applications Huge memory wall: processors, accelerators



# Abundant-Data Applications Huge memory wall: processors, accelerators



# Nano-Engineered Computing Systems Technology























# **N3XT NanoSystems**

#### **Computation immersed in memory**



#### **N3XT NanoSystems**

#### **Computation immersed in memory**

Increased functionality



Impossible with today's technologies

#### **N3XT Computation Immersed in Memory**

3D Resistive RAM Massive storage

•

1D CNFET, 2D FET Compute, RAM access

MRAM Quick access

1D CNFET, 2D FET Compute, RAM access

1D CNFET, 2D FET Compute, Power, Clock



# Carbon Nanotube FET (CNFET)



#### **Energy Delay Product**

□ ~ 10× benefit

Full-chip case studies
[IBM, IMEC, Stanford,
other commercial]

#### **CNFET Inverter**



# Big Promise, Major Obstacles

Process advances alone inadequate

#### Mis-positioned CNTs



#### **Metallic CNTs**



Imperfection-immune paradigm

#### CNT Growth circa 2005

Highly mis-positioned



# First Wafer-Scale Aligned CNT Growth



Aligned CNT growth



**Quartz wafer with CNTs** 



99.5% aligned CNTs



Stanford Nanofabrication Facility

#### Wafer-Scale CNT Transfer

High-temperature CNT growth

Low-temperature circuit fabrication



**CNT** transfer

120 °C



**Before** transfer



Quartz



After transfer

SiO<sub>2</sub>/Si





#### **Mis-Positioned CNT-Immune NAND**

1. Grow CNTs



[Patil IEEE TCAD 09]

#### **Mis-Positioned CNT-Immune NAND**

1. Grow CNTs

2. Extended gate, contacts



[Patil IEEE TCAD 09]

#### **Mis-Positioned CNT-Immune NAND**

- 1. Grow CNTs
- 2. Extended gate, contacts
- 3. Etch gate & CNTs
- 4. Dope P & N regions
- Arbitrary logic functions
  - Graph algorithms



#### Imperfection-Immune VLSI

Mis-positioned CNTs

Metallic CNTs



Arbitrary logic functions



Scalable m-CNT Removal

#### **Most Importantly**

- VLSI processing
  - No per-unit customization

- VLSI design
  - Immune CNT library

# **CNT Computer**



[Shulaker Nature 13]

#### **CNT Computer**

Turing-complete processor: entirely CNFETs



[Shulaker Nature 13] 128

# 10× EDP, BUT....

# How can we do better?

#### **N3XT Computation Immersed in Memory**

3D Resistive RAM Massive storage

•

1D CNFET, 2D FET Compute, RAM access

MRAM Quick access

1D CNFET, 2D FET Compute, RAM access

1D CNFET, 2D FET Compute, Power, Clock



#### Many Nano-scale Innovations

#### Memory & logic devices



3D Resistive RAM (RRAM)







2D FETs: large-area monolayer MoS<sub>2</sub>

#### **Embedded cooling**



Phase change: hotspots suppressed



Vertical metal nanowire arrays

#### 3D Integration

Massive ILV density >> TSV density





# **Realizing Monolithic 3D**

Low-temperature fabrication: < 400 °C</li>

#### **Device + Architecture Benefits**

Naturally enabled





[Shulaker Nature 17] 135

#### >2 Million CNFETs, 1 Mbit Resistive RAM



[Shulaker Nature 17] 136

Interwoven compute + memory + sensing



Abundant data: Terabytes / second



[Shulaker Nature 17] 138

#### 3D NanoSystem Results



[Shulaker Nature 17] 139

#### **N3XT Simulation Framework**

Joint technology, design & app. exploration













**2D** 

# 64 GB off-chip DRAM DDR3 interface 64 processor cores SRAM cache

#### Single-chip N3XT

64 GB on-chip 3D RRAM



"Simple" interface

STTRAM cache

64 processor cores











#### ~1,000× benefits, existing software













#### 851× benefits



- 100× 1,000× benefits (energy × execution time)
  - Deep learning accelerators



Chip stacking: 2 - 4× benefits

#### **Complement with Software Solutions**



optimization

s/w + h/w

reliability

#### **More Opportunities**

Co-optimized hardware + software

Brain-inspired

Technology innovations



"Brain-Inspired Computing Exploiting Carbon Nanotube FETs and Resistive RAM: Hyperdimensional Computing Case Study," *ISSCC* 2018.

#### **Outline**

Robust operation: silicon CMOS reliability

Beyond silicon

Conclusion

# **Thanks to Research Group**



#### Thanks to Sponsors & Collaborators



















































#### Conclusion

- Robust systems
  - New solutions: elegantly simple, effective

#### **Silicon CMOS Reliability**

BISER, LEAP

Failure prediction

**CLEAR** cross-layer

#### **Beyond silicon**

**CNFET** nanosystems

*N3XT* monolithic 3D

1,000X opportunity