# Ten Lessons From Three Generations Shaped Google's TPUv4i Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson, Google LLC Published in 2021 at the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) ### **Executive summary** #### **Problem & Motivation** - **Growing demand** for Al applications - State of the art DNN architectures continuously changing - Training and inference expensive #### Goal • Flexible, cost effective, scalable HW for efficient, low latency inference #### **Key Ideas** - *Identify* and *exploit* **Lessons** from previous TPU versions - Leverage technological improvement - Bring memory closer to processing elements - Exploit existing compiler optimizations - Expand computational capacity where appropriate #### Results - Compared to TPUv3: 2.3x perf/TDP using 1.6x transistors - Compared to Nvidia T4: 1.3-1.6x speed @ 0.9-1.0x perf/TDP #### Outline - Background What is a TPU? - Summary of the 10 Lessons - How have these 10 Lessons shaped the design of TPUv4i - Performance evaluation - Strengths and Weaknesses - Takeaways and Insights - Discussion #### What is a TPU? - Tensor Processing Unit - Domain-Specific Architectures DSA - Developed by Google - Training & Inference - *TPUv4i* inference only - Al Workloads - Matrix multiplications - Convolutions - Activation evaluation - Better efficiency compared to CPU or GPUs - 30-80X higher performance/Watt ## How did the TPU design evolve? # 2.3x better perf/TDP | Feature | TPUv1 | TPUv2 | TPUv3 | TPUv4i | NVIDIA T4 | | |--------------------------------|----------------|------------------|------------------|--------------------|-----------------------------|--| | Peak TFLOPS / Chip | 92 (8b int) | <u>46 (bf16)</u> | 123 (bf16) | 138 (bf16/8b int) | 65 (ieee fp16)/130 (8b int) | | | First deployed (GA date) | Q2 2015 | Q3 2017 | Q4 2018 | Q1 2020 | Q4 2018 | | | DNN Target | Inference only | Training & Inf. | Training & Inf. | Inference only | Inference only | | | Network links x Gbits/s / Chip | | 4 x 496 | <u>4 x 656</u> | <u>2 x 400</u> | | | | Max chips / supercomputer | | 256 | <u>1024</u> | <u></u> | | | | Chip Clock Rate (MHz) | 700 | 700 | <u>940</u> | <u>1050</u> | 585 / (Turbo 1590) | | | Idle Power (Watts) Chip | 28 | <u>53</u> | <u>84</u> | <u>55</u> | 36 | | | TDP (Watts) Chip / System | 75 / 220 | <u>280 / 460</u> | <u>450 / 660</u> | <u>175 / 275</u> | 70 / 175 | | | Die Size (mm²) | < 330 | <u>&lt; 625</u> | <u>&lt; 700</u> | <u>&lt; 400</u> | 545 | | | Transistors (B) | 3 | <u>9</u> | <u>10</u> | <u>16</u> | 14 | | | Chip Technology | 28 nm | <u>16 nm</u> | 16 nm | <u>7 nm</u> | 12 nm | | | Memory size (on-/off-chip) | 28MB / 8GB | 32MB / 16GB | 32MB / 32GB | <u>144MB / 8GB</u> | 18MB / 16GB | | | Memory GB/s / Chip | 34 | <u>700</u> | <u>900</u> | <u>614</u> | 320 (if ECC is disabled) | | | MXU Size / Core | 1 256x256 | <u>1 128x128</u> | <u>2 128x128</u> | <u>4 128x128</u> | 8 8x8 | | | Cores / Chip | 1 | 2 | 2 | 1 | 40 | | | Chips / CPUHost | 4 | 4 | 4 | 8 | 8 | | #### Outline - Background What is a TPU? - Summary of the 10 Lessons - 1. Lessons applying to any DSAs - 2. Lessons Focusing on DNN DSAs - 3. Lessons Regarding DNN applications - Strengths and Weaknesses - Takeaways and Insights - Discussion #### **Summary of the 10 Lessons** # 1. Lessons applying to any DSA ... and potentially also to CPUs and GPUs in general # 1 Logic, wires, SRAM, & DRAM improve unequally - Updated Horowitz's Energy per Operation - Logic improves faster than wires or SRAM - Logic is relatively "free" - High Bandwidth Memory (HBM) short DRAM stacks close to DSA #### Energy per operation [pJ] 45nm vs 7nm | Operation | | Picojoules per Operation | | | | | |----------------------|------------------------|--------------------------|----------------------|--------|--|--| | | | 45 nm | 7 nm | 45 / 7 | | | | | Int 8 | 0.03 | 0.007 | 4.3 | | | | | Int 32 | 0.1 | 0.03 | 3.3 | | | | + | BFloat 16 | | 0.11 | | | | | | IEEE FP 16 | 0.4 | 0.16 | 2.5 | | | | | IEEE FP 32 | 0.9 | 0.38 | 2.4 | | | | × | Int 8 | 0.2 | 0.07 | 2.9 | | | | | Int 32 | 3.1 | 1.48 | 2.1 | | | | | BFloat 16 | | 0.21 | | | | | | IEEE FP 16 | 1.1 | 0.34 | 3.2 | | | | | IEEE FP 32 | 3.7 | 1.31 | 2.8 | | | | | 8 KB SRAM | 10 | 7.5 | 1.3 | | | | SRAM | 32 KB SRAM | 20 | 8.5 | 2.4 | | | | | 1 MB SRAM <sup>1</sup> | 100 | 14 | 7.1 | | | | GeoMean <sup>1</sup> | | | - | 2.6 | | | | DRAM | | Circa 45 nm | Circa 7 nm | | | | | | DDR3/4 | $1300^{2}$ | 1300 <sup>2</sup> | 1.0 | | | | | НВМ2 | | 250-450 <sup>2</sup> | | | | | | GDDR6 | | 350-480 <sup>2</sup> | | | | # 2 Leverage prior compiler optimizations - C compilers improve 1-2% annually - Nvidia CUDA (2007) 1.8x - Google TPU XLA (2016) **2.2**x - Performance of DSAs is compelled by quality of their compilers - Significant compiler optimizations come after hardware is available - HW must stay compiler compatible to exploit the optimizations in future ## Relative DSA compiler gains over 20 months on MLPerf benchmark # 3 Design for performance per TCO vs per CapEx - Capital Expense (CapEx) Initial purchase cost for an item - Operation Expense (OpEx) The cost of electricity and provisioning over the lifetime of an item. Computer hardware lifetime: 3-5 years Total Cost of Ownership (TCO) $$TCO = CapEx + 3 \times OpEx$$ #### **Correlation of System TDP and TCO** DSA - Trendline for TCO vs TDP R<sup>2</sup> = 0.982 - CPUs and GPUs aim at best performance/CapEx - Companies aim at good performance/TCO #### **Summary of the 10 Lessons** # 2. Lessons focusing on DNN DSAs # 4 Support Backwards ML Compatibility - Developers don't want to change existing DNNs - Quantization time costly and loss of accuracy - Time-to-market constraints of deployed DNNs - Minimize effort in migrating to new hardware # 5 Inference DSAs need air cooling for global scale | TPU version TDP | | Cooling | Peak TFLOPS/Chip | | |---------------------|------|---------|-------------------|--| | TPUv1 (inf) | 75W | Air | 92 (8b int) | | | TPUv2 (train + inf) | 280W | Air | 46 (bf16) | | | TPUv3 (train + inf) | 450W | Liquid | 123 (bf16) | | | TPUv4i (inf) | 175W | Air | 138 (bf16/8b int) | | - High TDP requires Liquid cooling - Expensive at small scales - Low latency worldwide user-facing inference - Air Cooling easier deployment ## 6 Some inference apps need floating point arithmetic fp32 - Training performed in FP (fp32, bfp16) - Inference sometimes quantized to int8 - Better area and power - Reduced accuracy and delayed deployment - Some applications don't work with quantization - ImageNet improved by 1% 2019-2020 - 4 Support Backwards ML Compatibility - Inference Hardware should support Floating Point Operations # Inferior performance of quantized image segmentation model The m #### **Summary of the 10 Lessons** 3. Lessons regarding DNN applications ## 7 Production inference normally needs multi-tenancy - Sharing can lower cost and reduce latency - Flexible SW engineering - Hardware should support fast model switching ## Multi-tenancy Requirements across Google ML Workload | | Avg. | Max | Multi- | Avg. Number of | % Use | |-------|------|------|----------|-----------------|-------------------| | Name | Size | Size | | Programs | 2016/ | | | (MB) | (MB) | tenancy? | (StdDev), Range | 2020 | | MLP0 | 580 | 2500 | Yes | 27 (±17), 1-93 | 610/ <b>25</b> 0/ | | MLP1 | 90 | N.A. | Yes | 5 (±0.3), 1-5 | 61%-25% | | CNN0 | 60 | 454 | No | 1 | 50/ 100/ | | CNN1 | 120 | 680 | Yes | 6 (±10), 1-34 | 5%-18% | | RNN0 | 1300 | 1300 | Yes | 13 (±3), 1-29 | 0%-29% | | RNN1 | 120 | 400 | No | 1 | 0%0-29% | | BERT0 | 3000 | 3000 | Yes | 9 (±2), 1-14 | 00/ 200/ | | BERT1 | 90 | N.A. | Yes | 5 (±0.3), 1-5 | 0%-28% | ## Lessons Regarding DNN Application evolution - 8 DNNs grow ~1.5x/year in memory and compute - 9 DNN workloads evolve with DNN breakthroughs - DNNs continuously updated - TPU needs sufficient hardware - What will come next…? - Programmability and flexibility #### **Google ML Workload** | Name | Avg.<br>Size<br>(MB) | Max<br>Size<br>(MB) | Multi-<br>tenancy? | Avg. Number of<br>Programs<br>(StdDev), Range | % Use<br>2016/<br>2020 | |-------|----------------------|---------------------|--------------------|-----------------------------------------------|------------------------| | MLP0 | 580 | 2500 | Yes | 27 (±17), 1-93 | 61%-25% | | MLP1 | 90 | N.A. | Ys | 5 (±0.3), 1-5 | 01%-23% | | CNN0 | 60 | 454 | N) | 1 | <b>5</b> 0/ 100/ | | CNN1 | 120 | 680 | Y s | 6 (±10), 1-34 | 5%-18% | | RNN0 | 1300 | 1300 | | 13 (±3), 1-29 | 00/ 200/ | | RNN1 | 120 | 400 | | 1 | 0%-29% | | BERT0 | 3000 | 3000 | Yes | 9 (±2), 1-14 | 00/ 200/ | | BERT1 | 90 | N.A. | Yes | 5 (±0.3), 1-5 | 0%-28% | # 10 Inference limited by latency, not throughput - Batchsize size = Throughput - In datacenters Latency is major limitation - Hardware should hide Latency | Production | | | | | MLPerf 0.7 | | | | |------------|----|-------|-------|----|------------|----------|-----|-------| | DNN | ms | batch | DNN | ms | batch | DNN | ms | batch | | MLP0 | 7 | 200 | RNN0 | 60 | 8 | Resnet50 | 15 | 16 | | MLP1 | 20 | 168 | RNN1 | 10 | 32 | SSD | 100 | 4 | | CNN0 | 10 | 8 | BERT0 | 5 | 128 | GNMT | 250 | 16 | | CNN1 | 32 | 32 | BERT1 | 10 | 64 | | | | Table 5. Latency limit in ms and batch size picked for TPUv4i. #### Outline - Background What is a TPU? - Summary of the 10 Lessons - How have these 10 Lessons shaped the design of TPUv4i - Performance evaluation - Strengths and Weaknesses - Takeaways and Insights - Discussion ## Compiler compatibility - Based on TPUv3 HW design - compiler optimization 2 - maintain backwards ML compatibility 4 - class LeNet(nn.Module): def \_\_init\_\_(self): super(LeNet, self).\_\_init\_\_() # 1 input image channel (black & white), 6 output channels, 3x3 square convolution # kernel self.conv1 = nn.Conv2d(1, 6, 3)self.conv2 = nn.Conv2d(6, 16, 3)# an affine operation: y = Wx + bself.fc1 = nn.Linear(16 \* 6 \* 6, 120) # 6\*6 from image dimension self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): # Max pooling over a (2, 2) window $x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))$ # If the size is a square you can only specify a single number x = F.max pool2d(F.relu(self.conv2(x)), 2) - Compiler compatibility - XLA compiler separates: - High-Level Operations (HLO) - Hardware agnostic - Low-Level Operations (LLO) - Hardware dependent - Maintains compiler compatibility (2) ## On-chip storage & DMA #### HBM kept from TPUv3 - Multi-tenancy 7 - Rapid DNN growth 8 - Better energy efficiency to DRAM (1) #### SRAM - CMEM - 128MB CMEM 28% of die - 20x more efficient than DRAM (1) - Significant fraction of TCO 3 #### 4D tensor DMA - o Programmable & Flexible 89 - Support for various striding techniques - Compiler compatible 2 - Synchronizing partial completion progress, hiding DMA ramp-up, ramp-down latency 102 Figure 5. TPUv4i chip block diagram. Architectural memories Figure 6. TPUv4i chip floorplan. ## Custom on-chip interconnect (OCI) Point-to-point routing infeasible ① #### **Custom OCI** - Connects all components on die - Flexible & scalable topology 8 9 - HBM Bandwidth/core - 1.3x over TPUv3 8 - 4 non-overlapping network groups - Provides locality + reduced latency (1) - 153GB/s HBM bandwidth Figure 5. TPUv4i chip block diagram. Architectural memories Figure 6. TPUv4i chip floorplan. #### Arithmetic unit & TDP - Retain both int8 and bf16 support - Logic is "free" 1 - TPUv1 TPUv3 ML compatibility 4 - Not requiring quantization 6 - 4 MXUs per chip - XLA can handle 2x MXUs 12 - Custom 4-input FP adders - Minimal numerical difference 4 - 40% area saved reducing CapEx 3 - 12% lower peak power ➤ Lower TDP ⑤ - TDP 175W @ 1.05GHz - closer to TPUv1 (75W) - Allows Air cooling (5) - Reduces TCO (OpEx) 3 Figure 5. TPUv4i chip block diagram. Architectural memories Figure 6. TPUv4i chip floorplan. ## ICI scaling & Workload analysis #### 2 ICI links - 4 chips/board access nearby memory quickly for future DNN growth 8 - Extensive tracing a HW performance counters included - Analyze system-level bottlenecks - Increased Design Time but worth it since target is perf/TCO, not perf/CapEx 3 - System-level performance improvements (compiler) 2 - Boost developer productivity 7 8 9 Figure 7. TPUv4i board with 4 chips that are connected by ICI. TCS & SMEM, IMEM Tensor Core Figure 5. TPUv4i chip block diagram. Architectural memories Figure 6. TPUv4i chip floorplan. #### Outline - Background What is a TPU? - Summary of the 10 Lessons - How have these 10 Lessons shaped the design of TPUv4i - Performance evaluation - Strengths and Weaknesses - Takeaways and Insights - Discussion ## Performance/Watt - TPUv3 and TPUv4i ~1.9x faster - TPUv2 & TPUv3 have 2 cores - TPUv4i has 1 winning perf/TCO - TPUv4i has 2.3x perf/TDP vs TPUv3 #### Breakdown: - CMEM ~1.5x - 7nm ~1.3x - Other contributions ~1.2x Figure 8. Performance (top) and performance/system Watt ③ for production apps ⑨ relative to TPUv2 for the other TPUs. ## Strengths - **5 years** of Design Team's **experience** - Detailed overview of main principles guiding modern DSA architecture decisions - Detailed analysis of workload requirements and TPU benchmarks - Pragmatic, Production-focused, industry approach #### Weaknesses - Extremely broad coverage - Design details are covered superficially - Some Lessons are extremely restricting - Production & compatibility focus 24 ➤ Restricts innovation - **Limited benchmark** comparison to competition DSAs, and alternative architectures CPU,GPU (Nvidia T4) - Some Lessons redundant - Support Backwards ML Compatibility & 6 Some inference apps need floating point arithmetic - 8 DNNs grow ~1.5x/year in memory and compute & 9 DNN workloads evolve with DNN breakthroughs - Benchmarks mainly on Google Workloads ## Key Takeaways and insights - Documentation of unequal technological improvement (1) - Significance of **compiler** optimization (2) - Design for perf/TCO vs perf/CapEx (3) - Significance of **Backwards ML compatibility** in production (4) - Know your workload - **Iterative improvement** is key for evolving problems #### **Discussion and Questions** As Moore's law reaches plateau and Dennard scaling finishes, what are the next steps to keep pace with growing DNN models? #### **Discussion and Questions** Is compiler, ML and hardware backwards compatibility restricting innovation in production hardware? #### **Discussion and Questions** Is **sacrificing flexibility**, that Google seeks in its TPU hardware, a viable approach in extracting more performance from current hardware? As Moore's law reaches plateau and Dennard scaling finishes, what are the next steps to **keep pace with growing DNN models**? - Alternative model architectures? - ICI scaling high bandwidth interchip communication - Near, in memory computing? - New computing paradigms? - Spiking neural networks - Memristor computing # Is compiler, ML and hardware backwards compatibility **restricting innovation** in production hardware? - Production hardware generally has slower adoption rate. - Exploit compilers to provide bridge between new paradigms Is **sacrificing flexibility**, that Google seeks in its TPU hardware, a viable approach in extracting more performance from current hardware? - Edge processing hardware does not always require flexibility - Personal devices have a shorter lifespan + don't care about OpExp # Thank you to my advisors Gagandeep Singh Joël Lindegger Nika Mansouri Ghiasi ## Backup Slides: Roofline model Figure 12. Roofline model showing apps without CMEM (low point) vs with CMEM (high point). Operational intensity (OI) here is operations divided by memory accesses to HBM *or* to CMEM. If OI were relative to HBM only, CMEM would increase OI and move the points to the right as well as up. #### Backup Slides: CMEM size Figure 13. Percent of 128 MB speed as CMEM varies 0–128 MB for the apps and MLPerf Inference 0.5-0.7 server code. ## Comparison to NVIDIA T4 - TPUs used bf16 - NVIDIA used int8 (fp16 on NMT) - TPUv4i wins on speed - 1.3-1.6x faster - 0.9-1.0x for perf/TDP - 1.3x perf/TDP for NMT (both use fp) - Measuring average power - 1.6-2x of T4 for NMT - For Google, backwards ML compatibility more important than small int8 perf/TDP Figure 9. Performance (top) and performance/system TDP ③ relative to T4 for TPUv3/v4i in our datacenter (§7.B). ## Backup Slides: 10 Lessons summary - 1 Logic, wires, SRAM, & DRAM improve unequally - 2 Leverage prior compiler optimizations - 3 Design for performance per TCO vs per CapEx - 4 Support Backwards ML Compatibility - 5 Inference DSAs need air cooling for global scale - 6 Some inference apps need floating point arithmetic - 7 Production inference normally needs multi-tenancy - 8 DNNs grow ~1.5x/year in memory and compute - 9 DNN workloads evolve with DNN breakthroughs - 10 Inference SLO limit is P99 latency, not batch size