# Computer Architecture Lecture 23: On-Chip Networks Prof. Onur Mutlu ETH Zürich Fall 2020 28 December 2020 # Buffering and Flow Control in On-Chip Networks # **On-Chip Networks** - Connect cores, caches, memory controllers, etc - Buses and crossbars are not scalable - Packet switched - 2D mesh: Most commonly used topology - Primarily serve cache misses and memory requests - Router - PE Processing Element (Cores, L2 Banks, Memory Controllers, etc) # On-chip Networks # On-Chip vs. Off-Chip Interconnects - On-chip advantages - Low latency between cores - No pin constraints - Rich wiring resources - → Very high bandwidth - → Simpler coordination - On-chip constraints/disadvantages - 2D substrate limits implementable topologies - Energy/power consumption a key concern - Complex algorithms undesirable - Logic area constrains use of wiring resources # On-Chip vs. Off-Chip Interconnects (II) #### Cost - Off-chip: Channels, pins, connectors, cables - On-chip: Cost is storage and switches (wires are plentiful) - Leads to networks with many wide channels, few buffers #### Channel characteristics - □ On chip short distance → low latency - □ On chip RC lines → need repeaters every 1-2mm - Can put logic in repeaters #### Workloads Multi-core cache traffic vs. supercomputer interconnect traffic # On-Chip vs. Off-Chip Tradeoffs George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects" Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx) # On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis†, Chris Fallin†, Thomas Moscibroda§, Onur Mutlu†, Srinivasan Seshan† † Carnegie Mellon University § Microsoft Research Asia {gnychis,cfallin,onur,srini}@cmu.edu moscitho@microsoft.com ### Buffers in NoC Routers - Buffers are necessary for high network throughput - → buffers increase total available bandwidth in network # Buffers in NoC Routers - Buffers are necessary for high network - → buffers increase total available - Dynamic er - Static - Buff mow control uire significant chip area z.g., in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD' 06] # Going Bufferless...? - How much throughput do we lose? - → How is latency affected? - Up to what injection rates can we use bufferless routing? - → Are there realistic scenarios in which NoC is operated at injection rates below the threshold? - Can we achieve energy reduction? - $\rightarrow$ If so, how much...? - Can we reduce area, complexity, etc...? # **BLESS: Bufferless Routing** - Always forward *all* incoming flits to some output port - If no productive direction is available, send to another direction - → packet is deflected - → Hot-potato routing [Baran' 64, etc] ## **BLESS: Bufferless Routing** Flit-Ranking I. Create a ranking over all incoming flits Port-Prioritization 2. For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking # FLIT-BLESS: Flit-Level Routing - Each flit is routed independently. - Oldest-first arbitration (other policies evaluated in paper) - Network Topology: - → Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, ...) - 1) #output ports , #input ports at every router - 2) every router is reachable from every other router - Flow Control & Injection Policy: - → Completely local, inject whenever input port is free - Absence of Deadlocks: every flit is always moving - Absence of Livelocks: with oldest-first ranking # BLESS: Advantages & Disadvantages #### **Advantages** - No buffers - Purely local flow control - Simplicity - no credit-flows - no virtual channels - simplified router design - No deadlocks, livelocks - Adaptivity - packets are deflected around congested areas! - Router latency reduction - Area savings #### **Disadvantages** - Increased latency - Reduced bandwidth - Increased buffering at receiver - Header information at each flit - Oldest-first arbitration complex - QoS becomes difficult # Evaluation – Synthetic Traces - First, the bad news © - Uniform random injection - BLESS has significantly lower saturation throughput compared to buffered baseline. # Evaluation – Homogenous Case Study - milc benchmarks (moderately intensive) - Perfect caches! - Very little performance degradation with BLESS (less than 4% in dense network) - With router latency I, BLESS can even outperform baseline (by ~10%) - Significant energy improvements (almost 40%) # Evaluation – Homogenous Case Study #### **BLESS Conclusions** - For a very wide range of applications and network settings, buffers are not needed in NoC - Significant energy savings (32% even in dense networks and perfect caches) - Area-savings of 60% - Simplified router and network design (flow control, etc...) - Performance slowdown is minimal (can even increase!) - A strong case for a rethinking of NoC design! - Future research: - Support for quality of service, different traffic classes, energymanagement, etc... # Bufferless Routing in NoCs - Moscibroda and Mutlu, "A Case for Bufferless Routing in On-Chip Networks," ISCA 2009. - https://users.ece.cmu.edu/~omutlu/pub/bless\_isca09.pdf #### A Case for Bufferless Routing in On-Chip Networks Thomas Moscibroda Microsoft Research moscitho@microsoft.com Onur Mutlu Carnegie Mellon University onur@cmu.edu # Issues In Bufferless Deflection Routing - Livelock - Resulting Router Complexity - Performance & Congestion at High Loads - Quality of Service and Fairness - Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu, "Bufferless and Minimally-Buffered Deflection Routing" Invited Book Chapter in Routing Algorithms in Networks-on-Chip, pp. 241-275, Springer, 2014. # Low-Complexity Bufferless Routing Chris Fallin, Chris Craik, and Onur Mutlu, "CHIPPER: A Low-Complexity Bufferless Deflection Router" Proceedings of the <u>17th International Symposium on High-</u> <u>Performance Computer Architecture</u> (**HPCA**), pages 144-155, San Antonio, TX, February 2011. <u>Slides (pptx)</u> <u>An extended version</u> as <u>SAFARI Technical Report</u>, TR-SAFARI2010-001, Carnegie Mellon University, December 2010. #### CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu cfallin@cmu.edu craik@cmu.edu onur@cmu.edu Computer Architecture Lab (CALCM) Carnegie Mellon University # **CHIPPER**: A Low-complexity Bufferless Deflection Router Chris Fallin, Chris Craik, and Onur Mutlu, "CHIPPER: A Low-Complexity Bufferless Deflection Router" Proceedings of the <u>17th International Symposium on High-Performance</u> <u>Computer Architecture</u> (**HPCA**), pages 144-155, San Antonio, TX, February 2011. <u>Slides (pptx)</u> # SAFARI Carnegie Mellon #### Motivation - Recent work has proposed bufferless deflection routing (BLESS [Moscibroda, ISCA 2009]) - □ Energy savings: ~40% in total NoC energy - □ Area reduction: ~40% in total NoC area - Minimal performance loss: ~4% on average - Unfortunately: unaddressed complexities in router - → long critical path, large reassembly buffers - Goal: obtain these benefits while simplifying the router in order to make bufferless NoCs practical. #### Problems that Bufferless Routers Must Solve - 1. Must provide livelock freedom - → A packet should not be deflected forever 2. Must reassemble packets upon arrival **Flit**: atomic routing unit Packet: one or multiple flits 0 1 2 3 # A Bufferless Router: A High-Level View ### Complexity in Bufferless Deflection Routers #### 1. Must provide livelock freedom Flits are sorted by age, then assigned in age order to output ports → 43% longer critical path than buffered router #### 2. Must reassemble packets upon arrival Reassembly buffers must be sized for worst case → 4KB per node (8x8, 64-byte cache block) #### Problem 1: Livelock Freedom #### Livelock Freedom in Previous Work - What stops a flit from deflecting forever? - All flits are timestamped - Oldest flits are assigned their desired ports - Total order among flits But what is the cost of this? # Age-Based Priorities are Expensive: Sorting - Router must sort flits by age: long-latency sort network - Three comparator stages for 4 flits # Age-Based Priorities Are Expensive: Allocation - After sorting, flits assigned to output ports in priority order - Port assignment of younger flits depends on that of older flits - sequential dependence in the port allocator # Age-Based Priorities Are Expensive Overall, deflection routing logic based on Oldest-First has a 43% longer critical path than a buffered router Question: is there a cheaper way to route while guaranteeing livelock-freedom? #### Solution: Golden Packet for Livelock Freedom What is really necessary for livelock freedom? **Key Insight**: No total order. it is enough to: - 1. Pick one flit to prioritize until arrival - 2. Ensure any flit is eventually picked Flit age forms total order partial ordering is sufficient! #### Which Packet is Golden? - We select the Golden Packet so that: - 1. a given packet stays golden long enough to ensure arrival - → maximum no-contention latency - 2. the selection rotates through all possible packet IDs - → static rotation schedule for simplicity ### What Does Golden Flit Routing Require? - Only need to properly route the Golden Flit - First Insight: no need for full sort - Second Insight: no need for sequential allocation # Golden Flit Routing With Two Inputs Let's route the Golden Flit in a two-input router first - Step 1: pick a "winning" flit: Golden Flit, else random - Step 2: steer the winning flit to its desired output and deflect other flit - → Golden Flit is always routed toward its destination # Golden Flit Routing with Four Inputs - Each block makes decisions independently! - Deflection is a distributed decision ## Permutation Network Operation ### Permutation Network-based Pipeline ## Problem 2: Packet Reassembly ## Reassembly Buffers are Large - Worst case: every node sends a packet to one receiver - Why can't we make reassembly buffers smaller? #### Small Reassembly Buffers Cause Deadlock What happens when reassembly buffer is too small? ## Reserve Space to Avoid Deadlock? - What if every sender asks permission from the receiver before it sends? - → adds additional delay to every request ## Escaping Deadlock with Retransmissions - Sender is optimistic instead: assume buffer is free - If not, receiver drops and NACKs; sender retransmits - → no additional delay in best case Drop, NACK Other nacked - Other packet completes → transmit buffering overhead for all packets cket - → potentially many retransmits 5. ACK - Sender frees data 1. Send (2 flits) # Solution: Retransmitting Only Once - Key Idea: Retransmit only when space becomes available. - → Receiver drops packet if full; notes which packet it drops - → When space frees up, receiver reserves space so retransmit is successful - → Receiver notifies sender to retransmit ## Use MSHRs as Reassembly Buffers Miss Status Handling Register (MSHR) Outstanding Block 0x3C Pending Cache Misses **Data Buffer** Status Address Reassembly buffering for "free" → A truly bufferless NoC! ### Using MSHRs as Reassembly Buffers #### CHIPPER: Cheap Interconnect Partially-Permuting Router ## CHIPPER: Cheap Interconnect Partially-Permuting Router #### **EVALUATION** ## Methodology - Multiprogrammed workloads: CPU2006, server, desktop - □ 8x8 (64 cores), 39 homogeneous and 10 mixed sets - Multithreaded workloads: SPLASH-2, 16 threads - □ 4x4 (16 cores), 5 applications #### System configuration - Buffered baseline: 2-cycle router, 4 VCs/channel, 8 flits/VC - Bufferless baseline: 2-cycle latency, FLIT-BLESS - Instruction-trace driven, closed-loop, 128-entry OoO window - 64KB L1, perfect L2 (stresses interconnect), XOR mapping ## Methodology #### Hardware modeling - Verilog models for CHIPPER, BLESS, buffered logic - Synthesized with commercial 65nm library - ORION for crossbar, buffers and links #### Power - Static and dynamic power from hardware models - Based on event counts in cycle-accurate simulations ## Results: Performance Degradation Minimal loss for low-to-medium-intensity workloads <del>7.0 /0</del> **49.8**%<sup>∢</sup> #### Results: Power Reduction - Removing buffers -> majority of power savings - Slight savings from BLESS to CHIPPER #### Results: Area and Critical Path Reduction **CHIPPER maintains area savings** of BLESS Critical path becomes competitive to buffered #### Conclusions - Two key issues in bufferless deflection routing - livelock freedom and packet reassembly - Bufferless deflection routers were high-complexity and impractical - □ Oldest-first prioritization → long critical path in router - No end-to-end flow control for reassembly → prone to deadlock with reasonably-sized reassembly buffers - CHIPPER is a new, practical bufferless deflection router - □ Golden packet prioritization → short critical path in router - □ Retransmit-once protocol → deadlock-free packet reassembly - □ Cache miss buffers as reassembly buffers → truly bufferless network - CHIPPER frequency comparable to buffered routers at much lower area and power cost, and minimal performance loss #### More on CHIPPER Chris Fallin, Chris Craik, and Onur Mutlu, "CHIPPER: A Low-Complexity Bufferless Deflection Router" Proceedings of the <u>17th International Symposium on High-</u> Performance Computer Architecture (HPCA), pages 144-155, San Antonio, TX, February 2011. <u>Slides (pptx)</u> An extended version as <u>SAFARI Technical Report</u>, TR-SAFARI2010-001, Carnegie Mellon University, December 2010. #### CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu cfallin@cmu.edu craik@cmu.edu onur@cmu.edu Computer Architecture Lab (CALCM) Carnegie Mellon University # Minimally-Buffered Deflection Routing - Bufferless deflection routing offers reduced power & area - But, high deflection rate hurts performance at high load - MinBD (Minimally-Buffered Deflection Router) introduces: - Side buffer to hold only flits that would have been deflected - Dual-width ejection to address ejection bottleneck - Two-level prioritization to avoid unnecessary deflections - MinBD yields reduced power (31%) & reduced area (36%) relative to buffered routers - MinBD yields improved performance (8.1% at high load) relative to **bufferless** routers → closes half of perf. gap - MinBD has the best energy efficiency of all evaluated designs with competitive performance # Minimally-Buffered Deflection Routing Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu, "MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect" Proceedings of the 6th ACM/IEEE International Symposium on Networks on Chip (NOCS), Lyngby, Denmark, May 2012. Slides (pptx) (pdf) #### MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu<sup>†</sup>, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University {cfallin,gnazario,kevincha,rachata,onur}@cmu.edu <sup>†</sup>Tsinghua University & Carnegie Mellon University yxythu@gmail.com #### MinBD: # Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and <u>Onur Mutlu</u>, "MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect" Proceedings of the <u>6th ACM/IEEE International Symposium on Networks on Chip</u> (**NOCS**), Lyngby, Denmark, May 2012. <u>Slides (pptx) (pdf)</u> # SAFARI Carnegie Mellon University # Bufferless Deflection Routing - Key idea: Packets are never buffered in the network. When two packets contend for the same link, one is deflected. - Removing **buffers** yields significant benefits - Reduces power (CHIPPER: reduces NoC power by 55%) - Reduces die area (CHIPPER: reduces NoC area by 36%) - But, at high network utilization (load), bufferless deflection routing causes unnecessary link & router traversals - Reduces network throughput and application performance - Increases dynamic power - Goal: Improve high-load performance of low-cost deflection networks by reducing the deflection rate. #### Outline: This Talk #### Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions # Issues in Bufferless Deflection Routing - Correctness: Deliver all packets without livelock - CHIPPER<sup>1</sup>: Golden Packet - Globally prioritize one packet until delivered - Correctness: Reassemble packets without deadlock - □ CHIPPER¹: Retransmit-Once - Performance: Avoid performance degradation at high load - MinBD ## Key Performance Issues - **1. Link contention**: no buffers to hold traffic → any link contention causes a deflection - → use side buffers - 2. Ejection bottleneck: only one flit can eject per router per cycle → simultaneous arrival causes deflection → eject up to 2 flits/cycle - **3. Deflection arbitration**: practical (fast) deflection arbiters deflect unnecessarily - → new priority scheme (silver flit) #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions ## Addressing Link Contention - Problem 1: Any link contention causes a deflection - Buffering a flit can avoid deflection on contention - But, input buffers are expensive: - □ All flits are buffered on every hop → high dynamic energy - □ Large buffers necessary → high static energy and large area Key Idea 1: add a small buffer to a bufferless deflection router to buffer only flits that would have been deflected #### How to Buffer Deflected Flits <sup>&</sup>lt;sup>1</sup> Fallin et al CHIPPER: A Low-complexity Bufferless Deflection Router", HPCA 2011 #### How to Buffer Deflected Flits ## Why Could A Side Buffer Work Well? - Buffer some flits and deflect other flits at per-flit level - Relative to **bufferless routers**, deflection rate reduces (need not deflect all contending flits) - → 4-flit buffer reduces deflection rate by 39% - Relative to **buffered routers**, buffer is more efficiently used (need not buffer all flits) - → similar performance with 25% of buffer space #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions ## Addressing the Ejection Bottleneck - Problem 2: Flits deflect unnecessarily because only one flit can eject per router per cycle - In 20% of all ejections, ≥ 2 flits could have ejected - → all but one flit must **deflect** and try again - → these deflected flits cause additional contention - Ejection width of 2 flits/cycle reduces deflection rate 21% Key idea 2: Reduce deflections due to a single-flit ejection port by allowing two flits to eject per cycle #### Addressing the Ejection Bottleneck #### Addressing the Ejection Bottleneck #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions #### Improving Deflection Arbitration - Problem 3: Deflections occur unnecessarily because fast arbiters must use simple priority schemes - Age-based priorities (several past works): full priority order gives fewer deflections, but requires slow arbiters - State-of-the-art deflection arbitration (Golden Packet & two-stage permutation network) - Prioritize one packet globally (ensure forward progress) - Arbitrate other flits randomly (fast critical path) - Random common case leads to uncoordinated arbitration ## Fast Deflection Routing Implementation Let's route in a two-input router first: - Step 1: pick a "winning" flit (Golden Packet, else random) - Step 2: steer the winning flit to its desired output and deflect other flit - → Highest-priority flit always routes to destination ## Fast Deflection Routing with Four Inputs - Each block makes decisions independently - Deflection is a distributed decision #### Unnecessary Deflections in Fast Arbiters - How does lack of coordination cause unnecessary deflections? - 1. No flit is golden (pseudorandom arbitration) - 2. Red flit wins at first stage - 3. Green flit loses at first stage (must be deflected now) - 4. Red flit loses at second stage; Red and Green are deflected ## Improving Deflection Arbitration Key idea 3: Add a priority level and prioritize one flit to ensure at least one flit is not deflected in each cycle - Highest priority: one Golden Packet in network - Chosen in static round-robin schedule - Ensures correctness - Next-highest priority: one silver flit per router per cycle - Chosen pseudo-randomly & local to one router - Enhances performance ## Adding A Silver Flit - Randomly picking a silver flit ensures one flit is not deflected - 1. No flit is golden but Red flit is silver - 2. Red flit wins at first stage (silver) - 3. Green flit is deflected at first stage 4. Red flit wins at second stage (silver); not deflected #### Minimally-Buffered Deflection Router #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration #### Outline: This Talk - Motivation - Background: Bufferless Deflection Routing - MinBD: Reducing Deflections - Addressing Link Contention - Addressing the Ejection Bottleneck - Improving Deflection Arbitration - Results - Conclusions ## Methodology: Simulated System #### Chip Multiprocessor Simulation - 64-core and 16-core models - Closed-loop core/cache/NoC cycle-level model - Directory cache coherence protocol (SGI Origin-based) - 64KB L1, perfect L2 (stresses interconnect), XOR-mapping - Performance metric: Weighted Speedup (similar conclusions from network-level latency) - Workloads: multiprogrammed SPEC CPU2006 - 75 randomly-chosen workloads - Binned into network-load categories by average injection rate #### Methodology: Routers and Network - Input-buffered virtual-channel router - 8 VCs, 8 flits/VC [Buffered(8,8)]: large buffered router - □ 4 VCs, 4 flits/VC [Buffered(4,4)]: typical buffered router - □ 4 VCs, 1 flit/VC [Buffered(4,1)]: smallest deadlock-free router - □ All power-of-2 buffer sizes up to (8, 8) for perf/power sweep - Bufferless deflection router: CHIPPER<sup>1</sup> - Bufferless-buffered hybrid router: AFC<sup>2</sup> - Has input buffers and deflection routing logic - Performs coarse-grained (multi-cycle) mode switching #### Common parameters - 2-cycle router latency, 1-cycle link latency - 2D-mesh topology (16-node: 4x4; 64-node: 8x8) - Dual ejection assumed for baseline routers (for perf. only) #### Methodology: Power, Die Area, Crit. Path #### Hardware modeling - Verilog models for CHIPPER, MinBD, buffered control logic - Synthesized with commercial 65nm library - ORION 2.0 for datapath: crossbar, muxes, buffers and links #### Power - Static and dynamic power from hardware models - Based on event counts in cycle-accurate simulations - Broken down into buffer, link, other ## Reduced Deflections & Improved Perf. #### Overall Performance Results - Similar perf. to Buffered (4,1) @ 25% of buffering space - Within 2.7% of Buffered (4,4) (8.3% at high load) #### Overall Power Results ## Performance-Power Spectrum Most energy-efficient (perf/watt) of any evaluated network router design #### Die Area and Critical Path - Only 3% area increase over CHIPPER (4-flit buffer) - Increases by 7% over CHIPPER, 8% over Buffered (4,4) #### Conclusions - Bufferless deflection routing offers reduced power & area - But, high deflection rate hurts performance at high load - MinBD (Minimally-Buffered Deflection Router) introduces: - Side buffer to hold only flits that would have been deflected - Dual-width ejection to address ejection bottleneck - Two-level prioritization to avoid unnecessary deflections - MinBD yields reduced power (31%) & reduced area (36%) relative to buffered routers - MinBD yields improved performance (8.1% at high load) relative to **bufferless** routers → closes half of perf. gap - MinBD has the best energy efficiency of all evaluated designs with competitive performance ## Minimally-Buffered Deflection Routing Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu, "MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect" Proceedings of the 6th ACM/IEEE International Symposium on Networks on Chip (NOCS), Lyngby, Denmark, May 2012. Slides (pptx) (pdf) #### MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu<sup>†</sup>, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University {cfallin,gnazario,kevincha,rachata,onur}@cmu.edu <sup>†</sup>Tsinghua University & Carnegie Mellon University yxythu@gmail.com ## **HAT: Heterogeneous Adaptive Throttling for On-Chip Networks** Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu, "HAT: Heterogeneous Adaptive Throttling for On-Chip Networks" Proceedings of the <u>24th International Symposium on Computer Architecture and</u> High Performance Computing (SBAC-PAD), New York, NY, October 2012. Slides (pptx) (pdf) ## **Executive Summary** <u>Problem</u>: Packets contend in on-chip networks (NoCs), causing congestion, thus reducing performance #### Observations: - 1) Some applications are more sensitive to network latency than others - 2) Applications must be throttled differently to achieve peak performance - Key Idea: Heterogeneous Adaptive Throttling (HAT) - 1) Application-aware source throttling - 2) Network-load-aware throttling rate adjustment - **Result:** Improves performance and energy efficiency over state-of-the-art source throttling policies ## Source Throttling in Bufferless NoCs Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu, <u>"HAT: Heterogeneous Adaptive Throttling for On-Chip</u> Networks" Proceedings of the <u>24th International Symposium on Computer</u> <u>Architecture and High Performance Computing</u> (**SBAC-PAD**), New York, NY, October 2012. <u>Slides (pptx) (pdf)</u> #### HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang, Rachata Ausavarungnirun, Chris Fallin, Onur Mutlu Carnegie Mellon University {kevincha, rachata, cfallin, onur}@cmu.edu ## "Bufferless" Hierarchical Rings - Ausavarungnirun et al., "Design and Evaluation of Hierarchical Rings with Deflection Routing," SBAC-PAD 2014. - http://users.ece.cmu.edu/~omutlu/pub/hierarchical-rings-withdeflection\_sbacpad14.pdf - Discusses the design and implementation of a mostlybufferless hierarchical ring # Design and Evaluation of Hierarchical Rings with Deflection Routing ``` Rachata Ausavarungnirun Chris Fallin Xiangyao Yu† Kevin Kai-Wei Chang Greg Nazario Reetuparna Das§ Gabriel H. Loh‡ Onur Mutlu ``` Carnegie Mellon University §University of Michigan †MIT ‡Advanced Micro Devices, Inc. #### "Bufferless" Hierarchical Rings (II) - Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna Das, Gabriel Loh, and Onur Mutlu, "A Case for Hierarchical Rings with Deflection Routing: An Energy-Efficient On-Chip Communication Substrate" Parallel Computing (PARCO), to appear in 2016. - <u>arXiv.org version</u>, February 2016. Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun Chris Fallin Xiangyao Yu† Kevin Kai-Wei Chang Greg Nazario Reetuparna Das§ Gabriel H. Loh‡ Onur Mutlu Carnegie Mellon University §University of Michigan †MIT ‡AMD #### Summary of Six Years of Research Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu, "Bufferless and Minimally-Buffered Deflection Routing" Invited Book Chapter in Routing Algorithms in Networks-on-Chip, pp. 241-275, Springer, 2014. # Chapter 1 Bufferless and Minimally-Buffered Deflection Routing Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu ## More Readings - Studies of congestion and congestion control in on-chip vs. internet-like networks - George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects" Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx) - George Nychis, Chris Fallin, Thomas Moscibroda, and <u>Onur Mutlu</u>, <u>"Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?"</u> Proceedings of the <u>9th ACM Workshop on Hot Topics in Networks</u> (**HOTNETS**), Monterey, CA, October 2010. <u>Slides (ppt) (key)</u> ## On-Chip vs. Off-Chip Tradeoffs George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects" Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx) # On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis†, Chris Fallin†, Thomas Moscibroda§, Onur Mutlu†, Srinivasan Seshan† † Carnegie Mellon University § Microsoft Research Asia {gnychis,cfallin,onur,srini}@cmu.edu moscitho@microsoft.com ## Packet Scheduling #### Packet Scheduling - Which packet to choose for a given output port? - Router needs to prioritize between competing flits - Which input port? - Which virtual channel? - Which application's packet? - Common strategies - Round robin across virtual channels - Oldest packet first (or an approximation) - Prioritize some virtual channels over others - Better policies in a multi-core environment - Use application characteristics ## Application-Aware Packet Scheduling Das et al., "Application-Aware Prioritization Mechanisms for On-Chip Networks," MICRO 2009. #### The Problem: Packet Scheduling Network-on-Chip is a critical resource shared by multiple applications #### The Problem: Packet Scheduling #### The Problem: Packet Scheduling #### The Problem: Packet Scheduling #### The Problem: Packet Scheduling #### The Problem: Packet Scheduling - Existing scheduling policies - Round Robin - Age - Problem 1: Local to a router - Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next. - Problem 2: Application oblivious - Treat all applications packets equally - But applications are heterogeneous - Solution : Application-aware global scheduling policies. **Packet Injection Order at Processor** | STALL CYCLES | | | | Avg | |--------------|---|---|----|-----| | RR | 8 | 6 | 11 | 8.3 | | Age | | | | | | STC | | | | | | STALL CYCLES | | | Avg | | |--------------|---|---|-----|-----| | RR | 8 | 6 | 11 | 8.3 | | Age | 4 | 6 | 11 | 7.0 | | STC | | | | | | STALL CYCLES | | | Avg | | |--------------|---|---|-----|-----| | RR | 8 | 6 | 11 | 8.3 | | Age | 4 | 6 | 11 | 7.0 | | STC | 1 | 3 | 11 | 5.0 | #### Application-Aware Prioritization in NoCs - Das et al., "Application-Aware Prioritization Mechanisms for On-Chip Networks," MICRO 2009. - https://users.ece.cmu.edu/~omutlu/pub/app-awarenoc\_micro09.pdf ## Application-Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das<sup>§</sup> Onur Mutlu<sup>†</sup> Thomas Moscibroda<sup>‡</sup> Chita R. Das<sup>§</sup> §Pennsylvania State University †Carnegie Mellon University ‡Microsoft Research {rdas,das}@cse.psu.edu onur@cmu.edu moscitho@microsoft.com #### Slack-Based Packet Scheduling Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das, "Aergia: Exploiting Packet Latency Slack in On-Chip Networks" Proceedings of the <u>37th International Symposium on Computer</u> Architecture (ISCA), pages 106-116, Saint-Malo, France, June 2010. <u>Slides (pptx)</u> # Aérgia: Exploiting Packet Latency Slack in On-Chip Networks Reetuparna Das<sup>§</sup> Onur Mutlu<sup>†</sup> Thomas Moscibroda<sup>‡</sup> Chita R. Das<sup>§</sup> §Pennsylvania State University †Carnegie Mellon University ‡Microsoft Research {rdas,das}@cse.psu.edu onur@cmu.edu moscitho@microsoft.com #### Low-Cost QoS in On-Chip Networks (I) Boris Grot, Stephen W. Keckler, and Onur Mutlu, "Preemptive Virtual Clock: A Flexible, Efficient, and Costeffective QOS Scheme for Networks-on-Chip" Proceedings of the <u>42nd International Symposium on</u> <u>Microarchitecture</u> (MICRO), pages 268-279, New York, NY, December 2009. Slides (pdf) # Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip **Boris Grot** Stephen W. Keckler Onur Mutlu<sup>†</sup> Department of Computer Sciences The University of Texas at Austin {bgrot, skeckler@cs.utexas.edu} <sup>†</sup>Computer Architecture Laboratory (CALCM) Carnegie Mellon University onur@cmu.edu ## Low-Cost QoS in On-Chip Networks (II) Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu, "Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees" Proceedings of the 38th International Symposium on Computer Proceedings of the <u>38th International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), San Jose, CA, June 2011. <u>Slides (pptx)</u> ## Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot¹ bgrot@cs.utexas.edu Joel Hestness<sup>1</sup> hestness@cs.utexas.edu Stephen W. Keckler<sup>1,2</sup> skeckler@nvidia.com Onur Mutlu<sup>3</sup> onur@cmu.edu <sup>1</sup>The University of Texas at Austin Austin, TX <sup>2</sup>NVIDIA Santa Clara, CA <sup>3</sup>Carnegie Mellon University Pittsburgh, PA #### Kilo-NoC: Topology-Aware QoS Boris Grot, Joel Hestness, Stephen W. Keckler, and <u>Onur Mutlu</u>, "Kilo-NOC: A <u>Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees"</u> Proceedings of the <u>38th International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), San Jose, CA, June 2011. <u>Slides (pptx)</u> #### Motivation - Extreme-scale chip-level integration - Cores - Cache banks - Accelerators - □ I/O logic - Network-on-chip (NOC) - 10-100 cores today - 1000+ assets in the near future #### Kilo-NOC requirements - High efficiency - Area - Energy - Good performance - Strong service guarantees (QoS) - Problem: QoS support in each router is expensive (in terms of buffering, arbitration, bookkeeping) - E.g., Grot et al., "Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip," MICRO 2009. - Goal: Provide QoS guarantees at low area and power cost #### Idea: - Isolate shared resources in a region of the network, support QoS within that area - Design the topology so that applications can access the region without interference #### Baseline QOS-enabled CMP #### Conventional NOC QOS #### Contention scenarios: - Shared resources - memory access - Intra-VM traffic - shared cache access - Inter-VM traffic - VM page sharing #### Conventional NOC QOS #### Contention scenarios: - Shared resources - memory access - Intra-VM traffic - shared cache access - Inter-VM traffic - VM page sharing Q Network-wide guarantees without network-wide QOS support #### Kilo-NOC QOS - Insight: leverage rich network connectivity - Naturally reduce interference among flows - > Limit the extent of hardware QOS support - Requires a low-diameter topology - □ This work: Multidrop Express Channels (MECS) Grot et al., HPCA 2009 - Dedicated, QOS-enabled regions - Rest of die: QOS-free - Richly-connected topology - Traffic isolation - Special routing rules - Manage interference - Dedicated, QOS-enabled regions - Rest of die: QOS-free - Richly-connected topology - Traffic isolation - Special routing rules - Manage interference - Dedicated, QOS-enabled regions - Rest of die: QOS-free - Richly-connected topology - Traffic isolation - Special routing rules - Manage interference - Dedicated, QOS-enabled regions - Rest of die: QOS-free - Richly-connected topology - Traffic isolation - Special routing rules - Manage interference #### Kilo-NOC view - Topology-aware QOS support - Limit QOS complexity to a fraction of the die - Optimized flow control - Reduce buffer requirements in QOSfree regions ## **Evaluation Methodology** | Parameter | Value | |-------------|-----------------------------------------------------| | Technology | 15 nm | | Vdd | 0.7 V | | System | 1024 tiles: | | | 256 concentrated nodes (64 shared resources) | | Networks: | | | MECS+PVC | VC flow control, QOS support (PVC) at each node | | MECS+TAQ | VC flow control, QOS support only in shared regions | | MECS+TAQ+EB | EB flow control outside of SRs, | | | Separate Request and Reply networks | | K-MECS | Proposed organization: TAQ + hybrid flow control | ## Area comparison ## **Energy comparison** ## Summary Kilo-NOC: a heterogeneous NOC architecture for kilo-node substrates - Topology-aware QOS - Limits QOS support to a fraction of the die - Leverages low-diameter topologies - Improves NOC area- and energy-efficiency - Provides strong guarantees ## Low-Cost QoS in On-Chip Networks (II) Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu, "Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees" Proceedings of the 29th International Symposium on Computer Proceedings of the <u>38th International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), San Jose, CA, June 2011. <u>Slides (pptx)</u> ## Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot¹ bgrot@cs.utexas.edu Joel Hestness<sup>1</sup> hestness@cs.utexas.edu Stephen W. Keckler<sup>1,2</sup> skeckler@nvidia.com Onur Mutlu<sup>3</sup> onur@cmu.edu <sup>1</sup>The University of Texas at Austin Austin, TX <sup>2</sup>NVIDIA Santa Clara, CA <sup>3</sup>Carnegie Mellon University Pittsburgh, PA # Computer Architecture Lecture 23: On-Chip Networks Prof. Onur Mutlu ETH Zürich Fall 2020 28 December 2020