### MorphCore # An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater Suleman Chris Wilkerson **Presented by Lukas Fluri** ### **Executive summary** - Problem: Modern workloads require a microarchitecture with good single- and multithreaded performance while not wasting any energy. Current cores do not provide this as they are specialized on the execution of one of those workload-types. - MorphCore: microarchitecture based on a big out-of-order core with the ability to switch to higly parallel in-order SMT execution mode - Results: MorphCore - Performs very close to the best single-thread optimized core on single-threaded workloads - Achieves 2/3 of the performance improvement of the best optimized multithreaded architecture on multi-threaded workloads - Performs best on average over all workloads compared to the other measured core architectures - Achieves the performance improvements with significally less energy than other cores #### Outline - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters #### 2 important concepts for this paper Out-of-order execution Simultaneous Multithreading ### **Out-of-order execution (OOO)** #### In-order execution #### Out-of-order execution | F | D | Е | Е | Е | Е | R | W | | | | | |---|---|---|---|---|---|---|---|---|---|---|---| | | F | D | - | - | - | Е | R | W | | _ | | | | | F | D | Е | R | - | - | - | W | | | | | | | F | D | Е | Е | Е | Е | R | W | | | | | , | | F | D | - | - | - | Е | R | W | #### Program to execute: R3 ← MUL R1, R2 R3 ← ADO R3, R1 R1 ← ADD R6, R7 R5 ← MUL R6, R8 R7 ← ADD R8, R5 #### Dependencies! ### Simultaneous Multithreading (SMT) #### Industry builds 2 types of cores #### Large out-of-order cores - Exploit Instruction-Level-Parallelism (ILP) - + High single thread performance - Power-inefficient for multi-threaded programs #### **Small cores** - Exploit Thread-Level-Parallelism (TLP) - + High parallelThroughput - Poor single thread performance #### Problem Modern workloads require a microarchitecture capable of both delivering good single and multi-threaded performance. Currently only possible with a big OOO-core that wastes huge amounts of energy on multi-threaded workloads. # Early approach: ACMP #### **Asymmetric Chip Multiprocessor** - One or few large cores for fast single-threaded execution - Many small cores for high throughput in multi-threaded execution Numbers of cores fixed at design time, can't adapt dynamically to workload | | 7 | 8 | 9 | 10 | | |------|----|----|----|----|--| | CPU1 | 11 | 12 | 13 | 14 | | | CPUT | 15 | 16 | 17 | 18 | | | | 19 | 20 | 21 | 22 | | | CPU2 | 3 | | 4 | | | | CPUZ | ļ | 5 | 6 | | | Image: Morad, Weiser et al 2005 ### Recent approach: Core Fusion #### **Core Fusion** - Many small cores for high throughput in multi-threaded execution - Ability to dynamically fuse into larger cores when executing single-threaded code - + Can dynamically adapt to workload - Fused cores have low performance and high power/energy consumption ### Goal #### **Propose a Core architecture that:** - Can adapt to its workload - Provides high performance in single-threaded execution - Provides high parallel throughput in multi-threaded execution - Uses no more energy/power than necessary # MorphCore #### Outline - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters # Key insight 1 A highly threaded in-order core can achieve the same or better performance as an out-of-order core. (While using much less energy) # Key insight 2 Such a core can be built using almost a subset of the hardware required to build an aggressive OOO core. #### Idea - Use a big out-of-order core as base substrate - Add the capability to switch between out-oforder and highly threaded in-order SMT execution mode - In the in-order SMT execution mode, turn off power-hungry OOO-structures # MorphCore - Can switch between out-oforder and in-order SMT execution mode - Can dynamically adapt to different workloads - Runs as normal OOO core in single-threaded programs - → Provides high performance single-thread execution - Runs as highly-threaded inorder core in multi-threaded programs - → Provides high parallel throughput while not wasting vast amounts of energy Large out-of-order Core ----- In-order SMT thread Out-of-order/In-order SMT thread ### Outline - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters # An out-of-order core microarchitecture ### Area, Power & Frequency Overhead ### Area, power & frequency overhead #### When to switch between modes? - Based on number of active threads - Threshold t = 2 - When # active threads <= t, switch to OOO-mode</li> - When # active threads > t switch to In-Order-mode - Uses MONITOR/MWAIT, 2 already existent ISA instructions to get info about waiting threads - No changes to operating systems, compilers or ISAs, and no recompilation of programs necessary! # Switching from OOO to in-order - Handled by a micro-code routine that performs the following tasks: - 1) Drains the core pipeline - 2) Spills the architectural registers of all threads (into reserved memory regions) - 3) Turns off Renaming unit, OOO-Wakeup and Select blocks and Load Queue (clock-gated) - 4) Fills register values back into each thread's PRF partitions # Switching from in-order to OOO - Handled by a micro-code routine that performs the following tasks: - 1) Drains the core pipeline - 2) Spills the architectural registers of all threads. Store pointers to the architectural state of the inactive threads in the Active Thread Table - 3) Turns on Renaming unit, OOO-Wakeup and Select blocks and Load Queue - 4) Fills the architectural registers of only the active threads into pre-determined locations in PRF, and updates the speculative- and permanent RAT # Overhead of changing the mode #### Two main contributors to overhead: - Draining of the pipeline (dependent on instructions still in pipeline) - Spilling of architectural register state of the threads (~250 cycles) #### Outline - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters ### The cores | Core | Туре | Freq<br>(Ghz) | Issue-<br>width | Num<br>of<br>cores | SMT<br>threads<br>per core | Total<br>Threads | Total<br>Norm<br>. Area | Peak ST<br>throughput | Peak MT<br>throughput | |-----------|--------------------|---------------|-----------------|--------------------|----------------------------|------------------|-------------------------|-----------------------|-----------------------| | 000-2 | 000 | 3.4 | 4 | 1 | 2 | 2 | 1 | 4 ops/cycle | 4 ops/cycle | | 000-4 | 000 | 3.23 | 4 | | expected cloop best in bo | th, | · · | 4 ops/cycle | 4 ops/cycle | | MED | 000 | 3.4 | 2 | 3 | T and MT | | H | 2 ops/cycle | 6 ops/cycle | | SMALL | in-order | 3.4 | 2 | 3 | 2 | 6 | 0.97 | 2 ops/cycle | 6 ops/cycle | | MorphCore | OOO or<br>In-order | 3.315 | 4 | 1 | 000: 2<br>In-order: 8 | 2 or 8 | 1.015 | 4 ops/cycle | 4 ops/cycle | #### The workloads 14 single-thread and 14 multi-threaded workloads | Workload | Problem description | Input set | | | | | | |---------------------------|---------------------------|------------------|--|--|--|--|--| | Multi-Threaded Workloads | | | | | | | | | web | web cache [29] | 500K queries | | | | | | | qsort | Quicksort [8] | 20K elements | | | | | | | tsp | Traveling salesman [19] | 11 cities | | | | | | | OLTP-1 | MySQL server [2] | OLTP-simple [3] | | | | | | | OLTP-2 | MySQL server [2] | OLTP-complex [3] | | | | | | | OLTP-3 | MySQL server [2] | OLTP-nontrx [3] | | | | | | | black | Black-Scholes [23] | 1M options | | | | | | | barnes | SPLASH-2 [34] | 2K particles | | | | | | | fft | SPLASH-2 [34] | 16K points | | | | | | | lu (contig) | SPLASH-2 [34] | 512x512 matrix | | | | | | | ocean (contig) | SPLASH-2 [34] | 130x130 grid | | | | | | | radix | SPLASH-2 [34] | 300000 keys | | | | | | | ray | SPLASH-2 [34] | teapot.env | | | | | | | water (spatial) | SPLASH-2 [34] | 512 molecules | | | | | | | Single-Threaded Workloads | | | | | | | | | SPEC 2006 | 7 INT and 7 FP benchmarks | 200M instrs | | | | | | Image source: Khubaib, Suleman et al. "MorphCore...", 2012 ### Result: Single-thread workloads #### Result: Multi-threaded workloads - MorphCore reaches a 22% perf. Improvement over OOO-2 - Stays behind MED and SMALL (30% and 33% improv.) - But beats MED in three workloads - Gets beaten by OOO-4 three times # Speedup summary On average, MorphCore outperforms all other cores # Result: Power & Energy #### Overall result MorphCore has the lowest ED<sup>2</sup> being 22% lower than the baseline Energy-Delay-Squared (lower is better) # Comparison to CoreFusion # Comparison to CoreFusion - CoreFusion is better in multi-threaded workloads (8% on aver.) - MorphCore outperforms CoreFusion in general (5% on aver.) - Reduces power (19%), energy (29%) and ED<sup>2</sup> (29%) significally compared to CoreFusion - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters # **Executive summary** - Problem: Modern workloads require a microarchitecture with good single- and multithreaded performance while not wasting any energy. Current cores do not provide this as they are specialized on the execution of one of those workload-types. - MorphCore: microarchitecture based on a big out-of-order core with the ability to switch to higly parallel in-order SMT execution mode - Results: MorphCore - Performs very close to the best single-thread optimized core on single-threaded workloads - Achieves 2/3 of the performance improvement of the best optimized multithreaded architecture on multi-threaded workloads - Performs best on average over all workloads compared to the other measured core architectures - Achieves the performance improvements with significally less energy than other cores - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters # Strengths - Novel but simple and elegant solution - Low hardware overhead and low frequency penalty (1.5% & 2.5%) - Does not need changes to software, compilers or OS; ISA remains unchanged - Solves many of the issues of CoreFusion - Well structured paper - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters #### Weaknesses - Performance on MT-workloads is better than on other OOO-cores but still weak compared to small cores (only ~2/3 of performance) - Mode switching policy may cause big performance overhead - No predictable overhead of the mode switching - Paper sometimes lacks some details - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters # Takeaways - A new michroarchitecture that can handle both, single- and multi-threaded workloads, while delivering good performance and not wasting energy - No changes to software necessary - Well structured paper, sometimes a bit lack of detail - Possibility of further improvement and extensions - Background, Problem and Goal - Novelty, Key approach and Ideas - Mechanisms (in some detail) - Key Results: Methodology and Evaluation - Summary - Strengths - Weaknesses - Takeaways - Thougts, Ideas and Discussion starters #### Thoughts, Ideas and Discussion starters - Increase issue-width. Can this approach achieve a higher total peak throughput and tackle the performance gap on MT workloads between MorphCore and SMALL/MED? - → Yes, see Khubaib Ph.D. Dissertation 2014 # Increase issue-width - Increased width yields better performance - At least almost (see lu) - Comes at cost of higher energy cost #### Thoughts, Ideas and Discussion starters - Increase issue-width. Can this approach achieve a higher total peak throughput and tackle the performance gap on MT workloads between MorphCore and SMALL/MED? - → Yes, see Khubaib Ph.D. Dissertation 2014 - Is the concept of MorphCore the only approach to the problem of providing good single- and multi-threaded performance while not wasting energy? - → No, see Shruti Padmanabha et al. "Mirage Cores..." IEEE/ACM 2017 # Mirage Cores - Use few OOO cores to analyze the execution of a program - Instruction schedules of parts that repeat often (e.g. loops) get saved ("memoized") - All further executions of these parts get executed on the in-order cores - → In-order cores performe nearly as good as the big out-of-order cores but use less energy #### **Mirage Cores** - High system throughput - Shorter execution latency #### Thoughts, Ideas and Discussion starters - Increase issue-width. Can this approach achieve a higher total peak throughput and tackle the performance gap on MT workloads between MorphCore and SMALL/MED? - → Yes, see Khubaib Ph.D. Dissertation 2014 - Is the concept of MorphCore the only approach to the problem of providing good single- and multi-threaded performance while not wasting energy? - → No, see Shruti Padmanabha et al. "Mirage Cores..." MICRO 2017 - Fetch in each cycle from several threads instead of fetching several instructions from one thread each cycle. Can this improve SMT performance? - Gather statistics about thread behaviour to achieve smarter mode-switching (similar to branch prediction). Is this a good approach? - Does a frequent switch of modes lead to cache trashing? # MorphCore # An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater Suleman Chris Wilkerson **Presented by Lukas Fluri**