#### Self-Optimizing Memory Controllers: A Reinforcement Learning Approach Engin İpek<sup>1,2</sup> Onur Mutlu<sup>2</sup> José F. Martínez<sup>1</sup> Rich Caruana<sup>1</sup> <sup>1</sup>Cornell University, Ithaca, NY 14850 USA $^{2}$ Microsoft Research, Redmond, WA 98052 USA 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China. DINFK Ege Karaismailoglu 2019-04-11 #### Problem: Scheduling policies • Can not anticipate the long term effects of their scheduling decisions **DINFK** Ege Karaismailoglu 2019-04-11 #### Problem: Scheduling policies - Cannot anticipate the long term effects of their scheduling decisions - Cannot take lessons from the consequences of their past actions #### Problem: Scheduling policies - Cannot anticipate the long term effects of its scheduling decisions - Cannot take lessons from the consequences of its past actions Solution: RL-based controller computes learning, far-sighted policy $\rightarrow$ efficient bandwidth utilization #### Problem: Scheduling policies - Cannot anticipate the long term effects of its scheduling decisions - Cannot take lessons from the consequences of its past actions Solution: RL-based controller computes learning, far-sighted policy $\rightarrow$ efficient bandwidth utilization #### Results: 19% speedup, 21% more bandwidth utilization over best static policy #### Problem: Scheduling policies - Cannot anticipate the long term effects of its scheduling decisions - Cannot take lessons from the consequences of its past actions Solution: RL-based controller computes learning, far-sighted policy $\rightarrow$ efficient bandwidth utilization #### Results: - 19% speedup, 21% more bandwidth utilization over best static policy - Scales as well as the best static policy Problem, Background and Goal #### DRAM bandwidth is the bottleneck ### Goal: Efficiently utilize DRAM bandwidth Accepts cache misses and write-back requests, puts them in the memory transaction queue - Accepts cache misses and write-back requests, puts them in the memory transaction queue - Issues activate, read, write and precharge commands to satisfy these requests - Accepts cache misses and write-back requests, puts them in the memory transaction queue - Issues activate, read, write and precharge commands to satisfy these requests - Must obey many local and global DRAM timing constraints <sup>&</sup>lt;sup>0</sup>Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Memory Controllers" IEEE Micro, 2009 • Provides the best average performance - Provides the best average performance - (1) Prioritizes read/write over activate/precharge. - Provides the best average performance - (1) Prioritizes read/write over activate/precharge. - (2) Prioritizes older commands over younger commands - Provides the best average performance - (1) Prioritizes read/write over activate/precharge. - (2) Prioritizes older commands over younger commands - Can't anticipate the long term effects of its scheduling decisions - Provides the best average performance - (1) Prioritizes read/write over activate/precharge. - (2) Prioritizes older commands over younger commands - Can't anticipate the long term effects of its scheduling decisions - Can't take lessons from the consequences of its past actions to decide better in the future **Novelty, Key Approach & Ideas** # Key Idea: Enable the Memory Controller to Learn From Its **Actions** #### **Benefits** #### Benefits Allows the hardware designer to focus on what specific goal the MC should accomplish #### **Benefits** - Allows the hardware designer to focus on what specific goal the MC should accomplish - Enables the MC to adapt to workload changes - Allows the hardware designer to focus on what specific goal the MC should accomplish - Enables the MC to adapt to workload changes - Hopefully enables the MC to make farsighted decisions - Allows the hardware designer to focus on what specific goal the MC should accomplish - Enables the MC to adapt to workload changes - Hopefully enables the MC to make farsighted decisions - Creates an MC which can use its experience in new, but similar situations - Allows the hardware designer to focus on what specific goal the MC should accomplish - Enables the MC to adapt to workload changes - Hopefully enables the MC to make farsighted decisions - Creates an MC which can use its experience in new, but similar situations - More efficiently utilize DRAM bandwidth (21% more utilization over FR-FCFS) - Allows the hardware designer to focus on what specific goal the MC should accomplish - Enables the MC to adapt to workload changes - Hopefully enables the MC to make farsighted decisions - Creates an MC which can use its experience in new, but similar situations - More efficiently utilize DRAM bandwidth (21% more utilization over FR-FCFS) - Improved system performance (19% performance improvement over FR-FCFS) # **Mechanisms** Infinite-horizon task: An endless task where we observe the state, take an action, and get a reward in each turn. **DINFK** Ege Karaismailoglu 2019-04-11 Infinite-horizon task: An endless task where we observe the state, take an action, and get a reward in each turn. Infinite-horizon task: An endless task where we observe the state, take an action, and get a reward in each turn. Formulate memory access scheduling as an *infinite horizon (continuous) task* Scheduler always has **1 DRAM clock cycle** to decide what it wants to do Formulate memory access scheduling as an *infinite horizon (continuous) task* Scheduler always has **1 DRAM clock cycle** to decide what it wants to do Figure 4: High-level overview of an RL-based scheduler. Formulate memory access scheduling as an infinite horizon (continuous) task Scheduler always has 1 DRAM clock cycle to decide what it wants to do Figure 4: High-level overview of an RL-based scheduler. Reward = 1 if a read/write command was issued. Otherwise, Reward = 0 At every time step, the scheduler will make the decision it thinks will maximize the reward function, which is defined as<sup>1</sup> $$G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$ <sup>&</sup>lt;sup>1</sup>R. Sutton and A. Barto. Reinforcement Learning. MIT Press, Cambridge, MA, 1998. At every time step, the scheduler will make the decision it thinks will maximize the reward function, which is defined as<sup>2</sup> $$\begin{array}{ll} G_t \; \doteq \; R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \; = \; \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \\ \gamma \in [\mathtt{0},\mathtt{1}] \\ \gamma \approx \mathtt{1} \to \mathit{farsighted} \\ \gamma \approx \mathtt{0} \to \mathit{greedy} \end{array}$$ <sup>&</sup>lt;sup>2</sup>R. Sutton and A. Barto. Reinforcement Learning. MIT Press, Cambridge, MA, 1998. At every time step, the scheduler will make the decision it thinks will maximize the reward function, which is defined as<sup>3</sup> $$G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{55} \gamma^k R_{t+k+1}$$ Note that we also have $$egin{aligned} &\gamma \in [0,1] \ G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots \ &= R_{t+1} + \gamma \left( R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \cdots ight) \ &= R_{t+1} + \gamma G_{t+1} \end{aligned} egin{aligned} &\gamma \approx \mathbf{1} ightarrow \mathit{farsighted} \ &\gamma \approx \mathbf{0} ightarrow \mathit{greedy} \end{aligned}$$ <sup>&</sup>lt;sup>3</sup>R. Sutton and A. Barto. Reinforcement Learning. MIT Press, Cambridge, MA, 1998. # Q-values ## Q-values • Q-values enable us to assign credit & blame to past actions ## Q-values - Q-values enable us to assign credit & blame to past actions - We define the Q-value of a state-action pair for a specific policy as the expected value of the reward function if we take action a in state s and follow policy $\pi$ afterwards: ## O-values - Q-values enable us to assign credit & blame to past actions - We define the Q-value of a state-action pair for a specific policy as the expected value of the reward function if we take action a in state s and follow policy $\pi$ afterwards:4 $$q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s, A_t = a \right]$$ <sup>&</sup>lt;sup>4</sup>R. Sutton and A. Barto. Reinforcement Learning. MIT Press, Cambridge, MA, 1998. For each candidate command, the associated state has 6 attributes: 1. # Reads in the queue - 1. # Reads in the queue - 2. # Reads in the queue that are load misses - 1. # Reads in the queue - 2. # Reads in the queue that are load misses - 3. # Writes in the queue - 1. # Reads in the queue - 2. # Reads in the queue that are load misses - 3. # Writes in the queue - 4. # Writes in the queue waiting for the row referenced by this command - 1. # Reads in the queue - 2. # Reads in the queue that are load misses - 3. # Writes in the queue - 4. # Writes in the queue waiting for the row referenced by this command - 5. # Load misses (which are the oldest load misses from their cores) in the queue waiting for the row referenced by this command - 1. # Reads in the queue - 2. # Reads in the queue that are load misses - 3. # Writes in the queue - 4. # Writes in the queue waiting for the row referenced by this command - 5. # Load misses (which are the oldest load misses from their cores) in the queue waiting for the row referenced by this command - 6. The order of the load relative to other loads from C (if this command is *related to* a load miss by core C) For each candidate command, the associated state has 6 attributes: - 1. # Reads in the queue - 2. # Reads in the queue that are load misses - 3. # Writes in the queue - 4. # Writes in the queue waiting for the row referenced by this command - 5. # Load misses (which are the oldest load misses from their cores) in the queue waiting for the row referenced by this command - 6. The order of the load relative to other loads from C (if this command is *related to* a load miss by core C) All attributes available in the controller's transaction queue $\rightarrow$ fast access DINFK Ege Karaismailoglu 2019-04-11 In every DRAM cycle: In every DRAM cycle: Issue the command you selected in the last cycle ### In every DRAM cycle: Issue the command you selected in the last cycle Observe reward In every DRAM cycle: Issue the command you selected in the last cycle Observe reward With small probability $\epsilon$ : Select a random command #to explore the states In every DRAM cycle: Issue the command you selected in the last cycle Observe reward With small probability $\epsilon$ : Select a random command #to explore the states With probability 1 - $\epsilon$ : Select the command with the highest Q-value among all legal commmands In every DRAM cycle: Issue the command you selected in the last cycle Observe reward With small probability $\epsilon$ : Select a random command #to explore the states With probability 1 - $\epsilon$ : Select the command with the highest Q-value among all legal commands Update the Q-value of the previous command ### RL-Based DRAM Scheduling Algorithm #### In every DRAM cycle: Issue the command you selected in the last cycle Observe reward With small probability $\epsilon$ : Select a random command #to explore the states With probability 1 - $\epsilon$ : Select the command with the highest Q-value among all legal commmands $$Q_{prev} \leftarrow (1 - \alpha)Q_{prev} + \alpha(r + \gamma Q_{selected})$$ Note: Correct operation is ensured by adding a set of extra constraints # How to store all the Q-values efficiently Figure: Fine-grain # How to store all the Q-values efficiently Figure: Fine-grain Figure: Coarse-grain # How to store all the Q-values efficiently Figure: Fine-grain Figure: Coarse-grain Figure: CMAC $cmd \leftarrow \text{select\_command\_with\_max\_Q-value}(C)$ **DINFK** Ege Karaismailoglu ``` cmd \leftarrow \text{select\_command\_with\_max\_Q-value}(C) ``` Assumption: Scheduler's pipe can be clocked 10 times each DRAM cycle $cmd \leftarrow \text{select\_command\_with\_max\_Q-value}(C)$ Assumption: Scheduler's pipe can be clocked 10 times each DRAM cycle (a) 2-wide Q-value Estimation Pipeline $cmd \leftarrow \text{select\_command\_with\_max\_Q-value}(C)$ - Assumption: Scheduler's pipe can be clocked 10 times each DRAM cycle - Scheduler can consider 12 commands every cycle (a) 2-wide Q-value Estimation Pipeline A constant number per action type → prevent generalization across different commands A constant number per action type → prevent generalization across different commands #### (b) CMAC Index Generation Logic #### (c) CMAC Indexing using Attribute Shifting - A constant number per action type → prevent generalization across different commands - random shifts of attributes implement the shiftedness of CMAC arrays #### (b) CMAC Index Generation Logic #### (c) CMAC Indexing using Attribute Shifting 1. logic to compute state attributes $\rightarrow$ counters that are updated every DRAM cycle - 1. logic to compute state attributes $\rightarrow$ counters that are updated every DRAM cycle - 2. logic to update Q-values → single pipelined 16-bit fixed-point multiplier - 1. logic to compute state attributes $\rightarrow$ counters that are updated every DRAM cycle - 2. logic to update Q-values $\rightarrow$ single pipelined 16-bit fixed-point multiplier - 3. storing the Q-values $\rightarrow$ 32 kB on-chip storage **Key Results: Methodology & Evaluation** # Methodology Some important parameters of the simulated CMP: - Frequency: 4 GHz - #Cores: 4 (each 2-way simultaneously multithreaded) - iL1/dL1 size: 32 kB - Shared L2 cache: 4MB, 8-way # Methodology Some important parameters of the simulated CMP: - Frequency: 4 GHz - #Cores: 4 (each 2-way simultaneously multithreaded) - iL1/dL1 size: 32 kB - Shared L2 cache: 4MB, 8-way Some important parameters of the simulated DRAM: - DDR2-800 SDRAM - Transaction Queue: 64 entries - Peak Data Rate: 6.4 GB/s - DRAM Bus Frequency: 400 MHz - Single rank with 4 DRAM chips - #Banks: 4 per DRAM chip - Row Buffer Size: 2 KB # Applications & Benchmarks | Benchmark | Description | Problem size | |-------------|------------------------|--------------------------| | Data Mining | | | | SCALPARC | Decision Tree | 125k pts., 32 attributes | | NAS OpenMP | | | | MG | Multigrid Solver | Class A | | CG | Conjugate Gradient | Class A | | SPEC OpenMP | | | | SWIM-OMP | Shallow water model | MinneSpec-Large | | EQUAKE-OMP | Earthquake model | MinneSpec-Large | | ART-OMP | Self-Organizing Map | MinneSpec-Large | | Splash-2 | | | | OCEAN | Ocean movements | $514 \times 514$ ocean | | FFT | Fast Fourier transform | 1M points | | RADIX | Integer radix sort | 2M integers | Table 3: Simulated applications and their input sizes. • A conventional in-order MC - A conventional in-order MC - MC implementing FR-FCFS - A conventional in-order MC - MC implementing FR-FCFS - RL-based controller proposed by this paper - A conventional in-order MC - MC implementing FR-FCFS - RL-based controller proposed by this paper - An ideal scheduler with an ideal memory that can sustain 100% peak bandwidth #### Results: Data Bus Utilization ### Results: Performance Improvement 101 #### Results: Scaled to More Cores # **Executive Summary** ### **Executive Summary** #### Problem: Scheduling policies - Cannot anticipate the long term effects of their scheduling decisions - Cannot take lessons from the consequences of their past actions Solution: RL-based controller computes learning, far-sighted policy $\rightarrow$ efficient bandwidth utilization #### Results: - 19% speedup, 21% more bandwidth utilization over best static policy - Scales as well as the best static policy # **Strengths & Weaknesses** # Strengths ### Strengths • A fundamentally more powerful approach than its predecessors #### Strengths - A fundamentally more powerful approach than its predecessors - Tries to solve an important problem that will always be relevant ### Strengths - A fundamentally more powerful approach than its predecessors - Tries to solve an important problem that will always be relevant - Significantly improves overall performance and data bus utilization DINFK ## Strengths - A fundamentally more powerful approach than its predecessors - Tries to solve an important problem that will always be relevant - Significantly improves overall performance and data bus utilization - The paper accurately predicts that the DRAM bandwidth is going to be the main problem, 11 years ago! ## Strengths - A fundamentally more powerful approach than its predecessors - Tries to solve an important problem that will always be relevant - Significantly improves overall performance and data bus utilization - The paper accurately predicts that the DRAM bandwidth is going to be the main problem, 11 years ago! - Well-written paper DINFK 112 • It does not provide fairness across multiple competing threads. It does not even provide non-starvation. - It does not provide fairness across multiple competing threads. It does not even provide non-starvation. - More complicated hardware than FR-FCFS - It does not provide fairness across multiple competing threads. It does not even provide non-starvation. - More complicated hardware than FR-FCFS - Extending it is hard since hardware will get even more complicated - It does not provide fairness across multiple competing threads. It does not even provide non-starvation. - More complicated hardware than FR-FCFS - Extending it is hard since hardware will get even more complicated - · Heterogeneous workloads are not tested • Can we solve the fairness problem? 118 - Can we solve the fairness problem? - "ATLAS" [Y. Kim, D. Han, O. Mutlu and M. Harchol-Balter, HPCA 2010] DINFK 119 ## ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers Yoongu Kim Dongsu Han Onur Mutlu Mor Harchol-Balter Carnegie Mellon University DINFK ## ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers Yoongu Kim Dongsu Han Onur Mutlu Mor Harchol-Balter Carnegie Mellon University Periodically order threads based on the service they have attained from the memory controllers so far **DINFK** ## ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers Yoongu Kim Dongsu Han Onur Mutlu Mor Harchol-Balter Carnegie Mellon University - Periodically order threads based on the service they have attained from the memory controllers so far - Prioritize the threads that have attained the least service over others in each period - Can we solve the fairness problem? - "ATLAS" [Y. Kim, D. Han, O. Mutlu and M. Harchol-Balter, HPCA 2010] - TCM scheduling [Y. Kim, M. Papamichael, O. Mutlu, M. Harchol-Balter, MICRO 2010] DINFK # THREAD CLUSTER MEMORY **SCHEDULING** Dynamically group threads into a latency-sensitive cluster and a bandwidth-sensitive cluster DINFK # THREAD CLUSTER MEMORY SCHEDULING Dynamically group threads into a latency-sensitive cluster and a bandwidth-sensitive cluster Prioritize 1st cluster over 2nd cluster # THREAD CLUSTER MEMORY SCHEDULING - Dynamically group threads into a latency-sensitive cluster and a bandwidth-sensitive cluster - Prioritize 1st cluster over 2nd cluster - Employ different policies within each cluster - Can we solve the fairness problem? - "ATLAS" [Y. Kim, D. Han, O. Mutlu and M. Harchol-Balter, HPCA 2010] - TCM scheduling [Y. Kim, M. Papamichael, O. Mutlu, M. Harchol-Balter, MICRO 2010] - "BLISS" [L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, O. Mutlu, IEEE Transactions on Parallel and Distributed Systems 2016] # BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu DINFK ## BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu Mark each application either as vulnerable-to-interference or as interference-causing. ## BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu - Mark each application either as vulnerable-to-interference or as interference-causina. - Prioritize requests from 1st group over requests from 2nd group **DINFK** 131 • A novel approach to utilize the data bus 132 - A novel approach to utilize the data bus - Effective in terms of performance gain - A novel approach to utilize the data bus - Effective in terms of performance gain - Comes with some hardware cost - A novel approach to utilize the data bus - Effective in terms of performance gain - · Comes with some hardware cost - QoS-unaware $\rightarrow$ no fairness - A novel approach to utilize the data bus - Effective in terms of performance gain - · Comes with some hardware cost - QoS-unaware → no fairness - Seemingly hard to improve ## **Open Discussion** How can we solve the fairness problem while keeping our RL-based approach? Are there other flaws in this approach? Can this approach be used to solve other scheduling problems? When are machine-learning based approaches applicable in computer architecture? ## **Backup slides** ## **Ensuring correct operation** - the scheduler is not permitted to select NOPs when other legal commands are available - the scheduler is only allowed to activate rows due to pending requests in the transaction queue (i.e., the scheduler cannot choose to activate an arbitrary row with no pending requests) - the scheduler is not allowed to precharge a newly activated row until it issues a reador write command to it. ## Results: RL versus Family-BEST ## Results: RL versus Family-BEST Preference relations used in Family-BEST: - Row commands over column commands - Older commands over younger commands - Reads over writes - · Load misses over store misses - More critical load misses over less critical ones, based on sequence numbers #### Results: Online versus Offline