## DESIGN OF DIGITAL CIRCUITS (252-0028-00L), SPRING 2019 OPTIONAL HW 7: CACHES AND VIRTUAL MEMORY Instructor: Prof. Onur Mutlu TAs: Mohammed Alser, Can Firtina, Hasan Hassan, Juan Gomez Luna, Lois Orosa, Giray Yaglikci Released: Friday, May 31, 2019 ### 1 Instruction and Data Caches Consider the following loop is executed on a system with a small instruction cache (I-cache) of size 16 B. The data cache (D-cache) is fully associative of size 1 KB. Both caches use 16-byte blocks. The instruction length is 4 B. The initial value of register \$1 is 40. The value of \$0 is 0. ``` Loop: lw $6, X($1) addi $6, $6, 1 sw $6, Y($1) subi $1, $1, 4 beq $1, $0, Exit j Loop Exit: ... ``` - (a) Compute I-cache and D-cache miss rates, considering: - X and Y are different arrays. - X and Y are the same array. | (b) | Compute the average number of cycles per instruction (CPI), using a baseline ideal CPI (ideal caches) equal to 2, and a miss latency equal to 10 clock cycles. | |-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------| | | | | (c) . | A compiler could unroll this loop for optimization. How would this affect CPI? | |-------|--------------------------------------------------------------------------------| | | | | | | | | | | | | | | | | | | | | | | (d) 1 | How would the previous results change with a 32-byte I-cache? | | | | | | | | | | | | | #### Reverse Engineering Caches I $\mathbf{2}$ You're trying to reverse-engineer the characteristics of a cache in a system so that you can design a more efficient, machine-specific implementation of an algorithm you're working on. To do so, you've come up with four patterns that access various bytes in the system in an attempt to determine the following four cache characteristics: - Cache block size (8, 16, 32, 64, or 128 B) - Cache associativity (2-, 4-, or 8-way) - Cache size (4 or 8 KB) - Cache replacement policy (LRU or FIFO) However, the only statistic that you can collect on this system is cache hit rate after performing the access pattern. Here is what you observe: | Access Pattern | | $Addresses\ Accessed\ (Oldest \rightarrow Youngest)$ | | | | | | | | Hit Rate | |-----------------|-----|------------------------------------------------------|------|-------|-------|------|------|------|-----|----------| | A | 0 | 4096 | 8192 | 12288 | 16384 | 4096 | 0 | | | 1/7 | | В | 0 | 1024 | 2048 | 3072 | 4096 | 5120 | 6144 | 3072 | 0 | 1/9 | | $^{\mathrm{C}}$ | 0 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 4/9 | | D | 128 | 1152 | 2176 | 3200 | 128 | 4224 | 1152 | | | 2/7 | Based on what you observe, what are the following characteristics of the cache? (Be sure to justify clearly | you | r answer for full credit.) | |-----|---------------------------------------------| | (a) | Cache block size (8, 16, 32, 64, or 128 B)? | | | | | | | | | | | | | | | | | (b) | Cache associativity (2-, 4-, or 8-way)? | | | | | | | | | | | | | | | | | (c) | Cache size (4 or 8KB)? | | | | | | | | | | | | | ### 3 Reverse Engineering Caches II You are trying to reverse-engineer the characteristics of a cache in a system, so that you can design a more efficient, machine-specific implementation of an algorithm you are working on. To do so, you have come up with three patterns that access various *bytes* in the system in an attempt to determine the following four cache characteristics: - Cache block size (8, 16, 32, 64, or 128 B) - Cache associativity (1-, 2-, 4-, or 8-way) - Cache size (4 or 8 KB) - Cache replacement policy (LRU or FIFO) (a) Cache block size (8, 16, 32, 64, or 128 B)? However, the only statistic that you can collect on this system is *cache hit rate* after performing the access pattern. Here is what you observe: | Sequence | nce Addresses Accessed (Oldest $\rightarrow$ Youngest) | | | | | | | | Hit Rate | |----------------|--------------------------------------------------------|----------------|----------------|---------------------|--------------------|---------------------|----|-------|-------------------| | 1.<br>2.<br>3. | $0 \\ 31 \\ 32768$ | 4<br>8192<br>0 | 8<br>63<br>129 | 16<br>16384<br>1024 | 64<br>4096<br>3072 | 128<br>8192<br>8192 | 64 | 16384 | 1/2<br>5/8<br>1/3 | Assume that the cache is initially empty at the beginning of the first sequence, but not at the beginning of the second and third sequences. The sequences are executed back-to-back, i.e., no other accesses take place between the three sequences. Thus, at the beginning of the second (third) sequence, the contents are the same as at the end of the first (second) sequence. Based on what you observe, what are the following characteristics of the cache? Explain to get points. | ) Cache as | ssociativity (1-, 2-, | 4-, or 8-way)? | | | |------------|-----------------------|----------------|--|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cache siz | ze (4 or 8 KB)? | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | # 4 Analyzing Cache Structure Below, we have given you four different sequences of addresses generated by a program running on a processor with a data cache. Cache hit ratio for each sequence is also shown below. Assuming that the cache is initially empty at the beginning of each sequence, find out the following parameters of the processor's data cache: - Associativity (1, 2 or 4 ways) - Block size (1, 2, 4, 8, 16, or 32 bytes) - Total cache size (256 B, or 512 B) - Replacement policy (LRU or FIFO) Assumptions: all memory accesses are one byte accesses. All addresses are byte addresses. | Sequence No. | Address Sequence | Hit Ratio | |--------------|----------------------------------------------|-----------| | 1 | 0, 2, 4, 8, 16, 32 | 0.33 | | 2 | 0, 512, 1024, 1536, 2048, 1536, 1024, 512, 0 | 0.33 | | 3 | 0, 64, 128, 256, 512, 256, 128, 64, 0 | 0.33 | | 4 | 0, 512, 1024, 0, 1536, 0, 2048, 512 | 0.25 | | 8/13 | | |------|--| # 5 Caches | perfect LRU r<br>bits of storage | replacement. The early with the second second is the bloom as a single bit. (H | tag store (includi<br>ock size of the cae | ng the tag and alche? Assume tha | l other meta-data | a) requires a tota | l of 4352 | |----------------------------------|--------------------------------------------------------------------------------|-------------------------------------------|----------------------------------|-------------------|--------------------|-----------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### 6 Memory Hierarchy An enterprising computer architect is building a new machine for high-frequency stock trading and needs to choose a CPU. She will need to optimize her setup for *memory access latency* in order to gain a competitive edge in the market. She is considering two different prototype enthusiast CPUs that advertise high memory performance: - (A) Dragonfire-980 Hyper-Z - (B) Peregrine G-Class XTreme She needs to characterize these CPUs to select the best one, and she knows from Prof. Mutlu's course that she is capable of reverse-engineering everything she needs to know. Unfortunately, these CPUs are not yet publicly available, and their exact specifications are unavailable. Luckily, important documents were recently leaked, claiming that all three CPUs have: - Exactly 1 high-performance core - LRU replacement policies (for any set-associative caches) - Inclusive caching (i.e., data in a given cache level is present upward throughout the memory hierarchy. For example, if a cache line is present in L1, the cache line is also present in L2 and L3 if available.) - Constant-latency memory structures (i.e., an access to any part of a given memory structure takes the same amount of time) - Cache line, size, and associativity are all size aligned to powers of two Being an ingenious engineer, she devises the following simple application in order to extract all of the information she needs to know. The application uses a high-resolution timer to measure the amount of time it takes to read data from memory with a specific pattern parameterized by *STRIDE* and *MAX\_ADDRESS*: Assume 1) this code runs for a long time, so all memory structures are fully warmed up, i.e., repeatedly accessed data is already cached, and 2) N is large enough such that the timer captures **only** steady-state information. By sweeping *STRIDE* and *MAX\_ADDRESS*, the computer architect can glean information about the various memory structures in each CPU. She produces Figure 1 for CPU A and Figure 2 for CPU B. **Your task:** Using the data from the graphs, reverse-engineer the following system parameters. If the parameter does not make sense (e.g., L3 cache in a 2-cache system), mark the box with an "X". If the graphs provide insufficient information to ascertain a desired parameter, simply mark it as "N/A". ### (a) Fill in the blanks for Dragonfire-980 Hyper-Z. Figure 1: Execution time of the test code on CPU A for various values of *STRIDE* and *MAX\_ADDRESS*. *STRIDE* values are labeled on curves themselves for clarity. Note that the curves for strides 1, 2, 4, and 8 overlap in the figure. Table 1: Fill in the following table for CPU A (Dragonfire-980 Hyper-Z) | System Parameter | CPU A: Dragonfire-980 Hyper-Z | | | | | | | |----------------------------------|-------------------------------|----|----|------|--|--|--| | | L1 | L2 | L3 | DRAM | | | | | Cache Line Size (B) | | | | | | | | | Cache Associativity | | | | | | | | | Total Cache Size (B) | | | | | | | | | Access Latency (ns) <sup>1</sup> | | | | | | | | <sup>&</sup>lt;sup>1</sup> e.g., DRAM access latency means the latency of fetching the data from DRAM to L3, *not* the latency of bringing the data from the DRAM all the way down to the CPU. Similarly, L3 access latency means the latency of fetching the data from L3 to L2. L1 access latency is the latency to bring the data to the CPU from the L1 cache. ### (b) Fill in the blanks for Peregrine G-Class XTreme. Figure 2: Execution time of the test code on CPU B for various values of *STRIDE* and *MAX\_ADDRESS*. *STRIDE* values are labeled on curves themselves for clarity. Note that the curves for strides 1, 2, 4, 8, 16, and 32 overlap in the figure. Table 2: Fill in the following table for CPU B (Peregrine G-Class XTreme) | System Parameter | CPU B: Peregrine G-Class XTreme | | | | |----------------------------------|---------------------------------|----|----|------| | | L1 | L2 | L3 | DRAM | | Cache Line Size (B) | | | | | | Cache Associativity | | | | | | Total Cache Size (B) | | | | | | Access Latency (ns) <sup>1</sup> | | | | | <sup>&</sup>lt;sup>1</sup> e.g., DRAM access latency means the latency of fetching the data from DRAM to L3, *not* the latency of bringing the data from the DRAM all the way down to the CPU. Similarly, L3 access latency means the latency of fetching the data from L3 to L2. L1 access latency is the latency to bring the data to the CPU from the L1 cache. ### 7 Virtual Memory An ISA supports an 8-bit, byte-addressable virtual address space. The corresponding physical memory has only 128 bytes. Each page contains 16 bytes. A simple, one-level translation scheme is used and the page table resides in physical memory. The initial contents of the frames of physical memory are shown below. | Frame Number | Frame Contents | |--------------|----------------| | 0 | Empty | | 1 | Page 13 | | 2 | Page 5 | | 3 | Page 2 | | 4 | Empty | | 5 | Page 0 | | 6 | Empty | | 7 | Page Table | A three-entry translation lookaside buffer that uses Least Recently-Used (LRU) replacement is added to this system. Initially, this TLB contains the entries for pages 0, 2, and 13. For the following sequence of references, put a circle around those that generate a TLB hit and put a rectangle around those that generate a page fault. What is the hit rate of the TLB for this sequence of references? (Note: LRU policy is used to select pages for replacement in physical memory.) | Ref | Ferences (to pages): 0, 13, 5, 2, 14, 14, 13, 6, 6, 13, 15, 14, 15, 13, 4, 3. | |-----|-------------------------------------------------------------------------------| | | | | | | | | | | (a) | At the end of this sequence, what three entries are contained in the TLB? | | | | | | | | | | | (b) | What are the contents of the 8 physical frames? | | | | | | | | | |