Recall: Hybrid Memory Systems

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Yoon+, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.
Challenge and Opportunity

Providing the Best of Multiple Metrics with Multiple Memory Technologies
Hybrid Memory Systems: Issues

- Cache vs. Main Memory
- Granularity of Data Move/Manage-ment: Fine or Coarse
- Hardware vs. Software vs. HW/SW Cooperative
- When to migrate data?
- How to design a scalable and efficient large cache?
- ...
Another Challenge

Designing Effective Large (DRAM) Caches
One Problem with Large DRAM Caches

- A large DRAM cache requires a large metadata (tag + block-based information) store
- How do we design an efficient DRAM cache?
Idea 1: Tags in Memory

- Store tags in the same row as data in DRAM
  - Store metadata in same row as their data
  - Data and metadata can be accessed together

- Benefit: No on-chip tag storage overhead
- Downsides:
  - Cache hit determined only after a DRAM access
  - Cache hit requires two DRAM accesses
Idea 2: Cache Tags in SRAM

- Recall Idea 1: Store all metadata in DRAM
  - To reduce metadata storage overhead

- Idea 2: Cache in on-chip SRAM frequently-accessed metadata
  - Cache only a small amount to keep SRAM size small
Idea 3: Dynamic Data Transfer Granularity

- Some applications benefit from caching more data
  - They have good spatial locality
- Others do not
  - Large granularity wastes bandwidth and reduces cache utilization

Idea 3: **Simple dynamic caching granularity policy**
- Cost-benefit analysis to determine best DRAM cache block size
- Group main memory into sets of rows
- Different sampled row sets follow different fixed caching granularities
- The rest of main memory follows the best granularity
  - Cost–benefit analysis: access latency versus number of cachings
  - Performed every quantum
TIMBER Performance

Reduced channel contention and improved spatial locality

TIMBER Energy Efficiency

More migrations reduce transmitted data and channel contention

On Large DRAM Cache Design

Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan,
"Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management"
IEEE Computer Architecture Letters (CAL), February 2012.

Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management

Justin Meza* Jichuan Chang† HanBin Yoon* Onur Mutlu* Parthasarathy Ranganathan†
*Carnegie Mellon University †Hewlett-Packard Labs
{meza,hanbinyoon,onur}@cmu.edu {jichuan.chang,partha.ranganathan}@hp.com
Table 1: Summary of Operational Characteristics of Different State-of-the-Art DRAM Cache Designs  

We assume perfect way prediction for Unison Cache. Latency is relative to the access time of the off-package DRAM (see Section 6 for baseline latencies). We use different colors to indicate the high (dark red), medium (white), and low (light green) overhead of a characteristic. 

<table>
<thead>
<tr>
<th>Scheme</th>
<th>DRAM Cache Hit</th>
<th>DRAM Cache Miss</th>
<th>Replacement Traffic</th>
<th>Replacement Decision</th>
<th>Large Page Caching</th>
</tr>
</thead>
<tbody>
<tr>
<td>Unison [32]</td>
<td>In-package traffic: 128 B (data + tag read and update) Latency: ~1x</td>
<td>In-package traffic: 96 B (spec. data + tag read) Latency: ~2x</td>
<td>On every miss Footprint size [31]</td>
<td>Hardware managed, set-associative, LRU</td>
<td>Yes</td>
</tr>
<tr>
<td>Alloy [50]</td>
<td>In-package traffic: 96 B (data + tag read) Latency: ~1x</td>
<td>In-package traffic: 96 B (spec. data + tag read) Latency: ~2x</td>
<td>On some misses Cacheline size (64 B)</td>
<td>Hardware managed, direct-mapped, stochastic [20]</td>
<td>Yes</td>
</tr>
<tr>
<td>TDC [38]</td>
<td>In-package traffic: 64 B Latency: ~1x TLB coherence</td>
<td>In-package traffic: 0 B Latency: ~1x TLB coherence</td>
<td>On every miss Footprint size [28]</td>
<td>Hardware managed, fully-associative, FIFO</td>
<td>No</td>
</tr>
<tr>
<td>HMA [44]</td>
<td>In-package traffic: 64 B Latency: ~1x</td>
<td>In-package traffic: 0 B Latency: ~1x</td>
<td>Software managed, high replacement cost</td>
<td></td>
<td>Yes</td>
</tr>
<tr>
<td>Banshee (This work)</td>
<td>In-package traffic: 64 B Latency: ~1x</td>
<td>In-package traffic: 0 B Latency: ~1x</td>
<td>Only for hot pages Page size (4 KB)</td>
<td>Hardware managed, set-associative, frequency based</td>
<td>Yes</td>
</tr>
</tbody>
</table>

Banshee [MICRO 2017]

- Tracks presence in cache using TLB and Page Table
  - No tag store needed for DRAM cache
  - Enabled by a new lightweight lazy TLB coherence protocol

- New bandwidth-aware frequency-based replacement policy
More on Banshee


Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation

Xiangyao Yu\textsuperscript{1}  Christopher J. Hughes\textsuperscript{2}  Nadathur Satish\textsuperscript{2}  Onur Mutlu\textsuperscript{3}  Srinivas Devadas\textsuperscript{1}

\textsuperscript{1}MIT  \textsuperscript{2}Intel Labs  \textsuperscript{3}ETH Zürich
Other Opportunities with Emerging Technologies

- **Merging of memory and storage**
  - e.g., a single interface to manage all data

- **New applications**
  - e.g., ultra-fast checkpoint and restore

- **More robust system design**
  - e.g., reducing data loss

- **Processing tightly-coupled with memory**
  - e.g., enabling efficient search and filtering
TWO-LEVEL STORAGE MODEL

- CPU
- Memory
- Storage

- Ld/St
- File
- I/O

- DRAM
- HDD

- Volatile
  - Fast
  - Byte Addr

- Nonvolatile
  - Slow
  - Block Addr
Non-volatile memories combine characteristics of memory and storage.
Two-Level Memory/Storage Model

- The traditional two-level storage model is a bottleneck with NVM
  - Volatile data in memory → a load/store interface
  - Persistent data in storage → a file system interface
  - Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores

Two-Level Store

Virtual memory → Address translation → Main Memory

Load/Store:
- fopen, fread, fwrite, ...

Processor and caches:

Operating system and file system

Persistent (e.g., Phase-Change) Storage (SSD/HDD)
**Unified Memory and Storage with NVM**

- **Goal:** Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data
  - Improves both energy and performance
  - Simplifies programming model as well

---

**Unified Memory/Storage**

- Processor and caches
- Load/Store
- Feedback
- Persistent Memory Manager
- Persistent (e.g., Phase-Change) Memory

---

PERSISTENT MEMORY

Provides an opportunity to manipulate persistent data directly
The Persistent Memory Manager (PMM)

PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices.

```c
int main(void) {
    // data in file.dat is persistent
    FILE myData = "file.dat";
    myData = new int[64];
}

void updateValue(int n, int value) {
    FILE myData = "file.dat";
    myData[n] = value; // value is persistent
}
```

Software

Persistent Memory Manager

Data Layout, Persistence, Metadata, Security, ...

Hardware

Load | Store | Hints from SW/OS/runtime

DRAM | Flash | NVM | HDD

PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices.
The Persistent Memory Manager (PMM)

- Exposes a load/store interface to access persistent data
  - Applications can directly access persistent memory → no conversion, translation, location overhead for persistent data

- Manages data placement, location, persistence, security
  - To get the best of multiple forms of storage

- Manages metadata storage and retrieval
  - This can lead to overheads that need to be managed

- Exposes hooks and interfaces for system software
  - To enable better data placement and management decisions

A persistent memory exposes a large, persistent address space
- But it may use many different devices to satisfy this goal
- From fast, low-capacity volatile DRAM to slow, high-capacity non-volatile HDD or Flash
- And other NVM devices in between

Performance and energy can benefit from good placement of data among these devices
- Utilizing the strengths of each device and avoiding their weaknesses, if possible
- For example, consider two important application characteristics: locality and persistence
Efficient Data Mapping among Heterogeneous Devices
Columns in a column store that are scanned through only infrequently → place on Flash

Efficient Data Mapping among Heterogeneous Devices
Efficient Data Mapping among Heterogeneous Devices

Columns in a column store that are scanned through only infrequently
→ place on Flash

Frequently-updated index for a Content Delivery Network (CDN)
→ place in DRAM

Applications or system software can provide hints for data placement
Evaluated Systems

- HDD Baseline
  - Traditional system with volatile DRAM memory and persistent HDD storage
  - Overheads of operating system and file system code and buffering

- NVM Baseline (NB)
  - Same as HDD Baseline, but HDD is replaced with NVM
  - Still has OS/FS overheads of the two-level storage model

- Persistent Memory (PM)
  - Uses only NVM (no DRAM) to ensure full-system persistence
  - All data accessed using loads and stores
  - Does not waste time on system calls
  - Data is manipulated directly on the NVM device
Performance Benefits of a Single-Level Store

<table>
<thead>
<tr>
<th>Normalized Execution Time</th>
<th>HDD 2-level</th>
<th>NVM 2-level</th>
<th>Persistent Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>User CPU</td>
<td>0.044</td>
<td>0.009</td>
<td>~24X</td>
</tr>
<tr>
<td>User Memory</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Syscall CPU</td>
<td></td>
<td>0.044</td>
<td>~5X</td>
</tr>
<tr>
<td>Syscall I/O</td>
<td></td>
<td></td>
<td>0.009</td>
</tr>
</tbody>
</table>

Energy Benefits of a Single-Level Store

On Persistent Memory Benefits & Challenges

- Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu,

"A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory"

Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

A Case for Efficient Hardware/Software Cooperative Management of Storage and Memory

Justin Meza* Yixin Luo* Samira Khan*‡ Jishen Zhao† Yuan Xie†§ Onur Mutlu*
*Carnegie Mellon University †Pennsylvania State University ‡Intel Labs §AMD Research

SAFARI
Challenge and Opportunity

Combined Memory & Storage
A Unified Interface to All Data
Intel Optane Persistent Memory (2019)

- Non-volatile main memory
- Based on 3D-XPoint Technology
UPMEM Processing-in-DRAM Engine (2019)

- Processing in DRAM Engine
- Includes **standard DIMM modules**, with a **large number of DPU processors** combined with DRAM chips.

- Replaces **standard** DIMMs
  - DDR4 R-DIMM modules
    - 8GB+128 DPUs (16 PIM chips)
    - Standard 2x-nm DRAM process
  - **Large amounts of** compute & memory bandwidth

---

One Key Challenge in Persistent Memory

- How to ensure consistency of system/data if all memory is persistent?

- Two extremes
  - Programmer transparent: Let the system handle it
  - Programmer only: Let the programmer handle it

- Many alternatives in-between...
CRASH CONSISTENCY PROBLEM

Add a node to a linked list

1. Link to next
2. Link to prev

System crash can result in inconsistent memory state
CURRENT SOLUTIONS

Explicit interfaces to manage consistency

– NV-Heaps [ASPLOS’11], BPFS [SOSP’09], Mnemosyne [ASPLOS’11]

```java
AtomicBegin {
    Insert a new node;
}
AtomicEnd;
```

Limits adoption of NVM
Have to rewrite code with clear partition between volatile and non-volatile data

Burden on the programmers
CURRENT SOLUTIONS

Explicit interfaces to manage consistency
– NV-Heaps [ASPLOS’11], BPFS [SOSP’09], Mnemosyne [ASPLOS’11]

Example Code

`update a node in a persistent hash table`

```c
void hashtable_update(hashtable_t* ht,
                      void *key, void *data)
{
    list_t* chain = get_chain(ht, key);
    pair_t* pair;
    pair_t updatePair;
    updatePair.first = key;
    pair = (pair_t*) list_find(chain,
                               &updatePair);
    pair->second = data;
}
```
void \texttt{TMhashtable_update(TMARCGDECL hashtable_t* ht, void *key, void*data)}{
    list_t* chain = get_chain(ht, key);
    pair_t* pair;
    pair_t updatePair;
    updatePair.first = key;
    pair = (pair_t*) \texttt{TMLIST\_FIND}(chain, &updatePair);
    pair->second = data;
}
void TMhashtable_update(TMARCGDECL hashtable_t* ht, void *key, void*data){
    list_t* chain = get_chain(ht, key);
    pair_t* pair;
    pair_t updatePair;
    updatePair.first = key;
    pair = (pair_t*) TMLIST_FIND(chain, &updatePair);
    pair->second = data;
}
Manual declaration of persistent components

```c
void TMhashtable_update(TMARCGDECL hashtable_t* ht, void *key, void*data)
{
    list_t* chain = get_chain(ht, key);
    pair_t* pair;
    pair_t updatePair;
    updatePair.first = key;
    pair = (pair_t*) TMLIST_FIND(chain, &updatePair);
    pair->second = data;
}
```

Need a new implementation
Current Solutions

Manual declaration of persistent components

```c
void TM hashtable_update(TMARCDECL hashtable_t* ht, void* key, void* data) {
    list_t* chain = get_chain(ht, key);
    pair_t* pair;
    pair = (pair_t*) TMLIST_FIND(chain, &updatePair);
    pair->second = data;
}
```

Need a new implementation

Third party code can be inconsistent
CURRENT SOLUTIONS

Manual declaration of persistent components

```c
void TM_hashtable_update(hashtable_t* ht, void *key, void*data){
    list_t* chain = get_chain(ht, key);
    pair_t* pair;
    pair_t updatePair;
    updatePair.first = key;
    pair = (pair_t*)TMLIST_FIND(chain, &updatePair);
    pair->second = data;
}
```

Prohibited Operation

Need a new implementation

Third party code can be inconsistent

Burden on the programmers
OUR APPROACH: ThyNVM

Goal:
Software transparent consistency in persistent memory systems

Key Idea:
Periodically checkpoint state; recover to previous checkpoint on crash
ThyNVM: Summary

A new hardware-based checkpointing mechanism

- **Checkpoints** at *multiple granularities* to reduce both checkpointing latency and metadata overhead
- **Overlaps** *checkpointing* and *execution* to reduce checkpointing latency
- **Adapts** to *DRAM and NVM* characteristics

Performs within 4.9% of an *idealized DRAM* with zero cost consistency
2. OVERLAPPING CHECKPOINTING AND EXECUTION
More About ThyNVM

- Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu,

"ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems"

Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December 2015.

[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems

Jinglei Ren*† Jishen Zhao‡ Samira Khan†‡ Jongmoo Choi†‡ Yongwei Wu* Onur Mutlu†

†Carnegie Mellon University *Tsinghua University
‡University of California, Santa Cruz †University of Virginia ‡Dankook University

SAFARI
Another Key Challenge in Persistent Memory

Programming Ease to Exploit Persistence
Tools/Libraries to Help Programmers

- Himanshu Chauhan, Irina Calciu, Vijay Chidambaram, Eric Schkufza, Onur Mutlu, and Pratap Subrahmanyanam,

"NVMove: Helping Programmers Move to Byte-Based Persistence"

Proceedings of the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW), Savannah, GA, USA, November 2016.

[Slides (pptx) (pdf)]

**NVMove: Helping Programmers Move to Byte-Based Persistence**

| Himanshu Chauhan * | Irina Calciu | Vijay Chidambaram |
| UT Austin | VMware Research Group | UT Austin |
| Eric Schkufza | VMware Research Group | Onur Mutlu |
| | | ETH Zürich |
| | | Pratap Subrahmanyanam |
| | | VMware |

SAFARI
The Future of Emerging Technologies is Bright

- Regardless of challenges
  - in underlying technology and overlying problems/requirements

Can enable:
- Orders of magnitude improvements
- New applications and computing systems

Yet, we have to
- Think across the stack
- Design enabling systems
If In Doubt, Refer to Flash Memory

- A very “doubtful” emerging technology
  - for at least two decades

---

**Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives**

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

**Abstract** | NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and its cost has continuously decreased. This persistent growth in performance and low cost has led to widespread deployment of flash memory in applications, including computers and mobile devices, storage systems, and data centers. However, the error rate of flash memory has increased, and the error rate has increased as a result of the growing demand for high-density and high-performance flash memory. This has led to an increased demand for error correction and recovery techniques in flash memory. This paper presents an overview of error characterization, mitigation, and recovery techniques for flash memory.

**Keywords** | Data storage systems; error recovery; fault tolerance; flash memory; reliability; solid-state drives

Many Research & Design Opportunities

- Enabling completely persistent memory
- Hybrid memory systems
- Security and privacy issues in persistent memory
- Reliability and endurance related problems
- ...

SAFARI