



## HIGH-THROUGHPUT SEQUENCE ALIGNMENT USING REAL PROCESSING-IN-MEMORY SYSTEMS

Safaa Diab, Amir Nassereldine, Mohammed Alser, Juan Gómez Luna, Onur Mutlu, <u>Izzat El Hajj</u>

April 13, 2023 BIO-Arch Workshop









Observation: limited performance improvement as the number of CPU threads grows

## Sequence Alignment Scaling on CPUs







# Observation: limited performance improvement as the number of CPU threads grows

As the number of CPU threads grows, IPC decreases meaning threads spend more time idle

## Sequence Alignment Scaling on CPUs









#### **Observation:** limited performance improvement as the number of CPU threads grows

As the number of CPU threads grows, IPC decreases meaning threads spend more time idle

cycle

per

3.0

2.0

1.0

# threads | 1 8 16 32 48 | 1

NW

SWG

## Sequence Alignment Scaling on CPUs







## © All rights reserved. American University of Beirut 2023

## **Memory Bandwidth Bottleneck**

#### **CPU Chip**



Conventional CPU processing







## Processing-in-Memory





Conventional CPU processing

#### **CPU Chip**



Processing-in-memory (PIM)







#### **UPMEM:** The First Real PIM Hardware











#### **UPMEM: The First Real PIM Hardware**









































































## **Supported Algorithms**

| D |    | Α  | Т | Α  |
|---|----|----|---|----|
|   | 0  | 4  | 8 | 12 |
| Α | 4  | 0  | 4 | 8  |
| Т | 8  | 4  | Q | 4  |
| С | 12 | 8  | 4 | 2  |
| Α | 16 | 12 | 8 | 4  |

#### Needleman-Wunsch (NW)



#### GenASM



#### Smith-Waterman-Gotoh (SWG)



#### Wavefront Algorithm (WFA)

Cali, Damla Senol, et al. "GenASM: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis." MICRO, 2020. Marco-Sola, Santiago, et al. "Fast gap-affine pairwise alignment using the wavefront algorithm." Bioinformatics 37.4 (2021): 456-463.





## Managing the UPMEM Memory Hierarchy



Using WRAM only for intermediate data structures







## Managing the UPMEM Memory Hierarchy



Using WRAM only for intermediate data structures



Using WRAM and MRAM for intermediate data structures









Observation #1: The best performing CPU system is the one with the largest L3 cache (demonstrates memory-boundedness of sequence alignment)









Observation #2: PIM outperforms CPU in the majority of cases (up to 4.06× for SWG, up to 1.83× for WFA, and up to 2.56× for WFA-adaptive)









Observation #3: When data transfer time is not included, PIM outperforms CPU even more (up to 25.93× for WFA, up to 28.14× for WFA-adaptive)



Observation #3: When data transfer time is not included, PIM outperforms CPU even more (up to 25.93× for WFA, up to 28.14× for WFA-adaptive)









Observation #4: PIM does not outperform CPU for algorithms with regular access patterns at small read lengths







## PIM vs. CPU for WFA-adaptive with Large Read Lengths



Observation #1: PIM continues to outperform CPU for very large read lengths







## PIM vs. CPU for WFA-adaptive with Large Read Lengths



Observation #1: PIM continues to outperform CPU for very large read lengths

Observation #2: Scalability currently limited by WRAM capacity







## PIM vs. CPU for WFA-adaptive with Large Read Lengths









#### PIM vs. GPU

| Sequence | Edit     | Throughput (alignments per second) |                       | Throughput    |
|----------|----------|------------------------------------|-----------------------|---------------|
| length   | distance | WFA-GPU                            | UPMEM (with transfer) | improvement   |
| 150      | 2%       | 9.09M                              | 12.97M                | $1.42 \times$ |
|          | 5%       | 5.56M                              | 7.03M                 | $1.27 \times$ |
| 1,000    | 2%       | 1.43M                              | 1.10M                 | $0.77 \times$ |
|          | 5%       | 370K                               | 434K                  | $1.17 \times$ |
| 10,000   | 2%       | 25.0K                              | 66.9K                 | $2.68 \times$ |
|          | 5%       | 5.56K                              | 11.81K                | $2.12 \times$ |

**Observation:** PIM outperforms GPU in the majority of cases

















Observation #1: For algorithms that use large data structures (NW and SWG), WRAM only does not scale









Observation #2: For algorithms that use small data structures (GenASM), WRAM only is better









Observation #3: For algorithms that use medium-sized data structures (WFA, WFA-adaptive), WRAM only is better for short reads while WRAM+MRAM is better for long reads







## Summary

- Sequence alignment on traditional systems is limited by the memory bandwidth bottleneck
- Processing-in-memory (PIM) overcomes this bottleneck by placing cores near the memory
- Our framework, Alignment-in-Memory (AIM), is a PIM framework that supports multiple alignment algorithms (NW, SWG, GenASM, WFA)
  - Implemented on UPMEM, the first real PIM system.
- Results show substantial speedups over CPUs and GPUs
- AIM is available at: <a href="https://github.com/safaad/aim">https://github.com/safaad/aim</a>

