SAFARI Newsletter January 2021: transcript video interview Damla Senol Cali

Transcript of Damla Senol Cali‘s video interview. The video can be found here.

Hello. I am Damla Senol Cali. I am a 6th year PhD student in the SAFARI Research Group at Carnegie Mellon. I started here at CMU in 2015 when Onur was physically here. Now, I am the last SAFARI member remaining at CMU. Hopefully, I will also be graduating this year.

You can find my contact information here.

My research interests include 1) computational methods for genome sequence analysis, 2) hardware/software co-design, 3) processing-in-memory, and 4) memory systems.

My thesis research statement is “Genome sequence analysis can be accelerated by co-designing fast and efficient algorithms along with scalable and power-efficient customized hardware”.

Can you tell us about the significance of your recently published and presented paper, GenASM [MICRO’20]?

Recently, we published and presented our GenASM work at MICRO 2020. GenASM is a high-performance, low-power approximate string matching (ASM) acceleration framework for genome sequence analysis. Approximate string matching is one of the main bottlenecks in genome sequence analysis and it is used at multiple points during the analysis. Thus, it was a very crucial problem to solve. We modify an existing algorithm (called Bitap) to expand its algorithmic functionality and then co-design this modified algorithm with an area- and power-efficient hardware accelerator.

In our MICRO’20 paper, we describe and rigorously evaluate three use cases of GenASM. First, we show that GenASM can effectively accelerate the read alignment step of read mapping. Second, we illustrate that GenASM can be employed as the most efficient (to date) pre-alignment filter for short reads. Third, we demonstrate how GenASM can efficiently find the edit distance (i.e., Levenshtein distance) between two sequences of arbitrary lengths. For all three use cases that we evaluate in detail, we find that GenASM provides significant performance and energy benefits over state-of-the-art SW and HW tools. We also believe the GenASM framework can be utilized in several other steps in genome sequence analysis, as well as for generic text search.

When we look at the significance of GenASM, we should highlight that, to our knowledge, GenASM is the first acceleration framework to support multiple use cases of approximate string matching (ASM) via HW/SW co-design, which we demonstrate with several applications in genome sequence analysis. In addition, GenASM is the first work to enhance and accelerate Bitap, and we develop the first Bitap-compatible traceback algorithm, enabling Bitap for use in both software- and hardware-based genome sequence analysis.

In your opinion, what will be the long term impact of GenASM?

We believe that the long-term impact of GenASM is three-fold.

Enabling Portable, Fast, and Efficient Genome Sequence Analysis. Recent advances have enabled genome sequencing anywhere in the world with cheap, portable sequencing machines (e.g., ONT’s MinION). In the near future, even smaller sequencing devices can enable sequencing using smartphones. Such readily-available sequencing technologies can open up a number of new applications, such as bringing personalized medicine to rural or remote areas, near-patient testing, and rapid infection diagnosis and outbreak tracing (e.g., COVID-19, Ebola, Zika). However, these applications require memory-efficient, low-power and area-efficient systems to process the generated genome sequence data, as laptops and mobile phones have limited resources (e.g., greater memory constraints, limited battery life).

GenASM’s careful co-design of scalable and memory-efficient algorithms with area- and power-efficient hardware accelerators is an important milestone, allowing genome sequence analysis to be performed in highly-resource-constrained environments. GenASM can even be implemented in the sequencing machine itself, eliminating expensive sequencer-to-computer data movement and providing a single embedded solution for portable sequencing and sequence analysis.

Rapid Genome Sequence Analysis for Pandemics. Rapid genome sequence analysis plays a critical role during pandemics such as the current COVID-19 (i.e., SARS-CoV-2) crisis. Rapid analysis can (1) detect the virus in human DNA samples; (2) track the mutations, sources, and transmission modes of the virus; (3) help with the development of new treatments; and (4) help uncover why some people experience more severe symptoms and higher mortality than others.

Given the fast pace at which viruses can proliferate and mutate during a pandemic, there is a need to perform large volumes of viral genomic analysis rapidly and widely, as lost time or limited availability can hinder tracking and harm our ability to control spread and mutations. Today, rapid genome sequence analysis is bottlenecked by the limited computational power and memory bandwidth of existing systems. We believe it is more important than ever to overcome these bottlenecks through the development of high-efficiency, low-cost solutions. By revisiting the core algorithms used for ASM in genomics, GenASM unlocks large improvements in both of these directions, with significantly greater efficiency over state-of-the-art solutions and a flexible framework with many applications. Beyond the benefits that GenASM already yields, we hope that our co-design approach sparks further research from both academia and industry on developing even more powerful and efficient solutions for rapid genome sequence analysis of viruses.

Reducing Genomic Accelerator Costs with Multi-Purpose Frameworks. Unlike prior works that build a specific accelerator for only one use case, we intentionally design GenASM as an acceleration framework that can be employed both (1) in multiple steps of genome sequence analysis and (2) for non-genomic purposes. While there is a pressing need for genomic sequence analysis hardware, any fixed-function hardware incurs high per-unit costs, as the non-recurring engineering (NRE) costs can be amortized over only the number of platforms that perform the specific function. To significantly lower NRE costs, we design GenASM to provide substantial benefits for generic ASM (a widely-used primitive for any text search or error-aware pattern matching), while still optimizing its design to maximize benefits for genomic use cases. This significantly increases the number of platforms that can use GenASM. We believe that such an approach, with flexible frameworks that can serve as general-purpose accelerators but include domain-specific optimizations, opens a promising pathway for low-cost acceleration of other tasks. For example, other bioinformatics workloads (e.g., a graph processing acceleration framework for genome assembly, a neural network acceleration framework for nanopore basecalling) can take a similar approach as GenASM, making what would otherwise be high-cost hardware much cheaper, and addressing key cost concerns in the healthcare industry.

What are your current research directions?

I am currently working on two smaller projects, which are extensions of our GenASM paper. One of them is the FPGA implementation of GenASM. And the other is exploring algorithmic enhancements for GenASM which would increase the accuracy of the underlying algorithms and provide more functionality.

Besides these, I am also looking for other use cases that can benefit from GenASM or some modified version of GenASM.

In addition, in our Briefings in Bioinformatics 2018 paper, we comprehensively analyze the multiple steps and the associated state-of-the-art tools in genome assembly pipelines using nanopore sequence data in terms of accuracy, performance, memory usage, and scalability, and we reveal bottlenecks and tradeoffs that different combinations of tools lead to. Nanopore sequencing technology has a great potential and is a very promising sequencing technology with its ability to generate long reads, provide portability, and enable real-time analysis. However, high error rates of the technology pose a challenge while performing the sequence analysis. The tools used for nanopore sequence analysis are of critical importance as they should overcome the high error rates of the technology. Our GenASM work covers the read mapping and read-to-read overlap finding steps of the analyzed pipeline. Thus, looking at other steps and important functions of that analyzed pipeline (such as basecalling, which requires neural network based processing, or assembly, which requires graph based processing) is also another interesting direction for me.

What are your thoughts on working in Computer Architecture and Systems for Genome Analysis?

This is relatively a new direction for both the comp arch community and the genomics community. But I believe that it is very timely considering the current pandemic that the whole world is trying to deal with. Genome sequence analysis plays a pivotal role in enabling many medical and scientific advancements in personalized medicine, virus outbreak tracing, evolutionary theory, and forensics. However, in order to enable high performance, low power, and scalable analysis, support from HW side is crucial due to limitations of existing computing systems. Thus, I believe that being part of this newly emerging and very important area is very exciting. Also, many tech companies are also interested in this direction and they are trying to enable research and develop commercial solutions in this area. I hope that the interest of both communities will increase even more in the near future and will enable more research and solutions in this direction.

What is your experience in SAFARI as the last remaining CMU student? 🙂

I believe I am both lucky and unlucky. Of course, it is sometimes very difficult to be physically away from your advisor but luckily, I had my co-advisor Saugata and some other senior SAFARI members here at CMU during my initial years and they helped me a lot. And last two years, I really had a lot of opportunities to collaborate with many SAFARI folks at ETH. Remote life emerged after COVID made these remote collaborations even easier, of course, since everything went online now. I was already used to that so it worked pretty well for me.

We have many internal meetings at SAFARI, both project meetings and subgroup meetings where we are brainstorming on specific topics such as bioinformatics-related topics, or Processing-in-Memory-related topics. I am leading the bioinformatics subgroup meetings where folks from Bilkent University also join us. So, I believe this remote collaboration is both very effective and beneficial. That is why I count myself as lucky. I have colleagues from all over the world 🙂

I really like the collaborative environment Onur provides to us both internally and externally with academia and industry partners. I believe that it is very important. PhD life would not be bearable without this environment. This is really vital for your growth because you have the opportunity to get familiar with many topics, directions, and of course people. This enabled me also to do two internships at Intel Labs and get familiar with the industry based research.

What are you planning to do after your PhD?

Since I really like my research direction, I would like to continue doing research in this domain. I will be looking for research positions in industry that I know interested in this direction or already working on this direction. Since hopefully I will be graduating sometime in this year, I will start looking for specific positions soon.

Do you have any advice that you want to give to new and prospective students?

Being active in your communication with Onur is very important. We are a huge group so this is very important for your growth.

Although we have many meetings going on, you never know what topic will be interesting to you. Especially for the new students, I believe that these meetings are a great opportunity to learn more about many directions going on in the group, get exposed to new problems, and find new ideas.

In general, being a PhD student is a tough dedication and it is a long journey. I know this would sound very cliche, but having a social life is very very important to survive.

And don’t be shy. Again in order to do well and maximize your growth, you really need to be proactive in your communication with both Onur and other folks of SAFARI.

Final words

Thank you for your time! I wish a great, happy, successful and most importantly healthy year to everyone with their loved ones! Bye!

Further reading and watching:

On Damla’s homepage:
https://damlasenolcali.github.io/publication/genasm/
https://damlasenolcali.github.io/publication/nanopore/

Damla Senol Cali, Gurpreet S. Kalsi, Zulal Bingol, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu, GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis
Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), Virtual, October 2020.
Slides (pptx) (pdf)
Short Talk Slides (pptx) (pdf)
Lightning Talk Slides (pptx) (pdf)
ARM Research Summit Talk Slides (pptx) (pdf)
ARM Research Summit Short Talk Slides (pptx) (pdf)
Lecture Slides (pptx) (pdf)
MICRO 2020 Talk Video (18 minutes)
MICRO 2020 Short Talk Video (6 minutes)
MICRO 2020 Lighting Talk Video (1.5 minutes)
ARM Research Summit Talk Video (21 minutes)
ARM Research Summit Short Talk Video (15 minutes)
ARM Research Summit Short Talk Video and Q&A (31 minutes)
Lecture Video (37 minutes)
GenASM Source Code

Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu, Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions, Briefings in Bioinformatics (BIB), 2018.
Paper in Bioinformatics
Paper PDF arXiv
Slides (pptx) (pdf)
Talk Video at AACBB 2019