Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

He, Dongze; Zakeri, Mohsen; Sarkar, Hirak; Soneson, Charlotte; Srivastava, Avi; Patro, Rob

doi:10.1038/s41592-022-01408-3

Article
Published: 11 March 2022

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Nature Methods volume 19, pages 316–322 (2022)Cite this article

7611 Accesses
17 Citations
42 Altmetric
Metrics details

Subjects

Abstract

The rapid growth of high-throughput single-cell and single-nucleus RNA-sequencing (scRNA-seq and snRNA-seq) technologies has produced a wealth of data over the past few years. The size, volume and distinctive characteristics of these data necessitate the development of new computational methods to accurately and efficiently quantify sc/snRNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying sc/snRNA-seq data. In addition to being faster and more memory frugal than other accurate quantification approaches, alevin-fry ameliorates the memory scalability and false-positive expression issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify sc/snRNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate normal gene expression count matrices.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of the alevin-fry pipeline (operating in USA quantification mode).**

**Fig. 2: Comprehensive analysis of the performance pf alevin-fry on real and simulated datasets.**

The triumphs and limitations of computational methods for scRNA-seq

Article 21 June 2021

Fast and highly sensitive full-length single-cell RNA sequencing using FLASH-seq

Article Open access 30 May 2022

A flexible cross-platform single-cell data processing pipeline

Article Open access 11 November 2022

Data availability

All data analyzed in this paper are publicly available. The mouse pancreas dataset, the mouse placenta dataset and the zebrafish pineal dataset analyzed during the current study are available on NCBI Gene Expression Omnibus, under accession number GSM3852755, GSM4609872 and GSM3511193, respectively. The PBMC5k and PBMC10k dataset are available from 10x Genomics at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3 and https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3, respectively. The description of the references that were used for mapping the sequencing reads from each dataset can be found in Supplementary Table 1.

Code availability

Alevin-fry is written in Rust (https://www.rust-lang.org), and is available under the BSD 3-Clause license, as a free and open-source tool at https://github.com/COMBINE-lab/alevin-fry. The specific version used in this work⁴⁰ has been uploaded to zenodo at https://doi.org/10.5281/zenodo.5799568. The generation of RAD files is implemented as part of the alevin command of the salmon tool, available at https://github.com/COMBINE-lab/salmon. Both tools are also available through bioconda^41,42. The roe R package has been developed for the construction of splici reference sequences, and it is available at https://github.com/COMBINE-lab/roe as free and open-source software under the BSD 3-Clause license. Useful scripts and functions for simplifying reference preparation and quantification as well as utilities for reading alevin-fry output in Python and R are available at https://github.com/COMBINE-lab/usefulaf. Support for reading alevin-fry output (including USA mode output) has been integrated into the fishpond package available at https://github.com/mikelove/fishpond as well as through Bioconductor⁴³. The scripts used to perform the analyses in this paper are available at https://github.com/COMBINE-lab/alevin-fry-paper-scripts.

References

Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database 2020, baaa073 (2020).
Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
Article Google Scholar
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Article CAS Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Article CAS Google Scholar
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).
Article Google Scholar
Niebler, S., Müller, A., Hankeln, T. & Schmidt, B. RainDrop: rapid activation matrix computation for droplet-based single-cell RNA-seq reads. BMC Bioinformatics 21, 274 (2020).
Article CAS Google Scholar
Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat. Biotechnol. 39, 813–818 (2021).
Article CAS Google Scholar
Melsted, P., Ntranos, V. & Pachter, L. The barcode, UMI, set format and BUStools. Bioinformatics 35, 4472–4473 (2019).
Article CAS Google Scholar
Kaminow, B., Yunusov, D. & Dobin. A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).
Shainer, I. et al. Agouti-related protein 2 is a new player in the teleost stress response system. Curr. Biol. 29, 2009–2019.e7 (2019).
Article CAS Google Scholar
Shainer, I. & Stemmer, M. Choice of preprocessing pipeline influences clustering quality of scRNA-seq datasets. BMC Genomics 22, 661 (2021).
Article CAS Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Article CAS Google Scholar
Cau, E., Ronsin, B., Bessière, L. & Blader, P. A notch-mediated, temporal asymmetry in BMP pathway activation promotes photoreceptor subtype diversification. PLoS Biol. 17, e2006250 (2019).
Article Google Scholar
Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Article Google Scholar
Crespo, C., Soroldoni, D. & Knust, E. A novel transgenic zebrafish line for red opsin expression in outer segments of photoreceptor cells. Dev. Dyn. 247, 951–959 (2018).
Article CAS Google Scholar
Wada, S. et al. Color opponency with a single kind of bistable opsin in the zebrafish pineal organ. Proc. Natl Acad. Sci. USA 115, 11310–11315 (2018).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Brüning, R. S., Tombor, L., Schulz, M. H., Dimmeler, S. & John, D. Comparative analysis of common alignment tools for single-cell RNA sequencing. GigaScience 11, giac001 (2022).
Article Google Scholar
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Article Google Scholar
Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020).
Article CAS Google Scholar
Soneson, C., Srivastava, A., Patro, R. & Stadler, M. B. Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS Comput. Biol. 17, e1008585 (2021).
Article CAS Google Scholar
Marsh, B. & Blelloch, R. Single nuclei RNA-seq of mouse placental labyrinth development. eLife https://doi.org/10.7554/elife.60266 (2020).
Woods, L., Perez-Garcia, V. & Hemberger, M. Regulation of placental development and its impact on fetal growth—new insights from mouse models. Front. Endocrinol. https://doi.org/10.3389/fendo.2018.00570 (2018).
10k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (v3 Chemistry) (10x Genomics, 2018); https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
Srivastava, A. et al. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol. 21, 239 (2020).
Article CAS Google Scholar
You, Y. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 22, 339 (2021).
Article CAS Google Scholar
Sarkar, H., Srivastava, A. & Patro, R. Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level. Bioinformatics 35, i136–i144 (2019).
Article CAS Google Scholar
Almodaresi, F., Sarkar, H., Srivastava, A. & Patro, R. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34, i169–i177 (2018).
Article CAS Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS Google Scholar
Srivastava, A., Sarkar, H., Gupta, N. & Patro, R. RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 32, i192–i200 (2016).
Article CAS Google Scholar
Li. H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).
Article CAS Google Scholar
Zhu, A., Srivastava, A., Ibrahim, J. G., Patro, R. & Love, M. I. Non-parametric expression analysis using inferential replicate counts. Nucleic Acids Res. 47, e105–e105 (2019).
Article CAS Google Scholar
5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (v3 Chemistry) (10x Genomics, 2019): https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3
Bastidas-Ponce, A. et al. Massive single-cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development https://doi.org/10.1242/dev.173849 (2019).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
Article CAS Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Article Google Scholar
He, D. et al. Alevin-fry v0.4.0 for manuscript "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data". Zenodo https://doi.org/10.5281/zenodo.5806834 (2021).
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
Article Google Scholar
He, D. et al. Additional data for manuscript "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" [Data set]. Zenodo https://doi.org/10.5281/zenodo.5799568 (2021).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology 5(10), 1–16 (2004).
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Institute of Health under grant award numbers R01HG009937 to R.P. and K99CA267677 to A.S., and the National Science Foundation under grant award numbers CCF-1750472 to R.P. and CNS-1763680 to R.P. Also, this project has been made possible in part by grant number CZIF2020-004893 from the Chan Zuckerberg Initiative Foundation to R.P. The funders had no role in the design of the method, data analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
Dongze He
Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
Mohsen Zakeri & Rob Patro
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Hirak Sarkar
Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
Charlotte Soneson
SIB Swiss Institute of Bioinformatics, Basel, Switzerland
Charlotte Soneson
New York Genome Center, New York City, NY, USA
Avi Srivastava

Authors

Dongze He
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Zakeri
View author publications
You can also search for this author in PubMed Google Scholar
Hirak Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Charlotte Soneson
View author publications
You can also search for this author in PubMed Google Scholar
Avi Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Rob Patro
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceptualized the method. D.H., A.S., R.P., M.Z. and H.S. implemented the software. M.Z. and R.P. benchmarked the tools. D.H., R.P. and C.S. analyzed the results. All authors wrote and approved the manuscript.

Corresponding author

Correspondence to Rob Patro.

Ethics declarations

Competing interests

R.P. is a cofounder of Ocean Genomics, Inc. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Davis McCarthy, David van Dijk and Matthew Ritchie for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary analyses and figures.

Reporting Summary

Peer Review Information

Supplementary Table 1

Details of the datasets used in this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, D., Zakeri, M., Sarkar, H. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3

Download citation

Received: 09 July 2021
Accepted: 27 January 2022
Published: 11 March 2022
Issue Date: March 2022
DOI: https://doi.org/10.1038/s41592-022-01408-3

This article is cited by

Fulgor: a fast and compact k-mer index for large-scale matching and color queries
- Jason Fan
- Jamshed Khan
- Rob Patro
Algorithms for Molecular Biology (2024)
Split Pool Ligation-based Single-cell Transcriptome sequencing (SPLiT-seq) data processing pipeline comparison
- Lucas Kuijpers
- Bastian Hornung
- Eskeatnaf Mulugeta
BMC Genomics (2024)
DeepVelo: deep learning extends RNA velocity to multi-lineage systems with cell-specific kinetics
- Haotian Cui
- Hassaan Maan
- Bo Wang
Genome Biology (2024)
Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells
- Adam Gayoso
- Philipp Weiler
- Nir Yosef
Nature Methods (2024)
Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references
- Allan-Hermann Pool
- Helen Poldsam
- Yuki Oka
Nature Methods (2023)

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Subjects

Abstract

Access options

Similar content being viewed by others

The triumphs and limitations of computational methods for scRNA-seq

Fast and highly sensitive full-length single-cell RNA sequencing using FLASH-seq

A flexible cross-platform single-cell data processing pipeline

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Peer Review Information

Supplementary Table 1

Rights and permissions

About this article

Cite this article

This article is cited by

Fulgor: a fast and compact k-mer index for large-scale matching and color queries

Split Pool Ligation-based Single-cell Transcriptome sequencing (SPLiT-seq) data processing pipeline comparison

DeepVelo: deep learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells

Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links