Abstract
The rapid growth of high-throughput single-cell and single-nucleus RNA-sequencing (scRNA-seq and snRNA-seq) technologies has produced a wealth of data over the past few years. The size, volume and distinctive characteristics of these data necessitate the development of new computational methods to accurately and efficiently quantify sc/snRNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying sc/snRNA-seq data. In addition to being faster and more memory frugal than other accurate quantification approaches, alevin-fry ameliorates the memory scalability and false-positive expression issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify sc/snRNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate normal gene expression count matrices.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All data analyzed in this paper are publicly available. The mouse pancreas dataset, the mouse placenta dataset and the zebrafish pineal dataset analyzed during the current study are available on NCBI Gene Expression Omnibus, under accession number GSM3852755, GSM4609872 and GSM3511193, respectively. The PBMC5k and PBMC10k dataset are available from 10x Genomics at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3 and https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3, respectively. The description of the references that were used for mapping the sequencing reads from each dataset can be found in Supplementary Table 1.
Code availability
Alevin-fry is written in Rust (https://www.rust-lang.org), and is available under the BSD 3-Clause license, as a free and open-source tool at https://github.com/COMBINE-lab/alevin-fry. The specific version used in this work40 has been uploaded to zenodo at https://doi.org/10.5281/zenodo.5799568. The generation of RAD files is implemented as part of the alevin command of the salmon tool, available at https://github.com/COMBINE-lab/salmon. Both tools are also available through bioconda41,42. The roe R package has been developed for the construction of splici reference sequences, and it is available at https://github.com/COMBINE-lab/roe as free and open-source software under the BSD 3-Clause license. Useful scripts and functions for simplifying reference preparation and quantification as well as utilities for reading alevin-fry output in Python and R are available at https://github.com/COMBINE-lab/usefulaf. Support for reading alevin-fry output (including USA mode output) has been integrated into the fishpond package available at https://github.com/mikelove/fishpond as well as through Bioconductor43. The scripts used to perform the analyses in this paper are available at https://github.com/COMBINE-lab/alevin-fry-paper-scripts.
References
Svensson, V., da Veiga Beltrame, E. & Pachter, L. A curated database reveals trends in single-cell transcriptomics. Database 2020, baaa073 (2020).
Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).
Niebler, S., Müller, A., Hankeln, T. & Schmidt, B. RainDrop: rapid activation matrix computation for droplet-based single-cell RNA-seq reads. BMC Bioinformatics 21, 274 (2020).
Melsted, P. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat. Biotechnol. 39, 813–818 (2021).
Melsted, P., Ntranos, V. & Pachter, L. The barcode, UMI, set format and BUStools. Bioinformatics 35, 4472–4473 (2019).
Kaminow, B., Yunusov, D. & Dobin. A. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at bioRxiv https://doi.org/10.1101/2021.05.05.442755 (2021).
Shainer, I. et al. Agouti-related protein 2 is a new player in the teleost stress response system. Curr. Biol. 29, 2009–2019.e7 (2019).
Shainer, I. & Stemmer, M. Choice of preprocessing pipeline influences clustering quality of scRNA-seq datasets. BMC Genomics 22, 661 (2021).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Cau, E., Ronsin, B., Bessière, L. & Blader, P. A notch-mediated, temporal asymmetry in BMP pathway activation promotes photoreceptor subtype diversification. PLoS Biol. 17, e2006250 (2019).
Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Crespo, C., Soroldoni, D. & Knust, E. A novel transgenic zebrafish line for red opsin expression in outer segments of photoreceptor cells. Dev. Dyn. 247, 951–959 (2018).
Wada, S. et al. Color opponency with a single kind of bistable opsin in the zebrafish pineal organ. Proc. Natl Acad. Sci. USA 115, 11310–11315 (2018).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Brüning, R. S., Tombor, L., Schulz, M. H., Dimmeler, S. & John, D. Comparative analysis of common alignment tools for single-cell RNA sequencing. GigaScience 11, giac001 (2022).
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020).
Soneson, C., Srivastava, A., Patro, R. & Stadler, M. B. Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS Comput. Biol. 17, e1008585 (2021).
Marsh, B. & Blelloch, R. Single nuclei RNA-seq of mouse placental labyrinth development. eLife https://doi.org/10.7554/elife.60266 (2020).
Woods, L., Perez-Garcia, V. & Hemberger, M. Regulation of placental development and its impact on fetal growth—new insights from mouse models. Front. Endocrinol. https://doi.org/10.3389/fendo.2018.00570 (2018).
10k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (v3 Chemistry) (10x Genomics, 2018); https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
Srivastava, A. et al. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol. 21, 239 (2020).
You, Y. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 22, 339 (2021).
Sarkar, H., Srivastava, A. & Patro, R. Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level. Bioinformatics 35, i136–i144 (2019).
Almodaresi, F., Sarkar, H., Srivastava, A. & Patro, R. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34, i169–i177 (2018).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Srivastava, A., Sarkar, H., Gupta, N. & Patro, R. RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 32, i192–i200 (2016).
Li. H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Smith, T., Heger, A. & Sudbery, I. UMI-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Res. 27, 491–499 (2017).
Zhu, A., Srivastava, A., Ibrahim, J. G., Patro, R. & Love, M. I. Non-parametric expression analysis using inferential replicate counts. Nucleic Acids Res. 47, e105–e105 (2019).
5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (v3 Chemistry) (10x Genomics, 2019): https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3
Bastidas-Ponce, A. et al. Massive single-cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development https://doi.org/10.1242/dev.173849 (2019).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
He, D. et al. Alevin-fry v0.4.0 for manuscript "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data". Zenodo https://doi.org/10.5281/zenodo.5806834 (2021).
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
He, D. et al. Additional data for manuscript "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" [Data set]. Zenodo https://doi.org/10.5281/zenodo.5799568 (2021).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology 5(10), 1–16 (2004).
Acknowledgements
This work is supported by the National Institute of Health under grant award numbers R01HG009937 to R.P. and K99CA267677 to A.S., and the National Science Foundation under grant award numbers CCF-1750472 to R.P. and CNS-1763680 to R.P. Also, this project has been made possible in part by grant number CZIF2020-004893 from the Chan Zuckerberg Initiative Foundation to R.P. The funders had no role in the design of the method, data analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
All authors conceptualized the method. D.H., A.S., R.P., M.Z. and H.S. implemented the software. M.Z. and R.P. benchmarked the tools. D.H., R.P. and C.S. analyzed the results. All authors wrote and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
R.P. is a cofounder of Ocean Genomics, Inc. The other authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Davis McCarthy, David van Dijk and Matthew Ritchie for their contribution to the peer review of this work. Lin Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary analyses and figures.
Supplementary Table 1
Details of the datasets used in this work.
Rights and permissions
About this article
Cite this article
He, D., Zakeri, M., Sarkar, H. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-022-01408-3
This article is cited by
-
Fulgor: a fast and compact k-mer index for large-scale matching and color queries
Algorithms for Molecular Biology (2024)
-
Split Pool Ligation-based Single-cell Transcriptome sequencing (SPLiT-seq) data processing pipeline comparison
BMC Genomics (2024)
-
DeepVelo: deep learning extends RNA velocity to multi-lineage systems with cell-specific kinetics
Genome Biology (2024)
-
Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells
Nature Methods (2024)
-
Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references
Nature Methods (2023)