Main

Cell type annotation is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis. This process is often laborious and time-consuming, requiring a human expert to compare genes highly expressed in each cell cluster with canonical cell type marker genes. Although automated cell type annotation methods have been developed (Supplementary Table 1), manual annotation using marker genes remains widely used.

Generative pre-trained transformers (GPT), including GPT-3.5 and GPT-4, are large language models designed for language understanding and generation. Recent studies have demonstrated their effectiveness in biomedical contexts1,2. In this Brief Communication, we hypothesize that GPT-4 can accurately annotate cell types, transitioning the annotation process from manual to a semi- or even fully automated procedure (Fig. 1a). GPT-4 offers cost-efficiency and seamless integration into existing single-cell analysis pipelines such as Seurat3, avoiding the need for building additional pipelines and collecting high-quality reference datasets. The vast training data of GPT-4 enables broader applications across various tissues and cell types, and its chatbot nature allows for user-driven annotation refinement (Fig. 1a,b).

Fig. 1: Examples of GPT-4’s cell type annotation and comparisons with other methods.
figure 1

a, Comparison of cell type annotations by human experts, GPT-4, and other automated methods. b, Example of GPT-4 annotating human prostate cells with increasing granularity. c, Example of GPT-4 annotating single, mixed and new cell types.

We systematically assessed GPT-4’s cell type annotation performance across ten datasets4,5,6,7,8,9,10,11,12, covering five species and hundreds of tissue and cell types, and including both normal and cancer samples (Supplementary Table 2). GPT-4 was queried using GPTCelltype, a software tool we developed (Methods). For competing methods, we evaluated GPT-3.5, a prior version of GPT-4, and CellMarker2.013, SingleR14 and ScType15, which are automatic cell type annotation methods that provide references applicable to a large number of tissues (Methods and Supplementary Table 1). Cell type annotations by GPT-4 or competing methods were evaluated based on their agreement with manual annotations provided by the original studies. The degree of agreement was measured using a numeric score (Methods). Supplementary Table 3 presents an example of evaluating GPT-4 cell type annotations in human prostate tissue, and details of all cell type annotations and their evaluation results are included in Supplementary Table 4.

We first explored different factors that may affect the annotation accuracy of GPT-4 (Fig. 2a and Supplementary Table 5). We found that GPT-4 performs best when using the top ten differential genes, and when differential genes are derived using the two-sided Wilcoxon test. GPT-4 exhibits similar accuracy across various prompt strategies, including a basic prompt strategy, a chain-of-thought16-inspired prompt strategy that includes reasoning steps, and a repeated prompt strategy (Methods). In subsequent analyses, both GPT-4 and GPT-3.5 used the basic prompt strategy with the top ten differential genes obtained from Wilcoxon test as inputs for applicable datasets.

Fig. 2: Performance evaluation.
figure 2

a, Average agreement scores for varying numbers of top differential genes, statistical tests for differential analysis, and prompt strategies. b, Proportion of cell types with varying agreement levels in each study and tissue, most abundant broad cell types, malignant cells, different cell population sizes, and major cell types versus cell subtypes. c, log2-transformed ratio of type I (COL1A1 and COL1A2) and II (COL2A1) collagen gene expression. d,e, Comparison of average agreement scores (d) and running times (e). In e, n = 59 for GPT-4 and GPT-3.5 and n = 36 for ScType and SingleR. Each boxplot shows the distribution (center: median; bounds of box: first and third quartiles; bounds of whiskers: data points within 1.5× interquartile range from the box; minima; maxima) of running time. f, Financial cost of querying GPT-4 API versus cell type numbers. g, GPT-4’s performance in identifying mixed/single cell types and known/unknown cell types, and under different subsampling and noise levels in multiple simulation rounds (dots). h, Reproducibility of GPT-4 annotations. i, Consistency of agreement scores between two versions of GPT-4.

GPT-4’s annotations fully or partially match manual annotations in over 75% of cell types in most studies and tissues (Fig. 2b), demonstrating its competency in generating expert-comparable cell type annotations. This agreement is particularly high for marker genes from literature searches, with at least 70% fully match rate in most tissues. Though lower for genes identified by differential analysis, the agreement remains high. However, results from datasets published before September 2021 should be interpreted cautiously as they predate GPT-4’s training cutoff. GPT-4 performs better for immune cells like granulocytes compared to other cell types (Fig. 2b). It identifies malignant cells in colon and lung cancer datasets but struggles with B lymphoma, potentially due to a lack of distinct gene sets. The identification of malignant cells could benefit from other approaches such as copy number variation9. Performance dips slightly in small cell populations comprising no more than ten cells (Fig. 2b), possibly due to the limited available information. GPT-4 annotations fully match manual annotations more frequently in major cell types (for example, T cells) than in subtypes (for example, CD4 memory T cells), while over 75% of subtypes still achieve full or partial matches (Fig. 2b).

The low agreement between GPT-4 and manual annotations in some cell types does not necessarily imply that GPT-4’s annotation is incorrect. For instance, cell types classified as stromal cells include fibroblasts and osteoblasts expressing type I collagen genes, and chondrocytes expressing type II collagen genes. For cells manually annotated as stromal cells, GPT-4 assigns cell type annotations with higher granularity (for example, fibroblasts and osteoblasts), resulting in partial matches and a lower agreement. For cell types that are manually annotated as stromal cells but identified by GPT-4 as fibroblasts or osteoblasts, type I collagen genes show substantially higher expression than type II collagen genes (Fig. 2c). This agrees with the pattern observed in cells manually annotated as chondrocytes, fibroblasts, and osteoblasts (Fig. 2c), suggesting that GPT-4 provides more accurate cell type annotations for stromal cells.

GPT-4 substantially outperforms other methods based on average agreement scores (Methods and Fig. 2d). Using GPTCelltype as the interface, GPT-4 is also notably faster (Fig. 2e), partly due to its utilization of differential genes from the standard single-cell analysis pipelines such as Seurat3. Given the integral role of these pipelines, we regard the differential genes as immediately available for GPT-4. In contrast, other methods like SingleR and ScType require additional steps to reprocess the gene expression matrices. Compared to other methods that are free of charge, GPT-4 incurs a $20 monthly fee for using online web portal. Cost of GPT-4 API is linearly correlated with the number of queried cell types and does not exceed $0.1 for all queries in this study (Fig. 2f).

We further assessed GPT-4’s robustness in complex real data scenarios (Fig. 1c) with simulated datasets (Methods). GPT-4 can distinguish between pure and mixed cell types with 93% accuracy, and differentiate between known and unknown cell types with 99% accuracy (Fig. 2g). When the input gene set includes fewer genes or is contaminated with noise, GPT-4’s performance decreases but remains high (Fig. 2g). These results demonstrate GPT-4’s robustness in various scenarios.

Finally, we assessed the reproducibility of GPT-4’s annotations using prior simulation studies (Methods). GPT-4 generated identical annotations for the same marker genes in 85% of cases (Fig. 2h), indicating high reproducibility. Annotations of two GPT-4 versions showed identical agreement scores in most cases, with a Cohen’s κ of 0.65, demonstrating substantial consistency (Fig. 2i).

While GPT-4 excels in cell type annotation, which surpasses existing methods, there are limitations to consider. Firstly, the undisclosed nature of GPT-4’s training corpus makes verifying the basis of its annotations challenging, thus requiring human evaluation to ensure annotation quality and reliability. Secondly, human involvement in the optional fine-tuning of the model may affect reproducibility due to subjectivity and could limit the scalability of the model in large datasets. Thirdly, high noise levels in scRNA-seq data and unreliable differential genes can adversely affect GPT-4’s annotations. Lastly, over-reliance on GPT-4 risks artificial intelligence hallucination. We recommend validation of GPT-4’s cell type annotations by human experts before proceeding with downstream analyses.

While this study focuses on the standard version of GPT-4, fine-tuning GPT-4 with high-quality reference marker gene lists could further improve cell type annotation performance, utilizing services such ‘GPTs’ provided by OpenAI.

Methods

Dataset collection

For the HuBMAP Azimuth project, manually annotated cell types and their marker genes were downloaded from the Azimuth website (https://azimuth.hubmapconsortium.org/). Azimuth provides cell type annotations for each tissue at different granularity levels. We selected the level of granularity with the fewest number of cell types, provided that there are more than ten cell types within that level. Details of how marker genes were generated are not reported by Azimuth.

For the GTEx5 dataset, manually annotated cell types, differential gene lists and gene expression matrices were downloaded directly from the publication5. In the original study, gene expression raw counts were library-size-normalized and log-transformed after adding a pseudocount of 1 with SCANPY17. ComBat18 was used to account for the protocol- and sex-specific effects with SCANPY17. Welch’s t-test was then performed to identify differential genes that compare one cell type against the rest. For each cell type, genes were ranked increasingly by P values, and genes with the same P values were further ranked decreasingly by t-statistics. Top 10, 20 and 30 differential genes were used in this study. Lists of marker genes through literature search and the corresponding cell types were downloaded from the same study5, and only cell types with at least five marker genes were used.

For the HCL6 dataset, manually annotated cell types, differential gene lists and the gene expression matrix were downloaded directly from the publication6. In the original study, gene expression raw counts underwent a batch removal process to facilitate cross-tissue comparison and were subsequently normalized by library size and log-transformed after adding a pseudocount of 1. Two-sided Wilcoxon rank-sum test was then performed to identify differential genes comparing one cell type against the rest using Seurat3. Differential genes were further selected by log fold change larger than 0.25, Bonferroni-adjusted P value smaller than 0.1, and expressed in at least 15% of cells in either population. For each cell type, genes were ranked increasingly by P values, and genes with the same P values were further ranked decreasingly by two-sided Wilcoxon test statistics. Top 10, 20 and 30 differential genes were used in this study.

For the Mouse Cell Atlas (MCA)7 dataset, manually annotated cell types, differential gene lists and gene expression matrix were downloaded directly from the publication6. In the original study, gene expression raw counts underwent a batch removal process to facilitate cross-tissue comparison, and Seurat3 was used to perform preprocessing and differential analysis. For each cell type, genes were ranked increasingly by P values, and genes with the same P values were further ranked decreasingly by log fold change. Top 10, 20 and 30 differential genes were used in this study.

For non-model mammal dataset12, manually annotated cell types and lists of marker genes through literature search were downloaded directly from the original study.

For Tabula Sapiens (TS)8, B-cell lymphoma (BCL)9, lung cancer11 and colon cancer10 datasets, manually annotated cell types and raw gene expression count matrices were downloaded directly from original studies. Raw counts were normalized by library size and log-transformed after adding a pseudocount of 1. Seurat FindAllMarkers() function with default settings was used to obtain differential genes by comparing one cell type with the rest within each tissue. Briefly, genes with at least 0.25 log fold change between two cell populations and detected in at least 10% of cells in either cell population were retained. Two-sided Wilcoxon rank-sum test was then performed for differential analysis. In addition, two-sided two-sample t-test was also performed for differential analysis using the FindAllMarkers() function with default settings. For each cell type, genes were ranked increasingly by P values, and genes with the same P values were further ranked decreasingly by log fold changes. Top 10, 20 and 30 differential genes were used in this study.

Cell type annotation methods

GPT-4 and GPT-3.5

All GPT-4 (13 June 2023 version) and GPT-3.5 (13 June 2023 version) cell type annotations in this study were performed using GPTCelltype, an R software package we developed as an interface for GPT models. GPTCelltype takes marker genes or top differential genes as input, and automatically generates prompt message using the following template with the basic prompt strategy:

‘Identify cell types of TissueName cells using the following markers separately for each row. Only provide the cell type name. Do not show numbers before the name. Some can be a mixture of multiple cell types.\n GeneList’.

Here ‘TissueName’ is a variable that will be replaced with the actual name of the tissue (for example, human prostate), and ‘GeneList’ is a list of marker genes or top differential genes. Genes for the same cell population are joined by comma (,), and gene lists for different cell populations are separated by the newline character (\n). GPT-4 or GPT-3.5 was then queried using the generated prompt message through OpenAI API, and the returned information was parsed and converted to cell type annotations.

For chain-of-thought prompt strategy, the following sentence was added to the beginning of the message generated by the basic prompt strategy: ‘Because CD3 gene is a marker gene of T cells, if CD3 gene is included in the marker gene list of an unknown cell type, the cell type is likely to be T cells, a subtype of T cells, or a mixed cell type containing T cells’.

For repeated prompt strategy, GPT-4 was queried with the basic prompt strategy repeatedly for five times. The annotation result that appears most frequently among the five queries was selected as the final cell type annotation.

GPT-4 (23 March 2023 version) cell type annotations were performed by manually copying and pasting prompt messages to GPT-4 online web interface (https://chat.openai.com/). The prompt message was constructed using the following template:

‘Identify cell types of TissueName cells using the following markers. Identify one cell type for each row. Only provide the cell type name. \n GeneList’.

Computationally identified differential genes in eight scRNA-seq datasets and canonical marker genes identified through literature search in two datasets were used as inputs to GPT-4 and GPT-3.5 (Supplementary Table 2). Cell type annotation for HCL and MCA was performed and evaluated once by aggregating all tissues, similar to the original studies. In other studies, cell type annotation was performed and evaluated within each tissue.

SingleR

SingleR14 (version 1.4.1) R package was used to perform cell type annotations with default settings. For HCL and MCA datasets, the gene expression matrices after batch effect removal, library size normalization and log transformation across all tissues were used as input. For all other datasets, SingleR was performed separately within each tissue, and the input is the log-transformed and library-size normalized gene expression matrix. The built-in Human Primary Cell Atlas reference19 was used as the reference dataset for all SingleR annotations. SingleR generates single-cell level cell type annotations by returning an assignment score matrix for each single cell and each cell type label in the reference. To convert single-cell level annotations to cell-cluster level annotations, for each manually annotated cell type, we assigned the reference label with assignment scores summed across all single cells in that manually annotated cell type as the predicted cell type annotation.

ScType

ScType15 (version 1.0) R package was used to perform cell type annotations with default settings. To meet the need for computational efficiency when working with large datasets, we developed an in-house version of ScType. We utilized vectorization to optimize the most time-consuming steps, while still generating the same output of the original ScType software. The input gene expression matrices to ScType were the same as used in SingleR described above. The built-in cell type marker database was used as the reference for all ScType annotations. Manually annotated cell types were treated as cell clusters and given as inputs to ScType. ScType directly generates cluster-level cell type annotations.

CellMarker2.0

CellMarker2.0 (ref. 13) only provides an online user interface and does not have a software implementation. We used the exact same marker gene sets or top ten differential gene sets identified by two-sided Wilcoxon tests for GPT-4 and GPT-3.5 cell type annotations as inputs of CellMarker2.0.

Evaluations of cell type annotations

Cell type annotations by GPT-4 or competing methods were compared to manual annotations provided by the original studies. Each manually or automatically identified cell type annotation was assigned an unambiguous cell ontology (CL) name20 and a broad cell type name when applicable. A pair of manually and automatically identified cell type annotations was classified as ‘fully match’ if they have the same annotation term or available CL cell ontology name, ‘partially match’ if they have the same or subordinate (for example, fibroblast and stromal cell) broad cell type name but different annotations and CL cell ontology names, and ‘mismatch’ if they have different broad cell type names, annotations and CL cell ontology names.

To facilitate comparison, we assigned agreement scores of 1, 0.5 and 0 to cases of ‘fully match’, ‘partially match’ and ‘mismatch’ respectively, and calculated average scores within each dataset across cell types and tissues.

Simulation studies and reproducibility

To generate simulation datasets, we used canonical cell type markers through GTEx literature search of human breast cells, the top ten differential genes from the human colon cancer dataset, and the top ten differential genes from the vasculature tissue of the TS dataset as templates. Simulation studies were performed separately for the three tissue types.

To generate simulation datasets of mixed cell types, marker genes for each mixed cell type were created by combining the marker gene lists of two randomly selected cell types. Ten mixed cell types were generated in each simulation iteration. Additionally, we incorporated the original cell type markers of ten randomly chosen cell types as negative controls of single cell types. This entire simulation process was repeated five times. Subsequently, GPT-4 was queried using these simulated marker gene lists, and its performance in differentiating between mixed and single cell types was assessed.

To generate simulation datasets of unknown cell types, we compiled a list of all human genes using the Bioconductor org.Hs.eg.db package21. In each simulation iteration, ten simulated unknown cell types were generated. The marker genes for each unknown cell type were produced by combining ten randomly selected human genes. Additionally, we included ten real cell types and their marker genes as negative controls of known cell types, similar to the previous simulation study. This entire simulation process was repeated five times. Subsequently, GPT-4 was queried using these simulated marker gene lists, and its performance in distinguishing between known and unknown cell types was assessed.

To generate simulation datasets with partial marker gene information, we randomly subsampled 25%, 50% or 75% of the original marker genes. The simulation process was repeated five times. Subsequently, GPT-4 was queried using these subsampled marker gene lists, and the performance was assessed by agreement scores.

To generate simulation datasets with contaminated information, we added randomly selected human genes to the original marker gene list. The numbers of randomly selected genes are 25%, 50% or 75% of the number of original marker genes. The simulation process was repeated five times. Subsequently, GPT-4 was queried using these subsampled marker gene lists, and the performance was assessed by agreement scores.

We assessed the reproducibility of GPT-4 responses by leveraging the repeated querying of GPT-4 with identical marker gene lists of the same negative control cell types in simulation studies. For each cell type, reproducibility is defined as the proportion of instances in which GPT-4 generates the most prevalent cell type annotation. For instance, in the case of vascular endothelial cells, GPT-4 produces ‘endothelial cells’ eight times and ‘blood vascular endothelial cells’ once. Consequently, the most prevalent cell type annotation is ‘endothelial cells’, and the reproducibility is calculated as \(\frac{8}{9}=0.89\).

GPT-4 API financial cost

According to information provided by OpenAI, the application programming interface (API) cost for running GPT-4 13 June 2023 version is $0.03 for every thousand input tokens and $0.06 for every thousand output tokens. For each query, we obtained i and o, which represent the numbers of input tokens and output tokens respectively, through the OpenAI API. The total API financial cost is thus calculated as $(0.00003i + 0.00006o).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.