Search
Close this search box.

The impacts of active and self-supervised learning on efficient annotation of single-cell expression data – Nature Communications

Random forest models and prior marker knowledge are best suited for active learning

We first sought to replicate existing active learning findings17 under real world conditions. To accomplish this, we collated six single cell datasets comprising different modalities and existing ground truth labels: two scRNASeq datasets from breast35 and lung36 cancer cell lines where each cell line forms a “cell type” to predict, a pancreatic cancer single nucleus RNA Sequencing (snRNASeq) dataset in which cell types were assigned using a combination of clustering and copy number profiles37, a CyTOF dataset comprised of mouse bone marrow cells38 that was assigned using a gating strategy, a scRNASeq dataset from healthy donors of the liver atlas dataset39 and the vasculature tabula sapiens scRNASeq dataset40. This covers cell type labels previously designated as gold (cell lines) and silver (gating) standards as ground truth41. Each dataset was subsampled to several thousand cells (see methods) for computational efficiency given that over 200,000 unique experiments were run for this analysis (S. Table 1). The resulting datasets were composed of cell types with varying similarity (Supplementary Fig. 1) and cell type imbalance (Fig. 1A). To benchmark the active learning approach, we split each dataset into ten train/test splits. To mimic the number of cells an end user would manually annotate, we selected a total of 100, 250 and 500 cells from the training set. These subsets were then used to train six cell type assignment methods using ground truth cell type labels. We then evaluated the trained classifiers using the held out test set using five different accuracy metrics (methods) (Fig. 1B). Note however, that the main goal of this work is not to evaluate the accuracies of the classifiers, but rather to benchmark the improvement in classification performance gained by creating an informative training dataset.

Fig. 1: Benchmarking overview.
figure 1

A TSNE embedding of datasets used in this benchmarking colored by cell type (top) along with bar charts of cell type composition (bottom). B Schematic of the evaluation procedure: each dataset is split into 10 different train-test splits using a 50/50 split. Datasets of size 100, 250 and 500 cells are then sampled from the training dataset using active learning, adaptive reweighting and a random sampling (baseline). SingleR, scmap, CyTOF-LDA, a random forest model, singleCellNet, and a support vector machine is then trained using ground truth labels and evaluated by quantifying cell type prediction accuracy on the held-out test set. Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.

Fig. 2: Active learning is most effective with random forest classifiers and when selecting the initial set of cells by marker expression.
figure 2

A Performance comparison of classifiers trained using a random forest and logistic regression to provide predictive uncertainty estimates for active learning. The relative difference in F1 improvement score is calculated as the difference in F1-score between the random forest and logistic regression models and standardized by the logistic regression F1-score. This score is averaged across train-test splits and cell type prediction methods. The initial set of cells was selected randomly. B Proportion of all cell types represented in the ground truth dataset selected by the random and ranking selection procedures for each of the 10 train test splits. Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range. C Same as (A) with the improvement score as the difference between the performance when the initial cells are selected based on cell type marker information and random selection. This score is averaged across train-test splits, active learning algorithms, and cell type prediction methods. The cell number specifies the total size of the training set. Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.

We first set out to replicate existing findings that random forest models outperform logistic regression active learning models17. Rather than ensuring one cell of each type was present in the training dataset, we randomly selected 20 cells as the initial training set for the active learning model without regard for cell type composition. Using these probabilities we selected the next set of 10 cells with maximum uncertainty at each iteration, labeled them and added them to the training set. We quantified uncertainty using two metrics: (i) the highest entropy based on the predicted cell type probabilities and (ii) the lowest maximum probability predicted for a cell over cell types. Both are well established as active learning techniques to quantify which samples (cells) a classifier is least certain about42, and would therefore benefit most from receiving a label. While doublets were removed by the dataset authors, the approaches used for this task are generally imperfect and doublets or mislabeled cells may still exist in the ground truth dataset43. These would likely have the highest entropy and lowest maximum probability thus possibly corrupting the efficacy of our active learning approach. To protect against these cells being preferentially selected, we selected cells at three different certainty thresholds for each metric: cells with the highest entropy and cells that lie at the 95th and the 75th percentile of the entropy distribution and cells with the lowest maximum probability, 5th and 25th percentile of the probability distribution. This should be an effective way to ensure singlets are selected as the multiplet rate is generally below these values44. Finally, to ensure our active learning methodology is valid, we calculated the performance of the active learning classifier with each iteration. This showed a steady increase in accuracy (Supplementary Figs. 2 and 3), indicating that our implementation works as intended. Using this active learning setup, we then created training datasets of sizes 100, 250 and 500 cells, and labeled these cells with ground truth labels. Using our benchmarking pipeline (Fig. 1B) we replicated existing results17 and found that our random forest model also outperformed the logistic regression model (Fig. 2A, Supplementary Figs. 4 and 5).

Next, we explored how the initial set of cells upon which the active learning model is trained impacts performance. We hypothesized that exploiting known information about marker genes with cell type specific expression could help select the initial cells and improve active learning results. To test this, we ranked all cells by the expression of a set of cell type marker genes that were either provided by the dataset authors, derived from the data, or identified from an external database45. We then iterated through all expected cell types and selected the cell with the highest score for that type. We repeated this process until we selected 20 cells to serve as the initial set of cells to train an active learning model. As expected, this approach created datasets with an increased number of represented cell types relative to a random selection of cells (Fig. 2B).

When benchmarking as previously described (Fig. 1B), selecting the initial set of cells by ranking their expression resulted in an improved classification performance across datasets. This was particularly notable in situations where few cells were labeled (Fig. 2C, Supplementary Figs. 6 and 7), likely because a larger diversity of cell types is present since the initial training, which becomes less important as more cells are labeled. Overall, we replicate existing results17 suggesting that random forest based active learning approaches outperform logistic regression in real world circumstances. In addition, we show that active learning can further be improved by selecting the initial set of training cells through a prior-knowledge informed ranking procedure.

Marker-informed adaptive reweighting complements active learning as a cell selection procedure

Next, we considered the failure modes of existing active learning approaches on single-cell data. While active learning approaches prioritize cells with high predictive uncertainty, they require an accurate prediction model which may be difficult to achieve in certain circumstances. To address this, we developed adaptive reweighting, a straightforward heuristic procedure that attempts to generate an artificially balanced cell set for labeling. Since clusters derived from unsupervised methods are often representative of individual cell types, we hypothesized that sampling a fixed number of cells from each cluster could obtain an approximately balanced dataset with respect to the ground truth cell type labels (Fig. 3A). However, this heuristic is not perfect, as cells of a single cell type can be represented by multiple clusters. Therefore, we introduced a cell-type aware strategy that putatively assigns each cluster to an expected cell type using the average expression of marker genes (methods) and sample evenly from cell types rather than clusters. We used several clustering parameters but found no difference in their performance (Supplementary Figs. 8 and 9).

Fig. 3: Investigating the effect of different cell selection methods on predictive performance.
figure 3

A Schematic depicting the adaptive reweighting algorithm. First, the full dataset is clustered using existing methods. In the non-marker-informed case (top) a subsampled dataset to be labeled is created by randomly selecting a set number of cells from each cluster. In the marker-informed case (bottom), each cluster is assigned a putative cell type based on the average expression of marker genes. A subsampled dataset of the size requested by the user is then created by sampling an equal number of cells from all putative cell types. B Performance of all selection methods tested across ten different train test splits (AL active learning, AR adaptive reweighting). Each selection method is ranked by the median balanced accuracy and sensitivity across seeds and cell type assignment methods. For the active learning results, the initial cell selection was ranked, and a random forest was used. C The difference in balanced accuracy between an imbalanced and a balanced dataset standardized by the balanced dataset for the snRNASeq cohort indicates improved classification accuracy by active learning approaches in imbalanced settings. Selection procedures are ordered by the average change in balanced accuracy. The balanced dataset is composed of two cell types with 250 cells each, while the imbalanced dataset is composed of 50 cells of one type, and 450 of another. In the snRNASeq cohort the similar dataset is composed of tumor and atypical ductal cells, while the different dataset is composed of tumor and immune cells. Boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range. Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.

Overall, no singular method is consistently best. However active learning does outperform random selection and adaptive reweighting across most datasets, though adaptive reweighting remains competitive in some situations (Fig. 3B). Specifically, the highest entropy and lowest maximum probability selection strategies consistently outperform random cell selections. Selecting cells at the 75th and 25th entropy and maximum probability percentile threshold however consistently performed worse than random. As expected, the marker aware adaptive reweighting strategy generally outperforms the non-marker aware strategy, likely because it has access to prior knowledge in the form of marker genes. Nonetheless, care should be taken when defining a set of markers, as corruption in these can lead to decreases in performance (Supplementary Fig. 10). While we focused our analysis on balanced accuracy and sensitivity for the sake of clarity, all five metrics are highly correlated (Supplementary Fig. 11). Finally, we found that no selection strategy was too run-time intensive for practical purposes (Supplementary Fig. 12).

Active learning outperforms alternative methods in imbalanced settings

We next sought to understand the effect of cell type imbalance on active learning and adaptive reweighting given that such imbalance is common within the field of single cell biology27, and has been shown to affect active learning results in adjacent fields46,47. Across all datasets we sampled 450 cells of one type and 50 of another to create artificially imbalanced datasets with 500 cells. Since cell type similarity can influence performance (whereby classifying cells of similar types is harder)48 we repeated this analysis twice: once with two distinct cell types and once with similar cell types. In the case of the snRNASeq the similar dataset was composed of tumor and atypical ductal cells for the majority and minority cell type respectively, while the distinct dataset was composed of tumor and immune cells for the majority and minority cell type respectively (Supplementary Fig. 1). In addition, we also created a balanced dataset with 250 cells of each type as a control (Table 1). Next, we sampled 100 cells from these artificially imbalanced datasets using active learning, adaptive reweighting and random selection. We then used the set of 100 cells as a training set using our benchmarking pipeline (Fig. 1B).

Table 1 Cell types used to generate imbalanced datasets for the first analysis considering cell type similarity

We found that active learning approaches generally outperformed both random and adaptive reweighting approaches in imbalanced settings (Fig. 3C, Supplementary Figs. 1327). As expected, the drop in performance in imbalanced settings was higher for cell types that were highly similar, while it was less pronounced when the cell types selected were distinct from each other. Overall, these results indicate that active learning approaches should be considered first if there is a large suspected cell type imbalance. We next sought to understand the impact of dataset imbalance in a complex dataset with more than two cell types. We created balanced datasets containing 100 cells of five different cell types and imbalanced datasets with 400 cells from one cell type and 25 cells from four cell types (Table 2). Overall, we found active learning to also outperform other selection approaches in these settings (Supplementary Figs. 2830).

Table 2 Cell types used to generate the second set of imbalanced datasets

Active learning can identify distinct novel cell types

Next, we tested the ability of active learning approaches to identify cell types that were completely unlabeled in the initial training set. We trained our active learning models with 20 initial cells and ensured this dataset contained a specific cell type 0, 1, 2 or 3 times, while the remaining cells were selected either randomly or by ranking their expression as previously described. Upon training on this set of 20 cells, we then predicted cell type probabilities across the unannotated cells, calculated their entropies, and contextualized these values using ground truth labels.

The results show that when a cell type was excluded from the training set, the entropies for that cell type were generally higher relative to training sets that included 1, 2 or 3 cells of that type (Fig. 4A). While this increase in predictive entropy varied across datasets, it was most drastic when using logistic regression, though it was still appreciable when using a random forest classifier (Fig. 4A, Supplementary Fig. 31). In addition, even the logistic regression classifier showed little change in entropy values when some cell types (e.g. schwann cells) were removed (Fig. 4C). This is likely due to the similarity between schwann and endothelial cells (Supplementary Fig. 1), as when both were removed, schwann cells had appreciably higher entropies when no cells of this type were present than when a few were added (Fig. 4C, last panel). Based on these results we conclude that logistic regression based active learning approaches are likely to identify novel cell types quickly even if these were not selected in the initial training phase if these cell types are sufficiently distinct from one another.

Fig. 4: Unlabeled cells have higher entropy values.
figure 4

A Median scaled entropy values for each cohort when all cells of a type were removed (purple) and all other cells (yellow). B Entropy of the cell type predictive distribution for all cells not in the initial training set of 20 cells for the scRNASeq breast cancer dataset. Boxplots are colored by the number of cells present of a particular type (shown in the plot title), while the x axis shows the ground truth cell type label. C For the snRNASeq pancreas cancer dataset, with the bottom right panel depicting the effect of removed endothelial and schwann cells and (D) effect of removing CD8 T cells and classical monocytes from the CyTOF bone marrow dataset. As entropy is bounded by the total number of classes, the entropy values depicted were scaled by the maximum possible value for each experiment. Shown are the results across the 10 different train test splits. All boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range. Abbreviations: logistic regression (LR), random forest (RF). Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.

Self-training can further improve classification performance and detect mis-annotated cell types

Next, we investigated the utility of self-training—a form of self-supervised learning—to boost cell type classification performance without requiring additional manual labeling. Self-training or pseudo-labelling is a technique that uses a small, labeled dataset to train a classifier that is then used to predict the label of all remaining (unlabeled) samples49. The most confidently labeled cells (based on the lowest entropy) are combined with the manually labeled cells to create a larger labeled dataset, which can then be used to train subsequent cell type annotation algorithms. In adjacent fields, self-training has been demonstrated to improve classification performance34, though its efficacy for efficient cell type annotation remains unexplored.

To investigate this, we implemented random forest and logistic regression classifiers as self-training algorithms, and labeled the top 10%, 50% and 100% most confident cells with the predicted label on the three datasets from before. As expected, the accuracy of these classifiers was inversely correlated with the prediction confidence of the cells included (Fig. 5A, Supplementary Figs. 32 and 33).

Fig. 5: Self-training can increase performance of some cell type assignment algorithms.
figure 5

A F1-score of each self-training algorithm for each cohort decreases as a larger set of predictively-labeled cells are included. The x-axis shows the percent of cells with highest confidence of the overall dataset that were labeled using the self-training method. The initial set of training cells were picked using active learning. B Overall improvement in F1-score when including cells labeled using self-training relative to the baseline accuracy (only training on the cells labeled with ground truth values). Shown are the results using an initial training set of 100 cells that were selected using active learning and a self-training strategy. The percentage specifies the percent of most confident cells that are self-labeled. C Correlation between self-training improvement and original performance. D Entropy of all cells represented in the randomly selected 250 cell training datasets shown for each of the 10 train splits. The x-axis denotes the ground truth cell type, while each boxplot is colored by whether the cell was corrupted to a different cell type. Like Fig. 4, because entropy is bounded by the total number of classes, the entropy values depicted were scaled by the maximum possible value for each experiment. All boxplots depict the median as the center line, the boxes define interquartile range (IQR), the whiskers extend up to 1.5 times the IQR and all points depict outliers from this range. Source data are provided on zenodo: https://doi.org/10.5281/zenodo.10403475.

We benchmarked the impact of self-training on the cell type annotation performance by combining the manually annotated dataset with a varying percentage of the most confidently labeled cells. We then used these datasets to train all the cell type annotation methods previously implemented (Fig. 1B) and evaluated their performance on the test set. When comparing the performance of each classifier trained on only the manually annotated data to the same classifier trained on the manually annotated data including the self-trained labels, we found that in general there is an increase in predictive performance when using self-training, especially for datasets with dissimilar cell types (Fig. 5B, Supplementary Figs. 1, 3439). The change in performance gained from self-training is most noticeable in situations with few annotated cells and is lost once 500 cells have been annotated (Supplementary Fig. 40). To further understand if any selection procedures particularly benefit from including self-training data, we correlated the classification improvement gained by adding self-trained data with the F1-score of the baseline performance achieved on only the initially labeled cells. We find that selection procedures with lower accuracies benefit most from self-training (Fig. 5C). This is likely because little performance gains can be made when a classifier already achieves a classification accuracy (F1-score) that is near-optimal.

Finally, we investigated whether self-training can be used to identify mis-labeled cells. To address this, we took a random sample of 250 cells from each of the train splits and corrupted the ground truth cell type for 10% of cells, such that their label was mis-assigned. We then trained logistic regression and random forest models on these 250 cells, including the misannotated ones. Next, we used this classifier to calculate the entropy for each cell in the training dataset. The results of this analysis clearly show increased entropy levels for those cells whose cell type labels were misassigned (Fig. 5D). Thus, we conclude that self-training can also be used to detect mislabeled cells within the training set used for the self-training classifier.