Gene-expression memory-based prediction of cell lineages from scRNA-seq datasets – Nature Communications

Cell culture

CGR8 ES cells (Sigma, Cat307032901-1VL) were routinely cultured at 37 °C and 5% CO2 on 10 cm dishes coated with 0.1% gelatin type B (Sigma, Cat#G9391-100G) in 10 ml GMEM (Sigma, Cat#G5154‐500 ML) supplemented with 10% ES cell‐qualified fetal bovine serum (Gibco, Cat#16141‐079), 1% nonessential amino acids (Gibco, Cat#11140‐050), 2 mM l‐glutamine (Gibco, Cat#25030‐024) and sodium pyruvate (Sigma, Cat#S8636‐100 ML), 100 μM 2‐mercaptoethanol (Sigma, Cat#63689‐25 ML‐F), 1% penicillin and streptomycin (Pen/Strep) (BioConcept, Cat#4‐01F00‐H), homemade leukemia inhibitory factor (LIF), CHIR99021 (Merck, Cat#361559‐5MG) at 3 μM and PD184352 (Sigma, Cat#PZ0181‐25MG) at 0.8 μM. Cells were passaged by trypsinization (2 ml of trypsin, Sigma, Cat#T4049‐100 ML) every 2–3 days at a ratio of 1:10. For scRNA-seq experiments, cells were switched to N2B27 + 2i/LIF medium 2 passages beforehand. N2B27 + 2i/LIF medium was composed of a 1:1 mix of DMEM/F12 (Gibco, Cat#11320‐033) and Neurobasal medium (Gibco, Cat#21103‐049), supplemented with N2 (Gibco, Cat#17502‐001), B27 (Gibco, Cat#17504‐001), 1% Pen/Strep (BioConcept, Cat#4‐01F00‐H), 2 mM L-glutamine (Gibco, Cat#25030‐024), 100 μM 2-mercaptoethanol (Sigma, Cat#63689‐25 ML‐F), LIF, CHIR99021 (Merck, Cat#361559‐5MG) at 3 μM and PD184352 (Sigma, Cat# PZ0181‐25MG) at 0.8 μM. Cells were split every 2–3 days at a ratio of 1:10, using 2 ml of accutase (Innovative Cell Technologies, Cat#31195) and centrifugation. HEK293T cells were routinely cultured at 37 °C and 5% CO2 on dishes in DMEM high glucose medium (Gibco, Cat#41966) supplemented with 10% Fetal bovine serum (Life Technologies, Cat#10270-106) and 1% Pen/Strep (BioConcept, Cat#4‐01F00‐H).

Lentiviral barcoding library production

The LARRY lentiviral barcoding library12 was purchased from Addgene (https://www.addgene.org/pooled-library/camargo-plarry-egfp-barcoding-v1/). The barcodes of the library are composed of 40 base pairs (bp) with 28 random bp separated by 6 fixed bp doublets and are located in the 3’ untranslated region of EGFP expressed from the EF-1ɑ promoter. The library was amplified and the lentiviral vector was produced according to the associated protocol with small modifications. Briefly, plasmids were introduced into ElectroMAX Stbl4 Competent Cells (Life technologies, Cat#11635018) using MicroPulser Electroporator (Bio-Rad, Cat#1652100) EcoRI program. Cells were incubated for 1 h at 37 °C and spread over 20 large (24.5 × 14.5 cm) Agar+Ampicillin plates (Ampicillin at 100 ug/ml, AppliChem, Cat# A08390025). After 24 h, colonies were harvested through scraping using pre-warmed LB medium supplemented with 100 ug/ml Ampicillin. The resulting 1.5 L bacterial culture was incubated at 37 °C for 2 h and a Maxiprep was performed using a Qiagen MaxiPlus kit (Qiagen, Cat#12963) to a resulting 1 mg of plasmid DNA. A lentiviral vector was produced by transforming the produced LARRY plasmid into HEK293T cells using the Trans-IT 293 transfection reagent (Mirus, Cat#MIR2700). For each of eighteen 10 cm dishes of HEK293T cells, 1.5 ml Opti-MEM (Gibco, Cat#51935), 16 μg plasmid (8 μg LARRY plasmid, 6 μg psPAX2 (Addgene, PRID: Addgene_12260), 2 μg pMD2.G (Addgene, PRID: Addgene_12259)) were mixed with 45 μl TransIT 293 transfection reagent, incubated for 15-30 min at RT, and added to the cells dropwise. Lentiviral vector particles were harvested 48 h and 72 h after transfection. Medium was collected and filtered through a 0.45 μm PVDF filter (Millipore, Cat#SLHV033RS), and centrifuged in a Beckman Optima XL-80K Ultracentrifuge (Beckman, Cat#8043-30-1211) at 140,000 × g for 1h 30 min at 4 °C. The supernatant was removed, and pellets were resuspended with 100 μl of GMEM with serum and all additives as above but without 2i/LIF. The lentiviral vector preparation was incubated on ice for 30 min, aliquoted, and stored at –80 °C. Titration of the lentiviral barcoding library was performed on CGR8 cells cultured in GMEM+2i/LIF, with a read-out 5 days after infection.

Barcode reference library generation

A reference library was made through the sequencing of PCR-amplified barcodes from the LARRY plasmid library in triplicates. 500 ng of LARRY plasmids were taken as input for a two-step PCR using Phusion high-fidelity DNA polymerase (Thermo Fisher, Cat#F-5305) adapted for LARRY from ref. 48. The first step amplifies the barcodes and adds Illumina Read1 and Read2 sequences (5’ACACTCTTTCCCTACACG ACGCTCTTCCGATCTTGTGACGTCACAGGTCGACACCAGTCTCATT3′ and 5’GTGACTGGAGTTCAG ACGTGTGCTCTTCCGATCGAGTAACCGTTGCTAGGAGAGACCATA3′). The second step adds the P5 and P7 flow cell attachment sequences and a sample index of 7 bp (P5 5’AATGATA CGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT3′ and P7 5’CAAGCAGA

AGACGGCATACGAGANNNNNNNGTGACTGGAGTTCAGACGTGCTCTTCCGATC3′). 200 ng of PCR1 product was taken as input for PCR2. (PCR programs: 98 °C 2 min, 15 cycles of 98 °C 10 s, 67.2 °C (PCR1), 72 °C (PCR2) 30 s, 72 °C 30 s, followed by final elongation 72 °C 5 min and 4 °C indefinitely). In between PCR1 and PCR2 and after PCR2, PCR purification was performed using the QIAquick PCR purification kit (Qiagen, Cat#28106). The mix of the three samples was sequenced on a MiSeq instrument (Illumina) at the Gene Expression Core Facility of EPFL. Sequencing results were first filtered for a perfect match to the plate index pattern using XCALIBR (https://github.com/NKI-GCF/xcalibr). The resulting read files were filtered for a perfect match to the barcode pattern using customized R scripts. The resulting list and the LARRY “barcode_list” available on Addgene (https://www.addgene.org/pooled-library/camargo-plarry-egfp-barcoding-v1/) were merged and used as a reference list.

mESCs scRNA-seq memory experiment run

CGR8 cells were transduced with 3 ul of the lentiviral LARRY barcoding library per 10 cm dish with 10 ml medium to obtain roughly 1% of GFP+ cells. 72 h after infection, live GFP+ cells (Fig. S1a; live stain 1:500 propidium iodide solution (BioLegend, Cat#421301)) were sorted on an FACSAria III (BD) at the Flow Cytometry Core Facility of EPFL into Cellcarrier 96-Ultra 96-well plates (PerkinElmer, Cat#6055308). Plates were coated with recombinant human E-cadherin-Fc chimera (BioLegend, Cat#BLG-779904-25ug) to reduce colony formation and thereby limit the potential impact of paracrine and direct cell signaling in regulating gene expression of related cells49. Coating was performed at 10 μg/ml for 1 h at 37 °C. After washing the plates with PBS, 1,200 cells were sorted into two wells and both were collected after 48 h for sequencing. Before collection, cells were imaged on an IN Cell analyzer 2200 (GE Healthcare) at the Biomolecular Screening facility of EPFL. ScRNA-seq library preparation was performed on a 10X Genomics Chromium platform of the Gene Expression Core Facility of EPFL using the SingleCell 3’ Reagent Kit v3.1. The sample was sequenced on a Hiseq4000 instrument (Illumina). The experiment was performed once.

mESC scRNA-seq data processing and cell lineage inference

ScRNA-seq data was analyzed using 10X Genomics Cell Ranger (v. 5.0.1), Seurat (v. 4.1.0), customized R (v. 4.0.2), and Python (v. 3.9.7) scripts. Raw sequencing reads were processed using 10X Genomics Cell Ranger (v. 5.0.1) using default parameters and refdata-gex-mm10-2020-A as reference genome, with or without the include-introns option. Cell Ranger outputs a unique molecular identifier (UMI) corrected read count matrix. Cells with a percentage of mitochondrial reads between 1.75 and 7.5 and with more than 10,000 reads were further analyzed. Data was normalized to 40,000 reads per cell (similar to RPM, one normalized count equals one read on median). Lineage barcodes were extracted from the data using the CellTag pipeline9 available for download at (https://github.com/morris-lab/BiddyetalWorkflow) and adapted to the LARRY barcode design. Briefly, the CellTag pipeline extracts reads containing a CellTag motif from the processed, filtered, and unmapped reads BAM files produced in intermediate steps of the 10X Genomics Cell Ranger pipeline. To extract LARRY barcodes, the CellTag motif was changed to “([ACTG]{4}TG[ACTG]{4}CA[ACTG]{4}AC[ACTG]{4}GA[ACTG]{4}GT [ACTG]{4}

AG[ACTG] {4})” in all scripts. Barcodes with reads from only one UMI, and without perfect match to the reference library were filtered out. A Jaccard similarity score of >0.7 was used to identify cell lineages. No filtering on the number of barcodes expressed per cell was performed. Lineages were called on unfiltered data. For the mESC dataset, 6 lineages had sizes above 5 cells, which is larger than expected based on cell cluster sizes after 48 h of culture and the expected loss of cell lineage members in the preparation for scRNA-seq. They were therefore excluded from further analysis.

Processing of public scRNA-seq data

Data from Biddy et al.9 was extracted as a BAM file from SRA links specified under GSE99915. BAM files were converted back to fastq format using 10X Genomics’ bamtofastq (v. 1.3.2) and Cell Ranger (v. 5.0.1) was run on the resulting data as described above with or without the include-introns option. Cells were filtered on the percentage of mitochondrial reads and the number of reads as indicated in Table S2. Data was normalized to 40,000 reads per cell. Cell lineages were assigned using the CellTag pipeline as specified above with the original CellTag motif of “(GGT([ACTG]{8})GAATTC)”(V1), “(GTGATG([ACTG]{8})GAATTC)”(V2) or “(TGTACG([ACTG]{8})GAATTC)”(V3) and the respective whitelists and sample-specific cell barcodes available at (https://github.com/morris-lab/BiddyetalWorkflow). As in the original publication, cells with >20 or <2 CellTags expressed were not considered for lineage assignment. Lineages were called on unfiltered data using a Jaccard similarity score of >0.7. Cell lineages were called on individual datasets for all analyses of a single time point. Data from Kimmerling et al.21 is GSE74923_L1210_CD8_processed_data.txt from GEO. Ground truth cell lineage information was extracted from the GSE74923_series_matrix.txt file. Data from Weinreb et al.12 was downloaded from (https://github.com/AllonKleinLab/paper-data/blob/master/Lineage_tracing_on_transcriptional_landscapes_links_state_to_fate_during_differentiation/README.md). Data from Jindal et al. was directly obtained from the authors in the form of Seurat objects including the unnormalized but filtered count matrix and a metadata file comprising the lineage information. All datasets were RPM normalized to 40,000 reads/cell. Data from Wehling et al.23 was directly obtained from the authors in the form of an unnormalized count matrix and the metadata file (available on GEO GSE167317 asGSE167317_CountMatrix_Seq5.csv, GSE167317_CountMatrix_Seq4.csv, GSE167317_Metadata_Seq5.csv, GSE167317_Metadata_Seq4.csv). Cells with under 100,000 reads were removed and data was normalized to 40,000 reads per cell. Lineage information was extracted from the metadata file. Data from Harmange et al. was downloaded from https://drive.google.com/drive/folders/1-C78090Z43w5kGb1ZW8pXgysjha35jlU?usp=sharing (accessed begin July 2022) in the form of 10×1_Filterd_BatchCor_unnorm_sctrans.rds from experiment one. Lineages were assigned using the corresponding script section in the file 10×1_r1_r2_Analysis_unorm_sctrans.Rmd. During filtering, cells with <3% of mitochondrial reads, and cells with <4000 reads were removed. Data was normalized to 40,000 reads/cell. Data from Simeonov et al.42 was obtained in the form of CellRanger output files for the lung metastasis scRNA-seq data of mouse 1 and 2 from GEO (GSE173958; GSM5283486 and GSM5283491). scRNA-seq data was normalized to 40,000 reads/cell. Lineage annotation was obtained from Mendeley data (DOI: 10.17632/t98pjcd7t6.1) in the form of “Barcodes-of-barcodes” files for all clones. This metadata was then filtered for the annotations of clones present in the scRNA-seq datasets of the lung metastases of mouse 1 and 2. Data from Bues et al.,43 was directly obtained from the authors in the form of a Seurat Object including a normalized count matrix with crypt and organoid annotation. Briefly, organoids were generated by sorting single Lgr5+ intestinal stem cells from dissociated organoids into a Matrigel matrix. After culture for 3, 4, 5, or 6 days, single organoids were hand-picked and dissociated individually before loading for scRNA-seq. Organoids are derived from 3 batches. Lgr5+ intestinal stem cells seeded can be derived from the same organoid. Crypts were collected over 5 batches of 3 pooled mice from 10 mm sections of the ileum. Crypts within the same batch can be derived from the same mouse but were several mm apart in the ileum. Data and scripts from Miller et al.27 were downloaded from https://github.com/vangalenlab/MAESTER-2021 and https://vangalenlab.bwh.harvard.edu/resources/maester-2021/ as TenX_CellLineMix_cells.rds, TenX_CellLineMix_All_mr3_maegatk.rds, TenX_CellLineMix_Seurat_Keep.rds (for the K562 dataset) and BPDCN712_Maegatk_Final.rds, BPDCN712_Seurat_Final.rds, 4.2_vois.txt (for the primary human bone marrow cells). Lineage assignment was performed as in the scripts 3.4_TenX_K562_clones.R and 4.3_variants_Of_Interest.R for K562 and bone marrow cells respectively. “GroupIDs” were then unlisted to extract the cell-barcode lineage correspondence. The count-matrix for the K562 dataset was extracted from the TenX_CellLine_Mix Seurat object by filtering for K562 cells. For the primary bone marrow cells, the BPDCN712 Seurat object was split out by replicate to generate the GEMLI input count matrices. For both cell types, lineages were filtered to be defined by a single informative mitochondrial variant. Data from Janesick et al.44 was downloaded from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast. The scRNA-seq dataset was filtered to a > 1000 reads/cell and a mitochondrial fraction of 1-50%. Cell types were annotated through supervised machine learning. In brief, a UMAP was built upon a linear discriminant analysis (LDA; R package ‘MASS’ (v. 7.3-58.3) performed on log molecule numbers) which in turn was based on provided cluster annotations for Xenium data in situ sequencing data. Single-cell RNA-seq Chromium data was then projected onto this LDA-UMAP to assign cell types. Assignments were specific and unique. Finally, each cluster was assigned its cell type annotation through the assessment of marker genes provided in Janesick et al.44.

ScRNA-seq data representation

ScRNA-seq data is represented as tSNE, UMAP, or SPRING graph. tSNE embeddings were generated using a custom function based on the tSNE R package (v. 0.1-3.1) on exonic, or both intronic and exonic reads. PCA was calculated on the 3,000 top variable expressed genes. Subsequently, a tSNE was computed on the PCAs top 10 components. SPRING graphs were generated from precalculated values in public datasets (HSPC data was represented as SPRING graph)). UMAPS (mESC, crypts, and organoid datasets) were generated in a standard Seurat analysis. Briefly, variable features were called using the FindVariableFeatures function (parameters selection.method =”vst”, and number of features = 2000), data was scaled using the ScaleData function on all genes, and a PCA was run using the RunPCA function on variable features. Subsequently, a UMAP was built using the RunUMAP function on the top 14 components.

Assessment of gene expression similarity

Gene expression correlation within cell lineages or random samples for the whole transcriptome was calculated as Pearson correlation. Correlation distance is calculated as 1- the correlation value. To assess the similarity of related cells with respect to cell cycle, genes were classified as cell cycle-dependent or independent using a cell cycle assignment by the cyclone function within scran (v. 1.18.7). Dependence of gene expression on the cyclone assigned cell cycle phases (numerized) was tested using Hoeffding’s D statistics. Gene expression correlation was then calculated using only genes classified as cell-cycle dependent. To assess the similarity in the velocity of related cells, velocity analysis was performed using the velocyto package (v. 0.6). Velocity momenti were extracted for each cell, and their Pearson correlation was calculated. Complexity was defined as the number of genes expressed in a cell and the complexity range within cell lineages was calculated for analysis. For all metrics (transcriptomic similarity in all genes, in cell-cycle dependent genes, similarity in velocity momenti, and in complexity) values between related cells (all lineage sizes) and repeated (100x) random samples of cells of the same size were compared. For the scoring of transcriptomic similarity in symmetric and asymmetric lineages, symmetric and asymmetric lineages were defined based on entire lineages.

Estimation of variation mediated by cell lineages

Assuming that the variation (here CV2) within each lineage is independent of cross-lineage effects, we built a linear model using the stats package (v. 3.6.2) to capture the relationship of the variation across lineages (variation of means of each lineage) and total variation (variation across all cells). We then generated mock lineages through the permutation of ground truth lineages and used our model to estimate the total variation in these control sets. Lineages of sizes 3–5 for mESCs and WM989 cells, of size 4 for CD8 and L1210, and of size 2 for all other cell types were considered. This ensured to inclusion of around 90% of the cells in each dataset.

Memory-gene identification and categorization

We defined memory genes as genes with high variability (coefficient of variability squared; CV2) in mean gene expression in different cell lineages. We compared the lineage value to the distribution of values from repeated (20x) random size-matched samples of cells to calculate a p-value. Memory genes were defined as genes with a significantly (p < =0.05) higher variability between lineages as compared to random samples. To categorize memory genes into quantitative and qualitative memory genes, we used the skewness function of the R package moments (v. 0.14.1). We defined a quantitative memory gene as a gene with a skewness in expression level of under 3, and a qualitative memory gene, as a gene with a skewness in expression level equal or above 3. Memory genes were called on cell lineages of 3–5 (mESCs), 2 members (invasive tumor), or 2–5 members (all other cell types). All predicted cell lineages were used when calling memory genes on these. Other criteria to call memory genes resulted in highly overlapping gene sets (Fig. S6a, b). These additional memory gene selection criteria included a high correlation in gene expression within cell lineages and a low variability (CV2) in gene expression between cells of the same cell lineage (intra-lineage CV2; intraCV2). Furthermore, we analyzed marker genes of cell lineages using the Seurat FindMarker function with a range of test.use parameters (bimod, roc, t (T-test), negbinom, poisson, LR, MAST). We then considered as memory genes the markers with a high sharing across cell lineages, using thresholds defined independently for each FindMarker run to optimize the Pearson correlation in gene expression within cell lineages. Finally, to identify memory genes using machine learning, we used common dimension reduction methods, mutual information maximizer (MIM), and ANOVA F-test feature selection (ANOVA). Other feature selection techniques frequently used in the literature were also investigated, but MIM and ANOVA outperformed all the other methods. For both, each gene was attributed a memory score either by computing the mutual information or ANOVA F-test of the gene expression across cells. To determine the ideal number of genes N to include in the final gene set, we predicted cell lineages using the top N genes with the highest memory score. Each of the resulting clusters was evaluated and the number of genes N with the best precision was chosen as the optimal number of memory genes. For the comparison of memory genes and variable genes in the Janesick et al.44 dataset, memory genes were called on lineages predicted at confidence level 50 with a lineage size parameter of 2–20 in each cell type independently. Genes were considered to be variable when having a residual to a least total square fit on mean expression and CV² above zero.

GO-term enrichment analysis

GO-term enrichment analysis was performed on memory genes called using the lineage ground truth (derived from cellular barcoding, microfluidics, or sister cell picking), and on memory genes called using predicted cell lineages. The topGO R package (v. 2.42.0) was used with the GO category “biological process”. Memory genes and background (all genes detected) were considered to be binary. The top 10, 20, 100, or 500 enriched GOs from each set were visualized as indicated. Spearman rank correlation between GO-term enrichment values was calculated for memory genes called on-ground truth and predicted cell lineages. For GO-term enrichment of memory genes called based on predicted lineages in the Janesick data, GO-terms were further compared to expression-matched random gene sets. Enriched terms were significant (Fisher’s exact test, p < 0.001) and enriched over 100 expression-matched random gene sets (estimated p < 0.05). A selection of terms is shown in Fig. S22a and the full list can be found in Supplementary Data file 5.

Gene selection for GEMLI predictions

Genes for cell lineage predictions were selected to enrich memory genes based on the quantiles of the mean and variability of gene expression. The variability of gene expression was defined as mean-corrected CV2 calculated in the form of the residual of the CV2 to a linear fit of CV2 and mean expression. The genes selected for lineage predictions are the 2% highest expressed genes, the 60% most variable among the 10% highest expressed genes, and the 10% most variable among the 40% highest expressed genes: (mean quantile >= 98) or (mean quantile >= 90 and variability quantile >= 40) or (mean quantile >= 60 & and variability quantile >= 90)). The selected genes are taken as input for the repetitive iterative clustering algorithm. The selected genes do only partly overlap with highly variable genes and cell type markers called during a standard Seurat-based scRNA-seq analysis (Figs. S4f and  S7f). For this comparison, highly variable genes were called as top 2000 variable genes with standard Seurat parameters. Cell type markers were called using Seurat’s FindMarkers function for each cell type present with at least 5 cells. Also using machine learning to select genes based on variability and mean gene expression did not further enrich memory genes (Fig. S7c). Briefly, different models were trained including logistic regression with L2 regularization (LR), support-vector machine (SVM) with the radial basis function (RBF) kernel, and a neural network. The mESC and Biddy et al. MEF datasets of 48 h and 72 h culture time (n = 1 and n = 6, respectively) were used for model training, the datasets of CD8, L1210 (n = 1 each), and HSPC (48 h; n = 22) were used for model validation. The mean expression outliers in each dataset were removed using the inter-quartile range (IQR) method. All the outliers were considered to be memory genes and were added in the final gene selection. Then, the data was normalized to a range between 0 and 1. Each dataset was preprocessed individually. The LR and SVM models were trained using their sklearn (v. 1.1.1) implementation. The neural network was optimized using the Pytorch framework (v. 1.11.0). To optimize the model, a balanced version of the binary cross entropy loss and accuracy was used to account for the fact that the proportion of memory genes is low in all datasets while most machine learning models assume that the data has approximately the same number of samples of both classes. To take this into account, a bigger weight was given to a misclassification of the minority class, the memory genes. To find the optimal hyperparameters for the LR and SVM a grid search and cross-validation were performed. Based on this, the L2 regularization was adjusted to 103 and 105 for LR and SVM respectively. For the neural network, cross-validation was deemed too computationally expensive, and the Optuna Python package (v. 2.10.0) was used to perform a random search for the optimal hyperparameters on training and validation set (Supplementary Data file 6). The best performing model was the neural network and its results are show in the Supplementary Figs.

GEMLI cell lineage prediction algorithm

The GEMLI algorithm is a repetitive iterative clustering approach to predict cell lineages (Scheme in Fig. 2a, see Table S1 for a comparison to other clustering algorithms), taking a gene set as input. For de novo cell lineage predictions on scRNA-seq data without lineage annotation the gene set is selected based on mean gene expression and variability (see previous section). Also other gene sets can be used as input, notably memory genes when a ground truth lineage annotation is available (see below). During each iterative clustering, the input gene set is randomly subsampled to 75% (sampling value parameter; see section on GEMLI parameters below). This gene set subsample is then used to cluster the cells of the dataset iteratively as follows. During the first iteration, two clusters (cluster cut parameter; see below) are generated using cutree function from the dendextend (v.1.16.0) R package and the hclust hierarchical clustering function of the R package stats (v.4.0.2) on the gene expression correlation distance using the agglomeration method ward.D2. In every subsequent iteration round, the cells of each previously defined cluster are clustered again into two clusters using the same procedure. The iteration is ended for clusters meeting a predetermined size of 2–3 cells (lineage size parameter, see below). This results in a cell-by-cell matrix indicating for each cell-pair if it did cluster together in the final cluster of 2–3 cells. This iterative hierarchical clustering represents one-cell lineage prediction. It is repeated 100 times (repetition number parameter, see below) to generate a confidence level for the lineage predictions. More in detail, the cell-by-cell matrices generated in the 100 repetitions of the iterative hierarchical clustering are summed into a confidence level cell-by-cell matrix. Two cells ending up in the same cluster in 30 of the 100 predictions will here for example receive the value of 30. Setting a threshold on the confidence level can then be used for lineage assignment. Cell lineages are assigned as cell groups of which members are clustering together above a confidence level threshold. Lineage assignments are scored against the ground truth lineages (barcode, microfluidic, or sister-cell picking) on precision (true positives (TP)/(TP+ false positives (FP))), sensitivity (TP/(TP+ false negatives (FN))) and false positive rate (FPR; FP/(FP+ true negatives (TN))). All are calculated for each confidence level X by comparing the predictions with a confidence level >=X to the ground truth cell lineage relationships. Thereby precision, sensitivity and FPR are calculated on cell pairs. For the majority of datasets, also AUC values were calculated using the R package Metrics (v. 0.1.4) by the confidence level of predictions as a probability value.

Comparison of GEMLI to other lineage assignment approaches

To test the performance of GEMLI predictions, characteristics-based gene selection was compared to other starting gene sets, as all genes and memory genes called using the lineage ground truth (from cellular barcoding, microfluidics, or sister cell picking). Further, different memory gene definitions (see above) and Seurat’s highly variable genes (see above) were tested as input gene sets. Finally, memory gene enriched gene sets selected based on mean expression and variability using a neural network (see above) were tested in predictions. GEMLI predictions had a slightly lower precision and sensitivity than predictions using memory genes, but performed better than all genes (Fig. S8). FPR was comparable between all three genesets (GEMLI, memory genes, all genes) and consistently very low (<1%) at confidence levels >=50. Other memory gene definitions did not improve prediction performance (Fig. S6c–h). Also, gene selection based on a neural network did not improve lineage predictions as compared to GEMLI (Fig. S9d). Seurat’s highly variable genes likewise performed worse than the GEMLI gene selection (Fig. S9f). Furthermore, the use of a memory gene set called using the ground truth lineage information in one dataset, did also not allow for improved predictions in other datasets of the same or related/other starting cell type and culture condition (Fig. S9e). Finally, GEMLI performance was also compared to other lineage assignment approaches, namely random cell pair assignment and kNN-clustering-based lineage assignments. GEMLI predictions on randomly lineage-assigned cells had as expected a very low precision and sensitivity and a high FPR (Fig. S9a). Furthermore, cell pairs generated using kNN-clustering with different inputs (Seurat top 2000 variable genes, top 14 components of a PCA on these genes, UMAP on this PCA) resulted in a very low precision, comparable sensitivity, and higher FPR than GEMLI predictions (Fig. S9b, c).

GEMLI predictions for symmetric and asymmetric lineages and different lineage sizes

Scoring predictions for symmetric (one cell type) and asymmetric (several cell types) lineages were performed in three ways. First, only two-cell lineages were scored, which can be unequivocally defined as symmetric (both members of one cell type) and asymmetric (each member of another cell type). Second, lineages of all sizes were scored for which symmetry was defined considering all cell members. Lineages with all members in one cell type were here considered symmetric. Lineages in which any members were part of two different cell types were considered to be asymmetric. Third, cell pairs were scored on the entire prediction matrix by taking pairs where both cells are of the same cell type as symmetric lineage, and pairs where both cells are of a different cell type as asymmetric lineage, respectively. For lineage size analyses the prediction matrix was stratified to only score pairs in which one or both cells are members of a lineage of the indicated ground truth lineage size.

KNN-clustering-based lineage assignments for comparison to GEMLI predictions

For kNN analysis a standard Seurat pipeline was used. Data was normalized and the 2000 most variable features were extracted using default parameters. The data was then scaled, a PCA was performed on previously defined variable genes and a UMAP was conducted on the first 14 principal components. The kNN analysis was then performed using the dbscan package (v. 1.1-11) with k = 1 or 2 as indicated, on either the scaled data for variable features, the 14 first principal components or the UMAP embedding. Pairs of nearest neighbors were used as the lineage for comparison with GEMLI predictions.

GEMLI prediction algorithm parameters

Several parameters can be set during GEMLI predictions: (1) the sampling value (which fraction of genes is selected for each repetition of the iterative hierarchical clustering), (2) the number of repetitions of the iterative clustering (based on which the confidence level is calculated), (3) the number of clusters in which clusters are split during each iteration (“cluster cut”) and (4) the size of clusters at which clustering iterations are stopped (“lineage size parameter”). The default values used throughout the main figures are a sampling value of 75%, 100 repetitions, splitting clusters into 2 during each iteration, and lineage size parameter of 2–3 cell members unless otherwise indicated. The influence of changes in single parameters on predictions was tested on the unique mESC, CD8, L1210, and one HSPC, WM989, and HSC dataset (Figs. S10 and  S12). A lower sampling value increased precision while decreasing sensitivity (Fig. S10a, b). When the number of clusters into which each cluster is split is higher than the maximal lineage size to be predicted, precision values increase and sensitivity decreases (Fig. S10c, d). The number of repetitions did only had a small effect on precision and sensitivity values, with higher repetition numbers increasing precision and decreasing sensitivity of the predictions (Fig. S12e, f). Also, the lineage size parameter, meaning the size of clusters at which clustering iterations are stopped influences both precision and sensitivity. Precision will be high for predictions with a lineage size parameter of 2–3 until the average size of ground truth lineages is present (Fig. S12). FPR stayed low throughout. Only a size parameter greatly exceeding the average ground truth lineage size (Fig. S12c–e) can increase FPR to values > 1%, but still <5%. For predictions of multicellular structures (intestinal crypts and organoids) for which related cells are in spatial proximity to each other, the cluster stability of lineages predicted at increasing lineage size parameters, (cluster stability calculated using the sc3 package within the R package clustree (v.0.5.0)), reflected well the ground truth lineage size and plateaued when reaching the maximal ground truth lineage size present. In contrast, for predictions in datasets with primarily small lineages without a spatial proximity or confinement, sc3 cluster stability of predictions increased gradually with increasing lineage size parameter and did not plateau. This allowed for a de novo estimation of the lineage size until which prediction could be performed with high precision in datasets with multicellular structures (fig. S19). Depending on the aim of cell lineage predictions and the dataset characteristics (see next section), parameters and confidence level threshold can thereby be adjusted to increase precision or sensitivity respectively.

Influence of sequencing depth and cell number on GEMLI predictions

To estimate the influence of sequencing depth and dataset size on lineage predictions, GEMLI predictions were performed after downsampling the number of reads or cells in the mESC dataset, and one MEF and HSPC dataset (Fig. S15 and Supplementary Data file 1). For read downsampling, all reads of a given cell were vectorized, and subsequently, a fraction (66%, 50%, 33%, 10%) of these reads were sampled. Cells were subsampled separately to 66%, 50%, 33%, and 10% of the initial cell numbers. Cells without lineage assignment or being the only member of a cell lineage were subsampled directly to the desired fraction. Cells that were members of cell lineages with several members were subsampled together with their respective cell lineage members. Sensitivity of lineage predictions decreased with decreasing sequencing depth but the precision of lineage predictions stayed high above values of 5000–8000 reads/cell as commonly recommended for available scRNA-seq technologies37,38,39 (Fig. S15a, b). Subsamples with fewer cells generally had a slightly better precision and sensitivity, especially for very low read counts (Fig. S15c, d). FPR stayed <1% for all conditions.

Cell fate decision analysis

To assess how well GEMLI can recapitulate the prevalence of different types of cell fate decisions, the HSPC (N = 44) and WM989 (N = 8) datasets were analyzed. HSPCs datasets are derived from cells cultured in myeloid differentiation conditions and are annotated according to their cell type as basophil (Baso), eosinophil (Eosino), erythroid, lymphoid (Lymph), megakaryocyte (Meg), monocyte (Mono), neutrophil (Neutro), mast cell, plasmacytoid dendritic cell (pDC (1/2)), migratory dendritic cell (Ccr7+), classic dendritic cell (cDC) or undifferentiated (Undiff) cells (published metadata12). Only certain cell type combinations of related cells exist and all have a different prevalence, meaning that HSPCs are restricted in their fate decisions. For each possible cell type combination (for example, erythroid-eosinophil or undifferentiated-mast cell) the number of cell pairs in barcode and GEMLI lineages was determined and summed for all datasets. WM989 cells are annotated as drug-susceptible or drug-resistant (published metadata4). Members of each cell lineage are either entirely in one of these two states (symmetric cell pairs) or distributed between these two states (asymmetric cell pairs), with the latter implying that cells recently switched fate. We compared the number of barcodes- and GEMLI-predicted related cell pairs being entirely drug-susceptible, primed, or distributed between the two states. Both for HSPC and WM989 datasets, also DEGs called between different related asymmetric and symmetric cell pairs were compared in barcode and GEMLI lineages. DEG was called using the FindMarkers function of Seurat with parameters min.pct=0.05 (HSPC) or 0.25 (WM989) and logfc.threshold=0.1. For HSPCs, not all comparisons of cell type pairs allowed DEG identification. To compare GEMLI and barcode lineages, the Spearman rank correlation between enrichment scores was calculated. The performance of GEMLI in recapitulating the prevalence of lineages and encompassing different cell types (HSPC) of states (WM989) was also compared to random cell lineage assignments for HSPC and WM989 datasets. For the HSPC datasets, it was further compared to a kNN-clustering based lineage assignment. Likewise, the performance of GEMLI in recovering DEG specific to cells in lineages composed of specific cell types or states was also compared to random cell type and kNN-based lineage assignments as follows. For the DEG analysis in Fig. 3e–d the correlation between the enrichment of DEG in ground truth lineages and predicted lineages for which members were exchanged against random cells of the same cell type was calculated (Fig. S17e, f). For the HSPC datasets, DEG analysis was also performed for random cell samples and kNN-based lineage assignments in all datasets (Fig. S17a, d). Cell-type combinations predicted by GEMLI were scored and compared. Trajectory analyses were performed using the scran (v. 1.28.2) and slingshot (2.8.0) packages. Analyses were performed on the 200 (HSPC) or 2000 (WM989) most variable genes on a UMAP with cluster annotation based on a nearest-neighbor graph. Cells in transition was identified based on pseudo-time as those cells of each type that are located close to the transition point between undifferentiated cells and neutrophil cells (HSPC) or drug-susceptible and primed cells (drug-resistant; WM989) respectively. To analyze cell fate decisions in the human breast cancer cell subsets of the Janesick et al.44 dataset, DEG were called as described above for all cell pairs being members of lineages predicted with lineage size parameter 2–20 cells at confidence level 50 annotated as encompassing only DCIS or invasive tumor cells (symmetric DCIS and symmetric invasive tumor lineages respectively) or being members of lineages encompassing both cell types (asymmetric lineages). For the latter category, only DCIS or invasive tumor cells were considered when indicated.

Datasets considered for analysis

For datasets considered in each figure and analysis see Supplementary Data file 1. For all figures in which one dataset of seven cell types is represented the unique mESC, CD8, L1210, as well as the MEF dataset BIDDY_D0_2, the WM989 dataset WM989_well1, the HSC dataset HSC_seq2, and the HSPC dataset LK_D2_exp1_library_d2_2 are represented. To show a dataset encompassing a large number of asymmetric lineages, the dataset LK_D4_well1_exp1_library_d4_1_2 is used in Fig. 3e. For other panels see Supplementary Data file 1.

Statistics and reproducibility

A two-sided Mann–Whitney U-test, which does not require normally distributed data, was used to test statistical significance when indicated. No statistical method was used to predetermine sample sizes. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.