HiDDEN: a machine learning method for detection of disease-relevant populations in case-control single-cell transcriptomics data

HiDDEN: A computational method for revealing subtle transcriptional heterogeneity and perturbation markers in case-control studies

Intuition

In a case-control experiment, typically all cells in control samples will be unaffected, and possibly only a subset of the cells in case samples will be affected by the perturbation (Fig. 1a). Using the gene expression profiles alone can fail to separate out the affected from unaffected cells (Fig. 1b). Using the sample-level labels alone can fail to recover the perturbation markers (Fig. 1c). However, combining the two allows us to leverage that at least some of the labels are correct and allows a prediction model to utilize the shared variability in features corresponding to correctly labeled cells. As a result, we transform the sample-level labels into cell-specific perturbation effect scores and can assign binary cell labels representing their status as affected or unaffected (Fig. 1d). We use this information to find hard-to-detect affected subpopulations of cells, characterize their marker genes, and contrast their cellular communication patterns with those of unaffected cells.

Notation

Let XϵRN×M denote the matrix containing the gene expression profiles of (N) cells across (M) genes. Let ZϵRN×K denote the reduced representation of the (N) cells in a (K)-dimensional latent space of features. Let (Y{epsilon }{{mathrm{0}},{mathrm{1}}}^{N}) denote the binary vector encoding the sample-level label of each cell, where (0) stands for control and (1) stands for case. We train a predictive model denoted by (hleft(.right)) on the reduced representation (Z) and the binary sample-level labels (Y). Due to the binary nature of the case-control labels (Y), the predictive model is a binary classifier modeling the probability of label (1) given the input features, i.e., (Pleft(Y=1,|,Zright)=hleft(Zright)). We train the parameters of the classifier on a dataset of interest and denote the fitted value of (Pleft(Y=1,|,Zright)) with (hat{p}). Finally, we cluster the continuous scores (hat{p}) to derive refined binary labels, denoted by (widetilde{Y}), reflecting the status of each cell as affected or unaffected by the perturbation regardless of which sample it originated from.

Construction of the latent space

We transform the (Ntimes M) gene expression matrix (X) into an (Ntimes K) matrix (Z) containing an information-rich reduced representation of the gene expression profile of each cell in the dataset. Throughout the applications to real data in this work we used principal component analysis (PCA) for this task, which is a commonly used dimensionality reduction technique for single-cell RNA-seq data. In principle, any other information-preserving dimensionality reduction method can be used to construct the latent features in lieu of PCA, such as non-negative matrix factorization (NMF), or the latent representations from a state-of-the-art single-cell expression autoencoder11. We did not find convincing evidence that using more sophisticated dimensionality reduction techniques improved model performance (Supplementary Fig. 5) and thus adopted PCA as the most computationally efficient option.

Estimation of the continuous perturbation score

For each cell, we derive a continuous score, (hat{p}{rm{epsilon }}left[mathrm{0,1}right]), reflective of the strength of the perturbation effect on that cell relative to the rest of the cells in the dataset. Throughout this work we use logistic regression for this task. That is, the predicted probability of label (1) given the input features is given by

$$hat{p}=Pleft(Y=1,|,Zright)=hleft(Zhat{beta }right)={g}^{-1}left(Zhat{beta }right)$$

(1)

where (h) is the logit link function, i.e., (g) is the logistic function, and (hat{beta }) is the (K)-dimensional vector of fitted regression parameters. In principle, we can use any classifier (h:{R}^{Ntimes K}to {left[mathrm{0,1}right]}^{N}), including large-parameter non-linear models such as neural networks. In practice, we opted for logistic regression as a simple yet powerful model with a canonical parameter optimization routine that does not introduce additional hyperparameters, training heuristics, and increased computational resources and time demands.

Derivation of the refined binary label

Each cell in a dataset from a case-control experiment possesses a binary sample-level label reflecting whether it originated from a case or control sample. As these coarse labels do not reflect the individual cell identity of being affected or unaffected by the perturbation, we derive a new refined binary label that captures the presence or absence of a perturbation effect in each cell. When we want to distinguish between affected and unaffected cells in the case sample, we cluster the continuous perturbation scores (hat{p}) for all the cells with initial label (1) into two groups. The cells in the group with lower (hat{p}) scores receive new label (widetilde{Y}=0) and the cells in the group with higher (hat{p}) scores get a new label (widetilde{Y}=1) matching their old label. To do this, we use k-means clustering with (k=2). In principle, any other clustering algorithm, such as Gaussian mixture models with two components for example, can be utilized in lieu of k-means. We compared these two clustering methods across our ground truth Naive B / Memory B mixtures and concluded that the two clustering strategies tend to perform similarly, especially for the datasets with (15%) to (75%) Memory B cells in the case condition. For datasets with less than (15%) Memory B cells, the Gaussian mixture model approach had overall more power to detect ground truth marker genes. However, for datasets with more than (75%) Memory B cells in the case condition, k-means performed better, especially due to a lower false discovery rate.

Choosing the number of latent dimensions

When selecting (K), the number of latent dimensions, our guiding principle is that (K) should be chosen in a data-dependent manner aiming to retain informative transcriptional heterogeneity while avoiding overfitting the not-entirely-correct sample-level labels. For example, note that using (K > {rm{rank}}(X)) will yield predictions (hat{p}) synonymous with the sample-level labels. In practice, this implies we should choose (K < < min (N,M)). Therefore, we develop two novel data-driven heuristics to quantify the amount of informative heterogeneity retained in the latent space by measuring the perturbation signal downstream of redefining the binary labels.

The first heuristic is to use the number of differentially expressed genes defined by the refined labels (widetilde{Y}). A large number of DE genes indicates a meaningful signal in (widetilde{Y}), and in turn in (hat{p}) and the latent space (Z). Intuitively, when (K) is too small, the continuous scores (hat{p}) do not contain enough heterogeneity to yield labels (widetilde{Y}) that distinguish DE genes. Conversely, when (K) is too large, we are overfitting to the sample-level labels resulting in low power to detect the perturbation markers. To achieve an appropriate balance, we scan a range of values for (K) that traverses the concave relationship between (K) and the number of DE genes and choose the number of latent dimensions maximizing it.

The second heuristic is to use the strength of the difference between the values of (hat{p}) for cells in the case sample with new label (widetilde{Y}=1) and new label (widetilde{Y}=0). We quantify the probability that these two sets of values are drawn from the same distribution using the two-sample Kolmogorov-Smirnov (KS) test32. The larger the value of the KS test statistic, the more different the sample distributions of the perturbation score are between the cells predicted to be affected and unaffected. This value is generally an increasing function of (K), therefore we pick the smallest value of (K) that maximizes the KS test statistic.

Note that we do not need access to ground truth labels of the perturbation effect for neither heuristic. When ground truth data is available, we find that the ability of the refined binary labels (widetilde{Y}) to represent the true perturbation effect per cell is similarly high for a wide range of values of (K) and that either heuristic yields a choice in that range (Supplementary Fig. 9).

Assessing performance on semi-simulated ground truth data

To demonstrate and quantify the problem difficulty and to assess the power of our method, we conducted simulations using a real single-cell dataset. We used the RNA profiles of (n=1900) Naive B and (n=1630) Memory B cells from a dataset of peripheral blood mononuclear cells (PBMC) freely available from 10x Genomics16. We first describe our design of simulated case-control datasets by combining the two B cell subpopulations. We then describe the challenge of separating the two subpopulations using the standard single cell clustering analysis workflow. Finally, we describe how we train our method and the metrics we use to assess its power to detect the biological signal and to compare it against related methods.

Generation of ground truth datasets

The RNA profiles of all (n=30672) cells in the human PBMC data from 10x Genomics were clustered and annotated independently of the ATAC-seq profiles following standard approaches detailed in the Seurat Weighted Nearest Neighbor Analysis vignette33. This resulted in (27) annotated cell types of which we subset all Naive B and Memory B cells for the subsequent generation of ground truth datasets.

For the tSNE representation of all Naive B and all Memory B cells in Fig. 2a, we normalized the gene counts, performed variable gene selection, scaled the normalized counts, performed dimensionality reduction using PCA, built the nearest-neighbor graph, and ran tSNE, all with default hyperparameter values using the standard functions in Seurat v 3.2.326.

To comprehensively describe the problem difficulty and test the performance of our method, we constructed a collection of ground truth case-control datasets by varying two aspects (Fig. 2b). In each dataset, the control sample consists entirely of Naive B cells, which we refer to as unperturbed, whereas the case dataset consists of both unperturbed and perturbed cells, which are either Memory B or s of Naive B and Memory B cells. Each dataset is indexed by (1) the percent perturbed cells in the case sample and (2) the strength of the perturbation.

To explore the effect of the percent perturbed cells in the case sample, we randomly drew (left(100-pright)%) of the cells in the case from the Naive B cells and (p%) from the Memory B cells. The remaining Naive B cells were all allocated to the control sample. We explored (18) values of the percent perturbed cells (p{rm{epsilon }}{mathrm{5,10,15},ldots,mathrm{85,90}}%.) The resulting number of Naive B and Memory B cells across case and control per dataset is provided in Supplementary Table 1.

To explore the effect of the strength of the perturbation, we varied the extent to which perturbed and unperturbed cells in the case sample differ from each other. Let ({W}^{left(mright)}) and ({W}^{left(nright)}) denote the weight of Memory B and Naive B contribution, respectively, for each hybrid cell, where ({W}^{left(mright)}+{W}^{left(nright)}=1), and ({W}^{left(mright)},{W}^{left(mright)}ge 0), i.e. ({W}^{left(mright)}) denotes the strength of the perturbation. Let (i) index the perturbed (Memory B) cells in the case sample and let ({N}_{i}) denote the total number of UMIs in cell (i). Each Memory B / Naive B hybrid cell has the same number of UMIs, ({N}_{i}), as the Memory B cell it originates from, ({N}_{i}^{left(mright)}) of which are subsampled from the originating Memory B cell and ({N}_{i}^{left(nright)}) of which are drawn from the Naive B centroid profile, as described below.

Let (vec{{x}_{i}}) denote the gene expression profile of the original Memory B cell (i). Let (vec{{p}_{i}}) denote the normalized counts, i.e., the relative proportion of counts across genes. We drew a Memory B / Naive B hybrid gene expression profile (vec{{h}_{i}}) by first subsampling ({N}_{i}^{left(mright)}=text{ceiling}left({W}^{left(mright)}*{N}_{i}right)) counts from the original Memory B expression profile (overrightarrow{{x}_{i}}):

$$overrightarrow{{h}_{i}^{left(mright)}}sim {rm{Multinomial}}left({N}_{i}^{left(mright)},overrightarrow{{p}_{i}}right),$$

where the ceiling function for any real number (x), integer (z), and the set of integers ({mathbb{Z}}) is defined as ceiling ((x)={rm{min}} {z{epsilon }{mathbb{Z}}| z ge x }), i.e. the ceiling is the smallest integer greater than or equal to x; and ({rm{Multinomial}}) denotes the Multinomial probability distribution.

We then drew ({N}_{i}^{left(nright)}:={N}_{i}-{N}_{i}^{left(mright)}) counts from the Naive B centroid, (overrightarrow{{p}^{left(nright)}}), defined as the average normalized counts overrightarrowtor across all Naive B cells in the dataset:

$$overrightarrow{{h}_{i}^{left(nright)}} sim {rm{Multinomial}}left({N}_{i}^{left(nright)},overrightarrow{{p}^{left(nright)}}right),$$

and finally, summed the two count overrightarrowtors (overrightarrow{{h}_{i}}=overrightarrow{{h}_{i}^{left(mright)}}+overrightarrow{{h}_{i}^{left(nright)}}) to compose the hybrid profile.

We explored (17) values of the perturbation strength parameter ({W}^{left(mright)}{rm{epsilon }}{mathrm{0.25,0.3,0.35},ldots,mathrm{0.95,1}}). Overall, spanning both the percent perturbed cells in case and the perturbation strength axes, we generated (72) datasets to characterize the problem difficulty and assess the performance of our method, as described below.

Clustering analysis of ground truth datasets

For the tSNE representation in Fig. 2c, we focused on the simulated dataset containing 5% Memory B cells in the case. We used the standard Seurat functions and normalized the gene counts, performed variable gene selection with default parameter values, scaled the normalized counts, performed dimensionality reduction using PCA with default parameter values, built the nearest-neighbor graph with the default number of nearest neighbors, used the Leiden algorithm with the default value of the resolution parameter to find clusters, and ran tSNE. For the bar plots in Fig. 2d and Supplementary Fig. 1, we computed the abundance of case and control-labeled cells and the abundance of Naive B and Memory B cells in each cluster.

Several hyperparameters influence the clustering results, and we varied each one to study their effects on the problem difficulty. The challenge of capturing the biological signal and separating Memory B from Naive B cells using the standard pipeline lies in (1) the construction of the latent space; and (2) the resolution parameter of the clustering algorithm. To quantify the problem difficulty, we investigated the degree of separability of Naive B and Memory B cells in the latent space via the distribution of the number of Memory B nearest neighbors across Memory B cells. For a given simulated dataset, varying the number of principal components (PCs) used to build the nearest-neighbor graph impacts the separability of the latent space with including more PCs resulting in a more mixed latent space (Supplementary Fig. 1a). The choice of a feature selection strategy along with the choice of the resolution parameter of the Leiden clustering algorithm have a significant impact on the results. We observed that using highly variable gene selection with the resolution parameter chosen to yield two clusters (since we are aiming to separate two cell types) fails to isolate the Memory B cells (Supplementary Fig. 1b). In this simulated ground truth setting, we can compute the differentially expressed (DE) genes, i.e., marker genes, between the two classes and use them as the selected features. However, that choice alone is also not sufficient to yield improved clustering when using the default value of the resolution parameter (Supplementary Fig. 1c).

HiDDEN model training

The HiDDEN model training was done in python and consists of three steps: (1) we preprocess the raw gene expression counts, (2) we train a logistic regression model, and (3) we binarize the predictions for cells in the case sample. First, we followed the standard preprocessing routine for single-cell RNA-seq data in scanpy26 which consists of filtering out cells with ( < 20) genes and filtering out genes not expressed in any cells, followed by log-normalization, and then, we used the standard PCA dimensionality reduction routine in scanpy on the scaled gene features. For the comparison of the choice of dimensionality reduction technique in Supplementary Fig. 5, we trained a state-of-the-art gene expression autoencoder, using the scVI framework11, varying the size of the latent space (K{rm{epsilon }}{mathrm{10,25,50}}). Second, we used the LogisticRegression function from the sklearn.linear_model python library to train a logistic regression on the binary sample-level labels and the first (K) features. Finally, we used the KMeans function from the sklearn.cluster python library with ({rm{n}}_{rm{clusters}}=2) on the continuous perturbation scores output by the logistic regression for cells in the case sample.

Note that HiDDEN does not require parameter tuning. Since we use all genes when computing the PC embedding of the data, the only parameter in the model is (K), the number of PCs used in the training of the logistic regression. As described earlier in this section, we provide two data-driven heuristics for automatically choosing an appropriate value for (K). For each dataset and for each heuristic, we scanned all integer values for (K) in the range ([mathrm{2,60}]). The first heuristic is to choose (K) that maximizes the number of DE genes defined by the HiDDEN refined binary labels. To compute the number of DE genes downstream of a given value of (K{rm{epsilon }}[mathrm{2,60}]), we used the Wilcoxon rank-sum differential expression test in scanpy with adjusted p-value threshold ( < 0.05) (Supplementary Fig. 9a). The second heuristic is to choose the smallest value of (K) that maximizes the value of the two-sample Kolmogorov-Smirnov (KS) test statistic comparing the sampling distributions of (hat{p}) for cells in the case sample with new refined label (widetilde{Y}=1) and (widetilde{Y}=0) (Supplementary Fig. 9b). To compute the value of the KS test statistic, we used the ks_2samp function from the scipy.stats python library.

Assessing agreement between HiDDEN continuous perturbation scores and ground truth labels

To quantify the ability of the continuous perturbation scores output by HiDDEN to capture the biological difference between Memory B and Naive B cells in each simulated dataset, we computed the Area Under the Receiver Operating Characteristic Curve (AUROC) using the roc_auc_score function from the sklearn.metrics python library (Fig. 2f, Supplementary Figs. 2a, 5a, 6a, 7I-K). The AUROC score can take values in the range ([mathrm{0,1}]), with higher values indicating better agreement between the continuous prediction scores and the ground truth Naive B / Memory B binary labels.

Assessing agreement between HiDDEN-refined binary labels and ground truth labels

To evaluate the agreement between ground truth Naive B / Memory B labels and the binary labels refined by HiDDEN in each simulated dataset, we measured the accuracy of retrieved ground truth markers. Here we considered two scenarios – (1) working with the (unclustered) dataset as a whole and (2) looking for perturbation markers across clusters produced by the standard Seurat workflow.

The unclustered case: To define the set of ground truth markers for Naive B and Memory B cells in each simulated dataset, we computed the DE genes using the Naive B / Memory B ground truth labels using the Wilcoxon rank-sum differential expression test in scanpy with adjusted p-value threshold ( < 0.05). Analogously, we defined the set of HiDDEN-derived marker genes as well as a baseline set of marker genes using the refined binary labels and the sample-level case-control labels, respectively, under the same testing procedure. We then computed the number of True Positives (TP), False Negatives (FN), False Positives (FP), and associated metrics of Recall, Precision, and F1-score (Fig. 2g, Supplementary Figs. 2b, 3, 4a, 5b, 6b). Recall can take values in the range ([mathrm{0,1}]), with higher values indicating a higher fraction of correctly retrieved ground truth markers. Precision can take on values in the range ([mathrm{0,1}]), with lower values indicating a higher fraction of falsely discovered marker genes. The F1-score is calculated as the harmonic mean of the precision and recall. F1-score can take values in the range ([mathrm{0,1}]), with higher values indicating better agreement between the HiDDEN refined binary labels and the ground truth Naive B / Memory B binary labels.

The clustered case: We proceeded analogously to the unclustered case with the difference that DE testing was performed per cluster and the union of all DE genes across clusters defined the final gene set (Supplementary Fig. 3).

Comparing HiDDEN to CNA, MELD, Milo, and Mixscape

While these methods are developed with different objectives within the larger question of characterizing the effect of a perturbation in single-cell data, all of them utilize expression profiles (or a neighborhood graph derived from them) and cell labels reflecting the condition of the sample a cell comes from (as well as other metadata, optionally) as input. All methods output a continuous score measuring the effect of the perturbation in each cell, which can further be binarized whenever appropriate. Therefore, we can compare the performance of these five methods along both continuous scores and binary labels. Towards that end, we use the ground truth datasets of Naive B and Memory B cell mixtures.

Training of CNA was performed in python following the jupyter notebook tutorial provided by the authors of the method34. The original CNA implementation has a hard-coded assumption that the dataset to be analyzed is composed of at least five samples. We relaxed this assumption to accommodate our B cell mixtures and obtained continuous perturbation scores as the per-cell neighborhood coefficient using the CNA association function with case/control status as the sample-level attribute of interest and case/control status as sample id. The resulting CNA continuous score is a correlation value ranging from −1 to 1.

Training of MELD was done in python following the code used by the authors of Milo in their comparison section35. We computed the k-nearest neighbors graph based on the expression matrix subsetted to the top 2000 highly variable genes. Then we used the meld_op.transform function to compute the density of the case/control sample labels and transformed the density to likelihood per condition using the meld.utils.normalize_densities function. Continuous perturbation scores, which take on values between 0 and 1, were obtained as the likelihood of label 1, which denotes the case condition.

Training of Milo was performed in python following the Differential abundance analysis in python with milopy jupyter notebook tutorial provided by the authors of the method36. The underlying implementation of Milo is done in R and calls on the glmFit routine from the edgeR R package. This routine cannot estimate the negative binomial dispersion parameter if it is not given at least three samples. To overcome this hard-coded assumption, we randomly created three samples per condition. We created the partially overlapping cellular neighborhoods using the milo.make_nhoods function. This results in a collection of neighborhoods, each identified by an index cell. A fraction of the cells in the dataset are deliberately excluded from the Milo analysis as outliers. Cells included in the analysis can also belong to one or more neighborhoods at the same time. Then we used the case/control sample level labels to count the number of cells from each sample in each neighborhood using the milo.count_nhoods function. We used the milo.DA_nhoods function to perform differential abundance testing, which outputs a log fold-change test statistic per neighborhood. Since we aim to make comparisons at the level of individual cells, when a cell belonged to more than one neighborhood – we reported the average log fold-change across neighborhoods. For cells excluded from the analysis, the average log fold-change is NA. Continuous perturbation scores were obtained as the average log fold-change and can take on values between minus to plus infinity.

Training of Mixscape was performed in python following the code provided by the Theis lab as part of pertpy tools on github at https://github.com/theislab/pertpy/blob/development/pertpy/tools/_mixscape.py. We calculated perturbation signatures by subtracting the averaged expression profile of the 20 control neighbors from the expression profile of each cell using the perturbation_signature function. We then identified perturbed and non-perturbed cells within the case condition using the mixscape function with default hyperparameter values.

Comparison of the continuous perturbation scores from HiDDEN, CNA, MELD, Milo, and Mixscape (Supplementary Figs. 6a, 7) was performed using the AUROC described earlier in the Methods section as a metric for assessing agreement between continuous perturbation scores and ground truth labels.

Converting the continuous perturbation scores into binarized labels was done in an identical manner for HiDDEN, CNA, MELD, and Milo, as described earlier in the Methods section. Mixscape has a built-in function for computing binary perturbed/non-perturbed labels. Comparison of the binary labels across all five methods (Supplementary Figs. 6b, 8) was performed using the F1-score described earlier in the Methods section as a metric for assessing the agreement between HiDDEN-refined binary labels and ground truth labels.

Using HiDDEN-refined binary labels as input to CNA, MELD, and Milo

HiDDEN-refined binary labels demonstrate better agreement with ground truth labels than the sample-level case/control labels. Therefore, we explored the performance of CNA, MELD, and Milo when given HiDDEN binary labels in lieu of case/control labels as input (Supplementary Figs. 7, 8). CNA continuous perturbation scores were obtained as the per-cell neighborhood coefficient using the CNA association function with Memory B / Naive B ground truth labels as the sample-level attribute of interest and HiDDEN-refined binary labels as sample id. MELD continuous perturbation scores were obtained as the density of the HiDDEN-refined binary labels transformed to likelihood of label 1. Milo continuous perturbation scores were obtained as the average log fold-change downstream of differential abundance testing per neighborhood using the HiDDEN-refined binary labels to count the number of cells per condition. All continuous scores were binarized in the same manner as above.

Assessing performance on human multiple myeloma and precursor states data

The first real dataset we analyzed consists of single-cell RNA-seq profiles of human bone marrow plasma cells from patients with multiple myeloma (MM) ((n=8) patients, (N=10790) cells), its precursor conditions smoldering multiple myeloma (SMM) ((n=12) patients, (N=8431) cells) and monoclonal gammopathy of undetermined significance (MGUS) ((n=6) patients, (N=817) cells), and healthy donors with normal bone marrow (NBM) ((n=9) patients, (N=9329) cells)4.

Precursor samples can contain a mixture of neoplastic and normal cells (Fig. 3a) and the authors of the original study define two sources of ground truth describing the malignancy status of precursor samples and their cells. The first source is a manual annotation of binary labels reflecting whether a cell is healthy or malignant. The second source is a tumor-purity estimate of the proportion of malignant cells in each precursor sample.

HiDDEN model training

This dataset has pronounced patient-specific batch effects (Supplementary Fig. 12a). Therefore, echoing the analysis in the original study, we first deployed a batch-sensitive strategy to refine the malignancy status of cells in precursor samples using HiDDEN. Additionally, we developed a batch-agnostic strategy as well, to explore the ability of our method to perform well in the presence of strong batch effects. Mirroring the within-patient annotation approach in the original study, our batch-sensitive fitting approach considers each precursor sample one at a time. We trained the model on all NBM, one precursor sample, and all MM samples, where we give all NBM cells label (0), or healthy, and all the rest label (1), reflecting that they do not originate from healthy donors. The batch-agnostic fitting approach consists of fitting the model to all NBM samples, all precursor samples manually annotated to be mixed, and all MM samples together. Besides this, all other aspects of model training were carried out the same way between the two strategies. The results from the batch-specific strategy are featured in Fig. 3, and the results of the batch-agnostic approach, along with a comparison of the two, are included in Supplementary Fig. 12.

The HiDDEN model training was done in python. First, we followed the standard preprocessing routine for single-cell RNA-seq data in scanpy and log-normalized each sample separately. We then used the standard PCA dimensionality reduction routine in scanpy on the scaled gene features. Next, we used the LogisticRegression function from the sklearn.linear_model python library to train a logistic regression on the binary NBM / non-NBM labels and the first (K) PCs. Finally, we used the KMeans function from the sklearn.cluster python library with ({rm{n}}_{rm{clusters}}=2) on the continuous perturbation scores output by the logistic regression for cells in each precursor sample separately (Supplementary Table 2). We used all genes to compute the PC dimensionality reduction and automatically chose (K), the number of PCs used in the logistic regression, using the heuristic for maximizing the number of DE genes downstream of the HiDDEN refined binary labels (Supplementary Fig. 13). The specific strategy for defining the DE genes in this dataset characterized by strong batch effects is described in detail below.

Under both fitting strategies, we used the same downstream metrics to evaluate the performance of our method at recovering the manually annotated cell-level labels and the purity sample-level estimates, as described below.

Assessing agreement between HiDDEN continuous perturbation scores and ground truth manually annotated labels

To quantify the ability of the continuous perturbation scores produced by HiDDEN to capture the manually annotated healthy-malignant binary labels in all mixed precursor samples, we computed the AUROC using the roc_auc_score function from the sklearn.metrics python library. The average AUROC across samples per precursor state from the batch-sensitive training strategy is depicted in Fig. 3b, and the per-sample curves and distributions of perturbation scores are included in Supplementary Fig. 12.

Assessing agreement between HiDDEN refined binary labels and tumor-purity sample estimates

The sample-level source of ground truth provided in the original study is an estimate of the proportion of malignant cells from a Bayesian hierarchical model based only on the expression of immunoglobulin light chain genes. Additionally, we estimated the per-sample tumor-purity using the manual labels and the HiDDEN-refined binary labels. Confidence bounds in all three cases were derived following the same approach as in the original study (Fig. 3c, e).

There are three quantities relevant to the ability of HiDDEN and manual annotation labels to recapitulate the ground truth point estimates of sample purity, i.e. the proportion of neoplastic cells in a sample:

  1. 1.

    sample purity estimates derived from HiDDEN-refined binary labels, denoted as ({p}_{1}),

  2. 2.

    manual annotation-based estimates of sample purity reported in the paper the data originated from4 denoted as ({p}_{2}),

  3. 3.

    ground truth point estimates of proportion of malignant cells4, denoted as ({p}_{3}).

Deriving significance values that quantify the ability of HiDDEN-derived labels and manual annotation labels to match the ground truth point estimates of malignant cell proportions

we treat ({p}_{1}) and ({p}_{2}) as random variables, and the point estimate ({p}_{3}) as a scalar. The probability distributions of ({p}_{1}) and ({p}_{2}) follow from the derivation of confidence bounds in the original study4 and are described in more detail below.

For a given sample, let (p) be the proportion of neoplastic cells (denoting the population parameter and not just the estimate based on the sequenced cells from that sample). We model the observed data as

$$x sim {rm{Binomial}}(n,p),$$

where (n) denotes the total number of sequenced cells in the sample, (x) of which are labeled as neoplastic. We put a noninformative uniform prior on (p):

$$p sim {rm{Beta}}(1,1).$$

Due to the Beta-Binomial conjugacy, the posterior distribution of (p) is closed form:

$$p{{|}}n,x sim {rm{Beta}}(x+1,n-x+1),$$

with (p) taking on values in ([mathrm{0,1}]).

For each of the 11 precursor samples we consider in Fig. 3c, e, we test a pair of hypotheses:

$${H}_{0}:{p}_{1}={p}_{3}$$

$${H}_{A}:{p}_{1}ne {p}_{3}$$

and

$${H}_{0}:{p}_{2}={p}_{3}$$

$${H}_{A}:{p}_{2}ne {p}_{3}.$$

Since we are testing analogous hypotheses for ({p}_{1}) and ({p}_{2}), below we describe the hypothesis test with respect to ({p}_{1}).

The distribution of the test statistic (the difference between ({p}_{1}) and ({p}_{3})) under the null hypothesis is:

$${p}_{3}{{|}}n,{H}_{0} sim {rm{Beta}}({x}_{3}+1,n-{x}_{3}+1),$$

with support ([-{p}_{3},1-{p}_{3}]), where ({x}_{3}) is the expected number of neoplastic cells under the null, calculated as ({p}_{3}*n) rounded to the nearest integer.

Therefore, the p-value is calculated as the probability to observe a test statistic equal to or more extreme (in either direction away from 0) according to the null distribution:

$${mathbb{P}}left({rm{Beta}}left({x}_{3}+1,n-{x}_{3}+1right)le -left|{p}_{1}-{p}_{3}right|right){mathbb{+}}{mathbb{P}}left({rm{Beta}}left({x}_{3}+1,n-{x}_{3}+1right)ge left|{p}_{1}-{p}_{3}right|right).$$

(2)

The resulting p-values for testing ({p}_{1}-{p}_{3}) and ({p}_{2}-{p}_{3}) across all 11 mixed precursor samples are reported in Supplementary Table 2. The significance depicted in Fig. 3c, e is with respect to a Bonferroni-adjusted threshold of alpha=0.01/22 = 4.55E-04. A smaller p-value indicates a larger discrepancy with the ground truth. Whenever the HiDDEN-based approach produced a larger p-value compared to the manual annotation, we concluded that HiDDEN agreed better with the ground truth compared to the manual annotation approach.

Differential expression analysis using manually annotated and HiDDEN binary labels

To demonstrate the ability of the HiDDEN-refined binary labels to discover additional malignancy markers from precursor samples, we computed the DE genes using the manual annotation in NBM and MM samples and contrasted it against the DE genes found using the HiDDEN refined labels in precursor samples (Fig. 3d). Due to the presence of strong batch effects in the data, mirroring the DE testing strategy in the original paper, we find the DE genes per patient and take the union across patients. For a gene to be considered DE, it had to have an adjusted p-value ( < 0.05) from the t-test differential expression testing routine in scanpy and a maximum absolute log-foldchange ( > 1.5).

To assess the significance of the overlap between the two sets of DE genes, we ran a hypergeometric test using the dhyper function from the Stats R package. The background number of genes was calculated based on genes expressed in at least one cell from all precursor samples ((mathrm{18,770}) genes).

Validation of HiDDEN binary labels in non-mixed MGUS samples

The authors of the original study computed a Bayesian non-negative matrix factorization (NMF) to highlight gene signatures that are active in this patient cohort and validated them in external cohorts. Several signatures were annotated with a biological interpretation. There were three MGUS samples considered to consist of only healthy cells, according to the manual annotation. They are also the three precursor samples with lowest, although not zero, estimated sample purity according to the Bayesian purity model from the original study. The HiDDEN refined binary labels for these patients annotate some of their cells as malignant. To validate this annotation, we plotted the mean activity of the genes identified by the original study for each signature in the cells labeled as healthy and as malignant for each sample (Fig. 3f, Supplementary Fig. 11). The confidence bounds (SEM) in Fig. 3f were derived following the same approach as in the original study.

Analysis of mouse endothelial cells from a time-course demyelination experiment

Generation of demyelination and control tissue

The second real dataset analyzed consists of single-nucleus RNA-seq profiles of mouse endothelial cells from a demyelination model with matched controls. Case and control animals received 500 nl of 1% lysophosphatidylcholine (LPC, Cat# 440154, Millipore Sigma, US) or saline vehicle (PBS) injection, respectively. This was delivered intracranially, using standard approaches, with a Nanoject III (Drummond, US) into the corpus callosum at the following stereotaxic coordinates: Anterior-Posterior: −1.2, Medio-lateral: 0.−5 relative to bregma and a depth of 1.4 mm normalized to the surface of the skull.

All mice were housed on a 12-h light/dark cycle between 68 °F and 79 °F and 30-70% humidity. All animal work was approved by the Broad’s Institutional Animal Care and Use Committee (IACUC). The only mouse strain used was C57BL/6 J which was purchased directly from The Jackson Labs (cat# 000664). Mice were sacrificed at four time points: 3, 7, 12, and 18 days post injection (dpi) with (n=3) animals per time point per condition, totaling (n=24) animals (Fig. 4a). At the appropriate time point, mice were perfused with ice-cold pH 7.4 HEPES buffer (containing 110 mM NaCl, 10 mM HEPES, 25 mM glucose, 75 mM sucrose, 7.5 mM MgCl2, and 2.5 mM KCl) to remove blood from the brain. Brains were fresh frozen for 3 min in liquid nitrogen vapor and all tissue was stored at −80 °C for long-term storage. The full dataset will be described in a forthcoming paper (Dolan et al., in preparation).

Generation of single-nucleus RNA profiles

Frozen mouse brains were mounted onto cryostat chucks with OCT embedding compound within a cryostat. Brains were sectioned until reaching the injection site location, which was confirmed by the presence of hypercellularity using a Nissl stain (Histogene Staining Solution, KIT0415, Thermofisher). For saline controls, anatomical landmarks were used to determine the injection site. Lesions or control white matter was microdissected using a 1 mm biopsy punch (Integra Miltex, US), whose circular punch was bent into a rectangle shape with a sterile hemostat. Lesion punches were 300 μm deep.

Each excised tissue punch was placed into a pre-cooled 0.25 ml PCR tube using pre-cooled forceps and stored at −80 °C for a maximum of 24 hours. Nuclei were extracted from this frozen tissue using gentle, detergent-based dissociation, according to a protocol available at protocols.io (https://doi.org/10.17504/protocols.io.bck6iuze) with minor changes to maximize nuclei extraction, which will be described in a forthcoming paper (Dolan et al. in preparation). Nuclei were loaded into the 10x Chromium V3 system. Reverse transcription and library generation were performed according to the manufacturer’s protocol (10x Genomics). Sequencing reads from mouse cerebellum experiments were demultiplexed and aligned to a mouse (mm10) premrna reference using CellRanger v3.0.2 with default settings. Digital gene expression matrices were generated with the CellRanger count function. Initial analysis and generation of overall UMAP and clustering (Fig. 4b) was performed with Seurat v326.

Clustering analysis of endothelial cells

For the UMAP representations in Fig. 4c, d, and Supplementary Fig. 14 of the profiles of (N=891) endothelial cells from (n=6) animals spanning both case and control conditions and all four timepoints, we used the standard scanpy functions to log-normalize and scale the gene counts, ran PCA, computed the nearest-neighbor graph with (10) neighbors in the latent space defined by the first (50) PCs, and ran the UMAP algorithm.

To cluster the endothelial cells (Fig. 4d), we used the standard preprocessing and clustering workflow in Seurat. We normalized the gene counts, performed variable gene selection with default parameter values, scaled the normalized counts, performed dimensionality reduction using PCA with default parameter values, built the nearest-neighbor graph with the default number of nearest neighbors, and used the Leiden algorithm with resolution parameter (=0.5) to find clusters.

For the heatmaps in Fig. 4e and Supplementary Fig. 14c, we computed the abundance of case (LPC) and control (PBS) labels in each cluster and across time points, respectively.

HiDDEN model training

Model training was performed in python on all (n=891) endothelial cells together. We followed the standard preprocessing routine for single-cell RNA-seq data in scanpy and log-normalized the gene counts, followed by PCA dimensionality reduction of the scaled gene features. For the continuous perturbation score in Fig. 4f and Supplementary Fig. 14d, we used the LogisticRegression function from the sklearn.linear_model python library to train a logistic regression on the binary PBS / LPC labels and the first (K) PCs. We used all genes to compute the PC embedding and automatically chose (K=5), the number of PCs used in the logistic regression, using the heuristic for maximizing the number of DE genes downstream of the HiDDEN refined binary labels. The strategy for defining the DE genes in this dataset is described in detail below.

For the HiDDEN refined binary labels in Fig. 4g, we used the KMeans function from the sklearn.cluster python library with ({rm{n}}_{rm{clusters}}=2) to split the continuous perturbation scores of all LPC endothelial cells at 3 dpi into two groups. We denote the group with lower perturbation scores LPC0, corresponding to endothelial cells unaffected by the LPC injection and similar to endothelial cells in the PBS control condition, and the group with higher perturbation scores LPC1, as the subset of endothelial cells affected by the LPC injection.

Differential expression analysis to define endothelial LPC1 markers

To define the set of endothelial LPC1 markers depicted in the dotplot in Fig. 5a, we took the unique perturbation-enriched genes found in DE analysis using the HiDDEN refined labels and not in DE analysis using the original PBS / LPC labels. We performed both DE analyses using the Wilcoxon rank-sum test in scanpy with a threshold for the adjusted p-value ( < 0.05). The comprehensive output from both tests can be found in Supplementary Table 3, with the unique genes found using the HiDDEN refined binary labels highlighted in bold font.

Validation of endothelial LPC1 markers using RNAscope

Fresh-frozen, 14 μm sections of 3 days post injection (dpi) demyelinating or saline control tissue were mounted on cold Superfrost plus slides (Fisher Scientific, US). These slides were stored at −80 °C. We performed RNAscope Multiplex Fluorescent v2 (Advanced Cell Diagnostics, US) using probes targeting S100a6 (412981), Lgals1 (897151-C2) and Flt1 (415541-C3), where Flt1 is a general marker for endothelial cells. RNAscope was performed following the manufacturer’s protocol for fresh frozen tissue and the following dyes were used at a concentration of 1/1500 to label specific mRNAs (TSA Plus fluorescein, TSA Plus Cyanine 3, TSA Plus Cyanine 5 from PerkinElmer, USA). Imaging was performed on an Andor CSU-X spinning disk confocal system coupled to a Nikon Eclipse Ti microscope equipped with an Andor iKon-M camera. Images were acquired using 20x air and 60x oil immersion objectives (Nikon). All images shown in Fig. 5b are representative images taken from at least 2 independent experiments.

Interpretation of endothelial LPC1 markers using gene ontology analysis

For the identification of gene ontology (GO) categories summarizing the list of unique endothelial LPC1 markers (Fig. 5c, d), we performed GO enrichment analysis in g:Profiler37 with default settings and ReviGo38 to summarize and visualize the results.

Ligand-receptor analysis to identify cell-cell communication changes between endothelial LPC1 and LPC0

To contrast the ligand-receptor communication of the two endothelial LPC subtypes with neighboring cell types in the tissue (Fig. 5e), we used a modification (Goeva et al., in preparation) to CellphoneDB39. We separated the output for each interaction in three bins based on the sign of the test statistic (Supplementary Fig. 15) and the magnitude of the p-value, reflected in the figure legend: significantly depleted in LPC1 with respect to LPC0, not significant (p-value (ge 0.05)), and significantly enriched in LPC1 with respect to LPC0.

Statistics and reproducibility

HiDDEN was evaluated on data originating from two publicly available datasets and one novel dataset, using as many samples as possible in these datasets (no statistical method was used to predetermine the sample size and no data were excluded from the analyses). Preprocessing steps were performed according to standard practice and reported for each dataset independently. The experiments involving running computational methods on previously published publicly available datasets did not require randomization. The investigators were not blinded to allocation during experiments and assessment of outcome. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.