DELVE: feature selection for preserving biological trajectories in single-cell data – Nature Communications

DELVE

DELVE identifies a subset of dynamically-changing features that preserve the local structure of the underlying cellular trajectory. In this section, we will (1) describe computational methods for the identification and ranking of features that have non-random patterns of dynamic variation, (2) explain DELVE’s relation to previous work, and (3) provide context for the mathematical foundations behind discarding information-poor features prior to performing trajectory inference.

Problem formulation

Let ({{{{{{{bf{X}}}}}}}}={{{{{{{{{{bf{x}}}}}}}}}_{i}}}_{i=1}^{n}) denote a single-cell dataset, where ({{{{{{{{bf{x}}}}}}}}}_{i}in {{mathbb{R}}}^{d}) represents the vector of d measured features (e.g., genes or proteins) measured in cell i. We assume that the data have an inherent trajectory structure, or biologically-meaningful ordering, that can be directly inferred by a limited subset of p features where pd. Therefore, our goal is to identify this limited set of p features from the original high-dimensional feature set that best approximate the transitions of cells through each stage of the underlying dynamic process.

Step 1: Dynamic seed selection

Graph construction

Our approach DELVE extends previous similarity-based29,55,56 or subspace-learning58 feature selection methods by computing the dependence of each gene on the underlying cellular trajectory. In step 1, DELVE models cell states using a weighted k-nearest neighbor affinity graph of cells (k = 10), where nodes represent cells and edges describe the transcriptomic or proteomic similarity amongst cells according to the d profiled features encoded in X. More specifically, let ({{{{{{{mathcal{G}}}}}}}}=({{{{{{{mathcal{V}}}}}}}},{{{{{{{mathcal{E}}}}}}}})) denote a between-cell affinity graph, where ({{{{{{{mathcal{V}}}}}}}}) represents the cells and the edges, ({{{{{{{mathcal{E}}}}}}}}), are weighted according to a Gaussian kernel as,

$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(-frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}-{{{{{{{{bf{x}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$

(1)

Here, W is a n × n between-cell similarity matrix, where cells vi and vj are connected with an edge with edge weight wij if the cell vj is within the set of vi’s neighbors, as denoted by notation ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Moreover, σi, specific for a particular cell i, represents the Gaussian kernel bandwidth parameter that controls the decay of cell similarity edge weights. We chose a bandwidth parameter as the distance to the 3rd nearest neighbor as this has been shown previously in refs. 56,154 to provide reasonable decay in similarity weights.

Identification of feature modules

To identify groups of features with similar co-expression variation, DELVE clusters features according to changes in expression across prototypical cell neighborhoods. First, cellular neighborhoods are defined according to the average expression of each set of k nearest neighbors (({{{{{{{{mathcal{N}}}}}}}}}_{i})) as, ({{{{{{{bf{Z}}}}}}}}={{{{{{{{{{bf{z}}}}}}}}}_{i}in {{mathbb{R}}}^{d}}}_{i=1}^{n}), where each ({{{{{{{{bf{z}}}}}}}}}_{i}=frac{1}{k}{sum }_{{{{{{{{{mathcal{N}}}}}}}}}_{i}}{{{{{{{{bf{x}}}}}}}}}_{i}) represents the center of the k nearest neighbors for cell i across all measured features. Next, DELVE leverages Kernel Herding sketching69 to effectively sample m representative cell neighborhoods, or rows, from the per-cell neighbor averaged feature matrix, Z, as (tilde{{{{{{{{bf{Z}}}}}}}}}={{{tilde{{{{{{{{bf{z}}}}}}}}}}_{i}in {{mathbb{R}}}^{d}}}_{i=1}^{m}). This sampling approach ensures that cellular neighborhoods are more reflective of the original distribution of cell states, while removing redundant cell states to aid in the scalability of estimating expression dynamics. DELVE then computes the average pairwise change in expression of features across representative cellular neighborhoods, Δ, as,

$${{{{Delta }}}}=frac{1}{m-1}mathop{sum }limits_{i=1}^{m}left(tilde{{{{{{{{bf{Z}}}}}}}}}-{{{{{{{{bf{j}}}}}}}}}_{m}{tilde{{{{{{{{bf{z}}}}}}}}}}_{i}^{{{{{{{{rm{T}}}}}}}}}right),$$

(2)

where jm is a column vector of ones with length m, such that ({{{{{{{{bf{j}}}}}}}}}_{m}in {{mathbb{R}}}^{m}). Here, Δ is a m × d neighborhood by feature matrix, where each row corresponds to a neighborhood and contains the average pairwise change in expression for that neighborhood across all features. Lastly, features are clustered according to the transpose of their average pairwise change in expression across the representative cellular neighborhoods, ΔT, using the KMeans + + algorithm105. In this context, each DELVE module contains a set of features with similar local changes in co-variation across cell states along the cellular trajectory.

Dynamic expression variation permutation testing

To assess whether modules of features have coordinated or noisy expression variation, we compare the average sample variance of features within a DELVE module to random assignment using a permutation test as follows. Let ({bar{S}}_{c}^{2}left({P}_{c}right)) denote the average sample variance of the average pairwise change in expression across m cell neighborhoods for the set of p features (a set of features denoted as Pc) within a DELVE cluster c as,

$${bar{S}}_{c}^{2}left({P}_{c}right)=frac{1}{| {P}_{c}| }mathop{sum }limits_{p=1}^{| {P}_{c}| }mathop{sum }limits_{i=1}^{m}frac{{left({Delta }_{i,p}-{bar{Delta }}_{p}right)}^{2}}{m-1}.$$

(3)

Moreover, let Rq denote a set of randomly selected features sampled without replacement from the full feature space d, such that Pc = Rq, and ({tilde{S}}_{c}^{2}({R}_{q})) denote the average sample variance of randomly selected feature sets averaged across t random permutations as,

$${tilde{S}}_{c}^{2}({R}_{q})=frac{1}{t}mathop{sum }limits_{q=1}^{t}{bar{S}}_{c}^{2}({R}_{q}).$$

(4)

Here, DELVE considers a module of features as being dynamically-expressed if the average sample variance of the change in expression of the set of features within a DELVE cluster (or specifically feature set Pc), is greater than random assignment, Rq, across randomly permuted trials as,

$${bar{S}}_{c}({P}_{c}) , > , {tilde{S}}_{c}({R}_{q}).$$

(5)

In doing so, this approach is able to identify and exclude modules of features that have static, random, or noisy patterns of expression variation, while retaining dynamically expressed features for ranking feature importance. Given that noisy or irrelevant features can confound inference of the underlying cellular trajectory45,46 and have been shown to corrupt the graph Laplacian for feature selection61,62, this exclusion step is crucial prior to performing for feature selection. Of note, given that KMeans++ clustering is used to initially assign features to a group, feature-to-cluster assignments can tend to vary due to algorithm stochasticity. Therefore, to reduce the variability and find a core set of features that are consistently dynamically-expressed, this process is repeated across ten random clustering initializations and the set of dynamically-expressed features are defined as the intersection across runs.

Step 2: Feature ranking

Following dynamic seed selection, in step two, DELVE ranks features according to their association with the underlying cellular trajectory graph. First, DELVE approximates the underlying cellular trajectory by constructing a between-cell affinity graph, where the nodes represent the cells and edges are now re-weighted according to a Gaussian kernel between all cells based on the core subset of dynamically expressed regulators from step 1, such that (tilde{{{{{{{{bf{X}}}}}}}}}={{tilde{{{{{{{{bf{x}}}}}}}}}}_{i}in {{mathbb{R}}}^{p}}) where pd as,

$${tilde{{{{{{{{rm{w}}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(-frac{parallel {tilde{{{{{{{{bf{x}}}}}}}}}}_{{v}_{i}}-{tilde{{{{{{{{bf{x}}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$

(6)

Here, (tilde{{{{{{{{bf{W}}}}}}}}}) is a n × n between-cell similarity matrix, where cells vi and vj are connected with an edge with edge weight ({tilde{w}}_{ij}) if the cell is within the set of vi’s neighbors, denoted as ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Moreover, as previously mentioned, σi represents the Gaussian kernel bandwidth parameter for a particular cell i as the distance to the 3rd nearest neighbor.

Features are then ranked according to their association with the underlying cellular trajectory graph using graph signal processing techniques55,70,71. By modeling data as a set of signals on a weighted graph, graph signal processing techniques have led to the development of new machine learning models for improved biological prediction, including regression or classification70,155, identification of prototypical cells associated with experimental perturbations82, or inference of cell signaling dynamics156. A graph signal f is any function that has a real defined value on all of the nodes, such that ({{{{{{{boldsymbol{f}}}}}}}}in {{mathbb{R}}}^{n}) and fi gives the signal at the ith node. Intuitively, in DELVE, we consider all features as graph signals and rank them according to their variation in expression along the approximate cell trajectory graph to see if they should be included or excluded from downstream analysis. Let L denote the unnormalized graph Laplacian, with ({{{{{{{bf{L}}}}}}}}={{{{{{{bf{D}}}}}}}}-tilde{{{{{{{{bf{W}}}}}}}}}), where D is a diagonal degree matrix with each element as ({d}_{ii}={sum }_{j}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij}). The local variation in the expression of feature signal f can then be defined as the weighted sum of differences in signals around a particular cell i as,

$$left({{{{{{{bf{L}}}}}}}}{{{{{{{bf{f}}}}}}}}right)(i)=mathop{sum}limits_{j}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij}left({{{{{{{bf{f}}}}}}}}(i)-{{{{{{{bf{f}}}}}}}}(j)right).$$

(7)

This metric effectively measures the similarity in expression of a particular node’s graph signal, denoted by the feature vector, f, around its k nearest neighbors. By summing the local variation in expression across all neighbors along the cellular trajectory, we can define the total variation in expression of feature graph signal f as,

$${{{{{{{{bf{f}}}}}}}}}^{{{{{{{{rm{T}}}}}}}}}{{{{{{{bf{L}}}}}}}}{{{{{{{bf{f}}}}}}}}=mathop{sum}limits_{ij}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij}{left({{{{{{{bf{f}}}}}}}}(i)-{{{{{{{bf{f}}}}}}}}(j)right)}^{2}.$$

(8)

Otherwise known as the Laplacian quadratic form71,157,158, in this context, the total variation represents the global smoothness of the particular graph signal encoded in f (e.g., expression of a particular gene or protein) along the approximate cellular trajectory graph. Intuitively, DELVE aims to retain features that have a low total variation in expression, or have similar expression values amongst similar cells along the approximate cellular trajectory graph. In contrast, DELVE excludes features that have a high total variation in expression, or those which have expression values that are rapidly oscillating amongst neighboring cells, as these features are likely noisy or not involved in the underlying dynamic process that was initially seeded.

In this work, we ranked features according to their association with the cell-to-cell affinity graph defined by a core set of dynamically expressed regulators from DELVE dynamic modules using the Laplacian score55. This measure takes into account both the total variation in expression, as well as the overall global variance. For each of the original d measured features, or graph signals encoded in f with ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}), the Laplacian score Lf is computed as,

$${L}_{f}=frac{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{L}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}.$$

(9)

Here, L represents the unnormalized graph Laplacian, such that ({{{{{{{bf{L}}}}}}}}={{{{{{{bf{D}}}}}}}}-tilde{{{{{{{{bf{W}}}}}}}}},,{{{{{{{bf{D}}}}}}}}) is a diagonal degree matrix with the ith element of the diagonal dii as ({d}_{ii}={sum }_{j}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij},tilde{{{{{{{{bf{f}}}}}}}}}) represents the mean centered expression of feature f as (tilde{{{{{{{{bf{f}}}}}}}}}={{{{{{{bf{f}}}}}}}}-frac{{{{{{{{{bf{f}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}{{{{{{{{{bf{1}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}), and 1 = [1, …, 1]T. By sorting features in ascending order according to their Laplacian score, DELVE effectively ranks features that best preserve the local trajectory structure (e.g., an ideal numerator has a small local variation in expression along neighboring cells), as well as best preserve cell types (e.g., an ideal denominator has large variance in expression for discriminitive power).

Guidelines on parameter selection

In this section, we provide practical recommendations for selecting hyperparameters in DELVE and describe DELVE’s sensitivity to choices in hyperparameters for both simulated single-cell RNA sequencing data and proteomic imaging data. We recommend standard quality control preprocessing prior to performing feature selection.

Number of nearest neighbors (k)

DELVE uses a k-nearest neighbor between-cell affinity graph to (1) identify dynamic modules of features and (2) rank features according to their association with the underlying cellular trajectory. The choice of number of nearest neighbors, k, in the k-nearest neighbor graph construction directly influences the connectivity of the graph. Selecting a small value of k prioritizes local relationships amongst cells, whereas selecting a large value of k connects more dissimilar cells together and prioritizes global relationships. Since DELVE ranks features by summing the differences in feature expression around nearest neighbors along the cellular trajectory graph (See Equation 8), the choice of number of nearest neighbors, k, may influence the overall resolution of feature dynamics and feature ranking. Overall, we recommend selecting a small number of nearest neighbors, such that the local connectivity of similar cell states are preserved. Moreover, we recommend constructing a k-nearest neighbors graph based on Euclidean distances in principal component space for single-cell RNA sequencing datasets that have not been variance-filtered or contain many more genes than cells, pn72,159.

Number of subsampled neighborhoods (m)

We have previously shown that selecting a small representative subset of cells with Kernel Herding sketching (1) preserves the original distribution of cell states, (2) achieves the same fidelity of downstream performance with far fewer cells, and (3) results in improved scalability and faster runtimes69. However, it is important to note that selecting too small of a sketch size (i.e., too few subsampled neighborhoods, mn) has the potential to neglect rare cell states. This may impact the identification of associated feature dynamics. Therefore, we recommend selecting a larger number of subsampled neighborhoods from the original dataset to retain expected rare cell populations that occur at a small frequency.

Number of modules (c)

To identify modules of features similar expression dynamics, DELVE clusters features according to their average pairwise change in expression across prototypical cellular neighborhoods using the K-means++ algorithm. The number of clusters, c, should be chosen according to the number and granularity of desired dynamic and static modules. Selecting too few clusters (e.g., c < 3) can result in modules of features with more heterogeneity in their expression and/or retention of correlated noisy features. Moreover, as the number of clusters becomes very large proportionally to the total number of features, clusters will contain far fewer features, which may result in reduced power to detect grouped feature dynamics in the permutation test. Of note, for particularly noisy imaging datasets with many correlated features (e.g., PDAC 4i cell cycle), we found that selecting more modules (c = 10) improved the separation and retention of a core set of dynamic cell cycle features from noisy correlated modules.

DELVE’s sensitivity to parameter choices

To evaluate DELVE’s sensitivity to choices in hyperparameters, we varied the number of neighbors per cell (k = 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), the number of subsampled neighborhoods (m = 250, 500, 1000, 1500, 2000), and the number of KMeans++ clusters in dynamic module identification (c = 3, 4, 5, 6, 7, 8, 9, 10). Feature selection method performance was assessed on datasets with ground truth reference information (i.e., 4i RPE cell cycle dataset using 30 selected proteomic features, SymSim simulated single-cell RNA sequencing dataset using 1000 selected genes) by computing several metrics of trajectory preservation, including (1) k nearest neighbor classification accuracy between predicted and ground truth cell type labels, (2) NMI clustering score between predicted and ground truth cell type labels, and (3) Kendall rank correlation between estimated pseudotime and the ground truth measurements using either the diffusion pseudotime algorithm31 or Slingshot28 for trajectory inference. Results were computed across 20 random trials for each dataset and parameter configuration. Overall, we found DELVE was robust to choices in hyperparameters (Supplementary Figs. 14, 15). However, we note that parameter choices should be selected according to the profiled dataset.

Benchmarked feature selection methods

In this section, we describe the twelve feature selection methods evaluated for representing biological trajectories. For more details on implementation and hyperparameters, see Supplementary Table 1 and Guidelines on parameter selection.

Random forest

To quantitatively compare feature selection approaches on preserving biologically relevant genes or proteins, we aimed to implement an approach that would leverage ground truth cell type labels to determine feature importance. Random forest classification49 is a supervised ensemble learning algorithm that uses an ensemble of decision trees to partition the feature space such that all of the samples (cells) with the same class (cell type labels) are grouped together. Each decision or split of a tree was chosen by minimizing the Gini impurity score as,

$$G(m)=mathop{sum }limits_{i=1}^{C}{p}_{mi}(1-{p}_{mi}).$$

(10)

Here, pmi is the proportion of cells that belong to class i for a feature node m, and C is the total number of classes (e.g., cell types). We performed random forest classification using the scikit-learn v0.23.2 package in python. Nested 10-fold cross-validation was performed using stratified random sampling to assign cells to either a training or test set. The number of trees was tuned over a grid search within each fold prior to training the model. Feature importance scores were subsequently determined by the average Gini importance across folds.

Max variance

Max variance is an unsupervised feature selection approach that uses sample variance as a criterion for retaining discriminative features, where ({hat{S}}_{f}^{2}) represents the sample variance for feature ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}) as,

$${S}_{f}^{2}=frac{1}{n-1}mathop{sum }limits_{i=1}^{n}{({f}_{i}-bar{f})}^{2},$$

(11)

where fi indicates the expression value of feature f in cell i. We performed max variance feature selection by sorting features in descending order according to their variance score and selecting the top p maximally varying features.

Neighborhood variance

Neighborhood variance29 is an unsupervised feature selection approach that uses a local neighborhood variance metric to select gradually-changing features for building biological trajectories. Namely, the neighborhood variance metric ({tilde{S}}_{f}^{2}) quantifies how much feature f varies across neighboring cells as,

$${tilde{S}}_{f}^{2}=frac{1}{n{k}_{c}-1}mathop{sum }limits_{i=1}^{n}mathop{sum }limits_{j=1}^{{k}_{c}}{({f}_{i}-{f}_{{{{{{{{{mathcal{N}}}}}}}}}_{(i,j)}})}^{2}.$$

(12)

Here, fi represents the expression value of feature f for cell (i,{{{{{{{{mathcal{N}}}}}}}}}_{(i,j)}) indicates the j nearest neighbor of cell i, and kc is the minimum number of k-nearest neighbors required to form a fully connected graph. Features were subsequently selected if they had a smaller neighborhood variance ({tilde{S}}_{f}^{2}) than global variance ({S}_{f}^{2}),

$$frac{{S}_{f}^{2}}{{tilde{S}}_{f}^{2}} , > , 1.$$

(13)

Highly variable genes

Highly variable gene selection53 is an unsupervised feature selection approach that selects features according to a normalized dispersion measure. First, features are binned based on their average expression. Within each bin, genes are then z-score normalized to identify features that have a large variance, yet a similar mean expression. We selected the top p features using the highly variable genes function in Scanpy v1.9.3 (flavor = Seurat, bins = 20, n_top_genes = p).

Laplacian score

Laplacian score (LS)55 is a locality-preserving unsupervised feature selection method that ranks features according to (1) how well a feature’s expression is consistent across neighboring cells defined by a between-cell similarity graph define by all profiled features and (2) the feature’s global variance. First, a weighted k-nearest neighbor affinity graph of cells (k = 10) is constructed according to pairwise Euclidean distances between cells based on all features, X. More specifically, let ({{{{{{{mathcal{G}}}}}}}}=({{{{{{{mathcal{V}}}}}}}},{{{{{{{mathcal{E}}}}}}}})), where ({{{{{{{mathcal{V}}}}}}}}) represents the cells and edges, ({{{{{{{mathcal{E}}}}}}}}), are weighted using a Gaussian as follows. Specifically, edge weights between cells i and j can be defined as,

$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(-frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}-{{{{{{{{bf{x}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$

(14)

Here W is a n × n between-cell similarity matrix, where cells vi and vj are connected with an edge with edge weight wij if the cell vj is within the set of vi’s neighbors, ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Moreover, as previously described, σi represents the bandwidth parameter for cell i defined as the distance to the 3rd nearest neighbor. For each feature f, where ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}) represents the value of the feature across all n cells, we compute the Laplacian score, Lf as,

$${L}_{f}=frac{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{L}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}.$$

(15)

Here, L represents the unnormalized graph Laplacian, with L = D − W, D is a diagonal degree matrix with the ith element of the diagonal dii as ({d}_{ii}={sum }_{j}{w}_{ij},,tilde{{{{{{{{bf{f}}}}}}}}}) represents the mean centered expression of feature f as (tilde{{{{{{{{bf{f}}}}}}}}}={{{{{{{bf{f}}}}}}}}-frac{{{{{{{{{bf{f}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}{{{{{{{{{bf{1}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}), and 1 = [1, …, 1]T. We performed feature selection by sorting features in ascending order according to their Laplacian score and selecting the top p features.

MCFS

Multi-cluster feature selection (MCFS)58 is an unsupervised feature selection method that selects for features that best preserve the multi-cluster structure of data by solving an L1 regularized least squares regression problem on the spectral embedding. Similar to the Laplacian score, first k-nearest neighbor affinity graph of cells (k = 10) is computed to encode the similarities in feature expression between cells i and j using a Gaussian kernel as,

$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(-frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}-{{{{{{{{bf{x}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$

(16)

Similar to previous formulations above, W is an n × n between cell affinity matrix, where a pair of cells vi and vj are connected with an edge with weight wij if cell vj is within the set of vi’s neighbors, ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Further, σi represents the kernel bandwidth parameter chosen to be the distance to the third nearest neighbor from cell i. Next, to represent the intrinsic dimensionality of the data, the spectral embedding160 is computed through eigendecomposition of the unnormalized graph Laplacian L, where L = D − W as,

$${{{{{{{bf{L}}}}}}}}{{{{{{{bf{y}}}}}}}}=lambda {{{{{{{bf{D}}}}}}}}{{{{{{{bf{y}}}}}}}}.$$

(17)

Here, ({{{{{{{bf{Y}}}}}}}}={{{{{{{{{bf{y}}}}}}}}}}_{k=1}^{K}) are the eigenvectors corresponding to the K smallest eigenvalues, W is a symmetric affinity matrix encoding cell similarity weights, and D represents a diagonal degree matrix with each element as dii = ∑jwij. Given that eigenvectors of the graph Laplacian represent frequency harmonics71 and low frequency eigenvectors are considered to capture the informative structure of the data, MCFS computes the importance of each feature along each intrinsic dimension yk by finding a relevant subset of features by minimizing the error using an L1 norm penalty as,

$$mathop{min }limits_{{{{{{{{{bf{a}}}}}}}}}_{k}}parallel !{{{{{{{{bf{y}}}}}}}}}_{k}-{{{{{{{{bf{X}}}}}}}}}^{T}{{{{{{{{bf{a}}}}}}}}}_{k}{parallel }^{2}quad {{{{{{{rm{s}}}}}}}}.{{{{{{{rm{t}}}}}}}}.quad parallel !{{{{{{{{bf{a}}}}}}}}}_{k}!{parallel }_{1}le gamma .$$

(18)

Here, the non-zero coefficients, ak, indicate the most relevant features for distinguishing clusters from the embedding space, yk and γ controls the sparsity and ensures the least relevant coefficients are shrunk to zero. The optimization is solved using the least angles regression algorithm74, where for every feature, the MCFS score is defined as,

$${{{{{{{rm{MCFS}}}}}}}}(j)=mathop{max }limits_{k}parallel !!{a}_{k,j}!!parallel .$$

(19)

Here, j and k index feature and eigenvector, respectively. We performed multi-cluster feature selection with the number of eigenvectors K chosen to be the number of ground truth cell types present within the data, as this is the traditional convention in spectral clustering60 and the number of nonzero coefficients was set to the number of selected features, p.

SCMER

Single-cell manifold-preserving feature selection (SCMER)57 selects a subset of p features that represent the embedding structure of the data by learning a sparse weight vector w by formulating an elastic net regression problem that minimizes the KL divergence between a cell similarity matrix defined by all features and one defined by a reduced subset of features. More specifically, let P denote a between-cell pairwise similarity matrix defined in UMAP59 computed with the full data matrix ({{{{{{{bf{X}}}}}}}}in {{mathbb{R}}}^{ntimes d}) and Q denote a between-cell pairwise similarity matrix defined in UMAP computed with the dataset following feature selection ({{{{{{{bf{Y}}}}}}}}in {{mathbb{R}}}^{ntimes p}), where Y = Xw and pd. Here, elastic net regression is used to find a sparse and robust solution of w that minimizes the KL divergence as,

$${{{{{{{rm{KL}}}}}}}}left({{{{{{{bf{P}}}}}}}}parallel {{{{{{{bf{Q}}}}}}}}right)=mathop{sum}limits_{i}mathop{sum}limits_{j}{p}_{ij}log frac{{p}_{ij}}{{q}_{ij}}.$$

(20)

Features with non-zero weights in w are considered useful for preserving the embedding structure and selected for downstream analysis. We performed SCMER feature selection using the scmer v.0.1.0a3 package in python by constructing a k-nearest neighbor graph (k = 10) according to pairwise Euclidean distances of cells based on their first 50 principal components and using the default regression penalty weight parameters (lasso = 3.87e − 4, ridge = 0).

Hotspot

Hotspot56 is an unsupervised gene module identification approach that performs feature selection through a test statistic that measures the association of a gene’s expression with the between-cell similarity graph defined based on the full feature matrix, X. More specifically, first, a k-nearest neighbor cell affinity graph (k = 10) is defined based on pairwise Euclidean distances between all pairs of cells using a Gaussian kernel as,

$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(-frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}-{{{{{{{{bf{x}}}}}}}}}_{{v}_{i}j}{parallel }^{2}}{{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$

(21)

Here, cells vi and vj are connected with an edge with edge weight wij if the cell vj is within the set of vi’s neighbors such that ∑jwij = 1 for each cell and σi represents the bandwidth parameter for cell i defined as the distance to the (frac{k}{3}) neighbor. For a given feature ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}), representing expression across all n cells where fi is the mean-centered and standardized expression of feature f in cell i according to a null distribution model of gene expression, the local autocorrelation test statistic representing the dependence of each gene on the graph structure is defined as,

$${H}_{f}=mathop{sum}limits_{i}mathop{sum}limits_{jne i}{w}_{ij}{f}_{i}{f}_{j}.$$

(22)

Hotspot was implemented using the hotspotsc v1.1.1 package in python, where we selected the top p features by sorting features in ascending order according to their significance with respect to a null model. For the single-cell RNA sequencing datasets where count data were available (splatter simulation, SymSim simulation, DE differentiation datasets), or for the the indirect immunofluorescence imaging datasets (RPE, PDAC), the null model of counts was defined by a negative binomial distribution. For the single-cell RNA sequencing datasets where only log normalized data were available (CD8 differentiation), the null model of normalized counts was defined as a normal distribution.

All features

To consider a baseline representation without feature selection, we evaluated performance using all features from each dataset following quality control preprocessing.

Random features

As a second baseline strategy, we simply selected a subset of random features without replacement. Results were computed across twenty random initializations for each dataset.

DELVE

DELVE was run as previously described. Here, we constructed a weighted k-nearest neighbor affinity graph of cells (k = 10), and 1000 neighborhoods were sketched to identify dynamic seed feature clusters (c = 3 for the Splatter simulated datasets, c = 5 for the SymSim simulated, RPE cell cycle, CD8 T cell differentiation, and DE differentiation datasets, or c = 10 for the PDAC cell cycle datasets). For practical guidelines on parameter selection, see Guidelines on parameter selection.

Datasets

We evaluated feature selection methods based on how well retained features could adequately recover biological trajectories under various noise conditions, biological contexts, and single-cell technologies.

Splatter simulation

Splatter75 is a single-cell RNA sequencing simulation software that generates count data using a gamma-Poisson hierarchical model with modifications to alter the mean-variance relationship amongst genes, the library size, or sparsity. We used splatter to simulate a total of 90 ground truth datasets with different trajectory structures (e.g., 30 linear datasets, 30 bifurcation datasets, and 30 tree datasets). First, we estimated simulation parameters by fitting the model to a real single-cell RNA sequencing dataset consisting of human pluripotent stem cells differentiating into mesoderm progenitors138. We then used the estimated parameters (mean_rate = 0.0173, mean_shape = 0.54, lib_loc = 12.6, lib_scale = 0.423, out_prob = 0.000342, out_fac_loc = 0.1, out_fac_scale = 0.4, bcv = 0.1, bcv_df = 90.2, dropout = None) to simulate a diverse set of ground truth reference trajectory datasets with the splatter paths function (python wrapper scprep SplatSimulate v1.2.3 of splatter v1.18.2). Here, a reference trajectory structure (e.g., bifurcation) was used to simulate linear and nonlinear changes in the mean expression of genes along each step of the specified differentiation path. We simulated differentiation datasets (1500 cells, 500 genes, 6 clusters) for each trajectory type (linear, bifurcation, tree) by modifying (1) the probability of a cell belonging to a cluster by randomly sampling from a Dirichlet distribution with six categories and a uniform concentration of one and (2) the path skew by randomly sampling from a beta distribution (α = 10, β = 10). The output of each simulation is a ground truth reference consisting of cell-to-cluster membership, differentially expressed genes per cluster or path, as well as a latent step vector that indicates the progression of each cell within a cluster. Lastly, we modified the step vector to be monotonically increasing across clusters within the specified differentiation path to obtain a reference pseudotime measurement.

To estimate how well feature selection methods can identify genes that represent cell populations and are differentially expressed along a differentiation path in noisy single-cell RNA sequencing data, we added relevant sources of biological and technical noise to the reference datasets.

  1. 1.

    Biological Coefficient of Variation (BCV): To simulate the effect of stochastic gene expression, we modified the biological coefficient of variation parameter within splatter (BCV = 0.1, 0.25, 0.5). This scaling factor controls the mean-variance relationship between genes, where lowly expressed genes are more variable than highly expressed genes, following a γ distribution. This corresponded to an approximate mean coefficient of variation 1.55, 1.60, and 1.75 when averaged across all genes and max coefficient of variation of 2.90, 2.95, and 3.10.

  2. 2.

    Library size: The total number of profiled mRNA transcripts per cell, or library size, can vary between cells within a single-cell RNA sequencing experiment and can influence the detection of differentially expressed genes76, as well as impact the reproducibility of the lower-dimensional representation of the data77. To simulate the effect of differences in sequencing depth, we proportionally adjusted the gene means for each cell by modifying the location parameter (lib_loc = 12, 11, 10) of the log-normal distribution within splatter that estimates the library size scaling factors. This corresponded to an average library size of approximately 3e4, 1.7e4 and 9e3.

  3. 3.

    Technical dropout: Single-cell RNA sequencing data contain a large proportion of zeros, where only a small fraction of total transcripts are detected due to capture inefficiency and amplification noise161. To simulate the inefficient capture of mRNA molecules and account for the trend that lowly expressed genes are more likely to be affected by dropout, we undersampled mRNA counts by sampling from a binomial distribution with the scale parameter or dropout rate proportional to the mean expression of each gene as previously described in ref. 162 as,

    $${r}_{i}=exp left(-lambda {mu }_{i}^{2}right).$$

    (23)

    Here, μi represents the log mean expression of gene i, and λ is a hyperparameter that controls the magnitude of dropout (λ = 0, 0.05, 0.1). This corresponded to an approximate undersampling percentage of 0, 10, and 20 percent when averaged across all cells for all genes.

In our subsequent feature selection method analyses, we selected the top p = 100 features under each feature selection approach.

SymSim simulation

SymSim86 is a single cell RNA sequencing simulation software that uses a kinetic model of gene expression followed by library preparation and sequencing simulation to model intrinsic, extrinsic, and technical variability in single-cell RNA sequencing data. This approach was shown in a recent benchmarking study87 to be amongst the top ranking methods for reasonably simulating single-cell RNA sequencing data as measured by their accuracy in estimating data properties (e.g., library size, TMM, mean expression, scaled variance, fractions of zeros, cell and gene correlations) and their ability to preserve biological signals (e.g., differentially expressed genes, differentially variable genes). We used SymSim v0.0.0.9000 in R v4.1.1 to simulate tree differentiation trajectories with 10000 cells, 20000 genes, and 4 cell types. Simulation parameters (σ = 0.6, 0.4, α_mean = (0.01, 0.02, 0.03, 0.04, 0.05), α_sd = 0.02, evf_type = continuous, protocol = UMI, vary = s, nevf = 1.5% totalgenes, n_de_evf = 1% totalgenes, depth_mean = 5e5, depth_sd = 3e4) were chosen from Table 2 in ref. 87, as this table contained simulation parameters that achieved the most similar distributions of statistics to real single-cell RNA sequencing data. To evaluate how well feature selection methods could identify genes that define cellular trajectories when subjected to a reduction of the total mRNA count, we performed two experiments where we simulated five differentiation trajectories by modifying the mean mRNA capture efficiency rate α_mean = (0.01, 0.02, 0.03, 0.04, 0.05) for cells with a high within population variability (σ = 0.6) and low within population variability (σ = 0.4). The output of each simulated dataset was a ground truth reference trajectory containing cell-to-cluster membership, as well as a latent pseudotime vector that indicated cell progression along each population in the trajectory. In our subsequent feature selection method analyses, we first performed principal components analysis (n_pcs = 50) and then selected the top p = 2000, 1000, 500, 250, 100, or 50 features under each feature selection approach.

RPE analysis

The retinal pigmented epithelial (RPE) dataset17 is an iterative indirect immunofluorescence imaging (4i) dataset consisting of RPE cells undergoing the cell cycle. Here, time-lapse imaging was performed on an asynchronous population of non-transformed human retinal pigmented epithelial cells expressing a PCNA-mTurquoise2 reporter in order to record the cell cycle phase (G0/G1, S, G2, M) and age (time since last mitosis) of each cell. Following time-lapse imaging, cells were fixed and 48 core cell cycle effectors were profiled using 4i8. For preprocessing, we min-max normalized the data and performed batch effect correction on the replicates using the ComBat163 function in Scanpy v1.9.3. Lastly, to refine phase annotations and distinguish G0 from G1 cells, we selected cycling G1 cells according to the bimodal distribution of pRB/RB expression as described in ref. 17. Of note, cells were excluded if they did not have ground truth phase or age annotations. The resultant dataset consisted of 2759 cells × 241 imaging-derived features describing the expression and localization of different protein markers. For our subsequent analysis, we selected the top p = 30 imaging-derived features for each feature selection approach.

PDAC analysis

The pancreatic ductal adenocarcinoma (PDAC) cell cycle dataset164 is an iterative indirect immunofluorescence imaging dataset that profiled 63 core cell cycle effectors in 9 human PDAC cell lines: BxPC3, CFPAC, HPAC, MiaPaCa, Pa01C, Pa02C, Pa16C, PANC1, UM53. Each dataset resulted in d = 253 imaging-derived features representing the expression and localization of different protein markers. For each cell line (e.g., BxPC3) under control conditions, we min-max normalized the data. Cell cycle phases (G0, G1, S, G2, M) were annotated a priori based on manual gating cells according to the abundance of core cell cycle markers. Phospho-RB (pRB) was used to distinguish proliferative cells (G1/S/G2/M, high pRB) from arrested cells (G0, low pRB). DNA content, E2F1, cyclin A (cycA), and phospho-p21 (p-p21) were used to distinguish G1 (DNA content = 2C, low cycA), S (DNA content = 2-4C, high E2F1), G2 (DNA content = 4C, high cycA), and M (DNA content = 4C, high p-p21). For our subsequent analysis, we selected the top p = 30 features for each feature selection approach.

CD8+ T cell differentiation analysis

The CD8+ T cell differentiation dataset107 is a single-cell RNA sequencing dataset consisting of mouse splenic CD8+ T cells profiled over 12-time points (d = day) following infection with the Armstrong strain of the lymphocytic choriomeningitis virus: Naive, d3-, d4-, d5-, d6-, d7-, d10-, d14-, d21-, d32-, d60-, d90- post-infection. Spleen single-cell RNA sequencing data were accessed from the Gene Expression Omnibus using the accession code GSE131847 and concatenated into a single matrix. The dataset was subsequently quality control filtered according to the distribution of molecular counts. To remove dead or dying cells, we filtered cells that had more than twenty percent of their total reads mapped to mitochondrial transcripts. Genes that were observed in less than three cells or had less than 400 counts were also removed. Following cell and gene filtering, the data were transcripts-per-million normalized, log+1 transformed, and variance filtered using highly variable gene selection, such that the resulting dataset consisted of 29893 cells × 2000 genes (See Highly variable genes). When evaluating feature selection methods, we first performed principal components analysis n_pcs = 50, and then selected the top p = 500 features for each feature selection approach.

DE differentiation analysis

The DE differentiation dataset is a multiplexed single-cell RNA sequencing dataset consisting of human embryonic stem cells (hESCs) differentiating into the DE. Cells were profiled over a 2 day time course following induction of TGFβ and WNT signaling pathways: day 0, day 1-, day2-post treatment.

  1. 1.

    Cell culture of human embryonic stem cells: H9 human embryonic stem cells (WiCell) were grown in the medium mTesR1 (STEMCELL Technologies) in tissue culture dishes coated with Matrigel (Corning; 1:100 in DMEM/F12) overnight at 4 oC. Media changes were performed daily. Cells were routinely passaged every 2–3 days using Dulbecco’s PBS/0.5 mM EDTA and treated with ROCK inhibitor Y27672 (10 μM; STEMCELL Technologies) for 24 h after passaging. The cells were kept at 37 oC, and 5% CO2 in a tissue culture incubator.

  2. 2.

    Definitive endoderm differentiation: Definitive endoderm differentiation of H9 human embryonic stem cells (WiCell) was performed according by adapting the modified D’Amour protocol previously described in ref. 129. For each treatment group, 100K cells were seeded into a single well of a 12-well plate with mTeSR (Stemcell Technologies) and ROCK inhibitor Y-27632 (10 μM; STEMCELL Technologies). After 24 h, cells were switched to mTeSR without ROCK inhibitor Y-27632 for an additional 24 h. To induce definitive endoderm differentiation, cells were treated with Activin A (100ng/mL; Peprotech) and CHIR90992 (5 μM; Peprotech) in RPMI 1640 media with B27 supplement for 24 h (d1-post treatment) followed by Activin A (100ng/mL; Peprotech) in RPMI 1640 media with B27 supplement for the final 24 h (d2-post treatment). Cell seeding was staggered for simultaneous cell collection and labeling across time points.

  3. 3.

    Multiplexed single-cell RNA sequencing: We performed multiplexed single-cell RNA sequencing by adapting an approach previously outlined in ref. 128 that uses Click Chemistry to tag cells with unique oligo barcodes for condition-specific cell labeling and multiplexing prior to single-cell RNA sequencing. Condition-specific oligo barcodes were designed according to 10X Genomics’ specifications for feature barcoding of cell surface proteins. Each condition-specific oligo barcode contained a capture sequence 1, tru seq read 2, a unique feature barcode selected from the list of whitelisted barcodes provided by 10X Genomics, and an amine group (Am6C) on the 5′ end to allow for the required click chemistry alterations. Condition-specific oligo barcodes (desalted, 250 nmol, IDT) were reconstituted in 250 μL distilled water and spun for 10 min to remove any insoluble debris from synthesis. Barcodes were then subjected to ethanol precipitation and resuspended in 100 muL 1X borate buffer (ThermoFisher Scientific). Concentrations in μM were calculated as (A260/ϵ × Dilution Factor × 106). Then 35 nmol was diluted to 50 μL with water and mixed with 7 muL 10X BBS (0.5 M borate + 1.5 M NaCl) and 7 μL DMSO. The barcodes were then reacted with 2 additions of 3.5 μL freshly prepared 100 mM MTZ-PEG4-NHS (Click Chemistry Tools) in DMSO added at 15 min intervals. The reaction was quenched with 1.45 μL 1 M glycine for 5 min, then ethanol precipitated and resuspended in 100 μL HEPES buffer. The concentration of each barcode was measured and samples were normalized to 100 μM. Prior to cell labeling, cells were washed twice with HBSS (Gibco) to remove any extracellular proteins. Cells were then incubated in TCO-PEG4-NHS (Click Chemistry Tools) in HBSS for 15 min at room temperature on a rotating platform. This solution was prepared immediately prior to incubation with cells to minimize hydrolysis of NHS groups. After incubation, the TCO-PEG4-NHS solution was removed and the cells were incubated with MTZ-activated feature barcodes at ~10 μM in DMEM:F12 for an additional 20 min at 37 oC. The reaction was then quenched by incubation for 5 min with Tris HCl and MTZ-DBCO. Cells were washed once in PBS-EDTA then passaged using PBS-EDTA and collected with PBS-BSA. Cells were filtered and resuspended to a single cell suspension then equal numbers of cells from each treatment were combined and loaded onto a single lane of a 10X Genomics chip.

  4. 4.

    Single-cell RNA sequencing preprocessing and quality control: cDNA libraries were prepared for single-cell RNA sequencing using the 10X Genomics Version 3 platform and analyzed with the Cell Ranger pipeline v3.0.2. Raw sequencing Binary Base Call files were computationally demultiplexed by sample indices and converted into FASTQ files with the cellranger mkfastq function. This pipeline outputs two sets of FASTQ files, one corresponding to gene expression profiles and the other containing feature barcodes. Reads were then aligned to either the human reference transcriptome (Hg19) or the custom feature barcode reference file with the cellranger count function. Cells were then filtered by the inflection point on the barcode rank plot to eliminate background-associated cell barcodes. Loom files were generated using default parameters with the Velocyto v0.17 package in python. To computationally demultiplex cells following single-cell RNA sequencing, unique condition-specific oligo barcodes from the unified filtered feature-barcode matrix were median normalized to correct for differences in signal-to-noise and linked back to a sampling time point according to its maximum condition-specific feature barcode (e.g., day 0). Single-cell RNA sequencing quality control was performed according to the distributions of count depth, genes per cell, and the fraction of mitochondrial genes per cell. To remove dead or dying cells, we filtered cells that had more than twenty percent of their total reads mapped to mitochondrial transcripts. To remove empty droplets or cell doublets, cells were required to have a minimum number of 1500 transcript counts or a maximum number of 90000 transcript counts and were filtered out if they had less than 650 genes per cell or more than 7200 genes per cell. Following cell and gene filtering, the data were counts-per-million normalized, log+1 transformed, and variance filtered using highly variable gene selection, such that the resulting dataset consisted of 5397 cells × 2000 genes (See Highly variable genes).

  5. 5.

    RNA velocity estimation: RNA velocity125,126 was calculated using the dynamical model implementation in ScVelo v0.2.5. More specifically, first and second order moments for velocity estimation were computed for each cell based on a k-nearest neighbor graph (k = 10) constructed according to pairwise Euclidean distances between cells using the first 50 principal components. The full dynamical model was then solved for all genes to obtain a high dimensional velocity vector for every cell. We then performed a likelihood ratio test for differential kinetics amongst the cell populations defined according to the time point labels to account for any differences in mRNA splicing or degradation kinetics. Groups of cells that exhibited different kinetic regimes were fit independently and velocity vectors were corrected.

When evaluating feature selection methods, we first performed principal components analysis, n_pcs = 50, and then selected the top p = 500 features for each feature selection approach.

Statistics and reproducibility

For all publicly available datasets used in this study (CD8T, RPE, PDAC), the sample sizes, number of replicates, randomization, and blinding were determined by the original authors. No sample size calculations were performed for the RPE (ref. 17), PDAC (ref. 164), CD8T (ref. 107), or DE datasets. The RPE study was performed in technical duplicates (ref. 17), the PDAC study had a single well per cell line (ref. 164), and the CD8T cell differentiation study collected CD8T cells that were pooled from approximately 1-6 mice at each timepoint. Here each timepoint of the analysis represented an independent experiment (ref. 107). The RPE (ref. 17) and PDAC (ref. 164) studies had no randomization as there was no treatment induction for the control data. For the CD8T cell study, mice were randomly allocated into groups before adoptive transfer and mice were randomly selected for cell harvesting at specific time points (ref. 107). For the DE dataset, cells were randomly placed into three replicate wells for each condition, treated with differentiation stimuli, and pooled prior to acquisition. There was no blinding to experimental allocation or outcome association in this study. Data exclusions were determined according to (1) quality control measures such as the distribution of molecular counts and expression of mitochondrial markers in single-cell RNA sequencing data, and/or (2) the availability of ground truth cellular annotations.

Evaluation

Classification and regression

k-nearest neighbor classification

To quantitatively compare feature selection methods on retaining features that are representative of cell types, we aimed to implement an approach that would assess the quality of the graph structure. k − nearest neighbors classification is a supervised learning algorithm that classifies data based on labels of the k-most similar cells according to their gene or protein expression, where the output of this algorithm is a set of labels for every cell. We performed k-nearest neighbors classification to predict cell type labels from simulated single-cell RNA sequencing datasets as follows. First, 3-fold cross-validation was performed using stratified random sampling to assign cells to either a training or a test set. Stratified random sampling was chosen to mitigate the effect of cell type class imbalance. Within each fold, feature selection was then performed on the training data to identify the top p relevant features according to a feature selection strategy. Next, a k-nearest neighbor classifier (k = 3) was fit on the feature-selected training data to predict the cell type labels of the feature selected test query points. Here, labels were predicted as the mode of the cell type labels from the closest training data points according to Euclidean distance. Classification performance was subsequently assessed according to the median classification accuracy with respect to the ground truth cell type labels across folds. k-nearest neighbors classification was implemented using the scikit-learn v0.23.2 package in python.

Support Vector Machine

The Support Vector Machines (SVM)165 is a supervised learning algorithm that constructs hyperplanes in the high-dimensional feature space to separate classes. We implemented SVM classification or regression using the scikit-learn v0.23.2 package in python. SVM classification was used to predict cell cycle phase labels for both RPE and PDAC 4i datasets, whereas SVM regression was used to predict age measurements from time lapse imaging for the RPE dataset. Here, Nested 10-fold cross-validation was performed using random sampling to assign cells to either a training set or a test set. Within each fold, feature selection was performed to identify the p most relevant features according to a feature selection strategy. SVM hyperparameters were then tuned over a grid search and phase labels were subsequently predicted from the test data according to those p features. Classification performance was assessed according to the median classification accuracy with respect to the ground truth cell type labels across folds. Regression performance was assessed according to the average root mean squared error with respect to ground truth age measurements across folds.

Precision@k

To evaluate the biological relevance of selected features from each method, we computed precision@k (p@k) as the proportion of top k selected features that were considered to be biologically relevant according to a ground truth reference as,

$$p@k=frac{| {{{{{{{{rm{F}}}}}}}}}_{s,k}cap {{{{{{{{rm{F}}}}}}}}}_{r}| }{| {{{{{{{{rm{F}}}}}}}}}_{s,k}| },$$

(24)

where Fs,k indicates the set of selected features at threshold k, where Fs,k Fs, and Fr indicates the set of reference features. Reference features were defined as either (1) the ground truth differentially expressed features within a cluster or along a differentiation path from the Splatter single-cell RNA sequencing simulation study (see Splatter simulation) or (2) the features determined to be useful for classifying cells according to cell cycle phase using a random forest classifier trained on ground truth phase annotations from time-lapse imaging for the protein immunofluorescence imaging datasets (See Random forest, RPE analysis, PDAC analysis).

Unsupervised clustering

To evaluate feature selection method performance on retaining features that are informative for identifying canonical cell types, we performed unsupervised clustering on the data defined by the top p ranked features from a feature selection strategy. More specifically, for each feature selection approach, clustering was performed on the selected data using the KMeans++ algorithm105 with the number of centroids set as the same number of ground truth cell cycle phase labels for the protein immunofluorescence imaging datasets (RPE: c = 4, PDAC: c = 5).

To assess the accuracy of clustering assignments, we quantified a NMI score between the predicted cluster labels and the ground truth cell type labels. Normalized mutual information166 is a clustering quality metric that measures the amount of shared information between two cell-to-cluster partitions (u and v, such that the ith entry ui gives the cluster assignment of cell i) as,

$${{{{{{{rm{NMI}}}}}}}}=frac{2{{{{{{{rm{I}}}}}}}}({{{{{{{bf{u}}}}}}}};{{{{{{{bf{v}}}}}}}})}{{{{{{{{rm{H}}}}}}}}({{{{{{{bf{u}}}}}}}}){{{{{{{rm{H}}}}}}}}({{{{{{{bf{v}}}}}}}})},$$

(25)

where, I(u; v) measures the mutual information between ground truth cell type labels u and cluster labels v, and H(u) or H(v) indicates the Shannon entropy or the amount of uncertainty for a given set of labels. Here, a score of 1 indicates that clustering on the selected features perfectly recovers the ground truth cell type labels. KMeans++ clustering was implemented using the KMeans function in scikit-learn v0.23.2.

Protein-protein interaction networks

In this work, we aimed to test whether features within DELVE dynamic clusters had experimental evidence of co-regulation as compared to random assignment. The STRING (search tool for the retrieval of interacting genes/proteins) database122 is a relational database that computes protein association scores according to information derived from several evidence channels, including computational predictions (e.g., neighborhood, fusion, co-occurance), co-expression, experimental assays, pathway databases, and literature text mining. To assess the significance of protein interactions amongst features within a DELVE cluster, we performed a permutation test with a test statistic derived from STRING association scores using experimental evidence as follows.

Let ({{{{{{{{mathcal{G}}}}}}}}}_{p}=({{{{{{{{mathcal{N}}}}}}}}}_{p},,{{{{{{{{mathcal{E}}}}}}}}}_{p})) denote a graph of p proteins from a DELVE cluster comprising the nodes ({{{{{{{{mathcal{N}}}}}}}}}_{p}), and ({{{{{{{{mathcal{E}}}}}}}}}_{p}) denote the set of edges, where edge weights encode the association scores of experimentally-derived protein-protein interaction evidence from the STRING database. Moreover, let ({{{{{{{{mathcal{G}}}}}}}}}_{r}=({{{{{{{{mathcal{N}}}}}}}}}_{r},,{{{{{{{{mathcal{E}}}}}}}}}_{r})) denote a graph of r proteins randomly sampled without replacement from the full feature space d such that r = p comprising the nodes ({{{{{{{{mathcal{N}}}}}}}}}_{r}), and ({{{{{{{{mathcal{E}}}}}}}}}_{r}) denote the set of edges encoding the experimentally-derived association scores between those r proteins from the STRING database. We compute the permutation p-value as described previously in ref. 167 as,

$$p,{{mbox{-value}}},=frac{N+1}{R+1}.$$

(26)

Here N indicates the number of times that Tr ≥ Tobs out of R random permutations (R = 1000), where Tr is the average degree of a STRING association network from randomly permuted features as ({T}_{{{{{{{{rm{r}}}}}}}}}{=}^{| {{{{{{{{mathcal{N}}}}}}}}}_{r}| }{/}_{| {{{{{{{{mathcal{E}}}}}}}}}_{r}| }), and Tobs is the average degree of a STRING association network from the features identified within a DELVE cluster as ({T}_{{{{{{{{rm{obs}}}}}}}}}{=}^{| {{{{{{{{mathcal{N}}}}}}}}}_{p}| }{/}_{| {{{{{{{{mathcal{E}}}}}}}}}_{p}| }). Of note, networks with higher degree are more connected, and thus show greater experimental evidence of protein-protein interactions. Experimental evidence-based association scores were obtained from the STRING database (stringdb) and networks were generated using networkx v3.1 package in python.

Trajectory inference and analysis

To evaluate how well feature selection methods can identify features that (1) recapitulate the underlying cellular trajectory and (2) can be used for trajectory analysis, we computed three metrics to assess trajectory preservation at different stages of inference: accuracy of the inferred cell-state trajectory graph, correlation of estimated pseudotime to the ground truth cell progression measurements, and significance and biological relevance of dynamic features identified following trajectory inference and regression analysis.

To obtain predicted cellular trajectories following feature selection, we performed trajectory inference using two approaches that were shown to outperform alternative methods for inference of simple or tree differentiation trajectories23. First, cellular trajectories were inferred using the diffusion pseudotime algorithm31 based on 20 diffusion map components generated from a k-nearest neighbor graph (k = 10), where edge weights were determined by pairwise Euclidean distances between cells according to selected feature expression. For each feature selection approach, we estimated pseudotime using ten random root cells selected according to a priori biological knowledge: simulated datasets (cells with the smallest ground truth time annotation), 4i cell cycle datasets (cells with the youngest age from time-lapse imaging for the arrested (G0 phase) or proliferative (G1, S, G2, M phases) trajectories), CD8 differentiation dataset (cells from the d3 population), and DE differentiation dataset (cells from the d0 population). As a secondary approach, we inferred cellular trajectories using Slingshot28. Here, a minimum spanning tree was fit through cluster centroids defined according to a priori biological knowledge (e.g., cell type labels or time point labels), then pseudotime was estimated by projecting cells onto the principal curves fit through the PHATE embedding generated from selected feature expression (See PHATE visualizations). The root cluster for each dataset was defined as cluster with the smallest ground truth time annotation as introduced previously. Feature selection trajectory performance was subsequently assessed as follows.

  1. 1.

    Trajectory graph similarity: Partition-based graph abstraction (PAGA)30 performs trajectory inference by constructing a coarse grained trajectory graph of the data. First cell populations are determined either through unsupervised clustering, graph partitioning, or a prior experimental annotations. Next, a statistical measure of edge connectivity is computed between cell populations to estimate the confidence of a cell population transition. To assess if feature selection methods retained features that represented coarse cell type transitions, we compared predicted PAGA trajectory graphs to ground truth cell cycle reference trajectories curated from the literature17. First, PAGA connectivity was estimated between ground truth cell cycle phase groups using the k-nearest neighbor graph (k = 10) based on pairwise Euclidean distances between cells according to selected feature expression. We then computed the Jaccard distance between predicted and reference trajectories as,

    $${d}_{j}left({{{{{{{{bf{W}}}}}}}}}_{p},{{{{{{{{bf{W}}}}}}}}}_{r}right)=1-frac{| {{{{{{{{bf{W}}}}}}}}}_{p}cap {{{{{{{{bf{W}}}}}}}}}_{r}| }{| {{{{{{{{bf{W}}}}}}}}}_{p}cup {{{{{{{{bf{W}}}}}}}}}_{r}| }.$$

    (27)

    Wp indicates the predicted cell type transition adjacency matrix, where each entry Wp,ij represents the connectivity strength between cell populations i and j from PAGA and Wr indicates the reference trajectory adjacency matrix with entries encoding ground truth cell type transitions curated from the literature. Here, a lower Jaccard distance indicates that predicted trajectories better capture known cellular transitions.

  2. 2.

    Pseudotime correlation: To evaluate if feature selection methods retained features that accurately represent a cell’s progression through a biological trajectory, we computed the Kendall rank correlation coefficient between estimated pseudotime following feature selection and ground truth cell progression annotations (e.g., the ground truth pseudotime labels generated from simulations, time-lapse imaging molecular age measurements).

  3. 3.

    Regression analysis: To identify genes associated with the inferred differentiation trajectory (e.g., CD8+ T cell, definitive endoderm differentiation trajectory) following feature selection, we performed regression analysis for each gene (d = 500) along estimated pseudotime using a negative binomial GAM. Genes were considered to be differentially expressed along the inferred lineage if they had a q value < 0.05 following Benjamini-Hochberg false discovery rate correction123.

  4. 4.

    Gene Ontology: To identify the biological relevance of differentially expressed genes associated with the different differentiation trajectories specific to each feature selection strategy, we performed gene set enrichment analysis on the set of significant genes from either highly variable gene, DELVE, Hotspot, Laplacian score, or RNA velocity feature selection using Enrichr124. Here, we considered the gene sets (e.g., mouse – CD8+ T cell differentiation, human – DE differentiation) from GO Biological Process 2023.

Diffusion pseudotime was implemented using the dpt function in Scanpy v1.9.3 in python, Slingshot was implemented using the slingshot v2.1.1 package in R v4.1.1, PAGA was implemented using the paga function in Scanpy v1.9.3 in python, GAM regression was implemented using the statsmodels v0.14.0 package in python, and gene set enrichment analysis was performed using the enrichr function in gseapy v1.0.4 package in python.

PHATE visualizations

To qualitatively compare lower dimensional representations from each feature selection strategy, we performed nonlinear dimensionality reduction using PHATE (potential of heat-diffusion for affinity-based transition embedding)78 as this approach performs reasonably well for representing complex continuous biological trajectories. PHATE was implemented using the phate v1.0.11 package in python. Here, we used the same set of hyperparameters across all feature selection strategies (knn = 30, t = 10, decay = 40).

Aggregate scores

To rank feature selection methods on preserving biological trajectories in the presence of single-cell noise, we computed rank aggregate scores by taking the mean of scaled method scores across simulated single-cell RNA sequencing datasets from a trajectory type and noise condition (e.g., linear trajectory, dropout noise). More specifically, we first defined an overall method score per dataset as the median of each metric. Method scores were subsequently min-max scaled to ensure datasets were equally weighted prior to computing the average.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.