DELVE
DELVE identifies a subset of dynamicallychanging features that preserve the local structure of the underlying cellular trajectory. In this section, we will (1) describe computational methods for the identification and ranking of features that have nonrandom patterns of dynamic variation, (2) explain DELVE’s relation to previous work, and (3) provide context for the mathematical foundations behind discarding informationpoor features prior to performing trajectory inference.
Problem formulation
Let ({{{{{{{bf{X}}}}}}}}={{{{{{{{{{bf{x}}}}}}}}}_{i}}}_{i=1}^{n}) denote a singlecell dataset, where ({{{{{{{{bf{x}}}}}}}}}_{i}in {{mathbb{R}}}^{d}) represents the vector of d measured features (e.g., genes or proteins) measured in cell i. We assume that the data have an inherent trajectory structure, or biologicallymeaningful ordering, that can be directly inferred by a limited subset of p features where p ≪ d. Therefore, our goal is to identify this limited set of p features from the original highdimensional feature set that best approximate the transitions of cells through each stage of the underlying dynamic process.
Step 1: Dynamic seed selection
Graph construction
Our approach DELVE extends previous similaritybased^{29,55,56} or subspacelearning^{58} feature selection methods by computing the dependence of each gene on the underlying cellular trajectory. In step 1, DELVE models cell states using a weighted knearest neighbor affinity graph of cells (k = 10), where nodes represent cells and edges describe the transcriptomic or proteomic similarity amongst cells according to the d profiled features encoded in X. More specifically, let ({{{{{{{mathcal{G}}}}}}}}=({{{{{{{mathcal{V}}}}}}}},{{{{{{{mathcal{E}}}}}}}})) denote a betweencell affinity graph, where ({{{{{{{mathcal{V}}}}}}}}) represents the cells and the edges, ({{{{{{{mathcal{E}}}}}}}}), are weighted according to a Gaussian kernel as,
$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}{{{{{{{{bf{x}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$
(1)
Here, W is a n × n betweencell similarity matrix, where cells v_{i} and v_{j} are connected with an edge with edge weight w_{ij} if the cell v_{j} is within the set of v_{i}’s neighbors, as denoted by notation ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Moreover, σ_{i}, specific for a particular cell i, represents the Gaussian kernel bandwidth parameter that controls the decay of cell similarity edge weights. We chose a bandwidth parameter as the distance to the 3rd nearest neighbor as this has been shown previously in refs. ^{56,154} to provide reasonable decay in similarity weights.
Identification of feature modules
To identify groups of features with similar coexpression variation, DELVE clusters features according to changes in expression across prototypical cell neighborhoods. First, cellular neighborhoods are defined according to the average expression of each set of k nearest neighbors (({{{{{{{{mathcal{N}}}}}}}}}_{i})) as, ({{{{{{{bf{Z}}}}}}}}={{{{{{{{{{bf{z}}}}}}}}}_{i}in {{mathbb{R}}}^{d}}}_{i=1}^{n}), where each ({{{{{{{{bf{z}}}}}}}}}_{i}=frac{1}{k}{sum }_{{{{{{{{{mathcal{N}}}}}}}}}_{i}}{{{{{{{{bf{x}}}}}}}}}_{i}) represents the center of the k nearest neighbors for cell i across all measured features. Next, DELVE leverages Kernel Herding sketching^{69} to effectively sample m representative cell neighborhoods, or rows, from the percell neighbor averaged feature matrix, Z, as (tilde{{{{{{{{bf{Z}}}}}}}}}={{{tilde{{{{{{{{bf{z}}}}}}}}}}_{i}in {{mathbb{R}}}^{d}}}_{i=1}^{m}). This sampling approach ensures that cellular neighborhoods are more reflective of the original distribution of cell states, while removing redundant cell states to aid in the scalability of estimating expression dynamics. DELVE then computes the average pairwise change in expression of features across representative cellular neighborhoods, Δ, as,
$${{{{Delta }}}}=frac{1}{m1}mathop{sum }limits_{i=1}^{m}left(tilde{{{{{{{{bf{Z}}}}}}}}}{{{{{{{{bf{j}}}}}}}}}_{m}{tilde{{{{{{{{bf{z}}}}}}}}}}_{i}^{{{{{{{{rm{T}}}}}}}}}right),$$
(2)
where j_{m} is a column vector of ones with length m, such that ({{{{{{{{bf{j}}}}}}}}}_{m}in {{mathbb{R}}}^{m}). Here, Δ is a m × d neighborhood by feature matrix, where each row corresponds to a neighborhood and contains the average pairwise change in expression for that neighborhood across all features. Lastly, features are clustered according to the transpose of their average pairwise change in expression across the representative cellular neighborhoods, Δ^{T}, using the KMeans + + algorithm^{105}. In this context, each DELVE module contains a set of features with similar local changes in covariation across cell states along the cellular trajectory.
Dynamic expression variation permutation testing
To assess whether modules of features have coordinated or noisy expression variation, we compare the average sample variance of features within a DELVE module to random assignment using a permutation test as follows. Let ({bar{S}}_{c}^{2}left({P}_{c}right)) denote the average sample variance of the average pairwise change in expression across m cell neighborhoods for the set of p features (a set of features denoted as P_{c}) within a DELVE cluster c as,
$${bar{S}}_{c}^{2}left({P}_{c}right)=frac{1}{ {P}_{c} }mathop{sum }limits_{p=1}^{ {P}_{c} }mathop{sum }limits_{i=1}^{m}frac{{left({Delta }_{i,p}{bar{Delta }}_{p}right)}^{2}}{m1}.$$
(3)
Moreover, let R_{q} denote a set of randomly selected features sampled without replacement from the full feature space d, such that ∣P_{c}∣ = ∣R_{q}∣, and ({tilde{S}}_{c}^{2}({R}_{q})) denote the average sample variance of randomly selected feature sets averaged across t random permutations as,
$${tilde{S}}_{c}^{2}({R}_{q})=frac{1}{t}mathop{sum }limits_{q=1}^{t}{bar{S}}_{c}^{2}({R}_{q}).$$
(4)
Here, DELVE considers a module of features as being dynamicallyexpressed if the average sample variance of the change in expression of the set of features within a DELVE cluster (or specifically feature set P_{c}), is greater than random assignment, R_{q}, across randomly permuted trials as,
$${bar{S}}_{c}({P}_{c}) , > , {tilde{S}}_{c}({R}_{q}).$$
(5)
In doing so, this approach is able to identify and exclude modules of features that have static, random, or noisy patterns of expression variation, while retaining dynamically expressed features for ranking feature importance. Given that noisy or irrelevant features can confound inference of the underlying cellular trajectory^{45,46} and have been shown to corrupt the graph Laplacian for feature selection^{61,62}, this exclusion step is crucial prior to performing for feature selection. Of note, given that KMeans++ clustering is used to initially assign features to a group, featuretocluster assignments can tend to vary due to algorithm stochasticity. Therefore, to reduce the variability and find a core set of features that are consistently dynamicallyexpressed, this process is repeated across ten random clustering initializations and the set of dynamicallyexpressed features are defined as the intersection across runs.
Step 2: Feature ranking
Following dynamic seed selection, in step two, DELVE ranks features according to their association with the underlying cellular trajectory graph. First, DELVE approximates the underlying cellular trajectory by constructing a betweencell affinity graph, where the nodes represent the cells and edges are now reweighted according to a Gaussian kernel between all cells based on the core subset of dynamically expressed regulators from step 1, such that (tilde{{{{{{{{bf{X}}}}}}}}}={{tilde{{{{{{{{bf{x}}}}}}}}}}_{i}in {{mathbb{R}}}^{p}}) where p ≪ d as,
$${tilde{{{{{{{{rm{w}}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(frac{parallel {tilde{{{{{{{{bf{x}}}}}}}}}}_{{v}_{i}}{tilde{{{{{{{{bf{x}}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$
(6)
Here, (tilde{{{{{{{{bf{W}}}}}}}}}) is a n × n betweencell similarity matrix, where cells v_{i} and v_{j} are connected with an edge with edge weight ({tilde{w}}_{ij}) if the cell is within the set of v_{i}’s neighbors, denoted as ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Moreover, as previously mentioned, σ_{i} represents the Gaussian kernel bandwidth parameter for a particular cell i as the distance to the 3rd nearest neighbor.
Features are then ranked according to their association with the underlying cellular trajectory graph using graph signal processing techniques^{55,70,71}. By modeling data as a set of signals on a weighted graph, graph signal processing techniques have led to the development of new machine learning models for improved biological prediction, including regression or classification^{70,155}, identification of prototypical cells associated with experimental perturbations^{82}, or inference of cell signaling dynamics^{156}. A graph signal f is any function that has a real defined value on all of the nodes, such that ({{{{{{{boldsymbol{f}}}}}}}}in {{mathbb{R}}}^{n}) and f_{i} gives the signal at the ith node. Intuitively, in DELVE, we consider all features as graph signals and rank them according to their variation in expression along the approximate cell trajectory graph to see if they should be included or excluded from downstream analysis. Let L denote the unnormalized graph Laplacian, with ({{{{{{{bf{L}}}}}}}}={{{{{{{bf{D}}}}}}}}tilde{{{{{{{{bf{W}}}}}}}}}), where D is a diagonal degree matrix with each element as ({d}_{ii}={sum }_{j}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij}). The local variation in the expression of feature signal f can then be defined as the weighted sum of differences in signals around a particular cell i as,
$$left({{{{{{{bf{L}}}}}}}}{{{{{{{bf{f}}}}}}}}right)(i)=mathop{sum}limits_{j}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij}left({{{{{{{bf{f}}}}}}}}(i){{{{{{{bf{f}}}}}}}}(j)right).$$
(7)
This metric effectively measures the similarity in expression of a particular node’s graph signal, denoted by the feature vector, f, around its k nearest neighbors. By summing the local variation in expression across all neighbors along the cellular trajectory, we can define the total variation in expression of feature graph signal f as,
$${{{{{{{{bf{f}}}}}}}}}^{{{{{{{{rm{T}}}}}}}}}{{{{{{{bf{L}}}}}}}}{{{{{{{bf{f}}}}}}}}=mathop{sum}limits_{ij}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij}{left({{{{{{{bf{f}}}}}}}}(i){{{{{{{bf{f}}}}}}}}(j)right)}^{2}.$$
(8)
Otherwise known as the Laplacian quadratic form^{71,157,158}, in this context, the total variation represents the global smoothness of the particular graph signal encoded in f (e.g., expression of a particular gene or protein) along the approximate cellular trajectory graph. Intuitively, DELVE aims to retain features that have a low total variation in expression, or have similar expression values amongst similar cells along the approximate cellular trajectory graph. In contrast, DELVE excludes features that have a high total variation in expression, or those which have expression values that are rapidly oscillating amongst neighboring cells, as these features are likely noisy or not involved in the underlying dynamic process that was initially seeded.
In this work, we ranked features according to their association with the celltocell affinity graph defined by a core set of dynamically expressed regulators from DELVE dynamic modules using the Laplacian score^{55}. This measure takes into account both the total variation in expression, as well as the overall global variance. For each of the original d measured features, or graph signals encoded in f with ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}), the Laplacian score L_{f} is computed as,
$${L}_{f}=frac{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{L}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}.$$
(9)
Here, L represents the unnormalized graph Laplacian, such that ({{{{{{{bf{L}}}}}}}}={{{{{{{bf{D}}}}}}}}tilde{{{{{{{{bf{W}}}}}}}}},,{{{{{{{bf{D}}}}}}}}) is a diagonal degree matrix with the ith element of the diagonal d_{ii} as ({d}_{ii}={sum }_{j}{tilde{{{{{{{{bf{w}}}}}}}}}}_{ij},tilde{{{{{{{{bf{f}}}}}}}}}) represents the mean centered expression of feature f as (tilde{{{{{{{{bf{f}}}}}}}}}={{{{{{{bf{f}}}}}}}}frac{{{{{{{{{bf{f}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}{{{{{{{{{bf{1}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}), and 1 = [1, …, 1]^{T}. By sorting features in ascending order according to their Laplacian score, DELVE effectively ranks features that best preserve the local trajectory structure (e.g., an ideal numerator has a small local variation in expression along neighboring cells), as well as best preserve cell types (e.g., an ideal denominator has large variance in expression for discriminitive power).
Guidelines on parameter selection
In this section, we provide practical recommendations for selecting hyperparameters in DELVE and describe DELVE’s sensitivity to choices in hyperparameters for both simulated singlecell RNA sequencing data and proteomic imaging data. We recommend standard quality control preprocessing prior to performing feature selection.
Number of nearest neighbors (k)
DELVE uses a knearest neighbor betweencell affinity graph to (1) identify dynamic modules of features and (2) rank features according to their association with the underlying cellular trajectory. The choice of number of nearest neighbors, k, in the knearest neighbor graph construction directly influences the connectivity of the graph. Selecting a small value of k prioritizes local relationships amongst cells, whereas selecting a large value of k connects more dissimilar cells together and prioritizes global relationships. Since DELVE ranks features by summing the differences in feature expression around nearest neighbors along the cellular trajectory graph (See Equation 8), the choice of number of nearest neighbors, k, may influence the overall resolution of feature dynamics and feature ranking. Overall, we recommend selecting a small number of nearest neighbors, such that the local connectivity of similar cell states are preserved. Moreover, we recommend constructing a knearest neighbors graph based on Euclidean distances in principal component space for singlecell RNA sequencing datasets that have not been variancefiltered or contain many more genes than cells, p ≫ n^{72,159}.
Number of subsampled neighborhoods (m)
We have previously shown that selecting a small representative subset of cells with Kernel Herding sketching (1) preserves the original distribution of cell states, (2) achieves the same fidelity of downstream performance with far fewer cells, and (3) results in improved scalability and faster runtimes^{69}. However, it is important to note that selecting too small of a sketch size (i.e., too few subsampled neighborhoods, m ≪ n) has the potential to neglect rare cell states. This may impact the identification of associated feature dynamics. Therefore, we recommend selecting a larger number of subsampled neighborhoods from the original dataset to retain expected rare cell populations that occur at a small frequency.
Number of modules (c)
To identify modules of features similar expression dynamics, DELVE clusters features according to their average pairwise change in expression across prototypical cellular neighborhoods using the Kmeans++ algorithm. The number of clusters, c, should be chosen according to the number and granularity of desired dynamic and static modules. Selecting too few clusters (e.g., c < 3) can result in modules of features with more heterogeneity in their expression and/or retention of correlated noisy features. Moreover, as the number of clusters becomes very large proportionally to the total number of features, clusters will contain far fewer features, which may result in reduced power to detect grouped feature dynamics in the permutation test. Of note, for particularly noisy imaging datasets with many correlated features (e.g., PDAC 4i cell cycle), we found that selecting more modules (c = 10) improved the separation and retention of a core set of dynamic cell cycle features from noisy correlated modules.
DELVE’s sensitivity to parameter choices
To evaluate DELVE’s sensitivity to choices in hyperparameters, we varied the number of neighbors per cell (k = 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), the number of subsampled neighborhoods (m = 250, 500, 1000, 1500, 2000), and the number of KMeans++ clusters in dynamic module identification (c = 3, 4, 5, 6, 7, 8, 9, 10). Feature selection method performance was assessed on datasets with ground truth reference information (i.e., 4i RPE cell cycle dataset using 30 selected proteomic features, SymSim simulated singlecell RNA sequencing dataset using 1000 selected genes) by computing several metrics of trajectory preservation, including (1) k nearest neighbor classification accuracy between predicted and ground truth cell type labels, (2) NMI clustering score between predicted and ground truth cell type labels, and (3) Kendall rank correlation between estimated pseudotime and the ground truth measurements using either the diffusion pseudotime algorithm^{31} or Slingshot^{28} for trajectory inference. Results were computed across 20 random trials for each dataset and parameter configuration. Overall, we found DELVE was robust to choices in hyperparameters (Supplementary Figs. 14, 15). However, we note that parameter choices should be selected according to the profiled dataset.
Benchmarked feature selection methods
In this section, we describe the twelve feature selection methods evaluated for representing biological trajectories. For more details on implementation and hyperparameters, see Supplementary Table 1 and Guidelines on parameter selection.
Random forest
To quantitatively compare feature selection approaches on preserving biologically relevant genes or proteins, we aimed to implement an approach that would leverage ground truth cell type labels to determine feature importance. Random forest classification^{49} is a supervised ensemble learning algorithm that uses an ensemble of decision trees to partition the feature space such that all of the samples (cells) with the same class (cell type labels) are grouped together. Each decision or split of a tree was chosen by minimizing the Gini impurity score as,
$$G(m)=mathop{sum }limits_{i=1}^{C}{p}_{mi}(1{p}_{mi}).$$
(10)
Here, p_{mi} is the proportion of cells that belong to class i for a feature node m, and C is the total number of classes (e.g., cell types). We performed random forest classification using the scikitlearn v0.23.2 package in python. Nested 10fold crossvalidation was performed using stratified random sampling to assign cells to either a training or test set. The number of trees was tuned over a grid search within each fold prior to training the model. Feature importance scores were subsequently determined by the average Gini importance across folds.
Max variance
Max variance is an unsupervised feature selection approach that uses sample variance as a criterion for retaining discriminative features, where ({hat{S}}_{f}^{2}) represents the sample variance for feature ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}) as,
$${S}_{f}^{2}=frac{1}{n1}mathop{sum }limits_{i=1}^{n}{({f}_{i}bar{f})}^{2},$$
(11)
where f_{i} indicates the expression value of feature f in cell i. We performed max variance feature selection by sorting features in descending order according to their variance score and selecting the top p maximally varying features.
Neighborhood variance
Neighborhood variance^{29} is an unsupervised feature selection approach that uses a local neighborhood variance metric to select graduallychanging features for building biological trajectories. Namely, the neighborhood variance metric ({tilde{S}}_{f}^{2}) quantifies how much feature f varies across neighboring cells as,
$${tilde{S}}_{f}^{2}=frac{1}{n{k}_{c}1}mathop{sum }limits_{i=1}^{n}mathop{sum }limits_{j=1}^{{k}_{c}}{({f}_{i}{f}_{{{{{{{{{mathcal{N}}}}}}}}}_{(i,j)}})}^{2}.$$
(12)
Here, f_{i} represents the expression value of feature f for cell (i,{{{{{{{{mathcal{N}}}}}}}}}_{(i,j)}) indicates the j nearest neighbor of cell i, and k_{c} is the minimum number of knearest neighbors required to form a fully connected graph. Features were subsequently selected if they had a smaller neighborhood variance ({tilde{S}}_{f}^{2}) than global variance ({S}_{f}^{2}),
$$frac{{S}_{f}^{2}}{{tilde{S}}_{f}^{2}} , > , 1.$$
(13)
Highly variable genes
Highly variable gene selection^{53} is an unsupervised feature selection approach that selects features according to a normalized dispersion measure. First, features are binned based on their average expression. Within each bin, genes are then zscore normalized to identify features that have a large variance, yet a similar mean expression. We selected the top p features using the highly variable genes function in Scanpy v1.9.3 (flavor = Seurat, bins = 20, n_top_genes = p).
Laplacian score
Laplacian score (LS)^{55} is a localitypreserving unsupervised feature selection method that ranks features according to (1) how well a feature’s expression is consistent across neighboring cells defined by a betweencell similarity graph define by all profiled features and (2) the feature’s global variance. First, a weighted knearest neighbor affinity graph of cells (k = 10) is constructed according to pairwise Euclidean distances between cells based on all features, X. More specifically, let ({{{{{{{mathcal{G}}}}}}}}=({{{{{{{mathcal{V}}}}}}}},{{{{{{{mathcal{E}}}}}}}})), where ({{{{{{{mathcal{V}}}}}}}}) represents the cells and edges, ({{{{{{{mathcal{E}}}}}}}}), are weighted using a Gaussian as follows. Specifically, edge weights between cells i and j can be defined as,
$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}{{{{{{{{bf{x}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$
(14)
Here W is a n × n betweencell similarity matrix, where cells v_{i} and v_{j} are connected with an edge with edge weight w_{ij} if the cell v_{j} is within the set of v_{i}’s neighbors, ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Moreover, as previously described, σ_{i} represents the bandwidth parameter for cell i defined as the distance to the 3rd nearest neighbor. For each feature f, where ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}) represents the value of the feature across all n cells, we compute the Laplacian score, L_{f} as,
$${L}_{f}=frac{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{L}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}{{tilde{{{{{{{{bf{f}}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}tilde{{{{{{{{bf{f}}}}}}}}}}.$$
(15)
Here, L represents the unnormalized graph Laplacian, with L = D − W, D is a diagonal degree matrix with the ith element of the diagonal d_{ii} as ({d}_{ii}={sum }_{j}{w}_{ij},,tilde{{{{{{{{bf{f}}}}}}}}}) represents the mean centered expression of feature f as (tilde{{{{{{{{bf{f}}}}}}}}}={{{{{{{bf{f}}}}}}}}frac{{{{{{{{{bf{f}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}{{{{{{{{{bf{1}}}}}}}}}^{T}{{{{{{{bf{D}}}}}}}}{{{{{{{bf{1}}}}}}}}}), and 1 = [1, …, 1]^{T}. We performed feature selection by sorting features in ascending order according to their Laplacian score and selecting the top p features.
MCFS
Multicluster feature selection (MCFS)^{58} is an unsupervised feature selection method that selects for features that best preserve the multicluster structure of data by solving an L1 regularized least squares regression problem on the spectral embedding. Similar to the Laplacian score, first knearest neighbor affinity graph of cells (k = 10) is computed to encode the similarities in feature expression between cells i and j using a Gaussian kernel as,
$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}{{{{{{{{bf{x}}}}}}}}}_{{v}_{j}}{parallel }^{2}}{2{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$
(16)
Similar to previous formulations above, W is an n × n between cell affinity matrix, where a pair of cells v_{i} and v_{j} are connected with an edge with weight w_{ij} if cell v_{j} is within the set of v_{i}’s neighbors, ({{{{{{{{mathcal{N}}}}}}}}}_{i}). Further, σ_{i} represents the kernel bandwidth parameter chosen to be the distance to the third nearest neighbor from cell i. Next, to represent the intrinsic dimensionality of the data, the spectral embedding^{160} is computed through eigendecomposition of the unnormalized graph Laplacian L, where L = D − W as,
$${{{{{{{bf{L}}}}}}}}{{{{{{{bf{y}}}}}}}}=lambda {{{{{{{bf{D}}}}}}}}{{{{{{{bf{y}}}}}}}}.$$
(17)
Here, ({{{{{{{bf{Y}}}}}}}}={{{{{{{{{bf{y}}}}}}}}}}_{k=1}^{K}) are the eigenvectors corresponding to the K smallest eigenvalues, W is a symmetric affinity matrix encoding cell similarity weights, and D represents a diagonal degree matrix with each element as d_{ii} = ∑_{j}w_{ij}. Given that eigenvectors of the graph Laplacian represent frequency harmonics^{71} and low frequency eigenvectors are considered to capture the informative structure of the data, MCFS computes the importance of each feature along each intrinsic dimension y_{k} by finding a relevant subset of features by minimizing the error using an L1 norm penalty as,
$$mathop{min }limits_{{{{{{{{{bf{a}}}}}}}}}_{k}}parallel !{{{{{{{{bf{y}}}}}}}}}_{k}{{{{{{{{bf{X}}}}}}}}}^{T}{{{{{{{{bf{a}}}}}}}}}_{k}{parallel }^{2}quad {{{{{{{rm{s}}}}}}}}.{{{{{{{rm{t}}}}}}}}.quad parallel !{{{{{{{{bf{a}}}}}}}}}_{k}!{parallel }_{1}le gamma .$$
(18)
Here, the nonzero coefficients, a_{k}, indicate the most relevant features for distinguishing clusters from the embedding space, y_{k} and γ controls the sparsity and ensures the least relevant coefficients are shrunk to zero. The optimization is solved using the least angles regression algorithm^{74}, where for every feature, the MCFS score is defined as,
$${{{{{{{rm{MCFS}}}}}}}}(j)=mathop{max }limits_{k}parallel !!{a}_{k,j}!!parallel .$$
(19)
Here, j and k index feature and eigenvector, respectively. We performed multicluster feature selection with the number of eigenvectors K chosen to be the number of ground truth cell types present within the data, as this is the traditional convention in spectral clustering^{60} and the number of nonzero coefficients was set to the number of selected features, p.
SCMER
Singlecell manifoldpreserving feature selection (SCMER)^{57} selects a subset of p features that represent the embedding structure of the data by learning a sparse weight vector w by formulating an elastic net regression problem that minimizes the KL divergence between a cell similarity matrix defined by all features and one defined by a reduced subset of features. More specifically, let P denote a betweencell pairwise similarity matrix defined in UMAP^{59} computed with the full data matrix ({{{{{{{bf{X}}}}}}}}in {{mathbb{R}}}^{ntimes d}) and Q denote a betweencell pairwise similarity matrix defined in UMAP computed with the dataset following feature selection ({{{{{{{bf{Y}}}}}}}}in {{mathbb{R}}}^{ntimes p}), where Y = Xw and p ≪ d. Here, elastic net regression is used to find a sparse and robust solution of w that minimizes the KL divergence as,
$${{{{{{{rm{KL}}}}}}}}left({{{{{{{bf{P}}}}}}}}parallel {{{{{{{bf{Q}}}}}}}}right)=mathop{sum}limits_{i}mathop{sum}limits_{j}{p}_{ij}log frac{{p}_{ij}}{{q}_{ij}}.$$
(20)
Features with nonzero weights in w are considered useful for preserving the embedding structure and selected for downstream analysis. We performed SCMER feature selection using the scmer v.0.1.0a3 package in python by constructing a knearest neighbor graph (k = 10) according to pairwise Euclidean distances of cells based on their first 50 principal components and using the default regression penalty weight parameters (lasso = 3.87e − 4, ridge = 0).
Hotspot
Hotspot^{56} is an unsupervised gene module identification approach that performs feature selection through a test statistic that measures the association of a gene’s expression with the betweencell similarity graph defined based on the full feature matrix, X. More specifically, first, a knearest neighbor cell affinity graph (k = 10) is defined based on pairwise Euclidean distances between all pairs of cells using a Gaussian kernel as,
$${{{{{{{{rm{w}}}}}}}}}_{ij}=left{begin{array}{ll}exp left(frac{parallel {{{{{{{{bf{x}}}}}}}}}_{{v}_{i}}{{{{{{{{bf{x}}}}}}}}}_{{v}_{i}j}{parallel }^{2}}{{sigma }_{i}^{2}}right),quad &,{{mbox{if}}},,{v}_{j}in {{{{{{{{mathcal{N}}}}}}}}}_{i} 0,hfill &,{{mbox{otherwise}}},.end{array}right.$$
(21)
Here, cells v_{i} and v_{j} are connected with an edge with edge weight w_{ij} if the cell v_{j} is within the set of v_{i}’s neighbors such that ∑_{j}w_{ij} = 1 for each cell and σ_{i} represents the bandwidth parameter for cell i defined as the distance to the (frac{k}{3}) neighbor. For a given feature ({{{{{{{bf{f}}}}}}}}in {{mathbb{R}}}^{n}), representing expression across all n cells where f_{i} is the meancentered and standardized expression of feature f in cell i according to a null distribution model of gene expression, the local autocorrelation test statistic representing the dependence of each gene on the graph structure is defined as,
$${H}_{f}=mathop{sum}limits_{i}mathop{sum}limits_{jne i}{w}_{ij}{f}_{i}{f}_{j}.$$
(22)
Hotspot was implemented using the hotspotsc v1.1.1 package in python, where we selected the top p features by sorting features in ascending order according to their significance with respect to a null model. For the singlecell RNA sequencing datasets where count data were available (splatter simulation, SymSim simulation, DE differentiation datasets), or for the the indirect immunofluorescence imaging datasets (RPE, PDAC), the null model of counts was defined by a negative binomial distribution. For the singlecell RNA sequencing datasets where only log normalized data were available (CD8 differentiation), the null model of normalized counts was defined as a normal distribution.
All features
To consider a baseline representation without feature selection, we evaluated performance using all features from each dataset following quality control preprocessing.
Random features
As a second baseline strategy, we simply selected a subset of random features without replacement. Results were computed across twenty random initializations for each dataset.
DELVE
DELVE was run as previously described. Here, we constructed a weighted knearest neighbor affinity graph of cells (k = 10), and 1000 neighborhoods were sketched to identify dynamic seed feature clusters (c = 3 for the Splatter simulated datasets, c = 5 for the SymSim simulated, RPE cell cycle, CD8 T cell differentiation, and DE differentiation datasets, or c = 10 for the PDAC cell cycle datasets). For practical guidelines on parameter selection, see Guidelines on parameter selection.
Datasets
We evaluated feature selection methods based on how well retained features could adequately recover biological trajectories under various noise conditions, biological contexts, and singlecell technologies.
Splatter simulation
Splatter^{75} is a singlecell RNA sequencing simulation software that generates count data using a gammaPoisson hierarchical model with modifications to alter the meanvariance relationship amongst genes, the library size, or sparsity. We used splatter to simulate a total of 90 ground truth datasets with different trajectory structures (e.g., 30 linear datasets, 30 bifurcation datasets, and 30 tree datasets). First, we estimated simulation parameters by fitting the model to a real singlecell RNA sequencing dataset consisting of human pluripotent stem cells differentiating into mesoderm progenitors^{138}. We then used the estimated parameters (mean_rate = 0.0173, mean_shape = 0.54, lib_loc = 12.6, lib_scale = 0.423, out_prob = 0.000342, out_fac_loc = 0.1, out_fac_scale = 0.4, bcv = 0.1, bcv_df = 90.2, dropout = None) to simulate a diverse set of ground truth reference trajectory datasets with the splatter paths function (python wrapper scprep SplatSimulate v1.2.3 of splatter v1.18.2). Here, a reference trajectory structure (e.g., bifurcation) was used to simulate linear and nonlinear changes in the mean expression of genes along each step of the specified differentiation path. We simulated differentiation datasets (1500 cells, 500 genes, 6 clusters) for each trajectory type (linear, bifurcation, tree) by modifying (1) the probability of a cell belonging to a cluster by randomly sampling from a Dirichlet distribution with six categories and a uniform concentration of one and (2) the path skew by randomly sampling from a beta distribution (α = 10, β = 10). The output of each simulation is a ground truth reference consisting of celltocluster membership, differentially expressed genes per cluster or path, as well as a latent step vector that indicates the progression of each cell within a cluster. Lastly, we modified the step vector to be monotonically increasing across clusters within the specified differentiation path to obtain a reference pseudotime measurement.
To estimate how well feature selection methods can identify genes that represent cell populations and are differentially expressed along a differentiation path in noisy singlecell RNA sequencing data, we added relevant sources of biological and technical noise to the reference datasets.

1.
Biological Coefficient of Variation (BCV): To simulate the effect of stochastic gene expression, we modified the biological coefficient of variation parameter within splatter (BCV = 0.1, 0.25, 0.5). This scaling factor controls the meanvariance relationship between genes, where lowly expressed genes are more variable than highly expressed genes, following a γ distribution. This corresponded to an approximate mean coefficient of variation 1.55, 1.60, and 1.75 when averaged across all genes and max coefficient of variation of 2.90, 2.95, and 3.10.

2.
Library size: The total number of profiled mRNA transcripts per cell, or library size, can vary between cells within a singlecell RNA sequencing experiment and can influence the detection of differentially expressed genes^{76}, as well as impact the reproducibility of the lowerdimensional representation of the data^{77}. To simulate the effect of differences in sequencing depth, we proportionally adjusted the gene means for each cell by modifying the location parameter (lib_loc = 12, 11, 10) of the lognormal distribution within splatter that estimates the library size scaling factors. This corresponded to an average library size of approximately 3e^{4}, 1.7e^{4} and 9e^{3}.

3.
Technical dropout: Singlecell RNA sequencing data contain a large proportion of zeros, where only a small fraction of total transcripts are detected due to capture inefficiency and amplification noise^{161}. To simulate the inefficient capture of mRNA molecules and account for the trend that lowly expressed genes are more likely to be affected by dropout, we undersampled mRNA counts by sampling from a binomial distribution with the scale parameter or dropout rate proportional to the mean expression of each gene as previously described in ref. ^{162} as,
$${r}_{i}=exp left(lambda {mu }_{i}^{2}right).$$
(23)
Here, μ_{i} represents the log mean expression of gene i, and λ is a hyperparameter that controls the magnitude of dropout (λ = 0, 0.05, 0.1). This corresponded to an approximate undersampling percentage of 0, 10, and 20 percent when averaged across all cells for all genes.
In our subsequent feature selection method analyses, we selected the top p = 100 features under each feature selection approach.
SymSim simulation
SymSim^{86} is a single cell RNA sequencing simulation software that uses a kinetic model of gene expression followed by library preparation and sequencing simulation to model intrinsic, extrinsic, and technical variability in singlecell RNA sequencing data. This approach was shown in a recent benchmarking study^{87} to be amongst the top ranking methods for reasonably simulating singlecell RNA sequencing data as measured by their accuracy in estimating data properties (e.g., library size, TMM, mean expression, scaled variance, fractions of zeros, cell and gene correlations) and their ability to preserve biological signals (e.g., differentially expressed genes, differentially variable genes). We used SymSim v0.0.0.9000 in R v4.1.1 to simulate tree differentiation trajectories with 10000 cells, 20000 genes, and 4 cell types. Simulation parameters (σ = 0.6, 0.4, α_mean = (0.01, 0.02, 0.03, 0.04, 0.05), α_sd = 0.02, evf_type = continuous, protocol = UMI, vary = s, nevf = 1.5% total genes, n_de_evf = 1% total genes, depth_mean = 5e^{5}, depth_sd = 3e^{4}) were chosen from Table 2 in ref. ^{87}, as this table contained simulation parameters that achieved the most similar distributions of statistics to real singlecell RNA sequencing data. To evaluate how well feature selection methods could identify genes that define cellular trajectories when subjected to a reduction of the total mRNA count, we performed two experiments where we simulated five differentiation trajectories by modifying the mean mRNA capture efficiency rate α_mean = (0.01, 0.02, 0.03, 0.04, 0.05) for cells with a high within population variability (σ = 0.6) and low within population variability (σ = 0.4). The output of each simulated dataset was a ground truth reference trajectory containing celltocluster membership, as well as a latent pseudotime vector that indicated cell progression along each population in the trajectory. In our subsequent feature selection method analyses, we first performed principal components analysis (n_pcs = 50) and then selected the top p = 2000, 1000, 500, 250, 100, or 50 features under each feature selection approach.
RPE analysis
The retinal pigmented epithelial (RPE) dataset^{17} is an iterative indirect immunofluorescence imaging (4i) dataset consisting of RPE cells undergoing the cell cycle. Here, timelapse imaging was performed on an asynchronous population of nontransformed human retinal pigmented epithelial cells expressing a PCNAmTurquoise2 reporter in order to record the cell cycle phase (G0/G1, S, G2, M) and age (time since last mitosis) of each cell. Following timelapse imaging, cells were fixed and 48 core cell cycle effectors were profiled using 4i^{8}. For preprocessing, we minmax normalized the data and performed batch effect correction on the replicates using the ComBat^{163} function in Scanpy v1.9.3. Lastly, to refine phase annotations and distinguish G0 from G1 cells, we selected cycling G1 cells according to the bimodal distribution of ^{pRB}/_{RB} expression as described in ref. ^{17}. Of note, cells were excluded if they did not have ground truth phase or age annotations. The resultant dataset consisted of 2759 cells × 241 imagingderived features describing the expression and localization of different protein markers. For our subsequent analysis, we selected the top p = 30 imagingderived features for each feature selection approach.
PDAC analysis
The pancreatic ductal adenocarcinoma (PDAC) cell cycle dataset^{164} is an iterative indirect immunofluorescence imaging dataset that profiled 63 core cell cycle effectors in 9 human PDAC cell lines: BxPC3, CFPAC, HPAC, MiaPaCa, Pa01C, Pa02C, Pa16C, PANC1, UM53. Each dataset resulted in d = 253 imagingderived features representing the expression and localization of different protein markers. For each cell line (e.g., BxPC3) under control conditions, we minmax normalized the data. Cell cycle phases (G0, G1, S, G2, M) were annotated a priori based on manual gating cells according to the abundance of core cell cycle markers. PhosphoRB (pRB) was used to distinguish proliferative cells (G1/S/G2/M, high pRB) from arrested cells (G0, low pRB). DNA content, E2F1, cyclin A (cycA), and phosphop21 (pp21) were used to distinguish G1 (DNA content = 2C, low cycA), S (DNA content = 24C, high E2F1), G2 (DNA content = 4C, high cycA), and M (DNA content = 4C, high pp21). For our subsequent analysis, we selected the top p = 30 features for each feature selection approach.
CD8+ T cell differentiation analysis
The CD8+ T cell differentiation dataset^{107} is a singlecell RNA sequencing dataset consisting of mouse splenic CD8+ T cells profiled over 12time points (d = day) following infection with the Armstrong strain of the lymphocytic choriomeningitis virus: Naive, d3, d4, d5, d6, d7, d10, d14, d21, d32, d60, d90 postinfection. Spleen singlecell RNA sequencing data were accessed from the Gene Expression Omnibus using the accession code GSE131847 and concatenated into a single matrix. The dataset was subsequently quality control filtered according to the distribution of molecular counts. To remove dead or dying cells, we filtered cells that had more than twenty percent of their total reads mapped to mitochondrial transcripts. Genes that were observed in less than three cells or had less than 400 counts were also removed. Following cell and gene filtering, the data were transcriptspermillion normalized, log+1 transformed, and variance filtered using highly variable gene selection, such that the resulting dataset consisted of 29893 cells × 2000 genes (See Highly variable genes). When evaluating feature selection methods, we first performed principal components analysis n_pcs = 50, and then selected the top p = 500 features for each feature selection approach.
DE differentiation analysis
The DE differentiation dataset is a multiplexed singlecell RNA sequencing dataset consisting of human embryonic stem cells (hESCs) differentiating into the DE. Cells were profiled over a 2 day time course following induction of TGFβ and WNT signaling pathways: day 0, day 1, day2post treatment.

1.
Cell culture of human embryonic stem cells: H9 human embryonic stem cells (WiCell) were grown in the medium mTesR1 (STEMCELL Technologies) in tissue culture dishes coated with Matrigel (Corning; 1:100 in DMEM/F12) overnight at 4 ^{o}C. Media changes were performed daily. Cells were routinely passaged every 2–3 days using Dulbecco’s PBS/0.5 mM EDTA and treated with ROCK inhibitor Y27672 (10 μM; STEMCELL Technologies) for 24 h after passaging. The cells were kept at 37 ^{o}C, and 5% CO2 in a tissue culture incubator.

2.
Definitive endoderm differentiation: Definitive endoderm differentiation of H9 human embryonic stem cells (WiCell) was performed according by adapting the modified D’Amour protocol previously described in ref. ^{129}. For each treatment group, 100K cells were seeded into a single well of a 12well plate with mTeSR (Stemcell Technologies) and ROCK inhibitor Y27632 (10 μM; STEMCELL Technologies). After 24 h, cells were switched to mTeSR without ROCK inhibitor Y27632 for an additional 24 h. To induce definitive endoderm differentiation, cells were treated with Activin A (100^{ng}/_{mL}; Peprotech) and CHIR90992 (5 μM; Peprotech) in RPMI 1640 media with B27 supplement for 24 h (d1post treatment) followed by Activin A (100^{ng}/_{mL}; Peprotech) in RPMI 1640 media with B27 supplement for the final 24 h (d2post treatment). Cell seeding was staggered for simultaneous cell collection and labeling across time points.

3.
Multiplexed singlecell RNA sequencing: We performed multiplexed singlecell RNA sequencing by adapting an approach previously outlined in ref. ^{128} that uses Click Chemistry to tag cells with unique oligo barcodes for conditionspecific cell labeling and multiplexing prior to singlecell RNA sequencing. Conditionspecific oligo barcodes were designed according to 10X Genomics’ specifications for feature barcoding of cell surface proteins. Each conditionspecific oligo barcode contained a capture sequence 1, tru seq read 2, a unique feature barcode selected from the list of whitelisted barcodes provided by 10X Genomics, and an amine group (Am6C) on the 5′ end to allow for the required click chemistry alterations. Conditionspecific oligo barcodes (desalted, 250 nmol, IDT) were reconstituted in 250 μL distilled water and spun for 10 min to remove any insoluble debris from synthesis. Barcodes were then subjected to ethanol precipitation and resuspended in 100 muL 1X borate buffer (ThermoFisher Scientific). Concentrations in μM were calculated as (^{A260}/_{ϵ} × Dilution Factor × 10^{6}). Then 35 nmol was diluted to 50 μL with water and mixed with 7 muL 10X BBS (0.5 M borate + 1.5 M NaCl) and 7 μL DMSO. The barcodes were then reacted with 2 additions of 3.5 μL freshly prepared 100 mM MTZPEG4NHS (Click Chemistry Tools) in DMSO added at 15 min intervals. The reaction was quenched with 1.45 μL 1 M glycine for 5 min, then ethanol precipitated and resuspended in 100 μL HEPES buffer. The concentration of each barcode was measured and samples were normalized to 100 μM. Prior to cell labeling, cells were washed twice with HBSS (Gibco) to remove any extracellular proteins. Cells were then incubated in TCOPEG4NHS (Click Chemistry Tools) in HBSS for 15 min at room temperature on a rotating platform. This solution was prepared immediately prior to incubation with cells to minimize hydrolysis of NHS groups. After incubation, the TCOPEG4NHS solution was removed and the cells were incubated with MTZactivated feature barcodes at ~10 μM in DMEM:F12 for an additional 20 min at 37 ^{o}C. The reaction was then quenched by incubation for 5 min with Tris HCl and MTZDBCO. Cells were washed once in PBSEDTA then passaged using PBSEDTA and collected with PBSBSA. Cells were filtered and resuspended to a single cell suspension then equal numbers of cells from each treatment were combined and loaded onto a single lane of a 10X Genomics chip.

4.
Singlecell RNA sequencing preprocessing and quality control: cDNA libraries were prepared for singlecell RNA sequencing using the 10X Genomics Version 3 platform and analyzed with the Cell Ranger pipeline v3.0.2. Raw sequencing Binary Base Call files were computationally demultiplexed by sample indices and converted into FASTQ files with the cellranger mkfastq function. This pipeline outputs two sets of FASTQ files, one corresponding to gene expression profiles and the other containing feature barcodes. Reads were then aligned to either the human reference transcriptome (Hg19) or the custom feature barcode reference file with the cellranger count function. Cells were then filtered by the inflection point on the barcode rank plot to eliminate backgroundassociated cell barcodes. Loom files were generated using default parameters with the Velocyto v0.17 package in python. To computationally demultiplex cells following singlecell RNA sequencing, unique conditionspecific oligo barcodes from the unified filtered featurebarcode matrix were median normalized to correct for differences in signaltonoise and linked back to a sampling time point according to its maximum conditionspecific feature barcode (e.g., day 0). Singlecell RNA sequencing quality control was performed according to the distributions of count depth, genes per cell, and the fraction of mitochondrial genes per cell. To remove dead or dying cells, we filtered cells that had more than twenty percent of their total reads mapped to mitochondrial transcripts. To remove empty droplets or cell doublets, cells were required to have a minimum number of 1500 transcript counts or a maximum number of 90000 transcript counts and were filtered out if they had less than 650 genes per cell or more than 7200 genes per cell. Following cell and gene filtering, the data were countspermillion normalized, log+1 transformed, and variance filtered using highly variable gene selection, such that the resulting dataset consisted of 5397 cells × 2000 genes (See Highly variable genes).

5.
RNA velocity estimation: RNA velocity^{125,126} was calculated using the dynamical model implementation in ScVelo v0.2.5. More specifically, first and second order moments for velocity estimation were computed for each cell based on a knearest neighbor graph (k = 10) constructed according to pairwise Euclidean distances between cells using the first 50 principal components. The full dynamical model was then solved for all genes to obtain a high dimensional velocity vector for every cell. We then performed a likelihood ratio test for differential kinetics amongst the cell populations defined according to the time point labels to account for any differences in mRNA splicing or degradation kinetics. Groups of cells that exhibited different kinetic regimes were fit independently and velocity vectors were corrected.
When evaluating feature selection methods, we first performed principal components analysis, n_pcs = 50, and then selected the top p = 500 features for each feature selection approach.
Statistics and reproducibility
For all publicly available datasets used in this study (CD8T, RPE, PDAC), the sample sizes, number of replicates, randomization, and blinding were determined by the original authors. No sample size calculations were performed for the RPE (ref. ^{17}), PDAC (ref. ^{164}), CD8T (ref. ^{107}), or DE datasets. The RPE study was performed in technical duplicates (ref. ^{17}), the PDAC study had a single well per cell line (ref. ^{164}), and the CD8T cell differentiation study collected CD8T cells that were pooled from approximately 16 mice at each timepoint. Here each timepoint of the analysis represented an independent experiment (ref. ^{107}). The RPE (ref. ^{17}) and PDAC (ref. ^{164}) studies had no randomization as there was no treatment induction for the control data. For the CD8T cell study, mice were randomly allocated into groups before adoptive transfer and mice were randomly selected for cell harvesting at specific time points (ref. ^{107}). For the DE dataset, cells were randomly placed into three replicate wells for each condition, treated with differentiation stimuli, and pooled prior to acquisition. There was no blinding to experimental allocation or outcome association in this study. Data exclusions were determined according to (1) quality control measures such as the distribution of molecular counts and expression of mitochondrial markers in singlecell RNA sequencing data, and/or (2) the availability of ground truth cellular annotations.
Evaluation
Classification and regression
knearest neighbor classification
To quantitatively compare feature selection methods on retaining features that are representative of cell types, we aimed to implement an approach that would assess the quality of the graph structure. k − nearest neighbors classification is a supervised learning algorithm that classifies data based on labels of the kmost similar cells according to their gene or protein expression, where the output of this algorithm is a set of labels for every cell. We performed knearest neighbors classification to predict cell type labels from simulated singlecell RNA sequencing datasets as follows. First, 3fold crossvalidation was performed using stratified random sampling to assign cells to either a training or a test set. Stratified random sampling was chosen to mitigate the effect of cell type class imbalance. Within each fold, feature selection was then performed on the training data to identify the top p relevant features according to a feature selection strategy. Next, a knearest neighbor classifier (k = 3) was fit on the featureselected training data to predict the cell type labels of the feature selected test query points. Here, labels were predicted as the mode of the cell type labels from the closest training data points according to Euclidean distance. Classification performance was subsequently assessed according to the median classification accuracy with respect to the ground truth cell type labels across folds. knearest neighbors classification was implemented using the scikitlearn v0.23.2 package in python.
Support Vector Machine
The Support Vector Machines (SVM)^{165} is a supervised learning algorithm that constructs hyperplanes in the highdimensional feature space to separate classes. We implemented SVM classification or regression using the scikitlearn v0.23.2 package in python. SVM classification was used to predict cell cycle phase labels for both RPE and PDAC 4i datasets, whereas SVM regression was used to predict age measurements from time lapse imaging for the RPE dataset. Here, Nested 10fold crossvalidation was performed using random sampling to assign cells to either a training set or a test set. Within each fold, feature selection was performed to identify the p most relevant features according to a feature selection strategy. SVM hyperparameters were then tuned over a grid search and phase labels were subsequently predicted from the test data according to those p features. Classification performance was assessed according to the median classification accuracy with respect to the ground truth cell type labels across folds. Regression performance was assessed according to the average root mean squared error with respect to ground truth age measurements across folds.
Precision@k
To evaluate the biological relevance of selected features from each method, we computed precision@k (p@k) as the proportion of top k selected features that were considered to be biologically relevant according to a ground truth reference as,
$$p@k=frac{ {{{{{{{{rm{F}}}}}}}}}_{s,k}cap {{{{{{{{rm{F}}}}}}}}}_{r} }{ {{{{{{{{rm{F}}}}}}}}}_{s,k} },$$
(24)
where F_{s,k} indicates the set of selected features at threshold k, where F_{s,k} ⊂ F_{s}, and F_{r} indicates the set of reference features. Reference features were defined as either (1) the ground truth differentially expressed features within a cluster or along a differentiation path from the Splatter singlecell RNA sequencing simulation study (see Splatter simulation) or (2) the features determined to be useful for classifying cells according to cell cycle phase using a random forest classifier trained on ground truth phase annotations from timelapse imaging for the protein immunofluorescence imaging datasets (See Random forest, RPE analysis, PDAC analysis).
Unsupervised clustering
To evaluate feature selection method performance on retaining features that are informative for identifying canonical cell types, we performed unsupervised clustering on the data defined by the top p ranked features from a feature selection strategy. More specifically, for each feature selection approach, clustering was performed on the selected data using the KMeans++ algorithm^{105} with the number of centroids set as the same number of ground truth cell cycle phase labels for the protein immunofluorescence imaging datasets (RPE: c = 4, PDAC: c = 5).
To assess the accuracy of clustering assignments, we quantified a NMI score between the predicted cluster labels and the ground truth cell type labels. Normalized mutual information^{166} is a clustering quality metric that measures the amount of shared information between two celltocluster partitions (u and v, such that the ith entry u_{i} gives the cluster assignment of cell i) as,
$${{{{{{{rm{NMI}}}}}}}}=frac{2{{{{{{{rm{I}}}}}}}}({{{{{{{bf{u}}}}}}}};{{{{{{{bf{v}}}}}}}})}{{{{{{{{rm{H}}}}}}}}({{{{{{{bf{u}}}}}}}}){{{{{{{rm{H}}}}}}}}({{{{{{{bf{v}}}}}}}})},$$
(25)
where, I(u; v) measures the mutual information between ground truth cell type labels u and cluster labels v, and H(u) or H(v) indicates the Shannon entropy or the amount of uncertainty for a given set of labels. Here, a score of 1 indicates that clustering on the selected features perfectly recovers the ground truth cell type labels. KMeans++ clustering was implemented using the KMeans function in scikitlearn v0.23.2.
Proteinprotein interaction networks
In this work, we aimed to test whether features within DELVE dynamic clusters had experimental evidence of coregulation as compared to random assignment. The STRING (search tool for the retrieval of interacting genes/proteins) database^{122} is a relational database that computes protein association scores according to information derived from several evidence channels, including computational predictions (e.g., neighborhood, fusion, cooccurance), coexpression, experimental assays, pathway databases, and literature text mining. To assess the significance of protein interactions amongst features within a DELVE cluster, we performed a permutation test with a test statistic derived from STRING association scores using experimental evidence as follows.
Let ({{{{{{{{mathcal{G}}}}}}}}}_{p}=({{{{{{{{mathcal{N}}}}}}}}}_{p},,{{{{{{{{mathcal{E}}}}}}}}}_{p})) denote a graph of p proteins from a DELVE cluster comprising the nodes ({{{{{{{{mathcal{N}}}}}}}}}_{p}), and ({{{{{{{{mathcal{E}}}}}}}}}_{p}) denote the set of edges, where edge weights encode the association scores of experimentallyderived proteinprotein interaction evidence from the STRING database. Moreover, let ({{{{{{{{mathcal{G}}}}}}}}}_{r}=({{{{{{{{mathcal{N}}}}}}}}}_{r},,{{{{{{{{mathcal{E}}}}}}}}}_{r})) denote a graph of r proteins randomly sampled without replacement from the full feature space d such that r = p comprising the nodes ({{{{{{{{mathcal{N}}}}}}}}}_{r}), and ({{{{{{{{mathcal{E}}}}}}}}}_{r}) denote the set of edges encoding the experimentallyderived association scores between those r proteins from the STRING database. We compute the permutation pvalue as described previously in ref. ^{167} as,
$$p,{{mbox{value}}},=frac{N+1}{R+1}.$$
(26)
Here N indicates the number of times that T_{r} ≥ T_{obs} out of R random permutations (R = 1000), where T_{r} is the average degree of a STRING association network from randomly permuted features as ({T}_{{{{{{{{rm{r}}}}}}}}}{=}^{ {{{{{{{{mathcal{N}}}}}}}}}_{r} }{/}_{ {{{{{{{{mathcal{E}}}}}}}}}_{r} }), and T_{obs} is the average degree of a STRING association network from the features identified within a DELVE cluster as ({T}_{{{{{{{{rm{obs}}}}}}}}}{=}^{ {{{{{{{{mathcal{N}}}}}}}}}_{p} }{/}_{ {{{{{{{{mathcal{E}}}}}}}}}_{p} }). Of note, networks with higher degree are more connected, and thus show greater experimental evidence of proteinprotein interactions. Experimental evidencebased association scores were obtained from the STRING database (stringdb) and networks were generated using networkx v3.1 package in python.
Trajectory inference and analysis
To evaluate how well feature selection methods can identify features that (1) recapitulate the underlying cellular trajectory and (2) can be used for trajectory analysis, we computed three metrics to assess trajectory preservation at different stages of inference: accuracy of the inferred cellstate trajectory graph, correlation of estimated pseudotime to the ground truth cell progression measurements, and significance and biological relevance of dynamic features identified following trajectory inference and regression analysis.
To obtain predicted cellular trajectories following feature selection, we performed trajectory inference using two approaches that were shown to outperform alternative methods for inference of simple or tree differentiation trajectories^{23}. First, cellular trajectories were inferred using the diffusion pseudotime algorithm^{31} based on 20 diffusion map components generated from a knearest neighbor graph (k = 10), where edge weights were determined by pairwise Euclidean distances between cells according to selected feature expression. For each feature selection approach, we estimated pseudotime using ten random root cells selected according to a priori biological knowledge: simulated datasets (cells with the smallest ground truth time annotation), 4i cell cycle datasets (cells with the youngest age from timelapse imaging for the arrested (G0 phase) or proliferative (G1, S, G2, M phases) trajectories), CD8 differentiation dataset (cells from the d3 population), and DE differentiation dataset (cells from the d0 population). As a secondary approach, we inferred cellular trajectories using Slingshot^{28}. Here, a minimum spanning tree was fit through cluster centroids defined according to a priori biological knowledge (e.g., cell type labels or time point labels), then pseudotime was estimated by projecting cells onto the principal curves fit through the PHATE embedding generated from selected feature expression (See PHATE visualizations). The root cluster for each dataset was defined as cluster with the smallest ground truth time annotation as introduced previously. Feature selection trajectory performance was subsequently assessed as follows.

1.
Trajectory graph similarity: Partitionbased graph abstraction (PAGA)^{30} performs trajectory inference by constructing a coarse grained trajectory graph of the data. First cell populations are determined either through unsupervised clustering, graph partitioning, or a prior experimental annotations. Next, a statistical measure of edge connectivity is computed between cell populations to estimate the confidence of a cell population transition. To assess if feature selection methods retained features that represented coarse cell type transitions, we compared predicted PAGA trajectory graphs to ground truth cell cycle reference trajectories curated from the literature^{17}. First, PAGA connectivity was estimated between ground truth cell cycle phase groups using the knearest neighbor graph (k = 10) based on pairwise Euclidean distances between cells according to selected feature expression. We then computed the Jaccard distance between predicted and reference trajectories as,
$${d}_{j}left({{{{{{{{bf{W}}}}}}}}}_{p},{{{{{{{{bf{W}}}}}}}}}_{r}right)=1frac{ {{{{{{{{bf{W}}}}}}}}}_{p}cap {{{{{{{{bf{W}}}}}}}}}_{r} }{ {{{{{{{{bf{W}}}}}}}}}_{p}cup {{{{{{{{bf{W}}}}}}}}}_{r} }.$$
(27)
W_{p} indicates the predicted cell type transition adjacency matrix, where each entry W_{p,ij} represents the connectivity strength between cell populations i and j from PAGA and W_{r} indicates the reference trajectory adjacency matrix with entries encoding ground truth cell type transitions curated from the literature. Here, a lower Jaccard distance indicates that predicted trajectories better capture known cellular transitions.

2.
Pseudotime correlation: To evaluate if feature selection methods retained features that accurately represent a cell’s progression through a biological trajectory, we computed the Kendall rank correlation coefficient between estimated pseudotime following feature selection and ground truth cell progression annotations (e.g., the ground truth pseudotime labels generated from simulations, timelapse imaging molecular age measurements).

3.
Regression analysis: To identify genes associated with the inferred differentiation trajectory (e.g., CD8+ T cell, definitive endoderm differentiation trajectory) following feature selection, we performed regression analysis for each gene (d = 500) along estimated pseudotime using a negative binomial GAM. Genes were considered to be differentially expressed along the inferred lineage if they had a q value < 0.05 following BenjaminiHochberg false discovery rate correction^{123}.

4.
Gene Ontology: To identify the biological relevance of differentially expressed genes associated with the different differentiation trajectories specific to each feature selection strategy, we performed gene set enrichment analysis on the set of significant genes from either highly variable gene, DELVE, Hotspot, Laplacian score, or RNA velocity feature selection using Enrichr^{124}. Here, we considered the gene sets (e.g., mouse – CD8+ T cell differentiation, human – DE differentiation) from GO Biological Process 2023.
Diffusion pseudotime was implemented using the dpt function in Scanpy v1.9.3 in python, Slingshot was implemented using the slingshot v2.1.1 package in R v4.1.1, PAGA was implemented using the paga function in Scanpy v1.9.3 in python, GAM regression was implemented using the statsmodels v0.14.0 package in python, and gene set enrichment analysis was performed using the enrichr function in gseapy v1.0.4 package in python.
PHATE visualizations
To qualitatively compare lower dimensional representations from each feature selection strategy, we performed nonlinear dimensionality reduction using PHATE (potential of heatdiffusion for affinitybased transition embedding)^{78} as this approach performs reasonably well for representing complex continuous biological trajectories. PHATE was implemented using the phate v1.0.11 package in python. Here, we used the same set of hyperparameters across all feature selection strategies (knn = 30, t = 10, decay = 40).
Aggregate scores
To rank feature selection methods on preserving biological trajectories in the presence of singlecell noise, we computed rank aggregate scores by taking the mean of scaled method scores across simulated singlecell RNA sequencing datasets from a trajectory type and noise condition (e.g., linear trajectory, dropout noise). More specifically, we first defined an overall method score per dataset as the median of each metric. Method scores were subsequently minmax scaled to ensure datasets were equally weighted prior to computing the average.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
 SEO Powered Content & PR Distribution. Get Amplified Today.
 PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
 PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
 PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
 PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
 Source: https://www.nature.com/articles/s4146702446773z