Close this search box.

scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data – Nature Communications

The model of scCASE

Given a scCAS count matrix, we first filter out the peaks expressed in fewer than 1% of the cells (Supplementary Text S6)7,9,59, and then use TF-IDF transformation to reweight peaks by their occurrence frequencies7,59. The formula of TF-IDF transformation is shown in Eq. (1):

$${x}_{{ij}}=left(frac{hat{{x}_{{ij}}}}{mathop{sum }nolimits_{k=1}^{m}hat{{x}_{{kj}}}}right)cdot log left(frac{n}{mathop{sum }nolimits_{t=1}^{n}hat{{x}_{{it}}}}right)$$


where (hat{{x}_{{ij}}}) is the read count in peak (i) of cell (j). We denote the preprocessed count matrix as ({{{{{bf{X}}}}}}in {{mathbb{R}}}^{mtimes n}), where (m) is the number of peaks and (n) is the number of cells. We then minimize the loss function (F) in Eq. (2) to enhance the data in an iterative manner:

$$mathop{min }limits_{{{{{{bf{W}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{bf{H}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{bf{Z}}}}}}ge 0}F={Vert {{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-{{{{{bf{W}}}}}}{{{{{bf{H}}}}}}Vert }_{{{{{F}}}}}^{2}+lambda {Vert {{{{{bf{Z}}}}}}-{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}Vert }_{{{{{F}}}}}^{2}+{gamma }_{1}{Vert {{{{{bf{W}}}}}}Vert }_{F}^{2}+{gamma }_{2}{Vert {{{{{bf{H}}}}}}Vert }_{{{{{F}}}}}^{2}$$


where ({{{{{bf{Z}}}}}}in {{mathbb{R}}}^{ntimes n}) is the cell-to-cell similarity matrix in which each column sums to 1, and ({z}_{{ij}}) represents the similarity between cell (i) and cell (j). ({{{{{bf{R}}}}}}in {{mathbb{R}}}^{ntimes n}) is the random sampling matrix, where each element is either 0 or 1 generated through binomial distribution. ({{{{{bf{R}}}}}}) is Hadamard multiplied with ({{{{{bf{Z}}}}}}) in the computation to avoid the similar cells exhibiting almost the same accessible peaks which improperly reduces the cellular heterogeneity. ({{{{{bf{W}}}}}}in {{mathbb{R}}}^{mtimes k}) is the factorized projection matrix while ({{{{{bf{H}}}}}}in {{mathbb{R}}}^{ktimes n}) is the corresponding cell embedding matrix.

More specifically, the scCASE model can be divided into two parts. On the one hand, ({{{{{bf{X}}}}}}left({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}}right)) is the matrix enhanced by similarity. On the other hand, ({{{{{bf{WH}}}}}}) is the matrix reconstructed by matrix factorization. We aim to minimize the difference between the two matrices. The first term in Eq. (2) uses the Frobenius norm to approach ({{{{{bf{X}}}}}}left({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}}right)) and ({{{{{bf{WH}}}}}}). The second term minimizes the Frobenius norm of the cell-to-cell similarity matrix and the product of ({{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}) and ({{{{{bf{H}}}}}}). This term makes ({{{{{{bf{Z}}}}}}}_{{ij}}) similar to the Cartesian product of the cell embedding vectors of cell (i) and cell (j). Similar cells should have smaller angles between their embedding vectors and more oversized Cartesian products, corresponding to larger elements in the similarity matrix. The third and fourth terms represent two regularization terms that constrain the projection matrix ({{{{{bf{W}}}}}}) and the cell embedding ({{{{{bf{H}}}}}}), respectively, to prevent model overfitting. The Frobenius norm is commonly used in non-negative matrix factorization due to its differentiability, which facilitates the optimization process. Coefficients of (lambda), ({gamma }_{1}), and ({gamma }_{2}) serve as weights of different terms. We also discuss the differences and advantages of scCASE to PCA/SVD in Supplementary Text S7.

The model of scCASE with reference data

We also introduce a variant of scCASE, named scCASER, to incorporate available omics data as reference data to further improve the data enhancement performance. The data preprocessing strategy in scCASER is the same as that in scCASE. We perform conventional NMF on the reference data and obtain the factorized projection matrix ({{{{{bf{P}}}}}}in {{mathbb{R}}}^{mtimes {k}_{1}}). We introduce the projection matrix learned from reference data to the loss function F, as shown in Eq. (3):

$$mathop{min }limits_{{{{{{{bf{W}}}}}}}_{1,2},{{{{{bf{H}}}}}},{{{{{bf{Z}}}}}}ge 0}F= |{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-[{{{{{{bf{W}}}}}}}_{1},{{{{{{bf{W}}}}}}}_{2}]{{{{{{bf{H}}}}}}} |_{F}^{2}+lambda |{{{{{bf{Z}}}}}} -{{{{{{bf{H}}}}}}}^{T}{{{{{{bf{H}}}}}}} |_{F}^{2}+{gamma }_{1} | {{{{{{bf{W}}}}}}} _{{{{{{bf{m}}}}}}} | _{{F}}^{2}+{gamma }_{2} | {{{{{{bf{H}}}}}}} | _{F}^{2}+alpha | {{{{{bf{P}}}}}}-{{{{{{bf{W}}}}}}} _1| _{F}^{2}$$


The projection matrix ({{{{{{bf{W}}}}}}}_{{{{{{bf{m}}}}}}}in {{mathbb{R}}}^{mtimes k}) is decomposed into two parts and can be written as (left[{{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}},{{{{{{bf{W}}}}}}}_{{{{{{bf{2}}}}}}}right]), where ({{{{{{bf{W}}}}}}}_{1}in {{mathbb{R}}}^{mtimes {k}_{1}}) is used to transfer the epigenetic information from reference data, while ({{{{{{bf{W}}}}}}}_{2}in {{mathbb{R}}}^{mtimes {k}_{2}}) is the projection matrix learned during fitting the target scCAS data. Additionally, ({k}_{1}) and ({k}_{2}) are the numbers of columns in ({{{{{{bf{W}}}}}}}_{1}) and ({{{{{{bf{W}}}}}}}_{2}), respectively ((k={k}_{1}+{k}_{2})), and they can be adjusted to control the dominance of the reference data on the model. The last regularization term of the optimization problem imposes a constraint on ({{{{{{bf{W}}}}}}}_{1}) using the matrix ({{{{{bf{P}}}}}}), to mitigate the difference between the projection matrix learned from reference data and the projection matrix learned from target data. Other settings of scCASER are similar to that of scCASE. Coefficients of (lambda), ({gamma }_{1}), ({gamma }_{2}), and (alpha) serve as weights of different terms.

Iterative optimization of scCASE

For scCASE, to minimize the loss function in Eq. (2), we can update the three matrices ({{{{{bf{W}}}}}}), ({{{{{bf{H}}}}}}), and ({{{{{bf{Z}}}}}}) iteratively using the gradient descent algorithm. To achieve this, we first compute the partial derivatives of the loss function with respect to each matrix. According to the definition of the Frobenius norm and the trace of a matrix, we can convert the loss function from a norm form to a trace form as Eq. (4), making it easier to compute the gradient:

$$F= {tr}({({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-{{{{{{bf{H}}}}}}}^{T}{{{{{{bf{W}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-{({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{bf{WH}}}}}}+{{{{{{bf{H}}}}}}}^{T}{{{{{{bf{W}}}}}}}^{T}{{{{{bf{WH}}}}}}) +lambda {tr}({{{{{{bf{Z}}}}}}}^{T}{{{{{bf{Z}}}}}}-{{{{{{bf{H}}}}}}}^{T}{{{{{bf{HZ}}}}}}-{{{{{{bf{Z}}}}}}}^{T}{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}+{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}})+{gamma }_{1}{tr}({{{{{{bf{W}}}}}}}^{T}{{{{{bf{W}}}}}})+{gamma }_{2}{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}})$$


After obtaining the partial derivatives of (F) with respect to ({{{{{bf{W}}}}}}), ({{{{{bf{H}}}}}}), and ({{{{{bf{Z}}}}}}) (Eq. (5)), we use gradient descent to optimize the model:

$$frac{partial F}{partial {{{{{bf{W}}}}}}} =-2X(Zcirc R){{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}+2WH{{{{{{bf{H}}}}}}}^{T}+2{gamma }_{1}W frac{partial F}{partial {{{{{bf{H}}}}}}} =-2{{{{{{bf{W}}}}}}}^{T}X(Zcirc R)+2{{{{{{bf{W}}}}}}}^{T}WH-2lambda H({{{{{bf{Z}}}}}}+{{{{{{bf{Z}}}}}}}^{T})+4lambda H{{{{{{bf{H}}}}}}}^{T}H+2{gamma }_{2}H frac{partial F}{partial {{{{{bf{Z}}}}}}} =2({{{{{{bf{X}}}}}}}^{T}X({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}}))circ R-2({{{{{{bf{X}}}}}}}^{T}WH)circ R+2lambda Z-2lambda {{{{{{bf{H}}}}}}}^{T}H$$


The iteration will stop if the change between two consecutive iterations is less than ({10}^{-6}). Finally, we multiply the raw data matrix ({{{{{bf{X}}}}}}) by the iteratively optimized cell-to-cell similarity matrix ({{{{{bf{Z}}}}}}), resulting the enhanced scCAS data ({{{{{bf{XZ}}}}}}).

To improve the efficiency of the optimization algorithm, we propose a strategy to choose the optimal step size for each iteration. We let (frac{partial F}{partial {{{{{bf{W}}}}}}}={{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}), (frac{partial F}{partial {{{{{bf{Z}}}}}}}={{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}) and (frac{partial F}{partial {{{{{bf{H}}}}}}}={{{{{{bf{D}}}}}}}_{{{{{{bf{3}}}}}}}). For ({{{{{bf{W}}}}}}), we assume the most appropriate step size ({delta }_{1}) should be the one that minimizes (Fleft({{{{{bf{W}}}}}}-{delta }_{1}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}},{{{{{bf{H}}}}}},{{{{{bf{Z}}}}}}right)), as shown in Eq. (6):

$$F= {tr}({({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-{{{{{{bf{H}}}}}}}^{T}{({{{{{bf{W}}}}}}-{{{{{{boldsymbol{delta }}}}}}}_{{{{{{boldsymbol{1}}}}}}}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}})}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})) +{tr}(-{({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}({{{{{bf{W}}}}}}-{delta }_{1}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}){{{{{bf{H}}}}}}+{{{{{{bf{H}}}}}}}^{T}{({{{{{bf{W}}}}}}-{delta }_{1}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}})}^{T}({{{{{bf{W}}}}}}-{delta }_{1}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}){{{{{bf{H}}}}}}) +lambda {tr}({{{{{{bf{Z}}}}}}}^{T}{{{{{bf{Z}}}}}}-{{{{{{bf{H}}}}}}}^{T}{{{{{bf{HZ}}}}}}-{{{{{{bf{Z}}}}}}}^{T}{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}+{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}})+{gamma }_{1}{tr}({{{{{{bf{W}}}}}}}^{T}{{{{{bf{W}}}}}})+{gamma }_{2}{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}})$$


By substituting the parameters into Eq. (6) and taking the derivative, we can obtain the following:

$$frac{{dF}}{d{delta }_{1}}= {tr}({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}}))+{tr}({({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}{{{{{bf{H}}}}}}) -{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{W}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}{{{{{bf{H}}}}}}) -{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}^{T}{{{{{bf{WH}}}}}})+2{delta }_{1}{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}^{T}{{{{{bf{DH}}}}}})$$


When obtaining the minimum value of (F), the derivative of (F) with respect to (delta) is zero. Let Eq. (7) equal to 0, and we can solve the value ({delta }_{1}) as:

$${delta }_{1}=frac{-{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}^{T}{{{{{bf{X}}}}}}left({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}}right))-{tr}({({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}{{{{{bf{H}}}}}})+{tr}left({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{W}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}{{{{{bf{H}}}}}}right)+{tr}left({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}^{T}{{{{{bf{WH}}}}}}right)}{2{tr}left({{{{{{bf{H}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}^{T}{{{{{{bf{D}}}}}}}_{{{{{{bf{1}}}}}}}{{{{{bf{H}}}}}}right)}$$


For ({{{{{bf{Z}}}}}}), we suppose the best step size is ({delta }_{2}), and then the ({delta }_{2}) will let (Fleft({{{{{bf{W}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{bf{H}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{bf{Z}}}}}}-{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}right)) get minimum, (frac{{dF}left({{{{{bf{W}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{bf{H}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{bf{Z}}}}}}-{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{boldsymbol{2}}}}}}}right)}{d{delta }_{2}}=0), and the calculation of loss function (F) is as below:

$$F= {tr}({(({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})circ {{{{{bf{R}}}}}}-{{{{{{bf{H}}}}}}}^{T}{{{{{{bf{W}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})circ {{{{{bf{R}}}}}}) +{tr}(-{(({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{bf{WH}}}}}}+{{{{{{bf{H}}}}}}}^{T}{{{{{{bf{W}}}}}}}^{T}{{{{{bf{WH}}}}}}) +lambda {tr}({({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})}^{T}({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})-{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})-{({{{{{bf{Z}}}}}}{{{{{boldsymbol{-}}}}}}{delta }_{2}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}})}^{T}{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}} +{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}{{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}})+{gamma }_{1}{tr}({{{{{{bf{W}}}}}}}^{T}{{{{{bf{W}}}}}})+{gamma }_{2}{tr}({{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}})$$


$$frac{{dF}}{d{delta }_{2}}= 0=-{tr}({({{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})+{({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{bf{T}}}}}}}{{{{{bf{X}}}}}}({{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}{{{{{boldsymbol{circ }}}}}}{{{{{bf{R}}}}}})) +2{delta }_{2}{tr}({({{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{X}}}}}}({{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}circ {{{{{bf{R}}}}}})) +{tr}({{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{{bf{W}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{X}}}}}}({{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}circ {{{{{bf{R}}}}}})+{({{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{WH}}}}}}) -lambda {tr}({{{{{{bf{Z}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}+{{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{Z}}}}}})+2lambda {delta }_{2}{tr}({{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}) +lambda {tr}({{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{H}}}}}}{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}+{{{{{{{bf{D}}}}}}}_{{{{{{bf{2}}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{bf{H}}}}}})$$


Then, we get the optimal ({delta }_{2}) value as Eq. (11):

$${delta }_{2}=frac{{tr}left({{{{{bf{eq}}}}}}{{{{{bf{1}}}}}}right)-{tr}left({{{{{bf{eq}}}}}}{{{{{bf{3}}}}}}right)+lambda {tr}left({{{{{bf{eq}}}}}}{{{{{bf{4}}}}}}right)-lambda {tr}left({{{{{bf{eq}}}}}}{{{{{bf{6}}}}}}right)}{2{tr}left({{{{{bf{eq}}}}}}{{{{{bf{2}}}}}}right)+2lambda {tr}left({{{{{bf{eq}}}}}}{{{{{bf{5}}}}}}right)}$$



$${{{{{rm{eq1}}}}}} =,{({{{{{{bf{D}}}}}}}_{2}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{rm{T}}}}}}}{{bf{X}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})+{({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{bf{T}}}}}}}{{bf{X}}}({{{{{{bf{D}}}}}}}_{2}circ {{{{{bf{R}}}}}}) {{{{{rm{eq}}}}}}2 =,{({{{{{{bf{D}}}}}}}_{2}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{rm{T}}}}}}}{{bf{X}}}({{{{{{bf{D}}}}}}}_{2}circ {{{{{bf{R}}}}}}) {{{{{rm{eq}}}}}}3 =,{{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{{bf{W}}}}}}}^{{{{{{rm{T}}}}}}}{{bf{X}}}({{{{{{bf{D}}}}}}}_{2}circ {{{{{bf{R}}}}}})+{({{{{{{bf{D}}}}}}}_{2}circ {{{{{bf{R}}}}}})}^{{{{{{rm{T}}}}}}}{{{{{{bf{X}}}}}}}^{{{{{{rm{T}}}}}}}{{bf{WH}}} {{{{{rm{eq4}}}}}} =,{{{{{{bf{Z}}}}}}}^{{{{{{rm{T}}}}}}}{{{{{{bf{D}}}}}}}_{2}+{{{{{{{bf{D}}}}}}}_{2}}^{{{{{{rm{T}}}}}}}{{bf{Z}}} {{{{{rm{eq5}}}}}} =,{{{{{{{bf{D}}}}}}}_{2}}^{{{{{{rm{T}}}}}}}{{bf{D}}}_{2} {{{{{rm{eq}}}}}}6 =,{{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}{{bf{H}}}{{{{{{bf{D}}}}}}}_{2}+{{{{{{{bf{D}}}}}}}_{2}}^{{{{{{rm{T}}}}}}}{{{{{{bf{H}}}}}}}^{{{{{{rm{T}}}}}}}{{bf{H}}}$$


At each iteration, the optimal step size will be calculated and then used to update the variable matrix, accelerating convergence speed and reducing the computational cost. For scCASE, the optimal step size for ({{{{{bf{W}}}}}}) and ({{{{{bf{Z}}}}}}) can be adaptively obtained, while for ({{{{{bf{H}}}}}}), due to its high order terms, we cannot find the optimal step size in a similar way. We initialize the step size ({delta }_{3}) of ({{{{{bf{H}}}}}}) to 0.2. If this step size cannot reduce the loss function during the iteration process, we will reduce the step size and try updating ({{{{{bf{H}}}}}}) again.

In a similar fashion to the scCASE approach, we additionally introduce two mask matrices of ({{{{{bf{M}}}}}}) and ({{{{{bf{N}}}}}}) when implementing scCASER, and Eq. (3) becomes:

$$mathop{min }limits_{{{{{{{bf{W}}}}}}}_{1,2},{{{{{bf{H}}}}}},{{{{{bf{Z}}}}}}ge 0}F= |{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{1}+{{{{{bf{N}}}}}}circ {{{{{{bf{W}}}}}}}_{2}){{{{{{bf{H}}}}}}} |_{{{{{F}}}}}^{2}+lambda | {{{{{bf{Z}}}}}}-{{{{{{bf{H}}}}}}}^{T}{{{{{{bf{H}}}}}}} | _{{{{{F}}}}}^{2} +{gamma }_{1}|{{{{{{bf{W}}}}}}}_{{{{{{bf{m}}}}}}}|_{{{{F}}}}^{2}+{gamma }_{2}|{{{{{{bf{H}}}}}}} | _{{{{{F}}}}}^{2}+alpha |{{{{{bf{P}}}}}}-{{{{{{bf{W}}}}}}}_{1}|_{{{{F}}}}^{2}$$


In Eq. (13), ({{{{{{bf{W}}}}}}}_{1}) is expanded from the original ({{{{{{bf{W}}}}}}}_{1}) to (left[{{{{{{bf{W}}}}}}}_{1},{{{{{bf{0}}}}}}right]) to have the same dimension as ({{{{{{bf{W}}}}}}}_{{{{{{bf{m}}}}}}}).({{{{{bf{M}}}}}}) is a masking matrix with 1 in the region where ({{{{{{bf{W}}}}}}}_{1}) acts and 0 elsewhere. The same is done for ({{{{{{bf{W}}}}}}}_{2}) and ({{{{{bf{N}}}}}}). Hadamard multiplying ({{{{{bf{M}}}}}}), ({{{{{bf{N}}}}}}) by ({{{{{{bf{W}}}}}}}_{1}), ({{{{{{bf{W}}}}}}}_{2}) respectively, and adding them yields the equivalent effect of Eq. (3). The loss function (F) can then be written as Eq. (14):

$$F= {tr}({({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})-{{{{{{bf{H}}}}}}}^{T}{({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}+{{{{{bf{N}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{2}}}}}}})}^{T}{{{{{bf{X}}}}}}({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}}) -{({{{{{bf{Z}}}}}}circ {{{{{bf{R}}}}}})}^{T}{{{{{{bf{X}}}}}}}^{T}({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}+{{{{{bf{N}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{2}}}}}}}){{{{{bf{H}}}}}} +{{{{{{bf{H}}}}}}}^{T}{({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}+{{{{{bf{N}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{2}}}}}}})}^{{{{{{rm{T}}}}}}}({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}+{{{{{bf{N}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{2}}}}}}}){{{{{bf{H}}}}}})left.right) +lambda {tr}left({{{{{{bf{Z}}}}}}}^{T}{{{{{bf{Z}}}}}}-{{{{{{bf{H}}}}}}}^{T}{{{{{bf{HZ}}}}}}-{{{{{{bf{Z}}}}}}}^{{{{{{bf{T}}}}}}}{{{{{{bf{H}}}}}}}^{{{{{{bf{T}}}}}}}{{{{{bf{H}}}}}}+{{{{{{bf{H}}}}}}}^{{{{{{bf{T}}}}}}}{{{{{bf{H}}}}}}{{{{{{bf{H}}}}}}}^{{{{{{bf{T}}}}}}}{{{{{bf{H}}}}}}right) +{gamma }_{1}{tr}left({{{{{{bf{W}}}}}}}_{{{{{{bf{m}}}}}}}^{T}{{{{{{bf{W}}}}}}}_{{{{{{bf{m}}}}}}}right)+{gamma }_{2}{tr}left({{{{{{bf{H}}}}}}}^{T}{{{{{bf{H}}}}}}right) +alpha {tr}({{{{{{bf{P}}}}}}}^{T}{{{{{bf{P}}}}}}-{left({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}right)}^{T}{{{{{bf{P}}}}}}-{{{{{{bf{P}}}}}}}^{T}left({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}right)+{left({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}right)}^{T}left({{{{{bf{M}}}}}}circ {{{{{{bf{W}}}}}}}_{{{{{{bf{1}}}}}}}right))$$


Details of the optimization process for scCASER are similar to that of scCASE and can be found in Supplementary Texts S2.

Initialization and parameter selection of scCASE

Due to the high-dimensional characteristic of the variable matrix, random initialization cannot guarantee that the algorithm converges quickly to the desired solution. Therefore, we initialize ({{{{{bf{W}}}}}}), ({{{{{bf{H}}}}}}), and ({{{{{bf{Z}}}}}}) specifically. We initialize ({{{{{bf{Z}}}}}}) as the Jaccard similarity matrix between cells18, and ({{{{{bf{W}}}}}}) and ({{{{{bf{H}}}}}}) as the projection matrix and cell embedding matrix obtained by performing conventional NMF on the target scCAS matrix ({{{{{bf{X}}}}}}), respectively. In scCASER, ({{{{{{bf{W}}}}}}}_{1}) is initialized as the projection matrix obtained by NMF on the reference data and ({{{{{{bf{W}}}}}}}_{2}) is initialized as the projection matrix obtained by NMF on the target scCAS data. We use ASTER to estimate the number of cell types as the number of latent factors in non-negative matrix factorization, which is empirically suitable for various scCAS datasets24. The default value of lambda in the model is 106.

As pointed out in the existing literature, enhancement may lead to over-smoothing, resulting in the removal of true cell-cell heterogeneity signals11,60. We validated the impact of different initializations and parameter choices on over-smoothing. We designed three additional metrics, namely, over-smoothing score, under-smoothing score, and smoothing score (Supplementary Text S1), to assess the degree of over-smoothing. First, for model initialization, random initialization of ({{{{{bf{H}}}}}}) would severely affect the matrix ({{{{{bf{Z}}}}}}) and lead to unsatisfactory outcomes (Supplementary Fig. S24). The model with random initialization of ({{{{{bf{Z}}}}}}) will still convergent, though leading to a worse performance of enhancement, and the random initialization of ({{{{{bf{Z}}}}}}) does not result in over-smoothing (Supplementary Fig. S24). Secondly, we ran scCASE with varying lambda values within the range of 105 to 108. As lambda varies, different metrics remain stable with few changes (Supplementary Fig. S25). This indicates the high robustness of scCASE to the choice of lambda, suggesting that within a certain range, the choice of lambda does not lead to over-smoothing. Finally, we validated the impact of different values of parameter (K). The results indicate that when (K) is small ((K)<7), it can be observed that at lower dimensions, the model struggles to capture differences in the data effectively, leading to the elimination of heterogeneity between different cell types and over-smoothing of the data (Supplementary Figs. S26S28). As (K) gradually increases, the degree of over-smoothing eases. However, though not lead to over-smoothing, large values of (K) ((K)>20) may introduce excessive noise, making the model learning more challenging and resulting in a lower under-smoothing score (Supplementary Figs. S26S28).

Run-time and memory usage of scCASE

scCASE consistently exhibited commendable performance of run-time and peak memory usage. In terms of run-time, scCASE exhibits significant advantages compared to other methods, especially on smaller datasets such as BM0828, Blood, and LungA, where scCASE can operate several times faster than baseline methods (Supplementary Fig. S29). Even on larger datasets, scCASE still maintains a notable speed advantage. In terms of peak memory usage, SCALE and scBasset are GPU-based methods, they make more usage of GPU memory, so their memory usage is typically smaller than the methods that utilize CPUs. scCASE demonstrates a certain advantage in peak memory usage on smaller datasets such as BM0828, Blood, and LungA (Supplementary Fig. S29). Although the peak memory usage of scCASE increases in larger datasets, its memory usage is still comparable to that of scOpen, the state-of-the-art scCAS data enhancement method (Supplementary Fig. S29). Moreover, the memory usage of scCASE growth remains manageable. On two larger datasets of Muto and Simulated, which have a similar number of peaks but a fivefold increase in the number of cells (from 20k to 100k), the peak memory usage of scCASE increased by 7.67 times, while that of scOpen increased by 8.62 times. Note that scBFA is unable to run on datasets with 100k cells due to out-of-memory errors.

Implementation details of baseline methods

scBFA: scBFA is a detection-based model to remove technical variation in scRNA-seq and scATAC-seq data, available at We utilized the raw count matrices as input and performed scBFA using their default parameters. We executed scBFA following the same benchmarking procedure as in scOpen (

SCALE: SCALE integrates both the variational auto-encoder (VAE) and the Gaussian mixture model (GMM) to characterize the distribution of scATAC-seq data9. We obtained the SCALE program from and used the raw count matrices as input. When executing the program, we set the option “impute” as TRUE to obtain the imputed data, while keeping other parameters at their default settings.

scBasset: scBasset is a sequence-based convolutional neural network method to model scATAC data and predict chromatin accessibility17. The program and training tutorial of scBasset can be downloaded from We trained scBasset with default parameters using the raw count matrices and peaks as the input. Genome fasta file used in scBasset can be downloaded from ( After obtaining the trained model, we referred to their tutorial to obtain the enhanced data (

scOpen: scOpen is a scCAS-seq imputation method based on regularized non-negative matrix factorization10. The raw scCAS count matrix serves as the input for scOpen, and the output is an imputed matrix. We followed the tutorial and examples of scOpen provided at and executed it with default parameters.

Implementation details of downstream analyses

t-SNE and UMAP Visualization: we first preprocessed the raw data and the data enhanced by different methods using TF-IDF. Then, we performed PCA to reduce the dimensionality, created a “neighbors” graph, and used t-SNE/UMAP to obtain two-dimensional visualizations of the data, respectively following the scATAC-seq data analysis workflow provided by EpiScanpy26 ( The above steps were performed using the default parameters in the EpiScanpy pipeline.

SNPsea: SNPsea is an algorithm to identify cell types and pathways likely to be affected by risk loci. Specifically, genome-wide association studies (GWAS) have discovered multiple genomic loci associated with risk for different types of disease. SNPsea provides a simple way to determine the types of cells influenced by genes in these risk loci. SNPsea supposes disease-associated alleles influence a small number of pathogenic cell types, and assumes that a gene’s specificity to a cell type is a reasonable indicator of its importance to the unique function of that cell type. We performed SNPsea analysis with default settings in each set of cell type-specific peaks and the set of background peaks, respectively. The enrichments of tissue-specific expression in profiles of 17,581 genes across 79 human tissues (Gene Atlas) were quantified61. Specifically, we first obtained SNP site data for the whole genome from HapMap3 SNPs, which can be downloaded at To obtain the SNP sites corresponding to each group of cell type-specific peaks, we utilized GenomicRanges to identify SNP sites present within the cell type-specific peak regions62. GenomicRanges is an R/Bioconductor package for representing and manipulating genomic intervals, available at Then we obtained the SNP sites corresponding to each group of cell type-specific peaks, which serve as the input for SNPsea. We specified the same additional data including phenotype data and parameters for SNPsea as in its tutorial. These additional data can be downloaded from ( Their sources and detailed explanations are described at The parameters and specific running tutorials of the method can be found at We quantified the enrichments of each set of peaks in tissue-specific accessibility profiles across 79 tissues, and the top 30 significantly enriched tissues are illustrated in the figures.

LDSC: LDSC is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics40. After identifying cell type-specific peaks and background peaks in the Blood dataset, we quantified the enrichment of heritability for blood-related phenotypes within cell type-specific peaks for each cell type using partitioned LDSC with default settings. We ran LDSC using HapMap3 SNPs and used European samples from the 1000 Genomes Project as the LD reference panel. All the summary statistics provided by LDCS including SNPs and phenotypes were downloaded from the Broad LD Hub ( Specifically, in this analysis, our input consists of the detected cell type-specific peaks or background peaks. Similar to SNPsea, we first utilized GenomicRanges to identify SNP sites present within the specific peak regions. Subsequently, we used the LDSC program to calculate the LD Scores for these SNP sites. The LDSC process in this step can be referred to at Finally, we invoked the LDSC program again, using the obtained LD Scores as input, to calculate heritability and genetic correlation with blood-related phenotypes. The LDSC process in this step can be referred to

scABC: scABC is an R package for the analysis of scATAC-seq data41. With the clustering assignments obtained from scCASE and Louvain clustering, we followed the scABC workflow, utilized the function “getClusterSpecificPvalue()”, calculated the p value using hypothesis testing procedure, and finally identified cluster-specific peaks ( These identified cluster-specific peaks will be used as the input of chromVAR42 to perform motif analysis similar to RA37.

chromVAR: chromVAR is an R package for the analysis of sparse chromatin accessibility data from single-cell/bulk ATAC-seq/DNase-seq42. The package aims to identify motifs or other genomic annotations associated with variability in chromatin accessibility between individual cells or samples. We downloaded chromVAR from The motifs database is obtained from the “getJasparMotifs()” function within the chromVAR42 method, which is sourced from the JASPAR63 database. Following the workflow in RA37, we used the data enhanced by scCASE as the input and applied chromVAR42 to infer the enriched transcription factor (TF) binding motifs within the top 1000 cluster-specific peaks with the smallest p values calculated by scABC. Subsequently, we visualized the deviations calculated by chromVAR for the top 50 TF binding motifs.

Data collection and preprocessing

We utilized 13 datasets in this study, including a simulated scCAS dataset, eight publicly available scCAS datasets, two scCAS datasets annotated based on paired scRNA-seq data, and two mixed scCAS datasets (Supplementary Table S1). The simulated scCAS dataset is created by simCAS22, the state-of-the-art method for scCAS data simulation, in discrete mode to construct a dataset consisting of five clusters, each containing 500 cells. We then used the EpiScanpy pipeline to identify the top 3000 differentially accessible peaks for each cell type26. In this way, we obtained a simulated dataset with 2500 cells and 15,000 peaks. We collected eight real scCAS datasets for benchmarking. The Blood dataset contains ten types of human hemocytes from bone marrow, while the BM0828 dataset is a subset of the Blood dataset with the donor label of BM0828, containing seven types of human hemocytes23. To investigate the applicability of methods to datasets from different species and tissues, we further collected six datasets of WholeBrainA, WholeBrainB, LungA, LungB, LargeIntestine, and Spleen, which were profiled from the whole brain, lung, large intestine, and spleen tissues of adult mice25. These datasets vary in terms of protocol, species, tissue, number of cells, number of cell types, number of peaks, and imbalance degree, but all exhibit high sparsity and dimension (Supplementary Table S1). For the two scCAS datasets which were annotated solely based on paired scRNA-seq data, the Muto dataset was profiled via snATAC-seq and snRNA-seq and contains human kidney cells64. The snRNA-seq dataset was annotated based on lineage-specific marker expression, and the annotated snRNA-seq dataset was leveraged to predict snATAC-seq cell types with the label transfer function in Seurat65. The PBMC dataset was profiled by “10x Genomics via Single Cell Multiome ATAC + Gene Expression Sequencing” and contains the cryopreserved human peripheral blood mononuclear cells (PBMCs) of a healthy female donor. The cell type labels were annotated in the original study using only the scRNA-seq data. The Mixed-tissues and Mixed-protocols datasets are obtained by concatenating datasets from various tissues and protocols, respectively, to evaluate the robustness of different methods. The Mixed-tissues dataset was generated from the LungA and Spleen datasets while the Mixed-protocols dataset contains two batches assayed by snATAC-seq and 10X55,56. Cell types that accounted for less than three percent of the total cells were discarded along with the unknown categories to ensure evaluation credibility. We determine the imbalance degree by estimating the normalized entropy of the cell-type size distribution24 using Eq. (15):

$$I=1+frac{1}{log C}mathop{sum }limits_{c=1}^{C}frac{{n}_{c}}{N}log frac{{n}_{c}}{N}$$


where (C) is the number of cell types in the dataset, ({n}_{c}) is the number of cells in the cell type (c) and (N) is the total number of cells in the dataset. The more imbalanced the dataset is, the higher the value (I) has. This metric will have a value of 1 if all the cells have the same type and 0 if all the cell types have the same number of cells.

Construction of reference data

The construction approaches of reference data for scCASER are flexible and diverse. Firstly, we can obtain references from existing bulk data. For the Blood and BM0828 datasets, the reference data was constructed using bulk data from 17 hematopoietic cell types23. We counted the reads aligned to peaks of the scCAS data for the bulk samples, resulting in a count matrix that shares the same peaks as the target scCAS data. Then, we obtained the reference data by scaling the count matrix based on the total mapped reads of each bulk sample7.

Secondly, we can construct pseudo-bulk reference data by aggregating scCAS data of an external dataset from a similar tissue as the target data. The WholeBrainA and WholeBrainB datasets, as well as the LungA and LungB datasets, are paired data derived from similar tissues. We grouped the cells in one of the paired datasets by cell type, took the sum of each peak over cells of the same type, and then obtained the pseudo-bulk reference data for another dataset.

Thirdly, we can aggregate the scCAS data by its own clustering labels to construct the pseudo-bulk self-reference data. For scCAS datasets without external reference data, such as the Spleen and LargeIntestine datasets, we applied Louvain clustering with the default resolution, took the sum of each peak over cells of the same cluster, and then obtained the pseudo-bulk self-reference data for the target scCAS data.

Performance evaluation

We evaluated data enhancement performance via numerical accuracy, cell clustering, and data visualization. For numerical accuracy, we utilized the area under the precision-recall curve (auPRC) and the area under the receiver operating characteristic curve (auROC) to test if the enhanced matrix can recover the true signal cell-wise and peak-wise, respectively66. auPRC considers the relationship between precision and recall, while auROC considers the relationship between correctly classified positive examples and the number of incorrectly classified negative examples. auPRC is preferred for tasks with a large skew in the class distribution.

For cell clustering, we adopted the widely-used Louvain algorithm26,65,67,68 and utilized a binary search strategy to ensure the number of clusters equals the number of cell types7,24,26,27,69. We assessed the clustering results by four widely used metrics: adjusted Rand index (ARI)28, adjusted mutual information (AMI)29, Fowlkes-Mallows index(FMI)30, and silhouette score31. Rand index (RI) computes a similarity measure between the cluster labels and the cell-type labels. ARI is adjusted based on RI and accounts for chance agreement. Mutual information (MI) quantifies the correlation between the cluster labels and the cell-type labels and NMI is a normalized variant of MI. The Fowlkes-Mallows score FMI is defined as the geometric mean of the pairwise precision and recall. The silhouette score measures the similarity between an object and its own cluster compared to that of other clusters. While silhouette score is commonly utilized clustering labels, we replaced it with the cell type labels to evaluate the performance of the impact of enhancement on the estimation of the distance between cells as scOpen10, and the higher the silhouette score, the better the performance. We used 1- Pearson correlation coefficients as the cell-to-cell distance matrix as scOpen10.

For data visualization, we performed PCA to reduce the dimensionality of the raw scCAS data or data enhanced by various methods to 50 and then used the t-SNE70 and UMAP71 algorithms to further reduce the dimension to two. Cells in the visualization could be colored by cell-type labels and batch indices. More detailed mathematical equations and formulas for the aforementioned evaluation metrics are provided in Supplementary Text S1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.