Close this search box.

scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders – Nature Communications

The basic model of scButterfly

The basic model of scButterfly (scButterfly-B) is based on a dual-aligned variational autoencoder framework. Taking the translation between scRNA-seq and scATAC-seq data as an example, scButterfly-B consists of seven primary components (Fig. 1b): RNA encoder and ATAC encoder networks denoted as Enr and Ena, RNA decoder and ATAC decoder networks denoted as Der and Dea, a translator T and two discriminators Disr and Disa. Given pre-processed paired training data of scRNA-seq ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}in {{mathbb{R}}}^{ntimes {I}_{{{{{{rm{r}}}}}}}}) and scATAC-seq ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}in {{mathbb{R}}}^{ntimes {I}_{{{{{{rm{a}}}}}}}}), where n denotes the number of cells and Ir/Ia denotes the feature dimensions of RNA/ATAC data, scButterfly-B translates the input Xr to chromatin profiles by ({{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)right)), and translates the input Xa to transcriptome profiles by ({{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)right)). Details of each component in scButterfly-B is as follows.

Encoders of Enr and Ena are responsible for embedding RNA and ATAC inputs into low-dimensional representations, respectively. In Enr, we utilize fully connected layers with the LeakyReLU activation function to map the input gene expression vector ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}^{k}) of the k-th cell (left(kle nright)) to 256 dimensions and subsequently to 128 dimensions. The activation function is defined as ({{{{{rm{LeakyReLU}}}}}}left(xright)=x{I}_{xge 0}+{ax}{I}_{x < 0}), where a is set to 0.01 to address the vanishing gradient problem. In Ena, inspired by the insight that most chromatin accessibility interactions occur at an intra-chromosomal level21 and to alleviate the computational burden, we prune the inter-chromosomal connections and focus on intra-chromosomal biological patterns by mapping the profiles of each chromosome in the input ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{k}) to a 32-dimensional space and then projecting the concatenation results onto a 128-dimensional latent space using fully connected layers with LeakyReLU. Both Enr and Ena use a drop-out mechanism with a probability of 0.1 for all latent layers. Additionally, we adopt a masking strategy for the input data by randomly setting 50% of the elements in Xr and 30% of the elements in Xa to zero, inspired by the advanced masked autoencoders (MAE)57.

The translator T in latent space facilitates the translation between different modalities and the end-to-end mapping within each of the individual modalities (Fig. 1b). We first map the input Enr(Xr) from the RNA encoder into 128-dimensional mean vector ({{{{{{bf{X}}}}}}}_{{{{{{rm{mean}}}}}}}^{{{{{{rm{r}}}}}}}) and log-variance vector ({{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{r}}}}}}}) with two blocks of fully connected layer and LeakyReLU, respectively. Then, based on the assumption of variational autoencoders, the translator T obtains the latent embedding by sampling from the multivariate Gaussian distribution with ({{{{{{bf{X}}}}}}}_{{{{{{rm{mean}}}}}}}^{{{{{{rm{r}}}}}}}) and ({{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{r}}}}}}}) as follows:

$${{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}={{{{{{bf{X}}}}}}}_{{{{{{rm{mean}}}}}}}^{{{{{{rm{r}}}}}}}+{e}^{frac{{{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{r}}}}}}}}{2}}times {{{{{bf{eps}}}}}}{{{{{mathscr{ sim }}}}}}{{{{{mathscr{N}}}}}}left({{{{{bf{0}}}}}},{{{{{bf{I}}}}}}right).$$


Ultimately, 128-dimensional ({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)) and ({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)), namely the translated ATAC embeddings and the mapped RNA embeddings, are generated from Xembed with two blocks of fully connected layer and LeakyReLU, respectively. These translated/mapped embeddings are treated as the input for decoders. Analogously, we use another two blocks to map the input Ena(Xa) from the ATAC encoder into 128-dimensional mean vector ({{{{{{bf{X}}}}}}}_{{{{{{rm{mean}}}}}}}^{{{{{{rm{a}}}}}}}) and log-variance vector ({{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{a}}}}}}}), respectively, and generate the 128-dimensional translated RNA embeddings ({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)) and mapped ATAC embeddings ({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)) using the same blocks of RNA-to-RNA mapping and RNA-to-ATAC translation, respectively (Fig. 1b).

Discriminators of Disr and Disa aim to distinguish between the original embeddings and the translated embeddings, enabling adversarial training to improve the similarity between the translated embeddings and the original embeddings58,59. Demonstrating with the discriminator for RNA embeddings as a specific instance, Disr accepts original RNA embeddings ({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}})) and translated embeddings ({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)) with equal probability during the scButterfly-B training, expected to provide judgement based on the following equation:

$$left{begin{array}{c}{{{{{rm{Di}}}}}}{{{{{{rm{s}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)right)=0 {{{{{rm{Di}}}}}}{{{{{{rm{s}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)=1end{array}.right.$$


Similarly, the discriminator Disa for ATAC embeddings is designed to differentiate ({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)) and ({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)). For the network structures of Disr and Disa, we separately employ a two-layer network: a 128-dimensional latent block of fully connected layer with LeakyReLU and an output Sigmoid layer (Fig. 1b).

Decoders of Der and Dea are responsible for reconstructing the original high-dimensional representations of transcriptome and chromatin profiles, respectively, based on the mapped and translated embeddings in latent space (Fig. 1b). Der follows an inverse process of Enr by decoding the embeddings from 128 dimensions to 256 dimensions and to the original space of transcriptome profiles with fully connected layers and LeakyReLU. Similarly, Dea inverts the process of Ena but adopts a Sigmoid activation in the output layer given the near-binary nature of chromatin profiles. We also adopt a drop-out mechanism with a probability of 0.1 for all the latent layers of Der and Dea.

The training procedure of scButterfly

Based on the architecture of the scButterfly-B model, we propose the use of a step-wise training strategy, consisting of a pretraining phase and an integrative training phase (Fig. 1a). Using the translation between transcriptome and chromatin profiles as an example again, in the pretraining phase, we first independently train the RNA encoder Enr with its corresponding decoder Der, as well as the ATAC encoder Ena with its decoder Dea. During the integrative training phase, we then initialize the parameters of the encoders and decoders with the pre-trained parameters, and train the scButterfly-B model using both modalities simultaneously to capture interdependencies between modalities. The detailed process is as follows.

For the pretraining phase, we mainly focus on the reconstruction loss and evidence lower bound (ELBO) loss for variational inference. Translators are incorporated just for the end-to-end mapping within each of the individual modalities to maintain consistency with the integrative training phase. In each iteration, processed RNA profiles Xr is subsequently forward propagation through encoder Enr, translator Tr→r and decoder Der, resulting the reconstructed ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}}={{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)right)). We use the mean square error (MSE) loss as reconstruction loss, which could be calculated as:

$${{{{{{rm{L}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)={{{{{rm{MSE}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)=frac{1}{n}mathop{sum }limits_{k=1}^{n}{{{{{rm{|}}}}}}{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}}[k]-{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}{left[kright]{{{{{rm{|}}}}}}}^{2}.$$


The ELBO loss could be described as the Kullback-Leibler divergence between the shared embedding ({{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}) of Tr→r and normal multivariate Gaussian distribution:

$${{{{{{rm{L}}}}}}}_{{{{{{rm{ELBO}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}right)= {{{{{rm{KL}}}}}}({{{{{{bf{X}}}}}}}_{{embed}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}big|big|{{{{{mathcal{N}}}}}}left({{{{{bf{0}}}}}},, {{{{{bf{I}}}}}}right))= -frac{1}{2}(1+{{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}-{({{{{{{bf{X}}}}}}}_{{{{{{rm{mean}}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}})}^{2}-{e}^{{{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}}),$$


where ({{{{{{bf{X}}}}}}}_{{{{{{rm{mean}}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}) and ({{{{{{bf{X}}}}}}}_{{{{{mathrm{var}}}}}}^{{{{{{rm{r}}}}}}_{{{{{rm{pretrain}}}}}}}) denote the mean and log-variance vectors obtained from Tr, respectively. We train Enr, Tr→r and Der with a combination of the two parts of loss as follows:



We set wr = 1 and ({w}_{{{{{{rm{ELBO}}}}}}}=frac{20}{{I}_{{{{{{rm{r}}}}}}}}), and train the networks for 100 epochs. We apply the Adam optimizer60 with a learning rate of 0.001 and early-stop with patience of 50 epochs. The strategy for ATAC pretraining is nearly identical to RNA pretraining, with the only difference being the use of binary cross-entropy (BCE) reconstruction loss, denoted as La, instead of Lr. With the reconstructed output ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}}={{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)right)), La can be calculated by:

$${{{{{{rm{L}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)= {{{{{rm{BCE}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)= -mathop{sum }limits_{k=1}^{n}({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}left[kright]log ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}}[k])+(1-{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}left[kright])log (1-{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}}[k])).$$


For the integrative training, we further calculate the discriminator loss LDis, utilizing the BCE loss LD for the classification of discriminators and implementing the soft labels technique61. Specifically, we first sample the smoothed positive label ({l}_{{{{{{rm{pos}}}}}}}sim {{{{{rm{U}}}}}}left[{{{{mathrm{0.8,1}}}}}right]) and negative label ({l}_{{{{{{rm{neg}}}}}}}sim {{{{{rm{U}}}}}}left[{{{{mathrm{0,0.2}}}}}right]), respectively for the labels 1 and 0. Subsequently, we derive the loss for the discriminator LDis as follow:

$${{{{{{rm{L}}}}}}}_{{{{{{rm{Dis}}}}}}}= {{{{{{rm{L}}}}}}}_{{{{{{rm{D}}}}}}}left({{{{{rm{Di}}}}}}{{{{{{rm{s}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}left({{{{{{rm{En}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)right),{l}_{{{{{{rm{neg}}}}}}}right)+{{{{{{rm{L}}}}}}}_{{{{{{rm{D}}}}}}}left({{{{{rm{Di}}}}}}{{{{{{rm{s}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{rm{En}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right),{l}_{{{{{{rm{pos}}}}}}}right) + {{{{{{rm{L}}}}}}}_{{{{{{rm{D}}}}}}}left({{{{{rm{Di}}}}}}{{{{{{rm{s}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}left({{{{{{rm{En}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)right),{l}_{{{{{{rm{neg}}}}}}}right)+{{{{{{rm{L}}}}}}}_{{{{{{rm{D}}}}}}}left({{{{{rm{Di}}}}}}{{{{{{rm{s}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{En}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right),{l}_{{{{{{rm{pos}}}}}}}right).$$


For updating the discriminators, we employ an SGD optimizer62 with a learning rate of 0.005.

For each iteration, we train the encoders, translator, and decoders, after the updating of discriminators. Finally, we can obtain two reconstruction results and two translation results as follows:

$$begin{array}{c}{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}}={{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)right) {{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}}={{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)right) {{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}}={{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)right) {{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}}={{{{{rm{D}}}}}}{{{{{{rm{e}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{rm{T}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{a}}}}}}}left({{{{{rm{E}}}}}}{{{{{{rm{n}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right)right),end{array}$$


and train the encoders, translator, and decoders with a combination of reconstruction loss, translation loss, ELBO loss, and the recalculated discriminator loss with updated discriminators:

$${{{{{rm{Loss}}}}}}= {w}_{{{{{{rm{r}}}}}}}left({{{{{{rm{L}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)+{{{{{{rm{L}}}}}}}_{{{{{{rm{r}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{r}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}right)right)+{w}_{{{{{{rm{a}}}}}}}left({{{{{{rm{L}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right){+{{{{{rm{L}}}}}}}_{{{{{{rm{a}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}},{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}right)right) +{w}_{{{{{{rm{ELBO}}}}}}}left({{{{{{rm{L}}}}}}}_{{{{{{rm{ELBO}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}^{{{{{{rm{r}}}}}}}right)+{{{{{{rm{L}}}}}}}_{{{{{{rm{ELBO}}}}}}}left({{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}^{{{{{{rm{a}}}}}}}right)right)-{w}_{{{{{{rm{dis}}}}}}}{{{{{{rm{L}}}}}}}_{{{{{{rm{Dis}}}}}}},$$


where ({{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}^{{{{{{rm{r}}}}}}}) and ({{{{{{bf{X}}}}}}}_{{{{{{rm{embed}}}}}}}^{{{{{{rm{a}}}}}}}) denote the shared embeddings of RNA and ATAC, respectively, LDis is recalculated using the updated discriminator for each mini-batch, and the weights of wr, wa, wELBO and wdis are set to 1, 2, (frac{20}{{I}_{{{{{{rm{r}}}}}}}}+frac{20}{{I}_{{{{{{rm{a}}}}}}}}), and 1, respectively. We update the encoders, translator, and decoders using the Adam optimizer60 with a learning rate of 0.001. The integrative training consists of 200 epochs with early-stop patience of 50 epochs.

The data augmentation strategy of scButterfly

To further enhance the translation performance and address the challenges posed by the limited number of training cells and the high noise in multi-omics data, we introduce two variants of the scButterfly-B model, namely scButterfly-T (Type) and scButterfly-C (Cluster), for the scenarios with and without cell-type labels, respectively. scButterfly-T and scButterfly-C share the same model structure as scButterfly-B, while additionally employing data augmentation strategies to generate synthetic samples before model training. The generated samples could effectively simulate the original dataset and alleviate the issues of limited number of cells measured by multi-omics protocols (Supplementary Text 11).

For scButterfly-T, we partition all training samples by cell-type label, and for each cell type, generate artificial samples by pairing profiles of one modality of the randomly shuffled cells with profiles of another modality of the non-shuffled cells (Fig. 1c). To strike a balance between efficiency and performance, we perform two rounds of shuffling, resulting a total of twice as many synthetic samples as the original training set. The additional samples are mixed with the original training samples, resulting in a training dataset that is three times as large as the original dataset, for both the pretraining and integrative training phases.

In scenarios where cell type annotation is unavailable, scButterfly-C serves as a more general alternative approach by leveraging multi-omics integration techniques (Fig. 1d). scButterfly-C first trains a MultiVI22 model with default settings to obtain joint cell embeddings based on scRNA-seq and scATAC-seq profiles. Subsequently, Leiden63 clustering is used to obtain cluster labels to compensate for the lack of cell type annotation. scButterfly-C employs a high clustering resolution of three to ensure high purity of cell types within each cluster. Finally, for each Leiden cluster, scButterfly-C generates artificial samples by pairing profiles of one modality of the randomly shuffled cells with profiles of another modality of the non-shuffled cells for two rounds, also resulting in a training dataset that is three times as large as the original dataset, for both the pretraining and integrative training phases. scButterfly is a flexible framework that could also incorporate with other methods, such as detected anchors between the two modalities with Seurat, for data augmentation (Supplementary Text 12).

Data collection, pre-processing, and post-processing

We collected multiple datasets with different modalities, species, tissues, and protocols to evaluate the performance of the scButterfly model from a comprehensive perspective. For the translation between transcriptome and chromatin profiles, we collected seven paired RNA and ATAC datasets: the BMMC dataset composed of bone marrow mononuclear cells from 10 healthy human donors by 10x-Multiome23, the MB and MDS datasets of adult mouse brain and dorsal skin, respectively, profiled by SHARE-seq2, the MK dataset of adult mouse kidney profiled by sci-CAR3, the MCC dataset of adult mouse cerebral cortices profiled by SNARE-seq4, the CL dataset of multiple cancer cell lines profiled by scCAT-seq5, and the PBMC dataset of peripheral blood mononuclear cells profiled by 10x-Multiome (Supplementary Fig. 3).

Additionally, we considered the diagonal analysis of unpaired data and collected eight unpaired multi-omics RNA and ATAC datasets: the UP_HK dataset of snRNA-seq and snATAC-seq profiles from five healthy human kidney samples41, the UP_MPMC dataset of 10x RNA v3 and snATAC-seq profiles from mouse primary motor cortex42, the UP_eye, UP_muscle, UP_pancreas, UP_spleen, UP_stomach, UP_thymus datasets of sci-RNA-seq43 and sci-ATAC-seq44 profiles from various human fetal organs (Supplementary Fig. 3).

We further extended scButterfly to single-cell perturbation-response prediction and collected the PT_PBMC dataset of seven cell types of control and interferon-beta-stimulated human peripheral blood mononuclear cells47. In addition to epigenome and transcriptome, we also investigated the translation between transcriptome and proteome profiles with two datasets: the CITE_BMMC dataset of bone marrow mononuclear cells from the same human donors as the BMMC dataset23 and the CITE_BM dataset of scRNA-seq profiles alongside a panel of antibodies from human bone marrow51 (Supplementary Fig. 3).

We provide a summary of the above-mentioned datasets (Supplementary Fig. 3), including more detailed information such as the number of cells, features, batches and cell types, as well as the imbalance of cell types, sparsity, protocol and species of these datasets.

For data pre-processing, we applied conventional methods specific to different modalities (Fig. 1a). For the count matrices of scRNA-seq profiles, we normalized the total count of each cell to have the same values equal to the median of total counts for cells before normalization, logarithmized the normalized values with an adding offset of one, and selected the top 3000 highly variable genes (HVGs)27,51 for scButterfly training and downstream analysis.

For the count matrices of scATAC-seq profiles, we binarized the matrices, filtered out the peaks activated in less than 0.5% of all cells, and performed term frequency-inverse document frequency (TF-IDF) transformation as follows25,38,64:

$${{{{{bf{TF}}}}}}left[iright]left[jright]= frac{{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{bin}}}}}}}left[iright]left[jright]}{{sum }_{k=1}^{{I}_{{{{{{rm{a}}}}}}}}{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{bin}}}}}}}left[iright]left[kright]} {{{{{bf{IDF}}}}}}left[iright]left[jright]= {log} left(1+frac{n}{{sum }_{k=1}^{n}{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{bin}}}}}}}left[kright]left[jright]}right) {{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{TFIDF}}}}}}}= {{{{{bf{TF}}}}}}times {{{{{bf{IDF}}}}}},$$


where ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{bin}}}}}}}) and ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{TFIDF}}}}}}}) are the matrices before and after TF-IDF transformation, respectively. The matrix was then scaled to the range of [0,1] by dividing the elements by the maximum S of the matrix.

For the perturbation-response dataset of PT_PBMC, we retrieved the processed data in scGen45 and performed no additional processes to maintain the consistency in evaluation.

For the count matrices of proteome profiles, we performed the centered log ratio (CLR) transformation across cells51, with the formula as follows:

$${{{{{rm{CLR}}}}}}left({{{{{bf{x}}}}}}right)=log left(frac{{x}_{1}}{{{{{{rm{g}}}}}}left({{{{{bf{x}}}}}}right)},ldots ldots,frac{{x}_{n}}{{{{{{rm{g}}}}}}left({{{{{bf{x}}}}}}right)}right),$$


where ({{{{{bf{x}}}}}}=({x}_{1},ldots ldots,{x}_{n})) represents the count vector of protein epitopes for each cell, and g(x) denotes the geometric mean of (x1,……,xn).

For data post-processing, we set the values in the predicted RNA and ATAC matrices to zeros if they fell below the threshold of 1e-4, considering the high sparsity nature of original profiles. Additionally, we devised a method to recover count matrices from ATAC predictions, to ensure downstream methods that require count matrices, such as MultiVI22, could utilize the count matrices as input. Specifically, the inverse process involves reversing the scaling and TF-IDF transformation in the pre-processing stage:

$${{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{count}}}}}}}left[iright]left[jright]={{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{sparse}}}}}}}left[iright]left[jright]times S/{{{{{bf{IDF}}}}}}left[iright]left[jright]times mathop{sum }limits_{k=1}^{{I}_{{{{{{rm{a}}}}}}}}{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{bin}}}}}}}left[iright]left[kright],$$


where ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{sparse}}}}}}}) represents the sparse version of the output ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{pred}}}}}}}) of scButterfly, S is the scale factor in pre-processing, IDF and ({sum }_{k=1}^{{I}_{{{{{{rm{a}}}}}}}}{{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}^{{{{{{rm{bin}}}}}}}left[iright]left[kright]) are the matrices in TF-IDF transformation in pre-processing. We finally set the elements of ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}to {{{{{rm{a}}}}}}}^{{{{{{rm{count}}}}}}}) whose value was greater than both the corresponding column and row means to ones, and to zeros otherwise1, resulting binarized count matrices for downstream analysis.

Unpaired data training with scButterfly

In addition to translating paired multi-modal profiles, scButterfly can also perform the more challenging diagonal analysis of unpaired data using a similar method as the data augmentation strategy of scButterfly-T. Taking the translation between transcriptome and chromatin profiles as an example, we first identify the shared cell types of unpaired scRNA-seq and scATAC-seq single-modal datasets. Suppose that there are RNA profiles ({{{{{{bf{X}}}}}}}_{{{{{{rm{r}}}}}}}in {{mathbb{R}}}^{{n}_{{{{{{rm{r}}}}}}}times {I}_{{{{{{rm{r}}}}}}}}) and ATAC profiles ({{{{{{bf{X}}}}}}}_{{{{{{rm{a}}}}}}}in {{mathbb{R}}}^{{n}_{{{{{{rm{a}}}}}}}times {I}_{{{{{{rm{a}}}}}}}}) with nr and na cells respectively. Let m represent the count of shared cell types, and for type (ile m), let ({s}_{i}^{{{{{{rm{r}}}}}}}) and ({s}_{i}^{{{{{{rm{a}}}}}}}) separately denote the proportion of this type in Xr and Xa. Then the proportion of sampling for type i is given by:

$${s}_{i}=frac{{s}_{i}^{{{{{{rm{r}}}}}}}+{s}_{i}^{{{{{{rm{a}}}}}}}}{{sum }_{k=1}^{m}left({s}_{k}^{{{{{{rm{r}}}}}}}+{s}_{k}^{{{{{{rm{a}}}}}}}right)},$$


which represents the normalized average proportion of type i. Next, we sample (frac{{{s}_{i}}(n_{{{{{{rm{r}}}}}}}+{n}_{{{{{{rm{a}}}}}}})}{2}) paired profiles for model training by randomly matching the RNA profile of one cell of type i with the ATAC profile of another cell of type i. Although the training process only focuses on the shared cell types between two single-modal datasets, we included all cell types of each single-modal dataset in the testing phase.

Single-cell perturbation-response prediction with scButterfly

The scButterfly framework can be generalized to translate between transcriptome profiles before and after perturbation, enabling the prediction of single-cell perturbation responses. In terms of the model architecture, we replace the ATAC encoder and decoder with the RNA encoder and decoder, respectively. One pair of RNA encoder and decoder is dedicated to modeling gene expression data from the control group, while another pair of RNA encoder and decoder is responsible for modeling gene expression data from the stimulated group.

For model training, we utilize optimal transport to match cells and generate paired training samples48. Specifically, we first divide cells into different groups based on their cell types and perform principal component analysis (PCA) to reduce the dimension to 50 for each group. Denoting the number of cell types as m, then for type (kle m), we calculate the Euclidean distance cost matrix Mk between the 50-dimensional representations of control and stimulated data. In the absence of prior knowledge, a uniform distribution assumption is made for both the control and stimulated groups, represented by weight vectors ({{{{{{bf{w}}}}}}}_{k}^{{{{{{rm{ctr}}}}}}}{{{{{boldsymbol{=}}}}}}left(frac{1}{{n}_{k}^{{{{{{rm{ctr}}}}}}}}{{{{{boldsymbol{,}}}}}}frac{1}{{n}_{k}^{{{{{{rm{ctr}}}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{boldsymbol{.}}}}}}{{{{{boldsymbol{.}}}}}}{{{{{boldsymbol{.}}}}}}{{{{{boldsymbol{,}}}}}}frac{1}{{n}_{k}^{{{{{{rm{ctr}}}}}}}}right)in {{mathbb{R}}}^{1times {n}_{k}^{{{{{{rm{ctr}}}}}}}}) and ({{{{{{bf{w}}}}}}}_{k}^{{{{{{rm{sti}}}}}}}{{{{{boldsymbol{=}}}}}}left(frac{1}{{n}_{k}^{{{{{{rm{sti}}}}}}}}{{{{{boldsymbol{,}}}}}}frac{1}{{n}_{k}^{{{{{{rm{sti}}}}}}}}{{{{{boldsymbol{,}}}}}}{{{{{boldsymbol{.}}}}}}{{{{{boldsymbol{.}}}}}}{{{{{boldsymbol{.}}}}}}{{{{{boldsymbol{,}}}}}}frac{1}{{n}_{k}^{{{{{{rm{sti}}}}}}}}right)in {{mathbb{R}}}^{1times {n}_{k}^{{{{{{rm{sti}}}}}}}}), where ({n}_{k}^{{{{{{rm{ctr}}}}}}}) and ({n}_{k}^{{{{{{rm{sti}}}}}}}) denote the cell counts of control and stimulated groups of type k, respectively. Then the optimal transport problem could be formulated as the Earth Movers Distance (EMD) problem:

$${{{{{{boldsymbol{gamma }}}}}}}_{k}=mathop{{{{{rm{argmin}}}}}}limits_{{{{{{boldsymbol{gamma }}}}}}} < {{{{{boldsymbol{gamma }}}}}},{{{{{{{bf{M}}}}}}}_{k}} > _{{{{{{rm{F}}}}}}} s.t.left{begin{array}{c}{{{{{boldsymbol{gamma }}}}}}cdot {{{{{bf{1}}}}}}={{{{{{bf{w}}}}}}}_{k}^{{{{{{rm{ctr}}}}}}} {{{{{{boldsymbol{gamma }}}}}}}^{{{{{{boldsymbol{T}}}}}}} cdot {{{{{bf{1}}}}}}={{{{{{bf{w}}}}}}}_{k}^{{{{{{rm{sti}}}}}}} {{{{{boldsymbol{gamma }}}}}}{{ge }}{{{{{bf{0}}}}}}hfillend{array}right.,$$


where γk is the optimal transport matrix for control and stimulated data of type k,( < {{{{{boldsymbol{gamma }}}}}}{{{{{boldsymbol{,}}}}}}{{{{{{{bf{M}}}}}}}_{k} > }_{{{{{{rm{F}}}}}}}) is the Frobenius inner product, defined as ({sum }_{i=1}^{{n}_{k}^{{{{{{rm{ctr}}}}}}}}{sum }_{j=1}^{{n}_{k}^{{{{{{rm{sti}}}}}}}}{{{{{boldsymbol{gamma }}}}}}[i][j]{{{{{{bf{M}}}}}}}_{k}[i][j]). We use the algorithm in ref. 48 to solve this problem. Finally, for cell type k, we select the stimulated cell with the highest value in γk for each cell in the control group, thereby generating paired samples by pairing each control profile with its corresponding stimulated profile to train scButterfly. Note that some stimulated cells may be paired with multiple control cells, however, this kind of reusing will not significantly affect the prediction performance of scButterfly (Supplementary Text 13).

Translation between transcriptome and proteome with scButterfly

scButterfly can be easily extended to facilitate the translation between transcriptome and proteome profiles through specific modifications to the model structure and training strategy. In terms of the model structure, we maintain the consistency in the RNA encoder, RNA decoder, translator, and discriminators as the translation between transcriptome and chromatin profiles. For the ADT (Antibody-Tagged Detection) encoder, the processed data is projected into latent space via two blocks of 128-dimensional fully connected layer and LeakyReLU. The ADT decoder performs the reverse process by recovering the mapped/translated embeddings to the original dimension. Given the relatively high quality of ADT data, we do not use the masking strategy for ADT input. We also discussed the impact of different embedding dimensions on translation performance in Supplementary Text 14.

During the model training phase, we follow the same procedure as the translation between transcriptome and chromatin profiles. However, due to the generally lower dimensionality of ADT profiles in comparison to RNA or ATAC profiles, we assign a constant weight of (frac{1}{150}) to the KL divergence term associated with the ADT component. This precautionary measure is taken to mitigate the potential occurrence of posterior collapse, which can happen if the weight for the ELBO term is set too high and the variational posterior distribution closely matches the prior for a subset of latent variables65.

For the data augmentation strategy, no modifications are made for the scButterfly-T variant. However, for the scButterfly-C variant, instead of MultiVI22, we employ totalVI66, which is a multi-omics integration method specifically designed for transcriptome and proteome profiles. Specifically, we train totalVI with default settings to obtain joint embeddings of paired RNA and ADT profiles, then perform Leiden clustering63 with a resolution of three based on the cell embeddings, and finally augment the dataset and train scButterfly-C using the same approach as scButterfly-C for the translation between RNA and ATAC profiles.

Evaluation metrics

To quantitatively evaluate the cell heterogeneity preserved in translated profiles, we first performed PCA to reduce the dimensionality of translated profiles to 50, then performed cell clustering by the Leiden algorithm with default resolution of one63 based on the dimensionality reduction results, and finally assessed the clustering results by four widely-used metrics24,25,26, including adjusted Rand index (ARI), adjusted mutual information (AMI), normalized mutual information (NMI), and homogeneity (HOM). Rand index (RI) computes a similarity measure between the cluster labels and the cell-type labels. ARI is adjusted based on RI and accounts for chance agreement. Mutual information (MI) quantifies the correlation between the cluster labels and the cell-type labels. NMI is a normalized variant of MI, while AMI further considers chance agreement based on MI. HOM measures the purity of cell types within each cluster, and it equals one if all the cells within the same cluster belong to the same cell type. Note that the sizes of cell populations in most single-cell data are unbalanced and AMI is more appropriate in most cases since it is preferred when the sizes of clusters are unbalanced, while ARI is preferred when the clusters have nearly equal-sizes54.

To evaluate the performance of cell type annotation, we adopted four metrics as suggested by recent studies38,39, including accuracy (Acc), Cohen’s kappa value (Kappa), F1-macro, and F1-weighted. Acc provides a direct measure of the agreement between the annotated and ground-truth cell-type labels. Kappa takes chance agreement into consideration and is particularly suitable for non-ordinal categorical variables. Both F1-macro and F1-weighted are derived from F1-score: F1-macro is the arithmetic mean of the F1-scores for all cell-type labels, while F1-weighted is a weighted sum of the F1-scores. Compared to Acc, F1-macro and F1-weighted give relatively equal attention to both common and rare cell types, providing a more comprehensive evaluation of the annotation performance.

To evaluate the performance of single-cell perturbation-response prediction, we adopted two metrics that were universally used in existing studies45,46. First, we performed differential gene expression analysis between the control data and the real stimulated data, resulting real differentially expressed genes (DEGs), and between the control data and the predicted stimulated data, resulting predicted DEGs. To assess the capability of preserving the biological variance in real data, we then counted the number of common DEGs of the top 100 real DEGs versus the top 100 predicted DEGs. Second, to examine the consistency between the predicted perturbation responses and the ground truth responses, we randomly sampled 80% of the test data with replacement 100 times and computed the squared Pearson correlation (R2) for mean gene expression of the top 100 real DEGs between predicted and real stimulated data.

For the translation between transcriptome and proteome profiles, in addition to the above four clustering metrics for evaluating the cell heterogeneity preserved in translated profiles, we further evaluated the translated proteome profiles from a numerical accuracy standpoint, given the low dimensionality and dense nature of ADT data10,66. Specifically, for the real and predicted protein expression level of each cell, we investigated the correlation coefficients via Pearson correlation and Spearman correlation, respectively.

More detailed mathematical equations and formulas for the aforementioned metrics are provided in Supplementary Text 15.


For data visualization, we performed PCA to reduce the dimensionality of translated profiles to 50 and then adopted the t-SNE67 method to further reduce the dimension to two. Cells in the visualization could be colored by cell-type labels, batch indices, or clustering labels.

Baseline methods

For the translation between transcriptome and chromatin profiles, we compared the performance of scButterfly against three state-of-the-art methods, including BABEL1, Polarbear8, and JAMIE15. We implemented BABEL and JAMIE using their respective GitHub repositories with the default parameter settings. For Polarbear, we implemented it used the default parameters on the NGC docker with pre-compiled TensorFlow v1.15. Because of the extremely high dimension of chromatin profiles (1,050,819 peaks) in the datasets of UP_eye, UP_muscle, UP_pancreas, UP_spleen, UP_stomach and UP_thymus, we additionally adapted the same features selection strategy as scButterfly when implementing baseline methods or encountered memory errors otherwise. Note that we did not consider the cross-modal autoencoder9 and UnitedNet16 for comparison because the former is mainly designed for translation between scRNA-seq data and chromatin images and does not provide the source code for its intricate data processing steps, while the latter does not provide the guideline for number determination of the features to be selected as well as the code for data processing and performs translation between epigenome and transcriptome in a supervised manner that requires ground-truth cell-group identification.

For the cell type annotation of scATAC-seq data, we compared the performance of scButterfly with the state-of-the-art EpiAnno38 and Cellcano39 methods. For the prediction of single-cell perturbation responses, we compared the performance of scButterfly with the advanced scGen45 and scPreGAN46 methods. For the translation between transcriptome and proteome profiles, we compared the performance of scButterfly with the latest sciPENN10 and JAMIE15 methods. Note that we implemented the above baseline methods following their tutorials and with their default settings.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.