Search
Close this search box.

BIDCell: Biologically-informed self-supervised learning for segmentation of subcellular spatial transcriptomics data – Nature Communications

Datasets and preprocessing

We used publicly available data resources from three different SST commercial platforms (10 × Genomics Xenium, NanoString CosMx, and Vizgen MERSCOPE), and sequencing data from Human Cell Atlas.

Subcellular spatial transcriptomics data

For all datasets and for each gene, detected transcripts were converted into a 2D image where the value of each pixel represents the number of detected transcripts at its location. The images were combined channel-wise, resulting in an image volume ({{{{{{{boldsymbol{X}}}}}}}}in {{mathbb{R}}}^{Htimes Wtimes {n}_{genes}}), where H is the height of the sample, W is the width of the sample, and ngenes is the number of genes in the panel.

(i) Xenium-BreastCancer1 and Xenium-BreastCancer2

The Breast Cancer datasets included in this study were downloaded from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast(accessed 9 Feb 2023), and included two replicates. Low-quality transcripts for 10 × Genomics Xenium data with a phred-scaled quality value score below 20 were removed, as suggested by the vendor1. Negative control transcripts, blanks, and antisense transcripts were also filtered out. This resulted in 313 unique genes with the overall pixel dimension of the images being 5475 × 7524 × 313 for Xenium breast cancer replicate 1 (Xenium-BreastCancer1) and 5474 × 7524 × 313 for Xenium breast cancer replicate 2 (Xenium-BreastCancer2).

(ii) Xenium-MouseBrain

The Mouse Brain data included in this study was downloaded from https://www.10xgenomics.com/resources/datasets/fresh-frozen-mouse-brain-replicates-1-standard (accessed 14 Feb 2023) and were processed following the steps in (i). There were 248 unique genes, and the resulting size of the image was 7038 × 10,277 × 248 pixels.

(iii) CosMx-Lung

The CosMx NSCLC Lung dataset included in this study was downloaded from https://nanostring.com/products/cosmx-spatial-molecular-imager/nsclc-ffpe-dataset/ (accessed 24 Mar 2023). We used data for Lung5-1, which comprised 30 fields of view. Transcripts containing “NegPrb” were removed, resulting in 960 unique genes and an overall image dimension of 7878 × 9850 × 960 pixels.

(iv) MERSCOPE-Melanoma

The MERSCOPE melanoma data included in this study were downloaded from https://info.vizgen.com/merscope-ffpe-solution (for patient 2, accessed 26 Mar 2023). Transcripts with “Blank-” were filtered out, resulting in 500 unique genes and an image with 6841 × 7849 × 500 pixels.

(v) Stereo-seq-MouseEmbryo

The Stereo-seq data used in this study, including the DAPI image and detected gene expressions (bin 1), were downloaded from https://db.cngb.org/stomics/mosta/download/ for sample E12.5_E1S3. Stereo-seq data contains a far greater number of genes compared to Xenium, CosMx, and MERSCOPE. For efficiency, we selected a panel of 275 highly variable genes (HVGs) as the input to BIDCell. The HVGs are the common genes of the top 1000 HVGs from both Stereo-seq data and the single-cell reference data.

Nuclei segmentation

DAPI images were directly downloaded from the websites of their respective datasets. In cases where the maximum intensity projection (MIP) DAPI image was not provided, we computed the MIP DAPI by finding the maximum intensity value for each (x,y) location for each stack of DAPI. DAPI images were resized to align with the lateral resolutions of spatial transciptomic maps using bilinear interpolation. Nuclei segmentation was performed on the MIP DAPI using the pretrained Cellpose model with automatic estimation of nuclei diameter5. We used the “cyto” model as we found the “nuclei” model to undersegment or omit a considerable number (e.g., 21k for Xenium-BreastCancer1) of nuclei given the same MIP DAPI image, which is consistent with another study29. Other nuclei segmentation methods may be used with BIDCell as our framework is not limited to Cellpose.

Transcriptomics sequencing data

We used five publicly available single-cell RNA-seq data collections as references to guide the cell segmentation in BIDCell and evaluation with CellSPA. For the reference data with multiple datasets, we constructed cell-type specific profiles by aggregating the gene expression by cell type per dataset.

(i) TISCH-BRCA

The reference for Xenium-BreastCancer used in BIDCell was based on 10 single-cell breast cancer datasets downloaded from The Tumor Immune Single Cell Hub 2 (TISCH2)17 from http://tisch.comp-genomics.org/gallery/?cancer=BRCA&species=Human, which contains the gene by cell expressions and cell annotations of the data. We used the “celltype major lineage” as the cell type labels. We combined the “CD4Tconv” and “Treg” as “CD4Tconv/Treg” and “CD8T” and “CD8Tex” as “CD8T/CD8Tex”, which results in 17 cell types in total.

(ii) Chromium-BreastCancer

To evaluate the performance of Xenium-BreastCancer, we downloaded the Chromium scFFPE-seq data from the same experiment from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast (accessed 22 March 2023), which contains 30,365 cells and 18,082 expressed genes. We then performed Louvain clustering on the k-nearest neighbour graph with k = 20, based on the top 50 principal components (PCs) to obtain 22 clusters. We then annotated each cluster based on the markers and annotation provided in the original publication1.

(iii) Allen Brain Map

The reference for Xenium-MouseBrain data was based on Mouse Whole Cortex and Hippocampus SMART-seq data downloaded from https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-smart-seq, which contains both gene by cell expressions and cell annotations of the data. We used the cluster annotation from “cell_type_alias_label” as the cell type labels and combined some of the labels with a small number of cells. For example, we combined all “Sst” subtypes as “Sst” and all “Vip” subtypes as “Vip”, which results in 59 cell types in total.

(iv) HLCA and TISCH-NSCLC

The reference for CosMx-Lung for both BIDCell and CellSPA was based on Human Lung Cell Atlas (HLCA)30, provided in the “HLCA_v1.h5ad” file from https://beta.fastgenomics.org/p/hlca, including both gene expressions and cell type annotations of the data. We used “ann_finest_level” as cell type labels, which contained 50 cell types in total.

As HLCA only contains single-cell datasets from non-cancer lung tissue, we complemented the reference data with malignant cells provided in TISCH2, where we downloaded 6 single-cell NSCLC datasets with tumour samples from http://tisch.comp-genomics.org/gallery/?cancer=NSCLC&species=Human. We only included the cells labelled as malignant cells in the reference.

(v) TISCH-SKCM

The reference for MERSCOPE-Melanoma for both BIDCell and CellSPA was based on 10 single-cell melanoma datasets downloaded from TISCH2 from http://tisch.comp-genomics.org/gallery/?cancer=SKCM&species=Human, which contains the gene by cell expressions and cell annotations of the data. We used the “celltype major lineage” as the cell type labels. We combined the “CD4Tconv” and “Treg” as “CD4Tconv/Treg” and “CD8T” and “CD8Tex” as “CD8T/CD8Tex”, which resulted in 15 cell types in total.

(vi) Mouse Embryo reference

The reference for Stereo-seq-MouseEmbryo was downloaded from GEO database under accession code: GSE11994531, which contain both counts and cell type annotation data. The E12.5 data was then used as reference.

Biologically-informed deep learning-based cell segmentation (BIDCell) overview

BIDCell is a self-supervised deep learning framework that computes biologically-informed loss functions to optimise learnable parameters for the prediction of cell segmentation masks for spatial transcriptomic data. BIDCell uses three types of data: (i) spatial transcriptomic maps of genes, (ii) corresponding DAPI image, and (iii) average gene expression profiles of cell types from a reference dataset, such as the Human Cell Atlas. A major innovation in developing BIDCell is the use of biologically-informed prior knowledge via the SSL paradigm to enable DL models to learn complex structures in SST data, to derive cell segmentations that are visually more realistic and capture better expression profiles.

The BIDCell framework has the following four key characteristics:

  • BIDCell predicts diverse cell shapes for datasets containing various cell types to better capture cell expressions (see section Elongated and non-elongated shapes).

  • BIDCell uses positive and negative markers from sequencing data to enhance the guidance for learning relationships between spatial gene expressions and cell morphology in the form of cell segmentations (see section Positive and negative cell-type markers).

  • BIDCell is parameterised by a deep learning architecture that learns to segment cells from spatial transcriptomic images (see section Deep learning-based segmentation).

  • BIDCell uses biologically-informed, self-supervised loss functions to train the deep learning architecture without the need for manual annotations and better capture cell expressions (see section BIDCell training and loss functions).

Elongated and non-elongated shapes

BIDCell is capable of generating cell segmentations that exhibit different morphologies for different cell types, rather than assume a generally circular profile for all cell types. In particular, BIDCell can distinguish between cell types that typically appear more elongated, such as fibroblasts and smooth muscle cells, and those that are typically more rounded or circular, such as B cells. Elongated cell types can be directly specified for each tissue sample as desired, based on existing biological knowledge.

We used the expression within the nuclei (see section Nuclei segmentation) of cells to perform an initial classification of elongated and non-elongated cell types. Transcripts were mapped to nuclei using nuclei segmentations, and the Spearman correlation was computed between nuclei expression profiles and reference cell types of the Human Cell Atlas. Nuclei were classified as the cell type with which it was most highly correlated to. This initial classification coupled with the eccentricity of the nuclei were used to inform the cell-calling loss function (described in section Cell-calling loss) to produce segmentation morphologies with more variation that are more appropriate for different cell types. We considered epithelial cells, fibroblasts, myofibroblasts, and smooth muscle cells to be elongated for samples of breast cancer and melanoma. Endothelial cells, fibroblasts, myofibroblasts, fibromyocytes, and pericytes were deemed elongated for NSCLC. We considered all cell types in the mouse brain sample to be elongated.

Positive and negative cell-type markers

BIDCell learns relationships between the spatial distribution of gene expressions and cell morphology in the form of cell segmentations. This relationship can be enhanced by incorporating biological knowledge in the form of cell-type markers, specially, the genes that are typically more expressed (positive markers) and less expressed (negative markers) in different cell types, which allows BIDCell to predict segmentations that lead to more accurate cell expression profiles. Cell-type marker knowledge is drawn from the Human Cell Atlas, which allows BIDCell to be applied without requiring a matched single-cell reference for the same sample of interest. Markers were incorporated into BIDCell through our positive and negative marker losses (described in section Positive and negative marker losses).

Deep learning-based segmentation

BIDCell is parameterised by a set of learnable parameters θ of a deep learning segmentation model. We used the popular UNet 3+19 as the backbone of our framework to perform cell segmentation by predicting the probability of cell instances at each pixel. This architecture may be swapped out for other segmentation architectures. UNet 3+ was originally proposed for organ segmentation in computed tomography (CT) images. It was built on the original U-Net12 and incorporated full-scale skip connections that combined low-level details with high-level features across different scales (resolutions). UNet 3+ comprised an encoding branch and decoding branch with five levels of feature scales. We did not adopt the deep supervision component proposed by UNet 3+, and instead only computed training losses at the lateral resolution of the original input.

Input

The input to the UNet 3+ model was a cropped multichannel spatial transcriptomic image ({{{{{{{bf{x}}}}}}}}in {{mathbb{R}}}^{htimes wtimes {n}_{genes}}), where ngenes represents the channel axis corresponding to the total number of genes in the dataset, h is the height of the input patch, and w is the width of the input patch. Prior to being fed into the first convolutional layer, the input was reshaped to [ncells, ngenes, h, w], effectively placing ncells in the batch size dimension. In this way, all the cells in a patch were processed simultaneously, and the model could flexibly support an arbitrary number of cells without requiring extra padding or preprocessing. ncells was determined by the corresponding patch of nuclei to ensure consistency with predicted cell instances. Input volumes that were empty of nuclei were disregarded during training and yielded no cells during prediction.

Output and segmentation prediction

The softmax function was applied to the output of UNet 3+ to yield probabilities of foreground and background pixels for each cell instance. This produced multiple probabilities for background pixels (i.e., ncells probabilities per pixel for a patch containing ncells), due to the placement of cell instances in the batch size dimension. These probabilities were aggregated by averaging across all the background predictions per pixel. The argmax function was applied pixel-wise to the foreground probabilities for all cells and averaged background probabilities. This produced a segmentation map corresponding to the object (cell instance or background) with the highest probability at each pixel.

Morphological processing

The initial segmentation output by the deep learning model was further refined to ensure pixel connectivity within each cell (i.e., all the sections of the cell were connected). The process involved standard morphological image processing techniques to each cell, including dilation, erosion, hole-filling, and removal of isolated islands, while ensuring that the nucleus was captured. First, dilation followed by erosion were applied using a 5 × 5 circular kernel with two iterations each. Hole-filling was then carried out on the cell section with the largest overlap with the nucleus. Any remaining pixels initially predicted for the cell that were still not connected to the main cell section were discarded. After morphological processing, the number of transcripts captured within each cell is slightly higher, while purity metrics and correlation with Chromium are the same or slightly higher (Supplementary Fig. 24).

Mapping transcripts to predicted cells

The detected transcripts were mapped to cells using the final predicted segmentations. The segmentation map was resized back to the original pixel resolution using nearest neighbour interpolation. Transcripts located in the mask of a cell were added to the expression profile of the cell. This produced a gene-cell matrix ncells × ngenes, which was used for performance evaluation and downstream analysis.

BIDCell training and loss functions

The BIDCell framework combines several loss functions that automatically derive supervisory signals from the input data and/or predicted segmentations at each step of the training process. This approach to learning is a core aspect of SSL32. Furthermore, the modular and additive design of the loss functions allows each loss to be swapped out with alternative approaches to compute training signals. The SSL label describes the ability of the framework to automatically learn relationships between gene expressions and cell morphology from its inputs.

Our approach for learning the parameters θ of the segmentation model relies on minimising a total of 6 loss functions that we propose with our framework. Some of the losses effectively increase the number of pixels predicted for a cell, while others reduce the size of its segmentation. The nuclei encapsulation, cell-calling, over-segmentation, and overlap losses guide the basic morphology of cells. The positive and negative marker losses refine the cell morphologies learned through the other loss functions, by further guiding the model to learn biologically-informed relationships between gene expressions and cell morphology. This is reminiscent of the pretext and downstream (fine-tuning) stages commonly encountered in SSL, where the pretext task aids the model to learn better representations or intermediate weights, while the fine-tuning task refines the weights and further improves performance for a particular prediction task. Taken together, the losses ensure that the segmentation model learns relationships between spatially-localised, high-dimensional gene expression information and the morphology of individual cells.

(A) Nuclei encapsulation loss

The segmentation of a cell must contain all the pixels of the cell’s nucleus. Additionally, the expressed genes in nuclei can guide the model to learn which genes should be predicted within cells. Hence, we included a loss function Lne that incentivises the model to learn to correctly predict nuclei pixels:

$${L}_{ne}({{{{{{{{bf{x}}}}}}}}}_{{{{{{{{bf{nuc}}}}}}}}},hat{{{{{{{{bf{y}}}}}}}}})=-{{{{{{{{bf{x}}}}}}}}}_{{{{{{{{bf{nuc}}}}}}}}}log (hat{{{{{{{{bf{y}}}}}}}}})-(1-{{{{{{{{bf{x}}}}}}}}}_{{{{{{{{bf{nuc}}}}}}}}})log ({{{{{{{bf{1}}}}}}}}-hat{{{{{{{{bf{y}}}}}}}}}),$$

(1)

where xnuc is the binary nucleus segmentation mask, and (hat{{{{{{{{bf{y}}}}}}}}}) is the predicted segmentation for all cells of the corresponding training patch.

(B) Cell-calling loss

The aim of the cell-calling loss was to increase the number of transcripts assigned to cells. We also designed the cell-calling loss to allow BIDCell to capture cell-type specific morphologies. Unique expansion masks ec {0, 1}h×w were computed for each cell based on the shape of its nucleus and whether its nucleus expression profile was indicative of an elongated cell type. The expansion mask of a non-elongated cell was computed by applying a single iteration of the morphological dilation operator with a circular kernel of 20 × 20 pixels to its binary nucleus mask.

The expansion mask of an elongated cell was computed based on the elongation of its nucleus, defined as the eccentricity of an ellipse fitted to its nucleus mask:

$$ecc=sqrt{1-frac{{b}^{2}}{{a}^{2}}},$$

(2)

where a represents the length of the major axis, and b is the length of the minor axis.

We found that elongated cell types tended to have nuclei with higher eccentricity (Supplementary Fig. 1). Hence, the eccentricity of a nucleus could serve as a proxy for the shape of its cell via an elongated expansion mask. We computed each cell-specific elongated expansion mask using an elliptical dilation kernel applied to the nucleus. The horizontal and vertical lengths of the elliptical kernel were computed by:

$${l}_{h}=alpha times ec{c}_{nuc}times {l}_{t},$$

(3)

$${l}_{v}=left{begin{array}{ll}{l}_{t}-{l}_{h},quad &,{{mbox{if}}},,{l}_{t}-{l}_{h} > {l}_{vm} {l}_{vm},quad &,{{mbox{otherwise}}},end{array}right.$$

(4)

where α is a scaling factor set to 0.9, eccnuc is the eccentricity of the nucleus, lt is the sum of lh and lv, which was set to 60 pixels, and lvm is the minimum vertical length, which was set to 3 pixels. These values were selected based on visual inspection (e.g., the cells appear reasonably sized), and were kept consistent across the different elongated cell types and datasets used in this study. The elliptical dilation kernel was rotated to align with the nucleus and applied to the nucleus mask to produce the elongated expansion mask of the cell.

The expansion masks were used in our cell-calling loss function that was minimised during training:

$${L}_{cc}({{{{{{{bf{e}}}}}}}},hat{{{{{{{{bf{y}}}}}}}}})=frac{1}{M}mathop{sum }limits_{c}^{M}-{{{{{{{{bf{e}}}}}}}}}_{{{{{{{{bf{c}}}}}}}}}log ({hat{{{{{{{{bf{y}}}}}}}}}}_{c})-(1-{{{{{{{{bf{e}}}}}}}}}_{{{{{{{{bf{c}}}}}}}}})log ({{{{{{{bf{1}}}}}}}}-{hat{{{{{{{{bf{y}}}}}}}}}}_{c}),$$

(5)

where ec is the expansion mask and ({hat{{{{{{{{bf{y}}}}}}}}}}_{c}) is the predicted segmentation of cell c of M cells in an input patch.

(C) Over-segmentation loss

We introduced the over-segmentation loss to counter the cell size-increasing effects of the cell-calling loss to prevent the segmentations becoming too large and splitting into separate segments. This loss function elicited a penalty whenever the sum of cytoplasmic predictions exceeded the sum of nuclei predictions for a cell in a given patch:

$${p}_{nuc,c}=mathop{sum}limits_{i}mathop{sum}limits_{j}sigma ({hat{q}}_{ijc}{x}_{nuc,ij}-0.5),$$

(6)

$${p}_{cyto,c}=mathop{sum}limits_{i}mathop{sum}limits_{j}sigma ({hat{q}}_{ijc}(1-{x}_{nuc,ij})-0.5),$$

(7)

$${L}_{os}=left{begin{array}{ll}frac{1}{M}mathop{sum }limits_{c}^{M}({p}_{cyto,c}-{p}_{nuc,c}),quad &,{{mbox{if}}},,mathop{sum }limits_{c}^{M}({p}_{cyto,c}-{p}_{nuc,c}), > ,0 0,hfillquad &,{{mbox{otherwise}}},end{array}right.$$

(8)

where for cell c at pixel (i, j), ({hat{q}}_{ijc}) is the predicted foreground probability for cell c, xnuc,ij {0, 1} is the binary nucleus mask, and σ is the sigmoid function. Los was normalised by number of cells M to aid smooth training.

(D) Overlap loss

Cells are often densely-packed together in samples of various human tissues. This poses a challenge to segmentation models in predicting clear boundaries and coherent segmentations for neighbouring cells without overlap. We introduced the overlap loss to penalise the prediction of multiple cells occurring at each pixel:

$${s}_{ov,ij}=-(1-{x}_{nuc,ij})+mathop{sum }limits_{c}^{M}sigma ({hat{q}}_{ijc}(1-{x}_{nuc,ij})-0.5),$$

(9)

$${L}_{ov}=left{begin{array}{ll}frac{{sum }_{i}{sum }_{j}({s}_{ov,ij})}{Mhw},quad &,{{mbox{if}}},,{s}_{ov}, > ,0 0,quad &,{{mbox{otherwise}}},end{array}right.$$

(10)

Lov was normalised by number of cells M, and the lateral dimensions h and w of the input to aid smooth training.

(E) Positive and negative marker losses

The purposes of our positive and negative marker losses were to encourage the model to capture pixels that contained positive cell-type markers, and penalise the model when segmentations captured pixels that contained negative cell-type markers for each cell. The marker losses refine the initial morphology learned through the other loss functions, by further guiding the model to learn biologically-informed relationships between gene expressions and cell morphology.

The positive and negative markers for the training loss were those with expressions in the highest and lowest 10 percentile for each cell type of a tissue sample. In our experiments, we found that a higher number of positive markers tended to increase the size of predicted cells as the model learns to capture more markers, and vice versa. We found that removing positive markers that were common to at least a third of cell types in each tissue type was appropriate across the different datasets for training.

The one-hot encoded lists of positive and negative markers of the cell type for cell c were converted into sparse maps mpos,c {0, 1}h×w and mneg,c {0, 1}h×w. At each pixel, 0 indicated the absence of all markers, while 1 indicated the presence of any positive or negative marker for its respective map. mpos,c and mneg,c were then multiplied element-wise by the expansion mask ec to remove markers far away from the current cell. Each marker map was dilated by a 3 × 3 kernel, which was based on the assumption that pixels in a 3 × 3 region around each marker were most likely from the same cell. We found this dilation to improve training guidance and segmentation quality, as the maps tended to be quite sparse.

The marker maps were then used to compute the positive and negative marker losses:

$${L}_{pos}({{{{{{{{bf{m}}}}}}}}}_{{{{{{{{bf{pos}}}}}}}}},hat{{{{{{{{bf{y}}}}}}}}})=frac{1}{M}mathop{sum }limits_{c}^{M}-{{{{{{{{bf{m}}}}}}}}}_{{{{{{{{bf{pos}}}}}}}},{{{{{{{bf{c}}}}}}}}}log ({hat{{{{{{{{bf{y}}}}}}}}}}_{c})-(1-{{{{{{{{bf{m}}}}}}}}}_{{{{{{{{bf{pos}}}}}}}},{{{{{{{bf{c}}}}}}}}})log ({{{{{{{bf{1}}}}}}}}-{hat{{{{{{{{bf{y}}}}}}}}}}_{c}),$$

(11)

$${L}_{neg}({{{{{{{{bf{m}}}}}}}}}_{{{{{{{{bf{neg}}}}}}}}},hat{{{{{{{{bf{q}}}}}}}}})=frac{1}{M}mathop{sum }limits_{c}^{M}sigma ({hat{{{{{{{{bf{q}}}}}}}}}}_{c}{{{{{{{{bf{m}}}}}}}}}_{{{{{{{{bf{neg}}}}}}}},{{{{{{{bf{c}}}}}}}}}-0.5)$$

(12)

Total loss

The model was trained by minimising the sum of all the loss functions over N training patches:

$$mathop{min }limits_{theta }mathop{sum }limits_{n}^{N}[{lambda }_{ne}{L}_{ne}+{lambda }_{cc}{L}_{cc}+{lambda }_{os}{L}_{os}+{lambda }_{ov}{L}_{ov}+{lambda }_{pos}{L}_{pos}+{lambda }_{neg}{L}_{neg}],$$

(13)

where each λ represents a hyperparameter that scaled its respective L. The value of λ for all loss functions was set to 1.0 (except for the ablation and lambdas studies); this ensured our losses were not fine-tuned to any particular datasets.

Practical implementation

Details

To address computational efficiency concerns related to memory usage, we partitioned the spatial transcriptomic maps into patches of 48 × 48 × ngenes for input into UNet 3+. BIDCell has been verified for datasets containing up to 960 genes on a 12 GB GPU. It is also important to note that the number of genes primarily affects the weights of the first convolutional layer, thus having a minor impact on memory usage.

The patch-based predictions could result in effects along the patch boundaries such as sharp or cut-off cells. When dividing the transcriptomic maps into patches, we create two sets of patches of the same lateral dimensions with an overlap equal to half the lateral size of the patches. The predictions for the patches were combined (see Supplementary Fig. 25), without additional operations to resolve potential disagreement between predictions of the two sets. Only patches from the first set (no overlaps) were selected during training, while all patches were used during inference.

One image patch was input into the model at one time, though batch size was effectively ncells due to reshaping (see section Deep learning-based segmentation-Input). Neither normalisation nor standardisation were applied to the input image patches, such that the pixels depicted raw detections of transcripts.

The model was trained end-to-end from scratch for 4000 iterations (i.e., using 4000 training patches). This amounted to a maximum of 22% of the entire image, thereby leaving the rest of the image unseen by the model during inference. Weights of the convolutional layers were initialised using He et al.’s method33. We employed standard on-the-fly image data augmentation by randomly applying a flip (horizontal or vertical), rotation (of 90, 180, or 270 degrees) in the (x,y) plane. The order of training samples was randomised prior to training. We employed the Adam optimiser34 to minimise the sum of all losses at a fixed learning rate of 0.00001, with a first moment estimate of 0.9, second moment estimate of 0.999, and weight decay of 0.0001.

Time and system considerations

We ran BIDCell on a Linux system with a 12GB NVIDIA GTX Titan V GPU, Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz with 16 threads, and 64GB RAM. BIDCell was implemented in Python using PyTorch. For Xenium-BreastCancer1, which contained 109k detected nuclei, 41M pixels (x,y), and 313 genes, training was completed after approximately 10 minutes for 4000 steps. Inference time was about 50 minutes for the complete image. Morphological processing required approximately 30 min to generate the final segmentation. A comparison of the runtimes between different methods is included in Supplementary Fig. 26.

Ablation study

We performed an ablation study to determine the contributions from each loss function and effects of different hyperparameter values (Supplementary Figs. 4, 5). We used Xenium-BreastCancer1 for these experiments. We evaluated BIDCell without each of the different loss functions by individually setting their corresponding weights λ to zero. Furthermore, we evaluated different parameterisations of the cell-calling loss. We experimented with different diameters for the dilation kernel for non-elongated cells, including 10, 20, and 30 pixels, and different total lengths of the minor and major axes lt of the dilation kernel for elongated cells, including 50, 60, and 70 pixels. We also ran BIDCell without shape-specific expansions, thereby assuming a non-elongated shape for all cells.

Performance evaluation

We compared our BIDCell framework to vendor-provided cell segmentations, and methods designed to identify cell bodies via cell segmentation. Table 2 provides a summary of all methods compared from adapting classical approaches including Voronoi expansion, nuclei dilation, and the watershed algorithm, to recently proposed approaches for SST images including Baysor, JSTA, and Cellpose. Methods that were excluded from the evaluations include those that focus on the assignment of transcripts to cells and do not consider the cell boundaries, underperformance on the public datasets, lack of code and instructions to prepare data into the required formats, and failure of the method to detect any cells (Supplementary Table 1).

Settings used for other methods

We used publicly available code for Baysor, JSTA, and Cellpose with default parameters unless stated otherwise. All comparison methods that required nuclei information used identical nuclei as BIDCell, which were detected using Cellpose (v2.1.1) (see Nuclei segmentation).

  • Baysor – Version 0.5.2 was applied either without a prior, or with a prior nuclei segmentation with default prior segmentation confidence of 0.2. For both instances, we followed recommended settings35, including 15 for the minimum number of transcripts expected per cell, and not setting a scale value, since the sample contained cells of varying sizes. We found the scale parameter to have a considerable effect on segmentation predictions, and often resulted in cells with unrealistically uniform appearances if explicitly set.

  • JSTA – default parameters were used. We encountered high CPU loading and issues with two regions of Xenium-BreastCancer1, which yielded empty predictions for those regions despite multiple attempts and efforts to reduce input size.

  • Cellpose – Version 2.1.1 was applied to the channel-wise concatenated image comprising DAPI as the “nuclei” channel, and sum of spatial transcriptomic maps across all genes as the “cells” channel, using the pre-trained “cyto” model with automatic estimation of cell diameter.

  • Voronoi – Classical Voronoi expansion was seeded on nuclei centroids and applied using the SciPy library (v1.9.3).

  • Watershed – The watershed algorithm was performed on the sum of transcriptomic maps across all genes. Seeded watershed used nuclei centroids and was applied using OpenCV (v4.6.0).

  • Cellpose nuclei dilation – we applied dilation to nuclei masks as a comparison segmentation method. Each nucleus was enlarged by about 1 micron in radius by applying morphological dilation using a 3 × 3 circular kernel for one iteration. Overlaps between adjacent cell expansions were permitted.

Evaluation metrics and settings

We introduce the CellSPA framework, that captures evaluation metrics across five complementary categories. A summary of this information is provided in Supplementary Table 2.

[A] Baseline metrics

Overall characteristics

Cell-level QC metrics

  • Proportion of cells expressing each gene

  • Number of transcripts per cell

  • Number of genes expressed per cell

  • Cell area

$${{{{{{{rm{Density}}}}}}}}=frac{{sum }_{iin I}{n}_{i}}{A},$$

(14)

where ∑iIni represents the sum of all total transcripts over a set I, and A represents the cell area.

Cell morphology metrics

We evaluated multiple morphology-based metrics and provide diagrammatic illustrations in Supplementary Fig. 27.

• Elongation =

$$frac{{W}_{{{{{{{{rm{bb}}}}}}}}}}{{H}_{{{{{{{{rm{bb}}}}}}}}}},$$

(15)

where Wbb represents the width of the bounding box, and Hbb represents the height of the bounding box.

Elongation measures the ratio of height versus the width of the bounding box (Supplementary Fig. 27f). Elongation is insensitive to concave irregularities and holes present in the shape of the cell. The value of this metric will be 1 for a perfect square bounding box. As the cell becomes more elongated the value will either increase far above 1 or decrease far below 1, depending on whether the elongation occurs along the height or width of the bounding box.

• Circularity =

$$frac{4pi times A}{{P}_{{{{{{{{rm{convex}}}}}}}}}^{2}},$$

(16)

where A represents the area, and Pconvex represents the convex perimeter.

Circularity measures the area to perimeter ratio while excluding local irregularities of the cell. We used the convex perimeter of the object as opposed to its true perimeter to avoid concave irregularities. The value will be 1 for a circle and decreases as a cell becomes less circular.

• Sphericity =

$$frac{{R}_{{{{{{{{rm{I}}}}}}}}}}{{R}_{{{{{{{{rm{C}}}}}}}}}},$$

(17)

where RI represents the radius of the inscribing circle, and RC represents the radius of the circumscribing circle.

Sphericity measures the rate at which an object approaches the shape of a sphere while accounting for the largest local irregularity of the cell by comparing the ratio of the radius largest circle that fits inside the cell (inscribing circle) to the radius of the smallest circle that contains the whole cell (circumscribing circle). The value is 1 for a sphere and decreases as the cell becomes less spherical.

• Compactness =

$$frac{4pi times A}{{P}_{{{{{{{{rm{cell}}}}}}}}}^{2}},$$

(18)

where A represents the area, and Pcell represents the cell perimeter.

Compactness measures the ratio of the area of an object to the area of a circle with the same perimeter. Compactness uses the perimeter of the cell thus it considers local irregularities in the cell perimeter. A circle will have a value of 1, and the less smooth or more irregular the perimeter of a cell, the smaller the value will be. For most cells the numerical values for compactness and circularity are expected to be similar. Identifying which cells have large differences between these metrics can identify cells with highly irregular perimeters which may be of interest for downstream analysis and quality control for segmentation.

• Convexity =

$$frac{{P}_{{{{{{{{rm{convex}}}}}}}}}}{{P}_{{{{{{{{rm{cell}}}}}}}}}},$$

(19)

where Pconvex represents the convex perimeter and Pcell represents the cell perimeter.

Convexity measures the ratio of the convex perimeter of a cell to its perimeter. The value will be 1 for a circle and decrease the more irregular the perimeter of a cell becomes, similar to compactness.

• Eccentricity =

$$frac{{L}_{{{{{{{{rm{minor}}}}}}}}}}{{L}_{{{{{{{{rm{major}}}}}}}}}},$$

(20)

where Lminor represents the length of the minor axis and Lmajor represents the length of the major axis.

Eccentricity (or ellipticity) measures the ratio of the major axis to the minor axis of a cell. The major axis is the longest possible line that can be drawn between the inner boundary of a cell without intersecting its boundary. The minor axis is the longest possible line can be drawn within the inner boundary of a cell while while also being perpendicular to the major axis. This gives a value of 1 for a circle and decreases the more flat the cell becomes.

• Solidity =

$$frac{A}{{A}_{{{{{{{{rm{convex}}}}}}}}}},$$

(21)

where A represents the area, and Aconvex represents the convex area.

Solidity measures the ratio of the area of a cell to the convex area of a cell. This measures the density of a cell by detecting holes and irregular boundaries in the cell shape. The maximum value will be 1 for a cell with a perfectly convex and smooth boundary and will decrease as the cell shape becomes more concave and/or irregular.

Gene-level QC characteristics

• Proportion of cells expressing each gene

[B] Segmented cell expression purity. We implemented two broad classes of statistics to capture (i) the concordance of expression profile with scRNA-seq data and (ii) the expression purity or homogeneity of cell type markers. The scRNA-seq data used are described in Section Datasets and preprocessing and listed in Table 1.

• Concordance with scRNA-seq data – We calculated the similarity of the expression pattern between the segmented cells and publicly available single-cell datasets. Here the similarity was measured by Pearson correlation of the average log-normalised gene expression for each cell type. We also calculated the concordance of the proportion of non-zero expression for each cell type between the segmented cells and scRNA-seq data. For data with paired Chromium data from the same experiment, i.e., Xenium-Brain, we also compared the cell type proportion and quantify the concordance using the Pearson correlation. We annotated the cell type annotation for segmented cells using scClassify36 with scRNA-seq data as reference.

• Purity of expression – We first curated a list of positive markers and negative markers from the scRNA-seq reference data. For each cell type, we selected the highest and lowest 10 percentile of the genes with difference of expression compared to other cell types. We also removed the positive markers that were common to more than 25% of cell types for a more pure positive marker list. For each segmented cell, we then consider the genes with the highest 10 percentile of expression as positive genes and lowest 10 percentile as negative markers. We then calculated the Precision, Recall and F1 score for both positive and negative markers. We further summarised the average positive marker F1 scores and negative marker F1 scores into one Purity F1 score for each method, where we first scaled the average positive and negative marker F1 scores into the range of [0, 1] and then calculated the F1 score of transformed metrics as the following:

$$F{1}_{{{{{{{{rm{purity}}}}}}}}}=2cdot frac{(1-F{1}_{{{{{{{{rm{negative}}}}}}}}})cdot F{1}_{{{{{{{{rm{positive}}}}}}}}}}{1-F{1}_{{{{{{{{rm{negative}}}}}}}}}+F{1}_{{{{{{{{rm{positive}}}}}}}}}}.$$

(22)

[C] Spatial characteristics

In this category, we measured the association between cell type diversity in local spatial regions and all the cell-level baseline characteristics provided in [A]. We first divided each image into multiple small regions. Then, for each local spatial region, we calculated the cell type diversity using Shannon entropy with the R package ’entropy’, where a higher entropy indicates a more diverse cell type composition. Next, we assessed the variability of cell-level baseline characteristics within each local region using the coefficient of variation. Subsequently, for each of the cell-level baseline characteristics mentioned in [A], we calculated the Pearson correlation between the cell type diversity (measured using Shannon entropy) and the coefficient of variation of these characteristics across all local regions. Here, we anticipate that regions with more diverse cell type compositions will exhibit higher variability in cell-level characteristics, leading to a stronger correlation between these two metrics.

[D] Neighbouring contamination

This metric is designed for cell segmentation to ensure that the expression signals between neighboring cells are not contaminated. For a pair of cell types (e.g., cell type A and B), we computed the Euclidean distance from each cell in cell type A to its nearest neighbor belonging to cell type B. We then grouped the cells of cell type A based on a range of distances. Within each group, we calculated the proportion of cells expressing a selected negative marker, which is a cell type marker for cell type B. We anticipate that the method with less contamination will result in segmented cells expressing lower levels of the negative marker, even when the distance to a different cell type is minimal.

[E] Replicability

Our analysis involved assessing the agreement between the Xenium-BreastCancer1 and Xenium-BreastCancer2 datasets, which are closely related in terms of all the cell-level baseline characteristics provided in [A]. As these datasets are considered to be sister regions, we anticipated that the distribution of all the baseline characteristics, as well as the cell type composition, would be similar. We use Pearson correlation to quantify the degree of concordance.

Statistics and reproducibility

All analysis was done in R version version (4.3.0). No statistical method was used to predetermine sample size. No data were excluded from the analyses. All cells that passed quality control were included in the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.