
Culture of pluripotent stem cells
PSC lines were cultured under feeder-free conditions on Matrigel-coated plates at 37 °C with 5% CO2 and 3% O2. To coat culture dishes, Matrigel solution was prepared by diluting Matrigel (Corning, 354277) 1:40 in ice-cold DMEM/F12 medium (Gibco, 11330057) and stored at 4 °C until use. Before plating cells, 6-cm dishes were coated with 1 ml Matrigel solution and incubated for 30 min at 37 °C. PSCs were maintained in TeSR/E8 medium (Stemcell Technologies, 05990) supplemented with 100 U ml−1 penicillin and 100 µg ml−1 streptomycin (Thermo Fisher, 15140122) with daily medium changes and passaged 1:8 to 1:10 when confluency reached around 70%. To detach cells, plates were rinsed with 2–3 ml PBS (Gibco, 10010023) and treated with 0.5 ml ReLeSR reagent (Stemcell Technologies, 05872) for 5 min at 37 °C. When single-cell dissociation was needed, Accutase (Sigma-Aldrich, A6964) was used instead of ReLeSR, and 5 µM ROCK inhibitor Y-27632 (Tocris, 1254) was added to the medium to enhance cell survival in the first 24 h. All participants signed an informed consent form, and the use of PSCs was approved by the ethical committee of the University of Milan. All iPSC lines were reprogrammed by at least 15 passages. All PSCs have been routinely verified to be Mycoplasma free by routine PCR testing, and their identity was confirmed by short tandem repeat profiling. Details about the PSC lines can be found in Supplementary Table 1.
Multiplexing strategies
In the experimental design, we adopted two distinct multiplexing strategies, namely, mosaic and downstream, that differed in the moment at which the different cell lines were mixed.
In the former multiplexing approach, CBOs were generated by mixing equal amounts of PSCs derived from each cell line to obtain mosaic brain organoids. Briefly, PSC lines were dissociated in parallel at the single-cell level, and cells were counted separately and then mixed in equal proportions to obtain a mosaic cellular suspension. After diluting the cell suspension to the desired concentration of 2 × 105 cells per ml, organoids were generated as explained in the following chapter entitled ‘Cortical brain organoids’.
In the downstream approach, CBOs were independently generated from each individual PSC line and grown separately. When reaching the desired analytical time point, organoids from each cell line were dissociated in parallel, and cells were counted and mixed in equal proportions to obtain a pooled cell suspension.
For the comparison of neurodevelopmental cell types and trajectories between mosaic and downstream multiplexed CBOs, the design included four iPSC lines (CTL08A, CTL01, CTL02A, CTL04E) across replicates in both multiplexing modalities. For Census-seq-based assessment of mosaic organoid scalability, mCBOs were generated with different combinations of PSC lines as shown in Fig. 6a.
Cortical brain organoids
Pure line-derived and mosaic brain organoids were generated using an adaptation of the previously described protocol78, which allows one to obtain dorsal telencephalon cortical organoids, introducing orbital shaking on day 12 of differentiation as previously published by us in refs. 5,11,15. PSCs were grown on Matrigel-coated plates to a confluency of approximately 60–70%, dissociated with Accutase (Sigma, A6964) and resuspended in TeSR/E8 medium supplemented with 5 µM ROCK inhibitor Y-27632 (Tocris, 1254) to reach a final concentration of 2 × 105 cells per ml. The cell suspension (100 µl per well) was seeded in ultra-low-attachment, U-bottom 96-well plates (System Biosciences, MS9096UZ) and then centrifuged for 3 min at 150 rcf to promote formation of embryoid bodies. Plates were incubated at 37 °C with 5% CO2 and 3% O2 for 2 d, and then the first medium change was performed, substituting TeSR/E8 with neural induction medium containing 80% DMEM/F12 medium (Gibco, 11330057), 20% knockout serum (Gibco, 10828028), non-essential amino acids (1:100, Sigma, M7145), 0.1 mM cell culture-grade 2-mercaptoethanol solution (Gibco, 31350010), GlutaMAX (1:100, Gibco, 35050061), penicillin at 100 U ml−1 and streptomycin at 100 μg ml−1 (Thermo Fisher, 15140122), 7 µM dorsomorphin (Sigma, P5499) and 10 µM TGF-β inhibitor SB431542 (MedChemExpress, HY-10431). Since that moment, defined as day 1, cultures were grown in normal oxygen conditions (21% O2). Medium changes were performed daily for the subsequent 4 d, and, on the fifth day, neural induction medium was substituted with complete neurobasal medium, composed of neurobasal medium (Gibco, 12348017), B-27 supplement without vitamin A (1:50, Gibco, 12587001), GlutaMAX (1:100, Gibco, 35050061), penicillin at 100 U ml−1 and streptomycin at 100 μg ml−1 (Thermo Fisher, 15140122) and 0.1 mM cell culture-grade 2-mercaptoethanol solution (Gibco, 31350010) supplemented with 20 ng ml−1 FGF2 (PeproTech, 100-18B) and 20 ng ml−1 EGF (PeproTech, AF-100-15). On day 12, organoids were transferred by pipetting with cut-end pipette tips from 96-well to 9-cm ultra-low-attachment dishes (System Biosciences, MS-90900Z) and placed on a standard orbital shaker (VWR Standard Orbital Shaker, Model 1000). From day 12 onward, medium changes were performed every other day. On day 23, FGF and EGF were replaced with 20 ng ml−1 BDNF (PeproTech, 450-02) and 20 ng ml−1 neurotrophin 3 (PeproTech, 450-03) to promote differentiation of neural progenitors. From day 42 onward, complete neurobasal medium without BDNF and NT3 was used, performing medium changes every other day.
Cell cycle analysis
Pure line-derived organoids were subjected to cell cycle analysis at multiple time points after their generation along with cell suspensions employed in their generation. Organoids were collected at differentiation days 0, 5, 12, 25, 50 and 75, dissociated at 37 °C for 5 min with trypsin-EDTA (Euroclone, ECB3042) (differentiation days 0, 5 and 12) or for 30 min with papain (Stemcell Technologies, 07466) (differentiation days 25, 50 and 75); papain was more efficient for the dissociation of mature organoids.
As for the first analytical replicate (Fig. 2c), cells were fixed with cold ethanol and stored at +4 °C. Upon completing the longitudinal cohort, cell suspensions were rinsed with PBS, stained overnight with 3 µM propidium iodide solution (Thermo Fisher, P1304MP) in the presence of 25 µg ml−1 RNase I (Thermo Fisher, EN0601) and analyzed with a FACSCelesta instrument (BD Biosciences) to measure DNA content. Analyses were performed with FlowJo version 10 software (BD Biosciences).
In the second analytical replicate (Fig. 2d), to avoid the deterioration of early time point samples, 1 million cells per sample were resuspended in PBS supplemented with 0.1% BSA (Sigma, A9418), stained for 20 min at 37 °C with 2 µg ml−1 Hoecst 33342 (Sigma-Aldrich, B2261) in the presence of 5 µM verapamil (Sigma, V4629) and analyzed with a CytoFLEX instrument (Beckman Coulter Life Sciences) to measure DNA content.
Analyses were performed with FlowJo (replicate 1, BD Biosciences) or FCS Express 7 software (replicate 2, De Novo Software). Raw counts are shown in Supplementary Data 1, and the gating strategy is shown in Supplementary Fig. 2.
Histological analysis
Organoids were collected on differentiation days 50 and 100, washed with PBS and fixed overnight at 4 °C in 4% paraformaldehyde–PBS solution (Santa Cruz, sc-281692). After rinsing with PBS twice, samples were embedded in 2% low melting agarose, placed in 70% (vol/vol) ethanol and immediately given to the Tissue Processing Unit for paraffin embedding, sectioning and routine hematoxylin–eosin staining.
Deparaffinization and rehydration were achieved by consecutive passages of 5 min each in the following solutions: 2× histolemon (Carlo Erba, 454912), 100% ethanol, 95% ethanol, 80% ethanol and 2× ddH2O. Sections were then incubated for 45 min at 95 °C with 10 mM sodium citrate buffer (VWR Chemicals, 27833) with 0.05% Tween-20 (Sigma, P1379) for simultaneous antigen retrieval and permeabilization and then equilibrated at room temperature for at least 2 h. After 30 min of blocking with 5% normal donkey serum (Jackson ImmunoResearch, 017-000-121) in PBS, incubation with primary antibodies in PBS with 5% normal donkey serum was performed. The following primary antibodies were used: anti-PAX6 (rabbit, 1:200, BioLegend), anti-TUJ1 (mouse, 1:1,000, BioLegend), anti-nestin (mouse, 1:500, Millipore), anti-reelin (mouse, 1:400, Millipore), anti-CTIP2 (rat, 1:400, Abcam), anti-SATB2 (mouse, 1:400, Abcam), anti-HOPX (rabbit, 1:500, Sigma), anti-SOX2 (goat, 1:1,000, R&D Systems), anti-MAP2 (guinea pig, 1:200, Synaptic Systems), anti-NeuN (mouse, 1:200, Abcam), anti-Ki-67 (rabbit, 1:250, Abcam). The day after, secondary antibodies conjugated with Alexa Fluor 488, 594 or 647 (donkey, 1:400, Thermo Fisher) were diluted in PBS and applied to the sections for 1 h, followed by a 5-min incubation with 1 µg ml−1 DAPI solution. After each incubation, three 5-min washing steps with TBS buffer were performed. After a final rinse with deionized water, slides were dried and mounted using Mowiol mounting solution. Images at ×20, ×40 and ×63 magnification were acquired using a DM6 B MultiFluo microscope (Leica) equipped with an Andor Zyla VSC-04470 sCMOS camera (Fig. 2) or with an ECLIPSE Ti2 Crest microscope (Nikon) coupled with a Photometrics Prime 95B camera at ×20 magnification (Extended Data Fig. 1) and then processed with Fiji software. For Extended Data Fig. 1, maximum intensity projection was performed, choosing the ‘maximum intensity’ option, and a gamma correction of 0.7 was applied to the AF488 channel.
Cortical brain organoid processing for single-cell transcriptomic analysis
Organoids were collected on differentiation days 50, 100, 250 and 300 (±3 d). Three to five organoids per condition were dissociated by incubation with a solution of 0.5 mg ml−1 Collagenase/Dispase (Sigma) with 0.22 mg ml−1 EDTA (Euroclone) and 10 µl DNase I at 1,000 U ml−1 (Zymo Research) for 30–45 min according to organoid size. Digested suspensions were filtered through 70-µm-pore Flowmi Cell Strainers (Sigma, BAH136800040), resuspended in PBS and counted using a TC20 Automated Cell Counter (Bio-Rad). For the day 100 downstream sample, the single-cell suspension was further centrifuged at 500 rcf for 3 min and resuspended in 200 µl cold PBS–0.04% BSA. Pre-chilled (800 µl) 100% methanol was added dropwise for a final concentration of 80%. Cells were fixed for 30 min and stored at −80 °C for 6 months. For recovery of a single-cell suspension, fixed cells were thawed at 4 °C (all steps at 4 °C) and centrifuged at 1,000 rcf for 5 min, the supernatant was completely removed, and pre-chilled SSC cocktail (3× SSC, 0.04% BSA, 1% SUPERase-In, 40 mM DTT) was added. Cells were counted and resuspended at a concentration 1,000 cells per µl. Droplet-based single-cell partitioning and scRNA-seq libraries were generated using the Chromium Single Cell 3′ Reagent v2 Kit (10x Genomics) following the manufacturer’s instructions79. Briefly, a small volume (6–8 μl) of single-cell suspension at a density of 1,000 cells per μl was mixed with RT–PCR master mix and immediately loaded together with Single Cell 3′ gel beads and partitioning oil into a Single Cell 3′ Chip. The gel beads were coated with unique primers bearing 10x cell barcodes, unique molecular identifiers and poly(dT) sequences. The chip was then loaded onto a Chromium instrument (10x Genomics) for single-cell GEM generation and barcoding. RNA transcripts from single cells were reverse transcribed within droplets to generate barcoded full-length cDNA. After emulsion disruption, cDNA molecules from each sample were pooled and pre-amplified. Finally, amplified cDNA was fragmented, and adaptor and sample indices were incorporated into finished libraries that were compatible with Illumina sequencing. The final libraries were quantified by real-time quantitative PCR and calibrated with an in-house control sequencing library. The size profiles of the pre-amplified cDNA and sequencing libraries were examined on the Agilent Bioanalyzer 2100 using a High Sensitivity DNA chip (Agilent). Two indexed libraries were pooled equimolarly and sequenced on the Illumina NovaSeq 6000 platform using the v2 Kit (Illumina) with a customized paired-end, dual-indexing (26/8/0/98-bp) format according to the recommendation of 10x Genomics. Using proper cluster density, a coverage of around 250 million reads per sample (2,000–5,000 cells) was obtained, corresponding to at least 50,000 reads per cell.
Generation of CellPlex libraries
Two independent downstream multiplexed datasets were generated for external validation of the SCanSNP pipeline, labeling each genotype with a specific barcode from the 3′ CellPlex Kit (10x Genomics, PN-1000261). Briefly, organoids were collected on day 50 and dissociated with an HBSS-based solution containing 30 U ml−1 papain (Stemcell, 07466) and 125 U ml−1 DNase I (Zymo Research) for 30–45 min according to organoid size. After filtering through 70-µm-pore Flowmi Cell Strainers, cells were counted, and 1 million live cells per sample were taken and labeled with a unique barcode from the 3′ CellPlex Kit Set A according to the manufacturer’s instructions. Once the washing steps were completed, 1 × 105 live cells per sample were pooled together, and the concentration was adjusted to be within 1.2–1.6 × 106 cells per ml for droplet-based single-cell partitioning. Libraries were generated using the Chromium Next GEM Single Cell 3′ v3.1 Kit (10x Genomics) similarly as described previously. Target coverage for gene expression was set at 50,000 reads per cell, whereas, for CellPlex multiplexing oligonucleotides, a sequencing depth of 5,000 reads per cell was chosen as recommended.
Generation of bulk RNA-seq-based variant-calling file (VCF)
Reference genotypes for the deconvolution were obtained from bulk RNA-seq data of pure line-derived organoids.
Total RNA was isolated from pure line organoids with the RNeasy Micro Kit (Qiagen) according to the manufacturer’s instructions. RNA quantification and integrity was assessed by electrophoretic analysis with the Agilent 2100 Bioanalyzer. The TruSeq Stranded Total RNA LT Sample Prep Kit (Illumina) was used to run the library for each sample using 500 ng total RNA as the starting material. Sequencing was performed with the Illumina NovaSeq 6000 platform, sequencing on average 35 million 50-bp paired-end reads per sample.
First, raw reads were aligned to the GRCh38 version 93 Ensembl reference genome using STAR80 in two-pass mode to learn splicing junctions from data, subsequently, read group tags were made uniform, and optical and PCR-duplicated reads were marked using GATK MarkDuplicates. The resulting BAM file (in silico multiplexed BAM creation is described in the Supplementary Methods) was recalibrated using BaseRecalibrator and ApplyBQSR with the sorted 00-All.vcf.gz (b151_GRCh38p7) file from dbSNP. Next, we performed variant calling using GATK HaplotypeCaller81 with the option ‘dont-use-soft-clipped-bases’–standard-min-confidence-threshold-for-calling set to 30 and –min-base-quality-score equal to 20. We applied thresholds for MIN_DP, DP and quality (GQ) of 10, 10 and 20, respectively. Finally, we used CombineGVCFs and GenotypeGVCFs exclusively to merge together GVCFs of genotypes mixed in the corresponding single-cell experiment, ignoring the aggregated INFO field.
Alignment and genotype demultiplexing
scRNA-seq data were aligned using Cell Ranger 3.0.0 count and the matched reference provided from 10x. For subsequent demultiplexing and downstream analyses, only droplets passing the Cell Ranger filter were considered. For demultiplexing, we applied demuxlet, souporcell, Vireo and SCanSNP. The final identities used are the result of the consensus call (consensus call setup is described in the Supplementary Methods). With the exception of souporcell, all the tools were provided with bulk RNA-seq-derived VCFs and were embedded in a collective singularity image. To optimize the deconvolution process for our samples, we used demuxlet with default parameters and the setting ‘–doublet-prior’ according to the number of retained droplets after Cell Ranger filtering and V3 kit specs doublet expectation. CellSNP, the Vireo companion tool for single-cell variant calling, was launched ‘with -p 10–minMAF 0.1–minCOUNT 20’, and subsequently Vireo was launched on the combined VCF file cleaned from loci containing missing calls. Souporcell was launched using the available singularity image specifying only the number of genotypes in the mixture (k). hiPSC lines included in this analysis are listed in Supplementary Table 1.
Single-cell data preparation
All downstream analyses were performed within the SCANPY single-cell analysis framework82.
Basic filtering was done right after importing count matrices from Cell Ranger. We inspected the number of genes, mitochondrial gene counts and ribosomal gene count distributions and adopted dataset-specific thresholds to remove droplets with likely technical issues. Next, we removed droplets called for low quality or doublets according to the consensus call. After merging the seven datasets, cell counts were normalized and log transformed using the ‘sc.pp.normalize_total’ and ‘sc.pp.log1p’ SCANPY functions. Finally, we regressed out the effect of total counts and the percentage of mitochondrial transcripts with the ‘sc.pp.regress_out’ and ‘sc.pp.scale’ functions. All functions were run with default parameters.
Cell filtering and annotation
We relied on multiple tiers of annotation, filtering and dimensionality reduction. First, we removed proteoglycan (WLS, ANXA2, TPBG, RSPO3, DCN, BGN, MYH3)-expressing cells, considered non-relevant for our downstream analyses. We next iteratively partitioned cells and assessed the top marker via Leiden and rank_gene_groups, respectively, to remove cells highly expressing adhesion (WLS, TPBG) and stress (BNIP3, PGK1, MT-CO2) markers while manually annotating cell types using literature markers. Finally, we used SCANPY’s score_genes function providing ER stress and hypoxia signatures to remove clusters of cells scoring >0 and >0.3, respectively. Finally, we repartitioned the remaining cells with a Leiden resolution of 0.6 and manually transferred the annotation to new clusters, merging, if needed, different partitions into coherent cell types when key markers were overlapping.
Highly variable gene selection
For HGV detection, we took advantage of having at least two datasets per time point and detected HVGs by time point with the ‘sc.pp.highly_variable_genes’ SCANPY function, providing each dataset as a separate batch and min_mean = 0.0125, max_mean = 5 and min_disp= 0.5 as parameters. For each time point, we kept only genes found as HVGs in at least two datasets. After major filtering and single-trajectory isolation, HVGs were recomputed on the cell subset in the same fashion. Moreover, we included some relevant neurodevelopmental genes from the literature regardless of whether they were detected as HVGs (Supplementary Data 2).
Dimensionality reduction and dataset integration
After filtering, PCA was computed on defined HVGs, and the seven datasets were integrated via SCANPY’s Harmony implementation83 with a maximum of 20 iterations; the cells’ neighborhood graph was then computed on top 15 PCs, specifying 100 as the neighborhood size (n). Upon single-trajectory isolation, PCA was recomputed on each cell subset and branch-specific HVGs, Harmony was run with a maximum of 20 iterations, lambda = 2 and theta = 1, and cell neighborhood graphs were computed with ten PCs and 50 n; in the case of the excitatory neuron lineage, we used nine PCs and 60 n, given the greater differences between early and late branches than those within lineage-continuous cell states of other cases.
Trajectory isolation
To analyze the cell state transitions with trajectory-wise magnification, we isolated the most relevant neurodevelopmental trajectories. For this purpose, we first partitioned cells with the Leiden algorithm with double resolution (1.2) with respect to the previous partitioning and used it as the basis for PAGA84, obtaining a PAGA graph. The PAGA graph was refined by removing edges with weight <0.05. After complementing the edges to 1 to obtain the equivalent distance graph, we computed the shortest path from root r to each endpoint e using NetworkX85 with the Bellman–Ford method, where r is the partition of cells with highest counts for the ‘TOP2A’ gene and e are partitions with the highest rate of the cell types considered one of the endpoints (astrocytes, Cajal–Retzius-like neurons, early excitatory neurons, late excitatory neurons, interneurons and migrating neurons).
Differential abundance analysis
We adopted Milo45 for differential abundance analysis. We tested for differential abundance between the mosaic organoid dataset and non-mosaic datasets via direct comparison following the Milo standard workflow. The differential abundance test was between each time point and the other two simultaneously; therefore, we provided as model contrast CTn = Tn − (Tx + Ty)/2 for each of the n assessed time points, where x and y are the other two time points, T is the timepoint and CTn is the model contrast for timepoint n. Plots were generated displaying enriched/depleted cells’ neighbors for each time point with spatial FDR < 0.1.
Differential expression analysis
To compare the molecular impact of mosaic co-culture, we performed direct differential expression analysis between the same genotypes either grown individually (downstream multiplexing) or in mosaics. Given the impact of methanol fixation on the day 100 downstream dataset, (Supplementary Fig. 1b), we relied on day 50 datasets (Supplementary Fig. 1c). To provide a reference of the expected batch-to-batch variability, we also compared the same genotypes when grown in two different mosaic batches (Supplementary Fig. 1d). For both comparisons, we kept the two most abundant genotypes (Supplementary Fig. 1c,d, left) and the three most abundant cell types (that is, proliferating progenitors, radial glial progenitors and neurons; Supplementary Fig. 1c,d, right) present in both experiments. For each genotype and cell type combination, single-cell counts were aggregated to obtain two pseudoreplicates. Gene filtering and normalization were carried out within edgeR86 using the functions filterByExpr() and calcNormFactors(), respectively; the latter was performed to account for different numbers of cells aggregated into the various pseudoreplicate combinations. Finally, count fitting (glmQLFit edgeR function) and the DE test (glmQLFTest edgeR function) were repeated individually for each cell type comparing the condition (multiplexing paradigm or replicate) while providing the genotype as the blocking variable.
Developmental trajectory analysis
We aimed to assess the distribution of cells along pseudotime by different covariates for migrating neuron, astrocyte, Cajal–Retzius-like neuron and interneuron trajectories after their isolation. We wished to assess whether (1) time point differences mirror the asynchronous development of specific cell types in our in vitro system, (2) the multiplexing paradigm impact on the developmental timing of such populations and (3) whether we had the resolution to capture developmental differences among control genotypes. For each trajectory individually, we computed diffusion map87 and dpt88. Next, for the different time points and the multiplexing paradigm, we computed the kernel density of each dataset (sklearn.neighbors.KernelDensity, kernel = ‘gaussian’, bandwidth corresponding to 5% of the whole pseudotime window) and plotted mean and ±1 s.d. among dataset densities. For genotype comparison, we kept the most relevant time points for each trajectory (early and mid for migrating neurons and Cajal–Retzius-like neurons; early, mid and late for the other trajectories) and genotypes for which at least 50 cells were retrieved at each retained time point. Finally, the genotypes were balanced via random sampling to have the same amount of cells across time points. Mean and standard deviations were computed on kernel densities of 50 sampling iterations to assess the stability.
For the isolated migrating neuron trajectory, we confirmed the dpt-based genotype grouping (Fig. 5c) by assessing their behavior along PC1. We started by confirming that PC1 was mainly driven by differentiation, and thus could be used as an alternative of pseudotime measure. To do so, we quantified the PC1 variance (adjusted R2) explained by the annotated Leiden covariate using the ordinary least-square implementation of the statsmodels Python library (Extended Data Fig. 7a, left). Similarly, we then assessed the PC1 variance explained by the genotype (Extended Data Fig. 7a, right) and genotype distributions along PC1 (Extended Data Fig. 7b).
Subsequently, we used tradeSeq89 (fitGAM function was run with nknots = 8 after trimming cells in the first and 99th dpt percentiles to increase stability at lowly sampled extremes) and detected pseudotime-driven transcriptional difference within each lineage for all but excitatory neuron trajectories, whereas, for excitatory neurons, we used tradeSeq to find key transcriptional differences between early–mid and late neuronal trajectories.
Within lineage differences, to extract key driver genes along pseudotime, we isolated HVGs between lineage extremes (tradeSeq startVsEndTest(), pVal ≤ 0.001 and log (FC) ≥ 2).
For early-versus-late excitatory neuron differences, to detect key differences between the two lineages, we considered both the greatest divergently expressed genes at the terminal states (tradeseq diffEndTest(), pVal ≤ 0.001 and log (FC) ≥ 2) and the transiently varying genes, defined as simultaneously low ranked from the tradeSeq:diffEndTest function and high ranked from the tradeSeq:patternTest(pVal ≤ 0.001) function.
Allele-specific expression analysis
We leveraged our multigenotype design to carry out ASE analysis for each cell type annotated. We first produced the read pileup at genomic (biallelic-only) loci displaying variability within our cohort using SCanSNP pileup mode, which provided an anndata82 with nCells × nLoci dimensionality and two layers: ‘Refreads’ and ‘Altreads’, with the count of reads presenting or not presenting the variant, respectively. To perform the analysis, we summed the reads mapping to variant sites (at least one heterozygous genotype) of the same cell type. If multiple genotypes were heterozygous at the given location, reads were included in the sum regardless of the original genotype. As it is not granted that, among different individuals, if present, the dominant allele is the same, we first checked that, for loci with detected ASE, the dominant allele was coherent before merging coverages from different individuals. We observed minimal discrepancy, that is, at a given locus, the dominant allele was vastly the same (always Ref or Alt) across genotypes (information available in the Github repository of the paper, Notebook 04_ASE/13.1_SanityCheck). For each cluster, we kept only loci with at least 20 reads, computed the β value (Alt reads/(Alt reads + Ref reads)) and performed binomial test and FDR correction (q < 0.05) provided by the SciPy90 and statsmodels implementations, respectively. To calculate the correlation among cell types, we used β values of loci with detected ASE in at least one cell type and covered with at least 20 reads in all cell types. Correlation was computed in pandas with ‘spearman’ metrics.
SCanSNP
In our benchmark, the presence of low-quality droplets and doublets was observed to be an open challenge also for well-established methods when assigning genotypes (IDs) to droplets in the genetic demultiplexing. With those challenges in mind, we developed SCanSNP (available at https://github.com/GiuseppeTestaLab/SCanSNP) by dividing the demultiplexing and filtering in 3(+1) steps:
-
1.
Best ID detection per droplet: here, as for other approaches, we leveraged the accessibility of bulk RNA-seq data to generate a function that maximizes the score difference of each ID to the sequenced droplets:
$${S}_{g}=mathop{sum }limits_{i=1}^{n}left(left(frac{{A}_{i}times {a}_{ig}}{{t}_{i}}+left(frac{{R}_{i}times {r}_{ig}}{{T}_{i}}right)right)right),$$
where Sg is the score for ID g in each droplet, i are the loci for which allelic information is accessible from bulk RNA-seq, A and R are, respectively, the number of reads supporting alternative and reference alleles, a and r are the number of alternative and reference alleles in g and t and T are total alternative and reference alleles in the cohort at locus i.
-
2.
Second-best ID determination: given an m-by-g contribution matrix, where m are the droplets and g are the multiplexed IDs containing the number of reads supporting genotype-specific alleles. We used this matrix to iteratively train a multinomial logistic regression model to predict which is the most likely ID after the first one, assuming ambient contamination consistent across droplets. We split the contribution matrix into groups of droplets sharing the best ID according to step 1; for each group, we trained the model on counts and labels from other groups to predict the second-best ID of barcodes in the current group.
-
3.
Doublet detection: to allow doublet detection to be specific and flexible while accommodating genetic contributions ranging from balanced doublets to the presence of a cell and debris in the same drop, we implemented a method similar to the one adopted in ref. 22. Starting from the previous m-by-g contribution matrix, for every genotype g, we define as negative droplets the ones that do not contain that genotype as the best ID according to the first step and fit a negative binomial distribution via the fitdistrplus91 R function on counts supporting private g alleles. We therefore used the 99% quantile of the fitted distribution as the positivity threshold. Droplets positive for more than one ID are considered multiplets.
-
4.
We finally took advantage of the mixed-genotype design to structure an added layer of a low-quality droplet detection to be used during consensus call aggregation.
We applied a Gaussian mixture model expectation-maximization algorithm (implemented through the R mixtools92 package) to separate droplets with ‘low’ and ‘high’ signal-to-noise ratios by computing log (FC) between the first- and second-best predicted IDs. We started by preparing a new contribution matrix similar to the one in passage 2 but considering only non-ambiguous loci between each possible pair of best and second-best IDs in the dataset. Additionally, before log (FC) calculation, we add pseudocounts, which mimics average ambient RNA contamination coming from each ID, calculated as the average rate of reads deriving from the other genotype’s unambiguous reads when they are not labeled as first ID or second ID across all droplets (according to the contribution matrix); similar to the approach proposed in the hashedDrops function from the package MarioniLab/DropletUtils93,94, this step ensures that log (FC) is always defined for all droplets. Given the nature of the model, the resulting classification assumes the presence of two distinct populations that can be separated based on the proportion of the two IDs, and, given that it is computed after doublet detection, it will likely detect those droplets that embed enough ambient RNA to pass the Cell Ranger emptyDrops filter, while it should not be used if any sort of prior filtering of low-quality droplets has already been done. Benchmarking of SCanSNP and genetic demultiplexing in barcode-tagged samples are described in the Supplementary Methods.
Power estimation for single-cell eQTLs
We estimated the eQTL power using our R package scPower (version 1.0.2)50 for sample sizes between 25 and 200 and for number of cells per sample between 250 and 1,500, keeping the read depth as in our experiment. We fitted the required expression priors per cell type using the complete scRNA-seq dataset, combining the different multiplexing strategies and time points, and took effects from a previously published single-cell eQTL study in IPS cells35, excluding eQTLs from ROT-treated cells. Genes were defined as expressed with at least three counts in at least 9.5% of the samples.
Modeling mCBO clonal dynamics
We integrated the longitudinal Census-seq data with mCBO imaging to model mCBO clonal dynamics (modeling of PSC and CBO growth with imaging; Census-seq and Census-seq ranking of iPSC lines in mosaic organoids are described in the Supplementary Methods). First, we used already published imaging of CBOs stained for nuclear markers to estimate the density of cells inside CBOs (0.000767495 per µm3). Next, for each replicate of each mosaic combination that we longitudinally profiled with imaging, we multiplied the measured area of the mCBOs (converting this to the equivalent sphere volume simply through the sphere volume formula) by the density to estimate the number of cells present in each mCBO at each time point. We computed the average of this value for each mosaic combination across the replicates for each time point. We then multiplied this value by the average contribution of each iPSC line in each mosaic combination, as given by the corresponding Census-seq data (correlation across multimodal ranking is described in the Supplementary Methods). This represents the estimated number of cells for each PSC line in each mCBO at the different time points. The ratio between sequential time points was computed to estimate the line-specific growth rates, and their stability across mosaic combinations (Fig. 6b and Extended Data Fig. 10b) as defined by nit = rit · nt and gi = (nit+1)/nit with nt as the total number of cells in the organoid at time t, rit is the ratio of cells measured by Census-seq, nit is the number of cell per genotype at time t, gi is the growth rate of the genotype i. The empirical distribution is defined by expanding the distribution of gi∀ genotypes i from our measured data. We used the computed line-specific growth rates to estimate an empirical distribution of possible growth rates. We then generated the growth rates of the lines inside the mCBO randomly (10,000 cycles of sampling) from the probability distribution calculated over the domain. The process was repeated for sample sizes between two and 20 theoretically multiplexed lines.
Because we had a constant number of starting cells (20,000) when generating mCBOs, we multiplied this by the Monte Carlo simulated growth rates, thus modeling the expected number of cells per line for all the possible conditions (from two to 20 starting lines).
To translate this into useful guidelines for experimental disease-modeling pipelines, we considered that at least 100 cells per neurodevelopmental cell type should be recovered for each PSC line grown in mCBOs and that 100,000 cells can be sequenced. On the basis of these parameters, Supplementary Fig. 10c shows the number of PSC lines (y axis) retrieved with the highest empirical probability given the number of PSC lines used to generate mCBOs (x axis). We finally employed the curve_fit function from the SciPy python library to estimate the coefficients a and b of the power function N = aIb + c linking the number of mixed cell lines I and the average number of recovered lines N (Fig. 6c). The scalability of mosaic experiments in terms of experimental timeline is described in the Supplementary Methods.
Statistics and reproducibility
Statistical analyses were carried out using tests appropriate for each assessed modality using SCANPY, tradeSeq and Milo for single-cell transcriptomics analyses.
The threshold for statistical significance was spatial FDR < 0.1 for differential abundance (Milo) and P < 0.05 for other statistical tests. All details on sample size, number of replicates, statistical tests and significance are provided in the relevant figure legends. CBOs were differentiated in multiple independent batches, and the number of replicates was chosen on the basis of previous published studies on brain organoids.
The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assignments.
Graphics and figures
Final figure panels were assembled using Adobe Illustrator version 27.0.1. For the organoids, cells and human shapes in Fig. 1, templates were downloaded from BioRender and subsequently modified.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41592-024-02555-5