Search
Close this search box.

Correcting PCR amplification errors in unique molecular identifiers to generate accurate numbers of sequencing molecules – Nature Methods

Cell lines and reagents

The 5TGM1, Jurkat and RM82 cell lines were cultured in complete Roswell Park Memorial Institute medium. All parental cell lines were tested twice per year for mycoplasma contamination and authenticated by short tandem repeat during this project. For cell culture experiments, SGC-CLK-1 (Structural Genomics Consortium) inhibitor was incubated with cells for 24 h. Dimethylsulfoxide (DMSO) was used as a negative control. Jurkat (TIB-152) were purchased from the American Type Culture Collection, while 5TGM1 cells were a kind gift from C. Edwards (University of Oxford) and RM82 cells were gifts from N. Athanasou and B. Hassan, as part of the EuroBioNet network.

Oligonucleotide synthesis

Homotrimer phosphoramidites were purchased as a custom product from Metkinen Chemistry and reverse homotrimer phosphoramidites were a custom synthesis product from Chemgenes. Solid-phase phosphoramidite oligonucleotide synthesis on Toyopearl HW-65S resin (Tosoh Biosciences, catalog no. 0019815) was performed by ATDBio as described previously13, in the 5′–3′ direction (using reverse amidites), using a method adapted from ref. 20. The sequence of the capture oligonucleotide is as follows: Bead-5′-[spacer]-TTTTTTTAAGCAGTGGTATCAACGCAGAGTACJJJJJJJJJJJJNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT-3′, where ‘J’ indicates a nucleotide trimer block added via split-and-pool synthesis using reverse monomer phosphoramidites. ‘N’ indicates a degenerate trimer nucleotide (added using an equimolar mixture of the four reverse timer phosphoramidites). [spacer] is hexaethylene glycol added using DMT-protected hexaethylene glycol phosphoramidite and all the other bases are standard (monomeric) DNA bases added using reverse phosphoramidites. AAGCAGTGGTATCAACGCAGAGTAC is the PCR handle.

Before oligonucleotide synthesis, capping was performed to reduce the initial loading of hydroxyl groups on the beads, by suspending the resin in a 1:1 mixture of Cap A (tetrahydrofuran:lutidine:acetic anhydride 8:1:1) and Cap B (tetrahydrofuran:pyridine:1-methylimidazole 8:1:1) at room temperature for 30 min. Oligonucleotide synthesis was then performed using an ABI 394 DNA synthesizer, using a modified 1 μmol synthesis cycle (with an extended coupling time of 5 min for monomer bases and 10 min for trimer bases. The capping step was omitted for the trimer bases in the UMI region and the poly-T region). The barcode was generated using 12 split-and-pool synthesis cycles. Before the first split-and-pool synthesis cycle, beads were removed from the synthesis column, pooled and mixed, and divided into four equal aliquots. The bead aliquots were then transferred to separate synthesis columns before three consecutive couplings with monomers reverse amidites. This process was repeated 11 times. Following the final split-and-pool cycle, the beads were pooled, mixed and divided between four columns, ready for the next part of the synthesis. An equimolar mixture of the four trimer phosphoramidites was used in the synthesis of the degenerate UMI (poly(N)) region, and (monomeric) T reverse amidite was used for the poly(T) tail. After oligonucleotide synthesis, the resin was washed with acetonitrile and dried with argon before deprotection in aqueous ammonia (room temperature for 17 h followed by 55 °C, 6 h). The beads were then washed with water followed by acetonitrile and dried with argon gas.

Template switch oligonucleotide was synthesized using standard phosphoramidites: 5′-AAGCAGTGGTATCAACGCAGAGTNNNNNNNNNNGAATrGrGrG-3′. The oligonucleotides were PAGE purified and shipped lyophilized. Primers containing CMIs were synthesized by Sigma Aldrich using the following sequences polyA oligonucleotide: 5′-AAGCAGTGGTATCAACGCAGAGTACNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT-3′.

Generating bulk homotrimer UMI-tagged cDNA

Total mesenger RNA (mRNA) was isolated using a Quick-RNA MiniPrep kit (Zymo), following the manufacturer’s protocol. The RNA sample quality and quantity was measured using an RNA screen tape on the TapeStation (Agilent). cDNA synthesis was performed with modification to the SMART approach21. An oligo(dT)-containing adapter containing a homotrimer 30-base DNA sequence and a SMART primer sequence was used to initiate a reverse transcriptase reaction. Briefly, RNA was denatured at 72 °C for 2 min and then reverse transcribed with Maxima H minus reverse transcriptase (2,000 U) in a total volume of 50 μl with the buffer, 1 mM dNTPs, 2 mM dithiothreitol and 4% Ficoll PM-400. The reaction was performed for 90 min at 42 °C and then the enzyme was heat inactivated at 80 °C for 5 min. The library was then purified using 0.8× SPRI bead (Beckman Coulter) cleanup followed by PCR using KAPA HiFi master mix for 20 cycles (unless otherwise stated)20, using SMART PCR primer (AAGCAGTGGTATCAACGCAGAGT) before being purified using SPRI beads. To achieve a high concentration of cDNA the input was subjected to up to 30 cycles of PCR amplification followed by a second cleanup. Optionally, 10 ng of PCR product was subjected to 12 further cycles of PCR using primers that contained trimer sample barcodes (Supplementary Table 1). Finally, cDNA was quantified using a TapeStation (Agilent Technologies) using a DNA high-sensitivity D5000 tape before being split for Illumina or Oxford Nanopore library generation. To reduce PCR artifacts and improve sequencing return, we performed PCR using the primer 5-PCBio-TACACGACGCTCTTCCGATCT for a further 3–5 cycles of PCR.

ONT bulk RNA sequencing (RNA-seq) library preparation and sequencing

A total of 1,200 ng of purified cDNA was used as a template for ONT library preparation. We used SQK-LSK-109, SQK-LSK112 and SQK-LSK114 (also referred to as ONT latest kit14 chemistry) ligation sequencing kits, following the manufacturer’s protocol. Samples were sequenced using a minION device using R9.4.1 (FLO-MIN106D) or R10.4 (FLO-MIN112) flow cells. Barcoding using the Native Barcoding Amplicon kit (EXP-NDB104) was performed for RM82 cells treated with DMSO or CLK1 inhibitor treatment. These samples were sequenced using the PromethION sequencing platform on R9.4.1 FLO-PRO002 flow cells at the Deep Seq facility at the University of Nottingham.

PacBio bulk RNA-seq library preparation and sequencing

A total of 1,200 ng of purified cDNA was used as a template for PacBio library preparation and sequencing at the Centre for Genomic Research at the University of Liverpool (https://www.liverpool.ac.uk/genomic-research/technologies/next-generation-sequencing/). cDNA was end-repair and A-tailed with T4 polynucleotide kinase (New England Biolabs). The sequencing library was prepared using the SMRTbell Express Template Prep Kit v.2.0 following the standard protocol. Sequencing was then performed on a Sequel IIe using a Sequel IIe SMRT Cell 8M ion CCS mode, following the standard protocol. CCS reads were generated using CCS v.6.3.0 (https://github.com/PacificBiosciences/ccs) using default settings.

Illumina bulk RNA-seq library preparation and sequencing

Purified cDNA was used as an input for the Nextera XT DNA library preparation kit (New England Biolabs). Library quality and size was determined using a TapeStation (Agilent Technologies) High Sensitivity D1000 tape and then sequenced on a NextSeq 500 sequencer (Illumina) using a 75-cycle High Output kit using a custom read1 primer (GCCTGTCCGCGGAAGCAGTGGTATCAACGCAGAGTAC). Read1 length was 30 bp long and read2 length was 52 bp long.

ONT bulk RNA-seq analysis

We performed base calling on the raw fast5 data to generate fastq files using Guppy (v.6.4.8) (guppy_basecaller –compress-fastq -c [cfg file] -x ‘cuda:0’) in graphical processing unit (GPU) mode from ONT running on a RTX3090 graphics card. The fastq data were processed using a custom pipeline ‘pipeline_count’ written using cgatcore22 and included within the TallyTriN repository. Briefly, the quality of each fastq file was evaluated using the fastqc toolkit23 and summary statistics were collated using Multiqc24. We then identified the polyA associated UMI sequence by searching for the polyA region and reverse complementing the read if it did not appear in the correct orientation. The 30 bp UMI was identified upstream of the SMART primer by pattern matching for GTACTCTGCGTTGATACCACTGCTT. The set coverage method for removing homotrimer methods was then applied for UMI demultiplexing; if the UMI contained more than five errors then the read was removed. The demultiplexed UMI sequence was then added to the read name. Next, the template switch oligo (TSO)-associated UMI was identified using the SMART primer sequence AAGCAGTGGTATCAACGCAGAGTAAT. The 30 bp UMI sequence was first subjected to error correction or removed from the read, depending on the number of UMI errors detected. Subsequently, the corrected UMI, if applicable, was appended to the read name, enhancing the accuracy and use of the data. Both the TSO and polyA associated UMIs and primer sequences were removed from the read sequence. For transcript level analysis, the fastq file was then mapped against the transcriptome using minimap2 (v.2.25) with the following settings: -ax map-ont -p 0.9 –end-bonus 10 -N 3. The resulting .sam file was then sorted and indexed using samtools25. A custom script, in which pysam (v.0.21.0) was used to parse the output .sam file, was then used to add the transcript name to the XT tag of the samfile for downstream counting by homotrimer deduplication or UMI-tools. For gene level analysis, the fastq data were mapped using minimap2 using the following setting: -ax splice -k 14 –sam-hit-only –secondary=no –junc-bed. The resulting .sam file was then sorted and indexed followed by feature annotation using featurecounts (v.2.0.1)26 using the following settings to generate an annotated .bam file: featureCounts -a (gtf) -o (output) -R BAM. This .bam file was then used for downstream counting by UMI-tools or homotrimer correction. The reference transcriptome and genomes used for the analysis were hg38_ensembl98 and mm10_ensembl88. The resulting count tables were then used for differential gene expression analysis, which was performed using DESeq2 v.1.40.2 (ref. 27) within the R statistical framework v.4.3.1.

Illumina bulk RNA-seq analysis

The data were processed using a custom cgatcore28 (v.0.6.15) written pipeline ‘pipeline_illumina’. Briefly, the UMIs contained in read1 were corrected based on homotrimer complementarity or were removed from the analysis depending on a set error threshold. The paired fastq files were then mapped using hisat2 (v.2.2.1)29 before features being counted using featureCounts30 (v.2.0.3) using the following commands: featureCounts -a (gtf) -o (output) -R BAM. The resulting XT tagged .bam file was then used for downstream counting using homotrimer deduplication or UMI-tools. The resulting count tables were then used for differential gene expression analysis, which was performed using DESeq2 v.1.40.2 (ref. 27) within the R statistical framework v.4.3.1

UMI-tools deduplication

Following gene or transcript level mapping, the UMI was extracted from the read. Since UMI-tools was not designed to correct homotrimer sequences, we collapsed the UMI into a single nucleotide sequence by selecting the first base within each of the individual trimers. Reads were then deduplicated using the directional method using the command: umi_tools count –per-gene –gene-tag=XT.

Homotrimer set coverage deduplication

Following gene or transcript level mapping, the UMI was extracted from the read and collapsed into single nucleotide sequence using the majority vote approach where applicable or resolve inconsistencies through a combinatorial optimization scheme otherwise. Briefly, reads were first filtered to exclude reads in which there were more than three errors in the UMI sequence. For UMI sequences where each trimer contains at least two identical nucleotides, a majority vote was then performed to collapse the trimer into a monomer. If at least one trimer is inconclusive and contains three different nucleotides, we no longer treat each UMI sequence independently when collapsing trimers into monomers. Instead, we select one of the nucleotides in each trimer block to achieve maximal consistency between duplicates, that is to minimize the number of distinct collapsed UMI sequences. We formulate this task as a set cover problem for each gene as follows31. Let S be the set of sequenced homotrimer UMIs of a given gene (in a given cell). For (sin S) let C(s) denote the set of collapsed UMIs that can be obtained by combining single nucleotides that occur in each trimer block of s. Each such collapsed sequence (cin Cleft(sright)), for some (sin S,) can explain potentially multiple homotrimer UMIs ({s}^{{prime} }) if c is also contained in (Cleft({s}^{{prime} }right)). We therefore include one subset ({S}_{c}subseteq S) for each (cin {bigcup }_{left{sin Sright}}Cleft(sright)) that contains all (sin S) for which (cin C(s)). The collection of sets ({S}_{c}) of smallest cardinality that together include (‘cover’) all sequenced UMIs in S therefore corresponds to the smallest set of collapsed UMIs that explain all (sin S). To find this smallest set of collapsed UMIs, we use a greedy algorithm that starts from the empty set and in each iteration adds the subset Sc (that is, collapsed UMI c) that explains the largest number of yet unexplained sequenced UMIs. The solution returned by this algorithm is guaranteed to be within a logarithmic factor of the optimal solution31. In our experiments, the solution of the greedy approach was identical to the optimal solution for more than 90% of the genes. We computed the optimal solution using an integer linear programming approach, where decision variables model the inclusion or exclusion of sets Sc and linear inequalities enforce each sequenced UMI to be covered by at least one such set, that is to be explained by at least one collapsed UMI.

Settings for simulated UMIs

We simulated UMI data of length 30 (ten blocks of nucleotide trimers) to test the accuracy of our UMI correction methodology by using the ResimPy tool. We mimicked the PCR amplification and sequencing errors seen with ONT sequencing, as this sequencing methodology suffers from indels and base calling errors more frequently than PacBio or Illumina sequencing. UMIs were generated following an approach that was first described by UMI-tools11. Briefly, we simulated homotrimer blocks of UMIs at random, with an amplification rate (-ampl_rate) ranging between 0.8 and 1.0 and then simulated PCR cycles so that each UMI was duplicated to the probability of amplification. PCR errors were then randomly added and assigned new probabilities of amplification. A predefined number of UMIs were randomly sampled to simulate sequencing depth and sequencing errors introduced with a specified probability. Finally, errors were detected by assessing the complementarity of homotrimers across the full UMI sequence. If no errors were detected, then the homotrimers were collapsed into single nucleotide bases. However, if errors were identified, then collapsing into single nucleotides was performed using the most common nucleotide within the trimer. If a most common nucleotide could not be determined, then a single nucleotide was selected at random for collapsing. The following values were used within our simulations. Sequencing depth 400; number of UMIs 50 (-umi_num); UMI length 12 (-umi_len); PCR error rate 3.6 × 10−6 (-seq_err); error rate 1 × 10−1–1 × 10−7 and number of PCR cycles 12 (-pcr_num); permutation tests 50 (-perm_num).

ResimPy: simulating PCR artifacts in UMI-attached reads

We developed ResimPy for simulating UMI-attached reads. The total number of reads ({m}^{left(i+1right)}) at PCR cycle (i+1) comes from two sources: reads that are PCR amplified and those that are not. This can be described by a Galton–Watson branching process32,33,34 as follows

$${m}^{left(i+1right)}={m}^{left(iright)}+{n}^{(i)}$$

(1)

Here, ({n}^{(i)}) is the number of reads to be amplified, determined by an amplification rate (alpha). According to Chen et al.32, ({n}^{(i)}) follows a binomial distribution ({mathrm{Binom}}({m}^{left(iright)},alpha )). The ({n}^{(i)}) reads to be amplified are randomly selected from the set (left{mathrm{1,2},ldots ,{m}^{(i)}right}) without replacement. This ensures that each read has an equal chance of being amplified. The same process applied for every two adjacent PCR cycles.

To simulate PCR errors, we implemented another Galton–Watson branching process. The total number of base errors ({u}^{(i+1)}) at PCR cycle (i+1) is modeled by:

$${u}^{left(i+1right)}={u}^{left(iright)}+{v}^{left(iright)}$$

(2)

Here, ({v}^{left(iright)}) represents the number of base errors to be synthesized at PCR cycle (i+1). Following Rabadan et al.35, ({v}^{left(iright)}) is generated using a negative binomial distribution ({mathrm{NBinom}}(r,q)). Here q represents the probability of a base being successfully synthesized, which is derived by subtracting the base error probability Pe from 1 (that is (1-{P}_{mathrm{e}})). The variable r is determined by the product of the number of successfully synthesized bases calculated by (qtimes {t}^{left(iright)}) the positions of these ({v}^{left(iright)}) PCR errors were randomly chosen from set (left{mathrm{1,2},ldots ,{t}^{(i)}right}), where ({t}^{(i)}) represents the total number of bases to be synthesized at PCR cycle i + 1. Finally, the base at each error sequence position was substituted by one of the remaining three types of bases, drawn from a discrete uniform distribution (U(mathrm{1,3})), where 1 and 3 represent the indices of the first and the third remaining bases, indicating that each one gains an equal chance for substitution. We use the same method to simulate sequencing errors. While simulation data provide some evidence for UMI deduplication performance, it is important to note that simulations can be biased. Therefore, we complement our simulations with experimentally derived data using our CMI approach described below.

CMI and error evaluation in bulk sequencing

To measure the error rate and evaluate the accuracy of our UMIs following library preparation and sequencing, we synthesized a common sequence (GGGAAACCCTTTGGGCCCTTTAAACCCTTT) in replacement of a UMI to our polyA capture oligonucleotide. Following sequencing the CMI sequence was identified upstream of the SMART primer by pattern matching for GTACTCTGCGTTGATACCACTGCTT. The accuracy of our CMI was then determined by comparing the expected synthesized sequence to the extracted CMI sequence. The percentages of CMI that show full complementarity with the expected sequence were counted and the numbers of errors were determined for the inaccurate CMIs.

Comparison between UMI-tools and homotrimer CMI deduplication methods

After mapping the reads to the reference genome at the gene level, we processed the data using two different strategies: UMI-tools and homotrimer deduplication. For homotrimer deduplication, we used the full length of the CMI sequence, while for UMI-tools we collapsed the CMI into a monomer by selecting the first base for each trimer block. The inclusion of the CMI sequence to our reads provides an experimental ground truth with which to evaluate the accuracy of each deduplication strategy. To assess the accuracy of the final deduplicated counts, we compared them to the expected ground truth CMI gene count of 1.

10X Chromium library preparation

We prepared a single-cell suspension using JJN3 and 5TGM1 cells using the standard 10X Genomics chromium protocol as per the manufacturer’s instructions. Briefly, cells were filtered into a single-cell suspension using a 40 μM Flomi cell strainer before being counted. We performed 10X Chromium library preparation following the manufacturer’s protocol. Briefly, we loaded 3,300 JJN3:5TGM1 cells at a 50/50 split into a single channel of the 10X Chromium instrument. Cells were barcoded and reverse transcribed into cDNA using the Chromium Single Cell 3′ library kit and get bead v.3.1. We performed ten cycles of PCR amplification before cleaning up the library using 0.6× SPRI Select beads. The library was split and a further 20 or 25 PCR cycles were performed using a biotin oligonucleotide (5-PCBio-CTACACGACGCTCTTCCGATCT) and then cDNA was enriched using Dynabeads MyOne streptavidin T1 magnetic beads (Invitrogen). The beads were washed in 2× binding buffer (10 mM Tric-HCL pH 7.5, 1 mM EDTA and 2 M NaCl), then samples were added to an equi-volume amount of 2× binding buffer and incubated at room temperature for 10 min. Beads were placed in a magnetic rack and then washed twice in 1× binding buffer. The beads were resuspended in H2O and incubated at room temperature and subjected to long-wave ultraviolet light (~366 nm) for 10 min. Magnetic beads were removed, and library was quantified using the Qubit High-sensitivity kit. Libraries were then prepared before sequencing.

Drop-seq library preparation

Single-cell capture and reverse transcription were performed as previously described20. Briefly, JJN3 and 5TGM1 cells (20:80 ratio) were filtered into a single-cell suspension using a 40 μM Flomi cell strainer before being counted. Cells were loaded into the DolomiteBio Nadia Innovate system at a concentration of 310 cells per μl. Custom synthesized beads were loaded into the microfluidic cartridge at a concentration of 620,000 beads per ml. Cell capture was then performed using the standard Nadia Innovate protocol according to the manufacturer’s instructions. The droplet emulsion was then incubated for 10 min before being disrupted with 1H,1H,2H,2H-perfluoro-1-octanol (Sigma) and beads were released into aqueous solution. After several washes, the beads were subjected to reverse transcription. Before PCR amplification, beads were treated with ExoI exonuclease for 45 min. PCR amplification was then performed using the SMART PCR primer (AAGCAGTGGTATCAACGCAGAGT) and cDNA was subsequently purified using AMPure beads (Beckman Coulter). The library was split and a further 20 or 25 PCR cycles20 were performed using a biotin oligonucleotide (5-PCBio-TACACGACGCTCTTCCGATCT) and then cDNA was enriched using Dynabeads MyOne streptavidin T1 magnetic beads (Invitrogen). The beads were washed in 2× binding buffer (10 mM Tric-HCL pH 7.5, 1 mM EDTA and 2 M NaCl) then samples were added to an equi-volume amount of 2× binding buffer and incubated at room temperature for 10 mins. Beads were placed in a magnetic rack and then washed with twice with 1× binding buffer. The beads were resuspended in H2O and incubated at room temperature and subjected to long-wave ultraviolet light (~366 nm) for 10 min. Magnetic beads were removed, and library was quantified using the Qubit High-sensitivity kit. Libraries were then prepared for sequencing.

Bulk and single-cell library preparation and ONT sequencing

A total of 500 ng of single-cell PCR input was used as a template for ONT library preparation. Library preparation was performed using the SQK-LSK114 (kit V14) ligation sequencing kit, following the manufacturer’s protocol. Samples were then sequenced on either a Flongle device or a PromethION device using R10.4 (FLO-PRO114M) flow cells.

10X analysis workflow

We performed base calling on the raw fast5 data to generate fastq files using Guppy (v.6.4.8) (guppy_basecaller –compress-fastq -c dna_r10.4_e8.1_sup.cfg -x ‘cuda:0’) in GPU mode from ONT running on a RTX3090 graphics card. To process the 10X chromium data, we wrote a custom cgatcore pipeline (https://github.com/cribbslab/TallyTriN/blob/main/tallytrin/pipeline_10x.py)22. We first determined the orientation of the reads and if a poly-T sequence was detected we reverse complemented the read. Next, we identified the barcode and UMI based on the pairwise alignment of the sequence AGATCGGAAGAGCGT and AAAAAAAAA and identified the sequence between these alignments. We next removed reads that were greater or equal to 28 bp and isolated the barcode as the first 16 bp and the UMI the following 12 bp. The barcode and UMI sequence were then appended to the name of the fastq read using the underscore delimiter. Next, to remove barcode errors we parsed the barcodes from each read in the fastq file and then selected the most common barcode sequences using the number of expected cells in our library as the threshold. Next, for every read in the fastq file we then identified the closest barcode match for each read, allowing for two mismatches. Mapping was performed using minimap2 (v.2.25)36, with the following settings: -ax splice -uf –MD –sam-hit-only –junc-bed and using the reference transcriptome for human hg38 and mouse mm10. The resulting .bam file was sorted and indexed before adding the transcript name to the XT tag within the .bam file. Counting was then performed using UMI-tools –method=directional before being converted to a market matrix format. Raw transcript expression matrices generated by UMI-tools count were processed using R/Bioconductor (v.4.3.0), the raw market matrix files were imported into R using bustools (v.0.42.0) and the Seurat37,38 package (v.4.3.0). Transcript matrices were cell-level scaled and log-transformed. The top 2,000 highly variable genes were then selected based on variance stabilizing transformation that was used for principal component analysis. Clustering was performed within Seurat using the Louvain algorithm. To visualize the single-cell data, we projected data onto a uniform manifold approximation and projection (UMAP)39.

Drop-seq analysis workflow

We performed base calling on the raw fast5 data to generate fastq files using Guppy (v.6.4.8) (guppy_basecaller –compress-fastq -c dna_r10.4_e8.1_sup.cfg -x ‘cuda:0’) in GPU mode from ONT running on a RTX3090 graphics card. To process the drop-seq data, we wrote a custom cgatcore pipeline (https://github.com/cribbslab/TallyTriN)22. We followed the workflow previously described for identifying barcodes and UMIs using scCOLOR-seq sequencing analysis13. Briefly, to determine the orientation of our reads, we first searched for the presence of a polyA sequence or a poly-T sequence. In cases where the poly-T was identified, we reverse complemented the read. We next identified the barcode sequence by searching for the polyA region and flanking regions before and after the barcode. The trimer UMI was identified based on the primer sequence GTACTCTGCGTT at the TSO distal end of the read, allowing for two mismatches. Barcodes and UMIs that had a length less than 48 base pairs were filtered. To conduct monomer-based analyses, a random base was selected from each homotrimer in the UMI or CMI and collapsed into a monomer. Homotrimer UMI correction was performed following mapping using minimap2 (v.2.25)36. Mapping settings were as follows: -ax splice -uf –MD –sam-hit-only –junc-bed and using the reference transcriptome for human hg38 and mouse mm10. The resulting .sam file was sorted and indexed using samtools25. For monomer UMI, counting was performed using UMI-tools before being converted to a market matrix format. For homotrimer UMI correction, the counting was performed using the script greedy.py within the TallyTriN repository. Raw transcript expression matrices generated by UMI-tools count and greedy.py were processed using R and Bioconductor (v.4.3.0) and custom scripts were used to generate barnyard plots showing the proportion of mouse and human cells. Transcript matrices were cell-level scaled and center log ratio transformed. The top 3,000 highly variable genes were then selected based on variance stabilizing transformation that was used for principal component analysis. Clustering was performed within Seurat using the Louvain algorithm. To visualize the single-cell data, we projected data onto a UMAP39.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.