Targeted phasing of 2–200 kilobase DNA fragments with a short-read sequencer and a single-tube linked-read library method – Scientific Reports

Adapting whole-genome sequencing (WGS) TELL-Seq to targeted applications

We capitalized on our experience in adapting WGS TELL-Seq to the processing of genomes of different sizes to devise a series of protocol modifications that should enable high-quality targeted phasing as summarized in Fig. 1A,B. We anticipated that the key elements of an optimized TELL-Seq protocol for targeted applications would include: (i) using low amounts of input material and TELL microbeads (e.g., 100 pg or less and approximately 2 million microbeads or 6 μL in a 25 μL barcoding reaction, respectively); (ii) subsampling the amount of TELL microbeads before library amplification (e.g., using only 0.75 μL out of 6 μL for PCR); and, (iii) increasing the number of PCR cycles during library amplification (e.g., twenty-one). These adjustments aim to balance sufficient linked-read efficiency and library yield with a low risk for DNA collisions and sequencing costs. The specific volumes, amounts, and number of PCR cycles will depend on the level of target enrichment, the amount of off-target background, and the target size, as examined in this study.

Figure 1
figure 1

WGS and Targeted TELL-Seq protocols. (A) The TELL-Seq protocol was developed to assemble genomes and metagenomes, and for phasing. Although there is only one protocol, the specific amounts of starting material and TELL microbeads, as well as some aspects in the barcoding and library amplification reactions must be adjusted according to WGS applications (i.e., genomes size and sample complexity). For example, the recommended volume for the barcoding reaction is 150 μL when processing a human genome and 50 μL when processing a bacterial isolate. In this study, we propose conditions to adapt the WGS TELL-Seq protocol to the particularities of targeted applications: 1–10 (use workflow in B as reference). Based on target purity (i.e., the amount of off-target DNA or background), we anticipate two major classes of targets: largely impure targets, representing less than 1–5% of the total amount of DNA in the sample; and largely pure targets, representing 25–100% of the total amount of DNA in the sample (see Targeted Applications columns). Intermediate situations would be adjusted on a case-by-case basis, using as reference the examples provided in this study. (B) Briefly, ultra-low amounts of high-molecular weight (HMW) genomic DNA are mixed with transpososome and barcoded TELL microbeads. DNA (target for targeted applications) is then captured by transpososome and tagged. Transpososomes allow DNA recruitment on microbeads based on universal sequence homology. There are millions of microbeads in a TELL-Seq reaction, and each microbead provides a unique barcode to tagged DNA. A second transpososome complex allows tagging an intermediate position in the barcoded fragment and breaking and washing release transposase components. Barcoded DNA is finally released and amplified to build a library for sequencing.

Phasing long, partially enriched targets

In general, we anticipate two major types of targets based on target size and purity: relatively large, partially enriched targets (between 15 and 200 kb with up to 95% off-target background); and relatively small, highly enriched targets (between 2 and 15 kb with less than 1–2% off-target background). To test the optimized TELL-Seq protocol with the first type of targets, we focused on seven genomic regions associated with hereditary predisposition to cancer: BRCA1 (target length: 198.4 kb), BRCA2 (target length: 187.5 kb), APC (target length: 199.6 kb), MSH2 (target length: 201.1 kb), MSH6 (target length: 197.5 kb), MLH1 (target length: 201.7 kb), and PMS2 (target length: 197.5 kb).

To isolate these loci, all within the 185–205 kb range, we used the HLS-CATCH method, which can isolate targets of up to 1 Mb in size27. In the pipetting-free HLS-CATCH system, targets are excised from lysed cells using guide (g)RNA-directed Cas9 ribonucleoprotein (RNP) complexes, and the excised fragments are isolated using preparative pulse-field electrophoresis (Fig. 2A and Supplementary Fig. S1A)27,28,29,30. CATCH-isolated targets are highly enriched relative to off-target DNA (typically, 150-fold and more), but still they can represent a relatively small fraction of the total extracted DNA (e.g., 0.5–5%). We postulated that this high level of off-target background should be almost as permissive to DNA collisions as those allowed when processing a small bacterial genome (Fig. 1, table, targeted application: < 1–5% target DNA over total DNA).

Figure 2
figure 2

Phasing 185–205 kb HLS-CATCH-enriched targets. (A) Experimental workflow. See also Supplementary Fig. S1A. (B) A representative example of linked reads obtained with the targeted TELL-Seq protocol. Shown a 186 kb region in the BRCA2 target. Short reads (boxes) with identical barcode are connected by lines (i.e., linked reads). (C) Read profiles (coverage indicated) from TELL-Seq libraries showing robust on-target recovery compared to background. Targets: 200 kb APC and MLH1. Arrowheads represent locations of a trio of gRNA binding sites on each target end. Black bars represent the APC and MLH1 genes. (D) Joint screenshots from the Integrative Genomics Viewer (IGV) Portal showing TELL-Seq results for the APC target. Tracks (from top to bottom): GIAB phased haplotypes in HG002 (benchmark); phased haplotypes (numbers indicate distances between some phased sites; coverage; inferred haplotype 1; inferred haplotype 2; unphased reads; gene annotations; and inferred phase block. (E, G) Read behaviors across representative regions in the MLH1 (E) and PMS2 (G) targets showing correct genotyping but incomplete phasing (squared cells) and heterozygosity loss (incomplete genotyping and phasing) affecting Hap2 (squared cells), respectively. Left columns represent GIAB data (NA24385), right columns represent TELL-Seq data. Counts per positions are indicated by nucleotide G, A, T, C columns). Variants calls are also indicated (Variant columns). (F) Distances between the annotated neighbor heterozygous sites. Green columns represent phased sites. Black columns represent unphased sites. (H) Average read counts (allele burden) by target relative to the total read count. Sites selected based on GIAB annotated haplotypes.

We designed three gRNAs complementary to each end of every target to maximize the options of excision, minimizing also the risk of haplotype dropout due to target-specific allelic differences at the gRNA binding site (Fig. 2A). We assembled Cas9 RNP complexes in three pools: a pool with six BRCA1-targeting gRNAs, another pool with six BRCA2-targeting gRNAs, and a third pool with thirty gRNAs targeting the APC, MSH2, MSH6, MLH1, and PMS2 loci. As source of genomic DNA, we used GM24149/HG003 and GM24385/HG002 B lymphocytes from Genome in a Bottle (GIAB), two well-characterized reference materials. We used HG003 DNA (NA24149) to isolate the BRCA1 and BRCA2 targets and HG002 DNA (NA24385) to isolate the other five targets.

It is important to highlight that HG002 has two sequence singularities relevant to our phasing purposes. The first singularity is a low heterozygosity density across the MLH1 locus with a median distance between adjacent heterozygous sites of more than 13 kb, which is more than nineteen times the median distance between heterozygous sites in the other six targets (Supplementary Fig. S1B). The second singularity is the presence of a large polymorphic inversion relocating the three 3′ gRNA binding sites more than 700 bp downstream from the expected positions in the PMS2 locus31,32. We confirmed this inversion using GIAB-generated WGS linked-read data (Supplementary Fig. S1C). The mixture of typical and singular targets should help us better understand the process of targeted phasing with TELL-Seq.

After recovery from the HLS-CATCH system, we mixed 100 pg of each of the BRCA1 and BRCA2 targets to generate a single TELL-Seq library and used 270 pg of the APC, MSH2, MSH6, MLH1, and PMS2 co-enriched mixture to generate a second TELL-Seq library. After sequencing and duplicate removal, we obtained 10.4 million and 23.1 million reads, respectively. The mapping profiles of linked-read data were consistent with those found in conventional WGS TELL-Seq (Fig. 2B). On-target read recovery was 3.7% and 5.4% with an average mean coverage of 85.4 × and 108.5 ×, respectively. Figure 2C shows representative read density profiles for two targets.

As expected, the lowest recovery was observed for the targets with genomic singularities in HG002, 0.9% on-target recovery and 72.9 × average mean coverage for MLH1 and 0.4% on-target recovery and 36.1 × average mean coverage for PMS2, compared to 1.9% and 106.4 × for BRCA1, 1.8% and 108.5 × for BRCA2, 1.3% and 106.5 × for APC, 1.4% and 112.9 × for MSH2, and 1.2% and 98.3 × for MSH6. We correctly phased 808 out of the 809 GIAB-annotated heterozygous sites in the five loci without singularities, which represents a 99.88% phasing accuracy: 308 out of 308 for BRCA1; 71 out of 71 for BRCA2; 110 out of 110 for APC; 195 out of 195 for MSH2; and 124 out of 125 for MSH6. The only phasing inconsistency with GIAB data was a flip error along the MSH6 locus—a swap of maternal and paternal alleles. Importantly, we were also able to infer a single block of contiguous phased positions in all targets (for example, Fig. 2D, see a single blue line at the bottom; the apparent break results from joining two screen snapshots). Furthermore, we were able to de novo phase twenty-four GIAB-annotated-but-unphased heterozygous sites in our tests: 8 sites in APC, 13 sites in MSH2, and 3 sites in MSH6. However, we missed four GIAB-annotated heterozygous sites due to an insufficient read recovery underlying these sites in the APC and MSH6 targets, which could be considered as failed genotyping.

For the singular MLH1 region in HG002, genotyping was 100% accurate (16 out of 16 correctly recalled heterozygous sites), but phasing was largely incomplete with only 9 correctly phased positions out of 16 (56%, Fig. 2E, Hap1 and Hap2 columns). Moreover, we could not infer a continuous phase block for this target. We attributed this low performance to an insufficiency of linked reads among heterozygous sites, which were more than 10 kb apart (Fig. 2F), combined with a relatively low read coverage compared to the other targets.

For PMS2, the second target with a singularity in HG002, all GIAB-annotated heterozygous sites were incorrectly recalled as homozygous in our data, which is consistent with a full haplotype dropout generated by the large inversion (Fig. 2G). In agreement, we observed a high coverage imbalance between the two sets of GIAB-annotated alleles with an average of 91.98% and 8.02% relative abundance, compared to 48.52% and 51.48% in HG002 GIAB WGS data (Fig. 2H). Interestingly, the alleles of the ‘lost’ haplotype were still detectable to some degree (8.02%), suggesting that there is some level of inversion mosaicism affecting primarily one of the two haplotypes (Fig. 2G, compare A, C, G, and T counts), which is also supported by a multiplicity of breakpoints inferred from WGS linked-read data (Supplementary Fig. S1C, large heatmap).

Together, these analyses show that the targeted TELL-Seq protocol is an efficient method for phasing long, partially enriched targets. However, for targets with an unusually low density of heterozygous sites, we have observed an expected low linked-read rate, which is detrimental of the phasing process. Likewise, for targets with large inversions, our analyses highlight the risk of using size-based selection as DNA pre-enrichment method. We note that some frequency of large inversion polymorphisms (e.g., 700 kb and more) have been reported in the human population31.

Phasing mid-size, partially enriched targets

Next, we tested the targeted TELL-Seq protocol with mid-size, partially enriched targets. We used a modified version of the CaBagE protocol (Cas9-based Background Elimination)33,34, which leverages the stability of the Cas9/gRNA RNP complex bound to each end of a target to protect it from exonuclease digestion. Our modification consists of the use of a catalytically deactivated Cas9 form (dCas9) that enables dCas9 binding within the target—referred to as dCaBagE (Supplementary Fig. S2A). In the reaction, dCas9 binds to the target DNA, thereby shielding specific genomic segments from degradation by exonucleases (Supplementary Fig. S2A,B). This strategy can be applied to protect mid-size target segments in DNA extracted using pipetting-based procedures, in which high-molecular weight (HMW) genomic DNA will be invariably fragmented even for HMW DNA methods, generating fragments not longer than a few kb. As target, we selected a 80 kb region in the MSH2 locus, and designed gRNAs to guide dCas9 to the flanks of this region and along internal intervals of 20 kb, which could protect subfragments hereafter referred to as T1-T4 (Fig. 3A and Supplementary Fig. S2B)35. Such gRNA strategy based on five RNP binding sites and complexes (#1–#5) should enable the recovery of 20 kb, 40 kb, 60 kb, and 80 kb fragments if fragments of all these sizes were preserved during genomic DNA extraction36. The most frequent expected target size would be 20–40 kb when working with a genomic DNA prep showing average fragment sizes in the 30–50 kb range (as in our case).

Figure 3
figure 3

TELL-Seq-mediated phasing of four adjacent 20 kb targets enriched with dCaBagE, a modified CaBagE method that leverages dCas9 to protect end and internal sites of the target. (A) Experimental workflow. The entire (80 kb) or sub-targets (20, 40, or 60 kb) can be individually phased based on the quality of the input genomic DNA and the presence of free DNA ends, which serve as sites for exonuclease initiation of degradation. The protected DNA (target) is represented in dark red; unprotected DNA (non-target) is represented in grey. (B) Coverage across the MSH2 locus (black bar indicates the position of the MSH2 gene). This profile supports high on-target recovery over background for all sub-targets (T1–T4) but higher for T3–T4 than for T1–T2. Arrowheads indicate gRNA binding sites (#1–#5). (C) Screenshot from the IGV Portal showing phased data and coverage using targeted TELL-Seq with dCas9/exonuclease or HLS-CATCH/Cas9 as DNA pre-enrichment methods. Shown on top HG002 WGS linked-read phased data (benchmark). Phased data includes distances between phased sites (we note that distances are not shown when not allowed at the selected resolution on IGV). Bottom tracks show phased data and inferred phase blocks with targeted TELL-Seq using dCaBagE. Arrowheads indicate gRNA binding sites.

After RNP assembly, we incubated the pool of five RNP complexes with 1 μg of HG002 HMW genomic DNA followed by exonuclease digestion. We then applied the targeted TELL-Seq protocol used for HLS-CATCH-enriched targets but with approximately half of the amount of starting DNA material. After sequencing and duplicate removal, we obtained 6.3 million reads with 0.8% on-target recovery and 68.7 × average coverage depth.

Overall, T3 and T4 were more efficiently recovered than T1 and T2 (0.3%, 0.3%, 0.1%, and 0.1% on-target recovery and 104.5 ×, 85.6 ×, 40.5 ×, and 44.9 × average mean coverage, respectively), suggesting that RNPs #3–5 provided better protection than RNPs #1–2 (Fig. 3B). In addition, T3 and T4 showed a 100% recall and phasing accuracies: 27 out of 27 and 32 out of 32, respectively (Fig. 3C, T4). Also, T3 and T4 could be both inferred as a single block, suggesting that ~ 40 kb and larger fragments were rare in the sample of genomic DNA (Fig. 3C, bottom). For the underperforming T2, three GIAB-annotated heterozygous sites (out of 33) were missed by our method (Supplementary Fig. S2C). The rest of T2 heterozygous sites (30) were recalled and correctly phased except for one site that remained unphased (Supplementary Fig. S2D), thus representing a ~ 96.7% phasing accuracy of recalled sites. For the underperforming T1, phasing was more problematic. Despite that all heterozygous sites (9 out of 9) were recalled, two sites could not be phased, leading to a ~ 77.8% phasing accuracy (Supplementary Fig. S2E,F). Additionally, we inferred a discontinuous phase block for T1 (Supplementary Fig. S2E,F, bottom track). As in the case of the MLH1 locus in HG002, we attributed the problems with T1 to a distinctively low heterozygosity (9 heterozygous sites in T1 compared to 27, 32, and 30 in T2–T4). Nonetheless, our analyses reveal no flip or switch phasing errors. Overall, these results demonstrate that TELL-Seq is also capable of phasing partially enriched, mid-size targets, although a target with distinctively low heterozygosity might be challenging.

Minimizing DNA collisions with highly pure targets

Next, we sought to validate the targeted TELL-Seq protocol for relatively short but highly pure targets (e.g., PCR amplicons). Phasing amplicons will require further optimization of the protocol. For instance, whole-genome TELL-Seq tolerates an average of 6–8 collisions on the same microbead due to the extremely low possibility that maternal and paternal copies of the same locus collude on the same microbead24. In contrast, an average of 6–8 collisions on the same microbead would be detrimental when processing a single amplicon due to the almost certain possibility that maternal and paternal copies of the amplified locus will collude on the same microbead. Additionally, amplicons will require a much lower barcode diversity for phasing than whole genomes or partially enriched targets.

To determine conditions less likely to generate collisions, we generated five TELL-Seq libraries using 20, 50, 100, 200, and 400 pg of the BstP I-digested lambda phage genome. BstP I digestion splits the 48.5 kb lambda phage genome into fourteen non-overlapping fragments, which mimics a pool of 0.1–8.5 kb targets with identical number of molecules in every case (117, 224, 702, 1264, 1371, 1929, 2323, 3675, 4324, 4822, 5687, 6369, 7242, and 8453 bp). This diversity should allow us to quantify collisions with the only caveat that coincidence of two molecules of the same fragment will be missed. After sequencing the libraries, we recovered all but the 117 bp fragment (Supplementary Fig. S3A). On-target recovery correlated with fragment size (R2 = 0.8111–0.8461; Fig. 4A), although the fragment ends of every target showed distinctively lower coverage (Supplementary Fig. S3B). The highest fraction of collision-free barcoding events corresponded to the 20 pg input (84.8 ± 6.5%) and the lowest fraction corresponded to the 400 pg input (38.7 ± 4.8%), gradually decreasing with increasing DNA amounts: 50 pg, 70.7 ± 4.9; 100 pg, 61.0 ± 4.7; and 200 pg, 46.5 ± 5.1 (Fig. 4B). Within the fraction of estimated collision-free barcoding events, we also calculated linked-read efficiencies (i.e., linked reads over total short reads). Linked-read efficiency was the highest with the 20 pg input (40.9 ± 10.9%) and the lowest with the 400 pg input (29.4 ± 11.3%), but relatively uniform between both amounts: 50 pg, 40.6 ± 11.2%; 100 pg, 37.7 ± 9.6%; and 200 pg, 35.4 ± 11.4% (Fig. 4C). As expected, linked-read efficiencies increased with coverage (Supplementary Fig. S3C), but showed no strong dependency on fragment size when normalized by coverage for 0.7 kb and larger fragments (Supplementary Fig. S3D).

Figure 4
figure 4

Determining collision rates and linked-read efficiencies using BstP I-digested lambda phage genomic DNA. (A) Relative coverage (average across every fragment) relative to fragment size. On top indicated input amounts of digested fragments. Y-axis represents the average coverage relative to the average coverage with the 8.5 kb fragment, X-axis represents fragment size (each datapoint is one fragment). (B) DNA collision rates (%) relative to fragment size by input amounts. (C) Percentage of linked reads relative to the total number of short reads by fragment size and input amounts. Analysis limited to microbeads with only one associated fragment (single-barcode fragments or collision-free cases).

Phasing amplicons

We have estimated that 20 pg target inputs allows ~ 85% collision-free DNA co-barcoding events (Fig. 4B). The caveat is that, in our experience, inputs of 20 pg and lower can generate inconsistent library yields. We, therefore, reasoned that consistency in library yields could be achieved by adding 60–80 pg of non-human fragmented DNA, hereafter referred to as ‘filling’ DNA (e.g., Escherichia coli or BstP I-digested lambda phage genomic DNA). To test whether 20 pg inputs supplemented with 60–80 pg of filling DNA are suitable conditions for efficient phasing, we aimed to generate a series of long PCR products to use as input material.

Long-range PCR (3–30 kb) can be notoriously challenging, especially when amplicons are used for the purpose of phasing. First, long-range PCR often requires substantial optimization37,38. Second, PCR conditions must be established that reduce the risk of haplotype dropouts and chimeras. Haplotype dropouts can be generated when either the maternal or paternal copy of the target is inefficiently amplified39. Chimeras are crossovers of maternal and paternal haplotypes and can occur when an incomplete PCR product acts as a primer on the wrong amplicon copy during PCR amplification, leading to a switch error (or an artificial haplotype)40. In part, the risk of haplotype dropouts and chimeras can be reduced by using long extension times and avoiding over-amplification. In this context, it is advantageous that TELL-Seq requires only picogram amounts of input DNA, which allows minimizing the number of PCR cycles.

With these precautions, we amplified a 13 kb region in the SCN5ASCN10A locus containing seven polymorphic sites near gene-regulatory elements associated with a high risk for sudden cardiac arrest41,42,43. These seven sites are spaced 288, 4299, 23, 69, 5560, and 2505 bp apart (from 5′ to 3′)43. As a second target, we selected the 3.4 kb region spanning the exons of the PIK3CA gene. PIK3CA is the second most frequently mutated gene across all cancer types, and it is mutated at least twice (i.e., double mutations) in 8–13% of clinical cases44,45. Phasing double PIK3CA mutations can help to predict oncogenicity and sensitivity to treatment, such as when the E545K and L866F mutations coincide in the same chromosomal copy (in cis)45. These mutations can also be in a different chromosomal copy (in trans) relative to a nearby allelic variant, I391M (rs2230461)45. The distances between these three sites are 459 and 964 bp in cDNA.

For our phasing analyses, we amplified the 13 kb SCN10A and 3.4 kb PIK3CA fragments as well as shorter products within the same regions to examine the reproducibility of the phasing data: 3.9, 4.1, and 2.7 kb for SCN10A (from 5′ to 3′); and 1.8 kb for PIK3CA (Fig. 5A). For SCN10A targets, we used the GIAB reference HG001 as a source of genomic DNA (NA12878). Analysis of WGS linked-read data24,32 reveals that HG001 is a carrier of so-called Hap1 and Hap3 haplotypes in the SCN5ASCN10A locus (TACCATT and CGGGGGC)43. Further analysis of the same data reveals that HG001 carries another subset of fifteen heterozygous sites, some of which have also been recently associated with heart arrhythmia41. The twenty-two heterozygous sites are split by major and minor allele frequencies, which builds upon the previously proposed model that the 13 kb Hap1/Hap3 genotype represents fully separated all-major/all-minor alleles43. On the other hand, to generate PKI3CA amplicons, we used cDNA generated from human breast cancer HCC202 cells. HCC202 cells carry the GAG and AGC haplotypes underlying the three heterozygous sites under investigation (E545K, L866F, and I391M)45. All amplicons appeared as weak, mainly single PCR products on E-gels (before cleanup), suggesting no over-amplification (Fig. 5B), and were generated without major apparent haplotype biases (Supplementary Fig. S4).

Figure 5
figure 5

Amplicons used in this study. (A) SCN10A and PIK3CA amplicon sets phased in this study (13, 3.9, 4.1, and 2.7 kb from genomic DNA and 3.4 and 1.8 kb from cDNA, respectively). HG001 carries previously annotated Hap1 and Hap3 haplotypes. Red bars indicate the position of 7 clinically relevant heterozygous sites; orange bars indicate the position of 15 additional heterozygous sites. Alleles are colored if represent the alternate allele, i.e., different from the human reference. HCC202 cells carry two clinically relevant alleles (in red) and one non-clinically relevant variant (in orange). (B) 2% E-gels and 2% agarose (for 13 kb fragment) with PCR products as indicated. Non-specific DNA and primers were removed as indicated (e.g., compare DNA traces before and after MagBio cleanup, dashed squares; see Methods for technical details). We selected minimally amplified PCR products (see 3.4 kb PIK3CA 1:20 diluted cDNA as an example). DNA concentrations are indicated at the bottom.

For a comprehensive analysis, we generated forty TELL-Seq libraries to assess different amplicon amounts, target sizes, and cDNA priming methods, as well as to acquire replicates (between 2 and 6). Six of these libraries were generated with a pool of amplicons (3.4 kb PIK3CA and 2.7, 3.9, and 4.1 kb SCN10A). The total input amounts varied from 5 to 20 pg with one case using as low as 0.4 pg (Fig. 6A, target amount), and were supplemented with filling DNA when processing single targets (either E. coli genomic DNA or BstP I-digested lambda phage genomic DNA), or without filling DNA when processing a pool. For pools, we used equal amounts for each target (20 pg). The rationale for using equal amounts for each target in a pool (rather than equal number of molecules) is that the expected lower coverage for small fragments should be compensated by the higher molecule number compared to large fragments, as suggested by our analyses with the digested lambda phage genome (Fig. 4). For cDNA-based products, we tested reverse transcriptase primed by random hexamers and oligo-dT. Finally, we mapped SCN10A reads to genomic DNA and PIK3CA reads to the target PIK3CA cDNA with the allele frequency (AF) cutoff set to 0.1 (see Methods). An integrated analysis of all these experiments revealed that the average on-target coverage ranged between 58.8 × and 3574 × with most cases falling between 200 × and 800 × (Fig. 6A, average on-target coverage).

Figure 6
figure 6

Phasing SCN10A and PIK3CA amplicons (sizes: 1.8–13 kb). (A) Summary table of genotyping and phasing results using SCN10A and PIK3CA amplicons as input DNA; PCR from NA12878 genomic DNA (n = 33 tests) or breast cancer HCC202 cDNA (n = 25 tests). In red, highlighted the phasing errors. R6 (random hexamers) and dT (oligo-dT) for cDNA priming, as indicated. (B) Screenshot showing the phasing of three technical replicates on the IGV Portal. Fragments corresponds to 4.1 kb SCN10A amplicon. The top track shows the phasing of one 4.1 kb replicate in phased.pseudo.bam format. Heterozygous sites are separated by numbers that represent distances between neighboring sites. The third track show coverage, inferred haplotype 1 reads, inferred haplotype 2 reads, and unphased reads. The last tracks show gene annotations and an inferred phase block (blue bar).

Out of the 366 annotated heterozygous sites—collecting the numbers from the fifty-eight replicates—we correctly phased 364, while the remaining two sites represented flip errors. Overall, we achieved 100% genotyping and 99.45% phasing accuracies (Fig. 6A,B). Importantly, the two flip errors were not reproduced in replicate samples. Additionally, we genotyped and phased a heterozygous site not previously annotated by GIAB, but that it was consistent across fifteen libraries (Fig. 6A, de novo phasing). Together, these results indicate that targeted TELL-Seq can achieve highly accurate phasing of amplicons when supported by replicates.

Benchmarking TELL-Seq with long-reads

Next, we benchmarked the targeted TELL-Seq protocol with previously generated long-read results of 13 kb SCN10A products amplified from peripheral blood-derived genomic DNA43. From this previous study, we selected three Hap1/Hap3 carriers (referred here to as individuals #1–3)43, and obtained the original amplicons. Our results agreed with the previously reported long-read-based (ONT) genotypes in individuals #1 and #2, recalling and correctly phasing the seven clinically relevant heterozygous positions in each sample (Fig. 7A, Ind.#1 and #2). For individual #3, the same seven sites were also recalled as heterozygous in our data, although the most 5’ site remained as unphased and the most 3’ site was phased inconsistently with prior data (Fig. 7A, Ind.#3, Replicate #1). Since coverage was lower for individual #3 than for individuals #1–2 (54.6 × compared to 89.9 × and 92.0 ×, respectively), we suspected that our call for the most 3’ site in individual #3 was erroneous. In support, we generated a second TELL-Seq library from the same amplicon (Ind. #3, Replicate #2) and reamplified the first TELL-Seq library from the remaining target-bound microbeads (Ind. #3, Replicate #1.2). These two new tests were based on higher coverage (105.1 × and 290.4 ×) and agreed with a Hap1/Hap3 genotype (Fig. 7A, Ind.#3, Replicate #2 and #1.2). We also noticed that linked-read coverage at the two haplotype ends in the first replicate from individual #3 had been the lowest of all tests, again connecting phasing problems to low coverage (Supplementary Fig. S5A, orange columns). For reference, the published ONT data was based on at least 600 × coverage for reads containing all seven heterozygous sites (Supplementary Fig. S5B)43.

Figure 7
figure 7

Targeted TELL-Seq with three 13 kb amplicons generated from peripheral blood-extracted genomic DNA from three known Hap1/Hap3 SCN10A carrier individuals. (A) Previous work identified the three Hap1/Hap3 carrier individuals from whom 13 kb amplicons were generated and now have been processed in this study. We confirmed the Hap1/Hap3 genotype in all three individuals with targeted TELL-Seq. However, this conclusion needed the three replicates for Individual #3. In Replicate #1, we observed an unphased and a flip error. We note that this library provided the lowest coverage, 54.6x (Individual #3). We generated a new library from the same amplicon (Replicate #2) and re-do the library amplification with leftover barcoded microbeads from Replicate #1 (Replicate #1.2). In both cases, we validated the expected Hap1/Hap3 genotype. We note that the sequencing depth was also higher for both replicates, 105.1 × and 290.4 ×, respectively. Coverage indicated (x) on top of each genotype and number of correctly re-called and phased heterozygous positions in Hap1 and Hap3 are also indicated. (B) Graph shows average allele burden for all Hap1/Hap3 sites by library. In all cases, Hap1 was amplified slightly more efficiently than Hap3, although it had no effect on genotyping and phasing.

On a different note, we observed a slight haplotype bias (Hap1 > Hap3) with blood-based targets that we did not observe when amplifying the 13 kb fragment from cell culture-derived HG001 DNA (compare allele burden plots between Supplementary Fig. S4 and Fig. 7B). However, this haplotype bias did not apparently have a negative effect on the process of phasing.

Minimum sequencing depth for accurate phasing

Throughout this study, we have repeatedly observed the importance of sufficient sequencing coverage for the process of phasing. To examine in more detail how sequencing coverage (as a proxy for linked-read capture) affects genotyping and phasing accuracy with the targeted TELL-Seq protocol, we subsampled nine of the sequencing outputs generated from the libraries using amplicons and repeated the phasing analyses. We subsampled the sequencing outputs at 50%, 25%, 12.5%, 6.25%, 3.125%, 1.56%, and 0.78% of the original dataset. After duplicate removal, we segregated the data into four groups based on phasing quality. Group 1 integrates results without genotyping or phasing errors and a single phase block; group 2 integrates results without genotyping errors and a single phase block but with one flip or switch error; group 3 integrates results with incorrect genotyping, discontinuous phase blocks, and flip/switch errors; and group 4 integrates results with mostly unphased heterozygous sites (Supplementary Fig. S6). These data separation revealed that lowering the coverage below 180 × for the 13 kb fragment and 150 × for the 2.7–4.1 kb fragments led to phasing errors, and that further lowering the coverage added genotyping errors (Supplementary Fig. S6, compare group 1 and groups 2–4). Still, we note that results based on 50–60 × coverage often show no phasing errors (see cases in Fig. 6A). These subsampling tests reinforce the importance of sufficient coverage to reduce the risk of phasing errors.