Search
Close this search box.

A universal molecular control for DNA, mRNA and protein expression – Nature Communications

Design of synthetic pREF control

We first designed a synthetic control, that we term pREF, that encodes 21 synthetic genes encompassing a wide range of nucleotide sequences, GC content and other difficult features (Fig. 1a). It has been deposited with Addgene, who are responsible for it’s storage and distribution. pREF includes a full representation of all possible 6-mers (excluding unintended Restriction Enzyme recognition sequences), thereby providing a comprehensive evaluation of sequencing accuracy under different nucleotide contexts (Fig. S1a). pREF also includes eight genes containing difficult sequences, including six gece genes (gc-content), that encompass different percentages of GC content from 21% to 65%, as well as two repe genes (repeat) that include repeats ranging from 6 to 18 nt in length (Fig. 1d). A comparison of these synthetic genes shows they provide a greater representation of nucleotide diversity, repeat sequences, and GC content than the PhiX-174 genome (Fig. S1a-b). Furthermore, the synthetic genes have no homology (greater than 18nt) to natural gene sequences included in the NCBI nr/nt database and can be easily distinguished from natural RNA or DNA samples, allowing the use of pREF as a spike-in control in the study of any organism included in the NCBI nr/nt database10.

Fig. 1: Design of pREF.
figure 1

a Overview of pREF shows the organization of synthetic conu, gece, repe and proco genes that are suitable for DNA and RNA sequencing, and mass spectrometry. b The digestion of pREF (with EcoRI) generates a DNA fragment size ladder (nt = nucleotides). c conu genes are represented at multiple copy-numbers in pREF so that when sequenced, they form a staggered reference ladder able to measure quantitative features of DNA and RNA sequencing libraries (n = 1 biologically independent samples). Box plot extends from 25th to 75th percentiles, centre line is the median, and whiskers cover the 10th and 90th percentiles. d gece and repe genes can act as sequencing controls that measure accuracy at difficult GC-rich or repetitive sequences, respectively (n = 1 biologically independent samples). Box plot extends from 25th to 75th percentiles, centre line is the median, and whiskers cover the 10th and 90th percentiles. e In vitro transcription of conu, gece, proco and repe synthetic genes generate matched mRNA controls for use in RNAseq experiments. f In vitro translation of proco genes generates synthetic protein controls for use in proteomic experiments. Source data are provided in a Source Data File.

Synthetic control genes are flanked by a range of restriction enzyme sites enabling linearization of genes into DNA fragments of known sizes and sequence (Fig. 1b). For example, digestion with EcoRI generates a ladder of DNA fragments that range from 100 – 3500 nt, which are suitable for use as a DNA spike-in. Synthetic control genes are also preceded by T7 and Sp6 promoters that enable in vitro transcription and are followed by a 30nt poly-adenine tract (see Fig. 1e). These transcriptional units can be liberated by cleavage with HindIII, and transcribed using T7 RNA polymerase to yield transcripts of ~1500 nt in length. Alternatively, transcripts can be digested with BamHI and transcribed using Sp6 RNA polymerase to yield transcripts of varying lengths (see Methods).

We also designed a suite of conu genes (copy-number genes) to act as quantitative controls (Fig. 1c). The conu genes are repeated at different copy-numbers (1x, 2x, 3x and 4x) to form “paralogous gene families”. Due to their repeated copy-numbers, the sequenced read count for each conu gene will be proportional to its copy-number, and in fixed and constant ratio to the other conu genes. Therefore, a comparison of observed conu gene count to expected copy-number generates a staggered, graduated reference scale that is able to measure quantitative performance of NGS and MS methods11.

Additionally, we designed three proco genes (protein control) that can provide quantitative reference standards for proteomic analysis. The proco genes each encode an open-reading frame with known amino-acid sequences, as well as a Shine-Dalgarno and 5’ and 3’ untranslated sequences to enable efficient in vitro translation (Fig. 1f). The proco genes can be translated to form protein spike-in controls, and trypsin digestion of proco proteins liberates peptide sequences of known size, charge and retention time for the calibration of mass spectrometric experiments.

The pREF sequence also encodes a pMK-RQ backbone sequence containing an antibiotic resistance gene (Kanamycin) and an origin of replication (Ori). These regulatory elements enable ongoing and sustainable production of pREF by independent laboratories through transformation and propagation in a stable E. coli line and purification using standard plasmid preparation techniques (see Methods). This ability to reproduce and propagate pREF, and independently prepare matched mRNA and protein controls using standard molecular biology techniques distinguishes pREF from other molecular standards12,13,14,15.

Measuring next-generation sequencing accuracy with pREF

pREF is designed to be used as a molecular standard in NGS, where it provides a detailed and comprehensive evaluation of NGS accuracy and performance. To demonstrate this approach, we first sequenced four technical replicates of pREF using Illumina short-read and Oxford Nanopore (ONT) long-read sequencing (Fig. S14).

We first measured sequencing accuracy across the conu, repe and gece genes, which include all 6-mers, providing a detailed signature error profile for the Illumina DNA libraries (Fig. 2a,b, Fig. S2a-c, Fig. S3a). We observed a mean 0.009 error-rate with a wide 200-fold range in error rate between the least accurate (0.1200 for TCTTGT) and most accurate k-mers (0.0006 for ACTAGT; Fig. S3a) in one replicate. This sequencing error profile is systematic, with little variation in performance between three further technical replicate libraries (SD 0.0014; Fig. S3b). For comparison, we also sequenced the PhiX-174 genome, which exhibited similar sequencing accuracy within the narrower window of represented k-mers (Fig. S3c, S1a-b). pREF encodes genes with extensive repeats (repe genes), and strong GC biases (gece genes) that enable evaluation of NGS accuracy and performance at these difficult sequences. We found that the presence of repeats had a complex impact on Illumina sequencing accuracy, with higher rates of deletion and mismatch errors than insertion errors, and an unexpectedly higher error rate for shorter repeats (Fig. S2b-c, Fig. S4a). By contrast, GC-rich regions in gece genes were dominated by mismatch transversion errors (Fig. S2b, Fig. S4b).

Fig. 2: Using pREF to measure errors in Illumina and ONT sequencing.
figure 2

a Annotations of conu, gece and repe genes in pREF, with heat maps showing k-mer coverage, GC-content and repetitiveness. Error profile of (b) Illumina and (c) ONT alignments across the pREF sequence. d Histogram of individual k-mers ranked according to error rate for Illumina and ONT sequencing. e Individual error profiles for selected k-mers (ONT = Oxford Nanopore Technologies). ae A single replicate from each sequencing method was used to create each plot. Source data are provided in a Source Data File.

ONT sequencing provides real-time, long read and single-molecule sequencing, but suffers from errors that can confound analysis3. We evaluated the error profile of ONT sequencing using pREF (Fig. 2c, Fig. S5a-c), which showed a higher mean error frequency than Illumina sequencing (0.0613; SD 0.03455) (Fig. S6a), with a 33-fold difference between the most and least accurate k-mers (from 0.0167 to 0.55794) (Fig. 2d). Comparing ONT and Illumina sequencing showed different per nucleotide error rates within different k-mers (Fig. 2e). We also evaluated the ONT sequencing error profile at repeats within the repe genes (Fig. S6b,c), finding that error rate has a positive linear relationship with repeat length (Pearson’s correlation (R2) = 0.93) up to 0.3949 for an 18-nt homopolymer. The majority (93%) of errors at repeats were deletions, resulting in reads being ~16% shorter than expected (Fig. S6c).

Sequencing of gece genes also showed that ONT sequencing errors scaled linearly with GC content (R2 = 0.9442), from 0.0454 at 21% GC, up to 0.0728 at 65% GC content (Fig. S6d). Unlike Illumina sequencing, we observed similar proportions of all error types in our gece genes (compare Fig. S2cto Fig. S6d). This demonstrates how direct analysis of the pREF sequence provides a comprehensive evaluation of different reagents, instruments and libraries used in NGS.

Measuring quantitative accuracy using pREF conu genes

pREF includes repeated conu genes which, when sequenced, generate a staggered ladder that can evaluate the quantitative accuracy and variation in NGS libraries (Fig. 3a). To demonstrate this approach, we analysed the read counts across 31-mer sliding windows for all conu genes in Illumina and ONT sequenced DNA libraries (Fig. S14, Fig. 3b,c, Fig. S7a). For the Illumina DNA libraries, we observed a strong quantitative correlation between conu gene copy number and read count (Pearsons R2 = 0.9875, Fig. 3c). Consistent ratios were observed between the successive conu genes (mean = 1.991, SD = 0.1490). However, this relationship was weaker for ONT DNA sequencing, due to the lower quantitative accuracy of these libraries (R2 = 0.79, SD = 0.1490, Fig. 3c, Fig. S7a).

Fig. 3: Measuring the quantitative accuracy of synthetic conu genes.
figure 3

a Schematic diagram illustrates the design and use of conu genes as quantitative controls. b Density histogram from k-mer counts for conu gene families illustrates the distribution of technical variation in 31-mer normalised read count, calculated using a sliding window approach. Bounds of technical variation are a visual representation of read count variation and are not based on a statistical calculation. c Quantitative accuracy of simulated, DNA and RNA sequencing using Illumina and ONT sequencing technologies as measured from conu genes (ONT = Oxford Nanopore Technologies). d Density plots show the spread of technical variation in conu read counts in DNA and RNA libraries, prepared for ONT and Illumina sequencing. Source data are provided in a Source Data File.

We next used pREF to investigate the impact of technical variation introduced during the experimental steps of library preparation and sequencing. This technical variation can be measured by the symmetric, unimodal distribution of k-mer counts for conu genes (Fig. 3d). In a simulated library, which excludes experimental variation, we measured only minor technical variation due to random sampling error (coefficient of variation (CoV) = 1.26%, SD = 0.0807). By contrast, Illumina libraries showed additional variation due to experimental steps (CoV = 5.63%, SD = 0.1922), whilst the ONT libraries showed the highest degree of variation during their preparation (CoV = 12.56%, SD = 0.6537). This provides a useful estimate of variation within an NGS library, even in the absence of experimental replicate libraries.

Given this ability to estimate technical variation, pREF can be used to normalize technical differences between different libraries, whilst retaining biological differences16. To demonstrate this, we spiked pREF into two mock microbial communities13. The two mock communities, which contain synthetic microbial sequences at known concentrations, were then sequenced in triplicate using ONT DNA sequencing (Fig. S14). The observed differences in abundance were plotted against their expected abundances, demonstrating the technical variation introduced by sequencing (Fig. S7b) Sequencing data were normalized using removal of unwanted variation (RUVg), with the pREF genes providing negative scaling factors16. We found that RUVg normalization using pREF as a spike-in control improved normalization compared to either unnormalized or Upper Quartile (UQ) normalisation methods17 (see Fig. S7c). Accordingly, the use of pREF for normalization resulted in improved detection of known fold-change differences in microbial abundance between communities, and demonstrates how pREF can be routinely used as an internal control to normalise technical variation and improves detection of fold-change differences.

Use of synthetic mRNA controls during RNA sequencing

The synthetic genes encoded within the pREF sequence can be in vitro transcribed to generate a suite of matching mRNA controls for use in RNA sequencing experiments (Fig. 4a). To demonstrate this, we performed in vitro transcription using T7 RNA polymerase to generate a matched suite of mRNA controls. We then prepared mRNA controls as cDNA libraries for both Illumina and ONT sequencing (see Methods).

Fig. 4: In vitro transcription of pREF mRNA controls.
figure 4

a Synthetic control genes are preceded by a T7 promoter that enables in vitro transcription into matched mRNA controls. b Scatter plots indicate the fold-differences in k-mer sequencing error rates between RNA and DNA libraries for Illumina and ONT sequencing. c Violin plot illustrates the enrichment of sequencing errors at k-mers that form hairpins. d Use of synthetic conu gene during differential gene analysis of lung adenocarcinoma cells with torkinib. The synthetic RNA controls (coloured points) indicate the accuracy for detecting fold change differences in gene expression (grey) between treated and untreated cells. Source data are provided in a Source Data File.

We first evaluated the accuracy of RNA sequencing for Illumina and ONT sequencing across repe, conu and gece genes (Fig. S14). We observed markedly higher error rates and greater technical variation in RNA compared to DNA (Fig. 4b, Fig. S8a-c). For Illumina RNAseq libraries, we observed a mean k-mer error rate of 0.0154, which is 2.6-fold higher than the corresponding DNA sequencing (Fig. S9a). We also measured the quantitative accuracy of conu gene expression, showing that correlation in conu gene expression was lower for RNA (R2 = 0.79), than for DNA (R2 = 0.9875, Fig. 3c). Similar to our Illumina sequencing results, we observed higher error rates for ONT RNA compared to DNA (Fig. S8a-c, S9a-b). Moreover, we observed a higher error rate and technical variability for ONT RNAseq than for Illumina RNAseq (Figs. 3d, 4b).

Analysis of mRNA controls also showed the impact of RNA secondary structure on sequencing accuracy. Examination of RNAseq error rates showed that short, inverted repeats were particularly error-prone (such as an error rate of 0.555 for ACTAGT, or 0.4851 for CTGCAG kmers) for RNA sequencing. This is likely to be due to the formation of hairpin loops that may impact transcription by RNA polymerase, as RNA inverted repeats exhibited an 83-fold higher error rate than corresponding DNA sequencing (Fig. 4c). We also observed enrichment for errors in repe and gece genes in ONT RNA sequencing datasets, relative to complementary DNA sequencing results (Compare Fig. S6c-d to S10a-b). This indicates the additional exacerbation of errors at repetitive and GC-biased sequences during RNAseq library preparation and sequencing.

We next used the pREF control genes as reference standards to measure the enzyme performance of either Sp6 or T7 RNA polymerases (Fig. S14). We first performed in vitro transcription using either of Sp6 or T7 RNA, and then ONT sequenced in triplicate. Data were analysed using per-nucleotide normalisation against the matched DNA sequencing libraries to normalise for sequencing-specific errors (see Methods). We found that Sp6 polymerase (median error rate = 0.004579, SD 0.03401) performed markedly better than T7 polymerase, which was less accurate and exhibited a higher variance (see Fig. 4b). Closer comparison of T7 and Sp6 RNA polymerase accuracy provided a high-resolution analysis of polymerase processivity and performance. For example, we found RNA polymerase performance varies depending on the nucleotide context, with Sp6 performing more poorly on pyrimidine-rich (C,T) k-mers (1.4-fold enrichment), while performing comparatively better on purine-rich (A,G) rich k-mers (see Fig. 4b). Both Sp6 and T7 performance was impacted by the presence of repeats, with a 4.12 and 5.75 fold-enrichment in sequencing error compared to the background, respectively.

To demonstrate the use of pREF conu genes in an RNA sequencing experiment, we performed a drug-treatment experiment that evaluated the impact of torkinib treatment, a selective and ATP-competitive mTOR inhibitor, on gene expression in lung adenocarcinoma A549 cells (Fig. 4d, Fig. S14; see Methods). We spiked the conu mRNA controls into duplicate RNA samples harvested from treated and untreated cells prior to library preparation and sequencing. We first assessed the quantitative accuracy of our RNA sequencing experiment by comparing the expression conu spike-in mRNA controls between libraries (R2 = 0.8233). As previously indicated, the conu mRNA controls enabled us to estimate the technical variation between libraries (CoV=18.23%) and could be used as negative scaling factors for accurate RUVg normalisation between the samples (see Methods).

RUVg can remove technical variation through factor analysis using reference standards16. We tested whether RUVg normalisation with pREF could also be used for normalisation of an RNA-seq experiment. Following normalisation with RUVg, we used the conu genes to benchmark the detection of fold-differences in gene expression between control and treated libraries. The endogenous genes FOSB, LOXL2, TRIB and ITGB4, were differentially expressed in response to torkinib, in a pattern that is expected from mTOR inhibition (Fig. 4d). This spike-in use of conu mRNA controls allows normalisation between RNA sequencing libraries, without relying on housekeeping genes, which can be highly variable18. This method will be particularly useful for experiments that induce global gene expression changes or with low biological replication, where negative controls and analyses of variation are not an option.

Use of synthetic proco genes to improve accuracy of protein quantification

We designed synthetic proco genes that can be in vitro translated for use as a matched peptide controls during proteomic analysis (see Fig. 5a). The trypsin digestion of proco proteins liberates smaller peptides of differing size, charge and retention time, enabling the calibration of MS experiments. Furthermore, a subset of peptides is repeated at variable copy-numbers (2x, 4x and 8x) so that when they are digested with trypsin, they form a staggered, quantitative peptide ladder (Fig. 5a).

Fig. 5: In vitro translation of synthetic proco protein controls.
figure 5

a Schematic illustrates the design of proco genes that are translated to form protein controls. Trypsin digestion of the proco proteins then liberates control peptides of differing size, charge and retention time, enabling the calibration of LC-MS/MS. b Quantification of each fully cleaved proco peptide. Relative peptide abundance is measured by the proportion of detected peptides relative to all peptides in the proco protein. Data are presented as mean values +/- SD (n = 3 biologically independent samples). c Schematic diagram indicates how peptides are also present at differing copy-number, thereby forming a staggered quantitative reference ladder for evaluating quantitative performance of proteomic experiments. Data plots are for illustrative purposes only, and are not based on Mass Spectrometric measurements. d Measurement of relative peptide abundance for proco proteins and housekeeping E. coli proteins (where each peptide is expected to be in equal abundance) in replicate (n = 3). Source data are provided in a Source Data File.

We first expressed the proco proteins for use in the proteomic analysis of the E.coli proteome (Fig. S14, see Methods). Samples were trypsin digested and analysed using LC-MS/MS, and we identified the majority of peptides encoded within the three proco genes, alongside 2,306 endogenously expressed E. coli proteins (Fig. S13). We observed high levels of digestion (97.5%, 94.5% and 99.8%) of the proco proteins, with few undigested peptides, which compared similarly to a subset of highly abundant E. coli housekeeping genes (a mean ~91.8% trypsin digestion) that were analysed (Fig. S13 and S11a).

We next measured peptide counts using LC-MS/MS in untargeted Data Independent Acquisition (DIA) mode, which is thought to yield accurate quantification, even in the absence of standards19. We first used the proco genes to evaluate the quantitative accuracy of LC-MS/MS DIA datasets by comparing the measured abundance of individual proco peptides relative to their expected copy numbers (Fig. 5c). We found that observed proco peptide abundance correlated poorly with expected copy numbers (R2 = 3.057 e-06; Fig. 5d). These observed discrepancies in detected MS signals may be attributed to the sequence-dependent variation in ionization efficiencies between different peptides. However, despite these limitations, we observed high reproducibility in peptide quantification across three technical replicates indicating that while LC-MS/MS may not provide accurate quantification between peptides with different sequences, it does offer a reproducible and quantitative means of assessing the relative abundance of a given peptide between different samples (Fig. 5b).

To investigate the source of this variation, we evaluated whether the physicochemical properties of proco peptides confounded their quantification. We performed a multiple linear regression to compare peptide quantification against a range of predicted peptide physicochemical properties. Our analysis indicated that Molecular Weight, Hydrophobicity, Aliphatic and Instability Indexes did not confound peptide quantification. However, the Isoelectric Point and Extinction Coefficient of peptides both had a significant, albeit weak, negative correlation with peptide quantification (p = 0.047, p = 0.0076, respectively; Fig S12). Further variables such as post-translational modifications, degradation or cellular export of proteins may also have impacted peptide detection; however, their measurement is outside the scope of this experiment.

We also considered whether the observed discrepancies in LC-MS/MS quantification measured with the proco peptide controls similarly impacted the measurements of the accompanying E. coli proteins. To evaluate the variation in quantification, we assumed that each unique peptide within the selected E. coli housekeeping proteins was present once and was therefore equally abundant (Fig. S11a). Therefore, by comparing the variation in peptide quantification within E.coli proteins, we could estimate the quantitative accuracy of our LC-MS/MS experiment and compare it to proco controls. Similar to our proco peptide results, we observed a wide variation in the quantification of different E. coli peptides, with strong reproducibility for the same peptide between biological replicates (Fig. S11b-c). This demonstrates how proco proteins can evaluate and validate the quantification of protein abundance using MS methods.