This research complies with all relevant ethical regulations and was conducted with the approval and under the supervision of the Institutional Review Board of Washington University School of Medicine in St. Louis, MO. Samples were obtained from a previously published cohort study of women with rUTI27. Participants provided written informed consent for the work presented here.
Probe design and synthesis
E. coli reference genome selection, clustering, and annotation
All E. coli and Shigella (hereafter collectively referred to as E. coli) complete genomes were downloaded from NCBI RefSeq as of June 2017 (total of 295 genomes). To supplement the Refseq collection with additional diverse genomes, 3141 publicly available, high-quality (L50 < 20) genomes of E. coli that were listed in the NCBI Pathogen Detection database were downloaded from GenBank from July to August 2017 (Supplementary Dataset 4). In order to remove nearly identical Genbank genomes, we performed k-mer-based clustering. All 3141 Genbank genomes were k-merized (using 23mers) with the StrainGST “kmerize” tool from StrainGE26, then their pairwise all-vs-all Jaccard similarities were calculated. Single-linkage clustering was performed on these similarities at a 95% threshold to construct 1485 genome clusters, of which 1124 contained a single genome and 361 contained two or more Genbank genomes. Of the 361 multi-genome clusters, 67 included a genome identical to one of the RefSeq references, and were discarded. For the remaining 294, the largest reference was chosen as a representative for the cluster. Our final set included these 294 representatives, the 1124 singletons, and the 295 RefSeq genomes. This final set of 1713 genomes represented a large, diverse collection, with references from all eight major phylogroups of E. coli (Supplementary Table 1; Supplementary Fig. 1), as determined by the tool ClermonTyping69, and 515 distinct multi-locus sequence types, as determined by the tool mlst (https://github.com/tseemann/mlst).
The 1713 genomes were then uniformly re-annotated with the Broad Institute prokaryotic genome pipeline10,70. Genes were clustered into orthogroups using SynerClust71, which resulted in a total of 174,584 orthogroups containing 8,334,026 total genes. As the computational time to analyze all these orthogroups using CATCH was prohibitive, we filtered out rare orthogroups found in fewer than three genomes, leaving 64,146 orthogroups (containing a total of 8,165,358 genes). In order to ensure that our set contained all potential instances of key genes important in clinically relevant E. coli, we retained all instances of orthogroups containing 59 Pfam domains of interest, obtained from a curated list (Supplementary Table 2). Using this list, we added back a total of 2434 orthogroups (3479 genes) that were found in fewer than three genomes. Our final set contained 64,580 orthogroups comprising 8,168,837 genes. In order to reduce design constraints in CATCH, thereby decreasing computational cost and time, we further clustered each orthogroup using UCLUST72, with an 80% identity threshold. This generated one or more clusters of genes within each orthogroup, in which all cluster members had ≥80% identity to one other. This generated 87,218 gene clusters from the 64,580 orthogroups. These gene clusters were the input for the CATCH probe design.
Probe design and filtering
CATCH25 was run to generate probes for each gene cluster using the following parameters: 2 bp mismatch allowed; 25 bp cover extension; expand “N” to ACGT; 30 bp island of the exact match; 60-75 bp length. In addition, three non-E. coli Enterobacteriaceae assemblies were used as part of the CATCH algorithm to “blacklist” probes that matched off-target sequences with a mismatch tolerance of 8 bp: a Citrobacter, a Salmonella, and a Klebsiella genome (Genbank accessions GCA_000648515.1, GCF_000195995.1, and GCA_000240185.2, respectively). Representatives from these three genera were chosen as they represent well-characterized Enterobacteriaceae, closely related to E. coli; blacklisting them was thought to help to improve the specificity of the probe set to E. coli vs. other similar organisms. Duplicate probes were removed, resulting in a total of 911,618 unique probe sequences.
We also used an additional set of filters to remove remaining probes that might capture off-target sequences. We used the blastn tool from BLAST+73,74 to search probe sequences for homology against the NCBI prokaryote reference genome database (downloaded in October 2017), using the following parameters: max_target_seqs 30; evalue 1e-5; qcov_hsp_perc 80; perc_identity 80. Using these results, we removed probes that had matches of 65 bp or more to: 1) ≥ 100 references in the database (1798 probes removed); 2) Bacteroidetes references (2470 probes removed); 3) Firmicutes references (14,935 probes removed). We were left with a final total of 892,415 probes that were unlikely to hit other commonly found bacterial species in the human gut.
In silico probe set validation
To verify that our probe set would actually be able to capture the vast majority of genes in the E. coli pangenome, we used blastn from BLAST+ to query our probe sequences against the entire pangenome from our set of 1713 references, which included genes that had previously been filtered out at the probe design stage. We used probe sequences as queries for blastn with the following parameters: max_target_seqs 30; e-value 1e-5; qcov_hsp_perc 80; perc_identity 80. We retained alignments with >65 bp length and no more than 8 mismatches in the entire alignment. The probe set was considered to capture a gene if one or more probes met these criteria for a given gene.
Probe synthesis
For this study, the probes were synthesized by Roche, though the probe set was not specifically tailored to their technology and could be synthesized by other manufacturers. All probes could be synthesized, although 330,387 (37%) probes had one or more bases truncated from the 3’ end. The average number of bases trimmed per probe was 1.27 ± 2.16. Only 5423 probes had 10 or more bases trimmed. All of the most highly truncated probes had low nucleotide complexity, primarily due to long stretches of homopolymers. As these changes were unlikely to affect the performance of the probe set as a whole, we used this slightly modified probe set in our experiments. The average probe length after synthesis was 73.7 ± 2.2 bp.
Analysis of four-strain E. coli mock community
Library construction and sequencing
We used a previously reported mock community26, which included 99% human DNA and 1% E. coli DNA. The E. coli DNA was composed of: i) H10407 (phylogroup A; 80%), ii) E24337A (phylogroup B1; 15%), UTI89 (phylogroup B2; 4.9%), and Sakai (phylogroup E; 0.1%). The Nextera XT library construction kit (Illumina) was used to generate sequencing libraries following the manufacturer’s recommended protocol. To enrich E. coli sequences in the mock library (~100 ng into the HS reaction), we performed HS using a Roche SeqCap EZ Hypercap kit with our designed custom capture probe set. Hybridization and target capture followed the SeqCap kit instructions except that we diluted the probe pool 1:2 before use, and substituted custom Nextera adapter-blocking oligonucleotides25 for the SeqCap HE Universal adapter and index-blocking oligonucleotides. After hybridization (18 h), bead capture and washes, we performed 10 cycles of PCR with generic universal Illumina P7 and P5 primers. The final libraries were quantified by Qubit fluorometry (Thermo Fisher Scientific), and the size distribution was analyzed by TapeStation electrophoresis (Agilent) before Illumina sequencing. Then, pre- and post-HS libraries were sequenced on an Illumina HiSeqX, generating 21,460,598 and 75,576,717 paired-end 151 bp reads for the pre- and post-HS libraries, respectively.
Downsampling and quality control
Pre-HS and HS libraries were downsampled to equal depth (3 Gb, or approximately 20,000,000 paired-end reads) with Picard-Tools prior to analysis (https://broadinstitute.github.io/picard/). Quality control was performed with FastQC75 and MultiQC76. Due to observed heightened rates of PCR duplications in the HS libraries, both HS and pre-HS sequencing datasets were de novo deduplicated with FastUniq (Supplementary Results A, Supplementary Table 3)77.
Calculation of enrichment, bias, and coverage levels
We assessed the enrichment of total E. coli within the metagenome with a one-sample t-test of the log2 fold-change for all four strains. In order to examine enrichment bias between different strains within a sample, we compared the ratios of the RAs of individual strains in HS metagenomes to the ratios in pre-HS metagenomes with paired t-tests. RAs and depth of coverage for each of the four strains were estimated using StrainGST26. We first built a StrainGST database consisting of just the reference genomes of the four strains of E. coli in the mock mixture–H10407, UTI89, Sakai, and E24377A–all downloaded from NCBI Genbank. Then, we k-merized both pre- and post-HS data and ran StrainGST (without k-mer fingerprinting) against the database to determine the RAs and depth of coverages for all four strains.
Coverage levels for each of the four strains were obtained by aligning downsampled and deduplicated data with Bowtie2 v. 2.3.4.378 with default parameters to a concatenation of all four strains’ reference genomes. Only properly paired aligned reads with a minimum mapping quality (MQ) of 5 were retained with samtools (http://www.htslib.org/). This filtering was done to exclude reads and regions of the genomes where reads aligned equally well to different strains, with the goal of reducing bias in less abundant strains due to sequence homology to sequences deriving from the more abundant strains. Then, coverages of MQ ≥ 5 reads were assessed using Bedtools (https://bedtools.readthedocs.io/en/latest/).
Assembly of E. coli mock community
Downsampled and deduplicated data were used to generate metagenomic assemblies. First, pre- and post-HS data were digitally normalized with the program khmer79. Then, downsampled data were processed with Trim Galore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) to remove leftover adapter content. Then, reads were aligned to the hg38 reference using Bowtie2 v. 2.3.4.3 (with the “very sensitive” flag). Reads that did not align to the human genome were assembled with MetaSPAdes80 with default parameters. Contigs and scaffolds <1 kb were removed and GAEMR was used to assess assembly metrics and determine the taxonomy of each remaining contig/scaffold.
We used BLAST+ to search for coverage of the strain reference genomes by the assembled E. coli contigs >1 kb. Hits >1 kb and with >90% identity were included in the final coverage calculation. To identify strain-specific genes in each strain in the mock community, we used an all vs. all BLAST+ approach to look for homologous genes between all reference genomes. Any gene that did not match a gene in any other strain (E-value < 1e-10) was considered strain-specific.
Comparison of actual to expected probe coverages
To determine probe hybridization sites on each strain’s reference genome, all probe sequences were aligned using Bowtie2 v. 2.3.4.3 to each of the four reference genomes, individually. The intervals where probes aligned were designated as putative probe hybridization sites. Bedtools was used to calculate the probe coverage of the four reference genomes, as well as the coverage of the reference genomes by the pre-HS and post-HS metagenomes.
Statistical analysis and reproducibility
No statistical methods were used to pre-determine sample sizes, but our sample sizes are similar to those reported in previous longitudinal microbiome studies which were able to detect significant effects26,81,82. No data were excluded from the analyses. Our study was an observational cohort study and no replication was performed, although we have described the recruitment process and sampling strategy sufficiently such that the study may be replicated. Our study was an observational study with no intervention and cohorts based on pre-determined criteria; as such, no randomization was required. Control participants were age-matched to rUTI participants, and few dietary differences existed between the cohorts based on survey responses. We adjusted for race in cohort comparisons of microbiome structure.
Clinical cohorts from urinary microbiome (UMB) study
This study recruited only women, since women are the vast majority of patients who suffer from frequent recurrent UTI, the subject of this study. Sex was self-reported. The number of participants used in each experiment, along with their age, is presented in Supplementary Tables 6 and 11.
Sequencing of clinical stool samples
Total nucleic acid was extracted from stool metagenomes collected into ethanol during the Urinary Microbiome (UMB) project, as previously described (Bioproject PRJNA400628)27. The total (~100 µL) nucleic acid from each sample was divided into equal aliquots for DNA and RNA sequencing. 191 samples were used for pre- and post-HS metagenomic (DNA) sequencing, of which 130 samples were used for pre- and post-HS metatranscriptomic (RNA) sequencing (Supplementary Table 5). The RNA aliquots were treated with DNase and Agencourt AMPure beads for a SPRI clean-up. The 130 non-enriched RNA libraries were sequenced with an Illumina NovaSeq, generating an average of 56.0 ± 56.7 million paired-end 151 bp reads. The 191 non-enriched DNA libraries were sequenced with a combination of Illumina HiSeq 2500 and HiSeq X10 technologies, as previously reported27.
Enrichment and sequencing of E. coli PanSelect libraries
We enriched E. coli sequences from both DNA and RNA samples with multiplex solution HS using a Roche SeqCap EZ Hypercap kit with our designed custom capture probe set. DNA-Seq libraries were processed in pools of 8 libraries (~200 ng each). RNA-Seq libraries were prepared as pools of multiplex RNAtag-Seq libraries from 24 RNA samples83,84 and amplified by 14 cycles of PCR to generate at least 100 (mean 140) ng of each library pool for HS (one 24-plex pool per reaction). Hybridization and target capture were performed as for the mock community above. All post-HS libraries were run on an Illumina NovaSeq, generating an average of 9.6 ± 3.9 million paired-end 151 bp reads for post-HS DNA libraries, and 10.2 ± 10.7 million paired-end 151 bp reads for post-HS RNA libraries. Three samples for which HS DNA sequencing failed (low post-QC read depth) were removed from the analysis (UMB13_09, UMB08_04, UMB24_08).
Quality assessment
The quality of sequencing files was assessed with FastQC and MultiQC. We did not de novo deduplicate DNA reads, as we observed far lower PCR duplication in the HS DNA libraries than for the mock community (Supplementary Results A, Supplementary Dataset 1). We processed DNA and RNA data with KneadData (https://huttenhower.sph.harvard.edu/kneaddata/) to remove low-quality sequence, adapter content, and human contamination.
Benchmarking E. coli PanSelect enrichment using stool samples
Enrichment estimation
For both DNA and RNA, per-sample E. coli enrichment was estimated as the fold change in the depth-normalized read coverage of the UTI89 reference genome between pre-HS and post-HS sample pairs. Global alignments (with Bowtie2 v. 2.3.4.3) were used for estimating fold change in reference coverage. Strain composition was estimated with StrainGST. To track changes to strain RAs with HS, strain calls were paired between pre- and post-HS samples. First, pre- and post-HS strains assigned to the same reference were paired (184 strain pairs). Next, strains assigned to different references of the same phylogroup were paired (14 strain pairs). There were 16 instances where one strain in one sample appeared to match two strains in the corresponding sample. Because it was unclear if this was due to a strain-calling error, these 16 instances were removed from the analysis. The remaining strains with no clear pair were assumed to only be detected in one sample (55 strain discoveries).
The expression levels of individual transcripts were used to illustrate RNA enrichment. First, ordinary least squares regression was used to find the average relationship between log10-transformed pre-HS and post-HS E. coli RA. Three samples were identified as enrichment outliers via t-tests on studentized residuals and were removed (Benjamini-Hochberg FDR correction, p < 0.1). Of the remaining samples, the 94 with >1 million post-QC pre- and post-HS reads were downsampled to 1 million reads. Reads were aligned to the UTI89 reference with bowtie2 and counted with FADU v.1.885, run without the expectation-maximization algorithm. Transcript expression levels were estimated in counts per million (CPM). Transcripts expressed below 10 CPM were classified as not detected (n.d.). For visualization, transcripts from all 94 pre-HS/post-HS sample pairs were plotted on the same axes (Fig. 3b). To show density, clusters of transcripts expressed at similar rates were formed with hierarchical clustering.
Background metagenome and metatranscriptome analyses
Pre- and post-HS metagenomes and metatranscriptomes were downsampled to a maximum depth of 3.5 Gb. Taxonomic and transcriptomic profiles for all metagenomes and metatranscriptomes (pre- and post-HS) were calculated with MetaPhlAn3 and HUMAnN328 for samples with more than 1,000,000 post-QC reads. E. coli and Shigella were removed from the MetaPhlAn RA table and the HUMAnN gene family output. The remaining values were sum normalized. Bray-Curtis dissimilarity values were calculated with the Python package scipy v.1.7.1 (https://scipy.org/). PERMANOVA was implemented by the permanova function from the python package scikit-bio v.0.5.6 (https://scikit.bio/index.html).
Sequencing and assembly of E. coli isolate for benchmarking of metagenomic assembly
We sequenced a single E. coli isolate from Participant 8, timepoint 2, using a combination of Illumina and Oxford Nanopore Sequencing. Sequencing and hybrid assembly were performed as previously published86. This assembly has been submitted to NCBI Genbank with accession GCF_011751425.1.
Metagenomic assembly
Metagenomic assemblies were generated for ten samples using all available sequencing data. Digital normalization, adapter trimming, assembly, and calculation of assembly metrics were performed as for the mock community data. Metrics were calculated based on only contigs assigned to E. coli using blastn. After assemblies were produced, a binning program, MetaBat287, was used to produce metagenome-assembled genomes (MAGs). MAGs were analyzed with CheckM88 to determine taxonomy and assembly completeness for MAGs that were classified as Enterobacteriaceae by CheckM.
Analyses of full set of enriched stool samples
Comparison of gene content between cohorts
We calculated gene family presence/absence profiles for enriched metagenomes with PanPhlAn3 (v. 3.1), run with the UniRef90 Escherichia coli pangenome generated on Nov 2, 2020 by the Segata Lab28. Samples were filtered based on the evenness of E. coli coverage with the profiling workflow contained within PanPhlAn (panphlan_profiling.py), run with the ‘very sensitive’ parameters (–min_coverage 1 –left_max 1.70 –right_min 0.30).
We used omnibus and per-gene tests to test for differences in gene content between the recurrer and healthy cohorts. First, we quantified differences in overall gene content profiles between samples with the Jaccard Index and tested for differences between cohorts with PERMANOVA. We used the scikit-bio implementation of PERMANOVA to assess the marginal effect of cohort, controlling for subject, by permuting subject labels while keeping samples from the same subject together. In order to test for differential abundance of individual gene families, we used logistic regression, and Fisher’s exact tests for genes where we had issues with model fitting. For logistic regression, we selected the genes that were found in between 10% and 90% of all samples, and detected in at least one sample from both the recurrer and healthy cohorts (4,114 gene families). We fit models with the function pglmm from R package phyr v.1.1.289, using random intercepts for subjects and a phylogenetic covariance structure based on the most abundant strain identified within each sample by StrainGST. No gene families were significantly differentially abundant after Benjamini-Hochberg false discovery rate correction. Samples from individuals in the rUTI cohort who did not have a UTI during the study period (e.g. nonrecurrers) were included in modeling, but only the contrast between the recurrer and healthy cohorts was reported. To evaluate the remainder of gene families for which we had issues with model fitting, we used Fisher’s exact tests to calculate the significance of the association between gene families and rUTI history, at both the sample and subject carriage levels. For the subject level, we called a gene present in a subject if it was identified in at least one sample in a subject series. For both sets of models (subject and sample level), the most significantly differentially abundant genes were those that were already shown to be insignificant by the more comprehensive logistic regression model. Thus, we concluded that the remaining genes were not differentially abundant either.
fimS structural variation profiling
We estimated the fraction of fimS in the ON orientation within enriched metagenomes by aligning reads (Bowtie2 v. 2.3.4.3; MQ > 5) to a reference containing a copy of the UTI89 genome with fimS in the ON orientation and a copy with fimS in the OFF orientation. For many samples, we observed a mixture of alignments to both the ON and OFF references, indicating subpopulations of fim-expressing (piliated) and non-expressing (smooth) E. coli within the same samples. We used the proportion of reads aligning uniquely to the ON orientation as an estimate for the size of the piliated population. Because there was significant inter-sample variation in E. coli RA (and thus fimS coverage), we were able to estimate the RA of the piliated population more accurately in some samples than others. In order to reflect this variation in our estimates of average fimS activation, we used weighted averages and regressions with weights proportional to per-sample fimS coverage (ON + OFF), and filtered samples with <5 fimS-aligning reads from analysis (Supplementary Table 5).
Selection of E. coli coverage threshold for fim and differential expression analysis
Post-HS RNA and DNA samples were aligned to the UTI89 reference genome with bwa-mem v.0.7.17-r118890, and alignments were counted with FADU85 (run with parameters: -M -p). We then filtered samples based upon total post-HS DNA and RNA E. coli content, using a threshold of 60,000 UTI89 coding sequence-aligning reads. We determined this threshold by examining the relationship between RNA E. coli content and observed RNA transcript diversity. We used ‘relative abundance-weighted transcript diversity’ as a metric, defined as the sum of average RAs of all unique transcripts detected in a sample. We selected our threshold of 60,000 E. coli reads, because unique transcripts representative of 80% of the average E. coli transcriptome could be observed in samples with >60,000 E. coli reads (Supplementary Fig. 11).
Relationship between fimS orientation and fim operon expression
For testing the relationship between fimS orientation and fim operon expression, we used the subset of samples for which we could both estimate fimS orientation (5 fimS-aligning reads) and measure gene coverage and expression (60,000 E. coli DNA and RNA reads) (Supplementary Table 5) We quantified fim operon coverage and expression as the sum of coverage of all genes in the protein-coding fim operon (fimAICDFGH). We used mixed effects models for metagenome-controlled differential expression testing, as described in Zhang et al.91. Models were fit with the form: ({fim; RNA}(log {cpm}), sim {fim; DNA}(log {cpm}),+{fimS}(log%)+(1{|subject})) with the function lme from the R package nlme. To control for variation in E. coli content between samples, we used weights proportional to sample E. coli content: weights=varFixed(~ (1/DNA_ecoli + 1/RNA_ecoli)).
Selection of samples and genes for differential expression testing
For differential expression testing between cohorts, we filtered samples based upon total coverage of E. coli (60,000 DNA and RNA reads, see above) and coverage evenness, as determined by the profiling workflow contained within PanPhlAn3, run with ‘very sensitive’ parameters (–min_coverage 1 –left_max 1.70 –right_min 0.30) on alignments to the UTI89 reference (bwa-mem v.0.7.17-r1188, counted with FADU, run with parameters -M -p; see above) (Supplementary Table 5). We used genes that were classified as ‘single copy’ within metagenomes by PanPhlAn. For differential expression testing, we selected a subset of genes present in at least 50% of all metagenomes from both the recurrer and healthy cohorts (15 samples/cohort) and expressed in at least 15 metagenome/metatranscriptome pairs across the full study. Gene presence was defined as coverage by at least 20 reads within a metagenome. Expression was defined as presence with additional coverage by at least 20 reads within the associated metatranscriptome. We set the read threshold (20) based on the E. coli content range of samples included in DE testing. The sample with the 15th greatest RNA E. coli content had 10-fold more E. coli than the sample with the least E. coli. Thus, setting the threshold at 20 reads ensured that for any gene classified as expressed in 15 samples, we would have sensitivity to detect at least two reads in all samples where the gene was expressed at an equal or greater rate.
Between cohort differential expression testing
We used mixed effects models for per-gene metagenome-controlled differential expression tests91. We fit models of the form:
$${{{rm{RNA}}}}left(log {{{rm{cpm}}}}right) sim , {{{rm{DNA}}}}left(log {{{rm{cpm}}}}right)+{{{rm{cohort}}}}+{{{rm{E}}}}.{{{rm{coli; RNA}}}}left(log%right)+(1|{{{rm{subject}}}})$$
with the function lme from the R package nlme. To control for variation in E. coli content between samples, we used weights proportional to sample E. coli content: weights = varFixed(~ (1/DNA_ecoli + 1/RNA_ecoli)). We included E. coli RA as a fixed effect based on model comparisons using Akaike’s information criterion (AIC). We quantified the coverage of genes within metagenomes and metatranscriptomes as copies per million scaled by E. coli content (CPM), and used a natural log transformation for variance stabilization of RNA and DNA CPM values, as well as E. coli RAs. RNA zeroes were replaced before transformation with a gene-specific pseudocount equal to half the lowest non-zero RPM value measured for each gene, as done in Zhang et al.91. There were no zero values for DNA (because we used single copy genes) or E. coli RA.
Gene set enrichment analysis
We used Gene Set Enrichment Analysis (GSEA)42 to test for the overrepresentation of gene sets among genes over and under-expressed in the recurrer cohort. We grouped genes into TF regulons, using gene:TF interactions reported in RegulonDB39, as well as KEGG Modules and Pathways. For TFs with dual activity, we grouped genes into separate regulons consisting of genes activated and repressed by the TF. Because RegulonDB reports regulatory information for E. coli K12 and we used a E. coli UTI89 reference, the TF:gene interactions were not immediately transferable to our dataset. We used SynerClust71 to pair orthologs between the two reference genomes, and transferred annotations from the K12 reference to UTI89 orthologs. For annotation of KEGG Pathway and Module membership, we used KEGG gene name annotations from RegulonDB and the KEGG E. coli K12 (eco) Pathway and Module maps. For GSEA, we used gene sets that were five genes in size or larger. We used the t-scores from the differential expression tests (reported above) as input for GSEA. We reported GSEA results at an FDR-corrected significance threshold of 0.05.
Metagenomic growth rate estimation
The growth rate of E. coli strains within metagenomes was estimated as the difference in coverage between the origin and terminus of replication, as implemented in SMEG46. We constructed a SMEG species database using the strain reference genomes reported by StrainGST, and ran SMEG using the ‘reference based’ mode with the strain RA estimated by StrainGST, using the post-HS sequencing data. Per sample, we calculated the average E. coli growth rate as the average of strain growth rates, weighted by strain RA. We used linear mixed effects models to compare strain growth rate between cohorts, of the form ({growth; rate} sim {cohort}+(1{|subject})).
Statistics and graphical plotting
All statistical analysis and plotting was performed in R v3.692 and Python. R was primarily used for validation of the probe set and differential expression testing with the libraries ggplot293, data.table (https://rdatatable.gitlab.io/data.table/), Rmisc (https://cran.r-project.org/package=Rmisc), and nlme94. Python was used for other analysis of samples from the UMB study with libraries: pandas v.1.3.295, numpy v.1.21.296, scipy v.1.7.197, statsmodels v.0.12.198, pysam v.0.19.199, scikit-bio v0.5.6100, matplotlib v.3.4.3101, and seaborn v.0.11.281.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41467-024-53829-7