Close this search box.

Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes – Nature Communications

Natural microbial communities consist of many different microorganisms that can be identified and characterized via their DNA with shotgun metagenomic sequencing. For an overview of all microorganisms and their relative abundances in a sample, a comprehensive approach is to obtain taxonomic annotations for as many of the sequencing reads as possible. The resulting taxonomic profile reflects the amount of DNA that was contributed by the community members to the sequencing machine, as opposed to their cell count or number of genome copies35. The accuracy of the taxonomic profile depends on the reliability of the taxonomic annotations. While contigs and MAGs can be more reliably annotated than individual reads, in most metagenomic datasets, not all reads are assembled into contigs, and not all contigs are binned into MAGs (Fig. 1). To address this trade-off between annotation reliability and the fraction of data that can be explained in a metagenome, we developed Read Annotation Tool (RAT) (Fig. 1b, c).

Fig. 1: The RAT workflow.
figure 1

a Overview of a standard state-of-the-art metagenomics pipeline. b Overview of the RAT workflow: reads are mapped to contigs with BWA-MEM, which are binned into MAGs or unbinned. MAGs and contigs are taxonomically annotated using BAT and CAT, respectively. Unmapped reads and, thus far, unclassified contigs are annotated using DIAMOND. c Left: composition of an integrated taxonomic profile as reconstructed by RAT -mcr (for ‘MAGs and contigs and reads’, includes direct mapping of thus far unclassified contigs) and RAT -mc (for ‘MAGs and contigs’). Right: schematic bar plot showing the fraction of the metagenome that can be annotated as reads, contigs, and MAGs.

RAT annotates contigs and MAGs with the previously published tools Contig Annotation Tool (CAT) and Bin Annotation Tool (BAT), respectively. CAT and BAT predict ORFs on these longer sequences with Prodigal and query them against a protein reference database with DIAMOND blastp33. The taxonomy of the sequence is assigned based on the combined taxonomic signal of the individual ORFs, selecting higher-ranking taxa in cases where many conflicting signals are present21. Default options for the reference database include the NCBI non-redundant protein database (nr)36 and, in the latest RAT update, the non-redundant set of proteins of the representative genomes in the Genome Taxonomy Database (GTDB)37. Alternatively, any protein database with taxonomic annotations can be supplied by the user. Next, individual reads are mapped to the contigs with BWA-MEM, and each read inherits the taxonomic annotation with the highest reliability: the MAG annotation if the contig is binned and the contig annotation if it is unbinned. Finally, the remaining sequences (reads that do not map to a contig and contigs that cannot be annotated by CAT) are annotated individually by querying them directly against the protein database with DIAMOND blastx in default sensitivity mode33. Thus, by assigning reads to the taxonomic annotation with the highest reliability, RAT reconstructs a comprehensive taxonomic profile with high accuracy (Fig. 1c, Supplementary Fig. 1). The final step in which sequences are individually queried to the protein database is optional, and depending on whether this step is included, we distinguish two RAT modes: in -mc mode, RAT only uses the most reliable read annotations, which are based on MAGs and contigs with ORFs. In -mcr mode, RAT also uses the read and contig annotations with DIAMOND blastx, which will include more tentative annotations while representing more of the data.

We evaluated RAT’s performance of read annotation and how well the final taxonomic profile represents the microbial community. First, we addressed the trade-off between annotation accuracy and the fraction of reads that can be annotated by the different steps in RAT, using 28 samples of simulated data from three different datasets in the second round of the Critical Assessment of Metagenome Interpretation (CAMI2) challenge38. Second, with the same datasets, we compared taxonomic profiles predicted by RAT to those predicted by other commonly used state-of-the-art profilers. Third, we assessed the performance of RAT and the best-performing other profiler on real metagenomes. To this end, we analyzed 18 samples from three groundwater monitoring wells, a relatively unexplored high-diversity environment that contains many novel taxa39.

Including taxonomic signals from MAGs and contigs improves read annotation

To evaluate how the integration of different taxonomic signals influences the annotation of individual reads, we annotated simulated metagenomic datasets from the second CAMI challenge38 (CAMI2) with RAT. CAMI2 simulated well-characterized microbiomes of the mouse gut and microbiomes with more taxonomic novelty from marine and rhizosphere environments. The 28 samples contained between 78–381 species and included raw reads, gold standard assemblies (the best possible assembly of the sequencing reads in a sample), and genome sequences of these species. In our benchmarks, we used the gold standard assemblies as contig input. Each dataset of the CAMI2 challenge has different strengths for benchmarking. The mouse gut dataset only includes taxa from known species. However, as shown in Supplementary Fig. 2a, reads from the mouse gut dataset show high sequence divergence from taxa in known databases (as represented by the NCBI nucleotide (nt) database) due to simulation of sequencing errors, thereby posing a challenge for metagenomic profilers and effectively simulating unexplored environments. The rhizosphere dataset is the only dataset that contains eukaryotes, and its samples contain the highest fraction of reads belonging to unknown taxa, with up to 36.4% of reads not having a known species representative in nt (Supplementary Table 1). However, the rhizosphere samples have the highest fraction of reads mapping to MAGs, which is not representative of many biological samples, where reads are often unmapped or mapped to unbinned contigs (Supplementary Fig. 2b). Finally, the marine dataset contains 10.8–18.2% reads that belong to taxa without a known species representative in nt (Supplementary Fig. 2c). Conversely, reads from known taxa are highly similar to sequences in nt, making the samples representative of microbiomes that have been well-characterized (Supplementary Table 1, Supplementary Fig. 2a).

We compared five different methods for read annotation: (i) we annotated all reads directly with DIAMOND blastx, without mapping them to contigs or MAGs, (ii) RAT -cr for taxonomic annotations via contigs but ignoring MAG annotations, and direct read annotations for reads that did not map to contigs, (iii) RAT -mcr for annotations via MAGs, contigs, and reads, using the MAGs included in CAMI2 (‘CAMI genomes’), (iv) RAT -mcr for annotations via MAGs, contigs, and reads, using MAGs binned by MetaBAT2 (ref. 26) (<10% contamination), and (v) RAT-mc for annotations via MAGs and contigs, using MetaBAT2 MAGs, but no direct read annotations (Fig. 2). Results were assessed at six taxonomic ranks (phylum, class, order, family, genus, and species) and we scored whether a read was correctly or incorrectly annotated, or unannotated (Fig. 2, Supplementary Fig. 3).

Fig. 2: Outcome of incorporating different taxonomic signals into read annotations on 10 samples of the CAMI2 mouse gut dataset.
figure 2

‘DIAMOND’ refers to using only direct read annotation in default sensitivity mode. ‘RAT -mcr (CAMI2)’refers to a RAT -mcr run (integrating MAGs, contigs, and reads) using the genomes that were provided by the CAMI2 challenge as MAG input. ‘RAT -mcr (MetaBAT2)’ refers to a RAT -mcr run with contigs binned by MetaBAT2. ‘RAT -cr’ refers to a RAT run without MAG input. ‘RAT -mc’ refers to a RAT -mc run, using only read annotation via mapping to MetaBAT2 MAGs and contigs, but no direct read annotation. The mean TPR refers to the fraction of correctly annotated reads per fraction of annotated reads averaged across the ten samples. The white section of the pie charts shows the fraction of unannotated reads. The same figure including results for all taxonomic ranks and for all benchmarked tools, as well as for the marine and rhizosphere datasets can be found in Supplementary Fig. 3. TPR true positive rate. Source data are provided as a Source Data file.

Direct annotation with DIAMOND blastx resulted in low accuracy at low taxonomic ranks like genus and species with a high fraction of mis-annotated reads (Fig. 2, Supplementary Fig. 3), revealing spurious annotations when mapping short sequences to a reference database. Accuracy in the mouse gut dataset is particularly low on species rank, with a true positive rate (TPR) of 14.1 ± 4.5% (mean ± standard deviation) (marine: 40.3 ± 2.3%, rhizosphere: 25.0 ± 20.6%) for DIAMOND. Despite using DIAMOND with the same reference database, RAT runs reduced mis-annotations and improved the fraction of correctly annotated reads at deep taxonomic ranks, highlighting the value of integrating information from taxonomically annotated MAGs and contigs (Fig. 2, supplementary Fig. 3).

When only taxonomic signals from contigs and direct read annotations are integrated (RAT -cr), the TPR increases compared to direct annotation with DIAMOND blastx, while the fraction of incorrectly annotated reads drops to 0.1–1% for all datasets. In addition, the fraction of reads with an annotation increases. This indicates that many previously mis- or unannotated reads are correctly annotated if they map to contigs (Fig. 2).

When taxonomic signals from both contigs and MAGs are integrated, the fraction of unclassified reads decreases compared to annotating without MAGs. In the CAMI2 mouse gut dataset, using the CAMI2 genomes as MAG input and binning the contigs with MetaBAT2 gave very similar results, indicating that current binning tools accurately group contigs from the same species together. Without using DIAMOND blastx to annotate the remaining unmapped reads and unclassified contigs (RAT -mc), the fraction of annotated reads decreases, while the true positive rate stays the same in the mouse gut dataset. In real biological datasets, RAT -mc is likely to annotate fewer reads on low ranks like species and genus. This is because higher diversity makes it more difficult to assemble reads into contigs than in the simulated CAMI2 samples, and in turn, fewer or shorter contigs lead to a smaller fraction of the reads being associated with longer sequences, which leads to less reliable annotations (see below, Supplementary Figs. 7 and 8).

In the marine and rhizosphere datasets, the same patterns are visible. RAT, on average, annotated the highest fraction of reads in the marine samples, followed by the mouse gut and rhizosphere dataset. All different tool settings showed a higher TPR on the marine dataset compared to the mouse gut or rhizosphere (Supplementary Fig. 3), likely due to the much higher similarity between the reads and the reference databases in the marine samples (Supplementary Table 1, Supplementary Fig. 2).

In conclusion, using the taxonomic signals from contigs and MAGs for read annotation leads to more reliable annotations than using direct querying of individual reads with DIAMOND.

Including information from contigs and MAGs improves accuracy of taxonomic profiling

Metagenomics is used to analyze high-complexity microbial communities, including many different taxa with orders of magnitude of difference in their abundances. Taxonomic profilers aim to chart the community composition by estimating the relative abundance of all taxa in a sample. A good taxonomic profile contains as many members of the microbial community as possible while avoiding taxa that are not present in the sample. In practice, this often leads to a compromise between sensitivity (finding all taxa that are present and maybe some false positives) and precision (avoiding taxa that are not present and maybe some false negatives). To assess how the inclusion of contigs and MAGs affects the accuracy of taxonomic profiles, we used four metrics (sensitivity, precision, L1 distance, and weighted UniFrac distance) to compare the taxonomic profiles reconstructed by RAT and four state-of-the-art taxonomic profilers that carry out annotations via direct read mapping to the CAMI2 reference taxonomic profiles (Fig. 3). Centrifuge12 compares read to the nucleotide database using a Burrows–Wheeler-transformation, Kaiju13 annotates sequences in protein space, Kraken2 (ref. 11) uses exact k-mer matches, and Bracken40 uses the Kraken2 annotations for a Bayesian re-estimation of the abundances of taxa in the sample. The RAT output includes all taxa that are represented by at least a single read. However, as direct read annotations are known to be inaccurate, we limited the amount of noise by only considering taxa with a minimum abundance of 0.001% (which represents 4 reads in the CAMI2 samples) and applied this cut-off for all tools in the benchmark. This cut-off was applied separately on every taxonomic rank. Thus, even if a species with an abundance below 0.001% was removed, its reads could still be included at the genus rank if the genus was >0.001%.

Fig. 3: Similarities between taxonomic profiles reconstructed by different tools and the reference taxonomic profiles of the CAMI2 mouse gut dataset using different similarity measures.
figure 3

We only counted taxa as detected if their relative abundance was at least 0.001% (a minimum of 4 reads). a L1 distances between CAMI reference and reconstructed profiles by the tools in the study (mean ± 2 times the standard deviation, n = 10 samples). An L1 value of 0 means that the two profiles are identical; lower is better. The blue and white background shading facilitates differentiation between ranks and has no biological meaning. b Heatmap of weighted UniFrac distances between reconstructed and true profiles; a shorter distance is better. c Sensitivity vs. precision of the different tools. Different shapes signify the sensitivity/precision on different taxonomic ranks, different colors indicate tools; high precision + sensitivity is better. The same figures for the marine and rhizosphere datasets can be found in Supplementary Figs. 4 and 5. Source data are provided as a Source Data file.

In line with our first benchmark, the incorporation of taxonomic signals from MAGs led to more accurate profiles than using only taxonomic signals from contigs, as seen in the L1 distance and in the weighted UniFrac distance of RAT -rc and RAT -mcr. RAT -mcr slightly outperformed RAT -mc (Fig. 3, Supplementary Figs. 4, 5) in L1 distance and sensitivity, indicating that including direct read annotation leads to reconstructed profiles that are more similar to the reference profile than when relying solely on assembly-based profiling. Taxonomic profiles reconstructed by RAT consistently had lower L1 distances to the reference profiles than profiles reconstructed by Bracken, Centrifuge, and Kraken2 across all three CAMI2 challenge datasets (Fig. 3a, Supplementary Figs. 4 and 5). In comparison to taxonomic profiles reconstructed by Kaiju, RAT runs had slightly higher L1 distances on family, genus and species rank. Taxonomic profiles reconstructed by RAT had lower weighted UniFrac distances to the reference profiles than Bracken, Centrifuge, and Kraken2, except in the marine dataset (Fig. 3b, Supplementary Figs. 4, 5), while Kaiju performed similarly. The high performance of Bracken, Centrifuge, and Kraken2 on the marine dataset can likely be explained by the high similarity of the reads to the sequences in the database reflecting well-characterized microbiomes. In well-characterized communities, annotations in nucleotide and k-mer space are put at an advantage compared to unexplored environments due to the high likelihood of long exact matches in the database. For communities containing mostly well-known organisms, tools such as Kraken2/Bracken or Centrifuge are therefore suitable. Conversely, more sensitive methods that annotate in protein space, especially ones that include a last-common-ancestor approach, such as both RAT and Kaiju, are at a relative disadvantage in well-characterized communities. As proteins are more conserved than nucleic acids, the annotations by these tools are more likely to be based on multiple taxa with the same or similar amino acid sequences. This leads to a higher likelihood of missing annotations on low ranks compared to tools that search in nucleotide or k-mer space. Methods such as Kaiju or RAT are, therefore, particularly suitable for characterizing environments with organisms that exhibit high sequence divergence compared to their closest relatives in the databases.

RAT had a higher precision on all taxonomic ranks than the other evaluated tools (Fig. 3c). This means that RAT had fewer falsely detected taxa, in line with earlier observations of high precision of CAT and BAT annotations21. In the mouse gut dataset, RAT -mc maintained >0.94 precision on all taxonomic ranks, even when detected taxa were not limited by a minimum relative abundance cut-off (Supplementary Fig. 5). The same pattern can be seen in the marine (>0.93) and rhizosphere (>0.83) datasets, where RAT consistently showed higher precision than the other evaluated tools. Thus, like CAT and BAT on which its annotations are based, RAT -mc tends to avoid spurious annotations at low taxonomic ranks like genus and species in cases where conflicting taxonomic signals arise. For RAT -mcr, precision remained higher than that of the other evaluated tools across taxonomic ranks, but precision was lower than that of RAT -mc. The minimum relative abundance cut-off greatly improved the precision of RAT -mcr (cf. Fig. 3c and Supplementary Fig. 6). Spurious annotations are introduced when short sequencing reads are directly annotated in the direct annotation step of RAT and by the other evaluated tools. However, because of the prioritization of taxonomic signals in RAT, a smaller fraction of reads is annotated directly, leading to fewer spurious annotations in the first place. By setting an abundance cut-off (e.g., 0.001% of reads as in this benchmark), RAT can profit from the high sensitivity of the DIAMOND blastx step (finding taxa that might not be detected using just contig and MAG annotations) while further minimizing the number of falsely detected taxa (by excluding spurious annotations that have a very low abundance).

RAT’s overall high precision can be explained by its integrated taxonomic profiling approach, which improves annotations in most of the challenges discussed above. Reads that map to conserved or horizontally transferred regions, or map to novel genomic regions of a known taxon, are likely to get the correct annotation with RAT because the surrounding regions of the genome are considered in the annotation via the contig and/or MAG. Reads belonging to novel taxa within known clades are also more likely to get correctly annotated, as when the reads are assembled into contigs or MAGs, RAT may annotate them on a higher taxonomic rank. For example, if the closest related sequences in the reference database are found among different species in a genus, the sequence will be annotated at the genus rank (see Methods). The difference in precision between the different approaches shows that reads that are annotated via direct read mapping instead of by being associated with a contig or MAG are far more likely to get falsely annotated. RAT’s approach reduces the number of falsely detected taxa from 200–4000 by the other evaluated tools to between 0 (RAT -mc) and 38 (RAT -mcr).

All evaluated tools showed high sensitivity from phylum down to family rank, detecting most of the taxa that were present in the reference profiles (Fig. 3c, Supplementary Figs. 4, 5). This is consistent with increased barriers to horizontal gene transfer at higher taxonomic ranks41. Including direct read annotation consistently increased RAT’s sensitivity compared to RAT -mc on all ranks and across all three datasets. One may expect that these classifications are less robust than those annotated via MAGs or contigs. All tools displayed the highest sensitivity in the marine dataset and the lowest in the rhizosphere samples. RAT’s high performance on the CAMI2 datasets is in part due to the fact that a large fraction of the reads map back to annotated contigs (mouse gut: 81.6 ± 6.6% (mean ± standard deviation), marine: 97.4 ± 0.2%, rhizosphere: 95 ± 4.7%) and MAGs (mouse gut: 75.8 ± 6.4%, marine: 90.5 ± 0.5%, rhizosphere: 91.7 ± 8.0%, Supplementary Fig. 7, supplementary Table 1). These numbers are often lower in real metagenomic datasets (see below). The result is that most reads are annotated in the most reliable MAG and contig annotation steps, and few reads are annotated directly with DIAMOND, reducing the probability of spurious annotations (Supplementary Figs. 7 and 8). To show the effect of using simulated vs. biological data, we also tested RAT on a set of 18 groundwater samples (see below).

Usage, runtime, and memory requirements

Next, we compared the runtime and memory requirement of RAT to the other tools on the mouse gut samples 6 and 13 (Table 1). RAT does not assemble and bin metagenomes but rather takes assembled contigs and associated MAGs as input from the user. Other user input includes the CAT database and taxonomy folders, as well as the sequencing reads. If a previous RAT run was interrupted, the intermediate files can be used as input to shorten runtime. If CAT and/or BAT have already been run on a dataset, the output files can also be used as input for RAT. Although assembly and contig binning can take hours or days to run (for example, the two mouse gut samples took around 2 h to assemble and bin, Supplementary Table 2), they are a common procedure in many metagenomics studies, as they provide valuable genomic context information to short sequencing reads with relatively little risk of generating chimeras42.

Table 1 Runtime and memory usage of RAT and four other tools

Kraken2 was the fastest tool (01m49s), RAT –mcr was the slowest (02h05m10s), and all other tools, including RAT -mc performed the jobs in 16 min or less. In terms of memory usage, all tools can be run on a 256 Gb server. RAT -mcr had a higher memory footprint than Kraken2, but lower than Kaiju and Centrifuge. RAT -mcr varied in RAM and runtime between the two samples because it loads different amounts of unclassified reads and contigs into memory depending on the sample.

The expanded CAT pack facilitates the detection and annotation of unknown microorganisms

The simulated data provided by the CAMI2 challenge differs from real biological datasets. In the CAMI2 datasets, Illumina sequencing experiments were simulated of relatively low-diverse microbiomes containing mostly reads of known species (Supplementary Fig. 2). Annotations are facilitated by the fact that on average >80% of the reads mapped back to a MAG or contig from a gold-standard assembly, while in biological datasets, this percentage can be much lower (Supplementary Figs. 4 and 5). In addition, particularly in microbiomes from under-studied environments, unknown lineages are often detected that are only distantly related to known taxa in reference databases. Awaiting taxonomic classification of these microorganisms, a higher-rank taxonomic annotation of the sequence at e.g., family or phylum rank may be appropriate in these cases.

RAT provides a framework for assessing these unknowns. Because reads are classified via CAT and BAT, annotations are made at the appropriate taxonomic rank. CAT and BAT assign individual ORFs to the last common ancestor of all hits that have a similar bit-score to the best hit and annotate the contig or MAG using a bit-score-based voting scheme that selects the taxon at which a certain fraction (in the RAT workflow, the majority) of the ORF assignments agree21. Novel sequences have many distinct hits and are thus only annotated at a high taxonomic rank, reflecting their unknownness. MAGs that only receive a high taxonomic rank annotation by BAT may be further investigated with phylogenomic software for strain-level resolution. Since the quality of RAT results is highly dependent on the quality of the input data, we recommend using high-quality assemblies and only including MAGs with low contamination (e.g. <10% contamination according to CheckM43). Contaminated MAGs can be mis-annotated or annotated at a high trivial taxonomic rank, in which case a contig annotation is more reliable. MAG completeness is less relevant for RAT, as MAGs with low completeness typically still include more than one contig from the same microorganism, creating a stronger taxonomic signal than present on the individual contigs.

To challenge RAT with real datasets, we selected relatively unexplored groundwater samples taken 12–64 m below the surface level from three different monitoring wells in a Dutch agricultural area, which we previously found had high microbial diversity and contained many novel taxa39. We performed a metagenomic analysis including quality control, assembly24, and binning26,27,44, which produced 514 MAGs. We supplied the reads, 2,770,251 contigs, and 423 medium- to high-quality MAGs (completeness ≥ 50%, contamination < 10%; see ref. 45) to RAT to reconstruct taxonomic profiles of the groundwater samples, using nr as a reference database. In addition, the medium- to high-quality MAGs were dereplicated46, and the resulting 195 representative MAGs were placed in a phylogenetic tree showing their relationships and abundance across samples (Supplementary Fig. 6).

RAT annotated 22.0 ± 8.7% (mean ± standard deviation) of reads by mapping them to MAGs, much less than in the simulated CAMI2 datasets (see Supplementary Table 1, supplementary Figs. 4 and 5), reflecting the high complexity of the groundwater samples. RAT classified 20.9 ± 3.2% of reads via unbinned contigs annotated by CAT and 0.35 ± 0.23% via contigs annotated by DIAMOND. Finally, DIAMOND blastx annotated an additional 23.0 ± 3.3% of the reads. These unmapped reads represent sequences with low coverage that could not be assembled into contigs, and based on the results with simulated data above, we expect to represent more spurious results.

The taxonomic profile reconstructed by RAT -mcr showed that most reads belonged to unclassified bacteria, including the phyla Chloroflexi and Deltaproteobacteria (Fig. 4a). Chloroflexi bacteria utilize a variety of electron acceptors, including oxidized nitrogen or sulfur compounds. A comparison of the 18 reconstructed taxonomic profiles showed that Sample 23-2 contained relatively many Chloroflexi reads, while the Deltaproteobacteria were rare. Although many of the microorganisms in this sample could only be classified on high taxonomic ranks, 22 MAGs from these phyla represented 31.1% of the reads in the sample (see Supplementary Fig. 6).

Fig. 4: Taxonomic profiling of groundwater metagenomes.
figure 4

a Microbial profiles of groundwater samples on taxonomic rank class as reconstructed by RAT -mcr. b Rarefaction curves of the number of taxa detected in sample W23-2 by RAT -mc, RAT -mcr, and Kaiju. Triangles indicate the number of taxa detected in profiles when a minimum abundance is required to consider an organism as detected. Circles indicate the number of taxa detected without a cut-off. Source data are provided as a Source Data file.

Next, we compared the taxonomic profiles of the groundwater metagenomes as predicted by RAT and Kaiju, as Kaiju was the best-performing other tool in the previous benchmark. Both tools classified two-thirds of the data (RAT: 68.9 ± 5.8% of reads, Kaiju: 63.8 ± 5.6% of reads, Supplementary Table 2). However, RAT classified these reads as belonging to roughly 20% of the taxa that Kaiju predicted (Fig. 4b). Bearing in mind the high precision of RAT (Fig. 3), we propose that the taxa predicted by RAT are a more parsimonious interpretation of the metagenomic data than those predicted by Kaiju. To visualize the potential overestimation of taxa due to spurious annotations, we made rarefaction curves for the results of RAT -mcr, RAT -mc, and Kaiju. Without a minimum relative abundance cut-off, rarefaction curves of RAT -mcr and Kaiju results did not level off. This pattern was also observed in simulated data containing a known number of 110 species (Supplementary Fig. 10) and thus points to an overestimation of taxa richness. This reflects the spurious annotations of individual reads and indicates that, without a cut-off, deeper sequencing of the same sample would lead to higher predicted richness. The rarefaction curve of RAT -mc leveled off in the groundwater data, indicating robustness towards falsely detected taxa in the RAT -mc workflow. With a minimum relative abundance cut-off of 0.001%, all rarefaction curves leveled off, although the different tools predicted different taxa richness. Kaiju estimated a much higher richness than RAT in both -mc and -mcr mode (Fig. 4c). Combined with the RAT results on simulated data where RAT -mc underestimated richness while RAT -mcr included some false positives (Supplementary Fig. 10), this shows that: (i) RAT -mc is the best-suited RAT workflow in experiments where reliability is crucial, but it will likely not detect all of the rarer taxa, while (ii) RAT -mcr is more sensitive and will detect more taxa at the risk of including a few of them spuriously.

Since RAT annotates all reads in a metagenome, the resulting taxonomic profiles reflect sequence abundance as opposed to taxonomic abundance35. This means that RAT reports the abundance of a taxon as a fraction of total DNA in the sample rather than as the number of genome copies, which can, for example, be estimated by querying marker genes6,7,8. It may be expected that the resulting relative abundance profile is skewed towards microorganisms with larger genomes since they provide more DNA to the sequencing machine and thus contribute more reads than organisms with small genomes. To convert sequence abundance to genome copies, relative abundances have to be normalized by genome length, which is often unknown and can vary widely even between strains of the same species47. For novel microorganisms, genome sizes of closely related species might not be available. For these reasons, RAT, by default, does not convert sequence abundance to taxonomic abundance. However, the CAT pack provides a table with weighted mean genome sizes for most known bacterial and archaeal taxa at all ranks based on genomes deposited in the BV-BRC database48. This allows users to estimate the relative genome abundances from relative sequence abundances if they wish.

GTDB compatibility provides lower-rank annotations on biological data

The performance of a taxonomic profiler can only be as good as the underlying database that is used to annotate the data. Further, the database used can only be as good as the taxonomy that it is based on. The taxonomy of living organisms is still regularly being updated37,49. Curated databases, such as GTDB37 may provide better precision for profilers, but they might reduce sensitivity as they do not contain all known taxa. Conversely, comprehensive databases such as nr and nt36 contain more sequences, increasing the sensitivity of the profiler but the taxonomic annotation of those sequences might be of lower quality, so precision might be sacrificed.

To provide the user with as much freedom as possible, the CAT pack is now compatible with the GTDB database as well as the nr database, which includes NCBI Taxonomy. Although the nr database is larger than GTDB (591,417,602 versus 250,802,978 protein sequences as of November 2023), GTDB’s automated classification based on genome phylogeny makes the database more robust and less noisy than nr. CAT now includes automatic download, database preparation, and sequence annotation with CAT, BAT, and RAT based on GTDB. To show the difference in annotation between the two databases, we ran RAT on the groundwater samples described above with the nr database and with GTDB release 202 (Supplementary Figs. 9, 11, supplementary Table 5). In the MAG step, RAT annotated more MAGs with GTDB than with nr (GTDB/nr: 94 ± 4%/63.6 ± 16.5% (mean ± standard deviation) on phylum rank, 28.9 ± 11.2%/3 ± 3.2% on genus rank, supplementary Table 5). Considering all data, RAT annotated a larger fraction of reads down to genus rank using GTDB (GTDB/nr: 28 ± 4.1%/3.7 ± 1.9%). On species rank, RAT annotated a slightly larger fraction of the reads using nr compared to GTDB (GTDB/nr: 14.7 ± 2.2%/16.2 ± 2.4%). This is the result of incomplete taxonomic annotations in nr, where floating species are annotated that have not been assigned to a genus.

In this study, we presented the Read Annotation Tool (RAT), a tool to strengthen the CAT pack metagenome analysis suite. We showed how annotating each read by using the best available taxonomic information (based on MAGs, contigs, or direct read mapping) leads to fewer falsely detected taxa and improves the accuracy of taxonomic profiles. RAT is flexible to future improvements in sequencing technologies, as well as in assembly and binning software, as they are run by the user before the mapping and classification steps. RAT will be useful in the exploration and understanding of metagenomic datasets by robust classification of most sequencing reads, even in unexplored environments that are rich in novel microorganisms.