Shotgun metagenomic insights into secondary metabolite biosynthetic gene clusters reveal taxonomic and functional profiles of microbiomes in natural farmland soil – Scientific Reports

Soil physicochemical property and taxonomic annotation

The physicochemical analysis results indicated a high organic carbon content, a slightly alkaline pH, and a moderate temperature range for both samples. A recent study has suggested that the high organic content of the soil samples provides a favorite environment for microbial growth, metabolism, and nutrient cycling12. The soil texture results classified both samples as clay loam. This texture class indicates a mixture of different particle sizes which gives a balanced soil texture, as a result, the texture class provides a suitable environment for microbial colonization, nutrient solubility, and nutrient availability13. Overall from the physicochemical analysis result, it can be concluded that both the sampling sites support the growth of diverse microbial domains and phyla, which in turn, might have functional potential for the production of unique secondary metabolites14.

SSU rRNA analysis revealed the dominance of the bacterial domain in both samples. Sample BNFC had 87.24% of the contigs assigned to the bacterial domain, and sample BNFW had 83.33% of the contigs assigned to the bacterial domain. This illustrates that the majority of the contigs generated in both samples were assigned to the bacterial domain. This result is supported by previous studies that have indicated bacteria to be dominant in various agricultural soil environments15,16,17.

The phylum composition in both samples is also illustrated by SSU rRNA analysis, an additional table file shows this in more detail for sample BNFC [see Supplementary Table 3] and for sample BNFW [see Supplementary Table 4]. Proteobacteria was found to be the dominant phylum in both samples. The abundance of proteobacteria in the soil samples aligns with other previous studies that pointed out the high prevalence of proteobacteria in various agricultural soil environments18,19. Proteobacteria is an extremely diversified phylum responsible for the production of a wide range of bioactive products; including antibiotics, antifungal and antitumor agents20. The dominance of proteobacteria in this study can be related to the potential of this phylum for secondary metabolite production and its fundamental ecological role. Actinobacteria a well-recognized phylum, for secondary metabolite biosynthesis was also abundantly identified in both samples. A study on agricultural soil of maize plants has reported an abundance of the actinobacteria phylum together with other phyla which supports the present study16. The availability of this phylum in large proportion in the microbial communities of both samples implies the potential of the microbes for the production of a wide range of secondary metabolites in these environments. The other abundantly identified phylum in both samples was Acidobacteria. According to the analysis result this phylum was more abundant in sample BNFC than in sample BNFW. This variation in abundance might be credited to the fact that sample BNFC is collected from the agricultural soil of maize plants, including both rhizospheric and bulk soil samples. This is supported by a study that demonstrated the dominance of Acidobacteria in samples collected from the rhizospheric soil of maize plants17. Even though the phylum Acidobacteria is relatively less explored for the production of natural products by previous studies, this study can be used as an initial point for gaining an insight into their functional capability. Moreover, from the analysis, the identification of unassigned phyla in a significant proportion is a fascinating finding. These unassigned phyla represent poorly understood and classified groups of microbes holding immense potential for novel microbial diversity and functional capabilities, including the production of novel natural products.

Functional annotation

The results from the database InterPro21 suggested that, the majority of the predicted coding sequences (pCDSs) from the studied samples matched with the “Winged helix-like DNA-binding domain superfamily” InterPro entry. This entry consisted of DNA-binding proteins sharing a common structural motif22. Proteins with this domain can be classified as transcriptional factors which can be associated with the biosynthesis of secondary metabolites such as antibiotics, as regulatory elements play a substantial role in the expression of genes within BGC23. Plenty of pCDSs from both studied samples found a match with the “Alpha/Beta hydrolase fold” InterPro entry which consisted of hydrolytic enzymes with α/β-sheet of eight β-sheets connected by α-helices enzyme core in common among all the enzymes24. The presence of enzymes with hydrolytic properties can be associated with the modification, arrangement, and processing of molecules that are utilized as precursors for natural product assemblage. “FAD/NAD(P)-binding domain superfamily” experienced high pCDSs matches from sample BNFC and a reasonably comparable number of pCDSs matches from sample BNFW. During redox reaction, this domain functions as an electron binding and transferring region, and the occurrence of this domain in both studied samples might suggest the availability of enzymes, that utilizes FAD or NAD(P) as cofactors. These enzymes are commonly needed in different biosynthetic processes including in those pathways that are involved in the biosynthesis of secondary metabolites. On the other hand, a high number of pCDSs from both studied samples that matched the “ABC transporter-like” Interpro entry have functional implications that are related to the transportation of different molecules across cellular membranes. The availability of “ABC transporter-like domain” in both samples may suggest the presence of transport systems essential for the biosynthesis of natural products and their exportation25. Similarly, a great number of pCDSs in both samples matched the “Aldolase-type TIM barrel” InterPro entry that consisted of class I aldolases which contain the TIM barrel domain as exposed by X-ray crystallography study26. Class I aldolases, catalyze the formation of the Carbon–Carbon bonds and are involved in the formation of complex carbon frameworks localized in natural products. The availability of these enzymes in both samples may imply, the presence of biosynthetic enzymes that have a function in the assembly of components for the biosynthesis of natural products.

The results from the extracted Pfam database suggest that the “ABC transporter” had the highest number of pCDSs matches from both samples. This entry consisted of about 3 M proteins that are involved in the transportation of different types of compounds across biological membranes. “ABC transporters” frequently take part in the translocation of secondary metabolites precursors, intermediates, and final products across cell membranes. Interestingly, a recent study made a report on the ABC transporter that constitutes an essential part of the nonribosomal peptide biosynthetic machinery for the biosynthesis of secondary metabolites27. Another study related the ABC transporter with the maturation of the natural product Lasso Peptide Cochonodin I synthesized from the Ripp BGC28. Consequently, the abundant presence of these proteins in both samples signifies the considerable functional potential of the microbial community within the samples concerning secondary metabolites biosynthesis. Similarly, several pCDSs from both samples found a match with the “Major Facilitator Superfamily” Pfam entry which constituted of 2 M membrane proteins that represent the largest family of secondary transporters in all life forms. These proteins include transporters that can function by translocating natural products precursors, intermediates, and final products across the cell membrane. A recently published study focused on characterizing the “Major Facilitator Superfamily Transporters” associated with the antibacterial Pantoea natural product29. Therefore the availability of these proteins in both samples can be a strong indication for the presence of microbes with secondary metabolites production capacity. Furthermore, many pCDSs from both samples found a match to the Pfam entry, “TonB-dependent receptor” (TBDR) that comprised 473 K proteins. TBDRs mediate substrate-specific transport across the outer membrane by employing energy derived from the proton motive force, transmitted from the TonB complex located in the inner membrane30. During the biosynthesis of secondary metabolites, bacterial cells may need to uptake molecules such as; minerals, vitamins, aromatic compounds, and plant-derived compounds that are then imported by the TonB receptors30. Another study reported that TonB Dependent Receptor regulates the biosynthesis of antifungal secondary metabolites in bacterial cells31. To the Pfam entry, “MacB-like periplasmic core domain” several pCDSs from both samples were matched. This entry consisted of 206 K proteins. The periplasmic core domains localized in a variety of ABC transporters are represented by this protein family entry. The MacB-like proteins have functions related to the efflux system of the cell, which involves the removal of toxic molecules from the cell. In addition, the MacB-like periplasmic proteins may be involved in the exportation of secondary metabolites such as antibiotics from the cell32. The presence of these proteins in both samples is a piece of strong evidence that illustrates the production of secondary metabolites such as antibiotics by the soil microbiomes in the selected sampling sites.

The GO slim annotation results of both samples represent generalized, high-level functional categories across the three GO term classifications. In terms of biological process annotation, a significant number of pCDSs from both samples were assigned to the GO term “metabolic process.” This broad GO slim term encompasses the chemical reactions and pathways involved in living organisms. within the context of secondary metabolites biosynthesis, this GO term includes the key biological processes such as the synthesis of precursor molecules and the enzymatical conversion of these molecules to produce complex secondary metabolite compounds. Notably, a more specific complete GO term annotation of sample BNFC [see Supplementary Table 5] and for sample BNFW [see Supplementary Table 6] revealed an intriguing finding. Specifically, 5517 pCDSs from sample BNFC and 3293 pCDSs from sample BNFW were assigned to the “biosynthetic process” GO term, which is directly linked to secondary metabolite biosynthesis. Another GO slim term for a biological process that may be relevant to secondary metabolite biosynthesis is “DNA conformation change.” This term refers to the cellular process involving the structural conformational change of DNA molecules. Configurational change in DNA molecules can be associated with the transcriptional regulation of genes within BGCs, thereby determining the biosynthesis of secondary metabolites. The “cell redox homeostasis” GO slim term for biological process encompasses all the biological processes that maintain the redox environment within the cell and its organelles. The maintenance of a balanced environment within the cell is crucial for the proper functioning of enzymes and pathway reactions involved in the biosynthesis of secondary metabolites. The “cytochrome complex assembly” and “iron-sulfur cluster assembly” GO terms for biological processes describe enzymatic reactions that involve electron-carrying or redox-active molecules. Redox-active molecules and cofactors play a crucial role in secondary metabolite biosynthesis, serving as an electron transfer mechanism and potentially requiring specific cofactors in the biosynthesis process. The classification of cellular component GO terms mainly consisted of terms associated with membrane components such as, “membrane”, “Intrinsic to membrane”, “plasma membrane”, and “outer membrane.” As supported by previous studies, membrane-associated regions are a potential location for secondary metabolite biosynthesis in microbial systems33,34. These regions function by regulating, binding, and transporting enzymes involved in the biosynthesis of secondary metabolites. The cellular component GO term “extracellular region” is used to annotate gene products that are located outside of the outermost layer of the cell. Unlike membrane-associated components, these gene products are not attached to the surface of the cell. A previous study has associated extracellular gene products that function as signaling molecules for regulatory proteins, with the expression of secondary metabolite biosynthesis gene-cluster35. Consequently, the availability of these gene products in both samples can be associated with the regulation of BGCs for secondary metabolites biosynthesis. For molecular function GO slim annotation, most of the pCDSs of both samples are annotated with the “catalytic activity” GO term. This general GO term describes the activity of naturally occurring enzymes. Enzymatic activity can also be described by GO terms such as; “hydrolase activity” which involves the hydrolysis of bonds including peptides, esters, and glycosidic, and “isomerase activity” which involves the rearrangement of bonds in molecules. In secondary metabolite biosynthesis, enzymes with various and intricate catalytic activities play an essential role36. The availability of pCDSs with different enzymatic activity in both samples can be considered the major indicator of the potential of the microbes in both samples for secondary metabolite biosynthesis.

KEGG orthologs and pathway analysis

The analysis resulted from the KAAS server revealed the dominance of the ortholog group “RNA polymerase sigma-70 factor, ECF subfamily” in both samples to which a reasonable number of pCDSs were annotated. This gene-product group consisted of sigma-70 factor proteins which are primary sigma factors associated with bacterial RNA polymerase. Specifically, this entry consisted of the Extra Cytoplasmic Function (ECF) subfamily proteins which function by responding to environmental cues such as stress. These groups of proteins play an important role by initiating gene expression through the process of transcription by guiding RNA polymerase to the promoter region of the DNA sequence. The presence of this protein group in both samples can strongly suggest the potential of the microbiomes for secondary metabolite biosynthesis, as they are involved in the transcription activation process of BGCs. A previous study has related this group of proteins with metabolic pathways in Streptomyces spp37, which are known for secondary metabolite biosynthesis specifically antibiotics. There is a functional association of a considerable number of pCDSs originating from the investigated samples to an ortholog group “acetyl-CoA C-acetyltransferase which as an entry represents one of the key enzymes in the mevalonate pathway38. The mevalonate pathway is known for the biosynthesis of isoprenoids and polyketides, which are well-known classes of secondary metabolites. The presence of this KO entry in both samples might suggest the availability of microbes with the potential for the production of bioactive molecules such as; isoprenoids and polyketides.

The result from the KEGG module database revealed the abundance of KEGG modules for the biosynthesis of terpenoids and polyketides in both samples. Among the identified KEGG modules, the ones which have 100% or near 100% completeness that represent a pathway module involved in the Biosynthesis of terpenoids and polyketides and Terpenoid backbone biosynthesis were observed to be “C5 isoprenoid biosynthesis, non-mevalonate pathway “ (ID: M00096), “C10-C20 isoprenoid biosynthesis, bacteria” (ID: M00364) and “C5 isoprenoid biosynthesis, mevalonate pathway” (ID: M00095) while a pathway module involved in the Biosynthesis of terpenoids and polyketides and Polyketide sugar unit biosynthesis was represented by “dTDP-L-rhamnose biosynthesis” (ID: M00793). Terpenoids and polyketides are well-recognized secondary metabolites. Terpenoids are a diverse class of organic compounds primarily synthesized by plants and consisting of Isoprene as their building block. Even though, bacteria are not well-recognized sources of Terpenoids, a recent advancement in genomics is revealing the genomic potential of bacteria for terpenoid biosynthesis39. In contrast, polyketides are a prominent source of bacterial secondary metabolite. These complex and diverse organic compounds primarily exhibit antimicrobial properties and are renowned sources of antibiotics40,41. In both samples, the abundant occurrence of KEEG modules for terpenoids and polyketide biosynthesis strongly suggests the presence of microorganisms with potential capabilities for the biosynthesis of these secondary metabolites. KEEG Pathway modules for the biosynthesis of other secondary metabolites were also identified. Some of these modules include “Rebeccamycin biosynthesis, tryptophan =  > rebeccamycin” (ID: M00789), “Aurachin biosynthesis, anthranilate =  > aurachin A” (ID: M00848), and “Pyrrolnitrin biosynthesis, tryptophan =  > pyrrolnitrin” (ID: M00790). The secondary metabolite Rebeccamycin is recognized for its anticancer activity42, while, Pyrrolnitrin and Aurachin are known for their antimicrobial activity43,44. The identification of these KEGG modules related to antimicrobial secondary metabolite illustrates the microbiome potential in both samples for secondary metabolite biosynthesis particularly, for the biosynthesis of antibiotics.

Predictive identification of putative novel BGCs for secondary metabolites

Among the AntiSMASH-predicted BGCs, none of the NRPSs, or PKSs found a match with BGCs in the MIBiG repository that have experimentally characterized products. Further analysis of the domains of these BGCs using the NaPDoS2 web server revealed that all the ‘KS’ and ‘C’ domains of NRPS and PKS, respectively, had a percent identity of less than 85 with the existing domains in the database. According to NaPDoS2, if the domain sequence hits below 85%, it suggests that the specific domain of interest may potentially play a role in the production of uncharacterized (novel) natural products45. Although the percent identity for all the ‘C’ domains is significantly less than 85%, the phylogenetic tree constructed by maximum likelihood for the ‘C’ domain indicates a closeness with natural products such as pyridomycin, pristinamycin, arthrofactin, calcium-dependent antibiotics, syringomycin, and mycosubtilin with a much lower percent identity. Similarly, the phylogenetic tree constructed by maximum likelihood for the ‘KS’ domain indicates a closeness with natural products such as hedamycin, Aliivibrio fischen aryl polyene, and rishirlide. As illustrated in the phylogenetic tree, all the compounds in close proximity to the ‘C’ and ‘KS’ domains of NRPS and PKS BGCs exhibit antibiotic properties. Interestingly, this potentially indicates that the predicted novel putative BGCs might code for the machinery needed for the production of novel antibiotics. This can be supported by previous studies that identified the PKS and NRPS BGCs to be potential sources of secondary metabolites specifically antibiotics46. Polyketides and nonribosomal peptides are diverse classes of natural compounds that are widely recognized for their antimicrobial activities. The genetic information required for the assembly machinery involved in producing these complex compound classes is stored in the PKS and NRPS BGCs for polyketides and nonribosomal peptide compounds, respectively. The presence of novel putative PKS and NRPS BGCs in the metagenomic dataset of this study provides valuable insight into the microbial potential of these samples to produce novel secondary metabolites, most likely antibiotics.

Among the AntiSMASH-identified BGC types, two of the Terpene BGCs out of seven from sample BNFC, and one of the Ripp BGC out of two, found a match with BGCs in the MIBiG repository that have experimentally characterized natural products. The Terpene BGCs from sample BNFC were associated with Carotenoid and Hopene while the Ripp BGC from sample BNFW was associated with Mildiomycin. Further analysis of both the Terpene and Ripp BGCs involved performing a similarity search against the NCBI Metagenomics proteins (env nr) database using the BLASTn algorithm. This analysis revealed that one of the Ripp BGCs, which was associated with the natural product Mildiomycin in the MIBiG repository, aligned with the protein coenzyme PQQ synthesis protein D. On the other hand, the other Ripp BGC didn’t have a match in the repository and aligned with a hypothetical protein. Similarly, the BLASTn similarity search for the Terpene BGCs revealed that two of the Terpene BGCs, which were associated with the natural products Carotenoid and Hopene in the MIBiG repository, aligned with the proteins Lycopene cyclase and squalene hopene cyclase respectively. Lycopene cyclase is an essential enzyme involved in the biosynthesis of the natural product carotenoid. It catalyzes the cyclization of lycopene into various cyclic carotenoids. Certain types of Carotenoids are identified to possess antimicrobial properties. Squalene–hopene cyclase is an enzyme involved in the crucial steps in the biosynthesis of the natural product hopene. It catalyzes the conversion of squalene into hopene. Out of the remaining Terpene BGCs that did not find a match with BGCs in the MIBIG repository, a similarity search against the NCBI Metagenomics proteins (env nr) database revealed three terpene BGCs aligned with hypothetical proteins, one BGC aligned with unannotated protein, and one BGC aligned with uncharacterized protein. Terpene BGCs were found to dominate sample BNFC. This dominance of Terpene BGCs in soil environments is supported by other studies 7. Although most secondary metabolites with antimicrobial activities are produced by the assembly machinery of the PKS and NRPS BGCs47,48, Terpene BGCs from various life forms are widely known for their biotechnological application49.

Overall, the discovery of novel putative PKS and NRPS BGCs coupled with the abundant occurrence of the Terpene BGCs in sample BNFC, and the identification of the Ripp BGC in sample BNFW, unambiguously highlights the profound functional potential contained within the microbiomes, particularly in sample BNFC, for secondary metabolite biosynthesis including novel antibiotics.