L-arginine is a ubiquitous proteinogenic amino acid necessary for the basic physiological function of all organisms. De novo L-arginine biosynthesis proceeds via a critical L-ornithine precursor in both animals and bacteria. Distinctly from animals, however, bacterial L-ornithine is generated from a cascade of N-acetylated, rather than non-acetylated, intermediate (see Supplementary Fig. 1)13.
The first committed step in E. coli L-ornithine biosynthesis is the transacetylation of L-glutamate using acetyl-CoA to yield NAG, catalysed by N-acetyl-L-glutamate synthase (NAGS). In the second biosynthetic step, the nascent NAG is phosphorylated to furnish N-acetyl-L-glutamyl 5-phosphate (NAGP); this ATP-dependent conversion is catalysed by aforementioned EcNAGK, encoded in E. coli by the argB gene. Importantly, the binding of ATP to EcNAGK triggers a major inter-domain conformational shift that greatly tightens the active site, thereby permitting the facile phosphoryl transfer from ATP to NAG. This reaction is further facilitated by invariant residues Lys8, Gly11, Gly44, Gly45 and Lys217, which work to correctly orient the substrates and stabilise the transition state. Following phosphoryl transfer, the enzyme relaxes into an open conformation and liberates the products NAGP and ADP in preparation for the next catalytic cycle.
Here we used our PREVENT model (see Supplementary Fig. 2) to replace the wildtype EcNAGK enzyme with new, unseen variants thereof. To do that, we began by performing in silico mutagenesis of the EcNAGK wildtype sequence (Uniprot id: P0A6C8; PDB id: 1GS5) to build the input dataset required to approximate the sequence-to-energy relationship for this enzyme (see “Methods”). With our procedure, we generated 117,387 unique variants, with each variant harbouring up to 38 mutations (≈ 15% of the residues), and then computed the associated free energy, ΔG, using FOLDX14. Each position of the wildtype enzyme, except the first one, is mutated on average in 6.58% of the variants; as expected, most of the introduced mutations are destabilizing, with a significant positive correlation between number of mutations and free energy.
We then split the dataset into a training and validation sets (90%, 106,238 variants) (see Fig. 1A, B) and a held-out test set (10%, 11,149 variants) (see Supplementary Fig. 3) to evaluate the performance of our model. We maintained a similar ΔG distribution in the training and test sets for a fair comparison.
A Thermodynamic landscape approximation (ΔG) as a function of the number of mutations in EcNAGK variants in the training dataset. Black dashed line denotes the free energy of the wildtype EcNAGK. B Amino acid entropy of EcNAGK variants in the training dataset. Every position, except the first methionine, is mutated in 6.58% of the generated variants. C PREVENT estimates of ΔG values against free energy estimates simulated using FOLDX for different training set sizes. RMSE is 9.28 for 100K training set, 9.74 for 75K training set, 11.24 for 50K training set and 12.12 for 25K training set. Spearman correlation coefficient is 95.53% for 100K training set, 95.25% for 75K training set, 93.58% for 50K training set and 91.79% for 25K training set. D concordance at the top (CAT) analysis of sequences ranked using free energy information. PREVENT achieves a concordance 75% with FOLDX simulations regardless of the size of the training set, albeit moderate differences in ranking top sequences are observed when considering the top 100 and 1000 variants.
In order to study how PREVENT performances depend on the size of the input mutagenesis dataset, we further subsampled the initial training dataset by selecting 25K, 50K and 75K sequences at random, albeit maintaining a ΔG distribution similar to the original dataset.
Using these datasets, we then trained our default transformer VAE model (see “Methods”) with 128-dimensional latent space for 1400 epochs using mini batches of 256 sequences, and by setting the learning rate to 10−4 and the dropout probability to 0.2 for regularization. Moreover, to learn effective latent space encodings, we masked 25% of each sequence residues at random. With this experimental setup, we tested our model on 4 different configurations.
We first assessed model performances using a held-out test set by looking both at the sequence reconstruction error and the Root Mean Square Error (RMSE) of the free energy across the 4 experiments. After attaining the prefixed number of epochs, the reconstruction error was 114.80 (perplexity: 0.66) for the full dataset, suggesting that our model is able to effectively reconstruct sequences from the latent space, a trend that was consistent regardless of the size of the training dataset (see Supplementary Table 1). We then looked at the accuracy of the predicted free energy, and obtained the best results when using the full dataset (RMSE = 9.27); nonetheless, even when using only 25K sequences, the model achieves comparable performances (RMSE = 12.12). Importantly, while the predicted energy values might differ from those estimated by FOLDX, they are strongly correlated, with Spearman correlation ranging from ρ = 0.96 when training on the full dataset and only dropping to ρ = 0.92 when using only 25K sequences, suggesting that predicted values are a robust metric for variants ranking. This is further confirmed by the concordance at the top (CAT) analysis of variants ranked by free energy values; in particular, we found a concordance higher than 80% when looking at the top 1000 ranked sequences, a trend observed regardless of the size of the mutagenesis dataset used (see Fig. 1C, D).
We then analyzed the EcNAGK encodings to check whether PREVENT would map different variants to different regions of the latent space. To achieve this, we computed the empirical variance of each latent component, defined as (,{mbox{diag}}({{mbox{cov}}}_{{{{boldsymbol{x}}}}}{{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{boldsymbol{x}}}})}[{{{boldsymbol{z}}}}]))15, and considering a component as being active if its standard deviation was greater than 0.1. Here we found that the vast majority (126) of the 128 components to be active, albeit to a different level, suggesting that our model maps variants to different regions of the latent space. We then proceeded with the analysis of the latent space organization by performing Principal Component Analysis (PCA) of the expected value of the variational posterior distribution qϕ(z∣x) for the sequences in the mutagenesis dataset, which revealed a structural organization, where sequences with lower free energy were located in the central part of the latent space, whereas sequences with higher ΔG values were located in the periphery, remotely resembling a folding funnel (see Fig. 2).
A Principal component analysis of the ({{mathbb{E}}}_{{q}_{{{{boldsymbol{phi }}}}}({{{boldsymbol{z}}}}| {{{boldsymbol{x}}}})}[{{{boldsymbol{z}}}}]) embeddings generated by PREVENT for the sequences in the training set and color coded according to the corresponding free energy ΔG values. B Activation plot for the 128-dimensional latent space learned. The x-axis represents each latent component and y-axis the standard deviation of the corresponding value learned during training; the dashed line represents the expected value of each component under a null model of random value assignment.
Taken together, we found that PREVENT is able to effectively reconstruct EcNAGK sequences and the expected free energy from the latent space.
Hyperparameter analysis
We also performed a hyperparameter analysis to understand how the choice of the latent dimension, the transformer size and the number of training epochs affect the performance of PREVENT on the held-out test set. We used our original transformer model configuration as well as another, twice as small, configuration with 4 encoder layers, 2 decoder layers, 256 embedding size and 4 heads. First, we noticed that training for 500 epochs is usually sufficient to obtain a minimal validation loss, as most models converge between 100 and 200 epochs. Then, we confirmed that the choice of the latent dimension does not affect sequence reconstruction error, regardless of the size of the model and latent space. However, the model size and the latent dimension do affect the RMSE of the free energy, with the smaller model with 128-dimensional latent space achieving the lowest RMSE of 6.84 across all experiments. Nevertheless, the Spearman correlation remained high across all models, ranging from 0.97 to 0.98, which is comparable to the original model results, used in all downstream tasks (see Supplementary Table 2).
Sequence analysis of the EcNAGK variants generated by PREVENT
We then proceeded to perform sequence analysis of variants generated by the two PREVENT strategies, namely prior optimization of free energy (POE) and seeded non optimal free energy ranking (SNE). We generated 2000 variants using POE and 30,000 samples for SNE, and then removed those sequences that were i) present in the training set, ii) have mutations in the binding site, or iii) are duplicated.
With this criteria, we obtained 1282 new variants (66%) for POE and 13,323 variants for SNE (44%), which is expected given that the first strategy freely explores the free energy landscape of EcNAGK, whereas the second generates sequences within the region of the wildtype sequence. Importantly, POE variants are more diverse (see Fig. 3A) than SNE, with an average of 7.09 and 3.57 mutations respectively. Surprisingly, we observed that only ≈ 25% of mutations were conservative, as measured by BLOSUM62 substitution matrix; this result suggests that PREVENT might select energetically favorable mutations regardless of evolutionary information (see Fig. 3B). Finally, while the type of mutations was highly similar across strategies, they are located differently across the wildtype EcNAGK sequence (see Fig. 3C); in particular, POE predominantly mutates the C-terminus harboring the ATP-binding domain of the protein, which harbours highly non conservative mutations, such as the tryptophan introduced at position 212 in place of the aspartic acid residue (see Fig. 3D), whereas SNE introduced mutations almost uniformly across the wildtype sequence (see Fig. 3E).
A Distribution of the number of mutations in variants generated using the POE and SNE strategies; the POE strategy generates more diverse variants compared to SNE. B Distribution of the BLOSUM62 substitution matrix scores for the 9083 mutations introduced by the POE strategy and for the 47,573 mutations introduced by the SNE strategy; the vast majority of the mutations are non-conservative. C Residue level entropy for variants obtained using POE and SNE strategies; POE preferentially adds mutations at the C-terminus of the protein. Most frequent mutations in variants generated by the POE strategy (D) and the SNE strategy (E). While SNE variants have a similar mutational profile to the training set, POE variants have a vastly different mutational profile, with surprisingly non-conservative mutations highly represented in all variants.
Taken together, we found that PREVENT can generate highly diverse protein variants, which carry not only conservative mutations but also non conservative ones, which would not be selected by using homology information alone.
Design of a library of EcNAGK variants
We then built a library of EcNAGK variants for experimental validation, in order to estimate the expected fraction of functional variants generated by PREVENT.
To do that, we took the 10 variants with lowest free energy obtained with the POE strategy and the 10 variants with the lowest free energy generated with the SNE strategy. We further augmented this library by adding 10 variants with the highest ELBO, denoted as SNL, and 10 with the highest identity with respect to the wildtype, hereby denoted as SNI, in order to compare our energy-driven design against common generative strategies for protein engineering (see Fig. 4A). Variants in the library have an average free energy ranging from -29.8 kcal/mol for the SNE group, to -18 kcal/mol for the SNL group, with a number of mutations ranging from 6.8 for the SNE group to 1 for the SNL group, mostly located in the C-terminus of the protein for the variants generated by energy-driven strategies (see Fig. 4B). Taken together, we screened a library of 40 new and diverse variants with a significant variability in free energy.
A Free energy and number of mutations of each variant colored by design strategy. B Normalized sequence entropy of variants generated using different design strategies; energy-based strategies (POE, SNE) introduce mutation preferably in the C-terminus of the wildtype protein.
Before proceeding with experimental work, we performed a series of quality control steps, first, by comparing the predicted free energy values of each variant to the FOLDX estimates where, as expected, we found a strong correlation between the two estimates (Spearman correlation; ρ = 0.77, p-value: 5.7 × 10−9). We additionally checked whether the correlation changes a lot when using various model and latent space sizes and found that the correlation remains high across all models, ranging from 0.87 to 0.93 (see Supplementary Table 3).
We then assessed the structural properties of the variants by predicting their three dimensional structure using ESMFOLD, performing inverse folding sequence scoring using the wildtype EcNAGK structure and then downstream dynamics analyses using ensemble Normal Mode Analysis (eNMA). The predicted structures have an average pLDDT of 89.64 (see Fig. 5A), which confirms high accuracy of the predictions, closely resembling the structure of the wildtype enzyme (average RMSD: 1.70Å). This result was further confirmed when scoring our variants against the wildtype EcNAGK structure, where the average inverse folding score of -0.077 (see Fig. 5B, C), suggesting that variants are highly consistent with the wildtype fold albeit to a lesser extent than the wildtype sequence (wildtype log-likelihood: -0.991; average variants log-likelihood: -1.069). We then looked at the dynamics of each variant and, in general, we found strong agreement with the wildtype dynamics (average RMSF: 0.15, (see Fig. 6A)). Nonetheless, we found that the region spanning residues 58 and 65, and 211 and 216, showed increased flexibility, which could potentially lead to phenotypic differences (see Fig. 6B).
A Accuracy of the EcNAGK variants structures predicted by ESMFOLD reported as predicted local distance difference test (pLDDT) at the residue level; regions with pLDDT > 70 are expected to be accurately modelled. B Inverse folding scores of the variants library. C Structure of the top variant for each design strategy, with mutations color coded based on the BLOSUM90 substitution matrix scores.
A Residue level Root Mean Square Fluctuation (RMSF) of EcNAGK variants compared to wildtype. The most significant changes with respect to the wildtype are in the regions comprising residues 58 and 65, and 211 and 216. B Dimer structure of EcNAGK with region of increased flexibility compared to the wildtype colored in red.
Taken together, the overall dynamics of the protein is preserved despite the introduced mutations.
Establishing the auxotrophic screen for EcNAGK
The EcNAGK encoded by the argB gene is conditionally essential for protein synthesis in the absence of exogenous L-arginine (or a suitable precursor). In this context, we hypothesised that argB knockouts could be rescued by expressing EcNAGK (or functional variants) via an auxiliary plasmid, which is supplemented by DNA transformation. Without access to such a plasmid, we reasoned that the auxotroph will not survive in L-arginine-deficient media. Therefore, by using a simple viability screen, it becomes possible to identify functional EcNAGK variants in a manner amenable to high-throughput screening.
To test this hypothesis, we cured (see Supplementary Fig. 4) and transformed commercial BW25113 ΔargB using a bespoke constitutive expression vector encoding the wildtype E. coli argB gene (pKCHU-argB) and the production of EcNAGK was subsequently confirmed by SDS-PAGE (see Supplementary Fig. 5A–C). As anticipated, L-arginine auxotrophy could be remedied in M9 minimal salts media by either the supplementation of L-arginine or by complementation with pKCHU-argB (see Supplementary Fig. 5D). However, using this plasmid-based expression system, we observed that argB knockouts required over 24 h of incubation to emerge from lag phase. Based on our SDS-PAGE analysis, we hypothesised that the strong EcNAGK overexpression was severely penalising cell growth in minimal media. By recoding the argB coding sequence using E. coli codon usage bias data (see Supplementary Fig. 6)16, we measured an average 25-fold reduction in gene expression by RT-qPCR, which was associated with a drastic improvement in growth rate using both solid and liquid minimal media (see Supplementary Fig. 7). Given the clear benefit to cell viability, we decided to backtranslate all our variants using the same codon usage bias strategy.
Screening the EcNAGK variant library
A sequence-perfect library of 40 EcNAGK variants (Neochromosome, Inc.) cloned into our pKCHU expression vector was used to transform BW25113 ΔargB. To this end, we developed a robust and high-throughput heat-shock transformation protocol using an Opentrons OT-2 robot, which allowed us to reliably and reproducibly transform E. coli using both pET23b-EFGP and pKCHU-argB with minimal user input (see Supplementary Fig. 8). Whilst many EcNAGK variants yielded colonies overnight following heat-shock transformation, several more variants exhibited poorer transformation efficiency and required up to 2 consecutive nights of incubation to develop colonies (see Supplementary Fig. 9). Despite numerous attempts, 7 EcNAGK variants (namely poe4, snl3, snl6, snl7, sni10, sne1, sne3) did not produce viable transformants. The observed array of cell viability suggests that the EcNAGK variants are not tolerated equally, and some variants may considerably affect cell fitness.
The remaining 33 viable variants were studied by auxotrophic selection, as previously outlined. Remarkably, 22 of the 33 viable EcNAGK variants could quickly recover the BW25113 ΔargB phenotype in standard M9 minimal salts solid media without L-arginine supplementation. Interestingly, these variants include the top-ranked viable candidates in each model category (poe1, snl1, sni1, sne2). Further 6 variants (poe5, sni9, sne5, sne7, sne8, sne9) facilitated slow-to-intermediate recovery, thus bringing the total count of active variants to 28. The remainder (poe2, poe3, poe8, snl5, snl10) did not rescue the phenotype after 48 hours. Conversely, all cells carrying our variants could proliferate readily and rapidly by supplementing the minimal media with L-arginine (see Fig. 7A).
A Top plate: perfect recovery of transformed EcNAGK variants, including wildtype and ΔargB, in L-arginine rich media. Bottom plate: different recovery rates of transformed EcNAGK variants in the minimal media. B Average growth curves of BW25113 (argB) and EcNAGK transformants grown in auxotrophic M9 minimal salts media. Optical density (OD) measurements were recorded every 2 h for up to 12 h. Error bars represent the standard error of three biological replicates. C The average rate of exponential growth supported by wildtype (EcNAGK) and variants. Error bars represent the standard error of three biological replicates.
We then checked whether using the predicted energy information for designing or prioritizing variants was associated with either a higher number of functional variants or more diverse sequences. 14 out of the 28 viable variants were obtained by using either POE (6 variants) or SNE (8 variants); as expected, since they carry only 1 mutation, 9 out of 10 SNI variants recovered the phenotype, whereas only 5 out of the 10 SNL variants were functional.
It is interesting to note that prioritizing variants by ELBO did yield the lowest number of viable variants, suggesting that it might not be the most effective strategy. Interestingly, functional variants designed using free energy information (POE, SNE) harbour, on average, significantly more mutations (5.64) compared to the 1.64 mutations found on variants selected by sequence only information (Student t-test; t: 4.9173, df: 22.177, p-value = 3.151 × 10−5), suggesting that exploiting free energy information could yield more diverse library with minimal impact on the overall number of functional proteins obtained in a screen.
We subsequently quantified the growth supported by the top candidates for each design strategy (namely poe1, snl1, sni1, and sne2) in a cell density time-course assay. We observed that variants poe1, sni1, and sne2 enabled similar rates of exponential growth compared to the wildtype EcNAGK (see Fig. 7B, C), albeit cells expressing sne2 exhibited a 23−34% shorter lag time compared to all other experiments (see Supplementary Table 7). Conversely, cells expressing variant snl1 doubled 39% slower on average than those expressing the wildtype enzyme.
We then analyzed whether the introduced mutations in the top candidates are located at more buried or more accessible sites in the protein structure. We computed the smoothed hydropathy index17 for each residue in the wildtype and checked the location of the mutations in the top candidates. Around 33% of the mutations are located in the most hydrophobic regions of the protein, which suggests that the mutations are not biased towards the surface of the protein (see Supplementary Fig. 10).
Notably, variant snl1 features a radical and sterically-intrusive Gly123Trp mutation in an otherwise highly-conserved position on β7, which likely destabilizes the β6-β7 hairpin in the NAG-binding subdomain (see Supplementary Fig. 11)18. Furthermore, the expression of variant snl1 was discovered to be 5-fold greater than the wildtype by RT-qPCR analysis (see Supplementary Fig. 12). Therefore, we propose that the observed growth penalty may be attributed to the divergent nature of the Gly123Trp mutation and/or less favorable transcription regulation, which is mediated by our gene re-coding strategy.
In conclusion, 85% of the designed and transformed EcNAGK variants enabled survival on auxotrophic media, with 67% permitting similar recovery levels to the wild-type sequence.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41467-024-54814-w