Datasets
In this study, we adopt the widely used nonhomologous dataset SPOT-1D17 to train the proposed method. The dataset SPOT-1D consists of a training set containing 10029 protein chains and a validation set containing 983 protein chains, and all protein chains do not exceed 700 amino acid residues in length. In particular, these protein chains were culled from the PISCES server on February 2017 with the constraints of resolution < 2.5 Å, R-free < 1, and percent sequence identity (le ) 25% according to BlastClust. Moreover, to evaluate the proposed method, we conduct experiments on six publicly available test sets: TEST2016, TEST2018, TEST2020_HQ, CASP12, CASP13 and CASP-FM. Table 1 summarizes the detailed dataset statistics. TEST2016 consists of 1213 protein chains released between June 2015 and February 2017 in the PDB database. TEST2018 is composed of 250 high-quality protein chains with a resolution of < 2.5 Å and R-free < 0.25 that were released between January 2018 and July 2018 in the PDB database. Both TEST2016 and TEST2018 have (le ) 25% sequence identity with the training and validation sets. TEST2020_HQ22 consists of 121 protein chains released between May 2018 and April 2020, which was obtained by first removing close and remote homologs of all proteins released before 2018 using HMM models, followed by filtering using the same constraints as for TEST2018, and finally deleting protein chains greater than 700 in length. The remaining three test sets CASP12, CASP13, and CASP-FM were collected from the website http://predictioncenter.org/ by the literature35, where CASP stands for Critical Assessment of protein Structure Prediction and FM stands for template-Free Modeling. Note that homology reduction has been performed to remove sequences that have more than 25% sequence similarity to the sequences in the training set. There are 55 protein chains in CASP12, 32 protein chains in CASP13 and 56 protein chains in CASP-FM. For the 56 FM proteins in CASP-FM, 8 of them are from CASP10, 16 from CASP11, 22 from CASP12, and 10 from CASP13.
Feature representation
To predict protein torsion angles based on amino acid sequences, each amino acid residue in the sequence needs to be converted into a numerical vector. In this study, we adopt eight feature representations to perform such conversion: one base feature representation and seven embedding feature representations. For the base feature representation, the feature matrix of each protein chain consists of PSSM profile feature, HMM profile feature and physicochemical property feature. In particular, for a protein chain of length L, its PSSM profile feature is a matrix of size L (times ) 20, which was generated by performing three iterations of the PSI-BLAST program with default parameters against the UniRef90 database updated in April 2018. The corresponding HMM profile feature is a matrix of size L (times ) 30 generated by running HHblitsv3.0.3 against the Uniprot database updated in October 2017. The physicochemical property feature is a matrix of size L (times ) 7 involving seven property values for each amino acid: hydrophobicity, van der Waals volume, isoelectric point, sheet probability, helix probability, polarizability and graph shape index. Thus, the base feature is a matrix of size L (times ) 57. Moreover, for each embedding feature representation, its feature matrix is generated by feeding the amino acid sequence of a given protein chain into a pretrained protein language model, which is obtained by training a Transformer model on a large-scale non-redundant protein sequence dataset using self-supervised learning techniques. To investigate the impact of different embedding feature representations on torsion angle prediction performance, we constructed seven different embedding features based on seven pretrained protein language models prot_t5_xl_uniref5020, ankh-base36, ankh-large36, esm-1b21, esm2_t33_650M_UR50D37, esm2_t36_3B_UR50D37 and esm2_t48_15B_UR50D37. In particular, the embedding feature dimensions of these models are 1024, 768, 1536, 1280, 1280, 2560, and 5120, respectively. It should be noted that, among the above seven pretrained protein language models, the first three use the T5 model based on the encoder-decoder architecture, while the last four employ the BERT model based on the encoder architecture. For the T5 models, we only use its encoder to generate embedding features.
Experimental settings
All our experiments are conducted based on PyTorch. For training, we use the AdamW optimizer with learning rate 1e-3 and weight decay 1e-5. The batch size is set to 32. In particular, the indirect prediction loss with (alpha ) = 2 and the embedding representation generated by ankh-large will serve as our default loss function and feature, respectively, if not specified. For the proposed method, the default values of its hyperparameters R and l are set to 96 and 2, respectively. The dropout ratio is set to 0.2. For embedding features with feature dimension greater than 1024, the output channel parameter C is set to its feature dimension, otherwise C is set to 1024. To evaluate the performance of different protein torsion angle prediction methods, we adopt the widely used evaluation metric mean absolute error (MAE). Let y and z be the observed and predicted values of a given torsion angle, respectively, then its absolute error is defined as (min left( left| z-y right| ,360-left| z-y right| right) ). Note that at this point both y and z are in the range [-180, 180]. For a given test set, the final reported MAE is the average of the absolute errors across all valid torsion angles. Moreover, to alleviate overfitting and reduce training costs, we adopt an early stopping strategy with a patience value of 5 based on the average MAE of two torsion angles on the validation set.
Comparison to state-of-the-art methods
In this section, we compare PHAngle with the nine representative torsion angle prediction networks BiLSTM22, MS-ResNet22, MS-Res-LSTM22, OPUS-TASS18, ESIDEN19, NetSurfP3.023, DeepRIN16, SAINT-Angle (ProtTrans)24 and SAINT-Angle (Residual)24 on the six test sets, TEST2016, TEST2018, TEST2020_HQ, CASP12, CASP13, and CASP-FM. Among the nine prediction networks, the first three networks were first proposed in the literature38 and were subsequently used to construct the ensemble prediction methods SPOT-1D17 and SPOT-1D-LM22. BiLSTM is a two-layer bidirectional Long Short-Term Memory network (LSTM). MS-ResNet is a multiscale residual network with a pre-activation architecture. MS-Res-LSTM is a hybrid network consisting of bidirectional LSTM and multiscale residual network. OPUS-TASS is a hybrid network stacked by a convolutional network, a Transformer network and a bidirectional LSTM. ESIDEN is a prediction network consisting of three LSTM modules. NetSurfP3.0 is constructed by stacking two 1-dimensional convolutional layers with huge convolutional kernels (129 and 257) and two layers of bidirectional LSTM. DeepRIN is a deep residual inception network architecture capable of capturing local and global interactions between amino acids. SAINT-Angle (ProtTrans) is a network based on the attention augmented inception-inside-inception (2A3I) module35, while SAINT-Angle (Residual) adopts a novel RES-2A3I module, which is obtained by extending the 2A3I module by placing residual connections in each of the inception and self-attention modules. For a fair comparison, all methods should use the same training data and feature representation. Therefore, we trained all existing networks according to their default training settings and the SPOT-1D dataset, and used the embedding features generated by the pretrained protein language model ankh-large as feature inputs for each network. In particular, to study the impact of network architecture on torsion angle prediction performance, we did not use techniques such as ensemble learning and multitask learning.
The comparison results on the six test sets are given in Tables 2 and 3. As can be seen from Table 2, the performance of the proposed method PHAngle on the three larger test sets TEST2016, TEST2018 and TEST2020_HQ is significantly better than that of the nine existing methods. On the largest test set TEST2016, PHAngle’s mean absolute errors for torsion angles (phi ) and (psi ) are 0.36 and 0.47 lower than those of the second best method DeepRIN, respectively. On the three smaller test sets CASP12, CASP13, and CASP-FM, PHAngle outperforms other nine existing methods in most cases. Note that although SAINT-Angle (Residual) is a complex convolutional network incorporating inception-inside-inception module, attention mechanism, and residual connection, it is obviously inferior to the relatively simple networks DeepRIN and PHAngle in terms of torsion angle prediction performance. This shows that blindly increasing the complexity of the network does not improve the prediction accuracy of the torsion angle. In addition, Table 4 lists the model sizes of the proposed method with the nine state-of-the-art methods. As can be seen from the table, the model size of PHAngle is only 14.6 MB, which is significantly smaller than that of the other nine methods. This means that the proposed method achieves state-of-the-art performance with fewest parameters.
In addition, for the proposed method PHAngle, we can replace network layers PHLinear and PHConv1D in its backbone network using standard fully connected layers and 1-dimensional convolutional layers, respectively. In particular, we refer to the replaced method as PHAngle(standard). As can be seen from Tables 2 and 3, PHAngle (standard) is only slightly better than PHLinear in predicting the torsion angle (phi ) of the test sets TEST2016 and CASP12, while the latter outperforms the former in most cases. Moreover, it should be noted that the model size of PHAngle (standard) is 56.5 MB, which is 3.87 times the size of the PHAngle model. This implies that the proposed parameter-efficient model maintains and even improves the torsion angle prediction performance while reducing the network parameter redundancy.
In order to objectively evaluate the proposed method, we further executed a 10-fold cross-validation experiment. Specifically, we first merged the three datasets (the training and validation sets of SPOT-1D and TEST2016) into a single dataset, which contains 12225 protein chains. Then, we randomly divided this dataset into 10 subsets, where each of the first 5 subsets contained 1223 protein chains and each of the last 5 subsets contained 1222 protein chains. Finally, we iteratively select each subset as the test set. In particular, for the 9 subsets left after selecting a test set, we use one of them as the validation set and put the other 8 together as the training set. Table 5 gives the mean and standard deviation of the proposed method and nine state-of-the-art methods on the 10-fold cross-validation experiment. As can be seen from the table, PHAngle consistently outperforms the other nine methods.
Ablation study
To analyze the impact of feature representation, hyperparameter R, channel size, network depth and loss function on the performance of the proposed method, we conduct extensive experiments on the TEST2016 test set and validation set. In particular, we only change the parameters of interest in each experiment, while the other hyperparameters keep their default values.
Impact of feature representation To investigate the impact of different feature representations on the torsion angle prediction performance of the proposed method, we perform comparative experiment on the test set TEST2016. Note that the dimension of the base feature is 57, which is significantly lower than that of the seven embedding features. For the sake of fairness in comparison, we first project it to 1024 dimensions using a fully connected layer, and then feed the projected feature into our backbone network. In particular, we use the name of the pretrained models to indicate the type of embedding features. The mean absolute errors of the proposed method under eight feature representations for torsion angles (phi ) and (psi ) are shown in Fig. 4. As can be seen from the figure, the embedding feature esm-1b obtains the worst prediction performance, and its prediction errors on both torsion angles are consistently higher than those of other seven feature representations. In terms of prediction error, the base feature is only marginally better than esm-1b, but significantly inferior to the other six embedding features. Among the eight feature representations, ankh-large achieves the best prediction performance, which is why we use it as the default feature representation. The mean absolute errors of anke-large on the torsion angles (phi ) and (psi ) are 0.51 and 0.8 lower than those of esm2_t48_15B_UR50D, respectively. Note that the size of the pretrained model esm2_t48_15B_UR50D is 60.5 GB, while the size of ankh-large is 11.4 GB. This suggests that the embedding features generated by the larger pretrained model are not guaranteed to yield better torsion angle prediction performance. Moreover, the prediction performance of the other five embedding features esm2_t36_3B_UR50D, esm2_t33_650M_UR50D, ankh-base, prot_t5_xl_uniref50, and esm-1b decreases in this order.
The mean absolute errors of the proposed method under different feature representations on the test set TEST2016.
Impact of the hyperparameter R The hyperparameter R controls the number of parameters in the network layers PHLinear and PHConv1D. The mean absolute errors of the proposed method PHAngle under different R values are shown in Fig. 5. We can see from the figure that the mean absolute errors of the torsion angles (phi ) and (psi ) on the validation set are lowest at R = 96. Therefore, we set the default value of the hyperparameter R to 96. In addition, it should be noted that when R > 96, further increasing R not only does not guarantee a reduction in the prediction errors of torsion angles, but also increases the model size of the proposed method.
The mean absolute errors of the proposed method under different R values on the validation set and test set TEST2016.
Impact of the channel size To explore the impact of the channel size on the performance of PHAngle, we perform comparative experiments under different C values 512, 768, 1024, 1280, 1536, 1792, and 2048. The mean absolute errors of PHAngle on the validation set and test set TEST2016 are shown in Fig. 6. As can be seen from the figure, when C >= 1024, the MAEs of the two torsion angles on both the validation set and the test set fluctuate within the 0.1 interval. This means that PHAngle’s torsion angle prediction performance is not sensitive to changes in channel size.
The mean absolute errors of the proposed method at different channel sizes on the validation set and test set TEST2016.
Impact of the network depth The hyperparameter l controls the depth of our backbone network. The prediction errors of the proposed method with varying l values are shown in Fig. 7. As can be seen from the figure, the prediction errors of the torsion angles (phi ) and (psi ) on the validation set reached the minimum simultaneously at l = 3. Moreover, further increasing the network depth does not guarantee to reduce the prediction error of the torsion angles. Note that for inference speed considerations, we set the default value of l to 2.
The mean absolute errors of the proposed method at different network depths on the validation set and test set TEST2016.
Impact of the loss function In this work, three loss functions, direct prediction loss, indirect prediction loss ((alpha ) = 1) and indirect prediction loss ((alpha ) = 2), can be used to train the proposed method. To investigate their impact on the performance of PHAngle, we perform comparative experiments on the validation set and test set TEST2016. The prediction results of PHAngle under different loss functions are shown in Table 6. It can be observed that indirect prediction loss ((alpha ) = 1) outperforms direct prediction loss and indirect prediction loss ((alpha ) = 2). In particular, since many existing works such as SPOT-1D17, DeepRIN16, and OPUS-TASS18, adopt the mean square error to define the regression loss of torsion angles, we use the indirect prediction loss ((alpha ) = 2) as the default loss function. Moreover, compared with indirect prediction, the advantage of direct prediction is that there is no need to calculate the sine and cosine values of the angles during training and evaluation.
- SEO Powered Content & PR Distribution. Get Amplified Today.
- PlatoData.Network Vertical Generative Ai. Empower Yourself. Access Here.
- PlatoAiStream. Web3 Intelligence. Knowledge Amplified. Access Here.
- PlatoESG. Carbon, CleanTech, Energy, Environment, Solar, Waste Management. Access Here.
- PlatoHealth. Biotech and Clinical Trials Intelligence. Access Here.
- Source: https://www.nature.com/articles/s41598-024-77412-8