Search
Close this search box.

QSPR/QSAR study of antiviral drugs modeled as multigraphs by using TI’s and MLR method to treat COVID-19 disease – Scientific Reports

Computation of M-polynomial and NM-polynomial of Lopinavir

In this section, we present the significant computational findings of our study. Our focus was on analyzing the molecular multigraph of lopinavir and deriving its M-polynomial and NM-polynomial, as described in the theorem below. Subsequently, we expanded our analysis to encompass seven additional molecular drug structures. We performed calculations to obtain the M-polynomial and NM-polynomial equations for each structure, and their corresponding values can be found in Table 3. Only lopinavir computation part is shown and Fig. 2 shows molecular multigraph of lopinavir. Figure 3 shows the 3D-Plot of M-polynomial and NM-polynomial of Lopinavir. From this observation the differences in the surface patterns imply that the degree-based and neighborhood degree-based topological indices derived from these polynomials will also differ in their numerical values and interpretations. To determine the superiority of one index over another, further analysis is required, such as comparing their performance in QSPR/QSAR models, evaluating their correlation coefficients with experimental data, and assessing their ability to discriminate between different molecular structures.

Figure 2
figure 2

Molecular multigraph of Lopinavir.

Theorem 1

Let (mathscr {L}) be the molecular multigraph of Lopinavir. Then we have,

$$begin{aligned} M(mathscr {L};x,y)= & {} 3xy^{3}+2xy^{4}+4x^{2}y^{2}+7x^{2}y^{3}+13x^{2}y^{4}+18x^{3}y^{3}+11x^{3}y^{4}+3x^{4}y^{4} NM^{*}(mathscr {L};x,y)= & {} 2x^{3}y^{5}+x^{3}y^{6}+x^{4}y^{4}+x^{4}y^{5}+3x^{4}y^{6}+4x^{4}y^{7}+2x^{4}y^{8}+x^{5}y^{9}+x^{5}y^{10}+10x^{6}y^{6}+14x^{6}y^{7}+x^{6}y^{10}+3x^{7}y^{7}+11x^{7}y^{8}+x^{7}y^{9}+x^{7}y^{10}+3x^{8}y^{10}+x^{9}y^{10} end{aligned}$$

Proof

Consider (mathscr {L}) as the molecular multigraph representing Lopinavir (refer to Fig. 2). It comprises a total of 61 edges. Let (Gamma _{(j,k)}) represent the collection of edges where the endpoints have degrees i and j, respectively. (i.e.) (Gamma _{(j,k)} = {uv in E(mathscr {L}): Delta (u) = j, Delta (v) = k }). Let (m_{(i,j)}) be the no.of edges in (Gamma _{(j,k)}). From 2 it is clear that (m_{(1,3)} = 3, m_{(1,4)} = 2, m_{(2,2)} = 4, m_{(2,3)} = 7, m_{(2,4)} = 13, m_{(3,3)} = 18, m_{(3,4)} = 11, m_{(4,4)} = 3). To derive the M-polynomial of G, we use Eq. 1.

$$begin{aligned} begin{aligned} M(mathscr {L};x,y)&= sum _{j le k}^{} m_{(j,k)}x^jy^k &= m_{(1,3)}x^{1}y^{3}+ m_{(1,4)}x^{1}y^{4}+ m_{(2,2)}x^{2}y^{2} + m_{(2,3)}x^{2}y^{3}+ m_{(2,4)}x^{2}y^{4} + m_{(3,3)}x^{3}y^{3} + m_{(3,4)}x^{3}y^{4}+ m_{(4,4)}x^{4}y^{4}. end{aligned} end{aligned}$$

By using the values of (m_{(j,k)}), we get

$$begin{aligned} M(mathscr {L};x,y) = 3xy^{3}+2xy^{4}+4x^{2}y^{2}+7x^{2}y^{3}+13x^{2}y^{4}+18x^{3}y^{3}+11x^{3}y^{4}+3x^{4}y^{4}. end{aligned}$$

Let (Gamma ^{*}_{(j,k)}) as the set of all edges in which the neighborhood degree sum of the endpoints corresponds to degrees i and j, respectively. (i.e.,) (Gamma ^{*}_{(j,k)} = {uv in E(mathscr {L}): Delta (u) = j, Delta (v) = k }). Let (nm^{*}_{(i,j)}) be the no.of edges in (Gamma ^{*}_{(j,k)}). From 2 it is clear that (nm^{*}_{(3,5)} = 2, nm^{*}_{(3,6)} = 1, nm^{*}_{(4,4)} = 1, nm^{*}_{(4,5)} = 1, nm^{*}_{(4,6)} = 3, nm^{*}_{(4,7)} = 4, nm^{*}_{(4,8)} = 2, nm^{*}_{(5,9)} = 1, nm^{*}_{(5,10)} = 1, nm^{*}_{(6,6)} = 10, nm^{*}_{(6,7)} = 14, nm^{*}_{(6,10)} = 1, nm^{*}_{(7,7)} = 3, nm^{*}_{(7,8)} = 11, nm^{*}_{(7,9)} = 1, nm^{*}_{(7,10)} = 1, nm^{*}_{(8,10)} = 3, nm^{*}_{(9,10)} = 1). To derive the NM-polynomial of G, we use Eq. (2).

$$begin{aligned} NM^{*}(mathscr {L};x,y)= & {} sum _{j le k}^{} nm^{*}_{(j,k)}x^jy^k = & {} nm^{*}_{(3,5)}x^{3}y^{5}+ nm^{*}_{(3,6)}x^{3}y^{6}+ nm^{*}_{(4,4)}x^{4}y^{4}+ nm^{*}_{(4,5)}x^{4}y^{5}+ nm^{*}_{(4,6)}x^{4}y^{6} + nm^{*}_{(4,7)}x^{4}y^{7} + nm^{*}_{(4,8)}x^{4}y^{8} {} & {} +nm^{*}_{(5,9)}x^{5}y^{9}+nm^{*}_{(5,10)}x^{5}y^{10}+nm^{*}_{(6,6)}x^{6}y^{6}+nm^{*}_{(6,7)}x^{6}y^{7}+nm^{*}_{(6,10)}x^{6}y^{10} +nm^{*}_{(7,7)}x^{7}y^{7} + nm^{*}_{(7,8)}x^{7}y^{8} {} & {} +nm^{*}_{(7,9)}x^{7}y^{9}+nm^{*}_{(7,10)}x^{7}y^{10}+nm^{*}_{(8,10)}x^{8}y^{10}+nm^{*}_{(9,10)}x^{9}y^{10}. end{aligned}$$

The M-polynomial and NM-polynomial are computed to derive a range of ’D’ and ’NBD’ TI’s for the molecular multigraph representing Lopinavir. These findings are summarized in the following theorem. (square)

Theorem 2

Let (mathscr {L}) be the molecular multigraph of Lopinavir. Then, their respective values in Table 3holds.

Figure 3
figure 3

3D-plot generation of (a) M-polynomial and (b) NM-polynomial of Lopinavir.

Proof

Initially, we determine the degree-based indices by referring to Table 1. Let (M(mathscr {L};x,y) = t(x,y) = 3xy^{3}+2xy^{4}+4x^{2}y^{2}+7x^{2}y^{3}+13x^{2}y^{4}+18x^{3}y^{3}+11x^{3}y^{4}+3x^{4}y^{4}). Then we have,

  1. 1.

    (M_1(mathscr {L}) = (D_x+D_y)t(x,y)|_{x=y=1} =12xy^{3}+10xy^{4}+16x^{2}y^{2}+35x^{2}y^{3}+78x^{2}y^{4}+108x^{3}y^{3}+77x^{3}y^{4} +24x^{4}y^{4} = 360.)

  2. 2.

    (M_2(mathscr {L}) = (D_xD_y)t(x,y)|_{x=y=1} = 9xy^{3}+8xy^{4}+16x^{2}y^{2}+42x^{2}y^{3}+104x^{2}y^{4}+162x^{3}y^{3}+132x^{3}y^{4}+48x^{4}y^{4})

  3. 3.

    (mM_2(mathscr {L}) = S_xS_yt(x,y)|_{x=y=1} = xy^{3}+frac{2}{4}xy^{4}+x^{2}y^{2}+frac{7}{6}x^{2}y^{3}+frac{13}{8}x^{2}y^{4}+frac{18}{9}x^{3}y^{3}+frac{11}{12}x^{3}y^{4}+frac{3}{16}x^{4}y^{4} = 8.3958)

  4. 4.

    (ReZG_3(mathscr {L}) = D_xD_y(D_x+D_y)t(x,y)|_{x=y=1} = 36xy^{3}+40xy^{4}+64x^{2}y^{2}+210x^{2}y^{3}+624x^{2}y^{4}+972x^{3}y^{3}+924x^{3}y^{4}+384x^{4}y^{4} = 3254)

  5. 5.

    (F(mathscr {L}) = (D_x^{2}+D_y^{2})t(x,y)|_{x=y=1} = 30xy^{3}+34xy^{4}+32x^{2}y^{2}+91x^{2}y^{3}+260x^{2}y^{4}+324x^{3}y^{3}+275x^{3}y^{4}+96x^{4}y^{4} = 1142)

  6. 6.

    (SDD(mathscr {L}) = (S_xD_y+S_yD_x)t(x,y)|_{x=y=1} = frac{30}{3}xy^{3}+frac{34}{4}xy^{4}+frac{32}{4}x^{2}y^{2}+frac{91}{6}x^{2}y^{3}+frac{260}{8}x^{2}y^{4}+frac{324}{9}x^{3}y^{3} +frac{275}{12}x^{3}y^{4}+ frac{96}{16} = 139.0833)

  7. 7.

    (H(mathscr {L}) = 2S_xJt(x,y)|_{x=1} = frac{7}{4}x^{4}+frac{9}{5}x^{5}+frac{31}{6}x^{6}+frac{11}{7}x^{7}+frac{3}{8}x^{8} = 21.3262)

  8. 8.

    (I(mathscr {L}) = S_xJD_xD_yt(x,y)|_{x=1} = frac{25}{4}x^{4}+frac{50}{5}x^{5}+frac{266}{6}x^{6}+frac{132}{7}x^{7}+frac{48}{8}x^{8} = 85.4405)

  9. 9.

    (A(mathscr {L}) = S_x^{3}Q_{-2}JD_x^{3}D_y^{3}t(x,y)|_{x=1} = 42.125x^{2}+60.7407x^{3}+309.0313x^{4}+152.064x^{4}+56.8889x^{6} = 620.8499)

  10. 10.

    (R_{alpha }(mathscr {L}) = D_x^{alpha }D_y^{alpha }t(x,y)|_{x=1} 3(3)^{alpha }+2(4)^{alpha }+4(4)^{alpha }+7(6)^{alpha }+13(8)^{alpha }+18(9)^{alpha }+11(12)^{alpha }+3(16)^{alpha } = 22.1114)

Next, we compute the neighborhood degree sum-based indices by taking into account (NM^{*}(mathscr {L}) = t(x,y) = 2x^{3}y^{5}+x^{3}y^{6}+x^{4}y^{4}+x^{4}y^{5}+3x^{4}y^{6}+4x^{4}y^{7}+2x^{4}y^{8}+x^{5}y^{9}+x^{5}y^{10}+10x^{6}y^{6}+14x^{6}y^{7}+x^{6}y^{10}+3x^{7}y^{7}+11x^{7}y^{8}+x^{7}y^{9}+x^{7}y^{10}+3x^{8}y^{10}+x^{9}y^{10}). By utilizing the edge partition of (Gamma ^{*}_{(j,k)}) in combination with Table 1, the NM-polynomial can be derived, thus concluding the proof. The obtained values of the ’D’ & ’NBD’ indices, calculated using the M-polynomial and NM-polynomial, are displayed in Tables 3 and 4, respectively. (square)

Table 3 Selected antiviral drugs with degree based TI’s.
Table 4 Selected antiviral drugs with neighborhood degree sum based TI’s.

QSPR analysis of selected antiviral drugs with its target properties

Regression analyses

Table 5 Correlation coefficients (r) of degree based indices and the physicochemical properties of antiviral drugs modeled as molecular multigraphs using linear regression model.
Table 6 Correlation coefficients (r) between ‘NBD’ and the physicochemical properties of antiviral drugs, modeled as molecular multigraphs using a linear regression model.

To clarify the physical significance of our results, we have included concise discussions on the effectiveness of the computed topological indices. These quantitative measures reveal key structural attributes, with higher values indicating enhanced stability and lower reactivity, and lower values suggesting potential reactivity sites. Our study validates the predictive power of these indices by demonstrating strong correlations with experimental properties, supporting their use in understanding structure-property relationships and guiding drug design and development. We highlight the practical applications in drug delivery and material design while acknowledging the need to consider molecular context and explore advanced methods for improved accuracy.The correlated values between ‘D’ and ‘NBD’ based TI’s and the physicochemical properties of antiviral drugs (COVID-19 drugs) can be observed in Tables 5 and 6. From Table 5 we observe that inverse sum indeg index (estimator) reflects a strong positive relationship with boiling point(outcome variable) which is depicted in Fig. 4.

Figure 4
figure 4

Inverse sum indeg index versus predicted boiling point.

Figure 5
figure 5

Comparison chart of ‘r’ values for multigraph versus simple graph: ‘D’.

From Fig. 5 we observe that the high correlation coefficients ‘r’ values for the physicochemical properties of Surface tension(ST), Molar refractivity(MR), Molar volume(MV) and Polarizability(P) are higher than the simple graph’s representation of selected antiviral drugs. The existence of a double bond in a molecule can greatly impact its properties, including polarity, conjugation, and reactivity. These changes, in turn, can impact the molecule’s solubility, stability, and biological activity. For example when a molecule contains a double bond, it introduces regions of different electron density, resulting in a shift in polarity. The presence of the double bond can make the molecule more polar or less polar depending on the surrounding atoms and functional groups. We observe that molecular multigraphs can provide a more detailed and nuanced representation of the chemical structure and the high correlation coefficients ’r’ of simple graph representing seven drugs for the physicochemical properties of MR with r = 0.9709, P = 0.9710, ST = 0.5115 and MV = 0.9108 using degree based indices from11. One can see the high correlation ‘r’ values of molecular multigraph in Table 5, bold values with an asterisk*. In similar fashion, From Table 6 we observe that Neighborhood Inverse sum indeg index(NI) (predictor variable) reflects a strong positive relationship with Boiling point(outcome variable) which is depicted in Fig. 6.

Figure 6
figure 6

Neighborhood inverse sum indeg index versus predicted boiling point.

Figure 7
figure 7

Comparison chart of ‘r’ values for multigraph versus simple graph: ‘NBD’.

From Fig. 7 we observe that the high correlation coefficients ’r’ values for the physicochemical properties of Flash point(FP) and Surface tension(ST) are higher than the simple graph’s representation of selected antiviral drugs. The high correlation coefficients ’r’ of simple graph representing seven drugs for the physicochemical properties of FP with r = 0.9629 and ST with r = 0.6682 using Neighborhood degree sum based indices from11. One can see the high correlation ’r’ values of molecular multigraph in Table 6, bold values with an asterisk *.

Note: We also have observed that the highly correlated values in the multigraph are nearly identical to the values found in the simple graph for both ’D’ and ’NBD’ based correlation values for example, BP with 0.9920, E with 0.9887 from11 representing as simple graphs whereas for multigraphs BP with 0.9864 and E with 0.9827, we get a small variance with the correlation values and some are higher than the simple graph. However, when there is a low correlation between chemical structure descriptors and a target property, it suggests that additional factors may play a more significant role in determining the target property. Further analysis or experimentation might be necessary to identify and understand those factors.

QSAR analyses of biological activity (pIC_{50}) versus degree based & nbd degree sum-based indices as predictors

Within this section, we employed IBM SPSS Statistics Version 27.0.1.0 software. To view url link of this version, visit https://www.ibm.com/support/pages/downloading-ibm-spss-statistics-27010 to carry out multiple linear regression analyses. (IC_{50}) were used as dependent variable and several ’D’ and ’NBD’ based indices, (one can refer Table 1) were used as independent variables. (IC_{50}), also known as half maximal inhibitory concentration, is a parameter that measures the effectiveness of a drug or compound in inhibiting a specific biological or biochemical process. It represents the concentration at which the drug can block the target protein’s function by 50 %. (pIC_{50}) is a transformed version of (IC_{50}), where the “p” stands for the negative logarithm (base 10) of the (IC_{50}) value. (pIC_{50}) are used in regression analyses over (IC_{50}) since it is linearly related to the drug potency than (IC_{50}). The selection of the optimal multiple linear regression model was based on these statistical criteria: Fisher ratio (F), squared multiple correlation coefficient ((R^2)), adjusted correlation coefficient ((R^{2}_{adj})), Durbin–Watson value (DW), variance inflation factor (VIF), tolerance value and significance (Sig). The main difference between QSPR and QSAR is the type of property that is being predicted. QSPR models utilize statistical and mathematical methods to establish a link between the molecular structure of compounds and their physicochemical properties. On the other hand, QSAR models employ statistical and machine learning techniques to establish a correlation between the molecular structure of compounds and their biological activities.

MLR model and MLR analyses

Multiple linear regression (MLR)55 is a statistical technique that explores the relationship between a dependent variable and multiple independent variables. Its purpose is to find the best-fitting regression line that minimizes the differences between the predicted and actual values of the dependent variable. MLR is a statistical method that explores the linear relationship between target variable Y ((pIC_{50})) and predictor variables X (2D descriptors). Through the least squares curve fitting technique, MLR calculates regression coefficients ((r^2)) to estimate the model. This approach establishes a straight line equation that accurately represents the overall data points. The regression equation is formulated as follows:

$$begin{aligned} Y = b_1 *I_1 + b_2 *I_2 + B_3 *I_3 + c end{aligned}$$

(3)

In the regression equation, the dependent variable is represented as Y, and the regression coefficients ’b’ correspond to the independent variables ‘I’. The intercept or regression constant is denoted as ‘c’56. Kirmani et al.11 conducted a QSAR analysis on antiviral drugs represented as simple graphs, suggesting a weak association between biological activity ((pIC_{50})) and TI’s. Inspired by their approach, we applied a similar analysis using molecular multigraphs for our selected drugs and achieved a well-fitting QSAR model by backward elimination method which will be elaborated in the upcoming section.

Multicollinearity and VIF57

Multicollinearity refers to high correlation among independent variables, which can result in unstable and unreliable regression coefficient estimates. Variance inflation factor (VIF) is a measure used to evaluate the presence of multicollinearity in regression analysis, commonly utilized in tools such as SPSS and it is defined as (VIF = frac{1}{1-R^2}). VIF values ranging from 1 to 10 indicate no multicollinearity, while values below 1 or above 10 suggest the presence of multicollinearity. Our regression models showed signs of multicollinearity, as some independent variables had correlation coefficients near 1 and corresponding VIF values outside the ideal range of 1 to 10. This implies that the model may struggle to accurately estimate the individual effects of these correlated variables. Hence, it is crucial to address this issue to ensure the reliability and accuracy of our regression results.

QSAR model for (pIC_{50})

The correlation matrix is a helpful tool for detecting multicollinearity in regression models. It displays the pairwise correlations between multiple variables, indicating the strength and direction of their relationship. By examining the matrix for high correlations between independent variables, we can identify multicollinearity and take appropriate measures to address it. In the Supplementary Table S1, we present the correlation matrix between various ’D’ and ’NBD’ based indices. In QSAR analysis, one of the primary goals is to identify the most important molecular descriptors or features that are correlated with the target property. When dealing with numerous molecular descriptors in QSAR analysis, including all of them in the model may not be practical. To tackle this issue, variable selection techniques are utilized to identify the most significant descriptors that exhibit strong correlations with the target property. This process helps improve the predictive performance of the model. Stepwise regression is one such variable selection method that is commonly used in QSAR analysis. It involves iteratively adding or removing descriptors based on their statistical significance in predicting the target property. The process continues until no more significant descriptors remain, resulting in a effective model.

We began constructing simple linear regression models using topological indices that had the lowest correlation (specifically, 0.1170 between (NDe_3) and (NmM_2)). This led to the development of two mono-parameter models. However, both models demonstrated a weak correlation with (pIC_{50}).

$$begin{aligned} pIC_{50} = 6.183921-0.48734(pm 0.502904)NmM_2 end{aligned}$$

(Model 1)

(n=7, r=0.3976, R^2=0.1581, R_A^{2} = -0.01026, SE=0.4512, F=0.9390, PE=0.2121)

Here n : Number of drugs used, r(R):simple(multiple) correlation coefficient, (R_A^{2}): adjustable (R^{2}), F: Fisher’s statistics, PE: Probability error.

By employing Stepwise regression analysis, various combinations of two topological indices have been examined. The following bi-parametric model demonstrates significantly improved statistical measures in comparison to its mono-parametric (Model 1).

$$begin{aligned} pIC_{50} = 6.782221-1.9E-05(pm 1.06E-05)NDe_3-0.39912(pm 0.422226)NmM_2 end{aligned}$$

(Model 2)

(n=7, r=0.7292, R^2=0.5317, R_A^{2}=0.2976, SE=0.3762, F=2.2711, PE= 0.1179).

To improve the statistical parameters of the models, trials were conducted to determine the correlation between three combined TI’s and the biological activity(pIC_{50}). However, the resulting model exhibited only marginal improvements in its statistical measures.

$$begin{aligned} pIC_{50} = 5.76991-0.00392(pm 0.001944)S+0.000313(pm 0.000165)NDe_3+1.170587(pm 0.84103)NmM_2 end{aligned}$$

(Model 3)

(n=7, r=0.8950, R^2=0.8011, R_A^{2}=0.6022, SE=0.2831, F=4.0282, PE= 0.0501).

By applying successive Stepwise regression, a tetra-parametric model was derived, showcasing notable enhancements in the statistical parameters.

$$begin{aligned} pIC_{50}&= 6.945062 + 0.001272(pm 0.000599)NF – 0.00388(pm 0.00132)S &quad + 0.000167(pm 0.00131)NDe_3 – 0.58105(pm 1.003055)NmM_2 end{aligned}$$

(Model 4)

(n=7, r=0.9689, R^2=0.9389, R_A^{2}=0.8167, SE=0.1921, F=7.6844, PE= 0.0154).

After employing successive Stepwise regression, a penta-parametric model was obtained, demonstrating enhanced statistical parameters.

$$begin{aligned} pIC_{50}&= 6.274774 + 0.030819(pm 0.036622)NM_2 – 0.01093(pm 0.014519)NF &quad – 0.01637(pm 0.014921)S + 0.000939(pm 0.000928)NDe_3 + 0.726002(pm 1.8948)NmM_2 end{aligned}$$

(Model 5)

(n=7, r=0.9819, R^2=0.9642, R_A^{2}=0.7854, SE=0.2079, F=5.3922, PE= 0.0090).

In the aforementioned QSAR models, the F-value signifies the ratio between the variability accounted for by the model and the remaining variability ascribed to error. This value is used as an indicator of the model’s statistical significance, with a higher F-value suggesting a greater probability of statistical significance. Probability error, also known as a type I error or alpha error, refers to a statistical concept in hypothesis testing, (PE = frac{2(1-r^2)}{3sqrt{n}})56. The p-value is a statistical measure that evaluates the likelihood of observing the given outcomes if the null hypothesis is true. It quantifies the level of evidence against the null hypothesis, indicating the strength of the observed results. A predetermined significance level, commonly set at 0.05, is used as a threshold to determine the statistical significance of the study findings and decide whether to reject the null hypothesis. In our QSAR models, we encountered insignificant results as our p (alpha) value was greater than 0.05. By selecting the least correlated variable can reduce the problem of pairwise correlations between the variables, it does not account for the possibility of higher-order correlations among the variables (multicollinearity). Pairwise correlation refers to the correlation between two variables. So we remove all the predictor variables included in the model since all our p values are greater than 0.05. To mitigate this problem, we used the backward elimination method. The objective was to identify a subset of predictor variables that exhibited the most robust association with the response variable ((pIC_{50})) while avoiding the issue of over-fitting the model due to an excessive number of predictors.

Backward elimination method and validation

Backward elimination is a feature selection method used in statistical modeling and machine learning. It aims to identify the most relevant subset of features (independent variables) for a given predictive model. The method starts with a full model that includes all available features and iteratively eliminates features that are found to be non-significant. One can refer the article58 for QSAR study utilizing TI’s with backward elimination method. By conducting a 2D-QSAR analysis on the biological activity (pIC_{50}) of antiviral drugs, we generated multiple QSAR models. During the stepwise regression process, we successfully identified and eliminated five independent variables that exhibited insignificant associations with the (pIC_{50}) (biological activity) outcome. Initially, our study encompassed a total of 18 independent(predictor) variables, but after removing the insignificant features, we were left with 13 remaining predictors. The best linear model for (pIC_{50}) contains three topological indices (ReZG_3, NDe_5) and NH. Through the process of backward elimination, we initially considered all 13 predictors (M_1), F, (M_2), H, SDD, (mM_2), A, NH, I, (NM_1), (ReZG_3), (NDe_5) and NI. The aim was to identify the best subset of predictors(independent variables) that displayed a strong association with (pIC_{50}). The selected model, model 3 from Table 7, demonstrated the best combination of predictors based on various statistical parameters.

Table 7 Backward elimination: QSAR models.

Validation: Durbin–Watson statistics and tolerance59

The Durbin–Watson statistic is used to measure autocorrelation in regression residuals. It ranges from 0 to 4, with 2 indicating no autocorrelation. Autocorrelation occurs when residuals are correlated over time, violating the assumption of independence. The DW statistic helps assess the level of correlation among residuals. A DW value below 2 indicates the presence of positive autocorrelation, while a value above 2 suggests negative autocorrelation. A DW value of 2 indicates the absence of autocorrelation. To evaluate the model’s goodness of fit using the Durbin-Watson (DW) statistic, a value close to 2 indicates no significant autocorrelation in the residuals. This suggests that the model effectively represents the relationship between the variables. In our final QSAR model 3, the DW value is around 2, indicating that the errors are uncorrelated. The concept of tolerance is employed as an indicator of multicollinearity, measuring the correlation among independent variables in a model. It is represented on a scale from 0 to 1, with a higher tolerance value nearing 1 indicating a lower degree of correlation among predictor variables, thus suggesting reduced multicollinearity. Conversely, a low tolerance value close to 0 indicates high correlation among predictors, suggesting a potential issue of multicollinearity.