Leveraging AHP and transfer learning in machine learning for improved prediction of infectious disease outbreaks

Mosquito bites are a major vector for transmitting diseases that can lead to epidemics, particularly in tropical and subtropical regions15. Accordingly, this paper introduces a methodological model that integrates the Analytic Hierarchy Process (AHP) for feature prioritization with advanced ensemble machine-learning techniques to anticipate potential outbreaks of epidemic diseases. The researchers specifically focus on Dengue, Zika, and Chikungunya, which are among the fastest-growing viral diseases globally, transmitted by female Aedes mosquitoes15. The proposed model consists of six layers: the data source layer, preprocessing layer, feature engineering layer, data splitting layer, modelling layer, and evaluation layer, as illustrated in Figure 1. To identify common risk factors associated with infectious disease outbreaks, this study employes a predefined search technique across major online databases including PubMed/Medline, Scopus, and CINAHL, using search terms such as outbreak*, epidemic*, pandemic*, emerging disease*, and re-emerged disease*.

Fig. 1
figure 1

Proposed model for epidemic prediction using AHP and Ensamble machine learning.

Data source layer (data acquisition)

Various datasets are utilized to implement the proposed model. Initially, climate datasets collected from 2007 to 2017 are collected from the NASA Langley Research Center, DANE (National Administrative Department of Statistics of Colombia), and SIVIGILIA (National Public Health Surveillance System, Colombian National Institute of Health).

These datasets encompass both climate and socioeconomic data, detailed as follows:

Climatic data: this dataset provides extensive information on climatic variables for each municipality, including average temperature (tavg), minimum temperature (tmin), maximum temperature (tmax), average humidity (havg), wind speeds (maximum (wsmax), minimum (wsmin), and average (wsavg)).

Socioeconomic data: this includes sociodemographic indicators critical for public health analysis, such as illiteracy, low educational achievement, multidimensional poverty index (mpi), child labor, school absence, informal work, school lag, population, lack of health insurance, and dependency rate.

Upon integration of these data, the final dataset comprises 1716 entries, focusing on the diseases dengue, chikungunya, and zika, each distinguished by 27 unique features.

Data pre-processing layer

This layer encompasses exploratory data analysis (EDA), which addresses initial errors, missing values, and inconsistencies in the dataset. The EDA process transforms the data into a suitable format for feature engineering and predictive modeling, ensuring that the dataset is cleaned and standardized for subsequent analysis stages.

Data cleaning

During this phase, the researches addresse error detection, including the identification of negative case counts and misclassified disease types. Additionally, outliers are removed by calculating the Interquartile Range (IQR) for each variable. Values identified as outliers are those that fall below Q1 − 1.5 IQR or exceed Q3 + 1.5 IQR.

Handling missing values

To preserve the integrity of the dataset for analysis, it is essential to address the missing values using appropriate correction or imputation techniques. For instance, the missing values for the average temperature (Tavg) feature are replaced with the mean temperature calculated from the available temperature records.

Data transformation

In this phase, Min–Max normalization is employed to scale continuous data, including population metrics, temperature, and precipitation, to a uniform range between 0 and 1. Additionally, one-hot encoding is applied to convert categorical variables, such as “Municipality,” into a numerical format represented by binary indicators.

Feature engineering layer

Feature Engineering is a critical step in developing predictive models. It involves selecting the most relevant features and constructing new variables to enhance both model accuracy and interpretability. This process encompasses feature selection and feature extraction, both of which are vital for effectively preparing the data for modeling.

Feature selection

Feature selection involves identifying and selecting most pertinent characteristics for forecasting outbreaks16. In this phase, a semi-automated process is employed to optimize feature selection. The Researchers’/the study’s predictive model integrates the systematic consistency of the analytic hierarchy process (AHP) with expert domain knowledge to calculate weights for all diseases collectively. This approach utilizes the combined insights from various diseases to identify the most influential features, ensuring their relevance across different contexts. The steps of the AHP model are as follows17:

Step 1: Determine the goal of the AHP model.

Identify the most influential risk factors of infectious disease outbreaks.

Step 2: Determine criteria/sub-criteria.

Identify criteria such as climate factors, population demographics, and socioeconomic elements. These criteria are further decomposed into various components that influence the spread of disease.

Step 3: Expert input and pairwise comparisons:

Expert manual input

Expert manual input serves as the foundation for the data used in the AHP model. Domain experts evaluate potential features based on their knowledge and familiarity with the epidemiological factors influencing disease outbreaks. They assign initial weights reflecting the significance of each feature’s impact on disease transmission. Subsequently, these manual input weights are normalized using the following equation:

$$Normalized,Weight = frac{Given,Weight}{{Sum,of,all,Given,Weight}}$$

Conducting pairwise comparisons

In this step, pairwise comparisons are employed to derive standardized, objective weights for each feature based on the manually assigned weights. Utilizing the AHP technique, pairwise comparisons is created among all criteria and sub-criteria. Each pair of factors is evaluated to determine which is more significant, specifically in the context of outbreak prediction18,19.

Step 4: calculation of weights and consistency checking: This step ensures that each factor’s relative importance is accurately represented. A consistency check is then performed to validate the consistency of the judgments, ensuring the reliability of the weight assignments.

Step 5: conduct a consistency ratio (CR) evaluation: to verify the reliability and consistency of expert inputs and the pairwise comparison process18.

Step 6: Extract the weighted priorities of the selected risk factors.

figure a

In the final AHP model, de a formula is developed for calculating risk factor weights. This formula includes multiple factors, each multiplied by a weight derived from the correlation confusion matrix. To accurately assess each factor’s significance, the AHP weights are calculated by integrating expert opinions with the correlation coefficient matrix. This approach enhances this model’s ability to rank epidemiological risk factors effectively, thereby improving its capacity to predict potential disease outbreaks. Table 1 below represents the weight of each risk factor, ordered by highest ranking.

Table 1 Risk factors ranks.

Based on the previous rankings of risk factors shown in Fig. 2, it is concluded that the highest-ranked factors; barriers to health services, dependency rate, and lack of health insurance play a critical role in disease outbreaks. Other notable factors, such as barriers to childhood services, precipitation, and informal employment, reflect important socioeconomic and environmental influences. Conversely, lower-ranked factors like population, average temperatures (Tavg, Tmax, Tmin), and inadequate excreta disposal have comparatively lesser impact on predicting outbreaks. This ranking aids in prioritizing the most influential factors, thereby enhancing efforts in disease prevention and control.

Fig. 2
figure 2

Distributions of AHP weights and ranks for risk factors.

Feature extraction

Feature extraction involves transforming raw data into a format more suitable for modeling, a crucial step in enhancing the dataset’s predictive power for disease outbreaks. In this study, the dataset initially lacked classification and a defined target variable. To address this issue, the 75th percentile method is used to establish an outbreak threshold, effectively distinguishing between normal conditions and outbreak situations and creating a target variable from disease incidence data. This statistical approach analyzes historical disease occurrence rates to identify a critical value that indicates an unusually high incidence of disease. By setting the threshold at the 75th percentile, it is ensured that only the top 25% of data points, representing unusually high incidences, are classified as outbreaks. To compute the 75th percentile of the dataset, the researchers use the standard percetiled formula:

$$P = frac{n + 1}{{100}} times 75$$

where:

P = Percentile value.

n = Total number of observations.

After establishing the threshold, the final outbreak target is calculated, a binary variable indicating the presence or absence of an outbreak. If disease incidence exceeds the defined threshold, the outbreak target is set to 1 (indicating an outbreak); otherwise, it is set to 0 (indicating no outbreak).

Data splitting layer

In this phase, the dataset is split into 80% for training and 20% for testing to assess the model’s predictive accuracy on unseen data. This approach allowes the researchers to build and validate a model initially trained on Dengue data, which is then evaluated with data from Zika and Chikungunya viruses.

Modeling layer

After feature selection and data preprocessing, transfer learning is applied to use Dengue data as a base model for forecasting Zika and Chikungunya outbreaks. The model is developed by using Random Forest, XGBoost, and Gradient Boosting algorithms. Additionally, an ensemble technique is employed to combine predictions from these individual models, enhancing overall predictive performance. The algorithm below details the outbreak prediction process using transfer learning. Note that the dataset is imbalanced due to unequal distribution of target classes.

figure b

To generalize the researchers’ predictive model for Chikungunya and Zika outbreaks across different geographical regions and potentially other diseases, transfer learning techniques is employed. Transfer learning is particularly valuable when the training dataset is insufficient for developing highly accurate models, as demonstrated in this study.

The model is trained to use a comprehensive dataset on Dengue outbreaks, which shares similarities with Chikungunya and Zika in terms of transmission vectors (Aedes mosquitoes) and influencing socioeconomic and environmental factors. This pre-training phase enables the model to learn essential patterns and relationships from the extensive Dengue data, capturing critical features such as climatic conditions, socioeconomic indicators, and historical disease incidence rates.

After pre-training the model, the le.arned features, patterns, and insights derived from the Dengue data are extracted. These features serve as a foundational knowledge base, enabling the model to transfer valuable insights to the target tasks of predicting Chikungunya and Zika outbreaks.

In the fine-tuning phase, the extracted features and parameters from the pre-trained Dengue model are used to initialize the models for predicting Chikungunya and Zika outbreaks. This process involves adjusting the final layers to fit the new tasks and retraining them with the target data. Fine-tuning enables the model to refine its understanding, focusing on the unique characteristics of Chikungunya and Zika while preserving the generalized knowledge acquired from Dengue.

Modeling evaluation layer

The modeling evaluation layer is a crucial step in the prediction model pipeline. At this layer, The performance of each model will be evaluated using metrics that are commonly used in outbreak prediction tasks like accuracy, precision, recall, area under the receiver operating characteristic curve (AUC) and F1-score to select the one that provides the highest classification accuracy. The values of these metrics are calculated using the parameter of the confusion metrics, such as true positive (TP), false positive (FP), true negative (TN), and false negative (FN). These metrics can be defined mathematically in the following equations:

$$Pr ecision = frac{TP}{{TP + FP}}$$

$$Pr ecision = frac{TP}{{TP + FN}}$$

$$F1 – score = frac{{Pr ecision * {text{Re}} call}}{{Pr ecision + {text{Re}} call}}$$

$$Accuracy = frac{TP + TN}{{TP + TN + FP + FN}}$$

where true positive (TP) indicates a positive sample is correctly classified (correctly predicted outbreak cases), True Negative (TN) occurs when a negative sample is correctly classified (correct classification of outbreak negative). False positive (FP) occurs when a negative sample is mistakenly classified as positive (outbreak negative is classified as outbreak Positive). False negative (FN) occurs when a positive sample is mistakenly classified as negative (outbreak Positive is classified as outbreak Negative). The AUC value evaluates the overall performance across all classification thresholds, indicating the model’s ability to distinguish between classes. A higher AUC indicates better model performance.The final model is selected based on the one that best combines precision, recall, AUC, and F1 score, along with the highest classification accuracy, after evaluating all the models.