Data Analysis and Predictive Modelling on Heart Disease based on People’s Lifestyle

Coronary Artery Disease (CAD) is a form of heart disease primarily influenced by lifestyle choices. Despite preventative measures available to mitigate CAD risks, a significant proportion of the population remains unaware of its severity and consequently neglects necessary precautions. As a result, the influence of CAD continues to rise. This project aims to curb CAD cases by developing an early warning detection and educational accessible to the general population, leveraging Machine Learning and Data Visualization technologies. Research indicates that while Coronary Artery Disease can be mitigated through a lifestyle shift towards healthier living, the risk remains due to factors such as age and natural health deterioration.


Introduction
Heart disease encompasses conditions affecting the heart or blood vessels [1].In 2019, prior to the COVID-19 pandemic, it was the leading cause of death globally, accounting for 19 million deaths or 34% of all fatalities [2].Globally, 1 in 14 individuals lives with some form of heart disease, marking a 93% increase from the 1990s to the 2010s [2].Unfortunately, the number of deaths from heart diseases is projected to continue rising.
Coronary heart disease (CHD), also known as coronary artery disease (CAD), is the most prevalent type of heart disease [3].It is primarily caused by poor lifestyle choices such as smoking, unhealthy diet, excessive alcohol consumption, obesity, diabetes, and physical inactivity [3].These unhealthy behaviors are often seen as Personal Key Indicators (PKIs).However, many individuals fail to recognize that these PKIs significantly increase the risk of heart disease, particularly during early adulthood when health may not be a top priority due to the "You Only Live Once" mentality [4].As people age, they may mistakenly believe it's too late to adopt a healthy lifestyle, despite evidence to the contrary-healthy habits can benefit anyone at any age [5].Establishing healthy habits is easier in childhood due to greater adaptability compared to adulthood [6].Thus, increasing awareness of the risks of coronary artery disease across all age groups is crucial to prevent its onset later in life.
Machine Learning (ML) is a branch of Artificial Intelligence (AI) focused on using data and algorithms to enable machines to intelligently solve problems [7].ML models to some degree mimic how a human would learn, improving accuracy with increased data quantity and quality, without requiring explicit programming for each task [8].The most common use case of Machine Learning is in making prediction, classification, data mining insights and finding patterns via data mining.In real world application, ML is used in a wide range of applications which include utilized within recommendation engines, prediction models, spam filtering, malware detection detection [9].As there are so many different use cases, choosing the correct suitable machine learning algorithm is critical as each model has specific strengths and applications and model performance is dependent on both data quantity and quality.[10].A robust ML model continually learns from new data to enhance accuracy.
This paper aims to develop a machine learning-based heart disease prediction model to assess individuals' risk based on their Personal Key Indicators (PKIs).It will highlight crucial PKIs to educate the general public about lifestyle choices influencing heart disease risk.The prediction will be based on their Personal Key Indicators (PKIs).

Dataset
The dataset used in this paper originates from two distinct sources.The first source is the CDC's 2020 Annual Survey related to American Adults' health status.The dataset initially contains 279 columns/attributes, shortened into 18 columns/attributes with 17 features and 1 target variable.The attributes chosen related solely to general health conditions, such as difficulty in walking, BMI, and age.[11].The target variable "HeartDisease" is a binary variable indicating whether or not the observed person suffered from coronary artery disease (CAD).The first data set collected a total of 319,795 observations.
The second source involves a survey questionnaire conducted beforehand, to gain additional data which could supplement the main dataset.The data gathered mostly matched the existing data, with some changes to allow easier data collection from participants, including participants' Body Weight and Body Height rather than Body Mass Index (BMI), since BMI can be derived from both.The survey questionnaire was performed using two languages, English Language and Bahasa Indonesia, to allow a higher variance of responses.Combining both versions, the data collected 99 observations across 33 columns/attributes, which was reduced to 19 columns during pre-processing, as the other 14 columns contained non-relevant questionnaire responses for the research process.The target variable of the questionnaire mirrored the target variable of the first source.
The inclusion of general health conditions and lifestyle choices ensured the outcomes, including the developed Machine Learning models, could be accessible to the general population without the need for any professional medical assistance.

Data Pre-Processing
Before utilizing the data to construct the Machine Learning model, data pre-processing is essential to ensure the data is suitable for Machine Learning applications.The exact steps of data pre-processing can vary widely depending on the initial condition of the data [12].Through data preprocessing, the data undergoes cleaning, imputation, normalization, transformation, and encoding to optimize its Machine Learning result.In the case of this project, Data Cleaning is divided into two main phases: first, preprocessing of the survey data to assimilate it smoothly into the main data first; second, comprehensive pre-processing of the combined dataset.
First, the pre-processing of the survey data is conducted to allow the survey data to be added seamlessly into the main data.The first step is to fix all issues present in the data; this includes attribute renaming so that attributes match names in the main dataset.The next step involves standardizing inconsistencies such as capitalization, spacing, and units of measurement to ensure uniformity and reduce the variance of unique values in the dataset.Following this, data transformation is applied to transform the Body Weight data and the Body Height data into Body Mass Index (BMI) to calculate if a person's mass is within the healthy zone using the following formula: Once BMI is calculated, both Body Weight and Body Height attributes are discarded.Finally, the survey dataset is rearranged to align its structure with the main dataset, facilitating seamless concatenation.
The second step encompasses comprehensive preprocessing of the entire dataset.Initially, any duplicates entities from inside the dataset are removed to prevent data leakage during model development, where identical data appears in both training and testing datasets [13].Subsequently, further inconsistency fixing is performed to standardize any remaining outliers or unique variance.The next pre-processing step covers missing data imputation by imputing missing numerical data with the mean (average) and imputing missing categorical data with the mode (most frequent value).Finally, all categorical attributes need to be encoded, to prepare them to fit into the machine learning models during the model building process [14].For attributes with two unique values, Label Encoder assigns each unique value a numerical label ("0" for the first unique value and "1" for the second).For the attributes with more than two unique values, One Hot Encoding is utilized.One Hot Encoding create a new binary attribute for each unique category.The attribute corresponding to each category is marked with "1", while the others are marked with "0".

Model Building
Once the data has undergone pre-processing, it is ready for the Machine Learning Model Building phase.Prior to fitting the data into the ML models, several preparation steps are necessary.Firstly, the dataset is separated based on feature-target split, resulting in two datasets, "x" containing all feature attributes, and "y" containing the target attribute.Following this, the data is further separated into a train-test split, with a 70:30 data distribution ratio, with 70% allocated to training, and 30% to testing.This means that the data will be split into four, "x_train", "x_test", "y_train", "y_test".Next, the data is normalized to allow better optimization of the data, by using a common scale for the numerical data.In this case, Min Max Scaler is used for data normalization.The data normalization is only applied to both "x_train" and "x_test" datasets.Finally, since the target class distribution is heavily imbalanced -274,551 instances of "0" (Majority) against only 27,261 instances of "1" (Minority) -various data sampling techniques are applied to achieve a balanced class distribution.For testing purposes, three different data sampling techniques are utilized: Oversampling (where the Minority class data is synthetically increased to match the Majority), Undersampling (reducing the Majority class to match the Minority), and Combined Sampling (utilizing both oversampling and undersampling to achieve a balanced dataset).
With the data prepared, the Model building phase commences by evaluating eight different ML algorithms to see which best suits the data.These models are Multi-Layer Perceptron (MLP), Linear SVC (Support Vector Classification), Random Forest, XGBoost, Decision Tree, Ada Boost, K-Nearest Neighbors, and Gradient Boost.These algorithms are chosen based on their prevalence in prior research.Each model is tested with each of the data sampling techniques to determine the optimal combination of ML algorithm and data sampling approach for the dataset.
Once the most suitable ML model is identified, Hyperparameter Tuning (HP Tuning) is performed to finetune its parameters for enhanced prediction performance.HP Tuning employs three techniques: Keras Tuner, Randomized Search CV, and Grid Search CV, to identify the best parameters for the chosen model.

Results and Discussion
The model performance is evaluated by comparing all models with different data sampling techniques to determine the best configuration for the project's objective.
First, it is essential to explain the performance metrics.In Machine Learning, the model's performance is often measured using a confusion matrix.A confusion matrix is a 2x2 table that displays the outcome of a prediction: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN).It is used to visualize the results of the prediction [15].In the case of this project, the target variable "HeartDisease" has two unique values, zero (0), indicating a person has a low risk of coronary artery disease (CAD), and one (1), indicating a high risk or a prior diagnosis with coronary artery disease.The confusion matrix evaluates the model's results based on the following: • True Positive (TP): The prediction outcome is 1, the actual data is 1. • False Positive (FP): The prediction outcome is 0, the actual data is 1. • False Negative (FN): The prediction outcome is 1, the actual data is 0. • True Negative (TN): The prediction outcome is 0, the actual data is 0.
Based on these four results, four performance metrics can be created [16]: • Accuracy: The number of correct predictions over the total data amount.• Precision: Positive predictive value, which covers the amount of true positive over the true positive and false positive.Thus, precision value become higher when False Positive decreases.• Recall: Sensitivity, which covers the amount of true positive over true positive and False Negative.Thus, Recall value become higher when False Negative decreases.• F-1 Score: Metrics that covers both precision and recall.
All these performance metrics have the formula as below [16]: Since this project's main objective is to be a warning detection tool for detecting coronary artery disease (CAD), Recall is the important metric to be measured.Recall covers the False Negative (FN) situation, where the individual is predicted to be CAD-free despite having a high risk or an existing diagnosis of CAD.A high recall score indicates a lower number of False Negatives, making it a crucial measure for ensuring that at-risk individuals are accurately identified.
The next performance metric to evaluate is the accuracy of the predictions.Accuracy measures how well the Machine Learning model can correctly predict the outcomes.Precision, while important, is less critical in this case.A high false positive rate means the model predicts someone has a high risk of CAD when they do not, but this still prompts individuals to take preventive measures and make lifestyle changes to reduce their risk.Due to the considerations above, Recall and Accuracy are the primary metrics for determining the most suitable Machine Learning model for this project.

Figure 3. Combined Sampling Performance's Results
Based on the results, considering performance metrics like Accuracy, Recall, and the time required for the Model Fitting Process, the Gradient Boosting algorithm with undersampled data emerges as the best ML model for this project.
During the Model Building phase, combining undersampled data using RandomUnderSampler with Gradient Boosting algorithms and tuning using Keras Tuner produced the best results.This approach aligns well with the established criteria for evaluating model performance for this dataset and the paper's objective.The above figure shows the results of the Gradient Boosting with Undersampled data tuned using Keras Tuner.The model achieved an the accuracy of 74%, 23% precision, and 78% recall.This indicates that the model correctly classifies the data 74% of the time.The low precision suggests that the model tends to classify individuals as high risk for CAD even when they are actually low risk, which is problematic.However, the relatively high recall means that the model is less likely to produce False Negatives, where individuals with a high risk of CAD are predicted to be low risk.This is crucial given the project's objective to minimize such errors.These results are considered satisfactory, especially given the dataset's challenges, such as low correlation and heavily imbalanced target distributions.
Finally, comparing the project's Machine Learning model results with those from other developers' works can further validate the results.Such a comparison can demonstrate that this project's outcomes meet its objectives and are suitable for the target users.For benchmarking, three other developers' works, posted on Kaggle in the Code section of the main data page, will be used.Compared with other developers' results, this project's outcomes show the best balance between Accuracy and Recall.Specifically, compared to Elsayed, the project's Recall score is significantly higher, although the accuracy was not as high.Compared to Mohaimin, this project's performance metrics were superior across all scores.Finally, compared to Hossen, the performance metrics might not have been as high as Hossen's, but there is a major flaw in Hossen's methodology.Their process likely caused data leakage during data sampling, leading to unrealistically high results.Overall, this project's results are sufficient and valid, especially when benchmarked against other developers' work.While the results achieved were not optimal, they are still viable due to the nature of the project, the chosen Personal Key Indicators (PKIs), and the gathered data.

Conclusions
In conclusion, the primary objective of this project was to enhance public awareness about the risk factors influencing Coronary Artery Disease (CAD) based on lifestyle choices.The main outcome is an early warning detection tool that utilizes Machine Learning to predict a user's risk of CAD based on their current lifestyle.The results of the project indicate that the model is valid and viable for deployment and use by the general population.
The project faced limitations primarily related to the dataset and the resulting Machine Learning models.While the results are acceptable and align with the project's objectives, there is room for improvement in accuracy.These issues stem from the dataset's lack of variance and highly imbalanced class distribution.Improved variance and balanced classes in the dataset could lead to better results.
For future steps to progress the method further, the developer could enhance the Machine Learning models by conducting a larger-scale survey with more complex questions that correlate more strongly with CAD.
Overall, the developer feels that satisfactory results have been achieved, with all steps carefully executed to ensure the project's outcomes align with its objectives.Care was taken with the procedure and processes to ensure the successful aim of the project, which was to ensure that such models are useful for the general population, particularly in raising public awareness about the risk of coronary artery disease (CAD) based on lifestyle factors.

Figure 5 .
Figure 5.Comparison between All Developers' Final Results