Early Detection of Cardiovascular Disease with Different Machine Learning Approaches

.


Introduction
Today, cardiovascular diseases have become a primary reason for mortality across the globe.According to the World Health Organization's report, roughly 17.9 million people lost their lives to CVDs in 2019, which represent 32% of total global deaths.85% of these deaths were due to heart attacks and strokes [1].The prevention of most cardiovascular diseases can be achieved by tackling behavioral risk factors, which include inadequate physical activity, tobacco consumption, an unhealthy diet leading to obesity, and excessive alcohol consumption.Due to these reasons, young people have started facing issues like obesity, high cholesterol, and blood pressure levels resulting in premature heart failure and death [2].Therefore, it is crucial to detect the early symptoms of CVD to provide early medication to prevent casualties.Machine learning has emerged as a significant technological advancement in the field of healthcare, presenting the potential to revolutionize the entire sector and bring substantial benefits to both patients and healthcare providers.Machine Learning is an analytical approach that automates the construction of models by leveraging algorithms.It enables the extraction of concealed insights from data.The progressive process of machine learning empowers the system to modify its techniques and outcomes in response to unfamiliar circumstances and new data it encounters.[3].Major applications of machine learning in healthcare include Personalizing treatment, detecting diseases in their early stages, Robot-assisted surgery [4], analyzing errors in prescriptions [5], assisting in clinical research and trials [6], Drug discovery and creation [7], Automating image diagnosis [8], etc.In our approach, we aim to build and train a machine learning model to detect the onset of cardiovascular malfunction, using algorithms like ANN (Artificial Neural Network), KNN (K-Nearest Neighbors Algorithm), Decision Tree, and XGBoost.All the data needed to train and test our model has been fetched from E. Singh et al.

2
Kaggle.These models can be applied to assess whether an individual exhibits indication of a cardiovascular malfunction, taking into account attributes such as Age, Height, Weight, Gender, Systolic blood pressure, Cholesterol levels, Diastolic blood pressure, Glucose, Smoking, Alcohol Intake, Physical activity, and the presence or absence of cardiovascular disease [9].In this paper, a literature review was conducted based on heartrelated diseases and their causes [10].In the next section, the methodology is presented in which an analysis of the data was done to predict the presence or absence of cardiovascular disease in patients using ML [11].In the subsequent section, we delve into the outcomes and present them.Ultimately, in the final section, we showcase our findings and provide suggestions for future research.

Literature Review
Machine learning has lately been emerging as a promising tool for the detection of CVD.One study which explored the use of ML in CVD is a paper by Maini E. et al. [9] in which they proposed an unsupervised approach in terms of clustering techniques to develop a variety of models.Modepalli et al. [12] employed a distinctive approach by utilizing a combination of Random Forest, Decision Tree, and Hybrid Model methods, achieving an accuracy rate of 88.7%.Bharti R. et al. [13] conducted a comparative examination of multiple Machine Learning (ML) and Deep Learning (DL) models in relation to the Archive Coronary Heart Disease dataset.On the other hand, Ashish et al. [14] developed a rapid and accurate computerized system for coronary heart disease detection using SVM classification and XGBoost boosting algorithms.This is evidence that the current research suggests that ML algorithms have great potential for the detection, diagnosis, and risk prediction of CVD [15].However, the need for large and diverse datasets remains an open challenge to optimize ML algorithms and integrate the same into clinical practice [16].It is still stated that the use of ML algorithms for CVD detection is a rapidly evolving field of research that has great potential for improving patient outcomes via early diagnosis and appropriate treatment [17].

Methodology
In this paper, we have predicted whether a patient is suffering from cardiovascular disease or not based on various attributes.This dataset was taken from Kaggle and the dataset values were taken at the moment of medical examination.Through this dataset, we aim to classify the patient as healthy or suffering from cardiovascular disease.Our dataset consists of 69,301 patients and 13 health attributes.

Data Preparation and Analysis
On this dataset, we performed exploratory data analysis (EDA).EDA is a preprocessing step used for a better understanding of data.Exploratory data analysis was performed using pandas profiling in a Jupyter notebook.

Data Info
This step provides the snippet of our dataset and the attributes contained by the dataset.

Checking for null values
The first step in EDA is importing the dataset and cleaning it.We checked for the presence of null values, duplicate values, and missing values.Figure 1 shows that there are no null values present in our dataset.After that, we performed a statistical summary analysis of our dataset followed by the detection of outliers.For the features age, height, weight, systolic blood pressure, and diastolic blood pressure; the outliers count was determined as 4, 515, 1802, 1419, and 4584 respectively.These outliers were then removed from the dataset.

Distribution of data
Figure 4 shows the gender distribution according to the target variable using a stacked bar plot.Males have a count of 20,000 whereas the count value for females i.e., approx.10,000 is a lot less compared to males.

Fig. 4. Packages imported
Figure 5 shows the cholesterol distribution according to the target variable using a bar plot.Patients with above normal and well above normal cholesterol have more chances of having cardiovascular disease than patients having normal cholesterol levels.

Fig. 5. Cholesterol Distribution according to the target
Figure 6 shows the glucose distribution according to the target variable using a bar plot.Patients with abovenormal and well-above-normal glucose levels exhibit a higher propensity for cardiovascular disease compared to patients with normal glucose levels.

Heat Map
The Heat map in this analysis visually represents the intensity of relationships between variables using color gradients.It reveals that a majority of the correlations between parameters are positive, indicating a relatively strong dependence or association among them [18].Among the factors examined, height and gender, as well as cholesterol and glucose, emerge as the most influential variables with a significant impact on assessing the likelihood of cardiovascular disease (CVD).

Pair Plot
The pair plot visualization for the data is shown below in Figure 8.

describe() command
The describe() function is a useful tool for computing statistical measures from numerical data within a data frame.It provides various statistical values, such as percentiles (including the 25th and 75th percentiles representing the lower and upper quartile ranges), mean, median, standard deviation, minimum, and maximum values for each parameter.The 50% percentile specifically represents the median value of the data.By utilizing describe(), we can gain insights into the central tendency and dispersion of the numerical variables in the dataset.

Machine Learning and Deep Learning Models
The processed data was then divided into training and validation sets.The base models were built on the training set.The models used to predict the presence of cardiovascular disease are Decision Tree, XGBoost, KNN, and ANN [19].The data is divided into an 80-20 ratio, with 80% of the data allocated for the training set and 20% designated for the test set.

Decision Tree
The decision tree has a hierarchical arrangement that resembles a tree, that is used for regression and classification analysis of the data.It is a non-parametric supervised learning algorithm [20].For our dataset, we have used this ML algorithm to perform a predictive analysis as it requires less data preparation, is highly interpretable, non-linear, and highly versatile [21].Following is the decision tree that was produced using this algorithm.

XGBoost (Extreme Gradient Boosting)
This algorithm is also a popular regression and classification model.It is an ensemble learning method that compiles multiple weak models to perform a robust and reliable predictive model [22].It reduces overfitting and thus improves the performance of the model.The depth parameter chosen for this algorithm is 6 and the accuracy obtained is 73.373%.The precision value attained is 67.678% whereas the recall value is 75.254%.The F1 score obtained for this model is 71.265% which is the same as the decision tree.

KNN (k-Nearest Neighbors)
It is a supervised machine-learning algorithm capable of handling both classification and regression tasks.i.e., this algorithm predicts the label or value of new data points by identifying the k nearest data points in the training set and determining the most frequently appearing label or average of their labels and values.[23].The accuracy obtained for our dataset using KNN IS 64.273%.

ANN (Artificial Neural Networks)
ANN is a deep learning model and a binary classifier [24].Since ML models use very few parameters, ANN which is a deep learning model is used for comparable computation of the dataset [25].The accuracy obtained using the ANN algorithm is 72.253%.

Result
The comparison of accuracy, precision, and recall values of the various algorithms [26] used in the paper is shown in the below table.We have also used a Deep Learning algorithm known as Artificial Neural Network (ANN) [27], which also gave us a close to XGBoost accuracy of 72.253%.If the current dataset is provided with more entries, the accuracy of this algorithm can be greatly improved.Finally summing up the entire result, our research identifies XGBoost as the best algorithm to identify the early detection of cardiovascular diseases on this dataset with an accuracy score of 73.373%, the highest among the other algorithms that were used in this scenario.

Conclusion and Future Scope
Cardiovascular Diseases can easily be termed as one of the most difficult medical challenges faced by doctors and researchers in this field [28], solely because it can be challenging to detect CVDs early enough to prevent serious complications.Through the utilization of preexisting datasets containing individuals' information such as age, gender, blood pressure readings, and other relevant attributes, it becomes possible to train a model that can predict the occurrence of cardiovascular disease in individuals [29].Our study is a comparative analysis and diligent assessment of four machine learning algorithms for predicting cardiovascular disease, with promising outcomes.In our research [30], the machine learning algorithm that fetched us the most accuracy was XGBoost of 73.373%.For future projects, the accuracy of our model can be improved by using a larger dataset and using select features that better suit our purpose.We can also try to use various deep learning techniques that may help us to further improve the accuracy of our model.Analysis and combination of different datasets to produce a more meaningful dataset and carefully performing feature selection will fetch more productive results.In the future, this model can be used to develop web/mobile apps to utilize the predictions made by it to help doctors and researchers for medical purposes.

The
Confusion matrix for our dataset is shown in fig.The True-Positive value is 6692, the False-Positive value is 1961, the True-Negative value is 2833 whereas the False-Negative value is 5840.The accuracy for this algorithm is approximately 72.753%.The precision value obtained is 67.678% whereas the recall value is 75.254%.The F1 score obtained for this model is 71.265%.

Fig. 11 .
Fig. 11.Confusion Matrix for the Decision Tree results

Fig. 14 .
Fig. 14.The training loss for the ANN (x-axis denoting the number of epochs; y-axis denoting the MSE training loss)

Table 1 .
Description of Dataset

Table 2 .
Comparison of algorithmsAs shown in the above table, all the machine learning algorithms namely Decision Tree, KNN, and XGBoost yield significant results with XGBoost yielding the most amount of accuracy score i.e., the model trained with XGBoost gives more accurate results as compared to other algorithms and gives correct result 73.373% of the times.