A Review: Machine Learning and Data Mining Approaches for Cardiovascular Disease Diagnosis and Prediction

INTRODUCTION: Cardiovascular disease (CVD) is the most common cause of death worldwide, and its prevalence is rising in low-resource settings and among those with lower incomes. OBJECTIVES: Machine learning (ML) algorithms are quickly evolving and being implemented in medical procedures for CVD diagnosis and treatment decisions. Every day, the healthcare business creates massive amounts of data. However, the majority of it is inadequately utilized. Efficient techniques for extracting knowledge from these datasets for clinical diagnosis or other uses are scarce. METHODS: ML is being applied in the healthcare industry all over the world. In the health dataset, ML approaches useful in the prevention of locomotor disorders and heart disease. RESULTS: The revelation of such vital information allows researchers to acquire significant insight into how to use the proper treatment and diagnosis for a specific patient. Researchers study enormous volumes of complex healthcare data using various ML approaches, which improves healthcare professionals in disease prediction. CONCLUSION: The goal of this study is to summarize some of the current research on predicting heart diseases utilizing machine learning and data mining techniques, analyze the various mining algorithm combinations employed, and determine which techniques are useful and efficient. Future directions in prediction systems have also been considered.


Introduction
The heart is one of the most vital organs in human body.It is a muscular organ that pumps blood into the body and serves as the heart of the cardiovascular system.The cardiovascular system contains all blood vessels, including veins, capillaries, and arteries, which together constitute difficult blood circulation throughout the body [1][2][3].Any restriction or aberration in normal blood circulation flow from the heart can lead to some significant heart disease complications.These are referred to as CVDs, and they are among the deadliest illness in the world.CVDs include heart disorders, vascular diseases of the brain, and blood vessel diseases [4,5].
Although CVDs can be managed through lifestyle changes and other related measures, all indicators indicate that they are on the rise daily, as mentioned in several WHO studies.However, many WHO publications have shown a global increase in CVDs, which is highly concerning.CVDs kill more people than any other cause worldwide, killing an estimated 17.5 million individuals in 2012 [6,7].According to numerous WHO estimates, mortality from heart disease is on the rise, which is mostly linked to insufficient preventative actions despite rising risk factors.Clinical information has revealed that some risk factors create a person's chances of building CVD.Some of these risk factors include a family heritage of cardiovascular disease, a bad LDL cholesterol level, a low good cholesterol level, a high-fat diet, hypertension, obesity, and a lack of active exercise.Smoking, diabetes, age, and gender are all risk factors [8].Using these and other characteristics, physicians typically form diagnosis by examining a patient's present health status as well as previous diagnosis made on other patients in the same situation.The increasing incidence of heart disease has become a popular problem.As an output, the healthcare industry must denote and enhance the way these illnesses are treated to reduce impact.Huge amounts of data are available in the healthcare industry, particularly data on CVD, which must be efficiently analyzed to make successful conclusions.According to information, statistics, hospital administration, and clinical records, medical data doubles every three years, making the health sector a multibillion-dollar domain.Medical data analysis and knowledge extraction rely heavily on machine learning and data mining methods [9,10].The rising mortality and morbidity rates from CVD have influenced authors to perform numerous studies to reduce the rates.Machine learning and data mining approaches have been commonly utilized in the development of the prediction of cardiac disease.Data mining applications are utilized to prevent clinical mistakes and enhance health policy, early illness identification, and preventable hospital fatalities.This review's critical contribution is,

•
In this survey study, we review current research on heart disease prediction in different domains from 2021 to 2023.

•
The fundamental goal of problem-solving is to apply ML and DM approaches in heart disease recognition; IoTbased applications used in CVD are the primary goal of problem-solving.

•
We examined the performance of the heart disease recognition systems based on their accuracy.The following is the review's structure: Sections 2-6 outline the research methodologies utilized to choose the primary studies.Section 7 delves into and evaluates the CVD category review.Section 8 depicts CVD's research gap.Section 9 brings the paper to a close.

Machine Learning Based Heart Disease Prediction
Srinivas & Katarya [11] introduced HyOPTXg, an expert model that predicts heart illness using an optimized XGBoost classifier.They need good hyperparameter tweaking to create a better system with a classifier.As an outcome, they changed XGBoost's hyper-values and trained the model utilizing tuned parameters.OPTUNA (hyper-parameter optimization method) is the structure utilized for hyper-parameter tuning.This method has been evaluated on the Kaggle heart disease UCI, the Kaggle Heart Failure prediction dataset, and the Cleveland dataset.
Rani et al. [12] presented a hybrid method to aid in the early detection of CVD.To deal with missing values, the authors used multivariate imputation via chained equations.For the selection of necessary attributes from the provided dataset, a hybrid attribute selection technique combining the GA and recursive attribute elimination was utilized.SMOTE (Synthetic Minority Oversampling Technique) and traditional scalar methods were also employed for data pre-processing.The researchers utilized LR, NB, RF, SVM, and Adaboost classifiers in the presented hybrid model.It was discovered that the RF classifier produced the best accurate results.
Nadeem et al. [13] presented an SVM-based structure for heart disease detection that is enabled by a fuzzy-based decision-level fusion.The preprocessing layer collected original data, which may contain missing values and noise.Different approaches, such as average, mode, and mean, are used in the pre-processing layer to forecast missing data and reduce noise using normalization.Furthermore, the processed data is received by the application layer and utilized to train the supervised ML approach known as SVM.Within the suggested structure, the same procedure is implemented in parallel.
In this research, Budholiya et al. [14] suggested a diagnostic system that predicts cardiac illness using an optimized XGBoost (Extreme Gradient Boosting) classifier.They utilized Bayesian optimization to optimize the hyperparameters of XGBoost, which is a very effective approach for parameter optimization.Table 1 demonstrates the disadvantages and advantages of ML approaches in heart disease prediction.

Heart Disease Prediction Based on Ensemble Models
Gao et al. [15] used supervised learning methods to predict cardiac disease in its early stages.The HD dataset is utilized to test and train systems.It has 1025 documents, 13 attributes, and 1 target column.The goal column has two classes: 1 for heart diseases and 0 for nonheart disorders.The attributes are scaled to be between [0, 1].It should be noted that missing values are removed from the dataset.The extraction of the optimal features is critical since irrelevant features frequently influence the ML classifier's categorization effectiveness.To identify essential features from the dataset, LDA and PCA are used in this phase.To identify whether the people tested have heart disease or are healthy, ensemble methods and numerous algorithms such as SVM, NB, DT, KNN, and RF are employed.
Uddin & Halder [16] suggested a multilayer dynamic system (MLDS) based on an ensemble method that may increase its current knowledge in each layer.For attribute selection, the suggested approach employs the Extra Trees classifier (ETC), Gain Ratio Attribute Evaluator (GRAE), Correlation Attribute Evaluator (CAE), Lasso, and Information Gain Attribute Evaluator (IGAE).Finally, the ensemble technique for categorization in the structure is built using RF, NB, and GB classifiers.While the base classifiers stated above failed to identify accurately in any layer, the KNN technique was used to find the test information's neighbourhood data points.Mahesh et al. [17] suggested employing ML to determine whether or not an individual has a cardiac illness.This work implements both forms of ensemble classifiers, namely heterogeneous and homogeneous classifiers.To deal with the class imbalance as well as noise, data mining preprocessing utilized SMOTE was utilized.The planned work consists of two steps.In the first stage, SMOTE was employed to lessen the influence of data imbalance, and in the second stage, information is classified utilizing NB, DT, and their ensembles.Table 2 demonstrates the disadvantages and advantages of ensemble models in heart disease prediction.

Data Mining-Based Heart Disease Prediction
Kavitha et al. [18] suggested an ML method for predicting cardiac disease.The suggested research makes use of the Cleveland heart disease dataset, as well as data mining methods including classification and regression.RF and DT machine learning algorithms are used.The ML model's innovative method was developed.Three ML algorithms are employed in the implementation using Hybrid methods (RF and DT).We employed hybrid methods of DT and RF to detect heart illness using the user's input parameters.Deepika & Balaji [22] suggested a model for accurately predicting CVD utilizing a attribute selection and categorization method.In this an optimized unsupervised algorithm for attribute chosen and a MLP-EBMDA for CVD categorization are suggested.As input, the heart disease dataset is acquired, and pre-processing is performed.The optimized unsupervised method was utilized to choose attributes.The hybrid MLP-EBMDA approach was utilized to categorize HD based on the specified attributes.Table 3 demonstrates the disadvantages and advantages of data mining methods in heart disease forecasting.The IoT is an essential technology in a healthcare system.Malibari [24] suggested the EO-LWAMCNet framework to reliably forecast a patient's chronic health state in this study.A sensor implanted in the patient's body is capable of collecting all data and transmitting it to the cloud via a gateway.The EO-LWAMCNet framework begins the categorization process to forecast chronic disease based on the sensor data obtained.The model goes through testing and training.CKD and HD databases are used to forecast the disease.The processed information was used in the training stage for categorization.Following the completion of the training procedure, the CS sensor data was tested and classified as normal or abnormal.In the event of an aberrant result, the doctor receives an alert message to treat the patient.

IOT-Based Heart Disease Prediction
In this paper, Yaqoob et al. [25] introduced a system with an M-ABC optimization method to eliminate privacy concerns and enhance the detection technique for heart disease identification.Using an updated federated learning method for user sites and the cloud, they create and suggest a privacyaware system for predicting heart disease in healthcare.At the client end, the M-ABC optimizer was suggested for the optimal selection of features of heart illness information.A framework for a global cloud system is investigated using a federated matched averaging (FedMA)-based approach.Table 4 shows the disadvantages and advantages of IoTbased approaches in heart disease prediction.Malibari [24] EO-LWAMCNet CKD and HD Datasets The model effective ly predicts disease.

Deep Learning-Based Heart Disease Prediction
Mehmood et al. [26] developed CardioHelp, a method that recognizes the existence of CVD in a patient by combining a Deep Learning (DL) technique known as convolutional neural networks (CNN).At its earliest stage, the suggested approach is concerned with temporal information modeling by employing CNN for HD forecasting.They produced the heart disease dataset and compared the outcomes using cutting-edge methodologies.In terms of performance assessment measures, experimental results reveal that the suggested strategy outperforms the existing approaches.
The real-time dataset and UCI heart disease dataset are utilized to compare DL approaches to traditional approaches.Dileep et al. [27] developed C-BiLSTM to increase the accuracy of existing approaches.The real-time and UCI heart disease datasets are utilized for performance findings, and both datasets are passed through the K-Means clustering method to remove duplicate information, and the HD predicted by the C-BiLSTM method.C-BiLSTM was compared to traditional classifier approaches such as RT, SVM, LR, KNN, Gated Recurrent Unit, and Ensemble.Table 5 shows the disadvantages and advantages of DL methods in HD prediction.

Results and Discussion
This review focuses on the role of IoT, Data Mining, DL, and ML approaches in heart disease prediction.These algorithms were tested using public datasets like the UCI open dataset, the Heart Disease Dataset, the Cleveland heart disease dataset, and the Cardiovascular Disease Dataset, as well as local datasets.Table 6 shows the systems that were reviewed.According to the findings of the previous study, the suggested machine learning algorithms of Random Forest, Nave Bayes, and Gradient Boost produced higher detection accuracy than other studies.To choose the necessary features to predict heart disease, this article used GRAE, the CAE, Lasso, IGAE, ETC.The classifiers RF, NB, and GB were then used to determine whether the patient had heart disease or not.This technique's performance was examined using Hungarian, Cleveland, and Cleveland-Hungary-Switzerland-Long Beach datasets.Figure 1 shows the DL and ML techniques utilized in PA applications.

Limitations
The limitations and advantages of the reviewed HD prediction methods in the preceding literature have been summarized in Tables for a better understanding of the suggested methodology.All treatments used different methodologies to recognize HD in its early stages.However, all of these algorithms have low prediction accuracy and a long computation time for HD detection.
The HD identification approaches prediction accuracy requires further refinement for efficient and precise detection at early stages for improved treatment and recovery.Thus, the key concerns with the existing methodologies are long computation times and low accuracy, which may be attributed to the usage of irrelevant attributes in the dataset.To overcome these issues, new approaches for identifying HD are required.Improving forecast accuracy is a significant challenge to the researchers.

Conclusion
The use of several data mining and ML methods for predicting the occurrence of heart disease has been summarized.Determine every algorithm's prediction accuracy and apply the suggested approach to the required area.Improve algorithm accuracy by using more relevant attribute selection methods.If a patient is diagnosed with a specific type of cardiac disease, there are numerous treatments available.Data mining and ML may provide an enormous amount of information from an appropriate dataset.In conclusion, the literature survey revealed that only marginal success is reached in the development of predictive models for heart disease patients, indicating the need for combinational and more complex systems to improve the accuracy of identifying the early onset of heart disease.The more data that is supplied into the database, the more intelligent the system becomes.
An automated model that can assist in the choice of suitable treatment approaches for a heart disease patient may be developed in the future.A lot of studies have already been put into establishing systems that can detect whether an individual is going to develop cardiovascular disease or not.When an individual is identified with a specific form of heart disease, he or she has several treatment options.Machine Learning can be quite useful in selecting the type of treatment to be taken by retrieving information from such relevant databases.In addition, we are interested in classifying it as a multi-class problem to determine the disease's level.In order to more accurately and effectively anticipate heart problems, an ideal method needs to be developed.
Nancy et al.[23] suggested a system gather data from IoT devices, and electronic clinical data relevant to patient history saved on the cloud is subjected to forecasting analytics.The suggested smart medical model for forecasting the risk of HD contains phases such as (1) data gathering, (2) pre-processing, and (3) illness prediction.At the Cloud layer, the information acquired from IoT sensors for HD risk forecasting is subjected to pre-processing steps such as filtering and cleaning.The resulting data is transmitted to the FIS for the first categorization.Then, the suggested Bi-LSTM algorithm was employed to accurately forecast a patient's risk of heart illness.EAI Endorsed Transactions on Pervasive Health and Technology | Volume 10 | 2024 |

Fig. 1 .
Fig. 1.DL and ML techniques utilized in PA applications

Table 1 .
ADVANTAGES AND DISADVANTAGES OF ML APPROACHES IN HEART DISEASEPREDICTION.
EAI Endorsed Transactions on Pervasive Health and Technology | Volume 10 | 2024 |

Table 2 .
ADVANTAGES AND DISADVANTAGES OF ENSEMBLE MODELS IN HEART DISEASEPREDICTION.
[20]es were eliminated from the dataset's six instances.Using WARM, they explored every attribute as well as chosen important attributes.According to the findings, the important factors beat all other attributes with the greatest confidence score in forecasting HD.Premsmith & Ketmaneechairat[20]suggested a methodology for detecting cardiac disease using data mining approaches.The Logistic Regression model and the Neural Network model are used by the data mining algorithm.This research dataset is based on heart disease data from UCI.In the United States, there are 303 Instances and 75 Attributes.The confusion matrix table is used to evaluate characteristics such as precision, accuracy, F-Measure, and recall.According to the results, the LR model outperforms the NN model.
Yazdani et al. [19] used the heart illness dataset from the UCI ML Repository.During the pre-processing stage, missing EAI Endorsed Transactions on Pervasive Health and Technology | Volume 10 | 2024 |

Table 3 .
DISADVANTAGES AND ADVANTAGES OF DATA MINING METHODS IN HEART DISEASEFORECASTING.

Table 4 .
DISADVANTAGES AND ADVANTAGES OF IOT-BASED APPROACHES IN HEART DISEASEPREDICTION.

Table 5 .
DISADVANTAGES AND ADVANTAGES OF DL METHODS IN HD FORECASTING.
EAI Endorsed Transactions on Pervasive Health and Technology | Volume 10 | 2024 |

Table 6 .
DISADVANTAGES AND ADVANTAGES OF IOT-BASED APPROACHES IN HEART DISEASE PREDICTION.