Support vector machine with optimized parameters for the classification of patients with COVID-19

Introduction. The COVID-19 pandemic has had a significant impact worldwide, especially in health, where it is crucial to identify patients at high risk of clinical deterioration early. Objective. This study aimed to design a model based on the support vector machine (SVM) algorithm, optimizing its parameters to classify patients with suspected COVID-19. Methodology. One thousand patient records from two health establishments in Peru were used. After applying data preprocessing and variable engineering, the sample was reduced to 700 records. The construction of the model followed a machine learning methodology, using the linear, polynomial, sigmoid, and radial kernel functions, along with their estimated optimal parameters, to ensure the best performance. Results. The results revealed that the SVM model with the linear and sigmoid kernels presented an accuracy of 95%, surpassing the polynomial kernel with 94% and the radial kernel (RBF) with 94%. In addition, a value of 0.92 was obtained for Cohen's kappa, which measures the degree of agreement between the predictions of the machine learning model and the actual results, which indicates an excellent deal for the linear and sigmoid kernel. Conclusions. In conclusion, the SVM model with linear and sigmoid kernels could be a valuable tool for identifying patients at high risk of clinical deterioration in the context of the COVID-19 pandemic.


INTRODUCTION
On December 21, 2019, doctors in China reported cases of atypical pneumonia in dozens of patients in the city of Wuhan (Mojica Crespo & Morales Crespo, 2020). Taking into account the accelerated growth of infections and deaths that occurred in early 2020, the Chinese authorities submitted a report to the World Health Organization. (Reyes Nuñez & Simón Domínguez, 2020). On March 11, 2020, the WHO, considering the spread dynamics and its great danger, declared the SARS-CoV-2 pandemic, mentioning that it was a highly lethal virus and virologically similar to SARS-Cov-1 (HcoV -229E) that appeared in 2009 and would change the course of human history (De León et al., 2020). The SARS-CoV-2 pandemic has modified the development dynamics in different institutions worldwide (Zarei et al., 2022). In health, immediate attention is required in many activities, such as timely diagnosis and revitalizing early care in health establishments, as a fundamental task of organizations (Rodríguez Yago et al., 2020). In health facilities worldwide, a comprehensive clinical spectrum has been observed in patients with COVID 19 and levels of severity ranging from an asymptomatic course to acute respiratory distress syndrome (ARDS) and even death (Aljameel, Khan, Aslam, Aljabri, & Alsulmi, 2021). This has generated a complex problem in the different health establishments that did not have the necessary machinery and sufficient personnel to respond to timely patient care objectively (Martínez Chamorro, Díez Tascón, Ibañez Sanz, Ossaba Vélez, & Borruel Nacenta, 2021 ). The lack of an early diagnosis and not acting on time in the treatment would increase the clinical risk of the patient with COVID-19. As a consequence, there is a high percentage of deaths in many countries of the world, including Peru . In many cases, the latter presents high comorbidity, multiple pathologies, and specific signs and symptoms (Rivera & del Pino Casado, 2020). It is necessary to have mechanisms that allow the classification of patients in real-time to optimize patient treatment time in a scenario of accelerated growth of those infected with SARS-CoV-2 (Mohammad, Aljabri, Aboulnour, Mirza, & Alshobaiki, 2022). In many countries, it has been suggested to use respiratory triage and severity identification scales, as well as the risk of mortality in patients with suspected infection. The British Thoracic Society maintains that the National Early Warning Scale (NEWS) makes it possible to detect patients with SARS-CoV-2 infection in emergency services to decide admission or hospitalization, in addition to clinical judgment, such as the pneumonia severity index (PSI) or the CURB-65 scale (confusion, urea, respiratory rate, blood pressure and age ≥ 65 years) (Romero Hernández et al., 2020). It is necessary to apply efficient classification mechanisms in patients with COVID-19 who go through potentially complex and changing pathophysiological situations (Li, et al., 2021). Knowing the clinical condition of the patient and classifying it according to the clinical range is the first step for its treatment and stabilization, which should be done as early as possible (Lalueza, et al., 2015;Castellanos & Figueroa, 2023; Aljameel, Khan, Aslam, Aljabri, & Alsulmi, 2021). In hospitals where there is a continuous increase in patients with COVID-19, the classification process must be automated (Aftab et al., 2022) to be a support system for the health professional and provide a timely response from classification (Dinar, et al., 2022;Olusegun Oyetola et al., 2023). Within the traditional alternatives, the statistical methods, minimum distance, and maximum likelihood of category are the statistical classifiers that depend on the multivariate normal distribution of the data of each class. If the normal distribution for each class is correct, then the classification has a minimum probability of generating error, and the maximum likelihood classifier is the appropriate choice (Kavzoglu & Colkesen, 2009;. Indeed, the maximum likelihood method has limitations related to the assumptions of normal distribution and restrictions of the input data (Fuentes Marmolejo & Medina Parra, 2021; Simhan & Basupi, 2023). An alternative is to design an automatic learning model using mathematical algorithms for classifying patients according to the clinical range with the Support Vector Machine (SVM) (Jain, Shankar, & Devi, 2020). This machine can train and learn (Sánchez Gómez, 2019). The importance of models based on machine learning lies in the possibility of making classifications and predictions with different models such as k-nearest neighbors, Bayes classifiers, decision trees, the support vector machine, etc. (Véliz Capuñay, 2020). In addressing non-linear and multivariate classification problems, the support vector machine efficiently solves classification and prediction problems (Guhathakurata, Kundu, Chakraborty, & Banerjee, 2021). Its success is due to its solid foundation of mathematical optimization; In addition, it has powerful tools and algorithms to find the solution in a non-linear context . A support vector machine aims to find an optimal hyperplane that separates one class from another, maximizing the distance between the points of different categories (Pisner & Schnyer, 2020). In an actual application of clarification or regression, the data is often not linearly separable, so it is necessary to design a nonlinear support vector machine; this is possible by incorporating more polynomial characteristics (Ahmad, 2018; Campos Sánchez et al., 2022). Adding features is feasible to implement and works well with all machine learning algorithms. This method cannot handle complex data sets at a low polynomial degree. At high polynomial degrees, it creates many features, making the model too slow (Géron, 2020;Sebo et al., 2023). A viable alternative to solve combinatorial explosion is to apply SVM by adding polynomial features (Zohair et al., 2021). Another strategy, when a set of elements is not linearly separable, is to transform the original space utilizing a non-linear function into a Hilbert space (Díaz-Chieng et al., 2022; Campo León, 2017; Rincon Soto & Sanchez Leon, 2022). The main objective of the kernel functions (i.e., linear, polynomial, radial basis, and sigmoidal) is to maximize the margin between hyperplanes (Ahmad, 2018; Marinho de Sousa et al., 2022). The Hilbert Spaces with Reproductive Kernel theory shows that the kernel functions correspond to a dot product, which induces a linear space with greater dimension than the original space, possibly infinite (Aronszajn, 1944). This fact allows us to reproduce any linear algorithm in a Hilbert area or, equivalently, for any algorithm, there is a non-linear version. This fact is known as the kernel trick (Tharwat, 2019). From the investigations, an SVM-based model has not been obtained by optimizing the parameters and comparing kernel functions to classify patients with suspected COVID-19 accurately. Indeed, this research aims to design a model based on the SVM algorithm, optimizing its parameters, for the classification of patients with suspected COVID-19.

LITERATURE REVIEW
Significant advances have been made in classifying patients with suspected COVID-19, for example, by identifying chest X-ray images using the support vector machine. Kesav & Jubukumar (2022) mention that machine learning has advanced to solve a wide range of biomedical problems with high precision. The research uses a deep learning mechanism to identify chest X-ray images of patients with COVID-19. The Bayesian optimization technique found that the support vector machine is the most accurate among several compared classifiers. From the analysis of the research on the application of the support vector machine and the classification of patients with suspected COVID-19, it was observed that studies were identified in 52 Scopus articles, 11 IEEE Explore articles, and 17 Web of Science articles in classification by X-rays, comparison of models in predicting patients with COVID-19, and sentiment analysis on the management of the pandemic, among other studies divergent from the present investigation. Dilmi (2022) has developed an approach for automatically diagnosing COVID-19 using chest X-ray images and AlexNet, VGG16, and VGG19 deep learning architectures to extract useful and relevant features. Then, he used as inputs a support vector machine with two discrete outputs: COVID-19 or No-findings. Furthermore, he used the Bayesian optimization (BO) algorithm to fit the parameters of the SVM classifier and choose the optimal parameters. The study's results indicate that the VGG16-SVM-BO and VGG19-SVM-BO models give the best performance with an accuracy of 99.47%. Kesav & M.G. (2022) present an investigation that uses a deep learning mechanism to identify chest X-ray images of patients with COVID-19 and other patients with pneumonia in two-and three-class scenarios. The proposed approach employs the GoogLeNet architecture to extract features fed into different classifiers. With the Bayesian Optimization technique, the Kernel SVM is the most accurate among several compared classifiers. The model showed % overall accuracy of 98.31% for the two-class classification between COVID-19 and non-COVID-19 chest X-ray images and 98.60% for the Three-class classification problem between COVID-19, healthy, and viral pneumonia radiographs. The proposed system outperformed several existing architectures and was tested using smaller data sets to ensure robustness. Jeng & Hsieh (2021) used the SVM-supervised machine learning algorithm to build a model to analyze and predict the presence of COVID-19 in a person based on the symptoms experienced. Hyperparameters such as degree, cost, gamma, and kernels, including linear, radial, polynomial, and sigmoid, were tuned using R Studio to achieve the best possible model performance. The model was tested by ten-fold cross-validation, and the results show that the polynomial kernel with optimized hyperparameters produced the best accuracy of 98.02%. The purpose of the researchers Singh et al. (2020) was to produce real-time forecasts using the SVM model. They investigated the Coronavirus disease 2019 (COVID-19) prediction of confirmed, deceased and recovered cases. The prediction will help plan resources, determine government policy, provide survivors with immunity passports, and use the same plasma for care. Data, including attributes such as confirmed location, deceased, COVID-19 recovery, longitude, and latitude, were collected from January 22, 2020, to April 25, 2020, worldwide. SVM was used to explore the impact on identification, deaths, and recovery. The research of (Zoabi, Deri, & Shomron, 2021; Ramírez Moncada et al., 2022) has been oriented towards the effective detection of SARS-CoV-2 that allows a rapid and efficient diagnosis of COVID-19 and can mitigate the burden on medical care, for which have developed prediction models that combine several characteristics to estimate the risk of infection. These models are intended to assist medical personnel worldwide in triaging patients, especially with limited healthcare resources. We established a machine learning approach that trained on the records of 51,831 people tested (of whom 4,769 were confirmed to have COVID-19). The model they developed predicted COVID-19 test results with high accuracy using only eight binary characteristics: sex, age ≥60 years, known contact with an infected individual, and the occurrence of five initial clinical symptoms. After reviewing the literature, no studies were found that designed a machine learning model based on an automatic support machine that optimizes its parameters and compares the different kernels. Therefore, we aim to build a model based on the support vector machine algorithm to classify patients. The model implementation was developed in Python, a general-purpose programming language with a variety of packages for data science.

MATERIALS AND METHODS
This research follows the positivist paradigm, quantitative approach, observational design without intervention, and predictive level. For the development of the machine learning model, the methodology for machine learning machine design was applied (Géron, 2020).
3.1. Data collection. One thousand records of patients diagnosed with SARS-CoV-2 infection admitted by the emergency service in health centers in Peru were collected. The variables considered were: age, gender, weight, height, respiratory rate, oxygen saturation, systolic blood pressure, heart rate, and temperature. Since the study is observational without intervention, informed consent was not requested, and the confidentiality of the data has been maintained. 3.2. Data preprocessing. Preprocessing is a fundamental phase that is performed before proceeding to the analysis. This ensures that the data is suitable for training and testing the model. In the present investigation, the data were analyzed with the Pandas package of the Python programming language. The data and records were validated concerning clinical risk to guarantee the adequate classification of the model; missing data were imputed; the characteristics were standardized; Fixed data imbalance issue. After analyzing, reducing the features, and balancing, a sample of 700 records was considered. 3.3. Selection of learning algorithm. There are various machine learning tools. The support vector machine was chosen for this study, which allows classification and prediction to optimize the different kernels and compare the results. 3.4. Implementation of the model and training. The program has been developed using the Python package scikit-learn. A training set, which is 70% of the sample, and a validation set, which is 30%, have been used, for which a program aimed at improving the results was designed. In this way, the error is minimized in each training, guaranteeing an incremental improvement in the model's efficiency. 3.5. Evaluation and validation. It has been verified that the model performs an efficient classification, for which the confusion matrix has been calculated. The performance of the model has been optimized from the linear, polynomial, radial, and sigmoidal kernels, adjusting the cross-validation by incorporating the Grid Search algorithm (Rios, Ulloa, & Borello Gianni, 2019), which allowed the C and Gamma parameters to be calculated automatically. To compare the performance of the different kernels, sensitivity, specificity, and Cohen's kappa metrics have been used. We consider the number of false positives (FP) to the observations erroneously classified as correct. In the research, they are people who do not have COVID-19, and the model ranked them as COVID-19 (+) and the false negatives (FN) were the observations erroneously classified as accurate, that is, people who have COVID-19 and the model ranked them as COVID-19 (-). In addition, correctly classified results correspond to true positives (TV), people who have COVID- 19

RESULTS
During the data collection phase, 1,000 medical records of patients diagnosed with COVID-19 were identified in two health facilities in Peru. The data of patients who met the validation criteria about clinical risk were recorded. After performing the preprocessing and analysis of variables, which included reviewing missing values, selecting variables, and resolving the imbalance problem based on the clinical range, a sample of 700 records was considered. The sample presented a mean age of 59 years and a standard deviation 15.90. In addition, it was found that 53.28% of the patients were male, and 63.71% were female. The sklearn package of the Python programming language was used to implement the model. The model's training was carried out using 70% of the random samples from the clinical histories, while the remaining 30% were used to carry out the test. Below is a The results of the evaluation show that, when using the sigmoid kernel, the support vector machine model obtained a high total success rate of 95%. When predicting patients with a severe clinical range, it presented an accuracy of 96% and a sensitivity (recall) of 91%. For patients with a moderate clinical range, it achieved an accuracy of 94% and a sensitivity of 93%. As for patients with a mild clinical range, it exhibited an accuracy of 94% and a sensitivity of 100%. In the case of the radial kernel (Rbf), the model achieved a total hit rate of 94%. When predicting patients with a severe clinical range, it showed an accuracy of 93% and a sensitivity of 91%. For patients with a moderate clinical range, it achieved an accuracy of 94% and a sensitivity of 92%. As for patients with a mild clinical range, it exhibited an accuracy of 96% and a sensitivity of 100%.
To guarantee the external validity of the classifying model and avoid its dependence on the number of characteristics of the data, different values of the C parameters (0.1; 1; 10; 50;100;500; 1000,10000) were explored. and gamma (1, 0.1, 0.01, 0.001, 0.0001) using an optimization process. The cross-validation technique was applied to the training matrix, and the Grid Search algorithm was used to obtain the best parameters for each kernel. In order to make an adequate comparison, the model's accuracy was evaluated, and Cohen's Kappa index was calculated to measure the agreement between the categorizations obtained and the reference labels. These measures make it possible to assess the accuracy and consistency of the model in its ability to classify cases correctly. A comparison was made using the accuracy metric, representing the percentage of cases the model classified correctly. The results showed that both the linear kernel and the sigmoid kernel obtained an accuracy of 95%, which is higher than the accuracy of the polynomial kernel with 94%, and the radial kernel (RBF) also with 94%. In addition, Cohen's kappa index was calculated, which measures the degree of agreement between the predictions made by the machine learning model and the actual results. A value of 0.92 was obtained for the linear and sigmoid kernel, which indicates an excellent agreement between the model predictions and the actual results.  Table 2). Given that the application of the model based on a support vector machine is carried out in the health system, we are interested in keeping false negatives lower since it would be detrimental for the patient to give a negative diagnosis of COVID-19 when in reality, we must opt for a higher sensitivity value, where false negatives are lower among the implemented kernels. We opted for the line and sigmoid kernel over the polynomial and radial kernels, as they have an accuracy of 95%, a higher sensitivity value and both a Cohen's kappa value of 0.92 higher than the linear and radial kernels.

CONCLUSIONS
The data for the proposed model were obtained from 1000 records of patients with suspected SARS-CoV-2 infection who were admitted by the emergency service in health centers in Peru. The variables considered were: age, gender, weight, height, respiratory rate, oxygen saturation, systolic blood pressure, heart rate, temperature, and diagnosis based on mild, moderate, and severe clinical risk. The data were analyzed with the Pandas package of the Python programming language. The data and records were validated concerning clinical risk to guarantee the adequate classification of the model; missing data were imputed; the characteristics were standardized; Fixed data imbalance issue. After analyzing, reducing the features, and balancing, a sample of 700 records was considered. Different kernel functions have been used for performance analysis. The parameters for the kernel functions have been optimized: linear, radial, sigmoid, and polynomial. The comparison has been made with the accuracy (accuracy) that measures the percentage of cases that the model has succeeded in the classification. For the linear and sigmoid kernel, it is 95%, while for the polynomial and radial kernel, with an accuracy of 94%. Likewise, Cohen's kappa was calculated to carry out an adequate evaluation of the prediction, and the value 0.92 was obtained, which means the excellent concordance of the forecast through the linear and sigmoid kernel. In contrast, the polynomial and radial kernel have obtained a value of 0.90. The rapid triage process for patients with COVID-19 reduces the number of clinical consequences and healthcare costs. The timely classification of patients with COVID-19 will allow better management of hospital surveillance, prevention, and control strategies.

Ethical aspects.
The Office of the Vice President approved this study for research at the José Faustino Sánchez Carrión National University (RCU No. 0334 2020-CU-UNJFSC). Informed consent was not requested. Since the information is secondary, it was obtained directly from medical records, and the confidentiality of the data was respected.

Acknowledgements.
We thank the Office of the Vice President for Research of the José Faustino Sánchez Carrión National University for promoting the research and providing funding for this research.

Financing.
There is support with ordinary resources from the José Faustino Sánchez Carrión -Huacho National University and the authors.

Conflicts of interest.
We declare that we have no conflicts of interest in this study.