Classification model for student dropouts using machine learning: A case study

Information and communication technologies have been fulfilling a highly relevant role in the different fields of knowledge, addressing problems in various disciplines; there is an increased capacity to identify patterns and anomalies in an organization's data using data mining; In this context, the study aimed to develop a classification model for student dropout, applying machine learning with the autoML method of the H2O.ai framework; the dimensionality of the socioeconomic and academic characteristics has been taken into account, with the purpose that the directors make reasonable decisions to counteract the abandonment of the students in the study programs. The methodology used was of a technological type, purposeful level, incremental innovation, temporal scope, and synchronous; data collection was prospective. For this, a 20-item questionnaire was applied to 237 students enrolled in the master's degree programs in the education of the Graduate School. The research resulted in a supervised machine learning model, Gradient Reinforcement Machine (GBM), to classify student dropout, thus identifying the main associated factors that influence dropout, obtaining a Gini coefficient of 92.20%, AUC of 96.10% and a LogLoss of 24.24% representing a model with efficient performance.


Introduction
Education is essential for the development and well-being of a society; therefore, students are the raison d'être for any educational institution.A country's social and economic growth is directly related to the academic performance of its students (Mushtaq and Khan, 2012).In the last decade, the Peruvian state has implemented various measures for quality assurance in university higher education to guarantee that the country's youth have access to a comprehensive and continuous educational service that promotes development through research.In 2014, with the publication of University Law No. 30220, the National Superintendence of Higher University Education (SUNEDU, in Spanish) was created, an organization that has been playing a leading role in compliance with primary quality conditions by educational institutions during the institutional licensing process.Faced with the requirement to implement primary quality conditions in the higher education system, universities must manage their resources efficiently.In this sense, it is an excellent option to manage information technologies in the university higher education system, according to the proposal of Villarreal et al. (2021), to have the information available at the right time.One of the problems that arise in public universities is the insufficient allocation of financial resources; However, the Graduate School of the José Faustino Sánchez Carrión National University carries out its academic and administrative management with autonomy since it has two sources of income; the first, by ordinary budget allocation; and the second, for resources directly collected.University dropout is a problem related to the student as the directly responsible, which generates concern in its directors to know the probabilities of dropout because only a small number of students manage to complete their studies; Student dropout negatively influences the academic and economic development of the organic unit, which is why it is intended, through data mining, to identify behavior patterns in students, analyzing socioeconomic and academic factors that allow the implementation of specific strategies that contribute to maintaining a sustainable economy over time, seeking to reduce the dropout rate.Based on the report prepared by OECD (2022) with actual data from cohorts in 25 countries, it describes that students admitted to a full-time undergraduate program graduate on average within the theoretical duration by 39%; likewise, the completion rate on average after three years increases by 68% of the complement of students who have not obtained their degree within the theoretical duration; It is highlighted that on average 12% drop out of tertiary education before the start of the second year of studies.In the case of master's students, on average, 51% complete it within the theoretical duration of the study program.Of the complement, an average of 77% of students complete it after three years of the academic course.Other interesting statistics presented by OECD (2021) is estimated in 2019, 38% of students, on average, graduate for the first time before turning 30, excluding international students; In addition, 8%, on average, the proportion of first-time graduates at the master's level of education or it's equivalent compared to OECD information (2020) considering 2018, the majority of firsttime graduates 78% obtained a bachelor's degree on average and 10% a master's degree, they also maintain that the three areas or field of study with the highest average proportion is given by business, administration, and law with 25%, followed by health and well-being with 15 %, and finally with engineering, manufacturing, and construction with an average of 14%.The state of the students at the end of their first year of studies can be very significant to understand what happens with the effectiveness of the orientation or preparation.There is an average of 12% of students not enrolled, more than 2% of students completed transfer to another program, and 85% had enrolled in the same or another degree program; In addition, there is an average of 64% of students who have graduated from a bachelor's program, and only 1% from a master's program (OECD, 2019).In Peru, the figures on the evolution of enrollment according to SUNEDU (2021) at the undergraduate level during 2018 was 1.59 million, a figure that has been reduced by 1.34 million students in 2020, representing a 15.7% difference between periods; the case of postgraduate there is a reduction of 27.7%; during 2018 there were 131.9 thousand and in the 2020 period there were 95.4 thousand students enrolled; The official newspaper El Peruano (2021) details that the universities licensed at the national level indicate that the percentage of interruption of studies has decreased by 4.7%; that is, from 16.2% it has decreased to 11.5% between the semesters 2020-II and 2021-I.Likewise, the regions with the greatest impact were Loreto (16.7%),Callao (14.2%),Áncash (13.9%),Ayacucho (12.8%), and Lima (12.4%), in contrast to Amazonas (4.2%), Huancavelica (6.3 %), Tacna (7.9%), where there was less interruption; and among the causes were connectivity problems, student welfare services, and economic conditions, among others.To reverse this situation to some extent, an investment of 61 million soles has been made to contract internet for students and teachers.The research was framed in the production of new knowledge through the proposal of the classification model, in addition, the theory of student dropout supported by Díaz (2008) was corroborated.The objective of the research was to develop a dropout classification model in students of education study programs through machine learning and data mining techniques applying H2O.ai's autoML.

Data mining
Data mining is the process of discovering useful information from immense data structures.It is based on mathematical and statistical analysis aimed at deducing the patterns and trends in the data.Typically, these patterns cannot be detected through traditional exploration since the relationships are too complex or due to the existence of too many volumes of data (Microsoft, 2019;  Kodelja (2019) argues that some experts in machine learning, a subset of artificial intelligence, claim that machine learning is learning and not something else, while others-including philosophers-reject the claim that machine learning is real learning.For them, real human learning is the highest form of learning.For their part, Xu & Li (2014) argue that machine learning is becoming an essential method for dealing with knowledge acquisition problems; It is defined as a branch of artificial intelligence and refers to the construction and study of systems that can Classification model for student dropouts using machine learning: A case study 3 learn from data.Machine learning is typically concerned with how to build computer programs that automatically improve through the behavior of data; Samuel (1959) defined machine learning as a field of study that allows computers to learn without being explicitly programmed.Dwi et al. (2019) specify Machine Learning as a part of artificial intelligence that focuses on developing a system capable of learning from its own patterns based on a training data set without human intervention.It is applied in various fields, such as education.Types of machine learning Jung (2022) describes the types of machine learning as supervised learning -the approach that uses a labeled data set for its prediction, divided into regression and classification (Chatterjee et al., 2023)-; unsupervised learning-a data set that does not need labels; it allows analysts to discover behavior patterns or similarities between functions, it is not intended to detect or predict anything, it is only based on subdivision or grouping (Chatterjee et al., 2023)-; reinforcement learning-is similar to unsupervised learning, learning from an unlabeled data set.The difference with previous tutorials is that you can evaluate the loss function; in these cases, it learns from trial and error experiences depending on the feedback and its factor or agent to perform efficiently (Andrade-Girón et al., 2023; Junco Luna, 2023; Sharmeela et al., 2022).AUTO ML He et al. (2020) state that deep learning algorithms (Deep Learning, DL) have achieved extraordinary results in various tasks, such as language modeling, object detection, and image recognition; however, creating a world-class deep learning system for a particular activity is highly dependent on human expertise, which limits its widespread application; this drawback can be solved by introducing the AutoML approach.In recent years it has attracted the attention of various sectors.Many information and communication technology service providers have chosen to implement their respective platforms, such as H2O.ai, DataRobot, DarwinAI, and OneClick.ai.Existing AutoML libraries such as AutoWeka, MLBox, AutoKeras, Google Cloud AutoML, Amazon AutoGluon, IBM Watson AutoAI, and Microsoft Azure AutoML have provided solutions that automatically generate ML-based models (Olusegun Oyetola et al., 2023; Vakhrushev et al., 2021).AutoML, automatic machine learning, Nagarajah & Poravi (2019) describe it as a process that can develop custom models, considerably reducing human intervention; In addition to performing data preprocessing, variable engineering, model building, hyperparameter optimization, and analysis of prediction results and their evaluation, the development of automatic machine learning has made it possible, to a certain extent, to streamline time-consuming machine learning development operations, aiming to reduce the demand on data scientists and can build well-performing machine learning applications, without requiring extensive knowledge of statistics and machine learning (Zöller et (Ajgaonkar, 2022).

Feature Selection
To develop a machine learning model for the prediction of the objective variable, it is necessary to carry out the feature selection process, which aims to identify the interaction of the variables to have the best predictive performance.This process is relevant because it allows knowing the variables that contribute significantly to the predictive model, reducing the number of variables, time, speed, and deployment, making the model less complex and easier to explain (Haque, Tinto (1982) defines dropout as a situation in which a student fails to finish their education; therefore, a dropout would be one who is enrolled in a higher education institution but has no academic activity for three consecutive academic semesters.Gonzales (2005) differentiates two types of dropout in university higher education: the first concerning time (initial, early, and late), and the second concerning space (institutional, internal and the educational system).Likewise, Tinto (1989) sustains the existence of several critical periods that influence student dropout; the first, is during the admission process, where the interested parties form their first impressions or social and intellectual ideas, which generates the expectation of the applicant.The second period is contemplated in the transition between secondary education and the institution, after entering the institution (Driss Hanafi et al., 2023; González Vallejo, 2023;Montes, 2022;Rincón Soto et al., 2023) due to the assembly between college life towards the new way of university life, influencing their mental situation.Tinto (1989) states that dropouts occur during the transition period, with voluntary dropouts being more frequent.Díaz (2008) presented the analysis models of student dropout to analyze the phenomenon of dropout inherent to university student life and describe the theories from different points of view: a) Psychological Model, indicates the personality traits that establish the differences between students who complete and drop out of their university studies; it is based on the proposals of Fhisbein and Ajzen (1975), who support the theory of Reasoned Action; Ethington (1990) (1976).

Dimensions of Student Dropout
The variables most frequently considered in the theoretical models related to student dropout were consolidated in the study carried out by Díaz (2008), where four categories are considered: individual (age, gender, family group and integration, social); the academic ones (professional orientation, intellectual development, academic performance, study methods, admission processes, degrees of career satisfaction and academic load); the institutional ones (academic regulations, student financing, university resources, quality of the program or career and relationship with professors and peers); and the socioeconomic ones (socioeconomic stratum, employment situation of the student, employment situation of the parents and educational level of the parents).

Methodology
The methodology focused on applying data mining and supervised machine learning techniques, using a set of preclassified elements to develop the model.The data set was obtained from two sources of information: first, by applying a questionnaire as an instrument, containing 20 items grouped into four dimensions (academic, individual, environmental, and institutional) used to 237 participants.Second, data from the evaluation record was collected through observation.The process involves splitting two data sets; the first to carry out the training, allowing the construction of the classification model, and the second used for the tests, thus obtaining the adjustment parameters.Below, Table 1 presents the items contained.In addition, the existence of null values, outliers, and cardinality in the variables was verified, which impacts the machine learning models.

Results
The descriptive analysis of the scores issued by the participants was carried out, as evidenced in Table 3.To develop these models, independent variables were defined that correspond to 20 items of the instrument and as a dependent variable, student dropout; In addition, two aspects of vital importance have been considered: the selection of characteristics and the percentage for the partition of the data set for training, validation, and testing for each of the models.For the selection of the characteristics, different algorithms were used, obtaining two sets of variables based on the coincidences or similarities in common; the first set, made up of 11 variables (P01, P02, P03, P04, P09, P10, P12, P13, P14, P16, P20); and the second set made up of the variables (P07, P11, P17, P18, P19), plus the variables of the first, making a total of 16 variables.Subsequently, the parameters for the invocation of the AutoML method of the H2O object were established, considering the set of independent variables as data parameters and then the objective or destination variable, defined as the dependent variable; the stop or termination parameter, max_models = 100 was considered; in addition, of the option balance_classes = TRUE; With this configuration, the results are presented in Table 4.As seen in Table 5, the scores obtained in each metric are very similar and significant during the training process, subsequently carrying out the tests to get the performance metrics of each of the indicated models.
Ranking models have a variety of performance metrics.Among the most relevant, we have the Gini coefficient, used to measure the quality of the prediction model, in whose interpretation a value of zero means perfect equality; that is, there is a deficient model; Whenever it has a value close to unity, it is presented as a maximum inequality, it is considered a perfect classifier.The area under the curve is a metric to evaluate the capacity of the classification model, allowing it to differentiate between true positives and false positives; a value close to unity is considered a perfect model.Unlike the metric, the area under the precision-recall curve does not feel true negatives, something widely used in unbalanced data sets.The log loss metric looks at the approximation of a model's predicted values and actual target ratings, where an assignment close to zero means the model provides the probability correctly.6 contains the metrics of each execution and tests carried out with the automatically generated models.The metrics are similar, except for the cases of the third and fourth models, which are overfitted due to the number of observations partitioned into three data sets.Likewise, most models demonstrate better performance in the metrics of the models with fewer items.In this sense, due to the principle of parsimony, the models with 11 items are chosen according to the algorithms used to select characteristics, allowing benefits for their future implementation.Thus, slightly better performance is observed in the tenth Gradient Boosting Machine model, followed by the second DeepLearning model.
Figure 1 contemplates the variables ordered from highest to lowest, according to the importance in the model prediction, based on the percentage values that are scaled to 100%.A strong influence is evident in the experience of the participants at the secondary level: academic performance (29.65%), failed subjects (22.67%), repetition of the year (13.65%), teacher performance (14.03%), with less relevance are the aspects related to the stress of the person (6.35%); performance in undergraduate (5.99%), the number of children (3.40%), motivation (2.23%), economic situation (1.28%), work related to his career (0.62%) and finally the financing of his studies (0.10%).Additionally, model metrics were obtained from the confusion matrix.They are detailed in Table 7. Accuracy is a metric for determining correct predictions as a proportion of the total number of predictions made.A score close to unity represents optimal performance.From Table 7 we can obtain a precision equivalent to 92%, that is to say, that the model has a successful prediction capacity of 92 cases among 100 observations; for sensitivity, 90% is indicated, indicating a successful prediction of 90 cases out of 100 for the positive class; finally, for specificity, we identified 100% of the cases to predict the negative class.
The ROC curve is a graph that represents the relationship between true positives (sensitivity) and false positives (specificity).Figure 2 demonstrates a curve near the upper left corner, thus indicating optimum performance.It should be specified when the curve approaches the 45° diagonal or baseline.It will be less precise, corresponding to poor performance.Likewise, the lower left side of the graph represents a lower tolerance for false positives, while the upper right side represents a higher tolerance for false positives.Figure 3 shows the behavior of the GBM classification model through the learning curve and presents a logarithmic loss in the training and validation data set; In addition, it is seen that the curves are stable when having a number greater than 50 trees, that is, adding more instances to the model would not improve its performance much.
In short, the GBM (Gradient Boosting Machine) model is a supervised machine learning method used to classify machine learning problems.It is built using decision trees.The generated GBM model consists of 51 internal trees, with a size corresponding to 8,910 bytes.The tree has a minimum depth of 4 and maximum depth of 6, with an average depth of 5.29.The minimum number of sheets is 7, and the maximum is 13, with an average of 9.24.This configuration of the GBM model indicates that the internal decision trees have a reasonable depth and a moderate number of leaves.This means that the GBM model has a good fit and can provide an appropriate classification for the data, as evidenced by performance metrics.

Conclusions
Once the GBM model of student dropout classification has been generated, it can be concluded that it is adequate for the task since it offers adequate precision, sensitivity, and specificity for the prediction of student dropout cases, since it presents a high-performance capacity, depth, and several blades suitable for training.Therefore, the use of this model for the analysis of student dropout is appropriate.It offers several advantages, such as the ability to work with unbalanced data, improve results by tuning model parameters, use a cross-validation method to assess model accuracy, and make forecasts in real-time, which allows managers to make quick and effective decisions to combat student dropout, providing a useful tool for the detection and prevention of dropout.The use of H2O.ai platform, which has a method called H2O.AutoML, used for the automatic generation of learning models, allows the user to select a data set, partition them and generate the model.H2O.ai selects the best model according to the specified parameters.This tool saves the user time and resources since he does not need to choose the model parameters manually.Therefore, using the H2O.ai platform and the AutoML method is a good option for model building.
A relevant aspect of the research was transversality.In the first instance, machine learning could use algorithms to extrapolate insights into a data set; In the case of data mining, this technique has made it possible to identify patterns in the data within the context of university higher education, allowing users to share and reuse acquired knowledge and best practices in other knowledge areas.
Implementing the generated student dropout classification model is recommended since it has great practical utility for those responsible for education and the university community because this tool allows predicting the risk of student dropout early, allowing measures to be taken preventive measures to reduce it.These measures may include offering financial aid, academic advising, tutoring programs, remedial classes, and other forms of student support that can help students stay in college and achieve their educational goals.Academically, taking the model into account allows researchers to save time and resources when evaluating different classification and prediction models automatically, offering the ability to perform a sensitivity analysis to understand better the factors that influence attrition, being a good choice for research.Likewise, it improves the teaching approach and provides a greater understanding of the needs of students to provide them with appropriate support.

Figure 1 .
Figure 1.Importance of variables in the classification model.

Figure 2 .
Figure 2. ROC chart of the GBM classification model.
2022; Simhan & Basupi, 2023; Zambrano Verdesoto et al., 2023).There are three kinds of methodologies for feature selection.According to Khun & Jhonson (2022), we have the intrinsic methods -the models based on trees and rules-; multivariate adaptive regression models; and regularization models; The advantage is that they are relatively fast since they are integrated into the model fit; In the case of filter methods, utilizing a supervised analysis it is simple and quick to determine the essential characteristics in the model, they are prone to over select predictors in the model.

Table 1 .
Data collection instrument for participants.

Table 3 .
Descriptive analysis of the data set of the participants.

Table 4 .
Machine learning models based on the size of data sets for training, testing, and validation.

Table 5 .
Model performance metrics with the training and validation data set.

Table 6 .
Model performance metrics using the test data set.

Table 7 .
Confusion matrix of the generated GBM model.