Predicting Student Dropout based on Machine Learning and Deep Learning: A Systematic Review

.


Introduction
Student dropout is widely recognized worldwide as one of the most complex challenges facing the education system 1,2 , and this phenomenon has experienced a significant increase during the COVID-19 pandemic 3 .This issue entails economic, social, and educational consequences for the stakeholders in the global education system, ranging from the psychological impact on students to the management challenges faced by government entities 4,5 .To address the problem, predicting and managing early signs of student dropout is relevant [6][7][8][9] .This will enable educational institutions to act promptly, implementing preventive and proactive measures to address the issue and reduce the dropout rate 10 .Various governments have designed and implemented early warning systems for school dropouts to effectively tackle this problem [11][12][13] .An alternative of great relevance is using Machine Learning and Deep Learning algorithms 14 .These models predict dropout and provide early alerts to relevant authorities, enabling them to take alternative measures targeted at at-risk students 15 .Each Machine Learning and Deep Learning model is intrinsically linked to the underlying algorithm, optimized hyperparameters, the training and test datasets used 16 , as well as the variables and data behavior, different performance metrics 17 .As a result, multiple alternatives are Girón et al.
2 observed, offering different results in each specific application 13,18 .Consequently, Machine Learning and Deep Learning approaches 19 have been subject to criticism due to their use of a "black box" methodology in predicting student dropout, which results in a lack of proper interpretability of the model for humans 20 .Therefore, conducting a comprehensive systematic review study on the application of Machine Learning and Deep Learning in predicting student dropout is imperative [21][22][23][24] .This study aims to identify algorithms that have demonstrated better predictive capabilities and the different variants of each model.A thorough search has been conducted in the major databases of systematic review studies related to student dropout.However, research specifically oriented toward our purpose has yet to be found.Therefore, our main objective is to fill this knowledge gap and answer which Machine Learning and Deep Learning algorithms perform best in predicting student dropout.

Methods
The present research has been developed using the systematic review methodology 25 based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines 26,27 .The following phases were followed in developing the systematic review: Firstly, the research question guiding the study was formulated.Then, a research protocol was developed describing the design of the systematic review, including the criteria for study selection, the databases used for the search, the search strategies, and the methods of data extraction and analysis.Inclusion and exclusion criteria for the studies were also established.Subsequently, an exhaustive search of the scientific literature was conducted in different databases, using the defined search terms and applying the inclusion and exclusion criteria to select relevant studies.The titles and abstracts of the articles identified in the search were reviewed, selecting those that met the established inclusion criteria.A full reading of the selected studies was then conducted to verify their compliance with the inclusion and exclusion criteria.The relevant data from the selected studies were extracted and organized in a database.Finally, the results were interpreted, and the findings were synthesized, identifying possible limitations.In the last phase, a detailed report of the systematic review was written, including a complete description of the methodology used, the results obtained, and the conclusions reached [28][29][30] .Search Strategy To conduct this systematic review, an exhaustive search was performed in specialized databases to find relevant information for our research.Table 1 presents a detailed description of the search strategy used.

Inclusion and exclusion criteria
The inclusion and exclusion criteria in this scientific research refer to the standards and rules established to determine which studies or articles will be considered in the systematic review and which will be excluded.These criteria are based on the research objectives and questions being addressed.

Sample Selection Process
After applying the inclusion and exclusion criteria, the sample was restricted to analyzing only those articles that provided information relevant to the objective set.The attached flow chart shows that 246 articles were initially identified in the three databases.After eliminating duplicate articles and applying the inclusion and exclusion criteria, 43 articles were obtained.From this selection, additional exclusions were made for various reasons.In the end, a total of 23 articles were included in the analysis (Figure 1).In the articles' analysis, 16 algorithms applied in Machine Learning and Deep Learning have been identified.Among these algorithms, it has been observed that RandomForest has exhibited the best performance, achieving an accuracy of 99% (Table 03).Next, we will discuss the theoretical rationale behind why RandomForest has outperformed other algorithms [50][51][52][53][54] .

Results and Discussion
The application of Machine Learning and Deep Learning algorithms poses two main challenges.Firstly, it is essential to determine the optimal algorithm, which is a complex task given multiple candidate systems that meet the established criteria 55 .This problem becomes particularly challenging when the learning algorithm has a propensity for diverse local optima and insufficient training data availability 56 .Secondly, by discarding less successful models, there is a risk of losing potentially valuable information 56,57 .
RandomForest is a learning algorithm based on creating an ensemble of decision trees and combining their results to obtain a final prediction 13 .Each tree in the ensemble is constructed independently using the technique known as "bagging" 58 , which involves taking random samples with replacement from the original training dataset and building a decision tree from each of these samples 59 .
A theoretical justification for the superior performance of RandomForest lies in its nature as a Machine Learning ensemble [60][61][62][63][64][65] , which are techniques that combine multiple individual models to improve predictive capability and system robustness 66 .It is characterized by creating a set of decision trees, each representing an individual model in the ensemble 60,[67][68][69][70][71][72] .Each tree is constructed using a random sample with replacement from the original training dataset and a random selection of features at each node 73 .
Another key aspect supporting the's advantage of RandomForest is its ability to address local optima challenges.Some algorithms, such as decision trees, can generate highly non-convex cost functions, which can cause the methods used to solve them to become trapped in local optima 59 .Combining different hypotheses through different approaches in each of them increases the probability of approximating the true hypothesis more accurately [74][75][76][77][78] .This is because different solutions are explored, reducing the reliance on a single local optimum 79 .Indeed, the RandomForest algorithm has demonstrated superior performance to other algorithms in Machine Learning and Deep Learning when applied to predicting student dropout.Its ability to address local optima and overfitting challenges and leverage diversity and independence among the trees makes it a suitable choice 80 .The ensemble approach of machine learning, of which RandomForest is an example, has been shown to be beneficial in combining multiple models to improve predictive capability and system robustness, reduce the risk of selecting an incorrect hypothesis, and expand the hypothesis space to more effectively approximate the target function.

Conclusions
This systematic review study has provided an overview of predicting student dropout using Machine Learning and Deep Learning techniques.The most promising algorithms and their variants in terms of predictive capability were identified.Timely prediction of student dropout has significant potential to improve educational management and contribute to achieving quality standards in the educational field.After analyzing 23 scientific articles, the application of 16 different Machine Learning and Deep Learning algorithms was highlighted.The most utilized algorithm in these studies was RandomForest, representing approximately 21.73% of the total.Additionally, RandomForest demonstrated outstanding performance, achieving an impressive accuracy of 99%.
A key advantage of the RandomForest model, based on an ensemble of Machine Learning algorithms, lies in its ability to overcome local optima and overfitting issues.However, it is important to note that more variables related to student dropout and further research using large volumes of data are required to obtain more robust and generalizable results.Overall, this study highlights the potential of Machine Learning and Deep Learning techniques in addressing the challenge of student dropout.The findings and recommendations presented in this article are expected to serve as a starting point for future research and practical applications in the educational field, aiming to improve student retention and academic success.

Figure 1 :
Figure 1: Flowchart of the search and selection method for the systematic review references.

Table 3
presents the most relevant characteristics for the systematic review.The following attributes have been considered: author, country, sample, number of variables, training strategy, cross-validation, modality, Machine Learning and Deep Learning algorithm, performance metric, best-performing algorithm, and results (accuracy, sensitivity/recall, F1 score).The results of this study show the distribution of selected articles according to their country of origin.Most articles (21%) were published in China, while 17.39% originated from the United States.Additionally, 8.69% of the selected articles came from Korea, India, and Spain.Other countries contributing to the sample included Turkey, Hungary, Germany, Malaysia, Chile, Ecuador, Slovakia, and the Netherlands, representing 4.34% of the selected articles.These findings suggest that student dropout is a relevant research topic in various parts of the world.However, the selection of the used database may have influenced this distribution.According to the results obtained, it was observed that 100% of the studies included in the systematic review mostly employed supervised Machine Learning algorithms for classification.The total sum of samples was 1,912,653, with a mean of 57,959.Furthermore, the total number of variables was 373, with a mean of 16.21.