Comparative analysis of performance of AutoML algorithms: Classification model of payment arrears in students of a private university

The impact of artificial intelligence in our society is important due to the innovation of processes through data science to know the academic and sociodemographic factors that contribute to late payments in university students, to identify them and make timely decisions for implementing prevention and correction programs, avoiding student dropout due to this economic problem, and ensuring success in their education in a meaningful and focused way. In this sense, the research aims to compare the performance metrics of classification models for late payments in students of a private university by using AutoML algorithms from various existing platforms and solutions such as AutoKeras, AutoGluon, HyperOPT, MLJar, and H2O in a data set consisting of 8,495 records and the application of data balancing techniques. From the implementation and execution of various algorithms, similar metrics have been obtained based on the parameters and optimization functions used automatically by each tool, providing better performance to the H2O platform through the Stacked Ensemble algorithm with metrics accuracy = 0.778. F1 = 0.870, recall = 0.904 and precision = 0.839. The research can be extended to other contexts or areas of knowledge due to the growing interest in automated machine learning, providing researchers with a valuable tool in data science without the need for deep knowledge.


Introduction
The use of AutoML has become increasingly important due to the exponential increase in data availability and the need to perform effective predictive analyzes in the educational system and other productive sectors.The study has evaluated in terms of predictive capacity and ease of use various most popular and emerging automated machine learning (AutoML) platforms, such as AutoKeras, AutoGluon, HyperOPT, MLJar and H2O to develop the classification model, objectively providing information for choose the most suitable AutoML platform for the specific requirements according to the data set for the prevention and management of late payments among university students.
According to various sources of information worldwide, artificial intelligence has been an innovative factor in educational technology, highlighting three domains: academic support services, and institutional and administrative services 1,2 .In addition, from other fields such as medicine, finance, and law.In the case of educational institutions, there is evidence of its Learning analytics improves the experience in the teaching process and the management of higher education institutions by recommending its use 7 .Data mining is increasingly becoming more relevant in various productive sectors; in higher education, it has a significant impact oriented towards a new-age university, providing databased models 8 .In consideration, artificial intelligence is associated with challenges and ethical aspects to ensure the accuracy and fairness of algorithms to promote creativity, critical thinking, and analysis, diversifying and expanding the most complex behavioral patterns due to the adaptation of innovative and emerging tools and technologies to prepare students to succeed in reality in constant and permanent change 9,10 .
Machine learning is considered a multidisciplinary and interdisciplinary area of knowledge.It is located between data science and artificial intelligence; Furthermore, at the intersection of statistics and computer science, it has been driven by the development of algorithms, online data, and low computational cost, bringing with it growing interested in the software engineering community providing solutions to reduce costs; On the other hand, one of the drawbacks that presents itself as an obstacle is the adjustment of hyperparameters 11,12,13,14,55 ; becoming an innovative technology in higher education, providing a series of perspectives on various dimensions of educational quality, with academic performance, students at risk of dropping out and default rates as application variables 15 .
Machine learning can be distinguished between three different forms depending on the problem's characteristics and the data availability.These are supervised learning, unsupervised learning, and reinforcement learning 16 .For implementing the algorithms, three basic tasks are considered: feature selection, choosing the appropriate algorithm, and evaluating performance metrics 17 .Supervised learning uses a set of properly mapped data where it considers the input data and previously labeled output data, where the algorithm can develop a model based on the training data; for unsupervised learning, it models a collection of inputs without the presence of labeled instances; In this case, the algorithm organizes the instances into different categories; and in reinforcement learning, the algorithm learns from the environment through the policies of how to act, providing the corresponding feedback and is used in the decision-making process 12,18 .
Automated machine learning (AutoML) aims to identify the appropriate algorithm and hyperparameters, thus optimizing the guesswork of the optimization process whose results can simulate or improve the performance of a human expert, even in a shorter period 19,20 ; considering as relevant components, the selection, combination, and parameterization of machine learning algorithms 21 .AutoML allows the automation of specific tasks, such as model selection and hyperparameter optimization, generating complete machine-learning models with notable performance or results.However, the capacity of the algorithms is influenced by other factors, such as data cleaning and understanding of the field or area of knowledge 22,23,24,25 .
One of the greatest advantages of AutoML is its ease of use, where any member of the academic and scientific community can easily and independently develop prediction models for their area of knowledge.In addition, there are cloud platforms to implement this type of solution, such as Amazon Web Service SageMake, Microsoft Azure, Google Colab Pro, and Google Cloud Vertex AI 26 .Among the existing limitations in this type of processing is that the systems work efficiently on a large scale and on unbalanced or unbalanced data sets, which constitutes a challenging and frequent challenge 20,27 .
The AutoML process automatically identifies relevant features, optimizes hyperparameters, selects features, and evaluates performance metrics based on quality criteria without relying on human work and manual testing 28,29 .For the study, certain solutions and platforms have been considered for comparison.Firstly, there is AutoKeras.It is an open-source toolkit for machine learning designed to facilitate the adoption of deep learning by people with a minimum of knowledge in machine learning and programming.Its main strength lies in its independence from any prior knowledge in deep learning 30,31 .
AutoGluon is a publicly accessible AutoML framework that trains machine-learning models on tabular data sets, text, and images.It can achieve high accuracy with little effort due to its independent and specific preprocessing of the model, the combination of multiple layers of several models, resulting in a reduction in training time, being recognized for its cutting-edge capabilities and its easy-to-use interface 32,33,34,29 The HyperOPT library provides a set of algorithms and parallelization infrastructure to optimize the performance of hyperparameters 35 , offering functionality to the diversity of data mining techniques, selecting the one with the best performance using the F1-metric score and the one with the least information loss 36 .It additionally presents an optimization interface that identifies the difference between an evaluation function and a configuration space 37 .Furthermore, it will enable the construction of a search space that includes various standard components 38 .
The MLJar AutoML framework is an open-source project that enables fast and automated training of models; it presents a nice and simple interface to provide consolidated information about the candidate models, statistics, and graphs through feature engineering preprocessing, feature selection, and explanation capabilities for the various algorithms, reducing development time (preprocessing, construction, hyperparameter tuning, and model selection); Likewise, it is considered among the seven frameworks of the ten most used 39,40 .
The H2O ecosystem is an open-source distributed machine learning platform with libraries for the R, Python, Java, and Scala programming languages supporting large data sets.Likewise, it has the AutoML method for training GBM, Random Forests, Deep Neural Networks, and GLMs algorithms, giving candidate models a healthy level of diversity that packed ensembles can use to create a robust model, offering easy tools and frameworks to use 41,42,43 .
At the global level, there is constant uncertainty due to substantial changes in the various productive sectors, including higher education, to improve the work being carried out.In this sense, the European Commission has outlined its strategy for universities in the face of the challenges that lie ahead.In our region, the Institute of Higher Education of Latin America and the Caribbean (IESALC) of UNESCO provides the results of the consultations carried out with 25 world experts, where the report highlights a demand for quality and comprehensive education for all, adapting the structures of the institution to the needs and characteristics of its students, to strengthen higher education as a human right to education, make it inclusive.These changes require new values, behaviors, knowledge, and skills.To improve effectiveness and efficiency in the university system, interrelated components must be reevaluated, going through a technological transition and a culture of quality, transparency, and accountability 44 .
The COVID-19 epidemic has transformed teaching, professional training, learning, and student well-being 45 .In this sense, educational artificial intelligence is being considered a relatively new emerging field of specialization, with the potential to revolutionize teaching methods and learning experiences in students where its application can be oriented towards strategy management or management institutional and teaching and learning 46 .However, extensive data analysis can be used in the various areas of the education industry 47 .
Delays in payments among university students have several causes, one of which is that students have difficulties obtaining financing.That is, loan eligibility and repayment conditions may make it difficult to obtain financial aid, presenting greater complexity if they have to start paying them before graduating.Likewise, family finances can cause delays in payments due to the inability to pay.This financial burden could make it difficult for students to pay tuition on time.Other reasons are the increase in pensions and educational fees.Universities and authorities should strengthen financial aid to part-time students to reduce payment delays 48 .
Other reasons are the economy's recession, the charging of interest for late payments, and the lack of financial assistance and scholarships.Students may struggle to pay financial commitments without sufficient financial aid.The government could address Such payment delays in collaboration with universities to provide more assistance, especially to disadvantaged students, by providing greater freedom by implementing financial support programs, administrative improvements, and flexible payment options 49 .
The report prepared by Ilie et al. 50on access to higher education indicates that it remains unequal in many countries, which is one of the most important factors in achieving multiple sustainable development goals.In Peru, public higher education institutions are free, but admission is highly selective and many students attend private institutions, which presents another potential barrier for poor students, with a significant gap among study participants of 55% and 5%, respectively, whose ages are between 18 and 22 years old; where women are more likely to enroll in the higher education system in most socioeconomic groups.Demographic characteristics, particularly gender, ethnicity, and urban location, also predict participation in higher education, evidencing intersectionality and educational success according to classification models developed for payment behavior.In addition to the classification model for student dropout, tools aimed at the institutional management of universities to ensure the permanence and well-being of students 50,51,52 .In Peru, in 2022, the VI Delinquency Report prepared by Equifax in collaboration with the Universidad del Pacifico was published, where the delinquent debt is 10,629 million soles; distributed in mortgage loans at 23%, personal loans at 19%, 27% loans to small businesses and 9% to microbusinesses, and credit cards correspond to 8%; Likewise, the regions with the greatest impact with non-performing debt are Lima with 15,242 million soles, followed by Arequipa with 2,527 million soles and La Libertad with 1,426 million soles 53 .
Regarding studies carried out on the comparative analysis of AutoML algorithms, there is the one carried out by Ferreira et al. 54 under two approaches.The first contrasts 10 open source technologies for supervised learning, and the second case focuses on learning a class using grammatical evolution.The data set was collected from a software company to predict the number of days equipment would fail, obtaining similar results between the compared solutions.Gijsbers et al. 56 evaluated nine AutoML frameworks to develop classification and regression models where the AutoGluon library has demonstrated better performance metrics in less execution time, revealing some limitations such as the impossibility of attributing the performance of an AutoML tool to a specific aspect, the settings because they have been used by default.

Methodology
Automated machine learning algorithms such as AutoKeras, AutoGluon, HyperOPT, MLJar, and H2O were used to develop student payment delinquency classification models.Subsequently, a comparative analysis was carried out to determine the model with the best performance metric.The data set comprised 8,495 records and 12 variables or characteristics.The exogenous type attributes were considered: sex, type of educational institution, employment status, socioeconomic level, disability, family burden, marital status, study scholarship, faculty, branch, and enrollment number.The endogenous variable delinquency of a dichotomous type constitutes the objective characteristic from which the prediction is made.Once the exploratory analysis has been conducted, the data set does not present anomalies (atypical and missing data).Morosidad Dichotomic The classification model for late payments in university students was developed using the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology, providing a guide for creating data science projects and guaranteeing the quality of the results efficiently and systematically.Its phases are business understanding, data understanding, data preparation, modeling, evaluation, and implementation.

Results
A descriptive statistical analysis of the default variable is presented concerning the exogenous variables to understand and identify the distributions in the data set.Furthermore, students who do not work have higher default rates.On the other hand, students who do not have any disability and do not have family responsibilities show a greater incidence of making their payments; the same thing happens with those who have a partner.
Table 1 shows an imbalance in the data concerning the target variable delinquency.Consequently, the discretization process of the variables was carried out using the OneHotEncoder algorithm from the Sklearn library for the polytomous variables faculty and enrollment, resulting in a data set comprising 20 dichotomous exogenous variables.For the third data set, Table 5 shows that the StakedEnsemble model has the optimal metrics for classifying delinquent students, followed by the ExtraTreesEnt and Ensemble models with very similar performance indicators.Below, the classification models with the best performance metrics achieved using the AutoML algorithms for each under sampling, oversampling, and combined sampling data set defined using the data balancing technique are presented in Table 6.
The metrics obtained are very similar in the three experiments, with no significant difference found between the three models.Likewise, the H2O platform performs better on large data sets, as the StackedEnsemble model for the combined sampling data set.The analysis of its metrics indicates an accuracy equal to 0.779, indicating that the model correctly classifies around 78% of the observations, which is an acceptable level of accuracy.For the indicator, an F1-score of 0.871 determines the harmonic mean of precision and sensitivity; a value close to 1 indicates good balance.The Recall given by 0.904 shows that the model detects around 90% of the real positive cases; that is, a high level of sensitivity.For a Precision of 0.840 predicts positively, the model is correct in 84% of the cases, which is considered a high level of precision.Finally, the research was carried out from a multidisciplinary approach integrating several areas of knowledge, allowing a comprehensive understanding in a holistic context of late payments in university students, considering the multiple underlying factors and financial, emotional, academic, and health needs that are at stake ethically and responsibly, developing the most effective strategies for student support in the complex context of university higher education.

Table 1 .
Description of data set variables.

Table 1 .
Descriptive analysis of the data set

Table 2 .
Parameters for data balancing.

Table 3 .
Performance metrics of AutoML algorithms for NearMiss sampling.

Table 6 .
Summary of performance metrics for models with optimal performanceOn the other hand, through the National Artificial Intelligence Strategy and with the promulgation of Law No. 31814, regulations that promote the use of artificial intelligence in favor of the economic and social development of the country; The study has a significant contribution in this regard, where it considers the legal implications to be taken into account because they are vital in this type of prediction models and must consider the protection and privacy of personal data, where the model must comply with the laws and current regulations; Algorithmic equity means that the model must be objective and non-discriminatory, being fair for all groups of people; transparency and explainability, which seeks ways to make complex models more understandable for people; the ethics and regulation of artificial intelligence, representing an innovative tool that constitutes a methodological advance in the application of automated machine learning techniques in university management.