Development of a Classification Model for Predicting Student Payment Behavior Using Artificial Intelligence and Data Science Techniques

Artificial intelligence today has become a valuable tool for decision-making, where universities have to adapt and optimize their processes, improving the quality of their services. In this context, the economic income from collections is vital for sustainability. There are several problems that can contribute to student delinquency, such as economic, financial, academic, family, and personal. For this reason, the study aimed to develop a classification model to predict the payment behavior of enrolled students. The methodology is a proactive, technological study of incremental innovation with a synchronous temporal scope. The study population consisted of 8,495 undergraduate students enrolled in the 2022 - II academic semester, containing information on academic performance, financial situation, and personal factors. The result is a classification model using the H2O.ai platform, discretization algorithms, data balancing, and the R language. Data science algorithms obtained the base from the institution's computer system. The data sets for training and testing correspond to 70% and 30%, obtaining the GBM Grid model whose performance metrics are AUC of 0.905, AUCPR of 0.926, and logLoss equivalent to 0.311; that is, the model efficiently complies with the classification of student debtors to provide them with early intervention service and help them complete their studies.


Financing in Education
Total public spending on education, according to OECD (2022), the average is 10.6%, where most of it is destined for the primary and secondary levels; in the case of the tertiary level, the contribution of private sources gives it; Likewise, there is a difference between 2015 and 2019 where the proportion of public spending allocated to education had a slight decrease among the countries belonging to the OECD of around 1%, this figure grew due to the global crisis caused by the COVID-19 pandemic.19 that prompted governments to spend more to reactivate their economies.Regarding the average private spending on educational institutions, it remained stable.On the other hand, regardless of the educational level, the remuneration of personnel and other current expenses represent an average of 90% of the expenditure in educational institutions.In the affordability (Wiesmüller, 2023;Chatterjee et. al., 2023;Vähäkainu & Lehto, 2023;Chryssolouris et. al., 2023).To date, there is a wide catalog of artificial intelligence solutions, with the arrival of OpenAI ChatGPT, Google Bard, Microsoft Bind, and Adobe Firefly, among other solutions, they have impacted a substantial change in people, businesses, and organizations; all these innovations have been given by the computing capacity and the enormous amount of data where artificial intelligence has played a relevant role to be adopted in our daily lives and in many sectors, revolutionizing society; One of the barriers or problems encountered is explainability, generating a new field, explainable artificial intelligence, which provides explanations for the predictions, recommendations, and decisions of machine learning and deep learning systems, providing greater transparency, interpretability, and explainability to its users.algorithms (Angelov et. al ;Miller, 2017).In education, the development and use of information and communication technologies it has given rise to the rise of artificial intelligence, where institutions have adopted and incorporated integrated systems to perform functions similar to that of teachers or instructors, substantially improving efficiency.and efficiency of their teachers, giving a better educational quality to their students (Chen et. al argue that data mining is proliferating and is defined as an interdisciplinary field derived from computer science, considered a process to discover patterns in vast volumes of data by applying one or more algorithms or techniques.solid according to the types of data and the proposed objective.Also, it is considered as the process of extracting significant information from a database, becoming an effective tool through its phases: cleaning, integration, selection, transformation, and data visualization, help the development of applications for different areas of knowledge (Tan et al., 2022;Yağcı, 2022).On the other hand, educational mining can explore and analyze data that comes from multiple sources originated by educational services with the support of data mining; In this sense, higher education institutions, with the use of educational mining techniques and tools, can make better decision-making and explain educational phenomena (Sharma & Sharma, 2018;Romero & Ventura, 2012); Educational mining has the support of multiple disciplines, including cognitive science, computer science, cognitive psychology, education, and statistics, leading to a greater challenge to improve the quality of educational processes with the implementation of strategies and plans based on the information extracted (Koedinger, 2015; do Carmo & da Silva Lemos, 2022; Sumitha & Vinothkumar, 2016).The importance of data mining lies in: a) the process of collecting large amounts of data to extract information; b) interpretation of the data for its subsequent transformation into information; and c) evaluate the behaviors and ideas of consumers resulting in organizational growth based on data (Bolaño García et al., 2023;Olufemi, 2021).

Machine Learning
Machine learning has become an indispensable tool with an advanced approach to data analysis.It is defined as the ability of systems to learn from the training data of the problem to generate analytical models and solve the respective tasks; On the other hand, deep learning is based on neural networks from the concept of machine learning; It is highlighted that deep learning models are superior to machine learning models in most applications ( The first is a machine's ability to develop cognitive functions: perception, reasoning, learning, interaction with the environment, problem-solving, decisionmaking, and creativity.while machine learning is a technique used in artificial intelligence to allow machines to learn from data and improve their performance in a specific task (Benito, 2022; Corrêa da Silva, 2022; Kühl eet al., 2022; Silva, 2022).Considering machine learning models as successful and can solve problems and make decisions, they come to make a combination of complexity (black box) to understand its internal functioning by a person; This is where interpretability plays a vitally important role due to the lack of transparency and thus avoiding negative consequences, producing correct answers for the wrong reasons in highrisk areas (Rudin et al., 2021;Zhou et al., 2021).The three different types of machine learning are a) supervised learning, which refers to a type of learning where labeled data is provided to the model to train it; the model learns to make predictions based on the training data and the corresponding labels; b) unsupervised learning, on the other hand, refers to a type of learning in which no labels are provided to the model.Instead, the model must find patterns and structures in the data on its own; and c) reinforcement learning refers to learning in which the model learns to make decisions in a given environment to maximize a reward.Depending on its decisions, the model receives feedback through rewards or punishments (Kühl et al., 2022).

Data Balancing
For Chawla (2005), the data set is unbalanced or unbalanced if the classification categories are not equally represented, which would influence the training data, erroneous predictions, and poor performance of the resulting models.Thus, the study by Ghanem, Venkatesh & West (2008) describes machine learning methods that have structured data files as their main input under the assumption that the classes of categorical variables are similar in quantity; However, in reality, the data is stored primarily on relational database systems and is unbalanced; that is, a class of data contains a greater proportion compared to the other classes; There are also studies on the development of machine learning with unbalanced data as a research area in order to find efficient methods for solving real problems, which requires a broad vision to understand the nature of learning (Chawla, 2009;Lali et al., 2023a;Lali et al., 2023b).In this sense, the necessary attention has been received to improve the performance of classification models, where various components or factors have been considered, such as distribution, and cost-sensitive learning, among others (Yin et. al., 2020).In this sense, to obtain relevant data in higher education institutions, there is a proposal for the management model of

Feature Selection
A characteristic is conceived as an individual measurable property of the observed process.Being immersed in the digital age, data is generated from different information systems, leading to an increase in the dimensions of the data; having more characteristics should result in more discrimination.However, practice indicates that this is not always the case.Some factors affect the success of machine learning, such as the quality of the data set; feature selection is selecting a specific subset of variables from the original set, which can efficiently describe the input data while reducing the effects of irrelevant variables providing good prediction results.That is, identify and eliminate irrelevant and redundant information to reduce dimensionality and algorithms to be faster and more efficient, obtaining optimal or desired performance (Shi,  ).Higher education in Latin America and the Caribbean, according to Gazzola (2021), presents disruptions and instability in the region, which try to stop the transformative incidence given by the interest of the elites in betting on the privatization of higher education institutions; Another of the indicated factors is corruption, which has been impacting resources and generating an ethical and moral crisis; in addition to corporatism, by not defending the necessary changes to give new meaning to higher education; and finally, the discontinuity since each government in power has impacts that do not guarantee stable regulatory frameworks and resources.In Peru, the Federation of Private Institutions of Higher Education (FIPES), through its president Juan Ostopa indicates that 15% of students dropped out of the university during the state of emergency, and they also estimate that in the following semester, the university desertion would arrive at 35%, there will be approximately 350,000 students who will stop studying; In addition, payment delinquency reaches 50%, making it difficult or even impossible to sustain the universities, which would allow going back on the university reform, but would definitely not have qualified personnel (FIPES cited by Quinto, 2020).SUNEDU (2021) argues that 28% of young Peruvians had access to university, and only 10.3% of young adults had access to postgraduate studies.This is due to the health crisis that has significantly impacted master's programs and doctorate.Likewise, one in five students did not have a computer at home, and 22% did not have an Internet connection at home, mitigating these inconveniences with mobile Internet devices.The interruption of studies increased significantly, affecting private universities from 6% to 18% from 2019 to 2020.Higher education institutions, throughout their institutional life, according to Mense et al. (2020), have been strategically looking for reliable methods and means to improve the learning process.It has been considered one of the current challenges due to the growth of educational data and how to use it to improve the quality of decision-making associated with efficiency, objectivity, transparency, and innovation of organizations.On the other hand, machine learning has been achieving a significant impact on society due to the diversity of solutions to solve complicated problems of reality such as classification, prediction, and grouping occurring during the pandemic ( The private university under study is an academic community with social responsibility, aligns its research, teaching, cultural outreach, and social projection activities to provide comprehensive education with a clear awareness of the country as a multicultural reality; according to its internal regulations, it is adequate to University Law No. 30220; Likewise, its economic and financial resources derived from the commitments of the students, generated by tuition fees, tuition fees, educational fees, debts receivable among other income or contributions, the same that are breached considering the established payment schedule, caused by several factors.First of all, we can indicate that students have generalized an inadequate culture of payment, the same that is carried out at the end of the academic semester, protected by the validity of Law 29947 dated November 28, 2012, Law for the Protection of the Economy Family, which allows students to continue their studies without fulfilling the economic commitments generated, causing high delinquency rates affecting the economic flow of the organization.Secondly, the collection strategies and policies in the organization are inadequate, the same ones that generate reluctance to pay pensions on the part of the students, even generating complaints to the regulatory and consumer protection entity, impacting fines and sanctions savings towards the university.Thirdly, the deceleration of macroeconomic variables, corruption, insecurity, the COVID-19 pandemic, social upheaval, and natural disasters converge in this problem, influencing the drop in employment rates.All this contributes to an irresponsible culture on the part of the users of the university service; That is why it is necessary to know the current situation of payment behavior in university students through a system based on data mining and artificial intelligence.For these reasons, we ask ourselves: What is the classification model based on machine learning algorithms and data mining to predict the payment behavior of students in a private university?Objective Develop a classification model to predict the payment behavior of students in a private university by applying machine learning algorithms and data mining techniques.

Methodology
The research was focused on developing a classification model for payment behavior to predict non-payment students.8495 students have been considered as participants, and a detailed analysis of the data was carried out using techniques and methods of automatic learning, techniques for unbalanced data and feature selection being fundamental tools that help find patterns in the data; In addition, the help of the H2O platform, it was possible to understand the patterns and predict the students who have late or owed commitments.This tool made it possible to identify the model with greater precision and better performance in quality metrics.In this way, it will be possible to have a prior classification of students who are unpaid and thus develop assertive strategies to counteract the situation opportunely.The systems used as tools in the generation of the classification model for payment behavior were the R Statistical Software language (v4.2.

Results
The data set was extracted from the computer system of the higher education institution.The characteristics or variables are detailed in Table 1.The data analysis process was developed in several stages.First, data cleaning and preparation was performed by removing outliers, coding categorical variables, and creating additional variables, also called variable discretization.
Regarding the discretization process of the variables, the result was a dichotomous data set comprising 33 predictor or exogenous variables or characteristics and one response or endogenous variable, allowing better performance in machine learning algorithms.Subsequently, the resampling was carried out, that is, subsampling and over-sampling in the participants, to guarantee adequate proportions of the objective or response variable.Next, we reduced the number of variables using the feature selection method, reducing five independent variables.Then the automatic learning algorithms were trained, where 70% and 30% were considered for the sizes of the data sets, and the H2O.automl method was executed, allowing us to find the best prediction models for each case with their respective metrics.

Conclusions
Having developed the GBM Grid classification model for the prediction of payment behavior in students of a private university in Peru, we can conclude that it meets the performance metrics, such as a GINI of 0.810, AUC of 0.905, AUCPR of 0.926 and LOGLOSS of 0.311.In addition, the precision, sensitivity, and specificity demonstrated a high success rate in obtaining satisfactory results.Likewise, the model supports unbalanced data and multiclass characteristics.The results were improved by adjusting the model parameters.The cross-validation allowed evaluating the model's accuracy and making predictions in real-time.These results show an efficient classification model, having the ability to use algorithms to infer the knowledge obtained from a data set, which allows rigorous monitoring and analysis to detect delinquency in university students, achieving greater financial control in the student body, helping the authorities to apply tools to promote financial responsibility.The research contributes significantly to knowledge by providing a tool for decision-making in finance to predict student delinquency and, at the same time, to understand in depth the causes that lead to this phenomenon.Regarding the practical implications, a model has been satisfactorily established to identify students with delinquent tendencies, which allows the university to take precautions and measures to avoid non-compliance with tuition and tuition payments so that the impact is reduced.The generated classification model can be used by those responsible for the financial administration of the institution to apply policies and improvements in collection procedures to prevent future inconveniences related to delinquency; that is, having the possibility of early intervention in potentially problematic situations through the implementation of strategies to grant benefits to students in financial matters such as scholarships, discounts, incentives, promotions, educational loans, among others; in order to achieve greater satisfaction among them and, at the same time, reduce administrative costs.In this way, it improves the university community's quality of services and financial security.As a suggestion, the classification model for predicting student payment behavior should be implemented in the university institution for timely intervention by the responsible personnel.Later, measuring the follow-up of the students who have received the intervention programs will be possible.Finally, it will be possible to evaluate the effectiveness of the intervention programs offered by the institution.On the other hand, the functionality and processing capacity provided by the H2O.ai platform for the automatic generation of learning models saves time and resources, allowing users to perform data preprocessing, model generation using training data easily; and, later, the evaluation of the metrics of the model with test data allowing H2O to identify the optimal model according to the defined parameters.
., 2020; Mejías et al., 2022; Subbarayan & Gunaseelan, 2022).Data mining Currently, in a digital world, there are huge volumes of data provided by IoT, cybersecurity, mobility, social networks, commerce, health, etc., which must be intelligently analyzed to build intelligent applications where artificial intelligence and, specifically, machine learning have the leading role.The classification of machine learning algorithms comprises supervised, unsupervised, semi-supervised, and reinforcement learning (Silva Coimbra & Rodrigues Dias, 2022; Marinho de Sousa & Shintaku, 2022; Takaki & Dutra, 2022; Sarker, 2021), with many software tools (Bartschat et al., 2019).Garg & Sharma (2013) and Francis et al. (2019) information and communication technologies for higher education institutions (Villarreal et al., 2021).Furthermore, the analysis performed by Liu et al. (2020) expresses the existence of degradation and low performance in the classification algorithms when the data set is unbalanced with a minority class.Likewise, using inappropriate or incorrect metrics to evaluate the performance of the algorithms can affect the experimental results in a classification model with highly unbalanced data (Hancock, Khoshgoftaar & Johnson, 2023; Martín Ferron, 2022; de Araújo Telmo et al., 2021).
2022; Zebari et.al., 2020; Venkatesh & Anuradha, 2019; Chandrashekar & Sahin, 2014; Hall, 1999).Dimensionality reduction contemplates two main methods: a) feature selection, it is an important method that effectively solves dimensionality problems, such as decreasing redundancy and improving the understandability of the results, and b) feature extraction, which searches for the most distinctive, informative and reduced subset of features, improving data processing and storage (Driss Hanafi et al., 2023; Macea-Anaya et al., 2023; Olusegun Oyetola et al., 2023; Zebari et al., 2020).According to Chandrashekar & Sahin (2014), the methods for the selection of characteristics: a) filtering methods, which use various classification techniques, the criterion used for the selection is ordering; b) wrapping methods, which use the predictor as a black box, the evaluation of the subset of variables is carried out using the performance metrics as an objective function; c) integrated methods, it has the particularity of reducing the calculation period for the reclassification of different subsets, the criterion is to incorporate the selection of functions as part of the training process.Li et al. (2016) state that is selecting characteristics as a strategy is effective and efficient for creating more straightforward and more understandable models, improving performance, and preparing clean and understandable data.Saeys et al. (2007) argue that the most critical objectives of feature selection are: a) avoid overfitting and improve model performance; b) generate faster and more reliable models; c) gain insight into the underlying processes that generated the data.Problem In the United States, in 2012, the outstanding balance of student loans exceeded one trillion dollars; between 2005 and 2012, the delinquency rate on student loans increased by 77%.This figure was negatively associated with the suicide rate among people between 20 and 34 (Jones, 2019).On the other hand, in South Africa, one of the first challenges faced by future university students is the need to obtain financing for their studies, so universities are being pressured to grant scholarships and ensure that students are not excluded.of the university system (López Pérez et al., 2022; McKay et al., 2021 Abdul et al., 2022; Cárdenas Espinosa et al., 2023; Correa Moreno & González Castro, 2023; Junco Luna, 2023; Silva-Sánchez, 2022); Thus, academic and industrial areas have included recognition of patterns and trends, computer vision and natural language processing have demonstrated the capacity of deep networks influencing performance and improving results, achieving the development of the sector (Wang et al., 2016; Albarracín Vanoy, 2022; Khalaf, 2021).

Figure 1 .
Figure 1.ROC curve of the GMB Grid model.

Table 2 .
Description of the data set with its respective characteristics

Table 3 .
Description of quality metrics of models with training data.

Table 3 .
Description of quality metrics of models with test data.