Financial Fraud: Identifying Corporate Tax Report Fraud Under the Xgboost Algorithm

INTRODUCTION: With the development of economy, the phenomenon of financial fraud has become more and more frequent. OBJECTIVES: This paper aims to study the identification of corporate tax report falsification. METHODS: Firstly, financial fraud was briefly introduced; then, samples were selected from CSMAR database, 18 indicators related to fraud were selected from corporate tax reports, and 13 indicators were retained after information screening; finally, the XGBoost algorithm was used to recognize tax report falsification. RESULTS: The XGBoost algorithm had the highest accuracy rate (94.55%) when identifying corporate tax statement falsification, and the accuracy of the other algorithms such as the Logistic regressive algorithm were below 90%; the F1 value of the XGBoost algorithm was also high, reaching 90.1%; it also had the shortest running time (55 s). CONCLUSION: The results prove the reliability of the XGBoost algorithm in the identification of corporate tax report falsification. It can be applied in practice.


Introduction
With the continuous development of the capital market [1], the development of enterprises has also been subject to great challenges.In order to further regulate the capital market, its supervision has become more and more strict [2].However, there are still many enterprises that take the risk and choose financial fraud driven by profit [3].Financial fraud refers to the act of intentionally fabricating false financial reports to make improper profits to mislead information users, which is a great threat to economic security [4].Enterprises obtain illegal benefits by falsifying their income and expenses, which not only seriously affects the judgment of investors but also is detrimental to the stability of the capital market [5].Therefore, if financial fraud can be identified in advance through some methods, it is of great importance to effectively avoid risks and maintain market order.From a machine learning perspective, the identification of financial fraud is a binary classification problem [6].With the development of computer technology, artificial intelligence, machine learning and other methods have been widely used in the financial field [7], which can effectively detect and predict fraud [8].There are also many studies on text mining [9], which focus on mining clues from textual information such as annual reports and announcements of enterprises.At the same time, the study of financial fraud from interdisciplinary perspectives, such as psychology and sociology, has become a new way of thinking.Houssou et al. [10] studied the situation of data imbalance in financial fraud prediction and analyzed fraud prediction using homogeneous and non-homogeneous Poisson processes.They found through experiments that the method exhibited a superior prediction ability than a baseline approach.Swa et al. [11] designed a knowledge graph (KG) framework and then analyzed the performance of four machine learning algorithms on fraud detection and found that support vector machine (SVM) performed best on the test set.Akra et al. [12] analyzed the roles of Altman and Beneish models in detecting early profit manipulation and applied them to the Kuwaiti stock market.They found that the Beneish model had good power in predicting possible earnings manipulation or report falsification by firms.Zhou et al. [13] designed a convolutional neural network (CNN)-based method for fraud in supply chain finance and found through tests that the method had high accuracy and recall rate.Burke et al. [14] found that a brief online educational intervention could reduce fraud susceptibility.Davidson [15] analyzed 1805 executives and found that executives suspected of fraud had stronger equity incentives than executives in similar positions in nonfraudulent companies, and that equity incentives for all members of the top management team could be considered when identifying financial fraud.Novatiani et al. [17] analyzed the data of 90 state-owned enterprises by SEM-PLS and found that the effectiveness of the internal audit function could prevent financial statement falsification.The study of financial fraud focuses on how to detect, identify, and prevent companies or individuals who intentionally fabricate false information in their financial reports to deceive investors.For example, some companies may misrepresent their net profits to attract investments or fabricate false accounts to commit financial fraud.The identification of fraudulent corporate tax reports can effectively detect fraudulent behavior of enterprises, thus protecting the interests of investors and the public, which is of great importance to maintain the stability of the market and ensure compliance.This paper used a relatively novel machine learning algorithm, i.e., the XGBoost algorithm, to study financial fraud, screened the falsification identification indicators by the indicator information value (IV), and proved the reliability of the method by comparing it with other methods.The research in this paper provides a new method for identifying fraudulent behavior of enterprises in reports, which can be applied in practice to better detect fraudulent behavior of enterprises and thus promote the stability of capital market.

Overview of Financial Fraud and Fudging
As the economy grows, the number of frauds occurring in companies is increasing [18].Table 1  It is seen from the current financial fraud cases that most of the frauds are related to the falsification and beautification of reports and show the following characteristics: ① the more backward the economic development of the region, the higher the possibility of fraud in enterprises; ② the more complex the way of fraud, and the more objects of manipulating profits; ③ the motives of fraud are more diversified, and some even involve criminal crimes; ④ the frauds are more difficult to detect and concealed.
The financial fraud and falsification of enterprises will lead to punishment, damage the corporate image, and make the stock fall, which is not conducive to the long-term development of the enterprises.For investors, financial fraud and falsification will lead them to make wrong judgments due to false information and suffer economic losses [19]; for the capital market, the endless fraud and falsification will disrupt the market order and affect investors' investment confidence, which is not conducive to the healthy development of the capital market [20].Therefore, the identification of financial fraud is very important, not only to help improve the quality of the audit but also to reduce the risk of investment for investors, creditors and other information users, and for regulators, it is also conducive to the anti-fraud work.

Data selection and processing
Reports can reflect the economic situation of enterprises [21]; therefore, this paper studied frauds from the tax reports of enterprises.Samples were selected from the China Stock Market & Accounting Research (CSMAR) database, including 482 fraudulent tax reports of 227 Shanghai and Shenzhen A-share listed companies between 2011 and 2020 and another 482 normal tax reports of 227 non-fraudulent enterprises belonging to the same industry as the fraudulent enterprises in the same period.For the experimental data, after eliminating invalid, duplicate, and abnormal data, the missing values were filled using the mean value.The data were normalized in order to avoid the errors caused by the index magnitude, and the corresponding formula is: , where x is original data, x max and x min are the maximum and minimum values of original data, and ′ is the processed data, normalized to between 0 and 1.
In order to obtain a high identification accuracy, indicators related to fraud were selected from the tax reports.The selection of indicators was considered mainly from the following aspects.
(1) Debt service: In the case of a high proportion of enterprise liabilities, the management of an enterprise may be more inclined to disclose good news and avoid bad news.At this time, there is a possibility of falsifying the reports, including using accounts receivable and inventory to make adjustments, inflating profits, and beautifying the reports.Therefore, the selection of indicators needs to consider the debt service of an enterprise.
(2) Operation: The operation of an enterprise reflects the use and management of its capital.In fraud, profits can be manipulated by reducing the inventory turnover rate and increasing the proportion of inventories.Therefore, the selection of indicators needs to consider the operation of an enterprise.
(3) Profitability: When a company has a poor level of profit, not only will managers' earnings be reduced, but the company's ability to raise capital will also be affected, and the possibility of fraud also exists at this time.Thus, the profit situation of an enterprise should be paid much attention to.
(4) Risk: In a certain period of time, the cost and structural changes of an enterprise directly affect the revenue; therefore, the risk profile of an enterprise also needs to be considered in the selection of indicators.
(5) Cash flow: The cash flow situation of an enterprise is related to its ability to pay, and abnormal changes in the relevant indicators are likely to indicate fraud; therefore, the cash flow situation of an enterprise can be used as one of the indicators for fraud identification.
Based on the above aspects, the falsification identification indicators shown in Table 2 were selected.

Net cash flow from operating activities/net income
The indicators in Table 2 were further screened.The predictive ability of the indicators was determined by calculating the indicator information value (IV).The IV was calculated based on the weight of evidence (WOE).The calculation formula of WOE is: where (  |  ) refers to the proportion of falsified samples in the current group to the total falsified samples after grouping and (  |  ) refers to the proportion of normal samples to the total normal samples after grouping.The larger the WOE value, the greater the number of falsified samples.On this basis, IV is calculated: The value of IV can reflect the contribution of an indicator to label differentiation.It is generally considered that indicators with IV values below 0.02 do not have valid information.The calculation results of the IV values of the indicators in Table 2 are shown in Table 3.Net cash flow per share

XGBoost recognition algorithm
The XGBoost algorithm is currently a relatively new machine learning algorithm, which is characterized by its ability to handle high latitude, unbalanced and complex data well, effectively avoiding the problem of overfitting, and having high accuracy and efficiency in solving classification and regression problems; therefore, this paper used the XGBoost algorithm for report falsification identification.The XGBoost algorithm is an optimization of the gradient boosted decision tree (GBDT) algorithm [22].Compared with the GBDT algorithm, the XGBoost algorithm adds a regular term to the objective function, which reduces complexity and also improves efficiency [23].The objective of the GBDT algorithm is to find every optimal single regression tree (  ), its objective function is written as: where  is the sample size, (,  ̂) = ( ̂− ) 2 , and (  ) is the regular term of the -th regression tree: where   is the number of leaf nodes of the j -th regression tree,  represents the minimum loss per additional leaf node branch,  is the regular term, and   () is the leaf node value of the j-th regression tree.
The objective function of the XGBoost algorithm is written as: The comparison of the comprehensive performance found that the XGBoost algorithm had better performance in terms of accuracy and precision, indicating that it could identify and classify falsified and normal reports more accurately and thus help determine whether there is fraud in the enterprise.
Finally, the running time of these algorithms was compared, and the results are shown in Figure 1.

Figure 1. Comparison results of the running time of different algorithms
It was found from Figure 1 that the running time of Logistic regressive, SVM, and RF algorithms were long, above 1500 s, while the running time of the GBDT algorithm was 654 s, which was obviously shorter than the first three algorithms.The running time of the method proposed by Ma et al. [27] was 372 s, which was 43.12% shorter than the GBDT algorithm.The running time of the XGBoost algorithm was 55 s in report falsification identification, which was about one-fortieth of the Logistic regressive algorithm.Compared with the GBDT algorithm, the running time of the XGBoost algorithm was improved by 91.6%; compared with the method proposed by Ma et al. [27], it was improved by 85.22%.It was found from Table 7 and Figure 1 that the XGBoost algorithm had not only good recognition performance but also higher recognition efficiency in the identification of corporate tax report falsification, so it can be applied in practice to achieve better and faster identification of financial fraud.

Discussion
Financial fraud is a common problem in countries around the world and is becoming more prevalent as the economy grows [28].It often takes a long period of time from the implementation of fraud to its exposure, and during this process, for investors and other stakeholders, the wrong decisions can no longer be changed, and economic losses have long been caused [29].Moreover, with the development of technology, the means of fraud are becoming more and more diverse and hidden [30], and the amount involved is getting larger and larger, which seriously threatens the stability of the capital market [31].
Therefore, it is of practical importance for both investors and regulators to identify fraudulent behavior of enterprises in advance [32].
The main means of fraud is the manipulation of report entries [33], such as fictitious profit, fictitious reduction of liabilities, etc.At present, commonly used report falsifications include the following aspects: ① fictitious profit: increase revenue by forging contracts and other means, confirm revenue in advance by taking advantage of time lags, record less costs, or not record some costs; ② fictitious assets: fictitious monetary funds, cash flow, etc., or recognize the potential loss as impairment through asset restructuring or appraisal; ③ fictitious liabilities: conceal the liabilities of the enterprise, not record bank debits and repayments, etc.; ④ related transactions: change the profit situation through the transfer of assets between the parent company and subsidiaries or related purchases and sales.
Faced with the complex fraudulent means and the increasingly massive data, it is increasingly difficult to recognize fraudulent enterprises.This paper identified whether the reports are fraudulent through the XGBoost algorithm and compared it with the Logistic regressive algorithm and other algorithms.The results of experimental analysis suggested that the XGBoost algorithm was more advantageous in terms of falsification recognition accuracy and running time, proving the usability of this algorithm in the actual enterprise report falsification identification.
However, the research in this paper also has some shortcomings, for example, the research samples were not comprehensive due to the insufficient public data, the report content was lagging, and the performance of the algorithm needs further improvement.Therefore, in the future work, more in-depth research on these content is needed to better improve the report falsification identification method.

Conclusion
This paper focused on the identification of corporate tax statement falsification.In order to better judge the financial fraud enterprises, this paper selected indicators from corporate tax reports and used the XGBoost algorithm for report falsification identification.Through experimental analysis, it was found that, compared with algorithms such as the Logistic regressive algorithm, the XGBoost algorithm had the highest accuracy in report falsification identification, reaching 94.55%, its F1-score was 90.1%, and its running time was also short, only 55 s, which shows good performance and can be further applied in practice.

Table 1 .
Recent financial fraud cases

Table 2 .
Indicators for identifying fraudulent reports

Table 3 .
Results of the calculation of the IV value of the indicator

Table 4 .
Screened report falsification identification indicators