The data preprocessing in improving the classification quality of network intrusion detection systems

Stream-based intrusion detection is a growing problem in computer network security environments. Many previous researches have applied machine learning as a method to detect attacks in network intrusion detection systems. However, these methods still have limitations of low accuracy and high false alarm rate. To improve the quality of classification, this paper proposes two solutions in the data preprocessing stage, that is, the solution of feature selection and resampling of the training dataset before they are used for training the classifiers. This is based on the fact that there is a lot of class imbalanced data in the training dataset used for network intrusion detection systems, as well as that there are many features in the dataset that are irrelevant to the classification goal, this reduces the quality of classification and increases the computation time. The data after preprocessing by the proposed algorithms is used to train the classifiers using di ff erent machine learning algorithms including: Decision Trees, Naive Bayes, Logistic Regression, Support Vector Machines, k Nearest Neighbor and Artificial Neural Network. The training and testing results on the UNSW-NB15 dataset show that: as with the Reconnaissance attack type, the proposed feature selection solution for F-Measure achieves 96.31%, an increase of 19.64%; the proposed oversampling solution for F-Measure achieves 96.99%, an increase of 3.17% and the proposed undersampling solution for F-Measure achieves 94.65%, an increase of 11.42%.


Introduction
Internet is the trend of the times, the internet plays an increasingly important role in all areas of social life. On that internet platform, e-commerce is growing strongly, it is an indispensable part of business activities. Besides the benefits of the internet, businesses also face negative aspects of the internet, one of which is the problem of cyber attacks. Cyber attack is all forms of unauthorized intrusion into a computer system, website, database, network infrastructure, equipment of individuals and businesses through the internet for illegal purposes. The target of a cyber attack is very diverse, it can be a data breach (stealing, altering, encrypting, destroying), it can also target the integrity of the system (disruption, service obstruction), or take advantage of the victim's resources (displaying ads, malicious code, mining virtual currency, ...). To protect the network, one of the systems used by network administrators today is the network intrusion detection system (NIDS).
NIDS has the function of monitoring network traffic to detect abnormalities and illegal activities intruding the network of agencies and enterprises. NIDS can detect anomalies based on specific signatures of known threats or by comparing current network traffic with system benchmarks. There are three methods to detect attacks: (1) Signature-based detection; (2) Anomalybased detection and (3) Hybrid-based detection.
Signature-based detection is designed to detect known attacks using the signatures of those attacks. This is an effective m ethod t o d etect k nown attacks stored in the NIDS database. Therefore, it is much more accurate in identifying a penetration attempt of a known attack. However, with new or variant attacks, NIDS cannot detect because the signature of the attack is not stored. To solve the problem, anomalybased detection compares current user activities with predefined profiles for intrusion detection. Anomalybased detection is effective against unknown attacks or zero-day attacks without any updates to the system. However, these methods, mainly machine learning (ML) still face challenges in improving accuracy, reducing false alarm rate (FAR) and detecting new attacks [1].
In ML, data preprocessing is an important stage. The main objective comes from data preprocessing which has a major impact on the accuracy and capability of NIDS. With the increasing traffic of network data, ML techniques need a lot of time to train and classify the data. Using big data techniques for NIDS can solve many challenges such as speed and computational time as well as develop accurate NIDS [2]. This paper deals with improving the quality of classification (QoC) of NIDS through two data preprocessing techniques: feature selection and dataset resampling.

Feature selection
Feature selection is a method to remove irrelevant or noisy features and select the most suitable features to better classify instances belonging to various attack types. According to the researchers, this needs to be done because: (1) A single selection strategy is not sufficient to obtain consistency across multiple datasets, since network traffic behavior is constantly changing [3].
(2) A suitable subset for each attack type must be determined, since a common subset is not sufficient to represent all of the various attacks [3].
(3) Feature selection can greatly improve not only the detection accuracy but also the computational efficiency, where: -Irrelevant or noisy features can lead to poor detection rates, so reducing them can increase detection accuracy [4][5][6].
-Having more features results in higher computational cost and complexity. Reducing extraneous features increases computational efficiency [3,7].
(4) Finally, some known types of attacks have become challenging to identify because they are too isolated and can be mislabeled as normal data. Researches and experiments have shown that: feature selection can solve this problem by defining a subset of features that adapt to the behavior of each attack type [5][6][7].

Resampling dataset
For many years, the problem of imbalanced data has been one of the important issues and received the attention of many researchers [8]. A dataset is said to be imbalanced when the number of instances belonging to one class label is much smaller than that of other class labels. To solve the problem, resampling techniques have been proposed, there are two main approaches used: removing some instances from the majority class, called undersampling (US), and cloning some of the instances from the minority class, is called oversampling (OS). Both oversampling and undersampling aim to change the ratio between majority and minority classes [9]. It is also possible to combine both techniques at the same time to create a more balanced dataset. In this way, resampling allows various classes to have relatively similar influence on the results of the classification model. Researches show that resampling the training dataset improves the accuracy of NIDS [10,11].
One of the commonly used oversampling techniques is SMOTE (Synthetic Minority Over-Sampling Technique) [12]. The implementation of SMOTE is described as follows: Take a instance ⃗ a from the minority class of the dataset and randomly select one instance ⃗ b from among the k nearest neighbors of the same class ⃗ a (in the feature space). A new synthetic data instance ⃗ x = ⃗ a + w( ⃗ b − ⃗ a) is created and added to the dataset, where w is the random weight in the interval [0, 1].
Based on SMOTE, several various techniques have been built and developed. The first is the Cluster SMOTE technique [12]. In this technique, the training data is first classified into k clusters using the k-Means algorithm, for each cluster the imbalance ratio is calculated: IR = Number of minority class instances in the cluster Number of majority class instances in the cluster Then, use SMOTE to clone the number of minority class instances in clusters with imbalance ratio IR > 1.
Next is the Adaptive Synthetic Sampling technique (ADASYN), which is built by shifting the importance of classification boundaries to difficult minority classes. ADASYN uses a weighted distribution for minority class instances that vary according to training difficulty, where more synthetic data is generated for more difficult minority class instances to learn [13].
Another SMOTE-based innovation is Borderline-SMOTE, Borderline-SMOTE there are two variants Borderline-SMOTE1 and Borderline-SMOTE2. This method oversample of minority instances only near the boundary and nearest neighbors of the same type. The difference between the two versions is that Borderline-SMOTE2 uses both positive and negative nearest neighbor. Compared to conventional SMOTE, Borderline-SMOTE does not clone synthetic instances for noise, but concentrates its efforts near the boundary, thereby helping the decision function to create better boundaries between classes. In terms of performance, Borderline-SMOTE has also been reported to perform better than SMOTE [14].
The first known undersampling technique is Tomek Link which is defined as follows: provide a pair of objects (x i , x j ), here where x i ∈ S min , x j ∈ S max and d(x i , x j ) is the distance between x i and x j , then the pair ( In this way, if two instances form a Tomek Link, either of these instances is noisy or both are near the contour. So one can use Tomek Link to clean up the overlap between. By removing overlapping instances, one can establish well-defined clusters in the training dataset and lead to improved classification quality. Another approach is the Neighborhood Cleaning Rule (NCR) can be described as follows: for each instance Ei in the training dataset of the binary classification problem, its three nearest neighbors are found. If Ei belongs to the majority class and the class given by its three nearest neighbors contradicts the original class Ei, then Ei is deleted. If Ei belongs to the minority class and the three nearest neighbors misclassify Ei, then the nearest neighbors belonging to the majority class are discarded [9].
Similarly Tomek Link is an Edited Nearest Neighbors algorithm (ENN). ENN tend to remove more instances than Tomek Links, so it should provide more indepth data cleaning. Different from NCR which is a method of undersampling, ENN is used to remove instances from both classes. Therefore, any instance that is misclassified by its three nearest neighbors will be removed from the training dataset [15].
Several researches have been carried out on the basis of comparing the oversampling and undersampling methods to deal with the class imbalance problem. Douzas and Bacao [16] developed a method to estimate the distribution of real data and clone data for minority classes of various imbalanced datasets. Douzas et al [17] have presented an oversampling method based on k-Means and SMOTE clustering to avoid generating noise and overcome imbalances between classes.
Amin et al [18] have investigated several well-known oversampling techniques: Mega-trend Diffusion Function (MTDF), SMOTE, ADASYN, Top-N and k-nearest neighbor (TRkNN) inversion, Weighted Minority Oversampling Technique (MWMOTE) and Immune Centroids Oversampling Technique (ICOTE). The research showed that the overall prediction performance of MTDF and genetic algorithm-based rule generation performed best compared to the rest of the oversampling augmentation methods.

Dataset and evaluation metrics
The dataset is a main component in training classifiers to detect attacks. Choosing the right dataset is important to ensure proper model building. Statistically the most used datasets in the researches are: KDDCup99, NSL-KDD, ISCX2012 and UNSW-NB15. In which, the UNSW-NB15 dataset was chosen to be used in the experiments of this paper, because it has some advantages when compared with other datasets: (1) It contains composite attack activities nowadays; (2) The probability distributions of the training and testing datasets are similar; (3) It consists of a set of features from the packet's payload and header to reflect the effective network packet, and (4) The dataset contains many complex data samples [19].
The selection of evaluation metric plays an important role when building and evaluating NIDS models. In this paper, the evaluation metric F1 Score (F-Measure with β = 1) was chosen to evaluate the classification quality of NIDS, for the following reasons: (1) Dataset used for training the NIDS is inherently imbalanced; (2) In the NIDS, positive class is the class of instances labeled attack plays an important role; (3) The false positive or negative alerts are equally important and (4) A normal access is interpreted as an attack or conversely an attack is interpreted as a normal access, both are important.

Shortcomings and Challenges
It is becoming increasingly important to protect computer systems using NIDS for intrusion detection. The above section has detailed related works on the methodology and technology of data preprocessing. Here are the shortcomings and challenges that need to be research: (1) The use of old datasets such as: KDDCup99 and NSL-KDD can lead to static progress in NIDS, while intrusion attacks are constantly evolving with new technologies and user behaviors. It is therefore important to use a new dataset that is representative of the current environment, both software and hardware.
(2) Researches also show the effectiveness of reducing the features of the training datasets, which not only increases the accuracy of the ML algorithm, but also reduces the training time and cost. A subset of suitable features for each attack type should also be determined.
(3) And finally, like most imbalanced data sources in other fields, the improvement of algorithms for data resampling to improve the classification quality of NIDS should also be researched.

Solution of feature selection (FS)
The proposed feature selection solution here uses ML as an fitness function to determine the best-suited subset of features for each attack type on all feature subsets of the training dataset. Because the training datasets used in NIDS are often very large and have many  [20] is proposed on the basis of the assumption that the features of the dataset are independent of each other, presented in Algorithm 1. In this algorithm, at each iteration, a classification model is selected to train a dataset of n input features. Then we eliminate one input feature at a time and train the same model on n-1 remaining input features n times. The input feature whose removal produces the smallest increase in error rate is discarded, leaving us with n-1 remaining input features. The classification is then repeated on n-2 features, . . . In the k th iteration, the model is trained on n-k features and has an error rate e(k). Choosing the maximum acceptable error rate, we determine the minimum number of features needed to achieve classification accuracy with the chosen ML algorithm. Proposition 1: Algorithm 1 has a time complexity of O(N !), N is the number of features of the dataset.
Proof: Let T (N ) be the time complexity of the algorithm. According to the lines from (12) to (19), we have due to the recursion of the algorithm: Forward Feature Construction Algorithm. The Forward Feature Construction (FFC) algorithm [20] is also proposed on the basis of the assumption that the features of the dataset are independent of each other, presented in Algorithm 2. This is the reverse process of the BFE algorithm. We start with a feature, then increment one feature at a time, the feature that produces the highest quality will be selected to be added to the resulting feature set. Both algorithms, BFE and FFC, are time consuming and computationally Train C with features ∈ S on the dataset D 11: e ← Error rate of classifier C when testing 12: for i ← 1 to N do 13: Train C with features ∈ S 1 on the dataset D 15: if Error rate of classifier C < e + δ then 16: Proof: Let T (N ) be the time complexity of the algorithm. According to lines from (11) to (22), we have due to the recursion of the algorithm: Proposed feature selection algorithm (pFSA). As shown above, for a dataset with N features, if BFE or FFC is used to select the optimal set of features, the time complexity of the algorithm will be O(N!) (according to Proposition 1 and Proposition 2). This is not suitable R ← ∅ ▷ Selected best set of features 5: bq ← 0 ▷ the best QoC 6: FindBest(S, D, C) 7: return R 8: end 9: procedure FindBest(S, D, C) ▷ Find best features 10: best ← ∅ 11: for each s i ∈ {S \ R} do 12: Train C with features ∈ R 1 on the dataset D 14: if the QoC of C > bq then 15: bq ← the QoC of C (1) Combine BFE and FFC with feature ranking to reduce calculation time and cost, which is especially suitable for large datasets; (2) Consider the correlation between features when adding or removing a feature. This is intended to address the limitations of BFE and FFC with datasets whose features are not independent of other features; (3) The order of adding or removing a feature is based on the feature's rank, which is based on the relevance of the feature to the class label. Various reasearches [21] have suggested various metrics of the importance and relevance of features. In this paper, we propose to use the metrics: Information Gain (IG), Gain Ratio (GR) and Correlation Attribute (CA) to rank features.
The this paper proposes two algorithms: The first algorithm is denoted by pFFC, which is an algorithm that uses the improved wrapping model from the FFC algorithm combined with feature ranking, and at the same time considers the correlation between features. The algorithm starts from an empty set of features, then the features will, in turn, be selected for addition if the addition of that feature improves the classification quality of the NIDS. In addition, features that are correlated with the selected feature for inclusion in the selected set of features are also considered to be removed if such removal also improves the classification quality. Features with higher importance will be selected to be added first. The importance of features used here includes: IG, GR and CA.
The second algorithm, denoted pBFE, is an algorithm that uses the improved wrapping model from the BFE algorithm combined with feature ranking, while considering the correlation between features. The algorithm starts from the full set of 42 features, then the features will be selected for removal in turn if the removal of that feature improves the classification quality. In addition, before removing the selected feature, the features that correlate with the selected feature in the previously selected set of features are also calculated and evaluated to choose the best feature to remove. Features with lower importance will be selected for removal first. The feature importance used here also includes: IG, GR and CA.
The pseudocode of the first algorithm pFFC is shown in Algorithm 3. Accordingly, first the importance of 42 features in the UNSW-NB15 dataset is calculated and sorted in descending order. Important of the Features (IoF) used include: IG, GR and CA. Initially, S opt has a feature, the F-Measure is achieved when training and testing on the UNSW-NB15 dataset with a feature, which is the starting value for the journey to find better F-Measure in the next 41 iterations. At each iteration in the next step, the features s i ∈ S is in turn added to S opt to form S 1 , the more important features (with a larger metric of information) are added first. Next, the data with features in S 1 is used to train and test the classifiers using various ML techniques. The results of evaluation of the classifiers are performed on the independent testing dataset in the UNSW-NB15. The classification quality of the classifiers is shown by the F-Measure. If the F-Measure of the classifier is trained with features in S 1 better than S opt , which is the set of features for the previously stored best F-Measure, then the s i feature will be recorded. Then, the features correlating with s i ∈ S opt are considered for removal if such removal improves the F-Measure. The final obtained feature set will be assigned to S opt . Otherwise, the s i feature will be dropped, because adding this feature does not improve the classification quality. This process will be repeated until all the features have been added in turn to find the set of features S opt for the best F-Measure. Proof: Let T (N ) be the time complexity of the algorithm. According to the lines from (6) to (21), we have: The pseudocode of the second algorithm is shown in Algorithm 4. Accordingly, first the importance of 42 features in the UNSW-NB15 dataset is calculated and sorted in descending order. The importance of the features used includes: IG, GR and CA. The initial S opt includes all 42 features of the dataset, the F-Measure is achieved when training and testing on the UNSW-NB15 dataset with 42 features, which is the starting value for the journey to find better F-Measure in the next 41 iterations. At each iteration in the next step, the features s i , in turn, are considered to be removed from S opt to form S 1 , and the less important features (with a smaller information metric) will be considered to be eliminated first. Next, the data with features in S 1 is used to train and test the classifiers using various ML techniques. The results of evaluation of the classifiers are also performed on the independent testing dataset in the UNSW-NB15. The classification quality of the classifiers is shown by the F-Measure. If the F-Measure of the classifier is trained with features in S 1 is better than S opt , which is the set of features for the previously stored best F-Measure, then the s i feature will be considered for elimination. Then, features that are correlated with s i and have less importance than s in S opt will be considered for removal instead of s i if the removal improves the classification quality (shown by F-Measure). The final obtained set of features will be assigned to S opt . In contrast, the removed feature is recovered, because removing this feature does not improve the classification quality. This process will be repeated until all the features have been eliminated in turn to find the set of features S opt for the best F-Measure.
Proposition 4: Algorithm 4 has a time complexity of O(N × (N − 1)/2), N is the number of features of the dataset.
Proof: Let T (N ) be the time complexity of the algorithm. According to the lines from (6) to (21), we have: for i ← 2 to N do 7: if the QoC of S 1 better than S opt then 9: Best ← the QoC of S 1 10: for each s ca in S opt do 11: if s ca correlates with s i then 12: if the QoC of S 2 > Best then

Solution of dataset resampling
The second proposed solution to improve the classification quality of the NIDS: the training dataset resampling solution. As shown, the training dataset UNSW-NB15 is quite imbalanced, attack types are accounting for a very small proportion in the dataset such as: Worms accounting for 0.05%, Shellcode accounting for 0.46%, Backdoor accounting for 0.71%, Analysis accounting for 0.82%, . . . Such imbalance of data between classes leads to a situation where minority classes have a low influence on the results of the classification model and thereby reduce the quality of classification.
However, because the training datasets used in NIDS are often very large and have many irrelevant or noisy features, this adversely affects data resampling for i ← 1 to N do 7: if the QoC of S 1 better than S opt then 9: Best ← the QoC of S return S opt 23: end techniques (which is based on the Euclidean distance between the features) by: cloning the bad instances of the minority class and eliminating the good instances of the majority class. To improve, this paper proposes not to use irrelevant or noisy features when calculating to resample the dataset.
Solution of proposed oversampling. The above oversampling techniques all rely on k nearest neighbors to create the synthetic data instances with the participation of all features. The problem is, there are irrelevant or noisy features when calculating the distance to determine the k nearest neighbors, that can affect the quality of the oversampling. To eliminate these irrelevant or noisy features, this paper proposes to use 2 solutions presented in Algorithm 5 and Algorithm 6.
The first algorithm (Algorithm 5) uses the solution proposed by Algorithm 3 (the pFFC algorithm) to determine the best-fit features participating in the distance calculation when determining the k nearest neighbors used in oversampling. Algorithm  if QoC of S 1 is better than S opt after OS then

14:
Best ← the QoC of S 1 after OS 15: for each s ca in S opt do 16: if s ca correlates with s then 17: if QoC of S 2 after OS > Best then return S opt 28: end be IG, GR or CA. S min is the initial minimum set of features, these are the features obtained for each type of attack through the feature selection algorithms presented in Section 3.1 (see the feature selection results in Table 2. S 1 is the set of features to be evaluated, S 1 has an initial value of S min . At each loop, the remaining features (S max \ S min ) are added to S 1 in turn, the more important features (with a larger information metric) are added first. Then, oversampling techniques such as: SMOTE, ADASYN, Cluster SMOTE and Borderline SMOTE are respectively used to add the synthetic data instances to the original training dataset to generate the new training dataset, the difference is that only the features in S 1 are used when calculating the distance to determine the k nearest neighbors in the oversampling algorithms. New training datasets with additional synthetic instances used to train classifiers using ML techniques. Evaluation results of classifiers are performed on the testing dataset, which is an independent dataset in the UNSW-NB15. The classification quality of the classifiers is shown by the F-Measure. If the F-Measure of the classifier is trained with features in S 1 better than S opt , which is the set of features for the previously stored best F-Measure, then the added instance (denoted s) will be recorded. Next, we will find the features s ca that correlates with s in S opt , and perform the removal of s ca and add s to S opt if such removal improves the F-Measure. Otherwise, the added feature will be dropped, because adding this feature does not improve the classification quality. This process will be repeated until all the remaining features other than S min are added in turn to find the set of features S opt for the best F-Measure. Proof: Let T (N ) be the time complexity of the algorithm. According to the lines from (9) to (26), we have: The second algorithm (Algorithm 6) uses the solution proposed by Algorithm 4 (the pBFE algorithm) to determine the best-fit features participating in the distance calculation when determining the k nearest neighbors used in oversampling algorithms. Algorithm 6 is implemented specifically as follows: First, the set S max consisting of 42 features of the UNSW-NB15 dataset is also calculated and sorted in descending order of importance. The importance of features has can be IG, GR or CA. S min is the initial set of minimal features, which are also features obtained for each attack type through feature selection algorithms presented in Section 3.1 (see the results of feature selection in Table  2). S 1 is the set of features to be evaluated, S 1 has an initial value of S max including all 42 features. At each iteration, the features to be considered (S max \ S min ) are in turn considered to be removed from S 1 , and the less important features (with a smaller information metric) will be considered before. Next, oversampling techniques such as: SMOTE, ADASYN, Cluster SMOTE and Borderline SMOTE are also respectively used to add synthetic data instances to the original training dataset to generate the new training dataset, the difference is that only the features in S 1 are used when calculating the distance to determine the k nearest neighbors in the oversampling algorithms. New training datasets with additional synthetic instances used to train classifiers using ML techniques. Evaluation results of classifiers are performed on the testing dataset, which is an independent dataset in the UNSW-NB15. The classification quality of the classifiers is shown by the F-Measure. If the F-Measure of the best classifier of the above trained classifiers is better than the best F-Measure generated from S opt , which is the set of features for the previously stored best F-Measure, then the feature (denoted by s) will be considered for removal. Next, we will find the features s ca that is correlated with s and has less importance than s in S opt . The removal of textits will be replaced by the removal of s ca in S opt if that replacement improves the F-Measure. Otherwise, the removed feature will be recovered, because removing this feature will degrade the classification quality. This process will be repeated until all remaining features other than S min are removed in turn to find the set of features S opt for the best F-Measure. if QoC of S 1 is better than S opt after OS then

14:
Best ← the QoC of S 1 after OS 15: for each s ca in S opt do 16: if s ca correlates and has IoF < s then 17:  Proof: Call T (N ) is the time complexity of the algorithm. According to the lines from (9) to (26), we have: Solution of proposed undersampling. The undersampling techniques also rely on k nearest neighbors to remove unwanted overlap between classes with all features involved. The problem is, there are irrelevant or noisy features when calculating the distance to determine k nearest neighbors, that can affect the quality of removing data instances at majority class of undersampling. To eliminate these irrelevant or noisy features, this paper proposes to use 2 solutions presented in Algorithm 7 and Algorithm 8.
The first algorithm (Algorithm 7) uses the solution proposed by Algorithm 3 (the pFFC algorithm) to determine the best-fit features participating in the distance calculation when determining the k nearest neighbors used in undersampling algorithms. Algorithm 7 is implemented specifically as follows: First, the S max consisting of 42 features of the dataset UNSW-NB15 is calculated and sorted in descending order of importance, the importance of features has can be IG, GR or CA. S min is the initial minimum set of features, these are the features obtained for each type of attack through the feature selection algorithms presented in Section 3.1 (see the feature selection results in Table 2). S 1 is the set of features to be evaluated, S 1 has an initial value of S min . At each loop, the remaining features (S max \ S min ) are added to S 1 in turn, the more important features (with a larger information metric) are added first. Then, undersampling techniques such as: TML, NCR, ENN are respectively used to remove noisy and overlapping data instances from the original training dataset to generate a new training dataset. The difference is that only the features in S 1 are used when calculating the distance to determine the k nearest neighbors in the undersampling algorithm. New training datasets with eliminated data instances are used to train classifiers using ML techniques. Evaluation results of classifiers are performed on the testing dataset, which is an independent dataset in the UNSW-NB15. The classification quality of the classifiers is shown by the F-Measure. If the F-Measure of the above classifier is better than the best F-Measure generated from S opt , which is the set of features for the previously stored best F-Measure, then the added instance (denoted s) will be recorded. Next, we will find the features s ca that correlates with s in S opt , and perform the removal of s ca and add s to S opt if such removal improves the F-Measure. Otherwise, the added feature will be dropped, because adding this feature does not improve the classification quality. This process will be repeated until all remaining features other than S min are added in turn to find the set of features S opt for the best F-Measure.
Proposition 7: Algorithm 7 has a time complexity of O(N × (N − 1)/2), N is the number of features of the dataset.
Proof: Let T (N ) be the time complexity of the algorithm. According to the lines from (9) to (26), we have: The second algorithm (Algorithm 8) uses the solution proposed by Algorithm 4 (pBFE algorithm) to determine the best-fit features participating in distance calculation when determining k nearest neighbors is used in oversampling algorithms. Algorithm 8 is implemented specifically as follows: First, the set S max consisting of 42 features of the UNSW-NB15 dataset is also calculated and sorted in descending order of importance, the feature importances can also be IG, GR or CA as in Algorithm 7. S min is the initial minimum set of features, which are also features obtained for each attack type through feature selection algorithms presented in Section 3.1 (see the results of feature selection in Table 2). S 1 is the set of features to be evaluated, S 1 has an initial value of S max including all 42 features. At each iteration, the features to be considered (S max \ S min ) are in turn considered to be removed from S 1 , and the less important features (with a smaller information metric) are considered first. Next, undersampling techniques such as: TML, NCR, ENN are respectively used to remove noisy and overlapping data instances from the original training dataset to create a new training dataset. Another point is that only features in S 1 are used when calculating distances to determine k nearest neighbors in undersampling algorithms. New training datasets with eliminated data instances used to train classifiers using ML techniques. if QoC of S 1 is better than S opt after US then

14:
Best ← the QoC of S 1 after US 15: for each s ca in S opt do 16: if s ca correlates with s then 17: if QoC of S 2 after US > Best then return S opt 28: end Evaluation results of classifiers are performed on the testing dataset, which is an independent dataset in the UNSW-NB15. The classification quality of the classifiers is shown by the F-Measure. If the F-Measure of the above trained classifier is better than the best F-Measure generated from S opt , which is the set of features for the previously stored best F-Measure, then the feature (denoted by s) will be considered for removal. Next, we will find the features s ca that is correlated with s and has less importance than s in S opt . The removal of s will be replaced by the removal of s ca in S opt if that replacement improves the F-Measure. Otherwise, the removed feature will be recovered, because removing this feature will degrade the classification quality. This process will be repeated until all remaining features other than S min are removed in turn to find the S opt for the best F-Measure. if QoC of S 1 is better than S opt after US then

14:
Best ← the QoC of S 1 after US 15: for each s ca in S opt do 16: if s ca correlates and has IoF < s then 17:

Solution of feature selection
The training and testing datasets are the full training and testing datasets of the UNSW-NB15. The features on the UNSW-NB15 dataset are numbered sequentially 10 EAI Endorsed Transactions on Context-aware Systems and Applications Vol. 9 (2023)   3,4,6,7,8,9,11,12,14,15,16,17,18,19,20,21,24,27,28,29,31,33,37,38,39,42, 43 pBFE-IG as shown in Table 1. The results of feature selection using pFFC and pBFE with consideration of the correlation of features are shown in Table 2. For each type of attack, the selected features are different in features and quantity. The classifier is mainly used as a decision tree. Figure 1 shows the improvement of classification quality in each attack type when using the proposed feature selection algorithms. (1) Evaluation results on the UNSW-NB15 dataset show that this dataset has many complex instances, especially Generic and Fuzzers attack types.
(2) The use of feature selection techniques not only reduces the computational cost and time (according to Proposition 3 and Proposition 4) but also improves the classification quality in IDS.
(3) The use of pBFE-IG, pBFE-GR and pBFE-CA algorithms for better feature selection than other known algorithms.
(4) For each different type of attack, different features and ML algorithms will be selected to best improve the classification quality of the intrusion detection system.

Solution of dataset resampling
Regarding the oversampling techniques: SMOTE, ADASYN, Cluster SMOTE, Borderline SMOTE1 and Borderline SMOTE2 are used. Regarding the undersampling techniques: TML, ENN and NCR techniques are used. The selection of the features participating in the Euclidean distance to find the nearest neighbors in the oversampling and undersampling techniques has also been done with the pFFC and pBFE algorithms.
Oversampling the Dataset. As mentioned in the proposal, the oversampling techniques are based on the kNN algorithm to create a synthetic data instances with the participation of all features. The removal of irrelevant or noisy features when calculating distances to determine k nearest neighbors can improve the quality of oversampling techniques. In the experiments, both proposed solutions, pFFC and pBFE, were performed. The oversampling techniques used include: ADASYN, SMOTE, Cluster SMOTE, Borderline SMOTE1 and Borderline SMOTE2.
The results of using oversampling in combination with feature selection are presented in Table 3. The line with symbol G is the evaluation result obtained when not using oversampling; the line with the symbol O is the evaluation result obtained when using the oversampling on all 42 features; line with symbol F is the evaluation result obtained when using the oversampling in combination with pFFC; line with symbol B is the evaluation result obtained when using the oversampling in combination with pBFE. The line in bold is the line that gives the best F-Measure for each attack type. Table 4 is a summary of the results obtained by using the oversampling in combination with feature selection for each attack type. The Selected Features column has the features numbered in order as shown in Table 1    Undersampling the Dataset. Similar to oversampling, the undersampling techniques also rely on the kNN algorithm to remove the overlapping data between classes and the noisy data with the participation of all features. The removal of irrelevant or noisy features when calculating distances to determine k nearest neighbors can improve the quality of undersampling techniques.
In the experiments, both proposed solutions mFF and pBFE were used. The undersampling techniques used include: TML, NCR, ENN.
The results of using undersampling in combination with feature selection are presented in Table 5. The line with symbol G is the evaluation result obtained without using undersampling; the line with the symbol U is the evaluation result obtained when using the undersampling on all 42 features; line with symbol F is the evaluation result obtained when using the undersampling in combination with pFFC; The line with symbol B is the evaluation result obtained when using the the undersampling in combination with pBFE. The line in bold is the line that gives the best F-Measure for each attack type.   Table 1. Thereby, it shows that the technique of undersampling combined with feature selection greatly improves the classification quality. It also helps to remove many noisy instances and overlap between classes in the majority class. But there will be cost for feature selection with time complexity of O(N × (N − 1)/2) (according to Proposition 7 and Proposition 8).

CONCLUSION
Experimental results have demonstrated that the proposed solutions to improve the classification quality of NIDS include: (1) Propose techniques to improve feature selection of training datasets used in NIDS.
(2) Propose techniques to improve the handling of imbalanced data sources inherent in NIDS, through the improvement of oversampling and undersampling techniques.
In the experiments, the UNSW-NB15 dataset was used for training and testing, which is a dataset with many contemporary synthetic attack instances that have not been used by many researchers. The paper proposes to use the F-Measure to evaluate the classification quality of NIDS. This is to contribute to improving the effectiveness of the evaluation.
Besides the obtained results, the research results of the paper also leave the following shortcomings, limitations and future development orientations: (1) The correct combination of data preprocessing algorithms and classifiers to build a hybrid, multi-label and real-time response classifier is an issue that needs to be further researched.
(2) The system's ability to process data as well as compute plays an important role in exploiting ML algorithms. The improvement of processing efficiency in the direction of parallel processing as well as the selection and optimization of parameters for ML techniques is still an open issue.