Augmentation of Predictive Competence of Non-Small Cell Lung Cancer Datasets through Feature Pre-Processing Techniques

The major Objective of the Study is to augment the predictive analytics of Non-Small Cell Lung Cancer (NSCLC) datasets with Feature Pre-Processing (FPP) technique in three stages viz. Remove base errors with common analytics on emptiness or non-numerical or missing values in the dataset, remove repeated features through regression analysis and eliminate irrelevant features through clustering methods. The FPP Model is validated using classifiers like simple and complex Tree, Linear and Gaussian SVM, Weighted KNN and Boosted Trees in terms of accuracy, sensitivity, specificity, kappa, positive and negative likelihood. The result showed that the NSCLC dataset formed after FPP outperformed the raw NSCLC dataset in all performance levels and showed good augmentation in predictive analytics of NSCLC datasets. The research proved that pre-processing is essential for better prediction of complex medical datasets


Introduction
Non-Small Cell Lung Cancer (NSCLC) is one of the escalating cancers found in many parts of the world.The study aimed to provide a solution to build effective predictive system for NSCLC in complex high dimensional datasets.Medical datasets are generally complex and highly susceptible for utilization in prediction of lung cancer.The NSCLC datasets [1] are complex in nature due to the presence of biomarkers features.The complexity of the features depends on the quantity and quality of the features present in a dataset.Numerous NSCLC datasets were recognized to have high quantity of over 50 features which makes the complexity more hard-hitting scenario.Also, the quality of recorded samples was not appealing as it contained missing values, irrelevant values [2], redundant features that makes the prediction more complicated.Hence the major problem to be addressed in this paper is to minimize the irrelevant features to augment the predictive competency of NSCLC datasets through a series of feature reduction methods collectively represented as Feature Pre-Processing (FPP) using data mining techniques.
The major objective of the research is to increase the predictive competency of Non-small Cell Lung Cancer dataset through Feature Pre-Processing (FPP) techniques and test the performance of prediction of datasets before and after the FPP process.The specific objectives are to identify the class and predictive features of the NSCLC dataset and perform the analytics based on the efficiency, performance and likelihood nature of the dataset after identifying the flaws in the features presented before the FPP stage.The scope of the research work is applied to Non-Small Cell Lung Cancer (NSCLC) datasets collected from primary or secondary sources and applied with the pre-processing techniques modeled within the limit of data mining techniques.The research work focused on the importance of choosing the right features for better prediction.Hence the study is recommended for prediction of complex biological datasets and predict complicated diseases like Non-Small Cell Lung Cancer (NSCLC) that seldom shows symptoms at its earliest stage of infection.

Literature reviews on pre-processing techniques for lung cancer
of dataset classified 6910 as early-stage lung cancer and 8840 as advanced malignant lung cancer respectively.The accuracy after pre-processing methods were found to be highest as 88.55% with MLP model among all other tested models.Anna Meldo et.al (2019) [7] applied Computer Aided Diagnostic (CAD) pre-processing method for Lung cancer on the intellectual dataset called LIRA.A similar automated lung cancer prediction on Kaggle datasets was proposed by Gustavo Perez & Pablo Arbelaez (2020) [8] with an accuracy of 99.6% based on Malignation Prediction Pre-processing test.NegarMaleki et.al (2020) [9] proved that hybrid usage of pre-processing with KNN and Genetic Algorithm could enhance the prediction accuracy of complex lung cancer datasets using classifiers.Chip M. Lynch et.al (2017) [10] showed that statistical methods like Root Mean Squared Error (RMSE) and unsupervised learners like k-means could enhance pre-processing methods and enhance predictions.M. S. Kavitha et.al (2019) [11] utilized pre-processing techniques like Gabor Filter for Lung image enhancement, gaussian filter for smoothing of lung images for effective prediction using classifiers like SVM and Fuzzy C-Mean Clustering.Thus, pre-processing techniques serves a significant role in enhancing the prediction of the lung cancer datasets like Luna16 datasets as shown by Nasibeh and Mortezapour (2019) [12].They were also able to classify the images into different categories based on the effectiveness of preprocessing methods.Even in the recent analysis by Ankush Kumar Gulia, et. al. (2021) [13], the prediction of lung cancer datasets was proved to be effective in prediction after proper pre-processing techniques applied to the existing models.Some of the Pre-Processing methods and the classifiers used in Lung Cancer in the existing scenario were presented in Table .1 The research work concentrating on the competency of predictive performance after preprocessing requires study of relevant works completed in different scenarios.Chief preprocessing methods were applied in different medical datasets to counteract the major problems like redundancy, missing values, irrelevancy of data, high dimensionality etc.A deep learning approach to identify the lung cancer using chest x-rays and CT scan images was proposed by Yu.Gordienko et.al (2018) [3] where pre-processing techniques like segmentation, bone shadow exclusion techniques were applied on the BSE-JSRT dataset.The removal of unrelated bone data in the dataset enhanced accuracy of prediction.Choon Sen Seah et.al (2018) [4] developed a pre-processing model called Significant Directed Random Walk (SDRW) in three stages.During the first stage, unwanted attributes were removed along with missing values and arrangement of data.Secondly, the normalization techniques were applied followed by the filtering methods at the third stage.Shigang Liu et.al (2020) [5] identified that Feature Selection with SVM Classifier pre-processing technique has overwhelmed the existing KNN model of Pre-processing methods applied in Biomedical datasets.Biological datasets were highly complex and hence pre-processing methods were expected to be highly reliable.
Various other pre-processing methods were also proposed and tested with biological datasets like lung cancer datasets.Gur Amrit Pal Singh & P. K. Gupta (2018) [6] analyzed the lung cancer CT images using various classifiers like KNN, SVM, decision tree, RFT, and Multi-Layer Perceptron based on 15750 lung images where class variable .The above pre-processing models were used in complex Lung Cancer datasets to predict the benign and malignant tumors using the position of lymph in cells of Lung.

Research Gaps of the Study
Based on the analysis on various pre-processing models in the existing scenario, few of the flaws were identified in the existing frameworks.
Some of them are presented as follows: • The existing models mostly were based on the image pre-processing techniques applied on lung cancer images.The application of pre-processing techniques in numerical analysis were found to be missing among the models.• A pre-processing framework with sequence of stages were found earlier in Significant Directed Random Walk (SDRW) Choon Sen Seah et.al (2018) [4].However, the stages were generally made with no specific algorithm generated in novel form.• The models tested with the classifiers were not benchmarked with the existing methods before and after pre-processing to know the importance of preprocessing in lung cancer datasets • The datasets utilized in the research comprised of not more than 50 features at a time.The high dimensionality could be addressed better with features at least fifty or above for better scope and reliability of pre-processing techniques • Supervised and unsupervised models were not tested at the same time during the pre-processing stage.It is essential for effective identification of relevant and irrelevant features in a biological dataset.
To counteract all the above disadvantages of the existing preprocessing methods, the novel framework model specifically is designed for pre-processing technique in combination of data mining, regression and clustering methods.

Materials and Methods
After analyzing the disadvantages of the existing systems and their flaws, it's important to identify better techniques in combination that would assist in refining all the problems in the given complex NSCLC dataset to encourage better predictions using classifiers.The pre-processing is an important stage of data mining process, where irrelevant, redundant and unprecedented data [23] containing features has to be removed.Hence, various pre-processing techniques has to be performed on the numerical dataset to be used.The dataset would be the high dimensional of minimum fifty features or more with a class feature.The pre-processing architecture proposed in this research work "Feature Pre-Processing (FPP)" is a novel method comprising of three major phases as shown in Figure1 .  [24], Null data [25] or non-medical data [26] out of range.During the second phase, it tests the regression analysis where the correlation between the existing data and the overall mean data can be measured.Later, suing the third phase, based on the centroid of the dataset, the relevant and irrelevant data could be measured.Thus, after finding the relevance in three different phases, the irrelevant data in each phase corresponding feature can be removed and remaining features can be developed into a best feature set for further processing.The architecture applies to numerical dataset of NSCLC types to the maximum rather than other type of datasets.The dataset should also be of high dimension with huge features above fifty with a class feature to measure the competency of the prediction after reducing the irrelevant features respectively.

NSCLC Dataset Analysis
The Multivariate Dataset was collected from UCI repository based on the T, N and M values of the predictions on Non-Small Cell Lung Cancer (NSCLC) cells.The dataset was donated by Aeberhard et.al where initially the accuracies were achieved with 62.5% for RDA, 53.1% for KNN, 59.4% for Opt.Disc.Plane [27] respectively.The dataset comprised of 57 features that described three major types of pathological lung cancers.The donor of the dataset did not furnish the name of the features; hence, it is identified as hidden feature.The dataset was found to have unknown values in the dataset indicated as '?' due to non-reliability of unknown data.The feature 1 is found to be the class feature remaining being predictive in type as shown in Table .2.The remaining 56 features were considered as predictive in nature.The values of the features frequently range from 1 to 3 indicating values of T, N and M values respectively.The predictive features have to be examined for the proposed preprocessing methods to identify the eligible features and remove the irrelevant features to form a best feature set.

Phases of Feature Pre-Processing
The pre-processing of NSCLC dataset has been carried out with the novel architecture Feature Pre-Processing (FPP) under three different phases as explained with its schematic diagrams, formulations and algorithms.

Phase-I: Test for Relevancy of Features
The first Phase tests the Raw NSCLC dataset to test for relevancy of data through testing the behavioral data like presence of '?' instead of negative data [27], null data and empty set.It also tests the numerical analysis of data as the major research work is to test the numerical data rather than image analysis.The major processes in the Phase-I are represented in the Equation 1.
Where minimize indicates the minimization of features to form the subset.The data in the dataset ranges from k=1 to n. features where every individual data fti is tested for empty, Null or NAN.The Raw dataset initially loaded and normalized before testing for relevancy of data Conditional constructs were applied to test for null values emptiness of the data without any values and also non-numeric data represented as NAN (Not A Number) [28] values in the dataset.If the selected data results true for any one of the above behavioral relevancy problems, the corresponding feature is removed from the dataset and the remaining features are added to the refined or processed NSCLC dataset for further testing process as indicated in Figure .2.

Figure.2. Phase-I: Test for Relevancy of data in NSCLC dataset
The explained process has been given as algorithm in Table .3.NAN ruleset.If the data falls in any one of the categories, it is identified as corrupted data and the corresponding feature is deleted.After consequent iterations, all the feature elements are tested and the subset A1() is formed from remaining features in source dataset A().Thus, this stage separates externally identifiable corrupted features and removes them.

Phase-II: Test for Regression Analysis
During the Second Phase, the Raw dataset from first stage is applied with regression analysis to test the relationship that exists between the data in the NSCLC dataset.The regression analysis process is indicated in Equation.2.

End For
The Algorithm in Table.4 shows the regression analysis in FPP where the dataset F1() is the raw dataset after first level pre-processing and A2() is the dataset to be created after regression analysis.The fitness function is created to find sum and mean of the entire data in the dataset.Later, the regression co-efficient is calculated by subtracting the individual f(i) value from mean.After testing regression coefficient value for zero, the selected features are stored in new dataset.This phase separates the redundant features from independent features.

Phase-III: Test for Clustering Analysis
The third and final stage of Pre-Processing applies the segmentation and clustering methods to find the cluster with irrelevant features and another cluster with relevant features respectively.The clustering model applies k-means clustering technique [31] where the major intention is to find the centroid of the dataset and compare it with individual dataset.The overall process of the third phase is indicated in Eq.3.
where again the minimize is used to indicate the reduction of irrelevant features of total NSCLC dataset (x).The centroid is calculated by identifying the median of the features in statistical analysis and dividing it by the total of values (n).Initially, the Raw dataset from the first phase is loaded for test of segmentation and clustering in the third phase.The median is calculated by find the middle feature among the existing features.The median is the centroid of the entire dataset.After computing the centroid as the threshold value, the fitness function is calculated by subtracting individual feature from the centroid value.Finally, the segmentation fitness co-efficient is tested to find if it is greater than centroid [31].If the condition is true, it is added to the irrelevant feature set whereas if it is false, it is added to the relevant features refined dataset.The overall process of the third phase is shown in Figure.

End For
The above algorithm in Table .5. indicates that segmentation and clustering assists in identifying the irrelevant features through differentiating the features using centroid values.It is also applied to hidden data in the existing dataset to identify whether it is required for further processing or can be removed due to non-relevant or non-classifiable nature of the data.Thus, the entire Feature Pre-Processing stage represents a way to perform the best possible method to identify the highly irrelevant features thereby enhancing the predictive analytics of the NSCLC dataset.

Experimentation and Evaluation
The first Phase of Implementation was performed in MATLAB tool by loading the initial Raw dataset obtained from the UCI repository sources.The dataset contained errors like missing values '?' as it was clearly witnessed in Figure.It is evident from Figure .6. that the regions marked with yellow color with '?' are not readable form and hence couldn't be involved in the prediction of NSCLC without proper modification.
To conduct the test for relevancy of data present in the dataset and the values associated with it.Feature Pre-Processing (FPP) technique is designed in MATLAB and applied with the current dataset.The Raw dataset is loaded in the MATLAB interface and tested for the various pre-processing criterion as mentioned in the Phase-I of FPP known as Test for Relevancy of Data.
To normalize the data based on rows and columns.It was identified that there are 57 features showing columns and 32 records showing rows.The data is stored in the form of an array.Then each feature data is individually tested for numeric or non-numeric nature.Then it is also checked for empty data or negative values in the dataset.Also, the Null condition is checked in the Raw dataset.It was found with the following outcomes as shown in Table .7.After the first stage, the refined dataset is formed and this forms the basis for second and third phase evaluation respectively.During the Second Phase, the newly formed excel file was loaded into the MATLAB interface and tested with regression analysis as described in the methodological part of the paper.After testing for regression, the irrelevant features were separately loaded in a list box.In this experiment, it is found to be features 28,42,46    In the above Figure.9, it is shown that the relevant features are accepted as refined set whereas the irrelevant features VarName2, VarName5, VarName9, VarName12, VarName19, VarName20, VarName30 were removed from the existing dataset.After completing the overall implementation, the entire features identified as irrelevant and unnecessary were removed from the original dataset.The remaining features are considered effective for prediction and tested with classifiers in MATLAB.

Results and discussion
The performance comparison is carried out in two phases viz. 1) Before Feature Pre-Processing (BFPP) and 2) After Feature Pre-Processing (AFPP) based on the removal of irrelevant features and testing it with selected classifiers listed below: • Gaussian SVM, • Weighted KNN, • Boosted Tree.
In the above classifiers chosen for testing the performance, simple and complex tree [32] is based on non-supervised models, Linear SVM [33], Gaussian SVM [34] are based on supervised models, Weighted KNN [35] and Boosted Tree [36] are optimization models respectively.Hence, they can be better identified as the right classifiers to test the competency nature of complex NSCLC datasets than any other classifiers.

NSCLC Dataset Competency Analytics
The NSCLC dataset Competency analytics begins with the identification of classifier performance retrieval in the form of confusion matrix achieved after training the models.The total prediction possible in the dataset is 32 values.Hence the limit of the confusion matrix [37] is restricted to 32 values in four criterions viz., 1) True Positive, 2) True negative, 3) False positive and 4) False Negative respectively.According to the Predictive Analytics of NSCLC datasets, True Positive and False Negative [38] are correct predictions whereas the True Negatives and False Positives [39] are wrong predictions.

Before Feature Pre-Processing
Initially, the RAW dataset without FPP is trained and tested with the classifiers to obtain the accuracy as shown in Table .9.

After Feature Pre-Processing
After performing the FPP in three phases, the irrelevant features were removed and a new dataset is created in excel format.That dataset is loaded again in MATLAB to train and test it using the classifiers in the benchmark model.The results obtained in the form of confusion matrix are tabulated in Table .12.

Discussion and Findings
The overall competency augmentation is measured based on minimization of features and maximization of performance.The overall comparative analysis of the NSCLC dataset based on the performance of the classifiers before and after FPP are summarized in Table .14.

Accuracy and Kappa Competency Analysis
The Accuracy and Kappa [40] is calculated based on the correct predictions of the classifiers on the dataset.
Hence, both are found efficient only when the value increases from the benchmark model.Based on the values in Table .14, it is found that the accuracy of Simple tree remains the same, complex tree shows reduction in prediction whereas the remaining classifiers showed good improvement.The average accuracy enhancement in each of the algorithms are 7%.This showed that accuracy s highly competent after FPP Process.The Accuracy and Kappa Analysis is graphically represented in Figure.The Kappa is found less with complex tree classifier, whereas it showed high competency of improvement in other models.Even simple tree classifier which remained same accuracy showed improvement from 0.90 to 0.63 in kappa competency analysis.

Sensitivity and Specificity Competency Analysis
The Sensitivity and Specificity [41]   Since specificity is considered as error in prediction, reduction in value after FPP indicates good prediction.Thus, the test for sensitivity and specificity competency analysis was successful.

Positive and Negative Likelihood Competency Analysis
The likelihood competency analysis [42] indicates the possibility of the NSCLC dataset to be predicted correctly and wrongly in the future.The positive likelihood showed either improvement or remained the same in each of the classifiers whereas the negative likelihood remained less or remained same in all the classifiers except for boosted tree classifier as shown in Figure .12(a)and Figure .12(b).Based on the overall Competency Analysis, some of the following findings are identified at the of the research.The overall competency of NSCLC dataset prediction has been augmented with good performance except for few flaws like reduction of accuracy in Complex Tree and increase of negative likelihood in boosted tree respectively, The total features reduced in each phase is given in Table .15. 22.81%The findings shows that minimization of features is carried out with 22.81% reduction of features in the overall dataset as shown in Table .15.The competency analysis parameters like accuracy, kappa, sensitivity, specificity, positive and negative likelihood are found to be effective in testing with the classifiers.The classifiers included all three categories used in testing models like supervised, unsupervised and optimization models.Thus, it proves that the pre-processing methods are justified with all types of models.The benchmark model with the Raw dataset was outperformed by the proposed model thereby the alternate model is acceptable.
Based on the findings, the proposed model of this research work showed high level of competency in augmentation of prediction with NSCLC datasets.

Conclusion
The research work proposed a novel architecture for preprocessing of complicated datasets like NSCLC datasets.The dataset is a hidden type and hence anonymous data was handled as three phases presented in Feature Pre-Processing (FPP) model.The NSCLC dataset was also very complicated in terms of high number of features and the expected high prediction methods for better performance.The research study had three different phases to test the relevancy of data in behavioral, regression and segment-based categories.The overall proposed model showed high competency with the existing NSCLC prediction performance.This model can further be extended with high dimensionality reduction methods and feature extraction methods to provide more competency in the future.

Figure. 1 .
Figure.1.Phases of Feature Pre-Processing (FPP) for NSCLC dataset As shown in Figure.1,there are three phases in this proposed architecture FPP.The first phase tests the relevancy of the data to the NSCLC dataset based on numeric nature[24], Null data[25] or non-medical data[26] out of range.During the second phase, it tests the regression analysis where the correlation between the existing data and the overall mean data can be measured.Later, suing the third phase, based on the centroid of the dataset, the relevant and irrelevant data could be measured.Thus, after finding the relevance in three different phases, the irrelevant data in each phase corresponding feature can be removed and remaining features can be developed into a best feature set for further processing.The architecture applies to numerical dataset of NSCLC types to the maximum rather than other type of datasets.The dataset should also be of high dimension with huge features above fifty with a class feature to measure the competency of the prediction after reducing the irrelevant features respectively.
3. shows two datasets A() as source Raw Dataset, A1() as destination Refined Dataset.Other variables include s for size of features, fn as the individual element and n being the total elements in dataset respectively.After initial normalization [29] of the data, every individual element f(i) is tested individually for null, emptiness and 5 Augmentation of Predictive Competence of Non-Small Cell Lung Cancer Datasets through Feature Pre-Processing Techniques EAI Endorsed Transactions on Pervasive Health and Technology 11 2022 -12 2022 | Volume 8 | Issue 5 | e1

)
Where minimize represents the reduction of non-regressive features that shows similarities, F representing the feature set, X ̅ representing the mean of data in the dataset, 'fi' the individual data and 'n' the total values in the tested dataset.The overall function of the regression analysis is shown in Figure.4.

Figure. 4 .
Figure.4.Phase-II: Test for Regression Analysis in Feature Pre-Processing (FPP)The above Fig.4.indicates that the Raw dataset is loaded in the training platform to test for regression of data.The regression indicates the relationship that exists between the individual data representing a feature.The total sum of all data in the NSCLC data is calculated in the initial stage.Later the mean for the dataset is found by dividing the sum by the number of samples in the dataset[30].Later every feature is tested to find the correlation by finding difference of every element from the mean value.After finding the regression coefficient, the value is tested to find if it equals zero.If the condition is true, it is identified as redundant feature and deleted.If the condition is false, they are considered as different features and hence added to new refined dataset.The

Figure. 5 .
Figure.5.Phase-III: Test for Segmentation and Clustering in FPP Table.5 Algorithm for Test for Segmentation and Clustering in FPP Algorithm TFSCPP Input: NSCLC Refined Dataset A1 (), A3(), A4() Define fi ←  Initialize,  ← 1,  ← 1,  ← 1 1 ← 0, 2 ← 0, seg()  ∀ ∈  do fc ← median(f(i)) End For //Fitness Function  ∀ ∈  do seg(i) ← f(i) -fc End For  ∀ ∈  do If (seg(i) > fc) add(A3(f(i)) else add (A4(f(i))End ForThe above algorithm in Table.5.indicates that segmentation and clustering assists in identifying the irrelevant features through differentiating the features using centroid values.It is also applied to hidden data in the existing dataset to identify whether it is required for further processing or can be removed due to non-relevant or non-classifiable nature of the data.Thus, the entire Feature Pre-Processing stage represents a way to perform the best possible method to 6.
It is identified that Features with VarName5 and VarName39 are found to have errors in the form of missing values or non-numeric values.This may affect the quality of the prediction.Hence both the features are removed and new dataset is formed thorough 'Generate Corrected Dataset' as shown in Figure.7.

Figure. 7 .
Figure.7.Refined dataset after loading Raw dataset and removing the features 5 & 39 to form the refined dataset and 47 respectively.The regression co-efficient values are listed in another list box with graphical representations as shown in Figure.8.

Figure. 8 .
Figure.8.Phase-2: Test for Regression in FPP implemented using MATLAB.The Figure.8 shows that both positive and negative values are not redundant values whereas the regression factor being 'ZERO' indicates that the features are similar and will not be useful for prediction.Hence features, VarName28, VarName42, VarName46 and VarName47 can be removed from the dataset to form further refined dataset.However, the changes were not affected before the completion of third

Figure. 10
Figure.10(b) Kappa Competency Analysis Figure.10(a)Accuracy Competency Analysis of the NSCLC dataset is calculated based on the ability of the classifier to predict the Truth as True data and False as Negative data respectively.The sensitivity of the NSCLC dataset is expected to show improvement for high competency.The sensitivity of the NSCLC dataset has showed improvement in all supervised and unsupervised classifiers whereas it remained the same in optimization models as shown in Figure.11(a).

Figure. 11
Figure.11(a) Sensitivity Competency Analysis Figure.11(b)Specificity Competency Analysis However, Specificity as shown in Figure.11(b)showed reduction in values thereby showing good competency in prediction.Since specificity is considered as error in prediction, reduction in value after FPP indicates good prediction.Thus, the test for sensitivity and specificity competency analysis was successful.
Competence of Non-Small Cell Lung Cancer Datasets through Feature Pre-Processing Techniques EAI Endorsed Transactions on Pervasive Health and Technology

Table . 1
. Existing Pre-Processing methods and classifiers for lung cancer datasets

Table . 2
. Feature names, types and range of values in NSCLC dataset As shown in Table.2, the VarName1 is the class feature that represents the outcome received through medical analysis.

Table . 9. Confusion Matrix and accuracy values of NSCLC dataset before FPP
Based on the confusion matrix values obtained inTable.9, the Competency analysis measures like accuracy, kappa, sensitivity, specificity, positive likelihood and negative likelihood were calculated and summarized in Table.10Table.10.Competency Analysis of NSCLC dataset before FPP Table.12.Confusion Matrix and accuracy values of NSCLC dataset after FPP Augmentation of Predictive Competence of Non-Small Cell Lung Cancer Datasets through Feature Pre-Processing Techniques EAI Endorsed Transactions on Pervasive Health and Technology 11 2022 -12 2022 | Volume 8 | Issue 5 | e1As identified in Table.12, the confusion matrix values are used to find competency analysis measures for the classifiers and presented in Table.13Table.13.Competency Analysis of NSCLC dataset after FPP Table.14.Overall comparative competency analysis of NSCLC dataset before and after FPP Table.15.Minimization of Features