NNFSRR: Nearest Neighbor Feature Selection and Redundancy Removal Method for Nearest Neighbor Search in Microarray Gene Expression Data
Keywords:High dimension, Feature, gene expressions, Symmetric Uncertainty
INTRODUCTION: Gene expression data analysis is a critical aspect of disease prediction and classification, playing a pivotal role in the field of bioinformatics and biomedical research. High-dimensional gene expression datasets hold a wealth of information, but their effective utilization is hindered by the presence of irrelevant dimensions and noise. The challenge lies in extracting meaningful features from these datasets to enhance the accuracy of disease prediction and classification while maintaining computational efficiency.
Feature selection is a crucial step in addressing these challenges, as it aims to identify and retain only the most informative characteristics from large high-dimensional microarray datasets. In the context of microarray gene expression data, characterized by its substantial dimensionality, selecting relevant features is essential for efficient nearest neighbor search, a fundamental component of various analytical tasks in bioinformatics and data mining.
Existing feature selection methods in high-dimensional data often face issues related to the trade-off between search accuracy and computational efficiency. This paper introduces a novel approach, the Nearest Neighbor Feature Selection with Symmetrical Uncertainty-based Redundancy Removal (NNFSRR) method, designed to enhance the classification of microarray gene expression data through feature selection. The NNFSRR method focuses on reducing the dimensionality of the dataset by identifying and removing redundant features, allowing subsequent searches to operate solely on relevant dimensions.
OBJECTIVES: The primary goal is to evaluate the NNFSRR method's effectiveness in improving nearest neighbor search in microarray gene expression datasets by reducing dimensionality. This method utilizes Symmetrical Uncertainty-based correlation between dimensions for feature selection and aims to enhance accuracy and efficiency compared to existing methods.
METHODS: The NNFSRR method uses Symmetrical Uncertainty to identify and remove redundant features from microarray gene expression datasets. Reduced datasets are used for nearest neighbor search, improving accuracy and efficiency. Experiments are conducted using real-world datasets, and comparisons with existing methods are made based on search time and accuracy.
RESULTS: The NNFSRR method demonstrates improved nearest neighbor search performance, outperforming basic brute force methods and existing feature selection techniques. Selected feature sets exhibit strong class associations while minimizing feature correlations, enhancing classification precision.
CONCLUSION: In conclusion, the NNFSRR method presents a promising approach to address the challenges posed by high-dimensional gene expression data. It effectively reduces dimensionality, improves search accuracy, and enhances the efficiency of nearest neighbor search. Our experimental results demonstrate that this method outperforms existing techniques in terms of search time and accuracy, making it a valuable tool for applications in bioinformatics, data mining, pattern recognition, and biological information retrieval. The NNFSRR method holds the potential to advance our understanding of complex biological processes and support more accurate disease prediction and classification.
Journal article: Koul, Nimrita, and Sunilkumar S. Manvi. "Feature Selection from Gene Expression Data Using Simulated Annealing and Partial Least Squares Regression Coefficients." Global Transitions Proceedings (2022).
Journal article: Hambali, Moshood A., Tinuke O. Oladele, and Kayode S. Adewole. "Microarray cancer feature selection: review, challenges and research directions." International Journal of Cognitive Computing in Engineering 1 (2020): 78-97.
Journal article: P. E. Kafrawy, H. Fathi, M. Qaraad, A. K. Kelany and X. Chen, "An Efficient SVM-Based Feature Selection Model for Cancer Classification Using High-Dimensional Microarray Data," in IEEE Access, vol. 9, pp. 155353-155369, 2021, doi: 10.1109/ACCESS.2021.3123090.
Journal article: Gumaei, Abdu, Rachid Sammouda, Mabrook Al-Rakhami, Hussain AlSalman, and Ali El-Zaart. "Feature selection with ensemble learning for prostate cancer diagnosis from microarray gene expression." Health Informatics Journal 27, no. 1 (2021): 1460458221989402.
Journal article: Tripathy, Jogeswar, Rasmita Dash, Binod Kumar Pattanayak, Sambit Kumar Mishra, Tapas Kumar Mishra, and Deepak Puthal. "Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis." Big Data and Cognitive Computing 6, no. 1 (2022): 24.
Journal article: Potharaju, Sai Prasad, and M. Sreedevi. “Distributed Feature Selection (DFS) Strategy for Microarray Gene Expression Data to Improve the Classification Performance.” Clinical Epidemiology and Global Health, vol. 7, no. 2, June 2019, pp. 171–176, 10.1016/j.cegh.2018.04.001. Accessed 3 June 2020.
Journal article: Tripathy, Jogeswar, Rasmita Dash, Binod K. Pattanayak, Sambit K. Mishra, Tapas K. Mishra, and Deepak Puthal. 2022. "Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis" Big Data and Cognitive Computing 6, no. 1: 24. https://doi.org/10.3390/bdcc6010024
Journal article: Chuang, Li-Yeh, Cheng-Huei Yang, and Cheng-Hong Yang. "Tabu search and binary particle swarm optimization for feature selection using microarray data." Journal of computational biology 16, no. 12 (2009): 1689-1703.
Journal article: Buaba, Ruben, Abdollah Homaifar, William Hendrix, Seung Woo Son, Wei-keng Liao, and Alok Choudhary. "Randomized algorithm for approximate nearest neighbor search in high dimensions." Journal of Pattern Recognition Research 1 (2014): 111-122.
Journal article: Fan, Bin, Qingqun Kong, Baoqian Zhang, Hongmin Liu, Chunhong Pan, and Jiwen Lu. "Efficient nearest neighbor search in high dimensional hamming space." Pattern Recognition 99 (2020): 107082.
Journal article: F. Korn, B. Pagel, and C. Faloutsos, "On the dimensionality curse‟ and the self-similarity blessing‟," IEEE Transaction Knowledge Data Engineering, vol. 13, no. 1, pp. 96–111, 2001.
Journal article: G, Vasanthi. “Nearest Neighbors Search Algorithm for High Dimensional Data.” Journal of Advanced Research in Dynamical and Control Systems, vol. 12, no. SP8, 30 July 2020, pp. 1215–1218, 10.5373/jardcs/v12sp8/20202636. Accessed 13 Nov. 2020.
Conference: P. Zhu, X. Zhan and W. Qiu, "Efficient k-Nearest Neighbors Search in High Dimensions Using MapReduce," 2015 IEEE Fifth International Conference on Big Data and Cloud Computing, 2015, pp. 23-30, doi: 10.1109/BDCloud.2015.51.
Journal article: Kushilevitz, Eyal, Rafail Ostrovsky, and Yuval Rabani. "Efficient search for approximate nearest neighbor in high dimensional spaces." SIAM Journal on Computing 30, no. 2 (2000): 457-474.
Journal article: Dubiner, Moshe. “A Heterogeneous High-Dimensional Approximate Nearest Neighbor Algorithm.” IEEE Transactions on Information Theory, vol. 58, no. 10, Oct. 2012, pp. 6646–6658, 10.1109/tit.2012.2204169.
Journal article: H. Abusamra, "A comparative study of feature selection and classification methods for gene expression data of glioma," Procedia Computer Science, vol. 23, pp. 5–14, 2013.
Journal article: Liu, Yingfan, Hao Wei, and Hong Cheng. "Exploiting lower bounds to accelerate approximate nearest neighbor search on high-dimensional data." Information Sciences 465 (2018): 484-504.
Journal article: Fathi H, AlSalman H, Gumaei A, Manhrawy IIM, Hussien AG, El-Kafrawy P. An Efficient Cancer Classification Model Using Microarray and High-Dimensional Data. Comput Intell Neurosci. 2021;2021:7231126. Published 2021 Dec 29. doi:10.1155/2021/7231126
Conference: D. Koller and M. Sahami. Hierachically classifying documents using very few words. In Machine Learning: Proceedings of the Fourteenth International Conference. Morgan Kaufmann, 1997.
Journal article: Liu, Huiqing, Jinyan Li, and Limsoon Wong. "A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns." Genome informatics 13 (2002): 51-60.
Journal article: L Yu and H. Liu, "Efficient Feature Selection via Analysis of Relevance and Redundancy," J. Mach. Learn. Res., vol. 5, pp. 1205–1224, 2004.
Journal article: S. Alelyani, J. Tang, and H. Liu, "Feature Selection for Clustering: A Review,'' in: C. Aggarwal and C. Reddy (eds.), Data Clustering: Algorithms and Applications, CRC Press, 2013.
Journal article: Jianzhong Wang, Shuang Zhou, Yugen Yi, Jun Kong, "An Improved Feature Selection Based on Effective Range for Classification", The Scientific World Journal, vol. 2014, Article ID 972125, 8 pages, 2014. https://doi.org/10.1155/2014/972125
How to Cite
Copyright (c) 2023 Rupali Bhartiya, Gend Lal Prajapati
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.