Smart Data Prefetching Using KNN to Improve Hadoop Performance
DOI:
https://doi.org/10.4108/eetsis.9110Keywords:
Hadoop Performance, Smart prefetch technique, K-Nearest Neighbor Clustering, MapReduce, Machine Learning, Cache ReplacementAbstract
Hadoop is an open-source framework that enables the parallel processing of large data sets across a cluster of machines. It faces several challenges that can lead to poor performance, such as I/O operations, network data transmission, and high data access time. In recent years, researchers have explored prefetching techniques to reduce the data access time as a potential solution to these problems. Nevertheless, several issues must be considered to optimize the prefetching mechanism. These include launching the prefetch at an appropriate time to avoid conflicts with other operations and minimize waiting time, determining the amount of prefetched data to avoid overload and underload, and placing the prefetched data in locations that can be accessed efficiently when required. In this paper, we propose a smart prefetch mechanism that consists of three phases designed to address these issues. First, we enhance the task progress rate to calculate the optimal time for triggering prefetch operations. Next, we utilize K-Nearest Neighbor clustering to identify which data blocks should be prefetched in each round, employing the data locality feature to determine the placement of prefetched data. Our experimental results demonstrate that our proposed smart prefetch mechanism improves job execution time by an average of 28.33% by increasing the rate of local tasks.
References
[1] Apache Hadoop, http://Hadoop. Apache. org/, last accessed 2021/02/15
[2] Khezr, S. N. & Navimipour, N. J. MapReduce and Its Applications, Challenges, and Architecture: a Comprehensive Review and Directions for Future Research. Journal of Grid Computing 15, 295–321 (2017).
[3] Merceedi, K. J. & Sabry, N. A. A Comprehensive Survey for Hadoop Distributed File System. Asian Journal of Research in Computer Science 46–57 (2021).
[4] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27,1967.
[5] Ghazali, R., Adabi, S., Down, D. G. & Movaghar, A. A classification of hadoop job schedulers based on performance optimization approaches. Cluster Computing 24, 3381–3403 (2021).
[6] Gandomi A, Reshadi M, Movaghar A, Khademzadeh A. HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework. J Big Data (2019).
[7] Ghazali, R., Adabi, S., Rezaee, A., Down, D. G. & Movaghar, A. CLQLMRS: improving cache locality in MapReduce job scheduling using Q-learning. Journal of Cloud Computing 11, (2022).
[8] Luo, Y., Shi, J. & Zhou, S. JeCache: Just-Enough Data Caching with Just-in-Time Prefetching for Big Data Applications. Proceedings - International Conference on Distributed Computing Systems 2405–2410 (2017).
[9] Vinutha, D. C. & Raju, G. T. Data Prefetching for Heterogeneous Hadoop Cluster. 2019 5th International Conference on Advanced Computing and Communication Systems, ICACCS 2019 554–558 (2019).
[10] Lee, J., Kim, K. T. & Youn-Chen, T. MapReduce Performance Scaling Using Data Prefetching, 9, 26–31 (2022).
[11] Kalia, K. et al. Improving MapReduce heterogeneous performance using KNN fair share scheduling. Robotics and Autonomous Systems 157, 104228 (2022).
[12] Dong, B. et al. Correlation-based file prefetching approach for Hadoop. Proceedings - 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010 41–48 (2010).
[13] Singh, G., Chandra, P. & Tahir, R. A Dynamic Caching Mechanism for Hadoop using Memcached (2012)
[14] Chen, Q., Zhang, D., Guo, M., Deng, Q. & Guo, S.: SAMR: a self-adaptive MapReduce scheduling algorithm in a heterogeneous environment. In: Proceedings—10th IEEE International Conference on Computer and Information Technology, CIT-2010, 7th IEEE International Conference on Embedded Software and Systems, ICESS-2010, ScalCom-2010. pp. 2736–2743 (2010).
[15] Naik, N.S., Negi, A., Sastry, V.N.: Performance improvement of MapReduce framework in heterogeneous context using reinforcement learning. Procedia Comput. Sci. 50, 169–175 (2015)
[16] Kwak, J., Hwang, E., Yoo, T., Nam, B. & Choi, Y. In-memory Caching Orchestration for Hadoop. (2016).
[17] H. Li, H. Jiang, D. Wang, B. Han, An improved KNN algorithm for text classification, Eighth International Conference on Instrumentation & Measurement, Computer, Communication and Control IMCCC, 2018, pp. 1081–1085. (2018)
[18] TULGAR, T., HAYDAR, A. & ERŞAN, İ. A Distributed K Nearest Neighbor Classifier for Big Data. Balkan Journal of Electrical and Computer Engineering 6, 105–111 (2018).
[19] Maillo, J., Triguero, I. & Herrera, F. A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification. Proceedings - 14th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2015 2, 167–172 (2015).
[20] Huang S, Huang J, Dai J, Xie T, Huang B ,The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis (2014).
[21] Hibench, http://GitHub. com/ Intel- bigdata/ HiBench, last accessed 2023/06/25
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rana Ghazali, Douglas G. Down

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.