The Cutting-Edge Hadoop Distributed File System: Un-leashing Optimal Performance

Anish Gupta; P. Santhiya; C. Thiyagarajan; Anurag Gupta; Manish Gupta; Rajendra Kr. Dwivedi

doi:10.4108/eetsis.9027

Authors

Anish Gupta Chandigarh Engineering College
P. Santhiya Sathyabama Institute of Science and Technology
C. Thiyagarajan Panimalar Engineering College Chennai
Anurag Gupta ABES Engineering College
Manish Gupta Madan Mohan Malaviya University of Technology
Rajendra Kr. Dwivedi Madan Mohan Malaviya University of Technology

DOI:

https://doi.org/10.4108/eetsis.9027

Keywords:

Hadoop, HDFS, DataNode, NameNode, Write Operation, Read Operation

Abstract

Despite the widespread adoption of 1000-node Hadoop clusters by the end of 2022, Hadoop implementation still encounters various challenges. As a vital software paradigm for managing big data, Hadoop relies on the Hadoop Distributed File System (HDFS), a distributed file system designed to handle data replication for fault tolerance. This technique involves duplicating data across multiple DataNodes (DN) to ensure data reliability and availability. While data replication is effective, it suffers from inefficiencies due to its reliance on a single-pipelined paradigm, leading to time wastage. To tackle this limitation and optimize HDFS performance, a novel approach is proposed, utilizing multiple pipelines for data block transfers in-stead of a single pipeline. Additionally, the proposed approach incorporates dynamic reliability evaluation, wherein each DN updates its reliability value after each round and sends this information to the NameNode (NN). The NN then sorts the DN based on their reliability values. When a client requests to upload a data block, the NN responds with a list of high-reliability DN, ensuring high-performance data transfer. This proposed approach has been fully implemented and tested through rigorous experiments. The results reveal significant improvements in HDFS write operations, providing a promising solution to overcome the challenges associated with traditional HDFS implementations. By leveraging multiple pipelines and dynamic reliability assessment, this approach enhances the overall performance and responsiveness of Hadoop's distributed file system.

References

[1] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, B. Lyon, Design and Implementation of the Sun Network Filesystem, Proceedings of the USENIX Conference & Exibition, Portland, OR, 1985, pp. 119–130.

[2] J.H. Howard, An overview of the Andrew file system, Proceedings of the USENIX Winter Technical Confer-ence, Dallas TX, 1988, pp. 23–26.

[3] S. Ghemawat, H. Gobioff, S.T. Leung, The Google file system, SOSP ‘03 Proceedings of the 19th ACM Sym-posium on Operating Systems Review, ACM, New York, USA, 2003, pp. 29–43.

[4] B. Martini, K.K.R. Choo, Distributed filesystem foren-sics: XtreemFS as a case study, Dig. Invest. 11 (2014), 95–313.

[5] Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Sur-vey. Mobile Networks and Applications, 19(2), 171-209.

[6] Hassan, M. M., Hossain, M., & Mohammed, N. (2018). A Review on Big Data Analytics: Challenges, Open Research Issues and Tools. International Journal of Data Warehousing and Mining, 14(3), 1-26.

[7] Kumbhare, A., Meshram, B., & Dhoble, S. (2020). Big Data Analytics: Techniques, Challenges, and Opportu-nities. Journal of Big Data, 7(1), 1-30.

[8] Alharthi, A. A., El-Alfy, E. S. M., & Javaid, A. Q. (2019). Big Data Analytics in Healthcare: A Survey. International Journal of Computer Applications, 182(15), 1-7.

[9] Rajabi, F., Hosseinabady, M., & Mollahasani, A. (2018). A Survey on Big Data: Concepts, Applications, Challenges, and Future Trends. Journal of Information Systems and Telecommunication, 6(4), 258-267.

[10] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop Distributed File System. Proceed-ings of the 2010 IEEE 26th Symposium on Mass Stor-age Systems and Technologies (MSST), 1-10.

[11] Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google File System. Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 29(5), 29-43.

[12] Nambiar, V., Sankaralingam, K., & Cavanaugh, R. (2011). Benchmarking Big Data Systems and the Big-Data Top100 List. IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST), 1-11.

[13] Khanafer, A., & Al-Hemyari, A. (2017). Hadoop Dis-tributed File System (HDFS) Challenges and Solutions: A Survey. International Journal of Computer Applica-tions, 171(7), 1-5.

[14] E. Sivaraman and R. Manickachezian, "High Perfor-mance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop," 2014 International Conference on Intelligent Computing Applications, Coimbatore, India, 2014, pp. 32-36, doi: 10.1109/ICICA.2014.16.

[15] C.L. Abad, Y. Lu, R.H. Campbell, DARE: adaptive da-ta replication for efficient cluster scheduling, 2011 IEEE International Conference on Cluster Computing, IEEE, Austin, TX, USA, 2011, pp. 159–168.

[16] B. Fan, W. Tantisiriroj, L. Xiao, G. Gibson, DiskRe-duce: RAID for data-intensive scalable computing, Proceedings of the 4th Annual Workshop on Petascale Data Storage, ACM, Portland, OR, USA, 2009, pp. 6–10.

[17] Z. Cheng, Z. Luan, Y. Meng, Y. Xu, D. Qian, A. Roy, et al., ERMS: an elastic replication management system for HDFS, 2012 IEEE International Conference on Cluster Computing Workshops, IEEE, Beijing, China, 2012, pp. 32–40.

[18] Q. Feng, J. Han, Y. Gao, D. Meng, Magicube: high re-liability and low redundancy storage architecture for cloud computing, 2012 IEEE Seventh International Conference on Networking, Architecture and Storage (NAS), IEEE, Xiamen, Fujian, China, 2012, pp. 89–93.

[19] M. Patel Neha, M. Patel Narendra, M.I. Hasan, D. Shah Parth, M. Patel Mayur, Improving HDFS write perfor-mance using efficient replica placement, 2014 5th In-ternational Conference - Confluence The Next Genera-tion Information Technology Summit (Confluence), IEEE, Noida, India, 2014, pp. 36–39.

[20] M. Patel Neha, M. Patel Narendra, M.I. Hasan, M. Patel Mayur, Improving data transfer rate and throughput of HDFS using efficient replica placement, Int. J. Comput. Appl. 86 (2014), 254–261.

[21] H. Zhang, L. Wang, H. Huang, SMARTH: enabling multipipeline data transfer in HDFS, 2014 43rd Inter-national Conference on Parallel Processing, IEEE, Minneapolis MN, USA, 2014, pp. 30–39.

[22] Algaradi, T. S., B. Rama. Static Knowledge-Based Au-thentication Mechanism for Hadoop Distributed Plat-form Using Kerberos. – Int. J. Adv. Sci. Eng. Inf. Technol., Vol. 9, 2019, No 3, pp. 772-780.

[23] Tsu-Yang Wu, Xinglan Guo, Lei Yang, Qian Meng, Chien-Ming Chen, "A Lightweight Authenticated Key Agreement Protocol Using Fog Nodes in Social Internet of Vehicles", Mobile Information Systems, vol. 2021, Article ID 3277113, 14 pages, 2021. https://doi.org/10.1155/2021/3277113

[24] Hena, M., Jeyanthi, N. Distributed authentication framework for Hadoop based bigdata environment. J Ambient Intell Human Comput 13, 4397–4414 (2022). https://doi.org/10.1007/s12652-021-03522-0

[25] Honar Pajooh, H., Rashid, M.A., Alam, F. et al. IoT Big Data provenance scheme using blockchain on Ha-doop ecosystem. J Big Data 8, 114 (2021). https://doi.org/10.1186/s40537-021-00505

[26] Marco Anisetti, Claudio A. Ardagna, Filippo Berto, An assurance process for Big Data trust worthiness ,Future Generation Computer Systems,Volume 146,2023,Pages 34-46,ISSN 0167-739X,

[27] Tall, A.M.; Zou, C.C. A Framework for Attribute-Based Access Control in Processing Big Data with Multiple Sensitivities. Appl. Sci. 2023, 13, 1183. https://doi.org/10.3390/app13021183

[28] M.G. Noll, Benchmarking and stress testing an Hadoop cluster with TeraSort, TestDFSIO & Co., 2011, Availa-ble from: http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/.

[29] Gupta M, Dwivedi RK. Fortified MapReduce Layer: Elevating Security and Privacy in Big Data . EAI En-dorsed Scal Inf Syst [Internet]. 2023 Oct. 2 [cited 2023 Nov. 3];10(6).

[30] M. K. Gupta, A. K. Rai, B. Pandey, A. Gupta and V. K. Verma, "Big Data Privacy: A Survey Paper," 2023 In-ternational Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 2023, pp. 1-6, doi: 10.1109/ICICAT57735.2023.10263627.

The Cutting-Edge Hadoop Distributed File System: Un-leashing Optimal Performance

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission