Implementing Machine Learning on AWS Standalone Cluster
DOI:
https://doi.org/10.4108/v8.9432Keywords:
AWS, EC2, Logistic Regression, Scikit-learn, Machine Learning, Cloud Computing, Iris Dataset, HadoopAbstract
INTRODUCTION: This paper presents the implementation of a machine learning classification pipeline on Amazon Web Services (AWS) EC2 in standalone cluster mode using Apache Hadoop and the scikit-learn library. OBJECTIVES: To configure a cloud-based standalone cluster, deploy a Logistic Regression classifier on the Iris dataset, and evaluate its performance using standard metrics. METHODS: An AWS EC2 t2.micro instance running Ubuntu 24.04 LTS was provisioned and configured as a Hadoop standalone cluster. Python libraries including scikit-learn, pandas, Matplotlib, and Seaborn were installed. The 150-sample Iris dataset was split 80/20 for training and testing. Logistic Regression was applied with a maximum of 200 iterations. RESULTS: The model achieved 100% classification accuracy on the test set with a precision, recall, and F1-score of 1.00 across all three Iris species. The confusion matrix confirmed zero misclassifications. Training completed in under two seconds on the free-tier instance. CONCLUSION: Cloud-based standalone clusters on AWS EC2 provide a cost-effective environment for deploying machine learning workloads. Logistic Regression with scikit-learn delivers 100% accuracy on the Iris benchmark, demonstrating the viability of cloud-hosted ML pipelines for educational and small-scale production use cases.
References
[1] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2021. Doi: 10.1017/CBO9781107298019.
[2] A. Esteva et al., "Deep learning-enabled medical computer vision," nap Digital Medicine, vol. 4, no. 1, p. 5, 2021. Doi: 10.1038/s41746-020-00376-2.
[3] Feng, X., & Kim, S.-K. Novel Machine Learning Based Credit Card Fraud Detection Systems. Mathematics 2024, 12(12), 1869. doi:10.3390/math12121869
[4] M. Grigorescu et al., "A survey of deep learning techniques for autonomous driving," Journal of Field Robotics, vol. 37, no. 3, pp. 362–386, 2020. Doi: 10.1002/rob.21918.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proc. NAACL-HLT, Minneapolis,MN,2019,pp.4171–4186.doi: 10.18653/v1/N19-1423.
[6] P. Mishra, E. Varadharajan, U. Tapachula, and E. S. Pilli, "A detailed investigation and analysis of using machine learning techniques for intrusion detection," IEEE Communications Surveys & Tutorials, vol. 21, no. 1, pp. 686–728, 2019. Doi: 10.1109/COMST.2018.2847722.
[7] N. Challa, S. K. Devineni, and R. Kamangar, "A deep dive into Amazon Web Services: Unlocking the potential," Journal of Artificial Intelligence & Cloud Computing, vol. 1, pp. 2–5, 2022. Doi: 10.47363/JAICC/2022(1)179.
[8] Kumar, L., Pooja, Kumar, P. (2021). Amazon EC2: (Elastic Compute Cloud) Overview. In: Singh Mer, K.K., Semwal, V.B., Bijalwan, V., Crespo, R.G. (eds) Proceedings of Integrated Intelligence Enable Networks and Computing. Algorithms for Intelligent Systems. Springer, Singapore, 2021. doi:10.1007/978-981-33-6307-6_54
[9] M. Armbrust et al., "A view of cloud computing," Communications of the ACM, vol. 53, no. 4, pp. 50–58, 2010. Doi: 10.1145/1721654.1721672.
[10] M. Grandin, E. Bagli, and G. Vusani, "Metrics for multi-class classification: An overview," arrive preprint arXiv:2008.05756, 2020. Doi: 10.48550/arXiv.2008.05756.
[11] N. Ahmed, A. L. C. Barczak, T. Susnjak, and M. A. Rashid, "A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HI Bench," Journal of Big Data, vol. 7, no. 1, art. 110, 2020. Doi: 10.1186/s40537-020-00388-5.
[12] E. Besong, Building Machine Learning and Deep Learning Models on Google Cloud Platform. Après, 2019. Doi: 10.1007/978-1-4842-4470-8.
[13] F. Pedregosa et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://www.geeksforgeeks.org/what-is-python-scikit-library/
[14] D. Carson, "The Ubuntu operating system," 2020. Doi: 10.13140/RG.2.2.31960.72963.
[15] Alzoubi, Y.I., Mishra, A. & Topco, A.E. Research trends in deep learning and machine learning for cloud computing security. Arif Intel Rev 57, 132 (2024). https://doi.org/10.1007/s10462-024-10776-5
[16] S. MSInvoice et al., "Logistic regression was as good as machine learning for predicting major chronic diseases," Journal of Clinical Epidemiology, vol. 122, pp. 56–69, 2020. Doi: 10.1016/j.jclinepi.2020.03.002
[17] Tsutsumi, M., Saito, N., Kayaba, D. et al. A deep learning approach for morphological feature extraction based on variational auto-encoder: an application to mandible shape. nap Sits Boil Appl 9, 30 (2023). https://doi.org/10.1038/s41540-023-00293-6
[18] A. W. Severing et al., "Comparison of machine learning methods with logistic regression analysis in creating predictive models for risk of critical in-hospital events in COVID-19 patients on hospital admission," BMC Medical Informatics and Decision Making, vol. 22, art. 309, 2022. Doi: 10.1186/s12911-022-02057-4
[19] E. G. Dada, J. S. Bassi, H. Chroma, S. M. Abdulhamid, A. O. Adewunmi, and O. E. Ajibola, "Machine learning for email spam filtering: Review, approaches and open research problems," Helion, vol. 5, no. 6, p. e01802, 2019. Doi: 10.1016/j.heliyon.2019.e01802
[20] Enikő Nagy, Róbert Lovas, István Pintye, Ákos Hajnal, Péter Kacsuk, Cloud-agnostic architectures for machine learning based on Apache Spark, Advances in Engineering Software, Volume 159,103029,ISSN 0965-9978,2021. doi:10.1016/j.advengsoft.2021.103029
[21] J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. Doi: 10.1145/1327452.1327492
[22] Yauri Yuan and Zhaohui Li. Bayesian Optimization of SVM for Efficient Detection of Cloud Computing Failures. Proceedings of the 2024 8th International Conference on Electronic Information Technology and Computer Engineering. Association for Computing Machinery, New York,NY,USA,935–939,2025. doi:10.1145/3711129.3711288
[23] K. M. Raj and A. Karthikeyan, "Intelligent Cloud Data Indexing and Retrieval Through Random Forest Algorithms," 2024 2nd International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), Paralakhemundi Campus, Centurion University of Technology and Management, Odisha., India, 2024, pp. 1-6, doi: 10.1109/SCOPES64467.2024.10990612
[24] David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Ramesha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. 2022. Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22). Association for Computing Machinery, New York, NY, USA, 3671–3681. https://doi.org/10.1145/3534678.3539145
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Dia Aldeen Farabi

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.