Combining Lexical, Host, and Content-based features for Phishing Websites detection using Machine Learning Models
DOI:
https://doi.org/10.4108/eetsis.4421Keywords:
Phishing URLs detection, Machine learning algorithms, Classification, Lexical-based features, Host-based features, content-based features, Feature selectionAbstract
In cybersecurity field, identifying and dealing with threats from malicious websites (phishing, spam, and drive-by downloads, for example) is a major concern for the community. Consequently, the need for effective detection methods has become a necessity. Recent advances in Machine Learning (ML) have renewed interest in its application to a variety of cybersecurity challenges. When it comes to detecting phishing URLs, machine learning relies on specific attributes, such as lexical, host, and content based features. The main objective of our work is to propose, implement and evaluate a solution for identifying phishing URLs based on a combination of these feature sets. This paper focuses on using a new balanced dataset, extracting useful features from it, and selecting the optimal features using different feature selection techniques to build and conduct a
comparative performance evaluation of four ML models (SVM, Decision Tree, Random Forest, and XGBoost). Results showed that the XGBoost model outperformed the others models, with an accuracy of 95.70% and a false negatives rate of 1.94%.
References
Basit, A., Zafar, M., Liu, X., Javed, A.R., Jalil, Z. and Kifayat, K. (2021) A comprehensive survey of ai-enabled phishing attacks detection techniques. Telecommunication Systems 76: 139–154.
Alabdan, R. (2020) Phishing attacks survey: Types, vectors, and technical approaches. Future internet 12(10): 168.
(2021), APWG Phishing Trends Report: 4th quarter 2022. https://docs.apwg.org/reports/apwg_trends_report_q4_2022.pdf. Accessed: September 2023.
Ma, K.W.F. andMcKinnon, T. (2022) Covid-19 and cyber fraud: Emerging threats during the pandemic. Journal of Financial Crime 29(2): 433–446.
Sahoo, D., Liu, C. and Hoi, S.C. (2019) Malicious url detection using machine learning: A survey. arXiv preprint arXiv:1701.07179 .
Alurkar, A.A., Ranade, S.B., Joshi, S.V., Ranade, S.S., Shinde, G.R., Sonewar, P.A. and Mahalle, P.N. (2019) A comparative analysis and discussion of email spam classification methods using machine learning techniques. In Applied Machine Learning for Smart Data Analysis (CRC Press), 185–206.
Dada, E.G., Bassi, J.S., Chiroma, H., Adetunmbi, A.O., Ajibuwa, O.E. et al. (2019) Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6).
Prusti, D., Padmanabhuni, S.H. and Rath, S.K. (2020) Credit card fraud detection by implementing machine learning techniques. In Safety, Security, and Reliability of Robotic Systems (CRC Press), 205–216.
Varmedja, D., Karanovic, M., Sladojevic, S., Arsenovic, M. and Anderla, A. (2019) Credit card fraud detectionmachine learning methods. In 2019 18th International Symposium INFOTEH-JAHORINA (INFOTEH) (IEEE): 1–5.
Khatri, S., Arora, A. and Agrawal, A.P. (2020) Supervised machine learning algorithms for credit card fraud detection: a comparison. In 2020 10th international conference on cloud computing, data science & engineering (confluence) (IEEE): 680–683.
Ma, Z., Ge, H., Liu, Y., Zhao, M. and Ma, J. (2019) A combination method for android malware detection based on control flow graphs and machine learning algorithms. IEEE access 7: 21235–21245.
Rathore, H., Sharma, S.C., Sahay, S.K. and Sewak, M. (2022) Are malware detection classifiers adversarially vulnerable to actor-critic based evasion attacks? EAI Endorsed Transactions on Scalable Information Systems 10(1).
Grace, M. and Sughasiny, M. (2022) Malware detection for android application using aquila optimizer and hybrid lstm-svm classifier. EAI Endorsed Transactions on Scalable Information Systems 10(1).
Patil, D., Pattewar, T., Pardeshi, S., Punjabi, V. and Wagh, R. (2022) Learning to detect phishing web pages using lexical and string complexity analysis. EAI Endorsed Transactions on Scalable Information Systems 10(1).
Sahingoz, O.K., Buber, E., Demir, O. and Diri, B. (2019) Machine learning based phishing detection from urls. Expert Systems with Applications 117: 345–357.
Hindy, H., Brosset, D., Bayne, E., Seeam, A.K., Tachtatzis, C., Atkinson, R. and Bellekens, X. (2020) A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access 8: 104650–104675.
Dey, S., Ye, Q. and Sampalli, S. (2019) A machine learning based intrusion detection scheme for data fusion in mobile clouds involving heterogeneous client networks. Information Fusion 49: 205–215.
Yin, J., Tang, M., Cao, J., You, M.,Wang, H. and Alazab, M. (2022) Knowledge-driven cybersecurity intelligence: software vulnerability coexploitation behavior discovery. IEEE transactions on industrial informatics 19(4): 5593–5601.
Yin, J., Tang, M., Cao, J., Wang, H., You, M. and Lin, Y. (2022) Vulnerability exploitation time prediction: an integrated framework for dynamic imbalanced learning. World Wide Web : 1–23.
Zamir, A., Khan, H.U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A. and Hamdani, M. (2020) Phishing web site detection using diverse machine learning algorithms. The Electronic Library 38(1): 65–80.
Sahoo, D., Liu, C. and Hoi, S.C. (2017) Malicious url detection using machine learning: A survey. arXiv preprint arXiv:1701.07179 .
Aung, E.S., Zan, C.T. and Yamana, H. (2019) A survey of url-based phishing detection. In DEIM Forum: G2–3.
Tang, L. and Mahmoud, Q.H. (2021) A survey of machine learning-based solutions for phishing website detection. Machine Learning and Knowledge Extraction 3(3): 672–694.
Aljabri, M., Aljameel, S.S., Mohammad, R.M.A., Almotiri, S.H., Mirza, S., Anis, F.M., Aboulnour, M. et al. (2021) Intelligent techniques for detecting network attacks: review and research directions. Sensors 21(21): 7070.
Jain, A.K. and Gupta, B. (2022) A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems 16(4): 527–565.
Adane, K. and Beyene, B. (2022) Machine learning and deep learning based phishing websites detection: The current gaps and next directions. Review of Computer Engineering Research 9(1): 13–29.
Khan, H.M.J., Niyaz, Q., Devabhaktuni, V.K., Guo, S. and Shaikh, U. (2019) Identifying generic features for malicious url detection system. In 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) (IEEE): 0347–
Gupta, B.B., Yadav, K., Razzak, I., Psannis, K., Castiglione, A. and Chang, X. (2021) A novel approach for phishing urls detection using lexical based machine learning in a real-time environment. Computer Communications 175: 47–57.
McGahagan IV, J., Bhansali, D., Pinto-Coelho, C. and Cukier, M. (2021) Discovering features for detecting malicious websites: An empirical study. Computers & Security 109: 102374.
Li, T., Kou, G. and Peng, Y. (2020) Improving malicious urls detection via feature engineering: Linear and nonlinear space transformation methods. Information Systems 91: 101494.
Catak, F.O., Sahinbas, K. and Dörtkardeş, V. (2021) Malicious url detection using machine learning. In Artificial intelligence paradigms for smart cyber-physical systems (IGI Global): 160–180.
Korkmaz, M., Sahingoz, O.K. and Diri, B. (2020) Detection of phishing websites by using machine learning-based url analysis. In 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (IEEE): 1–7.
Mahajan, R. and Siddavatam, I. (2018) Phishing website detection using machine learning algorithms. International Journal of Computer Applications 181(23): 45–47.
Kumi, S., Lim, C. and Lee, S.G. (2021) Malicious url detection based on associative classification. Entropy 23(2): 182.
Li, Y., Yang, Z., Chen, X., Yuan, H. and Liu, W. (2019) A stacking model using url and html features for phishing webpage detection. Future Generation Computer Systems 94: 27–39.
Aljabri, M., Alhaidari, F., Mohammad, R.M.A., Mirza, S., Alhamed, D.H., Altamimi, H.S., Chrouf, S.M. et al. (2022) An assessment of lexical, network, and content-based features for detecting malicious urls using machine learning and deep learning models. Computational Intelligence and Neuroscience 2022.
Alexa: Web information company’s website. https://www.alexa.com/. Accessed: March 2022.
Phishtank—join the fight against phishing. https://www.phishtank.com. Accessed: September 2023.
Althnian, A., AlSaeed, D., Al-Baity, H., Samha, A., Dris, A.B., Alzakari, N., Abou Elwafa, A. et al. (2021) Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Applied Sciences 11(2): 796.
Cabello-Solorzano, K., Ortigosa de Araujo, I., Peña, M., Correia, L. and J. Tallón-Ballesteros, A. (2023) The impact of data normalization on the accuracy of machine learning algorithms: A comparative analysis. In International Conference on Soft Computing Models in Industrial and Environmental Applications (Springer): 344–353.
Bhanja, S. and Das, A. (2018) Impact of data normalization on deep neural network for time series forecasting. arXiv preprint arXiv:1812.05519 .
Ratner, B. (2009) The correlation coefficient: Its values range between+ 1/- 1, or do they? Journal of targeting, measurement and analysis for marketing 17(2): 139–142.
Blessie, E.C. and Karthikeyan, E. (2012) Sigmis: A feature selection algorithm using correlation based method. Journal of Algorithms & Computational Technology 6(3): 385–394.
Ali Abd Al-Hameed, K. (2022) Spearman’s correlation coefficient in statistical analysis. International Journal of Nonlinear Analysis and Applications 13(1): 3249–3255.
Brownlee, J. (2016) Master Machine Learning Algorithms: Discover How They Work and Implement Them From Scratch (Machine Learning Mastery).
Chen, T. and Guestrin, C. (2016) Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining: 785–794.
Belete, D.M. and Huchaiah, M.D. (2022) Grid search in hyperparameter optimization of machine learning models for prediction of hiv/aids test results. International Journal of Computers and Applications 44(9): 875–886.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Samiya Hamadouche, Ouadjih Boudraa, Mohamed Gasmi
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.