Learning to Detect Phishing Web Pages Using Lexical and String Complexity Analysis

Dharmaraj Patil; Tareek Pattewar; Shailendra Pardeshi; Vipul Punjabi; Rajnikant Wagh

doi:10.4108/eai.20-4-2022.173950

Authors

Dharmaraj Patil SES’s RC Patel Institute of Technology, Shirpur, India
Tareek Pattewar Vishwakarma University
Shailendra Pardeshi SES’s RC Patel Institute of Technology, Shirpur, India
Vipul Punjabi SES’s RC Patel Institute of Technology, Shirpur, India
Rajnikant Wagh SES’s RC Patel Institute of Technology, Shirpur, India

DOI:

https://doi.org/10.4108/eai.20-4-2022.173950

Keywords:

Phishing detection, Lexical analysis, Entropy, Kolmogorov complexity, Huffman coding complexity, online machine learning, cyber security

Abstract

Phishing is the most common and effective sort of attack employed by cybercriminals to deceive and steal sensitive information from innocent Web users. Researchers have developed major solutions to deal with this problem in recent years, but there are still a number of open challenges due to the ever-changing nature of phishing attacks. To discriminate between benign and phishing URLs, this paper proposes a static method based on lexical and string complexity analysis and distinguishing URL features. Proposed approach has been evaluated on the basis of two state of the art online learning classifiers. The confidence weighted learning classifier achieved a significant phishing URL detection accuracy of 98.35 %, error-rate of 1.65%, FPR of 0.026 and FNR of 0.005. Also, adaptive regularization of weight classifier achieved accuracy of 97.28%, error-rate of 2.72%, FPR of 0.000 and FNR of 0.052. Similar approach shows the improvement in the detection of the phishing web pages.

References

Anti-Phishing Working Group (APWG) Phishing Activity Trends Report, 1st Quarter 2021, Anti-Phishing Working Group, Inc. (2021), https://docs.apwg.org/reports/ apwg_trends_report_q1_2021.pdf.

PhishLabs Threat Trends and Intelligence Report Q1 2021, PhishLabs, https://info.phishlabs.com/q1-2021-threat-trends-intelligence-report

Sahoo, D., Liu, C., Hoi, S. C., and Solouk, V. Malicious URL Detection using Machine Learning: A Survey, arXiv preprint arXiv:1701.07179. 2019, 1–37.

Zabihimayvan, M., Doran, D. and Solouk, V. Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection, arXiv preprint: 1903.05675. (2019) 1–6.

Ding, Y., Luktarhan, N., Li, K. and Slamu, W. A keyword-based combination approach for detecting phishing web pages, Computers & Security. 84, 2019, 1–6, doi:10.https://doi.org/10.1016/j.cose.2019.03.018.

Niakanlahiji, A., Chu, B. T. and Al-Shaer, E. PhishMon: A Machine Learning Framework for Detecting Phishing Web pages, In : IEEE Int. Conf. Intelligence and Security Informatics (ISI), (Miami, FL, USA, 2018), pp. 220–225.

Yuan, H., Chen, X., Li, Y., Yang, Z. and Liu, W. Detecting Phishing Websites and Targets Based on URLs and Webpage Links, In: Int. Conf. Pattern Recognition (ICPR), (Beijing, China, 2018), pp. 3669–3674.

Babagoli, M., Aghababa, M.P, M.P. and Solouk, V. Heuristic nonlinear regression strategy for detecting phishing websites, Soft Computing. 23(12), 2019, 4315–4327.

Arab, M., and Sohrabi, M. K. Proposing a new clustering method to detect phishing web- sites, Turkish Journal of Electrical Engineering and Computer Sciences. 25(6), 2017, 4757–4767.

Patil, D. R.Patil and J. B. Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification, The ISC International Journal of Information Security (ISeCure). 10(2), 2018, 141–162, doi:10.22042/ISECURE.2018.0.0.1.

Basnet, R., Mukkamala, S. and Sung, A. H. Detection of phishing attacks: A machine learning approach, Soft Computing Applications in Industry, 2008, 373–383.

Mishra, A. and Gupta, B. B. Intelligent phishing detection system using similarity matching algorithms, International Journal of Information and Communication Technology. 12(1-2), 2018, 51–73.

HR, M. G., Adithya, M. V. and Vinay, S. Development of anti-phishing browser based on random forest and rule of extraction framework, Cybersecurity. 3(1), 2020, 1–14.

Adebowale, M. A., Lwin, K. T., Sanchez, E. and Hossain, M. A. Intelligent web- phishing detection and protection scheme using integrated features of Images, frames and text, Expert Systems with Applications. 115, 2019, 300–313.

Cooper, M., Levy, Y., Wang, L. and Dringus, L. Heads-up! An alert and warning system for phishing emails, Organizational Cybersecurity Journal: Practice, Process and People, 2021, 1–22.

Evans, K., Abuadbba, A., Ahmed, M., Wu, T., Johnstone, M., and Nepal, S. RAIDER: Reinforcement-aided Spear Phishing Detector, arXiv preprint arXiv:2105.07582, 2021, 1–19.

Mohammada, G. B., Shitharthb, S. and Kumarc, P. R. Integrated Machine Learning Model for an URL Phishing Detection, International Journal of Grid and Distributed Computing, 14(1) , 2020, 513–529.

Jain, A. K. and Gupta, B. B. Phishing detection: analysis of visual similarity based approaches, Security and Communication Networks, 2017, 1–20.

Mourtaji, Y., Bouhorma, M., Alghazzawi, D., Aldabbagh, G.andAlghamdi, A. Hybrid Rule-Based Solution for Phishing URL Detection Using Convolutional Neural Network, Wireless Communications and Mobile Computing, 2021, 1–24.

Akdemir, N. and Yenal, S. How Phishers Exploit the Coronavirus Pandemic: A Content Analysis of COVID-19 Themed Phishing Emails, SAGE Open, 11(3), 2021, 1–14,doi: 21582440211031879.

El Aassal, A., Baki, S., Das, A. and Verma, R. M. An in-depth benchmarking and evaluation of phishing detection research for security needs, IEEE Access, 8, 2020, 22170–22192, doi: 10.1109/ACCESS.2020.2969780.

Sahingoz, O. K., Buber, E., Demir, O. and Diri, B. Machine learning based phishing detection from URLs, Expert Systems with Applications, 117, 2019, 345–357, doi: https://doi.org/10.1016/j.eswa.2018.09.029.

Butnaru, A., Mylonas, A. and Pitropakis, N. Towards Lightweight URL-Based Phishing Detection, Future Internet. 13(6), 2021, 1–15, doi: https://doi.org/10.3390/fi13060154.

Bagui, S., Nandi, D. and White, R. J. Machine Learning and Deep Learning for Phishing Email Classification using One-Hot Encoding, Journal of Computer Science. 17(7), 2021, 610–623, doi: https://doi.org/10.3844/jcssp.2021.610.623.

Zhu, E., Chen, Y., Ye, C., Li, X. and Liu, F. OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network, IEEE Access. 7, 2019, 73271–73284, doi: 10.1109/ACCESS.2019.2920655.

Ali, W. and Malebary, S. Particle swarm optimization-based feature weighting for improving intelligent phishing website detection, IEEE Access, 8, 2020, 116766–116780, doi: 10.1109/ACCESS.2020.3003569.

Mao, J., Bian, J., Tian, W., Zhu, S., Wei, T., Li, A., and Liang, Z. Phishing page detection via learning classifiers from page layout feature, EURASIP Journal on Wireless Communications and Networking, 1, 2019, 1–14, doi: https://doi.org/10.1186/s13638- 019-1361-0.

Acharya, B. and Vadrevu, P. PhishPrint: Evading Phishing Detection Crawlers by Prior Profiling, In: 30th USENIX Security Symposium, 2021, 3775–3792.

Rasool, R. U., Ahmed, K., Anwar, Z., Wang, H., Ashraf, U., & Rafique, W. CyberPulse++: A machine learning‐based security framework for detecting link flooding attacks in software defined networks. International Journal of Intelligent Systems, 36(8), 2021, 3852-3879.

Vimalachandran, P., Liu, H., Lin, Y., Ji, K., Wang, H., & Zhang, Y. Improving accessibility of the Australian My Health Records while preserving privacy and security of the system. Health Information Science and Systems, 8(1), 2020, 1-9.

Patil, D. R., Patil, J. B. Malicious web pages detection using feature selection techniques and machine learning, International Journal of High Performance Computing and Networking., 14(4), 2019, 473–488., doi: 10.1504/IJHPCN.2019.102355.

Patil, D. R., Patil and J. B. Malicious URLs detection using decision tree classifiers and majority voting technique, Cybernetics and Information Technologies, 18(1), 2018, 11–29, doi: https://doi.org/10.2478/cait-2018-0002.

Verma R., Das A, What’s in a URL: Fast Feature Extraction and Malicious URL Detection, In: 3rd International Workshop on Security and Privacy Analytics, (Scottsdale, AZ, United States, 2017, 55–63.

Evans, S. C., Hershey, J. E and Saulnier, G. Kolmogorov complexity estimation and analysis, In: Sixth World Conference on Systemics, Cybernetics and Informatics, (Orlando, Fla,), 2002.

Pao, H. K., Chou, Y. L. and Lee, Y. J. Malicious URL detection based on Kolmogorov complexity estimation, In: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, (Macau, China), 2012, 380– 387.

Moffat, A. Huffman coding, In: ACM Computing Surveys (CSUR), 2019, 31–35.

Dredze, M., Crammer, K. and Pereira, F. Confidence-weighted linear classification, In: 25th International Conference on Machine learning, (Helsinki Finland), 2008, 264–271.

Dahlmeier D., Ng H. T. and Ng E. J. F. NUS at the HOO 2012 Shared Task, In: Seventh Workshop on Building Educational Applications Using NLP, 2008, 216– 224.

Confidence-weighted (CW) learning. (2019), http://www.comp.nus.edu.sg/nlp/ software.html

Crammer, K., Kulesza, A. and Dredze, M. Adaptive regularization of weight vectors, In: Advances in Neural Information Processing Systems, 2009, 414–422.

AROW++: An implementation of the efficient confidence-weighted classifier. (2019), https://github.com/tetsuok/arowpp

Alexa: Alexa top global websites. (2021), http://www.alexa.com/topsites

Phishtank: Join the fight against phishing. (2021), https://www.phishtank.com

OpenPhish - Phishing Intelligence. (2021), https://openphish.com

Sokolova M. and Lapalme G. A systematic analysis of performance measures for classification tasks, Information Processing and Management, 45(4), 2009, 427–437, doi: 10.1016/j.ipm.2009.03.002.

Xiang, J. Hong, C. P. Rose and L. Cranor. Cantina+: a feature-rich machine learning framework for detecting phishing web sites, ACM Transactions on Information and System Security (TISSEC), 14(2), 2011, 1–28, doi: https://doi.org/10.1145/2019599.2019606.