A malware detection method based on LLM to mine semantics of API

Ronghao Hou; Xiaoping Tian; Guanggang Geng

doi:10.4108/airo.8880

Authors

Ronghao Hou Jinan University https://orcid.org/0009-0009-6210-5055 (unauthenticated)
Xiaoping Tian Beijing Normal University
Guanggang Geng Jinan University

DOI:

https://doi.org/10.4108/airo.8880

Keywords:

malware detection, deep learning, feature engineer, API sequence

Abstract

In recent years, the application of the LLM model has played an increasing role in more and more fields, including network security. Some attackers use LLM to attack, generate malicious code for attack, generate phishing emails, and analyze the vulnerability of the software. This also inspires us to utilize LLM to maintain net security. In the past research on malware detection, there were many feature engineering aspects that we had to ask experts to analyze, and this work is very difficult and resource-consuming due to the frequent updates of malware. In this paper, we propose a malware detection method for intrinsic semantics. The method first designs an API intrinsic semantic feature encoder, which extracts intrinsic semantic features from API names and Microsoft's official API definitions based on the LLM's prompt engineering and sentence embedding techniques. Then the API co-occurrence feature encoder is designed, which mines the contextual co-occurrence features of API from API call sequences based on the word2vec. The API semantic features and API co-occurrence features are combined to improve the malware detection performance. Also, it uses TCN-GRU to capture dependencies between API calls. Results on several public datasets show that our method achieves better performance than other methods, and in addition, ablation study results demonstrate the important role of intrinsic semantics in malware detection algorithms.

Downloads

Download data is not yet available.

References

[1] av test (2023), Malware statistics[eb/ol], https://www.av-test.org/en/statistics/malware/.

[2] ENISA (2023), Enisa threat landscape 2023, https://www.enisa.europa.eu/publications/enisa-threat-landscape-2023.

[3] Karak, A., Kunal, K., Darapaneni, N. and Paduri, A.R. (2024) Implementation of gpt models for text generation in healthcare domain. EAI Endorsed Transactions on AI and Robotics 3(1). doi:10.4108/airo.4082.

[4] Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z. and Zhang, Y. (2024) A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing : 100211.

[5] Ofoeda, J., Boateng, R. and Effah, J. (2019) Application programming interface (api) research: A review of the past to inform the future. International Journal of Enterprise Information Systems (IJEIS) 15(3): 76–95.

[6] microsoft (2024), Microsoft windows app development documentation, https://learn.microsoft.com/en-us/windows/apps/.

[7] Singh, J. and Singh, J. (2022) Assessment of supervised machine learning algorithms using dynamic api calls for malware detection. International Journal of Computers and Applications 44(3): 270–277.

[8] Amer, E., Mohamed, A., Mohamed, S.E., Ashaf, M., Ehab, A., Shereef, O. and Metwaie, H. (2022) Using machine learning to identify android malware relying on api calling sequences and permissions. Journal of Computing and Communication 1(1): 38–47.

[9] Amer, E. and Zelinka, I. (2020) A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence. Computers & Security 92: 101760.

[10] Sharma, P. (2022) Windows malware detection using machine learning and tf-idf enriched api calls information. In 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA) (IEEE): 1–6.

[11] Ndibanje, B., Kim, K.H., Kang, Y.J., Kim, H.H., Kim, T.Y. and Lee, H.J. (2019) Cross-method-based analysis and classification of malicious behavior by api calls extraction. Applied Sciences 9(2): 239.

[12] Hemalatha, J., Roseline, S.A., Geetha, S., Kadry, S. and Damaševičius, R. (2021) An efficient densenetbased deep learning model for malware detection. Entropy 23(3). doi:10.3390/e23030344, URL https://www.mdpi.com/1099-4300/23/3/344.

[13] Shaukat, K., Luo, S. and Varadharajan, V. (2023) A novel deep learning-based approach for malware detection. Engineering Applications of Artificial Intelligence 122: 106030.

[14] Liu, Y. and Wang, Y. (2019) A robust malware detection system using deep learning on api calls. In 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (IEEE): 1456–1460.

[15] Maniriho, P., Mahmood, A.N. and Chowdhury, M.J.M. (2023) Api-maldetect: Automated malware detection framework for windows based on api calls and deep learning techniques. Journal of Network and Computer Applications 218: 103704.

[16] Zhang, Z., Qi, P. and Wang, W. (2020) Dynamic malware analysis with feature engineering and feature learning. In Proceedings of the AAAI conference on artificial intelligence, 34: 1210–1217.

[17] Li, C., Cheng, Z., Zhu, H., Wang, L., Lv, Q., Wang, Y., Li, N. et al. (2022) Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning. Computers & Security 122: 102872.

[18] Chen, X., Hao, Z., Li, L., Cui, L., Zhu, Y., Ding, Z. and Liu, Y. (2022) Cruparamer: Learning on parameteraugmented api sequences for malware detection. IEEE Transactions on Information Forensics and Security 17: 788–803.

[19] Cao, X., Li, S., Katsikis, V., Khan, A.T., He, H., Liu, Z., Zhang, L. et al. (2024) Empowering financial futures: Large language models in the modern financial landscape. EAI Endorsed Transactions on AI and Robotics 3(1). doi:10.4108/airo.6117.

[20] Han, S., Yoon, J., Arik, S.O. and Pfister, T. (2024), Large language models can automatically engineer features for few-shot tabular learning. URL https://arxiv.org/abs/2404.09491. 2404.09491.

[21] Liu, T., Wang, F. and Chen, M. (2023), Rethinking tabular data understanding with large language models. URL https://arxiv.org/abs/2312.16702. 2312.16702.

[22] Deng, X., Bashlovkina, V., Han, F., Baumgartner, S. and Bendersky, M. (2022), What do llms know about financial markets? a case study on reddit market sentiment analysis. URL https://arxiv.org/abs/2212.11311. 2212.11311.

[23] Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C. et al. (2024), Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. URL https://arxiv.org/abs/2309.00267. 2309.00267.

[24] Chaganti, R., Ravi, V. and Pham, T.D. (2022) Deep learning based cross architecture internet of things malware detection and classification. Computers & Security 120: 102779.

[25] Alomari, E.S., Nuiaa, R.R., Alyasseri, Z.A.A., Mohammed, H.J., Sani, N.S., Esa, M.I. and Musawi, B.A. (2023) Malware detection using deep learning and correlation-based feature selection. Symmetry 15(1): 123.

[26] Feng, P., Gai, L., Yang, L., Wang, Q., Li, T., Xi, N. and Ma, J. (2024) Dawngnn: Documentation augmented windows malware detection using graph neural network. Computers & Security : 103788.

[27] Zhao, D., Wang, H., Kou, L., Li, Z. and Zhang, J. (2023) Dynamic malware detection using parameter augmented semantic chain. Electronics 12(24): 4992.

[28] Zhou, B., Huang, H., Xia, J. and Tian, D. (2024) A novel malware detection method based on api embedding and api parameters. The Journal of Supercomputing 80(2): 2748–2766.

[29] Sherstinsky, A. (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena 404: 132306.

[30] Zargar, S. (2021) Introduction to sequence learning models: Rnn, lstm, gru. Department of Mechanical and Aerospace Engineering, North Carolina State University, Raleigh, North Carolina 27606.

[31] Bai, S., Kolter, J.Z. and Koltun, V. (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 .

[32] Moshayedi, A.J., Roy, A.S., Kolahdooz, A. and Shuxin, Y. (2022) Deep learning application pros and cons over algorithm. EAI Endorsed Transactions on AI and Robotics 1(1): e7.

[33] Carpenter, M. and Luo, C. (2023) Behavioural reports of multi-stage malware. arXiv preprint arXiv:2301.12800.

[34] De Oliveira, A.S. and Sassi, R.J. (2023) Behavioral malware detection using deep graph convolutional neural networks. Authorea Preprints .

[35] Yazi, A.F., Çatak, F.Ö. and Gül, E. (2019) Classification of methamorphic malware with deep learning (lstm). In 2019 27th signal processing and communications applications conference (SIU) (IEEE): 1–4.

[36] kericwy1337 (2019), Malicious-code-dataset. [online], https://github.com/kericwy1337.

A malware detection method based on LLM to mine semantics of API

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Scopus CiteScore

SCimago

Latest publications

Information