A malware detection method based on LLM to mine semantics of API
DOI:
https://doi.org/10.4108/airo.8880Keywords:
malware detection, deep learning, feature engineer, API sequenceAbstract
In recent years, the application of the LLM model has played an increasing role in more and more fields, including network security. Some attackers use LLM to attack, generate malicious code for attack, generate phishing emails, and analyze the vulnerability of the software. This also inspires us to utilize LLM to maintain net security. In the past research on malware detection, there were many feature engineering aspects that we had to ask experts to analyze, and this work is very difficult and resource-consuming due to the frequent updates of malware. In this paper, we propose a malware detection method for intrinsic semantics. The method first designs an API intrinsic semantic feature encoder, which extracts intrinsic semantic features from API names and Microsoft's official API definitions based on the LLM's prompt engineering and sentence embedding techniques. Then the API co-occurrence feature encoder is designed, which mines the contextual co-occurrence features of API from API call sequences based on the word2vec. The API semantic features and API co-occurrence features are combined to improve the malware detection performance. Also, it uses TCN-GRU to capture dependencies between API calls. Results on several public datasets show that our method achieves better performance than other methods, and in addition, ablation study results demonstrate the important role of intrinsic semantics in malware detection algorithms.
Downloads
References
[1] av test (2023), Malware statistics[eb/ol], https://www.av-test.org/en/statistics/malware/.
[2] ENISA (2023), Enisa threat landscape 2023, https://www.enisa.europa.eu/publications/enisa-threat-landscape-2023.
[3] Karak, A., Kunal, K., Darapaneni, N. and Paduri, A.R. (2024) Implementation of gpt models for text generation in healthcare domain. EAI Endorsed Transactions on AI and Robotics 3(1). doi:10.4108/airo.4082.
[4] Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z. and Zhang, Y. (2024) A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing : 100211.
[5] Ofoeda, J., Boateng, R. and Effah, J. (2019) Application programming interface (api) research: A review of the past to inform the future. International Journal of Enterprise Information Systems (IJEIS) 15(3): 76–95.
[6] microsoft (2024), Microsoft windows app development documentation, https://learn.microsoft.com/en-us/windows/apps/.
[7] Singh, J. and Singh, J. (2022) Assessment of supervised machine learning algorithms using dynamic api calls for malware detection. International Journal of Computers and Applications 44(3): 270–277.
[8] Amer, E., Mohamed, A., Mohamed, S.E., Ashaf, M., Ehab, A., Shereef, O. and Metwaie, H. (2022) Using machine learning to identify android malware relying on api calling sequences and permissions. Journal of Computing and Communication 1(1): 38–47.
[9] Amer, E. and Zelinka, I. (2020) A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence. Computers & Security 92: 101760.
[10] Sharma, P. (2022) Windows malware detection using machine learning and tf-idf enriched api calls information. In 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA) (IEEE): 1–6.
[11] Ndibanje, B., Kim, K.H., Kang, Y.J., Kim, H.H., Kim, T.Y. and Lee, H.J. (2019) Cross-method-based analysis and classification of malicious behavior by api calls extraction. Applied Sciences 9(2): 239.
[12] Hemalatha, J., Roseline, S.A., Geetha, S., Kadry, S. and Damaševičius, R. (2021) An efficient densenetbased deep learning model for malware detection. Entropy 23(3). doi:10.3390/e23030344, URL https://www.mdpi.com/1099-4300/23/3/344.
[13] Shaukat, K., Luo, S. and Varadharajan, V. (2023) A novel deep learning-based approach for malware detection. Engineering Applications of Artificial Intelligence 122: 106030.
[14] Liu, Y. and Wang, Y. (2019) A robust malware detection system using deep learning on api calls. In 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (IEEE): 1456–1460.
[15] Maniriho, P., Mahmood, A.N. and Chowdhury, M.J.M. (2023) Api-maldetect: Automated malware detection framework for windows based on api calls and deep learning techniques. Journal of Network and Computer Applications 218: 103704.
[16] Zhang, Z., Qi, P. and Wang, W. (2020) Dynamic malware analysis with feature engineering and feature learning. In Proceedings of the AAAI conference on artificial intelligence, 34: 1210–1217.
[17] Li, C., Cheng, Z., Zhu, H., Wang, L., Lv, Q., Wang, Y., Li, N. et al. (2022) Dmalnet: Dynamic malware analysis based on api feature engineering and graph learning. Computers & Security 122: 102872.
[18] Chen, X., Hao, Z., Li, L., Cui, L., Zhu, Y., Ding, Z. and Liu, Y. (2022) Cruparamer: Learning on parameteraugmented api sequences for malware detection. IEEE Transactions on Information Forensics and Security 17: 788–803.
[19] Cao, X., Li, S., Katsikis, V., Khan, A.T., He, H., Liu, Z., Zhang, L. et al. (2024) Empowering financial futures: Large language models in the modern financial landscape. EAI Endorsed Transactions on AI and Robotics 3(1). doi:10.4108/airo.6117.
[20] Han, S., Yoon, J., Arik, S.O. and Pfister, T. (2024), Large language models can automatically engineer features for few-shot tabular learning. URL https://arxiv.org/abs/2404.09491. 2404.09491.
[21] Liu, T., Wang, F. and Chen, M. (2023), Rethinking tabular data understanding with large language models. URL https://arxiv.org/abs/2312.16702. 2312.16702.
[22] Deng, X., Bashlovkina, V., Han, F., Baumgartner, S. and Bendersky, M. (2022), What do llms know about financial markets? a case study on reddit market sentiment analysis. URL https://arxiv.org/abs/2212.11311. 2212.11311.
[23] Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C. et al. (2024), Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. URL https://arxiv.org/abs/2309.00267. 2309.00267.
[24] Chaganti, R., Ravi, V. and Pham, T.D. (2022) Deep learning based cross architecture internet of things malware detection and classification. Computers & Security 120: 102779.
[25] Alomari, E.S., Nuiaa, R.R., Alyasseri, Z.A.A., Mohammed, H.J., Sani, N.S., Esa, M.I. and Musawi, B.A. (2023) Malware detection using deep learning and correlation-based feature selection. Symmetry 15(1): 123.
[26] Feng, P., Gai, L., Yang, L., Wang, Q., Li, T., Xi, N. and Ma, J. (2024) Dawngnn: Documentation augmented windows malware detection using graph neural network. Computers & Security : 103788.
[27] Zhao, D., Wang, H., Kou, L., Li, Z. and Zhang, J. (2023) Dynamic malware detection using parameter augmented semantic chain. Electronics 12(24): 4992.
[28] Zhou, B., Huang, H., Xia, J. and Tian, D. (2024) A novel malware detection method based on api embedding and api parameters. The Journal of Supercomputing 80(2): 2748–2766.
[29] Sherstinsky, A. (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena 404: 132306.
[30] Zargar, S. (2021) Introduction to sequence learning models: Rnn, lstm, gru. Department of Mechanical and Aerospace Engineering, North Carolina State University, Raleigh, North Carolina 27606.
[31] Bai, S., Kolter, J.Z. and Koltun, V. (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 .
[32] Moshayedi, A.J., Roy, A.S., Kolahdooz, A. and Shuxin, Y. (2022) Deep learning application pros and cons over algorithm. EAI Endorsed Transactions on AI and Robotics 1(1): e7.
[33] Carpenter, M. and Luo, C. (2023) Behavioural reports of multi-stage malware. arXiv preprint arXiv:2301.12800.
[34] De Oliveira, A.S. and Sassi, R.J. (2023) Behavioral malware detection using deep graph convolutional neural networks. Authorea Preprints .
[35] Yazi, A.F., Çatak, F.Ö. and Gül, E. (2019) Classification of methamorphic malware with deep learning (lstm). In 2019 27th signal processing and communications applications conference (SIU) (IEEE): 1–4.
[36] kericwy1337 (2019), Malicious-code-dataset. [online], https://github.com/kericwy1337.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ronghao Hou, Xiaoping Tian, Guanggang Geng

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.
Funding data
-
Ministry of Industry and Information Technology of the People's Republic of China
Grant numbers TC220H078