Synthetic Malware Using Deep Variational Autoencoders and Generative Adversarial Networks
DOI:
https://doi.org/10.4108/eetiot.6566Keywords:
Malware, Synthetic Malware, GAN, VAEAbstract
The effectiveness of detecting malicious files heavily relies on the quality of the training dataset, particularly its size and authenticity. However, the lack of high-quality training data remains one of the biggest challenges in achieving widespread adoption of malware detection by trained machine and deep learning models. In response to this challenge, researchers have made initial strides by employing generative techniques to create synthetic malware samples. This work utilizes deep variational autoencoders (VAE) and generative adversarial networks (GAN) to produce malware samples as opcode sequences. The generated malware opcodes are then distinguished from authentic opcode samples using machine and deep learning techniques as validation methods. The primary objective of this study was to compare synthetic malware generated using VAE and GAN technologies. The results showed that neither approach could create synthetic malware that could deceive machine learning classification. However, the WGAN-GP algorithm showed more promise by requiring a higher number of synthetic malware samples in the train set to effectively be detected, proving it
a better approach in synthetic malware generation.
Downloads
References
Cisco. What is Malware? - Definition and Examples;. Accessed Jun 26, 2024. https://www.cisco.com/c/en/us/products/security/advanced-malware-protection/what-is-malware.html.
Baker K. 12 Types of Malware + Examples ThatYou Should Know;. Accessed Jun 26, 2024. https://www.crowdstrike.com/cybersecurity-101/malware/types-of-malware/.
Ucci D, Aniello L, Baldoni R. Survey of Machine Learning Techniques for Malware Analysis. Computers Security. 2019 Mar;81:123-47. DOI: https://doi.org/10.1016/j.cose.2018.11.001
Aslan Ö, Samet R. A Comprehensive Review on Malware Detection Approaches. IEEE Access. 2020;8:6249-71. DOI: https://doi.org/10.1109/ACCESS.2019.2963724
Trehan H, Di Troia F. Fake Malware Generation Using HMM and GAN. Journal Name. 2022;02:3-21. DOI: https://doi.org/10.1007/978-3-030-96057-5_1
Illes D. On the impact of dataset size and class imbalance in evaluating machine-learning-based windows malware detection techniques. arXiv preprint arXiv:220606256. 2022.
Vemparala S, Di Troia F, Corrado VA, Austin TH, Stamo M. Malware detection using dynamic birthmarks. In: Proceedings of the 2016 ACM on international workshop on security and privacy analytics; 2016. p. 41-6. DOI: https://doi.org/10.1145/2875475.2875476
Yajamanam S, Selvin VRS, Di Troia F, Stamp M. Deep Learning versus Gist Descriptors for Image-based Malware Classification. In: Icissp; 2018. p. 553-61. DOI: https://doi.org/10.5220/0006685805530561
Iadarola G, Martinelli F, Mercaldo F, Santone A, et al. Image-based Malware Family Detection: An Assessment between Feature Extraction and Classification Techniques. In: IoTBDS; 2020. p. 499-506. DOI: https://doi.org/10.5220/0009817804990506
Santos I, Brezo F, Nieves J, Penya YK, Sanz B, Laorden C, et al. Idea: Opcode-sequence-based malware detection. In: Engineering Secure Software and Systems: Second International Symposium, ESSoS 2010, Pisa, Italy, February 3-4, 2010. Proceedings 2. Springer; 2010. p. 35-43. DOI: https://doi.org/10.1007/978-3-642-11747-3_3
Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG. Opcode sequences as representation of executables for datamining-based unknown malware detection. information Sciences. 2013;231:64-82. DOI: https://doi.org/10.1016/j.ins.2011.08.020
Gittins Z, Soltys M. Malware persistence mechanisms. Procedia Computer Science. 2020;176:88-97. DOI: https://doi.org/10.1016/j.procs.2020.08.010
Cesare S, Xiang Y. Classification of malware using structured control flow. In: Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing-Volume 107. Citeseer; 2010. p. 61-70.
Yan J, Yan G, Jin D. Classifying malware represented as control flow graphs using deep graph convolutional neural network. In: 2019 49th annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE; 2019. p. 52-63. DOI: https://doi.org/10.1109/DSN.2019.00020
Kumar A, Kuppusamy K, Aghila G. A learning model to detect maliciousness of portable executable using integrated feature set. Journal of King Saud University- Computer and Information Sciences. 2019;31(2):252-65. DOI: https://doi.org/10.1016/j.jksuci.2017.01.003
Morales JA, Al-Bataineh A, Xu S, Sandhu R. Analyzing and exploiting network behaviors of malware. In: Security and Privacy in Communication Networks: 6th International ICST Conference, SecureComm 2010, Singapore, September 7-9, 2010. Proceedings 6. Springer; 2010. p. 20-34. DOI: https://doi.org/10.1007/978-3-642-16161-2_2
Messabi KA, Aldwairi M, Yousif AA, Thoban A, Belqasmi F. Malware detection using dns records and domain name features. In: Proceedings of the 2nd International Conference on Future Networks and Distributed Systems; 2018. p. 1-7. DOI: https://doi.org/10.1145/3231053.3231082
Maniriho P, Mahmood AN, Chowdhury MJM. A Survey of Recent Advances in Deep Learning Models for Detecting Malware in Desktop and Mobile Platforms. arXiv [csCR]. 2022.
Burks R, Islam KA, Lu Y, Li J. Data Augmentation with Generative Models for Improved Malware Detection: A Comparative Study. In: 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON); 2019. p. 0660-5. DOI: https://doi.org/10.1109/UEMCON47517.2019.8993085
Ahmadi M, Giacinto G, Ulyanov D, Semenov S, Trofimov M. Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. CoRR. 2015;abs/1511.04317. DOI: https://doi.org/10.1145/2857705.2857713
Lu Y, Li J. Generative Adversarial Network for Improving Deep Learning Based Malware Classification. In: 2019 Winter Simulation Conference (WSC); 2019. p. 584-93. DOI: https://doi.org/10.1109/WSC40007.2019.9004932
Bae J, Lee C. Easy Data Augmentation for Improved Malware Detection: A Comparative Study. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp); 2021. p. 214-8. DOI: https://doi.org/10.1109/BigComp51126.2021.00048
Saxena S. Understanding Embedding Layers in Keras;. Accessed Jun 26, 2024. https://medium.com/analytics-vidhya/understanding-embedding-layer-in-keras-bbe3ff1327ce.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 EAI Endorsed Transactions on Internet of Things
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
This is an open-access article distributed under the terms of the Creative Commons Attribution CC BY 3.0 license, which permits unlimited use, distribution, and reproduction in any medium so long as the original work is properly cited.