Synthetic Malware Using Deep Variational Autoencoders and Generative Adversarial Networks

Authors

  • Aaron Choi San Jose State University
  • Albert Giang San Jose State University
  • Sajit Jumani San Jose State University
  • David Luong San Jose State University
  • Fabio Di Troia San Jose State University

DOI:

https://doi.org/10.4108/eetiot.6566

Keywords:

Malware, Synthetic Malware, GAN, VAE

Abstract

The effectiveness of detecting malicious files heavily relies on the quality of the training dataset, particularly its size and authenticity. However, the lack of high-quality training data remains one of the biggest challenges in achieving widespread adoption of malware detection by trained machine and deep learning models. In response to this challenge, researchers have made initial strides by employing generative techniques to create synthetic malware samples. This work utilizes deep variational autoencoders (VAE) and generative adversarial networks (GAN) to produce malware samples as opcode sequences. The generated malware opcodes are then distinguished from authentic opcode samples using machine and deep learning techniques as validation methods. The primary objective of this study was to compare synthetic malware generated using VAE and GAN technologies. The results showed that neither approach could create synthetic malware that could deceive machine learning classification. However, the WGAN-GP algorithm showed more promise by requiring a higher number of synthetic malware samples in the train set to effectively be detected, proving it
a better approach in synthetic malware generation.

Downloads

Download data is not yet available.

References

Cisco. What is Malware? - Definition and Examples;. Accessed Jun 26, 2024. https://www.cisco.com/c/en/us/products/security/advanced-malware-protection/what-is-malware.html.

Baker K. 12 Types of Malware + Examples ThatYou Should Know;. Accessed Jun 26, 2024. https://www.crowdstrike.com/cybersecurity-101/malware/types-of-malware/.

Ucci D, Aniello L, Baldoni R. Survey of Machine Learning Techniques for Malware Analysis. Computers Security. 2019 Mar;81:123-47.

Aslan Ö, Samet R. A Comprehensive Review on Malware Detection Approaches. IEEE Access. 2020;8:6249-71.

Trehan H, Di Troia F. Fake Malware Generation Using HMM and GAN. Journal Name. 2022;02:3-21.

Illes D. On the impact of dataset size and class imbalance in evaluating machine-learning-based windows malware detection techniques. arXiv preprint arXiv:220606256. 2022.

Vemparala S, Di Troia F, Corrado VA, Austin TH, Stamo M. Malware detection using dynamic birthmarks. In: Proceedings of the 2016 ACM on international workshop on security and privacy analytics; 2016. p. 41-6.

Yajamanam S, Selvin VRS, Di Troia F, Stamp M. Deep Learning versus Gist Descriptors for Image-based Malware Classification. In: Icissp; 2018. p. 553-61.

Iadarola G, Martinelli F, Mercaldo F, Santone A, et al. Image-based Malware Family Detection: An Assessment between Feature Extraction and Classification Techniques. In: IoTBDS; 2020. p. 499-506.

Santos I, Brezo F, Nieves J, Penya YK, Sanz B, Laorden C, et al. Idea: Opcode-sequence-based malware detection. In: Engineering Secure Software and Systems: Second International Symposium, ESSoS 2010, Pisa, Italy, February 3-4, 2010. Proceedings 2. Springer; 2010. p. 35-43.

Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG. Opcode sequences as representation of executables for datamining-based unknown malware detection. information Sciences. 2013;231:64-82.

Gittins Z, Soltys M. Malware persistence mechanisms. Procedia Computer Science. 2020;176:88-97.

Cesare S, Xiang Y. Classification of malware using structured control flow. In: Proceedings of the Eighth Australasian Symposium on Parallel and Distributed Computing-Volume 107. Citeseer; 2010. p. 61-70.

Yan J, Yan G, Jin D. Classifying malware represented as control flow graphs using deep graph convolutional neural network. In: 2019 49th annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE; 2019. p. 52-63.

Kumar A, Kuppusamy K, Aghila G. A learning model to detect maliciousness of portable executable using integrated feature set. Journal of King Saud University- Computer and Information Sciences. 2019;31(2):252-65.

Morales JA, Al-Bataineh A, Xu S, Sandhu R. Analyzing and exploiting network behaviors of malware. In: Security and Privacy in Communication Networks: 6th International ICST Conference, SecureComm 2010, Singapore, September 7-9, 2010. Proceedings 6. Springer; 2010. p. 20-34.

Messabi KA, Aldwairi M, Yousif AA, Thoban A, Belqasmi F. Malware detection using dns records and domain name features. In: Proceedings of the 2nd International Conference on Future Networks and Distributed Systems; 2018. p. 1-7.

Maniriho P, Mahmood AN, Chowdhury MJM. A Survey of Recent Advances in Deep Learning Models for Detecting Malware in Desktop and Mobile Platforms. arXiv [csCR]. 2022.

Burks R, Islam KA, Lu Y, Li J. Data Augmentation with Generative Models for Improved Malware Detection: A Comparative Study. In: 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON); 2019. p. 0660-5.

Ahmadi M, Giacinto G, Ulyanov D, Semenov S, Trofimov M. Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. CoRR. 2015;abs/1511.04317.

Lu Y, Li J. Generative Adversarial Network for Improving Deep Learning Based Malware Classification. In: 2019 Winter Simulation Conference (WSC); 2019. p. 584-93.

Bae J, Lee C. Easy Data Augmentation for Improved Malware Detection: A Comparative Study. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp); 2021. p. 214-8.

Saxena S. Understanding Embedding Layers in Keras;. Accessed Jun 26, 2024. https://medium.com/analytics-vidhya/understanding-embedding-layer-in-keras-bbe3ff1327ce.

Downloads

Published

09-07-2024

How to Cite

[1]
“Synthetic Malware Using Deep Variational Autoencoders and Generative Adversarial Networks”, EAI Endorsed Trans IoT, vol. 10, Jul. 2024, doi: 10.4108/eetiot.6566.