A Framework for Utilizing Permutational Multiple Analysis of Variance as a Precursor for Nonparametric Statistical Learning with Cyber Network Data

Authors

DOI:

https://doi.org/10.4108/eetcasa.v9i1.2929

Keywords:

Artificial intelligence, machine learning, Internet of Things, Malware, network flows, multinomial, PERMANOVA, NPMANOVA, multinomial classification

Abstract

INTRODUCTION: Although scientific hypothesis testing methodologies are well established, their application to falsifiable hypothesis testing for assessing causal relationships potentially identified by machine learning and artificial intelligence models is rare due to the primarily nonparametric statistical nature of these systems.

OBJECTIVES: The primary objective of this study is to demonstrate the potential for applying nonparametric statistical tests to a mixed qualitative and quantitative cyber network dataset as a method to pre-assess the feasibility of applying forms of statistical hypothesis testing before a machine learning algorithm models the data.

METHODS: A mixture of permuted analysis of variance models augmented by the use of transformed non-Euclidean multivariate distances between curated dependent variable classes produced this research data. Quasi-experimental data   from an enclosed laboratory environment utilizing a monitored, locally unrestricted network that introduced known Internet of Things (IoT) malware software supplied network flow events.

RESULTS: A PERMANOVA model was executed against 62,000 records of the network flow observations, using Euclidean distance measurements with variable-dependent relationship ordering, using terms added sequentially (first to last) in the order encountered in the raw network flow dataset, using 200 permutations. This precursor test resulted in a p-value for the PERMANOVA model that incorporated terms added sequentially of 0.02985, providing an F value of 0.00017 with which to determine the ratio of explained to unexplained variance. Utilizing an analysis of the F values for all of the residuals, we show 29,998 degrees of freedom with a residual F model score of 0.99983, indicating that there is a strong proportion of explained to unexplained variance across all of the independent variables contained in the model. The model is thus statistically significant with a p-value below the alpha test statistic of 0.05.

CONCLUSION: This research has demonstrated that it is possible to apply tests of falsifiability that incorporate reproducible methods into the quasi-experiment design and apply this to the field of machine learning. Applied to AI/ML (artificial intelligence/machine learning) models, this pre-assessment methodology supports the appropriateness of cyber network flow datasets in which a final test of statistical significance would be required. The authors believe that this represents a substantially useful precursor assessment stage for the suitability and reliability of the utilization of any nonparametric statistical learning algorithms applied to cyber network data predictive analytics.

References

IoT.Business.News. State of IOT 2022: Number of connected IOT devices growing 18% to 14.4 billion globally [Internet]. IoT Business News. 2022 [cited 2022Dec6]. Available from: https://iotbusinessnews.com/2022/05/19/70343-state-of-iot-2022-number-of-connected-iot-devices-growing-18-to-14-4-billion-globally/

Stratosphere Laboratory. (2020). Aposemat IoT-23. A labeled dataset with malicious and benign IoT network traffic. Parmisano, A., Garcia, S., Erquiaga, M. J. Available online at https://www.stratosphereips.org/datasets-iot23. Accessed on October 17, 2020.

Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’hara RB, Simpson GL, Solymos P, Stevens MH, Wagner H, Oksanen MJ. Package ‘vegan’. Community ecology package, version. 2013 Dec 12;2(9):1-295.

Raj A, Steingart D. Power sources for the internet of things. Journal of the Electrochemical Society. 2018 Apr 25;165(8):B3130. DOI: https://doi.org/10.1149/2.0181808jes

Anthi E, Williams L, Słowińska M, Theodorakopoulos G, Burnap P. A supervised intrusion detection system for smart home IoT devices. IEEE Internet of Things Journal. 2019 Jul 2;6(5):9042-53. DOI: https://doi.org/10.1109/JIOT.2019.2926365

Bobrovnikova K, Lysenko S, Gaj P, Martynyuk V, Denysiuk D. Technique for IoT Cyberattacks Detection Based on DNS Traffic Analysis. InIntelITSIS 2020 (pp. 208-218).

Murphy M. The Internet of Things and the threat it poses to DNS. Network Security. 2017 Jul 1;2017(7):17-9. DOI: https://doi.org/10.1016/S1353-4858(17)30072-7

Chizinski C. Permutational multivariate analysis of variance using distance matrices (Adonis) [Internet]. Christopher Chizinski. 2014 [cited 2022Dec6]. Available from: https://chrischizinski.github.io/rstats/adonis/

McArdle BH, Anderson MJ. Fitting multivariate models to community data: a comment on distance‐based redundancy analysis. Ecology. 2001 Jan;82(1):290-7. DOI: https://doi.org/10.1890/0012-9658(2001)082[0290:FMMTCD]2.0.CO;2

Chizinski C. Permutational multivariate analysis of variance using distance matrices (Adonis) [Internet]. Christopher Chizinski. 2014 [cited 2022Dec6]. Available from: https://chrischizinski.github.io/rstats/adonis/

Downloads

Published

12-07-2023

How to Cite

1.
Woolman T, Pickard J. A Framework for Utilizing Permutational Multiple Analysis of Variance as a Precursor for Nonparametric Statistical Learning with Cyber Network Data. EAI Endorsed Trans Context Aware Syst App [Internet]. 2023 Jul. 12 [cited 2024 Dec. 27];9. Available from: https://publications.eai.eu/index.php/casa/article/view/2929