Large data density peak clustering based on sparse auto-encoder and data space meshing via evidence probability distribution

Fang Lu

doi:10.4108/eetsis.6758

Authors

Fang Lu Harbin Finance University

DOI:

https://doi.org/10.4108/eetsis.6758

Keywords:

data density clustering, sparse auto-encoder, data space meshing, evidence probability distribution, transfer probability distribution strategy

Abstract

The development of big data analysis technology has brought new development opportunities to the production and management of various industries. Through the mining and analysis of various data in the operation process of enterprises by big data technology, the internal associated data of the enterprises and even the entire industry can be obtained. As a common method for large-scale data statistical analysis, clustering technology can effectively mine the relationship within massive heterogeneous multidimensional data, complete unlabeled data classification, and provide data support for various model analysis of big data. Common big data density clustering methods are time-consuming and easy to cause errors in data density allocation, which affects the accuracy of data clustering. Therefore we propose a novel large data density peak clustering based on sparse auto-encoder and data space meshing via evidence probability distribution. Firstly, the sparse auto-encoder in deep learning is used to achieve feature extraction and dimensionality reduction for input high-dimensional data matrix through training. Secondly, the data space is meshed to reduce the calculation of the distance between the sample data points. When calculating the local density, not only the density value of the grid itself, but also the density value of the nearest neighbors are considered, which reduces the influence of the subjective selection truncation distance on the clustering results and improves the clustering accuracy. The grid density threshold is set to ensure the stability of the clustering results. Using the K-nearest neighbor information of the sample points, the transfer probability distribution strategy and evidence probability distribution strategy are proposed to optimize the distribution of the remaining sample points, so as to avoid the joint error of distribution. The experimental results show that the proposed algorithm has higher clustering accuracy and better clustering performance than other advanced clustering algorithms on artificial and real data sets.

References

[1] Hilbert M, López P. The world’s technological capacity to store, communicate, and compute information[J]. Science, 2011, 332(6025): 60-65.

[2] Anand R, Veni S, Aravinth J. An application of image processing techniques for detection of diseases on brinjal leaves using k-means clustering method[C]//2016 international conference on recent trends in information technology (ICRTIT). IEEE, 2016: 1-6.

[3] Hennig C, Liao T F. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification[J]. Journal of the Royal Statistical Society Series C: Applied Statistics, 2013, 62(3): 309-369.

[4] Rahnenführer J, De Bin R, Benner A, et al. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges[J]. BMC medicine, 2023, 21(1): 182.

[5] Yu J, Lu Z, Yin S, et al. News recommendation model based on encoder graph neural network and bat optimization in online social multimedia art education[J]. Computer Science and Information Systems, vol. 21, no. 3, 989-1012, 2024.

[6] Rostami M, Oussalah M, Berahmand K, et al. Community detection algorithms in healthcare applications: a systematic review[J]. IEEE Access, 2023, 11: 30247-30272.

[7] Shi K, Yan J, Yang J. A semantic partition algorithm based on improved K-means clustering for large-scale indoor areas[J]. ISPRS international journal of geo-information, 2024, 13(2): 41.

[8] Hasan M K, Habib A K M A, Islam S, et al. DDoS: Distributed denial of service attack in communication standard vulnerabilities in smart grid applications and cyber security with recent developments[J]. Energy Reports, 2023, 9: 1318-1326.

[9] Ran X, Xi Y, Lu Y, et al. Comprehensive survey on hierarchical clustering algorithms and the recent developments[J]. Artificial Intelligence Review, 2023, 56(8): 8219-8264.

[10] Suchy D, Siminski K. GrDBSCAN: A granular density-based clustering algorithm[J]. International Journal of Applied Mathematics and Computer Science, 2023, 33(2).

[11] Wang Y, Wang D, Zhou Y, et al. VDPC: Variational density peak clustering algorithm[J]. Information Sciences, 2023, 621: 627-651.

[12] Ding S, Du W, Li C, et al. Density peaks clustering algorithm based on improved similarity and allocation strategy[J]. International journal of machine learning and cybernetics, 2023, 14(4): 1527-1542.

[13] Pourbahrami S. A neighborhood-based robust clustering algorithm using Apollonius function kernel[J]. Expert Systems with Applications, 2024, 248: 123407.

[14] Yan H, Wang M, Xie J. ANN-DPC: Density peak clustering by finding the adaptive nearest neighbors[J]. Knowledge-Based Systems, 2024, 294: 111748.

[15] Yu D, Liu G, Guo M, et al. Density peaks clustering based on weighted local density sequence and nearest neighbor assignment[J]. IEEE Access, 2019, 7: 34301-34317.

[16] Kumar A, Singh S K, Saxena S, et al. CoMHisP: A novel feature extractor for histopathological image classification based on fuzzy SVM with within-class relative density[J]. IEEE Transactions on Fuzzy Systems, 2020, 29(1): 103-117.

[17] Guan J, Li S, He X, et al. Fast hierarchical clustering of local density peaks via an association degree transfer method[J]. Neurocomputing, 2021, 455: 401-418.

[18] Guo W, Wang W, Zhao S, et al. Density peak clustering with connectivity estimation[J]. Knowledge-Based Systems, 2022, 243: 108501.

[19] Wang M, Zhang Y Y, Min F, et al. A two-stage density clustering algorithm[J]. Soft Computing, 2020, 24(23): 17797-17819.

[20] Cheng D, Huang J, Zhang S, et al. Improved density peaks clustering based on shared-neighbors of local cores for manifold data sets[J]. IEEE Access, 2019, 7: 151339-151349.

[21] Fan T, Li X, Hou J, et al. Density peaks clustering algorithm based on kernel density estimation and minimum spanning tree[J]. International Journal of Innovative Computing and Applications, 2022, 13(5-6): 336-350.

[22] Liang W, Schweitzer P, Xu Z. Approximation algorithms for capacitated minimum forest problems in wireless sensor networks with a mobile sink[J]. IEEE Transactions on Computers, 2012, 62(10): 1932-1944.

[23] Maximo A, Velho L, Siqueira M. Adaptive multi-chart and multiresolution mesh representation[J]. Computers & Graphics, 2014, 38: 332-340.

[24] Xu X, Ding S, Shi Z. An improved density peaks clustering algorithm with fast finding cluster centers[J]. Knowledge-Based Systems, 2018, 158: 65-74.

[25] Campello R J G B, Moulavi D, Zimek A, et al. Hierarchical density estimates for data clustering, visualization, and outlier detection[J]. ACM Transactions on Knowledge Discovery from Data (TKDD), 2015, 10(1): 1-51.

[26] Hongxiang Z H U, Genxiu W U, Zhaohui W. Density Peaks Clustering Algorithm Based on Shared Neighbor Degree and Probability Assignment[J]. Journal of Computer Engineering & Applications, 2024, 60(12).

[27] Fang N, Cui J. An Improved Dempster–Shafer Evidence Theory with Symmetric Compression and Application in Ship Probability[J]. Symmetry, 2024, 16(7): 900.

[28] Tang Y, Zhang X, Zhou Y, et al. A new correlation belief function in Dempster-Shafer evidence theory and its application in classification[J]. Scientific Reports, 2023, 13(1): 7609.

[29] Bi J, Wang Z, Yuan H, et al. Self-adaptive teaching-learning-based optimizer with improved RBF and sparse autoencoder for high-dimensional problems[J]. Information Sciences, 2023, 630: 463-481.

[30] Saufi S R, Isham M F, Ahmad Z A, et al. Machinery fault diagnosis based on a modified hybrid deep sparse autoencoder using a raw vibration time-series signal[J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(4): 3827-3838.

[31] Yin S, Li H, Sun Y, et al. Data Visualization Analysis Based on Explainable Artificial Intelligence: A Survey[J]. IJLAI Transactions on Science and Engineering, 2024, 2(2): 13-20.

[32] Yin S, Li H, Laghari A A, et al. An anomaly detection model based on deep auto-encoder and capsule graph convolution via sparrow search algorithm in 6G internet-of-everything[J]. IEEE Internet of Things Journal, vol. 11, no. 18, pp. 29402-29411, 15 Sept.15, 2024.

[33] Rabie A H, Saleh A I. A new diagnostic autism spectrum disorder (DASD) strategy using ensemble diagnosis methodology based on blood tests[J]. Health Information Science and Systems, 2023, 11(1): 36.

[34] Wang Y, Qian J, Hassan M, et al. Density peak clustering algorithms: A review on the decade 2014–2023[J]. Expert Systems with Applications, 2023: 121860.

[35] Du H, Hao Y, Wang Z. An improved density peaks clustering algorithm by automatic determination of cluster centres[J]. Connection Science, 2022, 34(1): 857-873.