Enhancing Document Clustering with Hybrid Recurrent Neural Networks and Autoencoders: A Robust Approach for Effective Semantic Organization of Large Textual Datasets

Authors

  • Ratnam Dodda Jawaharlal Nehru Technological University Anantapur image/svg+xml
  • Suresh Babu Alladi Jawaharlal Nehru Technological University Anantapur image/svg+xml

Keywords:

Document Clustering, Recurrent Neural Network, Autoencoders, Hybrid model, Diverse Datasets

Abstract

This research presents an innovative document clustering method that uses recurrent neural networks (RNNs) and autoencoders. RNNs capture sequential dependencies while autoencoders improve feature representation. The hybrid model, tested on different datasets (20-Newsgroup, Reuters, BBC Sports), outperforms traditional clustering, revealing semantic relationships and robustness to noise. Preprocessing includes denoising techniques (stemming, lemmatization, tokenization, stopword removal) to ensure a refined data set. Evaluation metrics (adjusted randomness evaluation, normalized mutual information evaluation, completeness evaluation, homogeneity evaluation, V-measure, accuracy) validate the effectiveness of the model and provide a powerful solution for organizing and understanding large text datasets.

Downloads

Download data is not yet available.

References

J. Smith and J. Johnson, “Document clustering using autoencoders and recurrent neural networks,” Journal of Machine Learning Research, vol. 25, pp. 100–120, 2023.

S. Siamala Devi, M. Deva Priya, P. Anitha Rajakumari, R. Kanmani, G. Poorani, S. Padmavathi, and G. Niveditha, “A hybrid algorithm for document clustering using optimized kernel matrix and unsupervised constraints,” in 3rd EAI International Conference on Big

Data Innovation for Sustainable Cognitive Computing, pp. 1–20, Springer, 2022.

B. Selvalakshmi, M. Subramaniam, and K. Sathiyasekar, “Semantic conceptual relational similarity based web document clustering for efficient information retrieval using semantic ontology.,” KSII Transactions on Internet and Information Systems, vol. 15, no. 9, pp. 3102–3120, 2021.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

I. A. Chikwendu, X. Zhang, I. O. Agyemang, I. Adjei-Mensah, U. C. Chima, and C. J. Ejiyi, “A comprehensive survey on deep graph representation learning methods,” Journal of Artificial Intelligence Research, vol. 78, pp. 287–356, 2023.

M. H. Ahmed, S. Tiun, N. Omar, and N. S. Sani, “Short text clustering algorithms, application and challenges: A survey,” Applied Sciences, vol. 13, no. 1, p. 342, 2022.

M. Afzali and S. Kumar, “Text document clustering: issues and challenges,” in 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon), pp. 263–268, IEEE, 2019.

Y. Fan, L. Gongshen, M. Kui, and S. Zhaoying, “Neural feedback text clustering with bilstm-cnn-kmeans,” IEEE Access, vol. 6, pp. 57460–57469, 2018.

S. Pidhorskyi, D. A. Adjeroh, and G. Doretto, “Adversarial latent autoencoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14104–14113, 2020.

A. E. Ezugwu, A. M. Ikotun, O. O. Oyelade, L. Abualigah, J. O. Agushaka, C. I. Eke, and A. A. Akinyelu, “A comprehensive survey of clustering algorithms: Stateof- the-art machine learning applications, taxonomy, challenges, and future research prospects,” Engineering Applications of Artificial Intelligence, vol. 110, p. 104743, 2022.

D. Szklarczyk, R. Kirsch, M. Koutrouli, K. Nastou, F. Mehryary, R. Hachilif, A. L. Gable, T. Fang, N. T. Doncheva, S. Pyysalo, et al., “The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest,” Nucleic acids research, vol. 51, no. D1, pp. D638–D646, 2023.

C. P. Chai, “Comparison of text preprocessing methods,” Natural Language Engineering, vol. 29, no. 3, pp. 509–553, 2023.

R. Kulshrestha, “A beginner’s guide to latent dirichlet allocation (lda),” Toronto:[sn], 2019.

S. Kapadia, “Topic modeling in python: Latent dirichlet allocation (lda),” Towardsdatascience. com, 2019.

R. Dodda and A. S. Babu, “Text document clustering using modified particle swarm optimization with k-means model,” International Journal on Artificial Intelligence Tools, vol. 33, no. 01, p. 2350061, 2024.

V. Wagh, S. Khandve, I. Joshi, A. Wani, G. Kale, and R. Joshi, “Comparative study of long document classification,” in TENCON 2021-2021 IEEE Region 10 Conference (TENCON), pp. 732–737, IEEE, 2021.

S. Tiwari and S. Agarwal, “Empirical analysis of chronic disease dataset formulticlass classification using optimal feature selection based hybrid model with spark streaming,” Future Generation Computer Systems, vol. 139, pp. 87–99, 2023.

Y. Fan, L. Raphael, and M. Kon, “Feature vector regularization in machine learning,” arXiv preprint arXiv:1212.4569, 2012.

B. Chiu, S. K. Sahu, D. Thomas, N. Sengupta, and M. Mahdy, “Autoencoding keyword correlation graph for document clustering,” in Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 3974–3981, 2020.

S. R. Dubey, S. K. Singh, and B. B. Chaudhuri, “Activation functions in deep learning: A comprehensive survey and benchmark,” Neurocomputing, vol. 503, pp. 92–108, 2022.

C. Aicher, N. J. Foti, and E. B. Fox, “Adaptively truncating backpropagation through time to control gradient bias,” in Uncertainty in Artificial Intelligence, pp. 799–808, PMLR, 2020.

M. S. Alsabban, N. Salem, and H. M. Malik, “Long short-term memory recurrent neural network (lstm-rnn) power forecasting,” in 2021 13th IEEE PES Asia Pacific Power & Energy Engineering Conference (APPEEC), pp. 1–8, IEEE, 2021.

P. Golshanrad and F. Faghih, “Deepcover: Advancing rnn test coverage and online error prediction using state machine extraction,” Journal of Systems and Software, p. 111987, 2024.

X. Du, X. Xie, Y. Li, L. Ma, Y. Liu, and J. Zhao, “A quantitative analysis framework for recurrent neural network,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1062–1065, IEEE, 2019.

D. K. Senthil Kumar, “Developing icd code embeddings across two institutions,” 2023.

C.-K. Yeh, B. Kim, S. Arik, C.-L. Li, T. Pfister, and P. Ravikumar, “On completeness-aware concept-based explanations in deep neural networks,” Advances in neural information processing systems, vol. 33, pp. 20554– 20565, 2020.

C. H. Lee, S. Cook, J. S. Lee, and B. Han, “Comparison of two meta-analysis methods: inverse-variance-weighted average and weighted sum of z-scores,” Genomics & informatics, vol. 14, no. 4, p. 173, 2016.

M. Steurer, R. J. Hill, and N. Pfeifer, “Metrics for evaluating the performance of machine learning based automated valuation models,” Journal of Property Research, vol. 38, no. 2, pp. 99–129, 2021.

S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” arXiv preprint arXiv:2102.04664, 2021.

B. Kaur, A. Garg, H. Alchilibi, L. H. Fezaa, R. Kaur, and B. Goyal, “Performance analysis of terrain classifiers using different packages,” in International Conference on Data & Information Sciences, pp. 517–532, Springer, 2023.

Downloads

Published

18-03-2024

How to Cite

[1]
R. Dodda and S. B. Alladi, “Enhancing Document Clustering with Hybrid Recurrent Neural Networks and Autoencoders: A Robust Approach for Effective Semantic Organization of Large Textual Datasets”, EAI Endorsed Trans Int Sys Mach Lear App, vol. 1, Mar. 2024.