Use of Neural Topic Models in conjunction with Word Embeddings to extract meaningful topics from short texts
DOI:
https://doi.org/10.4108/eetiot.v8i3.2263Keywords:
Neural Topic Models, Pre-training word embedding, Short text, Topic coherenceAbstract
Unsupervised machine learning is utilized as a part of the process of topic modeling to discover dormant topics hidden within a large number of documents. The topic model can help with the comprehension, organization, and summarization of large amounts of text. Additionally, it can assist with the discovery of hidden topics that vary across different texts in a corpus. Traditional topic models like pLSA (probabilistic latent semantic analysis) and LDA suffer performance loss when applied to short-text analysis caused by the lack of word co-occurrence information in each short text. One technique being developed to solve this problem is pre-trained word embedding (PWE) with an external corpus used with topic models. These techniques are being developed to perform interpretable topic modeling on short texts. Deep neural networks (DNN) and deep generative models have recently advanced, allowing neural topic models (NTM) to achieve flexibility and efficiency in topic modeling. There have been few studies on neural-topic models with pre-trained word embedding for producing significant topics from short texts. An extensive study with five NTMs was accomplished to test the efficacy of additional PWE in generating comprehensible topics through experiments with different datasets in Arabic and French concerning Moroccan news published on Facebook pages. Several metrics, including topic coherence and topic diversity, are utilized in the process of evaluating the extracted topics. Our research shows that the topic coherence of short texts can be significantly improved using a word embedding with an external corpus.
Downloads
References
N. Habbat, H. Anoun, et L. Hassouni, « Topic Modeling and Sentiment Analysis with LDA and NMF on Moroccan Tweets », in Innovations in Smart Cities Applications Volume 4, Cham, 2021, p. 147‑161. DOI: https://doi.org/10.1007/978-3-030-66840-2_12
N. Habbat, H. Anoun, et L. Hassouni, « Sentiment Analysis and Topic Modeling on Arabic Twitter Data during Covid-19 Pandemic », Indones. J. Innov. Appl. Sci. IJIAS, vol. 2, no 1, p. 60‑67, févr. 2022, doi: 10.47540/ijias.v2i1.432. DOI: https://doi.org/10.47540/ijias.v2i1.432
D. M. Blei, A. Y. Ng, et M. I. Jordan, « Latent Dirichlet Allocation », J Mach Learn Res, vol. 3, no null, p. 993‑1022, mars 2003.
T. Hofmann, « Unsupervised Learning by Probabilistic Latent Semantic Analysis », p. 20.
D. P. Kingma et M. Welling, « Auto-Encoding Variational Bayes », ArXiv13126114 Cs Stat, mai 2014, Consulté le: 10 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1312.6114
D. J. Rezende, S. Mohamed, et D. Wierstra, « Stochastic Backpropagation and Approximate Inference in Deep Generative Models », ArXiv14014082 Cs Stat, mai 2014, Consulté le: 16 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1401.4082
A. Srivastava et C. Sutton, « Autoencoding Variational Inference For Topic Models », ArXiv170301488 Stat, mars 2017, Consulté le: 12 janvier 2021. [En ligne]. Disponible sur: http://arxiv.org/abs/1703.01488
Y. Miao, L. Yu, et P. Blunsom, « Neural Variational Inference for Text Processing », ArXiv151106038 Cs Stat, juin 2016, Consulté le: 16 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1511.06038
W. Joo, W. Lee, S. Park, et I.-C. Moon, « Dirichlet Variational Autoencoder », ArXiv190102739 Cs Stat, janv. 2019, Consulté le: 16 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1901.02739
S. Burkhardt et S. Kramer, « Decoupling Sparsity and Smoothness in the Dirichlet Variational Autoencoder Topic Model », p. 27.
X. Ning, Y. Zheng, Z. Jiang, Y. Wang, H. Yang, et J. Huang, « Nonparametric Topic Modeling with Neural Inference », ArXiv180606583 Cs, juin 2018, Consulté le: 16 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1806.06583
Y. Miao, E. Grefenstette, et P. Blunsom, « Discovering Discrete Latent Topics with Neural Variational Inference », ArXiv170600359 Cs, mai 2018, Consulté le: 16 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1706.00359
X. Wang et Y. YANG, « Neural Topic Model with Attention for Supervised Learning », in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, août 2020, vol. 108, p. 1147‑1156. [En ligne]. Disponible sur: https://proceedings.mlr.press/v108/wang20c.html
J. Zeng, J. Li, Y. Song, C. Gao, M. R. Lyu, et I. King, « Topic Memory Networks for Short Text Classification ». arXiv, 10 septembre 2018. Consulté le: 26 juillet 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1809.03664
L. Lin, H. Jiang, et Y. Rao, « Copula Guided Neural Topic Modelling for Short Texts », in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: Association for Computing Machinery, 2020, p. 1773‑1776. [En ligne]. Disponible sur: https://doi.org/10.1145/3397271.3401245 DOI: https://doi.org/10.1145/3397271.3401245
X. Wu, C. Li, Y. Zhu, et Y. Miao, « Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder », in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, nov. 2020, p. 1772‑1782. doi: 10.18653/v1/2020.emnlp-main.138. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.138
Y. Niu, H. Zhang, et J. Li, « A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings », Appl. Sci., vol. 11, no 18, 2021, doi: 10.3390/app11188708. DOI: https://doi.org/10.3390/app11188708
X. Zhao, D. Wang, Z. Zhao, W. Liu, C. Lu, et F. Zhuang, « A neural topic model with word vectors and entity vectors for short texts », Inf. Process. Manag., vol. 58, no 2, p. 102455, mars 2021, doi: 10.1016/j.ipm.2020.102455. DOI: https://doi.org/10.1016/j.ipm.2020.102455
Q. Zhu, Z. Feng, et X. Li, « GraphBTM: Graph Enhanced Autoencoded Variational Inference for Biterm Topic Model », in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, oct. 2018, p. 4663‑4672. doi: 10.18653/v1/D18-1495. DOI: https://doi.org/10.18653/v1/D18-1495
J. Feng, Z. Zhang, C. Ding, Y. Rao, et H. Xie, « Context Reinforced Neural Topic Modeling over Short Texts ». arXiv, 11 août 2020. Consulté le: 26 juillet 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/2008.04545
J. Pennington, R. Socher, et C. Manning, « Glove: Global Vectors for Word Representation », in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, p. 1532‑1543. doi: 10.3115/v1/D14-1162. DOI: https://doi.org/10.3115/v1/D14-1162
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, et J. Dean, « Distributed Representations of Words and Phrases and their Compositionality », ArXiv13104546 Cs Stat, oct. 2013, Consulté le: 6 mars 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1310.4546
F. Nan, R. Ding, R. Nallapati, et B. Xiang, « Topic Modeling with Wasserstein Autoencoders », in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, juill. 2019, p. 6345‑6381. doi: 10.18653/v1/P19-1640. DOI: https://doi.org/10.18653/v1/P19-1640
H. Zhao, D. Phung, V. Huynh, T. Le, et W. Buntine, « Neural Topic Model via Optimal Transport », 2021. [En ligne]. Disponible sur: https://openreview.net/forum?id=Oos98K9Lv-k
A. B. Dieng, F. J. R. Ruiz, et D. M. Blei, « Topic Modeling in Embedding Spaces ». arXiv, 7 juillet 2019. Consulté le: 10 juin 2022. [En ligne]. Disponible sur: http://arxiv.org/abs/1907.04907
G. Carbone et G. Sarti, « ETC-NLG: End-to-end Topic-Conditioned Natural Language Generation », Ital. J. Comput. Linguist., vol. 6, p. 61‑77, déc. 2020, doi: 10.4000/ijcol.728. DOI: https://doi.org/10.4000/ijcol.728
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 EAI Endorsed Transactions on Internet of Things
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
This is an open-access article distributed under the terms of the Creative Commons Attribution CC BY 3.0 license, which permits unlimited use, distribution, and reproduction in any medium so long as the original work is properly cited.