Use of Neural Topic Models in conjunction with Word Embeddings to extract meaningful topics from short texts

Unsupervised machine learning is utilized as a part of the process of topic modeling to discover dormant topics hidden within a large number of documents. The topic model can help with the comprehension, organization, and summarization of large amounts of text. Additionally, it can assist with the discovery of hidden topics that vary across different texts in a corpus. Traditional topic models like pLSA (probabilistic latent semantic analysis) and LDA suffer performance loss when applied to short-text analysis caused by the lack of word co-occurrence information in each short text. One technique being developed to solve this problem is pre-trained word embedding (PWE) with an external corpus used with topic models. These techniques are being developed to perform interpretable topic modeling on short texts. Deep neural networks (DNN) and deep generative models have recently advanced, allowing neural topic models (NTM) to achieve flexibility and efficiency in topic modeling. There have been few studies on neural-topic models with pre-trained word embedding for producing significant topics from short texts. An extensive study with five NTMs was accomplished to test the efficacy of additional PWE in generating comprehensible topics through experiments with different datasets in Arabic and French concerning Moroccan news published on Facebook pages. Several metrics, including topic coherence and topic diversity, are utilized in the process of evaluating the extracted topics. Our research shows that the topic coherence of short texts can be significantly improved using a word embedding with an external corpus. article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.


Introduction
People are increasingly becoming emotionally attached to sharing information through diverse online social platforms, like Twitter, Facebook, webpages, etc., due to the fast development of information and communications technologies and extensive internet use. These messages sent via web and social networks include vital information about actual social trends and situations, opinions of people on various services and products, advertisements, government policy announcements, etc. To easily and quickly read through these huge numbers of messages and extract relevant information, an effective text processing technique is required. A topic modeling method is proven an efficient method for semantic understanding of textual data in traditional natural language processing (NLP). Conventional topic models [1], [2], like LDA [3] or pLSA [4] and their versions, are extremely effective at producing latent semantic structures from an unlabeled text and are popular in rapidly developing topic identification, comment summarization, classification of documents, and event tracking. N. Habbat et al. 2 In comparison to the length of largely formal texts like scientific articles or newspapers, messages published on different social media are usually short. These short texts share the following major characteristics: 1. A limited number of words per document. 2. The use of unique and informal terminology. 3. Post length restrictions. 4. Word meanings and usage may differ based on the posting. 5. Inappropriate comments (or "spam").

EAI Endorsed Transactions on Internet of Things
The implementation of traditional topic models (TTM) for analysis of short text yields poor results caused by the absence of word co-occurrence information within every document of short text, which stems from the short text's characteristics mentioned above.
To address the issues of topic modeling by TTM on short text, neural topic models (NTMs) with interesting achievements have become available, as have significant advancements in word embedding that provide an efficient approach to understanding relations of semantic words from a large text, which can aid in the development of models for producing more comprehensible and coherent topics.
The current study investigates computationally efficient and simple techniques to enhance the comprehensibility of extracted topics from real-world short texts using NTMs. The hardest part of topic modeling for short texts is learning context information, and incorporating pre-trained word embedding into an NTM appears to be among the most effective methods of expressly enriching the content knowledge.
In summary, our contribution is as follows: An examination of the effectiveness of different NTMs in terms of the quality of generated topics, as tested by many metrics of topic coherence and topic diversity, with two pre-trained word embedding models.
In the next section, a short presentation of related works on neural topic models (NTM) is provided, followed by a brief definitions of neural topic models. Section 4 presents simulation experiments and results, and section 5 summarises this paper.

Related works
The most frequent neural topic models (NTMs) are based on the variational autoencoder (VAE) [5], a deep generative model, and amortised variational inferences (AVI) [6]. The following section describes the basic framework of NTMs based on VAE, in which inference and generative mechanisms are modelled by neural network-based encoders and decoders, respectively. Inference in NTMs is computationally easier than in traditional Bayesian probabilistic topic models (BPTM), their application is simplified by the abundance of advanced deep learning techniques, and consequently, NTMs are used easily with PWEs for the acquisition of prior knowledge. Divers VAEbased NTM types have been introduced. Neural Variational Latent Dirichlet Allocation (NVLDA) [7], The Neural Variational Document Model (NVDM) [8], Dirichlet Variational Autoencoder (DirVAE) [9], Dirichlet Variational Autoencoder topic model (DVAE) [10], iTM-VAE [11], and the Gaussian Softmax Model (GSM) [12] are a few examples. This is not an extensive list, and it is still increasing.
Just several researchers utilized NTMs instead of conventional topic models to extract meaningful, coherent, and understandable topics from short texts by integrating contextual and semantic information. An integration of NTM with either a memory network or a recurrent neural network (RNN) was used in [13], [14], in which topics developed by the NTM were used for classification by a memory network or an RNN. For both works, the NTM outperforms conventional topic models regarding topic coherence and classification task performance. Lin et al. [15] used Archimedean copulas to make distributions of multiple topics in a short text more distinct. However, Wu et al. [16] suggested a novel NTM with a quantization approach for topic-distribution, resulting in the best distributions, as well as a negative sampling decode, having to learn to reduce redundant topics. As a consequence, their proposed technique outperforms standard topic models.
Niu et al. [17] combined short texts into long texts or document and used document embedding to create word cooccurrence data. Zhao et al. [18] proposed a variational autoencoder topic model (VAETM) and its supervised variant (SVAETM) by mixing embedded representations of entities and words with an external dataset. To improve contextual information, Zhu et al. [19] presented a graph neural network as the NTM encoder, which receives a biterm graph of words as input and gives as output the corpus's topic distribution. However, Feng et al. [20] presented a context-reinforced NTM based on the assumption of a few pertinent topics by each short text, with pre-trained word embedding informing the topic word distributions.

The proposed model
We outline in this section, the applied architecture and briefly discuss the various neural topic models that have been employed.

Global architecture
According to an analysis of current research on NTMs for analysis of short text, using auxiliary data from an outside corpus is one of the most common and successful ways to deal with short-text sparsity. As seen in Figure 1, we used web scraping techniques (especially Request and BeautifulSoup Python libraries) to collect Posts from Facebook pages; then, those posts were pre-processed using NLP techniques. For pre-trained word embedding, we utilised GloVe [21] and Word2Vec [22]. For topic modelling, we compared five neural topic models that will be defined in the next part.

Neural Topic Models for Analysis
We briefly describe in this section, the neural topic models utilized in this investigation. The significance of the notations used to describe models is provided in Table  1.  through the words in the vocabulary. They are named auto-encoders because the decoder attempts to rebuild the input's word distribution. In VAE, h is sampled using a Gaussian distribution, and θ is provided by transforming it.

NVDM
To our knowledge, Neural Variational Document Model or NVDM [8] is the first VAE-based document method proposed with a multilayer perceptron encoder. The pattern h from the distribution of Gaussian serves as the input for the decoder in this model, and variational inference is related to reducing KL divergence. NVDM is a general VAE, whereas the majority of subsequent NTMs regenerate h to handle θ as a vector of the topic proportion.

Another variant of NVDM is Neural Variational Latent
Dirichlet Allocation or NVLDA [7], which utilizes Neural Variational Inference to replicate LDA. To transform z to θ in this case, the Softmax function is utilized. The Logistic-Normal distribution, used as a surrogate for the Dirichlet distribution, is the probability distribution that converts Gaussian distribution samples to the Softmax basis. Furthermore, the decoder is p(x) = softmax(β)·θ. This topic model, as opposed to the NVDM, in which both the topic proportions and the topic-word distribution are probability distributions. The following is the definition of the Logisitic-Normal distribution:

ProdLDA
Product-of-Experts Latent Dirichlet Allocation or ProdLDA [7] is an extended form of NVLDA where the decoder is constructed using the expert model's product and the topics-word distribution is unnormalized.

WLDA
Wasserstein Latent Dirichlet Allocation or WLDA [23] is a Wasserstein auto-encoder-based topic model (WAE) (Figure 3). But while diverse probability distributions may be utilized for the prior distribution of θ, we used in this work the Dirichlet distribution, the most basic. The training based on GAN (Generative Adversarial Network) and MMD (Maximum Mean Discrepancy) are available in WAE, but MMD is utilized in WLDA due to the simplicity of training loss convergence. Note that WAE generates θ directly using the Softmax function, so no sampling is needed.

NSTM
As with WLDA, Neural Sinkhorn Topic Model or NSTM [24] is trained to utilize optimal transport [42]. Because we suppose that q computes x into a low-dimensional latent space whilst still conserving adequate information about x, the Sinkhorn Algorithm calculates the optimised transport distance between x and θ. The loss function is the sum of the negative log-likelihood and the optimized transport distance.

Results from Simulation Experiments
The datasets, the evaluation measurement utilized in this work, and the results are presented in this part. Use of Neural Topic Models in conjunction with Word Embeddings to extract meaningful topics from short texts 5

Datasets
We focused our analysis on Arabic and French Facebook posts produced by Moroccan news pages, and we chose four Facebook pages (Le Matin.ma, L'Economiste, Hespress, and Medi1TV). We collected approximately 81 000 posts written between November 11, 2021, and April 28, 2022. (Details are shown in Table 2).

Evaluation metrics
Multiple measures with two main directions have been introduced to assess the quality of the top-N words. The first is to determine whether the meanings of the top-N words are coherent with one another, which is known as topic coherence (TC). The second one is to assess the topic diversity (TD) or topic uniqueness of the top N words for every pair of topics. For topic coherence we used NPMI and WETC metrics described below.

NPMI
NPMI or Normalized Point-Wise Mutual Information measures a group of words' semantic coherence. It is calculated from the following equation and is regarded as having the strongest correlations with human evaluations.
Where m is the top N words for a given topic.

WETC
Word Embeddings Topic Coherence or WETC denotes topic coherence based on word embeddings, and pair-wise WETC for a specific topic is computed as: Where represents the inner product Pretrained weights of word2vec/ Glove were used to calculate the WETC score, and is the sequence of Word2vec/Glove word embedding vector equivalent to the top N words for a specific topic k; implies and all vectors are adjusted as follows: , N is gotten as 10.

Topic Diversity (TD)
Topic diversity [25] is the proportion of distinct words in the top 25 words among the topics. A TD near 0 indicates a repetitive topic, while TD near 1 reflect more diverse topics.
We also employed an additional metric, inverted rankbiased overlap or InvertedRBO [26], which assesses disjointness around topics based on the top-N words and weighted according to word rankings. The higher this metric, the better.

Results analysis
The simulation experiments were conducted with several datasets, and the topic models' performance was assessed using topic coherence and topic diversity metrics.    Tables 3 and 4 show the detail results of the WETC topic coherence metric and topic diversity metrics for various NTMs and datasets, respectively. The values in bold are the best performances. As word embedding models, we used GloVe and Word2vec. According to the findings, the TC metric differs significantly according to the used word embedding model. This finding indicates that the effectiveness of the word embeddings may have a substantial effect on topic model training.  Figure 6 shows the word cloud of the six relevant generated using ProdLDA + Glove on French datasets.

Summary and future work
Through social networking sites, short-text data are becoming more and more common in the real world. Every day, it's also becoming more important to figure out what these short messages mean. Because short texts aren't as long as long texts or documents, they don't have as much information about how words are used together. This makes it hard for popular topic model techniques to create topics that make sense and are easy to understand.
Using word embeddings that have already been trained in neural topic models is an effective method for rapidly enhancing the quality of the generated topics, as evaluated by topic diversity and topic coherence in our experiments, which show that using ProdLDA with Glove as PWE gives the best performance and more coherent topics.
As Future direction, we aim to investigate the use of contextualized word embedding based on the transformers like BERT with neural topic models.