Word Embedding for Text Classification: E ffi cient CNN and Bi-GRU Fusion Multi Attention Mechanism

.


Introduction
Kim used Convolutional Neural Networks (CNN) for sentence-level classification in reference [1], where pre-processed word vectors were fed into an English classification model.Similarly, Xing et al. [2] employed CNNs to tackle the Twitter polarity opinion problem.Although CNNs have significantly advanced text categorization, their preference for local characteristics over word context can impact classification accuracy.To overcome this limitation, CNN-based capsule networks and dynamic routing were developed by [3], which outperformed traditional CNNs in classification tasks.[4] suggested Recurrent Neural Networks (RNNs) for text categorization, which consider the text's word structure.However, RNNs are susceptible to gradient dispersion issues.Short-term memory (LSTM) can address this problem but is computationally expensive, memory-intensive, and relies heavily on previous knowledge.Bidirectional Gated Recurrent Units (BiGRU) combine the advantages of Gated Recurrent Units (GRUs) and bidirectional LSTMs, compressing the BiLSTM structure into two gates (update and reset) and solving the problem of semantic 1 EAI Endorsed Transactions on Scalable Information Systems information.BiGRUs consider the meaning of words in context and can converge faster due to their lower number of parameters.When working with lengthy text sequences, the cyclic neural network RNN runs into issues with gradient disappearance and explosion, despite the fact that it has been recommended for its capacity to closely correlate before and after text feature terms.[5] proposed the Long-Term Memory Network (LSTM) in 1997 as a solution to this problem.This network consists of three gates that are designed to improve the processing of long texts and text levels.[6] later introduced a bi-directional LSTM neural network model for text categorization, while [7] created the GRU theoretical model as a more efficient alternative to LSTM.In this work, GRU will collect context semantic information of text feature words and mitigate the adverse effects of CNN's inability to extract context features.The Attention mechanism was first utilized in natural language processing by [8].Wu proposed a new method for analyzing the emotional content of Chinese text.His method used a selfattention and BiLSTM model that operated on word vectors.Li expanded on Wu's work by proposing a Self-Attention+Bi-LSTM+CNN model that used sentence vectors.Later, Li introduced the BiGRU and Attention mechanism to hierarchically model sentences and documents.This study will use self-attention, drawing on Li's theoretical framework, as a way to enhance attentional focus on salient information.• The CNN layer is used to extract local features from the word vectors.
• The BiGRU layer is used to capture long-term dependencies between words.
• The attention mechanism is used to focus on the most important words in the text.
• A softmax layer is used to classify the text into one of the target classes.
• To address the existing issues in the domain of sentiment classification of short texts, this study puts forward a text classification model that combines CNN and BiGRU through feature fusion.Additionally, a multi-attention mechanism is incorporated in the proposed model based on prior research.
• The two models that make up the model suggested in this study are CNN and BiGRU.
In order to determine the emotional polarity of the target keyword in the phrase, the CNN model incorporates the particular target sentiment classification approach [10].After BiGRU evaluates the emotional polarity at the sentence level, the two features are combined to create a fusion global feature vector.

PROPOSED METHOD
This section presents a comprehensive overview of the proposed architecture, which comprises several layers, including the input, embedding, recurrent, output, and classification layers (softmax) [11].Embedded words are used as inputs in our method, and a GRU layer is used to extract the lexical characteristics.A classifier layer that completes the final classification follows as the architecture's concluding component as shown in Figure 1.

Embedding layer
Word embedding is a method of representing words as distributed vectors, which can be trained using deep learning techniques on large data sets to capture their contextual meaning.One-hot representations, on the other hand, suffer from the curse of dimensionality and are not suitable for deep neural networks [12].Previous research has shown that neural networks can achieve better results with unsupervised pre-training procedures.In the architecture that we have proposed, the sizes of both the GRU hidden layer and the word embedding have been predetermined.We used 200dimensional Glove word vectors that had been pretrained by [13], which are considered to be state-ofthe-art for a variety of NLP applications.This allowed us to initialize all of the word embeddings.The Glove model makes an attempt to categorize the target word by providing a neural network with inputs consisting Y. Salini In our architecture, we use an embedding layer to transform the input words into lower-dimensional vectors and refine them through backpropagation.This embedding layer is essentially the first layer of our feed-forward RNN, where word embedding serves as the weights.Specifically, the initial relationship between two words' probabilities of cooccurrence is based on the information contained in the encoding of the difference vector represented in Figure 2.

CNN-Attention
Initially, the word vector of the text is fed into the convolution layer to obtain the output features from said layer.The activation function processes the convolution result after the convolution kernel has been applied to the incoming text vector matrix by getting the dot product.This process is employed in this stratum.
The equation represents the convolution operation between a kernel matrix and a matrix of text word vectors with dimensions of width and height.Here, rA After obtaining the output features from the convolutional layer, they are then fed through a pooling layer to undergo further processing.This particular layer serves to condense the output by extracting the most essential features from the convolutional output as represented in Figure 3.
The following describes the precise Attention execution process: hi is shorthand for the word vector that is generated by the bi-directional GRU, and Xi is shorthand for the word vector that is generated by the activation function tanh.The average of the word vectors is then used to produce the word vector encoding ui re, where Wu represents the attention weight of the word vector in the sequence vector s = h 1 , h 2 , . . ., h n .In order to accomplish this, the arbitrary beginning vector Wu is used.

GRU-Attention
The GRU-Attention layer takes pre-trained word vectors as input and utilizes the attention mechanism to compute feature vectors with associated weight values using the GRU [14].A section of the GRU-Attention structure is illustrated in the accompanying Figure 4.
Following is a depiction of two-way GRU: By following the three stages described in steps ( 8), (9), and (10), the GRU network can generate a global feature that includes the feature word vector along with its weight score, as shown in equation (10).

Fully connection layer
The input vector for the fully connected layer can be created by combining the GRU vector h with the output vector v obtained from the CNN-attention calculation.The resulting vector F is then computed through attention computation [15].This serves as the final output of the model and can be determined using the following formula: The result of the hidden layer can be written as follows, where Wc and bc stand for the complete connection layer's bias vector and weight matrix, respectively: HW stands for the complete connection layer's weight matrix, and bH stands for the full connection layer's matrix vector.The Softmax layer receives its input   Word Embedding for Text Classification: Efficient CNN and Bi-GRU Fusion Multi Attention Mechanism The model's parameters are continually updated by computing the loss function, aiming to achieve optimal training performance.

Data set
THUCNews Data set: The THUCNews data set is a Chinese text classification data set consisting of news articles collected from the Chinese internet.It covers a wide range of categories such as politics, finance, sports, entertainment, and technology.Each article is labeled with a specific category, allowing for supervised learning tasks.The data set is significant in size, containing tens of thousands to hundreds of thousands of articles, making it suitable for training and evaluating models for Chinese text analysis tasks.
IMDB Data set: The IMDB data set is a popular collection of English text used for sentiment analysis.It comprises movie reviews sourced from the Internet Movie Database (IMDB).Each review is assigned a binary sentiment label, indicating whether it is positive or negative.This data set is well-balanced, containing an equal number of positive and negative reviews.It serves as a valuable resource for training and evaluating models in sentiment analysis, as it offers a wide range of opinions and sentiments expressed in movie reviews.Refer to Table 1 for an overview of the parameter values.

Experimental Hyper parameters
Experimental hyperparameters refer to the settings or values chosen for various parameters during the experimentation phase of building machine learning models.These hyperparameters are not learned from the data but are set manually by the practitioner or researcher to optimize the model's performance as represented in Table 2.

Model evaluatrion standard
Evaluation metrics or criteria, commonly referred to as model evaluation standards, are utilized to evaluate the efficacy and performance of machine learning models.The aforementioned standards offer numerical criteria that enable professionals and scholars to contrast diverse models, opt for the most suitable model for a specific undertaking, and arrive at well-informed determinations regarding model implementation.The following are frequently utilized benchmarks for evaluating models.Table 3 presents the confusion matrix from [16].
1. Accuracy: The metric of accuracy evaluates the degree of correctness in the predictions made by a model.It is determined by dividing the number of instances that were classified correctly by the total number of instances.The metric in question is a simple evaluation measure, however, its applicability may be limited in cases where the data set exhibits class imbalance, i.e., when the distribution of classes is uneven.

Precision and Recall:
The assessment of binary classification tasks often involves the utilization of precision and recall as standard evaluation metrics.Precision and recall are two important metrics used in evaluating the performance of a classification model.Precision is defined as the ratio of true positive predictions to the total number of instances predicted as positive.On the other hand, recall is defined as the ratio of true positive predictions to the total number of actual positive instances.The utilization of these metrics is advantageous in cases where there exists a disparity between the positive and negative categories within the data set.
3. F1-Score: The F1-score is a metric that quantifies a model's performance by taking into account both precision and recall and is calculated as the harmonic mean of these two measures.
It is considered a balanced measure of a model's effectiveness.This approach is frequently employed in scenarios where there is a need for both precision and recall to be accorded equal significance.

Model comparison and analysis
The process of evaluating and contrasting the efficacy of distinct machine learning models for a specific task is referred to as model comparison and analysis. The

CONCLUISION
The purpose of this research is to improve classification performance by developing a text classification model  that is based on the CNN+LSTM technique [17], [18], [19].In order to reduce the amount of time required for training while preserving the model's capacity for accurate classification, the model makes use of a GRU structure rather than LSTM units.In addition, the Attention mechanism was included in order to improve the accuracy of the model's classification.This was accomplished by centering attention on the words that had the greatest influence on the categorization effect.
[9]has presented a cyclic convolution neural network that integrates Bi-directional Recurrent Neural Network (Bi-RNN) and Convolutional Neural Network (CNN).Zhang, She, and Li have proposed hybrid models that combine Long Short-Term Memory (LSTM) with CNN, CNN with LSTM, and Bi-directional LSTM with CNN, respectively.Wang proposed the Attention-based bidirectional long-term memory convolution layer (AC-BiLSTM), which comprises the convolutional neural network (CNN), the bidirectional long-term memory (BiLSTM) attention mechanism, and the convolution layer.Luo proposed the use of Latent Dirichlet Allocation (LDA) as a method for text representation.Ma proposed a new text classification model that integrates Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BiGRU) with a fusion attention mechanism.The model demonstrated promising results on standard data sets.The fusion network model that has been proposed for text categorization consists of the following components: • Word embedding technology is used to convert the feature words of the text into a matrix of word vectors.

Figure 1 .
Figure 1.The proposed model architecture

( 1 )
In the given sentence sequence w d 1 , w d 2 , . . ., w d n , n where n represents the length of each text sequence and d represents the dimension of the word vector, the input size of this layer is l x b, which refers to the text matrix.The output size is b x l x d, representing the word vector matrix.Here, b represents the number of text batches, l represents the fixed length of the text, and d represents the dimension of the word vector.

Figure 4 . 5 EAI
Figure 4.The structure of Bi-GRU attention present study has opted to utilize a recently proposed model for the purpose of comparing it with the model presented in this paper.The objective is to establish the superiority of the model proposed in this paper.This paper selects three models for comparative analysis, namely CNN-BiGRU, CNN-LSTM-Attention (CLA), LSTM-CNN-Attention (LCN), and CNN+LSTM as shown in Figure5 and 6.According to the aforementioned experimental results, the suggested model has varied degrees of improvement in terms of accuracy, recall, and F1 value when compared to CLA, LCA, CNN-BiGRU and LSTM + CNN models[23].

Table 1 .
Dataset Information

Table 2 .
Model parameters