A Hybrid Deep Learning GRU based Approach for Text Classification using Word Embedding

Text categorization has become an increasingly important issue for businesses that handle massive volumes of data generated online, and it has found substantial use in the field of NLP. The capacity to group texts into separate categories is crucial for users to effectively retain and utilize important information. Our goal is to improve upon existing recurrent neural network (RNN) techniques for text classification by creating a deep learning strategy through our study. Raising the quality of the classifications made is the main difficulty in text classification, nevertheless, as the overall efficacy of text classification is often hampered by the data semantics' inadequate context sensitivity. Our study presents a unified approach to examine the effects of word embedding and the GRU on text classification to address this difficulty. In this study, we use the TREC standard dataset. RCNN has four convolution layers, four LSTM levels, and two GRU layers. RNN, on the other hand, has four GRU layers and four LSTM levels. One kind of recurrent neural network (RNN) that is well-known for its comprehension of sequential data is the gated recurrent unit (GRU). We found in our tests that words with comparable meanings are typically found near each other in embedding spaces. The trials' findings demonstrate that our hybrid GRU model is capable of efficiently picking up word usage patterns from the provided training set. Remember that the depth and breadth of the training data greatly influence the model's effectiveness. Our suggested method performs remarkably well when compared to other well-known recurrent algorithms such as RNN, MV-RNN, and LSTM on a single benchmark dataset. In comparison to the hybrid GRU's F-measure 0.952, the proposed model's F-measure is 0.982%. We compared the performance of the proposed method to that of the three most popular recurrent neural network designs at the moment RNNs, MV-RNNs, and LSTMs, and found that the new method achieved better results on two benchmark datasets, both in terms of accuracy and error rate.


Introduction
Rapid progress in the social and technical information domain, coupled with the exponential growth of digital text formats, heralds the arrival of the age of massive text data [1].There are promising future uses for text classification in information retrieval, digital libraries, and other areas [2].Therefore, the ability to effectively organize and leverage these massive textual datasets is crucial in this setting.Automatic text sorting has become one of the most important issues for large companies that need to handle huge amounts of data.When it comes to managing massive amounts of data found on the internet, automatic TC is an essential tool.In recent years, DL algorithms have attracted a lot of interest [3] because of their ability to learn layered, hierarchical representations of high-dimensional data.Text categorization has not received as much attention as pattern recognition [4], sentiment analysis [5], and CV [6].As a result of catastrophe dimensions, scarce data, and other difficulties, traditional text representation has become a limiting factor in the efficiency of many NLP operations.Many novel sentiment analysis methods have emerged with the rise of deep learning technologies.Word and phrase meanings can be learned with the help of the abundant unlabelled textual data that is readily available.Word2vec [1] tries to do this by acquiring word embeddings from unlabelled text samples.It picks up new information by doing word predictions based on context (CBOW) and word predictions based on a specific word (SKIP-GRAM).These word embeddings are used in current methods like Tf-idf and others to create dictionaries and to minimise dimensionality.Additional methods, such as the recurrent neural tensor network (RNTN) [7], are discovered for capturing representations at the phrase level.In addition to its success in image classification, CNNs have been demonstrated to be useful in text categorization [8].The fundamental issue is that sentences in natural English tend to be of varying lengths.Fixing the size of the context window helps, however this does not help with extracting semantics that go beyond the window's scope.Although recurrent neural networks can handle text sequences of varying lengths, they are notoriously difficult to train.This led to the use of RNN variants such as LSTM and GRU.In numerous NLP tasks, including sentiment analysis, translation, and sequence generation, LSTM is generating headlines since its proposal in 1997 by Hochreiter et al. [9].The concept of GRUs was first presented in 2014 by K. Cho [10].GRU are likely more practical because of their simpler form.In this paper, we make an effort to demonstrate its benefits over LSTM when used to sentiment analysis.Tf-idf, Word2Vec, k-means terms, and the Ensemble model.Are only some of the methods that have been applied by other researchers?It was discovered that GRU performed better than any of the other single models, and that this finding was further enhanced by employing an ensemble model.Figure 1 represents the Basic architecture of query-based text classification which is shown below.

Figure 1. Basic architecture of query-based text classification
The proposed works are characterized by their significant contributions, which are outlined in detail below: In this research, we will test a hybrid GRU network's ability to classify texts by using a pre-trained word embedding strategy (Glove).Using a GRU network allowed us to circumvent the gradient exploding or vanishing issue that plagues regular RNNs.Experimental configurations on two benchmark datasets (TREC and RNN) are used to evaluate the hybrid GRU model's performance to that of the RNN, the "Matrix-Vector Recurrent Neural Network," and the LSTM.
For the purpose of text classification, GRUs proved beneficial due to their ability to remember long-term demands and capably grasp the meanings among words.When dealing with sequence data, GRU approaches really shine.The experimental findings for both datasets demonstrate that the hybrid GRU model (our method) provides superior accuracy and error rate.
 The hybrid model is superior to the individual components since it combines the advantages of GRU, RNN, and LSTM to classify text data.Many uses, from sentiment analysis to content suggestion, stand to profit from the accuracy boost.
 The method incorporates word embedding techniques, which improves the model's capacity to interpret and represent words in the context of written texts.The capture of semantic data is aided by this addition.
 The approach's strength comes from the fact that it may be adapted to noisy or unstructured text data.This is of paramount importance in contexts where user-generated material, social media, or other dirty data sources are being used.
For the purpose of text categorization, we employ the TREC dataset in conjunction with our hybrid model (GRU-LSTM-RNN).We compared the performance of the proposed method to that of the three most popular recurrent neural network designs now (RNNs, MV-RNNs, and LSTMs), and found that the new method achieved better results on single benchmark dataset, both in terms of accuracy and error rate.

Literature survey
Classification is one among the most well-known and widely used NLP applications.Texts mining involve more than just classifying documents; it also includes recognizing themes, filtering out spam, labeling semantic roles, and more.Natural language processing is one area where deep learning architectures have been widely recognized as producing stateof-the-art outcomes.The process of organizing unstructured data involves four steps: pre-processing, term weighting, feature selection, and ultimately, obtaining text vectors.Many deep learning models have been applied to a variety of natural language processing (NLP) applications, such as chunking, language modelling, semantically linked word recognition, web news classification, and chunking.The likelihood that w will follow t in a sequence is something a natural language model can predict [11].For instance, "deep recursive neural networks model have been applied for parsing and sentiment analysis [12]," "question answering [13]," in addition to "logical inference [14]."Recurrent neural networks can be used to model language [15], recognize voice [16], and create phrases from images [17].
Bengio et al. [17] introduced a novel strategy based on a neural network-based language model (NNLM) to learn embeddings of words from each word's prior contexts.The C&W model generates word embeddings using a convolutional network that considers both local and global contexts.Using straightforward single-layer architecture, the CBOW and skip-gram algorithms developed by Mikolov et al. [18] enable rapid analysis of word embeddings from large datasets.While the methods presented in [20] for producing word embeddings relied on dependencies between words and presented global vectors (GloVe), the methods described in [19] relied solely on linear contexts and local contexts (typically just a few words in the preceding and subsequent contexts).To solve the issue of local contexts, the GloVe method considers global word-to-word co-occurrence statistics, whereas semantics-based word embeddings employ a dependency parser to produce syntactic con grammes.
Recent advances in RNN language models have shown the value of distributed representations and the capacity to explain arbitrarily lengthy dependencies [20].After training on a character-level dataset, the RNN variation described in [21] is able to produce phrases with a natural sounding tone.More recently, [22] shown how an RNN-LM may be utilized to generate image descriptions by conditioning the network model with a pre-trained convolutional image feature presentation.Having the ability to train deep networks also paves the way for a more sophisticated strategy for capitalizing on connections between labels and features, leading to a more accurate prediction.This feature has opened the door for RNNs to be employed for sequential tasks such as text categorization and named entity recognition.More complicated network topologies are made possible by the tree-LSTM introduced in [23], a variant of RNN that allows each LSTM unit to incorporate data from many child units.

Ghosh et al. (2023) embarked on a comprehensive study to assess water quality through predictive machine learning.
Their research underscored the potential of machine learning models in effectively assessing and classifying water quality.The dataset used for this purpose included parameters like pH, dissolved oxygen, BOD, and TDS.Among the various models they employed, the Random Forest model emerged as the most accurate, achieving a commendable accuracy rate of 78.96%.In contrast, the SVM model lagged behind, registering the lowest accuracy of 68.29% [31].values.By employing a unique Markov random field, the approach refines image edges.Performance evaluations, using metrics like UCIQE and UIQM, demonstrated the superiority of this method over existing techniques, resulting in sharper, clearer, and more colorful underwater images [32].Sharma et al. ( 2020) presented a comprehensive study on the impact of COVID-19 on global financial indicators, emphasizing its swift and significant disruption.The research highlighted the massive economic downturn, with global markets losing over US $6 trillion in a week in February 2020.Their multivariate analysis provided insights into the influence of containment policies on various financial metrics.The study underscores the profound effects of the pandemic on economic activities and the potential of using advanced algorithms for detection and analysis [33].

Steps for Classifying Text
The process of classifying text usually has four main steps: But first, there are steps that need to be taken to gather and prepare the information that will be used for the work.

Data Preparation
Preprocessing methods play a crucial part in enhancing the effectiveness of models.Improving the text dataset for different text mining applications begins with transforming unstructured text into structured text.We look at tokenization, stop word removal, and stemming, three essential parts of data preprocessing.

i. Tokenization
Tokenization is the process of separating individual words, phrases, symbols, or other recognized units from a continuous stream of text.Separating words inside a sentence is the primary objective of tokenization.

ii. Stop removing words
Common but unimportant words are eliminated at this stage.Both before and after processing natural language data (text), these terms are typically removed.Commonly used shortened forms of these words have been removed: "the," "an," "is," "at," "of," "but" and so on.

iii. Stemming
The practice of stemming unifies various word forms into a single representation called stems.In the text, for instance, the words "presentation," "presented," and "presenting" can all be shortened to the universal representation "present."The usage of stemming is prevalent in text processing, especially in the field of information retrieval (IR), based on the idea that a query containing the term "presenting" signals a demand for publications containing the words "presentation" and "presented."

Representation of Documents
The data must first be written in a way that is comprehensible to the categorization algorithm.One of the most common strategies is called the "Bag of Words" (BOW).
It illustrates text by displaying the frequency with which individual words appear in each text.

Reduced Dimensions
It is impossible to analyse all the words in a text corpus as potential features for a classification system because of the large number of words involved.It can be computationally challenging to analyse such large datasets.This highlights the need of carefully picking out attributes that are truly representative to use as inputs in the subsequent classification stage.

Model Development
This is a crucial part of the text-sorting procedure.A subset of the text data in the dataset is chosen to serve as the training set.The classification model is then developed by training the chosen model on this dataset.

Proposed Method
Bidirectional loops between the units are a feature of RNNs, a form of artificial neural network.This structure is appropriate for pulling valuable linguistic data from lengthy word sequences in a collection because it is made to analyses consecutive events, such as word sequences.As shown in Figure 2, prior to passing those through the U, W, and V weight vectors in the RNN architecture, the present input and the previous hidden state are added at time-step t. ℎ  =  ℎ (  . + ℎ  . +  (ℎ) ) (1)   =   (ℎ  .) (2) The output, Ot, is determined by Equation 2, while the value of the hidden state, ht, is determined by Equation 1.The weight matrix W is utilized to compute the persistent link between hidden states, while matrices U and V represent the connection from the hidden layer to the input and output, respectively.However, conventional RNNs have a poor reputation for ease of training because to frequent occurrences of bursting and disappearing gradients.This is because the slopes may become too steep.If the network is too small, it will not be able to learn, and if it is too big, the weights will grow too long, and the network will stop learning.With the addition of gated mechanisms, RNNs have spawned two variants-LSTM networks and GRUs-that are better able to deal with gradient vanishing.Input, ignore, and output gates all play critical roles in the LSTM design, determining how information flows through the network.These gates together determine how much weight to give to the current step's data addition, preservation of the previous state, and effect over the surrounding network.The relationships between the inputs, forget, and output gates of an LSTM network are formulated mathematically.

Figure. 3. LSTM Units
=   ( ()   +  () ℎ −1 +  () ) (3)   =   (    +  () ℎ −1 +  () ) (4)   =   ( ()   +  () ℎ −1 +  () ) (5)  ̂ = ℎ( ()   +  () The input gate, ignore gate, and output gate control the flow of data through a recurrent neural network.Each cell has an input gate and an output gate that determines how much new information is added to the cell's state and how much of the current state is broadcast to the outside network.The amount of the previous state to be kept is likewise determined by the ignore gate.Both the gates and the suggested state acquire biases and preferences as they are trained.The proposed state is computed using Eq. ( 6), and Eq. ( 7) is used to modify the actual state ct.The ht hidden state is then computed using the (8) equation.Compounding elements individually is represented by the letter 'o'.

Figure. 4. GRU Units
Instead of a dedicated memory module, GRU relies on a gating mechanism to regulate data transfer.To control the flow of information from the previous activation, GRU computes two gates, the update gate and the reset gate, while calculating a new possible state using a reset gate.
To ensure that an update gate properly mixes the proportion of prior activation and new candidate activation, the following formulas are employed to determine each hidden state at time-step t: Update gate:   =   ( ()   +  () ℎ −1 +  () ) (9) Reset gate:   =   ( ()   +  () ℎ −1 +  () ) (10) Candidate gate: Final output: It is demonstrated by previous work that each neural network building has drawbacks that can be overcome by mixing several buildings.In this work, we investigate how adding layers from different designs to the current RNN model affects the model's performance.Comparing the LSTM, which is the memory layer (Figure .3), to the GRU, we find that the former is better able to remember longdistance relations.(Figure . 4) While the sigmoid activation function is used in both the LSTM and GRU units' storage layers, the hyperbolic activation function in the output layer facilitates data retrieval even after long-term storage.GRUs, on the other hand, perform better than LSTMs and is simpler to train because they need less information.That is why GRUs is always embedded against LSTM, even if retrieval is not necessary.
The proposed model, shown in Figure 5, employs an RNN composed of three LSTM layers and two GRU layers.Important long-term dependency management is enabled by the output gate layer.Considering the current state of the cell, the output vector from the previous cell, and the input vector, the forget gate layer assigns a value.The forgetting layer is responsible for deciding whether to forward the value to the input layer, which further proof of this is.In the LSTM's forget gate layer, sigmoid neural function is multiplied by a point-wise operator to generate values.The input gate layer has a dual purpose.The incoming vector input is used to update a value using a sigmoid activation function, and this value is then compared to another value obtained using a hyperbolic activation function.The outcomes of the comparison are then included into the cell's existing state.The Output Gate Layer compares the input vector generated by the sigmoid to the cell state update generated by the hyperbolic.
In comparison, an LSTM or a fully gated GRU needs only two gates (the forget gate layer and the input gate).GRU has been around for more than half a decade, but it is still favored in some situations because of the shorter training time and smaller dataset needs.Because of its unique update gate and forget gate, LSTM is obviously more complicated.As a result, the LSTM's complexity allows for the incorporation of GRU, with the result being improved model control thanks to the inclusion of GRU units.In light of these findings, we compare the capabilities of four models featuring MV-RNN, RNN, LSTM and GRU units, respectively.The TREC dataset is used to train the models because it can accurately capture syntactic and semantic representations of the words.A significant corpus is needed to train the model in GloVE, which increases the memory required in the long run because of the model's inability to capture out-of-vocabulary terms.When compared to Word2Vec, GloVE's operation is similar; however, the training process will not be focused on the weights associated with frequently used word pairs.The benefits of using the GloVE dataset to train the models under consideration in the study are supported by the points made above.Previous work has shown that the limitations of individual neural network topologies can be mitigated by employing a hybrid approach.In this study, we look into how incorporating new layer types from various designs into the existing RNN model can improve its performance.Contrasting the memory layer, LSTM when comparing the GRU [28].In terms of their ability to remember long-distance relationships, we find that the former is superior.We performed tests on the reference dataset TREC [24] to evaluate the performance of the suggested model.TREC includes six distinct query categories, including "LOC," "NUM," and "ENTY," as shown in Table 1.500 questions make up the test dataset, while 5,452 labelled questions make up the training dataset.To improve the performance of the suggested model, we first employed flowing to combine all the comparable example words into a single word, and we also deleted stop words from the input order (such as "and" "are," "of," "the," and "to").Both the word embedding size, d, and the GRU size, u, was held constant throughout the experiment.Next, GloVe word vectors that had been pre-trained by Penington et al. [24] were used to initialize all word embeddings from the text data.Several researchers have modified word vector training approaches [25] to improve performance when categorizing phrases based on their emotional tone.
To depict the model's generalization abilities better accurately, we try to use the same embedded data across datasets whenever possible.When there was no clear classification source word in the texts, we devised the selfattention mechanism by instead considering all of the surrounding terms to be resource words for categorization.Each GRU unit in the deep learning networks was tweaked to have 200 hidden states, which required modifying all of the network layers.We applied AdaDelta optimization [26] to the proposed model, setting the learning rate to 0.001 and the minibatch size to 64.To prevent GRU layer overfitting, we used L2 regularization (with a coefficient r of 105) and the dropout technique [27].We have measured the state-of-theart in the text classification task in terms of error rate and accuracy.

Performance Evaluation
We used the TREC benchmark dataset and other custom parameters to evaluate our suggested hybrid GRU model.As a benchmark, we compared the suggested hybrid GRU model's performance on both datasets to that of four wellknown classical recurrent neural networks: the RNN, the Recursive M-V RNN, the Gate Recurrent Units, and the LSTM [29,30].To train the models, we used a stochastic gradient descent approach on mini batches created at random.We also discovered that networks with fewer gated units and fewer embedding dimensions perform better than those with more gated units and more embedding dimensions.For this, we employed a word-level embedding layer trained with the pre-trained GloVe method to assign actual values to each text word in order.
In this experiment, we examined the state-of-the-art RNN, MV-RNN, and LSTM models based on their classification results using the conventional assessment metrics of accuracy and means square errors.Figure 6 displays a comparison of four models' results on the first TREC dataset.The efficacy of all models is similar with minor variations, as shown in Figure 6.The proposed model has the greatest result in terms of accuracy 0.982, while hybrid GRU has a similar F-measure 0.952.
In this section, as shown in figure 7, we will examine the hybrid GRU model that we proposed and compare it to the error rates of the LSTM, MV-RNN, GRU, and RNN models.
To conduct this study, we created an execution environment in which to train these models using TREC's benchmark dataset.The experiment showed that the mean square error always went down as the number of epochs went up, and the final Mean Square Error value for TREC was 0.529.In this analysis, we calculated the error rate by dividing the total squared difference between observed and predicted values by the total number of data points (n).Word embedding size and unit amount were both adjusted to 64, as described in the model's implementation settings.Compared to other RNN models, the suggested one converged quickly and had a lower error rate even after many training iterations.All the models had the same basic framework so that the evaluation would be consistent.We evaluated our suggested model against the state-of-the-art RNN models using the TREC datasets.Table 2 shows that the suggested model has a lower error rate than the MV-RNN, LSTM, RNN, and GRU.

Conclusion
This study uses the hybrid GRU method to examine the problem of document-level text categorization efficiently.On standard text classification datasets, traditional recurrent neural networks (RNNs) were investigated, and it was discovered that Gated Recurrent Units beat other models in terms of accuracy and error rates.GloVe word vectors and other pre-trained word embedding were used to derive semantic characteristics between words in documents.
According to the research, proposed particularly with large quantities of learning data, attention is a suitable model for text classification of sequential data because it can efficiently record long sequence data for natural language comprehension.When compared to conventional RNN, LSTM, GRU, MVRNNs models, the suggested model outperformed them according to empirical findings from the TREC standard dataset to implement text classification and accuracy, error rates.In comparison to the GRU's F-measure 0.952%, the proposed model's F-measure is 0.982%.The paper recommends further investigation into the suggested model's applicability to added NLP techniques.
Alenezi et al. (2021) developed a novel Convolutional Neural Network (CNN) integrated with a block-greedy algorithm to enhance underwater image dehazing.The method addresses color channel attenuation and optimizes local and global pixel

Figure. 6 .
Figure. 6. Comparative analysis of Classification on TREC dataset

Table 1 .
TREC Dataset Statistics Summary

Table 2 .
Comparisons Error Rate (%) with Existing Models