DTT: A Dual-domain Transformer model for Network Intrusion Detection

With the rapid evolution of network technologies, network attacks have become increasingly intricate and threatening. The escalating frequency of network intrusions has exerted a profound influence on both industrial settings and everyday activities. This underscores the urgent necessity for robust methods to detect malicious network traffic. While intrusion detection techniques employing Temporal Convolutional Networks (TCN) and Transformer architectures have exhibited commendable classification efficacy, most are confined to the temporal domain. These methods frequently fall short of encompassing the entirety of the frequency spectrum inherent in network data, thereby resulting in information loss. To mitigate this constraint, we present DTT, a novel dual-domain intrusion detection model that amalgamates TCN and Transformer architectures. DTT adeptly captures both high-frequency and low-frequency information, thereby facilitating the simultaneous extraction of local and global features. Specifically, we introduce a dual-domain feature extraction (DFE) block within the model. This block effectively extracts global frequency information and local temporal features through distinct branches, ensuring a comprehensive representation of the data. Moreover, we introduce an input encoding mechanism to transform the input into a format suitable for model training. Experiments conducted on two distinct datasets address concerns regarding data duplication and diverse attack types, respectively. Comparative experiments with recent intrusion detection models unequivocally demonstrate the superior performance of the proposed DTT model.


Introduction
In contemporary information society, the ubiquitous nature of computer networks and the Internet has enhanced the daily lives of individuals.However, this progress has also led to an increase in the frequency and complexity of network intrusions, posing cybersecurity threats [1].Major incidents like the Capital One data breach [2] and the SolarWinds attack [3] underscore the urgency of addressing cybersecurity issues.To tackle this challenge, network security professionals utilize advanced * Corresponding author.Email: ly0446@hati.edu.cntechnologies to detect malicious network traffic [4].They aim to swiftly identify and thwart potential attacks to safeguard cyberspace integrity.Consequently, developing efficient and accurate models for network intrusion detection (NID) has emerged as a pivotal research focus within the field of network security.
Early intrusion detection research primarily relied on recognized attack patterns or characteristics, such as ruleand signature-based approaches [5].However, attackers continuously refine their strategies to evade detection.This poses challenges for traditional methods, which are unable to proactively detect new and emerging threats, such as zero-day vulnerabilities [6].To address this challenge, cybersecurity researchers are enhancing intrusion detection systems (IDS) by incorporating machine learning, deep learning, and behavioral analysis [7].
Models like Transformer and Temporal Convolutional Networks (TCN) demonstrate exceptional sequence modeling capabilities, making them well-suited for processing network traffic data [8,9].Researchers have widely adopted these models as tools for intrusion detection, leading to significant research advancements.For instance, Liang et al. [10] introduced a multi-level intrusion detection model based on Transformer, which demonstrates remarkable performance in detecting intrusion actions.Cheng et al. [11] proposed a global attentional TCN-based IDS for in-vehicle applications.However, they only utilized one of these methods.We believe that combining the attention mechanism inherent in Transformer with the long-term dependency-capturing capability of TCN could further enhance the performance of IDS.
Current intrusion detection methods based on TCN and Transformer typically explore the time domain while overlooking the frequency domain.Fourier analysis reveals that these models exhibit learning biases toward specific types of frequency components.They fail to perceive the entire spectrum of the network time series data, leading to information loss [12,13].However, specific intrusions may involve anomalous traffic patterns or frequent periodic signals within certain frequency ranges.For instance, Fu et al. [14] demonstrated the efficacy of leveraging frequency domain analysis for robust detection.Therefore, the inability to comprehensively extract information from the frequency domain poses a challenge in accurately identifying specific attack types.
Additionally, in terms of input coding, certain studies focus exclusively on categorized fields.They neglect numeric fields [15] or resort to simplistic encoding techniques like one-hot coding for the categorized fields [16].Consequently, crucial feature information is lost during the data processing stage, potentially compromising the effectiveness of IDS.Inspired by [17], we proposed a flow-level projection (FLP) encoding method to combine categorized and numeric fields.
In • We introduce a novel intrusion detection model termed DTT.We construct the DFE block based on Transformer.This block extracts multi-level frequency information to capture both global and local features.Additionally, the model integrates TCN to capture long-term dependencies in the network data, thereby improving computational efficiency.
• We integrate the FLP encoding module into the model.It improves the model's ability to comprehensively capture data feature information, subsequently enhancing its overall performance.
• We perform comprehensive experiments to assess the performance of DTT using two extensive datasets.
Experimental results showcase the superior performance of the DTT model compared to recent intrusion detection models.
The paper comprises the following sections: section 1 provides an introduction to our research, section 2 explores recently published related works, section 3 outlines the proposed DTT model, section 4 presents experimental analyses, and section 5 concludes the paper.

Research on intrusion detection based on Transformer
This section presents a study on NID utilizing Transformer or TCN.Regarding a thorough assessment of Transformer, Manocchio et al. [18] introduced the FlowTransformer framework, which is capable of directly replacing various Transformer components.It lays the foundation for further exploration of Transformer-based approaches in NID.Han et al. [19] merged n-gram frequency with a time-aware Transformer to capture session-level and packet-level features.Despite achieving improved performance compared to previous models, the approach faces hindrances in processing encrypted traffic.Nguten et al. [20] utilized bidirectional encoder representations from Transformers for intrusion detection research.Wu et al. [21] employed a robust Transformer-based IDS to reconstruct feature representations.Nam et al. [22] proposed a bidirectional generative pretrained transformer for attack detection in the Controller Area Network.These Transformer-based models consistently demonstrate superior performance for NID in the supervised domain, providing a solid foundation for our study.In the semisupervised domain, Li et al. [17] introduced an extreme semi-supervised framework based on Transformer, enhancing performance in the presence of limited labeled data.Its multi-level feature extraction module inspired us to design the FLP coding module.

Research on intrusion detection based on TCN
Cheng et al. [11] introduced an approach for in-vehicle NID by incorporating global attention into a TCN.It proves that TCN can not only focus on local features but also extract temporal relationships within the context.Sadique et al. [23] utilized TCN for sequential and predictive analysis of heterogeneous threat data, aiming to detect and thwart botnets effectively.Cai et al. [24] proposed a model for detecting malicious network traffic, leveraging bidirectional TCN (BiTCN) and a multi-head self-attention mechanism.This model successfully mitigates data imbalance issues and enhances the accuracy of NID.XGBoost-TCN [25] integrates Extreme Gradient Boosting Decision Tree and TCN for mobile edge computing scenarios, with TCN refining deep timing information within the features.The SSAE-TCN model combines a stacked sparse encoder with TCN within a federated learning framework [26].It achieves efficient NID and preserves privacy in Internet of Things data.

Other deep learning models for intrusion detection
Hassan et al. [27] presented a hybrid deep learning model for efficient NID in large-scale data environments.This model integrates a convolutional neural network with a weight-dropped LSTM network, showcasing improved performance.Sheykhkanloo et al. [28] addressed the challenge of insider threat detection in an extremely imbalanced dataset using a spread subsample technique.Xiao et al. [29] proposed Extended Byte-Byte Segmentation Neural Networks (EBSNN) for NID.While it showed effectiveness on their collected dataset, its generalization to other common datasets was poor.In the realm of mobile ad-hoc networks (MANETs), Madhu [30] proposed a model that utilizes COOT optimization and a hybrid LSTM-KNN classifier to bolster network security.Venkateswaran et al. [31] introduced a neuro deep learning wireless IDS tailored for MANETs.
Zipperle et al. and Yang et al. [32,33] provided comprehensive summaries of recent intrusion detection research.However, there are few studies that organically combine Transformer and TCN techniques in the field of NID.In view of this, our study builds upon Transformer and TCN frameworks.We optimize the multi-head selfattention module in Transformer and introduce a frequency domain module to capture frequency domain features.Based on the above ideas, we propose a novel NID model called DTT.

Methodology
Fig. 1 presents the overall architecture of the DTT.We preprocess the input data before feeding it into the DTT for training and classification.The DTT model comprises three components: the input encoding module FLP, the DTT module, and the classification classifier.FLP combines categorical and numerical fields of network data, automatically capturing their advanced features.The DTT module consists of the DFE module and the TCN module.DFE extracts both frequency-domain and time-domain information from the data stream.TCN is used to capture long-term dependency and supplement global features.
The complete flow algorithm of the data stream from the initial stage to the final classification stage is shown in Algorithm 1.Initially, the data undergoes pre-processing, where one-hot encoding and min-max normalization are applied (line 1).This step converts categorical variables into numerical form, enhancing the interpretability and manageability of these features for machine learning models.Following pre-processing, the data is converted into continuous feature vectors using the FLP encoding method (line 3).Subsequently, the processed data is utilized for training with the DTT model (lines 4-6) to classify the attack flow based on header information (line 7).
In the formula and algorithm, x is the value of any column of the original data set,  min is the minimum value obtained by counting the whole column,  max is the maximum value, and  * is the normalized data value.FLP coding ℰ is a fully connected layer that maps data onto continuous feature vectors.

Input encoding options
The function of data pre-processing is to transform the network stream into a format appropriate for model training.Unlike the preprocessing stage, the input encoding constitutes part of the model and integrates into the training phase.The input encoding is not mandatory, which means we can enter the data stream directly into the DTT model for training after the preprocessing stage.Our proposed FLP encoding approach encodes the categorical and numerical fields together.First, the categorical fields are one-hot encoded by the pre-processing part, followed by concatenation with the numerical fields and then normalization.The data is then passed to the FLP layer, which is a fully connected layer that maps the data flow into continuous feature vectors and captures highdimensional features.
The typical input encodings comprise the Lookup-Based Embedding Layer, Dense Embedding Layer, Linear Projection Layer [34], and Flow Level Embedding [18].The initial three encode solely categorical fields and omit numerical fields, whereas the final one encodes both categorical fields and numerical values.In particular, there is an additional method called No Encoding, which involves preprocessing the data stream and directly inputting it into the DTT model without encoding the input.This dissertation investigates the impact of these six encodings on the model's performance in Section IV's experimental section.

DFE block
The Dual-domain Feature Extraction (DFE) block is comprised of two principal branches, as illustrated in Fig. 2. The time domain branch employs an attention mechanism similar to the encoding layer in the Transformer, specifically designed for extracting local time domain features.In contrast, the spectral domain branch employs a Fast Fourier Transform (FFT) layer to substitute the self-attention block in the Transformer, facilitating frequency domain information extraction.This unique design allows our DFE block to simultaneously extract both time-and frequency domain information.
The spectral domain branch of DFE focuses on extracting frequency domain features based on attributes like arrival time, payload length, and protocol type.This is achieved through a spectral gating network with FFT and 1D-CNN layers.Illustrated in Fig. 2, the FFT layer comprises a FFT operation, two convolutional layers, and an IFFT process.The FFT operation transforms data from physical to spectral space, as shown in Equation (2).
where   is a frequency component of   with the frequency of 2/ .The FFT principles indicate that each spectrum frequency is a composite of all time domain points.Thus, frequency domain representation equates to global time series feature extraction.Notably, FFT outputs complex number vectors, which are unsuitable for direct input to deep learning algorithms.To resolve this, real and imaginary parts are channel-wise concatenated, as shown in Equation (3).
Two 1D convolution neural networks (1d CNNs) are used to extract the frequency domain features, see Equation (4).

𝐹𝐹(𝑍𝑍
where   representing the 1D CNN.The extracted features are feed in IFFT layer, which reverts frequency domain features back to the time domain through IFFT, see Equation ( 5) Learnable weight parameters are used to emphasize different frequency components, efficiently capturing the data's frequency domain characteristics.These parameters are optimized via back-propagation methods.Post-IFFT, the spectral data undergoes layer normalization and is processed through a 1D-CNN for feature calibration.Ultimately, DFE efficiently extracts and characterizes frequency domain features through a spectral gating network, which fully utilizes the FFT and IFFT operations alongside the learnable weighting parameters.
The time domain branch of DFE focuses on extracting time domain features.It is also utilized to extract local high-frequency details, complementing the global lowfrequency semantics captured by the frequency domain branch.This comprises a sequence of normalization and manipulation components, implemented to effectively capture pertinent input data.The attention layer is composed of a normalization layer to begin with, which serves to normalize the input data.Secondly, the model benefits from the Multi-Headed Self-Attention (MHSA) layer, which facilitates the simultaneous consideration of relationships between various input locations for an improved capture of contextual information.The MHSA utilizes a self-attention mechanism founded on a trainable triad (query, key, value).The query and key are employed to calculate the weight score allocated to each value, which is then used to compute the output via a weighted sum of values.This utilizes dot product attention, enabling parallelization of computation and reducing training time.Its calculation Equation ( 6) is shown below.
where Q, K, and V represent the matrices for Query, Key, and Value, respectively, and   is the dimension of Key.
The formula for MHSA is presented in Equation (7).
After conducting MHSA, layer normalization is once again carried out to ensure stability within the layers.Subsequently, it is immediately followed by the MLP module for further feature mapping and extraction.
Finally, the features from each branch of the DFE are integrated with the input features   (see Equation (8)).

TCN module
The Temporal Convolutional Network (TCN) integrates dilations and residual connections with causal convolutions.Causal convolutions prevent information leakage from the future to the past.Specifically, the output at time t undergoes convolution solely with elements from time t and earlier in the previous layer.The utilization of dilated convolutions enables an exponentially large receptive field.
It allows the model to capture intricate temporal patterns while maintaining computational efficiency.Formally, for a sequence input  = { 0 , … ,  −1 ,   } ∈ ℝ  and a filter : {0, … ,  1 }, the output of the hidden layer of   is defined as Equation (9).
where d is the dilation factor, k is the filter size.The dilated convolutions enable the incorporation of information from distant time steps through the parameter d .It enhances TCN's capacity to efficiently capture intricate temporal patterns.Given that the receptive field of TCN relies on the network depth, stabilizing a deeper and larger TCN is critical.The residual connections are proven to enhance the performance of very deep networks.The process is shown in Equation (10).
where  refers to the transformations of  and  denotes the sigmoid activation.
The unique formulation of TCN and its emphasis on parallelization during training distinguish it as a powerful tool in the realm of time series analysis.It offers enhanced performance and scalability in comparison to traditional recurrent architectures.

Experimental configuration
All experiments are currently conducted on a Windows operating system using a computer equipped with an Intel (R) Xeon (R) CPU E5-2678 v3 @ 2.50 GHz and an NVIDIA GeForce RTX 2080 Ti graphics processor.Each experiment is repeated at least three times to ensure result stability.The best-performing results are chosen as the final evaluation metrics from these repeated experiments to avoid any negative influence from poor model initialization.
• The NCCI dataset is a NetFlow-based dataset generated from the original pcap file of CSE-CIC-IDS2018 [36].The total number of datastreams is 8,392,401, of which 1,019,203 are attack samples and 7,373,198 are benign samples.
• The NUB dataset is the NetFlow format of UNSW-NB15 [37].The total number of data flows is 2,390,275, with 95,053 being attack samples and 2,295,222 being benign samples.These attacks are further categorized into nine sub-categories.

Data pre-processing
After obtaining the dataset, the next step is data preprocessing.In this paper, we refer to the method of literature [38] in the preprocessing process and divide the dataset into a training dataset and a validation dataset according to a ratio of 90% and 10%.The subsequent stage involves converting the data into a format that can be recognized by the input encoding module, known as the input encoding session.Various hyperparameter configurations can influence both model convergence speed and experimental outcomes, as detailed in Table 1.When training the DTT model, we use the Adam optimizer to adjust parameters automatically, including the learning rate.However, to avoid the risk of convergence at suboptimal local points, we undertake experiments to establish an appropriate learning rate.
In this experiment, two evaluation metrics are utilized, namely balanced accuracy and F1-score [39].The metrics are computed from the confusion matrix presented in Table 2.The specific formulas for these metrics are presented in Equations ( 11) and (12).
The formula shows that the F1-score indicator is more comprehensive.Therefore, we focus on analysing the F1score in the experiment.

Results and Discussion
The experiments in the study are organized into four main sections.Firstly, we examine the impact of different input encoding methods on model performance.Subsequently, we focus on the role of each component within the DFE block.Thirdly, we explore how adjusting the number of layers (represented as  ) of the DFE block affects outcomes.This allows us to evaluate the performance of DTT across varying complexity levels and scrutinize the influence of selected layers on model effectiveness.Finally, we compare our proposed model with recent intrusion detection models, assessing its advantages and competitiveness within the same experimental framework and dataset.

Input encodings
In this section, we investigate the effect of different input encoding methods on model performance.For this experiment, we set the  value of the DFE block to 2 and       4. For all experimental models, FLP encoding is chosen as the input encoding, and the parameter  is illustrated in the figure.The other hyperparameters remained constant.
The experiment results presented in Fig. 5 and Table 5 reveal distinct performance differences among the four groups.Group b outperforms Group a across both datasets, indicating the effectiveness of incorporating spectral domain information within the DFE block.Similarly, Group c shows considerable performance improvement over Group a, highlighting the effectiveness of using the TCN block to capture long-term dependencies.Notably, Group d exhibits the highest model performance among all groups.This suggests that the integration of DFE and TCN blocks yields the most substantial enhancement in model performance.

Conclusion
In this study, we introduced a Dual-domain intrusion detection model based on TCN and Transformer, named DTT.It integrates a frequency domain module with Transformer and TCN.We also utilized an efficient input encoding method tailored to the attention heads of Transformer.The Transformer with the spectral domain branch comprehensively extracts frequency and time domain information.This, coupled with TCN's capability to capture long-term dependencies, enhances the accuracy and efficiency of DTT for NID.As a result, DTT contributes to safeguarding network security and preventing system breakdowns.Extensive experiments conducted on two large-scale intrusion detection datasets demonstrated the performance of DTT, suggesting its potential to advance the field of NID.However, it is worth noting that this approach requires extensive pre-training, which may not be suitable for real-time intrusion detection scenarios.Additionally, the black-box nature of deep learning models poses challenges in gaining trust in network security management.In future work, we aim to optimize the model's real-time processing capabilities and explore interpretable methods for NID.
light of these limitations, this paper introduces DTT, a Dual-domain intrusion detection model based on TCN and Transformer.The DTT model bridges the gap between time and frequency domain analysis and adopts robust input coding strategies.This addresses the shortcomings of existing intrusion detection methods, particularly in capturing comprehensive feature information and enhancing model performance.The model comprises three modules: FLP encoding mudule, Dual-domain Feature Extraction (DFE) Block, and TCN.The FLP encodes the categorical and numerical fields together, aiming to preserve feature information in the data more effectively.The DFE block extracts both time and spectral domain information to learn high-frequency and low-frequency details.It enables comprehensive analysis of network traffic data, thereby enhancing the model's detection capabilities.The TCN module focuses on capturing long-term dependencies, thus supplementing global features.Experimental results demonstrate the superiority of the DTT model over recent intrusion detection methods on two public datasets.The contributions of this paper are summarized below.

Figure 2 .
Figure 2. Overall architecture of the DTT model.

Figure 4 .
Figure 4. Structure of DTT for each group

Table 3 .
Effect of different input encoding on the model.

Table 4 .
Configuration of ablation experiment groups

Table 5 .
Results of ablation experiments with DTT module (F1-score/ Balanced Accuracy)

Performance comparison of the DFE block with varying numbers of layers (with differing
The results of the experiment are displayed in Table6.It reveals that the DTT model performs more effectively when  is set to 2 on the NUB dataset.In addition, on the ) This experiment aims to examine how the number of layers of the DFE block influences the overall effectiveness of the model.Understanding how various layers affect performance helps us achieve optimal model design and selection.Fig. 6 displays the model diagram with varying numbers of layers.

Table 6 .
Impact of Various  on Model Performance EAI Endorsed Transactions on Scalable Information Systems Online First compared to other models.These findings indicate that our model surpasses other models in recent years under the same experimental conditions.Some studies excel in the preprocessing of the data or other aspects rather than the model itself.This experiment only demonstrates that the Research comparison experimentsIn this section, the DTT model proposed in this research paper is compared with other recent intrusion detection models.To ensure fairness, the comparison experiments are conducted under the same conditions, including identical datasets, preprocessing methods, and input encoding methods.Details of the comparative outcomes are presented in Table7.The DTT model exhibits improvements in F1-score ranging from 0.6 to 6.8 on the NCCI dataset and from 0.4 to 3.5 on the NUB dataset

Table 7 .
Impact of Various  on Model Performance.