Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer

Zhao, Xinjian; Miao, Weiwei; Yuan, Guoquan; Jiang, Yu; Zhang, Song; Li, Qianmu

doi:10.3390/math12111643

Open AccessArticle

Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer

by

Xinjian Zhao

¹,

Weiwei Miao

¹,

Guoquan Yuan

¹,

Yu Jiang

^2,*,

Song Zhang

¹ and

Qianmu Li

²

¹

State Grid Jiangsu Electric Power Co., Ltd., Information & Telecommunication Branch, Nanjing 210024, China

²

School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(11), 1643; https://0-doi-org.brum.beds.ac.uk/10.3390/math12111643

Submission received: 21 April 2024 / Revised: 15 May 2024 / Accepted: 22 May 2024 / Published: 24 May 2024

(This article belongs to the Special Issue Data Mining and Machine Learning in the Era of Big Knowledge and Large Models)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a feature fusion and sparse transformer-based anomalous traffic detection system (FSTDS). FSTDS utilizes a feature fusion network to encode the traffic data sequences and extracting features, fusing them into coding vectors through shallow and deep convolutional networks, followed by deep coding using a sparse transformer to capture the complex relationships between network flows; finally, a multilayer perceptron is used to classify the traffic and achieve anomaly traffic detection. The feature fusion network of FSTDS improves feature extraction from small sample data, the deep encoder enhances the understanding of complex traffic patterns, and the sparse transformer reduces the computational and storage overhead and improves the scalability of the model. Experiments demonstrate that the number of FSTDS parameters is reduced by up to nearly half compared to the baseline, and the success rate of anomalous flow detection is close to 100%.

Keywords:

anomaly detection; feature fusion; convolutional neural network; sparse transformer; deep encoder

MSC:

68T07

1. Introduction

With society’s increasing emphasis on renewable energy and the rapid development of distributed energy, distributed resource aggregation platforms play an increasingly critical role in the information interaction of power network enterprises. These platforms not only realize the monitoring, control, and optimal dispatch of distributed energy resources but also promote the intelligent and efficient operation of the network system. However, with the increasing scale of distributed energy access, the emergence of anomaly network traffic becomes a serious challenge [1]. Accurate evaluation of the collected network traffic data is necessary for researchers to spot objectionable behaviors in the traffic. Network traffic packets are typically separated into streams based on the source IP, destination IP, source port, destination port, protocol, and timestamp in order to improve the identification of anomalous network activity [2]. Accompanied by the rapid development of machine learning, the detection of anomaly traffic using network traffic classification has been one of the hotspots of common concern in academia, industry, and network regulators, the essence of which involves categorizing mixed traffic into distinct traffic categories depending on the features or attributes of various network applications or protocols. On the one hand, the network security field needs to identify the intrusion traffic; on the other hand, network management needs to analyze the traffic classification of different applications so as to reasonably control and allocate resources to ensure the network smoothness. With the massive increase of data volume and the variety of network traffic, traditional classification methods do not easily meet the requirements, and deep learning-based algorithms have become a research hotspot for network traffic anomaly detection [3].

Abnormal network traffic may be caused by a variety of factors such as malicious attacks, equipment failures, data communication anomalies, etc., which bring serious impacts upon the stable operation of the system and the quality of data interaction. Malicious attackers may utilize vulnerabilities or malicious codes to try to interfere with the normal operation of distributed resource aggregation platforms and network enterprise systems or even steal sensitive information. Meanwhile, device failures and misconfigurations may lead to the generation of anomaly data, which interferes with the operation of the network system. In addition, failures or congestion of data communication links may also lead to data transmission anomalies, which bring issues to network information interaction. In such a complex environment, it is particularly important to recognize and handle the anomaly traffic in a timely and accurate manner.

ChatGPT has experienced a significant increase in usage across academic, private, and professional sectors because of its capacity to respond to intricate cues and produce human-like texts of exceptional quality [4]. ChatGPT, similar to numerous other extensive language models, is built upon the transformer architecture. Transformer architectures are highly effective for natural language processing (NLP) activities due to their ability to capture distant dependencies and interactions among various elements of a sequence without requiring prior domain-specific knowledge or feature engineering. The transformer architecture, first intended for NLP, has demonstrated its versatility and potency in capturing intricate patterns and relationships in several forms of sequential data, encompassing images, graphics, and speech. Transformer-based models are well-suited for deep learning-based network traffic anomaly detection systems. These systems capture traffic data as sequences of packets or streams and require the ability to identify intricate patterns in order to perform effectively [5].

While many deep learning-based anomaly traffic detection systems have reached satisfactory detection performance, there remain three notable issues that have not yet been resolved:

Most traffic datasets are unbalanced datasets. In reality, often, the majority of the collected traffic is normal traffic, and the anomaly attack traffic is extremely unbalanced in comparison. Under-sampling and over-sampling are commonly employed techniques to address data imbalance. However, under-sampling results in the removal of data, leading to the loss of certain features. On the other hand, over-sampling introduces additional data, which alters the original data distribution. Both of these approaches have an impact on the accuracy of the experiment;
Acquiring previous information from concealed aspects of past traffic in conventional detection methods is challenging. Although network communications follow a sequential pattern, current ML (machine learning)-based anomaly traffic detection research often ignores sequential data and instead focuses on isolating and classifying individual network flow records;
Transformer models tend to have very large parameters and high training costs. For example, the GPT model parameter size is 110 M, the BERT model parameter size is 340 M, and the T5 model parameter size is as high as 11.7 B. This not only puts forward higher requirements for the hardware but also greatly increases the training time.

This paper proposes a feature fusion and sparse transformer-based anomalous traffic detection system (FSTDS). This system involves a feature fusion network when inputting the encoded traffic data sequence. The two-layer parallel convolutional network extracts shallow and deep features from the preprocessed flow data and fuses the extracted features to obtain the encoded flow data vector; then, a deep encoder based on sparse transformer is used to capture different network flows. Finally, a multi-layer perceptron is used to classify the traffic to detect abnormal traffic. The main contributions are as follows:

By introducing the feature fusion network in the encoding stage, the feature extraction effect of small sample data is improved;
Combination with deep encoders enhances the acquisition of remote information in complex network traffic patterns;
Through sparse transformer, computing and storage overhead are reduced, and the scalability of the model is increased.

2. Related Work

2.1. NetFlow

Monitoring network traffic is a crucial component of network security and management. There are two primary techniques employed for this objective: packet-based monitoring and flow-based monitoring. Packet-based monitoring entails capturing the headers and payloads of packets as they travel over the network, while flow-based monitoring gathers summarized information based on the sequence of packets between two endpoints [6]. However, packet-based continuous monitoring is difficult to implement in large-scale networks due to its resource-intensive nature. Moreover, packet capture gives rise to privacy problems, as it has the potential to gather sensitive information. On the other hand, flow-based monitoring offers a significantly condensed overview of network traffic, which makes it a more efficient and adaptable option. Flow-based traffic export and collection are extensively utilized in expansive networks, with a plethora of technologies readily accessible for this purpose.

NetFlow is a popular protocol developed by Cisco for collecting and monitoring flow-based network traffic statistics. The operation involves combining a sequence of packets in a transmission, which can be either one-way or two-way, that share common attributes such as the same source and destination IP, source and destination port, and transfer protocol [7]. Bidirectional NetFlow has the ability to capture packets and bytes in both directions, along with other functionalities.

2.2. Traffic Anomaly Detection

Machine learning has becoming extensively utilized for detecting traffic anomalies due to the ongoing advancements in artificial intelligence and cloud computing technology. In 1980, Anderson [8] introduced the notion of intrusion detection technology with the goal of rapidly detecting anomalous behaviors in the network and mitigating the resulting losses. Anomaly detection technology takes advantage of the continuous updating of new technologies. The purpose of these new methods is to perceive anomalies with precise prediction accuracy and improve real-time prediction efficiency by extracting different patterns from network traffic to distinguish abnormal traffic from regular traffic.

Researchers have dedicated significant effort to proposing diverse network intrusion detection systems with the aim of identifying and thwarting irregular attacks on network traffic. Currently, the predominant machine learning techniques used in intrusion detection are supervised learning approaches, including random forest (RF) [9], K-closest neighbor (KNN) [10], and support vector machine (SVM) [11]. Chowdhury et al. [12] suggested merging two machine learning techniques to categorize incursions that are based on signatures. Their study utilized the simulated annealing approach to produce three random feature sets and employed the SVM algorithm to detect unusual behaviors. Yang Min and colleagues [13] introduced an advanced semi-supervised framework (ESeT) for network intrusion detection. The framework comprises a multi-level feature extraction module and a semi-supervised learning module, which leverage a limited quantity of labelled data to enhance detection performance. Nevertheless, because of the inherent characteristics of identifying familiar dangers and the constraints of conventional machine learning techniques, a significant proportion of advanced cyber assaults can currently evade signature repositories. The aforementioned methods exhibit a high rate of false alarms and a low rate of detection when it comes to identifying traffic irregularities. The unresolved research problem pertains to the development of an efficient detection system.

Furthermore, deep learning exhibits strong flexibility, self-organization, and generalization skills. Consequently, it is capable of effectively addressing the issues that arise in conventional machine learning. In recent years, scholars have extensively researched the application of deep learning in order to enhance the detection system’s efficiency. Tuor et al. [14] used deep neural network autoencoders for unsupervised network anomaly detection, using temporal aggregation statistics as features. Yan [15] constructed an intrusion detection system utilizing a convolutional neural network (CNN) and employed a generative adversarial network to produce synthetic attack trails. The empirical findings confirmed the efficacy of the approach.

2.3. Transformer Architecture

Although network communications are inherently sequential, current ML-based research often neglects sequential data and instead prioritizes the classification of individual network flow records in isolation. The introduction of the transformer architecture in the field of natural language processing marked a significant advancement in applying machine learning to sequential data. Vaswani [16] first built a transformer architecture based on the attention mechanism. Dosovitskiy [17] introduced BERT, a novel language representation model that utilizes transformers to pre-train on unlabeled text by simultaneously adjusting the left and right context. BERT achieved state-of-the-art performance on 11 natural language processing tasks at that time.

In addition, effectively identifying a significant fraction of attacks in network traffic necessitates taking into account the extended patterns and attributes of the network [18]. Transformer’s ability to access remote information helps detect complex network traffic patterns that may indicate an attack. Therefore, studying transformer architecture has huge potential. Wu et al. [19] introduced a transformer-based intrusion detection system named RTIDS. This system utilizes position-embedding technology to establish connections between sequence information and features. Alkhatib et al. [20] showed that BERT can be effectively used to learn quorum identifier (ID) sequences in a controller area network (CAN) using a “masked language model” unsupervised training aim.

Secondly, although there are great improvements in combining the transformer architecture, applying transformer in the network field is not as simple as in the natural language field. Although the architecture for using transformers to process text data has been established, the same is not true for network traffic. In order to obtain network data and generate classifications from the transformer output, key decisions must be made independently of the transformer model itself [21]. To solve this problem, this paper proposes FSTDS, an anomaly traffic detection system based on feature fusion and sparse transformer. FSTDS leverages the distinct benefits of feature fusion and sparse transformer architecture to enhance both the accuracy of detecting aberrant network traffic and the speed at which the model is trained.

3. System Design

The overall framework of FSTDS is shown in Figure 1. In this section, we delve into each key step of FSTDS, including preprocessing technology, feature fusion network, sparse transformer, and classification head.

3.1. Data Preprocessing

First, we collect the NetFlow streaming dataset of traffic through the NetFlow Analyzer tool and preprocess the dataset. The main steps of preprocessing are as follows:

One-hot encoding: The NetFlow streaming dataset contains categorical features that need to be translated into numerical values for optimal prediction results from our deep learning model. Thus, the columns were transformed into numerical values using the get_dummies function from the pandas package in Python for the preparation stage. We chose one-hot encoding instead of the label encoder because the label encoder will produce multiple numbers in the same column, and the model may misinterpret that the values are in a specific order, which will affect classification;
Normalization: Normalization refers to the process of rescaling data to a given range in order to minimize duplication and expedite model training. The study employs min–max normalization to rescale the data range to [0, 1]:

$x_{[i]} = \frac{x_{[i]} - x_{m i n}}{x_{m a x} - x_{m i n}};$

(1)
S-fold stratified cross-validation: Stratification involves reorganizing the data to guarantee that each subset is a reliable representation of the entire dataset. The stratified S-fold cross-validation technique partitions the dataset into S subsets, and the model is trained on S-1 subsets while being validated on the S-th subset. This process continues until all folds have been utilized to validate the model. Stratification guarantees that every fold accurately represents the full dataset, facilitating parameter optimization and enhancing the model’s ability to categorize attacks.

3.2. CNN-Based Feature Fusion

For the preprocessed stream records, the one-hot encoding of the classification field is connected with the numerical field and then put into the feature fusion network as input. The feature fusion network structure is shown in Figure 2. In this paper, we denote the stride by the lowercase letter “s” and the number of folds in S-fold cross-validation by the uppercase letter “S”. This distinction ensures clarity in discussing these two concepts throughout the paper.

The feature fusion network has two layers of parallel convolutional networks. The initial layer is composed of two stacked convolutional layers. The first convolutional layer has a stride of 1, while the second convolutional layer has a stride of 2 and a kernel size of 3. The second layer is composed of a convolutional layer and a pooling layer. The convolutional layer has a kernel size of 3 and a stride of 1, while the pooling layer has a stride of 2. The padding size utilized in the two-layer convolution process is 1 for both layers. To fully utilize the features obtained from the convolutional layer and pooling layer, the extracted features are combined to generate the encoded stream data.

In order to maintain the size of the input matrix during the convolution operation, it is necessary to execute a padding procedure:

X = Padding (X_{0}, 1) = [\begin{matrix} 0 & \dots & \dots & \dots & 0 \\ x_{11} & \dots & x_{1 W} \\ ⋮ & ⋮ & \dots & ⋮ & ⋮ \\ x_{H 1} & \dots & x_{H W} \\ 0 & \dots & \dots & \dots & 0 \end{matrix}]

(2)

where

X_{0}

represents the preprocessed stream record matrix data, and the variable

W

represents the width of the matrix, whereas the variable

H

represents the height.

The calculation process of the first layer in the feature fusion network is as follows:

X_{1}^{1}

represents the feature matrix obtained after the first convolution operation. Since the step size is 1, the output size remains unchanged.

X_{1}^{1} = X ⊙ C = [\begin{matrix} x_{11} & \dots & x_{1 k} \\ ⋮ & ⋱ & ⋮ \\ x_{k 1} & \dots & x_{k k} \end{matrix}] ⊙ [\begin{matrix} c_{11} & \dots & c_{1 k} \\ ⋮ & ⋱ & ⋮ \\ c_{k 1} & \dots & c_{k k} \end{matrix}]

(3)

Among them,

C

represents the convolution kernel matrix,

c_{i j}

indicates the specific value within the convolution kernel matrix, and

k

denotes the size of the kernel.

X_{1}^{2} = padding (X_{1}^{1}, 1)

(4)

X_{1}^{3}

represents the feature matrix that is derived from the second convolution operation. Since the step size is 2, the output size is halved.

X_{1}^{3} = X_{1}^{2} ⊙ C = [\begin{matrix} x_{11}^{2} & \dots & x_{1 k}^{2} \\ ⋮ & ⋱ & ⋮ \\ x_{k 1}^{2} & \dots & x_{k k}^{2} \end{matrix}] ⊙ [\begin{matrix} c_{11} & \dots & c_{1 k} \\ ⋮ & ⋱ & ⋮ \\ c_{k 1} & \dots & c_{k k} \end{matrix}]

(5)

The second layer calculation process of the feature fusion network is as follows:

X_{2}^{1}

represents the feature matrix extracted after the convolution operation. The step size is 1, so the output size does not change.

X_{2}^{1} = X ⊙ C = [\begin{matrix} x_{11} & \dots & x_{1 k} \\ ⋮ & ⋱ & ⋮ \\ x_{k 1} & \dots & x_{k k} \end{matrix}] ⊙ [\begin{matrix} c_{11} & \dots & c_{1 k} \\ ⋮ & ⋱ & ⋮ \\ c_{k 1} & \dots & c_{k k} \end{matrix}]

(6)

The matrix

X_{2}^{2}

indicates the result of applying the feature maximal pooling procedure, which reduces the size of the output feature matrix by half.

X_{2}^{2} = Maxpooling (X_{2}^{1}) = \frac{m a x \{x_{i j}^{1}\}}{i, j \in [1, k]}

(7)

Then, the matrix results obtained by the two layers of parallel convolutional networks are feature-fused to obtain the encoded traffic feature

X_{f}

:

X_{f} = X_{1}^{3} \oplus X_{2}^{2}

(8)

3.3. Sparse Transformer Based on Memory Compression

Utilizing a transformer-based strategy offers numerous advantages to the system. Transformer’s ability to support parallel execution sets it apart from other ways of processing sequential data, enabling efficient and scalable analysis of network data. Moreover, the transformer model adeptly captures intricate connections among various network flows, rendering it highly ideal for NetFlow processing.

The encoded stream data are input into a deep encoder based on sparse transformer and converted into a fixed-length feature vector sequence. The encoder is composed of several blocks, each comprising a multi-head sparse attention block and a multilayer perceptron (MLP) block. Normalization is applied before each block, and residual concatenation is applied after each block.

The method of calculating attention is to multiply the input vector sequence by the linear matrices

W_{q}

,

W_{k}

, and

W_{v}

to generate the query vector

Q

, key vector

K

, and value vector

V

for each position. The inner product of

Q

and

K

is performed and normalized to obtain the attention matrix of each position with respect to all positions and then multiplied by

V

to obtain the sequence representation of the input sequence after self-attention.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(9)

To design a sparse self-attention block method, we modify the transformer’s multi-head self-attention to reduce memory usage by limiting the dot product between

Q

and

K

. Stride convolution is used to compress

K

and

V

on the sequence length, and the number of queries

Q

remains unchanged. First, a convolution sliding along the sequence length dimension is used to downsample the dimension to obtain a smaller matrix of key and value:

K_{s}^{T} = C o n v (K^{T}) = [\begin{matrix} k_{11} & \dots & k_{N 1} \\ \dots & \dots & \dots \\ k_{D 1} & \dots & k_{D N} \end{matrix}] ⟹ [\begin{matrix} k_{11} & \dots & k_{L 1} \\ \dots & \dots & \dots \\ k_{D 1} & \dots & k_{D L} \end{matrix}]

(10)

V_{s}^{T} = C o n v (V^{T}) = [\begin{matrix} v_{11} & \dots & v_{N 1} \\ \dots & \dots & \dots \\ v_{D 1} & \dots & v_{D N} \end{matrix}] ⟹ [\begin{matrix} v_{11} & \dots & v_{L 1} \\ \dots & \dots & \dots \\ v_{D 1} & \dots & v_{D L} \end{matrix}]

(11)

The memory compression principle of sparse transformer is shown in Figure 3.

N

is the initial length of the sequence, D represents the dimension, and

L

represents the length of the sequence after downsampling. It can be seen that the complexity of sparse self-attention matrix multiplication is

O (N L)

, while the complexity of standard

Q

and

K

calculations is

O (N^{2})

. The size of

L

is often half or even less than N because strided convolution is based on this sparse method. It greatly decreases the amount of parameters that the transformer needs to train, better utilizes the information in the input sequence, and removes irrelevant information.

Then, the multi-head attention value is obtained by splicing the head attention, and weighted fusion is performed through the learnable parameter

W^{o}

to obtain the final output label sequence.

{Head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(12)

Multihead (Q, K, V) = {Concat (head}_{1}, \dots, {head}_{b}) W^{o}

(13)

3.4. MLP-Based Anomaly Classifier

The transformer is a type of model that operates on sequences where both the input and output are sequences. For the traffic anomaly detection system, we aim to have classification as output. The classification head retrieves the output token sequence from the transformer components and transforms it into predictions for one or several classes. The primary obstacle associated with the classification head is the issue of dimensionality. Transformers exhibit fast handling of sequences of any length. However, directly feeding the output into a dense neural network might lead to an exponential growth in the number of parameters. On the contrary, classification heads are usually created to choose particular components or condense the information obtained from the transformer. This way, they avoid experiencing the same rise in dimensionality as the sequence length increases.

For each feature in the contextual representation, we use densely connected multi-layer perceptron (MLP). This is a spatiotemporally distributed technique that enables the model to assign varying weights to different streams within the sequence. Typically, the classification task of traffic anomaly detection systems involves only the category of the last flow (e.g., benign or anomalous), and using only the embedding vector of the last context is sufficient. We use the last context embedding vector as the input of the multi-layer perceptron, which still includes information from previous streaming data due to the role of multi-head self-attention in the transformer.

Next, we compute the loss during the training process through the binary cross-entropy loss function:

L o s s = - \frac{1}{N} \sum_{i = 1}^{N} y_{i} \cdot \log (p (y_{i})) + (1 - y_{i}) \cdot \log (1 - p (y_{i}))

(14)

where

p (y_{i})

represents the probability output by the MLP for the

y_{i}

, and

N

represents the total number of samples contained within the training set.

The number of predefined MLP hidden layer neurons is 128, and the dropout rate is 10%. The gradient descent method is used for parameter training to obtain the final model for malicious network traffic detection, enabling anomaly traffic detection.

4. Experiment

4.1. Experimental Setup

4.1.1. Experimental Environment

This study performed experiments on FSTDS in the specific setting described in Table 1.

4.1.2. Training Details

During the process of training the model, referring to FlowTransformer [4], we employed early stopping and epoch limitation. Early stopping was set to 10 epochs of patience, while the maximum number of epochs was restricted to 40. We chose 40 epochs because we observed in our initial experiments that by the 40th epoch, most models had converged to within 1% of their final performance. We used the Adam optimizer for training and specify a learning rate of 0.025.

4.1.3. Dataset

In this paper, we conduct a series of experiments using KDD-CUP99, UNSW-NB15, and Grid-Flow datasets.

KDD-CUP99 dataset: This dataset is utilized in conjunction with the Third International Knowledge Discovery and Data Mining Tools Competition, which took place simultaneously with the KDD-99 Fifth International Conference on Knowledge Discovery and Data Mining. The objective of the competition is to construct a network intrusion detector, which is a prognostic model with the ability to differentiate between “malicious” connections known as intrusions or attacks and “benign” normal connections. The dataset comprises a standardized collection of data that is subject to auditing, encompassing several incursions within a military network setting;
UNSW-NB15 dataset: The University of New South Wales released this dataset in 2015. Since its inception, the UNSW dataset has been extensively utilized. The UNSWNB15 dataset encompasses a wider range of attack families, a larger set of extracted features, and a distinct number of IP addresses employed for simulations and data collecting [10]. The dataset comprises a combination of genuine, up-to-date normal network traffic and current, extensive attack efforts;
Grid-Flow dataset: The Grid-Flow dataset consists of the grid flow collected from time to time by the Information and Communication Branch of State Grid Jiangsu Electric Power Co., Ltd. during October 2023 on the distributed resource aggregation platform during the information interaction between power grid enterprises, including but not limited to real-time flow data from various power sources, grid flow types, regional distribution, peak and valley periods, etc.

Table 2 shows a list of features available in Grid-Flow dataset. The feature distribution of KDD-CUP99 and UNSW-NB15 datasets refers to the work of J. Sinha et al. [22]. We utilized the findings presented in their research to support our analysis in this paper.

4.1.4. Evaluation Metrics

Several common metrics are used to evaluate the performance of binary classifiers:

True positive (TP)—the number of attack flows correctly classified as attacks;
False positive (FP)—the number of benign flows incorrectly classified as attacks;
True negatives (TN)—the number of benign flows correctly classified as normal;
False negatives (FN)—the number of attack flows incorrectly classified as normal.

Formula (15) provides the definition of accuracy (ACC), which is the percentage of all correct classifications. The definitions of precision, recall, and F1-score are shown in Formulas (16), (17) and (18), respectively. The first two metrics quantify the impact of misclassifications on the accuracy of accurate classifications, while the last metric provides a comprehensive evaluation of the balance between precision and recall.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(15)

P r e c i s i o n = \frac{T P}{T P + F P}

(16)

R e c a l l = \frac{T P}{T P + F N}

(17)

F 1 = \frac{2 * (P r e c i s i o n * R e c a l l)}{P r e c i s i o n + R e c a l l}

(18)

4.2. Overall Performance

We report the results of accuracy and F1-score evaluation of all methods in Table 3. The following observations are made:

Our proposed FSTDS model achieves the best performance. Compared to existing methods, it improved by 1.78% and 3.67% on the KDD-CUP99 and UNSW-NB15 datasets, respectively. Among all compared methods, FSTDS outperforms other baselines on both datasets. Specifically, it shows a relative ACC improvement of 1.76% and 0.88% over the strongest baseline respectively, demonstrating the effectiveness of FSTDS;
In conventional machine learning models, the support vector machine (SVM) and K-nearest neighbor (KNN) exhibit an accuracy of 81%. The KNN and SVM models, on the other hand, achieve accuracy rates of 92.96% and 93.17%, respectively. Isolation forest (IF) has the lowest accuracy among traditional ML models, at 55.05%. When comparing these models to deep learning models, a substantial enhancement in accuracy is seen. In particular, the convolutional neural network (CNN) and transformer models significantly outperform all traditional ML models in the comparison. This conclusion demonstrates the effectiveness of our proposed FSTDS in the practical application, affirming its superiority over conventional ML and sophisticated deep learning methods in terms of model precision;
In Table 3, the performance of FSTDS and other baselines in the datasets is evaluated using ACC and F1. The superior performance of FSTDS on both datasets demonstrates that the feature fusion and sparse Transformer architecture model significantly improve the accuracy of security-related tasks.

4.3. Ablation Study

In this section, we discuss the changes in parameter quantities and the impact on experimental results after introducing sparse transformer.

As shown in Figure 4, the introduction of the sparse transformer significantly reduces the parameter count, but the detection performance remains largely unchanged. For example, in the UNSW-NB15 dataset, the number of parameters has decreased by 47.95% compared to the baseline, while the accuracy basically remained unchanged, that is, from 99.39% to 99.34%. This illustrates the effectiveness of our proposed sparse transformer for detecting network traffic anomalies.

4.4. Visualization

We randomly selected 2000 traffic data from the Grid-Flow dataset to visualize the embedding representation before and after the traffic enters FSTDS. The dimensions were first reduced to 20 dimensions by PCA [30] and then to 2 dimensions by UMAP [31]. The visualization results are shown in Figure 5.

As we know from Figure 5, the characteristic representation of traffic data is chaotically distributed in the feature space before entering FSTDS. We cannot distinguish whether the traffic is abnormal through a single feature. After FSTDS classification, abnormal traffic and normal traffic are strictly distinguished in the feature space.

5. Conclusions

In this paper, we propose a powerful abnormal traffic detection system based on feature fusion and sparse transformer, called FSTDS, for accurate identification of abnormal traffic in the network. FSTDS provides an integrated abnormal traffic detection solution consisting of four modules: preprocessing technology, feature fusion network, sparse transformer, and a classification head. We designed a feature fusion network when encoding the input traffic data sequences, which uses two layers of parallel convolutional networks to extract shallow and deep feature extraction from the preprocessed flow data. These extract features to obtain the encoded flow data vectors. Subsequently, a deep encoder based on sparse transformer was utilized to capture the complex relationships between different network flows. Finally, a multi-layer perceptron was used to classify the traffic to detect abnormal traffic. The experimental results demonstrate that FSTDS outperforms mainstream classical and deep learning intrusion detection algorithms used in abnormal traffic detection systems.

The traffic detection system, in conjunction with the feature fusion and sparse transformer, performs exceptionally well in the area of anomalous traffic detection; however, the model’s scalability is limited, and the accuracy of detecting novel forms of attack traffic that do not occur in the training set requires improvement in light of the evolving complexity of network environments and the emergence of novel attack types.

Our future work will focus on how to improve the speed of the sparse transformer for the abnormal traffic intrusion detection system to significantly reduce the damage caused by abnormal events. Furthermore, in the next work, we will consider adopting prompt learning methods to solve the fow-shot classification problem.

Author Contributions

Methodology, X.Z. and Y.J.; validation, X.Z., W.M. and G.Y.; formal analysis, X.Z., W.M. and S.Z.; investigation, X.Z., G.Y. and Y.J.; resources, Q.L.; writing—original draft, X.Z.; writing—review and editing, X.Z. and W.M.; visualization, X.Z.; supervision, Q.L.; project administration, W.M.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of State Grid Jiangsu Electric Power Company Ltd., under Grant J2023124.

Data Availability Statement

The data will be made available by the authors on request.

Conflicts of Interest

Authors Xinjian Zhao, Weiwei Miao, Guoquan Yuan and Song Zhang were employed by the company State Grid Jiangsu Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, K.; Fu, Y.; Duan, X.; Liu, T.; Xu, J. Abnormal traffic detection system in SDN based on deep learning hybrid models. Comput. Commun. 2024, 216, 183–194. [Google Scholar] [CrossRef]
Xu, J. Research on abnormal traffic detection strategy for the Internet of Things based on machine learning. Software 2022, 43, 162–164. [Google Scholar]
Farnaaz, N.; Jabbar, M.A. Random forest modeling for network intrusion detection system. Procedia Comput. Sci. 2016, 89, 213–217. [Google Scholar] [CrossRef]
Manocchio, L.D.; Layeghy, S.; Lo, W.W.; Kulatilleke, G.K.; Sarhan, M.; Portmann, M. Flowtransformer: A transformer framework for flow-based network intrusion detection systems. Expert Syst. Appl. 2024, 241, 122564. [Google Scholar] [CrossRef]
Zhang, J.; Zulkernine, M.; Haque, A. Random-forests-based network intrusion detection systems. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2008, 38, 649–659. [Google Scholar] [CrossRef]
Li, Y.; Miao, R.; Kim, C.; Yu, M. {FlowRadar}: A better {NetFlow} for data centers. In Proceedings of the 13th USENIX SYMPOSIUM on Networked Systems Design and Implementation (NSDI 16), Santa Clara, CA, USA, 17–18 March 2016. [Google Scholar]
Mohanad, S.; Layeghy, S.; Moustafa, N.; Portmann, M. Netflow datasets for machine learning-based network intrusion detection systems. In Big Data Technologies and Applications: 10th EAI International Conference, BDTA 2020, and 13th EAI International Con-ference on Wireless Internet, WiCON 2020, Virtual Event, 11 December 2020, Proceedings 10; Springer International Publishing: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
dos Santos, F.P.; Ribeiro, L.S.; Ponti, M.A. Generalization of feature embeddings transferred from different video anomaly detection domains. J. Vis. Commun. Image Represent. 2019, 60, 407–416. [Google Scholar] [CrossRef]
Kuang, F.; Xu, W.; Zhang, S. A novel hybrid KPCA and SVM with GA model for intrusion detection. Appl. Soft Comput. 2014, 18, 178–184. [Google Scholar] [CrossRef]
Salman, O.; Elhajj, I.H.; Chehab, A.; Kayssi, A. A machine learning based framework for IoT device identification and abnormal traffic detection. Trans. Emerg. Telecommun. Technol. 2022, 33, e3743. [Google Scholar] [CrossRef]
Reddy, R.R.; Ramadevi, Y.; Sunitha, K.V.N. Effective discriminant function for intrusion detection using SVM. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Nam, M.; Park, S.; Kim, D.S. Intrusion detection method using bidirectional GPT for in-vehicle controller area networks. IEEE Access 2021, 9, 124931–124944. [Google Scholar] [CrossRef]
Chowdhury; Nasimuzzaman, M.; Ferens, K.; Ferens, M. Network intrusion detection using machine learning. In Proceedings of the International Conference on Security and Management (SAM), Las Vegas, NV, USA, 25–28 July 2016; The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp): Las Vegas, NV, USA, 2016. [Google Scholar]
Li, Y.; Yuan, X.; Li, W. An Extreme Semi-supervised Framework Based on Transformer for Network Intrusion Detection. In Proceedings of the 31st ACM International Conference on Information &Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 4204–4208. [Google Scholar]
Alex, K.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar]
Tuor, A.; Kaplan, S.; Hutchinson, B.; Nichols, N.; Robinson, S. Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. In Proceedings of the Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Yan, Q.; Wang, M.; Huang, W.; Luo, X.; Yu, F.R. Automatically synthesizing DoS attack traces using generative adversarial networks. Int. J. Mach. Learn. Cybern. 2019, 10, 3387–3396. [Google Scholar] [CrossRef]
Ashish, V.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Jiang, Y.; Liang, L.; Li, Q. Black-box Speech Adversarial Attack with Genetic Algorithm and Generic Attack Ideas. In 2023 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech); IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Wu, Z.; Zhang, H.; Wang, P.; Sun, Z. RTIDS: A robust transformerbased approach for intrusion detection system. IEEE Access 2022, 10, 64375–64387. [Google Scholar] [CrossRef]
Shi, D.; Xia, Y.; Peng, T. Network abnormal traffic detection model based on semi-supervised deep reinforcement learning. IEEE Trans. Netw. Serv. Manag. 2021, 18, 4197–4212. [Google Scholar]
Jay, S.; Manollas, M. Efficient deep CNN-BiLSTM model for network intrusion detection. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, Chengdu, China, 28–30 August 2020. [Google Scholar]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. arXiv 2019, arXiv:1911.06455. [Google Scholar]
Li, W.; Yi, P.; Wu, Y.; Pan, L.; Li, J. A new intrusion detection system based on KNN classification algorithm in wireless sensor network. J. Electr. Comput. Eng. 2014, 2014, 240217. [Google Scholar] [CrossRef]
Duan, X.; Fu, Y.; Wang, K. Network traffic anomaly detection method based on multi-scale residual classifier. Comput. Commun. 2023, 198, 206–216. [Google Scholar] [CrossRef]
Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. arXiv 2013, arXiv:1310.4546. [Google Scholar]
Deng, H.; Li, X. Network traffic anomaly identification and detection based on deep learning. Comput. Syst. Appl. 2023, 32, 274–280. [Google Scholar]
Alkhatib, N.; Mushtaq, M.; Ghauch, H.; Danger, J.-L. Can-bert do it? controller area network intrusion detection system based on bert language model. In Proceedings of the 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA), Abu Dhabi, United Arab Emirates, 5–8 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions, or products referred to in the content.

Figure 1. FSTDS Overall Framework Diagram.

Figure 2. Feature fusion network structure.

Figure 3. The difference in memory between standard self-attention and sparse self-attention.

Figure 4. Comparison of ablation results before and after introducing sparse transformer.

Figure 5. The embedding representation before and after the traffic enter FSTDS.

Table 1. Experimental environment.

CPU	Intel(R) Core(TM) i5-10400 CPU @ 2.90 GHz (Intel Corporation, Santa Clara, CA, USA)
RAM	16 GB DDR4 RAM (Kingston Technology, Fountain Valley, CA, USA)
GPU	NVIDIA GeForce RTX 2060 6G (NVIDIA Corporation, Santa Clara, CA, USA)
Compiler Environment	python 3.8 + tensorflow 2.0
Operating System	Windows 10 (Microsoft Corporation, Redmond, WA, USA)

Table 2. Grid-Flow Dataset Attack Categories.

Category	Count
Normal	898,195
DoS	4268
Brute Force	167,412
Spoofing	25,112
DDoS	259,565
Recon	24,829
Web-based	35,738
Mirai	1364
Total	1,416,483

Table 3. Performance of FSTDS on KDD-CUP99, UNSW-NB15, and Grid-Flow.

Dataset	Method	$A c c$	$F 1$
KDD-CUP99	GAN [23]	70.80	87.00
	HELAD	90.10	89.10
	uPU [24]	84.14	81.55
	nnPU [24]	89.62	89.14
	VPU [25]	90.30	89.58
	NB	98.00	98.00
	SVM	99.00	99.00
	FastRNN [26]	99.60	99.71
	FSTDS	99.73	99.75
UNSW-NB15	SVM	99.13	92.96
	BERT Model	98.90	76.23
	CNN	99.23	93.51
	IF [27]	86.33	55.05
	NB [28]	99.25	93.68
	KNN	99.30	93.17
	NN [29]	99.25	93.68
	FSTDS	99.34	93.75
Grid-Flow	GAN [23]	82.26	81.69
	CNN	98.12	86.75
	IF [27]	84.97	79.23
	NN [29]	97.25	90.68
	FSTDS	98.47	91.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, X.; Miao, W.; Yuan, G.; Jiang, Y.; Zhang, S.; Li, Q. Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer. Mathematics 2024, 12, 1643. https://0-doi-org.brum.beds.ac.uk/10.3390/math12111643

AMA Style

Zhao X, Miao W, Yuan G, Jiang Y, Zhang S, Li Q. Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer. Mathematics. 2024; 12(11):1643. https://0-doi-org.brum.beds.ac.uk/10.3390/math12111643

Chicago/Turabian Style

Zhao, Xinjian, Weiwei Miao, Guoquan Yuan, Yu Jiang, Song Zhang, and Qianmu Li. 2024. "Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer" Mathematics 12, no. 11: 1643. https://0-doi-org.brum.beds.ac.uk/10.3390/math12111643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer

Abstract

1. Introduction

2. Related Work

2.1. NetFlow

2.2. Traffic Anomaly Detection

2.3. Transformer Architecture

3. System Design

3.1. Data Preprocessing

3.2. CNN-Based Feature Fusion

3.3. Sparse Transformer Based on Memory Compression

3.4. MLP-Based Anomaly Classifier

4. Experiment

4.1. Experimental Setup

4.1.1. Experimental Environment

4.1.2. Training Details

4.1.3. Dataset

4.1.4. Evaluation Metrics

4.2. Overall Performance

4.3. Ablation Study

4.4. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI