Next Article in Journal
A Novel Prognostic Model for Gastric Cancer with EP_Dis-Based Co-Expression Network Analysis
Next Article in Special Issue
Machine Learning Algorithms Combining Slope Deceleration and Fetal Heart Rate Features to Predict Acidemia
Previous Article in Journal
Potential Novel Plant Growth Promoting Rhizobacteria for Bio-Organic Fertilizer Production in the Oil Palm (Elaeis guineensis Jacq.) in Malaysia
Previous Article in Special Issue
The Applicability of Machine Learning Methods to the Characterization of Fibrous Gas Diffusion Layers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparison Study of Generative Adversarial Network Architectures for Malicious Cyber-Attack Data Generation

by
Nikolaos Peppes
1,
Theodoros Alexakis
1,
Konstantinos Demestichas
2 and
Evgenia Adamopoulou
1,*
1
School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, 15773 Athens, Greece
2
Department of Agricultural Economics and Rural Development, Agricultural University of Athens, 11855 Athens, Greece
*
Author to whom correspondence should be addressed.
Submission received: 26 April 2023 / Revised: 11 June 2023 / Accepted: 12 June 2023 / Published: 14 June 2023
(This article belongs to the Special Issue Machine/Deep Learning: Applications, Technologies and Algorithms)

Abstract

:
The digitization trend that prevails nowadays has led to increased vulnerabilities of tools and technologies of everyday life. One of the many different types of software vulnerabilities and attacks is botnets. Botnets enable attackers to gain remote control of the infected machines, often leading to disastrous consequences. Cybersecurity experts engage machine learning (ML) and deep learning (DL) technologies for designing and developing smart and proactive cybersecurity systems in order to tackle such infections. The development of such systems is, often, hindered by the lack of data that can be used to train them. Aiming to address this problem, this study proposes and describes a methodology for the generation of botnet-type data in tabular format. This methodology involves the design and development of two generative adversarial network (GAN) models, one with six layers and the other with eight layers, to identify the most efficient and reliable one in terms of the similarity of the generated data to the real ones. The two GAN models produce data in loops of 25, 50, 100, 250, 500 and 1000 epochs. The results are quite encouraging as, for both models, the similarity between the synthetic and the real data is around 80%. The eight-layer solution is slightly better as, after running for 1000 epochs, it achieved a similarity degree of 82%, outperforming the six-layer one, which achieved 77%. These results indicate that such solutions of data augmentation in the cybersecurity domain are feasible and reliable and can lead to new standards for developing and training trustworthy ML and DL solutions for detecting and tackling botnet attacks.

1. Introduction

The digitization of almost every aspect of humans’ daily lives has led to an increasing number of connected sensors and digital devices according to the Internet of Things (IoT) paradigm. The online presence of such devices renders them vulnerable to cyber-attacks of various kinds. Cyber-attacks exist in various forms and formats such as DDoS, malware, ransomware and botnets. It is indicative that in 2022 there was a 38% increase in cyber-attacks compared to 2021, according to the Check Point Research blog [1]. Botnets are one of the major cyber-attack types since they can affect many devices in parallel, especially in IoT networks.
The increasing complexity of current networks, due to the large number of devices interconnected, also increases the number of vulnerabilities exposed to potential attackers. This creates a major problem for cybersecurity experts as new types of attacks or more complex attacks of the existing types are constantly appearing. These ever-expanding capabilities of attackers, along with outdated cybersecurity systems and the lack of data for properly training advanced systems, impose the need for new technologies and tools that can be able to tackle even unforeseen attacks.
Most cyber-attacks are addressed after the attack happens, and cybersecurity experts are, often, called to react and minimize its consequences. In this light, the data available for cyber-attacks mainly consist of data from previous attacks. This practically means that almost all cybersecurity systems are trained based on past attacks, and thus, they are vulnerable to new forms of them. However, most organizations do not share attack data, and therefore, there is a lack of availability of such data, which leads to inefficient training of ML models.
During the last two decades, network threats, including botnets, have attracted the interest of the academic and research community. Shinan et al. [2] worked on a review article about machine learning and botnet detection, where they define a “bot” as a software program installed on a compromised host which can perform a series of malicious activities. Usually, bots are installed using various ways such as back doors, external drives and infected websites or files. Botnets are defined as a group of bot-infected hosts which wait for instructions/commands in order to execute malicious code or other activities on the network to which they are attached [3,4].
Botnets are a major threat to the security and stability of computer networks worldwide. These networks of compromised devices can be used for a range of malicious activities, including distributed denial-of-service (DDoS) attacks, spamming, phishing and data theft [2]. One of the challenges in combating botnets is their ability to evolve and adapt to changes in the security landscape. To address this challenge, this study examines the use of generative adversarial networks (GANs) for generating botnet data samples. The quality of the generated data, in terms of similarity to the original data, is of utmost importance in order to assure that the synthetically generated data are equivalent in terms of quality for training purposes of machine learning models. Thus, the main goal and the added value of the present work can be summarized as follows:
  • Exploring different GAN architectures for data augmentation.
  • Examining and highlighting the significance of GANs in the generation process of data that can aid ML solutions.
  • Providing a data quality assessment procedure through graphical data quality Indicators (DQIs), including cumulative sums, absolute log mean and standard deviation (STD) diagrams, as well as correlation matrices along with heatmaps.
  • Evaluating the results of the DQIs and indicating the feasibility of the proposed solution.
The remainder of this paper is structured as follows: In Section 2, related works in the domain of botnet attack generation methods are explored. In Section 3, the methodology of designing the GAN models is detailed, while Section 4 delves into the assessment of the synthetic dataset’s value. In conclusion, Section 5 presents and discusses the noteworthy findings.

2. Related Works

The continuous increase in botnet attacks and the damage effects for individuals as well as organizations has led to the increasing need for new solutions or the enhancement of existing ones. This need also attracted academic and research interest; thus, several studies focusing in greater depth on botnet attack generation methods are available. Considering the focus of this study, which is to evaluate the generated botnet attack samples, the remainder of this section is focused on relevant works that were published during the last decade and are mainly focused on cyber-attack data generation procedures using GANs.
Firstly, an interesting approach regarding network traffic generation data was presented by Anande and Leeson [5]. Their study indicated that solutions engaging GANs can overcome the limitations of other approaches such as Poisson models. More specifically, the authors studied the findings of six different GANs: ITCGAN (Imbalanced Traffic Classification GAN) [6], PAC-GAN (Packet GAN) [7], PcapGAN [8], Facebook Chat Network Traffic GAN [9], Flow-Based Network Traffic Generation GAN [10] and ZipNet GAN [11]. All of these approaches feature certain strengths; e.g., ITCGAN paved a new way for addressing the data imbalance issue, whilst the PAC-GAN achieved the generation of network traffic flows at the packet byte level [5]. Another interesting approach presented by Yin et al. [12] was NetShare. NetShare was an end-to-end framework that used GAN architecture in order to produce IP-Header trace generation data. In addition, NetShare achieved 46% better fidelity compared to other baselines. Moreover, it achieved a better scalability–fidelity trade-off compared to other existing solutions [12]. In a similar direction, Wu et al. [13] proposed the Synthetic Packet Traffic Generative Adversarial Network (SPATGAN) approach. The SPATGAN framework consisted of a server agent and a client agent to simulate/mimic the exchange of packets over a network. More specifically, this architecture engaged other two GANs, namely the Timing Synthesis GAN (TSynGAN) and the Packet Synthesis GAN (PSyGAN) for each agent, respectively. The results indicated that the distribution of the generated samples was very close to the real ones. Furthermore, the Frechet Traffic Distance (FTD) score indicated that the distribution of synthetically generated data was, also, very close to other random distributions [13].
Zhong et al. [14] presented the MalFox solution which aims to prove the inefficiency of already existing black box detectors. The MalFox is based on a convolutional GAN, and its aim ultimately is to bypass malware detectors. Their solution adopts a confrontational approach to produce perturbation paths, with each one formed by up to three methods (namely Obfusmal, Stealmal and Hollowmal) to generate adversarial malware examples. The acquired performance results were quite encouraging with an accuracy of around 99%, while the detection rate of the generated samples led to a low percentage of almost 45% [14]. The significance of GANs for data augmentation, especially in the cybersecurity domain, was also indicated by the Conditional Tabular GAN (CTGAN) model proposed by Habibi et al. [15]. In their work, the authors tested different CTGAN versions with alternate parameters to identify the most efficient one. The results of their experiments indicated that the CTGAN could preserve the structure of both continuous and discrete data. This led to a solution for ML classifiers or detectors by alleviating the imbalance in datasets and, also, training these ML algorithms for unknown threats as the data generated by the GANs are new and unseen [15]. Lingam et al. [16] also performed a study about imbalanced data related to bot identification. In their study, they aimed to address the problem of imbalanced data for ML classifiers by engaging a GAN with a gated recurrent unit (GRU). This enabled them to produce synthetic data that were very similar to the real data and, thus, to balance the classes of benign users and bots. The results indicated that their solution outperformed ML solutions that were only trained with the original Twitter dataset, and the average accuracy of all the methods evaluated with the GAN-generated dataset achieved a precision of around 91% [16].
Yin et al. [17], also, focused on a solution that enhances botnet detection. Their study included a GAN which could create almost realistic botnet attack samples in order to train better ML classifiers. The Bot-GAN proposed constantly provided “fake” data to the discriminator which, through a softmax function, classified the samples. In this way, the accuracy and precision were improved compared to pretrained models with the original imbalanced dataset [17]. Following the same path to lift the limitations of an imbalanced dataset, Song et al. [18] proposed the GAN-Efficient Lifelong Learning Algorithm (ELLA) solution. Their approach indicated that a dataset expanded via a GAN architecture enhanced both the results of traditional ML solutions for identifying botnets as well as the lifelong approach of the ELLA algorithm [18]. Furthermore, Saurabh et al. [19] proposed the GANIBOT solution which is a semi-supervised GAN model for IoT botnet detection. Their solution engaged a GAN model which performed better for binary classification of the N-BaIoT dataset [20] as well as a supervised learning classifier for multiclass classification. Both were integrated into a common framework to provide a semi-supervised solution for botnet detection; thus, their solution is referred to as a semi-supervised GAN (SGAN). The results of their study indicated that the SGAN outperformed other ML methods such as artificial neural networks (ANNs) and convolutional neural network (CNNs), and after fine-tuning of the model parameters, it achieved even better results in terms of both accuracy and computational efficiency [19]. In order to tackle an imbalanced dataset and ingest more samples of the minority class, Kalleshappa and Savadatti [21] proposed a GAN with many convolutional layers. Their solution proved quite efficient as the fused bidirectional long short-term memory attention model (SFBAM) achieved an F1-score of around 96% on the IoT-23 dataset when it was augmented with GAN-generated samples [21]. Following a similar path, Randhawa et al. [22], proposed the botshot framework. Botshot is based on two GANs, one vanilla and one conditional, in order to augment already imbalanced IoT datasets so as to enhance the performance of ML classifiers [22].
In Table 1, an overview of the related works is presented, along with some key observations. The Data column indicates the type of data used for each study, the Result column briefly describes the main findings/contribution of the corresponding study, the Multiple GAN Architectures column indicates if this study contains more than one GAN implementation, the ML Classifiers column indicates if a study uses ML classifier performance to evaluate GANs, and finally, the GAN Layers column gives information on the number of layers used for generator/discriminator implementations.
As can be seen from Table 1, most of the research efforts discussed in this section do not examine more than one GAN implementation. Habibi et al. [15], who presented two GAN implementations, differentiate them only by performing a feature selection process on the input dataset. In addition, Randhawa et al. [22] engaged two different GAN implementations, without diving deep into the implementation details, but rather evaluated the generated data by using the precision, recall and F1-score over six different ML classifiers trained with them. Another point of the relevant literature that was considered in the context of the present study is the number of layers used. As can be observed, the corresponding median value is close to six. Thus, after studying and examining related works, the present study concluded that almost none of them presented a detailed description of the different GAN implementations developed. To this end, herein, the generated data are evaluated using data quality indicators which provide an explanatory and in-depth data profiling of those data, always in comparison to the original ones. Therefore, this study focuses on GANs and synthetically generated data and not on the performance of ML or DL classifiers or detectors. Its focal point is the step before ML and DL training with specific attention to data quality and similarity to the original data that are scarce and difficult to acquire in the domain of cybersecurity.
As can be inferred from the related works presented in this section, the need for data augmentation is of utmost importance in the cybersecurity domain. Thus, this study is mainly focused on data generation for dataset augmentation concerning botnet attacks. Through the methodology presented in the next section and the results provided in Section 4, different GAN models are evaluated using quality metrics in order to explore the generated data quality.

3. Material and Methods

3.1. Dataset

The CTU-13 dataset [23] provided by Stratosphere IPS is a collection of network traffic captures that have been widely used in the field of cybersecurity research. This dataset contains thirteen captures of different malware samples that were captured in a controlled environment, as well as one capture of normal traffic. The logs were collected over a period of several months and contain a total of more than 32 million packets. Each capture is labeled with information about the malware sample it contains, including the type of malware, the timestamp of the capture, and the network protocols used. The dataset was collected in a controlled environment, with the botnet activity being generated by a malware family known as “Neris”.
The dataset contains a total of 52 pcap files, each corresponding to a network traffic capture. The files are labeled according to their type, with the “botnet” captures numbered from 1 to 5 and the “normal” captures numbered from 11 to 15. The captures were taken at different times during the botnet attack, with the first capture taken shortly after the botnet was activated and the subsequent captures taken at different intervals throughout the duration of the attack.
Each capture in the dataset includes detailed information about the network traffic, the source and destination IP addresses, the network protocols used, and the timestamps of each packet. This dataset is commonly used in the research and evaluation of network security systems, as well as for the development of ML models for botnet detection. It provides a realistic and representative sample of the types of network traffic that may be encountered in a real-world botnet attack, making it a valuable resource for researchers and practitioners in the field of cybersecurity. Table 2 presents the data type of each of the 15 features included in the initial dataset used.
In the current study, the (original) CTU-13 dataset that was used consisted of 13 captures of different, labeled network traffic data with normal and botnet (malware) traffic samples. Overall, the dataset includes over 32 million packets. The training dataset consists of 216,352 records, out of which 140,849 records are marked as “0” denoting malware, and the remaining 75,503 records are labeled as “1” signifying legitimate activity. On the other hand, the evaluation dataset is composed of 88,258 unlabeled records. The training dataset encompasses 57 features, including the ID of each sample; these features were reduced for the purposes of the present study, as mentioned in the paragraphs below. Figure 1 provides some general indicators of the (initial) dataset statistics before the preprocess (feature selection) stage. As can be seen, the dataset consists of 15 classes containing 1 Boolean, 6 numeric and 8 categorical data types.
Figure 2 provides some general indicators of the dataset statistics, after the feature selection stage. The exposed dataset, following the feature selection (preprocess stage), includes 6 classes in total: 1 Boolean and 5 numeric data types.

3.2. Feature Selection

The selection of relevant independent features from the initial dataset is a crucial step in the preprocessing (or feature selection) stage that precedes the training and evaluation procedures of a deep neural network. This process, also known as feature engineering, aims to enhance the performance of the predictive model by choosing and working only with the necessary features [24]. However, the execution of such procedures comes with a cost as it requires significant computational resources. Some popular techniques for feature selection, as outlined in [25], that were also applied during the preprocessing stage are presented as follows:
  • Feature selection—the process of selecting a subset of relevant features from a larger set of features in the dataset to be used to train a machine learning model. The objective of feature selection is to improve the accuracy and efficiency of the model by reducing the dimensionality of the data by removing irrelevant or redundant features. Some of the most popular feature selection methods include univariate feature selection, recursive feature elimination, principal component analysis and feature importance estimation using decision trees or random forests.
  • Feature selection graphs, also known as scree plots, can be used to visualize the results of feature selection techniques and procedures. These kinds of plots display the performance of a predictive model against the number of features used. The y-axis represents the number of selected features, while the x-axis displays the performance metric extracted as a result of the selected method (e.g., accuracy, area under the ROC curve (AUC), F1-score).
  • Correlation matrix with heatmap—a data visualization technique used to depict the correlation between the features of a dataset in a graphical form. To this end, this method extracts a color-coded matrix to display the strength and direction of the linear relationship between two or more variables of the dataset used.
The first step performed during the feature engineering procedure was the detection and removal of any non-available values from the dataset provided. As a part of the feature engineering procedures, the categorical features were identified in the provided dataset and transformed using a label encoder, considering their respective values. Subsequently, the study focused on implementing the two feature selection methods (that were selected previously) to identify the most significant features relevant to the independent “Label” (namely “Legitimate” or “Malware”) feature element.
Feature selection is a critical step in the machine learning pipeline, as it can greatly impact the performance of the model and its ability to generalize to new data. There are several methods for executing feature selection, ranging from manual selection to automated algorithms. Manual feature selection involves expert domain knowledge and intuition in identifying the relevant features, whereas automated methods use statistical or machine learning techniques to evaluate the importance of each feature in the dataset used. During this study, univariate feature selection (more specifically, the SelectKBest class method [26]) was chosen in order to evaluate the correlation between each data feature and the target variable, so as to expose the correlation scores for the features included. As a scoring function, the “chi2” was used, whereas the method was applied to 14 out of the 15 features (the target feature, namely “Label”, was excluded) of the dataset. Table 3 presents the 15 features with their importance score, following the application of the univariate analysis (UVA) feature selection and the SelectKBest algorithm. As it is evident, the most relevant feature to the independent one (“Label” feature column) seems to be the “Proto” variable. Furthermore, to reduce the dimensionality of the dataset, to make it more manageable and at the same time reduce the computational resource requirements, variables “StartTime”, “Dir”, “SrcAddr” and “DstAddr” were excluded, since they contain dynamically assigned values (e.g., IP addresses). Additionally, variables “sTos” and “dTos” were also dropped since they are not expected to have a significant impact on the target variable, based also on the feature selection results depicted in Table 3.
The final dataset, after the feature selection stage, eventually included 8 variables (7 features and the target column), as presented in Table 4.
In order to create the feature selection graph, as presented in Figure 3, the following steps were performed: The “Extra Tree Classifier” feature selection method was applied to the final dataset to rank the features based on their importance scores. The predictive model was trained with 7 different feature variables, and the model’s performance was evaluated using the accuracy metric for each number of the features provided. The results were plotted on a graph to visually analyze and assess the feature importance for each variable in correlation with the target one. Based on the results of this method, the most important variable in correlation with the target variable is “Sport”.
Figure 4 displays the correlation matrix along with a heatmap that presents the correlation coefficients for the selected eight variables of the final dataset, as depicted in Table 3. Upon analyzing the obtained results, it is evident that the so-called “State” variable/feature is the most significant among the selected features. On the other hand, the “Sport” feature appears to be the least relevant to the “Label” variable. Thus, Figure 3 is a visualization that presents the eight most crucial features in correlation with the target variable, namely “Label”. These features were selected because of the SelectKBest and Extra Trees Classifier methods, which were previously described.

3.3. GAN Overview

This study focuses on GANs as proposed and modeled by Ian Goodfellow et al. [27] in 2014. The standard GAN architecture is composed of two deep neural network models, namely the generator and the discriminator, acting in a competitive manner. The generator is a deep neural network that receives as input a vector of random numbers, indicated as random noise, and its main task is the generation of high-quality, realistic data, similar to the content that was provided as input. On the other hand, the discriminator is a simple feedforward deep neural network that classifies the input samples of data as original or generated. The generator loss and the discriminator loss are calculated in a separate way and combined following a min–max game based on Equation (1) [27], where G is the generator, D is the discriminator and V(D, G) is the value function of the minmax game.
m i n G m a x D V D , G = E x p g ( x ) log   D ( x ) + E z p z ( z ) log   ( 1   D G z )
More specifically, the generator’s distribution pg over data x can be learned given that the input noise variables pz(z) are first defined. Following the noise variables’ definition, a representation of the mapping to the data space is defined as G(z; θg), where G is a differentiable function represented by a multilayer perceptron with parameters θg. Then, a second multilayer perceptron D(x; θd) is defined which outputs a single scalar as described previously. D(x) represents the probability that x came from the data rather than pg. Finally, the discriminator is trained to maximize the probability of assigning the correct label to both the original and generated samples, whilst the generator is trained to minimize log(1 − D(G(z))). In addition, during the generation of botnet samples, the generator and the discriminator losses can be calculated separately based on Equations (2) and (3), respectively [27].
m i n G V G = θ g 1 m i = 1 m log   ( 1 D G z i )
m a x D V D = θ d 1 m i = 1 m log D ( x i + log   ( 1 D G z i )
As this study focuses on the evaluation of different parameterizations, the mathematical modeling described above and Equations (1)–(3) totally express the GANs’ functionality developed and evaluated for the purposes of this study.
One of the most crucial stages when exploring the differences between different GAN architectures is the definition of the models used to achieve optimal performance. It is important to explore and evaluate different GAN architectures because different kinds of datasets and/or tasks may require different model architectures to achieve the best performance. The optimal number of layers for the generator and discriminator depends on the complexity of the data distribution, the size of the training dataset, the computational resources available, and the defined architecture and hyperparameters of the GAN model.
A GAN model with 4–6 layers in both the generator and discriminator networks is often a good starting point for generating realistic synthetic data. However, the optimal number of layers and architecture of a GAN model should be chosen based on a trade-off between model complexity, training stability, computational resources, and the desired quality and fidelity of the generated samples.
Hence, it is quite important to experiment with different GAN architectures and their hyperparameters to find the best model for a specific dataset and task. Techniques such as hyperparameter tuning and architecture experimentation can help identify the optimal GAN architecture and hyperparameters for a given task.
If the generator is shallow in terms of layers, it may not be able to capture the complexity of the data distribution, resulting in low-quality generated data samples. On the other hand, if the generator is too deep or complex, it may suffer from training instability issues, such as mode collapse or vanishing gradients, which may lead to difficulties in generating high-quality samples. Similarly, if the discriminator is too shallow, it may not be able to distinguish between real and fake samples, whereas if it is too deep, it may cause overfit issues during the training stage and consequently reduce the diversity and quality of the generated samples.
The present study focuses on comparing two different GAN architectures, one of 6 layers and the other of 8 layers both for the generator and the discriminator, for varying numbers of epochs. The aim of this comparison is to discover the optimal performance and draw conclusions about the combination of the parameters considered above (number of epochs and different GAN architectures).
Table 5 and Table 6 depict the outputs of the generator and the discriminator, respectively, for the suggested 6-layer GAN architecture. To create the generator model, the sequential API was used, which allowed stacking multiple layers in a sequential manner. As presented in Table 5, the generator model consists of (a) an input layer that accepts scaled, randomly generated noise of the desired size, followed by (b) four hidden layers activated by the “ReLU” function and (c) an output layer that uses a “linear” activation function and has the same dimension as the preprocessed dataset, which comprises eight variables, as detailed in Section 3.2.
Moving on to the discriminator model, Table 6 provides an overview of its structure. The discriminator, itself, is also implemented as a sequential model and consists of six dense layers. The first five layers use the “ReLU” function for activation, while the final layer uses the “sigmoid” function to classify input samples as either true (legitimate) or false (malware). In order to improve the model’s accuracy, a dropout rate of 20% was applied to both the visible (or input) and the four hidden layers of the discriminator model. The dropout rate was selected through experimentation and evaluation procedures. Various dropout rates were tested during the model training process, and the rate of 20% was found to strike a balance between preventing overfitting and preserving the model’s ability to learn relevant patterns in the data. This specific dropout rate was chosen based on empirical evidence and its performance in achieving satisfactory results in terms of accuracy and generalization.
The generator and the discriminator model for the 6-layer architecture are presented in Figure 5.
Table 7 and Table 8 illustrate the models of the 8-layer GAN architecture. The generator model, as shown in Table 7, is constructed using the sequential API and comprises (a) an input layer that accepts scaled, randomly generated noise of the desired size, followed by (b) six hidden layers activated by the “ReLU” function and (c) an output layer whose activation function is “linear” and matches the dimension of the preprocessed dataset.
Similarly, Table 8 outlines the output of the discriminator model. The discriminator, itself, is a sequential model, consisting of eight dense layers. The first seven layers use the “ReLU” function for activation, while the final layer uses the “sigmoid” function to classify input samples as either true (legitimate) or false (malware). To enhance the model’s accuracy, a dropout rate of 20% was applied to both the visible (input) layer and the six hidden layers of the discriminator model. The final selection of this dropout rate was determined through iterative experimentation, considering its impact on preventing overfitting while preserving the model’s ability to capture relevant patterns in the data.
The generator and the discriminator of the 8-layer architecture are depicted in Figure 6.

4. Results

Diagrams are considered as an effective way to compare and visualize the similarity scores between real and generated datasets from a GAN model. These scores can provide valuable insights into the quality and accuracy of the generated dataset and can help researchers identify areas where the GAN model may need to be improved to provide more real-looking synthetic data. The selection of these types of diagrams depends on the characteristics of the data being analyzed as well as on the specific research purposes.
During the implementation of the current study, two specific GAN architectures were selected with the aim of reproducing synthetic data using a real input dataset, namely the CTU dataset, described in Section 3. These models were trained for different numbers of epochs, i.e., for 25, 50, 100, 250 500 and 1000 epochs. Afterward, the generated datasets were compared to the real dataset in order to extract their similarity scores in terms of the features (variables) included.
To this end, one common and effective way to represent these similarity scores is through three types of diagrams, which are presented in Section 4.1 and Section 4.2 for the six- and eight-layer architectures, respectively, and described below. Each of the figures presented in Section 4.1 and Section 4.2 includes the following:
  • Correlation matrices with heatmaps that signify clusters with the difference between the real and generated datasets. The use of heatmaps is particularly useful for identifying patterns of similarity scores that may be related to specific data features.
  • Cumulative sum (or cumsum) diagrams: a graphical representation of the cumulative sum of each dataset (real and generated). For the purpose of similarity scores and dataset comparison using a GAN model, a cumulative sum (cumsum) diagram can be used to visualize how the similarity scores between the original and generated datasets accumulate over time.
  • Log mean and standard deviation (STD) diagrams comparing similarity scores between an original and a generated dataset from a GAN model. A log mean diagram would display the average or mean similarity score between the original dataset and the generated dataset for each epoch during training. This would allow an evaluation of how the similarity score changes over time and whether the generated dataset is becoming more or less similar to the real dataset as training progresses. On the other hand, a standard deviation diagram would display the variation in similarity scores between the real dataset and the generated dataset for each epoch during training. This would allow for an evaluation of the consistency of the similarity score and an indication of whether there are any significant fluctuations in the similarity between epochs.
By comparing these diagrams between the real and generated datasets, it is possible to determine how well the GAN model is performing in generating synthetic data that are similar to the real data.

4.1. Six-Layer Architecture

Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 depict the selected diagrams, as previously described, for the starting epochs (25) and the final number of epochs (1000) as defined in Section 3.
It is evident that for five of the eight variables (Dur, TotPkts, TotBytes, SrcBytes and Label), the cumulative sum diagrams show a steady increase in the similarity score both for the real and the generated datasets, indicating a consistent and continuous pattern of data. On the other hand, for three of the eight variables (Sport, Dport and State) a fluctuating pattern with sudden spikes and drops in the similarity score was observed. This indicates that the synthetic dataset generated by the GAN model contains certain data points that deviate significantly from the overall pattern, leading to a lower overall similarity score, for the specific variables. Additionally, for a small number of epochs, the cumulative sum diagrams show a slower increase in the similarity score compared to the real dataset, indicating that the GAN model may need more training epochs to produce a synthetic dataset that would be closer to the real one.
Moreover, the correlation matrix diagrams for the real dataset indicate a strong positive correlation between the variables included. However, in the generated synthetic dataset, there was a weaker positive correlation between the variables, and no significant negative correlation was detected. In addition, a strong positive correlation was observed between the different features of the real and generated data features in the “Difference” section, indicating that the generated data replicated the patterns and characteristics presented in the real dataset in a realistic way, as the epochs increased.
Finally, through the absolute mean and standard deviation diagrams, it was observed that the synthetic dataset includes higher values for some of the features compared to the real dataset. This indicates that the generated data for some features may not have been as close to the real data as for other features. As the number of training epochs increased, the generated dataset became more similar to the real dataset.
In Appendix A, the same diagrams for 50, 100, 250 and 500 epochs are also included.

4.2. Eight-Layer Architecture

As previously stated, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18 depict the selected diagrams for the eight-layer architecture described in Section 3 (for 25 and 1000 epochs).
It is obvious again that for five of the eight variables (Dur, TotPkts, TotBytes, SrcBytes and Label), the cumulative sum diagrams show a steady increase in the similarity score both for the real and the generated datasets, indicating a consistent and continuous pattern of data. On the other hand, for three of the eight variables (Sport, Dport and State), a fluctuating pattern with sudden spikes and drops in the similarity score was also observed. In the same way as before, this fact states that the synthetic dataset generated by the GAN model contains certain data points that deviate significantly from the overall pattern, leading to a lower overall similarity score for the specific data features. Furthermore, for a small number of epochs, the cumulative sum diagrams again show a slower increase in the similarity score compared to the real dataset, indicating that the GAN model may need more training epochs to produce a synthetic dataset that would be closer to the real one. To this end, as the number of epochs increased, the generated dataset would show higher similarity compared to the real one.
Moreover, in the correlation matrix diagrams, a positive correlation was observed between the different features of the real and generated data features in the “Difference” section, indicating that the generated data replicated the patterns and characteristics presented in the real dataset in a realistic way, as the epochs gradually increased.
Eventually, in the absolute mean and standard deviation diagrams, as was expected, with the increase in the number of epochs, the generated dataset became more similar to the real dataset, since fewer higher values are observed for some of the generated data features compared to the real ones.
In Appendix B, the same diagrams for 50, 100, 250 and 500 epochs are also included.

4.3. Comparison of Similarity Scores between the Architectures

The primary aim of this study was to compare the performance of two different GAN architectures—one with six layers and the other with eight layers—for both the generator and the discriminator. The study aimed to vary the number of epochs between 25, 50, 100, 250, 500 and 1000, with the objective of determining the optimal performance and investigating the relationship between the considered parameters, such as the number of epochs and the number of layers in the GAN model.
The results, as detailed in Section 4, demonstrated that the eight-layer GAN architecture was the most effective in achieving the highest similarity score between the real and generated data samples, especially after being trained for 1000 epochs, as depicted in Figure 19. The results suggested that the eight-layer GAN model may perform better, as the additional layers provide more capacity for the model to learn complex patterns in the data. Furthermore, it was observed that the number of epochs significantly affected the efficiency of the GAN model, as the respective increase allowed the model to learn more complex patterns in the data, generating higher-quality synthetic samples. However, it should be noted that while we understand the increased complexity of the eight-layer model, in this study, we considered the similarity performance between the original and the generated samples as the primary evaluation metric. More specifically, the quality of the generated samples was the primary evaluation metric, and the results indicated that increasing the number of epochs led to better performance of the GAN model as the architecture increased in depth. Additionally, regarding the six-layer architecture, a gradual increase in the similarity score is observed, which is stable from 100 to 1000 epochs without any optimization. This fact may lead to the conclusion that this model cannot produce better results for this type of dataset. On the other hand, the eight-layer model presented a variation in its performance during the different epochs selected. This may indicate that the specific GAN model architecture may not have been properly optimized or that it may need further tuning of its hyperparameters.
Another interesting observation, based on Figure 19, was the decrease in the similarity score between iterations 50 to 250 for the eight-layer architecture, compared to the corresponding six-layer architecture. This indicates that for this combination of iterations and architecture, a better selection of hyperparameters may be required, as the model yielded a lower similarity score for the generated datasets compared to the real ones. Another reason could be the specific sampling used, which may have led to these lower results in terms of similarity.

5. Conclusions

The present study aimed to provide insights into the effectiveness of different GAN architectures for generating synthetic data that accurately represent malicious cyber-attacks. To compare the performance of different GAN models in generating synthetic data that accurately represent malicious cyber-attacks (botnet attacks), a well-known collection of network traffic captures widely utilized in cybersecurity research was used, namely the Stratosphere IPS CTU-13 dataset [23]. The dataset comprises thirteen captures of diverse malware samples and one capture of normal traffic obtained in a controlled environment. These captures, containing over 32 million packets, were amassed over several months and feature metadata labeling for each capture, such as the type of malware, timestamp and network protocols employed. The “botnet” captures are numbered from 1 to 5, and the “normal” captures are numbered from 11 to 15. These captures were taken at various stages of the botnet attack, with the initial capture taken shortly after activation and the subsequent captures taken at different intervals during the attack.
The main objective of this study was to compare the performance of two distinct GAN architectures—one with six layers and the other with eight layers—for both the generator and discriminator, while varying the number of epochs from 25 to 1000. Thus, the aim of this study was to identify the model with the optimal performance and draw conclusions about the relationship between the considered parameters, including the number of epochs and GAN architectures (number of layers). According to the results, as presented in Section 4, the best similarity score, between the real and generated data samples, was achieved for the suggested eight-layer GAN architecture, after the model was trained for 1000 epochs. Since the quality of the generated samples was the primary evaluation metric, the results indicated that the eight-layer GAN may perform better, as the additional layers provide more capacity for the model to learn complex patterns in the data. Moreover, it was observed that the number of epochs greatly affected the efficiency of the GAN model, based on the experimentation and evaluation of the model’s performance using the selected metrics, for the similarity score calculation. In fact, increasing the number of epochs allowed the model to learn more complex patterns in the data and generate higher-quality synthetic samples.
The field of cybersecurity is constantly evolving, and the use of GANs for data generation has become an important tool for detecting and preventing malicious cyber-attacks. These insights could be used for further evaluation of GAN architectures. The current study focused on two GAN architectures for generating synthetic botnet data. Future studies could investigate additional GAN architectures, such as progressive GANs, cycle GANs and attention GANs, to determine whether they can produce even more accurate and diverse synthetic botnet data. Moreover, the incorporation of real-world data could also be a great asset to future steps of the current study. While the study did manage to generate synthetic botnet data that accurately represented real-world botnets, incorporating real-world data into the training process could further improve the quality and diversity of these synthetic data. Thus, future studies could explore the feasibility of using real-world data to train GAN models for generating synthetic botnet data. In terms of the impact of data preprocessing on GAN performance, future studies and experiments could investigate different data preprocessing techniques, such as data normalization. Eventually, the adoption of hybrid models and methods that combine GANs with other machine learning techniques, such as autoencoders or deep neural networks, could be developed to generate even more accurate and diverse synthetic botnet data. Future studies could explore the potential of these hybrid models and evaluate their performance in generating synthetic botnet data.
Eventually, it is important to acknowledge the limitations of the present study and identify further areas for future research. While the current research focused on comparing the performance of different GAN architectures and varying the number of epochs during the subsequent procedures, there are other factors that may influence the effectiveness of the synthetically generated datasets for malicious cyber-attacks. For instance, exploring the impact of different hyperparameters, such as batching sizes or learning rates, could provide further insights into optimizing GANs for generating quality synthetic datasets. Moreover, considering the transferability of the (already) trained GAN models to different types of attack scenarios or datasets could contribute further to the generalization of findings as well as the practical applicability of the generated synthetic data. Future studies could also address the evaluation metrics used for assessing the quality of the generated samples, exploring alternative measures or domain-specific metrics that may capture specific characteristics of malicious cyber-attacks. By addressing these limitations and conducting further investigations, we could continue to advance the field of synthetic data generation for cybersecurity, ultimately enhancing the ability to detect, prevent and/or mitigate malicious cyber-attacks.

Author Contributions

Conceptualization, N.P., T.A. and E.A.; methodology, N.P. and T.A. software, N.P.; validation, N.P., T.A., K.D. and E.A.; formal analysis, E.A. and N.P.; investigation, N.P. and T.A.; resources, T.A. and N.P.; data curation, N.P., T.A. and K.D.; writing—original draft preparation, N.P. and T.A.; writing—review and editing, E.A., N.P., T.A. and K.D.; visualization, N.P. and T.A.; supervision, E.A. and K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Six Layers

Correlation matrix, cumulative sums and absolute log mean and STDs for 50, 100, 250 and 500 epochs.
Figure A1. Correlation matrix, 50 epochs.
Figure A1. Correlation matrix, 50 epochs.
Applsci 13 07106 g0a1
Figure A2. Correlation matrix, 100 epochs.
Figure A2. Correlation matrix, 100 epochs.
Applsci 13 07106 g0a2
Figure A3. Correlation matrix, 250 epochs.
Figure A3. Correlation matrix, 250 epochs.
Applsci 13 07106 g0a3
Figure A4. Correlation matrix, 500 epochs.
Figure A4. Correlation matrix, 500 epochs.
Applsci 13 07106 g0a4
Figure A5. Cumulative sum, 50 epochs.
Figure A5. Cumulative sum, 50 epochs.
Applsci 13 07106 g0a5
Figure A6. Cumulative sum, 100 epochs.
Figure A6. Cumulative sum, 100 epochs.
Applsci 13 07106 g0a6
Figure A7. Cumulative sum, 250 epochs.
Figure A7. Cumulative sum, 250 epochs.
Applsci 13 07106 g0a7
Figure A8. Cumulative sum, 500 epochs.
Figure A8. Cumulative sum, 500 epochs.
Applsci 13 07106 g0a8
Figure A9. Absolute log mean and standard deviation, 50 epochs.
Figure A9. Absolute log mean and standard deviation, 50 epochs.
Applsci 13 07106 g0a9
Figure A10. Absolute log mean and standard deviation, 100 epochs.
Figure A10. Absolute log mean and standard deviation, 100 epochs.
Applsci 13 07106 g0a10
Figure A11. Absolute log mean and standard deviation, 250 epochs.
Figure A11. Absolute log mean and standard deviation, 250 epochs.
Applsci 13 07106 g0a11
Figure A12. Absolute log mean and standard deviation, 500 epochs.
Figure A12. Absolute log mean and standard deviation, 500 epochs.
Applsci 13 07106 g0a12

Appendix B. Eight Layers

Correlation matrix, cumulative sums and absolute log mean and STDs for 50, 100, 250 and 500 epochs.
Figure A13. Correlation matrix, 50 epochs.
Figure A13. Correlation matrix, 50 epochs.
Applsci 13 07106 g0a13
Figure A14. Correlation matrix, 100 epochs.
Figure A14. Correlation matrix, 100 epochs.
Applsci 13 07106 g0a14
Figure A15. Correlation matrix, 250 epochs.
Figure A15. Correlation matrix, 250 epochs.
Applsci 13 07106 g0a15
Figure A16. Correlation matrix, 500 epochs.
Figure A16. Correlation matrix, 500 epochs.
Applsci 13 07106 g0a16
Figure A17. Cumulative sum, 50 epochs.
Figure A17. Cumulative sum, 50 epochs.
Applsci 13 07106 g0a17
Figure A18. Cumulative sum, 100 epochs.
Figure A18. Cumulative sum, 100 epochs.
Applsci 13 07106 g0a18
Figure A19. Cumulative sum, 250 epochs.
Figure A19. Cumulative sum, 250 epochs.
Applsci 13 07106 g0a19
Figure A20. Cumulative sum, 500 epochs.
Figure A20. Cumulative sum, 500 epochs.
Applsci 13 07106 g0a20
Figure A21. Absolute log mean and standard deviation, 50 epochs.
Figure A21. Absolute log mean and standard deviation, 50 epochs.
Applsci 13 07106 g0a21
Figure A22. Absolute log mean and standard deviation, 100 epochs.
Figure A22. Absolute log mean and standard deviation, 100 epochs.
Applsci 13 07106 g0a22
Figure A23. Absolute log mean and standard deviation, 250 epochs.
Figure A23. Absolute log mean and standard deviation, 250 epochs.
Applsci 13 07106 g0a23
Figure A24. Absolute log mean and standard deviation, 500 epochs.
Figure A24. Absolute log mean and standard deviation, 500 epochs.
Applsci 13 07106 g0a24

References

  1. Check Point Check Point Research Reports a 38% Increase in 2022 Global Cyberattacks. Available online: https://blog.checkpoint.com/2023/01/05/38-increase-in-2022-global-cyberattacks/ (accessed on 22 February 2023).
  2. Shinan, K.; Alsubhi, K.; Alzahrani, A.; Ashraf, M.U. Machine Learning-Based Botnet Detection in Software-Defined Network: A Systematic Review. Symmetry 2021, 13, 50866. [Google Scholar] [CrossRef]
  3. Silva, S.S.C.; Silva, R.M.P.; Pinto, R.C.G.; Salles, R.M. Botnets: A Survey. Comput. Netw. 2013, 57, 378–403. [Google Scholar] [CrossRef]
  4. Limarunothai, R.; Amin Munlin, M. Trends and Challenges of Botnet Architectures and Detection Techniques. J. Inf. Sci. Technol. 2015, 5, 51–57. [Google Scholar] [CrossRef]
  5. Anande, T.J.; Al-Saadi, S.; Leeson, M.S. Generative Adversarial Networks for Network Traffic Feature Generation. Int. J. Comput. Appl. 2023, 45, 297–305. [Google Scholar] [CrossRef]
  6. Guo, Y.; Xiong, G.; Li, Z.; Shi, J.; Cui, M.; Gou, G. Combating Imbalance in Network Traffic Classification Using GAN Based Oversampling. In Proceedings of the 2021 IFIP Networking Conference (IFIP Networking), Virtual, 21–24 June 2021; pp. 1–9. [Google Scholar]
  7. Cheng, A. PAC-GAN: Packet Generation of Network Traffic Using Generative Adversarial Networks. In Proceedings of the 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 17–19 October 2019; pp. 728–734. [Google Scholar]
  8. Dowoo, B.; Jung, Y.; Choi, C. PcapGAN: Packet Capture File Generator by Style-Based Generative Adversarial Networks. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1149–1154. [Google Scholar]
  9. Rigaki, M.; García, S. Bringing a GAN to a Knife-Fight: Adapting Malware Communication to Avoid Detection. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 70–75. [Google Scholar]
  10. Ring, M.; Schlör, D.; Landes, D.; Hotho, A. Flow-Based Network Traffic Generation Using Generative Adversarial Networks. Comput. Secur. 2019, 82, 156–172. [Google Scholar] [CrossRef] [Green Version]
  11. Zhang, C.; Ouyang, X.; Patras, P. ZipNet-GAN: Inferring Fine-Grained Mobile Traffic Patterns via a Generative Adversarial Neural Network. In Proceedings of the CoNEXT ’17 13th International Conference on emerging Networking EXperiments and Technologies, New York, NY, USA, 12 December 2017; pp. 363–375. [Google Scholar]
  12. Yin, Y.; Lin, Z.; Jin, M.; Fanti, G.; Sekar, V. Practical GAN-Based Synthetic IP Header Trace Generation Using NetShare. In Proceedings of the ACM SIGCOMM 2022 Conference, Association for Computing Machinery, New York, NY, USA, 10–14 September 2022; pp. 458–472. [Google Scholar]
  13. Wu, C.; Chen, Y.; Chou, P.; Wang, C. Synthetic Traffic Generation with Wasserstein Generative Adversarial Networks. In Proceedings of the GLOBECOM 2022—2022 IEEE Global Communications Conference, Rio de Janeiro, Brazil, 4–8 December 2022; pp. 1503–1508. [Google Scholar]
  14. Zhong, F.; Cheng, X.; Yu, D.; Gong, B.; Song, S.; Yu, J. MalFox: Camouflaged Adversarial Malware Example Generation Based on Conv-GANs Against Black-Box Detectors. IEEE Trans. Comput. 2023, 1–14. [Google Scholar] [CrossRef]
  15. Habibi, O.; Chemmakha, M.; Lazaar, M. Imbalanced Tabular Data Modelization Using CTGAN and Machine Learning to Improve IoT Botnet Attacks Detection. Eng. Appl. Artif. Intell. 2023, 118, 105669. [Google Scholar] [CrossRef]
  16. Lingam, G.; Yasaswini, B.; Jagadamba, P.V.S.L.; Kolliboyana, N. An Improved Bot Identification with Imbalanced Data Using GG-XGBoost. In Proceedings of the 2022 2nd International Conference on Intelligent Technologies (CONIT), Hubballi, India, 24–26 June 2022; pp. 1–6. [Google Scholar]
  17. Yin, C.; Zhu, Y.; Liu, S.; Fei, J.; Zhang, H. An Enhancing Framework for Botnet Detection Using Generative Adversarial Networks. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May 2018; pp. 228–234. [Google Scholar]
  18. Song, C.; Wushouer, M.; Tuerho, G. Botnet Detection Based on Generative Adversarial Network and Efficient Lifelong Learning Algorithm. In Proceedings of the 2022 International Conference on Big Data, Information and Computer Network (BDICN), Sanya, China, 20–22 January 2022; pp. 48–54. [Google Scholar]
  19. Saurabh, K.; Singh, A.; Singh, U.; Vyas, O.P.; Khondoker, R. GANIBOT: A Network Flow Based Semi Supervised Generative Adversarial Networks Model for IoT Botnets Detection. In Proceedings of the 2022 IEEE International Conference on Omni-layer Intelligent Systems (COINS), Barcelona, Spain, 1–3 August 2022; pp. 1–5. [Google Scholar]
  20. Meidan, Y.; Bohadana, M.; Mathov, Y.; Mirsky, Y.; Shabtai, A.; Breitenbacher, D.; Elovici, Y. N-BaIoT: Network-Based Detection of IoT Botnet Attacks Using Deep Autoencoders. IEEE Pervasive Comput. 2018, 17, 12–22. [Google Scholar] [CrossRef] [Green Version]
  21. Kalleshappa, G.; Savadatti, B. Effective Internet of Things Botnet Classification by Data Upsampling Using Generative Adversarial Network and Scale Fused Bidirectional Long Short Term Memory Attention Model. Concurr. Comput. Pract. Exp. 2022, 34. [Google Scholar] [CrossRef]
  22. Randhawa, R.H.; Aslam, N.; Alauthman, M.; Rafiq, H.; Comeau, F. Security Hardening of Botnet Detectors Using Generative Adversarial Networks. IEEE Access 2021, 9, 78276–78292. [Google Scholar] [CrossRef]
  23. García, S.; Grill, M.; Stiborek, J.; Zunino, A. An Empirical Comparison of Botnet Detection Methods. Comput. Secur. 2014, 45, 100–123. [Google Scholar] [CrossRef]
  24. Rawat, T.; Khemchandani, V. Feature Engineering (FE) Tools and Techniques for Better Classification Performance. Int. J. Innov. Eng. Technol. 2019, 8, 169–179. [Google Scholar] [CrossRef]
  25. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2017, 50, 1–45. [Google Scholar] [CrossRef] [Green Version]
  26. Chandrashekar, G.; Sahin, F. A Survey on Feature Selection Methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  27. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Figure 1. Initial dataset overview.
Figure 1. Initial dataset overview.
Applsci 13 07106 g001
Figure 2. Final dataset (after feature selection).
Figure 2. Final dataset (after feature selection).
Applsci 13 07106 g002
Figure 3. Feature selection graph.
Figure 3. Feature selection graph.
Applsci 13 07106 g003
Figure 4. Correlation Matrix.
Figure 4. Correlation Matrix.
Applsci 13 07106 g004
Figure 5. Generator and discriminator for 6 layers.
Figure 5. Generator and discriminator for 6 layers.
Applsci 13 07106 g005
Figure 6. Generator and discriminator for 8 layers.
Figure 6. Generator and discriminator for 8 layers.
Applsci 13 07106 g006
Figure 7. Correlation matrix for 6 layers, 25 epochs.
Figure 7. Correlation matrix for 6 layers, 25 epochs.
Applsci 13 07106 g007
Figure 8. Cumulative sums per feature for 6 layers, 25 epochs.
Figure 8. Cumulative sums per feature for 6 layers, 25 epochs.
Applsci 13 07106 g008
Figure 9. Absolute log mean and standard deviation for 6 layers, 25 epochs.
Figure 9. Absolute log mean and standard deviation for 6 layers, 25 epochs.
Applsci 13 07106 g009
Figure 10. Correlation matrix for 6 layers, 1000 epochs.
Figure 10. Correlation matrix for 6 layers, 1000 epochs.
Applsci 13 07106 g010
Figure 11. Cumulative sums per feature for 6 layers, 1000 epochs.
Figure 11. Cumulative sums per feature for 6 layers, 1000 epochs.
Applsci 13 07106 g011
Figure 12. Absolute log mean and standard deviation for 6 layers, 1000 epochs.
Figure 12. Absolute log mean and standard deviation for 6 layers, 1000 epochs.
Applsci 13 07106 g012
Figure 13. Correlation matrix for 8 layers, 25 epochs.
Figure 13. Correlation matrix for 8 layers, 25 epochs.
Applsci 13 07106 g013
Figure 14. Cumulative sums per feature for 8 layers, 25 epochs.
Figure 14. Cumulative sums per feature for 8 layers, 25 epochs.
Applsci 13 07106 g014
Figure 15. Absolute log mean and standard deviation for 8 layers, 25 epochs.
Figure 15. Absolute log mean and standard deviation for 8 layers, 25 epochs.
Applsci 13 07106 g015
Figure 16. Correlation matrix for 8 layers, 1000 epochs.
Figure 16. Correlation matrix for 8 layers, 1000 epochs.
Applsci 13 07106 g016
Figure 17. Cumulative sums per feature for 8 layers, 1000 epochs.
Figure 17. Cumulative sums per feature for 8 layers, 1000 epochs.
Applsci 13 07106 g017
Figure 18. Absolute log mean and standard deviation for 8 layers, 1000 epochs.
Figure 18. Absolute log mean and standard deviation for 8 layers, 1000 epochs.
Applsci 13 07106 g018
Figure 19. Similarity score chart for 6-layer and 8-layer architecture per number of epochs.
Figure 19. Similarity score chart for 6-layer and 8-layer architecture per number of epochs.
Applsci 13 07106 g019
Table 1. Overview of related literature (Yes, when the study covers the corresponding attribute. No, when the study does not cover the corresponding attribute. N/A, when the information for this attribute is Not Available in the study. G = generator, D = discriminator).
Table 1. Overview of related literature (Yes, when the study covers the corresponding attribute. No, when the study does not cover the corresponding attribute. N/A, when the information for this attribute is Not Available in the study. G = generator, D = discriminator).
StudyDataResultMultiple GAN ArchitecturesML ClassifiersGAN Layers
Guo et al. [6]Network trafficData augmentation (oversampling) improved ML classifier precision on minority classNoYes6G/4D
Cheng [7]Network trafficNon-sequential scalable network traffic generator using GANNoNo6
Dowoo et al. [8]Network trafficImproved ML classifier accuracy when trained with generated data. Indirect evaluationNoYes6
Rigaki et al. [9]Network traffic (real Facebook chat)Improved malware data generation mimicking real data.NoNoN/A
Ring et al. [10]Network trafficData augmentation and comparison of generated data with real data using Euclidean distanceNoNoN/A
Yin et al. [12]IP-Header tracesBetter scalability–fidelity trade-offNoNoΝ/A
Wu et al. [13]Packet exchangeDistribution of synthetically generated data is also very close to other random distributionsNoNo3G/3D
Zhong et al. [14]MalwareDecreased detection rate (around 45%) for the generated samplesNoYes13G/15D
Habibi et al. [15]IoT botnet attacksGANs can preserve continuous and discrete data structures, alleviating data imbalanceYesYesN/A
Lingam et al. [16]Twitter datasetBenign class data augmentation using GANs helped ML solutions achieve an accuracy of around 91%NoNo5
Yin et al. [17]Botnet attacksImproved rate detection when ML classifiers were trained with a balanced synthetically generated datasetNoYes3
Song et al. [18]Botnet attacksImproved detection rate both for ML classifiers and ELLA algorithm when trained with a balanced synthetically generated datasetNoNoN/A
Saurabh et al. [19]Botnet attacksSGAN outperformed other ML methods both in accuracy and in computational efficiencyNoYesN/A
Kalleshappa and Savadatti [21]Botnet attacksTackle imbalanced dataset and enhance performance SFBAM by achieving F1-score of 96%NoYes10G/4D
Randhawa et al. [22]Botnet attacksTackle imbalanced dataset problem to enhance ML classifier problemYesYesN/A
Table 2. Initial dataset features and their data types.
Table 2. Initial dataset features and their data types.
FeatureType
StarTimeobject
Durfloat64
Protoobject
SrcAddrobject
Sportobject
Dirobject
DstAddrobject
Dportobject
Stateobject
sTosfloat64
dTosfloat64
TotPktsint64
TotBytesint64
SrcBytesint64
Labelbool
dtypeobject
Table 3. Importance score for the 15 features included in the initial dataset.
Table 3. Importance score for the 15 features included in the initial dataset.
FeatureScore
TotBytes4.22387 × 109
Sport8.21231 × 108
SrcBytes6.60375 × 108
Dport6.17780 × 108
Dur3.30228 × 108
State1.41850 × 108
TotPkts4.97291 × 107
Proto4.55909 × 102
dTos3.11329 × 10−2
sTos2.86774 × 10−2
StartTime5.23553 × 10−4
Dir4.90686 × 10−4
SrcAddr2.38865 × 10−4
DstAddr2.19013 × 10−4
Table 4. Final dataset features and their data types.
Table 4. Final dataset features and their data types.
FeatureType
Durfloat64
Sportobject
Dportobject
Stateobject
TotPktsint64
TotBytesint64
SrcBytesInt64
Labelbool
Table 5. Generator model output for the 6-layer architecture.
Table 5. Generator model output for the 6-layer architecture.
Generator
Layer (Type)Output ShapeValue
dense (Dense)(None, 1536)16,896
dense_1 (Dense)(None, 1278)1,964,286
dense_2 (Dense)(None, 512)654,848
dense_3 (Dense)(None, 128)65,664
dense_4 (Dense)(None, 16)2064
dense_5 (Dense)(None, 8)136
Table 6. Discriminator model output for the 6-layer architecture.
Table 6. Discriminator model output for the 6-layer architecture.
Discriminator
Layer (Type)Output ShapeValue
dense (Dense)(None, 256)2304
dropout_1 (Dropout)(None, 256)0.2
dense_1 (Dense)(None, 128)32,896
dropout_2 (Dropout)(None, 128)0.2
dense_2 (Dense)(None, 64)8256
dropout_3 (Dropout)(None, 64)0.2
dense_3 (Dense)(None, 32)2080
dropout_4 (Dropout(None, 32)0.2
dense_4 (Dense)(None, 16)528
dropout_5 (Dropout)(None, 16)0.2
dense_5 (Dense)(None, 1)17
Table 7. Generator model output for the 8-layer architecture.
Table 7. Generator model output for the 8-layer architecture.
Generator
Layer (Type)Output ShapeValue
dense (Dense)(None, 1536)16,896
dense_1 (Dense)(None, 1278)1,964,286
dense_2 (Dense)(None, 512)654,848
dense_3 (Dense)(None, 384)196,992
dense_4 (Dense)(None, 128)49,280
dense_5 (Dense)(None, 64)8256
dense_6 (Dense)(None, 16)1040
dense_7 (Dense)(None, 8)136
Table 8. Discriminator model output for the 8-layer architecture.
Table 8. Discriminator model output for the 8-layer architecture.
Discriminator
Layer (Type)Output ShapeValue
dense (Dense)(None, 1024)9216
dropout (Dropout)(None, 1024)0.2
dense_1 (Dense)(None, 512)524,800
dropout_1 (Dropout)(None, 512)0.2
dense_2 (Dense)(None, 256)131,328
dropout_2 (Dropout)(None, 256)0.2
dense_3 (Dense)(None, 128)32,896
dropout_3 (Dropout)(None, 128)0.2
dense_4 (Dense)(None, 64)8256
dropout_4 (Dropout)(None, 64)0.2
dense_5 (Dense)(None, 32)2080
dropout_5 (Dropout)(None, 32)0.2
dense_6 (Dense)(None, 16)528
dropout_6 (Dropout)(None, 16)0.2
dense_7 (Dense)(None, 1)17
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peppes, N.; Alexakis, T.; Demestichas, K.; Adamopoulou, E. A Comparison Study of Generative Adversarial Network Architectures for Malicious Cyber-Attack Data Generation. Appl. Sci. 2023, 13, 7106. https://0-doi-org.brum.beds.ac.uk/10.3390/app13127106

AMA Style

Peppes N, Alexakis T, Demestichas K, Adamopoulou E. A Comparison Study of Generative Adversarial Network Architectures for Malicious Cyber-Attack Data Generation. Applied Sciences. 2023; 13(12):7106. https://0-doi-org.brum.beds.ac.uk/10.3390/app13127106

Chicago/Turabian Style

Peppes, Nikolaos, Theodoros Alexakis, Konstantinos Demestichas, and Evgenia Adamopoulou. 2023. "A Comparison Study of Generative Adversarial Network Architectures for Malicious Cyber-Attack Data Generation" Applied Sciences 13, no. 12: 7106. https://0-doi-org.brum.beds.ac.uk/10.3390/app13127106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop