Data stream mining is a sub-domain of data mining that continuously processes incoming data on the fly. Data stream mining imposes several challenges, such as the evolving nature of data and its huge size, which is potentially infinite. These challenges require efficient and optimized mining methods, in terms of processing time and memory usage, which are adapted to the stream setting. The stream processing must also be performed under the one-pass, aka. single pass, constraint where instances can be accessed a limited number of times (in some cases only once). Multiple applications deal with data streams, such as finance, network monitoring, telecommunications, and sensor networks. The continuous arrival of data streams in rapid, time-varying, and possibly unbounded way may raise new fundamental research problems. As research in the data stream environment keeps progressing, the problem of how to efficiently alert when facing an abnormal behavior, data, or patterns (aka. outliers) still remains a significant challenge to handle. In fact, the outliers should be detected as fast as possible to obtain insights for mining purposes. Stream anomaly detection algorithms refer to the methods that are able to extract enough knowledge from the data in order to compute the anomaly scores while dealing with evolving data streams.
Many solutions have been recently developed for data stream processing that arrive at a single location or at multiple locations. The main idea of these proposed approaches consists in processing data streams on the fly before storing them. In this context, a recent framework has been released for multi-output, multi-label, and data streaming approaches, called scikit-multiflow [1
] (see Section 2.2
). This framework also includes the popular half-space-trees algorithm [2
], a fast anomaly detection for data streams. The principal motivation of the scikit-multiflow developers is to encourage researchers and industrials to easily integrate and share their approaches to the framework and make their work easily accessible by the growing Python community. Therefore, we decided to extend the framework by proposing other anomaly detection approaches that can be applied in the data streams context.
In this paper, we first list the existing frameworks for evolving data streams by highlighting their characteristics. Subsequently, we present the anomaly detection algorithms that have been adapted to the stream setting by discussing the advantages and disadvantages of each of them. Moreover, we present an implementation of some methods that proved to be efficient in the literature within the scikit-multiflow framework. In particular, we implement the isolation-based anomaly detection (IForestASD) algorithm [3
] and highlight the simplicity of the framework to ease and accelerate its evolution. Afterwards, we perform different experiments to evaluate the predictive performance and resource consumption (in terms of memory and time) of the IForestASD method and compare it with the well-known state-of-the-art anomaly detection algorithm for data streams, half-space-trees. As far as we know, this is the first open source implementation of this algorithm. Finally, we present some improvements of the methods in order to take the evolution of the stream into account, especially, the appearance of drifts in the data. Because IForestASD does not handle drifts in data, we extend this algorithm by proposing 3 approaches that incorporate efficient drift detection methods (ADWIN and KSWIN, as explained in Section 5
) inside the core of the IForestASD algorithm. Therefore, our main contributions are summarized, as follows:
The paper is organized, as follows: Section 1
introduces our work and its motivations. Section 2
highlights some background about data stream frameworks and the associated challenges. In Section 3
, we provide a survey on anomaly detection for data streams. In Section 4
, we discuss isolation-based methods with a focus on the IForestASD algorithm. Section 5
introduces our proposed approaches and how they deal with drifts. We discuss the experimental evaluations of the four implemented methods in Section 6
. Finally, Section 7
concludes the work by providing future research directions.
4. Isolation Forest, IForestASD and HSTrees Methods
Liu et al. proposed the original version of isolation Forest IForest in 2008 [26
]. This version addressed static datasets. In 2013, Ding and Fei designed an adaptation of IForest to data stream context while using sliding windows [3
]: Isolation Forest Algorithm for Stream Data (IForestASD).
There are many other improvements and extensions of Isolation Forest. As an example, Extended Isolation Forest (EIF) [34
] addressed a particular weakness in IForest that is caused by the successive splits according to the randomly chosen dimension. In fact, this splitting method engenders some fictitious zones with inconsistent scores, which gives classification errors. The key idea of EIF is to split data according to a randomly chosen hyper-plan that is generated from data dimensions. Functional Isolation Forest [35
] is the adaptation of IForest to functional data. Liu et al. have proposed the SCiForest method [36
] to detect local anomalies. Marteau et al. proposed an improvement of IForest, named Hybrid IForest, in order to make IForest able to detect anomalies in various datasets with different distributions [37
In another context, in [38
], the authors recently proposed an adaptation of IForest for categorical and missing data. Entropy IForest (E-IForest) method has been proposed in [39
] for the choice of the optimal dimension to split a node. In [40
], Zhang et al. proposed LSHIForest (Locality-Sensitive Hashing IForest) to generalize IForest to other measures of distance and data spaces. Simulated Annealing IForest (SA-IForest) [41
] is an improvement of IForest, which aims to create a more generic forest in the training phase.
Recently, Bandaragoda et al. proposed isolation while using the Nearest Neighbor Ensemble (iNNE) [42
] method to overcome some weaknesses of IForest when detecting local anomalies, anomalies with a high percentage of irrelevant attributes, anomalies that are masked by axis-parallel clusters, and anomalies in multi-modal datasets.
All of those adaptations and improvements cited above are designed for static datasets and they are not adapted for data streams.
A detailed description of IForest and IForestASD is given in the following sections.
4.1. Isolation Forest Method
Isolation Forest uses successive data splits to isolate anomalies. It is based on the following two characteristics of anomalies: first, anomalies are supposed to be the exception: they represent a very small proportion of the whole dataset. Second, they have a different behavior when compared to the normal data. With those properties, a relatively small number of partitions is needed in order to isolate anomalies, as shown in Figure 2
IForest is based on a forest of random distinct itrees. Every tree has to decide whether a considered observation is an anomaly or not. If a random forest of itrees globally considers an observation as an anomaly, then it is very likely to represent an anomaly.
IForest is composed of two phases: the first one is the training phase, which is the construction of the forest of random itrees, and the second one is the scoring phase, where IForest gives an anomaly score for every observation in the dataset.
In the training phase, all of the t random and independent itrees of the forest are built. Using the sample of randomly selected data, every internal node of the binary itree is split in two other nodes (left and right). In order to split one node, IForest randomly chooses one attribute: d from the m data attributes. Subsequently, it randomly chooses a split value v between the min and the max value of d in the considered node. IForest splits internal nodes until a complete data isolation or reaching a maximal tree depth, called , which is equal to .
After the forest training phase, the scoring phase can begin. In this phase, every new observation x
has to pass through all of the t
itrees to obtain its path length
]. The anomaly score of x
is computed with this formula:
is the average path length of x
is the average path length of unsuccessful search in Binary Search Tree (BST).
is Euler’s constant). Finally, if
is close to 1, then x
is considered as an anomaly. If
is less than
is considered as normal data.
4.2. IForestASD: Isolation Forest Algorithm for Stream Data Method
Isolation Forest is an efficient method for anomaly detection with relatively low complexity, CPU, and time cosunmption. It requires all of the data in the beginning to build t
random samples. It also needs many passes over the dataset to build all the random forest. Accordingly, it is not adapted to the data stream context. In [3
], the authors have proposed the so-called IForestASD method, which is an improvement of IForest for ADiDS. IForestASD uses sliding window to deal with streaming data. On the current complete window, IForestASD executes the standard IForest method to obtain the random forest. This is the IForest detector model for IForestASD.
IForestASD method can also deal with concept drift in the data stream by maintaining one input desired anomaly rate (u). If the anomaly rate in the considered window is upper than u, then a concept drift occurred, so IForestASD deletes its current IForest detector and builds another one while using all of the data in the considered window.
represents the workflow used in IForestASD to update the model. In Section 6
, we present our experiments results of this algorithm in scikit-multiflow framework while using incremental fit and prediction methods.
4.3. Streaming Half-Space Trees
Streaming half-space trees (we will denote by HSTrees) is an online anomaly detection algorithm that is proposed by Tan et al. [2
], which deals with streams. This method uses an unsupervised learning approach that can learn patterns from data-streams with continuous numerical attributes where the number of features is constant.
The principle of the algorithm consists of splitting the features space into an ensemble of sub-spaces using a binary trees (called half-space trees ) to build the ensemble. Each node of the trees has a splitting attribute chosen randomly and represents a given work-space (range of values for a given features) and the two children nodes refer to two sub-spaces representing, respectively, the half-space of the parent node space.
Thus, any data point in the domain can traverse a unique path from the root of a HST to one of its leaves, traversing all of the nodes where the data point is contained. The anomaly score of half-space trees relies on Mass profile computation, which refers to the number or points or instance observed in a given half-space. The lower the mass, the higher the anomaly probability.
The data-stream is then also partitioned into short windows of a set length of data points. At any given time, we only consider two consecutive windows, the current one, called latest window, and the preceding one, which is referred to as the reference window. Once the data points have been recorded in a latest window, it is considered to be full, and it becomes the reference window. When the next data point arrives, a new latest window begins. A data stream’s content is used in order to compute and update the mass profile of each node traversed by the instances in the stream, where each node has two counters, r for the reference mass profile and l for the latest mass profile.
One key advantage of the algorithm is that the training phase has constant amortized time complexity and constant memory requirement. When compared with a state-of-the-art method (Hoeffding Trees), it performs favorably in terms of detection accuracy and run-time performance.
4.4. Main Differences between HSTrees and IForestASD
HSTrees and IForestASD have been proposed in order to detect the anomalies in data flows; they are different on several points: training, testing, and model updating phases.
The training phase: before the trees are built, while IForestASD chooses from the original dataset a sample of
data without replacement for each tree planned for the forest, HSTrees creates a fictitious data space. Indeed, for each dimension of the original dataset, HSTrees creates a minimum and maximum value:
being a random value in .
IForestASD needs a sample of the original dataset to build the ensemble, while HSTrees can build trees structures without any data. During the construction of the trees, to split a node, while IForest ASD chooses a random split value v between and , for HSTrees v is the average of the (fictitious) values of the node, . Note that the maximum size of a tree () with IForestASD is a function of the sample size () while it is a user-defined parameter for HSTrees. Accordingly, to build the trees in the forest, HSTrees has no knowledge of what dataset to use, except the number of attributes. While IForestASD is closely dependent of the dataset.
Score’s computation: to calculate the score of a given data, IForestASD is based on the length of the path of this data in each tree (as described in Section 4.1
) then HSTrees is based on the score of the data for each tree. The overall score of the data is therefore for IForestASD based on the average of the path covered in all the trees in the forest, whereas it will be a sum of the scores that were obtained at the level of each tree in the forest for HSTrees. Unlike IForestASD, which normalizes the average length with
when calculating the score, HSTrees limits the number of instances that a node can have in a tree, i.e., sizeLimit. This is a parameter that is predefined by the user.
Drift Handling approach: the concept drift consists of the normal change of the values of the observations. Both of the methods update the base model (the forest) in order to handle the concept drift. However, while IForestASD updates its model on the condition that the anomaly rate in a window is greater than the preset u parameter, HSTrees automatically updates its model for each new incoming window.
Model Updates policies: HSTrees updates the model after the completion of each batch by resetting, to 0, the mass value of each node in each tree. However, IForestASD updates the model when the rate of anomalies in the window is greater than a predefined parameter u by retraining a new Isolation Forest model on the last window.
To recap, although IForestASD and HSTrees are both based on the principle of isolation, they have many major differences. In order to compare the performance of these two anomalies detection methods, we carried out, in the following sections, several comparative experiments on different datasets.
5. Drift Detection Methods
IForest ASD [3
] (also called IFA in the rest of the paper) is an adaptation of IForest to data stream by simply adding a windowing approach (applying IForest to a window of data). The authors have also added a drift detection phase. Because the stream is evolving, the underlying distribution may change over time. Therefore, the current learned model may be no more representative for the next incoming instances which can impact the quality of learning. In order to cope with this issue, stream algorithms are generally coupled with a drift detection mechanism in order to deal with the new trends as soon as they appear. More details about concept drifts are given by Gama et al. in this survey [8
In order to detect the concept drift, IForestASD maintains a u parameter that corresponds to the rate of anomalies that should not to be exceeded in the windows. This parameter is given as an input to the algorithm. Thus, if, in a window, the rate of anomalies detected by the model is greater than u, then IForestASD updates its model with the data in the current window, while assuming that data drift has occurred. Two scenarios are possible: the user gives a u lower than the reality, in which case the model will be updated in each window. The second possibility is to give a u higher than the reality which implies that the model will never be updated since drift will never be detect even if it occurred. Thus, u should be correctly defined in order to detect the drift in the data stream. In an unsupervised mode where there is no information or knowledge about the data, providing a u parameter becomes problematic, and IForestASD’s ability to detect drift depends on it. Therefore, in this extended version, we propose improving IForestASD by proposing two approaches based on a drift detection that can be done at two levels: model drift detection ADWIN (ADaptive WINdowing) or data drift detection KSWIN (Kosmolgorov Simirnov WINdows).
5.1. ADWIN for IFA
Several change detection methods have been proposed in the literature [8
]. For instance, ADWIN [43
], which is a well-known change detector and estimator that consists in maintaining a variable-length window W
that keeps the most recent instances of the stream with the constraint that the window has the maximal length statistically consistent with the hypothesis "there has been no change in the average value inside the window" [43
]. Technically, ADWIN compares the mean difference of any two sub-windows,
of old instances and
of recent instances, from W
. If the difference is statistically significant, then ADWIN erases all of the instances from
, which represent an old concept, and keeps
In the following, we aim to use ADWIN, in a first attempt, to detect changes in the distribution thanks to its strong theoretical guarantees and its capabilities in detecting drifts that have been proved in a wide range of stream algorithms, such as adaptive Hoeffding trees [4
] and ensembles [44
ADWIN provides two methods for detecting drift. First, every instance is added to ADWIN while using the add _element() function. This function takes a one dimensional numeric value (int or float) and adds it into the adaptive window. The second function is detected_change(). It computes the drift detection that is explained above in the current adaptive window and returns a boolean value that informed whether the drift was detected or not. The reader can refer to [1
] for more information regarding the implementation of ADWIN.
In the IForest model, two results are returned: the data score and the prediction of the model (0 or 1), both can be used as input values for ADWIN. Figure 4
presents our proposed ADWIN based improvements of IForestASD for drift detection. The drift detection phase is managed by ADWIN. Two improvements have been proposed: one based on the model prediction, called PADWIN IFA (Section 5.1.1
), and the other one based on the data score that is given by the model, named SADWIN IFA (Section 5.1.2
5.1.1. PADWIN IFA
The idea behind this improvement is that, in the case of data drift, the prediction will also drift. That is why we decide to use the prediction that is given by IForest as input to ADWIN. In the PADWIN IFA version, the drift detection is based on the predictions of the model (1 or 0) that are given to the add_element() function. If ADWIN detects the drift, we update the model. Otherwise, we handle the following windows while using the existing model.
5.1.2. SADWIN IFA
The prediction of IForest is based on a score computed for each data. Thus, the score is more representative of actual data than its predictive class, since the prediction is dependant on a predefined threshold. Therefore, we propose using the score as input to the add_element() function of ADWIN and call this new version SADWIN IFA. If ADWIN detects the drift, then we update the model; otherwise, we handle the following windows while using the current IForest model.
As explained above, all of the ADWIN based improvements detect the drift of the model after the computation of anomalies detection. This means that the previous window is lost, because all of the predictions are probably false. To solve this problem, we propose another improvement that detects the drift and updates the model before detecting the anomalies. Because ADWIN is designed for assessing a model quality, it cannot be directly applied on raw data. Recently, a new data drift detection method, called Kolmogorov–Smirnov WINdowing (KSWIN) [5
], has been proposed in the literature and has been implemented in the scikit-multiflow framework.
5.2. KSWIN for IForestASD
KSWIN and NDKSWIN IForestASD
] is a recent method for concept drift detection that is based on the well-known Kolmogorov–Smirnov (KS) statistic test [45
]. The KS-Test is a non-parametric test that does not need any assumption of the underlying data distribution. KS-Test is able to handle only one-dimensional data. It compares the absolute
between two empirical cumulative distributions.
Similar to ADWIN, KSWIN also implements the two functions: add_element() to add the new one dimensional data in the local sliding window
. The second function is detected_change(), which informs whether the drift is detected or not.
has a fixed size. Thus, before adding any new data in
, the old data are removed and the new one are added at the queue of the sliding window. Figure 5
describes this process.
Because the KS-Test needs two windows to compute the distance, in KSWIN, authors divide the local sliding window
into two parts. The first one R
contains the predefined parameter r
newest arrived data:
. The second window W
samples data by uniformly selecting data from the old data in
. The uniform distribution used is:
The concept drift is detected in KSWIN when
is the probability for the statistic of the KS-Test.
The KS-Test is a one dimensional method. However, data streams can be multidimensional, thus it is important to adapt KSWIN for this context. For this purpose, we propose a N-Dimensional KSWIN (that we call NDKSWIN in the rest of the paper). The idea of NDKSWIN is to declare a drift if a change is detected for at least one of the n dimensions (while using KSWIN).
shows the workflow of using NDKSWIN as an improvement for IForestASD in order to detect drift before anomalies. A first window is used to create the NDKSWIN instance with all of the needed parameters. It is also used to compute the anomaly detection model while using IForest approach. For the following windows, before applying the anomaly detection, all of the windows must pass through NDKSWIN Step. The add_element() function randomly chooses n
samples from the window and randomly selects n
dimensions. For every selected data, each value for every selected dimension must pass the KSWIN detected_change() test. If change is detected, the model is updated with all of the data in the current window. No matter the result of the NDKSWIN, the current window must pass through the IForest model for anomaly detection.
6. Experimental Evaluation
This section aims to discuss experiments results to support our two contributions: first, the implementation of IForestASD algorithm in a streaming framework and, secondly, the improvement of this algorithm by incorporating relevant drift detection methods. The section will be organized, as follows:
explain the experiments set up and material used to reproduce the work;
provide the results of our implementation with a deep comparison with half-space trees; and,
compare of the novel proposed algorithms and their performance related to the original version.
6.1. Datasets and Evaluation Metrics
6.1.1. Datasets Description
We tested the different methods discussed above on three well known real datasets [47
], which have been used in the original paper of IForestASD.
We also used a simulated stream data with different configurations to compare IForestASD to our proposed improvements. We performed a benchmark with the half-space trees method on a variety of configurations (12 for each dataset). We generate the simulated stream while using the SEAGenerator that is available on scikit-multiflow, which is based on the proposed data stream generator in [48
]. As described in the scikit-multiflow documentation, (https://github.com/scikit-multiflow/scikit-multiflow/blob/master/src/skmultiflow/data/sea_generator.py
), “It generates 3 numerical attributes, that vary from 0 to 10, where only 2 of them are relevant to the classification task. A classification function is chosen, among four possible ones. These functions compare the sum of the two relevant attributes with a threshold value, which is unique for each of the classification functions. Depending on the comparison, the generator will classify an instance as one of the two possible labels”. These stream data, called SEA, have noise but do not have any drift. Thus, we expect that the experimented methods do not detect any drift. Table 2
summarizes the characteristics of the datasets.
6.1.2. Experimental Setup
In order to compare HSTrees and IForestASD in one hand and compare IForestASD to SADWIN IFA, PADWIN IFA, and NDKSWIN IFA in other hand, we performed experiments while using the hyper-parameters that are described in Table 2
The number of observations (samples) is 10,000 for both datasets in each experiment. The anomaly threshold is set to
. We set the value of the parameter u
to the real anomalies rate in the datasets, as proceeded for IForestASD in [3
]. This value is used in order to detect drift and reset the anomaly detection model as shown in Figure 3
According to the results that were obtained in Table 3
, with 30 itrees, IForestASD gives good performances. Regarding the window size, the performances decrease after 500. Accordingly, we use
for the comparison between IForestASD IFA, SADWIN IFA, PADWIN IFA, and NDKSWIN IFA.
6.1.3. Evaluation Metrics
We addressed the following three metrics: F1 score, Running Time (training + testing) ratio (IForestASD/HSTrees), and Model Size ratio (HSTrees/IForestASD) for the implementation of IforestASD.
The performance of the models is measured following the prequential (test-then-train method) evaluation strategy that is specifically designed for stream settings, in the sense that each sample serves two purposes (test then train), and that samples are analyzed sequentially, in the order of arrival, and become immediately inaccessible. This approach assures that the model is tested on new samples that have not been used in the training yet. All of the metrics have been provided by the framework scikit-multiflow while using the prequential evaluator (https://scikit-multiflow.github.io/scikit-multiflow/documentation.html
6.1.4. F1 Metric
It corresponds to the harmonic mean of precision and recall, and it is computed, as follows:
1 is a useful metric in case of imbalanced class distribution. This is the case of abnormal data that are less represented than normal data.
6.1.5. Model Size Ratio (HSTrees Coefficient Ratio)—HRa
In the opposite of the running time, when we consider the model size, we observe that IFA always used less memory than HST. Hence, to compare these two methods, we compute the ratio of the higher value (HST) over the smaller value (IFA) to obtain the HST coefficient. Figure 7
reports the impact of parameters choice on resources used (model size absolute value in Kilo-bytes) for both HSTrees and the evolution of resources used for the three datasets when window size and number of trees vary. Interpretations are provided below.
6.1.6. Running Time Ratio (IForestASD Coefficient Ratio)—IRa
Because half-space trees (HST) is always faster than IForestASD (IFA), for both training and testing time, by an order of 400 for the worst case, we reported the ratio between running time (training and testing) of the two models: IFA over HST. For example, for the Forest-Cover data, IFA is three times slower than HST for
, and for
the running time ratio is 391 in the Table 3
, which means that IFA is 391 slower than HST. The absolute values of running time (in seconds) and their evolution over hyper-parameters variation are reported in Figure 8
6.2. Comparison of Original IForestASD (IFA) and Streaming Half-Space Trees (HST)
reports the results that were obtained for each hyper-parameter combination (window size and number of trees) for the three real datasets. We highlighted the best records by putting them in bold.
We observed that, based on the F1 score, IForestASD performs better than HS-Trees for both the Shuttle and SMTP datasets, no matter the window size or the number of trees. Three further points are observed:
First, we noticed that the number of trees has not a significant impact on the prediction performance for all of the datasets. However, when the window size increases (varying from 50, 100, 500, to 1000), the score is better, as the F1 score increases for all data sets. Second, while, for HSTrees, improves for window size ≥500, it decreases for IForestASD. This can be explained by the fact that, unlike HS-Trees, IForestASD deletes its current anomaly detector model (isolation forest) and builds another one while using all pf the data in the current window when the anomaly rate in the window is upper than u. Thirdly, there is a large difference between the performance of the two models, depending on the dataset, especially for Shuttle, which has a larger anomaly rate of 7%, IForestASD outperforms HSTrees with 80% . Therefore, we assume that IForestASD performs better on a data set with a relatively high anomaly rate.
In complement with predictive performance, two important notions in data streaming applications are the time and memory usage of models. Here, we analyze the impact of hyper-parameters (windows size W and number of trees T) on the amount of resources used by each model and then compare them.
Models Memory Consumption Comparison
Regarding the model size evolution from Figure 7
, where the model size of both models are represented in 2 Y axis (IForestASD in right), we can highlight three points:
IForestASD used less memory than HSTrees (≈20× less); this is explained by the fact that with IForestASD, update policy consists on discarding completely old model when drift anomaly rate in sliding window (updates by replacement), while HSTrees continuously updates the model at each observation
For HSTrees, the window size W has no impact on the model size, only the number of trees T increases the memory used (barplots). This is consistent with the HSTrees update policy, which consists of updating statistics in Tree nodes with a fixed ensemble size.
For IForestASD, both window size w and the number of trees T have a positive impact on model size, for three datasets (in three linesplot). This is due to the fact that IForestASD uses all of the instances in the window (W) to build the ensemble with T trees.
When it comes to some critical streaming applications, such as a predictive model for Network security intrusion, a fast, but less accurate, model is often preferred over a slow and more accurate one. Indeed, having a high rate of false positive anomalies to investigate is preferred over missing critical true positive anomalies (attack). Therefore, below, we aim to analyze the behavior of the models in terms of training and testing time.
We can observe, from Figure 8
, that IForestASD is faster with a small window size, while half-space trees is faster with a bigger window size. The number of estimators increases the running time of both of them. This is consistent with the time complexity of the two base estimators: isolation tree and half-space trees. Indeed, the worst case time complexity for training an iForest is
] while for the streaming HS-Trees, the (average-case) amortized time complexity for n
streaming points is
in the worst-case [3
], where t
is the number of trees in the ensemble,
is the subsampling size of the iForest,
the window size in HS-Trees, and h
the maximum depth (level) of a tree.
Furthermore, the testing time of IForestASD—in the right axis in red line—can be 100× longer than HSTrees testing time (40,000 vs. 400), which is a constraint for critical applications that need anomalies scoring quickly. The difference is significant between the two models, because, in IForestASD, each instance in the window is tested across each of the isolation trees to compute the anomaly score, thus the exponential increase regarding hyper-parameters.
Our results highlight that a trade-off must be made between resources consumption, processing time, and predictive performance (F1 for anomalies) to obtain acceptable results. In the context of anomaly detection (predictive performance), we can conclude that, in terms of F1 score, IForestASD is better than HS-Trees. However, in the context of data streaming applications, running time is an important factor and faster models are preferred. In this sense, half-space trees is a good option due to its lower processing time in the opposite to IForestASD, which is exponential with respect to the windows size. We have seen that the best model can be either HSTrees or IForestASD, depending on the dataset and application requirements, thus we recommend testing models with dataset sample while using scikit-multiflow framework to identify the best parameters for each application.
6.3. Proposed Algorithms to Deal with Drift Detection: SADWIN, PADWIN and NDKSWIN
In this section, we discuss the experimental evaluation obtained by comparing IForestASD with the three new proposed methods SADWIN IFA (SAd), PADWIN IFA (PAd), and NDKSWIN IFA (KSi). For this purpose, we used the real datasets and the synthetic data stream SEA that are presented in Table 2
. In the following experiments, we aim to prove that, instead of updating the model in every windowm like IForestASD do in the worst case, it is better to implement a real existing drift detection to collaborate with IForestASD in order to deal with both anomalies and drift detection.
6.3.1. Experiments Results
Based on the previous experimental results, we observed that the number of trees does not significantly impact the performance of models for more than 30 trees, and we also obtained few improvements for window sizes that are greater than 500. Therefore, in the following experiments, we use a window size of 100 and 500 with 30 trees in the ensemble. We compared four methods on the two real datasets (Shuttle, SMTP) and the synthetic dataset, SEA. We particularly choose these real datasets, because they are opposite in terms of characteristics. In fact, Shuttle is composed of 9 attributes, while SMTP only has 3 dimensions. Shuttle contains several anomalies in comparison with SMTP, which only has 0.03% of anomalies. We generated SEA stream, as mentioned above, in order to test the methods in the neutral condition, i.e., in a real stream data that does not contain any anomaly, but only 0.1% of noises.
Moreover, the value of the drift_rate parameter that is supplied to IForestASD corresponds to the percentage of noises or anomalies present in the data stream. However, for the other methods, no information was provided regarding the percentage of anomalies existing in the dataset.
In Table 4
, we provide the performance in terms of the
-score of the four methods with the different aforementioned configurations. We highlighted the best records by putting them in bold. For window sizes 100 and 500, we noticed that the testing time that is needed to predict the anomalies of the 3 new methods (SAd, PAd, KSi) is lower than the one of the streaming IForestASD. Thus, we provided a metric, which is equal to the ratio of the IForestASD total running time over the other methods (SAd, PAd, Ksi) running time (if the metric is equal to 3, it means that the method is three times faster than original method). We did the same operation to the model size metric. Results are reported in Table 4
as Time ratio and Size ratio, respectively.
shows that the IForestASD method almost systematically updates the model in each window for SMTP and SEA. Because there are very few anomalies in SMTP and no anomalies in SEA, the false alarms of the method lead IForestASD to conclude a drift because the predefined percentage is exceeded. This is due to the fact that IForestASD does not integrate an explicit drift detection mechanism. Based on the percentage of anomalies detected in the window, IForestASD is highly dependent on the model’s ability to detect anomalies and, especially, true anomalies.
The best performance is obtained on the Shuttle dataset that contains many anomalies. In terms of drift detection, we noticed that the ADWIN-based methods did not detect any drift on Shuttle and SEA. This expected behavior is due to the absence of drifts in the SEA dataset. On the other hand, NDKSWIN IFA detects drift in some windows and updates the model accordingly thanks to its strategy that considers several dimensions during the search for drift. Accordingly, if the statistics change for one dimension, the method assumes that a drift has occurred. This poses the problem of choosing the dimension to consider and the importance to give to each dimension. To mitigate the effect of this problem, a feature selection technique can be integrated in order to make it more robust and efficient. Using the same model from the beginning, without updating it when the stream evolves, leads to similar results, in terms of the F1-score and execution time, of IForestASD, SADWIN IFA and PADWIN IFA. However, SADWIN IFA and PADWIN IFA are slightly slower than IForestASD, which is normal under these circumstances.
The proposed methods are generally faster than the standard IForestASD. Indeed, when updating its model, IForestASD consumes additional time to build the 30 trees from the 100 or 500 instances in the window, according to Table 4
. One could initially assume that the SADWIN IFA, PADWIN IFA and NDKSWIN IFA methods should take more time than IForestASD, since they integrate ADWIN and NDKSWIN, which should also consume time during the search for drifts. However, this is not the case, since these methods only update the model when a drift is detected. Therefore, IForestASD ultimately takes more time than they do. On the other hand, IForestASD consumes less memory than these methods, which is explained by the fact that IForestASD completely removes the old model in order to create a new one. Nevertheless, ADWIN and NDKSWIN keep a sliding window in memory in order to detect possible drifts and every time new data is added to the window, ADWIN and NDKSWIN perform statistical calculations for drift detection. In Table 4
, we noticed that the performance in terms of F
1-score of the new methods is either similar to IForestASD or better, yet IForestASD does more model updating. This implies that there is no need to update the model in every window. Hence, the importance of integrating an explicit drift detection method.
In summary, the main finding points are the following:
IForestASD does not have a real drift detection method as the model updates policy depends on a user-defined threshold (anomaly-rate), which is not available for real application.
Integrating a drift detection method in conjunction with the vanilla IForestASD reduces the training time while maintaining similar performance.