An outlier is a data point that differs significantly from the other points in the dataset [1
]. Outliers affect the performance of data analysis algorithms and lead up to misleading results [2
]. Thus, detecting outliers becomes an important step in data analysis and data mining. Outlier detection algorithms detect rare or abnormal behavior that can be considered important than normal behavior in many applications such as cancer diagnosis, product defects detection, fraud credit card transactions and hacking network traffic data. Outlier detection algorithms are also used as a preprocessing step for the data mining algorithms to filter datasets from outliers [3
Outlier detection algorithms have extensively been tackled in the past fifteen years. Many algorithms with different approaches have been introduced in the literature [4
] which can be, in general, categorized into [12
] and clustering-based
approaches aim at finding the probability distribution/model of the underlying normal data and define outliers as those points that do not conform to that model. However, a single distribution may not model the entire data that may originate from multiple unknown distributions limiting the practical adoption of such approaches especially in high dimensional data. Distance-based
approaches define outliers as the points that are located far away from the majority points using any distance metric like Manhattan distance or Euclidean distance metrics. These approaches fail when the data points have different spatial densities (sparsity) and thus defining an outlierness distance is not practically feasible [24
]. To combat this issue, Density-based
approaches are proposed to investigate not only the local density of the point in question but also the local densities of its nearest neighbors. Density-based approaches feature a stronger modeling capability of outliers but require more expensive computation at the same time.
Clustering techniques, on the other hand, aim at finding patterns characterizing the underlying data then organize the data accordingly into normal clusters (the majority) and outliers deviating from them. In particular, clustering-based approaches can be considered as an unsupervised anomaly detection techniques which receive a great deal of attention because they can construct detection models without using labeled training data. This advantage boosts their applicability in real environments since labeled data or purely normal data cannot be readily obtained in practice. This motivates us to leverage clustering techniques in the outlier detection problem.
Nevertheless, the main challenge of the clustering-based outlier detection approaches is to get rid of the strong tie between using the outlier detection as a preprocessing step for clustering outlier-free data and the dependency of outlier detection on clustering which may be considered a deadlock problem. Additionally clustering algorithms require a large number of distance calculations which have negative impact on the scalability of the algorithm, in both time and memory [27
]. Unlike density-based
approaches do not include a measure for the outlierness degree [29
] for data points which is an essential demand in some applications. Majority of the outlier detection approaches are considered One-At-A-Time query approaches that require to re-run the algorithm from scratch for every one-time query in order to identify the outlier points in query [5
]. One-At-A-Time query approach is a time-consuming process which is clearly infeasible in real-time systems that require in-time response such as fraud credit card detector and hacking network traffic identifier.
To solve the above-mentioned challenges, an OFCOD: On-the-Fly Clustering-based Outlier Detection framework is proposed in this paper. OFCOD is a hybrid outlier detection framework which mixes clustering with a density-based approach in a novel way. The combination of clustering techniques and density measures yields a significant improvement as it enjoys the advantages of each individual method. Specifically, the proposed algorithm consists of two stages; an offline stage and an online stage. In the offline stage, OFCOD employs the K-Medoids clustering algorithm [30
]. A proposed enhancement is presented to reduce the distance calculations performed by K-Medoids to avoid repeated/redundant distance calculations and optimize the utilization of the computer memory. Thereafter, normal clusters and outlier clusters are identified based on multiple criteria and a density-based outlierness factor is only calculated for probable outlier points leading to a severe reduction in the computational cost. This paper extends our earlier work in Reference [23
]. Specifically, The cluster representative points and the detected outliers can be then leveraged to classify novel instances in the online phase. Therefore, unlike Reference [23
] or the baseline approaches, OFCOD is capable of detecting outliers on-the-fly at run time without re-running the whole process from scratch enabling its adoption in real-time applications. Finally, OFCOD provides a unique feature which enables the constructed clusters in the offline phase to be updated at run time by adjusting the clusters’ representative points upon new instances (queries).
Evaluation of OFCOD is done on two realistic datasets (of two different applications): Wisconsin Breast Cancer (WBC) dataset and KDD’99 dataset. The former contains 699 diagnostic data points dedicated for identifying breast cancer and the latter contains five millions network connection instances used to build intrusion detection systems. Our results show that OFCOD outperforms the baseline approaches on both datasets. Specifically, OFCOD provides high true positive rate (at least 99.39%) without compromising on the false positive rates.
The rest of this paper is organized as follows—the OFCOD framework is introduced in Section 2
. The efficacy of the proposed framework is experimentally verified in Section 3
. At the end, Section 4
concludes the whole paper and points out future research directions.
3. Results and Discussion
The performance of the proposed framework is evaluated using two real datasets: Wisconsin Breast Cancer (WBC) and KDD’99 datasets.
The Wisconsin Breast Cancer (WBC) [37
] is a real dataset originally collected from Wisconsin hospitals. WBC dataset consists of 699 points each with 9 numerical features representing the values of diagnosis. Each point belongs to one of two classes: benign or malignant. The WBC dataset includes 458 points (65.5%) as benign class and 241 points (34.5%) as malignant. The KDD’99 dataset [38
] consists of 41 features of network connections. These connections are labeled either normal
of four different types. KDD’99 dataset consists of a training set which has 4,898,431 (5 millions) connection instances and a test set of 311,029 (300 K) connections. These datasets include normal connections in addition to four main types of intrusions (outliers). For the clarity of presentation, we consider WBC as the default dataset and we have also tested our proposed framework on the KDD’99 in Section 3.2.6
3.1. The Offline Stage of the Proposed Framework
We adopt the techniques used in References [1
] by randomly removing some of the malignant points to form a very unbalanced distribution. As a result, the dataset has 39 (8%) malignant points (considered as outliers) and 444 (92%) benign points (considered as inliers). The removed points will be used to test the performance of the online stage of the algorithm. True positive rate (TPR) and false positive rate (FPR) [36
] are used as metrics to evaluate the performance of proposed outlier detection algorithm.
The data points are normalized to be in the range from 0 to 1 using Min-Max normalization. The number of clusters k is used as the only input to the offline stage of the proposed framework. k is set only one time. A 10% of the dataset is randomly chosen as a sample for the preliminary clustering step. The experimental results show that by choosing k = 8, the best clustering results are obtained. The value of k is fed into the offline stage of the framework only once.
The performance of the proposed offline stage is compared with the traditional algorithms; PLDOF [9
], CBOD [22
] and FindCBLOF [24
]. TPR and FPR are used as evaluation metrics. The top-n outlier points are requested with random sequence to ensure the validity of results. Table 1
shows the results obtained by the offline stage of the proposed framework compared the other approaches.
It is noted from Figure 7
that the offline stage of our algorithm is the first to reach the maximum detection rate of 100% while keeping FPR at the minimum value of 1.8% as shown in Figure 8
and Figure 7
and Figure 8
exhibit the direct proportional relation between the number of points and the detection rate which may be considered as a good performance evidence for all algorithms. However, with the increase of the number of requested outlier points, the FPR increases as well for all algorithms with no limits except for the proposed algorithm. The proposed algorithm reaches the maximum value of FPR 1.8% at a number of points equals 47. This value is kept constant when the number of points is greater than 47. Also, the average TPR, and FPR along with the percentage of outlierness factor calculations for all algorithms are shown in Figure 9
. As seen, the proposed algorithm outperforms all other algorithms.
The computational time of the proposed algorithm is measured in order to show the effect of each proposed modification on the response time. All experiments have been run on a 64-bit 2.2 GHz i5 CPU, Asus UX303 machine, Windows 10 and 12 GB of physical memory. Figure 10
shows the time performance of the proposed algorithm before and after modifying the distance calculations of K-Medoids algorithm.
The effect of outlier factor calculation reduction is illustrated in Figure 11
which ensures the importance of this proposed modification. We could not compare the response time of the proposed algorithm to the other three algorithms (PLDOF, CBOD and FindCBLOF) as they did not consider the time evaluation in their work. However, as they use repeated calculations, their time is expected to be larger than the proposed algorithm.
3.2. The Online Stage of the Proposed Framework
In this section, a diverse set of experiments is executed to find the best configuration of our framework. The configuration parameters of our framework are systematically defined by executing a separate experiment for each adjustable parameter. First, we study the effect of the number of clusters (k) which controls the number of classes and defines a trade-off between accuracy and efficiency. Second, we study the effect of the number of top outliers queried (N). As we have used only 483 points from the dataset in the offline stage (the proposed hybrid algorithm), the remaining points, which represent 31% of the original dataset, are used to test the performance of the online stage.
3.2.1. The Effect of the Number of Clusters k
shows the effect of the number of clusters on the TPR and FPR of the OFCOD. The figure shows that the TPR is first enhanced by increasing the number of clusters until reaching its maximum value at k
= [4,8] then it tends to decrease again. On the other side, the FPR decreases with the increase of the number of clusters. Furthermore, the increase of the number of clusters highly affects the computation time and complexity of the clustering process. Hence, the optimal results have been obtained at k
= 8 where TPR = 100% (at the left y-axis) and FPR = 1.8% (at the right y-axis), which is selected and fed into the clustering step of the OFCOD offline stage.
3.2.2. The Effect of the Number of Top Outliers Queried
In this section, we investigate the queried top outlier points (records) which have the highest outlierness factor. Figure 13
shows the true positive rate (TPR) for the proposed algorithm used in the offline phase at different number of queried outlier points (N). As shown, the detection rate of the proposed algorithm raises (until reaching 100% on the left scale) with the increase of the number of requested outlier points. Even with a number of points is as low as 47, OFCOD can achieve its maximum value of 100%. However, the FPR increases until it saturates at only 1.8% (at the right scale) without further increase. This highlights the stability of the proposed algorithm and how it outperforms the compared schemes (as shown in Figure 7
and Figure 8
) whose FPRs are negatively affected by the increase of the number of queried outliers.
3.2.3. The OFCOD Performance When Varying the Test Set
Five different sized test sets are randomly constructed (from the WBC dataset) consisting of a number of inliers (normal points) and outliers. In order to validate the generalization performance of the proposed framework, majority of these points are new points which are not invoked in the offline stage. Each dataset has a different percentage of outliers. The generated datasets are with 5%, 15%, 20%, 25% and 31% outliers out of all outliers exist in the original dataset.
and Table 3
exhibit the true positive rates and the false positive rates of the online stage, for each dataset with respect to the number of clusters. For instance, at
, Table 2
shows that the TPR equals 100% for 5%, 15% and 20% datasets and slightly decreases with the increased number of points (25% dataset).This is acceptable due to the possibility of misclassification of at least one point. For the next requests of the same datasets (25% dataset) or the 31% dataset, the TPR increases again to reach 100%. This is due to the centroid re-adjustment feature of the online stage. A similar scenario occurs in Table 3
for the false positive rates. The results show consistent favorable performance of the true positive rates (high values) and the false positive rates (low values). These results confirm the generalization ability of the proposed framework due to its adaptive nature to adjust clusters’ centroids upon new requests.
3.2.4. The OFCOD Response Time
The computational time of the proposed online stage of the framework is measured in order to show the response time. All experiments have been run on a 64-bit 2.2 GHz i5 CPU, Asus UX303 machine, Windows 10 and 12 GB of physical memory. Figure 14
shows the in-time response of the proposed on-the-fly framework is significantly better than the proposed traditional hybrid outlier detection algorithm. This is because the proposed framework has reduced run-time computations without the need to launch clustering from scratch.
3.2.5. Comparative Evaluation
We compute the TPR and FPR of the proposed algorithm on a test set which results in 99.58%. The accuracy of the proposed OFCOD framework is compared to K-SVM approach [39
], KNN [40
] and Lasso [40
] on the same dataset to evaluate the performance of the proposed framework. In K-SVM approach [39
], a k one-class support vector machine models are trained for normal data. It worths noting that, one-class SVM is a typical unsupervised learning technique which can define a hypersphere (i.e., a cluster) in which most of the normal points fall. During the testing stage (i.e., run-time), the K-SVM approach calculates the distance of each data point in question to the center of the K hyperspheres created in the training phase. Thereafter, the SVM model of the closest hypersphere to the point, is queried to classify either the point is inlier or outlier. Similarly, the K-Nearest Neighbor (KNN [40
]) approach creates clusters inside each class of points. Therefore, any new instance is compared to the created clusters to find the most similar one to which the new instance belongs. On the other hand, the Lasso approach [40
] transforms the anomaly detection problem into a linear regression model. It defines the detection parameters as regression variables and constructs the model (finding the regression variables) that maximizes the detection accuracy.
shows that the proposed OFCOD platform outperforms all approaches in both the TPR and the FPR while maintaining the response time as small as low as 23 ms. This can be justified as the performance of K-SVM approach [39
] depends heavily on the assumption that all hyperspheres only model normal data. This assumption cannot be justified and guaranteed since outliers can be grouped in small hyperspheres (clusters) as discussed in Section 2.1.3
and Section 2.1.4
. The KNN approach [40
] has the similar problem in addition to its limited generalization ability to unseen test instances. These issues lead to the worst results among others. On the contrary, the Lasso approach [40
] relaxed this assumption leading to better results compared to the K-SVM and KNN approaches. In conclusion, the proposed OFCOD framework has the advantage of finding outliers in both outlier clusters and normal clusters which is the general case. Moreover, it has the ability to update clusters and re-adjust their centroids based on new queries.
3.2.6. Scalability Evaluation
In this section, we evaluate the scalability of the proposed framework when applied to large datasets. For this purpose, the KDD’99 dataset described above is adopted. We used five million connections for learning purposes in the offline phase, and 300 K connections are used to test the overall OFCOD’s performance. To ensure the generalization ability of the proposed framework, the five million connections used in the offline phase are sampled to randomly generate four different datasets including 10% (500 K), 20% (1 M), 50% (2.5 M), and 100% (5 M) connections. The number of clusters that scores the best results are obtained at 52.
shows the TPR and FPR corresponding to each dataset. The figure shows that, the more available data for the offline phase of the framework, the better performance (i.e., increasing TPR and decreasing FPR). This can be justified due to ability of the proposed framework to build clusters with near perfect separable hyperplanes which facilitates the anomaly detection tasks. On the other hand, even with the availability of only relatively small subsets (e.g., 10% (500 K) and 20% (1 M)), the OFCOD framework is able to learn from the new queries and adjust the cluster centroids incrementally.