1. Introduction
Clustering is one of the core techniques in data mining. Its purpose is to form groups from data in a way that the observations within one group, the cluster, are similar to each other and dissimilar to observations in other groups. Prototype-based clustering algorithms, such as the popular K-means [
1], are known to be sensitive to initialization [
2,
3], i.e., the selection of initial prototypes. A proper set of initial prototypes can improve the clustering result and decrease the number of iterations needed for the convergence of an algorithm [
3,
4]. The initialization of K-means was remarkably improved by the work of Arthur and Vassilvitskii [
5], where they proposed the K-means++ method. There, the initial prototypes are determined by favoring distinct prototypes, which in high probability are not similar to the already selected ones.
A drawback of K-means++ is that the initialization phase requires
K inherently sequential passes over the data, since the selection of a new initial prototype depends on the previously selected prototypes. Bahmani et al. [
6] proposed a parallel initialization method called K-means‖ (Scalable K-means++). The K-means‖ speeds up initialization by sampling each point independently and by updating sampling probabilities less frequently. Independent sampling of the points enables parallelization of the initialization, thus providing a speedup over K-means++. However, for example MapReduce-based implementation of K-means‖ needs multiple MapReduce jobs for the initialization. The MapReduce K-means++ method [
7] tries to address this issue, as it uses one MapReduce job to select
K initial prototypes, which speeds up the initialization compared to K-means‖. Suggestions of parallelizing the second, search phase of K-means have been given in several papers (see, e.g., [
8,
9]). On a single machine, distance pruning approaches can be used to speed up K-means without affecting the clustering results [
10,
11,
12,
13,
14]. Besides parallelization and distance pruning, data summarization is also a viable option for speeding up the K-means clustering [
15,
16].
Dimension reduction has had an important role in making clustering algorithms more efficient. Over the years, various dimension reduction methods have been applied to decrease the dimension of data in order to speed up clustering algorithms [
17,
18,
19,
20]. The key idea for improved efficiency is to solve an approximate solution to the clustering problem in a lower-dimensional space. Dimension reduction methods are usually divided into two categories: feature selection methods and feature extraction methods [
21]. Feature selection methods aim to select a subset of the most relevant variables from the original variables. Correspondingly, feature extraction methods aim to transform the original dataset into a lower-dimensional space while trying to preserve the characteristics (especially distances between the observations and the overall variability) of the original data.
A particular dimensional reduction approach for processing large datasets is the random projection (RP) method [
22]. Projecting data from the original space to a lower-dimensional space while preserving the distances is the main characteristic of the RP method. This makes RP very appealing in clustering, whose core concept is dissimilarity. Moreover, classical dimension reduction methods such as the principal component analysis (PCA) [
23] become expensive to compute for high-dimensional spaces whereas RP remains computationally efficient [
24].
Fern and Brodley [
18] proposed an ensemble clustering method based on RP. They showed empirically that aggregation of clustering results from multiple lower-dimensional spaces produced by RP leads to better clustering results compared to a single clustering in lower-dimensional space produced by PCA or RP. Other combinations of K-means and RP have been studied in several papers [
17,
25,
26,
27]. RP for K-means++ was analyzed in [
28]. Generally, the main idea is to create a lower-dimensional dataset with RP and to solve the ensuing K-means clustering problem with less computational effort. On the other hand, one can also optimize clustering method’s proximity measure for small datasets [
29].
In general, K-means clustering procedure typically uses a non-deterministic initialization, such as K-means++, followed by the Lloyd’s iterations [
1]—with multiple restarts. Prototypes corresponding to the smallest sum-of-squares clustering error are selected as the final clustering result. In [
30], such a multistart strategy was carried out during the initialization phase, thus reducing the need to repeat the whole clustering algorithm. More precisely, a parallel method based on K-means++ clustering of subsets produced by the distribution optimally balanced stratified cross-validation (DOB-SCV) algorithm [
31] was proposed and tested. Here, such an approach is developed further with the help of K-means‖ and RP. More precisely, we run K-means‖ method in a low-dimensional subset created by RP. In contrast to the previous work [
30], the new methods also restrict the number of Lloyd’s iterations in the subsets.
Vattani [
32] showed by construction that the number of iterations, and thus the running time, of the randomly initialized K-means algorithm can grow exponentially already in small-dimensional spaces. As stated in the original papers [
5,
6], the K-means++ and K-means‖ readily provide improvements to this both in theory and in practice. Concerning our work, we have provided time complexity analysis for SK-means‖ in
Section 3.1 and for SRPK-means‖ in
Section 3.2. In terms of time complexity, SRPK-means
reduces the time complexity of the initialization for large-scale high-dimensional data (this was also confirmed by our experimental results) and provides better clustering results; thus, it reduces need for restarts compared to the baseline methods. Reduced need for restarts also improves the overall time complexity of K-means algorithm. In terms of clustering accuracy, SK-means does this same effect for large-scale lower-dimensional datasets. Moreover, we showed for the synthetic datasets (M-spheres) that the random projection variant of the initialization (SRPK-means‖) can provide clear advantage in very high-dimensional cases, where the distances can become meaningless for other distance-based initialization methods.
The main purpose of this article is to propose two new algorithms for clustering initialization and compare them experimentally to the initializations of K-means++ and K-means‖ using several large-scale datasets. To summarize the main justification of the proposed methods: they provide better results compared to baseline methods with better or equal running time. The proposed initialization method reduces data processing with sampling, subsampling, and dimensional reduction solving the K-means clustering problem in a coarse fashion. Moreover, from the perspective of parallel computing, using a parallelizable clustering method in the subset clustering allows fixing the number of subsets and treating each subset locally in parallel, hence improving the scalability.
For quantified testing and comparison of the methods, we introduce a novel clustering problem generator for high dimension spaces (see
Section 3.4). Currently, challenging simulated datasets for high-dimensional clustering problems are difficult to find. For instance, the experiments with DIM datasets of tens or hundreds of dimensions in [
4] were inconclusive: all clustering results and cluster validation index comparisons behaved perfectly without any errors. Therefore, better experimental datasets are needed and can be produced with the proposed algorithm.
5. Conclusions
In this paper, we proposed two parallel initialization methods for large-scale K-means clustering and a new high-dimensional clustering data generation algorithm to support their empirical evaluation. The methods are based on divide-and-conquer type of K-means‖ approach and random projections. The proposed initialization methods are scalable and fairly easy to implement with different parallel programming models.
The experimental results for an extensive set of benchmark and novel synthetic datasets showed that the proposed methods improve clustering accuracy and the speed of convergence compared to state-of-the-art approaches. Moreover, the deteriorating behavior of the K-means++ and K-means‖ initialization methods in high dimensions can be recovered with the proposed RP-based approach to provide accurate initialization also for high-dimensional data. Our experiments also confirmed the finding (e.g., [
52]) that the difference between the errors (SSE) of good and bad clustering results in high-dimensional spaces can be surprisingly small also challenge cluster validation and cluster validation indices (see [
4] and references therein) in such cases.
Experiments with SRPK-means‖ method demonstrate that use of RP and K-means‖ is beneficial for clustering large-scale high-dimensional datasets. In particular, SRPK-means‖ is an appealing approach as a standalone algorithm for clustering very high-dimensional large-scale datasets. In future work, it would be interesting to test a RP-based local SSE selection for SRPK-means‖, which uses the same RP matrix in each subset for the initial prototype selection. In this case, use of sparse RP variants [
54] or the mailman algorithm [
17] for the matrix multiplication could be beneficial, particularly in applications where
K is close to
P. Furthermore, integrating the proposed methods into the robust K-means‖ [
37] would be beneficial for clustering noisy data, because the clustering problems in these cases are especially challenging.