## 2. The Kd-Tree and Data Clustering

Developed around 1962, K-means is the most popular data clustering algorithm which is designed to create clusters that minimize the squared error between the cluster’s points and the centroid (mean) of the cluster [

11,

12]. Improving the K-means algorithm has been the focus of many studies up until this point, producing multiple different variations and methods focusing on improving speed [

13], selection of initial cluster centers [

14,

15], or reduction in the number of iterations [

16].

To introduce our novel framework, we need to introduce the Kd-tree [

17,

18] which is a multi-dimensional binary tree used for organizing data points in K-dimensional space in order to decrease the time of the nearest neighbor search. The tree consists of a number of nodes and leaves where the root node contains all data points. Each node in the tree has two child nodes, while leaves are without child nodes. Data points, or references to them, are stored in leaves only.

The Kd-tree uses splitting planes that are perpendicular to one of the coordinate system axes, chosen so that it goes through one of the points in the tree. There are several variants of the Kd-tree according to the stopping criterion or the applied splitting plan.

The idea of using the Kd-tree for clustering was firstly introduced by Moore [

19]. It was used for estimating the parameters of a mixture of Gaussian clusters to reduce the cost of the EM-based clustering.

To speed up the K-means algorithm and make it tractable for large datasets, the blacklisting algorithm [

20,

21] was proposed. This algorithm updates data points in each cluster in bulk instead of updating each data point in each cluster separately. It is based on a variant of the Kd-tree called MRKd-tree (Multiple Resolution K-dimensional tree) [

22].

In 2002, the filtering algorithm [

23] was introduced, the algorithm is an efficient variant of the K-means using the Kd-tree. It uses the Kd-tree as a data structure for storing multidimensional data points. Each node of this tree is represented by a box containing a set of data points. The root node is the box that contains all of the points set, while a leaf is a node that contains one point. The splitting criterion used in this algorithm depends on hyper-planes splitting orthogonally to the longest side of the node through the median coordinate of the associated points.

The filtering algorithm starts by creating the Kd-tree for the given data points. For each node in the tree, the algorithm maintains a set of candidate centers. The candidate centers for the root consist of all “k” centers. The algorithm propagates for each node candidate which is the closest to the midpoint of this node. A filtering process then starts to select the nearest center from these candidates by filtering (pruning) any center which is not closer to any part of the node than any others. If there is no nearest center for all points in the node, then this process will recur on its children.

To improve the filtering algorithm, the fast-greedy algorithm as well as the restricted filtering algorithm were proposed [

24]. The former modified the boundaries of the node, and the latter added a new threshold for the direct computation of distance and for Kd-tree splitting.

Aiming to increase the number of seeds until “k” seeds are found, the global K-means algorithm [

25] was proposed. This algorithm employs the Kd-tree to use the centers of “k” created buckets as seeds for the K-means algorithm. It uses a variant of the Kd-tree of a splitting criterion that splits each bucket along the direction of the maximum variance.

In 2007, Redmond and Heneghan described their method for initializing the K-means algorithm [

26]. The Kd-tree is used to perform a density estimation of the data. The algorithm then chooses “k” seeds for the K-means algorithm in the highest density nodes of the Kd-tree. This method uses a variation of the Kd-tree which splits data along the median. Additionally, existing studies used the Kd-tree to improve the speed of K-means utilizing cluster center displacement [

27].

In this paper, we develop an algorithm to enhance the K-means algorithm performance using an improved Kd-tree structure. The improved Kd-tree is used as a data structure to improve the choice of initial centers and to reduce the number of the nearest neighbor searches. It is also combined with an efficient center insertion technique to propose an incremental operation that overcomes the instability problem of the K-means algorithm.

## 3. The Proposed Algorithm

The proposed algorithm depends on the operation of an inline construction—during the execution (not pre-prepared)—for an improved variant of the Kd-tree, which is an important feature when dealing with different datasets as the algorithm can adjust the number of levels in the Kd-tree according to the required number of clusters. In this variant, the region of each node is represented by a hyper-rectangle identified by two vectors “Hmax” and “Hmin”.

When creating a new node, the algorithm calculates a number of attributes for the node and uses them in the next steps to decrease the mathematical calculations required in each step. The new suggested structure includes a number of additional attributes compared to those used in the filtering and fast greedy algorithms to make the computing process faster. Each node in the proposed structure has a number of attributes. These attributes of each node need to be computed just once when the node is created.

This tree is further improved by implementing a new splitting method that is suitable for clustering (

Figure 1). The splitting point is selected to be the average of the farthest two neighboring data points in that dimension, and happens to the widest dimension every time the algorithm wants to split a leaf.

The mechanism of assigning a leaf “L” to a cluster is simple (

Figure 2). This mechanism is illustrated in detail in Kanungo et al. 2002. It depends on the extreme point “p” in the direction of the leaf “L” and the two clusters’ centers “c1” and “c2”. By calculating the distance from “p” to “c1” and “c2”, it will be known which of “c1” and “c2” is closer to “L” and whether all the data points in “L” belong to the cluster of center “c1” or “c2”.

The extreme point “

p” can be determined as the corner of the leaf, as follows:

where:

The proposed algorithm is an incremental method and it uses an efficient initialization method. It was necessary to propose a new tree-building method of a dynamic construction in runtime.

Unlike the filtering, restricted, and fast greedy algorithms, the proposed algorithm does not create the whole tree starting from all data points in the root node and continuing until each node has one data point (e.g., the filtering algorithm), nor creating the whole tree with nodes until a predefined threshold (a specific number of items in each node) is met (e.g., the fast-greedy algorithm). The proposed algorithm starts with one node that contains all data points. According to the required number of clusters, the algorithm decides whether to continue the splitting process for each node separately or to stop.

Figure 3 illustrates the pseudo-code of the proposed algorithm.

The algorithm starts by adding references to all data points in one node which is the root of the tree–Step 1. When creating any node in the tree, the algorithm calculates the density of the node “density”, the number of items “number_of_items”, the total sum of all items in each dimension “total_sum”, and the upper and lower bounds of the node dimensions “Hmax” and “Hmin”, respectively.

Step 2: the algorithm finds the widest dimension of the whole node and splits the node into two leaves. If the widest dimension is “d”, the algorithm then finds the two data points furthest apart in this dimension and calculates the mean of these two data points to be the splitting point of the node.

Step 3: while the required number of clusters is not yet met, and the current number of clusters is equal to, or more than, the current number of leaves, then the lowest density leaf is split in Step 3.1.

Step 3.1: the algorithm splits the lowest density leaf to produce more leaves to insert the new center in one of them.

Step 3.2: the algorithm inserts a new center to be the mean of the highest density leaf of the tree.

Step 3.3: the main steps of the K-means algorithm are applied using the created tree.

The algorithm assigns all leaves to an appropriate center and then each center is moved to be the mean of all points of the leaves which belong to it.

Step 3.4: If any center is inserted with no leaves belonging to it, the algorithm eliminates that center.