We propose an optimized agglomerative clustering algorithm, AggCLOPE, to deal with the problem caused by movement of only one transaction illustrated in the previous section. Firstly, the terminologies from original CLOPE algorithm are extended to support cluster merge operation. Then a cluster graph structure is applied to optimize the traditional bottomtoup clustering approach to achieve a stable clustering result with much better profit value.
3.1. Extension from CLOPE
As the CLOPE algorithm moves only one transaction from one cluster to another, it is required to extend the definitions and the corresponding data structures to support cluster merge operation.
Definition 6 (Extension of Cluster). A cluster c is a quadruple$<id,\overline{T},\overline{H},\overline{L}>$where id is the unique identification of c,
$\overline{L}$is a list of ids from other clusters related to c, and the rest are the same as those in Definition 1.
Definition 7 (Extension of Profit).
The profit value of a cluster c is defined as follows:
Then the profit of a clustering
C would be:
Definition 8 (Delta):
Suppose there are two clusters c_{i} and c_{j},
and c_{i/j} is the merge of these two clusters. Then the Delta value of c_{i} and c_{j} is defined as follows:
We use symbol D
_{i,j} short for Delta(
c_{i},
c_{j}). Obviously the Delta function is symmetric that we have D
_{i,j} = D
_{j,i}. We also define D
_{i,i} = 0 for the relationship of the same cluster. To maximize the profit value in each merge operation, the Delta value for a certain cluster should be maximal. Suppose there are totally
p clusters, the maximum Delta value for cluster
i is denoted as:
If c_{j} could maximize the Delta value of c_{i}, then c_{j} is the merge candidate of c_{i} denoted as
${c}_{i}\Rightarrow {c}_{j}$. A cluster might have more than one merge candidates, and all the unique ids of those merge candidates are stored in
$\overline{L}$
according to Definition 6 for further process.
3.2. Optimization on Traditional Agglomerative Approach
Traditional agglomerative approach starts with each transaction staying in its own cluster, and then merges pairs of clusters to move up the hierarchy [
13]. However, the cluster merge operation could not be undone, namely transactions could not be taken away from one cluster. It is required to perform merge operations on as optimal clusters as possible. For a single cluster, the best candidates should be picked out to be merged with. The following example expresses a situation of unsuitable candidates.
Example 2. Suppose there are three initial clusters {i:{abc}, j:{abd}, h:{bd}}, we have${c}_{i}\Rightarrow {c}_{j}$,
${c}_{j}\Rightarrow {c}_{h}$and$\mathrm{M}({c}_{i})<\mathrm{M}({c}_{j})$.
In Example 2,
c_{j} is the merge candidate of
c_{i}, and
c_{h} is the merge candidate of
c_{j}. Although
c_{i} and
c_{j} comes before
c_{h},
c_{j} should be merged with
c_{h} rather than
c_{i}, as the former would produce larger profit value for the current clustering. In other case, if M(
c_{i}) and M(
c_{j}) are both maximum among all the delta values, it is obvious that
c_{i},
c_{j} and
c_{h} could be merge together to achieve optimal profit value. To deal with this phenomenon, we define a Global Maximum (
GM) Delta value for all the Delta values.
If
$\text{M}({c}_{i})=\text{M}({c}_{j})=GM$, it is sure that merging c_{i} with c_{j} is best, denoting as “${c}_{i}\iff {c}_{j}$”, as well as merging c_{j} with c_{h}. On this basis, we could make improvement on the traditional agglomerative approach.
Traditional agglomerative approach could only perform merge operation on two clusters at a time, which would not stop until all the transactions are in one cluster or the number of result clusters is specified. However, in this optimized approach, all the clusters linked with the GM value are able to merge at once, which saves a lot of time. AggCLOPE would automatically terminate until all Delta values are no more than zero and no further merge operations could be performed. We define the cluster graph to support this feature.
Definition 9 (Cluster Graph): A cluster graph is an undirected graph
$G=\left(V,E\right)$
where V is a set of vertices each of which representing a cluster and E contains edges that connect two vertices (clusters) with Delta value equals to GM.
Figure 2.
(a) Initial state of G with each transaction in a cluster; (b) clusters meeting with Global Maximum (GM) value are linked; (c) Linked clusters are merged; (d) The remaining clusters are linked; (e) The final result contains only one cluster.
Figure 2.
(a) Initial state of G with each transaction in a cluster; (b) clusters meeting with Global Maximum (GM) value are linked; (c) Linked clusters are merged; (d) The remaining clusters are linked; (e) The final result contains only one cluster.
Figure 2 illustrates the clustering process on the transactions {
ab,
ac,
bc,
abd,
acd,
bcd} using AggCLOPE with
r = 2.0. At the very beginning, each transaction is a cluster symbolized as a vertex in graph
G (see
Figure 2a). By calculating the Delta values of each two clusters, those meeting with the
GM value are linked and to be merged (see
Figure 2b). So at the end of the first round and the beginning of the second round, there are three new clusters remaining (see
Figure 2c). The second round actually does the same as those in the first round. As all Delta values equal to
GM, the clusters are linked with each other for the further merge operation (see
Figure 2d). After that, there are no more clusters to be merged, and the algorithm ends with a final cluster including all the transactions (see
Figure 2e).
Algorithm 1 AggCLOPE 
/*Phase 1  Initialization*/
1: while not end of the database file
2: read the next transaction t;
3: c_{i}←t; V←c_{i};
4: end while
/*Phase 2  Iteration*/
5: do
6: merged=false; GM=0;
7: for i=0 to V.size
8: for j=i+1 to V.size
9: if(D_{i,j}>GM)
10: GM=D_{i,j}; clear(E); E.add(i, j);
11: clear(c_{i}.L ); c_{i}.L.add(j);
12: clear(c_{j}.L ); c_{j}.L.add(i);
13: else if(D_{i,j}==GM)
14: E.add(i, j);.c_{i}.L.add(j);c_{j}.L.add(i);
15: end if
16: end for
17: end for
18: if (E.size>0)
19: for each distinct c in E
20: LinkMerge(c);
21: End For
22: clear(E); clear(M); merged=true;
23: end if
24: while(merged)

The implementation of AggCLOPE is listed in Algorithm 1. Similar to CLOPE, this algorithm also contains two phases. In the initialization phase, each transaction is read and placed into its own cluster to form vertices in G (Lines 1–4). In the iteration phase, the delta values are calculated between each two clusters. Each calculation performs a comparison to the current GM value (Line 9 and Line 13). The content in E and current cluster’s
$\overline{L}$
is cleared and renewed as the GM value is assigned to the current maximum delta value (Lines 10–12), otherwise the current edge is appended in E and the current vertices are added to the opponent’s
$\overline{L}$(Line 14). If E is not empty, all the connective subgraphs of G would be merged (Lines 18–23). Note that the main part of merge operation is done by LinkMerge, a recursive function shown in Algorithm 2. AggCLOPE would automatically terminate until no clusters to be merge, i.e., E is empty.
Suppose the total number of transactions is N, the current number of clusters is K and the average length of a transactions is A. Obviously the initialization phase takes O(N) time. In the iteration phase, the calculation of delta values requires O($\frac{K\times K\times A}{2}$) time, and the recursive function is O(K) that is the same as depth first search on graph G. As K = N in the first round, it requires at least O($\frac{{N}^{2}\times A}{2}$) + 2O(N), then the K value gradually shrinks in the next few rounds. The worst case would take N − 1 rounds, in each of which only two clusters are merged, resulting in O($\frac{{N}^{3}\times A}{3}$) approximately. The best case takes only two rounds, where all the clusters are merged in the first round and no further operations are performed in the second round. However, compared with the time complexity of CLOPEO( N × K × A), AggCLOPE is generally much slower to be its only and fatal defect.
Algorithm 2 LinkMerge 
Parameter: c is a cluster(vertex) in cluster graph G
1: If(c is not merged)
2: for each k in c.L
3: LinkMerge(c_{k});
4: End For
5: End If
