3.1. Data Acquisition
In order to verify the effectiveness of our proposed algorithm, we invited 30 volunteers to collect their GPS trajectories in their daily lives. The experiment lasted for a month. All volunteers involved in the experiment are students and teachers from Wuhan University. The data collection application was installed in the volunteers’ smart phones, which included Android or iOS. To ensure the accuracy of the data, the HOLUX GPSlim236 was taken as the GPS signal receiving the device and connected to the smart phone via Bluetooth. Therefore, GPS is the only GNSS in our experiment. As the HOLUX GPSlim236 updates data once per second, to reduce the redundant data, the application can adjust the sampling interval based on the mobile speed. The shortest interval is not less than 5 seconds.
During the trajectory data acquisition, we did not set up any restrictions or any manual intervention for volunteers. The volunteers were asked to carry out the GPS device in daylight to keep the application running in the background as much as possible, and the device was charged at night. More data was collected by the application to reflect the volunteers’ daily lives. The trajectory data was stored in the memory of a smart phone on a daily timescale. The application not only can automatically record the track information, but also can manually mark the location. The volunteers were not forced to record location. When the volunteer moved into a place or during the period, the volunteer should record the current coordinates and time through the application, which means the volunteer created a record of location. Each recorded location has a type attribute. In the application, there were five types for volunteers to choose from: home or dormitory, office or classroom, restaurant, entertainment (e.g., sight spot, cinema, etc.) and public service (e.g., hospital, bus station, etc.). In this paper, we defined the recorded location as RL = (lng, lat, type, t), which represents a registered location’s longitude, latitude, type of location and time. At the same time, if the volunteers record locations in the same place multiple times, the application will allow volunteers to choose the recorded locations and group them together. The recorded locations are mainly used for the verification of the stop point extraction algorithm, and the groups are for the location clustering algorithm.
During the whole experiment, we collected the trajectory data of 23 volunteers. Eleven of them recorded many locations. The acquired data shown in Table 1
, Dataset 1 contains 11 volunteers’ data over 20 days. The trajectory data is relatively complete. Datasets 2 and 3 are selected from Dataset 1.
Before the verification of the algorithm, we need to select the parameters of the algorithm. As we mentioned above, there are two parameters: the time threshold and the distance threshold.
In the algorithm of TDBC, the time threshold parameter δt determines the number of the clusters, because the algorithm considers a cluster as a stop point only when the time duration of the cluster is over δt. If the δt is too large, these clusters, which have a relatively short time duration, are ignored. Otherwise, these clusters are considered as the stop points and are joined in the set.
The distance threshold parameter δd determines the size of the clusters. For each point, the algorithm calculates the distance between the point and the centroid of the current cluster. The point joins the current cluster only when the distance is less than δd. A greater setting of δd results in a merge of clusters, which are adjacent to each other. The lesser setting of δd gives smaller clusters, which may be regarded as noise. Due to the lack of precision of GPS equipment, even if a person stays in the same place for a long time, the GPS data cannot be the same. If the δt is less than the level of precision, the algorithm produces some segmented clusters from an extended period during a visit. These clusters may be missed because the time duration is not sufficient to cover the δt.
Meanwhile, the distance threshold has effects on the location clustering algorithm. Individual location is a subjective concept which has different definitions according to different scales. Considering the application of location (e.g., navigation and location-based recommender), the location should be based on the building level. Therefore, the distance threshold value should not be too large.
In order to find the appropriate range of parameters, this paper selects Dataset 2 as the experimental data, which contains four volunteers’ trajectories in one day. The graphs in Figure 6
show the number of stop points extracted from Dataset 2 for different distance and time thresholds.
From the graphs, we can observe that the curves decrease intensely when δt increases from 0 s to 300 s. The short time threshold δt, results in a number of clusters with short durations that are treated as the stop points. When δt > 300 s, the curves become flat, which indicates that the influence of the time threshold is obviously weakened. Meanwhile, for different distance thresholds, the greater the time threshold, the more similar the number of stop points. Therefore, the time threshold δt should be selected from [300, 800] s.
Compared with the other distance thresholds, the curves decrease more rapidly with δd = 20 m. The reason for this is that the δd is close to the GPS equipment precision (the precision of the HOLUX GPSlim236 is 5–25 m). When δd = 200 m, the curves are too flat to recognize the stop points, especially for the building level of locations. When δd = 125 m, the curves are similar to 200 m, which means the distance threshold is large. For these distance thresholds, such as 50 m and 100 m, the curves have an obvious knee point. Therefore, the distance threshold δd should be selected from (20,125) m.
To further estimate the parameters of the TDBC, this article uses the Precision (P), Recall (R), and F1-Measure (F) as the judging criteria. The recorded locations, which were recorded by volunteers, are regarded as a reference sample.
Definition 7. Match: ∀a stop point SPi = (lngi, lati, typei, ti start, ti end), ∃a recorded location RLj = (lngj, latj, typej, tj), where ti start < tj < ti end and distance (SPi, RLj) < δd. As shown in Figure 7, a matched point means the recorded point RL1 is recorded during the stop point SP1 and the distance between SP1 and RL1 is less than the distance threshold
The precision rate is the proportion of stop points, which match with the recorded locations, Equation (3). While p+
stands for the number of the matched stop points, p−
does not. The recall rate is the proportion of recorded locations which are matched with the stop points, Equation (4). r+
stands for the number of the matched recorded locations, while r−
We used Dataset 1 as the experiment data, which contains 11 volunteers’ trajectory data and their recorded locations. They marked seven recorded locations per day on average. Figure 8
is the graphs of the precision and recall for different distance and time thresholds. When the precision rises, the recall decreases. On the whole, the reasonable precision rate is about 0.8 and the recall rate is about 0.7. A possible reason for this is that the recorded locations are not precise. For example, there were three visits by A, B, and C during a day and the duration times were 10, 15, and 30 mins. However, when the volunteer logged the location, he recorded one visit in B and two in C. Besides, recording locations was not mandatory during the data acquisition. Therefore, the Matched accuracy will be affected.
When the distance threshold δd is the same, the precision rate increases with the increase of the time threshold δt. This means that the stop points with a long duration, which are recognized by the TDBC, were just recorded by the volunteers. That is consistent with the actual situation. The longer a volunteer stays, the more likely to log a recorded location. Meanwhile, the recall rate decreases, which means the number of stop points reduces, and many stop points are not recognized.
When the time threshold δt is the same, both the precision and recall rates increase with the increase of the distance threshold δd. However, when δd exceeds 60 m, the precision and recall rates are not significantly improved, which indicates that a too small or too great of a distance parameter is not appropriate and δd = 60 m is the inflection point. To choose the appropriate parameters, the F1-Measure is further investigated.
F1-Measure is a comprehensive evaluation index based on the precision and recall rates.
is the graph of the F1-Measurefor different distance and time thresholds. When δd
is over 60 m, the F1-Measure remains constant. When δd
= 60 m and δt
= 500 s, the F1-Measure increases to the maximum. The precision rate is 0.81 and the recall rate is 0.68.
3.4. Stop Point Extraction Experiment
In the following, two groups of experiments were designed to compare the TDBC with the other four algorithms, K-Medoids, DJ-Cluster [13
], CB-SMoT [19
] and Time-Based Clustering [3
]. The first experiment used Dataset 1, which has a large amount of data, to compare the performance of the algorithms. The second experiment used Dataset 3 to compare the pros and cons of different algorithms under the specific conditions.
The results of the first experiment are listed in Table 2
. The parameters of the TDBC are determined by the discussion in Section 3.3
, while the other four algorithms selected parameters according to the optimal value of the Precision, Recall, and F1-Measure. For the CB-SMoT, we only completed the first step to get the potential stops without the follow-up operation. From the table, k-medoids, DJ-Cluster and CB-SMoT are more time-consuming than the other two algorithms due to their complex computation. The efficiency of the TDBC is similar to the Time-Based and both have high clustering efficiency. In terms of precision, the TDBC and DJ-Cluster are over 0.8, which is significantly larger than the other three algorithms. In terms of recall, compared with the others, the TDBC is slightly improved. Therefore, with the promise of ensuring time efficiency, the performance of TDBC was remarkably improved.
The treatment of GPS signal loss and data drift greatly improves the accuracy of the algorithm. Considering the recognition of the type I stop points, the recall rate demonstrated a degree of improvement. As we can see from Table 2
, the recall rate is not high as a whole. We made a return visit after the experiment and found that many volunteers did not log the recorded locations strictly. For example, volunteers may record locations before they arrive or after they leave the places, or mark a location repeatedly. These factors might affect the matching accuracy and lower the recall rates in different algorithms.
represents the stop point extraction results of the TDBC for various types from Dataset 1. The left graph is the proportion of different types and the right is the precision of different types. From the graphs, we can see that the proportion of type III is 0.52 and the precision is 0.88, which indicates that the TDBC has a higher recognition rate for the stop points where the signal was lost for a long time. A high rate of accuracy means that the volunteers are more inclined to log the recorded location in the building, which accords with reality. Considering most of the volunteers are students, their stop points are more often in these buildings, such as the teaching buildings, laboratories, student dormitories, libraries, etc. The proportion of type II, which is clustered by the consecutive GPS points, is 0.42, and the precision is 0.79. That indicates the GPS device has received the signal with short-range data drift in the partly closed building, which is in accordance with the raw data. Both the proportion of type I and the precision are not high, which means different volunteers had different touring habits and logged relatively few locations at the initial position of a trajectory.
is the second experiment’s result from Dataset 3, which only contains a day’s trajectory data from one volunteer. We know the volunteer went for an outing on that day and recorded nine locations. The parameters of the other algorithms are set to the same value as the first experiment except that K-Medoids’ is set to 9.
As shown in Table 3
, the clustering results of different algorithms are generally in accordance with the results of the first experiment. The precision of the DJ-Cluster and the recall of the TDBC are the highest.
The results of the comparison between the TDBC and the four algorithms are shown in Figure 11
. There are five recorded locations in the visible area. The TDBC correctly found the five points and extracted the other two points, which were not recorded by the volunteer; the points A and E in the Figure 11
a. The analysis shows that point A is a ferry, and that the volunteer stopped over there and waited to pass through the Yangtze River. The point E is a stop in the snack street. The volunteer had not logged either of the two points. Thus, the TDBC is reasonable. The point B in Figure 11
b is a stop which experienced signal loss for a long time. The point C in Figure 11
d is a stop which has points with short range data drift, and point D in Figure 11
d is a circular stop.
a is the result of K-Medoids, which accurately extracted three stop points. As the algorithm is sensitive to the initial points, different initial points may have different results. In this paper, we have experimented with the main times for the different random initial points. Some points, e.g., the point F in Figure 11
a, were still unavoidable. Obviously, these points are not reasonable stop points. Meanwhile, the time is the important characteristic of the trajectory data and it is not taken into account by the K-Medoids. Considering the time complexity and precision, the performance of the TDBC is, of course, better than the K-Medoids.
b is the result of the DJ-Cluster. The found stop points are the dense neighborhoods of GPS points, and the volunteer solely logged in these places. Since the essence of the DJ-Cluster is a density-based clustering algorithm, for those GPS points which are recorded within the same area at different times, the DJ-Cluster merges them together into a cluster and takes the cluster as a stop point without considering the influence of time. Although the results are not reasonable, the accuracy of DJ-Cluster is very high if we do not take the time into account. At the same time, the algorithm has natural defects for these stop points, which contain signal loss for an extended period, e.g., the point B in Figure 11
b, and have a circular trajectory with less density, e.g., the point D in Figure 11
b. That is the reason why the recall is small.
c is the result of the CB-SMoT. Considering the influence of time, the essence of the CB-SMoT is based on the average speed to find the stop point. Therefore, the algorithm can recognize these stop points which stay outside for a long time or have signal loss, e.g., the point B in Figure 11
c. It is precisely because of this characteristic that the algorithm is easy to consider in road junctions or traffic jams as stop points, e.g., the point G in Figure 11
c. Also, there is no capability for a circular stop point. Therefore, the accuracy of the CB-SMoT is not high.
d shows the time-based results, which have a higher precision. Since the algorithm considers the influence of time and has corresponding treatment measures for signal loss, these stop points which have signal loss are correctly identified, e.g., the point B in Figure 11
d. However, the algorithm divides these places which have data drift or a circular region into several stop points, such as the points C and D in Figure 11
d. Both points are divided into two parts. That can lead to an increase in the total number of stop points, and decrease the accuracy of the time-based results. Meanwhile, the recall of the time-based clustering algorithm is limited because the segmentation makes more small clusters, which may not conform to the requirements of the stop point, and these clusters are ignored.
3.5. Location Extracting Experiment
The improved clustering by a fast search and identification of density peaks in this research clustered the stop points into locations from Dataset 1. Compared with the algorithms in the literature [15
], the density-based clustering has better performance, which was proposed in the literature [3
]. Therefore, our improved algorithm will be compared with the density-based one.
As shown in Figure 12
, the Correct
means each of the stop point matches well with the recorded location in the same group. The False
means the opposite. The Merged
represents whether or not stop points are matched with the recorded locations from different groups. The Lost
stands for the recorded locations from the same group that do not have matched points. The Divided
stands for the recorded locations that are matched with the stop points from different clusters. The precision is the proportion of the Correct
clusters and the recall is the proportion of the Correct
groups. The evaluation indexes include the elapsed time of the algorithm, the false rate (FR
), merged rate (MR
), lost rate (LR
), divided rate (DR
), precision (P
), recall (R
) and F1-Measure.
is the comparison between our algorithm and the density-based clustering. The parameters of both the algorithms are set to 60 m. Compared with the density-based method, our improved algorithm has distinct advantages in efficiency, and it has a lower Merged
rate and a higher Divided
rate. That is because our improved algorithm does not merge the locations which are close to each other, while the density-based one does. As shown in Figure 13
, the density-based approach does not recognize the locations B and C, but merges them together. This is why the precision of our algorithm is higher. Furthermore, the false rate, lost rate and recall in the two algorithms are similar to each other. Both of them have a higher lost rate and a lower recall. This is because the overall recall rate of the algorithms is low as we discussed in Section 3.4
. From the whole perspective, under the prerequisite of ensuring precision and recall rate, our improved algorithm has higher performance and efficiency.