# A Framework for Discovering Evolving Domain Related Spatio-Temporal Patterns in Twitter

^{1}

^{2}

^{3}

^{4}

^{*}

State Key Laboratory of Information Engineering in Surveying, Mapping & Remote Sensing, Wuhan University, Wuhan 430079, China

Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan 430079, China

Department of Geo-informatics, Central South University, Changsha 410083, China

Department of Computer Science, Virginia Tech, Falls Church, VA 22043, USA

Author to whom correspondence should be addressed.

Academic Editors: Marguerite Madden and Wolfgang Kainz

Received: 28 March 2016 / Revised: 1 October 2016 / Accepted: 9 October 2016 / Published: 18 October 2016

In massive Twitter datasets, tweets deriving from different domains, e.g., civil unrest, can be extracted to constitute spatio-temporal Twitter events for spatio-temporal distribution pattern detection. Existing algorithms generally employ scan statistics to detect spatio-temporal hotspots from Twitter events and do not consider the spatio-temporal evolving process of Twitter events. In this paper, a framework is proposed to discover evolving domain related spatio-temporal patterns from Twitter data. Given a target domain, a dynamic query expansion is employed to extract related tweets to form spatio-temporal Twitter events. The new spatial clustering approach proposed here is based on the use of multi-level constrained Delaunay triangulation to capture the spatial distribution patterns of Twitter events. An additional spatio-temporal clustering process is then performed to reveal spatio-temporal clusters and outliers that are evolving into spatial distribution patterns. Extensive experiments on Twitter datasets related to an outbreak of civil unrest in Mexico demonstrate the effectiveness and practicability of the new method. The proposed method will be helpful to accurately predict the spatio-temporal evolution process of Twitter events, which belongs to a deeper geographical analysis of spatio-temporal Big Data.

Spatio-temporal Big Data has the characteristics of volume, variety, velocity, veracity and value. And nowadays the knowledge discovery from spatio-temporal Big Data is mainly focused on summarization, obfuscated outliers, rare associations, and obfuscated process prediction, which are expansions of traditional spatio-temporal data mining. In location-based social networks, Twitter has attracted the largest number of users since its launch in 2006 [1]. As mobile phones become more intelligent and wireless network coverage expands, anyone with a mobile phone can send tweets almost anywhere, anytime. As a result, Twitter has experienced an explosive growth in its user base [2]. Nowadays most intelligent mobile phones are GPS-enabled, so geographical location information is often included as an additional tag in tweets. Combined with the time annotation, this type of spatio-temporal information can be embedded in tweets to describe where and when the tweets are broadcast. So the Twitter data has become a kind of spatio-temporal Big Data. Due to the high degree of freedom and openness of Twitter, massive amounts of useless information that is unrelated to significant events is broadcast that simply reports common interactions among friends. Moreover, Twitter can be considered as a large black box that contains numerous topics reflecting various events from different domains, e.g., disasters [3], crimes [4], traffic [5], and epidemics [6]. Ways to extract hidden, unknown and significant events from the huge mass of Twitter data has thus become a research hotspot in computer science [7,8], human science [9,10] and GIS [11,12,13,14] in recent years. The research approaches applied can be roughly classified into three categories depending on which of the above three fields is the focus: (1) scholars in computer science consider tweets as textual information that changes over time, so topics related to different domains can be extracted by text classification methods such as Latent Dirichlet Allocation (LDA) and clustering; (2) in human science, scholars usually treat Twitter as a tool to record human behaviors; for example moving behaviors can be reflected by the changes in number of Twitter users coming into and going out of a certain region; and (3) researchers in GIS commonly extract domain related events to identify spatio-temporal outliers or hotspots. The research reported here utilized the third of these approaches to spatio-temporal pattern detection from Twitter.

In a spatio-temporal event dataset, each entity represents an event that occurred at the location and time tagged [15]. Further, spatio-temporal Twitter events are defined as a series of point entities with geo-location and time information embedded in domain related tweets. Taking Figure 1 as an example, this depicts the spatio-temporal Twitter events related to ‘civil unrest’ for the month of July, 2012 throughout Mexico. Unlike previous research in this area, the spatio-temporal approach proposed here focuses specifically on the evolution of domain related spatio-temporal patterns in Twitter. The major contributions of this study are as follows:

**Development of a mining framework:**a unified framework is proposed to discover evolving domain-related spatio-temporal patterns in Twitter. Prior knowledge is not required in the new framework.**Extraction of domain related Twitter events by dynamic query expansion:**For the target domain, related tweets can be obtained using a dynamic query expansion strategy. These tweets tagged with geo-location and time information constitute spatio-temporal Twitter events.**Discovery of evolving spatio-temporal patterns from Twitter events:**For the extracted domain related spatio-temporal Twitter events, spatial clusters and outliers are detected by spatial clustering, after which the spatio-temporal patterns are discovered by spatio-temporal clustering as they evolve.**Experimental evaluation using real Twitter data:**The proposed framework was extensively tested for spatio-temporal Twitter events related to ‘civil unrest’ in Mexico. The advantages and effectiveness of the new method are demonstrated by comparing the results with alternative methods and baseline data.

The rest of this paper is organized as follows. Section 2 reviews the related work and Section 3 explains our motivation and research strategy. Section 4 describes the model used to extract the domain related Twitter events, after which the approach used to discover the spatio-temporal patterns in the Twitter events as they evolved is presented in Section 5. Section 6 reports on the extensive experiments on real world Twitter data and their analysis, and the paper concludes by summarizing the study’s important findings in Section 7.

Existing Twitter event extraction methods mainly derive from machine learning, with approaches such as LDA (Latent Dirichlet Allocation), SVM (Support Vector Machine), and HMM (Hidden Markov Models). LDA is an unsupervised learning algorithm that was originally developed to classify general texts [16] but has more recently been employed to classify Twitter data into different topics [7,8]. SVM is a supervised learning algorithm for classification. Given a target domain, it begins by requiring users to label sections of domain related tweets as samples, after which these training samples are used to extract related tweets [17]. Chakrabarti and Punera (2011) [18] took a different approach, employing a modified HMM model to learn the characteristic of sample tweets and then extract related tweets.

In the field of spatio-temporal data mining, spatio-temporal clustering [15,19], outlier detection [20,21] and hotspot detection [22] are all key research techniques. As both geo-location and time information are often embedded in tweets, this facilitating spatio-temporal data mining in Twitter data. Research in this area can be classified into two types: (1) spatio-temporal distribution pattern detection from initial Twitter data; and (2) spatio-temporal distribution pattern detection from domain related Twitter events.

In summary, most previous work in this area has focused on detecting fixed spatio-temporal distribution patterns from Twitter. However, there is also an evolving relationship between the spatio-temporal development of a Twitter event and its final spatial distribution. In this paper, we propose a new framework that combines dynamic query expansion with a spatio-temporal mining approach to discover newly evolving domain related spatio-temporal patterns from Twitter.

Existing domain related tweet extraction methods mostly fail to consider the hidden relationships among tweets. For example, if an earthquake occurs at place ‘A’ then any tweets containing phrases such as ‘A, damage, collapsed buildings’, even if they do not specifically say ‘earthquake’, are also likely to be related to the earthquake. Therefore, it is necessary to analyze the hidden relationships in Twitter data if we are to adequately extract domain related tweets.

Further, existing research on mining spatio-temporal patterns from Twitter events focuses primarily on detecting outliers or hotspots directly from the distribution of the tweets; over a given time period these spatio-temporal Twitter events can evolve into certain spatial distribution patterns, e.g., spatial clusters or outliers. However, to the best of our knowledge there have not been studies seeking to discover the spatio-temporal evolution process for each spatial pattern. For example, a group of tweets representing a spatio-temporal event dataset with 10 time stamps is simulated in Figure 2. Figure 2a gives the spatio-temporal distribution while Figure 2b shows the spatial projection of all events. Figure 2c is the spatial projection at each time stamp. The spatial distribution patterns formed by all spatio-temporal events for this time period are hidden in Figure 2b, which contains four types of patterns: spatial clusters, global spatial outliers, local spatial outliers and inner spatial outliers. In Figure 2c, events at each stamp are differently labeled based on the patterns in Figure 2b. The evolving process by which each of the three distinct spatial patterns develops is as follows: (1) a dense cluster derives from its center at T = 1 and extends until the whole cluster is formed at T = 4, after which this cluster gradually diminishes from its center and disappears at T = 8; (2) For a sparse cluster, events appear to arise randomly in its upper section from T = 1 to 4 and only after T = 5 does the lower part of this cluster gradually come into being. At T = 7, some events in the upper section begin to appear again; and (3) Global outliers are present at all times, but the local outliers appear only between T = 3 and 7. The inner spatial outliers are formed gradually from the center from T = 4 to 6 and then do not change.

By integrating the spatial distribution of Twitter events at different time stamps shown in Figure 2c, in the research we are aiming to discover those spatio-temporal clusters or outliers, i.e., evolving spatio-temporal patterns, which will evolve into the final spatial distribution patterns shown in Figure 2b.

To discover and visualize the spatio-temporal patterns evolving in the Twitter data for a given domain, a framework is proposed here that is based on a dynamic query expansion and spatio-temporal pattern mining approach, as shown in Figure 3. There are two main parts in our proposed framework, described in turn below.

In this section, a model for the dynamic query expansion is built that is capable of extracting domain related spatio-temporal Twitter events. In Section 4.1, we provide definitions for the terms ‘Twitter information graph’, ‘seed query’, ‘expanded query’ and ‘weight measurement’. Section 4.2 moves on to consider the process of dynamic query expansion, after which spatio-temporal Twitter events are defined in Section 4.3.

By considering the multiple relationships among tweets and features, those tweets related to a given domain can be extracted by a kind of dynamic query expansion. Two parts, the seed query and the expanded query, are included and these can be described as follows:

$$W(F)=ID{F}_{F}\u2022{E}_{F\leftrightarrow T}\u2022W(T)$$

$$W(T)={\omega}_{1}\u2022{E}_{T\leftrightarrow F}\u2022W(F)+{\omega}_{2}\u2022{E}_{T\leftrightarrow T}\u2022W(T)$$

Here, W(F) and W(T) denote the weights of features and tweets, respectively. E_{F}_{↔T} denotes a matrix describing the relationship between features and tweets. If a feature belongs to a tweet, the corresponding value in the matrix is equal to 1 and otherwise it is equal to 0. E_{T}_{↔F} is the transpose of E_{F}_{↔T}. Similarly, E_{T}_{↔T} describes the relationship between tweets and other tweets. If a tweet replies to another tweet, the corresponding value in E_{T}_{↔T} is equal to 1 and otherwise it is equal to 0. IDF_{F} is the inverse document frequency matrix for the features [24]. ω_{1} and ω_{2} denote the degree of influence from features and other tweets, respectively, on the analyzed tweet.

Based on these basic definitions, a dynamic query expansion can be described in the following:

$$W[{F}^{(k)}]=ID{F}_{F}\u2022{E}_{F\leftrightarrow T}\u2022W[{T}^{(k-1)}]$$

$$W[{T}^{(k)}]={\omega}_{1}\u2022{E}_{T\leftrightarrow F}\u2022W[{F}^{(k)}]+{\omega}_{2}\u2022{E}_{T\leftrightarrow T}\u2022W[{T}^{(k-1)}]$$

Then, for tweets in T^{(k)} and T − T^{(k)}, if the maximal weigh in T − T^{(k)} is larger than the minimal weight in T^{(k)} then the two corresponding tweets will be swapped. This process of swapping will continue until max {W[T − T^{(k)}]} ≤ min {W[T^{(k)}]}.

The time complexity mainly derives from the dynamic query expansion and is approximately O{n_{i}[n_{F}*n_{TF} + n_{T}*(n_{TF} + n_{TT})]}, where n_{i} is the number of iterations performed, n_{F} and n_{T} are the number of features and tweets, respectively, n_{TF} is the number of connections between tweets and a feature, and n_{TT} is the number of connections between two different tweets. Note that n_{TF} << n_{F} and n_{TT} << n_{T}.

Combining the geo-location and time information embedded in the tweets, the domain related spatio-temporal Twitter events can be defined as:

This section describes two steps that are performed on the **STTE**: (1) Spatial distribution pattern detection; and (2) the discovery of evolving spatio-temporal patterns. Section 5.1 examines the approach used for the spatial distribution pattern detection, while the process of discovering spatio-temporal patterns as they evolve is described in Section 5.2. Finally, the algorithms are described in Section 5.3.

In order to detect spatial distribution patterns from spatial point events, a number of spatial clustering [25,26] and spatial outlier detection [27,28] methods have been proposed. However, these methods cannot accurately detect different types of spatial clusters and outliers simultaneously. Delaunay triangulation has been proven to be an efficient tool for constructing spatial proximity relationships for spatial datasets and has thus been successfully employed in spatial clustering [25,26]. Unfortunately, for spatial point events multiple types of clusters and outliers may be involved, as described in Section 3.1 and existing methods are unable to accurately obtain these spatial patterns. For example, Figure 4a shows the Delaunay triangulation for the spatial events in Figure 2b, with three types of inconsistent long edges connecting different types of spatial patterns: **(1) I-long edges intersected with green dashed lines**, where global long edges connect global spatial outliers such as the point and the small cluster on the right side of Figure 4a with other patterns; **(2) II-long edges intersected with blue dashed lines**, where local long edges connect local spatial outliers such as the point and the small cluster in the middle of Figure 4a with other patterns; and **(3) III-long edges intersected with red dashed lines**, which are usually located in a relatively even cluster due to the existence of inner spatial outlier regions such as the small dense cluster in the sparse quasi-circular cluster in Figure 4a. In order to accurately extract various types of spatial outliers and clusters from **STE**, a strategy of multi-constrained Delaunay triangulation, which is employed to remove the above three kinds of long edges in hierarchy, is proposed. This is described in detail below.

$$Long\_Edge{s}^{\mathrm{I}}\left(DT\right)=\left\{{E}_{i}|\left|{E}_{i}\right|\ge Mean\left(DT\right)+\frac{Mean\left(DT\right)}{\left|{E}_{i}\right|}*Std\left(DT\right)\right\},\text{}{E}_{i}\in DT$$

Here, $\frac{Mean\left(DT\right)}{\left|{E}_{i}\right|}$ is an adjusting coefficient that is inversely proportional to the length of edges. Mean(DT) and Std(DT) are both constants, so a longer edge will correspond to a smaller $Mean\left(DT\right)+\frac{Mean\left(DT\right)}{\left|{E}_{i}\right|}*Std\left(DT\right)$. As a result, the coefficient $\frac{Mean\left(DT\right)}{\left|{E}_{i}\right|}$ is sufficient to identify I-long edges. By removing all I-long edges, a series of sub-graphs for the remaining edges can be obtained, i.e., G^{(1)}_{1}, G^{(1)}_{2}, G^{(1)}_{3} in Figure 4b. In these sub-graphs the global spatial outliers have been separated from other patterns. II-long edges and III-long edges are further identified below in order to isolate other spatial patterns.

$$Long\_Edge{s}^{\mathrm{II}}({G}^{(1)}{}_{k})=\left\{Local\_Edge(j)|\left|Local\_Edge(j)\right|\ge Mean\left(L{E}_{i}\right)+\frac{Mean\left(L{E}_{i}\right)}{\left|Local\_Edge(j)\right|}*Std\left({G}^{(1)}{}_{k}\right)\right\}$$

$$where\text{\hspace{1em}}Local\_Edge(j)\in L{E}_{i}\text{\hspace{1em}}and\text{\hspace{1em}}Std({G}^{(1)}{}_{k})=\frac{{\displaystyle \sum _{i=1}^{\left|{G}^{(1)}{}_{k}\right|}Std(L{E}_{i})}}{\left|{G}^{(1)}{}_{k}\right|}$$

After removing all II-long edges in each G^{(1)}_{k}, a new series of sub-graphs can be obtained, i.e., G^{(2)}_{1}, G^{(2)}_{2}, …, G^{(2)}_{6} in Figure 4c. Those local spatial outliers are further separated. However, some relatively long edges remain in the magnified region in Figure 4c, so the III-long edges that lead the inner spatial outlier region cannot be further divided. These III-long edges need to be identified and dealt with.

Figure 4c shows that III-long edges are usually located in locally extremely uneven regions, which must therefore be identified first. This problem can be translated into finding those events whose local edges have an extremely large length standard.

$$Mea{n}_{Std}(Co{n}_{st{e}_{i}})=\frac{{\displaystyle \sum _{j=1}^{\left|Co{n}_{st{e}_{i}}\right|}Std(L{E}_{j})}}{\left|Co{n}_{st{e}_{i}}\right|}\text{\hspace{1em}}and\text{\hspace{1em}}St{d}_{Std}\left(Co{n}_{st{e}_{i}}\right)=\sqrt{\frac{{\displaystyle \sum _{j=1}^{\left|Co{n}_{st{e}_{i}}\right|}Std(L{E}_{j})}}{\left|Co{n}_{st{e}_{i}}\right|-1}}$$

Then any locally extremely uneven regions LEUR(G^{(2)}_{k}) can be defined as:

$$LEUR({G}^{(2)}{}_{k})=\left\{st{e}_{i}|Std(L{E}_{i})\ge Mea{n}_{Std}(Co{n}_{st{e}_{i}})+2\frac{Mea{n}_{Std}(Co{n}_{st{e}_{i}})}{Std(L{E}_{i})}\ast St{d}_{Std}({G}^{(2)}{}_{k})\right\},\text{\hspace{1em}}st{e}_{i}\in {G}^{(2)}{}_{k}$$

$$where\text{\hspace{1em}}St{d}_{Std}({G}^{(2)}{}_{k})=\frac{{\displaystyle \sum _{i=1}^{\left|{G}^{(2)}{}_{k}\right|}St{d}_{Std}\left(Co{n}_{st{e}_{i}}\right)}}{\left|{G}^{(2)}{}_{k}\right|}$$

$$Long\_Edge{s}^{\mathrm{III}}({G}^{(2)}{}_{k})=\left\{Local\_Edge(j)|\left|Local\_Edge(j)\right|\ge Mean\left(L{E}_{i}\right)+2\frac{Mean\left(L{E}_{i}\right)}{\left|Local\_Edge(j)\right|}*Std\left({G}^{(1)}{}_{k}\right)\right\}$$

$$where\text{\hspace{1em}}Local\_Edge(j)\in L{E}_{i}\text{\hspace{1em}}and\text{\hspace{1em}}L{E}_{i}\text{\hspace{0.17em}}\in LEUR\left({G}^{(2)}{}_{k}\right)$$

Equation (9) shows that III-long edges must be located in LEUR(G^{(2)}_{k}) and their lengths need to be larger than an indicator that is similar to the one defining II-long edges. Finally, all types of spatial patterns, i.e., G^{(3)}_{1}, G^{(3)}_{2}, …, G^{(3)}_{7} in Figure 4d, are separated after removing III-long edges. To determine which types of spatial patterns these sub-graphs are, in the following an indicator will be defined that considers the volumes of these sub-graphs.

It should be pointed out that the previous multi-constraint Delaunay triangulation is mainly designed to detect various types of spatial clusters with different shapes and densities [25,26]. The proposed multi-constraint Delaunay triangulation in this study can give a more detailed analysis of the characteristics of edges from different levels, by which various spatial clusters and outliers can be simultaneously detected. For example, III-long edges in Figure 4c are usually located in locally extremely uneven regions, the proposed approach in this paper is able to identify these uneven regions and then extract and delete the hidden III-long edges. This is the main difference from the multi-constrained Delaunay triangulation used before.

Spatial outliers usually contain very few ste_{i} and so are defined as those relatively small sub-graphs after the elimination of long edges in the Delaunay triangulation [29]. In addition, those aggregated structures except spatial outliers are defined as spatial clusters in this study. Therefore, following the example of identification of long edges in Section 5.1.1, Section 5.1.2 and Section 5.1.3, the volume of each connected sub-graph will be used to define an indicator for the identification of spatial outliers and clusters.

$$SC=\left\{{G}^{(3)}{}_{k}|Vol({G}^{(3)}{}_{k})>Mean\left(RVol\right)-\frac{Vol({G}^{(3)}{}_{k})}{Mean\left(RVol\right)}*Std\left(RVol\right)\right\}$$

$$SO=\left\{{G}^{(3)}{}_{k}|Vol({G}^{(3)}{}_{k})\le Mean\left(RVol\right)-\frac{Vol({G}^{(3)}{}_{k})}{Mean\left(RVol\right)}*Std\left(RVol\right)\right\}$$

The spatio-temporal distribution patterns for the given time period reflect the evolving process by which the ultimate spatial distribution patterns are formed, i.e., the evolving spatio-temporal patterns. In this section, these will be discovered based on spatio-temporal clustering using the following procedures.

- (i)
- all spatio-temporal Twitter events belonging to SN
^{δ}(stte_{i}); - (ii)
- all spatio-temporal Twitter events belonging to TN
^{ε}(stte_{i}); and - (iii)
- all spatio-temporal Twitter events corresponding to spatial Twitter events in SN
^{δ}(ste_{i’}) with IsOccur_T_{twi}(tw_{i}∈TW^{ε})=1, where ste_{i’}is the spatial Twitter event of stte_{i}.

Figure 7a shows the spatio-temporal Twitter events occurring at T = 5–7; the regions surrounded by red circles are produced by the amplification process. Given TW^{1} = [t − 1, t + 1], the red, blue and green points represent the above three treatments of STN^{1,1}(stte_{i}), respectively.

Based on the related definitions introduced in Section 5.1 and Section 5.2, the proposed algorithm for discovering evolving spatio-temporal patterns from Twitter events can be described as follows:

**Input:**Spatio-temporal Twitter events**STTE**, projected spatial Twitter events**STE**, threshold δ and ε**Output:**Evolving spatio-temporal patterns

- (i)
- Construct the Delaunay triangulation for
**STE**to obtain the initial spatial proximity graph; - (ii)
- Identify and remove inconsistent long edges, i.e., I-long edges, II-long edges and III-long edges, from the Delaunay triangulation;
- (iii)
- Extract connected sub-graphs and identify spatial clusters and outliers based on the volume of each connected sub-graph.

- (i)
- Determine the spatial neighborhoods of each spatial Twitter event and the spatial neighborhoods of each spatio-temporal Twitter event based on δ;
- (ii)
- Construct time windows based on ε and determine the temporal neighborhoods of each spatio-temporal Twitter event;
- (iii)
- Determine the spatio-temporal neighborhoods of each spatio-temporal Twitter event; and
- (iv)
- Extract spatio-temporal connected graphs based on the spatio-temporal proximity relationships and identify spatio-temporal clusters and outliers based on the volume of each spatio-temporal connected graph.

In this algorithm, constructing Delaunay triangulation requires O(NlogN), where N is the number of spatial Twitter events. Removing I-long edges and updating the graph require about O(N_{1} + N), where N_{1} is the number of edges in the Delaunay triangulation. Similarly, the time complexity of removing II-long edges and updating the graph are about O(N_{2} + N), where N_{2} is the number of the remaining edges after removing I-long edges. The next step, which involves finding extremely uneven regions, removing III-long edges and updating the graph again, require about O(N_{3} + 2N), where N_{3} is the number of edges located in the extremely uneven regions. Finally, determining the spatio-temporal neighborhoods of spatio-temporal Twitter events and clustering the spatio-temporal connected graphs require about O(N’), where N’ is the number of spatio-temporal Twitter events.

This section evaluates the effectiveness and practicality of the new framework proposed here by testing it experimentally on a real life dataset. In Section 6.1, the dataset and labels utilized in the experiments are described in detail, after which the experimental analysis is presented in Section 6.2. Finally, Section 6.3 examines the results of the analysis of evolving spatio-temporal patterns.

The Twitter dataset was purchased from www.datasift.com after a processing of data reduction. It consisted of 10% of all the tweets sent from 21 June 2012 to 31 May 2013 in 10 countries of Latin America and covered the target domain ‘civil unrest’. The tweets from 21 June 2012 to 1 September 2012 in one country, Mexico, was selected to create the case study. It must be noticed that the errors existing in the Twitter data will have an influence on the detection results, so those tweets with significant errors, those published in the ocean for example, have been deleted before performing the experiments. This case study provides an appropriate experimental test for the validation of the framework because ground truth data is available for this scenario. Here the ground truth consists of a group of significant events provided by a Gold Standard Report (GSR) provided by http://www.mitre.org/. Specifically, among the top 100 newspapers in Latin America provided by International Media and Newspapers, the top 3 ones in Mexico, i.e., La Jornada, Reforma and Milenio, were selected to collect news related to ‘civil unrest’ with the input from both the most influential international news outlets and subject matter experts. Events in the news reported by the above two ways would be defined as conflict events. Authoritative news outlets and experts guarantee that the events from GSR are reliable.

For the seed query, 10 tweets related to civil unrest were chosen by users based on the guidance of domain experts to initiate the process [13]. All terms in the 10 tweets were ranked in descending order based on their corresponding DFIDF values [24]. The top 5 terms were selected as the seed terms and included both ‘protest’ and ‘march’. Based on these 5 seed terms, a dynamic query expansion was performed to extract spatio-temporal Twitter events and projected spatial Twitter events. Significant events from the GSR were also projected into the spatio-temporal cube based on their spatial and time tags. Figure 9a,b show the spatio-temporal distribution and spatial projection of both the extracted domain related Twitter events (shown by black points) and the significant events (shown by red triangles) provided by the GSR, respectively.

In a previous study we demonstrated that dynamic query expansion is an effective tool for extracting domain related Twitter events [13]. Therefore, given the extracted domain related Twitter events, two spatio-temporal point events clustering methods, namely ST-DBSCAN [19] and STSNN [15], are utilized here for comparison. In all experimental results, the symbol “×” represent the spatial and spatio-temporal outlier points. For spatial/spatio-temporal clusters and outlier regions, they are represented by symbols with different shapes and colors.

Figure 10 shows the spatial distribution patterns for **STE** produced by our new method, where Figure 10a depicts the spatial clusters and Figure 10b both the spatial outlier points and regions. Figure 10a reveals that 8 spatial clusters with different shapes and densities are obtained and that these can be further divided into three main regions, R1, R2 and R3. R1 and R3 are composed of SC5 and SC7, respectively, while R2 covers all the remaining 6 clusters. By comparing these results with the significant events reported in the GSR, these events are mainly distributed in R2, especially in SC1. The spatial outlier points and regions in Figure 10b mainly distribute in the area surrounding R2 and in northern Mexico. One can see that parts of spatial outliers cover all the remaining significant events except those covered by spatial clusters. This indicates that spatial outliers are not just useless noise but can indicate important events as well as spatial clusters.

Based on these spatial distribution patterns, the evolving spatio-temporal patterns can be discovered after setting the thresholds δ and ε. To observe how the results vary for different parameters, δ and ε are assigned values of 1, 2 and 3 to generate a total of 9 pairs of parameters. Figure 11 illustrates all the evolving spatio-temporal patterns for each pair of parameters, where **STOP**, **STOR** and **STC** are shown from left to right, respectively. The figure shows that as δ and ε increase, **STOP** diminishes while both **STOR** and **STC** increase their spatio-temporal ranges. When δ and ε are set as infinity, the spatio-temporal Twitter events whose spatial projections belong to the same spatial distribution pattern are clustered together. Note that because **STOP**, **STOR** and **STC** evolve into their corresponding spatial distribution patterns, each **STOP**, **STOR** or **STC** contain only those spatio-temporal Twitter events located in the same spatial distribution pattern. The proposed method also extracts those spatio-temporal clusters (e.g., those regions signified by ellipses in Figure 11) that form spatial outlier regions. The most significant characteristic of this type of spatio-temporal cluster is that it locally aggregates in the spatial dimension and is continuous in the time dimension.

For ST-DBSCAN, the threshold Eps is set as 70 km, 85 km and 100 km, in turn, and MinPts is set as 5, 10 and 15. Repeated experiments revealed that the clustering results are mainly affected by Eps and MinPts, so ΔT is set at 2 days throughout. The results for the 9 sets of parameters are shown in Figure 12, clearly revealing that a larger Eps and a smaller MinPts correspond to larger spatio-temporal clusters. In Figure 12a, two spatio-temporal regions are represented by two black ellipses, labelled STR1 and STR2. As Eps increases and MinPts diminishes, the spatio-temporal cluster in STR1 expands significantly while STR2 is formed by a series of small clusters at all times. ST-DBSCAN only identifies the dense clusters in STR1 and cannot discover the clusters with small spatial ranges but large temporal ranges such as the ones in STR2. However, when Eps is 100 km and MinPts is 5 the spatio-temporal cluster in STR1 contains spatio-temporal Twitter events located in different spatial distribution patterns, as shown in Figure 12c.

For STSNN, the threshold k is set as 6, 10, 16 and 20 and based on a suggestion by Liu et al. (2014), k_{T} and MinPts are both set at 0.5k. The threshold ΔT is again set as 2 days. The clustering results for each group of parameters are shown in Figure 13. Here, only a number of discrete small clusters are obtained for k = 6 and 10, but when k is set as 16, a single large spherical spatio-temporal cluster appears in STR1, as shown in Figure 13c. However, this approach suffers from the same problem as ST-DBSCAN, in that both ignore the final spatial distribution patterns of Twitter events. In addition, neither is able to accurately identify those clusters with small spatial ranges and large temporal ranges, such as the one in STR2. In the spatial dimension, this kind of cluster only represents spatial outliers located in a local region collectively, but when considering both space and time indicates that these events take place continuously over a long period of time. It therefore belongs to an important spatio-temporal cluster that is evolving into a spatial outlier region. For k = 20, more significant spatio-temporal cluster are obtained, represented by STR3 and STR4 in Figure 13d. However, the cluster in STR2 of Figure 13c is still not completely detected, for example by STR5 in Figure 13d.

A specific analysis of the evolving spatio-temporal patterns reveals that for the results obtained by the new method and reported in Section 6.2, the emphasis is on analyzing how the spatio-temporal clusters that go on to form spatial clusters vary as the parameters change. Moreover, by focusing on a single set of these results, a more detailed analysis of the evolution of spatio-temporal patterns can be obtained and the results will be compared with the significant events identified from the GSR.

The details of the spatio-temporal clusters that evolve into spatial clusters can be visualized, as shown in Figure 14. For each group of results, the spatio-temporal distribution, spatial locations and time spans (denoted by ‘${\leftrightarrow}$’) of the spatio-temporal clusters are shown from left to right. The spatio-temporal clusters with (δ, ε) = (1, 1 days) have small spatial and temporal ranges, as shown in Figure 14a, and are mostly distributed in sporadic spatial clusters, with no significant spatio-temporal clusters forming (SC5 and SC8). In addition, these mainly take place on 2 sets of dates [2012.6.21, 2012.7.28] and [2012.8.16, 2012.9.01], and over a short period of time. As δ increases and ε remains the same, the spatial ranges of these spatio-temporal clusters expand significantly, as can be seen by comparing Figure 14d,g with Figure 14a. Similarly, as ε increases and δ remains the same, each spatio-temporal cluster extends over a longer time period, as shown in Figure 14b,c. δ can also affect the time periods of the spatio-temporal clusters. For example, in Figure 14d, for (δ, ε) = (2, 1 days) not only do the spatial ranges of STC1 in SC1 expand but the time period lengthens from [2012.6.21, 2012.7.28] to [2012.6.21, 2012.8.02]. At the same time, ε can also affect the spatial ranges of the spatio-temporal clusters; when δ remains the same and ε increases to 2 or 3 days, Figure 14b,c reveal that a new spatio-temporal cluster STC5 appears in SC5 that is not visible in Figure 14a.

For the obtained evolving spatio-temporal patterns, δ and ε can to a large extent reflect the outbreak degree of **STTE**. For example, spatio-temporal clusters with small δ and ε mean those **STTE** extend only a short distance in the spatial dimension and continuously in the time dimension, but as δ and ε increase, the new members that appear in addition to the original spatio-temporal clusters represent a process of wide and discontinuous extension.

Furthermore, a more detailed analysis illustrates how those spatio-temporal Twitter events evolve into the final spatial distribution patterns by selecting (δ, ε) = (2, 2 days) because of their eclectic nature. Figure 15a,c show the obtained spatio-temporal outlier points, regions and spatio-temporal clusters from left to right, respectively, while the corresponding spatial locations and time periods are shown in Figure 15b,d. The figures reveal 12 spatio-temporal clusters that are in the process of evolving into spatial outliers, mainly located in central and northern Mexico. In Figure 15b, STC1-STC11 is present from late-June to mid- and late-July, while STC12 first appears on 20 August 2012 and lasts until 1 September 2012. A number of spatio-temporal outlier points and regions also form spatial outliers. Spatial clusters are generally evolved into by spatio-temporal clusters, as shown in Figure 15c,d, and most occur between late June to late-July and early-August, with the final four lasting from mid-August to about 31 August 2012. Spatio-temporal outlier points and regions are also implicated in the evolution of spatial clusters, especially those clusters located in central Mexico.

To compare our results with the significant events identified by the GSR, three typical cities where significant numbers of events were reported, namely Ciudad de México, Pachuca de Soto and Monterrey, are selected for further analysis. Figure 16a gives the spatial location of these three cities and the reported dates of significant events are listed in Figure 16b. By combining Figure 16a with Figure 15b,d, one can see that Ciudad de México and Pachuca de Soto both fall within the range of SC1. Ciudad de México also locates in both STC3 and STC15 while Pachuca de Soto locates in STC3. In addition, Monterrey is in STC1 and STC12, both of which form spatial outliers and are represented by ‘${\u25b2}$’ in Figure 15b. Figure 16b reveals that significant events were reported in Ciudad de México almost daily throughout July and August, while Pachuca de Soto is reported to have had significant events during mid-July, late-July and on 13 August 2012. STC3 and STC15 in Figure 15d exist during the periods [2012.6.21, 2012.8.15] and [2012.8.16, 2012.9.01], respectively. For Monterrey, significant events were reported during early-July, mid-July and late-August. STC1 and STC12 in Figure 16b exist during the periods [2012.6.21, 2012.8.02] and [2012.8.20, 2012.9.01], respectively. Therefore, the evolving spatio-temporal patterns obtained using the new method are highly consistent with the reported significant events.

This paper proposes a framework for discovering evolving domain related spatio-temporal patterns from Twitter data. In our new framework, a dynamic query expansion is employed to extract spatio-temporal Twitter events from the initial Twitter data for a given target domain, after which a spatio-temporal approach that was specifically developed to discover the evolving spatio-temporal patterns of the domain related Twitter events is applied. By utilizing Twitter datasets in Mexico for the domain of civil unrest, an experimental comparison with ST-DBSCAN and STSNN was conducted to illustrate the effectiveness of our proposed method and its practicality demonstrated by comparing the results obtained by our method with the significant events identified in the Gold Standard Report.

In summary, the GSR only collected those dates when events reached their climax, but these events were usually preceded by a period during which minor conflicts escalated and were followed by the subsequent fallout from the event. The evolving spatio-temporal patterns for the Twitter events can reflect the characteristic of reported events based on the reactions of the human observers and participants. It would thus be helpful to refine this approach further in order to accurately predict the evolution process for different types of events in each representative region (i.e., those spatial clusters and outliers). However, to effectively perform the geographical analysis of Big Data, such as social media data focused on in this study, the data quality cannot be ignored because it is very common that there possibly contain numerous errors in the initial data. Also, the social media data is usually biased from the population, so it is a challenge that the bias should be remedied to make the data to reflect the spatio-temporal patterns correctly [30]. Therefore, our future work will focus on the analysis of quality, incompletion and uncertainty for Twitter data and further modifying our proposed methods. The modifiable temporal unit problem (MTUP) problem can impact the detection results, so how to select optimal width of time window by considering the MTUP problem and specific practical applications will also be investigated in the future [31,32]. As the variety of spatio-temporal Big Data, there is a challenge of mining potential spatio-temporal patterns from multiple datasets across different domains with different representations, distributions, scales, densities and so on [33,34]. In addition, methods of geographical visualization should be developed to present the complicated analyzed results to users vividly and comprehensibly.

This work was supported by the National High Technology Research and Development Program of China (863 Program), No. 2013AA122301, The Hunan Natural Science Fund for Distinguished Young scholars, No. 14JJ1007, and the National Science Foundation of China (NSFC), No. 41471385.

Yan Shi and Min Deng conceived the idea for the research and wrote the paper; Yan Shi designed the experiments; Xuexi Yang performed the experiments and analyzed the data; Qiliang Liu interpreted the results.

The authors declare no conflict of interest.

- Java, A.; Song, X.; Finin, T.; Tseng, B. Why we twitter: Understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNAKDD 2007 Workshop on Web Mining and Social Network Analysis, San Jose, CA, USA, 12–15 August 2007; pp. 56–65.
- Cheng, A.; Mark, E.; Harshdee, S. Inside Twitter: An in-Depth Look Inside the Twitter World; SYSOMOS: Toronto, ON, Canada, June 2009. [Google Scholar]
- De Albuquerque, J.P.; Herfort, B.; Brenning, A.; Zipf, A. A geographic approach for combining social media and authoritative data towards identifying useful information for disaster management. Int. J. Geogr. Inf. Sci.
**2015**. [Google Scholar] [CrossRef] - Heverin, T.; Zach, L. Microblogging for crisis communication: Examination of twitter use in response to a 2009 violent crisis in Seattle-Tacoma, Washington area. In Proceedings of the 7th International ISCRAM Conference, Seattle, WA, USA, 2–5 May 2010.
- Pan, B.; Zheng, Y.; Wilkie, D.; Shahabi, C. Crowd sensing of traffic anomalies based on human mobility and social media. In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Orlando, FL, USA, 5–8 November 2013; pp. 334–343.
- Chew, C.; Eysenbach, G. Pandemics in the age of Twitter: Content analysis of Tweets during the 2009 H1N1 outbreak. PLoS ONE
**2009**, 5, e14118. [Google Scholar] [CrossRef] [PubMed] - Ramage, D.; Dumais, S.; Liebling, D. Characterizing microblogs with topic models. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, Washington, DC, USA, 23–26 May 2010; pp. 130–137.
- Markman, V. Unsupervised discovery of fine-grained topic clusters in Twitter posts. Pap. AAAI Workshop Anal. Microtext
**2011**, WS-11–05, 32–37. [Google Scholar] - Fujisaka, T.; Lee, R.; Sumiya, K. Detection of unusually crowded places through micro-blogging sites. In Proceedings of 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, Perth, Australia, 20–23 April 2010; pp. 467–472.
- Lee, R.; Wakamiya, S.; Sumiya, K. Discovery of unusual regional social activities using geo-tagged microblogs. World Wide Web
**2011**, 14, 321–349. [Google Scholar] [CrossRef] - Chae, J.; Thom, D.; Bosch, H.; Jang, Y.; Maciejewski, R. Spatiotemporal social media analytics for abnormal event detection an examination using seasonal-trend decomposition. In Proceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST), Seattle, WA, USA, 14–19 October 2012; pp. 143–152.
- Cheng, T.; Wicks, T. Event detection using Twitter: A spatio-temporal approach. PLoS ONE
**2014**, 9, e97807. [Google Scholar] [CrossRef] [PubMed] - Zhao, L.; Chen, F.; Dai, J.; Hua, T.; Lu, C.-T.; Ramakrishnan, N. Unsupervised spatial event detection in targeted domains with applications to civil unrest modeling. PLoS ONE
**2014**, 9, e110206. [Google Scholar] [CrossRef] [PubMed] - Bakillah, M.; Li, R.Y.; Liang, S.H. Geo-located community detection in Twitter with enhanced fast-greedy optimization of modularity: The case study of typhoon Haiyan. Int. J. Geogr. Inf. Sci.
**2014**. [Google Scholar] [CrossRef] - Liu, Q.; Deng, M.; Bi, J.; Yang, W. A novel method for discovering spatio-temporal clusters of different sizes, shapes and densities in the presence of noise. Int. J. Digit. Earth
**2014**, 7, 138–157. [Google Scholar] [CrossRef] - Blei, D.; Ng, A.; Jordan, M. Latent Dirichlet Allocation. J. Mach. Learn. Res.
**2003**, 3, 993–1022. [Google Scholar] - Signorini, A.; Segre, A.M.; Polgreen, P.M. The use of Twitter to track levels of disease activity and public concern in the US during the influenza A H1N1 pandemic. PLoS ONE
**2011**, 6, e19467. [Google Scholar] [CrossRef] [PubMed] - Chakrabarti, D.; Punera, K. Event summarization using tweets. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media, Barcelona, Spain, 17–21 July 2011; pp. 66–73.
- Wang, M.; Wang, A.; Li, A. Mining spatial-temporal clusters from geo-database. Lect. Notes Artif. Intell.
**2006**, 4093, 263–270. [Google Scholar] - Cheng, T.; Li, Z. A multiscale approach for spatio-temporal outlier detection. Trans. GIS
**2006**, 10, 253–263. [Google Scholar] [CrossRef] - Wu, E.; Liu, W.; Chawla, S. Spatio-temporal outlier detection in precipitation data. Knowl. Discov. Sens. Data
**2010**, 5840, 115–133. [Google Scholar] - Kulldorff, M.; Heffernan, R.; Hartman, J.; Assunção, R.; Mostashari, F. A space-time permutation scan statistic for disease outbreak detection. PLoS Med.
**2005**, 2, e59. [Google Scholar] [CrossRef] [PubMed][Green Version] - Liu, P.; Zhou, D.; Wu, N. VDBSCAN: Varied density based spatial clustering of application with noise. In Proceedings of 2007 International Conference on Service Systems and Service Management, Chengdu, China, 9–11 June 2007; pp. 528–531.
- Weng, J.; Lee, B.S. Event detection in Twitter. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media, Barcelona, Spain, 17–21 July 2011; pp. 401–408.
- Estivill-Castro, V.; Lee, I. Argument free clustering for large spatial point-data sets. Comput. Environ. Urban Syst.
**2002**, 26, 315–334. [Google Scholar] [CrossRef] - Deng, M.; Liu, Q.; Cheng, T.; Shi, Y. An adaptive spatial clustering algorithm based on Delaunay triangulation. Comput. Environ. Urban Syst.
**2011**, 35, 320–332. [Google Scholar] [CrossRef] - Jiang, M.-F.; Tseng, S.-S.; Su, C.-M. Two-phase clustering process for outliers detection. Pattern Recognit. Lett.
**2001**, 22, 691–700. [Google Scholar] [CrossRef] - Al-Zoubi, M.B.; Al-Dahoud, A.A.; Yahya, A. New outlier detection method based on fuzzy clustering. WSEAS Trans. Inf. Sci. Appl.
**2010**, 7, 681–690. [Google Scholar] - Shi, Y.; Deng, M.; Yang, X.; Liu, Q. Adaptive detection of spatial point event outliers using multilevel constrained Delaunay triangulation. Comput. Environ. Urban Syst.
**2016**. [Google Scholar] [CrossRef] - Wang, J.; Ge, Y.; Li, L.; Meng, B.; Wu, J.; Bo, Y.; Du, S.; Liao, Y.; Hu, M.; Xu, C. Spatiotemporal data analysis in geography. Acta Geogr. Sin.
**2014**, 69, 1326–1345. [Google Scholar] - Cheng, T.; Adepeju, M. Modifiable temporal unit problem (MTUP) and its effect on space-time cluster detection. PLoS ONE
**2014**, 9, e100465. [Google Scholar] [CrossRef] [PubMed] - Huang, Q.; Wong, D.W.S. Modeling and visualizing regular human mobility patterns with uncertainty: An example using Twitter data. Ann. Assoc. Am. Geogr.
**2015**, 105, 1179–1197. [Google Scholar] [CrossRef] - Zheng, Y. Methodologies for cross-domain data fusion: An overview. IEEE Trans. Big Data
**2015**, 1, 16–34. [Google Scholar] [CrossRef] - Zheng, Y.; Zhang, H.; Yu, Y. Detecting collective anomalies from multiple spatio-temporal datasets across different domains. In Proceedings of the 23rd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Bellevue, WA, USA, 3–6 November 2015; pp. 1–10.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).