2.1. Method Overview
Twitter messages can be retrieved using keywords or geographic coordinates. Section 2.3
describes how to extract spatial information from a disaster database and to estimate the impact area. Toponyms (cities, towns and villages) from the area are used as keywords to retrieve messages, and messages with embedded coordinates are selected using the area polygon. In the next step (Section 2.5
), hashtags are extracted from the retrieved messages, and several heuristics are used to filter out non-relevant hashtags (Figure 1
The key element of the filtering process is the classification of messages by their content. Such classification models are a common element of nearly all research on disaster media, but usually, the quality of these models is not high (see Table 3 for a comparison). Imran et al. [4
] concluded that “pre-trained classifiers significantly drop in classification accuracy when used in different but similar disasters” and proposed an infrastructure to train a new individual model for every disaster event. Our classifier has higher quality than existing models, and once trained, the model can recognize messages about a wide variety of natural and technological disasters. Therefore, we expect that our model will be useful for future research, even research unrelated to the hashtag retrieval method. Details of the classifier design can be found in Section 2.4
We have to deal with two main challenges. The first one follows from the short length of the messages. Very often, users do not mention cities where they live and where a disaster happens. This can be explained through the private character of the messages, so their addressees (friends) already know where the author lives. News media, especially foreign news agencies, can use generic toponyms, e.g., country name, supposing detailed information can be found on the website referenced in the message. For example, two messages published on 12 November 2013 from the account CNN Breaking News (@cnnbrk) testify to both cases:
Typhoon #Haiyan has displaced at least 800,000 people, U.N. estimates.
Typhoon #Haiyan deaths likely 2000 to 2500—not 10,000, Philippine president tells
The second challenge follows from the correctness of the impact area estimation. Because the area is not recorded in the database, but estimated using several rules, it can include cities not really affected by the disaster or omit actually harmed places. Therefore, it increases the number of non-relevant messages and reduces the amount of useful messages that are used in the next step to extract hashtags.
2.3. Impact Area
The instrumental monitoring of disaster processes is recorded in the disaster databases. It can be a direct impact area registration (e.g., the contour of flooded territory or burned region) or an indirect reference to a key hazard characteristic (epicenter coordinates or cyclone track). The last property should be transformed to cover adjacent areas for some distance and to estimate a potential emergency zone.
In this research, we experiment with floods and tropical cyclones. Other types of disasters should be processed in a similar way.
The Global Active Archive of Large Flood Events (GAALFE [20
]) contains contours of submerged regions. The georeference accuracy can be illustrated by the example of the 2013 Alberta flood in Canada (Entry Number 4068 in GAALFE). This flood affected two river basins, the South Saskatchewan and Elk rivers [22
], but the most disastrous damage was observed in 5 cities and towns in the basin of South Saskatchewan [23
] (Figure 2
). The solid line in the figure delineates the South Saskatchewan river basin [24
], and the dashed polygon shows the location of the event as it is recorded in GAALFE. We can consider that the georeference of the database fairly represents the river; nevertheless, the flood occurred in the highland western part of the basin and did not affect the lower course.
Therefore, the polygon from GAALFE includes populated places not affected by the event and loses important harmed places. Because a list of cities and towns is used in the following steps to find hashtags, this inaccuracy leads to a significant increase in noise, as will be shown later.
International Best Track Archive for Climate Stewardship (IBTrACS [18
]) represents information about cyclones in track form. The affected area should be reconstructed using additional process parameters. One of the cyclone hazard factors is wind speed. It is recorded in the form of the radii of wind speed that exceeds 34-knot thresholds (higher level radii are not important in this case). For a reasonable approximation, these radii, four for every cardinal direction, can be reduced to the one maximal radius (Figure 3
). If the maximal wind speed of a cyclone is less than 34 knots (such as a “tropical depression”, according to some classifications), then it can be recommended to use a constant radius, e.g., 20 nautical miles.
The second hazard factor is rainfall. Total precipitation from the ECMWF ERA Interim dataset [25
] was summed throughout the whole time of cyclone activity (Figure 4
a). Rainfall can also be represented by cloudiness throughout the cyclone region (Figure 4
Rainfall is the most dangerous cyclone factor; it triggers floods and landslides and contributes to much of the human loss. In 2013, two hurricanes, Manuel and Ingrid, hit the territory of Mexico almost at the same time: 13–19 September (Manuel) and 12–17 September (Ingrid) [26
]. Most of the victims of the hurricanes were reported from three spots (yellow and red colors on Figure 4
a). The total precipitation represents the most accurate disaster impact and is consistent with on-site reports. However, in the case of two simultaneous disasters in the same region, e.g., cyclones Manuel and Ingrid, it is practically impossible to separate rainfall into two parts according to the initial cyclones. Cloudiness also does not facilitate disambiguation (Figure 4
The wind speed boundary is not as accurate as rainfall, but can be easily referenced to the cyclone, because it was constructed from the cyclone track recorded in the database. In the following experiments, we use wind speed boundaries to define the impact area. This approach has the above-mentioned limitations. However, we expect that the situation in the places not included in that area is described using the same hashtags.
For our experimental system flood, polygons from the GAALFE and cyclone tracks from the IBTrACS were imported into the PostGIS database (PostgreSQL). Flood polygons are left unchanged. The cyclone track consists of points (one point every 6 hour) with a 34-knot wind radius. These points are converted into buffers and joined into one polygon for every cyclone. Hence, we lose information about the dynamics of a cyclone, but it seems to be a reasonable simplification, otherwise data processing should be done for every day separately, which complicates the algorithm. Transformation of the points into a polygon is a simple and straightforward operation. The updating of data and the addition of new records can be fully automatized. The source code of all processing functions can be found in the Supplementary Materials
The estimated contour of the disaster area is used to find affected populated places (cities, towns and villages). We use a gazetteer based on a preprocessed OpenStreetMap dataset: city polygons were converted to central points, with population information added [21
Disaster polygons (flooded areas and cyclone buffers) are used to find populated places inside of the impact area. For optimization, the 20 most populated places are chosen to make a list of affected places. The PostgreSQL database provides effective full text searching infrastructure. Therefore, all Twitter messages were indexed; every message that mentions a place from the list is collected for the subsequent processing and extraction of hashtags. Some toponyms in the texts can be misprinted; however, we do not make any text correction or apply fuzzy search methods. It is expected that, dealing with big data, the loss of several messages is unessential.
Disaster polygons can also be used to directly find messages with embedded coordinates (geotweets). The portion of geotweets in all Twitter records for September–November 2013 is only 2.7%. However, these geotweets are expected to be strongly related to an event. In our experiments, we use the two approaches (city list and coordinates) together.
The classification of messages by their content is the main element of nearly all research on disaster media and is also the key element of the proposed method. The task of message filtering can be formalized as a binary classification problem with two classes, disaster and non-disaster. In our research, we use labeled datasets prepared in former research; hence, classes are not strictly defined. For example, CrisisLexT6 (see below) defines several classes: “not applicable”, “not related”, “related and informative” and “related, but not informative”. In our research, the first two classes are treated as the non-disaster category, and the last two classes are joined into the disaster category.
The main goal of the designed classifier is to find all kinds of social response on a disaster regardless of credibility and usefulness. The model should be generalized as much as possible, because this one model is to be used in the processing of a series of disasters without adaptation or modification. The aim of our research is to retrieve messages that are highly relevant to an event. We expect that the retrieved messages will be classified using other task-specific models according to the goals of the real application.
Former research used relatively small datasets and usually did not publish them [28
]. To the best of our knowledge, the first public disaster-related dataset was CrisisLexT6 in 2014 [10
]. To build our corpus, we used 3 independent datasets. The natural balance of the classes is not constant and changes in wide ranges. For this reason, an equal balance of the records was chosen at a 1:1 proportion.
CrisisLexT6 is a well-prepared dataset (this and the following datasets can be downloaded from the website http://crisislex.org
). In our corpus, 5 of 6 events were used, including 2 floods, 2 storms (hurricane and tornado) and 1 technological disaster. A terrorist incident was excluded. Approximately 5000 Tweets describe each of these events.
] is a collection of 26 events represented by several languages (mostly English and Spanish). Only English messages were picked out for the corpus. This dataset uses multi-class labels, which were rearranged according to the binary classification task. Most of the messages (80–90%) were disaster-related. This dataset added the following events to the corpus: 2 wildfires, 4 earthquakes, 4 floods, 2 storms and 7 technological disasters. The resulting number of messages varied from 85–900 per event.
] is a comparatively noisy dataset with broken character encoding. However, it contains some rare collections. We used 4 of them, including 1 storm, 1 earthquake, 1 volcano and 1 collection of landslides (including avalanches). The number of records was up to 2500 per event. The dataset was not balanced, and the number of non-disaster messages was twice the number of disaster messages. A large number of non-disaster messages is given to balance the corpus.
Imran et al. [4
] showed that disaster messages use a highly distinguishable lexicon, in which every event can be recognized using a small set of specific words. For example, almost every message about Hurricane Sandy in 2012 contained the words: nyc, obama, romney, sandy, #hurricane, #romneystormtips. To make the model more general, event-specific words were removed from the corpus, such as toponyms, responsible persons, proper names and dates.
Inter-annotator agreement was not recorded for any of these datasets. According to a similar project, this can be expected to equal 75–87% [30
The resulting dataset contained 36,122 disaster messages and an equal number of non-disaster messages (Table 1
The lack of some disaster categories (for example, tsunamis, heat waves and epidemic diseases) places a limitation on the classification model. It is not clear how biological (epidemic and epizootic) and long-term climatological (heat waves and droughts) disasters can be incorporated into one general model. Terrorist incidents seem similar to technological disasters, but are not included in this research. The study of the lexicon of these disaster categories is a task for future research.
Some categories (volcanoes and wildfires) are underrepresented in the corpus. However, it will be shown later that different categories share a common disaster vocabulary (victims, losses and damage) and complement each other so that the designed classification model has some generalizability.
According to previous research, the preparation of a corpus can vary. However, all non-words (URLs, hashtags, usernames) are usually removed or replaced by placeholders. The feature set is constructed from unigrams [28
] or is extended by POS tags and Verbnet classes [32
]. We follow this common practice, and the processing includes the removal of URLs, user names and punctuation, as well as lower case transformation and stemming. Messages are transformed into a document-term matrix with a reduction of low-frequency terms. A very simple feature set was chosen for the future adaptation of the model to new national languages.
Several machine-learning methods can be used to train the model, and the best results in previous research were achieved using logistic regression and supervised Latent Dirichlet Allocation (sLDA) [28
], random forest [29
] and naive Bayes [32
]. In our research, two methods were tested: Support Vector Machine (SVM) and sLDA implemented in the R packages RTextTools [33
] and lda [35
]. The SVM method was used with default settings: C-classification, radial basis kernel. The optimal number of topics,
for sLDA, was estimated using the methods of Cao Juan et al. [36
] and Griffiths and Steyvers [37
] (R package ldatuning [38
At the training stage, the model is very sensitive to event-specific words. For this reason, 5 events (earthquake, flood, storm, wildfire and technological disaster) were separated from the corpus to construct the testing dataset; the remaining part composed the training dataset (8% and 92%, respectively). In this way, the training and testing datasets contained messages from non-overlapping events, so the model was always being tested by messages from unseen disasters.
The designed models should successfully recognize disaster-related texts; hence, the performance of the classifier is measured using standard information retrieval metrics: the portion of event-related messages (precision) and the portion of recognized messages among all event-related messages (recall). The area under curve (AUC) is utilized to compare the models.
Two models were trained: SVM and sLDA. Latent Dirichlet allocation is a stochastic method, and for that reason, the sLDA model was trained ten times, with one selected with the highest area under curve (AUC) value. SVM showed a better result, with an AUC of 0.937 versus 0.909 for sLDA. Therefore, it was chosen for further use in the research.
The examination of the testing dataset shows that the model demonstrates good generalizability (Table 2
). One of the underrepresented categories, “wildfires” (1648 records in the training and 1659 in the testing datasets) was classified with acceptable quality, which can be explained by the existence of a common disaster vocabulary.
Disaster message classification models have been used in many previous studies [29
]. All of them solve slightly different problems. However, in all cases, the quality of the models is not high. For example, the classifier developed by Ashktorab et al. [28
] has a precision of 78% and recall of 57%.
Two models from former research were chosen for comparison. Ashktorab et al. [28
] found messages about damage and casualty in the set of texts representing 12 crises; Cobo et al. [29
] built a classifier that filters tweets relevant and non-relevant to an earthquake. The new model, compared with models from previous research, was significantly improved in terms of recall (Table 3
). The substantial difference between the number of records in the training datasets should be noted. As was said before, large labeled disaster datasets were published only during the past two years.
To estimate a sufficient size for the training dataset, messages were chosen from the category “flood”, and the model was trained using datasets of different size. Figure 5
shows that the stabilization of the education process was reached when the size of the training dataset was approximately 10,000 messages for the sLDA model (and 13,000 for SVM). These values can be treated as the minimum required dataset size for every disaster category.
LDA topics show good ability to separate classes. It is expected that incorporation of LDA topics as additional features of SVM can improve model.
2.5. Hashtag Extraction and Filtering
Toponyms and coordinates defined using disaster databases (Section 2.3
) are used to retrieve messages from the 1% Twitter archive. Hashtags extracted from these messages represent several social phenomena in the defined region related to the disaster and to ordinary life (e.g., shopping, birthday) and long-term overlapping events of different origin (e.g., election campaigns). For this reason, we propose several heuristics (filters) to remove non-disaster hashtags.
To illustrate the filtering process, two disastrous floods were chosen: the so-called “Halloween flood” in south-central Texas (Entry Number 2013-0510 in EM-DAT, 4101 in GAALFE) [40
] and river flooding in northern Colorado (Entry Number 4089 in GAALFE) [41
]. Both events occurred in Autumn 2013. In spite of the fact that GAALFE scores both floods as extreme events (2
severity class), the Colorado flood received relatively higher consideration on social media. Therefore, we expect a high level of noise in messages related to the flood in Texas.
Using the estimated impact region, we can find messages about Texas flood in the 1% Twitter archive. Coordinates of 224 messages get into the impact region; 19,187 messages mention toponyms in their text (Section 2.3
). From all discovered messages, we extract in total 4314 hashtags.
Filter 1. Low-frequency hashtags: The hashtags used in a small number of messages can be removed from the hashtag set using a certain noise level. Even if these hashtags are event specific, it should be expected that they are not productive and are not important for data retrieval.
In our example, 3412 hashtags (79%) occur only once. Messages are collected on the 1% archive; the full Twitter dataset can contain more messages with this hashtag. However, the disaster hashtag should symbolize a process important for the region and be highly represented in social media; therefore, low-frequency hashtags can be removed as insignificant.
According to the Colorado flood, the following 3 most frequent hashtags are event related: #boulderflood, #coflood, #longmontflood. The first hashtag related to the Texas flood can be found only at Position 55 of the hashtag list. Hence, noisy hashtags cannot be just cut using a certain level of frequency; rather, a more complex approach should be applied.
Filter 2. Pre-disaster hashtags: Assuming that topics discussed in the region before the disaster are not related to the event, pre-disaster hashtags can be filtered out. These filters effectively remove hashtags related to local toponyms, everyday life and other disasters (current or lapsed).
The same geographic method (coordinates and toponyms) is used to retrieve new messages from the impact area, but up to a week before the event. Hashtags extracted from new messages are clearly not related to the event. In our example, these hashtags correspond to local geographic places (#atx, #texas, #georgetown), recent events discussed in news-media (#news, #mtvema, #nascar), everyday life (#jobs, #atxtraffic, #hiring), leisure activity (#photo, #concert, #music), and so on (single underlined hashtags in Figure 6
Filter 3. Classification: To address classifier mistakes and odd uses of hashtags, up to 200 messages are retrieved for every hashtag. These messages are classified using the model described in Section 2.4
. A hashtag is defined as “disaster” if the ratio of supporting messages is higher than 0.5.
It should be noted that every time, we use the same model described in Section 2.4
without modification. Because the precision of the model is estimated at the level of 89%, it is expected that several messages will be classified incorrectly.
Simultaneously to the flood, people from the impact area discussed other non-disaster events, such as the MTV Europe Music Awards or gubernatorial debates between Wendy Davis and Greg Abbott. These events were reflected in social media as hashtags: #ema, #emabestfemale, #voteaustinmahone, #votearianagrande (MTV); #wendydavis (debates). Most of the messages related to such events were classified as non-disasters. Most of the misclassified messages are very short (1–2) or contain words from disaster lexicon (3–4):
#EMABestFemale Selena Gomez . my queen
Everyone RT this. #EMABestFemale Miley Cyrus
omg i woke up at 6:30am today and im just dead #votearianagrande
#Iraq’i government complicit in deadly attack at #CampAshraf on #Iran’ian refugees
Because of that, we choose a supporting ratio of messages. The hashtag is recognized as disaster related only if the majority of corresponding messages were classified as disaster related. For this reason, we do not recommend classifying messages retrieved in the beginning of this section because only a small proportion of Tweets mention toponyms in the text, and this sample does not represent well the real meaning of the hashtags. Therefore, some important hashtags will be lost. Nevertheless, this approach can be used for computational optimization.
Ofter applying Filter 3, the hashtag set for the flood in Texas still contains non-disaster hashtags: #wendydavis and #alienwarehangar. Just a few records matching these hashtags can be found in the Twitter 1% archive: 19 and 4, respectively. Such small sets of messages are very sensitive to classifier mistakes; hence, these hashtags should be removed according to the total number of supporting messages.
Filter 4. Common hashtags: At this point, the hashtag list already contains only disaster-related hashtags. However, some of them can be used among many disasters, such as #flood, #redcross and #donations. These “common” hashtags can be easily found through the intersection of hashtag lists collected for several disasters. For example, hashtag #floods can be found in messages about both flood events. These hashtags should be removed from the resulting set because they are not event-specific.
Additional control over hashtag filtering can be made through choosing the frequency limit in Filter 1 and the supporting level in Filter 3. Network analysis and clusterization of pairwise co-occurrences of the hashtags in the texts should be studied in future research.