1. Summary
Social media platforms generate increasing volumes of data with their growing popularity and widespread user bases [
1]. These data contain a wide range of information together with public opinions important to diverse groups such as the scientific community and business world to capture exciting open challenges and get marketing decisions [
2]. However, due to the vast volume and high dynamicity of data, it is impractical to analyse them manually to extract important events: incidents or activities which happened at a certain time and were discussed or reported significantly in social media [
3] or their sentiments. Thus, many researchers focus on automated mechanisms to extract events and sentiments from social media data streams. To support them, different datasets with ground truth of events [
4,
5] and sentiments [
6,
7,
8,
9] have been published by previous work. However, to the best of our knowledge, no dataset contains both event and sentiment labels together. Furthermore, a clear majority of the sentiment datasets consist of random sets of social media posts without the postings during a continuous period.
Considering these limitations and the importance of social media event sentiment analysis, in this paper, we release
TED-S, a Twitter dataset with both sub-event and sentiment labels. We specifically targeted Twitter, considering its ideality for social network analysis based on popularity, simple data model and limited restrictions on data access [
10]. Since events have different characteristics depending on the domain, we focus on data from two diverse domains, sports and politics, which have different sub-events, evolution rates, and audiences for this dataset. Deviating from traditional sentiment labelling, per tweet, we provide the probabilities/confidences for each sentiment category to further support aggregated sentiment analyses, such as analysing different sentiments expressed in a single tweet and the overall sentiment during a period or sub-event.
As our initial data source, we use
Twitter Event Data 2019 (TED) [
3] considering its recency and event coverage. TED consists of complete subsets of the Twitter data stream collected using Twitter Developer Standard API during the considered main events and their sub-event details extracted from published media reports. The considered main events are (1)
MUNLIV—English Premier League 19/20 match between Manchester United Football Club (FC) and Liverpool FC on 20 October 2019, and (2)
BrexitVote—Brexit Super Saturday 2019/ UK parliament session on Saturday, 19 October 2019. We assign sentiment labels for each tweet in these datasets (MUNLIV and BrexitVote) in our work to support future event sentiment research.
We target annotating the sentiment expressed by each tweet based on three categories as follows:
positive: Hopeful, confident or expressing the good/positive aspect of a situation
negative: Discouraging, refusing or expressing the bad/negative aspect of a situation
neutral: No positive or negative expression
These categories are commonly used by previous research for Twitter sentiment analysis, considering their simplicity and coverage [
8,
9,
11]. In addtion, since we target event data in our approach, using quantitative or more qualitative categories could introduce unnecessary complexities because the class boundaries could be thin, and definitions may need adjustments depending on the main event.
For sentiment labelling, different approaches have been used in previous research. Manual labelling is the most commonly used approach among them. In this approach, a group of annotators with sufficient background knowledge required for the targeted data do the labelling manually, and then a curator finalises the labels based on the majority opinion [
6,
7]. However, due to this process’s time consumption and cost, its usage is mostly limited to the small dataset annotations. Overcoming these limitations, previous research has proposed different approaches, which can be mainly categorised as unsupervised lexicon-, distant supervision- and supervised machine learning-based approaches to generate large sentiment datasets.
Unsupervised Lexicon-based Approaches: Considering the unavailability of pre-labelled data by the time of data labelling, there was a tendency to use unsupervised lexicon-based approaches for sentiment labelling. VADER [
12,
13,
14,
15] and TextBlob [
8,
14,
16] were found to be the popularly used such unsupervised tools. VADER (Valence Aware Dictionary for sEntiment Reasoning) is a simple lexicon- and rule-based model designed for general sentiment analysis [
17]. TextBlob is also a lexicon-based Python library designed for textual data processing covering a wide range of Natural Language Processing (NLP) tasks, including sentiment analysis [
18]. However, due to the generic design of these tools, they fail to capture event-specific sentiment expressions accurately, as indicated by the results in
Section 3.3.2.
Distant Supervision-based Approaches: Distant supervision uses an existing knowledge base as a source to generate data labels, combining the benefits of semi-supervised and unsupervised approaches [
19]. Go et al. [
20] proposed using emoticons to derive sentiment labels of tweets following this technique. They recognised emoticons which express positive and negative sentiments and labelled tweets which contain those accordingly. The same idea was also implemented using hashtags [
21]. However, these approaches highly depend on the initial categorisation of emoticons/hashtags and are only capable of labelling tweets which have at least one of those emoticons/hashtags. In addition, they may require event-specific customisations to the original knowledge base depending on the targeted events.
Supervised Machine Learning-based Approaches: With supervised learning, already available sentiment data or trained models were used to predict labels for large datasets [
9,
22]. These predictions are highly biased to the used training data and learning algorithm. Thus, similar to the scenario with unsupervised lexicon-based approaches, supervised approaches also will be less suitable for event-specific sentiment labelling. The comparatively low F1 scores we received during our experiments from the models only trained on available data further confirm this fact (
Section 3.3.2).
Considering the limitations associated with the above approaches used by previous research, we propose a novel data annotation approach for sentiment labelling, following the idea of supervised machine learning-based approaches. Rather than using a single model, we propose using an ensemble of multiple neural network models to mitigate individual model biases on final predictions. The models are specifically picked based on the recent trends in NLP to obtain more accurate predictions. In addition, we manually label small subsets from each large dataset and allow the models to learn the specifics of targeted events using that data while learning from a large labelled dataset available. The implementation of our approach is publicly available with the labelled data to support similar data annotation tasks. A comprehensive data description is available in
Section 2, and the data labelling approach is further described along with the data statistics in
Section 3.
In summary, the contributions of this paper are as follows.
- 1
To the best of our knowledge, TED-S is the first dataset that contains Twitter data corresponding to particular events (two diverse events from the sports and political domain) throughout a continuous period with both sub-event and sentiment labels.
- 2
Along with data, an ensembled data annotation approach appropriate for large datasets is proposed involving multiple state-of-the-art neural network models, and its implementation is released to support similar data annotation tasks.
- 3
As sentiments, non-manual annotations made for complete datasets, which hold a combination of confidence values for all targeted sentiment categories, are released, providing the ability to customise the data for either direct or aggregated sentiment analysis. In addition, manual annotations made for small fractions of data, which hold the mainly expressed sentiment of a tweet, are released to support future research avenues such as analysing manual and non-manual annotations and designing semi-supervised learning approaches, which strengthen the models by iteratively learning manual and predicted labels.
- 4
Availability of both event and sentiment labels unlike the existing datasets makes TED-S beneficial for a wide range of research in social media data, including event detection, sentiment classification, sentiment evolution, event sentiment extraction and event sentiment forecasting.
3. Methods
This section presents our data annotation approach.
Section 3.1 details the Twitter data collection we used.
Section 3.2 describes the manual data annotation approach we followed to label subsets from each large dataset to utilise during the model learning process to provide specifics of targeted events to the models. Finally,
Section 3.3 describes the ensembled approach proposed for data annotation, involving several neural network architectures, conducted experiments and data statistics.
3.1. Twitter Data Collection
We used
Twitter Event Data 2019 (TED) (Twitter Event Data 2019 is available on
https://github.com/HHansi/Twitter-Event-Data-2019 (accessed on 25 November 2021)) [
3] as our initial data source to acquire event-related tweets. To the best of our knowledge, this is the most recent social media dataset released with ground truth (GT) event details covering two diverse domains, sports and politics. In addition, this dataset has complete subsets of Twitter data stream collected using Twitter Developer Standard API during the considered event periods without any gaps. The sports dataset was generated focusing on the English Premier League 19/20 match between Manchester United FC and Liverpool FC on 20 October 2019, and we refer to this dataset as
‘MUNLIV’ similar to the original study. The political dataset was generated focusing on the Brexit Super Saturday 2019, a UK parliament session that occurred on Saturday, 19 October 2019, and it is referred to as
‘BrexitVote’. The statistics of the collected data are summarised in
Table 5.
3.2. Manual Annotation
We targeted annotating random subsets from MUNLIV and BrexitVote datasets using the manual annotation process. We considered three sentiment categories:
positive,
negative and
neutral for our annotation process following the definitions stated in
Section 1. We targeted assigning the most appropriate category for each tweet based on its textual content. In cases when the main text has a conflicting sentiment with the hashtags, we gave priority to the main text because such hashtags are mostly used to connect the tweet to a particular topic.
We involved two annotators for this task who at least have a master’s level qualification in computer science or linguistics. The annotators familiarised themselves with the targeted events before starting the annotation process by reading available resources. We also provided them with sample annotations per category to be familiar with the task. All annotators worked on the same 150 samples during the first annotation round to measure their inter-annotator agreement. The outputs of this round indicated 0.7393 and 0.6367 Cohen’s kappa [
23] between the annotations for MUNLIV and BrexitVote samples, respectively. Considering the high agreement achieved, we then annotated the remaining samples from both selected subsets using one annotator per instance. Completing the manual annotation process, we obtained 8344 labelled tweets from the MUNLIV dataset and 2016 labelled tweets from the BrexitVote dataset. The distribution of the labelled tweets among the three sentiment categories is summarised in
Table 6.
3.3. Ensembled Annotation
Considering the cost associated with a manual process in annotating large datasets and limitations in non-manual approaches on event-specific sentiment annotation, we propose an ensembled approach to annotate complete MUNLIV and BrexitVote datasets. For this approach, we utilise the data annotation strategy proposed with democratic co-learning [
24] considering its successful applications in different areas such as time-series prediction [
25] and offensive language identification [
26]. In this approach, a set of classifiers are trained on available labelled data using different learning algorithms. When different algorithms with different inductive biases are involved, it helps resolve individual model biases and produces predictions with lower noise. Then, the trained models are used to make predictions on unlabeled data and aggregated the outputs to generate final labels.
Since our annotation task focuses on three sentiment categories: positive, negative and neutral, we build multi-class classification models with the ability to predict the confidence of each category, given an instance. Then, we aggregate the confidence values predicted by each model to generate the labels for unlabelled data. As the final label, we provide the mean and standard deviation of the confidence values predicted by each model per category rather than providing an exact category. Given a more detailed label with these values, the users have the potential to adjust data based on targeted applications. The standard deviation values will be specifically useful for filtering out instances with high model disagreement to reduce the noise in the dataset depending on the user requirements. In addition, providing confidence values for each sentiment category is helpful in scenarios when a single instance contains a mix of sentiments. A summary of our approach is as follows. It is also illustrated in Algorithm 1 in a more detailed manner.
Train N diverse supervised models using available labeled data to predict the sentiment categories .
For each instance x in the unlabelled data, predict the confidence for each category using each built model .
Aggregate the predicted confidences per category of each instance x to generate final label where and .
Algorithm 1: Ensembled Annotation |
|
The rest of this section explains the used learning algorithms/models, model evaluations and a summary of final labels. Under model evaluation, we summarise the details of training and testing data we used, obtained results and criteria we used to select the best models to label unlabelled data.
3.3.1. Models
The supervised sentiment analysis approaches developed by previous research range from traditional machine learning (ML) [
27,
28] to deep learning (DL) [
29,
30,
31]. However, more focus has been given to DL-based methods in recent research, considering their improved performance over traditional ML-based approaches [
10,
32]. Among different DL methods, Long Short-term Memory (LSTM) [
33] and Convolutional Neural Network (CNN) [
34] were found to be the commonly used algorithms for sentiment analysis [
35]. Also, transformer models [
36] have been recently involved in sentiment analysis along with their success in several NLP applications [
37,
38,
39,
40]. Following these trends, we constructed three classification models based on LSTM, CNN and Transformer architectures, which have diverse inductive biases to use with the ensembled annotation approach.
LSTM:
LSTM model consists of five layers. First is an embedding layer initialised with concatenated GloVe and fastText embeddings. We used GloVe’s Common Crawl (840B tokens) 300-dimensional model (GloVe pre-trained models are available on
https://nlp.stanford.edu/projects/glove/ (accessed on 28 December 2021)) and fastText’s Common Crawl (with subword information) 300-dimensional model (fastText pre-trained models are available on
https://fasttext.cc/docs/en/english-vectors.html (accessed on 28 December 2021)) to generate the embeddings. We experimented with separate embeddings and their concatenation in initial experiments, and concatenation performed best. Following the embedding layer, this architecture has two bi-directional LSTM layers with a dense layer on top. Finally, a dense layer with softmax activation is used to generate the predictions. We adapted this architecture from the
Toxic Comment Classification Challenge in Kaggle (Toxic Comment Classification Challenge is available on
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge (accessed on 28 December 2021)).
CNN:
CNN model consists of four 2D convolutional layers. Like LSTM, first is an embedding layer initialised with concatenated GloVe and fastText embeddings. Then, following a spatial dropout layer, this architecture has the convolutional layers, each with max-pooling layers. The Final is a dense layer with softmax activation to make predictions. We adapted this architecture from the
Quora Insincere Questions Classification Kaggle competition (Quora Insincere Questions Classification is available on
https://www.kaggle.com/c/quora-insincere-questions-classification (accessed on 28 December 2021)).
3.3.2. Model Evaluation
This section describes the labelled data we used to train the models, model hyper-parameters, evaluation results and the best performing model selection for label generation. We used an Nvidia Tesla K80 GPU to conduct all our experiments.
Training Data:
In addition to the manually annotated data from MUNLIV and BrexitVote datasets, we used two available datasets: SemEval and FIFA, to train the models. SemEval dataset is obtained by merging all the datasets developed for SemEval sentiment analysis tasks from 2013 to 2016 [
6]. This dataset consists of tweets corresponding to trending topics on Twitter covering different domains, including sports and politics. FIFA dataset consists of tweets corresponding to FIFA World Cup 2014 representing the sports domain [
7]. The sentiment distribution of each dataset is illustrated in
Figure 1. For model evaluation purposes, we split each of these datasets into two splits (train and test), and their details are summarised in
Table 7. Additionally, the distribution of tweet sequence length in each split is shown in
Figure 2. As can be seen in these graphs, recent tweets (MUNLIV and BrexitVote) tend to have high sequence lengths than previous tweets (SemEval and FIFA) following the Twitter character limit increment.
Data Preprocessing:
Under data preprocessing, we mainly considered cleaning uninformative tokens and formatting data according to model requirements. As uninformative data, we removed links and retweet notations in tweets. We tokenised data using Natural Language Toolkit (NLTK)’s (NLTK documentation is available at
https://www.nltk.org/ (accessed on 28 December 2021)) TweetTokenizer with the ‘reduce length’ option, which generalises word forms by removing highly repeated characters. We converted text to lowercase for LSTM and CNN models because the used embedding models were trained on lowercased text.
Hyper-Parameters:
For all models, we fixed the max sequence length to 96, considering the sequence length distribution in datasets (
Figure 2) and used 10% of the training data to validate the model during the training process. For LSTM and CNN models, we used the batch size of 64, a learning rate of
with Adam optimizer and epochs of 20 with early stopping patience of 5. For the Transformer model, considering its high memory requirement, we used the batch size of 16, a learning rate of
with Adam optimizer and 3 epochs with early stopping patience of 10. In addition, we set the evaluation steps allowing 6–11 evaluations per training epoch depending on the size of the dataset.
Results:
We evaluated the performance of each built model on all test datasets (
Table 7) to select the best performing models. For training, we used separate as well as combined datasets. By combining datasets, we can increase data size and also analyse the inter-domain and inter-dataset capabilities in predicting the sentiment. When combining datasets, we ensure that each shares the same domain with at least one another set in combination to have sufficient instances from each domain in the final set. The Macro F1 score is used to measure model performance. The obtained results are summarised in
Table 8.
Additionally, we evaluated the performance of previously proposed approaches for large dataset annotations to compare with the proposed approach. Among the three method categories we described in
Section 1, we could only involve unsupervised lexicon- and supervised machine learning-based approaches for this comparison because distant supervision requires generating a knowledge base based on a specific data component (e.g., hashtag) and can only process the tweets which contain at least one of the defined components. As the unsupervised lexicon-based approaches, we used both VADER [
17] and TextBlob [
18] models. VADER returns a compound score within the range of [−1, 1] which is commonly used for sentiment analysis. The extreme negative and positive scenarios are indicated by −1 and 1. Following the common trend in previous research, we mapped compound scores
to positive, ≤−0.05 to negative and the remaining to neutral sentiments during our evaluations [
13,
14]. Similarly, TextBlob also returns a polarity score within [−1, 1], but negative, zero and positive values are recognised as negative, neutral and positive sentiments, commonly [
14,
16]. Their results are in the bottom section of
Table 8. As the supervised machine learning-based models, we consider the models trained only using the other training data except for the data from the same dataset to which the test set belongs.
According to the results, in the majority of cases, the proposed neural network models outperformed the unsupervised lexicon-based approaches (VADER and TextBlob), indicating the effectiveness of supervision. Among these neural networks, mostly, the models trained only on other datasets (compared to the test data) returned comparatively low F1 measures compared to the models, which at least saw a fraction of the dataset to which the test set belongs, emphasising the importance of capturing event-specific sentiment expressions. For SemEval-test set, all the models trained using the combination of all training sets performed the best. For FIFA-test set, LSTM and BERTweet models trained on the combination of SemEval, FIFA and MUNLIV training sets and the CNN model trained on FIFA and MUNLIV training sets achieved the highest F1 values. The same LSTM and BERTweet models performed best in MUNLIV-test set, but for CNN architecture, the model trained on the MUNLIV training set outperformed others. For BrexitVote-test set, different training combinations: LSTM trained on FIFA, CNN trained on SemEval, FIFA and MUNLIV, and BERTweet trained on FIFA and MUNLIV resulted in the best F1 values highlighting the inter-domain capabilities in predicting sentiments. Overall, LSTM and CNN models resulted in nearly similar F1 values, and the BERTweet model performed better than both of these models.
Model Selection:
Among the trained models, we selected the best-performing model of each architecture to automatically generate the sentiment labels of unlabelled tweets in MUNLIV and BrexitVote. For this selection, we did not directly rely on MUNLIV- or BrexitVote-test F1 values, considering the smaller size of these datasets. We used a
Weighted F1 measure calculated combining either MUNLIV or BrexitVote and one of the already available dataset’s test data (FIFA or SemEval). With MUNLIV, we combined FIFA results because both represent football events. With BrexitVote, we combined SemEval results as this dataset cover general topics, including politics. While calculating the weighted F1, we provided a high weight to the MUNLIV and BrexitVote because we are going to use the selected models to predict sentiments in their complete datasets (Equations (
1) and (
2)).
Table 9 shows the weighted F1 values achieved by each model. According to them, for MUNLIV predictions, we selected LSTM and BERTweet models trained on SemEval, FIFA and MUNLIV training sets, and the CNN model trained on a combination of all training sets. For BrexitVote predictions, we selected the LSTM model trained on SemEval and BrexitVote training sets, CNN trained on SemEval, FIFA and MUNLIV training sets and BERTweet trained on all training sets.
3.3.3. Final labels
Using the best models selected, we predicted the labels for complete MUNLIV and BrexitVote datasets (
Table 5). As the final label per instance we provide three mean values:
and
and three standard deviation values:
and
computed based on confidences predicted by models.
We release labels for non-duplicate and all tweets in the selected event streams, considering the different use cases of sentiment analysis associated with social media. Without duplicate data, we can efficiently analyse the sentiment of ideas expressed. In social networks, people share others’ information (on Twitter, we refer to this as retweeting) mostly to indicate an agreement. If we remove duplicates from a social media data stream, we remove such social aspect-based details associated with it. Thus, with duplicates, we can analyse public opinions and their evolution. For example, let us assume that there exists a tweet p with a positive sentiment and n with a negative sentiment posted during time t, and p is retweeted ten times while n does not. If we ignore the duplicates (retweets), we recognise one positive and negative opinion from the data. If we consider the duplicates, it indicates that the majority have a positive opinion.
Sentiment Distribution:
Using the predicted labels, we illustrate the distribution of sentiments of each event during the targeted period in
Figure 3,
Figure 4,
Figure 5 and
Figure 6. Since more than one tweet can be posted during a particular time, we aggregated the repeated values and showed the mean with a 95% confidence interval in these line graphs.
For MUNLIV non-duplicate data, the majority expressed a negative sentiment, and the minority had a neutral sentiment (
Figure 3). But, when we consider the sentiment distribution of all tweets, the majority is positive (
Figure 4). This indicates that users retweeted most of the positive tweets during this event rather than writing their own. Additionally, the clear peaks in positive sentiment indicate the discussions that happened during or after the major sub-events of this football match (e.g., 16:06—first goal, 17:10—second goal). Similar to MUNLIV sentiment distribution, in BrexitVote also non-duplicate and all datasets show different positive and negative distributions. For non-duplicate data, there is a nearly similar positive and negative distribution throughout the considered time except at the end (
Figure 5). In contrast to this, for all BrexitVote data, in the beginning, there was a combination of positive and negative opinions, but then the majority became negative, positive and negative again at the end (
Figure 6).
Summary:
In this paper, we proposed a novel data annotation approach involving several neural network models, overcoming major limitations in available approaches for large dataset annotation, mainly the inability to capture event-specific sentiment expressions and the high impact of model biases. Using our approach, we assigned sentiment labels for all tweets (273,915 tweets) in TED [
3], covering two main events: MUNLIV and BrexitVote in the domains of sports and politics. As the sentiment label per instance, we provide a composition of six values: means and standard deviations of confidences predicted per class (positive, negative and neutral) by built models. We release this new dataset under the name of
TED-S, to be used with a wide range of research in social media, including event detection, sentiment classification and event sentiment extraction. In addition, we release the implementation of our data annotation approach as an open-source project to support similar annotation tasks.