Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier

Alanazi, Saad Awadh

doi:10.3390/app12126070

Open AccessArticle

Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier

by

Saad Awadh Alanazi

Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakaka 72341, Saudi Arabia

Appl. Sci. 2022, 12(12), 6070; https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

Submission received: 13 May 2022 / Revised: 12 June 2022 / Accepted: 13 June 2022 / Published: 15 June 2022

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Individual mental feelings and reactions are getting more significant as they help researchers, domain experts, businesses, companies, and other individuals understand the overall response of every individual in specific situations or circumstances. Every pure and compound sentiment can be classified using a dataset, which can be in the form of Twitter text by various Twitter users. Twitter is one of the vital platforms for individuals to participate and share their ideas about different topics; it is also considered to be one of the most famous and the biggest website for micro-blogging on the Internet. One of the key purposes of this study is to classify pure and compound sentiments based on text related to cryptocurrencies, an innovative way of trading and flourishing daily. The cryptocurrency market incurs many fluctuations in the coins’ value. A small positive or negative piece of news can sensate the whole scenario about the specific cryptocurrencies. In this paper, individuals’ pure and compound sentiments based on cryptocurrency-related Twitter text are classified. The dataset is collected through the Twitter API. In WEKA, the two deployment schemes are compared; firstly, straight with single feature selection technique (Tweet to lexicon feature vector), and secondly, a tetrad of feature selection techniques (Tweet to lexicon feature vector, Tweet to input lexicon feature vector, Tweet to SentiStrength feature vector, and Tweet to embedding feature vector) are used to purify the data LibLINEAR (LL) classifier, which contains fast algorithms for linear classification using L2-regularization L2-loss support vector machines (Dual SVM). The LL classifier differs in that it can potentially alleviate the sum of the absolute values of errors rather than the sum of the squared errors and is typically much speedier. Based on the overall performance parameters, the deployment scheme containing the tetrad of feature selection techniques with the LL classifier is considered the best choice for the purpose of classification. Among machine learning techniques, LL produces effective results and gives an efficient performance compared to other prevailing techniques. The findings of this research would be beneficial for Twitter users as well as cryptocurrency traders.

Keywords:

social media; Twitter; cryptocurrency; coins; emotional states; sentimental classification; feature selection techniques; LibLINEAR

1. Introduction

Cryptocurrency is a digital or virtual currency created to function as a trade means and is secured using a blockchain. Cryptocurrency is also an electronic currency that is comparable. Bitcoin’s price as a virtual currency has risen dramatically during the last several months. The motivation for inventing Bitcoin and any following virtual agreements is to address perceived shortcomings through exchanging money between parties. Bitcoin is the most well-known cryptocurrency, as it piques everyone’s interest in the subject of encryption due to its rapid growth and status as the de facto standard for cryptocurrencies. While there are several cryptocurrencies, Bitcoin and Ethereum are the market leaders. Bitcoin and Ethereum serve distinct functions. Bitcoin was founded as a replacement for fiat currency; it serves as a means of exchange and value storage, whereas Ethereum is being developed as a platform for promoting peer-to-peer contracts and applications via its currency tools [1,2].

Despite their bubble values, cryptocurrencies have been transacted and have grown to be a significant investment in the Decentralized Finance (DeFi) sector. Binance reports that the global cryptocurrency market capitalization reached USD 1.3 trillion at the time of this study. The argument about cryptocurrencies has raged on among authorities, specialists, and academics globally. Bitcoin had the most incredible market valuation in the recent DeFi market, followed by Ethereum, Tether (USDT), and Binance Coin [3]. The virtual world’s public perception (e.g., people’s tweets) impacts the price fluctuation of cryptocurrencies. For instance, Elon Musk’s tweets in mid-2021 led Bitcoin prices to vary substantially, and then reports of a future US government effort to control digital assets may have spurred Bitcoin’s current price decline [4,5].

The considerable volume sees the rapid growth of scholarly articles on cryptocurrency of articles on a variety of related issues, including Blockchain, Bitcoin, Ethereum, network security, encryption, exchange, and electronic money. However, existing studies have primarily examined cryptocurrencies in computer science [6]. Thus, sentiment analysis studies using data tweets to understand the social consequences of cryptocurrencies through the perspective of emotion theory may remain limited. Sentiment analysis can gauge the public’s reaction to such a dynamic fluctuation in cryptocurrency values. The examination is crucial for comprehending the social ramifications of this phenomenon, which emphasizes the study’s significance [7].

The first stage in big data analysis is collecting data; this is called “data mining.” These records may come from any source. There are numerous data sources from which a massive amount of data can be acquired. Twitter is an excellent source for data science. Additionally, it is a free social networking platform that enables users to stream tweets. These tweets are referred to as short messages. They are used to broadcast tweets (short messages) for various reasons, including pride, attention, enjoyment, boredom, assistance, and the desire to become famous. Most users use Twitter for recreational purposes, sending messages to the world and ensuring that ideas are distributed within communities. Unlike other social media networks, users’ tweets are completely public and searchable. Public perspective and how they perceive subjects can be acquired through Twitter data. The Twitter API enables applications to retrieve data. However, certain restrictions are involved [8,9].

Sentiment analysis evaluates the emotional tone of a collection of words to comprehend the perspectives, thoughts, and feelings expressed in online references [10]. It is extremely advantageous for social media monitoring since it gives us a more comprehensive view of public thinking on particular subjects. Additionally, it can play a vital role in marketing and customer service [11].

Twitter is amongst the most popular and widely used microblogging services, where people can create status updates called tweets [12]. As a result, these tweets typically express opinions on various subjects. Over the years, communication and information technologies have profoundly impacted the world. Most of the articles analyzed used the Twitter API to retrieve data from Twitter, while others used premium APIs. According to several articles, there is a strong association between Twitter users’ chances of influencing and their chances of being affected, and most users maintain emotional stability in both [13]. Twitter provides social network functionality to determine if users are exposed to environments in the online social realm that influences their feelings. The models developed to learn both emotional influencing and influenced chances for users and provide observations are based on accurate social network data. Each tweet made by a user is subjected to an emotional analysis to assess its polarity or whether it is positive or negative [14].

Interpersonal interactions, communication patterns, social debates, and political debates have changed the latest media and technology [15]. The media and the communication intellectuals, the sociologists, the international association’s intellectuals, and political scientists have studied hundreds of different phases of social media use. Social computing is the inventive and developing computer model to analyze and model the social actions and events happening on different platforms [16]. It also creates interactive and intellectual applications to achieve effective results. The availability of social media for individuals offers their views or sentiments on a specific event, issue, and product. It is instrumental in breaking down these casual and uniformed data to reach conclusions in different areas. While a much more formless format of this data, accessible on the web, makes the mining procedure challenging. Through the different web-blogs development and the development of social networking sites, numerous organizations and data providers used its advertisements in numerous websites and web blogs [17]. Today, all over the world, numerous data groups share their advertisement in the form of short messages on the micro-blogging services; Twitter is an example [18]. When these short messages are managed and processed, they can make an amount of information related to numerous social research areas. Finally, the system classified these short messages into thirteen different categories. These groups were selected to cover critical areas of sentimental analysis [19].

Several data portals are currently available for the retrieval of short text. They included using a Twitter micro blog to collect short messaging for four reasons below.

Various individuals use concise manifestos to express their views on various subjects, making them credible sources of opinion.
A large number of text posts on Twitter grows each day. So, the gathering amount could be very arbitrary.
The Twitter viewers and the consistent users vary from company representatives, celebrities, politicians, and even countries presidents. Therefore, text posts can collect the users from dissimilar social and interest events/groups.
The users from numerous countries represent Twitter viewers.

Through the growth and expansion of the machine learning methods day-to-day, numerous researchers tend to use machine learning techniques in the classification of data and text. Machine learning analysis enables the extraction of emotional responses from data tweets to characterize netizens’ attitudes about cryptocurrency [20,21,22]. This technique is theoretically significant because it explains how emotion theory is quantified through machine learning and provides insight into cryptocurrency value fluctuation’s social consequences. There are mainly two types of the machine learning methods; one is the supervised learning, in which the learning data are offered with labels and provided through the user, and the second is unsupervised learning, in which the learning data is provided as a clustering method through looking at the immensity of the dataset [23,24,25,26,27,28,29]. So, for the current study, supervised learning techniques are used as the thirteen categories that did not alter regularly. To classify Twitter data as short text using machine learning techniques, the correct data set is needed to extract features from these short messaging. When the dataset is created, it is significant to find appropriate methods for preprocessing and classifying short messages. From the Filtered Classifier (FC), LibLINEAR (LL) was used to classify that data as competent to handle a massive dataset.

The main steps that the proposed research takes are listed below:

Firstly, the dataset was collected through the Twitter API and classified pure and compound sentiments based on twitter text.
Secondly, text based short messages are classified into thirteen major pure and compound sentimental attributes such as Not_Relevant, Neutral, Happy, Surprise, Happy_Surprise, Sad, Happy_Sad, Angry, Sad_Angry, Disgust, Sad_Disgust, Angry_Disgust, Sad_Angry_Disgust.
Thirdly, the study discussed the sentimental classification through feature selection and classification techniques based on the cryptocurrency-related Twitter textual dataset using a single feature selection technique and a tetrad of feature selection techniques deployment schemes along with an LL classifier.

The following are some of the study’s unique features:

The presented research is novel in that it provides a technique comprised of a tetrad of feature selection techniques; it produces more refined, concise, and presentable outcomes. Rarely are multiple feature selection techniques employed in published works.
To determine the optimal deployment scheme for the classification of the gained dataset, WEKA, a popular tool among researchers for its use in traditional and innovative classification algorithms for the given dataset, is chosen. However, it is employed for sentimental analysis through novel natural language processing techniques in this case, which is uncommon in the literature.

Hereunder are the objectives of the research:

The research aids a diverse audience (i.e., readers, Twitter users, cryptocurrency traders, monitoring agencies, and policymakers) by providing a global picture of public sentiments and their reactions to various cryptocurrencies’ volatile and unpredictable values.

The subsequent is the intended study’s essential contribution:

Essentially, the study provides a technique to the cryptocurrency relevant group (i.e., readers, Twitter users, cryptocurrency traders, monitoring agencies, and policymakers) for assessing public sentiments at any given time for any given cryptocurrency, which can aid in the planning and development of future strategies for the investment in cryptocurrencies.

The paper’s organization is as follows: Related Research has been explained in Section 2, Materials and Methods have been explained in Section 3, Experimental Results and Performance Analysis has been explained in Section 4, Section 5 comprises the Discussion, and lastly, the Conclusion has been written in Section 6.

2. Related Research

In this section, a literature review is conducted to shed light on attempts made by various scholars to better comprehend emotional analysis as a contribution to text mining and categorization associated to cryptocurrencies. Modeling and predicting people’s sentiments about cryptocurrency-related matters can assist the investors in their investment policy modification and development. The proposed idea of cryptocurrency investor sentimental analysis and classification through cryptocurrency-related text on Twitter using straight and tetrad configurations has proven to be a great source of guidance. Several pieces of research related to natural language processing, sentimental analysis, and machine learning-based algorithms highlight their applications in various fields.

In this study, the authors have recommended a technique for classifying Twitter-generated student data into different groupings to find students’ numerous problems. So, authors have presented and offered the logical mythology for shaping the feelings, which are shared upon dissimilar numerous social media programs. They also analyzed the text and data using complex annotation, grammar, semantic networks, and vocabulary acquisition. Fundamental techniques of text classification and data collection are presented [30].

Another study presented by the authors for normalizing the irrelevant and immaterial tweets to be classified as per the polarity, such as positive or negative. In addition, they also have mixed model approaches to generate dissimilar emotional words. The created words were then used as key signs in the classification model. The authors have introduced a new approach for predicting opinions regarding stock markets using numerous financial communication boards and made an automatic projection for the stock market [31].

Scientists described that social computing is an inventive and developing computer model to analyze the social actions and events happening on different platforms [32]. The exactness of the classification procedure with a chosen dataset is verified by using a range of performance parameters; accuracy; recall; precision; F1- score; confusion matrix; log-loss; and ROC Area are some of the most popular metrics. Authors have scrutinized the performance of the various classifications such as Sequential Minimal Optimization, Random Forest, Naïve Bayes, and Support Vector Machine for classifying the Twitter data [33].

Another research experiment is performed based on different individual parameters using Naïve Bayes, which were taken from the Facebook dataset to predict an individual’s personality [34]. These characteristics are in the form of English language words and are based upon the categories in the Linguistic Inquiry and Word Count (LIWC), such as different programs or plans, activity records, structural networks, and some other important personal information. The whole analysis was performed using Waikato Environment for Knowledge Analysis (WEKA) [35].

Text-based sentimental analysis is a well-known technique for better understanding and expressing individual opinions, feelings, and thoughts. The individuals typically express subjective text and typical emotions, moods, feelings, thoughts, and reactions [36]. The critical challenge in sentimental analysis is that most real-world data are shapeless and unstructured. Hence, in recent years, various research has greatly attempted to achieve significant and valuable information from these types of unstructured and shapeless datasets [37].

3. Materials and Methods

3.1. System Specifications

Experiments were conducted on Lenovo Mobile Workstation equipped with Processor: 10th Generation Intel Core i7, Operating System: Windows 10 Pro 64, Memory: 32 GB DDR4, Hard Drive: 512 GB SSD, Graphics: NVIDIA RTX A3000. WEKA 3.8.4 tools have been used for the experimentation and results of the proposed scheme.

3.2. Dataset Development

Twitter Application Programming Interface (API) is used to collect tweets as raw data, and the whole procedure is explained in Figure 1. There are various Twitter scraping APIs available, each with a somewhat different set of capabilities, including the widely used Twitter API for tweet retrieval. To utilize the Twitter API, a user must first apply for developer access. Apart from scraping tweets, the Twitter API is capable of performing a variety of other functions. The Twitter API has several limits based on the account tier. The Twitter API allows for data collection based on keywords, hashtags, dates, and locations. The Twitter API was used with keywords and hashtags to acquire a dataset of English tweets expressing people’s thoughts regarding the value volatility of cryptocurrencies during a specific time period. The data collected are critical for machine learning since it helps the model recognize patterns, make judgments, and do other tasks. For supervised learning, a dataset with the label mapped to the input feature is required. The data includes the following attributes TweetID, ReTweetCount, TweetText, TweetLanguage, TweetSource, and UserID. The text-based sentimental classification based on a Twitter dataset is applied in the text of short messaging of the Twitter micro-blogs. Therefore, it is necessary to collect short Twitter messages. For Twitter, there was a character limit, limiting the length of a single brief message to only 140 characters. As a result, the end-user is compelled to convey data utilizing little words or sentences. The reason is to restrict and use the words of tiny messages as keywords. Twitter API offers the capability to retrieve such tiny messages for a specific retriever in the XML file format. In comparison, these texts based short messages are classified into thirteen diverse sentimental attributes as Not_Relevant, Neutral, Happy, Surprise, Happy_Surprise, Sad, Happy_Sad, Angry, Sad_Angry, Disgust, Sad_Disgust, Angry_Disgust, and Sad_Angry_Disgust.

In machine learning, there are essentially two types of learning processes. The first is supervised learning, and the second is unsupervised learning. In supervised learning, the developer supplies a labeled dataset to the system to train it. Unsupervised learning is a technique in which the system discovers patterns on its own from the dataset. Regarding the present situation, a supervised mode of learning is much more pertinent due to the versatility of the dataset. The selected field data frequencies are depicted in Figure 2. So, the values have been selected as cut-off lower and upper values that maximize efficiency.

Sentiment analysis is performed on tweets, which are unstructured writings containing slang terms, acronyms, and orthographic errors. They must be converted to a proper format by preprocessing procedures for the machine learning model to assess the texts and deliver trustworthy, high-accuracy outputs. Thus, preprocessing is a critical stage in natural language processing and consists of multiple stages depending on the language’s nature and the analysis’s objective. Due to cryptocurrencies’ unique and dynamic character, researching tweets about them presents more significant hurdles than analyzing tweets about other well-known conventional currencies. The difficulties include spelling inconsistencies and a broad range of currencies that differs from regular currencies, which is essential for identifying text properties. Due to the high number of inherited irregularities, natural language processing lacks robust methods and resources for extracting cryptocurrency-related attitudes from the text.

The following steps illustrate how the dataset was preprocessed, as clarified in Figure 3:

By manually removing extraneous tweets that contained advertisements or were unrelated to the topic of cryptocurrencies, the first dataset was reduced to 3085 tweets.
Elimination of non-English letters
Eliminating emoticons, symbols, numerals, and the hashtag sign.
URLs and user mentions are being removed.
Removing punctuation.
Removing repeated characters.
Removing stop words.
Applying tokenization is a process that separates the text into smaller units called tokens.
Applying normalization, i.e., the unification of certain characters having many forms.
Applying Lemmatization to reduce words to their source.

To carry out the experiments in this research, the tweets of the dataset were labeled (Not_Relevant, Neutral, Happy, Surprise, Happy_Surprise, Sad, Happy_Sad, Angry, Sad_Angry, Disgust, Sad_Disgust, Angry_Disgust, and Sad_Angry_Disgust) at the manual annotation stage and divided into the following parts:

Not_Relevant: If the tweet expressed no relevant information about cryptocurrencies, it was labeled Not_Relevant.
Neutral: If the tweet expressed no sentiment and agreement about cryptocurrencies, it was labeled Neutral.
Happy: If the tweet communicated positive sentiment and agreement about cryptocurrencies, it was labeled Happy.
Surprise: If the tweet communicated shocking sentiment and agreement about cryptocurrencies, it was Surprise.
Happy_Surprise: If the tweet communicated amazement, positive sentiments, and agreement about cryptocurrencies, it was labelled Happy_Surprise.
Sad: If the tweet expressed down sentiment and partial agreement about cryptocurrencies, it was labeled as Sad.
Happy_Sad: If the tweet stated amalgamated positive, down sentiments and partial agreement about cryptocurrencies, it was labeled Happy_Sad.
Angry: If the tweet stated fuming sentiment and discrepancy about cryptocurrencies, it was labeled as Angry.
Sad_Angry: If the tweet stated fused down negative sentiments and discrepancies about cryptocurrencies, it was labeled as Sad_Angry.
Disgust: If the tweet stated repulsion sentiment and discrepancy about cryptocurrencies, it was labeled as Disgust.
Sad_Disgust: If the tweet communicated merged down negative sentiments and discrepancies about cryptocurrencies, it was labeled Sad_Disgust.
Angry_Disgust: If the tweet conveyed merged annoyed, repulsion sentiments, and discrepancy about cryptocurrencies, it was labeled Angry_ Disgust.
Sad_Disgust_Angry: If the tweet voiced mixed negative sentiments and discrepancies about cryptocurrencies, it was labeled Sad_Disgust_Angry.

The training set included a dataset in which each input was associated with the correct label. It enables the models to understand through training and growing the model.

The testing set comprised newly discovered data distinct from the training data; the constructed model predicts the label. The predictions were then compared to the actual labels to evaluate and compute the model’s accuracy.

The dataset was collected from 1 December 2021 to 31 December 2021 and consisted of around 3085 tweets, as shown in Table 1.

3.3. Features Extraction and Selection

Natural language text cannot be best handled by machine learning algorithms due to the inability of text data to be computed. As a result, text input is converted into numerical vectors that the algorithms can process and operate using feature extraction techniques (tweet level filters).

3.3.1. Tweet to Sparse Feature Vector

It retrieves sparse features such as word and character n-grams from tweets. There are options for excluding unusual features (n-grams appearing in fewer than m tweets for illustration) and adjusting the weighing mechanism (Boolean or frequency-based).

The word n-grams function extracts words from n = 1 to a specified highest value.
Negations appended words that appear in negative scenarios and the prefixes do not affect the n-gram characteristics of words. The negative scope concludes with the following punctuated expression ([.|,|:|;|!|-]+).
Character n-grams return the number of characters.
The part of speech (PoS) tags is processed using the Carnegie Mellon University (CMU) tweet Natural Language Processing (NLP) tool, which generates a vector space model based on the sequence of POS tags.
Brown clusters convert the words in a tweet to Brown word clusters, resulting in a low-dimensional vector space model. It applies to n-grams of word clusters.

3.3.2. Tweet to Lexicon Feature Vector

It calculates features from a tweet using several lexicons:

Multi-Perspective Question Answering (MPQA) sums up the amount of positive and negative words in the MPQA subjectivity lexicon.
Bing Liu (BL) keeps track of the positive and negative terms in the BL lexicon.
Finn Årup Nielsen (AFINN) derives positive and negative variables from the positive and negative word scores provided by AFINN’s lexicon.
Sentiment140 produces positive and negative factors by averaging the positive and negative word scores offered by this lexicon developed from emoticon-annotated tweets.
National Research Council’s (NRC) Hashtag Sentiment lexicon calculates positive and negative variables by aggregating positive and negative word values derived from tweets marked with emotional hashtags.
NRC Lexicon of Word-Emotion Association counts the number of words corresponding to each emotion in this lexicon.
NRC-10 Expanded includes the emotion associations for the terms that fit the NRC Word-Emotion Association Lexicon’s Twitter-specific expansion.
NRC Hashtag Emotion Association Lexicon augments this lexicon by including the emotion connections for the terms that match.
SentiWordNet uses SentiWordNet to determine positive and negative scores. It computes a weighted average of the synsets’ sentiment distributions for words that appear in several synsets.
Emoticons generate a positive and negative score based on the word associations associated with a collection of emoticons. The list was compiled as part of the AFINN project.
Negation is a metric that indicates the quantity of negating terms in a tweet.

3.3.3. Tweet to Input Lexicon Feature Vector

It extracts information from a tweet by utilizing a predefined set of affective lexicons, each represented by an ARFF file. The characteristics are computed by summing or counting the emotive associations associated with the words in the specified lexicons. Each lexicon’s numeric and nominal properties are considered, the numerical scores are added, and the nominal scores are counted. By default, the NRC-Affect-Intensity lexicon is utilized.

3.3.4. Tweet to Sentiment Strength Feature Vector

It utilizes SentiStrength to determine the positive and negative sentiment strengths of a tweet.

3.3.5. Tweet to Embeddings Feature Vector

It generates a feature representation at the tweet level using pre-trained word embeddings. A dummy word-embedding composed of zeroes is utilized for words that do not have a corresponding embedding. The following approaches can be used to calculate the tweet vectors:

Average word embeddings.
Add word embeddings.
Concatenation of first k embeddings. Dummy values are added if the tweet has less than k words.

3.4. Sentimental Analysis Using Single and Tetrad Feature Selection Techniques

Sentimental classification is a technique used in supervised machine learning that uses labeled training data to learn and predict the test data for which the model does not know the real class labels. This model is then validated using true positive rate (TPR), false positive rate (FPR), precision, recall, F-measure, Matthews correlation coefficient (MCC), receiver operating characteristic (ROC) area, precision–recall curves (PRC) area, and Matthews correlation coefficient (MCC). Numerous classification models are available, including decision trees, naive Bayes, support vector machines, k-nearest neighbor, and rule-based classifier. LL was employed from FC as a classification model in this paper to categorize a Twitter dataset containing fine-grained emotion automatically. LL is a free, open-source program well-suited for training large-scale issues. Experiments were conducted to determine the appropriate feature set and classifiers using the data mining software WEKA (Waikato Environment for Knowledge Analysis).

WEKA is a collection of numerous machine learning algorithms written in the Java computer language. WEKA is an excellent tool for pre-processing data, classifying it, clustering it, performing regression, visualizing it, and selecting features. It is free software distributed under the GNU general public license. There are numerous classic classifiers used for classification in the realm of machine learning algorithms. Thus, given the enormous number of classifiers available in machine learning, it might be challenging to select the optimal classifier for a specific task. Classification is determined by the amount and type of characteristics and the classifier used; one of these factors is the configuration of the specific classifier. As illustrated in Figure 4, the WEKA tool is a tremendous aid in selecting them optimally. One of the study’s primary objectives is to categorize pure and compound feelings using text regarding cryptocurrency-related topics from Twitter, which is frequently utilized in our conversations for soliciting opinions, providing feedback, and responding to any conversation.

3.4.1. Filtered Classifiers

WEKA, an open-source tool, is a collection of machine learning algorithms. In the WEKA tool, initially, the dataset is loaded. Under meta classifier, filtered classifiers were found.

LibLINEAR

LibLINEAR: An open-source software application that effectively classifies enormous sparse datasets with a high number of attributes and instances. For big, linear binary and multiclass classification, the tool offers L2-regularized logistic regression, L2-loss, and L1-loss linear SVMs. SVMs of “L2-regularized, L2-loss Support Vector classification” is used in LL, which offers efficient linear classification methods, and LibSVM, which generates non-linear SVMs. Both use SVMs, which WEKA already has as the SMO approach. The distinction is that LL is usually quicker than SMO (and can optionally minimize the sum of absolute values of errors rather than the sum of squared errors), whereas LibSVM is significantly more versatile. Using different kernels, SVMs may be used to build many types of non-linear decision boundaries, and the effect can be studied using WEKA’s boundary visualizer. They benefit immensely from parameter optimization, which may be accomplished with WEKA’s grid search meta-classifier.

Large linear classification (Binary or Multiclass) can be done through LibLINEAR, which supports two popular binary linear classifiers: linear support vector machine (SVM) and logistic regression (LR). Given a set of instances with label pairs

(v_{k}, w_{k})

:

v_{k} \in R_{l} f o r k = 1, \dots, s,

(1)

w_{k} \in R_{l} f o r k = 1, \dots, s,

(2)

Both methods solve the following unconstrained optimization problem with different loss functions

ℶ (t : v_{k}, w_{k})

:

\min_{t} \frac{1}{2} t^{x} t + D \sum_{k = 1}^{s} ℶ (t : v_{k}, w_{k}) for D > 0

(3)

In this case, D denotes a penalty parameter. The two most often used loss functions in SVM are:

\max (1 - w_{k} t^{x} v_{k}, 0)

(4)

and

\max {(1 - w_{k} t^{x} v_{k}, 0)}^{2}

(5)

The earlier is known as L1-SVM, while the latter is known as L2-SVM. The loss function for LR is as follows:

\log (1 + e^{- w_{k} t^{x} v_{k}})

(6)

It is calculated using a probabilistic model. In some circumstances, the classifier’s discriminant function contains a bias factor, m. LibLINEAR implements this term by adding a feature to the vector t, and each instance

v_{k}

:

t^{x} \leftarrow [t^{x}, m], v^{x_{i}} \leftarrow [v^{x_{i}}, M]

(7)

where M is a user-specified constant. For L1-SVM and L2-SVM, a coordinate descent algorithm is used. LibLINEAR includes the trust region Newton algorithm for LR and L2-SVM. It predicts a data point v as positive if

t^{x} t > 0

, and negative otherwise during the testing phase. It incorporates the one-vs-the-rest approach and a Crammer and Singer method for multiclass issues.

4. Experimental Results and Performance Analysis

This section discusses the experimentation conducted in this research using straight configuration (single feature selection technique), describing the experimental setup in Table 2 and the outcomes of the model used in all trials.

Classification accuracy (CA) is the ratio of correctly classified data to all input data. The confusion matrix (CM) generates a matrix that describes the model’s performance. Table 3 represents the evaluation of the test split using a single feature selection technique in combination with LL, and its graphical representation is shown in Figure 5.

The information given below in the table illustrates four critical terms from the confusion matrix: true positives (TP) are the number of tweets anticipated as positive and are perfectly positive; true negatives (TN) are the number of tweets anticipated as negative and are perfectly negative; false positives (FP) are the number of tweets anticipated as positive but are perfectly negative, and false negatives (FN) are the number of tweets anticipated as negative but are perfectly positive.

The matrix’s accuracy is determined by multiplying the proportion of correctly identified occurrences by the total number of classified examples. In trials, a model’s performance is evaluated using the accuracy metric. Along with accuracy, precision by averaging the classifier’s false positives and false negatives and by averaging the classifier’s and F1-score’s weighted harmonic averages were calculated.

Rather than, the Matthews correlation coefficient (MCC) is a more precise statistical measure that only yields a high score if the prediction did well in each of the four categories of the confusion matrix (true positives, false negatives, true negatives, and false positives), proportional to the volume of both positive and negative components in the input data. Many academics believe that the most realistic performance statistic is the ratio of correctly classified samples to total samples. It is referred to as accuracy, and by definition it works when there are more than two labels (multiclass case). However, when the dataset is unbalanced (the number of samples in one class is significantly greater than the number of samples in the other classes), accuracy becomes unreliable since it delivers an overoptimistic estimate of the classifier’s skill in the majority class. The MCC provides an excellent approach for resolving the class imbalance issue.

A receiver operating characteristic (ROC) curve plots the true positive rate (sensitivity) against the false positive rate (100-specificity) at various parameter cut-off points. Each point on the ROC curve corresponds to a pair of sensitivity/specificity values associated with a specific decision threshold. The ROC Area measures a parameter’s ability to discriminate across distinct classes.

Precision–recall curves (PRC) are frequently employed in binary classification to analyze a classifier’s output. It is essential to binarize the output to extend the precision–recall curve and average precision to multiclass or multi-label classification. While one curve can be produced for each label, a precision–recall curve can also be drawn by treating each label’s indication matrix element as a binary prediction (micro-averaging).

Table 4 represents the precise accuracy by class using a single feature selection technique combined with LL, and its graphical representation is shown in Figure 6.

Table 5 represents the confusion matrix by class using a single feature selection technique in combination with LL, and its graphical representation is shown in Figure 7.

This section covers the experiments conducted in this research using tetrad configuration (four feature selection procedures), describing the experimental setup in Table 6 and the findings of the model used in all the tests.

Table 7 represents the evaluation of the split test using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 8.

Table 8 represents the precise accuracy by class using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 9.

Table 9 represents the confusion matrix using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 10.

Table 10 represents the evaluation accuracy by class using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 11.

Table 11 presents the cost/benefit assessment summary for the 1085 instances dataset achieved by experiment measurements, and its graphical representation is shown in Figure 12.

Table 12 first part indicates the threshold plot curve ratio between the sample size (X) and the true positive rate (Y) from 0 to 1. The second part indicates the cost/benefit curve between the sample size (X) and the cost/benefit (Y) clusters with an acceptable accuracy rate. In the last column, cost curve represents the ratio between the probability cost function (X) and the normalized expected cost (Y).

5. Discussion

As evident from the presented results in Table 7, Table 8 and Table 9, the performance of the LL with the tetrad of feature selection techniques outperforms the combination with the single feature selection technique as exhibited in Table 3, Table 4 and Table 5 for cryptocurrency-related text obtained from Twitter. Increasing the structural complexity of machine learning and deep learning models improves sentiment classification performance. To begin, incorporating the Tweet to lexicon feature vector, the Tweet to input lexicon feature vector, the Tweet to SentiStrength feature vector, and the Tweet to embedding feature vector modules enhance the classification performance, particularly for textual inputs. The overall evaluation summary presented in Table 10 based on the parameters relative absolute error, root relative squared error, coverage of cases (0.95 level), mean relative region size (0.95 Level) for LL with the tetrad of feature selection techniques were significantly lower than those for the single feature selection technique, especially for configuration C in the first case and configuration B in the other case with textual inputs on the same dataset. The TTTR, TTTE, CCI, ICI, KS, MAS, and RMSE performance measures shown in Figure 8 for LL with the tetrad of feature selection techniques using textual inputs were generally more significant than that of the textual inputs with the single feature selection technique as depicted in Figure 5. The results indicated that the tetrad of feature selection techniques with LL was more effective than the other configuration in the experiment.

It is a quantitative analysis of public sentiments obtained through tweets on Twitter in the aftermath of cryptocurrency trembling rates. The analysis was performed for public sentiments toward various cryptocurrencies using data from Twitter. Despite the negative consequences of shaking rates of cryptocurrencies, it is found that public sentiments were more positive than negative. Although most tweets are deemed pure sentiments instead of hybrid sentiments, neutral and happy sentiments are predominant in all sentimental categories. Additionally, it is reassuring that anger sentiments outweigh happy sentiments in none of the cases. The findings of these analyses can be used to understand better Twitter users’ perceptions of their cryptocurrency-related decisions and the policies adopted by specific coins and governmental institutions. The current discoveries provide a baseline for measuring public discourse on cryptocurrency matters. A performance comparison of the anticipated text classification models with existing models is presented in Table 13.

6. Conclusions and Future Work

This paper analyzed the impact of text pre-processing techniques combined with the use of text classification process. The study shows the importance of the different sections in the classification process. It compared the two configurations; one with a single feature selection technique and the second with a tetrad of feature selection techniques to perform text-based sentimental analysis using LL from filtered classifiers. The sentimental classification based on cryptocurrency-related text through Twitter is an emergent area that needs more attention to consider. A dataset about cryptocurrencies has been collected through the Twitter API and classified with precision into thirteen primary pure and compound sentimental attributes.

In this research, thirteen identified pure and compound sentimental attributes in which the given dataset has been labeled. It compared a single feature selection technique-based configuration with a tetrad of feature selection techniques that quickly purified the dataset, and then LL classified the given text-based dataset into the most common pure and compound sentimental attributes. Through the daily growth and expansion of feature selection methods, numerous researchers tend to use these techniques to classify text-based datasets. As evident from results in the tetrad of feature selection techniques, it is considered the best configuration compared to the single feature selection technique because it has optimized outcomes for all performance measures. The results also demonstrated that LibLINEAR is computationally efficient and achieves the best performance.

Classifying tweets into pure and hybrid sentiments to fully comprehend and reveal the sentiment of tweets without labeling them is another intriguing future direction. The size of the dataset and the time period over which it was collected are the constraints on this exploration. It would also be interesting to have data spanning a more extended time period to observe how sentiments change over time.

Funding

The author extends his appreciation to the Deanship of Scientific Research at Jouf University for funding this work through research grant No. DSR-2021-02-0207.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset is available from corresponding author upon request.

Conflicts of Interest

The author declares no conflict of interest.

References

Baur, D.G.; Dimpfl, T. The volatility of Bitcoin and its role as a medium of exchange and a store of value. Empir. Econ. 2021, 61, 2663–2683. [Google Scholar] [CrossRef] [PubMed]
Ranasinghe, H.; Halgamuge, M.N. Twitter sentiment data analysis of user behavior on cryptocurrencies: Bitcoin and ethereum. In Analyzing Global Social Media Consumption; IGI Global: Hershey, PA, USA, 2021; pp. 277–291. [Google Scholar]
Minutolo, M.C.; Kristjanpoller, W.; Dheeriya, P. Impact of COVID-19 effective reproductive rate on cryptocurrency. Financ. Innov. 2022, 8, 1–27. [Google Scholar] [CrossRef] [PubMed]
Hassan, M.K.; Hudaefi, F.A.; Caraka, R.E. Mining netizen’s opinion on cryptocurrency: Sentiment analysis of Twitter data. Stud. Econ. Financ. 2021, 39, 365–385. [Google Scholar] [CrossRef]
Köhler, S. Sustainable Blockchain Technologies: An Assessment of Social and Environmental Impacts of Blockchain-Based Technologies; Aalborg Universitetsforlag: Aalborg, Denmark, 2021. [Google Scholar]
Ghosh, J. The blockchain: Opportunities for research in information systems and information technology. J. Glob. Inf. Technol. Manag. 2019, 22, 235–242. [Google Scholar] [CrossRef] [Green Version]
Guo, Y.-M.; Huang, Z.-L.; Guo, J.; Guo, X.-R.; Li, H.; Liu, M.-Y.; Ezzeddine, S.; Nkeli, M.J. A bibliometric analysis and visualization of blockchain. Future Gener. Comput. Syst. 2021, 116, 316–332. [Google Scholar] [CrossRef]
Chen, E.; Lerman, K.; Ferrara, E. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health Surveill. 2020, 6, e19273. [Google Scholar] [CrossRef]
Boon-Itt, S.; Skunkan, Y. Public perception of the COVID-19 pandemic on Twitter: Sentiment analysis and topic modeling study. JMIR Public Health Surveill. 2020, 6, e21978. [Google Scholar] [CrossRef]
Renz, S.M.; Carrington, J.M.; Badger, T.A. Two strategies for qualitative content analysis: An intramethod approach to triangulation. Qual. Health Res. 2018, 28, 824–831. [Google Scholar] [CrossRef]
Bhattacharya, S.; Sarkar, D.; Kole, D.K.; Jana, P. Recent trends in recommendation systems and sentiment analysis. In Advanced Data Mining Tools and Methods for Social Computing; Academic Press: New York, NY, USA, 2022; pp. 163–175. ISBN 9780323857086. [Google Scholar]
Rodrigues, A.P.; Chiplunkar, N.N. A new big data approach for topic classification and sentiment analysis of Twitter data. Evol. Intell. 2019, 15, 877–887. [Google Scholar] [CrossRef]
Xiong, X.; Li, Y.; Qiao, S.; Han, N.; Wu, Y.; Peng, J.; Li, B. An emotional contagion model for heterogeneous social media with multiple behaviors. Phys. A Stat. Mech. Appl. 2018, 490, 185–202. [Google Scholar] [CrossRef]
Valle-Cruz, D.; Fernandez-Cortez, V.; López-Chau, A.; Sandoval-Almazán, R. Does twitter affect stock market decisions? financial sentiment analysis during pandemics: A comparative study of the h1n1 and the covid-19 periods. Cognit. Comput. 2021, 14, 372–387. [Google Scholar] [CrossRef] [PubMed]
Campbell, H.A.; Evolvi, G. Contextualizing current digital religion research on emerging technologies. Hum. Behav. Emerg. Technol. 2020, 2, 5–17. [Google Scholar] [CrossRef] [Green Version]
Li, Z.; Huang, Q.; Emrich, C.T. Introduction to social sensing and big data computing for disaster management. Int. J. Digit. Earth 2019, 12, 1198–1204. [Google Scholar] [CrossRef] [Green Version]
Wirtz, J.G.; Zimbres, T.M. A systematic analysis of research applying ‘principles of dialogic communication’to organizational websites, blogs, and social media: Implications for theory and practice. J. Public Relat. Res. 2018, 30, 5–34. [Google Scholar] [CrossRef]
Arumugam, S. Development of argument based opinion mining model with sentimental data analysis from twitter content. Concurr. Comput. Pract. Exp. 2022, 34, e6956. [Google Scholar] [CrossRef]
Hrazi, M.M.; Althagafi, A.M.; Aljuhani, A.T.; Rahman, J.; Rahman, M.M.; Shorfuzzaman, M. Sentiment Analysis of Tweets from Airlines in the Gulf Region Using Machine Learning. In Proceedings of the 2021 International Conference of Women in Data Science at Taif University (WiDSTaif), Taif, Saudi Arabia, 30–31 March 2021; pp. 1–6. [Google Scholar]
Yadav, J.; Misra, M.; Rana, N.P.; Singh, K.; Goundar, S. Netizens’ behavior towards a blockchain-based esports framework: A TPB and machine learning integrated approach. Int. J. Sports Mark. Spons. 2021; ahead-of-print. [Google Scholar] [CrossRef]
Hasan, T.; Ahmad, F.; Rizwan, M.; Alshammari, N.; Alanazi, S.A.; Hussain, I.; Naseem, S. Edge Caching in Fog-Based Sensor Networks through Deep Learning-Associated Quantum Computing Framework. Comput. Intell. Neurosci. 2022, 2022, 6138434. [Google Scholar] [CrossRef]
Shabbir, M.; Ahmad, F.; Shabbir, A.; Alanazi, S.A. Cognitively managed multi-level authentication for security using Fuzzy Logic based Quantum Key Distribution. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1468–1485. [Google Scholar] [CrossRef]
Mehmood, M.; Ayub, E.; Ahmad, F.; Alruwaili, M.; Alrowaili, Z.A.; Alanazi, S.; Rizwan, M.H.M.; Naseem, S.; Alyas, T. Machine learning enabled early detection of breast cancer by structural analysis of mammograms. Comput. Mater. Contin. 2021, 67, 641–657. [Google Scholar] [CrossRef]
Shahzadi, S.; Ahmad, F.; Basharat, A.; Alruwaili, M.; Alanazi, S.; Humayun, M.; Rizwan, M.; Naseem, S. Machine learning empowered security management and quality of service provision in SDN-NFV environment. Comput. Mater. Contin. 2021, 66, 2723–2749. [Google Scholar] [CrossRef]
Alanazi, S.A.; Alruwaili, M.; Ahmad, F.; Alaerjan, A.; Alshammari, N. Estimation of Organizational Competitiveness by a Hybrid of One-Dimensional Convolutional Neural Networks and Self-Organizing Maps Using Physiological Signals for Emotional Analysis of Employees. Sensors 2021, 21, 3760. [Google Scholar] [CrossRef]
Mehmood, M.; Alshammari, N.; Alanazi, S.A.; Ahmad, F. Systematic Framework to Predict Early-Stage Liver Carcinoma Using Hybrid of Feature Selection Techniques and Regression Techniques. Complexity 2022, 2022, 7816200. [Google Scholar] [CrossRef]
Khan, W.A.; Ahmad, F.; Alanazi, S.A.; Hasan, T.; Naseem, S.; Nisar, K.S. Trust identification through cognitive correlates with emphasizing attention in cloud robotics. Egypt. Inform. J. 2022, 23, 259–269. [Google Scholar] [CrossRef]
Orangi-Fard, N.; Akhbardeh, A.; Sagreiya, H. Predictive Model for ICU Readmission Based on Discharge Summaries Using Machine Learning and Natural Language Processing. In Proceedings of the Informatics, Kowloon, Hongkong, 7–15 August 2022; p. 10. [Google Scholar]
Mehmood, M.; Alshammari, N.; Alanazi, S.A.; Basharat, A.; Ahmad, F.; Sajjad, M.; Junaid, K. Improved Colorization and Classification of Intracranial Tumor Expanse in MRI Images via Hybrid Scheme of Pix2Pix-cGANs and NASNet-Large. J. King Saud Univ.-Comput. Inf. Sci. 2022; in press. [Google Scholar] [CrossRef]
Wang, D.; Su, J.; Yu, H. Feature extraction and analysis of natural language processing for deep learning english language. IEEE Access. 2020, 8, 46335–46345. [Google Scholar] [CrossRef]
Eke, C.I.; Norman, A.A.; Shuib, L.; Nweke, H.F. Sarcasm identification in textual data: Systematic review, research challenges and open directions. Artif. Intell. Rev. 2020, 53, 4215–4258. [Google Scholar] [CrossRef]
Rahman, A.; Saleem, N.; Shabbir, A.; Shabbir, M.; Rizwan, M.; Naseem, S.; Ahmad, F. ANFIS based hybrid approach identifying correlation between decision making and online social networks. EAI Endorsed Trans. Scalable Inf. Syst. 2021, 8, e4. [Google Scholar] [CrossRef]
Ghoshal, S.; Bruckman, A. The role of social computing technologies in grassroots movement building. ACM Trans. Comput. Hum. Interact. 2019, 26, 1–36. [Google Scholar] [CrossRef]
Başaran, S.; Ejimogu, O.H. A neural network approach for predicting personality from Facebook data. SAGE Open 2021, 11, 21582440211032156. [Google Scholar] [CrossRef]
Giuntini, F.T.; Cazzolato, M.T.; dos Reis, M.d.J.D.; Campbell, A.T.; Traina, A.J.; Ueyama, J. A review on recognizing depression in social networks: Challenges and opportunities. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 4713–4729. [Google Scholar] [CrossRef]
Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 1–50. [Google Scholar] [CrossRef]
Fischer, C.; Pardos, Z.A.; Baker, R.S.; Williams, J.J.; Smyth, P.; Yu, R.; Slater, S.; Baker, R.; Warschauer, M. Mining big data in education: Affordances and challenges. Rev. Res. Educ. 2020, 44, 130–160. [Google Scholar] [CrossRef] [Green Version]
Sohail, M.N.; Jiadong, R.; Uba, M.M.; Irshad, M.; Iqbal, W.; Arshad, J.; John, A.V. A hybrid Forecast Cost Benefit Classification of diabetes mellitus prevalence based on epidemiological study on Real-life patient’s data. Sci. Rep. 2019, 9, 1–10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Prasetijo, A.B.; Isnanto, R.R.; Eridani, D.; Soetrisno, Y.A.A.; Arfan, M.; Sofwan, A. Hoax detection system on Indonesian news sites based on text classification using SVM and SGD. In Proceedings of the 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), Semarang, Indonesia, 18–19 October 2017; pp. 45–49. [Google Scholar]

Figure 1. Twitter API access procedure.

Figure 2. Frequencies ranges of Twitter text.

Figure 3. Tweets preprocessing.

Figure 4. Text-based sentimental classification.

Figure 5. Graphical representation of evaluation on split test (single feature selection technique).

Figure 6. Detailed accuracy by class (single feature selection techniques).

Figure 7. Graphical representation of confusion matrix (single feature selection technique).

Figure 8. Graphical representation of evaluation on split test (tetrad feature selection techniques).

Figure 9. Graphical representation of detailed accuracy by class (tetrad feature selection techniques).

Figure 10. Graphical representation of confusion matrix (tetrad feature selection techniques).

Figure 11. Graphical representation of evaluation accuracy (tetrad feature selection techniques).

Figure 12. Graphical representation of cost/benefit (tetrad feature selection techniques).

Table 1. Tweets distribution in sentimental classes.

No.	Sentiment Class	Instances	No.	Sentiment Class	Instances
1	Not_Relevant	214	8	Angry	57
2	Neutral	1572	9	Sad_Angry	2
3	Happy	1137	10	Disgust	6
4	Surprise	35	11	Sad_Disgust	2
5	Happy_Surprise	11	12	Angry_Disgust	7
6	Sad	32	13	Sad_Angry_Disgust	1
7	Happy_Sad	9		Total Instances	3085

Table 2. Run information (single feature selection technique).

Dataset	Instances	3085	3085	3085	3085
	Attributes	44	44	44	44
	Training Split	66%	70%	80%	90%
	Testing Split	34%	30%	20%	10%
Preprocess	WEKA
Classification	WEKA

Table 3. Evaluation of test split (single feature selection technique).

	Configuration A	Configuration B	Configuration C	Configuration D
Time Taken for Training (TTTR)	4.34 s	2.01 s	1.99 s	1.98 s
Time Taken for Testing (TTTE)	1.31 s	0.01 s	0.02 s	0.02 s
Correctly Classified Instances (CCI)	769 (73.313%)	684 (73.953%)	448 (72.618%)	220 (71.427%)
Incorrectly Classified Instances (ICI)	280 (26.705%)	241 (26.056%)	169 (27.397%)	88 (28.5714%)
Kappa Statistic	0.518	0.5313	0.5145	0.4863
Mean Absolute Error (MAS)	0.0411	0.0401	0.0421	0.044
Root Mean Squared Error (RMSE)	0.2026	0.2002	0.2053	0.2097
Total Number of Instances	1049	925	617	308

Table 4. Detailed accuracy by class (single feature selection technique).

TP Rate	FP Rate	Precision	Recall	F-Measure	MCC	ROC Area	PRC Area	Class
0.875	0.354	0.720	0.875	0.790	0.536	0.760	0.694	Neutral
0.728	0.124	0.775	0.728	0.751	0.612	0.802	0.665	Happy
0.203	0.002	0.875	0.203	0.329	0.406	0.600	0.230	NotRelevant
0.182	0.003	0.571	0.182	0.276	0.315	0.589	0.121	Angry
0.000	0.000	-	0.000	-	-	0.500	0.003	AngryDisgust
-	0.004	0.000	-	-	-	-	-	Disgust
0.000	0.001	0.000	0.000	0.000	−0.002	0.500	0.005	HappySurprise
0.000	0.003	0.000	0.000	0.000	−0.005	0.499	0.010	Sad
0.000	0.000	-	0.000	-	-	0.500	0.011	Surprise
0.000	0.002	0.000	0.000	0.000	−0.002	0.499	0.003	HappySad
-	0.001	0.000	-	-	-	-	-	SadDisgust
0.000	0.000	-	0.000	-	-	0.500	0.001	SadAngry
-	0.000	-	-	-	-	-	-	SadAngryDisgust
0.733	0.227	-	0.733	-	-	0.753	0.618	Weighted Average

Table 5. Confusion matrix by class (single feature selection technique).

a	b	c	d	f	g	h	j	k	Classified as
468	59	2	1	2	1	1	1	0	a = Neutral
103	283	0	0	1	0	1	1	0	b = Happy
45	8	14	1	1	0	0	0	0	c = NotRelevant
15	3	0	4	0	0	0	0	0	d = Angry
1	1	0	1	0	0	0	0	0	e = AngryDisgust
0	0	0	0	0	0	0	0	0	f = Disgust
1	4	0	0	0	0	0	0	0	g = HappySurprise
5	4	0	0	0	0	0	0	1	h = Sad
11	1	0	0	0	0	0	0	0	i= Surprise
0	2	0	0	0	0	1	0	0	j = HappySad
0	0	0	0	0	0	0	0	0	k = SadDisgust
1	0	0	0	0	0	0	0	0	l = SadAngry
0	0	0	0	0	0	0	0	0	m = SadAngryDisgust

Table 6. Run information (tetrad feature selection techniques).

Dataset	Instances	3085	3085	3085	3085
	Attributes	44	44	44	44
	Training Split	66%	70%	80%	90%
	Testing Split	34%	30%	20%	10%
Preprocess	WEKA
Classification	WEKA

Table 7. Evaluation of test split (tetrad feature selection techniques).

	Configuration A	Configuration B	Configuration C	Configuration D
Time Taken for Training (TTTR)	18.06 s	16.47 s	13.64 s	11.81 s
Time Taken for Testing (TTTE)	18.97 s	5.42 s	4.37 s	2.67 s
Correctly Classified Instances (CCI)	1014 (96.739%)	896 (96.865%)	601 (97.475%)	450 (97.322%)
Incorrectly Classified Instances (ICI)	35 (03.260%)	29 (03.135%)	16 (02.528%)	13 (02.678%)
Kappa Statistic	0.5842	0.5875	0.6019	0.6008
Mean Absolute Error (MAS)	0.0358	0.0356	0.0347	0.0349
Root Mean Squared Error (RMSE)	0.1892	0.1887	0.1862	0.1868
Total Number of Instances	1049	925	617	463

Table 8. Detailed accuracy by class (tetrad feature selection techniques).

Class	TP Rate	FP Rate	Precision	Recall	F-Measure	MCC	ROC Area	PRC Area
Neutral	0.907	0.289	0.751	0.907	0.822	0.629	0.809	0.726
Happy	0.809	0.105	0.827	0.809	0.818	0.707	0.852	0.742
Not_Relevant	0.233	0.000	1.000	0.233	0.377	0.469	0.616	0.286
Angry	0.231	0.005	0.500	0.231	0.316	0.330	0.613	0.132
Angry_Disgust	0.000	0.002	0.000	0.000	0.000	−0.003	0.499	0.005
Disgust	-	0.002	0.000	-	-	-	-	-
Happy_Surprise	0.000	0.002	0.000	0.000	0.000	−0.003	0.499	0.005
Sad	0.000	0.000	-	0.000	-	-	0.500	0.011
Surprise	0.000	0.000	-	0.000	-	-	0.500	0.010
Happy_Sad	0.000	0.002	0.000	0.000	0.000	−0.003	0.499	0.005
Sad_Disgust	-	0.002	0.000	-	-	-	-	-
Sad_Angry	0.000	0.000	-	0.000	-	-	0.500	0.002
Sad_Angry_Disgust	-	0.000	-	-	-	-	-	-
Weighted Average	0.775	0.182	-	0.775	-	-	0.797	0.662

Table 9. Confusion matrix (tetrad configuration).

a	b	c	d	e	f	g	h	j	k	Classified As
274	26	0	1	0	0	1	0	0	0	a = Neutral
44	191	0	0	0	0	0	1	1	0	b = Happy
28	4	10	0	0	1	0	0	0	0	c = NotRelevant
9	1	0	3	0	0	0	0	0	0	d = Angry
0	2	0	1	0	0	0	0	0	0	e = AngryDisgust
0	0	0	0	0	0	0	0	0	0	f = Disgust
2	1	0	0	0	0	0	0	0	0	g = HappySurprise
3	3	0	0	0	0	0	0	0	1	h = Sad
4	2	0	0	0	0	0	0	0	0	i = Surprise
0	1	0	1	1	0	0	0	0	0	j = HappySad
0	0	0	0	0	0	0	0	0	0	k = SadDisgust
1	0	0	0	0	0	0	0	0	0	l = SadAngry
0	0	0	0	0	0	0	0	0	0	m = SadAngryDisgust

Table 10. Evaluation summary (tetrad feature selection techniques).

	Configuration A	Configuration B	Configuration C	Configuration D
Relative Absolute Error	0.387187	0.384864	0.372557	0.374826
Root Relative Squared Error	0.382413	0.378699	0.359832	0.362554
Coverage of Cases (0.95 Level)	0.767398	0.768649	0.774716	0.773218
Mean Relative Region Size (0.95 Level)	0.76923	0.76923	0.76923	0.76923
Total Number of Instances	1049	925	617	463

Table 11. Cost/benefit summary (tetrad of feature selection techniques).

Class	Cost/Benefit	Random	Gain	Percentage of Accuracy	Percentage of Population	Percentage of Target
Neutral	119	309.69	±190.69	80.71%	59.16%	90.72%
Happy	85	290.29	±205.29	86.22%	37.43%	80.93%
Not_Relevant	33	51.61	±18.61	94.65	1.62%	23.25%
Angry	13	18.75	±5.75	97.89%	0.97%	23.07%
Angry_Disgust	3	3	0	99.51%	0%	0%
Disgust	0	0	0	100%	0%	NaN
Happy_Surprise	3	3	0	99.51%	0%	0%
Sad	7	7	0	98.86%	0%	0%
Surprise	6	6	0	99.03%	0%	0%
Happy_Sad	3	3	0	99.51%	0%	0%
Sad_Disgust	0	0	0	100%	0%	NaN
Sad_Angry	1	1	0	99.83%	0%	0%
Sad_Angry_Disgust	0	0	0	100%	0%	NaN

Table 12. Cost/benefit summary for pure and compound sentiments (tetrad of feature selection techniques).

Class	Threshold Curve = Sample Size (X) Vs. True Positive Rate (Y)	Cost/Benefit Curve = Sample Size (X) Vs. Cost/Benefit (Y)	Cost Curve = Probability Cost Function (X) Vs. Normalized Expected Cost (Y)
Neutral
Happy
Not_Relevant
Angry
Angry_Disgust
Disgust
Happy_Surprise
Sad
Surprise
Happy_Sad
Sad_Disgust
Sad_Angry
Sad_Angry_Disgust

Table 13. Performance comparison of the anticipated text classification models with existing models.

Studies	Classifier	Time Taken	Accuracy	Error Rate	Kappa Stats
Sohail, et al. [38]	J48 consolidation	0.15	0.9893	0.01	0.97
Sohail, et al. [38]	J48 graft	0.02	0.9893	0.01	0.97
Sohail, et al. [38]	Hoeffding Tree	0.11	0.9110	0.13	0.78
Anticipated (Cryptocurrency related Sentimental’ Classification)	LibLINEAR (Single Feature Selection Technique)	2.01	0.7395	0.2002	0.5315
	LibLINEAR (Tetrad Feature Selection Tehniques)	13.64	0.9747	0.1862	0.6019
	Classifier	AUC	CA	F1-Score	Precision	Recall
Prasetijo et al. [39]	SVM	-	0.784	-	0.548	1.00
Alanazi et al. [25]	ODCNN	0.990	0.998	-	-	-

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alanazi, S.A. Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier. Appl. Sci. 2022, 12, 6070. https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

AMA Style

Alanazi SA. Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier. Applied Sciences. 2022; 12(12):6070. https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

Chicago/Turabian Style

Alanazi, Saad Awadh. 2022. "Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier" Applied Sciences 12, no. 12: 6070. https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier

Abstract

1. Introduction

2. Related Research

3. Materials and Methods

3.1. System Specifications

3.2. Dataset Development

3.3. Features Extraction and Selection

3.3.1. Tweet to Sparse Feature Vector

3.3.2. Tweet to Lexicon Feature Vector

3.3.3. Tweet to Input Lexicon Feature Vector

3.3.4. Tweet to Sentiment Strength Feature Vector

3.3.5. Tweet to Embeddings Feature Vector

3.4. Sentimental Analysis Using Single and Tetrad Feature Selection Techniques

3.4.1. Filtered Classifiers

LibLINEAR

4. Experimental Results and Performance Analysis

5. Discussion

6. Conclusions and Future Work

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI