Next Article in Journal
A New Fourth-Order Predictor–Corrector Numerical Scheme for Heat Transfer by Darcy–Forchheimer Flow of Micropolar Fluid with Homogeneous–Heterogeneous Reactions
Next Article in Special Issue
The Saudi Novel Corpus: Design and Compilation
Previous Article in Journal
Economic Analysis and Optimal Control Strategy of Micro Gas-Turbine with Batteries and Water Tank: German Case Study
Previous Article in Special Issue
Neural Embeddings for the Elicitation of Jurisprudence Principles: The Case of Arabic Legal Texts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier

by
Saad Awadh Alanazi
Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakaka 72341, Saudi Arabia
Submission received: 13 May 2022 / Revised: 12 June 2022 / Accepted: 13 June 2022 / Published: 15 June 2022
(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

Abstract

:
Individual mental feelings and reactions are getting more significant as they help researchers, domain experts, businesses, companies, and other individuals understand the overall response of every individual in specific situations or circumstances. Every pure and compound sentiment can be classified using a dataset, which can be in the form of Twitter text by various Twitter users. Twitter is one of the vital platforms for individuals to participate and share their ideas about different topics; it is also considered to be one of the most famous and the biggest website for micro-blogging on the Internet. One of the key purposes of this study is to classify pure and compound sentiments based on text related to cryptocurrencies, an innovative way of trading and flourishing daily. The cryptocurrency market incurs many fluctuations in the coins’ value. A small positive or negative piece of news can sensate the whole scenario about the specific cryptocurrencies. In this paper, individuals’ pure and compound sentiments based on cryptocurrency-related Twitter text are classified. The dataset is collected through the Twitter API. In WEKA, the two deployment schemes are compared; firstly, straight with single feature selection technique (Tweet to lexicon feature vector), and secondly, a tetrad of feature selection techniques (Tweet to lexicon feature vector, Tweet to input lexicon feature vector, Tweet to SentiStrength feature vector, and Tweet to embedding feature vector) are used to purify the data LibLINEAR (LL) classifier, which contains fast algorithms for linear classification using L2-regularization L2-loss support vector machines (Dual SVM). The LL classifier differs in that it can potentially alleviate the sum of the absolute values of errors rather than the sum of the squared errors and is typically much speedier. Based on the overall performance parameters, the deployment scheme containing the tetrad of feature selection techniques with the LL classifier is considered the best choice for the purpose of classification. Among machine learning techniques, LL produces effective results and gives an efficient performance compared to other prevailing techniques. The findings of this research would be beneficial for Twitter users as well as cryptocurrency traders.

1. Introduction

Cryptocurrency is a digital or virtual currency created to function as a trade means and is secured using a blockchain. Cryptocurrency is also an electronic currency that is comparable. Bitcoin’s price as a virtual currency has risen dramatically during the last several months. The motivation for inventing Bitcoin and any following virtual agreements is to address perceived shortcomings through exchanging money between parties. Bitcoin is the most well-known cryptocurrency, as it piques everyone’s interest in the subject of encryption due to its rapid growth and status as the de facto standard for cryptocurrencies. While there are several cryptocurrencies, Bitcoin and Ethereum are the market leaders. Bitcoin and Ethereum serve distinct functions. Bitcoin was founded as a replacement for fiat currency; it serves as a means of exchange and value storage, whereas Ethereum is being developed as a platform for promoting peer-to-peer contracts and applications via its currency tools [1,2].
Despite their bubble values, cryptocurrencies have been transacted and have grown to be a significant investment in the Decentralized Finance (DeFi) sector. Binance reports that the global cryptocurrency market capitalization reached USD 1.3 trillion at the time of this study. The argument about cryptocurrencies has raged on among authorities, specialists, and academics globally. Bitcoin had the most incredible market valuation in the recent DeFi market, followed by Ethereum, Tether (USDT), and Binance Coin [3]. The virtual world’s public perception (e.g., people’s tweets) impacts the price fluctuation of cryptocurrencies. For instance, Elon Musk’s tweets in mid-2021 led Bitcoin prices to vary substantially, and then reports of a future US government effort to control digital assets may have spurred Bitcoin’s current price decline [4,5].
The considerable volume sees the rapid growth of scholarly articles on cryptocurrency of articles on a variety of related issues, including Blockchain, Bitcoin, Ethereum, network security, encryption, exchange, and electronic money. However, existing studies have primarily examined cryptocurrencies in computer science [6]. Thus, sentiment analysis studies using data tweets to understand the social consequences of cryptocurrencies through the perspective of emotion theory may remain limited. Sentiment analysis can gauge the public’s reaction to such a dynamic fluctuation in cryptocurrency values. The examination is crucial for comprehending the social ramifications of this phenomenon, which emphasizes the study’s significance [7].
The first stage in big data analysis is collecting data; this is called “data mining.” These records may come from any source. There are numerous data sources from which a massive amount of data can be acquired. Twitter is an excellent source for data science. Additionally, it is a free social networking platform that enables users to stream tweets. These tweets are referred to as short messages. They are used to broadcast tweets (short messages) for various reasons, including pride, attention, enjoyment, boredom, assistance, and the desire to become famous. Most users use Twitter for recreational purposes, sending messages to the world and ensuring that ideas are distributed within communities. Unlike other social media networks, users’ tweets are completely public and searchable. Public perspective and how they perceive subjects can be acquired through Twitter data. The Twitter API enables applications to retrieve data. However, certain restrictions are involved [8,9].
Sentiment analysis evaluates the emotional tone of a collection of words to comprehend the perspectives, thoughts, and feelings expressed in online references [10]. It is extremely advantageous for social media monitoring since it gives us a more comprehensive view of public thinking on particular subjects. Additionally, it can play a vital role in marketing and customer service [11].
Twitter is amongst the most popular and widely used microblogging services, where people can create status updates called tweets [12]. As a result, these tweets typically express opinions on various subjects. Over the years, communication and information technologies have profoundly impacted the world. Most of the articles analyzed used the Twitter API to retrieve data from Twitter, while others used premium APIs. According to several articles, there is a strong association between Twitter users’ chances of influencing and their chances of being affected, and most users maintain emotional stability in both [13]. Twitter provides social network functionality to determine if users are exposed to environments in the online social realm that influences their feelings. The models developed to learn both emotional influencing and influenced chances for users and provide observations are based on accurate social network data. Each tweet made by a user is subjected to an emotional analysis to assess its polarity or whether it is positive or negative [14].
Interpersonal interactions, communication patterns, social debates, and political debates have changed the latest media and technology [15]. The media and the communication intellectuals, the sociologists, the international association’s intellectuals, and political scientists have studied hundreds of different phases of social media use. Social computing is the inventive and developing computer model to analyze and model the social actions and events happening on different platforms [16]. It also creates interactive and intellectual applications to achieve effective results. The availability of social media for individuals offers their views or sentiments on a specific event, issue, and product. It is instrumental in breaking down these casual and uniformed data to reach conclusions in different areas. While a much more formless format of this data, accessible on the web, makes the mining procedure challenging. Through the different web-blogs development and the development of social networking sites, numerous organizations and data providers used its advertisements in numerous websites and web blogs [17]. Today, all over the world, numerous data groups share their advertisement in the form of short messages on the micro-blogging services; Twitter is an example [18]. When these short messages are managed and processed, they can make an amount of information related to numerous social research areas. Finally, the system classified these short messages into thirteen different categories. These groups were selected to cover critical areas of sentimental analysis [19].
Several data portals are currently available for the retrieval of short text. They included using a Twitter micro blog to collect short messaging for four reasons below.
  • Various individuals use concise manifestos to express their views on various subjects, making them credible sources of opinion.
  • A large number of text posts on Twitter grows each day. So, the gathering amount could be very arbitrary.
  • The Twitter viewers and the consistent users vary from company representatives, celebrities, politicians, and even countries presidents. Therefore, text posts can collect the users from dissimilar social and interest events/groups.
  • The users from numerous countries represent Twitter viewers.
Through the growth and expansion of the machine learning methods day-to-day, numerous researchers tend to use machine learning techniques in the classification of data and text. Machine learning analysis enables the extraction of emotional responses from data tweets to characterize netizens’ attitudes about cryptocurrency [20,21,22]. This technique is theoretically significant because it explains how emotion theory is quantified through machine learning and provides insight into cryptocurrency value fluctuation’s social consequences. There are mainly two types of the machine learning methods; one is the supervised learning, in which the learning data are offered with labels and provided through the user, and the second is unsupervised learning, in which the learning data is provided as a clustering method through looking at the immensity of the dataset [23,24,25,26,27,28,29]. So, for the current study, supervised learning techniques are used as the thirteen categories that did not alter regularly. To classify Twitter data as short text using machine learning techniques, the correct data set is needed to extract features from these short messaging. When the dataset is created, it is significant to find appropriate methods for preprocessing and classifying short messages. From the Filtered Classifier (FC), LibLINEAR (LL) was used to classify that data as competent to handle a massive dataset.
The main steps that the proposed research takes are listed below:
  • Firstly, the dataset was collected through the Twitter API and classified pure and compound sentiments based on twitter text.
  • Secondly, text based short messages are classified into thirteen major pure and compound sentimental attributes such as Not_Relevant, Neutral, Happy, Surprise, Happy_Surprise, Sad, Happy_Sad, Angry, Sad_Angry, Disgust, Sad_Disgust, Angry_Disgust, Sad_Angry_Disgust.
  • Thirdly, the study discussed the sentimental classification through feature selection and classification techniques based on the cryptocurrency-related Twitter textual dataset using a single feature selection technique and a tetrad of feature selection techniques deployment schemes along with an LL classifier.
The following are some of the study’s unique features:
  • The presented research is novel in that it provides a technique comprised of a tetrad of feature selection techniques; it produces more refined, concise, and presentable outcomes. Rarely are multiple feature selection techniques employed in published works.
  • To determine the optimal deployment scheme for the classification of the gained dataset, WEKA, a popular tool among researchers for its use in traditional and innovative classification algorithms for the given dataset, is chosen. However, it is employed for sentimental analysis through novel natural language processing techniques in this case, which is uncommon in the literature.
Hereunder are the objectives of the research:
  • The research aids a diverse audience (i.e., readers, Twitter users, cryptocurrency traders, monitoring agencies, and policymakers) by providing a global picture of public sentiments and their reactions to various cryptocurrencies’ volatile and unpredictable values.
The subsequent is the intended study’s essential contribution:
  • Essentially, the study provides a technique to the cryptocurrency relevant group (i.e., readers, Twitter users, cryptocurrency traders, monitoring agencies, and policymakers) for assessing public sentiments at any given time for any given cryptocurrency, which can aid in the planning and development of future strategies for the investment in cryptocurrencies.
The paper’s organization is as follows: Related Research has been explained in Section 2, Materials and Methods have been explained in Section 3, Experimental Results and Performance Analysis has been explained in Section 4, Section 5 comprises the Discussion, and lastly, the Conclusion has been written in Section 6.

2. Related Research

In this section, a literature review is conducted to shed light on attempts made by various scholars to better comprehend emotional analysis as a contribution to text mining and categorization associated to cryptocurrencies. Modeling and predicting people’s sentiments about cryptocurrency-related matters can assist the investors in their investment policy modification and development. The proposed idea of cryptocurrency investor sentimental analysis and classification through cryptocurrency-related text on Twitter using straight and tetrad configurations has proven to be a great source of guidance. Several pieces of research related to natural language processing, sentimental analysis, and machine learning-based algorithms highlight their applications in various fields.
In this study, the authors have recommended a technique for classifying Twitter-generated student data into different groupings to find students’ numerous problems. So, authors have presented and offered the logical mythology for shaping the feelings, which are shared upon dissimilar numerous social media programs. They also analyzed the text and data using complex annotation, grammar, semantic networks, and vocabulary acquisition. Fundamental techniques of text classification and data collection are presented [30].
Another study presented by the authors for normalizing the irrelevant and immaterial tweets to be classified as per the polarity, such as positive or negative. In addition, they also have mixed model approaches to generate dissimilar emotional words. The created words were then used as key signs in the classification model. The authors have introduced a new approach for predicting opinions regarding stock markets using numerous financial communication boards and made an automatic projection for the stock market [31].
Scientists described that social computing is an inventive and developing computer model to analyze the social actions and events happening on different platforms [32]. The exactness of the classification procedure with a chosen dataset is verified by using a range of performance parameters; accuracy; recall; precision; F1- score; confusion matrix; log-loss; and ROC Area are some of the most popular metrics. Authors have scrutinized the performance of the various classifications such as Sequential Minimal Optimization, Random Forest, Naïve Bayes, and Support Vector Machine for classifying the Twitter data [33].
Another research experiment is performed based on different individual parameters using Naïve Bayes, which were taken from the Facebook dataset to predict an individual’s personality [34]. These characteristics are in the form of English language words and are based upon the categories in the Linguistic Inquiry and Word Count (LIWC), such as different programs or plans, activity records, structural networks, and some other important personal information. The whole analysis was performed using Waikato Environment for Knowledge Analysis (WEKA) [35].
Text-based sentimental analysis is a well-known technique for better understanding and expressing individual opinions, feelings, and thoughts. The individuals typically express subjective text and typical emotions, moods, feelings, thoughts, and reactions [36]. The critical challenge in sentimental analysis is that most real-world data are shapeless and unstructured. Hence, in recent years, various research has greatly attempted to achieve significant and valuable information from these types of unstructured and shapeless datasets [37].

3. Materials and Methods

3.1. System Specifications

Experiments were conducted on Lenovo Mobile Workstation equipped with Processor: 10th Generation Intel Core i7, Operating System: Windows 10 Pro 64, Memory: 32 GB DDR4, Hard Drive: 512 GB SSD, Graphics: NVIDIA RTX A3000. WEKA 3.8.4 tools have been used for the experimentation and results of the proposed scheme.

3.2. Dataset Development

Twitter Application Programming Interface (API) is used to collect tweets as raw data, and the whole procedure is explained in Figure 1. There are various Twitter scraping APIs available, each with a somewhat different set of capabilities, including the widely used Twitter API for tweet retrieval. To utilize the Twitter API, a user must first apply for developer access. Apart from scraping tweets, the Twitter API is capable of performing a variety of other functions. The Twitter API has several limits based on the account tier. The Twitter API allows for data collection based on keywords, hashtags, dates, and locations. The Twitter API was used with keywords and hashtags to acquire a dataset of English tweets expressing people’s thoughts regarding the value volatility of cryptocurrencies during a specific time period. The data collected are critical for machine learning since it helps the model recognize patterns, make judgments, and do other tasks. For supervised learning, a dataset with the label mapped to the input feature is required. The data includes the following attributes TweetID, ReTweetCount, TweetText, TweetLanguage, TweetSource, and UserID. The text-based sentimental classification based on a Twitter dataset is applied in the text of short messaging of the Twitter micro-blogs. Therefore, it is necessary to collect short Twitter messages. For Twitter, there was a character limit, limiting the length of a single brief message to only 140 characters. As a result, the end-user is compelled to convey data utilizing little words or sentences. The reason is to restrict and use the words of tiny messages as keywords. Twitter API offers the capability to retrieve such tiny messages for a specific retriever in the XML file format. In comparison, these texts based short messages are classified into thirteen diverse sentimental attributes as Not_Relevant, Neutral, Happy, Surprise, Happy_Surprise, Sad, Happy_Sad, Angry, Sad_Angry, Disgust, Sad_Disgust, Angry_Disgust, and Sad_Angry_Disgust.
In machine learning, there are essentially two types of learning processes. The first is supervised learning, and the second is unsupervised learning. In supervised learning, the developer supplies a labeled dataset to the system to train it. Unsupervised learning is a technique in which the system discovers patterns on its own from the dataset. Regarding the present situation, a supervised mode of learning is much more pertinent due to the versatility of the dataset. The selected field data frequencies are depicted in Figure 2. So, the values have been selected as cut-off lower and upper values that maximize efficiency.
Sentiment analysis is performed on tweets, which are unstructured writings containing slang terms, acronyms, and orthographic errors. They must be converted to a proper format by preprocessing procedures for the machine learning model to assess the texts and deliver trustworthy, high-accuracy outputs. Thus, preprocessing is a critical stage in natural language processing and consists of multiple stages depending on the language’s nature and the analysis’s objective. Due to cryptocurrencies’ unique and dynamic character, researching tweets about them presents more significant hurdles than analyzing tweets about other well-known conventional currencies. The difficulties include spelling inconsistencies and a broad range of currencies that differs from regular currencies, which is essential for identifying text properties. Due to the high number of inherited irregularities, natural language processing lacks robust methods and resources for extracting cryptocurrency-related attitudes from the text.
The following steps illustrate how the dataset was preprocessed, as clarified in Figure 3:
  • By manually removing extraneous tweets that contained advertisements or were unrelated to the topic of cryptocurrencies, the first dataset was reduced to 3085 tweets.
  • Elimination of non-English letters
  • Eliminating emoticons, symbols, numerals, and the hashtag sign.
  • URLs and user mentions are being removed.
  • Removing punctuation.
  • Removing repeated characters.
  • Removing stop words.
  • Applying tokenization is a process that separates the text into smaller units called tokens.
  • Applying normalization, i.e., the unification of certain characters having many forms.
  • Applying Lemmatization to reduce words to their source.
To carry out the experiments in this research, the tweets of the dataset were labeled (Not_Relevant, Neutral, Happy, Surprise, Happy_Surprise, Sad, Happy_Sad, Angry, Sad_Angry, Disgust, Sad_Disgust, Angry_Disgust, and Sad_Angry_Disgust) at the manual annotation stage and divided into the following parts:
  • Not_Relevant: If the tweet expressed no relevant information about cryptocurrencies, it was labeled Not_Relevant.
  • Neutral: If the tweet expressed no sentiment and agreement about cryptocurrencies, it was labeled Neutral.
  • Happy: If the tweet communicated positive sentiment and agreement about cryptocurrencies, it was labeled Happy.
  • Surprise: If the tweet communicated shocking sentiment and agreement about cryptocurrencies, it was Surprise.
  • Happy_Surprise: If the tweet communicated amazement, positive sentiments, and agreement about cryptocurrencies, it was labelled Happy_Surprise.
  • Sad: If the tweet expressed down sentiment and partial agreement about cryptocurrencies, it was labeled as Sad.
  • Happy_Sad: If the tweet stated amalgamated positive, down sentiments and partial agreement about cryptocurrencies, it was labeled Happy_Sad.
  • Angry: If the tweet stated fuming sentiment and discrepancy about cryptocurrencies, it was labeled as Angry.
  • Sad_Angry: If the tweet stated fused down negative sentiments and discrepancies about cryptocurrencies, it was labeled as Sad_Angry.
  • Disgust: If the tweet stated repulsion sentiment and discrepancy about cryptocurrencies, it was labeled as Disgust.
  • Sad_Disgust: If the tweet communicated merged down negative sentiments and discrepancies about cryptocurrencies, it was labeled Sad_Disgust.
  • Angry_Disgust: If the tweet conveyed merged annoyed, repulsion sentiments, and discrepancy about cryptocurrencies, it was labeled Angry_ Disgust.
  • Sad_Disgust_Angry: If the tweet voiced mixed negative sentiments and discrepancies about cryptocurrencies, it was labeled Sad_Disgust_Angry.
The training set included a dataset in which each input was associated with the correct label. It enables the models to understand through training and growing the model.
The testing set comprised newly discovered data distinct from the training data; the constructed model predicts the label. The predictions were then compared to the actual labels to evaluate and compute the model’s accuracy.
The dataset was collected from 1 December 2021 to 31 December 2021 and consisted of around 3085 tweets, as shown in Table 1.

3.3. Features Extraction and Selection

Natural language text cannot be best handled by machine learning algorithms due to the inability of text data to be computed. As a result, text input is converted into numerical vectors that the algorithms can process and operate using feature extraction techniques (tweet level filters).

3.3.1. Tweet to Sparse Feature Vector

It retrieves sparse features such as word and character n-grams from tweets. There are options for excluding unusual features (n-grams appearing in fewer than m tweets for illustration) and adjusting the weighing mechanism (Boolean or frequency-based).
  • The word n-grams function extracts words from n = 1 to a specified highest value.
  • Negations appended words that appear in negative scenarios and the prefixes do not affect the n-gram characteristics of words. The negative scope concludes with the following punctuated expression ([.|,|:|;|!|-]+).
  • Character n-grams return the number of characters.
  • The part of speech (PoS) tags is processed using the Carnegie Mellon University (CMU) tweet Natural Language Processing (NLP) tool, which generates a vector space model based on the sequence of POS tags.
  • Brown clusters convert the words in a tweet to Brown word clusters, resulting in a low-dimensional vector space model. It applies to n-grams of word clusters.

3.3.2. Tweet to Lexicon Feature Vector

It calculates features from a tweet using several lexicons:
  • Multi-Perspective Question Answering (MPQA) sums up the amount of positive and negative words in the MPQA subjectivity lexicon.
  • Bing Liu (BL) keeps track of the positive and negative terms in the BL lexicon.
  • Finn Årup Nielsen (AFINN) derives positive and negative variables from the positive and negative word scores provided by AFINN’s lexicon.
  • Sentiment140 produces positive and negative factors by averaging the positive and negative word scores offered by this lexicon developed from emoticon-annotated tweets.
  • National Research Council’s (NRC) Hashtag Sentiment lexicon calculates positive and negative variables by aggregating positive and negative word values derived from tweets marked with emotional hashtags.
  • NRC Lexicon of Word-Emotion Association counts the number of words corresponding to each emotion in this lexicon.
  • NRC-10 Expanded includes the emotion associations for the terms that fit the NRC Word-Emotion Association Lexicon’s Twitter-specific expansion.
  • NRC Hashtag Emotion Association Lexicon augments this lexicon by including the emotion connections for the terms that match.
  • SentiWordNet uses SentiWordNet to determine positive and negative scores. It computes a weighted average of the synsets’ sentiment distributions for words that appear in several synsets.
  • Emoticons generate a positive and negative score based on the word associations associated with a collection of emoticons. The list was compiled as part of the AFINN project.
  • Negation is a metric that indicates the quantity of negating terms in a tweet.

3.3.3. Tweet to Input Lexicon Feature Vector

It extracts information from a tweet by utilizing a predefined set of affective lexicons, each represented by an ARFF file. The characteristics are computed by summing or counting the emotive associations associated with the words in the specified lexicons. Each lexicon’s numeric and nominal properties are considered, the numerical scores are added, and the nominal scores are counted. By default, the NRC-Affect-Intensity lexicon is utilized.

3.3.4. Tweet to Sentiment Strength Feature Vector

It utilizes SentiStrength to determine the positive and negative sentiment strengths of a tweet.

3.3.5. Tweet to Embeddings Feature Vector

It generates a feature representation at the tweet level using pre-trained word embeddings. A dummy word-embedding composed of zeroes is utilized for words that do not have a corresponding embedding. The following approaches can be used to calculate the tweet vectors:
  • Average word embeddings.
  • Add word embeddings.
  • Concatenation of first k embeddings. Dummy values are added if the tweet has less than k words.

3.4. Sentimental Analysis Using Single and Tetrad Feature Selection Techniques

Sentimental classification is a technique used in supervised machine learning that uses labeled training data to learn and predict the test data for which the model does not know the real class labels. This model is then validated using true positive rate (TPR), false positive rate (FPR), precision, recall, F-measure, Matthews correlation coefficient (MCC), receiver operating characteristic (ROC) area, precision–recall curves (PRC) area, and Matthews correlation coefficient (MCC). Numerous classification models are available, including decision trees, naive Bayes, support vector machines, k-nearest neighbor, and rule-based classifier. LL was employed from FC as a classification model in this paper to categorize a Twitter dataset containing fine-grained emotion automatically. LL is a free, open-source program well-suited for training large-scale issues. Experiments were conducted to determine the appropriate feature set and classifiers using the data mining software WEKA (Waikato Environment for Knowledge Analysis).
WEKA is a collection of numerous machine learning algorithms written in the Java computer language. WEKA is an excellent tool for pre-processing data, classifying it, clustering it, performing regression, visualizing it, and selecting features. It is free software distributed under the GNU general public license. There are numerous classic classifiers used for classification in the realm of machine learning algorithms. Thus, given the enormous number of classifiers available in machine learning, it might be challenging to select the optimal classifier for a specific task. Classification is determined by the amount and type of characteristics and the classifier used; one of these factors is the configuration of the specific classifier. As illustrated in Figure 4, the WEKA tool is a tremendous aid in selecting them optimally. One of the study’s primary objectives is to categorize pure and compound feelings using text regarding cryptocurrency-related topics from Twitter, which is frequently utilized in our conversations for soliciting opinions, providing feedback, and responding to any conversation.

3.4.1. Filtered Classifiers

WEKA, an open-source tool, is a collection of machine learning algorithms. In the WEKA tool, initially, the dataset is loaded. Under meta classifier, filtered classifiers were found.

LibLINEAR

LibLINEAR: An open-source software application that effectively classifies enormous sparse datasets with a high number of attributes and instances. For big, linear binary and multiclass classification, the tool offers L2-regularized logistic regression, L2-loss, and L1-loss linear SVMs. SVMs of “L2-regularized, L2-loss Support Vector classification” is used in LL, which offers efficient linear classification methods, and LibSVM, which generates non-linear SVMs. Both use SVMs, which WEKA already has as the SMO approach. The distinction is that LL is usually quicker than SMO (and can optionally minimize the sum of absolute values of errors rather than the sum of squared errors), whereas LibSVM is significantly more versatile. Using different kernels, SVMs may be used to build many types of non-linear decision boundaries, and the effect can be studied using WEKA’s boundary visualizer. They benefit immensely from parameter optimization, which may be accomplished with WEKA’s grid search meta-classifier.
Large linear classification (Binary or Multiclass) can be done through LibLINEAR, which supports two popular binary linear classifiers: linear support vector machine (SVM) and logistic regression (LR). Given a set of instances with label pairs ( v k , w k ) :
v k R l   f o r   k = 1 , , s ,
w k R l   f o r   k = 1 , , s ,
Both methods solve the following unconstrained optimization problem with different loss functions ( t : v k , w k ) :
min t 1 2 t x t + D k = 1 s ( t : v k , w k ) for   D > 0
In this case, D denotes a penalty parameter. The two most often used loss functions in SVM are:
max ( 1 w k t x v k , 0 )
and
max ( 1 w k t x v k , 0 ) 2
The earlier is known as L1-SVM, while the latter is known as L2-SVM. The loss function for LR is as follows:
log ( 1 + e w k t x v k )
It is calculated using a probabilistic model. In some circumstances, the classifier’s discriminant function contains a bias factor, m. LibLINEAR implements this term by adding a feature to the vector t, and each instance v k :
t x [ t x , m ] , v x i [ v x i , M ]
where M is a user-specified constant. For L1-SVM and L2-SVM, a coordinate descent algorithm is used. LibLINEAR includes the trust region Newton algorithm for LR and L2-SVM. It predicts a data point v as positive if t x t > 0 , and negative otherwise during the testing phase. It incorporates the one-vs-the-rest approach and a Crammer and Singer method for multiclass issues.

4. Experimental Results and Performance Analysis

This section discusses the experimentation conducted in this research using straight configuration (single feature selection technique), describing the experimental setup in Table 2 and the outcomes of the model used in all trials.
Classification accuracy (CA) is the ratio of correctly classified data to all input data. The confusion matrix (CM) generates a matrix that describes the model’s performance. Table 3 represents the evaluation of the test split using a single feature selection technique in combination with LL, and its graphical representation is shown in Figure 5.
The information given below in the table illustrates four critical terms from the confusion matrix: true positives (TP) are the number of tweets anticipated as positive and are perfectly positive; true negatives (TN) are the number of tweets anticipated as negative and are perfectly negative; false positives (FP) are the number of tweets anticipated as positive but are perfectly negative, and false negatives (FN) are the number of tweets anticipated as negative but are perfectly positive.
The matrix’s accuracy is determined by multiplying the proportion of correctly identified occurrences by the total number of classified examples. In trials, a model’s performance is evaluated using the accuracy metric. Along with accuracy, precision by averaging the classifier’s false positives and false negatives and by averaging the classifier’s and F1-score’s weighted harmonic averages were calculated.
Rather than, the Matthews correlation coefficient (MCC) is a more precise statistical measure that only yields a high score if the prediction did well in each of the four categories of the confusion matrix (true positives, false negatives, true negatives, and false positives), proportional to the volume of both positive and negative components in the input data. Many academics believe that the most realistic performance statistic is the ratio of correctly classified samples to total samples. It is referred to as accuracy, and by definition it works when there are more than two labels (multiclass case). However, when the dataset is unbalanced (the number of samples in one class is significantly greater than the number of samples in the other classes), accuracy becomes unreliable since it delivers an overoptimistic estimate of the classifier’s skill in the majority class. The MCC provides an excellent approach for resolving the class imbalance issue.
A receiver operating characteristic (ROC) curve plots the true positive rate (sensitivity) against the false positive rate (100-specificity) at various parameter cut-off points. Each point on the ROC curve corresponds to a pair of sensitivity/specificity values associated with a specific decision threshold. The ROC Area measures a parameter’s ability to discriminate across distinct classes.
Precision–recall curves (PRC) are frequently employed in binary classification to analyze a classifier’s output. It is essential to binarize the output to extend the precision–recall curve and average precision to multiclass or multi-label classification. While one curve can be produced for each label, a precision–recall curve can also be drawn by treating each label’s indication matrix element as a binary prediction (micro-averaging).
Table 4 represents the precise accuracy by class using a single feature selection technique combined with LL, and its graphical representation is shown in Figure 6.
Table 5 represents the confusion matrix by class using a single feature selection technique in combination with LL, and its graphical representation is shown in Figure 7.
This section covers the experiments conducted in this research using tetrad configuration (four feature selection procedures), describing the experimental setup in Table 6 and the findings of the model used in all the tests.
Table 7 represents the evaluation of the split test using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 8.
Table 8 represents the precise accuracy by class using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 9.
Table 9 represents the confusion matrix using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 10.
Table 10 represents the evaluation accuracy by class using the tetrad of feature selection technique in combination with LL, and its graphical representation is shown in Figure 11.
Table 11 presents the cost/benefit assessment summary for the 1085 instances dataset achieved by experiment measurements, and its graphical representation is shown in Figure 12.
Table 12 first part indicates the threshold plot curve ratio between the sample size (X) and the true positive rate (Y) from 0 to 1. The second part indicates the cost/benefit curve between the sample size (X) and the cost/benefit (Y) clusters with an acceptable accuracy rate. In the last column, cost curve represents the ratio between the probability cost function (X) and the normalized expected cost (Y).

5. Discussion

As evident from the presented results in Table 7, Table 8 and Table 9, the performance of the LL with the tetrad of feature selection techniques outperforms the combination with the single feature selection technique as exhibited in Table 3, Table 4 and Table 5 for cryptocurrency-related text obtained from Twitter. Increasing the structural complexity of machine learning and deep learning models improves sentiment classification performance. To begin, incorporating the Tweet to lexicon feature vector, the Tweet to input lexicon feature vector, the Tweet to SentiStrength feature vector, and the Tweet to embedding feature vector modules enhance the classification performance, particularly for textual inputs. The overall evaluation summary presented in Table 10 based on the parameters relative absolute error, root relative squared error, coverage of cases (0.95 level), mean relative region size (0.95 Level) for LL with the tetrad of feature selection techniques were significantly lower than those for the single feature selection technique, especially for configuration C in the first case and configuration B in the other case with textual inputs on the same dataset. The TTTR, TTTE, CCI, ICI, KS, MAS, and RMSE performance measures shown in Figure 8 for LL with the tetrad of feature selection techniques using textual inputs were generally more significant than that of the textual inputs with the single feature selection technique as depicted in Figure 5. The results indicated that the tetrad of feature selection techniques with LL was more effective than the other configuration in the experiment.
It is a quantitative analysis of public sentiments obtained through tweets on Twitter in the aftermath of cryptocurrency trembling rates. The analysis was performed for public sentiments toward various cryptocurrencies using data from Twitter. Despite the negative consequences of shaking rates of cryptocurrencies, it is found that public sentiments were more positive than negative. Although most tweets are deemed pure sentiments instead of hybrid sentiments, neutral and happy sentiments are predominant in all sentimental categories. Additionally, it is reassuring that anger sentiments outweigh happy sentiments in none of the cases. The findings of these analyses can be used to understand better Twitter users’ perceptions of their cryptocurrency-related decisions and the policies adopted by specific coins and governmental institutions. The current discoveries provide a baseline for measuring public discourse on cryptocurrency matters. A performance comparison of the anticipated text classification models with existing models is presented in Table 13.

6. Conclusions and Future Work

This paper analyzed the impact of text pre-processing techniques combined with the use of text classification process. The study shows the importance of the different sections in the classification process. It compared the two configurations; one with a single feature selection technique and the second with a tetrad of feature selection techniques to perform text-based sentimental analysis using LL from filtered classifiers. The sentimental classification based on cryptocurrency-related text through Twitter is an emergent area that needs more attention to consider. A dataset about cryptocurrencies has been collected through the Twitter API and classified with precision into thirteen primary pure and compound sentimental attributes.
In this research, thirteen identified pure and compound sentimental attributes in which the given dataset has been labeled. It compared a single feature selection technique-based configuration with a tetrad of feature selection techniques that quickly purified the dataset, and then LL classified the given text-based dataset into the most common pure and compound sentimental attributes. Through the daily growth and expansion of feature selection methods, numerous researchers tend to use these techniques to classify text-based datasets. As evident from results in the tetrad of feature selection techniques, it is considered the best configuration compared to the single feature selection technique because it has optimized outcomes for all performance measures. The results also demonstrated that LibLINEAR is computationally efficient and achieves the best performance.
Classifying tweets into pure and hybrid sentiments to fully comprehend and reveal the sentiment of tweets without labeling them is another intriguing future direction. The size of the dataset and the time period over which it was collected are the constraints on this exploration. It would also be interesting to have data spanning a more extended time period to observe how sentiments change over time.

Funding

The author extends his appreciation to the Deanship of Scientific Research at Jouf University for funding this work through research grant No. DSR-2021-02-0207.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset is available from corresponding author upon request.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Baur, D.G.; Dimpfl, T. The volatility of Bitcoin and its role as a medium of exchange and a store of value. Empir. Econ. 2021, 61, 2663–2683. [Google Scholar] [CrossRef] [PubMed]
  2. Ranasinghe, H.; Halgamuge, M.N. Twitter sentiment data analysis of user behavior on cryptocurrencies: Bitcoin and ethereum. In Analyzing Global Social Media Consumption; IGI Global: Hershey, PA, USA, 2021; pp. 277–291. [Google Scholar]
  3. Minutolo, M.C.; Kristjanpoller, W.; Dheeriya, P. Impact of COVID-19 effective reproductive rate on cryptocurrency. Financ. Innov. 2022, 8, 1–27. [Google Scholar] [CrossRef] [PubMed]
  4. Hassan, M.K.; Hudaefi, F.A.; Caraka, R.E. Mining netizen’s opinion on cryptocurrency: Sentiment analysis of Twitter data. Stud. Econ. Financ. 2021, 39, 365–385. [Google Scholar] [CrossRef]
  5. Köhler, S. Sustainable Blockchain Technologies: An Assessment of Social and Environmental Impacts of Blockchain-Based Technologies; Aalborg Universitetsforlag: Aalborg, Denmark, 2021. [Google Scholar]
  6. Ghosh, J. The blockchain: Opportunities for research in information systems and information technology. J. Glob. Inf. Technol. Manag. 2019, 22, 235–242. [Google Scholar] [CrossRef] [Green Version]
  7. Guo, Y.-M.; Huang, Z.-L.; Guo, J.; Guo, X.-R.; Li, H.; Liu, M.-Y.; Ezzeddine, S.; Nkeli, M.J. A bibliometric analysis and visualization of blockchain. Future Gener. Comput. Syst. 2021, 116, 316–332. [Google Scholar] [CrossRef]
  8. Chen, E.; Lerman, K.; Ferrara, E. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health Surveill. 2020, 6, e19273. [Google Scholar] [CrossRef]
  9. Boon-Itt, S.; Skunkan, Y. Public perception of the COVID-19 pandemic on Twitter: Sentiment analysis and topic modeling study. JMIR Public Health Surveill. 2020, 6, e21978. [Google Scholar] [CrossRef]
  10. Renz, S.M.; Carrington, J.M.; Badger, T.A. Two strategies for qualitative content analysis: An intramethod approach to triangulation. Qual. Health Res. 2018, 28, 824–831. [Google Scholar] [CrossRef]
  11. Bhattacharya, S.; Sarkar, D.; Kole, D.K.; Jana, P. Recent trends in recommendation systems and sentiment analysis. In Advanced Data Mining Tools and Methods for Social Computing; Academic Press: New York, NY, USA, 2022; pp. 163–175. ISBN 9780323857086. [Google Scholar]
  12. Rodrigues, A.P.; Chiplunkar, N.N. A new big data approach for topic classification and sentiment analysis of Twitter data. Evol. Intell. 2019, 15, 877–887. [Google Scholar] [CrossRef]
  13. Xiong, X.; Li, Y.; Qiao, S.; Han, N.; Wu, Y.; Peng, J.; Li, B. An emotional contagion model for heterogeneous social media with multiple behaviors. Phys. A Stat. Mech. Appl. 2018, 490, 185–202. [Google Scholar] [CrossRef]
  14. Valle-Cruz, D.; Fernandez-Cortez, V.; López-Chau, A.; Sandoval-Almazán, R. Does twitter affect stock market decisions? financial sentiment analysis during pandemics: A comparative study of the h1n1 and the covid-19 periods. Cognit. Comput. 2021, 14, 372–387. [Google Scholar] [CrossRef] [PubMed]
  15. Campbell, H.A.; Evolvi, G. Contextualizing current digital religion research on emerging technologies. Hum. Behav. Emerg. Technol. 2020, 2, 5–17. [Google Scholar] [CrossRef] [Green Version]
  16. Li, Z.; Huang, Q.; Emrich, C.T. Introduction to social sensing and big data computing for disaster management. Int. J. Digit. Earth 2019, 12, 1198–1204. [Google Scholar] [CrossRef] [Green Version]
  17. Wirtz, J.G.; Zimbres, T.M. A systematic analysis of research applying ‘principles of dialogic communication’to organizational websites, blogs, and social media: Implications for theory and practice. J. Public Relat. Res. 2018, 30, 5–34. [Google Scholar] [CrossRef]
  18. Arumugam, S. Development of argument based opinion mining model with sentimental data analysis from twitter content. Concurr. Comput. Pract. Exp. 2022, 34, e6956. [Google Scholar] [CrossRef]
  19. Hrazi, M.M.; Althagafi, A.M.; Aljuhani, A.T.; Rahman, J.; Rahman, M.M.; Shorfuzzaman, M. Sentiment Analysis of Tweets from Airlines in the Gulf Region Using Machine Learning. In Proceedings of the 2021 International Conference of Women in Data Science at Taif University (WiDSTaif), Taif, Saudi Arabia, 30–31 March 2021; pp. 1–6. [Google Scholar]
  20. Yadav, J.; Misra, M.; Rana, N.P.; Singh, K.; Goundar, S. Netizens’ behavior towards a blockchain-based esports framework: A TPB and machine learning integrated approach. Int. J. Sports Mark. Spons. 2021; ahead-of-print. [Google Scholar] [CrossRef]
  21. Hasan, T.; Ahmad, F.; Rizwan, M.; Alshammari, N.; Alanazi, S.A.; Hussain, I.; Naseem, S. Edge Caching in Fog-Based Sensor Networks through Deep Learning-Associated Quantum Computing Framework. Comput. Intell. Neurosci. 2022, 2022, 6138434. [Google Scholar] [CrossRef]
  22. Shabbir, M.; Ahmad, F.; Shabbir, A.; Alanazi, S.A. Cognitively managed multi-level authentication for security using Fuzzy Logic based Quantum Key Distribution. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1468–1485. [Google Scholar] [CrossRef]
  23. Mehmood, M.; Ayub, E.; Ahmad, F.; Alruwaili, M.; Alrowaili, Z.A.; Alanazi, S.; Rizwan, M.H.M.; Naseem, S.; Alyas, T. Machine learning enabled early detection of breast cancer by structural analysis of mammograms. Comput. Mater. Contin. 2021, 67, 641–657. [Google Scholar] [CrossRef]
  24. Shahzadi, S.; Ahmad, F.; Basharat, A.; Alruwaili, M.; Alanazi, S.; Humayun, M.; Rizwan, M.; Naseem, S. Machine learning empowered security management and quality of service provision in SDN-NFV environment. Comput. Mater. Contin. 2021, 66, 2723–2749. [Google Scholar] [CrossRef]
  25. Alanazi, S.A.; Alruwaili, M.; Ahmad, F.; Alaerjan, A.; Alshammari, N. Estimation of Organizational Competitiveness by a Hybrid of One-Dimensional Convolutional Neural Networks and Self-Organizing Maps Using Physiological Signals for Emotional Analysis of Employees. Sensors 2021, 21, 3760. [Google Scholar] [CrossRef]
  26. Mehmood, M.; Alshammari, N.; Alanazi, S.A.; Ahmad, F. Systematic Framework to Predict Early-Stage Liver Carcinoma Using Hybrid of Feature Selection Techniques and Regression Techniques. Complexity 2022, 2022, 7816200. [Google Scholar] [CrossRef]
  27. Khan, W.A.; Ahmad, F.; Alanazi, S.A.; Hasan, T.; Naseem, S.; Nisar, K.S. Trust identification through cognitive correlates with emphasizing attention in cloud robotics. Egypt. Inform. J. 2022, 23, 259–269. [Google Scholar] [CrossRef]
  28. Orangi-Fard, N.; Akhbardeh, A.; Sagreiya, H. Predictive Model for ICU Readmission Based on Discharge Summaries Using Machine Learning and Natural Language Processing. In Proceedings of the Informatics, Kowloon, Hongkong, 7–15 August 2022; p. 10. [Google Scholar]
  29. Mehmood, M.; Alshammari, N.; Alanazi, S.A.; Basharat, A.; Ahmad, F.; Sajjad, M.; Junaid, K. Improved Colorization and Classification of Intracranial Tumor Expanse in MRI Images via Hybrid Scheme of Pix2Pix-cGANs and NASNet-Large. J. King Saud Univ.-Comput. Inf. Sci. 2022; in press. [Google Scholar] [CrossRef]
  30. Wang, D.; Su, J.; Yu, H. Feature extraction and analysis of natural language processing for deep learning english language. IEEE Access. 2020, 8, 46335–46345. [Google Scholar] [CrossRef]
  31. Eke, C.I.; Norman, A.A.; Shuib, L.; Nweke, H.F. Sarcasm identification in textual data: Systematic review, research challenges and open directions. Artif. Intell. Rev. 2020, 53, 4215–4258. [Google Scholar] [CrossRef]
  32. Rahman, A.; Saleem, N.; Shabbir, A.; Shabbir, M.; Rizwan, M.; Naseem, S.; Ahmad, F. ANFIS based hybrid approach identifying correlation between decision making and online social networks. EAI Endorsed Trans. Scalable Inf. Syst. 2021, 8, e4. [Google Scholar] [CrossRef]
  33. Ghoshal, S.; Bruckman, A. The role of social computing technologies in grassroots movement building. ACM Trans. Comput. Hum. Interact. 2019, 26, 1–36. [Google Scholar] [CrossRef]
  34. Başaran, S.; Ejimogu, O.H. A neural network approach for predicting personality from Facebook data. SAGE Open 2021, 11, 21582440211032156. [Google Scholar] [CrossRef]
  35. Giuntini, F.T.; Cazzolato, M.T.; dos Reis, M.d.J.D.; Campbell, A.T.; Traina, A.J.; Ueyama, J. A review on recognizing depression in social networks: Challenges and opportunities. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 4713–4729. [Google Scholar] [CrossRef]
  36. Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 1–50. [Google Scholar] [CrossRef]
  37. Fischer, C.; Pardos, Z.A.; Baker, R.S.; Williams, J.J.; Smyth, P.; Yu, R.; Slater, S.; Baker, R.; Warschauer, M. Mining big data in education: Affordances and challenges. Rev. Res. Educ. 2020, 44, 130–160. [Google Scholar] [CrossRef] [Green Version]
  38. Sohail, M.N.; Jiadong, R.; Uba, M.M.; Irshad, M.; Iqbal, W.; Arshad, J.; John, A.V. A hybrid Forecast Cost Benefit Classification of diabetes mellitus prevalence based on epidemiological study on Real-life patient’s data. Sci. Rep. 2019, 9, 1–10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Prasetijo, A.B.; Isnanto, R.R.; Eridani, D.; Soetrisno, Y.A.A.; Arfan, M.; Sofwan, A. Hoax detection system on Indonesian news sites based on text classification using SVM and SGD. In Proceedings of the 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), Semarang, Indonesia, 18–19 October 2017; pp. 45–49. [Google Scholar]
Figure 1. Twitter API access procedure.
Figure 1. Twitter API access procedure.
Applsci 12 06070 g001
Figure 2. Frequencies ranges of Twitter text.
Figure 2. Frequencies ranges of Twitter text.
Applsci 12 06070 g002
Figure 3. Tweets preprocessing.
Figure 3. Tweets preprocessing.
Applsci 12 06070 g003
Figure 4. Text-based sentimental classification.
Figure 4. Text-based sentimental classification.
Applsci 12 06070 g004
Figure 5. Graphical representation of evaluation on split test (single feature selection technique).
Figure 5. Graphical representation of evaluation on split test (single feature selection technique).
Applsci 12 06070 g005
Figure 6. Detailed accuracy by class (single feature selection techniques).
Figure 6. Detailed accuracy by class (single feature selection techniques).
Applsci 12 06070 g006
Figure 7. Graphical representation of confusion matrix (single feature selection technique).
Figure 7. Graphical representation of confusion matrix (single feature selection technique).
Applsci 12 06070 g007
Figure 8. Graphical representation of evaluation on split test (tetrad feature selection techniques).
Figure 8. Graphical representation of evaluation on split test (tetrad feature selection techniques).
Applsci 12 06070 g008
Figure 9. Graphical representation of detailed accuracy by class (tetrad feature selection techniques).
Figure 9. Graphical representation of detailed accuracy by class (tetrad feature selection techniques).
Applsci 12 06070 g009
Figure 10. Graphical representation of confusion matrix (tetrad feature selection techniques).
Figure 10. Graphical representation of confusion matrix (tetrad feature selection techniques).
Applsci 12 06070 g010
Figure 11. Graphical representation of evaluation accuracy (tetrad feature selection techniques).
Figure 11. Graphical representation of evaluation accuracy (tetrad feature selection techniques).
Applsci 12 06070 g011
Figure 12. Graphical representation of cost/benefit (tetrad feature selection techniques).
Figure 12. Graphical representation of cost/benefit (tetrad feature selection techniques).
Applsci 12 06070 g012
Table 1. Tweets distribution in sentimental classes.
Table 1. Tweets distribution in sentimental classes.
No.Sentiment ClassInstancesNo.Sentiment ClassInstances
1Not_Relevant2148Angry57
2Neutral15729Sad_Angry2
3Happy113710Disgust6
4Surprise3511Sad_Disgust2
5Happy_Surprise1112Angry_Disgust7
6Sad3213Sad_Angry_Disgust1
7Happy_Sad9 Total Instances3085
Table 2. Run information (single feature selection technique).
Table 2. Run information (single feature selection technique).
DatasetInstances3085308530853085
Attributes44444444
Training Split66%70%80%90%
Testing Split34%30%20%10%
PreprocessWEKA
Applsci 12 06070 i001
ClassificationWEKA
Applsci 12 06070 i002
Table 3. Evaluation of test split (single feature selection technique).
Table 3. Evaluation of test split (single feature selection technique).
Configuration AConfiguration BConfiguration CConfiguration D
Time Taken for Training (TTTR)4.34 s2.01 s1.99 s1.98 s
Time Taken for Testing (TTTE)1.31 s0.01 s0.02 s0.02 s
Correctly Classified Instances (CCI)769 (73.313%)684 (73.953%)448 (72.618%)220 (71.427%)
Incorrectly Classified Instances (ICI)280 (26.705%)241 (26.056%)169 (27.397%)88 (28.5714%)
Kappa Statistic0.5180.53130.51450.4863
Mean Absolute Error (MAS)0.04110.04010.04210.044
Root Mean Squared Error (RMSE)0.20260.20020.20530.2097
Total Number of Instances1049925617308
Table 4. Detailed accuracy by class (single feature selection technique).
Table 4. Detailed accuracy by class (single feature selection technique).
TP RateFP RatePrecision RecallF-MeasureMCCROC AreaPRC AreaClass
0.8750.3540.7200.875 0.7900.536 0.7600.694Neutral
0.7280.1240.7750.728 0.7510.612 0.8020.665Happy
0.2030.0020.8750.2030.3290.4060.6000.230NotRelevant
0.1820.003 0.5710.182 0.2760.315 0.5890.121Angry
0.000 0.000 -0.000 --0.5000.003AngryDisgust
-0.004 0.000-----Disgust
0.000 0.001 0.0000.000 0.000−0.0020.5000.005HappySurprise
0.000 0.003 0.0000.000 0.000−0.0050.4990.010Sad
0.000 0.000 -0.000 --0.5000.011Surprise
0.000 0.002 0.0000.000 0.000−0.0020.4990.003HappySad
-0.001 0.000-----SadDisgust
0.000 0.000 -0.000 --0.5000.001SadAngry
-0.000 ------SadAngryDisgust
0.733 0.227 - 0.733 - -0.7530.618Weighted Average
Table 5. Confusion matrix by class (single feature selection technique).
Table 5. Confusion matrix by class (single feature selection technique).
abcdefghijklmClassified as
4685921021101000 a = Neutral
103 28300010101000 b = Happy
458 141010000000 c = NotRelevant
15304000000000 d = Angry
1101000000000 e = AngryDisgust
0000000000000 f = Disgust
1400000000000 g = HappySurprise
5400000000100 h = Sad
11100000000000 i= Surprise
0200000100000 j = HappySad
0000000000000 k = SadDisgust
1000000000000 l = SadAngry
0000000000000 m = SadAngryDisgust
Table 6. Run information (tetrad feature selection techniques).
Table 6. Run information (tetrad feature selection techniques).
DatasetInstances3085308530853085
Attributes44444444
Training Split66%70%80%90%
Testing Split34%30%20%10%
PreprocessWEKA
Applsci 12 06070 i003
Applsci 12 06070 i004
ClassificationWEKA
Applsci 12 06070 i005
Table 7. Evaluation of test split (tetrad feature selection techniques).
Table 7. Evaluation of test split (tetrad feature selection techniques).
Configuration AConfiguration BConfiguration CConfiguration D
Time Taken for Training (TTTR)18.06 s16.47 s13.64 s11.81 s
Time Taken for Testing (TTTE)18.97 s5.42 s4.37 s2.67 s
Correctly Classified Instances (CCI)1014 (96.739%)896 (96.865%)601 (97.475%)450 (97.322%)
Incorrectly Classified Instances (ICI)35 (03.260%)29 (03.135%)16 (02.528%)13 (02.678%)
Kappa Statistic0.58420.58750.60190.6008
Mean Absolute Error (MAS)0.03580.03560.03470.0349
Root Mean Squared Error (RMSE)0.18920.18870.18620.1868
Total Number of Instances1049925617463
Table 8. Detailed accuracy by class (tetrad feature selection techniques).
Table 8. Detailed accuracy by class (tetrad feature selection techniques).
ClassTP RateFP RatePrecisionRecallF-MeasureMCCROC AreaPRC Area
Neutral0.9070.2890.7510.9070.8220.6290.8090.726
Happy0.8090.1050.8270.8090.8180.7070.8520.742
Not_Relevant0.2330.0001.0000.2330.3770.4690.6160.286
Angry0.2310.0050.5000.2310.3160.3300.6130.132
Angry_Disgust0.0000.0020.0000.0000.000−0.0030.4990.005
Disgust-0.0020.000-----
Happy_Surprise0.0000.0020.0000.0000.000−0.0030.4990.005
Sad0.0000.000-0.000--0.5000.011
Surprise0.0000.000-0.000--0.5000.010
Happy_Sad0.0000.0020.0000.0000.000−0.0030.4990.005
Sad_Disgust-0.0020.000-----
Sad_Angry0.0000.000-0.000--0.5000.002
Sad_Angry_Disgust-0.000------
Weighted Average0.7750.182-0.775--0.7970.662
Table 9. Confusion matrix (tetrad configuration).
Table 9. Confusion matrix (tetrad configuration).
abcdefghijklmClassified As
2742601001000000a = Neutral
4419100000101000b = Happy
284100010000000c = NotRelevant
9103000000000d = Angry
0201000000000e = AngryDisgust
0000000000000f = Disgust
2100000000000g = HappySurprise
3300000000100h = Sad
4200000000000i = Surprise
0101100000000j = HappySad
0000000000000k = SadDisgust
1000000000000l = SadAngry
0000000000000m = SadAngryDisgust
Table 10. Evaluation summary (tetrad feature selection techniques).
Table 10. Evaluation summary (tetrad feature selection techniques).
Configuration AConfiguration BConfiguration CConfiguration D
Relative Absolute Error0.387187 0.3848640.372557 0.374826
Root Relative Squared Error0.3824130.3786990.359832 0.362554
Coverage of Cases (0.95 Level)0.767398 0.7686490.7747160.773218
Mean Relative Region Size (0.95 Level)0.769230.769230.76923 0.76923
Total Number of Instances1049925617463
Table 11. Cost/benefit summary (tetrad of feature selection techniques).
Table 11. Cost/benefit summary (tetrad of feature selection techniques).
ClassCost/BenefitRandomGainPercentage of AccuracyPercentage of PopulationPercentage of Target
Neutral119309.69±190.6980.71%59.16%90.72%
Happy85290.29±205.2986.22%37.43%80.93%
Not_Relevant3351.61±18.6194.651.62%23.25%
Angry1318.75±5.7597.89%0.97%23.07%
Angry_Disgust33099.51%0%0%
Disgust000100%0%NaN
Happy_Surprise33099.51%0%0%
Sad77098.86%0%0%
Surprise66099.03%0%0%
Happy_Sad33099.51%0%0%
Sad_Disgust000100%0%NaN
Sad_Angry11099.83%0%0%
Sad_Angry_Disgust000100%0%NaN
Table 12. Cost/benefit summary for pure and compound sentiments (tetrad of feature selection techniques).
Table 12. Cost/benefit summary for pure and compound sentiments (tetrad of feature selection techniques).
ClassThreshold Curve = Sample Size (X) Vs. True Positive Rate (Y)Cost/Benefit Curve = Sample Size (X) Vs. Cost/Benefit (Y)Cost Curve = Probability Cost Function (X) Vs. Normalized Expected Cost (Y)
Neutral Applsci 12 06070 i006 Applsci 12 06070 i007 Applsci 12 06070 i008
Happy Applsci 12 06070 i009 Applsci 12 06070 i010 Applsci 12 06070 i011
Not_Relevant Applsci 12 06070 i012 Applsci 12 06070 i013 Applsci 12 06070 i014
Angry Applsci 12 06070 i015 Applsci 12 06070 i016 Applsci 12 06070 i017
Angry_Disgust Applsci 12 06070 i018 Applsci 12 06070 i019 Applsci 12 06070 i020
Disgust Applsci 12 06070 i021 Applsci 12 06070 i022 Applsci 12 06070 i023
Happy_Surprise Applsci 12 06070 i024 Applsci 12 06070 i025 Applsci 12 06070 i026
Sad Applsci 12 06070 i027 Applsci 12 06070 i028 Applsci 12 06070 i029
Surprise Applsci 12 06070 i030 Applsci 12 06070 i031 Applsci 12 06070 i032
Happy_Sad Applsci 12 06070 i033 Applsci 12 06070 i034 Applsci 12 06070 i035
Sad_Disgust Applsci 12 06070 i036 Applsci 12 06070 i037 Applsci 12 06070 i038
Sad_Angry Applsci 12 06070 i039 Applsci 12 06070 i040 Applsci 12 06070 i041
Sad_Angry_Disgust Applsci 12 06070 i042 Applsci 12 06070 i043 Applsci 12 06070 i044
Table 13. Performance comparison of the anticipated text classification models with existing models.
Table 13. Performance comparison of the anticipated text classification models with existing models.
StudiesClassifierTime TakenAccuracyError RateKappa Stats
Sohail, et al. [38]J48 consolidation0.150.98930.010.97
Sohail, et al. [38]J48 graft0.020.98930.010.97
Sohail, et al. [38]Hoeffding Tree0.110.91100.130.78
Anticipated (Cryptocurrency related Sentimental’ Classification)LibLINEAR (Single Feature Selection Technique)2.010.73950.20020.5315
LibLINEAR (Tetrad Feature Selection Tehniques)13.640.97470.18620.6019
ClassifierAUCCAF1-ScorePrecisionRecall
Prasetijo et al. [39]SVM-0.784-0.5481.00
Alanazi et al. [25]ODCNN0.9900.998---
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Alanazi, S.A. Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier. Appl. Sci. 2022, 12, 6070. https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

AMA Style

Alanazi SA. Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier. Applied Sciences. 2022; 12(12):6070. https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

Chicago/Turabian Style

Alanazi, Saad Awadh. 2022. "Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier" Applied Sciences 12, no. 12: 6070. https://0-doi-org.brum.beds.ac.uk/10.3390/app12126070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop