Real-Time Big Data Architecture for Processing Cryptocurrency and Social Media Data: A Clustering Approach Based on k-Means

Barradas, Adrian; Tejeda-Gil, Acela; Cantón-Croda, Rosa-María

doi:10.3390/a15050140

Open AccessArticle

Real-Time Big Data Architecture for Processing Cryptocurrency and Social Media Data: A Clustering Approach Based on k-Means

by

Adrian Barradas

^*,†

,

Acela Tejeda-Gil

^†

and

Rosa-María Cantón-Croda

^†

Graduate School of Engineering, UPAEP-University, Puebla 72410, Mexico

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2022, 15(5), 140; https://0-doi-org.brum.beds.ac.uk/10.3390/a15050140

Submission received: 16 March 2022 / Revised: 6 April 2022 / Accepted: 7 April 2022 / Published: 22 April 2022

(This article belongs to the Special Issue Machine Learning in Data Structures)

Download

Browse Figures

Versions Notes

Abstract

:

Cryptocurrencies have recently emerged as financial assets that allow their users to execute transactions in a decentralized manner. Their popularity has led to the generation of huge amounts of data, specifically on social media networks such as Twitter. In this study, we propose an iterative kappa architecture that collects, processes, and temporarily stores data regarding transactions and tweets of two of the major cryptocurrencies according to their market capitalization: Bitcoin (BTC) and Ethereum (ETH). We applied a k-means clustering approach to group data according to their principal characteristics. Data are categorized into three groups: BTC typical data, ETH typical data, BTC and ETH atypical data. Findings show that activity on Twitter correlates to activity regarding the transactions of cryptocurrencies. It was also found that around 14% of data relate to extraordinary behaviors regarding cryptocurrencies. These data contain higher transaction volumes of both cryptocurrencies, and about 9.5% more social media publications in comparison with the rest of the data. The main advantages of the proposed architecture are its flexibility and its ability to relate data from various datasets.

Keywords:

kappa architecture; iterative data processing; document-oriented No-SQL database; Bitcoin; Ethereum; Twitter

1. Introduction

During the past few years, the use of digital currencies has emerged as a novel manner of executing financial transactions [1]. A digital currency works the same way a real currency does, with the particularity that it is not issued by a central bank; thus it is a decentralized currency [2]. Digital currencies are generated using a cryptographic algorithm called blockchain, which employs mathematical encryption methods to create and verify a continuously growing data structure. Therefore, blockchain protects data by transforming it into an unreadable format, which can only be decrypted employing the corresponding decryption algorithm. Blockchain transactions flow through a computer network without the need for intermediaries as the algorithm links users directly [1]. That kind of network is known as a cryptocurrency network as it enables the establishment of decentralized peer-to-peer data exchange [3].

In terms of trading volume, Bitcoin is currently the most popular cryptocurrency; it allows electronic cash transactions directly from one partner to another without going through a financial institution [4]. Diverse studies serve as evidence that Bitcoin has been strangely volatile since its establishment. Its volatile nature has brought into vogue its use among speculators [5]. Although its use until now has been mostly for speculation, since at least 2010, numerous intermediaries have begun to transact with Bitcoin [6]. It has been reported that the market capitalization of the one hundred largest cryptocurrencies exceeded the equivalent of USD 2.65 trillion by November 2021; nevertheless, according to CoinMarketCap, Bitcoin accounts for the largest cryptocurrency with a market capitalization that surpasses the 1.1 trillion mark, while Ethereum stands as the second-largest cryptocurrency with a market capitalization equivalent to USD 543 billion [7]. Both Bitcoin and Ethereum use the same principles of blockchain technology; nevertheless, while Bitcoin’s purpose is limited to functioning as a digital currency, Ethereum is designed to be a general-purpose programmable blockchain, which can manage the transactions of a digital currency, but also any kind of data expressible as a key-value tuple [8]. This gives Ethereum the advantage of being suitable for other decentralized applications, however, this study focuses only on its use as a digital currency.

Cryptocurrencies rose as a tendency due to their popularity on social media. In that context, one of the main sources of information about cryptocurrencies is Twitter. It allows users to share their thoughts and mindsets regarding cryptocurrencies; therefore it is, among other social networks, a medium to boost the cryptocurrency world [9,10]. According to data from BitInfoCharts, the number of daily tweets related to Bitcoin during 2021 fluctuated between 30,540 and 363,566; the latter corresponds to around 0.072 percent of the average daily tweets worldwide [11]. This evidences the wide use of Twitter as an information medium for cryptocurrencies [12,13]. It is worth mentioning that Twitter is considered a leading social media platform and a rich source of real-time information [14]. On the other hand, during the same period, daily Bitcoin transactions averaged 332,355 [15]. In that context, a large amount of data is generated every day; i.e., around 136 tweets and 230 transactions per minute. Big data refers to large and complex datasets which require advanced data storage, management, and analysis technologies [3]. One of the sources of big data is social media which has an increasing number of users [16] that integrate their background and daily activities into the networks. This fact contributes to the rapid generation of gigantic datasets [17]. As data are generated rapidly, it is meaningful to obtain information and insights in real time to react appropriately to events and trends surrounding large volumes of data. In this case, it concerns the analysis of social media posts and cryptocurrency transactions [18].

Given the popularity of cryptocurrencies, there is a vast number of recent studies and projects focused on analyzing data from social media and cryptocurrencies in real time utilizing novel data processing tools and methodologies. Moapatra et al. [19] proposed a distributed architectural design for handling large volumes of data from Twitter and Bitcoin transactions in real time to predict price fluctuations; by means of a combined machine learning and lexicon approach, they determined the sentiments of the tweets and related them with the price of Bitcoin to predict the next minute’s price. Bandi [20] utilized a lambda architecture to process and visualize real-time data regarding cryptocurrencies’ prices. On the other hand, Horvat et al. [21] proposed an architecture for real-time cryptocurrency data processing and analysis based on the lambda architectural approach to obtain insights through the relation of different data sources such as social media, cryptocurrencies, and the stock market. A kappa architecture was proposed by Bandi and Hurtado [18] to process real-time data from Twitter to visualize analytics, such as trends and tweet volume. In addition, a relation between tweets and cryptocurrencies’ prices was studied by Abraham et al. [22] as a way to predict the direction of the price variation, from which it was found that the volume of tweets is more significant than their sentiments. It was also found by Park and Lee [23] that the volume of tweets correlates with Bitcoin prices. Garcia et al. [24] found that an increase in Bitcoin’s price led to a higher number of tweets which again would drive the price further up [25]. Some other studies focused only on the relation between tweets and cryptocurrencies, leaving in the second term the methods involved in the management and processing of data. Aharon et al. [26] found that there is a causal relationship between the uncertainty associated with sentiments in social media and cryptocurrency returns. In addition, we have found evidence of works that aim to identify behavioral patterns regarding cryptocurrencies by means of clustering algorithms. Baek et al. [27] applied a k-means clustering approach to identify suspicious transactions of Ethereum. Aspembitova et al. [28] identified four types of cryptocurrency users through the application of k-means clustering and support vector machines (SVMs) on Bitcoin and Ethereum transactional data. Fang et al. [29] used k-means to classify positive and negative publications from Twitter related to Bitcoin.

The previous research serves as a reference and basis for our study; although similar approaches have been proposed, to the best of our knowledge there is no evidence of related papers that utilize an iterative kappa architecture to process, relate and manage data from Twitter and cryptocurrency markets in real time. In this context, this study proposes the application of a novel kappa architecture, derived from the lambda architecture, for processing and analyzing real-time data from Twitter and the cryptocurrency market. It integrates a temporary batch step which allows the relation of data from different data sources in a specific time span. The proposed architecture focuses on the processing of data in real time while looking for insights and patterns regarding the number of tweets, their sentiment, and the number, type, and volume of cryptocurrency transactions. Data are collected through application programming interfaces (APIs) and streamed to be processed and stored in a document-oriented No-SQL database (MongoDB™). Afterward, data are related with the purpose of finding meaningful patterns.

The present work aims to demonstrate the use and benefits of the proposed architecture as a choice for relating data from cryptocurrencies and social media while identifying patterns in real time; for that purpose, data from a defined period of time are used.

This paper is organized as follows: Section 2 describes in detail the materials and methods used for the study’s development. Section 3 presents the results obtained by processing and relating data using the proposed kappa architecture. Finally, Section 4 summarizes the main findings and future works for this study.

2. Materials and Methods

This study is developed by following an approach based on the kappa architecture for big data as shown in Figure 1. The kappa architecture was first introduced by Kreps in 2014 [30]. It derives from the lambda architecture, which is considered one of the industry’s best practices for scalable real-time big data processing [21]. Lambda architecture consists of three layers: batch layer, speed layer, and serving layer. The batch layer processes data and stores them to query precomputed data on demand instead of querying them on the fly. The speed layer processes data in real-time to compensate for the low latency updates in the batch layer. Thus, data are processed in a parallel manner in both layers. Finally, the serving layer stores the views from the previous two layers [31]. Kappa architecture is similar to the lambda architecture, with the difference that it does not include a batch layer, therefore it processes data only in real time [30]. In this context, the main characteristics of the kappa architecture are its simplicity and its flexibility in comparison with other big data architectures [32]; thus it is suitable for online processing of data flows [33].

The proposed architecture consists of a real-time streaming layer that receives and processes new incoming data and a serving layer that stores data in MongoDB™ to be displayed or queried on demand. At the streaming layer, the processing is executed by means of Apache Kafka™and Apache Spark™which are helpful to process data in a distributed manner and consequently faster, in comparison with non-distributed approaches [34]. At the serving layer of the kappa architecture, the processed, modeled, and evaluated data coming from the real-time streaming layer are finally loaded into a database management system (DBMS), i.e., MongoDB™. In this case, as we handle huge volumes of unstructured data from Twitter, a document-oriented No-SQL database is better suitable than a traditional relational database due to its advantages regarding the horizontal scalability and the storage of unstructured data.

The kappa architecture that we present is iterative. In the first iteration, single datasets from Twitter and CryptoCompare are processed and transformed in order to be related; thereafter, a second iteration is executed to classify the related datasets and obtain insights. In that context, data are collected as they are generated and then streamed, transformed, and stored in MongoDB™ from which datasets are queried. In this case, MongoDB™ serves as a batch that stores data from the last 120 s with the purpose of relating it, by considering a time span of one minute and therefore obtaining one register for each minute. Finally, the queried dataset is returned to the streaming layer to be processed by means of a machine learning approach; in this case, k-means clustering. K-means clustering is one of the most popular algorithms for unsupervised machine learning. It groups data with similar characteristics under a determined number of clusters while separating them according to their dissimilarities [35]. Clustering is defined as a method for finding homogeneous groups of data points in a dataset; in that sense, it allows the recognition of patterns in data [36].

For this study, data related to the two largest cryptocurrencies, according to their market capitalization, were collected, i.e., Bitcoin (BTC) and Ethereum (ETH) [7]. Data mining for the corresponding tweets was done considering publications made in English. Parameters for the k-means clustering approach were calculated for data collected on 14 January 2022 corresponding to a period of 8 h from 06:59:00 (UTC-6) to 16:59:00 (UTC-6). The algorithms for the proposed architecture were executed by a single computer, nevertheless, it is suitable for its execution in a computer cluster, which distributes the computational requirements between the computers that conform to it.

Figure 2 shows a representation of the steps involved in the development of the study. First, data mining is executed in real time by means of public APIs [37,38] that enable the retrieval of the latest available raw data from Twitter and CryptoCompare. Collected data are then streamed and immediately transformed. Datasets are cleaned by deleting unuseful variables, and the remaining are transformed in order to be correctly processed and related. Additionally, a standard notation for the data is defined, and derived attributes are calculated when needed. Thereafter, data pass to the serving layer, where they are stored in MongoDB™ and then queried to relate the corresponding datasets according to their most relevant attributes. The queried and related data are then returned to the real-time streaming layer, at which a k-means clustering approach is executed to categorize data in groups according to their characteristics. In that sense, data flow in a second iteration in parallel through the architecture with the purpose of obtaining more information from them in real time.

3. Results

Data for this study were obtained from two different sources (Twitter and CryptoCompare) in the form of a JSON real-time stream, by means of an API [37,38]. To query data, a set of keywords were given which correspond to the name and symbol of the cryptocurrencies, i.e., Bitcoin (BTC), and Ethereum (ETH). As shown in Table 1, data collected from Twitter contain several attributes related to each tweet such as id, timestamp, and text, but also attributes related to the user such as user mentions, number of followers, and location, among others. On the other hand, data from CryptoCompare contain transaction-inherent attributes, i.e., timestamp [TS], market [M], symbol [FSYM], price [P], and volume [Q].

Data streams feed their corresponding topic (Twitter and Crypto) in Apache Kafka™. Data streaming is executed in a parallel manner, and in that way they can be processed simultaneously. Then, data processing is sequentially carried out in Apache Spark™, which allows the computation tasks to be divided between various processors forming a cluster. Data from Twitter in Table 1 contain fields related to the user that, for the purposes of this study, are not representative. Only the following attributes were kept: timestamp, id, and text. In the case of data obtained from CryptoCompare, none of their attributes were neglected as they contain representative information regarding each trade. At this point, data are transformed into a binary object which can be managed by Apache Kafka™. Text data from each tweet are processed in the real-time streaming layer by means of the library for natural language processing: Spark NLP, which is one of the most widely used NLP libraries [39,40]. Attribute text is split into sentences and, for each one, sentiment analysis is executed to identify whether it is positive or negative. Thus, a new attribute sentence for each tweet is generated. On the other hand, data on the Apache Kafka™ topic Crypto are transformed to have the same notation as data from the topic Twitter, so they can be related. Finally, data are immediately uploaded to the corresponding collection in the database hosted at MongoDB™.

By following the process presented in Figure 2, a new dataset that relates the individual data from topics Twitter and Crypto is queried from the database. Attributes timestamp and currency are defined as keys to establish a relationship that allows generating a new dataset containing facts regarding the transactions. Considering the speculative nature of cryptocurrencies dominated by short-term investors [25], data are analyzed on a time basis of minutes; thereafter, new attributes are calculated: number of tweets, accumulated sentiment, transaction volume, average currency price, and number of transactions. The obtained dataset, as shown in Table 2, is then sent to a new topic (Query) in Apache Kafka™ to be streamed to Apache Spark™ and thus processed in a second iteration.

With the purpose of demonstrating the application of the proposed algorithm, we collected data for a period of 8 h, from 06:59:00 (UTC-6) to 16:59:00 (UTC-6) of 14 January 2022. This corresponds to 248,313 tweets, 73,506 sell transactions, and 114,493 buy transactions of both cryptocurrencies. Figure 3 and Figure 4 show a graphical representation of the behavior of the collected data regarding the cryptocurrencies Bitcoin (BTC) and Ethereum (ETH), respectively.

It is notorious that in the case of Bitcoin (BTC), as the price increases, the sentiment does too. A similar behavior is seen when the price remains steady, thus having a stable sentiment range. On other hand, buy and sell transactions seem to behave according to the change in price, and this means that an increase or decrease in price is related to a larger or smaller number of buy and sell transactions, respectively; nevertheless, this behavior appears to happen only when there is an abrupt change in price. It is worth mentioning that the number of tweets and transactions tends to lower values as the day goes by. This may indicate that the vast majority of activities regarding cryptocurrencies are carried out during normal working hours. Moreover, in the case of Ethereum (ETH), its behavior is similar to that of Bitcoin (BTC). As shown in Figure 4, there is a relation between the number of tweets, the sentiment around them, and price, but only when the price change is abrupt. When the price remains steady, the rest of the variables seem to behave in the same manner. In this case, it can also be seen that during the final minutes of the graph, the sentiment does not affect the price, which tends to remain significantly unchanged. Finally, as in the previous graph, the number of tweets and transactions tends to decrease as the day passes by.

To determine whether there is a correlation between variables, a Pearson correlation analysis was executed. For this purpose, data were standardized to let all the attributes be expressed in the same terms, so they can be correctly related. Table 3 presents a correlation matrix for the corresponding variables of the dataset, from which only the statistically significant values (p-value ≥ 0.05) were considered. It was found that there is a positive correlation between the number of tweets and the buy and sell volumes (0.34, 0.43). Additionally, there is a positive correlation between the sentiment and the buy and sell prices of the cryptocurrencies (0.30, 0.30), while the correlation between the latter and the number of tweets is negative (−0.69, −0.69). In addition, a correlation between volume, avg. price, and number of transactions of both buy and sell transactions was found, which was expected because of their mutually dependent nature. This approach complements the findings from Figure 3 and Figure 4.

Before returning data to the streaming layer for the execution of the k-means clustering approach, an optimal number of clusters is defined by means of the silhouette method, which measures compactness and separation of data [41] Compactness refers to the similarity between each data point and the cluster, while when compared to other clusters, it is called separation [42]. In this case, the optimal number of clusters is determined according to the collected data; therefore, a silhouette coefficient was calculated for an arbitrary range of clusters, from k = 3 to k = 9. As our data consider two cryptocurrencies, we neglected a k-value of 2 with the purpose of grouping data beyond their cryptocurrency symbol. The silhouette coefficient ranges between −1 and 1, with 1 being the value that denotes that clusters are apart from each other, and data points belonging to them are close to their centroid, while −1 denotes that data points are grouped in the wrong clusters and that their centers are not well separated [43]. In that context, the higher the value of the coefficient the better the behavior of the clusters. We selected the optimal number of clusters according to these criteria. Figure 5 shows the calculated values of the silhouette coefficient for the clusters between the defined range. The highest coefficient is obtained by grouping data in 3 clusters, therefore this is the number that we consider for k.

Now that the optimal number of clusters is selected, data are modeled at the streaming layer in a second iteration. Thereafter, it was found that data are grouped according to their symbol in the first and second clusters; nevertheless, the third cluster concentrates data from both cryptocurrencies whose numbers of buy and sell transactions are significantly higher in comparison with the rest of the data; in consequence, the volume of bought and sold cryptocurrencies is also higher. In those cases, on average, the sentiment tends to be more positive as well as the number of tweets. Table 4 shows the average values of the grouped data, which indicate that cluster 3 groups data related to an increase in the activity over cryptocurrencies. In that sense, and in relation to findings from graphs in Figure 3 and Figure 4, we consider that clusters 1 and 2 contain data corresponding to a steady behavior of the cryptocurrencies while cluster 3 corresponds to data whose behavior is more volatile.

4. Discussion

Results show that the proposed iterative kappa architecture is useful for processing data and for determining patterns in real time. From the correlation analysis, it was found that there is a relation between the activity in social networks, i.e., Twitter, and the behavior of cryptocurrency markets. This evidences a positive correlation between the number of tweets and the buy and sell volumes of the cryptocurrencies. The findings support previous studies [19,22,23,24], in which it was found that the number of tweets and sentiment were positively correlated with cryptocurrencies’ transaction volumes and prices. In addition, by means of the k-means clustering approach, it was found that some data lie outside the common trends regarding transaction volumes of the cryptocurrencies. We have identified the outliers by grouping data in three clusters; two of them correspond to a steady behavior of the cryptocurrencies, while the third gathers data related to unusual transaction volumes. Thus, this latter group is useful for identifying anomalous behaviors in the market which are characterized mainly by a higher volume of tweets with a more positive sentiment, and higher transaction volumes.

From the executed k-means clustering approach, we have found that around 14% of data fall in the third cluster. In that cluster, on average, Bitcoin (BTC) was sold and bought around 128% and 170% more times than in cluster 1, while for Ethereum (ETH), the percentages were 54% and 52%, respectively, in comparison with cluster 2, thus resulting in higher transaction volumes. In both cases, the number of tweets was around 9.5% higher than in the first two clusters. Additionally, the sentiment of the tweets shows higher values (12% for Bitcoin (BTC) and 25% for Ethereum(ETH)) in the third cluster. The previous findings demonstrate that positive sentiment in the environment regarding cryptocurrencies promotes the activity in the market, thus giving sense to the correlation found between the number of tweets and the buy and sell volumes.

The proposed architecture may be misidentified with a lambda architecture because both have a batch step; nevertheless, they do accomplish different tasks, and thus different purposes. While the lambda architecture contains an extra batch layer that receives data simultaneously with the streaming layer, our proposed variant of the kappa architecture applies a batch step inside the existent serving layer to temporarily store processed data. In that sense and in comparison with the simple kappa architecture, our proposal has the advantage of being able to relate various datasets in the second iteration by considering a different time span than the one selected for data streaming at the first iteration. It is a flexible architecture, which offers an alternative solution for real-time data processing and modeling from the perspective of traditional techniques, i.e., relational databases [44].

The application of our proposal is not limited to the execution of a k-means clustering approach. Other unsupervised machine learning algorithms may be explored, such as hierarchical cluster analysis (HCA) or fuzzy C-means clustering, which could help find different patterns regarding the behavior of cryptocurrencies. In addition, supervised machine learning algorithms may be supported. Some other studies proposed a similar application of the kappa architecture to process and model data in real time [33,45]; nevertheless, our proposal differs in the way data are processed. None of the previous studies found combined an iterative approach with a batch step involving a database management system and machine learning processing together. The proposed iterative kappa architecture in this study contributes to expanding the alternatives for real-time data processing with machine learning techniques. Even though this study considers only data from Twitter for a limited period of time and in a specific language, in future works, data from different social networks, i.e., Reddit and Telegram [14], over a longer period and in other languages can be evaluated. In addition, other machine algorithms may be explored within the architecture in order to widen the knowledge regarding the data. The integration of data from new data sources in order to analyze the architecture from a multidimensional approach also remains open for further studies. Finally, a higher volume of data and more attributes may be considered with the purpose of identifying if other variables correlate to specific trends in the cryptocurrency market.

Author Contributions

Methodology, A.B.; Supervision, R.-M.C.-C.; Writing—review and editing, A.B. and A.T.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by UPAEP-University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained in real time from Twitter and CryptoCompare and are available at https://twitter.com, accessed on 14 January 2022 and https://www.cryptocompare.com, accessed on 14 January 2022 with the permission of Twitter and CryptoCompare.

Conflicts of Interest

The authors declare no conflict of interest.

References

Peters, G.; Panayi, E.; Chapelle, A. Trends in Cryptocurrencies and Blockchain Technologies: A Monetary Theory and Regulation Perspective. J. Financ. Perspect. 2017, 3, 1–46. [Google Scholar]
de Albuquerque, B.S.; de Castro Callado, M. Understanding Bitcoins: Facts and Questions. Rev. Bras. Econ. 2015, 69, 3–16. [Google Scholar] [CrossRef] [Green Version]
Hassani, H.; Huang, X.; Silva, E.S. Fusing Big Data, Blockchain, and Cryptocurrency. In Fusing Big Data, Blockchain and Cryptocurrency: Their Individual and Combined Importance in the Digital Economy; Hassani, H., Huang, X., Silva, E.S., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 99–117. [Google Scholar] [CrossRef]
Shen, D.; Urquhart, A.; Wang, P. Does Twitter Predict Bitcoin? Econ. Lett. 2019, 174, 118–122. [Google Scholar] [CrossRef]
Mallikarjuna, B.; Ramana, T.; Kallam, S.; Patan, R.; Manikandan, R. Visualizing Bitcoin Using Big Data Mempool Visualization, Visualization, Peer Visualization, Attack Visual Analysis, High-Resolution Visualization of Bitcoin Systems, Effectiveness. In Blockchain, Big Data and Machine Learning, 1st ed.; CRC Press: Boca Raton, FL, USA, 2020; pp. 155–176. [Google Scholar] [CrossRef]
Harwick, C. Cryptocurrency and the Problem of Intermediation. Independ. Rev. 2016, 20, 569–588. [Google Scholar]
CoinMarketCap. Bitcoin. Available online: https://coinmarketcap.com/currencies/bitcoin/ (accessed on 28 December 2021).
Antonopoulos, A.M.; Wood, G. Mastering Ethereum: Building Smart Contracts and DApps; O’Reilly Media, Inc.: Sevastopol, CA, USA, 2018. [Google Scholar]
Nizzoli, L.; Tardelli, S.; Avvenuti, M.; Cresci, S.; Tesconi, M.; Ferrara, E. Charting the Landscape of Online Cryptocurrency Manipulation. IEEE Access 2020, 8, 113230–113245. [Google Scholar] [CrossRef]
Tandon, C.; Revankar, S.; Palivela, H.; Parihar, S.S. How Can We Predict the Impact of the Social Media Messages on the Value of Cryptocurrency? Insights from Big Data Analytics. Int. J. Inf. Manag. Data Insights 2021, 1, 100035. [Google Scholar] [CrossRef]
Bitcoin Tweets Chart. Available online: https://bitinfocharts.com/comparison/bitcoin-tweets.html (accessed on 28 December 2021).
Internet Live Stats. Twitter Usage Statistics. Available online: https://www.internetlivestats.com/twitter-statistics/ (accessed on 28 December 2021).
Sayce, D. The Number of Tweets per Day in 2020. 2019. Available online: https://www.dsayce.com/social-media/tweets-day/ (accessed on 28 December 2021).
Rothman, T. Trading the Dream: Does Social Media Affect Investors Activity—The Story of Twitter, Telegram and Reddit. Int. J. Financ. Res. 2019, 10, 147–152. [Google Scholar] [CrossRef] [Green Version]
Nasdaq Data Link. Bitcoin Number of Transactions. 2021. Available online: https://data.nasdaq.com (accessed on 28 December 2021).
Campbell, Stefan. Twitter Statistics 2022: How Many People Use Twitter? 2021. Available online: //thesmallbusinessblog.net/twitter-statistics/ (accessed on 29 December 2021).
Ghani, N.A.; Hamid, S.; Targio Hashem, I.A.; Ahmed, E. Social Media Big Data Analytics: A Survey. Comput. Hum. Behav. 2019, 101, 417–428. [Google Scholar] [CrossRef]
Bandi, A.; Hurtado, J.A. Big Data Streaming Architecture for Edge Computing Using Kafka and Rockset. In Proceedings of the 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 323–329. [Google Scholar] [CrossRef]
Mohapatra, S.; Ahmed, N.; Alencar, P. KryptoOracle: A Real-Time Cryptocurrency Price Prediction Platform Using Twitter Sentiments. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 5544–5551. [Google Scholar] [CrossRef] [Green Version]
Bandi, A. Data Streaming Architecture for Visualizing Cryptocurrency Temporal Data. In Computer Networks, Big Data and IoT; Pandian, A., Fernando, X., Islam, S.M.S., Eds.; Springer: Singapore, 2021; Volume 66, pp. 651–661. [Google Scholar] [CrossRef]
Horvat, N.; Ivkovic, V.; Todorovic, N.; Ivančević, V.; Gajić, D.; Lukovic, I. Big Data Architecture for Cryptocurrency Real-time Data Processing. In Proceedings of the ICIST 2020 Proceedings, Information Society of Serbia—ISOS, Belgrade, Serbia, 8–11 March 2020; pp. 150–155. [Google Scholar]
Abraham, J.; Higdon, D.; Nelson, J.; Ibarra, J. Cryptocurrency Price Prediction Using Tweet Volumes and Sentiment Analysis. SMU Data Sci. Rev. 2018, 1, 1–21. [Google Scholar]
Park, H.W.; Lee, Y. How Are Twitter Activities Related to Top Cryptocurrencies’ Performance? Evidence from Social Media Network and Sentiment Analysis. Drustvena Istrazivanja 2019, 28, 435–460. [Google Scholar] [CrossRef]
Garcia, D.; Tessone, C.J.; Mavrodiev, P.; Perony, N. The Digital Traces of Bubbles: Feedback Cycles between Socio-Economic Signals in the Bitcoin Economy. J. R. Soc. Interface 2014, 11, 20140623. [Google Scholar] [CrossRef] [PubMed]
Kjærland, F.; Meland, M.; Oust, A.; Øyen, V. How Can Bitcoin Price Fluctuations Be Explained? Int. J. Econ. Financ. Issues 2018, 8, 323–332. [Google Scholar]
Aharon, D.Y.; Demir, E.; Lau, C.K.M.; Zaremba, A. Twitter-Based Uncertainty and Cryptocurrency Returns; SSRN Scholarly Paper ID 3735435; Social Science Research Network: Rochester, NY, USA, 2020. [Google Scholar] [CrossRef]
Baek, H.; Oh, J.; Kim, C.Y.; Lee, K. A Model for Detecting Cryptocurrency Transactions with Discernible Purpose. In Proceedings of the 2019 Eleventh International Conference on Ubiquitous and Future Networks (ICUFN), Zagreb, Croatia, 2–5 July 2019; pp. 713–717. [Google Scholar] [CrossRef]
Aspembitova, A.T.; Feng, L.; Chew, L.Y. Behavioral Structure of Users in Cryptocurrency Market. PLoS ONE 2021, 16, e0242600. [Google Scholar] [CrossRef] [PubMed]
Fang, J.; Chiu, D.K.W.; Ho, K.K.W. Exploring Cryptocurrency Sentiments with Clustering Text Mining on Social Media. In Intelligent Analytics with Advanced Multi-Industry Applications; Sun, Z., Ed.; IGI Global: Hershey, PA, USA, 2021; pp. 157–171. [Google Scholar] [CrossRef]
Kreps, J. Questioning the Lambda Architecture. 2014. Available online: https://www.oreilly.com/radar/questioning-the-lambda-architecture/ (accessed on 28 December 2021).
Marz, N.; Warren, J. Lambda Architecture. In Big Data: Principles and Best Practices of Scalable Real-Time Data Systems; Manning Publications: Westhampton, NY, USA, 2015; p. 328. [Google Scholar]
Domínguez, J. De Lambda a Kappa: Evolución de las Arquitecturas Big Data. 2018. Available online: https://www.paradigmadigital.com/techbiz/de-lambda-a-kappa-evolucion-de-las-arquitecturas-big-data/ (accessed on 29 December 2021).
Nkamla Penka, J.B.; Mahmoudi, S.; Debauche, O. A New Kappa Architecture for IoT Data Management in Smart Farming. Procedia Comput. Sci. 2021, 191, 17–24. [Google Scholar] [CrossRef]
ProjectPro. How Data Partitioning in Spark Helps Achieve More Parallelism? 2021. Available online: https://www.projectpro.io/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297 (accessed on 29 December 2021).
Sinaga, K.P.; Yang, M.S. Unsupervised K-Means Clustering Algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
Likas, A.; Vlassis, N.; Verbeek, J. The Global K-Means Clustering Algorithm. Patt. Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef] [Green Version]
Cryptocompare. Cryptocurrency API, Historical & Real-Time Market Data. Available online: https://min-api.cryptocompare.com (accessed on 14 January 2022).
Roesslein, J. Tweepy. Available online: https://www.tweepy.org/ (accessed on 4 January 2022).
Kuilboer, J.P.; Stull, T. Text Analytics and Big Data in the Financial Domain. In Proceedings of the 2021 16th Iberian Conference on Information Systems and Technologies (CISTI), Chaves, Portugal, 23–26 June 2021; pp. 1–4. [Google Scholar]
John Snow Labs. Spark NLP. Available online: https://nlp.johnsnowlabs.com/ (accessed on 4 January 2022).
Lengyel, A.; Botta-Dukát, Z. Silhouette Width Using Generalized Mean—A Flexible Method for Assessing Clustering Efficiency. Ecol. Evol. 2019, 9, 13231–13243. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yuan, C.; Yang, H. Research on K-Value Selection Method of K-Means Clustering Algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef] [Green Version]
Hmwe, T.T.; Thein, N.Y.T.; Cho, K.M. Improving Clustering Quality Using Silhouette Score. J. Comput. Appl. Res. 2020, 1, 58–62. [Google Scholar]
Education, I.C. What Is Data Modeling? 2020. Available online: https://www.ibm.com/cloud/learn/data-modeling (accessed on 20 January 2022).
Zschörnig, T.; Wehlitz, R.; Franczyk, B. A Personal Analytics Platform for the Internet of Things—Implementing Kappa Architecture with Microservice-based Stream Processing. In Proceedings of the 19th International Conference on Enterprise Information Systems, Porto, Portugal, 26–29 April 2017; SCITEPRESS—Science and Technology Publications: Porto, Portugal, 2017; pp. 733–738. [Google Scholar] [CrossRef]

Figure 1. Proposed kappa architecture. Source: compiled by authors with data from [30]. “Apache Kafka”, and “Apache Spark” are trademarks of the Apache Software Foundation. “TWITTER, TWEET, RETWEET and the Twitter Bird logo are trademarks of Twitter Inc. or its affiliates”.

Figure 2. Process diagram for the proposed kappa architecture. Source: compiled by authors.

Figure 3. Graphical representation of data related to cryptocurrency Bitcoin (BTC). Source: compiled by authors.

Figure 4. Graphical representation of data related to cryptocurrency Ethereum (ETH). Source: compiled by authors.

Figure 5. Silhouette coefficient related to the number of clusters. Source: compiled by authors.

Table 1. Attributes of each raw dataset obtained.

Twitter

Cryptocompare

created at: `Fri Jan 14 07:00:00 +0000 2022’,
id: 61e173da2e853f6c8c8c92ff,
id str: `148197437765466521’,
text: `RT @Saki5786: @WatcherGuru A big
transformation is on the way! The TIME HAS COME
for #CryptoIslandDAO!NOW is the best time to start
thi…’,
truncated: True,
entities:
hashtags: [],
followers: [],
user mentions: [],
urls: [
url: ”,
display url: `twitter.com/i/web/status/1…’,
location: []],
metadata:
iso language code: `en’,
result type: `recent’,
href=“https://mobile.twitter.com, accessed on 14
January 2022”
rel=“nofollow”>Twitter Web App,

date:“2022-01-14 07:00:00”
TYPE:“0”
M:“Coinbase”
FSYM:“BTC”
TSYM:“USD”
F:“2”
ID:“263883436”
TS:“1642165200”
Q:“0.00059115”
P:“42070.6406”
TOTAL:“24.8704”
RTS:“1642165200”
TSNS:“7000000000”
RTSNS:“392000000”

Table 2. Relation between Twitter and CryptoCompare datasets on a time basis of minutes.

Timestamp	Symb.	Tweets	Sent.	Sell Vol.	Sell Avg.	Sell No.	Buy Vol.	Buy Avg.	Buy No.
14 January 2022 T07:00:00.00	BTC	693	−165	1.77	42.1 *	101	3.98	42.1 *	160
14 January 2022 T07:00:00.00	ETH	878	−352	63.19	3.21 *	182	23.5	3.21 *	160
14 January 2022 T07:01:00.00	BTC	618	−124	4.9	42.0 *	154	5.11	42.0 *	213
14 January 2022 T07:01:00.00	ETH	809	−238	24.7	3.21 *	176	24.4	3.21 *	155
14 January 2022 T07:02:00.00	BTC	620	−160	0.38	42.0 *	95	2.06	42.0 *	165
14 January 2022 T07:02:00.00	ETH	815	−272	76.2	3.21 *	135	66.7	3.21 *	135

* Expressed in thousands.

Table 3. Pearson correlation matrix.

	Tweets	Sent	Sell Vol.	Sell Avg.	Sell No.	Buy Vol.	Buy Avg.	Buy No.
symb.	-	-	-	-	-	-	-	-
tweets	1	-	-	-	-	-	-	-
sent	*	1	-	-	-	-	-	-
sell vol.	0.34	−0.07	1	-	-	-	-	-
sell avg.	−0.69	0.30	−0.39	1	-	-	-	-
sell no.	*	0.09	0.34	0.12	1	-	-	-
buy vol.	0.43	−0.10	0.52	−0.49	0.24	1	-	-
buy avg.	−0.69	0.30	−0.39	-	0.12	−0.49	1	-
buy no.	*	0.04	0.19	0.09	*	0.39	-	1

* Omitted: p-value < 0.05.

Table 4. Average values separated by cluster.

	Symb.	Avg. Teets	Avg. Sent.	Avg. Sell Vol.	Avg. Sell Price	Avg. Sell No.	Avg. Buy Vol.	Avg. Buy Price	Avg. Buy No.	% Data
1	BTC	515	−82	4.7	42.8 *	145	5.65	42.8 *	223.4	42%
2	ETH	720	−103	31.0	3.27 *	125	39.24	3.27 *	189.9	38%
3	BTC ETH	564 789	−72 −77	16.4 87.8	43.0 * 3.28 *	332 194	33.25 105.24	43.0 * 3.28 *	603.7 290.0	14%

* Expressed in thousands.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Barradas, A.; Tejeda-Gil, A.; Cantón-Croda, R.-M. Real-Time Big Data Architecture for Processing Cryptocurrency and Social Media Data: A Clustering Approach Based on k-Means. Algorithms 2022, 15, 140. https://0-doi-org.brum.beds.ac.uk/10.3390/a15050140

AMA Style

Barradas A, Tejeda-Gil A, Cantón-Croda R-M. Real-Time Big Data Architecture for Processing Cryptocurrency and Social Media Data: A Clustering Approach Based on k-Means. Algorithms. 2022; 15(5):140. https://0-doi-org.brum.beds.ac.uk/10.3390/a15050140

Chicago/Turabian Style

Barradas, Adrian, Acela Tejeda-Gil, and Rosa-María Cantón-Croda. 2022. "Real-Time Big Data Architecture for Processing Cryptocurrency and Social Media Data: A Clustering Approach Based on k-Means" Algorithms 15, no. 5: 140. https://0-doi-org.brum.beds.ac.uk/10.3390/a15050140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Big Data Architecture for Processing Cryptocurrency and Social Media Data: A Clustering Approach Based on k-Means

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI