Feature Papers in Big Data

A special issue of Informatics (ISSN 2227-9709). This special issue belongs to the section "Big Data Mining and Analytics".

Deadline for manuscript submissions: closed (31 December 2023) | Viewed by 30068

Special Issue Editor


E-Mail Website
Guest Editor
Department of Computer Science, Georgia Southern University, Statesboro, GA 30458, USA
Interests: big data processing; design of efficient algorithms; smart city; operations research

Special Issue Information

Dear Colleagues,

Big data is an area that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Now, Big Data is being significantly used in the field of healthcare, international development, education, media, insurance, Internet of Things (IoT), etc.

This Special Issue in Informatics welcomes papers in the field of big data. The scope includes but is not limited to data capture and storage; search, sharing, and analytics; big data technologies; data visualization; architectures for massively parallel processing; data mining tools and techniques; machine learning algorithms for Big Data; cloud computing platforms; distributed file systems and databases; and scalable storage systems. Newly submitted papers must be pre-peer-reviewed by the Editorial Board. If your paper is well prepared and approved for further publication, you might be eligible for discounts for your publication.

Dr. Weitian Tong
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Informatics is an international peer-reviewed open access quarterly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data capture and storage
  • big data technologies
  • data visualization
  • data mining tools and techniques
  • machine learning algorithms for big data
  • cloud computing platforms
  • distributed file systems and databases
  • scalable storage systems

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

24 pages, 3743 KiB  
Article
Classifying Crowdsourced Citizen Complaints through Data Mining: Accuracy Testing of k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost
by Evaristus D. Madyatmadja, Corinthias P. M. Sianipar, Cristofer Wijaya and David J. M. Sembiring
Informatics 2023, 10(4), 84; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics10040084 - 01 Nov 2023
Viewed by 1866
Abstract
Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently [...] Read more.
Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently vital to use data mining algorithms to classify the citizen complaint data for efficient follow-up actions. However, different classification algorithms produce varied classification accuracies. Thus, this study aimed to compare the accuracy of several classification algorithms on crowdsourced citizen complaint data. Taking the case of the LAKSA app in Tangerang City, Indonesia, this study included k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost for the accuracy assessment. The data were taken from crowdsourced citizen complaints submitted to the LAKSA app, including those aggregated from official social media channels, from May 2021 to April 2022. The results showed SVM with a linear kernel as the most accurate among the assessed algorithms (89.2%). In contrast, AdaBoost (base learner: Decision Trees) produced the lowest accuracy. Still, the accuracy levels of all algorithms varied in parallel to the amount of training data available for the actual classification categories. Overall, the assessments on all algorithms indicated that their accuracies were insignificantly different, with an overall variation of 4.3%. The AdaBoost-based classification, in particular, showed its large dependence on the choice of base learners. Looking at the method and results, this study contributes to e-government, data mining, and big data discourses. This research recommends that governments continuously conduct supervised training of classification algorithms over their crowdsourced citizen complaints to seek the highest accuracy possible, paving the way for smart and sustainable governance. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

18 pages, 1329 KiB  
Article
Analysis of Factors Associated with Highway Personal Car and Truck Run-Off-Road Crashes: Decision Tree and Mixed Logit Model with Heterogeneity in Means and Variances Approaches
by Thanapong Champahom, Panuwat Wisutwattanasak, Chamroeun Se, Chinnakrit Banyong, Sajjakaj Jomnonkwao and Vatanavongs Ratanavaraha
Informatics 2023, 10(3), 66; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics10030066 - 18 Aug 2023
Viewed by 1266
Abstract
Among several approaches to analyzing crash research, the use of machine learning and econometric analysis has found potential in the analysis. This study aims to empirically examine factors influencing the single-vehicle crash for personal cars and trucks using decision trees (DT) and mixed [...] Read more.
Among several approaches to analyzing crash research, the use of machine learning and econometric analysis has found potential in the analysis. This study aims to empirically examine factors influencing the single-vehicle crash for personal cars and trucks using decision trees (DT) and mixed binary logit with heterogeneity in means and variances (RPBLHMV) and compare model accuracy. The data in this study were obtained from the Department of Highway during 2011–2017, and the results indicated that the RPBLHMV was superior due to its higher overall prediction accuracy, sensitivity, and specificity values when compared to the DT model. According to the RPBLHMV results, car models showed that injury severity was associated with driver gender, seat belt, mount the island, defect equipment, and safety equipment. For the truck model, it was found that crashes located at intersections or medians, mounts on the island, and safety equipment have a significant influence on injury severity. DT results also showed that running off-road and hitting safety equipment can reduce the risk of death for car and truck drivers. This finding can illustrate the difference causing the dependent variable in each model. The RPBLHMV showed the ability to capture random parameters and unobserved heterogeneity. But DT can be easily used to provide variable importance and show which factor has the most significance by sequencing. Each model has advantages and disadvantages. The study findings can give relevant authorities choices for measures and policy improvement based on two analysis methods in accordance with their policy design. Therefore, whether advocating road safety or improving policy measures, the use of appropriate methods can increase operational efficiency. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

21 pages, 1931 KiB  
Article
Risk Factors Influencing Fatal Powered Two-Wheeler At-Fault and Not-at-Fault Crashes: An Application of Spatio-Temporal Hotspot and Association Rule Mining Techniques
by Reuben Tamakloe
Informatics 2023, 10(2), 43; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics10020043 - 12 May 2023
Cited by 1 | Viewed by 1408
Abstract
Studies have explored the factors influencing the safety of PTWs; however, very little has been carried out to comprehensively investigate the factors influencing fatal PTW crashes while considering the fault status of the rider in crash hotspot areas. This study employs spatio-temporal hotspot [...] Read more.
Studies have explored the factors influencing the safety of PTWs; however, very little has been carried out to comprehensively investigate the factors influencing fatal PTW crashes while considering the fault status of the rider in crash hotspot areas. This study employs spatio-temporal hotspot analysis and association rule mining techniques to discover hidden associations between crash risk factors that lead to fatal PTW crashes considering the fault status of the rider at statistically significant PTW crash hotspots in South Korea from 2012 to 2017. The results indicate the presence of consecutively fatal PTW crash hotspots concentrated within Korea’s densely populated capital, Seoul, and new hotspots near its periphery. According to the results, violations such as over-speeding and red-light running were critical contributory factors influencing PTW crashes at hotspots during summer and at intersections. Interestingly, while reckless riding was the main traffic violation leading to PTW rider at-fault crashes at hotspots, violations such as improper safety distance and red-light running were strongly associated with PTW rider not-at-fault crashes at hotspots. In addition, while PTW rider at-fault crashes are likely to occur during summer, PTW rider not-at-fault crashes mostly occur during spring. The findings could be used for developing targeted policies for improving PTW safety at hotspots. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

12 pages, 1298 KiB  
Article
Predicting Future Promising Technologies Using LSTM
by Seol-Hyun Noh
Informatics 2022, 9(4), 77; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics9040077 - 27 Sep 2022
Cited by 1 | Viewed by 2728
Abstract
With advances in science and technology and changes in industry, research on promising future technologies has emerged as important. Furthermore, with the advent of a ubiquitous and smart environment, governments and enterprises are required to predict future promising technologies on which new important [...] Read more.
With advances in science and technology and changes in industry, research on promising future technologies has emerged as important. Furthermore, with the advent of a ubiquitous and smart environment, governments and enterprises are required to predict future promising technologies on which new important core technologies will be developed. Therefore, this study aimed to establish science and technology development strategies and support business activities by predicting future promising technologies using big data and deep learning models. The names of the “TOP 10 Emerging Technologies” from 2018 to 2021 selected by the World Economic Forum were used as keywords. Next, patents collected from the United States Patent and Trademark Office and the Science Citation Index (SCI) papers collected from the Web of Science database were analyzed using a time-series forecast. For each technology, the number of patents and SCI papers in 2022, 2023 and 2024 were predicted using the long short-term memory model with the number of patents and SCI papers from 1980 to 2021 as input data. Promising technologies are determined based on the predicted number of patents and SCI papers for the next three years. Keywords characterizing future promising technologies are extracted by analyzing abstracts of patent data collected for each technology and the term frequency-inverse document frequency is measured for each patent abstract. The research results can help business managers make optimal decisions in the present situation and provide researchers with an understanding of the direction of technology development. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

27 pages, 3042 KiB  
Article
Bagging Machine Learning Algorithms: A Generic Computing Framework Based on Machine-Learning Methods for Regional Rainfall Forecasting in Upstate New York
by Ning Yu and Timothy Haskins
Informatics 2021, 8(3), 47; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics8030047 - 21 Jul 2021
Cited by 8 | Viewed by 3447
Abstract
Regional rainfall forecasting is an important issue in hydrology and meteorology. Machine learning algorithms especially deep learning methods have emerged as a part of prediction tools for regional rainfall forecasting. This paper aims to design and implement a generic computing framework that can [...] Read more.
Regional rainfall forecasting is an important issue in hydrology and meteorology. Machine learning algorithms especially deep learning methods have emerged as a part of prediction tools for regional rainfall forecasting. This paper aims to design and implement a generic computing framework that can assemble a variety of machine learning algorithms as computational engines for regional rainfall forecasting in Upstate New York. The algorithms that have been bagged in the computing framework include the classical algorithms and the state-of-the-art deep learning algorithms, such as K-Nearest Neighbors, Support Vector Machine, Deep Neural Network, Wide Neural Network, Deep and Wide Neural Network, Reservoir Computing, and Long Short Term Memory methods. Through the experimental results and the performance comparisons of these various engines, we have observed that the SVM- and KNN-based method are outstanding models over other models in classification while DWNN- and KNN-based methods outstrip other models in regression, particularly those prevailing deep-learning-based methods, for handling uncertain and complex climatic data for precipitation forecasting. Meanwhile, the normalization methods such as Z-score and Minmax are also integrated into the generic computing framework for the investigation and evaluation of their impacts on machine learning models. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

10 pages, 15347 KiB  
Article
Deep Learning and Parallel Processing Spatio-Temporal Clustering Unveil New Ionian Distinct Seismic Zone
by Antonios Konstantaras
Informatics 2020, 7(4), 39; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics7040039 - 29 Sep 2020
Cited by 9 | Viewed by 2364
Abstract
This research work employs theoretical and empirical expert knowledge in constructing an agglomerative parallel processing algorithm that performs spatio-temporal clustering upon seismic data. This is made possible by exploiting the spatial and temporal sphere of influence of the main earthquakes solely, clustering seismic [...] Read more.
This research work employs theoretical and empirical expert knowledge in constructing an agglomerative parallel processing algorithm that performs spatio-temporal clustering upon seismic data. This is made possible by exploiting the spatial and temporal sphere of influence of the main earthquakes solely, clustering seismic events into a number of fuzzy bordered, interactive and yet potentially distinct seismic zones. To evaluate whether the unveiled clusters indeed depict a distinct seismic zone, deep learning neural networks are deployed to map seismic energy release rates with time intervals between consecutive large earthquakes. Such a correlation fails should there be influence by neighboring seismic areas, hence casting the seismic region as non-distinct, or if the extent of the seismic zone has not been captured fully. For the deep learning neural network to depict such a correlation requires a steady seismic energy input flow. To address that the western area of the Hellenic seismic arc has been selected as a test case due to the nearly constant motion of the African plate that sinks beneath the Eurasian plate at a steady yearly rate. This causes a steady flow of strain energy stored in tectonic underground faults, i.e., the seismic energy storage elements; a partial release of which, when propagated all the way to the surface, casts as an earthquake. The results are complementary two-fold with the correlation between the energy release rates and the time interval amongst large earthquakes supporting the presence of a potential distinct seismic zone in the Ionian Sea and vice versa. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

22 pages, 18775 KiB  
Article
Exploring Casual COVID-19 Data Visualizations on Twitter: Topics and Challenges
by Milka Trajkova, A’aeshah Alhakamy, Francesco Cafaro, Sanika Vedak, Rashmi Mallappa and Sreekanth R. Kankara
Informatics 2020, 7(3), 35; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics7030035 - 15 Sep 2020
Cited by 14 | Viewed by 11901
Abstract
Social networking sites such as Twitter have been a popular choice for people to express their opinions, report real-life events, and provide a perspective on what is happening around the world. In the outbreak of the COVID-19 pandemic, people have used Twitter to [...] Read more.
Social networking sites such as Twitter have been a popular choice for people to express their opinions, report real-life events, and provide a perspective on what is happening around the world. In the outbreak of the COVID-19 pandemic, people have used Twitter to spontaneously share data visualizations from news outlets and government agencies and to post casual data visualizations that they individually crafted. We conducted a Twitter crawl of 5409 visualizations (from the period between 14 April 2020 and 9 May 2020) to capture what people are posting. Our study explores what people are posting, what they retweet the most, and the challenges that may arise when interpreting COVID-19 data visualization on Twitter. Our findings show that multiple factors, such as the source of the data, who created the chart (individual vs. organization), the type of visualization, and the variables on the chart influence the retweet count of the original post. We identify and discuss five challenges that arise when interpreting these casual data visualizations, and discuss recommendations that should be considered by Twitter users while designing COVID-19 data visualizations to facilitate data interpretation and to avoid the spread of misconceptions and confusion. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

25 pages, 574 KiB  
Article
Automated Configuration of NoSQL Performance and Scalability Tactics for Data-Intensive Applications
by Davy Preuveneers and Wouter Joosen
Informatics 2020, 7(3), 29; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics7030029 - 08 Aug 2020
Cited by 6 | Viewed by 3518
Abstract
This paper presents the architecture, implementation and evaluation of a middleware support layer for NoSQL storage systems. Our middleware automatically selects performance and scalability tactics in terms of application specific workloads. Enterprises are turning to NoSQL storage technologies for their data-intensive computing and [...] Read more.
This paper presents the architecture, implementation and evaluation of a middleware support layer for NoSQL storage systems. Our middleware automatically selects performance and scalability tactics in terms of application specific workloads. Enterprises are turning to NoSQL storage technologies for their data-intensive computing and analytics applications. Comprehensive benchmarks of different Big Data platforms can help drive decisions which solutions to adopt. However, selecting the best performing technology, configuring the deployment for scalability and tuning parameters at runtime for an optimal service delivery remain challenging tasks, especially when application workloads evolve over time. Our middleware solves this problem at runtime by monitoring the data growth, changes in the read-write-query mix at run-time, as well as other system metrics that are indicative of sub-optimal performance. Our middleware employs supervised machine learning on historic and current monitoring information and corresponding configurations to select the best combinations of high-level tactics and adapt NoSQL systems to evolving workloads. This work has been driven by two real world case studies with different QoS requirements. The evaluation demonstrates that our middleware can adapt to unseen workloads of data-intensive applications, and automate the configuration of different families of NoSQL systems at runtime to optimize the performance and scalability of such applications. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

Back to TopTop