Research

24 pages, 3743 KiB

Open AccessArticle

Classifying Crowdsourced Citizen Complaints through Data Mining: Accuracy Testing of k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost

by Evaristus D. Madyatmadja, Corinthias P. M. Sianipar, Cristofer Wijaya and David J. M. Sembiring

Informatics 2023, 10(4), 84; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics10040084 - 01 Nov 2023

Viewed by 1866

Abstract

Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently [...] Read more.

Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently vital to use data mining algorithms to classify the citizen complaint data for efficient follow-up actions. However, different classification algorithms produce varied classification accuracies. Thus, this study aimed to compare the accuracy of several classification algorithms on crowdsourced citizen complaint data. Taking the case of the LAKSA app in Tangerang City, Indonesia, this study included k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost for the accuracy assessment. The data were taken from crowdsourced citizen complaints submitted to the LAKSA app, including those aggregated from official social media channels, from May 2021 to April 2022. The results showed SVM with a linear kernel as the most accurate among the assessed algorithms (89.2%). In contrast, AdaBoost (base learner: Decision Trees) produced the lowest accuracy. Still, the accuracy levels of all algorithms varied in parallel to the amount of training data available for the actual classification categories. Overall, the assessments on all algorithms indicated that their accuracies were insignificantly different, with an overall variation of 4.3%. The AdaBoost-based classification, in particular, showed its large dependence on the choice of base learners. Looking at the method and results, this study contributes to e-government, data mining, and big data discourses. This research recommends that governments continuously conduct supervised training of classification algorithms over their crowdsourced citizen complaints to seek the highest accuracy possible, paving the way for smart and sustainable governance. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

18 pages, 1329 KiB

Open AccessArticle

Analysis of Factors Associated with Highway Personal Car and Truck Run-Off-Road Crashes: Decision Tree and Mixed Logit Model with Heterogeneity in Means and Variances Approaches

by Thanapong Champahom, Panuwat Wisutwattanasak, Chamroeun Se, Chinnakrit Banyong, Sajjakaj Jomnonkwao and Vatanavongs Ratanavaraha

Informatics 2023, 10(3), 66; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics10030066 - 18 Aug 2023

Viewed by 1266

Abstract

Among several approaches to analyzing crash research, the use of machine learning and econometric analysis has found potential in the analysis. This study aims to empirically examine factors influencing the single-vehicle crash for personal cars and trucks using decision trees (DT) and mixed [...] Read more.

Among several approaches to analyzing crash research, the use of machine learning and econometric analysis has found potential in the analysis. This study aims to empirically examine factors influencing the single-vehicle crash for personal cars and trucks using decision trees (DT) and mixed binary logit with heterogeneity in means and variances (RPBLHMV) and compare model accuracy. The data in this study were obtained from the Department of Highway during 2011–2017, and the results indicated that the RPBLHMV was superior due to its higher overall prediction accuracy, sensitivity, and specificity values when compared to the DT model. According to the RPBLHMV results, car models showed that injury severity was associated with driver gender, seat belt, mount the island, defect equipment, and safety equipment. For the truck model, it was found that crashes located at intersections or medians, mounts on the island, and safety equipment have a significant influence on injury severity. DT results also showed that running off-road and hitting safety equipment can reduce the risk of death for car and truck drivers. This finding can illustrate the difference causing the dependent variable in each model. The RPBLHMV showed the ability to capture random parameters and unobserved heterogeneity. But DT can be easily used to provide variable importance and show which factor has the most significance by sequencing. Each model has advantages and disadvantages. The study findings can give relevant authorities choices for measures and policy improvement based on two analysis methods in accordance with their policy design. Therefore, whether advocating road safety or improving policy measures, the use of appropriate methods can increase operational efficiency. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

21 pages, 1931 KiB

Open AccessArticle

Risk Factors Influencing Fatal Powered Two-Wheeler At-Fault and Not-at-Fault Crashes: An Application of Spatio-Temporal Hotspot and Association Rule Mining Techniques

by Reuben Tamakloe

Informatics 2023, 10(2), 43; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics10020043 - 12 May 2023

Cited by 1 | Viewed by 1408

Abstract

Studies have explored the factors influencing the safety of PTWs; however, very little has been carried out to comprehensively investigate the factors influencing fatal PTW crashes while considering the fault status of the rider in crash hotspot areas. This study employs spatio-temporal hotspot [...] Read more.

Studies have explored the factors influencing the safety of PTWs; however, very little has been carried out to comprehensively investigate the factors influencing fatal PTW crashes while considering the fault status of the rider in crash hotspot areas. This study employs spatio-temporal hotspot analysis and association rule mining techniques to discover hidden associations between crash risk factors that lead to fatal PTW crashes considering the fault status of the rider at statistically significant PTW crash hotspots in South Korea from 2012 to 2017. The results indicate the presence of consecutively fatal PTW crash hotspots concentrated within Korea’s densely populated capital, Seoul, and new hotspots near its periphery. According to the results, violations such as over-speeding and red-light running were critical contributory factors influencing PTW crashes at hotspots during summer and at intersections. Interestingly, while reckless riding was the main traffic violation leading to PTW rider at-fault crashes at hotspots, violations such as improper safety distance and red-light running were strongly associated with PTW rider not-at-fault crashes at hotspots. In addition, while PTW rider at-fault crashes are likely to occur during summer, PTW rider not-at-fault crashes mostly occur during spring. The findings could be used for developing targeted policies for improving PTW safety at hotspots. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

12 pages, 1298 KiB

Open AccessArticle

Predicting Future Promising Technologies Using LSTM

by Seol-Hyun Noh

Informatics 2022, 9(4), 77; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics9040077 - 27 Sep 2022

Cited by 1 | Viewed by 2728

Abstract

With advances in science and technology and changes in industry, research on promising future technologies has emerged as important. Furthermore, with the advent of a ubiquitous and smart environment, governments and enterprises are required to predict future promising technologies on which new important [...] Read more.

With advances in science and technology and changes in industry, research on promising future technologies has emerged as important. Furthermore, with the advent of a ubiquitous and smart environment, governments and enterprises are required to predict future promising technologies on which new important core technologies will be developed. Therefore, this study aimed to establish science and technology development strategies and support business activities by predicting future promising technologies using big data and deep learning models. The names of the “TOP 10 Emerging Technologies” from 2018 to 2021 selected by the World Economic Forum were used as keywords. Next, patents collected from the United States Patent and Trademark Office and the Science Citation Index (SCI) papers collected from the Web of Science database were analyzed using a time-series forecast. For each technology, the number of patents and SCI papers in 2022, 2023 and 2024 were predicted using the long short-term memory model with the number of patents and SCI papers from 1980 to 2021 as input data. Promising technologies are determined based on the predicted number of patents and SCI papers for the next three years. Keywords characterizing future promising technologies are extracted by analyzing abstracts of patent data collected for each technology and the term frequency-inverse document frequency is measured for each patent abstract. The research results can help business managers make optimal decisions in the present situation and provide researchers with an understanding of the direction of technology development. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

27 pages, 3042 KiB

Open AccessFeature PaperArticle

Bagging Machine Learning Algorithms: A Generic Computing Framework Based on Machine-Learning Methods for Regional Rainfall Forecasting in Upstate New York

by Ning Yu and Timothy Haskins

Informatics 2021, 8(3), 47; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics8030047 - 21 Jul 2021

Cited by 8 | Viewed by 3447

Abstract

Regional rainfall forecasting is an important issue in hydrology and meteorology. Machine learning algorithms especially deep learning methods have emerged as a part of prediction tools for regional rainfall forecasting. This paper aims to design and implement a generic computing framework that can [...] Read more.

Regional rainfall forecasting is an important issue in hydrology and meteorology. Machine learning algorithms especially deep learning methods have emerged as a part of prediction tools for regional rainfall forecasting. This paper aims to design and implement a generic computing framework that can assemble a variety of machine learning algorithms as computational engines for regional rainfall forecasting in Upstate New York. The algorithms that have been bagged in the computing framework include the classical algorithms and the state-of-the-art deep learning algorithms, such as K-Nearest Neighbors, Support Vector Machine, Deep Neural Network, Wide Neural Network, Deep and Wide Neural Network, Reservoir Computing, and Long Short Term Memory methods. Through the experimental results and the performance comparisons of these various engines, we have observed that the SVM- and KNN-based method are outstanding models over other models in classification while DWNN- and KNN-based methods outstrip other models in regression, particularly those prevailing deep-learning-based methods, for handling uncertain and complex climatic data for precipitation forecasting. Meanwhile, the normalization methods such as Z-score and Minmax are also integrated into the generic computing framework for the investigation and evaluation of their impacts on machine learning models. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

10 pages, 15347 KiB

Open AccessArticle

Deep Learning and Parallel Processing Spatio-Temporal Clustering Unveil New Ionian Distinct Seismic Zone

by Antonios Konstantaras

Informatics 2020, 7(4), 39; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics7040039 - 29 Sep 2020

Cited by 9 | Viewed by 2364

Abstract

This research work employs theoretical and empirical expert knowledge in constructing an agglomerative parallel processing algorithm that performs spatio-temporal clustering upon seismic data. This is made possible by exploiting the spatial and temporal sphere of influence of the main earthquakes solely, clustering seismic [...] Read more.

This research work employs theoretical and empirical expert knowledge in constructing an agglomerative parallel processing algorithm that performs spatio-temporal clustering upon seismic data. This is made possible by exploiting the spatial and temporal sphere of influence of the main earthquakes solely, clustering seismic events into a number of fuzzy bordered, interactive and yet potentially distinct seismic zones. To evaluate whether the unveiled clusters indeed depict a distinct seismic zone, deep learning neural networks are deployed to map seismic energy release rates with time intervals between consecutive large earthquakes. Such a correlation fails should there be influence by neighboring seismic areas, hence casting the seismic region as non-distinct, or if the extent of the seismic zone has not been captured fully. For the deep learning neural network to depict such a correlation requires a steady seismic energy input flow. To address that the western area of the Hellenic seismic arc has been selected as a test case due to the nearly constant motion of the African plate that sinks beneath the Eurasian plate at a steady yearly rate. This causes a steady flow of strain energy stored in tectonic underground faults, i.e., the seismic energy storage elements; a partial release of which, when propagated all the way to the surface, casts as an earthquake. The results are complementary two-fold with the correlation between the energy release rates and the time interval amongst large earthquakes supporting the presence of a potential distinct seismic zone in the Ionian Sea and vice versa. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

22 pages, 18775 KiB

Open AccessArticle

Exploring Casual COVID-19 Data Visualizations on Twitter: Topics and Challenges

by Milka Trajkova, A’aeshah Alhakamy, Francesco Cafaro, Sanika Vedak, Rashmi Mallappa and Sreekanth R. Kankara

Informatics 2020, 7(3), 35; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics7030035 - 15 Sep 2020

Cited by 14 | Viewed by 11901

Abstract

Social networking sites such as Twitter have been a popular choice for people to express their opinions, report real-life events, and provide a perspective on what is happening around the world. In the outbreak of the COVID-19 pandemic, people have used Twitter to [...] Read more.

Social networking sites such as Twitter have been a popular choice for people to express their opinions, report real-life events, and provide a perspective on what is happening around the world. In the outbreak of the COVID-19 pandemic, people have used Twitter to spontaneously share data visualizations from news outlets and government agencies and to post casual data visualizations that they individually crafted. We conducted a Twitter crawl of 5409 visualizations (from the period between 14 April 2020 and 9 May 2020) to capture what people are posting. Our study explores what people are posting, what they retweet the most, and the challenges that may arise when interpreting COVID-19 data visualization on Twitter. Our findings show that multiple factors, such as the source of the data, who created the chart (individual vs. organization), the type of visualization, and the variables on the chart influence the retweet count of the original post. We identify and discuss five challenges that arise when interpreting these casual data visualizations, and discuss recommendations that should be considered by Twitter users while designing COVID-19 data visualizations to facilitate data interpretation and to avoid the spread of misconceptions and confusion. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

25 pages, 574 KiB

Open AccessArticle

Automated Configuration of NoSQL Performance and Scalability Tactics for Data-Intensive Applications

by Davy Preuveneers and Wouter Joosen

Informatics 2020, 7(3), 29; https://0-doi-org.brum.beds.ac.uk/10.3390/informatics7030029 - 08 Aug 2020

Cited by 6 | Viewed by 3518

Abstract

This paper presents the architecture, implementation and evaluation of a middleware support layer for NoSQL storage systems. Our middleware automatically selects performance and scalability tactics in terms of application specific workloads. Enterprises are turning to NoSQL storage technologies for their data-intensive computing and [...] Read more.

This paper presents the architecture, implementation and evaluation of a middleware support layer for NoSQL storage systems. Our middleware automatically selects performance and scalability tactics in terms of application specific workloads. Enterprises are turning to NoSQL storage technologies for their data-intensive computing and analytics applications. Comprehensive benchmarks of different Big Data platforms can help drive decisions which solutions to adopt. However, selecting the best performing technology, configuring the deployment for scalability and tuning parameters at runtime for an optimal service delivery remain challenging tasks, especially when application workloads evolve over time. Our middleware solves this problem at runtime by monitoring the data growth, changes in the read-write-query mix at run-time, as well as other system metrics that are indicative of sub-optimal performance. Our middleware employs supervised machine learning on historic and current monitoring information and corresponding configurations to select the best combinations of high-level tactics and adapt NoSQL systems to evolving workloads. This work has been driven by two real world case studies with different QoS requirements. The evaluation demonstrates that our middleware can adapt to unseen workloads of data-intensive applications, and automate the configuration of different families of NoSQL systems at runtime to optimize the performance and scalability of such applications. Full article

(This article belongs to the Special Issue Feature Papers in Big Data)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Feature Papers in Big Data

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Published Papers (8 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI