Trends of Data Science and Knowledge Discovery

A special issue of Future Internet (ISSN 1999-5903). This special issue belongs to the section "Big Data and Augmented Intelligence".

Deadline for manuscript submissions: closed (5 October 2022) | Viewed by 63648

Special Issue Editor

Special Issue Information

Dear Colleagues,

In a world even more digital, the importance of data in our lives is significantly increasing, and new approaches and solutions arise everywhere in different formats and contexts. Data science (DS) is an interdisciplinary field that combines various areas, including computer science, machine learning, math and statistics, domain/business knowledge, software development, and traditional research. As a research topic, DS applies scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

Knowledge discovery (KD) is the basis of data science and consists of creating knowledge from structured and unstructured sources (e.g., text, data, and images, among others).

Trending topics like gamification, chatbots and blockchain, among others, are taking advantage of using data science and knowledge discovery to improve their solutions and create emerging and pervasive environments.

This Special Issue is an excellent opportunity to provide scientific knowledge and disseminate the findings and achievements through several communities. It will discuss trends and new approaches in this area and present innovative solutions to show the importance of data science and knowledge discovery to researchers, managers, industry, society, and other communities.

Dr. Carlos Filipe Da Silva Portela
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Future Internet is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Applied data science
  • Artificial intelligence
  • Big data
  • Blockchain
  • Business intelligence
  • Chatbots
  • Data analytics
  • Data mining, text mining and image mining
  • Expert systems
  • Gamification
  • Intelligent data systems
  • Machine learning
  • Pervasive data
  • Smart cities

Related Special Issue

Published Papers (25 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review, Other

30 pages, 2787 KiB  
Article
Detection of Malicious Websites Using Symbolic Classifier
by Nikola Anđelić, Sandi Baressi Šegota, Ivan Lorencin and Matko Glučina
Future Internet 2022, 14(12), 358; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14120358 - 29 Nov 2022
Cited by 4 | Viewed by 1911
Abstract
Malicious websites are web locations that attempt to install malware, which is the general term for anything that will cause problems in computer operation, gather confidential information, or gain total control over the computer. In this paper, a novel approach is proposed which [...] Read more.
Malicious websites are web locations that attempt to install malware, which is the general term for anything that will cause problems in computer operation, gather confidential information, or gain total control over the computer. In this paper, a novel approach is proposed which consists of the implementation of the genetic programming symbolic classifier (GPSC) algorithm on a publicly available dataset to obtain a simple symbolic expression (mathematical equation) which could detect malicious websites with high classification accuracy. Due to a large imbalance of classes in the initial dataset, several data sampling methods (random undersampling/oversampling, ADASYN, SMOTE, BorderlineSMOTE, and KmeansSMOTE) were used to balance the dataset classes. For this investigation, the hyperparameter search method was developed to find the combination of GPSC hyperparameters with which high classification accuracy could be achieved. The first investigation was conducted using GPSC with a random hyperparameter search method and each dataset variation was divided on a train and test dataset in a ratio of 70:30. To evaluate each symbolic expression, the performance of each symbolic expression was measured on the train and test dataset and the mean and standard deviation values of accuracy (ACC), AUC, precision, recall and f1-score were obtained. The second investigation was also conducted using GPSC with the random hyperparameter search method; however, 70%, i.e., the train dataset, was used to perform 5-fold cross-validation. If the mean accuracy, AUC, precision, recall, and f1-score values were above 0.97 then final training and testing (train/test 70:30) were performed with GPSC with the same randomly chosen hyperparameters used in a 5-fold cross-validation process and the final mean and standard deviation values of the aforementioned evaluation methods were obtained. In both investigations, the best symbolic expression was obtained in the case where the dataset balanced with the KMeansSMOTE method was used for training and testing. The best symbolic expression obtained using GPSC with the random hyperparameter search method and classic train–test procedure (70:30) on a dataset balanced with the KMeansSMOTE method achieved values of ACC¯, AUC¯, Precsion¯, Recall¯ and F1-score¯ (with standard deviation) 0.9992±2.249×105, 0.9995±9.945×106, 0.9995±1.09×105, 0.999±5.17×105, 0.9992±5.17×106, respectively. The best symbolic expression obtained using GPSC with a random hyperparameter search method and 5-fold cross-validation on a dataset balanced with the KMeansSMOTE method achieved values of ACC¯, AUC¯, Precsion¯, Recall¯ and F1-score¯ (with standard deviation) 0.9994±1.13×105, 0.9994±1.2×105, 1.0±0, 0.9988±2.4×105, and 0.9994±1.2×105, respectively. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

29 pages, 6579 KiB  
Article
Integrating ISA and Part-of Domain Knowledge into Process Model Discovery
by Alessio Bottrighi, Marco Guazzone, Giorgio Leonardi, Stefania Montani, Manuel Striani and Paolo Terenziani
Future Internet 2022, 14(12), 357; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14120357 - 28 Nov 2022
Cited by 1 | Viewed by 1564
Abstract
The traces of process executions are a strategic source of information, from which a model of the process can be mined. In our recent work, we have proposed SIM (semantic interactive miner), an innovative process mining tool to discover the process model incrementally: [...] Read more.
The traces of process executions are a strategic source of information, from which a model of the process can be mined. In our recent work, we have proposed SIM (semantic interactive miner), an innovative process mining tool to discover the process model incrementally: it supports the interaction with domain experts, who can selectively merge parts of the model to achieve compactness, generalization, and reduced redundancy. We now propose a substantial extension of SIM, making it able to exploit (both automatically and interactively) pre-encoded taxonomic knowledge about the refinement (ISA relations) and composition (part-of relations) of process activities, as is available in many domains. The extended approach allows analysts to move from a process description where activities are reported at the ground level to more user-interpretable/compact descriptions, in which sets of such activities are abstracted into the “macro-activities” subsuming them or constituted by them. An experimental evaluation based on a real-world setting (stroke management) illustrates the advantages of our approach. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

18 pages, 20430 KiB  
Article
Recursive Feature Elimination for Improving Learning Points on Hand-Sign Recognition
by Rung-Ching Chen, William Eric Manongga and Christine Dewi
Future Internet 2022, 14(12), 352; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14120352 - 26 Nov 2022
Cited by 7 | Viewed by 1784
Abstract
Hand gestures and poses allow us to perform non-verbal communication. Sign language is becoming more important with the increase in the number of deaf and hard-of-hearing communities. However, learning to understand sign language is very difficult and also time consuming. Researchers are still [...] Read more.
Hand gestures and poses allow us to perform non-verbal communication. Sign language is becoming more important with the increase in the number of deaf and hard-of-hearing communities. However, learning to understand sign language is very difficult and also time consuming. Researchers are still trying to find a better way to understand sign language using the help of technology. The accuracy of most hand-sign detection methods still needs to be improved for real-life usage. In this study, Mediapipe is used for hand feature extraction. Mediapipe can extract 21 hand landmarks from a hand image. Hand-pose detection using hand landmarks is chosen since it reduces the interference from the image background and uses fewer parameters compared to traditional hand-sign classification using pixel-based features and CNN. The Recursive Feature Elimination (RFE) method, using a novel distance from the hand landmark to the palm centroid, is proposed for feature selection to improve the accuracy of digit hand-sign detection. We used three different datasets in this research to train models with a different number of features, including the original 21 features, 15 features, and 10 features. A fourth dataset was used to evaluate the performance of these trained models. The fourth dataset is not used to train any model. The result of this study shows that removing the non-essential hand landmarks can improve the accuracy of the models in detecting digit hand signs. Models trained using fewer features have higher accuracy than models trained using the original 21 features. The model trained with 10 features also shows better accuracy than other models trained using 21 features and 15 features. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

10 pages, 832 KiB  
Article
Hate and False Metaphors: Implications to Emerging E-Participation Environment
by Sreejith Alathur, Naganna Chetty, Rajesh R. Pai, Vishal Kumar and Sahraoui Dhelim
Future Internet 2022, 14(11), 314; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14110314 - 31 Oct 2022
Cited by 4 | Viewed by 1580
Abstract
This study aims to investigate the effect of metaphorical content on e-participation in healthcare. With this objective, the study assesses the awareness and capability of e-participants to navigate through healthcare metaphors during their participation. Healthcare-related e-participation data were collected from the Twitter platform. [...] Read more.
This study aims to investigate the effect of metaphorical content on e-participation in healthcare. With this objective, the study assesses the awareness and capability of e-participants to navigate through healthcare metaphors during their participation. Healthcare-related e-participation data were collected from the Twitter platform. Data analysis includes (i) awareness measurements by topic modelling and sentiment analysis and (ii) participation abilities by problem-based learning models. Findings show that a lack of effort to validate metaphors harms e-participation levels and awareness, resulting in a problematic health environment. Exploring metaphors in these intricate forums has the potential to enhance service delivery. Improving web service delivery requires valuable input from stakeholders on the application of metaphors in the health domain. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

17 pages, 2789 KiB  
Article
Decision Support Using Machine Learning Indication for Financial Investment
by Ariel Vieira de Oliveira, Márcia Cristina Schiavi Dazzi, Anita Maria da Rocha Fernandes, Rudimar Luis Scaranto Dazzi, Paulo Ferreira and Valderi Reis Quietinho Leithardt
Future Internet 2022, 14(11), 304; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14110304 - 25 Oct 2022
Cited by 2 | Viewed by 1875
Abstract
To support the decision-making process of new investors, this paper aims to implement Machine Learning algorithms to generate investment indications, considering the Brazilian scenario. Three artificial intelligence techniques were implemented, namely: Multilayer Perceptron, Logistic Regression and Decision Tree, which performed the classification of [...] Read more.
To support the decision-making process of new investors, this paper aims to implement Machine Learning algorithms to generate investment indications, considering the Brazilian scenario. Three artificial intelligence techniques were implemented, namely: Multilayer Perceptron, Logistic Regression and Decision Tree, which performed the classification of investments. The database used was the one provided by the website Oceans14, containing the history of Fundamental Indicators and the history of Quotations, considering BOVESPA (São Paulo State Stock Exchange). The results of the different algorithms were compared to each other using the following metrics: accuracy, precision, recall, and F1-score. The Decision Tree was the algorithm that obtained the best classification metrics and an accuracy of 77%. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

22 pages, 841 KiB  
Article
Analysis of Data from Surveys for the Identification of the Factors That Influence the Migration of Small Companies to eCommerce
by William Villegas-Ch., Santiago Criollo-C, Walter Gaibor-Naranjo and Xavier Palacios-Pacheco
Future Internet 2022, 14(11), 303; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14110303 - 25 Oct 2022
Cited by 2 | Viewed by 1601
Abstract
Currently, medium and small businesses face a significant change in the way consumers purchase their services, products, or goods. This change is fundamentally due to the pandemic caused by the 2019 coronavirus disease, during which people were forced to use information and communication [...] Read more.
Currently, medium and small businesses face a significant change in the way consumers purchase their services, products, or goods. This change is fundamentally due to the pandemic caused by the 2019 coronavirus disease, during which people were forced to use information and communication technologies to satisfy their needs and interact with other people. After the pandemic, people’s dependence on technology increased exponentially, to such an extent that the Internet has become the channel through which any product can be purchased in an agile and varied way, from the comfort of home, and regardless of schedules. Therefore, for companies, moving from the traditional market to eCommerce is a necessity, but the change must take place efficiently. Therefore, identifying the factors that influence consumers to access a brand, a service, or a product is a characteristic of eCommerce. This paper presents an analysis of the factors that influence the use of electronic commerce. For this, a review of similar works was carried out for the design of surveys and the identification of the critical points considered by consumers. These data were analyzed in a granular way with tools used in business intelligence to improve decision making in the migration to a digital market. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

23 pages, 1292 KiB  
Article
Towards Reliable Baselines for Document-Level Sentiment Analysis in the Czech and Slovak Languages
by Ján Mojžiš, Peter Krammer, Marcel Kvassay, Lenka Skovajsová and Ladislav Hluchý
Future Internet 2022, 14(10), 300; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14100300 - 19 Oct 2022
Cited by 3 | Viewed by 1726
Abstract
This article helps establish reliable baselines for document-level sentiment analysis in highly inflected languages like Czech and Slovak. We revisit an earlier study representing the first comprehensive formulation of such baselines in Czech and show that some of its reported results need to [...] Read more.
This article helps establish reliable baselines for document-level sentiment analysis in highly inflected languages like Czech and Slovak. We revisit an earlier study representing the first comprehensive formulation of such baselines in Czech and show that some of its reported results need to be significantly revised. More specifically, we show that its online product review dataset contained more than 18% of non-trivial duplicates, which incorrectly inflated its macro F1-measure results by more than 19 percentage points. We also establish that part-of-speech-related features have no damaging effect on machine learning algorithms (contrary to the claim made in the study) and rehabilitate the Chi-squared metric for feature selection as being on par with the best performing metrics such as Information Gain. We demonstrate that in feature selection experiments with Information Gain and Chi-squared metrics, the top 10% of ranked unigram and bigram features suffice for the best results regarding online product and movie reviews, while the top 5% of ranked unigram and bigram features are optimal for the Facebook dataset. Finally, we reiterate an important but often ignored warning by George Forman and Martin Scholz that different possible ways of averaging the F1-measure in cross-validation studies of highly unbalanced datasets can lead to results differing by more than 10 percentage points. This can invalidate the comparisons of F1-measure results across different studies if incompatible ways of averaging F1 are used. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

20 pages, 2244 KiB  
Article
Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network
by Anna Kurtukova, Aleksandr Romanov, Alexander Shelupanov and Anastasia Fedotova
Future Internet 2022, 14(10), 287; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14100287 - 30 Sep 2022
Cited by 1 | Viewed by 2223
Abstract
This paper is a continuation of our previous work on solving source code authorship identification problems. The analysis of heterogeneous source code is a relevant issue for copyright protection in commercial software development. This is related to the specificity of development processes and [...] Read more.
This paper is a continuation of our previous work on solving source code authorship identification problems. The analysis of heterogeneous source code is a relevant issue for copyright protection in commercial software development. This is related to the specificity of development processes and the usage of collaborative development tools (version control systems). As a result, there are source codes written according to different programming standards by a team of programmers with different skill levels. Another application field is information security—in particular, identifying the author of computer viruses. We apply our technique based on a hybrid of Inception-v1 and Bidirectional Gated Recurrent Units architectures on heterogeneous source codes and consider the most common commercial development complex cases that negatively affect the authorship identification process. The paper is devoted to the possibilities and limitations of the author’s technique in various complex cases. For situations where a programmer was proficient in two programming languages, the average accuracy was 87%; for proficiency in three or more—76%. For the artificially generated source code case, the average accuracy was 81.5%. Finally, the average accuracy for source codes generated from commits was 84%. The comparison with state-of-the-art approaches showed that the proposed method has no full-functionality analogs covering actual practical cases. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

14 pages, 3771 KiB  
Article
Deep Learning Based Semantic Image Segmentation Methods for Classification of Web Page Imagery
by Ramya Krishna Manugunta, Rytis Maskeliūnas and Robertas Damaševičius
Future Internet 2022, 14(10), 277; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14100277 - 27 Sep 2022
Cited by 3 | Viewed by 2209
Abstract
Semantic segmentation is the task of clustering together parts of an image that belong to the same object class. Semantic segmentation of webpages is important for inferring contextual information from the webpage. This study examines and compares deep learning methods for classifying webpages [...] Read more.
Semantic segmentation is the task of clustering together parts of an image that belong to the same object class. Semantic segmentation of webpages is important for inferring contextual information from the webpage. This study examines and compares deep learning methods for classifying webpages based on imagery that is obscured by semantic segmentation. Fully convolutional neural network architectures (UNet and FCN-8) with defined hyperparameters and loss functions are used to demonstrate how they can support an efficient method of this type of classification scenario in custom-prepared webpage imagery data that are labeled multi-class and semantically segmented masks using HTML elements such as paragraph text, images, logos, and menus. Using the proposed Seg-UNet model achieved the best accuracy of 95%. A comparison with various optimizer functions demonstrates the overall efficacy of the proposed semantic segmentation approach. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

21 pages, 2780 KiB  
Article
An Efficient Blockchain Transaction Retrieval System
by Hangwei Feng, Jinlin Wang and Yang Li
Future Internet 2022, 14(9), 267; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14090267 - 15 Sep 2022
Cited by 4 | Viewed by 1794
Abstract
In the era of the digital economy, blockchain has developed well in various fields, such as finance and digital copyright, due to its unique decentralization and traceability characteristics. However, blockchain gradually exposes the storage problem, and the current blockchain stores the block data [...] Read more.
In the era of the digital economy, blockchain has developed well in various fields, such as finance and digital copyright, due to its unique decentralization and traceability characteristics. However, blockchain gradually exposes the storage problem, and the current blockchain stores the block data in third-party storage systems to reduce the node storage pressure. The new blockchain storage method brings the blockchain transaction retrieval problem. The problem is that when unable to locate the block containing this transaction, the user must fetch the entire blockchain ledger data from the third-party storage system, resulting in huge communication overhead. For this problem, we exploit the semi-structured data in the blockchain and extract the universal blockchain transaction characteristics, such as account address and time. Then we establish a blockchain transaction retrieval system. Responding to the lacking efficient retrieval data structure, we propose a scalable secondary search data structure BB+ tree for account address and introduce the I2B+ tree for time. Finally, we analyze the proposed scheme’s performance through experiments. The experiment results prove that our system is superior to the existing methods in single-feature retrieval, concurrent retrieval, and multi-feature hybrid retrieval. The retrieval time under single feature retrieval is reduced by 40.54%, and the retrieval time is decreased by 43.16% under the multi-feature hybrid retrieval. It has better stability in different block sizes and concurrent retrieval scales. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

15 pages, 339 KiB  
Article
Use of Data Augmentation Techniques in Detection of Antisocial Behavior Using Deep Learning Methods
by Viera Maslej-Krešňáková, Martin Sarnovský and Júlia Jacková
Future Internet 2022, 14(9), 260; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14090260 - 31 Aug 2022
Cited by 8 | Viewed by 1985
Abstract
The work presented in this paper focuses on the use of data augmentation techniques applied in the domain of the detection of antisocial behavior. Data augmentation is a frequently used approach to overcome issues related to the lack of data or problems related [...] Read more.
The work presented in this paper focuses on the use of data augmentation techniques applied in the domain of the detection of antisocial behavior. Data augmentation is a frequently used approach to overcome issues related to the lack of data or problems related to imbalanced classes. Such techniques are used to generate artificial data samples used to improve the volume of the training set or to balance the target distribution. In the antisocial behavior detection domain, we frequently face both issues, the lack of quality labeled data as well as class imbalance. As the majority of the data in this domain is textual, we must consider augmentation methods suitable for NLP tasks. Easy data augmentation (EDA) represents a group of such methods utilizing simple text transformations to create the new, artificial samples. Our main motivation is to explore EDA techniques’ usability on the selected tasks from the antisocial behavior detection domain. We focus on the class imbalance problem and apply EDA techniques to two problems: fake news and toxic comments classification. In both cases, we train the convolutional neural networks classifier and compare its performance on the original and EDA-extended datasets. EDA techniques prove to be very task-dependent, with certain limitations resulting from the data they are applied on. The model’s performance on the extended toxic comments dataset did improve only marginally, gaining only 0.01 improvement in the F1 metric when applying only a subset of EDA methods. EDA techniques in this case were not suitable enough to handle texts written in more informal language. On the other hand, on the fake news dataset, the performance was improved more significantly, boosting the F1 score by 0.1. Improvement was most significant in the prediction of the minor class, where F1 improved from 0.67 to 0.86. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

17 pages, 5766 KiB  
Article
Translating Speech to Indian Sign Language Using Natural Language Processing
by Purushottam Sharma, Devesh Tulsian, Chaman Verma, Pratibha Sharma and Nancy Nancy
Future Internet 2022, 14(9), 253; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14090253 - 25 Aug 2022
Cited by 7 | Viewed by 4121
Abstract
Language plays a vital role in the communication of ideas, thoughts, and information to others. Hearing-impaired people also understand our thoughts using a language known as sign language. Every country has a different sign language which is based on their native language. In [...] Read more.
Language plays a vital role in the communication of ideas, thoughts, and information to others. Hearing-impaired people also understand our thoughts using a language known as sign language. Every country has a different sign language which is based on their native language. In our research paper, our major focus is on Indian Sign Language, which is mostly used by hearing- and speaking-impaired communities in India. While communicating our thoughts and views with others, one of the most essential factors is listening. What if the other party is not able to hear or grasp what you are talking about? This situation is faced by nearly every hearing-impaired person in our society. This led to the idea of introducing an audio to Indian Sign Language translation system which can erase this gap in communication between hearing-impaired people and society. The system accepts audio and text as input and matches it with the videos present in the database created by the authors. If matched, it shows corresponding sign movements based on the grammar rules of Indian Sign Language as output; if not, it then goes through the processes of tokenization and lemmatization. The heart of the system is natural language processing which equips the system with tokenization, parsing, lemmatization, and part-of-speech tagging. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

11 pages, 1342 KiB  
Article
Zero-Inflated Patent Data Analysis Using Generating Synthetic Samples
by Daiho Uhm and Sunghae Jun
Future Internet 2022, 14(7), 211; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14070211 - 16 Jul 2022
Cited by 3 | Viewed by 1462
Abstract
Due to the expansion of the internet, we encounter various types of big data such as web documents or sensing data. Compared to traditional small data such as experimental samples, big data provide more chances to find hidden and novel patterns with big [...] Read more.
Due to the expansion of the internet, we encounter various types of big data such as web documents or sensing data. Compared to traditional small data such as experimental samples, big data provide more chances to find hidden and novel patterns with big data analysis using statistics and machine learning algorithms. However, as the use of big data increases, problems also occur. One of them is a zero-inflated problem in structured data preprocessed from big data. Most count values are zeros because a specific word is found in only some documents. In particular, since most of the patent data are in the form of a text document, they are more affected by the zero-inflated problem. To solve this problem, we propose a generation of synthetic samples using statistical inference and tree structure. Using patent document and simulation data, we verify the performance and validity of our proposed method. In this paper, we focus on patent keyword analysis as text big data analysis, and we encounter the zero-inflated problem just like other text data. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

15 pages, 1704 KiB  
Article
N-Trans: Parallel Detection Algorithm for DGA Domain Names
by Cheng Yang, Tianliang Lu, Shangyi Yan, Jianling Zhang and Xingzhan Yu
Future Internet 2022, 14(7), 209; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14070209 - 13 Jul 2022
Cited by 3 | Viewed by 1797
Abstract
Domain name generation algorithms are widely used in malware, such as botnet binaries, to generate large sequences of domain names of which some are registered by cybercriminals. Accurate detection of malicious domains can effectively defend against cyber attacks. The detection of such malicious [...] Read more.
Domain name generation algorithms are widely used in malware, such as botnet binaries, to generate large sequences of domain names of which some are registered by cybercriminals. Accurate detection of malicious domains can effectively defend against cyber attacks. The detection of such malicious domain names by the use of traditional machine learning algorithms has been explored by many researchers, but still is not perfect. To further improve on this, we propose a novel parallel detection model named N-Trans that is based on the N-gram algorithm with the Transformer model. First, we add flag bits to the first and last positions of the domain name for the parallel combination of the N-gram algorithm and Transformer framework to detect a domain name. The model can effectively extract the letter combination features and capture the position features of letters in the domain name. It can capture features such as the first and last letters in the domain name and the position relationship between letters. In addition, it can accurately distinguish between legitimate and malicious domain names. In the experiment, the dataset is the legal domain name of Alexa and the malicious domain name collected by the 360 Security Lab. The experimental results show that the parallel detection model based on N-gram and Transformer achieves 96.97% accuracy for DGA malicious domain name detection. It can effectively and accurately identify malicious domain names and outperforms the mainstream malicious domain name detection algorithms. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

13 pages, 3898 KiB  
Article
Correlation between Human Emotion and Temporal·Spatial Contexts by Analyzing Environmental Factors
by Minwoo Park and Euichul Lee
Future Internet 2022, 14(7), 203; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14070203 - 30 Jun 2022
Viewed by 1446
Abstract
In this paper, we propose a method for extracting emotional factors through audiovisual quantitative feature analysis from images of the surrounding environment. Nine features were extracted such as time complexity, spatial complexity (horizontal and vertical), color components (hue and saturation), intensity, contrast, sound [...] Read more.
In this paper, we propose a method for extracting emotional factors through audiovisual quantitative feature analysis from images of the surrounding environment. Nine features were extracted such as time complexity, spatial complexity (horizontal and vertical), color components (hue and saturation), intensity, contrast, sound amplitude, and sound frequency. These nine features were used to infer “pleasant-unpleasant” and “arousal-relaxation” scores through two support vector regressions. First, the inference accuracy for each of the nine features was calculated as a hit ratio to check the distinguishing power of the features. Next, the difference between the position in the two-dimensional emotional plane inferred through SVR and the ground truth determined subjectively by the subject was examined. As a result of the experiment, it was confirmed that the time-complexity feature had the best classification performance, and it was confirmed that the emotion inferred through SVR can be valid when the two-dimensional emotional plane is divided into 3 × 3. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

23 pages, 717 KiB  
Article
A Multi-View Framework to Detect Redundant Activity Labels for More Representative Event Logs in Process Mining
by Qifan Chen, Yang Lu, Charmaine S. Tam and Simon K. Poon
Future Internet 2022, 14(6), 181; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14060181 - 09 Jun 2022
Cited by 2 | Viewed by 1610
Abstract
Process mining aims to gain knowledge of business processes via the discovery of process models from event logs generated by information systems. The insights revealed from process mining heavily rely on the quality of the event logs. Activities extracted from different data sources [...] Read more.
Process mining aims to gain knowledge of business processes via the discovery of process models from event logs generated by information systems. The insights revealed from process mining heavily rely on the quality of the event logs. Activities extracted from different data sources or the free-text nature within the same system may lead to inconsistent labels. Such inconsistency would then lead to redundancy in activity labels, which refer to labels that have different syntax but share the same behaviours. Redundant activity labels can introduce unnecessary complexities to the event logs. The identification of these labels from data-driven process discovery are difficult and rely heavily on human intervention. Neither existing process discovery algorithms nor event data preprocessing techniques can solve such redundancy efficiently. In this paper, we propose a multi-view approach to automatically detect redundant activity labels by using not only context-aware features such as control–flow relations and attribute values but also semantic features from the event logs. Our evaluation of several publicly available datasets and a real-life case study demonstrate that our approach can efficiently detect redundant activity labels even with low-occurrence frequencies. The proposed approach can add value to the preprocessing step to generate more representative event logs. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

22 pages, 5436 KiB  
Article
Gamifying Community Education for Enhanced Disaster Resilience: An Effectiveness Testing Study from Australia
by Nayomi Kankanamge, Tan Yigitcanlar and Ashantha Goonetilleke
Future Internet 2022, 14(6), 179; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14060179 - 09 Jun 2022
Cited by 4 | Viewed by 3937
Abstract
Providing convenient and effective online education is important for the public to be better prepared for disaster events. Nonetheless, the effectiveness of such education is questionable due to the limited use of online tools and platforms, which also results in narrow community outreach. [...] Read more.
Providing convenient and effective online education is important for the public to be better prepared for disaster events. Nonetheless, the effectiveness of such education is questionable due to the limited use of online tools and platforms, which also results in narrow community outreach. Correspondingly, understanding public perceptions of disaster education methods and experiences for the adoption of novel methods is critical, but this is an understudied area of research. The aim of this study is to understand public perceptions towards online disaster education practices for disaster preparedness and evaluate the effectiveness of the gamification method in increasing public awareness. This study utilizes social media analytics and conducts a gamification exercise. The analysis involved Twitter posts (n = 13,683) related to the 2019–2020 Australian bushfires, and surveyed participants (n = 52) before and after experiencing a gamified application—i.e., STOP Disasters! The results revealed that: (a) The public satisfaction level is relatively low for traditional bushfire disaster education methods; (b) The study participants’ satisfaction level is relatively high for an online gamified application used for disaster education; and (c) The use of virtual and augmented reality was found to be promising for increasing the appeal of gamified applications, along with using a blended traditional and gamified approach. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

19 pages, 770 KiB  
Article
Optimization of the System of Allocation of Overdue Loans in a Sub-Saharan Africa Microfinance Institution
by Andreia Araújo, Filipe Portela, Filipe Alvelos and Saulo Ruiz
Future Internet 2022, 14(6), 163; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14060163 - 27 May 2022
Cited by 1 | Viewed by 1700
Abstract
In microfinance, with more loans, there is a high risk of increasing overdue loans by overloading the resources available to take actions on the repayment. So, three experiments were conducted to search for a distribution of the loans through the officers available to [...] Read more.
In microfinance, with more loans, there is a high risk of increasing overdue loans by overloading the resources available to take actions on the repayment. So, three experiments were conducted to search for a distribution of the loans through the officers available to maximize the probability of recovery. Firstly, the relation between the loan and some characteristics of the officers was analyzed. The results were not that strong with F1 scores between 0 and 0.74, with a lot of variation in the scores of the good predictions. Secondly, the loan is classified as paid/unpaid based on what prediction could result of the analysis of the characteristics of the loan. The Support Vector Machine had potential to be a solution with a F1 score average of 0.625; however, when predicting the unpaid loans, it showed to be random with a score of 0.55. Finally, the experiment focused on segmentation of the overdue loans in different groups, from where it would be possible to know their prioritization. The visualization of three clusters in the data was clear through Principal Component Analysis. To reinforce this good visualization, the final silhouette score was 0.194, which reflects that is a model that can be trusted. This way, an implementation of clustering loans into three groups, and a respective prioritization scale would be the best strategy to organize and assign the loans to maximize recovery. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

21 pages, 1194 KiB  
Article
The Whole Is Greater than the Sum of the Parts: A Multilayer Approach on Criminal Networks
by Annamaria Ficara, Giacomo Fiumara, Salvatore Catanese, Pasquale De Meo and Xiaoyang Liu
Future Internet 2022, 14(5), 123; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14050123 - 20 Apr 2022
Cited by 7 | Viewed by 2312
Abstract
Traditional social network analysis can be generalized to model some networked systems by multilayer structures where the individual nodes develop relationships in multiple layers. A multilayer network is called multiplex if each layer shares at least one node with some other layer. In [...] Read more.
Traditional social network analysis can be generalized to model some networked systems by multilayer structures where the individual nodes develop relationships in multiple layers. A multilayer network is called multiplex if each layer shares at least one node with some other layer. In this paper, we built a unique criminal multiplex network from the pre-trial detention order by the Preliminary Investigation Judge of the Court of Messina (Sicily) issued at the end of the Montagna anti-mafia operation in 2007. Montagna focused on two families who infiltrated several economic activities through a cartel of entrepreneurs close to the Sicilian Mafia. Our network possesses three layers which share 20 nodes. The first captures meetings between suspected criminals, the second records phone calls and the third detects crimes committed by pairs of individuals. We used measures from multilayer network analysis to characterize the actors in the network based on their local edges and their relevance to each specific layer. Then, we used measures of layer similarity to study the relationships between different layers. By studying the actor connectivity and the layer correlation, we demonstrated that a complete picture of the structure and the activities of a criminal organization can be obtained only considering the three layers as a whole multilayer network and not as single-layer networks. Specifically, we showed the usefulness of the multilayer approach by bringing out the importance of actors that does not emerge by studying the three layers separately. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

11 pages, 1445 KiB  
Article
Automated Business Goal Extraction from E-mail Repositories to Bootstrap Business Understanding
by Marco Spruit, Marcin Kais and Vincent Menger
Future Internet 2021, 13(10), 243; https://0-doi-org.brum.beds.ac.uk/10.3390/fi13100243 - 23 Sep 2021
Viewed by 1659
Abstract
The Cross-Industry Standard Process for Data Mining (CRISP-DM), despite being the most popular data mining process for more than two decades, is known to leave those organizations lacking operational data mining experience puzzled and unable to start their data mining projects. This is [...] Read more.
The Cross-Industry Standard Process for Data Mining (CRISP-DM), despite being the most popular data mining process for more than two decades, is known to leave those organizations lacking operational data mining experience puzzled and unable to start their data mining projects. This is especially apparent in the first phase of Business Understanding, at the conclusion of which, the data mining goals of the project at hand should be specified, which arguably requires at least a conceptual understanding of the knowledge discovery process. We propose to bridge this knowledge gap from a Data Science perspective by applying Natural Language Processing techniques (NLP) to the organizations’ e-mail exchange repositories to extract explicitly stated business goals from the conversations, thus bootstrapping the Business Understanding phase of CRISP-DM. Our NLP-Automated Method for Business Understanding (NAMBU) generates a list of business goals which can subsequently be used for further specification of data mining goals. The validation of the results on the basis of comparison to the results of manual business goal extraction from the Enron corpus demonstrates the usefulness of our NAMBU method when applied to large datasets. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

22 pages, 18330 KiB  
Article
A Review on Clustering Techniques: Creating Better User Experience for Online Roadshow
by Zhou-Yi Lim, Lee-Yeng Ong and Meng-Chew Leow
Future Internet 2021, 13(9), 233; https://0-doi-org.brum.beds.ac.uk/10.3390/fi13090233 - 13 Sep 2021
Cited by 7 | Viewed by 3354
Abstract
Online roadshow is a relatively new concept that has higher flexibility and scalability compared to the physical roadshow. This is because online roadshow is accessible through digital devices anywhere and anytime. In a physical roadshow, organizations can measure the effectiveness of the roadshow [...] Read more.
Online roadshow is a relatively new concept that has higher flexibility and scalability compared to the physical roadshow. This is because online roadshow is accessible through digital devices anywhere and anytime. In a physical roadshow, organizations can measure the effectiveness of the roadshow by interacting with the customers. However, organizations cannot monitor the effectiveness of the online roadshow by using the same method. A good user experience is important to increase the advertising effects on the online roadshow website. In web usage mining, clustering can discover user access patterns from the weblog. By applying a clustering technique, the online roadshow website can be further improved to provide a better user experience. This paper presents a review of clustering techniques used in web usage mining, namely the partition-based, hierarchical, density-based, and fuzzy clustering techniques. These clustering techniques are analyzed from three perspectives: their similarity measures, the evaluation metrics used to determine the optimality of the clusters, and the functional purpose of applying the techniques to improve the user experience of the website. By applying clustering techniques in different stages of the user activities in the online roadshow website, the advertising effectiveness of the website can be enhanced in terms of its affordance, flow, and interactivity. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

17 pages, 1279 KiB  
Article
Improving RE-SWOT Analysis with Sentiment Classification: A Case Study of Travel Agencies
by Shu-Fen Tu, Ching-Sheng Hsu and Yu-Tzu Lu
Future Internet 2021, 13(9), 226; https://0-doi-org.brum.beds.ac.uk/10.3390/fi13090226 - 30 Aug 2021
Cited by 1 | Viewed by 4926
Abstract
Nowadays, many companies collect online user reviews to determine how users evaluate their products. Dalpiaz and Parente proposed the RE-SWOT method to automatically generate a SWOT matrix based on online user reviews. The SWOT matrix is an important basis for a company to [...] Read more.
Nowadays, many companies collect online user reviews to determine how users evaluate their products. Dalpiaz and Parente proposed the RE-SWOT method to automatically generate a SWOT matrix based on online user reviews. The SWOT matrix is an important basis for a company to perform competitive analysis; therefore, RE-SWOT is a very helpful tool for organizations. Dalpiaz and Parente calculated feature performance scores based on user reviews and ratings to generate the SWOT matrix. However, the authors did not propose a solution for situations when user ratings are not available. Unfortunately, it is not uncommon for forums to only have user reviews but no user ratings. In this paper, sentiment analysis is used to deal with the situation where user ratings are not available. We also use KKday, a start-up online travel agency in Taiwan as an example to demonstrate how to use the proposed method to build a SWOT matrix. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

15 pages, 752 KiB  
Article
Implementation of a Virtual Assistant for the Academic Management of a University with the Use of Artificial Intelligence
by William Villegas-Ch, Joselin García-Ortiz, Karen Mullo-Ca, Santiago Sánchez-Viteri and Milton Roman-Cañizares
Future Internet 2021, 13(4), 97; https://0-doi-org.brum.beds.ac.uk/10.3390/fi13040097 - 13 Apr 2021
Cited by 17 | Viewed by 4870
Abstract
Currently, private universities, as a result of the pandemic that the world is facing, are going through very delicate moments in several areas, both academic and financial. Academically, there are learning problems and these are directly related to the dropout rate, which brings [...] Read more.
Currently, private universities, as a result of the pandemic that the world is facing, are going through very delicate moments in several areas, both academic and financial. Academically, there are learning problems and these are directly related to the dropout rate, which brings financial problems. Added to this are the economic problems caused by the pandemic, where the rates of students who want to access a private education have dropped considerably. For this reason, it is necessary for all private universities to have support to improve their student income and avoid cuts in budgets and resources. However, the academic part represents a great effort to fulfill their academic activities, which are the priority, with attention on those interested in pursuing a training programs. To solve these problems, it is important to integrate technologies such as Chatbots, which use artificial intelligence in such a way that tasks such as providing information on an academic courses are addressed by them, reducing the administrative burden and improving the user experience. At the same time, this encourages people to be a part of the college. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

Review

Jump to: Research, Other

33 pages, 2126 KiB  
Review
A Systematic Literature Review on Applications of GAN-Synthesized Images for Brain MRI
by Sampada Tavse, Vijayakumar Varadarajan, Mrinal Bachute, Shilpa Gite and Ketan Kotecha
Future Internet 2022, 14(12), 351; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14120351 - 25 Nov 2022
Cited by 7 | Viewed by 6015
Abstract
With the advances in brain imaging, magnetic resonance imaging (MRI) is evolving as a popular radiological tool in clinical diagnosis. Deep learning (DL) methods can detect abnormalities in brain images without an extensive manual feature extraction process. Generative adversarial network (GAN)-synthesized images have [...] Read more.
With the advances in brain imaging, magnetic resonance imaging (MRI) is evolving as a popular radiological tool in clinical diagnosis. Deep learning (DL) methods can detect abnormalities in brain images without an extensive manual feature extraction process. Generative adversarial network (GAN)-synthesized images have many applications in this field besides augmentation, such as image translation, registration, super-resolution, denoising, motion correction, segmentation, reconstruction, and contrast enhancement. The existing literature was reviewed systematically to understand the role of GAN-synthesized dummy images in brain disease diagnosis. Web of Science and Scopus databases were extensively searched to find relevant studies from the last 6 years to write this systematic literature review (SLR). Predefined inclusion and exclusion criteria helped in filtering the search results. Data extraction is based on related research questions (RQ). This SLR identifies various loss functions used in the above applications and software to process brain MRIs. A comparative study of existing evaluation metrics for GAN-synthesized images helps choose the proper metric for an application. GAN-synthesized images will have a crucial role in the clinical sector in the coming years, and this paper gives a baseline for other researchers in the field. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Figure 1

Other

Jump to: Research, Review

24 pages, 5158 KiB  
Concept Paper
Generating Indicators of Disruptive Innovation Using Big Data
by Roger C. Brackin, Michael J. Jackson, Andrew Leyshon, Jeremy G. Morley and Sarah Jewitt
Future Internet 2022, 14(11), 327; https://0-doi-org.brum.beds.ac.uk/10.3390/fi14110327 - 11 Nov 2022
Cited by 1 | Viewed by 1423
Abstract
Technological evolution and its potential impacts are of significant interest to governments, corporate organizations and for academic enquiry; but assessments of technology progression are often highly subjective. This paper prototypes potential objective measures to assess technology progression using internet-based data. These measures may [...] Read more.
Technological evolution and its potential impacts are of significant interest to governments, corporate organizations and for academic enquiry; but assessments of technology progression are often highly subjective. This paper prototypes potential objective measures to assess technology progression using internet-based data. These measures may help reduce the subjective nature of such assessments and, in conjunction with other techniques, reduce the uncertainty of technology progression assessment. The paper examines one part of the technology ecosystem, namely, academic research and publications. It uses analytics performed against a large body of academic paper abstracts and metadata published over 20 years to propose and demonstrate candidate indicators of technology progression. Measures prototyped are: (i) overall occurrence of technologies used over time in research, (ii) the fields in which this use was made; (iii) the geographic spread of specific technologies within research and (iv) the clustering of technology research over time. An outcome of the analysis is an ability to assess the measures of technology progression against a set of inputs and a set of commentaries and forecasts made publicly in the subject area over the last 20 years. The potential automated indicators of research are discussed together with other indicators which might help working groups in assessing technology progression using more quantitative methods. Full article
(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)
Show Figures

Graphical abstract

Back to TopTop