Data Science, Statistics and Visualization

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 December 2022) | Viewed by 16202

Special Issue Editors


E-Mail Website
Guest Editor
Departamento de Matemática, Facultad de Ingeniería, Universidad de Atacama, Copiapó 1530000, Chile
Interests: data science; biostatistics; multivariate time series; bayesian statistics; causal inference; machine learning; statistical process monitoring; information visualization

E-Mail Website
Guest Editor
Institute of Mathematical Science and Computing, University of Sao Paulo, Sao Carlos 13566-590, Brazil
Interests: data science and statistics; survival/reliability analysis; machine learning; statistics inference

Special Issue Information

Dear Colleagues,

Data science is a growing field that enables the quantification of decision-making based on data support. This field is a wide, exponential research area, even though its potential across interdisciplinary areas is not well understood, which leaves a gap to be explored in terms of its possibilities.

With this in mind, we have organized this Special Issue entitled Data Science, Statistics, and Visualization to explore the role of data science applications in identifying problems and suggestions that can benefit from new research in mathematical, statistical, and computational sciences highlighted through high-quality papers. We will provide an enabling environment for the development of applied research in the productive sector applied in the computing and artificial intelligence field and centered on the dissemination of knowledge in applied sciences. Therefore, we want to contribute towards building a more solid and lasting multidisciplinary community, fomenting solutions to practical problems.

The topics related to this Special Issue include, but are not limited to, the following:

  • Information visualization;
  • Quantum machine learning;
  • Symbolic data analysis;
  • Applied statistical learning;
  • Artificial intelligence, machine learning, and big data analytics;
  • Statistical methods in healthcare.

Prof. Dr. Diego Carvalho do Nascimento
Prof. Dr. Francisco Louzada
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • statistical learning
  • artificial intelligence
  • machine learning
  • big data analytics
  • decision-making based on data-driven models

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 695 KiB  
Article
Determining the Quality of a Dataset in Clustering Terms
by Alicja Rachwał, Emilia Popławska, Izolda Gorgol, Tomasz Cieplak, Damian Pliszczuk, Łukasz Skowron and Tomasz Rymarczyk
Appl. Sci. 2023, 13(5), 2942; https://0-doi-org.brum.beds.ac.uk/10.3390/app13052942 - 24 Feb 2023
Cited by 5 | Viewed by 1870
Abstract
The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed [...] Read more.
The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic k-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

17 pages, 478 KiB  
Article
Analysis and Prediction of MOOC Learners’ Dropout Behavior
by Zengxiao Chi, Shuo Zhang and Lin Shi
Appl. Sci. 2023, 13(2), 1068; https://0-doi-org.brum.beds.ac.uk/10.3390/app13021068 - 13 Jan 2023
Cited by 8 | Viewed by 2738
Abstract
With the wide spread of massive open online courses ( MOOC ), millions of people have enrolled in many courses, but the dropout rate of most courses is more than 90%. Accurately predicting the dropout rate of MOOC is of great significance to [...] Read more.
With the wide spread of massive open online courses ( MOOC ), millions of people have enrolled in many courses, but the dropout rate of most courses is more than 90%. Accurately predicting the dropout rate of MOOC is of great significance to prevent learners’ dropout behavior and reduce the dropout rate of students. Using the PH278x curriculum data on the Harvard X platform in spring 2013, and based on the statistical analysis of the factors that may affect learners’ final completion of the curriculum from two aspects: learners’ own characteristics and learners’ learning behavior, we established the MOOC dropout rate prediction models based on logical regression, K nearest neighbor and random forest, respectively. Experiments with five evaluation metrics (accuracy, precision, recall, F1 and AUC) show that the prediction model based on random forest has the highest accuracy, precision, F1 and AUC, which are 91.726%, 93.0923%, 95.4145%, 0.925341, respectively, its performance is better than that of the prediction model based on logical regression and that of the model based on K-nearest neighbor, whose values of these metrics are 91.395%, 92.8674%, 95.2337%, 0.912316 and 91.726%, 93.0923%, 95.4145% and 0.925341, respectively. As for recall metrics, the value of random forest is higher than that of KNN, but slightly lower than that of logistic regression, which are 0.992476, 0.977239 and 0.978555, respectively. Then, we conclude that random forests perform best in predicting the dropout rate of MOOC learners. This study can help education staff to know the trend of learners’ dropout behavior in advance, so as to put some measures to reduce the dropout rate before it occurs, thus improving the completion rate of the curriculum. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

21 pages, 3118 KiB  
Article
Visualization System for Transparency Requirement Analytics
by Samiha Fadloun, Souham Meshoul, Mahmood Hosseini, Abdennour Amokrane and Hichem Bennaceur
Appl. Sci. 2022, 12(23), 12423; https://0-doi-org.brum.beds.ac.uk/10.3390/app122312423 - 05 Dec 2022
Cited by 1 | Viewed by 1218
Abstract
Access to corporate information systems by consumers via the Internet has increased dramatically over the past several decades. In a separate organization, extensive research has been conducted on the free flow of information generated by both external and internal keywords. Research on transparency [...] Read more.
Access to corporate information systems by consumers via the Internet has increased dramatically over the past several decades. In a separate organization, extensive research has been conducted on the free flow of information generated by both external and internal keywords. Research on transparency should aid the audience in making informed decisions. Few have, however, created clear and compelling visual representations of transparency requirements (stakeholders, data, process, policy, and their relationships) utilizing current information visualization and visual analytics methodologies. Maintaining both the quality and visual representation of transparency requirements is a difficult challenge. In this paper, we propose TranspVis, a new visual analytics tool designed for transparency analytics. It consists of multiple views that aid domain experts in efficiently analyzing, updating, and saving application transparency datasets. TranspVis is an interactive tool for displaying TranspLan (i.e., Transparency Language) representations manually generated by experts utilizing the Shield, Infolet, and SitReq forms. In addition to the new circle view, TranspVis generates and synchronizes these latter representations automatically. TranspVis is evaluated using AWS and WhatsApp policy datasets as two case studies. Results show that TranspVis extends the initial TranspLan representation and significantly improves transparency requirement analytics in terms of visual encoding, interactions, and insight extraction. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

17 pages, 1844 KiB  
Article
CircleVis: A Visualization Tool for Circular Labeling Arrangements and Overlap Removal
by Samiha Fadloun, Souham Meshoul and Kheireddine Choutri
Appl. Sci. 2022, 12(22), 11390; https://0-doi-org.brum.beds.ac.uk/10.3390/app122211390 - 10 Nov 2022
Cited by 1 | Viewed by 1519
Abstract
Information visualization refers to the practice of representing data in a meaningful, visual way that users can interpret and easily comprehend. Geometric or visual encoding shapes such as circles, rectangles, and bars have grown in popularity in data visualization research over time. Circles [...] Read more.
Information visualization refers to the practice of representing data in a meaningful, visual way that users can interpret and easily comprehend. Geometric or visual encoding shapes such as circles, rectangles, and bars have grown in popularity in data visualization research over time. Circles are a common shape used by domain experts to solve real-world problems and analyze data. As a result, data can be encoded using a simple circle with a set of labels associated with an arc or portion of the circle. Labels can then be arranged in various ways based on human perception (easy to read) or by optimizing the available space around the circle. However, overlaps can occur in one or more arrangements. This paper proposes CircleVis, a new visualization tool for label arrangement and overlap removal in circle visual encoding. First, a mathematical model is presented in order to formulate existing arrangements such as angular, path, and linear. Furthermore, based on user interaction, a new arrangement approach is proposed to optimize available space in each circle arc and delete label overlaps. Finally, users test and evaluate the designed tool using the COVID-19 dataset for validation purposes. The obtained results demonstrate the efficacy of the proposed method for label arrangement and overlapping removal in circular layout. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

36 pages, 1792 KiB  
Article
Getting over High-Dimensionality: How Multidimensional Projection Methods Can Assist Data Science
by Evandro S. Ortigossa, Fábio Felix Dias and Diego Carvalho do Nascimento
Appl. Sci. 2022, 12(13), 6799; https://0-doi-org.brum.beds.ac.uk/10.3390/app12136799 - 05 Jul 2022
Cited by 2 | Viewed by 2666
Abstract
The exploration and analysis of multidimensional data can be pretty complex tasks, requiring sophisticated tools able to transform large amounts of data bearing multiple parameters into helpful information. Multidimensional projection techniques figure as powerful tools for transforming multidimensional data into visual information according [...] Read more.
The exploration and analysis of multidimensional data can be pretty complex tasks, requiring sophisticated tools able to transform large amounts of data bearing multiple parameters into helpful information. Multidimensional projection techniques figure as powerful tools for transforming multidimensional data into visual information according to similarity features. Integrating this class of methods into a framework devoted to data sciences can contribute to generating more expressive means of visual analytics. Although the Principal Component Analysis (PCA) is a well-known method in this context, it is not the only one, and, sometimes, its abilities and limitations are not adequately discussed or taken into consideration by users. Therefore, knowing in-depth multidimensional projection techniques, their strengths, and the possible distortions they can create is of significant importance for researchers developing knowledge-discovery systems. This research presents a comprehensive overview of current state-of-the-art multidimensional projection techniques and shows example codes in Python and R languages, all available on the internet. The survey segment discusses the different types of techniques applied to multidimensional projection tasks from their background, application processes, capabilities, and limitations, opening the internal processes of the methods and demystifying their concepts. We also illustrate two problems, from a genetic experiment (supervised) and text mining (non-supervised), presenting solutions through multidimensional projection application. Finally, we brought elements that reverberate the competitiveness of multidimensional projection techniques towards high-dimension data visualization, commonly needed in data sciences solutions. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

20 pages, 14436 KiB  
Article
SDA-Vis: A Visualization System for Student Dropout Analysis Based on Counterfactual Exploration
by Germain Garcia-Zanabria, Daniel A. Gutierrez-Pachas, Guillermo Camara-Chavez, Jorge Poco and Erick Gomez-Nieto
Appl. Sci. 2022, 12(12), 5785; https://0-doi-org.brum.beds.ac.uk/10.3390/app12125785 - 07 Jun 2022
Cited by 3 | Viewed by 2734
Abstract
High and persistent dropout rates represent one of the biggest challenges for improving the efficiency of the educational system, particularly in underdeveloped countries. A range of features influence college dropouts, with some belonging to the educational field and others to non-educational fields. Understanding [...] Read more.
High and persistent dropout rates represent one of the biggest challenges for improving the efficiency of the educational system, particularly in underdeveloped countries. A range of features influence college dropouts, with some belonging to the educational field and others to non-educational fields. Understanding the interplay of these variables to identify a student as a potential dropout could help decision makers interpret the situation and decide what they should do next to reduce student dropout rates based on corrective actions. This paper presents SDA-Vis, a visualization system that supports counterfactual explanations for student dropout dynamics, considering various academic, social, and economic variables. In contrast to conventional systems, our approach provides information about feature-perturbed versions of a student using counterfactual explanations. SDA-Vis comprises a set of linked views that allow users to identify variables alteration to chance predefined students situations. This involves perturbing the variables of a dropout student to achieve synthetic non-dropout students. SDA-Vis has been developed under the guidance and supervision of domain experts, in line with some analytical objectives. We demonstrate the usefulness of SDA-Vis through case studies run in collaboration with domain experts, using a real data set from a Latin American university. The analysis reveals the effectiveness of SDA-Vis in identifying students at risk of dropping out and proposes corrective actions, even for particular cases that have not been shown to be at risk with the traditional tools that experts use. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

14 pages, 735 KiB  
Article
High Definition tDCS Effect on Postural Control in Healthy Individuals: Entropy Analysis of a Crossover Clinical Trial
by Diandra B. Favoretto, Eduardo Bergonzoni, Diego Carvalho Nascimento, Francisco Louzada, Tenysson W. Lemos, Rosangela A. Batistela, Renato Moraes, João P. Leite, Brunna P. Rimoli, Dylan J. Edwards and Taiza G. S. Edwards
Appl. Sci. 2022, 12(5), 2703; https://0-doi-org.brum.beds.ac.uk/10.3390/app12052703 - 05 Mar 2022
Cited by 1 | Viewed by 1960
Abstract
Objective: Converging evidence supporting an effect of transcranial direct current stimulation (tDCS) on postural control and human verticality perception highlights this strategy as promising for post-stroke rehabilitation. We have previously demonstrated polarity-dependent effects of high-definition tDCS (HD-tDCS) on weight-bearing asymmetry. However, there is [...] Read more.
Objective: Converging evidence supporting an effect of transcranial direct current stimulation (tDCS) on postural control and human verticality perception highlights this strategy as promising for post-stroke rehabilitation. We have previously demonstrated polarity-dependent effects of high-definition tDCS (HD-tDCS) on weight-bearing asymmetry. However, there is no investigation regarding the time-course of effects on postural control induced by HD-tDCS protocols. Thus, we performed a nonlinear time series analysis focusing on the entropy of the ground reaction force as a secondary investigation of our randomized, double-blind, placebo-controlled, crossover clinical trial. Materials and Methods: Twenty healthy right-handed young adults received the following conditions (random order, separate days); anode center HD-tDCS, cathode center HD-tDCS or sham HD-tDCS at 1, 2, and 3 mA over the right temporo-parietal junction (TPJ). Using summarized time series of transfer entropy, we evaluated the exchanging information (causal direction) between both force plates and compared the dose-response across the healthy subjects with a Generalized Linear Hierarchical/Mixed Model (GLMM). Results: We found significant variation during the dynamic information flow (p < 0.001) among the dominant bodyside (and across time). A greater force transfer entropy was observed from the right to the left side during the cathode-center HD-tDCS up to 2 mA, with a causal relationship in the information flow (equilibrium force transfer) from right to left that decreased over time. Conclusions: HD-tDCS intervention induced a dynamic influence over time on postural control entropy. Right hemisphere TPJ stimulation using cathode-center HD-tDCS can induce an asymmetry of body weight distribution towards the ipsilateral side of stimulation. These results support the clinical potential of HD-tDCS for post-stroke rehabilitation. Full article
(This article belongs to the Special Issue Data Science, Statistics and Visualization)
Show Figures

Figure 1

Back to TopTop