Statistical Methods in Data Mining

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Mathematics and Computer Science".

Deadline for manuscript submissions: closed (31 May 2022) | Viewed by 18846

Special Issue Editor

Department of Mathematics, Faculty of Science, ‎University of Porto, 4169-007 Porto, Portugal
Interests: statistical learning; discriminant analysis; classification; clustering analysis; regression; correlation

Special Issue Information

Dear Colleagues, 

Statistics is, nowadays, more important than ever because of the availability of large amounts of data in many domains like science, finance, engineering, medicine, etc. For a long time, statistics has developed as a subdiscipline of mathematics. Nevertheless, computing is also a very important tool for statistics. This is particularly true in statistical methods in data mining, which is an interdisciplinary field involving the analysis of large existing databases in order to discover patterns and relationships in the data. It differs from traditional statistics on the size of the dataset and on the fact that the data were not initially collected according to some experimental design but rather for other purposes. On the other hand, asymptotic analysis, which has for a long time been an important area of statistics approaching problems where the sample size (and more recently, also, the number of variables) tends to infinity, is obviously also appropriate in data mining for dealing with huge amounts of data.

The aim of this Special Issue is to publish original research articles covering advances in statistical methods in data mining. Potential topics include but are not limited to the following: data mining algorithms; statistical approaches; practical applications involving innovative algorithms/statistical approaches; novel supervised classification and clustering problems; new methodologies for regression

Dr. Joaquim Fernando Pinto da Costa
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • variable selection for high-dimensional data
  • feature screening for ultrahigh-dimensional data
  • multiple comparisons
  • statistical learning and data mining
  • clustering
  • regression
  • decision trees
  • visualization
  • neural networks
  • support vector machines
  • deep learning

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

23 pages, 1096 KiB  
Article
Dynamic Surgical Waiting List Methodology: A Networking Approach
by Fabián Silva-Aravena and Jenny Morales
Mathematics 2022, 10(13), 2307; https://0-doi-org.brum.beds.ac.uk/10.3390/math10132307 - 01 Jul 2022
Cited by 5 | Viewed by 2445
Abstract
In Chile and the world, the supply of medical hours to provide care has been reduced due to the health crisis caused by COVID-19. As of December 2021, the outlook has been critical in Chile, both in medical and surgical care, where 1.7 [...] Read more.
In Chile and the world, the supply of medical hours to provide care has been reduced due to the health crisis caused by COVID-19. As of December 2021, the outlook has been critical in Chile, both in medical and surgical care, where 1.7 million people wait for care, and the wait for surgery has risen from 348 to 525 days on average. This occurs mainly when the demand for care exceeds the supply available in the public system, which has caused serious problems in patients who will remain on hold and health teams have implemented management measures through prioritization measures so that patients are treated on time. In this paper, we propose a methodology to work in net for predicting the prioritization of patients on surgical waiting lists (SWL) embodied with a machine learning scheme for a high complexity hospital (HCH) in Chile. That is linked to the risk of each waiting patient. The work presents the following contributions; The first contribution is a network method that predicts the priority order of anonymous patients entering the SWL. The second contribution is a dynamic quantification of the risk of waiting patients. The third contribution is a patient selection protocol based on a dynamic update of the SWL based on the components of prioritization, risk, and clinical criteria. The optimization of the process was measured by a simulation of the total times of the system in HCH. The prioritization strategy proposed savings of medical hours allowing 20% additional surgeries to be performed, thus reducing SWL by 10%. The risk of waiting patients could drop by up to 8% annually. We hope to implement this methodology in real health care units. Full article
(This article belongs to the Special Issue Statistical Methods in Data Mining)
Show Figures

Figure 1

13 pages, 6613 KiB  
Article
Quasi-Unimodal Distributions for Ordinal Classification
by Tomé Albuquerque, Ricardo Cruz and Jaime S. Cardoso
Mathematics 2022, 10(6), 980; https://0-doi-org.brum.beds.ac.uk/10.3390/math10060980 - 18 Mar 2022
Cited by 2 | Viewed by 2202
Abstract
Ordinal classification tasks are present in a large number of different domains. However, common losses for deep neural networks, such as cross-entropy, do not properly weight the relative ordering between classes. For that reason, many losses have been proposed in the literature, which [...] Read more.
Ordinal classification tasks are present in a large number of different domains. However, common losses for deep neural networks, such as cross-entropy, do not properly weight the relative ordering between classes. For that reason, many losses have been proposed in the literature, which model the output probabilities as following a unimodal distribution. This manuscript reviews many of these losses on three different datasets and suggests a potential improvement that focuses the unimodal constraint on the neighborhood around the true class, allowing for a more flexible distribution, aptly called quasi-unimodal loss. For this purpose, two constraints are proposed: A first constraint concerns the relative order of the top-three probabilities, and a second constraint ensures that the remaining output probabilities are not higher than the top three. Therefore, gradient descent focuses on improving the decision boundary around the true class in detriment to the more distant classes. The proposed loss is found to be competitive in several cases. Full article
(This article belongs to the Special Issue Statistical Methods in Data Mining)
Show Figures

Figure 1

22 pages, 1117 KiB  
Article
Local Linear Approximation Algorithm for Neural Network
by Mudong Zeng, Yujie Liao, Runze Li and Agus Sudjianto
Mathematics 2022, 10(3), 494; https://0-doi-org.brum.beds.ac.uk/10.3390/math10030494 - 03 Feb 2022
Cited by 4 | Viewed by 2696
Abstract
This paper aims to develop a new training strategy to improve efficiency in estimation of weights and biases in a feedforward neural network (FNN). We propose a local linear approximation (LLA) algorithm, which approximates ReLU with a linear function at the neuron level [...] Read more.
This paper aims to develop a new training strategy to improve efficiency in estimation of weights and biases in a feedforward neural network (FNN). We propose a local linear approximation (LLA) algorithm, which approximates ReLU with a linear function at the neuron level and estimate the weights and biases of one-hidden-layer neural network iteratively. We further propose the layer-wise optimized adaptive neural network (LOAN), in which we use the LLA to estimate the weights and biases in the LOAN layer by layer adaptively. We compare the performance of the LLA with the commonly-used procedures in machine learning based on seven benchmark data sets. The numerical comparison implies that the proposed algorithm may outperform the existing procedures in terms of both training time and prediction accuracy. Full article
(This article belongs to the Special Issue Statistical Methods in Data Mining)
Show Figures

Figure 1

15 pages, 880 KiB  
Article
Classification Comparison of Machine Learning Algorithms Using Two Independent CAD Datasets
by Meliz Yuvalı, Belma Yaman and Özgür Tosun
Mathematics 2022, 10(3), 311; https://0-doi-org.brum.beds.ac.uk/10.3390/math10030311 - 20 Jan 2022
Cited by 20 | Viewed by 3707
Abstract
In the last few decades, statistical methods and machine learning (ML) algorithms have become efficient in medical decision-making. Coronary artery disease (CAD) is a common type of cardiovascular disease that causes many deaths each year. In this study, two CAD datasets from different [...] Read more.
In the last few decades, statistical methods and machine learning (ML) algorithms have become efficient in medical decision-making. Coronary artery disease (CAD) is a common type of cardiovascular disease that causes many deaths each year. In this study, two CAD datasets from different countries (TRNC and Iran) are tested to understand the classification efficiency of different supervised machine learning algorithms. The Z-Alizadeh Sani dataset contained 303 individuals (216 patient, 87 control), while the Near East University (NEU) Hospital dataset contained 475 individuals (305 patients, 170 control). This study was conducted in three stages: (1) Each dataset, as well as their merged version, was subject to review separately with a random sampling method to obtain train-test subsets. (2) The NEU Hospital dataset was assigned as the training data, while the Z-Alizadeh Sani dataset was the test data. (3) The Z-Alizadeh Sani dataset was assigned as the training data, while the NEU hospital dataset was the test data. Among all ML algorithms, the Random Forest showed successful results for its classification performance at each stage. The least successful ML method was kNN which underperformed at all pitches. Other methods, including logistic regression, have varying classification performances at every step. Full article
(This article belongs to the Special Issue Statistical Methods in Data Mining)
Show Figures

Figure 1

25 pages, 2944 KiB  
Article
Towards Parameter Identification of a Behavioral Model from a Virtual Reality Experiment
by Nathalie Verdière, Oscar Navarro, Aude Naud, Alexandre Berred and Damienne Provitolo
Mathematics 2021, 9(24), 3175; https://0-doi-org.brum.beds.ac.uk/10.3390/math9243175 - 09 Dec 2021
Cited by 4 | Viewed by 1987
Abstract
In this paper, we investigate the calibration of a mathematical model describing different behaviors occurring during a natural, a societal, or a technological catastrophe. This model was developed in collaboration with geographers and psychologists. To collect information on the level of stress, psychologists [...] Read more.
In this paper, we investigate the calibration of a mathematical model describing different behaviors occurring during a natural, a societal, or a technological catastrophe. This model was developed in collaboration with geographers and psychologists. To collect information on the level of stress, psychologists of the LPPL laboratory of Nantes (France) led virtual reality experiments. These experiments consisted in immersing individuals in a situation of catastrophe and measuring their electrocardiogram. From the physical and biological data collected, we present the methodology to calibrate the behavioral model. First, a theoretical analysis is carried out to determine (i) if the parameters can be uniquely estimated, (ii) the minimal number of discrete measurements required for the estimation. Then, from these analyses, an estimation procedure is performed to calibrate the mathematical model or at least to have an order magnitude of the model parameters. Through this work, we will show from simulations that the proposed system makes it possible to apprehend non observable human processes. Full article
(This article belongs to the Special Issue Statistical Methods in Data Mining)
Show Figures

Figure 1

Review

Jump to: Research

22 pages, 3548 KiB  
Review
Statistical Methods with Applications in Data Mining: A Review of the Most Recent Works
by Joaquim Fernando Pinto da Costa and Manuel Cabral
Mathematics 2022, 10(6), 993; https://0-doi-org.brum.beds.ac.uk/10.3390/math10060993 - 19 Mar 2022
Cited by 9 | Viewed by 4126
Abstract
The importance of statistical methods in finding patterns and trends in otherwise unstructured and complex large sets of data has grown over the past decade, as the amount of data produced keeps growing exponentially and knowledge obtained from understanding data allows to make [...] Read more.
The importance of statistical methods in finding patterns and trends in otherwise unstructured and complex large sets of data has grown over the past decade, as the amount of data produced keeps growing exponentially and knowledge obtained from understanding data allows to make quick and informed decisions that save time and provide a competitive advantage. For this reason, we have seen considerable advances over the past few years in statistical methods in data mining. This paper is a comprehensive and systematic review of these recent developments in the area of data mining. Full article
(This article belongs to the Special Issue Statistical Methods in Data Mining)
Show Figures

Figure 1

Back to TopTop