Statistical Data Modeling and Machine Learning with Applications

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Mathematics and Computer Science".

Deadline for manuscript submissions: closed (30 June 2021) | Viewed by 29465

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editor


E-Mail Website
Guest Editor
Department of Mathematical Analysis, Faculty of Mathematics and Informatics, University of Plovdiv Paisii Hilendarski, 24 Tzar Assen St., 4000 Plovdiv, Bulgaria
Interests: computational statistics; applied mathematics; data mining; computer modeling in physics and engineering
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Statistics and machine learning are two intertwined fields of mathematics and computer science. In recent years, very powerful classification and predictive methods have been developed in this area. As a rule, the new methods for statistical data modeling and machine learning provide enormous opportunities for the development of new methods and approaches, as well as for their use to effectively solve practical problems like never before.

The proposed Special Issue aims to publish review papers, research articles, and communications, presenting new original methods, applications, data analyses, case studies, comparative studies, and other results. Special attention will be given, but not limited, to the theory and application of statistical data modeling and machine learning to diverse areas such as computer science, economics, industry, medicine, environmental sciences, forex and finance, education, engineering, marketing, agriculture, and more.

Prof. Dr. Snezhana Gocheva-Ilieva
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Computational statistics
  • Dimensionality reduction and variable selection
  • Nonparametric statistical modeling
  • Supervised learning (classification, regression)
  • Clustering methods
  • Financial statistics and econometrics
  • Statistical algorithms
  • Time series analysis and forecasting
  • Machine learning algorithms
  • Decision trees
  • Ensemble methods
  • Neural networks
  • Deep learning
  • Hybrid models
  • Data analysis

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

3 pages, 174 KiB  
Editorial
Special Issue “Statistical Data Modeling and Machine Learning with Applications”
by Snezhana Gocheva-Ilieva
Mathematics 2021, 9(23), 2997; https://0-doi-org.brum.beds.ac.uk/10.3390/math9232997 - 23 Nov 2021
Cited by 1 | Viewed by 1051
Abstract
Give Us Data to Predict Your Future! [...] Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)

Research

Jump to: Editorial

22 pages, 551 KiB  
Article
Self-Expressive Kernel Subspace Clustering Algorithm for Categorical Data with Embedded Feature Selection
by Hui Chen, Kunpeng Xu, Lifei Chen and Qingshan Jiang
Mathematics 2021, 9(14), 1680; https://0-doi-org.brum.beds.ac.uk/10.3390/math9141680 - 16 Jul 2021
Cited by 4 | Viewed by 1672
Abstract
Kernel clustering of categorical data is a useful tool to process the separable datasets and has been employed in many disciplines. Despite recent efforts, existing methods for kernel clustering remain a significant challenge due to the assumption of feature independence and equal weights. [...] Read more.
Kernel clustering of categorical data is a useful tool to process the separable datasets and has been employed in many disciplines. Despite recent efforts, existing methods for kernel clustering remain a significant challenge due to the assumption of feature independence and equal weights. In this study, we propose a self-expressive kernel subspace clustering algorithm for categorical data (SKSCC) using the self-expressive kernel density estimation (SKDE) scheme, as well as a new feature-weighted non-linear similarity measurement. In the SKSCC algorithm, we propose an effective non-linear optimization method to solve the clustering algorithm’s objective function, which not only considers the relationship between attributes in a non-linear space but also assigns a weight to each attribute in the algorithm to measure the degree of correlation. A series of experiments on some widely used synthetic and real-world datasets demonstrated the better effectiveness and efficiency of the proposed algorithm compared with other state-of-the-art methods, in terms of non-linear relationship exploration among attributes. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

14 pages, 928 KiB  
Article
A Cascade Deep Forest Model for Breast Cancer Subtype Classification Using Multi-Omics Data
by Ala’a El-Nabawy, Nahla A. Belal and Nashwa El-Bendary
Mathematics 2021, 9(13), 1574; https://0-doi-org.brum.beds.ac.uk/10.3390/math9131574 - 04 Jul 2021
Cited by 9 | Viewed by 3020
Abstract
Automated diagnosis systems aim to reduce the cost of diagnosis while maintaining the same efficiency. Many methods have been used for breast cancer subtype classification. Some use single data source, while others integrate many data sources, the case that results in reduced computational [...] Read more.
Automated diagnosis systems aim to reduce the cost of diagnosis while maintaining the same efficiency. Many methods have been used for breast cancer subtype classification. Some use single data source, while others integrate many data sources, the case that results in reduced computational performance as opposed to accuracy. Breast cancer data, especially biological data, is known for its imbalance, with lack of extensive amounts of histopathological images as biological data. Recent studies have shown that cascade Deep Forest ensemble model achieves a competitive classification accuracy compared with other alternatives, such as the general ensemble learning methods and the conventional deep neural networks (DNNs), especially for imbalanced training sets, through learning hyper-representations through using cascade ensemble decision trees. In this work, a cascade Deep Forest is employed to classify breast cancer subtypes, IntClust and Pam50, using multi-omics datasets and different configurations. The results obtained recorded an accuracy of 83.45% for 5 subtypes and 77.55% for 10 subtypes. The significance of this work is that it is shown that using gene expression data alone with the cascade Deep Forest classifier achieves comparable accuracy to other techniques with higher computational performance, where the time recorded is about 5 s for 10 subtypes, and 7 s for 5 subtypes. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

12 pages, 598 KiB  
Article
Damped Newton Stochastic Gradient Descent Method for Neural Networks Training
by Jingcheng Zhou, Wei Wei, Ruizhi Zhang and Zhiming Zheng
Mathematics 2021, 9(13), 1533; https://0-doi-org.brum.beds.ac.uk/10.3390/math9131533 - 29 Jun 2021
Cited by 10 | Viewed by 2058
Abstract
First-order methods such as stochastic gradient descent (SGD) have recently become popular optimization methods to train deep neural networks (DNNs) for good generalization; however, they need a long training time. Second-order methods which can lower the training time are scarcely used on account [...] Read more.
First-order methods such as stochastic gradient descent (SGD) have recently become popular optimization methods to train deep neural networks (DNNs) for good generalization; however, they need a long training time. Second-order methods which can lower the training time are scarcely used on account of their overpriced computing cost to obtain the second-order information. Thus, many works have approximated the Hessian matrix to cut the cost of computing while the approximate Hessian matrix has large deviation. In this paper, we explore the convexity of the Hessian matrix of partial parameters and propose the damped Newton stochastic gradient descent (DN-SGD) method and stochastic gradient descent damped Newton (SGD-DN) method to train DNNs for regression problems with mean square error (MSE) and classification problems with cross-entropy loss (CEL). In contrast to other second-order methods for estimating the Hessian matrix of all parameters, our methods only accurately compute a small part of the parameters, which greatly reduces the computational cost and makes the convergence of the learning process much faster and more accurate than SGD and Adagrad. Several numerical experiments on real datasets were performed to verify the effectiveness of our methods for regression and classification problems. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

15 pages, 515 KiB  
Article
Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification
by Jianli Shao, Xin Liu and Wenqing He
Mathematics 2021, 9(9), 936; https://0-doi-org.brum.beds.ac.uk/10.3390/math9090936 - 23 Apr 2021
Cited by 9 | Viewed by 2409
Abstract
Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the [...] Read more.
Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

20 pages, 1053 KiB  
Article
Visualizing Profiles of Large Datasets of Weighted and Mixed Data
by Aurea Grané and Alpha A. Sow-Barry
Mathematics 2021, 9(8), 891; https://0-doi-org.brum.beds.ac.uk/10.3390/math9080891 - 16 Apr 2021
Cited by 4 | Viewed by 2634
Abstract
This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of [...] Read more.
This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of classical MDS in large datasets is circumvented by selecting a small random sample of the dataset, whose individuals are clustered by means of an adapted version of the k-prototypes algorithm and mapped via classical MDS. Gower’s interpolation formula is used to project remaining individuals onto the previous configuration. In all the process, Gower’s distance is used to measure the proximity between individuals. The methodology is illustrated on a real dataset, obtained from the Survey of Health, Ageing and Retirement in Europe (SHARE), which was carried out in 19 countries and represents over 124 million aged individuals in Europe. The performance of the method was evaluated through a simulation study, whose results point out that the new proposal solves the high computational cost of the classical MDS with low error. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

15 pages, 1137 KiB  
Article
A Conceptual Probabilistic Framework for Annotation Aggregation of Citizen Science Data
by Jesus Cerquides, Mehmet Oğuz Mülâyim, Jerónimo Hernández-González, Amudha Ravi Shankar and Jose Luis Fernandez-Marquez
Mathematics 2021, 9(8), 875; https://0-doi-org.brum.beds.ac.uk/10.3390/math9080875 - 15 Apr 2021
Cited by 5 | Viewed by 2577
Abstract
Over the last decade, hundreds of thousands of volunteers have contributed to science by collecting or analyzing data. This public participation in science, also known as citizen science, has contributed to significant discoveries and led to publications in major scientific journals. However, little [...] Read more.
Over the last decade, hundreds of thousands of volunteers have contributed to science by collecting or analyzing data. This public participation in science, also known as citizen science, has contributed to significant discoveries and led to publications in major scientific journals. However, little attention has been paid to data quality issues. In this work we argue that being able to determine the accuracy of data obtained by crowdsourcing is a fundamental question and we point out that, for many real-life scenarios, mathematical tools and processes for the evaluation of data quality are missing. We propose a probabilistic methodology for the evaluation of the accuracy of labeling data obtained by crowdsourcing in citizen science. The methodology builds on an abstract probabilistic graphical model formalism, which is shown to generalize some already existing label aggregation models. We show how to make practical use of the methodology through a comparison of data obtained from different citizen science communities analyzing the earthquake that took place in Albania in 2019. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

16 pages, 1093 KiB  
Article
Artificial Neural Network, Quantile and Semi-Log Regression Modelling of Mass Appraisal in Housing
by Jose Torres-Pruñonosa, Pablo García-Estévez and Camilo Prado-Román
Mathematics 2021, 9(7), 783; https://0-doi-org.brum.beds.ac.uk/10.3390/math9070783 - 06 Apr 2021
Cited by 13 | Viewed by 2818
Abstract
We used a large sample of 188,652 properties, which represented 4.88% of the total housing stock in Catalonia from 1994 to 2013, to make a comparison between different real estate valuation methods based on artificial neural networks (ANNs), quantile regressions (QRs) and semi-log [...] Read more.
We used a large sample of 188,652 properties, which represented 4.88% of the total housing stock in Catalonia from 1994 to 2013, to make a comparison between different real estate valuation methods based on artificial neural networks (ANNs), quantile regressions (QRs) and semi-log regressions (SLRs). A literature gap in regard to the comparison between ANN and QR modelling of hedonic prices in housing was identified, with this article being the first paper to include this comparison. Therefore, this study aimed to answer (1) whether QR valuation modelling of hedonic prices in the housing market is an alternative to ANNs, (2) whether it is confirmed that ANNs produce better results than SLRs when assessing housing in Catalonia, and (3) which of the three mass appraisal models should be used by Spanish banks to assess real estate. The results suggested that the ANNs and SLRs obtained similar and better performances than the QRs and that the SLRs performed better when the datasets were smaller. Therefore, (1) QRs were not found to be an alternative to ANNs, (2) it could not be confirmed whether ANNs performed better than SLRs when assessing properties in Catalonia and (3) whereas small and medium banks should use SLRs, large banks should use either SLRs or ANNs in real estate mass appraisal. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

15 pages, 5405 KiB  
Article
Motor Imagery Classification Based on a Recurrent-Convolutional Architecture to Control a Hexapod Robot
by Tat’y Mwata-Velu, Jose Ruiz-Pinales, Horacio Rostro-Gonzalez, Mario Alberto Ibarra-Manzano, Jorge Mario Cruz-Duarte and Juan Gabriel Avina-Cervantes
Mathematics 2021, 9(6), 606; https://0-doi-org.brum.beds.ac.uk/10.3390/math9060606 - 12 Mar 2021
Cited by 14 | Viewed by 3201
Abstract
Advances in the field of Brain-Computer Interfaces (BCIs) aim, among other applications, to improve the movement capacities of people suffering from the loss of motor skills. The main challenge in this area is to achieve real-time and accurate bio-signal processing for pattern recognition, [...] Read more.
Advances in the field of Brain-Computer Interfaces (BCIs) aim, among other applications, to improve the movement capacities of people suffering from the loss of motor skills. The main challenge in this area is to achieve real-time and accurate bio-signal processing for pattern recognition, especially in Motor Imagery (MI). The significant interaction between brain signals and controllable machines requires instantaneous brain data decoding. In this study, an embedded BCI system based on fist MI signals is developed. It uses an Emotiv EPOC+ Brainwear®, an Altera SoCKit® development board, and a hexapod robot for testing locomotion imagery commands. The system is tested to detect the imagined movements of closing and opening the left and right hand to control the robot locomotion. Electroencephalogram (EEG) signals associated with the motion tasks are sensed on the human sensorimotor cortex. Next, the SoCKit processes the data to identify the commands allowing the controlled robot locomotion. The classification of MI-EEG signals from the F3, F4, FC5, and FC6 sensors is performed using a hybrid architecture of Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. This method takes advantage of the deep learning recognition model to develop a real-time embedded BCI system, where signal processing must be seamless and precise. The proposed method is evaluated using k-fold cross-validation on both created and public Scientific-Data datasets. Our dataset is comprised of 2400 trials obtained from four test subjects, lasting three seconds of closing and opening fist movement imagination. The recognition tasks reach 84.69% and 79.2% accuracy using our data and a state-of-the-art dataset, respectively. Numerical results support that the motor imagery EEG signals can be successfully applied in BCI systems to control mobile robots and related applications such as intelligent vehicles. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

21 pages, 4261 KiB  
Article
Improving the Accuracy of Dam Inflow Predictions Using a Long Short-Term Memory Network Coupled with Wavelet Transform and Predictor Selection
by Trung Duc Tran, Vinh Ngoc Tran and Jongho Kim
Mathematics 2021, 9(5), 551; https://0-doi-org.brum.beds.ac.uk/10.3390/math9050551 - 05 Mar 2021
Cited by 20 | Viewed by 2600
Abstract
Accurate and reliable dam inflow prediction models are essential for effective reservoir operation and management. This study presents a data-driven model that couples a long short-term memory (LSTM) network with robust input predictor selection, input reconstruction by wavelet transformation, and efficient hyper-parameter optimization [...] Read more.
Accurate and reliable dam inflow prediction models are essential for effective reservoir operation and management. This study presents a data-driven model that couples a long short-term memory (LSTM) network with robust input predictor selection, input reconstruction by wavelet transformation, and efficient hyper-parameter optimization by K-fold cross-validation and the random search. First, a robust analysis using a “correlation threshold” for partial autocorrelation and cross-correlation functions is proposed, and only variables greater than this threshold are selected as input predictors and their time lags. This analysis indicates that a model trained on a threshold of 0.4 returns the highest Nash–Sutcliffe efficiency value; as a result, six principal inputs are selected. Second, using additional subseries reconstructed by the wavelet transform improves predictability, particularly for flow peak. The peak error values of LSTM with the transform are approximately one-half to one-quarter the size of those without the transform. Third, for a K of 5 as determined by the Silhouette coefficients and the distortion score, the wavelet-transformed LSTMs require a larger number of hidden units, epochs, dropout, and batch size. This complex configuration is needed because the amount of inputs used by these LSTMs is five times greater than that of other models. Last, an evaluation of accuracy performance reveals that the model proposed in this study, called SWLSTM, provides superior predictions of the daily inflow of the Hwacheon dam in South Korea compared with three other LSTM models by 84%, 78%, and 65%. These results strengthen the potential of data-driven models for efficient and effective reservoir inflow predictions, and should help policy-makers and operators better manage their reservoir operations. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

17 pages, 3373 KiB  
Article
Assessment of Students’ Achievements and Competencies in Mathematics Using CART and CART Ensembles and Bagging with Combined Model Improvement by MARS
by Snezhana Gocheva-Ilieva, Hristina Kulina and Atanas Ivanov
Mathematics 2021, 9(1), 62; https://0-doi-org.brum.beds.ac.uk/10.3390/math9010062 - 30 Dec 2020
Cited by 11 | Viewed by 3138
Abstract
The aim of this study is to evaluate students’ achievements in mathematics using three machine learning regression methods: classification and regression trees (CART), CART ensembles and bagging (CART-EB) and multivariate adaptive regression splines (MARS). A novel ensemble methodology is proposed based on the [...] Read more.
The aim of this study is to evaluate students’ achievements in mathematics using three machine learning regression methods: classification and regression trees (CART), CART ensembles and bagging (CART-EB) and multivariate adaptive regression splines (MARS). A novel ensemble methodology is proposed based on the combination of CART and CART-EB models in a new ensemble to regress the actual data using MARS. Results of a final exam test, control and home assignments, and other learning activities to assess students’ knowledge and competencies in applied mathematics are examined. The exam test combines problems on elements of mathematical analysis, statistics and a small practical project. The project is the new competence-oriented element, which requires students to formulate problems themselves, to choose different solutions and to use or not use specialized software. Initially, empirical data are statistically modeled using six CART and six CART-EB competing models. The models achieve a goodness-of-fit up to 96% to actual data. The impact of the examined factors on the students’ success at the final exam is determined. Using the best of these models and proposed novel ensemble procedure, final MARS models are built that outperform the other models for predicting the achievements of students in applied mathematics. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications)
Show Figures

Figure 1

Back to TopTop