Applied Statistical Modeling and Data Mining

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Mathematics and Computer Science".

Deadline for manuscript submissions: closed (31 December 2023) | Viewed by 25174

Special Issue Editors


E-Mail Website
Guest Editor
Department of Statistics and Operations Reseach, Faculty of Sciences, University of Granada, 18071 Granada, Spain
Interests: data mining; statistical modeling; classification; regression; noisy data

E-Mail Website
Guest Editor
Department of Statistics and Operations Research, Faculty of Sciences, University of Granada, 18071 Granada, Spain
Interests: multivariate analysis; random fields; spatio-temporal risk assessment; spatio-temporal modeling
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue is focused on applied statistical modeling and data mining. Papers related to theoretical aspects of classical statistics, spatial and spatio-temporal statistics, as well as any application of statistical modeling to real-world problems, are welcome. Likewise, manuscripts that focus on the different challenges in data mining and machine learning are considered, including aspects related to traditional classification, regression, unsupervised learning, among others.

The following areas of interest, but not limited to, are expected for submission in this Special Issue:

  1. Data mining
  2. Machine learning
  3. Multivariate analysis
  4. Spatial risk assessment
  5. Spatio-temporal modeling
  6. Statistical modeling and applications
  7. Supervised and unsupervised learning, including research on classification, regression, and clustering, among others

Dr. Jose Antonio Sáez Muñoz
Dr. José Luis Romero Béjar
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data mining
  • machine learning
  • multivariate analysis
  • spatio-temporal
  • statistical modeling
  • risk assessment

Published Papers (15 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

54 pages, 44875 KiB  
Article
Research on Emotional Infection of Passengers during the SRtP of a Cruise Ship by Combining an SIR Model and Machine Learning
by Gaohan Xiong, Wei Cai, Min Hu and Zhiyan Yu
Mathematics 2023, 11(21), 4461; https://0-doi-org.brum.beds.ac.uk/10.3390/math11214461 - 27 Oct 2023
Viewed by 1002
Abstract
The Safe Return to Port issue regarding cruise ships has been extensively researched, covering aspects such as performance, operations, and electrical systems. However, an often overlooked aspect is the potential eruption of negative emotions among passengers during SRtP. This study aims to investigate [...] Read more.
The Safe Return to Port issue regarding cruise ships has been extensively researched, covering aspects such as performance, operations, and electrical systems. However, an often overlooked aspect is the potential eruption of negative emotions among passengers during SRtP. This study aims to investigate the prediction of collective emotions to facilitate timely safety planning and enhance the safety of the Safe Return to Port process. To achieve this objective, an improved susceptible-infectious-recovered model with bidirectional infection is proposed to describe the emotional contagion process during the Safe Return to Port process. This model classifies the population into five emotional (extremely anxious–anxious–normal–calm–very calm) states and introduces two sources of infection. Moreover, it allows for emotions to transition both positively and negatively, making it a more realistic representation of scenarios resembling long-term refuge scenarios. In this study, questionnaire data, collected and statistically analyzed, serve as the primary dataset. A machine learning technique (the weighted random forest algorithm) is integrated with the model to make predictions. The accuracy, precision, recall, and the F-measure of prediction results demonstrate good performance. Additionally, through simulation, this study illustrates the fluctuating nature of emotional changes during the Safe Return to Port process of the cruise ship and analyzes the effects of varying parameters. The findings suggest that the improved susceptible-infectious-recovered model proposed in this paper can provide valuable insights for cruise ship emergency planning and positively contribute to maintaining passenger emotional stability during the Safe Return to Port process. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

21 pages, 2521 KiB  
Article
BLogic: A Bayesian Model Combination Approach in Logic Regression
by Yu-Chung Wei
Mathematics 2023, 11(20), 4353; https://0-doi-org.brum.beds.ac.uk/10.3390/math11204353 - 19 Oct 2023
Viewed by 785
Abstract
With the increasing complexity and dimensionality of datasets in statistical research, traditional methods of identifying interactions are often more challenging to apply due to the limitations of model assumptions. Logic regression has emerged as an effective tool, leveraging Boolean combinations of binary explanatory [...] Read more.
With the increasing complexity and dimensionality of datasets in statistical research, traditional methods of identifying interactions are often more challenging to apply due to the limitations of model assumptions. Logic regression has emerged as an effective tool, leveraging Boolean combinations of binary explanatory variables. However, the prevalent simulated annealing approach in logic regression sometimes faces stability issues. This study introduces the BLogic algorithm, a novel approach that amalgamates multiple runs of simulated annealing on a dataset and synthesizes the results via the Bayesian model combination technique. This algorithm not only facilitates predicting response variables using binary explanatory ones but also offers a score computation for prime implicants, elucidating key variables and their interactions within the data. In simulations with identical parameters, conventional logic regression, when executed with a single instance of simulated annealing, exhibits reduced predictive and interpretative capabilities as soon as the ratio of explanatory variables to sample size surpasses 10. In contrast, the BLogic algorithm maintains its effectiveness until this ratio approaches 50. This underscores its heightened resilience against challenges in high-dimensional settings, especially the large p, small n problem. Moreover, employing real-world data from the UK10K Project, we also showcase the practical performance of the BLogic algorithm. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

30 pages, 3307 KiB  
Article
Dynamic Generation Method of Highway ETC Gantry Topology Based on LightGBM
by Fumin Zou, Weihai Wang, Qiqin Cai, Feng Guo and Rouyue Shi
Mathematics 2023, 11(15), 3413; https://0-doi-org.brum.beds.ac.uk/10.3390/math11153413 - 04 Aug 2023
Viewed by 960
Abstract
In Electronic Toll Collection (ETC) systems, accurate gantry topology data are crucial for fair and efficient toll collection. Currently, inaccuracies in the topology data can cause tolls to be based on the shortest route rather than the actual distance travelled, contradicting the ETC [...] Read more.
In Electronic Toll Collection (ETC) systems, accurate gantry topology data are crucial for fair and efficient toll collection. Currently, inaccuracies in the topology data can cause tolls to be based on the shortest route rather than the actual distance travelled, contradicting the ETC system’s purpose. To address this, we adopt a novel Gradient Boosting Decision Tree (GBDT) algorithm, Light Gradient Boosting Machine (LightGBM), to dynamically update ETC gantry topology data on highways. We use ETC gantry and toll booth transaction data from a province in southeast China, where ETC usage is high at 72.8%. From this data, we generate a candidate topology set and extract five key characteristics. We then use Amap API and QGIS map analysis to annotate the candidate set, and, finally, apply LightGBM to train on these features, generating the dynamic topology. Our comparison of LightGBM with 14 other machine learning algorithms showed that LightGBM outperformed the others, achieving an impressive accuracy of 97.6%. This methodology can help transportation departments maintain accurate and up-to-date toll systems, reducing errors and improving efficiency. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

16 pages, 584 KiB  
Article
Evaluation of Convergent, Discriminant, and Criterion Validity of the Cuestionario Burnout Granada-University Students
by Elena Ortega-Campos, Gustavo R. Cañadas, Raimundo Aguayo-Estremera, Tania Ariza, Carolina S. Monsalve-Reyes, Nora Suleiman-Martos and Emilia I. De la Fuente-Solana
Mathematics 2023, 11(15), 3315; https://0-doi-org.brum.beds.ac.uk/10.3390/math11153315 - 28 Jul 2023
Viewed by 968
Abstract
Burnout is a health problem that affects professionals and students or professionals in training, especially those in health areas. For this reason, it is necessary that it is properly identified to prevent the impact it can have on the work and personal areas [...] Read more.
Burnout is a health problem that affects professionals and students or professionals in training, especially those in health areas. For this reason, it is necessary that it is properly identified to prevent the impact it can have on the work and personal areas of the people who suffer from it. The aim of this work is to study the convergent, discriminant, and criterion validity of the Cuestionario Burnout Granada-University Students. The sample consisted of 463 undergraduate nursing students, selected by non-probabilistic convenience sampling, who participated voluntarily and anonymously in the study. The mean age of the participants was 21.9 (5.12) years, mostly female (74.1%), single (95.8%), and childless (95.6%). Information was collected face-to-face, and the instruments were completed on paper. Comparisons were made in the three dimensions of burnout of the CBG-USS between students with and without burnout, finding statistically significant differences in all three dimensions: Emotional Exhaustion (p < 0.001, d = 0.674), Cynicism (p < 0.001, d = 0.479), and Academic Efficacy (p < 0.001, d = −0.607). The Cuestionario Burnout Granada-University Students presents adequate reliability and validity indices, which demonstrates its usefulness in the identification of burnout. This syndrome has traditionally been measured in professionals, but students also present burnout, so it is necessary to have specific burnout instruments for students, since the pre-work situation and stressors of students are different from those of workers. In order to work on the prevention of university burnout, it is essential to have specific instruments for professionals in training that help in the detection of students with burnout. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

18 pages, 830 KiB  
Article
On the Quality of Synthetic Generated Tabular Data
by Erica Espinosa and Alvaro Figueira
Mathematics 2023, 11(15), 3278; https://0-doi-org.brum.beds.ac.uk/10.3390/math11153278 - 26 Jul 2023
Cited by 2 | Viewed by 1447
Abstract
Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the [...] Read more.
Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the suitability of such generated data is crucial to ensure their alignment with the original data distribution. Utility measures come into play here to quantify how similar the distribution of the generated data is to the original one. For tabular data, there are various evaluation methods that assess different characteristics of the generated data. In this study, we collected utility measures and categorized them based on the type of analysis they performed. We then applied these measures to synthetic data generated from two well-known datasets, Adults Income, and Liar+. We also used five well-known generative models, Borderline SMOTE, DataSynthesizer, CTGAN, CopulaGAN, and REaLTabFormer, to generate the synthetic data and evaluated its quality using the utility measures. The measurements have proven to be informative, indicating that if one synthetic dataset is superior to another in terms of utility measures, it will be more effective as an augmentation for the minority class when performing classification tasks. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

21 pages, 1807 KiB  
Article
Validity Evidence for the Internal Structure of the Maslach Burnout Inventory-Student Survey: A Comparison between Classical CFA Model and the ESEM and the Bifactor Models
by Raimundo Aguayo-Estremera, Gustavo R. Cañadas, Elena Ortega-Campos, Tania Ariza and Emilia Inmaculada De la Fuente-Solana
Mathematics 2023, 11(6), 1515; https://0-doi-org.brum.beds.ac.uk/10.3390/math11061515 - 21 Mar 2023
Cited by 5 | Viewed by 2297
Abstract
Academic burnout is a psychological problem characterized by three dimensions: emotional exhaustion, depersonalization, and personal accomplishment. This paper studies the internal structure of the MBI-SS, the most widely used instrument to assess burnout in students. The bifactor model and the ESEM approach have [...] Read more.
Academic burnout is a psychological problem characterized by three dimensions: emotional exhaustion, depersonalization, and personal accomplishment. This paper studies the internal structure of the MBI-SS, the most widely used instrument to assess burnout in students. The bifactor model and the ESEM approach have been proposed as alternatives, capable of overcoming the classical techniques of CFA to address this issue. Our study considers the internal structure of the MBI-SS by testing the models most frequently referenced in the literature, along with the bifactor model and the ESEM. After determining which model best fits the data, we calculate the most appropriate reliability index. In addition, we examined the validity evidence using other variables, namely the concurrent relationships with depression, anxiety, neuroticism, and conscientiousness, and the discriminant relationships with the dimensions of engagement, extraversion, and agreeableness. The results obtained indicate that the internal structure of the MBI-SS is well reflected by the three-factor congeneric oblique model, reaching good values of reliability and convergent and discriminant validity. Therefore, when the scale is used in applied contexts, we recommend considering the total scores obtained for each of the dimensions. Finally, we recommend using the omega coefficient and not the alpha coefficient as an estimator of reliability. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

28 pages, 4953 KiB  
Article
A Dynamic Spatio-Temporal Stochastic Modeling Approach of Emergency Calls in an Urban Context
by David Payares-Garcia, Javier Platero and Jorge Mateu
Mathematics 2023, 11(4), 1052; https://0-doi-org.brum.beds.ac.uk/10.3390/math11041052 - 19 Feb 2023
Viewed by 1436
Abstract
Emergency calls are defined by an ever-expanding utilisation of information and sensing technology, leading to extensive volumes of spatio-temporal high-resolution data. The spatial and temporal character of the emergency calls is leveraged by authorities to allocate resources and infrastructure for an effective response, [...] Read more.
Emergency calls are defined by an ever-expanding utilisation of information and sensing technology, leading to extensive volumes of spatio-temporal high-resolution data. The spatial and temporal character of the emergency calls is leveraged by authorities to allocate resources and infrastructure for an effective response, to identify high-risk event areas, and to develop contingency strategies. In this context, the spatio-temporal analysis of emergency calls is crucial to understanding and mitigating distress situations. However, modelling and predicting crime-related emergency calls remain challenging due to their heterogeneous and dynamic nature with complex underlying processes. In this context, we propose a modelling strategy that accounts for the intrinsic complex space–time dynamics of some crime data on cities by handling complex advection, diffusion, relocation, and volatility processes. This study presents a predictive framework capable of assimilating data and providing confidence estimates on the predictions. By analysing the dynamics of the weekly number of emergency calls in Valencia, Spain, for ten years (2010–2020), we aim to understand and forecast the spatio-temporal behaviour of emergency calls in an urban environment. We include putative geographical variables, as well as distances to relevant city landmarks, into the spatio-temporal point process modelling framework to measure the effect deterministic components exert on the intensity of emergency calls in Valencia. Our results show how landmarks attract or repel offenders and act as proxies to identify areas with high or low emergency calls. We are also able to estimate the weekly average growth and decay in space and time of the emergency calls. Our proposal is intended to guide mitigation strategies and policy. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

20 pages, 3956 KiB  
Article
A Malware Attack Enabled an Online Energy Strategy for Dynamic Wireless EVs within Transportation Systems
by Fahad Alsokhiry, Andres Annuk, Toivo Kabanen and Mohamed A. Mohamed
Mathematics 2022, 10(24), 4691; https://0-doi-org.brum.beds.ac.uk/10.3390/math10244691 - 10 Dec 2022
Cited by 4 | Viewed by 1220
Abstract
Developing transportation systems (TSs) under the structure of a wireless sensor network (WSN) along with great preponderance can be an Achilles’ heel from the standpoint of cyber-attacks, which is worthy of attention. Hence, a crucial security concern facing WSNs embedded in electrical vehicles [...] Read more.
Developing transportation systems (TSs) under the structure of a wireless sensor network (WSN) along with great preponderance can be an Achilles’ heel from the standpoint of cyber-attacks, which is worthy of attention. Hence, a crucial security concern facing WSNs embedded in electrical vehicles (EVs) is malware attacks. With this in mind, this paper addressed a cyber-detection method based on the offense–defense game model to ward off malware attacks on smart EVs developed by a wireless sensor for receiving data in order to control the traffic flow within TSs. This method is inspired by the integrated Nash equilibrium result in the game and can detect the probability of launching malware into the WSN-based EV technology. For effective realization, modeling the malware attacks in conformity with EVs was discussed. This type of attack can inflict untraceable detriments on TSs by moving EVs out of their optimal paths for which the EVs’ power consumption tends toward ascending thanks to the increasing traffic flow density. In view of this, the present paper proposed an effective traffic-flow density-based dynamic model for EVs within transportation systems. Additionally, on account of the uncertain power consumption of EVs, an uncertainty-based UT function was presented to model its effects on the traffic flow. It was inferred from the results that there is a relationship between the power consumption and traffic flow for the existence of malware attacks. Additionally, the results revealed the importance of repressing malware attacks on TSs. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

24 pages, 6645 KiB  
Article
Machine Learning-Based Prediction Models of Acute Respiratory Failure in Patients with Acute Pesticide Poisoning
by Yeongmin Kim, Minsu Chae, Namjun Cho, Hyowook Gil and Hwamin Lee
Mathematics 2022, 10(24), 4633; https://0-doi-org.brum.beds.ac.uk/10.3390/math10244633 - 07 Dec 2022
Cited by 2 | Viewed by 1511
Abstract
The prognosis of patients with acute pesticide poisoning depends on their acute respiratory condition. Here, we propose machine learning models to predict acute respiratory failure in patients with acute pesticide poisoning using a decision tree, logistic regression, and random forests, support vector machine, [...] Read more.
The prognosis of patients with acute pesticide poisoning depends on their acute respiratory condition. Here, we propose machine learning models to predict acute respiratory failure in patients with acute pesticide poisoning using a decision tree, logistic regression, and random forests, support vector machine, adaptive boosting, gradient boosting, multi-layer boosting, recurrent neural network, long short-term memory, and gated recurrent gate. We collected medical records of patients with acute pesticide poisoning at the Soonchunhyang University Cheonan Hospital from 1 January 2016 to 31 December 2020. We applied the k-Nearest Neighbor Imputer algorithm, MissForest Impuer and average imputation method to handle the problems of missing values and outliers in electronic medical records. In addition, we used the min–max scaling method for feature scaling. Using the most recent medical research, p-values, tree-based feature selection, and recursive feature reduction, we selected 17 out of 81 features. We applied a sliding window of 3 h to every patient’s medical record within 24 h. As the prevalence of acute respiratory failure in our dataset was 8%, we employed oversampling. We assessed the performance of our models in predicting acute respiratory failure. The proposed long short-term memory demonstrated a positive predictive value of 98.42%, a sensitivity of 97.91%, and an F1 score of 0.9816. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

17 pages, 10256 KiB  
Article
An Energy Efficient Specializing DAG Federated Learning Based on Event-Triggered Communication
by Xiaofeng Xue, Haokun Mao, Qiong Li, Furong Huang and Ahmed A. Abd El-Latif
Mathematics 2022, 10(22), 4388; https://0-doi-org.brum.beds.ac.uk/10.3390/math10224388 - 21 Nov 2022
Cited by 2 | Viewed by 1604
Abstract
Specializing Directed Acyclic Graph Federated Learning (SDAGFL) is a new federated learning framework with the advantages of decentralization, personalization, resisting a single point of failure, and poisoning attack. Instead of training a single global model, the clients in SDAGFL update their models asynchronously [...] Read more.
Specializing Directed Acyclic Graph Federated Learning (SDAGFL) is a new federated learning framework with the advantages of decentralization, personalization, resisting a single point of failure, and poisoning attack. Instead of training a single global model, the clients in SDAGFL update their models asynchronously from the devices with similar data distribution through Directed Acyclic Graph Distributed Ledger Technology (DAG-DLT), which is designed for IoT scenarios. Because of many the features inherited from DAG-DLT, SDAGFL is suitable for IoT scenarios in many aspects. However, the training process of SDAGFL is quite energy consuming, in which each client needs to compute the confidence and rating of the nodes selected by multiple random walks by traveling the ledger with 15–25 depth to obtain the “reference model” to judge whether or not to broadcast the newly trained model. As we know, the energy consumption is an important issue for IoT scenarios, as most devices are battery-powered with strict energy restrictions. To optimize SDAGFL for IoT, an energy-efficient SDAGFL based on an event-triggered communication mechanism, i.e., ESDAGFL, is proposed in this paper. In ESDAGFL, the new model is broadcasted only in the event that the new model is significantly different from the previous one, instead of traveling the ledger to search for the “reference model”. We evaluate the ESDAGFL on the FMNIST-clustered and Poets dataset. The simulation is performed on a platform with Intel®CoreTM i7-10700 CPU (CA, USA). The simulation results demonstrate that ESDAGFL can reach a balance between training accuracy and specialization as good as SDAGFL. What is more, ESDAGFL can reduce the energy consumption by 42.5% and 51.7% for the FMNIST-clustered and Poets datasets, respectively. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

27 pages, 5785 KiB  
Article
Smart Patrolling Based on Spatial-Temporal Information Using Machine Learning
by Cesar Guevara and Matilde Santos
Mathematics 2022, 10(22), 4368; https://0-doi-org.brum.beds.ac.uk/10.3390/math10224368 - 20 Nov 2022
Viewed by 2126
Abstract
With the aim of improving security in cities and reducing the number of crimes, this research proposes an algorithm that combines artificial intelligence (AI) and machine learning (ML) techniques to generate police patrol routes. Real data on crimes reported in Quito City, Ecuador, [...] Read more.
With the aim of improving security in cities and reducing the number of crimes, this research proposes an algorithm that combines artificial intelligence (AI) and machine learning (ML) techniques to generate police patrol routes. Real data on crimes reported in Quito City, Ecuador, during 2017 are used. The algorithm, which consists of four stages, combines spatial and temporal information. First, crimes are grouped around the points with the highest concentration of felonies, and future hotspots are predicted. Then, the probability of crimes committed in any of those areas at a time slot is studied. This information is combined with the spatial way-points to obtain real surveillance routes through a fuzzy decision system, that considers distance and time (computed with the OpenStreetMap API), and probability. Computing time has been analized and routes have been compared with those proposed by an expert. The results prove that using spatial–temporal information allows the design of patrolling routes in an effective way and thus, improves citizen security and decreases spending on police resources. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

14 pages, 4588 KiB  
Article
Preprocessing of Spectroscopic Data Using Affine Transformations to Improve Pattern-Recognition Analysis: An Application to Prehistoric Lithic Tools
by Francisco Javier Esquivel, José Antonio Esquivel, Antonio Morgado, José L. Romero-Béjar and Luis F. García del Moral
Mathematics 2022, 10(22), 4250; https://0-doi-org.brum.beds.ac.uk/10.3390/math10224250 - 13 Nov 2022
Viewed by 1457
Abstract
The analysis of spectral reflectance data is an important tool for obtaining relevant information about the mineral composition of objects and has been used for research in chemistry, geology, biology, archaeology, pharmacy, medicine, anthropology, and other disciplines. In archaeology, the use of spectroscopic [...] Read more.
The analysis of spectral reflectance data is an important tool for obtaining relevant information about the mineral composition of objects and has been used for research in chemistry, geology, biology, archaeology, pharmacy, medicine, anthropology, and other disciplines. In archaeology, the use of spectroscopic data allows us to characterize and classify artifacts and ecofacts, to analyze patterns, and to study the exchange of materials, etc., as well as to explain some properties, such as color or post-depositional processes. The spectroscopic data are of the so-called “big data” type and must be analyzed using multivariate statistical techniques, usually principal component analysis and cluster analysis. Although there are different transformations of the raw data, in this paper, we propose preprocessing by means of an affine transformation. From a mathematical point of view, this process modifies the values of reflectance for each spectral signature scaling them into a [0, 1] interval using minimum and maximum values of reflectance, thus highlighting the features of spectral curves. This method optimizes the characteristics of amplitude and shape, reduces the influence of noise, and improves results by highlighting relevant features as peaks and valleys that may remain hidden using the raw data. This methodology has been applied to a case study of prehistoric chert (flint) artifacts retrieved in archaeological excavations in the Andévalo area located in the Archaeological Museum of Huelva (Huelva, Andalusia). The use of transformed data considerably improves the results obtained with raw data, highlighting the peaks, valleys, and the shape of spectral signatures. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

18 pages, 688 KiB  
Article
Determinants of Investment Awareness: A Moderating Structural Equation Modeling-Based Model in the Saudi Arabian Context
by Mohamed Ali Shabeeb Ali, Mohammed Abdullah Ammer and Ibrahim A. Elshaer
Mathematics 2022, 10(20), 3829; https://0-doi-org.brum.beds.ac.uk/10.3390/math10203829 - 17 Oct 2022
Cited by 9 | Viewed by 3011
Abstract
In line with today’s economy, investment and financial awareness are necessary for success and an individual’s well-being, specifically for the younger generations. Therefore, this study aims to examine the relationships between financial literacy, saving behavior, a lack of self-control, family financial socialization, and [...] Read more.
In line with today’s economy, investment and financial awareness are necessary for success and an individual’s well-being, specifically for the younger generations. Therefore, this study aims to examine the relationships between financial literacy, saving behavior, a lack of self-control, family financial socialization, and investment awareness. Further, it investigates the moderating role of both family financial socialization and the lack of self-control in these relationships. Employing a quantitative study technique and partial least squares structural equation modeling (PLS-SEM), we analyzed a sample of 409 students representing young adults at King Faisal University, specifically in the School of Business. Our results indicate that financial literacy, saving behavior, and family financial socialization are significantly and positively related to investment awareness. Interestingly and as expected, a lack of self-control negatively and significantly affects investment awareness. For the moderating impact, it was found that the connection between financial literacy, saving behavior, and investment awareness is positively and strongly moderated by family financial socialization. Likewise, a lack of self-control significantly and negatively moderated the association between financial literacy, saving behavior, and investment awareness. The results of this study provide substantial implications for regulators, educational organizations, individuals, and their families. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

14 pages, 484 KiB  
Article
Impact of Regressand Stratification in Dataset Shift Caused by Cross-Validation
by José A. Sáez and José L. Romero-Béjar
Mathematics 2022, 10(14), 2538; https://0-doi-org.brum.beds.ac.uk/10.3390/math10142538 - 21 Jul 2022
Cited by 1 | Viewed by 1335
Abstract
Data that have not been modeled cannot be correctly predicted. Under this assumption, this research studies how k-fold cross-validation can introduce dataset shift in regression problems. This fact implies data distributions in the training and test sets to be different and, therefore, a [...] Read more.
Data that have not been modeled cannot be correctly predicted. Under this assumption, this research studies how k-fold cross-validation can introduce dataset shift in regression problems. This fact implies data distributions in the training and test sets to be different and, therefore, a deterioration of the model performance estimation. Even though the stratification of the output variable is widely used in the field of classification to reduce the impacts of dataset shift induced by cross-validation, its use in regression is not widespread in the literature. This paper analyzes the consequences for dataset shift of including different regressand stratification schemes in cross-validation with regression data. The results obtained show that these allow for creating more similar training and test sets, reducing the presence of dataset shift related to cross-validation. The bias and deviation of the performance estimation results obtained by regression algorithms are improved using the highest amounts of strata, as are the number of cross-validation repetitions necessary to obtain these better results. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

Review

Jump to: Research

20 pages, 475 KiB  
Review
Noise Models in Classification: Unified Nomenclature, Extended Taxonomy and Pragmatic Categorization
by José A. Sáez
Mathematics 2022, 10(20), 3736; https://0-doi-org.brum.beds.ac.uk/10.3390/math10203736 - 11 Oct 2022
Cited by 3 | Viewed by 2251
Abstract
This paper presents the first review of noise models in classification covering both label and attribute noise. Their study reveals the lack of a unified nomenclature in this field. In order to address this problem, a tripartite nomenclature based on the structural analysis [...] Read more.
This paper presents the first review of noise models in classification covering both label and attribute noise. Their study reveals the lack of a unified nomenclature in this field. In order to address this problem, a tripartite nomenclature based on the structural analysis of existing noise models is proposed. Additionally, a revision of their current taxonomies is carried out, which are combined and updated to better reflect the nature of any model. Finally, a categorization of noise models is proposed from a practical point of view depending on the characteristics of noise and the study purpose. These contributions provide a variety of models to introduce noise, their characteristics according to the proposed taxonomy and a unified way of naming them, which will facilitate their identification and study, as well as the reproducibility of future research. Full article
(This article belongs to the Special Issue Applied Statistical Modeling and Data Mining)
Show Figures

Figure 1

Back to TopTop