Multivariate Statistics: Theory and Its Applications

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Probability and Statistics".

Deadline for manuscript submissions: closed (31 May 2022) | Viewed by 43769

Special Issue Editor


E-Mail Website
Guest Editor
Department of Statistics, University of Salamanca, 37008 Salamanca, Spain
Interests: multivariate analysis; computational statistics; data mining; machine learning; big data

Special Issue Information

Dear Colleagues,

It is well known that data is knowledge and knowledge is a powerful tool when used correctly. Multivariate Statistical Methods have been increasingly used in different areas, where factorial techniques (PCA and generalizations of principal component analysis, BIPLOT Methods, Correspondence Analysis, STATIS Methods, TUCKERS Models...), are very relevant. The importance of these techniques has been reinforced by the need to analyze large volumes of data and this has required an adaptation of the above techniques to make them useful in Data Mining, Business Intelligence and BIG DATA.

This special issue invites papers on data analysis topics with potential application in the life and social sciences. We solicit papers that incorporate new ideas, results, innovative and modern methodologies and algorithms with application to real-life problems. We also encourage the submission of papers that include modules and computational packages that allow the reproduction and implementation of the results.

Research papers, review articles and short communications are invited.

Topics of interest include but are not limited to the following:

  • Multivariate Statistical Methods for Data Analysis
  • Data Mining;
  • Text Mining;
  • Sentiment Analysis;
  • Learning Associaton Rules;
  • Computational statistics;
  • High dimensional statistics with a view toward Applications.

Prof. Dr. María Purificación Galindo Villardón
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • PCA
  • sparse PCA
  • weighted PCA
  • functional PCA
  • HJ-biplot
  • GGE biplot
  • logistic biplot
  • canonical biplot
  • disjoint biplot
  • compositional biplot
  • sparse biplot
  • correspondence analysis
  • STATIS
  • STATIS dual
  • STATIS-4
  • canonical STATIS
  • CoSTATIS
  • STATICo
  • TUCKER models
  • CoTUCKER
  • TUCKERCo
  • cluster

Published Papers (19 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

23 pages, 1216 KiB  
Article
Partial Least Squares Regression for Binary Responses and Its Associated Biplot Representation
by Laura Vicente-Gonzalez  and Jose Luis Vicente-Villardon
Mathematics 2022, 10(15), 2580; https://0-doi-org.brum.beds.ac.uk/10.3390/math10152580 - 25 Jul 2022
Cited by 4 | Viewed by 2497
Abstract
In this paper, we propose a generalization of Partial Least Squares Regression (PLS-R) for a matrix of several binary responses and a a set of numerical predictors. We call the method Partial Least Squares Binary Logistic Regression (PLS-BLR). That is equivalent to a [...] Read more.
In this paper, we propose a generalization of Partial Least Squares Regression (PLS-R) for a matrix of several binary responses and a a set of numerical predictors. We call the method Partial Least Squares Binary Logistic Regression (PLS-BLR). That is equivalent to a PLS-2 model for binary responses. Biplot and even triplot graphical representations for visualizing PLS-BLR models are described, and an application to real data is presented. Software packages for the calculation of the main results are also provided. We conclude that the proposed method and its visualization using triplots are powerful tools for the interpretation of the relations among predictors and responses. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

17 pages, 3320 KiB  
Article
HJ-Biplot as a Tool to Give an Extra Analytical Boost for the Latent Dirichlet Assignment (LDA) Model: With an Application to Digital News Analysis about COVID-19
by Luis Pilacuan-Bonete, Purificación Galindo-Villardón and Francisco Delgado-Álvarez
Mathematics 2022, 10(14), 2529; https://0-doi-org.brum.beds.ac.uk/10.3390/math10142529 - 20 Jul 2022
Viewed by 1677
Abstract
This work objective is to generate an HJ-biplot representation for the content analysis obtained by latent Dirichlet assignment (LDA) of the headlines of three Spanish newspapers in their web versions referring to the topic of the pandemic caused by the SARS-CoV-2 virus (COVID-19) [...] Read more.
This work objective is to generate an HJ-biplot representation for the content analysis obtained by latent Dirichlet assignment (LDA) of the headlines of three Spanish newspapers in their web versions referring to the topic of the pandemic caused by the SARS-CoV-2 virus (COVID-19) with more than 500 million affected and almost six million deaths to date. The HJ-biplot is used to give an extra analytical boost to the model, it is an easy-to-interpret multivariate technique which does not require in-depth knowledge of statistics, allows capturing the relationship between the topics about the COVID-19 news and the three digital newspapers, and it compares them with LDAvis and heatmap representations, the HJ-biplot provides a better representation and visualization, allowing us to analyze the relationship between each newspaper analyzed (column markers represented by vectors) and the 14 topics obtained from the LDA model (row markers represented by points) represented in the plane with the greatest informative capacity. It is concluded that the newspapers El Mundo and 20 M present greater homogeneity between the topics published during the pandemic, while El País presents topics that are less related to the other two newspapers, highlighting topics such as t_12 (Government_Madrid) and t_13 (Government_millions). Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

20 pages, 890 KiB  
Article
Modelling Bimodal Data Using a Multivariate Triangular-Linked Distribution
by Daan de Waal, Tristan Harris, Alta de Waal and Jocelyn Mazarura
Mathematics 2022, 10(14), 2370; https://0-doi-org.brum.beds.ac.uk/10.3390/math10142370 - 06 Jul 2022
Viewed by 1789
Abstract
Bimodal distributions have rarely been studied although they appear frequently in datasets. We develop a novel bimodal distribution based on the triangular distribution and then expand it to the multivariate case using a Gaussian copula. To determine the goodness of fit of the [...] Read more.
Bimodal distributions have rarely been studied although they appear frequently in datasets. We develop a novel bimodal distribution based on the triangular distribution and then expand it to the multivariate case using a Gaussian copula. To determine the goodness of fit of the univariate model, we use the Kolmogorov–Smirnov (KS) and Cramér–von Mises (CVM) tests. The contributions of this work are that a simplistic yet robust distribution was developed to deal with bimodality in data, a multivariate distribution was developed as a generalisation of this univariate distribution using a Gaussian copula, a comparison between parametric and semi-parametric approaches to modelling bimodality is given, and an R package called btld is developed from the workings of this paper. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

23 pages, 449 KiB  
Article
A Two-Sample Test of High Dimensional Means Based on Posterior Bayes Factor
by Yuanyuan Jiang and Xingzhong Xu
Mathematics 2022, 10(10), 1741; https://0-doi-org.brum.beds.ac.uk/10.3390/math10101741 - 19 May 2022
Cited by 1 | Viewed by 1168
Abstract
In classical statistics, the primary test statistic is the likelihood ratio. However, for high dimensional data, the likelihood ratio test is no longer effective and sometimes does not work altogether. By replacing the maximum likelihood with the integral of the likelihood, the Bayes [...] Read more.
In classical statistics, the primary test statistic is the likelihood ratio. However, for high dimensional data, the likelihood ratio test is no longer effective and sometimes does not work altogether. By replacing the maximum likelihood with the integral of the likelihood, the Bayes factor is obtained. The posterior Bayes factor is the ratio of the integrals of the likelihood function with respect to the posterior. In this paper, we investigate the performance of the posterior Bayes factor in high dimensional hypothesis testing through the problem of testing the equality of two multivariate normal mean vectors. The asymptotic normality of the linear function of the logarithm of the posterior Bayes factor is established. Then we construct a test with an asymptotically nominal significance level. The asymptotic power of the test is also derived. Simulation results and an application example are presented, which show good performance of the test. Hence, taking the posterior Bayes factor as a statistic in high dimensional hypothesis testing is a reasonable methodology. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

17 pages, 4884 KiB  
Article
An Analysis of Travel Patterns in Barcelona Metro Using Tucker3 Decomposition
by Elisa Frutos-Bernal, Ángel Martín del Rey, Irene Mariñas-Collado and María Teresa Santos-Martín
Mathematics 2022, 10(7), 1122; https://0-doi-org.brum.beds.ac.uk/10.3390/math10071122 - 31 Mar 2022
Cited by 1 | Viewed by 1573
Abstract
In recent years, a growing number of large, densely populated cities have emerged, which need urban traffic planning and therefore knowledge of mobility patterns. Knowledge of space-time distribution of passengers in cities is necessary for effective urban traffic planning and restructuring, especially in [...] Read more.
In recent years, a growing number of large, densely populated cities have emerged, which need urban traffic planning and therefore knowledge of mobility patterns. Knowledge of space-time distribution of passengers in cities is necessary for effective urban traffic planning and restructuring, especially in large cities. In this paper, the inbound ridership in the Barcelona metro is modelled into a three-way tensor so that each element contains the number of passenger in the ith station at the jth time on the kth day. Tucker3 decomposition is used to discover spatial clusters, temporal patterns, and the relationships between them. The results indicate that travel patterns differ between weekdays and weekends; in addition, rush and off-peak hours of each day have been identified, and a classification of stations has been obtained. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

16 pages, 477 KiB  
Article
TAID-LCA: Segmentation Algorithm Based on Ternary Trees
by Claudio Castro-López, Purificación Vicente-Galindo, Purificación Galindo-Villardón and Oscar Borrego-Hernández
Mathematics 2022, 10(4), 560; https://0-doi-org.brum.beds.ac.uk/10.3390/math10040560 - 11 Feb 2022
Viewed by 1234
Abstract
In this work, a statistical method for the segmentation of samples and/or populations is presented, which is based on a ternary tree structure. This approach overcomes known limitations of other segmentation methods such as CHAID, concerning the multivariate response and the non-symmetric relationship [...] Read more.
In this work, a statistical method for the segmentation of samples and/or populations is presented, which is based on a ternary tree structure. This approach overcomes known limitations of other segmentation methods such as CHAID, concerning the multivariate response and the non-symmetric relationship between explanatory and response variables. The multivariate response segmentation problem is handled through latent class models, while the factorial decomposition of the explanatory capability of variables is based on the Non-Symmetrical Correspondence Analysis. Stop criteria based on the CATANOVA index and impurity measures are proposed. A Simulated Annealing based post-pruning strategy is considered to avoid over-fitting relative to the training set and guarantee a better generalization capability for the method. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

19 pages, 2779 KiB  
Article
Are Worldwide Governance Indicators Stable or Do They Change over Time? A Comparative Study Using Multivariate Analysis
by Isabel Gallego-Álvarez, Miguel Rodríguez-Rosa and Purificación Vicente-Galindo
Mathematics 2021, 9(24), 3257; https://0-doi-org.brum.beds.ac.uk/10.3390/math9243257 - 15 Dec 2021
Cited by 5 | Viewed by 2328
Abstract
Governance is a characteristic of political systems that indicates the degrees of cooperation and interaction between a state and non-state actors when it comes to decision making that will have an impact on society. The aim of our research focuses on analysing the [...] Read more.
Governance is a characteristic of political systems that indicates the degrees of cooperation and interaction between a state and non-state actors when it comes to decision making that will have an impact on society. The aim of our research focuses on analysing the behaviour of the Worldwide Governance Indicators (WGI) over the 2002–2019 period, since we are interested in learning whether such indicators varied or remained constant. Moreover, we will gain insight into the evolution of these indicators across countries in different geographical areas. The techniques we have chosen for this research are as follows: Partial Triadic Analysis, also known as X-STATIS, to highlight the stable structure of the evolution of the indicators and countries along the years by means of building an average year; Tucker3 to highlight deeper relationships among countries, indicators and years. A comparative analysis of these methods will allow us to check whether the WGI are stable over the years studied or whether they vary over time, providing information about the differences between the Worldwide Governance Indicators (WGI) in several countries or geographical areas. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

20 pages, 886 KiB  
Article
A New Goodness of Fit Test for Multivariate Normality and Comparative Simulation Study
by Jurgita Arnastauskaitė, Tomas Ruzgas and Mindaugas Bražėnas
Mathematics 2021, 9(23), 3003; https://0-doi-org.brum.beds.ac.uk/10.3390/math9233003 - 23 Nov 2021
Cited by 7 | Viewed by 2054
Abstract
The testing of multivariate normality remains a significant scientific problem. Although it is being extensively researched, it is still unclear how to choose the best test based on the sample size, variance, covariance matrix and others. In order to contribute to this field, [...] Read more.
The testing of multivariate normality remains a significant scientific problem. Although it is being extensively researched, it is still unclear how to choose the best test based on the sample size, variance, covariance matrix and others. In order to contribute to this field, a new goodness of fit test for multivariate normality is introduced. This test is based on the mean absolute deviation of the empirical distribution density from the theoretical distribution density. A new test was compared with the most popular tests in terms of empirical power. The power of the tests was estimated for the selected alternative distributions and examined by the Monte Carlo modeling method for the chosen sample sizes and dimensions. Based on the modeling results, it can be concluded that a new test is one of the most powerful tests for checking multivariate normality, especially for smaller samples. In addition, the assumption of normality of two real data sets was checked. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

16 pages, 7374 KiB  
Article
Using HJ-Biplot and External Logistic Biplot as Machine Learning Methods for Corporate Social Responsibility Practices for Sustainable Development
by Joel A. Martínez-Regalado, Cinthia Leonora Murillo-Avalos, Purificación Vicente-Galindo, Mónica Jiménez-Hernández and José Luis Vicente-Villardón
Mathematics 2021, 9(20), 2572; https://0-doi-org.brum.beds.ac.uk/10.3390/math9202572 - 14 Oct 2021
Cited by 10 | Viewed by 1905
Abstract
In recent years, social responsibility has been revolutionizing sustainable development. After the development of new mathematical techniques, the improvement of computers’ processing capacity and the greater availability of possible explanatory variables, the analysis of these topics is moving towards the use of different [...] Read more.
In recent years, social responsibility has been revolutionizing sustainable development. After the development of new mathematical techniques, the improvement of computers’ processing capacity and the greater availability of possible explanatory variables, the analysis of these topics is moving towards the use of different machine learning techniques. However, within the field of machine learning, the use of Biplot techniques is little known for these analyses. For this reason, in this paper we explore the performance of two of the most popular techniques in multivariate statistics: External Logistic Biplot and the HJ-Biplot, to analyse the data structure in social responsibility studies. The results obtained from the sample of companies representing the Fortune Global 500 list indicate that the most frequently reported indicators are related to the social aspects are labour practices and decent work and society. On the contrary, the disclosure of indicators is less frequently related to human rights and product responsibility. Additionally, we have identified the countries and sectors with the highest CSR in social matters. We discovered that both machine learning algorithms are extremely competitive and practical to apply in CSR since they are simple to implement and work well with relatively big datasets. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

15 pages, 3259 KiB  
Article
Frequency of Neuroendocrine Tumor Studies: Using Latent Dirichlet Allocation and HJ-Biplot Statistical Methods
by Karime Montes Escobar, José Luis Vicente-Villardon, Javier de la Hoz-M, Lelly María Useche-Castro, Daniel Fabricio Alarcón Cano and Aline Siteneski
Mathematics 2021, 9(18), 2281; https://0-doi-org.brum.beds.ac.uk/10.3390/math9182281 - 16 Sep 2021
Cited by 7 | Viewed by 2322
Abstract
Background: Neuroendocrine tumors (NETs) are severe and relatively rare and may affect any organ of the human body. The prevalence of NETs has increased in recent years; however, there seem to be more data on particular types, even though, despite the efforts of [...] Read more.
Background: Neuroendocrine tumors (NETs) are severe and relatively rare and may affect any organ of the human body. The prevalence of NETs has increased in recent years; however, there seem to be more data on particular types, even though, despite the efforts of different guidelines, there is no consensus on how to identify different types of NETs. In this review, we investigated the countries that published the most articles about NETs, the most frequent organs affected, and the most common related topics. Methods: This work used the Latent Dirichlet Allocation (LDA) method to identify and interpret scientific information in relation to the categories in a set of documents. The HJ-Biplot method was also used to determine the relationship between the analyzed topics, by taking into consideration the years under study. Results: In this study, a literature review was conducted, from which a total of 7658 abstracts of scientific articles published between 1981 and 2020 were extracted. The United States, Germany, United Kingdom, France, and Italy published the majority of studies on NETs, of which pancreatic tumors were the most studied. The five most frequent topics were t_21 (clinical benefit), t_11 (pancreatic neuroendocrine tumors), t_13 (patients one year after treatment), t_17 (prognosis of survival before and after resection), and t_3 (markers for carcinomas). Finally, the results were put through a two-way multivariate analysis (HJ-Biplot), which generated a new interpretation: we grouped topics by year and discovered which NETs were the most relevant for which years. Conclusions: The most frequent topics found in our review highlighted the severity of NETs: patients have a poor prognosis of survival and a high probability of tumor recurrence. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

17 pages, 1429 KiB  
Article
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
by Amparo Baíllo and Aurea Grané
Mathematics 2021, 9(18), 2247; https://0-doi-org.brum.beds.ac.uk/10.3390/math9182247 - 13 Sep 2021
Viewed by 1580
Abstract
The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is [...] Read more.
The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computational cost, particularly due to the eigendecomposition of the Gram matrix. In this context, ensemble regression techniques provide a useful alternative to fitting the model to the whole sample. This work analyzes the performance of three subsampling and aggregation techniques in DB regression on two specific large, real datasets. We also analyze, via simulations, the performance of bagging and DB logistic regression in the classification problem with mixed-type features and large sample sizes. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

15 pages, 1502 KiB  
Article
Sparse STATIS-Dual via Elastic Net
by Carmen C. Rodríguez-Martínez, Mitzi Cubilla-Montilla, Purificación Vicente-Galindo and Purificación Galindo-Villardón
Mathematics 2021, 9(17), 2094; https://0-doi-org.brum.beds.ac.uk/10.3390/math9172094 - 30 Aug 2021
Cited by 3 | Viewed by 1744
Abstract
Multi-set multivariate data analysis methods provide a way to analyze a series of tables together. In particular, the STATIS-dual method is applied in data tables where individuals can vary from one table to another, but the variables that are analyzed remain fixed. However, [...] Read more.
Multi-set multivariate data analysis methods provide a way to analyze a series of tables together. In particular, the STATIS-dual method is applied in data tables where individuals can vary from one table to another, but the variables that are analyzed remain fixed. However, when you have a large number of variables or indicators, interpretation through traditional multiple-set methods is complex. For this reason, in this paper, a new methodology is proposed, which we have called Sparse STATIS-dual. This implements the elastic net penalty technique which seeks to retain the most important variables of the model and obtain more precise and interpretable results. As a complement to the new methodology and to materialize its application to data tables with fixed variables, a package is created in the R programming language, under the name Sparse STATIS-dual. Finally, an application to real data is presented and a comparison of results is made between the STATIS-dual and the Sparse STATIS-dual. The proposed method improves the informative capacity of the data and offers more easily interpretable solutions. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

13 pages, 21005 KiB  
Article
Evaluation of Paris MoU Maritime Inspections Using a STATIS Approach
by Jose Manuel Prieto, Victor Amor, Ignacio Turias, David Almorza and Francisco Piniella
Mathematics 2021, 9(17), 2092; https://0-doi-org.brum.beds.ac.uk/10.3390/math9172092 - 29 Aug 2021
Cited by 5 | Viewed by 2295
Abstract
Port state control inspections implemented under the Paris Memorandum of Understanding (MoU) have become known as one of the best instruments for maritime administrations in European Union (EU) Member States to ensure that the ships docked in their ports comply with all maritime [...] Read more.
Port state control inspections implemented under the Paris Memorandum of Understanding (MoU) have become known as one of the best instruments for maritime administrations in European Union (EU) Member States to ensure that the ships docked in their ports comply with all maritime safety requirements. This paper focuses on the analysis of all inspections made between 2013 and 2018 in the top ten EU ports incorporated in the Paris MoU (17,880 inspections). The methodology consists of a multivariate statistical information system (STATIS) analysis using the inspected ship’s characteristics as explanatory variables. The variables used describe both the inspected ships (classification society, flag, age and gross tonnage) and the inspection (type of inspection and number of deficiencies), yielding a dataset with more than 600,000 elements in the data matrix. The most important results are that the classifications obtained match the performance lists published annually by the Paris MoU and the classification societies. Therefore, the approach is a potentially valid classification method and would then be useful to maritime authorities as an additional indicator of a ship’s risk profile to decide inspection priorities and as a tool to measure the evolution in the risk profile of the flag over time. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

19 pages, 504 KiB  
Article
Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD
by Jose Giovany Babativa-Márquez and José Luis Vicente-Villardón
Mathematics 2021, 9(16), 2015; https://0-doi-org.brum.beds.ac.uk/10.3390/math9162015 - 23 Aug 2021
Cited by 2 | Viewed by 2275
Abstract
Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot [...] Read more.
Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot (LB) has been developed to represent the rows and columns of a binary data matrix simultaneously, even though the algorithm used to fit the parameters is too computationally demanding to be useful in the presence of sparsity or when the matrix is large. We propose the fitting of an LB model using nonlinear conjugate gradient (CG) or majorization–minimization (MM) algorithms, and a cross-validation procedure is introduced to select the hyperparameter that represents the number of dimensions in the model. A Monte Carlo study that considers scenarios with several sparsity levels and different dimensions of the binary data set shows that the procedure based on cross-validation is successful in the selection of the model for all algorithms studied. The comparison of the running times shows that the CG algorithm is more efficient in the presence of sparsity and when the matrix is not very large, while the performance of the MM algorithm is better when the binary matrix is balanced or large. As a complement to the proposed methods and to give practical support, a package has been written in the R language called BiplotML. To complete the study, real binary data on gene expression methylation are used to illustrate the proposed methods. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

26 pages, 4582 KiB  
Article
Epistemological Considerations of Text Mining: Implications for Systematic Literature Review
by Daniel Caballero-Julia and Philippe Campillo
Mathematics 2021, 9(16), 1865; https://0-doi-org.brum.beds.ac.uk/10.3390/math9161865 - 06 Aug 2021
Cited by 4 | Viewed by 2730
Abstract
In the era of big data, the capacity to produce textual documents is increasing day by day. Our ability to generate large amounts of information has impacted our lives at both the individual and societal levels. Science has not escaped this evolution either, [...] Read more.
In the era of big data, the capacity to produce textual documents is increasing day by day. Our ability to generate large amounts of information has impacted our lives at both the individual and societal levels. Science has not escaped this evolution either, and it is often difficult to quickly and reliably “stand on the shoulders of giants”. Text mining is presented as a promising mathematical solution. However, it has not yet convinced qualitative analysts who are usually wary of mathematical calculation. For this reason, this article proposes to rethink the epistemological principles of text mining, by returning to the qualitative analysis of its meaning and structure. It presents alternatives, applicable to the process of constructing lexical matrices for the analysis of a complex textual corpus. At the same time, the need for new multivariate algorithms capable of integrating these principles is discussed. We take a practical example in the use of text mining, by means of Multivariate Analysis of Variance Biplot (MANOVA-Biplot) when carrying out a systematic review of the literature. The article will show the advantages and disadvantages of exploring and analyzing a large set of publications quickly and methodically. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

13 pages, 842 KiB  
Article
Comparing COSTATIS and Generalized Procrustes Analysis with Multi-Way Public Education Expenditure Data
by María Concepción Vega-Hernández and Carmen Patino-Alonso
Mathematics 2021, 9(15), 1816; https://0-doi-org.brum.beds.ac.uk/10.3390/math9151816 - 31 Jul 2021
Cited by 1 | Viewed by 1927
Abstract
Governments serve a variety of purposes, and where governments spend their money has always been of concern to society. In particular, spending on public education is of great interest. However, the volume of this information can be difficult to manage. Therefore, the purpose [...] Read more.
Governments serve a variety of purposes, and where governments spend their money has always been of concern to society. In particular, spending on public education is of great interest. However, the volume of this information can be difficult to manage. Therefore, the purpose of this work is to compare the COSTATIS method and generalized Procrustes analysis (GPA) when working with multi-way data. Despite the particular characteristics of each of them, they present similarities and differences that, when analyzed together, can provide complementary results to researchers. The COSTATIS consists of a co-inertia analysis of the compromise of two k-table analyses. The GPA method provides an optimal superimposed representation of individual configurations, and a common consensus configuration is constructed as the mean of all transformed configurations. In addition, the GPA method includes the translation, rotation and scaling of coordinates. In this study, both methods were applied, and the advantages and disadvantages of each are presented. The treated data are a sequence of tables from various countries where different public expenditures on education have been measured over time. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

21 pages, 27776 KiB  
Article
LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools
by Javier De la Hoz-M, Mª José Fernández-Gómez and Susana Mendes
Mathematics 2021, 9(14), 1671; https://0-doi-org.brum.beds.ac.uk/10.3390/math9141671 - 16 Jul 2021
Cited by 9 | Viewed by 4692
Abstract
In this paper we propose an open source application called LDAShiny, which provides a graphical user interface to perform a review of scientific literature using the latent Dirichlet allocation algorithm and machine learning tools in an interactive and easy-to-use way. The procedures implemented [...] Read more.
In this paper we propose an open source application called LDAShiny, which provides a graphical user interface to perform a review of scientific literature using the latent Dirichlet allocation algorithm and machine learning tools in an interactive and easy-to-use way. The procedures implemented are based on familiar approaches to modeling topics such as preprocessing, modeling, and postprocessing. The tool can be used by researchers or analysts who are not familiar with the R environment. We demonstrated the application by reviewing the literature published in the last three decades on the species Oreochromis niloticus. In total we reviewed 6196 abstracts of articles recorded in Scopus. LDAShiny allowed us to create the matrix of terms and documents. In the preprocessing phase it went from 530,143 unique terms to 3268. Thus, with the implemented options the number of unique terms was reduced, as well as the computational needs. The results showed that 14 topics were sufficient to describe the corpus of the example used in the demonstration. We also found that the general research topics on this species were related to growth performance, body weight, heavy metals, genetics and water quality, among others. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

20 pages, 368 KiB  
Article
Variable Selection for the Spatial Autoregressive Model with Autoregressive Disturbances
by Xuan Liu and Jianbao Chen
Mathematics 2021, 9(12), 1448; https://0-doi-org.brum.beds.ac.uk/10.3390/math9121448 - 21 Jun 2021
Cited by 2 | Viewed by 1887
Abstract
Along with the rapid development of the geographic information system, high-dimensional spatial heterogeneous data has emerged bringing theoretical and computational challenges to statistical modeling and analysis. As a result, effective dimensionality reduction and spatial effect recognition has become very important. This paper focuses [...] Read more.
Along with the rapid development of the geographic information system, high-dimensional spatial heterogeneous data has emerged bringing theoretical and computational challenges to statistical modeling and analysis. As a result, effective dimensionality reduction and spatial effect recognition has become very important. This paper focuses on variable selection in the spatial autoregressive model with autoregressive disturbances (SARAR) which contains a more comprehensive spatial effect. The variable selection procedure is presented by using the so-called penalized quasi-likelihood approach. Under suitable regular conditions, we obtain the rate of convergence and the asymptotic normality of the estimators. The theoretical results ensure that the proposed method can effectively identify spatial effects of dependent variables, find spatial heterogeneity in error terms, reduce the dimension, and estimate unknown parameters simultaneously. Based on step-by-step transformation, a feasible iterative algorithm is developed to realize spatial effect identification, variable selection, and parameter estimation. In the setting of finite samples, Monte Carlo studies and real data analysis demonstrate that the proposed penalized method performs well and is consistent with the theoretical results. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

20 pages, 1895 KiB  
Article
Hierarchical Modeling for Diagnostic Test Accuracy Using Multivariate Probability Distribution Functions
by Johny Pambabay-Calero, Sergio Bauz-Olvera, Ana Nieto-Librero, Ana Sánchez-García and Puri Galindo-Villardón
Mathematics 2021, 9(11), 1310; https://0-doi-org.brum.beds.ac.uk/10.3390/math9111310 - 07 Jun 2021
Cited by 3 | Viewed by 2961
Abstract
Models implemented in statistical software for the precision analysis of diagnostic tests include random-effects modeling (bivariate model) and hierarchical regression (hierarchical summary receiver operating characteristic). However, these models do not provide an overall mean, but calculate the mean of a central study when [...] Read more.
Models implemented in statistical software for the precision analysis of diagnostic tests include random-effects modeling (bivariate model) and hierarchical regression (hierarchical summary receiver operating characteristic). However, these models do not provide an overall mean, but calculate the mean of a central study when the random effect is equal to zero; hence, it is difficult to calculate the covariance between sensitivity and specificity when the number of studies in the meta-analysis is small. Furthermore, the estimation of the correlation between specificity and sensitivity is affected by the number of studies included in the meta-analysis, or the variability among the analyzed studies. To model the relationship of diagnostic test results, a binary covariance matrix is assumed. Here we used copulas as an alternative to capture the dependence between sensitivity and specificity. The posterior values were estimated using methods that consider sampling algorithms from a probability distribution (Markov chain Monte Carlo), and estimates were compared with the results of the bivariate model, which assumes statistical independence in the test results. To illustrate the applicability of the models and their respective comparisons, data from 14 published studies reporting estimates of the accuracy of the Alcohol Use Disorder Identification Test were used. Using simulations, we investigated the performance of four copula models that incorporate scenarios designed to replicate realistic situations for meta-analyses of diagnostic accuracy of the tests. The models’ performances were evaluated based on p-values using the Cramér–von Mises goodness-of-fit test. Our results indicated that copula models are valid when the assumptions of the bivariate model are not fulfilled. Full article
(This article belongs to the Special Issue Multivariate Statistics: Theory and Its Applications)
Show Figures

Figure 1

Back to TopTop