Control Chart T2Qv for Statistical Control of Multivariate Processes with Qualitative Variables

Rojas-Preciado, Wilson; Rojas-Campuzano, Mauricio; Galindo-Villardón, Purificación; Ruiz-Barzola, Omar

doi:10.3390/math11122595

Open AccessArticle

Control Chart T2Qv for Statistical Control of Multivariate Processes with Qualitative Variables

¹

Faculty of Social Sciences, Technical University of Machala (UTMACH), Machala 070102, Ecuador

²

Department of Statistics, University of Salamanca, 37004 Salamanca, Spain

³

Center for Statistical Studies and Research, Polytechnic School of the Littoral, Guayaquil 090150, Ecuador

⁴

Center for Statistical Studies Management, Milagro State University (UNEMI), Milagro 33950, Ecuador

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(12), 2595; https://0-doi-org.brum.beds.ac.uk/10.3390/math11122595

Submission received: 10 April 2023 / Revised: 3 June 2023 / Accepted: 4 June 2023 / Published: 6 June 2023

(This article belongs to the Special Issue Statistical Process Control and Application)

Download

Browse Figures

Versions Notes

Abstract

:

Simple Summary

The T2Qv control chart is presented as a multivariate statistical process control technique that performs an analysis of qualitative data through Multiple Correspondence Analysis (MCA), multiple factorial analysis, and the Hotelling T2 chart.

Abstract

The scientific literature is abundant regarding control charts in multivariate environments for numerical and mixed data; however, there are few publications for qualitative data. Qualitative variables provide valuable information on processes in various industrial, productive, technological, and health contexts. Social processes are no exception. There are multiple nominal and ordinal categorical variables used in economics, psychology, law, sociology, and education, whose analysis adds value to decision-making; therefore, their representation in control charts would be useful. When there are many variables, there is a risk of redundant or excessive information, so the application of multivariate methods for dimension reduction to retain a few latent variables, i.e., a recombination of the original and synthesizing of most of the information, is viable. In this context, the T2Qv control chart is presented as a multivariate statistical process control technique that performs an analysis of qualitative data through Multiple Correspondence Analysis (MCA), and the Hotelling T2 chart. The interpretation of out-of-control points is carried out by comparing MCA charts and analyzing the

χ^{2}

distance between the categories of the concatenated table and those that represent out-of-control points. Sensitivity analysis determined that the T2Qv control chart performs well when working with high dimensions. To test the methodology, an analysis was performed with simulated data and with a real case applied to the graduate follow-up process in the context of higher education. To facilitate the dissemination and application of the proposal, a reproducible computational package was developed in R, called T2Qv, and is available on the Comprehensive R Archive Network (CRAN).

Keywords:

multivariate; statistical process control; qualitative; control charts; R; T2 hotelling; graduate tracking; higher education

MSC:

62H25; 62P30

1. Introduction

Statistical control plays a very important role in the continuous improvement of processes, and, within it, control charts, which help monitor processes, have been extensively used since their creation by Walter Shewhart [1].

Shewhart established two phases in process control: the first one, called the Development Phase, describes the statistical behavior of the analyzed variable, determines control limits for the analyzed parameter estimator, and contributes to the elimination of assignable or special causes of variability. The second one, the Maturity Phase, assesses the process capability to meet requirements, identifies the mean number of samples before obtaining false alarms, and promotes the decrease of sample size for detecting process changes [2]. The authors in Montgomery [3] refer to this as the tracking of future production.

From univariate charts, countless proposals have been developed, which incorporated the option of monitoring several variables at once [4,5], therefore opening up the field of multivariate statistical process control (MSPC).

The most well-known options in MSPC are: Hotelling’s T2 control chart [6], which could be considered the multivariate version of Shewhart’s mean chart; MEWMA [7], which is the multivariate version of the weighted mean chart EWMA [8]; or MCUSUM [9], which is the multivariate version of the cumulative sum control chart CUSUM [10].

Several improvements have been made to these multivariate control charts, such as optimization, analytically determining the optimal values of their parameters [11,12,13], or heuristically [2]. Another proposal is to work without probabilistic distributions or nonparametric versions [14,15,16], for continuous or batch processes [4].

All these multivariate control charts have a quantitative focus, meaning that the monitored variables are essentially quantitative, whether discrete or continuous. Initially, different authors used the Mahalanobis distance [17] for this purpose. Subsequently, for the analysis of a combination of continuous and categorical variables, a chart based on the Gower distance [18] was developed. However, addressing problems such as the high correlation between features and in the presence of mixed data requires the incorporation of classical multivariate statistical techniques, such as Principal Component Analysis [19], Biplot Methods [20,21], Correspondence Analysis [22], STATIS [23,24,25], Parallel Coordinates [26], and Cluster Analysis [27].

Among the contributions related to control charts that incorporate multivariate techniques, the STATIS-based chart for monitoring batch processes in nonparametric environments stands out [28]; robust bagplot diagrams using Dual STATIS and Parallel Coordinates [29]; the PCA-based multivariate control chart for mixed data, which applies a combination of Principal Component Analysis and Multiple Correspondence Analysis [30]; the Density-sensitive Novelty Weight control chart (DNW) that uses the k-Nearest Neighbor (kNN) algorithm [15]; the Kernel PCA Mix-based chart [31,32]; the

T^{2}

chart based on a combination of PCA for continuous and qualitative data with outlier detection [33]; and the PCA-based control charts for nonparametric environments [15,34].

However, contributions to the development of multivariate control charts for qualitative variables have not been numerous. In this field, proposals have been developed around the analysis of variables that follow a Poisson distribution and the analysis of multinomial variables. The first proposal was made by the authors in Holgate [35], who presented a paper on the bivariate Poisson distribution for correlated variables. This model was used as input in the research of authors such as Chiu and Kuo [36], Lee and Costa [37], Laungrungrong et al. [38], and Epprecht et al. [39].

Another notable proposal is that of Lu [40], who developed a Shewhart control chart for multivariate processes with qualitative variables, when the quality characteristic takes binary values, which they called a multivariate

n p

(MNP) chart. In the multinomial context, the authors in Ranjan-Mukhopadhyay [41] proposed a multivariate control chart using the Mahalanobis

D^{2}

statistic for attributes that follow a multinomial distribution. In addition, for multinomial processes under the fuzzy approach [42], the authors in Taleb [43] introduced control charts for monitoring multivariate processes with multidimensional linguistic data, based on two procedures: probability theory and fuzzy theory. The authors in Pastuizaca-Fernández et al. [44] presented a fuzzy-focused multivariate multinomial T2 control chart.

The authors in Saltos Segura et al. [45] claim that quality-control tools can be considered not only for monitoring industrial processes but also processes related to education, such as student performance evaluation. These authors applied the concept of depth, which transforms a multivariate observation into a univariate index, which is susceptible to monitoring on a control chart, and for this, they used the r chart. They also used cluster analysis to establish thresholds that facilitate the formation of groups and establish student profiles through descriptive measures.

In the study of processes that occur in the social environment, qualitative variables are very frequently used. It is not that quantitative data are absent, but in the databases used for these analyses, nominal and ordinal qualitative variables are abundant, sometimes more so than numeric variables.

The authors in López [46] point out that when observing many variables on a sample, it is presumable that some of the collected information may be redundant or excessive. In such cases, multivariate methods for reducing dimensionality attempt to eliminate this information by combining many observed variables to arrive at a few latent variables that, while not observed, are a combination of the real variables and synthesize most of the information contained in the data. In this case, the type of variables being handled should be taken into account. If they are quantitative variables, techniques that allow this treatment may be Principal Component Analysis [19,47] or Factor Analysis [48,49,50], while for qualitative variables, it is recommended to apply Multiple Correspondence Analysis, Homogeneity Analysis, or Multidimensional Scaling Analysis.

On the other hand, the use of computational tools has contributed to the development of multivariate statistical process control charts. Various authors have implemented statistical packages that facilitate the application of classical control charts, including those in Curran and Hersh [51], who developed the application for the Hotelling T2 chart, and Scrucca [52], the software package for MCUSUM and MEWMA charts. Other researchers originally published their contributions on control charts accompanied by a statistical tool that facilitates their application, including Ruiz-Barzola [2], Epprecht et al. [39].

In addition, as a complement to the statistical analysis of multivariate processes carried out through control charts, statistical applications based on the R program can be used to enable the use of biplots and other multivariate methods for two and three ways, such as MULTBIPLOT Vicente-Villardón [53], ade4 Thioulouse and Dray [54], Bougeard and Dray [55], FactoMineR [56], SparseBiplots [57], biplotbootGUI [58]. Similarly, Python libraries such as SciPy [59] or statsmodels [60] can be used.

In statistical process control, contributions to the development of control charts for qualitative variables are still in their infancy, with few publications focusing on the analysis of quality characteristics in industrial processes, but not social processes. Upon analyzing the procedures published, limitations are detected that could restrict their application, such as the analysis of a few quality characteristics, the use of samples composed of individual elements instead of groups, and the difficulty of working with many categories simultaneously. Thus, the need arises for a control chart for the representation of p qualitative variables that can work with multiple nominal and ordinal categories and facilitate the identification of the causes that can lead to the process becoming out of control, and that can be applied to social processes.

This article addresses the aforementioned limitations regarding control charts for qualitative variables and their application in social environments. For this reason, its objective is to develop a control chart for qualitative variables using multivariate statistical methodologies, to contribute to the diversification of techniques in phase I of statistical process control.

The Analytic-Synthetic method is used, which allows the mental disaggregation of a whole into its parts and qualities, establishes the connection of previously analyzed parts, and discovers relationships and general characteristics among elements of reality [61]. This method is used to identify the essential elements that characterize the control charts reviewed in the literature, such as authors, years, and relevant contributions, as well as in the identification of regularities to establish conclusions in the research.

Furthermore, the modeling method is used, which involves obtaining new knowledge through the creation of models that represent reality [61]. This method, which requires a high capacity for abstraction, involves representing the specific characteristics of quality, with their categories and associations through control charts that underlie abstract mathematical models. From the analysis and operation of these models, new results can be extrapolated to different specific application scenarios.

On the other hand, the systemic–structural–functional method is applied, which considers the object of study as a unique and composite reality, based on the interrelationship and interdependence between the parts of the whole. The structural–functional approach distinguishes essential from secondary elements and aims to model the object as a system, determining its components, structure, hierarchy, and functional relationships [61]. In this research, this method is applied in the analysis of processes characterized by different interrelated variables, which in turn are made up of associated categories. The degree of association that exists between these categories can influence the final behavior of the process.

According to Martínez et al. [62], documentary analysis is the technique that systematically investigates, collects, organizes, analyzes, and interprets information on a topic related to the research objective to provide theoretical support for the development of scientific studies. In this research, documentary analysis is used to provide theoretical support for scientific studies on the development of control charts in multivariate statistical processes. Theses, books, and scientific articles indexed in databases such as WoS, Scopus, Scielo, Taylor & Francis, ScienceDirect, Redalyc, Latindex, and Google Scholar were analyzed within a chronological framework that covers classical authors to the most recent contributions.

Furthermore, this study systematically employs the statistical analysis technique to identify patterns, relationships, and trends in the data, contributing to the verification of assumptions and informed decision-making. To apply the proposal, a database of simulated data, called Datak10Contaminated, was generated, which is described in Section 4.1.1.

This article is organized as follows: the Introduction, which establishes the conceptual and referential background of multivariate control charts applied to qualitative variables; Section 2, which details the procedure followed in the development of the proposed control chart; Section 3 describes the computational complement that facilitates the application of this methodology; Section 4 shows the results through the analysis of simulated data and shows a real case applied to the graduate follow-up process in the context of higher education; Section 5 corresponds to the sensitivity analysis that relates the number of dimensions analyzed versus the reliability of the results. Section 6 presents the discussion through a comparative analysis between the T2Qv control chart and the proposals of other authors. Finally, Section 7 establishes the conclusions.

2. Methodology

2.1. Notation

Table 1 contains elements, representation, and examples of how the algebraic elements addressed in the methodology are presented.

Throughout the article, letters will be used to refer to necessary parameters, which are listed in Table 2:

2.2. Multiple Correspondence Analysis (MCA)

Given that we are working with qualitative variables, Multiple Correspondence Analysis [22] is applied to analyze the similarity between categories [46] based on the

χ^{2}

distance, which is a similar analysis to Principal Component Analysis.

MCA is the application of the simple correspondence analysis (CA) method to multivariate categorical data encoded in the form of an indicator matrix or a Burt matrix [63]. It is an exploratory factor analysis technique for multivariate categorical data that describes, in a low-dimensional space, the structure of associations between a group of categorical variables, as well as the similarities and differences between the individuals to which those variables apply. MCA has been “reinvented” on several occasions by various authors, under different names or approaches [64].

In this article, we are not using the French approach [65], but the Anglo-Saxon approach, where MCA is called Homogeneity Analysis or Dual Scaling. We use the Burt table [22] and start from a data matrix with p qualitative variables, each with h categories

(h > 1)

.

The matrix is composed of the n rows or observations and p columns or variables, where each cell contains one of the aforementioned categories. It is equivalent to the disjunctive matrix Z, which breaks down the variables into each of their modalities and records the occurrence of events in a binary form [22].

The Burt table is given by:

B = Z^{'} Z

(1)

The matrix B in Equation (1) is formed by the absolute frequencies, which are transformed into relative frequencies by dividing the values in the matrix by the total frequency, resulting in the matrix P.

The row and column mass vectors, mf and mc, respectively, are obtained through the row and column margins of the matrix P.

The standardized residuals matrix S is then obtained.

S = {D_{row}}^{- \frac{1}{2}} (P - mf {mc}^{'}) {D_{column}}^{- \frac{1}{2}}

(2)

where

D_{row}

is a diagonal matrix containing the row masses and

D_{column}

is a diagonal matrix containing the column masses.

Singular value decomposition (SVD) is applied to the matrix S (Equation (2)):

S = U D V^{'}

(3)

where

U

and

V

are orthogonal matrices, and

D

is a diagonal matrix containing the singular values.

Then, the standardized coordinates are obtained by applying Equations (4) and (5).

X = {D_{row}}^{- \frac{1}{2}} U

(4)

Y = {D_{column}}^{- \frac{1}{2}} V

(5)

2.2.1. Generalization to K Tables

The T2Qv graph proposed in this research is not limited to processing a simple data table, but can handle databases with multiple tables taken at different moments (K) and represent them as points on the graph. This can be configured as a data cube of

n \times p \times K

.

If there are K tables, with the same structure and composed of qualitative variables, what is described in Section 2.2 is applied to each of the K tables, obtaining the set of K tables with the initial format (Figure 1).

Each of the K sets of coordinates obtained in the previous step is denoted as C. To detect the magnitude of the latent variables, the absolute value of the elements of the matrix

C_{k} (k = 1, \dots, K)

is taken. Thus, a set of K tables of coordinates (loadings) is obtained, whose rows correspond to the observed variables and the columns to the latent variables.

2.2.2. Normalization of Tables

Normalization [66] from Multiple Factor Analysis (MFA) is applied to the K tables C.

Let

λ_{1}^{k}

be the first eigenvalue obtained from the singular value decomposition of the k-th table C. The table is normalized by multiplying it by

1 / λ_{1}^{k}

. This results in the table

C^{^{'}}

, which corresponds to the normalized coordinate table. Individually, for the case of the k matrix, the following expression would be obtained.

C_{k}^{'} = \frac{1}{λ_{1}^{k}} C_{k}

(6)

Up to this point, we have a set of normalized coordinate matrices, whose rows contain the observed variables and columns contain the latent variables.

The expression in Equation (6) applied to K tables is represented in Figure 2, which shows the scheme for preparing the tables prior to obtaining centrality vectors used by the multivariate control chart.

By unifying the K normalized tables

C^{^{'}}

into a single one, we obtain the concatenated matrix

C^{^{'}}

, which contains all the elements of the K normalized tables.

C^{^{'}} = [C_{1}^{^{'}} | C_{2}^{^{'}} |, . . ., | C_{K}^{'}]^{T}

(7)

The normalization performed by MFA is responsible for weighting the K tables, with the aim of avoiding any imbalance when carrying out the joint analysis of the tables.

From the matrices

C^{^{'}}

and

C_{k}^{^{'}}

, the median vectors are obtained, as shown in Figure 2.

The vector

{\tilde{x}}_{C_{k}^{^{'}}}

explains the central behavior of table k, and the vector

{\tilde{x}}_{C^{^{'}}}

explains the behavior of the concatenated matrix.

2.3. T2Qv Control Chart

2.3.1. Obtaining the Control Chart

To define the Hotelling

T^{2}

control chart, the following considerations must be taken into account:

The matrix $C^{^{'}}$ (Equation (7)) is called Concatenated and serves as a reference for the in-control scenario in phase I of the process control.
The Hotelling $T^{2}$ statistic is usually calculated with the mean vectors and covariance matrix of the in-control process. The proposal of this research is to adopt robustness concepts, using the median vector instead of the mean vector, because medians are not affected by outliers [67].
From the concatenated matrix $C^{^{'}}$ , we obtain $\tilde{x_{0}}$ (median vector of the concatenated matrix) and $S_{0}$ (covariance matrix of the concatenated matrix).
Each matrix $C_{k}^{'}$ has the same number of columns.
The median vector $\tilde{x_{k}}$ is tied to the table $C_{k}^{'}$ , meaning that the control chart will depend on the differences between the matrices $C_{k}^{'}$ and the concatenated matrix $C^{^{'}}$ .
The matrices $C_{k}^{'}$ follow a multivariate normal distribution with median vector $\tilde{x_{k}}$ and covariance matrix $S_{k}$ .

The statistic

T^{2}

is given by:

T^{2} = n {(μ_{k} - μ_{0})}^{'} Σ_{0}^{- 1} (μ_{k} - μ_{0})

(8)

Taking into account the aforementioned considerations, the statistic

T_{m e d}^{2}

is obtained.

T_{m e d}^{2} = n {(\tilde{x_{k}} - \tilde{x_{0}})}^{'} Σ_{0}^{- 1} (\tilde{x_{k}} - \tilde{x_{0}})

(9)

It is known that the

T^{2}

distribution converges to a Chi-squared distribution with p degrees of freedom when the data comes from a multivariate normal distribution [11]. In addition, authors such as Gneri and Pimentel-Barbosa [68] argue that T2 converges to a chi-squared distribution, even in scenarios where the data do not come from a multivariate normal distribution, under certain conditions.

In this case, this principle can be applied, as the Concatenated matrix (

C^{^{'}}

) is used, which represents the in-control scenario.

Since this control chart is based on weighted Mahalanobis distances, it only has an upper control limit. This is given by Equation (10).

U C L = χ_{α, p}^{2}

(10)

where p is the number of dimensions and

α

is the predetermined significance level, with

α = 0.0027

being considered. When the

T_{m e d}^{2}

statistic of a sample exceeds this limit, it means that the process is out of control.

2.3.2. Interpretation of Out-of-Control Points

The multivariate chart for qualitative variables, T2Qv, is capable of indicating that the process has gone out of control, but it does not allow recognizing the causes for this to happen. Each point plotted on the chart represents a table (sample), consisting of a group of individuals (observations) and p variables that can have many categories, some of which may exhibit anomalous behavior. Therefore, it is necessary to carefully analyze what is happening with the data of the reported tables to identify the variable(s) that caused the process to go out of control.

This analysis is performed by comparing the location of the points representing the categories of the variables in the MCA of the concatenated table and the location of the points in the MCA charts of each table reported as out of control (Figure 3). The categories that are influencing the out-of-control state are those that show noticeable differences in their location when comparing both tables. To quantify the magnitude of these differences or the anomalous behavior of these categories, the Chi-squared distances between the masses of the columns of the table reported as out of control and the columns of the concatenated table, taken as a reference, are calculated. The higher the value of the statistic, the greater its incidence in the displacement of the centrality of the process that can ultimately lead to an out-of-control state.

3. Computational Complement

To facilitate the dissemination and application of the proposed method, a reproducible package has been developed in R. The T2Qv package [69] performs the analysis of control of K tables through multivariate control charts for qualitative variables, using the theoretical foundations of multiple correspondence analysis and multiple factor analysis, as well as the conceptual idea of STATIS.

The charts can be displayed in flat or interactive form, and all outputs can be shown in an interactive Shiny panel, and their graphical and numerical results can be exported.

3.1. Description of the T2Qv Package

The statistical package T2Qv performs Multiple Correspondence Analysis on the original tables (

T_{k}

), generating latent variable matrices (

C_{k}

) whose coordinates are subjected to a normalization process, multiplying them by

1 / λ_{1}^{k}

. The normalized coordinate matrices (

T_{k}^{'}

) are ordered one below the other, to form a concatenated table (

C^{^{'}}

), from which the median vector

{\tilde{\tilde{x}}}_{C^{'}}

is extracted, as well as the median vectors of each matrix

{\tilde{\tilde{x}}}_{C^{'} k}

) that conform it.

With these vectors, the statistics

T_{m e d}^{2} = n {(\tilde{x_{k}} - \tilde{x_{0}})}^{'} Σ_{0}^{- 1} (\tilde{x_{k}} - \tilde{x_{0}})

are obtained for each of the analyzed tables, which are represented as points in the T2Qv control chart. Points that fall outside the limit (

U C L = χ_{α, p}^{2}

) are reported as out of control.

The T2Qv statistical package allows the interpretation of the anomalous behavior of points outside of control through the comparison of the MCA charts of a table

{TC}_{k}

, which results from concatenating the initial matrices, and each initial table

T_{k}^{'}

. The package allows for the selection of the

T_{k}^{'}

tables, so that the researcher can focus their analysis on those identified as out of control.

In addition, the T2Qv package generates an interactive bar chart that represents the

χ^{2}

distances between the column masses of the variables in the

{TC}_{k}

table and the

T_{k}^{'}

table. Bars denoting greater height identify the variables that are most strongly causing the out-of-control output of the k-th table. This interactive chart includes, through a nested circular chart, a representation of the distribution of the observed variable categories corresponding to the k-th table, as well as a circular chart of the distribution of categories in the concatenated table (

{TC}_{k}

), facilitating the identification of changes in category distribution.

Thus, the T2Qv package consolidates the methodology proposed in this research and allows for an explanation of when and why the process went out of control.

The functions included in the package and their description are listed in Table 3.

3.2. Availability

The package is available on the official R repository, The Comprehensive R Archive Network (CRAN), and can be downloaded as follows:

install.packages("T2Qv")

4. Results

To test the proposed methodology in the Hotelling’s

T^{2}

control chart for qualitative variables, an analysis was conducted using simulated data and another using real data applied in the context of higher education. The results were obtained through the application of the T2Qv package.

4.1. Results with Simulated Data

4.1.1. Simulated Data Generation

For this study, a simulated database was generated, called Datak10Contaminated. It consists of 10 tables, each one composed of 100 rows (observations) and 11 columns, of which the first 10 correspond to the analyzed variables (V1, V2, …, V10), which contain 3 categories (High, Medium, and Low), while column 11, called GroupLetter, contains the classification factor of the groups. For their identification, the tables have been named with the letters of the alphabet, from a to j. Table j has a different distribution than the other nine.

The first 9 tables have their 10 variables with the following distribution:

u \sim U [0, 1]

t_{1, \dots, 9} = \{\begin{matrix} L o w & i f & u \leq 1 / 3 \\ M e d i u m & i f & 1 / 3 < u < 2 / 3 \\ H i g h & i f & u \geq 2 / 3 \end{matrix}

Table j or Table 10, in all its 10 variables, follows the distribution presented below:

u \sim U [0, 1]

t_{j} = \{\begin{matrix} L o w & i f & u \leq 1 / 5 \\ M e d i u m & i f & 1 / 5 < u < 2 / 6 \\ H i g h & i f & u \geq 2 / 6 \end{matrix}

The database is presented in the format shown in Table 4.

To verify the difference between the distributions of table 10 and the others, the average of the relative frequencies in the three categories was calculated from table a to i, for the 10 variables (Appendix A). Then, the average of the mean relative frequencies of the 10 variables was calculated. The result allows the comparison of the distribution of the categories of the Datak10Contaminated table with the theoretical uniform distribution, as shown in Table 5.

The corresponding goodness-of-fit Chi-square tests were applied to confirm the distribution of the generated data, as well as the comparison of table j with the other tables, confirming significant differences between the distributions (p-value

< 0.05

), as shown in Appendix B.

4.1.2. Application of T2Qv Package with Simulated Data

The first result is the T2Qv control chart (Figure 4), which is based on the adjusted Hotelling’s T2 statistic (

T_{m e d}^{2}

) and applied to detect anomalies in any of the k analyzed tables. Each table in the database is represented by points on the chart, which are generally referred to as “Table Point” followed by the specific name of the table. For example, Table Point a, Table Point b, Table Point j, etc. The chart also displays a horizontal line that represents the upper control limit (UCL). The lower control limit (LCL) is set to zero.

Furthermore, the point representing table j is located above the upper control limit, indicating that Table Point j has been identified as an out-of-control value. Therefore, it is necessary to carefully analyze what is happening with the data in this table in order to identify the causes of variation and take appropriate actions. To analyze the out-of-control point, a chart of the MCA of table j is compared with the chart of the concatenated table, as shown in Figure 5.

Another result is the MCA plot applied to the Concatenated table (Figure 6). This table is considered the visual reference for the in-control scenario in the subsequent analysis of tables that are reported as out-of-control points by the T2Qv chart.

The authors in Hoffman and Leeuw [70] provide insightful guidelines for effectively interpreting an MCA plot. They propose that the categories of the variables are best visualized in a two-dimensional plot as points, where the inter-point distances reflect the degree of homogeneity of the profiles, and closely located points correspond to identical response patterns. The position of the points in the plot serves as an indicator of the marginal frequency of each category, with low-frequency category points positioned towards the edge of the map, while

h i g h

-frequency category points cluster closer to the origin of the graph.

The authors additionally underscore the importance of establishing a clear association between the axes of the plot and relevant process characteristics, which facilitates the labeling of dimensions one and two and enhances the discrimination of cases based on these attributes. Although higher eigenvalues may suggest the need for additional dimensions, the authors recommend using only the first two for purposes of simplification and ease of interpretation.

The Figure 6 shows the total inertia of 63.35%, where dimension 1 represents 53.64% of the information and dimension 2 represents 9.71%. The graph displays the location of categories for each of the 10 variables across their three levels:

H i g h

,

M e d i u m

, and

L o w

. Observations located in the center of the graph represent the most frequently occurring categories, while those farthest from the center represent rare cases. In this regard, the concatenated table does not have any observations located in the center of the graph; instead, they are distributed in groups around the center. This can be explained by the uniform distribution of categories for the variables across most of the tables, with no variable dominating over the others.

It is evident that the

H i g h

categories of the represented variables have grouped to the left on the first principal axis, while

L o w

has arranged approximately from the center to the right, and

M e d i u m

found its place to the right of the vertical axis.

It is interpreted that, in general, there is an inverse relationship between the observations of

H i g h

and

M e d i u m

categories, which means that as the frequencies of one category increase, those of the other decrease. This is the case with the

H i g h

categories of variable V01 and

M e d i u m

of variable V03, which form an obtuse angle with respect to the center of the graph. Likewise, there are

H i g h

categories that are located opposite to the

L o w

ones, but there are also others that form more closed, almost straight angles, indicating little or no association between categories. This is the case with the

H i g h

categories of variable V02 and

L o w

of variable V01.

Furthermore, the T2Qv application allows the selection of any of the tables in the database for analysis. The user must choose a table for the application to perform an MCA, and as a result, a graph of the MCA of the Point table will be obtained, which can be compared to the MCA graph of the Concatenated table. Then, the behavior of the variables can be interpreted by the location of the categories on the plane of both graphs.

As shown in the T2Qv plot (Figure 4), point j is located above the UCL limit, indicating that the Point j table is out of control. Therefore, it is necessary to analyze in more detail what is happening and what could be the causes that originated this anomalous behavior in the table in question. Accordingly, a comparative analysis of the MCA plot of the Concatenated table and the Point j table is appropriate. The T2Qv application can present these two plots separately or together (Figure 5).

Figure 5 displays the distribution of observations from the Concatenated and Point j tables using MCA graphics. The MCA graph for the Concatenated table, which serves as a reference, was already analyzed in Figure 6. The MCA graph for the Point j table shows Medium categories of some variables located on the left side of the principal axis, moving away from the center of the graph, indicating that they are infrequent. This is the case for variables 1, 2, and 5, but especially variable 3, which records an observation for the

M e d i u m

level with the farthest value from the group. The remaining

M e d i u m

observations and all

L o w

observations have been placed around the second axis, leaving the central location of the graph to the High categories, which means that this category has a higher frequency than the others. This makes sense considering that the distribution of table j is

H i g h

0.724,

M e d i u m

0.092, and

L o w

0.184 (Table 5).

To obtain a more complete understanding of the behavior of variables across different tables in the database, an MCA plot can be generated for any Point table that is different from the one reported as anomalous. In this example, we examine the MCA plot of Point table b (Figure 7).

In Figure 7, it is evident that the

H i g h

,

M e d i u m

, and Low categories of all variables are randomly distributed in all quadrants of the graph, and a specific grouping pattern cannot be determined. The same would happen if the MCA analysis were performed on any of the other tables in the

D a t a k 10 C o n t a m i n a t e d

database because they share the same uniform distribution, except for table j, which was designed with a different distribution.

By comparing the plots, it is evident that the data distribution in the MCA plot of the j table is different from the distributions of the other tables and especially from the data distribution in the MCA plot of the concatenated table, which explains why the j point was identified as out of control in the T2Qv plot. This difference is explained in Table 6, which shows the Chi-square distance between the observations of the concatenated table and the j table.

Another way to visualize this information is through a bar chart generated by the T2Qv application (Figure 8).

The bar chart in Figure 8 also shows the

χ^{2}

distance between the masses of the Concatenated table and those of the k tables in the Datak10Contaminated database, in this case table j. Table 6 shows that variables V03, V01, and V06 exhibit the highest

χ^{2}

distances between the masses of the concatenated table and table j (0.07700, 0.06968, 0.05938), which are represented by the tallest bars in Figure 8. These variables are the ones introducing the most variability to the model, which generates greater changes in the process median and, consequently, have a greater contribution to the out-of-control output of point j.

The interactivity of this chart facilitates the observation of the distribution of the variable categories in the Point table, and their comparison with the distribution of the variable categories in the concatenated table, as shown in Figure 9.

Figure 9 presents, in pie charts, the distribution of categories for variables V03, V01, and V06, which recorded the highest Chi-squared distances between the masses of the concatenated table and the j table. The charts corresponding to the concatenated table show sectors with equivalent areas, which is explained by the uniform distribution of the variables, while those of the j table show areas with varying sizes, where the High category has a relatively high frequency in all three cases, and Low has a low frequency. Comparing these charts makes it evident that the distribution of categories presents significant differences between the concatenated table and the j table.

It is confirmed that the behavior of variables V03, V01, and V06 has a greater impact on the displacement of the process’s central tendency, which ultimately leads to an out-of-control state. However, in a multivariate context, all variables contribute to explaining the process’s behavior to a greater or lesser extent, so the out-of-control output cannot be attributed to the individual action of one variable or the separate action of a group of variables, but to the combined effect of correlated variables.

4.2. Insights from Data Applied to the Higher-Education Context

In this exemplar case, we undertake an exploration of the outcomes of the Graduate Tracking process for the Medical Sciences program at the Technical University of Machala (UTMACH), Ecuador. For this endeavor, we employ a database called

C M S G

, sourced from reports that are accessible to university administrators through their Information System SIUTMACH. The

C M S G

database houses 166 observations and 16 qualitative variables, derived from the results of a graduate follow-up survey from the Medical Sciences program, spanning from 2017 through to 2021.

The data have been arranged into four tables corresponding to the four periods during which the surveys were administered: 2021, 2020, 2019, and 2018–2017. This latter period amalgamates the tracking data of graduates from the initial two years (Appendix C).

There exist additional outcomes pertaining to the tracking of graduates from the Medical Sciences program at UTMACH; however, these correspond to periods prior to 2017 and were gathered via a different survey. Consequently, these constitute disparate variables, and as such, their results do not feature within this present analysis.

The variables encompassed within the

C M S G

database, each classified under their respective categories, are elucidated in detail within Appendix D.

Figure 10 presents the T2Qv control chart for the depiction of the

K = 4

tables that compose the

C M S G

database. Each of these tables is represented by points on the graph and corresponds to the four periods considered in this longitudinal study of graduates. The first three points are situated beneath the control limit (

U C L = 33.2

), whereas the fourth, which corresponds to the table for 2021, registers a value exceeding the UCL (

T_{2_{m e d}} = 34.83

). Thus, it can be concluded that the 2021 table is causing the process to deviate from control. As such, a comparative analysis of this table (termed the ’point table’) versus the concatenated table, which serves as a reference, is necessitated in order to identify the causes of this variation and facilitate informed decision-making aimed at rectifying the identified deviations. This process employs Multiple Correspondence Analysis (MCA) of the aforementioned tables for visualization.

Multiple Correspondence Analysis of the Concatenated Table— $C M S G$

Figure 11 presents the plot from the Multiple Correspondence Analysis of the concatenated table. The total inertia is 44.3%. The graph displays a cloud of points, representing the placement on the plane of the different categories of the variables analyzed. The first axis preserves the most explained variance and is associated with the graduates’ satisfaction regarding various aspects queried. On the left side are categories expressing low levels of satisfaction, whereas the right side corresponds to higher satisfaction. Notably, three distinct groups have formed, which are characterized as follows.

The first group, on the left side of the graph, includes the lowest satisfaction levels of graduates with the knowledge and skills acquired, the curriculum, the applied learning strategies, formative research, application of research to community engagement, and the dissemination of research results. These low levels are associated with unemployment scenarios; however, there is also little effort by the graduates to connect to a job—they have merely relied on the press, radio, and television for information. Graduates from this group express there is little or no relationship between work and the professional profile of the doctor, and disinterest in pursuing postgraduate programs.

The second group, at the other end of the horizontal axis, shows a strong association between the categories expressing the highest satisfaction among variables related to the professional training of UTMACH medical graduates. These are professionals who have put effort into two strategies to find work: the UTMACH Job Exchange and the Socioemployment Network, driven by the central government. As a result, these professionals are working, mostly as treating or rural doctors, linked to the Ministry of Public Health and their income is between $901 and $1200 per month. Additionally, these graduates, mainly from the 2014 cohort, assert there is a high relationship between work and the professional profile of the doctor and have an interest in pursuing postgraduate programs; some are already studying programs in medical specialties, especially in the clinical area.

The third group associates the categories that represent medium and medium-high satisfaction values regarding the variables evaluated. This group includes most graduates from different cohorts. They have based their strategy to find employment on the use of social networks and personal contacts. In response, they are working in the private sector and as independent doctors, resident doctors, and in administrative tasks in health houses. Their income is between $1201 and $1800. The interest in pursuing postgraduate programs is expressed around areas of surgery, higher medical education, gynecology-obstetrics, pediatrics, and community health. Within this group of graduates, there are also those who are not encouraged to study a postgraduate program, likely because they consider employment in private companies not as stable as in state institutions.

Figure 12 displays the MCA graph of the dot table, i.e., the table containing the data from the survey applied to the 2021 graduates. The total inertia is 47.26%. Comparing this table with the concatenated one, it is observed that there is a broader distribution of the points on the plane formed by the two dimensions, implying a higher incidence of other latent variables in the representation of the points. In contrast, in the concatenated table, the points were practically located around the first axis, which represented satisfaction.

Figure 12 also forms three groups that have a configuration similar to the concatenated table, but there are differences in the association of different variable categories which are described below.

The first group, located in the first quadrant of the plane, contains the lowest satisfaction levels of graduates with almost all the evaluated variables. However, unlike in the concatenated table, these low levels are not associated with unemployment scenarios. This is concerning. The graduates of this group express dissatisfaction with relevant elements of their training, and they are professionals working in their field of knowledge, either as independent professionals or in the role of treating physicians. This could be due to a perception of little job stability and low economic income, which do not exceed $600.

This dissatisfaction and perception of instability can be related to their denial or little interest in growing in their professional training at the postgraduate level, at least until achieving a more solid employment situation and economic position, perhaps when they join state health institutions.

On the other side of the graph, in the second quadrant, the second group is formed. This group shows a strong association between the categories expressing the highest satisfaction of graduates with all the evaluated variables. Still, it is noticeable that amid this cloud of points, a category reveals dissatisfaction with the curriculum. This is a significant difference between the 2021 table and the concatenated one. To highlight the difference, this group of graduates, represented by the 2011 and 2013 cohorts, showing conformity with the most important characteristics of their training process, is unemployed and asserts that they find little or no relationship between work and the professional profile of the doctor. Apparently, their strategy to find work, the UTMACH Job Exchange, has not yielded the expected results.

The third group, as in the graph of the concatenated table, associates the categories representing medium and medium-high satisfaction values regarding the evaluated variables. Unlike what is observed in the graph of the concatenated table, the graduates of this group bet on finding work using personal references and through the Socioemployment Network, promoted by the state, and managed to position themselves as rural doctors in the Ministry of Health, although this is temporary. Even so, the graduates of this group, mainly those of the 2012 and 2014 cohorts, value as high or very high the relationship between work and the professional profile of the doctor.

The variable categories that have changed their location in the compared tables and their degree of association in a sensitive manner may be causing the out-of-control state of the process. The identification of these variables is facilitated when analyzing Figure 13.

Figure 13 presents, in a bar graph, the Chi-square distance between the column masses of the concatenated table and the dot table, which in the T2Qv graph (Figure 10) was shown with a value greater than the control limit. The taller the variable bars, the greater this distance. The variables with the greatest Chi-square distances are those most strongly causing the shift from the centrality of the process and leading it to an out-of-control state. This is consistent with the findings found in the comparative analysis of the concatenated and dot tables through MCA.

In Figure 13, the tallest bar represents the Hierarchical Job Level, followed from a distance by the other variables. The analysis requires reviewing the distribution of the categories of variables in the two compared tables, which is carried out by observing the distribution of the categories of variables. For this purpose, the T2Qv application generates interactive circular graphs of the analyzed variables, which appear when passing the cursor over the bar graph (Figure 14).

Upon comparing, it is observed that the category with the greatest differences is

R u r a l D o c t o r

. This refers to recent graduates who practice in rural or remote areas for a period (usually a year), as a prerequisite to then practice the profession in the country. In the 2021 table, graduates practicing as rural doctors represent 70%, while in the Concatenated table, it is 35%.

On the other hand, in 2021 the

A d m i n i s t r a t i v e

category, which in the Concatenated table represents 5% of the graduates, does not appear in the dot table. This is coherent because this category is of the highest hierarchy, includes roles related to the administration and management of the health system, and consequently requires the participation of more experienced professionals, while the 2021 graduates are new, and fresh out of university. For this reason, even in the Concatenated table, which collects information from the graduates of the latest cohorts, this category has little participation.

Another category that does not appear in the 2021 table is

O t h e r s

, but it does appear in the Concatenated table (11%). In this group are the graduates who have already met the Rural Doctor requirement but have not yet been able to link themselves to work activities in the health field, consequently, they are engaged in other activities until they find a doctor’s job.

The

R e s i d e n t D o c t o r s

group also presents significant differences. In the Concatenated table, these graduates have a participation of 17%, while in the dot table, only 3%. Resident doctors are those who have already met the rural doctor requirement and now work in health centers through a contract with the Ministry of Health, under the supervision of more experienced doctors and may be studying postgraduate studies. It is normal that the graduates of the last cohort have a low percentage in this category if compared to the total graduates of all the studied cohorts, who have been in the job market for longer.

Other categories present differences not so pronounced, such as

P r i v a t e D o c t o r

, which in both tables shows low levels, 0.11 in the Concatenated table and 0.03 in the 2021 table, or the

N o t h i n g

category, which in the Concatenated has 0.19 and in the dot table, 0.18.

As had already been noted in Figure 13, after the Hierarchical Job Level, the other variables have smaller Chi-square distances between the column masses of the concatenated table and the dot table, indicating a lesser incidence in the shift of the process median and, consequently, they have less impact on the process going out of control. As an example, the behavior of the variable Relationship between the world of work and the professional profile of the doctor is analyzed next (Figure 15).

This variable seeks to determine the level of graduate satisfaction with the relationship between the world of work and the professional profile of the doctor. As shown in Figure 15, the proportion of the categories of the variable in the dot table and in the Concatenated table is not very different, if anything, the

L o w r e l a t i o n s h i p

category does not appear in the 2021 table, but it does in the Concatenated table, although with a low value (0.01). However, the potential to influence the process going out of control, in addition to the proportion of the categories, lies in the types and levels of association that the categories of this variable have with others, as observed in the location of the points in the MCA graphics in the Concatenated and 2021 tables.

This is inherent to the multivariate approach in statistical process control, which considers that all variables contribute to a greater or lesser extent to the process behavior. The process going out of control is not attributed to the individual action of a variable or group of them, but to the combined effect of correlated variables. This approach allows understanding the factors that affect the process, identifying the various causes of its variation, and adopting measures that allow correcting deviations even in complex situations where multiple variables interact with each other and can have indirect effects on the process behavior.

5. Sensitivity Analysis

As mentioned, in the T2Qv chart, an out-of-control point is interpreted as a table (

k_{i}

) that includes a quantity or proportion of contaminated variables. In these cases, the points on the T2Qv chart are expected to generalize the behavior of these differences in their distribution and thus surpass the upper control limit (UCL). The location of this control limit varies depending on the number of dimensions represented, since it is based on Hotelling’s T2 chart, whose upper control limit (UCL) depends on the number of variables considered [2]. In the case of T2Qv, this is the number of latent dimensions considered (Equation 10), and therefore, high dimensionality achieves optimal performance, while decreasing the number of dimensions that can be represented introduces instability and reduces the reliability of the results.

The proposed control chart can detect an out-of-control point even with a low number of contaminated variables when working with a high number of dimensions. It is recommended to use

p - 1

, where p is the total number of dimensions in the initial matrix (Figure 1). When the number of dimensions is decreased, the height of the upper control limit (UCL) also decreases, resulting in an increased number of out-of-control points, although the variables may not necessarily express significant differences in their values, increasing the probability of obtaining a type I error.

Therefore, the question arises as to how many dimensions can be reduced in the analysis without losing reliability in the result. The importance of this question lies in the need for a reliable chart that identifies out-of-control points even if a dimensionality reduction technique has been applied to the data, without falling into cases of false positives.

Sensitivity analysis utilizes contour plots and response surface plots (Figure 16), for which the

p e r s p 3 D

and

c o n t o u r 2 D

functions from the plot3D package [71] were utilized. A simulation of databases with different parameters was developed, considering a variation in the percentage of contaminated variables in the

k_{i}

table and the number of represented dimensions. The T2Qv chart was applied to each simulation to evaluate the behavior of the control chart under small perturbations in the case of a few contaminated variables, and large perturbations in the case of most contaminated variables.

The test data used in the model are recorded in 10 tables, each of which includes 10 variables and each variable has three categories: High, Medium, and Low. Table 10 (or table j) has a different distribution from the others, this being the contaminated table.

It is observed that the model can identify an out-of-control point when working with p − 1 dimensions (9), even with a low percentage of contaminated variables. When the number of dimensions decreases to p − 2 (8) and the percentage of contaminated variables is close to 100%, it correctly detects 1 out-of-control point. It is also observed that when the number of dimensions is lower, stability is lost and the power of the test is reduced. Consequently, the sensitivity analysis confirms that the T2Qv control chart performs well when working with high dimensions.

6. Discussion

In statistical process control, there are still few published proposals for control charts for qualitative variables. Differences between procedures for determining statistics and control charts in this field make comparison difficult.

The structure of the database required for the application of the T2Qv control chart involves a set of overlapping tables, where each of them constitutes a sample. The tables must have the same variables, which are in columns. One of these variables records the data used to identify the tables, such as the year, while the others provide the categories that operate in the MCA. From these variables, latent variables arise, which are the dimensions that intervene in the analysis.

Considering that MCA is a multivariate analysis technique that involves dimension reduction, and that the data processing for the T2Qv chart works with p - 1 dimensions, a database with p variables (

p > 3

) is required from the beginning for the analysis, in addition to the identification variable of the tables. That is to say, the chart cannot function with a data set that has less than four variables, including the table classification variable. This characteristic is a restriction in the use of T2Qv, especially when the sensitivity analysis results indicate that the chart loses stability at low dimensions, and that when it operates with high dimensions, it performs well.

In several multivariate control chart studies reviewed in the literature, the examples analyze only two or three variables as application cases. This can be observed in the publications of Epprecht et al. [39]; Ali and Aslam [72]; Jiang et al. [73]; Pastuizaca-Fernández et al. [44]; Taleb [43]; Taleb et al. [42]. These cases could not be treated with the T2Qv because they have fewer dimensions than required.

Meanwhile, In the example of application with simulated data presented in this research, the T2Qv analyzes the behavior of 10 variables, and the

D a t a k 10 C o n t a m i n a t e d

database has 11 columns (Table 4). It could be ensured, then, that a strength of the multivariate control chart T2Qv is its good performance when working with high dimensions, while its weakness is associated with working with low dimensions, and that with less than five variables, it cannot work.

Compared to the publications reviewed in the literature, the multivariate control chart T2Qv proposed in this research is aimed at analyzing databases that have K tables, where each

k_{i}

table is a sample consisting of n observations (rows) and p variables (columns), taken at K different analysis moments and represented as a point in the T2Qv chart. This means that the T2Qv analysis can be configured as a data cube (

n \times p \times k

), allowing the monitoring of process stability at multiple moments and with multiple variables.

The simulated dataset, Datak10Contaminated, used to illustrate the proposal in this research includes a set of 10 tables and 11 variables, where each table, represented as a point in the T2Qv plot, is a sample composed of 100 observations, for a total of 1000 rows. Therefore, one of the most valuable features of the T2Qv control chart is its ability to work with databases containing K tables, which can be made up of many individuals and multiple variables.

In the proposal presented in this article, each of the individuals (rows) that make up the different samples can have different configurations based on the number of categories of the multiple variables. Sociodemographic variables, which are very common in social context research, have different numbers of categories. For example, sex, age group, education level, marital status, state or province of residence, type of housing, presence of disability, type of disability, ethnic self-identification, employment status, and income level, among others. The registered individuals (rows) in a database can have different configurations based on the selected categories for each of the variables that characterize them. The T2Qv control chart, as well as the other charts that complement this proposal, represent the behavior of the variables well, even if there are dichotomous or polytomous variables with three, 10, or more categories, which is another strength of the proposal.

Meanwhile, there are publications on multivariate control charts for attribute data that, although they consider several quality characteristics in their analysis, ultimately classify each individual by only one of the analyzed variables. This is the case with the proposal by Ranjan-Mukhopadhyay [41], which is demonstrated with a case study that controls 7 quality characteristics in 24 samples whose size ranges from 20 to 404 individuals. The variables correspond to 6 types of defects in the paint on ceiling fan covers: insufficient coverage, overflow, blistering, pinholes, paint defects, and polishing defects. The seventh characteristic is the absence of defects. Each individual is classified by their most predominant defect; therefore, only one type of defect or absence of defects appears in their record, resulting in a loss of information about the combined effect of variables on the process.

Continuing with the analysis, we can observe numerous proposals published on multivariate charts that correspond to phase II of statistical process control and consequently, can make adjustments to their performance with a view to optimization. This contributes to the improvement of the effectiveness in detecting small changes in the process mean [7,9,74], as well as increasing efficiency by optimizing sample sizes [11], reducing sampling costs [2], minimizing the average time to detect changes outside of control [39], using tracking statistics based on classification algorithms in complex high-dimensional data [15], and improving data quality by cleaning outlier values [33].

The T2Qv chart is a tool that handles qualitative variables in phase I of statistical process control. Consequently, its effectiveness and efficiency have not yet been evaluated. This feature of the proposed chart constitutes a weakness when compared to phase II charts, but it also presents an opportunity for improvement to be considered in future studies aimed at optimizing it, by establishing control limits that adjust to specific analysis parameters, or, following Aparisi [11], Ruiz-Barzola [2], and Soriano [75], by delving deeper into the relationship between sample size (n) and the Average Run Length (

A R L_{1}

) in Phase II, as increasing n reduces the

A R L_{1}

.

The MCA is a factorial technique that seeks associations of variability and makes information that is common to most cases have little discrimination in the principal axes of the graph. Categories with low marginal frequency will be located at the edge of the plane, while those with high marginal frequency will be located closer to the origin [70]. In the T2Qv application, if a variable is expressed in only one category in a database table, it cannot be represented, and an error is reported, as its association with other categories of the other variables could not be measured. However, if there is at least one different case, it will be represented as a point very far away on some end of the principal axes. This is a limitation inherent to the methodology of the MCA that is inherited by the T2Qv application.

An opportunity for future research related to multivariate control for qualitative variables would be the development of a methodology that incorporates not only MCA but also three-way multivariate techniques. It could be feasible to incorporate, for example, JK-Meta Biplot [21] or STATIS Dual [23,76], techniques that facilitate the understanding of the internal structure of the data cube, after performing a previous qualitative coding analysis in a classical qualitative analysis phase [77].

Another opportunity for future research is the use of multivariate statistical process control in the context of big data. This topic is still in its infancy in scientific literature. Big data analysis requires monitoring the underlying sequential process of observed data to monitor the longitudinal performance of processes [78], or to determine how their distribution changes over time [79]. Traditional SPC control charts currently have difficulties in designing, recognizing, and interpreting patterns in machine-learning environments. Machine-learning algorithms can be integrated into SPC control charts to solve these problems [80] using new methods, detecting early anomalies, and making better decisions.

New data analysis techniques and tools are needed that can be effectively integrated into existing quality-control processes and address challenges such as model validation and software applications. Among the potential applications of these new research areas are dynamic disease detection, real-time profile, or image recognition.

7. Conclusions

This article presents a tool for multivariate statistical process control that performs analysis of qualitative data, which is called the T2Qv control chart, based on Multiple Correspondence Analysis. Normalized coordinates are represented by the robust Hotelling T2 chart. The T2Qv meets the need for a multivariate statistical process control chart for qualitative variables used in various production, industrial, environmental, administrative, and health processes, but especially in social processes, where the use of nominal and ordinal variables is very common.

The T2Qv chart can detect anomalous behaviors in the process and interpreting the behavior of the variables and their impact on the out-of-control state. To aid in this interpretation, the proposed method generates an MCA chart of the concatenated table as a reference for comparison with other MCA charts of tables identified as out-of-control in the T2Qv chart. Additionally, a Chi-squared distance analysis is performed between the categories of the tables, and interactive charts are presented to analyze the percentage distribution of the categories of the variables. The sensitivity analysis determined that the T2Qv control chart performs well when working with high dimensions but loses stability at low dimensions.

To facilitate the dissemination and application of the proposed method, a reproducible computational statistical package has been developed in R, called T2Qv, which is available on CRAN. This package allows for the visualization of results in either a flat or interactive format, and includes a Shiny dashboard that contains all integrated functions in one space.

The T2Qv chart has advantages such as its adaptability to qualitative databases for n individuals, with p variables at K distinct moments. It performs well when working with high dimensions, is stable in the presence of potentially atypical values, represents the behavior of variables well, even if they include dichotomous or polytomous variables with different numbers of categories. It is easy to apply thanks to its computational complement, including a dashboard with a complete analysis that has been developed with a good user experience in mind and a sequential analysis process, allowing users to detect which table went out of control, make comparisons between the associations of the table in question vs. the scenario under control, and inspect which variables had the greatest impact on the out-of-control state of the process.

One of the limitations of T2Qv is the requirement of a database with a minimum of four variables, including the table classification variable. The variables in the tables must have at least one variation for the association to be measurable. The method loses stability when working with low dimensions.

As an opportunity for improvement, the possibility of conducting an optimization study for its phase II in future research is proposed, since at the moment the T2Qv graph is focused solely on phase I of the process. It may be beneficial to incorporate three-way multivariate techniques such as JK-Biplot and Statis Dual in future research and software updates. Additionally, T2Qv could be optimized for inclusion in big data environments, where the balance between methodological quality and computational resource management is considered.

In a multivariate context, all variables contribute to a greater or lesser extent to explain the behavior of the process, so that the out-of-control output cannot be attributed to the individual action of a variable, or to the separate action of a group of them, but to the combined effect of correlated variables. This is why a multivariate approach is necessary in statistical process control.

Author Contributions

Conceptualization, O.R.-B.; Methodology, W.R.-P. and O.R.-B.; Software, W.R.-P. and M.R.-C.; Validation, P.G.-V.; Formal analysis, W.R.-P.; Investigation, W.R.-P.; Resources, W.R.-P.; Data curation, M.R.-C.; Writing—original draft, W.R.-P.; Writing—review & editing, M.R.-C.; Visualization, W.R.-P. and M.R.-C.; Supervision, O.R.-B.; Project administration, P.G.-V. and O.R.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

We would like to thank the Linkage Department of the Technical University of Machala for providing the database of graduates tracking in the field of Medical Sciences, which was used for the practical demonstration of the proposed methodology in this article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Average of mean relative frequencies in the three categories, from table a to i,

D a t a 10 C o n t a m i n a t e d

.

Table	Category	V01	V02	V03	V04	V05	V06	V07	V08	V09	V10
a	High	0.29	0.25	0.36	0.38	0.38	0.35	0.36	0.29	0.33	0.37
a	Medium	0.36	0.49	0.34	0.34	0.31	0.41	0.38	0.28	0.38	0.31
a	Low	0.35	0.26	0.3	0.28	0.31	0.24	0.26	0.43	0.29	0.32
b	High	0.31	0.44	0.37	0.29	0.31	0.34	0.3	0.36	0.29	0.34
b	Medium	0.4	0.31	0.3	0.35	0.37	0.32	0.35	0.3	0.39	0.36
b	Low	0.29	0.25	0.33	0.36	0.32	0.34	0.35	0.34	0.32	0.3
c	High	0.34	0.33	0.25	0.35	0.32	0.3	0.39	0.4	0.41	0.43
c	Medium	0.36	0.33	0.25	0.32	0.32	0.32	0.27	0.35	0.32	0.35
c	Low	0.3	0.34	0.5	0.33	0.36	0.38	0.34	0.25	0.27	0.22
d	High	0.32	0.34	0.34	0.38	0.41	0.33	0.35	0.46	0.34	0.45
d	Medium	0.35	0.3	0.28	0.31	0.27	0.35	0.3	0.24	0.33	0.24
d	Low	0.33	0.36	0.38	0.31	0.32	0.32	0.35	0.3	0.33	0.31
e	High	0.32	0.32	0.36	0.26	0.36	0.31	0.29	0.28	0.32	0.41
e	Medium	0.34	0.4	0.34	0.4	0.38	0.37	0.27	0.37	0.32	0.23
e	Low	0.34	0.28	0.3	0.34	0.26	0.32	0.44	0.35	0.36	0.36
f	High	0.31	0.29	0.27	0.32	0.36	0.32	0.26	0.41	0.34	0.26
f	Medium	0.41	0.29	0.36	0.31	0.31	0.38	0.36	0.33	0.3	0.37
f	Low	0.28	0.42	0.37	0.37	0.33	0.3	0.38	0.26	0.36	0.37
g	High	0.27	0.39	0.34	0.38	0.28	0.31	0.35	0.38	0.27	0.34
g	Medium	0.42	0.27	0.32	0.35	0.37	0.32	0.35	0.36	0.41	0.26
g	Low	0.31	0.34	0.34	0.27	0.35	0.37	0.3	0.26	0.32	0.4
h	High	0.32	0.47	0.34	0.38	0.47	0.34	0.32	0.35	0.35	0.31
h	Medium	0.28	0.31	0.29	0.27	0.27	0.43	0.39	0.35	0.36	0.4
h	Low	0.4	0.22	0.37	0.35	0.26	0.23	0.29	0.3	0.29	0.29
i	High	0.32	0.42	0.29	0.3	0.26	0.28	0.38	0.38	0.36	0.36
i	Medium	0.35	0.34	0.29	0.33	0.47	0.38	0.25	0.29	0.33	0.31
i	Low	0.33	0.24	0.42	0.37	0.27	0.34	0.37	0.33	0.31	0.33
j	High	0.75	0.71	0.78	0.71	0.7	0.73	0.69	0.66	0.73	0.78
j	Medium	0.08	0.1	0.01	0.06	0.1	0.12	0.11	0.12	0.12	0.1
j	Low	0.17	0.19	0.21	0.23	0.2	0.15	0.2	0.22	0.15	0.12
Avg a,b,…,i	High	0.31	0.37	0.33	0.33	0.35	0.32	0.33	0.37	0.33	0.36
Avg a,b,…,i	Medium	0.37	0.34	0.31	0.33	0.34	0.36	0.33	0.32	0.35	0.32
Avg a,b,…,i	Low	0.32	0.30	0.36	0.33	0.31	0.32	0.34	0.32	0.32	0.32

Appendix B

Test statistics for the comparison of the distributions of the categories of the 10 variables between table j and the others,

D a t a k 10 C o n t a m i n a t e d

.

GroupLetter	Statistics	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10
a	Chi-Squared	0.86	11.06	0.56	1.52	0.98	4.46	2.48	4.22	1.22	0.62
	p-value	0.651	0.004	0.756	0.468	0.613	0.108	0.289	0.121	0.543	0.733
b	Chi-Squared	2.06	5.66	0.74	0.86	0.62	0.08	0.5	0.56	1.58	0.56
	p-value	0.357	0.059	0.691	0.651	0.733	0.961	0.779	0.756	0.454	0.756
c	Chi-Squared	0.56	0.02	12.5	0.14	0.32	1.04	2.18	3.5	3.02	6.74
	p-value	0.756	0.99	0.002	0.932	0.852	0.595	0.336	0.174	0.221	0.034
d	Chi-Squared	0.14	0.56	1.52	0.98	3.02	0.14	0.5	7.76	0.02	6.86
	p-value	0.932	0.756	0.468	0.613	0.221	0.932	0.779	0.021	0.99	0.032
e	Chi-Squared	0.08	2.24	0.56	2.96	2.48	0.62	5.18	1.34	0.32	5.18
	p-value	0.961	0.326	0.756	0.228	0.289	0.733	0.075	0.512	0.852	0.075
f	Chi-Squared	2.78	3.38	1.82	0.62	0.38	1.04	2.48	3.38	0.56	2.42
	p-value	0.249	0.185	0.403	0.733	0.827	0.595	0.289	0.185	0.756	0.298
g	Chi-Squared	3.62	2.18	0.08	1.94	1.34	0.62	0.5	2.48	3.02	2.96
	p-value	0.164	0.336	0.961	0.379	0.512	0.733	0.779	0.289	0.221	0.228
h	Chi-Squared	2.24	9.62	0.98	1.94	8.42	6.02	1.58	0.5	0.86	2.06
	p-value	0.326	0.008	0.613	0.379	0.015	0.049	0.454	0.779	0.651	0.357
i	Chi-Squared	0.14	4.88	3.38	0.74	8.42	1.52	3.14	1.22	0.38	0.38
	p-value	0.932	0.087	0.185	0.691	0.015	0.468	0.208	0.543	0.827	0.827
j	Chi-Squared	79.34	65.06	95.78	68.18	62.00	70.94	58.46	49.52	70.94	89.84
	p-value	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

Appendix C

Section of the

C M S G

Database.

Work Relationship and Professional Profile

Satisfaction with Knowledge and Skills Acquired

Satisfaction Curricular Mesh

Satisfaction Learning Strategies

Satisfaction Formative Research

Satisfaction Application of Investigation Linking

Satisfaction Diffusion of Research Results

Type of Graduate Studying

Interest by Specialty

Strategy to Get a Job

Employment Situation

Hierarchical Labor Level

Seniority at Work

Monthly Income

Cohort

Grad Year

Low Relationship

High Moderate

Satisfactory

None

Pediatrics

Personal References

Unemployed

None

Before 2009

2020

High Relationship

High Moderate

Satisfactory

Master’s Degree

Higher Medical Education

Press_Radio_TV

Independent Professional

Other

Less than 6 months

901 to 1200

Before 2009

2017–2018

Very High Relationship

High

Very Satisfactory

None

Surgery

UTMACH Job Board

Public Employee

Other

Less than 6 months

901 to 1200

Before 2009

2017–2018

Very High Relationship

High

Very Satisfactory

None

Surgery

Personal References

Independent Professional

Private MD

6 months to 1 year

901 to 1200

Before 2009

2017–2018

Very High Relationship

Moderately

Satisfactory

Low Satisfactory

None

Community Health

Personal References

Private Employee

Private MD

Less than 6 months

1201 to 1500

Before 2009

2019

Moderate Relationship

Moderately

Low Satisfactory

None

Surgery

Socioemployment Network

Private Employee

Private MD

1 to 2 years

376 to 600

Before 2009

2020

Very High Relationship

High

Satisfactory

None

Surgery

Personal References

Private Employee

Resident MD

Less than 6 months

901 to 1200

2009

2017–2018

Very High Relationship

High Moderate

Satisfactory

Low Satisfactory

Unsatisfactory

None

Surgery

Socioemployment Network

Public Employee

Rural MD

Less than 6 months

601 to 900

2009

2017–2018

Moderate Relationship

High Moderate

Satisfactory

Very Satisfactory

Master’s Degree

Higher Medical Education

Personal References

Independent Professional

Private MD

1 to 2 years

601 to 900

2009

2020

High Relationship

High Moderate

Very Satisfactory

None

Pediatrics

Personal References

Public Employee

Rural MD

6 months to 1 year

601 to 900

2009

2017–2018

Very High Relationship

High Moderate

Satisfactory

None

Surgery

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

901 to 1200

2009

2017–2018

Moderate Relationship

High Moderate

Satisfactory

Low Satisfactory

Satisfactory

None

Pediatrics

Personal References

Unemployed

None

2009

2017–2018

Very High Relationship

High

Very Satisfactory

Master’s Degree

Surgery

Personal References

Private Employee

Other

1 to 2 years

601 to 900

2009

2019

Very High Relationship

High

Satisfactory

Very Satisfactory

Satisfactory

None

Higher Medical Education

UTMACH Job Board

Public Employee

Rural MD

6 months to 1 year

901 to 1200

2009

2017–2018

High Relationship

High Moderate

Satisfactory

Low Satisfactory

None

Higher Medical Education

Personal References

Unemployed

None

2009

2019

Very High Relationship

High Moderate

Very Satisfactory

Satisfactory

None

Surgery

Press_Radio_TV

Unemployed

None

2009

2017–2018

Very High Relationship

High

Very Satisfactory

Satisfactory

None

Surgery

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

901 to 1200

2009

2017–2018

Very High Relationship

High

Very Satisfactory

None

Higher Medical Education

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

901 to 1200

2009

2017–2018

Very High Relationship

High Moderate

Very Satisfactory

Master’s Degree

Surgery

Personal References

Private Employee

Administrative

6 months to 1 year

1201 to 1500

2009

2017–2018

Very High Relationship

High Moderate

Satisfactory

Master’s Degree

Gynecology and Obstetrics

Personal References

Private Employee

Resident MD

6 months to 1 year

601 to 900

2009

2017–2018

High Relationship

Moderately

Low Satisfactory

Satisfactory

Low Satisfactory

None

Surgery

Personal References

Unemployed

None

2009

2017–2018

Very High Relationship

High

Very Satisfactory

None

Surgery

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

901 to 1200

2010

2017–2018

Very High Relationship

High Moderate

Satisfactory

None

Clinical

Personal References

Public Employee

Rural MD

6 months to 1 year

601 to 900

2010

2017–2018

Very High Relationship

High

Very Satisfactory

None

Surgery

Socioemployment Network

Private Employee

Resident MD

Less than 6 months

901 to 1200

2010

2017–2018

High Relationship

High Moderate

Satisfactory

None

Pediatrics

Socioemployment Network

Public Employee

Resident MD

6 months to 1 year

1201 to 1500

2010

2019

Very High Relationship

High Moderate

Satisfactory

None

Pediatrics

Personal References

Public Employee

Rural MD

6 months to 1 year

601 to 900

2010

2017–2018

Very High Relationship

High

Satisfactory

None

Surgery

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

601 to 900

2010

2017–2018

Very High Relationship

High

Very Satisfactory

Specialization

Surgery

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

901 to 1200

2010

2017–2018

Very High Relationship

Moderately

Low Satisfactory

Very Satisfactory

Low Satisfactory

None

Higher Medical Education

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

601 to 900

2010

2017–2018

High Relationship

Moderately

Satisfactory

None

Community Health

Socioemployment Network

Public Employee

Rural MD

6 months to 1 year

601 to 900

2010

2017–2018

Very High Relationship

High Moderate

Very Satisfactory

Satisfactory

Very Satisfactory

Satisfactory

None

Gynecology and Obstetrics

Personal References

Private Employee

Resident MD

Less than 6 months

901 to 1200

2010

2017–2018

Appendix D

Variables and categories from the CMSG database.

Variables	Description	Categories
Grad_year	It serves as a table classifier within the database, dividing the dataset into groups corresponding to the four periods of study.	2017–2018. 2019. 2020. 2021.
Work Relationship and Professional Profile	This ordinal categorical variable communicates the degrees of association between the most relevant professional activities undertaken by UTMACH medical graduates and their professional profile, understood as a combination of essential knowledge, skills, and values necessary for medical practice within the country.	Very High Relationship. High Relationship. Medium Relationship. Medium Low Relationship. Low Relationship.
Satisfaction with Knowledge and Skills Acquired	It measures the level of satisfaction among graduates regarding the knowledge and skills acquired during their professional training process.	High. Medium High. Medium. Medium Low. Low.
Satisfaction Curricular Mesh	It assesses the level of satisfaction among graduates regarding the curriculum they studied during their academic journey.	Very Satisfactory. Satisfactory. Somewhat Satisfactory. Unsatisfactory.
Satisfaction with Learning Strategies	This variable evaluates the level of satisfaction among graduates regarding the teaching and learning strategies applied during their academic processes throughout their formative period.	Very Satisfactory. Satisfactory. Somewhat Satisfactory. Unsatisfactory.
Satisfaction with Formative Research	This variable evaluates the level of satisfaction among graduates regarding research processes conducted to promote and enrich student learning and training.	Very Satisfactory. Satisfactory. Somewhat Satisfactory. Unsatisfactory.
Satisfaction with Application of Research - Community Engagement	These variable measures the level of satisfaction among graduates regarding the application of research processes in programs and projects connected with society during their professional training stage.	Very Satisfactory. Satisfactory. Somewhat Satisfactory. Unsatisfactory.
Satisfaction with dissemination of Research Results	It measures the level of satisfaction among graduates regarding the dissemination and communication of research findings, results, and conclusions to the academic and scientific community, as well as to the general public, during their professional training stage.	Very Satisfactory. Satisfactory. Somewhat Satisfactory. Unsatisfactory.
Type of Postgraduate Studies	This nominal categorical variable identifies graduates who are pursuing postgraduate programs.	None. Diploma. Master’s Specialization.
Interest in Specialty	This variable complements the previous one, determining the interest of graduates in a specific area for postgraduate studies.	Surgery. Clinical. Higher Medical Education. Gynecology-Obstetrics. Pediatrics. Community Health.
Strategy to Secure Employment	This variable is directed towards the analysis of occupational demand and professional fields for graduates, to provide a clear and updated vision of employment prospects and possibilities for professional development in the field of health.	UTMACH Job Exchange. Press_Radio_TV. Personal Preferences. Socioemployment Network. Social Networks.
Employment Situation	This variable determines what the graduate is currently doing for work.	Independent Professional. Public Employee. Private Employee. Unemployed.
Hierarchical Job Level	The position and responsibility that an employee has within the organizational structure of a company or institution.	Administrative. Private Physician (MD). Resident MD. Rural MD. Treating MD. Other. None.
Tenure at work	This quantifies the elapsed time since an individual embarked on a particular job.	More than 2 years. 1 to 2 years. 6 months to 1 year. Less than 6 months. None.
Monthly income	This signifies the monetary earnings received in a single month.	2101 to 2400. 1801 to 2100. 1501 to 1800. 1201 to 1500. 901 to 1200. 601 to 900. 376 to 600. Less than 375. None.
Cohort	This refers to a group of graduates who enrolled together in the same academic course in a specific year and are tracked throughout their educational journey to analyze their performance, achievements, or behavior.	Prior to 2009. 2009. 2010. 2011. 2012. 2013. 2014.

References

Gutiérrez, H.; de la Vara Salazar, R. Control Estadístico de la Calidad y Seis Sigma; McGraw Hill Education: New York, NY, USA, 2013; Volume 3, pp. 152–253. [Google Scholar]
Ruiz-Barzola, O. Gráficos de Control de Calidad Multivariantes con Dimension Variable. Ph.D. Thesis, Universitat Politécnica de Valéncia, Valencia, Spain, 2013. [Google Scholar]
Montgomery, D.C. Statistical Quality Control; Wiley Global Education: Hoboken, NJ, USA, 2012. [Google Scholar]
Ramos, M. Una Alternativa a los méTodos cláSicos de Control de Procesos Basada en Coordenadas Paralelas, méTodos Biplot y Statis. Ph.D. Thesis, University of Salamanca, Salamanca, Spain, 2017. [Google Scholar]
Li, J.; Tsung, F.; Zou, C. Directional control schemes for multivariate categorical processes. J. Qual. Technol. 2012, 44, 136–154. [Google Scholar] [CrossRef] [Green Version]
Hotelling, H. Multivariate quality control. In Techniques of Statistical Analysis; McGraw-Hill: New York, NY, USA, 1947. [Google Scholar]
Lowry, C.A.; Woodall, W.H.; Champ, C.W.; Rigdon, S.E. A multivariate exponentially weighted moving average control chart. Technometrics 1992, 34, 46–53. [Google Scholar] [CrossRef] [Green Version]
Roberts, S. Control chart tests based on geometric moving averages. Technometrics 2000, 42, 97–101. [Google Scholar] [CrossRef]
Crosier, R.B. Multivariate Generalizations of Cumulative Sum Quality-Control Schemes. Technometrics 1988, 30, 291–303. [Google Scholar] [CrossRef]
Page, E. Continuous inspection schemes. Biometrika 1954, 41, 100–115. [Google Scholar] [CrossRef]
Aparisi, F. Hotelling’s T2 control chart with adaptive sample sizes. Int. J. Prod. Res. 1996, 34, 2853–2862. [Google Scholar] [CrossRef]
Aparisi, F.; Haro, C.L. Hotelling’s T2 control chart with variable sampling intervals. Int. J. Prod. Res. 2001, 39, 3127–3140. [Google Scholar] [CrossRef]
Faraz, A.; Parsian, A. Hotelling’s T2 control chart with double warning lines. Stat. Pap. 2006, 47, 569–593. [Google Scholar] [CrossRef]
Shabbak, A.; Midi, H. An improvement of the hotelling statistic in monitoring multivariate quality characteristics. Math. Probl. Eng. 2012, 2012, 531864. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; Liu, Y.; Jung, U. Nonparametric multivariate control chart based on density-sensitive novelty weight for non-normal processes. Qual. Technol. Quant. Manag. 2020, 17, 203–215. [Google Scholar] [CrossRef]
Xue, L.; Qiu, P. A nonparametric CUSUM chart for monitoring multivariate serially correlated processes. J. Qual. Technol. 2020, 53, 396–409. [Google Scholar] [CrossRef]
Mahalanobis, P. On the generalised distance in statistics. Proc. Natl. Inst. Sci. India 1936, 12, 49–55. [Google Scholar]
Tuerhong, G.; Kim, S.B. Gower distance-based multivariate control charts for a mixture of continuous and categorical variables. Expert Syst. Appl. 2014, 41, 1701–1707. [Google Scholar] [CrossRef]
Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinburgh Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
Gabriel, K.R. The biplot graphic display of matrices with application to principal component analysis. Biometrika 1971, 58, 453–467. [Google Scholar] [CrossRef]
Galindo-Villardón, P.; Vicente-Villardón, J.; Zarza, C.A.; Fernandez-Gómez, M.J.; Martın, J. JK-META-BIPLOT: Una alternativa al método STATIS para el estudio espacio temporal de ecosistemas. In Proceedings of the Conferencia Internacional de Estadística en Estudios Medioambientales, Cádiz, Spain, 21–23 November 2001. [Google Scholar]
Benzecri, J. OL’analyse des Correspondances. En L’Analyse des Données: Leçons sur L’analyse Factorielle et la Reconnaissance des Formes et Travaux; Dunod: Paris, France, 1973. [Google Scholar]
Des Plantes, L. Structuration des Tableaux à trois Indices de la Statistique. Ph.D. Thesis, Universite des Sciences et Techniques du Languedoc, Montpellier, France, 1976. [Google Scholar]
Robert, P.; Escoufier, Y. A unifying tool for linear multivariate statistical methods: The RV-coefficient. J. R. Stat. Soc. Ser. C Appl. Stat. 1976, 25, 257–265. [Google Scholar] [CrossRef]
Lavit, C. Présentation de la méthode STATIS permettant l’analyse conjointe de plusieurs tableaux de données quantitatives. Les Cah. Rech. Dev. 1988, 18, 49–60. [Google Scholar]
Inselberg, A.; Dimsdale, B. Parallel coordinates: A tool for visualizing multi-dimensional geometry. In Proceedings of the First IEEE Conference on Visualization: Visualization’90, San Francisco, CA, USA, 23–26 October 1990; pp. 361–378. [Google Scholar]
Edwards, A.; Cavalli-Sforza, L. A method for cluster analysis. Biometrics 1965, 21, 362–375. [Google Scholar] [CrossRef]
Filho, D.; de Oliveira, L. Multivariate quality control of batch processes using STATIS. Int. J. Adv. Manuf. Technol. 2016, 82, 867–875. [Google Scholar] [CrossRef]
Ramos-Barberán, M.; Hinojosa-Ramos, M.V.; Ascencio-Moreno, J.; Vera, F.; Ruiz-Barzola, O.; Galindo-Villardón, M.P. Batch process control and monitoring: A Dual STATIS and Parallel Coordinates (DS-PC) approach. Prod. Manuf. Res. 2018, 6, 470–493. [Google Scholar] [CrossRef] [Green Version]
Ahsan, M.; Mashuri, M.; Kuswanto, H.; Prastyo, D.D.; Khusna, H. Multivariate control chart based on PCA mix for variable and attribute quality characteristics. Prod. Manuf. Res. 2018, 6, 364–384. [Google Scholar] [CrossRef] [Green Version]
Ahsan, M.; Mashuri, M.; Wibawati; Khusna, H.; Lee, M.H. Multivariate Control Chart Based on Kernel PCA for Monitoring Mixed Variable and Attribute Quality Characteristics. Symmetry 2020, 12, 1838. [Google Scholar] [CrossRef]
Ahsan, M.; Mashuri, M.; Khusna, H. Comparing the performance of Kernel PCA Mix Chart with PCA Mix Chart for monitoring mixed quality characteristics. Sci. Rep. 2022, 12, 15723. [Google Scholar] [CrossRef]
Ahsan, M.; Mashuri, M.; Kuswanto, H.; Prastyo, D.D.; Khusna, H. Outlier detection using PCA mix based T2 control chart for continuous and categorical data. Commun. Stat.-Simul. Comput. 2021, 50, 1496–1523. [Google Scholar] [CrossRef]
Farokhnia, M.; Niaki, S.T.A. Principal component analysis-based control charts using support vector machines for multivariate non-normal distributions. Commun. Stat.-Simul. Comput. 2020, 49, 1815–1838. [Google Scholar] [CrossRef]
Holgate, P. Estimation for the bivariate Poisson distribution. Biometrika 1964, 51, 241–287. [Google Scholar] [CrossRef]
Chiu, J.E.; Kuo, T.I. Attribute control chart for multivariate Poisson distribution. Commun. Stat.-Theory Methods 2007, 37, 146–158. [Google Scholar] [CrossRef]
Lee, L.H.; Costa, A.F.B. Control charts for individual observations of a bivariate Poisson process. Int. J. Adv. Manuf. Technol. 2009, 43, 744–755. [Google Scholar] [CrossRef]
Laungrungrong, B.; M, C.B.; Montgomery, D.C. EWMA control charts for multivariate Poisson-distributed data. Int. J. Qual. Eng. Technol. 2011, 2, 185–211. [Google Scholar] [CrossRef]
Epprecht, E.K.; Aparisi, F.; García-Bustos, S. Optimal linear combination of Poisson variables for multivariate statistical process control. Comput. Oper. Res. 2013, 40, 3021–3032. [Google Scholar] [CrossRef]
Lu, X. Control chart for multivariate attribute processes. Int. J. Prod. Res. 1998, 36, 3477–3489. [Google Scholar] [CrossRef]
Ranjan-Mukhopadhyay, A. Multivariate attribute control chart using Mahalanobis D 2 statistic. J. Appl. Stat. 2008, 35, 421–429. [Google Scholar] [CrossRef]
Taleb, H.; Limam, M.; Hirota, K. Multivariate fuzzy multinomial control charts. Qual. Technol. Quant. Manag. 2006, 3, 437–453. [Google Scholar] [CrossRef]
Taleb, H. Control charts applications for multivariate attribute processes. Comput. Ind. Eng. 2009, 56, 399–410. [Google Scholar] [CrossRef]
Pastuizaca-Fernández, M.N.; Carrión-García, A.; Ruiz-Barzola, O. Multivariate multinomial T 2 control chart using fuzzy approach. Int. J. Prod. Res. 2015, 53, 2225–2238. [Google Scholar] [CrossRef]
Saltos Segura, G.; Flores Sánchez, M.; Horna Huaraca, L.; Morales Quinga, K. New methodologies applied to multivariate monitoring of student performance using control charts and threshold systems. Perfiles 2020, 1, 68–74. [Google Scholar]
López, C.P. Técnicas de anáLisis Multivariante de Datos; Pearson Educación: London, UK, 2004. [Google Scholar]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Ch, S. General intelligence objectively determined and measured. Am. J. Psychol. 1904, 15, 201–293. [Google Scholar]
Thurstone, L.L. Multiple-Factor Analysis: A Development and Expansion of the Vectors of Mind; University of Chicago Press: Chicago, IL, USA, 1947. [Google Scholar]
Kaiser, H. The varimax criterion for analytic rotation in factor analysis. Psychometrika 1958, 23, 187–200. [Google Scholar] [CrossRef]
Curran, J.; Hersh, T. Hotelling: Hotelling’s T2 Test and Variants, R Package Version 1.0-8. 2021. Available online: https://cran.r-project.org/web/packages/Hotelling/Hotelling.pdf (accessed on 9 September 2021).
Scrucca, L. qcc: An R package for quality control charting and statistical process control. R News 2004, 4/1, 11–17. [Google Scholar]
Vicente-Villardón, J. MULTBIPLOT: A Package for Multivariate Analysis Using Biplots; Computer Software; Departamento de Estadística, Universidad de Salamanca: Salamanca, Spain, 2010; Available online: https://www.researchgate.net/publication/263442299_MULTBIPLOT_A_package_for_multivariate_analysis_using_biplots (accessed on 1 January 2020).
Thioulouse, J.; Dray, S. Interactive multivariate data analysis in R with the ade4 and ade4TkGUI packages. J. Stat. Softw. 2007, 22, 1–14. [Google Scholar] [CrossRef] [Green Version]
Bougeard, S.; Dray, S. Supervised multiblock analysis in R with the ade4 package. J. Stat. Softw. 2018, 86, 1–17. [Google Scholar] [CrossRef] [Green Version]
Lê, S.; Josse, J.; Husson, F. FactoMineR: An R package for multivariate analysis. J. Stat. Softw. 2008, 25, 1–18. [Google Scholar] [CrossRef] [Green Version]
Cubilla-Montilla, M.; Nieto-Librero, A.; Galindo-Villardón, P.; Torres-Cubilla, C. Sparse HJ biplot: A new methodology via elastic net. Mathematics 2021, 9, 1298. [Google Scholar] [CrossRef]
Nieto-Librero, A. Package ‘BiplotbootGUI’ 2015. Available online: http://cran.nexr.com/web/packages/biplotbootGUI/index.html (accessed on 30 July 2019).
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Seabold, S.; Perktold, J. statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010. [Google Scholar]
Jiménez, A.R.; Jacinto, A.P. Métodos Científicos de Indagação e de Construção do Conhecimento. In Revista Ean. 2017, pp. 179–200. Available online: https://www.passeidireto.com/arquivo/105478374/metodos-cientificos-de-indagacao-e-de-construcao-do-conhecimento (accessed on 21 December 2016).
Martínez, J.C.; Almó, E.; Beatriz, O.G.D. Guía para la revisión y el análisis documental: Propuesta desde el enfoque investigativo. Ximhai Rev. Cient. Soc. Cult. Desarro. Sosten. 2023, 19, 67–83. [Google Scholar]
Nenadic, O.; Greenacre, M. Correspondence analysis in R, with two-and three-dimensional graphics: The ca package. J. Stat. Softw. 2007, 20, 1–13. [Google Scholar]
Ledesma, R. Software de análisis de correspondencias múltiples: Una revisión comparativa. Metodol. Encuestas 2008, 10, 59–75. [Google Scholar]
Michailidis, G.; Leeuw, J.D. The Gifi system of descriptive multivariate analysis. Stat. Sci. 1998, 13, 307–336. [Google Scholar] [CrossRef]
Escofier, B.; Pagès, J. Multiple factor analysis (AFMULT package). Comput. Stat. Data Anal. 1994, 18, 121–140. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Hubert, M. Robust statistics for outlier detection. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 73–79. [Google Scholar] [CrossRef]
Gneri, M.; Pimentel-Barbosa, E. Robustez asintótica de la estadística de Hotelling. Rev. Educ. Mat. 2012, 27, 28–36. [Google Scholar]
Rojas-Preciado, W.; Rojas-Campuzano, M.; Galindo-Villardón, P.; Ruiz-Barzola, O. T2Qv: Control Qualitative Variables, R Package Version 0.1.0. 2022. Available online: https://cran.r-project.org/web/packages/T2Qv/index.html (accessed on 18 May 2022).
Hoffman, D.; Leeuw, J.D. Interpreting multiple correspondence analysis as a multidimensional scaling method. Mark. Lett. 1992, 3, 259–272. [Google Scholar] [CrossRef]
Soetaert, K. plot3D: Plotting Multi-Dimensional Data, R Package Version 1.4. 2021. Available online: https://cran.r-project.org/web/packages/plot3D/index.html (accessed on 22 May 2021).
Ali, M.R.; Aslam, M. Design of control charts for multivariate Poisson distribution using generalized multiple dependent state sampling. Qual. Technol. Quant. Manag. 2019, 16, 629–650. [Google Scholar]
Jiang, W.; Au, S.; Tsui, K.L.; Xie, M. Process Monitoring with Univariate and Multivariate c-Charts; Technical Report; The Logistics Institute, Georgia Tech, and the Logistics Institute-Asia Pacific: Singapore, 2002. [Google Scholar]
Pignatiello, J.; Runger, G. Comparisons of multivariate CUSUM charts. J. Qual. Technol. 1990, 22, 173–186. [Google Scholar] [CrossRef]
Soriano, E. Estudio de la Influencia de la Fase I en el Desempeño de la Fase II en el Gráfico T2 de Hotelling. 2017. Available online: https://riunet.upv.es/handle/10251/89552 (accessed on 22 May 2021).
Escoufier, Y. Objectifs et procédures de l’analyse conjointe de plusieurs tableaux de données. Stat. Anal. Donnees 1985, 10, 1–10. [Google Scholar]
Caballero-Juliá, D.; Villardón, M.P.G.; García, M.C. JK-Meta-Biplot y STATIS Dual como herramientas de análisis de tablas textuales múltiples. Rev. Iber. Sist. Tecnol. Inf. 2017, 25, 18–33. [Google Scholar] [CrossRef] [Green Version]
Qiu, P. Statistical process control charts as a tool for analyzing big data. In Big and Complex Data Analysis: Methodologies and Applications; Springer: Berlin/Heidelberg, Germany, 2017; pp. 123–138. [Google Scholar]
Qiu, P. Big data? Statistical process control can help! Am. Stat. 2020, 74, 329–344. [Google Scholar] [CrossRef]
Tran, P.H.; Nadi, A.A.; Nguyen, T.H.; Tran, K.D.; Tran, K.P. Application of machine learning in statistical process control charts: A survey and perspective. In Control Charts and Machine Learning for Anomaly Detection in Manufacturing; Springer: Berlin/Heidelberg, Germany, 2022; pp. 7–42. [Google Scholar]

Figure 1. Procedure of MCA for K tables.

Figure 2. Scheme of the process for obtaining median vectors.

Figure 3. T2Qv Control Chart.

Figure 4. Multivariate control chart T2 Hotelling applied to qualitative variables,

D a t a k 10 C o n t a m i n a t e d

, figure.

Figure 4. Multivariate control chart T2 Hotelling applied to qualitative variables,

D a t a k 10 C o n t a m i n a t e d

, figure.

Figure 5. Plot of the MCA for the Concatenated Table and the Point Table,

D a t a k 10 C o n t a m i n a t e d

.

Figure 5. Plot of the MCA for the Concatenated Table and the Point Table,

D a t a k 10 C o n t a m i n a t e d

.

Figure 6. Multiple Correspondence Analysis applied to the concatenated table.

Figure 7. Multiple Correspondence Analysis applied to Table b.

Figure 8. Chi-squared distance between the masses of the concatenated table and the K tables,

D a t a k 10 C o n t a m i n a t e d

.

Figure 8. Chi-squared distance between the masses of the concatenated table and the K tables,

D a t a k 10 C o n t a m i n a t e d

.

Figure 9. Distribution of the categories of variables V03, V01, and V06 in the concatenated table and table j in the T2Qv application.

Figure 10. T2Qv graph applied to the

C M S G

database.

Figure 10. T2Qv graph applied to the

C M S G

database.

Figure 11. MCA plot of the concatenated table,

C M S G

.

Figure 11. MCA plot of the concatenated table,

C M S G

.

Figure 12. MCA plot of the 2021 table,

C M S G

.

Figure 12. MCA plot of the 2021 table,

C M S G

.

Figure 13.

χ^{2}

distance between the masses of the concatenated table and 2021.

Figure 13.

χ^{2}

distance between the masses of the concatenated table and 2021.

Figure 14. Distribution of the categories of the variable ’Job Hierarchical Level’ in the Concatenated Table and the Dot Table in the T2Qv application.

Figure 15. Distribution of the categories of the variable ’Relationship between the world of work and the Professional Profile’ in the Concatenated Table and 2021 in the T2Qv application.

Figure 16. Sensitivity plots, contour plots, and response surface plots are obtained from the measurement of the behavior of the T2Qv chart.

Table 1. Algebraic elements.

Elements	Representation	Example
Scalars	Lowercase letters	$v, λ$
Vectors	Lowercase bold letters	$v, u$
Matrices	Uppercase bold letters	$V, X$
Three-way matrices (Data cubes)	Uppercase letters with double stroke	$C, X$

Table 2. Notation.

Letter	Meaning	Specification
p	Number of dimensions
K	Total number of tables (Specifies the depth of the data cube)
k	Table index	k = 1, 2, …, K
T	Transpose matrix index	$X^{T}$
n	Sample size of the k tables

Table 3. Functions of the T2Qv package.

Function	Description
T2 qualitative	Multivariate control chart T2 Hotelling applicable for qualitative variables.
MCAconcatenated	Multiple correspondence analysis applied to a concatenated table.
MCApoint	Multiple correspondence analysis applied to a specific table.
ChiSq variable	Contains the Chi-square distance between the column masses of the table specified in PointTable and the concatenated table. It allows the identification of which mode is responsible for the anomaly in the table in which it is located.
Full Panel	A Shiny panel complete with the multivariate control chart for qualitative variables, the two MCA charts and the modality distance table. Within the dashboard, arguments such as type I error and dimensionality can be modified.

Table 4. Section of the Datak10Contaminated database.

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	GroupLetter
Low	Medium	Medium	High	High	High	Low	Medium	Medium	Medium	a
Low	Low	High	Low	Medium	High	High	High	Low	High	a
High	Medium	High	Low	High	Medium	Medium	High	Medium	Low	a
Medium	Medium	Low	High	Low	Medium	High	Low	Low	High	a
Low	Low	Low	High	Low	High	High	High	Medium	Medium	a
High	High	Medium	Low	High	Low	Medium	Medium	High	Low	a
High	High	Low	Low	Low	Medium	High	Medium	Medium	High	a
Medium	Medium	High	Medium	Medium	High	Medium	High	High	High	a
Low	Low	Low	Medium	High	Medium	Low	Medium	Low	Low	a
Medium	Medium	Medium	High	Low	Medium	High	Low	High	Medium	a

Table 5. Comparison of the distribution of the categories in the Datak10Contaminated table with the theoretical uniform distribution.

Categories	Uniform Theoretical	Mean of the Distributions of the Variables in Tables a, b, …, i	Mean of the Distribution of the Variables in Table j
High	0.333	0.340	0.724
Medium	0.333	0.336	0.092
Low	0.333	0.324	0.184

Table 6. Chi-square distance between the column masses of table k and the concatenated one,

D a t a k 10 C o n t a m i n a t e d

.

Table 6. Chi-square distance between the column masses of table k and the concatenated one,

D a t a k 10 C o n t a m i n a t e d

.

Variables	ChiSq
V1	0.06968
V2	0.05010
V3	0.07601
V4	0.04982
V5	0.05205
V6	0.05603
V7	0.03713
V8	0.03702
V9	0.04395
V10	0.06179

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rojas-Preciado, W.; Rojas-Campuzano, M.; Galindo-Villardón, P.; Ruiz-Barzola, O. Control Chart T2Qv for Statistical Control of Multivariate Processes with Qualitative Variables. Mathematics 2023, 11, 2595. https://0-doi-org.brum.beds.ac.uk/10.3390/math11122595

AMA Style

Rojas-Preciado W, Rojas-Campuzano M, Galindo-Villardón P, Ruiz-Barzola O. Control Chart T2Qv for Statistical Control of Multivariate Processes with Qualitative Variables. Mathematics. 2023; 11(12):2595. https://0-doi-org.brum.beds.ac.uk/10.3390/math11122595

Chicago/Turabian Style

Rojas-Preciado, Wilson, Mauricio Rojas-Campuzano, Purificación Galindo-Villardón, and Omar Ruiz-Barzola. 2023. "Control Chart T2Qv for Statistical Control of Multivariate Processes with Qualitative Variables" Mathematics 11, no. 12: 2595. https://0-doi-org.brum.beds.ac.uk/10.3390/math11122595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Control Chart T2Qv for Statistical Control of Multivariate Processes with Qualitative Variables

Abstract

Simple Summary

Abstract

1. Introduction

2. Methodology

2.1. Notation

2.2. Multiple Correspondence Analysis (MCA)

2.2.1. Generalization to K Tables

2.2.2. Normalization of Tables

2.3. T2Qv Control Chart

2.3.1. Obtaining the Control Chart

2.3.2. Interpretation of Out-of-Control Points

3. Computational Complement

3.1. Description of the T2Qv Package

3.2. Availability

4. Results

4.1. Results with Simulated Data

4.1.1. Simulated Data Generation

4.1.2. Application of T2Qv Package with Simulated Data

4.2. Insights from Data Applied to the Higher-Education Context

Multiple Correspondence Analysis of the Concatenated Table— C M S G

5. Sensitivity Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Multiple Correspondence Analysis of the Concatenated Table— $C M S G$