1. Introduction
Multiple regression models in which the response variables are generated by distributions that belong to a family of parametric probability distributions have been widely used. The goal is to explain the behavior of a response variable, denoted here as The increasing ease with which data can be collected and stored allows for quick construction of databases of ever-increasing sizes. With large volumes of data available for study, patterns of greater complexity are being observed, forcing the search for more flexible probabilistic (regression) models to deal with such patterns. Examples of such complexities include asymmetry around central values, the presence of a high excess kurtosis, multimodality, and other aspects of separate subgroups with little variability. These anomalies, common in a large database, can be caused by the effect of latent variables or latent categories, which are variables or categories that are not observed or collected.
For example, consider the weight distribution of a certain animal species for which the sexes of the sampled units are not recorded. A possible difference between the average weights of males and females may lead to a bimodal distribution of frequencies. Other aspects not noted in the database may also be responsible for additional anomalies. Another common issue in establishing species profiles occurs when the variabilities of the two subgroups determine the corresponding profiles. If latent characteristics are not considered, marginal distributions of the response variables can produce anomalies such as those illustrated in
Figure 1—(a) bimodality with little variability between subgroups; (b) high asymmetry on the right with little difference between location parameters. This high asymmetry can be explained by the variability differences among subgroups.
Several methods of handling these kinds of behavior are described in the literature. It is common for analysts not to have access to features that may be causing these distortions in the target variable. Usually, it is assumed that the response variables have bimodal and/or skewed distributions. New overly complex probabilistic distributions are being developed to deal with these problems, such as in [
1,
2,
3]. When one uses these new distributions to explain such distortions, one is only considering the marginal distribution of the target variable, and not the possibility that they might be due to one or more additional features.
Mixture distributions, such as in [
4], are a promising alternative approach. The simplest way to use this kind of model in problems such as those in
Figure 1 is to consider the observations of the response variable to have been generated by two distinct distributions that although they belong to the same family of distributions, have different values between their respective parameters. Mathematically, this can be written as
where
p represents the proportion of observations generated by distribution
,
is from
, and
and
are parameter vectors. For instance, if both generators are considered to follow Gaussian distributions, then we have
and
. Some other well-known options for dealing with such behavior available in the literature are the class of mixture-of-expert models (MoE) [
5] and cluster-weighted models (cwm) [
6].
The present work provides another alternative approach in which clustering techniques are used early in the modeling process. The use of these tools has the following objectives: (i) identify different clusters in the data set from a possible latent explanatory variable (e.g., in a bimodal data set, two different clusters would be created, producing a new dummy variable, while a trimodal data set would lead to three clusters, generating a factor with three levels, and so on); and (ii) include this new covariate in a regression model to explain the specific behavior observed in the response, as in [
7]. In our approach, we use the GAMLSS framework, short for “generalized additive models for location, scale and shape” [
8] with simple distributions. As stated in [
9] the use of simple distributions with highly sophisticated regression-type models may return reliable and easily interpreted results.
The paper is organized as follows. In
Section 2, we briefly describe some clustering methods that could be used in the proposed methodology. In
Section 3, we incorporate clusters as latent variables into the GAMLSS framework. In
Section 4, we perform some simulation studies to validate the proposed modeling process. In
Section 5, four different applications are discussed. In the first two, no extra covariates are available, and we compare the proposed methodology to both mixture models and sophisticated distributions. In the others, authentic explanatory variables are also considered, and we compare our approach with cluster-weighted regression models and the mixture-of-experts model. Finally,
Section 6 contains concluding remarks.
3. Incorporating Clustering into GAMLSS
As highlighted in
Section 1, many different sophisticated statistical distributions have been proposed to deal with some specific patterns presented in the marginal distributions of the response
Y. In this paper, we use simpler distributions, modeling not only the mean (location) parameter
in terms of the latent variable (clusters) and any other known features (explanatory variables) through the GAMLSS (generalized additive models for location, scale, and shape) framework [
8], a generalization of both well-established generalized linear models (GLM) [
18] and generalized additive models (GAM) [
19].
Generically, let
, i.e.,
Y follows any distribution (that does not necessarily belong to the exponential family) with parameter vector
For an extensive list of distributions used in GAMLSS, see [
20]. Clustering GAMLSS (c-GAMLSS), which considers different clusters obtained through one of the methods described in
Section 2, can then be described as follows:
where
denotes an appropriate link function for parameter
usually determined by the range of the parameter,
denotes the number of explanatory variables related to the
rth parameter,
is a known
model matrix,
is a parameter vector of length
,
are smooth functions of
(e.g., p-splines [
21]),
denotes latent variables (clusters) and
is a parameter vector of length
. Note here that any and all of the parameters of the response distribution may be modeled as functions of a given set of explanatory and latent variables. As in the classical GAMLSS framework, it is possible to consider interactions between covariates and varying coefficient terms (as introduced in [
22]), where the varying coefficient function fits separate smooth curves against
X for each cluster [
23]. Finally, model (
2) can be seen as a flexible expert-network mixture of experts [
24]. However, in this approach, the mixing proportion does not depend on the covariates.
Due to the high flexibility of GAMLSS, there are multiple techniques that may be used to select and fit a suitable model for the response variable considering the available covariates. They are extensively described in [
23]. In the c-GAMLSS framework, we can use the following steps in the model-selection process:
Step 1: Select a clustering method (e.g., among the ones presented in
Section 2) to create different groups (estimating the number of clusters can be achieved using one of multiple available strategies. See, for instance, [
25]).
Step 2: Select subsets of authentic and latent variables for each of the parameters of the response-variable distribution, using any of the applied strategies for GAMLSS models [
23]. The most common procedure is a stepwise procedure called “Strategy A” [
23,
26,
27]).
Strategy A uses a goodness-of-fit measure to select the most suitable model to describe the data being studied. One may say that the Bayesian information criterion (BIC) [
28] would perform better than the Akaike information criterion (AIC) [
29] in the c-GAMLSS framework [
30,
31]. However, since we may consider smoothing functions within model (
2), BIC can lead to underfitting (i.e., oversmoothing) [
32]. A richer discussion regarding this topic can be seen in [
27]. Furthermore, as in MoE, an important issue to be addressed in the future is whether a given covariate may be (or not) considered in either or both of the above-mentioned steps.
If only latent variables (clusters) are used to explain the behavior of a given response, then model (
2) reduces to
which can be seen as a mixture model [
24] with no covariates affecting the mixing proportion. It is worth mentioning that the new coefficient
in (
3) is the intercept associated with each of the regression structures related to the first cluster obtained from any of the methods described in
Section 2. In model (
2), the intercept is the first element of each vector of parameters
As a straightforward example, letting
Y have a Gaussian distribution, with vector of parameters
in (
3), results in the following:
Note that we choose appropriate link functions for both parameters (identity and logarithm functions for
and
, respectively). See [
23] for more information.
Regarding the inference processes, all model parameters can be estimated by the penalized maximum-likelihood method through the Rigby and Stasinopoulos (RS) and/or Cole and Green (CG) algorithms described in [
8,
33]. As stated in [
23], both algorithms are stable and fast using simple starting values, such as constants.
Implementation of fitting a c-GAMLSS model (including or not extra covariates and considering different response-variable distributions) is easily achieved using a new generic function called
gamlss.cluster(), based on the
gamlss package [
33] for the
R software environment [
34] and available at
https://git.io/JtOZW (accessed on 1 July 2021).
4. Simulation
In this section, we conduct some Monte Carlo simulation studies to understand the behavior of the c-GAMLSS framework based on the Gaussian distribution and compare it to the usual Gaussian mixture model approach, considering four different scenarios (marginal response-variable shapes) where each is composed of two different clusters. All observations were generated in the
R software environment via the
rnorm() function. The averages (
) and standard deviations (
) for each cluster and scenario are reported in
Table 1 and the resulting empirical densities (considering both scenarios together) are displayed in
Figure 2.
For each scenario, we evaluate the maximum-likelihood estimates (MLEs) of the parameters for both approaches and then, after all replications, we compute the biases and mean squared errors (MSEs) based on the average estimates. In order to obtain the MLEs for the mixture models, we are using the
normalmixEM() function available in the
mixtools package for the
R software environment, which returns the best fitted model after five attempts. All results here are obtained from 1000 Monte Carlo replications. Here we are considering two different sample sizes for each cluster, 50 and 300, totaling
and
observations for each data/replication. Results are displayed in
Table 2.
Please note that we have five different parameters in
Table 2:
,
,
,
and
p. Regarding the mixture model, all parameters were already introduced in Equation (
1) and refer to the mean and standard deviation of the first cluster, mean and standard deviation of the second cluster and the proportion of observations that will be modeled by the first Gaussian distribution, respectively. However, we note here that we temporarily reparameterize the c-GAMLSS model in this subsection to compare the performance of the two approaches, since they will have the same interpretation. This reparameterization, based on model (
4), is given as follows:
,
,
,
and
p is the proportion of observations in the first cluster obtained through the best clustering method available in
Table 1 for each of the scenarios (Scenario A: k-means; Scenario B: Fuzzy; Scenario C: model-based; and Scenario D: model-based).
In Scenario A we may note that the c-GAMLSS method clearly performs better than the Gaussian mixture models for both sample sizes considered (nevertheless, as expected, biases and MSEs decrease for both methods in the greater sample). For Scenario B ( in each cluster), we may note that the absolute biases for the standard deviation ( and ) estimates are greater for the c-GAMLSS method; however, the MSEs are drastically smaller, indicating a better accuracy. Considering in each cluster, all results are clearly favorable to the new proposed alternative. In Scenario C, the MSEs obtained for by the mixture model are 6.67 and 21.99 times greater than the one returned by the c-GAMLSS method for and in each cluster, respectively. Finally, for Scenario D, the MSE returned by the Gaussian mixture when in each cluster for the standard deviation was almost 130 times greater than the one returned by c-GAMLSS. We can conclude that our proposed methodology clearly outperformed the already well-known Gaussian mixture model in all simulated scenarios.
5. Applications
In this section, we provide two applications comparing the adequacy of the c-GAMLSS framework, Gaussian mixture models, and some sophisticated bimodal distributions for marginally modeling the response variable (i.e., not considering any further explanatory variables), and another two applications where extra authentic covariates exist, comparing c-GAMLSS to the mixture-of-experts models (MoE; Gaussian parsimonious clustering models with covariates) and cluster-weighted models (cwm). In order to compare these approaches, both AIC and BIC statistics are calculated.
The flexible recently proposed four-parameter distributions considered in the first two applications, where only the target variable is available, are: the odd log-logistic exponentiated Gumbel (OLLEGu) [
1], extended generalized odd half-Cauchy log-logistic (EGOHC-LL) [
2], and exponentiated log-sinh Cauchy (ELSC) [
3] distributions. All parameter roles and their respective ranges from each of these distributions are described in
Table 3.
5.1. Actuarial Sciences Data
In the first application, we present a data set already studied in [
1], where the authors compare their new model with some of its sub-models and other alternative distributions. The data consist of the lifespans (in years) of 280 retired women covered by the Mexican public health-insurance system who had temporary disabilities and died during 2004. The data are more fully reported in [
35].
Table 4 contains the MLEs, their respective SEs, and the AIC and BIC statistics for all fitted models. The c-GAMLSS framework based on the Gaussian distribution clearly performs better than all other approaches, yielding the lowest AIC and BIC values (1788.35 and 1802.90, respectively). Please note that in this particular case, the latent variable (clusters) was considered only for the parameter
, i.e., both clusters present the same dispersion (
), and hence no estimates for
are displayed. This was achieved through the stepwise method (Strategy A), which also selected the k-means clustering method as the most appropriate to divide these data.
The Gaussian mixture model did not perform well when compared to the other approaches (
Table 4), returning the highest AIC and BIC values among all models (2116.10 and 2134.30, respectively), presenting a roughly unimodal curve in
Figure 3a. Moreover, the c-GAMLSS method was able to precisely identify two different clusters (red and blue curves).
Although the OLLEGu distribution was elected as the best one to describe these data according to [
1] when compared to some other models, we can see that the EGOHC-LL distribution, proposed two years earlier by [
2], returned better AIC and BIC statistics (2091.69 and 2106.23, respectively). Their fitted densities are displayed in
Figure 3b.
5.2. Ozone Data
The second application corresponds to daily ozone-level measurements (in ppb = ppm × 1000) collected in New York during 1973 and is available in [
36]. As with the previous application, we provide in
Table 5 the MLEs, their corresponding SEs and AIC and BIC statistics. Once again, the proposed methodology is the best alternative to explain the behavior of the data since it returned the lowest AIC and BIC values (936.18 and 947.78, respectively) and the Gaussian mixture approach performed poorly. In this application, the c-GAMLSS framework based on the Gaussian distribution uses the model-based clustering selected via Strategy A, where the generated latent variable was included in both regression structures (
and
).
These data were also analyzed in [
2], where the EGOHC distribution was the model selected when compared to some of its sub-models and other distributions. Nevertheless, we can see that the OLLEGu model returns better AIC and BIC values (1054.41 and 1065.28, respectively). Finally, all fitted densities are displayed in
Figure 4.
5.3. Two-Moon
The two-moon is a well-known synthetic data set introduced by [
37] which can be obtained in the
clusterSim package in the
R software environment, and is mainly used in studies regarding clustering methods. The generated data consist of two variables: the response
Y and an authentic covariate
X, both defined on the real numbers. The relationship between
Y and
X, as well as the two clusters (in red and black colors) are shown in
Figure 5. We may note a clear nonlinear relationship between response and the explanatory variable, and hence clustering algorithms that consider linear separations may not be suitable for this example. Here we are considering the cluster classification already available in the data set [
37]. Furthermore, this nonlinear relationship indicates that the use of smoothing functions in the regression structure would be appropriate, and the shape of such relationship may depend on each cluster, so considering an interaction between each cluster and the explanatory variable
X may also be necessary.
Now we compare c-GAMLSS based on the Gaussian distribution, considering the presence of an authentic explanatory variable, against three different models: (i) fitting a regression structure only for the location parameter, with a constant variance (reducing to a c-GLM); (ii) incorporating smoothing functions in this regression structure (reducing to a c-GAM); and (iii) considering a mixture-of-experts model (Gaussian parsimonious clustering models with covariates, obtained using the
MoEClust package for the
R software environment). We shall highlight here that due to the behavior observed in
Figure 5, in the c-GLM structure, we are considering a quadratic term and interactions between covariate
X and the clusters, whereas in both c-GAM and c-GAMLSS, we fit a varying coefficient term (pvc).
Table 6 displays all regression structures for each fitted model. c-GAMLSS presented the lowest AIC value (−279.4), whereas the c-GLM model returned the lowest BIC value (−202.7). We emphasize here that the BIC criterion highly penalizes models that consider smoothing functions to explain a given response-variable parameter. However, as can be seen in
Figure 6, such functions are quite important to explain the nonlinear effect of
X on each of the response-variable parameters (mean
and standard deviation
).
5.4. Eruption Data
In this last application we revisit the well-known Old Faithful Geyser data from Yellowstone National Park in Wyoming, USA [
38]. This data set contains 299 pairs of measurements regarding the times between the beginnings of successive eruptions, continuously collected over the first 15 days of August of 1985, and can be obtained in the
MASS package for the
R software environment. The explanatory variable
X, which represents eruption duration, can also be used here to explain the response variable.
Here, we apply the full model displayed in Equation (
2) to these data, disregarding the smoothing functions, i.e., the eruption duration
X is treated as a linear predictor of the c-GAMLSS model. Furthermore, in order to show the great flexibility of this approach, we now consider a more complex distribution (apart from the celebrated Gaussian) to explain the target variable. In the classical GAMLSS context, when
, one of the most-used distributions is the Box-Cox Cole and Green (BCCG), a three parameter distribution where
is the median,
is the approximate coefficient of variation and
is the skewness parameter. For further details regarding the BCCG, see [
20].
We compare the results obtained by the c-GAMLSS framework based on both the Gaussian and BCCG distributions with a cluster-weighted model (cwm) [
6], the estimates of which were obtained using the
cwm function in the
flexCWM package. All model structures and their respective AIC and BIC statistics are presented in
Table 7, where we see that c-GAMLSS based on the BCCG distribution provides a better fit than the other two alternatives since its AIC and BIC values are the smallest (1840.6 and 1870.2, respectively). A visual comparison of all models is provided in
Figure 7 (the graphical representation of c-GAMLSS based on both Gaussian and BCCG distributions overlap). Finally, we also present in
Figure 8 the term plot for each effect fitted, i.e., the effects of each explanatory variable on the model parameters.