Next Article in Journal
fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search
Previous Article in Journal
Optimal Sampling Regimes for Estimating Population Dynamics
Previous Article in Special Issue
Bayesian Bandwidths in Semiparametric Modelling for Nonnegative Orthant Data with Diagnostics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Flexible Multivariate Distribution for Correlated Count Data

1
Department of Mathematics and Statistics, Georgetown University, Washington, DC 20057, USA
2
Center for Statistical Research and Methodology, U. S. Census Bureau, Washington, DC 20233, USA
3
Department of Mathematics and Statistics, McMaster University, Hamilton, ON L8S 4K1, Canada
*
Author to whom correspondence should be addressed.
Submission received: 17 March 2021 / Revised: 10 April 2021 / Accepted: 11 April 2021 / Published: 15 April 2021
(This article belongs to the Special Issue Directions in Statistical Modelling)

Abstract

:
Multivariate count data are often modeled via a multivariate Poisson distribution, but it contains an underlying, constraining assumption of data equi-dispersion (where its variance equals its mean). Real data are oftentimes over-dispersed and, as such, consider various advancements of a negative binomial structure. While data over-dispersion is more prevalent than under-dispersion in real data, however, examples containing under-dispersed data are surfacing with greater frequency. Thus, there is a demonstrated need for a flexible model that can accommodate both data types. We develop a multivariate Conway–Maxwell–Poisson (MCMP) distribution to serve as a flexible alternative for correlated count data that contain data dispersion. This structure contains the multivariate Poisson, multivariate geometric, and the multivariate Bernoulli distributions as special cases, and serves as a bridge distribution across these three classical models to address other levels of over- or under-dispersion. In this work, we not only derive the distributional form and statistical properties of this model, but we further address parameter estimation, establish informative hypothesis tests to detect statistically significant data dispersion and aid in model parsimony, and illustrate the distribution’s flexibility through several simulated and real-world data examples. These examples demonstrate that the MCMP distribution performs on par with the multivariate negative binomial distribution for over-dispersed data, and proves particularly beneficial in effectively representing under-dispersed data. Thus, the MCMP distribution offers an effective, unifying framework for modeling over- or under-dispersed multivariate correlated count data that do not necessarily adhere to Poisson assumptions.

1. Introduction

There exists a rich history of research regarding multivariate discrete distributions [1]. Krishnamoorthy [2] introduced a multivariate binomial (MB) distribution for the d- dimensional vector B = ( B 1 , B 2 , , B d ) from a 2 d table with a factorial moment generating function (fmgf)
G B ( α ) = 1 + i = 1 d p i * α i + 1 i < j d p i j * α i α j + + p 12 d * α 1 α 2 α d n ,
where p i * is the probability of B i , i = 1 , , d ; p i j * denotes the probability of B i B j ; and so on. Utilizing this form, Krishnamoorthy [2] further introduced the multivariate Poisson (MP) distribution as the limiting distribution of the multivariate binomial distribution wherein all of the probabilities appearing in Equation (1) have order O ( 1 / n ) and n p * λ as n , where ∘ denotes the corresponding probability subscripts in Equation (1). Accordingly, the fmgf of the MP distribution for a random vector X = ( X 1 , X 2 , , X d ) is
G X ( α ) = exp i = 1 d λ i α i + 1 i < j d λ i j α i α j + + λ 12 d α 1 α 2 α d .
Mahamunulu [3] noted that the MP distribution can likewise be derived by defining X = ( X 1 , X 2 , . . . , X d ) as the sum of independent Poisson( A * ) random variables Y * , where ∗ denotes all subscripts involving = i { 1 , 2 , , d } with Y * distributed as Poisson( A * ), and X * as Poisson( μ * ) where μ * denotes the sum of the associated A * parameters with subsets j 1 , j 2 , , j s for s { 1 , 2 , , d } such that j 1 < j 2 < < j s . The corresponding joint probability generating function (pgf) has the form
Π X ( s ) = exp i = 1 d A i s i + 1 i < j d A i j s i s j + + A 12 d s 1 s 2 s d A ,
where A = i = 1 d A i + 1 i < j d A i j + + A 12 d [3,4]. From (3), it is evident that the variables X 1 , X 2 , ..., X d have marginal Poisson distributions, and it can be further shown that all pairs of variables X i ’s are positively correlated.
While the MP distribution is a popular model for describing correlated discrete random variables, it is well known that Poisson models are constrained by their underlying assumption of equi-dispersion; analogous negative binomial (NB) models serve as a popular alternative due to their ability to address data over-dispersion [5]. Doss [6] discussed a multivariate negative binomial (MNB) distribution with joint pgf
Π ( s ) = a 0 + i = 1 d a i s i + 1 i < j d a i j s i s j + + a 12 d s 1 s 2 s d k
for k > 0 . From (4), it is evident that the variables X 1 , X 2 , ..., X d have marginal NB distributions which are known to be over-dispersed. For this reason, the MNB distribution can only accommodate data over-dispersion; accordingly, correlated under-dispersed data structures are only at best fitted by a MP model where the associated model parameters will still be biased. Therefore, in this work, we introduce the reader to the Conway–Maxwell–Poisson (CMP) distribution and develop a multivariate CMP (MCMP) distribution as a flexible alternative distribution for modeling correlated discrete count data. Section 2 introduces the reader to the CMP distribution and its bivariate analog as motivation. Section 3 develops the MCMP distribution and discusses its associated properties, and also introduces approaches for parameter estimation and hypothesis testing. Section 4 demonstrates the model flexibility by means of simulated and real data examples. Finally, Section 5 concludes the manuscript with discussion, while the appendices contain more detailed derivations and the datasets referenced in this work.

2. Conway–Maxwell–Poisson Distribution

The CMP ( λ , ν ) distribution [7] has the probability mass function (pmf)
P ( Y = y λ , ν ) = λ y ( y ! ) ν Z ( λ , ν ) , y = 0 , 1 , 2 ,
for a random variable Y, where ν 0 is the dispersion parameter, λ = E ( Y ν ) generalizes the Poisson rate parameter, and Z ( λ , ν ) = j = 0 λ j ( j ! ) ν denotes the normalizing constant. Equi-dispersion relative to the Poisson distribution is represented when ν = 1 while data over-dispersion (under-dispersion) occurs when ν < ( > ) 1 . The CMP( λ , ν ) distribution contains three well-known distributions as special cases: Poisson with rate parameter λ when ν = 1 ; geometric with success probability 1 λ when ν = 0 and λ < 1 ; and Bernoulli with success probability λ 1 + λ when ν [8].
The distribution’s moments can be represented recursively as
E ( Y r + 1 ) = λ [ E ( Y + 1 ) ] 1 ν , r = 0 λ λ E ( Y r ) + E ( Y ) E ( Y r ) , r > 0 ;
in particular, its expected value and variance are
E ( Y ) = log Z ( λ , ν ) log λ λ 1 / ν ν 1 2 ν , and
Var ( Y ) = E ( Y ) log λ 1 ν λ 1 / ν ,
where the approximations provided in Equations (5) and (6) hold for ν 1 or λ > 10 ν [9,10]. Further, the CMP has the moment generating function (mgf) M Y ( t ) = E ( e Y t ) = Z ( λ e t , ν ) Z ( λ , ν ) and pgf E ( t Y ) = Z ( λ t , ν ) Z ( λ , ν ) .
Sellers et al. [11] construct a bivariate CMP model by means of the compounding method, wherein the joint conditional distribution of { ( X 1 , X 2 ) n } has a bivariate binomial distribution and the number of trials n is CMP( λ , ν ) distributed. The pmf of ( X 1 , X 2 ) is
P ( X 1 = x 1 , X 2 = x 2 ) = 1 Z ( λ , ν ) n = 0 λ n ( n ! ) ν × a = n x 1 x 2 n n a , n a x 2 , n a x 1 , x 1 + x 2 + a n p 00 a p 10 n a x 2 p 01 n a x 1 p 11 x 1 + x 2 + a n ,
where n a , n a x 2 , n a x 1 , x 1 + x 2 + a n is the multinomial coefficient, and it has the joint pgf
Π ( t 1 * , t 2 * ) = Z λ [ 1 + p 1 + ( t 1 * 1 ) + p + 1 ( t 2 * 1 ) + p 11 ( t 1 * 1 ) ( t 2 * 1 ) ] , ν Z ( λ , ν )
for some parameters, λ , ν , and probabilities p 00 , p 10 , p 01 , p 11 such that p 00 + p 10 + p 01 + p 11 = 1 , p i + = p i 0 + p i 1 for i = 0 , 1 , and p + j = p 0 j + p 1 j for j = 0 , 1 . This bivariate CMP distribution yields the three special bivariate cases that are achieved in their univariate analogs: for ν = 1 , the bivariate CMP distribution reduces to the bivariate Poisson [12,13]; when ν , we obtain the bivariate Bernoulli distribution [14]; and, for ν = 0 , λ < 1 , and λ { p 1 + ( t 1 * 1 ) + p + 1 ( t 2 * 1 ) + p 11 ( t 1 * 1 ) ( t 2 * 1 ) } < 1 , the bivariate CMP distribution reduces to a bivariate geometric model [11].

3. Multivariate Conway–Maxwell–Poisson Distribution

Generalizing the compounding approach in [11], we develop a convenient form for the MCMP distribution. Consider d random variables X = ( X 1 , X 2 , , X d ) that, given some number of trials n, jointly have a conditional MB distribution with pgf
Π ( t 1 * , , t d * | n ) = x 1 = 0 1 x 2 = 0 1 x d = 0 1 p x 1 x 2 x d i = 1 d ( t i * ) x i n
(Equation (37.71) of [1]), where n is a CMP ( λ , ν ) random variable. The compounding technique that is formulated as a CMP-stopped MB (i.e., where the MB index parameter is CMP distributed) can then be applied, resulting in the corresponding MCMP distribution’s pgf as   
Π ( t 1 * , , t d * ) = n = 0 λ n ( n ! ) ν Z ( λ , ν ) Π ( t 1 * , , t d * n ) = Z λ x 1 = 0 1 x d = 0 1 p x 1 x 2 x d i = 1 d ( t i * ) x i , ν Z ( λ , ν ) ,
where Z ( ψ , ν ) = s = 0 ψ s ( s ! ) ν for some ψ > 0 . Equation (8) contains 2 d + 2 parameters, but its degrees of freedom equals 2 d + 1 due to the restriction, x 1 = 0 1 x d = 0 1 p x 1 x 2 x d = 1 ; this adds difficulty in the determination of model parameter maximum likelihood estimates (MLEs). We circumvent this issue through the reparametrization, λ x 1 x 2 x d = λ p x 1 x 2 x d . Each variable is independent under this parameterization, where
λ = λ x 1 = 0 1 x d = 0 1 p x 1 x 2 x d = x 1 = 0 1 x d = 0 1 λ p x 1 x 2 x d = x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d
and p x 1 x 2 x d = λ x 1 x 2 x d λ . For simplicity, we use λ to denote x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d but recognize that λ is no longer an independent parameter in the ensuing discussion. The pgf of the MCMP distribution can now be parameterized as
Π ( t 1 * , , t d * ) = Z x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d i = 1 d ( t i * ) x i , ν Z ( λ , ν ) .
As is the case of the univariate and bivariate CMP, this MCMP includes the MP, multivariate geometric, and multivariate Bernoulli distributions all as special cases, where ν maintains representation as the dispersion parameter. When ν = 1 , this MCMP pgf reduces to the form of the MP joint pgf (see Equation (3)). When ν = 0 and λ < 1 , its pgf becomes
Π ( t 1 * , , t d * ; ν = 0 , λ < 1 ) = a 0 + i = 1 d a i t i * + 1 i < j d a i j t i * t j * + + a 12 . . . d t 1 * t 2 * . . . t d * 1 ,
where a 0 = 1 λ 00 0 1 λ ; a i = λ 0 01 i 0 0 1 λ where 1 i denotes a 1 in the ith position, i = 1 , 2 , , d ; a i j = λ 0 01 i 0 01 j 0 0 1 λ where 1 i , 1 j denote 1s in the 1 i j d locations; ; a 12 d = λ 11 1 1 λ . This is the pgf of a multivariate geometric distribution (i.e., the MNB distribution pgf in Equation (4) with k = 1 ). Finally, when ν , this MCMP becomes a multivariate Bernoulli (i.e., the MB in [2] with n = 1 ) with p 00 0 * = 1 + λ 00 0 1 + λ and all remaining probabilities are p x 1 x 2 x d * = λ x 1 x 2 x d 1 + λ where at least one of x i equals 1, i = 1 , 2 , , d . More broadly, ν = 1 denotes the equi-dispersion case while ν < ( > ) 1 reflects data over-dispersion (under-dispersion), both for the joint distribution and the respective marginal distributions.
Given the joint pgf in Equation (8), this MCMP model has the joint mgf
M ( t 1 , t 2 , , t d ) = Π ( e t 1 , e t 2 , , e t d ) = Z x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d e i = 1 d t i x i , ν Z ( λ , ν ) ,
and the joint fmgf as
G ( t 1 , t 2 , . . . , t d ) = Π ( t 1 + 1 , t 2 + 1 , . . . , t d + 1 ) = Z x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d i = 1 d ( t i + 1 ) x i , ν Z ( λ , ν ) .
We derive the MCMP pmf by taking partial derivatives of the pgf, i.e.,
p ( x 1 , x 2 , , x d ) = 1 x 1 ! x 2 ! x d ! x 1 + x 2 + + x d t 1 x 1 t 2 x 2 t d x d Π ( t 1 , t 2 , , t d ) t 1 = t 2 = = t d = 0 ;
see Appendix A for pertinent details. Moving forward, we shall illustrate the MCMP results using the trivariate case as motivation, where the joint fmgf reduces to G ( t 1 , t 2 , t 3 ) from which we can obtain the moments and product moments, respectively; see Appendix B for all relevant details. These results confirm that the dispersion parameter ν denotes the type of data-dispersion for the joint and marginal distributions, and the correlations between any two random variables is non-negative with 0 ρ X i X j 1 for any variables i , j = 1 , 2 , 3 .

3.1. Parameter Estimation

We perform parameter estimation by the method of maximum likelihood (ML). Considering the trivariate case of Equations (9) and (12), there are nine parameters required to specify a trivariate CMP distribution, namely, λ x 1 x 2 x 3 for x i = 0 , 1 ; i = 1 , 2 , 3 ; and ν . Accordingly, the log-likelihood has the form
ln L ( λ 000 , , λ 111 , ν ; ( x 1 , x 2 , x 3 ) ) = ln j = 1 n p ( x 1 j , x 2 j , x 3 j ) = j = 1 n ln p ( x 1 j , x 2 j , x 3 j ) ,
where x i j denotes the jth observation in the ith data dimension, x i denotes the vector of the entire data set in the ith dimension; the precise form of p ( x 1 j , x 2 j , x 3 j ) is provided in Equation (A4) in Appendix A. The resulting score equations, however, do not have a closed form solution. For this reason, we carry out the statistical computations by using optimizing routines in R [15].
To perform the parameter estimation, we use the optim function where the negated form of the log-likelihood (Equation (13)) serves as the function to be optimized, and the L-BFGS-B method and its default convergence criteria are applied. Additionally, we approximate the standard errors of the estimated parameters by calculating the square root of the diagonal of the inverse Hessian matrix based on the approximate form obtained from optim. The complexity of the MCMP distribution, however, brings with it some computational difficulties when applying optim. The resulting MLE can vary considerably depending on the choice of starting values. To avert this, we consider several starting points including an exhaustive search in order to potentially improve the estimation result. Meanwhile, the resulting Hessian matrix provided from optim sometimes produces an inverse matrix containing negative diagonal elements; this violates the presumed positive semidefinite form of the Fisher information matrix. For these reasons, we recommend utilizing a parametric bootstrap method as an alternative approach for quantifying variability in the parameter estimates.

3.2. Hypothesis Testing

To check if a multivariate count data set suffers from any statistically significant data dispersion such that the MP distribution is unsuitable (favoring the MCMP distribution), we conduct the hypothesis test, H 0 : ν = 1 versus H 1 : ν 1 . We do not concern ourselves with the direction of the data dispersion because the MCMP distribution can accommodate both over- and under-dispersion. Nonetheless, the resulting statistical inference, along with the estimate for ν , offers guidance regarding the type of dispersion present in the data. We use the likelihood ratio test (LRT) statistic, Λ ν = 1 = sup ν = 1 ln L sup ln L , where sup ν = 1 ln L and sup ln L , respectively, denote the maximum log-likelihoods associated with the MP and MCMP models. Theoretically, 2 log Λ ν = 1 follows a χ 1 2 distribution and thus can be used to assess whether the data are reasonably distributed as a MP distribution, or if statistically significant dispersion exists such that it warrants using the MCMP model. In a similar vein, one can consider hypothesis tests, H 0 : ν = 0 or H 0 : ν (versus H 1 : otherwise) to determine whether the multivariate data satisfy a multivariate geometric or multivariate Bernoulli distribution, respectively; their associated LRTs have adjusted distributional forms based on a mixture involving χ 1 2 to account for being at the respective boundaries for ν [16].

4. Examples

This section considers various simulated and real data examples to illustrate the flexibility of the MCMP model. For the real data sets, we compare model performance via the respective log-likelihood and Akaike Information Criterion (AIC) values. We particularly consider Δ i = AIC i AIC min as introduced in Burnham and Anderson [17], where AIC i denotes the AIC associated with Model i, and AIC min is the minimum AIC among the considered models. [17] provides model support levels based on recommended Δ i ranges; see Table 1 for details.

4.1. Simulated Data

Here, we provide simulated data examples to illustrate the MCMP model’s ability to correctly distinguish the MP distribution. Without loss of generality, we proceed with the use of the trivariate case. To evaluate the robustness of the simulation process, we consider data simulations of size {100, 250, 500, 1000} and simulate data 500 times at each size level.
We first consider a simulated trivariate Poisson distribution where the joint pgf (Equation (3)) is defined with A 1 = 1.5 , A 2 = 2 , A 3 = 1.3 , A 12 = 1.2 , A 13 = 0.6 , A 23 = 0.7 and A 123 = 0.3 , and obtain the MLEs for the trivariate CMP under two conditions: the unconstrained case, and the restricted case where ν = 1 . The latter case serves to reflect the trivariate Poisson model with λ 100 = A 1 , λ 010 = A 2 , λ 001 = A 3 , λ 110 = A 12 , λ 101 = A 13 , λ 011 = A 23 , λ 111 = A 123 ; thus, the value of λ 000 no longer affects the model. Table 2 displays the proportion of 2 log Λ statistics that fall within the respective 95% or 99% confidence bounds across the simulations. As expected, the proportion of 2 log Λ values that are within the respective bounds, χ 1 2 ( 0.95 ) = 3.841 and χ 1 2 ( 0.99 ) = 6.635 , is quite close to their respective nominal levels, regardless of size level.
To assess the power of the test, H 0 : ν = 1 versus H 1 : ν 1 , we further generate data from the trivariate geometric ( a 1 = 0.3 , a 2 = 0.5 , a 3 = 0.7 , a 12 = 0.2 , a 13 = 1 , a 23 = 1.3 , a 123 = 0.1 ), the trivariate Bernoulli ( p 000 = 0.1 , p 001 = 0.07 , p 010 = 0.23 , p 100 = 0.15 , p 011 = 0.08 , p 101 = 0.11 , p 110 = 0.17 , p 111 = 0.09 ), and the trivariate CMP expressing over-dispersion ( λ 000 = 0.07 , λ 100 = 0.13 , λ 010 = 0.22 , λ 001 = 0.15 , λ 110 = 0.08 , λ 101 = 0.06 , λ 011 = 0.12 , λ 111 = 0.11 , ν = 0.5 ) and under-dispersion ( λ 000 = 2 , λ 100 = 0.8 , λ 010 = 1.4 , λ 001 = 1.7 , λ 110 = 2.2 , λ 101 = 1.3 , λ 011 = 0.6 , λ 111 = 0.9 , ν = 2 ), respectively. All simulation results obtained are presented in Table 3. As the generating distribution has a measure of dispersion that moves away from 1 (i.e., the data deviate from the Poisson), we see the power increase in both directions. Meanwhile, the power likewise increases with the sample size in association with all of the respective distributions.

4.2. Real Data: Corporación Favorita Grocery Sales

The Corporación Favorita grocery sales data [18] include information regarding the number of unit items sold daily to more than 4000 items in 35 different stores over a five-year period. To illustrate the MCMP distribution’s flexibility for describing real count data, we consider the unit sales regarding a particular item (Item ID:103665) over 100 days in each of three stores (Stores 1, 2 and 3, respectively); the data are provided in Table A2 in Appendix D. This dataset is over-dispersed due to the weekly and monthly periodic fluctuation; the number of sales often tend to be high at the beginning of each month as well as on weekends. Table 4 summarizes the results that stem from considering various trivariate models to describe the data, namely, the trivariate Poisson, trivariate NB [6], trivariate geometric, and trivariate CMP. For each of the assumed models, this table provides the respective MLEs, resulting log-likelihood, and AIC values.
Although the trivariate Poisson distribution has the least number of parameters (i.e., 7) among the models considered, it has the largest AIC (1748.5), suggesting its unsuitability for these data [17]. Meanwhile, the MLEs of the trivariate CMP model include ν ^ 0.25 , implying that the data are over-dispersed. The trivariate CMP produces a considerably smaller AIC (1627.9) relative to the trivariate Poisson; this further demonstrates the apparent data over-dispersion that should be addressed, but with Δ = 6.2 relative to the AIC from the trivariate NB, the trivariate CMP (while second best among the four considered models) still has model support that is “considerably less” than that of the trivariate NB ( AIC = 1621.7 ); this result is still substantially better than the difference between the trivariate NB and Poisson models ( Δ = 126.8 ), clearly inferring no support for the trivariate Poisson. Further, applying the trivariate CMP model introduces consideration of the trivariate geometric and NB models, respectively, as possible parsimonious models. The respective LRT statistics, 2 log Λ ν = 1 = 124.6 for the test H 0 : ν = 1 and 2 log Λ ν = 0 = 79.4 for H 0 : ν = 0 , both have p-values smaller than 0.005 which indicate that neither the trivariate Poisson nor the trivariate geometric fits the data well. Even still, ν ^ 0.25 serves as an indication of data over-dispersion, hence consideration of the general MNB distribution as a possible model.
Table 4 further shows that λ ^ 110 , λ ^ 101 , λ ^ 011 and λ ^ 111 for the CMP model are all 0; a similar situation appears on the estimation of the geometric and NB models, where a ^ 12 , a ^ 13 , a ^ 23 and a ^ 123 are also all 0. This indicates that there is no significant correlation within the data; this is true because the correlation coefficients between Stores 1 and 2, Stores 1 and 3, and Stores 2 and 3 are 0.15, 0.02, 0.21, respectively.
Figure 1 compares the marginal pmfs associated with each of the four models with the marginal relative frequencies associated with the number of unit sales for each of the three stores (Stores 1, 2, 3). These images show that the trivariate CMP and NB models produce very similar estimated marginal distributions with modes that are close to the observed mode, and have sufficiently wide tails to reflect the observed marginal frequencies, particularly for Store 2. Goodness-of-fit tests are likewise performed for comparing the aforementioned models to assess how well their marginal pmfs fit the marginal data frequencies. Following [19], we modify our observed frequencies by grouping observations greater than 8 on Store 1, greater than 9 on Store 2, and observations greater than 21 on Store 3. This allows the respective tail bins associated with each store to have a sufficiently large observed frequency to allow for the goodness-of-fit test to be conducted and the associated asymptotic chi-square distribution to be used. As a result, resulting statistics for the goodness-of-fit tests are expected to follow the chi-square distribution with 10, 11 and 23 degrees of freedom, respectively, for Stores 1, 2 and 3.
Table 5 summarizes the goodness-of-fit test statistics for each of the stores and models. While the trivariate geometric model best fits the Store 1 marginal distribution, the goodness-of-fit scores for the trivariate CMP and NB models are considerably better and outperform their peers for Stores 2 and 3. Table 5 confirms these assertions with χ 10 2 ( 0.95 ) = 18.3 , χ 11 2 ( 0.95 ) = 19.7 and χ 23 2 ( 0.95 ) = 35.2 , respectively, for Stores 1, 2 and 3; we again see that the geometric model fits the data better for Store 1, and the trivariate CMP and NB models produce closer fits for Stores 2 and 3.

4.3. Real Data: NBA All-Star

To demonstrate that the trivariate CMP can also be suitable for under-dispersed data, we consider data from the National Basketball Association (NBA) All-Star game rosters from 2000 to 2016 and seek to model the distribution of the number of players selected for the All-Star game each year in various positions [20]. For simplicity, we focus on the number of players that can play as Center (C), Forward (F), or Forward-center (FC); the data are provided in Table A3 of Appendix D. We again consider the trivariate CMP, the trivariate Poisson, and the trivariate NB distributions as possible models to describe this dataset. Table 6 contains a summary of the results including the respective MLEs, the resulting maximized log-likelihood, the number of free parameters, and the associated AIC for each of the three considered models.
The trivariate CMP model performs the best among the considered models, attaining a maximum likelihood equaling −68.2 and AIC min = 154.4 . The trivariate Poisson and NB models meanwhile produce respective AICs equaling 180.6 and 182.7 such that both respective difference values as defined in [17] (Table 1) are greater than 26, indicating no empirical support in favor of either model. The difference between the respective AIC values for the trivariate Poisson and NB models stems from the difference in the number of free parameters while they attain the same maximized log-likelihood value (−83.3). Neither of these models can accommodate data under-dispersion, and consequently the optimal trivariate NB distribution is that model which converges to the trivariate Poisson as k . Accordingly, the trivariate NB MLEs that best address data under-dispersion are those under the constraint of data equi-dispersion.
The trivariate CMP model successfully detects the data under-dispersion ( ν ^ = 38.4 1 ). In fact, such a large ν ^ suggests that we should consider modeling the data via a trivariate Bernoulli model. This would normally be true because the resulting CMP denominator includes ( m ! ) ν which becomes considerably large for m > 1 given large ν ^ . This makes p ( x , y , z ) vanish for any ( x , y , z ) such that at least one of the random variables exceeds 1. This is not the case here, however, because this data example likewise produces extremely large λ ^ estimates. Reviewing the raw data likewise suggests clearly that the trivariate Bernoulli distribution is not appropriate because there exist count data that are larger than 1, thus violating the multivariate Bernoulli structure. Therefore, the use of the trivariate CMP to analyze these data is duly justified.

5. Discussion

In this paper, we present a MCMP model that is developed via the compounding method. The distribution is established as a CMP-stopped multivariate binomial distribution, i.e., a multivariate binomial distribution where the associated index parameter is CMP distributed. Along with an introduction to this resulting distribution, we discussed its statistical properties which aid in better model interpretation. The CMP model can flexibly accommodate both over- and under-dispersed count data, and it includes the Poisson, Bernoulli, and geometric distributions all as special cases. Accordingly, the MCMP model serves as a reliable tool for model determination because it can successfully recognize these three multivariate special cases, and serve as an overarching distributional structure connecting them. One can determine if significant data dispersion exists by calculating the LRT statistic Λ discussed in Section 3.2, and analogous tests can be considered to determine whether the data effectively approximate either of the other two special case distributions (i.e., H 0 : ν = 0 or H 0 : ν , respectively). The MCMP distribution is particularly useful for modeling under-dispersed count data, as demonstrated through the simulated and real data examples.
A limitation of the MCMP model is that the correlation between any two of the d random variables comprising the MCMP is constrained to be non-negative, and so it may not be appropriate to consider this model to analyze multivariate count data containing negative correlations. This is true, however, of several multivariate discrete distributions, e.g., the [3] multivariate Poisson distribution. Meanwhile, this MCMP construct involves only one parameter ( ν ) to describe data dispersion. Hence, this MCMP model is suitable only for data with similar levels of data dispersion in each dimension, however the model can be broadened to allow for dynamic dispersion. Future work will seek to define a broader generalization of the MP distribution (or modification of this MCMP model) that allows for a broader range of correlation and possesses greater flexibility with regard to data dispersion. One proposed approach, for example, is to consider using copulas to develop a multivariate CMP distribution, as described in [21]. Though this is a standard method for multivariate continuous variables, its use for modeling multivariate count data has its own limitations, most importantly that copulas for discrete outcomes are not identifiable, especially when those discrete outcomes follow count distributions [21,22,23].
Table 4 demonstrates another limitation of the MCMP model, namely that it cannot accommodate as much data over-dispersion as the MNB; the MCMP distribution at best contains the multivariate geometric distribution (which is a special case of MNB). The MNB model, however, can be viewed as the convolution of independent and identically distributed (iid) multivariate geometric distributions. This convolution structure will then be able to capture greater over-dispersion. More broadly, the same idea can be used to consider a multivariate version of the sum of CMPs (MSCMP) model [24] as a generalization to accommodate broader dispersion, and use its trivariate form to revisit the Corporación Favorita grocery sales dataset. Unfortunately, due to computational issues, we were only able to perform parameter estimations for the trivariate SCMP model under the restriction m = 1 , 2 , 3 . Future work will further study the MSCMP model, for example, to determine how to optimally and directly compute the MSCMP pmf, and more efficiently determine the MLEs of model parameters. See Appendix C for more information about the MSCMP.

Author Contributions

Conceptualization (K.F.S.); Methodology (K.F.S., T.L., Y.W., N.B.); Formal analysis (K.F.S., T.L., Y.W., N.B.); Investigation (K.F.S., T.L., Y.W., N.B.); Software (K.F.S., T.L., Y.W.); Supervision (K.F.S.); Writing—original draft preparation (K.F.S., T.L., Y.W., N.B.); Writing—review and editing (K.F.S., T.L., Y.W., N.B.). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the appendix.

Acknowledgments

The authors thank Richard Sellers for insightful discussions that aided in better comprehension and understanding of the NBA data set. This paper is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MBmultivariate binomial
fmgffactorial moment generating function
MPmultivariate Poisson
pgfprobability generating function
NBnegative binomial
MNBmultivariate negative binomial
CMPConway–Maxwell–Poisson
MCMPmultivariate Conway–Maxwell–Poisson
mgfmoment generating function
MLEsmaximum likelihood estimates
pmfprobability mass function
MLmaximum likelihood
LRTlikelihood ratio test
AICAkaike Information Criterion
MLEmaximum likelihood estimate
NBANational Basketball Association
CCenter
FForward
FCForward-center
sCMPsum of CMPs
MSCMPmultivariate version of the sum of CMPs

Appendix A. Deriving the Probability Mass Function

In order to derive the general form of the pmf, we first introduce some notation. Assuming a MCMP distribution with d dimensions, there exist 2 d distinct probabilities p x 1 x 2 . . . x d in the pgf, where x i = { 0 , 1 } for all i = 1 , 2 , , d . The first derivation relies on the identity
i = 1 d a i n = s * n l 1 , l 2 , , l d a 1 l 1 a 2 l 2 a d l d ,
where s * = { ( l 1 , , l d ) : 0 l i n for i = 1 , , d , and i = 1 d l i = n } and n l 1 , l 2 , , l d = n ! l 1 ! l 2 ! l d ! is the multinomial coefficient. We can express
G n ( t 1 , t 2 , , t d ) = x 1 = 0 1 x d = 0 1 p x 1 x d i = 1 d t i x i n = p 00 0 + i 1 = 1 d p 0 010 0 t i 1 + 1 i 1 < i 2 d p 0 010 01 0 t i 1 t i 2 + + p 11 1 t 1 t 2 t d n { s 0 + s 1 + s 2 + + s d } n = s * n l 0 , l 1 , , l d s 0 l 0 s 1 l 1 s d l d ,
where s 0 = p 00 0 , s 1 = i 1 = 1 d p 0 010 0 t i 1 , s 2 = 1 i 1 < i 2 d p 0 010 010 0 t i 1 t i 2 , ⋯, s d = p 11 1 t 1 t d , and l i denotes the number of times out of n trials where we obtain exactly i of the required elements, i = 1 , , d . Observe that the number of respective s i elements is d 0 = # s 0 = 1 , d 1 = # s 1 = d 1 , d 2 = # s 2 = d 2 , d 3 = # s 3 = d 3 , ⋯, d d = # s d = d d = 1 . Then,
s 0 l 0 = p 00 0 l 0 ; s 1 l 1 = i 1 = 1 d p 0 010 0 t i 1 l 1 = i 1 = 1 d 1 p ( i 1 ) t i 1 l 1 = s 1 * l 1 j 11 , j 12 , , j 1 d 1 p ( 1 ) j 11 p ( 2 ) j 12 p ( d 1 ) j 1 d 1 t 1 j 11 t 2 j 12 t d 1 j 1 d 1 ,
where we use the simplified notation, p ( i 1 ) p 0 010 0 with 1 being in the i 1 th position, and s 1 * = { ( j 11 , j 12 , , j 1 d 1 ) : 0 j 1 i 1 l 1 for i 1 = 1 , , d 1 , and i 1 = 1 d 1 j 1 i 1 = l 1 } . Similarly,
s 2 l 2 = 1 i 2 < i 2 d p 0 10 010 0 t i 1 t i 2 l 2 = 1 l 2 < i 2 d p ( i 1 , i 2 ) t i 1 t i 2 l 2 = s 2 * l 2 j 21 , j 22 , , j 2 , d 2 p ( 1 , 2 ) j 21 p ( 1 , 3 ) j 22 p ( d 1 , d ) j 2 , d 2 ( t 1 t 2 ) j 21 ( t 1 t 3 ) j 22 ( t d 1 t d ) j 2 d 2 ,
where s 2 * = { ( j 21 , j 22 , , j 2 , d 2 ) : 0 j 2 i 2 l 2 for i 2 = 1 , , d 2 , and i 2 = 1 d 2 j 2 , i 2 = l 2 } , etc.; finally,
s d l d = ( p 11 1 t 1 t 2 t d ) l d = p 1 1 l d t 1 l d t 2 l d t d l d .
To illustrate this, consider the case when d = 3 . In this case,
s 0 l 0 = p 000 l 0 s 1 l 1 = s 1 * l 1 j 11 , j 12 , j 13 p ( 1 ) j 11 p ( 2 ) j 12 p ( 3 ) j 13 t 1 j 11 t 2 j 12 t 3 j 13 = s 1 * l 1 j 11 , j 12 , j 13 p 100 j 11 p 010 j 12 p 001 j 13 t 1 j 11 t 2 j 12 t 3 j 13 s 2 l 2 = s 2 * l 2 j 21 , j 22 , j 23 p ( 12 ) j 21 p ( 13 ) j 22 p ( 23 ) j 23 ( t 1 t 2 ) j 21 ( t 1 t 3 ) j 22 ( t 2 t 3 ) j 23 = s 2 * l 2 j 21 , j 22 , j 23 p 110 j 21 p 101 j 22 p 011 j 23 t 1 j 21 + j 22 t 2 j 22 + j 23 t 3 j 22 + j 23 s 3 l 3 = ( p 111 t 1 t 2 t 3 ) l 3 = p 111 l 3 t 1 l 3 t 2 l 3 t 3 l 3 ,
where
s 1 * = ( j 11 , j 12 , j 13 ) : 0 j 1 i 1 l 1 for i 1 = 1 , 2 , 3 , and i 1 = 1 3 j 1 i 1 = l 1 , s 2 * = ( j 21 , j 22 , j 23 ) : 0 j 2 i 2 l 2 for i 2 = 1 , 2 , 3 , and i 2 = 1 3 j 2 i 2 = l 2
and i = 0 3 l i = n . We then find that
G n ( t 1 , t 2 , t 3 ) = s * n l 0 , l 1 , l 2 , l 3 s 0 l 0 s 1 l 1 s 2 l 2 s 3 l 3 = s * n l 0 , l 1 , l 2 , l 3 s 1 * l 1 j 11 , j 12 , j 13 s * l 2 j 21 , j 22 , j 23 p 000 l 0 p 100 j 11 p 010 j 12 p 001 j 13 p 110 j 21 p 101 j 22 p 011 j 23 p 111 l 3 t 1 j 11 + j 21 + j 22 + l 3 t 2 j 12 + j 21 + j 23 + l 3 t 3 j 13 + j 22 + j 23 + l 3 ,
where
s * = ( l 0 , l 1 , l 2 , l 3 ) : 0 l i n for i = 0 , 1 , 2 , 3 , and i = 0 3 l i = n , s 1 * = ( j 11 , j 12 , j 13 ) : 0 j 1 i l 1 for i = 1 , 2 , 3 , and i = 1 3 j 1 i = l 1 , s 2 * = ( j 21 , j 22 , j 23 ) : 0 j 2 i l 2 for i = 1 , 2 , 3 , and i = 1 3 j 2 i = l 2 .
The second derivation utilizes the direct approach of differentiating the pgf to obtain the pmf. To simplify the notation, let λ = x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d as before and let q 1 = λ 000 . . . 01 λ , q 2 = λ 000 . . . 10 λ , q 3 = λ 000 . . . 11 λ , , q 2 d 1 = λ 111 . . . 1 λ . As a result, the general joint pmf is given by
p ( x ) = 1 Z ( λ , ν ) a = 0 x i max ( x ) a Z ( λ 000 . . . 0 , ν ) ( λ 000 . . . 0 ) a k 1 , k 2 , , k ( 2 d 1 ) N j b i j k j = x i j ( i b i j 1 ) k j = a q 1 k 1 q 2 k 2 q ( 2 d 1 ) k ( 2 d 1 ) k 1 ! k 2 ! k ( 2 d 1 ) ! .
where x = ( x 1 , x 2 , , x d ) and max ( x ) = max ( x 1 , x 2 , , x d ) . In the trivariate case, the model simplifies to
p ( x ) = 1 Z ( λ , ν ) a = 0 x i max ( x ) a Z ( λ 000 , ν ) ( λ 000 ) a k 1 , k 2 , , k 7 N j b i j k j = x i j ( i b i j 1 ) k j = a q 1 k 1 q 2 k 2 q 7 k 7 k 1 ! k 2 ! k 7 !
where x = ( x 1 , x 2 , x 3 ) and max ( x ) = max ( x 1 , x 2 , x 3 ) .

Appendix B. Derivations of Moments

Let λ i j + = λ i j 1 + λ i j 0 for i = 0 , 1 ; and j = 0 , 1 and λ i + + = λ i 00 + λ i 01 + λ i 10 + λ i 11 for i = 0 , 1 ; λ i + j , λ + i j , λ + i + , λ + + i are similarly defined. By differentiating the fmgf with respect to t 1 , t 2 , t 3 , and then setting t 1 = t 2 = t 3 = 0 , we obtain the joint factorial moments of the trivariate form, ( X 1 , X 2 , X 3 ) . Accordingly, letting Z ( k ) ( · ) = k Z λ k , the initial marginal and product moments are obtained as
μ X 1 = E ( X 1 ) = l n Z ( λ , ν ) λ λ 1 + + , μ X 2 = E ( X 2 ) = l n Z ( λ , ν ) λ λ + 1 + , μ X 3 = E ( X 3 ) = l n Z ( λ , ν ) λ λ + + 1 , μ X 1 X 2 = E ( X 1 X 2 ) = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + λ + 1 + + Z ( λ , ν ) Z ( λ , ν ) λ 11 + , μ X 1 X 3 = E ( X 1 X 3 ) = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + 1 , μ X 2 X 3 = E ( X 2 X 3 ) = Z ( λ , ν ) Z ( λ , ν ) λ + 1 + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ + 11 , σ X 1 X 2 = C o v ( X 1 , X 2 ) = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ 1 + + λ + 1 + + Z ( λ , ν ) Z ( λ , ν ) λ 11 + , σ X 1 X 3 = C o v ( X 1 , X 3 ) = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ 1 + + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + 1 , σ X 2 X 3 = C o v ( X 2 , X 3 ) = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ + 1 + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ + 11 , σ X 1 2 = V a r ( X 1 ) = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ 1 + + 2 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + + , σ X 2 2 = V a r ( X 2 ) = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ + 1 + 2 + Z ( λ , ν ) Z ( λ , ν ) λ + 1 + , σ X 3 2 = V a r ( X 3 ) = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ + + 1 2 + Z ( λ , ν ) Z ( λ , ν ) λ + + 1 .
The above moments demonstrate that dispersion definitions are maintained via ν for the marginal distributions, where ν = 1 denotes equi-dispersion and ν < ( > ) 1 capture over- (under-) dispersion. For example, when ν = 1 , Z ( λ , ν = 1 ) = Z ( λ , ν = 1 ) = Z ( λ , ν = 1 ) = exp ( λ ) and ln ( Z ) = λ , thus one can see that μ X i = σ X i 2 for i = 1 , 2 , 3 (i.e., marginal equi-dispersion holds), while μ X i < ( > ) σ X i 2 for i = 1 , 2 , 3 when ν < ( > ) 1 .
Using the notation introduced in Section 3, we derive the expression below. Let
h ( t 1 , t 2 , t 3 ) = [ λ 000 + λ 100 ( t 1 + 1 ) + λ 010 ( t 2 + 1 ) + λ 001 ( t 3 + 1 ) + λ 110 ( t 1 + 1 ) ( t 2 + 1 ) + λ 101 ( t 1 + 1 ) ( t 3 + 1 ) + λ 011 ( t 2 + 1 ) ( t 3 + 1 ) + λ 111 ( t 1 + 1 ) ( t 2 + 1 ) ( t 3 + 1 ) ] .
We then obtain the derivatives,
G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 1 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) h ( t 1 , t 2 , t 3 ) t 1 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 100 + λ 110 ( t 2 + 1 ) + λ 101 ( t 3 + 1 ) + λ 111 ( t 2 + 1 ) ( t 3 + 1 ) ] , G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 2 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) h ( t 1 , t 2 , t 3 ) t 2 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 010 + λ 110 ( t 1 + 1 ) + λ 011 ( t 3 + 1 ) + λ 111 ( t 1 + 1 ) ( t 3 + 1 ) ] , G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 3 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) h ( t 1 , t 2 , t 3 ) t 3 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 001 + λ 101 ( t 1 + 1 ) + λ 011 ( t 2 + 1 ) + λ 111 ( t 1 + 1 ) ( t 2 + 1 ) ] , 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 1 t 2 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 100 + λ 110 ( t 2 + 1 ) + λ 101 ( t 3 + 1 ) + λ 111 ( t 2 + 1 ) ( t 3 + 1 ) ] · [ λ 010 + λ 110 ( t 1 + 1 ) + λ 011 ( t 3 + 1 ) + λ 111 ( t 1 + 1 ) ( t 3 + 1 ) ] + Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 110 + λ 111 ( t 3 + 1 ) ] , 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 1 t 3 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 100 + λ 110 ( t 2 + 1 ) + λ 101 ( t 3 + 1 ) + λ 111 ( t 2 + 1 ) ( t 3 + 1 ) ] · [ λ 001 + λ 101 ( t 1 + 1 ) + λ 011 ( t 2 + 1 ) + λ 111 ( t 1 + 1 ) ( t 2 + 1 ) ] + Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 101 + λ 111 ( t 2 + 1 ) ] , 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 2 t 3 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ ( λ 010 + λ 110 ( t 1 + 1 ) + λ 011 ( t 3 + 1 ) + λ 111 ( t 1 + 1 ) ( t 3 + 1 ) ] · [ λ 001 + λ 101 ( t 1 + 1 ) + λ 011 ( t 2 + 1 ) + λ 111 ( t 1 + 1 ) ( t 2 + 1 ) ] + Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 011 + λ 111 ( t 1 + 1 ) ] , 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 1 2 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 100 + λ 110 ( t 2 + 1 ) + λ 101 ( t 3 + 1 ) + λ 111 ( t 2 + 1 ) ( t 3 + 1 ) ] 2 , 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 2 2 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 010 + λ 110 ( t 1 + 1 ) + λ 011 ( t 3 + 1 ) + λ 111 ( t 1 + 1 ) ( t 3 + 1 ) ] 2 , 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 3 2 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 001 + λ 101 ( t 1 + 1 ) + λ 011 ( t 2 + 1 ) + λ 111 ( t 1 + 1 ) ( t 2 + 1 ) ] 2 .
Then, the expected value of X 1 is
μ X 1 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 100 + λ 110 ( t 2 + 1 ) + λ 101 ( t 3 + 1 ) + λ 111 ( t 2 + 1 ) ( t 3 + 1 ) ] | t 1 = t 2 = t 3 = 0 = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + = ln Z ( λ , ν ) λ λ 1 + + ;
similarly, μ X 2 = ln Z ( λ , ν ) λ λ + 1 + , and μ X 3 = ln Z ( λ , ν ) λ λ + + 1 . Meanwhile,
μ X 1 X 2 = E ( X 1 X 2 ) = 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 1 t 2 | t 1 = t 2 = t 3 = 0 = Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 100 + λ 110 ( t 2 + 1 ) + λ 101 ( t 3 + 1 ) + λ 111 ( t 2 + 1 ) ( t 3 + 1 ) ] · [ λ 010 + λ 110 ( t 1 + 1 ) + λ 011 ( t 3 + 1 ) + λ 111 ( t 1 + 1 ) ( t 3 + 1 ) ] + Z { h ( t 1 , t 2 , t 3 ) , ν } Z ( λ , ν ) [ λ 110 + λ 111 ( t 3 + 1 ) ] | t 1 = t 2 = t 3 = 0 = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + λ + 1 + + Z ( λ , ν ) Z ( λ , ν ) λ 11 + ;
similarly, μ X 1 X 3 = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + 1 and μ X 2 X 3 = Z ( λ , ν ) Z ( λ , ν ) λ + 1 + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ + 11 . Consequently, we obtain the covariances to be
σ X 1 X 2 = C o v ( X 1 , X 2 ) = E ( X 1 X 2 ) E ( X 1 ) E ( X 2 ) = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + λ + 1 + + Z ( λ , ν ) Z ( λ , ν ) λ 11 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + + Z ( λ , ν ) Z ( λ , ν ) λ + 1 + = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + λ + 1 + + Z ( λ , ν ) Z ( λ , ν ) λ 11 + Z ( λ , ν ) Z ( λ , ν ) 2 λ 1 + + λ + 1 + = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ 1 + + λ + 1 + + Z ( λ , ν ) Z ( λ , ν ) λ 11 + ;
similarly,
σ X 1 X 3 = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ 1 + + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + 1 , and σ X 2 X 3 = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ + 1 + λ + + 1 + Z ( λ , ν ) Z ( λ , ν ) λ + 11 .
Finally, noting that E { X 1 ( X 1 1 ) } = 2 G x 1 , x 2 , x 3 ( t 1 , t 2 , t 3 ) t 1 2 | t 1 = t 2 = t 3 = 0 , we obtain the variances as
σ X 1 2 = V a r ( X 1 ) = E { X 1 ( X 1 1 ) } + E ( X 1 ) { E ( X 1 ) } 2 = Z ( λ , ν ) Z ( λ , ν ) λ 1 + + 2 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + + Z ( λ , ν ) Z ( λ , ν ) 2 λ 1 + + 2 = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ 1 + + 2 + Z ( λ , ν ) Z ( λ , ν ) λ 1 + + ;
similarly,
σ X 2 2 = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ + 1 + 2 + Z ( λ , ν ) Z ( λ , ν ) λ + 1 + , and σ X 3 2 = Z ( λ , ν ) Z ( λ , ν ) { Z ( λ , ν ) } 2 { Z ( λ , ν ) } 2 λ + + 1 2 + Z ( λ , ν ) Z ( λ , ν ) λ + + 1 .

Appendix C. Introduction to the Multivariate sCMP Model

A multivariate form of the sum of CMPs (MSCMP) distribution is an extension of the sCMP model in [24]. This section applies various MSCMP forms as alternative models to analyze the Corporación Favorita grocery sales data. The MSCMP distribution is defined as follows: given iid random variables W 1 , W 2 , , W m that are MCMP ( λ 00 . . . 0 , , λ 11 . . . 1 , ν ) distributed, Y = i = 1 m W i has a MSCMP ( λ 00 . . . 0 , , λ 11 . . . 1 , ν , m ) distribution with pgf
Π ( t 1 * , , t d * ) = Z x 1 = 0 1 x d = 0 1 λ x 1 x 2 x d i = 1 d ( t i * ) x i , ν Z ( λ , ν ) m .
Though the sCMP is defined as the sum of multiple CMPs, m need not be integer-valued since the pgf is valid for all m 0 . This MSCMP distribution includes the MNB (when ν = 0 ), the MP ( ν = 1 ), and the MB ( ν ) distributions all as special cases, and serves as an over-arching distribution that connects the three special cases.
Given the difficulties associated with calculating the pmf for the SCMP model, we pursue an alternative approach to compute its pmf for a positive integer, m. For simplicity, we illustrate this approach in the trivariate case and consider m = 2 : let W 1 = ( W 11 , W 12 , W 13 ) W 2 = ( W 21 , W 22 , W 23 ) be iid trivariate CMP ( λ 000 , , λ 111 , ν ) , and let Y = ( Y 1 , Y 2 , Y 3 ) = W 1 + W 2 ; then Y has a trivariate SCMP ( λ 000 , , λ 111 , ν , m = 2 ) distribution, and
p ( Y 1 = y 1 , Y 2 = y 2 , Y 3 = y 3 ) = a , b , c p ( W 11 = a , W 12 = b , W 13 = c ) · p ( W 21 = y 1 a , W 22 = y 2 b , W 23 = y 3 c ) ,
where a , b , c are non-negative integers such that y 1 a 0 , y 2 b 0 , and y 3 c 0 ; similarly, we can determine the pmf of the trivariate SCMP ( λ 000 , , λ 111 , ν , m = 3 ) distribution.
We fitted the Corporación Favorita grocery sales dataset with the trivariate SCMP model so as to demonstrate its capability of dealing with trivariate count data. Table A1 provides the resulting trivariate SCMP estimates with m = 2 and m = 3 where, for comparison, we also include the results of the trivariate NB and CMP models. Accordingly, we find that the trivariate SCMP models fit the data better than the trivariate CMP, with improvement growing with m. More precisely, we note that, as m increases, the log-likelihood increases while the AIC decreases. In particular, ν ^ likewise decreases toward 0 (which results in the trivariate NB model) as m increases. Unfortunately, current computational issues prevent us from providing SCMP results for m > 3 , but these results do illustrate that the SCMP model will produce a log-likelihood no worse than that from the trivariate NB as it is a special case of bivariate SCMP model.
Table A1. Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: CMP, sCMP ( m = 2 ), sCMP ( m = 3 ), and NB. Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
Table A1. Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: CMP, sCMP ( m = 2 ), sCMP ( m = 3 ), and NB. Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
ModelEstimated ParametersLog
Likelihood
No. of
Parameters
AIC
λ ^ 000 = 0.00 λ ^ 111 = 0.00 λ ^ 100 = 0.32
CMP λ ^ 010 = 0.47 λ ^ 001 = 1.17 λ ^ 110 = 0.00 −804.991627.9
λ ^ 101 = 0.00 λ ^ 011 = 0.00 ν ^ = 0.25
λ ^ 000 = 0.05 λ ^ 111 = 0.00 λ ^ 100 = 0.22
sCMP ( m = 2 ) λ ^ 010 = 0.33 λ ^ 001 = 0.82 λ ^ 110 = 0.00 −804.091626.0
λ ^ 101 = 0.00 λ ^ 011 = 0.00 ν ^ = 0.19
λ ^ 000 = 0.12 λ ^ 111 = 0.00 λ ^ 100 = 0.16
sCMP ( m = 3 ) λ ^ 010 = 0.24 λ ^ 001 = 0.61 λ ^ 110 = 0.00 −803.591625.0
λ ^ 101 = 0.00 λ ^ 011 = 0.00 ν ^ = 0.13
a ^ 1 = 0.44 a ^ 2 = 0.65 a ^ 3 = 1.61
NB a ^ 12 = 0.00 a ^ 13 = 0.00 a ^ 23 = 0.00 −802.881621.7
a ^ 123 = 0.00 a ^ 0 = 3.69 k ^ = 6.01

Appendix D. Real Datasets

Table A2. Corporación Favorita sales data: Unit sales of an item in each of three stores over 100 days.
Table A2. Corporación Favorita sales data: Unit sales of an item in each of three stores over 100 days.
DayStore 1Store 2Store 3DayStore 1Store 2Store 3DayStore 1Store 2Store 3
12563523368135
238233642669123
3282137321670044
44583871212711410
52713396013721014
6141540224730114
70011411013740219
81624236475046
966743203767212
1038134472477147
1138164557378107
1201746811979318
130511471617804317
146874812581477
1512449365823313
168445052983109
1731020511010841111
18661252441185028
19016531525866511
2037105435487009
210255517388048
2234156214891215
233435715290164
247171358037910211
25241159641292025
26059608111938211
27312613412943321
2823162125953921
292946350396088
3042764213974527
31641265037985315
32161866122699124
3301115670117100436
342712
Table A3. NBA All-Star data: Number of players selected for the NBA all-star game each year in each position (C = Center; F = Forward; FC = Forward-center).
Table A3. NBA All-Star data: Number of players selected for the NBA all-star game each year in each position (C = Center; F = Forward; FC = Forward-center).
YearCFFCYearCFFCYearCFFC
200061420063342012324
200134320072332013224
200243420083232014225
200342320092352015244
200434420102252016342
20052252011422

References

  1. Johnson, N.; Kotz, S.; Balakrishnan, N. Discrete Multivariate Distributions; John Wiley & Sons: New York, NY, USA, 1997. [Google Scholar]
  2. Krishnamoorthy, A.S. Multivariate binomial and Poisson distributions. Sankhyā Indian J. Stat. 1951, 11, 117–124. [Google Scholar]
  3. Mahamunulu, D.M. A note on regression in the multivariate Poisson distribution. J. Am. Stat. Assoc. 1967, 62, 251–258. [Google Scholar] [CrossRef]
  4. Teicher, H. On the Multivariate Poisson distribution. Skand. Aktuarietidskr. 1954, 37, 1–9. [Google Scholar] [CrossRef]
  5. Hilbe, J.M. Modeling Count Data; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  6. Doss, D.C. Definition and characterization of multivariate negative binomial distribution. J. Multivar. Anal. 1979, 9, 460–464. [Google Scholar] [CrossRef] [Green Version]
  7. Conway, R.W.; Maxwell, W.L. A queuing model with state dependent service rates. J. Ind. Eng. 1962, 12, 132–136. [Google Scholar]
  8. Sellers, K.F.; Shmueli, G.; Borle, S. The COM-Poisson model for count data: A survey of methods and applications. Appl. Stoch. Model. Bus. Ind. 2011, 28, 104–116. [Google Scholar] [CrossRef]
  9. Shmueli, G.; Minka, T.P.; Kadane, J.B.; Borle, S.; Boatwright, P. A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution. Appl. Stat. 2005, 54, 127–142. [Google Scholar] [CrossRef]
  10. Guikema, S.D.; Coffelt, J.P. A Flexible Count Data Regression Model for Risk Analysis. Risk Anal. 2008, 28, 213–223. [Google Scholar] [CrossRef] [PubMed]
  11. Sellers, K.F.; Morris, D.S.; Balakrishnan, N. Bivariate Conway-Maxwell-Poisson distribution: Formulation, properties, and inference. J. Multivar. Anal. 2016, 150, 152–168. [Google Scholar] [CrossRef]
  12. Kocherlakota, S.; Kocherlakota, K. Bivariate Discrete Distributions; Marcel Dekker: New York, NY, USA, 1992. [Google Scholar]
  13. Lai, C.D. Constructions of discrete bivariate distributions. In Advances in Distribution Theory, Order Statistics and Inference, Part I; Balakrishnan, N., Sarabia, J.M., Castillo, E., Eds.; Birkhauser: Boston, MA, USA, 2006; pp. 29–58. [Google Scholar]
  14. Marshall, A.W.; Olkin, I. A family of bivariate distributions generated by the bivariate Bernoulli distribution. J. Am. Stat. Assoc. 1985, 80, 332–338. [Google Scholar] [CrossRef]
  15. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
  16. Balakrishnan, N.; Pal, S. Lognormal lifetimes and likelihood-based inference for flexible cure rate models based on COM-Poisson family. Comput. Stat. Data Anal. 2013, 67, 41–67. [Google Scholar] [CrossRef]
  17. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference; Springer: New York, NY, USA, 2002. [Google Scholar]
  18. Corporación Favorita. Grocery Sales Data. 2018. Available online: https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data (accessed on 26 April 2020).
  19. Voinov, V.; Nikulin, M.; Balakrishnan, N. Chi-Squared Goodness of Fit Tests with Applications; Academic Press: Boston, MA, USA, 2013. [Google Scholar]
  20. NBA. NBA All-Star Game, 2000–2016. Available online: https://www.kaggle.com/fmejia21/nba-all-star-game-20002016? (accessed on 22 April 2020).
  21. Inouye, D.I.; Yang, E.; Allen, G.I.; Ravikumar, P. A review of multivariate distributions for count data derived from the Poisson distribution. WIREs Comput. Stat. 2017, 9, e1398. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Genest, C.; Nešlehová, J. A Primer on Copulas for Count Data. ASTIN Bull. 2007, 37, 475–515. [Google Scholar] [CrossRef] [Green Version]
  23. Trivedi, P.; Zimmer, D. A Note on Identification of Bivariate Copulas for Discrete Count Data. Econometrics 2017, 5, 10. [Google Scholar] [CrossRef] [Green Version]
  24. Sellers, K.F.; Swift, A.W.; Weems, K.S. A flexible distribution class for count data. J. Stat. Distrib. Appl. 2017, 4, 1–21. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Estimated marginal distributions associated with the trivariate CMP (blue/square), the trivariate geometric (purple/circle), the trivariate negative binomial (red/triangle), and the trivariate Poisson (green/diamond) compared with the original data relative frequencies (histogram) regarding the number of unit sales for: (a) Store 1, (b) Store 2, (c) Store 3.
Figure 1. Estimated marginal distributions associated with the trivariate CMP (blue/square), the trivariate geometric (purple/circle), the trivariate negative binomial (red/triangle), and the trivariate Poisson (green/diamond) compared with the original data relative frequencies (histogram) regarding the number of unit sales for: (a) Store 1, (b) Store 2, (c) Store 3.
Stats 04 00021 g001
Table 1. Model support levels based on AIC difference values, Δ i = AIC i AIC min , for Model i [17].
Table 1. Model support levels based on AIC difference values, Δ i = AIC i AIC min , for Model i [17].
Δ i Empirical Support Level for Model i
[ 0 , 2 ] Substantial
[ 4 , 7 ] Considerably less
( 10 , ) Essentially none
Table 2. Proportion of 2 log Λ values that lie within the χ 1 2 ( 0.95 ) and χ 1 2 ( 0.99 ) bounds, respectively, given various sample sizes, {100, 250, 500, 1000}.
Table 2. Proportion of 2 log Λ values that lie within the χ 1 2 ( 0.95 ) and χ 1 2 ( 0.99 ) bounds, respectively, given various sample sizes, {100, 250, 500, 1000}.
Sample SizeWithin χ 1 2 ( 0.95 ) Within χ 1 2 ( 0.99 )
100 93.0%99.2%
250 93.4%99.4%
500 94.2%98.8%
1000 94.8%98.6%
Table 3. Power of the likelihood ratio test (at 5% level) when data are generated from the trivariate geometric, the trivariate CMP with ν = 0.5 , the trivariate CMP with ν = 2 and the trivariate Bernoulli, respectively, for various sample sizes, {100, 250, 500, 1000}.
Table 3. Power of the likelihood ratio test (at 5% level) when data are generated from the trivariate geometric, the trivariate CMP with ν = 0.5 , the trivariate CMP with ν = 2 and the trivariate Bernoulli, respectively, for various sample sizes, {100, 250, 500, 1000}.
Sample SizeGeometricCMP ( ν = 0.5 )CMP ( ν = 2 )Bernoulli
100 100%37.0%82.8%  100%
250 100%62.6%99.4%  100%
500 100%75.4%100.0%  100%
1000 100%99.2%100.0%  100%
Table 4. Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: Conway–Maxwell–Poisson (CMP), trivariate Poisson, trivariate geometric and trivariate negative binomial (NB). Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided, along with the number of free parameters for AIC determination.
Table 4. Estimation results associated with the Corporación Favorita grocery sales data based on various assumed trivariate models: Conway–Maxwell–Poisson (CMP), trivariate Poisson, trivariate geometric and trivariate negative binomial (NB). Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided, along with the number of free parameters for AIC determination.
ModelEstimated ParametersLog
Likelihood
No. of Free
Parameters
AIC
λ ^ 000 = 0.00 λ ^ 111 = 0.00 λ ^ 100 = 0.32
CMP λ ^ 010 = 0.47 λ ^ 001 = 1.17 λ ^ 110 = 0.00 −804.991627.9
λ ^ 101 = 0.00 λ ^ 011 = 0.00 ν ^ = 0.25
A ^ 1 = 2.32 A ^ 2 = 2.97 A ^ 3 = 8.94
Poisson A ^ 12 = 0.19 A ^ 13 = 0.00 A ^ 23 = 0.62 −867.371748.5
A ^ 123 = 0.11
a ^ 1 = 2.62 a ^ 2 = 3.89 a ^ 3 = 9.67
geometric a ^ 12 = 0.00 a ^ 13 = 0.00 a ^ 23 = 0.00 −844.671703.2
a ^ 123 = 0.00 a ^ 0 = 17.18
a ^ 1 = 0.44 a ^ 2 = 0.65 a ^ 3 = 1.61
NB a ^ 12 = 0.00 a ^ 13 = 0.00 a ^ 23 = 0.00 −802.881621.7
a ^ 123 = 0.00 a ^ 0 = 3.69 k ^ = 6.01
Table 5. The goodness-of-fit test measures to compare the considered trivariate model (i.e., trivariate CMP, trivariate negative binomial (NB), trivariate Poisson, and trivariate geometric) marginal pmfs to the marginal data regarding the number of unit sales for Stores 1, 2 and 3, respectively.
Table 5. The goodness-of-fit test measures to compare the considered trivariate model (i.e., trivariate CMP, trivariate negative binomial (NB), trivariate Poisson, and trivariate geometric) marginal pmfs to the marginal data regarding the number of unit sales for Stores 1, 2 and 3, respectively.
Store 1Store 2Store 3
CMP26.08.930.6
NB26.49.730.2
Poisson82.757.7857.0
Geometric16.122.751.6
Table 6. Estimation results associated with the NBA All-Star data based on various assumed trivariate models: Conway–Maxwell–Poisson (CMP), Poisson, and negative binomial (NB). Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
Table 6. Estimation results associated with the NBA All-Star data based on various assumed trivariate models: Conway–Maxwell–Poisson (CMP), Poisson, and negative binomial (NB). Respective log-likelihood and Akaike Information Criterion (AIC) values are also provided.
ModelEstimated ParametersLog
Likelihood
No. of Free
Params
AIC
λ ^ 000 = 0.00 × 10 0 λ ^ 111 = 0.36 × 10 28 λ ^ 100 = 1.53 × 10 28
CMP λ ^ 010 = 0.00 × 10 0 λ ^ 001 = 1.38 × 10 28 λ ^ 110 = 2.17 × 10 28 −68.29154.4
λ ^ 101 = 3.77 × 10 28 λ ^ 011 = 4.52 × 10 28 ν ^ = 3.84 × 10 1
A ^ 1 = 1.15 A ^ 2 = 0.94 A ^ 3 = 1.94
Poisson A ^ 12 = 0.00 A ^ 13 = 0.12 A ^ 23 = 0.04 −83.37180.6
A ^ 123 = 1.66
a ^ 1 = 3.3 × 10 5 a ^ 2 = 2.5 × 10 5 a ^ 3 = 5.5 × 10 5
NB a ^ 12 = 2.6 × 10 6 a ^ 13 = 4.3 × 10 6 a ^ 23 = 4.1 × 10 6 −83.38182.7
a ^ 123 = 4.5 × 10 5 a ^ 0 = 1.00017 k ^ = 34856
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sellers, K.F.; Li, T.; Wu, Y.; Balakrishnan, N. A Flexible Multivariate Distribution for Correlated Count Data. Stats 2021, 4, 308-326. https://0-doi-org.brum.beds.ac.uk/10.3390/stats4020021

AMA Style

Sellers KF, Li T, Wu Y, Balakrishnan N. A Flexible Multivariate Distribution for Correlated Count Data. Stats. 2021; 4(2):308-326. https://0-doi-org.brum.beds.ac.uk/10.3390/stats4020021

Chicago/Turabian Style

Sellers, Kimberly F., Tong Li, Yixuan Wu, and Narayanaswamy Balakrishnan. 2021. "A Flexible Multivariate Distribution for Correlated Count Data" Stats 4, no. 2: 308-326. https://0-doi-org.brum.beds.ac.uk/10.3390/stats4020021

Article Metrics

Back to TopTop