Next Article in Journal
Structural Balance of Opinions
Next Article in Special Issue
Inequalities for Jensen–Sharma–Mittal and Jeffreys–Sharma–Mittal Type f–Divergences
Previous Article in Journal
The Elementary Particles of Quantum Fields
Previous Article in Special Issue
Fisher Information of Free-Electron Landau States
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Submission received: 7 September 2021 / Revised: 20 October 2021 / Accepted: 26 October 2021 / Published: 28 October 2021
(This article belongs to the Special Issue Distance in Information and Statistical Physics III)

Abstract

:
The Jeffreys divergence is a renown arithmetic symmetrization of the oriented Kullback–Leibler divergence broadly used in information sciences. Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with advantages and disadvantages have been proposed in the literature to either estimate, approximate, or lower and upper bound this divergence. In this paper, we propose a simple yet fast heuristic to approximate the Jeffreys divergence between two univariate Gaussian mixtures with arbitrary number of components. Our heuristic relies on converting the mixtures into pairs of dually parameterized probability densities belonging to an exponential-polynomial family. To measure with a closed-form formula the goodness of fit between a Gaussian mixture and an exponential-polynomial density approximating it, we generalize the Hyvärinen divergence to α -Hyvärinen divergences. In particular, the 2-Hyvärinen divergence allows us to perform model selection by choosing the order of the exponential-polynomial densities used to approximate the mixtures. We experimentally demonstrate that our heuristic to approximate the Jeffreys divergence between mixtures improves over the computational time of stochastic Monte Carlo estimations by several orders of magnitude while approximating the Jeffreys divergence reasonably well, especially when the mixtures have a very small number of modes.

Graphical Abstract

1. Introduction

1.1. Statistical Mixtures and Statistical Divergences

We consider the problem of approximating the Jeffreys divergence [1] between two finite univariate continuous mixture models [2] m ( x ) = i = 1 k w i p i ( x ) and m ( x ) = i = 1 k w i p i ( x ) with continuous component distributions p i ’s and p i s defined on a coinciding support X R . The mixtures m ( x ) and m ( x ) may have a different number of components (i.e., k k ). Historically, Pearson [3] first considered a univariate Gaussian mixture of two components for modeling the distribution of the ratio of forehead breadth to body length of a thousand crabs in 1894 (Pearson obtained a unimodal mixture).
Although our work applies to any continuous mixtures of an exponential family (e.g., Rayleigh mixtures [4] with restricted support X = R + ), we explain our method for the most prominent family of mixtures encountered in practice: the Gaussian mixture models or GMMs for short. In the remainder, a univariate GMM m ( x ) = i = 1 k w i p μ i , σ i ( x ) with k Gaussian components
p i ( x ) = p μ i , σ i ( x ) : = 1 σ i 2 π exp ( x μ i ) 2 2 σ i 2 ,
is called a k-GMM.
The Kullback–Leibler divergence (KLD) [5,6] D KL [ m : m ] between two mixtures m and m is:
D KL [ m : m ] : = X m ( x ) log m ( x ) m ( x ) d x .
The KLD is an oriented divergence since D KL [ m : m ] D KL [ m : m ] .
The Jeffreys divergence (JD) [1] D J [ m , m ] is the arithmetic symmetrization of the forward KLD and the reverse KLDs:
D J [ m , m ] : = D KL [ m : m ] + D KL [ m : m ] ,
= X ( m ( x ) m ( x ) ) log m ( x ) m ( x ) d x .
The JD is a symmetric divergence: D J [ m , m ] = D J [ m , m ] . In the literature, the Jeffreys divergence [7] has also been called the J-divergence [8,9], the symmetric Kullback–Leibler divergence [10] and sometimes the symmetrical Kullback–Leibler divergence [11,12]. In general, it is provably hard to calculate the definite integral of the KLD between two continuous mixtures in closed-form: For example, the KLD between two GMMs has been shown to be non-analytic [13]. Thus, in practice, when calculating the JD between two GMMs, one can either approximate [14,15], estimate [16], or bound [17,18] the KLD between mixtures. Another approach to bypass the computational intractability of calculating the KLD between mixtures consists of designing new types of divergences that admit closed-form expressions for mixtures. See, for example, the Cauchy–Schwarz divergence [19] or the total square divergence [20] (a total Bregman divergence) that admit the closed-form formula when handling GMMs. The total square divergence [20] is invariant to rigid transformations and provably robust to outliers in clustering applications.
In practice, to estimate the KLD between mixtures, one uses the following Monte Carlo (MC) estimator:
D ^ KL S s [ m : m ] : = 1 s i = 1 s log m ( x i ) m ( x i ) + m ( x i ) m ( x i ) 1 0 ,
where S s = { x 1 , , x s } is s independent and identically distributed (i.i.d.) samples from m ( x ) . This MC estimator is by construction always non-negative and therefore consistent. That is, we have lim s D ^ KL S s [ m : m ] = D KL [ m : m ] under mild conditions [21].
Similarly, we estimate the Jeffreys divergence via MC sampling as follows:
D ^ J S s [ m , m ] : = 1 s i = 1 s 2 ( m ( x i ) m ( x i ) ) m ( x i ) + m ( x i ) log m ( x i ) m ( x i ) 0 ,
where S s = { x 1 , , x s } are s i.i.d. samples from the “middle mixture” m 12 ( x ) : = 1 2 ( m ( x ) + m ( x ) ) . By choosing the middle mixture m 12 ( x ) for sampling, we ensure that we keep the symmetric property of the JD (i.e., D ^ J S s [ m , m ] = D ^ J S s [ m , m ] ), and we also have consistency under mild conditions [21]: lim s D ^ J S s [ m , m ] = D J [ m , m ] . The time complexity to stochastically estimate the JD is O ˜ ( ( k + k ) s ) , with s typically ranging from 10 4 to 10 6 in applications. Notice that the number of components of a mixture can be very large (e.g., k = O ( n ) for n input data when using Kernel Density Estimators [2]). KDEs may also have a large number of components and may potentially exhibit many spurious modes visualized as small bumps when plotting the densities.

1.2. Jeffreys Divergence between Densities of an Exponential Family

We consider approximating the JD by converting continuous mixtures into densities of exponential families [22]. A continuous exponential family (EF) E t of order D is defined as a family of probability density functions with support X and the probability density function:
E t : = p θ ( x ) : = exp i = 1 D θ i t i ( x ) F ( θ ) : θ Θ ,
where F ( θ ) is called the log-normalizer, which ensures the normalization of p θ ( x ) (i.e., X p θ ( x ) d x = 1 ):
F ( θ ) = log X exp i = 1 D θ i t i ( x ) d x .
Parameter θ Θ R D is called the natural parameter, and the functions t 1 ( x ) , , t D ( x ) are called the sufficient statistics [22]. Let Θ denote the natural parameter space: Θ : = { θ : F ( θ ) < } , an open convex domain for regular exponential families [22]. The exponential family is said to be minimal when the functions 1 , t 1 ( x ) , , t D ( x ) are linearly independent.
It is well-known that one can bypass the definite integral calculation of the KLD when the probability density functions p θ and p θ belong to the same exponential family [23,24]:
D KL [ p θ : p θ ] = B F ( θ : θ ) ,
where B F ( θ 2 : θ 1 ) is the Bregman divergence induced by the log-normalizer, a strictly convex real-analytic function [22]. The Bregman divergence [25] between two parameters θ 1 and θ 2 for a strictly convex and smooth generator F is defined by:
B F ( θ 1 : θ 2 ) : = F ( θ 1 ) F ( θ 2 ) ( θ 1 θ 2 ) F ( θ 2 ) .
Thus, the Jeffreys divergence between two pdfs p θ and p θ belonging to the same exponential family is a symmetrized Bregman divergence [26]:
D J [ p θ : p θ ] = B F ( θ : θ ) + B F ( θ : θ ) , = ( θ θ ) ( F ( θ ) F ( θ ) ) .
Let F * ( η ) denote the Legendre–Fenchel convex conjugate of F ( θ ) :
F * ( η ) : = sup θ Θ { θ η F ( θ ) } .
The Legendre transform ensures that η = F ( θ ) and θ = F * ( η ) , and the Jeffreys divergence between two pdfs p θ and p θ belonging to the same exponential family is:
D J [ p θ : p θ ] = ( θ θ ) ( η η ) .
Notice that the log-normalizer F ( θ ) does not appear explicitly in the above formula.

1.3. A Simple Approximation Heuristic

Densities p θ of an exponential family admit a dual parameterization [22]: η = η ( θ ) : = E p θ [ t ( x ) ] = F ( θ ) , called the moment parameterization (or mean parameterization). Let H denote the moment parameter space. Let us use the subscript and superscript notations to emphasize the coordinate system used to index a density: In our notation, we thus write p θ ( x ) = p η ( x ) .
In view of Equation (7), our method to approximate the Jeffreys divergence between mixtures m and m consists of first converting those mixtures m and m into pairs of polynomial exponential densities (PEDs) in Section 2. To convert a mixture m ( x ) into a pair ( p θ ¯ 1 , p η ¯ 2 ) dually parameterized (but not dual because η ¯ 2 F ( θ ¯ 1 ) ), we shall consider “integral extensions” (or information projections) of the Maximum Likelihood Estimator [22] (MLE estimates in the moment parameter space H = { F ( θ ) : θ Θ } ) and of the Score Matching Estimator [27] (SME estimates in the natural parameter space Θ = { F * ( η ) : η H } ).
We shall consider polynomial exponential families [28] (PEFs) also called exponential-polynomial families (EPFs) [29]. PEFs E D are regular minimal exponential families with polynomial sufficient statistics t i ( x ) = x i for i { 1 , , D } . For example, the exponential distributions { p λ ( x ) = λ exp ( λ x ) } form a PEF with D = 1 , t ( x ) = x and X = R + , and the normal distributions form an EPF with D = 2 , t ( x ) = [ x x 2 ] and X = R , etc. Although the log-normalizer F ( θ ) can be obtained in closed-form for lower order PEFs (e.g., D = 1 or D = 2 ) or very special subfamilies (e.g., when D = 1 and t 1 ( x ) = x k , exponential-monomial families [30]), a no-closed form formula is available for F ( θ ) of EPFs in general as soon D 4 [31,32], and the cumulant function F ( θ ) is said to be computationally intractable. Notice that when X = R , the leading coefficient θ D is negative for even integer order D. EPFs are attractive because these families can universally model any smooth multimodal distribution [28] and require fewer parameters in comparison to GMMs: Indeed, a univariate k-GMM m ( x ) (at most k modes and k 1 antimodes) requires 3 k 1 parameters to specify m ( x ) (or k + 1 for a KDE with constant kernel width σ or 2 k 1 for a KDE with varying kernel widths, but then k = n observations). A density of an EPF of order D is called an exponential-polynomial density (EPD) and requires D parameters to specify θ , with, at most, D 2 modes (and D 2 1 antimodes). The case of the quartic (polynomial) exponential densities E 4 ( D = 4 ) has been extensively investigated in [31,33,34,35,36,37]. Armstrong and Brigo [38] discussed order-6 PEDs, and Efron and Hastie reported and order-7 PEF in their textbook (see Figure 5.7 of [39]). Figure 1 displays two examples of converting a GMM into a pair of dually parameterized exponential-polynomial densities.
Then by converting both mixture m and mixture m into pairs of dually natural/moment parameterized unnormalized PEDs, i.e., m ( q θ ¯ SME , q η ¯ MLE ) and m ( q θ ¯ SME , q η ¯ MLE ) , we approximate the JD between mixtures m and m by using the four parameters of the PEDs
D J [ m , m ] ( θ ¯ SME θ ¯ SME ) ( η ¯ MLE η ¯ MLE ) .
Let Δ J denote the approximation formula obtained from the two pairs of PEDs:
Δ J [ p θ SME , p η MLE ; p θ SME , p η MLE ] : = ( θ SME θ SME ) ( η MLE η MLE ) .
Let Δ J ( θ SME , η MLE ; θ SME , η MLE ) : = Δ J [ p θ SME , p η MLE ; p θ SME , p η MLE ] . Then we have
D J [ m , m ] D ˜ J [ m , m ] : = Δ J ( θ SME , η MLE ; θ SME , η MLE ) .
Note that Δ J is not a proper divergence as it may be negative since, in general, η ¯ MLE F ( θ ¯ SME ) . That is, Δ J may not satisfy the law of the indiscernibles. Approximation Δ J is exact when k 1 = k 2 = 1 , with both m and m belonging to an exponential family.
We experimentally show in Section 4 that the D ˜ J heuristic yields fast approximations of the JD compared to the MC baseline estimations by several order of magnitudes while approximating the JD reasonably well when the mixtures have a small number of modes.
For example, Figure 2 displays the unnormalized PEDs obtained for two Gaussian mixture models ( k 1 = 10 components and k 2 = 11 components) into PEDs of a PEF of order D = 8 . The MC estimation of the JD with s = 10 6 samples yields 0.2633 , while the PED approximation of Equation (8) on corresponding PEFs yields 0.2618 (the relative error is 0.00585 or about 0.585 % ). It took about 2642.581 milliseconds (with s = 10 6 on a Dell Inspiron 7472 laptop) to MC estimate the JD, while it took about 0.827 milliseconds with the PEF approximation. Thus, we obtained a speed-up factor of about 3190 (three orders of magnitude) for this particular example. Notice that when viewing Figure 2, we tend to visually evaluate the dissimilarity using the total variation distance (a metric distance):
D TV [ m , m ] : = 1 2 | m ( x ) m ( x ) | d x ,
rather than by a dissimilarity relating to the KLD. Using Pinsker’s inequality [40,41], we have D J [ m , m ] D TV [ m , m ] 2 and D TV [ m , m ] [ 0 , 1 ] . Thus, large TV distance (e.g., D TV [ m , m ] = 0.1 ) between mixtures may have a small JD since Pinsker’s inequality yields D J [ m , m ] 0.01 .
Let us point out that our approximation heuristic is deterministic, while the MC estimations are stochastic: That is, each MC run (Equation (4)) returns a different result, and a single MC run may yield a very bad approximation of the true Jeffreys divergence.
We compare our fast heuristic D ˜ J [ m , m ] = ( θ SME θ SME ) ( η MLE η MLE ) with two more costly methods relying on numerical procedures to convert natural ↔ moment parameters:
  • Simplify GMMs m i into p η i MLE , and approximately convert the η ¯ i MLE ’s into θ ˜ i MLE ’s. Then approximate the Jeffreys divergence as
    D J [ m 1 , m 2 ] Δ ˜ J MLE [ m 1 , m 2 ] : = ( θ ˜ 2 MLE θ ˜ 1 MLE ) ( η ¯ 2 MLE η ¯ 1 MLE ) .
  • Simplify GMMs m i into p θ ¯ i SME , and approximately convert the θ ¯ i SME ’s into η ˜ i SME ’s. Then approximate the Jeffreys divergence as
    D J [ m 1 , m 2 ] Δ ˜ J SME ( m 1 , m 2 ) = ( θ ¯ 2 SME θ ¯ 1 SME ) ( η ˜ 2 SME η ˜ 1 SME ) .

1.4. Contributions and Paper Outline

Our contributions are summarized as follows:
  • We explain how to convert any continuous density r ( x ) (including GMMs) into a polynomial exponential density in Section 2 using integral-based extensions of the Maximum Likelihood Estimator [22] (MLE estimates in the moment parameter space H, Theorem 1 and Corollary 1) and the Score Matching Estimator [27] (SME estimates in the natural parameter space Θ , Theorem 3). We show a connection between SME and the Moment Linear System Estimator [28] (MLSE).
  • We report a closed-form formula to evaluate the goodness-of-fit of a polynomial family density to a GMM in Section 3 using an extension of the Hyvärinen divergence [42] (Theorem 4) and discuss the problem of model selection for choosing the order D of the polynomial exponential family.
  • We show how to approximate the Jeffreys divergence between GMMs using a pair of natural/moment parameter PED conversion and present experimental results that display a gain of several orders of magnitude of performance when compared to the vanilla Monte Carlo estimator in Section 4. We observe that the quality of the approximations depend on the number of modes of the GMMs [43]. However, calculating or counting the modes of a GMM is a difficult problem in its own [43].
The paper is organized as follows: In Section 2, we show how to convert arbitrary probability density functions into polynomial exponential densities using the integral-based Maximum Likelihood Estimator (MLE) and Score Matching Estimator (SME). We describe a Maximum Entropy method to iteratively convert moment parameters into natural parameters in Section 2.3.1. It is followed by Section 3, which shows how to calculate in closed-form the order-2 Hyvärinen divergence between a GMM and a polynomial exponential density. We use this criterion to perform model selection. Section 4 presents our computational experiments that demonstrate a gain of several orders of magnitudes for GMMs with a small number of modes. Finally, we conclude in Section 5.

2. Converting Finite Mixtures to Exponential Family Densities

We report two generic methods to convert a mixture m ( x ) into a density p θ ( x ) of an exponential family: The first method extending the MLE in Section 2.1 proceeds using the mean parameterization η , while the second method extending the SME in Section 2.2 uses the natural parameterization of the exponential family. We then describe how to convert the moments parameters into natural parameters (and vice versa) for polynomial exponential families in Section 2.3. We show how to instantiate these generic conversion methods for GMMs: It requires calculating non-central moments of GMMs in closed-form. The efficient computations of raw moments of GMMs is detailed in Section 2.4.

2.1. Conversion Using the Moment Parameterization (MLE)

Let us recall that in order to estimate the moment or mean parameter η ^ MLE of a density belonging an exponential family
E t : = p θ ( x ) = exp t ( x ) θ F ( θ )
with a sufficient statistic vector t ( x ) = [ t 1 ( x ) t D ( x ) ] from an i.i.d. sample set x 1 , , x n , the Maximum Likelihood Estimator (MLE) [22,44] yields
max θ i = 1 n p θ ( x i ) ,
max θ i = 1 n log p θ ( x i ) ,
= max θ E ( θ ) : = i = 1 n t ( x i ) θ n F ( θ ) ,
η ^ MLE = 1 n i = 1 n t ( x i ) .
In statistics, Equation (12) is called the estimating equation. The MLE exists under mild conditions [22] and is unique since the Hessian 2 E ( θ ) = 2 F ( θ ) of the estimating equation is positive-definite (log-normalizers F ( θ ) are always strictly convex and real analytic [22]). The MLE is consistent and asymptotically normally distributed [22]. Furthermore, since the MLE satisfies the equivariance property [22], we have θ ^ MLE = F * ( η ^ MLE ) , where F * denotes the gradient of the conjugate function F * ( η ) of the cumulant function F ( θ ) of the exponential family. In general, F * is intractable for PEDs with D 4 .
By considering the empirical distribution
p e ( x ) : = 1 n i = 1 s δ x i ( x ) ,
where δ x i ( · ) denotes the Dirac distribution at location x i , we can formulate the MLE problem as a minimum KLD problem between the empirical distribution and a density of the exponential family:
min θ D KL [ p e : p θ ] = min H [ p e ] E p e [ log p θ ( x ) ] , max θ 1 n i = 1 n log p θ ( x i ) ,
since the entropy term H [ p e ] is independent of θ .
Thus, to convert an arbitrary smooth density r ( x ) into a density p θ of an exponential family E t , we have to solve the following minimization problem:
min θ Θ D KL [ r : p θ ] .
Rewriting the minimization problem as:
min θ D KL [ r : p θ ] = r ( x ) log p θ ( x ) d x + r ( x ) log r ( x ) d x , min θ r ( x ) log p θ ( x ) d x , = min θ r ( x ) ( F ( θ ) θ t ( x ) ) d x , = min θ E ¯ ( θ ) = F ( θ ) θ E r [ t ( x ) ] ,
we obtain
η ¯ MLE ( r ) : = E r [ t ( x ) ] = X r ( x ) t ( x ) d x .
The minimum is unique since 2 E ¯ ( θ ) = 2 F ( θ ) 0 (positive-definite matrix). This conversion procedure r ( x ) p η ¯ MLE ( r ) ( x ) can be interpreted as an integral extension of the MLE, hence the ˙ ¯ notation in η ¯ MLE . Notice that the ordinary MLE is η ^ MLE = η ¯ MLE ( p e ) obtained for the empirical distribution: r = p e : η ¯ MLE ( p e ) = 1 n i = 1 n t ( x i ) .
Theorem 1.
The best density p η ¯ ( x ) of an exponential family E t = { p θ : θ Θ } minimizing the Kullback–Leibler divergence D KL [ r : p θ ] between a density r and a density p θ of an exponential family E t is η ¯ = E r [ t ( x ) ] = X r ( x ) t ( x ) d x .
Notice that when r = p θ , we obtain η ¯ = E p θ [ t ( x ) ] = η so that the method η ¯ MLE ( r ) is consistent (by analogy to the finite i.i.d. MLE case): η ¯ MLE ( p θ ) = η = F ( θ ) .
The KLD right-sided minimization problem can be interpreted as an information projection of r onto E t . As a corollary of Theorem 1, we obtain:
Corollary 1
(Best right-sided KLD simplification of a mixture). The best right-sided KLD simplification of a homogeneous mixture of exponential families [2] m ( x ) = i = 1 k w i p θ i ( x ) with p θ i E t , i.e., min θ Θ D KL [ m : p θ ] , into a single component p η ( x ) is given by η = η ^ MLE ( m ) = E m [ t ( x ) ] = i = 1 k η i = η ¯ .
Equation (16) allows us to greatly simplify the proofs reported in [45] for mixture simplifications that involved the explicit use of the Pythagoras’ theorem in the dually flat spaces of exponential families [42]. Figure 3 displays the geometric interpretation of the best KLD simplification of a GMM with ambient space the probability space ( R , B ( R ) , μ L ) , where μ L denotes the Lebesgue measure and B ( R ) the Borel σ -algebra of R .
Let us notice that Theorem 1 yields an algebraic system for polynomial exponential densities, i.e., E m [ x i ] = η ¯ i for i { 1 , , D } , to compute η ¯ MLE ( m ) for a given GMM m ( x ) (since raw moments E m [ x i ] are algebraic). In contrast with this result, the MLE of i.i.d. observations is in general not an algebraic function [46] but a transcendental function.

2.2. Converting to a PEF Using the Natural Parameterization (SME)

Integral-Based Score Matching Estimator (SME)

To convert the density r ( x ) into an exponential density with sufficient statistics t ( x ) , we can also use the Score Matching Estimator [27,47] (SME). The Score Matching Estimator minimizes the Hyvärinen divergence D H (Equation (4) of [47]):
D H [ p : p θ ] : = 1 2 x log p ( x ) x log p θ ( x ) 2 p ( x ) d x .
The Hyvärinen divergence is also known as half of the relative Fisher information in the optimal transport community (Equation (8) of [48] or Equation (2.2) in [49]), where it is defined for two measures μ and ν as follows:
I [ μ : ν ] : = X log d μ d ν 2 d μ = 4 X d μ d ν 2 d ν .
Moreover, the relative Fisher information can be defined on complete Riemannian manifolds [48].
That is, we convert a density r ( x ) into an exponential family density p θ ( x ) using the following minimizing problem:
θ SME ( r ) = min θ Θ D H [ r : p θ ] .
Beware that in statistics, the score s θ ( x ) is defined by θ log p θ ( x ) , but in Score Matching, we refer to the “data score” defined by x log p θ ( x ) . Hyvärinen [47] gave an explanation of the naming “score” using a spurious location parameter.
  • Generic solution: It can be shown that for exponential families [47], we obtain the following solution:
    θ SME ( r ) = E r [ A ( x ) ] 1 × E r [ b ( x ) ] ,
    where
    A ( x ) : = [ t i ( x ) t j ( x ) ] i j
    is a D × D symmetric matrix, and
    b ( x ) = [ t 1 ( x ) t D ( x ) ]
    is a D-dimensional column vector.
Theorem 2. 
The best conversion of a density r ( x ) into a density p θ ( x ) of an exponential family minimizing the right-sided Hyvärinen divergence is
θ SME ( r ) = E r [ [ t i ( x ) t j ( x ) ] i j ] 1 × E r [ [ t 1 ( x ) t D ( x ) ] ] .
  • Solution instantiated for polynomial exponential families:
    For polynomial exponential families of order D, we have t i ( x ) = i x i 1 and t i ( x ) = i ( i 1 ) x i 2 , and therefore, we have
    A D = E r [ A ( x ) ] = i j μ i + j 2 ( r ) i j ,
    and
    b D = E s [ b ( x ) ] = j ( j 1 ) μ j 2 ( r ) j ,
    where μ l ( r ) : = E r [ X l ] denotes the l-th raw moment of distribution X r ( x ) (with the convention that m 1 ( r ) = 0 ). For a probability density function r ( x ) , we have μ 1 ( r ) = 1 .
    Thus, the integral-based SME of a density r is:
    θ SME ( r ) = i j μ i + j 2 ( r ) i j 1 × j ( j 1 ) μ j 2 ( r ) j .
    For example, matrix A 4 is
    μ 0 2 μ 1 3 μ 2 4 μ 3 2 μ 1 4 μ 2 6 μ 3 8 μ 4 3 μ 2 6 μ 3 9 μ 4 12 μ 5 4 μ 3 8 μ 4 12 μ 5 16 μ 6 .
  • Faster PEF solutions using Hankel matrices:
    The method of Cobb et al. [28] (1983) anticipated the Score Matching method of Hyvärinen (2005). It can be derived from Stein’s lemma for exponential families [50]. The integral-based Score Matching method is consistent, i.e., if r = p θ , then θ ¯ SME = θ : The probabilistic proof for r ( x ) = p e ( x ) is reported as Theorem 2 of [28]. The integral-based proof is based on the property that arbitrary order partial mixed derivatives can be obtained from higher-order partial derivatives with respect to θ 1 [29]:
    1 i 1 D i D F ( θ ) = 1 j = 1 D j i j F ( θ ) ,
    where i : = θ i .
    The complexity of the direct SME method is O ( D 3 ) as it requires the inverse of the D × D -dimensional matrix A D .
    We show how to lower this complexity by reporting an equivalent method (originally presented in [28]) that relies on recurrence relationships between the moments of p θ ( x ) for PEDs. Recall that μ l ( r ) denotes the l-th raw moment E r [ x l ] .
    Let A = [ a i + j 2 ] i j denote the D × D symmetric matrix with a i + j 2 ( r ) = μ i + j 2 ( r ) (with a 0 ( r ) = μ 0 ( r ) = 1 ), and b = [ b i ] i the D-dimensional vector with b i ( r ) = ( i + 1 ) μ i ( r ) . We solve the system A β = b to obtain β = A 1 b . We then obtain the natural parameter θ ¯ SME from the vector β as
    θ ¯ SME = β 1 2 β i i + 1 β D D + 1 .
    Now, if we inspect matrix A D = μ i + j 2 ( r ) , we find that matrix A D is a Hankel matrix: A Hankel matrix has constant anti-diagonals and can be inverted in quadratic-time [51,52] instead of cubic time for a general D × D matrix. (The inverse of a Hankel matrix is a Bezoutian matrix [53].) Moreover, a Hankel matrix can be stored using linear memory (store 2 D 1 coefficients) instead of quadratic memory of regular matrices.
    For example, matrix A 4 is:
    A 4 = μ 0 μ 1 μ 2 μ 3 μ 1 μ 2 μ 3 μ 4 μ 2 μ 3 μ 4 μ 5 μ 3 μ 4 μ 5 μ 6 ,
    and requires only 6 = 2 × 4 2 coefficients to be stored instead of 4 × 4 = 16 . The order-d moment matrix is
    A d : = [ μ i + j 2 ] i j = μ 0 μ 1 μ d μ 1 μ 2 μ d μ 2 d ,
    is a Hankel matrix stored using 2 d + 1 coefficients:
    A d = : Hankel ( μ 0 , μ 1 , , μ 2 d ) .
    In statistics, those matrices A d are called moment matrices and well-studied [54,55,56]. The variance Var [ X ] of a random variable X can be expressed as the determinant of the order-2 moment matrix:
    Var [ X ] = E [ ( X μ ) 2 ] = E [ X 2 ] E [ X ] 2 = μ 2 μ 1 2 = det 1 μ 1 μ 1 μ 2 0 .
    This observation yields a generalization of the notion of variance to d + 1 random variables: X 1 , , X d + 1 i i d F X E j > i ( X i X j ) 2 = ( d + 1 ) ! det ( M d ) 0 . The variance can be expressed as E [ 1 2 ( X 1 X 2 ) 2 ] for X 1 , X 2 i i d F X . See [57] (Chapter 5) for a detailed description related to U-statistics.
    For GMMs r, the raw moments μ l ( r ) to build matrix A D can be calculated in closed-form, as explained in Section 2.4.
Theorem 3 (Score matching GMM conversion)
The Score Matching conversion of a GMM m ( x ) into a polynomial exponential density p θ SME ( m ) ( x ) of order D is obtained as
θ SME ( m ) = i j m i + j 2 i j 1 × j ( j 1 ) m j 2 j ,
where m i = E m [ x i ] denote the ith non-central moment of the GMM m ( x ) .

2.3. Converting Numerically Moment Parameters from/to Natural Parameters

Recall that our fast heuristic approximates the Jeffreys divergence by
D ˜ J [ m , m ] : = ( θ ¯ SME ( m ) θ ¯ SME ( m ) ) ( η ¯ MLE ( m ) η ¯ MLE ( m ) ) .
Because F and F * are not available in closed form (except for the case D = 2 of the normal family), we cannot obtain θ from a given η (using θ = F * ( η ) ) nor η from a given θ (using η = F ( θ ) ).
However, provided that we can approximate numerically η ˜ F ( θ ) and θ ˜ F * ( η ) , we also consider these two approximations for the Jeffreys divergence:
Δ ˜ J MLE [ m 1 , m 2 ] : = ( θ ˜ 2 MLE θ ˜ 1 MLE ) ( η ¯ 2 MLE η ¯ 1 MLE ) ,
and
Δ ˜ J SME [ m 1 , m 2 ] = ( θ ¯ 2 SME θ ¯ 1 SME ) ( η ˜ 2 SME η ˜ 1 SME ) .
We show how to numerically estimate θ ˜ MLE F ( η ¯ MLE ) from η ¯ MLE in Section 2.3.1. Next, in Section 2.3.2, we show how to stochastically estimate η ˜ SME F * ( θ ¯ SME ) .

2.3.1. Converting Moment Parameters to Natural Parameters Using Maximum Entropy

Let us report the iterative approximation technique of [58] (which extended the method described in [35]) based on solving a maximum entropy problem (MaxEnt problem). This method will be useful when comparing our fast heuristic D ˜ J [ m , m ] with the approximations Δ ˜ J MLE [ m , m ] and Δ ˜ J SME [ m , m ] .
The density p θ of any exponential family can be characterized as a maximum entropy distribution given the D moment constraints E p θ [ t i ( x ) ] = η i : Namely, max p h ( p ) subject to the D + 1 moment constraints t i ( x ) p ( x ) d x = η i for i { 0 , , D } , where we added by convention η 0 = 1 and t 0 ( x ) = 1 (so that p ( x ) d x = 1 ). The solution of this MaxEnt problem [58] is p ( x ) = p λ , where λ are the D + 1 Lagrangian parameters. Here, we adopt the following canonical parameterization of the densities of an exponential family:
p λ ( x ) : = exp i = 0 D λ i t i ( x ) .
That is, F ( λ ) = λ 0 and λ i = θ i for i { 1 , , D } . Parameter λ is a kind of augmented natural parameter that includes the log-normalizer in its first coefficient.
Let K i ( λ ) : = E p θ [ t i ( x ) ] = η i denote the set of D + 1 non-linear equations for i { 0 , , D } . The Iterative Linear System Method [58] (ILSM) converts p η to p θ iteratively. We initialize λ ( 0 ) to θ ¯ SME (and calculate numerically λ 0 ( 0 ) = F ( θ ¯ SME ) ).
At iteration t with current estimate λ ( t ) , we use the following first-order Taylor approximation:
K i ( λ ) K i ( λ ( t ) ) + ( λ λ ( t ) ) K i ( λ ( t ) ) .
Let H ( λ ) denote the ( D + 1 ) × ( D + 1 ) matrix:
H ( λ ) : = K i ( λ ) θ j i j .
We have
H i j ( λ ) = H j i ( λ ) = E p θ [ t i ( x ) t j ( x ) ] .
We update as follows:
λ ( t + 1 ) = λ ( t ) + H 1 ( λ ( t ) ) η 0 K 0 ( λ ( t ) ) η D K D ( λ ( t ) ) .
For a PEF of order D, we have
H i j ( λ ) = E p θ [ x i + j 2 ] = μ i + j 2 ( p θ ) .
This yields a moment matrix H λ (Hankel matrix), which can be inverted in quadratic time [52]. In our setting, the moment matrix is invertible because | H | > 0 , see [59].
Let λ ˜ T ( η ) denote θ ( T ) after T iterations (retrieved from λ ( T ) ) and the corresponding natural parameter of the PED. We have the following approximation of the JD:
D J [ m , m ] ( θ ˜ T ( η ) θ ˜ T ( η ) ) ( η η ) .
The method is costly because we need to numerically calculate μ i + j 2 ( p θ ) and the K i ’s (e.g., univariate Simpson integrator). Another potential method consists of estimating these expectations using acceptance-rejection sampling [60,61]. We may also consider the holonomic gradient descent [29]. Thus, the conversion η θ method is costly. Our heuristic Δ ˜ J bypasses this costly moment-to-natural parameter conversion by converting each mixture m to a pair ( p θ SME , p η MLE ) of PEDs parameterized in the natural and moment parameters (i.e., loosely speaking, we untangle these dual parameterizations).

2.3.2. Converting Natural Parameters to Moment Parameters

Given a PED p θ ( x ) , we have to find its corresponding moment parameter η (i.e., p θ = p η ). Since η = E p θ [ t ( x ) ] , we sample s i.i.d. variates x 1 , , x s from p θ using acceptance-rejection sampling [60,61] or any other Markov chain Monte Carlo technique [62] and estimate η ^ as:
η ^ = 1 s i = 1 s t ( x i ) .

2.4. Raw Non-Central Moments of Normal Distributions and GMMs

In order to implement the MLE or SME Gaussian mixture conversion procedures, we need to calculate the raw moments of a Gaussian mixture model. The l-th moment raw moment E [ Z l ] of a standard normal distribution Z N ( 0 , 1 ) is 0 when l is odd (since the normal standard density is an even function) and ( l 1 ) ! ! = 2 l 2 l ! ( l / 2 ) ! when l is even, where n ! ! = 2 n + 1 π Γ ( n 2 + 1 ) = k = 0 n 2 1 ( n 2 k ) is the double factorial (with ( 1 ) ! ! = 1 by convention). Using the binomial theorem, we deduce that a normal distribution X = μ + σ Z has finite moments:
μ l ( p μ , σ ) = E p μ , σ [ X l ] = E [ ( μ + σ Z ) l ] = E [ ( μ + σ Z ) l ] = i = 0 l l i μ l i σ i E [ Z i ] .
That is, we have
μ l ( p μ , σ ) = i = 0 l 2 l i ( 2 i 1 ) ! ! μ l 2 i σ 2 i ,
where n ! ! denotes the double factorial:
n ! ! = k = 0 n 2 1 ( n 2 k ) = k = 1 n 2 ( 2 k ) n   is   even , k = 1 n + 1 2 ( 2 k 1 ) n   is   odd .
By the linearity of the expectation E [ · ] , we deduce the l-th raw moment of a GMM m ( x ) = i = 1 k w i p μ i , σ i ( x ) :
μ l ( m ) = i = 1 k w i μ l ( p μ I , σ i ) .
Notice that by using [63], we can extend this formula to truncated normals and GMMs. Thus, computing the first O ( D ) raw moments of a GMM with k components can be done in O ( k D 2 ) using the Pascal triangle method for computing the binomial coefficients. See also [64].

3. Goodness-of-Fit between GMMs and PEDs: Higher Order Hyvärinen Divergences

Once we have converted a GMM m ( x ) into an unnormalized PED q θ m ( x ) = p ˜ θ m ( x ) , we would like to evaluate the quality of the conversion, i.e., D [ m ( x ) : q θ m ( x ) ] , using a statistical divergence D [ · : · ] . This divergence shall allow us to perform model selection by choosing the order D of the PEF so that D [ m ( x ) : p θ ( x ) ] ϵ for θ R D , where ϵ > 0 is a prescribed threshold. Since PEDs have computationally intractable normalization constants, we consider a right-sided projective divergence [42] D [ p : q ] that satisfies D [ p : λ q ] = D [ p : q ] = D [ p : q ˜ ] for any λ > 0 . For example, we may consider the γ -divergence [65] that is a two-sided projective divergence: D γ [ λ p : λ q ] = D [ p : q ] = D [ p ˜ : q ˜ ] for any λ , λ > 0 and converge to the KLD when γ 0 . However, the γ -divergence between a mixture model and an unnormalized PEF does not yield a closed-form formula. Moreover, the γ -divergence between two unnormalized PEDs is expressed using the log-normalizer function F ( · ) that is computationally intractable [66].
In order to a get a closed-form formula for a divergence between a mixture model and an unnormalized PED, we consider the order- α (for α > 0 ) Hyvärinen divergence [42] as follows:
D H , α [ p : q ] : = p ( x ) α x log p ( x ) x log q ( x ) 2 d x , α > 0 .
The Hyvärinen divergence [42] (order-1 Hyvärinen divergence) has also been called the Fisher divergence [27,67,68,69] or relative Fisher information [48]. Notice that when α = 1 , D H , 1 [ p : q ] = D H [ p : q ] , the ordinary Hyvärinen divergence [27].
The Hyvärinen divergences D H , α is a right-sided projective divergence, meaning that the divergence satisfies D H , α [ p : q ] = D H , α [ p : λ q ] for any λ > 0 . That is, we have D H , α [ p : q ] = D H , α [ p : q ˜ ] . Thus, we have D H , α [ m : p θ ] = D H , α [ m : q θ ] for an unnormalized PED q θ = p ˜ θ . For statistical estimation, it is enough to have a sided projective divergence since we need to evaluate the goodness of fit between the (normalized) empirical distribution p e and the (unnormalized) parameteric density.
For univariate distributions, x log p ( x ) = p ( x ) p ( x ) , and p ( x ) p ( x ) = p ˜ ( x ) p ˜ ( x ) , where p ˜ ( x ) is the unnormalized model.
Let P θ ( x ) : = i = 1 D θ i x i be a homogeneous polynomial defining the shape of the EPF:
p θ ( x ) = exp P θ ( x ) F ( θ ) .
For PEDs with the homogeneous polynomial P θ ( x ) , we have p ( x ) p ( x ) = ( log P θ ( x ) ) = i = 1 D i θ i x i 1 .
Theorem 4.
The Hyvärinen divergence D H , 2 [ m : q θ ] of order 2 between a Gaussian mixture m ( x ) and a polynomial exponential family density q θ ( x ) is available in closed form.
Proof. 
We have D H , 2 [ m : q ] = m ( x ) 2 m ( x ) m ( x ) i = 1 D i θ i x i 1 2 d x with
m ( x ) = i = 1 k w i x μ i σ i 2 p ( x i ; μ i , σ i ) ,
denoting the derivative of the Gaussian mixture density m ( x ) . It follows that:
D H , 2 [ m : q ] = m ( x ) d x 2 i = 1 D i θ i x i 1 m ( x ) m ( x ) d x + i , j = 1 D i j θ i θ j x i + j 2 m ( x ) 2 d x ,
where
x i m ( x ) m ( x ) d x = w a w b x μ a σ a 2 x i p ( x ; μ a , σ a ) p ( x ; μ b , σ b ) d x .
Therefore, we have
D H , 2 [ m : q ] = m ( x ) d x 2 i = 1 D i θ i x i 1 m ( x ) m ( x ) d x + i , j = 1 D i j θ i θ j x i + j 2 m ( x ) 2 d x
with m ( x ) = w a x μ a σ a 2 p ( x ; μ a , σ a ) .
Since p a ( x ) p b ( x ) = κ a , b p ( x ; μ a b , σ a b ) , with
μ a b = σ a 2 σ b 2 ( σ b 2 μ a + σ a 2 μ b ) , σ a b = σ a σ b σ a 2 + σ b 2 , κ a , b = exp ( F ( μ a b , σ a b ) F ( μ a , σ a ) F ( μ b , σ b ) ) ,
and
F ( μ , σ ) = μ 2 2 σ 2 + 1 2 log ( 2 π σ 2 ) ,
the log-normalizer of the Gaussian exponential family [42].
Therefore, we obtain
p a ( x ) p b ( x ) x l d x = κ a , b m l ( μ a b , σ a b ) .
Thus, the Hyvärinen divergence D H , 2 of order 2 between a GMM and a PED is available in closed-form. □
For example, when k = 1 (i.e., mixture m is a single Gaussian p μ 1 , σ 1 ) and p θ is a normal distribution (i.e., PED with D = 2 , q θ = p μ 2 , σ 2 ), we obtain the following formula for the order-2 Hyvärinen divergence:
D H , 2 [ p μ 1 , σ 1 : p μ 2 , σ 2 ] = ( σ 1 2 σ 2 2 ) 2 + 2 ( μ 2 μ 1 ) 2 σ 1 2 8 π σ 1 3 σ 2 4 .

4. Experiments: Jeffreys Divergence between Mixtures

In this section, we evaluate our heuristic to approximate the Jeffreys divergence between two mixtures m and m :
D ˜ J [ m , m ] : = ( θ ¯ SME ( m ) θ ¯ SME ( m ) ) ( η ¯ MLE ( m ) η ¯ MLE ( m ) ) .
Recall that stochastically estimating the JD between k-GMMs with Monte Carlo sampling using s samples (i.e., D ^ J , s [ m : m ] ) requires O ˜ ( k s ) and is not deterministic. That is, different MC runs yield fluctuating values that may be fairly different. In comparison, approximating D J by D ˜ J using Δ J by converting mixtures to D-order PEDs require O ( k D 2 ) time to compute the raw moments and O ( D 2 ) time to invert a Hankel moment matrix. Thus, by choosing D = 2 k , we obtain a deterministic O ( k 3 ) algorithm that is faster than the MC sampling when k 2 s . Since there are, at most, k modes for a k-GMM, we choose order D = 2 k for the PEDs.
To obtain quantitative results on the performance of our heuristic D ˜ J , we build random GMMs with k components as follows: m ( x ) = i = 1 k w i p μ i , σ i ( x ) , where w i U i , μ i 10 + 10 U 1 and σ i 1 + U 2 , where the U i ’s, and U 1 and U 2 are independent uniform distributions on [ 0 , 1 ) . The mixture weights are then normalized to sum up to one. For each value of k, we make 1000 trial experiments to gather statistics and use s = 10 5 for evaluating the Jeffreys divergence D ^ J by Monte Carlo samplings. We denote by error : = | D ^ J Δ J | D ^ J the error of an experiment. Table 1 presents the results of the experiments for D = 2 k : The table displays the average error, the maximum error (minimum error is very close to zero, of order 10 5 ), and the speed-up obtained by our heuristic Δ J . Those experiments were carried out on a Dell Inspiron 7472 laptop (equipped with an Intel(R) Core(TM) i5-8250U CPU at 1.60 GHz).
Notice that the quality of the approximations of D ˜ J depend on the number of modes of the GMMs. However, calculating the number of modes is difficult [43,70], even for simple cases [71,72].
Figure 4 displays several experiments of converting mixtures to pairs of PEDs to obtain approximations of the Jeffreys divergence.
Figure 5 illustrates the use of the order-2 Hyvärinen divergence D H , 2 to perform model selection for choosing the order of a PED.
Finally, Figure 6 displays some limitations of the GMM to PED conversion when the GMMs have many modes. In that case, running the conversion η ¯ MLE to obtain θ ˜ T ( η ¯ MLE ) and estimate the Jeffreys divergence by
Δ ˜ J MLE [ m 1 , m 2 ] = ( θ ˜ 2 MLE θ ˜ 1 MLE ) ( η ¯ 2 MLE η ¯ 1 MLE ) ,
improves the results but requires more computation.
Next, we consider learning a PED by converting a GMM derived itself from a Kernel Density Estimator (KDE). We use the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA): The dataset consists of 272 observations (https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat) (access date: 25 October 2021) and is included in the R language package ‘stats’. Figure 7 displays the GMMs obtained from the KDEs of the Old Faithful geyser dataset when choosing for each component σ = 0.05 (left) and σ = 0.1 . Observe that the data are bimodal once the spurious modes (i.e., small bumps) are removed, as studied in [32]. Barron and Sheu [32] modeled that dataset using a bimodal PED of order D = 4 , i.e., a quartic distribution. We model it with a PED of order D = 10 using the integral-based Score Matching method. Figure 8 displays the unnormalized bimodal density q 1 (i.e., p ˜ 1 ) that we obtained using the integral-based Score Matching method (with X = ( 0 , 1 ) ).

5. Conclusions and Perspectives

Many applications [7,73,74,75] require computing the Jeffreys divergence (the arithmetic symmetrization of the Kullback–Leibler divergence) between Gaussian mixture models. Since the Jeffreys divergence between GMMs is provably not available in closed-form [13], one often ends up implementing a costly Monte Carlo stochastic approximation of the Jeffreys divergence. In this paper, we first noticed the simple expression of the Jeffreys divergence between densities p θ and p θ of an exponential family using their dual natural and moment parameterizations [22] p θ = p η and p θ = p η :
D J [ p θ , p θ ] = ( θ θ ) ( η η ) ,
where η = F ( θ ) and η = F ( θ ) for the cumulant function F ( θ ) of the exponential family. This led us to propose a simple and fast heuristic to approximate the Jeffreys divergence between Gaussian mixture models: First, convert a mixture m to a pair ( p θ ¯ SME , p η ¯ MLE ) of dually parameterized polynomial exponential densities using extensions of the Maximum Likelihood and Score Matching Estimators (Theorems 1 and 3), and then approximate the JD deterministically by
D J [ m 1 , m 2 ] D ˜ J [ m 1 , m 2 ] = ( θ ˜ 2 MLE θ ˜ 1 MLE ) ( η ¯ 2 MLE η ¯ 1 MLE ) .
The order of the polynomial exponential family may be either prescribed or selected using the order-2 Hyvärinen divergence, which evaluates in closed form the dissimilarity between a GMM and a density of an exponential-polynomial family (Theorem 4). We experimentally demonstrated that the Jeffreys divergence between GMMs can be reasonably well approximated by D ˜ J for mixtures with a small number of modes, and we obtained an overall speed-up of several order of magnitudes compared to the Monte Carlo sampling method. We also propose another deterministic heuristic to estimate D J as
D ˜ J MLE [ m 1 : m 2 ] = ( θ ˜ 2 MLE θ ˜ 1 MLE ) ( η ¯ 2 MLE η ¯ 1 MLE ) ,
where θ ˜ MLE F ( η ¯ MLE ) is numerically calculated using an iterative conversion procedure based on maximum entropy [58] (Section 2.3.1). Our technique extends to other univariate mixtures of exponential families (e.g., mixtures of Rayleigh distributions, mixtures of Gamma distributions, or mixtures of Beta distributions, etc). One limitation of our method is that the PED modeling of a GMM may not guarantee obtaining the same number of modes as the GMM even when we increase the order D of the exponential-polynomial densities. This case is illustrated in Figure 9 (right).
Although PEDs are well-suited to calculate Jeffreys divergence compared to GMMs, we point out that GMMs are better suited for sampling, while PEDs require Monte Carlo methods (e.g., adaptive rejection sampling or MCMC methods [62]). Furthermore, we can estimate the Kullback–Leibler divergence between two PEDs using rejection sampling (or other McMC methods [62]) or by using the γ -divergence [76] with γ close to zero [66] (e.g., γ = 0.001 ). The web page of the project is https://franknielsen.github.io/JeffreysDivergenceGMMPEF/index.html (accessed on 25 October 2021).
This work opens up several perspectives for future research: For example, we may consider bivariate polynomial-exponential densities for modeling bivariate Gaussian mixture models [29], or we may consider truncating the GMMs in order to avoid tail phenomena when converting GMMs to PEDs [77,78].

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1946, 186, 453–461. [Google Scholar]
  2. McLachlan, G.J.; Basford, K.E. Mixture Models: Inference and Applications to Clustering; M. Dekker: New York, NY, USA, 1988; Volume 38. [Google Scholar]
  3. Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 1894, 185, 71–110. [Google Scholar]
  4. Seabra, J.C.; Ciompi, F.; Pujol, O.; Mauri, J.; Radeva, P.; Sanches, J. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Trans. Biomed. Eng. 2011, 58, 1314–1324. [Google Scholar] [CrossRef] [PubMed]
  5. Kullback, S. Information Theory and Statistics; Courier Corporation: North Chelmsford, MA, USA, 1997. [Google Scholar]
  6. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  7. Vitoratou, S.; Ntzoufras, I. Thermodynamic Bayesian model comparison. Stat. Comput. 2017, 27, 1165–1180. [Google Scholar] [CrossRef] [Green Version]
  8. Kannappan, P.; Rathie, P. An axiomatic characterization of J-divergence. In Transactions of the Tenth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Springer: Dordrecht, The Netherlands, 1988; pp. 29–36. [Google Scholar]
  9. Burbea, J. J-Divergences and related concepts. Encycl. Stat. Sci. 2004. [Google Scholar] [CrossRef]
  10. Tabibian, S.; Akbari, A.; Nasersharif, B. Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Process. 2015, 106, 184–197. [Google Scholar] [CrossRef]
  11. Veldhuis, R. The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Process. Lett. 2002, 9, 96–99. [Google Scholar] [CrossRef] [Green Version]
  12. Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. [Google Scholar] [CrossRef] [Green Version]
  13. Watanabe, S.; Yamazaki, K.; Aoyagi, M. Kullback information of normal mixture is not an analytic function. IEICE Tech. Rep. Neurocomput. 2004, 104, 41–46. [Google Scholar]
  14. Cui, S.; Datcu, M. Comparison of Kullback-Leibler divergence approximation methods between Gaussian mixture models for satellite image retrieval. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 3719–3722. [Google Scholar]
  15. Cui, S. Comparison of approximation methods to Kullback–Leibler divergence between Gaussian mixture models for satellite image retrieval. Remote Sens. Lett. 2016, 7, 651–660. [Google Scholar] [CrossRef] [Green Version]
  16. Sreekumar, S.; Zhang, Z.; Goldfeld, Z. Non-asymptotic Performance Guarantees for Neural Estimation of f-Divergences. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR 2021), San Diego, CA, USA, 18–24 July 2021; pp. 3322–3330. [Google Scholar]
  17. Durrieu, J.L.; Thiran, J.P.; Kelly, F. Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4833–4836. [Google Scholar]
  18. Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef] [Green Version]
  19. Jenssen, R.; Principe, J.C.; Erdogmus, D.; Eltoft, T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006, 343, 614–629. [Google Scholar] [CrossRef]
  20. Liu, M.; Vemuri, B.C.; Amari, S.i.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2407–2419. [Google Scholar]
  21. Robert, C.; Casella, G. Monte Carlo Statistical Methods; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  22. Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  23. Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef] [Green Version]
  24. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  25. Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  26. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
  27. Hyvärinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 695–709. [Google Scholar]
  28. Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
  29. Hayakawa, J.; Takemura, A. Estimation of exponential-polynomial distribution by holonomic gradient descent. Commun. Stat.-Theory Methods 2016, 45, 6860–6882. [Google Scholar] [CrossRef] [Green Version]
  30. Nielsen, F.; Nock, R. MaxEnt upper bounds for the differential entropy of univariate continuous distributions. IEEE Signal Process. Lett. 2017, 24, 402–406. [Google Scholar] [CrossRef]
  31. Matz, A.W. Maximum likelihood parameter estimation for the quartic exponential distribution. Technometrics 1978, 20, 475–484. [Google Scholar] [CrossRef]
  32. Barron, A.R.; Sheu, C.H. Approximation of density functions by sequences of exponential families. Ann. Stat. 1991, 19, 1347–1369, Correction in 1991, 19, 2284–2284. [Google Scholar]
  33. O’toole, A. A method of determining the constants in the bimodal fourth degree exponential function. Ann. Math. Stat. 1933, 4, 79–93. [Google Scholar] [CrossRef]
  34. Aroian, L.A. The fourth degree exponential distribution function. Ann. Math. Stat. 1948, 19, 589–592. [Google Scholar] [CrossRef]
  35. Zellner, A.; Highfield, R.A. Calculation of maximum entropy distributions and approximation of marginal posterior distributions. J. Econom. 1988, 37, 195–209. [Google Scholar] [CrossRef]
  36. McCullagh, P. Exponential mixtures and quadratic exponential families. Biometrika 1994, 81, 721–729. [Google Scholar] [CrossRef]
  37. Mead, L.R.; Papanicolaou, N. Maximum entropy in the problem of moments. J. Math. Phys. 1984, 25, 2404–2417. [Google Scholar] [CrossRef] [Green Version]
  38. Armstrong, J.; Brigo, D. Stochastic filtering via L2 projection on mixture manifolds with computer algorithms and numerical examples. arXiv 2013, arXiv:1303.6236. [Google Scholar]
  39. Efron, B.; Hastie, T. Computer Age Statistical Inference; Cambridge University Press: Cambridge, UK, 2016; Volume 5. [Google Scholar]
  40. Pinsker, M. Information and Information Stability of Random Variables and Processes (Translated and Annotated by Amiel Feinstein); Holden-Day Inc.: San Francisco, CA, USA, 1964. [Google Scholar]
  41. Fedotov, A.A.; Harremoës, P.; Topsoe, F. Refinements of Pinsker’s inequality. IEEE Trans. Inf. Theory 2003, 49, 1491–1498. [Google Scholar] [CrossRef]
  42. Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  43. Carreira-Perpinan, M.A. Mode-finding for mixtures of Gaussian distributions. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1318–1323. [Google Scholar] [CrossRef] [Green Version]
  44. Brown, L.D. Fundamentals of statistical exponential families with applications in statistical decision theory. Lect. Notes-Monogr. Ser. 1986, 9, 1–279. [Google Scholar]
  45. Pelletier, B. Informative barycentres in statistics. Ann. Inst. Stat. Math. 2005, 57, 767–780. [Google Scholar] [CrossRef]
  46. Améndola, C.; Drton, M.; Sturmfels, B. Maximum likelihood estimates for Gaussian mixtures are transcendental. In Proceedings of the International Conference on Mathematical Aspects of Computer and Information Sciences, Berlin, Germany, 11–13 November 2015; pp. 579–590. [Google Scholar]
  47. Hyvärinen, A. Some extensions of score matching. Comput. Stat. Data Anal. 2007, 51, 2499–2512. [Google Scholar] [CrossRef] [Green Version]
  48. Otto, F.; Villani, C. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal. 2000, 173, 361–400. [Google Scholar] [CrossRef] [Green Version]
  49. Toscani, G. Entropy production and the rate of convergence to equilibrium for the Fokker-Planck equation. Q. Appl. Math. 1999, 57, 521–541. [Google Scholar] [CrossRef] [Green Version]
  50. Hudson, H.M. A natural identity for exponential families with applications in multiparameter estimation. Ann. Stat. 1978, 6, 473–484. [Google Scholar] [CrossRef]
  51. Trench, W.F. An algorithm for the inversion of finite Hankel matrices. J. Soc. Ind. Appl. Math. 1965, 13, 1102–1107. [Google Scholar] [CrossRef]
  52. Heinig, G.; Rost, K. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra Its Appl. 2011, 435, 1–59. [Google Scholar] [CrossRef] [Green Version]
  53. Fuhrmann, P.A. Remarks on the inversion of Hankel matrices. Linear Algebra Its Appl. 1986, 81, 89–104. [Google Scholar] [CrossRef] [Green Version]
  54. Lindsay, B.G. On the determinants of moment matrices. Ann. Stat. 1989, 17, 711–721. [Google Scholar] [CrossRef]
  55. Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740. [Google Scholar] [CrossRef]
  56. Provost, S.B.; Ha, H.T. On the inversion of certain moment matrices. Linear Algebra Its Appl. 2009, 430, 2650–2658. [Google Scholar] [CrossRef]
  57. Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 162. [Google Scholar]
  58. Mohammad-Djafari, A. A. A Matlab program to calculate the maximum entropy distributions. In Maximum Entropy and Bayesian Methods; Springer: Berlin/Heidelberg, Germany, 1992; pp. 221–233. [Google Scholar]
  59. Karlin, S. Total Positivity; Stanford University Press: Redwood City, CA, USA, 1968; Volume 1. [Google Scholar]
  60. von Neumann, J. Various Techniques Used in Connection with Random Digits. In Monte Carlo Method; National Bureau of Standards Applied Mathematics Series; Householder, A.S., Forsythe, G.E., Germond, H.H., Eds.; US Government Printing Office: Washington, DC, USA, 1951; Volume 12, Chapter 13; pp. 36–38. [Google Scholar]
  61. Flury, B.D. Acceptance-rejection sampling made easy. SIAM Rev. 1990, 32, 474–476. [Google Scholar] [CrossRef]
  62. Rohde, D.; Corcoran, J. MCMC methods for univariate exponential family models with intractable normalization constants. In Proceedings of the 2014 IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast, Australia, 29 June–2 July 2014; pp. 356–359. [Google Scholar]
  63. Barr, D.R.; Sherrill, E.T. Mean and variance of truncated normal distributions. Am. Stat. 1999, 53, 357–361. [Google Scholar]
  64. Amendola, C.; Faugere, J.C.; Sturmfels, B. Moment Varieties of Gaussian Mixtures. J. Algebr. Stat. 2016, 7, 14–28. [Google Scholar] [CrossRef] [Green Version]
  65. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef] [Green Version]
  66. Nielsen, F.; Nock, R. Patch matching with polynomial exponential families and projective divergences. In Proceedings of the International Conference on Similarity Search and Applications, Tokyo, Japan, 24–26 October 2016; pp. 109–116. [Google Scholar]
  67. Yang, Y.; Martin, R.; Bondell, H. Variational approximations using Fisher divergence. arXiv 2019, arXiv:1905.05284. [Google Scholar]
  68. Kostrikov, I.; Fergus, R.; Tompson, J.; Nachum, O. Offline reinforcement learning with Fisher divergence critic regularization. In Proceedings of the International Conference on Machine Learning (PMLR 2021), online, 7–8 June 2021; pp. 5774–5783. [Google Scholar]
  69. Elkhalil, K.; Hasan, A.; Ding, J.; Farsiu, S.; Tarokh, V. Fisher Auto-Encoders. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR 2021), San Diego, CA, USA, 13–15 April 2021; pp. 352–360. [Google Scholar]
  70. Améndola, C.; Engström, A.; Haase, C. Maximum number of modes of Gaussian mixtures. Inf. Inference J. IMA 2020, 9, 587–600. [Google Scholar] [CrossRef]
  71. Aprausheva, N.; Mollaverdi, N.; Sorokin, S. Bounds for the number of modes of the simplest Gaussian mixture. Pattern Recognit. Image Anal. 2006, 16, 677–681. [Google Scholar] [CrossRef] [Green Version]
  72. Aprausheva, N.; Sorokin, S. Exact equation of the boundary of unimodal and bimodal domains of a two-component Gaussian mixture. Pattern Recognit. Image Anal. 2013, 23, 341–347. [Google Scholar] [CrossRef]
  73. Xiao, Y.; Shah, M.; Francis, S.; Arnold, D.L.; Arbel, T.; Collins, D.L. Optimal Gaussian mixture models of tissue intensities in brain MRI of patients with multiple-sclerosis. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Beijing, China, 20 September 2010; pp. 165–173. [Google Scholar]
  74. Bilik, I.; Khomchuk, P. Minimum divergence approaches for robust classification of ground moving targets. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 581–603. [Google Scholar] [CrossRef]
  75. Alippi, C.; Boracchi, G.; Carrera, D.; Roveri, M. Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016. [Google Scholar]
  76. Eguchi, S.; Komori, O.; Kato, S. Projective power entropy and maximum Tsallis entropy distributions. Entropy 2011, 13, 1746–1764. [Google Scholar] [CrossRef]
  77. Orjebin, E. A Recursive Formula for the Moments of a Truncated Univariate Normal Distribution. 2014, Unpublished note.
  78. Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Two examples illustrating the conversion of a GMM m (black) of k = 2 components (dashed black) into a pair of polynomial exponential densities of order D = 4 ( p θ ¯ SME , p η ¯ MLE ) . PED p θ ¯ SME is displayed in green, and PED p η ¯ MLE is displayed in blue. To display p η ¯ MLE , we first converted η ¯ MLE to θ ¯ ˜ MLE using an iterative linear system descent method (ILSDM), and we numerically estimated the normalizing factors Z ( θ ¯ SME ) and Z ( η ¯ MLE ) to display the normalized PEDs.
Figure 1. Two examples illustrating the conversion of a GMM m (black) of k = 2 components (dashed black) into a pair of polynomial exponential densities of order D = 4 ( p θ ¯ SME , p η ¯ MLE ) . PED p θ ¯ SME is displayed in green, and PED p η ¯ MLE is displayed in blue. To display p η ¯ MLE , we first converted η ¯ MLE to θ ¯ ˜ MLE using an iterative linear system descent method (ILSDM), and we numerically estimated the normalizing factors Z ( θ ¯ SME ) and Z ( η ¯ MLE ) to display the normalized PEDs.
Entropy 23 01417 g001
Figure 2. Two mixtures m 1 (black) and m 2 (red) of k 1 = 10 components and k 2 = 11 components (left), respectively. The unnormalized PEFs q θ ¯ 1 = p ˜ θ ¯ 1 (middle) and q θ ¯ 2 = p ˜ θ ¯ 2 (right) of order D = 8 . Jeffreys divergence (about 0.2634 ) is approximated using PEDs within 0.6 % compared to the Monte Carlo estimate with a speed factor of about 3190. Notice that displaying p θ ¯ 1 and p θ ¯ 2 on the same PDF canvas as the mixtures would require calculating the partition functions Z ( θ ¯ 1 ) and Z ( θ ¯ 2 ) (which we do not in this figure). The PEDs q η ¯ 1 and q η ¯ 2 of the pairs ( θ ¯ 1 , η ¯ 1 ) and ( θ ¯ 2 , η ¯ 2 ) parameterized in the moment space are not shown here.
Figure 2. Two mixtures m 1 (black) and m 2 (red) of k 1 = 10 components and k 2 = 11 components (left), respectively. The unnormalized PEFs q θ ¯ 1 = p ˜ θ ¯ 1 (middle) and q θ ¯ 2 = p ˜ θ ¯ 2 (right) of order D = 8 . Jeffreys divergence (about 0.2634 ) is approximated using PEDs within 0.6 % compared to the Monte Carlo estimate with a speed factor of about 3190. Notice that displaying p θ ¯ 1 and p θ ¯ 2 on the same PDF canvas as the mixtures would require calculating the partition functions Z ( θ ¯ 1 ) and Z ( θ ¯ 2 ) (which we do not in this figure). The PEDs q η ¯ 1 and q η ¯ 2 of the pairs ( θ ¯ 1 , η ¯ 1 ) and ( θ ¯ 2 , η ¯ 2 ) parameterized in the moment space are not shown here.
Entropy 23 01417 g002
Figure 3. The best simplification of a GMM m ( x ) into a single normal component p θ * ( min θ Θ D KL [ m : p θ ] = min η H D KL [ m : p η ] ) is geometrically interpreted as the unique m-projection of m ( x ) onto the Gaussian family (a e-flat): We have η * = η ¯ = i = 1 k η i .
Figure 3. The best simplification of a GMM m ( x ) into a single normal component p θ * ( min θ Θ D KL [ m : p θ ] = min η H D KL [ m : p η ] ) is geometrically interpreted as the unique m-projection of m ( x ) onto the Gaussian family (a e-flat): We have η * = η ¯ = i = 1 k η i .
Entropy 23 01417 g003
Figure 4. Experiments of approximating the Jeffreys divergence between two mixtures by considering pairs of PEDs. Notice that only the PEDs estimated using the Score Matching in the natural parameter space are displayed.
Figure 4. Experiments of approximating the Jeffreys divergence between two mixtures by considering pairs of PEDs. Notice that only the PEDs estimated using the Score Matching in the natural parameter space are displayed.
Entropy 23 01417 g004
Figure 5. Selecting the PED order D my evaluating the best divergence order-2 Hyvärinen divergence (for D { 4 , 8 , 10 , 12 , 14 , 16 } ) values. Here, the order D = 10 (boxed) yields the lowest order-2 Hyvärinen divergence: The GMM is close to the PED.
Figure 5. Selecting the PED order D my evaluating the best divergence order-2 Hyvärinen divergence (for D { 4 , 8 , 10 , 12 , 14 , 16 } ) values. Here, the order D = 10 (boxed) yields the lowest order-2 Hyvärinen divergence: The GMM is close to the PED.
Entropy 23 01417 g005
Figure 6. Some limitation examples of the conversion of GMMs (black) to PEDs (grey) using the integral-based Score Matching estimator: Case of GMMs with many modes.
Figure 6. Some limitation examples of the conversion of GMMs (black) to PEDs (grey) using the integral-based Score Matching estimator: Case of GMMs with many modes.
Entropy 23 01417 g006
Figure 7. Modeling the Old Faithful geyser by a KDE (GMM with k = 272 components, uniform weights w i = 1 272 ): Histogram (#bins = 25) (left), KDE with σ = 0.05 (middle), and KDE with σ = 0.1 with less spurious bumps (right).
Figure 7. Modeling the Old Faithful geyser by a KDE (GMM with k = 272 components, uniform weights w i = 1 272 ): Histogram (#bins = 25) (left), KDE with σ = 0.05 (middle), and KDE with σ = 0.1 with less spurious bumps (right).
Entropy 23 01417 g007
Figure 8. Modeling the Old Faithful geyser by an exponential-polynomial distribution of order D = 10 .
Figure 8. Modeling the Old Faithful geyser by an exponential-polynomial distribution of order D = 10 .
Entropy 23 01417 g008
Figure 9. GMM modes versus PED modes: (left) same number and locations of modes for the GMM and the PED; (right) 4 modes for the GMM but only 2 modes for the PED.
Figure 9. GMM modes versus PED modes: (left) same number and locations of modes for the GMM and the PED; (right) 4 modes for the GMM but only 2 modes for the PED.
Entropy 23 01417 g009
Table 1. Comparison of Δ ˜ J ( m 1 , m 2 ) with D ^ J ( m 1 , m 2 ) for random GMMs.
Table 1. Comparison of Δ ˜ J ( m 1 , m 2 ) with D ^ J ( m 1 , m 2 ) for random GMMs.
kDAverage ErrorMaximum ErrorSpeed-Up
240.11807999782215360.94914254041322592008.2323536011806
360.125338112945465261.94206081519884191010.4917042114389
480.101984488685080875.290871019594698474.5135294829539
5100.063363885798973523.8096955246161848246.38780782640987
6120.071452571921337171.0125283726458822141.39097909641052
7140.105388758531786250.866146314279394388.62985036546912
8160.41509055070079690.415090550700796958.72277575395611
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nielsen, F. Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions. Entropy 2021, 23, 1417. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

AMA Style

Nielsen F. Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions. Entropy. 2021; 23(11):1417. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

Chicago/Turabian Style

Nielsen, Frank. 2021. "Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions" Entropy 23, no. 11: 1417. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop