Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions

Nielsen, Frank

doi:10.3390/e23111417

Open AccessArticle

Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2021, 23(11), 1417; https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

Submission received: 7 September 2021 / Revised: 20 October 2021 / Accepted: 26 October 2021 / Published: 28 October 2021

(This article belongs to the Special Issue Distance in Information and Statistical Physics III)

Download

Browse Figures

Versions Notes

Abstract

:

The Jeffreys divergence is a renown arithmetic symmetrization of the oriented Kullback–Leibler divergence broadly used in information sciences. Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with advantages and disadvantages have been proposed in the literature to either estimate, approximate, or lower and upper bound this divergence. In this paper, we propose a simple yet fast heuristic to approximate the Jeffreys divergence between two univariate Gaussian mixtures with arbitrary number of components. Our heuristic relies on converting the mixtures into pairs of dually parameterized probability densities belonging to an exponential-polynomial family. To measure with a closed-form formula the goodness of fit between a Gaussian mixture and an exponential-polynomial density approximating it, we generalize the Hyvärinen divergence to

α

-Hyvärinen divergences. In particular, the 2-Hyvärinen divergence allows us to perform model selection by choosing the order of the exponential-polynomial densities used to approximate the mixtures. We experimentally demonstrate that our heuristic to approximate the Jeffreys divergence between mixtures improves over the computational time of stochastic Monte Carlo estimations by several orders of magnitude while approximating the Jeffreys divergence reasonably well, especially when the mixtures have a very small number of modes.

Keywords:

Gaussian mixture model; Jeffreys divergence; mixture family; exponential-polynomial family; Maximum Likelihood Estimator; Score Matching Estimator; Hyvärinen divergence; relative Fisher information; moment matrix; Hankel matrix

Graphical Abstract

1. Introduction

1.1. Statistical Mixtures and Statistical Divergences

We consider the problem of approximating the Jeffreys divergence [1] between two finite univariate continuous mixture models [2]

m (x) = \sum_{i = 1}^{k} w_{i} p_{i} (x)

and

m^{'} (x) = \sum_{i = 1}^{k^{'}} w_{i}^{'} p_{i}^{'} (x)

with continuous component distributions

p_{i}

’s and

p_{i}^{″}

s defined on a coinciding support

X \subset R

. The mixtures

m (x)

and

m^{'} (x)

may have a different number of components (i.e.,

k \neq k^{'}

). Historically, Pearson [3] first considered a univariate Gaussian mixture of two components for modeling the distribution of the ratio of forehead breadth to body length of a thousand crabs in 1894 (Pearson obtained a unimodal mixture).

Although our work applies to any continuous mixtures of an exponential family (e.g., Rayleigh mixtures [4] with restricted support

X = R_{+}

), we explain our method for the most prominent family of mixtures encountered in practice: the Gaussian mixture models or GMMs for short. In the remainder, a univariate GMM

m (x) = \sum_{i = 1}^{k} w_{i} p_{μ_{i}, σ_{i}} (x)

with k Gaussian components

p_{i} (x) = p_{μ_{i}, σ_{i}} (x) : = \frac{1}{σ_{i} \sqrt{2 π}} exp (- \frac{{(x - μ_{i})}^{2}}{2 σ_{i}^{2}}),

is called a k-GMM.

The Kullback–Leibler divergence (KLD) [5,6]

D_{KL} [m : m^{'}]

between two mixtures m and

m^{'}

is:

D_{KL} [m : m^{'}] : = \int_{X} m (x) log (\frac{m (x)}{m^{'} (x)}) d x .

(1)

The KLD is an oriented divergence since

D_{KL} [m : m^{'}] \neq D_{KL} [m^{'} : m]

.

The Jeffreys divergence (JD) [1]

D_{J} [m, m^{'}]

is the arithmetic symmetrization of the forward KLD and the reverse KLDs:

\begin{matrix} D_{J} [m, m^{'}] & : = & D_{KL} [m : m^{'}] + D_{KL} [m^{'} : m], \end{matrix}

(2)

\begin{matrix} = & \int_{X} (m (x) - m^{'} (x)) log (\frac{m (x)}{m^{'} (x)}) d x . \end{matrix}

(3)

The JD is a symmetric divergence:

D_{J} [m, m^{'}] = D_{J} [m^{'}, m]

. In the literature, the Jeffreys divergence [7] has also been called the J-divergence [8,9], the symmetric Kullback–Leibler divergence [10] and sometimes the symmetrical Kullback–Leibler divergence [11,12]. In general, it is provably hard to calculate the definite integral of the KLD between two continuous mixtures in closed-form: For example, the KLD between two GMMs has been shown to be non-analytic [13]. Thus, in practice, when calculating the JD between two GMMs, one can either approximate [14,15], estimate [16], or bound [17,18] the KLD between mixtures. Another approach to bypass the computational intractability of calculating the KLD between mixtures consists of designing new types of divergences that admit closed-form expressions for mixtures. See, for example, the Cauchy–Schwarz divergence [19] or the total square divergence [20] (a total Bregman divergence) that admit the closed-form formula when handling GMMs. The total square divergence [20] is invariant to rigid transformations and provably robust to outliers in clustering applications.

In practice, to estimate the KLD between mixtures, one uses the following Monte Carlo (MC) estimator:

{\hat{D}}_{KL}^{S_{s}} [m : m^{'}] : = \frac{1}{s} \sum_{i = 1}^{s} (log (\frac{m (x_{i})}{m^{'} (x_{i})}) + \frac{m^{'} (x_{i})}{m (x_{i})} - 1) \geq 0,

where

S_{s} = {x_{1}, \dots, x_{s}}

is s independent and identically distributed (i.i.d.) samples from

m (x)

. This MC estimator is by construction always non-negative and therefore consistent. That is, we have

{lim}_{s \to \infty} {\hat{D}}_{KL}^{S_{s}} [m : m^{'}] = D_{KL} [m : m^{'}]

under mild conditions [21].

Similarly, we estimate the Jeffreys divergence via MC sampling as follows:

{\hat{D}}_{J}^{S_{s}} [m, m^{'}] : = \frac{1}{s} \sum_{i = 1}^{s} 2 \frac{(m (x_{i}) - m^{'} (x_{i}))}{m (x_{i}) + m^{'} (x_{i})} log (\frac{m (x_{i})}{m^{'} (x_{i})}) \geq 0,

(4)

where

S_{s} = {x_{1}, \dots, x_{s}}

are s i.i.d. samples from the “middle mixture”

m_{12} (x) : = \frac{1}{2} (m (x) + m^{'} (x))

. By choosing the middle mixture

m_{12} (x)

for sampling, we ensure that we keep the symmetric property of the JD (i.e.,

{\hat{D}}_{J}^{S_{s}} [m, m^{'}] = {\hat{D}}_{J}^{S_{s}} [m^{'}, m]

), and we also have consistency under mild conditions [21]:

{lim}_{s \to \infty} {\hat{D}}_{J}^{S_{s}} [m, m^{'}] = D_{J} [m, m^{'}]

. The time complexity to stochastically estimate the JD is

\tilde{O} ((k + k^{'}) s)

, with s typically ranging from

10^{4}

to

10^{6}

in applications. Notice that the number of components of a mixture can be very large (e.g.,

k = O (n)

for n input data when using Kernel Density Estimators [2]). KDEs may also have a large number of components and may potentially exhibit many spurious modes visualized as small bumps when plotting the densities.

1.2. Jeffreys Divergence between Densities of an Exponential Family

We consider approximating the JD by converting continuous mixtures into densities of exponential families [22]. A continuous exponential family (EF)

E_{t}

of order D is defined as a family of probability density functions with support

X

and the probability density function:

E_{t} : = \{p_{θ} (x) : = exp (\sum_{i = 1}^{D} θ_{i} t_{i} (x) - F (θ)) : θ \in Θ\},

where

F (θ)

is called the log-normalizer, which ensures the normalization of

p_{θ} (x)

(i.e.,

\int_{X} p_{θ} (x) d x = 1

):

F (θ) = log (\int_{X} exp (\sum_{i = 1}^{D} θ_{i} t_{i} (x)) d x) .

Parameter

θ \in Θ \subset R^{D}

is called the natural parameter, and the functions

t_{1} (x)

, …,

t_{D} (x)

are called the sufficient statistics [22]. Let

Θ

denote the natural parameter space:

Θ : = {θ : F (θ) < \infty}

, an open convex domain for regular exponential families [22]. The exponential family is said to be minimal when the functions

1, t_{1} (x), \dots, t_{D} (x)

are linearly independent.

It is well-known that one can bypass the definite integral calculation of the KLD when the probability density functions

p_{θ}

and

p_{θ^{'}}

belong to the same exponential family [23,24]:

D_{KL} [p_{θ} : p_{θ}] = B_{F} (θ^{'} : θ),

where

B_{F} (θ_{2} : θ_{1})

is the Bregman divergence induced by the log-normalizer, a strictly convex real-analytic function [22]. The Bregman divergence [25] between two parameters

θ_{1}

and

θ_{2}

for a strictly convex and smooth generator F is defined by:

B_{F} (θ_{1} : θ_{2}) : = F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} \nabla F (θ_{2}) .

(5)

Thus, the Jeffreys divergence between two pdfs

p_{θ}

and

p_{θ^{'}}

belonging to the same exponential family is a symmetrized Bregman divergence [26]:

\begin{matrix} D_{J} [p_{θ} : p_{θ}] & = & B_{F} (θ^{'} : θ) + B_{F} (θ : θ^{'}), \\ = & {(θ^{'} - θ)}^{⊤} (\nabla F (θ^{'}) - \nabla F (θ)) . \end{matrix}

Let

F^{*} (η)

denote the Legendre–Fenchel convex conjugate of

F (θ)

:

F^{*} (η) : = sup_{θ \in Θ} {θ^{⊤} η - F (θ)} .

(6)

The Legendre transform ensures that

η = \nabla F (θ)

and

θ = \nabla F^{*} (η)

, and the Jeffreys divergence between two pdfs

p_{θ}

and

p_{θ^{'}}

belonging to the same exponential family is:

D_{J} [p_{θ} : p_{θ}] = {(θ^{'} - θ)}^{⊤} (η^{'} - η) .

(7)

Notice that the log-normalizer

F (θ)

does not appear explicitly in the above formula.

1.3. A Simple Approximation Heuristic

Densities

p_{θ}

of an exponential family admit a dual parameterization [22]:

η = η (θ) : = E_{p_{θ}} [t (x)] = \nabla F (θ)

, called the moment parameterization (or mean parameterization). Let H denote the moment parameter space. Let us use the subscript and superscript notations to emphasize the coordinate system used to index a density: In our notation, we thus write

p_{θ} (x) = p^{η} (x)

.

In view of Equation (7), our method to approximate the Jeffreys divergence between mixtures m and

m^{'}

consists of first converting those mixtures m and

m^{'}

into pairs of polynomial exponential densities (PEDs) in Section 2. To convert a mixture

m (x)

into a pair

(p_{{\bar{θ}}_{1}}, p^{{\bar{η}}_{2}})

dually parameterized (but not dual because

{\bar{η}}_{2} \neq \nabla F ({\bar{θ}}_{1})

), we shall consider “integral extensions” (or information projections) of the Maximum Likelihood Estimator [22] (MLE estimates in the moment parameter space

H = {\nabla F (θ) : θ \in Θ}

) and of the Score Matching Estimator [27] (SME estimates in the natural parameter space

Θ = {\nabla F^{*} (η) : η \in H}

).

We shall consider polynomial exponential families [28] (PEFs) also called exponential-polynomial families (EPFs) [29]. PEFs

E_{D}

are regular minimal exponential families with polynomial sufficient statistics

t_{i} (x) = x^{i}

for

i \in {1, \dots, D}

. For example, the exponential distributions

{p_{λ} (x) = λ exp (- λ x)}

form a PEF with

D = 1

,

t (x) = x

and

X = R_{+}

, and the normal distributions form an EPF with

D = 2

,

t (x) = {[x x^{2}]}^{⊤}

and

X = R

, etc. Although the log-normalizer

F (θ)

can be obtained in closed-form for lower order PEFs (e.g.,

D = 1

or

D = 2

) or very special subfamilies (e.g., when

D = 1

and

t_{1} (x) = x^{k}

, exponential-monomial families [30]), a no-closed form formula is available for

F (θ)

of EPFs in general as soon

D \geq 4

[31,32], and the cumulant function

F (θ)

is said to be computationally intractable. Notice that when

X = R

, the leading coefficient

θ_{D}

is negative for even integer order D. EPFs are attractive because these families can universally model any smooth multimodal distribution [28] and require fewer parameters in comparison to GMMs: Indeed, a univariate k-GMM

m (x)

(at most k modes and

k - 1

antimodes) requires

3 k - 1

parameters to specify

m (x)

(or

k + 1

for a KDE with constant kernel width

σ

or

2 k - 1

for a KDE with varying kernel widths, but then

k = n

observations). A density of an EPF of order D is called an exponential-polynomial density (EPD) and requires D parameters to specify

θ

, with, at most,

\frac{D}{2}

modes (and

\frac{D}{2} - 1

antimodes). The case of the quartic (polynomial) exponential densities

E_{4}

(

D = 4

) has been extensively investigated in [31,33,34,35,36,37]. Armstrong and Brigo [38] discussed order-6 PEDs, and Efron and Hastie reported and order-7 PEF in their textbook (see Figure 5.7 of [39]). Figure 1 displays two examples of converting a GMM into a pair of dually parameterized exponential-polynomial densities.

Then by converting both mixture m and mixture

m^{'}

into pairs of dually natural/moment parameterized unnormalized PEDs, i.e.,

m \to (q_{{\bar{θ}}_{SME}}, q_{{\bar{η}}_{MLE}})

and

m^{'} \to (q_{{\bar{θ}}_{SME}^{'}}, q_{{\bar{η}}_{MLE}}^{'})

, we approximate the JD between mixtures m and

m^{'}

by using the four parameters of the PEDs

D_{J} [m, m^{'}] \approx {({\bar{θ}}_{SME}^{'} - {\bar{θ}}_{SME})}^{⊤} ({\bar{η}}_{MLE}^{'} - {\bar{η}}_{MLE}) .

(8)

Let

Δ_{J}

denote the approximation formula obtained from the two pairs of PEDs:

Δ_{J} [p_{θ_{SME}}, p^{η_{MLE}}; p_{θ_{SME}^{'}}, p^{η_{MLE}^{'}}] : = {(θ_{SME}^{'} - θ_{SME})}^{⊤} (η_{MLE}^{'} - η_{MLE}) .

(9)

Let

Δ_{J} (θ_{SME}, η_{MLE}; θ_{SME}^{'}, η_{MLE}^{'}) : = Δ_{J} [p_{θ_{SME}}, p^{η_{MLE}}; p_{θ_{SME}^{'}}, p^{η_{MLE}^{'}}]

. Then we have

D_{J} [m, m^{'}] \approx {\tilde{D}}_{J} [m, m^{'}] : = Δ_{J} (θ_{SME}, η_{MLE}; θ_{SME}^{'}, η_{MLE}^{'}) .

Note that

Δ_{J}

is not a proper divergence as it may be negative since, in general,

{\bar{η}}_{MLE} \neq \nabla F ({\bar{θ}}_{SME})

. That is,

Δ_{J}

may not satisfy the law of the indiscernibles. Approximation

Δ_{J}

is exact when

k_{1} = k_{2} = 1

, with both m and

m^{'}

belonging to an exponential family.

We experimentally show in Section 4 that the

{\tilde{D}}_{J}

heuristic yields fast approximations of the JD compared to the MC baseline estimations by several order of magnitudes while approximating the JD reasonably well when the mixtures have a small number of modes.

For example, Figure 2 displays the unnormalized PEDs obtained for two Gaussian mixture models (

k_{1} = 10

components and

k_{2} = 11

components) into PEDs of a PEF of order

D = 8

. The MC estimation of the JD with

s = 10^{6}

samples yields

0.2633 \dots

, while the PED approximation of Equation (8) on corresponding PEFs yields

0.2618 \dots

(the relative error is

0.00585 \dots

or about

0.585 \dots %

). It took about

2642.581

milliseconds (with

s = 10^{6}

on a Dell Inspiron 7472 laptop) to MC estimate the JD, while it took about

0.827

milliseconds with the PEF approximation. Thus, we obtained a speed-up factor of about 3190 (three orders of magnitude) for this particular example. Notice that when viewing Figure 2, we tend to visually evaluate the dissimilarity using the total variation distance (a metric distance):

D_{TV} [m, m^{'}] : = \frac{1}{2} \int | m (x) - m^{'} (x) | d x,

rather than by a dissimilarity relating to the KLD. Using Pinsker’s inequality [40,41], we have

D_{J} [m, m^{'}] \geq D_{TV} {[m, m^{'}]}^{2}

and

D_{TV} [m, m^{'}] \in [0, 1]

. Thus, large TV distance (e.g.,

D_{TV} [m, m^{'}] = 0.1

) between mixtures may have a small JD since Pinsker’s inequality yields

D_{J} [m, m^{'}] \geq 0.01

.

Let us point out that our approximation heuristic is deterministic, while the MC estimations are stochastic: That is, each MC run (Equation (4)) returns a different result, and a single MC run may yield a very bad approximation of the true Jeffreys divergence.

We compare our fast heuristic

{\tilde{D}}_{J} [m, m^{'}] = {(θ_{SME}^{'} - θ_{SME})}^{⊤} (η_{MLE}^{'} - η_{MLE})

with two more costly methods relying on numerical procedures to convert natural ↔ moment parameters:

Simplify GMMs $m_{i}$ into $p^{η_{i}^{MLE}}$ , and approximately convert the ${\bar{η}}_{i}^{MLE}$ ’s into ${\tilde{θ}}_{i}^{MLE}$ ’s. Then approximate the Jeffreys divergence as

$D_{J} [m_{1}, m_{2}] ≃ {\tilde{Δ}}_{J}^{MLE} [m_{1}, m_{2}] : = {({\tilde{θ}}_{2}^{MLE} - {\tilde{θ}}_{1}^{MLE})}^{⊤} ({\bar{η}}_{2}^{MLE} - {\bar{η}}_{1}^{MLE}) .$

(10)
Simplify GMMs $m_{i}$ into $p_{{\bar{θ}}_{i}^{SME}}$ , and approximately convert the ${\bar{θ}}_{i}^{SME}$ ’s into ${\tilde{η}}_{i}^{SME}$ ’s. Then approximate the Jeffreys divergence as

$D_{J} [m_{1}, m_{2}] ≃ {\tilde{Δ}}_{J}^{SME} (m_{1}, m_{2}) = {({\bar{θ}}_{2}^{SME} - {\bar{θ}}_{1}^{SME})}^{⊤} ({\tilde{η}}_{2}^{SME} - {\tilde{η}}_{1}^{SME}) .$

(11)

1.4. Contributions and Paper Outline

Our contributions are summarized as follows:

We explain how to convert any continuous density $r (x)$ (including GMMs) into a polynomial exponential density in Section 2 using integral-based extensions of the Maximum Likelihood Estimator [22] (MLE estimates in the moment parameter space H, Theorem 1 and Corollary 1) and the Score Matching Estimator [27] (SME estimates in the natural parameter space $Θ$ , Theorem 3). We show a connection between SME and the Moment Linear System Estimator [28] (MLSE).
We report a closed-form formula to evaluate the goodness-of-fit of a polynomial family density to a GMM in Section 3 using an extension of the Hyvärinen divergence [42] (Theorem 4) and discuss the problem of model selection for choosing the order D of the polynomial exponential family.
We show how to approximate the Jeffreys divergence between GMMs using a pair of natural/moment parameter PED conversion and present experimental results that display a gain of several orders of magnitude of performance when compared to the vanilla Monte Carlo estimator in Section 4. We observe that the quality of the approximations depend on the number of modes of the GMMs [43]. However, calculating or counting the modes of a GMM is a difficult problem in its own [43].

The paper is organized as follows: In Section 2, we show how to convert arbitrary probability density functions into polynomial exponential densities using the integral-based Maximum Likelihood Estimator (MLE) and Score Matching Estimator (SME). We describe a Maximum Entropy method to iteratively convert moment parameters into natural parameters in Section 2.3.1. It is followed by Section 3, which shows how to calculate in closed-form the order-2 Hyvärinen divergence between a GMM and a polynomial exponential density. We use this criterion to perform model selection. Section 4 presents our computational experiments that demonstrate a gain of several orders of magnitudes for GMMs with a small number of modes. Finally, we conclude in Section 5.

2. Converting Finite Mixtures to Exponential Family Densities

We report two generic methods to convert a mixture

m (x)

into a density

p_{θ} (x)

of an exponential family: The first method extending the MLE in Section 2.1 proceeds using the mean parameterization

η

, while the second method extending the SME in Section 2.2 uses the natural parameterization of the exponential family. We then describe how to convert the moments parameters into natural parameters (and vice versa) for polynomial exponential families in Section 2.3. We show how to instantiate these generic conversion methods for GMMs: It requires calculating non-central moments of GMMs in closed-form. The efficient computations of raw moments of GMMs is detailed in Section 2.4.

2.1. Conversion Using the Moment Parameterization (MLE)

Let us recall that in order to estimate the moment or mean parameter

{\hat{η}}_{MLE}

of a density belonging an exponential family

E_{t} : = \{p_{θ} (x) = exp (t {(x)}^{⊤} θ - F (θ))\}

with a sufficient statistic vector

t (x) = {[t_{1} (x) \dots t_{D} (x)]}^{⊤}

from an i.i.d. sample set

x_{1}, \dots, x_{n}

, the Maximum Likelihood Estimator (MLE) [22,44] yields

\begin{matrix} max_{θ} \prod_{i = 1}^{n} p_{θ} (x_{i}), \end{matrix}

(12)

\begin{matrix} \equiv & max_{θ} \sum_{i = 1}^{n} log p_{θ} (x_{i}), \end{matrix}

(13)

\begin{matrix} = & max_{θ} E (θ) : = (\sum_{i = 1}^{n} t {(x_{i})}^{⊤} θ) - n F (θ), \end{matrix}

(14)

\begin{matrix} \Rightarrow & {\hat{η}}_{MLE} = \frac{1}{n} \sum_{i = 1}^{n} t (x_{i}) . \end{matrix}

(15)

In statistics, Equation (12) is called the estimating equation. The MLE exists under mild conditions [22] and is unique since the Hessian

\nabla^{2} E (θ) = \nabla^{2} F (θ)

of the estimating equation is positive-definite (log-normalizers

F (θ)

are always strictly convex and real analytic [22]). The MLE is consistent and asymptotically normally distributed [22]. Furthermore, since the MLE satisfies the equivariance property [22], we have

{\hat{θ}}_{MLE} = \nabla F^{*} ({\hat{η}}_{MLE})

, where

\nabla F^{*}

denotes the gradient of the conjugate function

F^{*} (η)

of the cumulant function

F (θ)

of the exponential family. In general,

\nabla F^{*}

is intractable for PEDs with

D \geq 4

.

By considering the empirical distribution

p_{e} (x) : = \frac{1}{n} \sum_{i = 1}^{s} δ_{x_{i}} (x),

where

δ_{x_{i}} (\cdot)

denotes the Dirac distribution at location

x_{i}

, we can formulate the MLE problem as a minimum KLD problem between the empirical distribution and a density of the exponential family:

\begin{matrix} min_{θ} D_{KL} [p_{e} : p_{θ}] & = & min - H [p_{e}] - E_{p_{e}} [log p_{θ} (x)], \\ \equiv & max_{θ} \frac{1}{n} \sum_{i = 1}^{n} log p_{θ} (x_{i}), \end{matrix}

since the entropy term

H [p_{e}]

is independent of

θ

.

Thus, to convert an arbitrary smooth density

r (x)

into a density

p_{θ}

of an exponential family

E_{t}

, we have to solve the following minimization problem:

min_{θ \in Θ} D_{KL} [r : p_{θ}] .

Rewriting the minimization problem as:

\begin{matrix} min_{θ} D_{KL} [r : p_{θ}] = - \int r (x) log p_{θ} (x) d x + \int r (x) log r (x) d x, \\ \equiv & min_{θ} - \int r (x) log p_{θ} (x) d x, \\ = & min_{θ} \int r (x) (F (θ) - θ^{⊤} t (x)) d x, \\ = & min_{θ} \bar{E} (θ) = F (θ) - θ^{⊤} E_{r} [t (x)], \end{matrix}

we obtain

{\bar{η}}_{MLE} (r) : = E_{r} [t (x)] = \int_{X} r (x) t (x) d x .

(16)

The minimum is unique since

\nabla^{2} \bar{E} (θ) = \nabla^{2} F (θ) ≻ 0

(positive-definite matrix). This conversion procedure

r (x) \to p^{{\bar{η}}_{MLE} (r)} (x)

can be interpreted as an integral extension of the MLE, hence the

\bar{\dot{}}

notation in

{\bar{η}}_{MLE}

. Notice that the ordinary MLE is

{\hat{η}}_{MLE} = {\bar{η}}_{MLE} (p_{e})

obtained for the empirical distribution:

r = p_{e}

:

{\bar{η}}_{MLE} (p_{e}) = \frac{1}{n} \sum_{i = 1}^{n} t (x_{i})

.

Theorem 1.

The best density

p^{\bar{η}} (x)

of an exponential family

E_{t} = {p_{θ} : θ \in Θ}

minimizing the Kullback–Leibler divergence

D_{KL} [r : p_{θ}]

between a density r and a density

p_{θ}

of an exponential family

E_{t}

is

\bar{η} = E_{r} [t (x)] = \int_{X} r (x) t (x) d x

.

Notice that when

r = p_{θ}

, we obtain

\bar{η} = E_{p_{θ}} [t (x)] = η

so that the method

{\bar{η}}_{MLE} (r)

is consistent (by analogy to the finite i.i.d. MLE case):

{\bar{η}}_{MLE} (p_{θ}) = η = \nabla F (θ)

.

The KLD right-sided minimization problem can be interpreted as an information projection of r onto

E_{t}

. As a corollary of Theorem 1, we obtain:

Corollary 1

(Best right-sided KLD simplification of a mixture). The best right-sided KLD simplification of a homogeneous mixture of exponential families [2]

m (x) = \sum_{i = 1}^{k} w_{i} p_{θ_{i}} (x)

with

p_{θ_{i}} \in E_{t}

, i.e.,

{min}_{θ \in Θ} D_{KL} [m : p_{θ}]

, into a single component

p^{η} (x)

is given by

η = {\hat{η}}_{MLE} (m) = E_{m} [t (x)] = \sum_{i = 1}^{k} η_{i} = \bar{η}

.

Equation (16) allows us to greatly simplify the proofs reported in [45] for mixture simplifications that involved the explicit use of the Pythagoras’ theorem in the dually flat spaces of exponential families [42]. Figure 3 displays the geometric interpretation of the best KLD simplification of a GMM with ambient space the probability space

(R, B (R), μ_{L})

, where

μ_{L}

denotes the Lebesgue measure and

B (R)

the Borel

σ

-algebra of

R

.

Let us notice that Theorem 1 yields an algebraic system for polynomial exponential densities, i.e.,

E_{m} [x^{i}] = {\bar{η}}_{i}

for

i \in {1, \dots, D}

, to compute

{\bar{η}}_{MLE} (m)

for a given GMM

m (x)

(since raw moments

E_{m} [x^{i}]

are algebraic). In contrast with this result, the MLE of i.i.d. observations is in general not an algebraic function [46] but a transcendental function.

2.2. Converting to a PEF Using the Natural Parameterization (SME)

Integral-Based Score Matching Estimator (SME)

To convert the density

r (x)

into an exponential density with sufficient statistics

t (x)

, we can also use the Score Matching Estimator [27,47] (SME). The Score Matching Estimator minimizes the Hyvärinen divergence

D_{H}

(Equation (4) of [47]):

D_{H} [p : p_{θ}] : = \frac{1}{2} \int {∥ \nabla_{x} log p (x) - \nabla_{x} log p_{θ} (x) ∥}^{2} p (x) d x .

The Hyvärinen divergence is also known as half of the relative Fisher information in the optimal transport community (Equation (8) of [48] or Equation (2.2) in [49]), where it is defined for two measures

μ

and

ν

as follows:

I [μ : ν] : = \int_{X} {∥\nabla log \frac{d μ}{d ν}∥}^{2} d μ = 4 \int_{X} {∥\nabla \sqrt{\frac{d μ}{d ν}}∥}^{2} d ν .

Moreover, the relative Fisher information can be defined on complete Riemannian manifolds [48].

That is, we convert a density

r (x)

into an exponential family density

p_{θ} (x)

using the following minimizing problem:

θ_{SME} (r) = min_{θ \in Θ} D_{H} [r : p_{θ}] .

Beware that in statistics, the score

s_{θ} (x)

is defined by

\nabla_{θ} log p_{θ} (x)

, but in Score Matching, we refer to the “data score” defined by

\nabla_{x} log p_{θ} (x)

. Hyvärinen [47] gave an explanation of the naming “score” using a spurious location parameter.

Generic solution: It can be shown that for exponential families [47], we obtain the following solution:

$θ_{SME} (r) = - {(E_{r} [A (x)])}^{- 1} \times (E_{r} [b (x)]),$

(17)

where

$A (x) : = {[t_{i}^{'} (x) t_{j}^{'} (x)]}_{i j}$

is a $D \times D$ symmetric matrix, and

$b (x) = {[t_{1}^{″} (x) \dots t_{D}^{″} (x)]}^{⊤}$

is a D-dimensional column vector.

Theorem 2.

The best conversion of a density

r (x)

into a density

p_{θ} (x)

of an exponential family minimizing the right-sided Hyvärinen divergence is

θ_{SME} (r) = - {(E_{r} [{[t_{i}^{'} (x) t_{j}^{'} (x)]}_{i j}])}^{- 1} \times (E_{r} {[[t_{1}^{″} (x) \dots t_{D}^{″} (x)]]}^{⊤}) .

Solution instantiated for polynomial exponential families:
For polynomial exponential families of order D, we have $t_{i}^{'} (x) = i x^{i - 1}$ and $t_{i}^{″} (x) = i (i - 1) x^{i - 2}$ , and therefore, we have

$A_{D} = E_{r} [A (x)] = {[i j μ_{i + j - 2} (r)]}_{i j},$

and

$b_{D} = E_{s} [b (x)] = {[j (j - 1) μ_{j - 2} (r)]}_{j},$

where $μ_{l} (r) : = E_{r} [X^{l}]$ denotes the l-th raw moment of distribution $X \sim r (x)$ (with the convention that $m_{- 1} (r) = 0$ ). For a probability density function $r (x)$ , we have $μ_{1} (r) = 1$ .
Thus, the integral-based SME of a density r is:

$θ_{SME} (r) = - {({[i j μ_{i + j - 2} (r)]}_{i j})}^{- 1} \times {[j (j - 1) μ_{j - 2} (r)]}_{j} .$

(18)

For example, matrix $A_{4}$ is

$[\begin{matrix} μ_{0} & 2 μ_{1} & 3 μ_{2} & 4 μ_{3} \\ 2 μ_{1} & 4 μ_{2} & 6 μ_{3} & 8 μ_{4} \\ 3 μ_{2} & 6 μ_{3} & 9 μ_{4} & 12 μ_{5} \\ 4 μ_{3} & 8 μ_{4} & 12 μ_{5} & 16 μ_{6} \end{matrix}] .$
Faster PEF solutions using Hankel matrices:
The method of Cobb et al. [28] (1983) anticipated the Score Matching method of Hyvärinen (2005). It can be derived from Stein’s lemma for exponential families [50]. The integral-based Score Matching method is consistent, i.e., if $r = p_{θ}$ , then ${\bar{θ}}_{SME} = θ$ : The probabilistic proof for $r (x) = p_{e} (x)$ is reported as Theorem 2 of [28]. The integral-based proof is based on the property that arbitrary order partial mixed derivatives can be obtained from higher-order partial derivatives with respect to $θ_{1}$ [29]:

$\partial_{1}^{i_{1}} \dots \partial_{D}^{i_{D}} F (θ) = \partial_{1}^{\sum_{j = 1}^{D} j i_{j}} F (θ),$

where $\partial_{i} : = \frac{\partial}{\partial θ_{i}}$ .
The complexity of the direct SME method is $O (D^{3})$ as it requires the inverse of the $D \times D$ -dimensional matrix $A_{D}$ .
We show how to lower this complexity by reporting an equivalent method (originally presented in [28]) that relies on recurrence relationships between the moments of $p_{θ} (x)$ for PEDs. Recall that $μ_{l} (r)$ denotes the l-th raw moment $E_{r} [x^{l}]$ .
Let $A^{'} = {[a_{i + j - 2}^{'}]}_{i j}$ denote the $D \times D$ symmetric matrix with $a_{i + j - 2}^{'} (r) = μ_{i + j - 2} (r)$ (with $a_{0}^{'} (r) = μ_{0} (r) = 1$ ), and $b^{'} = {[b_{i}]}_{i}$ the D-dimensional vector with $b_{i}^{'} (r) = (i + 1) μ_{i} (r)$ . We solve the system $A^{'} β = b^{'}$ to obtain $β = {A^{'}}^{- 1} b^{'}$ . We then obtain the natural parameter ${\bar{θ}}_{SME}$ from the vector $β$ as

${\bar{θ}}_{SME} = [\begin{matrix} - \frac{β_{1}}{2} \\ ⋮ \\ - \frac{β_{i}}{i + 1} \\ ⋮ \\ - \frac{β_{D}}{D + 1} \end{matrix}] .$

(19)

Now, if we inspect matrix $A_{D}^{'} = [μ_{i + j - 2} (r)]$ , we find that matrix $A_{D}^{'}$ is a Hankel matrix: A Hankel matrix has constant anti-diagonals and can be inverted in quadratic-time [51,52] instead of cubic time for a general $D \times D$ matrix. (The inverse of a Hankel matrix is a Bezoutian matrix [53].) Moreover, a Hankel matrix can be stored using linear memory (store $2 D - 1$ coefficients) instead of quadratic memory of regular matrices.
For example, matrix $A_{4}^{'}$ is:

$A_{4}^{'} = [\begin{matrix} μ_{0} & μ_{1} & μ_{2} & μ_{3} \\ μ_{1} & μ_{2} & μ_{3} & μ_{4} \\ μ_{2} & μ_{3} & μ_{4} & μ_{5} \\ μ_{3} & μ_{4} & μ_{5} & μ_{6} \end{matrix}],$

and requires only $6 = 2 \times 4 - 2$ coefficients to be stored instead of $4 \times 4 = 16$ . The order-d moment matrix is

$A_{d}^{'} : = {[μ_{i + j - 2}]}_{i j} = [\begin{matrix} μ_{0} & μ_{1} & \dots & μ_{d} \\ μ_{1} & μ_{2} & \dots & ⋮ \\ ⋮ & ⋱ & ⋮ \\ μ_{d} & \dots & \dots & μ_{2 d} \end{matrix}],$

is a Hankel matrix stored using $2 d + 1$ coefficients:

$A_{d}^{'} = : Hankel (μ_{0}, μ_{1}, \dots, μ_{2 d}) .$

In statistics, those matrices $A_{d}^{'}$ are called moment matrices and well-studied [54,55,56]. The variance $Var [X]$ of a random variable X can be expressed as the determinant of the order-2 moment matrix:

$Var [X] = E [{(X - μ)}^{2}] = E [X^{2}] - E {[X]}^{2} = μ_{2} - μ_{1}^{2} = \det ([\begin{matrix} 1 & μ_{1} \\ μ_{1} & μ_{2} \end{matrix}]) \geq 0 .$

This observation yields a generalization of the notion of variance to $d + 1$ random variables: $X_{1}, \dots, X_{d + 1} \sim_{i i d} F_{X} \Rightarrow E [\prod_{j > i} {(X_{i} - X_{j})}^{2}] = (d + 1)! \det (M_{d}) \geq 0$ . The variance can be expressed as $E [\frac{1}{2} {(X_{1} - X_{2})}^{2}]$ for $X_{1}, X_{2} \sim_{i i d} F_{X}$ . See [57] (Chapter 5) for a detailed description related to U-statistics.
For GMMs r, the raw moments $μ_{l} (r)$ to build matrix $A_{D}$ can be calculated in closed-form, as explained in Section 2.4.

Theorem 3 (Score matching GMM conversion)

The Score Matching conversion of a GMM

m (x)

into a polynomial exponential density

p_{θ_{SME} (m)} (x)

of order D is obtained as

θ_{SME} (m) = - {({[i j m_{i + j - 2}]}_{i j})}^{- 1} \times {[j (j - 1) m_{j - 2}]}_{j},

where

m_{i} = E_{m} [x^{i}]

denote the ith non-central moment of the GMM

m (x)

.

2.3. Converting Numerically Moment Parameters from/to Natural Parameters

Recall that our fast heuristic approximates the Jeffreys divergence by

{\tilde{D}}_{J} [m, m^{'}] : = {({\bar{θ}}_{SME} (m^{'}) - {\bar{θ}}_{SME} (m))}^{⊤} ({\bar{η}}_{MLE} (m^{'}) - {\bar{η}}_{MLE} (m)) .

Because F and

\nabla F^{*}

are not available in closed form (except for the case

D = 2

of the normal family), we cannot obtain

θ

from a given

η

(using

θ = \nabla F^{*} (η)

) nor

η

from a given

θ

(using

η = \nabla F (θ)

).

However, provided that we can approximate numerically

\tilde{η} ≃ \nabla F (θ)

and

\tilde{θ} ≃ \nabla F^{*} (η)

, we also consider these two approximations for the Jeffreys divergence:

{\tilde{Δ}}_{J}^{MLE} [m_{1}, m_{2}] : = {({\tilde{θ}}_{2}^{MLE} - {\tilde{θ}}_{1}^{MLE})}^{⊤} ({\bar{η}}_{2}^{MLE} - {\bar{η}}_{1}^{MLE}),

and

{\tilde{Δ}}_{J}^{SME} [m_{1}, m_{2}] = {({\bar{θ}}_{2}^{SME} - {\bar{θ}}_{1}^{SME})}^{⊤} ({\tilde{η}}_{2}^{SME} - {\tilde{η}}_{1}^{SME}) .

We show how to numerically estimate

{\tilde{θ}}^{MLE} ≃ \nabla F ({\bar{η}}^{MLE})

from

{\bar{η}}^{MLE}

in Section 2.3.1. Next, in Section 2.3.2, we show how to stochastically estimate

{\tilde{η}}^{SME} ≃ \nabla F^{*} ({\bar{θ}}^{SME})

.

2.3.1. Converting Moment Parameters to Natural Parameters Using Maximum Entropy

Let us report the iterative approximation technique of [58] (which extended the method described in [35]) based on solving a maximum entropy problem (MaxEnt problem). This method will be useful when comparing our fast heuristic

{\tilde{D}}_{J} [m, m^{'}]

with the approximations

{\tilde{Δ}}_{J}^{MLE} [m, m^{'}]

and

{\tilde{Δ}}_{J}^{SME} [m, m^{'}]

.

The density

p_{θ}

of any exponential family can be characterized as a maximum entropy distribution given the D moment constraints

E_{p_{θ}} [t_{i} (x)] = η_{i}

: Namely,

{max}_{p} h (p)

subject to the

D + 1

moment constraints

\int t_{i} (x) p (x) d x = η_{i}

for

i \in {0, \dots, D}

, where we added by convention

η_{0} = 1

and

t_{0} (x) = 1

(so that

\int p (x) d x = 1

). The solution of this MaxEnt problem [58] is

p (x) = p_{λ}

, where

λ

are the

D + 1

Lagrangian parameters. Here, we adopt the following canonical parameterization of the densities of an exponential family:

p_{λ} (x) : = exp (- \sum_{i = 0}^{D} λ_{i} t_{i} (x)) .

That is,

F (λ) = λ_{0}

and

λ_{i} = - θ_{i}

for

i \in {1, \dots, D}

. Parameter

λ

is a kind of augmented natural parameter that includes the log-normalizer in its first coefficient.

Let

K_{i} (λ) : = E_{p_{θ}} [t_{i} (x)] = η_{i}

denote the set of

D + 1

non-linear equations for

i \in {0, \dots, D}

. The Iterative Linear System Method [58] (ILSM) converts

p^{η}

to

p_{θ}

iteratively. We initialize

λ^{(0)}

to

{\bar{θ}}_{SME}

(and calculate numerically

λ_{0}^{(0)} = F ({\bar{θ}}_{SME})

).

At iteration t with current estimate

λ^{(t)}

, we use the following first-order Taylor approximation:

K_{i} (λ) \approx K_{i} (λ^{(t)}) + (λ - λ^{(t)}) \nabla K_{i} (λ^{(t)}) .

Let

H (λ)

denote the

(D + 1) \times (D + 1)

matrix:

H (λ) : = {[\frac{\partial K_{i} (λ)}{\partial θ_{j}}]}_{i j} .

We have

H_{i j} (λ) = H_{j i} (λ) = - E_{p_{θ}} [t_{i} (x) t_{j} (x)] .

We update as follows:

λ^{(t + 1)} = λ^{(t)} + H^{- 1} (λ^{(t)}) [\begin{matrix} η_{0} - K_{0} (λ^{(t)}) \\ ⋮ \\ η_{D} - K_{D} (λ^{(t)}) \end{matrix}] .

(20)

For a PEF of order D, we have

H_{i j} (λ) = - E_{p_{θ}} [x^{i + j - 2}] = - μ_{i + j - 2} (p_{θ}) .

This yields a moment matrix

H_{λ}

(Hankel matrix), which can be inverted in quadratic time [52]. In our setting, the moment matrix is invertible because

| H | > 0

, see [59].

Let

{\tilde{λ}}_{T} (η)

denote

θ^{(T)}

after T iterations (retrieved from

λ^{(T)}

) and the corresponding natural parameter of the PED. We have the following approximation of the JD:

D_{J} [m, m^{'}] \approx {({\tilde{θ}}_{T} (η^{'}) - {\tilde{θ}}_{T} (η))}^{⊤} (η^{'} - η) .

The method is costly because we need to numerically calculate

μ_{i + j - 2} (p_{θ})

and the

K_{i}

’s (e.g., univariate Simpson integrator). Another potential method consists of estimating these expectations using acceptance-rejection sampling [60,61]. We may also consider the holonomic gradient descent [29]. Thus, the conversion

η \to θ

method is costly. Our heuristic

{\tilde{Δ}}_{J}

bypasses this costly moment-to-natural parameter conversion by converting each mixture m to a pair

(p_{θ_{SME}}, p_{η_{MLE}})

of PEDs parameterized in the natural and moment parameters (i.e., loosely speaking, we untangle these dual parameterizations).

2.3.2. Converting Natural Parameters to Moment Parameters

Given a PED

p_{θ} (x)

, we have to find its corresponding moment parameter

η

(i.e.,

p_{θ} = p^{η}

). Since

η = E_{p_{θ}} [t (x)]

, we sample s i.i.d. variates

x_{1}, \dots, x_{s}

from

p_{θ}

using acceptance-rejection sampling [60,61] or any other Markov chain Monte Carlo technique [62] and estimate

\hat{η}

as:

\hat{η} = \frac{1}{s} \sum_{i = 1}^{s} t (x_{i}) .

2.4. Raw Non-Central Moments of Normal Distributions and GMMs

In order to implement the MLE or SME Gaussian mixture conversion procedures, we need to calculate the raw moments of a Gaussian mixture model. The l-th moment raw moment

E [Z^{l}]

of a standard normal distribution

Z \sim N (0, 1)

is 0 when l is odd (since the normal standard density is an even function) and

(l - 1)!! = 2^{- \frac{l}{2}} \frac{l!}{(l / 2)!}

when l is even, where

n!! = \sqrt{\frac{2^{n + 1}}{π}} Γ (\frac{n}{2} + 1) = \prod_{k = 0}^{⌈ \frac{n}{2} ⌉ - 1} (n - 2 k)

is the double factorial (with

(- 1)!! = 1

by convention). Using the binomial theorem, we deduce that a normal distribution

X = μ + σ Z

has finite moments:

μ_{l} (p_{μ, σ}) = E_{p_{μ, σ}} [X^{l}] = E [{(μ + σ Z)}^{l}] = E [{(μ + σ Z)}^{l}] = \sum_{i = 0}^{l} (\binom{l}{i}) μ^{l - i} σ^{i} E [Z^{i}] .

That is, we have

μ_{l} (p_{μ, σ}) = \sum_{i = 0}^{⌊ \frac{l}{2} ⌋} (\binom{l}{i}) (2 i - 1)!! μ^{l - 2 i} σ^{2 i},

(21)

where

n!!

denotes the double factorial:

n!! = \prod_{k = 0}^{⌈\frac{n}{2}] - 1} (n - 2 k) = \{\begin{matrix} \prod_{k = 1}^{\frac{n}{2}} (2 k) & n is even, \\ \prod_{k = 1}^{\frac{n + 1}{2}} (2 k - 1) & n is odd . \end{matrix}

By the linearity of the expectation

E [\cdot]

, we deduce the l-th raw moment of a GMM

m (x) = \sum_{i = 1}^{k} w_{i} p_{μ_{i}, σ_{i}} (x)

:

μ_{l} (m) = \sum_{i = 1}^{k} w_{i} μ_{l} (p_{μ_{I}, σ_{i}}) .

Notice that by using [63], we can extend this formula to truncated normals and GMMs. Thus, computing the first

O (D)

raw moments of a GMM with k components can be done in

O (k D^{2})

using the Pascal triangle method for computing the binomial coefficients. See also [64].

3. Goodness-of-Fit between GMMs and PEDs: Higher Order Hyvärinen Divergences

Once we have converted a GMM

m (x)

into an unnormalized PED

q_{θ_{m}} (x) = {\tilde{p}}_{θ_{m}} (x)

, we would like to evaluate the quality of the conversion, i.e.,

D [m (x) : q_{θ_{m}} (x)]

, using a statistical divergence

D [\cdot : \cdot]

. This divergence shall allow us to perform model selection by choosing the order D of the PEF so that

D [m (x) : p_{θ} (x)] \leq ϵ

for

θ \in R^{D}

, where

ϵ > 0

is a prescribed threshold. Since PEDs have computationally intractable normalization constants, we consider a right-sided projective divergence [42]

D [p : q]

that satisfies

D [p : λ q] = D [p : q] = D [p : \tilde{q}]

for any

λ > 0

. For example, we may consider the

γ

-divergence [65] that is a two-sided projective divergence:

D_{γ} [λ p : λ^{'} q] = D [p : q] = D [\tilde{p} : \tilde{q}]

for any

λ, λ^{'} > 0

and converge to the KLD when

γ \to 0

. However, the

γ

-divergence between a mixture model and an unnormalized PEF does not yield a closed-form formula. Moreover, the

γ

-divergence between two unnormalized PEDs is expressed using the log-normalizer function

F (\cdot)

that is computationally intractable [66].

In order to a get a closed-form formula for a divergence between a mixture model and an unnormalized PED, we consider the order-

α

(for

α > 0

) Hyvärinen divergence [42] as follows:

D_{H, α} [p : q] : = \int p {(x)}^{α} {(\nabla_{x} log p (x) - \nabla_{x} log q (x))}^{2} d x, α > 0 .

(22)

The Hyvärinen divergence [42] (order-1 Hyvärinen divergence) has also been called the Fisher divergence [27,67,68,69] or relative Fisher information [48]. Notice that when

α = 1

,

D_{H, 1} [p : q] = D_{H} [p : q]

, the ordinary Hyvärinen divergence [27].

The Hyvärinen divergences

D_{H, α}

is a right-sided projective divergence, meaning that the divergence satisfies

D_{H, α} [p : q] = D_{H, α} [p : λ q]

for any

λ > 0

. That is, we have

D_{H, α} [p : q] = D_{H, α} [p : \tilde{q}]

. Thus, we have

D_{H, α} [m : p_{θ}] = D_{H, α} [m : q_{θ}]

for an unnormalized PED

q_{θ} = {\tilde{p}}_{θ}

. For statistical estimation, it is enough to have a sided projective divergence since we need to evaluate the goodness of fit between the (normalized) empirical distribution

p_{e}

and the (unnormalized) parameteric density.

For univariate distributions,

\nabla_{x} log p (x) = \frac{p^{'} (x)}{p (x)}

, and

\frac{p^{'} (x)}{p (x)} = \frac{{\tilde{p}}^{'} (x)}{\tilde{p} (x)}

, where

\tilde{p} (x)

is the unnormalized model.

Let

P_{θ} (x) : = \sum_{i = 1}^{D} θ_{i} x^{i}

be a homogeneous polynomial defining the shape of the EPF:

p_{θ} (x) = exp (P_{θ} (x) - F (θ)) .

For PEDs with the homogeneous polynomial

P_{θ} (x)

, we have

\frac{p^{'} (x)}{p (x)} = {(log P_{θ} (x))}^{'} = \sum_{i = 1}^{D} i θ_{i} x^{i - 1}

.

Theorem 4.

The Hyvärinen divergence

D_{H, 2} [m : q_{θ}]

of order 2 between a Gaussian mixture

m (x)

and a polynomial exponential family density

q_{θ} (x)

is available in closed form.

Proof.

We have

D_{H, 2} [m : q] = \int m {(x)}^{2} {(\frac{m^{'} (x)}{m (x)} - \sum_{i = 1}^{D} i θ_{i} x^{i - 1})}^{2} d x

with

m^{'} (x) = - \sum_{i = 1}^{k} w_{i} \frac{x - μ_{i}}{σ_{i}^{2}} p (x_{i}; μ_{i}, σ_{i}),

denoting the derivative of the Gaussian mixture density

m (x)

. It follows that:

D_{H, 2} [m : q] = \int m^{'} (x) d x - 2 \sum_{i = 1}^{D} i θ_{i} \int x^{i - 1} m^{'} (x) m (x) d x + \sum_{i, j = 1}^{D} i j θ_{i} θ_{j} \int x^{i + j - 2} m {(x)}^{2} d x,

where

\int x^{i} m^{'} (x) m (x) d x = - \sum w_{a} w_{b} \int \frac{x - μ_{a}}{σ_{a}^{2}} x^{i} p (x; μ_{a}, σ_{a}) p (x; μ_{b}, σ_{b}) d x .

Therefore, we have

D_{H, 2} [m : q] = \int m^{'} (x) d x - 2 \sum_{i = 1}^{D} i θ_{i} \int x^{i - 1} m^{'} (x) m (x) d x + \sum_{i, j = 1}^{D} i j θ_{i} θ_{j} \int x^{i + j - 2} m {(x)}^{2} d x

with

m^{'} (x) = - \sum w_{a} \frac{x - μ_{a}}{σ_{a}^{2}} p (x; μ_{a}, σ_{a})

.

Since

p_{a} (x) p_{b} (x) = κ_{a, b} p (x; μ_{a b}, σ_{a b})

, with

\begin{matrix} μ_{a b} & = & σ_{a}^{2} σ_{b}^{2} (σ_{b}^{2} μ_{a} + σ_{a}^{2} μ_{b}), \\ σ_{a b} & = & \frac{σ_{a} σ_{b}}{\sqrt{σ_{a}^{2} + σ_{b}^{2}}}, \\ κ_{a, b} & = & exp (F (μ_{a b}, σ_{a b}) - F (μ_{a}, σ_{a}) - F (μ_{b}, σ_{b})), \end{matrix}

and

F (μ, σ) = \frac{μ^{2}}{2 σ^{2}} + \frac{1}{2} log (2 π σ^{2}),

the log-normalizer of the Gaussian exponential family [42].

Therefore, we obtain

\int p_{a} (x) p_{b} (x) x^{l} d x = κ_{a, b} m_{l} (μ_{a b}, σ_{a b}) .

Thus, the Hyvärinen divergence

D_{H, 2}

of order 2 between a GMM and a PED is available in closed-form. □

For example, when

k = 1

(i.e., mixture m is a single Gaussian

p_{μ_{1}, σ_{1}}

) and

p_{θ}

is a normal distribution (i.e., PED with

D = 2

,

q_{θ} = p_{μ_{2}, σ_{2}}

), we obtain the following formula for the order-2 Hyvärinen divergence:

D_{H, 2} [p_{μ_{1}, σ_{1}} : p_{μ_{2}, σ_{2}}] = \frac{{(σ_{1}^{2} - σ_{2}^{2})}^{2} + 2 {(μ_{2} - μ_{1})}^{2} σ_{1}^{2}}{8 \sqrt{π} σ_{1}^{3} σ_{2}^{4}} .

4. Experiments: Jeffreys Divergence between Mixtures

In this section, we evaluate our heuristic to approximate the Jeffreys divergence between two mixtures m and

m^{'}

:

{\tilde{D}}_{J} [m, m^{'}] : = {({\bar{θ}}_{SME} (m^{'}) - {\bar{θ}}_{SME} (m))}^{⊤} ({\bar{η}}_{MLE} (m^{'}) - {\bar{η}}_{MLE} (m)) .

Recall that stochastically estimating the JD between k-GMMs with Monte Carlo sampling using s samples (i.e.,

{\hat{D}}_{J, s} [m : m^{'}]

) requires

\tilde{O} (k s)

and is not deterministic. That is, different MC runs yield fluctuating values that may be fairly different. In comparison, approximating

D_{J}

by

{\tilde{D}}_{J}

using

Δ_{J}

by converting mixtures to D-order PEDs require

O (k D^{2})

time to compute the raw moments and

O (D^{2})

time to invert a Hankel moment matrix. Thus, by choosing

D = 2 k

, we obtain a deterministic

O (k^{3})

algorithm that is faster than the MC sampling when

k^{2} ≪ s

. Since there are, at most, k modes for a k-GMM, we choose order

D = 2 k

for the PEDs.

To obtain quantitative results on the performance of our heuristic

{\tilde{D}}_{J}

, we build random GMMs with k components as follows:

m (x) = \sum_{i = 1}^{k} w_{i} p_{μ_{i}, σ_{i}} (x)

, where

w_{i} \sim U_{i}

,

μ_{i} \sim - 10 + 10 U_{1}^{'}

and

σ_{i} \sim 1 + U_{2}^{'}

, where the

U_{i}

’s, and

U_{1}^{'}

and

U_{2}^{'}

are independent uniform distributions on

[0, 1)

. The mixture weights are then normalized to sum up to one. For each value of k, we make 1000 trial experiments to gather statistics and use

s = 10^{5}

for evaluating the Jeffreys divergence

{\hat{D}}_{J}

by Monte Carlo samplings. We denote by

error : = \frac{| {\hat{D}}_{J} - Δ_{J} |}{{\hat{D}}_{J}}

the error of an experiment. Table 1 presents the results of the experiments for

D = 2 k

: The table displays the average error, the maximum error (minimum error is very close to zero, of order

10^{- 5}

), and the speed-up obtained by our heuristic

Δ_{J}

. Those experiments were carried out on a Dell Inspiron 7472 laptop (equipped with an Intel(R) Core(TM) i5-8250U CPU at 1.60 GHz).

Notice that the quality of the approximations of

{\tilde{D}}_{J}

depend on the number of modes of the GMMs. However, calculating the number of modes is difficult [43,70], even for simple cases [71,72].

Figure 4 displays several experiments of converting mixtures to pairs of PEDs to obtain approximations of the Jeffreys divergence.

Figure 5 illustrates the use of the order-2 Hyvärinen divergence

D_{H, 2}

to perform model selection for choosing the order of a PED.

Finally, Figure 6 displays some limitations of the GMM to PED conversion when the GMMs have many modes. In that case, running the conversion

{\bar{η}}_{MLE}

to obtain

{\tilde{θ}}_{T} ({\bar{η}}_{MLE})

and estimate the Jeffreys divergence by

{\tilde{Δ}}_{J}^{MLE} [m_{1}, m_{2}] = {({\tilde{θ}}_{2}^{MLE} - {\tilde{θ}}_{1}^{MLE})}^{⊤} ({\bar{η}}_{2}^{MLE} - {\bar{η}}_{1}^{MLE}),

improves the results but requires more computation.

Next, we consider learning a PED by converting a GMM derived itself from a Kernel Density Estimator (KDE). We use the duration of the eruption for the Old Faithful geyser in Yellowstone National Park (Wyoming, USA): The dataset consists of 272 observations (https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat) (access date: 25 October 2021) and is included in the R language package ‘stats’. Figure 7 displays the GMMs obtained from the KDEs of the Old Faithful geyser dataset when choosing for each component

σ = 0.05

(left) and

σ = 0.1

. Observe that the data are bimodal once the spurious modes (i.e., small bumps) are removed, as studied in [32]. Barron and Sheu [32] modeled that dataset using a bimodal PED of order

D = 4

, i.e., a quartic distribution. We model it with a PED of order

D = 10

using the integral-based Score Matching method. Figure 8 displays the unnormalized bimodal density

q_{1}

(i.e.,

{\tilde{p}}_{1}

) that we obtained using the integral-based Score Matching method (with

X = (0, 1)

).

5. Conclusions and Perspectives

Many applications [7,73,74,75] require computing the Jeffreys divergence (the arithmetic symmetrization of the Kullback–Leibler divergence) between Gaussian mixture models. Since the Jeffreys divergence between GMMs is provably not available in closed-form [13], one often ends up implementing a costly Monte Carlo stochastic approximation of the Jeffreys divergence. In this paper, we first noticed the simple expression of the Jeffreys divergence between densities

p_{θ}

and

p_{θ^{'}}

of an exponential family using their dual natural and moment parameterizations [22]

p_{θ} = p^{η}

and

p_{θ^{'}} = p^{η^{'}}

:

D_{J} [p_{θ}, p_{θ^{'}}] = {(θ^{'} - θ)}^{⊤} (η^{'} - η),

where

η = \nabla F (θ)

and

η^{'} = \nabla F (θ^{'})

for the cumulant function

F (θ)

of the exponential family. This led us to propose a simple and fast heuristic to approximate the Jeffreys divergence between Gaussian mixture models: First, convert a mixture m to a pair

(p_{{\bar{θ}}^{SME}}, p^{{\bar{η}}^{MLE}})

of dually parameterized polynomial exponential densities using extensions of the Maximum Likelihood and Score Matching Estimators (Theorems 1 and 3), and then approximate the JD deterministically by

D_{J} [m_{1}, m_{2}] ≃ {\tilde{D}}_{J} [m_{1}, m_{2}] = {({\tilde{θ}}_{2}^{MLE} - {\tilde{θ}}_{1}^{MLE})}^{⊤} ({\bar{η}}_{2}^{MLE} - {\bar{η}}_{1}^{MLE}) .

The order of the polynomial exponential family may be either prescribed or selected using the order-2 Hyvärinen divergence, which evaluates in closed form the dissimilarity between a GMM and a density of an exponential-polynomial family (Theorem 4). We experimentally demonstrated that the Jeffreys divergence between GMMs can be reasonably well approximated by

{\tilde{D}}_{J}

for mixtures with a small number of modes, and we obtained an overall speed-up of several order of magnitudes compared to the Monte Carlo sampling method. We also propose another deterministic heuristic to estimate

D_{J}

as

{\tilde{D}}_{J}^{MLE} [m_{1} : m_{2}] = {({\tilde{θ}}_{2}^{MLE} - {\tilde{θ}}_{1}^{MLE})}^{⊤} ({\bar{η}}_{2}^{MLE} - {\bar{η}}_{1}^{MLE}),

where

{\tilde{θ}}^{MLE} \approx \nabla F ({\bar{η}}^{MLE})

is numerically calculated using an iterative conversion procedure based on maximum entropy [58] (Section 2.3.1). Our technique extends to other univariate mixtures of exponential families (e.g., mixtures of Rayleigh distributions, mixtures of Gamma distributions, or mixtures of Beta distributions, etc). One limitation of our method is that the PED modeling of a GMM may not guarantee obtaining the same number of modes as the GMM even when we increase the order D of the exponential-polynomial densities. This case is illustrated in Figure 9 (right).

Although PEDs are well-suited to calculate Jeffreys divergence compared to GMMs, we point out that GMMs are better suited for sampling, while PEDs require Monte Carlo methods (e.g., adaptive rejection sampling or MCMC methods [62]). Furthermore, we can estimate the Kullback–Leibler divergence between two PEDs using rejection sampling (or other McMC methods [62]) or by using the

γ

-divergence [76] with

γ

close to zero [66] (e.g.,

γ = 0.001

). The web page of the project is https://franknielsen.github.io/JeffreysDivergenceGMMPEF/index.html (accessed on 25 October 2021).

This work opens up several perspectives for future research: For example, we may consider bivariate polynomial-exponential densities for modeling bivariate Gaussian mixture models [29], or we may consider truncating the GMMs in order to avoid tail phenomena when converting GMMs to PEDs [77,78].

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1946, 186, 453–461. [Google Scholar]
McLachlan, G.J.; Basford, K.E. Mixture Models: Inference and Applications to Clustering; M. Dekker: New York, NY, USA, 1988; Volume 38. [Google Scholar]
Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 1894, 185, 71–110. [Google Scholar]
Seabra, J.C.; Ciompi, F.; Pujol, O.; Mauri, J.; Radeva, P.; Sanches, J. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Trans. Biomed. Eng. 2011, 58, 1314–1324. [Google Scholar] [CrossRef] [PubMed]
Kullback, S. Information Theory and Statistics; Courier Corporation: North Chelmsford, MA, USA, 1997. [Google Scholar]
Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Vitoratou, S.; Ntzoufras, I. Thermodynamic Bayesian model comparison. Stat. Comput. 2017, 27, 1165–1180. [Google Scholar] [CrossRef] [Green Version]
Kannappan, P.; Rathie, P. An axiomatic characterization of J-divergence. In Transactions of the Tenth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Springer: Dordrecht, The Netherlands, 1988; pp. 29–36. [Google Scholar]
Burbea, J. J-Divergences and related concepts. Encycl. Stat. Sci. 2004. [Google Scholar] [CrossRef]
Tabibian, S.; Akbari, A.; Nasersharif, B. Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Process. 2015, 106, 184–197. [Google Scholar] [CrossRef]
Veldhuis, R. The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Process. Lett. 2002, 9, 96–99. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. [Google Scholar] [CrossRef] [Green Version]
Watanabe, S.; Yamazaki, K.; Aoyagi, M. Kullback information of normal mixture is not an analytic function. IEICE Tech. Rep. Neurocomput. 2004, 104, 41–46. [Google Scholar]
Cui, S.; Datcu, M. Comparison of Kullback-Leibler divergence approximation methods between Gaussian mixture models for satellite image retrieval. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 3719–3722. [Google Scholar]
Cui, S. Comparison of approximation methods to Kullback–Leibler divergence between Gaussian mixture models for satellite image retrieval. Remote Sens. Lett. 2016, 7, 651–660. [Google Scholar] [CrossRef] [Green Version]
Sreekumar, S.; Zhang, Z.; Goldfeld, Z. Non-asymptotic Performance Guarantees for Neural Estimation of f-Divergences. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR 2021), San Diego, CA, USA, 18–24 July 2021; pp. 3322–3330. [Google Scholar]
Durrieu, J.L.; Thiran, J.P.; Kelly, F. Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4833–4836. [Google Scholar]
Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef] [Green Version]
Jenssen, R.; Principe, J.C.; Erdogmus, D.; Eltoft, T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst. 2006, 343, 614–629. [Google Scholar] [CrossRef]
Liu, M.; Vemuri, B.C.; Amari, S.i.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2407–2419. [Google Scholar]
Robert, C.; Casella, G. Monte Carlo Statistical Methods; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef] [Green Version]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
Hyvärinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 695–709. [Google Scholar]
Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
Hayakawa, J.; Takemura, A. Estimation of exponential-polynomial distribution by holonomic gradient descent. Commun. Stat.-Theory Methods 2016, 45, 6860–6882. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F.; Nock, R. MaxEnt upper bounds for the differential entropy of univariate continuous distributions. IEEE Signal Process. Lett. 2017, 24, 402–406. [Google Scholar] [CrossRef]
Matz, A.W. Maximum likelihood parameter estimation for the quartic exponential distribution. Technometrics 1978, 20, 475–484. [Google Scholar] [CrossRef]
Barron, A.R.; Sheu, C.H. Approximation of density functions by sequences of exponential families. Ann. Stat. 1991, 19, 1347–1369, Correction in 1991, 19, 2284–2284. [Google Scholar]
O’toole, A. A method of determining the constants in the bimodal fourth degree exponential function. Ann. Math. Stat. 1933, 4, 79–93. [Google Scholar] [CrossRef]
Aroian, L.A. The fourth degree exponential distribution function. Ann. Math. Stat. 1948, 19, 589–592. [Google Scholar] [CrossRef]
Zellner, A.; Highfield, R.A. Calculation of maximum entropy distributions and approximation of marginal posterior distributions. J. Econom. 1988, 37, 195–209. [Google Scholar] [CrossRef]
McCullagh, P. Exponential mixtures and quadratic exponential families. Biometrika 1994, 81, 721–729. [Google Scholar] [CrossRef]
Mead, L.R.; Papanicolaou, N. Maximum entropy in the problem of moments. J. Math. Phys. 1984, 25, 2404–2417. [Google Scholar] [CrossRef] [Green Version]
Armstrong, J.; Brigo, D. Stochastic filtering via L₂ projection on mixture manifolds with computer algorithms and numerical examples. arXiv 2013, arXiv:1303.6236. [Google Scholar]
Efron, B.; Hastie, T. Computer Age Statistical Inference; Cambridge University Press: Cambridge, UK, 2016; Volume 5. [Google Scholar]
Pinsker, M. Information and Information Stability of Random Variables and Processes (Translated and Annotated by Amiel Feinstein); Holden-Day Inc.: San Francisco, CA, USA, 1964. [Google Scholar]
Fedotov, A.A.; Harremoës, P.; Topsoe, F. Refinements of Pinsker’s inequality. IEEE Trans. Inf. Theory 2003, 49, 1491–1498. [Google Scholar] [CrossRef]
Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Carreira-Perpinan, M.A. Mode-finding for mixtures of Gaussian distributions. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1318–1323. [Google Scholar] [CrossRef] [Green Version]
Brown, L.D. Fundamentals of statistical exponential families with applications in statistical decision theory. Lect. Notes-Monogr. Ser. 1986, 9, 1–279. [Google Scholar]
Pelletier, B. Informative barycentres in statistics. Ann. Inst. Stat. Math. 2005, 57, 767–780. [Google Scholar] [CrossRef]
Améndola, C.; Drton, M.; Sturmfels, B. Maximum likelihood estimates for Gaussian mixtures are transcendental. In Proceedings of the International Conference on Mathematical Aspects of Computer and Information Sciences, Berlin, Germany, 11–13 November 2015; pp. 579–590. [Google Scholar]
Hyvärinen, A. Some extensions of score matching. Comput. Stat. Data Anal. 2007, 51, 2499–2512. [Google Scholar] [CrossRef] [Green Version]
Otto, F.; Villani, C. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal. 2000, 173, 361–400. [Google Scholar] [CrossRef] [Green Version]
Toscani, G. Entropy production and the rate of convergence to equilibrium for the Fokker-Planck equation. Q. Appl. Math. 1999, 57, 521–541. [Google Scholar] [CrossRef] [Green Version]
Hudson, H.M. A natural identity for exponential families with applications in multiparameter estimation. Ann. Stat. 1978, 6, 473–484. [Google Scholar] [CrossRef]
Trench, W.F. An algorithm for the inversion of finite Hankel matrices. J. Soc. Ind. Appl. Math. 1965, 13, 1102–1107. [Google Scholar] [CrossRef]
Heinig, G.; Rost, K. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra Its Appl. 2011, 435, 1–59. [Google Scholar] [CrossRef] [Green Version]
Fuhrmann, P.A. Remarks on the inversion of Hankel matrices. Linear Algebra Its Appl. 1986, 81, 89–104. [Google Scholar] [CrossRef] [Green Version]
Lindsay, B.G. On the determinants of moment matrices. Ann. Stat. 1989, 17, 711–721. [Google Scholar] [CrossRef]
Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740. [Google Scholar] [CrossRef]
Provost, S.B.; Ha, H.T. On the inversion of certain moment matrices. Linear Algebra Its Appl. 2009, 430, 2650–2658. [Google Scholar] [CrossRef]
Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 162. [Google Scholar]
Mohammad-Djafari, A. A. A Matlab program to calculate the maximum entropy distributions. In Maximum Entropy and Bayesian Methods; Springer: Berlin/Heidelberg, Germany, 1992; pp. 221–233. [Google Scholar]
Karlin, S. Total Positivity; Stanford University Press: Redwood City, CA, USA, 1968; Volume 1. [Google Scholar]
von Neumann, J. Various Techniques Used in Connection with Random Digits. In Monte Carlo Method; National Bureau of Standards Applied Mathematics Series; Householder, A.S., Forsythe, G.E., Germond, H.H., Eds.; US Government Printing Office: Washington, DC, USA, 1951; Volume 12, Chapter 13; pp. 36–38. [Google Scholar]
Flury, B.D. Acceptance-rejection sampling made easy. SIAM Rev. 1990, 32, 474–476. [Google Scholar] [CrossRef]
Rohde, D.; Corcoran, J. MCMC methods for univariate exponential family models with intractable normalization constants. In Proceedings of the 2014 IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast, Australia, 29 June–2 July 2014; pp. 356–359. [Google Scholar]
Barr, D.R.; Sherrill, E.T. Mean and variance of truncated normal distributions. Am. Stat. 1999, 53, 357–361. [Google Scholar]
Amendola, C.; Faugere, J.C.; Sturmfels, B. Moment Varieties of Gaussian Mixtures. J. Algebr. Stat. 2016, 7, 14–28. [Google Scholar] [CrossRef] [Green Version]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F.; Nock, R. Patch matching with polynomial exponential families and projective divergences. In Proceedings of the International Conference on Similarity Search and Applications, Tokyo, Japan, 24–26 October 2016; pp. 109–116. [Google Scholar]
Yang, Y.; Martin, R.; Bondell, H. Variational approximations using Fisher divergence. arXiv 2019, arXiv:1905.05284. [Google Scholar]
Kostrikov, I.; Fergus, R.; Tompson, J.; Nachum, O. Offline reinforcement learning with Fisher divergence critic regularization. In Proceedings of the International Conference on Machine Learning (PMLR 2021), online, 7–8 June 2021; pp. 5774–5783. [Google Scholar]
Elkhalil, K.; Hasan, A.; Ding, J.; Farsiu, S.; Tarokh, V. Fisher Auto-Encoders. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR 2021), San Diego, CA, USA, 13–15 April 2021; pp. 352–360. [Google Scholar]
Améndola, C.; Engström, A.; Haase, C. Maximum number of modes of Gaussian mixtures. Inf. Inference J. IMA 2020, 9, 587–600. [Google Scholar] [CrossRef]
Aprausheva, N.; Mollaverdi, N.; Sorokin, S. Bounds for the number of modes of the simplest Gaussian mixture. Pattern Recognit. Image Anal. 2006, 16, 677–681. [Google Scholar] [CrossRef] [Green Version]
Aprausheva, N.; Sorokin, S. Exact equation of the boundary of unimodal and bimodal domains of a two-component Gaussian mixture. Pattern Recognit. Image Anal. 2013, 23, 341–347. [Google Scholar] [CrossRef]
Xiao, Y.; Shah, M.; Francis, S.; Arnold, D.L.; Arbel, T.; Collins, D.L. Optimal Gaussian mixture models of tissue intensities in brain MRI of patients with multiple-sclerosis. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Beijing, China, 20 September 2010; pp. 165–173. [Google Scholar]
Bilik, I.; Khomchuk, P. Minimum divergence approaches for robust classification of ground moving targets. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 581–603. [Google Scholar] [CrossRef]
Alippi, C.; Boracchi, G.; Carrera, D.; Roveri, M. Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016. [Google Scholar]
Eguchi, S.; Komori, O.; Kato, S. Projective power entropy and maximum Tsallis entropy distributions. Entropy 2011, 13, 1746–1764. [Google Scholar] [CrossRef]
Orjebin, E. A Recursive Formula for the Moments of a Truncated Univariate Normal Distribution. 2014, Unpublished note.
Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Two examples illustrating the conversion of a GMM m (black) of

k = 2

components (dashed black) into a pair of polynomial exponential densities of order

D = 4

(p_{{\bar{θ}}_{SME}}, p^{{\bar{η}}_{MLE}})

. PED

p_{{\bar{θ}}_{SME}}

is displayed in green, and PED

p^{{\bar{η}}_{MLE}}

is displayed in blue. To display

p^{{\bar{η}}_{MLE}}

, we first converted

{\bar{η}}_{MLE}

to

{\tilde{\bar{θ}}}_{MLE}

using an iterative linear system descent method (ILSDM), and we numerically estimated the normalizing factors

Z ({\bar{θ}}_{SME})

and

Z ({\bar{η}}_{MLE})

to display the normalized PEDs.

Figure 1. Two examples illustrating the conversion of a GMM m (black) of

k = 2

components (dashed black) into a pair of polynomial exponential densities of order

D = 4

(p_{{\bar{θ}}_{SME}}, p^{{\bar{η}}_{MLE}})

. PED

p_{{\bar{θ}}_{SME}}

is displayed in green, and PED

p^{{\bar{η}}_{MLE}}

is displayed in blue. To display

p^{{\bar{η}}_{MLE}}

, we first converted

{\bar{η}}_{MLE}

to

{\tilde{\bar{θ}}}_{MLE}

using an iterative linear system descent method (ILSDM), and we numerically estimated the normalizing factors

Z ({\bar{θ}}_{SME})

and

Z ({\bar{η}}_{MLE})

to display the normalized PEDs.

Figure 2. Two mixtures

m_{1}

(black) and

m_{2}

(red) of

k_{1} = 10

components and

k_{2} = 11

components (left), respectively. The unnormalized PEFs

q_{{\bar{θ}}_{1}} = {\tilde{p}}_{{\bar{θ}}_{1}}

(middle) and

q_{{\bar{θ}}_{2}} = {\tilde{p}}_{{\bar{θ}}_{2}}

(right) of order

D = 8

. Jeffreys divergence (about

0.2634

) is approximated using PEDs within

0.6 %

compared to the Monte Carlo estimate with a speed factor of about 3190. Notice that displaying

p_{{\bar{θ}}_{1}}

and

p_{{\bar{θ}}_{2}}

on the same PDF canvas as the mixtures would require calculating the partition functions

Z ({\bar{θ}}_{1})

and

Z ({\bar{θ}}_{2})

(which we do not in this figure). The PEDs

q^{{\bar{η}}_{1}}

and

q^{{\bar{η}}_{2}}

of the pairs

({\bar{θ}}_{1}, {\bar{η}}_{1})

and

({\bar{θ}}_{2}, {\bar{η}}_{2})

parameterized in the moment space are not shown here.

Figure 2. Two mixtures

m_{1}

(black) and

m_{2}

(red) of

k_{1} = 10

components and

k_{2} = 11

components (left), respectively. The unnormalized PEFs

q_{{\bar{θ}}_{1}} = {\tilde{p}}_{{\bar{θ}}_{1}}

(middle) and

q_{{\bar{θ}}_{2}} = {\tilde{p}}_{{\bar{θ}}_{2}}

(right) of order

D = 8

. Jeffreys divergence (about

0.2634

) is approximated using PEDs within

0.6 %

compared to the Monte Carlo estimate with a speed factor of about 3190. Notice that displaying

p_{{\bar{θ}}_{1}}

and

p_{{\bar{θ}}_{2}}

on the same PDF canvas as the mixtures would require calculating the partition functions

Z ({\bar{θ}}_{1})

and

Z ({\bar{θ}}_{2})

(which we do not in this figure). The PEDs

q^{{\bar{η}}_{1}}

and

q^{{\bar{η}}_{2}}

of the pairs

({\bar{θ}}_{1}, {\bar{η}}_{1})

and

({\bar{θ}}_{2}, {\bar{η}}_{2})

parameterized in the moment space are not shown here.

Figure 3. The best simplification of a GMM

m (x)

into a single normal component

p_{θ^{*}}

(

{min}_{θ \in Θ} D_{KL} [m : p_{θ}] = {min}_{η \in H} D_{KL} [m : p^{η}]

) is geometrically interpreted as the unique m-projection of

m (x)

onto the Gaussian family (a e-flat): We have

η^{*} = \bar{η} = \sum_{i = 1}^{k} η_{i}

.

Figure 3. The best simplification of a GMM

m (x)

into a single normal component

p_{θ^{*}}

(

{min}_{θ \in Θ} D_{KL} [m : p_{θ}] = {min}_{η \in H} D_{KL} [m : p^{η}]

) is geometrically interpreted as the unique m-projection of

m (x)

onto the Gaussian family (a e-flat): We have

η^{*} = \bar{η} = \sum_{i = 1}^{k} η_{i}

.

Figure 4. Experiments of approximating the Jeffreys divergence between two mixtures by considering pairs of PEDs. Notice that only the PEDs estimated using the Score Matching in the natural parameter space are displayed.

Figure 5. Selecting the PED order D my evaluating the best divergence order-2 Hyvärinen divergence (for

D \in {4, 8, 10, 12, 14, 16}

) values. Here, the order

D = 10

(boxed) yields the lowest order-2 Hyvärinen divergence: The GMM is close to the PED.

Figure 5. Selecting the PED order D my evaluating the best divergence order-2 Hyvärinen divergence (for

D \in {4, 8, 10, 12, 14, 16}

) values. Here, the order

D = 10

(boxed) yields the lowest order-2 Hyvärinen divergence: The GMM is close to the PED.

Figure 6. Some limitation examples of the conversion of GMMs (black) to PEDs (grey) using the integral-based Score Matching estimator: Case of GMMs with many modes.

Figure 7. Modeling the Old Faithful geyser by a KDE (GMM with

k = 272

components, uniform weights

w_{i} = \frac{1}{272}

): Histogram (#bins = 25) (left), KDE with

σ = 0.05

(middle), and KDE with

σ = 0.1

with less spurious bumps (right).

Figure 7. Modeling the Old Faithful geyser by a KDE (GMM with

k = 272

components, uniform weights

w_{i} = \frac{1}{272}

): Histogram (#bins = 25) (left), KDE with

σ = 0.05

(middle), and KDE with

σ = 0.1

with less spurious bumps (right).

Figure 8. Modeling the Old Faithful geyser by an exponential-polynomial distribution of order

D = 10

.

Figure 8. Modeling the Old Faithful geyser by an exponential-polynomial distribution of order

D = 10

.

Figure 9. GMM modes versus PED modes: (left) same number and locations of modes for the GMM and the PED; (right) 4 modes for the GMM but only 2 modes for the PED.

Table 1. Comparison of

{\tilde{Δ}}_{J} (m_{1}, m_{2})

with

{\hat{D}}_{J} (m_{1}, m_{2})

for random GMMs.

Table 1. Comparison of

{\tilde{Δ}}_{J} (m_{1}, m_{2})

with

{\hat{D}}_{J} (m_{1}, m_{2})

for random GMMs.

k	D	Average Error	Maximum Error	Speed-Up
2	4	0.1180799978221536	0.9491425404132259	2008.2323536011806
3	6	0.12533811294546526	1.9420608151988419	1010.4917042114389
4	8	0.10198448868508087	5.290871019594698	474.5135294829539
5	10	0.06336388579897352	3.8096955246161848	246.38780782640987
6	12	0.07145257192133717	1.0125283726458822	141.39097909641052
7	14	0.10538875853178625	0.8661463142793943	88.62985036546912
8	16	0.4150905507007969	0.4150905507007969	58.72277575395611

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions. Entropy 2021, 23, 1417. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

AMA Style

Nielsen F. Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions. Entropy. 2021; 23(11):1417. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

Chicago/Turabian Style

Nielsen, Frank. 2021. "Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions" Entropy 23, no. 11: 1417. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111417

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions

Abstract

1. Introduction

1.1. Statistical Mixtures and Statistical Divergences

1.2. Jeffreys Divergence between Densities of an Exponential Family

1.3. A Simple Approximation Heuristic

1.4. Contributions and Paper Outline

2. Converting Finite Mixtures to Exponential Family Densities

2.1. Conversion Using the Moment Parameterization (MLE)

2.2. Converting to a PEF Using the Natural Parameterization (SME)

Integral-Based Score Matching Estimator (SME)

2.3. Converting Numerically Moment Parameters from/to Natural Parameters

2.3.1. Converting Moment Parameters to Natural Parameters Using Maximum Entropy

2.3.2. Converting Natural Parameters to Moment Parameters

2.4. Raw Non-Central Moments of Normal Distributions and GMMs

3. Goodness-of-Fit between GMMs and PEDs: Higher Order Hyvärinen Divergences

4. Experiments: Jeffreys Divergence between Mixtures

5. Conclusions and Perspectives

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI