Approximation of Densities on Riemannian Manifolds

le Brigant, Alice; Puechmorel, Stéphane

doi:10.3390/e21010043

Open AccessReview

Approximation of Densities on Riemannian Manifolds

by

Alice le Brigant

^†

and

Stéphane Puechmorel

^*,†

Ecole Nationale de l’Aviation Civile, Université de Toulouse, 31055 Toulouse, France

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2019, 21(1), 43; https://0-doi-org.brum.beds.ac.uk/10.3390/e21010043

Submission received: 11 December 2018 / Revised: 30 December 2018 / Accepted: 3 January 2019 / Published: 9 January 2019

(This article belongs to the Special Issue 20th Anniversary of Entropy—Review Papers Collection)

Download Versions Notes

Abstract

:

Finding an approximate probability distribution best representing a sample on a measure space is one of the most basic operations in statistics. Many procedures were designed for that purpose when the underlying space is a finite dimensional Euclidean space. In applications, however, such a simple setting may not be adapted and one has to consider data living on a Riemannian manifold. The lack of unique generalizations of the classical distributions, along with theoretical and numerical obstructions require several options to be considered. The present work surveys some possible extensions of well known families of densities to the Riemannian setting, both for parametric and non-parametric estimation.

Keywords:

quantization; directional densities; exponential family; group invariance; Riemannian manifold

1. Introduction

In probability and statistics, random variables whose law admits a probability density (with respect to e.g., the Lebesgue measure) are more tractable than general ones, both from the theoretical and the algorithmic point of view. When dealing with experimental data, the density is generally unknown and must be estimated.

In many cases, it is a function belonging to a given family, defined on the image of the random variable. When the family depends on a finite number of parameters, the estimation problem boils down to finding a point in the parameter space minimizing a goodness-of-fit criterion. Methods pertaining to this class are referred to as parametric. On the other hand, when prior information on the true density is lacking or when the parametric approach is too complicated or expensive from a computational point of view, it may be more pertinent to use another class of methods, the non-parametric ones, that does not rely on fitting parameters. In this case, the true density is computed from the samples themselves, often by summing copies of a model density known as a kernel function.

On the other hand, some applications require an approximation rather than a proper estimation of the density of a given dataset. When dealing with large datasets, it can be interesting to search for a summary of the empirical distribution in the form of a discrete probability measure with a given number of supporting points. This is known as the quantization problem and has received much attention from the signal processing community. It is worth noticing that finding an optimal quantization is closely related to clustering: each supporting point in the discrete distribution can be thought as a cluster center, with a membership function that associates with a sample the closest supporting point.

In all the above cases, density approximation is central. When the considered random variables are defined on a finite dimensional real vector space, the problem has been extensively studied [1,2]. However, in many applications, the data are best modeled as elements of a Riemannian manifold, and approximation procedures have to be adapted. Classical examples include the sphere

S^{d}

, the d-dimensional torus and some matrix spaces. In medical imaging, for example, diffusion tensor images have pixels taking their values in the cone of symmetric positive definite matrices. The same type of data arises when assessing a complexity level to a traffic situation in the air transportation system context. The geometry of correlation matrices is related to the hyperbolic space and is very informative in signal processing applications. An important issue is the lack of a unique extensions of commonly used distributions, like the Gaussian one. The problem is even more acute if one wants to use these model distributions as elementary bricks for approximating more complex ones. This paper will survey the different options offered to estimate and approximate probability densities on Riemannian manifolds. After a brief summary of the main notions of Riemannian geometry that will be needed in the sequel, we present parametric estimation in Section 3. Section 4 and Section 5 are devoted to non-parametric estimation, and Section 6 to optimal quantization.

2. Some Notions from Riemannian Geometry

2.1. Differentiable Manifolds

A topological manifold of dimension d is a topological space M that can be locally approximated at any point

p \in M

by a subset of

R^{d}

through a so-called local chart, that is, a homeomorphism

φ : U \to R^{d}

(a continuous, bijective map with continuous inverse) defined on a neighborhood U of p. A collection of local charts

{(U_{i}, φ_{i})}_{i \in I}

such that the union of the

U_{i}

’s covers all of M is called an atlas, and M is said to be differentiable if one can go from one local chart to another in a differentiable way. More precisely, M is of class

C^{k}

if the transition maps

φ_{i} \circ φ_{j}^{- 1}

defined by composing chart maps with intersecting domains

U_{i} \cap U_{j} \neq \emptyset

are

C^{k}

maps from

φ_{j} (U_{i} \cap U_{j}) \subset R^{d}

to

φ_{i} (U_{i} \cap U_{j}) \subset R^{d}

. Moreover, if the Jacobian determinants of the transition maps are positive, the manifold is said to be orientable. This property is mandatory to define a volume form, an object that is used repeatedly in the sequel to define densities. However, in case of non-orientable manifolds, it is still possible to use the weaker notion of Riemannian density. It is an odd-type differential form in the sense of [3]. In the rest of the paper, for the sake of simplicity, we will always consider smooth (

C^{\infty}

), orientable manifolds.

The local charts allow us to transpose (local) computations to the familiar Euclidean framework and to export definitions from that setting. Given another differentiable manifold N of dimension n, we can naturally define a mapping

F : M \to N

to be differentiable at a point

p \in M

if a given (or equivalently, any) representation

ψ \circ F \circ φ^{- 1}

of F using local charts

φ : U \to R^{d}

of M and

ψ : V \to R^{n}

of N is differentiable at

ϕ (p)

as a map from

φ^{- 1} (U) \subset R^{d}

to

R^{n}

. In what follows, we may place ourselves in a local chart

(U, φ)

and use the corresponding local coordinates

(x_{1} (p), \dots, x_{d} (p)) : = φ (p)

.

2.2. Tangent and Cotangent Vectors

A tangent vector at a point p can be seen as the intrinsic (i.e., compatible with chart transition functions) velocity of a curve in M passing through p, as well as a derivation acting on the algebra

F_{p}

of germs at p of smooth real-valued functions

f : M \to R

[4] (Chapter 1.7).

More precisely, let

α : (- ϵ, ϵ) \to M

be a smooth curve on M, and let

p = α (0)

. For any smooth function

f : M \to R

with germ

[f] \in F_{p}

, the derivative

{\frac{d}{d t}|}_{t = 0} f \circ α (t)

depends only on

[f]

and will be denoted as

X ([f])

. The operator

X : F_{p} \to R

is obviously linear and satisfies the Liebniz rule:

X ([f g]) = X ([f]) [g] (p) + [f] (p) X ([g]) .

X is called the tangent vector to

α

at p. Using Hadamard’s lemma in a chart, one can show that X can be in fact represented by a vector in

R^{d}

, hence the name. Please note that X as a derivation is coordinate-free, while it depends on the local coordinates when represented in

R^{d}

.

A tangent vector at p is a tangent vector to a certain curve passing through p at time zero. The set of all tangent vectors at p defines the tangent space

T_{p} M

, a vector space of same dimension as M, and the collection of all tangent spaces defines the tangent bundle

T M = \cup_{p \in M} T_{p} M

. Given a coordinate chart

φ = (x_{1}, \dots, x_{d})

, the tangent vectors defining partial derivation according to coordinates

x_{i}

are denoted by

\frac{\partial}{\partial x_{1}} (p), \dots, \frac{\partial}{\partial x_{d}} (p)

and define a basis of

T_{p} M

. As any vector space, the tangent space at p admits a dual space

T_{p}^{*} M

called the cotangent space, and composed of linear forms

z_{p} : T_{p} M \to R

, also called cotangent vectors. The basis of

T_{p} M^{*}

in local coordinates is denoted by

d x_{1} (p), \dots, d x_{d} (p)

.

2.3. Pullback and Pushforward

Associated with the dual notions of tangent and cotangent vectors are the dual notions of pushforward and pullback. Given a smooth map

F : M \to N

between two smooth manifolds, the pushforward of F is a linear map

F_{*} : T M \to T N

that maps tangent vectors

X_{p}

at point

p \in M

to tangent vectors

F_{*} X_{p}

at the image point

F (p) \in N

. The vector

F_{*} X_{p}

can be defined as acting on real-valued functions

f : N \to R

or equivalently, as the velocity vector of a curve

α : (- ϵ, ϵ) \to M

passing through p at time zero with speed

\dot{α} (0) = X_{p}

,

(F_{*} X_{p}) (f) = X_{p} (f \circ F), F_{*} X_{p} = {\frac{d}{d t}|}_{t = 0} F \circ α (t) .

The pushforward is also called the differential of F and can also be denoted

d_{p} F : = {(F_{*})}_{p}

. Symmetrically, the pullback maps cotangent vectors

z_{F (p)}

at

F (p) \in N

to cotangent vectors at

p \in M

, acting on tangent vectors

X_{p} \in T_{p} M

as

(F^{*} z_{F (p)}) (X_{p}) = z_{F (p)} (F_{*} X_{p}) .

2.4. Vector Fields and Covariant Derivatives

A vector field is a mapping

X : M \to T M

that associates with each point p a tangent vector

X_{p} \in T_{p} M

. Just as tangent vectors, it acts on differentiable functions

f : M \to R

in a way that can be written, in local coordinates, as

X_{p} (f) = \sum_{i = 1}^{d} a_{i} (p) \frac{\partial f}{\partial x_{i}} (p) .

It is possible to take the derivative of a vector field with respect to another using an affine connection, that is, a functional ∇ that acts on pairs of vector fields

(X, Y) \mapsto \nabla_{X} Y

according to the following rules

\begin{matrix} \nabla_{f X + Y} Z = f \nabla_{X} Z + \nabla_{Y} Z, \\ \nabla_{X} (f Y) = X (f) Y + f \nabla_{X} Y, \nabla_{X} (Y + Z) = \nabla_{X} Y + \nabla_{X} Z . \end{matrix}

This action is referred to as covariant derivative. Vector fields

V : (- ϵ, ϵ) \to T M

can also be defined along a curve

α (t)

, that is,

V (t)

is an element of

T_{α (t)} M

) for all t. Then, covariant derivative is sometimes denoted by

\frac{D V}{d t} : = \nabla_{\dot{α} (t)} V

.

2.5. Riemannian Metric and Geodesics

The possibility to compute angles and lengths in a differentiable manifold is given by a Riemannian metric, i.e., a smoothly varying inner product

g_{p} : T_{p} M \times T_{p} M \to R

defined on each tangent space

T_{p} M

at

p \in M

(recall that

T_{p} M

is a vector space). The subscript p in the metric will often be omitted and the associated norm will be denoted by

∥ \cdot ∥ : = g (\cdot, \cdot)

. There is only one affine connection that is symmetric, meaning

X Y - Y X = \nabla_{X} Y - \nabla_{Y} X

, and compatible with the Riemannian metric, which is

\frac{d}{d t} g (U, V) = g (\frac{D U}{d t}, V) + g (U, \frac{D V}{d t}),

for any vector fields

U, V

along a curve

α

. It is called the Levi–Civita connection associated with g and will be denoted by ∇ from now on. Just as the Euclidean distance can be measured as the length of a straight line, distances in a Riemannian manifold are computed through the length of minimizing geodesics. The geodesics of M are the curves

γ

satisfying the relation

\nabla_{\dot{γ}} \dot{γ} = 0

, which implies that their speed has constant norm

∥ \dot{γ} (t) ∥ = c s t

. They are also the local minimizers of the arc length functional l:

l : γ \mapsto \int_{0}^{1} ∥ \dot{γ} (t) ∥ d t

if curves are assumed, without loss of generality, to be defined over the interval

[0, 1]

. When it exists, the length of the shortest geodesic linking two points defines their geodesic distance. The cut locus of

p \in M

is the set of points where the geodesics starting at p stop being minimizing, and the injectivity radius at p is its distance to the cut locus. The global injectivity radius of the manifold is the infimum of the injectivity radii over all points of M.

2.6. Exponential and Logarithm Maps

From the geodesics of M, we can now define the exponential map at point p, a diffeomorphism (i.e., a differentiable, bijective map of differentiable inverse) denoted by

\exp_{p}

, which maps a tangent vector v of an open ball

B (0, r) \subset T_{p} M

centered in 0 to the endpoint

γ (1) = : \exp_{p} (v)

of the geodesic

γ : [0, 1] \to M

verifying

γ (0) = p

,

\dot{γ} (0) = v

. Intuitively, the exponential map moves the point p along the geodesic starting from p at speed v and stops after covering the length

∥ v ∥

. Conversely, the inverse of the exponential map

\log_{p} (q) : = \exp_{p}^{- 1} (q)

gives the vector that maps p to q. The image by the exponential map of the open ball

B (0, r) \subset T_{p} M

, with r less than the injectivity radius at p, is called the geodesic ball of radius

r > 0

centered in p.

2.7. Curvature and Jacobi Fields

The curvature tensor of M associates with any pair of vector fields

(X, Y)

on M a linear mapping

R (X, Y)

on the space of vector fields, defined for all vector field Z by

R (X, Y) Z : = \nabla_{X} \nabla_{Y} Z - \nabla_{Y} \nabla_{X} Z - \nabla_{[X, Y]} Z,

where

[X, Y] : = X Y - Y X

denotes the Lie bracket. Another way to characterize the curvature of M is through the sectional curvature, which is defined for any two-dimensional subspace

σ \subset T_{p} M

of the tangent space at p by

K (σ) = \frac{g_{p} (R (u, v) v, u)}{g_{p} (u, u) g_{p} (v, v) - g_{p} {(u, v)}^{2}},

if

u, v

are two linearly independent vectors that span

σ

. Due to the curvature of a manifold, geodesics spreading out from a given point p can either diverge (negative curvature) or converge (positive curvature). The way these geodesics spread out is described by the Jacobi fields. If

{t \mapsto \exp_{p} (t v (s)), s \in (- ϵ, ϵ)}

is a sheave of geodesics starting from the same point p at speeds

v (s) \in T_{p} M

, and

γ (t) : = \exp_{p} (t v (0))

denotes the one at the center of the sheave, the vector field along

γ

defined for all t by

J (t) = {\frac{d}{d s}|}_{s = 0} \exp_{p} (t v (s))

is a Jacobi field along

γ

. It is the only one with initial conditions

J (0) = v (0)

and

\frac{D J}{d t} (0) = \dot{v} (0)

(where we identify the two vector spaces

T_{p} M \approx T_{v (0)} T_{p} M

) verifying the Jacobi equation

\frac{D^{2}}{d t^{2}} J + R (J, \dot{γ}) \dot{γ} = 0

.

2.8. Measures and Integration over a Riemannian Manifold

A differentiable k-form

ω

on an orientable d-dimensional manifold M associates with all

p \in M

an alternating multilinear function

ω_{p} : {(T_{p} M)}^{d} \to R

(i.e.,

ω_{p}

associates zero to any d-tuple with a repetition). If

ω

is a differentiable k-form on a manifold N, then any smooth map

F : M \to N

induces by pullback a k-form on M acting on k-tuples of tangent vectors at

p \in M

as

{(F^{*} ω)}_{p} (u_{1}, \dots, u_{k}) = ω_{F (p)} (F_{*} u_{1}, \dots, F_{*} u_{k}) .

The volume forms of M are the differential forms of maximal degree d (the dimension of the manifold), and are the only ones that can be integrated over M. If

(U, φ)

is a local chart such that

supp ω \subset U

, then

{(φ^{- 1})}^{*} ω

is a d-form on

R^{d}

, and so it admits a density

f : φ^{- 1} (U) \mapsto R

with respect to the volume element defined in local coordinates by the exterior product

d x_{1} \land \dots \land d x_{d}

. The integral of the volume form

ω

is then defined by

\int_{U} ω : = \int_{φ^{- 1} (U)} {(φ^{- 1})}^{*} ω = \int_{φ^{- 1} (U)} f (x) d x .

Every volume form defines a measure on M, which is written by extension in local coordinates

d μ = f d x_{1} \land \dots \land d x_{d}

or

d μ = f d x

for short. The Riemannian measure is the volume form which density is given by the square root of the determinant of the Riemannian metric, i.e.,

d vol (x) = \sqrt{\det G (x)} d x,

where

G (x)

is the

d \times d

matrix with entries

G_{i j} (x) = g_{x} (\frac{\partial}{\partial x_{i}} (x), \frac{\partial}{\partial x_{j}} (x)) .

The Riemannian measure will play the role of the Lebesgue measure for integrals defined on M.

2.9. The Laplace–Beltrami Operator

Finally, in order to make this work self-contained, we introduce the generalization of the Laplacian to manifolds, namely, the Laplace–Beltrami operator. Let X be a vector field on M and

ϕ_{X} (t, x)

,

(t, x) \in (- ϵ, ϵ) \times U

its local flow in a neighborhood U of

p \in M

, i.e., such that for all x,

t \mapsto ϕ_{X} (t, x)

is the unique curve verifying

\partial_{t} ϕ_{X} (t, x) = X_{ϕ_{X} (t, x)}

and

ϕ_{X} (0, x) = x

. Then, the Lie derivative of the volume form along the vector field X is given by the derivative of its pullback by the flow of X

L_{X} vol = {\frac{d}{d t}|}_{t = 0} ϕ_{X} {(t, \cdot)}^{*} vol .

Intuitively, it measures the way infinitesimal volume is transported by X. Since

L_{X} vol

is a d-form, it admits a density with respect to the Riemannian volume form, which is defined to be the divergence of X, i.e.,

L_{X} vol = (div X) vol

. Then, the Laplace–Beltrami operator of a function

f : M \to R

is, just as the Euclidean case, defined as the divergence of its gradient

Δ f = div (grad f),

where the gradient is linked to the differential (or pushforward) as

g_{p} ({grad}_{p} f, X_{p}) = d_{p} f (X_{p})

for any tangent vector

X_{p} \in T_{p} M

.

The Laplace–Beltrami operator can defined alternatively using the Levi–Civita connection. Let

X, Y

be vector fields and

f : M \to R

and

f : M \to R

as above. The Hessian of f is the symmetric 2-tensor:

H (f; X, Y) = \nabla_{Y} \nabla_{X} f - \nabla_{\nabla_{Y} X} f .

The Laplacian of f is then defined as the trace of the Hessian with respect to the metric:

Δ f = g^{i j} H (f; \partial_{i}, \partial_{j}),

where

g^{i j}

stands for the

i, j

element of the inverse metric matrix and

\partial_{i}

is the i-th coordinate vector field.

3. Parametric Estimation

Let

(E, F, μ)

be a measure space. In the sequel, all distributions are assumed to be absolutely continuous with respect to

μ

and all densities will thus implicitly refer to it. In parametric density estimation, one wants to approximate an unknown distribution with density f by a member of a given parameterized family

{f_{θ} : E \to R^{+}, θ \in Θ}

of densities. Most of the time, the optimal

θ^{*}

is found using a maximum likelihood procedure: if

{(X_{i})}_{i = 1 \dots N}

is an iid sample with common distribution f, then

θ^{*} = {argmax}_{θ \in Θ} \prod_{i = 1}^{N} f (X_{i} | θ) .

The only requirement on the domain E of the family

{f_{θ}, θ \in Θ}

is to be a measure space, which of course encompasses the Riemannian manifold case, with

μ = vol

, the Riemannian volume.

Being able to obtain a meaningful parameterized distribution family on a general manifold is not an easy task in general. Some clues will be given at the end of this section. In special cases, some well known distributions were introduced. Some of them will be presented now.

3.1. Directional Statistics

Directional statistics [5] deals with inference on unit vectors samples and introduces ad hoc distributions. Since unit vectors can be seen as points on the unit sphere of the underlying vector space, it yields a basic, yet extremely useful example of parameterized families of distributions on a Riemannian manifold.

Since the unit sphere

S^{d - 1} \subset R^{d}

has rotational invariance, it is expected that the parameterized families

{(f_{θ})}_{θ \in Θ}

exhibit the same behavior, i.e.,

\forall A \in SO (d), \forall θ \in Θ, \forall x \in S^{n}, f_{A θ} (A x) = f_{θ} (x) .

(1)

Please note that, to be able to write such a covariance property, it is required to have an action of the group

SO (d)

on

Θ

. The case

Θ = S^{d - 1}

is, up to our knowledge, the only one that has been considered by the directional statistics community.

A common choice is the von Mises–Fisher (vMF) distribution on

S^{d - 1}

, denoted

M (μ, κ)

, given by the density [6]

f_{λ} (x; m) = c_{d} (κ) e^{κ 〈 μ, x 〉}, κ > 0, x \in S^{d - 1},

(2)

where

c_{d} (λ) = \frac{λ^{d / 2 - 1}}{{(2 π)}^{d / 2} I_{d / 2 - 1} (λ)}

(3)

is a normalization constant with

I_{r} (λ)

denoting the modified Bessel function of the first kind at order r. The vMF density is unimodal, parameterized by the mean

μ

and the concentration parameter

κ > 0

that controls the dispersion of the distribution around the mean. The limiting, degenerate case

κ = 0

yields the uniform distribution on the sphere.

When the expectations of the projections on a fixed orthonormal basis

(e_{1}, \dots, e_{d})

are given, it is a maximum entropy distribution. This fact can be easily seen by writing the associated variational problem with linear constraints:

\{\begin{matrix} {argmax}_{f} \int_{S^{d - 1}} f (x) \log f (x) d σ (x), \\ \int_{S^{d - 1}} f (x) 〈 e_{i}, x 〉 d σ (x) = a_{i}, i = 1 \dots d, \\ \int_{S^{d - 1}} f (x) d σ (x) = 1, \end{matrix}

(4)

with

σ

the solid angle measure on the sphere. Using Lagrange multipliers

λ_{1}, \dots λ_{d}

for the first d constraints and c for the last one yields a general form for the solution:

f (x; c, λ) = \exp (c + 〈 x, λ 〉)

(5)

with

λ = (λ_{1}, \dots, λ_{d})

. The c is a normalization constant. The remaining ones can be interpreted as mean parameters by normalization, provided

λ \neq 0

:

〈 x, λ 〉 = ∥ λ ∥ 〈 x, λ / ∥ λ ∥ 〉 .

The vMF density extends readily to the Stiefel manifold

O (d, p)

of p-dimensional orthonormal families in

R^{d}

using the same maximum entropy approach, but using projections on the elementary matrices of dimension

d \times p

. The general form of the distribution is then:

f (X; M) = c (M) \exp (tr M^{t} X), M \in M_{d, p}, X \in O (d, p) .

(6)

As in the spherical vMF case, M cannot be interpreted directly as a mean on

O (d, p)

and some kind of normalization is needed. In order to better understand the behavior of M, it is useful to use its singular values decomposition (SVD) decomposition [7]. Let

M = U Σ V^{t}

with

U \in M_{d, d}

,

V \in M_{p, p}

,

Σ \in M_{d, p}

and

U, V

orthogonal matrices. Then:

tr M^{t} X = tr V Σ^{t} U X = tr Σ^{t} U X V .

(7)

Let

C_{j}

,

j = 1 \dots p

, be the columns of the

d \times p

matrix

X V

. It comes:

tr M^{t} X = \sum_{j = 1}^{p} σ_{j} 〈 U_{j}, C_{j} 〉,

(8)

where

σ_{j}

is the j-th diagonal element of

Σ

and

U_{j}

is the j-th row of U. The net result is thus a product of vMF, after a change of basis given by the matrix V. A rank deficiency in the matrix M indicates a uniform distribution on a subspace, much like in the standard vMF case with

κ = 0

. In the matrix case, the limiting case can occur on full subspaces.

A further extension to the Grassmann manifold can be done by quotienting out the density with respect to the

O (p)

group action [8].

A generalization of maximum entropy distributions with moment constraints to Riemannian manifolds M can be found in [9], where an analogous of the normal law is obtained. The constraints chosen are, in normal coordinates around the mean value:

\{\begin{matrix} \int f (x) d vol (x) = 1, \\ \int x f (x) d vol (x) = 0, \\ \int x x^{t} f (x) d vol (x) = Σ, \end{matrix}

(9)

where

Σ

is a fixed symmetric, positive definite matrix. The resulting density is parameterized by a mean

μ

and concentration matrix

Γ

and is expressed as:

f_{μ, Γ} (p) \propto \exp (- \frac{\log_{μ} {(p)}^{t} Γ \log_{μ} (p)}{2}), p \in M .

(10)

As mentioned in [9], the distribution may not be differentiable or even continuous on the cut locus.

Finally, a different construction [10,11] yields a directional distribution on the hyperbolic d-dimensional space. Only the approach of [11] will be detailed here, as it introduces another way to obtain distributions on manifolds using exit points of a Brownian motion with drift. The general idea underlying this approach is to use a submanifold of a well-known model manifold on which the Brownian motion with drift can be constructed. Starting at a fixed origin, a Brownian motion path will intersect the submanifold for the first time at a point, called the exit point. The distribution of the exit points will yield a generalized directional distribution. The original motivation of this construction comes from the exit distribution of a Brownian motion with drift starting at the origin on the unit circle in

R^{2}

. The resulting density turns out to be exactly the vMF.

In the hyperbolic space, the Brownian motion is a diffusion with infinitesimal generator:

\frac{x_{d}^{2}}{2} (\sum_{i = 1}^{d} \frac{\partial^{2}}{\partial x_{i}^{2}}) - \frac{(d - 2) x_{d}}{2} \frac{\partial}{\partial x_{d}},

(11)

where all the coordinates are given in the half-space model of

H^{d}

:

H^{d} = {x_{1}, \dots, x_{d} : x_{i} \in R, i = 1, \dots, d - 1, x_{d} \in R^{+}} .

It is convenient to represent the half-space model of

H^{2}

in

C

, with

z = x + i y

,

y > 0

. The two-dimensional hyperboloid embedded in

R^{3}

associated with

H^{2}

is given by:

{(x_{1}, x_{2}, x_{3}) : x_{1}^{2} + x_{2}^{2} - x_{3}^{2} = - 1} .

(12)

It admits hyperbolic coordinates:

x_{1} = \sinh (r) \cos (θ), x_{2} = \sinh (r) \sin (θ), x_{3} = \cosh (r),

(13)

that transforms to the unit disk model as:

u = \frac{\sinh (r) \cos (θ)}{1 + \cosh (r)}, v = \frac{\sinh (r) \sin (θ)}{1 + \cosh (r)},

(14)

where

θ

and r are the angular and radius coordinates. Finally, using a complex representation

z = u + i v

and the Möbius mapping

z \to i (1 - z) / (1 + z)

, it comes the expression of the half-plane coordinates:

x = \frac{\sinh (r) \sin (θ)}{\cosh (r) + \sinh (r) \cos (θ)}, y = \frac{1}{\cosh (r) + \sinh (r) \cos (θ)} .

(15)

The hyperbolic von Mises distribution is then defined, for a given

r > 0

, as the density of the first exit on the circle of center i and radius r of the hyperbolic Brownian motion starting at i. Its expression is given in [11] (Section 2.2, Propostion 2) as:

f_{v} (r, θ) = \frac{1}{2 π P_{- ν}^{0} (\cosh (r))} {(\cosh (r) + \sinh (r) \cos (θ))}^{- ν},

(16)

where

P_{- ν}^{0}

is the Legendre function of the first kind with parameters

0, - ν

, which acts as a normalizing constant to get a true probability density. The parameter

ν

is similar to the concentration used in the classical von Mises distribution.

3.2. Gaussian-Like Distributions

The maximum entropy distributions introduced above are not the only possible choice for probabilities on manifolds. Another approach may be to mimic the multivariate normal density using the geodesic distance on the manifold. For the space of symmetric positive definite matrices, one can refer to [12], and to [13] for the general case of symmetric spaces.

Let M be a symmetric space [14]. The Gaussian-like density on M with mean

μ

and variance

σ^{2} > 0

is given by:

f_{μ, σ} (p) = \frac{1}{Z (σ)} \exp (- \frac{d^{2} (μ, p)}{2 σ^{2}}),

(17)

where the normalizing constant Z is independent from

μ

. This last fact is one of the motivations to use the above definition: the density computation does not require anything more than the geodesic distance evaluation. The basic facts about Gaussian-like distributions on symmetric spaces are given below.

Definition 1.

A Riemannian symmetric space M is a complete Riemannian manifold on which geodesic symmetries exist everywhere and are isometric.

The above geometric definition fits within a Lie theoretic construction. For the details on such objects, the reader can refer to [15] or for a more complete reference to [16].

Definition 2.

A Riemannian symmetric space M is diffeomorphic to a quotient

G / H

where G is a Lie connected group and H is a compact Lie subgroup.

The Lie group view enables the use of integral formulas [17] (pp. 203–209).

Proposition 1.

Let M be a symmetric space of non-compact type, diffeomorphic to

G / K

and let

g = h + a + n

be the Iwasawa decomposition of the Lie algebra

g

. For any

f \in L^{1} (M)

:

\int_{M} f (p) d vol (p) = C \int_{H} \int_{a} f (\exp (Ad (h) a) \cdot o) D (a) d a d h,

with

d a

the Lebesgue measure on

a

,

d h

the normalized Haar measure on H and:

D (a) = \prod_{α \in Σ^{+}} \sinh^{m_{α}} (| α (a) |),

where the product is taken over the set of positive roots

Σ^{+}

and

m_{α}

is the dimension of the root space at α. o is a reference point.

Applying the previous formula to the mapping:

\tilde{f} : p \mapsto \exp (- \frac{d^{2} (μ, p)}{2})

with an origin at

μ

yields:

\int_{M} \tilde{f} (p) d vol (p) = C \int_{H} \int_{a} \exp (- \frac{B (a, a)}{2 σ^{2}}) D (a) d a d h,

with B the Killing form. Since the Haar measure

d h

is normalized and the integrand does not depend on h, it comes:

\int_{M} \tilde{f} (p) d vol (p) = Z (σ) = C \int_{a} \exp (- \frac{B (a, a)}{2 σ^{2}}) D (a) d a,

thus proving that the normalizing constant of the Gaussian-like distribution is independent from

μ

. Ref. [13] also proposes a maximum-likelihood estimator (MLE) suitable for the estimation of the parameters

μ, σ

.

Proposition 2.

Let

X_{1}, \dots, X_{n}

be an iid sample drawn from the density

f_{μ, σ}

. The MLE estimator

\hat{μ}

(resp.

\hat{η}

) of μ (resp.

- 1 / 2 σ^{2}

) is the Riemannian barycentre of the sample (resp.

{argmax}_{η} η \hat{ρ} - \log Z (σ (η))

). In the expression of the

\hat{η}

estimator,

\hat{ρ}

is given by:

\hat{ρ} = \frac{1}{N} \sum_{i = 1}^{N} d^{(} \hat{μ}, X_{i}) .

3.3. Wrapped Distributions

Among numerous properties, the Gaussian densities in

R^{d}

are known to be solutions of the heat kernel. When the manifold of interest M is obtained as a quotient of a model space H by a discrete group, which is the case for example with Riemann surfaces, the heat kernel on M can be obtained by wrapping the one on H along the orbits of the group action.

The most basic distribution arising that way is the so-called wrapped Gaussian density on the unit circle in

R^{2}

. It is defined as:

f_{w g} (θ; σ) = \frac{1}{\sqrt{2 π} σ} \sum_{k \in Z} \exp (- (θ + 2 k π) / (2 σ^{2}))

(18)

and clearly exhibits a period

2 π

. The parameter

σ

controls the concentration. It is worth noticing that

f_{w g} (θ; σ)

is in fact the heat kernel on the circle. Evaluating the density involves finding the sum of a convergent series, which may be costly when the computation is done numerically. In the case of the wrapped Gaussian density, the very fast decay at infinity of the usual normal distribution limits the number of terms to be taken into account.

In [18], the heat kernel of simply connected Riemann surfaces is given by wrapping one of the only three possible model spaces: the Euclidean plane, the hyperbolic plane and the sphere. The respective heat kernels are:

\begin{matrix} K {(x, y, t)}_{R^{2}} = \frac{1}{4 π t} \exp (- \frac{{∥ x - y ∥}^{2}}{4 t}), \end{matrix}

(19)

\begin{matrix} K {(x, y, t)}_{H^{2}} = \frac{\sqrt{2} e^{- \frac{t}{4}}}{{(4 π t)}^{3 / 2}} \int_{d (x, y)}^{\infty} \frac{s e^{- \frac{s^{2}}{4 t}}}{\sqrt{\cosh s - \cosh d (x, y)}} d s, \end{matrix}

(20)

\begin{matrix} K {(x, y, t)}_{S^{2}} = \frac{1}{4 π} \sum_{n \in N} (2 n + 1) \exp (- n (n + 1) t) P_{n} (\cos d_{S^{2}} (x, y)), \end{matrix}

(21)

where the expression for the hyperbolic plane comes from [19] (p. 360). The distances

d (\cdot, \cdot)

are the geodesic distances.

Theorem 1.

Let M be a Riemann surface, U its universal cover and G its covering group. Let

K_{U}

be the heat kernel on U. Then, the heat kernel on M is obtained by wrapping

K_{U}

along the orbits of the covering group action:

K_{M} (x, y, t) = \sum_{g \in G} K_{U} (\tilde{x}, g \cdot \tilde{y}, t),

where

\tilde{x}, \tilde{y}

are fixed pre-images of respectively

x, y

for the covering map.

The proof can be found in [18] (pp. 7–8). In principle, Theorem 1 yields a density similar to a directional one, but on a more general class of manifolds. Unfortunately, while a closed-form solution for the kernel

K_{U}

is known, and is one of equations (19)–(21), the wrapped kernel is generally only computable numerically, after truncation to a finite number of terms in the sum.

In the case of surfaces with covering space

H^{2}

, it is possible to obtain a more convenient description. In this case, the genus

g

of the surface is strictly larger than 1, and the fundamental region in

H^{2}

is a hyperbolic polygon with

4 g

sides. For any

g \in G

, its length is defined to be

l (g) = \inf_{x} d (x, g x)

, or using the conjugacy class of g:

l (g) = \inf_{x} d (x, k g k^{- 1})

where k runs over G. Elements of G with non zero length are conjugate to hyperbolic elements in

S L (2, R)

and are thus conjugate to a scaling

x \mapsto λ^{2} x

. Furthermore, a conjugacy class represents a free homotopy class of closed curves, which contains a unique minimal geodesic whose length is

l (g)

, where g is a representative element. For p a primitive element, let

G_{p}

denote its centralizer in G. The conjugacy classes in G are all of the form

g p^{n} g^{- 1}, g \in G / G_{p}

with p primitive and

n \in Z

. The wrapped kernel can then be rewritten as:

K_{M} (x, y, t) = \sum_{p} \sum_{g \in G / G_{p}} \sum_{n \in Z} K_{H^{2}} (g x, p^{n} g y, t),

(22)

where p runs through the primitive elements of G. It indicates that the kernel

K_{M}

can be understood as a sum of elementary wrapped kernels associated to primitive elements, namely those

{\tilde{k}}_{p}

defined by:

{\tilde{k}}_{p} (x, y, t) = \sum_{n \in Z} K_{H^{2}} (x, p^{n} y, t),

(23)

with p primitive. Finally, p being hyperbolic, it is conjugate to a scaling, so it is enough to consider kernels of the form:

{\tilde{k}}_{p} (x, y, t) = \sum_{n \in Z} K_{H^{2}} (x, {(λ^{2})}^{n} y, t),

(24)

with

λ > 1

a real number. To each primitive element p, a simple closed minimal geodesic loop is associated, which projects onto the axis of the hyperbolic transformation p. In the Poincaré half-plane model, such a loop unwraps onto the segment of the imaginary axis that lies between i and

i λ^{2}

. It is easily seen that the action of the elements

p^{n}, n \in Z

will give rise to a tiling of the positive imaginary axis with segments of the form

[λ^{2 n}, λ^{2 (n + 1)} [

. This representation allows a simple interpretation of the elementary wrapped kernels

{\tilde{k}}_{p}

, where the wrapping is understood as a winding.

Once again, the computational cost involved in the summation may be high. An approximation of the true wrapped kernel is given in [20]. It is similar to the vMF that approximates the wrapped Gaussian density in the circular case.

3.4. Exponential Families Arising from Group Actions

In many physical systems, some quantities must be invariant under the action of a group. This is a consequence of the celebrated Noether theorem [21]. Looking at little bit deeper at this theorem reveals the importance of a mapping, called the momentum map, that turns out to be constant under the evolution of the system. Turning back to densities but keeping the physical framework in mind, it seems natural to find families with a maximum entropy property, with constraints based on the momentum map. This approach and its relationship with information geometry has been thoroughly studied in [22] and will be presented later.

Another view at the same problem is to start from natural exponential families and try to impose group invariance [23]. Let E be a finite dimensional vector space and let

E^{*}

be its dual.

Definition 3.

The set

E

is the subset of positive Radon measures μ on E such that:

μ is not concentrated on a proper affine subspace of E.
The set of the $θ \in E^{*}$ such that:

$\int_{E} \exp 〈 θ, x 〉 μ (d x) < + \infty$

has non empty interior, hereafter denoted $Θ (μ)$ .

Any measure in

E

gives rise to a natural exponential family.

Definition 4.

Let

μ \in E

. The natural exponential family with base measure μ is the parameterized family

P_{θ}, θ \in Θ (μ)

defined by:

P_{θ} (d x) = \exp (〈 θ, x 〉 - C_{μ} (θ)) μ (d x) .

Given a group G acting on E, a natural exponential family

P_{θ}, θ \in Θ (μ)

with base

μ

is said to be invariant if for any

g \in G

and any

p_{θ}

,

g . p_{θ} = p_{θ^{'}}

for a

θ^{'}

in

Θ (μ)

. In the original work, groups of affinities were considered, namely:

g . x = A_{g} x + v_{g}, x \in E

, where

A_{g}

is the linear part of the affinity and

v_{g}

the translation part. The main theorem characterizing the group action invariance for a given natural exponential family is [23]:

Theorem 2.

Let

P_{θ}, θ \in Θ (μ)

be a natural exponential family and G a group of affinities. The family

P_{θ}

is invariant under the action of G iff it exist an application

a : G \to E^{*}

and an application

b : G \to R

such that:

\begin{matrix} \forall (g, g^{'}) \in G \times G, a (g g^{'}) =^{t} A_{g}^{- 1} a (g^{'}) + a (g), \end{matrix}

(25)

\begin{matrix} \forall (g, g^{'}) \in G \times G, b (g g^{'}) = b (g) + b (g^{'}) - 〈 a (g^{'}), A_{g}^{- 1} v_{g} 〉, \end{matrix}

(26)

\begin{matrix} \forall g \in G g \cdot μ (d x) = \exp (〈 a (g), x 〉 + b (g)) μ (d x) . \end{matrix}

(27)

When G is a Lie group,

a, b

are differentiable mappings. With a group theoretic view, a is a cocycle for the action

g : x \in E^{*} \mapsto g \cdot x =^{t} g^{- 1} \cdot x

. Theorem 2 was given in [24] and improved in [23] (p. 4). As an example of use, natural exponential families on

R^{d}

are characterized by the next theorem.

Theorem 3.

A natural exponential family on

R^{d}

is invariant under the action of

S O (d)

iff it admits as base measure

μ = c δ_{0} + ϕ (ν \otimes σ),

where

c \geq 0

,

δ_{0}

is the delta measure at the origin, σ is the surface measure on

S^{d - 1}

and:

ν is a measure on $[0, + \infty]$ , with $ν ([0, 1]) < + \infty$ and it exists $k > 0$ such that:

$\int_{[1, + \infty]} x^{- (d - 1) / 2} \exp (k x) ν (d x) < + \infty,$
$ϕ : [0, + \infty] \times S^{d - 1} \to R^{d} - {0}$ is the polar coordinates mapping: $(r, u) \mapsto r u$ .

Going back to the mechanical formalism and the momentum map, invariant exponential families can be put into a wider framework. The underlying object is a symplectic manifold

(M, ω)

where

ω

is a closed, non degenerate two-form on M. Let G be a connected Lie group acting on M.

Definition 5.

The action of G on M is said to be symplectic if for all

g \in G

,

g \cdot ω = ω

.

The group action defines canonical vector fields on

T M

.

Definition 6.

Let

ξ \in g

. The vector field

X_{ξ}

is defined by:

X_{ξ} : x \in M \mapsto {\frac{d}{d t}|}_{t = 0} \exp_{e} (t ξ) \cdot x,

where e denotes the identity of G.

The vector field

X_{ξ}

can be interpreted as the infinitesimal action of g on the points of M.

Definition 7.

A mapping

U : M \to g^{*}

is said to be a momentum map for the G-action if for all

ξ \in g

:

d α_{ξ} = X_{ξ} ┘ ω,

where

α_{ξ}

is the 0-form defined as:

\forall x \in M, α_{ξ} (x) = 〈 U (x), ξ 〉 .

As noticed by Souriau [25], the momentum map allows a definition of exponential densities.

Definition 8.

Let

(M, ω)

be a symplectic manifold of dimension

2 d

. The

2 d

-form

ω^{n} / n!

is a volume form on M, called the Liouville form.

For a symplectic manifold, the Liouville form is the canonical one and will be denoted

vol

as in the Riemannian case. It is invariant by symplectomorphisms, which is a key ingredient in the definition of exponential families on M.

Definition 9.

Let

(M, ω)

be a symplectic manifold. Let G be a connected Lie group acting on M with momentum map U. If the set of

ξ \in g

such that:

\int_{M} \exp (- 〈 U (x), ξ 〉) d vol (x) < + \infty

has non empty interior, hereafter denoted by Ξ, the exponential family associated with the group action is defined to be:

\forall ξ \in Ξ, P_{ξ} (d vol (x)) = \exp (- 〈 U (x), ξ 〉 + C (ξ)) d vol (x) .

The momentum map may be used to define the analog of the usual moments and will in turn allow constraints definition in a maximum entropy approach.

Definition 10.

With the hypothesis of Definition 9, the

n t h

moment of a probability density f on M is defined as:

E_{n} (f) = \int_{M} U \otimes^{n} (x) f (x) d vol (x) .

Following [25], the exponential distributions are maximal entropy ones. Assuming that the first and second moments are defined for the exponential family, the next proposition holds.

Proposition 3.

Under the assumptions of Definition 9, the exponential distributions are the one with the largest entropy under the constraint

E_{n} = K

, with K a fixed vector in

g^{*}

.

The Souriau approach to invariant Gibbs measure has the obvious advantage of being intrinsic and adapted to a given group action. The parameters of the exponential families are elements of the Lie algebra and must be understood as a general way of fixing a generalized location (please note that it encompasses ’scale’ parameters also). It requires a symplectic base manifold, that is very natural in mechanics, but may be a little bit tricky to obtain in a more general setting.

4. Non-Parametric Density Estimation by Projection

4.1. The Euclidean Case

The intuitive idea behind the projection approach to density estimation on measure set

(E, F, μ)

is to use an orthonormal Hilbert basis

{(ϕ_{n})}_{n \in N}

of the space

L^{2} (E, μ)

to construct an approximation of the unknown density

f \in L^{2} (E, μ)

from its projections. Namely,

α_{n} = E_{f} [ϕ_{n} (X)] = \int_{E} ϕ_{n} (x) f (x) d μ (x),

(28)

where

E_{f}

denotes expectation taken with respect to the density f. The reconstruction formula in

L^{2} (E, μ)

then reads as:

f : x \mapsto \sum_{n \in N} α_{n} ϕ_{n} (x) .

(29)

To turn the expansion into a density estimator, it is necessary to have an estimator of the coefficients

α_{n}

. Furthermore, in applications, the series (29) has to be truncated to a finite number of terms. It is thus advisable to have a fast decay of the expansion coefficients when n goes to infinity.

For the first point, an empirical estimator of the expectation is generally used. Assuming an iid sample

{(X_{i})}_{i = 1 \dots N}

, the n-th projection is estimated as:

{\hat{α}}_{n}^{N} = \frac{1}{N} \sum_{i = 1}^{N} ϕ_{n} (X_{i}),

(30)

and the density estimator is

{\hat{f}}^{N} (x) = \sum_{i = 1}^{N} {\hat{α}}_{n}^{N} ϕ_{n} (x) .

(31)

Taking the expectation shows that the estimator is unbiased. It is worth to notice that the projection method does not use more than the measure space structure and is thus easy to use on many spaces, including manifolds. It is also very fast to evaluate provided the expansion functions are known and can be computed easily. It nevertheless suffers from two flaws:

The estimated density ${\hat{f}}^{N}$ is not necessarily non-negative as the expansion functions are generally not.
Depending on the underlying measure space, a countable Hilbert basis may not exist and, even if this holds, the expansion functions may not be expressed in a closed form.

Most of the usual Hilbert spaces are countable. However, the Besocovitch space

B^{2}

of almost periodic functions [26] is a classical example for which an uncountable Hilbert basis exists.

4.2. The Riemannian Case

When the underlying measure space is a compact Riemannian manifold

(M, g)

of dimension d, equipped with its volume form, the Laplace–Beltrami operator

Δ

naturally gives rise to a suitable Hilbert basis of

L^{2} (M, vol)

, which we will denote by

L^{2} (M)

for short. Indeed, there exists a sequence

{(λ_{n}, ϕ_{n})}_{n \in N}

such that for all

n \in N

,

Δ ϕ_{n} = λ_{n} ϕ_{n},

and the

ϕ_{n}

’s form a Hilbert basis of

L^{2} (M)

. The seminal work [27] details the theory of non-parametric projection based estimation in this case. Unfortunately, only in a few cases are the eigenfunctions of

Δ

known, limiting the applicability of the method. In the sequel, the estimator

{\hat{f}}^{N}

with expansion truncated to the Q lowest terms will be denoted as

{\hat{f}}_{Q}^{N}

. Two main theorems describe the behavior of the estimator.

Theorem 4.

Let f be a density of class

C^{s} (M)

, with derivatives belonging to

L^{2} (M)

. For any integer

I > 0

, there exist constants

A_{f}, B_{f}

such that, for all

Q \geq I

,

E_{f} [{∥f - {\hat{f}}_{Q}^{N}∥}_{L^{2} (M)}^{2}] \leq A_{f} \frac{Q^{d / 2}}{N} + B_{f} Q^{- s} .

(32)

The second theorem gives a

L^{\infty}

rate:

Theorem 5.

Let f be a density of class

C^{s} (M)

,

s > d / 2

, with derivatives belonging to

L^{2} (M)

. For any integer

I > 0

, there exist constants

A_{M}, B_{f}

such that, for all

Q \geq I

,

E_{f} {[{∥f - {\hat{f}}_{Q}^{N}∥}_{L^{\infty} (M)}^{2}]}^{1 / 2} \leq A_{f} \frac{Q^{d / 2}}{N^{1 / 2}} + B_{f} Q^{(d / 2 - s) / 2} .

(33)

The proofs are quite technical and the interested reader may refer to [27] for the details.

When the eigenfunctions of the Laplacian are known, the method is quite effective, since the evaluation of the estimated density at a point does not depend on the sample size. In most cases, however, a closed form for the eigenfunctions is not available, thus limiting the practical usability of this estimator.

A possible workaround is to estimate the true eigenfunctions and eigenvalues from an approximate discrete problem. In the approach of [28], a weighted graph is constructed from a net of points on the manifold M, with weights given by a function of the geodesic distance between vertices. For the graph, extracting the eigenfunctions and eigenvalues boils down to a standard linear algebra problem and can thus be solved efficiently. The result of the procedure is a finite set of eigenvectors, which represent discrete measures on the manifold. The projection estimator in such a case yields a quantization (i.e., a discrete approximation) of the estimated measure. Going back to a density can be done using a smoothing procedure, or simply by turning the discrete approximation to a piecewise constant one. In both cases, evaluation at point requires the computation of the geodesic distance to all the samples, thus making the overall procedure far less efficient.

Finally, it is worth mentioning that Laplacian eigenfunctions and representations are closely related when the underlying space is a Lie group. The reader may refer to [29] for the special case

S O (d)

.

5. Non-Parametric Kernel Estimation

Non-parametric estimation of densities is performed using a sum of elementary bell-shaped functions known as kernels most of the time. It was introduced in the 1960s by Parzen in its seminal work [30] and Rosenblatt [31].

5.1. The Euclidean Case

Assuming that the unknown probability density f is univariate, defined on

R

, it can be estimated using an iid sample

X_{i}, i = 1 \dots N

as:

\hat{f} : x \in R \mapsto \frac{1}{N h} \sum_{i = 1}^{N} K (\frac{x - X_{i}}{h}),

(34)

where

K : R \to R^{+}

is a symmetric kernel, i.e., a measurable mapping verifying

K (x) = K (- x)

and integrating to 1,

\int_{R} K (x) d x = 1 .

The parameter h is a strictly positive real number called the bandwidth of the estimator. It controls the degree of smoothing, and has to be tuned to get the best compromise between smoothness and accuracy. Kernel density estimators are biased. Their bias can be controlled in the following way.

Proposition 4.

|f (x) - E [\hat{f} (x)]| \leq \int_{R} K (u) |f (x) - f (x - h u)| d u .

(35)

Proof.

Let X be a random variable with law the common law of the sample. Taking the expectation of

\hat{f}

gives:

E [\hat{f} (x)] = \sum_{i = 1} \frac{1}{N h} E [K (\frac{x - X_{i}}{h})] = \frac{1}{h} E [K (\frac{x - X}{h})] = \frac{1}{h} \int_{R} K (\frac{x - y}{h}) f (y) d y .

Letting

u = (x - y) / h

,

\frac{1}{h} \int_{R} K (\frac{x - y}{h}) f (y) d y = \int_{R} K (u) f (x - h u) d u

comes. Then, since the kernel integrates to 1,

|f (x) - E [\hat{f} (x)]| \leq \int_{R} K (u) |f (x) - f (x - u h)| d u,

hence proving the claim. □

When the density is Lipschitz and the kernel satisfies

\int_{R} K (u) | u | d u < + \infty,

the bound can be improved to:

|f (x) - E [\hat{f} (x)]| \leq C h \int_{R} K (u) | u | d u,

where C is the Lispchitz constant. As expected, the bias vanishes as h goes to 0 but is non zero when

h > 0

. The variance of the estimator can be computed pretty much the same way. Since the

X_{i}

’s are independent,

var (\frac{1}{N h} \sum_{i = 1}^{N} K (\frac{x - X_{i}}{h})) = \frac{1}{N h^{2}} var K (\frac{x - X}{h})

comes. Using

f (x - h u) = f (x - h u) - f (x) + f (x)

, we have:

\begin{matrix} \frac{1}{N h^{2}} E [K {(\frac{x - X}{h})}^{2}] & = \frac{1}{N h} \int_{R} K^{2} (u) f (x - h u) d u \\ = \frac{1}{N h} \int_{R} K^{2} (u) (f (x - h u) - f (x)) d u + \frac{f (x)}{N h} \int_{R} K^{2} (u) d u . \end{matrix}

Thus, with the above Lipschitz condition,

\frac{1}{N h^{2}} E [K {(\frac{x - X}{h})}^{2}] \leq \frac{C}{N} \int_{R} K^{2} (u) | u | d u + \frac{f (x)}{N h} \int_{R} K^{2} (u) d u .

It then appears that the upper bound of the variance of

\hat{f}

goes to infinity as h goes to 0 due to the term

\frac{f (x)}{N h} \int_{R} K^{2} (u) d u .

This fact is an expression of the bias-variance dilemma.

When the density f is of class

C^{2}

, the bias can be expressed more conveniently as:

b (x) = f (x) - E [\hat{f} (x)] = - \frac{h^{2}}{2} f^{(2)} (x) \int_{R} K (u) u^{2} d u,

provided

V_{K} = \int_{R} K (u) u^{2} d u < + \infty .

Please note that

V_{K}

is the variance of the kernel, considered as a probability distribution. It is further assumed in the sequel that the kernel is square summable. The following holds:

var K (\frac{x - X}{h}) = h \int_{R} K^{2} (u) f (x - u h) d u - h^{2} {(\int_{R} K (u) f (x - h u) d u)}^{2},

so that the variance of the kernel estimator is

var \hat{f} (x) = \frac{1}{N h} \int_{R} K^{2} (u) f (x - u h) d u - \frac{1}{N} {(\int_{R} K (u) f (x - h u) d u)}^{2} .

(36)

The mean square error (MSE) of the estimator

\hat{f}

is the sum of the variance and the squared bias:

E [{(\hat{f} (x) - f (x))}^{2}] = \frac{1}{N h} \int_{R} K^{2} (u) f (x - u h) d u - \frac{1}{N} {(\int_{R} K (u) f (x - h u) d u)}^{2} + \frac{h^{4}}{4} {[f^{(2)} (x) V_{K}]}^{2} .

(37)

The asymptotic MSE is thus:

E [{(\hat{f} (x) - f (x))}^{2}] \underset{N \to \infty, h \to 0}{\sim} \frac{f (x)}{N h} \int_{R} K^{2} (u) d u + \frac{h^{4}}{4} {[f^{(2)} (x) V_{K}]}^{2} .

(38)

There exists an optimal value of h, minimizing the previous expression,

h_{o p t} = {(\frac{{f (x) ∥ K ∥}_{2}^{2}}{4 N A})}^{1 / 5},

where

A = {[f^{(2)} (x) V_{K}]}^{2} .

This relation is very classical in density estimation and yields a pointwise convergence rate in

o (n^{- 4 / 5})

, slower than the usual

o (n^{- 1})

in the parametric case.

In the multivariate case, two approaches are of common use. In the first one, the multivariate kernel in

R^{d}

is just an d-fold tensor product of univariate kernels:

K (x_{1}, \dots, x_{d}) = \prod_{i = 1}^{d} K (x_{i}) .

The tensor product kernel integrates to 1 by Fubini’s theorem; and the density estimator writes as:

f (x) = \frac{1}{h^{d}} \sum_{i = 1}^{N} K (\frac{x - X_{i}}{h}),

(39)

where

x \in R^{d}

. Apart from the slower convergence rates, things are similar to the univariate case.

Another option is to let the kernel depend on the norm of the difference between x and the

X_{i}

. In such a case, starting again with a univariate kernel K, the density estimator is:

f (x) = \frac{C}{h^{d}} \sum_{i = 1}^{N} K (\frac{∥ x - X_{i} ∥}{h}),

(40)

where C is a normalizing constant, such that:

C^{- 1} = \int_{R^{d}} K (∥ u ∥) d u .

In this framework, the multivariate kernel pertains to the class of radial basis functions, which has been thoroughly studied in approximation theory [32].

5.2. The Riemannian Case

Extending the previous derivations to manifolds is not direct, since several problems must be addressed. In the case of Riemannian manifolds, the geodesic distance can be used in place of the Euclidean norm, thus making the radial basis kernels more natural than the tensor product ones. A density estimator based on that is due to Pelletier [33]. His approach is briefly described in the sequel.

Let

(M, g)

be an orientable Riemannian manifold of dimension d, equipped with its volume form

vol

, and let

K : R_{+} \to R_{+}

be a mapping such that:

\begin{matrix} \int_{R_{+}^{d}} K (∥ u ∥) d u = 1, \end{matrix}

(41a)

\begin{matrix} \int_{R_{+}^{d}} u_{j} K (∥ u ∥) d u = 0, j = 1 \dots d, \end{matrix}

(41b)

\begin{matrix} \int_{R_{+}^{d}} {∥ u ∥}^{2} K (∥ u ∥) d u < + \infty, \end{matrix}

(41c)

\begin{matrix} t > 1 \Rightarrow K (t) = 0, \end{matrix}

(41d)

\begin{matrix} \sup_{t \in R^{+}} K (t) = K (0) . \end{matrix}

(41e)

The main problem is to define what a radial basis function is in the context of manifolds. A similar question was addressed in [34] for the definition of the Laplacian in radial coordinates. The idea is to use the exponential mapping

\exp_{p}

at a fixed point

p \in M

in a ball centered at the origin in

T_{p} M

and of radius less than the injectivity radius

r > 0

at p and to compose K with the distance function to p to get an equivalent of a radial basis function in

R^{d}

. However, the volume form is not invariant under the action of the exponential map unless the manifold is flat. In order to keep the integral of the kernel equal to 1, a multiplicative corrective term has to be added.

Definition 11.

Let

p \in M

. For any v in

T_{p} M

, let

γ_{v} : t \mapsto \exp (\frac{t v}{∥ v ∥})

and let

w_{i}

,

i = 1 \dots d

, be Jacobi fields along γ such that

w_{i} (0) = 0

, for all

i = 1 \dots d

,

D w_{1} / d t (0) = v / ∥ v ∥

and

D w_{i} / d t (0)

,

i = 1 \dots d

, forms an orthonormal basis of

T_{p} M

. The volume density function

θ_{p} : T_{p} M \to R

is defined as:

θ_{p} (v) = {∥ v ∥}^{d - 1} \det (w_{1} (∥ v ∥), \dots, w_{d} (∥ v ∥)) .

By abuse of notations, we will also write

θ_{p} (x) = θ_{p} (\log_{p} x)

.

The above definition with varying p extends readily to a mapping

θ : T M \to R

. The exponential map

\exp_{p} : T_{p} M \to M

induces by pullback a volume form

\exp_{p}^{*} vol

on

T_{p} M

, and it is worth noticing that

θ_{p}

is its density with respect to the Lebesgue measure of the Euclidean structure on

T_{p} M

. That is, in normal coordinates,

d \exp_{p}^{*} vol (x) = θ_{p} (x) d x .

(42)

Definition 12.

Let K be a kernel function in the sense of equation (41) and

R > 0

be the injectivity radius of M. The radial kernel at

p \in M

with bandwidth

0 < r < R

is defined as the mapping

K_{p, r}

:

K_{p, r} : x \in M \mapsto \frac{1}{r^{d}} \frac{1}{θ_{p} (x)} K (\frac{d (p, x)}{r}) .

Since K has a support in

[0, 1]

by assumption, the above expression will vanish for

x \notin B (p, r)

, where

B (p, r)

is the geodesic ball of center p and radius r. We will thus never have to bother about a possible vanishing of

θ

.

Proposition 5.

The radial kernel satisfies:

\int_{B (p, r)} K_{p, r} (x) d vol = 1 .

Proof.

In normal coordinates at p and from (42), the following comes:

\begin{matrix} \int_{B (p, r) = \exp_{p} B (0, r)} K_{p, r} (x) d vol & = \int_{B (0, r)} K_{p, r} (\frac{∥ x ∥}{r}) \exp_{p}^{*} d vol \\ = \int_{B (0, r)} \frac{1}{r^{d}} K (\frac{∥ x ∥}{r}) d x_{1} \land \dots \land d x_{d} = 1 . \end{matrix}

(43)

□

The next proposition in [33] shows that the kernel

K_{p, r}

is centered at p in a probabilistic sense.

Proposition 6.

Let

p \in M

and δ be the supremum of the sectional curvatures of M. Let μ be a probability measure, absolutely continuous with respect to the measure

vol

, and admitting

K_{p, r}

as density. If

r < {inj}_{p} (M) / 2

and, when

δ > 0

, also

r < \frac{π}{4 \sqrt{δ}}

then p is the unique minimizer of the function

E : x \mapsto \int_{M} d^{2} (x, p) d μ (p) .

The proof can be found in [33] (Propostion 2.2, p. 5) and relies on a computation of the gradient of E and the convexity of a geodesic ball of radius r satisfying the above conditions. Based on the above results, it is natural to choose the following definition for the Riemannian kernel estimator.

Definition 13.

Let

X_{i}

,

i = 1 \dots N

, be an iid sample on the manifold M with common density function f. Let

0 < r < inj (M)

. The kernel density estimator on N points of f with kernel K and bandwidth r is:

{\hat{f}}_{n, K, r} (x) = \frac{1}{N} \sum_{i = 1}^{N} K_{x, r} (X_{i}) .

(44)

Using Propostion 5, it is easy to see that

{\hat{f}}_{n, K, r}

is a probability density on M. The estimator

{\hat{f}}_{n, K, r}

is consistent and behaves roughly as in the Euclidean case.

Theorem 6.

Let f be a

C^{2}

probability density on M with bounded first and second derivative. If

0 < r < inj (M)

, then there exists a constant

C_{f}

such that:

E_{f} [{|{\hat{f}}_{n, K, r} - f|}_{L^{2} (M)}^{2}] \leq C_{f} (\frac{1}{N r^{d}} + r^{4}) .

The proof that will be given here differs slightly from the one given in [33], and relies on the next lemma.

Lemma 1.

Let

γ : [0, 1] \to M

be a smooth curve of M and let

f : M \to R

be of class

C^{k + 1}

. Letting

p = γ (0), u = γ^{'} (0)

, the Taylor expansion at order k of f along γ with integral remainder at p is given by:

f (γ (t)) = f (p) + \nabla_{u} f t + \dots \nabla_{u}^{k} f \frac{t^{k}}{k!} + \int_{0}^{t} \nabla_{γ^{'} (y)}^{k + 1} f \frac{{(1 - y)}^{k}}{k!} d y,

where ∇ is an affine connection and

\nabla_{u} f = u (f) = d f_{p} \cdot u

.

Proof.

The mapping

α = f \circ γ

is defined from

[0, 1]

to

R

and is of class

C^{k + 1}

. It thus admits a Taylor expansion with integral remainder

α (t) = f (p) + \sum_{i = 1}^{k} \frac{d^{i} α (0)}{d t^{i}} \frac{t^{i}}{i!} + \int_{0}^{t} \frac{d^{k + 1} α (y)}{d t^{k + 1}} \frac{{(1 - y)}^{k}}{k!} d y .

For a smooth function

ϕ : M \to R

,

\frac{d}{d t} ϕ \circ γ (t) = \nabla_{γ^{'} (t)} ϕ = d ϕ (t) \cdot γ^{'} (t)

and the result follows by induction. □

Lemma 2.

Let

γ : [0, 1] \to M

be a geodesic of M and

f : M \to R

be of class

C^{k + 1}

. Then, the remainder in Lemma 1 is upper bounded by

C_{f} ℓ {(γ)}^{k + 1} \frac{1}{(k + 1)!},

where

C_{f}

is a constant depending on f and

ℓ (γ)

is the length of γ.

Proof.

Let

ω \in T^{*} M

. Then,

\nabla_{γ^{'}} ω \cdot γ^{'} = (\nabla_{γ^{'}} ω) \cdot γ^{'} + ω \cdot \nabla_{γ^{'}} γ^{'} = (\nabla_{γ^{'}} ω) \cdot γ^{'}

. Proceeding by induction

\nabla_{γ^{'}}^{k + 1} f = (\nabla_{γ^{'} (y)}^{k} d f \cdot γ^{'}) = (\nabla_{γ^{'} (y)}^{k} d f) \cdot γ^{'} .

The term

\nabla_{γ^{'} (y)}^{k} d f

is k-linear in

γ^{'}

and involves derivatives of f up to order

k + 1

along

γ

. Since f is of class

C^{k + 1}

by assumption, it comes

|\nabla_{γ^{'}}^{k + 1} f| \leq C_{f} {∥ γ^{'} ∥}^{k + 1} .

Since

γ

is a geodesic,

∥ γ^{'} ∥ = d

, where d is the length of

γ

. The claim follows by integration. □

Proof of Theorem 6.

The proof is essentially an adaptation of the Euclidean case. The bias at

x \in M

is given by

b (x) = \int_{B (x, r)} K_{x, r} (y) f (y) d vol (y) - f (x),

and since the kernel integrates to 1,

b (x) = \int_{B (x, r)} K_{x, r} (y) f (y) - f (x) d vol (y) .

Using normal coordinates at x, it comes

b (x) = \int_{B (0, r)} \frac{1}{r^{d}} K (\frac{∥ u ∥}{r}) (f (\exp_{x} (u)) - f (x)) d u .

Using lemma 1,

f (\exp_{x} (u)) - f (x) = \nabla_{u} f (0) + R_{f} (u)

. In coordinates,

u = u^{i} \partial_{i}

and by linearity of the connexion ∇ and symmetry assumption on the kernel

\int_{B (0, r)} \frac{1}{r^{d}} K (\frac{∥ u ∥}{r}) \nabla_{u} f (0) d u = \sum_{i = 1}^{d} \nabla_{\partial_{i}} f (0) \int_{B (0, r)} \frac{1}{r^{d}} K (\frac{∥ u ∥}{r}) u_{i} d u = 0 .

Since the first and second derivatives of f are assumed to be bounded on M, it exists, by Lemma 2, a constant

A_{f}

such that

|R_{f} (u)| \leq A_{f} {∥ u ∥}^{2}

, yielding

| b (x) | \leq A_{f} \int_{B (0, r)} \frac{1}{r^{d}} K (\frac{∥ u ∥}{r}) {∥ u ∥}^{2} = A_{f} r^{2} \int_{B (0, 1)} K (∥ u ∥) {∥ u ∥}^{2} d u .

When the manifold is of finite volume, then the integral of the squared bias over M is bounded by:

A_{f}^{2} r^{2} {(\int_{B (0, 1)} K (∥ u ∥) {∥ u ∥}^{2} d u)}^{2} vol (M) .

The case of unbounded volume is still tractable provided the first and second derivatives of f are square-integrable over M. The computation of the variance is very close to what has been done in the Euclidean case. Since the kernel has vanishing first moment, the only term remaining in the variance is the integral of the squared kernel, which equals

var (p) = \int_{B (p, r)} K_{p, r}^{2} (x) f (x) d vol = \frac{1}{r^{2 d}} \int_{B (p, r)} \frac{1}{θ_{p}^{2} (x)} K^{2} (\frac{d (p, x)}{r}) f (x) d vol .

(45)

In normal coordinates around p,

\int_{B (p, r)} K_{p, r}^{2} (x) f (x) d vol = \frac{1}{r^{d}} \int_{B 0, 1} \frac{1}{θ_{p} (\exp_{p} (r u))} K^{2} ({∥ u ∥}^{2}) f (\exp_{p} (r u)) d u .

In contrast with the bias computation, the

θ_{p}

does not cancel. If

(p, x) \mapsto θ_{p} (p, x)

is bounded below by a constant

L > 0

and using the fact that K is bounded above by

K (0)

, the variance is bounded above by

\frac{K^{2} (0)}{L r^{d}} \int_{B (0, 1)} f (\exp_{p} (r u)) d u .

Integrating the variance over M yields

\int_{M} var (p) d v o l \leq \frac{K^{2} (0)}{L r^{d}} vol (B (0, 1)) .

Summing the variance and the squared bias completes the proof. □

It is worth noticing that, while the final result is essentially the same as in the Euclidean case, assumptions are somewhat stronger: the kernel used must be of compact support and the

θ

function has to be bounded below. Furthermore, bandwidth has to be small enough to ensure that the exponential mapping is one to one.

5.3. Computing the Kernel in the Riemannian Case

A important feature of the non-parametric kernel estimation in the Euclidean case is the ease of computation. Estimating the density at a given point is done easily using inner products and function evaluations (polynomials in most cases). Furthermore, when the kernel is compactly supported, it is quite simple to avoid summing vanishing terms: a k-d tree [35] structure can be used to allow a quick distance evaluation, thus excluding points that will not contribute to the estimator.

In the Riemannian case, evaluation the kernel

K_{r, p}

at a point x requires much more computation. First of all, the geodesic distance from x to p must be found, along with the

θ_{p}

function. If a shooting algorithm is used for approximating

d (x, p)

, the derivative of the exponential mapping, needed for

θ_{p}

, is generally also computed: the overall cost is thus not really different from the geodesic distance evaluation alone. Nevertheless, unless a closed form for the geodesics is known, the computational cost associated with the process is much higher than with Euclidean data.

In some special cases, however, the

θ_{p}

function can be obtained in a closed form. It is the case for symmetric spaces, when the roots system of the base Lie group is known: in such a case, the integral formulas in [17] yields directly

θ_{p}

.

6. Discrete Density Estimation through Quantization

Arguably the simplest form of estimate, one can choose for an unknown probability measure

μ

is a discrete density

\hat{μ}

with finite support, i.e., a linear combination of Dirac distributions

\hat{μ} (x) = \frac{1}{n} \sum_{i = 1}^{n} w_{i} δ_{a_{i}},

with weights

w_{i}, i = 1, \dots, n

that sum up to 1. Finding such an approximation is the purpose of quantization. The theory was originally developed for signal compression purposes in the middle of the 20th century, in order to find appropriate ways to discretize a signal. Quantization was founded for probability distributions on Euclidean spaces, but its generalization to Riemannian manifolds presents no particular difficulty (with the necessary assumptions), and so we will present it directly in that more general setting. For further details, we refer the reader to [2] or the survey paper [36].

6.1. Optimal Quantization

Let

μ

be a probability measure with compact support on a Riemannian manifold

(M, g)

. We assume that M is complete, i.e., that the exponential map at x is defined on the whole tangent space

T_{x} M

. Then, by the Hopf–Rinow theorem, M is also geodesically complete, that is, any two points

x, y \in M

can be joined by a geodesic of shortest length and the geodesic distance is everywhere well defined. Optimal quantization addresses the problem of approximating a random variable X with distribution

μ

by a quantized version

q (X)

where q has an image

Γ

of cardinal at most n. More precisely, defining

Q_{n} = {q : M \to Γ \subset M measurable, | Γ | \leq n},

the optimal quantization problem is to find

q \in Q

that minimizes the

L^{p}

distance between X and

q (X)

, with the following error

q^{*} = \underset{q \in Q_{n}}{argmin} E_{μ} [d {(X, q (X))}^{p}] .

(46)

A solution to the above minimization problem is called an optimal n-quantizer, and the minimum error is denoted by

e_{n, p} (μ) = \inf_{q \in Q_{n}} E_{μ} [d {(X, q (X))}^{p}] .

Proposition 7.

The search for optimal n-quantizers can be limited to nearest-neighbor projections, i.e.,

e_{n, p} (μ) = \inf_{Γ, | Γ | = n} E_{μ} [d {(X, q_{Γ} (X))}^{p}],

where

q_{Γ} : M \to Γ = {a_{1}, \dots, a_{n}}

is given by

q_{Γ} (x) = \sum_{i = 1}^{n} a_{i} 1_{C_{i} (Γ)} (x), x \in M,

and

C_{i} (Γ)

denotes the

i^{t h}

Voronoi cell associated with Γ,

C_{i} (Γ) = {x \in M, d (x, a_{i}) \leq d (x, a_{j}) \forall j \neq i} .

Proof.

Any n-quantizer q of image

Γ \subset M

verifies for all

x \in M

,

d (x, q (x)) \geq \inf_{a \in Γ} d (x, a)

, with equality if and only if

q (x) = {argmin}_{a \in Γ} d (x, a)

. Therefore, the optimal quantizer is the projection to the nearest neighbor of

Γ

. Moreover, if

| Γ | < n

and

| supp μ | \geq n

, one easily checks that q can always be improved, in the sense of criteria (46), by adding an element to its image. This means that an optimal n-quantizer has an image of exactly

| Γ | = n

points, and is of the given form. □

The optimal approximation of X is then given by the image

\hat{X} = q_{Γ} (X)

, while the optimal approximation of its distribution

μ

is given by the pushforward

\hat{μ} = {(q_{Γ})}_{*} μ = \sum_{i = 1}^{n} f [C_{i} (Γ)] δ_{a_{i}},

(47)

where the atoms

a_{1}, \dots, a_{n}

are chosen to minimize the distorsion function, simply obtained by evaluating the cost function of (46) at

q = q_{Γ}

,

F_{n, p} (a_{1}, \dots, a_{n}) = E_{μ} \{\min_{1 \leq i \leq n} d {(X, a_{i})}^{p}\} = \int_{M} \min_{1 \leq i \leq n} d {(x, a_{i})}^{p} d μ (x) .

(48)

Notice that if we seek to approximate

μ

by a single point

a \in M

(i.e.,

n = 1

) with respect to an

L^{2}

criteria (

p = 2

), we retrieve the definition of the Riemannian center of mass, also called the Fréchet mean [37].

\bar{x} = E_{μ} (X) = {argmin}_{a \in M} \int_{M} d {(x, a)}^{2} d μ (x) .

(49)

It is worth noting that the optimal quantization problem coincides with the optimal transport problem of approximating

μ

by the closest discrete measure with at most n supporting points, with respect to the

L^{p}

–Wasserstein distance.

Proposition 8.

Let

P_{n}

denote the set of all measures ν on M with

| supp ν | \leq n

. Then,

e_{n, p} (μ) = \inf_{ν \in P_{n}} W_{p} {(μ, ν)}^{p},

where

W_{p}

denotes the Wasserstein distance of order p, i.e.,

W_{p} (μ, ν) = \inf_{P} \int_{M \times M} d {(y, z)}^{p} d P (y, z) .

Here, the infimum is taken over all measures P on

M \times M

with marginals μ and ν.

The proof in the Euclidean case can be found in [2], and it applies verbatim to measures on manifolds. We restitute it here for the sake of completeness.

Proof.

Let

q \in Q

and

f : M \times M \to M

,

f (x) = (x, q (x))

and

g : M \times M \to R_{+}

,

g (y, z) = d {(y, z)}^{p}

. Then, by definition of the image measure

f_{*} μ

, we have

E_{μ} [d {(X, q (X))}^{p}] = E_{μ} [g \circ f (X)] = E_{(f_{*} μ)} [g (Y, Z)] = \int_{M} d {(y, z)}^{p} d (f_{*} μ) (y, z) .

Noticing that

\int_{y \in M} (f_{*} μ) (d y, d z) = μ (q^{- 1} (d z))

and

\int_{z \in M} (f_{*} μ) (d y, d z) = μ (d y)

, i.e., that

(f_{*} μ)

has marginals q and

q_{*} μ

, we get

e_{n, p} (μ) \geq \inf_{q \in Q_{n}} W_{p} {(μ, q_{*} μ)}^{p} \geq \inf_{ν \in P_{n}} W_{p} {(μ, ν)}^{p} .

On the other hand, if

ν \in P_{n}

has support

Γ = {a_{1}, \dots, a_{n}}

, and P has marginals

μ

and

ν

, then

\begin{matrix} \int_{M \times M} d {(y, z)}^{p} P (d y, d z) & = \int_{M \times Γ} d {(y, z)}^{p} P (d y, d z) \\ \geq \int_{M \times Γ} \min_{1 \leq i \leq n} d {(y, a_{i})}^{p} P (d y, d z) = \int_{M} \min_{1 \leq i \leq n} d {(y, a_{i})}^{p} μ (d y), \end{matrix}

which gives

W_{p} {(μ, ν)}^{p} \geq F_{n, p} (a_{1}, \dots, a_{n})

and finally

\inf_{ν \in P_{n}} W_{p} {(μ, ν)}^{p} \geq \inf_{(a_{1}, \dots, a_{n})} F_{n, p} (a_{1}, \dots, a_{n}) = e_{n, p} (μ) .

□

An important question that arises now is the existence of a minimizer of (48). The proof of the following claim can be found in [38].

Proposition 9.

Let M be a complete Riemannian manifold and μ a probability distribution on M with density and a compact support. Then, the distorsion function

F_{n, p}

is continuous and admits a minimizer.

The minimizer

α = (a_{1}, \dots, a_{n})

, referred to as optimal n-centers, is in general not unique: any symmetry of

μ

, if it exists, will transform a minimizer into another minimizer of

F_{n, p}

. For example, any rotation of the optimal n-centers of the uniform distribution on the sphere conserves optimality.

The second question that comes naturally is: how does the error

e_{n, p} (μ)

one makes by approximating

μ

by (47) evolve when the number n of points grows? In the vector case, Zador’s theorem [2] (Theorem 6.2) tells us that it decreases to zero as

n^{- p / d}

, and that the limit of

n^{p / d} e_{n, p} (μ)

is proportional to the

p^{t h}

quantization coefficient, i.e., the limit (which is also an infimum) when

μ

is the uniform distribution on the unit square of

R^{d}

C_{p} ({[0, 1]}^{d}) = \lim_{n \geq 1} n^{p / d} e_{n, p} \{U ({[0, 1]}^{d})\} .

Moreover, when

μ

is absolutely continuous with density h, the asymptotic empirical distribution of the optimal n-centers is proportional to

h^{d / (d + p)}

.

In the case of a Riemannian manifold M, the moment condition of the flat case generalizes to a condition involving the curvature of M. The following term measures the maximal variation of the exponential map at

x \in M

when restricted to a

(d - 1)

-dimensional sphere

S_{ρ} \subset T_{x} M

of radius

ρ

A_{x} (ρ) = \sup_{v \in S_{ρ}, w \in T_{v} S_{ρ}, ∥ w ∥ = ρ} ∥d_{v} \exp_{x} (w)∥ .

The following generalization of Zador’s theorem to Riemannian quantization was proposed by Iacobelli [39] (Theorem 1.4 and Corollary 1.5).

Theorem 7.

Let M be a complete Riemannian manifold without boundary, and let

μ = h d vol + μ_{s}

be a probability measure on M, where

d vol

denotes the Riemannian volume form and

μ_{s}

the singular part of μ. Assume there exist

x_{0} \in M

and

δ > 0

such that

\int_{M} d {(x, x_{0})}^{p + δ} d μ (x) + \int_{M} A_{x_{0}} {d {(x, x_{0})}^{p}} d μ (x) < \infty .

Then,

\lim_{n \to \infty} n^{p / d} e_{n, p} (μ) = C_{p} ({[0, 1]}^{d}) {∥ h ∥}_{d / (d + p)},

where

{∥ \cdot ∥}_{r}

denotes the

L^{r}

-norm. In addition, if

μ_{s} = 0

and

(a_{1}, \dots, a_{n})

are optimal n-centers, then

\frac{1}{n} \sum_{i = 1}^{n} δ_{a_{i}} \overset{D}{⟶} λ h^{d / (d + p)} d x as n \to \infty,

where

\overset{D}{\to}

denotes convergence in distribution and λ is the appropriate normalizing constant.

6.2. A Numerical Scheme

In practice, to compute the optimal n-centers

α = (a_{1}, \dots, a_{n})

from potentially large, manifold-valued datasets, one can search for the critical points of the distorsion function. Assume that the only knowledge that we have of the probability measure

μ

that we want to approximate is through an online sequence of i.i.d. observations

X_{1}, X_{2}, \dots

sampled from

μ

. A classical algorithm used for quadratic (

p = 2

) vector quantization, and easily generalized to the Riemannian setting, is the Competitive Learning Vector Quantization algorithm, a stochastic gradient descent method based on the differentiability of the distorsion function

F_{n, 2}

.

Proposition 10.

Let

α = (a_{1}, \dots, a_{n}) \in M^{n}

be an n-tuple of pairwise distinct components and

p > 1

. Then,

F_{n, p}

is differentiable and its gradient in α is

{grad}_{α} F_{n, p} = {(- p \int_{\overset{˚}{C_{i}} (α)} {∥ \overset{⟶}{a_{i} x} ∥}^{p - 1} \frac{\overset{⟶}{a_{i} x}}{∥ \overset{⟶}{a_{i} x} ∥} μ (d x))}_{1 \leq i \leq n} \in T_{α} M^{n},

where

\overset{˚}{C_{i}} (α)

is the interior of the

i^{t h}

Voronoi cell of α and

\overset{⟶}{x y} : = \exp_{x}^{- 1} (y)

denotes the vector that sends x on y through the exponential map. In particular, the gradient of the quadratic distorsion function is given by

{grad}_{α} F_{n, 2} = {(- 2 \int_{\overset{˚}{C_{i}}} \overset{⟶}{a_{i} x} μ (d x))}_{1 \leq i \leq n} = - 2 {(E_{μ} 1_{{X \in \overset{˚}{C_{i}}}} \overset{⟶}{a_{i} X})}_{1 \leq i \leq n} .

(50)

The proof can be found in [38].

Notice that optimal n-centers are Riemannian centers of mass of their Voronoi cells, in the sense of (49). More generally, for any value of p, each

a_{i}

,

i = 1, \dots, n

, is the p-mean of its Voronoi cell, i.e., the minimizer of

a \mapsto \int_{\overset{˚}{C_{i}} (α)} d {(x, a)}^{p} μ (d x) .

Therefore, the optimal n-centers are always contained in the compact support of

μ

. Notice also that the opposite direction of the gradient is, on average, given by the vectors inside the expectation. Competitive learning quantization consists in following this direction at each step k, that is, updating only the center

a_{i}

corresponding to the Voronoi cell of the new observation

X_{k}

, in the direction of that new observation. In the Riemannian setting, instead of moving along straight lines, we simply follow geodesics using the exponential map. This gives a convergent algorithm, as shown in [38], which is particularly adapted to large datasets as it is online, i.e., it processes one data point at a time. Moreover, unlike kernel based methods, it requires few distance computations: of order

n \times N

instead of

N^{2}

if N is the size of the dataset and

n ≪ N

the size of the summary.

Finding the right number of centers may be a difficult question when there is no a priori knowledge on the distribution to be approximated. In practice, it is mainly a trial and error procedure, unless the problem itself allows an initial guess of the value. An alternative approach is to use the fact that any quantization defines a clustering by Voronoï cells: the quality of the later can be used to assess the performance of the former. Quite a lot of standard indicators exist for that purpose [40]. Just to mention a simple one, the Silhouette [41] is easy to compute and does not require the knowledge of the membership labels.

7. Some Open Problems

Density estimation on manifolds is still an open area of research and as such has several questions that are not yet been answered. Some of them are given below, which may serve as a starting basis for further research.

7.1. Parametric Estimation and Symplectic Structure

A non-trivial aspect of the parametric estimation on manifold is that the parameter space is generally different from the base manifold. In fact, it is already the case for location-scale models in

R^{d}

, since the scale parameter is an element of

R^{+}

, but the underlying abelian group structure makes the location part an element of

R^{d}

, which is also the base space. If one tries to extend the concept of location parameters to manifolds, two approaches can be used:

Use local coordinates as parameters, and mimic the vector case as in [9].
Replace the abelian group underlying $R^{d}$ by a Lie group acting on the base manifold.

In view of the general results mentioned in Section 3.4 for exponential families defined on symplectic manifolds, the second setting is the most natural one, but is not as general as the first one. However, given a Riemannian manifold M, its cotangent bundle

T M^{*}

has a canonical symplectic structure, obtained from the so-called tautological one form [42] (pp. 9–14).

Definition 14.

Let

π : T M^{*} \to M

be the canonical projection and

d π : T T M^{*} \to T M

its derivative. The tautological one-form

α : T T M^{*} \to R

is defined pointwise as:

α_{ω} (v) = ω (d π v),

where

v \in T T M^{*}

and

\tilde{π} v = ω

, with

\tilde{π} : T T M^{*} \to T M^{*}

the canonical projection.

The tautological one-form gives rise to a symplectic form:

Proposition 11.

The two-form

ω = - d α

provides

T M^{*}

with a symplectic struture. In local coordinates

(x_{,} \dots x_{d}, ξ_{1}, \dots, ξ_{d}),

it reads as:

ω = \sum_{i = 1}^{d} d x_{i} \land d ξ_{i} .

Given the symplectic structure on

T M^{*}

, one can derive exponential families provided a group action by symplectomorphisms exists. This construction may be a starting point for a quite general definition of parameterized families on manifolds, by moving from the base manifold to its cotangent bundle. It is worth noticing that location models may also be constructed using generating functions [42].

7.2. Manifolds with Boundaries

All the density estimators presented before where defined on manifolds without boundaries. However, in several settings, boundaries arise naturally as limiting cases: as an example, the space of symmetric positive semi-definite matrices of dimension

d \times d

has stratified boundaries, namely the manifolds of matrices of rank

0 \leq p < d

. Densities localized on the boundaries are degenerate versions of the one defined in the interior, but must be taken into account as they carry a non-vanishing mass. Apart from the projection estimator that fits in the manifolds with boundaries framework, all the other methods must be adapted. In particular, distributions defined on matrix spaces must be parameterized in such a way that rank deficiency is allowed in the parameter space. For manifolds with corners [43], this is easily obtained from the particular structure of the local charts that map open subsets of the manifold to open subsets of

[0, + \infty [\times R^{d - k}, 0 \leq k \leq d

and it turns out that all examples of matrix manifolds fall within this frame.

Extending parameterized density estimators or kernel estimators to manifolds with corners will have many practical applications, especially when dealing with matrix statistics. The same applies to optimal quantization, where dirac densities located on the boundaries must be added to the initial model.

7.3. Constrained Quantization

In the Riemannian quantization problem, an approximate distribution in the form

\sum_{i = 1}^{n} μ_{i} δ_{a_{i}}

, with

a_{1}, \dots a_{n}

the optimal centers located on the base manifold M and

μ_{1}, \dots, μ_{n}

be positive real numbers summing to 1. While such a representation is very natural both from a optimal approximation and clustering point of view, some specific applications require putting additional constraints on the weights

μ_{i}, i = 1 \dots n

[44]. A common one is to impose that they are all equal, which in a clustering application means that all classes will have an equal expected number of members. The stochastic gradient algorithms presented above is no longer valid for dealing with this problem and has to be adapted. It is still an open question to find a suitable procedure.

8. Conclusions

Probability densities approximation and estimation on Riemannian manifolds are topics receiving an increasing interest in the statistics community. In many applications, data are living on non-Euclidean spaces and adapted procedures must be designed.

In Euclidean spaces, parametric and non-parametric estimation procedures have been intensively studied and are well understood. In the Riemannian manifold setting, several non-equivalent extensions are possible, making the task to find the right one quite tricky. The most obvious way of dealing with manifold valued data is to use a local linearization, like the use of normal coordinates. When applied to densities, it may be used to derive maximum entropy distributions with fixed moments, but some care must be taken with the cut locus, which may be charged by the defined distribution. Other approaches rely on a direct use of the geodesic distance, both in the parametric and non-parametric estimation framework. The computational cost associated with this operation may be high, as a differential problem with boundary conditions has to be solved. When dealing with kernel estimation, Jacobi fields must also be evaluated. While it involves only a classical Ordinary differential equation (ODE). integration, it is an increase in the overall complexity of the procedure.

Parametric estimation using exponential families based on group action invariance are theoretically appealing and makes use of the underlying information on the data. This a especially important when data is highly structured. A closed form momentum map is nevertheless a prerequisite to derive an efficient implementation.

Finally, projection based estimation is in principle very efficient, provided one can obtain the eigenfunctions of the Laplace–Beltrami operator in a closed form, or at least efficiently approximate them. In low dimension, numerical schemes may be used for that purpose, but do not scale well.

Classical directional densities and their generalizations are numerically appealing and offer a sound framework for some manifolds. They are designed on an ad hoc basis, and may not be adapted to all cases. Most of them are maximal entropy based and often exhibit a group invariance. An interesting question is whether approximate directional densities can be found for a wrapped distribution arising from a model heat kernel.

As a general conclusion, extending the usual distributions to general manifolds is by no way an elementary procedure. Furthermore, where in the Euclidean case some distributions satisfy many equivalent defining properties, it will not be the case for manifolds. Maximum entropy is generally a good criterion, provided the fixed moments are defined in a simple and natural way. Finally, an often overlooked issue with densities defined on Riemannian manifolds is the associated computational cost when closed forms are unknown. Since all distance computations require solving a boundary value problem, the complexity of the manifold algorithms may be some order of magnitudes higher than their Euclidean counterparts. This also limits the practical dimension of the problems.

Author Contributions

A.l.B. has contributed to the Riemannian geometry section and the quantization. S.P. has contributed the survey on parametric and non-parametric estimation.

Funding

This research received no external funding

Conflicts of Interest

The authors declare no conflict of interest.

References

DasGupta, A. Asymptotic Theory of Statistics and Probability; Springer Texts in Statistics; Springer: New York, NY, USA, 2008. [Google Scholar]
Graf, S.; Luschgy, H. Foundations of Quantization for Probability Distributions; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Chern, S.; Smith, F.; de Rham, G. Differentiable Manifolds: Forms, Currents, Harmonic Forms; Grundlehren der Mathematischen Wissenschaften; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Willmore, T. Riemannian Geometry; Oxford Science Publications, Clarendon Press: Oxfordshire, UK, 1996. [Google Scholar]
Mardia, K.V. Statistics of Directional Data. J. R. Stat. Soc. Ser. B (Methodol.) 1975, 37, 349–393. [Google Scholar] [CrossRef]
Mardia, K.; Jupp, P. Directional Statistics; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
Golub, G.; Van Loan, C. Matrix Computations; Johns Hopkins Studies in the Mathematical Sciences; Johns Hopkins University Press: Baltimore, MD, USA, 1996. [Google Scholar]
Chikuse, Y. Statistics on Special Manifolds; Lecture Notes in Statistics; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Pennec, X. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements. J. Math. Imaging Vis. 2006, 25, 127–154. [Google Scholar] [CrossRef] [Green Version]
Barndorff-Nielsen, O. Hyperbolic Distributions and Distributions on Hyperbolae. Scand. J. Stat. 1978, 5, 151–157. [Google Scholar]
Gruet, J.C. A Note on Hyperbolic von Mises Distributions. Bernoulli 2000, 6, 1007–1020. [Google Scholar] [CrossRef]
Said, S.; Bombrun, L.; Berthoumieu, Y.; Manton, J.H. Riemannian Gaussian Distributions on the Space of Symmetric Positive Definite Matrices. IEEE Trans. Inf. Theory 2017, 63, 2153–2170. [Google Scholar] [CrossRef] [Green Version]
Said, S.; Hajri, H.; Bombrun, L.; Vemuri, B.C. Gaussian Distributions on Riemannian Symmetric Spaces: Statistical Learning with Structured Covariance Matrices. IEEE Trans. Inf. Theory 2018, 64, 752–772. [Google Scholar] [CrossRef]
Terras, A. Harmonic Analysis on Symmetric Spaces and Applications I; Springer: New York, NY, USA, 2012. [Google Scholar]
Duistermaat, J.; Kolk, J. Lie Groups; Universitext; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
Knapp, A.W. Lie Groups Beyond an Introduction; Progress in Mathematics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Helgason, S. Groups and Geometric Analysis: Integral Geometry, Invariant Differential Operators, and Spherical Functions; Mathematical Surveys And Monographs; American Mathematical Society: Providence, RI, USA, 2000. [Google Scholar]
Jones, T.H.; Kucerovsky, D. Heat Kernel for Simply-Connected Riemann Surfaces. arXiv, 2010; arXiv:1007.5467. [Google Scholar]
McKean, H.P. An upper bound to the spectrum of Δ on a manifold of negative curvature. J. Differ. Geom. 1970, 4, 359–366. [Google Scholar] [CrossRef]
Nicol, F.; Puechmorel, S. Von Mises-Like Probability Density Functions on Surfaces. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 701–708. [Google Scholar]
Noether, E. Invariant variation problems. Transp. Theory Stat. Phys. 1971, 1, 186–207. [Google Scholar] [CrossRef] [Green Version]
Barbaresco, F. Koszul Information Geometry and Souriau Geometric Temperature/Capacity of Lie Group Thermodynamics. Entropy 2014, 16, 4521–4565. [Google Scholar] [CrossRef] [Green Version]
Casalis, M. Familles Exponentielles Naturelles sur R^d Invariantes par un Groupe. Int. Stat. Rev. 1991, 59, 241–262. [Google Scholar] [CrossRef]
Barndorff-Nielsen, O.; Blæsild, P.; Jensen, J.L.; Jørgensen, B. Exponential Transformation Models. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1982, 379, 41–65. [Google Scholar] [CrossRef]
Souriau, J.; Cushman, R.; Vries, C.; Tuynman, G. Structure of Dynamical Systems: A Symplectic View of Physics; Progress in Mathematics; Springer Science + Business Media: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Besicovitch, A. Almost Periodic Functions; Dover Edition; Dover Publications: Dover, DE, USA, 1954. [Google Scholar]
Hendriks, H. Nonparametric Estimation of a Probability Density on a Riemannian Manifold Using Fourier Expansions. Ann. Stat. 1990, 18, 832–849. [Google Scholar] [CrossRef]
Burago, D.; Ivanov, S.; Kurylev, Y. A graph discretization of the Laplace–Beltrami operator. J. Spectr. Theory 2014, 4, 675–714. [Google Scholar] [CrossRef] [Green Version]
Kim, P.T. Deconvolution density estimation on SO(N). Ann. Stat. 1998, 26, 1083–1102. [Google Scholar] [CrossRef]
Parzen, E. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
Rosenblatt, M. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Stat. 1956, 27, 832–837. [Google Scholar] [CrossRef]
Buhmann, M. Radial Basis Functions: Theory and Implementations; Cambridge Monographs on Applied and Computational Mathematics; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Pelletier, B. Kernel density estimation on Riemannian manifolds. Stat. Probab. Lett. 2005, 73, 297–304. [Google Scholar] [CrossRef]
Berger, M.; Gauduchon, P.; Mazet, E. Le Spectre d’une Variete Riemannienne; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1971. [Google Scholar]
Bentley, J.L. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
Pagès, G. Introduction to vector quantization and its applications for numerics. ESAIM Proc. Surv. 2015, 48, 29–79. [Google Scholar] [CrossRef] [Green Version]
Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. Inst. Henri Poincaré 1948, 10, 215–310. (In French) [Google Scholar]
Le Brigant, A.; Puechmorel, S. Optimal Riemannian quantization with an application to air traffic analysis. arXiv, 2018; arXiv:1806.07605. [Google Scholar]
Iacobelli, M. Asymptotic quantization for probability measures on Riemannian manifolds. ESAIM Control Optim. Calculus Var. 2016, 22, 770–785. [Google Scholar] [CrossRef] [Green Version]
Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Pérez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
Da Silva, A. Lectures on Symplectic Geometry; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Joyce, D. On Manifolds with Corners; Advances in Geometric Analysis, International Press: Boston, MA, USA, 2012; Volume 21, pp. 225–258. [Google Scholar]
Kämpke, T. Constrained quantization. Signal Process. 2003, 83, 1839–1858. [Google Scholar] [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

le Brigant, A.; Puechmorel, S. Approximation of Densities on Riemannian Manifolds. Entropy 2019, 21, 43. https://0-doi-org.brum.beds.ac.uk/10.3390/e21010043

AMA Style

le Brigant A, Puechmorel S. Approximation of Densities on Riemannian Manifolds. Entropy. 2019; 21(1):43. https://0-doi-org.brum.beds.ac.uk/10.3390/e21010043

Chicago/Turabian Style

le Brigant, Alice, and Stéphane Puechmorel. 2019. "Approximation of Densities on Riemannian Manifolds" Entropy 21, no. 1: 43. https://0-doi-org.brum.beds.ac.uk/10.3390/e21010043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Approximation of Densities on Riemannian Manifolds

Abstract

1. Introduction

2. Some Notions from Riemannian Geometry

2.1. Differentiable Manifolds

2.2. Tangent and Cotangent Vectors

2.3. Pullback and Pushforward

2.4. Vector Fields and Covariant Derivatives

2.5. Riemannian Metric and Geodesics

2.6. Exponential and Logarithm Maps

2.7. Curvature and Jacobi Fields

2.8. Measures and Integration over a Riemannian Manifold

2.9. The Laplace–Beltrami Operator

3. Parametric Estimation

3.1. Directional Statistics

3.2. Gaussian-Like Distributions

3.3. Wrapped Distributions

3.4. Exponential Families Arising from Group Actions

4. Non-Parametric Density Estimation by Projection

4.1. The Euclidean Case

4.2. The Riemannian Case

5. Non-Parametric Kernel Estimation

5.1. The Euclidean Case

5.2. The Riemannian Case

5.3. Computing the Kernel in the Riemannian Case

6. Discrete Density Estimation through Quantization

6.1. Optimal Quantization

6.2. A Numerical Scheme

7. Some Open Problems

7.1. Parametric Estimation and Symplectic Structure

7.2. Manifolds with Boundaries

7.3. Constrained Quantization

8. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI