Next Article in Journal
Navigating the Chemical Space of HCN Polymerization and Hydrolysis: Guiding Graph Grammars by Mass Spectrometry Data
Next Article in Special Issue
The Phase Space Elementary Cell in Classical and Generalized Statistics
Previous Article in Journal
Evaluating the Spectrum of Unlocked Injection Frequency Dividers in Pulling Mode
Previous Article in Special Issue
Theoretical Foundations and Mathematical Formalism of the Power-Law Tailed Statistical Distributions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Examples of the Application of Nonparametric Information Geometry to Statistical Physics

De Castro Statistics Initiative, Collegio Carlo Alberto, Via Real Collegio 30, Moncalieri 10024, Italy
Entropy 2013, 15(10), 4042-4065; https://0-doi-org.brum.beds.ac.uk/10.3390/e15104042
Submission received: 15 August 2013 / Revised: 13 September 2013 / Accepted: 16 September 2013 / Published: 25 September 2013
(This article belongs to the Collection Advances in Applied Statistical Mechanics)

Abstract

:
We review a nonparametric version of Amari’s information geometry in which the set of positive probability densities on a given sample space is endowed with an atlas of charts to form a differentiable manifold modeled on Orlicz Banach spaces. This nonparametric setting is used to discuss the setting of typical problems in machine learning and statistical physics, such as black-box optimization, Kullback-Leibler divergence, Boltzmann-Gibbs entropy and the Boltzmann equation.

1. Introduction

Information geometry was developed in the seminal monograph by Amari and Nagaoka [1], where previous—essentially metric—descriptions of probabilistic and statistics concepts are extended in the direction of differential geometry, including the fundamental treatment of differential connections. The differential geometry involved in their construction is finite dimensional, and the formalism is based on coordinate systems. Following a suggestion by Phil Dawid in [2,3,4], a particular nonparametric version of the Amari-Nagaoka theory was developed in a series of papers [5,6,7,8,9,10,11,12,13,14], where the set P > of all strictly positive probability densities of a measure space is shown to be a Banach manifold (as defined in [15,16,17]) modeled on an Orlicz Banach space, see [18] (Ch II).
Specifically, Gibbs densities, q = e u - K p ( u ) · p , error p u = 0 , are represented by the chart, s p : q u . Because of the exponential form, the random variable, u, is required to belong to an exponential Orlicz space, which is similar to ordinary Lebesgue spaces, but lacks some important features of these spaces, such as reflexivity and separability. On the other side, the nonparametric setting emphasizes in a nice way the fact that statistical manifolds are actually affine manifolds with a Hessian structure, cfr. [19].
Such a formalism has been frequently criticized as unnecessarily involved to be of use in practical applications and, also, as lacking really new results with respect to the Amari-Nagaoka theory. However, it should be observed that most applications in statistical physics, such as the Boltzmann equation theory [20], are intrinsically nonparametric. I like to quote here a line by Serge Lang in [17] (p. vi): “One major function of finding proofs valid in the infinite dimensional case is to provide proofs which are especially natural and simple in the finite dimensional case.” A good example is the use here of a different Banach space for each chart in the atlas defining our manifold, with a notable advantage in the interpretation of the results.
If p ( t ) , t I , is a curve in the manifold of positive densities, the Fisher score, δ p ( t ) = d d t ln p ( t ) , is a parameterized family of random variables, such that error p ( t ) δ p ( t ) = 0 , t I . The Fisher score provides the correct notion of the velocity of a statistical curve, while the set of all Fisher scores, i.e., the vector space of all random variables, u, such that error p u = 0 , suggest the form of the tangent space at p. We attach to each density p P > a vector space of random variables whose expected value with respect to p is zero to form the linear fiber of a vector bundle on the statistical manifold. This vector space can be either an Orlicz space, denoted here B p or * B p , or a Hilbert space H p = L 0 2 ( p ) . The purpose of this paper is to show that this mathematical formalism can be rigorously defined in such a way as to allow for the treatment of the Boltzmann equation as an evolution equation on the statistical manifold.
This paper is organized as follows. Section 2 and Section 3 are a review of the basic material on statistical exponential manifolds with some emphasis on the functional analytic setting and on second order structures. Section 4 contains a discussion of examples of the application to the differential geometry of expected values, Kullback-Leibler divergence, Boltzmann-Gibbs entropy and the Boltzmann equation. Section 5 presents some topics that would require further study, together with references to some lines of current research.

2. Model Spaces

Given a σ-finite measure space, ( Ω , F , μ ) , we denote by P > the set of all densities that are positive μ-a.s, by P the set of all densities, by P 1 the set of measurable functions f with f d μ = 1 . In the finite state space case, P 1 is an affine subspace, P is the simplex and P > its topological interior. We summarize below the basic notations and results. Missing proofs are to be found, e.g., in [10] and in [18] (Ch II).
If both ϕ and ϕ * are monotone, continuous functions on R onto itself, such that ϕ - 1 = ϕ * , we call the pair:
Φ ( x ) = 0 x ϕ ( u ) d u , Φ * ( y ) = 0 y ϕ * ( v ) d v
a Young pair. Each Young pair satisfies the Young inequality:
x y Φ ( x ) + Φ * ( y )
with equality, if and only if y = ϕ ( x ) . The relation in a Young pair is symmetric, and either element is called a Young function. We will use the following Young pairs:
ϕ * ϕ = ϕ * - 1 Φ * Φ (a) ln 1 + u e v - 1 ( 1 + x ) ln 1 + x - x e y - 1 - y (b) sinh - 1 u sinh v x sinh - 1 x - 1 + x 2 + 1 cosh y - 1
Let us derive a few elementary, but crucial, inequalities. If x 0 :
Φ * (a) ( x ) = 0 x x - u 1 + u d u , Φ * (b) ( x ) = 0 x x - u 1 + u 2 d u
hence, as 1 + u 2 1 + u 2 1 + u 2 , if u 0 , for all real x, we have:
Φ * (a) ( x ) Φ * (b) ( x ) 2 Φ * (a) ( x )
From Equation (4), we have for a > 1 :
Φ * (a) ( a x ) = a 2 0 x x - v 1 + a v d v a 2 Φ * (a) ( x ) , Φ * (b) ( a x ) = a 2 0 x x - v 1 + a 2 v 2 d v a 2 Φ * (b) ( x )
In a similar way, from:
Φ (a) ( y ) = 0 y ( y - v ) e v d v , Φ (b) ( y ) = 0 y ( y - v ) cosh v d v
and cosh v e v 2 cosh v , if v 0 we have a relation similar to Equation (5), that is, for all y:
Φ ( b ) ( y ) Φ ( a ) ( y ) 2 Φ ( b ) ( y )
Property Equation (6) does not hold in this case, i.e., Φ ( a x ) / Φ ( x ) , a > 1 , is unbounded for x . Such a type of inequality is called a Δ 2 -condition and has a crucial role in the theory of Orlicz spaces; see [18] (Th. 8.14).
If Φ is any Young function, a real random variable, u, belongs to the Orlicz space, L Φ ( p ) , if error p Φ ( α v ) < + for some α > 0 . A norm is obtained by defining the set, v : error p Φ ( v ) 1 , to be the closed unit ball. It follows that the open unit ball consists of those u’s, such that α u is in the closed unit ball for some α > 1 . The corresponding norm, · Φ , p , is called the Luxemburg norm and defines a Banach space; see [18] (Th 7.7). From Equation (8) and Equation (5) follows that cases (a) and (b) in Equation (3) define equal vector spaces with equivalent norms; see [10] (Lemma 1). Therefore, we drop any mention of them.
The Young function, cosh - 1 , has been chosen here, because the condition error p Φ ( α v ) < + is clearly equivalent to error p e t v < + for t [ - α , α ] , that is, the random variable, u, has a Laplace transform around zero. The case of a moment-generating function defined on all of the real line is special and defines a notable subspace of the Orlicz space. The use of such a space has been proposed by [21].
There are technical issues in working with Orlicz spaces, such as L ( cosh - 1 ) ( p ) , in particular, the regularity of its unit sphere S cosh - 1 = u : u ( cosh - 1 ) , p = 1 . In fact, while error p cosh u - 1 = 1 implies u S cosh - 1 , the latter implies only error p cosh u - 1 1 . Subspaces of L Φ ( p ) , where it cannot happen at the same time ( cosh - 1 ) , p , u = 1 and error p cosh u - 1 < 1 are of special interest. In general, the sphere, S cosh - 1 , is not smooth; see an example in [14] ( Example 3).
Because the functions, Φ and Φ * , are a Young pair, for each u L Φ ( p ) and v L Φ * ( p ) , such that u Φ , p , v Φ * , p 1 , we have from the Young inequality Equation (2), error p u v 2 ; hence:
L Φ * ( p ) × L Φ ( p ) ( v , u ) u , v p = error p u v
is a duality pairing, u , v p 2 u Φ * , p v Φ , p . It is a classical result that in our case Equation (3), the space, L Φ * ( p ) , is separable and its dual space is L Φ ( p ) , the duality pairing being ( u , v ) u , v p . This duality extends to a continuous chain of spaces:
L Φ ( p ) L a ( p ) L b ( p ) L Φ * ( p ) , 1 < b 2 , 1 a + 1 b = 1
where → denotes continuous injection.

2.1. Cumulant Generating Functional

Let p P > be given. The following theorem has been proven in [9] (Ch 2); see also [10].
Proposition 1
  • For a 1 , n = 0 , 1 , and u L Φ ( p ) :
    λ a , n ( u ) : w 1 , , w n w 1 a w n a e u a
    is a continuous, symmetric, n-multi-linear map from L Φ ( p ) to L a p .
  • v n = 0 1 n ! v a n is a power series from L Φ ( p ) to L a ( p ) , with radius of convergence 1 .
  • The superposition mapping, v e v / a , is an analytic function from the open unit ball of L Φ ( p ) to L a ( p ) .
The previous theorem provides an improvement upon the original construction of [5].
Definition 1 Let Φ = cosh - 1 and B p = L 0 Φ ( p ) = u L 0 Φ ( p ) : error p u = 0 , p P > . The moment generating functional is M p : L Φ ( p ) u error p e u R > + . The cumulant generating functional is K p : B p u log M p ( u ) R > + .
The moment-generating functional is the partition functional (normalizing factor) of the Gibbs model, ( e u / M p ( u ) ) · p P > , if u L Φ ( p ) , M p ( u ) < + . The same model is written e u - K p ( u ) · p , if, moreover, error p u = 0 .
Proposition 2
  • K p ( 0 ) = 0 ; otherwise, for each u 0 , K p ( u ) > 0 .
  • K p is convex and lower semi-continuous, and its proper domain is a convex set that contains the open unit ball of B p ; in particular, the interior of the proper domain is a non-empty open convex set denoted S p .
  • K p is infinitely Gâteaux-differentiable in the interior of its proper domain.
  • K p is bounded, infinitely Fréchet-differentiable and analytic on the open unit ball of B p .
Other properties of the functional K p are described below, as they relate directly to the exponential manifold.

3. Exponential Manifold

The set of positive densities, P > , around a given p P > is modeled by the subspace of centered random variables in the Orlicz space, L Φ ( p ) . Hence, it is crucial to discuss the isomorphism of the model spaces for different p’s.
Definition 2 (Maximal exponential model: [10] (Def 20)). For p P > , let S p be the topological interior of the proper domain of the cumulant functional, K p : B p . The maximal exponential model at p is:
E p = e u - K p ( u ) · p : u S p
It is important to observe that q E p is equivalent to p E q , as is proven in Proposition 3 below.
Definition 3 (Connected densities). Densities p , q P > are connected by an open exponential arc, p q , if there exists an open exponential family containing both, i.e., if for a neighborhood I of [ 0 , 1 ] :
p 1 - t q t d μ = error p q p t = error q p q 1 - t < + , t I
The following example is of interest for the applications in Section 4. Let f 0 be the standard normal density on R N and f, a density, f ( x ) ( 1 + x a ) f 0 ( x ) , a > 0 . Then, ( 1 + x a ) t f 0 ( x ) d x < + for all real t, hence f 0 f .
Proposition 3 (Characterization of a maximal exponential model: ([10], Th 19 and 21)). The following statements are equivalent:
  • p , q P > are connected by an open exponential arc, p q ;
  • q E p ;
  • E p = E q ;
  • log q p belongs to both L Φ ( p ) and L Φ ( q ) .
If it holds, then:
5.
L Φ ( p ) and L Φ ( q ) are equal as vector spaces and their norms are equivalent.
We can now define the exponential manifold as follows.
Definition 4 (Exponential manifold: [5,7,9,10]). For each p P > , define the charts:
s p : E p q ln q p - error p ln q p S p B ,
with inverse
s p - 1 = e p : S p u e u - K p ( u ) · p E p P >
The atlas, s p : S p : p P > , is affine and defines the exponential (statistical) manifold P > .
The affine manifold we have defined has a simple and natural structure because of Proposition 3. The domains, E p , E q , of the charts, s p , s q , are either disjoint or equal when p q :
E p [ r ] s p @ = [ d ] S p [ d ] s q s p - 1 [ r ] I B p [ d ] d ( s q s p - 1 ) [ r ] I L Φ ( p ) @ = [ d ] E q [ r ] s q S q [ r ] I B q [ r ] I L Φ ( q )
For ease of reference, various results from [5,7,9,10] are collected in the following proposition. We assume q = e u - K p ( u ) · p E p . It should be noted that K p ( u ) = error p ln p / q is the expression in the chart centered at p of the Kullback-Leibler divergence, D p q .
Proposition 4.
  • The first three derivatives of K p on S p are:
    d K p ( u ) v = error q v
    d 2 K p ( u ) ( v 1 , v 2 ) = Cov q v 1 , v 2
    d 3 K p ( u ) ( v 1 , v 2 , v 3 ) = Cov q ( v 1 , v 2 , v 3 )
  • The random variable, q p - 1 , belongs to * B p and:
    d K p ( u ) v = error p q p - 1 v
    In other words, the gradient of K p at u is identified with an element of the predual space of B p , viz. * B p = L 0 Φ * ( p ) , denoted by K p ( u ) = e u - K p ( u ) - 1 = q p - 1 .
  • The mapping, S p u K p ( u ) * B p , is monotonic:
    K p ( u ) - K p ( v ) , u - v p > 0 , u v
    in particular, one-to-one.
  • The weak derivative of the map, S p u K p ( u ) * B p , at u applied to w B p is given by:
    d ( K p ( u ) ) w = q p w - error q w
    and it is one-to-one at each point.
  • The mapping, m U p q : v p q v , is an isomorphism of * B p onto * B q . It is called the mixture transport or m-transport.
  • q / p L Φ * ( p ) .
  • D q p = D K p ( u ) u - K p ( u ) with q = e u - K p ( u ) p , in particular - D q p < + .
  • B q is defined by an orthogonality property:
    B q = L 0 Φ ( q ) = u L Φ ( p ) : error p u q p = 0
  • The mapping, e U p q : u u - error q u , is an isomorphism of B p onto B q . It is called the exponential transport or e-transport.

3.1. Tangent Bundle

Our discussion of the tangent bundle of the exponential manifold is based on the concept of the velocity of a curve as in [16] (§3.3), and it is mainly intended to underline its statistical interpretation, which is obtained by identifying curves with one-parameter statistical models. For a statistical model p ( t ) , t I , the random variable, p ˙ ( t ) / p ( t ) , the Fisher score, has zero expectation with respect to p ( t ) , and its meaning in the exponential manifold is velocity. If p ( t ) = e t v - ψ ( t ) · p , v L Φ ( p ) , is an exponential family, then p ˙ ( t ) / p ( t ) = v - error p ( t ) v B p ( t ) ; see [22] on exponential families.
Let p ( · ) : I E p , I the open real interval containing zero. In the chart centered at p, the curve is u ( · ) : I B p , where p ( t ) = e u ( t ) - K p ( u ( t ) ) · p . The transition maps of the exponential manifold are:
s q e p : S p u s q ( e u - K p ( u ) · p ) = u - error q u + ln p q - error q ln p q S q = S p
with derivative:
d v s q s p - 1 ( u ) = v - error q v = e U p q v , v B p
Definition 5 (Velocity field of a curve).
  • Assume t u ( t ) = s p ( p ( t ) ) is differentiable with derivative u ˙ ( t ) . Define:
    δ p ( t ) = e U p p ( t ) u ˙ ( t ) = u ˙ ( t ) - error p ( t ) u ˙ ( t ) = d d t ( u ( t ) - K p ( u ( t ) ) = d d t ln p ( t ) p = d d t p ( t ) p ( t )
    Note that δ p does not depend on the chart s p and that the derivative of t p ( t ) in the last term of the equation is computed in L Φ * ( p ) . The curve t ( p ( t ) , δ p ( t ) ) is the velocity field of the curve.
  • On the set ( p , v ) : p P > , v B p , the charts:
    s p : ( q , w ) : q E p , w B q ( q , w ) ( s p ( q ) , e U q p w ) S p × B p B p × B p
    define the tangent bundle, T P > . The isomorphism w e U p q w = w - error p w = d ( s q s p - 1 ) ( u ) w of Proposition 4(9) is the (exponential) parallel transport.
Let E : E p R be a C 1 function. Then, E p = E e p : S p R is differentiable and:
d d t E ( p ( t ) ) = d d t E p ( u ( t ) ) = d E p ( u ( t ) ) u ˙ ( t ) = d E p ( u ( t ) ) e U p ( t ) p δ p ( t )
Proposition 5 (Covariant derivative of a real function).
  • As v d E p ( u ) v is a linear operator on B p , w d E p ( u ) e U e p ( u ) p w is a linear operator on B e p ( u ) , which does not depend on p.
  • If G is a vector field in T P > , the covariant derivative D G E is:
    D G E ( q ) = d E p ( s p ( q ) ) e U e p ( u ) p w = d E q ( 0 ) w , w = G ( q )
  • Assume moreover that d E p ( u ) B p * can be identified with an element, E p ( u ) * B p , by:
    d E p ( u ) w = error p E p ( u ) w , w B p
    Then, for u = e p ( q ) :
    D G E ( q ) = d E p ( u ) e U q p G ( q ) = error q m U p q E p ( u ) G ( q )
    We define the covariant gradient , G E ( q ) , by D G E ( q ) = error q G E ( q ) G ( q ) .
Proof.
  • Assume u 1 = s p 1 ( q ) = s p 1 e p 2 ( u 2 ) , so that E ( q ) = E p 1 ( u 1 ) = E p 2 ( u 2 ) = E p 1 ( s p 1 e p 2 ( u 2 ) ) and:
    d E p 2 ( u 2 ) e U q p 2 w = d E p 1 s p 1 e p 2 ( u 2 ) e U q p 2 w = d E p 1 ( u 1 ) e U p 1 p 2 e U q p 2 w = d E p 1 ( u 1 ) e U q p 1 w
  • Compute the derivative of t E p ( t ) when δ p ( t ) = G ( p ( t ) ) .
  • It is a computation based on:
    error p E p ( u ) ( G ( p ) - error p G ( q ) ) = error q p q E p ( u ) ( G ( q ) - error p G ( q ) ) = error q p q E p ( u ) ( G ( q )
Definition 6. Let F , G : E p be vector fields of T P > . In the chart at p, F p ( u ) = e U e p ( u ) p F e p ( u ) , u S p has differential B p : v d F p ( u ) v B p . The e-covariant derivative is the vector field defined by D G F ( q ) = e U p q d F p ( s p ( q ) ) e U q p w , w = G ( q ) , and this definition does not depend on p.

3.2. Pretangent Bundle

Because of the lack of reflexivity of the exponential Orlicz space, we are forced to distinguish between the dual tangent bundle ( T P > ) * = ( p , v ) : p P > , v ( B p ) * and a pretangent bundle.
Definition 7. The set, ( q , v ) : q P > , v * B q , together with the charts:
* s p : ( q , v ) : q E p , v * B q ( q , v ) s p ( q ) , m U q p v
are the pretangent bundle, * T P > .
The pretangent bundle is actually the tangent bundle of the mixture manifold (cfr. [1]) on P 1 = f L 1 ( μ ) : f d μ = 1 , whose charts are of the form η p ( q ) = q / p - 1 * B p . For each p P > , consider the set:
* U p = q P 1 : q / p L Φ * ( p )
and the mapping:
η p : * U p q η p ( q ) = q / p - 1 * B p .
Characterize * U p as the set of q’s of finite Kullback-Leibler divergence from p.
Proposition 6 ([9], Proposition 30). Let p P > and q P 1 . Define q ˜ = q / q d μ . Then, D q ˜ p < + , if and only if q / p L Φ * ( p ) .
Proof. The second derivative of Φ * ( x ) = ( 1 + x ) ln 1 + x - x , x > 0 , is 1 / ( 1 + x ) , while the second derivative of x ln x is 1 / x . The function x ln x is more convex than Φ * ( x ) as 0 < 1 / ( 1 + x ) < 1 / x . The two functions have parallel tangents at x > 0 if ln 1 + x = ln x + 1 , that is, at x ¯ = 1 / ( e - 1 ) . At this point, the difference of the values is:
Φ * ( x ¯ ) - x ¯ ln x ¯ = 1 - ln e - 1
In conclusion, we have the inequalities:
Φ * ( x ) x ln x + 1 - ln e - 1 < x ln x + 1 , x 0
If D q ˜ p < + , then:
+ > ln q ˜ p q ˜ d μ = error p q ˜ p ln q ˜ p > error p Φ * q ˜ p - 1 = error p Φ * q d μ - 1 q p - 1
so that q / p L Φ * ( p ) .
Assume now q / p L Φ * ( p ) , or, equivalently, q ˜ / p L Φ * ( p ) . As x ln + ( x ) ( 1 + x ) ln x for x 0 , we have:
+ > error p ϕ * q ˜ p = error p 1 + q ˜ p ln 1 + q ˜ p - 1 error p q ˜ p ln + q ˜ p - 1
which, in turn, implies that:
D q ˜ p < error p q ˜ p ln + q ˜ p
is finite. ☐
The covariant gradient defined in Proposition 3(3) is a vector field of the pretangent bundle. Note that the injection, P > P 1 , is represented in the charts centered at p by u e u - K p ( u ) · p - 1 . We do not further discuss here the mixture manifold and refer to [10] (Section 5) for further information on this topic.
Let F be a vector field of the pretangent bundle, * T P > . The chart centered at p E p q F ( q ) is represented by:
F p ( u ) = m U e p ( u ) p F e p ( u ) * B p , u S p
If F p is of class C 1 with derivative d F p ( u ) L ( B p , * B p ) , for each differentiable curve t p ( t ) = e u ( t ) - K p ( u ( t ) ) · p :
d d t F p ( p ( t ) ) = d F p ( u ( t ) ) u ˙ ( t ) = d F p ( u ( t ) ) e U p ( t ) p δ p ( t ) * B p
For each q = e p ( u ) E p , w * B q , m U p q d F p ( u ) e U q p w * B q does not depend on p.
Definition 8 (Covariant derivative in * T P > .). Let F be a vector field of the pretangent bundle, * T P > , and G a vector field in the tangent bundle, P > , both of class C 1 on E p . The covariant derivative is:
D G F ( q ) = d m U e q ( u ) q F e q ( 0 ) w , w = G ( q )
The tangent and pretangent bundle can be coupled to produce the new frame bundle:
( * T × T ) P > = ( p , v , w ) : p P > , v B p , w * B p
with the duality coupling:
( * T × T ) P > ( p , v , w ) v , w p = error p v w = error q m U p q v e U p q w , p q
Proposition 7 (Covariant derivative of the duality coupling) . Let F be a vector field of * T P > , G , H vector fields of T P > , all of class C 1 on a maximal exponential model E . Then:
D H F , G = D H F , G + F , D H G
Proof. Consider the real function E q F , G ( q ) = error q F ( q ) G ( q ) in the chart centered at any p E :
S p u error q F ( q ) G ( q ) = error p m U q p F e p ( u ) e U q p G e p ( u ) = error p F p ( u ) G p ( u )
and compute its derivative. ☐

3.3. The Hilbert Bundle

The duality on ( * T × T ) P > is reminiscent of a Riemannian metric, but it is not, because we do not have a Riemannian manifold unless the state space is finite. However, we can push on the analogy, by constructing a Hilbert bundle. As L Φ ( p ) L 2 ( p ) L Φ * ( p ) , p P > , we have B p H p * B p , L 0 2 ( p ) = H p being the fiber at p. The Hilbert bundle:
H P > = ( p , v ) : p P > , v H p
is provided with an atlas of charts by using the isometries, U p q : H p H q , which result from the pull-back of the metric connection on the sphere S μ = f L 2 ( μ ) : f 2 d μ = 1 ; see [6,8,23] and [14] (Section 4).
Proposition 8 (Isometric transport: ([14], Proposition 13)).
  • For all p , q P > , the mapping:
    U p q : v p q u - 1 + error q p q - 1 1 + p q error q p q v
    is an isometry of H p P > onto H q P > .
  • U q p U p q u = u , u H p P > and ( U p q ) t = U q p .
Note that U q r U p q U p r .
Definition 9 (Hilbert bundle). The charts:
2 s p : ( q , v ) : q E p , v H q ( q , v ) s p ( q ) , U q p v S p × H p B p × H p
form an atlas on H P > .
Let t p ( t ) be a C 1 curve in E p , p = p ( 0 ) , u ( t ) = s p ( p ( t ) ) and F : E p a C 1 vector field in H P > . In the chart centered at p, we have F p ( u ( t ) ) = U p ( t ) p ( F e p ) ( u ( t ) ) . A computation shows that:
d d t F p ( t ) t = 0 = d d t U p ( t ) p ( F e p ) ( u ( t ) ) t = 0
= d F p ( 0 ) δ p ( 0 ) + 1 2 F p ( 0 ) δ p ( 0 ) - error p d F p ( 0 ) δ p ( 0 ) + 1 2 F p ( 0 ) δ p ( 0 )
which could be used as a nonparametric definition of the metric connection; see [14] (Section 4.4) and [23] .
Our results on parallel transports and connections are a development, not yet complete, of previous work on statistical bundles in [6,8,14,23].

3.4. The Second Tangent Bundle

We briefly discuss here the second order structure, i.e., the tangent bundle of tangent bundle T P > . Let F : I t ( p ( t ) , V ( t ) ) be a C 1 curve in the tangent bundle, T P > . In the chart centered at p, we have:
F p ( t ) = ( s p ( p ( t ) ) , e U p ( t ) p V ( t ) ) = ( u ( t ) , V p ( t ) )
where p ( t ) = e u ( t ) - K p ( u ( t ) ) · p and V ( t ) = V p ( t ) - error p ( t ) V p ( t ) = V p ( t ) - d K p ( u ( t ) ) ( V p ( t ) ) . It follows that t V ( t ) is differentiable in L Φ ( p ) , with derivative:
V ˙ ( t ) = V ˙ p ( t ) - d K p ( u ( t ) ) ( V ˙ p ( t ) ) - d 2 ( p ( t ) ) ( V p ( t ) , u ˙ ( t ) ) = e U p p ( t ) V ˙ p ( t ) - Cov p ( t ) V p ( t ) , u ˙ ( t )
hence:
e U p p ( t ) V ˙ p ( t ) = V ˙ ( t ) + error p ( t ) V ( t ) δ p ( t )
It follows, in particular, that error p ( t ) V ˙ ( t ) = - error p ( t ) V ( t ) δ p ( t ) and e U p p ( t ) V ˙ p ( t ) = V ˙ ( t ) - error p ( t ) V ˙ ( t ) . Note that the left-end side is not a transport, but an extension of the transport, precisely the projection, Π p ( t ) : L Φ ( p ) B p ( t ) . It follows from F ˙ p ( t ) = u ˙ ( t ) , V ˙ p ( t ) that the velocity vector is:
δ ( p , V ) ( t ) = δ p ( t ) , e U p p ( t ) V ˙ p ( t ) = δ p ( t ) , Π p ( t ) V ˙ ( t )
The Equation (55) in the case V ( t ) = ( δ p ) ( t ) gives:
Π p ( t ) ( δ p ) ˙ ( t ) = ( δ p ) ˙ ( t ) + error p ( t ) δ p ( t ) 2 = ( δ p ) ˙ ( t ) + I ( p ( t ) )
where we have denoted by I ( p ( t ) ) = error p ( t ) d d t ln p ( t ) 2 the Fisher information. In this case, we can write:
δ ( p , δ p ) ( t ) = ( δ p ( t ) , ( δ p ) ˙ ( t ) + I ( p ( t ) ) )

4. Applications

In this section, we consider a typical set of examples where the nonparametric framework is applicable.

4.1. Expected Value

Let f L Φ ( p ) , f 0 = f - error p f B p , and consider its expected value as a function of the the density, q E p :
E : E p q error q f = error q f 0 + error p f
which is sometimes called the relaxed version of f in optimization theory, where it is convenient to regularize the problem of finding max x f ( x ) by extending it to the problem of finding max q E ( q ) . As error q f < max f , q P > , unless f is μ-a.s.constant; relaxed optimization can produce a maximizing sequence only.
The information geometric study of the relaxed mapping can be based on the notion of natural gradient as defined in a seminal paper by Amari [24], and it is currently used for optimization, see, e.g., [25,26,27,28,29,30,31]. The covariant derivative of a real function is the nonparametric counterpart of Amari’s natural gradient.
From the properties of K p in Equation (17) and (18) of Proposition 4, we obtain the representation E p ( u ) = E s p - 1 ( u ) of the function in Equation (59) in the chart centered at p, E p ( u ) = d K p ( u ) ( f 0 ) + error p f , whose differential in the direction v is d E p ( u ) v = d 2 K p ( u ) ( f 0 , v ) = Cov q f , v . The covariant derivative at ( q . w ) T B q is computed from Definition 5(2) as:
d E p ( u ) e U q p w = Cov q f , e U q p w = error q ( f - error q f ) ( w - error p w ) = error q ( f - error q f ) w
hence, D G E ( p ) = error p ( f - error p f ) G ( p ) with gradient G E ( q ) = f - error q f in the duality on * B q × B q . Note that the gradient is never zero unless f is constant and that the covariant derivative is zero for each vector field G, which is uncorrelated with f.
We now compute the covariant derivative of the of the gradient in order to obtain the Hessian of the function, E. Consider the gradient vector field F ( q ) = f - error q f * T P > . The gradient flow is:
δ p ( t ) = d d t ln p ( t ) = f - error p ( t ) f
whose unique solution is the exponential family, p ( t ) e t f · p ( 0 ) . In fact, the gradient is actually the e-transport of f 0 , F ( p ) = e U p e p ( u ) f 0 , and the exponential family is the exponential curve of the e-transport.
Let us discuss the differentiability of the gradient. In the chart centered at p, the gradient is represented as:
F p ( u ) = m U e p ( u ) p [ f 0 - d K p ( u ) ( f 0 ) ] = m U e p ( u ) p e U p e p ( u ) f 0
Let us first compute the differential of u F p ( u ) , w p , w B p , in the direction v B p , i.e., the weak differential:
d v F p ( u ) , w p = d v m U e p ( u ) p e U p e p ( u ) f 0 , w p = Cov e p ( u ) f 0 , w = d v d 2 K p ( u ) ( f o , w ) = d 3 K p ( u ) ( f 0 , w , v ) = Cov e p ( u ) ( f 0 , w , v )
where we have used Proposition 4. At u = 0 :
d v F p ( 0 ) , w p = error p f 0 w v = error p f 0 v - error p f 0 v w = f 0 v - error p f 0 v , w p
The product f 0 G ( p ) belongs to * B p . In fact:
error p Φ * f 0 G ( p ) = error p f 0 2 0 G ( p ) G ( p ) - u 1 + f 0 u d u 1 2 error p f 0 2 G ( p ) 2 < +
If D G E exists in * P > as a Fréchet derivative, then:
D G E ( p ) = f 0 G ( p ) - error p f 0 G ( p )
The differentiability in Orlicz spaces of superposition operators is discussed in detail in [32].

4.2. Kullback-Leibler Divergence

If E is a maximal exponential model, the mapping:
E × E ( q 1 , q 2 ) D q 1 q 2 = error q 1 ln q 1 q 2
is represented in the charts centered at p by:
E p : S p × S p ( u 1 , u 2 ) d K p ( u 1 ) ( u 1 - u 2 ) - ( K p ( u 1 ) - K p ( u 2 ) )
Hence, from Proposition 2(4), it is C jointly in both variables and, moreover, analytic:
E p ( u 1 , u 2 ) = n 2 1 n ! d n K p ( u 1 ) ( u 1 - u 2 ) n , u 1 - u 2 Φ , p < 1
This regularity result is to be compared with what is available when the restriction, q 1 q 2 , is removed, i.e., the semi-continuity [33].
The (partial) derivative of u 2 E p ( u 1 , u 2 ) in the direction v 2 B p is:
d 2 E p ( u 1 , u 2 ) v 2 = - d K p ( u 1 ) v 2 + d K p ( u 2 ) v 2 = error q 2 v 2 - error q 1 v 2
If v 2 = e U q p w , we have error q 2 v 2 - error q 1 v 2 = error q 2 w - error q 1 w , and the covariant derivative of the partial functional q D q 1 q is:
D 2 , w D q 1 q = error q w - error q 1 w = error q 1 - q 1 q w , q D q 1 q = 1 - q 1 q
The second mixed derivative of E p is:
d 1 d 2 E p ( u 1 , u 2 ) ( v 1 , v 2 ) = - d 2 K p ( u 1 ) ( v 1 , v 2 ) = - Cov q 1 v 1 , v 2
Equivalently, we consider the mapping, q 1 D 2 , w D q 1 q , in the chart u 1 error q w - error q 1 w , to obtain:
D 1 , w 1 D 2 , w 2 D q 1 q 2 q 1 = q 2 = q = - error q w 1 , w 2

4.3. Boltzmann-Gibbs Entropy

While our discussion of the Kullback-Leibler divergence in the previous Section 4.2 does not require any special assumption, but the restriction of its domain to a maximal exponential model, in the present discussion of the Boltzmann-Gibbs entropy, a further restriction is required. If p , q belong to the same maximal exponential model, p q , then, from q = e u - K p ( u ) · p with u B p , we obtain ln q - ln p L Φ ( p ) , so that ln q L Φ ( p ) , if and only if ln p L Φ ( p ) .
We study the Boltzmann-Gibbs entropy E ( q ) = error q ln q on a maximal exponential model q E , such that for at least one, and, hence, for all, p E , it holds ln p L Φ ( p ) , i.e., p 1 + α + p 1 - α d μ < + for some α > 0 . This is, for example, the case when the reference measure is finite and p is constant. Another notable example is the Gaussian case, i.e., the sample space is R n endowed with the Lebesgue measure and p ( x ) exp - 1 / 2 | x | 2 . In fact cosh ( α | x | 2 ) exp - 1 / 2 | x | 2 d x < + for 0 < α < 1 / 2 .
Under our assumption, the Boltzmann-Gibbs entropy is a smooth function. As:
ln q = u - K p ( u ) + ln p = u - K p ( u ) + ( ln p - E ( p ) ) + E ( p ) L Φ ( p )
the representation in the chart centered at p is:
E p ( u ) = error e p ( u ) u - K p ( u ) + ln p = d K p ( u ) u + ( ln p - E ( p ) ) - K p ( u ) + E ( p )
Hence, it is a C real function. The derivative in the direction v equals:
d E p ( u ) v = d 2 K p ( u ) u + ( ln p - E ( p ) ) , v = Cov q u + ln p , v
in particular:
d E p ( 0 ) v = error p ( ln p - E ( p ) ) v = ln p - E ( p ) , v p
The value of the covariant derivative D G E at q and G ( q ) = w is:
d E p ( u ) e U q p w = Cov q u + ln p , w = error q ( ( ln q + K p ( u ) ) w = error q ( ln q - E ( q ) ) w
The gradient E ( q ) ( B q ) * , D G E ( q ) = E ( q ) , G ( q ) q is identified with a random variable in B q * B q , and:
F ( q ) = ln q - E ( q )
= u - K p ( u ) + ln p - error q u - K p ( u ) + ln p
= ( u + ln p - E ( p ) ) - d K p ( u ) ( u + ln p - E ( p ) )
= e U p q ( u + ln p - E ( p ) ) B q
is a vector field in the tangent bundle, T E , hence a vector field in the Hilbert bundle, H E , and in the pretangent bundle, * T E .
The equation, E ( q ) = 0 , implies q = E ( q ) , hence constant. The Boltzmann-Gibbs entropy is increasing along the vector field G T E if error q ( ln q - E ( q ) ) G ( q ) = Cov q ln q , G ( q ) > 0 . The exponential family tangent at p to E ( p ) is p ( t ) e t ln p · p = p 1 + t . The gradient flow equation is δ q ( t ) = E ( q ( t ) ) , that is:
d d t ln q ( t ) = ln q ( t ) - E ( q ( t ) )
In the pretangent bundle, the action of the dual exponential transport, ( e U q p ) * , is identified with m U q p . It follows that the representation of the gradient in the chart centered at p is:
F p ( u ) = e u - K p ( u ) ( u + ln p - E ( p ) ) - d K p ( u ) ( u + ln p - E ( p ) )
= m U e p ( u ) p e U p e p ( u ) [ ( u + ln p - E ( p ) )
Let us assume u F p ( u ) is (strongly) differentiable, and let us compute the derivative by the product rule. As u F p ( u ) can be seen locally as the product of an analytic mapping, u e u - K p ( u ) , with values in L a ( p ) , a > 1 , because of Proposition 1, while the second factor is an analytic function with values in L Φ ( p ) a > 1 L a ( p ) , we can compute its differential in the direction, v B p , as the product of two functions in the Fréchet space a > 1 L a ( p ) as:
d ( E ) p ( u ) v = e u - K p ( u ) × ( v - d K p ( u ) v ) ( u + ln p - E ( p ) ) - d K p ( u ) ( u + ln p - E ( p ) ) + v - d 2 K p ( u ) ( u + ln p - E ( p ) , v ) - d K p ( u ) v = q p ( v - error q v ) ( ln q - E ( q ) ) + v - error q v - Cov q ln q , v
in particular, for u = 0 :
d ( E ) p ( 0 ) v = ( ln p - E ( p ) + 1 ) v - error p ln p v
= ( E ( p ) + 1 ) v - error p E ( p ) v
The covariant derivative of the gradient, E , of the Boltzmann-Gibbs entropy in the pretangent bundle, * T E , is:
D G ( E ) ( p ) = ( ln p - E ( p ) + 1 ) G ( p ) + error p ln p G ( p )
= ( E ( p ) + 1 ) G ( p ) + error p E ( p ) G ( p ) , p E
The existence of the covariant derivative implies ln p G ( p ) L Φ * ( p ) , p E . We do not discuss here the existence problem.
The computation of the covariant derivative of the same gradient in the tangent bundle, T E , would be:
F ¯ p ( u ) = e U q p ( ln q - E ( q ) ) = ln q - error p ln q = u + ln p - E ( p )
d F ¯ p ( u ) v = v
but we cannot suggest any use for this computation.

4.4. Boltzmann Equation

Orlicz spaces as a setting for the Boltzmann equation have been recently discussed in [34], while the use of exponential manifolds has been suggested in [14] (Example 11). Here we further work out this framework for a space-homogeneous Boltzmann operator with angular collision kernel B ( z , x ) = x z ; see the presentation in [20]. In order to avoid a clash with the notations used in other parts of this paper, we use v and w to denote velocities in R 3 in place of the more common couple, v and v * , and the velocities after collision are denoted by v x and w x instead of v , v * , x S 2 being a unit vector.
Let v , w R 3 be the velocities of two particles, and v ¯ , w ¯ be the velocities after an elastic collision, i.e.,
v + w = v ¯ + w ¯ , v 2 + w 2 = v ¯ 2 + w ¯ 2
Using Equation (86), we derive from the development of v + w 2 = v ¯ + w ¯ 2 that v · w = v ¯ · w ¯ . The four vectors, v , w , v ¯ , w ¯ , all lie on a circle with center z = ( v + w ) / 2 = ( v ¯ + w ¯ ) / 2 . In fact, the four vectors and z lie on the same plane, because v - z = - ( w - z ) , v ¯ - z = - ( w ¯ - z ) , and moreover, v - z 2 = v ¯ - z 2 . As v , w , v ¯ , w ¯ form a rectangle, we can denote by x the common unit vector of the parallel sides, w ¯ - w and v - v ¯ , and write w ¯ - w = v - v ¯ as the orthogonal projection of v - w on x. Given the unit vector x S 2 = x R 3 : x x = 1 , the collision transformation ( v , w ) ( v ¯ , w ¯ ) = ( v x , w x ) is linear and represented by a R ( 3 + 3 ) × ( 3 + 3 ) matrix:
A x = ( I - Π x ) Π x Π x ( I - Π x ) , v x = v - x x ( v - w ) = ( I - x x ) v + x x w w x = w + x x ( v - w ) = x x v + ( I - x x ) w
where denotes the transposed vector.
Given any x S 2 , we have A x = A - x . If v , w , v x , w x are as in Equation (87), then the elastic collision invariants of Equation (86) hold, v + w = v x + w x , v 2 + w 2 = v x 2 + w x 2 . The components in the direction x are exchanged, x x v x = x x w and x x w x = x x v , while the orthogonal components are conserved.
Let σ be the uniform probability on S 2 . For each positive function, g : R 3 × R 3 , the integral, S 2 g ( v x , w x ) σ ( d x ) , depends on the collision invariants only. In fact:
v x = v + w 2 + v - w 2 y
w x = v + w 2 - v - w 2 y
where the unit vector y = v x - w x ^ = ( I - 2 x x ) v - w ^ and all other terms depend on the collision invariants, in particular, v - w 2 = 2 ( v 2 + w 2 ) - v + w 2 .
On the sample space ( R 3 , d v ) , let f 0 be the standard normal density viz. the Maxwell distribution of velocities. As A x A x = I 6 the identity matrix on R 6 , in particular det A x = 1 , we have:
A x ( V , W ) = ( V x , W x ) ( V , W )
if ( V , W ) N ( 0 6 , I 6 ) . We can give the previous remarks a more probabilistic form as follows.
Proposition 9. Let f 0 be the density of the standard normal N ( 0 3 , I 3 ) .
  • If ( V , W ) f 0 f 0 , then S 2 g ( V x , W x ) σ ( d x ) is the conditional expectation of g ( V , W ) , given V + W and V 2 + W 2 .
  • Assume ( V , W ) f , f E f 0 f 0 ; then S 2 f A x σ ( d x ) E f 0 f 0 and:
    error g ( V , W ) V + W , V 2 + W 2 = S 2 g ( V x , W x ) f ( V x , W x ) σ ( d x ) S 2 f ( V x , W x ) σ ( d x )
Proof. 1. The random variable, S 2 g ( V x , W x ) σ ( d x ) = S 2 g A x ( V , W ) σ ( d x ) , is a function g ˜ ( m 1 ( V , W ) , m 2 ( V , W ) ) with m 1 ( V , W ) = V + W and m 2 ( V , W ) = V 2 + W 2 . For all h 1 : R 3 , h 2 : R 3 :
error S 2 g A x ( V , W ) σ ( d x ) h 1 ( m ( V , W ) h 2 ( m 2 ( V , W ) ) ) = error g ( V , W ) h 1 ( m ( V , W ) h 2 ( m 2 ( V , W ) ) )
because of A x ( V , W ) ( V , W ) and m 1 A x = m 1 , m 2 A x = m 2 .
2. We use Proposition 3. If f E f 0 f 0 , then:
f = e u - K 0 ( u ) · f 0 f 0 , u S f 0 f 0
and there exists a neighborhood, I, of [ 0 , 1 ] , where the one dimensional exponential family:
f t = e t u - K 0 ( t u ) · f 0 f 0 , t I
exists. To show error f 0 f 0 ( f / f 0 f 0 ) t < + for t I , it is enough to consider the convex cases, t < 0 and t > 1 . We have:
S 2 f A x σ ( d x ) = S 2 e f u A x - K 0 ( u ) σ ( d x ) · f 0 f 0
and in the convex cases:
error f 0 f 0 S 2 f A x σ ( d x ) f 0 f 0 t = error f 0 f 0 S 2 e u A x - K 0 ( u ) σ ( d x ) t error f 0 f 0 S 2 e t u A x - t K 0 ( u ) σ ( d x ) = error f 0 f 0 e t u - t K 0 ( u ) = e K 0 ( t u ) - t K 0 ( u )
The last equation is Bayes’ formula for conditional expectation. ☐
Definition 10. For each element of the maximal exponential model containing f 0 , f E f 0 , the Boltzmann operator is:
Q ( f ) ( v ) = R 3 S 2 ( f ( v - x x ( v - w ) ) f ( w + x x ( v - w ) ) - f ( v ) f ( w ) ) x ( v - w ) σ ( d x ) d w
In our definition, we have restricted the domain of the Boltzmann operator to a maximal exponential model containing the standard normal density in order to fit into our framework and be able to prove the smoothness of the operator. The maximal exponential model E f 0 contains all normal densities, f N ( μ , Σ ) . It has other peculiar properties.
As f E f 0 , f = e u - K 0 ( u ) · f 0 ; u belongs to the interior of the proper domain of K 0 , u S f 0 B f 0 . It follows from Proposition 3 that we have the equality and isomorphism of the Banach spaces, L Φ ( f ) and L Φ ( f 0 ) . For the random variable, V a : v v a , it holds V α L Φ ( f 0 ) = L Φ ( f ) for all a [ 1 , 2 ] . In fact:
error f 0 cosh ( α V a ) = ( 2 π ) - 3 / 2 R 3 cosh ( α v a ) exp - v 2 / 2 d v
is finite for all α if a [ 0 , 2 [ and for α < 1 / 2 if a = 2 . In particular, it follows that V 1 ( v ) = v has finite moments with respect to f, v n f ( v ) d v < + , n = 1 , 2 , .
As x ( v - w ) = - x ( v x - w x ) , the measure, x ( v - w ) d v d w , is invariant under the transformation A x and the measure f ( v x ) f ( w x ) x ( v - w ) d v d w is the image of f ( v ) f ( w ) x ( v - w ) d v d w under A x . Other properties are obtained in the proof of the following proposition.
Proposition 10. Let f 0 ( v ) = ( 2 π ) - 3 / 2 exp - v 2 / 2 and f E f 0 ; then, Q ( f ) / f * B f . Then, f Q ( f ) / f is a vector field in the pretangent bundle, * T E f 0 , called the Boltzmann field.
Proof. Let us consider first the second part of the Boltzmann operator:
Q - ( f ) ( v ) = R 3 S 2 f ( v ) f ( w ) x ( v - w ) σ ( d x ) d w = f ( v ) R 3 f ( w ) S 2 x ( v - w ) σ ( d x ) d w
Note that from inequality Equation (6):
Φ * S 2 x ( v - w ) σ ( d x ) = Φ * v - w S 2 x 1 σ ( d x ) S 2 x 1 σ ( d x ) 2 Φ * v - w
We prove Q - ( f ) / f L Φ * ( f ) :
error f Φ * Q - ( f ) f = R 3 d v f ( v ) Φ * R 3 f ( w ) S 2 x ( v - w ) σ ( d x ) d w
R 3 d v f ( v ) R 3 d w f ( w ) Φ * S 2 x ( v - w ) σ ( d x )
= R 3 d v f ( v ) R 3 d w f ( w ) Φ * b v - w
b 2 2 R 3 R 3 d v d w f ( v ) f ( w ) v - w 2
which is finite as v - w 2 2 ( u 2 + v 2 ) .
We consider now the first part of the Boltzmann operator:
Q + ( f ) ( v ) = R 3 S 2 f ( v - x x ( v - w ) ) f ( w + x x ( v - w ) ) x ( v - w ) σ ( d x ) d w
= R 3 S 2 f ( v x ) f ( w x ) x ( v - w ) σ ( d x ) d w
We want to prove that Q + ( f ) / f L Φ * ( f ) or, equivalently, Q + ( f ) / f 0 L Φ * ( f 0 ) . As f E f 0 , we can write f as f = e u - K 0 ( u ) · f 0 , where u B f 0 , so that:
Q + ( f ) ( w ) = f 0 ( w ) S 2 σ ( d x ) R 3 d v f 0 ( v ) e u ( v x ) + u ( w x ) - 2 K 0 ( u ) x ( v - w )
and:
Φ * Q + ( f ) ( w ) f 0 ( w ) S 2 σ ( d x ) R 3 d v f 0 ( v ) Φ * e u ( v x ) + u ( w x ) - 2 K 0 ( u ) x ( v - w )
S 2 σ ( d x ) R 3 d v f 0 ( v ) L ( x ( v - w ) ) Φ * e u ( v x ) + u ( w x ) - 2 K 0 ( u )
where L ( a ) = a a 2 . It follows:
Φ * Q + ( f ) ( w ) f 0 ( w ) S 2 σ ( d x ) R 3 d v f 0 ( v ) L ( x ( v - w ) ) ( u ( v x ) + u ( w x ) - 2 K 0 ( u ) ) e u ( v x ) + u ( w x ) - 2 K 0 ( u ) + 1
and:
error f 0 Φ * Q + ( f ) f 0 R 3 d v d w f 0 ( v ) f 0 ( w ) L ( x ( v - w ) ) ( u ( v ) + u ( w ) - 2 K 0 ( u ) ) e u ( v ) + u ( w ) - 2 K 0 ( u ) + 1 R 3 d v d w f ( v ) f ( w ) L ( x ( v - w ) ) u ( v ) + u ( w ) - 2 K 0 ( u ) + R 3 d v d w f 0 ( v ) f 0 ( w ) L ( x ( v - w ) )
where both terms are finite.
Finally, the integral of the Boltzmann operator is zero:
R 3 Q ( f ) ( v ) d v = S 2 R 3 R 3 ( f ( v x ) f ( w x ) - f ( v ) f ( w ) ) x ( v - w ) d w d v d x = S 2 R 3 R 3 f ( v x ) f ( w x ) x ( v x - w x ) d w x d v x d x - S 2 R 3 R 3 f ( v ) f ( w ) ) x ( v - w ) d w d v d x = 0
The smoothness of the Boltzmann field can be studied by carefully analyzing the structure of the operator as a superposition of:
(1)
Product: E f 0 f f f E f 0 f 0 ;
(2)
Interaction: E f 0 f 0 f f g = B f f E f 0 f 0 ;
(3)
Conditioning: E f 0 f 0 g S 2 g A x σ ( d x ) E f 0 f 0 ;
(4)
Marginalization.
The single operations of the chain are discussed in [7]. We do not do this analysis here and conclude the section by rephrasing in our language Maxwell’s weak form [20] (I.2.3) of the Boltzmann operator.
Proposition 11. Let f E f 0 and g L Φ ( f ) . Then, A g defined by:
A g ( v , w ) = S 2 1 2 ( g ( v x ) + g ( w x ) ) σ ( d x ) - 1 2 ( g ( v ) + g ( w ) )
belongs to L Φ ( f f ) and:
g , Q ( f ) / f f = error f f A g
Especially, if f = e u - K 0 ( u ) · f 0 :
u , Q ( f ) / f f = error f f A f f 0

5. Conclusions and Discussion

We have shown that a careful consideration of the relevant functional analysis allows us to discuss some basic features of statistical models of interest in statistical physics in the framework of the nonparametric information geometry based on Orlicz spaces. In particular, we have defined the exponential statistical manifold and its vector bundles, namely, the tangent bundle, the pretangent bundle and the Hilbert bundle. Partial results are obtained on connections, which is a topic considered by many authors at the very core of statistical manifolds theory.
For example, the Boltzmann equation takes the form of an evolution equation for the Boltzmann field:
δ f t = Q ( f t ) f t , δ f T E f 0 , Q ( f ) f * T E f 0
and we can compute the covariant derivative of Boltzmann-Gibbs entropy along the Boltzmann field D Q ( f ) / f E ( f ) = Q ( f ) / f , ln f - E ( f ) f with Proposition 11, cfr. [20] (Ch. 3). Our treatment of Boltzmann-Gibbs entropy and Boltzmann equation does not add any new result, but our aim is to transform a generic geometric intuition about the geometry of probability densities into a formal geometrical methodology.
A number of issues remain open, in particular, the proper topological setting of the second order structures and the proper definition of sub-manifold, an important topic that is not mentioned at all in this paper.
In the case of the pretangent bundle, we have been able to show that it is actually the tangent bundle of an extension of the exponential manifold, the mixture manifold, * T P > T P 1 . The construction of an extended manifold whose tangent space would extend the Hilbert bundle, H P > , has been the object of much research. In some sense, the answer is known because of the embedding p p that maps positive densities, P > , into the unit sphere S μ , but a proper definition of the charts is difficult in this setting.
It has been suggested to use functions called deformed exponentials to mimic the theory of exponential families; see the monograph [35] and also [12,14] (Section 5). An example of deformed exponential is:
exp d ( u ) = 1 2 u + 1 + 1 4 u 2 2
which is a special case of the class introduced in [36,37]. See [38] for an example of application.
The function, exp d , maps R onto R > , is increasing, convex and:
Φ d ( u ) = 1 2 ( exp d ( u ) + exp d ( - u ) ) - 1 = 1 2 u 2
The Young conjugate is Φ d , * = Φ d , and the Orlicz space is L Φ d ( p ) = L 2 ( p ) . A nonparametric exponential family around the positive density, p, was defined by [39] to be:
q = exp d u - K p ( u ) + ln d p
where:
ln d ( v ) = exp d - 1 ( v ) = v 1 / 2 - v - 1 / 2
If we assume error p ¯ u = 0 , where p ¯ is a suitable density associated with p, then:
K p ( u ) = error p ¯ ln d p - ln d q
An account of this research in progress will be published elsewhere.
There are other approaches to nonparametric information geometry that are not based on the notion of the exponential family. We refer in particular to [40].

Acknowledgments

This research was supported by the de Castro Statistics Initiative, Collegio Carlo Alberto, Moncalieri. I wish to thank the guest editor, Antonio Scafone, for suggesting to me to present this contribution. My warmest thanks to Bertrand Lods and Lamberto Rondoni for helpful conversations on Boltzmann-Gibbs entropy and the Boltzmann equation.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Amari, S.; Nagaoka, H. Methods of Information Geometry; Translated from the 1993 Japanese original by Daishi Harada; American Mathematical Society: Providence, RI, USA, 2000; p. x+206. [Google Scholar]
  2. Dawid, A.P. Discussion of a paper by Bradley Efron. Ann. Stat. 1975, 3, 1231–1234. [Google Scholar]
  3. Dawid, A.P. Some comments on a paper by Bradley Efron. Ann. Statist. 1975, 3, 1189–1242. [Google Scholar]
  4. Dawid, A.P. Further comments on “Some comments on a paper by Bradley Efron". Ann. Stat. 1977, 5. No. 6. [Google Scholar] [CrossRef]
  5. Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
  6. Gibilisco, P.; Pistone, G. Connections on non-parametric statistical manifolds by Orlicz space geometry. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 1998, 1, 325–347. [Google Scholar] [CrossRef]
  7. Pistone, G.; Rogantin, M. The exponential statistical manifold: Mean parameters, orthogonality and space transformations. Bernoulli 1999, 5, 721–760. [Google Scholar] [CrossRef]
  8. Gibilisco, P.; Isola, T. Connections on statistical manifolds of density operators by geometry of noncommutative Lp-spaces. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 1999, 2, 169–178. [Google Scholar] [CrossRef]
  9. Cena, A. Geometric Structures on the Non-Parametric Statistical Manifold. Ph.D. Thesis, Università di Milano, Milano, Italy, 2002. [Google Scholar]
  10. Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
  11. Imparato, D. Exponential Models and Fisher Information–Geometry and Applications. Ph.D. Thesis, Politecnico di Torino, Torino, Italy, 2008. [Google Scholar]
  12. Pistone, G. κ-exponential models from the geometrical viewpoint. Eur. Phys. J. B Condens. Matter Phys. 2009, 71, 29–37. [Google Scholar] [CrossRef]
  13. Pistone, G. Algebraic Varieties vs. Differentiable Manifolds in Statistical Models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M., Wynn, H.P., Eds.; Cambridge University Press: London, UK, 2009; Chapter 21; pp. 339–363. [Google Scholar]
  14. Pistone, G. Nonparametric Information Geometry. In Proceedins of the First International Conference Geometric Science of Information, GSI 2013; Paris, France, 28–30 August 2013, Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36. [Google Scholar]
  15. Bourbaki, N. Variétés Differentielles et Analytiques. Fascicule de Résultats / Paragraphes 1 à 7 (in French); Number XXXIII in Éléments de mathématiques; Hermann: Paris, France, 1971. [Google Scholar]
  16. Abraham, R.; Marsden, J.E.; Ratiu, T. Manifolds, Tensor Analysis, and Applications, 2nd ed.; Applied Mathematical Sciences; Springer: New York, NY, USA, 1988; Volume 75, p. x+654. [Google Scholar]
  17. Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1995; Volume 160, p. xiv+364. [Google Scholar]
  18. Musielak, J. Orlicz Spaces and Modular Spaces; Lecture Notes in Mathematics; Springer: Berlin, Germany, 1983; Volume 1034, p. iii+222. [Google Scholar]
  19. Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA, 2007; p. xiv+246. [Google Scholar]
  20. Villani, C. A Review of Mathematical Topics in Collisional Kinetic Theory. In Handbook of Mathematical Fluid Dynamics; North-Holland: Amsterdam, The Netherlands, 2002; Volume I, pp. 71–305. [Google Scholar]
  21. Grasselli, M.R. Dual Connections in Nonparametric Classical Information Geometry. 2001. arXiv:math-ph/0104031v1. [Google Scholar] [CrossRef]
  22. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes; Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986; p. x+283. [Google Scholar]
  23. Grasselli, M.R. Dual connections in nonparametric classical information geometry. Ann. Inst. Stat. Math. 2010, 62, 873–896. [Google Scholar] [CrossRef]
  24. Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
  25. Malagò, L.; Matteucci, M.; Dal Seno, B. An Information Geometry Perspective On Estimation of Distribution Algorithms: Boundary Analysis. In Proceedings of the 2008 GECCO Conference Companion on GENETIC and Evolutionary Computation, GECCO ’08, Atlanta, GA, USA, 12–16 July 2008; ACM: New York, NY, USA, 2008; pp. 2081–2088. [Google Scholar]
  26. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming; NIPS 2009 workshop on discrete optimization in machine learning: submodularity, sparsity & polyhedra (DISCML); Whistler Resort & Spa: Whistler, BC, Canada, 2001. [Google Scholar]
  27. Malagò, L.; Matteucci, M.; Pistone, G. Towards the Geometry of Estimation of Distribution Algorithms Based on the Exponential Family. In Proceedings of the 11th Workshop on Foundations of Genetic Algorithms, FOGA ’11, Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
  28. Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Schmidhuber, J. Natural evolution strategies. 2011. arXiv:1106.4487. [Google Scholar]
  29. Arnold, L.; Auger, A.; Hansen, N.; Ollivier, Y. Information-geometric optimization algorithms:A unifying picture via invariance principles. 2011. arXiv:1106.3708. [Google Scholar]
  30. Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis, Politecnico di Milano, Milano, Italy, 2012. [Google Scholar]
  31. Malagò, L.; Matteucci, M.; Pistone, G. Natural Gradient, Fitness Modelling and Model Selection: A Unifying Perspective. In Proceedins of the 2013 IEEE Congress on Evolutionary Computation, IEEE CEC 2013, Paper #1747. Cancún, México, 20–23 June 2013.
  32. Appell, J.; Zabrejko, P.P. Nonlinear Superposition Operators; Cambridge Tracts in Mathematics; Cambridge University Press: Cambridge, UK, 1990; Volume 95, p. viii+311. [Google Scholar]
  33. Ambrosio, L.; Gigli, N.; Savaré, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures, 2nd ed.; Lectures in Mathematics ETH Zürich; Birkhäuser Verlag: Basel, Switzerland, 2008; p. x+334. [Google Scholar]
  34. Majewski, W.A.; Labuschagne, L.E. On applications of Orlicz spaces to statistical physics. Ann. Henry Poincaré 2013. [Google Scholar] [CrossRef] [Green Version]
  35. Naudts, J. Generalised Thermostatistics; Springer: London, UK, 2011; p. x+201. [Google Scholar]
  36. Kaniadakis, G. Statistical mechanics in the context of special relativity. Phys. Rev. E 2002, 66, 056125. [Google Scholar] [CrossRef] [PubMed]
  37. Kaniadakis, G. Statistical mechanics in the context of special relativity. II. Phys. Rev. E 2005, 72, 036108. [Google Scholar] [CrossRef] [PubMed]
  38. Trivellato, B. Deformed exponentials and applications to finance. Entropy 2013, 15, 3471–3489. [Google Scholar] [CrossRef]
  39. Vigelis, R.F.; Cavalcante, C.C. On ϕ-families of probability distributions. J. Theor. Probab. 2013, 26, 870–884. [Google Scholar] [CrossRef]
  40. Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Information geometry and sufficient statistics. 2013. arXiv:1207.6736. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Pistone, G. Examples of the Application of Nonparametric Information Geometry to Statistical Physics. Entropy 2013, 15, 4042-4065. https://0-doi-org.brum.beds.ac.uk/10.3390/e15104042

AMA Style

Pistone G. Examples of the Application of Nonparametric Information Geometry to Statistical Physics. Entropy. 2013; 15(10):4042-4065. https://0-doi-org.brum.beds.ac.uk/10.3390/e15104042

Chicago/Turabian Style

Pistone, Giovanni. 2013. "Examples of the Application of Nonparametric Information Geometry to Statistical Physics" Entropy 15, no. 10: 4042-4065. https://0-doi-org.brum.beds.ac.uk/10.3390/e15104042

Article Metrics

Back to TopTop