Next Article in Journal
Robust H Finite-Time Control for Discrete Markovian Jump Systems with Disturbances of Probabilistic Distributions
Next Article in Special Issue
Distributed Consensus for Metamorphic Systems Using a GossipAlgorithm for CAT(0) Metric Spaces
Previous Article in Journal
Information Decomposition in Bivariate Systems: Theory and Application to Cardiorespiratory Dynamics
Previous Article in Special Issue
J.J. Thomson and Duhem’s Lagrangian Approaches to Thermodynamics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Black-Box Optimization Using Geodesics in Statistical Manifolds †

by
Jérémy Bensadon
Laboratoire de Recherche en Informatique, Université Paris-Sud, 91400 Orsay, France
This paper is an extended version of our paper published in 34th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 21–26 September 2014, Château Clos Lucé, Parc Leonardo Da Vinci, Amboise, France.
Submission received: 8 October 2014 / Accepted: 7 January 2015 / Published: 13 January 2015
(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Abstract

:
Information geometric optimization (IGO) is a general framework for stochastic optimization problems aiming at limiting the influence of arbitrary parametrization choices: the initial problem is transformed into the optimization of a smooth function on a Riemannian manifold, defining a parametrization-invariant first order differential equation and, thus, yielding an approximately parametrization-invariant algorithm (up to second order in the step size). We define the geodesic IGO update, a fully parametrization-invariant algorithm using the Riemannian structure, and we compute it for the manifold of Gaussians, thanks to Noether’s theorem. However, in similar algorithms, such as CMA-ES (Covariance Matrix Adaptation - Evolution Strategy) and xNES (exponential Natural Evolution Strategy), the time steps for the mean and the covariance are decoupled. We suggest two ways of doing so: twisted geodesic IGO (GIGO) and blockwise GIGO. Finally, we show that while the xNES algorithm is not GIGO, it is an instance of blockwise GIGO applied to the mean and covariance matrix separately. Therefore, xNES has an almost parametrization-invariant description.

1. Introduction

Consider an objective function f: X → ℝ to be minimized. We suppose we have absolutely no knowledge about f: the only thing we can do is ask for its value at any point xX (black-box optimization) and that the evaluation of f is a costly operation. We are going to study algorithms that can be described in the IGO framework (see [1]).
We consider the following optimization procedure:
We choose (Pθ)θ∈Θ a family of probability distributions (which will be given a Riemannian manifold structure, following [2]) on X and an initial probability distribution P θ 0. Now, we replace f by F: Θ ℝ (for example F ( θ ) = E x ~ P θ [ f ( x ) ]), and we optimize F by gradient descent, corresponding to the gradient flow:
d θ t d t = θ E x ~ P θ [ f ( x ) ] .
However, because of the gradient, this equation depends entirely on the parametrization we chose for Θ, which is disturbing: we do not want to have two different updates, because we chose different parameters to represent the objects with which we are working. Moreover, in the case of a function with several local minima, changing the parametrization can change the attained optimum (see [3], for example). That is why invariance is a design principle behind IGO. More precisely, we want invariance with respect to monotone transformations of f and invariance under reparametrization of Θ.
The IGO framework uses the geometry of the family Θ, which is given by the Fisher metric to provide a differential equation on θ with the desired properties, but because of the discretization of time needed to obtain an explicit algorithm, we lose invariance under reparametrization of θ: two IGO algorithms applied to the same function to be optimized, but with different parametrizations, coincide only at first order in the step size. A possible solution to this problem is geodesic IGO (GIGO), introduced here (see also IGO-Maximum Likelihoodin [1], for example.): the initial direction of the update at each step of the algorithm remains the same as in IGO, but instead of moving straight for the chosen parametrization, we use the Riemannian manifold structure of our family of probability distributions (see [2]) by following its geodesics.
Finding the geodesics of a Riemannian manifold is not always easy, but Noether’s theorem will allow us to obtain quantities that are preserved along the geodesics, thus allowing, in the case of Gaussian distributions, one to obtain a first order differential equation satisfied by the geodesics, which makes their computation easier.
Although the geodesic IGO algorithm is not, strictly speaking, parametrization-invariant when no closed form for the geodesics is known, it is possible to compute them at arbitrary precision without increasing the numbers of objective function calls.
The first two sections are preliminaries: in Section 2, we recall the IGO algorithm, introduced in [1], and in Section 3, after a reminder about Riemannian geometry, we state Noether’s theorem, which will be our main tool to compute the GIGO update for Gaussian distributions.
In Section 4, we consider Gaussian distributions with a covariance matrix proportional to the identity matrix: this space is isometric to the hyperbolic space, and the geodesics of the latter are known.
In Section 5.1, we consider the general Gaussian case, and we use Noether’s theorem to obtain two different sets of equations to compute the GIGO update. The equations are known (see [46]), but the connection with Noether’s theorem has not been mentioned. We then give the explicit solution for these equations, from [5].
In Section 6, we recall quickly the xNES and CMA-ESupdates, and we introduce a slight modification of the IGO algorithm to incorporate the direction-dependent learning rates used in CMA-ESand xNES. We then compare these different algorithms and prove that xNES is not GIGO in general, and we finally introduce a new family of algorithms extending GIGO and recovering xNES from abstract principles.
Finally, Section 7 presents numerical experiments, which suggest that when using GIGO with Gaussian distributions, the step size must be chosen carefully.

2. Definitions: IGO, GIGO

In this section, we recall what the IGO framework is and we define the geodesic IGO update. Consider again Equation (1):
d θ t d t = θ E x ~ P θ [ f ( x ) ] .
As we saw in the Introduction:
  • The gradient depends on the parametrization of our space of probability distributions (see Section 2.3 for an example).
  • The equation is not invariant under monotone transformations of f. For example, the optimization for 10f moves ten times faster than the optimization for f.
In this section, we recall how IGO deals with this (see [1] for a better presentation).

2.1. Invariance under Reparametrization of θ: Fisher Metric

In order to achieve invariance under reparametrization of θ, it is possible to turn our family of probability distributions into a Riemannian manifold (this is the main topic of information geometry; see [2]), which allows us to use a canonical, parametrization-invariant gradient (called the natural gradient).
Definition 1. Let P, Q be two probability distributions on X. The Kullback–Leibler divergence of Q from P is defined by:
KL ( Q P ) = X ln ( Q ( x ) P ( x ) ) d Q ( x ) .
By definition, it does not depend on the parametrization. It is not symmetrical, but if for all x, the application θPθ(x) is C2, then a second-order expansion yields:
KL ( P θ + d θ P θ ) = 1 2 i , j I i j ( θ ) d θ i d θ j + o ( d θ 2 ) ,
where:
I i j ( θ ) = X ln P θ ( x ) θ i ln P θ ( x ) θ j d P θ ( x ) = X 2 ln P θ ( x ) θ i θ j d P θ ( x ) .
This is enough to endow the family (Pθ)θ∈Θ with a Riemannian manifold structure: a Riemannian manifold M is a differentiable manifold, which can be seen as pieces of ℝn glued together, with a metric. The metric at x is a symmetric positive-definite quadratic form on the tangent space of M at x: it indicates how expensive it is to move in a given direction on the manifold. We will think of the updates of the algorithms that we will be studying as paths on M.
The matrix I(θ) is called the “Fisher information matrix”, and the metric it defines is called the “Fisher metric”.
Given a metric, it is possible to define a gradient attached to this metric; the key property of the gradient is that for any smooth function f:
f ( x + h ) = f ( x ) + i h i f x i + o ( h ) = f ( x ) + x + h , f ( x ) + o ( h ) ,
where x , y = x T I y is the dot product in metric I. Therefore, in order to keep the property of Equation (5), we must have f = I 1 f x.
We have therefore the following gradient (called the “natural gradient”; see [2]):
˜ θ = I 1 ( θ ) θ ,
and since the Kullback–Leibler divergence does not depend on the parametrization, neither does the natural gradient.
Later in this paper, we will study families of Gaussian distributions. The following proposition gives the Fisher metric for these families.
Proposition 1. Let (Pθ)θ∈Θ be a family of normal probability distributions: P θ = N ( μ ( θ ) , Σ ( θ ) ) . If μ and Σ are C1, the Fisher metric is given by:
I i , j ( θ ) = μ T θ i Σ 1 μ θ j + 1 2 tr ( Σ 1 Σ θ i Σ 1 Σ θ j ) .
Proof. This is a non-trivial calculation. See [7] or [8] for more details.
As we will often be working with Gaussian distributions, we introduce the following notation:
Notation 1. G d is the manifold of Gaussian distributions in dimension d, equipped with the Fisher metric. G ˜ d is the manifold of Gaussian distributions in dimension d, with the covariance matrix proportional to identity in the canonical basis ofd, equipped with the Fisher metric.

2.2. IGO Flow, IGO Algorithm

In IGO [1], invariance with respect to monotone transformations is achieved by replacing f by the following transform; we set:
q ( x ) = P x ~ P θ ( f ( x ) f ( x ) ) ,
a non-increasing function w: [0; 1] ℝ is chosen (the selection scheme), and finally, W θ f ( x ) = w ( q ( x ) ) (this definition has to be slightly changed if the probability of a tie is not zero, see [1] for more details). By performing a gradient descent on E x ~ P θ [ W θ t f ( x ) ], we obtain the “IGO flow”:
d θ t d t = ˜ θ X W θ t f ( x ) P θ ( d x ) = X W θ t f ( x ) ˜ θ ln P θ ( x ) P θ t ( d x ) .
Notice that the function we are optimizing is E x ~ P θ [ W θ t f ( x ) ] and not E x ~ P θ [ W θ f ( x ) ] (the second function is constant and always equal to 0 1 w). In particular, the function for which we are performing the gradient descent changes at each step, although their optimum (a Dirac at the minimum of f) does not: the IGO flow is not a gradient flow; it is simply a vector flow given by the gradient of interrelated functions.
For practical implementation, the integral in (9) has to be approximated. For the integral itself, the Monte-Carlo method is used; N values (x1, …, xN) are sampled from the distribution P θ t, and the integral becomes:
1 N i = 1 N W θ t f ( x i ) ˜ θ ln P θ ( x i )
and we approximate 1 N W θ f ( x i ) = 1 N w ( q ( x i ) ) by w ^ i = 1 N w ( rk ( x i ) + 1 / 2 N ), where rk(xi) = |{j, f(xj) < f(xi)}|: it can be proven (see [1]) that lim N N w ^ i = W f θ t ( x i ) (here again, we are assuming that there are no ties).
We now have an algorithm that can be used in practice if the Fisher information matrix is known.
Definition 1. The IGO update associated with parametrization θ, sample size N, step size δt and selection scheme w is given by the following update rule:
θ t + δ t = θ t + δ t I 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ .
We call IGO speed the vector I 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ .
Notice that one could start directly with the w ^ i rather than w, as we will do later.
Replacing f by its expected value under a probability distribution Pθ and optimizing over θ using the natural gradient have already been discussed. For example, in the case of a function f defined on {0, 1}n, IGO with the Bernoulli distributions yields the algorithm, PBIL[9]. Another similar approach (stochastic relaxation) is given in [10]. For a continuous function, as we will see later, the IGO framework recovers several known ranked-based natural gradient algorithms, such as pure rank-μ CMA-ES [11], xNES or SNES (Separable Natural Evolution Strategies) [12]. See [13] or [14] for other, not necessarily gradient-based, optimization algorithms on manifolds.

2.3. Geodesic IGO

Although the IGO flow associated with a family of probability distributions is intrinsic (it only depends on the family itself, not the parametrization we choose for it), the IGO update is not. However, the difference between two steps of IGO that differ only by the parametrization is only O(δt2), whereas the different between two vanilla gradient descents with different parametrizations is O(δt).
Intuitively, the reason for this difference is that two IGO algorithms start at the same point and follow “straight lines” with the same initial speed, but the definition of “straight lines” changes with the parametrization.
For instance, in the case of Gaussian distributions, let us consider two different IGO updates with Gaussian distributions in dimension one, the first one with parametrization (μ, σ) and the second one with parametrization (μ, c := σ2). We suppose that the IGO speed for the first algorithm is ( μ ˙ , σ ˙ ). The corresponding IGO speed in the second parametrization is given by the identity c ˙ = 2 σ σ ˙. Therefore, the first algorithm gives the standard deviation σ new , 1 = σ old + δ t σ ˙ and the variance c new , 1 = ( σ new , 1 ) 2 = c old + 2 δ t σ old σ ˙ + δ t 2 σ ˙ 2 = c new , 2 + δ t 2 σ ˙ 2.
The geodesics of a Riemannian manifold are the generalization of the notion of a straight line: they are curves that locally minimize length. In particular, given two points a and b on the Riemannian manifold M, the shortest path from a to b is always a geodesic (the converse is not true, though). The notion will be explained precisely in Section 3, but let us define the geodesic IGO algorithm, which follows the geodesics of the manifold instead of following the straight lines for an arbitrary parametrization.
Definition 2 (GIGO). The geodesic IGO update (GIGO) associated with sample size N, step size δt and selection scheme w is given by the following update rule:
θ t + δ t = exp θ t ( Y δ t )
where:
Y = I 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ ,
is the IGO speed and exp θ t is the exponential of the Riemannian manifold Θ. Namely, exp θ t ( Y δ t ) is the endpoint of the geodesic of Θ starting at θt, with initial speed Y, after a time δt. By definition, this update does not depend on the parametrization θ.
Notice that while the GIGO update is compatible with the IGO flow (in the sense that when δt → 0 and N → ∞, a parameter θt updated according to the GIGO algorithm is a solution of Equation (9), the equation defining the IGO flow), it not necessarily an IGO update. More precisely, the GIGO update is an IGO update if and only if the geodesics of Θ are straight lines for some parametrization (by Beltrami’s theorem, this is equivalent to Θ having constant curvature).
This is a particular case of a retraction [14]: a map from the tangent bundle of a manifold to the manifold itself satisfying a rigidity condition. Arguably, the Riemannian exponential is the most natural retraction, since it depends only on the Riemannian manifold itself and not on any decomposition. However, in general, the geodesics are difficult to compute.
In the next section, we will state Noether’s theorem, which will be our main tool to compute the GIGO update for Gaussian distributions.

3. Riemannian Geometry, Noether’s Theorem

3.1. Riemannian Geometry

The goal of this section is to state Noether’s theorem. See [15] for the proofs and [16] or [17] for a more detailed presentation. Noether’s theorem states that if a system has symmetries, then there are invariants attached to these symmetries. Firstly, we need some definitions.
Definition 3 (Motion in a Lagrangian system). Let M be a differentiable manifold, TM the set of tangent vectors on M (a tangent vector is identified by the point at which it is tangent and a vector in the tangent space) and L : T M ( q , v ) L ( q , v ) differentiable function called the Lagrangian function (in general, it could depend on t). A “motion in the Lagrangian system (M, ) from x to y” is map γ : [t0, t1] → M, such that:
  • γ(t0) = x
  • γ(t1) = y
  • γ is a local extremum of the functional:
    Φ ( γ ) = t 0 t 1 ( γ ( t ) , γ ˙ ( t ) ) d t ,
    among all curves c: [t0, t1] → M, such that c(t0) = x, and c(t1) = y.
For example, when (M, g) is a Riemannian manifold, the length of a curve γ between γ(t0) and γ(t1) is:
t 0 t 1 g ( γ ˙ ( t ) , γ ˙ ( t ) ) d t .
The curves that follow the shortest path between two points x, yM are therefore the minima γ of the functional (15), such that γ(t0) = x and γ(t1) = y, and the corresponding Lagrangian function is ( q , v ) g ( v , v ). However, any curve following the shortest trajectory will have minimum length. For example, if γ1: [a, b] → M is a curve of the shortest path, so is γ2: tγ1(t2): these two curves define the same trajectory in M, but they do not travel along this trajectory at the same speed. This leads us to the following definition:
Definition 4 (Geodesics). Let I be an interval ofand (M, g) be a Riemannian manifold. A curve γ: I → M is called a geodesic if for all t 0 , t 1 I , γ | [ t 0 , t 1 ] is a motion in the Lagrangian system (M, ) from γ(t0) to γ(t1), where:
( γ ) = t 0 t 1 g ( γ ˙ ( t ) , γ ˙ ( t ) ) d t
It can be shown (see [16]) that geodesics are curves that locally minimize length, with constant velocity, in the sense that d g ( γ ˙ ( t ) , γ ˙ ( t ) ) d t = 0. In particular, given a starting point and a starting speed, the geodesic is unique. This motivates the definition of the exponential of a Riemannian manifold.
Definition 5. Let (M, g) be a Riemannian manifold. We call the exponential of M the application:
exp : T M M ( x , v ) exp x ( v ) ,
such that for any xM, if γ is the geodesic of M satisfying γ(0) = x and γ′(0) = v, then expx(v) = γ(1).
In order to find an extremal of a functional, the most commonly-used result is called the “Euler–Lagrange equations” (see [15], for example); a motion γ in the Lagrangian system (M, ) must satisfy:
x ( γ ( t ) ) d d t ( x ˙ ( γ ˙ ( t ) ) ) = 0.
By applying this equation with the Lagrangian given by (16), it is possible to show that the geodesics of a Riemannian manifold follow the “geodesic equations”:
x ¨ k + Γ i j k x ˙ i x ˙ j = 0 ,
where the
Γ i j k = 1 2 g l k ( g j l q i + g l i q j g i j q l )
are called “Christoffel symbols” of the metric g. However, these coefficients are tedious (and sometimes difficult) to compute, and (18) is a second order differential equation. Noether’s theorem will give us a first order equation to compute the geodesics.

3.2. Noether’s Theorem

Definition 6. Let h: M → M, a diffeomorphism. We say that the Lagrangian system (M, ) admits the symmetry h if for any (q, v) ∈ TM,
( h ( q ) , d h ( v ) ) = ( q , v ) ,
where dh is the differential of h.
If M is clear in the context, we will sometimes say that is invariant under h.
An example will be given in the proof of Theorem 3.
We can now state Noether’s theorem (see, for example, [15]).
Theorem 1 (Noether’s Theorem). If the Lagrangian system (M, ) admits the one-parameter group of symmetries hs: M → M, s ∈ ℝ, then the following quantity remains constant during motions in the system (M, ). Namely,
I ( γ ( t ) , γ ˙ ( t ) ) = v ( d h s ( γ ( t ) ) d s | s = 0 )
does not depend on t if γ is a motion in (M, ).
Now, we are going to apply this theorem to our problem: computing the geodesics of Riemannian manifolds of Gaussian distributions.

4. GIGO in G ˜ d

If we force the covariance matrix to be either diagonal or proportional to the identity matrix, the geodesics have a simple expression that we give below. In the former case, the manifold we are considering is ( G 1 ) d, and in the latter case, it is G ˜ d.
The geodesics of ( G 1 ) d are given by:
Proposition 2. Let M be a Riemannian manifold; let d ∈ ℕ; let Φ be the Riemannian exponential of Md; and let φ be the Riemannian exponential of M. We have:
Φ ( x 1 , , x n ) ( ( v 1 , , v n ) ) = ( ϕ x 1 ( v 1 ) , , ϕ x n ( v n ) )
In particular, knowing the geodesics of G 1 is enough to compute the geodesics of ( G 1 ) d.
This is true, because a block of the product metric does not depend on variables of the other blocks. Consequently, a GIGO update with a diagonal covariance matrix with the sample (xi) is equivalent to d separate one-dimensional GIGO updates using the same samples. Moreover, G 1 G ˜ 1, the geodesics of which are given below.
We will show that G ˜ d and the “hyperbolic space”, of which the geodesics are known, are isometric.

4.1. Preliminaries: Poincaré Half-Plane, Hyperbolic Space

In dimension two, the hyperbolic space is called the “hyperbolic plane” or the Poincaré half-plane. We recall its definition:
Definition 7 (Poincaré half-plane). We call the “Poincaré half-plane” the Riemannian manifold:
= { ( x , y ) 2 , y > 0 } ,
with the metric d s 2 = d x 2 + d y 2 y 2.
We also recall the expression of its geodesics (see, for example, [18]):
Proposition 3 (Geodesics of the Poincaré half-plane). The geodesics of the Poincaré half-plane are exactly the:
t ( Re ( z ( t ) ) , Im ( z ( t ) ) ) ,
where:
z ( t ) = a i e v t + b c i e v t + d ,
with ad − bc = 1 and v > 0.
The geodesics are half-circles perpendicular to the line y = 0 and vertical lines, as shown in Figure 1 below.
The generalization to the higher dimension is the following:
Definition 8 (Hyperbolic space). We call the “hyperbolic space of dimension n” the Riemannian manifold:
n = { ( x 1 , , x n 1 , y ) n , y > 0 } ,
with the metric d s 2 = d x 1 2 + + d x n 1 2 + d y 2 y 2 (or equivalently, the metric given by the matrix Diag ( 1 y 2 )).
The Lagrangian for the geodesics is invariant under all translations along the xi, so by Noether’s theorem, its geodesics stay in a plane containing the direction y and the initial speed. The induced metric on this plane is the metric of the Poincaré half-plane. The geodesics are therefore given by the following proposition:
Proposition 4 (Geodesics of the hyperbolic space). If γ : t ⟼ (x1(t), …, xn−1(t), y(t)) = (x(t), y(t)) is a geodesic of n, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that x ( t ) = x ( 0 ) + x ˙ 0 x ˙ 0 x ˜ ( t ) , y ( t ) = Im ( γ ( t ) ), with x ˜ ( t ) = Re ( γ ( t ) ) and:
γ ( t ) : = a i e v t + b c i e v t + d .

4.2. Computing the GIGO Update in G ˜ d

If we want to implement the GIGO algorithm in G ˜ d, we need to compute the natural gradient in G ˜ d and to be able to compute the Riemannian exponential of G ˜ d.
Using Proposition 1, we can compute the metric of G ˜ d in the parametrization (μ, σ) ⟼ N(μ, σ2I). We find:
( 1 σ 2 0 0 0 1 σ 2 0 0 0 2 d σ 2 ) .
Since this matrix is diagonal, it is easy to invert, and we immediately have the natural gradient and, consequently, the IGO speed.
Proposition 5. In G ˜ d, the IGO speed Y is given by:
Y μ = i w ^ i ( x i μ ) ,
Y σ = i w ^ i ( ( x i μ ) T ( x i μ ) 2 d σ σ 2 ) .
Proof. We recall the IGO speed is defined by Y = I 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ. Since P μ , σ ( x ) = ( 2 π σ 2 ) d / 2 exp ( ( x μ ) T ( x μ ) 2 σ 2 ), we have:
ln P μ , σ ( x ) μ = x μ , ln P μ , σ ( x ) σ = d σ + ( x μ ) T ( x μ ) σ 3 .
The result follows. □
The metric defined by Equation (25) is not exactly the metric of the hyperbolic space, but with the substitution μ μ 2 d, the metric becomes 2 d σ 2 I, which is proportional to the metric of the hyperbolic space and, therefore, defines the same geodesics.
Theorem 2 (Geodesics of G ˜ d). If γ : t N ( μ ( t ) , σ ( t ) 2 I ) is a geodesic of G ˜ d, then there exists a, b, c, d ∈, such that ad − bc = 1, and v > 0, such that: μ ( t ) = μ ( 0 ) + 2 d μ ˙ 0 μ ˙ 0 r ˜ , σ ( t ) = Im ( γ ( t ) ), with r ˜ ( t ) = Re ( γ ( t ) )and
γ ( t ) : = a i e v t + b c i e v t + d .
Now, in order to implement the corresponding GIGO algorithm, we only need to be able to find the coefficients a, b, c, d, v corresponding to an initial position (μ0, σ0) and an initial speed ( μ ˙ 0 , σ ˙ 0 ). This is a tedious but easy computation, the result of which is given in Proposition 17.
The pseudocode of GIGO in G ˜ d is also given in the Appendix: it is obtained by concatenating Algorithms 1 and 7 (Proposition 17 and the pseudocode in the Appendix allow the metric to be slightly modified; see Section 6.2).

5. GIGO in G ˜ d

5.1. Obtaining a First Order Differential Equation for the Geodesics of G d

In the case where both the covariance matrix and the mean can vary freely, the equations of the geodesics have been computed in [4] and [5]. However, these articles start with the equations of the geodesics obtained with the Christoffel symbols, then partially integrate them. These equations are in fact a consequence of Noether’s theorem and can be found directly.
Theorem 3. Let γ : t N ( μ t , t ) be a geodesic of G d. Then, the following quantities do not depend on t:
J μ = t 1 μ ˙ t ,
J = t 1 ( μ ˙ t μ t T + ˙ t ) .
Proof. This is a direct application of Noether’s theorem, with suitable groups of diffeomorphisms. By Proposition 1, the Lagrangian associated with the geodesics of G d is:
L ( μ , , μ ˙ , ˙ ) = μ ˙ T 1 μ ˙ + 1 2 tr ( ˙ 1 ˙ 1 ) .
Its derivative is:
θ ˙ = [ ( h , H ) 2 μ ˙ T 1 h + tr ( H 1 ˙ 1 ) ]
Let us show that this Lagrangian is invariant under affine changes of basis (thus illustrating Definition 6).
The general form of an affine change of basis is ϕμ0,A : (μ, Σ) ⟼ ( + μ0, AΣAT ), with μ0 d and A ∈ GLd(ℝ).
We have:
L ( ϕ μ 0 , A ( μ , ) , d ϕ μ 0 , A ( μ ˙ , ˙ ) ) = A μ ¯ . T ( A A T ) 1 A μ ¯ . + 1 2 tr ( A A T ¯ ( A A T ) 1 A A T ¯ ( A A T ) 1 . )
and since A μ ¯ . = A μ ˙ and, A A T ¯ ( A A T ) we find easily that:
( ϕ μ 0 , A ( μ , Σ ) , d ϕ μ 0 , A ( μ ˙ , Σ ˙ ) ) = ( μ , Σ , μ ˙ , Σ ˙ ) ,
or in other words: is invariant under ϕμ0,A for any μ0 d, A ∈ GLd(ℝ).
In order to use Noether’s theorem, we also need one-parameter groups of transformations. We choose the following:
  • Translations of the mean vector. For any i ∈ [1, d], let h i s : ( μ , Σ ) ( μ + s e i , Σ ), where ei is the i-th basis vector. We have d h i s d s | s = 0 = ( e i , 0 ), so by Noether’s theorem,
    θ ˙ ( e i , 0 ) = 2 μ ˙ T Σ 1 e i = 2 e i T Σ 1 μ ˙
    remains constant for all i. The fact that Jμ is an invariant immediately follows.
  • Linear base changes. For any i, j ∈ [1, d], let h i , j s : ( μ , Σ ) ( exp ( s E i j ) μ , exp ( s E i j ) Σ exp ( s E i j ) ) ,, where Eij is the matrix with a one at position (i, j) and zeros elsewhere. We have:
    d h E i j s d s | s = 0 = ( E i j μ , E i j Σ + E j i Σ ) .
Therefore, by Noether’s theorem, we then obtain the following invariants:
J i j : = θ ˙ ( E i j μ , E i j Σ + E Σ i j )
= 2 μ ˙ T Σ 1 E i j μ + tr ( ( E i j μ Σ + Σ E j i ) Σ 1 Σ ˙ Σ 1 )
= 2 ( Σ 1 μ ˙ ) T E i j μ + tr ( E i j Σ ˙ Σ 1 ) + tr ( E j i Σ 1 Σ ˙ )
= 2 ( J μ μ T ) i j + 2 ( Σ 1 Σ ˙ ) i j ,
and the coefficients of JΣ in (30) are the (Jij/2).
This leads us to first order equations satisfied by the geodesics mentioned in [46].
Theorem 4 (GIGO-Σ). t 7 N ( μ t , Σ t ) is a geodesic of G d if and only if μ : t 7↦μt and Σ : t ↦Σt satisfy the equations:
μ ˙ t = Σ t J μ
Σ ˙ t = Σ t ( J Σ J μ μ t T ) = Σ t J Σ μ ˙ t μ t T ,
where:
J μ = 0 1 μ ˙ 0 ,
and:
J Σ = Σ 0 1 ( μ ˙ 0 μ 0 T + Σ ˙ 0 ) .
Proof. This is an immediate consequence of Proposition 3.
These equations can be solved analytically (see [5]); however, usually, that is not the case, and they have to be solved numerically, for example with the Euler method (the corresponding algorithm, which we call GIGO-Σ, is described in the Appendix). The goal of the remainder of the subsection is to show that having to use the Euler method is fine.
To avoid confusion, we will call the step size of the GIGO algorithm (δt in Proposition 2) “GIGO step size” and the step size of the Euler method (inside a step of the GIGO algorithm) “Euler step size”.
Having to solve our equations numerically brings two problems:
The first one is a theoretical problem: the main reason to study GIGO is its invariance under reparametrization of θ, and we lose this invariance property when we use the Euler method. However, GIGO can get arbitrarily close to invariance by decreasing the Euler step size. In other words, the difference between two different IGO algorithms is O(δt2), and the difference between two different implementations of the GIGO algorithm is O(h2), where h is the Euler step size; it is easier to reduce the latter. Still, without a closed form for the geodesics of G d, the GIGO update is rather expensive to compute, but it can be argued that most of the computation time will still be the computation of the objective function f.
The second problem is purely numerical: we cannot guarantee that the covariance matrix remains positive-definite along the Euler method. Here, apart from finding a closed form for the geodesics, we have two solutions.
We can enforce this a posteriori: if the covariance matrix we find is not positive-definite after a GIGO step, we repeat the failed GIGO step with a reduced Euler step size (in our implementation, we divided it by four; see Algorithm 2 in the Appendix.).
The other solution is to obtain differential equations on a square root of the covariance matrix (any matrix A, such that Σ = AAT ).
Theorem 5 (GIGO-A). If μ : t ↦μt and A : t ↦At satisfy the equations:
μ ˙ t = A t A t T J μ ,
A ˙ t = 1 2 ( J Σ J μ μ t T ) T A t ,
where:
J μ = ( A 0 1 ) T A 0 1 μ 0
and:
J Σ = ( A 0 1 ) T A 0 1 ( μ ˙ 0 μ t T + A ˙ 0 A 0 T + A 0 A ˙ 0 T ) ,
then t N ( μ t , A t A t T ) is a geodesic of G d.
Proof. This is a simple rewriting of Theorem 4: if we write Σ := AAT, we find that Jμ and JΣ are the same as in Theorem 4, and we have:
μ ˙ = Σ J μ ,
and:
Σ ˙ = ( A ˙ A T + A A ˙ T ) = 1 2 ( J Σ J μ μ T ) T + A A T + 1 2 A A T ( J Σ J μ μ T ) = 1 2 ( J Σ J μ μ T ) T Σ + 1 2 ( J Σ J μ μ T ) = 1 2 Σ ( J Σ J μ μ T ) + 1 2 [ Σ ( J Σ J μ μ T ) ] T .
By Theorem 4, Σ(JΣ − JμμT) is symmetric (since Σ ˙ has to be symmetric). Therefore, we have Σ ˙ ( J Σ J μ μ T ), and the result follows. □
Notice that Theorem 5 gives an equivalence, whereas Theorem 4 does not. The reason is that the square root of a symmetric positive-definite matrix is not unique. Still, it is canonical; see the discussion in Section 6.1.2.
As for Theorem 4, we can solve Equations (41) and (42) numerically, and we obtain another algorithm (Algorithm 3 in the Appendix; we will call it GIGO-A), with a behavior similar to the previous one (with Equations (39) and (40)). For both of them, numerical problems can arise when the covariance matrix is almost singular.
We have not managed to find any example where one of these two algorithms converged to the minimum of the objective function, whereas the other did not, and their behavior is almost the same.
More interestingly, the performances of these two algorithms are also the same as the performances of the exact GIGO algorithm, using the equations of Section 5.2.
Notice that even though GIGO-A directly maintains a square root of the covariance matrix, which makes sampling new points easier (to sample a point from N ( μ , Σ ), a square root of Σ is needed), both GIGO-Σ and GIGO-A still have to invert the covariance matrix (or its square root) at each step, which is as costly as the decomposition, so one of these algorithms is roughly as expensive to compute as the other.

5.2. Explicit Form of the Geodesics of G d (from [5])

We now give the exact geodesics of G d: the following results are a rewriting of Theorem 3.1 and its first corollary in [5].
Theorem 6. Let ( μ ˙ 0 , Σ ˙ 0 ) T N ( 0 , I ) G d. The geodesic of G d starting from N ( 0 , 1 ) with initial speed ( μ ˙ 0 , Σ ˙ 0 ) is given by:
exp N ( 0 , I ) ( s μ ˙ 0 , s Σ ˙ 0 ) = N ( 2 R ( s ) sh ( s G 2 ) G μ ˙ 0 , R ( s ) R ( s ) T ) ,
where exp is the Riemannian exponential of G d, G is any matrix satisfying:
G 2 = Σ ˙ 0 2 + 2 μ ˙ 0 μ ˙ 0 T ,
R ( s ) = ( ( ch ( s G 2 ) Σ ˙ 0 G sh ( s G 2 ) ) 1 ) T
and G is a pseudo-inverse of G
In [5], the existence of G (as a square root of Σ ˙ 0 2 + 2 μ ˙ 0 μ ˙ 0 T) is proven. Notice that, anyway, in the expansions of (43) and (45), only even powers of G appear.
Additionally, since, for all AGLd(ℝ), for all μ0 ∈ ℝd, the application:
ϕ : G d G d N ( μ , Σ ) N ( A μ + μ 0 , A Σ A T )
preserves the geodesics, we find the general expression for the geodesics of G d.
Corollary 1. Let μ0 ∈ ℝd, AGLd(ℝ) and ( μ ˙ 0 , Σ ˙ 0 ) T N ( μ 0. , A 0 A 0 T ) G d. The geodesic of G d starting from N ( μ , Σ ) with initial speed ( μ ˙ 0 , Σ ˙ 0 ) is given by:
exp N ( μ 0 , A 0 A 0 T ) ( s μ ˙ 0 , s Σ ˙ 0 ) = N ( μ 1 , A 1 A 1 T ) ,
μ 1 = 2 A 0 R ( s ) sh ( s G 2 ) G A 0 1 μ ˙ 0 + μ 0 ,
A 1 = A 0 R ( s ) ,
where exp is the Riemannian exponential of G d, G is any matrix satisfying:
G 2 = A 0 1 ( Σ ˙ 0 Σ 0 1 Σ ˙ 0 + 2 μ ˙ 0 μ ˙ 0 T ) ( A 0 1 ) T ,
R ( s ) = ( ( ch ( s G 2 ) A 0 1 Σ ˙ 0 ( A 0 1 ) T G sh ( s G 2 ) ) 1 ) T ,
and G is a pseudo-inverse of G.
It should be noted that the final values for mean and covariance do not depend on the choice of G as a square root of:
A 0 1 ( Σ ˙ 0 Σ 0 1 Σ ˙ 0 + 2 μ ˙ 0 μ ˙ 0 T ) ( A 0 1 ) T .
The reason for this is that ch(G) is a Taylor series in G2, and so are sh(G)G and Gsh(G).
For our practical implementation, we actually used these Taylor series instead of the expression of the corollary

6. Comparing GIGO, xNES and Pure Rank-μ CMA-ES

6.1. Definitions

In this section, we recall the xNES and pure rank-μ CMA-ES, and we describe them in the IGO framework, thus allowing a reasonable comparison with the GIGO algorithms.

6.1.1. xNES

We recall a restriction of the xNES algorithm, introduced in [19] (this restriction is sufficient to describe the numerical experiments in [19]).
Definition 9 (xNES algorithm). The xNES algorithm with sample size N, weights wi and learning rates ημ and ηΣ updates the parameters μ ∈d, A ∈ Md(ℝ) with the following rule: At each step, N points x1, …, xN are sampled from the distribution. N ( μ , A A T ). Without loss of generality, we assume f(x1) < … < f(xN). The parameter is updated according to:
μ μ + η μ A G μ , A A exp ( η Σ G M / 2 ) ,
where, setting zi = A−1(xi − μ):
G μ = i = 1 N w i z i , G M = i = 1 N w i ( z i z i T I ) .
The more general version decomposes the matrix A as σB, where det B = 1, and uses two different learning rates for σ and for B. We gave the version where these two learning rates are equal (in particular, for the default parameters in [19], these two learning rates are equal). This restriction of the xNES algorithm can be described in the IGO framework, provided all of the learning rates are equal (most of the elements of the proof can be found in [19] (the proposition below essentially states that xNES is a natural gradient update) or in [1]):
Proposition 6 (xNES as IGO). The xNES algorithm with sample size N, weights wi and learning rates ημ = ηΣ = δt coincides with the IGO algorithm with sample size N, weights wi, step size δt and in which, given the current position (μt, At), the set of Gaussians is parametrized by:
ϕ μ t , A t : ( δ , M ) N ( μ t + A t δ , ( A t exp ( 1 2 M ) ) ( A t exp ( 1 2 M ) ) T ) ,
with δ ∈m and M ∈ Sym(ℝm).
The parameters maintained by the algorithm are (μ, A), and the xi are sampled from N ( μ , A A T ).
Proof. Let us compute the IGO update in the parametrization ϕ μ t , A t: we have δt = 0, Mt = 0, and by using Proposition 1, we can see that for this parametrization, the Fisher information matrix at (0, 0) is the identity matrix. The IGO update is therefore,
( δ , M ) t + δ t = ( δ , M ) t + δ t Y δ ( δ , M ) + δ t Y M ( δ , M ) = δ t Y δ ( δ , M ) + δ t Y M ( δ , M ) ,
where:
Y δ ( δ , M ) = i = 1 N w i δ ln ( p ( x i | ( δ , M ) )
and:
Y M ( δ , M ) = i = 1 N w i M ln ( p ( x i | ( δ , M ) ) .
Since tr(M) = log(det(exp(M))), we have:
ln P δ , M ( x ) = d 2 ln ( 2 π ) ln ( det A ) 1 2 tr M 1 2 exp ( 1 2 M ) A 1 ( x μ A δ ) | | 2 ,
and a straightforward computation yields:
Y δ ( δ , M ) = i = 1 N w i z i = G μ ,
and:
Y M ( δ , M ) = 1 2 i = 1 N w i ( z i z i T I ) = G M .
Therefore, the IGO update is:
δ ( t + δ t ) = δ ( t ) + δ t G μ , M ( t + δ t ) = M ( t ) + δ t G M ,
or, in terms of mean and covariance matrix:
μ ( t + δ t ) = μ ( t ) + δ t A ( t ) G μ A ( t + δ t ) = A ( t ) exp ( δ t G M / 2 ) ,
or:
Σ ( t + δ t ) = A ( t ) exp ( δ t G M ) A ( t ) T .
This is the xNES update. □

6.1.2. Using a Square Root of the Covariance Matrix

Firstly, we recall that the IGO framework (on G d, for example) emphasizes the Riemannian manifold structure on G d. All of the algorithms studied here (including GIGO, which is not strictly speaking an IGO algorithm) define a trajectory in G d (a new point for each step), and to go from a point θ to the next one (θ′), we follow some curve γ : [ 0 , δ t ] G d, with γ(0) = θ, γ(δt) = θ′ and γ ˙ ( 0 ) given by the natural gradient ( γ ˙ ( 0 ) = i = 1 N w ^ i ˜ θ P θ ( x i ) T θ G d ).
To be compatible with this point of view, an algorithm giving an update rule for a square root (any matrix A such that Σ = AAT: since we do not force A to be symmetric, the decomposition is not unique) of the covariance matrix A has to satisfy the following condition: for a given initial speed, the covariance matrix Σt+δt after one step must depend only on Σt and not on the square root At chosen for Σt.
The xNES algorithm does satisfy this condition: consider two xNES algorithms, with the same learning rates, respectively, at ( μ , A 1 t ) and ( μ , A 2 t ), with A 1 t ( A 1 t ) T = A 2 t ( A 2 t ) T (i.e., they define the same Σt), using the same samples xi to compute the natural gradient update, then we will have 1 t + δ t = 2 t + δ t. Using the definitions of Section 6.3, we have just shown that what we will call the “xNES trajectory” is well defined.
It is also important to notice that, in order to be well defined, a natural gradient algorithm updating a square root of the covariance matrix has to specify more conditions than simply following the natural gradient.
The reason for this is that the natural gradient is a vector tangent to G d: it lives in a space of dimension d(d + 3)/2 (the dimension of G d), whereas the vector (μ, A) lives in a space of dimension d(d + 1) (the dimension of ℝn × GLn(ℝ)), which is too large: there exists infinitely many applications tAt, such that a given curve γ : t N ( μ t , Σ t ) can be written γ ( t ) = N ( μ t A t A t T ). This is why Theorem 5 is simply an implication, whereas Theorem 4 is an equivalence.
More precisely, let us consider A in GLd(ℝ) and vA, v A two infinitesimal updates of A. Since Σ = AAT, the infinitesimal update of Σ corresponding to v A (resp. v A ) is v Σ = A v A T + v A A T (resp. v Σ = A v A T + v A A T ) .
It is now easy to see that vA and v A define the same direction for Σ (i.e., v Σ = v Σ ) if and only if AMT + MAT = 0, where M = v A v A . This is equivalent to A1M antisymmetric.
For any A ∈ Md(ℝ), let us denote by TA the space of the matrices M, such that A1M is antisymmetric or, in other words, TA := {u ∈ Md(ℝ), AuT + uAT = 0}. Having a subspace SA in direct sum with TA for all A is sufficient (but not necessary) to have a well-defined update rule. Namely, consider the (linear) application:
ϕ A : M d ( ) S d ( ) v A A v A T + v A A T ,
sending an infinitesimal update of A to the corresponding update of Σ. It is not bijective, but as we have seen before, Ker ϕA = TA, and therefore, if we have, for some UA,
M d ( ) = U A T A ,
then φA|UA is an isomorphism. Let vΣ be an infinitesimal update of Σ. We choose the following update of A corresponding to vΣ:
v A : ( ϕ A | U A ) 1 ( v Σ ) .
Any UA, such that UA ⊕ TA = Md(ℝ), is a reasonable choice to pick vA for a given vΣ. The choice SA = {u ∈ Md(ℝ), AuT − uAT = 0} has an interesting additional property; it is the orthogonal of TA for the norm:
v A 2 : = Tr ( v A T 1 v A ) = Tr ( ( A 1 v A ) T A 1 v A ) .
and consequently, it can be defined without referring to the parametrization, which makes it a canonical choice. To prove this, remark that TA = {M ∈ Md(ℝ), A−1M antisymmetric} and SA = {M ∈ Md(ℝ), A−1M symmetric} and that if M is symmetric and N is antisymmetric, then
Tr ( M T N ) = i , j = 1 d m i j n i j = i = 1 d m i i n i i + 1 i j d m i j ( n i j + n j i ) = 0.
Let us now show that this is the choice made by xNES and GIGO-A (which are well-defined algorithms updating a square root of the covariance matrix).
Proposition 7. Let AMn(ℝ). The vA given by the xNES and GIGO-A algorithms lies in SA = {u ∈ Md(ℝ), AuT − uAT = 0} = SA.
Proof. For xNES, let us write γ ˙ ( 0 ) = ( υ μ , υ Σ ) and υ A : = 1 2 A G M. We have A 1 υ A = 1 2 G M, and therefore, forcing M (and GM) to be symmetric in xNES is equivalent to A1 υA = (A1 υA)T, which can be rewritten as A υ A T = υ A A T. For GIGO-A, Equation (40) shows that Σ t ( J Σ J μ μ t T ) is symmetric, and with this fact in mind, Equation (42) shows that we have A υ A T = υ A A T ( υ A is A ˙ t ). □

6.1.3. Pure Rank-μ CMA-ES

We now recall the pure rank-μ CMA-ES algorithm. The general CMA-ES algorithm is described in [21].
Definition 10 (Pure rank-μ CMA-ES algorithm). The pure rank-μ CMA-ES algorithm with sample size N, weights wi and learning rates ημ and ηΣ is defined by the following update rule: At each step, N points x1, …, xN are sampled from the distribution N ( μ , Σ ). Without loss of generality, we assume f)x1) < … < f(xN). The parameter is updated according to:
μ μ + η μ i = 1 N w i ( x i μ ) ,
Σ Σ + η Σ i = 1 N w i ( ( x i μ ) ( x i μ ) T Σ ) .
The pure rank-μ CMA-ES can also be described in the IGO framework; see, for example, [20].
Proposition 8 (Pure rank-μ CMA-ES as IGO). The pure rank-μ CMA-ES algorithm with sample size N, weights wi and learning rates ημ = ηΣ = δt coincides with the IGO algorithm with sample size N, weights wi, step size δt and the parametrization (μ, Σ).

6.2. Twisting the Metric

As we can see, the IGO framework does not allow one to recover the learning rates for xNES and pure rank-μ CMA-ES, which is a problem, since usually, the covariance learning rate is set much smaller than the mean learning rate (see either [19] or [21]).
A way to recover these learning rates is to incorporate them directly into the metric (see also blockwise GIGO, in Section 6.4). More precisely:
Definition 11 (Twisted Fisher metric). Let ημ, ηΣ ∈ ℝ, and let (Pθ)θ∈Θ be a family of normal probability distributions: Pθ = N (μ(θ), Σ(θ)), with μ and Σ C1. We call the “(ημ, ηΣ)-twisted Fisher metric” the metric defined by:
I i , j ( η μ , η Σ ) ( θ ) = 1 η μ μ T θ i Σ 1 μ θ j + 1 η Σ 1 2 t r ( Σ 1 Σ θ i Σ 1 Σ θ j ) .
All of the remainder of this section is simply a rewriting of the work in Section 2 with the twisted Fisher metric instead of the regular Fisher metric. We will use the term “twisted geodesic” instead of “geodesic for the twisted metric”.
This approach seems to be somewhat arbitrary: arguably, the mean and the covariance play a “different role” in the definition of a Gaussian (only the covariance can affect diversity, for example), but we lack a reasonable intrinsic characterization that would make this choice of twisting more natural. This construction can be slightly generalized (see the Appendix).
The IGO flow and the IGO algorithms can be modified to take into account the twisting of the metric; the (ημ, ηΣ)-twisted IGO flow reads:
d θ t d t = I ( η μ , η Σ ) 1 ( θ ) X W θ t f ( x ) θ ln P θ ( x ) P θ t ( d x ) .
The only difference with (9) is that I1(θ) has been replaced by I(ημ, ηΣ)1(θ).
This leads us to the twisted IGO algorithms.
Definition 12. The (ημ, ηΣ)-twisted IGO algorithm associated with parametrization θ, sample size N, step size δt and selection scheme w is given by the following update rule:
θ t + δ t = θ t + δ t I ( η μ , η Σ ) 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ .
Definition 13. The (ημ, ηΣ)-twisted geodesic IGO algorithm associated with sample size N, step size δt and selection scheme w is given by the following update rule:
θ t + δ t = exp θ t ( Y δ t )
where:
Y = I ( η μ , η Σ ) 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ .
By definition, the twisted geodesic IGO algorithm does not depend on the parametrization (but it does depend on ημ and ηΣ).
There is some redundancy between δt, ημ and ηΣ: the only values actually appearing in the equations are δtημ and δtηΣ. More formally:
Proposition 9. Let k, d, N ∈ N, ημ, ηΣ, δt, λ1, λ2 ∈ ℝ and w : [0; 1] ℝ.
The (ημ, ηΣ)-twisted IGO algorithm with sample size N, step size δt and selection scheme w coincides with the (λ1ημ, λ1ηΣ)-twisted IGO algorithm with sample size N, step size λ2δt and selection scheme 1 λ 1 λ 2 w. The same is true for geodesic IGO.
In order to obtain the twisted algorithms, the Fisher metric in IGO has to be replaced by the metric from Definition 11. In practice, the equations found by twisting the metric are exactly the equations without twisting, except that we have “forced” the learning rates ημ, ηΣ to appear by multiplying the increments of μ and Σ by ημ and ηΣ.
We can now describe pure rank-μ CMA-ES and xNES with separate learning rates as twisted IGO algorithms:
Proposition 10 (xNES as IGO). The xNES algorithm with sample size N, weights wi and learning rates ημ, ησ = ηB = ηΣ coincides with the η μ δ t, η Σ δ t-twisted IGO algorithm with sample size N, weights wi, step size δt and in which, given the current position (μt, At), the set of Gaussians is parametrized by:
( δ , M ) N ( μ t + A t δ , ( A t exp ( 1 2 M ) ) ( A t exp ( 1 2 M ) ) T ) ,
with δ ∈ ℝm and M ∈ Sym(ℝm).
The parameters maintained by the algorithm are (μ, A), and the xi are sampled from N (μ, AAT).
Proposition 11 (Pure rank-μ CMA-ES as IGO). The pure rank-μ CMA-ES algorithm with sample size N, weights wi and learning rates ημ and ηΣ coincides with the ( η μ δ t , η Σ δ t )-twisted IGO algorithm with sample size N, weights wi, step size δt and the parametrization (μ, Σ).
The proofs of these two statements are an easy rewriting of their non-twisted counterparts: one can return to the non-twisted metric (up to a ηΣ factor) by changing μ to η σ η μ μ.
We give the equations of the twisted geodesics of G d in the Appendix.

6.3. Trajectories of Different IGO Steps

As we have seen, two different IGO algorithms (or an IGO algorithm and the GIGO algorithm) coincide at first order in δt when δt → 0. In this section, we study the differences between pure rank-μ CMA-ES, xNES and GIGO by looking at the second order in δt, and in particular, we show that xNES and GIGO do not coincide in the general case.
We view the updates done by one step of the algorithms as paths on the manifold G d, from (μ(t), Σ(t)) to (μ(t + δt), Σ(t + δt)), where δt is the time step of our algorithms, seen as IGO algorithms. More formally:
Definition 14. (1) We call the GIGO update trajectory the application:
T GIGO : ( μ , Σ , v μ v Σ ) ( δ t exp N ( μ , A A T ) ( δ t η μ v μ , δ t η Σ v Σ ) ) .
(exp is the exponential of the Riemannian manifold G d ( η μ , η Σ ))
(2) We call the xNES update trajectory the application:
T xNES : ( μ , Σ , v μ , v Σ ) ( δ t N ( μ + δ t η μ v μ , A exp [ η Σ δ t A 1 v Σ ( A 1 ) T ] A T ) ) ,
with AAT = Σ. The application above does not depend on the choice of a square root A.
(3) We call the CMA-ES update trajectory the application:
T CMA : ( μ , Σ , v μ , v Σ ) ( δ t N ( μ + δ t η μ v μ , A A T + δ t η Σ v Σ ) ) .
These applications map the set of tangent vectors to G d ( ( T G d )to the curves in G d ( η μ , η Σ ).
We will also use the following notation: μGIGO := ϕμTGIGO, μxNES := ϕμTxNES, μCMA := ϕμTCMA, ΣGIGO := ϕΣTGIGO, ΣxNES := ϕΣTxNES and ΣCMA := ϕΣTCMA, where ϕμ (resp. ϕΣ) extracts the μ-component (resp. the Σ-component) of a curve.
In particular, Im(ϕμ) ⊂ ℝd and Im(ϕΣ) ⊂ Pd, where Pd (the set of real symmetric positive-definitematrices of dimension d) is seen as a subset ofd2.
For instance, TGIGO(μ, Σ, vμ, vΣ)(δt) gives the position (mean and covariance matrix) of the GIGO algorithm after a step of size δt, while μGIGO and ΣGIGO give, respectively, the mean component and the covariance component of this position.
This formulation ensures that the trajectories we are comparing had the same initial position and the same initial speed, which is the case provided the sampled points (the values directly sampled from N ( μ , Σ ), not from N ( 0 , I ) and transformed) are the same.
Different IGO algorithms coincide at first order in δt. The following proposition gives the second order expansion of the trajectories of the algorithms.
Proposition 12 (Second derivatives of the trajectories). We have:
μ GIGO ( μ , Σ , v μ , v Σ ) ( 0 ) = η μ η Σ v Σ 0 1 v μ , μ xNES ( μ , Σ , v μ , v Σ ) ( 0 ) = μ CMA ( μ , Σ , v μ , v Σ ) ( 0 ) = 0 , Σ GIGO ( μ , Σ , v μ , v Σ ) ( 0 ) = η Σ 2 v Σ Σ 1 v Σ η μ η Σ v μ v μ T , Σ xNES ( μ , Σ , v μ , v Σ ) ( 0 ) = μ Σ 2 v Σ Σ 1 v Σ , Σ CMA ( μ , Σ , v μ , v Σ ) ( 0 ) = 0.
Proof. We can immediately see that the second derivatives of μxNES, μCMA and ΣCMA are zero. Next, we have:
Σ xNES ( μ , Σ , v μ , v Σ ) ( t ) = A exp [ t A 1 η Σ v Σ ( A 1 ) T ] A T = A A T + t η Σ v Σ + t 2 2 η Σ 2 v Σ ( A 1 ) T A 1 v Σ + o ( t 2 ) = Σ + t η Σ v Σ + t 2 2 η Σ 2 v Σ Σ 1 v Σ + o ( t 2 ) .
The expression of ΣxNES(μ, Σ, vμ, vΣ)(0) follows.
Now, for GIGO, let us consider the geodesic starting at (μ0, Σ0) with initial speed (ημvμ, ηΣvΣ). By writing Jμ(0) = Jμ(t), we find μ ˙ ( t ) = Σ ( t ) Σ 0 1 μ ˙ 0. We then easily have μ ¨ ( 0 ) = Σ ˙ 0 Σ 0 1 μ ˙ 0 In other words:
μ GIGO ( μ , Σ , v μ , v Σ ) ( 0 ) = η μ η Σ v Σ 0 1 v μ .
Finally, by using Theorem 4 and differentiating, we find:
Σ ¨ = η Σ Σ ˙ ( J Σ J μ μ T ) η Σ Σ J μ μ ˙ T , Σ ¨ 0 = η Σ Σ ˙ 0 1 η Σ Σ 0 1 Σ ˙ 0 η Σ η μ μ ˙ 0 μ ˙ 0 T = η Σ 2 v Σ Σ 0 1 v Σ η Σ η μ v μ v μ T .
In order to interpret these results, we will look at what happens in dimension one. In higher dimensions, we can suppose that the algorithms exhibit a similar behavior, but an exact interpretation is more difficult for GIGO in G d.
  • In [19], it has been noted that xNES converges to quadratic minima slower than CMA-ES and that it is less subject to premature convergence. That fact can be explained by observing that the mean update is exactly the same for CMA-ES and xNES, whereas xNES tends to have a higher variance (Proposition 12 shows this at order two, and it is easy to see that in dimension one, for any μ, Σ, vμ, vΣ, we have ΣxNES(μ, Σ, vμ, vΣ) > ΣCMA(μ, Σ, vμ, vΣ)).
  • At order two, GIGO moves the mean faster than xNES and CMA-ES if the standard deviation is increasing and more slowly if it is decreasing. This seems to be a reasonable behavior (if the covariance is decreasing, then the algorithm is presumably close to a minimum, and it should not leave the area too quickly). This remark holds only for isolated steps, because we do not take into account the evolution of the variance.
  • The geodesics of G 1 are half-circles (see Figure 2 below; we recall that G 1 is the Poincaré half-plane). Consequently, if the mean is supposed to move (which always happens), then σ → 0 when δt → ∞. For example, a step whose initial speed has no component on the standard deviation will always decrease it. See also Proposition 15, about the optimization of a linear function.
  • For the same reason, for a given initial speed, the update of μ always stays bounded as a function of δt: it is not possible to make one step of the GIGO algorithm go further than a fixed point by increasing δt. Still, the geodesic followed by GIGO changes at each step, so the mean of the overall algorithm is not bounded.
We now show that xNES follows the geodesics of G d if the mean is fixed, but that xNES and GIGO do not coincide otherwise.
Proposition 13 (xNES is not GIGO in the general case). Let μ, vμ ∈ ℝd, A ∈ GLd, vΣ ∈ Md.
Then, the GIGO and xNES updates starting at N ( μ , Σ ) with initial speeds vμ and vΣ follow the same trajectory if and only if the mean remains constant. In other words:
T GIGO ( μ , Σ , v μ , v Σ ) = T xNES ( μ , Σ , v μ , v Σ ) i f a n d o n l y i f v μ = 0.
Proof. If vμ = 0, then we can compute the GIGO update by using Theorem 4: since Jμ = 0, μ ˙ = 0, and μ remains constant. Now, we have J Σ = Σ 1 Σ ˙; this is enough information to compute the update. Since this quantity is also preserved by the xNES algorithm (see, for example, the proof of Proposition 14), the two updates coincide.
Figure 2. One step of the geodesic IGO (GIGO) update.
If vμ ≠ 0, then Σ xNES ( μ , Σ , v μ , v Σ ) ( 0 ) Σ GIGO ( μ , Σ , v μ , v Σ ) ( 0 ) = η μ η Σ v μ v μ T 0 and, in particular, TGIGO(μ, Σ, vμ, vΣ) ≠ TxNES(μ, Σ, vμ, vΣ).

6.4. Blockwise GIGO

Although xNES is not GIGO, it is possible to define a family of algorithms extending GIGO and including xNES, by decomposing our family of probability distributions as a product and by following the restricted geodesics simultaneously.
Definition 15 (Splitting). Let Θ be a Riemannian manifold. A splitting of Θ is n manifolds Θ1, …, Θn and a diffeomorphism Θ ≅ Θ1 ×× Θn. If for all x ∈ Θ, for all 1 ≤ i < jn, we also have Ti,xMTj,xM as subspaces of TxM (see Notation 2), then the splitting is said to be compatible with the Riemannian structure. If the Riemannian manifold is not ambiguous, we will simply write a “compatible splitting”.
We now give some notation, and we define the blockwise GIGO update:
Notation 2. Let Θ be a Riemannian manifold, Θ1, …, Θn a splitting of Θ, θ = (θ1, …, θn) ∈ Θ, YTθΘ and 1 ≤ in.
  • We denote by Θθ,i the Riemannian manifold
    { θ 1 } × × { θ i 1 } × Θ i × { θ i + 1 } × × { θ n } ,
    with the metric induced from Θ. There is a canonical isomorphism of vector spaces T θ Θ = i = 1 n T Θ θ , i. Moreover, if the splitting is compatible, it is an isomorphism of Euclidean spaces.
  • We denote by Φθ,i the exponential at θ of the manifold Θθ,i.
Definition 16 (Blockwise GIGO update). Let Θ1, …, Θn be a compatible splitting. The blockwise GIGO algorithm in Θ with splitting Θ1, …, Θn associated with sample size N, step sizes δt1, …, δtn and selection scheme w is given by the following update rule:
θ ( θ 1 t + δ t 1 , , θ n t + δ t n )
where:
Y = I 1 ( θ t ) i = 1 N w ^ i ln P θ ( x i ) θ ,
θ k t + δ t k = Φ θ t , k ( δ t k Y k ) ,
with Yk the TΘθ,k-component of Y. This update only depends on the splitting (and not on the parametrization inside each Θk).
The compatibility condition ensures that the natural gradient of W θ t f (defined in Section 2.2) in the whole manifold Θ really is the sum of the gradients of this same function in the submanifolds Θk. A practical consequence is that the Yk in Equation (62) can be computed simply by taking the natural gradient in Θk:
Y k = I k 1 ( θ i t ) i = 1 N w ^ ln P θ ( x i ) θ k ,
where Ik is the metric of Θk.
Since blockwise GIGO only depends on the splitting (and the tunable parameters: sample size, step sizes and selection scheme), it can be thought of as almost parametrization-invariant.
Notice that blockwise GIGO updates and twisted GIGO updates are two different things: firstly, blockwise GIGO can be defined on any manifold with a compatible splitting, whereas twisted GIGO (and twisted IGO) are only defined for Gaussians. However, even in G d ( η μ , η ), with the splitting (μ, Σ), these two algorithms are different: for instance, if ημ = ηΣ and δt = 1, then the twisted GIGO is the regular GIGO algorithm, whereas blockwise GIGO is not (actually, we will prove that it is the xNES algorithm). The only thing blockwise GIGO and twisted GIGO have in common is that they are compatible with the (ημ, ηΣ)-twisted IGO flow Equation (57): a parameter θt following these updates with δt → 0 and N → ∞ is a solution of Equation (57).
We now have a new description of the xNES algorithm:
Proposition 14 (xNES is a Blockwise GIGO algorithm). The Blockwise GIGO algorithm in G d with splitting Φ : N ( μ , Σ ) ( μ , Σ ), sample size N, step sizes δtμ, δtΣ and selection scheme w coincides with the xNES algorithm with sample size N, weights wi and learning rates ημ = δtμ, ησ = ηB = δtΣ.
Proof. Firstly, notice that the splitting (μ, Σ) is compatible, by Proposition 1.
Now, let us compute the Blockwise GIGO update: we have G d d × P d, where Pd is the space of real positive-definite matrices of dimension d. We have Θ θ t , 1 = d × ( { t } ) G d , Θ θ t , 2 = ( { μ t } × P d ) G d. The induced metric on Θ θ t , 1 is the Euclidean metric, so we have:
μ μ t + δ t 1 Y μ .
Since we have already shown (using the notation in Definition 9) that Yμ = AGμ (in the proof of Proposition 6), we find:
μ μ t + δ t 1 A G μ .
On Θθt,2, we have the following Lagrangian for the geodesics:
( , ˙ ) = 1 2 t r ( ˙ 1 ˙ 1 ) .
By applying Noether’s theorem, we find that
J Σ = Σ 1 Σ ˙
is invariant along the geodesics of Θ θ t , 2, so they are defined by the equation Σ ˙ = Σ J Σ = Σ Σ 0 1 Σ ˙ 0 (and therefore, any update preserving the invariant JΣ will satisfy this first-order differential equation and follow the geodesics of Θ θ t , 2 ). The xNES update for the covariance matrix is given by A(t) = A0 exp(tGM/2). Therefore, we have Σ ( t ) = A 0 exp ( t G M ) A 0 T , Σ 1 ( t ) = ( A 0 1 ) T exp ( t G M ) A 0 1, ˙ ( t ) = A 0 exp ( t G M ) G M A 0 T and, finally, Σ 1 ( t ) Σ ˙ ( t ) = ( A 0 1 ) T G M A 0 T = 0 1 ˙ 0. Therefore, xNES preserves JΣ, and therefore, xNES follows the geodesics of Θ θ t , 2 (notice that we had already proven this in Proposition 13, since we are looking at the geodesics of G d with a fixed mean).
Although blockwise GIGO is somewhat “less natural” than GIGO, it can be easier to compute for some splittings (as we have just seen), and in the case of the Gaussian distributions, the mean-covariance splitting seems reasonable.

7. Numerical Experiments

We conclude this article with some numerical experiments to compare the behavior of GIGO, xNES and pure rank-μ CMA-ES (we give the pseudocodes for these algorithms in the Appendix). We made two series of tests. The first one is a performance test, using classical benchmark functions and the settings from [19]. The goal of the second series of tests is to illustrate the computations in Section 6.3 by plotting the trajectories (standard deviation versus mean) of these three algorithms in dimension one.
The source code is available at [22].

7.1. Benchmarking

For the first series of experiments, presented in Figure 3, we used the following parameters, taken from [19] (we recall that xNES and pure rank-μ CMA-ES are seen as IGO algorithms):
  • Varying dimension.
  • Sample size: 4 + 3 log ( d ) .
  • Weights: w i = max ( 0 , log ( n 2 + 1 ) log ( i ) ) j = 1 N max ( 0 , log ( n 2 + 1 ) log ( j ) ) 1 N.
  • IGO step size and learning rates: δt = 1, ημ = 1, η = 3 5 3 + log ( d ) d d..
  • Initial position: θ 0 = N ( x 0 , I ), where x0 is a random point of the circle with center zero, and radius 10.
  • Euler method for GIGO: Number of steps: 100. We used the GIGO-A variant of the algorithm. No significant difference was noticed with GIGO-Σ or with the exact GIGO algorithm. The only advantage of having an explicit solution of the geodesic equations is that the update is quicker to compute.
  • We chose not to use the exact expression of the geodesics for this benchmarking to show that having to use the Euler method is fine. However, we did run the tests, and the results are basically the same as GIGO-A.
We plot the median number of runs to achieve target fitness (108). Each algorithm has been tested in dimension 2, 4, 8, 16, 32 and 64: a missing point means that all runs converged prematurely.

7.1.1. Failed Runs

In Figure 3, a point is plotted even if only one run was successful. Below is the list of the settings for which at least one run converged prematurely.
  • Only one run reached the optimum for the cigar-tablet function with CMA-ES in dimension eight.
  • Seven runs (out of 24) reached the optimum for the Rosenbrock function with CMA-ES in dimension 16.
  • About half of the runs reached the optimum for the sphere function with CMA-ES in dimension four.
For the following settings, all runs converged prematurely.
  • GIGO did not find the optimum of the Rosenbrock function in any dimension.
  • CMA-ES did not find the optimum of the Rosenbrock function in dimension 2, 4, 32 and 64.
  • All of the runs converged prematurely for the cigar-tablet function in dimension two with CMA-ES, for the sphere function in dimension two for all algorithms and for the Rosenbrock function in dimension two and four for all algorithms.

7.1.2. Discussion

As the last item in Section 7.1.1 shows, all of the algorithms converge prematurely in a low dimension, probably because the covariance learning rate has been set too high (or because the sample size is too small). This is different from the results in [19].
This remark aside, as noted in [19], the xNES algorithm shows more robustness than CMA-ES and GIGO: it is the only algorithm able to find the minimum of the Rosenbrock function in high dimensions. However, its convergence is consistently slower.
In terms of performance, when both of them work, pure rank-μ CMA-ES (or equivalently, IGO in the parametrization (μ, Σ)) and GIGO are extremely close (GIGO is usually a bit better). An advantage of GIGO is that it is theoretically defined for any δt, ηΣ, whereas the covariance matrix maintained by CMA-ES (not only pure rank-μ CMA-ES) can stop being positive definite if ηΣδt > 1. However, in that case, the GIGO algorithm is prone to premature convergence (remember Figure 2 and see Proposition 15 below), and in practice, the learning rates are much smaller.

7.2. Plotting Trajectories in G 1

We want the second series of experiments to illustrate the remarks about the trajectories of the algorithms in Section 6.3, so we decided to take a large sample size to limit randomness, and we chose a fixed starting point for the same reason. We use the weights below because of the property of quantile improvement proven in [23]: the 1/4-quantile will improve at each step. The parameters we used were the following:
  • Sample size: λ = 5, 000
  • Dimension one only.
  • Weights: w = 41q⩽1/4 (wi = 4.1i⩽1,250)
  • IGO step size and learning rates: ημ = 1, η = 3 5 3 + log ( d ) d d = 1.8, varying δt.
  • Initial position: θ 0 = N ( 10 , 1 )
  • Dots are placed at t = 0, 1, 2 … (except for the graph δt = 1.5, for which there is a dot for each step).
Figures 48 show the optimization of xx2, and Figures 911 show the optimization of x−x.
Figures 7, 8 and 11 show that when δt ≥ 1, GIGO reduces the covariance, even at the first step. More generally, when using the GIGO algorithm in G ˜ d for the optimization of a linear function, there exists a critical step size δtcr (depending on the learning rates ημ, ησ and on the weights wi), above which, GIGO will converge, and we can compute its value when the weights are of the form 1 q q 0 (for q0 ≥ 0.5, the discussion is not relevant, because in that case, even the IGO flow converges prematurely. Compare with the critical δt of the smoothed cross entropy method and IGO-ML in [1]).
Proposition 15. Let d ∈ ℕ, k, ημ, η σ + *; let w = k .1 q q 0 and let
g : d x x 1 .
Let μn be the first coordinate of the mean, and let σ n 2 be the variance (at step n) maintained by the (ημ, ησ)-twisted geodesic IGO algorithm in G ˜ d associated with selection scheme w, sample size ∞ and step size δt, when optimizing g (“sample size ∞” meaning the limit of the update when the sample size tends to infinity, which is deterministic [1]).
There exists δtcr, such that:
  • if δt > δtcr, (σn) converges to zero with exponential speed and (μn) converges.
  • if δt = δtcr, (σn) remains constant and (μn) tends to ∞ with linear speed.
  • if 0 < δt < δtcr, both (σn) and μn tend to ∞ with exponential speed.
The proof and the expression of δtcr can be found in the Appendix.
In the case corresponding to k = 4, n = 1, q0 = 1/4, ημ = 1, ησ = 1.8, we find:
δ t cr 0.84.

8. Conclusions

We introduced the geodesic IGO algorithm, and we showed that in the case of Gaussian distributions, Noether’s theorem directly gives a first order equation satisfied by the geodesics. In terms of performance, the GIGO algorithm is similar to pure rank-μ CMA-ES, which is rather encouraging: it would be interesting to test GIGO on real problems. Moreover, GIGO is a reasonable and totally parametrization-invariant algorithm (provided we can compute the solution of the equations of the geodesics), and as such, it should be studied for other families of probability distributions, like Bernoulli distributions (although in that case, the Riemannian exponential is not defined if the step size is too large, because the length of the geodesics is finite). Noether’s theorem could be a crucial tool for this.
We also showed that xNES and GIGO are not the same algorithm, and we defined blockwise GIGO, a simple extension of the GIGO algorithm, showing that xNES has a special status, as it admits a definition that is “almost” parametrization-invariant.

Conflicts of Interest

The author declares no conflict of interest.

Proof of Proposition 15

Let us first consider the case k = 1.
When optimizing a linear function, the non-twisted IGO flow in G ˜ d with the selection function q 1 q q 0 is known [1], and in particular, we have:
μ t = μ 0 + β ( q 0 ) α ( q 0 ) σ t ,
σ t = σ 0 exp ( α ( q 0 ) t ) ,
where, if we denote by N a random vector following a standard normal distribution and the cumulative distribution of a standard normal distribution,
α ( q 0 , d ) = 1 2 d ( 0 q 0 1 ( u ) 2 d u q 0 ) ,
and:
β ( q 0 ) = E ( N 1 N 1 ( q 0 ) ) .
In particular, α : = α ( 1 4 , 1 ) 0.107 and β : = β ( 1 4 ) 0.319.
With a minor modification of the proof in [1], we find that the (ημ, ησ)-twisted IGO flow is given by:
μ t = μ 0 + β ( q 0 ) α ( q 0 ) σ 0 exp ( η μ α ( q 0 ) t ) ,
σ t = σ 0 exp ( η σ α ( q 0 ) t ) ,
Notice that Equation (69) shows that the assertions about the convergence of (σn) immediately imply the assertions about the convergence of (μn).
Let us now consider a step of the GIGO algorithm: The twisted IGO speed is Y = (ημβσ0, ησασ0), with ασ0 > 0 (i.e., the variance should be increased: this is where we need q0 < 0.5).
Proposition 17 shows that the covariance at the end of the step is (using the same notation):
σ ( δ t ) = σ ( 0 ) Im ( d i e v δ t c c i e v δ t + d ) = σ ( 0 ) e v δ t ( d 2 + c 2 ) c 2 e 2 v δ t + d 2 = : σ ( 0 ) f ( δ t ) ,
and it is easy to see that f only depends on δt (and on q0). In other words, f(δt) will be the same at each step of the algorithm. The existence of δtcr easily follows (furthermore, recall Figure 1 in Section 4.1), and δtcr is the positive solution of f(x) = 1.
After a quick computation, we find:
exp ( v δ t cr ) = 1 + u 2 + 1 1 + u 2 1 .
where:
u : = η μ 2 n η σ β α ,
and:
v : = η σ 2 α 2 + η μ η σ 2 n β 2 .
Finally, for w = k .1 q q 0, Proposition 9 shows that:
δ t cr = 1 k 1 v ln ( 1 + u 2 + 1 1 + u 2 1 ) .

A1. Generalization of the Twisted Fisher Metric

The following definition is a more general way to introduce the twisted Fisher metric.
Definition 17. Let, g) be a Riemannian manifold, ( Θ 1 , g | Θ 1 ) , , ( Θ n , g | Θ n ), a splitting (as defined in Section 6.4) of Θ compatible with the metric g.
We call (η1, …, ηn)-twisted metric on (Θ, g) for the splitting Θ1, …, Θn the metric g′ on Θ defined by g | Θ i = 1 η i g | Θ i for 1 ≤ in, and Θi Θj for ij.
Proposition 16. The (ημ, ηΣ)-twisted metric on G d with the Fisher metric for the splitting N ( μ , ) ( μ , ) coincides with the (ημ, ηΣ)-twisted Fisher metric from Definition 11.
Proof. It is easy to see that the (ημ, ηΣ)-twisted Fisher metric satisfies the condition in Definition 17.

A2. Twisted Geodesics

The following theorem can be used to compute the twisted geodesics from the non twisted geodesics. It is a simple calculation.
Theorem 7. Let ημ, ηΣ ∈ ℝ, μ0 ∈ ℝd, A0GLd(ℝ), and ( μ ˙ 0 , ˙ 0 ) T N ( μ 0 , A 0 A 0 T ) G d Let
h : G d G d N ( μ , Σ ) N ( η μ η Σ μ , Σ )
We denote by ϕ (resp. ψ) the Riemannian exponential of G d (resp. G d with the (ημ, ηΣ)-twisted Fisher metric) at N ( η μ η Σ μ 0 , A 0 A 0 T ) ( r e s p . N ( μ 0 , A 0 A 0 T ) ). We have:
ψ ( μ ˙ 0 , Σ ˙ 0 ) = h ϕ ( η Σ η μ μ ˙ 0 , Σ ˙ 0 )
Proof. Let us denote by: ( I μ 0 0 I Σ ) the Fisher metric in the parametrization μ, Σ, and consider the 0 IΣ following parametrization of G d : ( μ ˜ , Σ ) N ( η Σ η μ μ ˜ , Σ ).
The Riemannian exponential at N ( μ 0 , A 0 A 0 T ) in this parametrization is:
h ϕ ( d h ( μ 0 , A 0 A 0 T ) ) 1
However, in this parametrization, the Fisher metric reads:
( η Σ η μ I μ 0 0 I Σ ) ,
which is proportional to the (ημ, ηΣ)-twisted Fisher metric up to a factor 1 η Σ. Consequently, the Christoffel symbols are the same as the Christoffel symbols of the (ημ, ηΣ)-twisted Fisher metric, and so are the geodesics. Therefore, we have:
ψ = h ϕ ( d h ( μ 0 , A 0 A 0 T ) ) 1 ,
which is what we wanted.
For the remainder of this section, we fix ημ and ηΣ; G d is endowed with the (ημ, ηΣ)-twisted Fisher metric, and G ˜ d is endowed with the induced metric. The proofs of the propositions below are a simple rewriting of their non-twisted counterparts that can be found in Sections 4 and 5.1 and can be seen as corollaries of Theorem 7.
Theorem 8. If γ : t N ( μ ( t ) , σ ( t ) 2 I ) is a twisted geodesic of G ˜ d, then there exists a, b, c, d ∈ ℝ, such that adbc = 1, and v > 0, such that μ ( t ) = μ ( 0 ) + 2 d η μ η σ μ ˙ 0 μ ˙ 0 r ˜ ( t ), σ(t) = Im(γ(t)), with r ˜ ( t ) = Re ( γ ( t ) ) and:
γ ( t ) : = a i e v t + b c i e v t + d .
Proposition 17. Let n ∈ ℕ, vμ ∈ ℝnr,vσ, ημ, ησ, σ0 ∈ ℝ, with σ0 > 0.
Let vr := ║vμ λ = 2 n η μ η σ v : = 1 λ 2 v r 2 + v σ 2 σ 0 2, M 0 : = 1 λ v r v σ 0 2 and S 0 : = v σ v σ 0 2.
Let c : = ( M 0 2 + S 0 2 S 0 2 ) 1 2 and d : = ( M 0 2 + S 0 2 S 0 2 ) 1 2.
Let γ ( t ) : = σ 0 d i e v t c c i e v t + d.
Then
γ : t N ( μ 0 + λ v μ v μ Re ( γ ( t ) ) , Im ( γ ( t ) ) )
is the twisted geodesic of G ˜ n satisfying γ(0) = (μ0, σ0) and γ ˙ ( 0 ) = ( v μ , v σ ). The regular geodesics of G ˜ n are obtained with ημ = ησ = 1.
Theorem 9. Let γ : t N ( μ t , Σ t ) be a twisted geodesic of G d. Then, the following quantities are invariant:
J μ = 1 η μ t 1 μ ˙ t
J Σ = t 1 ( 1 η μ μ ˙ t μ t T + 1 η Σ Σ ˙ t ) .
Theorem 10. If μ : tμt and Σ : t ⟼ Σt satisfy the equations:
μ ˙ t = η μ Σ t J μ
Σ ˙ t = η Σ Σ t ( J Σ J μ μ t T ) = η Σ Σ t J Σ η Σ η μ μ ˙ t μ t T ,
where:
J μ = 1 η μ 0 1 μ ˙ 0
and:
J Σ = 0 1 ( 1 η μ μ ˙ 0 μ 0 T + 1 η Σ Σ ˙ 0 ) .
then t N ( μ t , Σ t ) is a twisted geodesic of G d.
Theorem 11. If μ : tμt and A : tAt satisfy the equations:
μ ˙ = η μ A t A t T J μ ,
A ˙ t = η Σ 2 ( J Σ J μ μ t T ) T A t ,
where:
J μ = 1 η μ ( A 0 1 ) T A 0 1 μ ˙ 0
and:
J Σ = ( A 0 1 ) T A 0 1 ( 1 η μ μ ˙ 0 μ 0 T + 1 η Σ A ˙ 0 A 0 T + 1 η Σ A 0 A ˙ 0 T ) ,
then t N ( μ t , A t A t T ) is a twisted geodesic of G d.

A3. Pseudocodes

A3.1. For All Algorithms

All studied algorithms have a common part, given here:
Variables: μ, Σ (or A such that Σ = AAT).
List of parameters: f : ℝd ℝ, step size δt, learning rates ημ, ηΣ, sample size λ, weights (wi)i∈[1], N number of steps for the Euler method, r Euler step size reduction factor (for GIGO-Σ only).
Algorithm 1. For all algorithms.
Algorithm 1. For all algorithms.
Entropy 17 00304f12
Notice that we always need a square root A of Σ to sample the xi, but the decomposition Σ = AAT is not unique. Two different decompositions will give two algorithms, such that one is a modification of the other as a stochastic process: same law (the xi are abstractly sampled from N ( μ , Σ ), but different trajectories (for given zi, different choices for the square root will give different xi). For GIGO-Σ, since we have to invert the covariance matrix, we used the Cholesky decomposition (A lower triangular. The the other implementation directly maintains a square root of Σ). Usually, in CMA-ES, the square root of Σ (Σ = AAT, A symmetric) is used.

A3.2. Updates

When describing the different updates, μ, Σ, A, the xi and the zi are those defined in Algorithm 1. For Algorithm 2 (GIGO-Σ), when the covariance matrix after one step is not positive-definite, we compute the update again, with a step size divided by r for the Euler method (we have no reason to recommend any particular value of r, the only constraint is r > 1).
Algorithm 2. GIGO Update, one step, updating the covariance matrix.
Algorithm 2. GIGO Update, one step, updating the covariance matrix.
Entropy 17 00304f13
Algorithm 3. GIGO Update, one step, updating a square root of the covariance matrix.
Algorithm 3. GIGO Update, one step, updating a square root of the covariance matrix.
Entropy 17 00304f14
Algorithm 4. Exact GIGO, one step. Not exactly our implementation; see the discussion after Corollary 1.
Algorithm 4. Exact GIGO, one step. Not exactly our implementation; see the discussion after Corollary 1.
Entropy 17 00304f15
Algorithm 5. xNES update, one step.
Algorithm 5. xNES update, one step.
Entropy 17 00304f16
Algorithm 6. pure rank-μ CMA-ES update, one step
Algorithm 6. pure rank-μ CMA-ES update, one step
Entropy 17 00304f17
Algorithm 7. GIGO in G ˜ d, one step.
Algorithm 7. GIGO in G ˜ d, one step.
Entropy 17 00304f18

Acknowledgments

I would like to thank Yann Ollivier for his numerous remarks about this article and Frédéric Barbaresco for finding [5].

References

  1. Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-geometric optimization algorithms: A unifying picture via invariance principles 2011, arXiv, 1106.3708.
  2. Amari, S.-I.; Nagaoka, H. Methods of Information Geometry (Translations of Mathematical Monographs); American Mathematical Society: Providence, RI, USA, 2007. [Google Scholar]
  3. Malagò, L.; Pistone, G. Combinatorial optimization with information geometry: The Newton method. Entropy 2014, 16, 4260–4289. [Google Scholar]
  4. Eriksen, P. Geodesics Connected with the Fisher Metric on the Multivariate Normal Manifold; Technical Report 86-13; Institute of Electronic Systems, Aalborg University: Aalborg, Denmark, 1986. [Google Scholar]
  5. Calvo, M.; Oller, J.M. An Explicit Solution of Information Geodesic Equations for the Multivariate Normal Model. Stat. Decis. 1991, 9, 119–138. [Google Scholar]
  6. Imai, T.; Takaesu, A.; Wakayama, M. Remarks on geodesics for multivariate normal models. J. Math-for-Industry 2011, 3, 125–130. [Google Scholar]
  7. Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1981, 11, 211–223. [Google Scholar]
  8. Porat, B.; Friedlander, B. Computation of the Exact Information Matrix of Gaussian Time Series with Stationary Random Components. IEEE Trans. Acoust. Speech Signal Process 1986, 34, 118–130. [Google Scholar]
  9. Baluja, S.; Caruana, R. Removing the Genetics from the Standard Genetic Algorithm; Technical Report CMU-CS-95-141; Morgan Kaufmann Publishers: Burlington, MA, USA, 1995; pp. 38–46. [Google Scholar]
  10. Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family, Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, Schwarzenberg, Austria, 5–9 January 2011; pp. 230–242.
  11. Kern, S.; Müller, S.D.; Hansen, N.; Büche, D.; Ocenasek, J.; Koumoutsakos, P. Learning probability distributions in continuous evolutionary algorithms—A comparative review. Nat. Comput. 2003, 3, 77–112. [Google Scholar]
  12. Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; Schmidhuber, J. Natural evolution strategies. J. Mach. Learn. Res. 2014, 15, 949–980. [Google Scholar]
  13. Huang, W. Optimization Algorithms on Riemannian Manifolds with Applications. Ph.D. Thesis, Florida State University, Tallahassee, FL, USA, 2013. [Google Scholar]
  14. Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
  15. Arnold, V.; Vogtmann, K.; Weinstein, A. Mathematical Methods of Classical Mechanics (Graduate Texts in Mathematics); Springer: New York, NY, USA, 1989. [Google Scholar]
  16. Bourguignon, J. Calcul variationnel; Ecole Polytechnique: Palaiseau, France, 2007; in French. [Google Scholar]
  17. Jost, J.; Li-Jost, X. Calculus of Variations (Cambridge Studies in Advanced Mathematics); Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  18. Gallot, S.; Hulin, D.; LaFontaine, J. Riemannian Geometry (Universitext), 3rd ed; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
  19. Glasmachers, T.; Schaul, T.; Yi, S.; Wierstra, D.; Schmidhuber, J. Exponential natural evolution strategies, Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, Portland, OR, USA, 7–11 July 2010.
  20. Akimoto, Y.; Nagata, Y.; Ono, I.; Kobayashi, S. Bidirectional relation between CMA evolution strategies and natural evolution strategies. In Parallel Problem Solving from Nature, PPSN XI; Schaefer, R., Cotta, C., Kołodziej, J., Rudolph, G., Eds.; Springer: New York, NY, USA, 2010. [Google Scholar]
  21. Hansen, N. The CMA evolution strategy: A tutorial. Available online: https://www.lri.fr/∼hansen/cmatutorial.pdf accessed on 1 January 2015.
  22. Bensadon, J. Source Code. Available online: https://www.lri.fr/~bensadon/ accessed on 13 January 2015.
  23. Akimoto, Y.; Ollivier, Y. Objective improvement in information-geometric optimization, Proceedings of the twelfth workshop on Foundations of genetic algorithms XII, Adelaide, Australia, 16–20 January 2013.
Figure 1. Geodesics of the Poincaré half-plane.
Figure 1. Geodesics of the Poincaré half-plane.
Entropy 17 00304f1
Figure 2. One step of the geodesic IGO (GIGO) update.
Figure 2. One step of the geodesic IGO (GIGO) update.
Entropy 17 00304f2
Figure 3. Median number of function calls to reach 108 fitness on 24 runs for: sphere function, cigar-tablet function and Rosenbrock function. Initial position θ0 = N (x0, I), with x0 uniformly distributed on the circle of center zero and radius 10. We recall that the “CMA-ES” algorithm here is using the so-called pure rank-μ CMA-ES update.
Figure 3. Median number of function calls to reach 108 fitness on 24 runs for: sphere function, cigar-tablet function and Rosenbrock function. Initial position θ0 = N (x0, I), with x0 uniformly distributed on the circle of center zero and radius 10. We recall that the “CMA-ES” algorithm here is using the so-called pure rank-μ CMA-ES update.
Entropy 17 00304f3
Figure 4. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 0.01, sample size 5000, weights wi = 4.1i⩽1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 100 steps. All algorithms exhibit a similar behavior
Figure 4. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 0.01, sample size 5000, weights wi = 4.1i⩽1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 100 steps. All algorithms exhibit a similar behavior
Entropy 17 00304f4
Figure 5. Trajectories of GIGO, CMA and xNES optimizing x 7→x2 in dimension one with δt = 0.5, sample size 5000, weights wi = 4.1i⩽1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every two steps. Stronger differences. Notice that after one step, the lowest mean is still GIGO ( 8.5, whereas xNES is around 8.75), but from the second step, GIGO has the highest mean, because of the lower variance.
Figure 5. Trajectories of GIGO, CMA and xNES optimizing x 7→x2 in dimension one with δt = 0.5, sample size 5000, weights wi = 4.1i⩽1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every two steps. Stronger differences. Notice that after one step, the lowest mean is still GIGO ( 8.5, whereas xNES is around 8.75), but from the second step, GIGO has the highest mean, because of the lower variance.
Entropy 17 00304f5
Figure 6. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 0.1, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 10 steps. All algorithms exhibit a similar behavior, and differences start to appear. It cannot be seen on the graph, but the algorithm closest to zero after 400 steps is CMA (∼ 1.10−16, followed by xNES (∼ 6.10−16) and GIGO (∼ 2.10−15).
Figure 6. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 0.1, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 10 steps. All algorithms exhibit a similar behavior, and differences start to appear. It cannot be seen on the graph, but the algorithm closest to zero after 400 steps is CMA (∼ 1.10−16, followed by xNES (∼ 6.10−16) and GIGO (∼ 2.10−15).
Entropy 17 00304f6
Figure 7. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 1, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot per step. The CMA-ES algorithm fails here, because at the fourth step, the covariance matrix is not positive definite anymore (it is easy to see that the CMA-ES update is always defined if δtηΣ < 1, but this is not the case here). Furthermore, notice (see also Proposition 15) that at the first step, GIGO decreases the variance, whereas the σ-component of the IGO speed is positive.
Figure 7. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 1, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot per step. The CMA-ES algorithm fails here, because at the fourth step, the covariance matrix is not positive definite anymore (it is easy to see that the CMA-ES update is always defined if δtηΣ < 1, but this is not the case here). Furthermore, notice (see also Proposition 15) that at the first step, GIGO decreases the variance, whereas the σ-component of the IGO speed is positive.
Entropy 17 00304f7
Figure 8. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 1.5, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot per step. Same as δt = 1 for CMA. GIGO converges prematurely.
Figure 8. Trajectories of GIGO, CMA and xNES optimizing xx2 in dimension one with δt = 1.5, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot per step. Same as δt = 1 for CMA. GIGO converges prematurely.
Entropy 17 00304f8
Figure 9. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.01, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 100 steps. Almost the same for all algorithms.
Figure 9. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.01, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 100 steps. Almost the same for all algorithms.
Entropy 17 00304f9
Figure 10. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.1, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 10 steps. It is not obvious on the graph, but xNES is faster than CMA, which is faster than GIGO.
Figure 10. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.1, sample size 5000, weights wi = 4.1i≤1250 and learning rates ημ = 1, ηΣ = 1.8. One dot every 10 steps. It is not obvious on the graph, but xNES is faster than CMA, which is faster than GIGO.
Entropy 17 00304f10
Figure 11. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 1, sample size 5, 000, weights wi = 4.1i≤1,250 and learning rates ημ = 1, ηΣ = 1.8. One dot per step. GIGO converges, for the reasons discussed earlier.
Figure 11. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 1, sample size 5, 000, weights wi = 4.1i≤1,250 and learning rates ημ = 1, ηΣ = 1.8. One dot per step. GIGO converges, for the reasons discussed earlier.
Entropy 17 00304f11

Share and Cite

MDPI and ACS Style

Bensadon, J. Black-Box Optimization Using Geodesics in Statistical Manifolds. Entropy 2015, 17, 304-345. https://0-doi-org.brum.beds.ac.uk/10.3390/e17010304

AMA Style

Bensadon J. Black-Box Optimization Using Geodesics in Statistical Manifolds. Entropy. 2015; 17(1):304-345. https://0-doi-org.brum.beds.ac.uk/10.3390/e17010304

Chicago/Turabian Style

Bensadon, Jérémy. 2015. "Black-Box Optimization Using Geodesics in Statistical Manifolds" Entropy 17, no. 1: 304-345. https://0-doi-org.brum.beds.ac.uk/10.3390/e17010304

Article Metrics

Back to TopTop