On a 2-Relative Entropy

Fullwood, James

doi:10.3390/e24010074

Open AccessArticle

On a 2-Relative Entropy

by

James Fullwood

School of Mathematical Sciences, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China

Entropy 2022, 24(1), 74; https://0-doi-org.brum.beds.ac.uk/10.3390/e24010074

Submission received: 4 December 2021 / Revised: 27 December 2021 / Accepted: 27 December 2021 / Published: 31 December 2021

(This article belongs to the Special Issue Measures of Information II)

Download Review Reports Versions Notes

Abstract

:

We construct a 2-categorical extension of the relative entropy functor of Baez and Fritz, and show that our construction is functorial with respect to vertical morphisms. Moreover, we show such a ‘2-relative entropy’ satisfies natural 2-categorial analogues of convex linearity, vanishing under optimal hypotheses, and lower semicontinuity. While relative entropy is a relative measure of information between probability distributions, we view our construction as a relative measure of information between channels.

Keywords:

2-category; Bayesian inference; discrete memoryless channel; functor; information theory; relative entropy; synthetic probability

MSC:

primary 94A17; secondary 18A05; 62F15

1. Introduction

Let X and Y be finite sets which are the input and output alphabets of a discrete memoryless channel

X \overset{f}{⇝} Y

with probability transition matrix

f_{y x}

, representing the probability of the output y given the input x. Every input x then determines a probability distribution on Y which we denote by

f^{x}

, so that

f^{x} (y) = f_{y x}

for all

x \in X

and

y \in Y

. The channel

X \overset{f}{⇝} Y

together with the choice of a prior distribution p on X will be denoted

(f | p)

, and such data then determine a distribution

ϑ (f | p)

on

X \times Y

given by

ϑ {(f | p)}_{(x, y)} = p_{x} f_{y x}

. Given a second channel

X \overset{g}{⇝} Y

with prior distribution q on X, the chain rule for relative entropy says that the relative entropy

D (ϑ (f | p), ϑ (g | q))

is given by

D (ϑ (f | p), ϑ (g | q)) = D (p, q) + \sum_{x \in X} p_{x} D (f^{x}, g^{x}) .

(1)

As the RHS of (1) involves precisely the datum of the channels f and g together with the prior distributions p and q, we view the quantity

D (ϑ (f | p), ϑ (g | q))

as a relative measure of information between the channels

(f | p)

and

(g | q)

. In particular, since from a Bayesian perspective

D (p, q)

may be thought of as the amount of information gained upon discovering that the assumed prior distribution p is actually q, it seems only natural to think of

D (ϑ (f | p), ϑ (g | q))

as the amount of information gained upon learning that the assumed channel

(f | p)

is actually the channel

(g | q)

.

To make such a Bayesian interpretation more precise, we build upon the work of Baez and Fritz [1], who formulated a type of Bayesian inference as a process

X \to Y

(including a set of conditional hypotheses on the outcome of the process), which given a prior distribution p on X yields distributions r on Y and q on X in such a way that the relative entropy

D (p, q)

has an operational meaning as a quantity associated with a Bayesian updating with respect to the process

X \to Y

. (Here, X may be thought of more generally as the set of possible states of some system to be measured, while Y may be thought of as the possible outcomes of the measurement.) Baez and Fritz then proved that up to a constant multiple, the map on such processes given by

(X \to Y) \mapsto RE (X \to Y) : = D (p, q)

(2)

is the unique map satisfying the following axioms.

Functoriality: Given a composition of processes $X \to Y \to Z$ ,

$RE (X \to Y \to Z) = RE (X \to Y) + RE (Y \to Z) .$
Convex Linearity: Given a collection of processes $U^{x} \to V^{x}$ indexed by the elements $x \in X$ of a finite probability space $(X, p)$ ,

$RE (\sum_{x \in X} p_{x} (U^{x} \to V^{x})) = \sum_{x \in X} p_{x} RE (U^{x} \to V^{x}) .$
Vanishing Under Optimal Hypotheses: If the conditional hypotheses associated with a process $X \to Y$ are optimal, then

$RE (X \to Y) = 0 .$
Continuity: The map $(X \to Y) \mapsto RE (X \to Y)$ is lower semi-continuous.

While Baez and Fritz facilitated their exposition using the language of category theory [2], knowing that a category consists of a class of objects together with a class of composable arrows (i.e., morphisms) between objects is all that is needed for an appreciation of their construction. From such a perspective, the aforementioned processes

X \to Y

are morphisms in a category

FinStat

, and the relative entropy assignment given by (2) is then a map from morphisms in

FinStat

to

[0, \infty]

.

In what follows, we elevate the construction of Baez and Fritz to the level of 2-categories (or more precisely, double categories), whose 2-morphisms may be viewed as certain processes between processes, or rather, processes which connect one channel to another. In particular, in Section 2, we formally introduce the category

FinStat

introduced by Baez and Fritz, and then, in Section 3, we review their functorial characterization of relative entropy using

FinStat

. In Section 4, we construct a category

{FinStat}_{2}

which is a 2-level extension of

FinStat

, and in Section 5, we define a convex structure on

{FinStat}_{2}

. In Section 6, we define a relative measure of information between channels which we refer to as conditional relative entropy, and show that it is convex linear and functorial with respect to vertical morphisms in

{FinStat}_{2}

. The conditional relative entropy is then used in Section 7 to define a relative entropy assignment

{RE}_{2}

on 2-morphisms via the chain rule as given by (1) (for more on the chain rule for relative entropy one may consult Chapter 2 of [3]). Moreover, we show that such a ‘2-relative entropy’ satisfies the natural 2-level analogues of axioms 1–4 as satisfied by the relative entropy map RE.

As abstract as a relative entropy of processes between processes may seem, Shannon’s Noisy Channel Coding Theorem—which is a cornerstone of information theory—is essentially a statement about transforming a noisy channel into a noiseless one via a sequence of codings and encodings. From such a viewpoint, information theory is fundamentally about processes (i.e., a sequence of codings and encodings), between processes (i.e., channels), and it is precisely this viewpoint with which we will proceed. Furthermore, there is a growing recent interest in axiomatic and categorical approaches to information theory [4,5,6,7,8,9,10,11,12,13], and the present work is a direct outgrowth of such activity.

2. The Category $FinStat$

In this section, we introduce the first-level structure of interest, which is the category

FinStat

introduced by Baez and Fritz [1]. Though we use the language of categories, knowing that a category consists of a class of composable arrows between a class of objects is sufficient for the comprehension of all categorical notions in this work.

Definition 1.

Let X and Y be finite sets. Adiscrete memoryless channel(or simplychannelfor short)

X \overset{f}{⇝} Y

associates every

x \in X

with a probability distribution

f^{x}

on Y. In such a case, the sets X and Y are referred to as theset of inputsandset of outputsof the channel f, respectively, and

f^{x} (y)

is the probability of receiving the output y given the input x, which will be denoted by

f_{y x}

.

Definition 2.

If

X \overset{f}{⇝} Y

and

Y \overset{g}{⇝} Z

are channels, then the composition

X \overset{g \circ f}{⇝} Z

is given by

g_{z x} = \sum_{x \in X} g_{z y} f_{y x}

for all

x \in X

and

y \in Y

.

Remark 1.

If

X \overset{f}{⇝} Y

is a channel such that for every

x \in X

there exists a

y \in Y

with

f_{y x} = 1

, then such a y is necessarily unique given x, and as such, f may be identified with a function

f : X \to Y

. In such a case, we say that the channel f ispure(ordeterministic), and from here on, we will not distinguish the difference between a pure channel and the associated function from its set of inputs to its set of outputs.

Definition 3.

If ★ denotes a set with a single element, then a channel

★ \overset{p}{⇝} X

is simply a probability distribution on X, and, in such a case, we will use

p_{x}

to denote the probability of x as given by p for all

x \in X

. The pair

(X, p)

is then referred to as a finite probability space.

Notation 1.

The datum of a channel

X \overset{f}{⇝} Y

together with a prior distribution

★ \overset{p}{⇝} X

on its set of inputs will be denoted

(f | p)

.

Definition 4.

Let

FinStat

denote the category whose objects are finite probability spaces, and whose morphisms

(X, p) ⟶ (Y, q)

consist of the following data:

A function $f : X \to Y$ such that $f \circ p = q$ ;
A channel $Y \overset{s}{⇝} X$ such that $f \circ s = {id}_{Y}$ . In other words, $Y \overset{s}{⇝} X$ is astochastic sectionof $f : X \to Y$ .

A morphism in

FinStat

is then summarized by a diagram of the form

(3)

and a composition of morphisms in

FinStat

is obtained via function composition and composition of stochastic sections. In such a case, it is straightforward to show that a composition of stochastic sections is a stochastic section, etc. The morphism corresponding to diagram (3) will often be denoted

(f, p, s)

.

Remark 2.

Note that in diagram (3), a straight arrow is used for

X \overset{f}{⟶} Y

as f is a function, as opposed to a noisy channel.

Remark 3.

The operational interpretation of diagram (3) is as follows. The set X is thought of as the set of possible states of the system, and

f : X \to Y

is then thought of as a measurement process, so that Y is then thought of as the set of possible states of some measuring apparatus. The stochastic section

Y \overset{s}{⇝} X

is then thought of as a set of hypotheses about the state of the system given a state of the measuring apparatus. In particular,

s_{x y}

is thought of as the probability the system was in state x given the state y of the measuring apparatus.

Definition 5.

If the stochastic section

Y \overset{s}{⇝} X

in diagram (3) is such that

s \circ q = p

, then s will be referred to as anoptimal hypothesisfor

(f | p)

.

Definition 6.

Let

(X, p)

be a finite probability space, and let

U^{x} \overset{μ^{x}}{⇝} V^{x}

be a collection of channels with prior distributions

★ \overset{q^{x}}{⇝} U^{x}

indexed by X. Theconvex combinationof

(μ^{x} | q^{x})

with respect to

(X, p)

is the channel

(\underset{x \in X}{∐} U^{x} \overset{⨁_{x \in X} μ^{x}}{⇝} \underset{x \in X}{∐} V^{x}| ⨁_{x \in X} p_{x} q^{x}),

where

⨁_{x \in X} μ^{x}

is the channel given by

{(⨁_{x \in X} μ^{x})}_{v u} = \{\begin{matrix} μ_{v u}^{x} if (v, u) \in V^{x} \times U^{x} for some x \in X \\ 0 otherwise \end{matrix},

with prior distribution

★ \overset{⨁_{x \in X} p_{x} q^{x}}{⇝} ∐ U^{x}

given by

{(⨁_{x \in X} p_{x} q^{x})}_{u} = p_{x_{u}} q_{u}^{x_{u}},

where

x_{u}

is such that

u \in U^{x_{u}}

. Such a convex combination will be denoted

⨁_{x \in X} p_{x} (μ^{x} | q^{x})

.

3. The Baez and Fritz Characterization of Relative Entropy

We now recall the Baez and Fritz characterization of relative entropy in

FinStat

.

Definition 7.

Let

(X, p)

be a finite probability space, and let

(U^{x} \overset{μ^{x}}{⟶} V^{x}, ★ \overset{q^{x}}{⇝} U^{x}, V^{x} \overset{s^{x}}{⇝} U^{x})

be a collection of morphisms in

FinStat

indexed by X. Theconvex combinationof

(μ^{x}, q^{x}, s^{x})

with respect to

(X, p)

is the morphism

⨁_{x \in X} p_{x} (μ^{x}, q^{x}, s^{x})

in

FinStat

corresponding to the diagram

(4)

where

r^{x} = μ^{x} \circ q^{x}

for all

x \in X

.

Definition 8.

Let

(f, p, s)

be a morphism in

FinStat

, and let

r = s \circ f \circ p

. Therelative entropyof

(f, p, s)

is the non-negative extended real number

R E (f, p, s) \in [0, \infty]

given by

R E (f, p, s) = D (p, r),

where

D (p, r) = \sum_{x} p_{x} log (p_{x} / r_{x})

is the relative entropy between the distributions p and r on X.

Definition 9.

Let

F : FinStat \to [0, \infty]

be a map from the morphisms in

FinStat

to the extended non-negative reals

[0, \infty]

.

F is said to befunctorialif and only if for every composition $(g \circ f, p, s \circ t)$ of morphisms in $FinStat$ we have

$F (g \circ f, p, s \circ t) = F (f, p, s) + F (g, f \circ p, t) .$
F is said to beconvex linearif and only if for every convex combination $⨁_{x \in X}$ $p_{x} (μ^{x}, q^{x}, s^{x})$ of morphisms in $FinStat$ we have

$F (⨁_{x \in X} p_{x} (μ^{x}, q^{x}, s^{x})) = \sum_{x \in X} p_{x} F (μ^{x}, q^{x}, s^{x}) .$
F is said to bevanishing under optimal hypothesesif and only if for every morphism $(f, p, s)$ in $FinStat$ with s an optimal hypothesis we have

$F (f, p, s) = 0 .$
F is said to belower semicontinuousif and only if for every sequence of morphisms $(f, p_{n}, s_{n})$ in $FinStat$ converging to a morphism $(f, p, s)$ we have

$F (f, p, s) \leq \underset{n \to \infty}{lim inf} F (f, p_{n}, s_{n}) .$

Theorem 1

(The Baez and Fritz Characterization of Relative Entropy). Let

S

be the collection of maps from the morphisms in

FinStat

to

[0, \infty]

which are functorial, convex linear, vanishing under optimal hypotheses and lower semicontinuous. Then, the following statements hold.

1.: The relative entropy $RE$ is an element of $S$ ;
2.: If $F \in S$ , then $F = c RE$ for some non-negative constant $c \in R$ .

4. The Category ${FinStat}_{2}$

In this section, we introduce the second-level structure of interest, namely, the double category

{FinStat}_{2}

, which is a 2-level extension of

FinStat

.

Definition 10.

Let

{FinStat}_{2}

denote the 2-category whose objects and 1-morphisms coincide with those of

F i n S t a t

, and whose 2-morphisms are constructed as follows. Given 1-morphisms

a 2-morphism

♠ : (μ, p, s) \Rightarrow (ν, q, t)

consists of channels

X \overset{f}{⇝} Y

and

X^{'} \overset{f^{'}}{⇝} Y^{'}

such that

$f \circ p = q$
$f^{'} \circ p^{'} = q^{'}$
$ν \circ f = f^{'} \circ μ$

The 2-morphism

♠ : (μ, p, s) \Rightarrow (ν, q, t)

may then be summarized by the following diagram.

(5)

Remark 4.

A crucial point is that in the above diagram, all arrows necessarily commute except for any compositions involving the the outer ‘wings’, s and t. For example, the compositions

s \circ μ \circ p

and

t \circ f^{'} \circ μ

need not be equal to p and f, respectively.

Remark 5.

Diagram (5) should be thought of as a flattened out pyramid, whose base is the inner square and whose vertex is obtained by the identification of the upper and lower stars in the diagram.

Remark 6.

For an operational interpretation of a 2-morphism in

{FinStat}_{2}

as given by diagram (5), one may consider X and Y as sample spaces associated with all possible outcomes of experiments

E_{X}

and

E_{Y}

. As the sets X and Y are endowed with prior distributions p and q, the maps

μ : X \to X^{'}

and

ν : Y \to Y^{'}

are then random variables with values in

X^{'}

and

Y^{'}

, and the stochastic sections s and t then represent conditional hypotheses about the outcomes of the measurements corresponding to μ and ν. The channels

f : X ⇝ Y

and

f : X^{'} ⇝ Y^{'}

then represent stochastic processes such that taking the measurement μ followed by

f^{'}

results in the same process as first letting X evolve according to f and then taking the measurement ν.

Example 1.

For a real-life scenario which realizes a 2-morphism in

FinStat

, suppose two experimenters Alice and Bob are collaborating on a project to verify predictions of a theory. As such, Alice and her data analyst partner Alicia travel to a mountain in Brazil during a solar eclipse to perform experiments while Bob and and his data-analyst partner Bernie travel to a mountain in Montenegro at the same time for the same purpose. Alice and Bob will then perform experiments in their separate locations and hand their results over to Alicia and Bernie, who will then analyze the data to produce numerical results. At the end of each day, Alice will report her results to Bob over a noisy channel while Alicia will report her results to Bernie over a noisy channel, so that Bob and Bernie may compare their results with Alice and Alicia’s. We then summarize such a scenario with the following 2-morphism in

{FinStat}_{2}

:

(6)

In diagram (6), p and q are assumed prior distributions on Alice and Bob’s measurements, while s and t are empirical conditional distributions on Alice and Bob’s measurements given the data outcomes of Alicia and Bernie’s analysis. Moreover, if the communication channel f is less reliable than

f^{'}

, then the composition

t \circ f^{'} \circ μ

provides a Bayesian updating for the channel f.

For vertical composition of 2-morphisms, suppose

♣ : (μ^{'}, p^{'}, s^{'}) \Rightarrow (ν^{'}, q^{'}, t^{'})

is the 2-morphism summarized by the following diagram.

The vertical composition

♣ \circ ♠

is then summarized by the following diagram.

For horizontal composition, let

♡ : (ν, q, t) \Rightarrow (ξ, r, u)

be a 2-morphism summarized by the following diagram.

The horizontal composition

♡ \circ ♠

is then summarized by the following diagram.

5. Convexity in ${FinStat}_{2}$

We now generalize the convex structure on morphisms in

FinStat

to 2-morphisms in

{FinStat}_{2}

. For this, let

(X, p)

be a finite probability space, and let

♠^{x}

be a collection of 2-morphisms in

{FinStat}_{2}

indexed by X, where

♠^{x}

is summarized by the following diagram.

Definition 11.

Theconvex sum

⨁_{x \in X} p_{x} ♠^{x}

is the 2-morphism in

{FinStat}_{2}

summarized by the following diagram.

6. Conditional Relative Entropy in ${FinStat}_{2}$

We now introduce a measure of information associated with 2-morphisms in

{FinStat}_{2}

which we refer to as ‘conditional relative entropy’. The results proved in this section are essentially all lemmas for the results proved in the next section, where we introduce a 2-level extension of the relative entropy map RE, and show that it satisfies the 2-level analogues of the characterizing axioms of relative entropy.

Definition 12.

With every 2-morphism

we associate the non-negative extended real number

CE (♠) \in [0, \infty]

given by

CE (♠) = \sum_{x \in X} p_{x} D (f^{x}, {(t \circ f^{'} \circ μ)}^{x}),

(7)

where

D (-, -)

is the standard relative entropy. We refer to

CE (♠)

as the conditional relative entropy of ♠.

Remark 7.

We refer to

CE (♠)

as conditional relative entropy as its defining formula (7) is structurally similar to the defining formula for conditional entropy. In particular, if

X \overset{f}{⇝} Y

is a channel with prior distribution

★ \overset{p}{⇝} X

, then the conditional entropy

H (f | p)

is given by

H (f | p) = \sum_{x \in X} p_{x} H (f^{x}),

where

H (f^{x})

is the Shannon entropy of the distribution

f^{x}

on Y.

Proposition 1.

Conditional relative entropy in

{FinStat}_{2}

is convex linear, i.e., if

(X, p)

is a finite probability space and

♠^{x}

is a collection of 2-morphisms in

{FinStat}_{2}

indexed by X, then

C E (⨁_{x \in X} p_{x} ♠^{x}) = \sum_{x \in X} p_{x} C E (♠^{x}) .

Proof.

Suppose

⨁_{x \in X} p_{x} ♠^{x}

is summarized by the following diagram

and let

U = ∐_{x \in X} U_{x}

,

μ = \oplus_{x \in X} μ^{x}

,

f = \oplus_{x \in X} f^{x}

,

f^{'} = \oplus_{x \in X} f^{' x}

and

t = \oplus_{x \in X} t^{x}

. We then have

\begin{matrix} CE (⨁_{x \in X} p_{x} ♠^{x}) & = & \sum_{u \in U} q_{u} D (f^{u}, {(t \circ f^{'} \circ μ)}^{u}) \\ = & \sum_{x \in X} \sum_{u_{x} \in U_{x}} p_{x} q_{u_{x}}^{x} D (f^{u_{x}}, {(t \circ f^{'} \circ μ)}^{u_{x}}) \\ = & \sum_{x \in X} p_{x} \sum_{u_{x} \in U_{x}} q_{u_{x}}^{x} D ({(f_{x})}^{u_{x}}, {(t^{x} \circ f_{x}^{'} \circ μ^{x})}^{u_{x}}) \\ = & \sum_{x \in X} p_{x} CE (♠^{x}), \end{matrix}

as desired. □

Theorem 2.

Conditional relative entropy in

{FinStat}_{2}

is functorial with respect to vertical composition, i.e., if

♣ \circ ♠

is a vertical composition in

{FinStat}_{2}

, then

C E (♣ \circ ♠) = C E (♣) + C E (♠)

.

Lemma 1.

Let

X \overset{f}{⇝} Y \overset{g}{⇝} Z

be a composition of channels.

1.: If f is a pure channel, then ${(g \circ f)}_{z x} = g_{z f (x)}$ ;
2.: If g is a stochastic section of a pure channel $Z \overset{h}{\to} Y$ , then ${(g \circ f)}_{z x} = g_{z h (z)} f_{h (z) x}$ .

Proof.

The statements 1 and 2 follow immediately from the definitions of pure channel and stochastic section. □

Lemma 2.

Let ♠ be a 2-morphism in

{FinStat}_{2}

as summarized by the diagram

Then,

1.: $C E (♠) = \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{t_{y ν (y)} f_{ν (y) μ (x)}^{'}})$ for all $x \in X$ ;
2.: $p_{x^{'}}^{'} f_{y^{'} x^{'}}^{'} = \sum_{x \in μ^{- 1} (x^{'})} \sum_{y \in ν^{- 1} (y^{'})} p_{x} f_{y x}$ for all $(x^{'}, y^{'}) \in X^{'} \times Y^{'}$ .

Proof.

To prove item (1), let

x \in X

and

y \in Y

. Then,

\begin{matrix} {(t \circ f^{'} \circ μ)}_{y x} & = & {(t \circ (f^{'} \circ μ))}_{y x} \\ = & \sum_{y^{'} \in Y^{'}} t_{y y^{'}} {(f^{'} \circ μ)}_{y^{'} x} \\ = & \sum_{y^{'} \in Y} t_{y y^{'}} f_{y^{'} μ (x)}^{'} \\ = & t_{y ν (y)} f_{ν (y) μ (x)}^{'}, \end{matrix}

where the third equality follows from Lemma 1 since

μ

is a pure channel, and the fourth equality follows also from Lemma 1 since t is a stochastic section of a pure channel. We then have

\begin{matrix} CE (♠) & = & \sum_{x \in X} p_{x} D (f^{x}, {(t \circ f^{'} \circ μ)}^{x}) \\ = & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{{(t \circ f^{'} \circ μ)}_{y x}}) \\ = & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{t_{y ν (y)} f_{ν (y) μ (x)}^{'}}), \end{matrix}

as desired.

To prove item (2), the condition

ν \circ f = f^{'} \circ μ

is equivalent to the equation

{(ν \circ f)}_{y^{'} x} = {(f^{'} \circ μ)}_{y^{'} x}

for all

y^{'} \in Y^{'}

and

x \in X

. In addition, since

{(f^{'} \circ μ)}_{y^{'} x} = f_{y^{'} μ (x)}^{'},

and

{(ν \circ f)}_{y^{'} x} = \sum_{y \in Y} ν_{y^{'} y} f_{y x} = \sum_{y \in ν^{- 1} (y^{'})} ν_{y^{'} y} f_{y x} = \sum_{y \in ν^{- 1} (y^{'})} f_{y x},

it follows that

f_{y^{'} μ (x)}^{'} = \sum_{y \in ν^{- 1} (y^{'})} f_{y x}

, thus, for all

x^{'} \in X^{'}

and

y^{'} \in Y

, it follows that

f_{y^{'} x^{'}}^{'} = \sum_{y \in ν^{- 1} (y^{'})} f_{y x}

for all

x \in μ^{- 1} (x^{'})

. As such, we have

p_{x^{'}}^{'} f_{y^{'} x^{'}}^{'} = (\sum_{x \in μ^{- 1} (x^{'})} p_{x}) (\sum_{y \in ν^{- 1} (y^{'})} f_{y x}) = \sum_{x \in μ^{- 1} (x^{'})} \sum_{y \in ν^{- 1} (y^{'})} p_{x} f_{y x},

as desired. □

Proof of Theorem 2.

Suppose ♠ and ♣ are such that the vertical composition

♣ \circ ♠

is summarized by the following diagram.

By item (1) in Lemma 2, we then have

CE (♣ \circ ♠) = \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{{(t \circ t^{'})}_{y (ν^{'} \circ ν) (y)} f_{(ν^{'} \circ ν) (y) (μ^{'} \circ μ) (x)}^{''}}),

and since t is a section of the pure channel

ν

, by item (2) of Lemma 1, it follows that for every

y \in Y

we have

{(t \circ t^{'})}_{y (ν^{'} \circ ν) (y)} = t_{y ν (y)} t_{ν (y) (ν^{'} \circ ν) (y)}^{'} .

As such, we have

\begin{matrix} CE (♣ \circ ♠) & = & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{t_{y ν (y)} t_{ν (y) (ν^{'} \circ ν) (y)}^{'} f_{(ν^{'} \circ ν) (y)}^{″}}) \\ = & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x} f_{ν (y) μ (x)}^{'}}{t_{y ν (y)} f_{ν (y) μ (x)}^{'} t_{ν (y) (ν^{'} \circ ν) (y)}^{'} f_{(ν^{'} \circ ν) (y) (μ^{'} \circ μ) (x)}^{″}}) \\ = & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{t_{y ν (y)} f_{ν (y) μ (x)}^{'}}) \\ + & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{ν (y) μ (x)}^{'}}{t_{ν (y) (ν^{'} \circ ν) (y)}^{'} f_{(ν^{'} \circ ν) (y) (μ^{'} \circ μ) (x)}^{″}}) \\ = & \sum_{x \in X} \sum_{y \in Y} p_{x} f_{y x} log (\frac{f_{y x}}{t_{y ν (y)} f_{ν (y) μ (x)}^{'}}) \\ + & \sum_{x^{'} \in X^{'}} \sum_{y^{'} \in Y^{'}} p_{x^{'}}^{'} f_{y^{'} x^{'}}^{'} log (\frac{f_{y^{'} x^{'}}^{'}}{t_{y^{'} ν^{'} (y^{'})}^{'} f_{ν^{'} (y^{'}) μ^{'} (x^{'})}^{″}}) (by item (2) of Lemma 2) \\ = & \sum_{x \in X} p_{x} D (f^{x}, {(t \circ f^{'} \circ μ)}^{x}) + \sum_{x^{'} \in X^{'}} p_{x^{'}}^{'} D (f^{' x^{'}}, {(t^{'} \circ f^{″} \circ μ^{'})}^{x^{'}}) \\ = & CE (♣) + CE (♠), \end{matrix}

as desired (the second-to-last equality follows by item (1) of Lemma 2). □

7. Relative Entropy in ${FinStat}_{2}$

In this section, we introduce a 2-level extension of the relative entropy map RE introduced by Baez and Fritz, and show that it satisfies the natural 2-level analogues of functoriality, convex linearity, vanishing under optimal hypotheses, and lower semicontinuity.

Definition 13.

With every 2-morphism

we associate the non-negative extended real number

{RE}_{2} (♠) \in [0, \infty]

given by

{RE}_{2} (♠) = RE (μ, p, s) + CE (♠),

(8)

which we refer to as the 2-relative entropy of ♠. We note that the quantity

RE (μ, p, s)

appearing on the RHS of (8) is the relative entropy associated with the morphism

(μ, p, s)

in

FinStat

, so that

RE (μ, p, s) = D (p, s \circ μ \circ p),

where

D (-, -)

is the standard relative entropy.

Proposition 2.

2-Relative entropy is convex linear, i.e., if

(X, p)

is a finite probability space and

♠^{x}

is a collection of 2-morphisms in

{FinStat}_{2}

indexed by X, then

R E_{2} (⨁_{x \in X} p_{x} ♠^{x}) = \sum_{x \in X} p_{x} R E_{2} (♠^{x}) .

Proof.

Suppose

⨁_{x \in X} p_{x} ♠^{x}

is summarized by the following diagram

By Theorem 1, we know that the relative entropy RE is convex linear over 1-morphisms in

{FinStat}_{2}

, and by Proposition 1, we know conditional relative entropy is convex linear over 2-morphisms in

{FinStat}_{2}

, thus

\begin{matrix} RE (⨁_{x \in X} p_{x} (μ^{x}, q^{x}, s^{x})) = \sum_{x \in X} p_{x} RE (μ^{x}, q^{x}, s^{x}) & CE (⨁_{x \in X} p_{x} ♠^{x}) = \sum_{x \in X} p_{x} CE (♠^{x}) . \end{matrix}

(9)

We then have

\begin{matrix} {RE}_{2} (⨁_{x \in X} p_{x} ♠^{x}) & = & RE (⨁_{x \in X} p_{x} (μ^{x}, q^{x}, s^{x})) + CE (⨁_{x \in X} p_{x} ♠^{x}) \\ \overset{(9)}{=} & \sum_{x \in X} p_{x} RE (μ^{x}, q^{x}, s^{x}) + \sum_{x \in X} p_{x} CE (♠^{x}) \\ = & \sum_{x \in X} p_{x} (RE (μ^{x}, q^{x}, s^{x}) + CE (♠^{x})) \\ = & \sum_{x \in X} p_{x} {RE}_{2} (♠^{x}), \end{matrix}

as desired. □

Theorem 3.

Relative entropy is functorial with respect to vertical composition, i.e., if

♣ \circ ♠

is a vertical composition in

{FinStat}_{2}

, then

{RE}_{2} (♣ \circ ♠) = {RE}_{2} (♣) + {RE}_{2} (♠)

.

Proof.

Suppose ♠ and ♣ are such that the vertical composition

♣ \circ ♠

is summarized by the following diagram.

Then,

\begin{matrix} {RE}_{2} (♣ \circ ♠) & = & RE (μ^{'} \circ μ, p, s \circ s^{'}) + CE (♣ \circ ♠) \\ = & RE (μ, p, s) + RE (μ^{'}, p^{'}, s^{'}) + CE (♣) + CE (♠) \\ = & {RE}_{2} (♣) + {RE}_{2} (♠), \end{matrix}

where the second equality follows from Theorems 1 and 2. □

Proposition 3.

Let

♠ : (μ, p, s) \Rightarrow (ν, q, t)

be a 2-morphism in

FinStat

, and suppose s and t are optimal hypotheses for

(μ | p)

and

(ν | q)

as (defined in Definition 5). Then,

{RE}_{2} (♠) = 0

.

Proof.

Since s and t are optimal hypotheses, it follows that

RE (μ, p, s) = CE (♠) = 0

, from which the proposition follows. □

Proposition 4.

The 2-relative entropy

{RE}_{2}

is lower semicontinuous.

Proof.

Since the 2-relative entropy

{RE}_{2}

is a linear combination of 1-level relative entropies, and 1-level relative entropies are lower semicontinuous by Theorem 1, it follows that

{RE}_{2}

is lower semicontinuous. □

8. Conclusions, Limitations and Future Research

In this work, we have constructed a 2-categorical extension

{RE}_{2}

of the relative entropy functor RE of Baez and Fritz [1], yielding a new measure of information which we view as a relative measure of information between noisy channels. Moreover, we show that our construction satisfies natural 2-level analogues of functoriality, convex linearity, vanishing under optimal hypotheses and lower semicontinuity. As the relative entropy functor of Baez and Fritz is uniquely characterized by such properties, it is only natural to question if our 2-level extension

{RE}_{2}

of RE is also uniquely characterized by the 2-level analogues of such properties. It would also be interesting to investigate alternative versions of 2-morphisms in

FinStat

where the 2-morphisms are less restrictive, such as where the base pyramid of a 2-morphism is not assumed to be commutative. While taking the 2-relative entropy associated with such morphisms would not be functorial, such less restrictive morphisms would provide more flexibility for potential applications. Finally, as there are many other categories of interest with respect to information theory [8,14,15], it would be interesting to investigate 2-level extensions of such categories as well.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Baez, J.C.; Fritz, T. A Bayesian characterization of relative entropy. Theory Appl. Categ. 2014, 29, 422–457. [Google Scholar]
Mac Lane, S. Categories for the Working Mathematician, 2nd ed.; Springer: New York, NY, USA, 1998. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
Baez, J.C.; Fritz, T.; Leinster, T. A characterization of entropy in terms of information loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
Coecke, B.; Fritz, T.; Spekkens, R. A mathematical theory of resources. Inform. Comput. 2016, 250, 59–86. [Google Scholar] [CrossRef] [Green Version]
Faddeev, D.K. On the concept of entropy of a finite probabilistic scheme. Uspekhi Mat. Nauk 1956, 11, 227–231. [Google Scholar]
Fong, B. Causal Theories: A Categorical Perspective on Bayesian Networks. Master’s Thesis, University of Oxford, Oxford, UK, 2012. [Google Scholar]
Fritz, T. A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Adv. Math. 2020, 370, 107239. [Google Scholar] [CrossRef]
Fullwood, J. An axiomatic characterization of mutual information. arXiv 2021, arXiv:2108.12647. [Google Scholar]
Fullwood, J.; Parzygnat, A.P. The Information Loss of a Stochastic Map. Entropy 2021, 23, 1021. [Google Scholar] [CrossRef] [PubMed]
Leinster, T. A short characterization of relative entropy. J. Math. Phys. 2019, 60, 023302. [Google Scholar] [CrossRef] [Green Version]
Leinster, T. Entropy and Diversity: The Axiomatic Approach; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar]
Parzygnat, A.P. A functorial characterization of von Neumann entropy. IHÉS prépublications. arXiv 2020, arXiv:2009.07125. [Google Scholar]
Cho, K.; Jacobs, B. Disintegration and Bayesian inversion via string diagrams. Math. Struct. Comput. Sci. 2019, 29, 938–971. [Google Scholar] [CrossRef] [Green Version]
Golubtsov, P.V. Information transformers: Category-theoretical structure, informativeness, decision-making problems. Hadron. J. Suppl. 2004, 19, 375–424. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fullwood, J. On a 2-Relative Entropy. Entropy 2022, 24, 74. https://0-doi-org.brum.beds.ac.uk/10.3390/e24010074

AMA Style

Fullwood J. On a 2-Relative Entropy. Entropy. 2022; 24(1):74. https://0-doi-org.brum.beds.ac.uk/10.3390/e24010074

Chicago/Turabian Style

Fullwood, James. 2022. "On a 2-Relative Entropy" Entropy 24, no. 1: 74. https://0-doi-org.brum.beds.ac.uk/10.3390/e24010074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On a 2-Relative Entropy

Abstract

1. Introduction

2. The Category $FinStat$

3. The Baez and Fritz Characterization of Relative Entropy

4. The Category ${FinStat}_{2}$

5. Convexity in ${FinStat}_{2}$

6. Conditional Relative Entropy in ${FinStat}_{2}$

7. Relative Entropy in ${FinStat}_{2}$

8. Conclusions, Limitations and Future Research

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

On a 2-Relative Entropy

Abstract

1. Introduction

2. The Category FinStat

3. The Baez and Fritz Characterization of Relative Entropy

4. The Category FinStat 2

5. Convexity in FinStat 2

6. Conditional Relative Entropy in FinStat 2

7. Relative Entropy in FinStat 2

8. Conclusions, Limitations and Future Research

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. The Category $FinStat$

4. The Category ${FinStat}_{2}$

5. Convexity in ${FinStat}_{2}$

6. Conditional Relative Entropy in ${FinStat}_{2}$

7. Relative Entropy in ${FinStat}_{2}$