Dirichlet Polynomials and Entropy

Spivak, David I.; Hosgood, Timothy

doi:10.3390/e23081085

Open AccessArticle

Dirichlet Polynomials and Entropy

by

David I. Spivak

and

Timothy Hosgood

^*

Topos Institute, Berkeley, CA 94704, USA

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(8), 1085; https://0-doi-org.brum.beds.ac.uk/10.3390/e23081085

Submission received: 14 July 2021 / Revised: 18 August 2021 / Accepted: 19 August 2021 / Published: 21 August 2021

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

A Dirichlet polynomial d in one variable

y

is a function of the form

d (y) = a_{n} n^{y} + \dots + a_{2} 2^{y} + a_{1} 1^{y} + a_{0} 0^{y}

for some

n, a_{0}, \dots, a_{n} \in N

. We will show how to think of a Dirichlet polynomial as a set-theoretic bundle, and thus as an empirical distribution. We can then consider the Shannon entropy

H (d)

of the corresponding probability distribution, and we define its length (or, classically, its perplexity) by

L (d) = 2^{H (d)}

. On the other hand, we will define a rig homomorphism

h : Dir \to Rect

from the rig of Dirichlet polynomials to the so-called rectangle rig, whose underlying set is

R_{⩾ 0} \times R_{⩾ 0}

and whose additive structure involves the weighted geometric mean; we write

h (d) = (A (d), W (d))

, and call the two components area and width (respectively). The main result of this paper is the following: the rectangle-area formula

A (d) = L (d) W (d)

holds for any Dirichlet polynomial d. In other words, the entropy of an empirical distribution can be calculated entirely in terms of the homomorphism h applied to its corresponding Dirichlet polynomial. We also show that similar results hold for the cross entropy.

Keywords:

bundle; weighted geometric mean; category theory; Dirichlet polynomial

1. Introduction

The purpose of this paper is simply to provide another categorical treatment of entropy of probability distributions, which turns out to be computed in terms of a rig homomorphism; our treatment also generalises to cross entropy (and thus to Kullback–Leibler divergence). What is particularly interesting about the treatment outlined here is that we can somewhat “visualise” the notion of entropy in terms of sizes of coding schemes (cf. Section 6). Not only that, but classical entropy is only homomorphic in the product of distributions, whereas the notion that we describe here is homomorphic in both the product and the sum.

A brief outline of this paper is as follows:

Section 2: We recall the definitions of Dirichlet polynomials and set-theoretic bundles, along with their rig structures, from [1] (one important thing to note is the following: Dirichlet polynomials are well-studied objects in the setting of complex analysis, but we cannot apply tools from this area to our setting, because we have only natural number coefficients and non-negative exponents); we then study the equivalence between these two notions.
Section 3: We explain how empirical probability distributions correspond to set-theoretic bundles (and thus to Dirichlet polynomials).
Section 4: We define the rig homomorphism $h : Dir \to Rect$ that we wish to study, whose codomain is a rig encoding the weighted geometric mean; we prove some useful computational results and give some explicit examples.
Section 5: We define the entropy $H (d)$ of a Dirichlet polynomial using the classical notion of Shannon entropy; we give some explicit examples; we prove the main result of this paper (Theorem 1), relating entropy to the rig homomorphism defined in the previous section.
Section 6: We try to provide some intuition for the image $h (d)$ of a Dirichlet polynomial under the rig homomorphism, in terms of coding schemes.
Section 7: We generalise Theorem 1 to the case of cross-entropy, or Kullback–Leibler divergence.

Prerequisites

We assume that the reader is familiar with the fundamentals of category theory, such as functors, natural transformations, and (co)products, as covered, for example, in [2] (Chapter 1).

2. Dirichlet Polynomials and Bundles

This section is simply a brief summary of content from [1], repeated here for the convenience of the reader.

Definition 1.

A Dirichlet polynomial d in one variable

y

is a function of the form

d (y) = a_{n} n^{y} + \dots + a_{2} 2^{y} + a_{1} 1^{y} + a_{0} 0^{y}

for some

n, a_{0}, \dots, a_{n} \in N

.

The set of Dirichlet polynomials is clearly closed under addition, and further under multiplication (using the distributive law along with the fact that

m^{y} \cdot n^{y} = {(m \cdot n)}^{y}

). In fact, it has the structure of a rig: a “ring without negatives” (or, to be pedantic, a monoid object in commutative monoids). We denote this rig by

Dir

, where the additive unit is 0, and the multiplicative unit is

1^{y}

.

Note that we can embed

N

as a sub-rig of

Dir

, by

a \mapsto a \cdot 1^{y}

; we often use this fact and simply write

a \in Dir

.

Following [1], we can think of Dirichlet polynomials as functors

{FSet}^{o p} \to FSet

, where

FSet

is the category of finite sets. Indeed, given a natural number

n \in N

, the exponential

n^{y}

can be thought of as the Yoneda embedding of the set with n elements, i.e.,

n^{y} = FSet (-, \underset{̲}{n})

where

\underset{̲}{n} = {1, \dots, n}

. (For typographical convenience, we sometimes use the notation n and

\underset{̲}{n}

interchangeably. In particular, we write e.g.,

d (0)

instead of

d (\underset{̲}{0})

). Then the addition of exponentials corresponds to the coproduct of the corresponding representable functors (and so multiplication by a natural number

a_{n}

corresponds to the

a_{n}

-fold coproduct of the representable functor with itself). This means that evaluating a Dirichlet polynomial at some natural number n corresponds to evaluating the corresponding functor on the finite set

\underset{̲}{n}

.

Note that

0^{y}

is not the initial object 0, since

0^{\underset{̲}{n}} = \{\begin{matrix} 1 & if n = 0; \\ 0 & if n ⩾ 1 \end{matrix}

i.e.,

0^{y} \neq 0

.

Example 1.

The Dirichlet polynomial

d (y) = 4^{y} + 4 \cdot 1^{y}

evaluated at 0 gives

\begin{matrix} d (0) & = FSet (\underset{̲}{0}, \underset{̲}{4}) ⊔ (⊔_{i = 1}^{4} FSet (\underset{̲}{0}, \underset{̲}{1})) \\ ≅ \underset{̲}{1} ⊔ \underset{̲}{4} \\ ≅ \underset{̲}{5}, \end{matrix}

and, similarly,

\begin{matrix} d (1) & = FSet (\underset{̲}{1}, \underset{̲}{4}) ⊔ (⊔_{i = 1}^{4} FSet (\underset{̲}{1}, \underset{̲}{1})) \\ ≅ \underset{̲}{4} ⊔ \underset{̲}{4} \\ ≅ \underset{̲}{8} . \end{matrix}

Note that, since

1^{y} = 1

, we can write

d (y) = 4^{y} + 4

.

Definition 2.

A morphism

φ : d \to e

of Dirichlet polynomials is a natural transformation of (contravariant) functors. Denote by

Dir

the category of Dirichlet polynomials (thought of as functors

{FSet}^{o p} \to FSet

), and by

Dir (d, e)

the set of all morphisms

d \to e

.

When we think of Dirichlet polynomials as functors

{FSet}^{o p} \to FSet

, addition is given by the coproduct (disjoint union of sets), and multiplication by the product (Cartesian product of sets). This means that working with Dirichlet polynomials in

Dir

really is like working with polynomials, in the sense that addition and multiplication are exactly “as expected”.

Example 2.

The only slightly confusing aspect of multiplication in

Dir

is how

0^{y}

behaves (since

0^{y} \neq 0

): if

d (y)

is a Dirichlet polynomial, then

d (y) \cdot 0^{y} = d (0) \cdot 0^{y},

as follows from the aforementioned fact that

0^{\underset{̲}{n}}

is zero for

n \neq 0

, and 1 for

n = 0

.

We can use this general fact for specific computations. For example, let

\begin{matrix} d (y) & : = 3 \cdot 2^{y} + 1^{y} \\ e (y) & : = 4^{y} + 2^{y} + 3 \cdot 0^{y} \end{matrix}

Then

\begin{matrix} (d \cdot e) (y) & = (3 \cdot 8^{y} + 3 \cdot 4^{y} + 9 \cdot 0^{y}) + (4^{y} + 2^{y} + 3 \cdot 0^{y}) \\ = 3 \cdot 8^{y} + 4 \cdot 4^{y} + 2^{y} + 12 \cdot 0^{y} . \end{matrix}

(and

d + e = 4^{y} + 4 \cdot 2^{y} + 1^{y} + 3 \cdot 0^{y}

).

There is a more geometric interpretation of objects of

Dir

as set-theoretic bundles, i.e., (isomorphism classes of) functions

E \to B

, where

E, B \in FSet

, given as follows: to the Dirichlet polynomial

d : {FSet}^{o p} \to FSet

, we associate the function

π_{d} = d (!) : d (1) \to d (0)

induced by the unique function

! : \underset{̲}{0} \to \underset{̲}{1}

. (This vague statement can be upgraded to an equivalence of categories: ([1], [Theorem 4.6]) for more details).

For example, to the polynomial

d (y) = 4^{y} + 4 \cdot 1^{y}

, we associate the bundle

Note that bundles also form a rig, where the sum is given by disjoint union of sets, and the product is given by the Cartesian product of sets, both on the base and the total space. Further, the equivalence between Dirichlet polynomials and bundles respects the rig structures (cf. [1], Theorem 4.6). Because of this, we often switch freely between thinking of Dirichlet polynomials as functors

d : {FSet}^{o p} \to FSet

, as bundles

π_{d} : d (1) \to d (0)

, and simply as functions of the form

\sum_{j = 0}^{n} a_{j} \cdot j^{y}

.

Example 3.

We can draw the bundle corresponding to

(2^{y} + 1) \cdot (2^{y} + 1)

as follows:

Lemma 1.

Let

d (y) \sum_{j = 0}^{n} a_{j} \cdot j^{y}

be a Dirichlet polynomial. Then

\begin{matrix} | d (0) | & = \sum_{j = 0}^{n} a_{j} \\ | d (1) | & = \sum_{j = 0}^{n} a_{j} j . \end{matrix}

Proof.

This follows from the fact that

n^{\underset{̲}{0}} = 1

and

n^{\underset{̲}{1}} = n

, for all

n \in N

. □

Definition 3.

Let

d \in Dir

. For

i \in d (0)

, we define

d [i] π_{d}^{- 1} (i)

where

π_{d} : d (1) \to d (0)

is the bundle corresponding to d.

Using the fact that the sum of bundles is given by the disjoint union of sets, we can use this above definition to write any Dirichlet polynomial d as

d (y) ≅ \sum_{i \in d (0)} d {[i]}^{y}

(where ∑ is the coproduct in

FSet

).

Corollary 1.

Let

d (y) : = \sum_{i \in d (0)} d {[i]}^{y}

be a Dirichlet polynomial. Then

| d (1) | = \sum_{i \in d (0)} | d [i] | .

Proof.

This is, again, simply the fact that

n^{\underset{̲}{1}} = n

for all

n \in N

. □

Lemma 2.

A morphism

φ : d \to e

of Dirichlet polynomials is exactly a morphism of the corresponding bundles, i.e., functions

φ_{0} : d (0) \to e (0)

and

φ_{1} : d (1) \to e (1)

such that

commutes.

Proof.

This statement forms a specific part of ([1], Theorem 4.6), but the proof is simple enough that we give a direct version here. Writing

d (y) \sum_{i \in d (0)} d {[i]}^{y}

and

e (y) \sum_{i \in e (0)} e {[i]}^{y}

, we see that

\begin{matrix} {Hom}_{[{FSet}^{o p}, FSet]} (d, e) & ≅ \prod_{i \in d (0)} {Hom}_{[{FSet}^{o p}, FSet]} (d {[i]}^{y}, \sum_{j \in e (0)} e {[j]}^{y}) \\ ≅ \prod_{i \in d (0)} \sum_{j \in e (0)} e {[j]}^{d [i]} \\ = \prod_{i \in d (0)} \sum_{j \in e (0)} {Hom}_{FSet} (d [i], e [j]) \end{matrix}

(the first isomorphism is by the universal property of the coproduct; the second isomorphisms is the Yoneda lemma). However, an element of this set is exactly a bundle morphism: we have, for all

i \in d (0)

, some

j \in e (0)

along with a function

d [i] \to e [j]

;

φ_{1}

is given by the disjoint union of all these

d [i] \to e [j]

, and

φ_{0}

is given by the the choice of j for each i. □

Definition 4.

Given Dirichlet polynomials

d, e \in Dir

such that

d (0) = e (0)

, we denote by

{Dir}_{/ d (0)} (d, e)

the set of morphisms

(φ_{0}, φ_{1}) : d \to e

such that

φ_{0} = i d

.

Given the correspondence between Dirichlet polynomials and bundles, we might rightly ask why we should prefer to work with the former over the latter. For one possible answer to this, see Section 6.

3. Bundles as Empirical Distributions

The interpretation of Dirichlet polynomials as bundles helps us to understand how they relate to probability theory. Imagine flipping a coin eight times and observing five heads and three tails; we refer to “heads” and “tails” as outcomes, and each of the eight flips as draws; every draw has an associated outcome.

Consider some bundle

π_{d} : d (1) \to d (0)

. We can think of

d (0)

as the set of outcomes, and

d (1)

as the set of draws; the fibre

π_{d}^{- 1} (x)

over an outcome

x \in d (0)

corresponds to all the draws that lead to the outcome x, and so we obtain a probability distribution on

d (0)

by setting

P (X = x) = \frac{| π_{d}^{- 1} (x) |}{| d (0) |}

. Conversely, any rational distribution (i.e., a distribution such that all probabilities are rational numbers ) on a finite set arises in this way: take the finite set as the set of outcomes; take the least common multiple of the denominators of all the probabilities as the cardinality of

d (1)

; and then take

P (X = x) \cdot | d (1) |

many elements of

d (1)

to be in the fibre of

x \in d (0)

.

Example 4.

Consider the set

S = {x_{1}, x_{2}, x_{3}, x_{4}}

, endowed with the probability distribution such that

\begin{matrix} P (x_{1}) & = \frac{1}{5} P (x_{2}) = \frac{1}{6} \\ P (x_{3}) & = \frac{1}{2} P (x_{4}) = \frac{2}{15} \end{matrix}

Define the sets

d (0) = \underset{̲}{4}

and

d (1) = \underset{̲}{30}

, and define the function

π_{d} : d (1) \to d (0)

by

π (n) = \{\begin{matrix} 1 & i f 0 ⩽ n < 6; \\ 2 & i f 6 ⩽ n < 11; \\ 3 & i f 11 ⩽ n < 26; \\ 4 & i f 26 ⩽ n < 30 . \end{matrix}

Then the empirical probability distribution on the bundle

π : d (1) \to d (0)

agrees exactly with the given distribution on S. As a Dirichlet polynomial, this bundle is given (up to relabelling the outcomes) by

d (y) : = 15^{y} + 6^{y} + 5^{y} + 4^{y}

.

Note that any multiple

m^{y} \cdot d (y)

of d (for

m ⩾ 1

) will correspond to the same probability distribution as d itself, but to a different empirical distribution, since it will have m times as many draws.

Under this interpretation of Dirichlet polynomials as empirical distributions, multiplication

d \cdot e

corresponds to taking the product distribution.

Remark 1.

For any

d \in Dir

, and any

n \in N

, we can give

| d (n) |

a combinatorial interpretation: it is the number of ways of choosing n indistinguishable (in the sense that they have the same outcome) draws, i.e., the number of length-n lists of elements of

d [i]

for some

i \in d (0)

.

To see this, note that

d (n) = Dir (n^{y}, d)

(by Yoneda), and so

d (n)

is in bijection with the set of bundle morphisms

(φ_{0}, φ_{1}) : (n \to 1) \to (d (1) \to d (0))

, which are given exactly by choosing n (possibly repeated) elements of

d (1)

that all lie in the same fibre (namely the fibre above the point specified by

φ_{0} (1)

).

Remark 2.

Although we deal only with finite sets and rational probability distributions here, it seems likely that one could follow the methods of [3] and consider colimits of these to obtain analogous results for arbitrary probability distributions on discrete measurable spaces.

4. Area and Width

Definition 5.

Define the rig

Rect

as follows. The underlying set is

R_{⩾ 0} \times R_{⩾ 0}

. The multiplicative structure has unit

(1, 1)

, and is given by component-wise multiplication:

(A_{1}, W_{1}) \cdot (A_{2}, W_{2}) : = (A_{1} A_{2}, W_{1} W_{2}) .

The additive structure has unit

(0, 0)

, and is given by real-number addition in the first component, and by weighted geometric mean in the second component:

(A_{1}, W_{1}) + (A_{2}, W_{2}) : = (A_{1} + A_{2}, {(W_{1}^{A_{1}} W_{2}^{A_{2}})}^{\frac{1}{A_{1} + A_{2}}}) .

Given an element

(A, W)

in

Rect

, we call A its area and W its width.

The fact that

Rect

is indeed a rig follows from the fact that its multiplication distributes over its addition:

\begin{matrix} (A_{1}, W_{1}) \cdot ((A_{2}, W_{2}) + (A_{3}, W_{3})) & = (A_{1}, W_{1}) \cdot (A_{2} + A_{3}, {(W_{2}^{A_{2}} W_{3}^{A_{3}})}^{\frac{1}{A_{2} + A_{3}}}) \\ = (A_{1} (A_{2} + A_{3}), W_{1} {(W_{2}^{A_{2}} W_{3}^{A_{3}})}^{\frac{1}{A_{2} + A_{3}}}) \\ = (A_{1} A_{2} + A_{1} A_{3}, {(W_{1}^{A_{2} + A_{3}} W_{2}^{A_{2}} W_{3}^{A_{3}})}^{\frac{1}{A_{2} + A_{3}}}) \\ = (A_{1} A_{2} + A_{1} A_{3}, {({(W_{1} W_{2})}^{A_{2}} {(W_{1} W_{3})}^{A_{3}})}^{\frac{1}{A_{2} + A_{3}}}) \\ = (A_{1} A_{2} + A_{1} A_{3}, {({(W_{1} W_{2})}^{A_{1} A_{2}} {(W_{1} W_{3})}^{A_{1} A_{3}})}^{\frac{1}{A_{1} A_{2} + A_{1} A_{3}}}) \\ = (A_{1} A_{2}, W_{1} W_{2}) + (A_{1} A_{3}, W_{1} W_{3}) \\ = (A_{1}, W_{1}) \cdot (A_{2}, W_{2}) + (A_{1}, W_{1}) \cdot (A_{3}, W_{3}) . \end{matrix}

Proposition 1.

There exists a unique rig morphism

h : Dir \to Rect

for which

h : n^{y} \mapsto (n, n) .

Proof.

Since every Dirichlet polynomial is just a sum of exponentials, a rig homomorphism is fully determined by its action on exponentials, since it must respect addition. So we just need to show that h does indeed extend to a rig homomorphism, but this follows from the fact that

m^{y} \cdot n^{y} = {(m \cdot n)}^{y}

. □

Definition 6.

Given a Dirichlet polynomial d, we define its area

A (d)

and its width

W (d)

to be given by the components of

h (d) = (A (d), W (d))

.

With this definition, along with Proposition 1, we see that

A (n^{y}) = W (n^{y}) = n .

Lemma 3.

Let

d \in Dir

and

a \in N

. Then

1.: $h (a) = (a, 1)$ ;
2.: $A (a \cdot d) = a A (d)$ ;
3.: $W (a \cdot d) = W (d)$ .

Proof.

Recall that addition (and thus scalar multiplication) in

Rect

involves the weighted geometric mean in the second component. Then

\begin{matrix} h (a) & : = h (a \cdot 1^{y}) \\ = a \cdot h (1^{y}) \\ = a \cdot (1, 1) \\ = (a, 1) \end{matrix}

which proves (i). For (ii) and (iii), since h is a rig homomorphism (and thus respects addition), it suffices to consider the case where d is an exponential, say

d (y) = n^{y}

. However, then

\begin{matrix} h (a \cdot d) & = a \cdot h (d) \\ = a \cdot (n, n) \\ = (a n, {(\underset{a t i m e s}{\underset{⏟}{n^{n} n^{n} \dots n^{n}}})}^{1 / a n}) \\ = (a n, {(n^{a n})}^{1 / a n}) \\ = (a n, n) \end{matrix}

i.e.,

A (a \cdot d) = a A (d)

and

W (a \cdot d) = W (d)

, as claimed. □

Corollary 2.

Let

d \in Dir

. Then

\begin{matrix} A (d) & = | d (1) | \\ W {(d)}^{A (d)} & = |{Dir}_{/ d (0)} (d, d)| . \end{matrix}

Proof.

Write

d (y) = a_{n} \cdot n^{y} + \dots + a_{1} \cdot 1^{y} + a_{0} \cdot 0^{y}

. Using Lemma 3, along with Definition 5, we see that

\begin{matrix} h (d) & = (a_{n} n, n) + (a_{n - 1} (n - 1), n - 1) + \dots + (a_{1}, 1) \\ = (\sum_{i = 0}^{n} a_{i} i, {(\prod_{i = 0}^{n} i^{a_{i} i})}^{1 / \sum_{i = 0}^{n} a_{i} i}) . \end{matrix}

By Lemma 1, the first component (i.e.,

A (d)

) is equal to

| d (1) |

; by the same lemma, we can also rewrite the second component (i.e.,

W (d)

) as

W (d) = {(\prod_{i = 0}^{n} i^{a_{i} i})}^{\frac{1}{A (d)}}

so it simply remains to justify why this is equal to

{|{Dir}_{/ d (0)} (d, d)|}^{\frac{1}{A (d)}}

. But a morphism in

{Dir}_{/ d (0)} (d, d)

is exactly the data of an endomorphism of each fibre of

π_{d} : d (1) \to d (0)

; since there are

a_{i}

fibres of size i, endomorphisms of these fibres are in bijection with the

a_{i}

-fold product of

i^{i}

, which is equal to

i^{a_{i} i}

, whence the claim. □

Corollary 3.

Let

d \in Dir

. Then the width

W (d)

is an algebraic number, i.e., the image of

h : Dir \to Rect

lies in the sub-rig whose underlying set is

N \times {\bar{Q}}_{⩾ 0}

, where

\bar{Q}

is the algebraic closure of

Q

.

Proof.

By Corollary 2, both

W {(d)}^{A (d)}

and

A (d)

are equal to the cardinality of some sets, and thus integer. □

Example 5.

Reassuringly, if we start with a “rectangle”, then the area and width are exactly what we might expect. More concretely: consider

d (y) = a \cdot n^{y}

for some

a, n \in N

; then, by Corollary 2,

A (d) = d (1) = a n,

and, by direct calculation,

W (d) = n .

Comparing this to the picture of

a \cdot n^{y}

, we can explain why we chose the terminology “width” and “area”:

Indeed, the area is exactly the number of dots in the (upper) rectangle, and the width is its width.

However, this picture now leads us to consider the question of whether or not there is a good meaning we can give to the “length” of this rectangle (which, here, should be equal to a). Indeed, this has been our motivation all along; we will return to this question in Example 9.

Example 6.

The fact that

d (1)

is “rectangular” in Example 5 makes the terminology look like a numerical coincidence, but we can try to hone our intuition of what this really “means” by considering another example.

Let’s consider

d (y) = 4^{y} + 4

, which has area

A (d) = d (1) = 8

. We can calculate its width by using the fact that

\begin{matrix} h (4^{y}) & = (4, 4) \\ h (4) & = h (1^{y}) + h (1^{y}) + h (1^{y}) + h (1^{y}) \\ = (4, 1) \end{matrix}

whence

\begin{matrix} h (d) & = (4, 4) + (4, 1) \\ = (8, {(4^{4} 1^{4})}^{\frac{1}{8}}) \\ = (8, 2) \end{matrix}

and so

W (d) = 2

.

How, then, does the rectangle with area 8 and width 2 relate to our Dirichlet polynomial

d (y) = 4^{y} + 4

? That is, what is the process that takes us from d to

4 \cdot 2^{y}

? Looking at the pictures of the bundles, we see that the width tells us how our bundle would look if we had the same set (

d (1)

) of draws, but with different outcomes, now all equally likely:

Note that, in order to have equally sized fibres, we needed to have 4 outcomes, not 5 (since

8 / 2 = 4

). We make this idea more precise (as well as explain why the rectangle is of size

4 \times 2

instead of

2 \times 4

) in Section 6.

Example 7.

We have just seen that

d (y) = 4^{y} + 4

has

W (d) = 2

and

A (d) = 8

, but now let’s look at an example where the numbers don’t divide so neatly.

Let

d (y) = 4^{y} + 3

. Then

A (d) = d (1) = 7

, and, as in Example 6, we use the fact that

\begin{matrix} h (d) & = (4, 4) + (3, 1) \\ = (7, {(4^{4} 1^{3})}^{\frac{1}{7}}) \\ = (7, 2 \sqrt[7]{2}) \\ \approx (7, 2.21) \end{matrix}

Of course, now we can’t draw a nice rectangle representing the evenly distributed bundle as we did in Example 6 for

4^{y} + 4

, since we would have to have an outcome set of size

7 / 2.21 \approx 3.17

elements, with fibres all of size

2.21

, but this should come as no surprise, since 7 is prime. One might be tempted to solve this problem using groupoid cardinality (cf. [4]), but there are some technical issues here.

5. Length

Remark 3.

We write log to mean

{log}_{2}

.

Definition 7.

Given a Dirichlet polynomial

d (y) : = \sum_{i \in d (0)} d {[i]}^{y}

, we define its entropy

H (d)

by

H (d) : = - \sum_{i \in d (0)} \frac{| d [i] |}{| d (1) |} log (\frac{| d [i] |}{| d (1) |}) .

We then define its length (also called the perplexity)

L (d)

by

L (d) : = 2^{H (d)} .

Readers might recognise

H (d)

as being the Shannon entropy of the corresponding probability distribution (cf. [5]). The convention for naming the sides of a rectangle is from [6] (Figure 1).

Example 8.

Consider

d (y) = n^{y}

for some

n \in N

.

Then

d (0) = 1

and

d (1) = n

, and so

\begin{matrix} H (d) & = - \sum_{i \in \underset{̲}{1}} \frac{n}{n} log (\frac{n}{n}) \\ = - log 1 \\ = 0 \end{matrix}

whence

L (d) = 2^{0} = 1

.

In terms of distributions, this corresponds to the fact that the unique probability distribution on a single outcome has entropy equal to 0 (and so the same is true for any empirical distribution on a single outcome).

Example 9.

Continuing on from Example 5, we can calculate the entropy of a uniform distribution on a many outcomes

d (y) = a \cdot n^{y}

as

\begin{matrix} H (a \cdot n^{y}) & = - \sum_{i \in \underset{̲}{a}} \frac{n}{a n} log (\frac{n}{a n}) \\ = - log (\frac{1}{a}) \\ = log a \end{matrix}

whence

L (a \cdot n^{y}) = 2^{log a} = a

, exactly as desired.

Example 10.

Continuing on from Example 7, recall that

d (y) = 4^{y} + 4

has area

A (d) = 8

and width

W (d) = 2

. We can further calculate that

\begin{matrix} H (d) & = - \sum_{i \in \underset{̲}{5}} \frac{| d [i] |}{8} log (\frac{| d [i] |}{8}) \\ = - \frac{4}{8} log (\frac{4}{8}) - 4 \cdot \frac{1}{8} log (\frac{1}{8}) \\ = - \frac{1}{2} log (\frac{1}{16}) \\ = 2 \end{matrix}

whence

L (d) = 2^{2} = 4

.

As for

d (y) = 4^{y} + 3

, recall that its area is

A (d) = 7

and its width is

2 \sqrt[7]{2}

. Now, its entropy is

\begin{matrix} H (d) & = - \sum_{i \in \underset{̲}{4}} \frac{| d [i] |}{7} log (\frac{| d [i] |}{7}) \\ = - \frac{4}{7} log (\frac{4}{7}) - 3 \cdot \frac{1}{7} log (\frac{1}{7}) \\ = \frac{log 7}{log 2} - \frac{8}{7} \end{matrix}

and so its length is

L (d) = 2^{\frac{log 7}{log 2} - \frac{8}{7}} = \frac{7}{2 \sqrt[7]{2}} .

Note that, in the above example, even though both

L (d)

and

W (d)

have non-integer values, the formula implied by our choice of nomenclature still holds: the area

A (d)

is equal to the length

L (d)

times the width

W (d)

. This leads us to our main theorem. It says that the Shannon entropy, which is only homomorphic in products of distributions, can be computed in terms of the width and area, which together are homomorphic in both sums and products of distributions. We will explain this in more detail in Section 6.

Theorem 1.

For all

d \in Dir

, we have the rectangle-area formula

A (d) = L (d) W (d) .

Proof.

In the following, we omit absolute value signs, writing e.g.,

d (1)

instead of

| d (1)

.

First, write

d (y) : = \sum_{j = 0}^{n} a_{j} j^{y} ≅ \sum_{i \in d (0)} d {[i]}^{y} .

Now we can rewrite the length as

\begin{matrix} L (d) & = 2^{H (d)} \\ = 2^{- \sum_{i \in d (0)} \frac{d [i]}{d (1)} log (\frac{d [i]}{d (1)})} \\ = \prod_{i \in d (0)} 2^{- \frac{d [i]}{d (1)} log (\frac{d [i]}{d (1)})} \\ = \prod_{i \in d (0)} {(2^{- log (\frac{d [i]}{d (1)})})}^{\frac{d [i]}{d (1)}} \\ = \prod_{i \in d (0)} {\frac{d (1)}{d [i]}}^{\frac{d [i]}{d (1)}} \\ = \frac{\prod_{i \in d (0)} d {(1)}^{\frac{d [i]}{d (1)}}}{\prod_{i \in d (0)} d {[i]}^{\frac{d [i]}{d (1)}}} \end{matrix}

The numerator is then

\begin{matrix} \prod_{i \in d (0)} d {(1)}^{\frac{d [i]}{d (1)}} & = d {(1)}^{\sum_{i \in d (0)} \frac{d [i]}{d (1)}} \\ = d (1) \\ = A (d) \end{matrix}

since

\sum_{i \in d (0)} | d [i] | = | d (1) |

, by Corollary 1, and we can then apply Corollary 2. The denominator is exactly

{(\prod_{i \in d (0)} d {[i]}^{d [i]})}^{\frac{1}{d (1)}}

and so, by Corollary 2, we only need to justify why

\prod_{i \in d (0)} d {[i]}^{d [i]}

is equal to

| {Dir}_{/ d (0)} (d, d) |

. However, this follows from the definition of an element of the latter set: a choice of map

d [i] \to d [i]

for all

i \in d (0)

. □

6. Interpreting Area, Length, and Width

We have mentioned many times that Dirichlet polynomials are equivalent to set-theoretic bundles, so the natural question to ask is “why, then, should we work with the former instead of the latter?”. One answer to this is question is the fact that entropy does not respect bundle morphisms: we cannot functorially assign a morphism between entropies to morphisms, since we are working with arbitrary morphisms of bundles. (If, however, we restrict to only morphisms given by pushforward, then [7] tells us (via Faddeev’s theorem) that the only possible functorial definition of entropy is given by the relative entropy, i.e., the difference of the entropies of the source and the target). This makes it seem rather bad to work with a category (such as that of bundles) instead of simply a rig (such as that of Dirichlet polynomials). Of course, this isn’t an entirely satisfactory answer, since we do care about the notion of morphisms for Dirichlet polynomials (for example, Corollary 2 tells us that the width can be expressed in terms of the number of certain morphisms). In light of Theorem 1, however, we might consider the following possibility: both area and length can be expressed in terms of

d (0)

,

d (1)

, and

d [i]

(for

i \in d (0)

), and we could define the width by

W (d) A (d) / L (d)

.

A better answer to this question might be the following: the rig homomorphism

h : Dir \to Rect

is incredibly simple, since it just maps

n^{y}

to

(n, n)

; from this computationally simple homomorphism, however, we can recover entropy (as

log (A (d) / W (d))

), without making any reference to the classical equation that defines it (“negative the sum of probabilities of the log of the probabilities”), but instead relying on the fact that

Rect

encodes the weighted geometric mean. That is,

H (d)

is only homomorphic in the product of distributions, whereas the pair

(A (d), W (d))

is homomorphic in both the product and the sum.

Now, the entropy

H (d) = log L (d)

can be understood (via Huffman coding, cf. [8]) as the average number of bits needed to code a single outcome (over a long enough message). What is also true, however, is that the width (which is obtained purely “algebraically”, i.e., from the rig homomorphism

h : Dir \to Rect

) gives similar information: by Theorem 1, combined with the previous sentence,

log W (d)

is the average number of bits needed to code the draw, given an outcome (in the same Huffman coding as before). This answers the question of “what is special about the bundle defined by the width and length” with “it describes the optimal encoding of draws, given outcomes”.

As for the picture in Example 6, we can now understand the hand-wavy explanation a bit better (but still just as hand-wavy-ly): we take our original “half-filled” rectangle

d (1)

and pour its contents into a new rectangle, of length

L (d)

, and then “slosh the contents around” until they lie flat, and then put a lid on it; the rectangle will be perfectly filled up, and the placement of the lid will be given by

W (d)

.

We also mentioned, in Corollary 3, that the width

W (d)

of any Dirichlet polynomial d is an algebraic number, but the actual result is slightly more interesting that this: Corollary 2 tells us that

W {(d)}^{A (d)}

is equal to the cardinality of the set

{Dir}_{/ d (0)} (d, d)

. We already know how to understand endomorphisms of d that fix

d (0)

as endomorphisms of

d (1)

that fix the outcome; we can understand

W {(d)}^{A (d)}

as maps from

A (d)

to

W (d)

; roughly speaking, such a map

f : A (d) \to W (d)

determines the remaining ambiguity in determining a draw, given its outcome.

7. Cross Entropy

Everything above can be viewed as a specific example of the analogous cross notions. That is, given two Dirichlet polynomials, we can define their cross area, cross width, etc. as follows.

Definition 8.

Let

d, e \in Dir

be Dirichlet polynomials such that

d (0) = e (0)

. Then we define thecross entropy

H (d, e)

by

H (d, e) = - \sum_{i \in d (0)} \frac{| d [i] |}{| d (1) |} log (\frac{| e [i] |}{| e (1) |})

and the cross area, cross width, and cross length by

\begin{matrix} A (d, e) : = & | e (1) | \\ W (d, e) : = & {|{Dir}_{/ d (0)} (d, e)|}^{\frac{1}{| d (1) |}} \\ L (d, e) : = & 2^{H (d, e)} \end{matrix}

(respectively).

By definition,

X (d, d) = X (d)

for

X \in {A, W, H, L}

. That is, just as cross entropy is a generalisation of entropy, the notions of cross width, etc., generalise the notions of width, etc.

Remark 4.

Note that we can recover the notion of relative entropy (also known as Kullback–Leibler divergence)

D_{KL} (p ∥ q)

, as studied in [9], from cross entropy:

H (d, e) = H (d) + D_{KL} (p ∥ q)

(which can also be seen to justify the fact that

H (d, d) = H (d)

).

Remark 5.

Although we have some idea of how to understand these cross notions (e.g., cross area can be understood as the number of “actual” draws, when we think of d as being a potentially inaccurate model for e), the choice of definitions in Definition 8 was chosen simply so that

1.: we recover the “uncrossed” notions when we take $d = e$ , and
2.: Theorem 2 holds.

Theorem 2.

For all

d, e \in Dir

, we have the cross rectangle-area formula

A (d, e) = L (d, e) W (d, e) .

Proof.

This proof follows exactly the same argument as the proof of Theorem 1. □

Author Contributions

Writing—original draft, D.I.S. and T.H. Both authors have read and agreed to the published version of the manuscript.

Funding

The first author acknowledges support from AFOSR grant no. FA9550-20-1-0348.

Conflicts of Interest

The authors declare no conflict of interest.

References

Spivak, D.I.; Myers, D.J. Dirichlet Polynomials form a Topos. arXiv 2020, arXiv:2003.04827. [Google Scholar]
Leinster, T. Basic Category Theory; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Fritz, T.; Perrone, P. A probability monad as the colimit of spaces of finite samples. Theory Appl. Categ. 2019, 34, 170–220. [Google Scholar]
Baez, J.C.; Hoffnung, A.E.; Walker, C.D. Higher-Dimensional Algebra VII: Groupoidification. Theory Appl. Categ. 2010, 24, 489–553. [Google Scholar]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
KidzSearch Wiki. Rectangle Facts for Kids. Available online: https://wiki.kidzsearch.com/wiki/Rectangle (accessed on 18 August 2021).
Baez, J.C.; Fritz, T.; Leinster, T. A Characterization of Entropy in Terms of Information Loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
Huffman, D.A. A method for the construction of minimum-redundancy codes. Proc. IRE 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
Baez, J.C.; Fritz, T. A Bayesian characterization of relative entropy. Theory Appl. Categ. 2014, 29, 421–456. [Google Scholar]

Figure 1. Our convention for naming the sides of a rectangle, from [6].

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Spivak, D.I.; Hosgood, T. Dirichlet Polynomials and Entropy. Entropy 2021, 23, 1085. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081085

AMA Style

Spivak DI, Hosgood T. Dirichlet Polynomials and Entropy. Entropy. 2021; 23(8):1085. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081085

Chicago/Turabian Style

Spivak, David I., and Timothy Hosgood. 2021. "Dirichlet Polynomials and Entropy" Entropy 23, no. 8: 1085. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dirichlet Polynomials and Entropy

Abstract

1. Introduction

Prerequisites

2. Dirichlet Polynomials and Bundles

3. Bundles as Empirical Distributions

4. Area and Width

5. Length

6. Interpreting Area, Length, and Width

7. Cross Entropy

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI