Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators

Silva, Jorge F.

doi:10.3390/e20060397

Open AccessArticle

Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators

by

Jorge F. Silva

Information and Decision System Group, Department of Electrical Engineering, Universidad de Chile, Av. Tupper 2007, Santiago 7591538, Chile

Entropy 2018, 20(6), 397; https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

Submission received: 12 April 2018 / Revised: 14 May 2018 / Accepted: 18 May 2018 / Published: 23 May 2018

(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Download Versions Notes

Abstract

:

This work addresses the problem of Shannon entropy estimation in countably infinite alphabets studying and adopting some recent convergence results of the entropy functional, which is known to be a discontinuous function in the space of probabilities in ∞-alphabets. Sufficient conditions for the convergence of the entropy are used in conjunction with some deviation inequalities (including scenarios with both finitely and infinitely supported assumptions on the target distribution). From this perspective, four plug-in histogram-based estimators are studied showing that convergence results are instrumental to derive new strong consistent estimators for the entropy. The main application of this methodology is a new data-driven partition (plug-in) estimator. This scheme uses the data to restrict the support where the distribution is estimated by finding an optimal balance between estimation and approximation errors. The proposed scheme offers a consistent (distribution-free) estimator of the entropy in ∞-alphabets and optimal rates of convergence under certain regularity conditions on the problem (finite and unknown supported assumptions and tail bounded conditions on the target distribution).

Keywords:

Shannon entropy estimation; countably infinite alphabets; entropy convergence results; statistical learning; histogram-based estimators; data-driven partitions; strong consistency; rates of convergence

1. Introduction

Shannon entropy estimation has a long history in information theory, statistics, and computer science [1]. Entropy and related information measures (conditional entropy and mutual information) have a fundamental role in information theory and statistics [2,3] and, as a consequence, it has found numerous applications in learning and decision making tasks [4,5,6,7,8,9,10,11,12,13,14,15]. In many of these contexts, distributions are not available and the entropy needs to be estimated from empirical data. This problem belongs to the category of scalar functional estimation that has been thoroughly studied in non-parametric statistics.

Starting with the finite alphabet scenario, the classical plug-in estimator (i.e., the empirical distribution evaluated on the functional) is well known to be consistent, minimax optimal, and asymptotically efficient [16] (Section 8.7–8.9). More recent research has focused on looking at the so-called large alphabet (or large dimensional) regime, meaning a non-asymptotic under-sampling regime where the number of samples n is on the order of, or even smaller than, the size of the alphabet denoted by k. In this context, it has been shown that the classical plug-in estimator is sub-optimal as it suffers from severe bias [17,18]. For characterizing optimality in this high dimensional context, a non-asymptotic minimax mean square error analysis (under a finite n and k) has been explored by several authors [17,18,19,20,21] considering the minimax risk

R^{*} (k, n) = inf_{\hat{H} (\cdot)} sup_{μ \in P (k)} E_{X_{1}, \dots X_{n} \sim μ^{n}} \{{(\hat{H} (X_{1}, \dots, X_{n}) - H (μ))}^{2}\}

where

P (k)

denotes the collection of probabilities on

[k] \equiv \{1, \dots, k\}

and

H (μ)

is the entropy of

μ

(details in Section 2). Paninski [19] first showed that it was possible to construct an entropy estimator that uses a sub-linear sampling size to achieve minimax consistency when k goes to infinity, in the sense that there is a sequence

(n_{k}) = o (k)

where

R^{*} (k, n_{k}) ⟶ 0

as k goes to infinity. A set of results by Valiant et al. [20,21] shows that the optimal scaling of the sampling size with respect to k is O(k/

\log (k

)), to achieve the aforementioned asymptotic consistency for entropy estimation. A refined set of results for the complete characterization of

R^{*} (k, n)

, the specific scaling of the sampling complexity, and the achievability of the obtained minimax

L_{2}

risk for the family

\{P (k) : k \geq 1\}

with practical estimators have been presented in [17,18]. On the other hand, it is well-known that the problem of estimating the distribution (consistently in total variation) in finite alphabets requires a sampling complexity that scales as

O (k)

[22]. Consequently, in finite alphabets the task of entropy estimation is simpler than estimating the distribution in terms of sampling complexity. These findings are consistent with the observation that the entropy is a continuous functional of the space of distributions (in the total variational distance sense) for the finite alphabet case [2,23,24,25].

1.1. The Challenging Infinite Alphabet Learning Scenario

In this work, we are interested in the countably infinite alphabet scenario, i.e., on the estimation of the entropy when the alphabet is countably infinite and we have a finite number of samples. This problem can be seen as an infinite dimensional regime as the size of the alphabet goes unbounded and n is kept finite for the analysis, which differs from the large dimensional regime mentioned above. As argued in [26] (Section IV), this is a challenging non-parametric learning problem because some of the finite alphabet properties of the entropy do not extend to this infinite dimensional context. Notably, it has been shown that the Shannon entropy is not a continuous functional with respect to the total variational distance in infinite alphabets [24,26,27]. In particular, Ho et al. [24] (Theorem 2) showed concrete examples where convergence in

χ^{2}

-divergence and in direct information divergence (I-divergence) of a set of distributions to a limit, both stronger than total variational convergence [23,28], do not imply the convergence of the entropy. In addition, Harremoës [27] showed the discontinuity of the entropy with respect to the reverse I-divergence [29], and consequently, with respect to the total variational distance (the distinction between reverse and direct I-divergence was pointed out in the work of Barron et al. [29]). In entropy estimation, the discontinuity of the entropy implies that the minimax mean square error goes unbounded, i.e.,

R_{n}^{*} = inf_{\hat{H} (\cdot)} sup_{μ \in H (X)} E_{X_{1}, \dots X_{n} \sim μ^{n}} \{{(\hat{H} (X_{1}, \dots, X_{n}) - H (μ))}^{2}\} = \infty,

where

H (X)

denotes the family of finite entropy distribution over the countable alphabet set

X

(the proof of this result follows from [26] (Theorem 1) and the argument is presented in Appendix A). Consequently, there is no universal minimax consistent estimator (in the mean square error sense) of the entropy over the family of finite entropy distributions.

Considering a sample-wise (or point-wise) convergence to zero of the estimation error (instead of the minimax expected error analysis mentioned above), Antos et al. [30] (Theorem 2 and Corollary 1) show the remarkable result that the classical plug-in estimate is strongly consistent and consistent in the mean square error sense for any finite entropy distribution (point-wise). Then, the classical plug-in entropy estimator is universal, meaning that the convergence to the right limiting value

H (μ)

is achieved almost surely despite the discontinuity of the entropy. Moving on the analysis of the (point-wise) rate of convergence of the estimation error, Antos et al. [30] (Theorem 3) present a finite length lower bound for the error of any arbitrary estimation scheme, showing as a corollary that no universal rate of convergence (to zero) can be achieved for entropy estimation in infinite alphabets [30] (Theorem 4). Finally, constraining the problem to a family of distributions with specific power tail bounded conditions, Antos et al. [30] (Theorem 7) present a finite length expression for the rate of convergence of the estimation error of the classical plug-in estimate.

1.2. From Convergence Results to Entropy Estimation

In view of the discontinuity of the entropy in ∞-alphabets [24] and the results that guarantee entropy convergence [25,26,27,31], this work revisits the problem of point-wise almost-sure entropy estimation in ∞-alphabets from the perspective of studying and applying entropy convergence results and their derived bounds [25,26,31]. Importantly, entropy convergence results have established concrete conditions on both the limiting distribution

μ

and the way a sequence of distributions

\{μ_{n} : n \geq 0\}

converges to

μ

such that

{lim}_{n \to \infty} H (μ_{n}) = H (μ)

is satisfied. The natural observation that motivates this work is that consistency is basically a convergence to the true entropy value that happens with probability one. Then our main conjecture is that putting these conditions in the context of a learning task, i.e., where

\{μ_{n} : n \geq 0\}

is a random sequence of distributions driven by the classical empirical process, will offer the possibility to study a broad family of plug-in estimators with the objective to derive new strong consistency and rates of convergence results. On the practical side, this work proposes and analizes a data-driven histogram-based estimator as a key learning scheme, since this approach offers the flexibility to adapt to learning task when appropriate bounds for the estimation and approximation errors are derived.

1.3. Contributions

We begin revisiting the classical plug-in entropy estimator considering the relevant scenario where

μ

(the unknown distribution that produces the i.i.d. samples) has a finite but arbitrary large and unknown support. This is declared to be a challenging problem by Ho and Yeung [26] (Theorem 13) because of the discontinuity of the entropy. Finite-length (non-asymptotic) deviation inequalities and intervals of confidence are derived extending the results presented in [26] (Section IV). From this, it is shown that the classical plug-in estimate achieves optimal rates of convergence. Relaxing the finite support restriction on

μ

, two concrete histogram-based plug-in estimators are presented, one built upon the celebrated Barron-Györfi-van der Meulen histogram-based approach [29,32,33]; and the other on a data-driven partition of the space [34,35,36]. For the Barron plug-in scheme, almost-sure consistency is shown for entropy estimation and distribution estimation in direct I-divergence under some mild support conditions on

μ

. For the data-driven partition scheme, the main context of application of this work, it is shown that this estimator is strongly consistent distribution-free, matching the universal result obtained for the classical plug-in approach in [30]. Furthermore, new almost-sure rates of convergence results (in the estimation error) are obtained for distributions with finite but unknown support and for families of distributions with power and exponential tail dominating conditions. In this context, our results show that this adaptive scheme has a concrete design solution that offers very good convergence rate of the overall estimation error, as it approaches the rate

O (1 / \sqrt{n})

that is considered optimal for the finite alphabet case [16]. Importantly, the parameter selection of this scheme relies on, first, obtaining expressions to bound the estimation and approximation errors and, second, finding the optimal balance between these two learning errors.

1.4. Organization

The rest of the paper is organized as follows. Section 2 introduces some basic concepts, notation, and summarizes the main entropy convergence results used in this work. Section 3, Section 4 and Section 5 state and elaborate the main results of this work. Discussion of the results and final remarks are given in Section 6. The technical derivation of the main results are presented in Section 7. Finally, proofs of auxiliary results are relegated to the Appendix Section.

2. Preliminaries

Let

X

be a countably infinite set and let

P (X)

denote the collection of probability measures in

X

. For

μ

and v in

P (X)

, and

μ

absolutely continuous with respect to v (i.e.,

μ ≪ v

),

\frac{d μ}{d v} (x)

denotes the Radon-Nikodym (RN) derivative of

μ

with respect to v. Every

μ \in P (X)

is equipped with its probability mass function (pmf) that we denote by

f_{μ} (x) \equiv μ (\{x\})

,

\forall x \in X

. Finally, for any

μ \in P (X)

,

A_{μ} \equiv \{x \in X : f_{μ} (x) > 0\}

denotes its support and

F (X) \equiv \{μ \in P (X) : |A_{μ}| < \infty\}

(1)

denotes the collection of probabilities with finite support.

Let

μ

and v be in

P (X)

, then the total variation distance of

μ

and v is given by [28]

V (μ, v) \equiv sup_{A \in 2^{X}} |v (A) - μ (A)|,

(2)

where

2^{X}

denotes the subsets of

X

. The Kullback–Leibler divergence or I-divergence of

μ

with respect to v is given by

D (μ | | v) \equiv \sum_{x \in A_{μ}} f_{μ} (x) \log \frac{f_{μ} (x)}{f_{v} (x)} \geq 0,

(3)

when

μ ≪ v

, while

D (μ | | v)

is set to infinite, otherwise [37].

The Shannon entropy of

μ \in P (X)

is given by [1,2,38]:

H (μ) \equiv - \sum_{x \in A_{μ}} f_{μ} (x) \log f_{μ} (x) \geq 0 .

(4)

In this context, let

H (X) \subset P (X)

be the collection of probabilities where (4) is finite, let

AC (X | v) \subset P (X)

denote the collection of probabilities absolutely continuous with respect to

v \in P (X)

, and let

H (X | v) \subset AC (X | v)

denote the collection of probabilities where (3) is finite for

v \in P (X)

.

Concerning convergence, a sequence

\{μ_{n} : n \in N\} \subset P (X)

is said to converge in total variation to

μ \in P (X)

if

lim_{n \to \infty} V (μ_{n}, μ) = 0 .

(5)

For countable alphabets, ref. [31] (Lemma 3) shows that the convergence in total variation is equivalent to the weak convergence, which is denoted here by

μ_{n} \Rightarrow μ

, and the point-wise convergence of the pmf’s. Furthermore, from (2), the convergence in total variation implies the uniform convergence of the pmf’s, i.e,

{lim}_{n \to \infty} {sup}_{x \in X} |μ_{n} (\{x\}) - μ (\{x\})| = 0

. Therefore, in this countable case, all the four previously mentioned notions of convergence are equivalent: total variation, weak convergence, point-wise convergence of the pmf’s, and uniform convergence of the pmf’s.

We conclude with the convergence in I-divergence introduced by Barron et al. [29]. It is said that

\{μ_{n} : n \in N\}

converges to

μ

in direct and in reverse I-divergence if

{lim}_{n \to \infty} D (μ | | μ_{n}) = 0

and

{lim}_{n \to \infty} D (μ_{n} | | μ) = 0

, respectively. From Pinsker’s inequality [39,40,41], the convergence in I-divergence implies the weak convergence in (5), where it is known that the converse result is not true [27].

2.1. Convergence Results for the Shannon Entropy

The discontinuity of the entropy in ∞-alphabets raises the problem of finding conditions under which convergence of the entropy can be obtained. On this topic, Ho et al. [26] have studied the interplay between entropy and the total variation distance, specifying conditions for convergence by assuming a finite support on the involved distributions. On the other hand, Harremoës [27] (Theorem 21) obtained convergence of the entropy by imposing a power dominating condition [27] (Definition 17) on the limiting probability measure

μ

for all the sequences

\{μ_{n} : n \geq 0\}

converging in reverse I-divergence to

μ

[29]. More recently, Silva et al. [25] have addressed the entropy convergence studying a number of new settings that involve conditions on the limiting measure

μ

, as well as the way the sequence

\{μ_{n} : n \geq 0\}

converges to

μ

in the space of distributions. These results offer sufficient conditions where the entropy evaluated in a sequence of distributions converges to the entropy of its limiting distribution and, consequently, the possibility of applying these when analyzing plug-in entropy estimators. The results used in this work are summarized in the rest of this section.

Let us begin with the case when

μ \in F (X)

, i.e., when the support of the limiting measure is finite and unknown.

Proposition 1.

Let us assume that

μ \in F (X)

and

\{μ_{n} : n \in N\} \subset AC (X | μ)

. If

μ_{n} \Rightarrow μ

, then

{lim}_{n \to \infty} D (μ_{n} | | μ) = 0

and

{lim}_{n \to \infty} H (μ_{n}) = H (μ)

.

This result is well-known because when

A_{μ_{n}} \subset A_{μ}

for all n, the scenario reduces to the finite alphabet case, where the entropy is known to be continuous [2,23]. Since we obtain two inequalities that are used in the following sections, a simple proof is provided here.

Proof.

μ

and

μ_{n}

belong to

H (X)

from the finite-supported assumption. The same argument can be used to show that

D (μ_{n} | | μ) < \infty

, since

μ_{n} ≪ μ

for all n. Let us consider the following identity:

H (μ) - H (μ_{n}) = \sum_{x \in A_{μ}} (f_{μ_{n}} (x) - f_{μ} (x)) \log f_{μ} (x) + D (μ_{n} | | μ) .

(6)

The first term on the right hand side (RHS) of (6) is upper bounded by

M_{μ} \cdot V (μ_{n}, μ)

where

M_{μ} = \log \frac{1}{m_{μ}} \equiv sup_{x \in A_{μ}} |\log μ (\{x\})| < \infty .

(7)

For the second term, we have that

\begin{matrix} D (μ_{n} | | μ) & \leq \log e \cdot \sum_{x \in A_{μ_{n}}} f_{μ_{n}} (x) |\frac{f_{μ_{n}} (x)}{f_{μ} (x)} - 1| \\ \leq \frac{\log e}{m_{μ}} \cdot sup_{x \in A_{μ}} |f_{μ_{n}} (x) - f_{μ} (x)| \leq \frac{\log e}{m_{μ}} \cdot V (μ_{n}, μ) . \end{matrix}

(8)

and, consequently,

\begin{matrix} |H (μ) - H (μ_{n})| \leq [M_{μ} + \frac{\log e}{m_{μ}}] \cdot V (μ_{n}, μ) . \end{matrix}

(9)

☐

Under the assumptions of Proposition 1, we note that the reverse I-divergence and the entropy difference are bounded by the total variation by (8) and (9), respectively. Note, however, that these bounds are a distribution-dependent function of

m_{μ}

(

M_{μ}

) in (7) (it is direct to show that

m_{μ} (M_{μ}) < \infty

if, and only if,

μ \in F (X)

). The next result relaxes the assumption that

μ_{n} ≪ μ

and offers a necessary and sufficient condition for the convergence of the entropy.

Lemma 1.

Ref. [25] (Theorem 1) Let

μ \in F (X)

and

\{μ_{n} : n \in N\} \subset F (X)

. If

μ_{n} \Rightarrow μ

, then there exists

N > 0

such that

μ ≪ μ_{n}

\forall n \geq N

, and

lim_{n \to \infty} D (μ | | μ_{n}) = 0 .

Furthermore,

{lim}_{n \to \infty} H (μ_{n}) = H (μ)

, if and only if,

lim_{n \to \infty} μ_{n} (A_{μ_{n}} \ A_{μ}) \cdot H (μ_{n} (\cdot | A_{μ_{n}} \ A_{μ})) = 0 \Leftrightarrow lim_{n \to \infty} \sum_{x \in A_{μ_{n}} \ A_{μ}} f_{μ_{n}} (x) \log \frac{1}{f_{μ_{n}} (x)} = 0,

(10)

where

μ (\cdot | B)

denotes the conditional probability of μ given the event

B \subset X

.,

Lemma 1 tells us that to achieve entropy convergence (on top of the weak convergence), it is necessary and sufficient to ask for a vanishing expression (with n) of the entropy of

μ_{n}

restricted to the elements of the set

A_{μ_{n}} \ A_{μ}

. Two remarks about this result are: (1) The convergence in direct I-divergence does not imply the convergence of the entropy (concrete examples are presented in [24] (Section III) and [25]); (2) Under the assumption that

μ \in F (X)

,

μ

is eventually absolutely continuous with respect to

μ_{n}

, and the convergence in total variations is equivalent to the convergence in direct I-divergence.

This section concludes with the case when the support of

μ

is infinite and unknown, i.e.,

|A_{μ}| = \infty

. In this context, two results are highlighted:

Lemma 2.

Ref. [31] (Theorem 4) Let us consider that

μ \in H (X)

and

\{μ_{n} : n \geq 0\} \subset AC (X | μ)

. If

μ_{n} \Rightarrow μ

and

M \equiv sup_{n \geq 1} sup_{x \in A_{μ}} \frac{f_{μ_{n}} (x)}{f_{μ} (x)} < \infty,

(11)

then

μ_{n} \in H (X) \cap H (X | μ)

for all n and it follows that

lim_{n \to \infty} D (μ_{n} | | μ) = 0 and lim_{n \to \infty} H (μ_{n}) = H (μ) .

Interpreting Lemma 2 we have that, to obtain the convergence of the entropy functional (without imposing a finite support assumption on

μ

), a uniform bounding condition (UBC)

μ

-almost everywhere was added in (11). By adding this UBC, the convergence on reverse I-divergence is also obtained as a byproduct. Finally, when

μ ≪ μ_{n}

for all n, the following result is considered:

Lemma 3.

Ref. [25] (Theorem 3) Let

μ \in H (X)

and a sequence of measures

\{μ_{n} : n \geq 1\} \subset H (X)

such that

μ ≪ μ_{n}

for all

n \geq 1

. If

μ_{n} \Rightarrow μ

and

sup_{n \geq 1} sup_{x \in A_{μ}} |\log \frac{f_{μ_{n}} (x)}{f_{μ} (x)}| < \infty

(12)

then,

μ \in H (X | μ_{n})

for all

n \geq 1

, and

lim_{n \to \infty} D (μ | | μ_{n}) = 0 .

Furthermore,

{lim}_{n \to \infty} H (μ_{n}) = H (μ)

, if and only if,

lim_{n \to \infty} \sum_{x \in A_{μ_{n}} \ A_{μ}} f_{μ_{n}} (x) \log \frac{1}{f_{μ_{n}} (x)} = 0 .

(13)

This result shows the non-sufficiency of the convergence in direct I-divergence to achieve entropy convergence in the regime when

μ ≪ μ_{n}

. In fact, Lemma 3 may be interpreted as an extension of Lemma 1 when the finite support assumption over

μ

is relaxed.

3. Shannon Entropy Estimation

Let

μ

be a probability in

H (X)

, and let denote by

X_{1}, X_{2}, X_{3}, \dots

the empirical process induced from i.i.d. realizations of a random variable driven by

μ

, i.e.,

X_{i} \sim μ

, for all

i \geq 0

. Let

P_{μ}

denote the distribution of the empirical process in

(X^{\infty}, B (X^{\infty}))

and

P_{μ}^{n}

denote the finite block distribution of

X^{n} \equiv (X_{1}, \dots, X_{n})

in the product space

(X^{n}, B (X^{n}))

. Given a realization of

X_{1}, X_{2}, X_{3}, \dots, X_{n}

, we can construct an histogram-based estimator like classical empirical probability given by:

{\hat{μ}}_{n} (A) \equiv \frac{1}{n} \sum_{k = 1}^{n} 1_{A} (X_{k}), \forall A \subset X,

(14)

with pmf given by

f_{{\hat{μ}}_{n}} (x) = {\hat{μ}}_{n} (\{x\})

for all

x \in X

. A natural estimator of the entropy is the plug-in estimate of

{\hat{μ}}_{n}

given by

H ({\hat{μ}}_{n}) = - \sum_{x \in X} f_{{\hat{μ}}_{n}} (x) \log f_{{\hat{μ}}_{n}} (x),

(15)

which is a measurable function of

X_{1}, \dots, X_{n}

(this dependency on the data will be implicit for the rest of the paper).

For the rest of this Section and Section 4 and Section 5, the convergence results in Section 2.1 are used to derive strong consistency results for plug-in histogram-based estimators, like

H ({\hat{μ}}_{n})

in (15), as well as finite length concentration inequalities to obtain almost-sure rates of convergence for the overall estimation error

|H ({\hat{μ}}_{n}) - H (μ)|

.

3.1. Revisiting the Classical Plug-In Estimator for Finite and Unknown Supported Distributions

We start by analyzing the case when

μ

has a finite but unknown support. A consequence of the strong law of large numbers [42,43] is that

\forall x \in X

,

{lim}_{n \to \infty} {\hat{μ}}_{n} (\{x\}) = μ (\{x\})

,

P_{μ}

-almost surely (a.s.), hence

{lim}_{n \to \infty} V ({\hat{μ}}_{n}, μ) = 0

,

P_{μ}

-a.s. On the other hand, it is clear that

A_{{\hat{μ}}_{n}} \subset A_{μ}

holds with probability one. Then Proposition 1 implies that

\begin{matrix} lim_{n \to \infty} D ({\hat{μ}}_{n} | | μ) = 0 and lim_{n \to \infty} H ({\hat{μ}}_{n}) = H (μ), P_{μ} - a . s ., \end{matrix}

(16)

i.e.,

{\hat{μ}}_{n}

is a strongly consistent estimator of

μ

in reverse I-divergence and

H ({\hat{μ}}_{n})

is a strongly consistent estimate of

H (μ)

distribution-free in

F (X)

. Furthermore, the following can be stated:

Theorem 1.

Let

μ \in F (X)

and let us consider

{\hat{μ}}_{n}

in (14). Then

{\hat{μ}}_{n} \in H (X) \cap H (X | μ)

,

P_{μ}

-a.s and

\forall n \geq 1

,

\forall ϵ > 0

,

\begin{matrix} P_{μ}^{n} (D ({\hat{μ}}_{n} | | μ) > ϵ) \leq 2^{|A_{μ}| + 1} \cdot e^{- \frac{2 m_{μ}^{2} \cdot n ϵ^{2}}{\log e^{2}}}, \end{matrix}

(17)

\begin{matrix} P_{μ}^{n} (|H ({\hat{μ}}_{n}) - H (μ)| > ϵ) \leq 2^{|A_{μ}| + 1} \cdot e^{- \frac{2 n ϵ^{2}}{{(M_{μ} + \frac{\log e}{m_{μ}})}^{2}}} . \end{matrix}

(18)

Moreover,

D (μ | | {\hat{μ}}_{n})

is eventually finite with probability one and

\forall ϵ > 0

, and for any

n \geq 1

,

\begin{matrix} P_{μ}^{n} (D (μ | | {\hat{μ}}_{n}) > ϵ) \leq 2^{|A_{μ}| + 1} \cdot [e^{- \frac{2 n ϵ^{2}}{\log e^{2} \cdot {(1 / m_{u} + 1)}^{2}}} + e^{- n m_{μ}^{2}}] . \end{matrix}

(19)

This result implies that for any

τ \in (0, 1 / 2)

and

μ \in F (X)

,

|H ({\hat{μ}}_{n}) - H (μ)|

,

D ({\hat{μ}}_{n} | | μ)

, and

D (μ | | {\hat{μ}}_{n})

goes to zero as

o (n^{- τ})

P_{μ}

-a.s. Furthermore,

E_{P_{μ}^{n}} (|H ({\hat{μ}}_{n}) - H (μ)|)

and

E_{P_{μ}^{n}} (D ({\hat{μ}}_{n} | | μ))

behave like

O (1 / \sqrt{n})

for all

μ \in F (X)

from (30) in Section 7, which is the optimal rate of convergence of the finite alphabet scenario. As a corollary of (18), it is possible to derive intervals of confidence for the estimation error

|H (({\hat{μ}}_{n}) - H (μ_{n})|

: for all

δ > 0

and

n \geq 1

,

\begin{matrix} P_{μ} (|H (({\hat{μ}}_{n}) - H (μ_{n})| \leq (M_{μ} + \log e / m_{μ}) \sqrt{\frac{1}{2 n} \ln \frac{2^{|A_{μ}| + 1}}{δ}}) \geq 1 - δ . \end{matrix}

(20)

This confidence interval behaves like

O (1 / \sqrt{n})

as a function of n, and like

O (\sqrt{\ln 1 / δ})

as a function of

δ

, which are the same optimal asymptotic trends that can be obtained for

V (μ, {\hat{μ}}_{n})

in (30).

Finally, we observe that

A_{{\hat{μ}}_{n}} \subset A_{μ}

P_{μ}^{n}

-a.s. where for any

n \geq 1

,

P_{μ}^{n} (A_{{\hat{μ}}_{n}} \neq A_{μ}) > 0

implying that

E_{P_{μ}^{n}} (D (μ | | {\hat{μ}}_{n})) = \infty

for all finite n. Then even in the finite and unknown supported scenario,

{\hat{μ}}_{n}

is not consistent in expected direct I-divergence, which is congruent with the result in [29,44]. Besides this negative result, strong consistency in direct I-divergence can be obtained from (19), in the sense that

{lim}_{n \to \infty} D (μ | | {\hat{μ}}_{n}) = 0

,

P_{μ}

-a.s.

3.2. A Simplified Version of the Barron Estimator for Finite Supported Probabilities

It is well-understood that consistency in expected direct I-divergence is of critical importance for the construction of a lossless universal source coding scheme [2,23,29,44,45,46,47,48]. Here, we explore an estimator that achieves this learning objective, in addition to entropy estimation. For that, let

μ \in F (X)

and let assume

v \in F (X)

such that

μ ≪ v

. Barron et al. [29] proposed a modified version of the empirical measure in (14) to estimate

μ

from i.i.d. realizations, adopting a mixture estimate of the form

\begin{matrix} {\tilde{μ}}_{n} (B) = (1 - a_{n}) \cdot {\hat{μ}}_{n} (B) + a_{n} \cdot v (B), \end{matrix}

(21)

for all

B \subset X

, and with

{(a_{n})}_{n \in N}

a sequence of real numbers in

(0, 1)

. Note that

A_{{\tilde{μ}}_{n}} = A_{v}

then

μ ≪ {\tilde{μ}}_{n}

for all n and from the finite support assumption

H ({\tilde{μ}}_{n}) < \infty

and

D (μ | | {\tilde{μ}}_{n}) < \infty

,

P_{μ}

-a.s.. The following result derives from the convergence result in Lemma 1.

Theorem 2.

Let

v \in F (X)

,

μ ≪ v

and let us consider

{\tilde{μ}}_{n}

in (21) induced from i.i.d. realizations of μ.

(i): If $(a_{n})$ is $o (1)$ , then ${lim}_{n \to \infty} H ({\tilde{μ}}_{n}) = H (μ)$ , ${lim}_{n \to \infty} D (μ | | {\tilde{μ}}_{n}) = 0$ , $P_{μ}$ -a.s., and ${lim}_{n \to \infty} E_{P_{μ}} (D (μ | | {\tilde{μ}}_{n})) = 0$ .
(ii): Furthermore, if $(a_{n})$ is $O (n^{- p})$ with $p > 2$ , then for all $τ \in (0, 1 / 2)$ , $|H ({\tilde{μ}}_{n}) - H (μ)|$ and $D (μ | | {\tilde{μ}}_{n})$ are $o (n^{- τ})$ $P_{μ}$ -a.s, and $E_{P_{μ}} (|H ({\tilde{μ}}_{n}) - H (μ)|)$ and $E_{P_{μ}} (D (μ | | {\tilde{μ}}_{n}))$ are $O (1 / \sqrt{n})$ .

Using this approach, we achieve estimation of the true distribution in expected information divergence as well as strong consistency for entropy estimation as intended. In addition, optimal rates of convergence are obtained under the finite support assumption on

μ

.

4. The Barron-Györfi-van der Meulen Estimator

The celebrated Barron estimator was proposed by Barron, Györfi and van der Meulen [29] in the context of an abstract and continuous measurable space. It is designed as a variation of the classical histogram-based scheme to achieve a consistent estimation of the distribution in direct I-divergence [29] (Theorem 2). Here, the Barron estimator is revisited in the countable alphabet scenario, with the objective of estimating the Shannon entropy consistently, which, to the best of our knowledge, has not been previously addressed in literature. For that purpose, the convergence result in Lemma 3 will be used as a key result.

Let

v \in P (X)

be of infinite support (i.e.,

m_{v} = {inf}_{x \in A_{v}} v (\{x\}) = 0

). We want to construct a strongly consistent estimate of the entropy restricted to the collection of probabilities in

H (X | v)

. For that, let us consider a sequence

{(h_{n})}_{n \geq 0}

with values in

(0, 1)

and let us denote by

π_{n} = \{A_{n, 1}, A_{n, 2}, \dots, A_{n, m_{n}}\}

the finite partition of

X

with maximal cardinality satisfying that

v (A_{n, i}) \geq h_{n}, \forall i \in \{1, \dots, m_{n}\} .

(22)

Note that

m_{n} = |π_{n}| \leq 1 / h_{n}

for all

n \geq 1

, and because of the fact that

{inf}_{x \in A_{v}} v (\{x\}) = 0

it is simple to verify that if

(h_{n})

is

o (1)

then

{lim}_{n \to \infty} m_{n} = \infty

.

π_{n}

offers an approximated statistically equivalent partition of

X

with respect to the reference measure v. In this context, given

X_{1}, \dots, X_{n}

, i.i.d. realizations of

μ \in H (X | v)

, the idea proposed by Barron et al. [29] is to estimate the RN derivative

\frac{d μ}{d v} (x)

by the following histogram-based construction:

\frac{d μ_{n}^{*}}{d v} (x) = (1 - a_{n}) \cdot \frac{{\hat{μ}}_{n} (A_{n} (x))}{v (A_{n} (x))} + a_{n}, \forall x \in A_{v},

(23)

where

a_{n}

is a real number in

(0, 1)

,

A_{n} (x)

denotes the cell in

π_{n}

that contains the point x, and

{\hat{μ}}_{n}

is the empirical measure in (14). Note that

f_{μ_{n}^{*}} (x) = \frac{d μ_{n}^{*}}{d λ} (x) = f_{v} (x) \cdot [(1 - a_{n}) \cdot \frac{{\hat{μ}}_{n} (A_{n} (x))}{v (A_{n} (x))} + a_{n}],

\forall x \in X

, and, consequently,

\forall B \subset X

μ_{n}^{*} (B) = (1 - a_{n}) \sum_{i = 1}^{m_{n}} {\hat{μ}}_{n} (A_{n, i}) \cdot \frac{v (B \cap A_{n, i})}{v (A_{n, i})} + a_{n} v (B) .

(24)

By construction

A_{μ} \subset A_{v} \subset A_{μ_{n}^{*}}

and, consequently,

μ ≪ μ_{n}^{*}

for all

n \geq 1

. The next result shows sufficient conditions on the sequences

(a_{n})

and

(h_{n})

to guarantee a strongly consistent estimation of the entropy

H (μ)

and of

μ

in direct I-divergence, distribution free in

H (X | v)

. The proof is based on verifying that the sufficient conditions of Lemma 3 are satisfied

P_{μ}

-a.s.

Theorem 3.

Let v be in

P (X) \cap H (X)

with infinite support, and let us consider μ in

H (X | v)

. If we have that:

(i): $(a_{n})$ is $o (1)$ and $(h_{n})$ is $o (1)$ ,
(ii): $\exists τ \in (0, 1 / 2)$ , such that the sequence $(\frac{1}{a_{n} \cdot h_{n}})$ is $o (n^{τ})$ ,

then

μ \in H (X) \cap H (X | μ_{n}^{*})

for all

n \geq 1

and

lim_{n \to \infty} H (μ_{n}^{*}) = H (μ) and lim_{n \to \infty} D (μ | | μ_{n}^{*}) = 0, P_{μ} - a . s . .

(25)

This result shows an admisible regime of design parameters and its scaling with the number of samples that guarantees that the Barron plug-in entropy estimator is strongly consistent in

H (X | v)

. As a byproduct, we obtain that the distribution

μ

is estimated consistently in direct information divergence.

The Barron estimator [29] was originally proposed in the context of distributions defined in an abstract measurable space. Then if we restrict [29] (Theorem 2) to the countable alphabet case, the following result is obtained:

Corollary 1.

Ref. [29] (Theorem 2) Let us consider

v \in P (X)

and

μ \in H (X | v)

. If

(a_{n})

is

o (1)

,

(h_{n})

is

o (1)

and

lim {sup}_{n \to \infty} \frac{1}{n a_{n} h_{n}} \leq 1

then

lim_{n \to \infty} D (μ | | μ_{n}^{*}) = 0, P_{μ} - a . s .

When the only objective is the estimation of distributions consistently in direct I-divergence, Corollary 1 should be considered to be a better result than Theorem 3 (Corollary 1 offers weaker conditions than Theorem 3 in particular condition (ii)). The proof of Theorem 3 is based on verifying the sufficient conditions of Lemma 3, where the objective is to achieve the convergence of the entropy, and as a consequence, the convergence in direct I-divergence. Therefore, we can say that the stronger conditions of Theorem 3 are needed when the objective is entropy estimation. This is justified from the observation that convergence in direct I-divergence does not imply entropy convergence in ∞-alphabets, as is discussed in Section 2.1 (see, Lemmas 1 and 3).

5. A Data-Driven Histogram-Based Estimator

Data-driven partitions offer a better approximation to the data distribution in the sample space than conventional non-adaptive histogram-based approaches [34,49]. They have the capacity to improve the approximation quality of histogram-based learning schemes, which translates in better performance in different non-parametric learning settings [34,35,36,50,51]. One of the basic design principles of this approach is to partition or select a sub-set of elements of

X

in a data-dependent way to preserve a critical number of samples per cell. In our problem, this last condition proves to be crucial to derive bounds for the estimation and approximation errors. Finally, these expressions will be used to propose design solutions that offer an optimal balance between estimation and approximation errors (Theorems 5 and 6).

Given

X_{1}, \dots, X_{n}

i.i.d. realizations driven by

μ \in H (X)

and

ϵ > 0

, let us define the data-driven set

Γ_{ϵ} \equiv \{x \in X : {\hat{μ}}_{n} (\{x\}) \geq ϵ\},

(26)

and

ϕ_{ϵ} \equiv Γ_{ϵ}^{c}

. Let

Π_{ϵ} \equiv \{\{x\} : x \in Γ_{ϵ}\} \cup \{ϕ_{ϵ}\} \subset 2^{X}

be a data-driven partition with maximal resolution in

Γ_{ϵ}

, and

σ_{ϵ} \equiv σ (Π_{ϵ})

be the smallest sigma field that contains

Π_{ϵ}

(as

Π_{ϵ}

is a finite partition,

σ_{ϵ}

is the collection of sets that are union of elements of

Π_{ϵ}

). We propose the conditional empirical probability restricted to

Γ_{ϵ}

by:

{\hat{μ}}_{n, ϵ} \equiv {\hat{μ}}_{n} (\cdot | Γ_{ϵ}) .

(27)

By construction, it follows that

A_{{\hat{μ}}_{n, ϵ}} = Γ_{ϵ} \subset A_{μ}

,

P_{μ}

-a.s. and this implies that

{\hat{μ}}_{n, ϵ} ≪ μ

for all

n \geq 1

. Furthermore,

|Γ_{ϵ}| \leq \frac{1}{ϵ}

and, importantly in the context of the entropy functional, it follows that

m_{{\hat{μ}}_{n}}^{ϵ} \equiv inf_{x \in Γ_{ϵ}} {\hat{μ}}_{n} (\{x\}) \geq ϵ .

(28)

The next result establishes a mild sufficient condition on

(ϵ_{n})

for which

H ({\hat{μ}}_{n, ϵ_{n}})

is strongly consistent distribution-free in

H (X)

. Considering that we are in the regime where

{\hat{μ}}_{n, ϵ_{n}} ≪ μ

,

P_{μ}

-a.s., the proof of this result uses the convergence result in Lemma 2 as a central result.

Theorem 4.

If

(ϵ_{n})

is

O (n^{- τ})

with

τ \in (0, 1)

, then for all

μ \in H (X)

lim_{n \to \infty} H ({\hat{μ}}_{n, ϵ_{n}}) = H (μ), P_{μ} - a . s .

Complementing Theorem 4, the next result offers almost-sure rates of converge for a family of distributions with a power tail bounded condition (TBC). In particular, the family of distributions studied in [30] (Theorem 7) are considered.

Theorem 5.

Let us assume that for some

p > 1

there are two constants

0 < k_{0} \leq k_{1}

and

N > 0

such that

k_{0} \cdot x^{- p} \leq μ (\{x\}) \leq k_{1} x^{- p}

for all

x \geq N

. If we consider that

(ϵ_{n}) \approx (n^{- τ^{*}})

for

τ^{*} = \frac{1}{2 + 1 / p}

, then

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})| is O (n^{- \frac{1 - 1 / p}{2 + 1 / p}}), P_{μ} - a . s .

This result shows that under the mentioned p-power TBC on

f_{μ} (\cdot)

, the plug-in estimator

H ({\hat{μ}}_{n, ϵ_{n}})

can achieve a rate of convergence to the true limit that is

O (n^{- \frac{1 - 1 / p}{2 + 1 / p}})

with probability one. For the derivation of this result, the approximation sequence

(ϵ_{n})

is defined as a function of p (adapted to the problem) by finding an optimal tradeoff between estimation and approximation errors while performing a finite length (non-asymptotic) analysis of the expression

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})|

(the details of this analysis are presented in Section 7).

It is insightful to look at two extreme regimes of this result: p approaching 1, in which the rate is arbitrarily slow (approaching a non-decaying behavior); and

p \to \infty

, where

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})|

is

O (n^{- q})

for all

q \in (0, 1 / 2)

P_{μ}

-a.s.. This last power decaying range

q \in (0, 1 / 2)

matches what is achieved for the finite alphabet scenario (for instance in Theorem 1, Equation (18)), which is known to be the optimal rate for finite alphabets.

Extending Theorem 5, the following result addresses the more constrained case of distributions with an exponential TBC.

Theorem 6.

Let us consider

α > 0

and let us assume that there are

k_{0}, k_{1}

with

0 < k_{0} \leq k_{1}

and

N > 0

such that

k_{0} \cdot e^{- α x} \leq μ (\{x\}) \leq k_{1} \cdot e^{- α x}

for all

x \geq N

. If we consider

(ϵ_{n}) \approx (n^{- τ})

with

τ \in (0, 1 / 2)

, then

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})| is O (n^{- τ} \log n), P_{μ} - a . s .

Under this stringent TBC on

f_{μ} (\cdot)

, it is observed that

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})| is o (n^{- q})

P_{μ} - a . s .

, for any arbitrary

q \in (0, 1 / 2)

, by selecting

(ϵ_{n}) \approx (n^{- τ})

with

q < τ < 1 / 2

. This last condition on

τ

is universal over

α > 0

. Remarkably, for any distribution with this exponential TBC, we can approximate (arbitrarily closely) the optimal almost-sure rate of convergence achieved for the finite alphabet problem.

Finally, the finite and unknown supported scenario is revisited, where it is shown that the data-driven estimator exhibits the optimal almost sure convergence rate of the classical plug-in entropy estimator presented in Section 3.1.

Theorem 7.

Let us assume that

μ \in F (X)

and

(ϵ_{n})

being

o (1)

. Then for all

ϵ > 0

there is

N > 0

such that

\forall n \geq N

P_{μ}^{n} (|H ({\hat{μ}}_{n, ϵ_{n}}) - H (μ)| > ϵ) \leq 2^{|A_{μ}| + 1} \cdot [e^{- \frac{2 n ϵ^{2}}{{(M_{μ} + \frac{\log e}{m_{μ}})}^{2}}} + e^{- \frac{n m_{μ}^{2}}{4}}] .

(29)

The proof of this result reduces to verify that

{\hat{μ}}_{n, ϵ_{n}}

detects

A_{μ}

almost-surely when n goes to infinity and from this, it follows that

H ({\hat{μ}}_{n, ϵ_{n}})

eventually matches the optimal almost sure performance of

H ({\hat{μ}}_{n})

under the key assumption that

μ \in F (X)

. Finally, the concentration bound in (29) implies that

|H ({\hat{μ}}_{n, ϵ_{n}}) - H (μ)|

is

o (n^{- q})

almost surely for all

q \in (0, 1 / 2)

as long as

ϵ_{n} \to 0

with n.

6. Discussion of the Results and Final Remarks

This work shows that entropy convergence results are instrumental to derive new (strongly consistent) estimation results for the Shannon entropy in ∞-alphabets and, as a byproduct, distribution estimators that are strongly consistent in direct and reverse I-divergence. Adopting a set of sufficient conditions for entropy convergence in the context of four plug-in histogram-based schemes, this work shows concrete design conditions where strong consistency for entropy estimation in ∞-alphabets can be obtained (Theorems 2–4). In addition, the relevant case where the target distribution has a finite but unknown support is explored, deriving almost sure rates of convergence results for the overall estimation error (Theorems 1 and 7) that match the optimal asymptotic rate that can be obtained in the finite alphabet version of the problem (i.e., the finite and known supported case).

As the main context of application, this work focuses on the case of a data-driven plug-in estimator that restricts the support where the distribution is estimated. The idea is to have design parameters that control estimation-error effects and to find an adequate balance between these two learning errors. Adopting the entropy convergence result in Lemma 2, it is shown that this data-driven scheme offers the same universal estimation attributes than the classical plug-in estimate under some mild conditions on its threshold design parameter (Theorem 4). In addition, by addressing the technical task of deriving concrete closed-form expressions for the estimation and approximation errors in this learning context a solution is presented where almost-sure rates of convergence of the overall estimation error are obtained over a family of distributions with some concrete tail bounded conditions (Theorems 5 and 6). These results show the capacity that data-driven frameworks offer for adapting aspects of their learning scheme to the complexity of the entropy estimation task in ∞-alphabets.

Concerning the classical plug-in estimator presented in Section 3.1, it is important to mention that the work of Antos et al. [30] shows that

{lim}_{n \to \infty} H ({\hat{μ}}_{n}) = H (μ)

happens almost surely and distribution-free and, furthermore, it provides rates of convergence for families with specific tail-bounded conditions [30] (Theorem 7). Theorem 1 focuses on the case when

μ \in F (X)

, where new finite-length deviation inequalities and confidence intervals are derived. From that perspective, Theorem 1 complements the results presented in [30] in the non-explored scenario when

μ \in F (X)

. It is also important to mention two results by Ho and Yeung [26] (Theorems 11 and 12) for the plug-in estimator in (15). They derived bounds for

P_{μ}^{n} (|H ({\hat{μ}}_{n}) - H (μ)| \geq ϵ)

and determined confidence intervals under a finite and known support restriction on

μ

. In contrast, Theorem 1 resolves the case for a finite and unknown supported distribution, which is declared to be a challenging problem from the arguments presented in [26] (Theorem 13) concerning the discontinuity of the entropy.

7. Proof of the Main Results

Proof of Theorem 1.

Let

μ

be in

F (X)

, then

|A_{μ}| \leq k

for some

k > 1

. From Hoeffding’s inequality [28],

\forall n \geq 1

, and for any

ϵ > 0

\begin{matrix} P_{μ}^{n} (V ({\hat{μ}}_{n}, μ) > ϵ) \leq 2^{k + 1} \cdot e^{- 2 n ϵ^{2}} and E_{P_{μ}^{n}} (V ({\hat{μ}}_{n}, μ)) \leq 2 \sqrt{\frac{(k + 1) \log 2}{n}} . \end{matrix}

(30)

Considering that

{\hat{μ}}_{n} ≪ μ

P_{μ}

-a.s, we can use Proposition 1 to obtain that

\begin{matrix} D ({\hat{μ}}_{n} | | μ) \leq \frac{\log e}{m_{μ}} \cdot V ({\hat{μ}}_{n}, μ), and |H (({\hat{μ}}_{n}) - H (μ_{n})| \leq [M_{μ} + \frac{\log e}{m_{μ}}] \cdot V ({\hat{μ}}_{n}, μ) . \end{matrix}

(31)

Hence, (17) and (18) derive from (30).

For the direct I-divergence, let us consider a sequence

{(x_{i})}_{i \geq 1}

and the following function (a stopping time):

\begin{matrix} T_{o} (x_{1}, x_{2}, \dots) \equiv inf \{n \geq 1 : A_{{\hat{μ}}_{n} (x^{n})} = A_{μ}\} . \end{matrix}

(32)

T_{o} (x_{1}, x_{2}, \dots)

is the point where the support of

{\hat{μ}}_{n} (x^{n})

is equal to

A_{μ}

and, consequently, the direct I-divergence is finite (since

μ \in F (X)

). In fact, by the uniform convergence of

{\hat{μ}}_{n}

to

μ_{n}

(

P_{μ}

-a.s.) and the finite support assumption of

μ

, it is simple to verify that

P_{μ} (T_{o} (X_{1}, X_{2}, \dots) < \infty) = 1

. Let us define the event:

\begin{matrix} B_{n} \equiv \{x_{1}, x_{2}, \dots : T_{o} (x_{1}, x_{2}, \dots) \leq n\} \subset X^{N}, \end{matrix}

(33)

i.e., the collection of sequences in

X^{N}

where at time n,

A_{{\hat{μ}}_{n}} = A_{μ}

and, consequently,

D (μ | | {\hat{μ}}_{n}) < \infty

. Restricted to this set

\begin{matrix} D (μ | | {\hat{μ}}_{n}) & \leq \sum_{x \in A_{{\hat{μ}}_{n} | | μ}} f_{{\hat{μ}}_{n}} (x) \log \frac{f_{{\hat{μ}}_{n}} (x)}{f_{μ} (x)} + \sum_{x \in A_{μ} \ A_{{\hat{μ}}_{n} | | μ}} f_{{\hat{μ}}_{n}} (x) \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)} \\ \leq \log e \cdot \sum_{x \in A_{{\hat{μ}}_{n} | | μ}} f_{{\hat{μ}}_{n}} (x) \cdot (\frac{f_{{\hat{μ}}_{n}} (x)}{f_{μ} (x)} - 1) \end{matrix}

(34)

\begin{matrix} + \log e \cdot [μ (A_{μ} \ A_{{\hat{μ}}_{n} | | μ}) - {\hat{μ}}_{n} ((A_{μ} \ A_{{\hat{μ}}_{n} | | μ}))] \end{matrix}

(35)

\begin{matrix} \leq \log e \cdot (1 / m_{u} + 1) V (μ, {\hat{μ}}_{n}), \end{matrix}

(36)

where in the first inequality

A_{{\hat{μ}}_{n} | | μ} \equiv \{x \in A_{{\hat{μ}}_{n}} : f_{{\hat{μ}}_{n}} (x) > f_{μ} (x)\}

, and the last is obtained by the definition of the total variational distance. In addition, let us define the

ϵ

-deviation set

A_{n}^{ϵ} \equiv \{x_{1}, x_{2}, \dots : D (μ | | {\hat{μ}}_{n} (x^{n})) > ϵ\} \subset X^{N}

. Then by additivity and monotonicity of

P_{μ}

, we have that

\begin{matrix} P_{μ} (A_{n}^{ϵ}) \leq P_{μ} (A_{n}^{ϵ} \cap B_{n}) + P_{μ} (B_{n}^{c}) . \end{matrix}

(37)

By definition of

B_{n}

, (36) and (30) it follows that

\begin{matrix} P_{μ} (A_{n}^{ϵ} \cap B_{n}) & \leq P_{μ} (V (μ | | {\hat{μ}}_{n}) \log e \cdot (1 / m_{u} + 1) > ϵ) \\ \leq 2^{|A_{μ}| + 1} \cdot e^{- \frac{2 n ϵ^{2}}{\log e^{2} \cdot {(1 / m_{u} + 1)}^{2}}} . \end{matrix}

(38)

On the other hand,

\forall ϵ_{o} \in (0, m_{μ})

if

V (μ, {\hat{μ}}_{n}) \leq ϵ_{o}

then

T_{o} \leq n

. Consequently

B_{n}^{c} \subset \{x_{1}, x_{2}, \dots : V (μ, {\hat{μ}}_{n} (x^{n})) > ϵ_{o}\}

, and again from (30)

\begin{matrix} P_{μ} (B_{n}^{c}) & \leq 2^{|A_{μ}| + 1} \cdot e^{- 2 n ϵ_{o}^{2}}, \end{matrix}

(39)

for all

n \geq 1

and

\forall ϵ_{o} \in (0, m_{μ})

. Integrating the results in (38) and (39) and considering

ϵ_{0} = m_{μ} / \sqrt{2}

suffice to show the bound in (19). ☐

Proof of Theorem 2.

As

(a_{n})

is

o (1)

, it is simple to verify that

{lim}_{n \to \infty} V ({\tilde{μ}}_{n}, μ) = 0

,

P_{μ}

-a.s. Also note that the support disagreement between

{\tilde{μ}}_{n}

and

μ

is bounded by the hypothesis, then

\begin{matrix} lim_{n \to \infty} {\tilde{μ}}_{n} (A_{μ_{n}} \ A_{μ}) \cdot \log |A_{{\tilde{μ}}_{n}} \ A_{μ}| \leq lim_{n \to \infty} {\tilde{μ}}_{n} (A_{μ_{n}} \ A_{μ}) \cdot \log |A_{v}| = 0, P_{μ} - a . s . \end{matrix}

(40)

Therefore from Lemma 1, we have the strong consistency of

H ({\tilde{μ}}_{n})

and the almost sure convergence of

D (μ | | {\tilde{μ}}_{n})

to zero. Note that

D (μ | | {\tilde{μ}}_{n})

is uniformly upper bounded by

\log e \cdot (1 / m_{μ} + 1) V (μ, {\tilde{μ}}_{n})

(see (36) in the proof of Theorem 1). Then the convergence in probability of

D (μ | | {\tilde{μ}}_{n})

implies the convergence of its mean [42], which concludes the proof of the first part.

Concerning rates of convergence, we use the following:

\begin{matrix} H (μ) - H ({\tilde{μ}}_{n}) & = \sum_{x \in A_{μ} \cap A_{{\tilde{μ}}_{n}}} [f_{{\tilde{μ}}_{n}} (x) - f_{μ} (x)] \log f_{μ} (x) + \sum_{x \in A_{μ} \cap A_{{\tilde{μ}}_{n}}} f_{{\tilde{μ}}_{n}} (x) \log \frac{f_{{\tilde{μ}}_{n}} (x)}{f_{μ} (x)} \\ - \sum_{x \in A_{{\tilde{μ}}_{n}} \ A_{μ}} f_{{\tilde{μ}}_{n}} (x) \log \frac{1}{f_{{\tilde{μ}}_{n}} (x)} . \end{matrix}

(41)

The absolute value of the first term in the right hand side (RHS) of (41) is bounded by

M_{μ} \cdot V ({\tilde{μ}}_{n}, μ)

and the second term is bounded by

\log e / m_{μ} \cdot V ({\tilde{μ}}_{n}, μ)

, from the assumption that

μ \in F (X)

. For the last term, note that

f_{{\tilde{μ}}_{n}} (x) = a_{n} \cdot v (\{x\})

for all

x \in A_{{\tilde{μ}}_{n}} \ A_{μ}

and that

A_{{\tilde{μ}}_{n}} = A_{v}

, then

0 \leq \sum_{x \in A_{{\tilde{μ}}_{n}} \ A_{μ}} f_{{\tilde{μ}}_{n}} (x) \log \frac{1}{f_{{\tilde{μ}}_{n}} (x)} \leq a_{n} \cdot (H (v) + \log \frac{1}{a_{n}} \cdot v (A_{v} \ A_{μ})) .

On the other hand,

\begin{matrix} V ({\tilde{μ}}_{n}, μ) & = \frac{1}{2} \sum_{x \in A_{μ}} |(1 - a_{n}) {\hat{μ}}_{n} (\{x\}) + a_{n} v (\{x\}) - μ (\{x\})| + \sum_{x \in A_{v} \ A_{μ}} a_{n} v (\{x\}) . \\ \leq (1 - a_{n}) \cdot V ({\hat{μ}}_{n}, μ) + a_{n} . \end{matrix}

Integrating these bounds in (41),

\begin{matrix} |H (μ) - H ({\tilde{μ}}_{n})| & \leq (M_{μ} + \log e / m_{μ}) \cdot ((1 - a_{n}) \cdot V ({\hat{μ}}_{n}, μ) + a_{n}) + a_{n} \cdot H (v) + a_{n} \cdot \log \frac{1}{a_{n}} \\ = K_{1} \cdot V ({\hat{μ}}_{n}, μ) + K_{2} \cdot a_{n} + a_{n} \cdot \log \frac{1}{a_{n}}, \end{matrix}

(42)

for constants

K_{1} > 0

and

K_{2} > 0

function of

μ

and v.

Under the assumption that

μ \in F (X)

, the Hoeffding’s inequality [28,52] tells us that

P_{μ} (V ({\hat{μ}}_{n}, μ) > ϵ) \leq C_{1} \cdot e^{- C_{2} n ϵ^{2}}

(for some distribution free constants

C_{1} > 0

and

C_{2} > 0

). From this inequality,

V ({\hat{μ}}_{n}, μ)

goes to zero as

o (n^{- τ})

P_{μ}

-a.s.

\forall τ \in (0, 1 / 2)

and

E_{P_{μ}} (V ({\hat{μ}}_{n}, μ))

is

O (1 / \sqrt{n})

. On the other hand, under the assumption in ii)

(K_{2} \cdot a_{n} + a_{n} \cdot \log \frac{1}{a_{n}})

is

O (1 / \sqrt{n})

, which from (42) proves the rate of convergence results for

|H (μ) - H ({\tilde{μ}}_{n})|

.

Considering the direct I-divergence,

D (μ | | {\tilde{μ}}_{n}) \leq \log e \cdot \sum_{x \in A_{μ}} f_{μ} (x) |\frac{f_{μ} (x)}{f_{{\tilde{μ}}_{n}} (x)} - 1| \leq \frac{\log e}{m_{{\tilde{μ}}_{n}}} \cdot V ({\tilde{μ}}_{n}, μ)

. Then the uniform convergence of

{\tilde{μ}}_{n} (\{x\})

to

μ (\{x\})

P_{μ}

-a.s. in

A_{μ}

and the fact that

|A_{μ}| < \infty

imply that for an arbitrary small

ϵ > 0

(in particular smaller than

m_{μ}

)

\begin{matrix} lim_{n ⟶ \infty} D (μ | | {\tilde{μ}}_{n}) & \leq \frac{\log e}{m_{μ} - ϵ} \cdot lim_{n ⟶ \infty} V ({\tilde{μ}}_{n}, μ), P_{μ} - a . s . . \end{matrix}

(43)

(43) suffices to obtain the convergence result for the I-divergence. ☐

Proof of Theorem 3.

Let us define the oracle Barron measure

{\tilde{μ}}_{n}

by:

f_{{\tilde{μ}}_{n}} (x) = \frac{d {\tilde{μ}}_{n}}{d λ} (x) = f_{v} (x) [(1 - a_{n}) \cdot \frac{μ (A_{n} (x))}{v (A_{n} (x))} + a_{n}],

(44)

where we consider the true probability instead of its empirical version in (23). Then, the following convergence result can be obtained (see Proposition A2 in Appendix B),

\begin{matrix} lim_{n \to \infty} sup_{x \in A_{{\tilde{μ}}_{n}}} |\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) - 1| = 0, P_{μ} - a . s . . \end{matrix}

(45)

Let

A

denote the collection of sequences

x_{1}, x_{2}, \dots

where the convergence in (45) is holding (this set is typical meaning that

P_{μ} (A) = 1

). The rest of the proof reduces to show that for any arbitrary

{(x_{n})}_{n \geq 1} \in A

, its respective sequence of induced measures

\{μ_{n}^{*} : n \geq 1\}

(the dependency of

μ_{n}^{*}

on the sequence

{(x_{n})}_{n \geq 1}

will be considered implicit for the rest of the proof) satisfies the sufficient conditions of Lemma 3.

Let us fix an arbitrary

{(x_{n})}_{n \geq 1} \in A

:

Weak convergence $μ_{n}^{*} \Rightarrow μ$ : Without loss of generality we consider that

A_{{\tilde{μ}}_{n}} = A_{v}

for all

n \geq 1

. Since

a_{n} \to 0

and

h_{n} \to 0

,

f_{{\tilde{μ}}_{n}} (x) \to μ (\{x\})

\forall x \in A_{v}

, we got the weak convergence of

{\tilde{μ}}_{n}

to

μ

. On the other hand by definition of

A

,

{lim}_{n \to \infty} {sup}_{x \in A_{{\tilde{μ}}_{n}}} |\frac{f_{{\tilde{μ}}_{n}} (x)}{f_{μ_{n}^{*}} (x)} - 1| = 0

that implies that

{lim}_{n \to \infty} |f_{μ_{n}^{*}} (x) - f_{{\tilde{μ}}_{n}} (x)| = 0

for all

x \in A_{v}

and, consequently,

μ_{n}^{*} \Rightarrow μ

.

The condition in (12): By construction

μ ≪ μ_{n}^{*}

,

μ ≪ {\tilde{μ}}_{n}

and

{\tilde{μ}}_{n} \approx μ_{n}^{*}

for all n, then we will use the following equality:

\log \frac{d μ}{d μ_{n}^{*}} (x) = \log \frac{d μ}{d {\tilde{μ}}_{n}} (x) + \log \frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x),

(46)

for all

x \in A_{μ}

. Concerning the approximation error term of (46), i.e.,

\log \frac{d μ}{d {\tilde{μ}}_{n}} (x)

,

\forall x \in A_{μ}

\frac{d {\tilde{μ}}_{n}}{d μ} (x) = (1 - a_{n}) [\frac{μ (A_{n} (x))}{μ (\{x\})} \frac{v (\{x\})}{v (A_{n} (x))}] + a_{n} \frac{v (\{x\})}{μ (\{x\})} .

(47)

Given that

μ \in H (X | v)

, this is equivalent to state that

\log (\frac{d μ}{d v} (x))

is bounded

μ

-almost everywhere, which is equivalent to say that

m \equiv {inf}_{x \in A_{μ}} \frac{d μ}{d v} (x) > 0

and

M \equiv {sup}_{x \in A_{μ}} \frac{d μ}{d v} (x) < \infty

. From this,

\forall A \subset A_{μ}

,

m v (A) \leq μ (A) \leq M v (A) .

(48)

Then we have that,

\forall x \in A_{μ}

\frac{m}{M} \leq [\frac{μ (A_{n} (x))}{μ (\{x\})} \frac{v (\{x\})}{v (A_{n} (x))}] \leq \frac{M}{m}

. Therefore for n sufficient large,

0 < \frac{1}{2} \frac{m}{M} \leq \frac{d {\tilde{μ}}_{n}}{d μ} (x) \leq \frac{M}{m} + M < \infty

for all x in

A_{μ}

. Hence, there exists

N_{o} > 0

such that

{sup}_{n \geq N_{o}} {sup}_{x \in A_{μ}} |\log \frac{d {\tilde{μ}}_{n}}{d μ} (x)| < \infty

.

For the estimation error term of (46), i.e.,

\log \frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x)

, note that from the fact that

(x_{n}) \in A

, and the convergence in (45), there exists

N_{1} > 0

such that for all

n \geq N_{1}

{sup}_{x \in A_{μ}} |\log \frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x)| < \infty

, given that

A_{μ} \subset A_{{\tilde{μ}}_{n}} = A_{v}

. Then using (46), for all

n \geq max \{N_{0}, N_{1}\}

{sup}_{x \in A_{μ}} |\log \frac{d μ_{n}^{*}}{d μ} (x)| < \infty

, which verifies (12).

The condition in (13): Defining the function

ϕ_{n}^{*} (x) \equiv 1_{A_{v} \ A_{μ}} (x) \cdot f_{μ_{n}^{*}} (x) \log (1 / f_{μ_{n}^{*}} (x))

, we want to verify that

{lim}_{n \to \infty} \int_{X} ϕ_{n}^{*} (x) d λ (x) = 0

. Considering that

(x_{n}) \in A

for all

ϵ > 0

, there exists

N (ϵ) > 0

such that

{sup}_{x \in A_{{\tilde{μ}}_{n}}} |\frac{f_{{\tilde{μ}}_{n}} (x)}{f_{μ_{n}^{*}} (x)} - 1| < ϵ

and then

(1 - ϵ) f_{{\tilde{μ}}_{n}} (x) < f_{μ_{n}^{*}} (x) < (1 + ϵ) f_{{\tilde{μ}}_{n}} (x), for all x \in A_{v} .

(49)

From (49),

0 \leq ϕ_{n}^{*} (x) \leq (1 + ϵ) f_{{\tilde{μ}}_{n}} (x) \log (1 / (1 - ϵ) f_{{\tilde{μ}}_{n}} (x))

for all

n \geq N (ϵ)

. Analyzing

f_{{\tilde{μ}}_{n}} (x)

in (44), there are two scenarios:

A_{n} (x) \cap A_{μ} = \emptyset

where

f_{{\tilde{μ}}_{n}} (x) = a_{n} f_{v} (x)

and, otherwise,

f_{{\tilde{μ}}_{n}} (x) = f_{v} (x) (a_{n} + (1 - a_{n}) μ (A_{n} (x) \cap A_{μ}) / v (A_{n} (x)))

. Let us define:

\begin{matrix} B_{n} \equiv \{x \in A_{v} \ A_{μ} : A_{n} (x) \cap A_{μ} = \emptyset\} and C_{n} \equiv \{x \in A_{v} \ A_{μ} : A_{n} (x) \cap A_{μ} \neq \emptyset\} . \end{matrix}

(50)

Then for all

n \geq N (ϵ)

,

\begin{matrix} \sum_{x \in X} ϕ_{n}^{*} (x) & \leq \sum_{x \in A_{v} \ A_{μ}} (1 + ϵ) f_{{\tilde{μ}}_{n}} (x) \log 1 / ((1 - ϵ) f_{{\tilde{μ}}_{n}} (x)) \\ = \sum_{x \in B_{n}} (1 + ϵ) a_{n} f_{v} (x) \log \frac{1}{(1 - ϵ) a_{n} f_{v} (x)} + \sum_{x \in X} {\tilde{ϕ}}_{n} (x), \end{matrix}

(51)

with

{\tilde{ϕ}}_{n} (x) \equiv 1_{C_{n}} (x) \cdot (1 + ϵ) f_{{\tilde{μ}}_{n}} (x) \log \frac{1}{(1 - ϵ) f_{{\tilde{μ}}_{n}} (x)}

. The left term in (51) is upper bounded by

a_{n} (1 + ϵ) (H (v) + \log (1 / a_{n}))

, which goes to zero with n from

(a_{n})

being

o (1)

and the fact that

v \in H (X)

. For the right term in (51),

(h_{n})

being

o (1)

implies that x belongs to

B_{n}

eventually (in n)

\forall x \in A_{v} \ A_{μ}

, then

{\tilde{ϕ}}_{n} (x)

tends to zero point-wise as n goes to infinity. On the other hand, for all

x \in C_{n}

(see (50)), we have that

\begin{matrix} \frac{1}{1 / m + 1} \leq \frac{μ (A_{n} (x) \cap A_{μ})}{v (A_{n} (x) \cap A_{μ}) + v (A_{v} \ A_{μ})} \leq \frac{μ (A_{n} (x))}{v (A_{n} (x))} \leq \frac{μ (A_{n} (x) \cap A_{μ})}{v (A_{n} (x) \cap A_{μ})} \leq M . \end{matrix}

(52)

These inequalities derive from (48). Consequently for all

x \in X

, if n sufficiently large such that

a_{n} < 0.5

, then

\begin{matrix} 0 \leq {\tilde{ϕ}}_{n} (x) & \leq (1 + ϵ) (a_{n} + (1 - a_{n}) M) f_{v} (x) \log \frac{1}{(1 - ϵ) (a_{n} + (1 - a_{n}) m / (m + 1))} \\ \leq (1 + ϵ) (1 + M) f_{v} (x) [\log \frac{2 (m + 1)}{(1 - ϵ)} + \log \frac{1}{f_{v} (x)}] . \end{matrix}

(53)

Hence from (50),

{\tilde{ϕ}}_{n} (x)

is bounded by a fix function that is

ℓ_{1} (X)

by the assumption that

v \in H (X)

. Then by the dominated convergence theorems [43] and (51),

lim_{n \to \infty} \sum_{X} ϕ_{n}^{*} (x) \leq lim_{n \to \infty} \sum_{X} {\tilde{ϕ}}_{n} (x) .

In summary, we have shown that for any arbitrary

(x_{n}) \in A

the sufficient conditions of Lemma 3 are satisfied, which proves the result in (25) reminding that

P_{μ} (A) = 1

from (45). ☐

Proof of Theorem 4.

Let us first introduce the oracle probability

μ_{ϵ_{n}} \equiv μ (\cdot | Γ_{ϵ_{n}}) \in P (X) .

(54)

Note that

μ_{ϵ_{n}}

is a random probability measure (function of the i.i.d sequence

X_{1}, \dots, X_{n}

) as

Γ_{ϵ_{n}}

is a data-driven set, see (26). We will first show that:

lim_{n \to \infty} H (μ_{ϵ_{n}}) = H (μ) and lim_{n \to \infty} D (μ_{ϵ_{n}} | | μ) = 0, P_{μ} - a . s .

(55)

Under the assumption on

(ϵ_{n})

of Theorem 4,

{lim}_{n \to \infty} |μ (Γ_{ϵ_{n}}) - {\hat{μ}}_{n} (Γ_{ϵ_{n}})| = 0

,

P_{μ}

-a.s. (this result derives from the fact that

{lim}_{n \to \infty} V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) = 0

,

P_{μ}

-a.s. , from (63)) In addition, since

(ϵ_{n})

is

o (1)

then

{lim}_{n \to \infty} {\hat{μ}}_{n} (Γ_{ϵ_{n}}) = 1

, which implies that

{lim}_{n \to \infty} μ (Γ_{ϵ_{n}}) = 1

P_{μ}

-a.s. From this

μ_{ϵ_{n}} \Rightarrow μ

,

P_{μ}

-a.s. Let us consider a sequences

(x_{n})

where

{lim}_{n \to \infty} μ (Γ_{ϵ_{n}}) = 1

. Constrained to that

lim sup_{n \to \infty} sup_{x \in A_{μ}} \frac{f_{μ_{ϵ_{n}}} (x)}{f_{μ} (x)} = lim sup_{n \to \infty} \frac{1}{μ (Γ_{ϵ_{n}})} < \infty .

(56)

Then there is

N > 0

such that

{sup}_{n > N} {sup}_{x \in A_{μ}} \frac{f_{μ_{ϵ_{n}}} (x)}{f_{μ} (x)} < \infty

. Hence from Lemma 2,

{lim}_{n \to \infty} D (μ_{ϵ_{n}} | | μ) =

0 and

{lim}_{n \to \infty} |H (μ_{ϵ_{n}}) - H (μ)| = 0

. Finally, the set of sequences

(x_{n})

where

{lim}_{n \to \infty} μ (Γ_{ϵ_{n}}) = 1

has probability one (with respect to

P_{μ}

), which proves (55).

For the rest of the proof, we concentrate on the analysis of

|H ({\hat{μ}}_{n, ϵ_{n}}) - H (μ_{ϵ_{n}})|

that can be attributed to the estimation error aspect of the problem. It is worth noting that by construction

A_{{\hat{μ}}_{n, ϵ_{n}}} = A_{μ_{ϵ_{n}}} = Γ_{ϵ_{n}}

,

P_{μ}

-a.s.. Consequently, we can use

H ({\hat{μ}}_{n, ϵ_{n}}) - H (μ_{ϵ_{n}}) = \sum_{x \in Γ_{ϵ_{n}}} [μ_{ϵ_{n}} (\{x\}) - {\hat{μ}}_{n, ϵ_{n}} (\{x\})] \log {\hat{μ}}_{n, ϵ_{n}} (\{x\}) + D (μ_{ϵ_{n}} | | {\hat{μ}}_{n, ϵ_{n}}) .

(57)

The first term on the RHS of (57) is upper bounded by

\log 1 / m_{{\hat{μ}}_{n}}^{ϵ_{n}} \cdot V (μ_{ϵ_{n}}, {\hat{μ}}_{n, ϵ_{n}})

\leq \log 1 / ϵ_{n} \cdot V (μ_{ϵ_{n}}, {\hat{μ}}_{n, ϵ_{n}})

. Concerning the second term on the RHS of (57), it is possible to show (details presented in Appendix C) that

D (μ_{ϵ_{n}} | | {\hat{μ}}_{n, ϵ_{n}}) \leq \frac{2 \log \frac{e}{ϵ_{n}}}{μ (Γ_{ϵ_{n}})} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}),

(58)

where

V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) \equiv sup_{A \in σ_{ϵ_{n}}} |μ (A) - {\hat{μ}}_{n} (A)| .

(59)

In addition, it can be verified (details presented in Appendix D) that

V (μ_{ϵ_{n}}, {\hat{μ}}_{n, ϵ_{n}}) \leq K \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}),

(60)

for some universal constant

K > 0

. Therefore from (57), (58) and (60), there is

C > 0

such that

|H ({\hat{μ}}_{n, ϵ_{n}}) - H (μ_{ϵ_{n}})| \leq \frac{C}{μ (Γ_{ϵ_{n}})} \log \frac{1}{ϵ_{n}} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) .

(61)

As mentioned before,

μ (Γ_{ϵ_{n}})

goes to 1 almost surely, then we need to concentrate on the analysis of the asymptotic behavior of

\log 1 / ϵ_{n} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}})

. From Hoeffding’s inequality [28], we have that

\forall δ > 0

P_{μ}^{n} (\log 1 / ϵ_{n} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) > δ) \leq 2^{|Γ_{ϵ_{n}}| + 1} \cdot e^{- \frac{2 n δ^{2}}{{(\log 1 / ϵ_{n})}^{2}}},

(62)

considering that by construction

|σ_{ϵ_{n}}| \leq 2^{|Γ_{ϵ_{n}}| + 1} \leq 2^{1 / ϵ_{n} + 1}

. Assuming that

(ϵ_{n})

is

O (n^{- τ})

,

\ln P_{μ}^{n} (\log 1 / ϵ_{n} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) > δ) \leq (n^{τ} + 1) \ln 2 - \frac{2 n δ^{2}}{τ \log n} .

Therefore for all

τ \in (0, 1)

,

δ > 0

and any arbitrary

l \in (τ, 1)

lim sup_{n \to \infty} \frac{1}{n^{l}} \cdot \ln P_{μ}^{n} (\log 1 / ϵ_{n} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) > δ) < 0 .

(63)

This last result is sufficient to show that

\sum_{n \geq 1} P_{μ}^{n} (\log 1 / ϵ_{n} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) > δ) < \infty

that concludes the argument from the Borel-Cantelli Lemma. ☐

Proof of Theorem 5.

We consider the expression

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})| \leq |H (μ) - H (μ_{ϵ_{n}})| + |H (μ_{ϵ_{n}}) - H ({\hat{μ}}_{n, ϵ_{n}})|

(64)

to analize the approximation error and the estimation error terms separately.

• Approximation Error Analysis

Note that

|H (μ) - H (μ_{ϵ_{n}})|

is a random object as

μ_{ϵ_{n}}

in (54) is a function of the data-dependent partition and, consequently, a function of

X_{1}, \dots, X_{n}

. In the following, we consider the oracle set

{\tilde{Γ}}_{ϵ_{n}} \equiv \{x \in X : μ (\{x\}) \geq ϵ_{n}\},

(65)

and the oracle conditional probability

{\tilde{μ}}_{ϵ_{n}} \equiv μ (\cdot | {\tilde{Γ}}_{ϵ_{n}}) \in P . (X) .

(66)

Note that

{\tilde{Γ}}_{ϵ_{n}}

is a deterministic function of

(ϵ_{n})

and so is the measure

{\tilde{μ}}_{ϵ_{n}}

in (66). From definitions and triangular inequality:

\begin{matrix} |H (μ) - H ({\tilde{μ}}_{ϵ_{n}})| & \leq \sum_{x \in {\tilde{Γ}}_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})} + \log \frac{1}{μ ({\tilde{Γ}}_{ϵ_{n}})} \\ + (\frac{1}{μ ({\tilde{Γ}}_{ϵ_{n}})} - 1) \cdot \sum_{x \in {\tilde{Γ}}_{ϵ_{n}}} μ (\{x\}) \log \frac{1}{μ (\{x\})}, \end{matrix}

(67)

and, similarly, the approximation error is bounded by

\begin{matrix} |H (μ) - H (μ_{ϵ_{n}})| & \leq \sum_{x \in Γ_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})} \\ + \log \frac{1}{μ (Γ_{ϵ_{n}})} + (\frac{1}{μ (Γ_{ϵ_{n}})} - 1) \cdot \sum_{x \in Γ_{ϵ_{n}}} μ (\{x\}) \log \frac{1}{μ (\{x\})} . \end{matrix}

(68)

We denote the RHS of (67) and (68) by

a_{ϵ_{n}}

and

b_{ϵ_{n}} (X_{1}, \dots, X_{n})

, respectively.

We can show that if

(ϵ_{n})

is

O (n^{- τ})

and

τ \in (0, 1 / 2)

, then

lim sup_{n \to \infty} b_{ϵ_{n}} (X_{1}, \dots, X_{n}) - a_{2 ϵ_{n}} \leq 0, P_{μ} - a . s .,

(69)

which from (68) implies that

|H (μ) - H (μ_{ϵ_{n}})|

is

O (a_{2 ϵ_{n}})

,

P_{μ}

-a.s. The proof of (69) is presented in Appendix E.

Then, we need to analyze the rate of convergence of the deterministic sequence

(a_{2 ϵ_{n}})

. Analyzing the RHS of (67), we recognize two independent terms: the partial entropy sum

\sum_{x \in {\tilde{Γ}}_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})}

and the rest that is bounded asymptotically by

μ ({\tilde{Γ}}_{ϵ_{n}}^{c}) (1 + H (μ))

, using the fact that

\ln x \leq x - 1

for

x \geq 1

. Here is where the tail condition on

μ

plays a role. From the tail condition, we have that

\begin{matrix} μ ({\tilde{Γ}}_{ϵ_{n}}^{c}) & \leq μ (\{{(k_{o} / ϵ_{n})}^{1 / p} + 1, {(k_{o} / ϵ_{n})}^{1 / p} + 2, {(k_{o} / ϵ_{n})}^{1 / p} + 3, \dots\}) = \sum_{x \geq {(\frac{k_{o}}{ϵ_{n}})}^{1 / p} + 1} μ (\{x\}) \\ \leq k_{1} \cdot S_{{(\frac{k_{o}}{ϵ_{n}})}^{1 / p} + 1}, \end{matrix}

(70)

where

S_{x_{o}} \equiv \sum_{x \geq x_{o}} x^{- p}

. Similarly as

\{0, 1, \dots, {(k_{o} / ϵ_{n})}^{1 / p}\} \subset {\tilde{Γ}}_{ϵ_{n}}

, then

\begin{matrix} \sum_{x \in {\tilde{Γ}}_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})} & \leq \sum_{x \geq {(\frac{k_{o}}{ϵ_{n}})}^{1 / p} + 1} μ (\{x\}) \log \frac{1}{μ (\{x\})} \leq \sum_{x \geq {(\frac{k_{o}}{ϵ_{n}})}^{1 / p} + 1} k_{1} x^{- p} \cdot \log \frac{1}{k_{0} x^{- p}} \\ \leq k_{1} \log p \cdot R_{{(\frac{k_{o}}{ϵ_{n}})}^{1 / p} + 1} + k_{1} \log 1 / k_{0} \cdot S_{{(\frac{k_{o}}{ϵ_{n}})}^{1 / p} + 1}, \end{matrix}

(71)

where

R_{x_{o}} \equiv \sum_{x \geq x_{o}} x^{- p} \log x

.

In Appendix F, it is shown that

S_{x_{o}} \leq C_{0} \cdot x_{o}^{1 - p}

and

R_{x_{o}} \leq C_{1} \cdot x_{o}^{1 - p}

for constants

C_{1} > 0

and

C_{0} > 0

. Integrating these results in the RHS of (70) and (71) and considering that

(ϵ_{n})

is

O (n^{- τ})

, we have that both

μ ({\tilde{Γ}}_{ϵ_{n}}^{c})

and

\sum_{x \in {\tilde{Γ}}_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})}

are

O (n^{- \frac{τ (p - 1)}{p}})

. This implies that our oracle sequence

(a_{ϵ_{n}})

is

O (n^{- \frac{τ (p - 1)}{p}})

.

In conclusion, if

ϵ_{n}

is

O (n^{- τ})

for

τ \in (0, 1 / 2)

, it follows that

|H (μ) - H (μ_{ϵ_{n}})| is O (n^{- \frac{τ (p - 1)}{p}}), P_{μ} - a . s .

(72)

• Estimation Error Analysis

Let us consider

|H (μ_{ϵ_{n}}) - H ({\hat{μ}}_{n, ϵ_{n}})|

. From the bound in (61) and the fact that for any

τ \in (0, 1)

,

{lim}_{n \to \infty} μ (Γ_{ϵ_{n}}) = 1

P_{μ}

-a.s. from (63), the problem reduces to analyze the rate of convergence of the following random object:

ρ_{n} (X_{1}, \dots, X_{n}) \equiv \log \frac{1}{ϵ_{n}} \cdot V (μ / σ (Γ_{ϵ_{n}}), {\hat{μ}}_{n} / σ (Γ_{ϵ_{n}})) .

(73)

We will analize, instead, the oracle version of

ρ_{n} (X_{1}, \dots, X_{n})

given by:

ξ_{n} (X_{1}, \dots, X_{n}) \equiv \log \frac{1}{ϵ_{n}} \cdot V (μ / σ ({\tilde{Γ}}_{ϵ_{n} / 2}), {\hat{μ}}_{n} / σ ({\tilde{Γ}}_{ϵ_{n} / 2})),

(74)

where

{\tilde{Γ}}_{ϵ} \equiv \{x \in X : μ (\{x\}) \geq ϵ\}

is the oracle counterpart of

Γ_{ϵ}

in (26). To do so, we can show that if

ϵ_{n}

is

O (n^{- τ})

with

τ \in (0, 1 / 2)

, then

lim inf_{n \to \infty} ξ_{n} (X_{1}, \dots, X_{n}) - ρ_{n} (X_{1}, \dots, X_{n}) \geq 0, P_{μ} - a . s . .

(75)

The proof of (75) is presented in Appendix G.

Moving to the almost sure rate of convergence of

ξ_{n} (X_{1}, \dots, X_{n})

, it is simple to show for our p-power dominating distribution that if

(ϵ_{n})

is

O (n^{- τ})

and

τ \in (0, p)

then

lim_{n \to \infty} ξ_{n} (X_{1}, \dots, X_{n}) = 0 P_{μ} - a . s .,

and, more specifically,

ξ_{n} (X_{1}, \dots, X_{n}) is o (n^{- q}) for all q \in (0, (1 - τ / p) / 2), P_{μ} - a . s . .

(76)

The argument is presented in Appendix H.

In conclusion, if

ϵ_{n}

is

O (n^{- τ})

for

τ \in (0, 1 / 2)

, it follows that

|H (μ_{ϵ_{n}}) - H ({\hat{μ}}_{n, ϵ_{n}})| is O (n^{- q}), P_{μ} - a . s .,

(77)

for all

q \in (0, (1 - τ / p) / 2)

.

• Estimation vs. Approximation Errors

Coming back to (64) and using (72) and (77), the analysis reduces to finding the solution

τ^{*}

in

(0, 1 / 2)

that offers the best trade-off between the estimation and approximation error rate:

τ^{*} \equiv \arg max_{τ (0, 1 / 2)} min \{\frac{(1 - τ / p)}{2}, \frac{τ (p - 1)}{p}\} .

(78)

It is simple to verify that

τ^{*} = 1 / 2

. Then by considering

τ

arbitrary close to the admissible limit

1 / 2

, we can achieve a rate of convergence for

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})|

that is arbitrary close to

O (n^{- \frac{1}{2} (1 - 1 / p)})

,

P

-a.s.

More formally, for any

l \in (0, \frac{1}{2} (1 - 1 / p))

we can take

τ \in (\frac{l}{(1 - 1 / p)}, \frac{1}{2})

where

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})|

is

o (n^{- l})

,

P_{μ}

-a.s., from (72) and (77).

Finally, a simple corollary of this analysis is to consider

τ (p) = \frac{1}{2 + 1 / p} < 1 / 2

where:

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})| is O (n^{- \frac{1 - 1 / p}{2 + 1 / p}}), P_{μ} - a . s .,

(79)

which concludes the argument. ☐

Proof of Theorem 6.

The argument follows the proof of Theorem 5. In particular, we use the estimation-approximation error bound:

|H (μ) - H ({\hat{μ}}_{n, ϵ_{n}})| \leq |H (μ) - H (μ_{ϵ_{n}})| + |H (μ_{ϵ_{n}}) - H ({\hat{μ}}_{n, ϵ_{n}})|,

(80)

and the following two results derived in the proof of Theorem 5: If

(ϵ_{n})

is

O (n^{- τ})

with

τ \in (0, 1 / 2)

then (for the approximation error)

|H (μ) - H (μ_{ϵ_{n}})| is O (a_{2 ϵ_{n}}) P_{μ} - a . s .,

(81)

with

a_{ϵ_{n}} = \sum_{x \in {\tilde{Γ}}_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})} + μ ({\tilde{Γ}}_{ϵ_{n}}^{c}) (1 + H (μ)),

while (for the estimation error)

|H (μ_{ϵ_{n}}) - H ({\hat{μ}}_{n, ϵ_{n}})| is O (ξ_{n} (X_{1}, \dots, X_{n})) P_{μ} - a . s .,

(82)

with

ξ_{n} (X_{1}, \dots, X_{n}) = \log \frac{1}{ϵ_{n}} \cdot V (μ / σ ({\tilde{Γ}}_{ϵ_{n} / 2}), {\hat{μ}}_{n} / σ ({\tilde{Γ}}_{ϵ_{n} / 2})) .

For the estimation error, we need to bound the rate of convergence of

ξ_{n} (X_{1}, \dots, X_{n})

to zero almost surely. We first note that

\{1, \dots, x_{o} (ϵ_{n})\} = {\tilde{Γ}}_{ϵ_{n}}

with

x_{o} (ϵ_{n}) = ⌊ 1 / α \ln (k_{0} / ϵ_{n}) ⌋

. Then from Hoeffding’s inequality we have that

\begin{matrix} P_{μ}^{n} (\{ξ_{n} (X_{1}, \dots, X_{n}) > δ\}) & \leq 2^{|({\tilde{Γ}}_{ϵ_{n} / 2})|} \cdot e^{- 2 n \frac{δ^{2}}{\log {(1 / ϵ_{n})}^{2}}} \\ \leq 2^{1 / α \ln (2 k_{0} / ϵ_{n}) + 1} \cdot e^{- 2 n \frac{δ^{2}}{\log {(1 / ϵ_{n})}^{2}}} . \end{matrix}

(83)

Considering

ϵ_{n} = O (n^{- τ})

, an arbitrary sequence

(δ_{n})

being

o (1)

and

l > 0

, it follows from (83) that

\begin{matrix} \frac{1}{n^{l}} \cdot \ln P_{μ}^{n} (\{ξ_{n} (X_{1}, \dots, X_{n}) > δ_{n}\}) & \leq \frac{1}{n^{l}} \ln (2) [1 / α \ln (2 k_{0} / ϵ_{n}) + 1] - n^{1 - l} \frac{δ_{n}^{2}}{\log {(1 / ϵ_{n})}^{2}} . \end{matrix}

(84)

We note that the first term in the RHS of (84) is

O (\frac{1}{n^{l}} \log n)

and goes to zero for all

l > 0

, while the second term is

O (n^{1 - l} \frac{δ_{n}^{2}}{\log n^{2}})

. If we consider

δ_{n} = O (n^{- q})

, this second term is

O (n^{1 - 2 q - l} \cdot \frac{1}{\log n^{2}})

. Therefore, for any

q \in (0, 1 / 2)

we can take an arbitrary

l \in (0, 1 - 2 q]

such that

P_{μ}^{n} (\{ξ_{n} (X_{1}, \dots, X_{n}) > δ_{n}\})

is

O (e^{- n^{l}})

from (84). This result implies, from the Borel-Cantelli Lemma, that

ξ_{n} (X_{1}, \dots, X_{n})

is

o (δ_{n})

,

P_{μ}

-a.s, which in summary shows that

|H (μ_{ϵ_{n}}) - H ({\hat{μ}}_{n, ϵ_{n}})|

is

O (n^{- q})

for all

q \in (0, 1 / 2)

.

For the approximation error, it is simple to verify that:

μ ({\tilde{Γ}}_{ϵ_{n}}^{c}) \leq k_{1} \cdot \sum_{x \geq x_{o} (ϵ_{n}) + 1} e^{- α x} = k_{1} \cdot {\tilde{S}}_{x_{o} (ϵ_{n}) + 1}

(85)

and

\begin{matrix} \sum_{x \in {\tilde{Γ}}_{ϵ_{n}}^{c}} μ (\{x\}) \log \frac{1}{μ (\{x\})} & \leq \sum_{x \geq x_{o} (ϵ_{n}) + 1} k_{1} e^{- α x} \log \frac{1}{k_{0} e^{- α x}} = k_{1} \log \frac{1}{k_{0}} \cdot {\tilde{S}}_{x_{o} (ϵ_{n}) + 1} \\ + α \log e \cdot k_{1} \cdot {\tilde{R}}_{x_{o} (ϵ_{n}) + 1}, \end{matrix}

(86)

where

{\tilde{S}}_{x_{o}} \equiv \sum_{x \geq x_{o}} e^{- α x}

and

{\tilde{R}}_{x_{o}} \equiv \sum_{x \geq x_{o}} x \cdot e^{- α x}

. At this point, it is not difficult to show that

{\tilde{S}}_{x_{o}} \leq M_{1} e^{- α x_{o}}

and

{\tilde{R}}_{x_{o}} \leq M_{2} e^{- α x_{o}} \cdot x_{o}

for some constants

M_{1} > 0

and

M_{2} > 0

. Integrating these partial steps, we have that

\begin{matrix} a_{ϵ_{n}} \leq k_{1} (1 + H (μ) + \log \frac{1}{k_{0}}) \cdot {\tilde{S}}_{x_{o} (ϵ_{n}) + 1} + α \log e \cdot k_{1} \cdot {\tilde{R}}_{x_{o} (ϵ_{n}) + 1} \leq O_{1} \cdot ϵ_{n} + O_{2} \cdot ϵ_{n} \log \frac{1}{ϵ_{n}} \end{matrix}

(87)

for some constant

O_{1} > 0

and

O_{2} > 0

. The last step is from the evaluation of

x_{o} (ϵ_{n}) = ⌊ 1 / α \ln (k_{0} / ϵ_{n}) ⌋

. Therefore from (81) and (87), it follows that

|H (μ) - H (μ_{ϵ_{n}})|

is

O (n^{- τ} \log n)

P_{μ}

-a.s. for all

τ \in (0, 1 / 2)

.

The argument concludes by integrating in (80) the almost sure convergence results obtained for the estimation and approximation errors. ☐

Proof of Theorem 7.

Let us define the event

B_{n}^{ϵ} = \{x^{n} \in X^{n} : Γ_{ϵ} (x^{n}) = A_{μ}\},

(88)

that represents the detection of the support of

μ

from the data for a given

ϵ > 0

in (26). Note that the dependency on the data for

Γ_{ϵ}

is made explicit in this notation. In addition, let us consider the deviation event

A_{n}^{ϵ} (μ) = \{x^{n} \in X^{n} : V (μ, {\hat{μ}}_{n}) > ϵ\} .

(89)

By the hypothesis that

|A_{μ}| < \infty

, then

m_{μ} = {min}_{x \in A_{μ}} f_{μ} (x) > 0

. Therefore if

x^{n} \in {(A_{n}^{m_{μ} / 2} (μ))}^{c}

then

{\hat{μ}}_{n} (\{x\}) \geq m_{μ} / 2

for all

x \in A_{μ}

, which implies that

{(B_{n}^{ϵ})}^{c} \subset A_{n}^{m_{μ} / 2} (μ)

as long as

0 < ϵ \leq m_{μ} / 2

. Using the hypothesis that

ϵ_{n} \to 0

, there is

N > 0

such that for all

n \geq N

{(B_{n}^{ϵ_{n}})}^{c} \subset A_{n}^{m_{μ} / 2} (μ)

and, consequently,

P_{μ}^{n} ({(B_{n}^{ϵ_{n}})}^{c}) \leq P_{μ}^{n} (A_{n}^{m_{μ} / 2} (μ)) \leq 2^{k + 1} \cdot e^{- \frac{n m_{μ}^{2}}{4}},

(90)

the last from Hoeffding’s inequality considering

k = |A_{μ}| < \infty

.

If we consider the events:

\begin{matrix} C_{n}^{ϵ} (μ) & = \{x^{n} \in X^{n} : |H ({\hat{μ}}_{n, ϵ_{n}}) - H (μ)| > ϵ\} and \end{matrix}

(91)

\begin{matrix} D_{n}^{ϵ} (μ) & = \{x^{n} \in X^{n} : |H ({\hat{μ}}_{n}) - H (μ)| > ϵ\} \end{matrix}

(92)

and we use the fact that by definition

{\hat{μ}}_{n, ϵ_{n}} = {\hat{μ}}_{n}

conditioning on

B_{n}^{ϵ_{n}}

, it follows that

C_{n}^{ϵ} (μ) \cap B_{n}^{ϵ_{n}} \subset D_{n}^{ϵ} (μ)

. Then, for all

ϵ > 0

and

n \geq N

\begin{matrix} P_{μ}^{n} (C_{n}^{ϵ} (μ)) & \leq P_{μ}^{n} (C_{n}^{ϵ} (μ) \cap B_{n}^{ϵ_{n}}) + P_{μ}^{n} ({(B_{n}^{ϵ_{n}})}^{c}) \\ \leq P_{μ}^{n} (D_{n}^{ϵ} (μ)) + P_{μ}^{n} ({(B_{n}^{ϵ_{n}})}^{c}) \\ \leq 2^{k + 1} [e^{- \frac{2 n ϵ^{2}}{{(M_{μ} + \frac{\log e}{m_{μ}})}^{2}}} + e^{- \frac{n m_{μ}^{2}}{4}}], \end{matrix}

(93)

the last inequality from Theorem 1 and (90). ☐

Funding

The work is supported by funding from FONDECYT Grant 1170854, CONICYT-Chile and the Advanced Center for Electrical and Electronic Engineering (AC3E), Basal Project FB0008.

Acknowledgments

The author is grateful to Patricio Parada for his insights and stimulating discussion in the initial stage of this work. The author thanks the anonymous reviewers for their valuable comments and suggestions, and his colleagues Claudio Estevez, Rene Mendez and Ruben Claveria for proofreading this material.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Minimax risk for Finite Entropy Distributions in ∞-Alphabets

Proposition A1.

R_{n}^{*} = \infty .

For the proof, we use the following lemma that follows from [26] (Theorem 1).

Lemma A1.

Let us fix two arbitrary real numbers

δ > 0

and

ϵ > 0

. Then there are P, Q two finite supported distributions on

H (X)

that satisfy that

D (P | | Q) < ϵ

while

H (Q) - H (P) > δ

.

The proof of Lemma A1 derives from the same construction presented in the proof of [26] (Theorem 1), i.e.,

P = (p_{1}, \dots, p_{L})

and a modification of it

Q_{M} = (p_{1} \cdot (1 - 1 / \sqrt{M}), p_{2} + p_{1} / M \sqrt{M}, \dots, p_{L} + p_{1} / M \sqrt{M}, p_{1} / M \sqrt{M}, \dots, p_{1} / M \sqrt{M})

both distribution of finite support and consequently in

H (X)

. It is simple to verify that as M goes to infinity

D (P | | Q_{M}) ⟶ 0

while

H (Q_{M}) - H (P) ⟶ \infty

.

Proof.

For any pair of distribution P, Q in

H (X)

, Le Cam’s two point method [53] shows that:

R_{n}^{*} \geq \frac{1}{4} {(H (Q) - H (P))}^{2} \exp^{- n D (P | | Q)} .

(A1)

Adopting Lemma A1 and Equation (A1), for any n and any arbitrary

ϵ > 0

and

δ > 0

, we have that

R_{n}^{*} > δ^{2} \exp^{- n ϵ} / 4

. Then exploiting the discontinuity of the entropy in infinite alphabets, we can fix

ϵ

and make

δ

arbitrar large. ☐

Appendix B. Proposition A2

Proposition A2.

Under the assumptions of Theorem 3:

\begin{matrix} lim_{n \to \infty} sup_{x \in A_{{\tilde{μ}}_{n}}} |\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) - 1| = 0, P_{μ} - a . s . \end{matrix}

(A2)

Proof.

First note that

A_{{\tilde{μ}}_{n}} = A_{μ_{n}^{*}}

, then

\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x)

is finite and

\forall x \in A_{{\tilde{μ}}_{n}}

\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) = \frac{(1 - a_{n}) \cdot μ (A_{n} (x)) + a_{n} v (A_{n} (x))}{(1 - a_{n}) \cdot {\hat{μ}}_{n} (A_{n} (x)) + a_{n} v (A_{n} (x))} .

(A3)

Then by construction,

\begin{matrix} sup_{x \in A_{{\tilde{μ}}_{n}}} |\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) - 1| \leq sup_{A \in π_{n}} \frac{|{\hat{μ}}_{n} (A) - μ (A)|}{a_{n} \cdot h_{n}} . \end{matrix}

(A4)

From Hoeffding’s inequality, we have that

\forall ϵ > 0

\begin{matrix} P_{μ}^{n} (sup_{A \in π_{n}} |{\hat{μ}}_{n} (A) - μ (A)| > ϵ) \leq 2 \cdot |π_{n}| \cdot \exp^{- 2 n ϵ^{2}} . \end{matrix}

(A5)

By condition ii), given that

(1 / a_{n} h_{n})

is

o (n^{τ})

for some

τ \in (0, 1 / 2)

, then there exists

τ_{o} \in (0, 1)

such that

\begin{matrix} lim_{n \to \infty} \frac{1}{n^{τ_{o}}} \ln P_{μ}^{n} (sup_{x \in A_{{\tilde{μ}}_{n}}} |\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) - 1| > ϵ) \leq lim_{n \to \infty} \frac{1}{n^{τ_{o}}} \ln (2 |π_{n}|) - 2 \cdot {(n^{\frac{1 - τ_{o}}{2}} a_{n} h_{n} ϵ)}^{2} = - \infty . \end{matrix}

This implies that

P_{μ}^{n} ({sup}_{x \in A_{{\tilde{μ}}_{n}}} |\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) - 1| > ϵ)

is eventually dominated by a constant time

{(e^{- n^{τ_{o}}})}_{n \geq 1}

, which from the Borel-Cantelli Lemma [43] implies that

\begin{matrix} lim_{n \to \infty} sup_{x \in A_{{\tilde{μ}}_{n}}} |\frac{d {\tilde{μ}}_{n}}{d μ_{n}^{*}} (x) - 1| = 0, P_{μ} - a . s . . \end{matrix}

(A6)

☐

Appendix C. Proposition A3

Proposition A3.

\begin{matrix} D (μ_{ϵ_{n}} | | {\hat{μ}}_{n, ϵ_{n}}) \leq \frac{2 \log \frac{e}{ϵ_{n}}}{μ (Γ_{ϵ_{n}})} \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) \end{matrix}

Proof.

From definition,

\begin{matrix} D (μ_{ϵ_{n}} | | {\hat{μ}}_{n, ϵ_{n}}) = \frac{1}{μ (Γ_{ϵ_{n}})} \sum_{x \in Γ_{ϵ_{n}}} f_{μ} (x) \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)} + \log \frac{{\hat{μ}}_{n} (Γ_{ϵ_{n}})}{μ (Γ_{ϵ_{n}})} . \end{matrix}

(A7)

For the right term in the RHS of (A7):

\begin{matrix} \log \frac{{\hat{μ}}_{n} (Γ_{ϵ_{n}})}{μ (Γ_{ϵ_{n}})} \leq \frac{\log (e)}{μ (Γ_{ϵ_{n}})} |{\hat{μ}}_{n} (Γ_{ϵ_{n}}) - μ (Γ_{ϵ_{n}})| . \end{matrix}

(A8)

For the left term in the RHS of (A7):

\begin{matrix} |\sum_{x \in Γ_{ϵ_{n}}} f_{μ} (x) \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)}| & = |\sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) \leq f_{{\hat{μ}}_{n}} (x) \end{matrix}} f_{μ} (x) \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)} + \sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) > f_{{\hat{μ}}_{n}} (x) \geq ϵ_{n} \end{matrix}} f_{μ} (x) \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)}| \\ \leq \sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) \leq f_{{\hat{μ}}_{n}} (x) \end{matrix}} f_{μ} (x) \log \frac{f_{{\hat{μ}}_{n}} (x)}{f_{μ} (x)} + \sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) > f_{{\hat{μ}}_{n}} (x) \geq ϵ_{n} \end{matrix}} f_{{\hat{μ}}_{n}} (x) \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)} \\ + \sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) > f_{{\hat{μ}}_{n}} (x) \geq ϵ_{n} \end{matrix}} (f_{μ} (x) - f_{{\hat{μ}}_{n}} (x)) \cdot \log \frac{f_{μ} (x)}{f_{{\hat{μ}}_{n}} (x)} \\ \leq \log e [\sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) \leq f_{{\hat{μ}}_{n}} (x) \end{matrix}} (f_{{\hat{μ}}_{n}} (x) - f_{μ} (x)) + \sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) > f_{{\hat{μ}}_{n}} (x) \end{matrix}} (f_{μ} (x) - f_{{\hat{μ}}_{n}} (x))] \end{matrix}

(A9)

\begin{matrix} + log \frac{1}{ϵ_{n}} \cdot \sum_{\begin{matrix} x \in Γ_{ϵ_{n}} \\ f_{μ} (x) > f_{{\hat{μ}}_{n}} (x) \end{matrix}} (f_{μ} (x) - f_{{\hat{μ}}_{n}} (x)) \end{matrix}

(A10)

\begin{matrix} \leq (log e + log \frac{1}{ϵ_{n}}) \cdot \sum_{x \in Γ_{ϵ_{n}}} |f_{μ} (x) - f_{{\hat{μ}}_{n}} (x)| . \end{matrix}

(A11)

The first inequality in (A9) is by triangular inequality, the second in (A10) is from the fact that

\ln x \leq x - 1

for

x > 0

. Finally, from definition of the total variational distance over

σ_{ϵ_{n}}

in (59) we have that

\begin{matrix} 2 \cdot V (μ / σ_{ϵ_{n}}, {\hat{μ}}_{n} / σ_{ϵ_{n}}) = \sum_{x \in Γ_{ϵ_{n}}} |f_{μ} (x) - f_{{\hat{μ}}_{n}} (x)| + |{\hat{μ}}_{n} (Γ_{ϵ_{n}}) - μ (Γ_{ϵ_{n}})|, \end{matrix}

(A12)

which concludes the argument from (A7)–(A9). ☐

Appendix D. Proposition A4

Proposition A4.

Considering that

(k_{n}) \to \infty

, there exists

K > 0

and

N > 0

such that

\forall n \geq N

,

V ({\tilde{μ}}_{k_{n}}, {\hat{μ}}_{k_{n}, n}^{*}) \leq K \cdot V (μ / σ_{k_{n}}, {\hat{μ}}_{n} / σ_{k_{n}}) .

(A13)

Proof.

\begin{matrix} V ({\tilde{μ}}_{k_{n}}, {\hat{μ}}_{k_{n}, n}^{*}) & = \frac{1}{2} \sum_{x \in A_{μ} \cap Γ_{k_{n}}} |\frac{μ \{x\}}{μ (Γ_{k_{n}})} - \frac{{\hat{μ}}_{n} \{x\}}{{\hat{μ}}_{n} (Γ_{k_{n}})}| \\ \leq \frac{1}{2 μ (Γ_{k_{n}})} [\sum_{x \in A_{μ} \cap Γ_{k_{n}}} |{\hat{μ}}_{n} \{x\} - μ \{x\}| + \sum_{x \in A_{μ} \cap Γ_{k_{n}}} {\hat{μ}}_{n} \{x\} |\frac{μ (Γ_{k_{n}})}{{\hat{μ}}_{n} (Γ_{k_{n}})} - 1|] \\ = \frac{1}{2 μ (Γ_{k_{n}})} [2 \cdot V (μ / σ_{k_{n}}, {\hat{μ}}_{n} / σ_{k_{n}}) + |μ (Γ_{k_{n}}) - {\hat{μ}}_{n} (Γ_{k_{n}})|] \\ \leq \frac{3 \cdot V (μ / σ_{k_{n}}, {\hat{μ}}_{n} / σ_{k_{n}})}{2 μ (Γ_{k_{n}})} . \end{matrix}

(A14)

By the hypothesis

μ (Γ_{k_{n}}) \to 1

, which concludes the proof. ☐

Appendix E. Proposition A5

Proposition A5.

If

ϵ_{n}

is

O (n^{- τ})

with

τ \in (0, 1 / 2)

, then

lim sup_{n \to \infty} b_{ϵ_{n}} (X_{1}, \dots, X_{n}) - a_{2 ϵ_{n}} \leq 0, P_{μ} - a . s . .

Proof.

Let us define the set

B_{n} = \{(x_{1}, \dots, x_{n}) : {\tilde{Γ}}_{2 ϵ_{n}} \subset Γ_{ϵ_{n}}\} \subset X^{n} .

From definition every sequence

(x_{1}, \dots, x_{n}) \in B_{n}

is such that

b_{ϵ_{n}} (x_{1}, \dots, x_{n}) \leq a_{2 ϵ_{n}}

and, consequently, we just need to prove that

P_{μ} (lim {inf}_{n \to \infty} B_{n}) = P_{μ} (\cup_{n \geq 1} \cap_{k \geq n} B_{k}) = 1

[42]. Furthermore, if

{sup}_{x \in {\tilde{Γ}}_{2 ϵ_{n}}} |{\hat{μ}}_{n} (\{x\}) - μ (\{x\})| \leq ϵ_{n}

, then by definition of

{\tilde{Γ}}_{2 ϵ_{n}}

in (65), we have that

{\hat{μ}}_{n} (\{x\}) \geq ϵ_{n}

for all

x \in Γ_{2 ϵ_{n}}

(i.e.,

{\tilde{Γ}}_{2 ϵ_{n}} \subset Γ_{ϵ_{n}}

). From this

\begin{matrix} P_{μ}^{n} (B_{n}^{c}) & \leq P_{μ}^{n} (sup_{x \in {\tilde{Γ}}_{2 ϵ_{n}}} |{\hat{μ}}_{n} (\{x\}) - μ (\{x\})| > ϵ_{n}) \leq |{\tilde{Γ}}_{2 ϵ_{n}}| \cdot e^{- 2 n ϵ_{n}^{2}} \leq \frac{1}{2 ϵ_{n}} \cdot e^{- 2 n ϵ_{n}^{2}}, \end{matrix}

(A15)

from the Hoeffding’s inequality [28,52], the union bound and the fact that by construction

|{\tilde{Γ}}_{2 ϵ_{n}}| \leq \frac{1}{2 ϵ_{n}}

. If we consider

ϵ_{n} = O (n^{- τ})

and

l > 0

, we have that:

\begin{matrix} \frac{1}{n^{l}} \cdot \ln P_{μ}^{n} (B_{n}^{c}) \leq \frac{1}{n^{l}} \ln (1 / 2 \cdot n^{τ}) - 2 n^{1 - 2 τ - l} . \end{matrix}

(A16)

From (A16) for any

τ \in (0, 1 / 2)

there is

l \in (0, 1 - 2 τ]

such that

P_{μ}^{n} (B_{n}^{c})

is bounded by a term

O (e^{- n^{l}})

. This implies that

\sum_{n \geq 1} P_{μ}^{n} (B_{n}^{c}) < \infty

, that suffices to show that

P_{μ} (\cup_{n \geq 1} \cap_{k \geq n} B_{k}) = 1

. ☐

Appendix F. Auxiliary Results for Theorem 5

Let us first consider the series

\begin{matrix} S_{x_{o}} = \sum_{x \geq x_{o}} x^{- p} & = x_{o}^{- p} \cdot (1 + {(\frac{x_{o}}{x_{o} + 1})}^{p} + {(\frac{x_{o}}{x_{o} + 2})}^{p} + \dots) \\ = x_{o}^{- p} \cdot ({\tilde{S}}_{x_{o}, 0} + {\tilde{S}}_{x_{o}, 1} + \dots + {\tilde{S}}_{x_{o}, x_{o} - 1}), \end{matrix}

(A17)

where

{\tilde{S}}_{x_{o}, j} \equiv \sum_{k = 0}^{\infty} {(\frac{k \cdot x_{o} + j}{x_{o}})}^{- p}

for all

j \in \{0, \dots, x_{o} - 1\}

. It is simple to verify that for all

j \in \{0, \dots, x_{o} - 1\}

,

{\tilde{S}}_{x_{o}, j} \leq {\tilde{S}}_{x_{o}, 0} = \sum_{k \geq 0} k^{- p} < \infty

given that by hypothesis

p > 1

. Consequently,

S_{x_{o}} \leq x_{o}^{1 - p} \cdot \sum_{k \geq 0} k^{- p}

.

Similarly, for the second series we have that:

\begin{matrix} R_{x_{o}} = \sum_{x \geq x_{o}} x^{- p} \log x & = x_{o}^{- p} \cdot (\log (x_{o}) + (\frac{x_{o}}{x_{o} + 1}) \log (x_{o} + 1) + (\frac{x_{o}}{x_{o} + 2}) \log (x_{o} + 2) + \dots) \\ = x_{o}^{- p} \cdot ({\tilde{R}}_{x_{o}, 0} + {\tilde{R}}_{x_{o}, 2} + \dots + {\tilde{R}}_{x_{o}, x_{o} - 1}), \end{matrix}

(A18)

where

{\tilde{R}}_{x_{o}, j} \equiv \sum_{k = 1}^{\infty} {(\frac{k \cdot x_{o} + j}{x_{o}})}^{- p} \cdot \log (k x_{o} + j)

for all

j \in \{0, \dots, x_{o} - 1\}

. Note again that

{\tilde{R}}_{x_{o}, j} \leq {\tilde{R}}_{x_{o}, 0} < \infty

for all

j \in \{0, \dots, x_{o} - 1\}

, and, consequently,

R_{x_{o}} \leq x_{o}^{1 - p} \cdot \sum_{k \geq 1} k^{- p} \log k

from (A18).

Appendix G. Proposition A6

Proposition A6.

If

ϵ_{n}

is

O (n^{- τ})

with

τ \in (0, 1 / 2)

, then

lim inf_{n \to \infty} ξ_{n} (X_{1}, \dots, X_{n}) - ρ_{n} (X_{1}, \dots, X_{n}) \geq 0, P_{μ} - a . s . .

Proof.

By definition if

σ (Γ_{ϵ_{n}}) \subset σ ({\tilde{Γ}}_{ϵ_{n} / 2})

then

ξ_{n} (X_{1}, \dots, X_{n}) \geq ρ_{n} (X_{1}, \dots, X_{n})

. Consequently, if we define the set:

B_{n} = \{(x_{1}, \dots, x_{n}) : σ (Γ_{ϵ_{n}}) \subset σ ({\tilde{Γ}}_{ϵ_{n} / 2})\},

(A19)

then the proof reduced to verify that

P_{μ} (lim {inf}_{n \to \infty} B_{n})

= P_{μ} (\cup_{n \geq 1} \cap_{k \geq n} B_{k}) = 1

.

On the other hand, if

{sup}_{x \in Γ_{ϵ_{n}}} |{\hat{μ}}_{n} (\{x\}) - μ (\{x\})| \leq ϵ_{n} / 2

then by definition of

Γ_{ϵ}

, for all

x \in Γ_{ϵ_{n}}

μ (\{x\}) \geq ϵ_{n} / 2

, i.e.,

Γ_{ϵ_{n}} \subset {\tilde{Γ}}_{ϵ_{n} / 2}

. In other words,

C_{n} = \{(x_{1}, \dots, x_{n}) : sup_{x \in Γ_{ϵ_{n}}} |{\hat{μ}}_{n} (\{x\}) - μ (\{x\})| \leq ϵ_{n} / 2\} \subset B_{n} .

(A20)

Finally,

\begin{matrix} P_{μ}^{n} (C_{n}^{c}) & = P_{μ}^{n} (sup_{x \in Γ_{ϵ_{n}}} |{\hat{μ}}_{n} (\{x\}) - μ (\{x\})| > ϵ_{n} / 2) \leq |Γ_{ϵ_{n}}| \cdot e^{- n ϵ^{2} / 2} \leq \frac{1}{ϵ_{n}} \cdot e^{- n ϵ^{2} / 2} . \end{matrix}

(A21)

In this context, if we consider

ϵ_{n} = O (n^{- τ})

and

l > 0

, then we have that:

\begin{matrix} \frac{1}{n^{l}} \cdot \ln P_{μ}^{n} (C_{n}^{c}) \leq τ \cdot \frac{\ln n}{n^{l}} - \frac{n^{1 - 2 τ - l}}{2} . \end{matrix}

(A22)

Therefore, we have that for any

τ \in (0, 1 / 2)

we can take

l \in (0, 1 - 2 τ]

such that

P_{μ}^{n} (C_{n}^{c})

is bounded by a term

O (e^{- n^{l}})

. Then, the Borel Cantelli Lemma tells us that

P_{μ} (\cup_{n \geq 1} \cap_{k \geq n} C_{k}) = 1

, which concludes the proof from (A20). ☐

Appendix H. Proposition A7

Proposition A7.

For the p-power tail dominating distribution stated in Theorem 5, if

(ϵ_{n})

is

O (n^{- τ})

with

τ \in (0, p)

then

ξ_{n} (X_{1}, \dots, X_{n}) is o (n^{- q}) for all q \in (0, (1 - τ / p) / 2)

,

P_{μ}

-a.s.

Proof.

From the Hoeffding’s inequality we have that

\begin{matrix} P_{μ}^{n} (\{x_{1}, \dots, x_{n} : ξ_{n} (x_{1}, \dots, x_{n}) > δ\}) & \leq |σ ({\tilde{Γ}}_{ϵ_{n} / 2})| \cdot e^{- 2 n \frac{δ^{2}}{\log {(1 / ϵ_{n})}^{2}}} \\ \leq 2^{{(\frac{2 k_{o}}{ϵ_{n}})}^{1 / p} + 1} \cdot e^{- 2 n \frac{δ^{2}}{\log {(1 / ϵ_{n})}^{2}}}, \end{matrix}

(A23)

the second inequality using that

{\tilde{Γ}}_{ϵ} \leq {(\frac{k_{0}}{ϵ})}^{1 / p} + 1

from the definition of

{\tilde{Γ}}_{ϵ}

in (65) and the tail bounded assumption on

μ

. If we consider

ϵ_{n} = O (n^{- τ})

and

l > 0

, then we have that:

\begin{matrix} \frac{1}{n^{l}} \cdot \ln P_{μ}^{n} (\{x_{1}, \dots, x_{n} : ξ_{n} (x_{1}, \dots, x_{n}) > δ\}) & \leq \ln 2 \cdot (C n^{τ / p - l} + n^{- l}) - \frac{2 δ^{2}}{τ^{2}} \cdot \frac{n^{1 - l}}{\log n^{2}} \end{matrix}

(A24)

for some constant

C > 0

. Then in order to obtain that

ξ_{n} (X_{1}, \dots, X_{n})

converges almost surely to zero from (A24), it is sufficient that

l > 0

,

l < 1

, and

l > τ / p

. This implies that if

τ < p

, there is

l \in (τ / p, 1)

such that such that

P_{μ}^{n} (ξ_{n} (x_{1}, \dots, x_{n}) > δ)

is bounded by a term

O (e^{- n^{l}})

and, consequently,

{lim}_{n \to \infty} ξ_{n} (X_{1}, \dots, X_{n}) = 0

,

P_{μ}

-a.s. (this by using the same steps used in Appendix G).

Moving to the rate of convergence of

ξ_{n} (X_{1}, \dots, X_{n})

(assuming that

τ < p

), let us consider

δ_{n} = n^{- q}

for some

q \geq 0

. From (A24):

\begin{matrix} \frac{1}{n^{l}} \cdot \ln P_{μ}^{n} (\{x_{1}, \dots, x_{n} : ξ_{n} (x_{1}, \dots, x_{n}) > δ_{n}\}) & \leq \ln 2 \cdot (C n^{τ / p - l} + n^{- l}) - \frac{2 δ^{2}}{τ^{2}} \cdot \frac{n^{1 - 2 q - l}}{\log n^{2}} . \end{matrix}

(A25)

To make

ξ_{n} (X_{1}, \dots, X_{n})

being

o (n^{- q})

P

-a.s., a sufficient condition is that

l > 0

,

l > τ / p

, and

l < 1 - 2 q

. Therefore (considering that

τ < p

), the admissibility condition on the existence of a exponential rate of convergence

O (e^{- n^{l}})

for

l > 0

for the deviation event

\{x_{1}, \dots, x_{n} : ξ_{n} (x_{1}, \dots, x_{n}) > δ_{n}\}

is that

τ / p < 1 - 2 q

, which is equivalent to

0 < q < \frac{1 - τ / p}{2}

. ☐

References

Beirlant, J.; Dudewicz, E.; Györfi, L.; van der Meulen, E.C. Nonparametric entropy estimation: An Overview. Int. Math. Stat. Sci. 1997, 6, 17–39. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 2006. [Google Scholar]
Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
Principe, J. Information Theoretic Learning: Renyi Entropy and Kernel Perspective; Springer: New York, NY, USA, 2010. [Google Scholar]
Fisher, J.W., III; Wainwright, M.; Sudderth, E.; Willsky, A.S. Statistical and information-theoretic methods for self-organization and fusion of multimodal, networked sensors. Int. J. High Perform. Comput. Appl. 2002, 16, 337–353. [Google Scholar] [CrossRef]
Liu, J.; Moulin, P. Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Trans. Image Process. 2001, 10, 1647–1658. [Google Scholar] [PubMed]
Thévenaz, P.; Unser, M. Optimization of mutual information for multiresolution image registration. IEEE Trans. Image Process. 2000, 9, 2083–2099. [Google Scholar] [PubMed]
Butz, T.; Thiran, J.P. From error probability to information theoretic (multi-modal) signal processing. Elsevier Signal Process. 2005, 85, 875–902. [Google Scholar] [CrossRef]
Kim, J.; Fisher, J.W., III; Yezzi, A.; Cetin, M.; Willsky, A.S. A nonparametric statistical method for image segmentation using information theory and curve evolution. IEEE Trans. Image Process. 2005, 14, 1486–1502. [Google Scholar] [PubMed]
Padmanabhan, M.; Dharanipragada, S. Maximizing information content in feature extraction. IEEE Trans. Speech Audio Process. 2005, 13, 512–519. [Google Scholar] [CrossRef]
Silva, J.; Narayanan, S. Minimum probability of error signal representation. Presented at IEEE Workshop Machine Learning for Signal Processing, Thessaloniki, Greece, 27–29 August 2007; pp. 348–353. [Google Scholar]
Silva, J.; Narayanan, S. Discriminative wavelet packet filter bank selection for pattern recognition. IEEE Trans. Signal Process. 2009, 57, 1796–1810. [Google Scholar] [CrossRef]
Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 158–171. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Contreras-Reyes, J.E.; Stehlik, M. Generalized skew-normal negentropy and its applications to fish condition time factor time series. Entropy 2017, 19, 528. [Google Scholar] [CrossRef]
Lake, D.E. Nonparametric entropy estimation using kernel densities. Methods Enzymol. 2009, 467, 531–546. [Google Scholar] [PubMed]
Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000; Volume 3. [Google Scholar]
Wu, Y.; Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory 2016, 62, 3702–3720. [Google Scholar] [CrossRef]
Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef] [PubMed]
Paninski, L. Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 2004, 50, 2200–2203. [Google Scholar] [CrossRef]
Valiant, G.; Valiant, P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown opitmal via new CLTs. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, AL, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
Valiant, G.; Valiant, P. A CLT and Tight Lower Bounds for Estimating Entropy; Technical Report TR 10-179; Electronic Colloquium on Computational Complexity: Potsdam, Germany, 2011; Volume 17, p. 9. [Google Scholar]
Braess, D.; Forster, J.; Sauer, T.; Simon, H.U. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution. In Proceedings of the International Conference on Algorithmic Learning Theory, Lübeck, Germany, 24–26 November 2004; Springer: Berlin/Heidelberg, Germany, 2002; pp. 380–394. [Google Scholar]
Csiszár, I.; Shields, P.C. Information theory and statistics: A tutorial. In Foundations and Trends® in Communications and Information Theory; Now Publishers Inc.: Breda, The Netherlands, 2004; pp. 417–528. [Google Scholar]
Ho, S.W.; Yeung, R.W. On the discontinuity of the Shannon information measures. IEEE Trans. Inf. Theory 2009, 55, 5362–5374. [Google Scholar]
Silva, J.; Parada, P. Shannon entropy convergence results in the countable infinite case. In Proceedings of the International Symposium on Information Theory, Cambridge, MA, USA, 1–6 July 2012; pp. 155–159. [Google Scholar]
Ho, S.W.; Yeung, R.W. The interplay between entropy and variational distance. IEEE Trans. Inf. Theory 2010, 56, 5906–5929. [Google Scholar] [CrossRef]
Harremoës, P. Information topologies with applications. In Entropy, Search, Complexity; Csiszár, I., Katona, G.O.H., Tardos, G., Eds.; Springer: New York, NY, USA, 2007; Volume 16, pp. 113–150. [Google Scholar]
Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar]
Barron, A.; Györfi, L.; van der Meulen, E.C. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inf. Theory 1992, 38, 1437–1454. [Google Scholar] [CrossRef]
Antos, A.; Kontoyiannis, I. Convergence properties of fucntionals estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
Piera, F.; Parada, P. On convergence properties of Shannon entropy. Probl. Inf. Transm. 2009, 45, 75–94. [Google Scholar] [CrossRef]
Berlinet, A.; Vajda, I.; van der Meulen, E.C. About the asymptotic accuracy of Barron density estimates. IEEE Trans. Inf. Theory 1998, 44, 999–1009. [Google Scholar] [CrossRef]
Vajda, I.; van der Meulen, E.C. Optimization of Barron density estimates. IEEE Trans. Inf. Theory 2001, 47, 1867–1883. [Google Scholar] [CrossRef]
Lugosi, G.; Nobel, A.B. Consistency of data-driven istogram methods for density estimation and classification. Ann. Stat. 1996, 24, 687–706. [Google Scholar]
Silva, J.; Narayanan, S. Information divergence estimation based on data-dependent partitions. J. Stat. Plan. Inference 2010, 140, 3180–3198. [Google Scholar] [CrossRef]
Silva, J.; Narayanan, S.N. Nonproduct data-dependent partitions for mutual information estimation: Strong consistency and applications. IEEE Trans. Signal Process. 2010, 58, 3497–3511. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Gray, R.M. Entropy and Information Theory; Springer: New York, NY, USA, 1990. [Google Scholar]
Kullback, S. A lower bound for discrimination information in terms of variation. IEEE Trans. Inf. Theory 1967, 13, 126–127. [Google Scholar] [CrossRef]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
Kemperman, J. On the optimum rate of transmitting information. Ann. Math. Stat. 1969, 40, 2156–2177. [Google Scholar] [CrossRef]
Breiman, L. Probability; Addison-Wesley: Boston, MA, USA, 1968. [Google Scholar]
Varadhan, S. Probability Theory; American Mathematical Society: Providence, RI, USA, 2001. [Google Scholar]
Györfi, L.; Páli, I.; van der Meulen, E.C. There is no universal source code for an infinite source alphabet. IEEE Trans. Inf. Theory 1994, 40, 267–271. [Google Scholar] [CrossRef]
Rissanen, J. Information and Complexity in Statistical Modeling; Springer: New York, NY, USA, 2007. [Google Scholar]
Boucheron, S.; Garivier, A.; Gassiat, E. Coding on countably infinite alphabets. IEEE Trans. Inf. Theory 2009, 55, 358–373. [Google Scholar] [CrossRef]
Silva, J.F.; Piantanida, P. The redundancy gains of almost lossless universal source coding over envelope families. In Proceedings of the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 2003–2007. [Google Scholar]
Silva, J.F.; Piantanida, P. Almost Lossless Variable-Length Source Coding on Countably Infinite Alphabets. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 1–5. [Google Scholar]
Nobel, A.B. Histogram regression estimation using data-dependent partitions. Ann. Stat. 1996, 24, 1084–1105. [Google Scholar] [CrossRef]
Silva, J.; Narayanan, S. Complexity-regularized tree-structured partition for mutual information estimation. IEEE Trans. Inf. Theory 2012, 58, 940–1952. [Google Scholar] [CrossRef]
Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partition of the observation space. IEEE Trans. Inf. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer: New York, NY, USA, 2009. [Google Scholar]

© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Silva, J.F. Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators. Entropy 2018, 20, 397. https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

AMA Style

Silva JF. Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators. Entropy. 2018; 20(6):397. https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

Chicago/Turabian Style

Silva, Jorge F. 2018. "Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators" Entropy 20, no. 6: 397. https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators

Abstract

1. Introduction

1.1. The Challenging Infinite Alphabet Learning Scenario

1.2. From Convergence Results to Entropy Estimation

1.3. Contributions

1.4. Organization

2. Preliminaries

2.1. Convergence Results for the Shannon Entropy

3. Shannon Entropy Estimation

3.1. Revisiting the Classical Plug-In Estimator for Finite and Unknown Supported Distributions

3.2. A Simplified Version of the Barron Estimator for Finite Supported Probabilities

4. The Barron-Györfi-van der Meulen Estimator

5. A Data-Driven Histogram-Based Estimator

6. Discussion of the Results and Final Remarks

7. Proof of the Main Results

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Minimax risk for Finite Entropy Distributions in ∞-Alphabets

Appendix B. Proposition A2

Appendix C. Proposition A3

Appendix D. Proposition A4

Appendix E. Proposition A5

Appendix F. Auxiliary Results for Theorem 5

Appendix G. Proposition A6

Appendix H. Proposition A7

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI