Next Article in Journal
Spatial Heterogeneity Analysis: Introducing a New Form of Spatial Entropy
Next Article in Special Issue
Recognizing Information Feature Variation: Message Importance Transfer Measure and Its Applications in Big Data
Previous Article in Journal
Stochastic Entropy Solutions for Stochastic Nonlinear Transport Equations
Previous Article in Special Issue
Polynomial-Time Algorithm for Learning Optimal BFS-Consistent Dynamic Bayesian Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators

Information and Decision System Group, Department of Electrical Engineering, Universidad de Chile, Av. Tupper 2007, Santiago 7591538, Chile
Submission received: 12 April 2018 / Revised: 14 May 2018 / Accepted: 18 May 2018 / Published: 23 May 2018
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Abstract

:
This work addresses the problem of Shannon entropy estimation in countably infinite alphabets studying and adopting some recent convergence results of the entropy functional, which is known to be a discontinuous function in the space of probabilities in ∞-alphabets. Sufficient conditions for the convergence of the entropy are used in conjunction with some deviation inequalities (including scenarios with both finitely and infinitely supported assumptions on the target distribution). From this perspective, four plug-in histogram-based estimators are studied showing that convergence results are instrumental to derive new strong consistent estimators for the entropy. The main application of this methodology is a new data-driven partition (plug-in) estimator. This scheme uses the data to restrict the support where the distribution is estimated by finding an optimal balance between estimation and approximation errors. The proposed scheme offers a consistent (distribution-free) estimator of the entropy in ∞-alphabets and optimal rates of convergence under certain regularity conditions on the problem (finite and unknown supported assumptions and tail bounded conditions on the target distribution).

1. Introduction

Shannon entropy estimation has a long history in information theory, statistics, and computer science [1]. Entropy and related information measures (conditional entropy and mutual information) have a fundamental role in information theory and statistics [2,3] and, as a consequence, it has found numerous applications in learning and decision making tasks [4,5,6,7,8,9,10,11,12,13,14,15]. In many of these contexts, distributions are not available and the entropy needs to be estimated from empirical data. This problem belongs to the category of scalar functional estimation that has been thoroughly studied in non-parametric statistics.
Starting with the finite alphabet scenario, the classical plug-in estimator (i.e., the empirical distribution evaluated on the functional) is well known to be consistent, minimax optimal, and asymptotically efficient [16] (Section 8.7–8.9). More recent research has focused on looking at the so-called large alphabet (or large dimensional) regime, meaning a non-asymptotic under-sampling regime where the number of samples n is on the order of, or even smaller than, the size of the alphabet denoted by k. In this context, it has been shown that the classical plug-in estimator is sub-optimal as it suffers from severe bias [17,18]. For characterizing optimality in this high dimensional context, a non-asymptotic minimax mean square error analysis (under a finite n and k) has been explored by several authors [17,18,19,20,21] considering the minimax risk
R * ( k , n ) = inf H ^ ( · ) sup μ P ( k ) E X 1 , X n μ n H ^ ( X 1 , , X n ) H ( μ ) 2
where P ( k ) denotes the collection of probabilities on [ k ] 1 , , k and H ( μ ) is the entropy of μ (details in Section 2). Paninski [19] first showed that it was possible to construct an entropy estimator that uses a sub-linear sampling size to achieve minimax consistency when k goes to infinity, in the sense that there is a sequence ( n k ) = o ( k ) where R * ( k , n k ) 0 as k goes to infinity. A set of results by Valiant et al. [20,21] shows that the optimal scaling of the sampling size with respect to k is O(k/ log ( k )), to achieve the aforementioned asymptotic consistency for entropy estimation. A refined set of results for the complete characterization of R * ( k , n ) , the specific scaling of the sampling complexity, and the achievability of the obtained minimax L 2 risk for the family P ( k ) : k 1 with practical estimators have been presented in [17,18]. On the other hand, it is well-known that the problem of estimating the distribution (consistently in total variation) in finite alphabets requires a sampling complexity that scales as O ( k ) [22]. Consequently, in finite alphabets the task of entropy estimation is simpler than estimating the distribution in terms of sampling complexity. These findings are consistent with the observation that the entropy is a continuous functional of the space of distributions (in the total variational distance sense) for the finite alphabet case [2,23,24,25].

1.1. The Challenging Infinite Alphabet Learning Scenario

In this work, we are interested in the countably infinite alphabet scenario, i.e., on the estimation of the entropy when the alphabet is countably infinite and we have a finite number of samples. This problem can be seen as an infinite dimensional regime as the size of the alphabet goes unbounded and n is kept finite for the analysis, which differs from the large dimensional regime mentioned above. As argued in [26] (Section IV), this is a challenging non-parametric learning problem because some of the finite alphabet properties of the entropy do not extend to this infinite dimensional context. Notably, it has been shown that the Shannon entropy is not a continuous functional with respect to the total variational distance in infinite alphabets [24,26,27]. In particular, Ho et al. [24] (Theorem 2) showed concrete examples where convergence in χ 2 -divergence and in direct information divergence (I-divergence) of a set of distributions to a limit, both stronger than total variational convergence [23,28], do not imply the convergence of the entropy. In addition, Harremoës [27] showed the discontinuity of the entropy with respect to the reverse I-divergence [29], and consequently, with respect to the total variational distance (the distinction between reverse and direct I-divergence was pointed out in the work of Barron et al. [29]). In entropy estimation, the discontinuity of the entropy implies that the minimax mean square error goes unbounded, i.e.,
R n * = inf H ^ ( · ) sup μ H ( X ) E X 1 , X n μ n H ^ ( X 1 , , X n ) H ( μ ) 2 = ,
where H ( X ) denotes the family of finite entropy distribution over the countable alphabet set X (the proof of this result follows from [26] (Theorem 1) and the argument is presented in Appendix A). Consequently, there is no universal minimax consistent estimator (in the mean square error sense) of the entropy over the family of finite entropy distributions.
Considering a sample-wise (or point-wise) convergence to zero of the estimation error (instead of the minimax expected error analysis mentioned above), Antos et al. [30] (Theorem 2 and Corollary 1) show the remarkable result that the classical plug-in estimate is strongly consistent and consistent in the mean square error sense for any finite entropy distribution (point-wise). Then, the classical plug-in entropy estimator is universal, meaning that the convergence to the right limiting value H ( μ ) is achieved almost surely despite the discontinuity of the entropy. Moving on the analysis of the (point-wise) rate of convergence of the estimation error, Antos et al. [30] (Theorem 3) present a finite length lower bound for the error of any arbitrary estimation scheme, showing as a corollary that no universal rate of convergence (to zero) can be achieved for entropy estimation in infinite alphabets [30] (Theorem 4). Finally, constraining the problem to a family of distributions with specific power tail bounded conditions, Antos et al. [30] (Theorem 7) present a finite length expression for the rate of convergence of the estimation error of the classical plug-in estimate.

1.2. From Convergence Results to Entropy Estimation

In view of the discontinuity of the entropy in ∞-alphabets [24] and the results that guarantee entropy convergence [25,26,27,31], this work revisits the problem of point-wise almost-sure entropy estimation in ∞-alphabets from the perspective of studying and applying entropy convergence results and their derived bounds [25,26,31]. Importantly, entropy convergence results have established concrete conditions on both the limiting distribution μ and the way a sequence of distributions μ n : n 0 converges to μ such that lim n H ( μ n ) = H ( μ ) is satisfied. The natural observation that motivates this work is that consistency is basically a convergence to the true entropy value that happens with probability one. Then our main conjecture is that putting these conditions in the context of a learning task, i.e., where μ n : n 0 is a random sequence of distributions driven by the classical empirical process, will offer the possibility to study a broad family of plug-in estimators with the objective to derive new strong consistency and rates of convergence results. On the practical side, this work proposes and analizes a data-driven histogram-based estimator as a key learning scheme, since this approach offers the flexibility to adapt to learning task when appropriate bounds for the estimation and approximation errors are derived.

1.3. Contributions

We begin revisiting the classical plug-in entropy estimator considering the relevant scenario where μ (the unknown distribution that produces the i.i.d. samples) has a finite but arbitrary large and unknown support. This is declared to be a challenging problem by Ho and Yeung [26] (Theorem 13) because of the discontinuity of the entropy. Finite-length (non-asymptotic) deviation inequalities and intervals of confidence are derived extending the results presented in [26] (Section IV). From this, it is shown that the classical plug-in estimate achieves optimal rates of convergence. Relaxing the finite support restriction on μ , two concrete histogram-based plug-in estimators are presented, one built upon the celebrated Barron-Györfi-van der Meulen histogram-based approach [29,32,33]; and the other on a data-driven partition of the space [34,35,36]. For the Barron plug-in scheme, almost-sure consistency is shown for entropy estimation and distribution estimation in direct I-divergence under some mild support conditions on μ . For the data-driven partition scheme, the main context of application of this work, it is shown that this estimator is strongly consistent distribution-free, matching the universal result obtained for the classical plug-in approach in [30]. Furthermore, new almost-sure rates of convergence results (in the estimation error) are obtained for distributions with finite but unknown support and for families of distributions with power and exponential tail dominating conditions. In this context, our results show that this adaptive scheme has a concrete design solution that offers very good convergence rate of the overall estimation error, as it approaches the rate O ( 1 / n ) that is considered optimal for the finite alphabet case [16]. Importantly, the parameter selection of this scheme relies on, first, obtaining expressions to bound the estimation and approximation errors and, second, finding the optimal balance between these two learning errors.

1.4. Organization

The rest of the paper is organized as follows. Section 2 introduces some basic concepts, notation, and summarizes the main entropy convergence results used in this work. Section 3, Section 4 and Section 5 state and elaborate the main results of this work. Discussion of the results and final remarks are given in Section 6. The technical derivation of the main results are presented in Section 7. Finally, proofs of auxiliary results are relegated to the Appendix Section.

2. Preliminaries

Let X be a countably infinite set and let P ( X ) denote the collection of probability measures in X . For μ and v in P ( X ) , and μ absolutely continuous with respect to v (i.e., μ v ), d μ d v ( x ) denotes the Radon-Nikodym (RN) derivative of μ with respect to v. Every μ P ( X ) is equipped with its probability mass function (pmf) that we denote by f μ ( x ) μ x , x X . Finally, for any μ P ( X ) , A μ x X : f μ ( x ) > 0 denotes its support and
F ( X ) μ P ( X ) : A μ <
denotes the collection of probabilities with finite support.
Let μ and v be in P ( X ) , then the total variation distance of μ and v is given by [28]
V ( μ , v ) sup A 2 X v ( A ) μ ( A ) ,
where 2 X denotes the subsets of X . The Kullback–Leibler divergence or I-divergence of μ with respect to v is given by
D ( μ | | v ) x A μ f μ ( x ) log f μ ( x ) f v ( x ) 0 ,
when μ v , while D ( μ | | v ) is set to infinite, otherwise [37].
The Shannon entropy of μ P ( X ) is given by [1,2,38]:
H ( μ ) x A μ f μ ( x ) log f μ ( x ) 0 .
In this context, let H ( X ) P ( X ) be the collection of probabilities where (4) is finite, let AC ( X | v ) P ( X ) denote the collection of probabilities absolutely continuous with respect to v P ( X ) , and let H ( X | v ) AC ( X | v ) denote the collection of probabilities where (3) is finite for v P ( X ) .
Concerning convergence, a sequence μ n : n N P ( X ) is said to converge in total variation to μ P ( X ) if
lim n V ( μ n , μ ) = 0 .
For countable alphabets, ref. [31] (Lemma 3) shows that the convergence in total variation is equivalent to the weak convergence, which is denoted here by μ n μ , and the point-wise convergence of the pmf’s. Furthermore, from (2), the convergence in total variation implies the uniform convergence of the pmf’s, i.e, lim n sup x X μ n ( x ) μ ( x ) = 0 . Therefore, in this countable case, all the four previously mentioned notions of convergence are equivalent: total variation, weak convergence, point-wise convergence of the pmf’s, and uniform convergence of the pmf’s.
We conclude with the convergence in I-divergence introduced by Barron et al. [29]. It is said that μ n : n N converges to μ in direct and in reverse I-divergence if lim n D ( μ | | μ n ) = 0 and lim n D ( μ n | | μ ) = 0 , respectively. From Pinsker’s inequality [39,40,41], the convergence in I-divergence implies the weak convergence in (5), where it is known that the converse result is not true [27].

2.1. Convergence Results for the Shannon Entropy

The discontinuity of the entropy in ∞-alphabets raises the problem of finding conditions under which convergence of the entropy can be obtained. On this topic, Ho et al. [26] have studied the interplay between entropy and the total variation distance, specifying conditions for convergence by assuming a finite support on the involved distributions. On the other hand, Harremoës [27] (Theorem 21) obtained convergence of the entropy by imposing a power dominating condition [27] (Definition 17) on the limiting probability measure μ for all the sequences μ n : n 0 converging in reverse I-divergence to μ [29]. More recently, Silva et al. [25] have addressed the entropy convergence studying a number of new settings that involve conditions on the limiting measure μ , as well as the way the sequence μ n : n 0 converges to μ in the space of distributions. These results offer sufficient conditions where the entropy evaluated in a sequence of distributions converges to the entropy of its limiting distribution and, consequently, the possibility of applying these when analyzing plug-in entropy estimators. The results used in this work are summarized in the rest of this section.
Let us begin with the case when μ F ( X ) , i.e., when the support of the limiting measure is finite and unknown.
Proposition 1.
Let us assume that μ F ( X ) and μ n : n N AC ( X | μ ) . If μ n μ , then lim n D ( μ n | | μ ) = 0 and lim n H ( μ n ) = H ( μ ) .
This result is well-known because when A μ n A μ for all n, the scenario reduces to the finite alphabet case, where the entropy is known to be continuous [2,23]. Since we obtain two inequalities that are used in the following sections, a simple proof is provided here.
Proof. 
μ and μ n belong to H ( X ) from the finite-supported assumption. The same argument can be used to show that D ( μ n | | μ ) < , since μ n μ for all n. Let us consider the following identity:
H ( μ ) H ( μ n ) = x A μ ( f μ n ( x ) f μ ( x ) ) log f μ ( x ) + D ( μ n | | μ ) .
The first term on the right hand side (RHS) of (6) is upper bounded by M μ · V ( μ n , μ ) where
M μ = log 1 m μ sup x A μ log μ ( x ) < .
For the second term, we have that
D ( μ n | | μ ) log e · x A μ n f μ n ( x ) f μ n ( x ) f μ ( x ) 1 log e m μ · sup x A μ f μ n ( x ) f μ ( x ) log e m μ · V ( μ n , μ ) .
and, consequently,
H ( μ ) H ( μ n ) M μ + log e m μ · V ( μ n , μ ) .
Under the assumptions of Proposition 1, we note that the reverse I-divergence and the entropy difference are bounded by the total variation by (8) and (9), respectively. Note, however, that these bounds are a distribution-dependent function of m μ ( M μ ) in (7) (it is direct to show that m μ ( M μ ) < if, and only if, μ F ( X ) ). The next result relaxes the assumption that μ n μ and offers a necessary and sufficient condition for the convergence of the entropy.
Lemma 1.
Ref. [25] (Theorem 1) Let μ F ( X ) and μ n : n N F ( X ) . If μ n μ , then there exists N > 0 such that μ μ n n N , and
lim n D ( μ | | μ n ) = 0 .
Furthermore, lim n H ( μ n ) = H ( μ ) , if and only if,
lim n μ n A μ n \ A μ · H μ n · | A μ n \ A μ = 0 lim n x A μ n \ A μ f μ n ( x ) log 1 f μ n ( x ) = 0 ,
where μ ( · | B ) denotes the conditional probability of μ given the event B X .,
Lemma 1 tells us that to achieve entropy convergence (on top of the weak convergence), it is necessary and sufficient to ask for a vanishing expression (with n) of the entropy of μ n restricted to the elements of the set A μ n \ A μ . Two remarks about this result are: (1) The convergence in direct I-divergence does not imply the convergence of the entropy (concrete examples are presented in [24] (Section III) and [25]); (2) Under the assumption that μ F ( X ) , μ is eventually absolutely continuous with respect to μ n , and the convergence in total variations is equivalent to the convergence in direct I-divergence.
This section concludes with the case when the support of μ is infinite and unknown, i.e., A μ = . In this context, two results are highlighted:
Lemma 2.
Ref. [31] (Theorem 4) Let us consider that μ H ( X ) and μ n : n 0 AC ( X | μ ) . If μ n μ and
M sup n 1 sup x A μ f μ n ( x ) f μ ( x ) < ,
then μ n H ( X ) H ( X | μ ) for all n and it follows that
lim n D ( μ n | | μ ) = 0 and lim n H ( μ n ) = H ( μ ) .
Interpreting Lemma 2 we have that, to obtain the convergence of the entropy functional (without imposing a finite support assumption on μ ), a uniform bounding condition (UBC) μ -almost everywhere was added in (11). By adding this UBC, the convergence on reverse I-divergence is also obtained as a byproduct. Finally, when μ μ n for all n, the following result is considered:
Lemma 3.
Ref. [25] (Theorem 3) Let μ H ( X ) and a sequence of measures μ n : n 1 H ( X ) such that μ μ n for all n 1 . If μ n μ and
sup n 1 sup x A μ log f μ n ( x ) f μ ( x ) <
then, μ H ( X | μ n ) for all n 1 , and
lim n D ( μ | | μ n ) = 0 .
Furthermore, lim n H ( μ n ) = H ( μ ) , if and only if,
lim n x A μ n \ A μ f μ n ( x ) log 1 f μ n ( x ) = 0 .
This result shows the non-sufficiency of the convergence in direct I-divergence to achieve entropy convergence in the regime when μ μ n . In fact, Lemma 3 may be interpreted as an extension of Lemma 1 when the finite support assumption over μ is relaxed.

3. Shannon Entropy Estimation

Let μ be a probability in H ( X ) , and let denote by X 1 , X 2 , X 3 , the empirical process induced from i.i.d. realizations of a random variable driven by μ , i.e., X i μ , for all i 0 . Let P μ denote the distribution of the empirical process in ( X , B ( X ) ) and P μ n denote the finite block distribution of X n ( X 1 , , X n ) in the product space ( X n , B ( X n ) ) . Given a realization of X 1 , X 2 , X 3 , , X n , we can construct an histogram-based estimator like classical empirical probability given by:
μ ^ n ( A ) 1 n k = 1 n 1 A ( X k ) , A X ,
with pmf given by f μ ^ n ( x ) = μ ^ n ( x ) for all x X . A natural estimator of the entropy is the plug-in estimate of μ ^ n given by
H ( μ ^ n ) = x X f μ ^ n ( x ) log f μ ^ n ( x ) ,
which is a measurable function of X 1 , , X n (this dependency on the data will be implicit for the rest of the paper).
For the rest of this Section and Section 4 and Section 5, the convergence results in Section 2.1 are used to derive strong consistency results for plug-in histogram-based estimators, like H ( μ ^ n ) in (15), as well as finite length concentration inequalities to obtain almost-sure rates of convergence for the overall estimation error H ( μ ^ n ) H ( μ ) .

3.1. Revisiting the Classical Plug-In Estimator for Finite and Unknown Supported Distributions

We start by analyzing the case when μ has a finite but unknown support. A consequence of the strong law of large numbers [42,43] is that x X , lim n μ ^ n ( x ) = μ ( x ) , P μ -almost surely (a.s.), hence lim n V ( μ ^ n , μ ) = 0 , P μ -a.s. On the other hand, it is clear that A μ ^ n A μ holds with probability one. Then Proposition 1 implies that
lim n D ( μ ^ n | | μ ) = 0 and lim n H ( μ ^ n ) = H ( μ ) , P μ - a . s . ,
i.e., μ ^ n is a strongly consistent estimator of μ in reverse I-divergence and H ( μ ^ n ) is a strongly consistent estimate of H ( μ ) distribution-free in F ( X ) . Furthermore, the following can be stated:
Theorem 1.
Let μ F ( X ) and let us consider μ ^ n in (14). Then μ ^ n H ( X ) H ( X | μ ) , P μ -a.s and n 1 , ϵ > 0 ,
P μ n D ( μ ^ n | | μ ) > ϵ 2 A μ + 1 · e 2 m μ 2 · n ϵ 2 log e 2 ,
P μ n H ( μ ^ n ) H ( μ ) > ϵ 2 A μ + 1 · e 2 n ϵ 2 ( M μ + log e m μ ) 2 .
Moreover, D ( μ | | μ ^ n ) is eventually finite with probability one and ϵ > 0 , and for any n 1 ,
P μ n D ( μ | | μ ^ n ) > ϵ 2 A μ + 1 · e 2 n ϵ 2 log e 2 · 1 / m u + 1 2 + e n m μ 2 .
This result implies that for any τ ( 0 , 1 / 2 ) and μ F ( X ) , H ( μ ^ n ) H ( μ ) , D ( μ ^ n | | μ ) , and D ( μ | | μ ^ n ) goes to zero as o ( n τ ) P μ -a.s. Furthermore, E P μ n H ( μ ^ n ) H ( μ ) and E P μ n ( D ( μ ^ n | | μ ) ) behave like O ( 1 / n ) for all μ F ( X ) from (30) in Section 7, which is the optimal rate of convergence of the finite alphabet scenario. As a corollary of (18), it is possible to derive intervals of confidence for the estimation error H ( ( μ ^ n ) H ( μ n ) : for all δ > 0 and n 1 ,
P μ H ( ( μ ^ n ) H ( μ n ) M μ + log e / m μ 1 2 n ln 2 A μ + 1 δ 1 δ .
This confidence interval behaves like O ( 1 / n ) as a function of n, and like O ( ln 1 / δ ) as a function of δ , which are the same optimal asymptotic trends that can be obtained for V ( μ , μ ^ n ) in (30).
Finally, we observe that A μ ^ n A μ P μ n -a.s. where for any n 1 , P μ n ( A μ ^ n A μ ) > 0 implying that E P μ n ( D ( μ | | μ ^ n ) ) = for all finite n. Then even in the finite and unknown supported scenario, μ ^ n is not consistent in expected direct I-divergence, which is congruent with the result in [29,44]. Besides this negative result, strong consistency in direct I-divergence can be obtained from (19), in the sense that lim n D ( μ | | μ ^ n ) = 0 , P μ -a.s.

3.2. A Simplified Version of the Barron Estimator for Finite Supported Probabilities

It is well-understood that consistency in expected direct I-divergence is of critical importance for the construction of a lossless universal source coding scheme [2,23,29,44,45,46,47,48]. Here, we explore an estimator that achieves this learning objective, in addition to entropy estimation. For that, let μ F ( X ) and let assume v F ( X ) such that μ v . Barron et al. [29] proposed a modified version of the empirical measure in (14) to estimate μ from i.i.d. realizations, adopting a mixture estimate of the form
μ ˜ n ( B ) = ( 1 a n ) · μ ^ n ( B ) + a n · v ( B ) ,
for all B X , and with ( a n ) n N a sequence of real numbers in ( 0 , 1 ) . Note that A μ ˜ n = A v then μ μ ˜ n for all n and from the finite support assumption H ( μ ˜ n ) < and D ( μ | | μ ˜ n ) < , P μ -a.s.. The following result derives from the convergence result in Lemma 1.
Theorem 2.
Let v F ( X ) , μ v and let us consider μ ˜ n in (21) induced from i.i.d. realizations of μ.
(i) 
If ( a n ) is o ( 1 ) , then lim n H ( μ ˜ n ) = H ( μ ) , lim n D ( μ | | μ ˜ n ) = 0 , P μ -a.s., and lim n E P μ ( D ( μ | | μ ˜ n ) ) = 0 .
(ii) 
Furthermore, if ( a n ) is O ( n p ) with p > 2 , then for all τ ( 0 , 1 / 2 ) , H ( μ ˜ n ) H ( μ ) and D ( μ | | μ ˜ n ) are o ( n τ ) P μ -a.s, and E P μ ( H ( μ ˜ n ) H ( μ ) ) and E P μ ( D ( μ | | μ ˜ n ) ) are O ( 1 / n ) .
Using this approach, we achieve estimation of the true distribution in expected information divergence as well as strong consistency for entropy estimation as intended. In addition, optimal rates of convergence are obtained under the finite support assumption on μ .

4. The Barron-Györfi-van der Meulen Estimator

The celebrated Barron estimator was proposed by Barron, Györfi and van der Meulen [29] in the context of an abstract and continuous measurable space. It is designed as a variation of the classical histogram-based scheme to achieve a consistent estimation of the distribution in direct I-divergence [29] (Theorem 2). Here, the Barron estimator is revisited in the countable alphabet scenario, with the objective of estimating the Shannon entropy consistently, which, to the best of our knowledge, has not been previously addressed in literature. For that purpose, the convergence result in Lemma 3 will be used as a key result.
Let v P ( X ) be of infinite support (i.e., m v = inf x A v v ( x ) = 0 ). We want to construct a strongly consistent estimate of the entropy restricted to the collection of probabilities in H ( X | v ) . For that, let us consider a sequence ( h n ) n 0 with values in ( 0 , 1 ) and let us denote by π n = A n , 1 , A n , 2 , , A n , m n the finite partition of X with maximal cardinality satisfying that
v ( A n , i ) h n , i 1 , , m n .
Note that m n = π n 1 / h n for all n 1 , and because of the fact that inf x A v v ( x ) = 0 it is simple to verify that if ( h n ) is o ( 1 ) then lim n m n = . π n offers an approximated statistically equivalent partition of X with respect to the reference measure v. In this context, given X 1 , , X n , i.i.d. realizations of μ H ( X | v ) , the idea proposed by Barron et al. [29] is to estimate the RN derivative d μ d v ( x ) by the following histogram-based construction:
d μ n * d v ( x ) = ( 1 a n ) · μ ^ n ( A n ( x ) ) v ( A n ( x ) ) + a n , x A v ,
where a n is a real number in ( 0 , 1 ) , A n ( x ) denotes the cell in π n that contains the point x, and μ ^ n is the empirical measure in (14). Note that
f μ n * ( x ) = d μ n * d λ ( x ) = f v ( x ) · ( 1 a n ) · μ ^ n ( A n ( x ) ) v ( A n ( x ) ) + a n ,
x X , and, consequently, B X
μ n * ( B ) = ( 1 a n ) i = 1 m n μ ^ n ( A n , i ) · v ( B A n , i ) v ( A n , i ) + a n v ( B ) .
By construction A μ A v A μ n * and, consequently, μ μ n * for all n 1 . The next result shows sufficient conditions on the sequences ( a n ) and ( h n ) to guarantee a strongly consistent estimation of the entropy H ( μ ) and of μ in direct I-divergence, distribution free in H ( X | v ) . The proof is based on verifying that the sufficient conditions of Lemma 3 are satisfied P μ -a.s.
Theorem 3.
Let v be in P ( X ) H ( X ) with infinite support, and let us consider μ in H ( X | v ) . If we have that:
(i) 
( a n ) is o ( 1 ) and ( h n ) is o ( 1 ) ,
(ii) 
τ ( 0 , 1 / 2 ) , such that the sequence 1 a n · h n is o ( n τ ) ,
then μ H ( X ) H ( X | μ n * ) for all n 1 and
lim n H ( μ n * ) = H ( μ ) and lim n D ( μ | | μ n * ) = 0 , P μ - a . s . .
This result shows an admisible regime of design parameters and its scaling with the number of samples that guarantees that the Barron plug-in entropy estimator is strongly consistent in H ( X | v ) . As a byproduct, we obtain that the distribution μ is estimated consistently in direct information divergence.
The Barron estimator [29] was originally proposed in the context of distributions defined in an abstract measurable space. Then if we restrict [29] (Theorem 2) to the countable alphabet case, the following result is obtained:
Corollary 1.
Ref. [29] (Theorem 2) Let us consider v P ( X ) and μ H ( X | v ) . If ( a n ) is o ( 1 ) , ( h n ) is o ( 1 ) and lim sup n 1 n a n h n 1 then
lim n D ( μ | | μ n * ) = 0 , P μ - a . s .
When the only objective is the estimation of distributions consistently in direct I-divergence, Corollary 1 should be considered to be a better result than Theorem 3 (Corollary 1 offers weaker conditions than Theorem 3 in particular condition (ii)). The proof of Theorem 3 is based on verifying the sufficient conditions of Lemma 3, where the objective is to achieve the convergence of the entropy, and as a consequence, the convergence in direct I-divergence. Therefore, we can say that the stronger conditions of Theorem 3 are needed when the objective is entropy estimation. This is justified from the observation that convergence in direct I-divergence does not imply entropy convergence in ∞-alphabets, as is discussed in Section 2.1 (see, Lemmas 1 and 3).

5. A Data-Driven Histogram-Based Estimator

Data-driven partitions offer a better approximation to the data distribution in the sample space than conventional non-adaptive histogram-based approaches [34,49]. They have the capacity to improve the approximation quality of histogram-based learning schemes, which translates in better performance in different non-parametric learning settings [34,35,36,50,51]. One of the basic design principles of this approach is to partition or select a sub-set of elements of X in a data-dependent way to preserve a critical number of samples per cell. In our problem, this last condition proves to be crucial to derive bounds for the estimation and approximation errors. Finally, these expressions will be used to propose design solutions that offer an optimal balance between estimation and approximation errors (Theorems 5 and 6).
Given X 1 , , X n i.i.d. realizations driven by μ H ( X ) and ϵ > 0 , let us define the data-driven set
Γ ϵ x X : μ ^ n ( x ) ϵ ,
and ϕ ϵ Γ ϵ c . Let Π ϵ x : x Γ ϵ ϕ ϵ 2 X be a data-driven partition with maximal resolution in Γ ϵ , and σ ϵ σ ( Π ϵ ) be the smallest sigma field that contains Π ϵ (as Π ϵ is a finite partition, σ ϵ is the collection of sets that are union of elements of Π ϵ ). We propose the conditional empirical probability restricted to Γ ϵ by:
μ ^ n , ϵ μ ^ n ( · | Γ ϵ ) .
By construction, it follows that A μ ^ n , ϵ = Γ ϵ A μ , P μ -a.s. and this implies that μ ^ n , ϵ μ for all n 1 . Furthermore, Γ ϵ 1 ϵ and, importantly in the context of the entropy functional, it follows that
m μ ^ n ϵ inf x Γ ϵ μ ^ n ( x ) ϵ .
The next result establishes a mild sufficient condition on ( ϵ n ) for which H ( μ ^ n , ϵ n ) is strongly consistent distribution-free in H ( X ) . Considering that we are in the regime where μ ^ n , ϵ n μ , P μ -a.s., the proof of this result uses the convergence result in Lemma 2 as a central result.
Theorem 4.
If ( ϵ n ) is O ( n τ ) with τ ( 0 , 1 ) , then for all μ H ( X )
lim n H ( μ ^ n , ϵ n ) = H ( μ ) , P μ - a . s .
Complementing Theorem 4, the next result offers almost-sure rates of converge for a family of distributions with a power tail bounded condition (TBC). In particular, the family of distributions studied in [30] (Theorem 7) are considered.
Theorem 5.
Let us assume that for some p > 1 there are two constants 0 < k 0 k 1 and N > 0 such that k 0 · x p μ ( x ) k 1 x p for all x N . If we consider that ( ϵ n ) ( n τ * ) for τ * = 1 2 + 1 / p , then
H ( μ ) H ( μ ^ n , ϵ n ) is O ( n 1 1 / p 2 + 1 / p ) , P μ - a . s .
This result shows that under the mentioned p-power TBC on f μ ( · ) , the plug-in estimator H ( μ ^ n , ϵ n ) can achieve a rate of convergence to the true limit that is O ( n 1 1 / p 2 + 1 / p ) with probability one. For the derivation of this result, the approximation sequence ( ϵ n ) is defined as a function of p (adapted to the problem) by finding an optimal tradeoff between estimation and approximation errors while performing a finite length (non-asymptotic) analysis of the expression H ( μ ) H ( μ ^ n , ϵ n ) (the details of this analysis are presented in Section 7).
It is insightful to look at two extreme regimes of this result: p approaching 1, in which the rate is arbitrarily slow (approaching a non-decaying behavior); and p , where H ( μ ) H ( μ ^ n , ϵ n ) is O ( n q ) for all q ( 0 , 1 / 2 ) P μ -a.s.. This last power decaying range q ( 0 , 1 / 2 ) matches what is achieved for the finite alphabet scenario (for instance in Theorem 1, Equation (18)), which is known to be the optimal rate for finite alphabets.
Extending Theorem 5, the following result addresses the more constrained case of distributions with an exponential TBC.
Theorem 6.
Let us consider α > 0 and let us assume that there are k 0 , k 1 with 0 < k 0 k 1 and N > 0 such that k 0 · e α x μ ( x ) k 1 · e α x for all x N . If we consider ( ϵ n ) ( n τ ) with τ ( 0 , 1 / 2 ) , then
H ( μ ) H ( μ ^ n , ϵ n ) is O ( n τ log n ) , P μ - a . s .
Under this stringent TBC on f μ ( · ) , it is observed that H ( μ ) H ( μ ^ n , ϵ n ) is o ( n q ) P μ - a . s . , for any arbitrary q ( 0 , 1 / 2 ) , by selecting ( ϵ n ) ( n τ ) with q < τ < 1 / 2 . This last condition on τ is universal over α > 0 . Remarkably, for any distribution with this exponential TBC, we can approximate (arbitrarily closely) the optimal almost-sure rate of convergence achieved for the finite alphabet problem.
Finally, the finite and unknown supported scenario is revisited, where it is shown that the data-driven estimator exhibits the optimal almost sure convergence rate of the classical plug-in entropy estimator presented in Section 3.1.
Theorem 7.
Let us assume that μ F ( X ) and ( ϵ n ) being o ( 1 ) . Then for all ϵ > 0 there is N > 0 such that n N
P μ n H ( μ ^ n , ϵ n ) H ( μ ) > ϵ 2 A μ + 1 · e 2 n ϵ 2 ( M μ + log e m μ ) 2 + e n m μ 2 4 .
The proof of this result reduces to verify that μ ^ n , ϵ n detects A μ almost-surely when n goes to infinity and from this, it follows that H ( μ ^ n , ϵ n ) eventually matches the optimal almost sure performance of H ( μ ^ n ) under the key assumption that μ F ( X ) . Finally, the concentration bound in (29) implies that H ( μ ^ n , ϵ n ) H ( μ ) is o ( n q ) almost surely for all q ( 0 , 1 / 2 ) as long as ϵ n 0 with n.

6. Discussion of the Results and Final Remarks

This work shows that entropy convergence results are instrumental to derive new (strongly consistent) estimation results for the Shannon entropy in ∞-alphabets and, as a byproduct, distribution estimators that are strongly consistent in direct and reverse I-divergence. Adopting a set of sufficient conditions for entropy convergence in the context of four plug-in histogram-based schemes, this work shows concrete design conditions where strong consistency for entropy estimation in ∞-alphabets can be obtained (Theorems 2–4). In addition, the relevant case where the target distribution has a finite but unknown support is explored, deriving almost sure rates of convergence results for the overall estimation error (Theorems 1 and 7) that match the optimal asymptotic rate that can be obtained in the finite alphabet version of the problem (i.e., the finite and known supported case).
As the main context of application, this work focuses on the case of a data-driven plug-in estimator that restricts the support where the distribution is estimated. The idea is to have design parameters that control estimation-error effects and to find an adequate balance between these two learning errors. Adopting the entropy convergence result in Lemma 2, it is shown that this data-driven scheme offers the same universal estimation attributes than the classical plug-in estimate under some mild conditions on its threshold design parameter (Theorem 4). In addition, by addressing the technical task of deriving concrete closed-form expressions for the estimation and approximation errors in this learning context a solution is presented where almost-sure rates of convergence of the overall estimation error are obtained over a family of distributions with some concrete tail bounded conditions (Theorems 5 and 6). These results show the capacity that data-driven frameworks offer for adapting aspects of their learning scheme to the complexity of the entropy estimation task in ∞-alphabets.
Concerning the classical plug-in estimator presented in Section 3.1, it is important to mention that the work of Antos et al. [30] shows that lim n H ( μ ^ n ) = H ( μ ) happens almost surely and distribution-free and, furthermore, it provides rates of convergence for families with specific tail-bounded conditions [30] (Theorem 7). Theorem 1 focuses on the case when μ F ( X ) , where new finite-length deviation inequalities and confidence intervals are derived. From that perspective, Theorem 1 complements the results presented in [30] in the non-explored scenario when μ F ( X ) . It is also important to mention two results by Ho and Yeung [26] (Theorems 11 and 12) for the plug-in estimator in (15). They derived bounds for P μ n ( H ( μ ^ n ) H ( μ ) ϵ ) and determined confidence intervals under a finite and known support restriction on μ . In contrast, Theorem 1 resolves the case for a finite and unknown supported distribution, which is declared to be a challenging problem from the arguments presented in [26] (Theorem 13) concerning the discontinuity of the entropy.

7. Proof of the Main Results

Proof of Theorem 1.
Let μ be in F ( X ) , then A μ k for some k > 1 . From Hoeffding’s inequality [28], n 1 , and for any ϵ > 0
P μ n V ( μ ^ n , μ ) > ϵ 2 k + 1 · e 2 n ϵ 2 and E P μ n ( V ( μ ^ n , μ ) ) 2 ( k + 1 ) log 2 n .
Considering that μ ^ n μ P μ -a.s, we can use Proposition 1 to obtain that
D ( μ ^ n | | μ ) log e m μ · V ( μ ^ n , μ ) , and H ( ( μ ^ n ) H ( μ n ) M μ + log e m μ · V ( μ ^ n , μ ) .
Hence, (17) and (18) derive from (30).
For the direct I-divergence, let us consider a sequence ( x i ) i 1 and the following function (a stopping time):
T o ( x 1 , x 2 , ) inf n 1 : A μ ^ n ( x n ) = A μ .
T o ( x 1 , x 2 , ) is the point where the support of μ ^ n ( x n ) is equal to A μ and, consequently, the direct I-divergence is finite (since μ F ( X ) ). In fact, by the uniform convergence of μ ^ n to μ n ( P μ -a.s.) and the finite support assumption of μ , it is simple to verify that P μ ( T o ( X 1 , X 2 , ) < ) = 1 . Let us define the event:
B n x 1 , x 2 , : T o ( x 1 , x 2 , ) n X N ,
i.e., the collection of sequences in X N where at time n, A μ ^ n = A μ and, consequently, D ( μ | | μ ^ n ) < . Restricted to this set
D ( μ | | μ ^ n ) x A μ ^ n | | μ f μ ^ n ( x ) log f μ ^ n ( x ) f μ ( x ) + x A μ \ A μ ^ n | | μ f μ ^ n ( x ) log f μ ( x ) f μ ^ n ( x ) log e · x A μ ^ n | | μ f μ ^ n ( x ) · f μ ^ n ( x ) f μ ( x ) 1
+ log e · μ ( A μ \ A μ ^ n | | μ ) μ ^ n ( ( A μ \ A μ ^ n | | μ ) )
log e · 1 / m u + 1 V ( μ , μ ^ n ) ,
where in the first inequality A μ ^ n | | μ x A μ ^ n : f μ ^ n ( x ) > f μ ( x ) , and the last is obtained by the definition of the total variational distance. In addition, let us define the ϵ -deviation set A n ϵ x 1 , x 2 , : D ( μ | | μ ^ n ( x n ) ) > ϵ X N . Then by additivity and monotonicity of P μ , we have that
P μ ( A n ϵ ) P μ ( A n ϵ B n ) + P μ ( B n c ) .
By definition of B n , (36) and (30) it follows that
P μ ( A n ϵ B n ) P μ ( V ( μ | | μ ^ n ) log e · 1 / m u + 1 > ϵ ) 2 A μ + 1 · e 2 n ϵ 2 log e 2 · 1 / m u + 1 2 .
On the other hand, ϵ o ( 0 , m μ ) if V ( μ , μ ^ n ) ϵ o then T o n . Consequently B n c x 1 , x 2 , : V ( μ , μ ^ n ( x n ) ) > ϵ o , and again from (30)
P μ ( B n c ) 2 A μ + 1 · e 2 n ϵ o 2 ,
for all n 1 and ϵ o ( 0 , m μ ) . Integrating the results in (38) and (39) and considering ϵ 0 = m μ / 2 suffice to show the bound in (19). ☐
Proof of Theorem 2.
As ( a n ) is o ( 1 ) , it is simple to verify that lim n V ( μ ˜ n , μ ) = 0 , P μ -a.s. Also note that the support disagreement between μ ˜ n and μ is bounded by the hypothesis, then
lim n μ ˜ n A μ n \ A μ · log A μ ˜ n \ A μ lim n μ ˜ n A μ n \ A μ · log A v = 0 , P μ - a . s .
Therefore from Lemma 1, we have the strong consistency of H ( μ ˜ n ) and the almost sure convergence of D ( μ | | μ ˜ n ) to zero. Note that D ( μ | | μ ˜ n ) is uniformly upper bounded by log e · ( 1 / m μ + 1 ) V ( μ , μ ˜ n ) (see (36) in the proof of Theorem 1). Then the convergence in probability of D ( μ | | μ ˜ n ) implies the convergence of its mean [42], which concludes the proof of the first part.
Concerning rates of convergence, we use the following:
H ( μ ) H ( μ ˜ n ) = x A μ A μ ˜ n f μ ˜ n ( x ) f μ ( x ) log f μ ( x ) + x A μ A μ ˜ n f μ ˜ n ( x ) log f μ ˜ n ( x ) f μ ( x ) x A μ ˜ n \ A μ f μ ˜ n ( x ) log 1 f μ ˜ n ( x ) .
The absolute value of the first term in the right hand side (RHS) of (41) is bounded by M μ · V ( μ ˜ n , μ ) and the second term is bounded by log e / m μ · V ( μ ˜ n , μ ) , from the assumption that μ F ( X ) . For the last term, note that f μ ˜ n ( x ) = a n · v ( x ) for all x A μ ˜ n \ A μ and that A μ ˜ n = A v , then
0 x A μ ˜ n \ A μ f μ ˜ n ( x ) log 1 f μ ˜ n ( x ) a n · ( H ( v ) + log 1 a n · v ( A v \ A μ ) ) .
On the other hand,
V ( μ ˜ n , μ ) = 1 2 x A μ ( 1 a n ) μ ^ n ( x ) + a n v ( x ) μ ( x ) + x A v \ A μ a n v ( x ) . ( 1 a n ) · V ( μ ^ n , μ ) + a n .
Integrating these bounds in (41),
H ( μ ) H ( μ ˜ n ) ( M μ + log e / m μ ) · ( ( 1 a n ) · V ( μ ^ n , μ ) + a n ) + a n · H ( v ) + a n · log 1 a n = K 1 · V ( μ ^ n , μ ) + K 2 · a n + a n · log 1 a n ,
for constants K 1 > 0 and K 2 > 0 function of μ and v.
Under the assumption that μ F ( X ) , the Hoeffding’s inequality [28,52] tells us that P μ ( V ( μ ^ n , μ ) > ϵ ) C 1 · e C 2 n ϵ 2 (for some distribution free constants C 1 > 0 and C 2 > 0 ). From this inequality, V ( μ ^ n , μ ) goes to zero as o ( n τ ) P μ -a.s. τ ( 0 , 1 / 2 ) and E P μ ( V ( μ ^ n , μ ) ) is O ( 1 / n ) . On the other hand, under the assumption in ii) ( K 2 · a n + a n · log 1 a n ) is O ( 1 / n ) , which from (42) proves the rate of convergence results for H ( μ ) H ( μ ˜ n ) .
Considering the direct I-divergence, D ( μ | | μ ˜ n ) log e · x A μ f μ ( x ) f μ ( x ) f μ ˜ n ( x ) 1 log e m μ ˜ n · V ( μ ˜ n , μ ) . Then the uniform convergence of μ ˜ n ( x ) to μ ( x ) P μ -a.s. in A μ and the fact that A μ < imply that for an arbitrary small ϵ > 0 (in particular smaller than m μ )
lim n D ( μ | | μ ˜ n ) log e m μ ϵ · lim n V ( μ ˜ n , μ ) , P μ - a . s . .
(43) suffices to obtain the convergence result for the I-divergence. ☐
Proof of Theorem 3.
Let us define the oracle Barron measure μ ˜ n by:
f μ ˜ n ( x ) = d μ ˜ n d λ ( x ) = f v ( x ) ( 1 a n ) · μ ( A n ( x ) ) v ( A n ( x ) ) + a n ,
where we consider the true probability instead of its empirical version in (23). Then, the following convergence result can be obtained (see Proposition A2 in Appendix B),
lim n sup x A μ ˜ n d μ ˜ n d μ n * ( x ) 1 = 0 , P μ - a . s . .
Let A denote the collection of sequences x 1 , x 2 , where the convergence in (45) is holding (this set is typical meaning that P μ ( A ) = 1 ). The rest of the proof reduces to show that for any arbitrary ( x n ) n 1 A , its respective sequence of induced measures μ n * : n 1 (the dependency of μ n * on the sequence ( x n ) n 1 will be considered implicit for the rest of the proof) satisfies the sufficient conditions of Lemma 3.
Let us fix an arbitrary ( x n ) n 1 A :
Weak convergence μ n * μ : Without loss of generality we consider that A μ ˜ n = A v for all n 1 . Since a n 0 and h n 0 , f μ ˜ n ( x ) μ ( x ) x A v , we got the weak convergence of μ ˜ n to μ . On the other hand by definition of A , lim n sup x A μ ˜ n f μ ˜ n ( x ) f μ n * ( x ) 1 = 0 that implies that lim n f μ n * ( x ) f μ ˜ n ( x ) = 0 for all x A v and, consequently, μ n * μ .
The condition in (12): By construction μ μ n * , μ μ ˜ n and μ ˜ n μ n * for all n, then we will use the following equality:
log d μ d μ n * ( x ) = log d μ d μ ˜ n ( x ) + log d μ ˜ n d μ n * ( x ) ,
for all x A μ . Concerning the approximation error term of (46), i.e., log d μ d μ ˜ n ( x ) , x A μ
d μ ˜ n d μ ( x ) = ( 1 a n ) μ ( A n ( x ) ) μ ( x ) v ( x ) v ( A n ( x ) ) + a n v ( x ) μ ( x ) .
Given that μ H ( X | v ) , this is equivalent to state that log ( d μ d v ( x ) ) is bounded μ -almost everywhere, which is equivalent to say that m inf x A μ d μ d v ( x ) > 0 and M sup x A μ d μ d v ( x ) < . From this, A A μ ,
m v ( A ) μ ( A ) M v ( A ) .
Then we have that, x A μ m M μ ( A n ( x ) ) μ ( x ) v ( x ) v ( A n ( x ) ) M m . Therefore for n sufficient large, 0 < 1 2 m M d μ ˜ n d μ ( x ) M m + M < for all x in A μ . Hence, there exists N o > 0 such that sup n N o sup x A μ log d μ ˜ n d μ ( x ) < .
For the estimation error term of (46), i.e., log d μ ˜ n d μ n * ( x ) , note that from the fact that ( x n ) A , and the convergence in (45), there exists N 1 > 0 such that for all n N 1 sup x A μ log d μ ˜ n d μ n * ( x ) < , given that A μ A μ ˜ n = A v . Then using (46), for all n max N 0 , N 1 sup x A μ log d μ n * d μ ( x ) < , which verifies (12).
The condition in (13): Defining the function ϕ n * ( x ) 1 A v \ A μ ( x ) · f μ n * ( x ) log ( 1 / f μ n * ( x ) ) , we want to verify that lim n X ϕ n * ( x ) d λ ( x ) = 0 . Considering that ( x n ) A for all ϵ > 0 , there exists N ( ϵ ) > 0 such that sup x A μ ˜ n f μ ˜ n ( x ) f μ n * ( x ) 1 < ϵ and then
( 1 ϵ ) f μ ˜ n ( x ) < f μ n * ( x ) < ( 1 + ϵ ) f μ ˜ n ( x ) , for all x A v .
From (49), 0 ϕ n * ( x ) ( 1 + ϵ ) f μ ˜ n ( x ) log ( 1 / ( 1 ϵ ) f μ ˜ n ( x ) ) for all n N ( ϵ ) . Analyzing f μ ˜ n ( x ) in (44), there are two scenarios: A n ( x ) A μ = where f μ ˜ n ( x ) = a n f v ( x ) and, otherwise, f μ ˜ n ( x ) = f v ( x ) ( a n + ( 1 a n ) μ ( A n ( x ) A μ ) / v ( A n ( x ) ) ) . Let us define:
B n x A v \ A μ : A n ( x ) A μ = and C n x A v \ A μ : A n ( x ) A μ .
Then for all n N ( ϵ ) ,
x X ϕ n * ( x ) x A v \ A μ ( 1 + ϵ ) f μ ˜ n ( x ) log 1 / ( ( 1 ϵ ) f μ ˜ n ( x ) ) = x B n ( 1 + ϵ ) a n f v ( x ) log 1 ( 1 ϵ ) a n f v ( x ) + x X ϕ ˜ n ( x ) ,
with ϕ ˜ n ( x ) 1 C n ( x ) · ( 1 + ϵ ) f μ ˜ n ( x ) log 1 ( 1 ϵ ) f μ ˜ n ( x ) . The left term in (51) is upper bounded by a n ( 1 + ϵ ) ( H ( v ) + log ( 1 / a n ) ) , which goes to zero with n from ( a n ) being o ( 1 ) and the fact that v H ( X ) . For the right term in (51), ( h n ) being o ( 1 ) implies that x belongs to B n eventually (in n) x A v \ A μ , then ϕ ˜ n ( x ) tends to zero point-wise as n goes to infinity. On the other hand, for all x C n (see (50)), we have that
1 1 / m + 1 μ ( A n ( x ) A μ ) v ( A n ( x ) A μ ) + v ( A v \ A μ ) μ ( A n ( x ) ) v ( A n ( x ) ) μ ( A n ( x ) A μ ) v ( A n ( x ) A μ ) M .
These inequalities derive from (48). Consequently for all x X , if n sufficiently large such that a n < 0.5 , then
0 ϕ ˜ n ( x ) ( 1 + ϵ ) ( a n + ( 1 a n ) M ) f v ( x ) log 1 ( 1 ϵ ) ( a n + ( 1 a n ) m / ( m + 1 ) ) ( 1 + ϵ ) ( 1 + M ) f v ( x ) log 2 ( m + 1 ) ( 1 ϵ ) + log 1 f v ( x ) .
Hence from (50), ϕ ˜ n ( x ) is bounded by a fix function that is 1 ( X ) by the assumption that v H ( X ) . Then by the dominated convergence theorems [43] and (51),
lim n X ϕ n * ( x ) lim n X ϕ ˜ n ( x ) .
In summary, we have shown that for any arbitrary ( x n ) A the sufficient conditions of Lemma 3 are satisfied, which proves the result in (25) reminding that P μ ( A ) = 1 from (45). ☐
Proof of Theorem 4.
Let us first introduce the oracle probability
μ ϵ n μ ( · | Γ ϵ n ) P ( X ) .
Note that μ ϵ n is a random probability measure (function of the i.i.d sequence X 1 , , X n ) as Γ ϵ n is a data-driven set, see (26). We will first show that:
lim n H ( μ ϵ n ) = H ( μ ) and lim n D ( μ ϵ n | | μ ) = 0 , P μ - a . s .
Under the assumption on ( ϵ n ) of Theorem 4, lim n μ ( Γ ϵ n ) μ ^ n ( Γ ϵ n ) = 0 , P μ -a.s. (this result derives from the fact that lim n V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) = 0 , P μ -a.s. , from (63)) In addition, since ( ϵ n ) is o ( 1 ) then lim n μ ^ n ( Γ ϵ n ) = 1 , which implies that lim n μ ( Γ ϵ n ) = 1 P μ -a.s. From this μ ϵ n μ , P μ -a.s. Let us consider a sequences ( x n ) where lim n μ ( Γ ϵ n ) = 1 . Constrained to that
lim sup n sup x A μ f μ ϵ n ( x ) f μ ( x ) = lim sup n 1 μ ( Γ ϵ n ) < .
Then there is N > 0 such that sup n > N sup x A μ f μ ϵ n ( x ) f μ ( x ) < . Hence from Lemma 2, lim n D ( μ ϵ n | | μ ) = 0 and lim n H ( μ ϵ n ) H ( μ ) = 0 . Finally, the set of sequences ( x n ) where lim n μ ( Γ ϵ n ) = 1 has probability one (with respect to P μ ), which proves (55).
For the rest of the proof, we concentrate on the analysis of H ( μ ^ n , ϵ n ) H ( μ ϵ n ) that can be attributed to the estimation error aspect of the problem. It is worth noting that by construction A μ ^ n , ϵ n = A μ ϵ n = Γ ϵ n , P μ -a.s.. Consequently, we can use
H ( μ ^ n , ϵ n ) H ( μ ϵ n ) = x Γ ϵ n μ ϵ n ( x ) μ ^ n , ϵ n ( x ) log μ ^ n , ϵ n ( x ) + D ( μ ϵ n | | μ ^ n , ϵ n ) .
The first term on the RHS of (57) is upper bounded by log 1 / m μ ^ n ϵ n · V ( μ ϵ n , μ ^ n , ϵ n ) log 1 / ϵ n · V ( μ ϵ n , μ ^ n , ϵ n ) . Concerning the second term on the RHS of (57), it is possible to show (details presented in Appendix C) that
D ( μ ϵ n | | μ ^ n , ϵ n ) 2 log e ϵ n μ ( Γ ϵ n ) · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) ,
where
V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) sup A σ ϵ n μ ( A ) μ ^ n ( A ) .
In addition, it can be verified (details presented in Appendix D) that
V ( μ ϵ n , μ ^ n , ϵ n ) K · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) ,
for some universal constant K > 0 . Therefore from (57), (58) and (60), there is C > 0 such that
H ( μ ^ n , ϵ n ) H ( μ ϵ n ) C μ ( Γ ϵ n ) log 1 ϵ n · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) .
As mentioned before, μ ( Γ ϵ n ) goes to 1 almost surely, then we need to concentrate on the analysis of the asymptotic behavior of log 1 / ϵ n · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) . From Hoeffding’s inequality [28], we have that δ > 0
P μ n log 1 / ϵ n · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) > δ 2 Γ ϵ n + 1 · e 2 n δ 2 ( log 1 / ϵ n ) 2 ,
considering that by construction σ ϵ n 2 Γ ϵ n + 1 2 1 / ϵ n + 1 . Assuming that ( ϵ n ) is O ( n τ ) ,
ln P μ n log 1 / ϵ n · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) > δ ( n τ + 1 ) ln 2 2 n δ 2 τ log n .
Therefore for all τ ( 0 , 1 ) , δ > 0 and any arbitrary l ( τ , 1 )
lim sup n 1 n l · ln P μ n log 1 / ϵ n · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) > δ < 0 .
This last result is sufficient to show that n 1 P μ n log 1 / ϵ n · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) > δ < that concludes the argument from the Borel-Cantelli Lemma. ☐
Proof of Theorem 5.
We consider the expression
H ( μ ) H ( μ ^ n , ϵ n ) H ( μ ) H ( μ ϵ n ) + H ( μ ϵ n ) H ( μ ^ n , ϵ n )
to analize the approximation error and the estimation error terms separately.
• Approximation Error Analysis
Note that H ( μ ) H ( μ ϵ n ) is a random object as μ ϵ n in (54) is a function of the data-dependent partition and, consequently, a function of X 1 , , X n . In the following, we consider the oracle set
Γ ˜ ϵ n x X : μ ( x ) ϵ n ,
and the oracle conditional probability
μ ˜ ϵ n μ ( · | Γ ˜ ϵ n ) P . ( X ) .
Note that Γ ˜ ϵ n is a deterministic function of ( ϵ n ) and so is the measure μ ˜ ϵ n in (66). From definitions and triangular inequality:
H ( μ ) H ( μ ˜ ϵ n ) x Γ ˜ ϵ n c μ ( x ) log 1 μ ( x ) + log 1 μ ( Γ ˜ ϵ n ) + 1 μ ( Γ ˜ ϵ n ) 1 · x Γ ˜ ϵ n μ ( x ) log 1 μ ( x ) ,
and, similarly, the approximation error is bounded by
H ( μ ) H ( μ ϵ n ) x Γ ϵ n c μ ( x ) log 1 μ ( x ) + log 1 μ ( Γ ϵ n ) + 1 μ ( Γ ϵ n ) 1 · x Γ ϵ n μ ( x ) log 1 μ ( x ) .
We denote the RHS of (67) and (68) by a ϵ n and b ϵ n ( X 1 , , X n ) , respectively.
We can show that if ( ϵ n ) is O ( n τ ) and τ ( 0 , 1 / 2 ) , then
lim sup n b ϵ n ( X 1 , , X n ) a 2 ϵ n 0 , P μ - a . s . ,
which from (68) implies that H ( μ ) H ( μ ϵ n ) is O ( a 2 ϵ n ) , P μ -a.s. The proof of (69) is presented in Appendix E.
Then, we need to analyze the rate of convergence of the deterministic sequence ( a 2 ϵ n ) . Analyzing the RHS of (67), we recognize two independent terms: the partial entropy sum x Γ ˜ ϵ n c μ ( x ) log 1 μ ( x ) and the rest that is bounded asymptotically by μ ( Γ ˜ ϵ n c ) ( 1 + H ( μ ) ) , using the fact that ln x x 1 for x 1 . Here is where the tail condition on μ plays a role. From the tail condition, we have that
μ ( Γ ˜ ϵ n c ) μ ( k o / ϵ n ) 1 / p + 1 , ( k o / ϵ n ) 1 / p + 2 , ( k o / ϵ n ) 1 / p + 3 , = x ( k o ϵ n ) 1 / p + 1 μ ( x ) k 1 · S ( k o ϵ n ) 1 / p + 1 ,
where S x o x x o x p . Similarly as 0 , 1 , , ( k o / ϵ n ) 1 / p Γ ˜ ϵ n , then
x Γ ˜ ϵ n c μ ( x ) log 1 μ ( x ) x ( k o ϵ n ) 1 / p + 1 μ ( x ) log 1 μ ( x ) x ( k o ϵ n ) 1 / p + 1 k 1 x p · log 1 k 0 x p k 1 log p · R ( k o ϵ n ) 1 / p + 1 + k 1 log 1 / k 0 · S ( k o ϵ n ) 1 / p + 1 ,
where R x o x x o x p log x .
In Appendix F, it is shown that S x o C 0 · x o 1 p and R x o C 1 · x o 1 p for constants C 1 > 0 and C 0 > 0 . Integrating these results in the RHS of (70) and (71) and considering that ( ϵ n ) is O ( n τ ) , we have that both μ ( Γ ˜ ϵ n c ) and x Γ ˜ ϵ n c μ ( x ) log 1 μ ( x ) are O ( n τ ( p 1 ) p ) . This implies that our oracle sequence ( a ϵ n ) is O ( n τ ( p 1 ) p ) .
In conclusion, if ϵ n is O ( n τ ) for τ ( 0 , 1 / 2 ) , it follows that
H ( μ ) H ( μ ϵ n ) is O ( n τ ( p 1 ) p ) , P μ - a . s .
• Estimation Error Analysis
Let us consider H ( μ ϵ n ) H ( μ ^ n , ϵ n ) . From the bound in (61) and the fact that for any τ ( 0 , 1 ) , lim n μ ( Γ ϵ n ) = 1 P μ -a.s. from (63), the problem reduces to analyze the rate of convergence of the following random object:
ρ n ( X 1 , , X n ) log 1 ϵ n · V ( μ / σ ( Γ ϵ n ) , μ ^ n / σ ( Γ ϵ n ) ) .
We will analize, instead, the oracle version of ρ n ( X 1 , , X n ) given by:
ξ n ( X 1 , , X n ) log 1 ϵ n · V ( μ / σ ( Γ ˜ ϵ n / 2 ) , μ ^ n / σ ( Γ ˜ ϵ n / 2 ) ) ,
where Γ ˜ ϵ x X : μ ( x ) ϵ is the oracle counterpart of Γ ϵ in (26). To do so, we can show that if ϵ n is O ( n τ ) with τ ( 0 , 1 / 2 ) , then
lim inf n ξ n ( X 1 , , X n ) ρ n ( X 1 , , X n ) 0 , P μ - a . s . .
The proof of (75) is presented in Appendix G.
Moving to the almost sure rate of convergence of ξ n ( X 1 , , X n ) , it is simple to show for our p-power dominating distribution that if ( ϵ n ) is O ( n τ ) and τ ( 0 , p ) then
lim n ξ n ( X 1 , , X n ) = 0 P μ - a . s . ,
and, more specifically,
ξ n ( X 1 , , X n ) is o ( n q ) for all q ( 0 , ( 1 τ / p ) / 2 ) , P μ - a . s . .
The argument is presented in Appendix H.
In conclusion, if ϵ n is O ( n τ ) for τ ( 0 , 1 / 2 ) , it follows that
H ( μ ϵ n ) H ( μ ^ n , ϵ n ) is O ( n q ) , P μ - a . s . ,
for all q ( 0 , ( 1 τ / p ) / 2 ) .
• Estimation vs. Approximation Errors
Coming back to (64) and using (72) and (77), the analysis reduces to finding the solution τ * in ( 0 , 1 / 2 ) that offers the best trade-off between the estimation and approximation error rate:
τ * arg max τ ( 0 , 1 / 2 ) min ( 1 τ / p ) 2 , τ ( p 1 ) p .
It is simple to verify that τ * = 1 / 2 . Then by considering τ arbitrary close to the admissible limit 1 / 2 , we can achieve a rate of convergence for H ( μ ) H ( μ ^ n , ϵ n ) that is arbitrary close to O ( n 1 2 ( 1 1 / p ) ) , P -a.s.
More formally, for any l ( 0 , 1 2 ( 1 1 / p ) ) we can take τ ( l ( 1 1 / p ) , 1 2 ) where H ( μ ) H ( μ ^ n , ϵ n ) is o ( n l ) , P μ -a.s., from (72) and (77).
Finally, a simple corollary of this analysis is to consider τ ( p ) = 1 2 + 1 / p < 1 / 2 where:
H ( μ ) H ( μ ^ n , ϵ n ) is O ( n 1 1 / p 2 + 1 / p ) , P μ - a . s . ,
which concludes the argument. ☐
Proof of Theorem 6.
The argument follows the proof of Theorem 5. In particular, we use the estimation-approximation error bound:
H ( μ ) H ( μ ^ n , ϵ n ) H ( μ ) H ( μ ϵ n ) + H ( μ ϵ n ) H ( μ ^ n , ϵ n ) ,
and the following two results derived in the proof of Theorem 5: If ( ϵ n ) is O ( n τ ) with τ ( 0 , 1 / 2 ) then (for the approximation error)
H ( μ ) H ( μ ϵ n ) is O ( a 2 ϵ n ) P μ - a . s . ,
with a ϵ n = x Γ ˜ ϵ n c μ ( x ) log 1 μ ( x ) + μ ( Γ ˜ ϵ n c ) ( 1 + H ( μ ) ) , while (for the estimation error)
H ( μ ϵ n ) H ( μ ^ n , ϵ n ) is O ( ξ n ( X 1 , , X n ) ) P μ - a . s . ,
with ξ n ( X 1 , , X n ) = log 1 ϵ n · V ( μ / σ ( Γ ˜ ϵ n / 2 ) , μ ^ n / σ ( Γ ˜ ϵ n / 2 ) ) .
For the estimation error, we need to bound the rate of convergence of ξ n ( X 1 , , X n ) to zero almost surely. We first note that 1 , , x o ( ϵ n ) = Γ ˜ ϵ n with x o ( ϵ n ) = 1 / α ln ( k 0 / ϵ n ) . Then from Hoeffding’s inequality we have that
P μ n ξ n ( X 1 , , X n ) > δ 2 ( Γ ˜ ϵ n / 2 ) · e 2 n δ 2 log ( 1 / ϵ n ) 2 2 1 / α ln ( 2 k 0 / ϵ n ) + 1 · e 2 n δ 2 log ( 1 / ϵ n ) 2 .
Considering ϵ n = O ( n τ ) , an arbitrary sequence ( δ n ) being o ( 1 ) and l > 0 , it follows from (83) that
1 n l · ln P μ n ξ n ( X 1 , , X n ) > δ n 1 n l ln ( 2 ) 1 / α ln ( 2 k 0 / ϵ n ) + 1 n 1 l δ n 2 log ( 1 / ϵ n ) 2 .
We note that the first term in the RHS of (84) is O ( 1 n l log n ) and goes to zero for all l > 0 , while the second term is O ( n 1 l δ n 2 log n 2 ) . If we consider δ n = O ( n q ) , this second term is O ( n 1 2 q l · 1 log n 2 ) . Therefore, for any q ( 0 , 1 / 2 ) we can take an arbitrary l ( 0 , 1 2 q ] such that P μ n ξ n ( X 1 , , X n ) > δ n is O ( e n l ) from (84). This result implies, from the Borel-Cantelli Lemma, that ξ n ( X 1 , , X n ) is o ( δ n ) , P μ -a.s, which in summary shows that H ( μ ϵ n ) H ( μ ^ n , ϵ n ) is O ( n q ) for all q ( 0 , 1 / 2 ) .
For the approximation error, it is simple to verify that:
μ ( Γ ˜ ϵ n c ) k 1 · x x o ( ϵ n ) + 1 e α x = k 1 · S ˜ x o ( ϵ n ) + 1
and
x Γ ˜ ϵ n c μ ( x ) log 1 μ ( x ) x x o ( ϵ n ) + 1 k 1 e α x log 1 k 0 e α x = k 1 log 1 k 0 · S ˜ x o ( ϵ n ) + 1 + α log e · k 1 · R ˜ x o ( ϵ n ) + 1 ,
where S ˜ x o x x o e α x and R ˜ x o x x o x · e α x . At this point, it is not difficult to show that S ˜ x o M 1 e α x o and R ˜ x o M 2 e α x o · x o for some constants M 1 > 0 and M 2 > 0 . Integrating these partial steps, we have that
a ϵ n k 1 ( 1 + H ( μ ) + log 1 k 0 ) · S ˜ x o ( ϵ n ) + 1 + α log e · k 1 · R ˜ x o ( ϵ n ) + 1 O 1 · ϵ n + O 2 · ϵ n log 1 ϵ n
for some constant O 1 > 0 and O 2 > 0 . The last step is from the evaluation of x o ( ϵ n ) = 1 / α ln ( k 0 / ϵ n ) . Therefore from (81) and (87), it follows that H ( μ ) H ( μ ϵ n ) is O ( n τ log n ) P μ -a.s. for all τ ( 0 , 1 / 2 ) .
The argument concludes by integrating in (80) the almost sure convergence results obtained for the estimation and approximation errors. ☐
Proof of Theorem 7.
Let us define the event
B n ϵ = x n X n : Γ ϵ ( x n ) = A μ ,
that represents the detection of the support of μ from the data for a given ϵ > 0 in (26). Note that the dependency on the data for Γ ϵ is made explicit in this notation. In addition, let us consider the deviation event
A n ϵ ( μ ) = x n X n : V ( μ , μ ^ n ) > ϵ .
By the hypothesis that A μ < , then m μ = min x A μ f μ ( x ) > 0 . Therefore if x n ( A n m μ / 2 ( μ ) ) c then μ ^ n ( x ) m μ / 2 for all x A μ , which implies that ( B n ϵ ) c A n m μ / 2 ( μ ) as long as 0 < ϵ m μ / 2 . Using the hypothesis that ϵ n 0 , there is N > 0 such that for all n N ( B n ϵ n ) c A n m μ / 2 ( μ ) and, consequently,
P μ n ( ( B n ϵ n ) c ) P μ n ( A n m μ / 2 ( μ ) ) 2 k + 1 · e n m μ 2 4 ,
the last from Hoeffding’s inequality considering k = A μ < .
If we consider the events:
C n ϵ ( μ ) = x n X n : H ( μ ^ n , ϵ n ) H ( μ ) > ϵ and
D n ϵ ( μ ) = x n X n : H ( μ ^ n ) H ( μ ) > ϵ
and we use the fact that by definition μ ^ n , ϵ n = μ ^ n conditioning on B n ϵ n , it follows that C n ϵ ( μ ) B n ϵ n D n ϵ ( μ ) . Then, for all ϵ > 0 and n N
P μ n ( C n ϵ ( μ ) ) P μ n ( C n ϵ ( μ ) B n ϵ n ) + P μ n ( ( B n ϵ n ) c ) P μ n ( D n ϵ ( μ ) ) + P μ n ( ( B n ϵ n ) c ) 2 k + 1 e 2 n ϵ 2 ( M μ + log e m μ ) 2 + e n m μ 2 4 ,
the last inequality from Theorem 1 and (90). ☐

Funding

The work is supported by funding from FONDECYT Grant 1170854, CONICYT-Chile and the Advanced Center for Electrical and Electronic Engineering (AC3E), Basal Project FB0008.

Acknowledgments

The author is grateful to Patricio Parada for his insights and stimulating discussion in the initial stage of this work. The author thanks the anonymous reviewers for their valuable comments and suggestions, and his colleagues Claudio Estevez, Rene Mendez and Ruben Claveria for proofreading this material.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Minimax risk for Finite Entropy Distributions in ∞-Alphabets

Proposition A1.
R n * = .
For the proof, we use the following lemma that follows from [26] (Theorem 1).
Lemma A1.
Let us fix two arbitrary real numbers δ > 0 and ϵ > 0 . Then there are P, Q two finite supported distributions on H ( X ) that satisfy that D ( P | | Q ) < ϵ while H ( Q ) H ( P ) > δ .
The proof of Lemma A1 derives from the same construction presented in the proof of [26] (Theorem 1), i.e., P = ( p 1 , , p L ) and a modification of it Q M = ( p 1 · ( 1 1 / M ) , p 2 + p 1 / M M , , p L + p 1 / M M , p 1 / M M , , p 1 / M M ) both distribution of finite support and consequently in H ( X ) . It is simple to verify that as M goes to infinity D ( P | | Q M ) 0 while H ( Q M ) H ( P ) .
Proof. 
For any pair of distribution P, Q in H ( X ) , Le Cam’s two point method [53] shows that:
R n * 1 4 H ( Q ) H ( P ) 2 exp n D ( P | | Q ) .
Adopting Lemma A1 and Equation (A1), for any n and any arbitrary ϵ > 0 and δ > 0 , we have that R n * > δ 2 exp n ϵ / 4 . Then exploiting the discontinuity of the entropy in infinite alphabets, we can fix ϵ and make δ arbitrar large. ☐

Appendix B. Proposition A2

Proposition A2.
Under the assumptions of Theorem 3:
lim n sup x A μ ˜ n d μ ˜ n d μ n * ( x ) 1 = 0 , P μ - a . s .
Proof. 
First note that A μ ˜ n = A μ n * , then d μ ˜ n d μ n * ( x ) is finite and x A μ ˜ n
d μ ˜ n d μ n * ( x ) = ( 1 a n ) · μ ( A n ( x ) ) + a n v ( A n ( x ) ) ( 1 a n ) · μ ^ n ( A n ( x ) ) + a n v ( A n ( x ) ) .
Then by construction,
sup x A μ ˜ n d μ ˜ n d μ n * ( x ) 1 sup A π n μ ^ n ( A ) μ ( A ) a n · h n .
From Hoeffding’s inequality, we have that ϵ > 0
P μ n sup A π n μ ^ n ( A ) μ ( A ) > ϵ 2 · π n · exp 2 n ϵ 2 .
By condition ii), given that ( 1 / a n h n ) is o ( n τ ) for some τ ( 0 , 1 / 2 ) , then there exists τ o ( 0 , 1 ) such that
lim n 1 n τ o ln P μ n sup x A μ ˜ n d μ ˜ n d μ n * ( x ) 1 > ϵ lim n 1 n τ o ln ( 2 π n ) 2 · ( n 1 τ o 2 a n h n ϵ ) 2 = .
This implies that P μ n sup x A μ ˜ n d μ ˜ n d μ n * ( x ) 1 > ϵ is eventually dominated by a constant time ( e n τ o ) n 1 , which from the Borel-Cantelli Lemma [43] implies that
lim n sup x A μ ˜ n d μ ˜ n d μ n * ( x ) 1 = 0 , P μ - a . s . .

Appendix C. Proposition A3

Proposition A3.
D ( μ ϵ n | | μ ^ n , ϵ n ) 2 log e ϵ n μ ( Γ ϵ n ) · V ( μ / σ ϵ n , μ ^ n / σ ϵ n )
Proof. 
From definition,
D ( μ ϵ n | | μ ^ n , ϵ n ) = 1 μ ( Γ ϵ n ) x Γ ϵ n f μ ( x ) log f μ ( x ) f μ ^ n ( x ) + log μ ^ n ( Γ ϵ n ) μ ( Γ ϵ n ) .
For the right term in the RHS of (A7):
log μ ^ n ( Γ ϵ n ) μ ( Γ ϵ n ) log ( e ) μ ( Γ ϵ n ) μ ^ n ( Γ ϵ n ) μ ( Γ ϵ n ) .
For the left term in the RHS of (A7):
x Γ ϵ n f μ ( x ) log f μ ( x ) f μ ^ n ( x ) = x Γ ϵ n f μ ( x ) f μ ^ n ( x ) f μ ( x ) log f μ ( x ) f μ ^ n ( x ) + x Γ ϵ n f μ ( x ) > f μ ^ n ( x ) ϵ n f μ ( x ) log f μ ( x ) f μ ^ n ( x ) x Γ ϵ n f μ ( x ) f μ ^ n ( x ) f μ ( x ) log f μ ^ n ( x ) f μ ( x ) + x Γ ϵ n f μ ( x ) > f μ ^ n ( x ) ϵ n f μ ^ n ( x ) log f μ ( x ) f μ ^ n ( x ) + x Γ ϵ n f μ ( x ) > f μ ^ n ( x ) ϵ n ( f μ ( x ) f μ ^ n ( x ) ) · log f μ ( x ) f μ ^ n ( x ) log e x Γ ϵ n f μ ( x ) f μ ^ n ( x ) ( f μ ^ n ( x ) f μ ( x ) ) + x Γ ϵ n f μ ( x ) > f μ ^ n ( x ) ( f μ ( x ) f μ ^ n ( x ) )
+ log 1 ϵ n · x Γ ϵ n f μ ( x ) > f μ ^ n ( x ) ( f μ ( x ) - f μ ^ n ( x ) )
( log e + log 1 ϵ n ) · x Γ ϵ n f μ ( x ) - f μ ^ n ( x ) .
The first inequality in (A9) is by triangular inequality, the second in (A10) is from the fact that ln x x 1 for x > 0 . Finally, from definition of the total variational distance over σ ϵ n in (59) we have that
2 · V ( μ / σ ϵ n , μ ^ n / σ ϵ n ) = x Γ ϵ n f μ ( x ) f μ ^ n ( x ) + μ ^ n ( Γ ϵ n ) μ ( Γ ϵ n ) ,
which concludes the argument from (A7)–(A9). ☐

Appendix D. Proposition A4

Proposition A4.
Considering that ( k n ) , there exists K > 0 and N > 0 such that n N ,
V ( μ ˜ k n , μ ^ k n , n * ) K · V ( μ / σ k n , μ ^ n / σ k n ) .
Proof. 
V ( μ ˜ k n , μ ^ k n , n * ) = 1 2 x A μ Γ k n μ x μ ( Γ k n ) μ ^ n x μ ^ n ( Γ k n ) 1 2 μ ( Γ k n ) x A μ Γ k n μ ^ n x μ x + x A μ Γ k n μ ^ n x μ ( Γ k n ) μ ^ n ( Γ k n ) 1 = 1 2 μ ( Γ k n ) 2 · V ( μ / σ k n , μ ^ n / σ k n ) + μ ( Γ k n ) μ ^ n ( Γ k n ) 3 · V ( μ / σ k n , μ ^ n / σ k n ) 2 μ ( Γ k n ) .
By the hypothesis μ ( Γ k n ) 1 , which concludes the proof. ☐

Appendix E. Proposition A5

Proposition A5.
If ϵ n is O ( n τ ) with τ ( 0 , 1 / 2 ) , then
lim sup n b ϵ n ( X 1 , , X n ) a 2 ϵ n 0 , P μ a . s . .
Proof. 
Let us define the set
B n = ( x 1 , , x n ) : Γ ˜ 2 ϵ n Γ ϵ n X n .
From definition every sequence ( x 1 , , x n ) B n is such that b ϵ n ( x 1 , , x n ) a 2 ϵ n and, consequently, we just need to prove that P μ ( lim inf n B n ) = P μ ( n 1 k n B k ) = 1 [42]. Furthermore, if sup x Γ ˜ 2 ϵ n μ ^ n ( x ) μ ( x ) ϵ n , then by definition of Γ ˜ 2 ϵ n in (65), we have that μ ^ n ( x ) ϵ n for all x Γ 2 ϵ n (i.e., Γ ˜ 2 ϵ n Γ ϵ n ). From this
P μ n ( B n c ) P μ n sup x Γ ˜ 2 ϵ n μ ^ n ( x ) μ ( x ) > ϵ n Γ ˜ 2 ϵ n · e 2 n ϵ n 2 1 2 ϵ n · e 2 n ϵ n 2 ,
from the Hoeffding’s inequality [28,52], the union bound and the fact that by construction Γ ˜ 2 ϵ n 1 2 ϵ n . If we consider ϵ n = O ( n τ ) and l > 0 , we have that:
1 n l · ln P μ n ( B n c ) 1 n l ln ( 1 / 2 · n τ ) 2 n 1 2 τ l .
From (A16) for any τ ( 0 , 1 / 2 ) there is l ( 0 , 1 2 τ ] such that P μ n ( B n c ) is bounded by a term O ( e n l ) . This implies that n 1 P μ n ( B n c ) < , that suffices to show that P μ ( n 1 k n B k ) = 1 . ☐

Appendix F. Auxiliary Results for Theorem 5

Let us first consider the series
S x o = x x o x p = x o p · 1 + x o x o + 1 p + x o x o + 2 p + = x o p · S ˜ x o , 0 + S ˜ x o , 1 + + S ˜ x o , x o 1 ,
where S ˜ x o , j k = 0 k · x o + j x o p for all j 0 , , x o 1 . It is simple to verify that for all j 0 , , x o 1 , S ˜ x o , j S ˜ x o , 0 = k 0 k p < given that by hypothesis p > 1 . Consequently, S x o x o 1 p · k 0 k p .
Similarly, for the second series we have that:
R x o = x x o x p log x = x o p · log ( x o ) + x o x o + 1 log ( x o + 1 ) + x o x o + 2 log ( x o + 2 ) + = x o p · R ˜ x o , 0 + R ˜ x o , 2 + + R ˜ x o , x o 1 ,
where R ˜ x o , j k = 1 k · x o + j x o p · log ( k x o + j ) for all j 0 , , x o 1 . Note again that R ˜ x o , j R ˜ x o , 0 < for all j 0 , , x o 1 , and, consequently, R x o x o 1 p · k 1 k p log k from (A18).

Appendix G. Proposition A6

Proposition A6.
If ϵ n is O ( n τ ) with τ ( 0 , 1 / 2 ) , then
lim inf n ξ n ( X 1 , , X n ) ρ n ( X 1 , , X n ) 0 , P μ a . s . .
Proof. 
By definition if σ ( Γ ϵ n ) σ ( Γ ˜ ϵ n / 2 ) then ξ n ( X 1 , , X n ) ρ n ( X 1 , , X n ) . Consequently, if we define the set:
B n = ( x 1 , , x n ) : σ ( Γ ϵ n ) σ ( Γ ˜ ϵ n / 2 ) ,
then the proof reduced to verify that P μ ( lim inf n B n ) = P μ ( n 1 k n B k ) = 1 .
On the other hand, if sup x Γ ϵ n μ ^ n ( x ) μ ( x ) ϵ n / 2 then by definition of Γ ϵ , for all x Γ ϵ n μ ( x ) ϵ n / 2 , i.e., Γ ϵ n Γ ˜ ϵ n / 2 . In other words,
C n = ( x 1 , , x n ) : sup x Γ ϵ n μ ^ n ( x ) μ ( x ) ϵ n / 2 B n .
Finally,
P μ n ( C n c ) = P μ n sup x Γ ϵ n μ ^ n ( x ) μ ( x ) > ϵ n / 2 Γ ϵ n · e n ϵ 2 / 2 1 ϵ n · e n ϵ 2 / 2 .
In this context, if we consider ϵ n = O ( n τ ) and l > 0 , then we have that:
1 n l · ln P μ n ( C n c ) τ · ln n n l n 1 2 τ l 2 .
Therefore, we have that for any τ ( 0 , 1 / 2 ) we can take l ( 0 , 1 2 τ ] such that P μ n ( C n c ) is bounded by a term O ( e n l ) . Then, the Borel Cantelli Lemma tells us that P μ ( n 1 k n C k ) = 1 , which concludes the proof from (A20). ☐

Appendix H. Proposition A7

Proposition A7.
For the p-power tail dominating distribution stated in Theorem 5, if ( ϵ n ) is O ( n τ ) with τ ( 0 , p ) then ξ n ( X 1 , , X n ) is o ( n q ) for all q ( 0 , ( 1 τ / p ) / 2 ) , P μ -a.s.
Proof. 
From the Hoeffding’s inequality we have that
P μ n x 1 , , x n : ξ n ( x 1 , , x n ) > δ σ ( Γ ˜ ϵ n / 2 ) · e 2 n δ 2 log ( 1 / ϵ n ) 2 2 ( 2 k o ϵ n ) 1 / p + 1 · e 2 n δ 2 log ( 1 / ϵ n ) 2 ,
the second inequality using that Γ ˜ ϵ ( k 0 ϵ ) 1 / p + 1 from the definition of Γ ˜ ϵ in (65) and the tail bounded assumption on μ . If we consider ϵ n = O ( n τ ) and l > 0 , then we have that:
1 n l · ln P μ n x 1 , , x n : ξ n ( x 1 , , x n ) > δ ln 2 · ( C n τ / p l + n l ) 2 δ 2 τ 2 · n 1 l log n 2
for some constant C > 0 . Then in order to obtain that ξ n ( X 1 , , X n ) converges almost surely to zero from (A24), it is sufficient that l > 0 , l < 1 , and l > τ / p . This implies that if τ < p , there is l ( τ / p , 1 ) such that such that P μ n ( ξ n ( x 1 , , x n ) > δ ) is bounded by a term O ( e n l ) and, consequently, lim n ξ n ( X 1 , , X n ) = 0 , P μ -a.s. (this by using the same steps used in Appendix G).
Moving to the rate of convergence of ξ n ( X 1 , , X n ) (assuming that τ < p ), let us consider δ n = n q for some q 0 . From (A24):
1 n l · ln P μ n x 1 , , x n : ξ n ( x 1 , , x n ) > δ n ln 2 · ( C n τ / p l + n l ) 2 δ 2 τ 2 · n 1 2 q l log n 2 .
To make ξ n ( X 1 , , X n ) being o ( n q ) P -a.s., a sufficient condition is that l > 0 , l > τ / p , and l < 1 2 q . Therefore (considering that τ < p ), the admissibility condition on the existence of a exponential rate of convergence O ( e n l ) for l > 0 for the deviation event x 1 , , x n : ξ n ( x 1 , , x n ) > δ n is that τ / p < 1 2 q , which is equivalent to 0 < q < 1 τ / p 2 . ☐

References

  1. Beirlant, J.; Dudewicz, E.; Györfi, L.; van der Meulen, E.C. Nonparametric entropy estimation: An Overview. Int. Math. Stat. Sci. 1997, 6, 17–39. [Google Scholar]
  2. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 2006. [Google Scholar]
  3. Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
  4. Principe, J. Information Theoretic Learning: Renyi Entropy and Kernel Perspective; Springer: New York, NY, USA, 2010. [Google Scholar]
  5. Fisher, J.W., III; Wainwright, M.; Sudderth, E.; Willsky, A.S. Statistical and information-theoretic methods for self-organization and fusion of multimodal, networked sensors. Int. J. High Perform. Comput. Appl. 2002, 16, 337–353. [Google Scholar] [CrossRef]
  6. Liu, J.; Moulin, P. Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Trans. Image Process. 2001, 10, 1647–1658. [Google Scholar] [PubMed]
  7. Thévenaz, P.; Unser, M. Optimization of mutual information for multiresolution image registration. IEEE Trans. Image Process. 2000, 9, 2083–2099. [Google Scholar] [PubMed]
  8. Butz, T.; Thiran, J.P. From error probability to information theoretic (multi-modal) signal processing. Elsevier Signal Process. 2005, 85, 875–902. [Google Scholar] [CrossRef]
  9. Kim, J.; Fisher, J.W., III; Yezzi, A.; Cetin, M.; Willsky, A.S. A nonparametric statistical method for image segmentation using information theory and curve evolution. IEEE Trans. Image Process. 2005, 14, 1486–1502. [Google Scholar] [PubMed]
  10. Padmanabhan, M.; Dharanipragada, S. Maximizing information content in feature extraction. IEEE Trans. Speech Audio Process. 2005, 13, 512–519. [Google Scholar] [CrossRef]
  11. Silva, J.; Narayanan, S. Minimum probability of error signal representation. Presented at IEEE Workshop Machine Learning for Signal Processing, Thessaloniki, Greece, 27–29 August 2007; pp. 348–353. [Google Scholar]
  12. Silva, J.; Narayanan, S. Discriminative wavelet packet filter bank selection for pattern recognition. IEEE Trans. Signal Process. 2009, 57, 1796–1810. [Google Scholar] [CrossRef]
  13. Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 158–171. [Google Scholar] [CrossRef]
  14. Arellano-Valle, R.B.; Contreras-Reyes, J.E.; Stehlik, M. Generalized skew-normal negentropy and its applications to fish condition time factor time series. Entropy 2017, 19, 528. [Google Scholar] [CrossRef]
  15. Lake, D.E. Nonparametric entropy estimation using kernel densities. Methods Enzymol. 2009, 467, 531–546. [Google Scholar] [PubMed]
  16. Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000; Volume 3. [Google Scholar]
  17. Wu, Y.; Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory 2016, 62, 3702–3720. [Google Scholar] [CrossRef]
  18. Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef] [PubMed]
  19. Paninski, L. Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory 2004, 50, 2200–2203. [Google Scholar] [CrossRef]
  20. Valiant, G.; Valiant, P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown opitmal via new CLTs. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, San Jose, AL, USA, 6–8 June 2011; pp. 685–694. [Google Scholar]
  21. Valiant, G.; Valiant, P. A CLT and Tight Lower Bounds for Estimating Entropy; Technical Report TR 10-179; Electronic Colloquium on Computational Complexity: Potsdam, Germany, 2011; Volume 17, p. 9. [Google Scholar]
  22. Braess, D.; Forster, J.; Sauer, T.; Simon, H.U. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution. In Proceedings of the International Conference on Algorithmic Learning Theory, Lübeck, Germany, 24–26 November 2004; Springer: Berlin/Heidelberg, Germany, 2002; pp. 380–394. [Google Scholar]
  23. Csiszár, I.; Shields, P.C. Information theory and statistics: A tutorial. In Foundations and Trends® in Communications and Information Theory; Now Publishers Inc.: Breda, The Netherlands, 2004; pp. 417–528. [Google Scholar]
  24. Ho, S.W.; Yeung, R.W. On the discontinuity of the Shannon information measures. IEEE Trans. Inf. Theory 2009, 55, 5362–5374. [Google Scholar]
  25. Silva, J.; Parada, P. Shannon entropy convergence results in the countable infinite case. In Proceedings of the International Symposium on Information Theory, Cambridge, MA, USA, 1–6 July 2012; pp. 155–159. [Google Scholar]
  26. Ho, S.W.; Yeung, R.W. The interplay between entropy and variational distance. IEEE Trans. Inf. Theory 2010, 56, 5906–5929. [Google Scholar] [CrossRef]
  27. Harremoës, P. Information topologies with applications. In Entropy, Search, Complexity; Csiszár, I., Katona, G.O.H., Tardos, G., Eds.; Springer: New York, NY, USA, 2007; Volume 16, pp. 113–150. [Google Scholar]
  28. Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar]
  29. Barron, A.; Györfi, L.; van der Meulen, E.C. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inf. Theory 1992, 38, 1437–1454. [Google Scholar] [CrossRef]
  30. Antos, A.; Kontoyiannis, I. Convergence properties of fucntionals estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
  31. Piera, F.; Parada, P. On convergence properties of Shannon entropy. Probl. Inf. Transm. 2009, 45, 75–94. [Google Scholar] [CrossRef]
  32. Berlinet, A.; Vajda, I.; van der Meulen, E.C. About the asymptotic accuracy of Barron density estimates. IEEE Trans. Inf. Theory 1998, 44, 999–1009. [Google Scholar] [CrossRef]
  33. Vajda, I.; van der Meulen, E.C. Optimization of Barron density estimates. IEEE Trans. Inf. Theory 2001, 47, 1867–1883. [Google Scholar] [CrossRef]
  34. Lugosi, G.; Nobel, A.B. Consistency of data-driven istogram methods for density estimation and classification. Ann. Stat. 1996, 24, 687–706. [Google Scholar]
  35. Silva, J.; Narayanan, S. Information divergence estimation based on data-dependent partitions. J. Stat. Plan. Inference 2010, 140, 3180–3198. [Google Scholar] [CrossRef]
  36. Silva, J.; Narayanan, S.N. Nonproduct data-dependent partitions for mutual information estimation: Strong consistency and applications. IEEE Trans. Signal Process. 2010, 58, 3497–3511. [Google Scholar] [CrossRef]
  37. Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  38. Gray, R.M. Entropy and Information Theory; Springer: New York, NY, USA, 1990. [Google Scholar]
  39. Kullback, S. A lower bound for discrimination information in terms of variation. IEEE Trans. Inf. Theory 1967, 13, 126–127. [Google Scholar] [CrossRef]
  40. Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
  41. Kemperman, J. On the optimum rate of transmitting information. Ann. Math. Stat. 1969, 40, 2156–2177. [Google Scholar] [CrossRef]
  42. Breiman, L. Probability; Addison-Wesley: Boston, MA, USA, 1968. [Google Scholar]
  43. Varadhan, S. Probability Theory; American Mathematical Society: Providence, RI, USA, 2001. [Google Scholar]
  44. Györfi, L.; Páli, I.; van der Meulen, E.C. There is no universal source code for an infinite source alphabet. IEEE Trans. Inf. Theory 1994, 40, 267–271. [Google Scholar] [CrossRef]
  45. Rissanen, J. Information and Complexity in Statistical Modeling; Springer: New York, NY, USA, 2007. [Google Scholar]
  46. Boucheron, S.; Garivier, A.; Gassiat, E. Coding on countably infinite alphabets. IEEE Trans. Inf. Theory 2009, 55, 358–373. [Google Scholar] [CrossRef]
  47. Silva, J.F.; Piantanida, P. The redundancy gains of almost lossless universal source coding over envelope families. In Proceedings of the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 2003–2007. [Google Scholar]
  48. Silva, J.F.; Piantanida, P. Almost Lossless Variable-Length Source Coding on Countably Infinite Alphabets. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 1–5. [Google Scholar]
  49. Nobel, A.B. Histogram regression estimation using data-dependent partitions. Ann. Stat. 1996, 24, 1084–1105. [Google Scholar] [CrossRef]
  50. Silva, J.; Narayanan, S. Complexity-regularized tree-structured partition for mutual information estimation. IEEE Trans. Inf. Theory 2012, 58, 940–1952. [Google Scholar] [CrossRef]
  51. Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partition of the observation space. IEEE Trans. Inf. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
  52. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
  53. Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer: New York, NY, USA, 2009. [Google Scholar]

Share and Cite

MDPI and ACS Style

Silva, J.F. Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators. Entropy 2018, 20, 397. https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

AMA Style

Silva JF. Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators. Entropy. 2018; 20(6):397. https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

Chicago/Turabian Style

Silva, Jorge F. 2018. "Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators" Entropy 20, no. 6: 397. https://0-doi-org.brum.beds.ac.uk/10.3390/e20060397

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop