Next Article in Journal
Breakpoint Analysis for the COVID-19 Pandemic and Its Effect on the Stock Markets
Previous Article in Journal
Universal Regimes in Long-Time Asymptotic of Multilevel Quantum System Under Time-Dependent Perturbation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Distribution-Dependent Weighted Union Bound †

1
Department of Computer Science, Bioengineering, Robotics and Systems Engineering, University of Genoa, Via Opera Pia 11a, 16145 Genova, Italy
2
Department of Biophysical and Electronic Engineering, University of Genoa, Via Opera Pia 11a, 16145 Genova, Italy
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Brugge, Belgium, 2–4 October 2020.
Submission received: 11 November 2020 / Revised: 9 January 2021 / Accepted: 10 January 2021 / Published: 12 January 2021

Abstract

:
In this paper, we deal with the classical Statistical Learning Theory’s problem of bounding, with high probability, the true risk R ( h ) of a hypothesis h chosen from a set H of m hypotheses. The Union Bound (UB) allows one to state that P L R ^ ( h ) , δ q h R ( h ) U R ^ ( h ) , δ p h 1 δ where R ^ ( h ) is the empirical errors, if it is possible to prove that P { R ( h ) L ( R ^ ( h ) , δ ) } 1 δ and P { R ( h ) U ( R ^ ( h ) , δ ) } 1 δ , when h, q h , and p h are chosen before seeing the data such that q h , p h [ 0 , 1 ] and h H ( q h + p h ) = 1 . If no a priori information is available q h and p h are set to 1 2 m , namely equally distributed. This approach gives poor results since, as a matter of fact, a learning procedure targets just particular hypotheses, namely hypotheses with small empirical error, disregarding the others. In this work we set the q h and p h in a distribution-dependent way increasing the probability of being chosen to function with small true risk. We will call this proposal Distribution-Dependent Weighted UB (DDWUB) and we will retrieve the sufficient conditions on the choice of q h and p h that state that DDWUB outperforms or, in the worst case, degenerates into UB. Furthermore, theoretical and numerical results will show the applicability, the validity, and the potentiality of DDWUB.

1. Introduction

Statistical learning theory [1,2,3,4] deals with the problem of understanding and estimating the performance of a statistical learning procedure. The goal is to better understand the factors that influence its behavior and to suggest ways to improve it. Although asymptotic analysis is a crucial first step in this direction, finite sample error bounds are of more value as they allow the design of model selection procedures [5,6,7]. These error bounds typically have the following form: with high probability, the generalization error of the selected hypothesis, chosen in a space of possible ones, is bounded by an empirical estimate of the generalization error plus a penalty term which depends on the size of the hypothesis space and the number of samples available. The latter term basically considers that the learning procedure selects a hypothesis in a set of possible ones based on the available data. Every data-dependent choice implies a risk, and the penalty term is exactly the measure of this risk. When the hypothesis space is composed of an arbitrary finite number of hypothesis, and no additional information is provided, the evaluation of the total risk is usually made with the Union Bound (UB) [2,7,8]. The UB is an ubiquitous building block in statistical learning theory and is exploited in many context and in many different ways to derive the final result: in the Vapnik–Chervonenkis theory [2], in the Rademacher Complexity theory [9,10], in the Algorithmic Stability theory [11], in the Compression Bound [12], in the PAC-Bayes theory [13], and more recently in the Differential Privacy theory [14].
Let us consider the classical binary classification framework (The extension to the general supervised learning characterized by bounded loss functions will be discussed later during the presentation.). Let X be the input space and Y = { 1 , + 1 } be the set of binary output labels. Let D n = ( X 1 , Y 1 ) , , ( X n , Y n ) , where X i X and Y i Y , i { 1 , , n } , be a sequence of n N * samples drawn independently from an unknown probability distribution μ over X × Y . Let us consider a hypothesis h : X Y chosen from a finite set H of possible hypotheses of cardinality m N * such that H = { h i : i I } where I = { 1 , , m } . The error of h in approximating P { Y | X } is measured by a prescribed loss function : Y × Y R . Since we are dealing with binary classification problems the most natural choice is the loss function which counts the number of errors ( h ( X ) , Y ) = 𝟙 { Y h ( X ) } { 0 , 1 } . The generalization error of h is defined as
R ( h ) = E { ( h ( X ) , Y ) } [ 0 , 1 ] .
Since the probability measure μ is usually unknown, the generalization error cannot be computed; however, we can compute the empirical error
R ^ ( h ) = 1 n i = 1 n ( h ( X i ) , Y i ) [ 0 , 1 ] .
If the choice of h H does not depend on D n , namely if we want to bound the generalization error of a single hypothesis in the hypothesis space chosen before seeing the data, it is possible to prove that (Please note that for simplicity, we will refer to R ( h ) and R ^ ( h ) with R and R ^ respectively, when it is clear from the context.)
P R L ( R ^ , δ ) 1 δ , P R U ( R ^ , δ ) 1 δ ,
where δ ( 0 , 1 ) while L and U are respectively lower and upper bounds of the generalization error (see, for example, [15,16,17]).
Since the generalization error cannot be smaller than zero or larger than one consequently we have that L ( r ^ , δ ) [ 0 , 1 ) and U ( r ^ , δ ) ( 0 , 1 ] r ^ [ 0 , 1 ] and δ ( 0 , 1 ) [15]. When, instead of [15,16], or similar results, are exploited it is necessary to truncate them.
In general, the choice of h H does depend on D n : in this case we must estimate the risk due to this data-dependent choice.
As an example, common practice for choosing h H based on D n is to choose the hypothesis with minimum empirical error
arg min h H R ^ ( h ) ,
and this approach is called Empirical Risk Minimization [2,18], but others possibilities exist such as the Structural Risk Minimization [2,19,20], or the penalized (regularized) Empirical Risk Minimization [21,22,23].
To guarantee a prescribed confidence level, or risk, of the chosen hypothesis, the UB can be applied. The UB can be expressed in two forms (Please note that for simplicity, we will refer to R ( h i ) and R ^ ( h i ) with R i and R ^ i respectively, when it is clear from the context.) [8]: a simplified version (Theorem 1) and a generalized version (Theorem 2).
Theorem 1
(Simple UB). The following bounds hold
P L R ^ i , δ 2 m R i U R ^ i , δ 2 m i I 1 δ .
Theorem 2
(Generalized UB). Let q ( h i ) ( 0 , 1 ) and p ( h i ) ( 0 , 1 ) be some weight associated with h i with i I before seeing the data (Please note that for simplicity, we will refer to q ( h i ) and p ( h i ) with q i and p i respectively, when it is clear from the context.) and such that i I ( q i + p i ) = 1 , then the following bounds hold
P L R ^ i , δ q i R i U R ^ i , δ p i i I 1 δ .
Theorem 1 is a special case of Theorem 2 when q i = p i = 1 2 m i { 1 , , m } .
Theorem 2 introduces a weight for each risk associated with each choice. Weighting more the risk associated with useful choices leads to tighter bounds on the generalization error of hypotheses that will be selected by the algorithm (hypotheses characterized by small empirical error) and looser estimates over the others (hypotheses characterized by high empirical error). Unfortunately, this approach is mainly theoretical since the weights must be chosen before seeing the data and consequently we cannot set them without an a priori knowledge about the problem. Finally, Theorem 2 does not propose any solution for the choice of these weights.
For this reason, in this work, we propose a Distribution-Dependent Weighted UB (DDWUB) where the weights depend on some parameters of the distribution which generated them, extending our preliminary work [24]. In particular, we define a set of functions f i p : R m R and f i q : R m R with i I such that
q i = f i q ( R 1 , , R m ) , p i = f i p ( R 1 , , R m ) ( 0 , 1 ) , i I , i I f i q ( R 1 , , R m ) + f i p ( R 1 , , R m ) = 1 .
Please note that f i q , f i p with i I are quite general and are data independent (Please note that in the framework of the paper (binary classification with a loss function which counts the number of misclassified samples), the generalization error is the only parameter of the distribution which is a Binomial). It is surely possible to consider even more general data independent functions for defining the weights, but we think that our definition is general enough to contemplate a wide variety of cases.
At this point the proposed DDWUB for bounding the generalization error of a hypothesis chosen from a finite set of possible ones can be stated.
Theorem 3.
If r 1 , , r m [ 0 , 1 ]
f i q ( r 1 , , r m ) , f i p ( r 1 , , r m ) ( 0 , 1 ) , i I , i I f i q ( r 1 , , r m ) + f i p ( r 1 , , r m ) = 1 ,
then the following bound holds
P L R ^ i , δ f i q ( R 1 , , R m ) R i U R ^ i , δ f i p ( R 1 , , R m ) i I 1 δ .
The proof is a direct consequence of Theorem 2.
DDWUB allow the binding of the generalization error of each hypothesis in the space of hypotheses but, to prove that the DDWUB outperforms the UB, we will require some sufficient conditions that L , U , f i q , and f i p with i I must satisfy. These sufficient conditions define a class of functions and an open problem would be to find special and simpler classes of functions which satisfy them.
Nevertheless, we will show that it is possible to find simple classes of functions which satisfy these conditions. For example, if one is interested in having tighter upper bound of the generalization error of the empirical minimizer DDWUB suggest combining classical L and U , such as [15] or [16], and set
f i q ( R 1 , , R m ) = 1 2 m , f i p ( R 1 , , R m ) = 1 2 e γ max [ θ , R i ] j I e γ max [ θ , R j ] , i I ,
with particular values of γ [ 0 , ) and θ [ 0 , 1 ] .
As a last remark we would like to note that all our results easily extend to multiclass classification problems and regression problems if the loss function is bounded.
DDWUB is a distribution-dependent form of the UB analogously to the Computable Shell Decomposition Bounds (CSDB) [20]. The CSDB splits the hypothesis space in shells based on the generalization error of each hypothesis and, instead of taking into account the risk of each hypothesis in the space, show that it is possible to just take into account the risk of choosing one shell and the risk associated with each hypothesis in the shell. This allows, for example, to not consider hypotheses with high generalization error. The CSDB show also how to estimate the size of these shells based on the histogram of the empirical errors.
DDWUB takes inspiration from several works in the field. The first idea, which is also a driver of the CSDB, is that during any learning procedure the hypotheses with high error will be never taken into account and consequently we should not pay the risk for those hypotheses [10]. The second idea is that since we do not know the true error of the hypotheses but just its empirical one, we should discard those hypotheses for which the estimated confidence intervals do not overlap [25] with the ones of the hypothesis of minimal training error. The third idea is that since there is no supporting theory for discarding the hypothesis with non-overlapping confidence intervals, we should weight differently the risk associated with each hypothesis based on their true error analogously to what is done in the field of multiple hypotheses testing [26]. The fourth idea is that other researchers have shown that a distribution-dependent weighting strategy can be performed without the actual knowledge of the distribution [27]. DDWUB combines all these ideas and improves both on the UB and the CSDB.
DDWUB applies to finite hypotheses spaces and surely more sophisticated techniques, such as Local Vapnik–Chervonenkis [28] or the Local Rademacher Complexity [10], can be employed and can sometimes result in tighter bounds. However, insight into finite classes remains quite useful [20,29]. Finite class analysis can be exploited for as a pedagogical tool. Finite class analysis can teach new directions in which to look for the development and evolution of more sophisticated bounds. Finite class analysis can be useful for model selection purposes (e.g., selecting the most suitable hypothesis space, or set of hyperparameters, or algorithm). Finite class analysis can be useful when the models are represented with limited number of bits because of the constants involved in the bounds.
The rest of the paper is organized as follows. Section 2 presents the DDWUB in a simplified setting. In Section 3 we present the DDWUB in a generalized setting, we derive the sufficient conditions which state when DDWUB improves over the UB, we will show that it is possible to find simple classes of functions which satisfy these conditions, and we will make the connection between our results and the ones of [25]. Section 4 reports a comparison between DDWUB and the UB by means of closed form results. Section 5 reports a comparison between DDWUB and the UB by means of an extensive set of numerical results. Section 6 compares DDWUB with CSDB by means of an extensive set of numerical results. Section 7 shows the applicability and the potentiality of DDWUB. Section 8 concludes the paper. In the Appendices known results, proof, and technicalities (See in Appendix A, Appendix B and Appendix C) are reported for completeness.

2. Distribution-Dependent Weighted Union Bound: Simplified Setting

Let us consider Theorem 3 and the bound proposed by [16] recalled by Theorem A1 in Appendix A. Let us also suppose, for simplicity, that we are interested in upper bounding the generalization error of the empirical risk minimizer (Extensions will be discussed at the end of this section). In this setting it is possible to state our DDWUB.
Corollary 1.
If
f i ( r 1 , , r m ) = e γ max [ θ , r i ] j I e γ max [ θ , r j ] , i I ,
with γ [ 0 , ) and θ [ 0 , 1 ] then the following bound holds
P max 0 , R ^ i log 2 m δ 2 n R i min 1 , R ^ i + log 2 δ f i ( R 1 , , R m ) 2 n i I 1 δ .
Corollary 1 is a direct consequence of Theorems 3 and A1.
The choice of the weights takes inspiration from the work of [27] which proposed, in the context of the PAC-Bayes theory, a distribution-dependent method for assigning an a priori distribution over a set of hypotheses to give a higher probability to the hypothesis with small generalization error. This method has been shown to possess interesting theoretical properties [30,31] and to be also quite effective in practical applications [32].
Since we are interested in choosing and bounding the generalization error of the empirical minimizer, let us define
i * = arg min i I R ^ i .
This approach is analogous to Page’s criterion [33], which was designed as a process inspection scheme to detect deviations in average in only one direction (one-sided) in a stochastic process.
In Corollary 1, γ acts as a weighting factor. The larger is γ the larger are the weights of the risks associated with hypotheses with small empirical error and the smaller are the weights of the risks associated with hypotheses with large empirical error. For γ we have that (For simplicity, we assume in this statement that the empirical minimizer is unique.) p i * 1 and p i 0 i I \ i * . The smaller is γ the less is the difference between the weights of the risks. For γ 0 we have that p i = 1 m i I .
In Corollary 1, θ , instead, acts as a protection against the fact that the empirical error is measured over a finite number of samples and, if the sample size is small, hypotheses with a small difference in the empirical error are indistinguishable. In other words, the weights depend on unknown parameters of the data generating distribution, then we will have to estimate them and since the number of sample is finite these estimates will not allow us to distinguish hypotheses which show similar empirical error. For this reasons, θ gives the same the weight to the risks associated with hypotheses with small empirical error.
The values of γ and θ must be set in a particular way to be sure that DDWUB improves over the UB. In particular
  • Lemma 1 shows that to upper bound the generalization error of the empirical risk minimizer based on DDWUB of Corollary 1 we must solve an optimization problem;
  • Lemmas 2 and 3 show that for particular values of γ the solution is unique and can be found by simply search for the fixed point of a simple function;
  • Theorem 4 and Lemma 4 show that for particular values of θ it is possible to prove that DDWUB is tighter than, or in the worst case as tight as, the UB.
Thanks to Corollary 1 we can state the following lemma.
Lemma 1.
Under the same conditions of Corollary 1 if
i * = arg min i I R ^ i ,
then we can state that following bound holds
R i * max r 1 , , r m r i * s . t . max 0 , R ^ i log 2 m δ 2 n r i min 1 , R ^ i + log 2 j I e γ max [ θ , r j ] δ e γ max [ θ , r i ] 2 n , i I .
Lemma 1 can be further simplified as follows.
Lemma 2.
Under the same conditions of Lemma 1 the following bound holds
R i * max r i * r i * s . t . r i * min 1 , R ^ i * + log 2 j I e γ max [ θ , r j ] δ e γ max [ θ , r i * ] 2 n ,
where r i = max 0 , R ^ i log 2 m δ 2 n i I \ i * .
The proof can be found in Appendix B.
Please note that the optimization problem of Lemma 2 can be further simplified noting that for particular values of γ , the solution of the optimization problem is unique.
Lemma 3.
Under the same conditions of Lemma 2 if
γ 2 n ,
the solution of the optimization problem of Lemma 2 exists, it is unique, and it is the fixed point r i * * of the following function of r i *
r i * = min 1 , R ^ i * + log 2 j I e γ max [ θ , r j ] δ e γ max [ θ , r i * ] 2 n
where r i = max 0 , R ^ i log 2 m δ 2 n i I \ i * .
The proof can be found in Appendix B.
Please note that to find the fixed point defined in Lemma 3 a simple bisection method can be applied.
For particular values of θ , it is possible to state that DDWUB is tighter than, or in the worst case as tight as, the UB.
Theorem 4.
Under the same conditions of Lemma 3 if
θ min 1 , R ^ i * + log 2 m δ 2 n ,
then
r i * * min 1 , R ^ i * + log 2 m δ 2 n .
The proof can be found in Appendix B.
The problem of the θ defined in Theorem 4 is that it is data-dependent since we do not know i * before seeing the data. For this reason, the following lemma suggests a data independent threshold of θ which satisfies the conditions of Theorem 4.
Lemma 4.
Under the same conditions of Theorem 4
min 1 , R ^ i * + log 2 m δ 2 n min 1 , min R 1 , , R m + 2 log 2 m δ 2 n .
The proof can be found in Appendix B.
Lemma 4 provides us a method for finding a θ U R ^ i * , δ 2 m in a data independent way by setting
θ = min 1 , min R 1 , , R m + 2 log 2 m δ 2 n .
By finding the r i * * for all possible values of θ and then by selecting the largest one which satisfies Equation (1) we have the results of our DDWUB.
Please note that the above-mentioned result easily extends to the whole supervised learning framework, until a bounded loss function is employed, since the inequality proposed by [16] cover this case.
Following the same argument described in this section it is possible to derive the DDWUB for lower bounding the generalization error of the empirical risk minimizer by simply setting in Theorem 3
f i q ( R 1 , , R m ) = 1 2 e γ min [ θ , 1 R i ] j I e γ min [ θ , 1 R j ] , f i p ( R 1 , , R m ) = 1 2 m , i I .
Finally, it is possible to derive the DDWUB for upper and lower bounding the generalization error of the empirical risk minimizer by simply setting in Theorem 3
f i q ( R 1 , , R m ) = 1 2 e γ min [ θ , 1 R i ] j I e γ min [ θ , 1 R j ] , i I , f i p ( R 1 , , R m ) = 1 2 e γ max [ θ , R i ] j I e γ max [ θ , R j ] , i I .
Example 1.
Before presenting DDWUB in the general setting we would like to show an application of DDWUB in the simplified setting. Let us consider the case when (More general examples can be derived, and we will do it later with both closed form and numerical results, but here we want to keep the presentation as simple as possible.)
R ^ 1 = R ^ 2 = 0 , R ^ 3 = R ^ 4 = = R ^ m = ν , ν 1 n , 2 n , , 1 .
Let us set γ = 2 n (see Lemma 3) and note that to upper bound the function with the smallest empirical error (i.e., the one corresponding to R ^ 1 ) we have that DDWUB states that
r 1 = ln 2 i = 1 m e 2 n max [ θ , r i ] δ e 2 n max [ θ , r 1 ] 2 n r 2 = 0 r 3 = r 4 = = ν ln 2 m δ 2 n θ = min 1 , min r 1 , , r m + 2 log 2 m δ 2 n . .
Please note that for a finite but large enough value of n
min r 1 , , r m = 0 θ = 2 ln 2 m δ 2 n .
Thanks to the theory of DDWUB (see Lemma 4) we can state that
r 1 * θ .
Let us note that if
m < δ e 2 n ν 2 2 2 ,
then
r 3 = = r m > θ .
Then we can easily state that
lim n 2 i = 1 m e 2 n max [ θ , r i ] δ e 2 n max [ θ , r 1 ] = 4 δ ,
which means that all the hypothesis in the space with R ^ 0 , if m < δ e 2 n ν 2 2 2 , are not taken into account, asymptotically, in estimating the upper bound of the hypothesis with the smaller error with DDWUB.

3. Distribution-Dependent Weighted Union Bound: General Setting

In this section, we will derive the sufficient conditions for stating that DDWUB is tighter than, or in the worst case as tight as, the UB.
In particular, as we have done in Section 2, we will start by supposing that we are just interested in upper bounding the generalization error of the empirical risk minimizer. Nevertheless, as pointed out in Section 2, DDWUB can be easily generalized also to the lower bounds, or to both lower and upper bounds, and to the general supervised setting with bounded loss functions but, in this work, we did not report all these extensions in order not to make the notation and the presentation over-complicated.
As noted in the introduction, the weights should not depend on the data, but they can depend on some parameters of the data generating distribution. For this reason, we define a set of functions f i : R m R with i I such that
f i ( R 1 , , R m ) ( 0 , 1 ) i I , i I f i ( R 1 , , R m ) = 1 .
In this setting DDWUB can be formulated as follows.
Corollary 2.
If r 1 , , r m [ 0 , 1 ]
f i ( r 1 , , r m ) ( 0 , 1 ) i I , i I f i ( r 1 , , r m ) = 1 ,
then the following bound holds
P L R ^ i , δ 2 m R i U R ^ i , δ f i ( R 1 , , R m ) 2 i I 1 δ .
Corollary 2 is a direct consequence of Theorem 2.
To prove that DDWUB outperforms UB we will require some sufficient conditions that L , U , and f i with i I must satisfy. Please note that from Corollary 2, it is possible to derive all the lower bounds of the generalization error of the hypotheses in the class since L R ^ i , δ 2 m depends just on known quantities. For what concerns, instead, the upper bounds, the answer is not as easy.
In the rest of this section, we will show how to find the upper bound of the generalization error of a hypothesis chosen in H based on DDWUB (Corollary 2) and under which conditions these upper bounds are tighter than the one of UB (Theorem 1). For this purpose
  • Lemma 5 will show that under certain conditions, the bound of Corollary 2 can be exploited to compute the upper bound of the generalization error of a hypothesis chosen in a class of possible ones based on the observation of a set of data, by solving a complex optimization problem;
  • Lemma 6 will show the conditions under which the optimization problem of Lemma 5 can be simplified;
  • Lemma 7 will show the conditions under which the solution of the optimization problem of Lemma 6 is unique;
  • Theorem 5 will show the conditions under which the upper bound of the generalization error of Lemma 5 found with Lemma 7 is never looser than the one computed with the UB of Theorem 1. These conditions require the knowledge of a data-dependent threshold;
  • Lemma 8 shows that it is possible to estimate this threshold of Theorem 5 in a data independent fashion.
Thanks to Corollary 2 we can state the following lemma.
Lemma 5.
Under the same conditions of Corollary 2, if r ^ [ 0 , 1 ] and δ ( 0 , 1 )
L ( r ^ , δ ) [ 0 , 1 ) , U ( r ^ , δ ) ( 0 , 1 ]
then the following bound holds with probability at least ( 1 δ ) and i I
R i max r 1 , , r m r i s . t . L R ^ j , δ 2 m r j U R ^ j , δ f j ( r 1 , , r m ) 2 , j I .
The solution of the optimization problem of Lemma 5 is not trivial to be found and its properties are not easy to catch.
The following lemma helps us in simplifying the optimization problem of Lemma 5 under a quite natural condition: the upper bound of the generalization error of a hypothesis should decrease if the generalization error of one of the other hypotheses in the class increases.
Lemma 6.
Under the same conditions of Lemma 5, if r ^ i 0 , 1 n , , 1 , r 1 , , r m [ 0 , 1 ] and j I \ i and r j , r j [ 0 , 1 ] such that r j < r j
U r ^ i , δ f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) 2 U r ^ i , δ f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) 2 0 ,
where i I , then the optimization problem of Lemma 5 is equivalent to the following one
R i max r i r i s . t . r i U R ^ i , δ f i ( r 1 , , r m ) 2 .
where r j = L R ^ j , δ 2 m , with j I \ i .
In fact, the hypotheses of the lemma imply that to reach the maximum or r i , one must reach the lower bounds of r j with j I \ i .
Even if the optimization problem of Lemma 6 is much simpler than the one of Lemma 5, the next result further simplifies it under another sufficient condition which ensure the existence and uniqueness of the solution: the upper bound of the generalization error of a hypothesis should not increase too fast it its generalization error decreases.
Lemma 7.
Under the same conditions of Lemma 6, if r ^ i 0 , 1 n , , 1 , r 1 , , r m [ 0 , 1 ] , and r i , r i [ 0 , 1 ] such that r i < r i
U r ^ i , δ f i ( r 1 , , r i 1 , r i , r i + 1 , , r m ) 2 U r ^ i , δ f i ( r 1 , , r i 1 , r i , r i + 1 , , r m ) 2 r i r i < 1 ,
then the solution r i * of the optimization problem of Lemma 6 exists, it is unique, and it is the fixed point of the following function of r i
r i = U R ^ i , δ f i ( r 1 , , r m ) 2 ,
where r j = L R ^ j , δ 2 m with j I \ i .
In fact, the hypothesis of the lemma guarantees the uniqueness of the solution.
Algorithm 1 reports a simple pseudo-code for finding the fixed point of Lemma 7 based on the bisection method.
The next result introduces a parameter, more specifically a threshold, θ and states the condition over θ which states that the DDWUB improves over the UB.
Theorem 5.
Under the same conditions of Lemma 7, let us consider θ [ 0 , 1 ] and suppose that r 1 , , r m [ 0 , 1 ] and r j , r j [ 0 , θ ]
f j ( r 1 , , r j 1 , r j , r j + 1 , , r m ) f j ( r 1 , , r j 1 , r j , r j + 1 , , r m ) = 0 ,
with j I . Let us also suppose that if r 1 , , r m [ 0 , θ ] then j I
f j ( r 1 , , r m ) = 1 m .
If
θ U R ^ i , δ 2 m ,
and if r 1 , , r m [ 0 , 1 ] , j I \ i , r j , r j ( θ , 1 ] such that r j < r j
f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) < 0
then the following bound holds
r i * U R ^ i , δ 2 m .
The proof of Theorem 5 can be found in Appendix B.
Theorem 5 basically states that under particular conditions, the solution of the problem of Corollary 2 is never looser than the one of Theorem 1.
Unfortunately, we cannot set θ = U R ^ i , δ 2 m since this would be a data-dependent choice which will result in a data-dependent weighting strategy. The next lemma addresses this problem.
Algorithm 1: Algorithm for finding the fixed point of Lemma 7 based on the bisection method.
Entropy 23 00101 i001
Lemma 8.
Under the same conditions of Theorem 5, if r ^ , r ^ , r ^ { 0 , 1 m , , 1 } such that r ^ < r ^ we have that
i * = arg min i I R ^ i , L ( r ^ , δ ) L ( r ^ , δ ) 0 , L 1 : L 1 ( L ( r ^ , δ ) , δ ) r ^ ,
then
U R ^ i * , δ 2 m U L 1 min R 1 , , R m , δ 2 m , δ 2 m .
The proof of Lemma 8 can be found in Appendix B.
Lemma 8 provides us a method for finding a θ U R ^ i * , δ 2 m in a data independent way by setting
θ = U L 1 min R 1 , , R m , δ 2 m , δ 2 m .
Thanks to all these results we can provide a method for finding the fixed point of Lemma 7 but with data independent weighting strategy which satisfies the hypothesis of Theorem 5 and a data independent θ defined in Lemma 8.
Lemma 9.
Under the same conditions of Lemma 8, Algorithm 2 finds the fixed point of Lemma 7 but with a data independent weighting strategy which satisfies the hypothesis of Theorem 5 and a data independent θ defined in Lemma 8.
The proof of Lemma 9 can be found in Appendix B.
Algorithm 2: Algorithm for finding the fixed point of Lemma 7 but with data independent weighting strategy which satisfies the hypothesis of Theorem 5 and a data independent θ defined in Lemma 8.
Entropy 23 00101 i002
Please note that if we apply this general theory to Corollary 1 we obtain the same results of Section 2.
In the next section, instead, we apply the general theory to a the more complex case of when [15] is employed together with the same weights exploited in Corollary 1.

3.1. From Theory to Practice

In this section, we will exploit the same solution of Corollary 1 for the weights needed in DDWUB and we will show that is satisfies hypothesis of the Theorems, the Corollaries, and the Lemmas presented in the previous section. In particular we will set
f i ( R 1 , , R m ) = e γ max [ θ , R i ] j I e γ max [ θ , R j ] i I ,
where γ [ 0 , + ) and θ [ 0 , 1 ] are finite constants which regulates the shape of the distribution of the weights.
The following lemma shows that these weighs satisfy the sufficient conditions which states that DDWUB outperforms, or in the worst case performs as, the UB.
Lemma 10.
If γ [ 0 , + ) is a finite constant and
f i ( R 1 , , R m ) = e γ max [ θ , R i ] j I e γ max [ θ , R j ] i I ,
then the hypotheses of Corollary 2 and Theorem 5 are satisfied.
The proof of Lemma 10 can be found in Appendix B.
Please note that for what concerns the hypothesis of Lemmas 6 and 7, we cannot prove that condition holds without knowing the shape of the lower and upper bounds of the generalization error. For this reason, we exploit the generalization bounds of [15] which are the tightest ones in the settings of this paper [6]. This will allow us in Section 5 to compare the UB with the DDWUB with a set of numerical experiments.
To reach our goal let us define the Regularized Incomplete Beta Function p = F ( r ; a , b ) = I r ( a , b ) and its inverse r = F 1 ( p ; a , b ) with parameters specified by a N * and b N * for the corresponding values of r and probabilities in p
F ( r ; a , b ) = 1 B ( a , b ) 0 r t a 1 ( 1 t ) b 1 d t , F 1 ( p ; a , b ) = r : F ( r ; a , b ) = p ,
where
B ( a , b ) = B ( b , a ) = ( a 1 ) ! ( b 1 ) ! ( a + b 1 ) !
is the Complete Beta Function.
The following lemma states the conditions under which all the hypotheses of Corollary 2, Lemmas 6 and 7, Theorem 5, and Lemma 8 are satisfied if, in Corollary 2, the lower and upper generalization bounds proposed by [15], recalled by Theorem A2 in the Appendix A, and the weights defined in Lemma 10 are exploited.
Lemma 11.
Let us exploit the lower and upper generalization bounds proposed by [15] in Corollary 2
L ( R ^ , δ ) = F 1 ( δ ; n R ^ , n n R ^ + 1 ) R ^ 1 n , 2 n , , 1 0 R ^ = 0 , U ( R ^ , δ ) = F 1 1 δ ; n R ^ + 1 , n n R ^ R ^ 0 , 1 n , , n 1 n 1 R ^ = 1 ,
together with the weights defined in Lemma 10. Then if
i * = arg min i I R ^ i , R ^ i * 1 , γ < min r ^ 0 , 1 n , , n 1 n , p ( 0 , 1 ) U r ^ , δ p 2 n r ^ 1 U r ^ , δ p 2 n n r ^ 1 B ( n r ^ + 1 , n n r ^ ) δ 2 p ( 1 p ) ,
the hypotheses of Corollary 2, Lemmas 6 and 7, Theorem 5, and Lemma 8 are satisfied. Moreover, note that
L 1 ( r , n , δ ) = min r ^ : r ^ 0 , 1 n , , 1 , r L ( r ^ , δ ) , min r ^ 0 , 1 n , , n 1 n , p ( 0 , 1 ) U r ^ , δ p 2 n r ^ 1 U r ^ , δ p 2 n n r ^ 1 B ( n r ^ + 1 , n n r ^ ) δ 2 p ( 1 p ) 2 n 2 π e 7 6 .
The proof of Lemma 11 can be found in Appendix B.
The case where R ^ i * = 1 is trivial since, in this case, we can safely state that R i * 1 .
Please note that the limit in the value of γ is O ( n ) and this result is connected and in agreement with the consistency results derived in the PAC-Bayes [30] and in Algorithmic Stability [31] theories where the distribution-dependent prior of [27] is exploited.

3.2. Observation

Let us consider the case where the empirical errors of the hypotheses in the space have been sorted as follows
R ^ 1 R ^ 2 R ^ m ,
and let us exploit the results of Section 3.1.
Let us set
γ = n 4 < 2 n 2 π e 7 6 ,
and suppose that
R ^ 1 1 .
Then by using the UB (see Theorem 1) we can state that with probability at least ( 1 δ )
R 1 0 , U R ^ 1 , δ 2 m ,
while, by using DDWUB (see Lemma 11), we can state that with probability at least ( 1 δ )
0 R 1 r 1 * = max r 1 r 1 s . t . r 1 = U R ^ 1 , δ f 1 ( r 1 , , r m ) 2 r i = L R ^ i , δ 2 m , i I \ 1 f i ( r 1 , , r m ) = e γ max [ θ , r i ] j = 1 m e γ max [ θ , r j ] , i I γ = n 4 θ = U L 1 min r 1 , , r m , δ 2 m , δ 2 m .
By looking at this last problem and by setting r 1 = r 1 * for simplicity, we can observe some properties. The first one is that
f i ( r 1 , , r m ) = e γ θ | J | e γ θ + j = I \ J e γ r j i J e γ r i | J | e γ θ + j = I \ J e γ r j i I \ J , J = { j : j I , r j θ } .
The second one is that i I \ J
e γ r i | J | e γ θ + j = I \ J e γ r j e γ θ | J | e γ θ + j = I \ J e γ r j .
These properties state that DDWUB, with respect to the UB, can discard, or more properly reduce, the risk of the hypotheses h i with i I \ J when estimating the generalization error of the hypothesis h 1 . As a first raw approximation, we can state that we must pay a risk only for the hypotheses h i with i J obtaining
R 1 0 , U R ^ 1 , δ 2 | J | ,
where | J | m . A quite important aspect is then to understand some properties of θ . Let us recall that
θ = U L 1 min r 1 , , r m , δ 2 m , δ 2 m ,
but we can easily state that
min r 1 , , r m = min U R ^ 1 , δ f 1 ( r 1 , , r m ) 2 , L R ^ 2 , δ 2 m .
Thanks to Theorem 5, we can state that f 1 ( r 1 , , r m ) 1 m and then the case where
min U R ^ 1 , δ f 1 ( r 1 , , r m ) 2 , L R ^ 2 , δ 2 m = U R ^ 1 , δ f 1 ( r 1 , , r m ) 2 ,
is quite unusual since it means that R ^ 2 R ^ 1 is large, or, in other words, it means that our set of hypotheses contains just one hypothesis with small error and many hypotheses with high error. Instead, if
min U R ^ 1 , δ f 1 ( r 1 , , r m ) 2 , L R ^ 2 , δ 2 m = L R ^ 2 , δ 2 m ,
then
θ = L R ^ 2 , δ 2 m ,
which means that the threshold θ is obtained from the upper bound of the second-best hypothesis in the space, namely the hypothesis with the second smallest empirical error. To the best of our knowledge, this result is new since in the past many researchers have tried, with similar approaches but with no supporting theory, in proposing to clean the hypothesis space from the hypotheses with high empirical error. The basic idea of these methods is that it is reasonable to discard all the hypotheses such that the lower bound of their generalization error is greater than U R ^ 1 , δ 2 m , namely the upper bound of the generalization error of the hypothesis with the smallest empirical error, see for example [25]. Our theory states, instead, that we can smooth the risk due to the hypotheses such that the lower bound of their generalization error is greater than U R ^ 2 , n , δ 2 m , namely the upper bound of the generalization error of the hypothesis with the second smallest empirical error.

4. Closed Form Results

In this section, we will report some closed form results regarding the Lemma 11. These examples are useful for providing an idea of the effect of the DDWUB, an extensive comparison with numerical results is provided in Section 5.
Let us consider the case where
R ^ 1 = R ^ 2 = 0 , R ^ 3 = = R ^ m = 1 .
Let us set
γ = n 4 < 2 n 2 π e 7 6 ,
and note that
arg min i I R ^ i = 1 , R ^ 1 1 .
If we use the UB (see Theorem 1) with the [15] we obtain that with probability at least ( 1 δ ) the following bound holds
R 1 0 , 1 δ 2 m n .
If, instead, we use DDWUB (see Lemma 9 and Algorithm 2) we obtain that with probability at least ( 1 δ ) the following bound holds
0 R 1 r 1 * = max r r s . t . r = 1 δ f 1 ( r , r 2 , , r m ) 2 n f 1 ( r , r 2 , , r m ) = e γ max [ θ , r ] e γ max [ θ , r ] + j = 2 m e γ max [ θ , r j ] r 2 = 0 r 3 = = r m = δ 2 m n θ = U L 1 min r , r 2 , , r m , δ 2 m , δ 2 m γ = n 4 .
Since we can surely state that
min r 1 * , , r m = 0 ,
then
θ = 1 δ 2 m n .
Moreover, thanks to Theorem 5, we can state that
r 1 * θ .
  • Asymptotic Case
Please note that
f 1 ( r 1 * , r 2 , , r m ) = 1 2 + ( m 2 ) e n 4 1 2 δ 2 m n ,
moreover
lim n 1 2 + ( m 2 ) e n 4 1 2 δ 2 m n = 1 2 .
Consequently, at least asymptotically, DDWUB can discard entirely the risk due to the hypotheses with high empirical error.
  • Finite Sample Case
DDWUB obviously gives an advantage, with respect to the UB, until
r 3 = = r m > θ ,
This means that we have an advantage in terms of the tightness of the estimated upper bound for R 1 until
δ 2 m n > 1 δ 2 m n ,
and consequently until
m < δ 2 n 1 .

5. Numerical Results

This section is devoted to the numerical comparison between the UB (see Theorem 1) and the DDWUB (see Lemma 11).
In particular we will focus on upper bounding the generalization error of the hypothesis, in the set of possible ones, characterized by the smallest empirical error. The comparison between the UB and the DDWUB will be made in different scenarios to better understand the advantages of the DDWUB.
  • Scenario A.
Scenario A is an optimistic scenario, the same of Section 4: the set of hypotheses contains few useful hypotheses (small empirical error) and a lot of useless hypotheses (high empirical error) such that
R ^ 1 = R ^ 2 = 0 , R ^ 3 = = R ^ m = 1 .
We set δ = 0.05 and set the numerical precision of the algorithms to ϵ = 0.0001 .
Figure 1 reports the estimated generalization error upper bound of the hypothesis with the smallest empirical error estimated with the UB and the DDWUB, together with the percentage of improvement in three different sub-scenarios:
  • Sub-scenario A.1 (Figure 1a): we set m = 1000 and we vary n { 10 , 20 , 40 , 100 , 200 , 400 , 1000 } ;
  • Sub-scenario A.2 (Figure 1b): we vary m { 10 , 100 , 1000 , 10000 , 100000 } and we set n = 40 ;
  • Sub-scenario A.3 (Figure 1c): we vary m { 10 , 100 , 1000 , 10000 , 100000 } and we set n = 100 .
Based on the results reported in Figure 1 we can observe that
  • DDWUB is always tighter, or equivalent in the worst case, with respect to the UB;
  • increasing the number of samples always increases the advantage of the DDWUB over the UB until all the risk of the hypothesis with largest empirical error is disregarded;
  • increasing m increases the advantage of the DDWUB over the UB until a limit value for m: if too many useless hypotheses are present it is not able anymore to disregard their risk. Nevertheless, the larger is n the far is this value for m.
In a slightly less optimistic scenario when
R ^ 1 = R ^ 2 = 0 , R ^ 3 = = R ^ m = 1 2 ,
which is the case of lot of charlatans and just a few good candidates in a hiring process [34], the results are reported in Figure 2a–c and do not change too much, apart from the limit of m when the DDWUB stops to improve over the UB which is obviously smaller.
  • Scenario B.
The second scenario is a more classical one when
R ^ 1 = 0 , R ^ 2 = 1 n , , R ^ n + 1 = 1 .
This case is exploited in many applications (e.g., the CSDB of [20], or the Structural Risk Minimization of [2], or the Structural Risk Minimization over data-dependent hierarchies of [19]).
Also, in this scenario we set δ = 0.05 and we set the numerical precision to ϵ = 0.0001 . Then we vary n { 10 , 20 , 40 , 100 , 200 , 400 , 1000 } .
Figure 3 reports the estimated generalization error upper bound of the hypothesis with the smallest empirical error estimated with the UB and the DDWUB, together with the percentage of improvement.
Figure 3 clearly shows the advantage of the DDWUB over the UB and the improvement in the advantage as soon as n increases.
  • Scenario C.
The last scenario involves the unlucky case in which our set of m hypotheses is taken from all 2 n possible binary hypotheses over n data. Please note that the number of hypotheses with i errors in the 2 n possible binary hypotheses over n data are n i . Then we force our m hypotheses to have at least one hypothesis for each possible value of the empirical error and consequently m n + 1 . The remaining m n 1 are taken from the 2 n possible binary hypotheses over n to approximate the distribution of the 2 n possible binary hypotheses. The result of this approach is that our set of hypotheses will be composed by m = j = 0 n n j 2 n z hypotheses, with z N * , as follows
R ^ 1 = = R ^ n 0 2 n z = 0 , R ^ n 0 2 n z + 1 = = R ^ n 0 2 n z + n 1 2 n z = 1 n , R ^ j = 0 n 1 n j 2 n z + 1 = = R ^ j = 0 n n j 2 n z = 1 .
Please note that for example, when z = 1 we obtain the Scenario B.
Also, in this scenario we set δ = 0.05 and we set the numerical precision to ϵ = 0.0001 .
Figure 4a–c reports the estimated generalization error upper bound of the hypothesis with the smallest empirical error estimated with the UB and the DDWUB, together with the percentage of improvement in the same sub-scenarios of Scenario A.
Figure 4 shows that even in this unlucky case, the DDWUB can remarkably outperform the UB.

The Importance of γ and θ

In this section, we would like to discuss the importance of γ and θ also by means of some numerical experiments supporting our claims.
Let us start with θ . From one side setting θ = 1 would make the DDWUB degenerate in the UB eliminating all the benefits of using the DDWUB. From the other side setting θ = 0 , namely removing θ , is not possible because of the constraint on θ of Theorem 5 which does not allow us, in this case, to guarantee that the solution of the DDWUB always outperform, or in the worst case degenerates in, the UB. Hence, θ ( 0 , 1 ) deals with the fact that we do not know the generalization error of our hypotheses, then we must estimate it, and consequently hypotheses with a small but close empirical error cannot be distinguished. Unfortunately, the constraint of Theorem 5 on θ is data-dependent and consequently we must resort to a suboptimal, but data independent, limitation on θ as reported in Lemma 8. The question which raises here is the practical difference of all these choices. For this reason, we consider the Scenario A previously defined with R ^ 1 = 0.1 , R ^ 2 = ν and R ^ 3 = = R ^ m = 1 , where we set δ = 0.05 , n = 100 , and m = 1000 . Then we vary ν { 0.1 , 0.2 , , 1 } , and we reported in Figure 5 the comparison between the UB and the DDWUB with θ = 0 , θ = θ ^ i.e., equal to the lower limit of the constraint of Theorem 5, θ = θ ^ ^ i.e., equal to the lower limit of the constraint of Lemma 8, and θ = 1 .
From Figure 5 it is possible to derive some observations. Setting θ = 0 in the DDWUB can result in worse estimates with respect to the ones of the UB since when R ^ 2 is close to R ^ 1 (small ν ) we cannot distinguish between the first two hypotheses which are the ones with the lowest generalization error. As soon as ν grows the difference between R ^ 1 and R ^ 2 becomes statistical relevant and so setting θ = 0 in the DDWUB results in better estimates with respect to the UB or even better that the ones of the DDWUB since θ in this particular scenario for large ν is useless. Setting θ = θ ^ gives the best results and always outperform the UB while setting θ = θ ^ ^ results in worse estimates but still better than the ones of the UB. Please note that for small and large ν , θ ^ and θ ^ ^ are equivalent while there is a middle range of ν for which θ ^ performs better θ ^ ^ . The explanation of this phenomena can be derived using the observations of Section 3.2. The hypothesis that regulates θ ^ is the one corresponding to R ^ 1 (the one with the smallest empirical error). The hypothesis that regulates θ ^ ^ , instead, is the one corresponding to R ^ 2 (the second-best hypothesis) if the distance between R ^ 1 and R ^ 2 is small, i.e., for small ν , while is the one corresponding to R ^ 1 if the distance between R ^ 1 and R ^ 2 is large. Consequently, when R ^ 1 and R ^ 2 are close it is indifferent to choose one or the other and the estimates of the DDWUB are almost equivalent. When, instead, the difference between R ^ 1 and R ^ 2 increases, R ^ 2 , instead of R ^ 1 , regulates θ and the DDWUB with θ ^ performs better than the DDWUB with θ ^ ^ . Finally, when the distance between R ^ 1 and R ^ 2 is large, θ is regulated by R ^ 1 both for θ ^ and θ ^ ^ and consequently the corresponding estimates of the DDWUB are equivalent. Finally, setting θ = 1 results in the UB.
Let us now consider γ . From one side setting γ = 0 would make the DDWUB degenerate in the UB eliminating all the benefits of using the DDWUB. From the other side setting γ = would result in splitting equally the confidence just over the hypotheses with estimated error less than θ , not considering all the other hypotheses, which is our scope but unfortunately this is not possible because of the constraints of Lemma 7, which then result in the limitation over γ in Lemma 11. This is the reason we set, in the experiments, γ to the limits of what Lemma 11 allows. After that limit we do not know how the solution of the DDWUB behaves since we do not even know how to retrieve it. Setting a γ smaller than the maximum value allowed by Lemma 11 would diminish the performance of the DDWUB until the DDWUB degenerates in the UB. To support this statement we consider Scenario A with R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 , then set δ = 0.05 , n = 100 and m = 1000 , and we vary γ { 10 5 n , 10 4 n , , 10 1 n , γ ^ } , where γ ^ is the limit defined in Lemma 11, and we reported the comparison between the DDWUB and the UB in Figure 6.
From Figure 6 it is possible to clearly observe that the maximum improvement is achieved when γ is maximum and consequently when γ = γ ^ . The same result can be observed in all the other scenarios.

6. What About the Computable Shell Decomposition Bounds?

In this section, we will show that the DDWUB also improves over the CSDB. Before starting the comparison, we must recall this milestone result.
Theorem 6
([20]). The following bounds hold
P R i max r : r [ 0 , 1 ] , kl ( R ^ i | | r ) S ^ ( r , n , δ ) + ln 4 n δ n i I 1 δ ,
where r = max [ 1 , r n ] n { 1 n , , n n } and if r = k n then r [ k 1 n , k m ] , kl ( q | | p ) = q ln q p + ( 1 q ) ln 1 q 1 p is the Kullback–Leibler divergence, and
S ^ k n , n , δ = ln max 1 , 2 h i : i I , R ^ i k n 1 n + ln 16 n 2 δ 2 n 1 .
Let us consider the same scenario of Section 4.
The DDWUB is able, at least asymptotically, to discard all risk associated with the hypotheses with high empirical error obtaining that for n large enough, the rate of convergence of the bound on R 1 is O ( 1 n ) .
Instead, if we use the CSDB of Theorem 6 we get that for n large enough, the rate of convergence of the bound on R 1 is O ( ln ( n ) n ) . Moreover, note that O ( ln ( n ) n ) is also the fastest possible rate of convergence for Theorem 6.
If instead of checking the asymptotic behavior of the DDWUB and the CSDB we check their finite sample behavior by means of numerical experiments as in Section 5, we can derive other interesting observations. Figure 7, Figure 8, Figure 9 and Figure 10 (and the associated sub-figures) report the same comparison of Figure 1, Figure 2, Figure 3 and Figure 4 in Section 5 but, instead of comparing the UB with the DDWUB, here we compare the CSDB with the DDWUB.
As it can be clearly seen from the results the DDWUB performs consistently better than the CSDB.

7. Improving the Computable Shell Decomposition Bounds

The purpose of this section is to demonstrate that the DDWUB can be exploited to improve known results in Statistical Learning Theory. In particular, this section we will show that DDWUB can be exploited to improve the CSDB.
The proof of the CSDB of Theorem 6 relies on two main results, reported in the next theorem, combined with the UB.
Theorem 7
([20]). The following bounds hold
P kl ( R ^ i | | R i ) ln ( | H | ) + ln 2 δ n i I 1 δ , P S k n , n S ^ k n , n , 2 n δ 1 δ ,
where
S k n , n = ln max 1 , 2 h i : i I , R i k 1 n , k n
and S ^ k n , n , δ is defined as in Theorem 6.
By splitting the hypotheses in shells based on their generalization error, namely H r = h i : i I , R i k 1 n , k n with r { 1 n , 2 n , , 1 } , by combining the two probabilistic bounds of Theorem 7, by using the UB, and by considering the worst case scenario, the result of Theorem 6 is derived. Consequently, [20] consider 2 n probabilistic bounds, two for each of one the shells, and spread the confidence (risk) equally over them.
We propose, instead, to use the same approach of the DDWUB in the CSDB. Instead of spreading the confidence equally over the 2 n probabilistic bounds, we spread the confidence over them based on the maximum generalization error of the function in each of the n shells to which the bounds refer. The results is reported in the following lemma.
Lemma 12.
The following bound holds
P R i max r : r [ 0 , 1 ] , kl ( R ^ i | | r ) S ^ ( r , n , n p ( r ) δ ) + ln 4 δ p ( r ) n i I 1 δ ,
where
p ( r ) = e n r ln ( n ) i = 1 n e i ln ( n ) .
The proof is the simple application of the concepts behind the DDWUB. θ is not needed since for each one of the shells we exactly know by definition the maximum generalization error of the functions inside it. γ is set to n ln ( n ) and the reason is the following one. The rate of convergence of CSDB is O ( ln ( n ) n ) in the general case and O ( ln ( n ) n ) when R ^ i = 0 . Let us study instead the rate of convergence of the bound of Lemma 12. Thanks to the Geometric Series we can state that
p ( r ) = e n r ln ( n ) i = 1 n e i ln ( n ) = e n r ln ( n ) 1 e 1 ln ( n ) 1 e n ln ( n )
For n large enough we can state that
ln 1 p ( r ) n r ln ( n ) + ln ( ln ( n ) )
Consequently, the rate of convergence of the bound of Lemma 12 is O ( ln ( ln ( n ) ) n ) in the general case and O ( ln ( ln ( n ) ) n ) when R ^ i = 0 . This means that for γ = n ln ( n ) the rate of convergence of the bound of Lemma 12 of better than the one of CSDB.
It could be possible to improve the bound with a different values of γ and θ or to prove that for particular values of γ and θ , Lemma 12 is always better than the CSDB but this is beyond of the scope in this paper.
If, instead, we compare the finite sample behavior of the CSDB and the Lemma 12 (that we will call CSDB+DDWUB) by means of numerical experiments as in Section 5, we can observe the possible benefit of using DDWUB in CSDB, a well-known result of Statistical Learning Theory. Figure 11, Figure 12, Figure 13 and Figure 14 (and the associated sub-figures) report the same comparison of Figure 1, Figure 2, Figure 3 and Figure 4 in Section 5 but, instead of comparing the UB with the DDWUB, here we compare the CSDB with the CSDB+DDWUB.
As it can be clearly seen from the results the CSDB+DDWUB performs consistently better than the CSDB.

8. Conclusions and Discussion

In this work we derived, for an arbitrary finite hypothesis space, a fully empirical new upper bound on the generalization error of the hypothesis of minimal training error. As noted in the paper, although we presented just the upper bound, our result can be easily generalized also to lower or both upper and lower bounds.
In particular we depicted a quite general framework under which it is possible to improve the Union Bound with a distribution depended weighting strategy associated with the risk of each choice. Then we stated the conditions under which the proposed Distribution-Dependent Weighed Union Bound is always tighter than the one based on the Union Bound. We showed that these conditions are quite easy to satisfy. By means of both closed form and numerical results we demonstrated that Distribution-Dependent Weighed Union Bound is consistently tighter than the Union Bound in different scenarios. Finally, we showed that the Distribution-Dependent Weighed Union Bound is also able to improve over the Computable Shell Decomposition Bound, another quite powerful distribution-dependent Union Bound.
The results of this work are quite promising and pave the way toward many different future improvements. One is surely to derive a class of weighting strategies which satisfies behind the Distribution-Dependent Weighed Union Bound. Another one is to understand how to exploit the results of this work for improving all the results in Statistical Learning Theory where the Union Bound is employed. A final one, and probably the most important one, is how to extend and apply our results to the infinite-dimensional hypothesis spaces. This extension, which is obviously not trivial, has multiple alternatives which can be speculated. The first, and naive, approach would be to plug our approach into the compression bound which already deals with the infinite-dimensional case, exploiting the concept of compression, and exploits naively the union bound. The second approach, less intuitive, would be to split the hypothesis spaces in a finite number of shells, estimate the size of each shell with classical bounds based on the Vapnik–Chervonenkis or Rademacher Complexity theories and then exploit our Distribution-Dependent Weighed Union Bound to pay the price of the choice of one of the shells. The last, and more challenging, approach would be to plug our weighting strategy directly in the derivation of the Vapnik–Chervonenkis or Rademacher Complexity bases bounds shrinking then the measures of complexity as alternative to the localization approaches.

Author Contributions

Conceptualization, L.O. and S.R.; Formal analysis, L.O.; Investigation, L.O. and S.R.; Methodology, L.O.; Software, L.O.; Writing—original draft, L.O.; Writing—review & editing, L.O. and S.R. Authors have equally contributed to this work. All authors have read and agreed to the published version of the manuscript.

Funding

No funding to declare.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Known Results

Known bounds and results exploited in the paper are included in this section.
Theorem A1
([16]). The following bounds hold
P R R ^ ln 1 δ 2 n 1 δ , P R R ^ + ln 1 δ 2 n 1 δ .
Theorem A2
([15]). The following bounds hold
P R F 1 ( δ ; n R ^ , n n R ^ + 1 ) R ^ 1 n , 2 n , , 1 0 R ^ = 0 1 δ , P R F 1 1 δ ; n R ^ + 1 , n n R ^ R ^ 0 , 1 n , , n 1 n 1 R ^ = 1 1 δ .
Theorem A3
([35]). If n N * , the following bounds hold
2 π n n n e n n ! 2 π n n n e n e 1 12 n .
Theorem A4
([6]). If n N * , p [ 0 , 1 ] , q { 0 , 1 n , 2 n , , 1 } , and R ^ < p , the following bound holds
F p ; n q , n n q + 1 e n kl ( q | | p )
where
F p ; n q , n n q + 1 = i = 0 n q n i p i ( 1 p ) n i ,
is the beta cumulative density function, and
kl ( q | | p ) = q ln q p + ( 1 q ) ln 1 q 1 p
is the Kullback–Leibler divergence.

Appendix B. Proofs

Proofs of our results are included in this section.
Proof. (Lemma 2)
Please note that under the assumptions of the lemma, r 1 , , r m [ 0 , 1 ] , and i I \ i * and r k , r k [ 0 , 1 ] such that r k < r k
log 2 j I \ k e γ max [ θ , r j ] + e γ max [ θ , r k ] δ e γ max [ θ , r i ] 2 n log 2 j I \ k e γ max [ θ , r j ] + e γ max [ θ , r k ] δ e γ max [ θ , r i ] 2 n 0 ,
then the statement of the lemma is proved. □
Proof. (Lemma 3)
Please note that under the assumptions of the lemma
0 < min 1 , R ^ i + log 2 j I e γ max [ θ , r j ] δ e γ max [ θ , r i * ] 2 n 1 .
Note also that if
γ 2 n p i * = e γ r i * j I e γ r j
then
log 2 j I e γ r j δ e γ r i * 2 n r i * = γ p i * ( 1 p i * ) 4 n p i * ln 2 δ p i * 2 n < ( 1 p i * ) 2 ln 2 δ p i * 2 < 1 ,
then the statement of the lemma is proved. □
Proof. (Theorem 4)
Let us define
ϑ = min 1 , R ^ i * + log 2 m δ 2 n .
Let us suppose that
r i ϑ , i I \ i * .
If we set
r i * * = ϑ ,
we have, thanks to the hypothesis of the theorem that
e γ max [ θ , r i ] j I e γ max [ θ , r j ] = 1 m , i I ,
then r i * * is a fixed point and since it is unique by Lemma 3 we have that r i * * ϑ .
Let us suppose now that
r i ϑ , i I \ { i * , k } , r k > ϑ , k I \ i * .
Thanks to the hypothesis of the theorem, we can state that
e γ max [ θ , r i * ] j I \ { i * , k } e γ max [ θ , r j ] + e γ max [ θ , r i * ] + e γ max [ θ , r k ] > e γ max [ θ , r i * ] j I \ i * e γ θ + e γ max [ θ , r i * ] ,
and, consequently, we can also state that
r i * = min 1 , R ^ i * + log 2 j I \ { i * , k } e γ max [ θ , r j ] + e γ max [ θ , r i * ] + e γ max [ θ , r k ] δ e γ max [ θ , r i * ] 2 n min 1 , R ^ i * + log 2 j I \ i * e γ θ + e γ max [ θ , r i * ] δ e γ max [ θ , r i * ] 2 n .
Consequently, by exploiting the same reasoning exploited before, r i * * ϑ .
By induction, the statement of the theorem is proved. □
Proof. (Lemma 4)
Please note that
min R 1 , , R m max 0 , min R ^ 1 , , R ^ m log 2 m δ 2 n .
Then we can state that
R ^ i * min R 1 , , R m + log 2 m δ 2 n ,
and consequently, the statement of the theorem is proved. □
Proof. (Theorem 5)
Let us define
ϑ = U R ^ i , δ 2 m .
Let us suppose that
r j ϑ , j I \ i .
If we set
r i * = ϑ ,
we have, thanks to the hypothesis of the theorem that
f 1 ( r 1 , , r i 1 , r i * , r i + 1 , , r m ) = = f m ( r 1 , , r i 1 , r i * , r i + 1 , , r m ) = 1 m ,
then r i * is a fixed point and since it is unique by Lemma 7 we have that r i * ϑ .
Let us suppose now that
r j ϑ , j I \ { i , k } , r k > ϑ , k I \ i .
Thanks to the hypothesis of the theorem, we can state that
f i ( ϑ , , ϑ , r i , ϑ , , ϑ , r k , ϑ , , ϑ ) > f i ( ϑ , , ϑ , r i , ϑ , , ϑ ) ,
and, consequently, we can also state that
r i = U R ^ i , δ f i ( ϑ , , ϑ , r i , ϑ , , ϑ , r k , ϑ , , ϑ ) 2 U R ^ i , δ f i ( ϑ , , ϑ , r i , ϑ , , ϑ ) 2 .
Consequently, by exploiting the same reasoning exploited before, r i * ϑ .
By induction, the statement of the theorem is proved. □
Proof. (Lemma 8)
Please note that
min R 1 , , R m L min R ^ 1 , , R ^ m , δ 2 m .
Then we can state that
R ^ i * L 1 min R 1 , , R m , δ 2 m ,
and consequently, the statement of the theorem is proved. □
Proof. (Lemma 9)
To prove the theorem we first just have to note that for a fixed θ , the problem can be solved with Algorithm 1 and that its solution is r i * * ( θ ) and r j ( θ ) = L R ^ j , δ 2 m with j I \ i * . Then, searching for the largest fixed point, varying θ , such that
θ = U L 1 min r 1 ( θ ) , , r i * 1 ( θ ) , r i * * ( θ ) , r i * + 1 ( θ ) , , r m ( θ ) , δ 2 m , δ 2 m
gives the proposed algorithm. □
Proof. (Lemma 10)
For what concerns Corollary 2, we must prove that r 1 , , r m [ 0 , 1 ]
f i ( r 1 , , r m ) ( 0 , 1 ) i I , i I f i ( r 1 , , r m ) = 1 .
These two properties are trivially provable by definition.
For what concerns Theorem 5, instead, we must prove that for θ [ 0 , 1 ] , r 1 , , r m [ 0 , 1 ] , and r j , r j [ 0 , θ ]
f j ( r 1 , , r j 1 , r j , r j + 1 , , r m ) f j ( r 1 , , r j 1 , r j , r j + 1 , , r m ) = 0 ,
with j I . Moreover, we must prove that r 1 , , r m [ 0 , θ ] and j I
f j ( r 1 , , r m ) = 1 m .
These properties are trivially provable by definition. Then we must prove that r 1 , , r m [ 0 , 1 ] , j I \ i , r j , r j ( θ , 1 ] such that r j < r j
f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) < 0 .
Then, let us note that
e γ max [ θ , r i ] e γ max [ θ , r j ] + k I \ j e γ max [ θ , r k ] e γ max [ θ , r i ] e γ max [ θ , r j ] + k I \ j e γ max [ θ , r k ] > e γ max [ θ , r i ] e γ max [ θ , r j ] + k I \ j e γ max [ θ , r k ] e γ max [ θ , r i ] e γ max [ θ , r j ] + k I \ j e γ max [ θ , r k ] = 0 ,
consequently, the property is proven. □
Proof. (Lemma 11)
For what concerns Corollary 2 and Theorem 5, the proof is a trivial consequence of Lemma 10.
For what concerns Lemma 6, we have to prove that r ^ i 0 , 1 m , , 1 , r 1 , , r m [ 0 , 1 ] , and j I \ i and r j , r j [ 0 , 1 ] such that r j < r j
U r ^ i , δ f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) 2 U r ^ i , δ f i ( r 1 , , r j 1 , r j , r j + 1 , , r m ) 2 0 ,
where i I . Let us define
g i ( r 1 , , r m ) = e γ r i j I e γ r j , i I ,
and note that
g i ( r 1 , , r m ) r j = γ g i ( r 1 , , r m ) g j ( r 1 , , r m ) , i I , j I \ i .
Combining this property with Proposition A1 we obtain the result.
For what concerns Lemma 7, we must prove that r i , r i [ 0 , 1 ] such that r i < r i
U R ^ i , δ f i ( r 1 , , r i , , r m ) 2 U R ^ i , δ f i ( r 1 , , r i , , r m ) 2 r i r i < 1 , U R ^ i , δ f i ( r 1 , , r i 1 , 0 , r i + 1 , , r m ) 2 > 0 , U R ^ i , δ f i ( r 1 , , r i 1 , 1 , r i + 1 , , r m ) 2 < 1 .
The second and the third properties are trivially true if R ^ i 1 . For the first, instead, note that
g i ( r 1 , , r m ) r i = γ g i ( r 1 , , r m ) ( 1 g i ( r 1 , , r m ) ) , i I .
Then we can state, thanks also to Proposition A1 that
U R ^ i , δ f i ( r 1 , , r i , , r m ) 2 U R ^ i , δ f i ( r 1 , , r i , , r m ) 2 r i r i max r ^ 0 , 1 n , , n 1 n , p ( 0 , 1 ) B ( n r ^ + 1 , n n r ^ ) U r ^ , δ p 2 n r ^ 1 U r ^ , δ p 2 n n r ^ 1 δ 2 γ p ( 1 p )
Since this quantity must be strictly smaller than one we have that
γ < min r ^ 0 , 1 n , , n 1 n , p ( 0 , 1 ) U r ^ , δ p 2 n r ^ 1 U r ^ , δ p 2 n n r ^ 1 B ( n r ^ + 1 , n n r ^ ) δ 2 p ( 1 p ) ,
Note also that by denoting
s ^ = n r ^ , U = U r ^ , δ p 2 ,
then, thanks to Theorems A3 and A4 we can state that
B ( s ^ + 1 , n s ^ ) 1 U s ^ 1 U n s ^ 1 p = s ^ ! ( n s ^ 1 ) ! n ! 1 U s ^ 1 U n s ^ 1 p 2 π s ^ s ^ s ^ e s ^ e 1 12 s ^ 2 π ( n s ^ 1 ) ( n s ^ 1 ) n s ^ 1 e ( n s ^ 1 ) e 1 12 ( n s ^ 1 ) 2 π n n n e n 1 U s ^ 1 U n s ^ 1 p = 2 π e 1 + 1 12 s ^ + 1 12 ( n s ^ 1 ) s ^ s ^ s ^ ( n s ^ 1 ) ( n s ^ 1 ) n s ^ 1 n n n 1 U s ^ 1 U n s ^ 1 p 2 π e 7 6 s ^ ( n s ^ ) s ^ s ^ ( n s ^ ) n s ^ n n n 1 U s ^ 1 U n s ^ 1 U n s ^ p = 2 π e 7 6 s ^ ( n s ^ ) s ^ s ^ ( n s ^ ) n s ^ n n n 1 U s ^ 1 U n s ^ 1 U n 1 s ^ n p 2 π e 7 6 s ^ ( n s ^ ) s ^ s ^ ( n s ^ ) n s ^ n n n 1 U s ^ 1 U n s ^ 1 n p = 2 π e 7 6 s ^ ( n s ^ ) n 1 e n s ^ n ln s ^ n U + n s ^ n ln 1 s ^ n 1 U 1 n p 2 π e 7 6 s ^ ( n s ^ ) n 1 i = 0 s ^ n i U i ( 1 U ) n i 1 n p = 2 π e 7 6 s ^ n 1 s ^ n n 2 δ 2 π e 7 6 1 n 1 δ ,
and, consequently
min r ^ 0 , 1 n , , n 1 n , p ( 0 , 1 ) U r ^ , δ p 2 n r ^ 1 U r ^ , δ p 2 n n r ^ 1 B ( n r ^ + 1 , n n r ^ ) δ 2 p ( 1 p ) 2 n 2 π e 7 6 .
Finally, for what concerns Lemma 8, we must prove that r ^ , r ^ , r ^ { 0 , 1 m , , 1 } such that r ^ < r ^ , we have that
L ( r ^ , δ ) L ( r ^ , δ ) 0 , L 1 : L 1 ( L ( r ^ , n , δ ) , δ ) r ^ .
Exploiting Proposition A1 the first property can be easily derived. The second property is true by definition of L 1 ( r , δ ) . □

Appendix C. Technicalities

Technicalities of the paper are included in this section.
Proposition A1
(for Theorem A2). Let us define
L ( R ^ , n , δ ) = F 1 ( δ ; n R ^ , n n R ^ + 1 ) R ^ 1 n , 2 n , , 1 0 R ^ = 0 , U ( R ^ , n , δ ) = F 1 1 δ ; n R ^ + 1 , n n R ^ R ^ 0 , 1 n , , n 1 n 1 R ^ = 1 ,
then
L ( R ^ , n , δ ) [ 0 , 1 ] , U ( R ^ , n , δ ) [ 0 , 1 ] ,
and
L ( R ^ , δ ) δ = B ( n R ^ , n n R ^ + 1 ) ( L ( R ^ , δ ) ) n R ^ 1 ( 1 L ( R ^ , δ ) ) n n R ^ R ^ 1 n , 2 n , , 1 0 R ^ = 0 [ 0 , ) , U ( R ^ , δ ) δ = B ( n R ^ + 1 , n n R ^ ) ( U ( R ^ , δ ) ) n R ^ ( 1 U ( R ^ , δ ) ) n n R ^ 1 R ^ 0 , 1 n , , n 1 n 0 R ^ = 1 ( , 0 ] .
The derivation of Proposition A1 is trivial.

Appendix D. Code

All Matlab codes used in this paper are available by request to the corresponding author.

References

  1. Kearns, M.J.; Vazirani, U.V. An Introduction to Computational Learning Theory; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
  2. Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  3. Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
  4. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  5. Bartlett, P.L.; Boucheron, S.; Lugosi, G. Model selection and error estimation. Mach. Learn. 2002, 48, 85–113. [Google Scholar] [CrossRef]
  6. Langford, J. Tutorial on practical prediction theory for classification. J. Mach. Learn. Res. 2005, 6, 273–306. [Google Scholar]
  7. Oneto, L. Model Selection and Error Estimation Without the Agonizing Pain. WIREs Data Min. Knowl. Discov. 2018, 8, e1252. [Google Scholar] [CrossRef]
  8. Langford, J. Quantitatively Tight Sample Complexity Bounds. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2002. [Google Scholar]
  9. Bartlett, P.L.; Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 2002, 3, 463–482. [Google Scholar]
  10. Bartlett, P.L.; Bousquet, O.; Mendelson, S. Local Rademacher complexities. Ann. Stat. 2005, 33, 1497–1537. [Google Scholar] [CrossRef]
  11. Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
  12. Floyd, S.; Warmuth, M. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Mach. Learn. 1995, 21, 269–304. [Google Scholar] [CrossRef] [Green Version]
  13. McAllester, D.A. Some PAC-Bayesian theorems. Mach. Learn. 1999, 37, 355–363. [Google Scholar] [CrossRef] [Green Version]
  14. Dwork, C.; Feldman, V.; Hardt, M.; Pitassi, T.; Reingold, O.; Roth, A.L. Preserving Statistical Validity in Adaptive Data Analysis. In Proceedings of the ACM Symposium on Theory of Computing, Portland, OR, USA, 14–17 June 2015. [Google Scholar]
  15. Clopper, C.J.; Pearson, E.S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 1934, 26, 404–413. [Google Scholar] [CrossRef]
  16. Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
  17. Hotz, T.; Kelma, F.; Wieditz, J. Non-asymptotic confidence sets for circular means. Entropy 2016, 18, 375. [Google Scholar] [CrossRef] [Green Version]
  18. Mukherjee, S.; Niyogi, P.; Poggio, T.; Rifkin, R. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math. 2006, 25, 161–193. [Google Scholar] [CrossRef]
  19. Shawe-Taylor, J.; Bartlett, P.L.; Williamson, R.C.; Anthony, M. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 1998, 44, 1926–1940. [Google Scholar] [CrossRef] [Green Version]
  20. Langford, J.; McAllester, D. Computable shell decomposition bounds. J. Mach. Learn. Res. 2004, 5, 529–547. [Google Scholar]
  21. Tikhonov, A.N.; Arsenin, V.I.A.; John, F. Solutions of Ill-Posed Problems; Winston: Washington, DC, USA, 1977. [Google Scholar]
  22. Schölkopf, B.; Smola, A.J.; Bach, F. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
  23. Schuster, T.; Kaltenbacher, B.; Hofmann, B.; Kazimierski, K.S. Regularization Methods in Banach Spaces; Walter de Gruyter: Berlin, Germany, 2012. [Google Scholar]
  24. Oneto, L.; Ridella, S.; Anguita, D. Improving the Union Bound: A Distribution Dependent Approach. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Brugge, Belgium, 2–4 October 2020. [Google Scholar]
  25. Maron, O.; Moore, A.W. The racing algorithm: Model selection for lazy learners. Artif. Intell. Rev. 1997, 11, 193–225. [Google Scholar] [CrossRef]
  26. Roeder, K.; Wasserman, L. Genome-wide significance levels and weighted hypothesis testing. Stat. Sci. Rev. J. Inst. Math. Stat. 2009, 24, 398. [Google Scholar] [CrossRef] [Green Version]
  27. Catoni, O. PAC-Bayesian Supervised Classification; Institute of Mathematical Statistics: Beachwood, OH, USA, 2007. [Google Scholar]
  28. Oneto, L.; Anguita, D.; Ridella, S. A local Vapnik-Chervonenkis complexity. Neural Netw. 2016, 82, 62–75. [Google Scholar] [CrossRef]
  29. Polato, M.; Lauriola, I.; Aiolli, F. A novel boolean kernels family for categorical data. Entropy 2018, 20, 444. [Google Scholar] [CrossRef] [Green Version]
  30. Lever, G.; Laviolette, F.; Shawe-Taylor, J. Tighter PAC-Bayes bounds through distribution-dependent priors. Theor. Comput. Sci. 2013, 473, 4–28. [Google Scholar] [CrossRef]
  31. Oneto, L.; Anguita, D.; Ridella, S. PAC-Bayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognit. Lett. 2016, 80, 200–207. [Google Scholar] [CrossRef]
  32. Oneto, L.; Cipollini, F.; Ridella, S.; Anguita, D. Randomized Learning: Generalization Performance of Old and New Theoretically Grounded Algorithms. Neurocomputing 2018, 298, 21–33. [Google Scholar] [CrossRef]
  33. Page, E.S. Continuous inspection schemes. Biometrika 1954, 41, 100–115. [Google Scholar] [CrossRef]
  34. Jensen, D.D.; Cohen, P.R. Multiple comparisons in induction algorithms. Mach. Learn. 2000, 38, 309–338. [Google Scholar] [CrossRef] [Green Version]
  35. Robbins, H. A remark on Stirling’s formula. Am. Math. Mon. 1955, 62, 26–29. [Google Scholar] [CrossRef]
Figure 1. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Figure 1. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g001
Figure 2. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 2 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Figure 2. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 2 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g002
Figure 3. Scenario B: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Figure 3. Scenario B: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g003
Figure 4. Scenario C: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Figure 4. Scenario C: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g004
Figure 5. Scenario A ( R ^ 1 = 0.1 , R ^ 2 = ν and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB with θ { 0 , θ ^ , θ ^ ^ , 1 } together with the percentage of improvement when we set δ = 0.05 , n = 100 , and m = 1000 and we vary ν { 0.1 , 0.2 , , 1 } . The two figures above depict the whole range while the two figures below report a zoom on the most interesting parts.
Figure 5. Scenario A ( R ^ 1 = 0.1 , R ^ 2 = ν and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB with θ { 0 , θ ^ , θ ^ ^ , 1 } together with the percentage of improvement when we set δ = 0.05 , n = 100 , and m = 1000 and we vary ν { 0.1 , 0.2 , , 1 } . The two figures above depict the whole range while the two figures below report a zoom on the most interesting parts.
Entropy 23 00101 g005
Figure 6. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement when we set n = 100 and m = 1000 and we vary γ { 10 5 n , 10 4 n , , 10 1 n , γ ^ } , where γ ^ is the limit defined in Lemma 11.
Figure 6. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the UB and the DDWUB together with the percentage of improvement when we set n = 100 and m = 1000 and we vary γ { 10 5 n , 10 4 n , , 10 1 n , γ ^ } , where γ ^ is the limit defined in Lemma 11.
Entropy 23 00101 g006
Figure 7. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Figure 7. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g007
Figure 8. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 2 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Figure 8. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 2 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g008
Figure 9. Scenario B: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Figure 9. Scenario B: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g009
Figure 10. Scenario C: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Figure 10. Scenario C: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the DDWUB together with the percentage of improvement.
Entropy 23 00101 g010
Figure 11. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Figure 11. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Entropy 23 00101 g011
Figure 12. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 2 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Figure 12. Scenario A ( R ^ 1 = R ^ 2 = 0 and R ^ 3 = = R ^ m = 1 2 ): upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Entropy 23 00101 g012
Figure 13. Scenario B: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Figure 13. Scenario B: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Entropy 23 00101 g013
Figure 14. Scenario C: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Figure 14. Scenario C: upper bound of the generalization error of the hypothesis with the smallest empirical error computed with the CSDB and the CSDB+DDWUB together with the percentage of improvement.
Entropy 23 00101 g014
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Oneto, L.; Ridella, S. Distribution-Dependent Weighted Union Bound. Entropy 2021, 23, 101. https://0-doi-org.brum.beds.ac.uk/10.3390/e23010101

AMA Style

Oneto L, Ridella S. Distribution-Dependent Weighted Union Bound. Entropy. 2021; 23(1):101. https://0-doi-org.brum.beds.ac.uk/10.3390/e23010101

Chicago/Turabian Style

Oneto, Luca, and Sandro Ridella. 2021. "Distribution-Dependent Weighted Union Bound" Entropy 23, no. 1: 101. https://0-doi-org.brum.beds.ac.uk/10.3390/e23010101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop