Next Article in Journal
An Entropy Metric for Regular Grammar Classification and Learning with Recurrent Neural Networks
Next Article in Special Issue
Common Information Components Analysis
Previous Article in Journal
Dynamics of Ion Channels via Non-Hermitian Quantum Mechanics
Previous Article in Special Issue
No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Information-Theoretic Generalization Bounds for Meta-Learning and Applications

by
Sharu Theresa Jose
* and
Osvaldo Simeone
Department of Engineering, King’s College London, London WC2R 2LS, UK
*
Author to whom correspondence should be addressed.
Submission received: 17 November 2020 / Revised: 13 January 2021 / Accepted: 14 January 2021 / Published: 19 January 2021

Abstract

:
Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.

1. Introduction

As formalized by the “no free lunch theorem”, any effective learning procedure must be based on prior assumptions on the task of interest [1]. These include the selection of a model class and of the hyperparameters of a learning algorithm, such as weight initialization and learning rate. In conventional single-task learning, these assumptions, collectively known as inductive bias, are fixed a priori relying on domain knowledge or validation [1,2,3]. Fixing a suitable inductive bias can significantly reduce the sample complexity of the learning process, and is thus crucial to any learning procedure. The goal of meta-learning is to automatically infer the inductive bias, thereby learning to learn from past experiences via the observation of a number of related tasks, so as to speed up learning a new and unseen task [4,5,6,7,8].
In this work, we consider the meta-learning problem of inferring the hyperparameters of a learning algorithm. The learning algorithm (henceforth, called base-learning algorithm or base-learner) is defined as a stochastic mapping P W | Z m , u from the input training set Z m = ( Z 1 , , Z m ) of m samples to a model parameter W W for a fixed hyperparameter vector u. The meta-learning algorithm (or meta-learner) infers the hyperparameter vector u, which defines the inductive bias, by observing a finite number of related tasks.
For example, consider the well-studied algorithm of biased regularization for supervised learning [9,10]. Let us denote each data point Z = ( X , Y ) as a tuple of input features X R d and label Y R . The loss function l : W × Z R is given as the quadratic measure l ( w , z ) = ( w , x y ) 2 that quantifies the loss accrued by the inferred model parameter w on a data sample z. Corresponding to each per-task data set Z m , the biased regularization algorithm P W | Z m , u is a Kronecker delta function centered at the minimizer of the following optimization problem
1 m j = 1 m l ( w , Z j ) + λ 2 | | w u | | 2 ,
which corresponds to an empirical risk minimization problem with a biased regularizer. Here, λ > 0 is a regularization constant that weighs the deviation of the model parameter w from a bias vector u. The bias vector u can be then thought of as a common “mean” among related tasks. In the context of meta-learning, the objective then is to infer the bias vector u by observing data sets from a number of similar related tasks. Different meta-learning algorithms have been developed for this problem [11,12].
In the meta-learning problem under study, we follow the standard setting of Baxter [13] and assume that the learning tasks belong to a task environment, which is defined by a probability distribution P T on the space of learning tasks T , and per-task data distributions { P Z | T = τ } τ T . The data set Z m for a task τ is then generated i.i.d. according to the distribution P Z | T = τ . The meta-learner observes the performance of the base-learner on the meta-training data from a finite number of meta-training tasks, which are sampled independently from the task environment, and infers the hyperparameter U such that it can learn a new task, drawn from the same task environment, from fewer data samples.
The quality of the inferred hyperparameter U is measured by the meta-generalization loss, L g ( U ) , which is the average loss incurred on the data set Z m P Z m | T of a new, previously unseen task T sampled from the task distribution P T . The notation will be formally introduced in Section 2.2. While the goal of meta-learning is to infer a hyperparameter U that minimizes the meta-generalization loss L g ( U ) , this is not computable, since the underlying task and data distributions are unknown. Instead, the meta-learner can evaluate an empirical estimate of the loss, L t ( U | Z 1 : N m ) , using the meta-training set Z 1 : N m of data from N tasks, which is referred to as meta-training loss. The difference between the meta-generalization loss and the meta-training loss is the meta-generalization gap,
Δ L ( U | Z 1 : N m ) = L g ( U ) L t ( U | Z 1 : N m ) ,
and measures how well the inferred hyperparameter U generalizes to a new, previously unseen task. In particular, if the meta-generalization gap is small, on average or with high probability, then the performance of the meta-learner on the meta-training set can be taken as a reliable estimate of the meta-generalization loss.
In this paper, we study information-theoretic upper bounds on the average meta-generalization gap E P Z 1 : N m P U | Z 1 : N m [ Δ L ( U | Z 1 : N m ) ] , where the average is with respect to the meta-training set Z 1 : N m and the meta-learner defined by the stochastic kernel P U | Z 1 : N m . Specifically, we extend the recent line of work initiated by Russo and Zhou [14], and Xu and Raginsky [15], which obtain mutual information (MI)-based bounds on the average generalization gap for conventional learning, to meta-learning. To the best of our knowledge, this is the first work that studies information-theoretic bounds for meta-learning.
The bounds on average meta-generalization gap, studied in this work, are distinct from the other well-known bounds on meta-generalization gap in literature. Broadly speaking, existing bounds on the meta-generalization gap can be grouped into two—high probability, probably approximately correct (PAC) bounds, and high probability PAC-Bayesian bounds. These upper bounds take the general form, E P U | Z 1 : N m [ Δ ( U | Z 1 : N m ) ] ϵ , that hold with probability at least 1 δ , for δ ( 0 , 1 ) , over the meta-training set Z 1 : N m . In contrast, our work focuses on bounding E P Z 1 : N m E P U | Z 1 : N m [ Δ L ( U | Z 1 : N m ) ] on average also over the meta-training set. Notable PAC bounds on meta-generalization gap include the bound of Baxter [13] obtained using the framework of Vapnik–Chervonenkis (VC) dimensions; and of Maurer [16], which employs the algorithmic stability [17,18] properties. In contrast, the PAC-Bayesian bounds also incorporate prior beliefs on the base-learner and the meta-learner posteriors via an auxiliary data-independent prior distribution Q W | U and a hyper-prior distribution Q U , respectively. Most notably, PAC-Bayesian bounds include that of Pentina and Lambert [19], the tighter bound of Amit and Meir [20], and most recently, the bounds of Rothfuss et al. [21]. While the high-probability bounds are agnostic to task and data distributions, our information-theoretic bounds depend explicitly on the task and per-task data distributions, on the loss function, and on the meta-training algorithm, in accordance to prior work on information-theoretic generalization bounds.
Another general property inherited from the information-theoretic approach adopted in this paper is that the bounds on the average meta-generalization gap under study are designed to hold for arbitrary base-learners and meta-learners. As such, they generally do not result in tighter bounds as compared to non-information theoretic generalization guarantees obtained for specific meta-learning problems, such as the ridge regression problem with meta-learned bias vector mentioned above [22]. In contrast, the general purpose of the bounds in this paper is to provide insights into the number of tasks, and the number of samples per task required to ensure that the training-based metrics are a good approximation to their population counterparts.

1.1. Main Contributions

The derivation of bounds on average meta-generalization gap differs from conventional learning owing to two levels of uncertainties—environment-level uncertainty and within-task uncertainty. While within-task uncertainty results from observing a finite number m of data samples per task as in conventional learning, environment-level uncertainty results from observing a finite number N of tasks from the task-environment. The relative importance of these two forms of uncertainty depend on the use made by the meta-learner of the meta-training data. In fact, depending on how the meta-training data are used by the meta-learner, we identify two main classes of meta-training algorithms—with separate within-task training and test sets, and joint within-task training and test sets. The former class includes the state-of-the-art meta-learning algorithms, such as model agnostic meta-learning (MAML) [23], that splits the training data corresponding to each task into training and test sets, with the latter reserved for within-task validation. In contrast, the second class of algorithms, such as reptile [24], use the entire per-task data both for training and testing. Our main contributions are as follows.
  • In Theorem 1, we show that, for the case with separate within-task training and test sets, the average meta-generalization gap contains only the contribution of environment-level uncertainty. This is captured by a ratio of the mutual information (MI) between the output of the meta-learner U and the meta-training set Z 1 : N m , and the number of tasks N, as
    E P Z 1 : N m P U | Z 1 : N m Δ L sep ( U | Z 1 : N m ) 2 σ 2 N I ( U ; Z 1 : N m ) ,
    where σ 2 is the sub-Gaussianity variance factor of the meta-loss function. This is a direct parallel of the MI-based bounds for single-task learning [25].
  • In Theorem 3, we then show that, for the case with joint within-task training and test sets, the bound on the average meta-generalization gap also contains a contribution due to the within-task uncertainty via the ratio of the MI between the output of the base-learner and within-task training data and the per-task data sample size m. Specifically, we have the following bound
    E P Z 1 : N m P U | Z 1 : N m [ Δ L joint ( U | Z 1 : N m ) ] 2 σ 2 N I ( U ; Z 1 : N m ) + E P T 2 δ T 2 m I ( W ; Z m | T = τ ) ,
    where δ T 2 is the sub-Gaussianity variance factor of the loss function l ( w , z ) for task T.
  • In Theorems 2 and 4, we extend the individual sample MI (ISMI) bound of [26] to obtain novel individual task MI (ITMI)-based bounds on the meta-generalization gap for both separate and within-task training and test sets as
    E P Z 1 : N m P U | Z 1 : N m Δ L sep ( U | Z 1 : N m ) 1 N i = 1 N 2 σ 2 I ( U ; Z i m ) ,
    and
    E P Z 1 : N m P U | Z 1 : N m [ Δ L joint ( U | Z 1 : N m ) ] 1 N i = 1 N 2 σ 2 I ( U ; Z i m ) + E P T 1 m j = 1 m 2 δ T 2 I ( W ; Z j | T = τ ) .
These bounds can be seen to be tighter than the MI-based bounds in (3) and (4), respectively.
  • Finally, we study the applications of the derived bounds to two meta-learning problems. The first is a parameter estimation setup that involves one-shot meta-learning and base-learning procedures, for which a closed form expression for meta-generalization gap can be derived. The second application covers a broad range of noisy iterative meta-learning algorithms and is inspired by the work of Pensia et al. [27] for conventional learning.

1.2. Related Work

For conventional learning, there exists a rich literature on diverse frameworks for deriving upper bounds on the generalization gap, i.e., on the difference between generalization and training losses. Classical bounds from statistical learning theory quantify the generalization gap in terms of measures of complexity of the model class, most notably VC dimension [28] and Radmacher complexity [29]. This approach obtains high-probability, probably approximate correct (PAC) bounds on the generalization gap with respect to the training set. An alternate line of high-probability bounding techniques relies on the notion of algorithmic stability, which measures the sensitivity of the output of a learning algorithm to the replacement of individual samples from the training data set. The pioneering work [30] has been extended to include various notions of algorithmic stability [31,32,33]. As a notable example, a distributional notion of stability in terms of differential privacy, which quantifies the sensitvity of the distribution of algorithm’s output to data set, has been studied in [34,35]. The high-probability PAC–Bayesian bounds rely on change of measure arguments and uses the Kullback–Leibler (KL) divergence between the algorithm and a data-independent prior to quantifying the algorithmic sensitivity [36,37,38].
Following the initial work of Russo and Zou [14], information-theoretic bounds on the average generalization gap for conventional learning have been widely investigated in recent years. Xu and Raginsky [25] showed that the MI between the output of the learning algorithm and its training data set yields an upper bound in expectation on the generalization gap. The bound has been shown to offer computable generalization gaurentees for noisy iterative algorithms, including stochastic gradient Langevin dynamics (SGLD) in [27]. Various refinements of the MI-based bound have since been analyzed to obtain tighter bounds. In particular, the bounds in [39] employ chaining mutual information techniques to tighten the bounds in [25], while the bound in [26] depends on the MI between the output of the algorithm and an individual data sample. The MI between the output of the algorithm and a random subset of the data set appears in the bounds introduced in [40]. The total variation information between the joint distribution of the training data and algorithmic output and the product of marginals was shown in [41] to yield a bound on the generalization gap for any bounded loss function. Subsequent works in [42,43,44] consider other information-theoretic measures, such as maximum leakage and lautum information. Most recently, a conditional mutual information (CMI)-based approach has been proposed in [45] to develop generalization bounds.

1.3. Notation

Throughout this paper, upper case letters, e.g., X, denote random variables and lower case letters, e.g., x, their realizations. We use P ( · ) to denote the set of all probability distributions on the argument set or vector space. For a discrete or continuous random variable X taking values in a set or vector space X , P X P ( X ) denotes its probability distribution, with P X ( x ) being the probability mass or density value at x X . We denote as P X n the n-fold product distribution induced by P X . The conditional distribution of a random variable X given random variable Y is similarly defined as P X | Y , with P X | Y ( x | y ) representing the probability mass or density at X = x conditioned on the event Y = y . We use | | · | | 2 to denote the Euclidean norm of the argument vector, and I d to denote a d-dimensional identity matrix. We define the Kronecker delta δ ( x x 0 ) = 1 if x = x 0 and δ ( x x 0 ) = 0 otherwise.

2. Problem Definition

In this section, we define the problem of interest by introducing the key definitions of generalization gap for conventional, or single-task, learning and for meta-learning.

2.1. Generalization Gap for Single-Task Learning

Consider first the conventional problem of learning a task τ T .
As illustrated in Figure 1, each task τ T is associated with an underlying unknown data distribution, P Z | T = τ P ( Z ) , defined in a subset or vector space. Henceforth, we use P Z | τ to denote P Z | T = τ for notational convenience.
The training procedure, which is referred to as the base-learner, has access to a training data set Z m = ( Z 1 , Z 2 , , Z m ) P Z m | τ of m independent and identically distributed (i.i.d.) samples drawn from distribution P Z | τ . The base-learner uses this data set to choose a model, or hypothesis, W from the model class W by using a randomized training procedure defined by a conditional distribution P W | Z m , u as
W P W | Z m , u .
The conditional distribution P W | Z m , u defines a stochastic mapping from the training data set Z m to the model class W . The training procedure (7) is parameterized by a vector u U of hyperparameters, which defines the inductive bias. As an example, the base-learner P W | Z m , u may follow stochastic gradient descent (SGD) updates with hyperparameters u, including the learning rate and the initialization point.
The performance of a parameter vector w W on a data sample z Z is measured by a loss function l : W × Z R + . The generalization loss for a model parameter vector w W is the average
L g ( w | τ ) = E P Z | τ [ l ( w , Z ) ] ,
over a test example Z independently drawn from the data distribution P Z | τ . The subscript g is used to distinguish the generalization loss from the training loss defined below. The generalization loss cannot be computed by the learner, given that the data distribution P Z | τ is unknown. Instead, the learner can evaluate the training loss on the data set Z m , which is defined as the empirical average
L t ( w | Z m ) = 1 m i = 1 m l ( w , Z i ) .
The subscript t specifies that the loss is the empirical training loss.
The difference between generalization loss (8) and training loss (9) is known as generalization gap,
Δ L ( w | Z m , τ ) = L g ( w | τ ) L t ( w | Z m ) ,
and is a key metric that quantifies the level of uncertainty (This type of uncertainty is known as epistemic.) at the learner regarding the data distribution P Z | τ . The average generalization gap for the data distribution P Z | τ and base-learner P W | Z m , u is defined as
E P Z m , W | τ , u [ Δ L ( W | Z m , τ ) ] ,
where the expectation is taken with respect to the joint distribution P Z m , W | τ , u = P Z m | τ P W | Z m , u . A summary of the variables involved in the Definition of the generalization gap (11) can be found in Figure 1.
Intuitively, if the generalization gap is small, on average or with high probability, then the base-learner can take the performance (9) on the training set Z m as a reliable measure of the generalization loss (8) of the trained model W. Furthermore, data-dependent bounds on the generalization gap can be used as regularization terms to avoid overfitting, yielding generalized Bayesian inference problems [46,47].

2.2. Generalization Gap for Meta-Learning

As discussed, in single-task learning, the inductive bias u, defining the hyperparameters of the training procedure, must be selected a priori, i.e., without having access to task-specific data. The inductive bias determines the training data set size m needed to ensure a small generalization loss (8), since, generally speaking, richer models require more data to be trained [1]. The sample complexity can be generally reduced if one selects a suitable inductive bias based on prior information. Such prior information is typically obtained from domain knowledge on the problem under study. In contrast, meta-learning aims at automatically inferring an effective inductive bias based on data from related tasks.
To elaborate, we follow the setting of [13], in which a meta-learner observes data from a number of tasks, known as meta-training tasks, from the same task environment. A task environment is defined by a task distribution P T P ( T ) , supported on the space T of tasks, and by a per-task data distribution P Z | τ for each task τ T . Using the meta-training data drawn from a randomly selected subset of tasks, the meta-learner infers a hyperparameter vector u U defining the inductive bias. This is done with the goal of ensuring that, using hyperparameter u, the base-learner P W | Z m , u can efficiently learn on a new task, referred to as meta-test task, drawn independently from the same task distribution P T .
To elaborate, the meta-training data consist of N data sets Z 1 : N m = ( Z 1 m , , Z N m ) . Each ith data set is generated independently by first drawing a task T i P T from the task environment and then a task-specific training data set Z i m P Z m | T i . The meta-learner uses the meta-training data set Z 1 : N m to infer a hyperparameter vector u . To this end, we consider a randomized meta-learner
U P U | Z 1 : N m ,
where P U | Z 1 : N m is a stochastic mapping from the meta-training set Z 1 : N m to the space U of hyperparameters. We distinguish two different formulations of meta-learning that are often considered in the literature. In the first, the per-task data set Z m is split into training, or support, and test, or query subsets [23,48]; while, in the second, the entire data set Z m is used for both within-task training and testing [13,19,20].

2.2.1. Separate Within-Task Training and Test Sets

As seen in Figure 2, in this first approach to meta-learning, each meta-training sub data set Z i m is split into a training set and a test set as Z i m = ( Z i m tr , Z i m te ) , where Z i m tr contains m tr i.i.d. training examples and Z i m te contains m te i.i.d. test examples, with m = m tr + m te . The within-task base-learner P W | Z i m tr , u P ( W ) maps the per-task training subset Z i m tr to random model parameter W i P W | Z i m tr , u for a given hyperparameter U = u . The test subset is used to evaluate the empirical training loss of a model w for task T i as
L t ( w | Z i m te ) = 1 m te j = 1 m te l ( w , Z i , j m te ) ,
where Z i , j m te denote the jth example of the test subset Z i m te . Furthermore, the overall empirical meta-training loss for a hyperparameter u is computed by summing up all meta-training tasks as
L t sep ( u | Z 1 : N m ) = 1 N i = 1 N L t sep ( u | Z i m ) ,
where
L t sep ( u | Z m ) = E P W | Z m tr , u [ L t ( W | Z m te ) ]
is the average per-task training loss over the base-learner.
We emphasize that the meta-training loss (14) can be computed by the meta-learner and used as a criterion to select the meta-learning procedure (12), since it is obtained from the meta-training data Z 1 : N m . We also note that the rationale of splitting training and test sets is that the average training loss L t sep ( u | Z i m ) is an unbiased estimate of the corresponding average generalization loss E P W | Z i m tr , u [ L g ( W | T i ) ] .
The true goal of the meta-learner is to minimize the meta-generalization loss,
L g sep ( u ) = E P T , Z m tr E P W | Z m tr , u L g ( W | T ) ,
where P T , Z m tr = P T P Z m tr | T and L g ( W | T ) are as defined in (8). Unlike the meta-training loss (14), the meta-generalization loss is evaluated on a new, meta-test task T and on the corresponding training data Z m tr . We distinguish the meta-generalization loss and meta-training loss by the subscripts g and t, respectively in (16) and (14). The difference between the meta-generalization loss (16) and the meta-training loss (14), known as the meta-generalization gap, is defined as
Δ L sep ( u | Z 1 : N m ) = L g sep ( u ) L t sep ( u | Z 1 : N m ) .
The quantity of interest to us is the average meta-generalization gap, defined as
E P Z 1 : N m , U Δ L sep ( U | Z 1 : N m ) ,
where the expectation is with respect to the joint distribution P Z 1 : N m , U = P Z 1 : N m P U | Z 1 : N m , of the meta-training set Z 1 : N m and of the hyperparameter U. Note that P Z 1 : N m is the marginal of the joint distribution i = 1 N P T = T i P Z M | T = T i .
Intuitively, if the meta-generalization gap is small, on average or with high probability, the meta learner can take the performance (14) on the meta-training data as a reliable measure of the accuracy of the inferred hyperparameter vector in terms of the meta-generalization loss (16). Furthermore, data-dependant bounds on the meta-generalization gap can be used as regularization terms to avoid meta-overfitting. Meta-overfitting occurs when the meta-trained hyperparameter yields a small meta-training loss but a large meta-test loss, due to an excessive dependence on the meta-training set [13].

2.2.2. Joint Within-Task Training and Test Sets

In the second formulation of meta-learning, as illustrated in Figure 3, the entire data set Z i m is used for within-task training and testing. Accordingly, the meta-learner computes the meta-training loss
L t joint ( u | Z 1 : N m ) = 1 N i = 1 N L t joint ( u | Z i m ) ,
where
L t joint ( u | Z m ) = E P W | Z m , u [ L t ( W | Z m ) ]
is the average per-task training loss. Note here that in evaluating the meta-training loss in (19), the data set Z i m is used to infer model parameters W and to evaluate the per-task training loss. The expectation in (20) is taken over the output of the base-learner W given the hyperparameter vector u. As discussed, the meta-generalization loss for hyperparameter u U is computed by randomly selecting a novel task T P T as
L g joint ( u ) = E P T , Z m E P W | Z m , u L g ( W | T ) ,
where P T , Z m = P T P Z m | T and L g ( W | T ) is as defined in (8). In a manner similar to (17), the meta-generalization gap for a task distribution P T , data distribution P Z m | T , meta-learning algorithm P U | Z 1 : N m , and base-learner P W | Z m , U is defined as
Δ L joint ( u | Z 1 : N m ) = L g joint ( u ) L t joint ( u | Z 1 : N m ) .
The average meta-generalization gap is then given as E P Z 1 : N m , U [ Δ L joint ( U | Z 1 : N m ) ] , where the expectation is taken over all meta-training sets and over the output of the meta-learner.

3. Information-Theoretic Generalization Bounds for Single-Task Learning

In this section, we review two information-theoretic bounds on the generalization gap (11) for conventional learning derived in [25,26]. The material covered in this section provides the necessary background for the analysis of the meta-generalization gap to be studied in the rest of the paper. Throughout this section, we fix a task τ T . Since the generalization and meta-generalization gaps measure the deviation of empirical-mean random variables representing training and meta-training losses from reference values, we will make use of tools and definitions from large-deviation theory (see, e.g., [49]). We discuss the key essential definitions below.

3.1. Preliminaries

To start, the cumulant generating function (CGF) of a random variable X P X P ( X ) is defined as Λ X ( λ ) = log E P X [ e λ ( X E P X [ X ] ) ] . If it is well-defined, the CGF Λ X ( λ ) is convex and it satisfies the equalities Λ X ( 0 ) = Λ X ( 0 ) = 0 . A random variable X with finite mean, i.e., with E P X [ X ] < , is said to σ 2 -sub-Gaussian if its CGF is bounded as
Λ X ( λ ) λ 2 σ 2 2 , for   all λ R .
As a special case, if X is bounded in the interval [ a , b ] , i.e., if the inequality 0 < a X b < holds for some constants a and b, then X is ( b a ) 2 / 4 -sub-Gaussian.

3.2. Mutual Information (MI) Bound

We first present the mutual information (MI)-based upper bound obtained in [25]. Key to this result is the following Assumption.
Assumption 1.
The loss function l ( w , Z ) is δ τ 2 -sub-Gaussian under Z P Z | τ for all model parameters w W .
In particular, if the loss function is bounded, i.e., if the inequalities < a l ( w , z ) b < hold for all for w W and z Z , Assumption 1 is satisfied with δ τ 2 = ( b a ) 2 / 4 . The main result is as follows.
Lemma 1
([25]). Under Assumption 1, the following bound on the generalization gap holds for any base-learner W P W | Z m , u
| E P Z m , W | τ , u [ Δ L ( W | Z m , τ ) ] | 2 σ 2 m I ( W ; Z m ) .
The proof of Lemma 1 is based on a decoupling estimate Lemma, which is reported for completeness in Lemma A1. We also note that the result in Lemma 1 can be extended to account for loss function l ( w , Z ) with bounded CGF [14].
The bound (24) on the generalization gap is in terms of the mutual information I ( W ; Z m ) , which quantifies the overall dependence between the base-learner output W and the input training data set Z m . The mutual information in (24) is hence a measure of the sensitivity of the base-learner output to the data set. Using the terminology in [25], if I ( W ; Z m ) ϵ , the base-learner P W | Z m , u is said to be ( ϵ , P Z | τ ) -MI stable, in which case the bound in (24) evaluates to 2 δ τ 2 ϵ / m . The relationship between generalization and stability of a training algorithm is well-established [1], and the result (24) amounts to a formulation of this link in information-theoretic terms.
The traditional notion of algorithmic stability measures how much the base-learner output changes with the replacement of an individual training sample [30,50]. In the next section, we review the bound in [26] that translates this per-sample stability concept within an information-theoretic framework.

3.3. Individual Sample MI (ISMI) Bound

The MI-based bound in Lemma 1 has the disadvantage of being vacuous, i.e., I ( W ; Z m ) = , for deterministic base-learning algorithms P W | Z m , u defined on continuous parameter space W . An individual sample MI (ISMI)-based bound that address this shortcoming was introduced in [26]. The ISMI bound borrows the standard algorithmic stability notion of sensitivity of the base-learner output to the replacement of any individual training sample [17,18]. Accordingly, the resulting bound is in terms of the MI between the trained parameter W and each data point Z i of the training data set Z m . The bound, summarized in Lemma 2, applies under the following assumption.
Assumption 2.
The loss function l ( w , z ) satisfies either of the following two conditions:
(a) 
Assumption 1, or
(b) 
l ( W , Z ) is a δ τ 2 -sub-Gaussian random variable when ( W , Z ) P W | u , τ P Z | τ , where P W | u , τ P ( W ) is the marginal of the joint distribution P W | Z m , u P Z m | τ .
We note that, in general, Assumption 1 does not imply Assumption 2 ( b ) (see ([40], Appendix C)), and vice versa (see [26]). There are, however, loss functions l ( w , z ) and relevant distributions for which both the assumptions hold, including the case of loss functions l ( · , · ) which takes values in a bounded interval [ a , b ] .
Lemma 2
([26]). Under Assumption 2, the following bound on the average generalization gap holds for any base-learner P W | Z m , u
| E P Z m , W | τ , u [ Δ L ( W | Z m , τ ) ] | 1 m i = 1 m 2 σ 2 I ( W ; Z i ) .
For a loss function satisfying Assumption 1, the ISMI bound (25) is tighter than (24), i.e.,
1 m i = 1 m 2 δ τ 2 I ( W ; Z i ) 2 δ τ 2 m I ( W ; Z m ) .
The inequality in (26) follows from the chain rule of mutual information and Jensen’s inequality [26].

4. Information-Theoretic Generalization Bounds for Meta-Learning

In this section, we first derive novel MI-based bounds on the meta-generalization gap with separate within-task training and test sets, as introduced in Section 4.1, and then we consider joint within-task training and test sets, as described in Section 4.2.

4.1. Bounds on Meta-Generalization Gap with Separate Within-Task Training and Test Sets

In this section, we present two novel MI-based bounds on the meta-generalization gap (18) for the setup with separate within-task training and testing sets. The first is an MI-based bound, which is akin to Lemma 1, and the second is an individual task MI (ITMI) bound, which resembles Lemma 2 for conventional learning.

4.1.1. MI-Based Bound

In order to derive the MI-based bound, we make the following assumption on L t sep ( u | Z m ) in (15). Throughout, we use P Z m to denote the marginal of the joint distribution P T , Z m = P T P Z m | T .
Assumption 3.
For all u U , the average per-task training loss L t sep ( u | Z m ) is σ 2 -sub-Gaussian under Z m P Z m .
Distinct from the assumptions in Section 3 on loss function l ( w , z ) , we note that Assumption 3 is on the average per-task training loss L t sep ( u | Z m ) . This is because the loss function l ( w , z ) satisfying Assumption 1 do not in general guarantee the sub-Gaussianity of L t sep ( u | Z m ) with respect to Z m P Z m . However, if the loss function is bounded, Assumption 3 can be easily verified to hold, as given in the following lemma.
Lemma 3.
If the loss function l ( · , · ) is [ a , b ] bounded, then L t sep ( · | Z m ) is also [ a , b ] bounded for all Z m Z m . Consequently, L t sep ( u | Z m ) is ( b a ) 2 / 4 -sub-Gaussian under Z m P Z m for all u U .
Under Assumption 3, the following theorem presents an upper bound on the meta-generalization gap (18).
Theorem 1.
Let Assumption 3 hold for the base-learner P W | Z m tr , u . Then, for any meta learner P U | Z 1 : N m such that the inequality I ( U ; Z 1 : N m ) < holds, we have the following bound on the average meta-generalization gap
| E P Z 1 : N m , U Δ L sep ( U | Z 1 : N m ) | 2 σ 2 N I ( U ; Z 1 : N m ) .
Proof. 
See Appendix B. □
The technical lemmas required for the proof of Theorem 1 and the theorems that follow are included in Appendix A.
In order to prove Theorem 1, one needs to overcome an additional challenge as compared to the derivation of bounds for learning reviewed in Section 3. In fact, the meta-generalization gap is caused by two distinct sources of uncertainty: ( a ) environment-level uncertainty due to a finite number N of observed tasks, and ( b ) within-task uncertainty resulting from the finite number m of per-task data samples. Our proof approach involves applying the single-task MI-based bound in Lemma 1 to bound the effect of both sources of uncertainties.
Towards this, we start by introducing the average training loss for the randomly selected meta-test task as
L g , t sep ( u ) = E P T , Z m [ L t sep ( u | Z m ) ] .
The subscript g , t denotes that the loss is generalization (g) with expectation over P T , Z m at the environment level, and training (t) at the task level with L t sep ( u | Z m ) . Note that this differs from the meta-test loss L g sep ( u ) in (16) in that the per-task loss is evaluated in (28) on the training set. With this definition, the meta-generalization gap can be decomposed as
E P Z 1 : N m , U Δ L sep ( U | Z 1 : N m ) = E P Z 1 : N m , U ( L g sep ( U ) L g , t sep ( U ) ) + ( L g , t sep ( U ) L t sep ( U | Z 1 : N m ) ) .
In (29), the second difference L g , t sep ( U ) L t sep ( U | Z 1 : N m ) corresponds to the environment-level uncertainty and arises from the observation of a finite number N of tasks. In fact, as N increases, the meta-training loss L t sep ( u | Z 1 : N m ) almost surely tends to L g , t sep ( u ) by the law of large numbers. However, the average E P Z 1 : N m , U L g , t sep ( U ) L t sep ( U | Z 1 : N m ) is not equal to zero in general for finite values of N. The within-task generalization gap is instead measured by the difference L g sep ( u ) L g , t sep ( u ) . In the setup under study with separate within-task training and test sets, this term equals zero, since, as we discussed, the average empirical loss L t sep ( u | Z i m ) is an unbiased estimate of the corresponding average test loss E P W | Z i m tr , u [ L g ( W | T i ) ] (cf. (28)). This is no longer true for joint within-task training and test sets, as we discuss in Section 4.2.
The decomposition approach adopted here follows the main steps of the bounding techniques introduced in ([16], Equation (6)). In contrast, the PAC-Bayesian bounds in [20,21] rely on a different decomposition of the meta-generalization gap. The environment and within-task generalization gaps are then separately bounded in high probability, and are combined via union bound to obtain the required PAC-Bayesian bounds.
The bound (27) relates the meta-generalization gap to the information-theoretic stability of the meta-training procedure. As first introduced here, this stability is measured by the MI I ( U ; Z 1 : N m ) between the hyperparameter U and the meta-training data set Z 1 : N m , in a manner similar to the MI-based bounds in Lemma 1 for conventional learning. Importantly, as we will discuss in Section 4.2, this direct parallel between learning and meta-learning no longer applies with joint within-task training and test data sets.

4.1.2. ITMI Bound

We now present the ITMI bound, which holds under the following assumption.
Assumption 4.
Either of the following assumptions on the average per-task training loss, L t sep ( u | Z m ) holds:
(a) 
L t sep ( u | Z m ) satisfies Assumption 3, or
(b) 
L t sep ( U | Z m ) is σ 2 -sub-Gaussian under ( U , Z m ) P U P Z m , where P U is the marginal of the joint distribution P Z 1 : N m , U and P Z m is the marginal of the joint distribution P T , Z m .
Assumption 4 can be seen to be implied by the sufficient conditions in Lemma 3.
Theorem 2.
Let Assumption 4 hold for the base-learner P W | Z m tr , U . Then, for any meta learner P U | Z 1 : N m , the following bound on the meta-generalization gap (18) holds
E P Z 1 : N m , U Δ L sep ( U | Z 1 : N m ) 1 N i = 1 N 2 σ 2 I ( U ; Z i m ) .
where the MI I ( U ; Z i m ) is computed with respect to the joint distribution P Z i m , U obtained by marginalizing the probability distribution P Z 1 : N m , U .
Proof. 
See Appendix B. □
As can be seen from (30), the ITMI bound on the meta-generalization gap is in terms of the MI I ( U ; Z i m ) between the output U of the meta learner and each per-task data set Z i m . This, in turn, quantifies the sensitivity of the meta learner output to the replacement of a single per-task data set. Moreover, under Assumption 3, the ITMI bound (30) yields a tighter bound than the MI-based bound (27). This can be seen from the following sequence of relations
1 N I ( U ; Z 1 : N m ) = 1 N i = 1 N I ( U ; Z i m | Z ( i 1 ) m )
( a ) 1 N i = 1 N I ( U ; Z i m )
( b ) 1 N i = 1 N I ( U ; Z i m ) ,
where Z ( i 1 ) m = ( Z 1 m , , Z i 1 m ) ; ( a ) follows, since Z i m is independent of Z ( i 1 ) m ; and ( b ) follows from Jensen’s inequality.

4.2. Bounds on Generalization Gap with Joint Within-Task Training and Test Sets

We now derive MI and ITMI-based bounds on the meta-generalization gap in (22) for the case with joint within-task training and test sets. As we will see, the key difference with respect to the case with separate within-task training and test sets is that the uncertainty due to finite number of per-task samples, measured by the second term in the decomposition (29), contributes in a non-negligible way to the meta-generalization gap. Since there is no split into separate within-task training and test sets, the average training loss with respect to the learning algorithm is given by L t joint ( u | Z m ) in (20).

4.2.1. MI-Based Bound

In order to derive the MI-based bound, we make the following assumptions.
Assumption 5.
We consider the following assumptions.
(a) 
For each task τ T , the loss function l ( w , Z ) satisifies Assumption 1, and
(b) 
The average per-task training loss L t joint ( u | Z m ) in (20) is σ 2 -sub-Gaussian for all u U when Z m P Z m .
An easily verifiable sufficient condition for the above assumption to hold is the boundedness of loss function l ( w , z ) , which follows in a manner similar to Lemma 3.
Theorem 3.
Let Assumption 5 hold for a base-learner W P W | Z m , U . Then, for any meta learner P U | Z 1 : N m , we have the following bound on the meta-generalization gap (22)
E P Z 1 : N m , U [ Δ L joint ( U | Z 1 : N m ) ] 2 σ 2 N I ( U ; Z 1 : N m ) + E P T 2 δ T 2 m I ( W ; Z m | T = τ ) .
where the MI I ( W ; Z m | T = τ ) is evaluated with respect to the distribution P Z m , W | T = τ obtained by marginalizing the joint distribution P W | Z m , U P Z 1 : N m , U P Z m | T = τ .
Proof. 
See Appendix C. □
With joint within-task training and test sets, the bound (32) on the meta-generalization gap contains the contributions of two mutual informations. The first, I ( U ; Z 1 : N m ) , quantifies the sensitivity of the meta learner output U to the meta-training data set Z 1 : N m . This term also appeared in the bound (27) with separate within-task training and test sets. Decomposing the meta-generalization gap in a manner analogous to (29), it corresponds to a bound on the average of the second difference. The second contribution, I ( W ; Z m | T = τ ) , quantifies the sensitivity of the output of the base-learner P W | Z m , U to the data set Z m of the meta-test task T, when the hyperparameter is randomly selected by the meta-learner P U | Z 1 : N m using the meta-training set Z 1 : N m . This second term is in line with the single-task generalization gap bounds (24), and it bounds the corresponding first difference in the decomposition (29).
We finally note that the dependence of the bound in (32) on the number of tasks N and per-task samples m is of the order 1 / N + 1 / m . Meta-generalization bounds with similar dependence have been derived in [20] using PAC-Bayesian arguments. The bounds on excess risk for representation learning also follow a similar order of dependence on N and m (c.f [51], [Thm. 2]).

4.2.2. ITMI Bound on (22)

For deriving the ITMI bound on the meta-generalization gap (22), we assume the following.
Assumption 6.
Either of the following assumptions hold:
(a) 
Assumption 5 holds, or
(b) 
For each task τ T , the loss function l ( W , Z ) is δ τ 2 -sub-Gaussian when ( W , Z ) P W | τ P Z | τ , where P W | τ is the marginal of the joint distribution P W | Z m , U P Z 1 : N m , U P Z m | τ . The average per-task training loss L t joint ( U | Z m ) is σ 2 -sub-Gaussian when ( U , Z m ) P U P Z m .
As in Section 4.1.2, Assumption 6 can be seen to be implied by the sufficient conditions in Lemma 3.
Theorem 4.
Under Assumption 6, for any meta learner P U | Z 1 : N m , the following bound holds on the average meta-generalization gap
E P Z 1 : N m , U [ Δ L joint ( U | Z 1 : N m ) ] 1 N i = 1 N 2 σ 2 I ( U ; Z i m ) + E P T 1 m j = 1 m 2 δ T 2 I ( W ; Z j | T = τ ) ,
where the MI I ( U ; Z i m ) is evaluated with respect to P Z i m , U obtained by marginalizing P Z 1 : N m , U , and the MI I ( W ; Z j | T = τ ) is with respect to P Z j , W | T = τ obtained by marginalizing P Z m , W | T = τ .
Proof. 
See Appendix C. □
Similar to the bound in (32), the bounds on meta-generalization gap in (33) are in terms of two types of mutual information, the first describing the sensitivity of the meta-learner and the second the sensitivity of the base-learner. Specifically, the MI I ( U ; Z i m ) quantifies the sensitivity of the output of the meta learner to per-task data set Z i m , and the MI I ( W ; Z j | T = τ ) measures the sensitivity of the output of the base-learner, P W | Z m , U to each data sample Z i within the training set Z m of the meta-test task T. Moreover, it can be shown, in a manner similar to (31c), that, under Assumption 5, the ITMI bound in (33) is tighter than the MI bound in (32).

4.3. Discussion on Bounds

The bounds on the average meta-generalization gap obtained in this section generalize the bounds for conventional single-task learning in Section 3. To see this, consider the task distribution P T = δ ( T τ ) to be centered at some task τ T . Recall that in conventional learning, the hyperparameter u is fixed a priori. As such, the mutual information I ( U ; Z 1 : N m ) (for MI-based bounds) and I ( U ; Z i m ) (for ITMI-based bounds) vanishes. For the separate within-task training and test sets, this implies that the average generalization gap is zero, which follows since the per-task test loss L t ( W | Z i m te ) is an unbiased estimate of per-task generalization loss L g ( W | T i ) . The MI- and ITMI-based bounds for the joint within-task training and test sets then reduce to
E P Z m , W | τ , u [ Δ L ( W | Z m , τ ) ] 2 δ τ 2 m I ( W ; Z m ) ,
and
E P Z m , W | τ , u [ Δ L ( W | Z m , τ ) ] 1 m j = 1 m 2 δ τ 2 I ( W ; Z j )
respectively, where I ( W ; Z m ) is evaluated with respect to the joint distribution P W , Z m | τ , u and I ( W ; Z j ) with respect to P W , Z j | τ , u .
The MI- and ITMI-based bounds derived in this section point that a smaller correlation between hyperparameters and meta-training set and thus small mutual information I ( U ; Z 1 : N m ) improves the meta-generalization gap, although this seems deleterious to performance. To clarify this contradiction, we would like to emphasize that these bounds quantify the difference between meta-generalization loss and empirical training loss, which in turn depends on the sensitivity of the meta-learner and base-learner to their input meta-training set and per-task training set, respectively. The mutual information terms in our bounds capture these sensitivities. Consequently, our bounds suggest that a meta-learner that is highly correlated to the input meta-training set (i.e., when I ( U ; Z 1 : N m ) is large) does not generalize well (i.e., yields large meta-generalization gap). This property aligns with a previous information-theoretic analysis for generalization in conventional learning [25].
To the best of our knowledge, the MI- and ITMI-based bounds studied here are the first bounds on the average meta-generalization gap. As discussed in the introduction, these bounds are distinct from the high-probability PAC and PAC-Bayesian bounds on the meta-generalization gap studied previously on meta-learning. Consequently, the bounds studied in this work are not directly comparable with the existing high-probability bounds.
Finally, we note that similarity between tasks is crucial to meta-learning. If the per-task data distributions P Z | T = τ in the task environment are ‘closer’ to each other, a meta-learner can efficiently learn the shared characteristics of tasks, and can generalize well to new tasks from the task environment. In our setting, the statistical properties of the task environment ( P T , { P Z | T = τ } τ T ) dictate this similarity. Although our MI- and ITMI-based bounds do not explicitly capture this, we note that the properties of task environment are implicitly accounted for by the mutual information terms I ( U ; Z 1 : N m ) and I ( U ; Z i m ) , where the meta-training data set Z 1 : N m is generated from the task environment, and also by the sub-Gaussianity considerations in Assumptions 3–6. From preliminary studies, we believe that information-theoretic bounds that explicitly capture the impact of task similarity require a different performance metric than the average meta-generalization gap considered here, and is left to future work.

5. Applications

In this section, we consider two applications of the information-theoretic bounds proposed in Section 4.1. The first, simpler, example concerns a parameter estimation problem for which an optimized meta-learner can be obtained in closed form. In contrast, the second application covers a broad class of iterative meta-training schemes.

5.1. Parameter Estimation

To illustrate the bounds on the meta-generalization gap derived in Section 4.1, we first consider the problem of prediction for a Bernoulli process with a ‘soft’ predictor that uses only a few samples from the process, as well as meta-training data. Towards this, we consider an arbitrary discrete finite set of tasks T = { τ 1 , , τ M } . The data distribution P Z | T = τ k for each task τ k T , k { 1 , , M } , is given as Bernoulli ( μ τ k ) with mean parameter μ τ k . The task distribution P T is then defined over the finite set of mean parameters { μ τ 1 , , μ τ M } . The base-learner uses training data, distributed i.i.d. from Bernoulli ( μ τ k ) to determine the parameter W k , which is used as a predictor of new observation Z Bernoulli ( μ τ k ) at test time. The loss function is defined as l ( w , z ) = ( w z ) 2 , measuring the quadratic error between prediction and realized test input z. Note that the optimal (Bayes) predictor, computable in the ideal case of known distribution P Z | T = τ k , is given as W = μ τ k . We now distinguish the two cases with separate and joint within-task training and test sets.

5.1.1. Separate Within-Task Training and Test Sets

The base-learner P W | Z k m tr , u for task τ k T , deterministically selects the prediction
W k = α D k m tr + ( 1 α ) u ,
where D k m tr = 1 m tr j = 1 m tr Z k , j m tr is an empirical average over the training set Z k , j m tr , u is a hyperparameter defining a bias that can be meta-trained, and α [ 0 , 1 ) is a fixed scalar. Here, Z k , j m tr denote the jth data sample in the training set Z k m tr of task τ k . The bias term in (36) may help approximate the ideal Bayes predictor in the presence of limited data Z k m tr .
The objective of the meta-learner is to infer the hyperparameter u. For a given meta-training data set Z 1 : N m , comprising of data sets from N tasks sampled according to P T , the meta-learner can compute the empirical meta-training loss as
L t sep ( u | Z 1 : N m ) = 1 N k = 1 N 1 m te j = 1 m te ( W k Z k , j m te ) 2 ,
where Z k , j m te denote the jth example in the test set of Z k m , the kth sub-data set of Z 1 : N m . The meta-learner P U | Z 1 : N m then deterministically selects the minimizing hyperparameter u of the meta-training empirical loss function in (37). This optimization yields
U = ( 1 α ) 1 N k = 1 N D k m te α D k m tr ,
where D k m te = j = 1 m te Z k , j m te / m te . Note that D k m te and D k m tr are binomial random variables and by (38), U takes finitely many discrete values and is bounded as α ( 1 α ) 1 U ( 1 α ) 1 . The meta-test loss can be explicitly computed as
L g sep ( u ) = ( 1 α ) 2 u 2 2 u E P T [ μ T ] + E P T α 2 μ T 2 + μ T μ ¯ T m tr + μ T 2 α μ T 2 ,
where μ ¯ T = 1 μ T , and the average meta-generalization gap evaluates to
E P Z 1 : N m , U [ Δ P sep ( U | Z 1 : N m ) ] = 2 ( 1 + α 2 ) N Var T + 2 E P T [ μ T μ ¯ T ] N 1 m te + α 2 m tr ,
where Var T = ( E P T [ μ T 2 ] ( E P T [ μ T ] ) 2 ) is the variance of μ T .
To compute the MI- and ITMI-based bounds on the meta-generalization gap (40), it is easy to verify that the average training loss L t sep ( · | Z m ) is bounded, i.e., 0 L t sep ( · | Z m ) ( 1 + α ) 2 for all u U and Z m Z m . Thus, Assumption 3 for the MI bound and also Assumption 4 for the ITMI bound hold with σ 2 = ( 1 + α ) 4 / 4 . For the MI bound, we note that, since the meta-learner is deterministic, we have that I ( U ; Z 1 : N m ) = H ( U ) . The ITMI bound (30) is given as
| E P Z 1 : N m , U [ Δ sep ( U | Z 1 : N m ) ] | 1 N i = 1 N ( 1 + α ) 4 2 I ( U ; Z i m ) .
The information-theoretic measures in (41) can be evaluated numerically as discussed in Appendix D.
For a numerical illustration, Figure 4 plots the average of the meta-generalization loss (39) and average meta-training loss (A16) along with the ITMI bound in (41) and MI bound in (27). It can be seen that the ITMI bound is tighter than MI bound and correctly predicts the decrease in the meta-generalization gap as the number N of tasks increases.

5.1.2. Joint Within-Task Training and Testing sets

We now consider the case with joint within-task training and test sets. The base-learner P W | Z k m , U for task τ k T still uses the predictor (36), but now the empirical average over the training set is given as D k = j = 1 m Z k , j m / m . As before, the meta-learner P U | Z 1 : N m deterministically selects the minimizing hyperparameter u of the meta-training empirical loss function, L Z 1 : N m ( u ) = ( 1 / N ) k = 1 N ( 1 / m ) j = 1 m ( W i Z k , j m ) 2 , yielding U = 1 N k = 1 N D k . As discussed in Appendix D, the meta-generalization loss for this example can also be explicitly computed and the meta-generalization gap bounds in (32) and (33) can be evaluated numerically. Figure 5 plots the average meta-generalization loss and average meta-training loss along with the MI bound in (32) and ITMI bound in (A18), as a function of per-task data samples m. The ITMI bound is seen to better reflect the decrease of the meta-training loss as a function of m.

5.2. Noisy Iterative Meta-Learning Algorithms

Most meta-learning algorithms are built around a nested loop structure, with the inner loop applying the base-learner on the meta-training set and the outer loop updating the hyperparameters U. In this section, we focus on a vast class of such meta-learning algorithms in which the inner loop applies training procedures dependent on the current iterate of the hyperparameter, while the outer loop updates the hyperparameter using a stochastic rule. This class includes stochastic variants of state-of-the-art algorithms such as MAML [23] and reptile [24]. We apply the derived information-theoretic bounds to study the meta-generalization performance of the mentioned class of meta-training iterative stochastic rules by focusing on the case of separate within-task training and test sets here, which is assumed e.g., by MAML. The analysis for the setup with joint within-task training and test sets can also be carried out at the cost of a more cumbersome notation.
To start, let U j R d denote the hyperparameter vector at outer iteration j, with U 0 R d being an arbitrary initialization. For example, in MAML, the hyperparameter U defines the initial iterate used by each base-learner in the inner loop to update the model parameter W τ corresponding to task τ . At each iteration j 1 , we randomly select a mini-batch of task indices K j [ 1 , , N ] from the meta-training data Z 1 : N m , obtaining the corresponding data set Z K j m = ( Z K j m tr , Z K j m te ) Z 1 : N m , where Z K j m tr = { Z k m tr } k K j and Z K j m te = { Z k m te } k K j are the separate training and test sets for the selected tasks. For each index k K j , in the inner loop, the base-learner selects the model parameter W k j as a possibly stochastic function
W k j = g ( U j 1 , Z k m tr ) .
For instance, in MAML, the function g ( U j 1 , Z k m tr ) R d in (42) represents the output of an SGD procedure that starts from initialization U j 1 and uses the task training data Z k m tr to iteratively update the model parameters, producing the final iterate W k j . We denote as W K j = { W k j } k K j the collection of the base-learners’ outputs for all task indices k K j at outer iteration j.
In the outer loop, the meta-learner uses the task-specific adapted parameters W K j from the inner loop and the meta-test set Z K j m te to update the past iterate U j 1 according to the general update rule
U j = F ( U j 1 ) + β j G ( U j 1 , W K j , Z K j m te ) + ξ j ,
where F ( · ) and G ( · , · , · ) are arbitrary deterministic functions; β j is the step-size; and ξ j N ( 0 , γ j 2 I d ) is an isotropic Gaussian noise, independently drawn for j = 1 , 2 , , . As an example, in MAML, the function F ( · ) is the identity function and function G ( · , · , · ) equals the gradient of the empirical loss 1 / | K j | k K j L t sep ( W k j | Z k m te ) in (14) with respect to U j 1 . Note, however, that MAML does not add noise, i.e., γ j 2 = 0 for all j.
The final output of the meta-learning algorithm is then defined as an arbitrary function U = f ( U 1 , , U J ) , of all iterates. Examples of function f include the last update f ( U 1 , , U J ) = f ( U J ) and average of the updates f ( U 1 , , U J ) = 1 / J j = 1 J U j . A graphical model representation of the variables involved is shown in Figure 6.
We now derive an upper bound on the meta-generalization gap for the general class of iterative meta-learning algorithm satisfying (42) and (43) under the following assumptions.
Assumption 7.
(1) 
For the base-learner given in (42), the average training loss L t sep ( u | Z m ) in (15) is σ 2 -sub-Gaussian for all u when Z m P Z m ;
(2) 
The meta-training data set Z K j m sampled at each iteration j is conditionally independent of the history of model-parameter vectors { W K i } i = 1 j 1 and hyperparameter U ( j 1 ) = ( U 1 , U 2 , , U j 1 ) , i.e.,
P Z K j m | { Z K i m } i = 1 j 1 , Z 1 : N m , U ( j 1 ) , { W K i } i = 1 j 1 = P Z K j m | { Z K i m } i = 1 j 1 , Z 1 : N m ;
(3) 
The meta-parameter update function G ( · , · , · ) is uniformly bounded, i.e., | | G ( · , · , · ) | | 2 L for some L > 0 .
Lemma 4.
Under Assumption 7, the following upper bound on the meta-generalization gap (18) holds for the class of noisy iterative meta-training algorithms (42) and (43)
E P Z 1 : N m , U [ Δ sep ( U | Z 1 : N m ) ] 2 σ 2 N j = 1 J d 2 log 1 + β j 2 L 2 d γ j 2 .
Proof. 
See Appendix E. □
The bound in (45) has the same form as the generalization gap derived in [27] for conventional learning. From (45), the generalization gap can be reduced by increasing the variance γ j 2 of the injected Gaussian noise. In particular, the meta-generalization gap depends on the ratios β j 2 / γ j 2 between squared step size β j 2 and variance γ j 2 . For example, SGLD sets γ j = β j , and a step size β j decaying over time according to the standard Robbins–Monro conditions, in order to ensure convergence of the output samples to the generalized posterior distribution of the hyperparameters [52].
Example: To illustrate bound (45), we now consider a simple logistic regression problem that generalizes the example studied in Section 5.1. Accordingly, each data point Z corresponds to labelled data Z = ( X , Y ) , where X { 0 , 1 } d represents the input vector and Y { 0 , 1 } represents the corresponding binary label. The data distribution P Z | τ k = P X | τ k P Y | X , τ k for each task τ k T = { τ 1 , , τ M } is such that X P X | τ k is a d-dimensional Bernoulli vector obtained via d independent draws from Bernoulli ( ν ) and Y is distributed as Y Bernoulli ( ϕ ( μ τ k T X ) ) , where ϕ ( a ) = 1 / ( 1 + exp ( a ) ) is the sigmoid function and μ τ k R d , with | | μ τ k | | 2 1 . The task distribution P T then defines a distribution over the parameter vectors { μ τ 1 , , μ τ M } . The base-learner uses training data generated i.i.d. from P Z | τ k to obtain a prediction w of the parameter vector μ τ k for task τ k T . The loss function is taken as the quadratic error l ( w , z ) = ( ϕ ( w T x ) y ) 2 .
At each iteration j, starting from initialization point U j 1 , the base-learner in (42) uses a one-step projected gradient descent algorithm on the training data set Z k m tr to obtain the prediction W k j as
W k j = proj W U j 1 α w L t sep ( w | Z k m tr ) | w = U j 1 ,
where α > 0 is the step-size W = { w R d | | | w | | 2 1 } is the set of feasible model parameters and proj A ( b ) = 1 2 min a A | | a b | | 2 2 is the projection operator. The meta-learner (43) updates the initialization vector according to the noisy gradient descent rule
U j = U j 1 β j 1 | K j | k = 1 | K j | w L t sep ( w | Z k m te ) | w = W k j + ξ j ,
where β j is the step-size; and ξ j N ( 0 , γ j 2 I d ) is isotropic Gaussian noise. This update rule corresponds to performing a first order MAML (FOMAML) [23] with the addition of noise.
For this problem, it is easy to verify that Assumption 7 is satisfied, since the loss function l ( · , · ) is bounded in the interval [ 0 , 1 ] , whereby L t sep ( u | Z m ) is also [ a , b ] -bounded. We also have the inequality
1 | K t | i = 1 | K t | w L Z m te ( w ) | w = W i t 2 2 d e d L .
The MI bound in (45) then evaluates to
E P Z 1 : N m , U [ Δ L ( U | Z 1 : N m ) ] 1 2 N j = 1 J d 2 log 1 + 4 β j 2 e 2 d γ j 2 .
We now evaluate the meta-training and meta-test loss, along with the bound (49) as a function of the ratio γ j 2 / β j 2 in Figure 7. For the experiment, we considered a task environment of M = 20 tasks with ν = 0.4 , d = 3 , N = 4 meta-training tasks with m tr = 10 training data samples and m te = 5 test data samples. For the inner-loop (46), we fixed step-size α = 10 4 and for the outer-loop (47), we set | K t | = N , β j = 0.25 and T = 200 iterations.
As suggested by Lemma 4, the meta-generalization gap decreases with addition of noise. While the MI bound (45) is generally loose, it correctly quantifies the dependence of the meta-generalization loss and the ratio γ j 2 / β j 2 , and it can hence serve as a useful meta-training criterion [20,48].

6. Conclusions

This work has presented novel information-theoretic upper bounds on the average generalization gap of meta-learning algorithms, thereby extending the well-studied information-theoretic approaches in conventional learning to meta-learning. The proposed bounds capture two sources of uncertainty-environment-level uncertainty and within-task uncertainty—and bound them via separate mutual information terms. Applications were also discussed, with the aim of elucidating the use of the bounds to quantify meta-overfitting and guide the choice of the meta-inductive bias, i.e., the class of inductive biases. The derived bounds are amenable to further refinements, such as those along the lines of [39,40,45]. It would also be interesting to study the meta-generalization bounds on noisy iterative meta-learning algorithms using the tighter information-theoretic bounds such as [26,40].

Author Contributions

Conceptualization, formal analysis, writing—original draft, S.T.J.; conceptualization, funding acquisition and supervision, O.S. Both authors have read and agreed to the published version of the manuscript.

Funding

The authors have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement No. 725731).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Decoupling Estimate Lemmas

The proofs of the main results rely on the following decoupling estimate lemmas, which bound the difference in expectations under a change of measure from the joint P X , Y to the product of the marginals P X P Y . In order to state the most general form of decoupling estimate lemmas, we first define a generalized sub-Gaussian random variable.
Definition A1.
A random variable X is said to be ( Ψ + , Ψ , b + , b ) -generalized sub-Gaussian if there exist convex functions Ψ + : R + R and Ψ : R + R that satisfy the equalities Ψ + ( 0 ) = Ψ ( 0 ) = Ψ + ( 0 ) = Ψ ( 0 ) = 0 and bound the CGF of X as
Λ X ( λ ) Ψ + ( λ ) , f o r λ [ 0 , b + )
Λ X ( λ ) Ψ ( λ ) , f o r λ ( b , 0 ] ,
for some constants 0 < b + and b < 0 .
For a ( Ψ + , Ψ , b + , b ) -generalized sub-Gaussian random variable, we also introduce the following standard definitions. First, the Legendre dual of function Ψ + ( λ ) is defined as
Ψ + * ( x ) = sup λ [ 0 , b + ) ( λ x Ψ + ( λ ) ) .
It can be easily seen that Ψ + * ( · ) is a non-negative, convex, and non-decreasing function on [ 0 , ) with Ψ + * ( 0 ) = 0 . Second, the inverse Legendre dual of function Ψ + ( λ ) is defined as Ψ + * 1 ( y ) = inf { x 0 : Ψ + * ( x ) y } . This function is concave, and it can be equivalently written as [26]
Ψ + * 1 ( y ) = inf λ [ 0 , b + ) y + Ψ + ( λ ) λ .
Similar definitions and results apply for Ψ ( · ) .
A σ 2 -sub-Gaussian random variable X is a generalized sub-Gaussian variable with Ψ + ( λ ) = Ψ ( λ ) = λ 2 σ 2 / 2 , b + = and b = . Furthermore, the Legendre dual functions are given as Ψ + * ( x ) = Ψ * ( x ) = x 2 / ( 2 σ 2 ) , and the inverse Legendre dual functions evaluate to
Ψ + * 1 ( y ) = Ψ * 1 ( y ) = 2 σ 2 y .
We are now ready to state the decoupling estimate lemmas.
Lemma A1
(Decoupling Estimate [14]). Let X and Y be two jointly distributed random variables with joint distribution P X , Y , and let f ( X , Y ) be a real valued function such that f ( x , Y ) is ( Ψ + , Ψ , , ) -generalized sub-Gaussian for all x when Y P Y . Then we have the following inequalities
± E P X P Y [ f ( X ˜ , Y ˜ ) ] E P X , Y [ f ( X , Y ) ] Ψ * 1 ( I ( X ; Y ) ) ,
where ( X ˜ , Y ˜ ) P X P Y .
Lemma A2
(General Decoupling Estimate [26]). Let X and Y be two jointly distributed random variables with joint distribution P X , Y , and let f ( X , Y ) be a real valued function such that f ( X , Y ) is a ( Ψ + , Ψ , b + , b ) -generalized sub-Gaussian when ( X , Y ) P X P Y . Then, we have the inequality (A5).
Note that in Lemma A2, the random variables X , Y are jointly distributed according to P X , Y . Assuming that the function f ( X , Y ) is generalized sub-Gaussian under X P X and Y P Y with P X and P Y being the marginals of P X , Y , the lemma provides an upper bound on the difference between average of f ( X , Y ) when ( X , Y ) is jointly distributed according to P X , Y and the average of f ( X , Y ) when ( X , Y ) is independent with X P X and Y P Y . The resultant bound thus provides an estimate of the effect of decoupling of the joint distribution to its marginals with respect to function f ( X , Y ) .

Appendix B. Proofs of Theorems 1 and 2

For the proof of Theorem 1, we use the decomposition (29) of the meta-generalization gap into average environment-level and within-task generalization gaps as
E P Z 1 : N m , U [ Δ L sep ( U | Z 1 : N m ) ] = E P Z 1 : N m , U ( L g sep ( U ) L g , t sep ( U ) ) + ( L g , t sep ( U ) L t sep ( U | Z 1 : N m ) ) = E P T E P Z m | T P W , U , Z 1 : N m | Z m tr [ Δ L ( W | Z m te , T ) ] ) + E P Z 1 : N m , U [ L g , t sep ( U ) L t sep ( U | Z 1 : N m ) ]
where (A6) follows since the average within-task generalization gap for a random meta-test task E P Z 1 : N m , U [ L g sep ( U ) L g , t sep ( U ) ] can be equivalently written as E P T E P Z m | T P W , U , Z 1 : N m | Z m tr [ Δ L ( W | T ) ] ) , with Δ L ( W | Z m te , T ) = L P Z | T ( W ) L t ( W | Z m te ) denoting the generalization gap of the meta-test task T, and the joint distribution P W , U , Z 1 : N m | Z m tr factorizes as P W | Z m tr , U P Z 1 : N m , U . To obtain an upper bound on the average meta-generalization gap E P Z 1 : N m , U [ Δ L sep ( U | Z 1 : N m ) ] , we bound each of the two differences in (A6) separately.
We first bound the second difference in (A6) E P Z 1 : N m , U [ L g , t sep ( U ) L t sep ( U | Z 1 : N m ) ] , which represents the expected environment-level uncertainty measured using the average per-task training loss L t sep ( u | Z m ) defined in (15). To this end, we extend the single-task learning generalization bound of Lemma 1 by resorting to the decoupling estimate in Lemma A1 with X = U , Y = Z 1 : N m and f ( X , Y ) = L t sep ( U | Z 1 : N m ) , so that E P X , Y [ f ( X , Y ) ] = E P Z 1 : N m , U [ t sep ( U | Z 1 : N m ) ] and E P X P Y [ f ( X ˜ , Y ˜ ) ] = E P Z 1 : N m , U [ L g sep ( U ) ] .
Recall from Assumption 3 that for a given u U , L t sep ( u | Z 1 : N m ) = 1 N i = 1 N L t sep ( u | Z i m ) is the average of σ 2 -sub-Gaussian i.i.d. random variables L t sep ( u | Z i m ) under Z i m P Z m for all i { 1 , , N } . It then follows that L t sep ( u | Z 1 : N m ) is σ 2 / N -sub-Gaussian under Z 1 : N m P Z 1 : N m for all u U [53]. Applying Lemma A1 with Ψ * 1 specialized to the σ 2 / N -sub-Gaussian loss in (A4) gives that
E P Z 1 : N m , U L g , t sep ( U ) L t sep ( U | Z 1 : N m ) 2 σ 2 I ( U ; Z 1 : N m ) N .
We now evaluate the first difference in (A6). It can be seen that for a fixed task τ T , the average within-task uncertainty evaluates to
E P Z m | T = τ P W , U , Z 1 : N m | Z m tr [ Δ L ( W | Z m te , T = τ ) ] = E P Z m | T = τ P W | Z m tr [ Δ L ( W | Z m te , T = τ ) ] = E P Z m tr | T = τ E P W | Z m tr [ L g ( W | T = τ ) E P Z m te | T = τ L t ( W | Z m te ) ] = ( a ) 0 ,
where (a) follows, since W and Z m te are conditionally independent given Z m tr , whereby E P Z m tr | T = τ E P W | Z m tr [ E P Z m te | T = τ L t ( W | Z m te ) ] = E P Z m tr | T = τ E P W | Z m tr [ L g ( W | T = τ ) ] . Substituting (A7) and (A8) in (A6), then concludes the proof.
For Theorem 2, the proof follows along the same line, bounding the average environment-level uncertainty E P Z 1 : N m , U [ L g , t sep ( U ) L t sep ( U | Z 1 : N m ) ] . Towards this, we note that the environment-level uncertainty can be equivalently written as
E P Z 1 : N m , U [ L g , t sep ( U ) L t sep ( U | Z 1 : N m ) ] = 1 N i = 1 N E P Z i m P U [ L t sep ( U | Z i m ) ] E P Z i m , U [ L t sep ( U | Z i m ) ] ,
where Z m and U in the first term are conditionally independent random variables distributed as ( Z i m , U ) P Z m P U , while, in the second term, they are jointly distributed according to P Z i m , U , which is obtained by marginalizing the joint distribution P Z 1 : N m , U . Under Assumption 4(a), we can bound the difference E P Z i m P U [ L t sep ( U | Z i m ) ] E P Z i m , U [ L t sep ( U | Z i m ) ] by resorting to the decoupling estimate in Lemma A1 with X = U , Y = Z i m , f ( X , Y ) = L t sep ( U | Z i m ) such that E P X , Y [ f ( X , Y ) ] = E P U , Z i m [ L t sep ( U | Z i m ) ] and E P X P Y [ f ( X ˜ , Y ˜ ) ] = E P U P Z i m [ L t sep ( U | Z i m ) ] . Since L t sep ( U | Z i m ) is σ 2 sub-Gaussian under Assumption 4(a) for all u U , Lemma A1 yields the following bound
E P Z i m P U [ L t sep ( U | Z i m ) ] E P Z i m , U [ L t sep ( U | Z i m ) ] 2 σ 2 I ( U ; Z i m ) .
The bound in (A10) can also be obtained using Assumption 4(a) by resorting to the general decoupling estimate in Lemma A2 by fixing X = U , Y = Z i m , f ( X , Y ) = L t sep ( U | Z i m ) such that E P X , Y [ f ( X , Y ) ] = E P U , Z i m [ L t sep ( U | Z i m ) ] and = E P X P Y [ f ( X ˜ , Y ˜ ) ] = E P U P Z i m [ L t sep ( U | Z i m ) ] . Substituting the bound in (A10) in (A9) then yields the required bound in (27).

Appendix C. Proofs of Theorems 3 and 4

For Theorem 3, we start from the following decomposition of the average meta-generalization gap analogous to (A6)
E P Z 1 : N m , U [ Δ L joint ( U | Z 1 : N m ) ] = E P T E P Z m | T P W | Z m [ Δ L ( W | Z m , T ) ] ) + E P Z 1 : N m , U [ L g , t joint ( U ) L t joint ( U | Z 1 : N m ) ]
where L g , t joint ( u ) = E P T , Z m [ L t joint ( u | Z m ) ] is the average training loss for the randomly selected meta-test task as a function of the hyperparameter u, and Δ L ( W | Z m , T ) = L P Z | T ( W ) L t ( W | Z m ) is the generalization gap for the meta-test task T. The MI bound on the expected environment-level uncertainty, E P Z 1 : N m , U [ L g , t joint ( U ) L t joint ( U | Z 1 : N m ) ] , can be obtained by using Lemma A1 and the Assumption 5(b) as in (A7).
The main difference between the separate and joint within-task training and test sets scenarios is that while the average within-task uncertainty vanishes in the former scenario, this is not the case for joint within-task training and training sets. Consequently, we now bound the average within-task generalization gap denoted by the first difference in (A11). For given task τ T , to bound the within-task generalization gap E P Z m | T = τ P W | Z m [ Δ L ( W | Z m , T = τ ) ] , we resort to Lemma A1 with X = W , Y = Z m and f ( X , Y ) = L t ( W | Z m ) , so that E P X , Y [ f ( X , Y ) ] = E P W , Z m | T = τ [ L t ( W | Z m ) ] . It can be then verified that E P X P Y [ f ( X ˜ , Y ˜ ) ] = E P W | T = τ E P Z m | T = τ [ L t ( W | Z m ) ] = E P W | T = τ [ L g ( W | T = τ ) ] = E P W , Z m | T = τ [ L g ( W | T = τ ) ] , where P W , Z m | T = τ = P W | Z m P Z m | T = τ . Since L t ( w | Z m ) is the sum of i.i.d δ τ 2 -sub-Gaussian random variables l ( w , Z i ) (from Assumption 5(a)), we have that L t ( w | Z m ) is δ τ 2 / m -sub-Gaussian under Z m P Z m | T = τ for all w W [53]. Consequently, Lemma A1 yields the following bound
E P Z m | T = τ P W | Z m [ Δ L ( W | Z m , T = τ ) ] 2 δ τ 2 m I ( W ; Z m | T = τ ) .
Averaging with respect to P T on both sides of (A12), and combining with the bound on average environment-level uncertainty yields the required bound in (32) via Jensen’s inequality.
For Theorem 4, the proof follows along the same line. The ITMI bound on the expected environment-level uncertainty can be obtained along the lines of (A10), using the assumption on L t joint ( u | Z m ) in either Assumption 6(a) or Assumption 6(b). We now show that we can similarly bound the within-task uncertainty using the assumption on loss function l ( w , z ) in either Assumption 6(a) or Assumption 6(b). Towards this, for fixed task τ T , we write the average within-task uncertainty equivalently as
E P Z m | T = τ P W | Z m [ Δ L ( W | Z m , T = τ ) ] = 1 m j = 1 m E P W | T = τ P Z j | T = τ [ l ( W , Z j ) ] E P W , Z j | T = τ [ l ( W , Z j ) ] ,
where W and Z j in the second term are jointly distributed according to P W , Z j | T = τ , which is the marginal of the joint distribution P W , Z m | T = τ . In contrast, W and Z j in the first term are conditionally independent random variables distributed as ( W , Z j ) P W | T = τ P Z j | T = τ where P W | T = τ is the marginal distribution of P W , Z j | T = τ . Now, fixing X = W , Y = Z j and f ( X , Y ) = l ( W , Z j ) so that E P X P Y [ f ( X ˜ , Y ˜ ) ] = E P W | T = τ P Z j | T = τ [ l ( W , Z j ) ] and E P X , Y [ f ( X , Y ) ] = E P W , Z j | T = τ [ l ( W , Z j ) ] in Lemma A1 under the assumption on l ( w , z ) in Assumption 6(a), or in Lemma A2 under the assumption on l ( w , z ) in Assumption 6(b) yields the following bound,
E P Z m | T = τ P W | Z m [ Δ L ( W | Z m , T = τ ) ] 1 m j = 1 m 2 δ τ 2 I ( W ; Z j | T = τ ) .
Averaging with respect to P T on both sides of (A14), and combining with the bound on average environment-level uncertainty yields the required bound in (33).

Appendix D. Details of Example

We first give details of the derivation of meta-generalization gap for the case with separate within-task training and test sets. The average meta-generalization loss can be computed as E P Z 1 : N m , U [ L g sep ( U ) ] =
E P Z 1 : N m , U ( 1 α ) 2 U 2 + E P T , Z m tr α 2 ( D T m tr ) 2 + E P Z | T [ Z 2 ] 2 α D T m tr μ T + 2 ( 1 α ) U ( α D T m tr μ T ) = ( a ) E P Z 1 : N m , U ( 1 α ) 2 U 2 2 U E P T [ μ T ] + E P T α 2 μ T 2 + μ T μ ¯ T m tr + μ T 2 α μ T 2 ,
where the equality in (a) follows since E P Z | T [ Z 2 ] = μ T , E P Z m tr | T [ D T m tr ] = μ T and E P Z m tr | T [ ( D T m tr ) 2 ] = μ T 2 + μ T μ ¯ T / m tr . In a similar manner, the average meta-training loss can be computed as
E P Z 1 : N m , U [ L t sep ( U | Z 1 : N m ) ] = E P Z 1 : N m [ ( 1 α ) 2 U 2 + 1 N i = 1 N α 2 ( D i m tr ) 2 + 1 N i = 1 N 1 m te j = 1 m te ( Z i , j m te ) 2 2 α 1 N i = 1 N D i m tr D i m te ] ,
with U defined as in (38). The meta-generalization gap in (40) then results by taking the difference of (A15) and (A16), and using that E P Z 1 : N m ( 1 α ) 2 U 2 = E P T [ μ T μ ¯ T ] 1 N m te + α 2 N m tr + 1 N ( 1 + α 2 ) Var T + ( 1 α ) 2 ( E P T [ μ T ] ) 2 and E P Z 1 : N m [ U ] = E P T [ μ T ] with Var T = E P T [ μ T 2 ] ( E P T [ μ T ] ) 2 .
We now evaluate the mutual pieces of information I ( U ; Z 1 : N m ) and I ( U ; Z i m ) . For the first MI, note that, since the meta-learner is deterministic (see (38)), H ( U | Z 1 : N m ) = 0 and thus I ( U ; Z 1 : N m ) = H ( U ) . For the second MI, we can write I ( U ; Z i m ) = H ( U ) Z i m [ H ( U | Z i m = z m ) ] . It can be seen that random variables U and U | Z i m = z m are mixtures of probability distributions, whose entropies can be evaluated following standard methods [54].
For the case with joint within-task training and test sets, the meta-generalization gap can be obtained in a similar way as
E P Z 1 : N m , U [ L t joint ( U | Z 1 : N m ) ] = 2 N E P T μ T μ ¯ T m + Var T + 2 α m E P T [ μ T μ ¯ T ] .
For the MI- and ITMI-based bounds, note that with W = [ 0 , 1 ] , the loss function l ( · , · ) is [ 0 , 1 ] -bounded, and for the deterministic base-learner in (36) with U = [ 0 , 1 ] , the average training loss, L t joint ( u | Z m ) is also [ 0 , 1 ] -bounded for all Z m Z m . Thus, Assumptions 5 and 6 hold with σ 2 = δ τ 2 = 1 / 4 . For the MI bound in (32), we have I ( U ; Z 1 : N m ) = H ( U ) and I ( W ; Z m | T = τ ) = H ( W | T = τ ) E Z m [ H ( W | Z m , T = τ ) ] . For the ITMI bound (33), we have
| E P Z 1 : N m , U [ L t joint ( U | Z 1 : N m ) ] | 1 N i = 1 N 1 2 I ( U ; Z i m ) + E P T 1 m j = 1 m 1 2 I ( W ; Z j | T = τ ) .
All information measures can be easily evaluated numerically [54].

Appendix E. Proof of Lemma 4

From the update rule of the meta-learner in (43), we get the Markov dependency
P U j | U ( j 1 ) , { W K i } i = 1 j , { Z K i m } i = 1 j , Z 1 : N m = P U j | U j 1 , W K j , Z K j m te ,
where U ( j 1 ) = { U 1 , , U j 1 } is the history vector of hyperparameters. The sampling strategy in (44) together with (A19) then implies the following relation
P U j | U ( j 1 ) , { W K i } i = 1 j , { Z K i m } i = 1 J , Z 1 : N m = P U j | U j 1 , W K j , Z K j m te .
Using U ( J ) = { U 1 , , U J } to denote the set of all updates, we have the following relations
I ( U ; Z 1 : N m ) ( a ) I ( U ( J ) ; Z 1 : N m ) ( b ) I ( U ( J ) ; { Z K i m } i = 1 J ) = j = 1 J I ( U j ; { Z K i m } i = 1 J | U ( j 1 ) )
j = 1 J I ( U j ; { Z K i m } i = 1 J , { W K i } i = 1 j | U ( j 1 ) )
= j = 1 J h ( U j | U ( j 1 ) ) h U j | U ( j 1 ) , { Z K i m } i = 1 J , { W K i } i = 1 j
= ( c ) j = 1 J h ( U j | U j 1 ) h ( U j | U j 1 , W K j , Z K j m te ) ,
where, the inequality in (a) follows from data processing inequality on Markov chain Z 1 : N m U ( J ) U ; (b) follows from the Markov chain Z 1 : N m { Z K i m } i = 1 J U ( J ) ; and the equality in (c) follows from U ( j 2 ) U j 1 U j and (A20). Finally, the computation of bound in (A24) follows similar to Lemma 5 in [27].

References

  1. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  2. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  3. Simeone, O. A Brief Introduction to Machine Learning for Engineers. Found. Trends Signal Process. 2018, 12, 200–431. [Google Scholar] [CrossRef]
  4. Schmidhuber, J. Evolutionary Principles in Self-Referential Learning, or On Learning How to Learn: The Meta-meta-... Hook. Ph.D. Thesis, Technische Universität München, München, Germany, 1987. [Google Scholar]
  5. Thrun, S.; Pratt, L. Learning to Learn: Introduction and Overview. In Learning to Learn; Springer: Berlin/Heidelberg, Germany, 1998; pp. 3–17. [Google Scholar]
  6. Thrun, S. Is Learning the N-th Thing Any Easier Than Learning the First? In Proceedings of the Advances in Neural Information Processing Systems(NIPS); MIT Press: Cambridge, MA, USA, 1996; pp. 640–646. [Google Scholar]
  7. Vilalta, R.; Drissi, Y. A Perspective View and Survey of Meta-Learning. Artif. Intell. Rev. 2002, 18, 77–95. [Google Scholar] [CrossRef]
  8. Simeone, O.; Park, S.; Kang, J. From Learning to Meta-Learning: Reduced Training Overhead and Complexity for Communication Systems. arXiv 2020, arXiv:2001.01227. [Google Scholar]
  9. Kuzborskij, I.; Orabona, F. Fast rates by transferring from auxiliary hypotheses. Mach. Learn. 2017, 106, 171–195. [Google Scholar] [CrossRef] [Green Version]
  10. Kienzle, W.; Chellapilla, K. Personalized handwriting recognition via biased regularization. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 457–464. [Google Scholar]
  11. Denevi, G.; Ciliberto, C.; Grazzi, R.; Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. arXiv 2019, arXiv:1903.10399. [Google Scholar]
  12. Denevi, G.; Pontil, M.; Ciliberto, C. The advantage of conditional meta-learning for biased regularization and fine-tuning. arXiv 2020, arXiv:2008.10857. [Google Scholar]
  13. Baxter, J. A Model of Inductive Bias Learning. J. Artif. Intell. Res. 2000, 12, 149–198. [Google Scholar] [CrossRef]
  14. Russo, D.; Zou, J. Controlling Bias in Adaptive Data Analysis Using Information Theory. In Proceedings of the Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain, 9–11 May 2016; pp. 1232–1240. [Google Scholar]
  15. Xu, H.; Farajtabar, M.; Zha, H. Learning Granger causality for Hawkes processes. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 20–22 June 2016; pp. 1717–1726. [Google Scholar]
  16. Maurer, A. Algorithmic Stability and Meta-Learning. J. Mach. Learn. Res. 2005, 6, 967–994. [Google Scholar]
  17. Devroye, L.; Wagner, T. Distribution-Free Performance Bounds for Potential Function Rules. IEEE Trans. Inf. Theory 1979, 25, 601–604. [Google Scholar] [CrossRef] [Green Version]
  18. Rogers, W.H.; Wagner, T.J. A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules. Ann. Stat. 1978, 6, 506–514. [Google Scholar] [CrossRef]
  19. Pentina, A.; Lampert, C. A PAC-Bayesian Bound for Lifelong Learning. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 991–999. [Google Scholar]
  20. Amit, R.; Meir, R. Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2018; pp. 205–214. [Google Scholar]
  21. Rothfuss, J.; Fortuin, V.; Krause, A. PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. arXiv 2020, arXiv:2002.05551. [Google Scholar]
  22. Denevi, G.; Ciliberto, C.; Stamos, D.; Pontil, M. Incremental learning-to-learn with statistical guarantees. arXiv 2018, arXiv:1803.08089. [Google Scholar]
  23. Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the International Conference on Machine Learning-Volume 70, Sydney, NSW, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  24. Nichol, A.; Achiam, J.; Schulman, J. On First-Order Meta-Learning Algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
  25. Xu, A.; Raginsky, M. Information-Theoretic Analysis of Generalization Capability of Learning Algorithms. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 2524–2533. [Google Scholar]
  26. Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening Mutual Information Based Bounds on Generalization Error. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 587–591. [Google Scholar]
  27. Pensia, A.; Jog, V.; Loh, P.L. Generalization Error Bounds for Noisy, Iterative Algorithms. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Veil, CO, USA, 17–22 June 2018; pp. 546–550. [Google Scholar]
  28. Vapnik, V.N.; Chervonenkis, A.Y. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. In Theory of Probability and Its Applications; SIAM: Philadelphia, PA, USA, 1971; Volume 16, pp. 264–280. [Google Scholar]
  29. Koltchinskii, V.; Panchenko, D. Rademacher Processes and Bounding the Risk of Function Learning. In High Dimensional Probability II; Springer: Berlin/Heidelberg, Germany, 2000; Volume 47, pp. 443–457. [Google Scholar]
  30. Bousquet, O.; Elisseeff, A. Stability and Generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
  31. Kearns, M.; Ron, D. Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation. Neural Comput. 1999, 11, 1427–1453. [Google Scholar] [CrossRef]
  32. Poggio, T.; Rifkin, R.; Mukherjee, S.; Niyogi, P. General Conditions for Predictivity In Learning Theory. Nature 2004, 428, 419–422. [Google Scholar] [CrossRef]
  33. Kutin, S.; Niyogi, P. Almost-Everywhere Algorithmic Stability and Generalization Error. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI’02); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2002; pp. 275–282. [Google Scholar]
  34. Dwork, C.; Feldman, V.; Hardt, M.; Pitassi, T.; Reingold, O.; Roth, A.L. Preserving Statistical Validity in Adaptive Data Analysis. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing (STOC), Portland, OR, USA, 15–17 June 2015; pp. 117–126. [Google Scholar]
  35. Bassily, R.; Nissim, K.; Smith, A.; Steinke, T.; Stemmer, U.; Ullman, J. Algorithmic Stability for Adaptive Data Analysis. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (STOC), Cambridge, MA, USA, 19–21 June 2016; pp. 1046–1059. [Google Scholar]
  36. McAllester, D.A. PAC-Bayesian Model Averaging. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT), Santa Cruz, CA, USA, 7–9 July 1999; pp. 164–170. [Google Scholar]
  37. Seeger, M. PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification. J. Mach. Learn. Res. 2002, 3, 233–269. [Google Scholar]
  38. Alquier, P.; Ridgway, J.; Chopin, N. On the Properties of Variational Approximations of Gibbs Posteriors. J. Mach. Learn. Res. 2016, 17, 8374–8414. [Google Scholar]
  39. Asadi, A.; Abbe, E.; Verdú, S. Chaining Mutual Information and Tightening Generalization Bounds. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018; pp. 7234–7243. [Google Scholar]
  40. Negrea, J.; Haghifam, M.; Dziugaite, G.K.; Khisti, A.; Roy, D.M. Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 11013–11023. [Google Scholar]
  41. Alabdulmohsin, I. Towards a Unified Theory of Learning and Information. Entropy 2020, 22, 438. [Google Scholar] [CrossRef] [Green Version]
  42. Jiao, J.; Han, Y.; Weissman, T. Dependence Measures Bounding the Exploration Bias for General Measurements. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1475–1479. [Google Scholar]
  43. Issa, I.; Gastpar, M. Computable Bounds On the Exploration Bias. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 576–580. [Google Scholar]
  44. Issa, I.; Esposito, A.R.; Gastpar, M. Strengthened Information-Theoretic Bounds on the Generalization Error. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 582–586. [Google Scholar]
  45. Steinke, T.; Zakynthinou, L. Reasoning About Generalization via Conditional Mutual Information. arXiv 2020, arXiv:2001.09122. [Google Scholar]
  46. Knoblauch, J.; Jewson, J.; Damoulas, T. Generalized Variational Inference. arXiv 2019, arXiv:1904.02063. [Google Scholar]
  47. Bissiri, P.G.; Holmes, C.C.; Walker, S.G. A General Framework for Updating Belief Distributions. J. R. Stat. Soc. Ser. Stat. Methodol. 2016, 78, 1103–1130. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Yin, M.; Tucker, G.; Zhou, M.; Levine, S.; Finn, C. Meta-Learning Without Memorization. arXiv 2019, arXiv:1912.03820. [Google Scholar]
  49. Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
  50. Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Learnability, Stability and Uniform Convergence. J. Mach. Learn. Res. 2010, 11, 2635–2670. [Google Scholar]
  51. Maurer, A.; Pontil, M.; Romera-Paredes, B. The Benefit of Multitask Representation Learning. J. Mach. Learn. Res. 2016, 17, 2853–2884. [Google Scholar]
  52. Welling, M.; Teh, Y.W. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
  53. Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
  54. Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Handbook of Differential Entropy; CRC Press: New York, NY, USA, 2013. [Google Scholar]
Figure 1. Directed graph representing the variables involved in the definition of generalization gap (11) for single-task learning.
Figure 1. Directed graph representing the variables involved in the definition of generalization gap (11) for single-task learning.
Entropy 23 00126 g001
Figure 2. Directed graph representing the variables involved in the definition of meta-generalization gap (18) for separate within-task training and testing sets.
Figure 2. Directed graph representing the variables involved in the definition of meta-generalization gap (18) for separate within-task training and testing sets.
Entropy 23 00126 g002
Figure 3. Directed graph representing the variables involved in the definition of meta-generalization gap (22) for joint within-task training and testing sets.
Figure 3. Directed graph representing the variables involved in the definition of meta-generalization gap (22) for joint within-task training and testing sets.
Entropy 23 00126 g003
Figure 4. Comparison of the MI bound in (27) and ITMI-based bound obtained in (41) with the meta-generalization gap for meta-learning with separate within-task training and test sets. The task environment is defined by M = 12 tasks. Other parameters are set as α = 0.15 , m tr = 15 , m te = 5 .
Figure 4. Comparison of the MI bound in (27) and ITMI-based bound obtained in (41) with the meta-generalization gap for meta-learning with separate within-task training and test sets. The task environment is defined by M = 12 tasks. Other parameters are set as α = 0.15 , m tr = 15 , m te = 5 .
Entropy 23 00126 g004
Figure 5. Comparison of the MI- and ITMI-based bound obtained in (A18) with the meta-generalization gap for meta-learning with joint within-task training and test sets, as a function of the per-task data samples m for N = 5 and α = 0.55 . The task environment is defined by M = 9 tasks.
Figure 5. Comparison of the MI- and ITMI-based bound obtained in (A18) with the meta-generalization gap for meta-learning with joint within-task training and test sets, as a function of the per-task data samples m for N = 5 and α = 0.55 . The task environment is defined by M = 9 tasks.
Entropy 23 00126 g005
Figure 6. A graphical model representation of the variables involved in the Definition of noisy iterative algorithms.
Figure 6. A graphical model representation of the variables involved in the Definition of noisy iterative algorithms.
Entropy 23 00126 g006
Figure 7. Comparison of the meta-generalization gap with the MI-based bound in (49) as function of the ratio γ t 2 / β t 2 .
Figure 7. Comparison of the meta-generalization gap with the MI-based bound in (49) as function of the ratio γ t 2 / β t 2 .
Entropy 23 00126 g007
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jose, S.T.; Simeone, O. Information-Theoretic Generalization Bounds for Meta-Learning and Applications. Entropy 2021, 23, 126. https://0-doi-org.brum.beds.ac.uk/10.3390/e23010126

AMA Style

Jose ST, Simeone O. Information-Theoretic Generalization Bounds for Meta-Learning and Applications. Entropy. 2021; 23(1):126. https://0-doi-org.brum.beds.ac.uk/10.3390/e23010126

Chicago/Turabian Style

Jose, Sharu Theresa, and Osvaldo Simeone. 2021. "Information-Theoretic Generalization Bounds for Meta-Learning and Applications" Entropy 23, no. 1: 126. https://0-doi-org.brum.beds.ac.uk/10.3390/e23010126

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop