Next Article in Journal
Quantum Switchboard with Coupled-Cavity Array
Next Article in Special Issue
Robust Spike-Based Continual Meta-Learning Improved by Restricted Minimum Error Entropy Criterion
Previous Article in Journal
Instance Segmentation of Multiple Myeloma Cells Using Deep-Wise Data Augmentation and Mask R-CNN
Previous Article in Special Issue
Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Information Theoretic Interpretation to Deep Neural Networks †

1
Data Science and Information Technology Research Center, Tsinghua–Berkeley Shenzhen Institute, Shenzhen 518055, China
2
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
*
Author to whom correspondence should be addressed.
This work was presented in part at the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019.
Submission received: 7 December 2021 / Revised: 10 January 2022 / Accepted: 12 January 2022 / Published: 17 January 2022
(This article belongs to the Special Issue Information Theory and Machine Learning)

Abstract

:
With the unprecedented performance achieved by deep learning, it is commonly believed that deep neural networks (DNNs) attempt to extract informative features for learning tasks. To formalize this intuition, we apply the local information geometric analysis and establish an information-theoretic framework for feature selection, which demonstrates the information-theoretic optimality of DNN features. Moreover, we conduct a quantitative analysis to characterize the impact of network structure on the feature extraction process of DNNs. Our investigation naturally leads to a performance metric for evaluating the effectiveness of extracted features, called the H-score, which illustrates the connection between the practical training process of DNNs and the information-theoretic framework. Finally, we validate our theoretical results by experimental designs on synthesized data and the ImageNet dataset.

1. Introduction

Due to the striking performance of deep learning in various application fields, deep neural networks (DNNs) have gained great attention in modern computer science. While it is a common understanding that the features extracted from the hidden layers of DNN are “informative” for learning tasks, the mathematical meaning of informative features in DNN is generally not clear. From the practical perspective, DNN models have obtained unprecedented performance in varying tasks, such as image recognition [1], language processing [2,3], and games [4,5]. However, the understanding of the feature extraction behind these models is relatively lacking, which poses challenges for their application in security-sensitive tasks, such as the autonomous vehicle.
To address this problem, there have been numerous research efforts, including both experimental and theoretical studies [6]. The experimental studies usually focus on some empirical properties of the feature extracted by DNNs, by visualizing the feature [7] or testing its performance on specific training settings [8] or learning tasks [9]. Though such empirical methods have provided some intuitive interpretations, the performance can highly depend on the data and network architecture used. For example, while the feature visualization works well on convolutional neural networks, its application to other networks is typically less effective [10].
In contrast, theoretical studies focus on the analytical properties of the extracted feature or the learning process in DNNs. Due to the complicated structure of DNNs, existing studies were often restricted to the networks of specific structures, e.g., network with infinite width [11] or two-layer network [12,13], to characterize the theoretical behaviors. However, the interpretation of the optimal feature remains unclear, which limits their further applications. To obtain better interpretability, tools and measures from information theory [14] have recently been applied to connect DNNs with general information processing problems [15]. For instance, the information bottleneck [16,17] employs the mutual information as the metric to quantify the informativeness of features in DNN, and other information metrics, such as the Kullback–Leibler (KL) divergence [18] and Weissenstein distance [19], are also used in different problems. However, there is still a disconnection between these information metrics and the performance objectives of the inference tasks that DNNs want to solve [20]. Therefore, it is, in general, difficult to match the DNN learning with the optimization of a particular information metric.
This paper aims to provide an information-theoretic interpretation to the feature extraction process in DNNs, to bridge the gap between the practical deep learning implementations and information-theoretic characterizations. To this end, we first propose an information-theoretic feature selection framework, which establishes an information metric to measure the performance of each given feature in inference tasks. In addition, we demonstrate that the optimal features extracted by DNNs coincide with the solutions of the information-theoretic feature selection problem, which share the same performance metric. Therefore, our results give an explicit interpretation of the learning goal of the back-propagation (BackProp) and stochastic gradient descent (SGD) operations in deep learning [21], which also lead to a performance metric for evaluating the effectiveness of the extracted features. Finally, we validate our theoretic characterizations using numerical experiments on both synthesized data and the ImageNet [22] dataset for image classification.

2. Preliminaries and Methods

2.1. Methodological Background

The main method used in our development is local information geometry [23,24], which characterizes the local geometric properties of the probability distribution space. The local information geometric method is closely related to the conventional Hirschfeld–Gebelein–Rényi (HGR) maximal correlation [25,26,27] problem, which has attracted increasing interest in the information theory community [28,29,30,31,32,33], and has also been applied in data analysis [34] and privacy studies [35].
Specifically, we use the local information geometric method to construct and investigate an information-theoretic feature selection problem in Section 3.1, which leads to an information metric of features and also demonstrates an SVD (singular value decomposition) structure of the feature selection process. Following the same analysis framework, we characterize the optimal feature extracted by DNNs in Section 3.2, and demonstrate that the same SVD structure is shared by DNNs. Based on the established connection, we then propose an effectiveness measure for DNNs, with details presented in Section 3.3.

2.2. Notations

Throughout this paper, we use X, X , P X , and x to represent a discrete random variable, the range, the probability distribution, and the value of X. In addition, for any function s ( X ) R k of X, we use μ s to denote the mean of s ( X ) , and “ ˜ ” to denote the centered variable with mean subtracted, e.g., s ˜ ( X ) s ( X ) μ s . Moreover, we use · and · F to denote the 2 -norm and the Frobenius norm, respectively. All logarithms in our analyses are base e, i.e., natural.

2.3. Local Information Geometry

The following concepts from local information geometry would be useful in our development.
Definition 1
( ϵ -Neighborhood). Let P X denote the space of distributions on some finite alphabet X , and let relint ( P X ) denote the subset of strictly positive distributions. For a given ϵ > 0 , the ϵ-neighborhood of a distribution P X relint ( P X ) is defined by the χ 2 -divergence as
N ϵ X ( P X ) P P X : x X P ( x ) P X ( x ) 2 P X ( x ) ϵ 2 .
Definition 2
( ϵ -Dependence). The random variables X , Y are called ϵ-dependent if P X Y N ϵ X × Y ( P X P Y ) .
Definition 3
( ϵ -Attribute). A random variable U is called an ϵ-attribute of X if P X | U ( · | u ) N ϵ X ( P X ) , for all u U .
We will focus on the small ϵ regime, which we refer to as the local analysis regime. In addition, for any P P X , we define the information vector ϕ and feature function L ( x ) corresponding to P, with respect to a reference distribution P X relint ( P X ) , as
ϕ ( x ) P ( x ) P X ( x ) P X ( x ) , L ( x ) ϕ ( x ) P X ( x ) .
This gives a three way correspondence P ϕ L for all distributions in N ϵ X ( P X ) , which will be useful in our derivations.

2.4. Modal Decomposition

Given a pair of discrete random variables X , Y with the joint distribution P X Y ( x , y ) , the | Y | × | X | matrix B ˜ is defined as
B ˜ ( y , x ) P X Y ( x , y ) P X ( x ) P Y ( y ) P X ( x ) P Y ( y ) ,
where B ˜ ( y , x ) is the ( y , x ) th entry of B ˜ . The matrix B ˜ is referred to as the canonical dependence matrix (CDM) [24]. The SVD of B ˜ is referred to as the modal decomposition [24] of the joint distribution P X Y , which has the following property [18].
Lemma 1.
The SVD of B ˜ can be written as B ˜ = i = 1 K σ i ψ i Y ψ i X T , where K min { | X | , | Y | } , and σ i denotes the ith singular value with the ordering 1 σ 1 σ K = 0 , and ψ i Y and ψ i X are the corresponding left and right singular vectors with ψ K X ( x ) = P X ( x ) and ψ K Y ( y ) = P Y ( y ) .
This SVD decomposes the feature spaces of X , Y into maximally correlated features. To see that, consider the generalized canonical correlation analysis (CCA) problem:
max E f i ( X ) = E g i ( Y ) = 0 E f i ( X ) f j ( X ) = E g i ( Y ) g j ( Y ) = δ i j i = 1 k E f i ( X ) g i ( Y ) ,
where δ i j denotes the Kronecker delta function. It can be shown that for any 1 k K 1 , the optimal features are f i ( x ) = ψ i X ( x ) / P X ( x ) , and g i ( y ) = ψ i Y ( y ) / P Y ( y ) , for i = 0 , , K 1 , where ψ i X ( x ) and ψ i Y ( y ) are the xth and yth entries of ψ i X and ψ i Y , respectively [18]. The special case k = 1 corresponds to the HGR maximal correlation [25,26,27], and the optimal features can be computed from the ACE (Alternating Conditional Expectation) algorithm [36].

2.5. Deep Neural Networks

The architecture of deep neural networks (under log-loss) can be depicted as Figure 1, where X is the input data, e.g., images, audios, or natural languages. Moreover, Y is the objective to predict, which can represent a discrete label in classification tasks, or represent target natural languages in machine translations [37]. Specifically, for given data X, the network produces a (trainable) feature mapping to generate k-dimensional feature s ( x ) = ( s 1 , , s k ) T . In practice, the feature mapping block (depicted as the gray block in Figure 1) is typically composed of hundreds and thousands of functional components (e.g., residual block [1]) with different types of layers, and may contain recurrent structure, e.g., LSTM (Long Short-Term Memory) [38]. In general, the internal structure of the feature mapping can have various different types of designs, depending on the learning tasks.
After obtaining the feature s ( X ) , the Y is then predicted by the probability distribution P ˜ Y | X ( s , v , b ) of the form
P ˜ Y | X ( s , v , b ) ( y | x ) e v T ( y ) s ( x ) + b ( y ) y Y e v T ( y ) s ( x ) + b ( y ) ,
which is obtained by applying the softmax function [39] on v T ( y ) s ( x ) + b ( y ) , where v ( · ) and b ( · ) are the weights and biases in the last layer, respectively (this is equivalent to the common practice that denotes weight and biases by the matrix [ v ( 1 ) , , v ( | Y | ) ] T and the vector [ b ( 1 ) , , b ( | Y | ) ] T , respectively. However, as we will show later, expressing weights v and biases b as mappings of y can better illustrate their roles in feature selection). We will use P ˜ Y | X to refer to P ˜ Y | X ( s , v , b ) when there is no ambiguity.
Then, for a given training set of labeled samples ( x i , y i ) , for i = 1 , , N , all the parameters in the network, including v, b, as well as those in the feature mapping block, are chosen to maximize the log-likelihood function (or, equivalently, minimize the log-loss)
1 N i = 1 N log P ˜ Y | X ( y i | x i ) .
The procedure of choosing such parameters is called the training of network, which can be performed by stochastic gradient descent (SGD) or its variants [21]. With a trained network, the label y ^ for a new data sample x can be predicted by the maximum a posteriori (MAP) estimation, i.e., y ^   =   arg max y Y P ˜ Y | X ( y | x ) . Specifically, when we make predictions for samples in a test dataset, the proportion of samples with correct prediction (i.e., y ^ = y ) over all samples is called the test accuracy.

3. Results

3.1. Information-Theoretic Feature Selection

Suppose that, given random variables X , Y with joint distribution P X Y , we want to infer about an attribute V of Y from observed i.i.d. samples x 1 , , x n of X. When the statistical model P X | V is known, the optimal decision rule is the log-likelihood ratio test, where the log-likelihood function can be viewed as the optimal feature for inference. However, in many practical situations [18], it is hard to identify the model of the targeted attribute, and it is necessary to select low-dimensional informative features of X for inference tasks before knowing the model. An information-theoretic formulation of such feature selection problem is the universal feature selection problem [24], which we formalize as follows.
To begin, for an attribute V, we refer to C Y = V , { P V ( v ) , v V } , { ϕ v Y | V , v V } , as the configuration of V, where ϕ v Y | V P Y | V ( · | v ) is the information vector specifying the corresponding conditional distribution P Y | V ( · | v ) . The configuration of V models the statistical correlation between V and Y. In the sequel, we focus on the local analysis regime, for which we assume that all the attributes V of our interests to detect are ϵ -attributes of Y. As a result, the corresponding configuration satisfies ϕ v Y | V ϵ , for all v V . We refer to such configurations as ϵ-configurations. The configuration of V is unknown in advance but assumed to be generated from a rotational invariant ensemble (RIE).
Definition 4
(RIE). Two configurations C Y and C ˜ Y defined as
C Y V , { P V ( v ) , v V } , { ϕ v Y | V , v V } ,
C ˜ Y V , { P V ( v ) , v V } , { ϕ ˜ ϕ v Y | V , v V }
are called rotationally equivalent, if there exists a unitary matrix Q such that ϕ ˜ ϕ v Y | V = Q ϕ v Y | V , for all v V . Moreover, a probability measure defined on a set of configurations is called an RIE, if all rotationally equivalent configurations have the same measure.
The RIE can be interpreted as assigning a uniform measure to the attributes with the same level of distinguishability. To infer about the attribute V, we construct a k-dimensional feature vector h k = ( h 1 , , h k ) , for some 1 k K 1 , of the form
h i = 1 n l = 1 n f i ( x l ) , i = 1 , , k ,
for some choices of feature functions f i . Our goal is to determine the f i such that the optimal decision rule based on h k achieves the smallest possible error probability, where the performance is averaged over the possible C Y generated from an RIE. In turn, we denote ξ i X f i as the corresponding information vector, and define the matrix Ξ X [ ξ 1 X ξ k X ] .
Theorem 1
(Universal Feature Selection). For v , v V , let E h k ( v , v ) be the error exponent associated with the pairwise error probability distinguishing v and v based on h k , then the expected error exponent over a given RIE defined on the set of ϵ-configurations is given by
E E h k ( v , v ) = C 0 2 · B ˜ Ξ X Ξ X T Ξ X 1 2 F 2 + o ( ϵ 2 ) ,
where C 0 1 4 | Y | · E ϕ v Y | V ϕ v Y | V 2 is independent of the choices of f i ’s, and the expectations E · are taken over this RIE.
Proof. 
See Appendix A. □
As a result of (7), designing the ξ i X as the singular vectors ψ i X of B ˜ , for i = 1 , , k , optimizes (7) for all RIEs, pairs of ( v , v ) , and ϵ -configurations. Thus, the feature functions corresponding to ψ i X are universally optimal for inferring the unknown attribute V. Moreover, (7) naturally leads to an information metric B ˜ Ξ X Ξ X T Ξ X 1 2 F 2 for any feature Ξ X of X, measured by projecting the normalized Ξ X through a linear projection B ˜ . This information metric quantifies how informative a feature of X is when solving inference problems with respect to Y and is optimized when designing features by singular vectors of B ˜ . Thus, we can interpret the universal feature selection as solving the most informative features for data inferences via the SVD of B ˜ , which also coincides with the maximally correlated features in (3). Later, we will show that the feature selection in DNNs shares the same information metric as universal feature selection in the local analysis regime.

3.2. Feature Extraction in Deep Neural Networks

3.2.1. Network with Ideal Expressive Power

For convenience of analysis, we first consider the ideal case where the neural network can express any feature mapping s ( · ) as desired. While this assumption can be rather strong, the existence of such ideal networks is guaranteed by the universal approximation theorem [40]. In addition, one goal of practical network designs is to approximate the ideal networks and obtain sufficient expressive power. For such networks, we will show that when X , Y are ϵ -dependent, the extracted feature s ( x ) and weights v ( y ) coincide with the solutions of the universal feature selection.
To begin, we use P X Y to denote the joint empirical distribution of the labeled samples ( x i , y i ) , i = 1 , , N , and P X , P Y to denote the corresponding marginal distributions. Then, the objective function of (5) is the empirical average of the log-likelihood function
1 N i = 1 N log P ˜ Y | X ( y i | x i ) = E P X Y log P ˜ Y | X ( Y | X ) .
Therefore, maximizing this empirical average is equivalent as minimizing the KL divergence:
( s * , v * , b * ) = arg min ( s , v , b ) D ( P X Y P X P ˜ Y | X ( s , v , b ) ) .
This can be interpreted as finding the best fitting to empirical joint distribution P X Y by distributions of the form P X P ˜ Y | X ( s , v , b ) . In our development, it is more convenient to denote the bias by d ( y ) = b ( y ) log P Y ( y ) , for y Y . Then, the following lemma illustrates the explicit constraint on the problem (8) in the local analysis regime.
Lemma 2.
If X , Y are ϵ-dependent, then the optimal v , d for (8) satisfy
| v ˜ T ( y ) s ( x ) + d ˜ ( y ) | = O ( ϵ ) , f o r a l l x X , y Y .
Proof. 
See Appendix B. □
In turn, we take (9) as the constraint for solving the problem (8) in the local analysis regime. Moreover, we define the information vectors for zero-mean vectors s ˜ , v ˜ as ξ X ( x ) = P X ( x ) s ˜ ( x ) , ξ Y ( y ) = P Y ( y ) v ˜ ( y ) , and define matrices
Ξ Y ξ Y ( 1 ) ξ Y ( | Y | ) T , Ξ X ξ X ( 1 ) ξ X ( | X | ) T .
Lemma 3.
The KL divergence (8) in the local analysis regime (9) can be expressed as
D ( P X Y P X P ˜ Y | X ( s , v , b ) ) = 1 2 B ˜ Ξ Y Ξ X T F 2 + 1 2 η ( v , b ) ( s ) + o ( ϵ 2 ) ,
where η ( v , b ) ( s ) E P Y ( μ s T v ˜ ( Y ) + d ˜ ( Y ) ) 2 .
Proof. 
See Appendix C. □
Lemma 3 reveals key insights for feature selection in neural networks. To see this, we consider the following two learning problems: learning the optimal weight v for given s and learning the optimal feature s for given v.
For the case that s is fixed, we can optimize (10) with Ξ X fixed and obtain the following optimal weights:
Theorem 2.
For fixed Ξ X and μ s , the optimal Ξ Y * to minimize (10) is given by
Ξ Y * = B ˜ Ξ X Ξ X T Ξ X 1 ,
and the optimal weights v ˜ * and bias d ˜ * are
v ˜ * ( y ) = E P X | Y Λ s ˜ ( X ) 1 s ˜ ( X ) | Y = y , d ˜ * ( y ) = μ s T v ˜ ( Y ) .
where Λ s ˜ ( X ) denotes the covariance matrix of s ˜ ( X ) .
Proof. 
See Appendix D. □
Specifically, when s ( x ) = x , Theorem 2 gives the optimal weights for softmax regression. Note that Equation (11) can be viewed as a projection of the input feature s ˜ ( x ) , to a feature v ( y ) computable from the value of y, which is the most correlated feature to s ˜ ( x ) . The solution is given by the operation that left multiplies B ˜ matrix, which we refer to as forward feature projection.
Remark 1.
While we assume the continuous input s ( x ) is a function of a discrete variable X, we only need the labeled samples between s and Y to compute the weights and bias from the conditional expectation (12), and the correlation between X and s is irrelevant. Thus, our analysis for weights and bias can be applied to continuous input networks by just ignoring X and taking s as the real input to network.
We then consider the “backward feature projection” problem, which attempts to find informative feature s * ( X ) to minimize the loss (10) with given weights and bias. In particular, we can show that the solution of this backward feature projection is precisely symmetric to the forward one.
Theorem 3.
For fixed Ξ Y and d ˜ , the optimal Ξ X * to minimize (10) is given by
Ξ X * = B ˜ T Ξ Y ( ( Ξ Y ) T Ξ Y ) 1 ,
and the optimal feature function s * , which are decomposed to s ˜ * and μ s * , is given by
s ˜ * ( x ) = E P Y | X Λ v ˜ ( Y ) 1 v ˜ ( Y ) | X = x , μ s * = Λ v ˜ ( Y ) 1 E P Y v ˜ ( Y ) d ˜ ( Y ) ,
where Λ v ˜ ( Y ) denotes the covariance matrix of v ˜ ( Y ) .
Proof. 
See Appendix D. □
Finally, when both s and ( v , b ) (and hence Ξ X , Ξ Y , d ) can be designed, the optimal ( Ξ Y , Ξ X ) corresponds to the low rank factorization of B ˜ , and the solutions coincide with the universal feature selection.
Theorem 4.
The optimal solutions for weights and bias to minimize (10) are given by d ˜ ( y ) = μ s T v ˜ ( y ) , and ( Ξ Y , Ξ X ) * chosen as the largest k left and right singular vectors of B ˜ .
Proof. 
See Appendix E. □
Therefore, we conclude that the learning of neural networks, when both s and ( v , b ) are designable, is to extract the most correlated aspects of the input data X and the label Y that are informative features for data inferences from universal feature selection.
In the practical learning process of DNN, the BackProp updates the weights of the softmax layer and those on the previous layer(s) in an iterative manner. As we have illustrated in Lemma 3, such iterative updates will converge to the same solution as the alternating between the forward feature projection (11) and the backward feature projection (13), which is indeed the power method to solve the SVD for B ˜ [41], also known as the Alternating Conditional Expectation (ACE) algorithm [36].
Remark 2.
From Theorem 4, for a neural network with sufficient expressive power, the trained feature depends only on the distribution of input data rather than the training process. It is worth mentioning that this result does not contradict the practice that trained weights in hidden layers can be different during each training run. In fact, due to the over-parameterized nature of practical network designs, there exist multiple choices of weights in hidden layers to express the same optimal feature s ( x ) .

3.2.2. Network with Restricted Expressive Power

The analysis of the previous section has considered neural networks with ideal expressive power, where the feature s ( X ) can be selected as any desired function. In general, however, the form of feature functions that can be generalized is often limited by the network structure. In the following, we consider networks with restricted expressive power to characterize the impacts of network structure on the extracted feature.
For illustration, we consider the neural network with a hidden layer of k nodes, and a zero-mean continuous input t = [ t 1 t m ] T R m to this hidden layer, where t is assumed to be a function t ( x ) of some discrete variable X. Our goal is to analyze the weights and bias in this layer with labeled samples ( t ( x i ) , y i ) . Assume the activation function of the hidden layer is a generally smooth function σ ( · ) , then the output s z ( X ) of the z-th hidden node is
s z ( x ) = σ w T ( z ) t ( x ) + c ( z ) , for z = 1 , , k , x X ,
where w ( z ) R m and c ( z ) R are the weights and bias from input layer to hidden layer as shown in Figure 2. We denote s = [ s 1 s k ] T as the input vector to the output classification layer.
To interpret the feature selection in hidden layers, we fix ( v ( y ) , b ( y ) ) at the output layer and consider the problem of designing ( w ( z ) , c ( z ) ) to minimize the loss function (8) at the output layer. Ideally, we should have picked w ( z ) and c ( z ) to generate s ( x ) to match s * ( x ) from (14), which minimizes the loss. However, here we have the constraint that s ( x ) must take the form of (15) and, intuitively, the network should select w ( z ) , c ( z ) so that s ( x ) is close to s * ( x ) . Our goal is to quantify the notion of such closeness.
To develop insights on feature selection in hidden layers, we again focus on the local analysis regime, where the weights and bias are assumed to satisfy the local constraint
v ˜ T ( y ) s ( x ) + d ˜ ( y ) = O ( ϵ ) , w T ( z ) t ˜ ( x ) = O ( ϵ ) , x , y , z .
Then, since t is zero-mean, we can express (15) as
s z ( x ) = σ w T ( z ) t ( x ) + c ( z ) = w T ( z ) t ˜ ( x ) · σ c ( z ) + σ c ( z ) + o ( ϵ ) ,
Moreover, we define a matrix B ˜ 1 with the ( z , x ) th entry B ˜ 1 ( z , x ) = P X ( x ) σ ( c ( z ) ) s ˜ z * ( x ) , which can be interpreted as a generalized CDM for the hidden layer. Furthermore, we denote ξ 1 X ( x ) = P X ( x ) t ˜ ( x ) as the information vector of t ˜ ( x ) with the matrix Ξ 1 X defined as Ξ 1 X ξ 1 X ( 1 ) ξ 1 X ( | X | ) T , and we also define
W w ( 1 ) w ( k ) T ,
J diag { σ ( c ( 1 ) ) , σ ( c ( 2 ) ) , , σ ( c ( k ) ) } .
The following theorem characterizes the loss (8).
Theorem 5.
Given the weights and bias ( v , b ) at the output layer, and for any input feature s, we denote L ( s ) as the loss (8) evaluated with respect to ( v , b ) and s. Then, with the constraints (16)
L ( s ) L ( s * ) = 1 2 Θ B ˜ 1 Θ W Ξ 1 X T F 2 + 1 2 κ ( v , b ) ( s , s * ) + o ( ϵ 2 ) ,
where Θ ( Ξ Y T Ξ Y ) 1 / 2 J , and the term κ ( v , b ) ( s , s * ) = ( μ s μ s * ) T Λ v ˜ ( Y ) ( μ s μ s * ) .
Proof. 
See Appendix F. □
Equation (20) quantifies the closeness between s and s * in terms of the loss (8). Then, our goal is to minimize (20), which can be separated to two optimization problems:
W * = arg min W Θ B ˜ 1 Θ W Ξ 1 X T F 2 ,
μ s * = arg min μ s κ ( v , b ) ( s , s * ) .
Note that the optimization problem (21) is similar to the one that appeared in Lemma 3, and the optimal solution is given by W * = B ˜ 1 Ξ 1 X Ξ 1 X T Ξ 1 X 1 . Therefore, solving the optimal weights in the hidden layer can be interpreted as projecting s ˜ * ( x ) to the subspace of feature functions spanned by t ( x ) to find the closest expressible function. In addition, the problem (22) is to choose μ s (and hence the bias c ( z ) ) to minimize the quadratic term similar to η ( v , b ) ( s ) in (10). Similar to the analyses of parameters in the last layer, we can obtain analytical solutions for hidden layer parameters, e.g., μ s * and w * , with detailed discussions provided in Appendix G.
Overall, we observe the correspondence between (11), (14), and (21), (22), and interpret both operations as feature projections. Our argument can be generalized to any intermediate layer in a multi-layer network, with all the previous layers viewed as the fixed pre-processing that specifies t ( x ) , and all the layers after determining s * . Then, the iterative procedure in back-propagation can be viewed as alternating projection finding the fixed-point solution over the entire network. This final fixed-point solution, even under the local assumption, might not be the SVD solution as in Theorem 4. This is because the limited expressive power of the network often makes it impossible to generate the desired feature function. In such cases, the concept of feature projection can be used to quantify this gap, and thus to measure the quality of the selected features.

3.3. Scoring Neural Networks

Given a learning problem, it is useful to tell whether or not some extracted features are informative [42]. Our previous development naturally gives rise to a performance metric.
Definition 5.
Given a feature s ( x ) R k and weight v ( y ) R k with the corresponding information matrices Ξ X and Ξ Y , the H-score H ( s , v ) is defined as
H ( s , v ) 1 2 B ˜ F 2 1 2 B ˜ Ξ Y Ξ X T F 2 = E P X Y s ˜ T ( X ) v ˜ ( Y ) 1 2 tr Λ s ˜ ( X ) Λ v ˜ ( Y ) .
In addition, for given s ( x ) , we define the single-sided H-score H ( s ) as
H ( s ) max v H ( s , v )
= 1 2 B ˜ F 2 1 2 B ˜ B ˜ Ξ X Ξ X T Ξ X 1 Ξ X T F 2
= 1 2 B ˜ Ξ X Ξ X T Ξ X 1 2 F 2 = 1 2 E P Y E P X | Y Λ s ˜ ( X ) 1 / 2 s ˜ ( X ) | Y 2 .
H-score can be used to measure the quality of features generated at any intermediate layer of the network. It is related to (20) when choosing the optimal bias and Θ as the identity matrix. This can be understood as taking the output of this layer s ( x ) and directly feeding it to a softmax output layer with v ( y ) used as the weights, and H ( s , v ) measures the resulting performance. Note that v ( y ) here can be an arbitrary function of Y, not necessarily the weights on the next layer computed by the network. When the optimal v * ( y ) as defined in (12) is used, the resulting performance becomes the one-sided H-score H ( s ) , which measures the quality of s ( x ) . In addition, by comparing (26) with (7), the performance measure H ( s ) also coincides with the information metric (7), up to a scale factor.
Specifically, for a given dataset and a feature extractor that generate s ( · ) , the H-score H ( s ) can be efficiently computed from the second equation of (26). In addition, when we use H-score to compare the performance of different feature extractors (models), the model complexity has to be taken into account to reduce overfitting. To this end, we adopt Akaike information criterion (AIC) and define AIC-corrected H-score
H AIC ( s ) H ( s ) n p n s
for comparing different models, where n p and n s represent the number of parameters in the model and the training sample size, respectively.
In current practice, the cross-entropy E P X Y log P ˜ Y | X ( v , b ) is often used as the performance metric. One can, in principle, also use log-loss to measure the effectiveness of the selected feature at the output of an intermediate layer [42]. However, one problem of this metric is that, for a given problem, it is not clear what value of log-loss one should expect, as the log-loss is generally unbounded. In contrast, the H-score can be directly computed from the data samples and has a clear upper bound. Indeed, it follows from Lemma 1 that, for k-dimensional feature s and weights v, we have the sequence of inequalities
H ( s , v ) H ( s ) 1 2 i = 1 k σ i 2 k 2 ,
where σ i indicates the ith singular value of B ˜ .
In particular, the first “≤” follows from the definition (24), and the gap between H ( s , v ) and H ( v ) measures the optimality of the weights v; the second “≤” follows from the first equality of (26), and the gap between two sides characterizes the difference between the chosen feature and the optimal solution, which is a useful measure of how restrictive (lack of expressive power) the network structure is; the last “≤” follows from the fact that σ i 1 (cf. Lemma 1), which measures the dependency between data variable and label for the given dataset. In Section 3.4.3, we validate this metric on real data.

3.4. Experiments

This section presents experiments for validating our theoretical characterizations, with corresponding code available at https://github.com/XiangxiangXu/dnn (accessed on 7 December 2021). Specifically, all DNN models used in Section 3.4.3 are available at https://keras.io/applications/ (accessed on 7 December 2021).

3.4.1. Experimental Validation of Theorem 4

We first validate Theorem 4, the optimal feature extracted by network with ideal expressive power. Here, we consider the discrete data with alphabet sizes, | X | = 8 and | Y | = 6 , and construct the network as shown in Figure 3. Specifically, the network input is the one-hot encoding of X, i.e., [ 1 X ( 1 ) , , 1 X ( | X | ) ] T , where 1 X ( x ) takes one if and only if X = x , and takes zero otherwise. Then, the feature s ( X ) is generated by a linear layer, with sigmoid function used as the activation function. For ease of comparison and presentation, we set feature dimension to k = 1 , since otherwise the optimal feature (cf. Theorem 4) lies in a subspace and is non-unique. It can be verified that this network has ideal expressive power, i.e., with proper weights in the first layer, s ( X ) can express any desired function up to scaling and shifting.
To compare the result trained by the neural network and that in Theorem 4, we first randomly generate a distribution P X Y , and then draw independently n = 100,000 pairs of ( X , Y ) samples. We then train the network using batch gradient descent, where we have applied Nesterov momentum [43] with the momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 0.01 and clip gradients with norm exceeding 0.5. After training, the learned values of s ( x ) , v ( y ) and b ( y ) are shown in Figure 4 and compared with theoretical results. From the figure, we can observe that the training results match our theoretical analyses.

3.4.2. Experimental Validation of Theorem 5

In addition, we validate Theorem 5 by the neural network depicted in Figure 5, with the same settings of X , Y . Specifically, the number of neurons in hidden layers are set to m = 4 and k = 3 , where t ( X ) is randomly generated from X, and we have chosen sigmoid function as the activation function σ ( · ) to generate s ( x ) . We then fix the weights and bias at the output layer and train the weights w ( 1 ) , w ( 2 ) , w ( 3 ) and bias c in the hidden layer to optimize the log-loss. Specifically, we use the batch gradient descent with the Nesterov momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 10 6 and clip gradients with norm exceeding 0.1. After training, Figure 6 shows the matching between the learned results and the corresponding theoretical values.

3.4.3. Experimental Validation of H-Score

To validate H-score as a performance measure for extracted features, we compare the H-score and classification accuracy of DNNs on image classification tasks. Specifically, we use the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [22] dataset as the dataset and extract features using several deep neural networks with representative architectures designs [44,45,46,47,48,49]. After training the feature extractors on the ILSVRC2012 training set, we then compute the H-score of the feature in the last hidden layer, as well as the classification accuracies on ILSVRC2012 validation set (here, we use ILSVRC2012 validation set for testing, as the labels in ILSVRC2012 testing set have not been publicly released). The results are summarized in Table 1, where H AIC ( s ) is the AIC-corrected H-score as defined in (27), with n p being the number of model parameters, and n s = 1,300,000 corresponding to the number of training samples in ImageNet. The AIC-corrected H-score is consistent with the classification accuracy, which validates the effectiveness of H-score as a measurement of neural networks.

4. Discussion

Our characterization gives an information-theoretic interpretation of the feature extraction process in DNNs, which also provides a practical performance measure for scoring neural networks. Different from empirical studies focusing on specific datasets [7], our development is based on the probability distribution space, which is more general and can also provide theoretic insights. Moreover, the information-theoretic framework allows us to obtain direct operational meaning and better interpretations for the solutions, compared with optimization-based theoretical characterizations, e.g., [11,13].
As a first step in establishing a rigorous framework for DNN analysis, the present work can be extended in both theoretical and practical aspects. From the theoretical perspective, one extension is to investigate the analytical properties for general DNNs, using the theoretic insights obtained from local analysis regime. For example, it was shown in [50] that the symmetry between feature and weights in DNNs established in the local analysis regime (cf. Section 3.2.1) also holds for general probability distributions. Another extension is to apply the framework to investigate the optimal feature for structured data or network, e.g., data with sparsity structure [51].
From the practical perspective, in addition to the demonstrated example of evaluating existing DNN models (cf. Section 3.4.3), the H-score can also be used as an objective function in designing learning algorithms. In particular, such usages have been illustrated in multi-modal learning [52] and transfer learning [53] tasks.

5. Conclusions

In this paper, we apply the local information geometric analysis and provide an information-theoretic interpretation to the feature extraction scheme in DNNs. We first establish an information metric for features in inference tasks by formalizing the information-theoretic feature selection problem. In addition, we demonstrate that the features extracted by DNNs coincide with the information-theoretically optimal feature, with the same metric measuring the performance of features, called H-score. Furthermore, we discuss the usage of the H-score for measuring the effectiveness of DNNs. Our framework demonstrates a connection between the practical deep learning implementations and information-theoretic characterizations, which can provide theoretical insights for DNN analysis and learning algorithm designs.

Author Contributions

X.X., S.-L.H., L.Z. and G.W.W. contributed to the conceptualization, methodology, and writing of this paper. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S.-L. Huang was supported in part by the National Natural Science Foundation of China under Grant 61807021 and the Shenzhen Science and Technology Program under Grant KQTD20170810150821146. The work of L. Zheng was supported in part by the National Science Foundation (NSF) under Award CNS-2002908 and the Office of Naval Research (ONR) under Grant N00014-19-1-2621.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Proof of Theorem 1

We commence with the characterization of the error exponent.
Lemma A1.
Given a reference distribution P X relint ( P X ) , a constant ϵ > 0 and integers n and k, let x 1 , , x n denote i.i.d. samples from one of P 1 or P 2 , where P 1 , P 2 N ϵ X ( P X ) . To decide whether P 1 or P 2 is the generating distribution, a sequence of k-dimensional statistics h k = ( h 1 , , h k ) is constructed as
h i = 1 n l = 1 n f i ( x l ) , i = 1 , , k ,
where ( f 1 ( X ) , , f k ( X ) ) are zero mean, unit-variance, and uncorrelated with respect to P X , i.e.,
E P X f i ( X ) = 0 , i { 1 , , k }
E P X f i ( X ) f j ( X ) = δ i j , i , j { 1 , , k } .
Then, the error probability of the decision based on h k decays exponentially in n as n , with (Chernoff) exponent
lim n log p e n E h k = i = 1 k E h i ,
where
E h i = 1 8 ϕ 1 ϕ 2 , ξ i 2 + o ( ϵ 2 ) ,
and ϕ 1 P 1 , ϕ 2 P 2 , ξ i f i ( X ) , i { 1 , , k } are the corresponding information vectors.
Proof of Lemma A1.
Since the rule is to decide based on comparing the projection
i = 1 k h i E P 1 f i ( X ) E P 2 f i ( X )
to a threshold, via Cramér’s theorem [54], the error exponent under P j ( j = 1 , 2 ) is
E j ( λ ) = min P S ( λ ) D ( P P j ) ,
where
S ( λ ) P P X : E P f k ( X ) = λ E P 1 f k ( X ) + ( 1 λ ) E P 2 f k ( X ) .
Now, since (A2) holds, we obtain
E P j f i ( X ) = x X P j ( x ) f i ( x ) = x X P X ( x ) f i ( x ) + x X ( P j ( x ) P X ( x ) ) f i ( x ) = E P X f i ( X ) + x X P X ( x ) ϕ j ( x ) · ξ i ( x ) P X ( x ) = x X ϕ j ( x ) ξ i ( x ) = ϕ j , ξ i , j = 1 , 2 and i = 1 , , k ,
which we express compactly as
E P j f k ( X ) = ϕ j , ξ k , j = 1 , 2
with ξ k ( ξ 1 , , ξ k ) .
Hence, the constraint (A7) is expressed in information vectors as
ϕ , ξ i = λ ϕ 1 + ( 1 λ ) ϕ 2 , ξ i , i = 1 , , k ,
i.e.,
ϕ , ξ k = λ ϕ 1 + ( 1 λ ) ϕ 2 , ξ k .
In turn, the optimal P in (A6), which we denoted by P * , lies in the exponential family through P j with natural statistic f k ( x ) , i.e., the k-dimensional family whose members are of the form
log P ˜ θ k ( x ) = i = 1 k θ i f i ( x ) + log P j ( x ) α θ k ,
for which the associated information vector is
ϕ ˜ θ k ( x ) = i = 1 k θ i ξ i ( x ) + ϕ j ( x ) α ( θ k ) P X ( x ) + o ( ϵ ) ,
where we have used the fact that
log Q X ( x ) = log P X ( x ) + log Q X ( x ) P X ( x ) = log P X ( x ) + log 1 + 1 P X ( x ) ϕ ( x ) = log P X ( x ) + 1 P X ( x ) ϕ ( x ) + o ( ϵ )
for all Q X N ϵ X ( P X ) with the information vector ϕ Q X . As a result,
ϕ ˜ θ k , ξ i = θ i + ϕ j , ξ i + o ( ϵ ) ,
where we have used (A3). Hence, via (A9), we obtain that the intersection with the linear family (A7) is at P * = P θ k * with
θ i * = λ ϕ 1 + ( 1 λ ) ϕ 2 ϕ j , ξ i + o ( ϵ )
and thus
E j ( λ ) = D ( P * P j )
= 1 2 ϕ ˜ θ k ϕ j 2 + o ( ϵ 2 )
= 1 2 i = 1 k θ i * ξ i 2 + 1 2 α θ k * 2 + o ( ϵ 2 )
= 1 2 i = 1 k ( θ i * ) 2 + 1 2 α θ k * 2 + o ( ϵ 2 )
= 1 2 i = 1 k λ ϕ 1 + ( 1 λ ) ϕ 2 ϕ j , ξ i 2 + o ( ϵ 2 ) ,
where to obtain (A11) we have exploited the local approximation of KL divergence [18], to obtain (A12) we have exploited (A10), to obtain (A13) we have again exploited (A3), and to obtain (A14) we have used that
α θ k * = o ( ϵ 2 )
since θ k * = O ( ϵ ) and
α ( 0 ) = 0 , and α ( 0 ) = E P j f k ( X ) = ϕ j , ξ k = O ( ϵ ) .
Finally, E 1 ( λ ) = E 2 ( λ ) when λ = 1 / 2 , so the overall error probability has exponent (A5). □
Then, the following lemma demonstrates a property of information vectors in a Markov chain.
Lemma A2.
Given the Markov relation X Y V and any v V , let ϕ v X | V and ϕ v Y | V denote the associated information vectors for P X | V ( · | v ) and P Y | V ( · | v ) , then we have
ϕ v X | V = B ˜ T ϕ v Y | V .
Proof of Lemma A2.
From the Markov relation we have
P X ( x ) = y Y P X | Y ( x | y ) P Y ( y )
and
P X | V ( x | v ) = y Y P X | Y , V ( x | y , v ) P Y | V ( y | v ) = y Y P X | Y ( x | y ) P Y | V ( y | v ) .
As a result,
P X | V ( x | v ) P X ( x ) = y Y P X | Y ( x | y ) [ P Y | V ( y | v ) P Y ( y ) ] ,
from which we obtain the corresponding information vector
ϕ v X | V ( x ) = 1 P X ( x ) y Y P X | Y ( x | y ) P Y ( y ) ϕ v Y | V ( y ) = y Y B ˜ ( y , x ) + P X ( x ) P Y ( y ) ϕ v Y | V ( y ) = y Y B ˜ ( y , x ) ϕ v Y | V ( y ) ,
where the last equality follows from the fact that
y Y P Y ( y ) ϕ v Y | V ( y ) = y Y [ P Y | V ( y | v ) P Y ( y ) ] = 0 .
Finally, rewrite (A16) in the matrix form and we obtain (A15). □
In addition, the following lemma is useful for dealing with the expectation over an RIE.
Lemma A3.
Let z be a spherically symmetric random vector of dimension M, i.e., for any orthogonal Q we have z = d Q z . If A is a fixed matrix of compatible dimensions, then
E z T A 2 = 1 M E z 2 A F 2 .
Proof of Lemma A3.
By definition we have Λ z = Q Λ z Q T for any orthogonal Q ; hence, Λ z is diagonal. Suppose Λ z = λ I , then from
tr Λ z = E z 2 = λ M
we obtain
λ = 1 M tr Λ z .
As a result, we have
E z T A 2 = tr A T Λ z A = λ tr A T A = 1 M E z 2 A F 2 .
Proceeding to our proof of Theorem 1, by definition of feature functions, we have E P X f i ( X ) = 0 , i = 1 , , k . Suppose f is the vector representation of f k and denote by f ˜ Λ f 1 / 2 f the normalized f , with Λ f 1 / 2 denoting any square root matrix of Λ f . Then, the corresponding statistics f ˜ k = ( f ˜ 1 , , f ˜ k ) satisfy the constraints (A2) and (A3). In addition, we construct the statistic h ˜ k = ( h ˜ 1 , , h ˜ k ) as [cf. (A1)]
h ˜ i = 1 n l = 1 n f ˜ i ( x l ) , i = 1 , , k .
Then, from Lemma A1, the error exponent of distinguishing v and v based on h ˜ k is
E h ˜ k ( v , v ) = 1 8 i = 1 k ϕ v X | V ϕ v X | V T ξ ˜ i X 2 + o ( ϵ 2 ) = 1 8 ϕ v X | V ϕ v X | V T Ξ ˜ X 2 + o ( ϵ 2 ) ,
where ϕ v X | V denotes the associated information vector for P X | V ( · | v ) , ξ ˜ i X denotes the information vectors of f ˜ i , and Ξ ˜ X [ ξ ˜ 1 X , , ξ ˜ k X ] . Since the optimal decision rule is linear, the error exponent is invariant with linear transformations of statistics, i.e.,
E h k ( v , v ) = E h ˜ k ( v , v ) = 1 8 ϕ v X | V ϕ v X | V T Ξ ˜ X 2 + o ( ϵ 2 ) = 1 8 ϕ v Y | V ϕ v Y | V T B ˜ Ξ ˜ X 2 + o ( ϵ 2 ) ,
where the last equality follows from Lemma A2.
As a result, taking the expectation of (A19) over a given RIE yields
E E h k ( v , v ) = 1 8 E ϕ v Y | V ϕ v Y | V T B ˜ Ξ ˜ X 2 + o ( ϵ 2 ) = E ϕ v Y | V ϕ v Y | V 2 8 | Y | B ˜ Ξ ˜ X F 2 + o ( ϵ 2 ) ,
where we have exploited Lemma A3. Finally, the error exponent (7) can be obtained via noting from the definition of f ˜ k that
Ξ ˜ X = Ξ X Ξ X T Ξ X 1 2 .

Appendix B. Proof of Lemma 2

We first prove two useful lemmas.
Lemma A4.
For distributions P relint ( P X ) , Q , R P X , and sufficiently small ϵ, if D ( P Q ) ϵ 2 and D ( P R ) ϵ 2 , then there exists a constant C > 0 independent of ϵ, such that D ( Q R ) C ϵ 2 .
Proof of Lemma A4.
Denote by · 1 the 1 -distance between distributions, i.e., P Q 1 x X | P ( x ) Q ( x ) | , then from Pinsker’s inequality [14], we have
P Q 1 2 D ( P Q ) < 2 ϵ ,
P R 1 2 D ( P R ) < 2 ϵ ,
which implies
Q R 1 P Q 1 + P R 1 2 2 ϵ .
In addition, with p min min x X P ( x ) , for all x X we have
R ( x ) > P ( x ) | P ( x ) R ( x ) |
> min x X P ( x ) 2 ϵ
= p min 2 ϵ ,
where to obtain (A24) we have used (A21). Note that since P relint ( P X ) we have p min > 0 , and thus R ( x ) > p min / 2 for sufficiently small ϵ . As a result,
D ( Q R ) x X ( Q ( x ) R ( x ) ) 2 R ( x )
2 p min x X [ Q ( x ) R ( x ) ] 2
2 Q R 1 2 p min
16 p min ϵ 2 ,
where to obtain (A26) we have used the fact that KL divergence is upper bounded by corresponding χ 2 -divergence [55], and to obtain (A29) we have used (A22). □
Lemma A5.
For all ( x , y ) X × Y , we have
D ( P X P Y P X P ˜ Y | X ( s , v , b ) ) P X ( x ) log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y )
where P ˜ Y | X ( s , v , b ) is as defined in (4), and where we have defined τ ( x , y ) v ˜ T ( y ) s ( x ) + d ˜ ( y ) .
Proof of Lemma A5.
First, we can rewrite the conditional distribution P ˜ Y | X ( s , v , b ) ( y | x ) as
P ˜ Y | X ( s , v , b ) ( y | x ) = e v T ( y ) s ( x ) + b ( y ) y Y e v T ( y ) s ( x ) + b ( y ) = P Y ( y ) e v T ( y ) s ( x ) + d ( y ) y Y P Y ( y ) e v T ( y ) s ( x ) + d ( y ) = P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) y Y P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) = P Y ( y ) e τ ( x , y ) y Y P Y ( y ) e τ ( x , y ) .
Then, the KL divergence D ( P X P Y P X P ˜ Y | X ( s , v , b ) ) can be expressed as
D ( P X P Y P X P ˜ Y | X ( s , v , b ) ) = ( x , y ) X × Y P X ( x ) P Y ( y ) log y Y P Y ( y ) e τ ( x , y ) e τ ( x , y ) = x X P X ( x ) log y Y P Y ( y ) e τ ( x , y ) E P X P Y τ ( X , Y ) = x X P X ( x ) log y Y P Y ( y ) e τ ( x , y ) ,
where to obtain the last equality we have used the fact E P X P Y τ ( X , Y ) = 0 . As a result, we have
D ( P X P Y P X P ˜ Y | X ( s , v , b ) ) P X ( x ) log y Y P Y ( y ) e τ ( x , y )
P X ( x ) log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) ,
where the last inequality follows from Jensen’s inequality:
y Y P Y ( y ) e τ ( x , y ) = P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) y y P Y ( y ) 1 P Y ( y ) e τ ( x , y ) P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) exp 1 1 P Y ( y ) y y P Y ( y ) τ ( x , y ) = P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) .
Proceeding to our proof of Lemma 2, first note that when v = d = 0 , we have P ˜ Y | X ( s , v , b ) = P Y . As a result, the optimal v , d for (8) satisfy
D ( P X Y P X P ˜ Y | X ( s , v , b ) ) D ( P X Y P X P Y ) ( x , y ) X × Y P X , Y ( x , y ) P X ( x ) P Y ( y ) 2 P X ( x ) P Y ( y ) ϵ 2 ,
where to obtain the second inequality we have again exploited χ 2 -divergence as an upper bound of KL divergence [55], and to obtain the last inequality we have used the definition of ϵ -dependency.
As P X Y relint ( P X × Y ) , from Lemma A4, there exist C > 0 and ϵ 1 > 0 such that D ( P X P Y P X P ˜ Y | X ( s , v , b ) ) < C ϵ 2 for all ϵ < ϵ 1 . Furthermore, from Lemma A5, for all ( x , y ) X × Y and ϵ ( 0 , ϵ 1 ) , we have
C ϵ 2 P X ( x ) log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) .
Note that the right-hand side of (A35) satisfies
log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) = P Y ( y ) 2 ( 1 P Y ( y ) ) τ 2 ( x , y ) + o ( τ 2 ( x , y ) ) .
Therefore, there exists δ > 0 independent of ϵ 1 , such that for all | τ ( x , y ) | δ , we have
log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) > P Y ( y ) 2 τ 2 ( x , y ) .
In addition, if | τ ( x , y ) | > δ , we have
log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) min log P Y ( y ) e δ + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) δ , log P Y ( y ) e δ + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) δ P Y ( y ) 2 δ 2 ,
where to obtain the second inequality we have exploited the monotonicity of function t P Y ( y ) e t + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) t , and to obtain the third inequality we have exploited (A36).
As a result, we have
log P Y ( y ) e τ ( x , y ) + ( 1 P Y ( y ) ) e P Y ( y ) 1 P Y ( y ) τ ( x , y ) > P Y ( y ) 2 · min { δ 2 , τ 2 ( x , y ) } .
Hence, (A35) becomes
C ϵ 2 P X ( x ) P Y ( y ) 2 · min { δ 2 , τ 2 ( x , y ) } ,
from which we can obtain τ ( x , y ) = O ( ϵ ) . To see this, let
ϵ 2 δ 2 C · min ( x , y ) X × Y P X ( x ) P Y ( y ) , ϵ 0 min { ϵ 1 , ϵ 2 } .
Then, for all ϵ < ϵ 0 , we have
C ϵ 2 < P X ( x ) P Y ( y ) 2 · δ 2 ,
and (A38) implies | τ ( x , y ) | < C ϵ with C = 2 C P X ( x ) P Y ( y ) .

Appendix C. Proof of Lemma 3

Proof. 
From Lemma 2, there exists C > 0 such that for all ( x , y ) X × Y , we have
| v ˜ T ( y ) s ( x ) + d ˜ ( y ) | < C ϵ ,
which implies
| μ s T v ˜ ( y ) + d ˜ ( y ) | < C ϵ ,
| v ˜ T ( y ) s ˜ ( x ) | < 2 C ϵ ,
with C = max { C , 1 } .
From (A30), we can assume E P Y v ( Y ) = E P Y d ( Y ) = 0 without loss of generality. Then, (4) can be rewritten as
P ˜ Y | X ( s , v , b ) ( y | x ) = P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) y Y P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) ,
and the numerator can be written as
P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) = P Y ( y ) 1 + v ˜ T ( y ) s ( x ) + d ˜ ( y ) + o ( ϵ ) = P Y ( y ) 1 + v ˜ T ( y ) s ( x ) + d ˜ ( y ) + o ( ϵ ) ,
where we have used (A39). Similarly, from
y Y P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) = y Y P Y ( y ) 1 + v ˜ T ( y ) s ( x ) + d ˜ ( y ) + o ( ϵ ) = 1 + E P Y v ˜ T ( Y ) s ( x ) + E P Y d ˜ ( y ) + o ( ϵ ) = 1 + o ( ϵ )
we obtain
1 y Y P Y ( y ) e v ˜ T ( y ) s ( x ) + d ˜ ( y ) = 1 1 + o ( ϵ ) = 1 + o ( ϵ ) .
As a result, (A42) can be written as
P ˜ Y | X ( s , v , b ) ( y | x ) = P Y ( y ) 1 + v ˜ T ( y ) s ( x ) + d ˜ ( y ) + o ( ϵ ) [ 1 + o ( ϵ ) ] = P Y ( y ) 1 + v ˜ T ( y ) s ( x ) + d ˜ ( y ) + o ( ϵ ) ,
which implies P X P ˜ Y | X ( v , b ) N C ϵ X × Y ( P X P Y ) for sufficiently small ϵ . In addition, the local assumption of distributions implies that P X Y N ϵ X × Y ( P X P Y ) N C ϵ X × Y ( P X P Y ) . Again, from the local approximation of KL divergence [18]
D ( P 1 P 2 ) = 1 2 ϕ 1 ϕ 2 2 + o ϵ 2 ,
we have
D ( P Y , X P X P ˜ Y | X ( s , v , b ) ) = 1 2 x X , y Y P Y , X ( y , x ) P ˜ Y | X ( s , v , b ) ( y | x ) P X ( x ) 2 P Y ( y ) P X ( x ) + o ( ϵ 2 ) = 1 2 x X , y Y P Y , X ( y , x ) P Y ( y ) P X ( x ) P Y ( y ) P X ( x ) P Y ( y ) P X ( x ) v ˜ T ( y ) s ( x ) + d ˜ ( y ) + o ( ϵ ) 2 + o ( ϵ 2 ) = 1 2 x X , y Y B ˜ ( y , x ) P Y ( y ) P X ( x ) v ˜ T ( y ) s ˜ ( x ) P Y ( y ) P X ( x ) d ˜ ( y ) + μ s T v ˜ ( y ) P Y ( y ) P X ( x ) o ( ϵ ) 2 + o ( ϵ 2 ) = ( * ) 1 2 x X , y Y B ˜ ( y , x ) P Y ( y ) P X ( x ) v ˜ T ( y ) s ˜ ( x ) 2 + 1 2 x X , y Y P Y ( y ) P X ( x ) d ˜ ( y ) + μ s T v ˜ ( y ) 2 + o ( ϵ 2 ) = 1 2 x X , y Y B ˜ ( y , x ) ξ Y ( y ) T ξ X ( x ) 2 + 1 2 E P Y ( d ˜ ( y ) + μ s T v ˜ ( y ) ) 2 + o ( ϵ 2 ) = 1 2 B ˜ Ξ Y Ξ X T F 2 + 1 2 η ( v , b ) ( s ) + o ( ϵ 2 ) ,
where to obtain ( * ) , we have used (A40) and (A41) together with the fact | B ˜ ( y , x ) | < ϵ , and that
x X , y Y B ˜ ( y , x ) P Y ( y ) P X ( x ) d ˜ ( y ) + μ s T v ˜ ( y ) = 0 , x X , y Y P Y ( y ) P X ( x ) v ˜ T ( y ) s ˜ ( x ) d ˜ ( y ) + μ s T v ˜ ( y ) = 0 ,
since E d ˜ ( Y ) = 0 , E s ˜ ( X ) = E v ˜ ( Y ) = 0 . □

Appendix D. Proofs of Theorems 2 and 3

Theorems 2 and 3 can be proved based on Lemma 3.
Proofs of Theorems 2 and 3.
Note that the value of d ( · ) only affects the second term of the KL divergence; hence, we can always choose d ( · ) such that d ˜ ( y ) + μ s T v ˜ ( y ) = 0 . Then, the ( Ξ Y , Ξ X ) pair should be chosen as
( Ξ Y , Ξ X ) * = arg min ( Ξ Y , Ξ X ) B ˜ Ξ Y Ξ X T F 2 .
Set the derivative (we use the denominator-layout notation of matrix calculus where the scalar-by-matrix derivative will have the same dimension as the matrix)
Ξ Y B ˜ Ξ Y Ξ X T F 2 = 2 ( Ξ Y Ξ X T Ξ X B ˜ Ξ X )
to zero, and the optimal Ξ Y for fixed Ξ X is (here, we assume the matrix Ξ X T Ξ X = Λ s ˜ ( X ) is invertible; for the case where Ξ X T Ξ X is singular, we can obtain a similar result with ordinary matrix inverse replaced by the Moore–Penrose inverse)
Ξ Y * = B ˜ Ξ X ( Ξ X T Ξ X ) 1 .
As 1 T P Y B ˜ = 0 , we have 1 T P Y Ξ Y * = 0 , which demonstrates that Ξ Y * is a valid matrix for a zero-mean feature vector.
To express Ξ Y * of (A47) in the form of s and v, we can make use of the correspondence between feature and information vectors. We can show that, for a zero-mean feature function f ( X ) with corresponding information vector ϕ , we have the correspondence E P X | Y f ( X ) | Y B ˜ ϕ . To see this, note that the y-th element of information vector B ˜ ϕ is given by
x X B ˜ ( y , x ) ϕ ( x ) = x X P X Y ( x , y ) P X ( x ) P Y ( y ) P X ( x ) P Y ( y ) f ( x ) P X ( x ) = 1 P Y ( y ) x X P X Y ( x , y ) f ( x ) = 1 P Y ( y ) E P X | Y f ( X ) | Y = y .
Using similar methods, we can verify that Λ s ˜ ( X ) = Ξ X T Ξ X . As a result, (A47) is equivalent to
v ˜ * ( y ) = E P X | Y Λ s ˜ ( X ) 1 s ˜ ( X ) | Y = y .
By a symmetry argument, we can also obtain the first two equations of Theorem 3. To obtain the third equations of these two theorems, we need to minimize η ( v , b ) ( s ) = E P Y ( μ s T v ˜ ( Y ) + d ˜ ( Y ) ) 2 . For given v ˜ and μ s , the optimal d ˜ is
d ˜ * ( y ) = μ s T v ˜ ( Y ) ,
and the corresponding η ( v , b ) ( s ) = 0 .
In addition, for given d ˜ and v ˜ , we have
η ( v , b ) ( s ) = E P Y ( μ s T v ˜ ( Y ) + d ˜ ( Y ) ) 2 = μ s T Λ v ˜ ( Y ) μ s + 2 μ s T E P Y v ˜ ( Y ) d ˜ ( Y ) + var ( d ˜ ( Y ) ) .
Set μ s η ( v , b ) ( s ) = 0 and we obtain
μ s * = Λ v ˜ ( Y ) 1 E P Y v ˜ ( Y ) d ˜ ( Y ) .

Appendix E. Proof of Theorem 4

Proof. 
From Lemma 3, choosing the optimal ( Ξ Y , Ξ X ) is equivalent to solving the matrix factorization problem of B ˜ . Since both Ξ Y and Ξ X have rank no greater than k, from the Eckart–Young–Mirsky theorem [56], the optimal choice of Ξ Y Ξ X T should be the truncated singular value decomposition of B ˜ with top k singular values. As a result, ( Ξ Y , Ξ X ) * are the left and right singular vectors of B ˜ corresponding to the largest k singular values.
The optimality of bias d ˜ ( y ) = μ s T v ˜ ( y ) has already been shown in Appendix D. □

Appendix F. Proof of Theorem 5

The following lemma is useful to prove Theorem 5.
Lemma A6
(Pythagorean theorem). Let Ξ X * be the optimal matrix for given Ξ Y as defined in (13). Then,
B ˜ Ξ Y Ξ X T F 2 B ˜ Ξ Y Ξ X * T F 2 = Ξ Y Ξ X * T Ξ Y Ξ X T F 2 .
Proof of Lemma A6.
Denote by U , V the Frobenius inner product of matrices U and V , i.e., U , V tr ( U T V ) , and we have
B ˜ Ξ Y Ξ X * T , Ξ Y Ξ X T = tr B ˜ Ξ X Ξ Y T tr Ξ X * Ξ Y T Ξ Y Ξ X T = tr B ˜ Ξ X Ξ Y T tr B ˜ T Ξ Y Ξ X T = 0 .
As a result, we obtain
B ˜ Ξ Y Ξ X T F 2 = B ˜ Ξ Y Ξ X * T + Ξ Y Ξ X * T Ξ Y Ξ X T F 2 = B ˜ Ξ Y Ξ X * T F + Ξ Y Ξ X * T Ξ Y Ξ X T F 2 + 2 B ˜ Ξ Y Ξ X * T , Ξ Y ( Ξ X * T Ξ X T ) = B ˜ Ξ Y Ξ X * T F + Ξ Y Ξ X * T Ξ Y Ξ X T F 2 ,
which finishes the proof. □
Proceeding to our proof of Theorem 5, from Lemma A6 we have
L ( s ) L ( s * ) = 1 2 B ˜ Ξ Y Ξ X T F 2 B ˜ Ξ Y Ξ X * T F 2 + 1 2 η ( v , b ) ( s ) η ( v , b ) ( s * ) + o ( ϵ 2 ) = 1 2 Ξ Y Ξ X * T Ξ Y Ξ X T F 2 + 1 2 κ ( v , b ) ( s , s * ) + o ( ϵ 2 ) ,
where κ ( v , b ) ( s , s * ) η ( v , b ) ( s ) η ( v , b ) ( s * ) . We then optimize Ξ Y Ξ X * T Ξ Y Ξ X T F 2 and κ ( v , b ) ( s , s * ) separately.
For the first term, we need to express Ξ X in terms of W and Ξ 1 X . From (17), we obtain
E s z ( X ) = σ c ( z ) + o ( ϵ ) ,
s ˜ z ( x ) = w T ( z ) t ˜ ( x ) · σ c ( z ) + o ( ϵ ) ,
which can be expressed in information vectors as
Ξ X = Ξ 1 X W T J + o ( ϵ ) .
From Theorem 3, we have
Ξ X * = B ˜ T Ξ Y Ξ Y T Ξ Y 1 .
As a result, we have
Ξ Y Ξ X * T Ξ Y Ξ X T F 2 = Ξ Y T Ξ Y 1 / 2 ( Ξ X * T Ξ X T ) F 2 = Ξ Y T Ξ Y 1 / 2 · Ξ X * T J W Ξ 1 X T o ( ϵ ) F 2 = Ξ Y T Ξ Y 1 / 2 · Ξ X * T J W Ξ 1 X T F 2 + o ( ϵ 2 ) = Ξ Y T Ξ Y 1 / 2 J · J 1 Ξ X * T W Ξ 1 X T F 2 + o ( ϵ 2 ) = Θ B ˜ 1 Θ W Ξ 1 X T F 2 + o ( ϵ 2 ) ,
where the third equality follows from the fact that [cf. (A41)] s ˜ ( x ) = O ( ϵ ) and v ˜ ( y ) = O ( 1 ) , and the last equality follows from the definitions B ˜ 1 J 1 Ξ X * T and Θ ( Ξ Y T Ξ Y ) 1 / 2 J .
For the second term, from (A50) and (A51), we have
κ ( v , b ) ( s , s * ) = [ ( μ s μ s * ) + μ s * ] T Λ v ˜ ( Y ) ( μ s μ s * ) + μ s * μ s * T Λ v ˜ ( Y ) μ s * + 2 ( μ s μ s * ) T E P Y v ˜ ( Y ) d ˜ ( Y ) = ( μ s μ s * ) T Λ v ˜ ( Y ) μ s μ s * + 2 ( μ s μ s * ) T Λ v ˜ ( Y ) μ s * + E P Y v ˜ ( Y ) d ˜ ( Y ) = ( μ s μ s * ) T Λ v ˜ ( Y ) μ s μ s * .
Combining (A57) and (A58) finishes the proof.

Appendix G. Analyses of Hidden Layer Parameters

First, from (A53), the bias c ( z ) of hidden layer is (when μ t 0 , the formula should be modified as c ( z ) = σ 1 ( μ s * ( z ) ) μ t T w + o ( ϵ ) .)
c ( z ) = σ 1 ( μ s * ( z ) ) + o ( ϵ ) .
To obtain μ s * , let us define σ min inf x σ ( x ) , σ max sup x σ ( x ) . Then, the optimal μ s is the solution of
minimize μ s ( μ s μ s * ) T Λ v ˜ ( Y ) μ s μ s * subject to σ min μ s σ max .
If μ s * satisfies the constraint of (A59), then it is the optimal solution. Otherwise, some elements of μ s * will become either σ min or σ max , known as the saturation phenomenon [21].
To obtain W * , let
B ˜ 1 Θ B ˜ 1 = Ξ Y T Ξ Y 1 / 2 Ξ Y T B ˜ , W Θ W = Ξ Y T Ξ Y 1 / 2 J W .
Then, the optimal W is given by
W * = arg min W B ˜ 1 W Ξ 1 X T F 2 = B ˜ 1 Ξ 1 X ( Ξ 1 X T Ξ 1 X ) 1 .
Hence, W * is given by
W * = Θ 1 W * = Θ 1 B ˜ 1 Ξ 1 X ( Ξ 1 X T Ξ 1 X ) 1 = B ˜ 1 Ξ 1 X ( Ξ 1 X T Ξ 1 X ) 1 = J 1 · [ Ξ Y ( Ξ Y T Ξ Y ) 1 ] T B ˜ Ξ 1 X ( Ξ 1 X T Ξ 1 X ) 1 ,
where the term B ˜ Ξ 1 X ( Ξ 1 X T Ξ 1 X ) 1 corresponds to a feature projection of t ˜ ( X ) :
B ˜ Ξ 1 X Ξ 1 X T Ξ 1 X 1 E P X | Y Λ t ˜ ( X ) 1 t ˜ ( X ) | Y .
As a consequence, this multi-layer neural network conducts a generalized feature projection between features extracted from different layers. Note that the projected feature E P t ˜ | Y Λ t ˜ 1 t ˜ | Y depends only on the distribution P t ˜ | Y and does not depend on the distribution P X | Y . Therefore, the above computations can be accomplished without knowing the hidden random variable X and can be applied to general cases.

References

  1. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  2. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 3–5 June 2019; Volume 1 (Long and Short Papers). Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
  3. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
  4. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  5. Arulkumaran, K.; Cully, A.; Togelius, J. Alphastar: An evolutionary computation perspective. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Prague, Czech Republic, 13–17 July 2019; pp. 314–315. [Google Scholar]
  6. MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003; ISBN 9780521642989. [Google Scholar]
  7. Zintgraf, L.M.; Cohen, T.S.; Adel, T.; Welling, M. Visualizing Deep Neural Network Decisions: Prediction Difference Analysis. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  8. Papyan, V.; Han, X.; Donoho, D.L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl. Acad. Sci. USA 2020, 117, 24652–24663. [Google Scholar] [CrossRef]
  9. Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity Checks for Saliency Maps. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
  10. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018, 51, 1–42. [Google Scholar] [CrossRef] [Green Version]
  11. Jacot, A.; Gabriel, F.; Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
  12. Mei, S.; Montanari, A.; Nguyen, P.M. A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. USA 2018, 115, E7665–E7671. [Google Scholar] [CrossRef] [Green Version]
  13. Arora, S.; Du, S.; Hu, W.; Li, Z.; Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 322–332. [Google Scholar]
  14. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  15. Huang, S.L.; Xu, X.; Zheng, L.; Wornell, G.W. An information theoretic interpretation to deep neural networks. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1984–1988. [Google Scholar]
  16. Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
  17. Goldfeld, Z.; Polyanskiy, Y. The information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
  18. Huang, S.L.; Makur, A.; Zheng, L.; Wornell, G.W. An information-theoretic approach to universal feature selection in high-dimensional inference. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1336–1340. [Google Scholar]
  19. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
  20. Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
  21. Goodfellow, I.; Bengio, J.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
  22. Olga, R.; Jia, D.; Hao, S.; Jonathan, K.; Sanjeev, S.; Sean, M.; Zhiheng, H.; Andrej, K.; Aditya, K.; Michael, B.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
  23. Huang, S.L.; Zheng, L. Linear information coupling problems. In Proceedings of the 2012 IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012; pp. 1029–1033. [Google Scholar]
  24. Huang, S.L.; Makur, A.; Wornell, G.W.; Zheng, L. On universal features for high-dimensional learning and inference. arXiv 2019, arXiv:1911.09105. [Google Scholar]
  25. Hirschfeld, H.O. A connection between correlation and contingency. Proc. Camb. Phil. Soc. 1935, 31, 520–524. [Google Scholar] [CrossRef]
  26. Gebelein, H. Das statistische problem der Korrelation als variations-und Eigenwertproblem und sein Zusammenhang mit der Ausgleichungsrechnung. Z. Angew. Math. Mech. 1941, 21, 364–379. [Google Scholar] [CrossRef]
  27. Rényi, A. On Measures of Dependence. Acta Math. Acad. Sci. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
  28. du Pin Calmon, F.; Makhdoumi, A.; Médard, M.; Varia, M.; Christiansen, M.; Duffy, K.R. Principal inertia components and applications. IEEE Trans. Inf. Theory 2017, 63, 5011–5038. [Google Scholar] [CrossRef] [Green Version]
  29. Hsu, H.; Asoodeh, S.; Salamatian, S.; Calmon, F.P. Generalizing bottleneck problems. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 531–535. [Google Scholar]
  30. Hsu, H.; Salamatian, S.; Calmon, F.P. Correspondence analysis using neural networks. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, PMLR, Okinawa, Japan,, 16–18 April 2019; pp. 2671–2680. [Google Scholar]
  31. Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On hypercontractivity and a data processing inequality. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA,, 29 June–4 July 2014; pp. 3022–3026. [Google Scholar]
  32. Raginsky, M. Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels. IEEE Trans. Inf. Theory 2016, 62, 3355–3389. [Google Scholar] [CrossRef] [Green Version]
  33. Polyanskiy, Y.; Wu, Y. Strong data-processing inequalities for channels and Bayesian networks. In Convexity and Concentration; Springer: Berlin/Heidelberg, Germany, 2017; pp. 211–249. [Google Scholar]
  34. Greenacre, M.J. Theory and Applications Of Correspondence Analysis; Academic Press: London, UK, 1984. [Google Scholar]
  35. Wang, H.; Vo, L.; Calmon, F.P.; Médard, M.; Duffy, K.R.; Varia, M. Privacy with estimation guarantees. IEEE Trans. Inf. Theory 2019, 65, 8025–8042. [Google Scholar] [CrossRef]
  36. Breiman, L.; Friedman, J.H. Estimating Optimal Transformations for Multiple Regression and Correlation. J. Am. Stat. Assoc. 1985, 80, 614–619. [Google Scholar]
  37. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
  38. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  39. Hastie, T.; Tibshirani, R.; Friedman, J. Neural Networks. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; pp. 389–416. [Google Scholar] [CrossRef]
  40. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  41. Stoer, J.; Bulirsch, R. Introduction to Numerical Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 12. [Google Scholar]
  42. Alain, G.; Bengio, Y. Understanding intermediate layers using linear classifier probes. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  43. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; pp. 1139–1147. [Google Scholar]
  44. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; Conference Track Proceedings. [Google Scholar]
  45. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  46. Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–29 July 2017; Volume 1, p. 3. [Google Scholar]
  47. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–29 July 2017; pp. 1251–1258. [Google Scholar]
  48. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
  49. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
  50. Xu, X.; Huang, S.L.; Zheng, L.; Zhang, L. The geometric structure of generalized softmax learning. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar]
  51. Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. Adv. Neural Inf. Process. Syst. 2016, 29, 2074–2082. [Google Scholar]
  52. Wang, L.; Wu, J.; Huang, S.L.; Zheng, L.; Xu, X.; Zhang, L.; Huang, J. An efficient approach to informative feature extraction from multimodal data. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5281–5288. [Google Scholar]
  53. Lee, J.; Sattigeri, P.; Wornell, G. Learning new tricks from old dogs: Multi-source transfer learning from pre-trained networks. Adv. Neural Inf. Process. Syst. 2019, 32, 4370–4380. [Google Scholar]
  54. Dembo, A.; Zeitouni, O. Large Deviations Techniques and Applications; Corrected Reprint of the Second (1998) Edition; Stochastic Modelling and Applied Probability; Springer: Berlin/Heidelberg, Germany, 2010; p. 38. [Google Scholar]
  55. Sason, I.; Verdú, S. f-divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  56. Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218. [Google Scholar] [CrossRef]
Figure 1. A deep neural network that uses data X to predict Y. All hidden layers together map the input data X to k-dimensional feature s ( x ) = ( s 1 , , s k ) T . Then, the probabilistic prediction P ˜ Y | X of Y is computed from s ( x ) , v ( y ) , and b ( y ) , where v and bias b are the weights and bias in the last layer.
Figure 1. A deep neural network that uses data X to predict Y. All hidden layers together map the input data X to k-dimensional feature s ( x ) = ( s 1 , , s k ) T . Then, the probabilistic prediction P ˜ Y | X of Y is computed from s ( x ) , v ( y ) , and b ( y ) , where v and bias b are the weights and bias in the last layer.
Entropy 24 00135 g001
Figure 2. A multi-layer neural network, where the expressive power of the feature mapping s ( · ) is restricted by the hidden representation t. All hidden layers previous to t are fixed, represented by the “pre-processing” module.
Figure 2. A multi-layer neural network, where the expressive power of the feature mapping s ( · ) is restricted by the hidden representation t. All hidden layers previous to t are fixed, represented by the “pre-processing” module.
Entropy 24 00135 g002
Figure 3. A simple neural network with ideal expressive power, which can generate any k = 1 dimensional feature s of X by tuning the weights in the first layer.
Figure 3. A simple neural network with ideal expressive power, which can generate any k = 1 dimensional feature s of X by tuning the weights in the first layer.
Entropy 24 00135 g003
Figure 4. The trained feature s, weights v, and bias b of the network in Figure 3, which are compared with the corresponding theoretical results to show their coincidences.
Figure 4. The trained feature s, weights v, and bias b of the network in Figure 3, which are compared with the corresponding theoretical results to show their coincidences.
Entropy 24 00135 g004
Figure 5. The designed network for validating the impact of network structure on feature extraction, with m = 4 and k = 3 neurons in two hidden layers. Our goal is to compare the learned weights w ( 1 ) , w ( 2 ) , w ( 3 ) and bias c in the hidden layer with our theoretic characterizations in Section 3.2.2.
Figure 5. The designed network for validating the impact of network structure on feature extraction, with m = 4 and k = 3 neurons in two hidden layers. Our goal is to compare the learned weights w ( 1 ) , w ( 2 ) , w ( 3 ) and bias c in the hidden layer with our theoretic characterizations in Section 3.2.2.
Entropy 24 00135 g005
Figure 6. The trained weights w and bias c of the network in Figure 5, which are compared with the corresponding theoretical results to show their coincidences.
Figure 6. The trained weights w and bias c of the network in Figure 5, which are compared with the corresponding theoretical results to show their coincidences.
Entropy 24 00135 g006
Table 1. Classification accuracy and H-score for different DNN models on ImageNet dataset, where “Paras” indicates the number of parameters (in millions) in the model and H AIC represents the AIC-corrected H-score.
Table 1. Classification accuracy and H-score for different DNN models on ImageNet dataset, where “Paras” indicates the number of parameters (in millions) in the model and H AIC represents the AIC-corrected H-score.
DNN ModelParas [ × 10 6 ] H ( s ) H AIC ( s ) Accuracy [%]
VGG16 [44]138.4148.341.964.2
VGG19 [44]143.7152.742.264.7
MobileNet [45]4.345.942.668.4
DenseNet121 [46]8.159.553.371.4
DenseNet169 [46]14.381.270.273.6
DenseNet201 [46]20.289.173.574.4
Xception [47]22.9179.8162.277.5
InceptionV3 [48]23.9181.2162.976.3
InceptionResNetV2 [49]55.9241.1198.179.1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xu, X.; Huang, S.-L.; Zheng, L.; Wornell, G.W. An Information Theoretic Interpretation to Deep Neural Networks. Entropy 2022, 24, 135. https://0-doi-org.brum.beds.ac.uk/10.3390/e24010135

AMA Style

Xu X, Huang S-L, Zheng L, Wornell GW. An Information Theoretic Interpretation to Deep Neural Networks. Entropy. 2022; 24(1):135. https://0-doi-org.brum.beds.ac.uk/10.3390/e24010135

Chicago/Turabian Style

Xu, Xiangxiang, Shao-Lun Huang, Lizhong Zheng, and Gregory W. Wornell. 2022. "An Information Theoretic Interpretation to Deep Neural Networks" Entropy 24, no. 1: 135. https://0-doi-org.brum.beds.ac.uk/10.3390/e24010135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop