Next Article in Journal
Meta-Strategy for Learning Tuning Parameters with Guarantees
Next Article in Special Issue
Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method
Previous Article in Journal
Chaotic Entanglement: Entropy and Geometry
Previous Article in Special Issue
On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Population Risk Improvement with Model Compression: An Information-Theoretic Approach †

1
Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA
2
Bytedance Inc., Bellevue, WA 98004, USA
3
Department of Electrical Engineering, University at Buffalo, The State University of New York, Buffalo, NY 14221, USA
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020.
Current address: Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
Submission received: 13 August 2021 / Revised: 23 September 2021 / Accepted: 23 September 2021 / Published: 27 September 2021
(This article belongs to the Special Issue Information Theory and Machine Learning)

Abstract

:
It has been reported in many recent works on deep model compression that the population risk of a compressed model can be even better than that of the original model. In this paper, an information-theoretic explanation for this population risk improvement phenomenon is provided by jointly studying the decrease in the generalization error and the increase in the empirical risk that results from model compression. It is first shown that model compression reduces an information-theoretic bound on the generalization error, which suggests that model compression can be interpreted as a regularization technique to avoid overfitting. The increase in empirical risk caused by model compression is then characterized using rate distortion theory. These results imply that the overall population risk could be improved by model compression if the decrease in generalization error exceeds the increase in empirical risk. A linear regression example is presented to demonstrate that such a decrease in population risk due to model compression is indeed possible. Our theoretical results further suggest a way to improve a widely used model compression algorithm, i.e., Hessian-weighted K-means clustering, by regularizing the distance between the clustering centers. Experiments with neural networks are provided to validate our theoretical assertions.

1. Introduction

Although deep neural networks have achieved remarkable success in various domains [1], e.g., computer vision [2], playing games like Go [3], and autonomous driving [4], the improvement of the performance of deep models often comes with deeper layers and more complex network structures, which usually have a large number of parameters. For example, in the application of image classification, it takes over 200 MB to save the parameters of AlexNet [2] and more than 500 MB for VGG-16 net [5]. Hence, it is difficult to port such large models to resource-limited devices such as mobile devices and embedded systems, due to their limited storage, bandwidth, energy, and computational resources.
Due to this reason there has been a flurry of work on compressing deep neural networks (see [6,7,8] for recent surveys). Existing studies mainly focus on designing compression algorithms to reduce the memory and computational cost, while keeping the same level of population risk. In some recent papers [9,10,11,12], aggressive model compression algorithms have been proposed, which require 10% or fewer bits to store the compressed model compared to the storage required by the original model. Surprisingly, it has been observed empirically in these works that the population risk of the compressed model can often be even better than that of the original model. This phenomenon is counter-intuitive at first glance, since more compression generally leads to more information loss.
Indeed, a compressed model would usually have a larger empirical risk than the original one, since machine learning methods are usually trained by minimizing the empirical risk. On the other hand, model compression could possibly decrease the generalization error, since it can be interpreted as a regularization technique to avoid overfitting. As the population risk is the sum of the empirical risk and the generalization error, it is possible for the population risk to be reduced by model compression.

1.1. Contributions

In this paper, we provide an information-theoretic explanation for the population risk improvement with model compression by jointly characterizing the decrease in generalization error and the increase in empirical risk. Specifically, we focus on the case where the model is compressed based on a pre-trained model.
We first prove that model compression leads to a tightening of the information-theoretic generalization error bound in [13], and it can therefore be interpreted as a regularization method to reduce overfitting. Furthermore, by defining a distortion metric based on the difference in the empirical risk between the original model obtained by empirical risk minimization (ERM) and compressed models, we use rate distortion theory to characterize the increase in empirical risk as a function of the number of bits R used to describe the model. If the decrease in generalization error exceeds the increase in empirical risk, the population risk can be improved. An empirical illustration of this result for the MNIST dataset is provided in Figure 1, where model compression can lead to population risk improvement (details are given in Section 7). To better demonstrate our theoretical results, we investigate the example of linear regression comprehensively, where we develop explicit bounds on the generalization error and the increase in empirical risk.
Our results also suggest a way to improve a method for compression based on Hessian-weighted K-means clustering [11] in both scalar and vector case, by regularizing the distance between the clustering centers. Our experiments with neural networks validate our theoretical assertions and demonstrate the effectiveness of the proposed regularizer.

1.2. Related Works

There have been many studies on model compression for deep neural networks. The compression could be achieved by varying the training process, e.g., network structure optimization [14], low precision neural networks [15], and  neural networks with binary weights [16,17]. Here we mainly discuss compression approaches that are applied on a pre-trained model.
Pruning, quantization, and matrix factorization are the most popular approaches to compressing pre-trained deep neural networks. The study of pruning algorithms for model compression which remove redundant parameters from neural networks dates back to the 1980s and 1990s [18,19,20]. More recently, an iterative pruning and retraining algorithm to further reduce the size of deep models was proposed in [9,21]. The method of network quantization or weight sharing, i.e., employing a clustering algorithm to group the weights in a neural network, and its variants, including vector quantization [22], soft quantization [23,24], fixed point quantization [25], transform quantization [26], and Hessian weighted quantization [11], have been extensively investigated. Matrix factorization, where low-rank approximation of the weights in neural networks is used instead of the original weight matrix, has also been widely studied in [27,28,29].
All of the aforementioned works demonstrate the effectiveness of their compression methods via comprehensive numerical experiments. Little research has been done to develop a theoretical understanding of how model compression affects performance. In work [30], an information-theoretic view of model compression via rate-distortion theory is provided, with the focus on characterizing the tradeoff between model compression and only the empirical risk of the compressed model. In [31,32,33], using a PAC-Bayesian framework, a non-vacuous generalization error bound for compressed model is derived based on its smaller model complexity.
In contrast to these works, instead of focusing on minimizing only the empirical risk as in [30], or minimizing only the generalization error as in [33], we use the mutual information based generalization error bound developed in [13,34] jointly with rate distortion theory to connect analyses of generalization error and empirical risk. This way, we are able to characterize the tradeoff between decrease in generalization error and the increase in empirical risk that results from model compression, and thus provide an understanding as to why model compression can improve the population risk. More importantly, our theoretical studies offer insights on designing practical model compression algorithms.
The rest of the paper is organized as follows. In Section 2, we provide relevant definitions and review relevant results from rate distortion theory. In Section 3, we prove that model compression results in the tightening of an information-theoretic generalization error upper bound. In Section 4, we use rate distortion theory to characterize the tradeoff between the increase in empirical risk and the decrease in generalization error that results from model compression. In Section 5, we quantify this tradeoff for a linear regression model. In Section 6, we discuss how the Hessian-weighted K-means clustering compression approach can be improved by using a regularizer motivated by our theoretical results. In Section 7, we provide some experiments with neural network models to validate our theoretical results and demonstrate the effectiveness of the proposed regularizer.
Notation 1.
For a random variable X generated from a distribution μ, we use E X μ to denote the expectation taken over X with distribution μ. We use I d to denote the d-dimensional identity matrix and A to denote the spectral norm of a matrix A. The cumulant generating function (CGF) of a random variable X is defined as Λ X ( λ ) ln E [ e λ ( X E X ) ] . All logarithms are the natural ones.

2. Preliminaries

2.1. Review of Rate Distortion Theory

Rate distortion theory, introduced by Shannon [35], is a major branch of information theory that studies the fundamental limits of lossy data compression. It addresses the minimal number of bits per symbol, as measured by the rate R, to transmit a random variable W such that the receiver can reconstruct W without exceeding distortion D.
Specifically, let W m = { W 1 , W 2 , , W m } denote a sequence of m i.i.d. random variables W i W generated from a source distribution P W . An encoder f m : W m { 1 , 2 , , M } maps the message W m into a codeword, and a decoder g m : { 1 , 2 , , M } W ^ m reconstructs the message by an estimate W ^ m from the codeword, where W ^ W denotes the range of W ^ . A distortion metric d : W × W R + quantifies the difference between the original and reconstructed messages. The distortion between sequences w m and w ^ m is defined to be
d ( w m , w ^ m ) 1 m i = 1 m d ( w i , w ^ i ) .
A commonly used distortion metric is the square distortion: d ( w , w ^ ) = ( w w ^ ) 2 .
Definition 1.
An ( m , M , D ) -triple is achievable, if there exists a (probabilistic) encoder-decoder pair ( f m , g m ) such that the alphabet of codeword has size M and the expected distortion E [ d ( W m ; g m ( f m ( W m ) ) ) ] D .
Now we define the following rate-distortion and distortion-rate function for lossy data compression.
Definition 2.
The rate-distortion function and the distortion-rate function are defined as
R ( D ) lim m 1 m log 2 M * ( m , D ) ,
D ( R ) lim m D * ( m , R ) ,
where M * ( m , D ) min { M : ( m , M , D ) i s a c h i e v a b l e } and D * ( m , R ) min { D : ( m , 2 m R , D ) i s a c h i e v a b l e } .
The main theorem of rate distortion theory is as follows.
Lemma 1
([36]). For an i.i.d. source W with distribution P W and distortion function d ( w , w ^ ) :
R ( D ) = min P W ^ | W : E [ d ( W , W ^ ) ] D I ( W ; W ^ ) ,
D ( R ) = min P W ^ | W : I ( W ; W ^ ) R E [ d ( W , W ^ ) ] ,
where I ( W ; W ^ ) E W , W ^ [ ln P W , W ^ P W P W ^ ] denotes the mutual information between W and W ^ .
The rate-distortion function quantifies the smallest number of bits required to compress the data given the distortion, and the distortion-rate function quantifies the minimal distortion that can be achieved under the rate constraint.

2.2. Generalization Error

Consider an instance space Z , a hypothesis space W , and a non-negative loss function : W × Z R + . A training dataset S = { Z 1 , , Z n } consists of n i.i.d samples Z i Z drawn from an unknown distribution μ . The goal of a supervised learning algorithm is to find an output hypothesis w W that minimizes the population risk:
L μ ( w ) E Z μ [ ( w , Z ) ] .
In practice, μ is unknown, and therefore L μ ( w ) cannot be computed directly. Instead, the empirical risk of w on the training dataset S is studied, which is defined as
L S ( w ) 1 n i = 1 n ( w , Z i ) .
A learning algorithm can be characterized by a randomized mapping from the training dataset S to a hypothesis W according to a conditional distribution P W | S . The (expected) generalization error of a supervised learning algorithm is the expected difference between the population risk of the output hypothesis and its empirical risk on the training dataset:
gen ( μ , P W | S ) E W , S [ L μ ( W ) L S ( W ) ] ,
where the expectation is taken over the joint distribution P S , W = P S P W | S . The generalization error is used to measure the extent to which the learning algorithm overfits the training data.

3. Compression Can Improve Generalization

In this section, we show that lossy compression can lead to a tighter mutual information based generalization error upper bound, which potentially reduces the generalization error of a supervised learning algorithm.
We start from the following lemma which provides an upper bound on the generalization error using the mutual information I ( S ; W ) between training dataset S and the output of the learning algorithm W.
Lemma 2
([13]). Suppose ( w , Z ) is σ-sub-Gaussian (A random variable X is σ-sub-Gaussian if Λ X ( λ ) σ 2 λ 2 2 , λ R .) under Z μ for all w W , then
| gen ( μ , P W | S ) | 2 σ 2 n I ( S ; W ) .
Compression can be viewed as a post-processing of the output of a learning algorithm. The output model W generated by a learning algorithm can be quantized, pruned, factorized, or even perturbed by noise, which results in a compressed model W ^ . Assume that the compression algorithm is only based on W and can be described by a conditional distribution P W ^ | W . Then the following Markov chain holds: S W W ^ . By the data processing inequality,
I ( S ; W ^ ) min { I ( W ; W ^ ) , I ( S , W ) } .
Thus, we have the following theorem characterizing the generalization error of the compressed model.
Theorem 1.
Consider a learning algorithm P W | S , a compression algorithm P W ^ | W , and suppose ( w ^ , Z ) is σ-sub-Gaussian under Z μ for all w ^ W ^ . Then
| gen ( μ , P W ^ | S ) | 2 σ 2 n min { I ( W ; W ^ ) , I ( S , W ) } .
Note that the generalization error upper bound in Theorem 1 for the compressed model is always no greater than the one in Lemma 2. This allows for the interpretation of compression as a regularization technique to reduce the generalization error.

4. Generalization Error and Model Distortion

In this section, we define a distortion metric in model compression that allows us to relate the distortion (the increase in empirical risk) due to compression with the reduction in the generalization error bound discussed in Section 3.

4.1. Distortion Metric in Model Compression

The expected population risk of a model W can be written as
E W [ L μ ( W ) ] = E [ L S ( W ) ] + gen ( μ , P W | S ) ,
where the first term, which is the expected empirical risk, reflects how well the model W fits the training data, while the second term demonstrates how well the model generalizes. In the empirical risk minimization framework, we control both terms by (1) minimizing the empirical risk of W directly or using other stochastic optimization algorithms, and (2) using regularization methods to control the generalization error, e.g., early stopping and dropout [1].
Now, consider the expected population risk of the compressed model W ^ :
E W ^ [ L μ ( W ^ ) ] = E [ L μ ( W ^ ) L S ( W ^ ) + L S ( W ^ ) L S ( W ) + L S ( W ) ] = E [ L S ( W ) ] + gen ( μ , P W ^ | S ) + E [ L S ( W ^ ) L S ( W ) ] .
Compared with (11), we note that the first empirical risk term is independent of the compression algorithm, the second generalization error term can be upper bounded by Theorem 1, and the third term E [ L S ( W ^ ) L S ( W ) ] quantifies the increase in the empirical risk if we use the compressed model W ^ instead of the original model W. We then define the following distortion metric for model compression:
d S ( w , w ^ ) L S ( w ^ ) L S ( w ) ,
which is the difference in the empirical risk between the compressed model W ^ and the original model W. In general, function d S ( w , w ^ ) is not always non-negative. However, for ERM solution W, which is obtained by minimizing the empirical risk L S ( W ) , d S ( w , w ^ ) 0 , which ensures that d S ( w , w ^ ) is a valid distortion metric. By Theorem 1, it follows that
E S , W , W ^ [ L μ ( W ^ ) L S ( W ) ] 2 σ 2 n I ( W ; W ^ ) + E S , W , W ^ [ d S ( W ^ , W ) ] L S , W ( P W ^ | W ) ,
where L S , W ( P W ^ | W ) is an upper bound on the expected difference between the population risk of W ^ and the empirical risk of the original model W on training dataset S. Note that L S ( W ) is independent of the compression algorithm. Therefore, the bound in (14) can be viewed as an upper bound of the population risk of the compressed model W ^ .

4.2. Population Risk Improvement

By Lemma 1, the smallest distortion that can be achieved at rate R is D ( R ) = min I ( W ; W ^ ) R E S , W , W ^ [ d S ( W ^ , W ) ] . Thus, the tightest bound in (14) that can be achieved at rate R is given in the following theorem.
Theorem 2.
Suppose the assumptions in Theorem 1 hold, P W | S minimizes the empirical risk L S ( W ) , and  I ( W ; W ^ ) = R , then
min P W ^ | W : I ( W ; W ^ ) = R E S , W , W ^ [ L μ ( W ^ ) L S ( W ) ] 2 σ 2 n R + D ( R ) .
From the properties of the distortion-rate function [36], we know that D ( R ) is a decreasing function of R. Thus, we see that as R decreases the first term in (15), which corresponds to the generalization error, decreases, while the second term, which corresponds to the empirical risk, increases. Due to this tradeoff, it may be possible for the bound in (15) to be smaller due to compression, i.e., using a smaller rate R. This indicates that the population risk could improve with compression algorithm, which minimizes the upper bound L S , W ( P W ^ | W ) .
Remark 1.
In order to conclude definitively that the population risk can be improved with compression, we need to find a lower bound (as a function of R) to match (at least in the order sense) the upper bound in Theorem 2. This appears to be difficult to construct in general. One approach might be to use the same decomposition as in (12) and develop lower bounds for min I ( W ; W ^ ) = R gen ( μ , P W ^ | S ) and min I ( W ; W ^ ) = R E S , W , W ^ [ d S ( W ^ , W ) ] independently. However, such an approach runs into the following issues: (1) such a lower bound would be loose since the compression algorithm P W ^ | W that minimizes generalization error, the one that minimizes the distortion, and the one that minimizes the sum of the two can be quite different; and (2) a lower bound for generalization error needs to be developed, which appears to be difficult, with existing literature mainly focusing on lower bounding the excess risk, e.g., [37].
As will be shown in Section 7, we can actually improve the population risk with a well designed compression algorithm in practical applications.

5. Example: Linear Regression

In this section, we comprehensively explore the example of linear regression to get a better understanding of the results in Section 4. To this end, we develop explicit upper bounds for generalization error and distortion-rate function D ( R ) . All the proofs of the lemmas and theorems are provided in the Appendix A, Appendix B, Appendix C and Appendix D.
Suppose that the dataset S = { Z 1 , , Z n } = { ( X 1 , Y 1 ) , , ( X n , Y n ) } is generated from the following linear model with weight vector w * = ( w * ( 1 ) , , w * ( d ) ) R d ,
Y i = X i w * + ε i , i = 1 , , n ,
where X i ’s are i.i.d. d-dimensional random vectors with distribution N ( 0 , Σ X ) , and  ε i N ( 0 , σ 2 ) denotes i.i.d. Gaussian noise. We adopt the mean squared error as the loss function, and the corresponding empirical risk on S is
L S ( w ) = 1 n i = 1 n ( Y i X i w ) 2 = 1 n Y X w 2 2 ,
for w W = R d , where X R d × n denotes all the input samples, and  Y R n denotes the responses. If  n > d , the ERM solution is
W = ( X X ) 1 X Y ,
which is deterministic given S. Its generalization error can be computed exactly as in the following lemma (see Appendix A for detailed proof).
Lemma 3.
If n > d + 1 , then
gen ( μ , P W | S ) = σ 2 d n ( 2 + d + 1 n d 1 ) .

5.1. Information-Theoretic Generalization Bounds for Compressed Linear Model

We note that the mutual information based bound in Lemma 2 is not applicable for this linear regression model, since W is a deterministic function of S, and  I ( S ; W ) = . However, this issue can be resolved if we post-process the ERM solution W by a compression algorithm and upper bound the generalization error by I ( W ^ ; W ) as shown in Theorem 1.
Consider a compression algorithm, which maps the original weights W R d to the compressed model W ^ W ^ R d . For a fixed and compact W ^ , we define
C ( w * ) sup w ^ W ^ w ^ w * 2 2 ,
which measures the largest distance between the reconstruction w ^ and the optimal weights w * . The following proposition provides an upper bound on the generalization error of the compressed model W ^ , and the detailed proof is provided in Appendix B.
Proposition 1.
Consider the ERM solution W = ( X X ) 1 X Y , and suppose W ^ is compact, then
gen ( μ , P W ^ | S ) 2 σ * 2 I ( W ; W ^ ) n ,
where σ * 2 C ( w * ) Σ X + σ 2 .

5.2. Distortion-Rate Function for Linear Model

We now provide an upper bound on the distortion-rate function D ( R ) for the linear regression model. Note that L S ( W ) = 0 , since W minimizes the empirical risk. The Hessian matrix of the loss function is
H S ( W ) = 1 n X X ,
which is not a function of W. Then, the distortion function can be written as:
E S , W , W ^ [ d S ( W ^ , W ) ] = E S , W , W ^ [ L S ( W ^ ) L S ( W ) ] = E S , W , W ^ [ ( W ^ W ) 1 n X X ( W ^ W ) ] .
The following theorem characterizes upper bounds for R ( D ) and D ( R ) for linear regression.
Proposition 2.
For the ERM solution W = ( X X ) 1 X Y , we have
R ( D ) d 2 ln d σ 2 ( n d 1 ) D + , D 0 ,
D ( R ) d σ 2 n d 1 e 2 R d , R 0 ,
where ( x ) + = max { 0 , x } .
Proof sketch.
The proof of the upper bound for R ( D ) is based on considering a Gaussian random vector which has the same mean and covariance matrix as W. In addition, the upper bound is achieved when W W ^ is independent of the dataset S with the following conditional distribution,
P W ^ | W = N ( 1 α ) W + α w * , ( 1 α ) D d Σ X 1 ,
where α n D d σ 2 1 . Note that this “compression algorithm” requires the knowledge of optimal weights w * , which is unknown in practice.
The details can be found in Appendix C.    □
Remark 2.
As shown in [38], if  n > d / ϵ 2 , 1 n X X Σ X ϵ holds with high probability. Then, the following lower bound on R ( D ) holds if we can approximate 1 n X X in (23) using Σ X ,
R ( D ) d 2 ln d σ 2 ( n d 1 ) D + D ( P W P W G ) ,
where W G denotes a Gaussian random vector with the same mean and variance as W. The details can be found in Appendix D.
Combing Propositions 1 and 2, we have the following result.
Corollary 1.
Under the same assumptions as in Proposition 1, we have
min P W ^ | W : I ( W ; W ^ ) = R E S , W , W ^ [ L μ ( W ^ ) L S ( W ) ] 2 σ * 2 R n + d σ 2 n d 1 e 2 R d , R 0 .
In (28) the first term corresponds to the generalization error, which decreases with compression, and the second term corresponds to the empirical risk, which increases with compression.

5.3. Evaluation and Visualization

In the following plots, we generate the training dataset S using the linear model in (16) by letting d = 50 , n = 80 , Σ X = I d and σ 2 = 1 . We consider the following two compression algorithms. The first one is the conditional distribution P W ^ | W in the proof of achievability (26), which requires the knowledge of w * and is denoted as “Oracle”. The second one is the well-known K-means clustering algorithm, where the weights in W are grouped into K clusters and represented by the cluster centers in the reconstruction W ^ . By changing the number of clusters K, we can control the rate R, i.e.,  I ( W ; W ^ ) . We average the performance and estimate I ( W ; W ^ ) of these algorithms with 10,000 Monte-Carlo trials in the simulation.
We note that I ( W ; W ^ ) is equal to the number of bits used in compression only in the asymptotic regime of large number of samples. In practice, we may have only one sample of the weights W, and therefore I ( W ; W ^ ) simply measures the extent to which compression is performed by the compression algorithm.
In Figure 2a, we plot the generalization error bound in Proposition 1 as a function of the rate R and compare the generalization errors of the Oracle and K-means algorithms. It can be seen that Proposition 1 provides a valid upper bound for the generalization error, but this bound is tight only when R is small. Moreover, both compression algorithms can achieve smaller generalization errors compared to that of the ERM solution W, which validates the result in Theorem 1.
Figure 2b plots the upper bound on the distortion-rate function in Theorem 2 and the distortions achieved by the Oracle and K-means algorithms. The distortion of the Oracle decreases as we increase the rate R and matches the D ( R ) function well. However, there is a large gap between the distortion achieved by K-means algorithms and D ( R ) . One possible explanation is that since w * is unknown, it is impossible for the K-means algorithm to learn the optimal cluster center with only one sample of W. Even if we view W ( j ) , j = 1 , , d as i.i.d. samples from the same distribution, there is still a gap between the distortion achieved by the K-means algorithm and the optimal quantization as studied in [39].
We plot the population risks of the ERM solution W, the Oracle, and K-means algorithms in Figure 2c. It is not surprising that the Oracle algorithm achieves a small population risk, since W ^ is a function of w * and W ^ = w * when R = 0 . However, it can be seen that the K-means algorithm achieves a smaller population risk than the original model W, since the decrease in generalization error exceeds the increase in empirical risk, when we use fewer clusters in the K-means algorithm, i.e., a smaller rate R. We note that the minimal population risk is achieved when K = 2 , since we initialize w * so that w * ( i ) , 1 i d , can be well approximated by two cluster centers.

6. Clustering Algorithm Minimizing L S , W

In this section, we propose an improvement of the Hessian-weighted (HW) K-means clustering algorithm [11] for model compression by regularizing the distance between the cluster centers, which minimizes the upper bound L S , W ( P W ^ | W ) , as suggested by our theoretical results in Section 4.

6.1. Hessian-Weighted K-Means Clustering

The goal of HW K-means is to minimize the distortion on the empirical risk d S ( W ^ , W ) , which has the following Taylor series approximation:
d S ( W ^ , W ) ( W ^ W ) T L S ( W ) + 1 2 ( W ^ W ) T H S ( W ) ( W ^ W ) ,
where H S ( W ) is the Hessian matrix. Assuming that W is a local minimum of L S ( W ) (ERM solution) and L S ( W ) 0 , the first term can be ignored. Furthermore, the Hessian matrix H S ( W ) can be approximated by a diagonal matrix, which further simplifies the objective to d S ( W ^ , W ) j = 1 d h ( j ) ( W ( j ) W ^ ( j ) ) 2 , where h ( j ) is the j-th diagonal element of the Hessian matrix.
Given network parameters w = { w ( 1 ) , , w ( d ) } , the HW K-means clustering algorithm [11] partitions them into K disjoint clusters, using a set of cluster centers c = { c ( 1 ) , , c ( K ) } , and a cluster assignment C = C ( 1 ) , , C ( K ) , while solving the following optimization problem:
min k = 1 K w ( j ) C ( k ) h ( j ) | w ( j ) c ( k ) | 2 .

6.2. Diameter Regularization

In contrast to HW K-means which only cares about empirical risk, our goal is to obtain as small a population risk as possible by minimizing the upper bound
L S , W ( P W ^ | W ) = 2 σ 2 n I ( W ; W ^ ) + E [ d S ( W ^ , W ) ] .
Here, we let the number of clusters K to be an input argument of the algorithm, so that I ( W ; W ^ ) log 2 K , and we want to minimize L S , W ( P W ^ | W ) by carefully designing the reconstructed weights given K, i.e., by choosing cluster centers { c ( 1 ) , , c ( K ) } . Then, minimizing the sub-Gaussian parameter σ is one way to control the generalization error of the compression algorithm. Recall that in Proposition 1, we have
gen ( μ , P W ^ | S ) 2 C ( w * ) Σ X + σ 2 I ( W ; W ^ ) n ,
where the sub-Gaussian parameter is related to C ( w * ) = sup w ^ W ^ w ^ w * 2 2 in linear regression. Note that this quantity can be interpreted as the diameter of the set W . Since the ground truth w * is unknown in practice, we then propose the following diameter regularization by approximating C ( w * ) in (32) by
β max k 1 , k 2 | c ( k 1 ) c ( k 2 ) | 2 , β 0 ,
where β is a parameter controls the penalty term and can be selected by cross validation in practice. Our diameter-regularized Hessian-weighted (DRHW) K-means algorithm solves the following optimization problem:
min k = 1 K w ( j ) C ( k ) h ( j ) | w ( j ) c ( k ) | 2 + β max k 1 , k 2 | c ( k 1 ) c ( k 2 ) | 2 .
Such an optimization problem can be easily extended to the vector case which leads to a vector quantization algorithm. Suppose that we group the d-dimensional weights w = { w ( 1 ) , , w ( d ) } into d = d / m vectors with length m, i.e.,  { w ( 1 ) , , w ( d ) } , w ( j ) R m , then our goal is to find cluster centers c k R m and assignments minimizing the following cost function:
min k = 1 K w ( j ) C ( k ) ( w ( j ) c ( k ) ) H ( j ) ( w ( j ) c ( k ) ) + β max k 1 , k 2 c ( k 1 ) c ( k 2 ) 2 2 ,
where H ( j ) is the diagonal Hessian matrix corresponding to the vector w ( j ) . An iterative algorithm to solve the above optimization problem for vector quantization is provided in Algorithm 1.
The algorithm alternates between minimizing the objective function over the cluster centers and the assignments. In the Assignment step, we first fix centers and assign each w ( j ) to its nearest neighbor. We then fix assignments and update the centers by the weighted mean of each cluster in the Update step. For the farthest pair of centers, the diameter regularizer pushes them toward each other, so that the output centers have potentially smaller diameters than those of regular K-means. We note that the time complexity of the proposed diameter-regularized Hessian weighted K-means algorithm is the same as that of the original K-means algorithm.
Algorithm 1 Diameter-regularized Hessian weighted K-means in vector case
  • Input: Weights vector { w ( 1 ) , , w ( d ) } , Hessian matrices { H ( 1 ) , , H ( d ) } , diameter regularizer β > 0 , number of clusters K, iterations T
  • Initialize the K cluster centers { c 0 ( 1 ) , , c 0 ( K ) } randomly
  • for  t = 1 to T do
  •   Assignment step:
  •   Initialize C t ( k ) = for all k [ K ] .
  •   for  j = 1 to d  do
  •    Assign w ( j ) to the nearest cluster center, i.e., find k t ( j ) = arg min k [ K ] w ( j ) c t 1 ( k ) 2 2 and let
    C t ( k t ( j ) ) C t ( k t ( j ) ) { w ( j ) }
  •   end for
  •   Update step:
  •   Find current farthest pair of centers ( k 1 , k 2 ) = arg max k 1 , k 2 c t 1 ( k 1 ) c t 1 ( k 2 ) 2 2 .
  •   Update c t ( k 1 ) and c t ( k 2 ) by
    c t ( k 1 ) = w ( j ) C t ( k 1 ) H ( j ) + β I m 1 w ( j ) C t ( k 1 ) H ( j ) w ( j ) + β c t ( k 2 ) c t ( k 2 ) = w ( j ) C t ( k 2 ) H ( j ) + β I m 1 w ( j ) C t ( k 2 ) H ( j ) w ( j ) + β c t ( k 1 )
  •   for  k = 1 to K, k { k 1 , k 2 }  do
  •   Update the cluster centers by
    c t ( k ) = w ( j ) C t ( k ) H ( j ) 1 w ( j ) C t ( k ) H ( j ) w ( j )
  •   end for
  • end for
  • Output: centers { c T ( 1 ) , , c T ( K ) } and assignments { C T ( 1 ) , , C T ( K ) } .

7. Experiments

In this section, we provide some real-world experiments to validate our theoretical assertions and the DRHW K-means algorithm. (The code for our experiments is available at the following link https://github.com/wgao9/weight-quant (accessed on 13 August 2021)) Our experiments include compression of: (i) a three-layer fully connected network on the MNIST dataset [40]; and (ii) a convolutional neural network with five convolutional layers and three linear layers on the CIFAR10 dataset [41] (We downloaded the pre-trained model in PyTorch from https://github.com/aaron-xichen/pytorch-playground (accessed on 13 August 2021)).
In Theorem 1, an upper bound on the expected generalization error is provided, and therefore we independently train 50 different models (with the same structure but different parameter initializations) using different subset of training samples, and average the results. We use 10% of the training data to train the model for MNIST and use 20% of the training data to train the model for CIFAR10. For each experiment, we use the same number of clusters for each convolutional layer and fully connected layer.
In the following experiments, we plot the cross entropy loss as a function of compression ratio. Note that compression ratio can be controlled by changing the number of clusters K in the quantization algorithm. To see this, suppose that the neural networks have total of d parameters that need to be compressed, and each parameter is of b bits. Let C ( k ) be the set of weights in cluster k and let b k be the number of bits of the codeword assigned to the network parameters in cluster k for 1 k K . For a lookup table to decode quantized values, we need K b bits to store all the reconstructed weights, i.e., cluster centers c = { c ( 1 ) , , c ( K ) } . Then, the compression ratio is given by
Compression Ratio = k = 1 K | C ( k ) | b k + K b d b ,
where | · | denotes the number of elements in the set. In our experiments, we use a variable-length code such as the Huffman code to compute the compression ratio under different numbers of clusters K.
In Figure 3 and Figure 4, we compare the scalar DRHW K-means algorithm with the scalar HW K-means algorithm for different compression ratios on the MNIST and CIFAR10 datasets. Both figures demonstrate that the compression algorithm increases the empirical risk but decreases the generalization error, and the net effect is that the both compressed models have smaller population risks than those of the original models. More importantly, the DRHW K-means algorithm produces a compressed model that has a better population risk than that of the HW K-means algorithm.
In Figure 5, we compare the population risk of scalar DRHW K-means algorithm and that of the vector DRHW K-means algorithm with block length m = 2 for different compression ratios on the MNIST dataset. It can be seen from the figure that the improvement by using vector quantization ( m = 2 ) is quite modest, which implies that the dependence between the weights W ( j ) is weak. However, we can still observe the improvement of adding the diameter regularizer in vector DRHW K-means algorithm by comparing the curves with β = 50 and β = 0 .
In Figure 6, we demonstrate how β affects the performance of our diameter-regularized Hessian-weighted K-means algorithm in scalar case. It can be seen that as β increases, the generalization error decreases and the distortion in empirical risk increases, which validates the idea that this proposed diameter regularizer can be used to reduce the generalization error. The value of β that results in the best population risk therefore can be chosen via cross-validation in practice.

8. Conclusions

In this paper, we have provided an information-theoretical understanding of how model compression affects the population risk of a compressed model. In particular, our results indicate that model compression may increase the empirical risk but decrease the generalization error. Therefore, it might be possible to achieve a smaller population risk via model compression. Our experiments validate these theoretical findings. Furthermore, we showed how our information-theoretic bound on the population risk can be used to optimize practical compression algorithms.
We note that our results could be applied to improve other compression algorithms, such as pruning and matrix factorization. Moreover, we believe that the information-theoretic analysis adopted here could be generalized to characterize a similar tradeoff between the generalization error and empirical risk in other applications beyond compressing pre-trained models, e.g., distributed optimization [42] and low precision training [15].

Author Contributions

Y.B.: theoretical analysis, methodology, conceptualization, writing—original draft. W.G.: software, methodology and visualization. S.Z.: writing—review and editing. V.V.V.: supervision, funding acquisition, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196, through the University of Illinois at Urbana-Champaign.

Data Availability Statement

Data and code can be found in Section 7.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 3

Let Z ˜ = ( X ˜ , Y ˜ ) , X ˜ R d and Y ˜ R denote an independent copy of the training sample Z i . Then, it can be shown that
gen ( μ , P W | S ) = E W , S [ L μ ( W ) L S ( W ) ] = E W , S E Z ˜ [ ( Y ˜ X ˜ W ) 2 ] 1 n Y X W 2 2 = E S E Z ˜ [ ( Y ˜ X ˜ ( X X ) 1 X Y ) 2 ] 1 n Y X ( X X ) 1 X Y 2 2 ,
where Y ˜ = X ˜ w * + ε ˜ and Y = X w * + ε . Then, we have
gen ( μ , P W | S ) = E ε , ε ˜ , X , X ˜ ( ε ˜ X ˜ ( X X ) 1 X ε ) 2 1 n E ε , X ε X ( X X ) 1 X ε 2 2 = E ε , X , X ˜ ε X ( X X ) 1 X ˜ X ˜ ( X X ) 1 X ε + 1 n ε X ( X X ) 1 X ε = E ε , X Tr ( X ( X X ) 1 Σ X ( X X ) 1 X ε ε ) + σ 2 d n = σ 2 E X Tr ( ( X X ) 1 Σ X ) + σ 2 d n .
Note that X i ’s are i.i.d. samples from N ( 0 , Σ X ) , then we have ( X X ) 1 distributed according to Wishart 1 ( Σ X 1 , n ) , where Wishart 1 denotes the inverse Wishart distribution with n degrees of freedom, and E [ ( X X ) 1 ] = Σ X 1 n d 1 . It then follows that
gen ( μ , P W | S ) = σ 2 n d 1 Tr ( Σ X 1 Σ X ) + σ 2 d n = σ 2 d n ( 2 + d + 1 n d 1 ) .

Appendix B. Proof of Proposition 1

For all w ^ W ^ , it can be shown that
( w ^ , Z ˜ ) = ( Y ˜ X ˜ w ^ ) 2 = ( X ˜ ( w * w ^ ) + ε ˜ ) 2 .
Since X ˜ N ( 0 , Σ X ) and ε ˜ N ( 0 , σ 2 ) , then ( w ^ , Z ˜ ) σ 2 χ 1 2 , where
σ 2 ( w ^ w * ) Σ X ( w ^ w * ) + σ 2 ,
and χ 1 2 denotes the chi-squared distribution with one degree of freedom. Then, the CGF of ( w ^ , Z ˜ ) is
Λ ( w ^ , Z ˜ ) ( λ ) = σ 2 λ 1 2 ln ( 1 2 σ 2 λ ) , λ ( , 1 2 σ 2 ) .
Thus, ( w ^ , Z ˜ ) is not sub-Gaussian for all λ R . However, it can be shown that
Λ ( w ^ , Z ˜ ) ( λ ) σ 4 λ 2 , λ < 0 .
We need the following lemma from the Theorem 1 of [43] to proceed our analysis.
Lemma A1
([43]). Assume that for all w ^ W ^ , Λ ( w ^ , Z ˜ ) ( λ ) σ 2 λ 2 2 for λ 0 . Then,
gen ( μ , P W ^ | S ) 2 σ 2 n I ( W ^ ; S ) .
Recall that C ( w * ) = sup w ^ W ^ w ^ w * 2 2 . We then have the following bound on the CGF of ( w ^ , Z ˜ ) ,
Λ ( w ^ , Z ˜ ) ( λ ) λ 2 max w ^ W ^ σ 4 λ 2 C ( w * ) Σ X + σ 2 2 , λ < 0 .
Applying Lemma A1 and data processing inequality, we have
gen ( μ , P W ^ | S ) 2 C ( w * ) Σ X + σ 2 I ( W ^ ; W ) n .

Appendix C. Proof of Proposition 2

The constraint on the distortion function can be written as follows:
D E S , W , W ^ [ d S ( W ^ , W ) ] = 1 n E S , W , W ^ [ ( W ^ W ) X X ( W ^ W ) ] .
It follows from Lemma 1 that
R ( D ) = min P W ^ | W I ( W ^ ; W ) , s . t . E S , W , W ^ [ ( W ^ W ) 1 n X X ( W ^ W ) ] D .
Note that E [ W ] = w * and Cov [ W ] = σ 2 n d 1 Σ X 1 since W is the ERM solution. In the following proof, we consider a Gaussian random vector with the same mean and covariance matrix W G N ( w * , σ 2 n d 1 Σ X 1 ) as W.
For the upper bound of R ( D ) , consider the channel P W ^ | W * = N ( 1 α ) W + α w * , ( 1 α ) D d Σ X 1 , where α = n D d σ 2 1 . It can be verified that this channel satisfies the constraint on the distortion:
E S , W , W ^ [ d S ( W ^ , W ) ] = α 2 E [ ( W w * ) 1 n X X ( W w * ) ] + ( 1 α ) D d Tr E [ 1 n X X ] Σ X 1 = α 2 E [ ( X X ) 1 X ε 1 n X X ( X X ) 1 X ε ] + ( 1 α ) D = α 2 1 n E [ ε X ( X X ) 1 X ε ] + ( 1 α ) D = D .
If we let ξ N ( 0 , ( 1 α ) D d Σ X 1 ) , it follows that
R ( D ) I ( W ; ( 1 α ) W + α w * + ξ ) ( a ) I ( W G ; ( 1 α ) W G + ξ ) = d 2 ln d σ 2 ( n d 1 ) D n n d 1 + 1 d 2 ln d σ 2 ( n d 1 ) D + ,
where (a) is due to the fact that Gaussian distribution maximizes the mutual information in an additive white Gaussian noise channels.
The upper bound on D ( R ) follows immediately from the upper bound on R ( D ) .

Appendix D. Discussion of Remark 2

Suppose that 1 n X X can be approximated by Σ X for large n in (A10). It then follows that
R ( D ) = min P W ^ | W I ( W ^ ; W ) , s . t . E S , W , W ^ [ ( W ^ W ) Σ X ( W ^ W ) ] D .
It can be easily verified that the channel P W | W ^ * = N ( W ^ , D d Σ X 1 ) satisfies the distortion constraint. For any P W | W ^ such that E S , W , W ^ [ d S ( W ^ , W ) ] D , it follows that
I ( W ; W ^ ) = E W , W ^ ln P W | W ^ P W = E W , W ^ ln P W | W ^ P W | W ^ * + E W , W ^ ln P W | W ^ * P W G KL ( P W P W G ) E W , W ^ ln P W | W ^ * P W G KL ( P W P W G ) ,
where KL ( P W P W G ) is the Kullback–Leibler divergence between the two distributions, and the last step follows from the fact that KL ( P W , W ^ P W , W ^ * ) 0 . Note that
E W , W ^ ln P W | W ^ * P W G = E W , W ^ ( n d 1 ) ( W w * ) Σ X ( W w * ) 2 σ 2 d ( W ^ W ) Σ X ( W ^ W ) 2 D + d 2 ln d σ 2 ( n d 1 ) D = ( a ) d 2 ln d σ 2 ( n d 1 ) D + E W , W ^ d 2 d ( W ^ W ) Σ X ( W ^ W ) 2 D ( b ) d 2 ln d σ 2 ( n d 1 ) D ,
where (a) follows from the fact that E [ W ] = w * and Cov [ W ] = σ 2 n d 1 Σ X 1 , and (b) is due to the fact that P W ^ | W satisfies the distortion constraint. Thus,
R ( D ) d 2 ln d σ 2 ( n d 1 ) D KL ( P W P W G ) .
References yes

References

  1. Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, UK, 2016; Volume 1. [Google Scholar]
  2. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  3. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354. [Google Scholar] [CrossRef]
  4. Huval, B.; Wang, T.; Tandon, S.; Kiske, J.; Song, W.; Pazhayampallil, J.; Andriluka, M.; Rajpurkar, P.; Migimatsu, T.; Cheng-Yue, R. An empirical evaluation of deep learning on highway driving. arXiv 2015, arXiv:1504.01716. [Google Scholar]
  5. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  6. Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. A survey of model compression and acceleration for deep neural networks. arXiv 2017, arXiv:1710.09282. [Google Scholar]
  7. Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv 2018, arXiv:1806.08342. [Google Scholar]
  8. Guo, Y. A survey on methods and theories of quantized neural networks. arXiv 2018, arXiv:1808.04752. [Google Scholar]
  9. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
  10. Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. arXiv 2016, arXiv:1612.01064. [Google Scholar]
  11. Choi, Y.; El-Khamy, M.; Lee, J. Towards the limit of network quantization. arXiv 2016, arXiv:1612.01543. [Google Scholar]
  12. Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv 2017, arXiv:1712.01887. [Google Scholar]
  13. Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 2524–2533. [Google Scholar]
  14. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  15. Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1737–1746. [Google Scholar]
  16. Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 3123–3131. [Google Scholar]
  17. Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
  18. Mozer, M.C.; Smolensky, P. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 27–30 November 1989; pp. 107–115. [Google Scholar]
  19. LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal brain damage. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 26–29 November 1990; pp. 598–605. [Google Scholar]
  20. Hassibi, B.; Stork, D.G. Second order derivatives for network pruning: Optimal brain surgeon. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), San Francisco, CA, USA, 30 November–3 December 1992; pp. 164–171. [Google Scholar]
  21. Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
  22. Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv 2014, arXiv:1412.6115. [Google Scholar]
  23. Ullrich, K.; Meeds, E.; Welling, M. Soft weight-sharing for neural network compression. arXiv 2017, arXiv:1702.04008. [Google Scholar]
  24. Louizos, C.; Ullrich, K.; Welling, M. Bayesian compression for deep learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 3288–3298. [Google Scholar]
  25. Lin, D.; Talathi, S.; Annapureddy, S. Fixed point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 2849–2858. [Google Scholar]
  26. Young, S.; Wang, Z.; Taubman, D.; Girod, B. Transform Quantization for CNN Compression. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 1. [Google Scholar] [CrossRef] [PubMed]
  27. Denton, E.L.; Zaremba, W.; Bruna, J.; LeCun, Y.; Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 1269–1277. [Google Scholar]
  28. Tai, C.; Xiao, T.; Zhang, Y.; Wang, X. Convolutional neural networks with low-rank regularization. arXiv 2017, arXiv:1511.06067. [Google Scholar]
  29. Novikov, A.; Podoprikhin, D.; Osokin, A.; Vetrov, D.P. Tensorizing neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 442–450. [Google Scholar]
  30. Gao, W.; Liu, Y.H.; Wang, C.; Oh, S. Rate distortion for model compression: From theory to practice. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2102–2111. [Google Scholar]
  31. Dziugaite, G.K.; Roy, D.M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv 2017, arXiv:1703.11008. [Google Scholar]
  32. Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; Srebro, N. Exploring generalization in deep learning. arXiv 2017, arXiv:1706.08947. [Google Scholar]
  33. Zhou, W.; Veitch, V.; Austern, M.; Adams, R.P.; Orbanz, P. Non-vacuous generalization bounds at the imagenet scale: A PAC-bayesian compression approach. arXiv 2018, arXiv:1804.05862. [Google Scholar]
  34. Russo, D.; Zou, J. Controlling bias in adaptive data analysis using information theory. In Proceedings of the International Conference on Artifical Intelligence and Statistics (AISTATS), Cadiz, Spain, 9–11 May 2016; pp. 1232–1240. [Google Scholar]
  35. Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec 1959, 4, 142–163. [Google Scholar]
  36. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  37. Grønlund, A.; Kamma, L.; Larsen, K.G.; Mathiasen, A.; Nelson, J. Margin-Based Generalization Lower Bounds for Boosted Classifiers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 11940–11949. [Google Scholar]
  38. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. arXiv 2010, arXiv:1011.3027. [Google Scholar]
  39. Linder, T.; Lugosi, G.; Zeger, K. Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding. IEEE Trans. Inf. Theory 1994, 40, 1728–1740. [Google Scholar] [CrossRef] [Green Version]
  40. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  41. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  42. Basu, D.; Data, D.; Karakus, C.; Diggavi, S. Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 14668–14679. [Google Scholar]
  43. Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening Mutual Information-Based Bounds on Generalization Error. IEEE J. Sel. Areas Inf. Theory 2020, 1, 121–130. [Google Scholar] [CrossRef]
Figure 1. Population risk of the compressed model W ^ and the original model W vs. compression ratio (ratio of the number of bits used for compressed model to the number of bits used for original model). The generalization error of W ^ decreases and the empirical risk of W ^ increases with more compression (smaller compression ratio). The population risk of W ^ is less than that of W for compression ratios larger than 6% in this figure. As the compression ratio goes to 100% (no compression), the population risk of W ^ will converge to that of the original model W.
Figure 1. Population risk of the compressed model W ^ and the original model W vs. compression ratio (ratio of the number of bits used for compressed model to the number of bits used for original model). The generalization error of W ^ decreases and the empirical risk of W ^ increases with more compression (smaller compression ratio). The population risk of W ^ is less than that of W for compression ratios larger than 6% in this figure. As the compression ratio goes to 100% (no compression), the population risk of W ^ will converge to that of the original model W.
Entropy 23 01255 g001
Figure 2. Comparison of three different quantities for linear regression as a function of rate R in bits. (a) Generalization error. (b) Distortion. (c) Population risk.
Figure 2. Comparison of three different quantities for linear regression as a function of rate R in bits. (a) Generalization error. (b) Distortion. (c) Population risk.
Entropy 23 01255 g002
Figure 3. Comparison between DRHW K-means ( β = 50 ) and HW K-means ( β = 0 ) on MNIST. Top: empirical risks. Bottom: population risks and generalization errors.
Figure 3. Comparison between DRHW K-means ( β = 50 ) and HW K-means ( β = 0 ) on MNIST. Top: empirical risks. Bottom: population risks and generalization errors.
Entropy 23 01255 g003
Figure 4. Comparison between DRHW K-means ( β = 25 ) and HW K-means ( β = 0 ) on CIFAR10. Top: empirical risks. Bottom: population risks and generalization errors.
Figure 4. Comparison between DRHW K-means ( β = 25 ) and HW K-means ( β = 0 ) on CIFAR10. Top: empirical risks. Bottom: population risks and generalization errors.
Entropy 23 01255 g004
Figure 5. Comparison between scalar DRHW K-means ( m = 1 ) and vector DRHW K-means ( m = 2 ) on the MNIST dataset.
Figure 5. Comparison between scalar DRHW K-means ( m = 1 ) and vector DRHW K-means ( m = 2 ) on the MNIST dataset.
Entropy 23 01255 g005
Figure 6. DRHW K-means with different β on the MNIST dataset with K = 7 .
Figure 6. DRHW K-means with different β on the MNIST dataset with K = 7 .
Entropy 23 01255 g006
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bu, Y.; Gao, W.; Zou, S.; Veeravalli, V.V. Population Risk Improvement with Model Compression: An Information-Theoretic Approach. Entropy 2021, 23, 1255. https://0-doi-org.brum.beds.ac.uk/10.3390/e23101255

AMA Style

Bu Y, Gao W, Zou S, Veeravalli VV. Population Risk Improvement with Model Compression: An Information-Theoretic Approach. Entropy. 2021; 23(10):1255. https://0-doi-org.brum.beds.ac.uk/10.3390/e23101255

Chicago/Turabian Style

Bu, Yuheng, Weihao Gao, Shaofeng Zou, and Venugopal V. Veeravalli. 2021. "Population Risk Improvement with Model Compression: An Information-Theoretic Approach" Entropy 23, no. 10: 1255. https://0-doi-org.brum.beds.ac.uk/10.3390/e23101255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop