Next Article in Journal
Multifractal Spectrum Curvature of RR Tachograms of Healthy People and Patients with Congestive Heart Failure, a New Tool to Assess Health Conditions
Next Article in Special Issue
Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples
Previous Article in Journal / Special Issue
Bayesian Inference for Acoustic Direction of Arrival Analysis Using Spherical Harmonics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Universality of Logarithmic Loss in Fixed-Length Lossy Compression †

Department of Electronic and Electrical Engineering, Hongik University, Seoul 04066, Korea
This paper is an extended version of our paper published in the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015.
Submission received: 8 April 2019 / Revised: 5 June 2019 / Accepted: 8 June 2019 / Published: 10 June 2019
(This article belongs to the Special Issue Bayesian Inference and Information Theory)

Abstract

:
We established a universality of logarithmic loss over a finite alphabet as a distortion criterion in fixed-length lossy compression. For any fixed-length lossy-compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy-compression problem under logarithmic loss. The equivalence is in the strong sense that we show that finding good schemes in corresponding lossy compression under logarithmic loss is essentially equivalent to finding good schemes in the original problem. This equivalence relation also provides an algebraic structure in the reconstruction alphabet, which allows us to use known techniques in the clustering literature. Furthermore, our result naturally suggests a new clustering algorithm in the categorical data-clustering problem.

1. Introduction

Logarithmic loss is a unique distortion measure in the sense that it allows a “soft” estimation (or reconstruction) of the source. Although logarithmic loss plays a crucial role in learning theory, not much work has been published regarding lossy compression until recently. A few exceptions are a line of work on multiterminal source coding [1,2,3], the single-shot approach to lossy source coding under logarithmic loss [4], and several universal properties of logarithmic loss in information theory [5,6,7]. In [4], Shkel and Verdú focused on the lossy-compression problem when the distortion measure is given by logarithmic loss. On the other hand, Jiao et al. justified logarithmic loss by showing it is the only loss function that satisfies a natural data-processing requirement [5]. Painsky and Wornell provided a universal property of logarithmic loss in the context of classification. In [7], No focused on the universal property of logarithmic loss in the successive refinement problem. We would also like to point out that the information bottleneck method [8,9,10,11] is related to lossy compression under logarithmic loss. Indeed, it is equivalent to the noisy lossy-compression problem under logarithmic loss [12].
In this paper, we present a new universal property of logarithmic loss in fixed-length lossy-compression problems. Consider an arbitrary fixed-length lossy-compression problem, where source and reconstruction alphabets 𝒳 and X ^ are discrete. Suppose arbitrary distortion measure d : X × X ^ is given. Then, we show that there exists a corresponding fixed-length lossy-compression problem where the source alphabet remains the same, but the reconstruction alphabet is a set of distributions on 𝒳, and the distortion measure is logarithmic loss. This implies that there is a correspondence between any fixed-length lossy-compression problem under an arbitrary distortion measure and that under logarithmic loss. The correspondence is in the following strong sense:
  • optimal schemes for the two problems are the same; and
  • a good scheme for one problem is also a good scheme for the other.
We are more precise about the “optimal” and “goodness” of the scheme in later sections. This finding essentially implies that it is enough to consider the lossy-compression problem under logarithmic loss.
The above correspondence provides new insights into the fixed-length lossy-compression problem. In general, the reconstruction alphabet in the lossy-compression problem does not have any well-defined operations. However, in the corresponding lossy compression under logarithmic loss, reconstruction symbols are probability distributions that have their own algebraic structure. Thus, under the corresponding setting, we can apply various techniques, such as the information geometric approach, clustering with Bregman divergence, and relaxation of the optimization problem. Furthermore, the equivalence relation suggests a new algorithm in the categorical data-clustering problem, where data are not in the continuous space.
The remainder of the paper is organized as follows. In Section 2, we revisit some of the known results of logarithmic loss and fixed-length lossy compression. Section 3 is dedicated to the equivalence between lossy compression under arbitrary distortion measures and that under logarithmic loss. In Section 4, we present the geometric interpretation of our result. We provide the log-convex relaxation of lossy compression and connection to the clustering problems in Section 5. Finally, we conclude in Section 6.
Notation: Uppercase X denotes a random variable, where 𝒳 denotes a set of alphabet. On the other hand, lowercase x denotes a specific possible realization of random variable X, i.e., x X . Similarly, X n denotes an n-dimensional random vector ( X 1 , X 2 , , X n ) while lowercase x n denotes a realization of X n . The absolute value of function | f | denotes a size of image of function f : X Y , i.e., | { f ( x ) : x X } | . If it was clear from the context, we used x instead of x X . We used a natural logarithm and nats instead of bits.

2. Preliminaries

2.1. Logarithmic Loss

Suppose 𝒳 is a finite set of discrete symbols, and M ( X ) is the set of probability measures on 𝒳. For x X and q M ( X ) , the definition of logarithmic loss : X × M ( X ) [ 0 , ] is given by
( x , q ) = log 1 q ( x ) .

2.2. Fixed-Length Lossy Compression

In this section, we briefly introduce the basic settings of the fixed-length lossy-compression problem [13]. In a fixed-length lossy-compression setting, we have a source X with finite alphabet X = { 1 , , r } and source distribution p X . An encoder f : X { 1 , , M } maps the source symbol to one of M messages. On the other side, a decoder g : { 1 , , M } X ^ maps the message to actual reconstruction X ^ , where the reconstruction alphabet is also finite X ^ = { 1 , , s } . Let d : X × X ^ [ 0 , ) be a distortion measure between source and reconstruction.
First, we can define the code that the expected distortion is lower than a given distortion level.
Definition 1
(Average distortion criterion). An ( M , D ) code is a pair of an encoder f with | f | M and a decoder g, such that
E [ d ( X , g ( f ( X ) ) ) ] D .
The minimum number of codewords required to achieve average distortion not exceeding D is defined by
M ( D ) = min { M : ( M , D ) c o d e } .
Similarly, we can define the minimum achievable average distortion given number of codewords M.
D ( M ) = min { D : ( M , D ) c o d e } .
One may consider a stronger criterion that restricts the probability of exceeding a given distortion level.
Definition 2
(Excess distortion criterion). An ( M , D , ϵ ) code is a pair of an encoder f with | f | M and a decoder g such that
Pr [ d ( X , g ( f ( X ) ) ) > D ] ϵ .
The minimum number of codewords required to achieve excess distortion probability ϵ, and distortion D is defined by
M ( D , ϵ ) = min { M : ( M , D , ϵ ) c o d e } .
Similarly, we can define the minimum achievable excess distortion probability given target distortion D and number of codewords M.
ϵ ( M , ϵ ) = min { ϵ : ( M , D , ϵ ) c o d e } .
Given target distortion D and p X , the information rate-distortion function is defined by
R ( D ) = inf p X ^ | X : E [ d ( X , X ^ ) ] D I ( X ; X ^ )
We make the following benign assumptions:
  • There exists a unique rate-distortion function achieving conditional distribution p X ^ | X .
  • We assume that p X ^ ( x ^ ) > 0 for all x ^ X ^ since we can always discard the reconstruction symbol with zero probability.
  • If d ( x , x ^ 1 ) = d ( x , x ^ 2 ) for all x X , then x ^ 1 = x ^ 2 . (If d ( x , x ^ 1 ) = d ( x , x ^ 2 ) for all x, then, there is no difference between x ^ 1 and x ^ 2 in terms of loss. Thus, we can always discard x ^ 2 without loss of generality.)

2.3. D-Tilted Information

Define the information density of joint distribution p X , X ^ by
𝚤 X ; X ^ ( x ; x ^ ) = log p X , X ^ ( x , x ^ ) p X ( x ) p X ^ ( x ^ ) .
Then, we are ready to define D-tilted information that plays a key role in fixed-length lossy compression.
Definition 3
([13] (Definition 6)). The D-tilted information in x X is defined as
𝚥 X ( x , D ) = log 1 E [ exp λ D λ d ( x , X ^ ) ]
where the expectation is with respect to the marginal distribution of X ^ and λ = R ( D ) .
Note that X ^ is a random variable that has a marginal distribution of p X × p X ^ | X , and R ( D ) is the first derivative of rate-distortion function R ( D ) .
Theorem 1
([14] (Lemma 1.4)). For all x ^ X ^ ,
𝚥 X ( x , D ) = 𝚤 X ; X ^ ( x ; x ^ ) + λ d ( x , x ^ ) λ D ;
therefore, we have
R ( D ) = E [ 𝚥 ( X , D ) ] .
Let p X | X ^ be the induced conditional probability from p X ^ | X . Then, (2) can equivalently be expressed as
log 1 p X | X ^ ( x | x ^ ) = log 1 p X ( x ) 𝚥 X ( x , D ) + λ d ( x , x ^ ) λ D .
The following lemma shows that p X | X ^ ( · | x ^ ) are all distinct.
Lemma 1
([7] (Lemma 2)). For all x ^ 1 x ^ 2 , there exists x X such that p X | X ^ ( x | x ^ 1 ) p X | X ^ ( x | x ^ 2 ) .

3. One-to-One Correspondence Between General Distortion and Logarithmic Loss

3.1. Main Results

Consider fixed-length lossy compression under arbitrary distortion d ( · , · ) , as described in Section 2.2. We have a source X with finite alphabet X = { 1 , , r } , source distribution p X , and finite reconstruction alphabet X ^ = { 1 , , s } . For a fixed number of messages M, let f and g be the encoder and decoder that achieve optimal average distortion D ( M ) , i.e.,
E [ d ( X , g ( f ( X ) ) ) ] = D ( M ) .
Let p X ^ | X denote the rate-distortion function achieving conditional distribution at distortion D = D ( M ) . In other words, p X × p X ^ | X achieves the infimum in
R ( D ( M ) ) = inf p X ^ | X : E [ d ( X , X ^ ) ] D ( M ) I ( X ; X ^ ) .
Note that R ( D ( M ) ) may be strictly smaller than log M in general since R ( · ) is an information rate-distortion function that does not characterize the best achievable performance for the “one-shot” setting in which D ( M ) is defined.
Now, we define the corresponding fixed-length lossy-compression problem under logarithmic loss. In the corresponding problem, source alphabet X = { 1 , , r } , source distribution p X , and number of messages M remain the same. However, we have a different reconstruction alphabet Y = { p X | X ^ ( · | x ^ ) : x ^ X ^ } M ( X ) where p pertains to the achiever of the infimum in Equation (4) associated with the original loss function. Recall that M ( X ) is the set of all probability measures on 𝒳. Let the distortion of the corresponding problem be the logarithmic loss.
We now further connect the encoding and decoding schemes between the two problems. Suppose f : X { 1 , , M } and g : { 1 , , M } X ^ are an encoder and decoder pair in the original problem. When f and g are given in the original problem, we define the corresponding encoder and decoder in the corresponding problem as follows. We let the encoder be the same f = f , and define the decoder g : { 1 , , M } Y by
g ( m ) = p X | X ^ ( · | g ( m ) ) .
Then, f and g are a valid encoder and decoder pair for the corresponding fixed-length lossy-compression problem under logarithmic loss. Conversely, given f and g , we can find corresponding f and g because Lemma 1 guarantees that p X | X ^ ( · | x ^ ) are distinct.
The following result shows the relation between the corresponding schemes.
Theorem 2.
For any encoder–decoder pair ( f , g ) for the corresponding fixed-length lossy-compression problem under logarithmic loss, we have
E [ ( X , g ( f ( X ) ) ) ] = H ( X | X ^ ) + λ E [ d ( X , g ( f ( X ) ) ) ] D ( M ) H ( X | X ^ )
where ( f , g ) is the corresponding encoder–decoder pair for the original lossy-compression problem. Note that H ( X | X ^ ) and the expectations are with respect to distribution p X × p X ^ | X . Moreover, equality holds if and only if f = f and g ( m ) = p X | X ^ ( · | g ( m ) ) .
Proof. 
We have
E [ ( X , g ( f ( X ) ) ) ] = E [ X , p X | X ^ ( · | g ( f ( X ) ) ) ] = E [ log 1 p X | X ^ ( X | g ( f ( X ) ) ) ] .
Then, Equation (3) implies that
E [ ( X , g ( f ( X ) ) ) ] = E [ log 1 p X ( X ) 𝚥 X ( X , D ( M ) ) ] + E [ λ d ( X , g ( f ( X ) ) ) λ D ( M ) ] = H ( X | X ^ ) + λ E [ d ( X , g ( f ( X ) ) ) D ( M ) ]
H ( X | X ^ )
where Equation (5) is because E [ 𝚥 X ( X , D ( M ) ) ] = R ( D ( M ) ) = I ( X ; X ^ ) with respect to distribution p X × p X ^ | X . Inequality (6) is because D ( M ) is the minimum achievable average distortion with M codewords. Equality holds if and only if E [ d ( X , g ( f ( X ) ) ) ] = D ( M ) , which can be achieved by the optimal scheme for the original lossy-compression problem. In other words, the equality holds if
f = f g ( m ) = p X | X ^ ( · | g ( m ) ) .
 □
In the above theorem, distortion D ( M ) plays a critical role, which is the minimal achievable distortion in the one-shot setting. We also used p X | X ^ in the corresponding problem, which is the rate-distortion-achieving conditional distribution. This might be confusing since the rate-distortion function provides the optimal rate in the asymptotic setting. However, recall that the minimal mutual information between X and X ^ in Equation (1) is the “information” rate-distortion function. The “information” rate-distortion function is equal to the optimum rate in the asymptotic case if the source is independent and identically distributed.
On the other hand, we viewed the “information” rate-distortion function differently. We considered the one-shot setting where source X and reconstruction X ^ are single variables. Given number of messages M, the minimal achievable distortion is given by D ( M ) . Under this setting, we focused on minimal mutual information between X and X ^ when the distortion between X and X ^ is restricted by D ( M ) . Our theorem implies that minimal achieving distribution p X | X ^ provides the corresponding one-shot lossy-compression problem under logarithmic loss.
Remark 1.
In the corresponding fixed-length lossy-compression problem under logarithmic loss, the minimal achievable average distortion given number of codewords M is
D ( M ) = H ( X | X ^ )
where the conditional entropy is with respect to distribution p X × p X ^ | X .
Remark 2.
From now on, we denote the original lossy-compression problem under given distortion measure d ( · , · ) with reconstruction alphabet X ^ by “original problem”. On the other hand, we denote the corresponding lossy-compression problem under logarithmic loss with reconstruction alphabet Y by “corresponding problem”.

3.2. Example: Memoryless Bernoulli Source with Hamming Distortion Measure

In this section, we consider the memoryless Bernoulli source under Hamming distortion measure as an example of the above equivalence. Let X = U n be a memoryless Bernoulli source with probability α , where X = U n = { 0 , 1 } n , and reconstruction X ^ = V n is also an n-dimensional binary vector where X ^ = V n = { 0 , 1 } n . Note that block length n is fixed, so the problem is in the one-shot setting. Distortion measure d is separable Hamming distortion, i.e.,
d ( X , X ^ ) = d H ( U n , V n ) = 1 n i = 1 n d H ( U i , V i )
where d H ( u , v ) = 1 if u v and d H ( u , v ) = 0 if u = v . Let M be the number of messages. Then, we are interested in optimal encoding and decoding schemes that achieve distortion D = D ( M ) .
In this scenario, the information rate-distortion function is not hard to compute [15]:
R ( D ) = inf p X ^ | X : E [ d ( X , X ^ ) ] D I ( X ; X ^ ) = inf p U n | V n : E [ d H ( U n , V n ) ] D I ( U n ; V n ) = n inf p U | V : E [ d H ( U , V ) ] D I ( U ; V )
= n ( h 2 ( α ) h 2 ( D ) ) ,
where h 2 ( · ) is the binary entropy function. Let p U | V be the distribution that achieves the infimum in Equation (7). We have an analytic formula for rate-distortion-achieving distribution p X | X ^ . For x = u n and = v n , we have
p X | X ^ ( x | x ^ ) = i = 1 n p U | V ( u i | v i ) = i = 1 n D d H ( u i , v i ) ( 1 D ) 1 d H ( u i , v i ) = ( 1 D ) n D 1 D n d H ( u n , v n ) = ( 1 D ) n D 1 D n d ( x , x ^ ) .
Then, the corresponding problem is the rate-distortion problem under logarithmic loss where the set of reconstruction symbols is
Y = { p X | X ^ ( · | x ^ ) : x ^ V n } .
Remark 3.
We can rewrite Equation (3) in this case.
( x , p X | X ^ ( · | x ^ ) ) = log 1 p X | X ^ ( x | x ^ ) = n log 1 1 D + n d ( x , x ^ ) log 1 D D .
The above equation explicitly shows the correspondence between logarithmic loss and the original distortion measure.

3.3. Discussion

3.3.1. One-to-One Correspondence

Theorem 2 implies that, for any fixed-length lossy-compression problem, we can find an equivalent problem under logarithmic loss where optimal encoding schemes are the same. Thus, without loss of generality, we can restrict our attention to the problem under logarithmic loss with reconstruction alphabet Y = { q ( 1 ) , , q ( s ) } for some q ( 1 ) , , q ( s ) M ( X ) .

3.3.2. Scheme Suboptimality

Suppose f and g are a suboptimal encoder and decoder for the original fixed-length lossy-compression problem. Then, the theorem implies
E [ ( X , g ( X ) ) H ( X | X ^ ) ] = λ E [ d ( X , g ( f ( X ) ) ) ] D ( M ) .
The left-hand side of Equation (9) is the cost of suboptimality for the corresponding lossy-compression problem. On the other hand, the right-hand side is proportional to the cost of suboptimality for the original problem. In Section 3.3.1, we discussed that the optimal schemes of the two problems coincide. Equation (9) shows stronger equivalence in which costs of suboptimalities are linearly related. This implies that a good code for one problem is also good for the other.

3.3.3. Operations on the Reconstruction Alphabet

In general, reconstruction alphabet X ^ does not have an algebraic structure. However, in the corresponding rate-distortion problem, the reconstruction alphabet is the set of probability measures where we have natural operations such as convex combinations of elements or projection to a convex hull. We discuss such operations closer in Section 5.

3.4. Exact Performance of Optimal Scheme

In the previous section, we showed that there is a corresponding lossy-compression problem under logarithmic loss that shares the same optimal coding scheme. In this section, we investigate the exact performance of the optimal scheme for the fixed-length lossy-compression problem under logarithmic loss, when the reconstruction alphabet is the set of all probability measures on 𝒳, i.e., M ( X ) . (Recently, Shkel and Verdu [4] independently proposed similar results. The result was also presented in our conference version of the paper [16].) We also characterize minimal average distortion D ( M ) when we have a fixed number of messages M. Note that this is a single-letter version of ([2], [Lemma 1]). Although the optimal scheme associated with M ( X ) may differ from the optimal scheme with restricted reconstruction alphabets Y , it provides an insight, as we show in Section 4. In this section, we restrict our attention to deterministic schemes. However, it is not hard to show that the same result holds even if we allow a stochastic encoder and decoder.
Let an encoder and a decoder be f : X { 1 , , M } and g : { 1 , , M } M ( X ) where g ( m ) = q ( m ) M ( X ) . Then, we have
E [ ( X , g ( f ( X ) ) ) ] = x X p X ( x ) log 1 q ( f ( x ) ) ( x ) = H ( X ) + m = 1 M x f 1 ( m ) p X ( x ) log p X ( x ) q ( m ) ( x ) = H ( X ) + m = 1 M u m log u m + m = 1 M u m x f 1 ( m ) p X ( x ) u m log p X ( x ) / u m q ( m ) ( x ) ,
where f 1 ( m ) = { x X : f ( x ) = m } and u m = x f 1 ( m ) p X ( x ) . Since p X | f ( X ) ( x | m ) = p X ( x ) u m for all x f 1 ( m ) , we have
E [ ( X , g ( f ( X ) ) ) ] = H ( X ) H ( f ( X ) ) + m = 1 M u m D p X | f ( X ) ( · | m ) q ( m ) H ( X ) H ( f ( X ) ) .
Equality can be achieved by choosing q ( m ) = p X | f ( X ) ( · | m ) , which can be done no matter what f is. Thus, we have
D ( M ) = H ( X ) max f : | f | M H ( f ( X ) ) .
This implies that the optimal encoder is function f that maximizes H ( f ( X ) ) , and the optimal decoder is given by g ( m ) = p X | f ( X ) ( · | m ) . The above result provides a trivial lower bound:
D ( M ) H ( X ) log M .
The optimal scheme under an excess distortion criterion is given in Appendix A.

4. Geometrical Interpretation

In this section, we present another geometrical interpretation of the decoder in lossy-compression problems. Consider the original lossy-compression problem with discrete reconstruction alphabet X ^ and distortion measure d ( · , · ) . Suppose encoding function f is given that may or may not be optimal, where | f | = M . Let A m = f 1 ( m ) = { x X : f ( x ) = m } , which is the set of source symbols that are mapped to message m. Then, optimal reconstruction g ( m ) is given by
g ( m ) = argmin x ^ X ^ E [ d ( X , x ^ ) | X A m ] .
Now, consider the corresponding lossy-compression problem under logarithmic loss. Recall that the set of reconstruction alphabets is given by
Y = { p X | X ^ ( · | x ^ ) : x ^ X ^ }
where Y M ( X ) . As we have seen in Section 3.4, the optimal reconstruction is g E ( m ) = p X | f ( X ) ( · | m ) if we have extended set of reconstruction alphabet M ( X ) . Thus, it is natural to find the probability distribution in 𝒴, which is the nearest distribution from g E ( m ) . We propose Kullback–Leibler divergence to measure the distance between probability distributions. In other words, we want to find g ˜ ( m ) Y , such that
g ˜ ( m ) = argmin q Y D ( g E ( m ) q ) .
This can be viewed as projecting the optimal solution from extended set M ( X ) to original feasible set 𝒴. Since q Y , there exists x ^ X ^ , such that q ( · ) = p X | X ^ ( · | x ^ ) . Then, the above Kullback–Leibler divergence is given by
D ( p X | f ( X ) ( · | m ) p X | X ^ ( · | x ^ ) ) = x A m p X | f ( X ) ( x | m ) log p X | f ( X ) ( x | m ) p X | X ^ ( x | x ^ ) = x A m p X ( x ) Pr [ X A m ] log p X ( x ) Pr [ X A m ] p X | X ^ ( x | x ^ ) = log 1 Pr [ X A m ] + x A m p X ( x ) Pr [ X A m ] log p X ( x ) p X | X ^ ( x | x ^ ) = log 1 Pr [ X A m ] + x A m p X ( x ) Pr [ X A m ] 𝚥 ( x , D ) + λ d ( x , x ^ ) λ D ,
where the last equality is from Equation (2). Note that d ( x , x ^ ) is the only term that is a function of x ^ , and λ is positive. Thus, if q ( · ) = p X | X ^ ( · | x ^ ) achieves the minimum in Equation (11), then x ^ minimizes the following:
x A m p X ( x ) Pr [ X A m ] d ( x , x ^ ) = E [ d ( X , x ^ ) | X A m ] .
Since Equation (12) coincides with Equation (10), we have
g ˜ ( m ) = p X | X ^ ( · | g ( m ) ) .
Remark 4.
In Section 3, we directly defined g ( m ) = p X | X ^ ( · | g ( m ) ) . However, we obtained g ˜ ( m ) via the following two-step procedure:
  • extend the reconstruction set from 𝒴 to M ( X ) , then characterize optimal decoding functions g E ( m ) M ( X ) ; and
  • find the measure g ˜ ( m ) Y that is closest to g E ( m ) .
The above result (13) implies that g ˜ ( m ) = g ( m ) .

5. Log-Convex Relaxation

In the previous section, we obtained the optimal reconstruction symbol from the extended reconstruction alphabet, and projected it to the feasible set. In this section, instead of direct projection to 𝒴, we propose another slight extension of 𝒴, namely, log-convex hull. As we show in the following sections, the log-convex hull has interesting properties.

5.1. rI-Projection

Before defining the log-convex hull, we need to define the log-convex combination of probability distributions. Let p and q be probability distributions in M ( X ) . For 0 < t < 1 , the log-convex combination of p and q is given by
p t q 1 t ¯ ( x ) = p ( x ) t q ( x ) 1 t x ˜ p ( x ˜ ) t q ( x ˜ ) 1 t .
It is clear to see that log p t q 1 t ¯ is a convex combination of log p ( x ) and log q ( x ) with a normalizing constant. We can now define log-convex hull logconv ( Y ) that is a set of log-convex combination of probability measures in set 𝒴. More precisely,
logconv ( Y ) = q ( r ) M ( X ) : q ( r ) ( x ) = 1 c ( r ) exp x ^ r ( x ^ ) log p X | X ^ ( x | x ^ )
where r is a weight vector (i.e., r M ( X ) ), and c ( r ) is a normalizing constant. By definition, logconv ( Y ) is log-convex since it contains all log-convex combinations of probability distributions in 𝒴.
Instead of having projection of p X | f ( X ) ( · | m ) to 𝒴, we consider the projection to logconv ( Y ) . Since logconv ( Y ) is log-convex, ([17], [Theorem 1]) implies that there exists unique probability distribution q m logconv ( Y ) that achieves the following minimum.
min q logconv ( Y ) D p X | f ( X ) ( · | m ) q .
Projection q m is called an rI-projection of p X | f ( X ) ( · | m ) to logconv ( Y ) . Let r m be the corresponding weights, i.e.,
q m = q ( r m ) .
Csiszár and Matúš ([17], [Theorem 1]) showed that the rI-projection satisfies the following inequality for all x ^ X ^ .
D ( p X | f ( X ) ( · | m ) p X | X ^ ( · | x ^ ) ) D ( p X | f ( X ) ( · | m ) q m ) + D ( q m p X | X ^ ( · | x ^ ) ) .
On the other hand, the log-convex combination of probability measures q ( r ) is called the geometric mean of probability measures [18]. The author also provided geometric compensation identity, which is given by
x ^ r ( x ^ ) D ( p X | f ( X ) ( · | m ) p X | X ^ ( · | x ^ ) ) = D ( p X | f ( X ) ( · | m ) q ( r ) ) + x ^ r ( x ^ ) D ( q ( r ) p X | X ^ ( · | x ^ ) ) .
The above result holds for any r M ( X ) ; therefore, Equation (16) also holds when q ( r ) = q m * . Together with Inequality (15), we get the following result. For all x ^ X ^ , if r m ( x ^ ) 0 , then
D ( p X | f ( X ) ( · | m ) p X | X ^ ( · | x ^ ) ) = D ( p X | f ( X ) ( · | m ) q m ) + D ( q m p X | X ^ ( · | x ^ ) ) .
Remark 5.
The above result is similar to the projection to polytope in Euclidean space. Suppose vectors v 1 , v 2 , , v n form a polytope, and consider the projection from a vector w to the polytope. Let h be a projection. Then, h is a convex combination of v i ’s. Thus, there exist coefficients { a i } 1 i n , such that
h = i = 1 n a i v i
where i a i = 1 , and a i 0 for all i. Let E = { 1 i n : a i 0 } be the set of indices of nonzero coefficients. Then, projection h is on the plane generated by { v i } i E . Thus, two vectors w h and h v i are orthogonal for all i E . Then, Pythagorean theorem implies that, for all i, we have either a i = 0 or
w v i 2 = w h 2 + h v i 2 .

5.2. Optimization

As we saw in the previous section, we want to find q logconv ( Y ) that minimizes D p X | f ( X ) ( · | m ) q . Note that
D p X | f ( X ) ( · | m ) q ( r ) = x X p X | f ( X ) ( x | m ) log p X | f ( X ) ( x | m ) q ( r ) ( x ) = x X p X | f ( X ) ( x | m ) log p X | f ( X ) ( x | m ) + x X p X | f ( X ) ( x | m ) log 1 q ( r ) ( x ) .
Since the first term is not a function of q ( r ) , it is enough to consider the second term. By the definition of q ( r ) , we have
x X p X | f ( X ) ( x | m ) log 1 q ( r ) ( x ) = x X p X | f ( X ) ( x | m ) x ^ r ( x ^ ) log p X | X ^ ( x | x ^ ) + log c ( r ) = x X p X | f ( X ) ( x | m ) x ^ r ( x ^ ) log p X | X ^ ( x | x ^ ) + log x exp x ^ r ( x ^ ) log p X | X ^ ( x | x ^ ) .
Thus, minimizing D ( p X | f ( X ) ( · | m ) q ) is equivalent to solving the following optimization problem.
min r M ( X ^ ) x ^ r ( x ^ ) x p X | f ( X ) ( x | m ) log p X | X ^ ( x | x ^ ) + log x exp x ^ r ( x ^ ) log p X | X ^ ( x | x ^ ) s . t . r ( x ^ ) 0 x ^ r ( x ^ ) = 1 .
Since the objective function is a convex function of r ( x ^ ) , the above problem is a convex optimization problem that can be efficiently solved.

5.3. Relaxation in Clustering

In the corresponding lossy-compression problem under logarithmic loss, reconstruction symbols are probability measures that have a natural algebraic structure, as we discussed in Section 3.3.3. In this section, we present the benefits of such a property when we apply some known techniques from the clustering literature.
Lossy compression is closely related to the clustering problem [19,20,21]. Many works focused on the application of k-means clustering to a lossy-compression problem [22,23,24], which is an extension of the Lloyd max algorithm [25,26]. However, k-means clustering is only available when there exists a well-defined operation in X ^ (e.g., X ^ = R n ). This is because k-means clustering requires computing the mean of data points, which is the center of each cluster. In general lossy-compression problems, reconstruction alphabet X ^ may not have such an operation. In such cases, we may have to apply k-medoidlike clustering [27], where the center of each cluster has to be a data point. The k-medoidlike algorithm in the context of lossy compression is shown in Algorithm 1.
Algorithm 1k-medoidlike clustering in lossy compression.
 Randomly initialize x ^ 1 , , x ^ M X ^
repeat
  Set A m for all 1 m M .
  for x X do
    A m A m { x } where m = argmin m d ( x , x ^ m )
  end for
  for m = 1 to M do
    x ^ m argmin x ^ X ^ x A m p X ( x ) d ( x , x ^ )
  end for
until converge
On the other hand, in the corresponding problem, the reconstruction alphabet is the set of probability distributions where operations such as log-convex combinations are well-defined. This allows us to propose a k-meanslike clustering algorithm, as shown in Algorithm 2.
Algorithm 2k-meanslike clustering in lossy compression.
 Randomly initialize r 1 , , r M M ( X ^ )
repeat
  Set A m for all 1 m M
  for x X do
    A m A m { x } where m = argmin m log 1 q ( r m ) ( x )
   Set f ( x ) = m where x A m
  end for
  for m = 1 to M do
    r m argmin r M ( X ^ ) D ( p X | f ( X ) ( · | m ) q ( r ) )
  end for
until converge
The main idea of the above algorithm is that log-convex combination q m behaves like center of cluster A m . In the clustering literature, there are many known variations of k-means clustering [28,29]. The above result shows that we can borrow those techniques and apply them to the lossy-compression problem even without any algebraic structures on the reconstruction alphabet.

5.4. Application to General Clustering Problems

The idea of the previous section can be applied to an actual clustering problem. We mainly focus on clustering categorical data where data points are not in continuous space [30,31,32,33,34]. Since operations such as mean are not well-defined in this case, it is hard to apply known data-clustering algorithms in continuous space. The key idea is that the equivalence relation with logarithmic loss allows the algebraic structure on any set. More precisely, we can transform any clustering problem to the clustering problem in continuous space and apply known techniques such as variations of k-means.
A more rigorous definition of the problem is given below. Assume that we have a finite set of data points 𝒳, and each data point has its weight p X ( x ) . We normalize the weights so that x p X ( x ) = 1 , and the weights may or may not be uniform. The distance between two points are given by measure d : X × X [ 0 , ) . Suppose we want to partition the data points into M clusters.
If we let X ^ = 𝒳, then the clustering problem turns out to be a lossy-compression problem under distortion measure d ( · , · ) , where the number of messages is M. Let D = D ( M ) be the optimal achievable distortion, and p X ^ | X be the distribution that achieves rate-distortion function R ( D ) as defined in Equation (4). Then, we can find the corresponding lossy-compression problem under logarithmic loss. Finally, we can apply clustering algorithms in continuous space such as k-means to the corresponding problem. For example, Algorithm 2 can be applied to the corresponding problem.
Remark 6.
Note that it is hard to have an exact analytic formula for D ( M ) or p | X X ^ . However, as we mentioned in Section 3.3.2, we do not have to find an optimal scheme under the exact problem formulation. If we can provide a good scheme of the corresponding problem with D D ( M ) , that should be a good enough scheme in the original problem.

6. Conclusions

To conclude our discussion, we summarize our main contributions. We showed that for any fixed-length lossy-compression problem under an arbitrary distortion measure, there exists a corresponding lossy-compression problem under logarithmic loss where optimal schemes coincide. We also proved that a good scheme for one lossy-compression problem is also good for another problem. This equivalence provides an algebraic structure on any reconstruction alphabet that allows using various optimization techniques in lossy-compression problems, such as log-convex relaxation. Furthermore, our results naturally suggest a k-meanslike clustering algorithm in categorical data-clustering problems.

Funding

This work was supported by the National Research Foundation of Korea, funded by the Korean Government (MSIT) under Grant NRF-2017R1C1B5018298.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Optimal Scheme Under Excess Distortion Criterion

In this section, we characterize minimum number of codewords M ( D , ϵ ) that can achieve distortion D and excess distortion probability ϵ . Let an encoder and a decoder be f : X { 1 , , M } and g : { 1 , , M } M ( X ) where g ( m ) = q ( m ) M ( X ) . Since ( x , q ) D is equivalent to q ( x ) e D , we hav
1 p e = x X p X ( x ) 1 q ( f ( x ) ) ( x ) e D = m = 1 M x f 1 ( m ) p X ( x ) 1 q ( m ) ( x ) e D .
However, at most, e D of the q ( m ) ( x ) can be larger than e D where x is the largest integer that is smaller than or equal to x. Thus, we can at mos cover t M · e D of the source symbols with M codewords. Suppose p X ( 1 ) p X ( 2 ) p X ( r ) , then the optimal scheme is
f ( x ) = x e D q ( m ) ( x ) = 1 / e D if f ( x ) = m 0 otherwise ,
where q ( m ) = g ( m ) and x are the smallest integer that is larger than or equal to x. The idea is that each reconstruction symbol q ( m ) covers e D number of source symbols by assigning probability mass 1 / e D to each of them.
The above optimal scheme satisfies
1 p e = x = 1 M · e D p X ( x ) = F X M · e D ,
where F X ( · ) is the cumulative distribution function of X. This implies that the minimal error probability is
ϵ ( M , D ) = 1 F X M · e D .
On the other hand, if we fix target error probability ϵ , the minimal number of codewords is
M ( D , ϵ ) = F X 1 ( 1 ϵ ) e D
where F X 1 ( y ) = argmin 1 x r { x : F X ( x ) y } . Note that if we allow variable length coding without a prefix condition, the optimal coding scheme is similar to optimal nonasymptotic lossless coding introduced in [35].

References

  1. Courtade, T.A.; Wesel, R.D. Multiterminal source coding with an entropy-based distortion measure. Proc. IEEE Int. Symp. Inf. Theory. IEEE 2011, 2011, 2040–2044. [Google Scholar]
  2. Courtade, T.; Weissman, T. Multiterminal Source Coding Under Logarithmic Loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef]
  3. Ugur, Y.; Aguerri, I.E.; Zaidi, A. Vector Gaussian CEO problem under logarithmic loss. In Proceedings of the 2018 IEEE Information Theory Workshop, Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar]
  4. Shkel, Y.Y.; Verdú, S. A single-shot approach to lossy source coding under logarithmic loss. IEEE Trans. Inf. Theory 2018, 64, 129–147. [Google Scholar] [CrossRef]
  5. Jiao, J.; Courtade, T.A.; Venkat, K.; Weissman, T. Justification of logarithmic loss via the benefit of side information. IEEE Trans. Inf. Theory 2015, 61, 5357–5365. [Google Scholar] [CrossRef]
  6. Painsky, A.; Wornell, G.W. Bregman divergence bounds and the universality of the logarithmic loss. arXiv 2018, arXiv:1810.07014. [Google Scholar]
  7. No, A. Universality of Logarithmic Loss in Successive Refinement. Entropy 2019, 21, 158. [Google Scholar] [CrossRef]
  8. Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
  9. Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
  10. Gilad-Bachrach, R.; Navot, A.; Tishby, N. An information theoretic tradeoff between complexity and accuracy. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
  11. Aguerri, I.E.; Zaidi, A. Distributed Information Bottleneck Method for Discrete and Gaussian Sources. In Proceedings of the International Zurich Seminar on Information and Communication, Zurich, Switzerland, 21–23 February 2018. [Google Scholar]
  12. Kostina, V.; Verdú, S. Nonasymptotic noisy lossy source coding. IEEE Trans. Inf. Theory 2016, 62, 6111–6123. [Google Scholar] [CrossRef]
  13. Kostina, V.; Verdú, S. Fixed-length lossy compression in the finite blocklength regime. IEEE Trans. Inf. Theory 2012, 58, 3309–3338. [Google Scholar] [CrossRef]
  14. Csiszár, I. On an extremum problem of information theory. Studia Scientiarum Mathematicarum Hungarica 1974, 9, 57–71. [Google Scholar]
  15. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  16. No, A.; Weissman, T. Universality of logarithmic loss in lossy compression. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hongkong, China, 14–19 June 2015; pp. 2166–2170. [Google Scholar]
  17. Csiszár, I.; Matus, F. Information projections revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
  18. No, A. Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy 2018, 20, 688. [Google Scholar] [CrossRef]
  19. Chaffee, D.L. Applications of Rate Distortion Theory to the Bandwidth Compression of Speech Signals. Ph.D. Thesis, University of California, Los Angeles, CA, USA, 1975. [Google Scholar]
  20. Chen, D. On two or more dimensional optimum quantizers. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Hartford, CT, USA, 9–11 May 1977; Volume 2, pp. 640–643. [Google Scholar]
  21. Gray, R.; Buzo, A.; Matsuyoma, Y.; Gray, A., Jr.; Markel, J. Source coding and speech compression. In International Telemetering Conference Proceedings; International Foundation for Telemetering: San Diego, CA, USA, 1978; Volume 14. [Google Scholar]
  22. Linde, Y.; Buzo, A.; Gray, R. An algorithm for vector quantizer design. IEEE Trans. Commun. 1980, 28, 84–95. [Google Scholar] [CrossRef]
  23. Gray, R.M.; Neuhoff, D.L. Quantization. IEEE Trans. Inf. Theory 1998, 44, 2325–2383. [Google Scholar] [CrossRef]
  24. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  25. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  26. Max, J. Quantizing for minimum distortion. IRE Trans. Inf. Theory 1960, 6, 7–12. [Google Scholar] [CrossRef]
  27. Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]
  28. Phillips, S.J. Acceleration of k-means and related clustering algorithms. In Proceedings of the Workshop on Algorithm Engineering and Experimentation, San Francisco, CA, USA, 4–5 January 2002; pp. 166–177. [Google Scholar]
  29. Pelleg, D.; Moore, A.W. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; Volume 1, pp. 727–734. [Google Scholar]
  30. Watve, A.; Pramanik, S.; Jung, S.; Jo, B.; Kumar, S.; Sural, S. Clustering Non-Ordered Discrete Data. J. Inf. Sci. Eng. 2014, 30, 1–23. [Google Scholar]
  31. Bai, L.; Liang, J.; Dang, C.; Cao, F. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit. 2011, 44, 2843–2861. [Google Scholar] [CrossRef]
  32. Ng, R.T.; Han, J. CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 2002, 14, 1003–1016. [Google Scholar] [CrossRef]
  33. Ganti, V.; Gehrke, J.; Ramakrishnan, R. CACTUS—clustering categorical data using summaries. In Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999; Volume 99, pp. 73–83. [Google Scholar]
  34. Kumar, S.; Sural, S.; Watve, A.; Pramanik, S. CNODE: clustering of set-valued non-ordered discrete data. Int. J. Data Min. Model. Manag. 2009, 1, 310–334. [Google Scholar] [CrossRef]
  35. Kontoyiannis, I.; Verdu, S. Optimal Lossless Data Compression: Non-Asymptotics and Asymptotics. IEEE Trans. Inf. Theory 2014, 60, 777–795. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

No, A. Universality of Logarithmic Loss in Fixed-Length Lossy Compression. Entropy 2019, 21, 580. https://0-doi-org.brum.beds.ac.uk/10.3390/e21060580

AMA Style

No A. Universality of Logarithmic Loss in Fixed-Length Lossy Compression. Entropy. 2019; 21(6):580. https://0-doi-org.brum.beds.ac.uk/10.3390/e21060580

Chicago/Turabian Style

No, Albert. 2019. "Universality of Logarithmic Loss in Fixed-Length Lossy Compression" Entropy 21, no. 6: 580. https://0-doi-org.brum.beds.ac.uk/10.3390/e21060580

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop