Next Article in Journal
Bryan’s Maximum Entropy Method—Diagnosis of a Flawed Argument and Its Remedy
Previous Article in Journal
Data on Vietnamese Students’ Acceptance of Using VCTs for Distance Learning during the COVID-19 Pandemic
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions

by
Maryam Moghimi
*,†,‡ and
Herbert W. Corley
*,‡
Center on Stochastic Modeling, Optimization, and Statistics (COSMOS), the University of Texas at Arlington, Arlington, TX 76013, USA
*
Authors to whom correspondence should be addressed.
This paper was part of the author’s doctoral dissertation of May 2020.
The two authors contributed equally to this paper.
Submission received: 17 August 2020 / Revised: 6 September 2020 / Accepted: 10 September 2020 / Published: 13 September 2020

Abstract

:
In this paper, we study the information lost when a real-valued statistic is used to reduce or summarize sample data from a discrete random variable with a one-dimensional parameter. We compare the probability that a random sample gives a particular data set to the probability of the statistic’s value for this data set. We focus on sufficient statistics for the parameter of interest and develop a general formula independent of the parameter for the Shannon information lost when a data sample is reduced to such a summary statistic. We also develop a measure of entropy for this lost information that depends only on the real-valued statistic but neither the parameter nor the data. Our approach would also work for non-sufficient statistics, but the lost information and associated entropy would involve the parameter. The method is applied to three well-known discrete distributions to illustrate its implementation.

1. Introduction

We consider the data sample x = ( x 1 ,   ,   x n ) from a random sample X = ( X 1 ,   ,   X n ) for a discrete random variable X with sample space S and one-dimensional parameter θ . A statistic T ( X ) is a function of the random sample X for any fixed but arbitrary value of a parameter θ associated with the underlying X . Thus a statistic T ( X ) is a random variable itself. Here we consider only real-valued statistics that reduce the data sample x to a number T ( x ) that might be used to summarize X , to characteriz X , or   perhaps   to   estimate   θ . However, data reduction is an irreversible process [1] and always involves some information loss. For instance, if T ( X ) is the sample mean X ¯ , the original measurements x cannot be reconstructed from x ¯ , and some information about x is lost. Nonetheless, such data reduction is frequently used to make inferences.
More explicitly, our motivation for considering such situations is that T ( x ) is usually communicated in practice as a summary for the data x but without the actual data. The question then naturally arises: how much information is lost to someone about a data sample x when only the value of T is available for x , but not the data itself? To answer this question, we develop a theoretical framework for determining how much information is lost about a given data set x by knowing only the value of T ( x ) but neither x itself nor the parameter θ . Our information-theoretic approach to data reduction generalizes the observation in [2] that a binomial random variable loses all the information about the order of successes in the associated sequence of Bernoulli trials. In other words, for a series of n Bernoulli trials one cannot recreate the order of successes by only knowing the number that occurred.
For any real-valued statistic T and the given sample data x , we decompose the total information about X available in x into the sum of (a) the information available in the reduced data T ( x ) = x ¯ and (b) the information lost in the process of data reduction. When T is a sufficient statistic for θ , this lost information is independent of θ . Moreover, by taking the expected value of this lost information over all possible data sets, we define an associated entropy measure that depends on T but neither x nor θ . Our approach also works for non-sufficient statistics, but the lost information and associated entropy would then involve θ . Thus θ must be estimated before computing these quantities.
The paper is organized as follows. In Section 2, we present the necessary definitions, notation, and preliminary results. In Section 3, we decompose the total information available about X in x and give various expressions for the Shannon information lost by reducing x to T ( x ) . In Section 4, we develop an entropy measure associated with this lost information. In Section 5, we present examples of our results for some standard discrete distributions and several sufficient statistics for θ . Conclusions are offered in Section 6.

2. Preliminaries

Standard definitions, notation, and results to be used in our development are now presented for completeness and accessibility. In addition, some new definitions and results are established for subsequent use. Definition 1, Result 1, Definition 2, and Definition 3 can be found in [3,4,5] and elsewhere. The notion of a sufficient statistic is first defined.
Definition 1 (Sufficient Statistic [3]).
A statistic T ( X ) is a sufficient statistic (SS) for the parameter θ if the probability
P [ X = x | T ( X ) = T ( x ) ]
is independent of θ .
Note that P instead of P θ is used in (1) since this probability is independent of θ . In addition, observe that (1) is not a joint conditional probability distribution for X since its condition changes with x . This observation is significant in Section 4. The fact that (1) does not involve θ can be used to prove the Fisher Factorization Theorem (FFT) below, which is the usual method for determining if a statistic is an SS for θ . In the FFT, we use the notation f ( x | θ ) to denote the joint probability mass function (pmf) of X evaluated at the variable x for a fixed value of θ .
Result 1 (FFT [3]).
The real-valued statistic T ( X ) is sufficient for θ if and only if there exist functions g : R 1 R 1 and h : S n R 1 such that for any sample data x and for all values of θ the joint pmf f ( x | θ ) of X can be factored as
f ( x | θ ) = g [ T ( x ) | θ ] × h ( x )
for real-valued, nonnegative functions g on and h on S n .   The function h does not depend on θ , while g does depend on x but only through T ( x ) .
We focus on a sufficient statistic T for θ in Section 3, where we need the notion of a partition [5] defined next.
Definition 2 (Partition [3]).
Let S be the denumerable sample space of the discrete random variable X so that S n is the denumerable sample space of the random sample X . For any statistic T : S n R 1 , let τ T be the denumerable set τ T = { t | x S n for which t = T ( x ) } , which is the range of T . Then   T partitions the sample space S n into the mutually exclusive and collectively exhaustive partition sets A t = { x S n | T ( x ) = t } ,   t τ T .
Figure 1 illustrates Definition 2.
We also use the well-known likelihood function.
Definition 3 (Likelihood Function [3,4,5]).
Let x be sample data from a random sample X from a discrete random variable X with sample space S and real-valued parameter θ ,   and let f ( x | θ ) denote the joint pmf of the random sample X . For any sample data   x , the likelihood function of θ is defined as
L ( θ | x ) = f ( x | θ ) .
The likelihood function L ( θ | x ) in (3) is a function of the variable θ for given data x . However, the joint pmf f ( x | θ ) as a function of x for fixed θ is frequently called the likelihood function as well. In this case, we also write the joint pmf as L ( x | θ ) . We distinguish the two cases since L ( θ | x ) is not a statistic but L ( x | θ ) is one that incorporates all available information about X . Moreover, L ( x | θ ) is an SS for θ [4] and uniquely determines an associated SS called the likelihood kernel to be used in subsequent examples.
We next define a new concept called the likelihood kernel. As a function of x for fixed θ , it is shown below to be a sufficient statistic for θ and is used in Section 5 to facilitate the computation of lost information associated with other sufficient statistics T. As a function of θ for fixed x , a possibility not considered here, the likelihood kernel may be useful in applying the likelihood principle [3,4] to make inferences about θ without resorting to the notion of equivalence classes. It would be the “simplest” factor of L ( θ | x ) that can be used in a likelihood ratio comparing two values of θ .
Definition 4 (Likelihood kernel).
Let S be the sample space of X . For fixed but arbitrary θ , suppose that L ( x | θ ) can be factored as
L ( x | θ ) = K ( x | θ ) × R ( x ) ,   x S n ,
where K : S n R 1 and R : S n R 1 have the following properties.
 (a) 
Every nonnumerical factor of K ( x | θ ) contains θ .
 (b) 
R ( x ) does not contain θ .
 (c) 
For x S n , both K ( x | θ ) 0 and R ( x ) 0 .
 (d) 
  K ( x | θ ) is not divisible by any positive number except 1.
Then K ( x | θ ) is defined as the likelihood kernel of L ( x | θ ) and R ( x ) as the residue of L ( x | θ ) .
Theorem 1.
The likelihood kernel K ( x | θ ) has the following properties.
 (i) 
K ( x | θ ) exists uniquely.
 (ii) 
K ( x | θ ) is an SS for θ .
 (iii) 
For any θ 1   and θ 2 , the likelihood ratio L ( x | θ 1 ) L ( x | θ 2 ) equals K ( x | θ 1 )   K ( x | θ 2 ) .
 Proof. 
To prove (i), for fixed θ we first show that the likelihood kernel K ( x | θ ) of Definition 4 exists by construction. Since the formula for L ( θ | x ) = f ( x | θ ) must explicitly contain θ , the parameter θ cannot appear only in the range of x . Hence L ( x | θ ) as a function of x can be factored into K ( x | θ ) × R ( x ) , satisfying (a) and (b) of Definition 4, where K ( x | θ ) 0 ,   x S n , and the numerical factor of K ( x | θ ) is either + 1 or 1 . Then R ( x ) 0 ,   x S n since K ( x | θ ) 0 ,   x S n , and K ( x | θ ) × R ( x ) = f ( x | θ ) 0 . Thus (c) is satisfied. Finally, the only positive integer that evenly divides + 1 or 1 is 1 , so (d) holds. It follows that the likelihood   kernel   K ( x | θ ) and its associated R ( x ) in Definition 4 are well defined and exist.
We next show that K ( x | θ ) as constructed above is unique. Let K 1 ( x | θ ) with residue R 1 ( x ) and K 2 ( x | θ ) with R 2 ( x ) both satisfy Definition 4. Thus for j = 1 , 2 ,     R j ( x ) does not contain θ , while every nonnumerical factor of K j ( x | θ ) does contain θ . It follows that K 1 ( x | θ ) 0 and K 2 ( x | θ ) 0 must be identical or else be a positive multiple of one another. Assume that K 2 ( x | θ ) = λ K 1 ( x | θ ) for some λ > 0 . If λ 1 ,   K 2 ( x | θ ) is divisible by a positive number other than 1 to contradict (d). Thus K ( x | θ ) is unique.
To prove (ii), we show that this unique K ( x | θ ) is an SS for θ . For L ( θ | x ) = f ( x | θ ) , let g [ z ] = z and h ( x ) = R ( x ) in (2). Then, L ( θ | x ) = f ( x | θ ) = g [ K ( x | θ ) ] × h ( x ) =   K ( x | θ ) × R ( x ) . Thus K ( x | θ ) is an SS by the FFT of Result 1. Finally, (iii) follows from Definition 4 and the fact that the joint pmf L ( x | θ 2 ) 0 for x S n .
We next discuss the notion of information to be used. Actually, probability itself is a measure of information in the sense that it captures the surprise level of an event. An observer obtains more information, i.e., surprise, if an unlikely event occurs than if a likely one does. Instead of probability, however, we use the additive measure known as Shannon information [6,7], defined as follows.
Definition 5 (Shannon Information [6,7]).
Let x be sample data for the random sample X from the discrete random variable X with a one-dimensional parameter θ , and let f ( x | θ ) be the joint pmf of X at x . The Shannon information obtained from the sample data x is defined as
I ( x | θ ) = log   f ( x | θ ) ,
where the units of I ( x | θ ) is bits if the base of the logarithm is 2, which is to be used here.
Other definitions for information have been proposed. For example, Vigo [8,9] has defined a measure of representational information. Further details on different types of information can be found in [10,11,12,13,14,15,16]. For Shannon information, we use its expected value over x S n .
Definition 6 (Entropy [17,18,19,20]).
Under the conditions of Definition 5, the Shannon entropy H ( X | θ ) is defined as the expected value of I ( X | θ ) ; i.e.,
H ( X | θ ) = x f ( x | θ ) I ( x | θ ) .
The general properties of Shannon entropy are given in [17,18,19,20], for example. Since entropy is the expected information over all possible random samples, it can be argued that entropy is a better measure of the available information about X than would the Shannon information for a single data set x , which might not be typical [18]. We next give a method to obtain the information loss about X that occurs when a data set x is reduced to T ( x ) . In our approach, we focus on a sufficient statistic T so there will be no θ in (5) for the lost information below.

3. Information Decomposition under Data Reduction by a Real-Valued Statistic

We now develop a procedure to determine how much information about X contained in a data set x is lost when the data is reduced to T ( x ) by the sufficient statistic T . Consider the joint conditional probability
P θ [ X = x   |   T ( X ) = T ( x ) ] ,
which is identified with the probabilistic information lost about the event X = x by the data reduction of x to T ( x ) . The notation P θ refers to the fact that the discrete probability (7), in general, involves the parameter θ . We next express (7) using the definition of conditional probability to obtain the basis of our development. Result 2 is given in ([3], p. 273) and proven below to illustrate the reasoning.
Result 2 [3].
Let x be sample data for a random sample X from a discrete random variable X with sample space S and real-valued parameter θ , and let T ( X ) be any real-valued statistic. Then
P θ [ X = x   |   T ( X ) = T ( x ) ] = P θ [ X = x ] P θ [ T ( X ) = T ( x ) ] .
 Proof. 
Using the definition of conditional probability, rewrite (7) as
P θ [ X = x ; T ( X ) = T ( x ) ] P θ [ T ( X ) = T ( x ) ] .
However, T ( X ) = T ( x ) whenever X = x , so (8) follows. □
Observe that if T is an SS for θ , the left side of (8) is independent of θ by the FFT and hence so is the right. Taking the negative logarithm of (8) and rearranging terms gives
log P θ [ X = x ] = log P θ [ T ( X ) = T ( x ) ] log P θ [ X = x   |   T ( X ) = T ( x ) ] .
From (8), note that P θ [ X = x   |   T ( X ) = T ( x ) ] P θ [ X = x ] since P θ [ T ( X ) = T ( x ) ] 1 , so log P θ [ X = x   |   T ( X ) = T ( x ) ] log P θ [ X = x ] . Similarly, log P θ [ T ( X ) = T ( x ) ] log P θ [ X = x ] . These facts suggest that the left side of (10) is the total Shannon information in bits about X contained in the sample data x . On the right side of (10), the term log P θ [ T ( X ) = T ( x ) ] is considered the information about X contained in the reduced data summary T ( x ) , and the term log P θ [ X = x   |   T ( X ) = T ( x ) ] is identified as the information about X that has been lost as the result of the data reduction by T ( x ) .
In particular, this lost information represents a combinatorial loss in the sense that multiple x ’s may give the same value T ( x ) = t , as depicted in Figure 1 above. In other words, the lost information log P θ [ X = x   |   T ( X ) = T ( x ) ] is a measure of the knowledge unavailable about the data sample x when only the reduced data summary T ( x ) is known but not x itself. For a sufficient statistic T ( X ) for θ , this lost information is independent of θ . It is a characteristic of T ( X ) for the given data sample x .
In terms of Figure 1, (10) may be described as follows. On the left of the figure is the sample space S n R n over which probabilities on X are computed. On the right is the range τ T R 1 of T over which the probability of T ( X ) are computed. T reduces the data sample x into T ( x ) , where multiple x ’s may give the same T ( x ) = t . In Figure 1, the distinct data samples x 1 ,     x 2 , and x 3 are all reduced into the same value t 1 . However, knowing that T ( x ) = t 1 for some data sample x does not provide sufficient information to know unequivocally, for example, that x = x 1 . Information is lost in the reduction. One can also say that the total information log P θ [ X = x ] obtained from the left side of Figure 1 is reduced to log P θ [ T ( X ) = T ( x ) ] obtained from the right. The reduction in information from the left to the right side is precisely the lost information log P θ [ X = x   |   T ( X ) = T ( x ) ] of (10). For fixed t , it is lost due to the ambiguity as to which data sample on the left actually gave t . There is no such ambiguity when T is one-to-one.
The general decomposition of information in (10) is next summarized in Definition 7, where T does not need to be sufficient for θ .
Definition 7 ( I t o t a l ,   I   r e d u c e d ,   I l o s t ).
Let x be sample data for a random sample X from a discrete random variable X with sample space S and real-valued parameter θ . For any real-valued statistic T ( X ) ,   the Shannon information about X obtained from the sample data x can be decomposed as
I total ( x | θ ) = I reduced ( x | θ , T ) + I lost ( x | θ , T ) ,
where
I total ( x | θ ) = log P θ [ X = x ]
I reduced ( x | θ , T ) = log P θ [ T ( X ) = T ( x ) ]
and
I lost ( x | θ , T ) = log P θ [ X = x   |   T ( X ) = T ( x ) ] .
Definition 7generalizes the information decomposition of [2] for a data sample x of size n from a Bernoulli random variable X .   I n   t e r m s   o f   t h i s   p a p e r ,   t h e   p a r a m e t e r   θ   in [2] is the probability 0.5 of success on a single Bernoulli trial, and T ( x ) = i = 1 n x i . It should be noted that the notation I c o m p in [2], which refers to compressed information, corresponds to I r e d u c e d in Equation (13). We use the term “data reduction” as described in [3] as opposed to “data compression” to prevent misinterpretation. In computer science, data compression refers to encoding information using fewer bits than the original representation and is often lossless.
Both Result 2 and Definition 7 are valid for any real-valued statistic for X . The notation I total ( x | θ ) indicates that I total is a function of the sample data x for a fixed but arbitrary parameter value θ . Similarly, both I reduced ( x | θ , T ) and I lost ( x | θ , T ) are functions of x for fixed θ and T . However, in this paper we focus on sufficient statistics, which provide a simpler expression for I lost ( x | θ , T ) that does not involve θ . For a sufficient statistic T for θ , we use the notation I lost ( x | T ) for the lost information, though I total ( x | θ ) and I reduced ( x | θ , T ) still require θ . The next result is an application of the FFT of Result 1.
Theorem 2 (Lost Information for an SS).
Let x be sample data for a random sample X from a discrete random variable X with sample space S and real-valued parameter θ . Let T be an SS for θ , let f ( x | θ ) be the joint pmf of X , and write f ( x | θ ) = g [ T ( x ) | θ ] × h ( x ) as in Result 1. Then for all x S n
I lost ( x | T ) = log h ( x ) y ϵ A T ( x ) h ( y ) ,
where A T ( x ) is defined in Definition 2 for t = T ( x ) .
 Proof. 
Let x S n . Then f ( x | θ ) > 0 since x is a realization of X . Because T is an SS, we write (7) without θ . It now suffices to establish that
P [ X = x | T ( X ) = T ( x ) ] = h ( x ) y ϵ A T ( x ) h ( y ) ,
from which (15) immediately follows. Rewrite (8) as
P [ X = x | T ( X ) = T ( x ) ] = P θ [ X = x ] P θ [ T ( X ) = T ( x ) ] = f ( x | θ ) y ϵ A T ( x ) f ( y | θ ) ,
so from (17) and (2), then
P [ X = x | T ( X ) = T ( x ) ] = g [ T ( x ) | θ ] × h ( x ) y ϵ A T ( x ) g [ T ( y ) | θ ] × h ( y ) .
However, T ( y ) = T ( x ) ,   y A T ( x ) in (18), so
P [ X = x | T ( X ) = T ( x ) ] = g [ T ( x ) | θ ] × h ( x ) g [ T ( x ) | θ ] × y ϵ A T ( x ) h ( y ) ,   x S n .
Since f ( x | θ ) > 0 and hence g [ T ( x ) | θ ] 0 , this term can be canceled on the right side of (19) to yield (16). Taking the log of (16) completes the proof. □
Now consider Theorem 2 when each A t is a singleton in (16), i.e., when T is a one-to-one function. In this extreme case, P [ X = x | T ( X ) = T ( x ) ] = 1 since y A T ( x ) h ( y ) = h ( x ) in the denominator of the right side of (16). Thus I lost ( x | T ) = 0 from which I comp ( x | θ , T ) = I total ( x | θ ) for all x in S n . Thus, the special case of a one-to-one T justifies the identification of the lost information as I lost ( x | θ , T ) = log P θ [ X = x | T ( X ) = T ( x ) ] . In other words, for all data samples x ,   y S n , if x y whenever T ( x ) T ( y ) , then P [ X = x | T ( X ) = T ( x ) ] is not diminished by the reduction of the singleton A T ( x ) to the number T ( x ) .
More generally, it is also true that I lost ( x | θ , T ) = 0 when T is one-to-one but not sufficient for θ . In this case, write P θ [ X = x | T ( X ) = T ( x ) ] = P θ [ X = x ] P θ [ T ( X ) = T ( x ) ] = f ( x | θ ) y ϵ A T ( x ) f ( y | θ ) . However, since T is one-to-one, then y ϵ A T ( x ) f ( y | θ ) = f ( x | θ ) , P θ [ X = x | T ( X ) = T ( x ) ] = 1 , and again I lost ( x | θ , T ) = 0 .
Now consider the other extreme case where T ( x ) = c is constant on S n . Thus P θ [ X = x | T ( X ) = c ] = P θ [ X = x ] P θ [ T ( X ) = c ] . However, P θ [ T ( X ) = c ] = 1 , so P θ [ X = x | T ( X ) = c ] = P θ [ X = x ] and I lost ( x | θ , T ) = I total ( x | θ , T ) on S n . In this case, I reduced ( x | θ , T ) = 0 because the event T ( x ) = c gives no information about x .
We also note that I lost ( x | T ) could be used as a metric to compare sufficient statistics for a given data sample x . For example,   T 1 could be regarded as better than T 2 for x if I lost ( x , T 1 ) < I lost ( x , T 2 ) . However, this comparison would be limited to the given x . In Section 4, we propose but do not explore a metric based on entropy independent of a particular data sample. We next show that (16) can be simplified when T is the likelihood function.
Corollary 1 (Information Loss for Likelihood Function).
Under the assumptions of Theorem 2, if   T ( x ) = L ( x | θ ) , then
I lost ( x | L ) = log 1 | A L ( x | θ ) | ,
where | A L ( x | θ ) | is the cardinality of the partition set A t for t = L ( x | θ ) .
 Proof. 
For T ( x ) = L ( x | θ ) = f ( x | θ ) in (2), let g be the identity function and h ( x ) = 1 . Then substituting h ( x ) = 1 into (16) gives the denominator y ϵ A L ( x | θ ) 1 = | A L ( x | θ ) | to yield (20). □
We next state a reproductive property of a statistic T that is a one-to-one function of a sufficient statistic T for θ .
Theorem 3.
If there is a one-to-one function between a sufficient statistic T for θ and an arbitrary real-valued statistic T on S n , then the following hold.
 (i) 
T is also an SS.
 (ii) 
T and T partition the sample space S into the same partition sets.
 (iii) 
I l o s t ( x | T ) = I l o s t ( x | T ) , x S n .
 Proof. 
To prove (i), let u be a real-valued one-to-one function of T such that
T ( x ) = u [ T ( x ) ] .
Since T is an SS, by Equation (2) there are real-valued functions g on R 1 and h on S n for which
f ( x | θ ) = g [ T ( x ) | θ ] × h ( x ) .
By substituting T ( x ) from (21) in (22), we get
f ( x | θ ) = g ( u [ T ( x ) ] | θ ) × h ( x ) ,
which can be rewritten as
f ( x | θ ) = ( g u ) [ T ( x ) | θ ] × h ( x ) .
Since T in (24) satisfies the condition of Result 1 for g = g u , T is an SS.
To prove (ii), we use Definition 2. Let T partition the sample space S n into the mutually exclusive and collectively exhaustive sets A t = { x | T ( x ) = t } ,   t τ T . By Equation (21) we can also write A t as
A t = { x | u [ T ( x ) ] = t } ,   t τ T .
Since u is a one-to-one function, it has an inverse u 1 . Letting u 1 ( t ) = t , we apply u 1 to the right side of (25) and get
A t = { x | T ( x ) = t } , t u ( τ T ) .
However, u ( τ T ) =   τ T and the cardinalities satisfy | τ T | = | τ T | , so the right side of (26) is A t and
A t = A t .
Finally, to get (iii) we use Theorem 2 to calculate information lost over two statistics T and T . Since h ( x ) is the same in (22) and (24) and since Equation (27) holds, we sum h ( x ) over the same sets in the denominator of Equation (16) for both T and T to give
I lost ( x | T ) = I lost ( x | T )
and complete the proof. □
We next compare the information loss of the sufficient statistic L ( x | θ ) to other sufficient statistics. For the sufficient statistic K ( x | θ ) , a lemma is needed.
Lemma 1.
Let x be any data sample for a random sample X from the discrete random variable X with real-valued parameter θ . Then K ( x | θ ) is a function of L ( x | θ ) and τ L τ K .
 Proof. 
From ([3], p. 280), K ( x | θ ) is a function of L ( x | θ ) if and only if K ( x | θ ) = K ( y | θ ) whenever L ( x | θ ) = L ( y | θ )   . For all data samples x and y , we prove that if L ( x | θ ) = L ( y | θ ) , then K ( x | θ ) = K ( y | θ ) . Thus suppose that L ( x | θ ) = L ( y | θ ) . By Definition 4, we can decompose L ( x | θ ) and L ( y | θ ) into K ( x | θ ) R ( x ) and K ( y | θ ) R ( y ) , respectively. Note that K ( y | θ ) 0 . Otherwise, L ( y | θ ) = 0 in contradiction to y being sample data with a nonzero probability of occurring. Now write
K ( x | θ ) K ( y | θ ) = R ( y ) R ( x ) .
Suppose that K ( x | θ ) K ( y | θ ) so that K ( x | θ ) K ( y | θ ) = R ( y ) R ( x ) 1 in (29). From Definition 4, every nonnumerical factor of K ( x | θ ) and K ( y | θ ) contains θ . Moreover, neither K ( x | θ ) nor K ( y | θ ) is divisible by any positive number except the number 1. Hence, since R ( y ) R ( x ) does not contain θ , the nonnumerical factors of K ( x | θ ) and K ( y | θ ) must cancel in (29) and the remaining numerical factors could not be identical. Thus at least one of these factors would be divisible by a positive number other than 1 in contradiction to Definition 4. It now follows that K ( x | θ ) = K ( y | θ ) , so K ( x | θ ) is some function u of L ( x | θ ) . Finally, τ L τ K since this function u is surjective from S n onto its image u ( S n ) . □
Lemma 2.
Under the conditions of Lemma 1, the sufficient statistics L and K satisfy
I reduced ( x | θ , L ) I reduced ( x | θ , K ) ,   x S n .
 Proof. 
Let x S n and suppose that y     A L ( x ) . Then L ( y | θ ) = L ( x | θ ) , so it follows from Lemma 1 that K ( y | θ ) = K ( x | θ ) and thus y     A K ( x ) . Hence A L ( x ) A K ( x ) , and so
  P θ [ L ( X | θ ) = L ( x | θ ) ] = y ϵ A L ( x ) f ( x | θ ) y ϵ A K ( x ) f ( x | θ ) = P θ [ K ( X | θ ) = K ( x | θ ) ] , x S n .
Taking the negative log of both sides of the inequality in (31) and using (13) gives (30). □
Theorem 4.
Let x be sample data for a random sample X from a discrete random variable X with the real-valued parameter θ . Then for all x S n ,
I lost ( x | L ) I lost ( x | K ) .
 Proof. 
Let x S n . Note that I total ( x | θ ) in (12) does not depend on the arbitrary sufficient statistic T of (11). Hence
I total ( x | θ ) = I reduced ( x | θ , L ) + I lost ( x | L ) = I reduced ( x | θ , K ) + I lost ( x | K ) .
Then (32) follows immediately from (30) and (33). □
As a consequence of Theorem 3, Theorem 4 has an immediate corollary.
Corollary 2.
Under the conditions of Theorem 4, let T be a sufficient statistic for θ for which there is a one-to-one function between T and K . Then for all x S n ,
I lost ( x | L ) I lost ( x | T ) .
The question remains open as to whether (34) holds for all sufficient statistics T for θ . Regardless, the proofs of Lemma 2 and Theorem 4 illustrate the fact that the relation between the lost information for two statistics T and T is determined by the relation between their partition sets A t = { x | T ( x ) = t } and B t = { x | T ( x ) = t } . For example, if for every A t there exists a B t for which A t B t , then the partition of S n by the B t of T is said to be coarser than the partition by the A t of T . In that case, I lost ( x | θ , T ) I lost ( x | θ , T ) because each x S n has more y S n with T ( y ) = T ( x ) than there are with T ( y ) = T ( x ) . In other words, T ( y ) = t is at least as ambiguous as T ( y ) = t in determining the data sample giving the value of the respective statistics.

4. Entropic Loss for an SS

For a sufficient statistic T for θ , we now propose an entropy measure to characterize T by the expected lost information incurred by the reduction of X to T ( X ) . This expectation is taken over all possible data sets x . This nonstandard entropy measure is called entropic loss, and it depends on neither a particular data set x nor the value of θ . Before defining this measure, we need to determine the appropriate pmf to use in taking an expectation. The following results are used.
Result 3.
Under the assumptions of Theorem 2, for any data sample let t = T ( x ) and consider the partition set A t . Then
x A t P [ X = x   |   T ( X ) = t ] = 1 .
 Proof. 
Summing (16) over x A t yields
x A t P [ X = x   |   T ( X ) = t ] = x ϵ A T ( x ) h ( x ) y ϵ A T ( x ) h ( y ) = 1 .
to give (35). □
Result 4.
Under the assumptions of Theorem 2, the sum
x S n P [ X = x   |   T ( X ) = T ( x ) ] = | τ T | .
 Proof. 
We perform the sum on the left of (37) by first summing over x A t for fixed t and then summing over each t τ T to give
x S n P [ X = x   |   T ( X ) = T ( x ) ] = t τ T x A t P [ X = x   |   T ( X ) = t ] .
The inner series on the right side of (38) sums to one by Result 3. Hence, the outer sum yields | τ T | for τ T = { t | x S n for which t = T ( x ) } .  
From (37), it follows that the left side of (37) is not a probability distribution on S n unless | τ T | = 1 . Moreover, P [ X = x   |   T ( X ) = T ( x ) ] is not a conditional probability distribution even if | τ T | = 1 since the condition T ( X ) = T ( x ) varies with x . However, we use Result 4 to normalize P [ X = x   |   T ( X ) = T ( x ) ] and obtain the appropriate pmf for calculating the expectation of I lost ( X | T ) .
Definition 8 (Entropic Loss).
Under the assumptions of Theorem 2, the entropic loss resulting from the data reduction by T is defined as
H lost ( X , T ) = 1 | τ T | x S n P [ X = x | T ( X ) = T ( x ) ] log P [ X = x | T ( X ) = T ( x ) ] ,
which from (15) and (16) can be rewritten as
H lost ( X , T ) = 1 | τ T | x S n h ( x ) y ϵ A T ( x ) h ( y ) log h ( x ) y ϵ A T ( x ) h ( y ) .
Observe that (39) and (40) are independent of both x and θ . Indeed, for a given underlying random variable X and sample size n, H lost ( X , T ) is a function only of T . Thus, H lost ( X , T ) could be used as a metric to compare sufficient statistics independent of the data sample. In particular, T 1 could be regarded as better than T 2 if H lost ( X , T 1 ) < H lost ( X , T 2 ) ,   i.e., if the expected information loss associated with T 1 is less than that for T 2 . Moreover, for a given underlying random variable X and sample size n, Definition 8 could be extended to non-sufficient statistics. In that case, the entropic loss H lost ( X , T , θ ) would be a function of both T and θ . For a fixed θ , a non-sufficient statistic T 1 could again be considered as better than a non-sufficient T 2 if H lost ( X , T 1 , θ   ) < H lost ( X , T , θ   ) .   Furthermore , for a given statistic T , the numerical value θ 1 could be considered as a better numerical point estimate for θ than the value θ 2 if H lost ( X , T , θ 1   )   < H lost ( X , T , θ 2   ) . Similarly,   H lost ( X , T , θ ) could be minimized over θ to give a best numerical point estimate for θ based on the entropic loss criterion. However, we do not pursue these possibilities here. We next compute H lost ( T ) for the sufficient statistic T ( X ) = L ( X | θ ) .
Theorem 5 (Entropic Loss for Likelihood Function).
Under the assumptions of Theorem 2, the entropic loss resulting from the data reduction by T ( x ) = L ( x | θ ) is
H lost ( X , L ) = 1 | τ L | t τ L log 1 | A t | .
 Proof. 
From (20), write
H lost ( X , L ) = 1 | τ L | x S n 1 | A L ( x ) | log 1 | A L ( x ) | .
We decompose the sum over x ∈ Sn in (42) to consecutive sums over x A t and then t τ T to get
H lost ( X , L ) = 1 | τ L | t τ L x A t 1 | A t | log 1 | A t | = 1 | τ L | t τ L | A t | | A t | log 1 | A t | .
Equation (41) now follows from (43). □
Result 5.
Suppose there is a one-to-one function between two sufficient statistics T and T for θ . Then
H lost ( X , T ) = H lost ( X , T ) .
 Proof. 
For all x S n , I lost ( x | T ) = I lost ( x | T ) from Theorem 3, so
log h ( x ) y ϵ A T ( x ) h ( y ) = log h ( x ) y ϵ A T ( x ) h ( y ) ,
from which
h ( x ) y ϵ A T ( x ) h ( y ) = h ( x ) y ϵ A T ( x ) h ( y ) .
Thus from (45) and (46),
h ( x ) y ϵ A T ( x ) h ( y ) log h ( x ) y ϵ A T ( x ) h ( y ) = h ( x ) y ϵ A T ( x ) h ( y ) log h ( x ) y ϵ A T ( x ) h ( y ) .
Now summing (47) over x S n yields
x S n h ( x ) y ϵ A T ( x ) h ( y ) log h ( x ) y ϵ A T ( x ) h ( y ) = x S n h ( x ) y ϵ T ( x ) h ( y ) log h ( x ) y ϵ A T ( x ) h ( y )   .
However, from Theorem 3, | τ T | = | τ T | . Thus dividing the left side of (48) by | τ T | and the right side by | τ T | yields (44). □

5. Examples and Computational Issues

In this section, we present examples involving the discrete Poisson, binomial, and geometric distributions [21]. For each distribution, three sufficient statistics for a parameter θ are analyzed. For each such T , the   quantity   I lost ( x | T ) does not involve θ . However, calculating I lost ( x | T ) can still present computational issues, some of which are discussed below. Our examples are simple in order to focus on the definitions and results of Section 3 and Section 4.
Example 1 (Poisson Distribution).
Consider the random sample X = ( X 1 ,   ,   X n ) with the data sample x = ( x 1 ,   ,   x n ) from a Poisson random variable X . We consider three sufficient statistics for the parameter θ > 0 . These sufficient statistics are T 1 ( X ) = i = 1 n X i , the likelihood kernel T 2 ( X ) = K ( X | θ )   for fixed but arbitrary   θ , and the likelihood function T 3 ( X ) = L ( X | θ )   f o r   f i x e d   b u t   a r b i t r a r y   θ . We use T 1 ( X )   as a surrogate for T 1 ( X ) =   i = 1 n X i n . Neither T 1 ( X ) or T 1 ( X ) involves θ and can thus be used either to characterize X or to estimate   θ . Moreover, there is an obvious one-to-one function relating i = 1 n X i n   and i = 1 n X i , so Theorems 3 and 5 establish that I l o s t ( x | T 1 ) = I l o s t ( x | T 1 ) and H l o s t ( X , T 1 ) =   H l o s t ( X , T 1 ) ,   respectively. We analyze T 1 ( X ) because it is also Poisson, whereas T 1 ( X ) is not Poisson since i = 1 n X i n is notnecessarily a nonnegative integer. In contrast to T 1 ( X ) , both T 2 ( X ) and T 3 ( X ) contain θ and can only be used to characterize X . For each of these three sufficient statistics, we develop an expression for I l o s t ( x | T ) and describe how to obtain a numerical value. We then illustrate previous results with a realistic Poisson data sample. We present further computational results in Table 1.
Case 1: Let T 1 ( X ) = i = 1 n X i .   Observe   that   T 1 ( X ) is a sufficient statistic for θ from Result 1 since f ( x | θ ) = P θ [ X = x ] = θ i = 1 n x i   e n θ i = 1 n x i ! can be factored in (2) into the functions g [ T 1 ( x ) | θ ] = θ i = 1 n x i   e n θ and h ( x ) = 1 i = 1 n x i ! . Next recall that the statistic i = 1 n X i has a Poisson distribution with parameter n θ   [21]. Thus, P θ [ i = 1 n X i = i = 1 n x i ] = ( n θ ) i = 1 n x i   e n θ ( i = 1 n x i ) ! , and so (8) becomes
P [ X = x | i = 1 n X i = i = 1 n x i ] = 1 n i = 1 n x i ( i = 1 n x i x 1 , , x n ) ,
where the multinomial coefficient ( i = 1 n x i x 1 , , x n ) = ( i = 1 n x i ) ! i = 1 n x i ! . It follows from (49) and (10) that
I lost ( x | T 1 ) = log ( i = 1 n x i x 1 , , x n ) +   ( log n ) i = 1 n x i ,
which is also I lost ( x | T 1 ) , as noted above.
For a data sample ( x 1 ,   ,   x n ) , the evaluation of I lost ( x | T 1 ) in (50) involves computing factorials [22]. For realistic data, the principal limitation to calculating them by direct multiplication is their magnitude. See [23] for a discussion. However, (50) can be approximated using either the well-known Stirling formula or the more accurate Ramanujan approximation [24]. The online multinomial coefficient calculator [25] can evaluate multinomial coefficients for when all x i as well as n are less than approximately 50 if any x i = 0 is removed from ( i = 1 n x i x 1 , , x n ) . Such deletions do not affect the calculation since 0 ! = 1 .
As a numerical example, consider a data sample x of size n = 34 from a Poisson random variable X with θ = 3 , where
x = ( 4 ,   7 ,   1 ,   3 ,   4 ,   2 ,   5 ,   0 ,   1 ,   2 ,   3 ,   6 ,   8 ,   0 ,   1 ,   2 ,   4 ,   9 ,   0 ,   2 ,   3 ,   1 ,   4 ,   2 ,   0 ,   1 ,   5 ,   6 ,   2 ,   7 ,   0 ,   1 ,   4 ,   2 ) .
Then, T 1 ( x ) = i = 1 n x i = 102   from   ( 51 ) , and the calculator at [25] gives ( i = 1 n x i x 1 , , x n ) 1.574   × 10123 in (49) and (50). Moreover, ( log n ) i = 1 n x i =   518.915. Hence, from (50), I lost ( x | T 1 ) = I lost ( x | T 1 ) 109.667 bits. This value corresponds to 13.708 bytes at 8 bits per byte or to 0.013 kilobytes (KB) at 1024 bytes per KB. It follows from previous discussion in this example that the Shannon information lost by using the sample mean T 1 as a surrogate for x itself is I lost ( x | T 1 ) = I lost ( x | T 1 )   0.013 KB, which seems surprisingly small. Perhaps the small loss results partially from the fact that T 1 ( x ) = x ¯ = θ = 3   exactly .
Case 2: Let T 2 ( X ) = K ( X | θ ) for fixed but arbitrary θ > 0 . For a data sample ( x 1 ,   ,   x n ) , write
L ( x | θ ) = f ( x | θ ) = θ i = 1 n x i   e n θ i = 1 n x i ! ,
from which
K ( x | θ ) = θ i = 1 n x i   e n θ
and R ( x ) = 1 i = 1 n x i ! in (4). Note that for all fixed θ > 0   except θ = 1 , there is an obvious one-to-one function between T 1 ( x ) = i = 1 n x i and (53). Hence, from Case 1, I lost ( x | K ( x | θ ) ) = I lost ( x | T 1 )   0.013 KB from Theorem 3 for all θ > 0   except   θ = 1 . For θ = 1 ,   K ( x | θ ) =   e n from (53) and is constant with respect to any data sample x . Thus, I reduced ( x | 1 , K ) = 0 and I lost ( x | K ( x | 1 ) ) = I total ( x | 1 , K ) . It follows that K ( x | 1 ) provides no information about X .
Case 3: Let T 3 ( X ) = L ( X | θ ) for fixed but arbitrary θ > 0 . We attempt to obtain I lost ( x | L ( x | θ ) ) for a data sample x = ( x 1 ,   ,   x n ) by determining | A L ( x | θ ) | and using (20). From (52), note that for all fixed θ > 0   except   θ = 1 , y A L ( x | θ ) if and only if
θ i = 1 n y i   i = 1 n y i ! =   θ i = 1 n x i   i = 1 n x i ! .
Thus, for any fixed θ satisfying θ > 0   and   θ 1 , y A L ( x | θ ) if both i = 1 n y i = i = 1 n x i and i = 1 n y i ! = i = 1 n x i ! . However, for some θ > 0   and   θ 1 , it is possible that y A L ( x | θ ) when neither i = 1 n y i = i = 1 n x i nor i = 1 n y i ! = i = 1 n x i ! . For example, let θ = 2 ,   x = ( 4 , 1 , 1 , 0 ) , and y = ( 3 , 2 , 0 , 0 ) . Then, i = 1 n x i = 6 ,   i = 1 n y i = 5, i = 1 n x i ! = 24 , and i = 1 n y i ! = 12 . However, (54) is satisfied.
This complication suggests that an efficient implicit enumeration of the y satisfying (54) would be required to obtain | A L ( x | θ ) | for calculating I lost ( x | L ( x | θ ) ) from (20). Using such an algorithm, a conventional computer could possibly compute I lost ( x | L ( x | θ ) ) for the numerical data and value of θ in Case 1, since there is now a 250 petabyte, 200 petaflop conventional computer [26]. Substantially larger problems, if not already tractable, will likely be so in the foreseeable future on quantum computers. Recently, the milestone of quantum supremacy was achieved where the various possible combinations of a certain randomly generated output were obtained in 110 s, whereas this task would have taken the above conventional supercomputer 10,000 years [27]. Regardless, for the data of Case 1, we have the upper bound I lost ( x | L ( x | θ ) )   0.013 KB from (32).
Finally, we present some simple computational results to illustrate the relationships among T 1 ,   T 2 , T 3 with regard to the Poisson distribution. Table 1 below summarizes the results for sample data ( x 1 , x 2 ,   x 3 ) with i = 1 3 x i 2 . In particular, a complete enumeration of A L ( x | θ ) in (20) gives I lost ( x | L ( x | θ ) ) .
Example 2 (Binomial Distribution).
Consider a random sample X = ( X 1 ,   ,   X n ) from a binomial random variable X withparameters m and θ , where θ is the probability of success on any of the m Bernoulli trials associated with the X i ,   i = 1 , , n . Let m be fixed, so the only parameter is θ . Moreover, the sample space of the underlying random variable X is now finite.
Case 1: T 1 ( X ) = i = 1 n X i . Again, i = 1 n X i is an SS for θ . From [21], i = 1 n X i has a binomial distribution with parameter θ for fixed nm . Hence,
P θ [ i = 1 n X i = i = 1 n x i ] = θ i = 1 n x i   θ mn i = 1 n x i ( mn i = 1 n x i )
and
P θ [ X = x ] = θ i = 1 n x i   θ mn i = 1 n x i i = 1 n ( m x i ) .
From (1), dividing (56) by (55) gives
P [ X = x | i = 1 n X i = t ] = i = 1 n ( m x i ) ( mn t ) .
By taking the log of (57), the lost information is given as
I lost ( x | T 1 ) = log i = 1 n ( m x i ) ( mn t ) = i = 1 n log ( m x i ) + log ( mn t ) .
Case 2: T 2 ( X ) = K ( X | θ ) . In this case, we use (16) as in Example 1. Write
L ( x | θ ) = f ( x | θ ) = θ i = 1 n x i   ( 1 θ ) mn i = 1 n x i i = 1 n ( m x i ) ,
from which K ( x | θ ) = θ i = 1 n x i   ( 1 θ ) mn i = 1 n x i and R ( x ) = i = 1 n ( m x i ) in (4). To factor the right side of (60) as in (2), let g be the identity function and h ( x ) = i = 1 n ( m x i ) . Hence,
I lost ( x | T 2 ) = log i = 1 n ( m x i ) y A K ( x | θ ) i = 1 n ( m y i ) ,
and (60) gives
I lost ( x | T 2 ) = i = 1 n log ( m x i ) + log y A K ( x | θ ) i = 1 n ( m y i ) ,
where
A K ( x | θ ) = { y S n | θ i = 1 n y i   ( 1 θ ) mn i = 1 n y i = θ i = 1 n x i   ( 1 θ ) mn i = 1 n x i } .
From (62), for any fixed θ satisfying 0 < θ < 1   and   θ 1 / 2 , it can easily be shown that y A K ( x | θ ) if and only if i = 1 n y i = i = 1 n x i . Thus, in general, for a given x and fixed θ , determining A K ( x | θ ) in Case 2 would require an enumeration of the y satisfying (62) to compute (61). We perform such an enumeration below for a simple example.
Case 3: T 3 ( X ) = L ( X | θ ) . For a data sample x = ( x 1 ,   ,   x n ) , we now have
L ( x | θ ) = ( θ   1 θ ) i = 1 n x i ( 1 θ ) mn   i = 1 n ( m x i )
with g being the identity function and h ( x ) = 1 in (2). For fixed θ satisfying 0 <   θ < 1   and   θ 1 / 2 , from (63) we obtain that y A L ( x | θ ) if and only if
( θ   1 θ ) i = 1 n y i i = 1 n ( m y i ) = ( θ   1 θ ) i = 1 n x i i = 1 n ( m x i ) .
As in Case 3 of Example 1, developing an algorithm to use (64) and determine | A L ( x | θ ) | for calculating I lost ( x | L ( x | θ ) ) from (20) is beyond the scope of this paper.
As a simple example, consider the experiment of flipping a possibly biased coin twice ( m = 2 ). The total number of heads follows a binomial distribution with the parameter θ , which is the probability of getting a head on any flip. By doing this experiment three times, we generate the random variables X 1 , X 2 , X 3 with possible values 0, 1, 2. Table 2 shows all the possibilities and the lost information for the statistics. The small size of this example allows the computation of I lost in Cases 2 and 3 via total enumeration.
Now, using (40), we give in Table 3 the entropic losses of Example 2 for T 1 , T 2 , T 3 . Note that H lost ( X , T ) is the same for the sum T 1 and the likelihood kernel T 2 , which are related by a one-to-one function. Hence, Result 5 is corroborated. In addition, observe that H lost ( X , T ) is smallest for the likelihood function T 3 .
Example 3 (Geometric Distribution).
Consider a random sample X = ( X 1 ,   ,   X n ) with sample data x = ( x 1 ,   ,   x n ) from a geometric random variable X , where the parameter θ is the probability of success on any of the series of independent Bernoulli trials for which X is the trial number on which the first success is obtained. It readily follows from [5] that
P [ X = x ] = θ n ( 1 θ ) i = 1 n x i n .
Case 1: T 1 ( X ) = i = 1 n X i . For fixed n ,   i = 1 n X i has a negative binomial distribution with parameter θ   [21]. Hence,
P [ i = 1 n X i = i = 1 n x i ] = ( i = 1 n x i 1 n 1 ) θ n ( 1 θ ) i = 1 n x i n .
Thus T 1 ( X ) = i = 1 n X i is an SS for θ since it satisfies (2) with g [ T 1 ( x ) | θ ] = θ n ( 1 θ ) T 1 ( x ) n and h ( x 1 ,   , x n ) = ( i = 1 n x i 1 n 1 ) . Moreover, substitution of (65) and (66) into (8) gives
P [ X = x | i = 1 n X i = i = 1 n x i ] = 1 ( i = 1 n x i 1 n 1 ) .
Then from (14) and (67) we obtain that
I lost ( x | T 1 ) = log ( i = 1 n x i 1 n 1 ) .
Case 2: T 2 ( X ) = K ( X | θ ) . From (66), for all x S n , R ( x ) = 1 and
K ( x | θ ) = L ( x | θ ) = ( θ 1 θ ) n ( 1 θ ) i = 1 n x i .
Thus, for 0   < θ < 1 , there is an obvious one-to-one function between T 1 ( x ) = i = 1 n x i and T 2 ( x ) = K ( x | θ ) in (69). Thus, from Theorem 3, I lost ( x | T 2 ( x ) ) = I lost ( x | T 1 ) as given in (68).
Case 3: T 3 ( X ) = L ( X | θ ) . Since K ( X | θ ) = L ( X | θ ) from (69), then
I lost ( x | T 3 ) = log ( i = 1 n x i 1 n 1 )
from (68). However, there is an alternate derivation of (70). For 0 < θ < 1 it follows from (69) that then y A L ( x | θ ) if and only if
i = 1 n y i = i = 1 n x i .
But for fixed positive integers x 1 , ,   x n we have from [28] that the number of solutions | A L ( x | θ ) | to (71) in positive integers y 1 , ,   y n is
( i = 1 n x i 1 n 1 ) .
Thus, (70) follows for L ( X | θ ) from (72) and (20), so I lost ( x | T 1 ) = I lost ( x | T 2 ) = I lost ( x | T 3 ) from Theorem 3.
As a numerical illustration, let the random variable X denote the number of flips of a possibly biased coin until a head is obtained. Then, X has a geometric distribution, where the parameter θ is now the probability of getting a head on any flip. Suppose this experiment is performed three times yielding the sample data x = ( x 1 , x 2 , x 3 ) shown in Table 4. I lost ( x | T ) is then calculated for each of the sufficient statistics for θ of Example 3. Observe that the individual statistics depend on θ while the lost information does not. Moreover, I lost ( x | T 1 ) = I lost ( x | T 2 ) = I lost ( x | T 3 ) for all data samples, as established above.

6. Conclusions

In this paper, the Shannon information obtained for a random sample X taken from a discrete random variable X with a single parameter θ was decomposed into two components: (i) the reduced information associated with the value of a real-valued statistic T ( X ) evaluated at the data sample x , and (ii) the information lost by using this value as a surrogate for x . Information is lost because multiple data sets can give the same value of the statistic. In data analysis, the data uniquely determines the value of a statistic, but typically the value of the statistic does not uniquely determine the data yielding it. The lost information thus measures the knowledge unavailable about the data sample x when only the reduced data summary T ( x ) is known, but not x itself. To eliminate the effect of θ , we focused on sufficient statistics for θ such as the sample mean. We then answered the question: how much Shannon information is lost to someone about a data sample x when only the value of T ( x ) is available but not the data x itself? Our answer is independent of the parameter θ and does not require that θ be known. Our method generalizes the approach of [2] for analyzing the information contained in a sequence of Bernoulli trials.
More generally, we developed a metric associated with the value T ( x ) used to summarize, represent, or characterize a given data set. Our approach and results are significant because such statistics are often communicated without the original data. One could argue that I lost ( x | T ) should be communicated along with T ( x ) in a manner similar to providing the margin of error associated with the results of a poll. A small I lost ( x | T ) would signify that T ( x ) is more informative than if   I lost ( x | T ) were large.
In addition, we defined the entropic loss associated with a sufficient statistic T under consideration as the expected lost information over all possible samples, to give a value dependent only on T . We noted but did not explore the possibility that entropic loss could be used as a metric to compare different sufficient statistics. Moreover, if sufficient statistics were not required, entropic loss could provide metrics on either θ or T if the other of these variables is fixed. Finally, numerical examples of our results were presented and some computational issues noted.

Author Contributions

M.M. suggested the topic after reading [2]. She shared the development of the theory and examples of this paper, as well as wrote early drafts. H.C. formulated the general decomposition here. He shared the development of the theory and examples of this paper, as well as edited the final draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Landauer, R. Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 1961, 5, 183–191. [Google Scholar] [CrossRef]
  2. Hodge, S.E.; Vieland, V.J. Information loss in binomial data due to data compression. Entropy 2017, 19, 75. [Google Scholar] [CrossRef] [Green Version]
  3. Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Cengage Learning: Delhi, India, 2002. [Google Scholar]
  4. Pawitan, Y. All Likelihood: Statistical Modeling and Inference Using Likelihood, 1st ed.; The Clarendon Press: Oxford, UK, 2013. [Google Scholar]
  5. Rohatgi, V.K.; Saleh, A.K.E. An Introduction to Probability and Statistics, 2nd ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2001. [Google Scholar]
  6. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication, 1st ed.; The University of Illinois Press: Urbana, IL, USA, 1964. [Google Scholar]
  7. Shannon, C. A mathematical theory of communication. Bell. Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  8. Vigo, R. Representational information: A new general notion and measure of information. Inf. Sci. 2011, 181, 4847–4859. [Google Scholar] [CrossRef] [Green Version]
  9. Vigo, R. Complexity over uncertainty in generalized representational information theory (GRIT): A structure-sensitive general theory of information. Information 2013, 4, 1–30. [Google Scholar] [CrossRef] [Green Version]
  10. Klir, G.J. Uncertainty and Information: Foundations of Generalized Information Theory, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
  11. Devlin, K. Logic and Information, 1st ed.; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
  12. Luce, R.D. Whatever happened to information theory in psychology? Rev. Gen. Psychol. 2003, 7, 183–188. [Google Scholar] [CrossRef] [Green Version]
  13. Floridi, L. The Philosophy of Information, 1st ed.; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
  14. Garner, W.R. The Processing of Information and Structure, 1st ed.; Wiley: New York, NY, USA, 1974. [Google Scholar]
  15. Spellerberg, I.F.; Fedor, P.J. A tribute to Claude-Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the “Shannon-Wiener” index. Glob. Ecol. Biogeogr. 2003, 12, 177–179. [Google Scholar] [CrossRef] [Green Version]
  16. Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef] [Green Version]
  17. Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
  18. Kapur, J.N.; Kesavan, H.K. Entropy Optimization Principles and Their Applications, 1st ed.; Water Science and Technology Library, Springer: Dordrecht, The Netherlands, 1992; Volume 9. [Google Scholar]
  19. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
  20. Ibekwe-SanJuan, F.; Dousa, T. Theories of Information, Communication and Knowledge: A Multidisciplinary Approach, 1st ed.; Springer: Dordrecht, The Netherlands, 2014. [Google Scholar]
  21. Johnson, J.L. Probability and Statistics for Computer Science, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2008. [Google Scholar]
  22. Beeler, R.A. How to Count: An Introduction to Combinatorics and Its Applications, 1st ed.; Springer: Cham, Switzerland, 2015. [Google Scholar]
  23. Sridharan, S.; Balakrishnan, R. Foundations of Discrete Mathematics with Algorithms and Programming, 1st ed.; Chapman and Hall/CRC: New York, NY, USA, 2018. [Google Scholar]
  24. Mortici, C. Ramanujan formula for the generalized Stirling approximation. Appl. Math. Comput. 2010, 19, 2579–2585. [Google Scholar] [CrossRef]
  25. Multinomial Coefficient Calculator. Available online: https://mathcracker.com/multinomial-coefficient-calculator.php (accessed on 27 July 2019).
  26. Wan, L.; Mehta, K.V.; Klasky, S.A.; Wolf, M.; Wang, H.Y.; Wang, W.H.; Li, J.C.; Lin, Z. Data management challenges of exascale scientific simulations: A case study with the Gyrokinetic Toroidal Code and ADIOS. In Proceedings of the 10th International Conference on Computational Methods, ICCM’19, Singapore, 9–13 July 2019. [Google Scholar]
  27. Arute, F.; Arya, K.; Martinis, J.M. Quantum supremacy using a programmable superconducting processor. Nature 2019, 574, 505–510. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Mahmoudvand, R.; Hassani, H.; Farzaneh, A.; Howell, G. The exact number of nonnegative integer solutions for a linear Diophantine inequality. IAENG Int. J. Appl. Math. 2010, 40, 5. [Google Scholar]
Figure 1. Partition Sets.
Figure 1. Partition Sets.
Data 05 00084 g001
Table 1. Poisson Example.
Table 1. Poisson Example.
x = ( x 1 , x 2 , x 3 ) T1(x) I l o s t ( x | T 1 ) T 2 ( x ) I l o s t ( x | T 2 ) T 3 ( x ) I l o s t ( x | T 3 )
(0,0,0)00 e 3 θ 0 e 3 θ 0
(0,0,1)1 log 3 θ e 3 θ log 3 θ e 3 θ log 3
(0,1,0)
(1,0,0)
(1,1,0)2 log 9 2 θ 2 e 3 θ log 9 2 θ 2 e 3 θ log 3
(1,0,1)
(0,1,1)
(2,0,0)2 log 9 θ 2 e 3 θ log 9 θ 2 e 3 θ 2 log 3
(0,2,0)
(0,0,2)
Table 2. Binomial Example.
Table 2. Binomial Example.
x = ( x 1 , x 2 , x 3 ) T 1 ( x ) I l o s t ( x | T 1 ) T 2 ( x ) I l o s t ( x | T 2 ) T 3 ( x ) I l o s t ( x | T 3 )
(0,0,0)0 0 ( 1 θ ) 6 0 ( 1 θ ) 6 0
(0,0,1)1 log 3 (1 − θ)5θ1 log 3 2 ( 1 θ ) 5 θ 1 log 3
(0,1,0)
(1,0,0)
(1,1,0)2 log 15 4 ( 1 θ ) 4 θ 2 log 15 4 4 ( 1 θ ) 4 θ 2 log 3
(1,0,1)
(0,1,1)
(2,0,0)2 log 15 ( 1 θ ) 4 θ 2 log 15 ( 1 θ ) 4 θ 2 log 3
(0,2,0)
(0,0,2)
(1,1,1)3 log 5 2 ( 1 θ ) 3 θ 3 log 5 2 8 ( 1 θ ) 3 θ 3 0
(2,1,0)3 log 10 ( 1 θ ) 3 θ 3 log 10 2 ( 1 θ ) 3 θ 3 log 6
(2,0,1)
(1,0,2)
(1,2,0)
(0,1,2)
(0,2,1)
(2,1,1)4 log 15 4 ( 1 θ ) 2 θ 4 log 15 4 4 ( 1 θ ) 2 θ 4 log 3
(1,2,1)
(1,1,2)
(2,2,0)4 log 15 ( 1 θ ) 2 θ 4 log 15 ( 1 θ ) 2 θ 4 log 3
(2,0,2)
(0,2,2)
(2,2,1)5 log 3 ( 1 θ ) 1 θ 5 log 3 2 ( 1 θ ) 1 θ 5 log 3
(2,1,2)
(1,2,2)
(2,2,2)6 0 θ 6 0 θ 6 0
Table 3. Entropic loss over different statistics for a binomial distribution.
Table 3. Entropic loss over different statistics for a binomial distribution.
H l o s t ( X , T 1 ) H l o s t ( X , T 2 ) H l o s t ( X , T 3 )
1.47221.47221.2095
Table 4. Geometric Example.
Table 4. Geometric Example.
x = ( x 1 , x 2 , x 3 ) T1(x) I l o s t ( x | T 1 ) T 2 ( x ) I l o s t ( x | T 2 ) T 3 ( x ) I l o s t ( x | T 3 )
(1,1,1)3 0 θ 3 0 θ 3 0
(2,1,1)4 log 3 θ 3 ( 1 θ ) log 3 θ 3 ( 1 θ ) log 3
(1,2,1)
(1,1,2)
(2,2,1)5 log 6 θ 3 ( 1 θ ) 2 log 6 θ 3 ( 1 θ ) 2 log 6
(2,1,2)
(1,2,2)
(2,2,2)6 log 10 θ 3 ( 1 θ ) 3 log 10 θ 3 ( 1 θ ) 3 log 10

Share and Cite

MDPI and ACS Style

Moghimi, M.; Corley, H.W. Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions. Data 2020, 5, 84. https://0-doi-org.brum.beds.ac.uk/10.3390/data5030084

AMA Style

Moghimi M, Corley HW. Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions. Data. 2020; 5(3):84. https://0-doi-org.brum.beds.ac.uk/10.3390/data5030084

Chicago/Turabian Style

Moghimi, Maryam, and Herbert W. Corley. 2020. "Information Loss Due to the Data Reduction of Sample Data from Discrete Distributions" Data 5, no. 3: 84. https://0-doi-org.brum.beds.ac.uk/10.3390/data5030084

Article Metrics

Back to TopTop