Next Article in Journal
Dynamics Modeling and Experimental Investigation of Gear Mechanism with Coupled Small Clearances
Previous Article in Journal
Multiscale Permutation Lempel–Ziv Complexity Measure for Biomedical Signal Analysis: Interpretation and Application to Focal EEG Signals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Objective Prior from a Scoring Rule

by
Stephen G. Walker
1 and
Cristiano Villa
2,*
1
Department of Mathematics, University of Texas at Austin, 2515 Speedway, Austin, TX 78712, USA
2
School of Mathematics, Statistics & Physics, University of Newcastle, Newcastle upon Tyne NE1 7RU, UK
*
Author to whom correspondence should be addressed.
Submission received: 7 May 2021 / Revised: 16 June 2021 / Accepted: 24 June 2021 / Published: 29 June 2021
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
In this paper, we introduce a novel objective prior distribution levering on the connections between information, divergence and scoring rules. In particular, we do so from the starting point of convex functions representing information in density functions. This provides a natural route to proper local scoring rules using Bregman divergence. Specifically, we determine the prior which solves setting the score function to be a constant. Although in itself this provides motivation for an objective prior, the prior also minimizes a corresponding information criterion.

1. Introduction

A major drawback of objective priors, such as Jeffreys prior [1] and the reference prior [2], is that in many cases, they are improper. The case for objective priors has been made, see for example [3], yet while for a parameter that is defined over a bounded interval, such as ( 0 , 1 ) , it is generally possible to derive objective prior distributions that are proper, this is not the case for parameters on ( 0 , ) or ( , ) . The literature provides many examples where improper prior distributions cannot be suitably employed; such as Bayes factors, mixture models and hierarchical models, to name but a few. Methods have been proposed to overcome these obstacles, for example, Intrinsic Bayes Factors [4] and Fractional Bayes factors [5] or reparametrizing mixture models [6]. However, these types of results are generally valid for a limited number of specific conditions. Additionally, improper prior distributions are not too suitable to be employed where large numbers of parameters are involved as it would be difficult to establish properness of the full posterior distribution.
The idea of this paper is to present a novel objective prior distribution for continuous parameter spaces by considering the connection between information, divergence and scoring rules. In particular, the proposed prior can be defined over ( 0 , ) and ( , ) , the latter by extending the former, and it has the appealing property of being proper.
Recently, Ref. [7], introduced a new class of objective prior which solved a differential equation of the form S ( q , q , q ) = 0 , where S is a score function and the solution q acts as the prior distribution, and where q and q are, respectively, the first and second derivative of q. The solution is also shown to minimize an information criterion.
There are two well-known relations that connect information, proper local scoring rules and divergences. The most famous of which links Shannon information, Kullback–Leibler divergence and the log–score, given by
p log p q = p log p + p ( log q ) ,
where p and q are two densities, and integrals will be generally defined with respect to the Lebesgue measure. The term on the left-hand-side of (1) is the Kullback–Leibler divergence [8] between p and q, the first term on the right-hand-side is the Shannon information associated with density p, and the second term is the expectation of the log-score function.
Another way to connect information, divergence and proper local scoring rules, involves Fisher divergence, Fisher information, and the Hyvärinen score function ([9]):
p p p q q 2 = ( p ) 2 p + p 2 q q q q 2 ,
where the final term has been obtained using an integration by parts, requiring p q / q to vanish at the boundary values. In general, these relationships can be expressed as
D ( p , q ) = I ( p ) + p S ( q ) ,
where D denotes the divergence, I the measure of information and S the score, and clearly I ( p ) = p S ( p ) .
Recently, in [10], a new class of score function was introduced, where the starting point is the property of the score function, which is that
p = arg min q p S ( q ) ,
for all densities p. In other words, a score is said to be proper if the above is minimized by the choice of q = p . Let us consider the well-known log-score, S ( q ) = log q ( x ) . Then, we have that it satisfies the above property, since for any density p it is that p log ( p / q ) 0 , with equality only when q p . As such, we have that the log-score is a proper score. Furthermore, a score is said to be local if it only depends on q through the density value q ( x ) . See [10,11]. It must be noted that the log-score is the only proper score to be local.
If we consider the Hyvärinen score function [9], which is given by
S ( q , q , q ) = 2 q q q q 2 ,
which we note depends on ( q , q , q ) , i.e., q and the first two derivatives, as such it is not local in the above sense. However, the locality condition can be weakened [10] by allowing the score to depend on a finite number m of derivatives. Therefore, the Hyvärinen score will be an order–2 proper local scoring rule.
More generally, if a proper score depends on m derivatives, then it will be defined an order–m local scoring rule. The theory in support of this is based on the fact that the minimizer of p S ( q ) is p, and this can be investigated using variational analysis. The relevant Euler–Lagrange equation of order two being
S + q S q d d x q S q + d 2 d x 2 q S q = 0 .
The corresponding general case of (3) is given as Equation (18) in [10]. Throughout this paper we will focus on the case m = 2 , since this is where we draw our prior from. The Appendix A provides the expression for a general m.
In [10], the solution to Equation (3), is proposed using properties of differential operators and 1–homogeneous functions. Recall that a 1–homogeneous function f is such that f ( x , λ q , λ q ) = λ f ( x , q , q ) for any λ > 0 . In particular, the Hyvärinen score arises with f ( x , q , q ) = ( q ) 2 / q and
S ( q ) = f q + d d x f q .
Furthermore, Refs. [10,11] characterize all local and proper scoring rules of order m = 2 . With this respect, as an additional interesting result, in the Appendix A we present the characterization using measures of information and the Bregman divergence [12]. The benefits of the proposed approach are that complicated mathematical analysis is avoided and the derivation of the local rule is made explicit.
Following [10,11] and the novel derivation of their results using Bregman divergence, which is the focus of the Appendix A, information, divergence and scores can be obtained as follows: For some convex function α : R R ,
  • Divergence: Given the result (A10), we obtain
    D ( p , q ) = p α ( p / p ) p ϕ q p ϕ q ,
    where
    ϕ q = α ( q / q ) ( q / q ) α ( q / q ) and ϕ q = α ( q / q ) .
    Using integration by parts on the right most integral, and assuming that [ p · ϕ / q ] vanishes at the extremes of the integral,
    D ( p , q ) = p α ( p / p ) + p d d x α ( q / q ) α ( q / q ) + ( q / q ) α ( q / q ) .
  • Information: This follows from the divergence, and from (A4), and is given by
    I ( p ) = p α ( p / p ) .
  • Score: Again, from the form of the divergence and (A8), this is given by
    S ( q , q , q ) = d d x α ( q / q ) α ( q / q ) + ( q / q ) α ( q / q ) .
The score S ( q , q , q ) in (5) generalizes the Hyvärinen score, which arises when α ( u ) = u 2 .
The paper is organized as follows. Section 2 introduces the proposed objective prior. Section 3 includes a thorough simulation study, and an application to mixture models that involves both simulated and real data. In Section 4 we have discussed another critical scenario where improper priors may resolve in improper posteriors, i.e., assigning an objective prior to the variance parameter in a hierarchical model. The supporting theory is presented in the Appendix A. In Appendix A.1 we use Bregman divergence to obtain general forms for score functions and associated divergences and following on from this in Appendix A.2 we detail how we use Bregman divergences to obtain a divergence between probability density functions using their first derivatives, and show how to obtain score functions from these divergences. Appendix A.3 provides the general case using m derivatives. Finally, in Appendix A.4 we make the connection with our derivations of scores and that of [10].

2. New Objective Prior

Ref. [7] proposed constructing objective prior distributions on parameter spaces by solving equations of the kind S ( q ) = 0 . Specifically, they used a weighted mixture of the log-score and the Hyvärinen score functions. Please note that the sole use of the log-score function would result in the uniform prior, which is not appropriate in many cases and may yield improper posterior distributions. On the other hand, a weighted combination of the two score functions yields a differential equation given by
w log q ( x ) + q ( x ) q ( x ) 1 2 q ( x ) q ( x ) 2 = 0 ,
where q denotes the prior density and w > 0 the weight balancing the two score functions.
Solutions to the differential Equation (6) can be found for different spaces, and constraints on the shape of q can be considered; so, to have a prior density with desirable behavior, such as monotone, convex, log–concave and more.
We have already seen that the Hyvärinen score arises with α ( u ) = u 2 ; see (5). An important property an objective prior distribution may be required to have is a heavy tail and it is this type of prior we are seeking to obtain in a formal way through an appropriate choice of α . We will consider such on ( 0 , ) . Mirroring the Hyvärinen score, we adopt α ( u ) = u 2 with u = q / q , and q a decreasing density on ( 0 , ) . In this case, Equation (5) becomes 6 u / u 4 3 / u 2 , which, by setting to 0, becomes u = 1 2 u 2 . The solution is easily seen to be u ( x ) = 2 / ( a + x ) , for some constant a. In this case, the prior on the parameter space ( 0 , ) is
q ( x ) = a ( a + x ) 2 .
To obtain a density function on ( 0 , ) ; i.e., q is non-negative, we choose a > 0 , as we are permitted to do through the constant of integration from the differential equation.
Interestingly, the prior in (7), is a Lomax distribution [13] with scale parameter a and shape parameter equal to 1. Recalling that the Lomax distribution can be directly connected to the Pareto Type I and Pareto Type II distributions, its heavy-tailed nature is immediately obvious.
Figure 1 shows the prior with a = 1 .
Making the connection more directly with the theory set out in the paper with α ( u ) = u 2 , we have ϕ ( u , v ) = u 3 / v 2 which is easy to show satisfies ϕ = u ϕ / u + v ϕ / v . Then using (A8) we obtain
S ( q , q , q ) = 3 q q 2 2 q q ( q ) 2 3 .
Setting this to zero; i.e., 2 q q = 3 ( q ) 2 , this can be solved and the solution is precisely of the form a / ( a + x ) 2 . We now write this all out in a theorem.
Theorem 1.
Let ϕ ( p , p ) = p 3 / ( p ) 2 be the convex function appearing in (A3); i.e., ϕ ( u , v ) = u α ( v / u ) with α ( ξ ) = ξ 2 . Then ϕ is convex for either ξ < 0 or ξ > 0 . The Euler equation associated with this ϕ; i.e., d / d x ϕ / p = ϕ / p , yields
6 p 3 p / ( p ) 4 6 p 2 / ( p ) 2 = 3 p 2 / ( p ) 2 ,
the solution to which can be written as S ( p , p , p ) = 0 where S is the corresponding score function (8).
To obtain the corresponding prior on ( , ) through symmetry about 0, we obtain
q ( x ) = 1 2 a ( x a ) 2 , for x < 0 1 2 a ( x + a ) 2 , for x 0 .
Here we motivate the natural objective choice for the constant a as 1. We note that a is a scale parameter in the above and as such the most fundamental transformation to be considered here is ϕ = 1 / x . This, for example, would take variance to precision (and vice versa). For the prior in (7) to be invariant, i.e., p a ( ϕ ) = a / ( a + ϕ ) 2 , we need to have a = 1 , since
p a ( ϕ ) = a ( a + 1 / ϕ ) 2 1 ϕ 2 = a ( a ϕ + 1 ) 2 ,
which yields p a ( x ) = a / ( a + x ) 2 iff a = 1 . All the illustrations that follow have been made taking this choice for a. Although other transformations are clearly available, the dominance of the reciprocal transform is sufficient motivation to select the a = 1 .

First Examples

The first simulation study was to make inference on a scale parameter; specifically, the standard deviation of a normal density with mean μ = 0 and standard deviation σ ( 0 , ) . We compare prior (7) with Jeffreys prior, i.e., π ( σ ) 1 / σ . We took 250 samples of size n = 30 and n = 100 , obtained the posterior distributions using standard MCMC methods (Metropolis-Hastings with normal random walk proposal, 6000 iterations, with a burn–in of 1000 and a thinning of 10) and computed the following two indexes. The root Mean Squared Error (MSE) divided by the true parameter value from the sample mean;
M S E = E ( σ ^ σ ) 2 σ ,
where σ ^ is the posterior mean, and the coverage of the 95% posterior credible interval for σ . Table 1 shows the results for the MSE for σ = { 0.25 , 0.50 , 1 , 2 , 5 , 10 , 20 } , where we see little difference between the performance of the two priors. As one would expect, the MSE is lower for the largest sample size. However, the important point is that the score prior is proper, an important property.
The coverage of the 95% posterior credible interval is shown in Table 2, where we can also see very similar behavior between the two priors.
To illustrate the frequentist properties of the prior in (9), we have compared it to a flat prior, π ( μ ) 1 , in making inference for a location parameter of a log–normal density with unknown μ and known scale parameter σ = 1 . Similar to the previous case, we have drawn 250 samples of size n = 30 and n = 100 and computed the MSE and coverage of the 95% posterior credible interval. The values of μ considered were from the set { 0 , 1 , 5 , 50 , 100 } . Table 3 shows the MSE for the two priors, where we see that apart for a small difference for μ = 0 , the two priors appear to perform in a very similar fashion.
The coverage of the 95% posterior credible interval for μ is shown in Table 4, where we can see a very similar behavior for the two priors, with an exception for μ = 0 , although the two coverage levels are perfectly acceptable.
The general conclusion for the two experiments above, is that the prior obtained via α ( u ) = 1 / u 2 exhibits tails which are sufficiently heavy to generate optimal frequentist performance even for large parameter values. These properties are comparable to those obtained by Jeffreys prior, which is well- known for being the objective prior yielding good frequentist properties of the posterior. The advantage with our prior is that it is always proper.

3. Mixture Models

A major area of challenge for objective priors is finite mixture models, where observations are assumed to be generated by the following model;
f ( y ) = l = 1 k ω l f l ( y θ l ) , l = 1 k ω l = 1 ,
for densities ( f l ( · | θ l ) ) , where θ l is vector of parameters characterizing the densities. Given the ill-defined nature of the model in (10), Ref. [6], the use of improper priors for the parameters is not acceptable. In particular, if we consider densities f l to be location-scale distributions, Ref. [6] shows that under certain circumstances, Jeffreys priors cannot be used, due to their improperness. For example, if all parameters are unknown (i.e., weights, location and scale parameters), then Jeffreys prior yields improper posteriors. Even in more restrictive circumstances the use of improper priors if troublesome; if we consider only the location parameters unknown, then Jeffreys prior yields improper posteriors for k > 2 . The above represents severe limitations in Bayesian analysis. Therefore, the objective priors proposed in this paper represent a clear solution to the above problem, avoiding reparameterization or addition of extra parameters, as proposed, for example in [6], the latter resulting in an increased uncertainty.

3.1. Single Sample

In this first simulation study, we illustrate the performance of the proposed prior on the following three-component mixture model, from which we have drawn a sample of size n = 200 ,
0.25 N ( 0 , 1.2 2 ) + 0.65 N ( 10 , 1 ) + 0.10 N ( 7 , 0.8 2 ) .
In terms of prior distributions, we have assumed a symmetric Dirichlet prior with concentration parameters equal to one, and for the means and standard deviations of the Gaussian components the proposed prior on ( , ) and on ( 0 , ) , respectively. We have also assumed independent information, so the priors for the parameters of the component have the following form,
π ( μ , σ ) = l = 1 3 π l ( μ l ) × l = 1 3 π ( σ l ) ,
where μ = ( μ 1 , μ 2 , μ 3 ) and σ = ( σ 1 , σ 2 , σ 3 ) . The histogram of the sample, together with the true density, is shown in Figure 2.
The analysis uses a Metropolis–within–Gibbs algorithm with normal random walk proposal, with a total of 60,000 iterations, a burn–in of 10,000 and a thinning of 100. In Table 5 we have reported the posterior means and the 95% posterior credible intervals for the parameters of the mixture model, where we note that the true values are within the posterior credible intervals.

3.2. Repeated Sampling

To have a more thorough understanding of the proposed prior implementation, we have performed some experiment on repeated sampling, taking into consideration different scenarios, which include different sample sizes and model structure. We have limited the analysis to mixture of normal densities, but it is obvious that due to the properness of the prior, its implementation can be extended to any family of densities for the mixture components. We computed the posterior distribution for M = 20 replications of sample of size n = ( 50 , 100 , 200 ) of mixture models with number of components k = ( 3 , 4 , 5 ) . For these illustrations, we reported the results for the means and the standard deviations of each component, as they are estimated using the proposed prior. The models used are as follows:
1 3 N ( 10 , 1 ) + 1 3 N ( 0 , 0.8 2 ) + 1 3 N ( 7 , 1.2 2 ) 1 4 N ( 10 , 1 ) + 1 4 N ( 3 , 0.9 ) + 1 4 N ( 0 , 0.8 ) + 1 4 N ( 7 , 1.2 ) 1 5 N ( 10 , 1 ) + 1 5 N ( 3 , 0.9 ) + 1 5 N ( 0 , 0.8 ) + 1 5 N ( 3 , 1 ) + 1 5 N ( 7 , 1.2 ) .
Please note that we have not chosen variable weights as these are not associated with a proposed prior.
Figure 3 shows the boxplots of the posterior means for the μ of the mixture components, while Figure 4 shows the boxplots of the posterior means for the standard deviations σ . As one would expect, the larger the sample size the less variability in the repeated estimates, for the same number of components. Keeping the sample size fixed, we notice an increase in the variability of the estimates as the number of components grows, which is also an expected result.

3.3. Real Data Analysis

In this section, we analyze the well-known galaxy data set, which contains the velocity of 82 galaxies in the Corona Borealis region. To support a particular theory about the formation of galaxies, the analysis aims to estimate the number of stellar populations. This is a benchmark data set, well investigated in the literature, for example in [14,15,16], among others. We consider the galaxies velocities as random variables distributed according to a mixture of k normal densities. The estimation of the number of components has proved to be delicate, with estimates ranging from 3 to 7, depending on factors, such as the priors for the parameters and the Bayesian method used. The histogram of the data set with a superimposed density is presented in Figure 5.
To select the number of components in the mixture of normal densities, we have fitted models with k = ( 2 , 3 , 4 , 5 , 6 , 7 , 8 ) components, computing the Deviance Information Criterion (DIC) under each model. The results are reported in Table 6. We notice that according to the computed index, we identify as best model the mixture with k = 4 components, which is in line with [16], and slightly more conservative than [15,16], where the number of components with non-zero weight is 5. Table 7 shows the posterior means for the parameters of the 4 components estimated.

4. Variance Parameters in Hierarchical Models

In this section, we discuss a well-known implementation of a hierarchical model that is proposed, for example, in [17]. The basic two-level hierarchical model is as follows:
y i j N ( μ + α j , σ y 2 ) , i = 1 , , n , j = 1 , , J α j N ( 0 , σ α 2 ) , j = 1 , , J .
This model has three parameters, namely μ , σ y and σ α . However, out interest for this paper is in σ α only, noting that “regular” objective priors can be used on the remaining parameters, such as π ( μ , σ y ) 1 , for example. Although being improper, this prior yields a proper posterior on the parameters.
The actual concern is on the variance (scale) parameter σ α , as if we were to put an improper prior on it, then the corresponding posterior, most likely, would be improper as well. To compare the proposed prior, we assign an inverse-gamma prior on the variance with parameters set so to define a very sparse probability distribution. This is recommended, for example, in [18], where the prior is π ( σ α 2 ) I G ( ε , ε ) , with ε > 0 sufficiently small. We do not discuss in detail the appropriateness of the above choice, or other alternatives; the reader can refer to [17], for example, for a through discussion.
The method used to obtain the posterior samples is a Metropolis-within-Gibbs in both cases, with 40,000 iterations, a burn-in of 20,000 and a thinning of 10.
The data consists of J = 8 educational testing experiments, where the parameters α 1 , , α 8 represent the relative effects of Scholastic Aptitude Test coaching programs in different schools. In this example, the parameter σ α represents the between-schools variability (standard deviation) of the effects. Table 8 shows the data.
We have compared the marginal posteriors π ( σ α 2 | y ) obtained using the inverse-gamma prior with ε = 1 and the proposed prior in (7) with a = 1 . The histograms of the marginal posteriors are in Figure 6, where we note similar results. The statistics of the posteriors are reported in Table 9, where we note a less-informative distribution when the proposed prior is employed. This is expected, as the inverse-gamma distribution is considered a relatively informative one [17].

5. Discussion

In this paper, we have derived a class of objective prior distributions that have the appealing properties of being proper and heavy-tailed. These have been obtained by exploiting a straightforward approach to the construction of score functions (here proposed). In detail, using convex function α ( · ) we can find the score function with first two derivatives using (5). The Hyvärinen score arises with α ( u ) = u 2 ; whereas we have used α ( u ) = u 2 and used it to construct objective prior distributions using methodology introduced in [7].
The class of prior is heavy-tailed, behaving as 1 / x 2 for large | x | ; this result is immediately obvious as the prior on ( 0 , ) is a Lomax distribution with shape 1. In this respect, it behaves similar to standard objective priors but comes without the problems of being improper. The benefits of using a proper prior is that the posterior is automatically proper and so does not need to be checked.
We have showed that when compared to Jeffreys prior on simulated data, the frequentist performance of the prior distribution derived from score functions are nearly equivalent. In addition, we have showed that on both simulated and real data, the proposed prior is suitable to be used in a key scenario where improper priors (e.g., Jeffreys and reference) are not suitable (or are yet to be found). We have also illustrated the prior on a common problem for hierarchical models, i.e., assigning an objective prior for the variance parameter.
As a final point, we briefly discuss the case where a prior is needed on a multidimensional parameter space. Therefore, say we have a model with k parameters, i.e., θ = ( θ 1 , , θ k ) , where θ j Θ j , for j = 1 , , k . We also assume that the uni-dimensional space for each parameter is either ( 0 , ) or ( , ) . Assuming k relatively large, besides some specific statistical models such as regression models or graphical models, a common practice to assign objective priors on θ is as follows:
π ( θ ) = π 1 ( θ 1 ) × × π k ( θ k ) .
In other words, parameters are assumed to be independent a priori, so the join prior distribution is represented by the product of the marginal priors on each parameter. We can then set π j ( θ j ) to be either (7) or (9), for j = 1 , , k , depending on Θ j .

Author Contributions

The authors contributed equally. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The paper has benefited from the comments of anonymous referees.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Information and Bregman Divergence

Information for a density function is assumed to be convex, based on the obvious notion that averaging reduces information [19]. Hence, if we write I ( p ) = ϕ ( p ) , then ϕ is assumed convex. For example, when ϕ ( p ) = p log p , ϕ is convex, as well as when ϕ ( p , p ) = ( p ) 2 / p , being convex in ( p , p ) .
With a convex function we can set up a Bregman divergence. First, recall that a Bregman divergence (or Bregman distance) is a measure of the distance between two points, and it is defined in terms of a convex function. In particular, when the two points are probability distributions, then we have a statistical distance. Probably the most elementary Bregman distance is the (squared) Euclidean distance.
Let us start by considering the case m = 2 , so ϕ is a function of ( p , p ) . Let u and v be vectors in R d . The Bregman divergence with convex function ϕ : R d R is given by
B ϕ ( u , v ) = ϕ ( u ) ϕ ( v ) ϕ ( v ) , u v ,
where the right most term is
j = 1 d ϕ ( v ) v j ( u j v j ) .
So B ϕ ( u , v ) 0 , and equality holds if and only if u = v .
In this work, we will focus on such divergences where the arguments are probability density functions [20]. Hence, in the one-dimensional case, with ϕ : R + R a convex function and p ( x ) a density function, define
B ϕ ( p ( x ) , q ( x ) ) = ϕ ( p ( x ) ) ϕ ( q ( x ) ) ϕ ( q ) q ( x ) ( p ( x ) q ( x ) ) ,
and the divergence between p and q defined as
D ( p , q ) = B ϕ ( p ( x ) , q ( x ) ) x ̣ .
For example, if ϕ ( p ) = p log p , which is convex in p, then
D ( p , q ) = p log p q log q ( 1 + log q ) ( p q ) = p log ( p / q ) ,
which is the well-known Kullback–Leibler divergence. The L 2 divergence arises when ϕ ( p ) = p 2 , with B ϕ ( p , q ) = ( p q ) 2 .
Please note that we can write
D ( p , q ) = ϕ ( p ) p ϕ q q ϕ q ϕ ( q ) ,
and so we see that
S ( q ) = ϕ q q ϕ q ϕ ( q )
is a score function, the Bregman score, and is local if ϕ satisfies q ϕ / q = ϕ ( q ) . In this case, the only solution is the log-score. As mentioned above, the log-score is the sole local and proper score function that depends on q only. Therefore, the Bregman divergence and the Bregman score confirm this, together with the results in [2]. We will see this is also the case with order m = 2 .
The extension to the Bregman divergence and the Bregman score that we discuss in this paper, follows from a two-dimensional Bregman divergence with arguments ( p , p ) , where p ( x ) = d p / d x , and for some two-dimensional convex function ϕ ( u , v ) . It is defined by
B ϕ ( p ( x ) , q ( x ) ) = ϕ ( p ( x ) , p ( x ) ) ϕ ( q ( x ) , q ( x ) ) ϕ q ( x ) ( p ( x ) q ( x ) ) ϕ q ( x ) ( p ( x ) q ( x ) ) .
For example, if ϕ ( u , v ) = v 2 / u , which is easily shown to be convex, we obtain
D ( p , q ) = p p p q q 2 ,
known as the Fisher information divergence; see, for example, [21].

Appendix A.2. Divergences and Scores

The two most well-known measures of information are differential Shannon and Fisher. See, for example, [22]. Associated divergences are the Kullback–Leibler and Fisher, respectively, with corresponding score functions the logarithmic and Hyvärinnen. The connection between divergence, information and score functions can be understood from the following;
D ( p , q ) = I ( p ) + p S ( q ) .
Here D denotes divergence, I information and S score. From this it is clear that p S ( q ) is minimized at q = p . For (A4) based on the Kullback–Leibler divergence, we have
D ( p , q ) = p log ( p / q ) , I ( p ) = p log p and S ( q ) = log q .
For (A4) based on Fisher information, we have
D ( p , q ) = p p / p q / q 2 , I ( p ) = ( p ) 2 / p and S ( q ) = 2 q / q ( q / q ) 2 .
From (A1) we obtain
D ( p , q ) = ϕ ( p ) + p ϕ q + q ϕ q ϕ ( q ) ,
so
S ( q ) = ϕ q + q ϕ q ϕ ( q ) .
More generally, using the two-dimensional Bregman divergence in (A3), and put it in (A2), we obtain
D ( p , q ) = ϕ ( p , p ) + p S ( q , q , q ) ,
where the score is
S ( q , q , q ) = ϕ q + d d x ϕ q ϕ ( q , q ) q ϕ q q ϕ q .
The score S ( q ) has been obtained by implementing integration by parts, and assuming [ p · ϕ / q ] vanishes at the boundary points. To ensure the score is local, we use the following condition on ϕ ,
ϕ ( u , v ) = u ϕ u + v ϕ v .
Hence, the proposed class of score functions, which effectively is the class introduced in [10], is given by
S ( q , q , q ) = ϕ q + d d x ϕ q .
The derivation just presented is arguably much simpler to the one used in [10], as we have avoided any variational analysis and the use of differential operators.
For a specific example, consider the class of convex functions, satisfying (A7), given by
ϕ ( u , v ) = u α ( v / u ) ,
for some convex function α : R R . The convexity of α implies the convexity of function ϕ in (A9), as the following Lemma A1 shows.
Lemma A1.
The function ϕ ( u , v ) = u α ( v / u ) is convex when α is convex.
Proof. 
The Hessian matrix corresponding to ϕ is seen to be
α ( v / u ) u ( v / u ) 2 v / u v / u 1 ,
which can be shown to be positive definite when α > 0 . Given that α is assumed to be convex, then the condition α > 0 is true. □
Given the above form for α , we are now able to find the divergence, the information and the score. However, before proceeding, we first note that
ϕ ( q , q ) + q ϕ ( q , q ) q + q ϕ ( q , q ) q = 0 .
That is, ϕ satisfies the condition in (A7). In fact, we have
u α ( v / u ) + u α ( v / u ) + α ( v / u ) ( v / u 2 ) + v α ( v / u ) ( 1 / u ) = 0
for all ( u , v ) .

Appendix A.3. Higher Order Score Functions

In this section, we look at score functions using an arbitrary number of derivatives; that is, we allow m > 2 . So let ϕ now be a convex function on ( m + 1 ) dimensions:
ϕ ( u 0 , , u m ) ϕ ( v 0 , , v m ) + j = 0 m ϕ v j ( u j v j ) .
The Bregman divergence can be written as
B ϕ ( q , p ) = ϕ ( q 0 , q 1 , , q m ) ϕ ( p 0 , p 1 , , p m ) j = 0 m ϕ p j ( q j p j ) ,
where the subscript j indicates the order of differentiation, with, for example, p 0 = p and p j = d j p / d x j . Additionally, we have D ( p , q ) = B ϕ ( p , q ) . In this Section, to keep a readable notation, we set p = ( p 0 , p 1 , , p m ) and q = ( q 0 , q 1 , , q m ) . If we have
ϕ ( p ) = j = 0 m ϕ p j p j ,
and the derivatives disappear at boundary values, using multiple integration by parts we obtain
D ( q , p ) = ϕ ( p ) j = 0 m p j ϕ q j = ϕ ( p ) p j = 0 m ( 1 ) j d j d x j ϕ q j .
Hence, the score function is given by
S ( q ) = j = 0 m ( 1 ) j + 1 d j d x j ϕ q j .
Please note that if we have m = 0 and ϕ ( u ) = u log u u , we recover the log-score function, i.e., S ( q ) = log q . If we have m = 1 and ϕ ( u , v ) = v 2 / u , we recover the Hyvärinen score function.
A general form of convex function satisfying the requirement (A11) is
ϕ ( u 0 , , u m ) = j = 0 m 1 u j α j ( u j + 1 / u j ) ,
where ( α 0 , α 1 , , α m 1 ) is a set of convex functions.

Appendix A.4. Connection with Parry, M. et al., 2012

In Appendix A.2, we have shown that it is possible to derive the class of score functions using convex functions and the Bregman divergence only. Here we show that the properties we have used, imply those used by [10].
First we show that a function ϕ satisfying (A7) implies that ϕ is 1–homogeneous.
Lemma A2.
Let ϕ be such that ϕ ( u , v ) = u ϕ / u + v ϕ / v . Then, for any λ > 0 , it is that ϕ ( λ u , λ v ) = λ ϕ ( u , v ) .
Proof. 
Let T ( λ ) = ϕ ( λ u , λ v ) . Thus, T / λ = T ( λ ) / λ , which implies T ( λ ) λ . □
Next we show that the class of score functions (A8) and the condition (A7), imply (3).
Lemma A3.
If a function ϕ satisfies the condition in (A7), and a score S is given by (A8), then the score S satisfies Equation (3).
Proof. 
The proof is a simple matter of algebra and calculus. It requires the key observation that
q d d x = d d x q and q d d x = d d x q + q .
Given that ϕ = q ϕ / q + q ϕ / q we obtain
q 2 ϕ q 2 + q 2 ϕ q q = 0 and q 2 ϕ q q + q 2 ϕ q 2 = 0 .
Since S = ϕ / q + q 2 ϕ / q q + q 2 ϕ / q 2 , we have S / q = 2 ϕ / q 2 . Furthermore,
S q = 2 ϕ q q + d d x 2 ϕ q 2 + 2 ϕ q q and S q = 2 ϕ q 2 + d d x 2 ϕ q q .
Combining all the above expressions, yields (3). □
Finally, in this Appendix A, we show that if ϕ is 1–homogeneous, then ϕ is convex. The result is quite straightforward, and is achieved by showing that 2 ϕ / u 2 · 2 ϕ / v 2 ( 2 ϕ / u v ) 2 .
Here we make the connection with Equation (39) in [10] and the score function in (A12). The main result is to show that if
ϕ ( u ) = j = 0 m u j ϕ u j and S ( u ) = j = 0 m ( 1 ) j + 1 d j d x j ϕ u j ,
then the relevant Euler–Lagrange equation is
j = 0 2 m ( 1 ) j d j d x j u 0 S u j = 0 .
To this end, define the operators, as in [10]
E = j u j u j , Λ = j ( 1 ) j + 1 d j d x j u j and L = j ( 1 ) j d j d x j u 0 u j .
Theorem A1.
If S = Λ ϕ and E ϕ = ϕ , then L S = 0 .
Proof. 
To proof the theorem, we consider the properties of differential operations, as discussed in Section 5 of [10]. In particular, the properties Λ E = E Λ + Λ and E L = L E .
Hence, L Λ ϕ = L Λ E ϕ = L ( E Λ + Λ ) ϕ , which implies that L E Λ ϕ = 0 . This in turn implies E L Λ ϕ = 0 ; i.e., E L S = 0 . Finally, it is easy to see that if E ψ = 0 then ψ = 0 . □

References

  1. Jeffreys, H. Theory of Probability, 3rd ed.; Oxford University Press: New York, NY, USA, 1961. [Google Scholar]
  2. Bernardo, J.M. Expected information as expected utility (with discussion). Ann. Stat. 1979, 7, 686–690. [Google Scholar] [CrossRef]
  3. Berger, J.O. The case for objective Bayesian analysis. Bayesian Anal. 2006, 1, 385–402. [Google Scholar] [CrossRef]
  4. Berger, J.O.; Pericchi, L.R. The intrinsic Bayes Factor for model selection and prediction. J. Am. Stat. Assoc. 1996, 91, 109–122. [Google Scholar] [CrossRef]
  5. O’Hagan, A. Fractional Bayes Factors for model comparison. J. R. Stat. Soc. B 1995, 57, 99–138. [Google Scholar] [CrossRef]
  6. Grazian, C.; Robert, C.P. Jeffreys priors for mixture estimation: Properties and alternatives. Comput. Stat. Data Anal. 2018, 121, 149–163. [Google Scholar] [CrossRef] [Green Version]
  7. Leisen, F.; Villa, C.; Walker, S.G. On a class of objective priors from scoring rules (with discussion). Bayesian Anal. 2020, 15, 1345–1423. [Google Scholar] [CrossRef]
  8. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  9. Hyvärinen, A. Estimation of non–normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 695–709. [Google Scholar]
  10. Parry, M.; Dawid, A.P.; Lauritzen, S. Proper local scoring rules. Ann. Stat. 2012, 40, 561–592. [Google Scholar] [CrossRef]
  11. Ehm, W.; Gneiting, T. Local proper scoring rules of order 2. Ann. Stat. 2012, 40, 609–637. [Google Scholar] [CrossRef] [Green Version]
  12. Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  13. Lomax, K.S. Business Failures; Another example of the analysis of failure data. J. Am. Stat. Assoc. 1954, 49, 847–852. [Google Scholar] [CrossRef]
  14. Escobar, M.; West, M. Bayesian prediction and density estimation. J. Am. Stat. Assoc. 1995, 90, 577–588. [Google Scholar] [CrossRef]
  15. Richardson, S.; Green, P. On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. B 1997, 59, 731–792. [Google Scholar] [CrossRef]
  16. Grazian, C.; Villa, C.; Liseo, B. On a loss-based prior for the number of components in mixture models. Stat. Probab. Lett. 2019, 158, 1–7. [Google Scholar] [CrossRef] [Green Version]
  17. Gelman, A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006, 1, 515–533. [Google Scholar] [CrossRef]
  18. Spiegelhalter, D.J.; Thomas, A.; Best, N.G.; Gilks, W.R.; Lunn, D. BUGS: Bayesian Inference Using Gibbs Sampling Manual; MRC Biostatistics Unit: Cambridge, UK, 1996; Available online: https://www2.isye.gatech.edu/isyebayes/bank/manual05.pdf (accessed on 24 June 2021).
  19. Topsøe, F. Basic concepts, identities and inequalities—The basic toolkit of information theory. Entropy 2001, 3, 162–190. [Google Scholar] [CrossRef]
  20. Stummer, W.; Vajda, I. On Bregman distances and divergences of probability measures. IEEE Trans. Inf. Theory 2012, 58, 1277–1288. [Google Scholar] [CrossRef] [Green Version]
  21. Villani, C. Optimal Transport, Old and New; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  22. MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Figure 1. Prior q ( x ) = 1 / ( 1 + x ) 2 .
Figure 1. Prior q ( x ) = 1 / ( 1 + x ) 2 .
Entropy 23 00833 g001
Figure 2. Histogram of the sample of size n = 200 from model (11), and density (red line) of the true model.
Figure 2. Histogram of the sample of size n = 200 from model (11), and density (red line) of the true model.
Entropy 23 00833 g002
Figure 3. Boxplots of posterior means of the means μ , using the proposed prior, for samples of size n = ( 50 , 100 , 200 ) , plotted by row, and number of components k = ( 3 , 4 , 5 ) , plotted by column.
Figure 3. Boxplots of posterior means of the means μ , using the proposed prior, for samples of size n = ( 50 , 100 , 200 ) , plotted by row, and number of components k = ( 3 , 4 , 5 ) , plotted by column.
Entropy 23 00833 g003
Figure 4. Boxplots of posterior standard deviations of the means σ , using the proposed prior, for samples of size n = ( 50 , 100 , 200 ) , plotted by row, and number of components k = ( 3 , 4 , 5 ) , plotted by column.
Figure 4. Boxplots of posterior standard deviations of the means σ , using the proposed prior, for samples of size n = ( 50 , 100 , 200 ) , plotted by row, and number of components k = ( 3 , 4 , 5 ) , plotted by column.
Entropy 23 00833 g004
Figure 5. Histogram of the Galaxy data set with a smoothed density (red curve) superimposed.
Figure 5. Histogram of the Galaxy data set with a smoothed density (red curve) superimposed.
Entropy 23 00833 g005
Figure 6. Histogram of the marginal posterior for σ α 2 for the School problem when using an inverse-gamma prior (left) and the proposed prior based on scores (right).
Figure 6. Histogram of the marginal posterior for σ α 2 for the School problem when using an inverse-gamma prior (left) and the proposed prior based on scores (right).
Entropy 23 00833 g006
Table 1. Posterior MSE for σ to compare Jeffreys prior against the prior based on scores for data generated by a normal density with mean μ = 0 and unknown variance. This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation σ = { 0.25 , 0.50 , 1 , 2 , 5 , 10 , 20 } .
Table 1. Posterior MSE for σ to compare Jeffreys prior against the prior based on scores for data generated by a normal density with mean μ = 0 and unknown variance. This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation σ = { 0.25 , 0.50 , 1 , 2 , 5 , 10 , 20 } .
n = 30 n = 100
σ Jeffreys PriorScore PriorJeffreys PRIORScore Prior
0.250.13680.13710.07230.0720
0.50.13670.13330.07210.0721
10.13650.13290.07210.0716
20.13490.13260.07190.0716
50.13470.13330.07220.0720
100.13490.13310.07230.0716
200.13530.13370.07220.0718
Table 2. Posterior coverage of the 95% credible interval for σ to compare Jeffreys prior against the prior based on scores for data generated by a normal density with mean μ = 0 and unknown variance. This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation σ = { 0.25 , 0.50 , 1 , 2 , 5 , 10 , 20 } .
Table 2. Posterior coverage of the 95% credible interval for σ to compare Jeffreys prior against the prior based on scores for data generated by a normal density with mean μ = 0 and unknown variance. This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation σ = { 0.25 , 0.50 , 1 , 2 , 5 , 10 , 20 } .
n = 30 n = 100
σ Jeffreys PriorScore PriorJeffreys PRIORScore Prior
0.250.960.950.910.91
0.50.950.940.920.91
10.950.940.910.91
20.950.950.930.91
50.950.940.930.91
100.940.940.900.92
200.950.950.900.93
Table 3. Posterior MSE for μ to compare Jeffreys prior against the prior based on scores for data generated from a log-normal density with unknown location parameter μ and known scale parameter σ = 1 . This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation μ = { 0 , 1 , 1 , 5 , 50 , 100 } .
Table 3. Posterior MSE for μ to compare Jeffreys prior against the prior based on scores for data generated from a log-normal density with unknown location parameter μ and known scale parameter σ = 1 . This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation μ = { 0 , 1 , 1 , 5 , 50 , 100 } .
n = 30 n = 100
σ Jeffreys PriorScore PriorJeffreys PRIORScore Prior
00.02900.02080.01140.0007
10.03250.04810.00860.0091
50.03250.03240.00860.0086
100.03250.03230.00860.0086
500.03250.03230.00860.0087
1000.03250.03240.00860.0087
Table 4. Posterior coverage of the 95% credible interval for μ to compare Jeffreys prior against the prior based on scores for data generated from a log-normal density with unknown location parameter μ and known scale parameter σ = 1 . This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation μ = { 0 , 1 , 1 , 5 , 50 , 100 } .
Table 4. Posterior coverage of the 95% credible interval for μ to compare Jeffreys prior against the prior based on scores for data generated from a log-normal density with unknown location parameter μ and known scale parameter σ = 1 . This has been obtained on 250 samples of size n = 30 and n = 100 and with standard deviation μ = { 0 , 1 , 1 , 5 , 50 , 100 } .
n = 30 n = 100
σ Jeffreys PriorScore PriorJeffreys PRIORScore Prior
00.960.980.921.00
10.940.920.970.98
50.940.940.970.98
100.940.940.970.98
500.940.940.970.98
1000.940.940.970.98
Table 5. Posterior means and 95% credible intervals (in brackets) for a sample of size n = 200 from model (11).
Table 5. Posterior means and 95% credible intervals (in brackets) for a sample of size n = 200 from model (11).
Component ω μ σ
10.26−10.21.1
(0.21, 0.33)(−10.5, −9.9)(0.9, 1.4)
20.660.01.3
(0.58,0.72)(−0.1, 0.2)(1.1, 1.5)
30.086.70.9
(0.04, 0.12)(6.2, 7.2)(0.6, 1.4)
Table 6. Deviance Information Criterion for mixture with number of components k = ( 2 , 3 , 4 , 5 , 6 , 7 , 8 ) .
Table 6. Deviance Information Criterion for mixture with number of components k = ( 2 , 3 , 4 , 5 , 6 , 7 , 8 ) .
k2345678
DIC476.72425.20371.48413.68446.45446.54458.81
Table 7. Posterior means of the parameters for the k = 5 components. The results are reported by descending order of the weights.
Table 7. Posterior means of the parameters for the k = 5 components. The results are reported by descending order of the weights.
Component ω μ σ
10.3219.160.71
20.3123.593.46
30.2922.014.59
40.109.310.58
Table 8. Observed effects ( y j ) of special preparation on SAT scores on eight randomized experiments, where σ j are the standard errors of effect estimate.
Table 8. Observed effects ( y j ) of special preparation on SAT scores on eight randomized experiments, where σ j are the standard errors of effect estimate.
School y j σ j
A2815
B810
C−316
D711
E−19
F111
G1810
H1218
Table 9. Posterior statistics for the marginal distribution of σ α 2 for the schools problem.
Table 9. Posterior statistics for the marginal distribution of σ α 2 for the schools problem.
PriorMean95% C.I.
Inverse-Gamma3.8(0.3, 23.9)
Score-based2.3(0.2, 57.8)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Walker, S.G.; Villa, C. An Objective Prior from a Scoring Rule. Entropy 2021, 23, 833. https://0-doi-org.brum.beds.ac.uk/10.3390/e23070833

AMA Style

Walker SG, Villa C. An Objective Prior from a Scoring Rule. Entropy. 2021; 23(7):833. https://0-doi-org.brum.beds.ac.uk/10.3390/e23070833

Chicago/Turabian Style

Walker, Stephen G., and Cristiano Villa. 2021. "An Objective Prior from a Scoring Rule" Entropy 23, no. 7: 833. https://0-doi-org.brum.beds.ac.uk/10.3390/e23070833

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop