Next Article in Journal
Path Planning and Trajectory Tracking for Autonomous Obstacle Avoidance in Automated Guided Vehicles at Automated Terminals
Previous Article in Journal
Solutions by Quadratures of Complex Bernoulli Differential Equations and Their Quantum Deformation
Previous Article in Special Issue
Modeling High-Frequency Zeros in Time Series with Generalized Autoregressive Score Models with Explanatory Variables: An Application to Precipitation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Probability Distributions Approximation via Fractional Moments and Maximum Entropy: Theoretical and Computational Aspects

by
Pier Luigi Novi Inverardi
*,† and
Aldo Tagliani
Department of Economics & Management, University of Trento, I-38122 Trento, Italy
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Submission received: 16 November 2023 / Revised: 20 December 2023 / Accepted: 23 December 2023 / Published: 30 December 2023
(This article belongs to the Special Issue Statistical Methods and Applications)

Abstract

:
In the literature, the use of fractional moments to express the available information in the framework of maximum entropy (MaxEnt) approximation of a distribution F having finite or unbounded positive support, has been essentially considered as a computational tool to improve the performance of the analogous procedure based on integer moments. No attention has been paid to two formal aspects concerning fractional moments, such as conditions for the existence of the maximum entropy approximation based on them or convergence in entropy of this approximation to F. This paper aims to fill this gap by providing proofs of these two fundamental results. In fact, convergence in entropy can be involved in the optimal selection of the order of fractional moments for accelerating the convergence of the MaxEnt approximation to F, to clarify the entailment relationships of this type of convergence with other types of convergence useful in statistical applications, and to preserve some important prior features of the underlying F distribution.

1. Introduction

In statistical estimation, one often wants to guess an unknown probability distribution F, given certain observations based on it. There are generally infinitely many distributions consistent with the available data, and the question of which of these to select is an important one in many fields. The notion of entropy has been proposed as a remarkable tool for performing this choice. More precisely, the principle of maximum entropy was established by [1,2] as a tool for inference under uncertainty and consists of finding the most suitable probability distribution under the available information. As Jaynes [1] expressed it, the resulting MaxEnt distribution “… is the least biased estimate possible on the given information”. In summary, the MaxEnt method dictates what are the most “reasonable and objective” distribution subject to given constraints expressing the available information concerning the data generating mechanism: the analytical form of that constraints are chosen to look at the features of the distribution that we want to preserve in the MaxEnt approximation to guarantee its capability of modeling specific features of F.
It is a common choice to express the constraints in terms of expectations of some functions g j of X, i.e.,
E g j ( X ) = U g j ( x ) f ( x ) d x = c j , j = 1 , 2 , , n
and the resulting maximum entropy distribution (better, its density) emerges by maximizing the Shannon (differential) entropy
h f = U f ( x ) ln f ( x ) d x
under a set of constraints (1) using calculus of variations and Lagrange’s multipliers method. The general solution, assuming an arbitrary set of n + 1 constraints g j , is given by [1]
f n ( x ) = exp λ 0 j = 1 n λ j g j ( x )
where λ 1 , , λ n , are the Lagrange multipliers linked to the set of adopted constraints (1) while the multiplier λ 0 guarantees the legitimacy of the distribution and E ( g j ( X ) ) , j = 1 , 2 , , n are the characterizing moments of the distribution. It becomes clear that the resulting maximum entropy distribution having density (3) is uniquely driven by the choice of the imposed constraints. This implies that this choice is the most important and determinative part of the MaxEnt method. In the end, the form of the MaxEnt approximation (3) of f is problem-dependent that is, its analytical form depends on the choice of the constraints g j describing the features of the distribution F that must be preserved in the approximation process.
As we said above, in constructing a density with the MaxEnt methodology, for practical purposes only partial information can be used. However, this does not preclude that we must have at our disposal some physical knowledge of the underlying problem that amounts to
  • The solution to the problem is unique.
  • The entire moment curve from which to pick up a finite number of arbitrary fractional moments is known.
The aim of the paper is focused on the opportunity offered by expressing the system of constraints (1) by fractional moments. This choice gives back a great flexibility to model a wide class of problems where the traditional integer moment constraints may be inefficient in recovering the available information. The theorem of entropy convergence stated for fractional moments setup and related other modes of convergence, offer a formal basis for the optimal choice of number and order of the fractional moments to be involved in the MaxEnt approximation procedures. In the paper we will stress the fact that theoretical and numerical aspects are inextricably linked to each other, justifying why both theoretical and computational aspects must be treated simultaneously in the paper. More precisely, in Section 2 some properties of fractional moments motivating their use in MaxEnt reconstruction of distribution will be discussed, in Section 3 some basics about Tchebycheff systems (T-systems, for brevity) will be recalled and using this tool, the existence and convergence in entropy of the MaxEnt distribution constrained by fractional moments will be proven in Section 4.1 and in Section 4.2, respectively. Finally, in Section 5 two crucial results concerning the optimal choice of the orders α ’s and the optimal number n of the fractional moments both based on the convergence in entropy of the MaxEnt distribution will be presented.

2. The Role of Fractional Moments in MaxEnt Setup

Constraints (1) expressed by integer moments play an important role in the inverse Hausdorff and Stieltjes classical moment problem that consists in determining an unknown probability mass or density function f corresponding to the distribution F from the knowledge of the sequence of its integer moments m j = E ( X j ) = U x j f ( x ) d x , j 1 , m 0 = 1 , where U = [ 0 , 1 ] in Hausdorff case or U = [ 0 , ) in Stieltjes case.
For practical purposes, if only n prefixed moments are taken into account to express the available information about the distribution, then many different (even an infinity!) probability distributions could be compatible with that information and non-uniqueness of the distribution recovered from them follows immediately. Hence, the question: What probability distribution is the best and with respect to what criterion? The answer follows naturally from Jaynes’ principle [1]: from the set of all probability distributions compatible with the n prefixed moments, choose the one that maximizes Shannon’s entropy that is, the so-called MaxEnt distribution.
The MaxEnt approximation of f constrained by first n + 1 integer moments that is, E [ g j ( X ) ] = E ( X j ) = m j , j = 1 , , n where n is arbitrary large and m 0 = 1 being any density a normalized function, comes immediately from (3) that is,
f n ( x ) = exp λ 0 j = 1 n λ j x j
where λ j , j = 1 , , n , are the Lagrange multipliers linked to the set of adopted constraints while the multiplier λ 0 guarantees the legitimacy of the distribution. Widely known references are the books [3,4] and more recently [5]. These sources contain comprehensive details about a series of remarkable results paving the progress in moment problems for more than a century. Theoretical and computational aspects are inextricably linked to each other.
It is a well-known fact that in a determinate moment problem, the sequence of integer moments { m j } j = 0 carries all the information concerning the distribution F; hence it may happen that the moments of high order also contain a considerable amount of it as in the case of asymmetric or heavy-tailed distributions. But it is also well known that the moment problem becomes ill-conditioned when the number (hence, the order) n of moments increases and to avoid numerical instability due to the ill-conditioning of Hankel matrices only the moments of small order n are involved. Neglecting higher-order moments implies losing the information carried by them with consequences on the quality of f n as an approximation of f. Furthermore, if the first few moments are not informative with respect to the distribution, the situation is even worse with f n being a definitely bad approximation of f.
For a practical example, Ref. [6] discuss the role played by the constraints choice in modeling probability distributions via the MaxEnt method for complex geophysical processes concluding that the usual choice based on integer moments in virtue of their physical meaning cannot be able, both for theoretical and empirical, i.e., data-driven, reasons to describe the relevant geophysical features that have to be preserved because characterizing the distribution (for more details, see the mentioned paper pp. 52–53).
At this point, a fundamental question is how to reformulate the MaxEnt solution of the moment problem in a suitable way to permit a reliable and efficient approximation through f n of the target density f. Or, equivalently, how to choose the (optimal) analytical form of the set of the MaxEnt constraints?
To try to find an answer to this crucial question, combining (2) and (3), the entropy of f n is given by
h f n = j = 0 n λ j m j h f .
and point out that the quantity
h f n h f 0
is a measure of residual uncertainty about the distribution of the random variable X associated with the MaxEnt approximation f n of f and corresponds to the maximum residual entropy-or equivalently the minimum information gain-associated to the knowledge of the first n moments of F not captured by (4).
When integer moments are used as constraints in a MaxEnt procedure, due to the mechanical choice of integer moments, the rate of reduction of the residual entropy (6) is very slow and becomes more and more negligible as the number n of moments increases. Consequently, it is urgent to look for a class of alternative approximants with faster convergence of h f n to h f : the class of MaxEnt distributions constrained by fractional moments
m α j = E ( X α j ) = U x α j f ( x ) d x , α j I R +
represents a natural alternative to (4) where a proper choice of number n and α exponents built on the convergence in entropy theorem of Section 4.2, allows us to control the residual uncertainty reduction and at the end, to accelerate the rate of the convergence of f n to f, mitigating the effects of ill-conditioning due to the large value of n used. Indeed, since any fractional moment m α can be obtained as a function of many (as computationally feasible) integer moments m j ([7,8] for details), the information available in the sequence of integer moments can be squeezed into a few fractional moments: for example, in case of heavy-tailed distributions where integer moments of high order are required, the use of a few fractional moments permit to avoid (or mitigate) ill-conditioning without losing the tail information which is crucial to model, for example, the risk of extreme events or their predictability in a computationally tractable environment. Refs. [9,10] give an application of the fractional moment method in the structural reliability analysis which is typically based on a model that describes the response, such as maximum deformation or stress, as a function of several random variables and they base the derivation of that model on the MaxEnt principle where constraints are specified in terms of the fractional moments, in place of commonly used integer moments to avoid the well known ill-conditioning problem, studying the numerical accuracy and efficiency of the proposed method. From the life-cycle perspective, Ref. [11] observes that probabilistic lifetime modeling of an engineering system provides important information for risk assessment of the system by evaluating the mean-time to failure, survival probability, dynamic hazard rate, and among others. This can be conducted again using fractional moments into the MaxEnt technique to approximate the distribution of interest. Further, in virtue of the flexibility due to the continuous nature of their order α , fractional moments represent a valid alternative when physical principles of momentum are invalid and consequently, the use of integer moments to express the MaxEnt constraints reveals to be improper as it happens in many geophysical processes like daily rainfall distribution [6] or tree diameter distribution modeling [12].
Looking for more practical reasons motivating the use of fractional moments in addition to computational feasibility issues, it is interesting to note that sometimes the available information could be better exploited if the search for its optimal summaries took place on the entire moment curve rather than on a predetermined sequence of equispaced points (i.e., integer moments). For example, a large number of common families of probability distributions largely used in reliability and risk theory such as Gamma, Pareto, Rayleigh, and Lognormal to name a few, belong to the exponential family having logarithmic characterizing moments as shown in Table 1 (recalling that x α = e α ln ( x ) ) .
It is possible to show that these families of distributions can be considered MaxEnt distributions with fractional moments as characterizing moments, in the sense that the analytic form of the latter is appropriate to capture the relevant (that is, characterizing) information and features of the corresponding distribution. For example, in the Lognormal case, the characterizing moments are E [ ln ( X ) ] , E [ ln 2 ( X ) ] , respectively. Then, by the known relationships
lim α 0 x α 1 α = ln ( x ) and lim α 0 x α 1 α 2 = lim α 0 x 2 α 2 x α + 1 α 2 = ln 2 ( x )
it follows that the Lognormal density can be reconsidered as a MaxEnt one having { E ( X α ) , E ( X 2 α ) } , α 0 , as characterizing fractional moments.
The same line of reasoning and related results hold true if we consider a random sample ( X 1 , X 2 , , X N ) and the associated sample fractional moments
m ^ α j = 1 N i = 1 N X i α j , α j I R +
to summarize the sample information needed for the MaxEnt estimation of the density f [7]. In this setup, the MaxEnt estimate f n represents a genuine non-parametric estimate of f where the constraints of appropriate number and order expressed by (8) represent the features of the distribution of X that must be preserved.
For assessing the feasibility of the moment problem solution based on fractional moments, three aspects must be now considered: the first one concerns the existence of the MaxEnt approximation based on fractional moments as a tool to express the constraints set, the second one consists in finding a formal proof of the convergence in entropy to f of the sequence of MaxEnt "fractional" approximation (Equation (10) below). The third aspect concerns the choice of the number n and the set of the orders α j for j = 1 , 2 , , n of the fractional moments m α j . Convergence in entropy of the MaxEnt approximation (10) f n to f guarantees the residual uncertainty h f n h f about the distribution of the random variable X associated with the approximation f n of f is minimal. And, for fixed n, it is natural to base the choice of the fractional order α j , j = 1 , 2 , , n , looking for the α ’s values that minimize the residual uncertainty (6) in the framework established by two important results due to [13].
More precisely, Lin’s Theorems 1 and 2 based on asserting that an analytic function on the right half complex plane is completely determined by its values on a sequence of points having an accumulation point there, guarantee that the fractional moments corresponding to the posed restrictions on the exponents to catch the aspects of the process that must be preserved, still characterize the underlying distribution. Specifically, these theorems are:
Theorem 1 (Lin (1992), Thm. 1).
A positive r.v. X is uniquely characterized by an infinite sequence of positive fractional moments { m α j } j = 1 with distinct exponents α j ( 0 , α * ) , m α * < , for some α * > 0 .
and
Theorem 2 (Lin (1992), Thm. 2).
If X is a r.v. assuming values from a bounded interval [ 0 , 1 ] and { α j } j = 1 an infinite sequence of positive and distinct numbers satisfying
lim j α j = 0 and j = 1 α j = +
then the sequence of moments { m α j } j = 1 characterizes X.
The following sections provide formal proofs and results related to each of the three aspects mentioned. But before proceeding, let us briefly recall an important technical result, which plays a pivotal role in performing the proofs.

3. A Reminder about T-Systems

T-systems represent a technical tool that plays a crucial role in proving both the existence and convergence in entropy of the MaxEnt approximation f n of f. For this reason, we will revisit briefly their main aspects, considering the two cases X [ 0 , ) and X [ 0 , 1 ] , separately. In the sequel, notations and results are borrowed from [14,15] where the T-systems are extensively investigated including general functions { u j ( t ) } j = 0 n on abstract set E .
  • X U = [ 0 , ) .
  • The starting point is to consider that the set of continuous linearly independent real-valued functions { u j ( t ) } j = 0 n , defined on the interval U = [ 0 , ) , constitutes a T-system of order n if any polynomial
    P ( t ) = j = 0 n a j u j ( t ) , with j = 0 n a j 2 > 0
    has no more than n zeros on [ 0 , ) . Equivalently, it is readily seen that { u j ( t ) } j = 0 n is a T-system if and only if the determinants of order ( n + 1 )
    det u 0 ( t ) , u 1 ( t ) , , u n ( t ) 0 n = : u 0 ( t 0 ) u 0 ( t 1 ) u 0 ( t n ) u 1 ( t 0 ) u 1 ( t 1 ) u 1 ( t n ) u n ( t 0 ) u n ( t 1 ) u n ( t n )
    are strictly positive for any choice of (distinct) pairs of elements 0 t 0 < t 1 , < t n in [ 0 , ) . According with the above definition the special set { u j ( t ) = t α j } j = 0 n we are interested in, with distinct 0 = α 0 < α 1 , < α n , is a T-system having the properties
    (a)
    u j ( t ) = t α j > 0 for each 0 j n
    (b)
    lim t t α j t α n = 0 for each j = 0 , , n 1
    (c)
    if the set { u j ( t ) = t α j } j = 0 n is a T-system, then { u j ( t ) = t α j } j = 0 n + 1 is it too.
  • The space M n + 1 of moments, given by the convex hull generated by the points { t α j } j = 0 n has a nonempty interior. This set is convex but has a complex geometry. A good deal of the geometry of the classical moment spaces induced by the special T-system { 1 , t , t 2 , . . . , t n } can be generalized to the case of the investigated T-system { u j ( t ) = t α j } j = 0 n . If the sequence of prescribed moments { m α j } j = 0 n is an inner point of M n + 1 then there are uncountably many probability measure d σ ( t ) having such prescribed moments, one of them being d σ ( t ) = f n ( t ) d t . Elsewhere, if the sequence of prescribed moments { m α j } j = 0 n belongs to M n + 1 , the boundary of M n + 1 , a unique measure supported on a finite set of points exists (the so-called lower principal representation) and the determinant of the below-defined Gram matrix G n becomes zero. For an arbitrary n, let 0 = α 0 < α 1 < < α n . For notational convenience we set
    m α i , α j = E X α i X α j = U t α i t α j f ( t ) d t = U t α i + α j f ( t ) d t .
  • Let us now consider the probability measure d σ ( t ) = f n ( t ) d t . Then t α j L d σ 2 ( U ) , where, as usual
    L d σ 2 ( U ) = { t α j : U t 2 α j d σ ( t ) = U t 2 α j f n ( t ) d t < + }
  • Thus the matrix G n = [ m α i , α j ] i , j = 0 n is the positive definite Gram matrix.
  • The following Markov-Krein theorem ([14] Thm 5.1, p. 157; [15] Thm 1.1, p. 177) is fundamental to prove the convergence in entropy of the MaxEnt distribution: here we adapt it to fractional moments.
    Theorem 3 (Markov–Krein theorem).
    Given values of the first fractional moments { m α j } j = 0 n I n t ( M n + 1 ) so that the Gram matrix G n be positive definite, the integral U u n + 1 ( t ) d σ ( t ) over all the distributions σ ( t ) having the assigned moments { m α j } j = 0 n , has a minimum value m α n + 1 where
    m α n + 1 = U u n + 1 ( t ) d σ ̲ ( t )
    The corresponding measure σ ̲ , under the form of a sum weighted Dirac delta function for which is uniquely determined, is the so-called lower principal representation. Furthermore the point { m α 0 , , m α n , m α n + 1 } belongs to M n + 2 , the boundary of M n + 2 .
2.
X U = [ 0 , 1 ] .
  • In this case the procedure proposed for X [ 0 , ) runs more or less similarly since the involved functions { u j ( t ) = t α j } j = 0 n , t [ 0 , 1 ] are T-systems too and an analogous Markov-Krein theorem ([14] Thm 1.1, p. 80; [15] Thm. 1.1, p. 109) is available. It should only be recalled that, in analogy with Theorem 3, given { m α j } j = 0 n I n t ( M n + 1 ) , the moment m α n + 1 admits minimum and maximum value m α n + 1 , m α n + 1 + , respectively, where
    m α n + 1 + m α n + 1 = U u n + 1 ( t ) d σ ¯ σ ̲
    Here, the corresponding measures σ ̲ and σ ¯ under the form of a sum-weighted Dirac delta function are uniquely determined and they are the so-called lower and upper principal representation, respectively, and the points { m α 0 , , m α n , m α n + 1 ± } ( M n + 2 ) .

4. MaxEnt Solution of the Fractional Moment Problem

Once in both cases X [ 0 , 1 ] and X [ 0 , ) the moment curve m X ( α ) = U t α f ( t ) d t has been obtained, the probability distribution constrained by fractional moments can be estimated (approximated) through the MaxEnt technique, which is essentially an extension of the commonly used integer moment-based MaxEnt procedure.
Note, that given a finite collection of (population or sample) fractional moments { m α j } j = 0 n , with α 0 = 0 , the corresponding MaxEnt solution for f is
f n ( x ) = exp j = 0 n λ j x α j
where the λ j are such that the (fractional) constraints
U x α j f n ( x ) d x = m α j , j = 0 , , n
are satisfied, and f n depends on the m α j (thus on the α j ) through the λ j . Note, that in [ 0 , ) case λ n must take positive values in [ 0 , ) to guarantee f n integrability. The MaxEnt approximation f n of f has entropy
h f n = j = 0 n λ j m α j
Here ( λ 0 , , λ n ) is the vector of Lagrange multipliers: if it is possible to determine Lagrange multipliers from the constraints { m α j } j = 0 n , then the moment problem admits solution and f n is MaxEnt approximation of f which is unique in U due to strict concavity of (12). In this setup, two fundamental theoretical questions must be now addressed: the existence of the MaxEnt distribution F n and its convergence in entropy to F. The last two are crucial to exploit in real-world applications; the MaxEnt technique aims to recover the distribution F by the available information on X here summarized by a proper set of constraints expressed in terms of fractional moments, just to take into account what is discussed in Section 2.

4.1. Existence of MaxEnt Distribution

For the MaxEnt distribution existence, a close and evident analogy between the two cases integer and fractional moments there exist, being both the set of functions { u j ( t ) = t α j } j = 0 n and { u j ( t ) = t j } j = 0 n T-systems. The proof simply replaces Hankel matrices with Gram matrices above defined.
  • Suppose that X has unbounded support, U = I R + , and the first n + 1 moments { m α j } j = 0 n I n t ( M n + 1 ) have been assigned, λ n 0 has to be to guarantee integrability of f n . In analogy with the case of integer moments, being both integer and fractional moments T-systems, the above nonnegativity condition on λ n is crucial and renders the moment problem solvable only under certain restrictive assumptions on the prescribed moment vector { m α j } j = 0 n . Consider (11) with n replaced by n + 1 , the first n + 1 moments { m α j } j = 0 n held constant, whilst m α n + 1 varies continuously, so that the Lagrange multipliers λ j = λ j ( m α n + 1 ) , j = 0 , , n + 1 are depending on m α n + 1 . Differentiating both sides with respect to m α n + 1 one has
    G n + 1 · d λ 0 d m α n + 1 , , d λ n + 1 d m α n + 1 = [ 0 , , 0 , 1 ]
    where ′ denotes the transpose. From G n + 1 symmetric and positive definite it follows
    0 < d λ 0 d m α n + 1 , , d λ n + 1 d m α n + 1 · G n + 1 · d λ 0 d m α n + 1 , , d λ n + 1 d m α n + 1 = = d λ 0 d m α n + 1 , , d λ n + 1 d m α n + 1 · 0 , , 0 , 1 = d λ n + 1 d m α n + 1
    Then d λ n + 1 d m α n + 1 < 0 and λ n + 1 monotonic decreasing function. MaxEnt machinery leads us to consider a further quantity
    m α n + 1 + = U t α n + 1 f n ( t ) d t
    with, in general, m α n + 1 + m α n + 1 .
  • From now on, for the sake of brevity, in the arguments of f n + 1 and h f n + 1 we will mention only those that take continuously varying values.
    (i)
    Assume f n exists. Once { m α j } j = 0 n + 1 are assigned and m α n + 1 varies continuously, combine together the following facts: λ n + 1 ( m α n + 1 ) is a monotonic decreasing function, f n + 1 ( m α n + 1 + ) = f n and take into account (9) and (15). One concludes that, if f n exists, the necessary and sufficient condition for the existence of f n + 1 is m α n + 1 < m α n + 1 m α n + 1 + , in analogy with the past investigated case concerning integer moments ([16], Appendix A).
    (ii)
    Assume f n does not exist. In such a case λ n + 1 > 0 . Indeed, if it were λ n + 1 = 0 then we would have both m α n + 1 = m α n + 1 + and then f n + 1 = f n , contradicting the fact that f n does not exist. Consequently, f n + 1 exists for every set { m α j } j = 0 n + 1 I n t ( M n + 2 ) . For practical purposes, f n doesn’t exist, both f n 1 and f n + 1 exist. We can state that the problem of the non-existence of the MaxEnt density can be easily bypassed.
  • Collecting together the items (i) and (ii) we conclude that the existence of f n is iteratively and numerically determined, starting from f 1 which exists.
  • Proving the conditions of existence of the MaxEnt distribution we remarked the close analogy between the cases of fractional moments and integer moments. It is reasonable to expect similar analogies to arise also in the case in which an entropy value is to be attributed to the density in the case in which it does not exist so that the sequence of entropies { h f n } n = 1 is defined for every n. The issue was addressed in ([16], Thm. 1) and taking into account of the laboriousness of the proof, we limit ourselves to illustrating the tools involved and the results obtained.
  • Some relevant facts need to be collected together. Since MaxEnt density f n does not exist, both f n 1 and f n + 1 exist with entropies h f n 1 and h f n + 1 , respectively. Introduce now the following class of densities all having the same first moments { m α j } j = 0 n
    C n = : f 0 | U x j f ( x ) d x = m α j , j = 0 , , n
  • In particular, we direct our attention to the density f n + 1 = f n + 1 ( m α n + 1 ) C n , which thanks to Theorem 4 exists for any value m α n + 1 > m α n + 1 . As in integer moments case, f n may not exist so that h f n is meaningless ([16], Thm. 1) proved the relationship lim ( m n + 1 ) h f n + 1 ( m n + 1 ) = h f n 1 , from which sup f C n h f = h f n 1 , although the current use of MaxEnt fails (here the last recalled C n is the analog of (16) with m α j replaced by m j ). Since the entropy is non-increasing as n increases, the latter equality enables us to set h f n = h f n 1 , filling the gap left by the nonexistence of the density f n . We reformulate such a result in terms of fractional moments as lim ( m α n + 1 ) h f n + 1 ( m α n + 1 ) = h f n 1 , from which sup f C n h f = h f n 1 . That leads us to conclude, whenever f n does not exist the missing entropy h f n is replaced with h f n 1 , so that the sequence of entropies { h f n } j = 1 is defined for every n.
  • We can thus reformulate the conditions of existence according to integer moments by means
    Theorem 4.
    Once the moment set { m α j } j = 0 n 1 I n t ( M n ) is prescribed, suppose f n 1 exists with its n-th moment m α n + = U t α n f n 1 ( t ) d t .
(i)
If m α n m α n + , then f n exists; conversely if m α n > m α n + f n does not exist. Thus the existence of f n is iteratively (and numerically only) determined from f n 1 starting from f 1 which exists.
(ii)
If f n does not exist, both f n 1 and f n + 1 exist for every m α n 1 > m α n 1 and m α n + 1 > m α n + 1 , respectively. In addition, h f n = h f n 1 can be set.
2.
Suppose now that X [ 0 , 1 ] : the procedure employed in the unbounded support case 1. runs similarly since the involved functions { u j ( t ) = t α j } j = 0 n , t [ 0 , 1 ] are T-systems too and an analogous Markov-Krein theorem ([15], Thm. 1.1, p. 109) is available. It should only be recalled that, in analogy with the above Theorem 4, given { m α j } j = 0 n 1 I n t ( M n ) , the moment m α n admits minimum and maximum value m α n and m α n + , respectively. The corresponding measures σ = σ ̲ and σ = σ ¯ under the form of a sum weighted Dirac delta function are uniquely determined and they are the so-called lower and upper principal representation, respectively, and the points { m α 0 , , m α n 1 , m α n ± } ( M n + 1 ) . Thanks to MaxEnt formalism Equation (13) continues to hold. Once the first moments { m α j } j = 0 n I n t ( M n + 1 ) have been assigned, the bounded support does not imply any restriction on the Lagrange multipliers, in particular, λ n can take on any real value. From (13), as m α n varies within the bounded range of its admissible values ( m α n , m α n + ) , d e t ( G n 1 ) > 0 is bounded. As m α n m α n ± , f n coincides with the measures σ ̲ and σ ¯ . As a consequence d e t ( G n ) 0 , from which d λ n + 1 d m α n + 1 = d e t ( G n 1 ) d e t ( G n ) and then λ n + 1 follows.
  • Analog conclusions hold for the remaining Lagrange multipliers, pre and post-multiplying in d λ j d m α n , with j < n , the matrix at the numerator by a suitable permutation matrix.
  • In conclusion, given { m α j } j = 0 n 1 I n t ( M n ) and assuming f n 1 exists, f n exists if { m α j } j = 0 n I n t ( M n + 1 ) . Equivalently, the existence of f n is iteratively determined, starting from f 0 (the uniform distribution) which exists. On the other hand, thanks again to the MaxEnt formalism the previous proof of existence continues to hold. The solvability of the problem under certain restrictive assumptions on the prescribed moment vector ceases to exist and consequently the following theorem holds:
    Theorem 5.
    If X [ 0 , 1 ] a necessary and sufficient condition for the existence of the MaxEnt distribution f n is that the vector of moments is internal to the space of moments, that is { m α j } j = 0 n I n t ( M n + 1 ) .

4.2. Entropy Convergence of MaxEnt Distribution

Convergence in entropy of f n to f in the case where the entropy (12) is finite or and its implications, play a fundamental role in many applied problems where the focus is often put on the behavior of the tails of the distribution F that are crucial to study extreme events behavior and to evaluate the probability of their occurrence. In this direction, Ref. [17] stresses the fact that “…at the tails, the MaxEnt distribution oscillates because of the nonmonotonic nature of the polynomial embedded in the f n . Thus, only the lower-order moments are typically considered, but in such cases, f n hardly models tails fatter than the Gaussian. Therefore, the tails of many distributions cannot be well fitted by the MaxEnt distribution with n 4 thus questioning the utility of the MaxEnt approach and consequent solution in this case.
We will prove the almost everywhere nature of the convergence in entropy to f of the MaxEnt approximation f n based on an optimal set of fractional moments and this will permit us to disprove the above claim “…MaxEnt distribution oscillates because of the nonmonotonic nature of the polynomial embedded in the f n and state that f n represents a reliable reconstruction of f and of the main features of the corresponding distribution F, including the tail behavior. Further, exploiting the convergence in entropy of f n to f, it is possible to formulate a criterion for choosing the optimal number n and the values { α j } j = 1 n of the fractional exponents and then, the best set of fractional moments { m α j } j = 1 n (see Equation (29) below).
Finally, even if the tails of the distribution oscillate as stated by [17], if the focus is on evaluating (or estimating) appropriate numerical summaries of the distribution usually expressed in terms of expected values or quantiles, convergence in entropy ensures that the approximation error (Equations (25) and (26) below) can be controlled by a proper choice of the number and the orders of fractional moments and, consequently, the goodness and reliability of such summaries regardless of the oscillating nature of the tails of the MaxEnt distribution.
We are now in a position to prove the main result of this paper, which we enunciate below.
Theorem 6 (Main result).
If X is a positive random variable, having the moments sequence { m α j } j = 0 characterizing a unique distribution, MaxEnt approximations converge in entropy to the underlying distribution, that is
lim n h f n = h f
with h f either finite or .
Proof. 
We begin giving the proof of Theorem 6 for X [ 0 , ) . Then we just adjust the proof for X [ 0 , 1 ] .
1.
Suppose X [ 0 , ) .
As m α n + 1 > m α n + 1 varies, both f n + 1 = f n + 1 ( m α n + 1 ) (equivalently λ j = λ j ( m α n + 1 ) , j = 0 , , n + 1 ) and then h f n + 1 = h f n + 1 ( m α n + 1 ) hold.
Consider h f n + 1 ( m α n + 1 ) and collect together (12) and the first equation of (13), we have
d h f n + 1 ( m α n + 1 ) d m α n + 1 = j = 0 n m α j d λ j ( m α n + 1 ) d m α n + 1 + λ n + 1 ( m α n + 1 ) = λ n + 1 ( m α n + 1 )
from which, taking into account (14), d 2 h f n + 1 ( m α n + 1 ) d m α n + 1 2 = d λ n + 1 ( m α n + 1 ) d m α n + 1 < 0 . Thus h f n + 1 ( m α n + 1 ) is a differentiable concave function.
Enter Markov–Krein’s Theorem. From Theorem 4 and its consequences, as m α n + 1 m α n + 1 , f n + 1 ( m α n + 1 ) can be assimilated to Dirac’s deltas set, equivalently to discrete distribution, the so-called lower principal representation σ ̲ .
We recall for consistency between the differential entropy of a continuous random variable and the entropy of its discretization, the differential entropy of any discrete measure (being compared to the delta Dirac function) is assumed to be ([18], pp. 247–249). As a consequence, h f n + 1 ( m α n + 1 ) = can be set. On the other hand, as m α n + 1 takes its own prescribed value, h f n + 1 h f holds. Then, with h f n + 1 ( m α n + 1 ) being a continuous function, there exists a value, say m ˜ α n + 1 ( m α n + 1 ; m α n + 1 ] , such that h f n + 1 ( m ˜ α n + 1 ) = h f . Summarizing, we have seen that:
(i)
If { α j } 0 n + 1 are assigned and f n + 1 is the corresponding MaxEnt density with entropy h f n + 1 , the sequence { h f n + 1 } is monotonically decreasing and then convergent, with lim n h f n + 1 h f ;
(ii)
for each n, h f n + 1 ( m α n + 1 ) is concave function in ( m α n + 1 ; m α n + 1 ] ; as m α n + 1 m α n + 1 , h f n + 1 ( m α n + 1 ) = ;
(iii)
there exists m ˜ α n + 1 ( m α n + 1 ; m α n + 1 ] such that h f n + 1 ( m ˜ α n + 1 ) = h f .
Enter Lin’s Theorem. Consider the Theorem 1 and without loss of generality, it will be assumed the sequence { m α j } 0 is asymptotically monotonic increasing. From Theorem 1, the sequence { m α j } 0 is convergent and, under the above assumption, is asymptotically monotonic increasing. As n , from both relationships m α n < m α n + 1 < m ˜ α n + 1 < m α n + 1 and ( m α n + 1 m α n ) 0 , it follows
(iv)
both m α n + 1 m α n + 1 and m ˜ α n + 1 m α n + 1 .
Combining together just the above items (i)–(iv) drawn from Theorem 1 and Theorem 4, respectively, it follows
lim n h f n + 1 = lim n h f n + 1 ( m ˜ α n + 1 ) = h f .
The employed methodology for the proof clearly suggests that the convergence in entropy holds true in both cases h f finite and h f = . Indeed, assuming h f = , as n , m α ˜ n + 1 tends to m α n + 1 , so that Equation (18) leads to lim n h f n + 1 = too. Previously we proved that whenever f n does not exist the missing entropy h f n is replaced with h f n 1 , so that the sequence of entropies { h f n } 1 is defined for every n. That fact gives full significance to (18).
2.
Suppose X [ 0 , 1 ] .
The procedure previously employed in X [ 0 , ) case, is likewise extended to X [ 0 , 1 ] since the involved functions { u j ( t ) = t α j } 0 n + 1 , t [ 0 , 1 ] are T-systems too and both an analogous Markov–Krein theorem ([15]—Thm. 1.1, p. 109) and Lin’s Theorem 6 are available (in the latter case, although the sequence { m α j } j = 0 has to be monotonically decreasing, the proof is similar). Thanks to MaxEnt formalism, both Equation (13) and the used methodology to prove Theorem (6) hold true.
In conclusion, if h f is finite or , for X [ 0 , 1 ] or X [ 0 , ) , MaxEnt formalism enables us to prove the entropy convergence (Theorem (6)) by means of a unified procedure. □
We recall that the entropy convergence had been proved in [19]—Theorem 3.1, for the case with X [ 0 , 1 ] and h f finite, by transforming the problem of Laplace transform inversion into a fractional moment one on [ 0 , 1 ] . Here, the author mentions Lin’s theorem, although the statements of such theorem are not actually used in the proof at all. Theorem 6 allows us to selecting ( α 1 , , α n ) in both cases X [ 0 , 1 ] and X [ 0 , ) . This choice is driven by the minimization of the residual h f n h f . For this purpose, limited to the case in which f has finite entropy h f , a valid guide is given by the different modes of convergence stemming from just above proved entropy convergence and shortly recalled.
Thanks to MaxEnt formalism the below-described procedure holds true in both cases X [ 0 , 1 ] and X [ 0 , ) , so that with U we mean, without distinction, the support of X [ 0 , 1 ] or X [ 0 , ) .

4.3. Further Convergence Modes for Finite h f

In the case in which h f is finite and then inf n h f n is finite too, the following additional results may be drawn. These results configure in a chain of implications starting from the (almost everywhere) convergence in entropy and ending with the convergence in distribution: the aim is to justify the MaxEnt reconstruction of particular features of the distribution in which we are interested in governing their reliability by controlling their approximation error in terms of residual entropy h f n h f .
Let m > n and f m and f n be the maxentropic solution of the truncated fractional moment problem, with m and n moments, respectively. Combining together the following two facts:
(a)
the monotonically non-increasing sequence { h f n } converges to h f and then it is a Cauchy sequence
(b)
the Kullback–Leibler distance between f m and f n that share the same first n fractional moments given by
D ( f m , f n ) = U f m ln f m f n d x
implies
D ( f m , f n ) = h f n h f m .
Hence, taking into account Pinsker’s inequality ([20], p. 390), it follows
1 2 f m f n 1 2 D ( f m , f n ) = h f n h f m
  • By replacing f m with f, letting n , recalling Theorem 6 and the completeness of the L 1 space, it holds
    1 2 f n f 1 2 D ( f , f n ) = h f n h f 0
    and hence { f n } n = 1 has limit f. Then { f n } n = 1 has a subsequence pointwise convergent a.e. to f and the whole sequence { f n } n = 1 is also convergent a.e. to the same limit, that is
    lim n f n = f a . e .
    that explains the goodness of the approximation (or estimation, if in a sample setup) f through the MaxEnt f n based on fractional moments.
Since { f n } n = 1 converges in L 1 -norm to f, then it converges to f also (in probability and) in distribution so that,
lim n F n ( x ) = F ( x )
for all x at which F ( x ) is continuous, where F n and F denote the cumulative distribution functions corresponding to f n and f, respectively. Then the approximation f n is particularly suitable for an accurate calculation of the expected values, since as n convergence in distribution is equivalent to
lim n U g ( x ) f n ( x ) d x = U g ( x ) f ( x ) d x
for each bounded function g. Then from (21) and (24) it follows
E f n ( g ) E f ( g ) g 2 ( h f n h f )
The argument runs similarly whether quantiles have to be calculated. They may be configured as expected values of proper bounded functions: indeed, for fixed x, F ( x ) = E [ g ( t ) ] with g ( t ) = 1 if t [ 0 , x ] and g ( t ) = 0 if t U [ 0 , x ] . Then we have
F n ( x ) F ( x ) 0 x f n ( t ) f ( t ) d t U f n ( t ) f ( t ) d t 2 ( h f n h f ) .
The above convergence results suggest that a rapid convergence in entropy allows an accurate approximation of the desired density or its features that have to be preserved. So choosing an optimal set of α j ’s indices in terms of numbers and values becomes the priority.
Remark 1.
For X [ 0 , 1 ] or X [ 0 , ) with h f finite, combining Theorem 6 with (21), from { h f n } n = 1 Cauchy sequence, the continuous functions sequence { f n } n = 1 is Cauchy sequence too and then uniformly convergent to f. Hence, f is a continuous function. Consequently, an accurate reconstruction of the distribution requires that the underlying density be continuous as well. From an engineering point of view, this request may seem obvious. This explains the reason why in some numerical tests appearing in the literature and concerning the reconstruction of discontinuous densities with entropic techniques using integer or fractional moments, the reconstruction obtained had proven to be somewhat inaccurate.

5. Optimal Choice and Optimal Number of α ’s

As recalled in Section 1, in real-world problems the proposal of a stochastic model or a probabilistic law F for a phenomenon X must necessarily take into account the aspects of X that must be preserved and, in some sense, the proposal process is guided by them. As a consequence, although considering the same phenomenon, it could be necessary to compute several distributions, each referring to a specific aspect that has to be preserved and then identify the proposal according to it. Because of their flexible choice, fractional moments can be considered a valuable tool in this regard and for operational reasons, two main questions need now to be addressed: the choice of the number n and of the orders α ’s of the fractional moments involved in the MaxEnt approximation of F. Both questions have strong relationship with the notion of convergence in entropy of F n to F which plays a strategic role in finding appropriate answers to these questions.

5.1. The Choice of ( α 1 , , α n )

Once the theoretical problem of existence, and convergence in entropy from an arbitrary fractional moment sequence according to Lin’s theorems are solved, the approximation of the distribution becomes essentially a computational issue. For this reason, it is a matter of choosing a suitable set of exponents ( α 1 , , α n ) .
Note, that Lin’s theorems provide a theoretical guarantee to the process of reconstructing. However, due to the underlying computational issues, there are infinitely many possibilities in the choice of a few fractional moments. Such a choice must rest on the characteristics of the quantity to be calculated and that are intended to be retained in the approximation process. In an equivalent way, unlike the integer moments, in the approximation of the distribution with fractional moments it is possible to incorporate further information that comes from the underlying physical problem. The idea that we follow here was originally proposed by [7] and further explored by [21]. The idea goes as follows.
  • We shall denote by f n found in (10) to make explicit its dependence on n and implicitly on the ( α 1 , , α n ) . These will be chosen as to minimize the Kullback–Leibler divergence (19) between the ”true” but unknown density f and the maxentropic solution f n . From (12), h f n = j = 0 n λ j m α j and this quantity equals U f n ( x ) ln f n ( x ) d x because f and f n satisfy the same moments constraints. Therefore, minimizing (19) amounts to
    arg   min U f ( x ) ln f ( x ) f n ( x ) d x | α 1 , , α n = arg   min { h f n | α 1 , , α n } .
    In other words, f n is obtained through two consecutive minimization procedures with respect to ( α 1 , , α n , λ 1 , , λ n ) = ( α , λ ) , namely
    min α min λ h f n ( λ , α ) = min α min λ [ ln ( U exp ( j = 1 n λ j x α j ) d x ) + j = 1 n λ j m α j ]
    for n = 1 , 2 , . This method consists of an implementation of the nested minimization. That is for each fixed α , first minimize λ h f n ( λ , α ) and then carry on the outer minimization with respect to α . It is worth mentioning the choice criterion (27) stems from entropy-convergence Theorem 6. The inner minimization is easy because we are dealing with a convex function. But even though the function α m α = E [ X α ] is log-convex, the linear combination j = 0 n λ j m α j need not be so. However, the existence conditions of a unique solution for (28) remain a theoretical open issue.
    Being (28) multivariable and highly nonlinear unconstrained not convex optimization, the uniqueness of the MaxEnt solution may not be guaranteed, so the results greatly rely on the initial condition, i.e., different initial conditions may give different MaxEnt solutions. And even if the algorithm converges, there is no assurance that it will have converged to a global, rather than a local, optimum since conventional algorithms cannot distinguish between the two.
    For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, the Simulated Annealing Method may be preferable to exact algorithms. This explores the function’s entire surface and tries to optimize the function while moving both uphill and downhill. Thus, it is largely independent of the starting values, often a critical input in conventional algorithms. Further, it can escape from local optima and go on to find the global optimum.
    In conclusion, the crucial issue consists of solving the nested minimization which ranges over two distinct sets of variables { α j } 1 n and { λ j } 1 n . While each α j takes its values into the interval ( 0 , α m a x ] , where α m a x relies upon physical or numerical reasons, each λ j may assume any real value.
  • Alternatively, taking into account for each fixed set ( α 1 , , α n ) , the inner min λ 1 , , λ n admits a unique solution being h f n = j = 0 n λ j m α j convex function, the outer one could be calculated by Monte Carlo technique, replacing min α 1 , , α n with inf α 1 , , α n , that is
    inf α min λ h f n ( λ , α ) = inf α min λ [ ln ( U exp ( j = 1 n λ j x α j ) d x ) + j = 1 n λ j m α j ]
    with n = 1 , 2 , . Indeed, Equation (29) is just a computational trick and replacing min α 1 , , α n with inf α 1 , , α n arises from the request for an estimator that guarantees faster convergence in entropy. This replacement does not conflict with the spirit of MaxEnt since, regardless of the estimation criterion (29) of α ’s, the resulting f n continues to be a MaxEnt distribution. Then, according to Theorem 6, Equations (25) and (26), expected values or quantiles can be accurately calculated, up to a predetermined tolerance by means of (29). Note, also that the estimation method (29) lends itself easily to taking into account the existence conditions of f n stated in Theorem 4: if ( f n α 1 , , α n 1 , α n ) does not exist, ( f n 1 α 1 , , α n 1 ) , with the same ( α 1 , , α n 1 ) as f n , does exist. Consequently, h f n 1 is recalculated and the value h f n 1 is assumed.
    After having illustrated the ( α 1 , , α n ) selection criteria in the distribution calculation procedure we can reconnect again to the previously introduced problem of choice of constraints in the construction of the MaxEnt distribution. As a constraint, we can also include the choice of range ( 0 , α m a x ] in which to place ( α 1 , , α n ) in the minimization procedure (29). As an example, if for physical reasons we know the underlying f has hazard rate function h ( x ) = f ( x ) 1 0 x f ( u ) d u with prescribed properties (for instance, asymptotically decreasing to zero), the approximation f n would save such property. As a consequence we choose once more (29), but, as it is easy to verify taking into account (10) with exponents ( α 1 , , α n ) [ 0 , 1 ) (therefore, not optimal for the purposes of rapid convergence in entropy). Conversely, with the hazard rate asymptotically increasing to + , ( α 1 , , α n ) with α n > 1 accomplish that request. Consequently, in both cases, it is important that entropy convergence is ensured.
    In conclusion, the criterion (29) for the calculation of f n is elastic and lends itself to correctly describing multiple scenarios.

5.2. A Single-Loop Strategy for Approximating f n with X [ 0 , 1 ]

In past Section, the difficulties related to nested minimization (28) have been circumvented with the procedure Monte Carlo (29) which allows the computation of λ uniquely through the minimization of a convex function. In the case X [ 0 , 1 ] , exploiting the different modes of convergence previously proved, it is possible to further simplify the computation of λ in (29) by replacing the inner min λ with the solution of a suitable linear system of equations. Indeed, from (3) integrating by parts (see [21], for details), we have
exp ( k = 0 n λ k ) + k = 1 n α k λ k E f n ( X α j + α k ) = ( 1 + α j ) E f n ( X α j ) , j = 0 , . . . , n
Subtracting from each equation of index j = 1 , , n the one having index j = 0 , the following system of equations in the unknowns λ 1 , , λ n is obtained
k = 1 n α k λ k E f n ( X α k + α j 1 ) E f n ( X α k + α j ) = ( 1 + α j 1 ) E f n ( X α j 1 ) ( 1 + α j ) E f n ( X α j )
for j = 1 , , n where E f n ( X α j ) = E f ( X α j ) , j = 0 , , n are known, whilst E f n ( X α k + α j 1 ) and E f n ( X α k + α j ) are generally unknown. Now observe that taking (25) into account, the moment curves E f n ( X α ) and E f ( X α ) corresponding to f n and f, respectively differ as follows
E f n ( X α ) E f ( X α ) 2 ( h f n h f )
Note, as well that with α solution of (29), the two moment curves E f X α and E f n X α interpolate in the Birkhoff–Hermite sense at the nodes (see [22]); that is, they are both interpolating and tangent at the nodes α . This implies that
E f X α j = E f n X α j , j = 0 , 1 , 2 , , n E f X α j ln ( X ) = E f n X α j ln ( X ) , j = 1 , 2 , , n .
By adopting a guessed choice of α (and the Monte Carlo method may achieve this goal), from Theorem 6, h f n = h f follows, so that relying upon (32) and (33), both E f n ( X α k + α j 1 ) = E f ( X α k + α j 1 ) and E f n ( X α k + α j ) = E f ( X α k + α j ) can be set. Thus, with a guessed choice of λ , it is legitimate to assimilate (31) to a linear system with unknown λ which admits a unique solution, being an identity relating λ with a set of values optimally picked up from the moment curve E f n ( X α ) E f ( X α ) .
We shall suppose that the solution to (31) coincides with that obtained by solving (29). This brings this section close to experimental mathematics. The necessary analysis to compute the error in the just above approximation is hard and the verification comes in a posteriori as the numerical results based on it make good sense. Indeed, the numerical evidence suggests that, with n 6 optimally chosen α according with (29), h f n h f is usually observed. From which, combining together (20) with (23) (after replacing f m with f) the relationship f = f n a.e. holds. Consequently, from (25) the last two densities have their respective moment curves coincident as well. Then in (31) E f n ( X α k + α j ) = E f ( X α k + α j ) can be set which in turn enables us to state the solution of (31) coincides with the one obtained from (29). Which makes the ansatz plausible. The ill-conditioning of (31) remains to be investigated. In this regard, unlike integer moments, only a limited number n 6 of fractional moments are sufficient for an accurate estimate of f n , so that ill-conditioning issues are avoided. Once n is fixed, a final consideration concerns the choice of α m a x = max 1 j n { α j } . Since the simplified procedure just described essentially concerns the accurate calculation of expected values, the answer follows from (24) and (25): compatibly with numerical issues, α m a x should be taken as large as possible so as not to raise further constraints on a rapid convergence in entropy. As a consequence, the approximate suggested procedure for replacing (29) is as follows:
  • Once α is fixed, and E f n ( X α k + α j 1 ) = E f ( X α k + α j 1 ) , E f n ( X α k + α j ) = E f ( X α k + α j ) are set, λ are drawn solving the linear system (31);
  • as f n integrates into one, λ 0 is given by
    λ 0 = ln U exp j = 1 n λ j x α j d x
  • combining (31) with (34), h f n = λ 0 + j = 1 n λ j m α j is calculated and finally
    f n ( a p p ) : h f n ( a p p ) = inf α 1 , , α n h f n = λ 0 + j = 1 n λ j m α j
In conclusion, the quick simplified approximate procedure permits to avoid the direct solution of (29) by solving the low order linear system (31) with unknown λ , doing the numerical integration (34) and performing the one-loop procedure (35) that runs on α by means of Monte Carlo technique, with a reduced number of unknowns. This procedure is computationally feasible and convenient and gives back an accurate approximation of f n (hence of f), as we will see in the Example 1.
Remark 2.
In principle, thanks to MaxEnt formalism, the above outlined procedure in X [ 0 , 1 ] might be pairwise extended to X [ 0 , ) . Indeed, in that case the recursive relationship relating Lagrange multipliers { λ j } j = 1 n with higher order moments is quite similar to (31). With a similar procedure to the case of rv X [ 0 , 1 ] , integrating by parts (11), the following linear system follows
k = 1 n α k λ k E f n ( X α k + α j ) = ( 1 + α j ) E f n ( X α j ) , k , j = 1 , , n
with E f n ( X α j ) = E f ( X α j ) , j = 1 , , n known, whilst E f n ( X α k + α j ) are generally unknown and λ 0 given by (34). In analogy with X [ 0 , 1 ] , MaxEnt density f n converges in entropy (and then in distribution) to f according to Theorem 6. Then, starting from moderate values of n, E f n ( X α k + α j ) = E f ( X α k + α j ) , for each j , k , can be set, where I E f ( X α k + α j ) are known quantities. As a consequence, (36) may considered a linear system admitting a unique solution ( λ 1 , , λ n ) , being the involved matrix a nonsingular Gram matrix with distinct { α j } j = 1 n . However, the method lacks a theoretical ground, since two main issues remain open that is,
  • how to guarantee λ n 0 in (36) to ensure integrability in (34)
  • although the convergence in distribution is guaranteed, Equation (24) is not applicable being the function g ( x ) = x α unbounded.
Nevertheless, the above theoretical drawbacks don’t preclude the possibility that, with a special given moment set { m α j } j = 1 n , the method can guarantee accurate results.
Remark 3.
Suppose to consider a generic random sample ( X 1 , X 2 , , X N ) from a distribution having support U not necessarily [ 0 , 1 ] : for example, U = I R + or U = I R . The simplified procedure for the MaxEnt estimation of the density f proposed in Section 4.2 for the case X [ 0 , 1 ] can be easily applied to the case X I R + or the case X I R . Below we will briefly sketch some details.
  • Case X I R : it is enough to transform the original sample data for instance through Y = g ( X ) = 1 2 + 1 π arctan ( X ) , to obtain a transformed sample Y 1 , Y 2 , , Y N in [ 0 , 1 ] and apply the simplified procedure of Section 4.2.
  • Case X I R + : in a similar way, the transformation Y = g ( X ) = e X can be applied to the original data ( X 1 , X 2 , , X N ) obtaining a transformed sample Y 1 , Y 2 , , Y N once again in [ 0 , 1 ] interval and the simplified procedure of Section 4.2 is immediately applicable. Note, that the empirical fractional moments of Y coincide with the empirical Laplace Transform of X, that is 1 N 1 N Y j α = 1 N 1 N e α X j , which turns out to be the empirical version of the relationship 0 1 y α d F Y ( y ) = 0 e α x d F X ( x ) . This last relation leads us to conclude that also the numerical inversion of the Laplace transform can be reduced to a fractional moment problem in [ 0 , 1 ] .
  • More recently [23] investigated the feature of an estimator relying upon the fractional moments for random variables supported on I R by allowing the fractional powers to take complex numbers. Unlike other authors, they are dealing with the case that the negative values of a random variable are not negligible at all.
Once obtained the MaxEnt estimate f n ( a p p ) of f in [ 0 , 1 ] , it is possible to come back to the original spaces I R or I R + by the associated inverse transformations f n ( a p p ) ( x ) = g ( x ) f n ( a p p ) ( g ( x ) ) .
The case X I R outlined in just above items 1. and 3. partially permits to disprove the criticism arising from [17] where it is asserted that the fractional moments technique is applicable in the case of positive r.v. X only.

5.3. The Choice of n

With just before considerations, as h f is finite in both (28) and (29), as well in the single loop strategy, arises the optimal choice criterion of n. Recall that by employing integer moments, the adding of a further moment could result in a negligible or even zero entropy decrease, followed by not negligible entropy decreasing with the subsequent moments. Consequently, a stopping criteria based solely on the difference in entropy when adding an additional integer moment could be misleading in choosing the optimal number of moments to use. On the contrary, with the choice of fractional moments according to (29), it is easy to deduce that the sequence { h f n } n N is strictly monotonic decreasing. Indeed, consider (29), fix n and calculate ( α 1 , , α n ) from which h f n . Next, put n + 1 in (29). As a first step, take the special set ( α 1 , , α n + 1 ) where the first entries ( α 1 , , α n ) coincide with the just above found and α n + 1 > α n is kept arbitrarily (that is constrained minimization running on α n + 1 only, whilst ( α 1 , , α n ) is held fixed). Calculate h f n + 1 and call it h f n + 1 * , with h f n + 1 * < h f n . As a second step take n + 1 in (29), where the minimum runs on ( α 1 , , α n + 1 ) , from which h f n + 1 (that is unconstrained minimization). It follows h f n + 1 < h f n + 1 * < h f n . The sequence { h f n } n N is strictly monotonic decreasing and converges to h f . In the special case where the sequence { h f n } n N is bounded below, has a finite limit, so it is a Cauchy sequence. It leads us to conclude from (29) the rate of entropy decrease becomes smaller and smaller as n increases and the difference between successive entropies becomes reasonably small (which is up to the modeler to decide), one stops and accepts the density determined by the larger number of moments as the ‘true’ density.
We conclude the paper with a simple example just to see the fractional moment MaxEnt technique in action. It involves all the crucial theoretical results (fractional moments, entropy reduction, convergence in entropy, optimal choice of number and exponents) introduced in the previous sections of the paper.
Example 1.
Here our goal is to compare the performances of integer and fractional moments MaxEnt approximations of a given density function f in [ 0 , 1 ] . For the sake of comparison, we will also consider the approximation f ( a p p ) of f obtained by using fractional moments in the simplified MaxEnt procedure given in Section 4.2. Double arithmetic precision is used.
Consider
f ( x ) = π 2 sin ( π x ) I [ 0 , 1 ] ( x )
with h f = 0.14472327456 . The integer moments (im) have the following recursive relationship
m j = E ( X j ) = 1 2 j ( j 1 ) π 2 m j 2 , j 2 , m 0 = 1 , m 1 = 1 2
whilst fractional moments (fm) m α j = E ( X α j ) are explicitly obtained by numerical integration. The associated MaxEnt densities f n ( i m ) and f n ( f m ) are given by (4) and (10), respectively.
Optimal fractional moments determine a fast entropy decreasing h f n ( f m ) h f and 4 or 5 of them capture f. But, a definitely higher number of integer moments is required to have a comparable reconstruction of f, incurring drastic numerical instability due to ill-conditioning for n > 12 : the first column of Table 2 gives evidence of it and actually, the sequence { h f n ( i m ) } n N ceases to be a decreasing monotone sequence. The third column of Table 2 contains the residual entropy concerning the MaxEnt approximation f n ( a p p ) of f given by the simplified MaxEnt procedure described in Section 4.2. A quick comparison allows us to conclude the closeness of the solution based on fractional moments and that based again on fractional moments but using the simplified procedure of Section 4.2. As a consequence of the chain of convergence implications descending from the convergence in entropy of f n ( f m ) to f, the features of interest of the density f can be well approximated (or estimated, if in a sample setup) by the corresponding features of f n ( f m ) or f n ( a p p ) and governing the approximation error in terms of n.

6. Conclusions

The approximation of probability distributions with finite or unbounded positive support using fractional moments to express available information has been reconsidered. Compared to the classical MaxEnt, a novel feature of the proposed method is that the fractional exponent of the MaxEnt distribution is determined through the entropy maximization process, instead of being assigned a priori by an analyst. Theorems of the existence of the maximum entropy distribution and its convergence in entropy are provided. The latter allows us to reconsider the selection criteria of the fractional exponents in order to speed up the convergence in entropy and other related modes of convergence, as well as preserve some important prior features of the underlying distribution.
Since the release two decades ago of the first paper on the subject by the authors, numerous criticisms have been raised by several researchers who have used this methodology. The main criticism focused on the method of calculating the distribution consisting of nested minimization. There is no need to formalize ourselves on the existence or non-existence of the outer minimum in (28). With min α 1 , , α n we would simply intend to carry down the entropy as fast as possible. Indeed, from the different ways of convergence that we have above listed, it can be deduced that the committed error in the calculation of expected values or quantiles is controlled precisely by residual entropy.
Relying on a solid proof of the entropy convergence theorem, different convergence modes are derived as well, we have tried to overcome that shortcoming. For this reason, a Monte Carlo method has been suggested which rests on solid foundations, being the only process of involved minimization performed on a function that is known to be convex.
Concerning the computational efforts of the suggested techniques, the algorithm associated with (29) requires a numerical integration subroutine while the simplified procedure of Section 4.2 needs a numerical routine for the solution of a linear system of equations, both tools available in any mathematical or statistical numerical package and seem to be definitely reasonable.
It is also remarkable to note that the use of a finite number of fractional moments to represent the available information properly tailored to the problem of interest is still possible when only a random sample from a given unknown distribution is available. The existence theorem and the convergence in entropy theorem give solid bases for the inferential procedures about the unknown f based on f n .
After recalling that fractional moments can be included in the mathematical family of T-systems, they have helped to provide
  • the conditions of existence of the density f n ;
  • the convergence theorem in entropy from which other modes of convergence follow;
  • an optimal choice and optimal number of the fractional exponents α ;
  • assuming X [ 0 , 1 ] , a single-loop algorithm for approximating f n .

Author Contributions

The authors have contributed equally to both the conception and drafting as well as the revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620. [Google Scholar] [CrossRef]
  2. Jaynes, E.T. Information theory and statistical mechanics II. Phys. Rev. 1957, 108, 171. [Google Scholar] [CrossRef]
  3. Akhiezer, N.I. The Classical Moment Problem and Some Related Questions in Analysis; Oliver and Boyd: Edinburgh, UK, 1965. [Google Scholar]
  4. Shohat, J.A.; Tamarkin, J.D. The Problem of Moments; Mathematical Surveys and Monographs-Volume I; American Mathematical Society: Providence, RI, USA, 1943. [Google Scholar]
  5. Olteanu, O. Symmetry and asymmetry in moment, functional equations and optimization problems. Symmetry 2023, 15, 1471. [Google Scholar] [CrossRef]
  6. Papalexiou, S.M.; Koutsoyiannis, D. Entropy based derivation of probability distributions: A case study to daily rainfall. Adv. Water Resour. 2012, 45, 51–57. [Google Scholar] [CrossRef]
  7. Novi Inverardi, P.L.; Tagliani, A. Maximum Entropy Density Estimation from Fractional Moments. Commun. Stat. Theory Methods 2003, 32, 327–345. [Google Scholar] [CrossRef]
  8. Novi Inverardi, P.L.; Petri, A.; Pontuale, G.; Tagliani, A. Stieltjes moment problem via fractional moments. Appl. Math. Comput. 2005, 166, 664–677. [Google Scholar] [CrossRef]
  9. Xu, J.; Zhu, S. An efficient approach for high-dimensional structural reliability analysis. Mech. Syst. Signal Process. 2019, 122, 152–170. [Google Scholar] [CrossRef]
  10. Zhang, X.; Pandey, M.D. Structural reliability analysis based on the concepts of entropy, fractional moment and dimensional reduction method. Struct. Saf. 2017, 43, 28–40. [Google Scholar] [CrossRef]
  11. Zhang, X.; He, W.; Zhang, Y.; Pandey, M.D. An effective approach for probabilistic lifetime modelling based on the principle of maximum entropy with fractional moments. Appl. Math. Model. 2017, 51, 626–642. [Google Scholar] [CrossRef]
  12. Ferreira de Lima, A.R.; Ferreira Batista, J.L.; Prado, P.I. Modelling Tree Diameter Distributions in Natural Forests: An Evaluation of 10 Statistical Models. Forest Sci. 2015, 61, 320–327. [Google Scholar] [CrossRef]
  13. Lin, G.D. Characterizations of Distributions via moments. Sankhya Indian J. Stat. 1992, 54, 128–132. [Google Scholar]
  14. Karlin, S.; Studden, W.J. Tchebycheff Systems: With Applications in Analysis and Statistics; Wiley Interscience: New York, NY, USA, 1966. [Google Scholar]
  15. Krein, M.G.; Nudelman, A.A. The Markov Moment Problem and Extremal Problems; American Mathematical Society: Providence, RI, USA, 1977. [Google Scholar]
  16. Novi Inverardi, P.L.; Tagliani, A. Stieltjes and Hamburger Reduced Moment Problem When MaxEnt Solution Does Not Exist. Mathematics 2021, 9, 309. [Google Scholar] [CrossRef]
  17. Alibrandi, U.; Mosalam, K.M. Kernel density maximum entropy method with generalized moments for evaluating probability distributions, including tails, from a small sample of data. Int. J. Numer. Methods Eng. 2017, 113, 1904–1928. [Google Scholar] [CrossRef]
  18. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  19. Gzyl, H. Super resolution in the maximum entropy approach to invert Laplace transforms. Inverse Probl. Sci. Eng. 2017, 25, 1536–1545. [Google Scholar] [CrossRef]
  20. Kullback, S. Information Theory and Statistics; Dover: New York, NY, USA, 1967. [Google Scholar]
  21. Tagliani, A. Hausdorff moment problem and fractional moments: S simplified procedure. Appl. Math. Comput. 2011, 218, 4423–4432. [Google Scholar] [CrossRef]
  22. Gzyl, H.; Novi Inverardi, P.L.; Tagliani, A. Fractional moments and maximum entropy: Geometric meaning. Commun. Stat. Theory Methods 2014, 43, 3596–3601. [Google Scholar] [CrossRef]
  23. Akaoka, Y.; Okamura, K.; Otobe, Y. Properties of complex-valued power means of random variables and their applications. Acta Math. Acad. Sci. Hung. 2023, 171, 124–175. [Google Scholar] [CrossRef]
Table 1. Some families of distributions and their characterizing moments.
Table 1. Some families of distributions and their characterizing moments.
Families of DistributionDensityCharacterizing Moments
Gamma ( γ , β ) x γ 1 exp { x / β } Γ ( γ ) β γ 1 I R + ( x ) E [ X ] , E [ ln ( X ) ]
Pareto ( γ , k ) γ k γ x ( γ 1 ) 1 I [ k , + ) ( x ) E [ ln ( X ) ]
Lognormal ( μ , σ ) exp { [ ln ( x ) μ ] 2 / 2 σ 2 } 2 π σ x 1 I R + ( x ) E [ ln ( X ) ] , E [ ln 2 ( X ) ]
Rayleigh ( σ 2 ) x exp { ( x 2 / 2 σ 2 ) } σ 2 1 I R + ( x ) E [ X 2 ] , E [ ln ( X ) ]
Table 2. Residual entropy with integer (left), fractional (middle) moments and approximated method (right) for an increasing number n of moments.
Table 2. Residual entropy with integer (left), fractional (middle) moments and approximated method (right) for an increasing number n of moments.
n h f n ( im ) h f n h f n ( fm ) h f n h f n ( app ) h f
2 0.2577 × 10 1 1 0.8774 × 10 1 1 0.8893 × 10 1
4 0.5051 × 10 2 2 0.8077 × 10 2 2 0.2802 × 10 2
6 0.1626 × 10 2 3 0.4488 × 10 3 3 0.4851 × 10 3
8 0.1443 × 10 2 4 0.6043 × 10 5 4 0.4693 × 10 4
10 0.6698 × 10 3 5 0.4000 × 10 6 5 0.1196 × 10 4
12 0.5923 × 10 3 6 0.14152 × 10 5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Novi Inverardi, P.L.; Tagliani, A. Probability Distributions Approximation via Fractional Moments and Maximum Entropy: Theoretical and Computational Aspects. Axioms 2024, 13, 28. https://0-doi-org.brum.beds.ac.uk/10.3390/axioms13010028

AMA Style

Novi Inverardi PL, Tagliani A. Probability Distributions Approximation via Fractional Moments and Maximum Entropy: Theoretical and Computational Aspects. Axioms. 2024; 13(1):28. https://0-doi-org.brum.beds.ac.uk/10.3390/axioms13010028

Chicago/Turabian Style

Novi Inverardi, Pier Luigi, and Aldo Tagliani. 2024. "Probability Distributions Approximation via Fractional Moments and Maximum Entropy: Theoretical and Computational Aspects" Axioms 13, no. 1: 28. https://0-doi-org.brum.beds.ac.uk/10.3390/axioms13010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop