How to Read Probability Distributions as Statements about Process

Frank, Steven A.

doi:10.3390/e16116059

Open AccessReview

How to Read Probability Distributions as Statements about Process

by

Steven A. Frank

Department of Ecology & Evolutionary Biology, University of California, Irvine, CA 92697, USA

Entropy 2014, 16(11), 6059-6098; https://0-doi-org.brum.beds.ac.uk/10.3390/e16116059

Submission received: 22 October 2014 / Revised: 13 November 2014 / Accepted: 14 November 2014 / Published: 18 November 2014

(This article belongs to the Section Statistical Physics)

Download Versions Notes

Abstract

:

Probability distributions can be read as simple expressions of information. Each continuous probability distribution describes how information changes with magnitude. Once one learns to read a probability distribution as a measurement scale of information, opportunities arise to understand the processes that generate the commonly observed patterns. Probability expressions may be parsed into four components: the dissipation of all information, except the preservation of average values, taken over the measurement scale that relates changes in observed values to changes in information, and the transformation from the underlying scale on which information dissipates to alternative scales on which probability pattern may be expressed. Information invariances set the commonly observed measurement scales and the relations between them. In particular, a measurement scale for information is defined by its invariance to specific transformations of underlying values into measurable outputs. Essentially all common distributions can be understood within this simple framework of information invariance and measurement scale.

Keywords:

measurement; maximum entropy; information theory; statistical mechanics; extreme value distributions; neutral theories in biology

1. Introduction

Patterns of nature often follow probability distributions. Physical processes lead to an exponential distribution of energy levels among a collection of particles. Random fluctuations about mean values generate a Gaussian distribution. In biology, the age of cancer onset tends toward a gamma distribution. Economic patterns of income typically match variants of the Pareto distributions with power law tails.

Theories in those different disciplines attempt to fit observed patterns to an underlying generative process. If a generative model predicts the observed pattern, then the fit promotes the plausibility of the model. For example, the gamma distribution for the ages of cancer onset arises from a multistage process [1]. If cancer requires k different rate-limiting events to occur, then, by classical probability theory, the simplest model for the waiting time for the kth event to occur is a gamma distribution.

Many other aspects of cancer biology tell us that the process indeed depends on multiple events. However, how much do we really learn by this inverse problem, in which we start with an observed distribution of outcomes and then try to infer underlying process? How much does an observed distribution by itself constrain the range of underlying generative processes that could have led to that observed pattern?

The main difficulty of the inverse problem has to do with the key properties of commonly observed patterns. The common patterns are almost always those that arise by a wide array of different underlying processes [2,3]. We may say that a common pattern has a wide basin of attraction, in the sense that many different initial starting conditions and processes lead to that same common outcome. For example, the central limit theorem is, in essence, the statement that adding up all sorts of different independent processes often leads to a Gaussian distribution of fluctuations about the mean value.

In general, the commonly observed patterns are common because they are consistent with so many different underlying processes and initial conditions. The common patterns are therefore particularly difficult with regard to the inverse problem of going from observed distributions to inferences about underlying generative processes. However, an observed pattern does provide some information about the underlying generative process, because only certain generative processes lead to the observed outcome. How can we learn to read a mathematical expression of a probability pattern as a statement about the family of underlying processes that may generate it?

2. Overview

In this article, I will explain how to read continuous probability distributions as simple statements about underlying process. I presented the technical background in an earlier article [4], with addition details in other publications [3,5,6]. Here, I focus on developing the intuition that allows one to read probability distributions as simple sentences. I also emphasize key unsolved puzzles in the understanding of commonly observed probability patterns.

Section 3 introduces the four components of probability patterns: the dissipation of all information, except the preservation of average values, taken over the measurement scale that relates changes in observed values to changes in information, and the underlying scale on which information dissipates relative to alternative scales on which probability pattern may be expressed.

Section 4 develops an information theory perspective. A distribution can be read as a simple statement about the scaling of information with respect to the magnitude of the observations. Because measurement has a natural interpretation in terms of information, we can understand probability distributions as pure expressions of measurement scales.

Section 5 illustrates the scaling of information by the commonly observed log-linear pattern. Information in observations may change logarithmically at small magnitudes and linearly at large magnitudes. The classic gamma distribution is the pure expression of the log-linear scaling of information.

Section 6 presents the inverse linear-log scale. The Lomax and generalized Student’s distributions follow that scale. Those distributions include the classic exponential and Gaussian forms in their small-magnitude linear domain, but add power law tails in their large-magnitude logarithmic domain.

Section 7 shows that the commonly observed log-linear and linear-log scales form a dual pair through the Laplace transform. That transform changes addition of random variables into multiplication, and multiplication into addition. Those arithmetic changes explain the transformation between multiplicative log scaling and additive linear scaling. In general, integral transforms describe dualities between pairs of measurement scales, clarifying the relations between commonly observed probability patterns.

Section 8 considers cases in which information dissipates on one scale, but we observe probability pattern on a different scale. The log-normal distribution is a simple example, in which observations arise as products of perturbations. In that case, information dissipates on the additive log scale, leading to a Gaussian pattern on that log scale.

Section 8 continues with the more interesting case of extreme values, in which one analyzes the largest or smallest value of a sample. For extreme values, dissipation of information happens on the scale of cumulative probabilities, but we express probability pattern on the typical scale for the relative probability at each magnitude. Once one recognizes the change in scale for extreme value distributions, those distributions can easily be read in terms of my four basic components.

Section 9 returns to dual scales connected by integral transforms. In superstatistics, one evaluates a parameter of a distribution as a random variable rather than a fixed value. Averaging over the distribution of the parameter creates a special kind of integral transform that changes the measurement scale of a distribution, altering that original distribution into another form with a different scaling relation.

Section 10 considers alternative perspectives on generative process. We may observe pattern on one scale, but the processes that generated that pattern may have arisen on a dual scale. For example, we may observe the classic gamma probability pattern of log-linear scaling, in which we measure the time per event. However, the underlying generative process may have a more natural interpretation on the inverse linear-log scaling of the Lomax distribution. That inverse scale has dimensions of events per unit time, or frequency.

Section 11 reiterates how to read probability distributions. I then introduce the Lévy stable distributions, in which dual scales relate to each other by the Fourier integral transform. The Lévy case connects log scaling in the tails of distributions to constraints in the dual domain on the average of power law expressions. The average of power law expressions describes fractional moments, which associate with the common stretched exponential probability pattern.

Section 12 explains the relations between different probability patterns. Because a probability pattern is a pure expression of a measurement scale, the genesis of probability patterns and the relations between them reduce to understanding the origins of measurement scales. The key is that the dissipation of information and maximization of entropy set a particular invariance structure on measurement scales. That invariance strongly influences the commonly observed scales and thus the commonly observed patterns of nature.

Section 12 continues by showing that particular aspects of invariance lead to particular patterns. For example, shift invariance with respect the information in underlying values and transformed measured values leads to exponential scaling of information. By contrast, affine invariance leads to linear scaling. The distinctions between broad families of probability distributions turn on this difference between shift and affine invariance for the information in observations.

Section 13 presents a broad classification of measurement scales and associated probability patterns. Essentially all commonly observed distributions arise within a simple hierarchically generated sequence of measurement scales. That hierarchy shows one way to consider the genesis of the common distributions and the relations between them. I present a table that illustrates how the commonly observed distributions fit within this scheme.

Section 14 considers the most interesting unsolved puzzle: Why do linear and logarithmic scaling dominate the base scales of the commonly observed patterns? One possibility is that linear and log scaling express absolute and relative incremental information, the two most common ways in which information may scale. Linear and log scaling also have a natural association with addition and multiplication, suggesting a connection between common arithmetic operations and common scaling relations.

Section 15 suggests one potential solution to the puzzle of why commonly observed measurement scales are simple. Underlying values may often be transformed by multiple processes before measurement. Each transformation may be complex, but the aggregate transformation may smooth into a simple relation between initial inputs and final measured outputs. The scaling that defines the associated probability pattern must provide invariant information with respect to underlying values or final measured outputs. If the ultimate transformation of underlying values to final measured outputs is simple, then the required invariance may often define a simple information scaling and associated probability pattern.

The Discussion summarizes key points and emphasizes the major unsolved problems.

3. The Four Components of Probability Patterns

To parse probability patterns, one must distinguish four properties. In this section, I begin by briefly describing each property. I then match the properties to the mathematical forms of different probability patterns, allowing one to read probability distributions in terms of the four basic components. Later sections develop the concepts and applications.

First, dissipation of information occurs because most observable phenomena arise by aggregation over many smaller scale processes. The multiple random, small scale fluctuations often erase the information in any particular lower level process, causing the aggregate observable probability pattern to be maximally random subject to constraints that preserve information [2,7,8].

Second, average values tend to be the only preserved information after aggregation has dissipated all else. Jaynes [2,7,8] developed dissipation of information and constraint by average values as the key principles of maximum entropy, a widely used approach to understanding probability patterns. I extended Jaynesian maximum entropy by the following components [4–6].

Third, average values may arise on different measurement scales. For example, in large scale fluctuations, one might only be able to obtain information about the logarithm of the underlying values. The constrained average would be the mean of the logarithmic values, or the geometric mean. The information in measurements may change with magnitude. In some cases, the scale may be linear for small fluctuations but logarithmic for large fluctuations, leading to an observed linear-log scale of observations.

Fourth, the measurement scale on which information dissipates may differ from the scale on which one observes pattern. For example, a multiplicative process causes information to dissipate on the additive logarithmic scale, but we may choose to analyze the observed multiplicative pattern. Alternatively, information may dissipate by the multiplication of the cumulative probabilities that individual fluctuations fall below some threshold, but we may choose to analyze the extreme values of aggregates on a transformed linear scale.

The measurement scaling defines the various commonly observed probability distributions. By learning to parse the scaling relations of measurement implicit in the mathematical expressions of probability patterns, one can read those expression as simple statements about underlying process. The previously hidden familial relations between different kinds of probability distributions become apparent through their related forms of measurement scaling.

3.1. Dissipation of Information

Most observations occur on a macroscopic scale that arises by aggregation of many small scale phenomena [9]. Each small scale process often has a random component. The greater the number of small scale fluctuations that combine to form an aggregate, the greater the total randomness in the macroscopic system. We may think of randomness as entropy or as the loss of information. Thus, aggregation dissipates information and increases entropy [7,8].

A typical measure of entropy or randomness is

ε = - \int p_{y} \log (p_{y}) d y,

(1)

in which p_y describes the probability distribution for a variable y.

Information is the negative of the entropy, and so the dissipation of information is also given by the entropy [10]. I use a continuous form of entropy throughout this article, and focus only on the continuous probability distributions. Discrete distributions follow a similar logic, but require different expressions and details of presentation.

We can find the probability distribution consistent with maximum entropy by maximizing the expression in Equation (1), which requires solving ∂ε /∂p_y = 0. The solution is p_y = c, where c is a constant. This uniform distribution describes the pattern in which the probability of observing any value is the same for all values of y. The maximum entropy uniform distribution has the least information, because all outcomes are equally likely.

3.2. Constraint by Average Values

Suppose that we are studying the distribution of energy levels in a population of particles. We want to know the probability that any particle has a certain level of energy The probability distribution over the population describes the probability of different levels of energy per particle.

Typically, there is a certain total amount of energy to be distributed among the particles in the population. The fixed total amount of energy constrains the average energy per particle.

To find the distribution of energy, we could reasonably assume that many different processes operate at a small scale, influencing each particle in multiple ways. Each small scale process often has a random component. In the aggregate of the entire population, those many small scale random fluctuations tend to increase the total entropy in the population, subject to the constraint that the mean is set extrinsically.

For any pattern influenced by small-scale random fluctuations, the only constraint on randomness may be a given value for the mean. If so, then pattern follows maximum entropy subject to a constraint on the mean [7,8].

3.2.1. Constraint on the Mean

When we maximize the entropy in Equation (1) to find the probability distribution consistent with the inevitable dissipation of information and increase in entropy, we must also account for the constraint on the average value of observable events. The technical approach to maximizing a quantity, such as entropy, subject to a constraint is the method of Lagrange multipliers. In particular, we must maximize the quantity

\land = ε - k C_{0} - λ C,

(2)

in which the constraint on the average value is written as

C = \int p_{y} y d y - μ

. The integral term of the constraint is the average value of y over the distribution p_y, and the term, μ, is the actual average value set by constraint. The method guarantees that we find a distribution, p_y, that satisfies the constraint, in particular that the average of the distribution that we find is indeed equal to the given constraint on the average,

\int p_{y} y d y = μ

. We must also set the total probability to be one, expressed by the constraint

C_{0} = \int p_{y} d y - 1

.

We find the maximum of Equation (2) by solving ∂ε /∂p_y = 0 for the constants κ and λ that satisfy the constraint on total probability and the constraint on average value, yielding

p_{y} \propto e^{- λ y},

(3)

in which λ = 1/μ, and ∝ means “is proportional to.” The total probability over a distribution must be one. If we use that constraint on total probability, we can find κ such that ψe^−λ^y would be an equality rather than a proportionality for p_y for some constant, ψ. That is easy to do, but adds additional steps and a lot of notational complexity without adding any further insight. I therefore present distributions without the adjusting constants, and write the distributions as “p_y ∝” to express the absence of the constants and the proportionality of the expression.

The expression in Equation (3) is known as the exponential distribution, or sometimes the Gibbs or Boltzmann distribution. We can read the distribution as a simple statement. The exponential distribution is the probability pattern for a positive variable that is most random, or has least information, subject to a constraint on the mean. Put another way, the distribution contains information only about the mean, and nothing else.

3.2.2. Constraint on the Average Fluctuations from the Mean

Sometimes we are interested in fluctuations about a mean value or central location. For example, what is the distribution of errors in measurements? How do average values in samples vary around the true mean value? In these cases, we may describe the intrinsic variability by the variance. If we constrain the variance, we are constraining the average squared distance of fluctuations about the mean.

We can find the distribution that is most random subject to a constraint on the variance by using the variance as the constraint in Equation (2). In particular, let

C = \int p_{y} {(y - μ)}^{2} d y - σ^{2}

, in which σ² is the variance and μ is the mean. This expression constrains the squared distance of fluctuations, (y − μ)², averaged over the probability distribution of fluctuations, p_y, to be the given constraint, σ².

Without loss of generality, we can set μ = 0 and interpret y as a deviation from the mean, which simplifies the constraint to be

C = \int p_{y} y^{2} d y - σ^{2}

. We can then write the constraint on the mean or the constraint on the variance as a single general expression

C = \int p_{y} f_{y} d y - {\bar{f}}_{y},

(4)

in which f_y is y or y² for constraints on the mean or variance, respectively, and

{\bar{f}}_{y}

is the extrinsicially set constraint on the mean or variance, respectively. Then the maximization of entropy subject to constraint takes the general form

p_{y} \propto e^{- λ f_{y}} .

(5)

If we constrain the mean, then f_y = y and λ = 1/μ, yielding the exponential form in Equation (3). If we constrain the variance, then f_y = y², and λ = 1/2σ², which is the Gaussian distribution.

3.3. The Measurement Scale for Average Values

The constraint on randomness may be transformed by the measurement scale [4,6]. We may write the transformation of the observable values, f_y, as T(f_y) = T_f. Here, f_y is y or y² depending on whether we are interested in the average value or in the average distance from a central location, and T is the measurement scale. Thus, the constraint in Equation (4) can be written as

C = \int p_{y} T_{f} d_{y} - {\bar{T}}_{f},

(6)

which generalizes the solution in Equation (5) to

p_{y} \propto e^{- λ T_{f}} .

(7)

This form provides a simple way to express many different probability distributions, by simply choosing T_f to be a constraint that matches the form of a distribution. For example, the power law distribution, p_y ∝ y^−λ, corresponds to the measurement scale T_f = log(y). In general, finding the measurement scale and the associated constraint that lead to a particular form for a distribution is useful, because the constraint concisely expresses the information in a probability pattern [4,6].

Simply matching probability patterns to their associated measurement scales and constraints leaves open the problem of why particular scalings and constraints arise. What sort of underlying generative processes lead to a particular scaling relation, T_f, and therefore attract to the same probability pattern? I address that crucial question in later sections. For now, it is sufficient to note that we have a simple way to connect the dissipation of information and constraint to probability patterns.

3.4. The Scale on which Information Dissipates

In some cases, information dissipates on one scale, but we wish to express the probability pattern on another scale. Suppose that information dissipates on the scale given by x, leading to the distribution p_x. After obtaining the distribution on the scale x by applying the theory for the dissipation of information and constraint, we may wish to transform the distribution to a different scale, y. Here, I briefly mention two distinct types of transformation. Later sections illustrate the crucial role of scale transformations in understanding several important probability patterns. The Methods provides technical details.

3.4.1. Change of Variable

The relation between x and y is given by the transformation x = g(y), where g is some function of y. For example, we may have x = log(y). In general, we can use any transformation that has meaning for a particular problem. Several important probability distributions arise by dissipation of information on scales other than the one on which we typically express probability patterns. To understand those distributions, one must recognize the scale on which information dissipates and the transformed scale used to express the probability distribution.

Define m_y = |g′(y)|, where g′ is the derivative of g with respect to y. The notation m_y emphasizes the term as the measurement scale correction when observing pattern on the scale y. Because information dissipates on the scale x, we can often find the distribution p_x easily from Equation (7), in which T_f is a function of f_x. Applying the change in measure, m_y, we obtain

p_{y} \propto m_{y} e^{- λ T_{f}}

(8)

in which we replace f_x by the transformed expression f_g₍_y₎ in the scaling relation T_f.

The key point is that we have simply made a change of variable from x to y. The term m_y adjusts the scaling of the probability pattern for that change of variable.

3.4.2. Integral Transform

If we take the average of e^−xy over the distribution of x, we obtain a new function for each value of y, as

h * (y) = \int e^{- x y} p_{x} d x,

which may be interpreted as a Laplace or Fourier transform of the original distribution, p_x. Under some conditions, we can think of the transformed function h* (y) as a distribution that has a paired relation with the original distribution p_x. The transformation creates a pair of related measurement scales that determines the associated pair of probability distributions. We may use other transformation functions besides e^−xy to create various pairs of measurement scales and probability distributions.

4. Reading Probability Expressions in Terms of Measurement and Information

In this section, I show that probability distributions can be read as simple statements about the change in information with the magnitude of the observations. The essential scaling relation Tf expresses exactly how information changes with magnitude. Because measurement has a natural interpretation in terms of information, we can also think of T_f as an expression of the measurement scale associated with a particular probability distribution.

4.1. Information and Surprise

The key step arises from interpreting

S_{y} = - \log (p_{y})

(9)

in Equation (1) as the translation between probability, p_y, and information, S_y. This expression is sometimes called self-information, which describes the information in an event y in terms of the probability of that event, p_y. I use the symbol S_y because Tribus [11] interpreted this quantity as the surprise associated with the magnitude of the observation, y.

The interpretation of S_y as surprise arises from the idea that relatively rare events are more surprising. For any initial value of p_y, the surprise, − log(p_y) = log(1/p_y), increases by log(2) as p_y decreases by half. Thus, the surprise increases linearly with relative rarity.

Surprise connects to information. If we are surprised by an observation, we learn a lot; if we are not surprised, we had already predicted the outcome to be relatively likely, and we gain little information.

Note that entropy in Equation (1) is equivalent to

\int p_{y} S_{y} d y

, which is simply the average amount of surprise over a particular probability distribution. A uniform distribution, in which all values of y are equally likely, has a maximum amount of entropy and a minimum amount of information or surprise. The low surprise occurs because, with any value of y equally likely, we can never be relatively more surprised by observing one particular value of y rather than another.

4.2. Scaling Relations Express the Change in Information

The expression S_y relates information to the magnitude of observations, y. We can use that relation to develop an understanding of how information changes with magnitude. The change in information with magnitude captures the essential aspect of measurement scale. This notion of information in relation to scale turns out to be the key to understanding probability patterns for continuous variables.

I begin with the general expression for probability patterns in Equation (7), altered slightly here as

p_{y} = ψ e^{- λ T_{f}},

(10)

in which ψ is a constant that sets the total probability of the distribution to one. In this section, we can ignore the scale transformations and the term m_y that led to Equation (8). Those transformations change the original probability pattern from one scale to another. That change of scale does not alter the relation between information and magnitude on the original scale that determined the form of the probability distribution.

If we take the logarithm of both sides of Equation (10), we obtain a general expression for probability patterns in terms of information as

S_{y} = ψ + λ T_{f} .

(11)

Thus, the change in information, dS_y, compares with the change in the scaling relation for measurement, dT_f, as

| d S_{y} | = | λ {dT}_{f} |,

(12)

in which absolute values quantify the magnitude of change. Intuitively, we may think of this expression as the increment of information gained for measuring a change in magnitude on the scale T_f. The parameter λ is the relative rate of change of information compared with measured values.

Note that we can also write

d S_{y} \propto {dT}_{f},

(13)

which means that an increment on the measurement scale is proportional to an increment of information.

4.3. How to Read the Exponential and Gaussian Distributions

The exponential distribution in Equation (3) has T_f = y and dS_y = λ. The parameter λ = 1/μ is the inverse of the distribution’s mean value. The exponential distribution describes a constant increase in information with magnitude, associated with a constant decline in relative probability with magnitude. The rate of increase in information with magnitude is the inverse of the mean.

For the Gaussian distribution with a mean of zero, T_f = y² and dS_y = 2λy. The parameter λ = 1/2σ² is the inverse of twice the distribution’s average squared deviation, leading to dS_y = y/σ². The Gaussian distribution describes a linearly increasing gain (constant acceleration) in information with magnitude, associated with a linearly increasing decline (constant deceleration) in relative probability with magnitude. The rate of the linearly increasing gain in information with magnitude is the inverse of the variance.

The following sections present the way in which to read a wide variety of common distributions in terms of the scaling relations of information and measurement. Later sections consider the underlying structure and familial relations between commonly observed distributions. That underlying structure arises from the information symmetries that relate different measurements scales to each other.

5. The Log-linear Scale

Cancer incidence illustrates how probability patterns may express simple scaling relations [1]. For many cancers, the probability p_y that an individual develops disease near the age y, among all those born at age zero, is approximately

p_{y} \propto y^{k - 1} e^{- α y},

(14)

which is the gamma probability pattern. A simple generative model that leads to a gamma pattern is the waiting time for the kth event to occur. For example, if cancer developed only after k independent rate-limiting barriers or stages have been passed, then the process of cancer progression would lead to a gamma probability pattern.

That match between a generative multistage model of process and the observed gamma pattern led many people to conclude that cancer develops by a multistage process of progression. By fitting the particular incidence data to a gamma pattern and estimating the parameter k, one could potentially estimate the number of rate-limiting stages required for cancer to develop. Although this simple model does not capture the full complexity of cancer, it does provide the basis for many attempts to connect observed patterns for the age of onset to the underlying generative processes that cause cancer [1].

Let us now read the gamma pattern as an expression about the scaling of probability in relation to magnitude. We can then compare the general scaling relation that defines the gamma pattern to the different kinds of processes that may generate a pattern matched to the gamma distribution.

The probability expression in Equation (14) can be divided into two terms. The first term is

y^{k - 1} = e^{(k - 1) \log (y)},

(15)

which matches our general expression for probability patterns in Equation (7) with T_f = log(y). This equivalence associates the power law component of the gamma distribution with a logarithmic measurement scale.

For the second term, e^−α^y, in Equation (14), we have T_f = y, which expresses linear scaling in y. Thus, the two terms in Equation (14) correspond to logarithmic and linear scaling

p_{y} \propto \underset{\log}{\underset{⎵}{y^{k - 1}}} \times \underset{linear}{\underset{⎵}{e^{- a y}}},

(16)

which leads to an overall measurement function that has the general log-linear form T_f = log(y) − by. For the parameters in this example, b = α/(k − 1).

When y is small, T_f ≈ log(y), and the logarithmic term dominates changes in the information of the probability pattern, dS_y, and the measurement scale, dT_f. By contrast, when y is large, T_f ≈ − by, and the linear term dominates. Thus, the gamma probability pattern is simply the expression of logarithmic scaling at small magnitudes and linear scaling at large magnitudes. The value of b determines the magnitudes at which the different scales dominate.

Generative processes that create log-linear scaling typically correspond to a gamma probability pattern. Consider the classic generative process for the gamma, the waiting time for the kth independent event to occur. When the process begins, none of the events has occurred. For all k events to occur in the next time interval, all must happen essentially simultaneously.

The probability of multiple independent events to occur essentially simultaneously is the product of the probabilities for each event to occur. Multiplication leads to power law expressions and logarithmic scaling. Thus, at small magnitudes, the change in information scales with the change in the logarithm of time.

By contrast, at large magnitudes, after much time has passed, either the kth event has already happened, and the waiting is already over, or k − 1 events have happened, and we are waiting only for the last event. Because we are waiting for a single event that occurs with equal probability in any time interval, the scaling of information with magnitude is linear. Thus, the classic waiting time problem is a generative model that has log-linear scaling.

The gamma pattern itself is a pure expression of log-linear scaling. That probability pattern matches any underlying generative process that converges to logarithmic scaling at small magnitudes and linear scaling at large magnitudes. Many processes may be essentially multiplicative at small scales and approximately linear at large scales. All such generative processes will also converge to the gamma probability distribution. In the general case, k is a continuous parameter that influences the magnitudes at which logarithmic or linear scaling dominate.

Later, I will return to this important link between generative process and measurement scale. For now, let us continue to follow the consequences of various scaling relations.

The log-linear scale contains the purely linear and the purely logarithmic as special cases. In Equation (14), as k → 1, the probability pattern becomes the exponential distribution, the pure expression of linear scaling. Alternatively, as α → 0, the probability pattern approaches the power law form, the pure expression of logarithmic scaling.

6. The Linear-log Scale

Another commonly observed pattern follows a Lomax or Pareto Type II form

p_{y} \propto {(1 + \frac{y}{a})}^{- k},

(17)

which is associated with the measurement scale T_f = log(1 + y/α). This distribution describes linear-log scaling. For small values of y relative to α, we have T_f → y/α, and the distribution becomes

p_{y} \propto e^{- (k / a) y},

(18)

which is the pure expression of linear scaling. For large values of y relative to α, we have T_f → log(y/α), and the distribution becomes

p_{y} \propto y^{- k},

(19)

which is the pure expression of logarithmic scaling.

In these examples, I have used f_y = y in the scaling relation T_f = log(1 + f_y/α). We can add to the forms of the linear-log scale by using f_y = (y − μ)², describing squared deviations from the mean. To simplify the notation, let μ = 0. Then Equation (17) becomes

p_{y} \propto {(1 + \frac{y^{2}}{a})}^{- k},

(20)

which is called the generalized Student’s or q-Gaussian distribution [12]. When the deviations from the mean are relatively small compared with α, linear scaling dominates, and the distribution is Gaussian,

p_{y} \propto e^{- (k / a) y^{2}}

. When deviations from the mean are relatively large compared with α, logarithmic scaling dominates, causing power law tails, p_y ∝ y⁻²^k.

7. Relation between Linear-log and Log-linear Scales

The specific way in which these two scales relate to each other provides much insight into pattern and process.

7.1. Common Scales and Common Patterns

The log-linear and linear-log scales include most of the commonly observed probability patterns. The purely linear exponential and Gaussian distributions arise as special cases. Pure linearity is perhaps rare, because very large or very small values often scale logarithmically. For example, we measure distances in our immediate surroundings on a linear scale, but typically measure very large cosmological distances on a logarithmic scale, leading to a linear-log scaling of distance.

On the linear-log scale, positive variables often follow the Lomax distribution Equation (17). The Lomax expresses an exponential distribution with a power law tail. Over a sufficiently wide range of magnitudes, many seemingly exponential distributions may in fact grade into a power law tail, because of the natural tendency for the information at extreme magnitudes to scale logarithmically. Alternatively, many distributions that appear to be power laws may in fact grade into an exponential shape at small magnitudes.

When studying deviations from the mean, the linear-log scale leads to the generalized Student’s form. That distribution has a primarily Gaussian shape but with power law tails. The tendency for the tails to grade into a power law may again be the rule when studying pattern over a sufficiently wide range of magnitudes [12].

In some cases, the logarithmic scaling regime occurs at small magnitudes rather than large magnitudes. Those cases of log-linear scaling typically lead to a gamma probability pattern. Many natural observations approximately follow the gamma pattern, which includes the chi-square pattern as a special case.

7.2. Relations between the Scales

The linear-log and log-linear scales seem to be natural inverses of each other. However, what does an inverse scaling mean? We obtain some clues by noting that the mathematical relation between the scales arises from

\underset{linear- \log}{\underset{⎵}{{(1 + \frac{f_{y}}{a})}^{- k}}} \propto \int e^{- x f_{y}} \underset{\log -linear}{\underset{⎵}{x^{k - 1} e^{- a x}}} d x .

(21)

The right side is the Laplace transform of the log-linear gamma pattern in the variable x, here interpreted for real-valued f_y. That transform inverts the scale to the linear-log form, which is the Lomax distribution for f_y = y or the generalized Student’s distribution for f_y = y².

This relation between scales is easily understood with regard to mathematical operations [4,6]. The Laplace transform changes the addition of random variables into the multiplication of those variables, and it changes the multiplication of random variables into the addition of those variables [13]. Logarithmic scaling can be thought of as the expression of multiplicative processes, and linear scaling can be thought of as the expression of additive processes.

The Laplace transform, by changing multiplication into addition, transforms log scaling into linear scaling, and by changing addition into multiplication, transforms linear scaling into log scaling. Thus, log-linear scaling changes to linear-log scaling. The inverse Laplace transform works in the opposite direction, changing linear-log scaling into log-linear scaling.

The fact that the Laplace transform connects two of the most important scaling relations is interesting. However, what does it mean in terms of reading and understanding common probability patterns? The following sections suggest one possibility.

8. Dissipation of Information on Alternative Scales

It may be that information dissipates on one scale, but we observe pattern on a different scale. For example, information may dissipate on the frequency scale of events per unit time, but we may observe pattern on the inverse scale of time per event. Before developing that interpretation of the Laplace pair of inverse scales, it is useful to consider more generally the problem of analyzing pattern on one scale when information dissipates on a different scale.

8.1. Scale Change for Data Analysis

Information may dissipate on the scale, x, but we may wish to observe or to analyze the data on the transformed scale, y. For example, the observations, y, may arise by the product of positive random values. Then x = log(y) would be the sum of the logarithms of those random values. The dissipation of information by the addition of random variables often leads to a Gaussian distribution. By application of Equation (7), we have the distribution of x = log(y) as

p_{x} \propto e^{- λ {(x - μ)}^{2}},

where μ is the mean of x, and 1/λ is twice the variance of x. On the x scale, the Gaussian distribution has T_f = (x − μ)².

Suppose we want the distribution on the scale of the observations, y, rather than on the logarithmic scale x = log(y) on which information dissipates. Then we must apply Equation (8) to transform to the scale, x, to the scale of interest, y, by using g(y) = log(y), and thus m_y = g′(y) = y⁻¹. Then, from Equation (8), we have the log-normal distribution

p_{y} \propto y^{- 1} e^{- λ {(\log (y) - μ)}^{2}},

which we match to Equation (8) by noting that m_y = y⁻¹ and T_f = (log(y) − μ)².

Consider another example, in which information dissipates on the log-linear scale, x. By Equation (7), log-linear scaling leads to a gamma distribution

p_{x} \propto x^{k - 1} e^{- a x},

in which the log-linear scale has the form −λT(x) = (k − 1) log(x) − αx.

Suppose that we wish to analyze the data on a logarithmic scale, or that we only have access to the logarithms of the observations [5]. Then we must analyze the distribution of y = log(x), which means that the original scale for the dissipation of information was x = g(y) = e^y. Therefore

- λ T (g (y)) = (k - 1) \log (e^{y}) - {ae}^{y} .

Because m_y = g′(y) = e^y, by Equation (8), we have

p_{y} \propto m_{y} e^{- λ T (g (y))} = e^{y} e^{(k - 1) \log (e^{y}) - {ae}^{y}},

which simplifies to

p_{y} \propto e^{k y - {ae}^{y}} .

(22)

We read this as the dissipation of information on the log-linear scale, x, and a change of variable x = e^y, in order to analyze the log transformation of the underlying distribution as y = log(x). Data, such as the distribution of biological species abundances in samples, often have an underlying log-linear structure associated with the gamma distribution. Typically, such data are log-transformed before analysis, leading to the distribution in Equation (22), which I call the exponential-gamma distribution [5].

Equation (22) has the same form as the commonly observed Gumbel distribution that arises in extreme value theory. That theory turns out to be another way in which information dissipates on one scale, but we analyze pattern on a different scale.

8.2. Extreme Values: Dissipation on the Cumulative Scale

Many problems depend only on the largest or smallest value of a sample. Extreme values determine much of the financial risk of disasters, the probability of structural failure, and the expectation of unacceptable traffic congestion. In biology, the most advantageous beneficial mutations may set the pace and extent of adaptation.

At first glance, it may seem that the most extreme values associated with rare events would be hard to predict. Although it is true that the extreme value in any particular case cannot be guessed with certainty, it turns out that the probability distribution of extreme values often follows a very regular pattern. That regularity of extreme values arises from the same sort of strong convergence by which the central limit theorem leads to the regularity of the Gaussian probability distribution.

I describe how the extreme value distributions can be understood by the dissipation of information and scale transformation. I focus on the largest value in a sample. The same logic applies to the smallest value. I emphasize an intuitive way in which to read the extreme value distributions as expressions about process.

Many sources provide background on the extreme value distributions [14–17]. In my own work, I described the technical details for a maximum entropy interpretation of extreme values [3,18], and the scale transformations that connect extreme value forms to general measurement interpretations of probability patterns [4].

8.2.1. Dissipation of Information

In a sufficiently large sample, the probability of an extreme value depends only on the chance that an observation falls in the upper tail of the underlying distribution from which the observations are drawn. All other information about the underlying process dissipates. The average tail probability of the underlying distribution sets the constraint on retained information, expressed as follows.

Let x be the upper tail probability of a distribution, p_z, defined as

x = \int_{y}^{\infty} p_{z} d z,

(23)

in which y is a threshold value, and x is the cumulative probability in the upper tail of the distribution p_z above the value y. Thus, x is the probability of observing a value that is greater than y. The cumulative probability, x, tells us how likely it is to observe a value greater than y, and thus how likely it is that y would be near the extreme value in a sample of observations.

On the scale, x, the dissipation of information in repeated samples causes the distribution of upper tail probabilities to take on the general form of Equation (7), in particular

p_{x} \propto e^{- y x} .

The average value of x, which is the average upper tail probability, sets the only constraint that shapes the pattern of the distribution. We can, without loss of generality, rescale x so that λ = 1, and thus p_x is proportional to e^−x.

8.2.2. Scale Transformation

Scale transformation describes how to go from tail probabilities, x, to the extreme value in a sample, y. Suppose tail probabilities, which are on the scale x, are related to extreme values, which are on the scale y. The relation between x and y is given by Equation (23). We can express that relation as x = T(y) = T_f, in which T_f is the right-hand side of Equation (23).

We can now use our general approach to scale transformation in Equation (8), repeated here

p_{y} \propto m_{y} e^{- λ T_{f}} .

In this case, m_y = |T′_f|, which is the absolute value of the derivative of x with respect to y, yielding

p_{y} \propto | T_{f}^{'} | e^{- λ T_{f}} .

(24)

This expression provides the general form of probability distributions when T_f describes the measurement scale for y in terms of the cumulative distribution, or tail probabilities, for some underlying distribution.

The form of T_f arises, as always, from the information constrained by an average value. For example, if in Equation (23) the tail probability decays exponentially such that p_z → e^−z, then

x = \int_{y}^{\infty} p_{z} d z \approx e^{- y} .

The average tail probability is the average of e^−y, and x = T_f = e^−y. From Equation (24), we have

p_{y} \propto e^{- y - λ e^{- y}},

(25)

which is the Gumbel form of the extreme value distributions. Alternatively, if the average tail probability is the average of y^−γ, from the tail of an underlying distribution that decays as a power law in y, then T_f = y^−γ, and

p_{y} \propto y^{- (γ + 1)} e^{- λ y^{- γ}},

which is the Fréchet form of the extreme value distributions. In summary, the extreme value distributions follow the simple maximum entropy form. The constraint is the average tail probability of an underlying distribution. We transform from the scale, x, of the cumulative distribution of tail probabilities, to the scale, y, of the extreme value in a sample [3,4].

9. Pairs of Alternative Scales by Integral Transform

The prior section discussed paired scales, in which information dissipates on one scale, but we observe pattern on a transformed scale. In those particular cases, the dual relation between scales was obvious. For example, we may explicitly choose to study pattern by a log or exponential transformation of the observations. Or information may dissipate on the cumulative scale of tail probabilities, but we transform to the scale of observed extreme values to express probability patterns.

I now return to the linear-log and log-linear scales, which lead to the most commonly observed probability patterns. How can we understand the duality between these inverted scales? Is there a general way in which to understand the pairing between inverted measurement scales?

9.1. Overview

The distributions based on linear-log and log-linear scales form naturally inverted pairs connected by the Laplace transform. Equation (21) showed that connection, repeated here

\underset{linear- \log}{\underset{⎵}{{(1 + \frac{f_{y}}{a})}^{- k}}} \propto \int e^{- x f_{y}} \underset{\log -linear}{\underset{⎵}{x^{k - 1} e^{- a x}}} d x .

(26)

In this section, I summarize two ways in which to understand this mathematical expression. First, the pair may arise from superstatistics [19], in which a parameter of a distribution is considered to vary rather than to be fixed. Second, the pair provides an example of a more general way in which dual measurement scales connect to each other through integral transformation, which changes one measurement scale into another. Fourier, Laplace, and superstatistics transformations can be understood as special cases of the more general integral transforms. Those general transforms include as special cases the classic characteristic functions and moment generating functions of probability theory.

The following section considers cases in which information dissipates on one of the scales, but we observe pattern on the inverted scale. This duality provides an essential way in which to connect the scaling and constraints of process on one scale to the patterns of nature that we observe on the dual scale. Reading probability patterns in terms of underlying process may often depend on recognizing this essential duality.

9.2. Superstatistics

The transformation between scales in Equation (26) can be interpreted as averaging over a varying parameter. Assume that we begin with a distribution in the variable f_y, given by ϕ(f_y|x). Here, x is the parameter of the distribution. Typically, we think of a parameter x as a fixed constant. Suppose, instead, that x varies according to a distribution, h(x). For example, we may think of a composite population in which f_y varies according to ϕ(f_y|x) in different locations, with the mean of the distribution, 1/x, varying across locations.

If we measure the composite population, we study the distribution ϕ(f_y|x) when averaged over the different values of x, which vary according to h(x). The composite population then follows the distribution given by

h * (f_{y}) = \int ϕ (f_{y} | x) h (x) d x .

(27)

Averaging a distribution, such as ϕ, over a variable parameter, is sometimes called super statistics [19]. When the initial distribution, ϕ(f_y|x), is exponential,

e^{- x f_{y}}

, then superstatistical averaging over the variable parameter x in Equation (27) is equivalent to the Laplace transform, of which Equation (26) is an example.

9.3. Integral Transforms

We may read Equation (27) as an integral transform, which provides a general relation between a pair of measurement scales. Thus, we may think of Equation (27) as a general way in which to express the duality between paired measurement scales, rather than a specific superstatistics process of averaging over a variable parameter.

In this general integral transform interpretation, we start with some distribution h(x), which has a scaling relation, T(x). Integrating over the transformation kernel ϕ(f_y|x) creates the distribution h*(f_y), with scaling relation T*(f_y). Thus, averaging over the transformation kernel ϕ changes the variable from x to f_y, and changes the measurement scale from T(x) to T*(f_y).

The interpretation of such scale transformations depends on the particular transformation kernel, which creates the particular properties of the dual relation. The Laplace transform, with the exponential transformation kernel

e^{- x f_{y}}

, has many special properties that connect paired measurement scales in interesting ways.

9.4. Scale Inversion by the Laplace Transform

Suppose the log-linear scaling pattern occurs for the variable x, as in Equation (26). That equation shows that the Laplace transformation kernel,

e^{- x f_{y}}

, transforms the log-linear scaling relation of x into the linear-log scaling relation of f_y, for real values of f_y.

The Laplace change of variable from x to f_y often inverts the dimensional units. The exponent of the transformation kernel

e^{- x f_{y}}

is usually dimensionless, which means that the dimensions of x and f_y must cancel. Thus, the units of f_y are typically the inverse of the units of x. For example, if x has units of time per event, then f_y has units of events or repetitions per time, which is a kind of frequency. The units may also be changed inversely from frequency to time.

The Laplace transform changes the way in which independent observations combine to produce aggregate pattern. On one scale, the distribution of the sum (convolution) of independent observations from an underlying distribution transforms to multiplication of the distributions on the other scale. Inversely, the distribution of multiplied observations on one scale transforms to addition of variables on the other scale. This duality between addition and multiplication on inverted scales corresponds to the duality between linear and logarithmic measurement on the paired scales.

10. Alternative Descriptions of Generative Process

We often wish to associate an observed probability pattern with the underlying generative process. The generative process may dissipate information directly on the measurement scale associated with the observed probability pattern. Or, the generative process may dissipate information on a different scale, but we observe the pattern on a transformed scale.

Consider, as an example, the Laplace duality between the linear-log and log-linear scales in Equation (26). Suppose that we observe the gamma pattern of log-linear scaling. We wish to associate that observed gamma pattern to the underlying generative process.

The generative process may directly create a log-linear scaling pattern. The classic example concerns waiting time for the kth independent event. For small times, the k events must happen nearly simultaneously. As noted earlier, the probability of multiple independent events to occur essentially simultaneously is the product of the probabilities for each event to occur. Multiplication leads to power law expressions and logarithmic scaling. Thus, at small magnitudes, the change in information scales with the change in the logarithm of time.

By contrast, at large magnitudes, after much time has passed, either the kth event has already happened, and the waiting is already over, or k − 1 events have happened, and we are waiting only for the last event. Because we are waiting for a single event that occurs with equal probability in any time interval, the scaling of information with magnitude is linear. Thus, the classic waiting time problem expresses a generative model that has log-linear scaling.

Any process that scales log-linearly tends to the gamma pattern by the dissipation of all other information. The only requirement is that, in the aggregate, small magnitude events associate with underlying multiplicative combinations of probabilities, and large event magnitudes associate with additive combinations.

In this case, we move from underlying process to observed pattern: a process tends to scale log-linearly, and dissipation of information on that scale shapes pattern into the gamma distribution form. However, often we are concerned with the inverse problem. We observe the log-linear gamma pattern, and we want to know what process caused that pattern.

The duality of the log-linear and linear-log scales in Equation (26) means that a generative process could occur on the linear-log scale, but we may observe the resulting pattern on the log-linear scale. For example, the number of events per unit time (frequency) may combine in a linear, additive way at small frequencies and in a multiplicative, logarithmic way at large frequencies. That linear-log process would often converge to a Lomax distribution of frequency pattern, or to a Student’s distribution if we measure squared deviations, f_y = y². If we observe the outcome of that process in terms of the inverted units of time per event, those inverted dimensions lead to log-linear scaling and a gamma pattern, or to a gamma pattern with a Gaussian tail if we measure squared deviations.

Is it meaningful to say that the generative process and dissipation of information arise on a linear-log scale of events per unit time, but we observe the pattern on the log-linear scale of time per event? That remains an open question.

On the one hand, the scaling relations and dissipation of information contain exactly the same information whether on the linear-log or log-linear scales. That equivalence suggests a single underlying generative process that may be thought of in alternative ways. In this case, we may consider constraints on average frequency or, equivalently, constraints on average time. More generally, constraints on either of a dual pair of scales with inverted dimensions would be equivalent.

On the other hand, the meaning of constraint by average value may make sense only on one of the scales. For example, it may be meaningful to consider only the average waiting time for an event to occur. That distinction suggests that we consider the underlying generative process strictly in terms of the log-linear scale. However, if our observations of pattern are confined to the inverse frequency scale, then the observed linear-log scaling would only be a reflection of the true underlying process on the dual log-linear scale.

All paired scales through integral transformation pose the same issues of duality and interpretation with regard to the connection between generative process and observed pattern.

11. Reading Probability Distributions

In this section, I recap the four components of probability patterns. A clear sense of those four components allows one to read the mathematical expressions of probability distributions as sentences about underlying process.

The four components are: the dissipation of all information; except the preservation of average values; taken over the measurement scale that relates changes in observed values to changes in information; and the transformation from the underlying scale on which information dissipates to alternative scales on which probability pattern may be expressed.

Common probability patterns arise from those four components, described in Equation (8) by

p_{y} \propto m_{y} e^{- λ T_{f}} .

(28)

I show how to read probability distributions in terms of the four components and this general expression. To illustrate the approach, I parse several commonly observed probability patterns. This section mostly repeats earlier results, but does so in an alternative way to emphasize the simplicity of form in common probability expressions.

11.1. Linear Scale

The exponential and Gaussian are perhaps the most common of all distributions. They have the form

p_{y} \propto e^{- λ f_{y}} .

(29)

The exponential case, f_y = y, corresponds to the preservation of the average value, y. The Gaussian case, f_y = (y − μ)², preserves the average squared distance from the mean, which is the variance. For convenience, I often set μ = 0 and write f_y = y² for the squared distance. The exponential and Gaussian express the dissipation of information and preservation of average values on a linear scale. We use either the average value itself or the average squared distance from the mean.

11.2. Combinations of Linear and Log Scales

Purely linear scaling is likely to be rare over a sufficiently wide range of magnitudes. For example, one naturally plots geographic distances on a linear scale, but very large cosmological distances on a logarithmic scale.

On a geographic scale, an increment of an additional meter in distance can be measured directly anywhere on earth. The equivalent measurement information obtained at any geographic distance leads to a linear scale.

By contrast, the information that we can obtain about meter-scale increments tends to decrease with cosmological distance. The declining measurement information obtained at increasing cosmological distance leads to a logarithmic scale.

The measurement scaling of distances and other quantities may often grade from linear at small magnitudes to logarithmic at large magnitudes. The linear-log scale is given by T_f = log (1 + f_y|α). Using that measurement scale in Equation (28), with m_y = 1 and λ = k, we obtain

p_{y} \propto {(1 + f_{y} / a)}^{- k} .

When f_y is small relative to α, we get the standard exponential form of linear scaling in Equation (29), which corresponds to the exponential or Gaussian pattern. The tail of the distribution, with f_y greater than α, is a power law in proportion to f^−k. An exponential pattern with a power law tail is the Lomax or Pareto type II distribution. A Gaussian with a power law tail is the generalized Student’s distribution.

If one measures observations over a sufficiently wide range of magnitudes, many apparently exponential or Gaussian distributions will likely turn out to have the power law tails of the Lomax or generalized Student’s forms. Similarly, observed power law patterns may often turn out to be exponential or Gaussian at small magnitudes, also leading to the Lomax or generalized Student’s forms.

Other processes lead to the inverse log-linear scale, which changes logarithmically at small magnitudes and linearly at large magnitudes. The log-linear scale is given by T_f = log(f_y) − bf_y, in which b determines the transition between log scaling at small magnitudes and linear scaling at large magnitudes. Using that measurement scale in Equation (28) with m_y = 1 and f_y = y, and adjusting the parameters to match earlier notation, we obtain the gamma distribution

p_{y} \propto y^{k - 1} e^{- a y},

which is a power law with logarithmic scaling for small magnitudes and an exponential with linear scaling for large magnitudes. The gamma distribution includes as a special case the widely used chi-square distribution. Thus, the chi-square pattern is a particular instance of log-linear scaling.

If we use the log-linear scale for squared deviations from zero, f_y = y², then we obtain

p_{y} \propto y^{k - 1} e^{- {ay}^{2}},

which is a gamma pattern with a Gaussian tail, expressing log-linear scaling with respect to squared deviations. For k = 2, this is the well-known Rayleigh distribution.

In some cases, information scales logarithmically at both small and large magnitudes, with linearity dominating at intermediate magnitudes [20]. In a log-linear-log scale, precision at the extremes may depend more strongly on magnitude, or there may be a saturating tendency of process at extremes that causes relative scaling of information with magnitude. Relative scaling corresponds to logarithmic measures.

Commonly observed log-linear-log patterns often lead to the beta family of distributions [4]. For example, we can modify the basic linear-log scale, T_f = log (1 + y/α), by adding a logarithmic component at small magnitudes, yielding the scale T_f = blog(y) − log (1 + y/α), for b = γ/k, which leads to a variant of the beta-prime distribution

p_{y} \propto y^{γ} {(1 + y / a)}^{- k} .

This distribution can be read as a linear-log Lomax distribution, (1 + y/α)^−k, with an additional log scale power law component, y^γ, that dominates at small magnitudes. Other forms of log-linear-log scaling often lead to variants from the beta family.

11.3. Direct Change of Scale

In many cases, process dissipates information and preserves average values on one scale, but we observe or analyze data on a different scale. When the scale change arises by simple substitution of one variable for another, the form of the probability distribution is easy to read if one directly recognizes the scale of change. Here, I repeat my earlier discussion for the way in which one reads the commonly observed log-normal distribution. Other direct scale changes follow this same approach.

If process causes information to dissipate on a scale x, preserving only the average squared distance from the mean (the variance), then x tends to follow the Gaussian pattern

p_{x} \propto e^{- λ {(x - μ)}^{2}},

in which the mean of x is μ, and the variance is 1/2λ. If the scale, x, on which information dissipates is logarithmic, but we observe or analyze data on a linear scale, y, then x = log(y). The value of m_y in Equation (8) is the change in x with respect to y, yielding d log(y)/dy = y⁻¹. Thus, the distribution on the y scale is

p_{y} \propto y^{- 1} e^{- λ {(\log (y) - μ)}^{2}},

which is simply the Gaussian pattern for log(y), corrected by m_y = y⁻¹ to account for the fact that dissipation of information and constraint of average value are happening on the logarithmic scale, log(y), but we are analyzing pattern on the linear scale of y. Other direct changes of scale can be read in this way.

11.4. Extreme Values and Exponential Scaling

Extreme values arise from the probability of observing a magnitude beyond some threshold. Probabilities beyond a threshold depend on the cumulative probability of all values beyond the cutoff. For an initially linear scale with f_x = x, cumulative tail probabilities typically follow the generic form e^−λ^x or, simplifying by using λ = 1, the exponential form e^−x. The cumulative tail probabilities above a threshold, y, define the scaling relation between x and y, as

x = \int_{y}^{\infty} e^{- z} d z = e^{- y} .

Thus, extreme values that depend on tail probabilities tend to define an exponential scaling, x = e^−y = T_f. Because we have changed the scale from the cumulative probabilities, x, to the probability of some threshold, y, that determines the extreme value observed, we must account for that change of scale by m_y = |T′_f| = e^−y, where the prime is the derivative with respect to y. Using Equation (8) for the generic method of direct change in scale, and using the form of m_y here for the change from the cumulative scale of tail probabilities to the direct scaling of threshold values, we obtain the general form of the extreme value distributions as

p_{y} \propto | T_{f}^{'} | e^{- λ T_{f}} .

In this simple case, T_f = e^−y, thus

p_{y} \propto e^{- y - λ e^{- y}},

a form of the Gumbel extreme value distribution. Note that this form is just a direct change from linear to exponential scaling, x = e^−y.

Alternatively, we can obtain the same Gumbel form by any process that leads to exponential-linear scaling of the form λT(y) = y + λe^−y, in which the exponential term dominates for small values and the linear term dominates for large values. That scaling leads directly to the distribution

p_{y} \propto \underset{\exp}{\underset{⎵}{e^{- λ e^{- y}}}} \underset{linear}{\underset{⎵}{e^{- y}}} .

The probability of a small value being the largest extreme value decreases exponentially in y, leading to the double exponential term

e^{- λ e^{- y}}

dominating the probability. By contrast, the probability of observing large extreme values decreases linearly in y, leading to the exponential term e^−y dominating the probability.

11.5. Integral Transform and Change of Scale

Equation (21) showed the connection between linear-log and log-linear scales through the Laplace integral transform. The Laplace transform can often be thought of as inverting the dimensional units. For example, we may change from the time per event for a gamma distribution with log-linear scaling to the number of events per unit time (frequency) according to a Lomax distribution with linear-log scaling. Or we may start with a gamma distribution of frequencies and transform to a Lomax distribution of time per event. The units do not have to be in terms of time and frequency. Any pair of inverted dimensions relates to each other in the same way.

That connection between different scales helps to read probability distributions in relation to underlying process. For example, an observation of frequencies distributed according to the linear-log Lomax pattern may suggest dissipation of information and constraint of average values in the dual log-linear measurement domain.

Scale inversion by the Laplace transform also has the interesting property of switching between addition and multiplication in the two domains. For example, multiplicative aggregation of processes and a logarithmic pattern at small magnitudes on the scale of time per event transform to additive aggregation and a linear pattern at small magnitudes on the frequency scale of events per unit time.

This arithmetic duality of measurement scales clarifies the meaning of probability distributions with respect to underlying generative mechanisms. It would be interesting to study pairs of scales connected by the general integral transform Equation (27) with respect to the interpretation of aggregation and pattern in dual domains.

11.6. Lévy Stable Distributions

Another important family of common distributions arises by a similar scaling duality

{(1 + \frac{y^{2}}{φ^{2}})}^{- 1} \propto \int e^{- xiy} e^{- φ | x |} d x .

(30)

Consider each part in relation to the Laplace pair in Equation (21). The left side is the Cauchy distribution, a special case of the linear-log generalized Student’s distribution with k = 1 and α = φ². On the right, e⁻^φ^|^x^| is a symmetric exponential distribution, because e^−φx is the classic exponential distribution for x > 0, and e^φx for x < 0 is the same distribution reflected about the x = 0 axis. The two distributions together form a new distribution over all positive and negative values of x.

Each positive and negative part of the symmetric exponential, by itself, expresses linearity in x. However, the sharp switch in direction and the break in smoothness at x = 0 induces a quasi-logarithmic scaling at small magnitudes, which corresponds to the linearity at small magnitudes in the transformed domain of the Cauchy distribution.

In this case, the integral transform is Fourier rather than Laplace, using the transformation kernel e^−xiy over all positive and negative values of x. For our purposes, we can consider the consequences of the Laplace and Fourier transforms as similar with regard to inverting the dimensions and scaling relations between a pair of measurement scales.

The Cauchy distribution is a particularly important probability pattern. In one simple generative model, the Cauchy arises by the same sort of summing up of random perturbations and dissipation of information that leads to the Gaussian distribution by the central limit theorem. The Cauchy differs from the Gaussian because the underlying random perturbations follow logarithmic scaling at large magnitudes.

Log scaling at large magnitudes causes power law tails, in which the distributions of the underlying random perturbations tend to have the form 1/|x|^1+γ at large magnitudes of x. When the tail of a distribution has that form, then the total probability in the tail above magnitudes of |x| is approximately 1/|x|^γ. The Cauchy is the particular distribution with γ = 1. Thus, one way to generative a Cauchy is to sum up random perturbations and constrain the average total probability in the tail to be 1/|x|.

Note that the constraint on the average tail probability of 1/|x| for the Cauchy distribution on the left side of Equation (30) corresponds, in the dual domain on the right side of that equation, to e^−φ|^x^|, in which the measurement scale is T_f = |x|. The average of the scaling T_f corresponds to the preserved average constraint after the dissipation of information. In this case, the dual domain preserves only the average of |x|. Thus the dual scaling domains preserve the average of |x| in the symmetric exponential domain and the average total tail probability of 1/|x| in the dual Cauchy domain.

We can express a more general duality that includes the Cauchy as a special case by

p_{y} \propto \int e^{- xiy} e^{- φ | x |^{γ}} d x .

(31)

The only difference from Equation (30) is that in the symmetric exponential, I have written |x|^γ. The parameter γ creates a power law scaling T_f = |x|^γ, which corresponds to a distribution that is sometimes called a stretched exponential.

The distribution in the dual domain, p_y, is a form of the Lévy stable distribution. That distribution does not have a mathematical expression that can be written explicitly. The Lévy stable distribution, p_y, can be generated by dissipating all information by summation of random perturbations while constraining the average of the total tail probability to be 1/|x|^γ for γ < 2. For γ = 1, we obtain the Cauchy distribution. When γ = 2, the distributions in both domains become Gaussian, which is the only case that domains paired by Laplace or Fourier transform inversion have the same distribution.

Note that the paired scales in Equation (31) match a constraint on the average of |x|^γ with an inverse constraint on the average tail probability, 1/|x|^γ. Here, γ is not necessarily an integer, so the average of |x|^γ can be thought of as a fractional moment in the stretched exponential domain that pairs with the power law tail in the inverse Lévy domain [3].

12. Relations between Probability Patterns

I have shown how to read probability distributions as statements about the dissipation of information, the constraint on average values, and the scaling relations of information and measurement. Essentially all common distributions have the form given in Equation (8) as

p_{y} \propto m_{y} e^{- λ T_{f}} .

(32)

Dissipation of information and constraint on average values set the e^−λ^fy form. Scaling measures transform the observables, f_y, to T_f = T(f_y). The term m_y accounts for changes between dissipation of information on one scale and measurement of final pattern on a different scale.

The scaling measures, T_f, determine the differences between probability patterns. In this section, I discuss the scaling measures in more detail. What defines a scaling relation? Why are certain common scaling measures widely observed? How are the different scaling measures connected to each other to form families of related probability distributions?

12.1. Invariance and Common Scales

The form of the maximum entropy distributions influences the commonly observed scales and associated probability distributions [4,6]. In particular, we obtain the same distribution in Equation (32) for either the measurement function T_f or the affine transformed measurement function T_f ↦ a + bT_f. An affine transformation shifts the variable by the constant a and multiplies it by the constant b.

The shift by a changes the constant of proportionality

e^{- λ (a + T_{f})} = ξ e^{- λ T_{f}},

in which ξ = e^−λa. In maximum entropy, the final proportionality constant always adjusts to satisfy the constraint that the total probability is one Equation (2). Thus, the final adjustment of total probability erases any prior multiplication of the distribution by a constant. A shift transformation of T_f does not change the associated probability pattern.

Multiplication by b also has no effect on probability pattern, because

e^{- λ b T_{f}} = e^{- \hat{λ} Τ_{f}}

for

\hat{λ} = b λ

. In maximum entropy, the final value of the constant multiplier for T_f always adjusts so that that the average value of T_f satisfies an extrinsic constraint, as given in Equation (6).

Thus, maximum entropy distributions are invariant to affine transformations of the measurement scale. That affine invariance shapes the form of the common measurement scales. In particular, consider transformations of the observables, G(f_y), such that

T [G (f_{y})] = a + b T (f_{y}) .

(33)

Any scale, T, that satisfies this relation causes the transformed scale T [G(f_y)] to yield the same maximum entropy probability distribution as the original scale T_f = T(f_y).

For example, suppose our only information about a probability distribution is that its form is invariant to a transformation of the observable values f_y by a process that changes f_y to G(f_y). Then it must be that the scaling relation of the measurement function Tf satisfies the invariance in Equation (33). By evaluating how that invariance sets a constraint on T_f, we can find the form of the probability distribution.

The classic example concerns the invariance of logarithmic scaling to power law transformation [21]. Let T(y) = log(y) and G(y) = cy^γ. Then by Equation (33), we have

\log ({cy}^{γ}) = \log (c) + γ \log (y),

(34)

which demonstrates that logarithmic scaling is affine invariant to power law transformations of the form cy^γ, in which affine invariance means that the scaling relation T and the associated transformation G satisfy Equation (33).

12.2. Affine Invariance of Measurement Scaling

Put another way, a scaling relation, T, is defined by the transformations, G, that leave unchanged the information in the observables with respect to probability patterns. In maximum entropy distributions, unchanged means affine invariance. This affine invariance of measurement scaling in probability distributions is so important that I like to write the key expression in Equation (33) in a more compact and memorable form

T ~ T \circ G .

(35)

Here, the circle means composition of functions, such that T ○ G = T[G(f_y)], and the symbol “~” for similarity means equivalent with respect to affine transformation. Thus, the right side of Equation (33) is similar to T with respect to affine transformation, and the left side Equation (33) is equivalent to T ○ G. Reversing sides of Equation (33) and using “~” for affine similarity leads to Equation (35).

Note, from Equation (11) and Equation (33), that S_y = T ○ G, showing that the information in a probability distribution, S_y, is invariant to affine transformation of T. Thus, we can also write

T ~ T \circ G ~ S_{y},

which emphasizes the fundamental role of invariant information in defining the measurement scaling, T, and the associated form of probability patterns.

12.3. Base Scales and Notation

Earlier, I defined f_y = f(y) as an arbitrary function of the variable of interest, y. I have used either y or y² or (y − μ)² for f_y to match the classical maximum entropy interpretation of average values constraining either the mean or the variance.

To express other changes in the underlying variable, y, I introduced the measurement functions or scaling relations, T_f = T(f_y). In this section, I use an expanded notation to reveal the structure of the invariances that set the forms of scaling relations and probability distributions [4]. In particular, let

w = w (f_{y})

be a function of f_y. Then, for example, we can write an exponential scaling relation as T(f_y) = e^βw. We may choose a base scale, w, such as a linear base scale, w(f_y) = f_y, or a logarithmic base scale, w(f_y) = log(f_y), or a linear-log base scale, w(f_y) = log(1 + f_y/α), or any other base scale. Typically, simple combinations of linear and log scaling suffice. Why such simple combinations suffice is an essential unanswered question, which I discuss later.

Previously, I have referred to f_y as the observable, in which we are interested in the distribution of y but only collect statistics on the function f_y. Now, we will consider w = w(f_y) as the observable. We may, for example, be limited to collecting data on w = log(f_y) or on measurement functions T(f_y) that can be expressed as functions of the base scale w. We can always revert to the simpler case in which w = f_y or w = y.

In the following sections, the expanded notation reveals how affine invariance sets the structure of scaling relations and probability patterns.

12.4. Two Distinct Affine Relations

All maximum entropy distributions satisfy the affine relation in Equation (33), expressed compactly in Equation (35). In that general affine relation, any measurement function, T, could arise, associated with its dual transformation, G, to which T is affine invariant. That general affine relation does not set any constraints which measurement functions T may occur, although the general affine relation may favor certain scaling relations to be relatively common.

By contrast with the general affine form T ~ T ○ G, for any T and its associated G, we may consider how specific forms of G determine the scaling, T. Put another way, if we require that a probability pattern be invariant to transformations of the observables by a particular G, what does that tell us about the form of the associated scaling relation, T, and the consequent probability pattern?

Here we must be careful about potential confusion. It turns out that an affine form of G is itself important, in which, for example, G(w) = δ + θw. That specific affine choice for G is distinct from the general affine form of Equation (35). With that in mind, the following sections explore the consequences of an affine transformation, G, or a shift transformation, which is a special case of an affine transformation.

12.5. Shift Invariance and Generalized Exponential Measurement Scales

Suppose we know only that the information in probability patterns does not change when the observables undergo shift transformation, such that G(w) = δ + w. In other words, the form of the measurement scale, T, must be affine invariant to adding a constant to the base values, w. A shift transformation is a special case of an affine transformation G(w) = δ + θw, in which the affine transform becomes strictly a shift transformation for the restricted case of θ = 1.

The exponential scale

T_{f} = e^{β w}

(36)

maintains the affine invariance in Equation (33) to a shift transformation, G. If we apply shift transformation to the observables, w ↦ δ + w, then the exponential scale becomes e^β⁽^δ+w⁾, which is equivalent to be^βw for b = e^βδ. We can ignore the constant multiplier, b, thus, the exponential scale is shift invariant with respect to Equation (33).

Using the shift invariant exponential form for T_f, the maximum entropy distributions in Equation (32) become

p_{y} \propto m_{y} e^{- λ e^{β w}} .

(37)

This exponential scaling has a simple interpretation. Consider the example in which w is a linear measure of time, y, and β is a rate of exponential growth (or decay). Then the measurement scale, T_f, transforms each underlying time value, y, into a final observable value after exponential growth, e^βy. The random time values, y, become random values of final magnitudes, such as random population sizes after exponential growth for a random time period. In general, exponential growth or decay is shift invariant, because it expresses a constant rate of change independently of the starting point.

If the only information we have about a scaling relation is that the associated probability pattern is shift invariant to transformation of observables, then exponential scaling provides a likely measurement function, and the probability distribution may often take the form of Equation (37).

The Gumbel extreme value distribution in Equation (25) follows exponential scaling. In that case, the underlying observations, y, are transformed into cumulative exponential tail probabilities that, in aggregate, determine the probability that an observation is the extreme value of a sample. The exponential tail probabilities are shift invariant, in the sense that a shifted observation, δ + y, also yields an exponential tail probability. The magnitude of the cumulative tail probability changes with a shift, but the exponential form does not change.

12.6. Affine Duality and Linear Scaling

Suppose probability patterns do not change when observables undergo an affine transformation G(w) = δ + θw. Affine transformation of observables allows a broader range of changes than does shift transformation. The broader the range of allowable transformations of observables, G, the fewer the measurement functions, T, that will satisfy the affine invariance in Equation (33). Thus affine transformation of observables leads to a narrower range of compatible measurement functions than does shift transformation.

When G is affine with θ ≠ 1, then the associated measurement function T_f must itself be affine. Because T_f is invariant to shift and multiplication, we can say that invariance to affine G means that T_f = w, and thus the maximum entropy probability distribution in Equation (32) becomes linear in the base measurement scale, w, as

p_{y} \propto m_{y} e^{- λ w} .

(38)

This form follows when the probability pattern is invariant to affine transformation of the observables, w. By contrast, invariance to a shift transformation of the observables leads to the broader class of distributions in Equation (37), of which Equation (38) is special case for the more restrictive condition of invariance to affine transformation of observables.

To understand the relation between affine and shift transformations of observables, G, it is useful to write the expression for the measurement function in Equation (36) more generally as

T_{f} = \frac{1}{β} (e^{β w} - 1),

(39)

noting that we can make any affine transformation of a measurement function, T_f ↦ a + bT_f, without changing the associated probability distribution. With this new measurement function for shift invariance, as β → 0, then T_f → w, and we recover the measurement function associated with affine G.

Suppose, for example, that we interpret β as a rate of exponential change in the underlying observable, w, before the final measurement. Then, as β → 0, the underlying observable and the final measurement become equivalent, T_f → w, because

T_{f} = \lim_{β \to 0} [\frac{1}{β} (e^{β w} - 1)] \to w .

12.7. Exponential and Gaussian Distributions Arise from Affine Invariance

Suppose we know only that the information in probability patterns does not change when the observables undergo affine transformation, w ↦ δ + θw. The invariance of probability pattern to affine transformation of observables leads to distributions of the form in Equation (38). Thus, if the observable is the underlying value, w = y, then the probability distribution is exponential

p_{y} \propto e^{- λ y},

and if the observable is y², the squared distance of the underlying value from its mean, then the probability distribution is Gaussian

p_{y} \propto e^{- λ y^{2}} .

By contrast, if the probability pattern is invariant to a shift of the observables, but not to an affine transformation of the observables, then the distribution falls into the broader class based on exponential measurement functions in Equation (37).

13. Hierarchical Families of Measurement Scales and Distributions

The general form for probability distributions in Equation (37), repeated here

p_{y} \propto m_{y} e^{- λ e^{β w}}

arises from a base measurement scale, w, and shift invariance of the probability pattern to changes w ↦ δ + w. Each base scale, w, defines a family of related probability distributions, including the linear form

p_{y} \propto m_{y} e^{- λ w}

as a special case when the probability pattern is invariant to affine changes w ↦ δ + θw, which corresponds to β → 0 in Equation (39).

We may consider a variety of base scales, w, creating a variety of distinct measurement scales and families of distributions. Ultimately, we must consider how the base scales arise. However, it is useful first to study the commonly observed base scales. The relations between these common base scales form a hierarchical pattern of measurement scales and probability distributions [4].

13.1. A Recursive Hierarchy for the Base Scale

The base scales associated with common distributions typically arise as combinations of linear and logarithmic scaling. For example, the linear-log scale can be defined by log(c + x). This scale changes linearly in x when x is much smaller than c and logarithmically in x when x is much larger than c. As c → 0, the scale becomes almost purely logarithmic, and for large c, the scale becomes almost purely linear.

We can generate a recursive hierarchy of linear-log scale deformations by

w^{(i)} = \log (c_{i} + w^{(i - 1)}) .

(40)

The hierarchy begins with w⁽⁰⁾ = f_y, in which f_y denotes our underlying observable. Recursive expansion of the hierarchy yields: a linear scale, w⁽⁰⁾ = f_y; a linear-log deformation, w⁽¹⁾ = log(c₁ + f_y); a linear-log deformation of the linear-log scale, w⁽²⁾ = log(c₂ + log(c₁ + f_y)); and so on. A log deformation of a log scale arises as a special case, leading to a double log scale.

Other scales, such as the log-linear scale, can be expanded in a similarly recursive manner. We may also consider log-linear-log scales and linear-log-linear scales. We can abbreviate a scale, w, by its recursive deformation and by its level in a recursive hierarchy. For example,

{LinLog}^{(2)} = \log (c_{2} + \log (c_{1} + f_{y}))

(41)

is the second recursive expansion of a linear-log deformation. The initial value for any recursive hierarchy with a superscript of i = 0 associates with the base observable w⁽⁰⁾ = f_y, which I will also write as “Linear,” because the base observable is always a linear expression of the underlying observable, f_y.

13.2. Examples of Common Probability Distributions

Table ?? shows that commonly observed probability distributions arise from combinations of linear and logarithmic scaling. For example, the simple linear-log scale expresses linear scaling at small magnitudes and logarithmic scaling at large magnitudes. The distributions that associate with linear-log scaling include very common patterns.

For direct observables, f_y = y, the linear-log scale includes the purely linear exponential distribution as a limiting case, the purely logarithmic power law (Pareto type I) distribution as a limiting case, and the Lomax (Pareto type II) distribution that is exponential at small magnitudes and has a power law tail at large magnitudes.

For observables that measure the squared distance of fluctuations from a central location, f_y= (y − μ)², or y² for simplicity, the linear-log scale includes the purely linear Gaussian (normal) distribution as a limiting case, and the generalized Student’s distribution that is a Gaussian linear pattern for small deviations from the central location and grades into a logarithmic power law pattern in the tails at large deviations.

Most of the commonly observed distributions arise from other simple combinations of linear and logarithmic scaling. To mention just two further examples among the many described in Table ??, the log-linear scale leads to the gamma distribution, and the log-linear-log scale leads to the commonly observed beta distribution.

14. Why do Linear and Logarithmic Scales Dominate?

Processes in the natural world often cause highly nonlinear transformations of inputs into outputs. Why do those complex nonlinear transformations typically lead in the aggregate to simple combinations of linear and logarithmic base scales? Several possibilities exist [20]. I mention a few in this section. However, I do not know of any general answer to this essential question. A clear answer would greatly enhance our understanding of the commonly observed patterns in nature.

14.1. Absolute versus Relative Incremental Information

The scaling of information often changes between linear and logarithmic as magnitude changes. At some magnitudes, a fixed measurement increment provides about the same (linear) information over a varying range, whereas at other magnitudes, a fixed measurement provides less (logarithmic) information as values increase.

Consider the example of measuring distance [6,20]. Start with a ruler that is about the length of your hand. With that ruler, you can measure the size of all the visible objects in your office. That scaling of objects in your office with the length of the ruler means that those objects have a natural linear scaling in relation to your ruler.

Now consider the distances from your office to various galaxies. If the distance is sufficiently great, your ruler is of no use, because you cannot distinguish whether a particular galaxy moves farther away by one ruler unit. Instead, for two distant galaxies, you can measure the ratio of distances from your office to each galaxy. You might, for example, find that one galaxy is twice as far as another, or, in general, that a galaxy is some percentage farther away than another. Percentage changes define a ratio scale of measure, which has natural units in logarithmic measure [21]. For example, a doubling of distance always adds log(2) to the logarithm of the distance, no matter what the initial distance.

Measurement naturally grades from linear at local magnitudes to logarithmic at distant magnitudes when compared to some local reference scale. The transition between linear and logarithmic varies between problems, depending partly on measurement technology. Measures from some phenomena remain primarily in the linear domain, such as measures of height and weight in humans. Measures for other phenomena remain primarily in the logarithmic domain, such as large cosmological distances. Other phenomena scale between the linear and logarithmic domains, such as fluctuations in the price of financial assets [25] or the distribution of income and wealth [26].

Consider the opposite direction of scaling, from local magnitude to very small magnitude. Your hand-length ruler is of no value for small magnitudes, because it cannot distinguish between a distance that is a fraction 10⁻⁴ of the ruler and a distance that is 2 × 10⁻⁴ of the ruler. At small distances, one needs a standard unit of measure that is the same order of magnitude as the distinctions to be made. A rule of length 10⁻⁴ distinguishes between 10⁻⁴ and 2 × 10⁻⁴, but does not distinguish between 10⁻⁸ and 2 × 10⁻⁸. At small magnitudes, ratios can potentially be distinguished, causing the unit of informative measure to change with scale. Thus, small magnitudes naturally have a logarithmic scaling.

As we change from very small to intermediate to very large, the measurement scaling naturally grades from logarithmic to linear and then again to logarithmic, a log-linear-log scaling [20]. The locus of linearity and the meaning of very small and very large differ between problems, but the overall pattern of the scaling relations remains the same.

14.2. Common Arithmetic Operations Lead to Common Scaling Relations

Perhaps linear and logarithmic scaling reflect aggregation by addition or multiplication of fluctuations. Adding fluctuations often tends in the limit to a smooth linear scaling relation. Multiplying fluctuations often tends in the limit to a smooth logarithmic scaling relation.

Consider the basic log-linear scale that leads to the gamma distribution. A simple generative model for the gamma distribution arises from the waiting time for the kth event to occur. At time zero, no events have occurred.

At small magnitudes of time, the occurrence of all k events requires essentially simultaneous occurrence of all of those events. Nearly simultaneous occurrence happens roughly in proportion to the product of the probability of any single event occurring in a small time interval. Multiplication associates with logarithmic scaling.

At large magnitudes of time, either all k events have occurred, or in most cases k − 1 events have occurred and we wait only for the last event. The waiting time for a single event follows an the exponential distribution associated with linear scaling. Thus, the waiting time for k events naturally follows a log-linear pattern.

Any process that requires simultaneity at extreme magnitudes leads to logarithmic scaling at those limits. Thus, a log-linear-log scale may be a very common underlying pattern. Special cases include log-linear, linear-log, purely log, and purely linear. For those variant patterns, the actual extreme tails may be logarithmic, although difficulty observing the extreme tail pattern may lead to many cases in which a linear tail is a good approximation over the range of observable magnitudes.

Other aspects of aggregation and limiting processes may also lead to the simple and commonly observed scaling relations. For example, fractal theory provides much insight into logarithmic scaling relations [27,28]. However, I do not know of any single approach that matches the simplicity of the commonly observed combinations of linear and logarithmic scaling patterns to a single, simple underlying theory.

The invariances associated with simple scaling patterns may provide some clues. As noted earlier, shift invariance associates with exponential scaling, and affine invariance associates with linear scaling. It is easy to show that power law invariance associates with logarithmic scaling. For example, in the measurement scale invariance expression given in Equation (33), the invariance holds for a log scale, T(y) = log(y), in relation to power law transformations of the observables, G(y) = cy^γ, as shown in Equation (34).

We may equivalently say that a scaling relation satisfies power law invariance or that a scaling relation is logarithmic. Noting the invariance does not explain why the scaling relation and the associated invariance are common, but it does provide an alternative and potentially useful way in which to study the problem of commonness.

15. Asymptotic Invariance

The measurement functions, T, that define maximum entropy distributions satisfy the affine invariance given in Equation (35), repeated here

T ~ T \circ G .

(42)

One can think of G as an input-output function that transforms observations in a way that does not change information with respect to probability pattern.

Most of the commonly observed probability patterns have a simple form, associated with a simple measurement function composed of linear, logarithmic, and exponential components. I have emphasized the open problem of why the measurement functions, T, tend to be confined to those simple forms. That simplicity of measurement implies an associated simplicity for the form of G under which information remains invariant. If we can figure out why G tends to be simple, then perhaps we may understand the simplicity of T.

15.1. Multiple Transformations of Observations

At the microscopic scale, observations may tend to get transformed or filtered through a variety of complex processes represented by variable and complex forms of G. Then, for a simple measurement function, T, the fundamental affine invariance would not hold

T ≁ T \circ G .

(43)

However, the great lesson of statistical mechanics and maximum entropy is that, for complex underlying processes, aggregation often smooths ultimate pattern into a simple form. Perhaps multiple filtering of observations through input-output functions G would, in the aggregate, lead to a simple overall form for the transformation of initial observations into the actual values observed [20].

We can study how multiple applications of input-output transformations may influence the measurement function, T. Note that in the basic invariance of Equation (42), application of G does not change the information in observations. Thus, we can apply G multiple times and still maintain invariant information. If we write Gⁿ or G^s for n or s applications of input-output processing for n, s = 0, 1, 2, …, then we can write the more general expression for the fundamental measurement and information invariance as

T \circ G^{n} ~ T \circ G^{s} .

(44)

15.2. Invariance in the Limit

Suppose that, for a simple measurement function, T, and a complex input-output process, G, the basic invariance does not hold Equation (43). However, it may be that multiple rounds of processing by G ultimately lead to a relatively simple transformation of the initial inputs to the final outputs. In other words, G may be complex, but for sufficient large n, the form of Gⁿ may be simple [20]. This aggregate simplicity may lead in the limit to asymptotic invariance

T \circ G^{n} \to T \circ G^{\infty}

(45)

as n becomes sufficiently large. It is not necessary for every G to be identical. Instead, each G may be a sample from a pool of alternative transformations. Each individual transformation may be complicated. However, in the aggregate, the overall relation between the initial inputs and final outputs may smooth asymptotically into a simple form, such as a power law. If so, then the associated measurement scale smooths asymptotically into a simple logarithmic relation.

Other aggregates of input-output processing may smooth into affine or shift transformations, which associate with linear or exponential scales. When different invariances hold at different magnitudes of the initial inputs, then the measurement scale will change with magnitude. For example, a log-linear scale may reflect asymptotic power law and affine invariances at small and large magnitudes.

16. Discussion

Aggregation smooths underlying complexity into simple patterns. The common probability patterns arise by the dissipation of information in aggregates. Each additional random perturbation increases entropy until the distribution of observations takes on the maximum entropy form. That form has lost all information except the constraints on simple average values.

For each particular probability distribution, the constraint on average value arises on a characteristic measurement scale. That scaling relation, T, defines the form of the maximum entropy probability distributions

p_{y} \propto m_{y} e^{- λ T_{f}}

as initially presented in Equation (8), for which T = T_f. Here, m_y accounts for cases in which information dissipates on one scale, but we measure probability pattern on a different scale.

The common probability distributions tend to have simple forms for T that follow linear, logarithmic, or exponential scaling at different magnitudes. The way in which those three fundamental scalings grade into each other as magnitude changes sets the overall scaling relation.

A scaling relation defines the associated maximum entropy distribution. Thus, reading a probability distribution as a statement about process reduces to reading the embedded scaling relation, and trying to understand the processes that cause such scaling. Similarly, understanding the familial relations between probability patterns reduces to understanding the familial relations between different measurements scales.

The greatest open puzzle concerns why a small number of simple measurement scales dominant the commonly observed patterns of nature. I suggested that the solution may follow from the basic invariance that defines a measurement scale. Equation (35) presented that invariance as

T ~ T \circ G .

The measurement scale, T, is affine invariant to transformation of the observations by G. In other words, the information in measurements with regard to probability pattern does not change if we use the directly measured observations or we measure the observations after transformation by G, when analyzed on the scale T.

In many cases, the small scale processes, G, that transform underlying values may have complex forms. If so, then the associated scaling relation T, might also be complex, leaving open the puzzle of why observable forms of T tend to be simple. I suggested that underlying values may often be transformed by multiple processes before ultimate measurement. Those aggregate transformations may smooth into a simple form with regard to the relation between initial inputs and final measurable outputs. If we express a sequence of n transformations as Gⁿ, then the asymptotic invariance of the aggregate processing may be simple in the sense that

T \circ G^{n} \to T \circ G^{\infty}

(46)

as given by Equation (45). Here, the measurement scaling T, and the aggregate input-output processing Gⁿ are relatively simple and consistent with commonly observed patterns.

The puzzle concerns how aggregate input-output processing smooths into simple forms [20]. In particular, how does a combination of transformations lead in the aggregate to a simple asymptotic invariance?

The scaling pattern for any aggregate input-output relation may have simple asymptotic properties. The application to probability patterns arises when we embed a simple asymptotic scaling relation into the maximum entropy process of dissipating information. The dissipation of information in maximum entropy occurs as measurements are made on the aggregation of individual outputs.

Two particularly simple forms of invariance by T to input-output processing by Gⁿ may be important. If Gⁿ is a shift transformation w ↦ δ + w for some base scaling w, then the associated measurement scale has the form T_f = e^βw. This exponential scaling corresponds to the fact that exponential growth or decay is shift invariant. With exponential scaling, the general maximum entropy form is

p_{y} \propto m_{y} e^{- λ e^{β w}} .

The extreme value distributions and other common distributions derive from that double exponential form. The particular distribution depends on the base scaling, w, as illustrated in Table ??.

Shift transformation is a special case of the broader class of affine transformations, w ↦ δ + θw. If Gⁿ causes affine changes, then the broader class of input-output relations leads to a narrower range of potential measurement scales that preserve invariance. In particular, an affine measurement scale is the only scale that preserves information about probability pattern in relation to affine transformations. For maximum entropy probability distributions, we may write T_f = w for the measurement scale that preserves invariance to affine Gⁿ, leading to the simpler form for probability distributions

p_{y} \propto m_{y} e^{- λ w},

which includes most of the very common probability distributions. Thus, the distinction between asymptotic shift and affine changes of initial base scales before potential measurement may influence the general form of probability patterns.

In summary, the common patterns of nature follow a few generic forms. Those forms arise by the dissipation of information and the scaling relations of measurement. The measurement scales arise from the particular way in which the information in a probability pattern is invariant to transformation. Information invariance apparently limits the common measurement scales to simple combinations of linear, logarithmic, and exponential components. Common probability distributions express how those component scales grade into one another as magnitude changes.

Acknowledgments

National Science Foundation grant DEB–1251035 supports my research.

Appendix

A. Scale Transformation

In some cases, information dissipates on one scale, but we wish to express the probability pattern on another scale. For example, a process may lead to a final measured value that is the product of a series of underlying processes. The product of multiple values is equal to the sum of the logarithms of those values. So we may consider how information dissipates as the logarithm of each individual component is added to the total. The theory for the dissipation of information has a particularly simple interpretation as the sum of independent random processes.

The sum of random processes often converges to a Gaussian distribution, preserving information only about the average squared distance of fluctuations around the mean. Thus, we obtain a simple expression for the dissipation of information when we transform the final measured values, which arise by multiplication, to the additive logarithmic scale. After finding the shape of the distribution on the log transformed scale, it makes sense to transform the distribution of values back to the original scale of the measurements. In the case of a Gaussian distribution on the altered scale, the transformation back to the original scale leads to the pattern known as the log-normal distribution.

The transformations associated with the log-normal distribution are well known. Because the Gaussian distribution is a standard component of simple maximum entropy approaches, the log-normal also falls within that scope. However, the Gaussian and log-normal transformation pair are sometimes considered to be a special case. Here, I emphasize that one must understand the structure of the transformation argument more generally. Information often dissipates on one scale, but we may wish to express probability patterns on another scale.

Once one recognizes the more general structure for the dissipation of information, many previously puzzling patterns fall naturally within the scope of a simple theory of probability patterns. In the main text, I discuss important examples, particularly the extreme value distributions that play a central role in many applications of risk analysis. In this Methods section, I give the general form by which one can express the different scales for the dissipation of information and for measurement. That general form provides the key to reading the mathematical expressions of probability patterns as simple statements about process.

For continuous variables, probability expressions describe the chance that an observation is close to a value y. The chance of observing a value exactly equal to y must be close to zero, because there are essentially an infinite number of possible values that y can take on. So we describe probability in terms of the chance that an observation falls into a small interval between y and y + dy, where dy is a small increment. We write the probability of falling into a small increment near y as p_y|dy|.

We are interested in understanding the distribution on the scale y. However, suppose that information dissipates on a different scale given by x, leading to the distribution p_x|dx|. After obtaining the distribution on the scale x by applying the theory for the dissipation of information and constraint, we often wish to transform the distribution to the original scale y. The relation between x and y is given by the transformation x = g(y), where g is some function of y. For example, we may have x = log(y). In general, we can use any transformation that has meaning for a particular problem.

By standard calculus, we can write dx = g′(y)dy, where g′ is the derivative of g with respect to y. Define m_y = |g′(y)|, which gives a notation m_y that emphasizes the term as the translation between the measurement scales for x and y. Thus

| d x | = m_{y} | d y | .

Because x = g(y), we can also write p_x = p_g(y), and so

p_{x} | d x | = m_{y} p_{g (y)} | d y | = p_{y} | d y |,

or

p_{y} = m_{y} p_{x} = m_{y} p_{g (y)} .

(47)

Because information dissipates on the scale x, we can often find the distribution p_x relatively easily. From Equation (5), the form of that distribution is

p_{x} \propto e^{- λ f_{x}} .

Applying the change in measure in Equation (47), we obtain

p_{y} \propto m_{y} e^{- λ f_{g (y)}} .

(48)

To illustrate, consider the log-normal example, in which x = g(y) = log(y) and m_y = y⁻¹. On the logarithmic scale, x, the distribution is Gaussian

p_{x} \propto e^{- λ {(x - μ)}^{2}},

in which λ = 1/2σ². From Equation (47), we obtain the distribution on the original scale, y, as

p_{y} \propto y^{- 1} e^{- λ {(\log (y) - μ)}^{2}},

which is the log-normal distribution. The relation between the Gaussian and the log-normal is widely known. However, the general principle of studying the dissipation of information on one scale and then transforming to another scale is more general. That relation is an essential step in reading probability expressions in terms of process and in unifying the commonly observed distributions into a single general framework.

Conflicts of Interest

The author declares no conflict of interest.

References

Frank, S.A. Dynamics of Cancer: Incidence, Inheritance, and Evolution; Princeton University Press: Princeton, NJ, USA, 2007. [Google Scholar]
Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: New York, NY, USA, 2003. [Google Scholar]
Frank, S.A. The common patterns of nature. J. Evol. Biol 2009, 22, 1563–1585. [Google Scholar]
Frank, S.A.; Smith, E. A simple derivation and classification of common probability distributions based on information symmetry and measurement scale. J. Evol. Biol 2011, 24, 469–484. [Google Scholar]
Frank, S.A. Measurement scale in maximum entropy models of species abundance. J. Evol. Biol 2011, 24, 485–496. [Google Scholar]
Frank, S.A.; Smith, E. Measurement invariance, entropy, and probability. Entropy 2010, 12, 289–303. [Google Scholar]
Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev 1957, 106, 620–630. [Google Scholar]
Jaynes, E.T. Information theory and statistical mechanics. II. Phys. Rev 1957, 108, 171–190. [Google Scholar]
Gibbs, J.W. Elementary Principles in Statistical Mechanics; Scribner: New York, NY, USA, 1902. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 1991. [Google Scholar]
Tribus, M. Thermostatics and Thermodynamics: An Introduction to Energy, Information and States of Matter, with Engineering Applications; Van Nostrand: New York, NY, USA, 1961. [Google Scholar]
Tsallis, C. Introduction to Nonextensive Statistical Mechanics; Springer: New York, NY, USA, 2009. [Google Scholar]
Bracewell, R.N. The Fourier Transform and its Applications, 3rd ed; McGraw Hill: Boston, MA, USA, 2000. [Google Scholar]
Embrechts, P.; Kluppelberg, C.; Mikosch, T. Modeling Extremal Events: For Insurance and Finance; Springer: Heidelberg, Germany, 1997. [Google Scholar]
Kotz, S.; Nadarajah, S. Extreme Value Distributions: Theory and Applications; World Scientific: Singapore, Singapore, 2000. [Google Scholar]
Coles, S. An Introduction to Statistical Modeling of Extreme Values; Springer: New York, NY, USA, 2001. [Google Scholar]
Gumbel, E.J. Statistics of Extremes; Dover Publications: New York, NY, USA, 2004. [Google Scholar]
Frank, S.A. Generative models versus underlying symmetries to explain biological pattern. J. Evol. Biol 2014, 27, 1172–1178. [Google Scholar]
Beck, C.; Cohen, E. Superstatistics. Physica A 2003, 322, 267–275. [Google Scholar]
Frank, S.A. Input-output relations in biological systems: Measurement, information and the Hill equation. Biol. Direct 2013, 8, 31. [Google Scholar]
Hand, D.J. Measurement Theory and Practice; Arnold: London, UK, 2004. [Google Scholar]
Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed; Wiley: New York, NY, USA, 1994. [Google Scholar]
Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed; Volume 2, Wiley: New York, NY, USA, 1995. [Google Scholar]
Kleiber, C.; Kotz, S. Statistical Size Distributions in Economics and Actuarial Sciences; Wiley: New York, NY, USA, 2003. [Google Scholar]
Aparicio, F.; Estrada, J. Empirical distributions of stock returns: European securities markets, 1990–95. Eur. J. Finance 2001, 7, 1–21. [Google Scholar]
Dragulescu, A.A.; Yakovenko, V.M. Exponential and power-law probability distributions of wealth and income in the United Kingdom and the United States. Physica A 2001, 299, 213–221. [Google Scholar]
Mandelbrot, B.B. The Fractal Geometry of Nature; W. H. Freeman: London, UK, 1983. [Google Scholar]
Sornette, D. Critical Phenomena in Natural Sciences: Chaos, Fractals, Self-organization, and Disorder: Concepts and Tools; Springer: New York, NY, USA, 2006. [Google Scholar]

Table 1. Some Common Probability Distributions*.

**Table 1.** Some Common Probability Distributions*.
Distribution	p_y	w	Notes and alternative names

Gumbel	$e^{β y - λ e^{β y}}$	Linear	m_y = T′
Gibbs/Exponential	e^−λy	Linear	β → 0
Gauss/Normal	$e^{- λ y^{2}}$	Linear	β → 0; f_y = y²
Rayleigh	${ye}^{- λ y^{2}}$	Linear	β → 0; fy = y²; m_y = T′
Log-Normal	$y^{- 1} e^{- λ {(\log y)}^{2}}$	Linear	β → 0; f_y = y²; y → log y; m_y = y⁻¹
Stretched exponential	$e^{- λ y^{β}}$	Log⁽¹⁾	Gauss with β = 2
Fréchet/Weibull	$y^{β - 1} e^{- λ y^{β}}$	Log⁽¹⁾	m_y = T′; Rayleigh with β = 2
Symmetric Lévy	$e^{- λ \| y \|^{β}} (Fourier domain)$	Log⁽¹⁾	f_y = \|y\|; β ≤ 2; Gauss (β = 2), Cauchy (β = 1); Equation (31)
Pareto type I	y^−λ	Log⁽¹⁾	β → 0; m_y = 1 or m_y = T′
Log-Fréchet	$y^{- 1} {(\log y)}^{β - 1} e^{- λ {(\log y)}^{β}}$	Log⁽²⁾	m_y = T; also from Fréchet: y → log y, m_y = y⁻¹T′(y)
??	$e^{- λ {(\log y)}^{β}}$	Log⁽²⁾	Also from stretched exponential with f_y = log y
Log-Pareto type I	y⁻¹(log y)^−λ	Log⁽²⁾	β → 0; m_y = T′; also from Pareto I: y → log y, m_y = y⁻¹
??	(log y )^−λ	Log⁽²⁾	β → 0; also from Pareto I with f_y = log y
Pareto type II/Lomax	(c₁ + y)^−λ	LinLog⁽¹⁾	β → 0
Generalized Student’s	(c₁ +y²)^−λ	LinLog⁽¹⁾	β → 0; f_y = y²; Pearson VII, Kappa; Cauchy for λ = 1
??	(log(c₁ + y))^−λ	LinLog⁽²⁾	β → 0; c₂ =0
Gamma	$y^{- λ} e^{- c_{1} λ y}$	LogLin⁽¹⁾	β → 0; Pearson type III, includes chi-square
Gamma-Gauss	$y^{- λ} e^{- c_{1} λ y^{2}}$	LogLin⁽¹⁾	β → 0; f_y = y²; m_y = 1 or m_y = T′; Rayleigh λ = −1
Generalized gamma	$y^{- λ (λ - 1) - 1} e^{- c_{1} λ y^{γ}}$	LogLin⁽¹⁾	β → 0; y → y^γ; m_y = y^γ−1; Chi for γ = 2 and c₁λ = 1/2
Beta	${(c_{2} - y)}^{- λ} {(y - c_{1})}^{- b λ}$	LogLinLog⁽¹⁾	β → 0; Pearson type I; c₁ ≤ y ≤ c₂
Beta prime/F	$y^{- b λ} {(1 + y)}^{(b + 1)}^{λ - 2}$	LogLinLog⁽¹⁾	β → 0; y → y/(1 + y); m_y = (1 + y)⁻²; y > 0; Pearson VI
Gamma variant	${(c_{1} + y)}^{- b λ} e^{- c_{2} λ y}$	LinLogLin⁽¹⁾	β → 0; y > 0

^*Assumptions: base form for p_y is always

m_{y} e^{- λ e^{β w}}

, in which T_f = e^βw, as given in Equation (37). The w column describes the base scale, expressed as combinations of Lin (Linear) and Log scaling, with the superscript denoting the number of recursions as in Equation (40). For example, Log(¹) implies that w(f_y) = log(f_y), and LinLog(¹) implies w(f_y) = log(c₁ + f_y). Purely linear scaling is shown as “Linear,” which implies w = f_y. Recursive expansion of a linear scale remains linear, so no superscript is given for linear scales. Unless otherwise noted, f_y = y, shift invariance only is assumed for T with respect to G with β ≠ 0, and m_y = 1. When β → 0 is shown, affine invariance holds for T with respect to G. For extreme value distributions, m_y = T′ abbreviates the proper change of scale, m_y = |T′_f|, in which information dissipates on the cumulative distribution scale. Change of variable is shown as y → g(y), which often leads to a change of scale, m_y = g′(y). Direct values y, possibly corrected by displacement from a central location, y − μ, are shown here as y without correction. Squared deviations (y − μ)² from a central location are shown here as y². Listings of distributions can be found in various texts [22–24]. Many additional forms can be generated by varying the measurement function. In the first column, the question marks denote a distribution for which I did not find a commonly used name. Modified from Table 5 of Frank and Smith [4]. See that article for additional details.

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Frank, S.A. How to Read Probability Distributions as Statements about Process. Entropy 2014, 16, 6059-6098. https://0-doi-org.brum.beds.ac.uk/10.3390/e16116059

AMA Style

Frank SA. How to Read Probability Distributions as Statements about Process. Entropy. 2014; 16(11):6059-6098. https://0-doi-org.brum.beds.ac.uk/10.3390/e16116059

Chicago/Turabian Style

Frank, Steven A. 2014. "How to Read Probability Distributions as Statements about Process" Entropy 16, no. 11: 6059-6098. https://0-doi-org.brum.beds.ac.uk/10.3390/e16116059

Article Menu

How to Read Probability Distributions as Statements about Process

Abstract

1. Introduction

2. Overview

3. The Four Components of Probability Patterns

3.1. Dissipation of Information

3.2. Constraint by Average Values

3.2.1. Constraint on the Mean

3.2.2. Constraint on the Average Fluctuations from the Mean

3.3. The Measurement Scale for Average Values

3.4. The Scale on which Information Dissipates

3.4.1. Change of Variable

3.4.2. Integral Transform

4. Reading Probability Expressions in Terms of Measurement and Information

4.1. Information and Surprise

4.2. Scaling Relations Express the Change in Information

4.3. How to Read the Exponential and Gaussian Distributions

5. The Log-linear Scale

6. The Linear-log Scale

7. Relation between Linear-log and Log-linear Scales

7.1. Common Scales and Common Patterns

7.2. Relations between the Scales

8. Dissipation of Information on Alternative Scales

8.1. Scale Change for Data Analysis

8.2. Extreme Values: Dissipation on the Cumulative Scale

8.2.1. Dissipation of Information

8.2.2. Scale Transformation

9. Pairs of Alternative Scales by Integral Transform

9.1. Overview

9.2. Superstatistics

9.3. Integral Transforms

9.4. Scale Inversion by the Laplace Transform

10. Alternative Descriptions of Generative Process

11. Reading Probability Distributions

11.1. Linear Scale

11.2. Combinations of Linear and Log Scales

11.3. Direct Change of Scale

11.4. Extreme Values and Exponential Scaling

11.5. Integral Transform and Change of Scale

11.6. Lévy Stable Distributions

12. Relations between Probability Patterns

12.1. Invariance and Common Scales

12.2. Affine Invariance of Measurement Scaling

12.3. Base Scales and Notation

12.4. Two Distinct Affine Relations

12.5. Shift Invariance and Generalized Exponential Measurement Scales

12.6. Affine Duality and Linear Scaling

12.7. Exponential and Gaussian Distributions Arise from Affine Invariance

13. Hierarchical Families of Measurement Scales and Distributions

13.1. A Recursive Hierarchy for the Base Scale

13.2. Examples of Common Probability Distributions

14. Why do Linear and Logarithmic Scales Dominate?

14.1. Absolute versus Relative Incremental Information

14.2. Common Arithmetic Operations Lead to Common Scaling Relations

15. Asymptotic Invariance

15.1. Multiple Transformations of Observations

15.2. Invariance in the Limit

16. Discussion

Acknowledgments

Appendix

A. Scale Transformation

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI