Next Article in Journal
Continuous-Discrete Path Integral Filtering
Previous Article in Journal
Thompson, W.A. et al. Decimative Multiplication of Entropy Arrays, with Application to Influenza. Entropy, 2009, 11, 351-359
Article

Properties of the Statistical Complexity Functional and Partially Deterministic HMMs

Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
Received: 31 March 2009 / Accepted: 5 August 2009 / Published: 11 August 2009

Abstract

Statistical complexity is a measure of complexity of discrete-time stationary stochastic processes, which has many applications. We investigate its more abstract properties as a non-linear function of the space of processes and show its close relation to the Knight’s prediction process. We prove lower semi-continuity, concavity, and a formula for the ergodic decomposition of statistical complexity. On the way, we show that the discrete version of the prediction process has a continuous Markov transition. We also prove that, given the past output of a partially deterministic hidden Markov model (HMM), the uncertainty of the internal state is constant over time and knowledge of the internal state gives no additional information on the future output. Using this fact, we show that the causal state distribution is the unique stationary representation on prediction space that may have finite entropy.
Keywords: statistical complexity; lower semi-continuity; ergodic decomposition; concavity; prediction process; partially deterministic hidden Markov models (HMMs) statistical complexity; lower semi-continuity; ergodic decomposition; concavity; prediction process; partially deterministic hidden Markov models (HMMs)

1. Introduction

An important task of complex systems sciences is to define “complexity”. Measures that quantify complexity are of both theoretical (e.g., [1]) and practical interest. In applications, they are widely used to identify “interesting” parts of simulations and real-world data (e.g., [2]). There exist various measures of different kinds of complexity. In particular, statistical complexity constitutes a complexity measure for stationary stochastic processes in doubly infinite discrete time and discrete state space. It was introduced by Jim Crutchfield and co-workers within a theory called computational mechanics [3,4,5]. Note that here “computational mechanics” is unrelated to computer simulations of mechanical systems. Statistical complexity is applied to a variety of real-world data, e.g., in [6]. An important, closely related concept of computational mechanics is the so-called ε -machine. It is a particular partially deterministic HMM that encodes the mechanisms of prediction. Partially deterministic HMMs are often called deterministic stochastic automata to emphasize their close connection to a key concept of theoretical computer science, namely deterministic finite state automata [7].
In this paper, we look at more abstract features of statistical complexity as well as partially deterministic HMMs. We consider statistical complexity to be a non-linear functional from the space of Δ-valued stationary processes (Δ countable) to the set R ¯ + = R + { } of non-negative extended real numbers. Here, we identify stationary processes with their law, i.e., with shift-invariant probability measures on the sequence space Δ Z , and equip the space of measures with the usual weak-* topology (often called “weak topology”). Because Δ is discrete, this topology is equal to the topology of finite-dimensional convergence. In ergodic theory, Kolmogorov-Sinai entropy is studied as a function of the (invariant) measure, and the questions of continuity properties, affinity, and behaviour under ergodic decomposition arise naturally (e.g., [8]). We believe that these questions are worth considering also for complexity measures. A formula for the ergodic decomposition of excess entropy, which is another complexity measure for stochastic processes, was obtained in [9,10]. Our results presented here include the corresponding formula for statistical complexity, and this formula directly implies concavity. The most important result is lower semi-continuity of statistical complexity. We consider this a desirable property for a complexity measure, as it means that a process cannot be complex if it can be approximated by non-complex ones.
In Section 2., we define statistical complexity and show its relations to a discrete version of Frank Knight’s prediction process [11,12]. The prediction process is the measure-valued process of conditional probabilities of the future given the past. It takes values in the space P ( Δ N ) of probability measures on ΔN, called prediction space. In our formulation, statistical complexity is the marginal entropy of the prediction process. This is equivalent to the classical definition as entropy of a certain partition of the past. We only replace equivalence classes with the respective induced probabilities on the future. In this section, we also show that the discrete (and thus technically vastly simplified) version of the prediction process has a continuous Markov transition kernel (Proposition 5).
In Section 3., we investigate properties of partially deterministic HMMs. Here, we use a general notion of HMM (sometimes called edge-emitting HMM), where new internal state and output symbol are jointly determined and may have dependencies conditioned on the last internal state. Partial determinism means that this dependence is extreme in the sense that the last internal state and the output altogether uniquely determine the following internal state. We show that, if one knows the past output trajectory, the remaining uncertainty (measured by entropy) of the internal state is constant over time, although it may depend on the ergodic component (Proposition 18). Furthermore, the distribution of future output is the same for any internal state that is compatible with the past output (Corollary 20). In Section 3.3., we construct a canonical Markov kernel, such that taking any measure ν on prediction space P ( Δ N ) (i.e., ν is a measure on measures) as initial distribution, we obtain a partially deterministic HMM of a process P P ( Δ N ) . This process P coincides with the measure r ( ν ) represented by ν in the sense of integral representation theory, and if ν is appropriately chosen, we obtain the ε -machine of computational mechanics (or something isomorphic) as special case. Using the properties of partially deterministic HMMs, we obtain that there is no invariant representation on prediction space with finite entropy other than, possibly, the causal state distribution, which may have finite or infinite entropy (Proposition 23).
Section 4. contains our results about statistical complexity. We show that the complexity of a process is the average complexity of its ergodic components plus the entropy of the mixture (Proposition 26). As a direct consequence, statistical complexity is concave (Corollary 27) and non-continuous (even w.r.t. variational topology). But it does have a continuity property. Namely, using the results of the previous sections, we show in Theorem 32 that it is weak-* lower semi-continuous.

2. Prediction Dynamic and Statistical Complexity

For the whole article, fix a countable set Δ with at least two elements and discrete topology. We identify Δ-valued stochastic processes X Z : = ( X k ) k Z , defined on some probability space ( Ω , A , P ) , with their respective laws P : = P X Z 1 P ( Δ Z ) . Here, P denotes the set of probability measures. If X Z is stationary, P is in the set P inv ( Δ Z ) of shift-invariant probability measures. Let ξ k : Δ Z Δ be the canonical projections. Then ζ Z is a process on ( Δ Z , B ( Δ Z ) , P ) with the same distribution as X Z . Here, B denotes the Borel σ-algebra. We often decompose the time set Z into the “future” N and the “past” Z \ N = N 0 , where N 0 = N { 0 } . For simplicity of notation, we denote the canonical projections on Δ N with the same symbols, ζk, as the projections on Δ Z . If not stated otherwise, product spaces are equipped with product and spaces of probability measures are equipped with weak- -topology. We use the arrow * to denote weak- convergence.

2.1. Discrete Version of Knight’s Prediction Process

For every measurable stochastic process with time set R + on some Lusin space, Frank Knight defines the corresponding prediction process as a process of conditional probabilities of the future given the past. This theory originated in [11] and was further developed in [12,13,14]. The most important properties of the prediction process are that its paths are right continuous with left limits (cadlag), it has the strong Markov property and determines the original process. The continuity of the time set and the generality of the state space lead to a lot of technical difficulties. In our simpler, discrete setting, these difficulties mostly disappear, and useful properties of the prediction process, such as having cadlag paths, become meaningless. A new aspect, however, is added by considering infinite pasts of stationary processes via the time-set Z . The marginal distribution (unique because of stationarity) of the prediction process is an important characteristic, which is used to define statistical complexity. For this subsection, fix a stationary process X Z with distribution P P inv ( Δ Z ) .
We use the following notation concerning Markov kernels and conditional probabilities. If K is a kernel from Ω to a measurable space M, we consider K as measurable function from Ω to P ( M ) and write K ( ω ; A ) : = K ( ω ) ( A ) for the probability of a measurable set A w.r.t. the measure K ( ω ) . Given random variables X , Y on Ω, we write K = P ( X Y ) if K is the conditional probability kernel of X given Y, i.e., K ( ω ; A ) = P ( { X A } | Y ) ( ω ) .
Definition 1. 
Let Z Z = Z Z P be the P ( Δ N ) -valued stochastic process of conditional probabilities defined by Z k : = P ( ξ [ k + 1 , [ ξ ] , k ] ) for k Z . Then Z Z is called prediction process of X Z . P ( Δ N ) is called prediction space.
It is evident that the Markov property of the prediction process in continuous time also holds in discrete time. Nevertheless, we give a proof, because it is elementary in our discrete setting. The corresponding transition kernel works as follows. Assume the prediction process is in state z P ( Δ N ) . The transition kernel maps z to a measure on measures, namely P ( Z 1 Z 0 = z ) P ( P ( Δ N ) ) . Note that z is a state of the prediction process but at the same time a probability measure. Thus it makes sense to consider the conditional probability given ξ 1 = d w.r.t. the measure z. It is intuitively plausible that the next state will be one of those conditional probabilities with d distributed according to the marginal of z. The resulting measure has to be shifted by one as time proceeds. With ς : Δ N Δ N , we denote the left shift.
Proposition 2. 
For z P ( Δ N ) , let ϕ z : Δ N P ( Δ N ) , ϕ z ( ω ) : = z ( ς 1 ( · ) ξ 1 ) ( ω ) . The prediction process Z Z is a stationary Markov process. The kernel S : P ( Δ N ) P ( P ( Δ N ) ) with S ( z ) = z ϕ z 1 , i.e.
S ( z ) ( B ) : = S ( z ; B ) : = z ( { ϕ z B } ) , z P ( Δ N ) , B B ( P ( Δ N ) ) ,
satisfies P ( Z k Z k 1 ) = S Z k 1 a.s. In other words, S is the transition kernel of the prediction process.
Proof. 
Stationarity is obvious from stationarity of X Z . We obtain a.s.
S ( Z 0 ; B ) = Z 0 ( Z 0 ( ς 1 ( · ) ξ 1 ) B ) = P P ( ξ [ 2 , [ ξ ] , 1 ] ) B | ξ N 0 = P ( { Z 1 B } | ξ N 0 ) .
In particular, P ( { Z 1 B } | ξ N 0 ) is σ ( Z 0 ) -measurable (modulo P) and together with σ ( Z 0 ) σ ( ξ N 0 ) we obtain
P ( { Z 1 B } | Z 0 ) = P ( { Z 1 B } | ξ N 0 ) = S ( Z 0 ; B ) ,
as claimed. We still have to verify the Markov property. But because the σ -algebra induced by Z N 0 is nested between those induced by Z 0 and ξ N 0 , i.e. σ ( Z 0 ) σ ( Z N 0 ) σ ( ξ N 0 ) , we obtain the Markov property from the first equality in (1). □
Definition 3. 
We call the Markov transition S of the prediction process prediction dynamic.
Note that although the prediction process Z Z obviously depends on P, the prediction space P ( Δ N ) and the prediction dynamic S do not. In the case of general Lusin state space, it is non-trivial to prove the existence of the regular versions of conditional probability such that ϕ z ( ω ) is jointly measurable in ( z , ω ) (see [9]). For countable Δ, however, we can obtain essential continuity in an elementary way. This enables us to prove continuity of the prediction danamic.
Lemma 4. 
Let z , z n P ( ʔ N ) and z n * z . There is a clopen (i.e. closed and open) set Ω z ʔ N with z ( Ω z ) = 1 such that ϕ z n * ϕ z , uniformly on compact subsets of Ω z .
Proof. 
Let A ω : = ξ 1 1 ( ξ 1 ( ω ) ) and Ω z : = ω Δ N | z ( A ω ) > 0 . Because Δ is discrete and countable, Ωz id clopen with zz) = 1. Uniform convergence on compacta is equivalent to ϕ z n ( ω n ) * ϕ z ( ω ) whenever ωnω in Ωz. For sufficiently large n, ζ1 (ωn) = ζ1 (ω) and because ς−1 maps cylinder sets to cylinder sets, ϕ z n ( ω n ) = z n ( A ω ς 1 ( ° ) ) z n ( A ω ) * ϕ z ( ω ) .
Proposition 5. 
The prediction dynamic S is continuous.
Proof. 
Let z n , z P ( Δ N ) with z n * z and Ω z as in Lemma 4. We have to show
g d S ( z n ) = g ϕ z n d z n n g ϕ z d z = g d S ( z )
for any bounded continuous g. According to Prokhorov’s theorem, the sequence ( z n ) n N is uniformly tight and we can restrict the integrations to compact subsets. Because lim n z n ( Ω z ) = z ( Ω z ) = 1 , we can restrict to compact subsets of Ω z . There, the convergence of ϕ z n is uniform, thus (2) holds. □

2.2. Statistical Complexity

In integral representation theory, a measure ν P ( P ( Δ N ) ) represents the measure z P ( Δ N ) if
z = r ( ν ) : = P ( Δ N ) id P ( Δ N ) d ν ,
where r : P ( P ( Δ N ) ) P ( Δ N ) is called resolvent or barycentre map (see [15]) and id is the identity map. Here, measure valued integrals are Gel’fand integrals. That is, μ = K d ν for some kernel K means f d μ = f d K ( · ) d ν for all continuous, real-valued f or, equivalently, μ ( B ) = K ( · ; B ) d ν for all measurable sets B. z = r ( ν ) means that z is a mixture (convex combination) of other processes, and the mixture is described by ν . A trivial representation for z is given by δ z , the Dirac measure in z. The measure ν is called S-invariant if ν S = ν , where ν S : = S d ν . In other words, it is S-invariant if the iteration with the prediction dynamic S does not change it. We see in the following lemma that general iteration with S shifts the represented measure, i.e., ν S represents z ς 1 .
Lemma 6. 
r ( ν S ) = r ( ν ) ς 1 . In particular, S-invariant ν represent stationary processes.
Proof. 
Because r ( ν S ) = id P ( Δ N ) d S d ν , it is sufficient to consider Dirac measures δ z , z P ( Δ N ) (the general claim follows by integration over ν ). For Dirac measures we have
r ( δ z S ) = id P ( Δ N ) d S ( z ) = ϕ z d z = z ( ς 1 ( · ) | ξ 1 ) d z = z ς 1
If ν is S-invariant, we also say that ν represents the stationary extension of r ( ν ) to Δ Z . The marginal of the prediction process is an important such representation, which we call causal state distribution because of its close relation to the causal states of computational mechanics.
Definition 7. 
For P P inv ( Δ Z ) , the causal state distribution μ C ( P ) is the marginal distribution of the prediction process, i.e., μ C ( P ) : = P Z 0 1 P ( P ( Δ N ) ) .
The causal state distribution of P is an S-invariant representation of P.
Lemma 8. 
Let P P inv ( Δ Z ) . Then μ C ( P ) is S-invariant and represents P.
Proof. 
From Proposition 2 we know that P ( Z 1 Z 0 ) = S Z 0 and Z Z is stationary. Thus
S d μ C ( P ) = S Z 0 d P = P ( Z 1 Z 0 ) d P = P Z 1 1 = μ C ( P )
Furthermore, μ C ( P ) represents P because we have
r ( μ C ( P ) ) = Z 0 d P = P ( ξ N ξ N 0 ) d P = P ξ N 1
Remark. 
The definitions in computational mechanics are slightly different. There, one works with equivalence classes of past trajectories (called causal states) instead of probability distributions on future trajectories. Because past trajectories x , y Δ N 0 are identified if P ( ξ N ξ N 0 = x ) = P ( ξ N ξ N 0 = y ) , the two approaches are equivalent. The advantage of working on prediction space P ( Δ N ) is that it has a natural topology and the prediction processes of all Δ-valued stochastic processes are described in a unified manner on the same space with the same transition kernel.
Example 9. 
μ C is not continuous. Let P be a non-deterministic i.i.d. (independent, identically distributed) process. Obviously, the causal state distribution of an i.i.d. process is the Dirac measure δ P N in its restriction P N : = P ξ N 1 to positive time. According to [16], periodic measures are dense in the stationary measures and we find an approximating sequence P n * P of periodic measures P n . But the past of a periodic process determines its future. Thus its causal state distribution is supported by the set of Dirac measures on Δ N . Because the set of Dirac measures is closed in P ( P ( Δ N ) ) , the topological supports supp μ C ( P n ) are disjoint from the support supp μ C ( P ) = { P N } . Consequently, μ C ( P n ) cannot converge to μ C ( P ) .
With statistical complexity, we measure complexity of a process P by the “diversity” of its expected futures, given observed pasts (i.e., of μ C ( P ) ). The Shannon entropy H ( μ ) is used as the measure of “diversity” of a probability measure μ . With φ ( x ) : = x log ( x ) , it is defined as
H ( μ ) : = sup { i = 1 n φ ( μ ( B i ) ) | n N , B i disjoint, measurable }
Definition 10. 
For P P inv ( Δ Z ) , the quantity C C ( P ) : = H ( μ C ( P ) ) R ¯ + is called statistical complexity of P.
Note that if the probability space is sufficiently regular (e.g., separable, metrisable), H ( μ ) can only be finite if μ is supported by a countable set A. In this case
H ( μ ) = a A φ ( μ ( { a } ) )
Probably, lower semi-continuity of this entropy functional is well-known. We give a proof in the appendix.
Lemma 11. 
Let M be a separable, metrisable space. Then the entropy H : P ( M ) R ¯ + is weak-* lower semi-continuous.

3. Partially Deterministic HMMs

The probability measures on prediction space induce hidden Markov models (HMMs) with an additional partial determinism property, and it turns out to be helpful to investigate such HMMs. In Section 3.1., we define HMMs and introduce the notation we need for the further discussion. In Section 3.2., we define the partial determinism property and obtain our results about the HMMs satisfying this property. In Section 3.3., we show how measures on prediction space induce partially deterministic HMMs and apply the results from Section 3.2. to prove that the causal state distribution is the only invariant representation on prediction space that can have finite entropy.

3.1. HMMs

We use the term HMM in a wide sense, meaning a pair ( T , μ ) , where μ is an initial probability measure on some Polish space M of internal states and T is a Markov kernel from M to Δ × M. The HMM generates on ( Ω , A , P ) a Δ-valued output process X N and a (coupled) M-valued internal process W N 0 , such that W0 is μ-distributed and the joint process is Markovian with
P ( { X k D , W k B } | X k 1 , W k 1 ) = T ( W k 1 ; D × B ) a.s.
We call (T,μ) an HMM of z P ( Δ N ) if z = P X N 1 if μ ( B ) = T ( · ; Δ × B ) d μ , we say that the HMM is invariant and extend the generated processes to stationary processes X Z and W Z . We need some further notation.
Definition 12. 
Let ( T , μ ) be an HMM, m M , d Δ , and ν P ( M ) .
a)
The output kernel K: M P ( Δ ) is defined by K ( m ) : = K m : = T ( m ; · × M ) P ( Δ ) . We also use the notations K d ¯ ( m ) : = K m ( d ) : = K m ( { d } ) and K ν : = K d ν .
b)
The internal operators L d : P ( M ) P ( M ) { 0 } are defined as follows. L d ( ν ) = 0 and
L d ( ν ) ( B ) : = T ( · ; { d } × B ) d ν K ν ( d )    otherwise .
Remark. 
a)
K m is the distribution of the next output symbol when the internal state is m, i.e. K m = P ( X 1 W 0 = m ) a.s. Further, K μ is the law of X 1 .
b)
The internal operator L d describes the update of knowledge of the internal state when the symbol d Δ is observed. For Dirac measures, we obtain
L d ( δ m ) = P ( W 1 W 0 = m , X 1 = d ) a.s.
Be warned that Ld is not induced by a kernel in the following sense. There is no kernel l d : M P ( M ) such that L d ( ν ) = l d d ν . To see this, note that L d ( ν ) L d ι d ν for ι ( m ) = ζ m , because Ld(ν) is normalised outside the integral as opposed to an individual normalisation of the Ld(ζm) inside the integral on the right-hand side.
It directly follows from the definition of ( X N , W N 0 ) by a Markov kernel that the conditional probability, given that the internal state is m, is obtained by starting the HMM in m. In other words, it is generated by the HMM ( T , δ m ) . Similarly, the conditional probability given an observed symbol X 1 = d is obtained by starting the HMM in the updated initial distribution L d ( μ ) . We formulate these observations in the following lemma and give a formal proof in the appendix.
Lemma 13. 
Let ( T , μ ) be an HMM with internal and output processes W N 0 , X N as above. Then a.s. ( T , δ W 0 ( ω ) ) is an HMM of P ( X N W 0 ) ( ω ) , and ( T , L X 1 ( ω ) ( μ ) ) is an HMM of P ( X [ 2 , [ X 1 ) ( ω ) .
Definition 14. 
(processes Y Z and H Z ) Given an invariant HMM, let Y Z be the P ( M ) -valued stochastic process of expectations over internal states, given by Y k : = P ( W k X ] , k ] ) . Let H Z be the process of entropies of the random measures Y k , i.e., H k ( ω ) : = H ( Y k ( ω ) ) , where entropy H is defined by (4).
Remark. 
Y k describes the current knowledge of the internal state, given the past. H k is the entropy of the value of Y k and measures “how uncertain” the knowledge of the internal state is. It is important to bear in mind that this is different from the entropy of the random variable Y k . To avoid confusion, we always write H P ( X ) when referring to the entropy of a random variable X defined on a probability space with measure P.
The following lemma justifies the idea of the internal operator L d being an update of knowledge of the internal state. Furthermore, it enables us to condition on Y 0 instead of X N 0 . The conditional probability of the internal state given the past, Y 0 , contains as much information about X 1 (and in fact X N , but we do not need that here) as the past X N 0 does.
Lemma 15. 
a)
Y 1 ( ω ) = L X 1 ( ω ) ( Y 0 ( ω ) ) a.s.
b)
P ( { X 1 = d } | Y 0 ) ( ω ) = P ( { X 1 = d } | X N 0 ) ( ω ) = K Y 0 ( ω ) ( d ) a.s.
Proof. 
Conditional independence of ( X 1 , W 1 ) and X N 0 given W 0 implies that a.s. P ( X 1 , W 1 W 0 ) = P ( X 1 , W 1 W 0 , X N 0 ) and thus
T d Y 0 = P ( X 1 , W 1 W 0 ) d P ( · X N 0 ) = P ( X 1 , W 1 X N 0 )
a)
Let d = X 1 ( ω ) and for B B ( M ) set F B : = { X 1 = d , W 1 B } . We obtain a.s.
L d ( Y 0 ) ( B ) = ( 5 ) P ( F B X N 0 ) P ( F M X N 0 ) = ( d = X 1 ( ω ) ) P ( { W 1 B } | X N 0 , X 1 ) = Y 1 ( · ) ( B )
b)
The second equality follows directly from (5). The first follows because, due to the second equality, P ( { X 1 = d } | X N 0 ) is σ ( Y 0 ) -measurable modulo P. □
The previous lemma enables us to prove that Y Z is Markovian and compute its transition kernel. We already know that L d ( ν ) is the updated expectation of the internal state when it was previously ν and is now observed d. Thus it is not surprising that the conditional probability of Y k given Y k 1 = ν is a convex combination of Dirac measures in L d ( ν ) for different d (note that Y k is a measure-valued random variable, thus its conditional probability distribution is indeed a distribution on distributions). The mixture is given by the output kernel K, more precisely by K ν .
Lemma 16. 
For an invariant HMM, Y Z and H Z are stationary. Y Z is a Markov process with transition kernel
P ( Y k + 1 Y k = ν ) = d Δ K ν ( d ) · δ L d ( ν ) P ( P ( M ) ) ν P ( M ) .
Proof. 
Stationarity is obvious. For ν 0 , ν k P ( M ) and ν : ν k we obtain
P ( Y k + 1 Y [ 0 , k ] = ν [ 0 , k ] ) = ( lem .   15 a ) P ( L X k + 1 ( · ) ( ν ) | Y [ 0 , k ] = ν [ 0 , k ] ) = d Δ P ( { X k + 1 = d } | Y [ 0 , k ] = ν [ 0 , k ] ) · δ L d ( ν ) .
Because σ ( Y [ 0 , k ] ) is nested between σ ( Y k ) and σ ( X ] , k ] ) , Lemma 15 b) implies that P ( { X k + 1 = d } | Y [ 0 , k ] = ν [ 0 , k ] ) = K ν k = K ν and hence the claim.

Partial Determinism

If the transition T of an HMM is deterministic, i.e., if the internal state determines the next state and output (and thus the whole future) uniquely, the HMM is called (completely) deterministic. In a deterministic HMM, all randomness is due to the initial distribution. This is a very strong property, and a weaker partial determinism property is useful. In a partially deterministic HMM, the output symbol is determined randomly, but the new internal state is a function f ( m , d ) of the last internal state m and the new output symbol d. If the internal space M is finite, such HMMs are stochastic versions of deterministic finite state automata (DFAs), an important concept of theoretical computer science (see [7]). The function f directly corresponds to the transition function of the DFA, but the start state is replaced by the initial distribution and the HMM assigns probabilities to the outputs via the output kernel K. A difference in interpretation is that the symbols from Δ are considered input of the DFA and output of HMMs. To emphasise their close connection to DFAs, partially deterministic HMMs are often called deterministic stochastic automata, although they are not completely deterministic.
Definition 17. 
An HMM ( T , μ ) is called partially deterministic if there is a measurable function f : M × Δ M , called transition function, such that T ( m ) = K m δ f m ( · ) for all m M , i.e.,
T ( m ; D × B ) = K m ( D f m 1 ( B ) ) m M , D Δ B B ( M )
where f m ( d ) : = f ^ d ( m ) : = f ( m , d ) and B ( M ) is the Borel σ -algebra on M.
Remark. 
For partially deterministic HMMs we obtain
L d ( ν ) ( B ) = 1 K ν ( d ) f ^ d 1 ( B ) K ^ d d ν and L d ( δ m ) = δ f m ( d )
The second equation implies that W k = f W k 1 ( X k ) a.s., justifying the name transition function for f.
The following proposition is crucial for understanding partially deterministic representations. It states that, given the past output, the uncertainty H k = H ( Y k ) about the internal state is constant over time and the next output symbol is independent of the internal state. The proof is along the following lines. If we know the internal state at one point in time, we can maintain knowledge of the internal state due to partial determinism. More generally, the uncertainty H k of the internal state cannot decrease on average and thus is a supermartingale. But because it is also stationary, the trajectories have to be constant. If two possible internal states would lead to different probabilities for the next output symbol, we could increase our knowledge of the internal state by observing the next output. But because of partial determinism, this would also decrease the uncertainty of the following internal state, in contradiction to the constant trajectories of H Z .
Proposition 18. 
Let ( T , μ ) be a partially deterministic, invariant HMM with H ( μ ) < . Then H Z has a.s. constant trajectories, i.e., H k = H 0 a.s. Furthermore, the restriction K supp ( Y 0 ) of the output kernel K to the support supp ( Y 0 ) M of the random measure Y 0 is a.s. a constant kernel, i.e.,
K m = K m ^ m , m ^ supp ( Y 0 ( ω ) ) a.s.
Proof. 
We show that H Z is a supermartingale to use the following well-known property.
Lemma. 
Every stationary supermartingale has a.s. constant trajectories.
Because H ( μ ) < , we may assume w.l.o.g. that M is countable. Note that φ ( x ) = x log ( x ) satisfies φ ( x i ) φ ( x i ) . We obtain
H ( L d ( ν ) ) = ( 6 ) m ^ M φ m f ^ d 1 ( m ^ ) ν ( m ) K m ( d ) K ν ( d ) m f ^ d 1 ( M ) = M φ ν ( m ) K m ( d ) K ν ( d ) .
We use the filtration F k : = σ ( Y ] , k ] ) . Markovianity of Y Z yields E ( H k + 1 F k ) = E ( H k + 1 Y k ) .
E ( H k + 1 | Y k = ν ) = ( lem . 16 ) d Δ K ν ( d ) · H ( L d ( ν ) ) d , m ν ( m ) K m ( d ) · log ν ( m ) K m ( d ) K ν ( d ) = H P ( W k X k + 1 , Y k = ν ) H P ( W k Y k = ν ) = H ( ν )
where the secong equality holds because P ( { W k = m , X k + 1 = d } | Y k = ν ) = ν ( m ) K m ( d ) and P ( { X k + 1 = d } | Y k = ν ) = K ν ( d ) , This H Z is a supermartingale w.r.t. ( F k ) k Z and has a.s. constant trajectories. In particular, inequality (8) is actually an equality. Because H ( μ ) < and μ = Y k d P , the entropy of Y k ( ω ) is a.s. finite. Thus, H P ( W k X k + 1 , Y k = ν ) = H P ( W k Y k = ν ) implies that Wk and XK+1 are independent given Y k = ν , i.e. K supp ( ν ) is constant.  □
Note that the finite-entropy assumption is indeed necessary for the second statement of Proposition 18. For example, the shift defines a deterministic HMM that does not (in general) satisfy (7).
Example 19. (shift HMM) 
The shift HMM is defined as follows. The internal state consists of the whole trajectory, M : = Δ Z . T = T ς outputs the symbol at position one and shifts the sequence to the left. More formally with m = ( m k ) k Z M and ς ( m ) = ( m k + 1 ) k Z we have
T ς ( m ) = δ m 1 δ ς ( m ) = δ ( m 1 , ς ( m ) )
If P P inv ( Δ Z ) , it is obvious that ( T ς , P ) is an invariant, deterministic (in particular partially deterministic) HMM of P. Here, P is the law of both X Z and W 0 ; in fact even X Z = W 0 . We claim that, generically, ( T ς , P ) does not satisfy (7) (and of course the internal state entropy H ( P ) is infinite). Indeed, K m = δ m 1 and thus K m = K m ^ implies m 1 = m ^ 1 . Because Y 0 ( ω ) = P ( X Z X N 0 ) ( ω ) , (7) implies that X N 0 determines X 1 uniquely, which is generically not true. The analogously defined one-sided shift on M = Δ N also does not satisfy (7). Note that, because future trajectories are equivalent to internal states, the associated process Y Z is essentially the prediction process in the sense that Y k = Z k X Z .
Proposition 18 tells us that the next output symbol of a partially deterministic HMM is conditionally independent of the internal state, given the past output. But even more is true. The whole future output is conditionally independent of the internal state. Thus, if we know the past, the internal state provides no additional information useful for the prediction of the future output.
Corollary 20. 
Let ( T , μ ) be partially deterministic, invariant, and H ( μ ) < . Then
P ( X N W 0 = m ) = P ( X N W 0 = m ^ ) m , m ^ supp ( Y 0 ) a.s.
Proof. 
According to Proposition 18, P ( X 1 W 0 = · ) = K is constant on supp ( Y 0 ) . To obtain the statement for X [ 1 , n ] , we consider the n-tuple HMM defined as follows. The output space is Δn, the internal space is M, whereas the output and internal processes X ^ Z and W ^ Z are given by X ^ k = X [ ( k 1 ) n + 1 , k n ] and W ^ k = W n k . This is achieved by the HMM ( T ^ , μ ) with T ^ : M P ( Δ n × M ) , T ^ ( m ) = P ( X [ 1 , n ] , W n W 0 = m ) . The HMM is obviously partially deterministic with transition function f d n f d 1 and invariant. Thus Proposition 18 implies that P ( X [ 1 , n ] W 0 = · ) = P ( X ^ 1 W ^ 0 = · ) is constant on supp ( Y ^ 0 ) . Because we can couple the processes such that Y ^ 0 = Y 0 , the claim follows.  □

3.3. Representations on Prediction Space

We can interpret any probability measure μ on prediction space P ( Δ N ) as initial distribution of an HMM. The “internal state update” of the corresponding transition T C follows the same rule as the prediction dynamic S, described by the conditional probability given the last observation. The difference is that now we include output symbols from Δ. We want to construct the HMM in such a way that if it is started in the internal state z P ( Δ N , its output process is distributed according to z (which is also a measure on the future). Thus, the distribution of the next output d has to be equal to the marginal of z. The next internal state has to be the conditional z-probability of the future given ζ 1 = d . Recall that ϕ z ( ω ) = ( ς 1 ( · ) | ζ 1 ) ( ω ) .
Definition 21. 
We define the Markov kernel T C from P ( Δ N ) to Δ × P ( Δ N ) by
T C ( z ; D × B ) : = z ( { ξ 1 D , ϕ z B } ) , z P ( Δ N ) , D Δ B B ( P ( Δ N ) ) .
Note that T C ( z ; Δ × B ) = S ( z ; B ) , i.e., marginalising T C ( z ) to the internal component yields the prediction dynamic. Thus, if μ = μ C ( P ) is the causal state distribution (Definition 7) of some process P P inv ( Δ Z ) , then the internal state process of the induced HMM ( T C , μ ) coincides with the prediction process Z Z of P. From the following lemma we conclude that the output process X Z is, as expected, distributed according to P. More generally, if μ P ( P ( Δ N ) ) represents a process z P ( Δ N ) in the sense of integral representation theory as a mixture of other processes, it also induces an HMM of z, namely ( T C , μ ) . Recall that r is the resolvent, defined in (3), and associates the represented process to μ .
Lemma 22. 
Let μ P ( P ( Δ N ) ) . Then ( T C , μ ) is a partially deterministic HMM of r ( μ ) . In particular, ( T C , μ C ( P ) ) is an invariant HMM of P P inv ( Δ Z ) .
Proof. 
Partial determinism follows directly from the definition of T C . We have K z = z ξ 1 1 and the transition function f is given by f z ξ 1 : = ϕ z . It is well defined due to the σ ( ξ 1 ) -measurability of ϕ z and obviously T C ( z ; D × B ) = K z ( D f z 1 ( B ) ) . We assume w.l.o.g. that μ is a Dirac measure (the general claim follows by integration over μ ). Thus let μ = δ z with z = r ( μ ) . Recall that, according to Lemma 13, ( T C , T d C ( δ z ) ) is an HMM of the conditional probability of ξ [ 2 , [ given that ξ 1 = d (w.r.t. the output process of ( T C , δ z ) ). Using T C ( z ; { d } × P ( Δ N ) ) = z ( { ξ 1 = d } ) and
r ( T d C ( δ z ) ) = ( 6 ) r ( δ f z ( d ) ) = f z ( d ) = z ( ς 1 ( · ) ξ 1 = d ) ,
the claim follows by induction.  □
Remark. ( ε -machine). 
( T C , μ C ( P ) ) corresponds to the so-called ε -machine of computational mechanics. It is in some sense a minimal predictive model but in general not the minimal HMM of P (see [17]).
Given a process P P inv ( Δ Z ) , there are (usually) many invariant representations on prediction space (i.e., S-invariant ν P ( P ( Δ N ) ) with r ( ν ) = P N ). The next proposition shows that the causal state distribution of P is distinguished among them as the only one that can have finite entropy.
Proposition 23. 
Let ν P ( P ( Δ N ) ) be S-invariant, and P P inv ( Δ Z ) the measure it represents. If ν μ C ( P ) , then H ( ν ) = .
Proof. 
Let H ( ν ) < . According to Lemma 22, ( T C , ν ) is an invariant HMM of P and satisfies the conditions of Corollary 20. Let W Z be the corresponding M = P ( Δ N ) -valued internal process. For a.e. fixed ω , Lemma 13 tells us that ( T C , δ W 0 ( ω ) ) is an HMM of P ( X N W 0 ) ( ω ) , but it is also an HMM of r ( δ W 0 ( ω ) ) = W 0 ( ω ) due to Lemma 22. Thus, P ( X N W 0 ) = W 0 and
z = P ( X N W 0 = z ) = ( cor . 20 ) P ( X N W 0 = z ^ ) = z ^ z , z ^ supp ( Y 0 ( ω ) )
This means supp ( Y 0 ) = 1 , i.e. Y 0 ( ω ) is a Dirac measure. Thus Y 0 = P ( W 0 X N 0 ) = δ W 0 a.s. and
Z 0 X Z = P ( X N X N 0 ) = P ( X N W 0 = · ) d Y 0 = P ( X N W 0 ) = W 0 a.s.
Because W 0 is ν -distributed and μ C ( P ) is the law of Z 0 , we obtain ν = μ C ( P ) .
We conclude this section with two examples of representations on prediction space. They are extreme cases. The first one, ν 1 , is maximally concentrated, namely ν 1 is the Dirac measure in (the future part of) the process we want to represent. Thus it has no uncertainty in itself, but the (unique) process in its support can be arbitrary. The second example, ν 2 , is supported by maximally concentrated processes, i.e. by Dirac measures on ( Δ N ) , but the mixture ν 2 is as diverse as the original process. The HMM corresponding to ν 2 is equivalent to the one-sided shift (Example 19).
Example 24. 
Let P P inv ( Δ Z ) , P N = P X N 1 and ν = δ P N . Then ν is a representation of P N with H ( ν ) = 0 . This is no contradiction to Proposition 23 because ν is not S-invariant (if P is not i.i.d.)
Example 25. (lifted shift). 
Let P P inv ( Δ Z ) and ν = P N ι 1 , where ι : Δ N P ( Δ N ) , ι ( x ) = δ x is the embedding as Dirac measures. ν is an S-invariant representations of P and ( T C , ν ) is equivalent to the one-sided shift. The only difference is that trajectories x Δ N are replaced by the corresponding Dirac measures δ x P ( Δ N ) . In other words, ι is an isomorphism. This is no contradiction to Proposition 23 because H ( ν ) = (if P is not concentrated on countably many trajectories).

4. Properties of the Statistical Complexity Functional

Recall that the statistical complexity C C ( P ) (Definition 10) of a process P P inv ( Δ Z ) is defined as the entropy H ( μ C ( P ) ) of its causal state distribution. In this section, we investigate C C as a functional on the space of processes. First, we consider the problem of ergodic decomposition. With ergodic decomposition of P, we denote a probability measure ν on the ergodic measures P e ( Δ Z ) P inv ( Δ Z ) that satisfies
P = r ( ν ) = P e ( Δ Z ) id P e ( Z ) d ν
Such a measure ν always exists and is uniquely determined by P. In [9,10], Łukasz Dębowski investigated another complexity measure, excess entropy, and gave a formula for its ergodic decomposition. Here, we obtain the corresponding result for statistical complexity. It is the average complexity of the ergodic components plus the entropy of the mixture.
Proposition 26. 
(ergodic decomposition). Let ν P ( P e ( Δ Z ) ) be the ergodic decomposition of P P inv ( Δ Z ) . Then
C C ( P ) = C C d ν + H ( ν )
Proof. 
First note that μ C ( P 1 ) and μ C ( P 2 ) are singular for distinct ergodic P 1 , P 2 P e ( Δ Z ) . Indeed, there exist disjoint A 1 , A 2 σ ( ξ N 0 ) and disjoint B 1 , B 2 σ ( ξ N ) s.t. P k ( A k ) = 1 and P k ( B k ξ N 0 ) A k 1 . Consequently, if ν is not supported by a countable set, μ C ( P ) cannot be supported by a countable set and C C ( P ) = H ( ν ) = . Thus assume ν = k N ν k δ P k for some ν k 0 and distinct P k P e ( Δ Z ) . Then there are disjoint A k σ ( ξ N 0 ) s.t. P k ( A k ) = 1 . We claim
P ( · ξ N 0 ) = k N 1 A k P k ( · ξ N 0 ) P -a.s.
Indeed, the σ ( ξ N 0 ) -measurability is clear, and for A σ ( ξ N 0 ) , F B ( Δ Z ) we have
A k N 1 A k P k ( F ξ N 0 ) d P = j N ν j A k N 1 A k P k ( F ξ N 0 ) d P j = ( P j ( A j ) = 1 ) j ν j A A j P j ( F ξ N 0 ) d P j = j ν j P j ( F A A j ) = P ( F A )
As P ( A k ) = ν k , it follows that μ C ( P ) = k ν k μ C ( P k ) . Mutual singularity of the μ C ( P k ) implies
C C ( P ) = H k ν k μ C ( P k ) = k ν k H ( μ C ( P k ) ) + H ( ν )
Several corollaries follow directly from this proposition. The set P C : = C C 1 ( R ) of stationary processes with finite statistical complexity is convex, C C is concave but not continuous, and the set P : = P inv ( Δ Z ) P C of processes with infinite statistical complexity is dense.
Corollary 27 
(concavity) P C is a convex set and C C is concave. Moreover, for all ν P ( Δ N ) , ν k : = ν ( k ) and P k P inv ( Z )
k N ν k C C ( P k ) C C k N ν k P k k N ν k C C ( P k ) + H ( ν )
Proof. 
Use ergodic decomposition of the P k and Proposition 26.  □
Corollary 28. 
(non-continuity). C C P C is not continuous in any P P C w.r.t. variational topology, let alone w.r.t. weak-* topology.
Proof. 
Let Q n P C with lim n 1 n C C ( Q n ) and P n : = n 1 n P + 1 n Q n . Then P n P in variational topology, but C C ( P n ) 1 n C C ( Q n ) by Corollary 27.
Corollary 29. 
P is dense in P inv ( Δ Z ) w.r.t. variational- and a fortiori w.r.t. weak-*-topology.
Proof. 
P , Q P inv ( Δ Z ) with C C ( Q ) = . Then P n 1 n P + 1 n Q P  □
We give a simple example of a situation where statistical complexity is not continuous.
Example 30. 
(non-continuity). Let Q p be the Bernoulli process on Δ = { 0 , 1 } with parameter 0 < p < 1 , i.e. Q p ( ξ 1 = 1 ) = p . Consider the process of throwing a coin which is either slightly biased to 0 or 1, each with probability 1 2 , i.e. P ε = 1 2 Q 1 2 + ε + 1 2 Q 1 2 ε with 0 < ε < 1 2 . Then P ε * P 0 = Q 1 2 for ε 0 , but C C ( P ε ) = log ( 2 ) for ε > 0 and C C ( P 0 ) = 0 .
The proof of our most important result about statistical complexity, namely its lower semi-continuity, makes use of the propositions given in Section 2.1. and Section 3.. It also uses a compactness argument. To this end we need, in the case of infinite Δ, a lemma guaranteeing that μ C preserves relative compactness.
Lemma 31. 
Let M P inv ( Z ) be relatively compact. Then μ C ( M ) : = μ C ( P ) | P M is relatively compact in P ( P ( Δ N ) ) .
Proof. 
Using Prokhorov’s theorem, we have to show that μ C ( M ) is tight provided that M is tight. Let ε > 0 and K n Δ Z compact with P ( K n ) 1 ε 2 n n for all P M . We define K n : = ξ N ( K n ) , K ˜ : = z P ( Δ N ) | z ( K n ) 1 1 n n N and f n : = P ( { ξ N K n } | ξ N 0 ) . For P M
f n d P P ( K n ξ N 0 ) d P = P ( K n ) 1 ε 2 n n .
We obtain P ( n { f n < 1 1 n } ) n n ( 1 f n d P ) ε 2 n = ε and consequently μ C ( P ) ( K ˜ ) = P ( { Z 0 K ˜ } ) = P ( n { f n 1 1 n } ) 1 ε for all P M . We still have to show compactness of K ˜ . It is closed because z k * z implies z ( K n ) lim sup k z k ( K n ) due to closedness of K n . It is tight by definition because the K n are compact. Therefore, K ˜ is compact.  □
Theorem 32. 
(lower semi-continuity). The statistical complexity functional, C C : P inv ( Δ Z ) R ¯ + , is weak-* lower semi-continuous.
Proof. 
Let P n , P P inv ( Δ Z ) with P n * P . Every subsequence of ( μ C ( P n ) ) n N has an accumulation point (a.p.), according to Lemma 31. Consequently,
lim inf n C C ( P n ) = lim inf n H ( μ C ( P n ) ) ( H lsc) inf H ( ν ) | ν a.p. of ( μ C ( P n ) ) n N .
Every μ C ( P n ) is S-invariant. According to Proposition 5, S is continuous and thus every a.p. ν of ( μ C ( P n ) ) n N is also S-invariant. The resolvent r : P ( P ( Δ N ) ) P ( Δ N ) is continuous (see [15]), and thus ν represents P. Therefore, according to Proposition 23, H ( ν ) C C ( P ) . In total we obtain
lim inf n C C ( P n ) C C ( P ) .
We argue that, from a theoretical point of view, every complexity measure should be lower semi-continuous. While it is not counter-intuitive that it is possible to approximate a simple system by unnecessarily complex ones (and hence the complexity is not continuous), it would be strange to consider a process complex if there is an approximating sequence with (uniformly) simple processes. Therefore, an axiomatic characterisation of complexity measures (although, of course, we are far from having such a characterisation) should include lower semi-continuity. There are also slightly more practical reasons why semi-continuity is a nice property.
In a model selection task, for instance, it might be desirable to impose some upper bound a R + on the complexity of considered processes (e.g. to avoid overfitting). An important consequence of lower semi-continuity is that the set C C 1 ( [ 0 , a ] ) = P P inv ( Δ Z ) | C C ( P ) a of processes with complexity bounded by a is closed. This makes the complexity constraint technically easier. Consider any complete metric on P inv ( Δ Z ) compatible with weak-* (or any stronger) topology (e.g. Prokhorov, Kantorovich-Rubinshtein or variational metric). Then due to the closedness, for every P P inv ( Δ Z ) with arbitrary complexity, there is a (not necessarily unique) closest “sufficiently simple” process P a with complexity not exceeding a. Another consequence is that the set of processes with infinite complexity is generic in the following sense.
Corollary 33. 
P contains a dense G δ -set.
Proof. 
Because all C C 1 ( [ 0 , n ] ) are closed, P is a G δ -set. It is dense according to Corollary 29.
Example 34. 
Consider the experiment of first choosing a random coin with success probability p uniformly in [ 0 , 1 ] and then generating an i.i.d. sequence with this coin. More precisely, let Q p be the Bernoulli process with parameter p on Δ = { 0 , 1 } and P = Q p d p . Then P has infinite statistical complexity according to Proposition 26. We might approximate P by P n * P (e.g. with ergodic P n ). Then Theorem 32 implies that the complexity of P n necessarily tends to infinity.
Example 35. 
Let Δ be finite, then P inv ( Δ Z ) is compact. Assume we made observations of a Δ-valued process and want to fit some P P inv ( Δ Z ) . From the observations, we might derive a set of closed constraints, e.g., P ( { ξ 1 = ξ 2 } ) [ a , b ] , P ( { ξ 1 = d } ) ε , and P ( { ξ 2 = d } | ξ 1 = d ) [ a , b ] (the third is closed only in presence of the second). Further closed constraints may be given by modelling assumptions. Because the resulting set of admissible processes is compact, lower semi-continuity implies that there is at least one process of minimal complexity satisfying all constraints.

Appendix

Proof of Lemma 11 (lower semi-continuity of the entropy). Recall that φ ( x ) : = x log ( x ) and denote the boundary of a set B by B . Define
H ^ ( μ ) : = sup { i = 1 n φ ( μ ( B i ) ) | n N , B i disjoint , μ ( B i ) = 0 } .
Obviously, H ^ H . Recall that μ n * μ implies μ n ( A ) μ ( A ) for all A with μ ( A ) = 0 (e.g., [18]). Thus H ^ is clearly lower semi-continuous and it is sufficient to show
H ( μ ) H ^ ( μ ) .
If μ is not supported by any countable set, H ^ ( μ ) = due to separability of M. Let μ = i = 1 a i δ x i ( a i [ 0 , 1 ] , x i M ) , and d a compatible metric on M. For fixed n N , we can choose a radius r n > 0 , such that B i n : = { x M d ( x i , x ) < r n } , i = 1 , , n , are disjoint and μ ( B i n ) = 0 . We get
i = 1 n φ ( a i ) ( φ 1 ) i = 1 n φ ( μ ( B i n ) ) + i = 1 n ( μ ( B i n ) a i ) H ^ ( μ ) + i = n + 1 a i .
Therefore, H ( μ ) = lim n i = 1 n φ ( a i ) H ^ ( μ ) .
Proof of Lemma 13. We first prove that ( T , δ W 0 ) is an HMM of P ( X N W 0 ) . Let G T ( m ) P ( N ) be the distribution of the output process of ( T , δ m ) . Because G T is measurable, G T W 0 is σ ( W 0 ) -measurable. From the definition of ( W N 0 , X N ) it follows for measurable B M , A Δ N that
P ( { W 0 B } { X N A } ) = B G T ( · ; A ) d μ = W 0 1 ( B ) G T ( W 0 ( · ) ; A ) d P ,
where the second equality holds because W 0 is distributed according to μ . Thus G T W 0 is the claimed conditional probability. To see that ( T , L X 1 ( μ ) ) is an HMM of P ( X [ 2 , [ X 1 ) , let d Δ that
G T ( · ; A ) d L d ( μ ) = 1 K μ ( d ) { d } × M G T ( · ; A ) d T d μ = P ( { X 1 = d , X [ 2 , [ A } ) P ( { X 1 = d } ) .

Acknowledgements

I am thankful to Nihat Ay for introducing me to computational mechanics, discussions, and all kinds of scientific support. I also thank the anonymous reviewers for their many helpful comments.

References

  1. Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B 2008, 63, 407–415. [Google Scholar]
  2. Jänicke, H.; Wiebel, A.; Scheuermann, G.; Kollmann, W. Multifield Visualization Using Local Statistical Complexity. IEEE Trans. Visual. Comput. Gr. 2007, 13, 1384–1391. [Google Scholar]
  3. Crutchfield, J.P.; Young, K. Inferring Statistical Complexity. Phys. Rev. Let. 1989, 63, 105–108. [Google Scholar]
  4. Shalizi, C.R.; Crutchfield, J.P. Computational Mechanics: Pattern and Prediction, Structure and Simplicity. J. Statist. Phys. 2001, 104, 817–879. [Google Scholar]
  5. Ay, N.; Crutchfield, J.P. Reductions of Hidden Information Sources. J. Statist. Phys. 2005, 120, 659–684. [Google Scholar]
  6. Clarke, R.W.; Freeman, M.P.; Watkins, N.W. Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism. Phys. Rev. E 2003, 67, 016203.1–016203.15. [Google Scholar]
  7. Hopcroft, J.; Ullman, J. Introduction to Automata Theory, Language, and Computation; Addison-Wesely: Reading, Massachusetts, USA, 1979. [Google Scholar]
  8. Keller, G. Equilibrium States in Ergodic Theory; London Mathematical Society: New York, USA, 1998. [Google Scholar]
  9. Dębowski, Ł. Ergodic Decomposition of Excess Entropy and Conditional Mutual Information. IPI PAN Reports, nr 993. 2006. [Google Scholar]
  10. Dębowski, Ł. A General Definition of Conditional Information and Its Application to Ergodic Decomposition. Stat. Probab. Lett. 2009, 79, 1260–1268. [Google Scholar]
  11. Knight, F. A Predictive View of Continuous Time Processes. The Annals of Probability 1975, 573–596. [Google Scholar]
  12. Knight, F. Foundations of the Prediction Process; Oxford Science Publications: New York, USA, 1992. [Google Scholar]
  13. Meyer, P. La théorie de la prédiction de F. Knight. Seminaire de Probabilités 1976, X, 86–103. [Google Scholar]
  14. Knight, F. Essays on the Prediction Process; Vol. 1, Lecture Notes Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1981. [Google Scholar]
  15. Choquet, G. Lectures on Analysis, Volume II (Representation Theory); W. A. Benjamin, Inc.: New York, USA, 1969. [Google Scholar]
  16. Parthasarathy. On the Category of Ergodic Measures. Illinois J. Math. 1961, 5, 648–656. [Google Scholar]
  17. Löhr, W.; Ay, N. On the Generative Nature of Prediction. Adv. Complex. Syst. 2009, 12, 169–194. [Google Scholar]
  18. Billingsley, P. Convergence of Probability Measures, 2nd Ed. ed; Wiley: New York, USA, 1968. [Google Scholar]
Back to TopTop