All articles published by MDPI are made immediately available worldwide under an open access license. No special
permission is required to reuse all or part of the article published by MDPI, including figures and tables. For
articles published under an open access Creative Common CC BY license, any part of the article may be reused without
permission provided that the original article is clearly cited.
Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature
Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review
prior to publication.
The Feature Paper can be either an original research article, a substantial novel research study that often involves
several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest
progress in the field that systematically reviews the most exciting advances in scientific literature. This type of
paper provides an outlook on future directions of research or possible applications.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world.
Editors select a small number of articles recently published in the journal that they believe will be particularly
interesting to authors, or important in this field. The aim is to provide a snapshot of some of the most exciting work
published in the various research areas of the journal.
Statistical complexity is a measure of complexity of discrete-time stationary stochastic processes, which has many applications. We investigate its more abstract properties as a non-linear function of the space of processes and show its close relation to the Knight’s prediction process. We prove lower semi-continuity, concavity, and a formula for the ergodic decomposition of statistical complexity. On the way, we show that the discrete version of the prediction process has a continuous Markov transition. We also prove that, given the past output of a partially deterministic hidden Markov model (HMM), the uncertainty of the internal state is constant over time and knowledge of the internal state gives no additional information on the future output. Using this fact, we show that the causal state distribution is the unique stationary representation on prediction space that may have finite entropy.
An important task of complex systems sciences is to define “complexity”. Measures that quantify complexity are of both theoretical (e.g., ) and practical interest. In applications, they are widely used to identify “interesting” parts of simulations and real-world data (e.g., ). There exist various measures of different kinds of complexity. In particular, statistical complexity constitutes a complexity measure for stationary stochastic processes in doubly infinite discrete time and discrete state space. It was introduced by Jim Crutchfield and co-workers within a theory called computational mechanics [3,4,5]. Note that here “computational mechanics” is unrelated to computer simulations of mechanical systems. Statistical complexity is applied to a variety of real-world data, e.g., in . An important, closely related concept of computational mechanics is the so-called -machine. It is a particular partially deterministic HMM that encodes the mechanisms of prediction. Partially deterministic HMMs are often called deterministic stochastic automata to emphasize their close connection to a key concept of theoretical computer science, namely deterministic finite state automata .
In this paper, we look at more abstract features of statistical complexity as well as partially deterministic HMMs. We consider statistical complexity to be a non-linear functional from the space of Δ-valued stationary processes (Δ countable) to the set of non-negative extended real numbers. Here, we identify stationary processes with their law, i.e., with shift-invariant probability measures on the sequence space , and equip the space of measures with the usual weak-* topology (often called “weak topology”). Because Δ is discrete, this topology is equal to the topology of finite-dimensional convergence. In ergodic theory, Kolmogorov-Sinai entropy is studied as a function of the (invariant) measure, and the questions of continuity properties, affinity, and behaviour under ergodic decomposition arise naturally (e.g., ). We believe that these questions are worth considering also for complexity measures. A formula for the ergodic decomposition of excess entropy, which is another complexity measure for stochastic processes, was obtained in [9,10]. Our results presented here include the corresponding formula for statistical complexity, and this formula directly implies concavity. The most important result is lower semi-continuity of statistical complexity. We consider this a desirable property for a complexity measure, as it means that a process cannot be complex if it can be approximated by non-complex ones.
In Section 2., we define statistical complexity and show its relations to a discrete version of Frank Knight’s prediction process [11,12]. The prediction process is the measure-valued process of conditional probabilities of the future given the past. It takes values in the space of probability measures on ΔN, called prediction space. In our formulation, statistical complexity is the marginal entropy of the prediction process. This is equivalent to the classical definition as entropy of a certain partition of the past. We only replace equivalence classes with the respective induced probabilities on the future. In this section, we also show that the discrete (and thus technically vastly simplified) version of the prediction process has a continuous Markov transition kernel (Proposition 5).
In Section 3., we investigate properties of partially deterministic HMMs. Here, we use a general notion of HMM (sometimes called edge-emitting HMM), where new internal state and output symbol are jointly determined and may have dependencies conditioned on the last internal state. Partial determinism means that this dependence is extreme in the sense that the last internal state and the output altogether uniquely determine the following internal state. We show that, if one knows the past output trajectory, the remaining uncertainty (measured by entropy) of the internal state is constant over time, although it may depend on the ergodic component (Proposition 18). Furthermore, the distribution of future output is the same for any internal state that is compatible with the past output (Corollary 20). In Section 3.3., we construct a canonical Markov kernel, such that taking any measure on prediction space (i.e., is a measure on measures) as initial distribution, we obtain a partially deterministic HMM of a process . This process P coincides with the measure represented by in the sense of integral representation theory, and if is appropriately chosen, we obtain the -machine of computational mechanics (or something isomorphic) as special case. Using the properties of partially deterministic HMMs, we obtain that there is no invariant representation on prediction space with finite entropy other than, possibly, the causal state distribution, which may have finite or infinite entropy (Proposition 23).
Section 4. contains our results about statistical complexity. We show that the complexity of a process is the average complexity of its ergodic components plus the entropy of the mixture (Proposition 26). As a direct consequence, statistical complexity is concave (Corollary 27) and non-continuous (even w.r.t. variational topology). But it does have a continuity property. Namely, using the results of the previous sections, we show in Theorem 32 that it is weak-* lower semi-continuous.
2. Prediction Dynamic and Statistical Complexity
For the whole article, fix a countable set Δ with at least two elements and discrete topology. We identify Δ-valued stochastic processes , defined on some probability space , with their respective laws . Here, denotes the set of probability measures. If is stationary, P is in the set of shift-invariant probability measures. Let be the canonical projections. Then is a process on () with the same distribution as . Here, denotes the Borel σ-algebra. We often decompose the time set into the “future” and the “past” , where . For simplicity of notation, we denote the canonical projections on with the same symbols, ζk, as the projections on . If not stated otherwise, product spaces are equipped with product and spaces of probability measures are equipped with weak--topology. We use the arrow to denote weak- convergence.
2.1. Discrete Version of Knight’s Prediction Process
For every measurable stochastic process with time set on some Lusin space, Frank Knight defines the corresponding prediction process as a process of conditional probabilities of the future given the past. This theory originated in  and was further developed in [12,13,14]. The most important properties of the prediction process are that its paths are right continuous with left limits (cadlag), it has the strong Markov property and determines the original process. The continuity of the time set and the generality of the state space lead to a lot of technical difficulties. In our simpler, discrete setting, these difficulties mostly disappear, and useful properties of the prediction process, such as having cadlag paths, become meaningless. A new aspect, however, is added by considering infinite pasts of stationary processes via the time-set . The marginal distribution (unique because of stationarity) of the prediction process is an important characteristic, which is used to define statistical complexity. For this subsection, fix a stationary process with distribution .
We use the following notation concerning Markov kernels and conditional probabilities. If K is a kernel from Ω to a measurable space M, we consider K as measurable function from Ω to and write for the probability of a measurable set A w.r.t. the measure . Given random variables on Ω, we write if K is the conditional probability kernel of X given Y, i.e., .
Let be the -valued stochastic process of conditional probabilities defined by for . Then is called prediction process of . is called prediction space.
It is evident that the Markov property of the prediction process in continuous time also holds in discrete time. Nevertheless, we give a proof, because it is elementary in our discrete setting. The corresponding transition kernel works as follows. Assume the prediction process is in state . The transition kernel maps z to a measure on measures, namely . Note that z is a state of the prediction process but at the same time a probability measure. Thus it makes sense to consider the conditional probability given w.r.t. the measure z. It is intuitively plausible that the next state will be one of those conditional probabilities with d distributed according to the marginal of z. The resulting measure has to be shifted by one as time proceeds. With , we denote the left shift.
For , let , . The prediction process is a stationary Markov process. The kernel with , i.e.
satisfiesa.s. In other words, S is the transition kernel of the prediction process.
Stationarity is obvious from stationarity of . We obtain a.s.
In particular, is -measurable (modulo P) and together with we obtain
as claimed. We still have to verify the Markov property. But because the -algebra induced by is nested between those induced by and , i.e. , we obtain the Markov property from the first equality in (1). □
We call the Markov transition S of the prediction process prediction dynamic.
Note that although the prediction process obviously depends on P, the prediction space and the prediction dynamic S do not. In the case of general Lusin state space, it is non-trivial to prove the existence of the regular versions of conditional probability such that is jointly measurable in (see ). For countable Δ, however, we can obtain essential continuity in an elementary way. This enables us to prove continuity of the prediction danamic.
Letand . There is a clopen (i.e. closed and open) setwithsuch that , uniformly on compact subsets of .
Let and . Because Δ is discrete and countable, Ωz id clopen with z(Ωz) = 1. Uniform convergence on compacta is equivalent to whenever ωn → ω in Ωz. For sufficiently large n, ζ1 (ωn) = ζ1 (ω) and because ς−1 maps cylinder sets to cylinder sets, .
The prediction dynamic S is continuous.
Let with and as in Lemma 4. We have to show
for any bounded continuous g. According to Prokhorov’s theorem, the sequence is uniformly tight and we can restrict the integrations to compact subsets. Because , we can restrict to compact subsets of . There, the convergence of is uniform, thus (2) holds. □
2.2. Statistical Complexity
In integral representation theory, a measure represents the measure if
where is called resolvent or barycentre map (see ) and is the identity map. Here, measure valued integrals are Gel’fand integrals. That is, for some kernel K means for all continuous, real-valued f or, equivalently, for all measurable sets B. means that z is a mixture (convex combination) of other processes, and the mixture is described by . A trivial representation for z is given by , the Dirac measure in z. The measure is called S-invariant if , where . In other words, it is S-invariant if the iteration with the prediction dynamic S does not change it. We see in the following lemma that general iteration with S shifts the represented measure, i.e., represents .
In particular, S-invariantrepresent stationary processes.
Because , it is sufficient to consider Dirac measures , (the general claim follows by integration over ). For Dirac measures we have
If is S-invariant, we also say that represents the stationary extension of to . The marginal of the prediction process is an important such representation, which we call causal state distribution because of its close relation to the causal states of computational mechanics.
For , the causal state distribution is the marginal distribution of the prediction process, i.e., .
The causal state distribution of P is an S-invariant representation of P.
Let . Thenis S-invariant and represents P.
From Proposition 2 we know that and is stationary. Thus
Furthermore, represents P because we have
The definitions in computational mechanics are slightly different. There, one works with equivalence classes of past trajectories (called causal states) instead of probability distributions on future trajectories. Because past trajectories are identified if , the two approaches are equivalent. The advantage of working on prediction space is that it has a natural topology and the prediction processes of all Δ-valued stochastic processes are described in a unified manner on the same space with the same transition kernel.
is not continuous. Let P be a non-deterministic i.i.d. (independent, identically distributed) process. Obviously, the causal state distribution of an i.i.d. process is the Dirac measure in its restriction to positive time. According to , periodic measures are dense in the stationary measures and we find an approximating sequence of periodic measures . But the past of a periodic process determines its future. Thus its causal state distribution is supported by the set of Dirac measures on . Because the set of Dirac measures is closed in , the topological supports are disjoint from the support . Consequently, cannot converge to .
With statistical complexity, we measure complexity of a process P by the “diversity” of its expected futures, given observed pasts (i.e., of ). The Shannon entropy is used as the measure of “diversity” of a probability measure . With , it is defined as
For , the quantity is called statistical complexity of P.
Note that if the probability space is sufficiently regular (e.g., separable, metrisable), can only be finite if is supported by a countable set A. In this case
Probably, lower semi-continuity of this entropy functional is well-known. We give a proof in the appendix.
Let M be a separable, metrisable space. Then the entropyis weak-* lower semi-continuous.
3. Partially Deterministic HMMs
The probability measures on prediction space induce hidden Markov models (HMMs) with an additional partial determinism property, and it turns out to be helpful to investigate such HMMs. In Section 3.1., we define HMMs and introduce the notation we need for the further discussion. In Section 3.2., we define the partial determinism property and obtain our results about the HMMs satisfying this property. In Section 3.3., we show how measures on prediction space induce partially deterministic HMMs and apply the results from Section 3.2. to prove that the causal state distribution is the only invariant representation on prediction space that can have finite entropy.
We use the term HMM in a wide sense, meaning a pair , where is an initial probability measure on some Polish space M of internal states and T is a Markov kernel from M to Δ × M. The HMM generates on a Δ-valued output process and a (coupled) M-valued internal process , such that W0 is μ-distributed and the joint process is Markovian with
We call (T,μ) an HMM of if if , we say that the HMM is invariant and extend the generated processes to stationary processes and . We need some further notation.
Let be an HMM, , , and .
The output kernelK: is defined by . We also use the notations and .
The internal operators are defined as follows. and
is the distribution of the next output symbol when the internal state is m, i.e. a.s. Further, is the law of .
The internal operator describes the update of knowledge of the internal state when the symbol is observed. For Dirac measures, we obtain
Be warned that Ld is not induced by a kernel in the following sense. There is no kernel such that . To see this, note that for , because Ld(ν) is normalised outside the integral as opposed to an individual normalisation of the Ld(ζm) inside the integral on the right-hand side.
It directly follows from the definition of by a Markov kernel that the conditional probability, given that the internal state is m, is obtained by starting the HMM in m. In other words, it is generated by the HMM . Similarly, the conditional probability given an observed symbol is obtained by starting the HMM in the updated initial distribution . We formulate these observations in the following lemma and give a formal proof in the appendix.
Letbe an HMM with internal and output processes , as above. Then a.s.is an HMM of , andis an HMM of .
(processes and ) Given an invariant HMM, let be the -valued stochastic process of expectations over internal states, given by . Let be the process of entropies of the random measures , i.e., , where entropy H is defined by (4).
describes the current knowledge of the internal state, given the past. is the entropy of the value of and measures “how uncertain” the knowledge of the internal state is. It is important to bear in mind that this is different from the entropy of the random variable . To avoid confusion, we always write when referring to the entropy of a random variable X defined on a probability space with measure P.
The following lemma justifies the idea of the internal operator being an update of knowledge of the internal state. Furthermore, it enables us to condition on instead of . The conditional probability of the internal state given the past, , contains as much information about (and in fact , but we do not need that here) as the past does.
Conditional independence of and given implies that a.s. and thus
Let and for set . We obtain a.s.
The second equality follows directly from (5). The first follows because, due to the second equality, is -measurable modulo P. □
The previous lemma enables us to prove that is Markovian and compute its transition kernel. We already know that is the updated expectation of the internal state when it was previously and is now observed d. Thus it is not surprising that the conditional probability of given is a convex combination of Dirac measures in for different d (note that is a measure-valued random variable, thus its conditional probability distribution is indeed a distribution on distributions). The mixture is given by the output kernel K, more precisely by .
For an invariant HMM,andare stationary.is a Markov process with transition kernel
Stationarity is obvious. For and we obtain
Because is nested between and , Lemma 15 b) implies that and hence the claim.
If the transition T of an HMM is deterministic, i.e., if the internal state determines the next state and output (and thus the whole future) uniquely, the HMM is called (completely) deterministic. In a deterministic HMM, all randomness is due to the initial distribution. This is a very strong property, and a weaker partial determinism property is useful. In a partially deterministic HMM, the output symbol is determined randomly, but the new internal state is a function of the last internal state m and the new output symbol d. If the internal space M is finite, such HMMs are stochastic versions of deterministic finite state automata (DFAs), an important concept of theoretical computer science (see ). The function f directly corresponds to the transition function of the DFA, but the start state is replaced by the initial distribution and the HMM assigns probabilities to the outputs via the output kernel K. A difference in interpretation is that the symbols from Δ are considered input of the DFA and output of HMMs. To emphasise their close connection to DFAs, partially deterministic HMMs are often called deterministic stochastic automata, although they are not completely deterministic.
An HMM is called partially deterministic if there is a measurable function , called transition function, such that for all , i.e.,
where and is the Borel -algebra on M.
For partially deterministic HMMs we obtain
The second equation implies that a.s., justifying the name transition function for f.
The following proposition is crucial for understanding partially deterministic representations. It states that, given the past output, the uncertainty about the internal state is constant over time and the next output symbol is independent of the internal state. The proof is along the following lines. If we know the internal state at one point in time, we can maintain knowledge of the internal state due to partial determinism. More generally, the uncertainty of the internal state cannot decrease on average and thus is a supermartingale. But because it is also stationary, the trajectories have to be constant. If two possible internal states would lead to different probabilities for the next output symbol, we could increase our knowledge of the internal state by observing the next output. But because of partial determinism, this would also decrease the uncertainty of the following internal state, in contradiction to the constant trajectories of .
Letbe a partially deterministic, invariant HMM with . Thenhas a.s. constant trajectories, i.e.,a.s. Furthermore, the restrictionof the output kernel K to the supportof the random measureis a.s. a constant kernel, i.e.,
We show that is a supermartingale to use the following well-known property.
Every stationary supermartingale has a.s. constant trajectories.
Because , we may assume w.l.o.g. that M is countable. Note that satisfies . We obtain
We use the filtration . Markovianity of yields .
where the secong equality holds because and , This is a supermartingale w.r.t. and has a.s. constant trajectories. In particular, inequality (8) is actually an equality. Because and , the entropy of is a.s. finite. Thus, implies that Wk and XK+1 are independent given , i.e. is constant. □
Note that the finite-entropy assumption is indeed necessary for the second statement of Proposition 18. For example, the shift defines a deterministic HMM that does not (in general) satisfy (7).
Example 19. (shift HMM)
The shift HMM is defined as follows. The internal state consists of the whole trajectory, . outputs the symbol at position one and shifts the sequence to the left. More formally with and we have
If , it is obvious that is an invariant, deterministic (in particular partially deterministic) HMM of P. Here, P is the law of both and ; in fact even . We claim that, generically, does not satisfy (7) (and of course the internal state entropy is infinite). Indeed, and thus implies . Because , (7) implies that determines uniquely, which is generically not true. The analogously defined one-sided shift on also does not satisfy (7). Note that, because future trajectories are equivalent to internal states, the associated process is essentially the prediction process in the sense that .
Proposition 18 tells us that the next output symbol of a partially deterministic HMM is conditionally independent of the internal state, given the past output. But even more is true. The whole future output is conditionally independent of the internal state. Thus, if we know the past, the internal state provides no additional information useful for the prediction of the future output.
Letbe partially deterministic, invariant, and . Then
According to Proposition 18, is constant on . To obtain the statement for , we consider the n-tuple HMM defined as follows. The output space is Δn, the internal space is M, whereas the output and internal processes and are given by and . This is achieved by the HMM with , . The HMM is obviously partially deterministic with transition function and invariant. Thus Proposition 18 implies that is constant on . Because we can couple the processes such that , the claim follows. □
3.3. Representations on Prediction Space
We can interpret any probability measure on prediction space as initial distribution of an HMM. The “internal state update” of the corresponding transition follows the same rule as the prediction dynamic S, described by the conditional probability given the last observation. The difference is that now we include output symbols from Δ. We want to construct the HMM in such a way that if it is started in the internal state , its output process is distributed according to z (which is also a measure on the future). Thus, the distribution of the next output d has to be equal to the marginal of z. The next internal state has to be the conditional z-probability of the future given . Recall that .
We define the Markov kernel from to by
Note that , i.e., marginalising to the internal component yields the prediction dynamic. Thus, if is the causal state distribution (Definition 7) of some process , then the internal state process of the induced HMM coincides with the prediction process of P. From the following lemma we conclude that the output process is, as expected, distributed according to P. More generally, if represents a process in the sense of integral representation theory as a mixture of other processes, it also induces an HMM of z, namely . Recall that r is the resolvent, defined in (3), and associates the represented process to .
Let . Then is a partially deterministic HMM of . In particular, is an invariant HMM of .
Partial determinism follows directly from the definition of . We have and the transition function f is given by . It is well defined due to the -measurability of and obviously . We assume w.l.o.g. that is a Dirac measure (the general claim follows by integration over ). Thus let with . Recall that, according to Lemma 13, is an HMM of the conditional probability of given that (w.r.t. the output process of ). Using and
the claim follows by induction. □
corresponds to the so-called -machine of computational mechanics. It is in some sense a minimal predictive model but in general not the minimal HMM of P (see ).
Given a process , there are (usually) many invariant representations on prediction space (i.e., S-invariant with ). The next proposition shows that the causal state distribution of P is distinguished among them as the only one that can have finite entropy.
Letbe S-invariant, andthe measure it represents. If , then .
Let . According to Lemma 22, is an invariant HMM of P and satisfies the conditions of Corollary 20. Let be the corresponding -valued internal process. For a.e. fixed , Lemma 13 tells us that is an HMM of , but it is also an HMM of due to Lemma 22. Thus, and
This means , i.e. is a Dirac measure. Thus a.s. and
Because is -distributed and is the law of , we obtain .
We conclude this section with two examples of representations on prediction space. They are extreme cases. The first one, , is maximally concentrated, namely is the Dirac measure in (the future part of) the process we want to represent. Thus it has no uncertainty in itself, but the (unique) process in its support can be arbitrary. The second example, , is supported by maximally concentrated processes, i.e. by Dirac measures on , but the mixture is as diverse as the original process. The HMM corresponding to is equivalent to the one-sided shift (Example 19).
Let , and . Then is a representation of with . This is no contradiction to Proposition 23 because is not S-invariant (if P is not i.i.d.)
Example 25. (lifted shift).
Let and , where , is the embedding as Dirac measures. is an S-invariant representations of P and is equivalent to the one-sided shift. The only difference is that trajectories are replaced by the corresponding Dirac measures . In other words, is an isomorphism. This is no contradiction to Proposition 23 because (if P is not concentrated on countably many trajectories).
4. Properties of the Statistical Complexity Functional
Recall that the statistical complexity (Definition 10) of a process is defined as the entropy of its causal state distribution. In this section, we investigate as a functional on the space of processes. First, we consider the problem of ergodic decomposition. With ergodic decomposition of P, we denote a probability measure on the ergodic measures that satisfies
Such a measure always exists and is uniquely determined by P. In [9,10], Łukasz Dębowski investigated another complexity measure, excess entropy, and gave a formula for its ergodic decomposition. Here, we obtain the corresponding result for statistical complexity. It is the average complexity of the ergodic components plus the entropy of the mixture.
(ergodic decomposition). Letbe the ergodic decomposition of . Then
First note that and are singular for distinct ergodic . Indeed, there exist disjoint and disjoint s.t. and . Consequently, if is not supported by a countable set, cannot be supported by a countable set and . Thus assume for some and distinct . Then there are disjoint s.t. . We claim
Indeed, the -measurability is clear, and for , we have
As , it follows that . Mutual singularity of the implies
Several corollaries follow directly from this proposition. The set of stationary processes with finite statistical complexity is convex, is concave but not continuous, and the set of processes with infinite statistical complexity is dense.
(concavity) is a convex set andis concave. Moreover, for all , and
Use ergodic decomposition of the and Proposition 26. □
(non-continuity). is not continuous in anyw.r.t. variational topology, let alone w.r.t. weak-* topology.
Let with and . Then in variational topology, but by Corollary 27.
is dense inw.r.t. variational- and a fortiori w.r.t. weak-*-topology.
with . Then □
We give a simple example of a situation where statistical complexity is not continuous.
(non-continuity). Let be the Bernoulli process on with parameter , i.e. . Consider the process of throwing a coin which is either slightly biased to 0 or 1, each with probability , i.e. with . Then for , but for and .
The proof of our most important result about statistical complexity, namely its lower semi-continuity, makes use of the propositions given in Section 2.1. and Section 3.. It also uses a compactness argument. To this end we need, in the case of infinite Δ, a lemma guaranteeing that preserves relative compactness.
Letbe relatively compact. Thenis relatively compact in .
Using Prokhorov’s theorem, we have to show that is tight provided that is tight. Let and compact with for all . We define , and . For
We obtain and consequently for all . We still have to show compactness of . It is closed because implies due to closedness of . It is tight by definition because the are compact. Therefore, is compact. □
(lower semi-continuity). The statistical complexity functional, , is weak-* lower semi-continuous.
Let with . Every subsequence of has an accumulation point (a.p.), according to Lemma 31. Consequently,
Every is S-invariant. According to Proposition 5, S is continuous and thus every a.p. of is also S-invariant. The resolvent is continuous (see ), and thus represents P. Therefore, according to Proposition 23, . In total we obtain
We argue that, from a theoretical point of view, every complexity measure should be lower semi-continuous. While it is not counter-intuitive that it is possible to approximate a simple system by unnecessarily complex ones (and hence the complexity is not continuous), it would be strange to consider a process complex if there is an approximating sequence with (uniformly) simple processes. Therefore, an axiomatic characterisation of complexity measures (although, of course, we are far from having such a characterisation) should include lower semi-continuity. There are also slightly more practical reasons why semi-continuity is a nice property.
In a model selection task, for instance, it might be desirable to impose some upper bound on the complexity of considered processes (e.g. to avoid overfitting). An important consequence of lower semi-continuity is that the set of processes with complexity bounded by a is closed. This makes the complexity constraint technically easier. Consider any complete metric on compatible with weak-* (or any stronger) topology (e.g. Prokhorov, Kantorovich-Rubinshtein or variational metric). Then due to the closedness, for every with arbitrary complexity, there is a (not necessarily unique) closest “sufficiently simple” process with complexity not exceeding a. Another consequence is that the set of processes with infinite complexity is generic in the following sense.
contains a dense-set.
Because all are closed, is a -set. It is dense according to Corollary 29.
Consider the experiment of first choosing a random coin with success probability p uniformly in and then generating an i.i.d. sequence with this coin. More precisely, let be the Bernoulli process with parameter p on and . Then P has infinite statistical complexity according to Proposition 26. We might approximate P by (e.g. with ergodic ). Then Theorem 32 implies that the complexity of necessarily tends to infinity.
Let Δ be finite, then is compact. Assume we made observations of a Δ-valued process and want to fit some . From the observations, we might derive a set of closed constraints, e.g., , , and (the third is closed only in presence of the second). Further closed constraints may be given by modelling assumptions. Because the resulting set of admissible processes is compact, lower semi-continuity implies that there is at least one process of minimal complexity satisfying all constraints.
Proof of Lemma 11 (lower semi-continuity of the entropy). Recall that and denote the boundary of a set B by . Define
Obviously, . Recall that implies for all A with (e.g., ). Thus is clearly lower semi-continuous and it is sufficient to show
If is not supported by any countable set, due to separability of M. Let , and d a compatible metric on M. For fixed , we can choose a radius , such that , , are disjoint and . We get
Proof of Lemma 13. We first prove that is an HMM of . Let be the distribution of the output process of . Because is measurable, is -measurable. From the definition of it follows for measurable that
where the second equality holds because is distributed according to . Thus is the claimed conditional probability. To see that is an HMM of , let that
I am thankful to Nihat Ay for introducing me to computational mechanics, discussions, and all kinds of scientific support. I also thank the anonymous reviewers for their many helpful comments.
Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B2008, 63, 407–415. [Google Scholar]
Jänicke, H.; Wiebel, A.; Scheuermann, G.; Kollmann, W. Multifield Visualization Using Local Statistical Complexity. IEEE Trans. Visual. Comput. Gr.2007, 13, 1384–1391. [Google Scholar]
Shalizi, C.R.; Crutchfield, J.P. Computational Mechanics: Pattern and Prediction, Structure and Simplicity. J. Statist. Phys.2001, 104, 817–879. [Google Scholar]
Ay, N.; Crutchfield, J.P. Reductions of Hidden Information Sources. J. Statist. Phys.2005, 120, 659–684. [Google Scholar]
Clarke, R.W.; Freeman, M.P.; Watkins, N.W. Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism. Phys. Rev. E2003, 67, 016203.1–016203.15. [Google Scholar]
Hopcroft, J.; Ullman, J. Introduction to Automata Theory, Language, and Computation; Addison-Wesely: Reading, Massachusetts, USA, 1979. [Google Scholar]
Keller, G. Equilibrium States in Ergodic Theory; London Mathematical Society: New York, USA, 1998. [Google Scholar]
Dębowski, Ł. Ergodic Decomposition of Excess Entropy and Conditional Mutual Information. IPI PAN Reports, nr 993. 2006. [Google Scholar]
Dębowski, Ł. A General Definition of Conditional Information and Its Application to Ergodic Decomposition. Stat. Probab. Lett.2009, 79, 1260–1268. [Google Scholar]
Knight, F. A Predictive View of Continuous Time Processes. The Annals of Probability1975, 573–596. [Google Scholar]
Knight, F. Foundations of the Prediction Process; Oxford Science Publications: New York, USA, 1992. [Google Scholar]
Meyer, P. La théorie de la prédiction de F. Knight. Seminaire de Probabilités1976, X, 86–103. [Google Scholar]
Knight, F. Essays on the Prediction Process; Vol. 1, Lecture Notes Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1981. [Google Scholar]
Choquet, G. Lectures on Analysis, Volume II (Representation Theory); W. A. Benjamin, Inc.: New York, USA, 1969. [Google Scholar]
Parthasarathy. On the Category of Ergodic Measures. Illinois J. Math.1961, 5, 648–656. [Google Scholar]
Löhr, W.; Ay, N. On the Generative Nature of Prediction. Adv. Complex. Syst.2009, 12, 169–194. [Google Scholar]
Billingsley, P. Convergence of Probability Measures, 2nd Ed. ed; Wiley: New York, USA, 1968. [Google Scholar]
The statements, opinions and data contained in the journal Entropy are solely
those of the individual authors and contributors and not of the publisher and the editor(s).
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The statements, opinions and data contained in the journals are solely
those of the individual authors and contributors and not of the publisher and the editor(s).
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.