Complexity as Causal Information Integration

Langer, Carlotta; Ay, Nihat

doi:10.3390/e22101107

Open AccessFeature PaperArticle

Complexity as Causal Information Integration

by

Carlotta Langer

^1,*

and

Nihat Ay

^1,2,3

¹

Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany

²

Faculty of Mathematics and Computer Science, University of Leipzig, PF 100920, 04009 Leipzig, Germany

³

Santa Fe Institute, Santa Fe, NM 87501, USA

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(10), 1107; https://0-doi-org.brum.beds.ac.uk/10.3390/e22101107

Submission received: 21 August 2020 / Revised: 25 September 2020 / Accepted: 27 September 2020 / Published: 30 September 2020

(This article belongs to the Special Issue Entropy: The Scientific Tool of the 21st Century)

Download

Browse Figures

Versions Notes

Abstract

:

Complexity measures in the context of the Integrated Information Theory of consciousness try to quantify the strength of the causal connections between different neurons. This is done by minimizing the KL-divergence between a full system and one without causal cross-connections. Various measures have been proposed and compared in this setting. We will discuss a class of information geometric measures that aim at assessing the intrinsic causal cross-influences in a system. One promising candidate of these measures, denoted by

Φ_{C I S}

, is based on conditional independence statements and does satisfy all of the properties that have been postulated as desirable. Unfortunately it does not have a graphical representation, which makes it less intuitive and difficult to analyze. We propose an alternative approach using a latent variable, which models a common exterior influence. This leads to a measure

Φ_{C I I}

, Causal Information Integration, that satisfies all of the required conditions. Our measure can be calculated using an iterative information geometric algorithm, the em-algorithm. Therefore we are able to compare its behavior to existing integrated information measures.

Keywords:

complexity; integrated information; causality; conditional independence; em-algorithm

1. Introduction

The theory of Integrated Information aims at quantifying the amount and quality of consciousness of a neural network. It was originally proposed by Tononi and went through various phases of evolution, starting with one of the first papers "Consciousness and Complexity" [1] in 1999 to "Consciousness as Integrated Information—a Provisional Manifesto" [2] in 2008 and Integrated Information Theory (IIT) 3.0 [3] in 2014 to ongoing research. Although important parts of the methodology of this theory changed or got extended the two key concepts determining consciousness that virtually stayed fixed are “Information” and “Integration”. Information refers to the number of different states a system can be in and Integration describes the amount to which the information is integrated among different parts of it. Tononi summarizes this idea in Reference [2] with the following sentence:

In short, integrated information captures the information generated by causal interactions in the whole, over and above the information generated by the parts.

Therefore Integrated Information can be seen as a measure of the systems complexity. In this context it belongs to the class of theories that define complexity as to what extent the whole is more than the sum of its parts.

There are various ways to define a split system and the difference between them. Therefore, there exist different branches of complexity measures in the context of Integrated Information. The most recent theory, IIT 3.0 [3], goes far beyond the original measures and includes a different level of definitions corresponding to the quality of the measured consciousness, including the maximally irreducible conceptual structure (MICS) and the integrated conceptual information. In order to focus on the information geometric aspects of IIT, we follow the strategy of Oizumi et al. [4] and Amari et al. [5], restricting attention to measuring the integrated information in discrete n-dimensional stationary Markov processes from an information geometric point of view.

In detail we will measure the distance between the full and the split system using the KL-divergence as proposed in Reference [6], published in Reference [7]. This framework was further discussed in Reference [8]. Oizumi et al. [4] and Amari et al. [5] summarize these ideas and add a Markov condition and an upper bound to clarify what a complexity measure should satisfy. The Markov condition intends to model the removal of certain cross-time connections, which we call causal cross-connections. These connections are the ones that integrate information among the different nodes across different points in time. The upper bound was originally proposed in Reference [9] and is given by the mutual information, which aims at quantifying the total information flow from one timestep to the next. These conditions are defined as necessary and do not specify a measure uniquely. We will discuss the conditions in the next section.

Additionally Oizumi et al. [4] and Amari et al. [5] introduce one measure that satisfies all of these requirements. This measure is described by conditional independence statements and will be denoted here by

Φ_{C I S}

. We will introduce

Φ_{C I S}

along with two other existing measures, namely Stochastic Interaction

Φ_{S I}

[7] and Geometric Integrated Information

Φ_{G}

[10]. The measure

Φ_{S I}

is not bounded from above by the mutual information and

Φ_{G}

does not satisfy the postulated Markov condition.

Although

Φ_{C I S}

fits perfectly in the proposed framework, this measure does not correspond to a graphical representation and it is therefore difficult to analyze the causal nature of the measured information flow. We focus on the notion of causality defined by Pearl in Reference [11], in which the correspondence between conditional independence statements and graphs, for instance DAGs or more generally chain graphs, is a key concept. Moreover, we demonstrate that it is not possible to express the conditional independence statements corresponding to

Φ_{C I S}

using a chain graph even after adding latent variables. Following the reasoning of Pearls causality theory, however, this would be a desirable property.

The main purpose of this paper is to propose a more intuitive approach that ensures the consistency between graphical representation and conditional independence statements. This is achieved by using a latent variable that models a common exterior influence. Doing so leads to a new measure, which we call Causal Information Integration

Φ_{C I I}

. This measure is specifically created to only measure the intrinsic causal cross-influences in a setting with an unknown exterior influence and it satisfies all the required conditions postulated by Oizumi et al. To assume the existence of an unknown exterior influence is not unreasonable, in fact one point of criticism concerning

Φ_{S I}

is that this measure does not account for exterior influences and therefore measures them erroneously as internal, see Section 6.9 in Reference [10]. In a setting with known external influences, these can be integrated in the model as visible variables. This leads to a model discussed in Section 2.1.1 that we call

Φ_{T}

, which is an upper bound for

Φ_{C I I}

.

We discuss the relationships between the introduced measures in Section 2.1.2 and present a way of calculating

Φ_{C I I}

by using an iterative information geometric algorithm, the em-algorithm described in Section 2.1.3. This algorithm is guaranteed to converge to a minimum, but this might be a local minimum. Therefore we have to run the algorithm multiple times to find a global minimum. Utilizing this algorithm we are able to compare the behavior of

Φ_{C I I}

to existing integrated information measures.

Integrated Information Measures

Measures corresponding to Integrated Information investigate the information flow in a system from a time t to

t + 1

. This flow is represented by the connections from the nodes

X_{i}

in t to the nodes

Y_{i}

in

t + 1, i \in {1, \dots, n}

as displayed in Figure 1.

The systems are modeled as discrete, stationary, n-dimensional Markov processes

{(Z_{t})}_{t \in N}

X = (X_{1}, \dots, X_{n}) = (X_{1, t}, \dots, X_{n, t}), Y = (Y_{1}, \dots, Y_{n}) = (X_{1, t + 1}, \dots, X_{n, t + 1}), Z = (X, Y)

on a finite set

Z \neq \emptyset

, which is the Cartesian product of the sample spaces of

X_{i}

i \in {1 \dots n}

, denoted by

X_{i}

Z = X \times Y = X_{i = 1}^{n} X_{i} \times X_{i = 1}^{n} Y_{i} .

It is possible to apply the following methods to non-stationary distributions, but this assumption in addition to the process being Markovian allows us to restrict the discussion to one time step.

Let

M P (Z)

be set of distributions that belong to these Markov processes.

Denote the complement of

X_{i}

in X by

X_{I \ {i}} = (X_{1}, \dots, X_{i - 1}, X_{i + 1}, \dots, X_{n})

with

I = {1, \dots, n}

. Corresponding to this notation

x_{I \ {i}} \in X_{I \ {i}}

describes the elementary events of

X_{I \ {i}}

. We will use the analogue notation in the case of Y and we will write

z \in Z

instead of

(x, y) \in X \times Y

. The set of probability distributions on

Z

will be denoted by

P (Z)

. Throughout this article we will restrict attention to strictly positive distributions.

The core idea of measuring Integrated Information is to determine how much the initial system differs from one in which no information integration takes place. The former will be called a “full” system, because we allow all possible connections between the nodes, and the latter will be called a “split” system. Graphical representations of the full systems for

n = 2, 3

and their connections are depicted in Figure 1. In this article we are using graphs that describe the conditional independence structure of the corresponding sets of distributions. An introduction to those is given in Appendix A.

Graphs are not only a tool to conveniently represent conditional independence statements, but the connection between conditional independence and graphs is a core concept of Pearls causality theory. The interplay between graphs and conditional independence statements provides a consistent foundation of causality. In Reference [11] Section 1.3 Pearl emphasizes the importance of a graphical representation with the following statement:

It seems that if conditional independence judgments are by-products of stored causal relationships, then tapping and representing those relationships directly would be a more natural and more reliable way of expressing what we know or believe about the world. This is indeed the philosophy behind causal Bayesian networks.

Therefore, measures of the strength of causal cross-connections should be based on split models, that have a graphical representation.

Following the concept introduced in References [6,7], the difference between the measures corresponding to the full and split systems will be calculated by using the KL-divergence.

Definition 1 (Complexity).

Let

M

be a set of probability distributions on

Z

corresponding to a split system. Then we minimize the KL-divergence between

M

and the distribution of the fully connected system

\tilde{P}

to calculate the complexity

Φ_{M} = inf_{Q \in M} D_{Z} (\tilde{P} ‖ Q) = \sum_{z \in Z} \tilde{P} (z) l o g \frac{\tilde{P} (z)}{Q (z)} .

Minimizing the KL-divergence with respect to the second argument is called m-projection or rI-projection. Hence we will call

P^{⋆}

with

P^{⋆} = \underset{Q \in M}{arg inf} D_{Z} (\tilde{P} ‖ Q)

the projection of

\tilde{P}

to

M

.

The question remains how to define the split model

M

. We want to measure the information that gets integrated between different nodes in different points in time. In Figure 1 these are the dashed connections, also called cross-influences in Reference [4]. We will refer to the dashed connections as causal cross-connections.

In order to ensure that these connections are removed in the split system, the authors of Reference [4] and Reference [5] argue that

Y_{j}

should be independent of

X_{i}

given

X_{I \ {i}}

,

i \neq j

, leading to the following property.

Property 1.

A valid split system should satisfy the Markov condition

Q (X_{i}, Y_{j} ∣ X_{I \ {i}}) = Q (X_{i} ∣ X_{I \ {i}}) Q (Y_{j} ∣ X_{I \ {i}}), i \neq j,

(1)

with

Q \in P (Z)

. This can also be written in the following form

Y_{j} ⫫ X_{i} | X_{I \ {i}} .

(2)

Now we take a closer look at the remaining connections. The dotted lines connect nodes belonging to the same point in time. These connections between the

Y_{i}

s might result from common internal influences, meaning a correlation between the

X_{i}

s passed on to the next point in time via the dashed or solid connections. Additionally Amari points out in Section 6.9 in Reference [10] that there might exist a common exterior influence on the

Y_{i}

s. Although the measured integrated information should be internal and independent of external influences, the system itself is in general not completely independent of its environment.

Since we want to measure the amount of integrated information between t and

t + 1

, the distribution in t, and therefore the connection between the

X_{i}

s, should stay unchanged in the split system. The dotted connections between the

Y_{i}

s play an important role in Property 2. For this property, we will consider the split system in which the solid and dashed connections are removed.

The solid arrows represent the influence of a node in t on itself in

t + 1

and removing these arrows, in addition to the causal cross-connections, leads to a system with completely disconnected points in time as shown on the right in Figure 2. The distributions corresponding to this split system are

M_{I} = {Q \in P (Z) | Q (z) = Q (x) Q (y), \forall z = (x, y) \in Z}

and the measure

Φ_{I}

is given by the mutual information

I (X; Y)

, which is defined in the following way

Φ_{I} = I (X; Y) = \sum_{z \in Z} P (x, y) l o g (\frac{P (x, y)}{P (x) P (y)}) .

Since there is no information flow between the time steps Oizumi et al. argue in Reference [4] that an integrated information measure should be bounded from above by the mutual information.

Property 2.

The mutual information should be an upper bound for an Integrated Information measure

Φ_{M} = inf_{Q \in M} D_{Z} (\tilde{P} ∣ Q) \leq I (X; Y) .

Oizumi et al. [4,9] and Amari et al. [5] state that this property is natural, because an Integrated Information measure should be bounded by the total amount of information flow between the different points in time. The postulation of this property led to a discussion in Reference [12]. The point of disagreement concerns the edge between the

Y_{i}

s. On the one hand this connection takes into account that the

Y_{i}

s might have a common exterior influence that affects all the

Y_{i}

s, as pointed out by Amari in Reference [10]. This is symbolized by the additional node W in Figure 2 and this should not contribute to the value of Integrated Information between the different points in time.

On the other hand, we know that if the

X_{i}

s are correlated, then the correlation is passed to the

Y_{i}

s via the solid and dashed arrows. The edges created by calculating the marginal distribution on Y also contain these correlations. The question now is, how much of these correlations integrate information in the system and should therefore be measured. Kanwal et al. discuss this problem in Reference [12]. They distinguish between intrinsic and extrinsic influences that cause the connections between the

Y_{i}

s in the way displayed in Figure 2. By calculating the split system for

Φ_{I}

the edge between the

Y_{i}

s might compensate for the solid arrows and common exterior influences, but also for the dashed, causal cross-connections, as shown in Figure 2 on the right. Kanwal et al. analyze an example of a full system without a common exterior influence with the result that there are cases in which a measure that only removes the causal cross-connections has a larger value than

Φ_{I}

. This is only possible if the undirected edge between the

Y_{i}

s compensates a part of the causal cross-connections. Hence

Φ_{I}

does not measure all the intrinsic causal cross-influences. Therefore Kanwal et al. question the use of the mutual information as an upper bound.

Then again, we would like to contribute a different perspective. Admitting to Property 2 does not necessarily mean that the connections between the

Y_{i}

s are fixed. It may merely mean that

M_{I}

is a subset of the set of split distributions. We will see that the measures

Φ_{C I S}

and

Φ_{C I I}

do satisfy Property 2 in this way. Although the argument that

Φ_{I}

measures all the intrinsic influences is no longer valid, satisfying Property 2 is still desirable in general. Consider an initial system with the distribution

\tilde{P} (z) = \tilde{P} (x) \tilde{P} (y), \forall z \in Z

. This system has a common exterior influence on the

Y_{i}

s and no connection between the different points in time. Since there is no information flow between the points in time, a measure for Integrated Information

Φ_{M}

should be zero for all distributions of this form. This is the case exactly when

M_{I} \subseteq M

, hence when

Φ_{I}

is an upper bound for

Φ_{M}

. In order to emphasize this point we propose a modified version of Property 2.

Property 3.

The set

M_{I}

should be a subset of the split model

M

corresponding to the Integrated Information measure

Φ_{M}

. Then the inequality

Φ_{M} = inf_{Q \in M} D_{Z} (\tilde{P} ∣ Q) \leq I (X; Y)

holds.

Note that the new formulation is stronger, hence Property 2 is a consequence of Property 3. Every measure discussed here that satisfies Property 2 also fulfills Property 3. Therefore we will keep referring to Property 2 in the following sections.

Figure 3 displays an overview over the different measures and whether they satisfy Properties 1 and 2.

The first complexity measure that we are discussing does not fulfill Property 2. It is called Stochastic Interaction and was introduced by Ay in Reference [6] in 2001, later published in Reference [7]. Barrett and Seth discuss it in Reference [13] in the context of Integrated Information. In Reference [5] the corresponding model is called “fully split model”.

The core idea is to allow only the connections among the random variables in t and additionally the connections between

X_{i}

and

Y_{i}

, meaning the same random variable in different points in time. The last ones correspond to the solid arrows in Figure 1. A graphical representation for

n = 2

can be found in the first column of Figure 3.

Definition 2 (Stochastic Interaction).

The set of distributions belonging to the split model in the sense of Stochastic Interaction can be defined as

M_{S I} = \{Q \in P (Z) ∣ Q (Y ∣ X) = ⨂_{i = 1}^{n} Q (Y_{i} ∣ X_{i})\}

and the complexity measure can be calculated as follows

Φ_{S I} = inf_{Q \in M_{S I}} D_{Z} (\tilde{P} ‖ Q) = \sum_{i = 1}^{n} H (Y_{i} ∣ X_{i}) - H (Y ∣ X),

as shown in Reference [7]. In the definition above, H denotes the conditional entropy

H (Y_{i} ∣ X_{i}) = - \sum_{x_{i} \in X_{i}} \sum_{y_{i} \in Y_{i}} \tilde{P} (x_{i}, y_{i}) l o g \tilde{P} (y_{i} | x) .

This does not satisfy Property 2 and therefore the corresponding graph is displayed only in the first column of Figure 3. Amari points out in Reference [10] that this measure is not applicable in the case of an exterior influences on the

Y_{i}

s. Such an influence can cause the

Y_{i}

s to be correlated even in the case of independent

X_{i}

s and no causal cross-connections.

Consider a setting without exterior influences, then

Φ_{S I}

quantifies the strength of the causal cross-connections alone and is therefore a reasonable choice for an Integrated Information measure. Accounting for an exterior influence that does not exist leads to a split system, which compensates a part of the removal of the causal cross-connections so that the resulting measure does not quantify all of the interior causal cross-influences.

To force the model to satisfy Property 2, one can add the interaction between

Y_{i}

and

Y_{j}

, which results in the measure Geometric Integrated Information [10].

Definition 3 (Geometric Integrated Information).

The graphical model corresponding to the graph in the second row and first column of Figure 3 is the set

M_{G} = \{P \in P (Z) | \exists f_{1}, \dots, f_{n + 2} \in R_{+}^{Z} s . t . P (z) = f_{n + 1} (x) f_{n + 2} (y) \prod_{i = 1}^{n} f_{i} (x_{i}, y_{i})\}

and the measure is defined as

Φ_{G} = inf_{Q \in M_{G}} D_{Z} (P ‖ Q) .

M_{G}

is called the diagonally split model in Reference [5]. This is not causally split in the sense that the corresponding distributions in general do not satisfy Property 1. It can be seen by analyzing the conditional independence structure of the graph as described in Appendix A. By introducing the edges between the

Y_{i}

s as fixed,

Φ_{G}

might force these connections to be stronger than they originally are. A result of this might be that an effect of the causal cross-connections gets atoned for by the new edge. We discussed this above in the context of Property 2.

This measure has no closed form solution, but we are able to calculate the corresponding split system with the help of the iterative scaling algorithm, (see, for example, Section 5.1 in Reference [14]).

The first measure that satifies both properties is called “Integrated Information” [4], its model is referred to by “Causally split model” in Reference [5] and it is derived from the first property. Since we are able to define it using conditional independence statements, we will denote it by

Φ_{C I S}

. It requires

Y_{i}

to be independent of

X_{I \ {i}}

given

X_{i}

.

Definition 4 (Integrated Information).

The set of distributions, that belongs to the split system corresponding to integrated information, is defined as

M_{C I S} = \{Q \in P (Z) ∣ Q (Y_{i} ∣ X) = Q (Y_{i} ∣ X_{i}), f o r a l l i \in {1, \dots, n}\}

(3)

and this leads to the measure

Φ_{C I S} = inf_{Q \in M_{C I S}} D_{Z} (P ‖ Q) .

We write the requirements to the distributions in (3) as conditional independent statements

Y_{i} ⫫ X_{I \ {i}} ∣ X_{i} .

A detailed analysis of probabilistic independence statements can be found in Reference [15]. Unfortunately, these conditional independence statements can not be encoded in terms of a chain graph in general. The definition of this measure arises naturally from Property 1 by applying the relation (1)

Q (X_{i}, Y_{j} ∣ X_{I \ {i}}) = Q (X_{i} ∣ X_{I \ {i}}) Q (Y_{j} ∣ X_{I \ {i}}), i \neq j

to all pairs

i, j \in {1, \dots, n}

. This leads to

Q (Y_{j} | X) = Q (Y_{j} | X_{j}),

(4)

as shown in Appendix B.

Note that this implies that every model satisfying Property 1 is a submodel of

M_{C I S}

. In order to show that

Φ_{C I S}

satisfies Property 1, we are going to rewrite the condition in Property 1 as

Q (Y_{j} | X) = Q (Y_{j} | X_{I \ {i}}) .

The definition of

M_{C I S}

allows us to write

Q (Y_{j} | X) = Q (Y_{j} | X_{j}) = Q (Y_{j} | X_{I \ {i}}),

for

Q \in M_{C I S}

. Therefore

Φ_{C I S}

satisfies Property 1 and since

M_{I}

meets the conditional independence statements of Property 1 the relation

M_{I} \subseteq M_{C I S}

holds and

Φ_{C I S}

fulfills Property 2.

In Reference [4] Oizumi et al. derive an analytical solution for Gaussian variables, but there does not exist a closed form solution for discrete variables in general. Therefore they use Newton’s method in the case of discrete variables.

Due to the lack of a graphical representation, it is difficult to interpret the causal nature of the elements of

M_{C I S}

. In Example 1 we will see a type of model that is part of

M_{C I S}

, but which has a graphical representation. This model does not lie in the set of Markovian processes discussed in this article

M P (Z)

. Hence this implies that not all the split distributions in

M_{C I S}

arise from removing connections from a full distribution, as depicted in Figure 1.

2. Causal Information Integration

Inspired by the discussion about extrinsic and intrinsic influences in the context of Property 2, we now utilize the notion of a common exterior influence to define the measure

Φ_{C I I}

, which we call Causal Information Integration. This measure should be used in case of an unknown exterior influence.

2.1. Definition

Explicitly including a common exterior influence allows us to avoid the problems of a fixed edge between the

Y_{i}

s discussed earlier. This leads to the graphs in Figure 4.

The factorization of the distributions belonging to these graphical models is the following one

P (z, w) = P (x) \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (w) .

By marginalizing over the elements of

W

we get a distribution on

Z

defining our new model.

Definition 5 (Causal Information Integration).

The set of distributions belonging to the marginalized model for

| W^{m} | = m

is

M_{C I I}^{m} = \{P \in P (Z) | \exists Q \in P (Z \times W^{m}) : P (z) = \sum_{j = 1}^{m} Q (x) Q (w_{j}) \prod_{i = 1}^{n} Q (y_{i} | x_{i}, w_{j})\} .

We will define the split model for Causal Integrated Information as the closure (denoted by a bar) of the union of

M_{C I I} s

M_{C I I} = \bar{⋃_{m \in N} M_{C I I}^{m}} .

(5)

This leads to the measure

Φ_{C I I} = inf_{Q \in M_{C I I}} D_{Z} (P ‖ Q) .

Since the split system

M_{C I I}

was defined by utilizing graphs, we are able to use the graphical representation to get a more precise notion of the cases in which

Φ_{C I I} (\tilde{P}) = 0

holds. In those cases the initial distribution can be completely explained as a limit of marginalized distributions without causal cross-influences and with exterior influences.

Proposition 1.

The measure

Φ_{C I I} (\tilde{P})

is 0 if and only if there exists a sequence of distributions

Q^{m} \in P (Z)

with the following properties.

1.: $\tilde{P} = lim_{m \to \infty} Q^{m} .$
2.: For every $m \in N$ there exists a distribution ${\hat{Q}}^{m} \in P (Z \times W^{m})$ that has $Z$ marginals equal to $Q^{m}$

$Q^{m} (z) = {\hat{Q}}^{m} (z), \forall z \in Z .$

Additionally ${\hat{Q}}^{m}$ factors according to the graph corresponding to the split system

${\hat{Q}}^{m} (z, w) = \hat{Q} {(x)}^{m} \prod_{i = 1}^{n} {\hat{Q}}^{m} (y_{i} | x_{i}, w) {\hat{Q}}^{m} (w), \forall (z, w) \in Z \times W^{m} .$

In order to show that

Φ_{C I I}

satisfies the conditional independence statements in Property 1, we will calculate the conditional distributions

P (y_{i} | x_{i})

and

P (y_{i} | x)

of

P (z) = \sum_{w} P (x) \prod_{j = 1}^{n} P (y_{j} | x_{j}, w) P (w) .

This results in

\begin{matrix} P (y_{i} | x_{i}) & = \frac{\sum_{y_{I \ {i}}} \sum_{x_{I \ {i}}} \sum_{w} P (x) \prod_{i = j}^{n} P (y_{j} | x_{j}, w) P (w)}{P (x_{i})} = \frac{\sum_{x_{I \ {i}}} \sum_{w} P (x) P (y_{i} | x_{i}, w) P (w)}{P (x_{i})} = \sum_{w} P (y_{i} | x_{i}, w) P (w) \\ P (y_{i} | x) & = \frac{\sum_{y_{I \ {i}}} \sum_{w} P (x) \prod_{i = j}^{n} P (y_{j} | x_{j}, w) P (w)}{P (x)} = \sum_{w} P (y_{i} | x_{i}, w) P (w) \end{matrix}

for all

z \in Z

. Hence

P (y_{i} | x_{i}) = P (y_{i} | x)

, for every

P \in M_{C I I}^{m}, m \in N

. Since every element in

\hat{P} \in M_{C I I}

is a limit point of distributions that satisfy the conditional independence statements,

\hat{P}

also fulfills those. A proof can be found in Reference [16] Proposition 3.12. Therefore

Φ_{C I I}

satisfies Property 1 and the set of all such distributions is a subset of

M_{C I S}

M_{C I I} \subseteq M_{C I S} .

We are able to represent the marginalized model by using the methods from Reference [17]. Up to this point we have been using chain graphs. These are graphs consisting of directed and undirected edges such that there are no semi-directed cycles as described in Appendix A. In order to be able to gain a graph that represents the conditional independence structure of the marginalized model, we need the concept of chain mixed graphs (CMGs). In addition to the directed and undirected edges belonging to chain graphs, chain mixed graphs also have arcs ↔. Two nodes connected by an arc are called spouses. The connection between spouses appears when we marginalize over a common influence, hence spouses do not have a directed information flow from one node to the other but are affected by the same mechanisms. The Algorithm A3 from Reference [17] allows us to transform a chain graph with latent variables into a chain mixed graph that represents the conditional independence structures of the marginalized chain graph. Using this on the graphs in Figure 4 leads to the CMGs in Figure 5. Unfortunately, there exists no new factorization corresponding to the CMGs known to the authors.

In order to prove that

Φ_{C I I}

satisfies Property 2, we will show that

M_{I}

is a subset of

M_{C I I}

. At first we will consider the following subset of

M_{C I I}

\begin{matrix} M_{C I}^{m} & = \{P \in P (Z) | \exists Q \in P (Z \times W^{m}) : P (z) = \sum_{j = 1}^{m} Q (x) Q (w_{j}) \prod_{i = 1}^{n} Q (y_{i} | w_{j})\} \\ M_{C I} & = \bar{⋃_{m \in N} M_{C I}^{m}}, \end{matrix}

where we remove the connections between the different stages, as shown in Figure 6.

Now X and Y are independent of each other

Q (z) = Q (x) \cdot Q (y)

with

Q (y) = \sum_{w} Q (w) \prod_{i = 1}^{n} Q (y_{i} | w)

for

Q \in M_{C I}^{m}

and since independence structures of discrete distributions are preserved in the limit we have

M_{C I} \subseteq M_{I}

. In order to gain equality it remains to show that

Q (Y)

can approximate every distribution on

Y

if the state space of W is sufficiently large. These distributions are mixtures of discrete product distributions, where

\prod_{i = 1}^{n} Q (y_{i} | w)

are the mixture components and

Q (w)

are the mixture weights. Hence we are able to use the following result.

Theorem 1

(Theorem 1.3.1 from Reference [18]). Let q be a prime power. The smallest m for which any probability distribution on

{1, \dots, q}

can be approximated arbitrarily well as mixture of m product distributions is

q^{n - 1}

.

Universal approximation results like the theorem above may suggest that the models

M_{C I I}

and

M_{C I S}

are equal. However we will present numerically calculated examples of elements belonging to

M_{C I S}

, but not to

M_{C I I}

, even with an extremely large state space. We will discuss this matter further in Section 2.1.2.

In conclusion,

Φ_{C I I}

satisfies Property 1 and 2.

Note that using

Φ_{C I I}

in cases without an exterior influence might not capture all the internal cross-influences, since the additional latent variable can compensate some of the difference between the initial distribution and the split model. This can only be avoided when the exterior influence is known and can therefore be included in the model. We will discuss that case in the next section.

2.1.1. Ground Truth

The concept of an exterior influence suggests that there exists a ground truth in a larger model in which W is a visible variable. This is shown in Figure 7 on the right.

Assuming that we know the distribution of the whole model, we are able to apply the concepts discussed above to define an Integrated Information measure

Φ_{T}

on the larger space. This allows us to really only remove the causal cross-connections as shown in Figure 7 on the left. Thus we can interpret

Φ_{T}

as the ultimate measure of Integrated Information, if the ground truth is available. Note that using the measure

Φ_{S I}

in the setting with no external influences is a special case of

Φ_{T}

.

The set of distributions belonging to the larger, fully connected model will be called

E^{f}

and the set corresponding to the graph on the left of Figure 7 depicts the split system which will be denoted by

E

. Since W is now known, we are able to fix the state space

W

to its actual size m.

\begin{matrix} E = \{P \in P (Z \times W^{m}) ∣ P (z, w) = P (x) \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (w), \forall (z, w) \in Z \times W^{m}, | W | = m\} \\ E^{f} = \{P \in P (Z \times W^{m}) ∣ P (z, w) = P (x) \prod_{i = 1}^{n} P (y_{i} | x, w) P (w), \forall (z, w) \in Z \times W^{m}, | W | = m\} . \end{matrix}

Note that

E

is the set of all the distributions that result in an element of

M_{C I I}

after marginalization over

W^{m}

M_{C I I}^{m} = \{P \in P (Z) | \exists Q \in E^{m} : P (z) = \sum_{j = 1}^{m} Q (x) Q (w_{j}) \prod_{i = 1}^{n} Q (y_{i} | x_{i}, w_{j})\} .

Calculating the KL-divergence between

P \in E^{f}

and

E

results in the new measure.

Proposition 2.

Let

P \in E^{f}

. Minimizing the KL-divergence between P and

E

leads to

\begin{matrix} Φ_{T} = inf_{Q \in E} D_{Z \times W^{m}} (P ‖ Q) & = \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} P (y_{i} | x_{i}, w)} \\ = \sum_{i} I (Y_{i}; X_{I \ {i}} | X_{i}, W) . \end{matrix}

In the definition above

I (Y_{i}; X_{I \ {i}} | X_{i}, W)

is the conditional mutual information defined by

I (Y_{i}; X_{I \ {i}} | X_{i}, W) = \sum_{y_{i}, x, w} P (y_{i}, x, w) l o g \frac{P (y_{i}, x_{I \ {i}} | x_{i}, w)}{P (y_{i} | x_{i}, w) P (x_{I \ {i}} | x_{i}, w)} .

It characterizes the reduction of uncertainty in

Y_{i}

due to

X_{I \ {i}}

when W and

X_{i}

are given. Therefore this measure decomposes to a sum in which each addend characterizes the information flow towards one

Y_{i}

. Writing this as conditional independence statements,

Φ_{T}

is 0 if and only if

Y_{i} ⫫ X_{I \ {i}} | {X_{i}, W} .

Ignoring W would lead exactly to the conditional independence statements in Equation (3). For a more detailed description of the conditional mutual information and its properties, see Reference [19].

Furthermore,

Φ_{T} = 0

if and only if the initial distribution P factors according to the graph that belongs to

E

. This follows from Proposition 2 and the fact that the KL-divergence is 0 if and only if both distributions are equal. Hence this measure truly removes the causal cross-connections.

Additionally, by using that

W ⫫ X

, we are able to split up the conditional mutual information into a part corresponding to the conditional independence statements of Property 1 and another conditional mutual information.

\begin{matrix} I (Y_{i}; X_{I \ {i}} | X_{i}, W) & = \sum_{y_{i}, x, w} P (w) l o g (\frac{P (y_{i}, x_{I \ {i}} | x_{i})}{P (y_{i} | x_{i}) P (x_{I \ {i}} | x_{i})} \cdot \frac{P (y_{i}, x_{i}) P (x) P (y_{i}, x, w) P (x_{i}, w)}{P (y_{i}, x) P (x_{i}) P (y_{i}, x_{i}, w) P (x, w)}) \\ = I (Y_{i}; X_{I \ {i}} | X_{i}) + \sum_{y_{i}, x, w} P (w) l o g \frac{P (y_{i}, x_{i}) P (x) P (y_{i}, x, w) P (x_{i}, w)}{P (y_{i}, x) P (x_{i}) P (y_{i}, x_{i}, w) P (x, w)} \\ = I (Y_{i}; X_{I \ {i}} | X_{i}) + \sum_{y_{i}, x, w} P (w) l o g \frac{P (w, x_{I \ {i}} | y_{i}, x_{i})}{P (w | y_{i}, x_{i}) P (x_{I \ {i}} | y_{i}, x_{i})} \\ = I (Y_{i}; X_{I \ {i}} | X_{i}) + I (W; X_{I \ {i}} | Y_{i}, X_{i}) . \end{matrix}

Since the conditional mutual information is non-negative,

Φ_{T}

is 0 if and only if the conditional independence statements of Equation (3) hold and additionally the reduction of uncertainty in W due to

X_{I \ {i}}

given

Y_{i}, X_{i}

is 0.

In general, we do not know what the ground truth of our system is and therefore we have to assume that W is a hidden variable. This leads us back to

Φ_{C I I}

. Minimizing over all possible W might compensate a part of the causal information flow. One example, in which accounting for an exterior influence that does not exist leads to a value smaller than the true integrated information, was discussed earlier in the context of Property 2. There we refer to an example in Reference [12] where

Φ_{S I}

exceeds

Φ_{I}

in a setting without an exterior influence. Similarly,

Φ_{C I I}

is smaller or equal to the true value

Φ_{T}

.

Proposition 3.

The new measure

Φ_{T}

is an upper bound for

Φ_{C I I}

Φ_{C I I} \leq Φ_{T} .

Hence by assuming that there exists a common exterior influence, we are able to show that

Φ_{C I I}

is bounded from above by the true value, that measures all the intrinsic cross-influences. We are able to observe this behavior in Section 2.2.2.

2.1.2. Relationships between the Different Measures

Now we are going to analyze the relationship between the different measures

Φ_{S I}, Φ_{G}, Φ_{C I S}

and

Φ_{C I I}

. We will start with

Φ_{G}

and

Φ_{C I I}

. Previously we already showed that

Φ_{C I I}

satisfies Property 1 and since

Φ_{G}

does not satisfy Property 1, we have

M_{G} ⊈ M_{C I I} .

To evaluate the other inclusion, we will consider the more refined parametrizations of elements

P \in M_{C I I}^{m}

and

Q \in M_{G}

as defined A1. These are

\begin{matrix} P (z) & = P (x) f_{2} (x_{1}, y_{1}) g_{2} (x_{2}, y_{2}) \sum_{w} P (w) f_{1} (w, y_{1}) f_{3} (x_{1}, y_{1}, w) g_{1} (w, y_{2}) g_{3} (x_{2}, y_{2}, w) \\ = P (x) f_{2} (x_{1}, y_{1}) g_{2} (x_{2}, y_{2}) ϕ (x_{1}, x_{2}, y_{1}, y_{2}) \\ Q (z) & = h_{n + 1} (x) h_{n + 2} (y) \prod_{i = 1}^{n} h_{i} (y_{i}, x_{i}), \end{matrix}

where

f_{1}, f_{2}, f_{3}, g_{1}, g_{2}, g_{3}, h_{1}, h_{2}, h_{3}, h_{4}

are non-negative functions such that

P, Q \in P (Z)

and

ϕ (x_{1}, x_{2}, y_{1}, y_{2}) = \sum_{w} P (w) f_{1} (w, y_{1}) f_{3} (x_{1}, y_{1}, w) g_{1} (w, y_{2}) g_{3} (x_{2}, y_{2}, w) .

Since

ϕ

depends on more than

Y_{1}

and

Y_{2}

,

P (z)

does not factorize according to

M_{G}

in general. Hence

M_{C I I} ⊈ M_{G}

holds.

Furthermore, looking at the parametrizations allows us to identify a subset of distributions that lies in the intersection of

M_{G}

and

M_{C I I}

. Allowing P to only have pairwise interactions would lead to

\begin{matrix} P (z) & = P (x) {\tilde{f}}_{2} (x_{1}, y_{1}) {\tilde{g}}_{2} (x_{2}, y_{2}) \sum_{w} P (w) {\tilde{f}}_{1} (w, y_{1}) {\tilde{g}}_{1} (w, y_{2}) \\ = P (x) {\tilde{f}}_{2} (x_{1}, y_{1}) {\tilde{g}}_{2} (x_{2}, y_{2}) \tilde{ϕ} (y_{1}, y_{2}), \end{matrix}

with the non-negative functions

{\tilde{f}}_{1}, {\tilde{f}}_{2}, {\tilde{g}}_{1}, {\tilde{g}}_{2}

such that

P \in P (Z)

and

\tilde{ϕ} (y_{1}, y_{2}) = \sum_{w} P (w) {\tilde{f}}_{1} (w, y_{1}) {\tilde{g}}_{1} (w, y_{2}) .

This P is an element of

M_{G} \cap M_{C I I}

.

In the next part we will discuss the relationship between

M_{C I I}

and

M_{C I S}

. The elements in

M_{C I I}

satisfy the conditional independence statements of Property 1, therefore

M_{C I I} \subseteq M_{C I S} .

Previously we have seen that making the state space of W large enough can approximate a distribution between the

Y_{i}

s, see Theorem 1. This gives the impression that

M_{C I I}

and

M_{C I S}

coincide. However, based on numerically calculated examples, we have the following conjecture.

Conjecture 1.

It is not possible to approximate every distribution

Q \in M_{C I S}

with arbitrary accuracy by an element of

P \in M_{C I I}

. Therefore, we have that

M_{C I I} ⊊ M_{C I S} .

The following example strongly suggests this conjecture to be true.

Example 1.

Consider the set of distributions that factor according to the graph in Figure 8

N_{C I S} = {P \in P (Z) | P (z) = P (x_{1}) P (x_{2}) P (y_{1} | x_{1}, y_{2}) P (y_{2})} .

This model satisfies the conditional independence statements of Property 1 and is therefore a subset of the model

M_{C I S}

. In this case

X_{1}

and

X_{2}

are independent of each other, hence from a causal perspective the influence of

Y_{2}

on

Y_{1}

should be purely external. Therefore we try to model this with a subset of

M_{C I I}

\begin{matrix} N_{C I I} & = \bar{⋃_{m \in N} N_{C I I}^{m}}, \\ N_{C I I}^{m} & = \{P \in P (Z) | \exists Q \in P (Z \times W^{m}) : P (z) = Q (x_{1}) Q (x_{2}) \sum_{j = 1}^{m} Q (y_{1} | x_{1}, w_{j}) Q (y_{2} | w_{j}) Q (w_{j})\} \end{matrix}

(6)

and this corresponds to Figure 9.

Using the em-algorithm described in Section 2.1.3 we took 500 random elements of

N_{C I S}

and calculated the closest element of

N_{C I I}

by using the minimum KL-divergence of 50 different random input distributions in each run. The results are displayed in Table 1.

This is an example of an element lying in

M_{C I S}

, which cannot be approximated by an element in

M_{C I I}

.

Now we are going to look at this example from the causal perspective. Proposition 1 states that

Φ_{C I I} (\tilde{P})

is 0 if and only if

\tilde{P}

is the limit of a sequence of distributions in

M_{C I I}

corresponding to distributions on the extended space that factor according to the split model. Hence a distribution resulting in

Φ_{C I I} > 0

cannot be explained by a split model with an exterior influence. Taking into account that

M_{C I S}

does not correspond to a graph, we do not have a similar result describing the distributions for which

Φ_{C I S} = 0

. Nonetheless, by looking at the graphical model

N_{C I S}

, we are able to discuss the causal structure of a submodel of

M_{C I S}

, a class of distributions for which

Φ_{C I S} = 0

holds.

If we trust the results in Table 1, this would imply that the influence from

Y_{2}

to

Y_{1}

is not purely external, but that there suddenly develops an internal influence in timestep

t + 1

that did not exist in timestep t. Therefore the distributions in

N_{C I S}

do not belong to the stationary Markovian processes

M P (Z)

, depicted in Figure 1, in general. For these Markovian processes the connections between the

Y_{i}

s arise from correlated

X_{i}

s or external influences, as pointed out by Amari in Section 6.9 [10]. So from a causal perspective

N_{C I S}

does not fit into our framework. Hence the initial distribution

\tilde{P}

, which corresponds to a full model, will in general not be an element of

N_{C I S}

. However, the projection of

\tilde{P}

to

M_{C I S}

might lie in

N_{C I S}

as illustrated in Figure 10.

When this is the case, then

\tilde{P}

is closer to an element with a causal structure that does not fit into the discussed setting, than to a split model in which only the causal cross-connections are removed. Hence a part of the internal cross-connections is being compensated by this type of model and therefore this does not measure all the intrinsic integrated information.

Further examples, which hint towards

M_{C I I} ⊊ M_{C I S}

, can be found in Section 2.2.2.

Adding the hidden variable W seems not to be sufficient to approximate elements of

M_{C I S}

. Now the question naturally arises whether there are other exterior influences that need to be included in order to be able to approximate

M_{C I S}

. We will explore this thought by starting with the graph corresponding to the split model

M_{S I}

, depicted in Figure 11 on the left. In the next step we add hidden vertices and edges to the graph in a way such that the whole graph is still a chain graph. An example for a valid hidden structure is given in Figure 11 in the middle. Since we are going to marginalize over the hidden structure, it is only important how the visible nodes are connected via the hidden nodes. In the case of the example in Figure 11 we have a directed path from

X_{1}

to

X_{2}

going through the hidden nodes. Therefore we are able to reduce the structure to a gray box shown on the right in Figure 11.

Then we use the Algorithm A3 mentioned earlier, which converts a chain graph with hidden variables to a chain mixed graph reflecting the conditional independence structure of the marginalized model. This leads to a directed edge from

X_{1}

to

X_{2}

by marginalizing over the nodes in the hidden structures. Seeing that this directed edge already existed, the resulting model now is a subset of

M_{S I}

and therefore does not approximate

M_{C I S}

.

Following this procedure we are able to show that adding further hidden nodes and subgraphs of hidden nodes does not lead to a chain mixed graph belonging to a model that satisfies the conditional independence statements of Property 1 and strictly contains

M_{C I I}

.

Theorem 2.

It is not possible to create a chain mixed graph corresponding to a model

M

, such that its distributions satisfy Property 1 and

M_{C I I} ⊊ M

, by introducing a more complicated hidden structure to the graph of

M_{S I}

.

In conclusion, assuming that Conjecture 1 holds, we have the following relations among the different presented models.

\begin{matrix} M_{I} ⊊ M_{G} \\ M_{I} ⊊ M_{C I I} ⊊ M_{C I S} \\ M_{S I} ⊊ M_{C I I} ⊊ M_{C I S} \end{matrix}

A sketch of the inclusion properties among the models is displayed in Figure 12.

Every set that lies inside

M_{C I S}

satisfies Property 1 and every set that completely contains

M_{I}

fulfills Property 2.

2.1.3. em-Algorithm

The calculation of the measure

Φ_{C I I}^{m}

with

Φ_{C I I}^{m} = inf_{Q \in M_{C I I}^{m}} D_{Z} (\tilde{P} ‖ Q)

can be done by the em-algorithm, a well known information geometric algorithm. It was proposed by Csiszár and Tusnády in 1984 in Reference [20] and its usage in the context of neural networks with hidden variables was described for example by Amari et al. in Reference [21]. The expectation-maximization EM-algorithm [22] used in statistics is equivalent to the em-algorithm in many cases, including this one, as we will see below. A detailed discussion of the relationship of these algorithms can be found in Reference [23].

In order to calculate the distance between the distribution

\tilde{P}

and the set

M_{C I I}^{m}

on

Z

we will make use of the extended space of distributions on

Z \times W^{m}

,

P (Z \times W^{m})

. Let

M_{W | Z}

be the set of all distributions on

Z \times W^{m}

that have

Z

-marginals equal to the distribution of the whole system

\tilde{P}

\begin{matrix} M_{W | Z} = & \{P \in P (Z \times W^{m}) ∣ P (z) = \tilde{P} (z), \forall z \in Z\} \\ = & \{P \in P (Z \times W^{m}) ∣ P (z, w) = \tilde{P} (z) P (w | z), \forall (z, w) \in Z \times W^{m}\} . \end{matrix}

This is an m-flat submanifold since it is linear w.r.t

P (w | z)

. Therefore there exists a unique e-projection to

M_{W | Z}

.

The second set that we are going to use is the set

E^{m}

of distributions that factor according to the split model including the common exterior influence. We have seen this set before in Section 2.1.1.

\begin{matrix} E^{m} & = \{P \in P (Z \times W^{m}) ∣ P (z, w) = P (x) \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (w), \forall (z, w) \in Z \times W^{m}\} . \end{matrix}

(7)

This set is in general not e-flat, but we will show that there is a unique m-projection to it. We are able to use these sets instead of

\tilde{P}

and

M_{C I I}^{m}

because of the following result.

Theorem 3

(Theorem 7 from Reference [21]). The minimum divergence between

M_{W | Z}

and

E^{m}

is equal to the minimum divergence between

\tilde{P}

and

M_{C I I}^{m}

in the visible manifold

inf_{P \in M_{W | Z}, Q \in E^{m}} D_{Z \times W^{m}} (P ‖ Q) = inf_{\tilde{Q} \in M_{C I I}^{m}} D_{Z} (\tilde{P} ‖ \tilde{Q}) .

Proof of Theorem 3.

Let

P, Q \in P (Z \times W^{m})

, using the chain-rule for KL-divergence leads to

D_{Z \times W^{m}} (P ‖ Q) = D_{Z} (P ‖ Q) + D_{W | Z} (P ‖ Q),

with

D_{W | Z} (P ‖ Q) = \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{P (w | z)}{Q (w | z)} .

This results in

\begin{matrix} inf_{P \in M_{W | Z}, Q \in E^{m}} D_{Z \times W^{m}} (P ‖ Q) & = inf_{P \in M_{W | Z}, Q \in E^{m}} \{D_{Z} (P ‖ Q) + D_{W | Z} (P ‖ Q)\} \\ = inf_{P \in M_{W | Z}, Q \in E^{m}} \{D_{Z} (\tilde{P} ‖ Q) + D_{W | Z} (P ‖ Q)\} \\ = inf_{Q \in M_{C I I}^{m}} D_{Z} (\tilde{P} ‖ Q) . \end{matrix}

□

The em-algorithm is an iterative algorithm that first performs an e-projection to

M_{W | Z}

and then an m-projection to

E^{m}

repeatedly. Let

Q_{0} \in E^{m}

be an arbitrary starting point and define

P_{1}

as the e-projection of

Q_{0}

to

M_{W | Z}

P_{1} = \underset{P \in M_{W | Z}}{arg inf} D_{Z \times W^{m}} (P ‖ Q_{0}) .

Now we define

Q_{1}

as the m-projection of

P_{1}

to

E^{m}

Q_{1} = \underset{Q \in E^{m}}{arg inf} D_{Z \times W^{m}} (P_{1} ‖ Q) .

Repeating this leads to

P_{i + 1} = \underset{P \in M_{W | Z}}{arg inf} D_{Z \times W^{m}} (P ‖ Q_{i}), Q_{i + 1} = \underset{Q \in E^{m}}{arg inf} D_{Z \times W^{m}} (P_{i + 1} ‖ Q) .

The correspondence between these projections in the extended space

P (Z \times W^{m})

and one m-projection in

P (Z)

is illustrated in Figure 13.

The algorithm iterates between the extended spaces

M_{W | Z}

and

E^{m}

on the left of Figure 13. Using Theorem 2.1.3 we gain that this minimization is equivalent to the minimization between

\tilde{P}

and

M_{C I I}^{m}

. The convergence of this algorithm is given by the following result.

Proposition 4

(Theorem 8 from Reference [21]). The monotonic relations

D_{Z \times W^{m}} (P_{i} ‖ Q_{i}) \geq D_{Z \times W^{m}} (P_{i + 1} ‖ Q_{i}) \geq D_{Z \times W^{m}} (P_{i + 1} ‖ Q_{i + 1})

hold, where equality holds only for the fixed points

(\hat{P}, \hat{Q}) \in M_{W | Z} \times E^{m}

of the projections

\begin{matrix} \hat{P} & = \underset{P \in M_{W | Z}}{arg inf} D_{Z \times W^{m}} (P ‖ \hat{Q}) \\ \hat{Q} & = \underset{Q \in E^{m}}{arg inf} D_{Z \times W^{m}} (\hat{P} ‖ Q) . \end{matrix}

Proof of Proposition 4.

This is immediate, because of the definitions of the e- and m-projections. □

Hence this algorithm is guaranteed to converge towards a minimum, but this minimum might be local. We will see examples of that in Section 2.2.2.

In order to use this algorithm to calculate

Φ_{C I I}

we first need to determine how to perform an e- and m-projection in this case. The e-projection from

Q \in E^{m}

to

M_{W | Z}

is given by

P (z, w) = \tilde{P} (z) Q (w | z),

for all

(z, w) \in Z \times W^{m}

. This is the projection because of the following equality

\begin{matrix} D_{Z \times W^{m}} (P ‖ Q) & = \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{P (z, w)}{Q (z, w)} \\ = \sum_{z \in Z} \tilde{P} (z) l o g \frac{\tilde{P} (z)}{Q (z)} + \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{P (w | z)}{Q (w | z)} . \end{matrix}

The first addend is a constant for a fixed distribution

\tilde{P}

and the second addend is equal to 0 if and only if

P (w | z) = Q (w | z)

. Note that this means that the conditional expectation of W remains fixed during the e-projection. This is an important point, because this guarantees the equivalence to the EM algorithm and therefore the convergence towards the MLE. For a proof and examples see Theorem 8.1 in Reference [10] and Section 6 in Reference [23].

After discussing the e-projection, we now consider the m-projection.

Proposition 5.

The m-projection from

P \in M_{W | Z}

is given by

Q (z, w) = P (x) \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (w)

for all

(z, w) \in Z \times W^{m}

.

The last remaining decision to be made before calculating

Φ_{C I I}

is the choice of the initial distribution. Since it depends on the initial distribution whether the algorithm converges towards a local or global minimum, it is important to take the minimal outcome of multiple runs. One class of starting points that immediately lead to an equilibrium, which is in general not minimal, are the ones in which Z and W are independent

P^{0} (z, w) = P^{0} (z) P^{0} (w)

. It is easy to check that the algorithm converges here to the fixed point

\hat{P}

\begin{matrix} \hat{P} (z, w) & = \tilde{P} (x) \frac{1}{| W^{m} |} \prod_{i}^{n} \tilde{P} (y_{i} | x_{i}) \\ \hat{P} (z) & = \tilde{P} (x) \prod_{i}^{n} \tilde{P} (y_{i} | x_{i}) . \end{matrix}

Note that this is the result of the m-projection of

\tilde{P}

to

M_{S I}

, the manifold belonging to

Φ_{S I}

.

2.2. Comparison

In order to compare the different measures, we need a setting in which we generate the probability distributions of full systems. We chose to use weighted Ising models as described in the next section.

2.2.1. Ising Model

The distributions used to compare the different measures in the next chapter are generated by weighted Ising models, also known as binary auto-logistic models as described in Reference [24] Example 3.2.3. Let us consider n binary variables

X = (X_{1}, \dots, X_{n})

,

X = {- 1, 1}^{n}

. The matrix

V \in R^{n \times n}

contains the weights

v_{i j}

of the connection from

X_{i}

to

Y_{j}

as displayed in Figure 14. Note that this figure is not a graphical model corresponding to the stationary distribution, but merely displays the connections of the conditional distribution of

Y_{i} = y_{i}

given

X = x

with the respective weights

P (y_{j} | x) = \frac{1}{1 + e^{- 2 β \sum_{i = 1}^{n} v_{i j} x_{i} y_{j}}} .

(8)

The inverse temperature

β > 0

regulates the coupling strength between the nodes. For

β

close to zero the different nodes are almost independent and as

β

grows the connections become stronger.

We are calculating the stationary distribution

\hat{P}

by starting with a random initial distribution

P^{0}

and then multiplying by (8) in the following way

P^{t + 1} (x) = \sum_{x \in X} P^{t} (x) \cdot \prod_{j = 1}^{n} P (y_{i} | x),

this leads to

\hat{P} = lim_{t \to \infty} P^{t} .

There always exists a unique stationary distribution, see for instance Reference [24], Theorem 5.1.2.

2.2.2. Results

In this section we are going to compare the different measures experimentally. Note that we do not have an exterior influence in these examples, so that

Φ_{T} = Φ_{S I}

holds.

To distinguish between the Causal Information Integration

Φ_{C I I}

calculated with different sized state spaces of W, we will denote

Φ_{C I I}^{m} = inf_{Q \in M_{C I I}^{m}} D_{Z} (\tilde{P} ‖ Q) .

We start with the smallest example possible, with

n = 2

, and the weight matrix

V = (\begin{matrix} 0.0084181 & - 0.2401545 \\ 0.39270161 & 0.37198751 \end{matrix})

shown in Figure 15. In this example every measure is bounded by

Φ_{I}

and the measures

Φ_{I}, Φ_{G}

and

Φ_{S I}

display a limit behavior different from

Φ_{C I S}

and the

Φ_{C I I}

. The state spaces of W have the size 2, 3, 4, 36 and 92 and the respective measures are displayed in shades of blue that get darker as the state space gets larger. In every case the em-algorithm has been initiated 100 times with a random input distribution in order to find a global minimum. Minimizing over the outcome of 100 different runs turns out to be sufficient, at least empirically, to reveal the behavior of the global minima. On the right side of this figure, we are able to see the difference between

Φ_{C I S}

and

Φ_{C I I}

. Considering the precision of the algorithms we assume that a difference smaller than 5e-07 is approx. zero. We can see that in a region from

β = 15

to

β = 25

the measures differ even in the case of 92 hidden states. So this small case already hints towards

M_{C I I} ⊊ M_{C I S}

.

Increasing n from 2 to 3 makes the difference even more visible, as we can see in Figure 16 produced with the weight matrix

V = (\begin{matrix} - 0.43478388 & 0.47448218 & 0.36808313 \\ 0.52117467 & 0.00672578 & - 0.7387737 \\ - 0.56114795 & - 0.96941243 & - 0.76408711 \end{matrix}) .

Here we are able to observe a difference in the behavior of

Φ_{G}

compared to the other measures, since we see that

Φ_{I}, Φ_{S I}

,

Φ_{C I I}

and

Φ_{G}

are still increasing around

β \approx 1.1

, while

Φ_{G}

starts to decrease.

Now, we are going to focus on an example with 5 nodes. Since it is very time consuming to calculate

Φ_{C I S}

for more than 3 nodes, we are going to restrict attention to

Φ_{I}

,

Φ_{G}

,

Φ_{S I}

and

Φ_{C I I}

. The weight matrix

V = (\begin{matrix} - 0.35615839 & - 0.09775903 & 0.89743801 & - 0.00604247 & - 0.03897772 \\ - 0.2260056 & 0.47769717 & - 0.4302256 & 0.18692707 & 0.25140741 \\ - 0.86081159 & - 0.18348132 & - 0.71528754 & - 0.08100602 & - 0.64364176 \\ - 0.13967234 & - 0.03233011 & - 0.81057654 & - 0.33327558 & - 0.57447322 \\ 0.18920264 & - 0.99054716 & 0.32088358 & 0.69100397 & - 0.69206604 \end{matrix})

produces the Figure 17. This example shows that

Φ_{S I}

is not bounded by

Φ_{I}

and therefore does not satisfy Property 2. Since the focus in this examples lies on the relationship between

Φ_{S I}

and

Φ_{I}

, the em-algorithm was run with ten different input distributions for each step.

Using this example, we are going to take a closer look at the local minima the em-algorithm converges to. Considering only

Φ_{C I I}

and varying the size of the state space leads to the upper part in Figure 18. This figure displays ten different runs of the em-algorithm with each size of state space in different shades of the respective color, namely blue for

Φ_{C I I}^{2}

, violet for

Φ_{C I I}^{4}

, red for

Φ_{C I I}^{8}

and orange for

Φ_{C I I}^{16}

. Note that we display the outcomes of every run in this case and not only the minimal one, since we are interested in the local minima. We are able to observe how increasing the state space leads to a smaller value of

Φ_{C I I}

. Additionally, the differences between the minimal values corresponding to each state space grow smaller and converge as the state spaces increase.

The bottom half of Figure 18 highlights an observation that we made. Each of the four illustrations is a copy of the one above, where the difference between the minima are shaded in the respective color. By increasing the size of the state space the difference in value between the various local minima decreases visibly. We think this is consistent with the general observation made in the context of high dimensional optimization, for example, Reference [25] in which the authors conjecture that the probability of finding a high valued local minimum decreases when the network size grows.

Letting the algorithm run only once with

| W | = 2

on the same data leads to a curve on the left in Figure 19.

The sets

E

defined in (7) and

M_{C I I}

(5) do not change for different values of

β

and therefore we have a fixed set of local minima for a fixed state space of W. What does change with different

β

is which of the local minima are global minima. The vertical dotted lines represent the steps

P^{β_{t}}

to

P^{β_{t + 1}}

in which the KL-divergence between the projection to

M_{C I I}

is greater than 0.2

D_{Z} (P^{β_{t}, ⋆} ‖ P^{β_{t + 1}, ⋆}) > 0.2,

meaning that inside the different sections of the curve, the projections to

M_{C I I}

are close. As

β

increases, a different region of local minima becomes global. A sketch of this is shown in Figure 20.

The curve is colored according to the distribution of W as shown on the right side of Figure 19. We see that a different distribution on

W

results in a different minimum, except for the region between 7.5 and 8. The colors light blue and yellow refer to distributions on

W

that are different, but symmetric in the following way. Consider two different distributions

Q, \hat{Q}

on

Z \times W

such that

Q (z, w_{1}) = \hat{Q} (z, w_{2}) and Q (z, w_{2}) = \hat{Q} (z, w_{1})

for all

z \in Z

. Then the corresponding marginalized distributions in

M_{C I I}^{2}

are equal

\sum_{w} Q (z, w) = \sum_{w} \hat{Q} (z, w_{1}) .

This symmetry is the reason for the different colors in the region between 7.5 and 8.

Using this geometric algorithm we therefore gain a notion of the local minima on

E

.

3. Discussion

This article discusses a selection of existing complexity measures in the context of Integrated Information Theory that follow the framework introduced in Reference [7], namely

Φ_{S I}, Φ_{G}

and

Φ_{C I S}

. The main contribution is the proposal of a new measure, Causal Information Integration

Φ_{C I I}

.

In Reference [4] and Reference [5] the authors postulate a Markov condition, ensuring the removal of the causal cross-connections, and an upper bound, given by the mutual information

Φ_{I}

, for valid Integrated Information measures. Although

Φ_{S I}

is not bounded by

Φ_{I}

, as we see in Figure 17, it does measure the intrinsic causal cross-connections in a setting in which there exists no common exterior influences. Therefore the authors of Reference [12] criticize this bound. Since wrongly assuming the existence of a common exterior influence might lead to a value that does not measure all the intrinsic causal influences, the question which measure to use strongly depends on how much we know about the system and its environment. We argue that using

Φ_{I}

as an upper bound in the cases in which we have an unknown common exterior influence is reasonable. The measure

Φ_{G}

attempts to extend

Φ_{S I}

to a setting with exterior influences, but it does not satisfy the Markov condition postulated in Reference [4].

One measure that fulfills all the requirements of this framework is

Φ_{C I S}

, but it has no graphical representation. Hence the causal nature of the measured information flow is difficult to analyze. We present in Example 1 a submodel of

M_{C I S}

that has a causal structure, which does not lie inside the set of Markovian processes

M P (Z)

, that we discuss in this article. Therefore by projecting to

M_{C I S}

we might project to a distribution that still holds some of the integrated information of the original system, although it does not have any causal cross-connections. Additionally we demonstrate that

M_{C I S}

does not correspond to a graphical representation, even after adding any number of latent variables to the model of

M_{S I}

. This is conflicting with the strong connection between conditional independence statements and graphs in Pearls causality theory. For discrete variables

Φ_{C I S}

does not have a closed form solution and has to be calculated numerically.

We propose a new measure

Φ_{C I I}

that also satisfies all the conditions and has additionally a graphical and intuitive interpretation. Numerically calculated examples indicate that

Φ_{C I I} ⊊ Φ_{C I S}

. The definition of

Φ_{C I I}

explicitly includes an interior influence as a latent variable and therefore aims at only measuring intrinsic causal influences. This measure should be used in the setting in which there exists an unknown common exterior influence. By assuming the existence of a ground truth, we are able to prove that our new measure is bounded from above by the ultimate value of Integrated Information

Φ_{T}

of this system. Although

Φ_{C I I}

also has no analytical solution, we are able to use the information geometric em-algorithm to calculate it. The em-algorithm is guaranteed to converge towards a minimum, but this might be local. Even after letting our smallest example, depicted in Figure 15, run with 100 random input distributions, we still get local minima. On the other hand, in our experience the em-algorithm seems to be more reliable, and for larger networks faster, than the numerical methods we used to calculate

Φ_{C I S}

. Additionally, by letting the algorithm run multiple times we are able to gain a notion on how the local minima in

E

are related to each other as demonstrated in Figure 19.

4. Materials and Methods

The distributions used in the Section 2.2.2 were generated by a python program and the measures

Φ_{I}, Φ_{C I I}, Φ_{S I}

ans

Φ_{G}

are implemented in C++. The python package scipy.mimimize has been used to calculate

Φ_{C I S}

. The code is available at Reference [26].

Author Contributions

Conceptualization, N.A. and C.L.; methodology, N.A. and C.L.; software, C.L.; investigation, C.L.; writing, C.L.; supervision, N.A.; project administration, N.A.; funding acquisition, N.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge funding by Deutsche Forschungsgemeinschaft Priority Programme “The Active Self” (SPP 2134).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Graphical Models

Graphical models are a useful tool to visualize conditional independence structures. In this method a graph is used to describe the set of distributions that factor according to it. In our case, we are considering chain graphs.These are graphs, with vertex set V and edge set

E \in V \times V

, consisting of directed and undirected edges such that we are able to partition the vertex set into subsets

V = V_{1} \cup \dots \cup V_{m}

, called chain components, with the properties that all edges between different subsets are directed, all edges between vertices of the same chain component are undirected and that there are no directed cycles between chain components. For a vertex set

τ

, we will denote by

p a (τ)

the set of parents of element in

τ

, which are vertices

α

with a directed arrow from

α

to an element of

τ

. Vertices connected by an undirected edge are called neighbours. A more detailed description can be found in Reference [16].

Definition A1.

Let T be the set of chain components. A distribution factorizes with respect to a chain graph G if the distribution can be written as follows

P (z) = \prod_{τ \in T} P (x_{τ} | x_{p a (τ)}),

where the structure of

P (x_{τ} | x_{p a (τ)})

can be described in more detail. Let

A (τ), τ \in T

be the set of all subsets of

τ \cup p a (τ)

, that are complete in a graph

τ_{⋆}

, which is an undirected graph with the vertex set

τ \cup p a (τ)

and the edges are the ones between elements in

τ \cup p a (τ)

that exist in G and additionally the ones between elements in

p a (τ)

. An undirected graph is complete if every pair of distinct vertices is connected by an edge. Then there are non-negative functions

ϕ_{a}

such that

P (x_{τ} | x_{p a (τ)}) = \prod_{a \in A (τ)} ϕ_{a} (x) .

If

τ

is a singleton then

τ_{⋆}

is already complete. There are different kinds of independence statements a chain graph can encode, but we only need the global chain graph markov property. In order to define this property we need the concepts ancestral set and moral graph.

The boundary

b d (A)

of a set

A \subseteq V

is the set of vertices in

V \ A

that are parents or neighbours to vertices in A. If

b d (α) \subseteq A

for all

α \in A

we call A an ancestral set. For any

A \subseteq V

there exists a smallest ancestral set containing A, because the intersection of ancestral sets is again an ancestral set. This smallest ancestral set of A is denoted by

A n (A)

.

Let G be a chain graph. The moral graph of G is an undirected graph denoted by

G^{m}

that consists of the same vertex set as G and in which two vertices

α, β

are connected if and only if either they were already connected by an edge in G or if there are vertices

γ, δ

belonging to the same chain component such that

α \to γ

and

β \to δ

.

Definition A2.

(Global Chain Graph Markov Property). Let P be a distribution on

Z

and G a chain graph. P satisfies the global chain Markov property, with respect to G, if for any triple

(Z_{A}, Z_{B}, Z_{S})

of disjoint subsets of Z such that

Z_{S}

separates

Z_{A}

from

Z_{B}

in

{(G_{A n (Z_{A} \cup Z_{B} \cup Z_{S})})}^{m}

, the moral graph of the smallest ancestral set containing

Z_{A} \cup Z_{B} \cup Z_{S}

,

Z_{A} ⫫ Z_{B} ∣ Z_{S}

holds.

Since we are only considering positive discrete distributions, we have the following result.

Lemma A1.

The global chain Markov property and the factorization property are equivalent for positive discrete distributions.

Proof of Lemma A1.

Theorem 4.1 from Reference [27] combined with the Hammersley–Clifford theorem, for example, Theorem 2.9 in Reference [28], proves this statement. □

In order to understand the conditional independence structure of a chain graph after marginalization, we need the following alogrithm from Reference [17]. This algorithm converts a chain graph with latent variables into a chain mixed graph with the conditional independence structure of the marginalized chain graph. A chain mixed graph has in addition to directed and undirected edges also bidirected edges, called arcs. The condition that there are no semi-directed cycles also applies to chain mixed graphs.

Definition A3.

Let M be the set of vertices over which we want to marginalize. The following algorithm produces a chain mixed graph (CMG) with the conditional independence structure of the marginalized chain graph.

1.: Generate an ij edge as in Table A1, steps 8 and 9, between i and j on a collider trislide with an endpoint j and an endpoint in M if the edge of the same type does not already exist.
2.: Generate an appropriate edge as in Table A1, steps 1 to 7, between the endpoints of every tripath with inner node in M if the edge of the same type does not already exist. Apply this step until no other edge can be generated.
3.: Remove all nodes in M.

Table A1. Types of edge induced by tripaths with inner node m ∈ M and trislides with endpoint m ∈ M.

1	i ← m ← j	generates	i ← j
2	i ← m – j	generates	i ← j
3	i ↔ m —j	generates	i ↔ j
4	i ← m → j	generates	i ↔ j
5	i ← m ↔ j	generates	i ↔ j
6	i – m ← j	generates	i ← j
7	i – m – j	generates	i–j
8	m → i – ⋯ – $\circ \leftarrow$ j	generates	i ← j
9	m $\to i - - \dots - - \circ \leftrightarrow$ j	generates	i ↔ j

Conditional independence in CMGs is defined using the concept of c-separation, see for example Reference [17] in Section 4. For this definition we need the concepts of a walk and of a collider section. A walk is a list of vertices

α_{0}, \dots, α_{k}, k \in N

, such there is an edge or arrow from

α_{i}

to

α_{i + 1}, i \in {0, \dots, k - 1}

. A set of vertices connected by undirected edges is called a section. If there exists a walk including a section such that an arrow points at the first and last vertices of the section

\to • - \dots - • \leftarrow

then this is called a collider section.

Definition A4 (c-separation).

Let

A, B

and C be disjoint sets of vertices of a graph. A walk π is called a c-connecting walk given C, if every collider section of π has a node in C and all non-collider sections are disjoint. The nodes A and B are called c-separated given C if there are no c-connecting walks between them given C and we write

A ⫫_{c} B | C

.

Appendix B. Proofs

Proof of the Relationship (4).

For

n = 2

this is immediate. Let now

n \geq 3

and

i, j, k \in {1, \dots, n}, i \neq j \neq k \neq i

. Applying (1) two times leads to

\begin{matrix} Q (y_{j}, x) & = \frac{Q (y_{j}, x_{I \ {i}}) Q (x)}{Q (x_{I \ {i}})} \\ Q (y_{j}, x) & = \frac{Q (y_{j}, x_{I \ {k}}) Q (x)}{Q (x_{I \ {k}})} \\ Q (y_{j}, x_{I \ {i}}) Q (x_{I \ {k}}) & = Q (y_{j}, x_{I \ {k}}) Q (x_{I \ {i}}) \end{matrix}

for all

(x, y_{j}) \in X \times Y_{j}

. Marginalizing over the elements of

X_{k}

yields

\begin{matrix} Q (y_{j}, x_{I \ {i, k}}) Q (x_{I \ {k}}) & = Q (y_{j}, x_{I \ {k}}) Q (x_{I \ {i, k}}) \\ Q (y_{j} | x_{I \ {i, k}}) & = Q (y_{j} | x_{I \ {k}}) . \end{matrix}

Using inductively the remaining relations results in (4). □

Proof of Proposition 1.

If

Φ_{C I I} (\tilde{P}) = 0

holds, then

inf_{Q \in M_{C I I}} D_{Z} (\tilde{P} ‖ Q) = 0 .

Since

M_{C I I}

is compact the infimum is an element of

M_{C I I}

, so there exists

Q \in M_{C I I}

such that

D_{Z} (P ‖ Q) = 0

. Therefore

P \in M_{C I I}

and the existence of a sequence

Q^{m}

follows from the definition of

M_{C I I}

.

Assume that there exists a sequence

Q^{m}

that satisfies 1. and 2. Then every element

Q^{m} \in M_{C I I}^{m}

per definition and the limit

\tilde{P} \in \bar{⋃_{m \in N} M_{C I I}^{m}} = M_{C I I} .

Hence

Φ_{C I I} (\tilde{P}) = inf_{Q \in M_{C I I}} D_{Z} (\tilde{P} ‖ Q) = D_{Z} (\tilde{P}, \tilde{P}) = 0 .

□

Proof of Proposition 2.

Let

P \in E^{f}

and

Q \in E

, then the KL-divergence between the two elements is

\begin{matrix} D_{Z \times W^{m}} (P ‖ Q) & = \sum_{z, w} P (z, w) l o g \frac{P (x) \prod_{i} P (y_{i} | x, w) P (w)}{Q (x) \prod_{i} Q (y_{i} | x_{i}, w) Q (w)} \\ = \sum_{x} P (x) l o g \frac{P (x)}{Q (x)} + \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} Q (y_{i} | x_{i}, w)} + \sum_{w} P (w) l o g \frac{P (w)}{Q (w)} \\ \geq \sum_{x} P (x) l o g \frac{P (x)}{P (x)} + \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} P (y_{i} | x_{i}, w)} + \sum_{w} P (w) l o g \frac{P (w)}{P (w)} \\ = \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} P (y_{i} | x_{i}, w)} . \end{matrix}

The inequality holds, because in the first and third addend, we are able to apply that the cross entropy is greater or equal to the entropy and in the second addend we use the log-sum inequality in the following way

\begin{matrix} \sum_{z, w} P (z, w) l o g & \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} Q (y_{i} | x_{i}, w)} - \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} P (y_{i} | x_{i}, w)} \\ = \sum_{x, w} P (x) P (w) \sum_{y} \prod_{i} P (y_{i}, | x, w) l o g \frac{\prod_{i} P (y_{i} | x_{i}, w)}{\prod_{i} Q (y_{i} | x_{i}, w)} \\ \geq \sum_{x, w} P (x) P (w) (\sum_{y} \prod_{i} P (y_{i}, | x, w)) l o g \frac{\sum_{y} \prod_{i} P (y_{i} | x_{i}, w)}{\sum_{y} \prod_{i} Q (y_{i} | x_{i}, w)} \\ = 0 . \end{matrix}

Therefore the new integrated information measure results in

inf_{Q \in E} D_{Z \times W^{m}} (P ‖ Q) = \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} P (y_{i} | x_{i}, w)} .

This can be rewritten to

\begin{matrix} \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i} | x, w)}{\prod_{i} P (y_{i} | x_{i}, w)} & = \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i}, x, w) P (x_{i}, w)}{\prod_{i} P (y_{i}, x_{i}, w) P (x, w)} \\ = \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i}, x_{I \ {i}} | x_{i}, w) P (x_{i}, w)}{\prod_{i} P (y_{i} | x_{i}, w) P (x, w)} \\ = \sum_{z, w} P (z, w) l o g \frac{\prod_{i} P (y_{i}, x_{I \ {i}} | x_{i}, w)}{\prod_{i} P (y_{i} | x_{i}, w) P (x_{I \ {i}} | x_{i}, w)} \\ = \sum_{i} I (Y_{i}; X_{I \ {i}} | X_{i}, W) . \end{matrix}

□

Proof of Proposition 3.

By using the log-sum inequality we get

\begin{matrix} Φ_{C I I}^{m} & = inf_{Q \in M_{C I I}^{m}} \sum_{z} P (z) l o g \frac{\sum_{w} P (x) \prod_{i} P (y_{i} | x, w) P (w)}{\sum_{w} Q (x) \prod_{i} Q (y_{i} | x_{i}, w) Q (w)} \\ \leq inf_{Q \in M_{C I I}^{m}} \sum_{w} \sum_{z} P (z, w) l o g \frac{P (x) \prod_{i} P (y_{i} | x, w) P (w)}{Q (x) \prod_{i} Q (y_{i} | x_{i}, w) Q (w)} \\ = inf_{Q \in E} D_{Z \times W^{m}} (P ‖ Q) . \end{matrix}

The fact that every element of

Q \in E

corresponds via marginalization to an element in

M_{C I I}^{m}

and every element in

M_{C I I}^{m}

has at least one corresponding element in

Q \in E

, leads to the equality in the last row. Since taking the infimum over a larger space can only decrease the value further, the relation

Φ_{C I I} \leq Φ_{T}

holds. □

Proof of Proposition 5.

\begin{matrix} D_{Z \times W^{m}} (P ‖ Q) = & \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{P (z, w)}{Q (x) \prod_{i = 1}^{n} Q (y_{i} | x_{i}, w) Q (w)} \\ = & \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g P (z, w) \\ + \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{1}{Q (x)} \\ + \sum_{(z, w) \in Z \times W^{m}} \sum_{i = 1}^{n} P (z, w) l o g \frac{1}{Q (y_{i} | x_{i}, w)} \\ + \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{1}{Q (w)} \end{matrix}

The first addend is a constant for P and the others are cross-entropies which are greater or equal to entropy

\begin{matrix} D_{Z \times W^{m}} (P ‖ Q) \geq & \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g P (z, w) \\ + \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{1}{P (x)} \\ + \sum_{(z, w) \in Z \times W^{m}} \sum_{i = 1}^{n} P (z, w) l o g \frac{1}{P (y_{i} | x_{i}, w)} \\ + \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{1}{P (w)} \\ = & \sum_{(z, w) \in Z \times W^{m}} P (z, w) l o g \frac{P (z, w)}{P (x) \prod_{i = 1}^{n} P (y_{i} | x_{i}, w) P (w)} . \end{matrix}

Therefore this projection is unique. □

Proof of Theorem 2.

We need a way to understand the connections in a graph after marginalization. In Reference [17] Sadeghi presents an algorithm that converts a chain graph to a chain mixed graph that represents the markov properties of the original graph after marginalizing, see Definition A3.

Although the actual set of distributions after marginalizing might be more complicated, it is a subset of the distributions factorizing according to the new graph, if the new graph is still a chain graph. This is due to the equivalence of the global chain Markov property and the factorization property in Lemma A1.

At first we will consider the case of two nodes per time step,

n = 2

. We will take a close look at the possible ways a hidden structure could be connected to the left graph in Figure A1. At first we will look at the possible connections between two nodes, depicted on the right in Figure A1. The boxes stand for any kind of subgraph of hidden nodes such that the whole graph is still a chain graph and the two headed dotted arrows stand for a line, or an arrow in any direction. Consider two nodes A and B, then the connections including a box between the nodes can take one of the five following forms

they form an undirected path between A and B,
they can form a directed path from A to B,
they can form a directed path form B to A,
there exists a collider,
A and B have a common exterior influence.

A collider is a node or a set of nodes connected by undirected edges that have an arrow pointing at the set at both ends

\to • \dots • \leftarrow .

Figure A1. Starting graph and possible two way interactions.

We will start with the gridded hidden structure connected to

X_{1}

and

X_{2}

. Since there already is an undirected edge between the

X_{i}

s an undirected path would make no difference in the marginalized model. The cases (2) and (3) would form a directed cycle which violates the requirements of a chain mixed graph. A collider would also make no difference, since it disappears in the marginalized model. A common exterior influence leads to

\begin{matrix} P (\hat{w}) P (x | \hat{w}) P (y_{1} | x_{1}) P (y_{2} | x_{2}) & = P (x, \hat{w}) P (y_{1} | x_{1}) P (y_{2} | x_{2}) \\ \sum_{\hat{w}} P (x, \hat{w}) P (y_{1} | x_{1}) P (y_{2} | x_{2}) & = P (x) P (y_{1} | x_{1}) P (y_{2} | x_{2}) . \end{matrix}

Now let us discuss these possibilities in the case of a gray hidden structure between

X_{i}

and

Y_{j}

,

i, j \in {1, 2}, i \neq j

. An undirected edge or a directed edge (3) would create a directed cycle. A directed path (2) from

X_{i}

to

Y_{j}

would lead to a chain graph in which

X_{i}

and

Y_{j}

are not conditionally independent given

X_{j}

. If there exists a collider (4) in the hidden structure, then nothing else in the graph depends on this part of the structure and it reduces to a factor one when we marginalize over the hidden variables. Therefore the path between

X_{i}

and

Y_{j}

gets interrupted leaving a potential external influence or effect. Those do not have an additional impact on the marginalized model. A common exterior influence (5) leads to a chain mixed graph which does not satisfy the necessary conditional independence structure, because using the Algorithm A3 leads to an arc between

X_{i}

and

Y_{j}

, hence they are c-connected in the sense of Definition A4.

The next possibility is a dotted hidden structure between

X_{i}

and

Y_{i}, i \in {1, 2}

. An undirected path (1) and a directed path (3) would lead to a directed cycle. A directed path (2) would add no new structure to the model since there already is a directed edge between

X_{i}

and

Y_{i}

. A collider (4) does not have an effect on the marginalized model. Adding a common exterior influence

W_{1}

on

X_{1}, Y_{1}

results in a new model which is not symmetric in

i \in {1, 2}

and does not include

M_{I}

, therefore it does not fully contain

M_{C I I}

. By adding additional common exterior

W_{2}

influences on

X_{2}, Y_{2}

or

Y_{1}, Y_{2}

, in order to include

M_{I}

in the new model, violates the conditional independence statements since nodes in

W_{1}

and

W_{2}

are connected in the moralized graph.

The last hidden structure between two nodes is the striped one between the

Y_{i}

s. An undirected path (1) or any directed path (2), (3) lead to a graph that does not satisfy the conditional independence statements. A collider (4) has no impact on the model and a common exterior influence leads to the definition of Causal Information Integration.

Connecting

Y_{1}, Y_{2}

and

X_{i}, i \in {1, 2}

leads either to a violation of the conditional independence statements or contains a collider in which case the marginalized model reduces to one of the cases above.

All the possible ways a hidden structure could be connected to three nodes

X_{1}, X_{2}, Y_{1}

by directed edges are shown in Figure A2. Replacing any of these edges by an undirected edge would either make no difference or lead to a model that does not satisfy the conditional independence statements. In this case the black boxes represent sections. More complicated hidden structures reduce to this case, since these structures either contain a collider and correspond to one of the cases above or contain longer directed paths in the direction of the edges connecting the structure to the visible nodes, which does not change the marginalized model.

Figure A2. The eight possible hidden structures between three nodes.

The models in (c), (d), (e), (f) and (g) contain either a collider and reduce therefore to one of the cases discussed above or induce a directed cycle. We see that (a) and (h) display structures that do not satisfy the conditional independence statements. The hidden structure in (b) has no impact on the model.

A hidden structure connected to all four nodes contains one of the structures above and therefore does not induce a new valid model.

Let us now consider a model with

n > 2

. Any hidden structure on this model either connects only up to four nodes and reduces therefore to one of the cases above, contains one of the connections discussed in Figure A2 or only connects nodes among one point in time. The only structures possible to add would be a common exterior influence on the

X_{i}

s, a common exterior influence on the

Y_{i}

s or a collider section on any nodes. All these structures do not change the marginalized model. Therefore it is not possible to create a chain graph with hidden nodes in order to get a model strictly larger than

M_{C I I}

. □

References

Tononi, G.; Edelman, G.M. Consciousness and Complexity. Science 1999, 282, 1846–1851. [Google Scholar] [CrossRef]
Tononi, G. Consciousness as Integrated Information: A Provisional Manifesto. Biol. Bull. 2008, 215, 216–242. [Google Scholar] [CrossRef]
Oizumi, M.; Albantakis, L.; Tononi, G. From the Phenomenology to the Mechanisms of Consciousness: Integrated Information Theory 3.0. PLoS Comput. Biol. 2014, 10, 1–25. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Oizumi, M.; Tsuchiya, N.; Amari, S. Unified framework for information integration based on information geometry. Proc. Natl. Acad. Sci. USA 2016, 113, 14817–14822. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Amari, S.; Tsuchiya, N.; Oizumi, M. Geometry of Information Integration. In Information Geometry and Its Applications; Ay, N., Gibilisco, P., Matúš, F., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–17. [Google Scholar]
Ay, N. Information Geometry on Complexity and Stochastic Interaction. MPI MIS PREPRINT 95. 2001. Available online: https://www.mis.mpg.de/preprints/2001/preprint2001_95.pdf (accessed on 28 September 2020).
Ay, N. Information Geometry on Complexity and Stochastic Interaction. Entropy 2015, 17, 2432–2458. [Google Scholar] [CrossRef]
Ay, N.; Olbrich, E.; Bertschinger, N.A. Geometric Approach to Complexity. Chaos 2011, 21. [Google Scholar] [CrossRef] [PubMed]
Oizumi, M.; Amari, S.; Yanagawa, T.; Fujii, N.; Tsuchiya, N. Measuring Integrated Information from the Decoding Perspective. PLoS Comput. Biol. 2016, 12. [Google Scholar] [CrossRef] [PubMed]
Amari, S. Information Geometry and Its Applications; Springer Japan: Tokyo, Japan, 2016. [Google Scholar]
Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Kanwal, M.S.; Grochow, J.A.; Ay, N. Comparing Information-Theoretic Measures of Complexity in Boltzmann Machines. Entropy 2017, 19, 310. [Google Scholar] [CrossRef]
Barrett, A.B.; Seth, A.K. Practical Measures of Integrated Information for Time- Series Data. PLoS Comput. Biol. 2011, 7. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Csiszár, I.; Shields, P. Foundations and Trends in Communications and Information Theory. In Information Theory and Statistics: A Tutorial; Now Publishers Inc.: Delft, The Netherlands, 2004; pp. 417–528. [Google Scholar]
Studený, M. Probabilistic Conditional Independence Structures; Springer: London, UK, 2005. [Google Scholar]
Lauritzen, S.L. Graphical Models; Clarendon Press: Oxford, UK, 1996. [Google Scholar]
Sadeghi, K. Marginalization and conditioning for LWF chain graphs. Ann. Stat. 2016, 44, 1792–1816. [Google Scholar] [CrossRef] [Green Version]
Montúfar, G. On the expressive power of discrete mixture models, restricted Boltzmann machines, and deep belief networks—A unified mathematical treatment. Ph.D. Thesis, Universität Leipzig, Leipzig, Germany, 2012. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Csiszár, I.; Tusnády, G. Information geometry and alternating minimization procedures. Stat. Decis. 1984, Supplemental Issue Number 1, 205–237. [Google Scholar]
Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw. 1992, 3, 260–271. [Google Scholar] [CrossRef] [PubMed]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. 1977, 39, 2–38. [Google Scholar]
Amari, S. Information Geometry of the EM and em Algorithms for Neural Networks. Neural Netw. 1995, 9, 1379–1408. [Google Scholar] [CrossRef]
Winkler, G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Choromanska, A.; Henaff, M.; Mathieu, M.; Arous, G.B.; LeCun, Y. The Loss Surfaces of Multilayer Networks. PMLR 2015, 38, 192–204. [Google Scholar]
Langer, C. Integrated-Information-Measures GitHub Repository. Available online: https://github.com/CarlottaLanger/Integrated-Information-Measures (accessed on 18 August 2020).
Frydenberg, M. The Chain Graph Markov Property. Scand. J. Stat. 1990, 17, 333–353. [Google Scholar]
Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Information Geometry; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]

Figure 1. The fully connected system for

n = 2

and

n = 3

.

Figure 1. The fully connected system for

n = 2

and

n = 3

.

Figure 2. Interior and exterior influences on Y in the full and the split system corresponding to

Φ_{I}

.

Figure 2. Interior and exterior influences on Y in the full and the split system corresponding to

Φ_{I}

.

Figure 3. The different measures and their properties in the case of

n = 2

.

Figure 3. The different measures and their properties in the case of

n = 2

.

Figure 4. Split systems with exterior influences for

n = 2

and

n = 3

.

Figure 4. Split systems with exterior influences for

n = 2

and

n = 3

.

Figure 5. Marginalized Model for

n = 2

and

n = 4

.

Figure 5. Marginalized Model for

n = 2

and

n = 4

.

Figure 6. Submodels of the split models with exterior influences for

n = 2

and

n = 3

.

Figure 6. Submodels of the split models with exterior influences for

n = 2

and

n = 3

.

Figure 7. The graphs corresponding to

E

and

E^{f}

(right).

Figure 7. The graphs corresponding to

E

and

E^{f}

(right).

Figure 8. Graph of the model

N_{C I S}

.

Figure 8. Graph of the model

N_{C I S}

.

Figure 9. Graph of the model

N_{C I I}

.

Figure 9. Graph of the model

N_{C I I}

.

Figure 10. Sketch of the relationships among

M P (Z), M_{C I S}

and

N_{C I S} .

Figure 10. Sketch of the relationships among

M P (Z), M_{C I S}

and

N_{C I S} .

Figure 11. Example of an exterior influence on the initial graph.

Figure 12. Sketch of the relationship between the manifolds corresponding to the different measures.

Figure 13. Sketch of the em-Algorithm.

Figure 14. The weights corresponding to the connections for

n = 2

.

Figure 14. The weights corresponding to the connections for

n = 2

.

Figure 15. Ising model with 2 nodes and the differences between

Φ_{C I S}

and

Φ_{C I I}

.

Figure 15. Ising model with 2 nodes and the differences between

Φ_{C I S}

and

Φ_{C I I}

.

Figure 16. Ising model with 3 nodes.

Figure 17. Ising model with 5 nodes.

Figure 18. The effect of a different sized state space.

Figure 19. Curve of one run of the em-algorithm for each

β

coloured according to the distribution of W.

Figure 19. Curve of one run of the em-algorithm for each

β

coloured according to the distribution of W.

Figure 20. Sketch of different local Minima.

Table 1. The results of the em-algorithm between

N_{C I S}

and

N_{C I I}

.

Table 1. The results of the em-algorithm between

N_{C I S}

and

N_{C I I}

.

$\| W \|$	Minimum	Maximum	Arithmetic Mean
2	0.011969035529826939	0.5028091152589176	0.15263592877594967
3	0.021348311360946	0.5499395859771526	0.1538653506807848
4	0.014762084688030863	0.3984635189946462	0.15139198568055212
8	0.017334311629729246	0.4383731978333986	0.15481967618112732
16	0.024306996171092318	0.4238222051787452	0.1490336847067273
300	0.016524177216064712	0.47733473380366764	0.15493896625208842

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Langer, C.; Ay, N. Complexity as Causal Information Integration. Entropy 2020, 22, 1107. https://0-doi-org.brum.beds.ac.uk/10.3390/e22101107

AMA Style

Langer C, Ay N. Complexity as Causal Information Integration. Entropy. 2020; 22(10):1107. https://0-doi-org.brum.beds.ac.uk/10.3390/e22101107

Chicago/Turabian Style

Langer, Carlotta, and Nihat Ay. 2020. "Complexity as Causal Information Integration" Entropy 22, no. 10: 1107. https://0-doi-org.brum.beds.ac.uk/10.3390/e22101107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Complexity as Causal Information Integration

Abstract

1. Introduction

Integrated Information Measures

2. Causal Information Integration

2.1. Definition

2.1.1. Ground Truth

2.1.2. Relationships between the Different Measures

2.1.3. em-Algorithm

2.2. Comparison

2.2.1. Ising Model

2.2.2. Results

3. Discussion

4. Materials and Methods

Author Contributions

Funding

Conflicts of Interest

Appendix A. Graphical Models

Appendix B. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI