1. Introduction
Path analysis [
1] is often applied to causal systems of continuous variables through the linear structural equations model (LISREL) [
2,
3]. In the LISREL approach, causal relationships among variables are described by a path diagram and translated into linear equations of the variables. Causal effects can then be calculated by regression and correlation coefficients obtained for the linear equations. In contrast, path analysis of categorical variables is more complex than that of continuous variables because the causal system under consideration cannot be described by linear regression equations. Goodman [
4–
7] considered path analysis of binary variables by using logit models and discussed the effects of explanatory variables, though without discussing direct and indirect effects. Hagenaars [
8] discussed path analysis of categorical variables by using a log-linear model, but here as well without discussion of direct and indirect effects. Eshima
et al. [
9] proposed a path analysis method for categorical variables in logit models. Kuha and Goldthorpe [
10] gave a two-stage path analysis method for generalized linear models (GLMs) that uses log odds ratios. In their approach, first the total, direct and indirect effects are defined for mean differences of response variables, and then the method is applied to measuring the effects on the basis of log odds ratios. However, additive decomposition of the total effect into the direct and indirect effects only approximately reflects reality, and assessing effects in categorical (polytomous) variable systems become more complicated as the numbers of variable categories are increased [
10]. Albert and Nelson [
11] proposed a path analysis method to calculate pathway effects for causal systems on the basis of GLMs, but not all pathway effects are identifiable. As in the two-stage cases, when factors, intermediate variables, and response variables are categorical, pathway effects become very complicated because the variable effects are defined for mean differences of response variables.
There are many examples of response variables in practical data that are not normally distributed in various fields of study. There is need for a method of path analysis with responses that are not normally distributed, especially categorical responses, and it is useful to discuss a path analysis approach for causal systems of GLMs [
12,
13]. When describing causal systems of the variables by GLMs, regression parameters or coefficients are related to log odds ratios [
14–
16], and so it is natural to consider the effects of factors (explanatory variables) according to odds or log odds ratios. However results become more complicated as the number of categories of the variables increases. In such cases, it is suitable to summarize the effects of factors in GLMs. For this purpose, we use the entropy coefficient of determination (ECD), one of the entropy-based measures of predictive power for GLMs [
15,
16].
The remainder of this paper is organized as follows: Section 2 presents a practical example of causal systems—British mobility data [
10]—and re-analyzes them by a new method of path analysis. Section 3 considers the relation between the log odds ratio and entropy, and ECD is briefly reviewed. Section 4 introduces a path analysis method for causal systems described by GLMs, and in Section 5 a method for testing effects of variables is given. The British mobility data are re-analyzed by the proposed approach in Section 6. Finally, Section 7 provides some discussion and conclusions for the present approach.
2. Practical Example
British mobility data describe the effects of education on social class mobility [
10]. There are three variables, which are causally ordered as shown in
Figure 1: parents’ social class,
X; individual social class,
Y; and education,
Z, which intermediates between
X and
Y. The three variables are discrete. Social classes
X and
Y have three categories, “salariat and employers”, “middle class”, and “working class”; education
Z has seven levels. While the effects of
X and
Z on
Y can be discussed through log odds ratios, the results are complicated because the number of variable categories is large. It is important to summarize causal effects measured with log odds ratios, especially in such practical examples, to assess the intermediate effect of education on social class mobility.
4. Measuring the Total, Direct, and Indirect Effects in Recursive GLM Systems
For simplicity, in the recursive case with
, where
precedes
, we discuss the effects of
and
on
. Let
be the expectation of
. Then, for a GLM with the conditional density or probability function of
when
given by
(1), the total effect of
on
can be defined by using the following log odds ratio:
Taking the expectation of the above effect with respect to
,
and
, we have:
The above KL information is the (summary) total effect of explanatory variables
on response variable
. Let
and
be the conditional expectations for
and
, respectively, given that
. The log odds ratio with respect to
and
for a given
is:
The total effect of
on
when
is defined by the above log odds ratio because the effect expresses the total effect of
on
when the effect of the preceding variable
is excluded. From this, the total effect of
on
is defined by
By taking the expectation of the above information with respect to
,
and
, the (summary) total effect of
on
is given by
where
is given by
The second term implies the effect of
by itself, that is, the effect of
on
when the effect of
is excluded, and is defined as the (summary) total effect of
on
. The direct effect of
on
can be understood according to the following odds ratio:
The above effect is derived by excluding the effect of
, so it is defined as the direct effect of
on
. Taking the expectation of the above effect, we have the (summary) direct effect of
on
, expressed as follows:
where
is defined as in
(2). The above quantity is the amount of entropy of
explained by
alone, that is, excluding the effect of
. By subtracting the direct effect of
on
from the total effect, we have the indirect effect of
on
:
Taking the expectation of the above effect, the (summary) indirect effect is given by
As in the previous section, to standardize the above effects by ECD, we define the standardized total, direct, and indirect effects of
and
on
as follows:
The total effect of
and
on
is:
The total effect of
on
:
The direct effect of
on
:
The indirect effect of
on
:
The total (direct) effect of
on
:
A general approach based on the above discussion is given below. Let
be variables such that the parents of
are
, that is,
precedes
. Let
be the conditional density or probability of
given
such that:
Explaining response variable
in a GLM framework by explanatory variables
, the effects of the explanatory variables on the response variable can be treated in terms of entropy as discussed above. From this the standardized (summary) total effect of
on
is defined by:
Second, the total effect of
is defined as:
Then, we can find the total effects of
by induction, which yields:
where
and
can be defined as in
(2). In the above formulae, we have:
and:
Remark 3. The total effect of on is given by:
where and be the conditional expectations of given and,
respectively.
Let
be parent variables of
excluding
. The direct effect of
on
is defined by:
From this, we have the indirect effect of
:
Remark 4. The direct effect of on is given bywhere is the conditional expectation of given.
For canonical links:
we have:
and:
The direct effect of
on
is given by:
and the indirect effect is calculated by
(8) minus
(9).
The present approach is different from the usual approach for linear equation models and from the approach in [
10], because it is based on the log odds ratio and entropy by using all the variables concerned.
Remark 5. The total effects of variables by Kuha and Goldthorpe [
10]
are defined with the marginal distributions of response variables and explanatory variables. Meanwhile the present approach defines the total effects of explanatory variables based on a recursive structure of all the variables concerned and we have (6).
Remark 6. Indirect effects are defined by the total effects minus the direct effects as (3), (4) and (7); however the interpretation can be done in terms of entropy. On the other hand, direct and indirect effects are defined in an approach by [
10],
though the sum of the effects does not equal to the total effect. Remark 7. Assessing the model identification and testing the goodness-of-fit of the model are based on the discussion of GLMs.
5. Statistical Test for Effects
Let
and
be the ML estimators of
and
, respectively. A similar result presented in Eshima & Tabata [
16] can be used to show that:
is asymptotically distributed according to a chi-squared distribution with the degrees of freedom equal to the number of parameters in the conditional independent model with
minus that with
.By using statistic
(10), the total effects can be tested. Similarly, the statistic:
is asymptotically distributed according to a chi-squared distribution with degrees of freedom equal to the number of regression coefficients (parameters) related to variable
.
The following statistic is asymptotically distributed according to a non-central chi-squared distribution with degree of non-centrality:
and an appropriate degrees of freedom ν, found as the number of parameters in the conditional independent model with
minus that with
. Let:
and let
and
. The statistic
is asymptotically distributed according to the chi-squared distribution with
degrees of freedom. As
becomes large, the chi-squared distribution tends to a normal distribution with mean
and variance
. From this, for sufficiently large sample sizes
, the statistic:
is asymptotically normally distributed with mean
and variance
[
17]. For sufficiently large
, we have that:
From this, the asymptotic standard error (ASE) of
is
. Similarly, the asymptotic standard error of:
is
. Moreover:
is asymptotically equal to a normal distribution with mean:
and variance
. By using the above results, ASEs of the estimates of the summary total and direct effects can be calculated.
6. Path Analysis of the British Morbility Data
The British mobility data described in Section 2 were analyzed in detail by using log odds ratios [
10]. Here, the proposed path analysis method is applied to summarize the effects of parental class
and education
on destination class
Y, measured by log odds ratios as in the previous section, and to give a simple interpretation from the summary effects of
and
on
. The three variables are random, and the GLM system can be composed of logit models. In this example, the employed logistic model can be expressed as follows. Let
be a categorical factor;
a score that take levels {1,2,3} and {1,2,…,7}, respectively, and let
be a categorical response variable with levels {1,2,3}. Let:
Then, dummy variable vectors
and
are identified with categorical variables
and response
, respectively. From this, the systematic component of the above model can be expressed as follows:
where:
Then, the logit model is described as:
where
implies the summation over all
u. Then, from
Table 4 in [
10], the estimated regression parameters for men are calculated as follows:
Similarly, we have the estimated parameters for women as follows:
From
Tables 1 and
5 in [
10], the joint distributions of parental class
and education
for men and women are calculated, respectively, in
Table 1.
On the basis of the estimated parameters shown above and the estimated joint distribution of
and
Z in
Table 1, the joint distributions of
X,
Y, and
Z by sex can be estimated. The effects of
X and
Z on
Y for men are shown in
Tables 2–
4, for example, the effects of
and
on
illustrated in
Table 2 are as follows:
the total effect of
and
on
is calculated as follows: 0.51;
the total effect of
is 0.04;
the total effect of
is 0.47 when
;
the direct effect of
is 0.16 when
;
the indirect effect of
is 0.31 when
.
Similarly, the effects of X and Z on Y for women can be calculated. The results are omitted to avoid redundancy of the discussion.
The standardized summary effects are shown in
Table 5. For men, the total effect of
X and
Z on
Y is 0.276, and so 27.6% of the variation of
Y’s entropy is explained by
X and
Z. The indirect effect of
X is about twice the direct effect, and the total (direct) effect of
Z on
Y is about 1.5-fold that of
X. Therefore, the effect of education
Z on the destination class
Y is large. For women, the total effect of
X and
Z on
Y is 0.289, meaning that 28.9% of the variation of
Y’s entropy is explained by
X and
Z. The indirect effect of
X on
Y is about 6-fold that of the direct effect, and the direct effect is small. The total effect of
Z on
Y is about 2.7-fold that of
X. The effect of
Z on
Y is more pronounced for women than for men.
In a comparison of men and women, the effect of Z on Y for women is about 1.3-fold the effect for men, and, contrarily, the effect of X on Y for men is about 1.4-fold the effect for women. For both men and women, the direct effects of X on Y are mostly very small, and this decomposition of effects shows that education plays an important role in determining social class as an adult.