Chapter 2 Causal Models

We provide a lay-language primer on the logic of causal models.

Causal claims are everywhere. Causal knowledge is often not just the goal of empirical social science, it is also an input into causal inference.11 Causal assumptions are hidden in seemingly descriptive statements: claims that someone is guilty, or exploited, or powerful, or weak involve beliefs about how things would be if conditions were different. Even when scholars carefully try to avoid causal claim-making, causal verbs—depends, drives, produces, influences—are hard to avoid.

But while causal claims are commonplace, it is not always clear what exactly is meant by a causal relation and how causal knowledge about one thing can be marshaled to justify causal claims about another. For our purposes, the counterfactual view of causality addresses the first question. Causal models address the second. In this chapter, we discuss each in turn. The present chapter is largely conceptual, with ideas worked through with a couple of “toy” running examples. In Chapter 3, we then apply and illustrate many of the key concepts from this chapter by translating a few prominent arguments from the field of political science into structural causal models.

2.1 The counterfactual model

We begin with what we might think of as a meta-model, the counterfactual model of causation. At its core, a counterfactual understanding of causation captures a simple notion of causation as “difference-making.”12 In the counterfactual view, to say that \(X\) caused \(Y\) is to say: had \(X\) been different, \(Y\) would have been different.

The causal effect, in this view, is the difference between two things (two values of \(Y\)) that might have happened This means that by definition, causal effects are not measurable quantities. They are not differences between two observable outcomes in the world, but, at best, differences between one observable outcome and a second counterfactual outcome. For this reason, causal effects need to be inferred not measured.

Moreover, in this view, the antecedent, “had \(X\) been different,” imagines a controlled change in \(X\)—an intervention that altered \(X\)’s value—rather than a naturally arising difference in \(X\). The usual counterfactual claim, then, is not that \(Y\) is different to how it might have been had circumstances been such that \(X\) were different; it is, rather, that if one could somehow have made \(X\) different in a case, then \(Y\) would have been different in that case.13

Consider a simple example. Teacher A is extraordinary. Students with teacher A do not study and would perform well whether or not they studied. Students with teacher B perform well if and only if they study. Moreover, students with teacher B do in fact study. And all perform well.

When we say that one of teacher B’s students did well because they studied, we are comparing the outcome that they experienced to the outcome that they would have experienced if (1) they had had teacher B, as they did but (2) counterfactually, had not studied.

Notably, when we define the effect of studying, we are not comparing the realized outcome of the studiers to the outcome of the students who in fact did not study. That is because the students who in fact did not study had teacher A, not B. Moreover we are not comparing the realized outcome of a student of teacher B to what that same student would have achieved if they had had teacher A (and for that reason, had not studied). The reason again is that this comparison includes the effect of having teacher A and not the effect of studying given they had teacher B.

Here is a second example, drawn from a substantive domain that we will return to many times in this book. In his seminal book on democracy and distribution, Boix (2003) argues that low economic inequality is a cause of democratization. At high levels of inequality, Boix argues, the elite would rather repress the poor than submit to democracy and its redistributive consequences; at low levels of inequality, in contrast, redistribution under democracy will be less costly for the elite than would continued repression. Now, in light of this theory, consider the claim that Switzerland democratized (\(D=1\)) because it had a relatively low level of economic inequality (\(I=0\)). In the counterfactual view, this claim is equivalent to saying that, if Switzerland had had a high level of inequality, the country would not have democratized. Low economic inequality made a difference. The comparison for the causal statement is with the outcome Switzerland would have experienced under an intervention that boosted its historical level of economic inequality (but made no other change)—not with how Switzerland would have performed if it had been like one of the countries that in fact had higher levels of inequality, cases that likely differ from Switzerland in other causally relevant ways.

2.1.1 Potential outcomes

Researchers often employ what’s called the “potential outcomes” framework when they need a precise formal language for describing counterfactual quantities (Rubin 1974). In this framework, we characterize how a given unit responds to a causal variable by positing the outcomes that the unit would take on at different values of the causal variable. Most commonly, \(Y_i(0)\) and \(Y_i(1)\) are used to denote the values that \(Y\) would take for unit \(i\) if \(X\) were 0 and 1 respectively.14

One setting in which it is quite easy to think about potential outcomes is medical treatment. Imagine that some individuals in a diseased population have received a drug (\(X=1\)) while others have not received the drug (\(X=0\)). Assume that, subsequently, a researcher observes which individuals become healthy (\(Y=1\)) and which do not (\(Y=0\)). Given the assignments of all other individuals,15 we can treat each individual as belonging to one of four unobserved response “types,” defined by the outcomes that the individual would have if they received or did not receive treatment:16

  • adverse: Those individuals who would get better if and only if they do not receive the treatment
  • beneficial: Those who would get better if and only if they do receive the treatment
  • chronic: Those who will remain sick whether or not they receive treatment
  • destined: Those who will get better whether or not they receive treatment

Table 2.1 maps the four types (\(a, b, c, d\)) onto their respective potential outcomes. In each column, we have simply written down the outcome that a patient of a given type would experience if they are not treated, and the outcome they would experience if they are treated. We are here always imagining controlled changes in treatment: the responses if treatments are changed without changes to other background (or pre-treatment) conditions in the case.

Table 2.1: . Potential outcomes: What would happen to each of four possible types of case if they were or were not treated.
Type a Type b Type c Type d
adverse beneficial chronic destined
Outcome if not treated Healthy Sick Sick Healthy
Outcome if treated Sick Healthy Sick Healthy

We highlight that, in this framework, case-level causal relations are treated as deterministic. A given case has a set of potential outcomes. Any uncertainty about outcomes enters as incomplete knowledge of a case’s “type,” not from underlying randomness in causal relations. This understanding of causality—as ontologically deterministic, but empirically imperfectly understood—is compatible with views of causation commonly employed by qualitative researchers (see, e.g., Mahoney (2008)), and with understandings of causal determinism going back at least to Laplace (1901).

As we will also see, we can readily express this kind of incompleteness of knowledge within a causal model framework: indeed, the way in which causal models manage uncertainty is central to how they allow us to pose questions of interest and to learn from evidence. There are certainly situations we could imagine in which one might want to conceptualize potential outcomes themselves as random (for instance, if individuals in different conditions play different lotteries). But for the vast majority of the settings we consider, not much of importance is lost if we treat potential outcomes as deterministic but possibly unknown: every case is of a particular type; we just do not know which type that is.

2.1.2 A generalization

Throughout the book, we generalize from this simple setup. Whenever we have one causal variable and one outcome, and both variables are binary (i.e., each can take on two possible values, 0 or 1), there are only four sets of possible potential outcomes, or “types.” More generally, for variable \(Y\), we will use \(\theta^Y\) to capture the unit’s “type”: the way that \(Y\) responds to its potential causes.17 We, further, add subscripts to denote particular types. Where there are four possible types, for instance, we use the notation \(\theta^Y_{ij}\), where the first subscript, \(i\), represents the case’s potential outcome when \(X=0\); and the second subscript, \(j\), is the case’s potential outcome when \(X=1\).

Adopting this notation, for a causal structure with one binary causal variable and a binary outcome, the four types can be represented as \(\{\theta^Y_{10}, \theta^Y_{01}, \theta^Y_{00}, \theta^Y_{11}\}\), as shown in Table 2.2:

Table 2.2: Potential outcomes representation of quantities in Table 2.1. The table gives for each type (\(\theta\)) the values that \(Y\) would take on if \(X\) were set at \(0\) and if \(X\) were set at 1.
a: \(\theta^Y=\theta^Y_{10}\) b: \(\theta^Y=\theta^Y_{01}\) c: \(\theta^Y=\theta^Y_{00}\) d : \(\theta^Y=\theta^Y_{11}\)
Set \(X=0\) \(Y(0)=1\) \(Y(0)=0\) \(Y(0)=0\) \(Y(0)=1\)
Set \(X=1\) \(Y(1)=0\) \(Y(1)=1\) \(Y(1)=0\) \(Y(1)=1\)

Returning to the matter of inequality and democratization to illustrate, let \(I=1\) represent a high level of economic inequality and \(I=0\) its absence; let \(D=1\) represent democratization and \(D=0\) its absence. A \(\theta^D_{10}\) (or \(a\)) type is a case in which a high level of inequality, if it occurs, prevents democratization in a country that would otherwise have democratized. So the causal effect of high inequality in a case, \(i\), of \(\theta^D_{10}\) type is \(\tau_i= -1\). A \(\theta^D_{01}\) type (or \(b\) type) is a case in which high inequality, if it occurs, generates democratization in a country that would otherwise have remained non-democratic (effect of \(\tau_i= 1\)). A \(\theta^D_{00}\) type (\(c\) type) is a case that will not democratize regardless of the level of inequality (effect of \(\tau_i = 0\)); and a \(\theta^D_{11}\) type (\(d\) type) is one that will democratize regardless of the level of inequality (again, effect of \(\tau_i= 0\)).

In this setting, a causal explanation of a given case outcome amounts to a statement about its type. The claim that Switzerland’s low level of equality was a cause of its democratization is equivalent to saying that Switzerland democratized and is a \(\theta^D_{10}\) type. To claim that Benin democratized because of high inequality is equivalent to saying that Benin democratized and is a \(\theta^D_{01}\) type. To claim, on the other hand, that Malawi democratized for reasons having nothing to do with its level of economic inequality is to characterize Malawi as a \(\theta^D_{11}\) type (which implies that Malawi would have been democratic no matter what its level of inequality).

Now, let us consider more complex causal relations. Suppose now that there are two binary causal variables \(X_1\) and \(X_2\). We can specify any given case’s potential outcomes for each of the different possible combinations of there causal conditions. There are now four such conditions since each causal variable may take on \(0\) or \(1\) when the other is at \(0\) or \(1\).

As for notation, we now need to expand \(\theta\)’s subscript since we need to represent the value that \(Y\) takes on under each of the four possible combinations of \(X_1\) and \(X_2\) values. This requires four, rather than two, subscript digits. We map the subscripting for \(\theta_{hijk}\) to potential outcome notation as shown in Equation (2.1).

\[\begin{eqnarray} \tag{2.1} \theta^Y_{hijk} \left\{\begin{array}{ccc} Y(0,0) &=& h \\ Y(1,0) &=& i \\ Y(0,1) &=& j \\ Y(1,1) &=& k \end{array} \right. \end{eqnarray}\]

where the first argument of \(Y(.,.)\) is the value to which \(X_1\) is set and the second is the value to which \(X_2\) is set.

Thus, for instance, \(\theta^Y_{0100}\) means that \(Y\) is 1 if \(X_1\) is set to 1 and \(X_2\) to 0 and is 0 otherwise; \(\theta^Y_{0011}\) is a type in which \(Y=1\) if and only if \(X_2=1\); \(\theta^Y_{0001}\) is a type for which \(Y=0\) unless both \(X_1\) and \(X_2\) are 1.

We now have sixteen causal types: sixteen different patterns that \(Y\) might display in response to changes in \(X_1\) and \(X_2\). The full set is represented in Table 2.3, which also illustrates how we read types off of four-digit subscripts. For instance, the table shows us that for nodal type \(\theta^Y_{0101}\), \(X_1\) has a positive causal effect on \(Y\) but \(X_2\) has no effect. On the other hand, for type \(\theta^Y_{0011}\), \(X_2\) has a positive effect while \(X_1\) has none.

The 16 types also capture interactions. For instance, in a \(\theta^Y_{0001}\) type, \(X_2\) has a positive causal effect if and only if \(X_1\) is 1. In that type, \(X_1\) and \(X_2\) serve as “complements.” For \(\theta^Y_{0111}\), \(X_2\) has a positive causal effect if and only if \(X_1\) is 0. In that setup, \(X_1\) and \(X_2\) are “substitutes.”

Table 2.3: With two binary causal variables, there are 16 causal types: 16 ways in which \(Y\) might respond to changes in the other two variables.
\(\theta^Y\) if \(X_1=0, X_2=0\) if \(X_1=1,X_2=0\) if \(X_1=0,X_2=1\) if \(X_1=1, X_2=1\)
\(\theta^Y_{0000}\) 0 0 0 0
\(\theta^Y_{1000}\) 1 0 0 0
\(\theta^Y_{0100}\) 0 1 0 0
\(\theta^Y_{1100}\) 1 1 0 0
\(\theta^Y_{0010}\) 0 0 1 0
\(\theta^Y_{1010}\) 1 0 1 0
\(\theta^Y_{0110}\) 0 1 1 0
\(\theta^Y_{1110}\) 1 1 1 0
\(\theta^Y_{0001}\) 0 0 0 1
\(\theta^Y_{1001}\) 1 0 0 1
\(\theta^Y_{0101}\) 0 1 0 1
\(\theta^Y_{1101}\) 1 1 0 1
\(\theta^Y_{0011}\) 0 0 1 1
\(\theta^Y_{1011}\) 1 0 1 1
\(\theta^Y_{0111}\) 0 1 1 1
\(\theta^Y_{1111}\) 1 1 1 1

This is a rich framework in that it allows for all possible ways in which a set of multiple causes can interact with each other. Often, when seeking to explain the outcome in a case, researchers proceed as though causes are necessarily rival, where \(X_1\) being a cause of \(Y\) implies that \(X_2\) was not. Did Malawi democratize because it was a relatively economically equal society or because of international pressure to do so? In the counterfactual model, however, causal relations can be non-rival. If two out of three people vote for an outcome under majority rule, for example, then both of the two supporters caused the outcome: the outcome would not have occurred if either supporter’s vote were different. A typological, potential-outcomes conceptualization provides a straightforward way of representing this kind of complex causation.

Because of this complexity, when we say that \(X\) caused \(Y\) in a given case, we will generally mean that \(X\) was a cause, not the (only) cause. Malawi might not have democratized if either a relatively high level of economic equality or international pressure had been absent. For most social phenomena that we study, there will be multiple, and sometimes a great many, difference-makers for any given case outcome.

We will mostly use \(\theta^Y_{ij}\)-style notation in this book to refer to types. We will, however, occasionally revert to the simpler \(a, b, c, d\) designations when that eases exposition. As types play a central role in the causal-model framework, we recommend getting comfortable with both forms of notation before going further.

Using the same framework, we can generalize to structures in which a unit has any number of causes and also to cases in which causes and outcomes are non-binary. As one might imagine, the number of types increases rapidly (very rapidly) as the number of considered causal variables increases; it also increases rapidly if we allow \(X\) or \(Y\) to take on more than two possible values. For example, if there are \(n\) binary causes of an outcome, then there can be \(2^{\left(2^n\right)}\) types of this form: that is, \(k=2^n\) combinations of values of causes to consider, and \(2^k\) distinct response patterns across the possible combinations. If causes and outcomes are ternary instead of binary, we have \(3^{\left(3^n\right)}\) causal types of this form.

Nevertheless, the basic principle of representing possible causal relations as patterns of potential outcomes remains unchanged, at least as long as variables are discrete.

2.1.3 Summaries of potential outcomes

So far, we have focused on causal relations at the level of an individual case. Causal relations at the level of a population are, however, simply a summary of causal relations for cases, and the same basic ideas can be used. We could, for instance, summarize our beliefs about the relationship between economic inequality and democratization by saying that we think that the world is comprised of a mixture of \(a\), \(b\), \(c\), and \(d\) types, as defined above. We could get more specific and express a belief about what proportions of cases in the world are of each of the four types. For instance, we might believe that \(a\) types and \(d\) types are quite rare while \(b\) and \(c\) types are quite common. Moreover, our belief about the proportions of \(b\) (positive effect) and \(a\) (negative effect) cases imply a belief about inequality’s average effect on democratization as, in a binary setup, this quantity is simply the proportion of \(b\) types minus the proportion of \(a\) types. Such summaries allow us to move from discussion of the cause of a single outcome to discussions of average effects, a distinction that we take up again in Chapter 4.

2.2 Causal Models and Directed Acyclic Graphs

So far we have discussed how a single outcome is affected by one or more possible causes. However, these same ideas can be used to describe more complex relations between collections of variables—for example, with one variable affecting another directly as well as indirectly via its impact on a third variable. For instance, \(X\) might affect \(Y\) directly. But \(X\) might also affect \(Y\) by affecting \(M\), which in turn affects \(Y\). In the latter scenario, \(M\) is a mediator of \(X\)’s effect on \(Y\).

Potential outcomes tables can be used to describe such relations. However, as causal structures become more complex—especially, as the number of variables in a domain increases—a causal model can be a powerful organizing tool. In this section, we show how causal models and their visual counterparts, directed acyclic graphs, can represent substantive beliefs about counterfactual causal relationships in the world. The key ideas in this section can be found in many texts (see, e.g., Halpern and Pearl (2005) and Galles and Pearl (1998)), and we introduce here a set of basic principles that readers will need to keep in mind in order to follow the argumentation in this book.

As we shift to talking about networks of causal relations between variables we will also shift our language. When talking about causal networks, or causal graphs, we will generally refer to variables as “nodes.” And we will sometimes use familial terms to describe relations between nodes. For instance, if \(A\) is a cause of \(B\), we will refer to \(A\) as a “parent” of \(B\), and \(B\) as a “child” of \(A\). Graphically we have an arrow pointing from the parent to the child. If two variables have a child in common (both directly affecting the same variable) we refer to them as “spouses.” We can also say that a variable is an “ancestor” of another variable (its “descendant”) if there is a chain of parent-child relations from the “ancestor” to the “descendant.”

Returning to our running democratization example, suppose now that we have more fully specified beliefs about how the level of economic inequality can have an effect on whether a country democratizes. We might believe that inequality (\(I\)) affects the likelihood of democratization (\(D\)) by generating demands for redistribution (\(R\)), which in turn can cause the mobilization (\(M\)) of lower-income citizens, which in turn can cause democratization (\(D\)). We might also believe that mobilization itself is not just a function of redistributive preferences but also of the degree of ethnic homogeneity (\(E\)), which shapes the capacities of lower-income citizens for collective action. We visualize this model as a directed acyclic graph (DAG) in Figure 2.1. In this model, \(R\) is a parent of \(M\). \(I\) is an ancestor of \(M\) but not its parent. \(R\) and \(E\) are spouses, and \(M\) is their child (that is: mobilization depends on both redistributive preferences and ethnic demography).

A simple causal model in which high inequality ($I$) affects democratization ($D$) via redistributive demands and mass mobilization ($M$), which is also a function of ethnic homogeneity ($E$). The arrows show relations of causal dependence between variables.  The graph does not capture the ranges of the variables and the functional relations between them.

Figure 2.1: A simple causal model in which high inequality (\(I\)) affects democratization (\(D\)) via redistributive demands and mass mobilization (\(M\)), which is also a function of ethnic homogeneity (\(E\)). The arrows show relations of causal dependence between variables. The graph does not capture the ranges of the variables and the functional relations between them.

Fundamentally, we treat causal models in this book as formal representations of beliefs about how the world works—or, more specifically, about causal relations within a given domain. We might use a causal model to capture our own beliefs, a working simplification of our beliefs, or a set of potential beliefs that one might hold. The formalization of prior beliefs in the form of a causal model is the starting point for research design and inference in this book’s analytic framework. Using the democratization example, we will now walk through the three components of a causal model in which our beliefs get embedded: nodes, functions, and distributions.

We now provide a formal definition of a causal model as used in this book, and then unpack the definition. The following definition corresponds to Pearl’s definition of a probabilistic causal model (Pearl (2009), Defn 7.1.6):


A “causal model” is:

1.1: an ordered list of \(n\) endogenous nodes, \(\mathcal{V}= (V^1, V^2,\dots V^n)\), with a specification of a range for each of them

1.2: a list of \(n\) exogenous nodes, \(\Theta = (\theta^1, \theta^2,\dots \theta^n)\)

2: a list of \(n\) functions \(\mathcal{F}= (f^1, f^2,\dots f^n)\), one for each element of \(\mathcal{V}\) such that each \(f^i\) takes as arguments \(\theta^i\) as well as elements of \(\mathcal{V}\) that are prior to \(V^i\) in the ordering


3: a probability distribution over \(\Theta\)

The three components of a causal model are (1) the nodes—that is, the set of variables we are focused on and how are they defined (2) the functional relations—which nodes are caused by which other nodes and how, and (3) probability distributions over unexplained elements of a model (in our framework, the \(\theta\) nodes). We discuss each in turn.

2.2.1 The nodes

The first component of a causal model is the set of variables (nodes) across which the model characterizes causal relations.

We have two sorts of variables: the named, endogenous, nodes, \(\mathcal{V}\), and the unnamed “exogenous” nodes, \(\Theta\).18

All the endogenous nodes have an arrow pointing into them indicating that the node at the end of the arrow is (possibly) caused in part by the node at the beginning of the arrow.

On the graph (DAG) in Figure 2.1, the five endogenous nodes are lettered. \(R\), \(M\), and \(D\) are all obviously endogenous because they are endogenous to other named variables. \(I\) and \(E\) are not endogenous to other nodes in \(\mathcal{V}\), but we still call them endogenous because they depend on other nodes in the model, specifically on nodes in \(\Theta\). We will use the term “root nodes” to indicate nodes like this that are in \(\mathcal{V}\) but are not endogenous to other nodes in \(\mathcal{V}\).

Our definition specified that the endogeneous nodes should be ordered. We can in fact specify different orderings of nodes in this example. For instance we could have the ordering \(<E, I, R, M, D>\), or the ordering \(<I, R, E, M, D>\) or even \(<I, E, R, M, D>\). The ordering matters only to the extent that it constrains \(F\): \(f^j\) can take as arguments only \(\theta^j\) and elements of \(\mathcal{V}\) that come before \(j\) in the ordering. In practice this prevents us from having a variable that is both a cause and a consequence of another variable.

In specifying these nodes, we also need to specify the ranges over which they can vary. We might specify, for instance, that all the endogeneous nodes in the model are binary, taking on the values 0 or 1. We could, alternatively, define a set of categories across which a node ranges or allow a node to take on any real number value or any value between a set of bounds. 19

The exogenous nodes, \(\Theta\), require a little more explanation since they do not describe substantive nodes. Five exogenous nodes are also shown on the graph, one for each endogenous node (note though, very frequently do not show include the exogenous nodes explicitly when we draw a graph, but you should still imagine them there, pointing into each endogenous node). In our discussion above, we introduced \(\theta\) notation for representing types. Here we simply build these types into a causal model. We imagine a \(\theta\) term pointing into every node (whether explicitly represented on the graph or not). A node’s \(\theta\) term characterizes the value that that node will take on given the value of its parents. Ontologically, we can think of \(\theta\) terms as unobservable and unspecified inputs into a causal system. These might include random processes (noise) or contextual features that we are unable to identify or do not understand, but that both affect outcomes and condition the effects of other, specified variables on outcomes.20

As we will show, consistent with our discussion of potential outcomes and types, in discrete settings we can think of \(\theta\) nodes as capturing the functional relations between variables and as such as being quantities of direct interest for causal inquiry. We more fully develop this point—returning to the notion of \(\theta\) terms as receptacles for causal effects—below.

2.2.2 The functions

Next, we need to specify our beliefs about the causal relations among the nodes in our model. How is the value of one node affected by, and how does it affect, the values of others? For each endogenous node—each node influenced by others in the model—we need to express beliefs about how its value is affected by its parents, its immediate causes.

We can think in a qualitative and quantitative way about how one variable is affected by others. Qualitatively if a variable \(k\) depends on value of another variable \(j\) (given other variables prior to \(k\) in the ordering), then that variable \(j\) enters as an argument in \(f^k\). In that case \(j\) is a parent to \(k\). Graphically we can represent all such relations between variables and their parents with arrows and when we represent the relations in a causal model in this way we get a Directed Acyclic Graph (DAG)—where the acyclicality and directedness are guaranteed by the ordering requirements we impose when we define a causal model.

Thus the DAG already represents a critical part of our model: the arrows, or directed edges, tell us which nodes we believe may be direct causal inputs into other nodes. So, for instance, we believe that democratization (\(D\)) is determined jointly by mobilization (\(M\)) and some exogenous, unspecified factor (or set of factors), \(\theta^D\). As we have said, we can think of \(\theta^D\) as all of the other influences on democratization, besides mobilization, that we either do not know of or have decided not to explicitly include in the model. We believe, likewise, that \(M\) is determined by \(I\) and an unspecified exogenous factor (or set of factors), \(\theta^M\). And we are conceptualizing inequality (\(I\)) and ethnic heterogeneity (\(E\)) as shaped solely by factors exogenous to the model, captured by \(\theta^I\) and \(\theta^E\), respectively.

Beyond the qualitative beliefs captured by the arrows in a DAG, we can express more specific quantitative beliefs about causal relations in the form of a causal function. A function specifies how the value that one node takes is determined by the values that other nodes—its parents—take on. Specifying a function means writing down whatever general or theoretical knowledge we have about the direct causal relations between nodes.

We can specify this relationship in a vast variety of ways. It is useful however to distinguish broadly between parametric and non-parametric approaches. We take a non-parametric approach in this book—this is where our types come back in—but it is helpful to juxtapose that approach to a parametric approach to causal functions. Parametric approaches

A parametric approach specifies a functional form that relates parents to children. For instance, we might model one node as a linear function of another and write \(D=\alpha + \beta M\). Here \(\beta\) is a parameter that we may not know the value of at the outset of a study but about which we wish to learn. If we believe \(D\) to be linearly affected by \(M\) but also subject to forces that we do not yet understand and have not yet specified in our theory, then we might write: \(D=\alpha + \beta M+\theta^D\). (This functional form will be familiar to most readers as that captured in a standard linear regression equation.) In this function, \(\alpha\) and \(\beta\) might be the parameters of interest—features of the world that we seek to learn about—with \(\theta^D\) treated as merely a random disturbance around the linear relationship.

We can also write down functions in which the functional relations between nodes are left unspecified, governed by parameters with unknown values. Consider, for instance the function, \(D=\beta M^{\theta^D}\). Here, \(D\) and \(M\) are linearly related if \(\theta^D=1\). (If \(\theta^D=1\), then the function just reduces to the linear form, \(D=\beta M\).) However, if \(\theta^D\) is not equal to \(1\), then \(M\)’s effect on \(D\) can be non-linear. For instance, if \(\theta^D\) lies between \(0\) and \(1\), then \(M\) will have a diminishing marginal effect on \(D\). If \(\theta^D\) is greater than \(1\), on the other hand, then \(M\)’s effect on \(D\) will be “exponential”, increasing as \(M\) itself increases. Here \(\theta^D\) itself would likely be a quantity of interest to the researcher since it conditions the causal relationship between the other two nodes.

The larger point is that functions can be written to be quite specific or extremely general, depending on the state of prior knowledge about the phenomenon under investigation. The use of a structural model does not require precise knowledge of specific causal relations, even of the functional forms through which two nodes are related. The non-parametric approach

With discrete (non-continuous) data, causal functions can take fully non-parametric form. That is, non-parametric functions can allow for any possible relation between parents and children, not just those that can be expressed in an equation.

We use a non-parametric framework for most of this book and thus spend some time developing the approach here.

We begin by returning to the concept of types. Drawing on our original four types and the democratization example from earlier in this chapter, we know that we can fully specify causal relations between a binary \(M\) and a binary \(D\) using the concept of a type, represented by \(\theta^D\). We think of \(\theta^D\) as akin to a variable that can take on different values in different cases, corresponding to the different possible types. Specifically, we allow \(\theta^D\) to range across the four possible values (or types) \(\{\theta^D_{10}, \theta^D_{01}, \theta^D_{00}, \theta^D_{11}\}\). For instance, \(\theta^D_{10}\) represents a negative causal effect of \(M\) on \(D\) while \(\theta^D_{00}\) represents \(D\) remaining at 0 regardless of \(M\).

So the value that \(\theta^D\) takes on in a case governs the causal relationship between \(M\) and \(D\). Put differently, \(\theta^D\) represents the non-parametric function that relates \(M\) to \(D\). We can formally specify \(D\)’s behavior as a function of \(M\) and \(\theta^D\) in the following way:

\[D(M, \theta^D_{ij}) = \left\{\begin{array}{ccc} i & \text{if} & M=0 \\ j & \text{if} & M=1 \end{array}\right.\]

Here we are saying that \(D\)’s value in a case depends on two things: the value of \(M\) and the case’s type, defining how \(D\) responds to \(M\). We are then saying, more specifically, how \(D\)’s value is given by the subscripts on \(\theta\) once we know \(M\)’s value: if \(M=0\), then \(D\) is equal to the subscript \(i\); if \(M=1\), then \(D\) is equal to \(j\). Note that \(\theta^D\)’s possible values range over all possible functional forms between these two binary variables.

How should we think about what kind of thing that \(\theta^D\) is, in a more substantive sense? It is helpful to think of \(\theta^D\) as an unknown and possibly random factor that conditions the effect of mobilization on democratization, determining whether \(M\) has a negative effect, a positive effect, no effect with democratization never occurring, or no effect with democratization bound to occur regardless of mobilization. A little more generally it can be thought of a describing a “stratum”—a grouping together of units that may differ in innumerable ways but that, nevertheless, respond in the same way at the node in question given values of other nodes in the model (Frangakis and Rubin 2002). Importantly, while we might think of \(\theta^D\) as an unknown or random quantity, in this framework \(\theta^D\) should not be thought of as a nuisance—as “noise” that we would like to get rid of. Rather, under this non-parametric approach, \(\theta\) terms are the very quantities that we want to learn about: we want to know whether \(M\) likely had a positive, negative, or no effect on \(D\). We elaborate on this point in Chapter 4.

We can similarly use \(\theta\) terms to capture causal relations involving any number of parent nodes. Every substantively defined node, \(J\), in a graph can be thought of as having a \(\theta^J\) term pointing into it, and the (unobservable) value of \(\theta^J\) represents the mapping from \(J\)’s parents (if it has any) to the value of \(J\).

We can think of every \(\theta^J\) term as having a different range of possible values depending on the number of parents \(V^J\) has. Applied to the binary nodes in Figure 2.1, \(\theta^J\) ranges as follows:

  • Nodes with no parents: For a parentless node, like \(I\) or \(E\), \(\theta^J\) represents an external “assignment” process that can take on one of two values. If \(\theta^J=\theta^J_{0}\), we simply mean that \(J\) has been “assigned” to \(0\), while a value of \(\theta^J_{1}\) means that \(J\) has been assigned to 1. For instance, \(\theta^I_{0}\) describes a case in which exogenous forces have generated low inequality.
  • Binary nodes with 1 binary parent: For endogenous node \(R\), with only one parent (\(I\)), \(\theta^R\) takes on one of four values of the form \(\theta^R_{ij}\) (our four original types, \(\theta^R_{10}\), \(\theta^R_{01}\), etc.).
  • Binary nodes with 2 binary parents: \(M\) has two parent nodes. Thus, \(\theta^M\) will take on a possible 16 values of the form \(\theta^M_{hijk}\) (e.g., \(\theta^M_{0000}\), \(\theta^M_{0001}\), etc.), using the syntax detailed earlier in this chapter and unpacked in Table 2.3.

Nodal types and causal types. So far, we have been talking about types operating at specific nodes. For instance, we can think of the unit of Malawi as having a \(\theta^D\) value—the type governing how \(D\) responds to \(M\) in this case. Let’s call this Malawi’s nodal causal type, or simply nodal type, for \(D\). But we can also conceptualize the full collection of Malawi’s nodal types: the nodal types governing causal effects in Malawi for all nodes in the model. This collection would include Malawi’s nodal type values for \(\theta^I\), \(\theta^E\), \(\theta^R\), \(\theta^M\), and \(\theta^D\). We refer to the collection of nodal types across all nodes for a given unit (i.e., a case) as the case’s unit causal type, or simply causal type. We denote a causal type by the vector \(\theta\), the elements of which are all of the nodal types in a given model (\(\theta^I\), \(\theta^E\), etc.). For analytic applications later in the book, this distinction between nodal types and causal types will become important.

We will sometimes refer to a unit’s causal type—the values of \(\theta\)—as a unit’s context. This is because \(\theta\) captures all exogenous forces acting on a unit. This includes the assignment process driving the model’s exogenous nodes (in our example, \(\theta^I\) and \(\theta^E\)) as well as all contextual factors that shape causal relations between nodes (\(\theta^R\), \(\theta^M\), and \(\theta^D\)). Put differently, \(\theta\) captures both how a unit reacts to situations and which situations it is reacting to.

Thus, if we hypothetically knew a unit’s causal type—all nodal types operating in the unit, for all nodes—then we would know everything there is to know about that unit. We would know the value of all exogenous nodes as well as how those values cascade through the model to determine the values of all endogenous nodes. So a unit’s causal type fully specifies all nodal values. More than that, because the causal type contains all causal information about a unit, it also tells us what values every endogenous node would take on under counterfactual values of other nodes. Of course, causal types, like nodal types, are fundamentally unobservable quantities. But (as we discuss later in the book) they are quantities that we will seek to draw inferences about from observable, so it is conceptually useful to keep in mind what is at stake in learning about them.

Table 2.4: . Nodal types, causal types.
term symbol meaning
nodal type \(\theta^J\) The way that node \(J\) responds to the values of its parents. Example: \(\theta^Y_{10}\): \(Y\) takes the value 1 if \(X=0\) and 0 if \(X=1\).
causal type \(\theta\) A causal type is a concatenation of nodal types, one for each node. Example: \((\theta^X_0, \theta^Y_{00})\), is a causal type that has \(X=0\) and that has \(Y=0\) no matter what the value of \(X\).

A few important aspects of causal functions are worth highlighting.

  1. These functions express causal beliefs. When we write \(D=\beta M\) as a function, we do not just mean that we believe the values of \(M\) and \(D\) in the world to be linearly related. We mean that we believe that the value of \(M\) determines the value of \(D\) through this linear function. Functions are, in this sense, directional statements, with causes on the righthand side and an outcome on the left.

  2. The collection of simple functions that map from the values of parents of a given node to the values of that node are sufficient to represent potentially complex webs of causal relations. For each node, we do not need to think through entire sequences of causation that might precede it. We need only specify how we believe it to be affected by its parents—that is to say, those nodes pointing directly into it. Our outcome of interest, \(D\), may be shaped by multiple, long chains of causality. To theorize how \(D\) is generated, however, we write down how we believe \(D\) is shaped by its parent—its direct cause, \(M\). We then, separately, express a belief about how \(M\) is shaped by its parents, \(R\) and \(E\). A node’s function must include as inputs all, and only, those nodes that point directly into that node.21

  3. As in the general potential-outcomes framework, all relations in a causal model are conceptualized as deterministic at the case level. Yet, there is not as much at stake here as one might think at first; by this we simply mean that a node’s value is determined by the values of its parents along with any stochastic or unknown components. We express uncertainty about causal relations, however, as unknown parameters, the nodal types.

2.2.3 The distributions

Putting causal structure and causal functions together gives us what we call a structural causal model. A structural causal model expresses our beliefs about the skeletal structure of causal relations in a domain: it tells us which nodes are exogenous (entirely caused by things outside the model), which nodes are endogenous (partly caused by things inside the model), and which nodes can have effects on which other nodes.

But this only takes us so far in inscribing our causal beliefs about the world. In particular, we have not yet said anything about either the kinds of exogenous conditions that we believe are more or less prevalent in the world or about the kinds of causal effects we that expect most commonly to operate between linked nodes on the graph.

To put this a bit more formally, all nodes are functions of a case’s causal type: the collection all of its nodal types. What we have not yet inscribed into the model, however, is beliefs about how likely or common different nodal types are in the world. Or if we want to think of a collection of nodal types (a causal type) as a unit’s context, a structural causal model alone is silent on the prevalence or likelihood of different kinds of contexts. When we add this information we get a causal model (or more precisely, a probabilistic causal model).

Thus, for instance, a structural causal model consistent with Figure 2.1 stipulates that democratization may be affected by mobilization, that mobilization may be affected by ethnic homogeneity and redistributive demands, and that redistributive demands may be affected by the level of inequality. But it says nothing about the values that we think the exogenous nodes tend take on in the world.22 We have not said anything, that is, about how common high inequality is across the relevant domain of cases or how common ethnic homogeneity is.

Put differently, we have said nothing about the distribution of \(\theta^I\) or of \(\theta^E\). Similarly, we have said nothing yet about the nature of the causal effects in the model: for instance, about how commonly mobilization has positive, negative, or null effects of democratization; about how commonly redistributive demands (\(R\)) and ethnic homogeneity (\(E\)) have different possible joint causal effects on \(M\); or about how commonly inequality (\(I\)) has different possible effects on redistributive demands (\(R\)). That is, we have said nothing about the distribution of \(\theta^D\), \(\theta^M\), or \(\theta^R\) values in the world.

We make progress by specifying probability distributions over the model’s nodal types—its \(\theta^J\) terms, which in turn implies a probability distributions over the value of endogeneous nodes.

At the case level we can think of this probability distribution as a statement about our beliefs about the unit’s type. If we think in terms of populations we might think of this as a statement about the proportion of units in the population of interest that have different values for \(\theta^J\).

For instance, our structural causal model might tell us that \(E\) and \(R\) can jointly affect \(M\). We might, then, add to this a belief about what kinds of effects among these variables are most common. For instance, we might believe that redistribution rarely has a positive effect on mobilization when ethnic homogeneity is low. Well, there are four specific nodal types in which \(R\) has a positive effect on \(M\), when \(E=0\): \(\theta^M_{0010}, \theta^M_{0110}, \theta^M_{0111}\), and \(\theta^M_{0011}\). (Look back at Table 2.3 to confirm this for yourself.) Thus, we can express our belief as a probability distribution over the possible nodal types for \(M\), \(\theta^M\), in which we place a relatively low probability on \(\theta^M_{0010}, \theta^M_{0110}, \theta^M_{0111}\), and \(\theta^M_{0011}\), as compared to \(\theta^M\)’s other possible values. This is akin to saying that we think that these four nodal types occur in a relatively small share of units in the population of interest.

Of course, when we are thinking about populations we will usually be uncertain about these kinds of beliefs. We can then build uncertainty into our beliefs about the shares of different nodal types in the population. We do this by specifying a distribution over possible “share” allocations. For instance, rather than stipulating that the share of cases \(\theta^E_1\) is exactly \(0.1\), we can specify a distribution over the possible shares, centered on a low value but with our degree of uncertainty captured by that distribution’s variance. Similarly, we can specify a distribution over the shares of \(\theta^M\) types. We say more about these distributions when we turn to a discussion of Bayesianism in Chapter 5.

In the default setup, we assume that each \(\theta\) term (\(\theta^I, \theta^E, \theta^R\), etc.) is generated independently of the others. So, for instance, the probability that \(I\) has a positive effect on \(R\) in a case bears no relationship to the probability that \(M\) has a positive effect on \(D\). Or, put differently, those cases with a positive \(I \rightarrow R\) effect are no more or less likely to have a positive \(M \rightarrow D\) effect than are those cases without a positive \(I \rightarrow R\) effect. This independence feature is especially useful for being able to read off relations of “conditional independence” from a graph. See box ?? on the “Markovian condition” that relates the structure of the graph to the types of independence statements implied by the graph (Spirtes et al. 2000).

If this assumption cannot be maintained then the model might have to be enriched to ensure independence between exogenous nodes. Otherwise non-independence has to be taken into account when doing analysis23 Graphically we represent such failures of independence by using curved two headed arrows. More on this in section 2.3.1.

We need not say much more, for the moment, about the probabilistic components of causal models. But to foreshadow the argument to come, our prior beliefs about the relative prevalence of different causal types play a central role in the framework that we develop here. We will see how the encoding of contextual knowledge—beliefs that some kinds of conditions and causal effects are more common than others—forms a key foundation for causal inference. At the same time, our expressions of uncertainty about context represent scope for learning. At the case level we want to learn about \(\theta\)—the way things operate in a given case. At a population level however we might want to learn about the distribution of \(\theta\). Thus if we let \(\lambda_x\) represent the share of cases in a population that have \(\theta=\theta_x\) (see e.g. Chickering and Pearl (1996)) we can specify beliefs not just at the case level about \(\theta\) but at the population level about \(\lambda_x\).

2.3 Graphing models and using graphs

While we have been speaking to causal graphs throughout this chapter, we want to take some time to unpack their core features and uses. A key benefit of causal models is that they lend themselves to graphical representations. In turn, graphs constructed according to particular rules can aid causal analysis. In the next subsection we discuss a set of rules for representing a model in graphical form. The following subsection then demonstrates how access to a graph facilitates causal inference.

2.3.1 Rules for graphing causal models

The diagram in Figure 2.1 is a causal DAG (Hernán and Robins 2006). We endow it with the interpretation that an arrow from a parent to a child means that a change in the parent can, under some circumstances, induce a change in the child. Though we have already been making use of this causal graph to help us visualize elements of a causal model, we now explicitly point out a number of general features of causal graphs as we will be using them throughout this book. Causal graphs have their own distinctive “grammar,” a set of rules that give them important analytic features.

Directed, acyclic. A causal graph represents elements of a causal model as a set of nodes (or vertices), representing variables, connected by a collection of single-headed arrows (or directed edges). We draw an arrow from node \(A\) to node \(B\) if and only if we believe that \(A\) can have a direct effect on \(B\). Thus, in Figure 2.1, the arrow from \(I\) to \(R\) means that inequality can directly affect redistributive demands.

The resulting diagram is a directed acyclic graph (DAG) if there are no paths along directed edges that lead from any node back to itself—i.e., if the graph contains no causal cycles. The absence of cycles (or “feedback loops”) is less constraining than it might appear at first. In particular if one thinks that \(A\) today causes \(B\) tomorrow which in turn causes \(A\) the next day, we can represent this as \(A_1 \rightarrow B \rightarrow A_2\) rather than \(A \leftrightarrow B\). That is, we timestamp the nodes, turning what might informally appear as feedback into a non-cyclical chain.

Meaning of missing arrows. The absence of an arrow between \(A\) and \(B\) means that \(A\) is not a direct cause of \(B\).24 Here lies an important asymmetry: drawing an \(A \rightarrow B\) arrow does not mean that we know that \(A\) does directly cause \(B\); but omitting such an arrow implies that we know that \(A\) does not directly cause \(B\). We say more with the arrows that we omit than with the arrows that we include.

Returning to Figure 2.1, we have here expressed the belief that redistributive preferences exert no direct effect on democratization; we have done so by not drawing an arrow directly from \(R\) to \(D\). In the context of this model, saying that redistributive preferences have no direct effect on democratization is to say that any effect of redistributive preferences on democratization must run through mobilization; there is no other pathway through which such an effect can operate. As social scientists, we often have beliefs that take this form. For instance, the omission of an arrow from \(R\) to \(D\) might be a way of encoding the prior knowledge that mass preferences for redistribution cannot induce autocratic elites to liberalize the regime absent collective action in pursuit of those preferences.

The same goes for the effects of \(I\) on \(M\), \(I\) on \(D\), and \(E\) on \(D\): the graph in Figure 2.1 implies that we believe that these effects also do not operate directly, but only along the indicated, mediated paths.

Sometimes-causes. The existence of an arrow from \(A\) to \(B\) does not imply that \(A\) always has a direct effect on \(B\). Consider, for instance, the arrows running from \(E\) and from \(R\) to \(M\). Since \(M\) has two parents, assuming all variables are binary, we define a range of 16 nodal types for \(\theta^M\), capturing all possible joint effects of \(E\) and \(R\). However, for some of these nodal types, \(E\) or \(R\) or both will have no effect on \(M\). For instance, in the nodal type \(\theta^M{0011}\),25 \(E\) has no effect on \(M\) while \(R\) has a positive effect. Thus, in a case with this nodal type for \(M\), \(E\) is not a cause of \(M\); whereas in a case with, say, \(\theta^M{0101}\), \(E\) has an effect on \(M\), while \(R\) has none. In this sense, the existence of the arrows pointing into \(M\) reflect that \(E\) and \(R\) are “sometimes-causes” of \(M\).26

No excluded common causes. Any cause common to multiple nodes on a graph must itself be represented on the graph. If \(A\) and \(B\) on a graph are both affected by some third node, \(C\), then we must represent this common cause. Thus, for instance, the graph in Figure 2.1 implies that \(I\) and \(E\) have no common cause. If in fact we believed that a country’s level of inequality and its ethnic composition were both shaped by, say, its colonial heritage, then this DAG would not be an accurate representation of our beliefs about the world. To make it accurate, we would need to add to the graph a node capturing that colonial heritage and include arrows running from colonial heritage to both \(I\) and \(E\).

This rule of “no excluded common causes” ensures that the graph captures all potential correlations among nodes that are implied by our beliefs. If \(I\) and \(E\) are in fact driven by some common cause, then this means not just that these two nodes will be correlated but also that each will be correlated with any consequences of the other. For instance, a common cause of \(I\) and \(E\) would also imply a correlation between \(R\) and \(E\). \(R\) and \(E\) are implied to be independent in the current graph but would be implied to be correlated if a common node pointed into both \(I\) and \(E\).

Of particular interest is the implied independence of \(\theta\) terms from one another, noted earlier. In Figure 2.1, imagine, for instance, that the distribution of \(\theta^D\) and \(\theta^I\) were correlated: i.e., if the distribution of \(\theta^D\) were different when \(I=0\) than when \(I=1\). This could be because some other factor, perhaps a feature of a country’s economy, affects both its level of inequality and the response of its elites to mobilization from below. Such a situation would represent a classic form of confounding: the assignment of cases to values on an explanatory node (\(I\)) would be correlated with the case’s potential outcomes on \(D\). The omission of any such common cause is precisely equivalent to expressing the belief that \(I\) is exogenous, i.e., (as if) randomly assigned. If we believe such a common cause to be operating, however, then we must include it as a node on the graph, pointing into both \(I\) and \(D\).

Representing unobserved confounding. It may be, however, that there are common causes for nodes that we simply do not understand. We might believe, for instance, that some unknown factor (partially) determines both \(I\) and \(D\). We refer to this situation as one of unobserved confounding. Even when we do not know what factor is generating the confounding, we still have a violation of the assumption of independence and need to be sure we are capturing this correlation in the graph. We can do so in a couple of ways. If we are representing all \(\theta\) terms on a graph, then we can capture the correlation of \(\theta^I\) and \(\theta^D\) by including a single, joint term \((\theta^I, \theta^D)\) that points into both \(I\) and \(D\). Where the \(\theta\) terms are not explicitly included in a graph (as is often the case), we can represent unobserved confounding by adding a two-headed arrow, or a dotted line, connecting nodes whose unknown causes are not independent. Either way, we are building in the possibility of a joint distribution over the nodal types \(\theta^I\) and \(\theta^D\). Figure 2.2 illustrates for the \(I\) and \(D\) relationship.

A DAG with unobserved confounding

Figure 2.2: A DAG with unobserved confounding

We address unobserved confounding in more detail later in the book and show how we can seek to learn about the joint distribution of nodal types—that is, how we can learn even about confounders that we cannot observe—in such situations.

License to exclude nodes. The flip side of the “no excluded common causes” rule is that a causal graph does not need to include everything that we know about a substantive domain of interest. We may know quite a lot about the causes of economic inequality, for example. But we can safely omit any other factor from the graph as long as it does not affect multiple nodes in the model. Indeed, \(\theta^I\) in Figure 2.1 already implicitly captures all factors that affect \(I\). Similarly, \(\theta^D\) captures all factors other than mobilization that affect democratization. We may be aware of a vast range of forces shaping whether countries democratize, but we can choose to bracket them for the purposes of an examination of the role of economic inequality. This bracketing is permissible as long as none of these unspecified factors also acts on other nodes that are included in the model. For instance, we have chosen to exclude from the model the existence of international pressure on a state to democratize, even though this is another potential cause of democratization. This exclusion is permissible as long as we believe that international pressure does not have an effect on the level of inequality, a state’s ethnic makeup, redistributive demands, or mobilization.

Similarly, we do not need to include all mediating steps that we might believe to operate between two causally linked variables. In Figure 2.1, we could choose to exclude \(R\), for instance, and draw an arrow directly from \(I\) to \(M\). We could also exclude \(M\), if we wished to. (Since \(E\) points into \(M\), removing \(M\) would mean that we would have \(E\) point directly into \(R\)—a point that we return to below.) And, of course, the model that we have drawn leaves out numerous other mediating steps that we might imagine—such as the role of elites’ perceptions of the costs of repression as a mediator between mobilization and democratization. In other words, we generally have discretion about the degree of granularity to represent in our chains of causation. For reasons that we take up in Chapters 6 and 7, we will sometimes want to spell out more, rather than fewer, mediating steps in our models for reasons of research design—because of the empirical leverage that such mediating variables might provide. However, there is nothing about the rules of DAG-making that require a particular level of granularity.

We can’t read causal functions from a graph. As should be clear, a DAG does not represent all features of a causal model. What it does record is which nodes enter into the causal function for every other node: what can directly cause what. But the DAG contains no other information about the form of those causal relations. Thus, for instance, the DAG in Figure 2.1 tells us that \(M\) is function of both \(R\) and \(E\), but it does not tell us whether that joint effect is additive (\(R\) and \(E\) separately increase mobilization) or interactive (the effect of each depends on the value of the other). Nor does it tell us whether either effect is linear, concave, or something else.

This lack of information about functional forms often puzzles those encountering causal graphs for the first time: surely it would be convenient to visually differentiate, say, additive from interactive effects. As one thinks about the variety of possible causal functions, however, it quickly becomes clear that there would be no simple visual way of capturing all possible functional relations. Moreover, causal graphs do not require functional statements to perform their main analytic purpose—a purpose to which we now turn.

2.3.2 Conditional independence from DAGs

If we encode our prior knowledge using the grammar of a causal graph, we can put that knowledge to work for us in powerful ways. In particular, the rules of DAG-construction allow for an easy reading of the expected correlations among the variables in the model. More formally, we say that we can use a DAG to identify the conditional independencies that are implied by our causal beliefs. (For a more extended treatment of the ideas in this section, see Rohrer (2018).)


Nodes \(A\) and \(B\) are “conditionally independent” given \(C\) if \(P(a|b,c) = P(a|c)\) for all values of \(a, b\) and \(c\).

To begin thinking about conditional independence, it is helpful to start by saying what it means for there to be dependence between two nodes. Let us first consider a simple relationship of dependence. Returning to Figure 2.1, the arrow running from \(I\) to \(R\), implying a direct causal effect, means that \(R\) is causally dependent on \(I\). This dependence, moreover, implies that we expect \(I\)’s and \(R\)’s values to be correlated with each other.

We can think of dependencies as generating flows of information. Put simply, observing the value of one of one node gives us information about the likely value of a node on which it is dependent, and vice-versa. The graph in Figure 2.1 implies that, if we measured redistributive preferences, we would also be in a better position to infer the level of inequality, and vice versa. Likewise, consider \(I\) and \(M\), which are linked in a relationship of dependence, one that is mediated by \(R\). Since inequality can affect mobilization (through \(R\)), knowing the the level of inequality would allow us to improve our estimate of the level of mobilization—and vice versa.

In contrast, consider \(I\) and \(E\), which are in this graph indicated as being independent of one another. In drawing the graph such that these two nodes have no common ancestor, we are saying that the forces that set a case’s level of inequality are independent of the forces that determine its level of ethnic homogeneity. (Formally, recall our assumption that \(\theta^I\) and \(\theta^E\) are independent of each other unless \(I\) and \(E\) are connected by a double-headed arrow indicating confounding.) So learning the level of inequality in a case, according to this graph, would give us no information whatsoever about the case’s degree of ethnic homogeneity, and vice-versa.

So dependencies between nodes can arise from those nodes lying along a causal chain. Yet they can also arise from nodes having common causes (or ancestors). Consider Figure 2.3, where we are indicating that war (\(W\)) can cause both excess deaths (\(D\)) and price inflation (\(P\)). Casualties and inflation will then be correlated with one another because of their shared cause. If we learn that there have been military casualties, this information will lead us to think it more likely that there is also war and, in turn, that there is price inflation (and vice versa). When two outcomes have a common (proximate or distant) cause, observing one outcome should lead us to believe it more likely that the other outcome has also occurred.

Yet, sometimes what we learn from an observation depends on what we already know. An everyday example can help us wrap our minds around this intuition. Suppose that, on a winter’s day, I want to know whether the boiler in my basement, which provides steam to the heating system, is working properly. I usually figure out if the boiler is working by reading the temperature on the thermometer on my living room wall: this is because I believe that the boiler’s operation causes the room temperature to rise (implying \(B \rightarrow T\)). Under this causal dependency, the temperature in the living room is generally informative about the boiler’s operation. If the room is warm, this makes me believe that the boiler is probably operating; if the room is cold, then I come to think it’s less likely that the boiler is running. (Similarly, if I go down to the basement and can see whether the boiler is fired up, this will shape my expectations about how warm the living room is.)

However, I also believe that the boiler affects the room’s temperature through a change in the temperature of the radiator (\(B \rightarrow R \rightarrow T\)), and that this is the only way in which the boiler can affect the room temperature. So suppose that, before reading the thermometer on the wall, I touch the radiator and feel that it is hot. The radiator’s temperature has, of course, given me information about the boiler’s operation—since I believe that the boiler’s operation has an effect on the radiator’s temperature (\(B \rightarrow R\)). If the radiator is hot, I judge that the boiler is probably running. But now, having already observed the radiator’s temperature, can I learn anything further about whether the boiler is operating by taking a reading from the thermometer on the wall? No, I cannot. Everything I could possibly learn about the boiler’s status from gauging the room’s temperature I have already learned from touching the radiator—since the boiler’s effect on the room’s temperature runs entirely through the radiator. One way to think about this is that, by observing the radiator’s temperature, we have fully intercepted, or “blocked, the flow of information from the boiler to the wall thermometer.

In sum, the room’s temperature can be informative about the boiler; but whether it is informative hinges on whether we already know if the radiator is hot. If we know \(R\), then \(B\) and \(T\) are uninformative about one another. Formally, we say that \(B\) and \(T\) are conditionally independent given \(R\).

Turning back to Figure 2.1, imagine that we already knew the level of redistributive preferences. Would we then be in a position to learn about the level of inequality by observing the level of mobilization? According to this DAG we would not. This is because \(R\), which we already know, blocks the flow of information between \(I\) and \(M\). Since the causal link—and, hence, flow of information—between \(I\) and \(M\) runs through \(R\), and we already know \(R\), there is nothing left to be learned about \(I\) by also observing \(M\). Anything we could have learned about inequality by observing mobilization is already captured by the level of redistributive preferences, which we have already seen. We can express this idea by saying that \(I\) and \(M\) are conditionally independent given \(R\). That is, observing \(R\) makes \(I\) and \(M\) independent of one another.

Technical Note on the Markov Property

The assumptions that no node is its own descendant and that the \(\theta\) terms are generated independently make the model Markovian, and the parents of a given node are Markovian parents.

Knowing the set of Markovian parents allows you to write relatively simple factorizations of a joint probability distribution, exploiting the fact (“the Markov condition”) that all nodes are conditionally independent of their nondescendants, conditional on their parents.

Consider how this Markovian property allows for simple factorization of \(P\) for an \(X \rightarrow M \rightarrow Y\) DAG. Note that \(P(X, M, Y)\) can always be written as: \[P(X, M, Y) = P(X)P(M|X)P(Y|M, X)\] If we believe, as implied by this DAG, that \(Y\) is independent of \(X\) given \(M\), then we have the simpler factorization: \[P(X, M, Y) = P(X)P(M|X)P(Y|M)\]

More generally, using \(pa_i\) to denote the parents of \(i\), we have:

\[\begin{equation} P(v_1,v_2,\dots v_n) = \prod P(v_i|pa_i) \tag{2.2} \end{equation}\]

More generally knowing if two nodes are or are not conditionally independent of each other tells us if we can learn about one from values of the other.

To return to the war example in 2.3, we would say that excess deaths and price inflation are conditionally independent given war. If we already know that there is war, then we can learn nothing further about the level of excess deaths by learning about price inflation (and similarly, we learn nothing about deaths from observing inflation). We can think of war, when observed, as blocking the flow of information between its two consequences; everything we would learn about inflation from excess deaths is already contained in the observation that there is war. Put differently, if we were just to look at cases where war is present (i.e., if we hold war constant), we should find no correlation between excess deaths and price inflation; likewise, for cases in which war is absent.

Importantly, being able to read this relationship of conditional independence from the graph hinges on our having followed the rules in drawing the DAG. In particular, we must be sure we have not excluded any common causes. If we believe, for instance, that pandemics are also causes of both excess deaths and inflation, then we have a second source of a correlation between these two nodes. In that scenario, observing \(W\) would not fully block the flow of information between \(D\) and \(P\). So then \(D\) and \(P\) are conditionally independent given both war and pandemics (i.e., only if we have observed both). We will only be able to correctly assess these relationships, however, if we follow the rules for DAG construction and include pandemics as a node on the graph and a common cause of excess deaths and price inflation.

This graph represents a simple causal model in which war ($W$) affects both excess deaths ($D$) and price inflation ($P$).

Figure 2.3: This graph represents a simple causal model in which war (\(W\)) affects both excess deaths (\(D\)) and price inflation (\(P\)).

Relations of conditional independence are central to the strategy of statistical control, or covariate adjustment, in correlation-based forms of causal inference, such as regression. In a regression framework, identifying the causal effect of an explanatory node, \(X\), on a dependent node, \(Y\), requires the assumption that \(X\)’s value is conditionally independent of \(Y\)’s potential outcomes (over values of \(X\)) given the model’s covariates. To draw a causal inference from a regression coefficient, in other words, we have to believe that including the covariates in the model “breaks” any biasing correlation between the value of the causal node and its unit-level effect.

As we will explore, however, relations of conditional independence are also of more general interest in that they tell us, given a model, when information about one feature of the world may be informative about another feature of the world, given what we already know. By identifying the possibilities for learning, relations of conditional independence can thus guide research design. We discuss these research-design implications in Chapter 7, but focus here on showing how relations of conditional independence operate on a DAG.

To see more systematically how a DAG can reveal conditional independencies, it is useful spell out three elemental structures according to which which information can flow across a causal graph.

Three elemental relations of conditional independence.

Figure 2.4: Three elemental relations of conditional independence.

For each of the three structures in Figure 2.4 we can read off whether nodes are independent both in situations when other nodes are not already observed and in situations in which they are. We discuss each of these structures in turn. For each, we first specify the unconditional relations among nodes in the structure and then the relations conditional on having already observed another node. When we talk about “unconditional” relations, we are asking: what does observing one node in the structure tell us about the other nodes? When we talk about “conditional” relations, we are asking: if we have already observed a node (so, conditional on that node), what does observing a second node tell us about a third node?

(1) A path of arrows in same direction

Unconditional relations. Information can flow unconditionally along a path of arrows pointing in the same direction. In Panel 1 of Figure 2.4, information flows across all three nodes. If we have observed nothing yet, learning about any one node can tell us something about the other two.

Conditional relations. Learning the value of a node along a path of arrows pointing in the same direction blocks flows of information that passes through that node. Knowing the value of \(B\) in Panel 1 renders \(A\) no longer informative about \(C\), and vice versa. This is because anything that \(A\) might tell us about \(C\) is already captured by the information contained in \(B\).

(2) A forked path

Unconditional relations. Information can flow unconditionally across the branches of any forked path. In Panel 2, if we have observed nothing already, learning about any one node can provide information about the other two nodes. For instance, observing only \(A\) can provide information about \(C\) and vice-versa.

Conditional relations. Learning the value of the node at the forking point blocks flows of information across the branches of a forked path. In Panel 2, learning \(A\) provides no information about \(C\) if we already know the value of \(B\).27

(3) An inverted fork (collision)

Unconditional relations. When two or more arrowheads collide, generating an inverted fork, there is no unconditional flow of information between the incoming sequences of arrows. In Panel 3, learning only \(A\) provides no information about \(C\), and vice-versa, since each is independently determined.

Conditional relations. Collisions can be sites of conditional flows of information. In the jargon of causal graphs, \(B\) in Panel 3 is a “collider” for \(A\) and \(C\).28 Although information does not flow unconditionally across colliding sequences, it does flow across them conditional on knowing the value of the collider node or any of its downstream consequences. In Panel 3, learning \(A\) does provide new information about \(C\), and vice-versa, if we also know the value of \(B\) (or, in principle, the value of anything that \(B\) causes).

The last point is somewhat counter-intuitive and warrants further discussion. It is easy enough to see that, for two nodes that are correlated unconditionally, that correlation can be “broken” by controlling for a third node. In the case of collision, two nodes that are not correlated when taken by themselves become correlated when we condition on (i.e., learn the value of) a third node into which they both point, the collider. The reason is in fact quite straightforward once one sees it: if an outcome is a joint function of two inputs, then if we know the outcome, information about one of the inputs can provide information about the other input. For example, if I know that you are short, then learning that your mother is tall makes me more confident that your father is short. Crucially, it is knowing the outcome—that you are short—that makes the information about your mother’s height informative about your father’s.

Looking back at our democratization DAG in Figure 2.1, \(M\) is a collider for \(R\) and \(E\), its two inputs. Suppose that \(M\) is determined by the parametric causal function \(M=RE\). Knowing about redistributive preferences alone provides no information whatsoever about ethnic homogeneity since the two are determined independently of one another. On the other hand, imagine that we already know that there was no mobilization. Now, if we observe that there were redistributive preferences, we can figure out the level of ethnic homogeneity: it must be 0. (And likewise in going from observing homogeneity to inferring redistributive preferences.)

Using these basic principles, conditional independencies can be read off of any DAG. If we want to know whether two nodes are conditionally independent of one another, we do so by checking every path connecting two nodes of interest. And we ask whether, along those paths, the flow of information is open or blocked, given any other nodes whose values are already observed. Conditional independence is established when all paths are blocked given what we have already observed; otherwise, conditional independence is absent.

Following Pearl (2000), we will sometimes refer to relations of conditional independence using the term d-separation. We say that variable set \(\mathcal C\) \(d\)-separates variable set \(\mathcal A\) from variable set \(\mathcal B\) if \(\mathcal A\) and \(\mathcal B\) are conditionally independent given \(\mathcal C\). For instance, in Panel 2 of Figure 2.4, \(B\) \(d-\)separates \(A\) and \(C\). We say that \(\mathcal A\) and \(\mathcal B\) are \(d-\)connected given \(\mathcal C\) if \(\mathcal A\) and \(\mathcal B\) are not conditionally independent given \(\mathcal C\). For instance, in Panel 3, \(A\) and \(C\) are \(d-\)connected given \(B\).

Readers are invited to practice reading relations of conditional independence off of a DAG using the exercises in the chapter appendix, section 2.5.4. Analyzing a causal graph for relations of independence represents one payoff to formally encoding our beliefs about the world in a causal model. We are, in essence, drawing out implications of those beliefs: given what we believe about a set of direct causal relations (the arrows on the graph), what must this logically imply about other dependencies and independencies on the graph, conditional on having observed some particular set of nodes? We show in a later chapter how these implications can be deployed to guide research design, by indicating which parts of a causal system are potentially informative about other parts that may be of interest.

2.3.3 Simplifying models

It is very easy to write down a model that is too complex to use effectively. In such cases we often seek simpler models that are consistent with models we have in mind but contain fewer nodes or more limited variation. As we have already suggested, we often have considerable discretion about how detailed to make our models. However, whenever we seek to simplify a more complex model, we must take care to ensure that the simplified model is logically consistent with the original model.

Fortunately, the mapping between graphs and relations of conditional independence give guidance for determining when and how it is possible to simplify models. We spell out the rules for simplifying models here, focusing the discussion on simplifications that involve node-elimination and conditioning on nodes. Eliminating nodes

If we want to eliminate a node, the key rule is that the new model (and graph) must take into account:

  1. all dependencies among remaining nodes and
  2. all variation generated by the eliminated node.

We can work out what this means, separately, for eliminating endogenous nodes and for eliminating exogenous nodes.

Eliminating endogenous nodes

Eliminating an endogenous node means removing a node with parents (direct causes) represented on the graph. If the node also has one or more children, then the node captures a dependency: it links its parents to its children. When we eliminate this node, preserving these dependencies requires that all of the eliminated node’s parents adopt—become parents of—all of the eliminated node’s children. Thus, for instance, in Figure 2.1, if we were to eliminate \(R\), \(R\)’s parent (\(I\)) would need to adopt \(R\)’s child, \(M\): we would then draw a direct \(I \rightarrow M\) arrow.

More intuitively, when we simplify away a mediator, we need to make sure that we preserve the causal relationships being mediated—both those among substantive variables and any random shocks at the mediating causal steps.29

Eliminating root nodes

What about eliminating root nodes—nodes with no parents in \(\mathcal{V}\)? For the most part, root nodes cannot be eliminated, but must either be replaced by or incorporated into \(\theta\) terms. The reason is that we need to preserve any dependencies or variation generated by these nodes. Figure 2.5 walks through four different situations in which we might want to remove a root node \(X\). The left column shows the original graph and the right column shows the simplification. Note that, although the simplified DAGs don’t always look simpler, they all involve a smaller number of named, substantive nodes.

Basic principles for eliminating root nodes.

Figure 2.5: Basic principles for eliminating root nodes.

  • Multiple children. In (a1), we start with a model in which \(X\) has two children, thus generating a dependency between \(W\) and \(Y\). If we eliminate \(X\), then we must preserve this dependency. We can do so, as pictured in (a2), by replacing \(X\) with a \(\theta\) term that points into both \(W\) and \(Y\). This \(\theta^{WY}\) term represents an unspecified common cause of \(W\) and \(Y\), allowing for a correlation between the two. By convention, we could, alternatively, convey the same information with a double-headed arrow (or dashed, undirected line) between \(W\) and \(Y\). Though we are no longer specifying what it is that connects \(W\) and \(Y\), the correlation itself is retained.
  • Substantive spouse. In (b1), \(X\) has a spouse that is substantively specified, \(W\). If we eliminate \(X\), we have to preserve the fact that \(Y\) is not fully determined by \(W\); something else also generates variation in \(Y\). We thus need to replace \(X\) with a \(\theta\) term, \(\theta^Y\), to capture the variation in \(Y\) that is not accounted for by \(W\).
  • \(\theta\)-term spouse. In (c1), \(X\) has a spouse that is not substantively specified, \(\theta^{Y}\). Eliminating \(X\) requires, again, capturing the variance that it generates as a random input. As we already have a \(\theta\) term pointing only into \(Y\), we can substitute in \(\theta^{Y}_\text{mod}\), which represents both \(theta^{Y}\) and the variance generated by \(X\).30
  • One child, no spouse. In (d1), \(X\) has only one child and no spouse. Here we can safely eliminate \(X\) with no loss of information. It is always understood that every root node has some cause, and there is no loss of information in simply eliminating a node’s causes if those causes are exogenous and do not affect other endogenous nodes in the model. In (d2) we are simply not specifying \(Y\)’s cause, but we have not lost any dependencies or sources of variance that had been expressed in (d1).

One interesting effect of eliminating a substantive root node can be to render seemingly deterministic relations effectively probabilistic. In moving from (b1) to (b2), for instance, we have taken a component of \(Y\) that was determined by \(X\) and converting it into a random disturbance. So the simplified model says less about the world; it is less specific. But it is nonetheless a logical implication of the more complex theory, and thus consistent with it.

A model  from which multiple simpler models can be derived.

Figure 2.6: A model from which multiple simpler models can be derived.

Simplifications of the model of Figure 2.6. Nodes that are eliminated are marked in grey; circles denote root nodes that are replaced in subgraphs by unidentified variables. (A circled node pointing into two other nodes could equivalently be indicated as an undirected edge connecting the two.)

Figure 2.7: Simplifications of the model of Figure 2.6. Nodes that are eliminated are marked in grey; circles denote root nodes that are replaced in subgraphs by unidentified variables. (A circled node pointing into two other nodes could equivalently be indicated as an undirected edge connecting the two.)

We can apply these principles to a model of any complexity. We illustrate a wider range of simplifications by starting with Figure 2.6. (Note that we have here altered our original inequality and democracy model by adding an arrow from \(E\) to \(R\), allowing ethnic homogeneity to affect redistributive demands.) In Figure 2.7, we show all permissible reductions of the more elaborate model. We can think of these reductions as the full set of simpler claims (involving at least two nodes) that can be derived from the original model. In each subgraph,

  • we mark eliminated nodes in grey;
  • those nodes that are circled must be replaced with \(\theta\) terms; and
  • arrows represent the causal dependencies that must be preserved.

Note, for instance, that neither \(I\) (because it has a spouse) nor \(E\) (because it has multiple children) can be simply eliminated; each must be replaced with a \(\theta\) term. Also, the simplified graph with nodes missing can contain arrows that do not appear at all in the graph: eliminating \(R\), for instance, forces an arrow running from \(E\) to \(M\) (though that is there already) and another running from \(E\) to \(D\), as \(E\) must adopt \(R\)’s children. The simplest elimination is of \(D\) itself since \(D\)—sitting at the “end” of the DAG—has no children that we have to worry about when it is gone. Conditioning on nodes

Another way to simplify a model is to condition on the value of a node. When we condition on a node, we are restricting the model in scope to situations in which that node’s value is held constant. Doing so allows us to eliminate the node as well as all arrows pointing into it or out of it. Consider three different situations in which we might condition on a node:

  • Root node with multiple children. When removing \(X\) from graph (a1) in Figure 2.5, we needed to be sure we retain the dependence that \(X\) generates between \(W\) and \(Y\). Not so, however, if we are fixing \(X\)’s value. Recalling the rules of conditional independence on a graph (see Chapter 2), we know that \(W\) and \(Y\) are independent conditional on \(X\). Put differently, if we restrict the analysis to contexts in which \(X\) takes on a constant value, the model implies that \(Y\) and \(W\) will be uncorrelated across cases. As fixing \(X\)’s value breaks the dependence between \(Y\) and \(W\), we can drop \(X\) (and the arrows pointing out of it) without having to represent that dependence.
  • Root node with spouse. In removing \(X\) from graphs (b1) or (c1) in Figure 2.5, we needed to account for the variation generated by \(X\). If we fix \(X\)’s value, however, then we eliminate this variation by assumption and do not need to continue to represent it (or the arrow pointing out of it) on the graph.
  • Endogenous. When we condition on an endogenous node, we can eliminate the node as well the arrows pointing into and out of it. We, again, leverage relations of conditional independence here. If we start with model \(X \rightarrow M \rightarrow Y\) and we condition on the mediator, \(M\), we sever the link between \(X\) and \(Y\), rendering these two nodes conditionally independent of one another. We can thus remove \(M\), the arrow from \(X\) to \(M\), and the arrow from \(M\) to \(Y\). In the new model, with \(M\) fixed, \(Y\) will be entirely determined by the random disturbance \(\theta^{Y}\).

2.3.4 Retaining probabilistic relations

We have highlighted the graphical implications of node elimination or node conditioning but importantly the distribution over \(\theta\) also needs to be preserved faithfully in a move to a simpler model.

In sum, we can work with models that are simpler than our causal beliefs: we may believe a model to be true, but we can derive from it a sparser set of claims. There may be intervening causal steps or features of context that we believe matter, but that are not of interest for a particular line of inquiry. While these can be left out of our model, we nonetheless have to make sure that their implications for the relations remaining in the model are not lost. Understanding the rules of reduction allow us to undertake an important task: checking which simpler claims are and are not consistent with our full belief set.

2.4 Conclusion

In this chapter, we have shown how we can inscribe causal beliefs, rooted in the potential outcomes framework, into a causal model. In doing so, we have now set out the foundations of the book’s analytic framework. Causal models are both the starting point for analysis in this framework and the object about which we seek to learn. Before moving on to build on this foundation, we aim in the next chapter to offer further guidance by example on the construction of causal models, by illustrating how a set of substantive social scientific arguments from the literature can be represented in causal model form.

2.5 Chapter Appendix

2.5.1 Steps for constructing causal models

  1. Identify a set of variables in a domain of interest. These become the nodes of the model.
  • Specify the range of each node: is it continuous or discrete?
  • Each node should have an associated \(\theta\) term pointing into it, representing unspecified other influences (not necessarily graphed)
  1. Draw a causal graph (DAG) representing beliefs about causal dependencies among these nodes
  • Include arrows for direct effects only
  • Arrows indicate possible causal effects
  • The absence of an arrow between two nodes indicates a belief of no direct causal relationship between them
  • Ensure that the graph captures all correlations among nodes. This means that either (a) any common cause of two or more nodes is included on the graph (with implications for Step 1) or (b) correlated nodes are connected with a double-headed arrow or dashed, undirected edge.
  1. Write down one causal function for each endogenous node
  • Each node’s function must include all nodes directly pointing into it on the graph as well as the \(\theta\) terms
  • Functions may express arbitrary amounts of uncertainty about causal relations
  • In this book’s non-parametric framework, the causal functions are captured by the \(\theta\) terms.
  1. State probabilistic beliefs about the distributions of the \(\theta\)s.
  • How common or likely to do we think different nodal types are for each node?
  • Are the nodal types independently distributed? If in step 2 we drew an undirected edge between nodes, then we believe that the connected nodes’ types are not independently distributed.

2.5.2 Model construction in code

Our CausalQueries package provides a set of functions to implement all of these steps concisely for binary models—models in which all nodes are dichotomous.

# Steps 1 and 2 
# We define a model with three binary nodes and 
# specified edges between them:
model <- make_model("X -> M -> Y")

# Functional forms are unrestricted. Restrictions can 
# be added. Here we impose monotonicity at each step 
# by removing one type for M and one for Y
model <- set_restrictions(model, labels = list(M = "10", Y="10"))

# Step 4
# Set priors over the distribution of (remaining) causal types.
# Here we set "jeffreys priors"
model <- set_priors(model, distribution = "jeffreys")

# We now have a model defined as an R object. 
# Later we will update  and query this model

These steps are enough to fully describe a binary causal model. Later in this book we will see how we can ask questions of a model like this but also how to use data to train it.

2.5.3 Rules for moving between levels

Moving down levels:

All (conditional) independencies represented in a higher-level model must be preserved in the lower-level model.

When we disaggregate or add nodes to a model, new conditional independencies can be generated. But any variables that are independent or conditionally independent (given a third variable) in the higher-level model must also be independent or conditionally independent in the lower-level model.

Moving up levels:

We can move up levels by eliminating on a node, or conditioning on a node. When we eliminate a node from a model, we must preserve any variation and dependencies that it generates:

  1. When eliminating a node that has parents, that node’s parents adopt (become direct causes of) that node’s children.
  2. When eliminating a root node—a node with no parents in \(\mathcal{V}\)–we must usually replace it with a \(\theta\) term. If the node has more than one child, it must be replaced with a \(\theta\) term pointing into both children (or an undirected edge connecting them) to preserve the dependency between its children. If the node has a spouse, the eliminated node’s variation must also be preserved using a \(\theta\) term. Where the spouse is (already) a \(\theta\) term with no other children, \(\theta\) terms can be combined.
  3. Since conditioning on a node “blocks” the path through which it connects its children, we can simply eliminate the node and the arrows between it and its children.
  4. An root node with no spouse and only one child can be simply eliminated.

2.5.4 Exercise: Reading conditional independence from a graph

We illustrate how to identify the relations of conditional independence between \(A\) and \(D\) in Figure 2.8.

An exercise: $A$ and $D$ are conditionally independent, given which other node(s)?

Figure 2.8: An exercise: \(A\) and \(D\) are conditionally independent, given which other node(s)?

Are A and D independent:

  • unconditionally?

Yes. \(B\) is a collider, and information does not flow across a collider if the value of the collider node or its consequences is not known. Since no information can flow between \(A\) and \(C\), no information can flow between \(A\) and \(D\) simply because any such flow would have to run through \(C\).

  • if you condition on \(B\)?

No. Conditioning on a collider opens the flow of information across the incoming paths. Now, information flows between \(A\) and \(C\). And since information flows between \(C\) and \(D\), \(A\) and \(D\) are now also connected by an unbroken path. While \(A\) and \(D\) were independent when we conditioned on nothing, they cease to be independent when we condition on \(B\).

  • if you condition on \(C\)?

Yes. Conditioning on \(C\), in fact, has no effect on the situation. Doing so cuts off \(B\) from \(D\), but this is irrelevant to the \(A\)-\(D\) relationship since the flow between \(A\) and \(D\) was already blocked at \(B\), an unobserved collider.

  • if you condition on \(B\) and \(C\)?

Yes. Now we are doing two, countervailing things at once. While conditioning on \(B\) opens the path connecting \(A\) and \(D\), conditioning on \(C\) closes it again, leaving \(A\) and \(D\) conditionally independent.


Boix, Carles. 2003. Democracy and Redistribution. New York: Cambridge University Press.
Cartwright, Nancy et al. 1994. “Nature’s Capacities.” OUP Catalogue.
Chickering, David M, and Judea Pearl. 1996. “A Clinician’s Tool for Analyzing Non-Compliance.” In Proceedings of the National Conference on Artificial Intelligence, 1269–76.
Copas, JB. 1973. “Randomization Models for the Matched and Unmatched 2\(\times\) 2 Tables.” Biometrika 60 (3): 467–76.
Frangakis, Constantine E, and Donald B Rubin. 2002. “Principal Stratification in Causal Inference.” Biometrics 58 (1): 21–29.
Galles, David, and Judea Pearl. 1998. “An Axiomatic Characterization of Causal Counterfactuals.” Foundations of Science 3 (1): 151–82.
Halpern, Joseph Y, and Judea Pearl. 2005. “Causes and Explanations: A Structural-Model Approach. Part i: Causes.” The British Journal for the Philosophy of Science 56 (4): 843–87.
Hernán, Miguel A, and James M Robins. 2006. “Instruments for Causal Inference: An Epidemiologist’s Dream?” Epidemiology 17 (4): 360–72.
Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60.
Hume, David, and Tom L Beauchamp. 2000. An Enquiry Concerning Human Understanding: A Critical Edition. Vol. 3. Oxford University Press.
Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT press.
Laplace, Pierre-Simon. 1901. A Philosophical Essay on Probabilities. Translated by F.W. Truscott and F.L. Emory. Vol. 166. New York: Cosimo.
Lewis, David. 1973. “Counterfactuals and Comparative Possibility.” In Ifs, 57–85. Springer.
———. 1986. “Causation.” Philosophical Papers 2: 159–213.
———. 2008. “Toward a Unified Theory of Causality.” Comparative Political Studies 41 (4-5): 412–36.
Pearl, Judea. 2000. Causality: Models, Reasoning and Inference. Vol. 29. Cambridge Univ Press.
———. 2009. Causality. Cambridge university press.
Rohrer, Julia M. 2018. “Thinking Clearly about Correlations and Causation: Graphical Causal Models for Observational Data.” Advances in Methods and Practices in Psychological Science 1 (1): 27–42.
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66: 688–701.
Spirtes, Peter, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, Prediction, and Search. MIT press.
Splawa-Neyman, Jerzy, DM Dabrowska, TP Speed, et al. 1990. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.” Statistical Science 5 (4): 465–72.

  1. As nicely put by Cartwright et al. (1994): no causes in, no causes out. We return to the point more formally later.↩︎

  2. The approach is sometimes attributed to David Hume, whose writing contains ideas both about causality as regularity and causality as counterfactual. On the latter, Hume’s key formulation is, “if the first object had not been, the second never had existed” (Hume and Beauchamp 2000, vol. 3, sec. VIII). More recently, the counterfactual view has been set forth by Splawa-Neyman et al. (1990) and Lewis (1973). See also Lewis (1986).↩︎

  3. In the terminology of Pearl (2000), we represent this quantity using a “do” operator: \(Y(do(X=x))\) is the value of \(Y\) when the variable \(X\) is set to the value \(x\).↩︎

  4. To avoid ambiguity we prefer \(Y_i(X=0)\) and \(Y_i(X=1)\). Alternative notation, used in Holland (1986) for instance, places the treatment condition in the subscript: \(Y_t(u), Y_c(u)\), while \(u\) captures individual level features. Sometimes the pairs are written \(Y_{u0}, Y_{u1}\).↩︎

  5. We note that we are conditioning on the assignments of others. If we wanted to describe outcomes as a function of the profile of treatments received by others we would have a more complex type space. For instance, in an \(X \rightarrow Y\) model with 2 individuals we would report how \((Y_1, Y_2)\) respond to \((X_1,X_0)\); each vector can take on four values producing a type space with \(4^4\) types rather than \(2^2\). The complex type space could be reduced back down to four types again, however, if we invoked the assumption that the treatment or non-treatment of one patient has no effect on the outcomes of other patients—an assumption known as the stable unit treatment value assumption (SUTVA).↩︎

  6. See Copas (1973)} for an early classification of this form. The literature on probabilistic models also refers to such strata as “canonical partitions” or “equivalence classes.”↩︎

  7. Later, we will refer to these as “nodal types.”↩︎

  8. In many treatments \(\mathcal{U}\) is used for the exogenous nodes.↩︎

  9. If we let \(\mathcal{R}\) denote a set of ranges for all nodes in the model, we can indicate \(X\)’s range, for instance, by writing \(\mathcal{R}(X)=\{0,1\}\). The nodes in a causal model together with their ranges—the triple \((\mathcal{U}, \mathcal{V}, \mathcal{R})\)—are sometimes called a , \(\mathcal{S}\).↩︎

  10. In many treatments, these components are labelled as \(U\) terms, in the set \(\mathcal{U}\) to be distinguished from endogeneous nodes in \(\mathcal{V}\). However, we will generally use \(\theta\) to denote these unobserved, unspecified influences in order to emphasize their particular role, as direct objects of interest in causal inquiry.↩︎

  11. The set of a node’s parents is required to be minimal in the sense that a node is not included among the parents if, given the other parents, the child does not depend on it in any state that arises with positive probability.↩︎

  12. Thus \(P(d|i,e, u_D)\) would defined by this structural model (as a degenerate distribution), but \(P(i)\), \(P(e)\), \(P(u_D)\), and \(P(i,e, u_D)\) would not be.↩︎

  13. In the CausalQueries software package, we can specify nodal types as having joint distributions.↩︎

  14. By “direct” we mean that the \(A\) is a parent of \(B\): i.e., the effect of \(A\) on \(B\) is not fully mediated by one or more other nodes in the model.↩︎

  15. A reminder that, in applying the subscript scheme in Table 2.3, we put our nodes in alphabetical order and then map the first node into \(X_1\) and the second into \(X_2\).↩︎

  16. Put in more general terms, a node’s causal function must include all nodes pointing directly into it. We can imagine this same idea in a parametric setting. Imagine that \(M\)’s causal function were specified as: \(M = RE\). This function would allow for the possibility that \(R\) affects \(M\), as it will whenever \(E=1\). However, it would also allow that \(R\) will have no effect, as it will when \(E=0\).↩︎

  17. Readers may recognize this statement as the logic of adjusting for a confound that is a cause of both an explanatory node and a dependent node in order to achieve conditional independence.↩︎

  18. In the familial language of causal models, a collider is a child of two or more parents.↩︎

  19. Eliminating endogenous nodes may also operate via “encapsulated conditional probability distributions” (Koller and Friedman 2009) wherein a system of nodes, \(\{Z_i\}\) is represented by a single node, \(Z\), that takes the parents of \(\{Z_i\}\) not in \(\{Z_i\}\) as parents to \(Z\) and issues the children of \((Z_i)\) that are not in \((Z_i)\) as children. However, this is not a fundamental alteration of the graph.↩︎

  20. This aggregation cannot occur if \(\theta^{Y}\) also has another child, \(W\), that is not a child of \(X\) since then we would be representing \(Y\)’s and \(W\)’s random components as identical, which they are not in the original graph.↩︎