Chapter 6 Theories as causal models

We describe an approach in which theoretical claims are thought of as model justifications within a hierarchy of causal models. The approach has implications for consistency of inferences and for assessing when and how theory is useful.


In Chapter 3, we described a set of theories and represented them as causal models. But so far we haven’t been very explicit in what we mean by a theory or how theory maps onto a causal-model framework.

In this book, we will think of theory as a type of explanation: a theory provides an account of how or under what conditions a set of causal relationships operate. We generally express both a theory and the claims being theorized as causal models. A theory is then a model that implies another model—possibly with the help of some data.

To fix ideas: a simple claim might be that “A caused B in case \(j\)”. This claim is itself a model, albeit a very simple one. The theory that supports this model might be of the form “A always causes B”, “A always causes B whenever C (and C holds in case j)”, or “A invariably causes B and invariably B causes C”. These all have in common that they are arguments that could be provided to support the simple claim; in each case, if you believe the theory you believe the implication.

The rest of this short chapter builds out this idea and uses it to provide a way of characterizing when a theory is useful or not. In the first section, we consider multiple senses in which one model might imply, and thus serve as a theory of, another model. For one thing, we consider how one causal structure can imply (theorize) another causal structure, by including additional new nodes and nodal types that explain how or when causal effects in the original model will unfold. Next, we consider how the causal-type ranges of models can relate to one another: one model can imply another model when the former’s causal types constitute a subset of the latter’s. In this situation, the theory represents a more specific, stronger claim about the kinds of causal effects that are operating.

We then turn to logical relations between probabilistic models. We show how the distributions over nodal types in a simpler model structure can be underwritten by distributions over nodal types in a more detailed model structure. Here, a claim about the prevalence (or probability) of causal effects in a causal network is justified by claims about the prevalence or probability of causal effects in a more granular rendering of that network. Finally, we show how a probabilistic model plus data can provide a theoretical underpinning for a new, stronger model.

Second, we consider how theories-as-models can be useful. In embedding theorization within the world of causal models, we ultimately have an empirical objective in mind. Theorizing a causal relationship of interest, in our framework, means elaborating our causal beliefs about the world in greater detail. As we show in later chapters, theorizing in the form of a causal model allows us to generate research designs: to identify sources of inferential leverage and to explicitly and systematically link observations of components of a causal system to the causal questions we seek to answer. In the second section of this chapter, however, we provide a high-level conceptualization of the empirical gains from theory.

In the chapter’s third and final section, we show how our formalization of theory maps onto formal theory as usually understood, showing how we can generate a causal model from a game-theoretic model.

6.1 Models as theories of

Let us say that a causal model, \(M^\prime\), is a theory of \(M\) if \(M\) is implied by \(M^\prime\). It is a theory because it has implications. Otherwise it is a conclusion, an inference, or claim. A theory, \(M^\prime\), might itself sit atop—be supported by—another theory, \(M^{\prime\prime}\), that implies \(M^\prime\). To help fix the idea of theory as “supporting” or “underlying” the model(s) it theorizes, we refer to the theory, \(M^\prime\), as a lower-level model relative to \(M\) and refer to \(M\) as a higher-level model relative to its theorization, \(M^\prime\).49

Both structural models and probabilistic models—possibly in combination with data—imply other models. We discuss each in turn.

6.1.1 Implications of structural causal models

A structural model can imply multiple other simpler structural models. Similarly, a structural model can be implied by multiple more complex models.

We imagine two forms of lower level model, those that involve “type splintering” and those that involve “type reduction.”

6.1.1.1 Type splintering theorization

Theorization often involves a refinement of causal types, implemented through the addition of nodes. Take the very simple model, \(M\), represented in Figure 6.1(a). The model simply states that \(X\) has (or can have) a causal effect on \(Y\).

What theories might justify \(M\)? This question can be rephrased as “what models imply model \(M\)?” The figure points to two possibilities. Both models \(M^\prime\) and \(M^{\prime\prime}\) imply model \(M\). They can be thought of as theories, or lower-level models, of \(M\).

Model \(M^\prime\) differs by the addition of a node, \(K\), in the causal chain between \(X\) and \(Y\). We can say that \(M^\prime\) is a theory of \(M\) for two reasons. First it provides a justification—if you believe \(M^\prime\) you should believe \(M\): if \(X\) affects \(Y\) through \(K\), then \(X\) affects \(Y\). But as well as a justification it also provides an explanation of \(M\). Suppose we already know that \(X\) affects \(Y\) but want to know why. If we ask, “why does \(X\) affect \(Y\)?”, \(M^\prime\) provides an answer: \(X\) affects \(Y\) because \(X\) affects \(K\), and \(K\) affects \(Y\).

Model \(M^{\prime\prime}\) differs by the addition of a node, \(C\), that moderates the effect of \(X\) on \(Y\). \(M^{\prime\prime}\) justifies \(M\) in the sense that if you believe \(M^{\prime\prime}\) you should believe \(M\). It provides an explanation of a kind also: if you believe model \(M^{\prime\prime}\) you likely believe that the relation between \(X\) and \(Y\) is what it is because of \(C\). Had \(C\) been different the causal relation between \(X\) and \(Y\) might have been also.


A key idea is that both \(M'\) and \(M''\) involve a redefinition of \(\theta^Y\). That is we see a change in the endogenous nodes but these in turn imply a change in the interpretation of the exogenous nodes pointing into existing endogenous nodes (such as \(Y\) in this example). We can think of part of \(\theta^Y\) being splintered off and captured by \(\theta^K\) or \(C\).

Return to models \(M\) and \(M'\) in Figure 6.1(a). Importantly, in moving from the higher- to the lower-level model, we have effectively split the nodal-type term \(\theta^Y\) into two parts: \(\theta^{Y_\text{lower}}\) and \(\theta^K\). Intuitively, in the higher-level model, (a), \(Y\) is a function of \(X\) and \(\theta^Y\), the latter representing all things other than \(X\) than can affect \(Y\). Or, in the language of our nodal-type setup, \(\theta^Y\) represents all of the (unspecified) sources of variation in \(X\)’s effect on \(Y\). When we insert \(K\) into the model, however, \(X\) now does not directly affect \(Y\) but only does so via \(K\). Further, we model \(X\) as acting on \(K\) in a manner conditioned by \(\theta^K\), which represents all of the (unspecified) factors determining \(X\)’s effect on \(K\). The key thing to notice here is that \(\theta^K\) now represents a portion of the variance that \(\theta^Y\) represented in the higher-level graph: some of the variation in \(X\)’s effect on \(Y\) now arises from variation in \(X\)’s effect on \(K\), which is captured by \(\theta^K\). So, for instance, \(X\) might have no effect on \(Y\) because \(\theta^K\) takes on the value \(\theta^K_{00}\), so that \(X\) has no effect on \(K\). Put differently, any effect of \(X\) on \(Y\) must arise from an effect of \(X\) on \(K\); so \(\theta^K\)’s value must be either \(\theta^K_{01}\) or \(\theta^K_{10}\) for \(X\) to affect \(Y\). 50 What \(\theta^K\) represents, then, is that part of the original \(\theta^Y\) that arose from some force other than \(X\) operating at the first step of the causal chain from \(X\) to \(Y\). So now, \(\theta^Y\) in the lower-level graph is not quite the same entity as it was in the higher-level graph. In the original graph, \(\theta^Y\) represented all sources of variation in \(X\)’s effect on \(Y\). In the lower-level model, with \(K\) as mediator, \(\theta^Y\) represents only the variation in \(K\)’s effect on \(Y\). Put differently, \(\theta^Y\) has been expunged of any factors shaping the first stage of the causal process, which now reside in \(\theta^K\). We highlight this change in \(\theta^Y\)’s meaning by referring in the second model to \(\theta^{Y_\text{lower}}\).

Consider next model \(M^{\prime\prime}\) panel (c) in Figure 6.1, which also supports (implies) the higher-level model in panel \((a)\). The logical relationship between models \((a)\) and \((c)\), however, is somewhat different. Here the lower-level model specifies one of the conditions that comprised \(\theta^Y\) in the higher-level model. In specifying a moderator, \(C\), we have extracted \(C\) from \(\theta^Y\), leaving \(\theta^{Y_\text{lower}}\) to represent all factors other than \(C\) that condition \(Y\)’s response to its parents. More precisely, \(\theta^{Y_\text{lower}}\) now represents the set of nodal types defining how \(Y\) responds jointly to \(X\) and \(C\). Again, the relabeling as \(\theta^{Y_\text{lower}}\) reflects this change in the term’s meaning. Whereas in Model \(M^{\prime}\) we have extracted \(\theta^K\) from \(\theta^Y\), in Model \(M^{\prime\prime}\), it is \(C\) itself that we have extracted from \(\theta^Y\), substantively specifying what had been just a random disturbance.

Here we represent the simple claim that one variable causes another, and two theories---lower-level models---that could explain this claim. Both model (b) and model (c) involve theorization via disaggregation of nodes.

Figure 6.1: Here we represent the simple claim that one variable causes another, and two theories—lower-level models—that could explain this claim. Both model (b) and model (c) involve theorization via disaggregation of nodes.

6.1.1.2 Type-reducing theorization

There is a second way in which we might imagine a model being implied by another model that does not involve a change in nodes. Let \(\Theta(\mathcal M_1)\) denote the set of causal types in model \(\mathcal M_1\). Then we can say that \(\mathcal M_0\) implies \(\mathcal M_1\) if \(\Theta(\mathcal M_0)\subseteq \Theta(\mathcal M_1)\). We can think of \(\mathcal M_0\), then, as a restriction of ranges of \(\Theta(M_1)\)—and thus as a more specific claim about causal relations than \(\mathcal M_1\) makes. Put differently, this means that if \(\mathcal M_0\) is true, then \(\mathcal M_1 is true\): any relation admitted by theory \(\mathcal M_0\) is representable in model \(\mathcal M_1\) (though the converse may not be true). In that sense, \(\mathcal M_0\) serves as a theory that can justify \(\mathcal M_1\).

We illustrate the idea in Figure 6.2. In panel (a) of the figure, we have a model, \(\mathcal M_1\) in which \(Z\) can have both a direct and an indirect effect (via \(X\)) on \(Y\). Suppose that we believed that \(\mathcal M_1\) was technically true but overly permissive, in the sense that it allowed for causal relations that we do not in fact believe are operating. We might believe, for instance, that \(Z\) has no direct effect on \(Y\) and that \(Z\) has no negative effects on \(X\)—the beliefs we would need to hold to treat \(Z\) as an instrument for \(X\). We could thus write down a lower-level model, \(\mathcal M_0\),in which we have reduced the type space accordingly. Specifically, in \(\mathcal M_0\), we would restrict the nodal types at \(Y\) to only the \(\theta^Y_{0000}\), \(\theta^Y_{1100}\), \(\theta^Y_{0011}\), and \(\theta^Y_{1111}\); and we would reduce the nodal types at \(X\) to \(\theta^X_{00}\), \(\theta^X_{01}\), and \(\theta^X_{11}\). In panel (b), we (somewhat loosely) represent \(\mathcal M_0\). We have now eliminated the arrow from \(Z\) to \(Y\) to represent the dropping of all nodal types involving a direct effect of \(Z\) on \(Y\); not pictured is the montonicity assumption at \(X\). However, we have relabeled the nodal-type nodes for both \(X\) and \(Y\) to represent the fact that these are different objects from the nodal type nodes in the higher-level model.51

Thus, while we can theorize by adding substantive nodes to a model and thus splitting types, we can also theorize by maintaining existing nodes but constraining relations among them. In both forms of theorization, we start with a model that allows for a broad, and possibly unknown, range of possibilities: for instance, a broad range of paths through which or conditions under which \(X\) might affect \(Y\) or a broad range of causal effects operating at each node. Theorization of both forms then involves making a stronger claim: for instance, a claim about how or when \(X\) affects \(Y\) (via type-splintering) or a claim about the particular causal effects operating at a given node (via type-reduction). In both forms of theory, believing the stronger claim in the lower-level model implies believing the weaker claim in the higher-level model. Further, both modes of theorization also map nicely onto common ways in which we think about theory-development in the social sciences: we theorize mechanisms, sources of heterogeneity, and directions of effects (starting with a belief that \(X\) affects \(Y\), for instance, and moving to a more constrained belief about whether that effect is positive or negative).

Finally, as we speak to below, theorization of both forms can generate gains for causal inference, by allowing us to use data in ways that we are unable to use it in the higher-level model.

6.1.1.3 Preserving (conditional) independencies

Not all potential mappings from higher- to lower-levels are permitted. In particular, when theorizing, we may add but may not remove independencies implied by the original model. If two variables are independent— or conditionally independent given a third variable—in one model, then this same relation of independence (or conditional independence) must be captured in any theory of that model. For instance, if we start with a model of the form \(X \rightarrow Y \leftarrow W\), where \(W\) and \(X\) are independent, we could not theorize this model by adding an arrow from \(X\) to \(W\). A theory can have additional conditional independencies not present in the higher-level model, as in the example in Figure 6.2. But we may not theorize away (conditional) independencies insisted on by our higher-level claim.

This is a key part of what it means for the lower-level model to justify the higher-level model. A model makes claims about what is (conditionally) independent of what. The claims about conditional independence implied by the higher-level model must therefore be warranted by (conditional) independencies operating in the lower-level model. If we introduce new dependencies via theorization, then our higher-level model (which excludes these dependencies) would no longer be justified by the lower-level model.

Here we represent theorization via type-reduction. Though we show the removal of an arrow to help convey the idea, we would in fact reduce types by imposing restrictions on the nodal types at Y within the same DAG.

Figure 6.2: Here we represent theorization via type-reduction. Though we show the removal of an arrow to help convey the idea, we would in fact reduce types by imposing restrictions on the nodal types at Y within the same DAG.

6.1.2 Probabilistic causal models

At the structural level, then, there are two types of theory, or two types of relations between levels of model: those defined by type-splintering and those defined by type-reduction. In general, we will want to be working with probabilistic causal models—i.e., those that include distributions over nodal types. We can describe straightforwardly how distributions in a higher-level model relate to—and must change with—distributions at the lower level. Indeed, it is these relations that unlock the opportunity for reaping empirical gains from theory. We can think of Bayesian updating in a causal-model framework as using data to learn about the distribution of causal effects in a lower-level model, which, in turn, allows us to learn about causation in a higher-level model that captures the causal questions of interest.

6.1.2.1 Theoretical implications of probabilistic models

Suppose we start with the mediation model in panel (b) of Figure 6.1. We then add to it a distribution over \(\theta^K\) and \(\theta^Y_{lower}\), giving us a probabilistic causal model that we will denote \(\mathcal M^p_{lower}\). \(\mathcal M^p_{lower}\), in turn, implies a higher-level probabilistic model, \(\mathcal M^p_{higher}\), formed from the structure of Model (a) in Figure 6.1, and a particular distribution over \(\theta^Y\): specifically, \(\theta^Y\) will have the distribution that preserves the causal relations implied by the beliefs in \(\mathcal M^p_{lower}\). Thus, for instance, the probability that \(X\) has a positive effect on \(Y\) in \(\mathcal M^p_{higher}\) is \(\Pr(\theta^{Y} = \theta^{Y}_{01}) = \lambda^{Y}_{01}\); the probability that \(X\) has a positive effect on \(Y\) in \(\mathcal M^p_{lower}\) is \(\lambda^{K_{lower}}_{01}\lambda^{Y_{lower}}_{01} + \lambda^{K_{lower}}_{10}\lambda^{Y_{lower}}_{10}\). Consistency then requires that \(\lambda^{M_{lower}}_{01}\lambda^{Y_{lower}}_{01} + \lambda^{K_{lower}}_{10}\lambda^{Y_{lower}}_{10} = \lambda^{Y_{higher}}_{01}\). So the value of \(\lambda^{Y_{higher}}_{01}\) is implied by \(\lambda^{K_{lower}}_{01},\lambda^{Y_{lower}}_{01}, \lambda^{K_{lower}}_{10},\lambda^{Y_{lower}}_{10}\), but not vice-versa.

6.1.2.2 Deducing models from theory and data.

Now we can see what happens if we bring data to the lower-level model. A probabilistic causal model coupled with data implies another probabilistic causal model via Bayes rule. For this reason, we can fruitfully think of an initial model plus data as constituting a theory of an updated model. Thought of in this way we have clarity over what is meant when we turn to theory to support a claim, but also what is meant when we seek to justify a theory. We might imagine a scholar arguing: “\(\mathcal M_1\): \(X\) caused \(Y\) in country \(j\).” When pushed for a justification for the claim, they provide the lower-level model: “\(\mathcal M_0:\) the average effect of \(X\) on \(Y\) in countries with feature \(C=1\) is 0.95, making it likely that \(X\) caused \(Y\) in this case.” Here \(\mathcal M_1\) is implied by \(\mathcal M_0\) plus data \(C=1\). If pushed further as to why that theory is itself credible, the the scholar might point to a lower-level model consisting of structural model \(X\rightarrow Y \leftarrow C\) plus flat priors—coupled with data on \(X,Y\) and \(C\), where the data justify the belief about \(C\)’s relationship to \(X\)’s average effect on \(Y\). At each stage, as more justification is provided, the researcher formally provides lower-level models.

Moving up, as more data are provided, more “specific” higher-level models emerge, justified by lower-level models plus data. These models are more specific in the sense that they are implied by the higher level models, plus data, but not vice versa. But they are also (generally) more specific in a second sense: that they make stronger statements about how causal processes operate.52 They place greater weight on a smaller set of causal claims.53

As the simplest illustration, we might imagine beginning with an \(X\rightarrow Y\) model, \(\mathcal M_1\), in which, \(X\) and \(Y\) are binary, and we we believe that \(Y\) possibly responds to \(X\). If we have “flat” priors over causal types, in the sense described in Chapter 5, then our prior uncertainty over the proposition that \(X\) causes \(Y\), under this model, is large; as is our uncertainty that \(Y=1\) is due to \(X=1\) in a given case. In other words, given our theory, we are uncertain about the proposition \(\mathcal M_3\): \(X\) caused \(Y\). However, if we then receive a lot of data, \(\mathcal D\), showing strong relations between \(X\) and \(Y\), then our updated model \(\mathcal M_2\), formed from combining \(\mathcal D\) and \(\mathcal M_1\) allows us to infer that \(X\) caused \(Y\) in this case with greater certainty.

Thus our new theory \(\mathcal M_2\) is (a) formally similar to \(\mathcal M_1\), (b) formed as a product of past theory plus evidence, here justified by \(\mathcal M_1\) given data \(\mathcal D\), and (c) capable of providing sharper implications than past theory.54

In this way, Bayesian updating provides a simple and coherent way of thinking about the integration of theory and data.

6.2 Gains from theory

We now turn to consider how to think about whether a theory is useful. We are comfortable with the idea that theories, or models more generally, are wrong. Models are not full and faithful reflections of reality; they are maps designed for a particular purpose. We make use of them because we think that they help in some way.

But how do they actually help, and can we quantify the gains we get from using them?

We think we can. Using the notion of hierarchies of models, imagine we begin with model \(\mathcal M_1\), which together with data \(\mathcal D\), implies claim \(\mathcal M_2\). We then posit theory \(\mathcal M_0\) of \(\mathcal M_1\), so \(\mathcal M_0\) implies \(\mathcal M_1\). But when we bring \(\mathcal D\) to \(\mathcal M_0\) we get a new model, \(\mathcal M_2'\), that is different—and, hopefully, better—than \(\mathcal M_2\).

Our gain from theory \(\mathcal M_0\) should be some summary of how much better \(\mathcal M_2'\) is than \(\mathcal M_2\).

6.2.1 Illustration: A front door theory

Here is an illustration using a theory that allows use of the “front-door criterion.”

Imagine that we have data on three variables, \(X\), \(Y\), and \(K\). We begin, however, with a model \(\mathcal M_1\) with confounding: \(C \rightarrow X \rightarrow Y \leftarrow C\). \(\mathcal M_1\) includes nodes for two of the three variables we have data on, \(X\) and \(Y\), but not \(K\). Assume, further, that we do not have data on node \(C\), the confound.

Suppose that we observe a strong correlation between \(X\) and \(Y\) and infer \(\mathcal M_2\): that \(X\) is a likely cause \(Y\). Our inference under \(\mathcal M_2\) is, however, quite uncertain because we are aware that the correlation may be due to the confound \(C\).

Suppose now that we posit the lower-level model \(\mathcal M_0\): \(C \rightarrow X \rightarrow K \rightarrow Y \leftarrow C\). \(\mathcal M_0\) now lets us make better use of data \(K\). If we observe, for instance, that \(X\) and \(K\) are uncorrelated, then we infer with confidence that in fact \(X\) did not cause \(Y\), despite the correlation.

Thus, in return for specifying a theory of \(\mathcal M_1\), we have been able to make better use of data and form a more confident conclusion. In this case, stating the theory, \(\mathcal M_0\), does not alter our priors over our query. Our prior over the effect of \(X\) on \(Y\) may be identical under \(\mathcal M_1\) and \(\mathcal M_0\)—but our conclusions, given data, differ because the theory lets us make use of the data on \(K\), which we could not do under \(\mathcal M_1\) (which did not include \(K\)).

In other situations, we might imagine invoking a theory that does not necessarily involve new data but that allows us to make different, perhaps tighter inferences using the same data. An example might be the invocation of a type-reducing theory that involves a monotonicity restriction or exclusion restriction that allows for identification of a quantity that would not be identifiable without the theory.

Thus, one reason to theorize our models—develop lower-level models that make stronger claims—is to be able to reap greater inferential leverage from the more elaborate theory.

6.2.2 Quantifying gains

Can we quantify how much better off we are?

In all cases we ask how much better do we do as a result of making use of a theory using some evaluation criterion. Evaluation criteria might be based on:

  • error: an error based evaluation asks whether the theory helped reduce the (absolute) difference between an estimate and a target; similarly you might be focus on squared error—which essentially places more weight on bigger errors
  • uncertainty: you might instead assess gains in terms of reduced uncertainty. You might measure uncertainty using the variance of your beliefs, or you might use relative entropy–to assess reductions in uncertainty
  • explanation: you might ask whether the data we see are explained by the theory in the sense that they are more likely—less surprising—given the theory; one approach might to compare the likelihood of the data with and without the theory

Other loss functions might take account of the costs of collecting additional data or to the risks associated with false conclusions. For instance, in Heckerman, Horvitz, and Nathwani (1991), an objective function is generated using expected utility gains from diagnoses generated based on new information over diagnoses based on what is believed already.

These criteria are all closely connected with each other, as we will highlight below.

Whatever criteria is adopted might be applied from an “objective” or “subjective” position, and from an ex ante or ex post perspective.

We describe these possibilities next and then illustrate.

Objective, ex post: If we are willing to posit an external ground truth, then we can define “better” in objective terms. For instance we might calculate the size of the error we make in conclusions relative to the ground truth, from an inference that uses a theory, compared to an inference that does not make use of the theory.

Objective, ex ante: An objective ex ante approach might be to ask what the expected error is from conclusions one draws given a theory. For instance: how wrong are we likely to be if we base our best guess on our posterior mean? “How wrong” might be operationalized in terms of expected squared error—the square of the distance between the truth and the posterior mean.

Subjective, ex post. A more subjective approach would be to ask about the reductions in posterior variance. Ex post you can define “better” as the reduction in posterior variance from conclusions that make use of a theory compared to conclusions that do not.

Subjective, ex ante. You might also think about the expected posterior variance: how certain do you expect you will be after you make use of this new information?

6.2.3 Connections

We illustrate connections between criteria in the context of a theory that lets us make use of additional data \(K\) to make inferences.

There is an unknown parameter \(q\). We have beliefs about the distribution of \(K\), given \(q\). Let \(p(q,k)\) denote the joint prior distribution over \(q\) and \(k\) with marginal distributions \(p(k)\) and \(p(q)\). For any \(k\) there is posterior estimate \(q_k\).

The squared error, given \(k\) is just \((q - q_k)^2\).

The expected squared error is:

\[ESE := \int_q\int_k \left({q}_k-q\right)^2p(k, q)dkdq \]

This takes the error one might get with respect to any true value of the parameter (\(q\)), given the data one might see given \(q\) and the inferences one might draw.

For any \(k\) we might write the posterior variance as \(v_k\).

The expected posterior variance can be written:

\[EV := \int_k v_k p(k)dk\]

This takes the posterior variance, given some data, over all the possible data one might see given marginal distribution \(p(k)\).

Interestingly, if we assess expectations using the same priors as you use for for forming posteriors the expected posterior variance and expected squared error are equivalent (Scharf 1991). To see this, we take advantage of the fact that \(p(q,k) = p(k)p(q|k) = p(q)p(k|q)\) and that \(p(q|k)\) gives the posterior distribution of \(q\) given \(k\). We then have:

\[\begin{eqnarray} ESE &=& \int_q\int_k \left({q}_k-q\right)^2p(q,k)dkdq \\ &=& \int_k\int_q \left({q}_k-q\right)^2p(k)p(q|k)dq dk \\ &=& \int_k\left[\int_q \left({q}_k-q\right)^2p(q|k)dq\right]p(k)dk \\ &=& \int_k v_k p(k)dk = EV \end{eqnarray}\]

Note that the key move is in recognizing that \(p(q |k)\) is in fact the posterior distribution on \(q\) given \(k\). In using this we assume that the same distribution is used for assessing error and for conducting analysis—that is we take the researcher’s prior to be the relevant one for assessing error.

Moreover, it is easy to see that whenever inferences are sensitive to \(K\), the expected variance of the posterior will be lower than the variance of the prior. This can be seen from the law of total variance, written here to highlight the gains from observation of \(K\), given what is already known from observation of \(W\).55
\[Var(Q|W) = E_{K|W}(Var(Q|K,W)) +Var_{K|W}(E(Q|K,W))\]

However, although expected posterior variance goes down, it is still always possible that posterior variance rises.

The increase in uncertainty does not, however, mean you haven’t been learning. Rather, you have learned that things aren’t as simple as you thought.

One approach that addresses this issue asks: how much better are our guesses having observed \(K\) compared to what we would have guessed before, given what we know having observed \(K\)? This captures the idea that, although we might be more uncertain, we think we are better off now than we were because we are less naive. We will call this “Wisdom” to reflect the idea that it values appreciation of justifiable uncertainty:

\[Wisdom = \frac{\int\left((q_0 - q)^2 - (q_k - q)^2 \right)p(q | k)dq}{\int(q_0 - q)^2 p(q)dq}\]

The numerator in this expression captures how much better off we are with the guess we have made given current data (\(q_k\)) compared to the guess we would have made if we had a theory that did not let us make use of it (\(q_0\)), all assessed knowing what we now know having observe \(K\) (\(p(q|k)\) is our posterior given \(k\)). The denominator is simply the prior variance and is included here for scaling.

Note that with a little manipulation this can also be written as:

\[Wisdom = \frac{(q_0 - q_k)^2}{\int(q_0 - q)^2 p(q)dq}\]

From this we see that the measure does not depend on either prior or posterior variance (except through the denominator).

An advantage of this conceptualization is that we can still record gains in learning even if the learning operates such that the posterior variance is larger than the prior variance. Even so, the implications for strategy are the same since wisdom is maximized by a strategy that reduces expected squared error.

Thus expected wisdom, is:

\[\begin{eqnarray} \text{Expected Wisdom} &=& \frac{\int_q(q_0 - q)^2dq - \int_k\int_q(q_k - q)^2 p(q, k)dqdk}{\int(q_0 - q)^2 p(q)dq}\\ &=& 1 - \frac{\text{Expected Posterior Variance}}{\text{Prior variance}} \end{eqnarray}\]

Expected Wisdom lies between 0 and 1; wisdom itself however, though non negative, can exceed 1 in situations in which there is a radical re-evaluation of a prior theory, even if uncertainty rises. As an illustration, if our prior on some share is given by a \(\text{Beta}(2, 18)\) distribution, then our prior mean is .1 and our prior variance is very small, at 0.0043. If we observe another four positive cases then our posterior mean becomes 1/4 and our posterior variance increases to 0.0075. We have shifted our beliefs upwards and at the same time become more uncertain. But we are also wiser since we are confident that our prior best guess of .1 is surely an underestimate. Our wisdom is 5.25—a dramatic gain.

A second approach is to measure the information gain by assessing relative entropy between probability distributions. Relative entropy is measured using Kullback–Leibler divergence or here:

\[D_{KL} = \int \log\left(\frac{p(q)}{p(q|K)}\right)p(q)dq\]

For instance if we started with a Beta(1,2) prior—we would have a 2/3 expectation of observing a 0. If we observed a 1 we would learn a lot and have relative entropy of 0.17 if instead we observed a 0 we would be less surprised and have relative entropy 0.07. Our expected relative entropy would be: 0.1.

We close with a reminder. Although expected reduction in variance and expected wisdom are both positive, both are are fundamentally subjective ideas, that presuppose the theory is correct. In contrast the expected error measure can be assessed under rival theoretical propositions and so allow for the real possibility that the gains of invoking a theory are negative.

Illustration of Wisdom for single case inference

Let \(p\) denote the prior that \(Q=1\), let \(q_q\) denote the probability of observing \(K=1\) given \(Q=q\). Let \(p'_k\) denote the posterior given \(K=k\). Let \(v := p(1-p)\) denote prior variance and \(v'\) posterior variance.

Wisdom, given evidence \(K=k\) is:

\[W_k = (p'_k - p)^2/v > 0\]

Letting \(p_k\) denote the probability that \(K=1\), expected wisdom is:

\[(p_k(p'_1 - p)^2 + (1-p_k) (p'_0 - p)^2)/v\]

Substituting and rearranging, this can be written as:

\[W^e = 1 - E(v')/v \in [0,1]\] Written in terms of primitives we have:

\[{W}^e=1 - \frac{q_0q_1}{pq_1 + (1-p)q_0} - \frac{(1-q_1)(1-q_0)}{p(1-q_1) + (1-p)(1-q_0)}\] It’s easy to check that \({W}^e=0\) if \(q_0=q_1\) (no probative value) and \({W}^e=1\) if \(|q_1-q_0|=1\) (strong probative value).

6.3 Formal theories and causal models

It is relatively easy to see how the ideas above play out for what might be called empirical models. But in social sciences “theory” is a term sometimes reserved for what might be called analytic models. In this last section we work through how to use this framework when seeking to bring analytic models to data.

Let’s start with analytic models. As an example we might consider the existence of “Nash equilibria.” Nash considered a class of settings (“normal form games”) in which each player \(i\) can choose an action \(\sigma_i\) from set \(\Sigma_i\) and receives a payoff \(u_i\) that depends on the actions of all players. A particular game, \(\Gamma\) is the collection of players, action sets, and payoffs.

Nash’s theorem relates to the existence of a collection of strategies with the property that each strategy would produce the greatest utility for each player given the strategies of the other players. Such a collection of strategies is called a Nash equilibrium.

The claim that such a collection of strategies exists in these settings is an analytic claim. Unless there are errors in the derivation of the result, the claim is true in the sense that the conclusions follow from the assumptions. There is no evidence that we could go looking for in the world to assess the claim. The same can be said of the theoretical claims of many formal models in social sciences; they are theoretical conclusions of the if-then variety (Clarke and Primo 2012).

We will refer to theories of this form as “analytic theories.”

When researchers refer to a theory of populism or a theory of democratization however they generally do not have such pure theories in mind. Rather they have in mind what might be called “applied theories” (or perhaps more simply “scientific theories” or “empirical theories”): general claims about the relations between objects in the world. The distinction here corresponds to the distinction in Peressini (1999) between “pure mathematical theories” and “mathematized scientific theories.”56

Applied theory, in this sense, is a collection of claims with empirical content: an applied theory refers to a set of propositions of causal relations in the world that might or might not hold, and is susceptible to assessment using data. These theories might look formally a lot like analytic theories but it is better to think of them as translations at most. The relations between nodes of an applied theory are a matter of conjecture not a matter of necessity.57

Though it is not standard practice, formal models produced by game theorists can often be translated and then represented using the notation of structural causal models in this way. Moreover, doing so may be fruitful. Using the approach described above we can then assess the utility of the applied theory, if not the pure theory itself.

For two players, for instance, we might imagine a representation of a game as shown in Figure 6.3.

Formal structure of a normal form game.

Figure 6.3: Formal structure of a normal form game.

Here the only functional equations are the utility functions. The utilities, given actions, are the implications of the theory, and so this is just a theory of how outcomes depend on social actions. It is not—yet—a behavioral theory.

In contrast to Nash’s theorem regarding the existence of equilibria, a behavioral theory might claim that in problems that can be represented as normal form games, players indeed play Nash equilibrium. This is a theory about how people act in the world. We might call it Nash’s theory.

How might this theory be represented as a causal model? Figure 6.4 provides one representation.

Formal structure of a normal form game.

Figure 6.4: Formal structure of a normal form game.

Here beliefs about the game form (\(\Gamma\)) results in strategy choices by actors. If players play according to Nash’s theory, the functional equations for the strategy choices are given by the Nash equilibrium solution itself, with a refinement in case of multiplicity.

This model represents what we expect to happen in a game under Nash’s theory and we can indeed see if the relations between nodes in the world look like what we expect under the theory. But it does not provide much of an explanation for behavior.

A lower level causal model might help. In Figure 6.5 the game form \(\Gamma\) determines the beliefs about what actions the other player would make (thus \(\sigma_2^e\) is 1’s belief about 2’s actions). The functional equations for \(\sigma_2^e\) and \(\sigma_1^e\) might, for instance, be the Nash equilibrium solution itself: that is, players expect other players to play according to the Nash equilibrium (or in the case of multiple, a particular equilibrium selected using some refinement). The beliefs in turn, together with the game form (which contains \(u1, u_2\)), are what cause the players to select a particular action. The functional equation for \(\sigma_1\) might thus be \(\sigma_1 = \arg \max_\sigma u_1(\sigma, \sigma_2^e)\).

Formal structure of a normal form game.

Figure 6.5: Formal structure of a normal form game.

This representation implies a set of relations that can be compared against empirical patterns. Do players indeed hold these beliefs when playing a given game? are actions indeed consistent with beliefs in ways specified by the theory. It provides a theory of beliefs and a theory of individual behavior as well as an explanation for social outcomes.

The model in Figure 6.5 provides a foundation of sorts for Nash’s theory. It suggests that players play Nash equilibria because they expect others to and they are utility maximizers. But this is not the only explanation that can be provided; alternatively behavior might line up with the theory without passing through beliefs at all as suggested in some accounts from evolutionary game theory that show how processes might select for behavior that corresponds to Nash even if agents are unaware of the game they are playing.

One might step still further back and ask why would actors form these beliefs, or take these actions, and answer in terms of assumptions about actor rationality. Figure 6.6 for instance is a model in which actor rationality might vary and might influence beliefs about the actions of others as well as reactions to those beliefs. Fully specified functional equations might specify not only how actors act when rational but also how they react when they are not. In this sense the model in Figure 6.6 both nests Nash’s theory and provides an explanation for why actors conform to the predictions of the theory.

Formal structure of a normal form game.

Figure 6.6: Formal structure of a normal form game.

In a final elaboration we can represent a kind of underspecification of Nash’s theory that make it difficult to take the theory to data. In the above we assumption that players chose actions based on expectations that the other player would play the Nash equilibrium—or that the theory would specify which equilibrium in the case of multiplicity. But it is well known that Nash’s theory often does not provide a unique solution. This indeterminacy can be captured in the Causal model as shown in Figure 6.7 where a common shock—labelled \(\nu\), and interpreted as norms—interacts with the game form to determine the expectations of other players.

The functional equation for expectations can then allow for the possibility that (i) there is a unique equilibrium invariably chosen and played by both (ii) or a guarantee that players are playing one or other equilibrium together but uncertainty over which one is played, or (iii) the possibility that players are in fact out of sync, with each playing optimal strategies given beliefs but nevertheless not playing the same equilbria.

Nash’s theory likely corresponds to position (ii). It can be captured by functional equations on beliefs given \(\nu\) but the theory does not specify \(\nu\), in the same way that it does not specify \(\Gamma\).

A normal form game with a representation of equilibrium selection norms.

Figure 6.7: A normal form game with a representation of equilibrium selection norms.

We highlight three points from this discussion.

First the discussion highlights that thinking of theory as causal models does not force a sharp move away from abstract analytic theories; close analogues of these can often be incorporated in the same framework. This is true even for equilibrium analysis that seems to involve a kind of simultaneity on first blush.

Second, the discussion highlights how the causal modelling framework can make demands for specificity from formal theories. For instance specifying a functional relations from game form to actions requires a specification of a selection criterion in the event of multiple equilibria. Including agent rationality as a justification for the theory invites a specification for what would happen absent rationality.

Third the example shows a way of building a bridge from pure theory to empirical claims. One can think of Nash’s theory as an entirely data free set of claims. When translate into an applied theory—a set of proposition about the ways actual players might behave—and represented as a causal model, we are on a path to being able to use data to refine the theory. Thus we might begin with a formal specification like that in Figure 6.7 but with initial uncertainty about player rationality, optimizing behavior, and equilibrium selection. This theory nests Nash but does not presume the theory to be a valid description of processes in the world. Combined with data, however, we shift to a more refined theory that selects Nash from the lower level model.

Finally, we can then apply the ideas of section 6.2 to applied formal theories and ask: is the theory useful? For instance, does data on player rationality help us better understand the relationship between game structure and welfare?

References

Clarke, Kevin A, and David M Primo. 2012. A Model Discipline: Political Science and the Logic of Representations. New York: Oxford University Press.
Geweke, John, and Gianni Amisano. 2014. “Analysis of Variance for Bayesian Inference.” Econometric Reviews 33 (1-4): 270–88.
Heckerman, David E, Eric J Horvitz, and Bharat N Nathwani. 1991. “Toward Normative Expert Systems: The Pathfinder Project.” Methods of Information in Medicine 31: 90I105.
———. 2009. Causality. Cambridge university press.
Peressini, Anthony. 1999. “Applying Pure Mathematics.” Philosophy of Science 66: S1–13.
Raiffa, Howard, and Robert Schlaifer. 1961. “Applied Statistical Decision Theory.”
Scharf, Louis L. 1991. Statistical Signal Processing. Vol. 98. Addison-Wesley Reading, MA.

  1. We note that our definition of theory differs somewhat from that given in Pearl (2009) (p207): there a theory is a (functional) causal model and a restriction over \(\times_j \mathcal{R}(U_j)\), that is, over the collection of contexts envisionable. Our definition also considers probabilistic models as theories, allowing statements such as “the average effect of \(X\) on \(Y\) is 0.5.”↩︎

  2. As we emphasize further below, it is in fact only the random, unknown component of the \(X\rightarrow K\) link that makes the addition of \(K\) potentially informative as a matter of research design: if \(K\) were a deterministic function of \(X\) only, then knowledge of \(X\) would provide full knowledge of \(K\), and nothing could be learned from observing \(K\).↩︎

  3. We drop the arrow in Figure 6.2, however, in order to help visually convey the difference between the two models. In fact, we would construct \(\mathcal M_0\) by placing restrictions at nodes in \(\mathcal M_1\), rather than by changing the model’s structure, so that the allowed types in \(\mathcal M_0\) form a subset of those in \(\mathcal M_1\).↩︎

  4. This is not universally true, a point we return to below.↩︎

  5. In frequentist frameworks we often think of analysis as implementing up-or-down empirical tests against data to parse between theories that should be maintained and theories that should be rejected. In a Bayesian framework we think more continuously of shifting our beliefs across causal possibilities within a multi-dimensional theoretical space.↩︎

  6. As a general matter an updated theory may not provider sharper claims for all queries. That is, in practice, posterior variance over queries can increase with more data. As a simple illustration: say, we start out thinking that the probability that an outcome is due to conditions A, B, or C is .9, .05, and .05, respectively. If I find evidence that convinces me that A is not the cause, then I shift (a) to greater certainty about whether A was the cause but (b) greater uncertainty about whether B was the cause.↩︎

  7. See Raiffa and Schlaifer (1961). A similar expression can be given for the expected posterior variance from learning \(K\) in addition to \(W\) when \(W\) is not yet known. See, for example, Proposition 3 in Geweke and Amisano (2014).↩︎

  8. Or see the distinction, for instance in in Keynes, between pure theory and applied theory.↩︎

  9. Peressini (1999) distinguishes between “applied mathematical theories” and “mathematized scientific theories” on the grounds that not all mathematized theories are an application of a pure theory.↩︎