Chapter 13 Case selection

With a causal model in hand we can assess in advance what conclusions we will draw from different observations. We apply the logic to the problem of case-selection: given a set of cases on which we already have \(X,Y\) data, which cases will it be most advantageous to choose for more in-depth investigation? We show the optimal case-selection strategy depends jointly on the model we start with and the causal question we seek to answer.

Very often researchers have access to \(X\), \(Y\) data on many cases and want to select a subset of cases—case studies—to examine more carefully in order to draw stronger conclusions either about general processes or about likely effects in given cases. A key design decision is to determine which cases are mostly likely to be informative. This is the question we address in this chapter.

13.1 Common case selection strategies

A host of different strategies have been proposed for selecting cases for in-depth study based on the observed values of \(X\), \(Y\) data. Perhaps the most common strategy is to select cases in which \(X=1\) and \(Y=1\) and look to see whether in fact \(X\) caused \(Y\) in the case in question (using some more or less formal strategy for inferring causality from within-case evidence). But many other strategies have been proposed, including strategies to select cases “on the regression line” or, for some purposes, cases “off the regression line” (e.g., Evan S. Lieberman (2005)). Some scholars suggest ensuring variation in \(X\) (most prominently, King, Keohane, and Verba (1994)), while others have proposed various kinds of matching strategies. Some have pointed to the advantages of random sampling of cases, either stratified or unstratified by values on \(X\) or \(Y\) (Fearon and Laitin (2008), Herron and Quinn (2016)).

Which cases we should choose will likely depend on the purposes to which we want to put them.

A matching strategy for instance—selecting cases that are comparable on many features but that differ on \(X\)—can replicate at a small scale the kind of inference done by matching estimators with large-\(n\) data. The strategy can draw leverage from \(X,Y\) variation even if researchers have matched on other within-case characteristics.

Other treatments seek to use qualitative information to check assumptions made in \(X, Y\) analysis: for example, is the measurement of \(X\) and \(Y\) reliable in critical cases? For such questions with limited resources, it might make sense to focus on cases for which validation plausibly makes a difference to the \(X,Y\) inferences: for example influential cases that have unusually extreme values on \(X\) and \(Y\). Similar arguments are made for checking assumptions on selection processes, though we consider this a more complex desideratum since this requires making case level causal inferences and not simply measurement claims (Dunning 2012).

A third purpose is to use a case to generate alternative or richer theories of causal processes, as in Lieberman’s “model-building” mode of “nested analysis” (Evan S. Lieberman (2005)). Here it may be cases off the regression line that are of interest.

Weller and Barnes (2014) focus on (a) X/Y relations and (b) whether the cases are useful for hypothesis generation.

In what follows, we focus on a simpler goal: given existing \(X, Y\) data for a set of cases and a given clue (or set of clues) that we can go looking for in the intensive analysis of some subset of these cases, for which cases would process tracing yield the greatest learning about different causal queries.

The basic insight of this chapter is simple enough: the optimal strategy for case selection for a model-based analysis is a function of the model we start with and the query we seek to address, just as we saw for the optimal clue-selection strategy in Chapter 12. Using this strategy yields guidance that is consistent with some common advice but at odds with other advice. The most general message of this chapter is about the general approach: that is, have clear goals—whether you want to learn about cases level queries, population level queries, or both, think through in advance what you might find in cases and how what you might find in cases addresses those goals, and choose accordingly. More specifically, we can use a causal model to tell us what kinds of cases are likely to yield the greatest learning, given the model and a strategy of inference. We provide a tool for researchers to undertake this analysis.

The broad injunction to select cases to maximize learning is in line with the recommendations of Fairfield and Charman (forthcoming), though the strategy to do so differs. Most closely related to our analysis in this chapter is the contribution of Herron and Quinn (2016), who build on Seawright and Gerring (2008). While Seawright and Gerring provide a taxonomy of approaches to case selection, they do not provide a strategy for assessing the relative merits of these different approaches. As we do, Herron and Quinn (2016) focus on a situation with binary \(X,Y\) data and assess the gains from learning about causal type in a set of cases (interestingly in their treatment causal type, \(Z_i\) is called a confounder rather than being an estimand of direct interest; in our setup, confounding as normally understood arises because of different probabilities of different causal types of being assigned to “treatment”, or an \(X=1\) value). Herron and Quinn (2016) assume that in any given case selected for analysis a qualitative researcher is able to infer the causal type perfectly.

Our setup differs from that in Herron and Quinn (2016) in a few ways. Herron and Quinn (2016) paramaterize differently, though this difference is not important.78 Perhaps the most important difference between our analysis and that in Herron and Quinn (2016) is that we connect the inference strategy to process-tracing approaches. Whereas Herron and Quinn (2016) assume that causal types can be read directly, we assume that these are inferred imperfectly from evidence and we endogenize the informativeness of the evidence to features of the inquiries. Moreover, not only can we have uncertainty about the probative value of clues, but researchers can learn about the probative value of clues by examining cases.

Here we assume that the case selection decision is made after observing the \(XY\) distribution and we explore a range of different possible contingency tables. In Herron and Quinn (2016) the distribution from which the contingency tables are drawn is fixed, though set to exhibit an expected observed difference in means (though not necessarily a true treatment effect) of 0.2. They assume large \(XY\) data sets (with 10,000) units and case selection strategies ranging from 1 to 20 cases.

Another important difference, is that in many of their analyses, Herron and Quinn (2016) take the perspective of an outside analyst who knows the true treatment effect; they then assess the expected bias generated by a research strategy over the possible data realizations. We, instead, take the perspective of a researcher who has beliefs about the true treatment effect that correspond to their priors, and for whom there is therefore no expected bias. This has consequences also for the assessment of expected posterior variance, as in our analyses the expectation of the variance is taken with respect to the researcher’s beliefs about the world, rather than being made conditional on some specific world (ATE). We think that this setup is addressed to the question that a researcher must answer when deciding on a strategy: given what they know now, what will produce the greatest reduction in uncertainty (the lowest expected posterior variance)?

Finally, we proceed somewhat differently in our identification of strategies from Herron and Quinn: rather than pre-specifying particular sets of strategies (operationalizations of those identified by Seawright and Gerring (2008)) and evaluating them, we define a strategy as the particular distribution over \(XY\) cells to be examined and proceed to examine every possible strategy given a choice of a certain number of cases in which to conduct process tracing. We thus let the clusters of strategies—those strategies that perform similarly—emerge from the analysis rather than being privileged by past conceptualizations of case-selection strategies.

Despite these various differences, our results will agree in key ways with those in Herron and Quinn (2016).

13.2 No general rules

Case selection is about choosing in which cases you will seek further information. We want to look for evidence in cases where that evidence is likely to be most informative. And the informativeness of a case depends, in turn, on our model and our query.

In this section we illustrate how simple rules, like choosing cases where \(X=1\) and \(Y=1\) or choosing the cases you most care about, may sometimes lead you astray. Rather, we will argue, there is a general procedure for determining how to select cases, and this procedure requires a specification of the learning you expect to achieve given different data patterns you might find.

13.2.1 Any cell might do

Although it might be tempting to seek general case-selection rules of the form “examine cases in which \(X=1\) and \(Y=1\)” or “ignore cases in which \(X=0\) and \(Y=1\),” it is easily demonstrated that which cases will be (in expectation) more informative depends on models and queries.

Suppose we have observed \(X\) and \(Y\) values for a random draw of cases from a population. Suppose further that we know, for this population, that the processes can be represented by the model \(X \rightarrow Y \leftarrow K\), and, moreover:

  • \(\Pr(Y=1|X=0, K = 0) = 1\)
  • \(\Pr(Y=1|X=1, K = 0) = .5\)
  • \(\Pr(Y=1|X=0, K = 1) = 0\)
  • \(\Pr(Y=1|X=1, K = 1) = .9\)

One way to read this set of statements is that \(X\)’s causal effect on \(Y\) varies with \(K\). We do not know, however how common \(K\) is. Thus we do not know either the average effect of \(X\) or the probability that \(X\) caused \(Y\) in a case with particular \(X, Y\) values.

What do the above statements tell us about \(K\)’s informativeness? The beliefs above imply that if, \(X=Y=1\), then \(K\) is a “doubly decisive” clue for assessing whether, in a given case, \(X\) causes \(Y\). In particular, we see that for an \(X=Y=1\) case, observing \(K=1\) implies that \(X\) caused \(Y\): this is because, if \(X\) were 0 \(Y\) would have been 0. We also see that \(K=0\), in an \(X=1, Y=1\) case implies that \(X\) did not cause \(Y\) since \(Y\) would have still been 1 even if \(X\) were 0. So an \(X=Y=1\) case would be a highly informative place to go looking for \(K\).

However, if we had a case in which \(X=Y=0\), then learning \(K\) would be entirely uninformative for the case. In particular, we already know that \(K=1\) in this case as the statements above exclude the possibility of a case in which \(X=Y=0\) and \(K=0\). So there would be nothing gained by “looking” to see what \(K\)’s value is in the case.

For the same reason, we can learn nothing from \(K\) in an \(X=0, Y=1\) case since we know that \(K=0\) in such a case. On the other hand, if we chose an \(X=1, Y=0\) case, then \(K\) would again be doubly decisive, with \(K=0\) implying that \(X=1\) caused \(Y=0\) (because the counterfactual of \(X = 0\) would have resulted in \(Y=1\) when \(K\) is 0), and \(K=1\) implying that \(X=1\) did not cause \(Y=0\) (because the counterfactual of \(X = 0\) would still result in \(Y=0\) since there is zero likelihood that \(Y = 1\) when \(X\) is 0 and \(K\) is 1).

We have chosen extreme values for this illustration—our beliefs could, of course, allow for gradations of informativeness, rather than all-or-nothing identification—but the larger point is that beliefs about the way the world works can have a powerful effect on the kind of case in which learning is possible. And note that in this example, there is nothing special about where a case lies relative to a (notional) regression line: informativeness in this setup happens to depend on \(X\)’s value entirely. Thoughgain, this is a particular feature of this particular set of beliefs about the world.

Suppose, now, that we were interested in a population query: the average effect of \(X\) on \(Y\). We can see that this is equal to \(\Pr(K=1)\times.9 + (1-\Pr(K=0))\times(-.5)) = 1.4\times \Pr(K=1)-.5\). For this query, we need only determine the prevalence of \(K=1\) in the population. It might seem that this means that it is irrelevant what type of case we choose: why not use pure random sampling to determine \(K\)’s prevalence. As noted above, however, we have more information about the likely value of \(K\) in some kinds of cases than in others. Thus, for this population-level estimand as well, selecting an \(X=1\) case will be informative while selecting an \(X=0\) case will not be informative.

At the same time, not all \(X=1\) cases are equally informative. Should we choose an \(X=1\) case in which \(Y=1\) or one in which \(Y=0\)? In both types of case, \(K\) is doubly decisive for the probability of causation in the case. However, the two kinds of cases are differentially informative about \(K\)’s prevalence in the population. For an \(X=1, Y=1\) case, we think it moderately likely that \(K=1\) (specifically, assuming a prior of \(\Pr(K=1)=.5\) we think \(\Pr(K=1 | X=1, Y=1) = \frac{0.9 }{0.9 + 0.5}=.64\)). For an \(X=1, Y=0\) case, we think \(K=1\) is quite unlikely (\(\Pr(K=1 | X=1, Y=0) = \frac{0.1}{0.5+0.1}=0.17\)). In other words, we are much more uncertain about the value of \(K\) in the \(X=Y=1\) case than in the \(X=1, Y=0\) case. In this setup, we would thus expect to learn more about the average treatment effect by choosing to observe \(K\) in an on-the-diagonal case than in an off-the-diagonal case.

Specifically, let \(\kappa\) denote \(\Pr(K=1)\). Suppose that we begin thinking it equally likely that \(\kappa=\kappa^H = .5\) and \(\kappa=\kappa^L=0\). Suppose, further, that the distribution of \(X\) is such that, for any randomly drawn case, \(\Pr(X=1) = .5\). We then observe one case with \(X=1, Y=1\) and another with \(X=1, Y=0\). From that information alone, we can update over \(\kappa\). Specifically, conditioning on \(X=1\), when we observe the data pattern, \(D\), in which \(Y=0\) in one case and \(Y=1\) in the other, the posterior on \(\kappa\) is:

\[\begin{eqnarray*} p(\kappa = \kappa^L|D) &=& \frac{p(D|\kappa^H)}{p(D|\kappa^H)+p(D|\kappa^L)}\\ &=&\frac{.5}{.5 + .25\times(2\times.5\times.5 + 2\times.9\times.1 + 2\times(.9\times.5 +.1\times.5))} \end{eqnarray*}\]

In summary, under the stipulated beliefs about the world, we can learn most about the population \(ATE\) by selecting an \(X=Y=1\) for study. We can also learn about the case-level effect in such a case as well as in an \(X=1, Y=0\) case. If we are interested in the case level estimand for any \(X=0\) case, then there are no gains from any case-selection strategy since we know \(K\)’s value based on \(X\) and \(Y\)’s value.

There is nothing preferable in general about an \(X=1\) case. Under a different set of beliefs about the world, we would expect to learn more in an \(X=Y=0\) than in an \(X=Y=1\) case. Suppose, for instance, that we have a model in which:

  • \(X \rightarrow Y \leftarrow K\)
  • \(\Pr(Y=1|X=0, K = 0) = .5\)
  • \(\Pr(Y=1|X=1, K = 0) = 0\)
  • \(\Pr(Y=1|X=0, K = 1) = .5\)
  • \(\Pr(Y=1|X=1, K = 1) = 1\)

In this world, we learn nothing from observing a case in which \(X=Y=1\)—since we already know that \(K=1\). In contrast, if \(X=Y=0\), then if we learn that \(K=1\), we know that, were \(X=1\), \(Y\) would have been 1; and if instead we observe \(K=0\), we know that \(Y\) would have (still) been 0 if \(X\) were 1. Now, \(K\) is doubly decisive for an \(X=Y=0\) case but unhelpful for an \(X=Y=1\) case.

The two off-diagonal cases may also be different in the opportunities for learning that they present. Suppose that you knew that:

  • \(X \rightarrow Y \leftarrow K\)
  • \(\Pr(Y=1|X=0, K = 0) = 0\)
  • \(\Pr(Y=1|X=1, K = 0) = .5\)
  • \(\Pr(Y=1|X=0, K = 1) = 1\)
  • \(\Pr(Y=1|X=1, K = 1) = .5\)

For an \(X=1, Y=0\) case, if you observe \(K=1\) you know that if \(X\) were 0, \(Y\) would have been 1; but if \(K=0\), you know that if \(X\) were 0, \(Y\) would still have been 0. In that case \(K\), would be doubly decisive for \(X=1\) causing \(Y=0\). But in an \(X=0, Y=1\), we already know \(K=0\) before we go looking.

In summary: beware simple rules for case selection. Depending on the model, the query, and priors, any type of case may be optimal.

13.2.2 Interest in a case might not justify selecting that case

It seems obvious that if your query of interest is defined at the case level—not at the population level—then the choice of cases is determined trivially by the query.

This is not correct however.

Sometimes you might be interested in effects in case A but still be better off gathering more information about case B instead of digging deeper into case A. We illustrate this phenomenon for a case where the learning from cases operates via updating on a general model (and subsequent application of that model to the case of interest) rather than via direct application of an informative general model to the case of interest.

We imagine a world in which we have causal model \(X \rightarrow M \rightarrow Y \leftarrow K\), flat priors on all nodal types, and we have access to data as in Table 13.1:

Table 13.1: Available data. If we are interested in whether X caused Y in case A are we better gathering data on M for case A or on K in case B?
Case X Y K M
A 1 1 1
B 0 0 1

We are interested specifically in whether \(X\) mattered in case A. Are we better off gathering data on \(M\) for case \(A\) or on \(K\) in case \(B\)?

Given this model we can imagine what we might find under each strategy and what we might then infer. These are shown Details are Table 13.2.

Table 13.2: Expected data and projected inferences on effects for case A given data
Quantity Best guess Uncertainty
Current beliefs 0.25 0.19
Probability K = 1 for B 0.66
If you find K = 0 for B: 0.25 0.19
If you find K = 1 for B: 0.26 0.19
Probability M = 1 for A 0.48
If you find M = 0 for A: 0.32 0.22
If you find M = 1 for A: 0.19 0.16

This gives us enough to calculate the expected uncertainty under each strategy.

  • The baseline uncertainty is 0.1897.
  • Under a strategy in which you gather data on \(K\) for case B the expected uncertainty is 0.1885.
  • The expected uncertainty from gathering data on \(M\) for case A is 0.1878.

These numbers are all very very similar—highlighting the difficulty of drawing inferences without a strong prior model based on just two cases. This is probably the most important lesson of this exercise.

Nevertheless they do diverge. Intuitively, when we investigate case \(B\) we benefit from a Millian logic: finding that the cases are similar on \(K\) makes us think it marginally more likely that variation in \(X\) is explaining outcomes. When we investigate case \(A\) however we are more likely to be convinced that \(X\) mattered in case \(A\) when we find that differences in \(M\) are in line with differences in \(X\) and \(Y\).

Say though we were interested in inferences on Case B. Which strategy would be better?

Details are given in Table 13.3.

Table 13.3: Expected data and projected inferences on effects for case B given data
Quantity Best guess Uncertainty
Current beliefs 0.26 0.19
Probability K = 1 for B 0.66
If you find K = 0 for B: 0.25 0.19
If you find K = 1 for B: 0.26 0.19
Probability M = 1 for A 0.48
If you find M = 0 for A: 0.32 0.22
If you find M = 1 for A: 0.19 0.16

We see here again that updating on case \(B\) is also best achieved by observation of \(M\) (in case \(A\)) rather than \(K\) (in case \(B\)). In other words tightening inferences on \(B\) is best done by investigating \(A\) further. In particular:

  • The baseline uncertainty for case B is 0.1926.
  • Under a strategy in which you gather data on \(K\) for case B the expected uncertainty is 0.1925.
  • The expected uncertainty from gathering data on \(M\) for case A is 0.1875.

Note that these Tables 13.2 and 13.3 are in fact identical, even though the query differs. The reason is that in both cases the updating operates with respect to clues (\(K\) or \(M\)) for which there is data for both units; the fact that there may be additional data on a clue for only one unit (\(K\) for unit A in Table 13.2 and \(M\) for unit B in Table 13.3) does not change inferences because starting from an uninformed model the additional “clue” in one case has no probative value.

This example highlights learning about cases that works through updating on the general causal model rather than learning about cases from applying a prior model to the case. It confirms the possibility of this learning, highlights the possibly limited scope from learning from few cases, and points to counterintuitive implications in cases in which this is the goal of case level analysis.

13.3 General strategy

We introduce a flexible approach to comparing the prospective learning from alternative case-selection strategies. To help explore the intuition behind this strategy, we start by walking through a simplified setup and then implement the approach for a range of strategies and causal queries.

13.3.1 Walk through of the general strategy

Consider a situation in which our model is \(X \rightarrow M \rightarrow Y\). Suppose, further, that we restrict the nodal types such that \(X\) cannot have a negative effect on \(M\), and \(M\) cannot have a negative effect on \(Y\), with flat priors over all remaining nodal types. Imagine then that we begin by collecting only \(X,Y\) data on six cases and obtain the following data pattern:

Table 13.4: Observed data
event count
X0Y0 2
X1Y0 1
X0Y1 1
X1Y1 2

These \(X,Y\) data already give us some information about the causal effect of \(X\) on \(Y\). Yet, we want to learn more by examining some subset of these cases more deeply—and, specifically, by collecting data on \(M\) for two of these cases. Which cases should we select? We consider three strategies, each conditional on \(X\) and \(Y\) values:

  • Strategy \(A\) chooses two cases on the regression line, one randomly drawn from the \(X=Y=0\) cell and one randomly drawn from the \(X=Y=1\) cell
  • Strategy \(B\) chooses off the regression line, one randomly drawn from the \(X=1, Y=0\) cell and one randomly drawn from the \(X=0, Y=1\) cell
  • Strategy \(C\) chooses two cases both from the \(X=1, Y=1\)

How can we evaluate these strategies prospectively?

We start by recognizing that different strategies yield different possible data patterns. For instance, Strategy \(A\) (on the line) could possibly give us a data pattern that includes the observation \(X=0, M=0, Y=0\). Yet Strategy \(A\) cannot possibly yield a data pattern that includes the observation \(X=1, M=0, Y=0\)—because it does not involve the inspection of \(M\) in an \(X=1, Y=0\) case—whereas Strategy \(B\) (off the line) can yield a pattern that includes this observation. And neither strategy can possibly yield a pattern that includes both \(X=1, M=0, Y=0\) and \(X=0, M=1, Y=0\).

In Table 13.5, we represent the full set of possible data patterns that can arise from each strategy, with the possible data patterns for Strategy \(A\) or \(B\) labeled \(A1, A2\), etc. or \(B1, B2\), etc., respectively. As we can see, there are four possible data patterns from strategies \(A\) and \(B\), representing the 4 different combinations of \(M\) values we might find across the two cases selected for deeper investigation. There are three possible outcomes from strategy \(C\). In the comparison presented here, none of the possible data patterns overlap across strategies.

The next step is to grapple with the fact that not all of the possible data realizations for a given strategy are equally likely to emerge. We represent the data probabilities near the bottom of the table.79 How likely a data pattern is to emerge will depend on the model, any restrictions or priors we have built into the model, and any updating of beliefs that arises from the pure \(X,Y\) data. Note, for instance, that data pattern \(A3\) is much more likely to emerge than the other data patterns possible under Strategy \(A\). This is for two reasons. One is that \(A3\) involves \(M\) co-varying with \(X\) and \(Y\), a pattern consistent with \(X\) having an effect on \(Y\)—since, in this model, \(X\) can only affect \(Y\) if it affects \(M\) and if \(M\) effects \(Y\). Data patterns \(A1\) and \(A4\) have \(M\) constant between the two cases, even as \(X\) and \(Y\) vary; this is a pattern inconsistent with \(X\) having an effect on \(Y\). \(A3\), then, is more likely than \(A1\) or \(A4\) because the restrictions on the model plus the evidence from the \(X,Y\) data make us believe that \(X\) does have an average effect on \(Y\). Second, we believe \(A3\) is more probable than \(A2\) because of the model’s restrictions: the model allows positive effects of \(X\) on \(M\) and of \(M\) on \(Y\) (a way of generating \(A3\)), but rules out negative intermediate effects (a way of generating \(A2\)).

Finally, each possible data realization will (if realized) generate (possible) updating of our beliefs about the query of interest. In the second-to-last row of the Table 13.5, we can see the mean of the posterior distribution (for the \(ATE\) of \(X\) on \(Y\)) under each data pattern.

How do we now evaluate the different strategies? This is the same as asking what our loss function is (or utility function or objective function). In what follows we will focus on posterior variance and in particular expected posterior variance. Though we emphasize that the same procedure can be used with other loss functions (a natural candidate from the study of experimental design is the expected information gain (Lindley 1956)80.

The posterior variance on the \(ATE\), for each data pattern, is represented in the table’s final row. We can see that our level of posterior uncertainty varies across possible data realizations. We operationalize the expected learning under each case-selection strategy as the expected reduction in posterior variance.

Table 13.5: Each column shows a possible distribution of data that can be generated from a given strategy. We calculate the probability of each data possibility, given the data seen so far, and the posterior variance associated with each one.
event A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3
X0M0Y0 1 0 1 0 0 0 0 0 0 0 0
X0M0Y1 0 0 0 0 1 0 1 0 0 0 0
X0M1Y0 0 1 0 1 0 0 0 0 0 0 0
X0M1Y1 0 0 0 0 0 1 0 1 0 0 0
X0Y0 1 1 1 1 2 2 2 2 2 2 2
X0Y1 1 1 1 1 0 0 0 0 1 1 1
X1M0Y0 0 0 0 0 1 1 0 0 0 0 0
X1M0Y1 1 1 0 0 0 0 0 0 2 1 0
X1M1Y0 0 0 0 0 0 0 1 1 0 0 0
X1M1Y1 0 0 1 1 0 0 0 0 0 1 2
X1Y0 1 1 1 1 0 0 0 0 1 1 1
X1Y1 1 1 1 1 2 2 2 2 0 0 0
Probability 0.165 0.029 0.632 0.174 0.262 0.234 0.234 0.271 0.076 0.235 0.688
Posterior mean 0.078 0.041 0.171 0.079 0.129 0.142 0.142 0.129 0.047 0.09 0.163
Posterior variance 0.008 0.003 0.02 0.007 0.017 0.018 0.018 0.016 0.003 0.009 0.02

From the probability of each data type (given the model and the \(X,Y\) data seen so far) and the posterior variance given each data realization, the implied expected variance is easily calculated as a weighted average. The expected posterior variances for our three strategies are summarized in Table 13.6:

Table 13.6: Expected posterior variances
Strategy Variance
Offline 0.0172
Online 0.0154
Two X=1, Y=1 cases 0.0160

In this example, we see that we would expect to be better off—in the sense of having less posterior uncertainty—by focusing our process-tracing efforts on the regression line than off the regression line. We only do marginally better by spreading on the line than by concentrating on positive cases. We save an account of the intuition underlying this result for the discussion of our more extensive set of simulations below.

The key takeaway here are the core elements of our model-based approach to assessing case-selection strategies:

  1. Derive from the model the full set of possible data patterns under each case-selection strategy being assessed
  2. Calculate the probability of each data pattern given the model (with any priors or restrictions), the prior \(X,Y\) data, and the strategy
  3. Generate a posterior distribution on the query of interest for each data pattern
  4. Use the probability of and posterior variance under each data pattern to calculate the expected posterior distribution on the query of interest for each strategy

13.3.2 Simulation results

In this section, we generalize the model-based approach by applying it to a wide range of models, queries, and case-selection strategies.81

In all scenarios examined here, we imagine a situation in which we have already observed some data (the values of some nodes from the causal model in some set of cases) and must now decide in which cases we should gather additional data. We will assume throughout that we are considering gathering additional observations in cases for which we already have some data. In other words, we are deciding which subset of the cases—among those we have already gathered some data on—we should investigate more deeply. (This is distinct from the question of “wide vs. deep”, where we might decide to observe cases we have not yet seen at all.)

The general intuition of the case-selection approach that we develop here is that we can use our causal model and any previously observed data to estimate what observations we are more or less likely to make under a given case-selection strategy, and then figure out how far off from the (under the model) true estimand we can expect to be under the strategy, given whatever causal question we seek to answer.

We proceed as follows:

DAG. We start, as always, with a DAG representing our beliefs about which variables we believe to be direct causes of other variables. For the current illustrations, we consider four different DAGS: a simple mediation (or “chain”) model, \(X \rightarrow M \rightarrow Y\), a model with confounding \(X \rightarrow Y \leftarrow M \rightarrow X\), a model with a moderator \(X \rightarrow Y \leftarrow M\) and a two-path model, \(X \rightarrow M \rightarrow Y \leftarrow X\).

Restrictions or priors. As when conducting mixed-method inference, we can set qualitative restrictions and/or differential quantitative weights on the (possibly conditional) nodal types in the model. For each model we examine a version with no restrictions and a version with monotonicity restrictions..

Given data. If we have already made observations of any of the model’s nodes in some set of cases, we can use this information to condition our strategy for searching for further information. For instance, if we have observed \(X\)’s and \(Y\)’s value in a set of cases, we might select cases for process tracing based on their values of \(X\) and/or \(Y\). And, importantly, what we have already observed in the cases will affect the inferences we will draw when we observe additional data, including how informative a particular new observation is likely to be. For the simulations, we assume that we have already observed \(X\) and \(Y\) in a set of cases and found a positive correlation.

Query. We define our query. This could, for instance, be \(X\)’s average effect on \(Y\) or it might be the probability that \(X\) has a negative effect on \(Y\) in an \(X=1, Y=0\) case. We can use the general procedure to identify case-selection strategies for any causal query that can be defined on a DAG. And, importantly, the optimal case-selection strategy may depend on the query. The best case-selection strategy for answering one query may not be the best case-selection strategy for another query. In the simulations we examine four common queries.

Define one or more strategies. A strategy is defined, generically, as the search for data on a given set of nodes, in a given number of cases that are randomly selected conditional on some information we already have about potential cases. In the simulations below, our strategy will always involve uncovering \(M\)’s value in 1 or 2 cases. What we are wondering is how to choose these one or two cases for deeper analysis.

Possible data. For each strategy, there are multiple possible sets of data that we could end up observing. In particular, the data we could end up with will be the \(X,Y\) patterns we have already observed plus some pattern of \(M\) observations.

Probability of the data. We then calculate a probability of each possible data realization, given the model (with any restrictions or priors) and any data that we have already observed.82 Starting with the model together with our priors, we update our beliefs about \(\lambda\) based on the previously observed data. This posterior now represents our prior for the purposes of the process tracing. In the analyses below, we use the already-observed \(X,Y\) correlation to update our beliefs about causal-type share allocations in the population. We then use this posterior to draw a series of \(\lambda\) values.

Given that the ambiguity matrix gives us the mapping from causal types to data realizations, we can calculate for each \(lambda\) draw the probability of each data possibility given that particular \(\lambda\) and the strategy. We then average across repeated \(\lambda\) draws.

Posterior on estimate given the data. For each data possibility, we can then use CQtools to ask what inference we would get from each data possibility, given whatever query we seek to answer, as well as the variance of that posterior. Examining the inferences from possible data-realizations, as we do below, can help us understand how the learning unfolds for different strategies.

Expected posterior variance under each strategy. The quantity of ultimate interest is the posterior variance that we expect to end up with under each strategy. The expected posterior variance is simply an average of the posterior variances under each data possibility, weighted by the probability of each data possibility. We operationalize the expected learning under a strategy as the expected reduction in posterior variance arising from that strategy.

13.3.3 Models, queries, and strategies

We vary the features of the simulations with respect to models, queries and strategies.


We illustrate the approach using a set of relatively simply models with core structural features that we believe will be fairly common in applied settings. The four structures that we examine are:

  • Chain model: an \(X \rightarrow M \rightarrow Y\) model, where \(M\) is a mediator
  • Confounded model: a model with \(X \rightarrow Y\) and with \(M\) as a confounder, pointing into both \(X\) and \(Y\)
  • Moderator model: an \(X \rightarrow Y \leftarrow M\), where \(M\) is a moderator
  • Two-path model: a model in which \(X \rightarrow M \rightarrow Y \leftarrow X\), meaning that \(X\) can affect \(Y\) both through a direct path and indirectly via \(M\)

For each of these causal structures, we consider both an unconstrained version (all nodal types permitted) and a monotonic version in which we use restrictions to exclude negative effects throughout the model.


We also examine a range of queries under each model:

  • \(ATE\): what is the average effect of \(X\) on \(Y\) for the population?
  • Probability of positive causation: what is the probability that \(Y=1\) because \(X=1\) for a case randomly drawn from the population of \(X=1, Y=1\) cases
  • Probability of negative causation: what is the probability that \(Y=1\) is due to \(X=0\) for a case randomly drawn from the population of \(X=0, Y=1\) cases
  • Probability of an indirect effect: defined only for the two-path models, we estimate the probability that the effect of \(X\) on \(Y\) operates through the indirect path. More precisely, we ask, for an \(X=1, Y=1\) case in which \(X=1\) caused \(Y=1\), what the probability is that that effect would have occurred if \(M\) were held fixed at the value it takes on when \(X=1\).83


Finally, for each model-query combination, we assess the contributions of 7 strategies for selecting cases for process tracing, with inferences from the \(X,Y\) data alone serving as our baseline. In the figures below, the strategies run along the \(X-\)axis of each graph and can be interpreted as follows:

  • Prior: beliefs are based on \(X,Y\) data only.
  • 1 off: data on \(M\) is sought in one case in the \(X=1, Y=0\) cell
  • 1 on: data on \(M\) is sought in one case in the \(X=1, Y=1\) cell
  • 2 off: data on \(M\) is sought in one \(X=0, Y=1\) case and one \(X=1, Y=0\) case
  • 2 pos: data on \(M\) is sought for two cases in the \(X=1, Y=1\) cell
  • 2 on: data on \(M\) is sought in one \(X=1, Y=1\) case and one \(X=0, Y=0\) case
  • fix \(X\): a strategy in which we seek \(M\) in two cases in which a causal condition was present, with \(X\) fixed at 1, one with \(Y=0\) and one with \(Y=1\)
  • fix \(Y\): a strategy in which we seek \(M\) in two cases in which a positive outcome was observed, with \(Y\) fixed at 1, one with \(X=0\) and one with \(X=1\)

These are all “pure” strategies in the sense that the number of units for which data on \(M\) is sought in each cell is fixed. One could also imagine random strategies in which a researcher chooses at random in which cells to look. For example, if we choose one point at random, we are randomly choosing between a case on the regression line and a case off the line. The performance of a random strategy will be a weighted average of the pure strategies over which the random strategy is randomizing.

For all simulations, we assume prior \(X,Y\) data of \(N=6\), with a weak positive relationship (2 \(X=1, Y=1\) cases, 2 \(X=0, Y=0\) cases, and 1 case in each off-diagonal cell). And it is from these original 6 cases that we are selecting our cases for process tracing. In the experiments below, we do not examine how the prior data itself might affect the choice of case-selection strategies (as we do, for instance, with clue-selection in Chapter 12), but we invite the reader to explore these relationships by adjusting the code we provide.

13.3.4 Results

The main results are shown in Figure 13.1 and Figure 13.2. For each model-strategy-query combination, we figure out (a) all of the possible data realizations, (b) what inferences would be made on the query from each data realization, (c) how likely each data-realization is to arise given the strategy, the model, and the prior data, and (d) the resulting expected reduction in variance from each strategy. Result (d) is our measure of expected learning.

The two figures take two different approaches to representing the value of alternative strategies. In Figure 13.1, we examine the informativeness of strategies by showing how much our inferences depend on what we observe within the cases. Generally, a larger spread across points (for a given model-query-strategy combination) represents a greater opportunity for learning from the data. However, as expected learning is also a function of how likely each data realization is, we represent the probability of each potential inference via shading of the points. In Figure 13.2 we directly plot expected learning, operationalized as the expected reduction in posterior variance.

In the remainder of this section, we walk through the results and suggest, often tentatively, interpretations of some of the more striking patterns. We caution that reasoning one’s way through expected learning for different model-query-strategy combinations, given a particular pattern in the prior data, can be tricky—hence, our recommendation that researchers simulate their way to research-design guidance, rather than relying on intuition.

Inferences given observations

Figure 13.1: Inferences given observations

Reduction in variance on ATE and PC given strategies

Figure 13.2: Reduction in variance on ATE and PC given strategies \(N=1\) strategies, unconstrained models

Suppose that we can only conduct process tracing (observe \(M\)) for a single case drawn from our sample of 6 \(X,Y\) cases. Should we choose a case from on or off the regression line implied by the \(X,Y\) pattern? In Figure 13.1, we can see that for all unconstrained models, our inferences are completely unaffected by the observation of \(M\) in a single case, regardless of which case-selection strategy we choose and regardless of the query of interest. We see only 1 point plotted for the two \(N=1\) strategies for all unconstrained models and all queries because the inference is the same regardless of the realized value of \(M\). In Figure 13.2, we see, in the same vein, that we expect 0 reduction in expected posterior variance from these \(N=1\) strategies: they cannot make us any less uncertain about our estimates because the observations we glean cannot affect our beliefs.

To see why, let’s first consider the on-the-line strategy. Not having observed \(M\) previously, we still have flat priors over the nodal types governing \(X\)’s effect on \(M\) and \(M\)’s effect on \(Y\). That is to say, we still have no idea whether \(X\)’s positive effect on \(Y\) (where present) more commonly operates through a chain of positive effects or a chain of negative effects. Thus, the observation of, say, \(M=1\) in an \(X=1, Y=1\) case is equally consistent with a positive \(X \rightarrow Y\) (to the extent that effect operates via linked positive effects) and with no \(X \rightarrow Y\) effect (to the extent positive effects operate through linked negative effects). Observing \(M=1\) in an \(X=1, Y=1\) case therefore tells us nothing about the causal effect in that case and, thus, nothing about the average effect either.

Similarly, we have no idea whether \(X\)’s negative effect on \(Y\) (where present) operates through a positive-negative chain or a negative-positive chain, making \(M=1\) or \(M=0\) in an \(X=1, Y=0\) case both equally consistent with a negative or null \(X \rightarrow Y\) effect, yielding no information about causation in the case. By a similar logic, observing \(M=1\) in the \(X=1, Y=1\) case is uninformative about negative effects in an \(X=0, Y=1\) case, and observing \(M=1\) in an \(X=1, Y=0\) case tells us nothing about positive effects in an \(X=1, Y=1\) case.

The same logic applies to drawing inferences from \(M\) as a moderator or to learning from \(M\) about indirect effects. In the absence of prior information about effects, one case is not enough. \(N=1\) strategies, monotonic models

The situation changes, however, when we operate with models with montonicity restrictions. Now we can see that our inferences on the queries do generally depend on \(M\)’s realization in a single case and that we expect learning. For many model-query combinations, the two \(N=1\) strategies perform comparably, but there are situations in which we see substantial differences.

Most notably, in a chain model with negative effects ruled out by assumption, we learn almost nothing from choosing an off-the-line case: this is because we already know from the model itself that there can be no \(X \rightarrow Y\) effect in such a case since such an effect would require a negative effect at one stage. The only learning that can occur in such a case is about the prevalence of positive effects (relative to null effects) at individual stages (\(X \rightarrow M\) and \(M \rightarrow Y\)), which in turn has implications for the prevalence of positive effects (relative to null effects) of \(X\) on \(Y\). Likely for similar reasons, in the monotonic two-path model, an on-the-line case is much more informative than an off-the-line case about the \(ATE\) and about the probability that the effect runs via the indirect path.

Interestingly, however, the on-the-line strategy is not uniformly superior for an \(N=1\) process-tracing design. We appear to learn significantly more from an off-the-line case than an on-the-line case when estimating the share of positive effects in the population of \(X=1, Y=1\) cases and operating with a monotonic confounded or two-path model. At first, these results seem surprising: why would we not want to choose an \(X=1, Y=1\) case for learning about the population of \(X=1, Y=1\) cases? One possible reason is that, in the on-the-line case, one data realization is much more likely then the other, while we are more uncertain about what we will find in the off-the-line case. For instance, in the confounding model with montonicity, in an \(X=1, Y=1\) case we would learn about the prevalence of confounding from seeing \(M=0\) (where confounding cannot be operating since negative effects are excluded) as opposed to \(M=1\); but we do not expect to see \(M=0\) when both of its children (\(X\) and \(Y\)) take a value of 1 while negative effects are excluded. In an \(X=1, Y=0\) case, however, \(M=0\) and \(M=1\) are about equally likely to be observed, and we can learn about confounding from each realization. We can see these differences in relative data probabilities from the shadings in the graphs, where we have more even shading for the possible inferences from the one-off strategy than for the one-one strategy.

The general point here is that we expect to learn more from seeking a clue the more uncertain we are about what we will find, and some case-selection strategies will give us better opportunities to resolve uncertainty than do others. \(N=2\) strategies, unconstrained models

Next, we consider the process tracing of two of our cases 6 cases. Now, because we are observing \(M\) in two cases, we can learn from the variation in \(M\) across these cases—or, more specifically, from its covariation with \(X\) and with \(Y\). This should matter especially for unconstrained models, where we start out with no information about intermediate causal effects (e.g., whether they are more often positive or more often negative). Thus, when we only process trace one case, we cannot learn about causal effects in the cases we process-trace since we don’t know how to interpret the clue. In contrast, if we observe \(M\) in two or more cases, we do learn about causal effects for those cases because of the leverage provided by observing covariation between the process-tracing clue and other variables.

We assess the expected gains from 5 \(N=2\) strategies: examine two off-the line cases, one \(X=1, Y=0\) case and one \(X=0, Y=1\) case; examine two on-the-line cases, an \(X=Y=0\) case and an \(X=Y=1\) case; examine two treated, positive outcome (\(X=Y=1\)) cases; select on \(X\) by examining two \(X=1\) cases with different \(Y\) values; and select on \(Y\) by examining two \(Y=1\) cases with different \(X\) values.

A key message of these results is that, with 2 cases, the performance of each strategy depends quite heavily on the model we start with and what we want to learn. For instance, when estimating the \(ATE\), the on-the-line strategy in which we disperse the cases across cells (two-on) clearly outperforms both the dispersed off-the-line strategy (two-off) and an on-the-line strategy in which we concentrate on one cell (two-pos) if we are working with an unconstrained chain model, and the off-the-line strategy is clearly the worst-performing of the three. The differences in learning about the \(ATE\) are more muted, however, for an unconstrained confounded model, and the two-pos strategy does better than the other two for a two-path model.

If we seek to learn about the probability of positive causation in an \(X=1, Y=1\) case, then there is little difference between two-off and two-pos, with two-on performing best. We also see that two-pos has lost its edge in an unconstrained two-path model, with no strategy offering leverage. When estimating the probability of a negative effect for an \(X=0, Y=1\) case, we see that the two-off strategy performs best for the chain model, but that the two-pos strategies offers the greatest leverage in a two-path model. Finally, when estimating the probability of an indirect positive effect in an unconstrained two-path model, we get the most from a two-on strategy, though the two-off strategy does moderately well.

In general, selecting conditional on a fixed value of \(X\) or \(Y\) (while dispersing on the other variable) does not do particularly well in unconstrained models, and it does not usually matter much which variable we fix on. There are exceptions, however. Perhaps most strikingly, in a two-path unconstrained model, we do relatively well in estimating the probability of an indirect positive effect when we fix \(Y\) but stand to learn nothing if we fix \(X\). Interestingly, fixing \(Y\) fairly well dominates fixing \(X\) across all model-query combinations shown, given the prior data pattern we are working with.

This pattern is particularly interesting in light of canonical advice in the qualitative methods literature. King, Keohane, and Verba (1994) advise selecting for variation on the explanatory variable and, as a second-best approach, on the dependent variable. And they warn sternly against selection for variation on both at the same time. But note what happens if we follow their advice. Suppose we start with an unconstrained chain model, hoping to learn about the \(ATE\) or probability of positive causation, and decide to select for variation on \(X\), ignoring \(Y\). We might get lucky and end up with a pair of highly informative on-the-line cases. But, depending on the joint \(X,Y\) distribution in the population, we might just as easily end up with a fairly uninformative off-the-line case or \(X=0, Y=1\), \(X=1, Y=1\) pair. We do better if we intentionally select on both \(X\) and \(Y\) in this setup. This is equally true if we want to learn about the probability of negative effects in this model, in which case we want to choose an off-the-line case, or if we want to learn about positive indirect effects in a two-path model, where we want both \(X\) and \(Y\) to be 1. King, Keohane, and Verba’s advice makes sense if all we are interested in is examining covariation between \(X\) and the \(Y\): then we can learn from forcing \(X\) to vary and letting \(Y\)’s values fall where they may. However, seeking leverage from observation of a third variable is a different affair. As our simulations indicate, the strategy from which we stand to learn the most will depend on how we think the world works and what we want to learn. \(N=2\) strategies, monotonic models

Generally speaking, we get more leverage across strategies, models, and queries if we are willing to rule out negative effects by assumption. The most dramatic illustration of this is in a comparison of the unconstrained to monotonic moderator and two-path models, where we face bleak prospects of learning about the \(ATE\) and probability of positive effects in an unconstrained model, regardless of strategy. Imposing monotonicity assumptions on these two models makes for relatively similar \(ATE-\)learning opportunities across \(N=2\) strategies while boosting the relative performance of two-on (best) and two-off (second-best) strategies for learning about the probability of positive causation.

Relative performance also flips in some places. For instance, whereas two-pos gives us the most leverage for estimating the \(ATE\) in an unconstrained two-path model, the two-on strategy is optimal once we impose monotonicity. And two-pos leapfrogs two-off for estimating positive indirect effects when we go from an unconstrained to a monotonic two-path model. The opposite seems to be true for estimating the \(ATE\) or the probability of positive causation in a confounded model, where two-off does relatively better when we introduce montonicity restrictions.


Dunning, T. 2012. Natural Experiments in the Social Sciences: A Design-Based Approach. Strategies for Social Inquiry. Cambridge University Press.
Fairfield, Tasha, and Andrew E. Charman. forthcoming. Social Inquiry and Bayesian Inference: Rethinking Qualitative Research. Cambridge University Press.
Fearon, James, and David Laitin. 2008. “Integrating Qualitative and Quantitative Methods.” In Oxford Handbook of Political Methodology, edited by Janet M. Box-Steffenmeier, David Collier, and Henry E Brady, 756–76. Cambridge, UK: Oxford University Press.
Herron, Michael C, and Kevin M Quinn. 2016. “A Careful Look at Modern Case Selection Methods.” Sociological Methods & Research 45 (3): 458–92.
King, Gary, Robert Keohane, and Sidney Verba. 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton University Press.
———. 2005. “Nested Analysis as a Mixed-Method Strategy for Comparative Research.” American Political Science Review 99 (July): 435–52.
Lindley, Dennis V. 1956. “On a Measure of the Information Provided by an Experiment.” The Annals of Mathematical Statistics, 986–1005.
Seawright, Jason, and John Gerring. 2008. “Case Selection Techniques in Case Study Research: A Menu of Qualitative and Quantitative Options.” Political Research Quarterly 61 (2): 294–308.
Weller, Nicholas, and Jeb Barnes. 2014. Finding Pathways: Mixed-Method Research for Studying Causal Mechanisms. New York: Cambridge University Press.

  1. Herron and Quinn (2016) have a parameter \(\theta\) that governs the distribution of data over \(X\) and \(Y\) and then, conditional on \(X,Y\) values, a set of parameters \(\psi_{xy}\) that describe the probability of a case’s being of a given causal type. We take both \(\theta\) and \(\psi_{xy}\) to derive from the fundamental distribution of causal types and assignment probabilities. Thus, for example, \(\psi_{00}\) from Herron and Quinn (2016) corresponds to \(\frac{(1-\pi_b)\lambda_b}{(1-\pi_b)\lambda_b + (1-\pi_c)\lambda_c}\) in our notation. The difference in paramaterization does have implications for interpretations of the priors. For example flat priors over \(\theta\) and \(\psi\) implies a tighter distribution that a uniform prior over the causal types. In fact Herron and Quinn (2016) use priors with greater variance than uniform in any event.↩︎

  2. These are calculated using the CQtools package.↩︎

  3. Lindley (1956) (eq 7) defines the average gain from an experiment as \(E_x(I_1(x) - I_0)]\) where \(x\) is the data that might be observed, given the design, and \(I_1(x) = \int p(\theta|x)\log(p(\theta|x))d\theta\) and \(I_0 = \int p(\theta)\log(p(\theta))d\theta\)).↩︎

  4. We generate the results displayed here using CausalQueries together with CQTools.↩︎

  5. In practice we do this using the CQtools package via simulation.↩︎

  6. In code, this somewhat complicated query is expressed as "Y[X=0, M=M[X=1]]==Y[X=1, M=M[X=1]]", given "(X == 1 & Y == 1) & (Y[X=1]>Y[X=0])".↩︎