Chapter 14 Going wide, going deep

We turn to the problem of choosing between going “wide” and going “deep.” We illustrate gains from going deep for problems with large \(N\) but self selection into treatment. Simulations suggest that going deep is especially valuable for observational research, situations with homogeneous treatment effects, and, of course, when clues have strong probative value.


We continue exploring how we can leverage causal models in making research-design choices by thinking about the tradeoff between intensive (deep) and extensive (wide) analysis.

Suppose that we have identified those clues that will be most informative and those cases in which it would be most valuable to conduct process tracing, given our beliefs about the world. A further question that we face is the quintessential dilemma of mixing methods: what mixture of quantitative and qualitative evidence is optimal? We argued in Chapter 9 that the distinction between quantitative and qualitative inference is, in a causal-model framework, without much of a difference. But here we frame a more precise question: given finite resources, how should we trade off between studying a larger number of cases at a given level of intensiveness, on the one hand, and drilling down to learn more intensively about some subset of the cases in our sample? How should we decide between going “wide” and going “deep”?

Just as with the selection of clues and cases, examined in Chapters 12 and 13, how much we should expect to learn from going wide versus going deep will depend on how we think the world works, as expressed in the causal model with which we start and as shaped by the data that we have seen at the point of making the wide-versus-deep decision.

We examine here queries commonly associated with large-\(N\), quantitative strategies of analysis (such as average treatment effects) as well as queries commonly associated with more case-oriented, qualitative approaches (queries about causal pathways and about causal effects at the case level). The analysis in this chapter makes clear the opportunities for integration across these lines of inquiry. We show that investing in-depth process tracing will sometimes make sense even when one aims to learn about average effects in a population. Likewise, collecting \(X, Y\) data can sometimes help us draw inferences that will aid in case-level explanation. Particular kinds of case-level information can teach us about populations, and understanding population-level patterns can help us get individual cases right.

14.1 Walk-through of a simple comparison

To build up our intuitions about how the optimal mix of strategies might depend on how the world works, let us explore a simple comparison of wide and deep strategies.

Imagine a world in which we have a large amount of data on \(X\) and \(Y\) (2000 observations), and we see that \(X\) and \(Y\) are perfectly correlated. We might be tempted to infer that \(X\) causes \(Y\). And, if \(X\) were randomly assigned, then we might be able to justify that inference. Suppose, however, that our data is observational and, in particular, we were aware of an observed confound, \(M\), that might determine both \(X\) and \(Y\). In that situation, the effect of \(X\) on \(Y\) is not identified. As shown by Manski (1995), this data pattern could be produced even if \(X\) had no effect but all those cases that were destined to have \(Y=1\) were assigned to \(X=1\) while all those who would have had \(Y=0\) regardless were assigned to \(X=0\). Indeed different priors could support beliefs about that effect lying anywhere between 0 and 1.

From Pearl (2009)’s backdoor criterion, however, we also know that if the right causal model is \(X \rightarrow Y \leftarrow M \rightarrow X\), then data on \(M\) would allow the effect of \(X\) on \(Y\) to be identified. We could estimate the effect of \(X\) on \(Y\) for \(M=0\) and for \(M=1\) and take the average. Let’s imagine that we think that this structural model is plausible—substantively, we think we can gather data on how units are selected into treatment..

Suppose now that we aim to collect additional data, but that data on \(M\) for a single unit is far more costly than data on \(X\) and \(Y\) for a single unit. We thus face a choice between gathering a lot more data on \(X\) and \(Y\) (say, for 2000 more cases) or gathering a little data on \(M\) for a subset of cases—just 20 in this illustration. Which should we do? Is 20 cases enough to probe the causal model to see whether the correlation between \(X\) and \(Y\) is spurious or not?

We get an intuition for the answer by imagining the inferences we might draw in 3 extreme cases and compare these to the base case. Figure 14.1 illustrates. The figures are generated by forming a model with \(X\rightarrow Y \leftarrow M \rightarrow X\), strong priors that \(\Pr(M=1)=0.5\), and flat priors on all other nodal types. In other words, in our priors we think that \(M\) is equally likely to be a 0 or 1 but do not make assumptions about how it is related to \(X\) and \(Y\). We first update the model with the \(X,Y\) data—and then choose between going wider and going deeper.

Posteriors on the ATE given different wide or deep data patterns.

Figure 14.1: Posteriors on the ATE given different wide or deep data patterns.

Panel 1 in Figure 14.1 shows our posterior distribution over the average causal effect from observation of the base data: 2000 cases with \(X\) and \(Y\) perfectly correlated. The distribution is quite wide, despite the strong correlation, because the posterior includes our uncertainty over the nature of confounding. Our estimate for the \(ATE\) is 0.86 but with a posterior standard deviation of 0.1519. There is positive weight on all positive value of the \(ATE\).

How can we improve on this estimate?

One possibility would be to go wide and collect \(X,Y\) data on an additional 2000 cases. Panel 2 displays our posterior on the average causal effect with the addition of these 2000 cases. We assume that the new data also display a perfect \(X,Y\) correlation, like the first set of data. Again, we could not imagine correlational data that more strongly confirms a positive relation, and now we have twice as much of it. What we see, however, is that investing in gathering data on 2000 additional cases does not help us very much. The mean of our posterior on the \(ATE\) is now 0.88, with a standard deviation of 0.1497. So the updating is very slight.

Suppose that, for the cost of gathering \(X,Y\) data on an additional 2000 cases, we could drill down on a random subset of 20 of the original 2000 cases and observe \(M\) in those cases. What might we learn?

Because we start out with a flat prior on how \(M\) will relate to \(X\) and \(Y\), we display inferences for two possible realizations of that pattern. In Panel 3, we show the updating if \(M\) turns out to be uncorrelated with both \(X\) and \(Y\). The mean of our posterior on the \(ATE\) now rises to 0.98, and the posterior standard deviation shrinks dramatically, to 0.004. Greater depth in a relatively small number of cases is enough to convince us that the \(X,Y\) relationship is not spurious.

Panel 4 shows inferences from the same “going deep” strategy but where \(M\) turns out to be perfectly correlated with \(X\) and \(Y\). Now our estimate for the \(ATE\) shifts downward to 0.79, with a posterior standard deviation of 0.1632.

In other words, in this setup, what we observe from our “going deep” strategy can have a big impact on our inferences. One reason we stand to learn so much from process-tracing so few cases is that the process-tracing speaks to relationships about which we start out knowing so little: \(M\)’s effect on \(X\) and \(M\)’s effect on \(Y\), effects on which the \(X,Y\) data themselves shed no light.

It is also interesting to note that we cannot learn as much by updating only using information from the 20 cases for which we have full \(X\), \(M\), \(Y\) data. Were we to use only the subset with this complete data—ignoring the other 1880 cases—and observe \(M\) to be uncorrelated with \(X\) and \(Y\), the mean of our posterior on the \(ATE\) would be 0.26 with a posterior standard deviation of 0.1218 (not graphed). The breadth provided by those 1880 \(X,Y\)-only cases thus adds a great deal. While observing an uncorrelated \(M\) in 20 cases allows us to largely rule out \(M\) as a cause of any \(X,Y\) correlation, observing a strong \(X,Y\) correlation over a large number of cases provides evidence that \(X\) in fact affects \(Y\).

We use this example to illustrate a simple but stark point: there will be situations in which the expected gains from collecting more data on the same cases and from collecting the same data on more cases will be different, sometimes very different. The model and the prior data shape the tradeoff. In this particular setup, it is the confounding together with the large number of prior \(X,Y\) observations that makes depth the better strategy. Once we have learned from 2000 \(X,Y\) observations, data of the same form from more cases will not change beliefs. Yet going deep—even if only in a few cases—provides information on parameters we know nothing about, helping us interpret the \(X,Y\) correlation in causal terms.

14.2 Simulation results

While the results in the last section are striking, they depend upon particular realizations of the data under each strategy. When selecting strategies we, of course, do not know how the data will turn out. Our problem becomes, as in the case-selection analyses, one of figuring out the expected posterior variance from different strategies.

14.2.1 Approach

The more general, simulation-based approach that we introduce here is parallel to the approach for case-selection. The steps of this procedure are:

  1. Model. We posit a causal model, along with any priors or restrictions.
  2. Prior data. We specify the data that we already have in hand. For the simulations below, we assume no prior data.
  3. Strategies. We then specify a set of mixing strategies to assess. A strategy, in this context, is defined as any combination of collecting data on the same nodes for a given number of additional cases (randomly drawn from the population) and collecting data on new nodes for a random sample of the original set of cases.
  4. Data possibilities. For each strategy, we define the set of possible data-realizations. Whereas for case-selection the structure of the possible data-realizations will be the same for all strategies of a given \(N\), possible data patterns in wide-versus-deep analyses involve much greater complexity and will vary in structure across strategies. This is because the number of cases itself varies across strategies. Also, whereas we fix the \(X,Y\) pattern for the purposes of case-selection, here we allow the \(X,Y\) patterns we discover to vary across each simulation draw.
  5. Data probabilities. As for case-selection, we use the model and prior data to calculate the probability of each data possibility under each strategy.
  6. Inference. Again, as for case-selection, we update the model using each possible data pattern to derive a posterior distribution.
  7. Expected posterior variance. We then average the posterior distributions across the possible data patterns under a given strategy, weighted by the probability of each data pattern.

14.2.2 Simulation results

We now explore alternative mixes of going wide and going deep for a range of models and queries, the same set that we examined for case-selection in Chapter 13. We present the results in compact form in Figure 14.2. The structure of the panels is the same as those in the figure in Chapter 13, with models being crossed with queries. Though here we plot the posterior variance, not the reduction in variance as in Chapter 13. Within each panel, each line represents going wide to a differing degree: collecting \(X,Y\) data for \(N=100\), for \(N=400\), and for \(N=1600\). As we move rightward along each line, we are adding depth: we show results for a strategy with no process-tracing, for process tracing 50 of the \(X,Y\) cases, and for process-tracing 100 of the \(X,Y\) cases. On the \(y-\)axis of each graph, we plot the expected posterior variance from each wide-deep combination.84

Looking at the figure as a whole, one pattern that leaps out is that there are, for almost all model-query combinations, gains from going wider. We can also see, unsurprisingly, that the marginal gains to going wide are diminishing. There are however some cases where going wider appears to add little or nothing. One of these is where we want to estimate the probability that a positive effect runs through the indirect path. For this we learn from more cases when we use a two path model, but we get little from from observing \(X\) and \(Y\) for a larger set of cases (see the two rightmost boxes in the bottom row) .

Focusing just on the gains to depth, we see that these are much more concentrated in specific model-query combinations. Going deep to learn about the \(ATE\) or the probability of positive or negative causation appears at best marginally advantageous—at least up to \(N=100\)—for the unconstrained confounded and moderator models and for both unconstrained and monotonic two-path models. Notably, we do learn from going deep in both the unconstrained and monotonic chain models, as well as for the monotonic confounded models (of course with the exception of when the query is the probability that \(X=0\) caused \(Y=1\)). And we learn from going deep for the query that the effect of \(X\) runs through \(M\) in the two path model. As we saw with going wide, the marginal gains from going deep are diminishing.

Part of what is likely going on here is that unconstrained models are ones in which we start out with no information about \(M\)’s probative value. We can learn about \(M\)’s effects from observing \(M\) (as discussed in Chapters 9 and 10); this is why we can learn from process tracing in even the unconstrained version of the chain model. However, that learning gets much harder in unconstrained models with confounding or in models with two paths. In a confounded model, for instance, suppose we see \(M\) positively correlated with \(X\) and negatively correlated with \(Y\), with \(X\) and \(Y\) themselves negatively correlated. This pattern is consistent with \(M\) having a positive effect on \(X\) and a negative effect on \(Y\), and thus a high degree of confounding. But it is equally consistent with \(M\) having only a positive effect on \(X\) and \(X\) having a negative effect on \(Y\), with no \(M \rightarrow Y\) effect and thus no confounding. A monotonicity assumption makes \(M\) a priori informative: for instance, when we see \(X=1, Y=1\), confounding is only a possibility if we also see \(M=1\); confounding is ruled out if we observe \(M=0\).

In a two-path model, observing \(M\) correlated with \(X\) and with \(Y\) across a set of cases should allow us to learn about \(M\)’s informativeness about the operation of an indirect effect, just as we can learn about \(M\)’s probative value in a chain model. The problem is that knowing about the indirect effect in a two-path model contributes only marginally to the first three queries in the figure since these are about total effects. Thus, even adding monotonicity restrictions, which makes \(M\) a priori informative, does not significantly improve learning from \(M\) about total effects.

The important exception, when it comes to learning about the two-path model, is when it is the pathway itself that we seek to learn about. As we can see in the figure, we can learn a great deal about whether effects operate via \(M\) by observing \(M\), even in an unconstrained two-path model. Interestingly, the gains from depth for causal-effect queries in an unconstrained chain model closely resemble the gains from depth for the indirect-effect query in the unconstrained two-path model. This similarity suggests that both kinds of model-query combinations allow for learning about \(M\) that, in turn, permits learning from \(M\).

We also see that the context in which depth delivers the steepest gains of all is when we seek to learn about the probability of an indirect-effect in a monotonic two-path model. Part of the reason is likely that \(M\) is a priori informative about the operation of the mediated pathway (as it is about the operation of effects in the monotonic chain model). Additionally, however, it appears that we start out with relatively high uncertainty about the pathway query because the model itself is quite uninformative about it. Thus, for instance, we learn much more from depth here than we do for a total effect query in a monotonic chain model: the monotonicity assumptions themselves already tell us a great deal about total effects, whereas they imply nothing about the path through which effects unfold. There is simply more to be learned from \(M\) about pathways than about total effects.

Most interesting perhaps is using the graphs to examine different wide versus deep tradeoffs we might face. Suppose, for instance, that we wish to learn about the probability of negative causation in an unconstrained chain model for \(X=0, Y=1\) cases. We start out with \(X,Y\) data for 100 cases and have additional resources with which to collect more data. Let us further assume that the cost of collecting \(X,Y\) data for an additional 300 cases is equal to the cost of collecting \(M\) on 50 of the original 100 cases. Where whould we invest?

We can read a fairly clear answer off of the graph. As we can see in the relevant graph, we can expect to do better from adding depth than by adding breadth. In fact, even expanding our \(X,Y\) sample to 1600 cases only gets us about as much leverage as get from process-tracing 50 cases.

We can also see how the optimal choice depends on what data-collection we have already committed to. Consider the probability of positive causation query for a monotonic confounded model. If we start with \(X,Y\) data for 100 cases, we expect to gain more from going deep in 50 of these cases than from going wide to an additional 300. However, once we have decided to process-trace 50 cases, if we have additional resources, we then expect do be much better by investing in an expansion to 300 \(X,Y\) cases than by process-tracing the other 50 \(X,Y\) cases.

A further question we can ask is: where is mixing methods advantageous? And when are maximally wide or maximally deep strategies best? We can examine this question by comparing a strategy with maximal breadth and no process tracing; a strategy with maximal process-tracing and minimal breadth; and a strategy in which we invest in a mixture of new data, by examining \(X,Y\) in 400 cases while process-tracing 50 cases.

We see some places where we are best off going as wide as possible, at least for the ranges we explore in these simulations. For instance, if we wish to estimate the \(ATE\) in a chain model (unconstrained or monotonic), a pure “going wide” strategy is optimal. At the other extreme, when we seek to learn about the probability of an indirect effect from an unconstrained two-path model, we are best off process-tracing our original 100 cases and gain nothing by expanding our sample.

In many contexts, however, it is the mixed strategy that performs best. The advantage of mixing appears starkest for the confounded monotonic model: examining \(X\) and \(Y\) for 400 cases and process-tracing 50 of these delivers much greater gains, for estimation of the \(ATE\) and probability of positive causation, than either expanding the sample to 1600 \(X,Y\) cases or than going deeper into the original 100 cases.

Expected posterior variance over multiple models with multiple data strategies.

Figure 14.2: Expected posterior variance over multiple models with multiple data strategies.

14.3 Factoring in the cost of data

We can also use these results to think through optimal allocations of resources with varying prices of breadth and depth. To illustrate, consider the unconstrained chain model, where we see similar expected posterior variance for the probability of positive causation query from the following three combinations of wide and deep (first column, third row in Figure 14.2):

  1. Maximally wide: 1600 wide + 0 deep
  2. Maximally deep: 100 wide + 100 deep
  3. Mixed: 400 wide + 50 deep

However, which strategy is optimal will depend on the relative cost of collecting \(X,Y\) data for a new case (which we normalize to a cost of 1) and collecting \(M\) for an existing case (at cost \(d\) per observation).

Then, for this model-query combination, the widest strategy is better than the deepest strategy if and only if \(1600 < 100 + 100d\), that is, when \(d > 15\). The mixed strategy is better than the maximally deep strategy if and only if \(400 + 50d < 100 + 100d\), that is when \(d > 6\). And the maximally wide strategy is better than the mixed strategy if and only if \(1600 < 400 + 50d\), or \(d > 24\). Thus, roughly speaking, if \(d < 6\), then our ordering is deepest > mixed > widest, if \(6 < d < 15\), our ordering is mixed > deepest > widest, if \(15 < d < 24\), our ordering is mixed > widest > deepest, and if \(d > 24\), our preference-ordering is widest > mixed > deepest. We can, thus, see that the mixed strategy is optimal across a broad range of \(d\), though for sufficiently cheap or expensive within case data gathering it may be optimal to go purely wide or purely deep.

References

Manski, Charles F. 1995. Identification Problems in the Social Sciences. Harvard University Press.
———. 2009. Causality. Cambridge university press.

  1. Note that the expected posterior variance is always \(0\) for queries that are already answered with certainty by the model itself, such as the probability of a negative effect in a model with negative effects excluded.↩︎