The Use of Clues: On Moderators and Mediators

Macartan Humphreys

Plan

Roadmap

  • Fixing ideas: the possibility of empirically grounded process tracing
  • Some bad news: limited gains from details (\(\downarrow\) drawing on published work with Philip Dawid, Monica Musio)
  • Some good news: moderators can provide foundations for qualitative tests
  • Some bad news: nothing from nothing (\(\downarrow\) drawing on forthcoming work with Alan Jacobs)
  • Some good news: clues in many places
  • Some OK news: applications

Roadmap

  • Focus here on causes of effects (“explanation”)
  • Mechanisms as clues for explanation

Same structures hold for mechanistic queries and non-mechanistic clues

Fixing ideas

  • Say you want to learn about the cause of an effect from a chain

The estimand

The estimand is:

  • Is \(Y\) due to \(X\)? or
  • Would \(Y\) have been different if \(X\) had been different in this case?

We often seek the probability of causation. Using potential outcomes notations: \(\Pr(Y(0)=0 | Y(1)=1)\).

Arguably this is an answer not an estimand. Still it’s our focus: we seek the answer that is defensible given the data and discuss the identifiability of this quantity.

Intuition 1

Say that we have lots of data from a randomized experiment and we know that the effect of X on Y is 2/3. In particular we have infinite data supporting the following conditional distribution of \(Y\) given an application of \(X\):

Y = 0 Y = 1
\(X=0\) 2/3 1/3
\(X=1\) 1/3 2/3

What is the probability that \(X\) caused \(Y\) for a case from this population? (an “exchangeable” case)

Intuition 1

Y = 0 Y = 1
\(X=0\) 2/3 1/3
\(X=1\) 1/3 2/3

From this data alone either of the following distributions of potential outcomes are possible:

  • Positive effects for 2/3 of units; negative for 1/3
  • 0 effect for 2/3 of units, positive for 1/3

PC is 1 in the former case and 0.5 in the latter case so bounds are [0.5, 1]

Intuition 1

Sometimes distributions allow for tighter inferences:

Y = 0 Y = 1
\(X=0\) 1 0
\(X=1\) 0.5 0.5

Here \(X=1\) is necessary for \(Y=1\). From this we know that if \(X=1, Y=1\) then \(X=1\) caused \(Y=1\)

But for a \(X=0, Y=0\) we don’t know if \(X=0\) caused \(Y=0\)

Intuition 1

Sometimes distributions allow for tighter distributions:

Y = 0 Y = 1
\(X=0\) 0.5 0.5
\(X=1\) 0 1

Here \(X=1\) is sufficient for \(Y=1\). From this we know that if \(X=0, Y=0\) then \(X=0\) caused \(Y=0\)

But for a \(X=1, Y=1\) we don’t know if \(X=1\) caused \(Y=1\). In fact POC = \(0.5\).

Learning from chains

Intuition 2

Say now that we could decompose the \(X\rightarrow Y\) process into a 2 step process. \(X\rightarrow M \rightarrow Y\).

For an \(X=1, Y=1\) case, is learning about \(M\) informative for the probability that \(X=1\) caused \(Y=1\)?

Intuition 2

Imagine:

  1. \(X\) is sufficient for \(M\) and \(M\) is necessary for \(Y\)
  2. \(X\) is necessary for \(M\) and \(M\) is sufficient for \(Y\)

We learn nothing in the first case but might learn a lot in the second case.

Intuition 2

Take the second case. Imagine we could compose this:

\(Y = 0\) \(Y = 1\)
\(X=0\) 0.5 0.5
\(X=1\) 0.25 0.75

with POC of \(\left[\frac13, \frac23\right]\), into:

\(M = 0\) \(M = 1\)
\(X=0\) 1 0
\(X=1\) 0.5 0.5
\(Y = 0\) \(Y = 1\)
\(M=0\) 0.5 0.5
\(M=1\) 0 1
  • POC: \(1 \times 0.5 = 0.5\) if \(M=1\); and 0 if \(M=0\).
  • Expected value is \(1/3\) (since \(\Pr(M=1|X=Y=1) = \frac23\)). Remarkably, knowledge of the process alone gives identification even absent observation of \(M\)!

Intuition 2

Say now that we could decompose the \(X\rightarrow Y\) process into a 10 step process, with an effect of 0.9 at every step (\(0.9^{10} \approx 1/3\)).

\(M_{j+1} = 0\) \(M_{j+1} = 1\)
\(M_j=0\) 0.95 0.05
\(M_j=1\) 0.05 0.95

Then the upper bound at each step remains at 1. The lower bound is \(\frac{0.9}{0.95}\) which is about 95%.

However \(0.95^{10} < 0.60\) which means bounds are now [0.6, 1]: so better than the [0.5, 1] interval we had before but not much better.

Intuition 2

Knowing a lot about many steps means that you have greater certainty at each step, but there are more sites for leakage and so the accumulation of confidence is not large.

Homogeneous transitions

Suppose all matrices are equal:

PC bounds (red for observed, blue for non observed mediators) tighten only modestly as the length of the homogeneous chain increases

Non-homogeneous transitions

We find the largest and smallest upper and lower bounds from any complete mediation process, for different types of evidence.

From Dawid, Humphreys, and Musio

  • all achievable by decompositions of length one or two

Comparison with other bounds

  • compare these results with possible bounds from monotonicity, and from knowledge of covariates.

Take homes

  • POC in general not identified
  • Knowledge of mediation processes can help; information on mediators can help further
  • Positive evidence on mediators is of quite limited value for inferring causes of effects
  • The best possible gains are achievable from short chains: long chains leak
  • Negative evidence can however be powerful

Some good news: clues in many places

Clues in many places

The general idea for case level process tracing following data based model training can be generalized to:

  • arbitrary DAGs
  • arbitrary queries
  • finite and incomplete data

See Humphreys and Jacobs Integrated Inferences

Strategy

  • Embed prior causal beliefs explicitly in a causal model
    • What might directly cause what?
    • Where might there be confounding?
    • How likely are different causal effects?
  • Possible observations are nodes in the model
  • Probative value of observations emerges from properties of the model
  • Then gather data and update the model
  • Yields inferences from data that are consistent with explicit prior beliefs about how the world works

Generality of the approach

  • Basic structures described by Pearl (2009) yield conceptual basis/language to articulate possibly complex estimands
  • Flexible stan structure used for estimation

Clues in many places

  • Mediators
  • Moderators
  • Confounds
  • Colliders
  • Surrogates
  • Symptoms

Clues in many places

Illustration: Microfounding a hoop test

In qualitative inference a “hoop” test is a search for a clue that, if absent, greatly reduces confidence in a theory.

Define a model with \(X\) causing \(Y\) through \(M\) but with confounding.

Simulate performance

We imagine a real world in which there are in fact monotonic effects and no confounding, though this is not known. (The data suggests a process in which \(X\) is necessary for \(M\) and \(M\) sufficient for \(Y\))

Define the model, then update, and query.

Learning from a hoop clue
Given truth prior post.mean sd
X==1 & Y==1 0.62 0.268 0.313 0.183
X==1 & Y==1 & M==1 0.70 0.250 0.354 0.206
X==1 & Y==1 & M==0 0.00 0.250 0.005 0.006

Hoop illustration implications

We see that we can find \(M\) informative about whether \(X\) caused \(Y\) in a case specifically when we see \(M=0\).

This is striking because:

  • We have a non experimental design in which the effect of \(X\) on \(Y\) is not identified
  • The probability that \(Y\) is caused by \(X\) is also not identified.
  • We placed no restrictions on functional forms except those implied by the DAG (notably \(X\) works via \(M\))
  • We did not build “hoopiness” into the model, we learned about it from the model

We can do similarly for other Van Evera tests, specifically using moderators to generate “doubly decisive” tests.

Nothing from nothing

Nothing from nothing

  • Say you had access to large amounts of observational data on \(X\), \(Y\) and \(M\)
  • You know the temporal order of \(X, Y, M\) only.
  • Can you figure out whether \(M\) is informative for \(X\) causes \(Y\) from this data?

Updating gives:

Assume a world, like above, where in fact \(X \rightarrow M \rightarrow Y\), all effects strong (80%, 80%).

Conditional inferences from an updated agnostic model given a true model in which X causes M and M causes Y
Query Given Using mean sd
Q 1 - posteriors 0.4 0.09
Q 1 M==0 posteriors 0.4 0.12
Q 1 M==1 posteriors 0.4 0.13

This negative result holds even if we can exclude \(X \rightarrow Y\)

This example illustrates the Cartwright idea of no causes in, no causes out.

Some OK news: applications

Institutions and growth

Institutions and Growth Model

Updated

Case level inferences given possible observations of distance and mortality.

Implications

  • for a case with weak institutions and low growth (first column), the former likely caused the latter. Similarly for cases with strong institutions and growth (last column).

  • cases with weak institutions and high growth (and vice versa) relationship unlikely causal

  • in a strong institutions / high growth case, proximity to the equator increases confidence that the strong institutions helped: despite the fact that distance and institutions are complements for the average treatment effect

  • Mortality is informative about the effect of institutions on growth even if we already know the strength of institutions

  • Learned patterns of confounding are consistent with a world in which settlers responded to low mortality by building strong institutions specifically in those places where they rationally expected strong institutions to help.

Confounding

Correlated posteriors

Take aways

  • The probative value of clues can be derived rather than imposed
  • Causal attribution queries are generally not identified—that is, we do not nail them even with infinite data
  • Knowledge of mediators (and mediation processes) can narrow bounds—-but not all that much
  • Some structure is generally needed (“nothing from nothing”); e.g. as justified by experimentation
  • Given a causal structure however the integration of qualitative and quantitative inferences is quite simple
  • But inferences from formal integration likely to be less dramatic than researcher interpretations

References