Well posed questions
dagitty
Say that units are randomly assigned to treatment in different strata (maybe just one); with fixed, though possibly different, shares assigned in each stratum. Then the key estimands and estimators are:
Estimand | Estimator |
---|---|
\(\tau_{ATE} \equiv \mathbb{E}[\tau_i]\) | \(\widehat{\tau}_{ATE} = \sum\nolimits_{x} \frac{w_x}{\sum\nolimits_{j}w_{j}}\widehat{\tau}_x\) |
\(\tau_{ATT} \equiv \mathbb{E}[\tau_i | Z_i = 1]\) | \(\widehat{\tau}_{ATT} = \sum\nolimits_{x} \frac{p_xw_x}{\sum\nolimits_{j}p_jw_j}\widehat{\tau}_x\) |
\(\tau_{ATC} \equiv \mathbb{E}[\tau_i | Z_i = 0]\) | \(\widehat{\tau}_{ATC} = \sum\nolimits_{x} \frac{(1-p_x)w_x}{\sum\nolimits_{j}(1-p_j)w_j}\widehat{\tau}_x\) |
where \(x\) indexes strata, \(p_x\) is the share of units in each stratum that is treated, and \(w_x\) is the size of a stratum.
In addition, each of these can be targets of interest:
And for different subgroups,
The CATEs are conditional average treatment effects, for example the effect for men or for women. These are straightfoward.
However we might also imagine conditioning on unobservable or counterfactual features.
\[LATE = \frac{1}{|C|}\sum_{j\in C}(Y_j(X=1) - Y_j(X=0))\] \[C:=\{j:X_j(Z=1) > X_j(Z=0) \}\]
We will return to these in the study of instrumental variables.
Other ways to condition on potential outcomes:
Many inquiries are averages of individual effects, even if the groups are not known, but they do not have to be:
Many inquiries are averages of individual effects, even if the groups are not known,
But they do not have to be:
Inquiries might relate to distributional quantities such as:
You might even be interested in \(\min(Y_i(1) - Y_i(0))\).
There are lots of interesting “spillover” estimands.
Imagine there are three individuals and each person’s outcomes depends on the assignments of all others. For instance \(Y_1(Z_1, Z_2, Z_3\), or more generally, \(Y_i(Z_i, Z_{i+1 (\text{mod }3)}, Z_{i+2 (\text{mod }3)})\).
Then three estimands might be:
Interpret these. What others might be of interest?
A difference in CATEs is a well defined estimand that might involve interventions on one node only:
It captures differences in effects.
An interaction is an effect on an effect:
Note in the latter the expectation is taken over the whole population.
Say \(X\) can affect \(Y\) directly, or indirectly through \(M\). then we can write potential outcomes as:
We can then imagine inquiries of the form:
Interpret these. What others might be of interest?
Again we might imagine that these are defined with respect to some group:
here, among those for whom \(X\) has a positive effect on \(Y\), for what share would there be a positive effect if \(M\) were fixed at 1?
In qualitative research a particularly common inquiry is “did \(X=1\) cause \(Y=1\)?
This is often given as a probability, the “probability of causation” (though at the case level we might better think of this probability as an estimate rather than an estimand):
\[\Pr(Y_i(0) = 0 | Y_i(1) = 1, X = 1)\]
Intuition: What’s the probability \(X=1\) caused \(Y=1\) in an \(X=1, Y=1\) case drawn from a large population with the following experimental distribution:
Y=0 | Y=1 | All | |
---|---|---|---|
X=0 | 1 | 0 | 1 |
X=1 | 0.25 | 0.75 | 1 |
Intuition: What’s the probability \(X=1\) caused \(Y=1\) in an \(X=1, Y=1\) case drawn from a large population with the following experimental distribution:
Y=0 | Y=1 | All | |
---|---|---|---|
X=0 | 0.75 | 0.25 | 1 |
X=1 | 0.25 | 0.75 | 1 |
Other inquiries focus on distinguishing between causes.
For the Billy Suzy problem (Hall 2004), Halpern (2016) focuses on “actual causation” as a way to distinguish between Suzy and Billy:
Imagine Suzy and Billy, simultaneously throwing stones at a bottle. Both are excellent shots and hit whatever they aim at. Suzy’s stone hits first, knocks over the bottle, and the bottle breaks. However, Billy’s stone would have hit had Suzy’s not hit, and again the bottle would have broken. Did Suzy’s throw cause the bottle to break? Did Billy’s?
Actual Causation:
An inquiry: for what share in a population is a possible cause an actual cause?
Pearl (e.g. Pearl and Mackenzie (2018)) describes three types of inquiry:
Level | Activity | Inquiry |
---|---|---|
Association | “Seeing” | If I see \(X=1\) should I expect \(Y=1\)? |
Intervention | “Doing” | If I set \(X\) to \(1\) should I expect \(Y=1\)? |
Counterfactual | “Imagining” | If \(X\) were \(0\) instead of 1, would \(Y\) then be \(0\) instead of \(1\)? |
We can understand these as asking different types of questions about a causal model
Level | Activity | Inquiry |
---|---|---|
Association | “Seeing” | \(\Pr(Y=1|X=1)\) |
Intervention | “Doing” | \(\mathbb{E}[\mathbb{I}(Y(1)=1)]\) |
Counterfactual | “Imagining” | \(\Pr(Y(1)=1 \& Y(0)=0)\) |
The third is qualitatively different because it requires information about two mutually incompatible conditions for units. This is not (generally ) recoverable directly from knowledge of \(\Pr(Y(1)=1)\) and \(\Pr(Y(0)=0)\).
Given a causal model over nodes with discrete ranges, inquiries can generally be described as summaries of the distributions of exogenous nodes.
We already saw two instances of this:
What it is. When you have it. What it’s worth.
Informally a quantity is “identified” if it can be “recovered” once you have enough data.
Say for example average wage is \(x\) in some very large population. If I gather lots and lots of data on the wages of individuals and take the average then then my estimate will ultimately let be figure out \(x\).
Identifiability Let \(Q(M)\) be a query defined over a class of models \(\mathcal M\), then \(Q\) is identifiable if \(P(M_1) = P(M_2) \rightarrow Q(M_1) = Q(M_1)\).
Identifiability with constrained data Let \(Q(M)\) be a query defined over a class of models \(\mathcal M\), then \(Q\) is identifiable from features \(F(M)\) if \(F(M_1) = F(M_2) \rightarrow Q(M_1) = Q(M_1)\).
Based on Defn 3.2.3 in Pearl.
Informally a quantity is “identified” if it can be “recovered” once you have enough data.
Our goal in causal inference is to estimate quantities such as:
\[\Pr(Y|\hat{x})\]
where \(\hat{x}\) is interpreted as \(X\) set to \(x\) by “external” control. Equivalently: \(do(X=x)\) or sometimes \(X \leftarrow x\).
If this quantity is identifiable then we can recover it with infinite data.
If it is not identifiable, then, even in the best case, we are not guaranteed to get the right answer.
Are there general rules for determining whether this quantity can be identified? Yes.
Note first, identifying
\[\Pr(Y|x)\]
is easy.
But we are not always interested in identifying the distribution of \(Y\) given observed values of \(x\), but rather, the distribution of \(Y\) if \(X\) is set to \(x\).
If we can identify the controlled distribution we can calculate other causal quantities of interest.
For example for a binary \(X, Y\) the causal effect of \(X\) on the probability that \(Y=1\) is:
\[\Pr(Y=1|\hat{x}=1) - \Pr(Y=1|\hat{x}=0)\]
Again, this is not the same as:
\[\Pr(Y=1|x=1) - \Pr(Y=1|x=0)\]
It’s the difference between seeing and doing.
The key idea is that you want to find a set of variables such that when you condition on these you get what you would get if you used a do
operation.
Intuition:
The backdoor criterion is satisfied by \(Z\) (relative to \(X\), \(Y\)) if:
In that case you can identify the effect of \(X\) on \(Y\) by conditioning on \(Z\):
\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\] (This is eqn 3.19 in Pearl (2000))
\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\]
\[P(Y=y | \hat{x}) - P(Y=y | \hat{x}')\]
Following Pearl (2009), Chapter 11. Let \(T\) denote the set of parents of \(X\): \(T := pa(X)\), with (possibly vector valued) realizations \(t\). These might not all be observed.
If the backdoor criterion is satisfied, we have:
We bring \(Z\) into the picture by writing: \[p(y|\hat{x}) = \sum_{t\in T} p(t) \sum_z p(y|x, t, z)p(z|x, t)\]
Then using the two conditions above:
This gives: \[p(y|\hat x) = \sum_{t \in T} p(t) \sum_z p(y|x, z)p(z|t) \]
So, cleaning up, we can get rid of \(T\):
\[p(y|\hat{x}) = \sum_z p(y|x, z)\sum_{t\in T} p(z|t)p(t) = \sum_z p(y| x, z)p(z)\]
For intuition:
We would be happy if we could condition on the parent \(T\), but \(T\) is not observed. However we can use \(Z\) instead making use of the fact that:
See Shpitser, VanderWeele, and Robins (2012)
The adjustment criterion is satisfied by \(Z\) (relative to \(X\), \(Y\)) if:
Note:
Here \(Z\) satisfies the adjustment criterion but not the backdoor criterion:
\(Z\) is descendant of \(X\) but it is not a descendant of a node on a path from \(X\) to \(Y\). No harm adjusting for \(Z\) here, but not necessary either.
Consider this DAG:
Why?
If:
Then \(\Pr(y| \hat x)\) is identifiable and given by:
\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]
We want to get \(\Pr(y | \hat x)\)
From the graph the joint distribution of variables is:
\[\Pr(x,m,y,u) = \Pr(u)\Pr(x|u)\Pr(m|x)\Pr(y|m,u)\] If we intervened on \(X\) we would have (\(\Pr(X = x |u)=1\)):
\[\Pr(m,y,u | \hat x) = \Pr(u)\Pr(m|x)\Pr(y|m,u)\] If we sum up over \(u\) and \(m\) we get:
\[\Pr(m,y| \hat x) = \Pr(m|x)\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\] \[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\]
The first part is fine; the second part however involves \(u\) which is unobserved. So we need to get the \(u\) out of \(\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\).
Now, from the graph:
\[\Pr(u|m, x) = \Pr(u|x)\] 2. \(X\) is d-separated from \(Y\) by \(M\), \(U\)
\[\Pr(y|x, m, u) = \Pr(y|m,u)\] That’s enough to get \(u\) out of \(\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\)
\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\sum_u\left(\Pr(y|m,u)\Pr(u|x)\Pr(x)\right)\]
Using the 2 equalities we got from the graph:
\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\sum_u\left(\Pr(y|x,m,u)\Pr(u|x,m)\Pr(x)\right)\]
So:
\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\left(\Pr(y|m,x)\Pr(x)\right)\]
Intuitively: \(X\) blocks the back door between \(Z\) and \(Y\) just as well as \(U\) does
Substituting we are left with:
\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]
(The \('\) is to distinguish the \(x\) in the summation from the value of \(x\) of interest)
It’s interesting that \(x\) remains in the right hand side in the calculation of the \(m \rightarrow y\) effect, but this is because \(x\) blocks a backdoor from \(m\) to \(y\)
Bringing all this together into a claim we have:
If:
Then \(\Pr(y| \hat x)\) is identifiable and given by:
\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]
There is a package (Textor et al. 2016) for figuring out what to condition on.
Define a dag using dagitty syntax:
There is then a simple command to check whether two sets are d-separated by a third set:
And a simple command to identify the adjustments needed to identify the effect of one variable on another:
Example where \(Z\) is correlated with \(X\) and \(Y\) and is a confounder
Example where \(Z\) is correlated with \(X\) and \(Y\) but it is not a confounder
But controlling can also cause problems. In fact conditioning on a temporally pre-treatment variable could cause problems. Who’d have thunk? Here is an example from Pearl (2005):
U1 <- rnorm(10000); U2 <- rnorm(10000)
Z <- U1+U2
X <- U2 + rnorm(10000)/2
Y <- U1*2 + X
lm_robust(Y ~ X) |> tidy() |> kable(digits = 2)
term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|
(Intercept) | -0.02 | 0.02 | -1.21 | 0.23 | -0.06 | 0.01 | 9998 | Y |
X | 1.02 | 0.02 | 56.52 | 0.00 | 0.98 | 1.05 | 9998 | Y |
term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|
(Intercept) | -0.01 | 0.01 | -1.13 | 0.26 | -0.03 | 0.01 | 9997 | Y |
X | -0.34 | 0.01 | -34.98 | 0.00 | -0.36 | -0.32 | 9997 | Y |
Z | 1.67 | 0.01 | 220.37 | 0.00 | 1.65 | 1.68 | 9997 | Y |
g <- dagitty("dag{U1 -> Z ; U1 -> y ; U2 -> Z ; U2 -> x -> y}")
adjustmentSets(g, exposure = "x", outcome = "y")
{}
[1] FALSE
[1] TRUE
Which means, no need to condition on anything.
A bind: from Pearl 1995.
For a solution for a class of related problems see Robins, Hernan, and Brumback (2000)
g <- dagitty("dag{U1 -> Z ; U1 -> y ;
U2 -> Z ; U2 -> x -> y;
Z -> x}")
adjustmentSets(g, exposure = "x", outcome = "y")
{ U1 }
{ U2, Z }
which means you have to adjust on an unobservable. Here we double check that including or not including “Z” is enough:
So we cannot identify the effect here. But can we still learn about it?