Visibility Treatment | % invested (average) | ...when 10% invested | ...when 50% invested |
---|---|---|---|
Control: Masked information on respondents | 30% | 20% | 40% |
Treatment: Full information on respondents | 30% | 0% | 60% |
Causation as difference making
The intervention based motivation for understanding causal effects:
The problem in 2 is that you need to know what would have happened if things were different. You need information on a counterfactual.
Idea: A causal claim is (in part) a claim about something that did not happen. This makes it metaphysical.
Now that we have a concept of causal effects available, let’s answer two questions:
Now that we have a concept of causal effects available, let’s answer two questions:
TRANSITIVITY: If for a given unit \(A\) causes \(B\) and \(B\) causes \(C\), does that mean that \(A\) causes \(C\)?
A boulder is flying down a mountain. You duck. This saves your life.
So the boulder caused the ducking and the ducking caused you to survive.
So: did the boulder cause you to survive?
CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?
CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?
The counterfactual model is about contribution and attribution in a very specific sense.
Consider an outcome \(Y\) that might depend on two causes \(X_1\) and \(X_2\):
\[Y(0,0) = 0\] \[Y(1,0) = 0\] \[Y(0,1) = 0\] \[Y(1,1) = 1\]
What caused \(Y\)? Which cause was most important?
The counterfactual model is about attribution in a very conditional sense.
This is problem for research programs that define “explanation” in terms of figuring out the things that cause \(Y\)
Real difficulties conceptualizing what it means to say one cause is more important than another cause. What does that mean?
Erdogan’s increasing authoritarianism was the most important reason for the attempted coup
More uncomfortably:
What does it mean to say that the tides are caused by the moon? What exactly do we have to imagine…
Jack exploited Jill
It’s Jill’s fault that bucket fell
Jack is the most obstructionist member of Congress
Melania Trump stole from Michelle Obama’s speech
Activists need causal claims
This is sometimes called a “switching equation”
In DeclareDesign
\(Y\) is realised from potential outcomes and assignment in this way using reveal_outcomes
Say \(Z\) is a random variable, then this is a sort of data generating process. BUT the key thing to note is
Now for some magic. We really want to estimate: \[ \tau_i = Y_i(1) - Y_i(0)\]
BUT: We never can observe both \(Y_i(1)\) and \(Y_i(0)\)
Say we lower our sights and try to estimate an average treatment effect: \[ \tau = \mathbb{E} [Y(1)-Y(0)]\]
Now make use of the fact that \[\mathbb E[Y(1)-Y(0)] = \mathbb E[Y(1)]- \mathbb E [Y(0)] \]
In words: The average of differences is equal to the difference of averages; here, the average treatment effect is equal to the difference in average outcomes in treatment and control units.
The magic is that while we can’t hope to measure the differences; we are good at measuring averages.
This provides a positive argument for causal inference from randomization, rather than simply saying with randomization “everything else is controlled for”
Let’s discuss:
Idea: random assignment is random sampling from potential worlds: to understand anything you find, you need to know the sampling weights
Idea: We now have a positive argument for claiming unbiased estimation of the average treatment effect following random assignment
But is the average treatment effect a quantity of social scientific interest?
The average of the differences \(\approx\) difference of averages
The average of the differences \(\approx\) difference of averages
Question: \(\approx\) or \(=\)?
Consider the following potential outcomes table:
Unit | Y(0) | Y(1) | \(\tau_i\) |
---|---|---|---|
1 | 4 | 3 | |
2 | 2 | 3 | |
3 | 1 | 3 | |
4 | 1 | 3 | |
5 | 2 | 3 |
Questions for us: What are the unit level treatment effects? What is the average treatment effect?
Consider the following potential outcomes table:
In treatment? | Y(0) | Y(1) |
---|---|---|
Yes | 2 | |
No | 3 | |
No | 1 | |
Yes | 3 | |
Yes | 3 | |
No | 2 |
Questions for us: Fill in the blanks.
What is the actual treatment effect?
Take a short break!
Experiments often give rise to endogenous subgroups. The potential outcomes framework can make it clear why this can cause problems.
Problems arise in analyses of subgroups when the categories themselves are affected by treatment
Example from our work:
V(0) | V(1) | R(0,1) | R(1,1) | R(0,0) | R(1,0) | |
---|---|---|---|---|---|---|
Type 1 (reporter) | 1 | 1 | 1 | 1 | 0 | 0 |
Type 2 (non reporter) | 1 | 0 | 0 | 0 | 0 | 0 |
Expected reporting given violence in control = Pr(Type 1)
Expected reporting given violence in treatment = 100%
Question: What is the actual effect of treatment on the propensity to report violence?
It is possible that in truth no one’s reporting behavior has changed, what has changed is the propensity of people with different propensities to report to experience violence:
Reporter | No Violence | Violence | % Report | |
---|---|---|---|---|
Control | Yes No |
25 25 |
25 25 |
\(\frac{25}{25+25}=50\%\) |
Treatment | Yes No |
25 50 |
25 0 |
\(\frac{25}{25+0}=100\%\) |
This problem can arise as easily in seemingly simple field experiments. Example:
What’s the problem?
Question for us:
Which problems face an endogenous subgroup issue?:
Which problems face an endogenous subgroup issue?:
In such cases you can:
Pair | I | I | II | II | |
---|---|---|---|---|---|
Unit | 1 | 2 | 3 | 4 | Average |
Y(0) | 0 | 0 | 0 | 0 | |
Y(1) | -3 | 1 | 1 | 1 | |
\(\tau\) | -3 | 1 | 1 | 1 | 0 |
Pair | I | I | II | II | |
---|---|---|---|---|---|
Unit | 1 | 2 | 3 | 4 | Average |
Y(0) | 0 | 0 | 0 | ||
Y(1) | 1 | 1 | 1 | ||
\(\hat{\tau}\) | 1 |
Pair | I | I | II | II | |
---|---|---|---|---|---|
Unit | 1 | 2 | 3 | 4 | Average |
Y(0) | [0] | 0 | 0 | ||
Y(1) | [-3] | 1 | 1 | ||
\(\hat{\tau}\) | 1 |
Note: The right way to think about this is that bias is a property of the strategy over possible realizations of data and not normally a property of the estimator conditional on the data.
Multistage games can also present an endogenous group problem since collections of late stage players facing a given choice have been created by early stage players.
Question: Does visibility alter the extent to which subjects follow norms to punish antisocial behavior (and reward prosocial behavior)? Consider a trust game in which we are interested in how information on receivers affects their actions
Average % returned
|
|||
---|---|---|---|
Visibility Treatment | % invested (average) | ...when 10% invested | ...when 50% invested |
Control: Masked information on respondents | 30% | 20% | 40% |
Treatment: Full information on respondents | 30% | 0% | 60% |
What do we think? Does visibility make people react more to investments?
Imagine you could see all the potential outcomes, and they looked like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
---|---|---|---|---|---|---|---|
Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
Invest 10% | 60% | 60% | 60% | 0% | 0% | 0% | 30% |
Invest 50% | 60% | 60% | 60% | 0% | 0% | 0% | 30% |
Conclusion: Both the offer and the information condition are completely irrelevant for all subjects.
Unfortunately you only see a sample of the potential outcomes, and that looks like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
---|---|---|---|---|---|---|---|
Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
Invest 10% | 0% | 0% | 0% | 0% | |||
Invest 50% | 60% | 60% | 60% | 60% |
False Conclusion: When not protected, responders condition behavior strongly on offers (because offerers can select on type accurately)
In fact: The nice types invest more because they are nice. The responders return more to the nice types because they are nice.
Unfortunately you only see a (noisier!) sample of the potential outcomes, and that looks like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
---|---|---|---|---|---|---|---|
Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
Invest 10% | 60% | 0% | 0% | 20% | |||
Invest 50% | 60% | 60% | 0% | 40% |
False Conclusion: When protected, responders condition behavior less strongly on offers (because offerers can select on type less accurately)
What to do?
Solutions?
Take away: Proceed with extreme caution when estimating effects beyond the first stage.
Take a short break!
Directed Acyclic Graphs
The most powerful results from the study of DAGs give procedures for figuring out when conditioning aids or hinders causal identification.
Pearl’s book Causality is the key reference. Pearl (2009) (Though see also older work such as Pearl and Paz (1985))
There is a lot of excellent material on Pearl’s page http://bayes.cs.ucla.edu/WHY/
See also excellent material on Felix Elwert’s page http://www.ssc.wisc.edu/~felwert/causality/?page_id=66
Say you don’t like graphs. Fine.
Consider this causal structure:
Say \(Z\) is temporally prior to \(X\); it is correlated with \(Y\) (because of \(U_1\)) and with \(X\) (because of \(U_2\)).
Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?
Say you don’t like graphs. Fine.
Consider this causal structure:
Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?
Answer: Hopefully by the end of today you should see that the answer is obviously (or at least, plausibly) “no.”
Variable sets \(A\) and \(B\) are conditionally independent, given \(C\) if for all \(a\), \(b\), \(c\):
\[\Pr(A = a | C = c) = \Pr(A = a | B = b, C = c)\]
Informally; given \(C\), knowing \(B\) tells you nothing more about \(A\).
Now we have what we need to simplify: if the Markov condition is satisfied, then instead of writing the full probability as \(P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1, x_2)\) we can write \(P(x) = \prod_i P(x_i |pa_i)\).
If \(P(a,b,c)\) is Markov relative to this graph then: \(C\) is independent of \(A\) given \(B\)
And instead of
\[\Pr(a,b,c) = \Pr(a)\Pr(a|b)\Pr(c|a, b)\]
we could now write:
\[\Pr(a,b,c) = \Pr(a)\Pr(a|b)\Pr(c|b)\]
We want the graphs to be able to represent the effects of interventions.
Pearl uses do
notation to capture this idea.
\[\Pr(X_1, X_2,\dots | do(X_j = x_j))\] or
\[\Pr(X_1, X_2,\dots | \hat{x}_j)\]
denotes the distribution of \(X\) when a particular node (or set of nodes) is intervened upon and forced to a particular level, \(x_j\).
do
operationsNote, in general: \[\Pr(X_1, X_2,\dots | do(X_j = x_j')) \neq \Pr(X_1, X_2,\dots | X_j = x_j')\] as an example we might imagine a situation where:
In that case \(\Pr(Y=1 | X = 1) = 1\) but \(\Pr(Y=1 | do(X = 1)) = .5\)
do
operationsA DAG is “causal Bayesian network” or “Causal DAG” if (and only if) the probability distribution resulting from setting some set \(X_i\) to \(\hat{x'}_i\) (i.e. do(X=x')
) is:
\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]
This means that there is only probability mass on vectors in which \(x_i = x_i'\) (reflecting the success of control) and all other variables are determined by their parents, given the values that have been set for \(x_i\).
do
operationsIllustration, say we have binary \(X\) causes binary \(M\) which cases binary \(Y\); say we intervene and set \(M=1\). Then what is the distribution of \((x,m,y)\)?
It is:
\[\Pr(x,m,y) = \Pr(x)\mathbb I(M = 1)\Pr(y|m)\]
We now have a well defined sense in which the arrows on a graph represent a causal structure and capture the conditional independence relations implied by the causal structure.
Of course any graph might represent many different probability distributions \(P\)
We can now start reading off from a graph when there is or is not conditional independence between sets of variables
\(A\) and \(B\) are conditionally independent, given \(C\) if on every path between \(A\) and \(B\):
or
Notes:
Are A and D unconditionally independent:
Now: say we removed the arrow from \(X\) to \(Y\) - Would you expect to see a correlation between \(X\) and \(Y\) if you did not control for \(Z\) - Would you expect to see a correlation between \(X\) and \(Y\) if you did control for \(Z\)
A “causal model” is:
1.Variables
A list of \(n\) functions \(\mathcal{F}= (f^1, f^2,\dots, f^n)\), one for each element of \(\mathcal{V}\) such that each \(f^i\) takes as arguments \(\theta^i\) as well as elements of \(\mathcal{V}\) that are prior to \(V^i\) in the ordering
A probability distribution over \(\Theta\)
Learning about effects given a model means learning about \(F\) and also the distribution of shocks (\(\Theta\)).
For discrete data this can be reduced to a question about learning about the distribution of \(\Theta\) only.
For instance the simplest model consistent with \(X \rightarrow Y\):
Endogenous Nodes = \(\{X, Y\}\), both with range \(\{0,1\}\)
Exogenous Nodes = \(\{\theta^X, \theta^Y\}\), with ranges \(\{\theta^X_0, \theta^X_1\}\) and \(\{\theta^Y_{00}\theta^Y_{01}, \theta^Y_{10}, \theta^Y_{11}\}\)
Functional equations:
Distributions on \(\Theta\): \(\Pr(\theta^i = \theta^i_k) = \lambda^i_k\)
What is the probability that \(X\) has a positive causal effect on \(Y\)?
This is equivalent to: \(\Pr(\theta^Y =\theta^Y_{01}) = \lambda^Y_{01}\)
So we want to learn about the distributions of the exogenous nodes
http://egap.org/resources/guides/causality/