Visibility Treatment | % invested (average) | ...when 10% invested | ...when 50% invested |
---|---|---|---|
Control: Masked information on respondents | 30% | 20% | 40% |
Treatment: Full information on respondents | 30% | 0% | 60% |
Causation as difference making
The intervention based motivation for understanding causal effects:
The problem in 2 is that you need to know what would have happened if things were different. You need information on a counterfactual.
Idea: A causal claim is (in part) a claim about something that did not happen. This makes it metaphysical.
Now that we have a concept of causal effects available, let’s answer two questions:
Now that we have a concept of causal effects available, let’s answer two questions:
TRANSITIVITY: If for a given unit \(A\) causes \(B\) and \(B\) causes \(C\), does that mean that \(A\) causes \(C\)?
A boulder is flying down a mountain. You duck. This saves your life.
So the boulder caused the ducking and the ducking caused you to survive.
So: did the boulder cause you to survive?
CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?
CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?
The counterfactual model is about contribution and attribution in a very specific sense.
Consider an outcome \(Y\) that might depend on two causes \(X_1\) and \(X_2\):
\[Y(0,0) = 0\] \[Y(1,0) = 0\] \[Y(0,1) = 0\] \[Y(1,1) = 1\]
What caused \(Y\)? Which cause was most important?
The counterfactual model is about attribution in a very conditional sense.
This is problem for research programs that define “explanation” in terms of figuring out the things that cause \(Y\)
Real difficulties conceptualizing what it means to say one cause is more important than another cause. What does that mean?
Erdogan’s increasing authoritarianism was the most important reason for the attempted coup
More uncomfortably:
What does it mean to say that the tides are caused by the moon? What exactly do we have to imagine…
Jack exploited Jill
It’s Jill’s fault that bucket fell
Jack is the most obstructionist member of Congress
Melania Trump stole from Michelle Obama’s speech
Activists need causal claims
This is sometimes called a “switching equation”
In DeclareDesign
\(Y\) is realised from potential outcomes and assignment in this way using reveal_outcomes
Say \(Z\) is a random variable, then this is a sort of data generating process. BUT the key thing to note is
Now for some magic. We really want to estimate: \[ \tau_i = Y_i(1) - Y_i(0)\]
BUT: We never can observe both \(Y_i(1)\) and \(Y_i(0)\)
Say we lower our sights and try to estimate an average treatment effect: \[ \tau = \mathbb{E} [Y(1)-Y(0)]\]
Now make use of the fact that \[\mathbb E[Y(1)-Y(0)] = \mathbb E[Y(1)]- \mathbb E [Y(0)] \]
In words: The average of differences is equal to the difference of averages.
The magic is that while we can’t hope to measure the differences; we are good at measuring averages.
This provides a positive argument for causal inference from randomization, rather than simply saying with randomization “everything else is controlled for”
Let’s discuss:
Idea: random assignment is random sampling from potential worlds: to understand anything you find, you need to know the sampling weights
Idea: We now have a positive argument for claiming unbiased estimation of the average treatment effect following random assignment
But is the average treatment effect a quantity of social scientific interest?
The average of the differences \(\approx\) difference of averages
The average of the differences \(\approx\) difference of averages
Question: \(\approx\) or \(=\)?
Consider the following potential outcomes table:
Unit | Y(0) | Y(1) | \(\tau_i\) |
---|---|---|---|
1 | 4 | 3 | |
2 | 2 | 3 | |
3 | 1 | 3 | |
4 | 1 | 3 | |
5 | 2 | 3 |
Questions for us: What are the unit level treatment effects? What is the average treatment effect?
Consider the following potential outcomes table:
In treatment? | Y(0) | Y(1) |
---|---|---|
Yes | 2 | |
No | 3 | |
No | 1 | |
Yes | 3 | |
Yes | 3 | |
No | 2 |
Questions for us: Fill in the blanks.
What is the actual treatment effect?
Take a short break!
Experiments often give rise to endogenous subgroups. The potential outcomes framework can make it clear why this can cause problems.
Problems arise in analyses of subgroups when the categories themselves are affected by treatment
Example from our work:
V(0) | V(1) | R(0,1) | R(1,1) | R(0,0) | R(1,0) | |
---|---|---|---|---|---|---|
Type 1 (reporter) | 1 | 1 | 1 | 1 | 0 | 0 |
Type 2 (non reporter) | 1 | 0 | 0 | 0 | 0 | 0 |
Expected reporting given violence in control = Pr(Type 1) (explanation: both types see violence but only Type 1 reports)
Expected reporting given violence in treatment = 100% (explanation: only Type 1 sees violence and this type also reports)
So you might infer a large effect on violence reporting.
Question: What is the actual effect of treatment on the propensity to report violence?
It is possible that in truth no one’s reporting behavior has changed, what has changed is the propensity of people with different propensities to report to experience violence:
Reporter | No Violence | Violence | % Report | |
---|---|---|---|---|
Control | Yes No |
25 25 |
25 25 |
\(\frac{25}{25+25}=50\%\) |
Treatment | Yes No |
25 50 |
25 0 |
\(\frac{25}{25+0}=100\%\) |
This problem can arise as easily in seemingly simple field experiments. Example:
What’s the problem?
Question for us:
Which problems face an endogenous subgroup issue?:
Which problems face an endogenous subgroup issue?:
In such cases you can:
Pair | I | I | II | II | |
---|---|---|---|---|---|
Unit | 1 | 2 | 3 | 4 | Average |
Y(0) | 0 | 0 | 0 | 0 | |
Y(1) | -3 | 1 | 1 | 1 | |
\(\tau\) | -3 | 1 | 1 | 1 | 0 |
Pair | I | I | II | II | |
---|---|---|---|---|---|
Unit | 1 | 2 | 3 | 4 | Average |
Y(0) | 0 | 0 | 0 | ||
Y(1) | 1 | 1 | 1 | ||
\(\hat{\tau}\) | 1 |
Pair | I | I | II | II | |
---|---|---|---|---|---|
Unit | 1 | 2 | 3 | 4 | Average |
Y(0) | [0] | 0 | 0 | ||
Y(1) | [-3] | 1 | 1 | ||
\(\hat{\tau}\) | 1 |
Note: The right way to think about this is that bias is a property of the strategy over possible realizations of data and not normally a property of the estimator conditional on the data.
Multistage games can also present an endogenous group problem since collections of late stage players facing a given choice have been created by early stage players.
Question: Does visibility alter the extent to which subjects follow norms to punish antisocial behavior (and reward prosocial behavior)? Consider a trust game in which we are interested in how information on receivers affects their actions
Average % returned
|
|||
---|---|---|---|
Visibility Treatment | % invested (average) | ...when 10% invested | ...when 50% invested |
Control: Masked information on respondents | 30% | 20% | 40% |
Treatment: Full information on respondents | 30% | 0% | 60% |
What do we think? Does visibility make people react more to investments?
Imagine you could see all the potential outcomes, and they looked like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
---|---|---|---|---|---|---|---|
Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
Invest 10% | 60% | 60% | 60% | 0% | 0% | 0% | 30% |
Invest 50% | 60% | 60% | 60% | 0% | 0% | 0% | 30% |
Conclusion: Both the offer and the information condition are completely irrelevant for all subjects.
Unfortunately you only see a sample of the potential outcomes, and that looks like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
---|---|---|---|---|---|---|---|
Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
Invest 10% | 0% | 0% | 0% | 0% | |||
Invest 50% | 60% | 60% | 60% | 60% |
False Conclusion: When not protected, responders condition behavior strongly on offers (because offerers can select on type accurately)
In fact: The nice types invest more because they are nice. The responders return more to the nice types because they are nice.
Unfortunately you only see a (noisier!) sample of the potential outcomes, and that looks like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
---|---|---|---|---|---|---|---|
Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
Invest 10% | 60% | 0% | 0% | 20% | |||
Invest 50% | 60% | 60% | 0% | 40% |
False Conclusion: When protected, responders condition behavior less strongly on offers (because offerers can select on type less accurately)
What to do?
Solutions?
Take away: Proceed with extreme caution when estimating effects beyond the first stage.
Take a short break!
Directed Acyclic Graphs
The most powerful results from the study of DAGs give procedures for figuring out when conditioning aids or hinders causal identification.
Pearl’s book Causality is the key reference. Pearl (2009) (Though see also older work such as Pearl and Paz (1985))
There is a lot of excellent material on Pearl’s page http://bayes.cs.ucla.edu/WHY/
See also excellent material on Felix Elwert’s page http://www.ssc.wisc.edu/~felwert/causality/?page_id=66
Say you don’t like graphs. Fine.
Consider this causal structure:
Say \(Z\) is temporally prior to \(X\); it is correlated with \(Y\) (because of \(U_1\)) and with \(X\) (because of \(U_2\)).
Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?
Say you don’t like graphs. Fine.
Consider this causal structure:
Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?
Answer: Hopefully by the end of today you should see that the answer is obviously (or at least, plausibly) “no.”
Variable sets \(A\) and \(B\) are conditionally independent, given \(C\) if for all \(a\), \(b\), \(c\):
\[\Pr(A = a | C = c) = \Pr(A = a | B = b, C = c)\]
Informally; given \(C\), knowing \(B\) tells you nothing more about \(A\).
Three elemental relations of conditional independence.
\(A\) and \(B\) are conditionally independent, given \(C\) if on every path between \(A\) and \(B\):
or
Notes:
Are A and D unconditionally independent:
Now: say we removed the arrow from \(X\) to \(Y\)
do
operationsA DAG is “causal Bayesian network” or “Causal DAG” if (and only if) the probability distribution resulting from setting some set \(X_i\) to \(\hat{x'}_i\) (i.e. do(X=x')
) is:
\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]
This means that there is only probability mass on vectors in which \(x_i = x_i'\) (reflecting the success of control) and all other variables are determined by their parents, given the values that have been set for \(x_i\).
do
operationsIllustration, say we have binary \(X\) causes binary \(M\) which cases binary \(Y\); say we intervene and set \(M=1\). Then what is the distribution of \((x,m,y)\)?
It is:
\[\Pr(x,m,y) = \Pr(x)\mathbb I(M = 1)\Pr(y|m)\]
We will use these ideas to motivate a general procedure for learning about, updating over, and querying, causal models.
A “causal model” is:
1.Variables
A list of \(n\) functions \(\mathcal{F}= (f^1, f^2,\dots, f^n)\), one for each element of \(\mathcal{V}\) such that each \(f^i\) takes as arguments \(\theta^i\) as well as elements of \(\mathcal{V}\) that are prior to \(V^i\) in the ordering
A probability distribution over \(\Theta\)
A simple causal model in which high inequality (\(I\)) affects democratization (\(D\)) via redistributive demands (\(R\)) and mass mobilization (\(M\)), which is also a function of ethnic homogeneity (\(E\)). Arrows show relations of causal dependence between variables.
Learning about effects given a model means learning about \(F\) and also the distribution of shocks (\(\Theta\)).
For discrete data this can be reduced to a question about learning about the distribution of \(\Theta\) only.
For instance the simplest model consistent with \(X \rightarrow Y\):
Endogenous Nodes = \(\{X, Y\}\), both with range \(\{0,1\}\)
Exogenous Nodes = \(\{\theta^X, \theta^Y\}\), with ranges \(\{\theta^X_0, \theta^X_1\}\) and \(\{\theta^Y_{00}\theta^Y_{01}, \theta^Y_{10}, \theta^Y_{11}\}\)
Functional equations:
Distributions on \(\Theta\): \(\Pr(\theta^i = \theta^i_k) = \lambda^i_k\)
What is the probability that \(X\) has a positive causal effect on \(Y\)?
This is equivalent to: \(\Pr(\theta^Y =\theta^Y_{01}) = \lambda^Y_{01}\)
So we want to learn about the distributions of the exogenous nodes
This general principle extends to a vast class of causal models