Causality

Macartan Humphreys

1 Causality. What’s a cause?

1.1 Potential outcomes and the counterfactual approach

Causation as difference making

1.1.1 Motivation

The intervention based motivation for understanding causal effects:

We want to know if a particular intervention (like aid) caused a particular outcome (like reduced corruption).
We need to know:
1. What happened?
2. What would the outcome have been if there were no intervention?
The problem:
1. … this is hard
2. … this is impossible

The problem in 2 is that you need to know what would have happened if things were different. You need information on a counterfactual.

1.1.2 Potential Outcomes

For each unit, we assume that there are two post-treatment outcomes: \(Y_i(1)\) and \(Y_i(0)\).
- \(Y(1)\) is the outcome that would obtain if the unit received the treatment.
- \(Y(0)\) is the outcome that would obtain if it did not.
The causal effect of Treatment (relative to Control) is: \(\tau_i = Y_i(1) - Y_i(0)\)
Note:
- The causal effect is defined at the individual level.
- There is no “data generating process” or functional form.
- The causal effect is defined relative to something else, so a counterfactual must be conceivable (did Germany cause the second world war?).
- Are there any substantive assumptions made here so far?

1.1.3 Potential Outcomes

Idea: A causal claim is (in part) a claim about something that did not happen. This makes it metaphysical.

1.1.4 Potential Outcomes

Now that we have a concept of causal effects available, let’s answer two questions:

TRANSITIVITY: If for a given unit \(A\) causes \(B\) and \(B\) causes \(C\), does that mean that \(A\) causes \(C\)?

1.1.5 Potential Outcomes

Now that we have a concept of causal effects available, let’s answer two questions:

TRANSITIVITY: If for a given unit \(A\) causes \(B\) and \(B\) causes \(C\), does that mean that \(A\) causes \(C\)?
A boulder is flying down a mountain. You duck. This saves your life.
So the boulder caused the ducking and the ducking caused you to survive.
So: did the boulder cause you to survive?

1.1.6 Potential Outcomes

CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?

1.1.7 Potential Outcomes

CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?

Person A is planning some action \(Y\); Person B sets out to stop them; person X intervenes and prevents person B from stopping person A. In this case Person A may complete their action, producing Y, without any knowledge that B and X even exist; in particular B and X need not be anywhere close to the action. So: did X cause Y?

1.1.8 Causal claims: Contribution or attribution?

The counterfactual model is about contribution and attribution in a very specific sense.

Focus is on non-rival contributions
Focus is on conditional attribution. Not: “what caused \(Y\)?” or “What is the cause of \(Y\)?”, but “did \(X\) cause \(Y\) given all other factors were what they were?”

1.1.9 Causal claims: Contribution or attribution?

Consider an outcome \(Y\) that might depend on two causes \(X_1\) and \(X_2\):

\[Y(0,0) = 0\] \[Y(1,0) = 0\] \[Y(0,1) = 0\] \[Y(1,1) = 1\]

What caused \(Y\)? Which cause was most important?

1.1.10 Causal claims: Contribution or attribution?

The counterfactual model is about attribution in a very conditional sense.

This is problem for research programs that define “explanation” in terms of figuring out the things that cause \(Y\)
Real difficulties conceptualizing what it means to say one cause is more important than another cause. What does that mean?

1.1.11 Causal claims: Contribution or attribution?

Erdogan’s increasing authoritarianism was the most important reason for the attempted coup

More important than Turkey’s history of coups?
What does that mean?

1.1.12 Causal claims: No causation without manipulation

Some seemingly causal claims not admissible.
To get the definition off the ground, manipulation must be imaginable (whether practical or not)
This renders thinking about effects of race and gender difficult
What does it mean to say that Aunt Pat voted for Brexit because she is old?

1.1.13 Causal claims: No causation without manipulation

Some seemingly causal claims not admissible.
To get the definition off the ground, manipulation must be imaginable (whether practical or not)
This renders thinking about effects of race and gender difficult
Compare: What does it mean to say that Southern counties voted for Brexit because they have many old people?

1.1.14 Causal claims: No causation without manipulation

More uncomfortably:

What does it mean to say that the tides are caused by the moon? What exactly do we have to imagine…

1.1.15 Causal claims: Causal claims are everywhere

Jack exploited Jill
It’s Jill’s fault that bucket fell
Jack is the most obstructionist member of Congress
Melania Trump stole from Michelle Obama’s speech
Activists need causal claims

1.1.16 Causal claims: What is actually seen?

We have talked about what’s potential, now what do we observe?
Say \(Z_i\) indicates whether the unit \(i\) is assigned to treatment \((Z_i=1)\) or not \((Z_i=0)\). It describes the treatment process. Then what we observe is: \[ Y_i = Z_iY_i(1) + (1-Z_i)Y_i(0) \]

This is sometimes called a “switching equation”

In DeclareDesign \(Y\) is realised from potential outcomes and assignment in this way using reveal_outcomes

1.1.17 Causal claims: What is actually seen?

Say \(Z\) is a random variable, then this is a sort of data generating process. BUT the key thing to note is
- \(Y_i\) is random but the randomness comes from \(Z_i\) — the potential outcomes, \(Y_i(1)\), \(Y_i(0)\) are fixed
- Compare this to a regression approach in which \(Y\) is random but the \(X\)’s are fixed. eg: \[ Y \sim N(\beta X, \sigma^2) \text{ or } Y=\alpha+\beta X+\epsilon, \epsilon\sim N(0, \sigma^2) \]

1.1.18 Causal claims: The estimand and the rub

The causal effect of Treatment (relative to Control) is: \[\tau_i = Y_i(1) - Y_i(0)\]
This is what we want to estimate.
BUT: We never can observe both \(Y_i(1)\) and \(Y_i(0)\)!
This is the fundamental problem (Holland (1986))

1.1.19 Causal claims: The rub and the solution

Now for some magic. We really want to estimate: \[ \tau_i = Y_i(1) - Y_i(0)\]
BUT: We never can observe both \(Y_i(1)\) and \(Y_i(0)\)
Say we lower our sights and try to estimate an average treatment effect: \[ \tau = \mathbb{E} [Y(1)-Y(0)]\]
Now make use of the fact that \[\mathbb E[Y(1)-Y(0)] = \mathbb E[Y(1)]- \mathbb E [Y(0)] \]
In words: The average of differences is equal to the difference of averages.
The magic is that while we can’t hope to measure the differences; we are good at measuring averages.

1.1.20 Causal claims: The rub and the solution

So we want to estimate \(\mathbb{E} [Y(1)]\) and \(\mathbb{E} [Y(0)]\).
We know that we can estimate averages of a quantity by taking the average value from a random sample of units
To do this here we need to select a random sample of the \(Y(1)\) values and a random sample of the \(Y(0)\) values, in other words, we randomly assign subjects to treatment and control conditions.
When we do that we can in fact estimate: \[ \mathbb {E}_N[Y_i(1) | Z_i = 1) - \mathbb {E}_N(Y_i(0) | Z_i = 0]\] which in expectation equals: \[ \mathbb{E} [Y_i(1) | Z_i = 1 \text{ or } Z_i = 0] - \mathbb{E} [Y_i(0) | Z_i = 1 \text{ or } Z_i = 0]\]

1.1.21 Causal claims: The rub and the solution

This highlights a deep connection between random assignment and random sampling: when we do random assignment we are in fact randomly sampling from different possible worlds.

1.1.22 Causal claims: The rub and the solution

This provides a positive argument for causal inference from randomization, rather than simply saying with randomization “everything else is controlled for”

Let’s discuss:

Does the fact that an estimate is unbiased mean that it is right?
Can a randomization “fail”?
Where are the covariates?

1.1.23 Causal claims: The rub and the solution

Idea: random assignment is random sampling from potential worlds: to understand anything you find, you need to know the sampling weights

1.1.24 Reflection

Idea: We now have a positive argument for claiming unbiased estimation of the average treatment effect following random assignment

But is the average treatment effect a quantity of social scientific interest?

1.1.25 Potential outcomes: why randomization works

The average of the differences \(\approx\) difference of averages

1.1.26 Potential outcomes: heterogeneous effects

The average of the differences \(\approx\) difference of averages

1.1.27 Potential outcomes: heterogeneous effects

Question: \(\approx\) or \(=\)?

1.1.28 Exercise your potential outcomes 1

Consider the following potential outcomes table:

Unit	Y(0)	Y(1)
1	4	3
2	2	3
3	1	3
4	1	3
5	2	3

Questions for us: What are the unit level treatment effects? What is the average treatment effect?

1.1.29 Exercise your potential outcomes 2

Consider the following potential outcomes table:

In treatment?	Y(0)	Y(1)
Yes		2
No	3
No	1
Yes		3
Yes		3
No	2

Questions for us: Fill in the blanks.

Assuming a constant treatment effect of \(+1\)
Assuming a constant treatment effect of \(-1\)
Assuming an average treatment effect of \(0\)

What is the actual treatment effect?

1.2 Pause

Take a short break!

1.3 Endogeneous subgroups

1.3.1 Endogeneous Subgroups

Experiments often give rise to endogenous subgroups. The potential outcomes framework can make it clear why this can cause problems.

1.3.2 Heterogeneous Effects with Endogeneous Categories

Problems arise in analyses of subgroups when the categories themselves are affected by treatment
Example from our work:
- You want to know if an intervention affects reporting on violence against women
- You measure the share of all subjects that experienced violence that file reports
- The problem is that which subjects experienced violence is itself a function of treatment

1.3.3 Heterogeneous Effects with Endogeneous Categories

V(t): Violence(Treatment)
R(t, v): Reporting(Treatment, Violence)

	V(0)	V(1)	R(0,1)	R(1,1)	R(0,0)	R(1,0)
Type 1 (reporter)	1	1	1	1	0	0
Type 2 (non reporter)	1	0	0	0	0	0

Expected reporting given violence in control = Pr(Type 1) (explanation: both types see violence but only Type 1 reports)
Expected reporting given violence in treatment = 100% (explanation: only Type 1 sees violence and this type also reports)

So you might infer a large effect on violence reporting.

Question: What is the actual effect of treatment on the propensity to report violence?

1.3.4 Heterogeneous Effects with Endogeneous Categories

It is possible that in truth no one’s reporting behavior has changed, what has changed is the propensity of people with different propensities to report to experience violence:

	Reporter	No Violence	Violence	% Report
Control	Yes No	25 25	25 25	\(\frac{25}{25+25}=50\%\)
Treatment	Yes No	25 50	25 0	\(\frac{25}{25+0}=100\%\)

1.3.5 Heterogeneous Effects with Endogeneous Categories

This problem can arise as easily in seemingly simple field experiments. Example:

In one study we provided constituents with information about performance of politicians
We told politicians in advance so that they could take action
We wanted to see whether voters punished poorly performing politicians

What’s the problem?

1.3.6 Endogeneous Categories: Test yourself

Question for us:

Quotas for women are randomly placed in a set of constituencies in year 1. All winners in these areas are women; in other areas only some are.
In year 2 these quotas are then lifted.

Which problems face an endogenous subgroup issue?:

1.3.7 Endogeneous Categories: Test yourself

Which problems face an endogenous subgroup issue?:

You want to estimate the likelihood that a woman will stand for reelection in treatment versus control areas in year 2.
You want to estimate whether incumbents are more likely to be reelected in treatment versus control areas in year 2
You want to estimate how much treatment areas have more re-elected incumbents in elections in year 2 compared to control

1.3.8 Endogeneous Categories: Responses

In such cases you can:

Examine the joint distribution of multiple outcomes
Condition on pretreatment features only
Engage in mediation analysis

1.3.9 Missing data can create an endogeneous subgroup problem

It is well known that missing data can undo the magic of random assignment.
One seemingly promising approach is to match into pairs ex ante and drop pairs together ex post.
Say potential outcomes looked like this (2 pairs of 2 units):

Pair	I	I	II	II
Unit	1	2	3	4	Average
Y(0)	0	0	0	0
Y(1)	-3	1	1	1
\(\tau\)	-3	1	1	1	0

1.3.10 Missing data

Say though that treated cases are likely to drop out of the sample if things go badly (e.g. they get a negative score or die)
Then you might see no attrition if those would-be attritors are not treated.
You might assume you have no problem (after all, no attrition).
No missing data when the normal cases happens to be selected

Pair	I	I	II	II
Unit	1	2	3	4	Average
Y(0)	0		0		0
Y(1)		1		1	1
\(\hat{\tau}\)					1

1.3.11 Missing data

But in cases in which you have attrition, dropping the pair doesn’t necessarily help.
The problem is potential missingness still depends on potential outcomes
The kicker is that the method can produce bias even if (in fact) there is no attrition!
But missing data when the vulnerable cases happens to be selected

Pair	I	I	II	II
Unit	1	2	3	4	Average
Y(0)		[0]	0		0
Y(1)	[-3]			1	1
\(\hat{\tau}\)					1

1.3.12 Missing data

Note: The right way to think about this is that bias is a property of the strategy over possible realizations of data and not normally a property of the estimator conditional on the data.

1.3.13 Multistage games

Multistage games can also present an endogenous group problem since collections of late stage players facing a given choice have been created by early stage players.

1.3.14 Multistage games

Question: Does visibility alter the extent to which subjects follow norms to punish antisocial behavior (and reward prosocial behavior)? Consider a trust game in which we are interested in how information on receivers affects their actions

Table 1: Return rates given investments under different conditions.

Return rates given investments under different conditions
		Average % returned
Visibility Treatment	% invested (average)	...when 10% invested	...when 50% invested
Control: Masked information on respondents	30%	20%	40%
Treatment: Full information on respondents	30%	0%	60%

What do we think? Does visibility make people react more to investments?

1.3.15 Multistage games

Imagine you could see all the potential outcomes, and they looked like this:

Table 2: Potential outcomes with (and without) identity protection.

Potential outcomes with (and without) identity protection
	Responder’s return decision (given type)						Avg.
Offered behavior	Nice 1	Nice 2	Nice 3	Mean 1	Mean 2	Mean 3
Invest 10%	60%	60%	60%	0%	0%	0%	30%
Invest 50%	60%	60%	60%	0%	0%	0%	30%

Conclusion: Both the offer and the information condition are completely irrelevant for all subjects.

1.3.16 Multistage games

Unfortunately you only see a sample of the potential outcomes, and that looks like this:

Table 3: Outcomes when respondent is visible.

Outcomes when respondent is visible
	Responder’s return decision (given type)						Avg.
Offered behavior	Nice 1	Nice 2	Nice 3	Mean 1	Mean 2	Mean 3
Invest 10%				0%	0%	0%	0%
Invest 50%	60%	60%	60%				60%

False Conclusion: When not protected, responders condition behavior strongly on offers (because offerers can select on type accurately)

In fact: The nice types invest more because they are nice. The responders return more to the nice types because they are nice.

1.3.17 Multistage games

Unfortunately you only see a (noisier!) sample of the potential outcomes, and that looks like this:

Table 4: Outcomes when respondent is not visible.

Outcomes when respondent is not visible
	Responder’s return decision (given type)						Avg.
Offered behavior	Nice 1	Nice 2	Nice 3	Mean 1	Mean 2	Mean 3
Invest 10%			60%		0%	0%	20%
Invest 50%	60%	60%		0%			40%

False Conclusion: When protected, responders condition behavior less strongly on offers (because offerers can select on type less accurately)

1.3.18 Multistage games

What to do?

Solutions?

Analysis could focus on the effect of treatment on respondent behavior, directly.
- This would get the correct answer but to a different question (Does information affect the share of contributions returned by subjects on average?)
Strategy method can sometimes help address the problem, but note that that is (a) changing the question and (b) putting demands on respondent imagination and honesty
First mover action could be directly manipulated, but unless deception is used that is also changing the question
First movers could be selected because they act in predictable ways (bordering on deception?)

Take away: Proceed with extreme caution when estimating effects beyond the first stage.

1.4 Pause

Take a short break!

1.5 DAGs

Directed Acyclic Graphs

1.5.1 Key insight

The most powerful results from the study of DAGs give procedures for figuring out when conditioning aids or hinders causal identification.

You can read off a confounding variable from a DAG.
- You figure out what to condition on for causal identification.
You can read off “colliders” from a DAG
- Sometimes you have to avoid conditioning on these
Sometimes a variable might be both, so
- you have to condition on it
- you have to avoid conditioning on it
- Ouch.

1.5.2 Key resource

Pearl’s book Causality is the key reference. Pearl (2009) (Though see also older work such as Pearl and Paz (1985))
There is a lot of excellent material on Pearl’s page http://bayes.cs.ucla.edu/WHY/
See also excellent material on Felix Elwert’s page http://www.ssc.wisc.edu/~felwert/causality/?page_id=66

1.5.3 Challenge for us

Say you don’t like graphs. Fine.
Consider this causal structure:
- \(Z = f_1(U_1, U_2)\)
- \(X = f_2(U_2)\)
- \(Y = f_3(X, U_1)\)

Say \(Z\) is temporally prior to \(X\); it is correlated with \(Y\) (because of \(U_1\)) and with \(X\) (because of \(U_2\)).

Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?

1.6 Challenge for us

Say you don’t like graphs. Fine.
Consider this causal structure:
- \(Z = f_1(U_1, U_2)\)
- \(X = f_2(U_2)\)
- \(Y = f_3(X, U_1)\)

Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?

Answer: Hopefully by the end of today you should see that the answer is obviously (or at least, plausibly) “no.”

1.7 Conditional independence and graph structure

What DAGs do is tell you when one variable is independent of another variable given some third variable.
Intuitively:
- what variables “shield off” the influence of one variable on another
- e.g. If inequality causes revolution via discontent, then inequality and revolution should be related to each other overall, but not related to each other among those that are content or among those that are discontent

1.7.1 Conditional independence

Variable sets \(A\) and \(B\) are conditionally independent, given \(C\) if for all \(a\), \(b\), \(c\):

\[\Pr(A = a | C = c) = \Pr(A = a | B = b, C = c)\]

Informally; given \(C\), knowing \(B\) tells you nothing more about \(A\).

1.7.2 Conditional independence on paths graphs

Three elemental relations of conditional independence.

1.7.3 Conditional independence from graphs

\(A\) and \(B\) are conditionally independent, given \(C\) if on every path between \(A\) and \(B\):

there is some chain (\(\bullet\rightarrow \bullet\rightarrow\bullet\) or \(\bullet\leftarrow \bullet\leftarrow\bullet\)) or fork (\(\bullet\leftarrow \bullet\rightarrow\bullet\)) with the central element in \(C\),

there is an inverted fork (\(\bullet\rightarrow \bullet\leftarrow\bullet\)) with the central element (and its descendants) not in \(C\)

Notes:

In this case we say that \(A\) and \(B\) are d-separated by \(C\).
\(A\), \(B\), and \(C\) can all be sets
Note that a path can involve arrows pointing any direction \(\bullet\rightarrow \bullet\rightarrow \bullet\leftarrow \bullet\rightarrow\bullet\)

1.7.4 Test yourself

Are A and D unconditionally independent:

if you do not condition on anything?
if you condition on B?
if you condition on C?
if you condition on B and C?

1.7.5 Back to this example

\(Z = f_1(U_1, U_2)\)
\(X = f_2(U_2)\)
\(Y = f_3(X, U_1)\)

Let’s graph this
Now: say we removed the arrow from \(X\) to \(Y\)
- Would you expect to see a correlation between \(X\) and \(Y\) if you did not control for \(Z\)
- Would you expect to see a correlation between \(X\) and \(Y\) if you did control for \(Z\)

1.7.6 Back to this example

Now: say we removed the arrow from \(X\) to \(Y\)

Would you expect to see a correlation between \(X\) and \(Y\) if you did not control for \(Z\)?
Would you expect to see a correlation between \(X\) and \(Y\) if you did control for \(Z\)?

1.7.7 Conditional distributions given `do` operations

A DAG is “causal Bayesian network” or “Causal DAG” if (and only if) the probability distribution resulting from setting some set \(X_i\) to \(\hat{x'}_i\) (i.e. do(X=x')) is:

\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]

This means that there is only probability mass on vectors in which \(x_i = x_i'\) (reflecting the success of control) and all other variables are determined by their parents, given the values that have been set for \(x_i\).

1.7.8 Conditional distributions given `do` operations

Illustration, say we have binary \(X\) causes binary \(M\) which cases binary \(Y\); say we intervene and set \(M=1\). Then what is the distribution of \((x,m,y)\)?

It is:

\[\Pr(x,m,y) = \Pr(x)\mathbb I(M = 1)\Pr(y|m)\]

1.7.9 Application

We will use these ideas to motivate a general procedure for learning about, updating over, and querying, causal models.

1.8 Causal models

1.8.1 From graphs to Causal Models

A “causal model” is:

1.Variables

An ordered list of \(n\) endogenous nodes, \(\mathcal{V}= (V^1, V^2,\dots, V^n)\), with a specification of a range for each of them
A list of \(n\) exogenous nodes, \(\Theta = (\theta^1, \theta^2,\dots , \theta^n)\)

A list of \(n\) functions \(\mathcal{F}= (f^1, f^2,\dots, f^n)\), one for each element of \(\mathcal{V}\) such that each \(f^i\) takes as arguments \(\theta^i\) as well as elements of \(\mathcal{V}\) that are prior to \(V^i\) in the ordering
A probability distribution over \(\Theta\)

1.8.2 From graphs to Causal Models

A simple causal model in which high inequality (\(I\)) affects democratization (\(D\)) via redistributive demands (\(R\)) and mass mobilization (\(M\)), which is also a function of ethnic homogeneity (\(E\)). Arrows show relations of causal dependence between variables.

1.8.3 Effects on a DAG

Learning about effects given a model means learning about \(F\) and also the distribution of shocks (\(\Theta\)).
For discrete data this can be reduced to a question about learning about the distribution of \(\Theta\) only.

1.8.4 Effects on a DAG

For instance the simplest model consistent with \(X \rightarrow Y\):

Endogenous Nodes = \(\{X, Y\}\), both with range \(\{0,1\}\)
Exogenous Nodes = \(\{\theta^X, \theta^Y\}\), with ranges \(\{\theta^X_0, \theta^X_1\}\) and \(\{\theta^Y_{00}\theta^Y_{01}, \theta^Y_{10}, \theta^Y_{11}\}\)
Functional equations:
- \(f_Y\): \(\theta^Y =\theta^Y_{ij} \rightarrow \{Y = i \text{ if } X=0; Y = j \text{ if } X=1\}\)
- \(f_X\): \(\theta^X =\theta^X_{i} \rightarrow \{X = i\}\)
Distributions on \(\Theta\): \(\Pr(\theta^i = \theta^i_k) = \lambda^i_k\)

1.8.5 Effects as statement about exogeneous variables

What is the probability that \(X\) has a positive causal effect on \(Y\)?

This is equivalent to: \(\Pr(\theta^Y =\theta^Y_{01}) = \lambda^Y_{01}\)
So we want to learn about the distributions of the exogenous nodes
This general principle extends to a vast class of causal models

1.8.6 Recap: Things you need to know about causal inference

A causal claim is a statement about what didn’t happen.
There is a fundamental problem of causal inference.
You can estimate average causal effects even if you cannot observe any individual causal effects.
If you know that \(A\) causes \(B\) and that \(B\) causes \(C\), this does not mean that you know that \(A\) causes \(C\).
\(X\) can cause \(Y\) even if there is no “causal path” connecting \(X\) and \(Y\).
There is no causation without manipulation.
Estimating average causal effects via differences in means does not require that treatment and control groups are identical.
Estimating average causal effects via differences in means is fraught when you condition on post treatment variables or on colliders.

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60.

Pearl, Judea. 2009. Causality. Cambridge university press.

Pearl, Judea, and Azaria Paz. 1985. Graphoids: A Graph-Based Logic for Reasoning about Relevance Relations. University of California (Los Angeles). Computer Science Department.