Causal Inference

Topics

Macartan Humphreys

1 Topics

Introduction to observational strategies and more advanced topics

1.1 Outline

LATE
Diff in Diff
RDD
Spillovers
Mediation
Survey experiments

1.2 Noncompliance and the LATE estimand

1.2.1 Local Average Treatment Effects

Sometimes you give a medicine but only a nonrandom sample of people actually try to use it. Can you still estimate the medicine’s effect?

	X=0	X=1
Z=0	\(\overline{y}_{00}\) (\(n_{00}\))	\(\overline{y}_{01}\) (\(n_{01}\))
Z=1	\(\overline{y}_{10}\) (\(n_{10}\))	\(\overline{y}_{11}\) (\(n_{11}\))

Say that people are one of 3 types:

\(n_a\) “always takers” have \(X=1\) no matter what and have average outcome \(\overline{y}_a\)
\(n_n\) “never takers” have \(X=0\) no matter what with outcome \(\overline{y}_n\)
\(n_c\) “compliers have” \(X=Z\) and average outcomes \(\overline{y}^1_c\) if treated and \(\overline{y}^0_c\) if not.

1.2.2 Local Average Treatment Effects

Sometimes you give a medicine but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?

	X=0	X=1
Z=0	\(\overline{y}_{00}\) (\(n_{00}\))	\(\overline{y}_{01}\) (\(n_{01}\))
Z=1	\(\overline{y}_{10}\) (\(n_{10}\))	\(\overline{y}_{11}\) (\(n_{11}\))

We can figure something about types:

	\(X=0\)	\(X=1\)
\(Z=0\)	\(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}^0_{c}+\frac{\frac{1}{2}n_n}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}_{n}\)	\(\overline{y}_{a}\)
\(Z=1\)	\(\overline{y}_{n}\)	\(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}^1_{c}+\frac{\frac{1}{2}n_a}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}_{a}\)

1.2.3 Local Average Treatment Effects

You give a medicine to 50% but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?

	\(X=0\)	\(X=1\)
\(Z=0\)	\(\frac{n_c}{n_c + n_n} \overline{y}^0_{c}+\frac{n_n}{n_c + n_n} \overline{y}_n\)	\(\overline{y}_{a}\)
(n)	(\(\frac{1}{2}(n_c + n_n)\))	(\(\frac{1}{2}n_a\))
\(Z=1\)	\(\overline{y}_{n}\)	\(\frac{n_c}{n_c + n_a} \overline{y}^1_{c}+\frac{n_a}{n_c + n_a} \overline{y}_{a}\)
(n)	(\(\frac{1}{2}n_n\))	(\(\frac{1}{2}(n_a+n_c)\))

Key insight: the contributions of the \(a\)s and \(n\)s are the same in the \(Z=0\) and \(Z=1\) groups so if you difference you are left with the changes in the contributions of the \(c\)s.

1.2.4 Local Average Treatment Effects

Average in \(Z=0\) group: \(\frac{{n_c} \overline{y}^0_{c}+ \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a\right)}{n_a+n_c+n_n}\)

Average in \(Z=1\) group: \(\frac{{n_c} \overline{y}^1_{c} + \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a \right)}{n_a+n_c+n_n}\)

So, the difference is the ITT: \(({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\)

Last step:

\[ITT = ({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\]

\[\leftrightarrow\]

\[LATE = \frac{ITT}{\frac{n_c}{n}}= \frac{\text{Intent to treat effect}}{\text{First stage effect}}\]

1.2.5 The good and the bad of LATE

You get a well-defined estimate even when there is non-random take-up
May sometimes be used to assess mediation or knock-on effects
But:
- You need assumptions (monotonicity and the exclusion restriction – where were these used above?)
- Your estimate is only for a subpopulation
- The subpopulation is not chosen by you and is unknown
- Different encouragements may yield different estimates since they may encourage different subgroups

1.2.6 Pearl and Chickering again

With and without an imposition of monotonicity

data("lipids_data")

models <- 
  list(unrestricted =  make_model("Z -> X -> Y; X <-> Y"),
       restricted =  make_model("Z -> X -> Y; X <-> Y") |>
         set_restrictions("X[Z=1] < X[Z=0]")) |> 
  lapply(update_model,  data = lipids_data, refresh = 0) 

models |>
  query_model(query = list(CATE = "Y[X=1] - Y[X=0]", 
                           Nonmonotonic = "X[Z=1] < X[Z=0]"),
              given = list("X[Z=1] > X[Z=0]", TRUE),
              using = "posteriors")

1.2.7 Pearl and Chickering again

With and without an imposition of monotonicity:

model	query	mean	sd
unrestricted	CATE	0.70	0.05
restricted	CATE	0.71	0.05
unrestricted	Nonmonotonic	0.01	0.01
restricted	Nonmonotonic	0.00	0.00

In one case we assume monotonicity, in the other we update on it (easy in this case because of the empirically verifiable nature of one sided non compliance)

1.3 Diff in diff

Key idea: the evolution of units in the control group allow you to impute what the evolution of units in the treatment group would have been had they not been treated

1.3.1 Logic

We have group \(A\) that enters treatment at some point and group \(B\) that never does

The estimate:

\[\hat\tau = (\mathbb{E}[Y^A | post] - \mathbb{E}[Y^A | pre]) -(\mathbb{E}[Y^B | post] - \mathbb{E}[Y^B | pre])\] (how different is the change in \(A\) compared to the change in \(B\)?)

can be written:

\[\hat\tau = (\mathbb{E}[Y_1^A | post] - \mathbb{E}[Y_0^A | pre]) -(\mathbb{E}[Y_0^B | post] - \mathbb{E}[Y_0^B | pre])\]

1.3.2 Logic

Cleaning up

\[\hat\tau = (\mathbb{E}[Y_1^A | post] - \mathbb{E}[Y_0^A | pre]) -(\mathbb{E}[Y_0^B | post] - \mathbb{E}[Y_0^B | pre])\]

\[\hat\tau = (\mathbb{E}[Y_1^A | post] - \mathbb{E}[Y_0^A | post]) + ((\mathbb{E}[Y_0^A | post] - \mathbb{E}[Y_0^A | pre]) -(\mathbb{E}[Y_0^B | post] - \mathbb{E}[Y_0^B | pre]))\]

\[\hat\tau_{ATT} = \tau_{ATT} + \text{Difference in trends}\]

1.3.3 Simplest case

n_units <- 2
design <- 
  declare_model(
    unit = add_level(N = n_units, I = 1:N),
    time = add_level(N = 6, T = 1:N, nest = FALSE),
    obs = cross_levels(by = join_using(unit, time))) +
  declare_model(potential_outcomes(Y ~ I + T^.5 + Z*T)) +
  declare_assignment(Z = 1*(I>(n_units/2))*(T>3)) +
  declare_measurement(Y = reveal_outcomes(Y~Z)) + 
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0),
                  ATT = mean(Y_Z_1[Z==1] - Y_Z_0[Z==1])) +
  declare_estimator(Y ~ Z, label = "naive") + 
  declare_estimator(Y ~ Z + I, label = "FE1") + 
  declare_estimator(Y ~ Z + as.factor(T), label = "FE2") + 
  declare_estimator(Y ~ Z + I + as.factor(T), label = "FE3")

1.3.4 Diagnosis

Here only the two way fixed effects is unbiased and only for the ATT.

The ATT here is averaging over effects for treated units (later periods only). We know nothing about the size of effects in earlier periods when all units are in control!

design |> diagnose_design()

Inquiry	Estimator	Bias
ATE	FE1	2.25
ATE	FE2	6.50
ATE	FE3	1.50
ATE	naive	5.40
ATT	FE1	0.75
ATT	FE2	5.00
ATT	FE3	0.00
ATT	naive	3.90

1.3.5 The classic graph

design |> 
  draw_data() |>
  ggplot(aes(T, Y, color = unit)) + geom_line() +
       geom_point(aes(T, Y_Z_0)) + theme_bw()

1.3.6 Extends to multiple units easily

design |> redesign(n_units = 10) |> diagnose_design()

Inquiry	Estimator	Bias
ATE	FE1	2.25
ATE	FE2	6.50
ATE	FE3	1.50
ATE	naive	5.40
ATT	FE1	0.75
ATT	FE2	5.00
ATT	FE3	0.00
ATT	naive	3.90

1.3.7 Extends to multipe units easily

design |> 
  redesign(n_units = 10) |>  
  draw_data() |> 
  ggplot(aes(T, Y, color = unit)) + geom_line() +
       geom_point(aes(T, Y_Z_0)) + theme_bw()

1.3.8 In practice

Need to defend parallel trends
Most typically using an event study
Sometimes: report balance between treatment and control groups in covariates
Placebo leads and lags

1.3.9 Heterogeneity

Things get much more complicated when there is (a) heterogeneous timing in treatment take up and (b) heterogeneous effects
It’s only recently been appreciated how tricky things can get
But we already have an intuition from our analysis of trials with heterogeneous assignment and heterogeneous effects:
in such cases fixed effects analysis weights stratum level treatment effects by the variance in assignment to treatment
something similar here

1.3.10 Staggared assignments

Just two units assigned at different times:

trend = 0

design <- 
  declare_model(
    unit = add_level(N = 2, ui = rnorm(N), I = 1:N),
    time = add_level(N = 6, ut = rnorm(N), T = 1:N, nest = FALSE),
    obs = cross_levels(by = join_using(unit, time))) +
  declare_model(
    potential_outcomes(Y ~ trend*T + (1+Z)*(I == 2))) +
  declare_assignment(Z = 1*((I == 1) * (T>3) + (I == 2) * (T>5))) +
  declare_measurement(Y = reveal_outcomes(Y~Z), 
                      I_c = I - mean(I)) +
  declare_inquiry(mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, label = "1. naive") + 
  declare_estimator(Y ~ Z + I, label = "2. FE1") + 
  declare_estimator(Y ~ Z + as.factor(T), label = "3. FE2") + 
  declare_estimator(Y ~ Z + I + as.factor(T), label = "4. FE3") + 
  declare_estimator(Y ~ Z*I_c + as.factor(T), label = "5. Sat")

1.3.11 Staggared assignments diagnosis

Estimator	Mean Estimand	Mean Estimate
1. naive	0.50	-0.12
	(0.00)	(0.00)
2. FE1	0.50	0.36
	(0.00)	(0.00)
3. FE2	0.50	-1.00
	(0.00)	(0.00)
4. FE3	0.50	0.25
	(0.00)	(0.00)
5. Sat	0.50	0.50
	(0.00)	(0.00)

1.3.12 Where do these numbers come?

The estimand is .5 – this comes from weighting the effect for unit 1 (0) and the effect for unit 2 (1) equally
The naive estimate is wildly off because it does not take into account that units with different treatment shares have different average levels in outcomes

1.3.13 Where do these numbers come?

The estimate when we control for unit is 0.36: this comes from weighting the unit-stratum level effects according to the variance of assignment to each stratum:

design |> draw_data() |> group_by(unit) |> summarize(var = mean(Z)*(1-mean(Z))) |>
  mutate(weight = var/sum(var)) |> kable(digits = 2)

unit	var	weight
1	0.25	0.64
2	0.14	0.36

1.3.14 Where do these numbers come?

The estimate when we control for time is -1: this comes from weighting the time-stratum level effects according to the variance of assignment to each stratum
it weights periods 4 and 5 only and equally, yielding the difference in outcomes for unit 1 in treatment (0) and group 2 in control (1)

design |> draw_data()  |> group_by(time) |> summarize(var = mean(Z)*(1-mean(Z))) |>
  mutate(weight = var/sum(var)) |> kable(digits = 2)

time	var	weight
1	0.00	0.0
2	0.00	0.0
3	0.00	0.0
4	0.25	0.5
5	0.25	0.5
6	0.00	0.0

1.3.15 Where do these numbers come?

The estimate when we control for time and unit is 0.25
This is actually a lot harder to interpret:
We can figure out what it is from the “Goodman-Bacon decomposition” in Goodman-Bacon (2021)

1.3.16 Where do these numbers come?

1.3.17 Where do these numbers come?

In this case we can think of our data having the following structure:

	y1	y2
pre	0	1
mid	0	1
post	0	2

the mid to pre diff in diff is (0-0) - (1-1) = 0 (group 2 serves as control)
the post to mid diff in diff is (2-1) - (0-0) = 1 (group 1 serves as control though already in treatment!)

TWFE gives a weighted average of these two, putting a 3/4 weight on the first and a 1/4 weight on the second

1.3.18 Two way fixed effects

Specifically:

\[\hat\beta^{DD} = \mu_{12}\hat\beta^{2 \times 2, 1}_{12} + (1-\mu_{12})\hat\beta^{2 \times 2, 2}_{12}\]

where \(\mu_{12} = \frac{1-\overline{Z_1}}{1-(\overline{Z_1}-\overline{Z_2})} = \frac{1-\frac36}{1-\left(\frac36-\frac16\right)} = \frac34\)

\[\frac34\hat\beta^{2 \times 2, 1}_{12} + \frac14 \hat\beta^{2 \times 2, 2}_{12}\] (weights formula from WP version)

1.3.19 Two way fixed effects

And:

\(\hat\beta^{2 \times 2, 1}_{12}\) is \((\overline{y}_1^{MID(1,2)} - \overline{y}_1^{PRE(1)}) - (\overline{y}_2^{MID(1,2)} - \overline{y}_2^{PRE(1)})\)

which in the simple example without time trends is \((0 - 0) - (1-1) = 0\)

\(\hat\beta^{2 \times 2, 2}_{12}\) is \((\overline{y}_2^{POST(2)} - \overline{y}_2^{MID(1,2)}) - (\overline{y}_1^{POST(2)} - \overline{y}_1^{MID(1,2)})\)

which is \((2 - 1) - (0 - 0) = 1\)

1.3.20 A problem

So quite complex weighting of different comparisons

Units effectively get counted as both treatment and control units for different comparisons
Treated units counted as control units
Effectively negative weights on some quantities
Possible to have very poorly performing estimates

1.3.21 Solutions

Involve better specification of estimands
Use of comparisons directly relevant for the estimands
Imputation of control outcomes in treated units using data from appropriate control units only and then targetting estimands directly (Borusyak, Jaravel, and Spiess 2021)
Particular solution: focus on the effect of treatment at the time of first treatment / or time of switching: this usually involves a no carryover assumption (De Chaisemartin and d’Haultfoeuille 2020) also Imai and Kim (2021)

1.3.22 Solutions

See excellent review by Roth et al. (2023)

knitr::include_graphics("assets/didtools.jpg")

1.3.23 Chaisemartin and d’Haultfoeuille (2020).

library(DIDmultiplegt)
library(rdss)
design <- 
  declare_model(
    unit = add_level(N = 4, I = 1:N),
    time = add_level(N = 8, T = 1:N, nest = FALSE),
    obs = cross_levels(by = join_using(unit, time),
                       potential_outcomes(Y ~ T + (1 + Z)*I))) +
  declare_assignment(Z = 1*(T > (I + 4))) +
  declare_measurement(
    Y = reveal_outcomes(Y~Z),
    Z_lag = lag_by_group(Z, groups = unit, n = 1, order_by = T)) +
  declare_inquiry(
    ATT_switchers = mean(Y_Z_1 - Y_Z_0), 
    subset = Z == 1 & Z_lag == 0 & !is.na(Z_lag)) +
  declare_estimator(
    Y = "Y",  G = "unit",  T = "T",  D = "Z",
    handler = label_estimator(rdss::did_multiplegt_tidy),
    inquiry = c("ATT_switchers"),
    label = "chaisemartin"
  )

1.3.24 Chaisemartin and d’Haultfoeuille (2020)

Note the inquiry

run_design(design) |> kable(digits = 2)

inquiry	estimand	estimator	estimate
ATT_switchers	2	chaisemartin	2

1.3.25 Triple differencing

A response to concerns that double differencing is not enough is to triple difference
When you think that there may be a violation of parallel tends but you have other outcomes that would pick up the same difference in trends
See Olden and Møen (2022)

1.3.26 Triple differencing

You are interested in the effects of influx of refugees on right wing voting

You have (say more conservative) states with no refugees at any period
You have (say more liberal) states with refugees post 2016 only

You want to do differences in differences comparing these states before and after

However you worry that things change differntially in the conservative and liberal states: no parallel trends
but you can identify areas within states that are more or less likely to be exposed and compare differnces in differences in the exposed and unexposed groups.

1.3.27 Triple differencing

So:

Two types of states: \(L \in \{0,1\}\), only \(L=1\) types get refugee influx
Two time periods: \(Post \in \{0,1\}\), refugee influx occurs in period \(Post = 1\)
Two groups: \(B \in \{0,1\}\), \(B=1\) types affected by refugee influx

\[Y = \beta_0 + \beta_1 L + \beta_2 B + \beta_3 Post + \beta_4 LB + \beta_5 L Post + \beta_6 B Post + \beta_7L B Post + \epsilon\]

1.3.28 Triple differencing

\[\frac{\partial ^3Y}{\partial L \partial B \partial Post} = \beta_7\]

1.3.29 Can we not just condition on the \(B=1\) types?

The level among the \(B=1\) types is:

\[Y = \beta_0 + \beta_1 L + \beta_2 + \beta_3 Post + \beta_4 L + \beta_5 L Post + \beta_6 Post + \beta_7L Post + \epsilon\] If you did simple before / after differences among the \(B\) types you would get

\[\Delta Y| L = 1, B = 1 = \beta_3 + \beta_5 + \beta_6 + \beta_7\] \[\Delta Y| L = 0, B = 1 = \beta_3 + \beta_6\]

1.3.30 Can we not just condition on the \(B=1\) types?

And so if you differenced again you would get:

\[\Delta^2 Y| B = 1 = \beta_5 + \beta_7\] So the problem is that you have \(\beta^5\) in here which corresponds exactly to how \(L\) states change over time.

1.3.31 Triple

But we can figure out \(\beta_5\) by doing a diff in diff among the \(B\)’s.

\[Y|B = 0 = \beta_0 + \beta_1 L + \beta_3 Post + \beta_5 L Post\]

\[\Delta^2 Y| B = 0 = \beta_5\]

1.3.32 Easier to swallow?

The identifying assumption is that absent treatment the differences in trends between \(L=0\) and \(L=1\) would be the same for units with \(B=0\) and \(B=1\).
See equation 5.4 in Olden and Møen (2022)

\[ \left(E[Y_0|L=1, B=1, {\textit {Post}}=1] - E[Y_0|L=1, B=1, {\textit {Post}}=0]\right) \\ \quad - \ \left(E[Y_0|L=1, B=0, {\textit {Post}}=1] - E[Y_0|L=1, B=0, {\textit {Post}}=0]\right) \\ = \nonumber \\ \left(E[Y_0|L=0, B=1, {\textit {Post}}=1] - E[Y_0|L=0, B=1, {\textit {Post}}=0]\right) \\ \quad - \ \left(E[Y_0|L=0, B=0, {\textit {Post}}=1] - E[Y_0|L=0, B=0, {\textit {Post}}=0]\right)\]

1.3.33 Easier to swallow?

In a sense this is one parallel trends assumption, not two
But there are four counterfactual quantities in this expression.

Puzzle: Is it possible to have an effect identified by a difference in difference but incorrectly by a triple difference design?

1.4 Regression discontintuity

Errors and diagnostics

1.4.1 Intuition

The core idea in an RDD design is that if a decision rule assigns units that are almost identical to each other to treatment and control conditions then we can infer effects for those cases by looking at those cases.

See excellent introduction: Lee and Lemieux (2010)

1.4.2 Intuition

Kids born on 31 August start school a year younger than kids born on 1 September: does starting younger help or hurt?
Kids born on 12 September 1983 are more likely to register Republican than kids born on 10 September 1983: can this identify the effects of registration on long term voting?
A district in which Republicans got 50.1% of the vote get a Republican representative while districts in which Republicans got 49.9% of the vote do not: does having a Republican representative make a difference for these districts?

1.4.3 Argument for identification

Setting:

Typically the decision is based on a value on a “running variable”, \(X\). e.g. Treatment if \(X > 0\)
The estimand is \(\mathbb{E}[Y(1) - Y(0)|X=0]\)

Two arguments:

Continuity: \(\mathbb{E}[Y(1)|X=x]\) and \(\mathbb{E}[Y(0)|X=x]\) are continuous (at \(x=0\)) in \(x\): so \(\lim_{\hat x \rightarrow 0}\mathbb{E}[Y(0)|X=\hat x] = \mathbb{E}[Y(0)|X=\hat 0]\)
Local randomization: tiny things that determine exact values of \(x\) are as if random and so we can think of a local experiment around \(X=0\).

1.4.4 Argument for identification

Note:

continuity argument requires continuous \(x\): granularity
also builds off a conditional expectation function defined at \(X=0\)

Exclusion restriction is implicit in continuity: If something else happens at the threshold then the conditional expectation functions jump at the thresholds

Implicit: \(X\) is exogenous in the sense that units cannot adjust \(X\) in order to be on one or the other side of the threshold

1.4.5 Evidence

Typically researchers show:

“First stage” results: assignment to treatment does indeed jump at the threshold
“ITT”: outcomes jump at the threshold
LATE (if fuzzy / imperfect compliance) using IV

1.4.6 Evidence

Typically researchers show:

In addition:

Arguments for no other treatments at the threshold
Arguments for no “sorting” at the threshold
Evidence for no “heaping” at the threshold (McCrary density test)

Sometimes:

argue for why estimates extend beyond the threshold
exclude points at the threshold (!)

1.4.7 Design

library(rdss) # for helper functions
library(rdrobust)

cutoff <- 0.5
bandwidth <- 0.5

control <- function(X) {
  as.vector(poly(X, 4, raw = TRUE) %*% c(.7, -.8, .5, 1))}
treatment <- function(X) {
  as.vector(poly(X, 4, raw = TRUE) %*% c(0, -1.5, .5, .8)) + .25}

rdd_design <-
  declare_model(
    N = 1000,
    U = rnorm(N, 0, 0.1),
    X = runif(N, 0, 1) + U - cutoff,
    D = 1 * (X > 0),
    Y_D_0 = control(X) + U,
    Y_D_1 = treatment(X) + U
  ) +
  declare_inquiry(LATE = treatment(0) - control(0)) +
  declare_measurement(Y = reveal_outcomes(Y ~ D)) + 
  declare_sampling(S = X > -bandwidth & X < bandwidth) +
  declare_estimator(Y ~ D*X, term = "D", label = "lm") + 
  declare_estimator(
    Y, X, 
    term = "Bias-Corrected",
    .method = rdrobust_helper,
    label = "optimal"
  )

1.4.8 RDD Data plotted

Note rdrobust implements:

local polynomial Regression Discontinuity (RD) point estimators
robust bias-corrected confidence intervals

See Calonico, Cattaneo, and Titiunik (2014) and related papers ? rdrobust::rdrobust

1.4.9 RDD Data plotted

rdd_design  |> draw_data() |> ggplot(aes(X, Y, color = factor(D))) + 
  geom_point(alpha = .3) + theme_bw() +
  geom_smooth(aes(X, Y_D_0)) + geom_smooth(aes(X, Y_D_1)) + theme(legend.position = "none")

1.4.10 RDD diagnosis

rdd_design |> diagnose_design()

Estimator	Mean Estimate	Bias	SD Estimate	Coverage
lm	0.23	-0.02	0.01	0.64
	(0.00)	(0.00)	(0.00)	(0.02)
optimal	0.25	0.00	0.03	0.89
	(0.00)	(0.00)	(0.00)	(0.01)

1.4.11 Bandwidth tradeoff

rdd_design |> 
  redesign(bandwidth = seq(from = 0.05, to = 0.5, by = 0.05)) |> 
  diagnose_designs()

As we increase the bandwidth, the lm bias gets worse, but slowly, while the error falls.
The best bandwidth is relatively wide.
This is more true for the optimal estimator.

1.4.12 Geographic RDs

Are popular in political science:

Put a lot of pressure on assumption of no alternative treatment—including “random” country level shocks!
Put a lot of pressure on no sorting assumptions (why was the border put where it was; why did units settle here or there?)
Put a lot of pressure on SUTVA: people on one side are literally proximate to people on another

See Keele and Titiunik (2015)

1.5 Spillovers

1.5.1 SUTVA violations (Spillovers)

Spillovers can result in the estimation of weaker effects when effects are actually stronger.

The key problem is that \(Y(1)\) and \(Y(0)\) are not sufficient to describe potential outcomes

1.5.2 SUTVA violations

Unit	Location	\(D_1\)	\(y(D_1)\)	\(D_2\)	\(y(D_2)\)	\(D_3\)	\(y(D_3)\)	\(D_4\)	\(y(D_4)\)
A	1	1	3	0	1	0	0	0	0
B	2	0	3	1	3	0	3	0	0
C	3	0	0	0	3	1	3	0	3
D	4	0	0	0	0	0	1	1	3

Table: Potential outcomes for four units for different treatment profiles. \(D_i\) is an allocation and \(y_j(D_i)\) is the potential outcome for (row) unit \(j\) given (column) \(D_i\).

The key is to think through the structure of spillovers.
Here immediate neighbors are exposed
In this case we can define a direct treatment (being exposed) and an indirect treatment (having a neighbor exposed) and we can work out the propensity for each unit of receiving each type of treatment
These may be non uniform (here central types are more likely to have teated neighbors); but we can still use the randomization to assess effects

1.5.3 SUTVA violations

			0		1		2		3		4
Unit	Location	\(D_\emptyset\)	\(y(D_\emptyset)\)	\(D_1\)	\(y(D_1)\)	\(D_2\)	\(y(D_2)\)	\(D_3\)	\(y(D_3)\)	\(D_4\)	\(y(D_4)\)
A	1	0	0	1	3	0	1	0	0	0	0
B	2	0	0	0	3	1	3	0	3	0	0
C	3	0	0	0	0	0	3	1	3	0	3
D	4	0	0	0	0	0	0	0	1	1	3
\(\bar{y}_\text{treated}\)			-			3		3		3
\(\bar{y}_\text{untreated}\)			0			1		4/3		4/3
\(\bar{y}_\text{neighbors}\)			-			3		2		2
\(\bar{y}_\text{pure control}\)			0			0		0		0
ATT-direct			-			3		3		3
ATT-indirect			-			3		2		2

1.5.4 Design

dgp <- function(i, Z, G) Z[i]/3 + sum(Z[G == G[i]])^2/5 + rnorm(1)

spillover_design <- 

  declare_model(G = add_level(N = 80), 
                     j = add_level(N = 3, zeros = 0, ones = 1)) +
  
  declare_inquiry(direct = mean(sapply(1:240,  # just i treated v no one treated 
    function(i) { Z_i <- (1:240) == i
                  dgp(i, Z_i, G) - dgp(i, zeros, G)}))) +
  
  declare_inquiry(indirect = mean(sapply(1:240, 
    function(i) { Z_i <- (1:240) == i           # all but i treated v no one treated   
                  dgp(i, ones - Z_i, G) - dgp(i, zeros, G)}))) +
  
  declare_assignment(Z = complete_ra(N)) + 
  
  declare_measurement(
    neighbors_treated = sapply(1:N, function(i) sum(Z[-i][G[-i] == G[i]])),
    one_neighbor  = as.numeric(neighbors_treated == 1),
    two_neighbors = as.numeric(neighbors_treated == 2),
    Y = sapply(1:N, function(i) dgp(i, Z, G))
  ) +
  
  declare_estimator(Y ~ Z, 
                    inquiry = "direct", 
                    model = lm_robust, 
                    label = "naive") +
  
  declare_estimator(Y ~ Z * one_neighbor + Z * two_neighbors,
                    term = c("Z", "two_neighbors"),
                    inquiry = c("direct", "indirect"), 
                    label = "saturated", 
                    model = lm_robust)

1.5.5 Spillovers: direct and indirect treatments

1.5.6 Spillovers: Simulated estimates

1.5.7 Spillovers: Opportunities and Warnings

You can in principle:

debias estimates
learn about interesting processes
optimize design parameters

But to estimate effects you still need some SUTVA like assumption.

1.5.8 Spillovers: Opportunities and Warnings

In this example if one compared the outcome between treated units and all control units that are at least \(n\) positions away from a treated unit you will get the wrong answer unless \(n \geq 7\).

1.6 Mediation

1.6.1 The problem of unidentified mediators

Consider a causal system like the below.
The effect of X on M1 and M2 can be measured in the usual way.
But unfortunately, if there are multiple mediators, the effect of M1 (or M2) on Y is not identified.
The ‘exclusion restriction’ is obviously violated when there are multiple mediators (unless you can account for them all).

1.6.2 The problem of unidentified mediators

Which effects are identified by the random assignment of \(X\)?

1.6.3 The problem of unidentified mediators

An obvious approach is to first examine the (average) effect of X on M1 and then use another manipulation to examine the (average) effect of M1 on Y.

But both of these average effects may be positive (for example) even if there is no effect of X on Y through M1.

1.6.4 The problem of unidentified mediators

An obvious approach is to first examine the (average) effect of X on M1 and then use another manipulation to examine the (average) effect of M1 on Y.

Similarly both of these average effects may be zero even if X affects on Y through M1 for every unit!

1.6.5 The problem of unidentified mediators

Both instances of unobserved confounding between \(M\) and \(Y\):

1.6.6 The problem of unidentified mediators

Both instances of unobserved confounding between \(M\) and \(Y\):

1.6.7 The problem of unidentified mediators

Another somewhat obvious approach is to see how the effect of \(X\) on \(Y\) in a regression is reduced when you control for \(M\).
If the effect of \(X\) on \(Y\) passes through \(M\) then surely there should be no effect of \(X\) on \(Y\) after you control for \(M\).
This common strategy associated with Baron and Kenny (1986) is also not guaranteed to produce reliable results. See for instance Green, Ha, and Bullock (2010)

1.6.8 Baron Kenny issues

df <- fabricate(N = 1000, 
                U = rbinom(N, 1, .5),     X = rbinom(N, 1, .5),
                M = ifelse(U==1, X, 1-X), Y = ifelse(U==1, M, 1-M)) 
            
list(lm(Y ~ X, data = df), 
     lm(Y ~ X + M, data = df)) |> texreg::htmlreg()

Statistical models
	Model 1	Model 2
(Intercept)	-0.00^***	-0.00^***
	(0.00)	(0.00)
X	1.00^***	1.00^***
	(0.00)	(0.00)
M		-0.00
		(0.00)
R²	1.00	1.00
Adj. R²	1.00	1.00
Num. obs.	1000	1000
^*p < 0.001; ^p < 0.01; ^*p < 0.05

1.6.9 The problem of unidentified mediators

See Imai on better ways to think about this problem and designs to address it.

1.6.10 The problem of unidentified mediators: Quantities

Using potential outcomeswe can describe a mediation effect as (see Imai et al): \[\delta_i(t) = Y_i(t, M_i(1)) - Y_i(t, M_i(0)) \textbf{ for } t = 0,1\]
The direct effect is: \[\psi_i(t) = Y_i(1, M_i(t)) - Y_i(0, M_i(t)) \textbf{ for } t = 0,1\]
This is a decomposition, since: \[Y_i(1, M_i(1)) - Y_1(0, M_i(0)) = \frac{1}{2}(\delta_i(1) + \delta_i(0) + \psi_i(1) + \psi_i(0)) \]
If there are no interaction effects—ie \(\delta_i(1) = \delta_i(0), \psi_i(1) = \psi_i(0)\), then \[Y_i(1, M_i(1)) - Y_1(0, M_i(0)) = \delta_i + \psi_i\]

1.6.11 The problem of unidentified mediators: Solutions?

The bad news is that although a single experiment might identify the total effect, it can not identify these elements of the direct effect.

So:

Check formal requirement for identification under single experiment design (“sequential ignorability”—that, conditional on actual treatment, it is as if the value of the mediation variable is randomly assigned relative to potential outcomes). But this is strong (and in fact unverifiable) and if it does not hold, bounds on effects always include zero (Imai et al)
Consider sensitivity analyses

1.6.12 Implicit mediation

You can use interactions with covariates if you are willing to make assumptions on no heterogeneity of direct treatment effects over covariates.

eg you think that money makes people get to work faster because they can buy better cars; you look at the marginal effect of more money on time to work for people with and without cars and find it higher for the latter.

This might imply mediation through transport but only if there is no direct effect heterogeneity (eg people with cars are less motivated by money).

1.6.13 The problem of unidentified mediators: Solutions?

Weaker assumptions justify parallel design

Group A: \(T\) is randomly assigned, \(M\) left free.
Group B: divided into four groups \(T\times M\) (requires two more assumptions (1) that the manipulation of the mediator only affects outcomes through the mediator (2) no interaction, for each unit, \(Y(1,m)-Y(0,m) = Y(1,m')-Y(0,m')\).)

Takeaway: Understanding mechanisms is harder than you think. Figure out what assumptions fly.

1.6.14 In `CausalQueries`

Lets imagine that sequential ignorability does not hold. What are our posteriors on mediation quantities when in fact all effects are mediated, effects are strong, and we have lots of data?

model <- make_model("X -> M ->Y <- X; M <-> Y")

plot(model)

1.6.15 In `CausalQueries`

We imagine a true model and consider estimands:

truth <- make_model("X -> M ->Y") |> 
  set_parameters(c(.5, .5, .1, 0, .8, .1, .1, 0, .8, .1))

queries  <- 
  list(
      indirect = "Y[X = 1, M = M[X=1]] - Y[X = 1, M = M[X=0]]",
      direct = "Y[X = 1, M = M[X=0]] - Y[X = 0, M = M[X=0]]"
      )

truth |> query_model(queries) |> kable()

query	given	using	case_level	mean	sd	cred.low	cred.high
indirect	-	parameters	FALSE	0.64	NA	0.64	0.64
direct	-	parameters	FALSE	0.00	NA	0.00	0.00

1.6.16 In `CausalQueries`

model |> update_model(data = truth |> make_data(n = 1000)) |>
  query_distribution(queries = queries, using = "posteriors")

Why such poor behavior? Why isn’t weight going onto indirect effects?

Turns out the data is consistent with direct effects only: specifically that whenever \(M\) is responsive to \(X\), \(Y\) is responsive to \(X\).

1.6.17 In `CausalQueries`

1.7 Survey experiments

Survey experiments are used to measure things: nothing (except answers) should be changed!
If the experiment in the survey is changing things then it is a field experiment in a survey, not a survey experiment

1.7.1 The list experiment: Motivation

Multiple survey experimental designs have been generated to make it easier for subjects to answer sensitive questions
The key idea is to use inference rather than measurement.
Subjects are placed in different conditions and the conditions affect the answers that are given in such a way that you can infer some underlying quantity of interest

1.7.2 The list experiment: Motivation

This is an obvious DAG but the main point is to be clear that the Value is the quantity of interest and the value is not affected by the treatment, Z.

1.7.3 The list experiment: Motivation

The list experiment supposes that:

Subjects do not want to give a direct answer to a question
They nevertheless are willing to truthfully answer an indirect question

In other words: sensitivities notwithstanding, they are happy for the researcher to make correct inferences about them or their group

1.7.4 The list experiment: Strategy

Respondents are given a short list and a long list.
The long list differs from the short list in having one extra item—the sensitive item
We ask how many items in each list does a respondent agree with:
- \(Y_i(0)\) is the number of elements on a short list that a respondent agrees with
- \(Y_i(1)\) is the number of elements on a long list that a respondent agrees with
- \(Y_i(1) - Y_i(0)\) is an indicator for whether an individual agrees with the sensitive item
- \(\mathbb{E}[Y_i(1) - Y_i(0)]\) is the share of people agreeing with sensitive item

1.7.5 The list experiment: Simplified example

How many of these do you agree with:

	Short list	Long list	“Effect”
	“2 + 2 = 4”	“2 + 2 = 4”
	“2 * 3 = 6”	“2 * 3 = 6”
	“3 + 6 = 8”	“Climate change is real”
		“3 + 6 = 8”
Answer	Y(0) = 2	Y(1) = 4	Y(1) - Y(0) = 2

[Note: this is obviously not a good list. Why not?]

1.7.6 The list experiment: Design

declaration_17.3 <-
  declare_model(
    N = 500,
    control_count = rbinom(N, size = 3, prob = 0.5),
    Y_star = rbinom(N, size = 1, prob = 0.3),
    potential_outcomes(Y_list ~ Y_star * Z + control_count) 
  ) +
  declare_inquiry(prevalence_rate = mean(Y_star)) +
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
  declare_estimator(Y_list ~ Z, .method = difference_in_means, 
                    inquiry = "prevalence_rate")

diagnosands <- declare_diagnosands(
  bias = mean(estimate - estimand),
  mean_CI_width = mean(conf.high - conf.low)
)

1.7.7 Diagnosis

diagnose_design(declaration_17.3, diagnosands = diagnosands)

Design	Inquiry	Bias	Mean CI Width
declaration_17.3	prevalence_rate	0.00	0.32
		(0.00)	(0.00)

1.7.8 Tradeoffs: is the question really sensitive?

declaration_17.4 <- 
  declare_model(
    N = N,
    U = rnorm(N),
    control_count = rbinom(N, size = 3, prob = 0.5),
    Y_star = rbinom(N, size = 1, prob = 0.3),
    W = case_when(Y_star == 0 ~ 0L,
                  Y_star == 1 ~ rbinom(N, size = 1, prob = proportion_hiding)),
    potential_outcomes(Y_list ~ Y_star * Z + control_count)
  ) +
  declare_inquiry(prevalence_rate = mean(Y_star)) +
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z),
                      Y_direct = Y_star - W) +
  declare_estimator(Y_list ~ Z, inquiry = "prevalence_rate", label = "list") + 
  declare_estimator(Y_direct ~ 1, inquiry = "prevalence_rate", label = "direct")

1.7.9 Diagnosis

declaration_17.4 |> 
  redesign(proportion_hiding = seq(from = 0, to = 0.3, by = 0.1), 
           N = seq(from = 500, to = 2500, by = 500)) |> 
  diagnose_design()

1.7.10 Negatively correlated items

How would estimates be affected if the items selected for the list were negatively correlated?
How would subject protection be affected?

1.7.11 Negatively correlated items

rho <- -.8 

correlated_lists <- 
  declare_model(
    N = 500,
    U = rnorm(N),
    control_1 = rbinom(N, size = 1, prob = 0.5),
    control_2 = correlate(given = control_1, rho = rho, draw_binary, prob = 0.5),
    control_count = control_1 + control_2,
    Y_star = rbinom(N, size = 1, prob = 0.3),
    potential_outcomes(Y_list ~ Y_star * Z + control_count)
  ) +
  declare_inquiry(prevalence_rate = mean(Y_star)) +
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
  declare_estimator(Y_list ~ Z)

1.7.12 Negatively correlated items

draw_data(correlated_lists) |> ggplot(aes(control_count)) + 
  geom_histogram() + theme_bw()

1.7.13 Negatively correlated items

correlated_lists |> redesign(rho = c(-.8, 0, .8)) |> diagnose_design()

These trade-off against each other: the more accuracy you have the less protection you have

1.7.14 Individual or group effects?

This is typically used to estimate average levels
However you can use it in the obvious way to get average levels for groups: this is equivalent to calculating group level heterogeneous effects
Extending the idea you can even get individual level estimates: for instance you might use causal forests
You can also use this to estimate the effect of an experimental treatment on an item that’s measured using a list, without requiring individual level estimates:

\[Y_i = \beta_0 + \beta_1Z_i + \beta_2Long_i + \beta_3Z_iLong_i\]

1.7.15 Hiders and liars

Note that here we looked at “hiders” – people not answering the direct question truthfully
See Li (2019) on bounds when the “no liars” assumption is threatened — this is about whether people respond truthfully to the list experimental question

Baron, Reuben M, and David A Kenny. 1986. “The Moderator–Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations.” Journal of Personality and Social Psychology 51 (6): 1173.

Borusyak, Kirill, Xavier Jaravel, and Jann Spiess. 2021. “Revisiting Event Study Designs: Robust and Efficient Estimation.” arXiv Preprint arXiv:2108.12419.

Calonico, Sebastian, Matias D Cattaneo, and Rocio Titiunik. 2014. “Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs.” Econometrica 82 (6): 2295–2326.

De Chaisemartin, Clément, and Xavier d’Haultfoeuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Green, Donald P, Shang E Ha, and John G Bullock. 2010. “Enough Already about ‘Black Box’ Experiments: Studying Mediation Is More Difficult Than Most Scholars Suppose.” The Annals of the American Academy of Political and Social Science 628 (1): 200–208.

Imai, Kosuke, and In Song Kim. 2021. “On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data.” Political Analysis 29 (3): 405–15.

Keele, Luke J, and Rocio Titiunik. 2015. “Geographic Boundaries as Regression Discontinuities.” Political Analysis 23 (1): 127–55.

Lee, David S, and Thomas Lemieux. 2010. “Regression Discontinuity Designs in Economics.” Journal of Economic Literature 48 (2): 281–355.

Li, Yimeng. 2019. “Relaxing the No Liars Assumption in List Experiment Analyses.” Political Analysis 27 (4): 540–55.

Olden, Andreas, and Jarle Møen. 2022. “The Triple Difference Estimator.” The Econometrics Journal 25 (3): 531–53.

Roth, Jonathan, Pedro HC Sant’Anna, Alyssa Bilinski, and John Poe. 2023. “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.” Journal of Econometrics.