Causal Inference

Topics I

Macartan Humphreys

1 Topics 1

1.1 Covariate Adjustment

1.1.1 When to condition? What to condition on?

The key idea from Pearl is that you want to find a set of variables such that when you condition on these you get what you would get if you had random assignment—or, used a do operation.

1.1.2 When to condition? What to condition on?

Intuition:

  • You could imagine creating a “mutilated” graph by removing all the arrows leading out of X
  • Then select a set of variables, \(Z\), such that \(X\) and \(Y\) are d-separated by \(Z\) on the the mutilated graph
  • When you condition on these you are making sure that any covariation between \(X\) and \(Y\) is covariation that is due to the effects of \(X\)

1.1.3 Illustration

1.1.4 Illustration: Remove paths out

1.1.5 Illustration: Block backdoor path

1.1.6 Illustration: Why not like this?

1.1.7 Backdoor Criterion: (Pearl 1995)

The backdoor criterion is satisfied by \(Z\) (relative to \(X\), \(Y\)) if:

  1. No node in \(Z\) is a descendant of \(X\)
  2. \(Z\) blocks every backdoor path from \(X\) to \(Y\) (i.e. every path that contains an arrow into \(X\))

In that case you can identify the effect of \(X\) on \(Y\) by conditioning on \(Z\):

\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\] (This is eqn 3.19 in Pearl (2000))

1.1.8 Backdoor Criterion: (Pearl 1995)

\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\]

  • No notion of a linear control or anything like that; idea really is like blocking: think lots of discrete data and no missing patterns
  • Note this is a formula for a (possibly counterfactual) level; a counterfactual difference would be given in the obvious way by:

\[P(Y=y | \hat{x}) - P(Y=y | \hat{x}')\]

1.1.9 Backdoor Proof

Following Pearl (2009), Chapter 11. Let \(T\) denote the set of parents of \(X\): \(T := pa(X)\), with (possibly vector valued) realizations \(t\). These might not all be observed.

If the backdoor criterion is satisfied, we have:

  1. \(Y\) is independent of \(T\), given \(X\) and observed data, \(Z\) (since \(Z\) blocks backdoor paths)
  2. \(X\) is independent of \(Z\) given \(T\). (Since \(Z\) includes only nondescendents)
  • Key idea: The intervention level relates to the observational level as follows: \[p(y|\hat{x}) = \sum_{t\in T} p(t)p(y|x, t)\]

  • Think of this as fully accounting for the (possibly unobserved) causes of \(X\), \(T\)

1.1.10 Backdoor Proof

We want to get to:

\[p(y|\hat{x}) = \sum_{t\in T} p(t)p(y|x, t)\]

  • But of course we do not observe \(T\), rather we observe \(Z\). So we now need to write everything in terms of \(Z\) rather than \(T\).

We bring \(Z\) into the picture by writing:

\[p(y|\hat{x}) = \sum_{t\in T} p(t) \sum_z p(y|x, t, z)p(z|x, t)\]

now we want to get rid of \(T\)

1.1.11 Backdoor Proof

now we want to get rid of \(T\)

  • Using the two conditions from the backdoor definition above:

    1. replace \(p(y|x, t, z)\) with \(p(y | x, z)\)
    2. replace \(p(z|x, t)\) with \(p(z|t)\)

This gives: \[p(y|\hat x) = \sum_{t \in T} p(t) \sum_z p(y|x, z)p(z|t)\]

Cleaning up, we can get rid of \(T\):

\[p(y|\hat{x}) = \sum_z p(y|x, z)\sum_{t\in T} p(z|t)p(t) = \sum_z p(y| x, z)p(z)\]

1.1.12 Backdoor proof figure

For intuition:

We would be happy if we could condition on the parent \(T\), but \(T\) is not observed. However we can use \(Z\) instead making use of the fact that:

  1. \(p(y|x, t, z) = p(y | x, z)\) (since \(Z\) blocks)
  2. \(p(z|x, t) = p(z|t)\) (since \(Z\) is upstream and blocked by parents, \(T\))

1.1.13 Guidelines

1.1.14 Example

Consider for example this data.

  • You randomly pair offerers and receivers in a dictator game (in which offerers decide how much of $1 to give to receivers).
  • Your population comes from two groups (80% Baganda and 20% Banyankole) so in randomly assigning partners you are randomly determining whether a partner is a coethnic or not.
  • You find that in non-coethnic pairings 35% is offered, in coethnic pairings 48% is offered.

Should you believe it?

1.1.15 Covariate Adjustment

  • Population: randomly matched Baganda (80% of pop) and Banyankole (20% of pop)
  • You find: in non-coethnic pairings 35% is offered, in coethnic pairings 48% is offered.
  • But a closer look at the data reveals…
To: Baganda To: Banyankole
Offers by Baganda 64% 16%
Banyankole 16% 4%
Figure 1: Number of Games
To: Baganda To: Banyankole
Offers by Baganda 50 50
Banyankole 20 20
Figure 2: Average Offers

So that’s a problem

1.1.16 Covariate Adjustment

Control?

  • With such data you might be tempted to ‘control’ for the covariate (here: ethnic group), using regression.
  • But, perhaps surprisingly, it turns out that regression with covariates does not estimate average treatment effects.
  • It does estimate an average of treatment effects, but specifically a minimum variance estimator, not necessarily an estimator of your estimand.

1.1.17 Covariate Adjustment

Compare:

  • \(\hat{\tau}_{ATE} =\sum_{x} \frac{w_x}{\sum_{j}w_{j}}\hat{\tau}_x\)
  • \(\hat{\tau}_{OLS} =\sum_{x} \frac{w_xp_x(1-p_x)}{\sum_{j}w_j{p_j(1-p_j)}}\hat{\tau}_x\)

Instead you can use formula above for \(\hat{\tau}_{ATE}\) to estimate ATE

alternatively…

1.1.18 Covariate adjustment via saturated regression

  • Alternatively you can use propensity weights.
  • Alternatively you can use a regression that includes both the treatment and the treatment interacted with the covariates.
    • In practice this is best done by demeaning the covariates; doing this lets you read off the average effect from the main term. Key resource: Lin (2012)

You should have noticed that the logic for controlling for a covariate here is equivalent to the logic we saw for heterogeneous assignment propensities. These are really the same thing.

1.1.19 Covariate adjustment via saturated regression

Returning to prior example:

df <- fabricatr::fabricate(
  N = 500, 
  X = rep(0:1, N/2), 
  Z = rbinom(N, 1, .2 + .3*X),
  Y = rnorm(N) + Z*X)

lm_robust(Y ~ Z*X_c, data = df |> mutate(X_c = X - mean(X))) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -0.05 0.06 -0.89 0.37 -0.17 0.06 496 Y
Z 0.47 0.11 4.32 0.00 0.26 0.68 496 Y
X_c 0.08 0.12 0.65 0.51 -0.15 0.30 496 Y
Z:X_c 0.85 0.22 3.91 0.00 0.42 1.28 496 Y

1.1.20 Covariate adjustment via saturated regression

Returning to prior example:

lm_lin(Y ~ Z, ~ X, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -0.05 0.06 -0.89 0.37 -0.17 0.06 496 Y
Z 0.47 0.11 4.32 0.00 0.26 0.68 496 Y
X_c 0.08 0.12 0.65 0.51 -0.15 0.30 496 Y
Z:X_c 0.85 0.22 3.91 0.00 0.42 1.28 496 Y

1.1.21 Demeaning and saturating

Demeaning interactions

  • Say you have a factorial design with treatments X1 and X2 (or observational data with two covariates)
  • You analyse with a model that has main terms and interaction terms
  • Interpreting coefficients can be confusing, but sometimes demeaning can help. What does demeaning do?

1.1.22 Demeaning and saturating

Let’s:

  • Declare a factorial design in which Y is generated according to

f_Y <- function(X1, X2, u) .1 + .2*X1 + .3*X2 + u*X1*X2

where u is distributed \(U[0,1]\).

  • Specify estimands carefully
  • Run analyses in which we do and do not demean the treatments; compare and explain results

1.1.23 Demeaning interactions

f_Y <- function(X1, X2, u) .1 + .2*X1 + .3*X2 + u*X1*X2

design <-
  declare_model(N = 1000, u = runif(N),
                X1 = complete_ra(N), X2 = block_ra(blocks = X1),
                X1_demeaned = X1 - mean(X1),
                X2_demeaned = X2 - mean(X2),
                Y = f_Y(X1, X2, u)) +
  declare_inquiry(
    base = mean(f_Y(0, 0, u)),
    average = mean(f_Y(0, 0, u) + f_Y(0, 1, u)  + f_Y(1, 0, u)  + f_Y(1, 1, u))/4,
    CATE_X1_given_0 = mean(f_Y(1, 0, u) - f_Y(0, 0, u)),
    CATE_X2_given_0 = mean(f_Y(0, 1, u) - f_Y(0, 0, u)),
    ATE_X1 = mean(f_Y(1, X2, u) - f_Y(0, X2, u)),
    ATE_X2 = mean(f_Y(X1, 1, u) - f_Y(X1, 0, u)),
    I_X1_X2 = mean((f_Y(1, 1, u) - f_Y(0, 1, u)) - (f_Y(1, 0, u) - f_Y(0, 0, u)))
  ) +
  declare_estimator(Y ~ X1*X2, 
                    inquiry = c("base", "CATE_X1_given_0", "CATE_X2_given_0", "I_X1_X2"), 
                    term = c("(Intercept)", "X1", "X2", "X1:X2"),
                    label = "natural") +
  declare_estimator(Y ~ X1_demeaned*X2_demeaned, 
                    inquiry = c("average", "ATE_X1", "ATE_X2", "I_X1_X2"), 
                    term = c("(Intercept)", "X1_demeaned", "X2_demeaned", "X1_demeaned:X2_demeaned"),
                    label = "demeaned")

1.1.24 Demeaning interactions: Solution

Estimator Inquiry Term Mean Estimand Mean Estimate
demeaned ATE_X1 X1_demeaned 0.45 0.45
demeaned ATE_X2 X2_demeaned 0.55 0.55
demeaned I_X1_X2 X1_demeaned:X2_demeaned 0.50 0.50
demeaned average (Intercept) 0.48 0.48
natural CATE_X1_given_0 X1 0.20 0.20
natural CATE_X2_given_0 X2 0.30 0.30
natural I_X1_X2 X1:X2 0.50 0.50
natural base (Intercept) 0.10 0.10

It’s all good. But you need to match the estimator to the inquiry: demean for average marginal effects; do not demean for conditional marginal effects.

1.1.25 Recap

If you have different groups with different assignment propensities you can do any or all of these:

  1. Blocked differences in means
  2. Inverse propensity weighting
  3. Saturated regression (Lin)
  4. More… (coming)

You cannot (reliably):

  1. Ignore the groups
  2. Include them in a regression (without interactions)

1.2 To control or not control

Even “good” controls might not pay their way in practice:

When (and how) does controlling for covariates improve things and when does it make it worse

1.2.1 Considerations

  • Even though randomization ensures no bias, you may sometimes want to “control” for covariates in order to improve efficiency (see the discussion of blocking above).
  • Or you may have to take account of the fact that the assignment to treatment is correlated with a covariate (as above).
  • In observational work you might also figure out you have to control for a covariate to justify inferences (Refer to our discussion of the backdoor criteria)

1.2.2 Observational work

  • Observational motivation: Controls can provide grounds for identification
  • But recall – they can also destroy identification

For a great walk through of what you can draw from graphical models for the decision to control see:

A Crash Course in Good and Bad Controls by Cinelli, Forney, and Pearl (2022)

Aside: these implications generally refer to use controls as covariates – e.g. by implementing blocked differences in means or similar. For a Bayesian model of the form used in CausalQueries the information from “bad controls” is used wisely.

1.2.3 Experimental work

  • Conditional Bias and Precision Gains from Controls

  • Experimental motivation: Controls can reduce noise and improve precision. This is an argument for using variables that are correlated with the output (not with the treatment).

1.2.4 Precision Gains from Controls

However: Introducing controls can create complications

  • As argued by Freedman (summary from Lin (2012)), we can get: “worsened asymptotic precision, invalid measures of precision, and small-sample bias”\(^*\)

  • These adverse effects are essentially removed with an interacted model

  • See discussions in Imbens and Rubin (2015) (7.6, 7.7) and especially Theorem 7.2 for the asymptotic variance of the estimator

\(^*\) though note that the precision concern does not hold when treatment and control groups are equally sized

1.2.5 To control or not

We will illustrate by comparing:

  • DIM
  • OLS (linear controls)
  • Lin (saturated)

With:

  • Varying fraction assigned to treatment
  • Varying relation between \(Y(0)\) and \(Y(1)\)

1.2.6 Declaration (from RDSS)

# https://book.declaredesign.org/library/experimental-causal.html

prob <- 0.5
control_slope <- -1

declaration_18.3 <-
  declare_model(N = 100, X = runif(N, 0, 1),
                U = rnorm(N, sd = 0.1),
                Y_Z_1 = 1*X + U, Y_Z_0 = control_slope*X + U
  ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N = N, prob = prob)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE", label = "DIM") +
  declare_estimator(Y ~ Z + X, .method = lm_robust, inquiry = "ATE", label = "OLS") +
  declare_estimator(Y ~ Z, covariates = ~X, .method = lm_lin, 
                    inquiry = "ATE", label = "Lin")

1.2.7 Implied potential outcomes

The variances and covariance of potential outcomes depend on the slope parameter

1.2.8 To control or not

Diagnosis:

declaration_18.3 |> 
  redesign(
    control_slope = seq(-1, 1, 0.5), 
    prob = seq(0.1, 0.9, 0.1)) |> 
  diagnose_designs()

1.2.9 To control or not

  • Lin always does well
  • OLS does fine with equal probability assignments
  • Otherwise the ranking of DIM and OLS is design dependent

1.2.10 Conditional Bias and Precision Gains from Controls

  • Treatment correlated with covariates can induce “conditional bias.”
  • Including controls that are correlated with treatment can introduce inefficiencies
  • Including controls can change your estimates so be sure not to fish!

1.2.11 Declaration

a <- 0
b <- 0

design <- 
  declare_model(N = 100,
                        X = rnorm(N),
                        Z = complete_ra(N),
                        Y_Z_0 = a*X + rnorm(N),
                        Y_Z_1 = a*X + correlate(given = X, rho = b, rnorm) + 1,
                        Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, covariates = ~X, .method = lm_lin, label = "Lin") +
  declare_estimator(Y ~ Z,  label = "No controls") +
  declare_estimator(Z ~ X,  label = "Condition")

The design implements estimation controlling and not controlling for \(X\) and also keeps track of the results of a test for the relation between \(Z\) and \(X\).

1.2.12 Simulations

We simulate with many simulations over a range of designs

simulations <- 
  list(design |> redesign(a = 0, b = 0),  design |> redesign(a = 1, b = 0),  design |> redesign(a = 0, b = 1)) |>
  simulate_design(sims = 20000) 

1.2.13 Standard errors

We see the standard errors are larger when you control in cases in which the control is not predictive of the outcome and it is correlated with the treatment. Otherwise they can be smaller.

See Mutz et al

1.2.14 Errors

We also see “conditional bias” when we do not control: where the distribution of errors depends on the correlation with the covariate.

Errors

1.2.15 Estimands

Puzzle: Does the sample average treatment effect on the treated depend on the covariate balance?

1.3 Doubly robust estimation

1.3.1 Doubly robust estimation

Doubly robust estimation combines:

  1. A model for how the covariates predict the potential outcomes
  2. A model for how the covariates predict assignment propensities

Using both together to estimate potential outcomes using propensity weighting lets you do well even if either model is wrong.

Each part can be done using nonparameteric methods resulting in an overall semi-parametric procedure.

  • \(\pi(Z) = \Pr(Z=1|X)\): Estimate \(\hat\pi\)
  • \(Y_z = \mathbb{E}[Y|Z=z, X]\): Estimate \(\hat{Y}_z\)
  • Estimate of causal effect: \(\frac{1}{n}\sum_{i=1}^n\left(\left(\frac{Z_i}{\hat{\pi}_i}(Y_i - \hat{Y}_{i1}\right) - \left(\frac{1-Z_i}{1-\hat{\pi}_i}(Y_i - \hat{Y}_{i0}\right) + \left(\hat{Y}_{i1} - \hat{Y}_{i0}\right) \right)\)

1.3.2 Doubly robust estimation

  • Estimate of causal effect: \(\frac{1}{n}\sum_{i=1}^n\left(\left(\frac{Z_i}{\hat{\pi}_i}(Y_i - \hat{Y}_{i1}\right) - \left(\frac{1-Z_i}{1-\hat{\pi}_i}(Y_i - \hat{Y}_{i0}\right) + \left(\hat{Y}_{i1} - \hat{Y}_{i0}\right) \right)\)

  • Note that if \(\hat{Y}_{iz}\) are correct then the first parts drop out and we we get the right answer.

  • So if you can impute the potential outcomes, you are good (though hardly surprising)

1.3.3 Doubly robust estimation

  • More subtly say the \(\hat{\pi}\)s are correct, but your imputations are wrong; then we again have an unbiased estimator.

To see this imagine with probability \(\pi\) we assign unit 1 to treatment and 2 to control (otherwise 1 to control and 2 to treatment).

Then our expected estimate is:

\(\frac12\pi\left(\left(\frac{1}{\pi}(Y_{11} - \hat{Y}_{11}\right) - \left(\frac{1}{\pi}(Y_{20} - \hat{Y}_{20}\right) \right) + (1-\pi)\left(\left(\frac{1}{1-\pi}(Y_{21} - \hat{Y}_{21}\right) - \left(\frac{1}{1-\pi}(Y_{10} - \hat{Y}_{10}\right) \right) + \left(\hat{Y}_{11} - \hat{Y}_{20}\right) + \left(\hat{Y}_{21} - \hat{Y}_{10}\right)\)

\(\frac12\left(Y_{11} - Y_{10} + Y_{21}- Y_{20} +\pi\left(\left(\frac{1}{\pi}( - \hat{Y}_{11}\right) - \left(\frac{1}{\pi}( - \hat{Y}_{20}\right) \right) + (1-\pi)\left(\left(\frac{1}{1-\pi}( - \hat{Y}_{21}\right) - \left(\frac{1}{1-\pi}(- \hat{Y}_{10}\right) \right)\right) + \left(\hat{Y}_{11} - \hat{Y}_{20}\right) + \left(\hat{Y}_{21} - \hat{Y}_{10}\right)\)

\(\frac12\left(Y_{11} - Y_{10} + Y_{21}- Y_{20}\right)\)

Robins, Rotnitzky, and Zhao (1994)

1.3.4 Doubly robust estimation illustration

Consider this data (with confounding):

# df with true treatment effect of 1 
# (0.5 if race = 0; 1.5 if race = 1)

df <- fabricatr::fabricate(
  N = 5000,
  class = sample(1:3, N, replace = TRUE),
  race = rbinom(N, 1, .5),
  Z = rbinom(N, 1, .2 + .3*race),
  Y = .5*Z + race*Z + class + rnorm(N),
  qsmk = factor(Z),
  class = factor(class),
  race = factor(race)
)

1.3.5 Simple approaches

Naive regression produces biased estimates, even with controls. Lin regression gets the right result however.

# Naive
lm_robust(Y ~ Z, data = df)$coefficients[["Z"]]
[1] 1.229234
# OLS with controls
lm_robust(Y ~ Z + class + race, data = df)$coefficients[["Z"]]
[1] 1.139115
# Lin
lm_lin(Y ~ Z,  ~ class + race, data = df)$coefficients[["Z"]]
[1] 1.017271

1.3.6 Doubly robust estimation

drtmle is an R package that uses doubly robust estimation to compute “marginal means of an outcome under fixed levels of a treatment.”

library(SuperLearner)
library(drtmle)
drtmle_fit <- drtmle(
  W = df |> select(race, class), 
  A = df$Z, 
  Y = df$Y, 
  SL_Q = c("SL.glm", "SL.mean", "SL.glm.interaction"),
  SL_g = c("SL.glm", "SL.mean", "SL.glm.interaction"),
  SL_Qr = "SL.glm",
  SL_gr = "SL.glm", 
  maxIter = 1
)

1.3.7 Doubly robust estimation

# "Marginal means"
drtmle_fit$drtmle$est
[1] 2.990577 1.972552
# Effects
ci(drtmle_fit, contrast = c(-1,1))
$drtmle
                   est    cil    ciu
E[Y(0)]-E[Y(1)] -1.018 -1.082 -0.954
wald_test(drtmle_fit, contrast = c(-1,1))
$drtmle
                       zstat pval
H0:E[Y(0)]-E[Y(1)]=0 -31.203    0

Resource: https://muse.jhu.edu/article/883477

1.3.8 Assessing performance

Challenge: Use DeclareDesign to compare performance of drtmle and lm_lin

1.4 Matching

1.4.1 Core idea

The matching idea can be seen as an application of the backdoor criterion:

Seek features \(X\) that block backdoor paths from \(Z\) to \(Y\) and estimate:

\[\sum_x p(x) \mathbb E[Y | Z = 1, X = x] - \mathbb E[Y | Z = 0, X = x]\]

This, fundamentally, is a blocked differences in means. The intuitions from the discussion of Lin estimation applies here.

1.4.2 ATT Wrinkle

In practice matching differs from simply controlling for covariates (with interactions) insofar matches are (often) sought specifically for treated units.

Thus for example in data like this:

Z X Y
0 1 1
0 2 2
1 1 3
1 1 4

we would not make use of the second observation. Indeed it is not even part of our estimand.

1.4.3 ATT Wrinkle

More formally if \(M_i\) are matches for each unit \(i\) of \(n_1\) treated units, our estimand is:

\[\tau_{ATT} = \frac{1}{n_1} \sum_{i: Z_i = 1}\left(Y_i(1) - Y_i(0)\right) \]

and we estimate:

\[\hat\tau_{ATT} =\frac{1}{n_1} \sum_{i: Z_i = 1}\left(Y_i - \frac{1}{|M_i|}\sum_{j \in M_i}Y_j\right) \]

1.4.4 ATE?

In principle however one could maintain a focus on ATE and find matches both for treated and for control units.

The commitment to the ATT then is not cooked in. See Stuart and Green (2008)

Example below.

1.4.5 Variations

  • Exact matching. Great if possible: use as matches units that exactly agree on all covariates
  • Coarsened exact matching. Transparent and easy: use as matches units that exactly agree on all coarsened covariates (i.e. nearly agree on covariates)
  • Distance matching e.g. Mahalanobis, nearest neighbor: use as matches units that are close in the multidimensional covariate space
  • Propensity matching. Clever but more model dependent: use as matches units that, baed ono covariates, have similar estimated probabilities of being in treatment

###Exact matching illustration {.smaller}

exact_matching <- 
  function(data) { 
    matched <- MatchIt::matchit(Z ~ X, method = "exact", data = data) 
    MatchIt::match.data(matched) 
  }

design <-
  declare_model(
    N = 20, 
    U = rnorm(N), 
    X = sample(c(.2, .4, .6, .8), N, replace = TRUE),
    Z = rbinom(N, 1, prob = X),
    Y_Z_0 = 0.2 * X + U,
    Y_Z_1 = Y_Z_0 + 0.5,
    Y = reveal_outcomes(Y~Z)
  ) + 
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_step(handler = exact_matching) +
  declare_estimator(Y ~ Z, weights = weights, label = "Matched")

1.4.6 Investigate

set.seed(1000)
data <- draw_data(design)

dim(data)
[1] 15  9
data|> group_by(X,Z) |> summarize(Y = mean(Y), n = n()) |> kable()
X Z Y n
0.2 0 -0.1203946 2
0.2 1 -0.7960410 1
0.4 0 -0.4513312 5
0.4 1 1.2997507 1
0.6 0 -0.1495231 4
0.6 1 -0.0433644 2

1.4.7 Investigate

get_estimates(design,  data) |> kable()
estimator term estimate std.error statistic p.value conf.low conf.high df outcome
Matched Z 0.3219382 0.5401371 0.5960305 0.5613898 -0.8449571 1.488833 13 Y

###By hand

sets <- 
  data|> group_by(X,Z) |> summarize(Y = mean(Y), n = n()) |>
  group_by(X) |> summarize(diff = Y[2] - Y[1], n = n[2])

sets |> kable()
X diff n
0.2 -0.6756465 1
0.4 1.7510819 1
0.6 0.1061586 2
weighted.mean(sets$diff, sets$n)
[1] 0.3219382

###Estimand considerations {.smaller}

Here there is no actual confounding and there are efficiency losses from matching. Why?

design <-
  declare_model(
    N = 20, 
    U = rnorm(N), 
    X = sample(c(.2, .4, .6, .8), N, replace = TRUE),
    Z = rbinom(N, 1, prob = .5),
    Y_Z_0 = 0.2 * X + U,
    Y_Z_1 = Y_Z_0 + 0.5,
    Y = reveal_outcomes(Y~Z)
  ) + 
  declare_inquiry(
    ATE = mean(Y_Z_1 - Y_Z_0),
    ATT = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, label = "Unmatched", inquiry = "ATE") +
  declare_step(handler = exact_matching) +
  declare_estimator(Y ~ Z, weights = weights, label = "Matched", inquiry = "ATT")

diagnose_design(design) |> reshape_diagnosis() |> select(Estimator, RMSE) |> kable()
Estimator RMSE
Unmatched 0.46
(0.01)
Matched 0.56
(0.02)

1.4.8 Puzzle

Say that in fact \(X\) does not affect \(Z\); then is there anything at stake in going from the ATE to the ATT?

1.4.9 For real

The MatchIt package (Ho et al. 2011) provides multiple methods and diagnostic tools to implement different strategies.

The below example of full propensity matching is from their vignettes

Steps:

  • Choose covariates (close the backdoor)
  • Choose matching method
  • Check matches
  • Maybe revise method
  • Implement

1.4.10 For real

“Lalonde” data has been used by many researchers to assess the merits of different strategies. It is used ot assess job training effect on earnings. We have before and after earnings data and various characteristics.

The basic analysis would yield:

lm_robust(re78 ~ treat, data = lalonde) |> tidy() |> kable()
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) 6984.1697 352.1654 19.8320697 0.0000000 6292.570 7675.7691 612 re78
treat -635.0262 677.1954 -0.9377297 0.3487532 -1964.935 694.8824 612 re78

1.4.11 For real

But there are clear concerns regarding confounding:

lm_robust(treat ~ age + educ + race + married + nodegree + re74 +  re75, data = lalonde) |> 
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) 0.34 0.13 2.67 0.01 0.09 0.59 605 treat
age 0.00 0.00 1.34 0.18 0.00 0.01 605 treat
educ 0.02 0.01 2.62 0.01 0.01 0.04 605 treat
racehispan -0.44 0.05 -8.10 0.00 -0.55 -0.34 605 treat
racewhite -0.53 0.04 -13.95 0.00 -0.60 -0.45 605 treat
married -0.09 0.04 -2.60 0.01 -0.16 -0.02 605 treat
nodegree 0.10 0.04 2.20 0.03 0.01 0.19 605 treat
re74 0.00 0.00 -2.26 0.02 0.00 0.00 605 treat
re75 0.00 0.00 0.75 0.46 0.00 0.00 605 treat

1.4.12 For real: Match

Here we match using a probit mode to predict assignment probabilities:

pm <- 
  matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
          data = lalonde,
          method = "full",
          distance = "glm",
          link = "probit")
pm_data <- MatchIt::match.data(pm) 

pm_data |> arrange(subclass) |> head() |> kable(digits = 2)
treat age educ race married nodegree re74 re75 re78 distance weights subclass
NSW1 1 37 11 black 1 1 0.00 0.00 9930.05 0.64 1.00 1
PSID69 0 30 17 black 0 0 17827.37 5546.42 14421.13 0.63 2.32 1
NSW10 1 33 12 white 1 0 0.00 0.00 12418.07 0.04 1.00 2
PSID13 0 34 8 white 1 1 8038.87 11404.35 5486.80 0.04 0.06 2
PSID16 0 42 0 hispan 1 1 2797.83 10929.92 9922.93 0.05 0.06 2
PSID26 0 27 7 white 1 1 3064.29 8461.07 11149.45 0.04 0.06 2

“matching as pre-processing”

1.4.13 For real: Inspect

plot(summary(pm))

Huge gains in balance

1.4.14 For real: Estimate

lm(re78 ~ treat * (age + educ + race + married + nodegree + re74 + re75),
          data = pm_data, weights = weights) |>
  
  marginaleffects::avg_comparisons(
    variables = "treat",
    vcov = ~subclass, 
    newdata = subset(treat == 1))

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     1977        704 2.81  0.00501 7.6   596   3357

Term: treat
Type:  response 
Comparison: 1 - 0
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 

1.4.15 Intuition: What are these weights?

pm_data <- pm_data |> 
     mutate(
       propensity = glm(
         treat ~ age + educ + race + married + nodegree + re74 + re75,
         data = pm_data, family = binomial(link = "probit"))$fitted.values)

pm_data |> 
  ggplot(aes(propensity, weights, color = subclass, shape = factor(treat))) +  geom_point() +
  scale_y_continuous(trans = "log") + theme(legend.position = "none")

Weights on treated units = 1. If propensity is low, treated units weighted more than control.

1.4.16 Intuition: Comparison with IPW (by hand)

lm_robust(re78 ~ treat * (age + educ + race + married + nodegree + re74 + re75 + ipw),
   data = lalonde |> 
     mutate(ipw = 1/(treat*propensity + (1-treat)*(1-propensity))) |>
     filter(ipw <= 20), 
   weights = ipw) |>
  
  marginaleffects::avg_comparisons(
    variables = "treat",
    newdata = subset(treat == 1))
Error in `mutate()`:
ℹ In argument: `ipw = 1/(treat * propensity + (1 - treat) * (1 -
  propensity))`.
Caused by error:
! object 'propensity' not found

1.5 Diff in diff

Key idea: the evolution of units in the control group allow you to impute what the evolution of units in the treatment group would have been had they not been treated

1.5.1 Logic

We have group \(A\) that enters treatment at some point and group \(B\) that never does

The estimate:

\[\hat\tau = (\mathbb{E}[Y^A | post] - \mathbb{E}[Y^A | pre]) -(\mathbb{E}[Y^B | post] - \mathbb{E}[Y^B | pre])\] (how different is the change in \(A\) compared to the change in \(B\)?)

can be written:

\[\hat\tau = (\mathbb{E}[Y_1^A | post] - \mathbb{E}[Y_0^A | pre]) -(\mathbb{E}[Y_0^B | post] - \mathbb{E}[Y_0^B | pre])\]

1.5.2 Logic

\[\hat\tau = (\mathbb{E}[Y_1^A | post] - \mathbb{E}[Y_0^A | pre]) -(\mathbb{E}[Y_0^B | post] - \mathbb{E}[Y_0^B | pre])\] Adding and subtract \(\mathbb{E}[Y_0^A | post]\):

\[\hat\tau = (\mathbb{E}[Y_1^A | post] - \mathbb{E}[Y_0^A | post]) + ((\mathbb{E}[Y_0^A | post] - \mathbb{E}[Y_0^A | pre]) -(\mathbb{E}[Y_0^B | post] - \mathbb{E}[Y_0^B | pre]))\]

Cleaning up:

\[\hat\tau_{ATT} = \tau_{ATT} + \text{Difference in trends}\]

1.5.3 Simplest case

n_units <- 2
design <- 
  declare_model(
    unit = add_level(N = n_units, I = 1:N),
    time = add_level(N = 6, T = 1:N, nest = FALSE),
    obs = cross_levels(by = join_using(unit, time))) +
  declare_model(potential_outcomes(Y ~ I + T^.5 + Z*T)) +
  declare_assignment(Z = 1*(I>(n_units/2))*(T>3)) +
  declare_measurement(Y = reveal_outcomes(Y~Z)) + 
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0),
                  ATT = mean(Y_Z_1[Z==1] - Y_Z_0[Z==1])) +
  declare_estimator(Y ~ Z, label = "naive") + 
  declare_estimator(Y ~ Z + I, label = "FE1") + 
  declare_estimator(Y ~ Z + as.factor(T), label = "FE2") + 
  declare_estimator(Y ~ Z + I + as.factor(T), label = "FE3")  

1.5.4 Diagnosis

Here only the two way fixed effects is unbiased and only for the ATT.

The ATT here is averaging over effects for treated units (later periods only). We know nothing about the size of effects in earlier periods when all units are in control!

design |> diagnose_design() 
Inquiry Estimator Bias
ATE FE1 2.25
ATE FE2 6.50
ATE FE3 1.50
ATE naive 5.40
ATT FE1 0.75
ATT FE2 5.00
ATT FE3 0.00
ATT naive 3.90

1.5.5 The classic graph

design |> 
  draw_data() |>
  ggplot(aes(T, Y, color = unit)) + geom_line() +
       geom_point(aes(T, Y_Z_0)) + theme_bw()

1.5.6 Extends to multiple units easily

design |> redesign(n_units = 10) |> diagnose_design() 
Inquiry Estimator Bias
ATE FE1 2.25
ATE FE2 6.50
ATE FE3 1.50
ATE naive 5.40
ATT FE1 0.75
ATT FE2 5.00
ATT FE3 0.00
ATT naive 3.90

1.5.7 Extends to multipe units easily

design |> 
  redesign(n_units = 10) |>  
  draw_data() |> 
  ggplot(aes(T, Y, color = unit)) + geom_line() +
       geom_point(aes(T, Y_Z_0)) + theme_bw()

1.5.8 In practice

  • Need to defend parallel trends
  • Most typically using an event study: predicting trends in a unit based on its history
  • Sometimes: report balance between treatment and control groups in covariates
  • Placebo leads and lags

1.5.9 Heterogeneity

  • Things get much more complicated when there is

    1. heterogeneous timing in treatment take up and
    2. heterogeneous effects
  • It’s only recently been appreciated how tricky things can get

  • But we already have an intuition from our analysis of trials with heterogeneous assignment and heterogeneous effects:

  • in such cases fixed effects analysis weights stratum level treatment effects by the variance in assignment to treatment

  • something similar here

1.5.10 Staggared assignments

Just two units assigned at different times:

trend = 0

design <- 
  declare_model(
    unit = add_level(N = 2, ui = rnorm(N), I = 1:N),
    time = add_level(N = 6, ut = rnorm(N), T = 1:N, nest = FALSE),
    obs = cross_levels(by = join_using(unit, time))) +
  declare_model(
    potential_outcomes(Y ~ trend*T + (1+Z)*(I == 2))) +
  declare_assignment(Z = 1*((I == 1) * (T>3) + (I == 2) * (T>5))) +
  declare_measurement(Y = reveal_outcomes(Y~Z), 
                      I_c = I - mean(I)) +
  declare_inquiry(mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, label = "1. naive") + 
  declare_estimator(Y ~ Z + I, label = "2. FE1") + 
  declare_estimator(Y ~ Z + as.factor(T), label = "3. FE2") + 
  declare_estimator(Y ~ Z + I + as.factor(T), label = "4. TWFE") + 
  declare_estimator(Y ~ Z*I_c + as.factor(T), label = "5. Sat")  

1.5.11 Staggared assignments diagnosis

draw_data(design) |> mutate(I = factor(I)) |>
  ggplot(aes(T, Y, color = I, label = Z)) + geom_text()

1.5.12 Staggared assignments diagnosis

Estimator Mean Estimand Mean Estimate
1. naive 0.50 -0.12
(0.00) (0.00)
2. FE1 0.50 0.36
(0.00) (0.00)
3. FE2 0.50 -1.00
(0.00) (0.00)
4. TWFE 0.50 0.25
(0.00) (0.00)
5. Sat 0.50 0.50
(0.00) (0.00)

1.5.13 Where do these numbers come?

  • The estimand is .5 – this comes from weighting the effect for unit 1 (0) and the effect for unit 2 (1) equally

  • The naive estimate is wildly off because it does not take into account that units with different treatment shares have different average levels in outcomes

1.5.14 Where do these numbers come?

  • The estimate when we control for unit is 0.36: this comes from weighting the unit-stratum level effects according to the variance of assignment to each stratum:
design |> draw_data() |> group_by(unit) |> summarize(var = mean(Z)*(1-mean(Z))) |>
  mutate(weight = var/sum(var)) |> kable(digits = 2)
unit var weight
1 0.25 0.64
2 0.14 0.36

1.5.15 Where do these numbers come?

  • The estimate when we control for time is -1: this comes from weighting the time-stratum level effects according to the variance of assignment to each stratum
  • it weights periods 4 and 5 only and equally, yielding the difference in outcomes for unit 1 in treatment (0) and group 2 in control (1)
design |> draw_data()  |> group_by(time) |> summarize(var = mean(Z)*(1-mean(Z))) |>
  mutate(weight = var/sum(var)) |> kable(digits = 2)
time var weight
1 0.00 0.0
2 0.00 0.0
3 0.00 0.0
4 0.25 0.5
5 0.25 0.5
6 0.00 0.0

1.5.16 Where do these numbers come?

  • The estimate when we control for time and unit is 0.25

  • This is actually a lot harder to interpret:

  • We can figure out what it is from the “Goodman-Bacon decomposition” in Goodman-Bacon (2021)

1.5.17 Where do these numbers come?

1.5.18 Where do these numbers come?

In this case we can think of our data having the following structure:

y1 y2
pre 0 1
mid 0 1
post 0 2
  • the mid to pre diff in diff is (0-0) - (1-1) = 0 (group 2 serves as control)
  • the post to mid diff in diff is (2-1) - (0-0) = 1 (group 1 serves as control though already in treatment!)

TWFE gives a weighted average of these two, putting a 3/4 weight on the first and a 1/4 weight on the second

1.5.19 Two way fixed effects

Specifically:

\[\hat\beta^{DD} = \mu_{12}\hat\beta^{2 \times 2, 1}_{12} + (1-\mu_{12})\hat\beta^{2 \times 2, 2}_{12}\]

where \(\mu_{12} = \frac{1-\overline{Z_1}}{1-(\overline{Z_1}-\overline{Z_2})} = \frac{1-\frac36}{1-\left(\frac36-\frac16\right)} = \frac34\)

\[\frac34\hat\beta^{2 \times 2, 1}_{12} + \frac14 \hat\beta^{2 \times 2, 2}_{12}\] (weights formula from WP version)

1.5.20 Two way fixed effects

And:

  • \(\hat\beta^{2 \times 2, 1}_{12}\) is \((\overline{y}_1^{MID(1,2)} - \overline{y}_1^{PRE(1)}) - (\overline{y}_2^{MID(1,2)} - \overline{y}_2^{PRE(1)})\)

which in the simple example without time trends is \((0 - 0) - (1-1) = 0\)

  • \(\hat\beta^{2 \times 2, 2}_{12}\) is \((\overline{y}_2^{POST(2)} - \overline{y}_2^{MID(1,2)}) - (\overline{y}_1^{POST(2)} - \overline{y}_1^{MID(1,2)})\)

which is \((2 - 1) - (0 - 0) = 1\)

1.5.21 A problem

So quite complex weighting of different comparisons

  • Units effectively get counted as both treatment and control units for different comparisons
  • Treated units counted as control units
  • Effectively negative weights on some quantities
  • Possible to have very poorly performing estimates

1.5.22 Solutions

  • Involve better specification of estimands
  • Use of comparisons directly relevant for the estimands
  • Imputation of control outcomes in treated units using data from appropriate control units only and then targetting estimands directly (Borusyak, Jaravel, and Spiess 2021)
  • Particular solution: focus on the effect of treatment at the time of first treatment / or time of switching: this usually involves a no carryover assumption (De Chaisemartin and d’Haultfoeuille 2020) also Imai and Kim (2021)

1.5.23 Solutions

See excellent review by Roth et al. (2023)

knitr::include_graphics("assets/didtools.jpg")

1.5.24 Chaisemartin and d’Haultfoeuille (2020).

library(DIDmultiplegt)
library(rdss)
design <- 
  declare_model(
    unit = add_level(N = 4, I = 1:N),
    time = add_level(N = 8, T = 1:N, nest = FALSE),
    obs = cross_levels(by = join_using(unit, time),
                       potential_outcomes(Y ~ T + (1 + Z)*I))) +
  declare_assignment(Z = 1*(T > (I + 4))) +
  declare_measurement(
    Y = reveal_outcomes(Y~Z),
    Z_lag = lag_by_group(Z, groups = unit, n = 1, order_by = T)) +
  declare_inquiry(
    ATT_switchers = mean(Y_Z_1 - Y_Z_0), 
    subset = Z == 1 & Z_lag == 0 & !is.na(Z_lag)) +
  declare_estimator(
    Y = "Y",  G = "unit",  T = "T",  D = "Z", mode = "old", 
    handler = label_estimator(rdss::did_multiplegt_tidy),
    inquiry = c("ATT_switchers"),
    label = "chaisemartin"
  ) 

1.5.25 Chaisemartin and d’Haultfoeuille (2020)

Note the inquiry

run_design(design) |> kable(digits = 2)
inquiry estimand estimator estimate
ATT_switchers 2 chaisemartin 2

1.5.26 Triple differencing

  • A response to concerns that double differencing is not enough is to triple difference

  • When you think that there may be a violation of parallel tends but you have other outcomes that would pick up the same difference in trends

  • See Olden and Møen (2022)

1.5.27 Triple differencing

You are interested in the effects of influx of refugees on right wing voting

  • You have (say more conservative) states with no refugees at any period
  • You have (say more liberal) states with refugees post 2016 only

You want to do differences in differences comparing these states before and after

  • However you worry that things change differntially in the conservative and liberal states: no parallel trends

  • but you can identify areas within states that are more or less likely to be exposed and compare differnces in differences in the exposed and unexposed groups.

1.5.28 Triple differencing

So:

  • Two types of states: \(L \in \{0,1\}\), only \(L=1\) types get refugee influx
  • Two time periods: \(Post \in \{0,1\}\), refugee influx occurs in period \(Post = 1\)
  • Two groups: \(B \in \{0,1\}\), \(B=1\) types affected by refugee influx

\[Y = \beta_0 + \beta_1 L + \beta_2 B + \beta_3 Post + \beta_4 LB + \beta_5 L Post + \beta_6 B Post + \beta_7L B Post + \epsilon\]

1.5.29 Triple differencing

\[\frac{\partial ^3Y}{\partial L \partial B \partial Post} = \beta_7\]

1.5.30 Can we not just condition on the \(B=1\) types?

The level among the \(B=1\) types is:

\[Y = \beta_0 + \beta_1 L + \beta_2 + \beta_3 Post + \beta_4 L + \beta_5 L Post + \beta_6 Post + \beta_7L Post + \epsilon\] If you did simple before / after differences among the \(B\) types you would get

\[\Delta Y| L = 1, B = 1 = \beta_3 + \beta_5 + \beta_6 + \beta_7\] \[\Delta Y| L = 0, B = 1 = \beta_3 + \beta_6\]

1.5.31 Can we not just condition on the \(B=1\) types?

And so if you differenced again you would get:

\[\Delta^2 Y| B = 1 = \beta_5 + \beta_7\] So the problem is that you have \(\beta^5\) in here which corresponds exactly to how \(L\) states change over time.

1.5.32 Triple

But we can figure out \(\beta_5\) by doing a diff in diff among the \(B\)’s.

\[Y|B = 0 = \beta_0 + \beta_1 L + \beta_3 Post + \beta_5 L Post\]

\[\Delta^2 Y| B = 0 = \beta_5\]

1.5.33 Easier to swallow?

  • The identifying assumption is that absent treatment the differences in trends between \(L=0\) and \(L=1\) would be the same for units with \(B=0\) and \(B=1\).

  • See equation 5.4 in Olden and Møen (2022)

\[ \left(E[Y_0|L=1, B=1, {\textit {Post}}=1] - E[Y_0|L=1, B=1, {\textit {Post}}=0]\right) \\ \quad - \ \left(E[Y_0|L=1, B=0, {\textit {Post}}=1] - E[Y_0|L=1, B=0, {\textit {Post}}=0]\right) \\ = \nonumber \\ \left(E[Y_0|L=0, B=1, {\textit {Post}}=1] - E[Y_0|L=0, B=1, {\textit {Post}}=0]\right) \\ \quad - \ \left(E[Y_0|L=0, B=0, {\textit {Post}}=1] - E[Y_0|L=0, B=0, {\textit {Post}}=0]\right)\]

1.5.34 Easier to swallow?

  • In a sense this is one parallel trends assumption, not two

  • But there are four counterfactual quantities in this expression.

Puzzle: Is it possible to have an effect identified by a difference in difference but incorrectly by a triple difference design?

1.6 Noncompliance and the LATE estimand

1.6.1 Local Average Treatment Effects

Sometimes you give a medicine but only a nonrandom sample of people actually try to use it. Can you still estimate the medicine’s effect?

X=0 X=1
Z=0 \(\overline{y}_{00}\) (\(n_{00}\)) \(\overline{y}_{01}\) (\(n_{01}\))
Z=1 \(\overline{y}_{10}\) (\(n_{10}\)) \(\overline{y}_{11}\) (\(n_{11}\))

Say that people are one of 3 types:

  1. \(n_a\) “always takers” have \(X=1\) no matter what and have average outcome \(\overline{y}_a\)
  2. \(n_n\) “never takers” have \(X=0\) no matter what with outcome \(\overline{y}_n\)
  3. \(n_c\) “compliers have” \(X=Z\) and average outcomes \(\overline{y}^1_c\) if treated and \(\overline{y}^0_c\) if not.

1.6.2 Local Average Treatment Effects

Sometimes you give a medicine but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?

X=0 X=1
Z=0 \(\overline{y}_{00}\) (\(n_{00}\)) \(\overline{y}_{01}\) (\(n_{01}\))
Z=1 \(\overline{y}_{10}\) (\(n_{10}\)) \(\overline{y}_{11}\) (\(n_{11}\))

We can figure something about types:

\(X=0\) \(X=1\)
\(Z=0\) \(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}^0_{c}+\frac{\frac{1}{2}n_n}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}_{n}\) \(\overline{y}_{a}\)
\(Z=1\) \(\overline{y}_{n}\) \(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}^1_{c}+\frac{\frac{1}{2}n_a}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}_{a}\)

1.6.3 Local Average Treatment Effects

You give a medicine to 50% but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?

\(X=0\) \(X=1\)
\(Z=0\) \(\frac{n_c}{n_c + n_n} \overline{y}^0_{c}+\frac{n_n}{n_c + n_n} \overline{y}_n\) \(\overline{y}_{a}\)
(n) (\(\frac{1}{2}(n_c + n_n)\)) (\(\frac{1}{2}n_a\))
\(Z=1\) \(\overline{y}_{n}\) \(\frac{n_c}{n_c + n_a} \overline{y}^1_{c}+\frac{n_a}{n_c + n_a} \overline{y}_{a}\)
(n) (\(\frac{1}{2}n_n\)) (\(\frac{1}{2}(n_a+n_c)\))

Key insight: the contributions of the \(a\)s and \(n\)s are the same in the \(Z=0\) and \(Z=1\) groups so if you difference you are left with the changes in the contributions of the \(c\)s.

1.6.4 Local Average Treatment Effects

Average in \(Z=0\) group: \(\frac{{n_c} \overline{y}^0_{c}+ \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a\right)}{n_a+n_c+n_n}\)

Average in \(Z=1\) group: \(\frac{{n_c} \overline{y}^1_{c} + \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a \right)}{n_a+n_c+n_n}\)

So, the difference is the ITT: \(({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\)

Last step:

\[ITT = ({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\]

\[\leftrightarrow\]

\[LATE = \frac{ITT}{\frac{n_c}{n}}= \frac{\text{Intent to treat effect}}{\text{First stage effect}}\]

1.6.5 The good and the bad of LATE

  • You get a well-defined estimate even when there is non-random take-up
  • May sometimes be used to assess mediation or knock-on effects
  • But:
    • You need assumptions (monotonicity and the exclusion restriction – where were these used above?)
    • Your estimate is only for a subpopulation
    • The subpopulation is not chosen by you and is unknown
    • Different encouragements may yield different estimates since they may encourage different subgroups

1.6.6 Pearl and Chickering again

With and without an imposition of monotonicity

data("lipids_data")

models <- 
  list(unrestricted =  make_model("Z -> X -> Y; X <-> Y"),
       restricted =  make_model("Z -> X -> Y; X <-> Y") |>
         set_restrictions("X[Z=1] < X[Z=0]")) |> 
  lapply(update_model,  data = lipids_data, refresh = 0) 

models |>
  query_model(query = list(CATE = "Y[X=1] - Y[X=0]", 
                           Nonmonotonic = "X[Z=1] < X[Z=0]"),
              given = list("X[Z=1] > X[Z=0]", TRUE),
              using = "posteriors") 

1.6.7 Pearl and Chickering again

With and without an imposition of monotonicity:

model query mean sd
unrestricted CATE 0.70 0.05
restricted CATE 0.71 0.05
unrestricted Nonmonotonic 0.01 0.01
restricted Nonmonotonic 0.00 0.00

In one case we assume monotonicity, in the other we update on it (easy in this case because of the empirically verifiable nature of one sided non compliance)

1.7 Regression discontintuity

Errors and diagnostics

1.7.1 Intuition

  • The core idea in an RDD design is that if a decision rule assigns units that are almost identical to each other to treatment and control conditions then we can infer effects for those cases by looking at those cases.

See excellent introduction: Lee and Lemieux (2010)

1.7.2 Intuition

  • Kids born on 31 August start school a year younger than kids born on 1 September: does starting younger help or hurt?

  • Kids born on 12 September 1983 are more likely to register Republican than kids born on 10 September 1983: can this identify the effects of registration on long term voting?

  • A district in which Republicans got 50.1% of the vote get a Republican representative while districts in which Republicans got 49.9% of the vote do not: does having a Republican representative make a difference for these districts?

1.7.3 Argument for identification

Setting:

  • Typically the decision is based on a value on a “running variable”, \(X\). e.g. Treatment if \(X > 0\)
  • The estimand is \(\mathbb{E}[Y(1) - Y(0)|X=0]\)

Two arguments:

  1. Continuity: \(\mathbb{E}[Y(1)|X=x]\) and \(\mathbb{E}[Y(0)|X=x]\) are continuous (at \(x=0\)) in \(x\): so \(\lim_{\hat x \rightarrow 0}\mathbb{E}[Y(0)|X=\hat x] = \mathbb{E}[Y(0)|X=\hat 0]\)

  2. Local randomization: tiny things that determine exact values of \(x\) are as if random and so we can think of a local experiment around \(X=0\).

1.7.4 Argument for identification

Note:

  • continuity argument requires continuous \(x\): granularity
  • also builds off a conditional expectation function defined at \(X=0\)

Exclusion restriction is implicit in continuity: If something else happens at the threshold then the conditional expectation functions jump at the thresholds

Implicit: \(X\) is exogenous in the sense that units cannot adjust \(X\) in order to be on one or the other side of the threshold

1.7.5 Evidence

Typically researchers show:

  1. “First stage” results: assignment to treatment does indeed jump at the threshold
  2. “ITT”: outcomes jump at the threshold
  3. LATE (if fuzzy / imperfect compliance) using IV

1.7.6 Evidence

Typically researchers show:

In addition:

  1. Arguments for no other treatments at the threshold
  2. Arguments for no “sorting” at the threshold
  3. Evidence for no “heaping” at the threshold (McCrary density test)

Sometimes:

  • argue for why estimates extend beyond the threshold
  • exclude points at the threshold (!)

1.7.7 Design

library(rdss) # for helper functions
library(rdrobust)

cutoff <- 0.5
bandwidth <- 0.5

control <- function(X) {
  as.vector(poly(X, 4, raw = TRUE) %*% c(.7, -.8, .5, 1))}
treatment <- function(X) {
  as.vector(poly(X, 4, raw = TRUE) %*% c(0, -1.5, .5, .8)) + .25}

rdd_design <-
  declare_model(
    N = 1000,
    U = rnorm(N, 0, 0.1),
    X = runif(N, 0, 1) + U - cutoff,
    D = 1 * (X > 0),
    Y_D_0 = control(X) + U,
    Y_D_1 = treatment(X) + U
  ) +
  declare_inquiry(LATE = treatment(0) - control(0)) +
  declare_measurement(Y = reveal_outcomes(Y ~ D)) + 
  declare_sampling(S = X > -bandwidth & X < bandwidth) +
  declare_estimator(Y ~ D*X, term = "D", label = "lm") + 
  declare_estimator(
    Y, X, 
    term = "Bias-Corrected",
    .method = rdrobust_helper,
    label = "optimal"
  )

1.7.8 RDD Data plotted

Note rdrobust implements:

  • local polynomial Regression Discontinuity (RD) point estimators
  • robust bias-corrected confidence intervals

See Calonico, Cattaneo, and Titiunik (2014) and related papers ? rdrobust::rdrobust

1.7.9 RDD Data plotted

rdd_design  |> draw_data() |> ggplot(aes(X, Y, color = factor(D))) + 
  geom_point(alpha = .3) + theme_bw() +
  geom_smooth(aes(X, Y_D_0)) + geom_smooth(aes(X, Y_D_1)) + theme(legend.position = "none")

1.7.10 RDD diagnosis

rdd_design |> diagnose_design()
Estimator Mean Estimate Bias SD Estimate Coverage
lm 0.23 -0.02 0.01 0.64
(0.00) (0.00) (0.00) (0.02)
optimal 0.25 0.00 0.03 0.89
(0.00) (0.00) (0.00) (0.01)

1.7.11 Bandwidth tradeoff

rdd_design |> 
  redesign(bandwidth = seq(from = 0.05, to = 0.5, by = 0.05)) |> 
  diagnose_designs()
  • As we increase the bandwidth, the lm bias gets worse, but slowly, while the error falls.
  • The best bandwidth is relatively wide.
  • This is more true for the optimal estimator.

1.7.12 Geographic RDs

Are popular in political science:

  • Put a lot of pressure on assumption of no alternative treatment—including “random” country level shocks!
  • Put a lot of pressure on no sorting assumptions (why was the border put where it was; why did units settle here or there?)
  • Put a lot of pressure on SUTVA: people on one side are literally proximate to people on another

See Keele and Titiunik (2015)

Borusyak, Kirill, Xavier Jaravel, and Jann Spiess. 2021. “Revisiting Event Study Designs: Robust and Efficient Estimation.” arXiv Preprint arXiv:2108.12419.
Calonico, Sebastian, Matias D Cattaneo, and Rocio Titiunik. 2014. “Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs.” Econometrica 82 (6): 2295–2326.
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2022. “A Crash Course in Good and Bad Controls.” Sociological Methods & Research, 00491241221099552.
De Chaisemartin, Clément, and Xavier d’Haultfoeuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.
Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.
Ho, Daniel, Kosuke Imai, Gary King, and Elizabeth A Stuart. 2011. “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software 42: 1–28.
Imai, Kosuke, and In Song Kim. 2021. “On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data.” Political Analysis 29 (3): 405–15.
Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Keele, Luke J, and Rocio Titiunik. 2015. “Geographic Boundaries as Regression Discontinuities.” Political Analysis 23 (1): 127–55.
Lee, David S, and Thomas Lemieux. 2010. “Regression Discontinuity Designs in Economics.” Journal of Economic Literature 48 (2): 281–355.
Lin, Winston. 2012. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” arXiv Preprint arXiv:1208.2301.
Olden, Andreas, and Jarle Møen. 2022. “The Triple Difference Estimator.” The Econometrics Journal 25 (3): 531–53.
Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.
Roth, Jonathan, Pedro HC Sant’Anna, Alyssa Bilinski, and John Poe. 2023. “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.” Journal of Econometrics.
Stuart, Elizabeth A, and Kerry M Green. 2008. “Using Full Matching to Estimate Causal Effects in Nonexperimental Studies: Examining the Relationship Between Adolescent Marijuana Use and Adult Outcomes.” Developmental Psychology 44 (2): 395.