Design-based estimation

Macartan Humphreys

1 Frequentist Analysis

Estimation and testing

1.1 Outline

  • Simple estimates from experimental data
  • Weighting, blocking
  • Design-based variance estimates
  • Design-based \(p\) values
  • Controls and doubly robust estimation
  • Reporting

1.1.1 ATE: DIM

Unbiased estimates of the (sample) average treatment effect can be estimated (whether or not there imbalance on covariates) using:

\[ \widehat{ATE} = \frac{1}{n_T}\sum_TY_i - \frac{1}{n_C}\sum_CY_i, \]

1.1.2 ATE: DIM in practice

df <- fabricatr::fabricate(N = 100, Z = rep(0:1, N/2), Y = rnorm(N) + Z)

# by hand
df |>
  summarize(Y1 = mean(Y[Z==1]), 
            Y0 = mean(Y[Z==0]), 
            diff = Y1 - Y0) |> kable(digits = 2)
Y1 Y0 diff
1.07 -0.28 1.35
# with estimatr
estimatr::difference_in_means(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 1.35 0.17 7.94 0 1.01 1.68 97.98 Y

1.1.3 ATE: DIM in practice

We can also do this with regression:

estimatr::lm_robust(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -0.28 0.12 -2.33 0.02 -0.51 -0.04 98 Y
Z 1.35 0.17 7.94 0.00 1.01 1.68 98 Y

See Freedman (2008) on why regression is fine here

1.1.4 ATE: Blocks

Say now different strata or blocks \(\mathcal{S}\) had different assignment probabilities. Then you could estimate:

\[ \widehat{ATE} = \sum_{S\in \mathcal{S}}\frac{n_{S}}{n} \left(\frac{1}{n_{S1}}\sum_{S\cap T}y_i - \frac{1}{n_{S0}}\sum_{S\cap C}y_i \right) \]

Note: you cannot just ignore the blocks because assignment is no longer independent of potential outcomes: you might be sampling units with different potential outcomes with different probabilities.

However, the formula above works fine because selecting is random conditional on blocks.

1.1.5 ATE: Blocks

As a DAG this is just classic confounding:

make_model("Block -> Z ->Y <- Block") |> 
  plot(x_coord = c(2,1,3), y_coord = c(2, 1, 1))

1.1.6 ATE: Blocks in practice

Data with heterogeneous assignments:

df <- fabricatr::fabricate(
  N = 500, X = rep(0:1, N/2), 
  prob = .2 + .3*X,
  Z = rbinom(N, 1, prob),
  ip = 1/(Z*prob + (1-Z)*(1-prob)), # discuss
  Y = rnorm(N) + Z*X)

True effect is 0.5, but:

estimatr::difference_in_means(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 0.9 0.1 9.32 0 0.71 1.09 377.93 Y

1.1.7 ATE: Blocks in practice

Averaging over effects in blocks

# by hand
estimates <- 
  df |>
  group_by(X) |>
  summarize(Y1 = mean(Y[Z==1]), 
            Y0 = mean(Y[Z==0]), 
            diff = Y1 - Y0,
            W = n())

estimates$diff |> weighted.mean(estimates$W)
[1] 0.7236939
# with estimatr
estimatr::difference_in_means(Y ~ Z, blocks = X, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 0.72 0.11 6.66 0 0.51 0.94 496 Y

1.1.8 ATE with IPW

This also corresponds to the difference in the weighted average of treatment outcomes (with weights given by the inverse of the probability that each unit is assigned to treatment) and control outcomes (with weights given by the inverse of the probability that each unit is assigned to control).

  • The average difference in means estimator is the same as what you would get if you weighted inversely by shares of units in different conditions inside blocks.

1.1.9 ATE with IPW in practice

# by hand
df |>
  summarize(Y1 = weighted.mean(Y[Z==1], ip[Z==1]), 
            Y0 = weighted.mean(Y[Z==0],  ip[Z==0]), # note !
            diff = Y1 - Y0)|> 
  kable(digits = 2)
Y1 Y0 diff
0.59 -0.15 0.74
# with estimatr
estimatr::difference_in_means(Y ~ Z, weights = ip, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 0.74 0.11 6.65 0 0.52 0.96 498 Y

1.1.10 ATE with IPW

  • But inverse propensity weighting is a more general principle, which can be used even if you do not have blocks.

  • The intuition for it comes straight from sampling weights — you weight up in order to recover an unbiased estimate of the potential outcomes for all units, whether or not they are assigned to treatment.

  • With sampling weights however you can include units even if their weight was 1. Why can you not include these units when doing inverse propensity weighting?

1.1.11 Illustration: Estimating treatment effects with terrible treatment assignments: Fixer

Say you made a mess and used a randomization that was correlated with some variable, \(U\). For example:

  • The randomization is done in a way that introduces a correlation between Treatment Assignment and Potential Outcomes
  • Then possibly, even though there is no true causal effect, we naively estimate a large one — enormous bias
  • However since we know the assignment procedure we can fully correct for the bias

1.1.12 Illustration: Estimating treatment effects with terrible treatment assignments: Fixer

  • In the next example, we do this using “inverse propensity score weighting.”
  • This is exactly analogous to standard survey weighting — since we selected different units for treatment with different probabilities, we weight them differently to recover the average outcome among treated units (same for control).

1.1.13 Basic randomization: Fixer

Bad assignment, some randomization process you can’t understand (but can replicate) that results in unequal probabilities.

N <- 400
U <- runif(N, .1, .9)

 design <- 
   declare_model(N = N,
                 Y_Z_0 = U + rnorm(N, 0, .1),
                 Y_Z_1 = U + rnorm(N, 0, .1),
                 Z = rbinom(N, 1, U)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, label = "naive") 

1.1.14 Basic randomization: Fixer

Results is a sampling distribution not centered on the true effect (0)

diagnosis <- diagnose_design(design)
diagnosis$simulations_df |>
  ggplot(aes(estimate)) + geom_histogram() + facet_grid(~estimator) +
  geom_vline(xintercept = 0, color = "red")

1.1.15 A fix

To fix you can estimate the assignment probabilities by replicating the assignment many times:

probs <- replicate(1000, design |> draw_data() |> pull(Z)) |> apply(1, mean)

and then use these assignment probabilities in your estimator

design_2 <-
  design +
  declare_measurement(weights = Z/probs + (1-Z)/(1-probs)) +
  declare_estimator(Y ~ Z, weights = weights, label = "smart") 

1.1.16 Basic randomization: Fixer

Implied weights

draw_data(design_2) |> 
  ggplot(aes(probs, weights, color = factor(Z))) + 
  geom_point()

1.1.17 Basic randomization: Fixer

Improved results

diagnosis <- diagnose_design(design_2)
diagnosis$simulations_df |>
  ggplot(aes(estimate)) + geom_histogram() + facet_grid(estimator~.)+
  geom_vline(xintercept = 0, color = "red")

1.1.18 IPW with one unit!

This example is surprising but it helps you see the logic of why inverse weighting gets unbiased estimates (and why that might not guarantee a reasonable answer)

Imagine there is one unit with potential outcomes \(Y(1) = 2, Y(0) = 1\). So the unit level treatment effect is 1.

You toss a coin.

  • If you assign to treatment you estimate: \(\hat\tau = \frac{2}{0.5} = 4\)
  • If you assign to control you estimate: \(\hat\tau = -\frac{1}{0.5} = -2\)

So your expected estimate is: \[0.5 \times 4 - 0.5 \times (-2) = 1\]

Great on average but always lousy

1.1.19 Generalization: why IPW works

  • Say a given unit is assigned to treatment with probability \(\pi_i\)
  • We estimate the average \(Y(1)\) using

\[\hat{\overline{Y_1}} = \frac{1}n\left(\sum_i \frac{Z_iY_i(1)}{\pi_i}\right)\] With independent assignment the expected value of \(\hat{\overline{Y_1}}\) is just:

\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( \left(\pi_1 \frac{1\times Y_1(1)}{\pi_1} + (1-\pi_1) \frac{0\times Y_1(1)}{\pi_1}\right) + \left(\pi_2 \frac{1\times Y_2(1)}{\pi_2} + (1-\pi_2) \frac{0\times Y_1(1)}{\pi_2}\right) + \dots\right)\]

\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( Y_1(1) + Y_2(1) + \dots\right) = \overline{Y_1}\]

and similarly for \(\mathbb{E}[\hat{\overline{Y_0}}]\) and so using linearity of expectations:

\[\mathbb{E}[\widehat{\overline{Y_1 - Y_0}}] = \overline{Y_1 - Y_0}\]

1.1.20 Generalization: why IPW works

  • Note we needed \(\pi_i >0\) and also \(\pi_i <1\) everywhere. Why?
  • We used independence here; sampling theory is used to show similar results for e.g. complete randomization
  • For blocked randomization this is easy to see

1.2 Design-based Estimation of Variance

Lets talk about “inference”

1.2.1 Var(ATE)

  • Recall that the treatment effect is gotten by taking a sample of outcomes under treatment and comparing them to a sample of outcomes under control
  • Say that there is no “error”
  • Why would this procedure produce uncertainty?

1.2.2 Var(ATE)

  • Why would this procedure produce uncertainty?
  • The uncertainty comes from being uncertain about the average outcome under control from observations of the control units, and from being uncertain about the average outcome under treatment from observation of the treated units
  • In other words, it comes from the variance in the treatment outcomes and variance in the control outcomes (and not, for example, from variance in the treatment effect)

1.2.3 Var(ATE)

  • In classical statistics we characterize our uncertainty over an estimate using an estimate of variance of the sampling distribution of the estimator.

  • Key idea is we want to be able to say: how likely are we to have gotten such an estimate if the distribution of estimates associated with our design looked a given way.

  • More specifically we want to estimate “standard error” or the “standard deviation of the sampling distribution”

(See Woolridge (2023) where the standard error is understood as the “estimate of the standard deviation of the sampling distribution”)

1.2.4 Variance and standard errors

Given:

  • \(\hat\tau\) is an estimate for \(\tau\)
  • \(\overline{x}\) is the average values of \(x\)

The variance of the estimator of \(n\) repeated ‘runs’ of a design is: \(Var(\hat{\tau}) = \frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2\)

And the standard error is:

\(se(\hat{\tau}) = \sqrt{\frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2}\)

1.2.5 Variance and standard errors

If we have a good measure for the shape of the sampling distribution we can start to make statements of the form:

  • What are the chances that an estimate would be this large or larger?

If the sampling distribution is roughly normal, as it may be with large samples, then we can use procedures such as: “there is a 5% probability that an estimate would be more than 1.96 standard errors away from the mean of the sampling distribution”

1.2.6 Var(ATE)

  • Key idea: You can estimate variance straight from the data, given knowledge of the assignment process and assuming well defined potential outcomes?

  • Recall in general \(Var(x) = \frac{1}n\sum_i(x_i - \overline{x})^2\). here the \(x_i\)s are the treatment effect estimates we might get under different random assignments, the \(n\) is number of different assignments (assumed here all equally likely, but otherwise we can weight) and \(\overline{x}\) is the truth.

  • For intuition imagine we have just two units \(A\), \(B\), with potential outcomes \(A_1\), \(A_0\), \(B_1\), \(B_0\).

  • When there are two units with outcomes \(x_1, x_2\), the variance simplifies like this:

\[Var(x) = \frac{1}2\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \frac{1}2\left(x_2 - \frac{x_1 + x_2}{2}\right)^2 = \left(\frac{x_1 - x_2}{2}\right)^2\]

1.2.7 Var(ATE)

In the two unit case the two possible treatment estimates are: \(\hat{\tau}_1=A_1 - B_0\) and \(\hat{\tau}_2=B_1 - A_0\), depending on what gets put into treatment. So the variance is:

\[Var(\hat{\tau}) = \left(\frac{\hat{\tau}_1 - \hat{\tau}_2}{2}\right)^2 = \left(\frac{(A_1 - B_0) - (B_1 - A_0)}{2}\right)^2 =\left(\frac{(A_1 - B_1) + (A_0 - B_0)}{2}\right)^2 \] which we can re-write as:

\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\] The first two terms correspond to the variance of \(Y(1)\) and the variance of \(Y(0)\). The last term is a bit pesky though, it corresponds to twice the covariance of \(Y(1)\) and \(Y(0)\).

1.2.8 Var(ATE)

How can we go about estimating this?

\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\]

In the two unit case it is quite challenging because we do not have an estimate for any of the three terms: we do not have an estimate for the variance in the treatment group or in the control group because we have only one observation in each case; and we do not have an estimate for the covariance because we don’t observe both potential outcomes for any case.

Things do look a bit better however with more units…

1.2.9 Var(ATE): Generalizing

From Freedman Prop 1 / Example 1 (using combinatorics!) we have:

\(V(\widehat{ATE}) = \frac{1}{n-1}\left[\frac{n_C}{n_T}V_1 + \frac{n_T}{n_C}V_0 + 2C_{01}\right]\)

… where \(V_0, V_1\) denote variances and \(C_{01}\) covariance

This is usefully rewritten as:

\[ \begin{split} V(\widehat{ATE}) & = \frac{1}{n-1}\left[\frac{n - n_T}{n_T}V_1 + \frac{n - n_C}{n_C}V_0 + 2C_{01}\right] \\ & = \frac{n}{n-1}\left[\frac{V_1}{n_T} + \frac{V_0}{n_C}\right] - \frac{1}{n-1}\left[V_1 + V_0 - 2C_{01}\right] \end{split} \]

where the final term is positive

1.2.10 Var(ATE)

Note:

  • With more than two units we cannot use the sample estimates \(s^2(\{Y_i\}_{i \in C})\) and \(s^2(\{Y_i\}_{i \in T})\) for the first part.
  • But \(C_{01}\) still cannot be estimated from data.
  • The Neyman estimator ignores the second part (and so is conservative).
  • Tip: for STATA users, use , robust (see Samii and Aronow (2012))

1.2.11 ATE and Var(ATE)

For the case with blocking, the conservative estimator is:

\(V(\widehat{ATE}) = {\sum_{S\in \mathcal{S}}{\left(\frac{n_{S}}{n}\right)^2} \left({\frac{s^2_{S1}}{n_{S1}}} + {\frac{s^2_{S0}}{n_{S0}}} \right)}\)

1.2.12 Illustration of Neyman Conservative Estimator

An illustration of how conservative the conservative estimator of variance really is (numbers in plot are correlations between \(Y(1)\) and \(Y(0)\).

We confirm that:

  1. the estimator is conservative
  2. the estimator is more conservative for negative correlations between \(Y(0)\) and \(Y(1)\) — eg if those cases that do particularly badly in control are the ones that do particularly well in treatment, and
  3. with \(\tau\) and \(V(Y(0))\) fixed, high positive correlations are associated with highest variance.

1.2.13 Illustration of Neyman Conservative Estimator

\(\tau\) \(\rho\) \(\sigma^2_{Y(1)}\) \(\Delta\) \(\sigma^2_{\tau}\) \(\widehat{\sigma}^2_{\tau}\) \(\widehat{\sigma}^2_{\tau(\text{Neyman})}\)
1.00 -1.00 1.00 -0.04 0.00 -0.00 0.04
1.00 -0.67 1.00 -0.03 0.01 0.01 0.04
1.00 -0.33 1.00 -0.03 0.01 0.01 0.04
1.00 0.00 1.00 -0.02 0.02 0.02 0.04
1.00 0.33 1.00 -0.01 0.03 0.03 0.04
1.00 0.67 1.00 -0.01 0.03 0.03 0.04
1.00 1.00 1.00 0.00 0.04 0.04 0.04

Here \(\rho\) is the unobserved correlation between \(Y(1)\) and \(Y(0)\); and \(\Delta\) is the final term in the sample variance equation that we cannot estimate.

1.2.14 Illustration of Neyman Conservative Estimator

1.2.15 Tighter Bounds On Variance Estimate

The conservative variance comes from the fact that you do not know the covariance between \(Y(1)\) and \(Y(0)\).

  • But as Aronow, Green, and Lee (2014) point out, you do know something.
  • Intuitively, if you know that the variance of \(Y(1)\) is 0, then the covariance also has to be zero.
  • This basic insight opens a way of calculating bounds on the variance of the sample average treatment effect.

1.2.16 Tighter Bounds On Variance Estimate

Example:

  • Take a million-observation dataset, with treatment randomly assigned
  • Assume \(Y(0)=0\) for everyone and \(Y(1)\) distributed normally with mean 0 and standard deviation of 1000.
  • Note here the covariance of \(Y(1)\) and \(Y(0)\) is 0.
  • Note the true variance of the estimated sample average treatment effect should be (approx) \(\frac{Var(Y(1))}{{1000000}} + \frac{Var(Y(0))}{{1000000}} = 1+0=1\), for an se of \(1\).
  • But using the Neyman estimator (or OLS!) we estimate (approx) \(\frac{Var(Y(1))}{({1000000/2})} + \frac{Var(Y(0))}{({1000000/2})} = 2\), for an se of \(\sqrt{2}\).
  • But we can recover the truth knowing the covariance between \(Y(1)\) and \(Y(0)\) is 0.

1.2.17 Tighter Bounds On Variance Estimate: Code

sharp_var <- function(yt, yc, N=length(c(yt,yc)), upper=TRUE){
  
  m <- length(yt)
  n <- m + length(yc)
  V <- function(x,N) (N-1)/(N*(length(x)-1)) * sum((x - mean(x))^2)
  yt <- sort(yt)
  if(upper) {yc <- sort(yc)
  } else {
    yc <- sort(yc,decreasing=TRUE)}
  p_i <- unique(sort(c(seq(0,n-m,1)/(n-m),seq(0,m,1)/m)))- 
        .Machine$double.eps^.5
  p_i[1] <- .Machine$double.eps^.5
  
  yti <- yt[ceiling(p_i*m)]
  yci <- yc[ceiling(p_i*(n-m))]
  
  p_i_minus <- c(NA,p_i[1: (length(p_i)-1)])
 
 ((N-m)/m * V(yt,N) + (N-(n-m))/(n-m)*V(yc,N) + 
     2*sum(((p_i-p_i_minus)*yti*yci)[2:length(p_i)]) - 2*mean(yt)*mean(yc))/(N-1)}

1.2.18 Illustration

n   <- 1000000
Y   <- c(rep(0,n/2), 1000*rnorm(n/2))
X   <- c(rep(0,n/2), rep(1, n/2))

lm_robust(Y~X) |> tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) 0.00 0.00 0.63 0.53 0.00 0.00 999998 Y
X 1.21 1.41 0.86 0.39 -1.56 3.98 999998 Y
kable(t(as.matrix(ols)))
Error in eval(expr, envir, enclos): object 'ols' not found
c(sharp_var(Y[X==1], Y[X==0], upper = FALSE),
  sharp_var(Y[X==1], Y[X==0], upper = TRUE)) |> 
  round(2)
[1] 1 1

The sharp bounds are \([1,1]\) but the conservative estimate is \(\sqrt{2}\).

1.2.19 Asymptotics

  • It is a remarkable thing that you can estimate the standard error straight from the data
  • However, once you want to use the standard error to do hypothesis testing you generally end up looking up distributions (\(t\)-distributions or normal distributions)
  • That’s a little disappointing and has been one of the criticisms made by Deaton and Cartwright (2018)

However you can do hypothesis testing even without an estimate of the standard error.

Up next

1.3 Randomization Inference

A procedure for using the randomization distribution to calculate \(p\) values

1.3.1 Calculate a \(p\) value in your head

  • Illustrating \(p\) values via “randomization inference”

  • Say you randomized assignment to treatment and your data looked like this.

Unit 1 2 3 4 5 6 7 8 9 10
Treatment 0 0 0 0 0 0 0 1 0 0
Health score 4 2 3 1 2 3 4 8 7 6

Then:

  • Does the treatment improve your health?
  • What’s the \(p\) value for the null that treatment had no effect on anybody?

1.3.2 Calculate a \(p\) value in your head

  • Illustrating \(p\) values via “randomization inference”
  • Say you randomized assignment to treatment and your data looked like this.
Unit 1 2 3 4 5 6 7 8 9 10
Treatment 0 0 0 0 0 0 0 0 1 0
Health score 4 2 3 1 2 3 4 8 7 6

Then:

  • Does the treatment improve your health?
  • What’s the \(p\) value for the null that treatment had no effect on anybody?

1.3.3 Randomization Inference: Some code

  • In principle it is very easy.
  • These few lines generate data, produce the regression estimate and then an ri estimate of \(p\):
# data
set.seed(1)
df <- fabricate(N = 1000, Z = rep(c(0,1), N/2), Y=  .1*Z + rnorm(N))

# test stat
test.stat <- function(df) with(df, mean(Y[Z==1])- mean(Y[Z==0]))

# test stat distribution
ts <- replicate(4000, df |> mutate(Z = sample(Z)) |> test.stat())

# test
mean(ts >= test.stat(df))   # One sided p value
[1] 0.025

1.3.4 Randomization Inference: Some code

The \(p\) value is the mass to the right of the vertical

hist(ts); abline(v = test.stat(df), col = "red") 

1.3.5 Using ri2

You can do the same using Alex Coppock’s ri2 package

library(ri2)

# Declare the assignment
assignment <- declare_ra(N = 1000, m = 500)

# Implement
ri2_out <- conduct_ri(
  formula = Y ~ Z,
  declaration = assignment,
  sharp_hypothesis = 0,
  data = df, 
  p = "upper",
  sims = 4000
)

1.3.6 Using ri2

term estimate upper_p_value
Z 0.1321367 0.02225

You’ll notice slightly different answer. This is because although the procedure is “exact” it is subject to simulation error.

1.3.7 Randomization Inference

  • Randomization inference can get more complicated when you want to test a null other than the sharp null of no effect.
  • Say you wanted to test the null that the effect is 2 for all units. How do you do it?
  • Say you wanted to test the null that an interaction effect is zero. How do you do it?
  • In both cases by filling in a potential outcomes schedule given the hypothesis in question and then generating a test statistic
Observed
Under null that
effect is 0
Under null that
effect is 2
Y(0) Y(1) Y(0) Y(1) Y(0) Y(1)
1 NA 1 1 1 3
2 NA 2 2 2 4
NA 4 4 4 2 4
NA 3 3 3 1 3

1.3.8 ri and CIs

It is possible to use this procedure to generate confidence intervals with a natural interpretation.

  • The key idea is that we can use the same procedure to assess the probability of the data given a sharp null of no effect, but also a sharp null of any other **constant* effect.
  • We can then see what set of effects we reject and what set we accept
  • We are left with a set of values that we cannot reject at the 0.05 level.

1.3.9 ri and CIs in practice

candidates <- seq(-.05, .3, length = 50)

get_p <- function(j)
  (conduct_ri(
      formula = Y ~ Z,
      declaration = assignment,
      sharp_hypothesis = j,
      data = df,
      sims = 5000,
    ) |> summary()
  )$two_tailed_p_value

# Implement
ps <- candidates |> sapply(get_p)

1.3.10 ri and CIs in practice

Warning: calculating confidence intervals this way can be computationally intensive

1.3.11 ri with DeclareDesign

  • DeclareDesign can do randomization inference natively
  • The trick is to ensure that when calculating the \(p\) values the only stochastic component is the assignment to treatment when calculating the \(p\) values

1.3.12 ri with DeclareDesign (advanced)

Here we get minimal detectable effects by using a design that has two stage simulations so we can estimate the sampling distribution of summaries of the sampling distribution generated from reassignments.

test_stat <- function(data)
  with(data, data.frame(estimate = mean(Y[Z==1]) - mean(Y[Z==0])))

b <- 0

design <- 
  declare_model(N = 100,  Z = complete_ra(N), Y = b*Z + rnorm(N)) +
  declare_estimator(handler = label_estimator(test_stat), label = "actual")+
  declare_measurement(Z = sample(Z))  + # this is the permutation step
  declare_estimator(handler = label_estimator(test_stat), label = "null")

1.3.13 ri with DeclareDesign (advanced)

Simulations data frame from two step simulation. Note computational intensity as number of runs is the product of the sims vector. I speed things up by using a simple estimation function and also using parallelization.

library(future)
options(parallelly.fork.enable = TRUE) 
plan(multicore) 

simulations <- 
  design |> redesign(b = c(0, .25, .5, .75, 1)) |> 
  simulate_design(sims = c(200, 1, 1000, 1)) 

1.3.14 ri with DeclareDesign (advanced)

A snap shot of the simulations dataframe: we have multiple step 3 draws for each design and step 1 draw.

simulations |> head() |> kable(digits = 2)
design b sim_ID estimator estimate step_1_draw step_3_draw
design_1 0 1 actual -0.02 1 1
design_1 0 1 null 0.06 1 1
design_1 0 2 actual -0.02 1 2
design_1 0 2 null 0.03 1 2
design_1 0 3 actual -0.02 1 3
design_1 0 3 null -0.05 1 3

1.3.15 ri with DeclareDesign (advanced)

Power for each value of b.

simulations |> group_by(b, step_1_draw) |> 
  summarize(p = mean(abs(estimate[estimator == "null"]) >= abs(estimate[estimator == "actual"]))) |>
  group_by(b) |> summarize(power = mean(p <= .05)) |>
  ggplot(aes(b, power)) + geom_line() + geom_hline(yintercept = .8, color = "red")

If you want to figure out more precisely what b gives 80% or 90% power you can narrow down the b range.

1.3.16 ri interactions

Lets now imagine a world with two treatments and we are interested in using ri for assessing the interaction. (Code from Coppock, ri2)

set.seed(1)

N <- 100
declaration <- randomizr::declare_ra(N = N, m = 50)

data <- fabricate(
  N = N,
  Z = conduct_ra(declaration),
  X = rnorm(N),
  Y = .9 * X + .2 * Z + .1 * X * Z + rnorm(N))

1.3.17 ri interactions

The approach is to declare a null model that is nested by the full model. Then \(F\) test statistic from the model comparisons is taken as the test statistic and distribution of this is built up under re-randomizations.

conduct_ri(
    model_1 = Y ~ Z + X,
    model_2 = Y ~ Z + X + Z * X,
    declaration = declaration,
    assignment = "Z",
    sharp_hypothesis = coef(lm(Y ~ Z, data = data))[2],
    data = data, 
    sims = 1000
  )  |> summary() |> kable()
term estimate two_tailed_p_value
F-statistic 1.954396 0.171

1.3.18 ri interactions with DeclareDesign

Let’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects

Note: this is quite a sharp hypothesis

df <- fabricate(N = 1000, Z1 = rep(c(0,1), N/2), Z2 = sample(Z1), Y = Z1 + Z2 - .15*Z1*Z2 + rnorm(N))

my_estimate <- (lm(Y ~ Z1*Z2, data = df) |> coef())[4]

null_model <-  function(df) {
  M0 <- lm(Y ~ Z1 + Z2, data = df) 
  d1 <- coef(M0)[2]
  d2 <- coef(M0)[3]
  df |> mutate(
    Y_Z1_0_Z2_0 = Y - Z1*d1 - Z2*d2,
    Y_Z1_1_Z2_0 = Y + (1-Z1)*d1 - Z2*d2,
    Y_Z1_0_Z2_1 = Y - Z1*d1 + (1-Z2)*d2,
    Y_Z1_1_Z2_1 = Y + (1-Z1)*d1 + (1-Z2)*d2)
  }

1.3.19 ri interactions with DeclareDesign

Let’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects

Imputed potential outcomes look like this:

df <- df |> null_model()

df |>  head() |> kable(digits = 2, align = "c")
ID Z1 Z2 Y Y_Z1_0_Z2_0 Y_Z1_1_Z2_0 Y_Z1_0_Z2_1 Y_Z1_1_Z2_1
0001 0 0 -0.18 -0.18 0.76 0.68 1.61
0002 1 0 0.20 -0.73 0.20 0.12 1.06
0003 0 0 2.56 2.56 3.50 3.42 4.36
0004 1 0 -0.27 -1.21 -0.27 -0.35 0.59
0005 0 1 -2.13 -2.99 -2.05 -2.13 -1.19
0006 1 1 3.52 1.72 2.66 2.58 3.52

1.3.20 ri interactions with DeclareDesign

design <- 
  declare_model(data = df) +
  declare_measurement(Z1 = sample(Z1), Z2 = sample(Z2),
                      Y = reveal_outcomes(Y ~ Z1 + Z2)) +
  declare_estimator(Y ~ Z1*Z2, term = "Z1:Z2")
diagnose_design(design, sims = 1000, diagnosands = ri_ps(my_estimate))
Design Estimator Outcome Term N Sims One Sided Pos One Sided Neg Two Sided
design estimator Y Z1:Z2 1000 0.95 0.05 0.10
(0.01) (0.01) (0.01)

1.3.21 ri in practice

  • In practice (unless you have a design declaration), it is a good idea to create a \(P\) matrix when you do your randomization.
  • This records the set of possible randomizations you might have had: or a sample of these.
  • So, again: assignments have to be replicable

1.3.22 ri Applications

  • Recall that silly randomization procedure from this slide.
  • Say you forgot to take account of the wacky assignment in your estimates and you estimate 0.15.
  • Does the treatment improve your health?: \(p=?\)

1.3.23 ri Applications

  • Randomization procedures are sometimes funky in lab experiments
  • Using randomization inference would force a focus on the true assignment of individuals to treatments
  • Fake (but believable) example follows

1.3.24 ri Applications

Capacity T1 T2 T3
Session Thursday 40 10 30 0
Friday 40 10 0 30
Saturday 10 10 0 0

Optimal assignment to treatment given constraints due to facilities

Subject type N Available
A 3 Thurs, Fri
B 30 Thurs, Sat
C 30 Fri, Sat

Constraints due to subjects

1.3.25 ri Applications

If you think hard about assignment you might come up with an allocation like this.

Allocations
Subject type N Available Thurs Fri Sat
A 30 Thurs, Fri 15 15 NA
B 30 Thurs, Sat 25 NA 5
C 30 Fri, Sat NA 25 5

Assignment of people to days

1.3.26 ri Applications

That allocation balances as much as possible. Given the allocation you might randomly assign individuals to different days as well as randomly assigning them to treatments within days. If you then figure out assignment propensities, this is what you would get:

Assignment Probabilities
Subject type N Available T1 T2 T3
A 30 Thurs, Fri 0.250 0.375 0.375
B 30 Thurs, Sat 0.375 0.625 0.000
C 30 Fri, Sat 0.375 NA 0.625

1.3.27 ri Applications

Even under the assumption that the day of measurement does not matter, these assignment probabilities have big implications for analysis.

Assignment Probabilities
Subject type N Available T1 T2 T3
A 30 Thurs, Fri 0.250 0.375 0.375
B 30 Thurs, Sat 0.375 0.625 0.000
C 30 Fri, Sat 0.375 NA 0.625
  • Only the type \(A\) subjects could have received any of the three treatments.

  • There are no two treatments for which it is possible to compare outcomes for subpopulations \(B\) and \(C\)

  • A comparison of \(T1\) versus \(T2\) can only be made for population \(A \cup B\)

  • However subpopulation \(A\) is assigned to \(A\) (versus \(B\)) with probability 4/5; while population \(B\) is assigned with probability 3/8

1.3.28 ri Applications

  • Implications for design: need to uncluster treatment delivery

  • Implications for analysis: need to take account of propensities

Idea: Wacky assignments happen but if you know the propensities you can do the analysis.

1.3.29 ri Applications: Indirect assignments

A particularly interesting application is where a random assignment combines with existing features to determine an assignment to an “indirect” treatment.

  • For instance: \(n\) of \(N\) are assigned to a treatment.
  • You are interested in whether “having a friend assigned to treatment” makes a difference to a subject. Or maybe “a friend of a friend”
  • That means the subject has a complex clustered assignment that depends on how many friends they have
  • A bit mind-boggling, but:
    • Rerun your assignment many times and each time figure out whether a subject is assigned to an indirect treatment or not
    • Calculate the implied quantity of interest for each assignment
    • Assess the place of the actual quantity in the sampling distribution

1.4 Covariate Adjustment

1.4.1 Example

Consider for example this data.

  • You randomly pair offerers and receivers in a dictator game (in which offerers decide how much of $1 to give to receivers).
  • Your population comes from two groups (80% Baganda and 20% Banyankole) so in randomly assigning partners you are randomly determining whether a partner is a coethnic or not.
  • You find that in non-coethnic pairings 35% is offered, in coethnic pairings 48% is offered.

Should you believe it?

1.4.2 Covariate Adjustment

  • Population: randomly matched Baganda (80% of pop) and Banyankole (20% of pop)
  • You find: in non-coethnic pairings 35% is offered, in coethnic pairings 48% is offered.
  • But a closer look at the data reveals…
To: Baganda To: Banyankole
Offers by Baganda 64% 16%
Banyankole 16% 4%
Figure 1: Number of Games
To: Baganda To: Banyankole
Offers by Baganda 50 50
Banyankole 20 20
Figure 2: Average Offers

So that’s a problem

1.4.3 Covariate Adjustment

Control?

  • With such data you might be tempted to ‘control’ for the covariate (here: ethnic group), using regression.
  • But, perhaps surprisingly, it turns out that regression with covariates does not estimate average treatment effects.
  • It does estimate an average of treatment effects, but specifically a minimum variance estimator, not necessarily an estimator of your estimand.

1.4.4 Covariate Adjustment

Compare:

  • \(\hat{\tau}_{ATE} =\sum_{x} \frac{w_x}{\sum_{j}w_{j}}\hat{\tau}_x\)
  • \(\hat{\tau}_{OLS} =\sum_{x} \frac{w_xp_x(1-p_x)}{\sum_{j}w_j{p_j(1-p_j)}}\hat{\tau}_x\)

Instead you can use formula above for \(\hat{\tau}_{ATE}\) to estimate ATE

alternatively…

1.4.5 Covariate adjustment via saturated regression

  • Alternatively you can use propensity weights.
  • Alternatively you can use a regression that includes both the treatment and the treatment interacted with the covariates.
    • In practice this is best done by demeaning the covariates; doing this lets you read off the average effect from the main term. Key resource: Lin (2012)

You should have noticed that the logic for controlling for a covariate here is equivalent to the logic we saw for heterogeneous assignment propensities. These are really the same thing.

1.4.6 Covariate adjustment via saturated regression

Returning to prior example:

df <- fabricatr::fabricate(
  N = 500, 
  X = rep(0:1, N/2), 
  Z = rbinom(N, 1, .2 + .3*X),
  Y = rnorm(N) + Z*X)

lm_robust(Y ~ Z*X_c, data = df |> mutate(X_c = X - mean(X))) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -0.02 0.06 -0.42 0.68 -0.14 0.09 496 Y
Z 0.59 0.10 6.05 0.00 0.40 0.78 496 Y
X_c 0.16 0.11 1.42 0.15 -0.06 0.39 496 Y
Z:X_c 0.64 0.20 3.28 0.00 0.26 1.02 496 Y

1.4.7 Covariate adjustment via saturated regression

Returning to prior example:

lm_lin(Y ~ Z, ~ X, data = df) |>
  tidy() |> kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -0.02 0.06 -0.42 0.68 -0.14 0.09 496 Y
Z 0.59 0.10 6.05 0.00 0.40 0.78 496 Y
X_c 0.16 0.11 1.42 0.15 -0.06 0.39 496 Y
Z:X_c 0.64 0.20 3.28 0.00 0.26 1.02 496 Y

1.4.8 Demeaning and saturating

Demeaning interactions

  • Say you have a factorial design with treatments X1 and X2 (or observational data with two covariates)
  • You analyse with a model that has main terms and interaction terms
  • Interpreting coefficients can be confusing, but sometimes demeaning can help. What does demeaning do?

1.4.9 Demeaning and saturating

Let’s:

  • Declare a factorial design in which Y is generated according to

f_Y <- function(X1, X2, u) .1 + .2*X1 + .3*X2 + u*X1*X2

where u is distributed \(U[0,1]\).

  • Specify estimands carefully
  • Run analyses in which we do and do not demean the treatments; compare and explain results

1.4.10 Demeaning interactions

f_Y <- function(X1, X2, u) .1 + .2*X1 + .3*X2 + u*X1*X2

design <-
  declare_model(N = 1000, u = runif(N),
                X1 = complete_ra(N), X2 = block_ra(blocks = X1),
                X1_demeaned = X1 - mean(X1),
                X2_demeaned = X2 - mean(X2),
                Y = f_Y(X1, X2, u)) +
  declare_inquiry(
    base = mean(f_Y(0, 0, u)),
    average = mean(f_Y(0, 0, u) + f_Y(0, 1, u)  + f_Y(1, 0, u)  + f_Y(1, 1, u))/4,
    CATE_X1_given_0 = mean(f_Y(1, 0, u) - f_Y(0, 0, u)),
    CATE_X2_given_0 = mean(f_Y(0, 1, u) - f_Y(0, 0, u)),
    ATE_X1 = mean(f_Y(1, X2, u) - f_Y(0, X2, u)),
    ATE_X2 = mean(f_Y(X1, 1, u) - f_Y(X1, 0, u)),
    I_X1_X2 = mean((f_Y(1, 1, u) - f_Y(0, 1, u)) - (f_Y(1, 0, u) - f_Y(0, 0, u)))
  ) +
  declare_estimator(Y ~ X1*X2, 
                    inquiry = c("base", "CATE_X1_given_0", "CATE_X2_given_0", "I_X1_X2"), 
                    term = c("(Intercept)", "X1", "X2", "X1:X2"),
                    label = "natural") +
  declare_estimator(Y ~ X1_demeaned*X2_demeaned, 
                    inquiry = c("average", "ATE_X1", "ATE_X2", "I_X1_X2"), 
                    term = c("(Intercept)", "X1_demeaned", "X2_demeaned", "X1_demeaned:X2_demeaned"),
                    label = "demeaned")

1.4.11 Demeaning interactions: Solution

Estimator Inquiry Term Mean Estimand Mean Estimate
demeaned ATE_X1 X1_demeaned 0.45 0.45
demeaned ATE_X2 X2_demeaned 0.55 0.55
demeaned I_X1_X2 X1_demeaned:X2_demeaned 0.50 0.50
demeaned average (Intercept) 0.48 0.48
natural CATE_X1_given_0 X1 0.20 0.20
natural CATE_X2_given_0 X2 0.30 0.30
natural I_X1_X2 X1:X2 0.50 0.50
natural base (Intercept) 0.10 0.10

It’s all good. But you need to match the estimator to the inquiry: demean for average marginal effects; do not demean for conditional marginal effects.

1.4.12 Recap

If you have different groups with different assignment propensities you can do any or all of these:

  1. Blocked differences in means
  2. Inverse propensity weighting
  3. Saturated regression (Lin)
  4. More… (coming)

You cannot (reliably):

  1. Ignore the groups
  2. Include them in a regression (without interactions)

1.5 To control or not control

When does controlling for covariates improve things and when does it make it worse

1.5.1 Considerations

  • Even though randomization ensures no bias, you may sometimes want to “control” for covariates in order to improve efficiency (see the discussion of blocking above).
  • Or you may have to take account of the fact that the assignment to treatment is correlated with a covariate (as above).
  • In observational work you might also figure out you have to control for a covariate to justify inferences (Refer to our discussion of the backdoor criteria)

1.5.2 Observational work

  • Observational motivation: Controls can provide grounds for identification
  • But recall – they can also destroy identification

For a great walk through of what you can draw from graphical models for the decision to control see:

A Crash Course in Good and Bad Controls by Cinelli, Forney, and Pearl (2022)

Aside: these implications generally refer to use controls as covariates – e.g. by implementing blocked differences in means or similar. For a Bayesian model of the form used in CausalQueries the information from “bad controls” is used wisely.

1.5.3 Experimental work

  • Conditional Bias and Precision Gains from Controls

  • Experimental motivation: Controls can reduce noise and improve precision. This is an argument for using variables that are correlated with the output (not with the treatment).

1.5.4 Precision Gains from Controls

However: Introducing controls can create complications

  • As argued by Freedman (summary from Lin (2012)), we can get: “worsened asymptotic precision, invalid measures of precision, and small-sample bias”\(^*\)

  • These adverse effects are essentially removed with an interacted model

  • See discussions in Imbens and Rubin (2015) (7.6, 7.7) and especially Theorem 7.2 for the asymptotic variance of the estimator

\(^*\) though note that the precision concern does not hold when treatment and control groups are equally sized

1.5.5 To control or not

We will illustrate by comparing:

  • DIM
  • OLS (linear controls)
  • Lin (saturated)

With:

  • Varying fraction assigned to treatment
  • Varying relation between \(Y(0)\) and \(Y(1)\)

1.5.6 Declaration (from RDSS)

# https://book.declaredesign.org/library/experimental-causal.html

prob <- 0.5
control_slope <- -1

declaration_18.3 <-
  declare_model(N = 100, X = runif(N, 0, 1),
                U = rnorm(N, sd = 0.1),
                Y_Z_1 = 1*X + U, Y_Z_0 = control_slope*X + U
  ) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N = N, prob = prob)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE", label = "DIM") +
  declare_estimator(Y ~ Z + X, .method = lm_robust, inquiry = "ATE", label = "OLS") +
  declare_estimator(Y ~ Z, covariates = ~X, .method = lm_lin, 
                    inquiry = "ATE", label = "Lin")

1.5.7 Implied potential outcomes

The variances and covariance of potential outcomes depend on the slope parameter

1.5.8 To control or not

Diagnosis:

declaration_18.3 |> 
  redesign(
    control_slope = seq(-1, 1, 0.5), 
    prob = seq(0.1, 0.9, 0.1)) |> 
  diagnose_designs()

1.5.9 To control or not

  • Lin always does well
  • OLS does fine with equal probability assignments
  • Otherwise the ranking of DIM and OLS is design dependent

1.5.10 Conditional Bias and Precision Gains from Controls

  • Treatment correlated with covariates can induce “conditional bias.”
  • Including controls that are correlated with treatment can introduce inefficiencies
  • Including controls can change your estimates so be sure not to fish!

1.5.11 Declaration

a <- 0
b <- 0

design <- 
  declare_model(N = 100,
                        X = rnorm(N),
                        Z = complete_ra(N),
                        Y_Z_0 = a*X + rnorm(N),
                        Y_Z_1 = a*X + correlate(given = X, rho = b, rnorm) + 1,
                        Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, covariates = ~X, .method = lm_lin, label = "Lin") +
  declare_estimator(Y ~ Z,  label = "No controls") +
  declare_estimator(Z ~ X,  label = "Condition")

The design implements estimation controlling and not controlling for \(X\) and also keeps track of the results of a test for the relation between \(Z\) and \(X\).

1.5.12 Simulations

We simulate with many simulations over a range of designs

simulations <- 
  list(design |> redesign(a = 0, b = 0),  design |> redesign(a = 1, b = 0),  design |> redesign(a = 0, b = 1)) |>
  simulate_design(sims = 20000) 

1.5.13 Standard errors

We see the standard errors are larger when you control in cases in which the control is not predictive of the outcome and it is correlated with the treatment. Otherwise they can be smaller.

See Mutz et al

1.5.14 Errors

We also see “conditional bias” when we do not control: where the distribution of errors depends on the correlation with the covariate.

Errors

1.5.15 Estimands

Puzzle: Does the sample average treatment effect on the treated depend on the covariate balance?

1.6 Doubly robust estimation

1.6.1 Doubly robust estimation

Doubly robust estimation combines:

  1. A model for how the covariates predict the potential outcomes
  2. A model for how the covariates predict assignment propensities

Using both together to estimate potential outcomes using propensity weighting lets you do well even if either model is wrong.

Each part can be done using nonparameteric methods resulting in an overall semi-parametric procedure.

  • \(\pi(Z) = \Pr(Z=1|X)\): Estimate \(\hat\pi\)
  • \(Y_z = \mathbb{E}[Y|Z=z, X]\): Estimate \(\hat{Y}_z\)
  • Estimate of causal effect: \(\frac{1}{n}\sum_{i=1}^n\left(\left(\frac{Z_i}{\hat{\pi}_i}(Y_i - \hat{Y}_{i1}\right) - \left(\frac{1-Z_i}{1-\hat{\pi}_i}(Y_i - \hat{Y}_{i0}\right) + \left(\hat{Y}_{i1} - \hat{Y}_{i0}\right) \right)\)

1.6.2 Doubly robust estimation

  • Estimate of causal effect: \(\frac{1}{n}\sum_{i=1}^n\left(\left(\frac{Z_i}{\hat{\pi}_i}(Y_i - \hat{Y}_{i1}\right) - \left(\frac{1-Z_i}{1-\hat{\pi}_i}(Y_i - \hat{Y}_{i0}\right) + \left(\hat{Y}_{i1} - \hat{Y}_{i0}\right) \right)\)

  • Note that if \(\hat{Y}_{iz}\) are correct then the first parts drop out and we we get the right answer.

  • So if you can impute the potential outcomes, you are good (though hardly surprising)

1.6.3 Doubly robust estimation

  • More subtly say the \(\hat{\pi}\)s are correct, but your imputations are wrong; then we again have an unbiased estimator.

To see this imagine with probability \(\pi\) we assign unit 1 to treatment and 2 to control (otherwise 1 to control and 2 to treatment).

Then our expected estimate is:

\(\frac12\pi\left(\left(\frac{1}{\pi}(Y_{11} - \hat{Y}_{11}\right) - \left(\frac{1}{\pi}(Y_{20} - \hat{Y}_{20}\right) \right) + (1-\pi)\left(\left(\frac{1}{1-\pi}(Y_{21} - \hat{Y}_{21}\right) - \left(\frac{1}{1-\pi}(Y_{10} - \hat{Y}_{10}\right) \right) + \left(\hat{Y}_{11} - \hat{Y}_{20}\right) + \left(\hat{Y}_{21} - \hat{Y}_{10}\right)\)

\(\frac12\left(Y_{11} - Y_{10} + Y_{21}- Y_{20} +\pi\left(\left(\frac{1}{\pi}( - \hat{Y}_{11}\right) - \left(\frac{1}{\pi}( - \hat{Y}_{20}\right) \right) + (1-\pi)\left(\left(\frac{1}{1-\pi}( - \hat{Y}_{21}\right) - \left(\frac{1}{1-\pi}(- \hat{Y}_{10}\right) \right)\right) + \left(\hat{Y}_{11} - \hat{Y}_{20}\right) + \left(\hat{Y}_{21} - \hat{Y}_{10}\right)\)

\(\frac12\left(Y_{11} - Y_{10} + Y_{21}- Y_{20}\right)\)

Robins, Rotnitzky, and Zhao (1994)

1.6.4 Doubly robust estimation illustration

Consider this data (with confounding):

# df with true treatment effect of 1 
# (0.5 if race = 0; 1.5 if race = 1)

df <- fabricatr::fabricate(
  N = 5000,
  class = sample(1:3, N, replace = TRUE),
  race = rbinom(N, 1, .5),
  Z = rbinom(N, 1, .2 + .3*race),
  Y = .5*Z + race*Z + class + rnorm(N),
  qsmk = factor(Z),
  class = factor(class),
  race = factor(race)
)

1.6.5 Simple approaches

Naive regression produces biased estimates, even with controls. Lin regression gets the right result however.

# Naive
lm_robust(Y ~ Z, data = df)$coefficients[["Z"]]
[1] 1.239593
# OLS with controls
lm_robust(Y ~ Z + class + race, data = df)$coefficients[["Z"]]
[1] 1.126977
# Lin
lm_lin(Y ~ Z,  ~ class + race, data = df)$coefficients[["Z"]]
[1] 1.013944

1.6.6 Doubly robust estimation

drtmle is an R package that uses doubly robust estimation to compute “marginal means of an outcome under fixed levels of a treatment.”

library(SuperLearner)
library(drtmle)
drtmle_fit <- drtmle(
  W = df |> select(race, class), 
  A = df$Z, 
  Y = df$Y, 
  SL_Q = c("SL.glm", "SL.mean", "SL.glm.interaction"),
  SL_g = c("SL.glm", "SL.mean", "SL.glm.interaction"),
  SL_Qr = "SL.glm",
  SL_gr = "SL.glm", 
  maxIter = 1
)

1.6.7 Doubly robust estimation

# "Marginal means"
drtmle_fit$drtmle$est
[1] 2.997512 1.983561
# Effects
ci(drtmle_fit, contrast = c(-1,1))
$drtmle
                   est    cil    ciu
E[Y(0)]-E[Y(1)] -1.014 -1.079 -0.949
wald_test(drtmle_fit, contrast = c(-1,1))
$drtmle
                       zstat pval
H0:E[Y(0)]-E[Y(1)]=0 -30.688    0

Resource: https://muse.jhu.edu/article/883477

1.6.8 Assessing performance

Challenge: Use DeclareDesign to compare performance of drtmle and lm_lin

1.7 Principle: Keep the reporting close to the design

1.7.1 Design-based analysis

Report the analysis that is implied by the design.

T2
N Y All Diff
T1 N \(\overline{y}_{00}\) \(\overline{y}_{01}\) \(\overline{y}_{0x}\) \(d_2|T1=0\)
(sd) (sd) (sd) (sd)
Y \(\overline{y}_{10}\) \(\overline{y}_{10}\) \(\overline{y}_{1x}\) \(d_2|T1=1\)
(sd) (sd) (sd) (sd)
All \(\overline{y}_{x0}\) \(\overline{y}_{x1}\) \(y\) \(d_2\)
(sd) (sd) (sd) (sd)
Diff \(d_1|T2=0\) \(d_1|T2=1\) \(d_1\) \(d_1d_2\)
(sd) (sd) (sd) (sd)

This is instantly recognizable from the design and returns all the benefits of the factorial design including all main effects, conditional causal effects, interactions and summary outcomes. It is much clearer and more informative than a regression table.

Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2022. “A Crash Course in Good and Bad Controls.” Sociological Methods & Research, 00491241221099552.
Deaton, Angus, and Nancy Cartwright. 2018. “Understanding and Misunderstanding Randomized Controlled Trials.” Social Science & Medicine 210: 2–21.
Freedman, David A. 2008. “On Regression Adjustments to Experimental Data.” Advances in Applied Mathematics 40 (2): 180–93.
Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Lin, Winston. 2012. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” arXiv Preprint arXiv:1208.2301.
Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.
Samii, Cyrus, and Peter M Aronow. 2012. “On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments.” Statistics & Probability Letters 82 (2): 365–70.