Design-based estimation

Macartan Humphreys

1 Frequentist Analysis

Estimation and testing

1.1 Outline

Simple estimates from experimental data
Weighting, blocking
Design-based variance estimates
Design-based \(p\) values
Reporting

See topics for Controls and doubly robust estimation

1.1.1 ATE: DIM

Unbiased estimates of the (sample) average treatment effect can be estimated (whether or not there imbalance on covariates) using:

\[ \widehat{ATE} = \frac{1}{n_T}\sum_TY_i - \frac{1}{n_C}\sum_CY_i, \]

1.1.2 ATE: DIM in practice

df <- fabricatr::fabricate(N = 100, Z = rep(0:1, N/2), Y = rnorm(N) + Z)

# by hand
df |>
  summarize(Y1 = mean(Y[Z==1]), 
            Y0 = mean(Y[Z==0]), 
            diff = Y1 - Y0) |> kable(digits = 2)

Y1	Y0	diff
1.07	-0.28	1.35

# with estimatr
estimatr::difference_in_means(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	1.35	0.17	7.94	0	1.01	1.68	97.98	Y

1.1.3 ATE: DIM in practice

We can also do this with regression:

estimatr::lm_robust(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	-0.28	0.12	-2.33	0.02	-0.51	-0.04	98	Y
Z	1.35	0.17	7.94	0.00	1.01	1.68	98	Y

See Freedman (2008) on why regression is fine here

1.1.4 ATE: Blocks

Say now different strata or blocks \(\mathcal{S}\) had different assignment probabilities. Then you could estimate:

\[ \widehat{ATE} = \sum_{S\in \mathcal{S}}\frac{n_{S}}{n} \left(\frac{1}{n_{S1}}\sum_{S\cap T}y_i - \frac{1}{n_{S0}}\sum_{S\cap C}y_i \right) \]

Note: you cannot just ignore the blocks because assignment is no longer independent of potential outcomes: you might be sampling units with different potential outcomes with different probabilities.

However, the formula above works fine because selecting is random conditional on blocks.

1.1.5 ATE: Blocks

As a DAG this is just classic confounding:

make_model("Block -> Z ->Y <- Block") |> 
  plot(x_coord = c(2,1,3), y_coord = c(2, 1, 1))

1.1.6 ATE: Blocks in practice

Data with heterogeneous assignments:

df <- fabricatr::fabricate(
  N = 500, X = rep(0:1, N/2), 
  prob = .2 + .3*X,
  Z = rbinom(N, 1, prob),
  ip = 1/(Z*prob + (1-Z)*(1-prob)), # discuss
  Y = rnorm(N) + Z*X)

True effect is 0.5, but:

estimatr::difference_in_means(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	0.9	0.1	9.32	0	0.71	1.09	377.93	Y

1.1.7 ATE: Blocks in practice

Averaging over effects in blocks

# by hand
estimates <- 
  df |>
  group_by(X) |>
  summarize(Y1 = mean(Y[Z==1]), 
            Y0 = mean(Y[Z==0]), 
            diff = Y1 - Y0,
            W = n())

estimates$diff |> weighted.mean(estimates$W)

[1] 0.7236939

# with estimatr
estimatr::difference_in_means(Y ~ Z, blocks = X, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	0.72	0.11	6.66	0	0.51	0.94	496	Y

1.1.8 ATE with IPW

This also corresponds to the difference in the weighted average of treatment outcomes (with weights given by the inverse of the probability that each unit is assigned to treatment) and control outcomes (with weights given by the inverse of the probability that each unit is assigned to control).

The average difference in means estimator is the same as what you would get if you weighted inversely by shares of units in different conditions inside blocks.

1.1.9 ATE with IPW in practice

# by hand
df |>
  summarize(Y1 = weighted.mean(Y[Z==1], ip[Z==1]), 
            Y0 = weighted.mean(Y[Z==0],  ip[Z==0]), # note !
            diff = Y1 - Y0)|> 
  kable(digits = 2)

Y1	Y0	diff
0.59	-0.15	0.74

# with estimatr
estimatr::difference_in_means(Y ~ Z, weights = ip, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	0.74	0.11	6.65	0	0.52	0.96	498	Y

1.1.10 ATE with IPW

But inverse propensity weighting is a more general principle, which can be used even if you do not have blocks.
The intuition for it comes straight from sampling weights — you weight up in order to recover an unbiased estimate of the potential outcomes for all units, whether or not they are assigned to treatment.
With sampling weights however you can include units even if their weight was 1. Why can you not include these units when doing inverse propensity weighting?

1.1.11 Illustration: Estimating treatment effects with terrible treatment assignments: Fixer

Say you made a mess and used a randomization that was correlated with some variable, \(U\). For example:

The randomization is done in a way that introduces a correlation between Treatment Assignment and Potential Outcomes
Then possibly, even though there is no true causal effect, we naively estimate a large one — enormous bias
However since we know the assignment procedure we can fully correct for the bias

1.1.12 Illustration: Estimating treatment effects with terrible treatment assignments: Fixer

In the next example, we do this using “inverse propensity score weighting.”
This is exactly analogous to standard survey weighting — since we selected different units for treatment with different probabilities, we weight them differently to recover the average outcome among treated units (same for control).

1.1.13 Basic randomization: Fixer

Bad assignment, some randomization process you can’t understand (but can replicate) that results in unequal probabilities.

N <- 400
U <- runif(N, .1, .9)

 design <- 
   declare_model(N = N,
                 Y_Z_0 = U + rnorm(N, 0, .1),
                 Y_Z_1 = U + rnorm(N, 0, .1),
                 Z = rbinom(N, 1, U)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, label = "naive")

1.1.14 Basic randomization: Fixer

Results is a sampling distribution not centered on the true effect (0)

diagnosis <- diagnose_design(design)

diagnosis$simulations_df |>
  ggplot(aes(estimate)) + geom_histogram() + facet_grid(~estimator) +
  geom_vline(xintercept = 0, color = "red")

1.1.15 A fix

To fix you can estimate the assignment probabilities by replicating the assignment many times:

probs <- replicate(1000, design |> draw_data() |> pull(Z)) |> apply(1, mean)

and then use these assignment probabilities in your estimator

design_2 <-
  design +
  declare_measurement(weights = Z/probs + (1-Z)/(1-probs)) +
  declare_estimator(Y ~ Z, weights = weights, label = "smart")

1.1.16 Basic randomization: Fixer

Implied weights

draw_data(design_2) |> 
  ggplot(aes(probs, weights, color = factor(Z))) + 
  geom_point()

1.1.17 Basic randomization: Fixer

Improved results

diagnosis <- diagnose_design(design_2)

diagnosis$simulations_df |>
  ggplot(aes(estimate)) + geom_histogram() + facet_grid(estimator~.)+
  geom_vline(xintercept = 0, color = "red")

1.1.18 IPW with one unit!

This example is surprising but it helps you see the logic of why inverse weighting gets unbiased estimates (and why that might not guarantee a reasonable answer)

Imagine there is one unit with potential outcomes \(Y(1) = 2, Y(0) = 1\). So the unit level treatment effect is 1.

You toss a coin.

If you assign to treatment you estimate: \(\hat\tau = \frac{2}{0.5} = 4\)
If you assign to control you estimate: \(\hat\tau = -\frac{1}{0.5} = -2\)

So your expected estimate is: \[0.5 \times 4 - 0.5 \times (-2) = 1\]

Great on average but always lousy

1.1.19 Generalization: why IPW works

Say a given unit is assigned to treatment with probability \(\pi_i\)
We estimate the average \(Y(1)\) using

\[\hat{\overline{Y_1}} = \frac{1}n\left(\sum_i \frac{Z_iY_i(1)}{\pi_i}\right)\] With independent assignment the expected value of \(\hat{\overline{Y_1}}\) is just:

\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( \left(\pi_1 \frac{1\times Y_1(1)}{\pi_1} + (1-\pi_1) \frac{0\times Y_1(1)}{\pi_1}\right) + \left(\pi_2 \frac{1\times Y_2(1)}{\pi_2} + (1-\pi_2) \frac{0\times Y_1(1)}{\pi_2}\right) + \dots\right)\]

\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( Y_1(1) + Y_2(1) + \dots\right) = \overline{Y_1}\]

and similarly for \(\mathbb{E}[\hat{\overline{Y_0}}]\) and so using linearity of expectations:

\[\mathbb{E}[\widehat{\overline{Y_1 - Y_0}}] = \overline{Y_1 - Y_0}\]

1.1.20 Generalization: why IPW works

Note we needed \(\pi_i >0\) and also \(\pi_i <1\) everywhere. Why?
We used independence here; sampling theory is used to show similar results for e.g. complete randomization
For blocked randomization this is easy to see

1.2 Design-based Estimation of Variance

Lets talk about “inference”

1.2.1 Var(ATE)

Recall that the treatment effect is gotten by taking a sample of outcomes under treatment and comparing them to a sample of outcomes under control
Say that there is no “error”
Why would this procedure produce uncertainty?

1.2.2 Var(ATE)

Why would this procedure produce uncertainty?
The uncertainty comes from being uncertain about the average outcome under control from observations of the control units, and from being uncertain about the average outcome under treatment from observation of the treated units
In other words, it comes from the variance in the treatment outcomes and variance in the control outcomes (and not, for example, from variance in the treatment effect)

1.2.3 Var(ATE)

In classical statistics we characterize our uncertainty over an estimate using an estimate of variance of the sampling distribution of the estimator.
Key idea is we want to be able to say: how likely are we to have gotten such an estimate if the distribution of estimates associated with our design looked a given way.
More specifically we want to estimate “standard error” or the “standard deviation of the sampling distribution”

(See Woolridge (2023) where the standard error is understood as the “estimate of the standard deviation of the sampling distribution”)

1.2.4 Variance and standard errors

Given:

\(\hat\tau\) is an estimate for \(\tau\)
\(\overline{x}\) is the average values of \(x\)

The variance of the estimator of \(n\) repeated ‘runs’ of a design is: \(Var(\hat{\tau}) = \frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2\)

And the standard error is:

\(se(\hat{\tau}) = \sqrt{\frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2}\)

1.2.5 Variance and standard errors

If we have a good measure for the shape of the sampling distribution we can start to make statements of the form:

What are the chances that an estimate would be this large or larger?

If the sampling distribution is roughly normal, as it may be with large samples, then we can use procedures such as: “there is a 5% probability that an estimate would be more than 1.96 standard errors away from the mean of the sampling distribution”

1.2.6 Var(ATE)

Key idea: You can estimate variance straight from the data, given knowledge of the assignment process and assuming well defined potential outcomes?
Recall in general \(Var(x) = \frac{1}n\sum_i(x_i - \overline{x})^2\). here the \(x_i\)s are the treatment effect estimates we might get under different random assignments, the \(n\) is number of different assignments (assumed here all equally likely, but otherwise we can weight) and \(\overline{x}\) is the truth.
For intuition imagine we have just two units \(A\), \(B\), with potential outcomes \(A_1\), \(A_0\), \(B_1\), \(B_0\).
When there are two units with outcomes \(x_1, x_2\), the variance simplifies like this:

\[Var(x) = \frac{1}2\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \frac{1}2\left(x_2 - \frac{x_1 + x_2}{2}\right)^2 = \left(\frac{x_1 - x_2}{2}\right)^2\]

1.2.7 Var(ATE)

In the two unit case the two possible treatment estimates are: \(\hat{\tau}_1=A_1 - B_0\) and \(\hat{\tau}_2=B_1 - A_0\), depending on what gets put into treatment. So the variance is:

\[Var(\hat{\tau}) = \left(\frac{\hat{\tau}_1 - \hat{\tau}_2}{2}\right)^2 = \left(\frac{(A_1 - B_0) - (B_1 - A_0)}{2}\right)^2 =\left(\frac{(A_1 - B_1) + (A_0 - B_0)}{2}\right)^2 \] which we can re-write as:

\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\] The first two terms correspond to the variance of \(Y(1)\) and the variance of \(Y(0)\). The last term is a bit pesky though, it corresponds to twice the covariance of \(Y(1)\) and \(Y(0)\).

1.2.8 Var(ATE)

How can we go about estimating this?

\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\]

In the two unit case it is quite challenging because we do not have an estimate for any of the three terms: we do not have an estimate for the variance in the treatment group or in the control group because we have only one observation in each case; and we do not have an estimate for the covariance because we don’t observe both potential outcomes for any case.

Things do look a bit better however with more units…

1.2.9 Var(ATE): Generalizing

From Freedman Prop 1 / Example 1 (using combinatorics!) we have:

\(V(\widehat{ATE}) = \frac{1}{n-1}\left[\frac{n_C}{n_T}V_1 + \frac{n_T}{n_C}V_0 + 2C_{01}\right]\)

… where \(V_0, V_1\) denote variances and \(C_{01}\) covariance

This is usefully rewritten as:

\[ \begin{split} V(\widehat{ATE}) & = \frac{1}{n-1}\left[\frac{n - n_T}{n_T}V_1 + \frac{n - n_C}{n_C}V_0 + 2C_{01}\right] \\ & = \frac{n}{n-1}\left[\frac{V_1}{n_T} + \frac{V_0}{n_C}\right] - \frac{1}{n-1}\left[V_1 + V_0 - 2C_{01}\right] \end{split} \]

where the final term is positive

1.2.10 Var(ATE)

Note:

With more than two units we cannot use the sample estimates \(s^2(\{Y_i\}_{i \in C})\) and \(s^2(\{Y_i\}_{i \in T})\) for the first part.
But \(C_{01}\) still cannot be estimated from data.
The Neyman estimator ignores the second part (and so is conservative).
Tip: for STATA users, use , robust (see Samii and Aronow (2012))

1.2.11 ATE and Var(ATE)

For the case with blocking, the conservative estimator is:

\(V(\widehat{ATE}) = {\sum_{S\in \mathcal{S}}{\left(\frac{n_{S}}{n}\right)^2} \left({\frac{s^2_{S1}}{n_{S1}}} + {\frac{s^2_{S0}}{n_{S0}}} \right)}\)

1.2.12 Illustration of Neyman Conservative Estimator

An illustration of how conservative the conservative estimator of variance really is (numbers in plot are correlations between \(Y(1)\) and \(Y(0)\).

We confirm that:

the estimator is conservative
the estimator is more conservative for negative correlations between \(Y(0)\) and \(Y(1)\) — eg if those cases that do particularly badly in control are the ones that do particularly well in treatment, and
with \(\tau\) and \(V(Y(0))\) fixed, high positive correlations are associated with highest variance.

1.2.13 Illustration of Neyman Conservative Estimator

\(\tau\)	\(\rho\)	\(\sigma^2_{Y(1)}\)	\(\Delta\)	\(\sigma^2_{\tau}\)	\(\widehat{\sigma}^2_{\tau}\)	\(\widehat{\sigma}^2_{\tau(\text{Neyman})}\)
1.00	-1.00	1.00	-0.04	0.00	-0.00	0.04
1.00	-0.67	1.00	-0.03	0.01	0.01	0.04
1.00	-0.33	1.00	-0.03	0.01	0.01	0.04
1.00	0.00	1.00	-0.02	0.02	0.02	0.04
1.00	0.33	1.00	-0.01	0.03	0.03	0.04
1.00	0.67	1.00	-0.01	0.03	0.03	0.04
1.00	1.00	1.00	0.00	0.04	0.04	0.04

Here \(\rho\) is the unobserved correlation between \(Y(1)\) and \(Y(0)\); and \(\Delta\) is the final term in the sample variance equation that we cannot estimate.

1.2.14 Illustration of Neyman Conservative Estimator

1.2.15 Tighter Bounds On Variance Estimate

The conservative variance comes from the fact that you do not know the covariance between \(Y(1)\) and \(Y(0)\).

But as Aronow, Green, and Lee (2014) point out, you do know something.
Intuitively, if you know that the variance of \(Y(1)\) is 0, then the covariance also has to be zero.
This basic insight opens a way of calculating bounds on the variance of the sample average treatment effect.

1.2.16 Tighter Bounds On Variance Estimate

Example:

Take a million-observation dataset, with treatment randomly assigned
Assume \(Y(0)=0\) for everyone and \(Y(1)\) distributed normally with mean 0 and standard deviation of 1000.
Note here the covariance of \(Y(1)\) and \(Y(0)\) is 0.
Note the true variance of the estimated sample average treatment effect should be (approx) \(\frac{Var(Y(1))}{{1000000}} + \frac{Var(Y(0))}{{1000000}} = 1+0=1\), for an se of \(1\).
But using the Neyman estimator (or OLS!) we estimate (approx) \(\frac{Var(Y(1))}{({1000000/2})} + \frac{Var(Y(0))}{({1000000/2})} = 2\), for an se of \(\sqrt{2}\).
But we can recover the truth knowing the covariance between \(Y(1)\) and \(Y(0)\) is 0.

1.2.17 Tighter Bounds On Variance Estimate: Code

sharp_var <- function(yt, yc, N=length(c(yt,yc)), upper=TRUE){
  
  m <- length(yt)
  n <- m + length(yc)
  V <- function(x,N) (N-1)/(N*(length(x)-1)) * sum((x - mean(x))^2)
  yt <- sort(yt)
  if(upper) {yc <- sort(yc)
  } else {
    yc <- sort(yc,decreasing=TRUE)}
  p_i <- unique(sort(c(seq(0,n-m,1)/(n-m),seq(0,m,1)/m)))- 
        .Machine$double.eps^.5
  p_i[1] <- .Machine$double.eps^.5
  
  yti <- yt[ceiling(p_i*m)]
  yci <- yc[ceiling(p_i*(n-m))]
  
  p_i_minus <- c(NA,p_i[1: (length(p_i)-1)])
 
 ((N-m)/m * V(yt,N) + (N-(n-m))/(n-m)*V(yc,N) + 
     2*sum(((p_i-p_i_minus)*yti*yci)[2:length(p_i)]) - 2*mean(yt)*mean(yc))/(N-1)}

1.2.18 Illustration

n   <- 1000000
Y   <- c(rep(0,n/2), 1000*rnorm(n/2))
X   <- c(rep(0,n/2), rep(1, n/2))

lm_robust(Y~X) |> tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	0.00	0.00	0.63	0.53	0.00	0.00	999998	Y
X	1.21	1.41	0.86	0.39	-1.56	3.98	999998	Y

kable(t(as.matrix(ols)))

Error: object 'ols' not found

c(sharp_var(Y[X==1], Y[X==0], upper = FALSE),
  sharp_var(Y[X==1], Y[X==0], upper = TRUE)) |> 
  round(2)

[1] 1 1

The sharp bounds are \([1,1]\) but the conservative estimate is \(\sqrt{2}\).

1.2.19 Asymptotics

It is a remarkable thing that you can estimate the standard error straight from the data
However, once you want to use the standard error to do hypothesis testing you generally end up looking up distributions (\(t\)-distributions or normal distributions)
That’s a little disappointing and has been one of the criticisms made by Deaton and Cartwright (2018)

However you can do hypothesis testing even without an estimate of the standard error.

Up next

1.3 Randomization Inference

A procedure for using the randomization distribution to calculate \(p\) values

1.3.1 Calculate a \(p\) value in your head

Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.

Unit	1	2	3	4	5	6	7	8	9	10
Treatment	0	0	0	0	0	0	0	1	0	0
Health score	4	2	3	1	2	3	4	8	7	6

Then:

Does the treatment improve your health?
What’s the \(p\) value for the null that treatment had no effect on anybody?

1.3.2 Calculate a \(p\) value in your head

Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.

Unit	1	2	3	4	5	6	7	8	9	10
Treatment	0	0	0	0	0	0	0	0	1	0
Health score	4	2	3	1	2	3	4	8	7	6

Then:

Does the treatment improve your health?
What’s the \(p\) value for the null that treatment had no effect on anybody?

1.3.3 Randomization Inference: Some code

In principle it is very easy.
These few lines generate data, produce the regression estimate and then an ri estimate of \(p\):

# data
set.seed(1)
df <- fabricate(N = 1000, Z = rep(c(0,1), N/2), Y=  .1*Z + rnorm(N))

# test stat
test.stat <- function(df) with(df, mean(Y[Z==1])- mean(Y[Z==0]))

# test stat distribution
ts <- replicate(4000, df |> mutate(Z = sample(Z)) |> test.stat())

# test
mean(ts >= test.stat(df))   # One sided p value

[1] 0.025

1.3.4 Randomization Inference: Some code

The \(p\) value is the mass to the right of the vertical

hist(ts); abline(v = test.stat(df), col = "red")

1.3.5 Using `ri2`

You can do the same using Alex Coppock’s ri2 package

library(ri2)

# Declare the assignment
assignment <- declare_ra(N = 1000, m = 500)

# Implement
ri2_out <- conduct_ri(
  formula = Y ~ Z,
  declaration = assignment,
  sharp_hypothesis = 0,
  data = df, 
  p = "upper",
  sims = 4000
)

1.3.6 Using `ri2`

term	estimate	upper_p_value
Z	0.1321367	0.02225

You’ll notice slightly different answer. This is because although the procedure is “exact” it is subject to simulation error.

1.3.7 Randomization Inference

Randomization inference can get more complicated when you want to test a null other than the sharp null of no effect.
Say you wanted to test the null that the effect is 2 for all units. How do you do it?
Say you wanted to test the null that an interaction effect is zero. How do you do it?
In both cases by filling in a potential outcomes schedule given the hypothesis in question and then generating a test statistic

Observed		Under null that effect is 0		Under null that effect is 2
Y(0)	Y(1)	Y(0)	Y(1)	Y(0)	Y(1)
1	NA	1	1	1	3
2	NA	2	2	2	4
NA	4	4	4	2	4
NA	3	3	3	1	3

1.3.8 `ri` and CIs

It is possible to use this procedure to generate confidence intervals with a natural interpretation.

The key idea is that we can use the same procedure to assess the probability of the data given a sharp null of no effect, but also a sharp null of any other **constant* effect.
We can then see what set of effects we reject and what set we accept
We are left with a set of values that we cannot reject at the 0.05 level.

1.3.9 `ri` and CIs in practice

candidates <- seq(-.05, .3, length = 50)

get_p <- function(j)
  (conduct_ri(
      formula = Y ~ Z,
      declaration = assignment,
      sharp_hypothesis = j,
      data = df,
      sims = 5000,
    ) |> summary()
  )$two_tailed_p_value

# Implement
ps <- candidates |> sapply(get_p)

1.3.10 `ri` and CIs in practice

Warning: calculating confidence intervals this way can be computationally intensive

1.3.11 `ri` with `DeclareDesign`

DeclareDesign can do randomization inference natively
The trick is to ensure that when calculating the \(p\) values the only stochastic component is the assignment to treatment when calculating the \(p\) values

1.3.12 `ri` with `DeclareDesign` (advanced)

Here we get minimal detectable effects by using a design that has two stage simulations so we can estimate the sampling distribution of summaries of the sampling distribution generated from reassignments.

test_stat <- function(data)
  with(data, data.frame(estimate = mean(Y[Z==1]) - mean(Y[Z==0])))

b <- 0

design <- 
  declare_model(N = 100,  Z = complete_ra(N), Y = b*Z + rnorm(N)) +
  declare_estimator(handler = label_estimator(test_stat), label = "actual")+
  declare_measurement(Z = sample(Z))  + # this is the permutation step
  declare_estimator(handler = label_estimator(test_stat), label = "null")

1.3.13 `ri` with `DeclareDesign` (advanced)

Simulations data frame from two step simulation. Note computational intensity as number of runs is the product of the sims vector. I speed things up by using a simple estimation function and also using parallelization.

library(future)
options(parallelly.fork.enable = TRUE) 
plan(multicore) 

simulations <- 
  design |> redesign(b = c(0, .25, .5, .75, 1)) |> 
  simulate_design(sims = c(200, 1, 1000, 1))

1.3.14 `ri` with `DeclareDesign` (advanced)

A snap shot of the simulations dataframe: we have multiple step 3 draws for each design and step 1 draw.

simulations |> head() |> kable(digits = 2)

design	sim_ID	estimator	estimate	step_1_draw	step_3_draw
design_1	1	actual	-0.02	1	1
design_1	1	null	0.06	1	1
design_1	2	actual	-0.02	1	2
design_1	2	null	0.03	1	2
design_1	3	actual	-0.02	1	3
design_1	3	null	-0.05	1	3

1.3.15 `ri` with `DeclareDesign` (advanced)

Power for each value of b.

simulations |> group_by(b, step_1_draw) |> 
  summarize(p = mean(abs(estimate[estimator == "null"]) >= abs(estimate[estimator == "actual"]))) |>
  group_by(b) |> summarize(power = mean(p <= .05)) |>
  ggplot(aes(b, power)) + geom_line() + geom_hline(yintercept = .8, color = "red")

If you want to figure out more precisely what b gives 80% or 90% power you can narrow down the b range.

1.3.16 `ri` interactions

Lets now imagine a world with two treatments and we are interested in using ri for assessing the interaction. (Code from Coppock, ri2)

set.seed(1)

N <- 100
declaration <- randomizr::declare_ra(N = N, m = 50)

data <- fabricate(
  N = N,
  Z = conduct_ra(declaration),
  X = rnorm(N),
  Y = .9 * X + .2 * Z + .1 * X * Z + rnorm(N))

1.3.17 `ri` interactions

The approach is to declare a null model that is nested by the full model. Then \(F\) test statistic from the model comparisons is taken as the test statistic and distribution of this is built up under re-randomizations.

conduct_ri(
    model_1 = Y ~ Z + X,
    model_2 = Y ~ Z + X + Z * X,
    declaration = declaration,
    assignment = "Z",
    sharp_hypothesis = coef(lm(Y ~ Z, data = data))[2],
    data = data, 
    sims = 1000
  )  |> summary() |> kable()

term	estimate	two_tailed_p_value
F-statistic	1.954396	0.171

1.3.18 `ri` interactions with `DeclareDesign`

Let’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects

Note: this is quite a sharp hypothesis

df <- fabricate(N = 1000, Z1 = rep(c(0,1), N/2), Z2 = sample(Z1), Y = Z1 + Z2 - .15*Z1*Z2 + rnorm(N))

my_estimate <- (lm(Y ~ Z1*Z2, data = df) |> coef())[4]

null_model <-  function(df) {
  M0 <- lm(Y ~ Z1 + Z2, data = df) 
  d1 <- coef(M0)[2]
  d2 <- coef(M0)[3]
  df |> mutate(
    Y_Z1_0_Z2_0 = Y - Z1*d1 - Z2*d2,
    Y_Z1_1_Z2_0 = Y + (1-Z1)*d1 - Z2*d2,
    Y_Z1_0_Z2_1 = Y - Z1*d1 + (1-Z2)*d2,
    Y_Z1_1_Z2_1 = Y + (1-Z1)*d1 + (1-Z2)*d2)
  }

1.3.19 `ri` interactions with `DeclareDesign`

Let’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects

Imputed potential outcomes look like this:

df <- df |> null_model()

df |>  head() |> kable(digits = 2, align = "c")

ID	Z1	Z2	Y	Y_Z1_0_Z2_0	Y_Z1_1_Z2_0	Y_Z1_0_Z2_1	Y_Z1_1_Z2_1
0001	0	0	-0.18	-0.18	0.76	0.68	1.61
0002	1	0	0.20	-0.73	0.20	0.12	1.06
0003	0	0	2.56	2.56	3.50	3.42	4.36
0004	1	0	-0.27	-1.21	-0.27	-0.35	0.59
0005	0	1	-2.13	-2.99	-2.05	-2.13	-1.19
0006	1	1	3.52	1.72	2.66	2.58	3.52

1.3.20 `ri` interactions with `DeclareDesign`

design <- 
  declare_model(data = df) +
  declare_measurement(Z1 = sample(Z1), Z2 = sample(Z2),
                      Y = reveal_outcomes(Y ~ Z1 + Z2)) +
  declare_estimator(Y ~ Z1*Z2, term = "Z1:Z2")

diagnose_design(design, sims = 1000, diagnosands = ri_ps(my_estimate))

Design	Estimator	Outcome	Term	N Sims	One Sided Pos	One Sided Neg	Two Sided
design	estimator	Y	Z1:Z2	1000	0.95	0.05	0.10
					(0.01)	(0.01)	(0.01)

1.3.21 `ri` in practice

In practice (unless you have a design declaration), it is a good idea to create a \(P\) matrix when you do your randomization.
This records the set of possible randomizations you might have had: or a sample of these.
So, again: assignments have to be replicable

1.3.22 `ri` Applications

Recall that silly randomization procedure from this slide.
Say you forgot to take account of the wacky assignment in your estimates and you estimate 0.15.
Does the treatment improve your health?: \(p=?\)

1.3.23 `ri` Applications

Randomization procedures are sometimes funky in lab experiments
Using randomization inference would force a focus on the true assignment of individuals to treatments
Fake (but believable) example follows

1.3.24 `ri` Applications

		Capacity	T1	T2	T3
Session	Thursday	40	10	30	0
	Friday	40	10	0	30
	Saturday	10	10	0	0

Optimal assignment to treatment given constraints due to facilities

Subject type	N	Available
A	3	Thurs, Fri
B	30	Thurs, Sat
C	30	Fri, Sat

Constraints due to subjects

1.3.25 `ri` Applications

If you think hard about assignment you might come up with an allocation like this.

			Allocations
Subject type	N	Available	Thurs	Fri	Sat
A	30	Thurs, Fri	15	15	NA
B	30	Thurs, Sat	25	NA	5
C	30	Fri, Sat	NA	25	5

Assignment of people to days

1.3.26 `ri` Applications

That allocation balances as much as possible. Given the allocation you might randomly assign individuals to different days as well as randomly assigning them to treatments within days. If you then figure out assignment propensities, this is what you would get:

			Assignment Probabilities
Subject type	N	Available	T1	T2	T3
A	30	Thurs, Fri	0.250	0.375	0.375
B	30	Thurs, Sat	0.375	0.625	0.000
C	30	Fri, Sat	0.375	NA	0.625

1.3.27 `ri` Applications

Even under the assumption that the day of measurement does not matter, these assignment probabilities have big implications for analysis.

			Assignment Probabilities
Subject type	N	Available	T1	T2	T3
A	30	Thurs, Fri	0.250	0.375	0.375
B	30	Thurs, Sat	0.375	0.625	0.000
C	30	Fri, Sat	0.375	NA	0.625

Only the type \(A\) subjects could have received any of the three treatments.
There are no two treatments for which it is possible to compare outcomes for subpopulations \(B\) and \(C\)
A comparison of \(T1\) versus \(T2\) can only be made for population \(A \cup B\)
However subpopulation \(A\) is assigned to \(A\) (versus \(B\)) with probability 4/5; while population \(B\) is assigned with probability 3/8

1.3.28 `ri` Applications

Implications for design: need to uncluster treatment delivery
Implications for analysis: need to take account of propensities

Idea: Wacky assignments happen but if you know the propensities you can do the analysis.

1.3.29 `ri` Applications: Indirect assignments

A particularly interesting application is where a random assignment combines with existing features to determine an assignment to an “indirect” treatment.

For instance: \(n\) of \(N\) are assigned to a treatment.
You are interested in whether “having a friend assigned to treatment” makes a difference to a subject. Or maybe “a friend of a friend”
That means the subject has a complex clustered assignment that depends on how many friends they have
A bit mind-boggling, but:
- Rerun your assignment many times and each time figure out whether a subject is assigned to an indirect treatment or not
- Calculate the implied quantity of interest for each assignment
- Assess the place of the actual quantity in the sampling distribution

1.4 Principle: Keep the reporting close to the design

1.4.1 Design-based analysis

Report the analysis that is implied by the design.

		T2
		N	Y	All	Diff
T1	N	\(\overline{y}_{00}\)	\(\overline{y}_{01}\)	\(\overline{y}_{0x}\)	\(d_2\|T1=0\)
		(sd)	(sd)	(sd)	(sd)
	Y	\(\overline{y}_{10}\)	\(\overline{y}_{10}\)	\(\overline{y}_{1x}\)	\(d_2\|T1=1\)
		(sd)	(sd)	(sd)	(sd)
	All	\(\overline{y}_{x0}\)	\(\overline{y}_{x1}\)	\(y\)	\(d_2\)
		(sd)	(sd)	(sd)	(sd)
	Diff	\(d_1\|T2=0\)	\(d_1\|T2=1\)	\(d_1\)	\(d_1d_2\)
		(sd)	(sd)	(sd)	(sd)

This is instantly recognizable from the design and returns all the benefits of the factorial design including all main effects, conditional causal effects, interactions and summary outcomes. It is much clearer and more informative than a regression table.

Deaton, Angus, and Nancy Cartwright. 2018. “Understanding and Misunderstanding Randomized Controlled Trials.” Social Science & Medicine 210: 2–21.

Freedman, David A. 2008. “On Regression Adjustments to Experimental Data.” Advances in Applied Mathematics 40 (2): 180–93.

Samii, Cyrus, and P M Aronow. 2012. “On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments.” Statistics & Probability Letters 82 (2): 365–70.