| Y1 | Y0 | diff |
|---|---|---|
| 1.07 | -0.28 | 1.35 |
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 1.35 | 0.17 | 7.94 | 0 | 1.01 | 1.68 | 97.98 | Y |
See topics for Controls and doubly robust estimation
Unbiased estimates of the (sample) average treatment effect can be estimated (whether or not there is imbalance on covariates) using:
\[ \widehat{ATE} = \frac{1}{n_T}\sum_TY_i - \frac{1}{n_C}\sum_CY_i, \]
| Y1 | Y0 | diff |
|---|---|---|
| 1.07 | -0.28 | 1.35 |
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 1.35 | 0.17 | 7.94 | 0 | 1.01 | 1.68 | 97.98 | Y |
We can also do this with regression:
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -0.28 | 0.12 | -2.33 | 0.02 | -0.51 | -0.04 | 98 | Y |
| Z | 1.35 | 0.17 | 7.94 | 0.00 | 1.01 | 1.68 | 98 | Y |
See Freedman (2008) on why regression is fine here
Say now different strata or blocks \(\mathcal{S}\) had different assignment probabilities. Then you could estimate:
\[ \widehat{ATE} = \sum_{S\in \mathcal{S}}\frac{n_{S}}{n} \left(\frac{1}{n_{S1}}\sum_{S\cap T}y_i - \frac{1}{n_{S0}}\sum_{S\cap C}y_i \right) \]
Note: you cannot just ignore the blocks because assignment is no longer independent of potential outcomes: you might be sampling units with different potential outcomes with different probabilities.
However, the formula above works fine because selecting is random conditional on blocks.
As a DAG this is just classic confounding:
Data with heterogeneous assignments:
True effect is 0.5, but:
Averaging over effects in blocks
[1] 0.7236939
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 0.72 | 0.11 | 6.66 | 0 | 0.51 | 0.94 | 496 | Y |
This also corresponds to the difference in the weighted average of treatment outcomes (with weights given by the inverse of the probability that each unit is assigned to treatment) and control outcomes (with weights given by the inverse of the probability that each unit is assigned to control).
| Y1 | Y0 | diff |
|---|---|---|
| 0.59 | -0.15 | 0.74 |
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 0.74 | 0.11 | 6.65 | 0 | 0.52 | 0.96 | 498 | Y |
But inverse propensity weighting is a more general principle, which can be used even if you do not have blocks.
The intuition for it comes straight from sampling weights — you weight up in order to recover an unbiased estimate of the potential outcomes for all units, whether or not they are assigned to treatment.
With sampling weights however you can include units even if their weight was 1. Why can you not include these units when doing inverse propensity weighting?
Say you made a mess and used a randomization that was correlated with some variable, \(U\). For example:
Bad assignment, some randomization process you can’t understand (but can replicate) that results in unequal probabilities.
Results is a sampling distribution not centered on the true effect (0)
To fix you can estimate the assignment probabilities by replicating the assignment many times:
and then use these assignment probabilities in your estimator
Implied weights
Improved results
This example is surprising but it helps you see the logic of why inverse weighting gets unbiased estimates (and why that might not guarantee a reasonable answer)
Imagine there is one unit with potential outcomes \(Y(1) = 2, Y(0) = 1\). So the unit level treatment effect is 1.
You toss a coin.
So your expected estimate is: \[0.5 \times 4 - 0.5 \times (-2) = 1\]
Great on average but always lousy
\[\hat{\overline{Y_1}} = \frac{1}n\left(\sum_i \frac{Z_iY_i(1)}{\pi_i}\right)\] With independent assignment the expected value of \(\hat{\overline{Y_1}}\) is just:
\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( \left(\pi_1 \frac{1\times Y_1(1)}{\pi_1} + (1-\pi_1) \frac{0\times Y_1(1)}{\pi_1}\right) + \left(\pi_2 \frac{1\times Y_2(1)}{\pi_2} + (1-\pi_2) \frac{0\times Y_1(1)}{\pi_2}\right) + \dots\right)\]
\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( Y_1(1) + Y_2(1) + \dots\right) = \overline{Y_1}\]
and similarly for \(\mathbb{E}[\hat{\overline{Y_0}}]\) and so using linearity of expectations:
\[\mathbb{E}[\widehat{\overline{Y_1 - Y_0}}] = \overline{Y_1 - Y_0}\]
Lets talk about “inference”
In classical statistics we characterize our uncertainty over an estimate using an estimate of variance of the sampling distribution of the estimator.
Key idea is we want to be able to say: how likely are we to have gotten such an estimate if the distribution of estimates associated with our design looked a given way.
More specifically we want to estimate “standard error” or the “standard deviation of the sampling distribution”
(See Woolridge (2023) where the standard error is understood as the “estimate of the standard deviation of the sampling distribution”)
Given:
The variance of the estimator of \(n\) repeated ‘runs’ of a design is: \(Var(\hat{\tau}) = \frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2\)
And the standard error is:
\(se(\hat{\tau}) = \sqrt{\frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2}\)
If we have a good measure for the shape of the sampling distribution we can start to make statements of the form:
If the sampling distribution is roughly normal, as it may be with large samples, then we can use procedures such as: “there is a 5% probability that an estimate would be more than 1.96 standard errors away from the mean of the sampling distribution”
Key idea: You can estimate variance straight from the data, given knowledge of the assignment process and assuming well defined potential outcomes
Recall in general \(Var(x) = \frac{1}n\sum_i(x_i - \overline{x})^2\). here the \(x_i\)s are the treatment effect estimates we might get under different random assignments, the \(n\) is number of different assignments (assumed here all equally likely, but otherwise we can weight) and \(\overline{x}\) is the truth.
For intuition imagine we have just two units \(A\), \(B\), with potential outcomes \(A_1\), \(A_0\), \(B_1\), \(B_0\).
When there are two units with outcomes \(x_1, x_2\), the variance simplifies like this:
\[Var(x) = \frac{1}2\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \frac{1}2\left(x_2 - \frac{x_1 + x_2}{2}\right)^2 = \left(\frac{x_1 - x_2}{2}\right)^2\]
In the two unit case the two possible treatment estimates are: \(\hat{\tau}_1=A_1 - B_0\) and \(\hat{\tau}_2=B_1 - A_0\), depending on what gets put into treatment. So the variance is:
\[Var(\hat{\tau}) = \left(\frac{\hat{\tau}_1 - \hat{\tau}_2}{2}\right)^2 = \left(\frac{(A_1 - B_0) - (B_1 - A_0)}{2}\right)^2 =\left(\frac{(A_1 - B_1) + (A_0 - B_0)}{2}\right)^2 \] which we can re-write as:
\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\] The first two terms correspond to the variance of \(Y(1)\) and the variance of \(Y(0)\). The last term is a bit pesky though, it corresponds to twice the covariance of \(Y(1)\) and \(Y(0)\).
How can we go about estimating this?
\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\]
In the two unit case it is quite challenging because we do not have an estimate for any of the three terms: we do not have an estimate for the variance in the treatment group or in the control group because we have only one observation in each case; and we do not have an estimate for the covariance because we don’t observe both potential outcomes for any case.
Things do look a bit better however with more units…
From Freedman Prop 1 / Example 1 (using combinatorics!) we have:
\(V(\widehat{ATE}) = \frac{1}{n-1}\left[\frac{n_C}{n_T}V_1 + \frac{n_T}{n_C}V_0 + 2C_{01}\right]\)
… where \(V_0, V_1\) denote variances and \(C_{01}\) covariance
This is usefully rewritten as:
\[ \begin{split} V(\widehat{ATE}) & = \frac{1}{n-1}\left[\frac{n - n_T}{n_T}V_1 + \frac{n - n_C}{n_C}V_0 + 2C_{01}\right] \\ & = \frac{n}{n-1}\left[\frac{V_1}{n_T} + \frac{V_0}{n_C}\right] - \frac{1}{n-1}\left[V_1 + V_0 - 2C_{01}\right] \end{split} \]
where the final term is positive
Note:
, robust (see Samii and Aronow (2012))For the case with blocking, the conservative estimator is:
\(V(\widehat{ATE}) = {\sum_{S\in \mathcal{S}}{\left(\frac{n_{S}}{n}\right)^2} \left({\frac{s^2_{S1}}{n_{S1}}} + {\frac{s^2_{S0}}{n_{S0}}} \right)}\)
An illustration of how conservative the conservative estimator of variance really is (numbers in plot are correlations between \(Y(1)\) and \(Y(0)\).
We confirm that:
| \(\tau\) | \(\rho\) | \(\sigma^2_{Y(1)}\) | \(\Delta\) | \(\sigma^2_{\tau}\) | \(\widehat{\sigma}^2_{\tau}\) | \(\widehat{\sigma}^2_{\tau(\text{Neyman})}\) |
|---|---|---|---|---|---|---|
| 1.00 | -1.00 | 1.00 | -0.04 | 0.00 | -0.00 | 0.04 |
| 1.00 | -0.67 | 1.00 | -0.03 | 0.01 | 0.01 | 0.04 |
| 1.00 | -0.33 | 1.00 | -0.03 | 0.01 | 0.01 | 0.04 |
| 1.00 | 0.00 | 1.00 | -0.02 | 0.02 | 0.02 | 0.04 |
| 1.00 | 0.33 | 1.00 | -0.01 | 0.03 | 0.03 | 0.04 |
| 1.00 | 0.67 | 1.00 | -0.01 | 0.03 | 0.03 | 0.04 |
| 1.00 | 1.00 | 1.00 | 0.00 | 0.04 | 0.04 | 0.04 |
Here \(\rho\) is the unobserved correlation between \(Y(1)\) and \(Y(0)\); and \(\Delta\) is the final term in the sample variance equation that we cannot estimate.
The conservative variance comes from the fact that you do not know the covariance between \(Y(1)\) and \(Y(0)\).
Example:
sharp_var <- function(yt, yc, N=length(c(yt,yc)), upper=TRUE){
m <- length(yt)
n <- m + length(yc)
V <- function(x,N) (N-1)/(N*(length(x)-1)) * sum((x - mean(x))^2)
yt <- sort(yt)
if(upper) {yc <- sort(yc)
} else {
yc <- sort(yc,decreasing=TRUE)}
p_i <- unique(sort(c(seq(0,n-m,1)/(n-m),seq(0,m,1)/m)))-
.Machine$double.eps^.5
p_i[1] <- .Machine$double.eps^.5
yti <- yt[ceiling(p_i*m)]
yci <- yc[ceiling(p_i*(n-m))]
p_i_minus <- c(NA,p_i[1: (length(p_i)-1)])
((N-m)/m * V(yt,N) + (N-(n-m))/(n-m)*V(yc,N) +
2*sum(((p_i-p_i_minus)*yti*yci)[2:length(p_i)]) - 2*mean(yt)*mean(yc))/(N-1)}| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 0.00 | 0.00 | 0.63 | 0.53 | 0.00 | 0.00 | 999998 | Y |
| X | 1.21 | 1.41 | 0.86 | 0.39 | -1.56 | 3.98 | 999998 | Y |
Error: object 'ols' not found
[1] 1 1
The sharp bounds are \([1,1]\) but the conservative estimate is \(\sqrt{2}\).
However you can do hypothesis testing even without an estimate of the standard error.
Up next
A procedure for using the randomization distribution to calculate \(p\) values
Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.
| Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
| Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
ri estimate of \(p\):# data
set.seed(1)
df <- fabricate(N = 1000, Z = rep(c(0,1), N/2), Y= .1*Z + rnorm(N))
# test stat
test.stat <- function(df) with(df, mean(Y[Z==1])- mean(Y[Z==0]))
# test stat distribution
ts <- replicate(4000, df |> mutate(Z = sample(Z)) |> test.stat())
# test
mean(ts >= test.stat(df)) # One sided p value[1] 0.025
The \(p\) value is the mass to the right of the vertical
ri2You can do the same using Alex Coppock’s ri2 package
ri2| term | estimate | upper_p_value |
|---|---|---|
| Z | 0.1321367 | 0.02225 |
You’ll notice slightly different answer. This is because although the procedure is “exact” it is subject to simulation error.
Observed
|
Under null that effect is 0 |
Under null that effect is 2 |
|||
|---|---|---|---|---|---|
| Y(0) | Y(1) | Y(0) | Y(1) | Y(0) | Y(1) |
| 1 | NA | 1 | 1 | 1 | 3 |
| 2 | NA | 2 | 2 | 2 | 4 |
| NA | 4 | 4 | 4 | 2 | 4 |
| NA | 3 | 3 | 3 | 1 | 3 |
ri and CIsIt is possible to use this procedure to generate confidence intervals with a natural interpretation.
ri and CIs in practiceri and CIs in practiceWarning: calculating confidence intervals this way can be computationally intensive
ri with DeclareDesignDeclareDesign can do randomization inference nativelyri with DeclareDesign (advanced)Here we get minimal detectable effects by using a design that has two stage simulations so we can estimate the sampling distribution of summaries of the sampling distribution generated from reassignments.
test_stat <- function(data)
with(data, data.frame(estimate = mean(Y[Z==1]) - mean(Y[Z==0])))
b <- 0
design <-
declare_model(N = 100, Z = complete_ra(N), Y = b*Z + rnorm(N)) +
declare_estimator(handler = label_estimator(test_stat), label = "actual")+
declare_measurement(Z = sample(Z)) + # this is the permutation step
declare_estimator(handler = label_estimator(test_stat), label = "null")ri with DeclareDesign (advanced)Simulations data frame from two step simulation. Note computational intensity as number of runs is the product of the sims vector. I speed things up by using a simple estimation function and also using parallelization.
Error in readRDS(con, refhook = refhook): cannot open the connection
ri with DeclareDesign (advanced)A snap shot of the simulations dataframe: we have multiple step 3 draws for each design and step 1 draw.
ri with DeclareDesign (advanced)Power for each value of b.
Error: object 'simulations' not found
If you want to figure out more precisely what b gives 80% or 90% power you can narrow down the b range.
ri interactionsLets now imagine a world with two treatments and we are interested in using ri for assessing the interaction. (Code from Coppock, ri2)
ri interactionsThe approach is to declare a null model that is nested by the full model. Then \(F\) test statistic from the model comparisons is taken as the test statistic and distribution of this is built up under re-randomizations.
ri interactions with DeclareDesignLet’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects
Note: this is quite a sharp hypothesis
df <- fabricate(N = 1000, Z1 = rep(c(0,1), N/2), Z2 = sample(Z1), Y = Z1 + Z2 - .15*Z1*Z2 + rnorm(N))
my_estimate <- (lm(Y ~ Z1*Z2, data = df) |> coef())[4]
null_model <- function(df) {
M0 <- lm(Y ~ Z1 + Z2, data = df)
d1 <- coef(M0)[2]
d2 <- coef(M0)[3]
df |> mutate(
Y_Z1_0_Z2_0 = Y - Z1*d1 - Z2*d2,
Y_Z1_1_Z2_0 = Y + (1-Z1)*d1 - Z2*d2,
Y_Z1_0_Z2_1 = Y - Z1*d1 + (1-Z2)*d2,
Y_Z1_1_Z2_1 = Y + (1-Z1)*d1 + (1-Z2)*d2)
}ri interactions with DeclareDesignLet’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects
Imputed potential outcomes look like this:
| ID | Z1 | Z2 | Y | Y_Z1_0_Z2_0 | Y_Z1_1_Z2_0 | Y_Z1_0_Z2_1 | Y_Z1_1_Z2_1 |
|---|---|---|---|---|---|---|---|
| 0001 | 0 | 0 | -0.18 | -0.18 | 0.76 | 0.68 | 1.61 |
| 0002 | 1 | 0 | 0.20 | -0.73 | 0.20 | 0.12 | 1.06 |
| 0003 | 0 | 0 | 2.56 | 2.56 | 3.50 | 3.42 | 4.36 |
| 0004 | 1 | 0 | -0.27 | -1.21 | -0.27 | -0.35 | 0.59 |
| 0005 | 0 | 1 | -2.13 | -2.99 | -2.05 | -2.13 | -1.19 |
| 0006 | 1 | 1 | 3.52 | 1.72 | 2.66 | 2.58 | 3.52 |
ri interactions with DeclareDesign| Design | Estimator | Outcome | Term | N Sims | One Sided Pos | One Sided Neg | Two Sided |
|---|---|---|---|---|---|---|---|
| design | estimator | Y | Z1:Z2 | 1000 | 0.95 | 0.05 | 0.10 |
| (0.01) | (0.01) | (0.01) |
ri in practiceri Applicationsri Applicationsri Applications| Capacity | T1 | T2 | T3 | ||
|---|---|---|---|---|---|
| Session | Thursday | 40 | 10 | 30 | 0 |
| Friday | 40 | 10 | 0 | 30 | |
| Saturday | 10 | 10 | 0 | 0 |
Optimal assignment to treatment given constraints due to facilities
| Subject type | N | Available |
|---|---|---|
| A | 3 | Thurs, Fri |
| B | 30 | Thurs, Sat |
| C | 30 | Fri, Sat |
Constraints due to subjects
ri ApplicationsIf you think hard about assignment you might come up with an allocation like this.
|
Allocations
|
|||||
|---|---|---|---|---|---|
| Subject type | N | Available | Thurs | Fri | Sat |
| A | 30 | Thurs, Fri | 15 | 15 | NA |
| B | 30 | Thurs, Sat | 25 | NA | 5 |
| C | 30 | Fri, Sat | NA | 25 | 5 |
Assignment of people to days
ri ApplicationsThat allocation balances as much as possible. Given the allocation you might randomly assign individuals to different days as well as randomly assigning them to treatments within days. If you then figure out assignment propensities, this is what you would get:
Assignment Probabilities
|
|||||
|---|---|---|---|---|---|
| Subject type | N | Available | T1 | T2 | T3 |
| A | 30 | Thurs, Fri | 0.250 | 0.375 | 0.375 |
| B | 30 | Thurs, Sat | 0.375 | 0.625 | 0.000 |
| C | 30 | Fri, Sat | 0.375 | NA | 0.625 |
ri ApplicationsEven under the assumption that the day of measurement does not matter, these assignment probabilities have big implications for analysis.
Assignment Probabilities
|
|||||
|---|---|---|---|---|---|
| Subject type | N | Available | T1 | T2 | T3 |
| A | 30 | Thurs, Fri | 0.250 | 0.375 | 0.375 |
| B | 30 | Thurs, Sat | 0.375 | 0.625 | 0.000 |
| C | 30 | Fri, Sat | 0.375 | NA | 0.625 |
Only the type \(A\) subjects could have received any of the three treatments.
There are no two treatments for which it is possible to compare outcomes for subpopulations \(B\) and \(C\)
A comparison of \(T1\) versus \(T2\) can only be made for population \(A \cup B\)
However subpopulation \(A\) is assigned to \(A\) (versus \(B\)) with probability 4/5; while population \(B\) is assigned with probability 3/8
ri ApplicationsImplications for design: need to uncluster treatment delivery
Implications for analysis: need to take account of propensities
Idea: Wacky assignments happen but if you know the propensities you can do the analysis.
ri Applications: Indirect assignmentsA particularly interesting application is where a random assignment combines with existing features to determine an assignment to an “indirect” treatment.
Report the analysis that is implied by the design.
| T2 | |||||
|---|---|---|---|---|---|
| N | Y | All | Diff | ||
| T1 | N | \(\overline{y}_{00}\) | \(\overline{y}_{01}\) | \(\overline{y}_{0x}\) | \(d_2|T1=0\) |
| (sd) | (sd) | (sd) | (sd) | ||
| Y | \(\overline{y}_{10}\) | \(\overline{y}_{10}\) | \(\overline{y}_{1x}\) | \(d_2|T1=1\) | |
| (sd) | (sd) | (sd) | (sd) | ||
| All | \(\overline{y}_{x0}\) | \(\overline{y}_{x1}\) | \(y\) | \(d_2\) | |
| (sd) | (sd) | (sd) | (sd) | ||
| Diff | \(d_1|T2=0\) | \(d_1|T2=1\) | \(d_1\) | \(d_1d_2\) | |
| (sd) | (sd) | (sd) | (sd) |
This is instantly recognizable from the design and returns all the benefits of the factorial design including all main effects, conditional causal effects, interactions and summary outcomes. It is much clearer and more informative than a regression table.