[1] 0 1 0 1 1 1 1 1 1 0
Design
Focus on randomization schemes
DeclareDesignExperiments are investigations in which an intervention, in all its essential elements, is under the control of the investigator. (Cox & Reid)
Two major types of control:
In general you might want to set things up so that your randomization is replicable. You can do this by setting a seed:
and again:
Even better is to set it up so that it can reproduce lots of possible draws so that you can check the propensities for each unit.
[1] 0.519 0.496 0.510 0.491 0.524 0.514 0.535 0.497 0.470 0.506
Here the \(P\) matrix gives 1000 possible ways of allocating 5 of 10 units to treatment. We can then confirm that the average propensity is 0.5.
A survey dictionary with results from a complex randomization presented in a simple way for enumerators
People often wonder: did randomization work? Common practice is to implement a set of \(t\)-tests to see if there is balance. This makes no sense.
If you doubt whether it was implemented properly do an \(F\) test. If you worry about variance, specify controls in advance as a function of relation with outcomes (more on this later). If you worry about conditional bias then look at substantive differences between groups, not \(t\)–tests
If you want realizations to have particular properties: build it into the scheme in advance.
Note: clusters are part of your design, not part of the world.
Often used if intervention has to function at the cluster level or if outcome defined at the cluster level.
Disadvantage: loss of statistical power
However: perfectly possible to assign some treatments at cluster level and then other treatments at the individual level
Principle: (unless you are worried about spillovers) generally make clusters as small as possible
Principle: Surprisingly, variability in cluster size makes analysis harder. Try to control assignment so that cluster sizes are similar in treatment and in control.
Be clear about whether you believe effects are operating at the cluster level or at the individual level. This matters for power calculations.
Be clear about whether spillover effects operate only within clusters or also across them. If within only you might be able to interpret treatment as the effect of being in a treated cluster…
Surprisingly, if clusters are of different sizes the difference in means estimator is not unbiased, even if all units are assigned to treatment with the same probability.
Here’s the intuition.Say there are two clusters each with homogeneous treatment effects:
| Cluster | Size | Y0 | Y1 |
|---|---|---|---|
| 1 | 1000000 | 0 | 1 |
| 2 | 1 | 0 | 0 |
Then: What is the true average treatment effect? What do you expect to estimate from cluster random assignment?
The solution is to block by cluster size. For more see: http://gking.harvard.edu/files/cluster.pdf
There are more or less efficient ways to randomize.
Consider a case with four units and two strata. There are 6 possible assignments of 2 units to treatment:
| ID | X | Y(0) | Y(1) | R1 | R2 | R3 | R4 | R5 | R6 |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
| 3 | 2 | 1 | 2 | 0 | 1 | 0 | 1 | 0 | 1 |
| 4 | 2 | 1 | 2 | 0 | 0 | 1 | 0 | 1 | 1 |
| – | – | – | – | – | – | – | – | – | – |
| \(\widehat{\tau}\): | 0 | 1 | 1 | 1 | 1 | 2 |
Even with a constant treatment effect and everything uniform within blocks, there is variance in the estimation of \(\widehat{\tau}\). This can be eliminated by excluding R1 and R6.
Simple blocking in R (5 pairs):
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 |
| 0 | 0 | 1 | 0 | 0 |
DeclareDesign can help with this.Load up:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(50\%\) | \(0\%\) |
| T1 = 1 | \(50\%\) | \(0\%\) |
Spread out:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(25\%\) | \(25\%\) |
| T1 = 1 | \(25\%\) | \(25\%\) |
Three arm it?:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(33.3\%\) | \(33.3\%\) |
| T1 = 1 | \(33.3\%\) | \(0\%\) |
Bunch it?:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(40\%\) | \(20\%\) |
| T1 = 1 | \(20\%\) | \(20\%\) |
This speaks to “spreading out.” Note: the “bunching” example may not pay off and has undesireable feature of introducing a correlation between treatment assignments.
Two ways to do favtial assignments in DeclareDesign:
In practice if you have a lot of treatments it can be hard to do full factorial designs – there may be too many combinations.
In such cases people use fractional factorial designs, like the one below (5 treatments but only 8 units!)
| Variation | T1 | T2 | T3 | T4 | T5 |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 1 |
| 2 | 0 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| 5 | 1 | 0 | 0 | 1 | 0 |
| 6 | 1 | 0 | 1 | 0 | 1 |
| 7 | 1 | 1 | 0 | 0 | 0 |
| 8 | 1 | 1 | 1 | 1 | 1 |
Then randomly assign units to rows. Note columns might also be blocking covariates.
In R, look at library(survey)
| Unit | T1 | T2 | T3 | T4 | T5 |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 1 |
| 2 | 0 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| 5 | 1 | 0 | 0 | 1 | 0 |
| 6 | 1 | 0 | 1 | 0 | 1 |
| 7 | 1 | 1 | 0 | 0 | 0 |
| 8 | 1 | 1 | 1 | 1 | 1 |
library(survey)Muralidharan, Romero, and Wüthrich (2023) write:
Factorial designs are widely used to study multiple treatments in one experiment. While t-tests using a fully-saturated “long” model provide valid inferences, “short” model t-tests (that ignore interactions) yield higher power if interactions are zero, but incorrect inferences otherwise. Of 27 factorial experiments published in top-5 journals (2007–2017), 19 use the short model. After including interactions, over half of their results lose significance. […]
Anything to be done on randomization to address external validity concerns?
DeclareDesignA design with hierarchical data and different assignment schemes.
design <-
declare_model(
school = add_level(N = 16,
u_school = rnorm(N, mean = 0)),
classroom = add_level(N = 4,
u_classroom = rnorm(N, mean = 0)),
student = add_level(N = 20,
u_student = rnorm(N, mean = 0))
) +
declare_model(
potential_outcomes(Y ~ .1*Z + u_classroom + u_student + u_school)
) +
declare_assignment(Z = simple_ra(N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z, .method = difference_in_means) Here are the first couple of rows and columns of the resulting data frame.
| school | u_school | classroom | u_classroom | student | u_student | Y_Z_0 | Y_Z_1 | Z | Y |
|---|---|---|---|---|---|---|---|---|---|
| 01 | 1.35 | 01 | 1.26 | 0001 | -1.28 | 1.33 | 1.43 | 0 | 1.33 |
| 01 | 1.35 | 01 | 1.26 | 0002 | 0.79 | 3.40 | 3.50 | 1 | 3.50 |
| 01 | 1.35 | 01 | 1.26 | 0003 | -0.12 | 2.49 | 2.59 | 0 | 2.49 |
| 01 | 1.35 | 01 | 1.26 | 0004 | -0.65 | 1.96 | 2.06 | 1 | 2.06 |
| 01 | 1.35 | 01 | 1.26 | 0005 | 0.36 | 2.97 | 3.07 | 1 | 3.07 |
| 01 | 1.35 | 01 | 1.26 | 0006 | -0.96 | 1.65 | 1.75 | 0 | 1.65 |
Here is the distribution between treatment and control:
We can draw a new set of data and look at the number of subjects in the treatment and control groups.
But what if all students in a given class have to be assigned the same treatment?
assignment_clustered <-
declare_assignment(Z = cluster_ra(clusters = classroom))
estimator_clustered <-
declare_estimator(Y ~ Z, clusters = classroom,
.method = difference_in_means)
design_clustered <-
design |>
replace_step("assignment", assignment_clustered) |>
replace_step("estimator", estimator_clustered)assignment_clustered_blocked <-
declare_assignment(Z = block_and_cluster_ra(blocks = school,
clusters = classroom))
estimator_clustered_blocked <-
declare_estimator(Y ~ Z, blocks = school, clusters = classroom,
.method = difference_in_means)
design_clustered_blocked <-
design |>
replace_step("assignment", assignment_clustered_blocked) |>
replace_step("estimator", estimator_clustered_blocked)| Design | Power | Coverage |
|---|---|---|
| simple | 0.16 | 0.95 |
| (0.01) | (0.01) | |
| complete | 0.20 | 0.96 |
| (0.01) | (0.01) | |
| blocked | 0.42 | 0.95 |
| (0.01) | (0.01) | |
| clustered | 0.06 | 0.96 |
| (0.01) | (0.01) | |
| clustered_blocked | 0.08 | 0.96 |
| (0.01) | (0.01) |
In many designs you seek to assign an integer number of subjects to treatment from some set.
Sometimes however your assignment targets are not integers.
Example:
Two strategies:
Can also be used to set targets
| ID | size | n_treated |
|---|---|---|
| 1 | 47 | 23 |
| 2 | 53 | 27 |
| 3 | 87 | 43 |
| 4 | 25 | 13 |
| Total | 212 | 106 |
Can also be used to set for complete assignment with heterogeneous propensities
[1] 0.5
Indirect control
Indirect assignments are generally generated by applying a direct assignment and then figuring our an implied indirect assignment
Looks better: but there are trade offs between the direct and indirect distributions
Figuring out the optimal procedure requires full diagnosis
A focus on power
In the classical approach to testing a hypothesis we ask:
How likely are we to see data like this if indeed the hypothesis is true?
How unlikely is “not very likely”?
When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.
Othello has a hypothesis that Desdemona is innocent.
Iago confronts him with evidence:
Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.
He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.
So:
Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.
| Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
Power is just the probability of getting a significant result rejecting a hypothesis.
Simple enough but it presupposes:
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.
I want to test the hypothesis that this dice is fair.
Here’s my test:
Now:
For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.
Here is how many 6’s I would expect if the dice is fair:
I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:
Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.
For instance: we think that in fact sixes appear 20% of the time.
Now what’s the probability of seeing at least 190 sixes?
So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.
Simplest intuition on power:
What is the probability of getting a significant estimate given the sampling distribution is centered on \(b\) and the standard error is 1?
Add these together: probability of getting an estimate above 1.96 or below -1.96.
This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!
Imagine the effect was actually 1.96 what would power be?:
Substantively: if in expectation an estimate will be just significant, then your power is 50%
This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!
Imagine the effect was actually 2.5 what would power be? Is this a lot?:
[1] 0.5000586
Intuition:
Of course the standard error will depend on the number of units and the variance of outcomes in treatment and control.
Say \(N\) subject are divided into two groups and potential outcomes have standard deviation \(\sigma\) in treatment and control. Then the conservative variance of the treatment effect is (approx / conservatively):
\[Var(\tau)=\frac{\sigma^2}{N/2} + \frac{\sigma^2}{N/2} = 4\frac{\sigma^2}{N}\]
and so the (conservative / approx) standard error is:
\[\sigma_\tau=\frac{2\sigma}{\sqrt{N}}\]
Note here we seem to be using the actual standard error but of course the tests we actually run will use an estimate of the standard error…
This can be done e.g. with pwrss like this:
+--------------------------------------------------+
| POWER CALCULATION |
+--------------------------------------------------+
Welch's T-Test (Independent Samples)
---------------------------------------------------
Hypotheses
---------------------------------------------------
H0 (Null Claim) : d - null.d = 0
H1 (Alt. Claim) : d - null.d != 0
---------------------------------------------------
Results
---------------------------------------------------
Sample Size = 50 and 50
Type 1 Error (alpha) = 0.050
Type 2 Error (beta) = 0.921
Statistical Power = 0.079 <<
[1] 0.7010827
Mostly involve figuring out the standard error.
Consider a cluster randomized trial, with each unit having a cluster level shock \(\epsilon_k\) and an individual shock \(\nu_i\). Say these have variances \(\sigma^2_k\), \(\sigma^2_i\).
The standard error is:
\[\sqrt{\frac{4\sigma^2_k}{K} + \frac{4\sigma^2_i}{nK}}\]
Define \(\rho = \frac{\sigma^2_k}{\sigma^2_k + \sigma^2_i}\)
\[\sqrt{\rho \frac{4\sigma^2}{K} + (1- \rho)\frac{4\sigma^2}{nK}}\]
\[\sqrt{((n - 1)\rho + 1)\frac{4\sigma^2}{nK}}\]
where
Plug in these standard errors and proceed as before
Is arbitrarily flexible
| sim_ID | estimate | p.value |
|---|---|---|
| 1 | 0.24 | 0.22 |
| 2 | 0.80 | 0.00 |
| 3 | 0.25 | 0.16 |
| 4 | 0.27 | 0.18 |
| 5 | 0.87 | 0.00 |
| 6 | 0.57 | 0.00 |
Obviously related to the estimates you might get
A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.50 | -0.00 | 0.20 | 0.20 | 0.69 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
| b | Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.00 | 0.20 | 0.20 | 0.05 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
| 0.25 | 0.25 | 0.00 | 0.20 | 0.20 | 0.24 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
| 0.5 | 0.50 | -0.00 | 0.20 | 0.20 | 0.69 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
| 1 | 1.00 | -0.00 | 0.20 | 0.20 | 1.00 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
coming up:
We often focus on sample sizes
But
Power also depends on
Say we have access to a “pre” measure of outcome Y_now; call it Y_base. Y_base is informative about potential outcomes. We are considering using Y_now - Y_base as the outcome instead of Y_now.
N <- 100
rho <- .5
design <-
declare_model(N,
Y_base = rnorm(N),
Y_Z_0 = 1 + correlate(rnorm, given = Y_base, rho = rho),
Y_Z_1 = correlate(rnorm, given = Y_base, rho = rho),
Z = complete_ra(N),
Y_now = Z*Y_Z_1 + (1-Z)*Y_Z_0,
Y_change = Y_now - Y_base) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y_now ~ Z, label = "level") +
declare_estimator(Y_change ~ Z, label = "change")+
declare_estimator(Y_now ~ Z + Y_base, label = "RHS")Punchline:
You can see from the null design that power is great but bias is terrible and coverage is way off.
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 1.59 | 1.59 | 0.12 | 1.60 | 1.00 | 0.00 |
| (0.01) | (0.01) | (0.00) | (0.01) | (0.00) | (0.00) |
Power without unbiasedness corrupts, absolutely
another_bad_design <-
declare_model(
N = 100,
female = rep(0:1, N/2),
U = rnorm(N),
potential_outcomes(Y ~ female * Z + U)) +
declare_assignment(
Z = block_ra(blocks = female, block_prob = c(.1, .5)),
Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z + female, inquiry = "ate",
.method = lm_robust)You can see from the null design that power is great but bias is terrible and coverage is way off.
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.76 | 0.26 | 0.24 | 0.35 | 0.84 | 0.85 |
| (0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.02) |
clustered_design <-
declare_model(
cluster = add_level(N = 10, cluster_shock = rnorm(N)),
individual = add_level(
N = 100,
Y_Z_0 = rnorm(N) + cluster_shock,
Y_Z_1 = rnorm(N) + cluster_shock)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = cluster_ra(clusters = cluster)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| -0.00 | -0.00 | 0.64 | 0.64 | 0.79 | 0.20 |
| (0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.01) |
What alerts you to a problem?
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.00 | -0.00 | 0.66 | 0.65 | 0.06 | 0.94 |
| (0.02) | (0.02) | (0.01) | (0.01) | (0.01) | (0.01) |
design_uncertain <-
declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(b)) +
declare_estimator(Y ~ factor(Z), term = TRUE)
draw_estimands(design_uncertain) inquiry estimand
1 ate -0.3967765
inquiry estimand
1 ate 0.7887188
Say I run two tests and want to correct for multiple comparisons.
Two approaches. First, by hand:
b = .2
design_mc <-
declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = b) +
declare_estimator(Y ~ factor(Z), term = TRUE)design_mc |>
simulate_designs(sims = 1000) |>
filter(term != "(Intercept)") |>
group_by(sim_ID) |>
mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
p_holm = p.adjust(p = p.value, method = "holm"),
p_fdr = p.adjust(p = p.value, method = "fdr")) |>
ungroup() |>
summarize(
"Power using naive p-values" = mean(p.value <= 0.05),
"Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
"Power using Holm correction" = mean(p_holm <= 0.05),
"Power using FDR correction" = mean(p_fdr <= 0.05)
) | Power using naive p-values | Power using Bonferroni correction | Power using Holm correction | Power using FDR correction |
|---|---|---|---|
| 0.7374 | 0.6318 | 0.6886 | 0.7032 |
The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.
my_estimator <- function(data)
lm_robust(Y ~ factor(Z), data = data) |>
tidy() |>
filter(term != "(Intercept)") |>
mutate(p.naive = p.value,
p.value = p.adjust(p = p.naive, method = "bonferroni"))
design_mc_2 <- design_mc |>
replace_step(5, declare_estimator(handler = label_estimator(my_estimator)))
run_design(design_mc_2) |>
select(term, estimate, p.value, p.naive) |> kable()| term | estimate | p.value | p.naive |
|---|---|---|---|
| factor(Z)2 | 0.1182516 | 0.2502156 | 0.1251078 |
| factor(Z)3 | 0.1057031 | 0.3337476 | 0.1668738 |
Lets try same thing for a null model (using redesign(design_mc_2, b = 0))
…and power:
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.00 | 0.00 | 0.08 | 0.08 | 0.02 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
| -0.00 | -0.00 | 0.08 | 0.08 | 0.02 | 0.96 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
bothered?