Experiments

Design

Macartan Humphreys

1 Design

Focus on randomization schemes

1.1 Aims and practice

  1. Goals
  2. Cluster randomization
  3. Blocked randomization
  4. Factorial designs
  5. External validity
  6. Assignments with DeclareDesign
  7. Indirect assignments

1.1.1 Experiments

  • Experiments are investigations in which an intervention, in all its essential elements, is under the control of the investigator. (Cox & Reid)

  • Two major types of control:

  1. control over assignment to treatment – this is at the heart of many field experiments
  2. control over the treatment itself – this is at the heart of many lab experiments
  • Main focus today is on 1 and on the question: how does control over assignment to treatment allow you to make reasonable statements about causal effects?

1.1.2 Experiments

1.1.3 Basic randomization

  • Basic randomization is very simple. For example, say you want to assign 5 of 10 units to treatment. Here is simple code:
sample(0:1, 10, replace = TRUE)
 [1] 0 1 0 1 1 1 1 1 1 0

1.1.4 …should be replicable

In general you might want to set things up so that your randomization is replicable. You can do this by setting a seed:

set.seed(20111112)
sample(0:1, 10, replace = TRUE)
 [1] 1 0 1 1 1 0 1 1 1 1

and again:

set.seed(20111112)
sample(0:1, 10, replace = TRUE)
 [1] 1 0 1 1 1 0 1 1 1 1

1.1.5 Basic randomization

Even better is to set it up so that it can reproduce lots of possible draws so that you can check the propensities for each unit.

set.seed(20111112)
P <- replicate(1000, sample(0:1, 10, replace = TRUE)) 
apply(P, 1, mean)
 [1] 0.519 0.496 0.510 0.491 0.524 0.514 0.535 0.497 0.470 0.506

Here the \(P\) matrix gives 1000 possible ways of allocating 5 of 10 units to treatment. We can then confirm that the average propensity is 0.5.

  • A huge advantage of this approach is that if you make a mess of the random assignment; you can still generate the P matrix and use that for all analyses!

1.1.6 Do it in advance

  • Unless you need them to keep subjects at ease, leave your spinners and your dice and your cards behind.
  • Especially when you have multiple or complex randomizations you are generally much better doing it with a computer in advance

A survey dictionary with results from a complex randomization presented in a simple way for enumerators

1.1.7 Did the randomization ‘’work’’?

  • People often wonder: did randomization work? Common practice is to implement a set of \(t\)-tests to see if there is balance. This makes no sense.

  • If you doubt whether it was implemented properly do an \(F\) test. If you worry about variance, specify controls in advance as a function of relation with outcomes (more on this later). If you worry about conditional bias then look at substantive differences between groups, not \(t\)–tests

  • If you want realizations to have particular properties: build it into the scheme in advance.

1.2 Cluster Randomization

1.2.1 Cluster Randomization

  • Simply place units into groups (clusters) and then randomly assign the groups to treatment and control.
  • All units in a given group get the same treatment

Note: clusters are part of your design, not part of the world.

1.2.2 Cluster Randomization

  • Often used if intervention has to function at the cluster level or if outcome defined at the cluster level.

  • Disadvantage: loss of statistical power

  • However: perfectly possible to assign some treatments at cluster level and then other treatments at the individual level

  • Principle: (unless you are worried about spillovers) generally make clusters as small as possible

  • Principle: Surprisingly, variability in cluster size makes analysis harder. Try to control assignment so that cluster sizes are similar in treatment and in control.

  • Be clear about whether you believe effects are operating at the cluster level or at the individual level. This matters for power calculations.

  • Be clear about whether spillover effects operate only within clusters or also across them. If within only you might be able to interpret treatment as the effect of being in a treated cluster…

1.2.3 Cluster Randomization: Block by cluster size

Surprisingly, if clusters are of different sizes the difference in means estimator is not unbiased, even if all units are assigned to treatment with the same probability.

Here’s the intuition.Say there are two clusters each with homogeneous treatment effects:

Cluster Size Y0 Y1
1 1000000 0 1
2 1 0 0

Then: What is the true average treatment effect? What do you expect to estimate from cluster random assignment?

The solution is to block by cluster size. For more see: http://gking.harvard.edu/files/cluster.pdf

1.3 Blocked assignments and other restricted randomizations

1.3.1 Blocking

There are more or less efficient ways to randomize.

  • Randomization helps ensure good balance on all covariates (observed and unobserved) in expectation.
  • But balance may not be so great in realization
  • Blocking can help ensure balance ex post on observables

1.3.2 Blocking

Consider a case with four units and two strata. There are 6 possible assignments of 2 units to treatment:

ID X Y(0) Y(1) R1 R2 R3 R4 R5 R6
1 1 0 1 1 1 1 0 0 0
2 1 0 1 1 0 0 1 1 0
3 2 1 2 0 1 0 1 0 1
4 2 1 2 0 0 1 0 1 1
\(\widehat{\tau}\): 0 1 1 1 1 2

Even with a constant treatment effect and everything uniform within blocks, there is variance in the estimation of \(\widehat{\tau}\). This can be eliminated by excluding R1 and R6.

1.3.3 Blocking

Simple blocking in R (5 pairs):

sapply(1:5, function(i) sample(0:1))
1 2 3 4 5
1 1 0 1 1
0 0 1 0 0

1.3.4 Of blocks and clusters

1.3.5 Blocking

  • Blocking is a case of restricted randomization. Although each unit is sampled with equal probability, the profiles of possible assignments are not.
  • You have to take account of this when doing analysis
  • There are many other approaches.
    • Matched Pairs are a particularly fine approach to blocking
    • You could also randomize and then replace the randomization if you do not like the balance. This sounds tricky (and it is) but it is OK as long as you understand the true lottery process you are employing and incorporate that into analysis
    • It is even possible to block on covariates for which you don’t have data ex ante, by using methods in which you allocate treatment over time as a function of features of your sample (also tricky)

1.3.6 Other types of restricted randomization

  • Really you can set whatever criterion you want for your set of treated units to have (eg no treated unit beside another treated unit; at least 5 from the north, 10 from the south, guaranteed balance by some continuous variable etc)
  • You just have to be sure that you understand the random process that was used and that you can use it in the analysis stage
  • But here be dragons
    • The more complex your design, the more complex your analysis.
    • General injunction http://www.ncbi.nlm.nih.gov/pubmed/15580598 ``as ye randomize so shall ye analyze’’)
    • In general you should make sure that a given randomization procedure coupled with a given estimation procedure will produce an unbiased estimate. DeclareDesign can help with this.

1.3.7 Challenges with re-randomization

  • We can see that blocked and clustered assignments are actually types of restricted randomizations: they limit the set of acceptable randomizations to those with good properties
  • You could therefore implement the equivalent distribution of assignments y specifying an acceptable rule and then re-randomizing when the rule is met
  • That’s fine but you would then have to take account of clustering and blocking just as you do when you actually cluster or block

1.4 Factorial Designs

1.4.1 Factorial Designs

  • Often when you set up an experiment you want to look at more than one treatment.
  • Should you do this or not? How should you use your power?

1.4.2 Factorial Designs

  • Often when you set up an experiment you want to look at more than one treatment.
  • Should you do this or not? How should you use your power?

Load up:

\(T2=0\) \(T2=1\)
T1 = 0 \(50\%\) \(0\%\)
T1 = 1 \(50\%\) \(0\%\)

Spread out:

\(T2=0\) \(T2=1\)
T1 = 0 \(25\%\) \(25\%\)
T1 = 1 \(25\%\) \(25\%\)

1.4.3 Factorial Designs

  • Often when you set up an experiment you want to look at more than one treatment.
  • Should you do this or not? How should you use your power?

Three arm it?:

\(T2=0\) \(T2=1\)
T1 = 0 \(33.3\%\) \(33.3\%\)
T1 = 1 \(33.3\%\) \(0\%\)

Bunch it?:

\(T2=0\) \(T2=1\)
T1 = 0 \(40\%\) \(20\%\)
T1 = 1 \(20\%\) \(20\%\)

1.4.4 Factorial Designs

  • Surprisingly, adding multiple treatments does not generally eat into your power (unless you are decomposing a complex treatment – then it can. Why?)
  • Especially when you use a fully crossed design like the middle one above.
  • Fisher: “No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken.”
  • However – adding multiple treatments does alter the interpretation of your average treatment effects. If T2 is an unusual treatment for example, then half the T1 effect is measured for unusual situations.

This speaks to “spreading out.” Note: the “bunching” example may not pay off and has undesireable feature of introducing a correlation between treatment assignments.

1.4.5 Factorial Designs

Two ways to do favtial assignments in DeclareDesign:

# Block the second assignment
declare_assignment(Z1 = complete_ra(N)) +
declare_assignment(Z2 = block_ra(blocks = Z1)) +
  
# Recode four arms  
declare_assignment(Z = complete_ra(N, num_arms = 4)) +
declare_measurement(Z1 = (Z == "T2" | Z == "T4"),
                      Z2 = (Z == "T3" | Z == "T4"))

1.4.6 Factorial Designs: In practice

  • In practice if you have a lot of treatments it can be hard to do full factorial designs – there may be too many combinations.

  • In such cases people use fractional factorial designs, like the one below (5 treatments but only 8 units!)

Variation T1 T2 T3 T4 T5
1 0 0 0 1 1
2 0 0 1 0 0
3 0 1 0 0 1
4 0 1 1 1 0
5 1 0 0 1 0
6 1 0 1 0 1
7 1 1 0 0 0
8 1 1 1 1 1

1.4.7 Factorial Designs: In practice

  • Then randomly assign units to rows. Note columns might also be blocking covariates.

  • In R, look at library(survey)

1.4.8 Factorial Designs: In practice

  • But be careful: you have to be comfortable with possibly not having any simple counterfactual unit for any unit (invoke sparsity-of-effects principle).
Unit T1 T2 T3 T4 T5
1 0 0 0 1 1
2 0 0 1 0 0
3 0 1 0 0 1
4 0 1 1 1 0
5 1 0 0 1 0
6 1 0 1 0 1
7 1 1 0 0 0
8 1 1 1 1 1
  • In R, look at library(survey)

1.4.9 Controversy?

Muralidharan, Romero, and Wüthrich (2023) write:

Factorial designs are widely used to study multiple treatments in one experiment. While t-tests using a fully-saturated “long” model provide valid inferences, “short” model t-tests (that ignore interactions) yield higher power if interactions are zero, but incorrect inferences otherwise. Of 27 factorial experiments published in top-5 journals (2007–2017), 19 use the short model. After including interactions, over half of their results lose significance. […]

1.5 External Validity: Can randomization strategies help?

1.5.1 Principle: Address external validity at the design stage

Anything to be done on randomization to address external validity concerns?

  • Note 1: There is little or nothing about field experiments that makes the external validity problem greater for these than for other forms of ‘’sample based’’ research
  • Note 2: Studies that use up the available universe (cross national studies) actually have a distinct external validity problem
  • Two ways to think about external validity issues:
    1. Are things likely to operate in other units like they operate in these units? (even with the same intervention)
    2. Are the processes in operation in this treatment likely to operate in other treatments? (even in this population)

1.5.2 Principle: Address external validity at the design stage

  • Two ways to think about external validity issues:
    1. Are things likely to operate in other units like they operate in these units? (even with the same intervention) 2.Are the processes in operation in this treatment likely to operate in other treatments? (even in this population)
  • Two approaches for 1.
    • Try to sample cases and estimate population average treatment effects
    • Exploit internal variation: block on features that make the case unusal and assess importance of these (eg is unit poor? assess how effects differ in poor and wealthy components)
  • 2 is harder and requires a sharp identification of context free primitives, if there are such things.

1.6 Assignments with DeclareDesign

1.6.1 A design: Multilevel data

A design with hierarchical data and different assignment schemes.

design <- 
  declare_model(
    school = add_level(N = 16, 
                       u_school = rnorm(N, mean = 0)),     
    classroom = add_level(N = 4,    
                  u_classroom = rnorm(N, mean = 0)),
    student =  add_level(N = 20,    
                         u_student = rnorm(N, mean = 0))
    ) +
  declare_model(
    potential_outcomes(Y ~ .1*Z + u_classroom + u_student + u_school)
    ) +
  declare_assignment(Z = simple_ra(N)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z))  +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, .method = difference_in_means)    

1.6.2 Sample data

Here are the first couple of rows and columns of the resulting data frame.

my_data <- draw_data(design)
kable(head(my_data), digits = 2)
school u_school classroom u_classroom student u_student Y_Z_0 Y_Z_1 Z Y
01 1.35 01 1.26 0001 -1.28 1.33 1.43 0 1.33
01 1.35 01 1.26 0002 0.79 3.40 3.50 1 3.50
01 1.35 01 1.26 0003 -0.12 2.49 2.59 0 2.49
01 1.35 01 1.26 0004 -0.65 1.96 2.06 1 2.06
01 1.35 01 1.26 0005 0.36 2.97 3.07 1 3.07
01 1.35 01 1.26 0006 -0.96 1.65 1.75 0 1.65

1.6.3 Sample data

Here is the distribution between treatment and control:

kable(t(as.matrix(table(my_data$Z))), 
      col.names = c("control", "treatment"))
control treatment
645 635

1.6.4 Complete Random Assignment using the built in function

assignment_complete <-   declare_assignment(Z = complete_ra(N))  

design_complete <- 
  replace_step(design, "assignment", assignment_complete)

1.6.5 Data from complete assignment

We can draw a new set of data and look at the number of subjects in the treatment and control groups.

set.seed(1:5)
data_complete <- draw_data(design_complete)

kable(t(as.matrix(table(data_complete$Z))))
0 1
640 640

1.6.6 Plotted

1.6.7 Block Random Assignment

  • The treatment and control group will in expectation contain the same share of students in different classrooms.
  • But as we saw this does necessarily hold in realization
  • We make this more obvious by sorting the students by treatment status with schools

1.6.8 Blocked design

assignment_blocked <-   
  declare_assignment(Z = block_ra(blocks = classroom))  

estimator_blocked <- 
  declare_estimator(Y ~ Z, blocks = classroom, 
                    .method = difference_in_means)  

design_blocked <- 
  design |> 
  replace_step("assignment", assignment_blocked) |>
  replace_step("estimator", estimator_blocked)

1.6.9 Illustration of blocked assignment

  • Note that subjects are sorted here after the assignment to make it easier to see that in this case blocking ensures that exactly 5 students within each classroom are assigned to treatment.

1.6.10 Clustering

But what if all students in a given class have to be assigned the same treatment?

assignment_clustered <- 
  declare_assignment(Z = cluster_ra(clusters = classroom))  
estimator_clustered <- 
  declare_estimator(Y ~ Z, clusters = classroom, 
                    .method = difference_in_means)  


design_clustered <- 
  design |> 
  replace_step("assignment", assignment_clustered) |> 
  replace_step("estimator", estimator_clustered)

1.6.11 Illustration of clustered assignment

1.6.12 Clustered and Blocked

assignment_clustered_blocked <-   
  declare_assignment(Z = block_and_cluster_ra(blocks = school,
                                              clusters = classroom))  
estimator_clustered_blocked <- 
  declare_estimator(Y ~ Z, blocks = school, clusters = classroom, 
                    .method = difference_in_means)  


design_clustered_blocked <- 
  design |> 
  replace_step("assignment", assignment_clustered_blocked) |> 
  replace_step("estimator", estimator_clustered_blocked)

1.6.13 Illustration of clustered and blocked assignment

1.6.14 Illustration of efficiency gains from blocking

designs <- 
  list(
    simple = design, 
    complete = design_complete, 
    blocked = design_blocked, 
    clustered = design_clustered,  
    clustered_blocked = design_clustered_blocked) 
diagnoses <- diagnose_design(designs)

1.6.15 Illustration of efficiency gains from blocking

Design Power Coverage
simple 0.16 0.95
(0.01) (0.01)
complete 0.20 0.96
(0.01) (0.01)
blocked 0.42 0.95
(0.01) (0.01)
clustered 0.06 0.96
(0.01) (0.01)
clustered_blocked 0.08 0.96
(0.01) (0.01)

1.6.16 Sampling distributions

diagnoses$simulations_df |> 
  mutate(design = factor(design, c("blocked", "complete", "simple", "clustered_blocked", "clustered"))) |>
  ggplot(aes(estimate)) +
  geom_histogram() + facet_grid(~design)

1.6.17 Nasty integer issues

  • In many designs you seek to assign an integer number of subjects to treatment from some set.

  • Sometimes however your assignment targets are not integers.

Example:

  • I have 12 subjects in four blocks of 3 and I want to assign each subject to treatment with a 50% probability.

Two strategies:

  1. I randomly set a target of either 1 or 2 for each block and then do complete assignment in each block. This can result in the numbers treated varying from 4 to 8
  2. I randomly assign a target of 1 for two blocks and 2 for the other two blocks: Intuition–set a floor for the minimal target and then distribue the residual probability across blocks

1.6.18 Nasty integer issues

# remotes::install_github("macartan/probra")
library(probra)
set.seed(1)

blocks <- rep(1:4, each = 3)

table(blocks, prob_ra(blocks = blocks))
      
blocks 0 1
     1 1 2
     2 1 2
     3 2 1
     4 2 1
table(blocks, block_ra(blocks = blocks))
      
blocks 0 1
     1 1 2
     2 2 1
     3 1 2
     4 1 2

1.6.19 Nasty integer issues

Can also be used to set targets

# remotes::install_github("macartan/probra")
library(probra)
set.seed(1)

fabricate(N = 4,  size = c(47, 53, 87, 25), n_treated = prob_ra(.5*size)) %>%
  janitor::adorn_totals("row") |> 
  kable(caption = "Setting targets to get 50% targets with minimal variance")
Setting targets to get 50% targets with minimal variance
ID size n_treated
1 47 23
2 53 27
3 87 43
4 25 13
Total 212 106

1.6.20 Nasty integer issues

Can also be used to set for complete assignment with heterogeneous propensities

set.seed(1)

df <- fabricate(N = 100,  p = seq(.1, .9, length = 100), Z = prob_ra(p)) 

mean(df$Z)
[1] 0.5
df |> ggplot(aes(p, Z)) + geom_point() + theme_bw()

1.7 Indirect assignments

Indirect control

1.7.1 Indirect assignments

Indirect assignments are generally generated by applying a direct assignment and then figuring our an implied indirect assignment

set.seed(1)

df <-
  fabricate(
    N = 100, 
    latitude = runif(N),
    longitude = runif(N))

adjacency <- 
  sapply(1:nrow(df), function(i) 
    1*((df$latitude[i] - df$latitude)^2 + (df$longitude[i] - df$longitude)^2)^.5 < .1)

diag(adjacency) <- 0

1.7.2 Indirect assignments: Adjacency matrix

adjacency |>  
  reshape2::melt(c("x", "y"), value.name = "close") |> mutate(close = factor(close)) |>
  ggplot(aes(x=x,y=y,fill=close))+
  geom_tile() + xlab("individual") + ylab("individual") + theme_bw() +
  scale_fill_grey(start = 1, end = 0)  # 1 = white, 0 = black

1.7.3 Indirect assignments

n_assigned <- 50

design <-
  declare_model(data = df) + 
  declare_assignment(
    direct = complete_ra(N, m = n_assigned),
    indirect = 1*(as.vector(as.vector(direct) %*% adjacency) >= 1))

draw_data(design) |> with(table(direct, indirect))
      indirect
direct  0  1
     0 13 37
     1 13 37

1.7.4 Indirect assignments: Properties

indirect_propensities <- replicate(5000, draw_data(design)$indirect) |> 
  apply(1, mean) 

1.7.5 Indirect assignments: Properties

df |> ggplot(aes(latitude, longitude, label = round(indirect_propensities_1, 2))) + geom_text()

1.7.6 Indirect assignments: Redesign

replicate(5000, draw_data(design |> redesign(n_assigned = 25))$indirect) |> 
  apply(1, mean) 

1.7.7 Indirect assignments: Redesign

df |> ggplot(aes(latitude, longitude, label = round(indirect_propensities_2, 2))) + 
  geom_text()

Looks better: but there are trade offs between the direct and indirect distributions

Figuring out the optimal procedure requires full diagnosis

2 Design evaluation (design diagnosis!)

A focus on power

2.1 Outline

  1. Tests review
  2. \(p\) values and significance
  3. Power
  4. Sources of power
  5. Advanced applications

2.2 Tests

2.2.1 Review

In the classical approach to testing a hypothesis we ask:

How likely are we to see data like this if indeed the hypothesis is true?

  • If the answer is “not very likely” then we treat the hypothesis as suspect.
  • If the answer is not “not very likely” then the hypothesis is maintained (some say “accepted” but this is tricky as you may want to “maintain” multiple incompatible hypotheses)

How unlikely is “not very likely”?

2.2.2 Weighing Evidence

When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.

  • Othello has a hypothesis that Desdemona is innocent.

  • Iago confronts him with evidence:

    • See how she looks at him: would she look a him like that if she were innocent?
    • … would she defend him like that if she were innocent?
    • … would he have her handkerchief if she were innocent?
    • Othello, the chances of all of these things arising if she were innocent is surely less than 5%

2.2.3 Hypotheses are often rejected, sometimes maintained, but rarely accepted

  • Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.

  • He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.

So:

  • He assesses: \(\Pr(\text{Data} | \text{Hypothesis is TRUE})\)
  • While a Bayesian would assess: \(\Pr(\text{Hypothesis is TRUE} | \text{Data})\)

2.2.4 Recap: Calculate a \(p\) value in your head

  • Illustrating \(p\) values via “randomization inference”

  • Say you randomized assignment to treatment and your data looked like this.

Unit 1 2 3 4 5 6 7 8 9 10
Treatment 0 0 0 0 0 0 0 1 0 0
Health score 4 2 3 1 2 3 4 8 7 6

Then:

  • Does the treatment improve your health?
  • What’s the \(p\) value for the null that treatment had no effect on anybody?

2.3 Power

2.4 What power is

Power is just the probability of getting a significant result rejecting a hypothesis.

Simple enough but it presupposes:

  • A well defined hypothesis
  • An actual stipulation of the world under which you evaluate the probability
  • A procedure for producing results and determining of they are significant / rejecting a hypothesis

2.4.1 By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

  • I will roll the dice once.
  • If a six comes up I will reject the hypothesis.

What is the power of this test?

2.4.2 By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

  • I will roll the dice twice.
  • If a six comes up either time I will reject the hypothesis.

What is the power of this test?

2.4.3 Two probabilities

Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.

I want to test the hypothesis that this dice is fair.

Here’s my test:

  • I will roll the dice 1000 times and if I see fewer than x 6s or more than y 6s I will reject the hypothesis.

Now:

  • What should x and y be?
  • What is the power of this test?

2.4.4 Step 1: When do you reject?

For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.

Here is how many 6’s I would expect if the dice is fair:

fabricate(N = 1001, sixes = 0:1000, p = dbinom(sixes, 1000, 1/6)) |>
  ggplot(aes(sixes, p)) + geom_line()

2.4.5 Step 1: When do you reject?

I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:

c(lower = pbinom(143, 1000, 1/6), upper = 1 - pbinom(189, 1000, 1/6))
     lower      upper 
0.02302647 0.02785689 

2.4.6 Step 2: What is the power?

  • Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.

  • For instance: we think that in fact sixes appear 20% of the time.

Now what’s the probability of seeing at least 190 sixes?

1 - pbinom(189, 1000, .2)
[1] 0.796066

So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.

2.4.7 Rule of thumb

  • 80% or 90% is a common rule of thumb for “sufficient” power
  • but really, how much power you need depends on the purpose

2.4.8 Think about

  • Are there other tests I could have implemented?
  • Are there other ways to improve this test?

2.4.9 Subtleties

  • Is a significant result from an underpowered study less credible? (only if there is a significance filter)
  • What significance level should you choose for power? (Obviously the stricter the level the lower the power, so use what you will use when you actually implement tests)
  • Do you really have to know the effect size to do power analysis? (No, but you should know at least what effects sizes you would want to be sure about picking up if they were present)
  • Power is just one of many possible diagnosands
  • What’s power for Bayesians?

2.4.10 Power analytics

Simplest intuition on power:

What is the probability of getting a significant estimate given the sampling distribution is centered on \(b\) and the standard error is 1?

  • Probability below -1.96: \(F(-1.96 | \tau))\)
  • Probability above -1.96: \(1-F(1.96 | \tau)\)

Add these together: probability of getting an estimate above 1.96 or below -1.96.

power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))  

  1 - pnorm(critical, mean = abs(b)) + pnorm(-critical, mean = abs(b))
power(0)
[1] 0.05
power(1.96)
[1] 0.5000586
power(-1.96)
[1] 0.5000586
power(3)
[1] 0.8508388

2.4.11 Power analytics: graphed

This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!

Imagine the effect was actually 1.96 what would power be?:

pwrss::power.z.test(mean = 1.96, alpha = 0.05, 
                    alternative = "two.sided", plot = TRUE, verbose = FALSE)

2.4.12 Power analytics: graphed

Substantively: if in expectation an estimate will be just significant, then your power is 50%

2.4.13 Power analytics: graphed

This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!

Imagine the effect was actually 2.5 what would power be? Is this a lot?:

pwrss::power.z.test(mean = 2.5, alpha = 0.05, 
                    alternative = "two.sided", plot = TRUE, verbose = FALSE)

2.4.14 Equivalent

power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))  

  1 - pnorm(critical - b) + pnorm(-critical - b)

power(1.96)
[1] 0.5000586

Intuition:

x <- seq(-3, 3, .01)
plot(x, dnorm(x), main = "power associated with effect of 1 se")
abline(v = 1.96 - 1)
abline(v = -1.96 - 1)

2.4.15 Power analytics for a trial: by hand

  • Of course the standard error will depend on the number of units and the variance of outcomes in treatment and control.

  • Say \(N\) subject are divided into two groups and potential outcomes have standard deviation \(\sigma\) in treatment and control. Then the conservative variance of the treatment effect is (approx / conservatively):

\[Var(\tau)=\frac{\sigma^2}{N/2} + \frac{\sigma^2}{N/2} = 4\frac{\sigma^2}{N}\]

and so the (conservative / approx) standard error is:

\[\sigma_\tau=\frac{2\sigma}{\sqrt{N}}\]

Note here we seem to be using the actual standard error but of course the tests we actually run will use an estimate of the standard error…

2.4.16 Power analytics for a trial: by hand

se <- function(sd, N) (N/(N-1))^.5*2*sd/(N^.5)


power_2 <- function(b, alpha = .05, sd = 1, N = 100, critical = qnorm(1-alpha/2), se = 2*sd/N^.5)  

  1 - pnorm(critical, mean = abs(b)/se(sd, N)) + pnorm(-critical, mean = abs(b)/se(sd, N))

power_2(0)
[1] 0.05
power_2(.5)
[1] 0.7010827

2.4.17 Power analytics for a trial: flexible

This can be done e.g. with pwrss like this:

pwrss::pwrss.t.2means(mu1 = .2, mu2 = .1, sd1 = 1, sd2 = 1, 
               n2 = 50, alpha = 0.05,
               alternative = "not equal")
+--------------------------------------------------+
|                POWER CALCULATION                 |
+--------------------------------------------------+

Welch's T-Test (Independent Samples)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : d - null.d = 0 
  H1 (Alt. Claim) : d - null.d != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 50 and 50
  Type 1 Error (alpha)   = 0.050
  Type 2 Error (beta)    = 0.921
  Statistical Power      = 0.079  <<
power_2(.50, N = 100)
[1] 0.7010827

2.4.18 Power for more complex trials: Analytics

Mostly involve figuring out the standard error.

Consider a cluster randomized trial, with each unit having a cluster level shock \(\epsilon_k\) and an individual shock \(\nu_i\). Say these have variances \(\sigma^2_k\), \(\sigma^2_i\).

The standard error is:

\[\sqrt{\frac{4\sigma^2_k}{K} + \frac{4\sigma^2_i}{nK}}\]

Define \(\rho = \frac{\sigma^2_k}{\sigma^2_k + \sigma^2_i}\)

\[\sqrt{\rho \frac{4\sigma^2}{K} + (1- \rho)\frac{4\sigma^2}{nK}}\]

\[\sqrt{((n - 1)\rho + 1)\frac{4\sigma^2}{nK}}\]

where

  • \(((n - 1)\rho + 1)\) is the “design effect”
  • \(\frac{nK}{((n - 1)\rho + 1)}\) is the “effective sample size”

Plug in these standard errors and proceed as before

2.4.19 Power via design diagnosis

Is arbitrarily flexible

N <- 100
b <- .5

design <- 
  declare_model(N = N, 
    U = rnorm(N),
    potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N),
                     Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

2.4.20 Run it many times

sims_1 <- simulate_design(design) 

sims_1 |> select(sim_ID, estimate, p.value)
sim_ID estimate p.value
1 0.24 0.22
2 0.80 0.00
3 0.25 0.16
4 0.27 0.18
5 0.87 0.00
6 0.57 0.00

2.4.21 Power is mass of the sampling distribution of decisions under the model

sims_1 |>
  ggplot(aes(p.value)) + 
  geom_histogram() +
  geom_vline(xintercept = .05, color = "red")

2.4.22 Power is mass of the sampling distribution of decisions under the model

Obviously related to the estimates you might get

sims_1 |>
  mutate(significant = p.value <= .05) |>
  ggplot(aes(estimate, p.value, color = significant)) + 
  geom_point()

2.4.23 Check coverage is correct

sims_1 |>
  mutate(within = (b > sims_1$conf.low) & (b < sims_1$conf.high)) |> 
  pull(within) |> mean()
[1] 0.944

2.4.24 Check validity of \(p\) value

A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)

sims_2 <- 
  
  redesign(design, b = 0) |>
  
  simulate_design() 

2.4.25 Design diagnosis does it all (over multiple designs)

  diagnose_design(design)
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.50 -0.00 0.20 0.20 0.69 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)

2.4.26 Design diagnosis does it all

design |>
  redesign(b = c(0, 0.25, 0.5, 1)) |>
  diagnose_design()
b Mean Estimate Bias SD Estimate RMSE Power Coverage
0 0.00 0.00 0.20 0.20 0.05 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
0.25 0.25 0.00 0.20 0.20 0.24 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
0.5 0.50 -0.00 0.20 0.20 0.69 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
1 1.00 -0.00 0.20 0.20 1.00 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)

2.4.27 Diagnose over multiple moving parts (and ggplot)

design |>
  ## Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  ## Diagnosis
  diagnose_design() |>
  ## Prep
  tidy() |>
  filter(diagnosand == "power") |>
  ## Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()

2.4.28 Diagnose over multiple moving parts (and ggplot)

2.4.29 Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

design |>

  ## Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  
  ## Diagnosis
  diagnose_design() |>
  
  ## Prep
  tidy() |>
  
  ## Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()+
  facet_wrap(~diagnosand)

2.4.30 Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

2.5 Beyond basics

2.5.1 Power tips

coming up:

  • power everywhere
  • power with bias
  • power with the wrong standard errors
  • power with uncertainty over effect sizes
  • power and multiple comparisons

2.5.2 Power depends on all parts of MIDA

We often focus on sample sizes

But

Power also depends on

  • the model – obviously signal to noise
  • the assignments and specifics of sampling strategies
  • estimation procedures

2.5.3 Power from a lag?

Say we have access to a “pre” measure of outcome Y_now; call it Y_base. Y_base is informative about potential outcomes. We are considering using Y_now - Y_base as the outcome instead of Y_now.

N <- 100
rho <- .5

design <- 
  declare_model(N,
                 Y_base = rnorm(N),
                 Y_Z_0 = 1 + correlate(rnorm, given = Y_base, rho = rho),
                 Y_Z_1 = correlate(rnorm, given = Y_base, rho = rho),
                 Z = complete_ra(N),
                 Y_now = Z*Y_Z_1 + (1-Z)*Y_Z_0,
                 Y_change = Y_now - Y_base) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y_now ~ Z, label = "level") +
  declare_estimator(Y_change ~ Z, label = "change")+
  declare_estimator(Y_now ~ Z + Y_base, label = "RHS")

2.5.4 Power from a lag?

design |> redesign(N = c(10, 100, 1000, 10000), rho = c(.1, .5, .9)) |>
  diagnose_design() 

2.5.5 Power from a lag?

Punchline:

  • if you difference: the lag has to be sufficiently informative to pay its way (the \(\rho = .5\) equivalent between level and change follows from Gerber and Green (2012) equation 4.6)
  • The right hand side is your friend, at least for experiments (Ding and Li (2019))
  • As \(N\) grows the stakes fall

2.5.6 Power when estimates are biased

bad_design <- 
  
  declare_model(N = 100, 
    U = rnorm(N),
    potential_outcomes(Y ~ 0 * X + U, conditions = list(X = 0:1)),
    X = ifelse(U > 0, 1, 0)) + 
  
  declare_measurement(Y = reveal_outcomes(Y ~ X)) + 
  
  declare_inquiry(ate = mean(Y_X_1 - Y_X_0)) + 
  
  declare_estimator(Y ~ X, inquiry = "ate", .method = lm_robust)

2.5.7 Power when estimates are biased

You can see from the null design that power is great but bias is terrible and coverage is way off.

diagnose_design(bad_design)
Mean Estimate Bias SD Estimate RMSE Power Coverage
1.59 1.59 0.12 1.60 1.00 0.00
(0.01) (0.01) (0.00) (0.01) (0.00) (0.00)

Power without unbiasedness corrupts, absolutely

2.5.8 Power with a more subtly biased experimental design

another_bad_design <- 
  
  declare_model(
    N = 100, 
    female = rep(0:1, N/2),
    U = rnorm(N),
    potential_outcomes(Y ~ female * Z + U)) + 
  
  declare_assignment(
    Z = block_ra(blocks = female, block_prob = c(.1, .5)),
    Y = reveal_outcomes(Y ~ Z)) + 

  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  
  declare_estimator(Y ~ Z + female, inquiry = "ate", 
                    .method = lm_robust)
  diagnose_design(another_bad_design)

2.5.9 Power with a more subtly biased experimental design

You can see from the null design that power is great but bias is terrible and coverage is way off.

Mean Estimate Bias SD Estimate RMSE Power Coverage
0.76 0.26 0.24 0.35 0.84 0.85
(0.01) (0.01) (0.01) (0.01) (0.01) (0.02)

2.5.10 Power with the wrong standard errors

clustered_design <-
  declare_model(
    cluster = add_level(N = 10, cluster_shock = rnorm(N)),
    individual = add_level(
        N = 100,
        Y_Z_0 = rnorm(N) + cluster_shock,
        Y_Z_1 = rnorm(N) + cluster_shock)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = cluster_ra(clusters = cluster)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")
Mean Estimate Bias SD Estimate RMSE Power Coverage
-0.00 -0.00 0.64 0.64 0.79 0.20
(0.01) (0.01) (0.01) (0.01) (0.01) (0.01)

What alerts you to a problem?

2.5.11 Let’s fix that one

clustered_design_2  <-
  clustered_design |> replace_step(5, 
  declare_estimator(Y ~ Z, clusters = cluster))
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.00 -0.00 0.66 0.65 0.06 0.94
(0.02) (0.02) (0.01) (0.01) (0.01) (0.01)

2.5.12 Power when you are not sure about effect sizes (always!)

  • you can do power analysis for multiple stipulations
  • or you can design with a distribution of effect sizes
design_uncertain <-
  declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = mean(b)) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

draw_estimands(design_uncertain)
  inquiry   estimand
1     ate -0.3967765
draw_estimands(design_uncertain)
  inquiry  estimand
1     ate 0.7887188

2.5.13 Multiple comparisons correction (complex code)

Say I run two tests and want to correct for multiple comparisons.

Two approaches. First, by hand:

b = .2

design_mc <-
  declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = b) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

2.5.14 Multiple comparisons correction (complex code)

design_mc |>
  simulate_designs(sims = 1000) |>
  filter(term != "(Intercept)") |>
  group_by(sim_ID) |>
  mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
         p_holm = p.adjust(p = p.value, method = "holm"),
         p_fdr = p.adjust(p = p.value, method = "fdr")) |>
  ungroup() |>
  summarize(
    "Power using naive p-values" = mean(p.value <= 0.05),
    "Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
    "Power using Holm correction" = mean(p_holm <= 0.05),
    "Power using FDR correction" = mean(p_fdr <= 0.05)
    ) 
Power using naive p-values Power using Bonferroni correction Power using Holm correction Power using FDR correction
0.7374 0.6318 0.6886 0.7032

2.5.15 Multiple comparisons correction (approach 2)

The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.

my_estimator <- function(data) 
  lm_robust(Y ~ factor(Z), data = data) |> 
  tidy() |>
  filter(term != "(Intercept)") |>
  mutate(p.naive = p.value,
         p.value = p.adjust(p = p.naive, method = "bonferroni"))
  

design_mc_2 <- design_mc |>
  replace_step(5, declare_estimator(handler = label_estimator(my_estimator))) 

run_design(design_mc_2) |> 
  select(term, estimate, p.value, p.naive) |> kable()
term estimate p.value p.naive
factor(Z)2 0.1182516 0.2502156 0.1251078
factor(Z)3 0.1057031 0.3337476 0.1668738

2.5.16 Multiple comparisons correction (Null model case)

Lets try same thing for a null model (using redesign(design_mc_2, b = 0))

design_mc_3 <- 
  design_mc_2 |> 
  redesign(b = 0) 

run_design(design_mc_3) |> select(estimate, p.value, p.naive) |> kable(digits = 3)
estimate p.value p.naive
0.068 0.799 0.399
0.144 0.151 0.076

2.5.17 Multiple comparisons correction (Null model case)

…and power:

Mean Estimate Bias SD Estimate RMSE Power Coverage
0.00 0.00 0.08 0.08 0.02 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.01)
-0.00 -0.00 0.08 0.08 0.02 0.96
(0.00) (0.00) (0.00) (0.00) (0.00) (0.01)

bothered?

2.5.18 You might try

  • Power for an interaction (in a factorial design)
  • Power for a binary variable (versus a continuous variable?)
  • Power gains from blocked randomization
  • Power losses from clustering at different levels
  • Controlling the ICC directly? (see book cluster designs section)

2.5.19 Big takeaways

  • Power is affected not just by sample size, variability and effect size but also by you data and analysis strategies.
  • Try to estimate power under multiple scenarios
  • Try to use the same code for calculating power as you will use in your ultimate analysis
  • Basically the same procedure can be used for any design. If you can declare a design and have a test, you can calculate power
  • Your power might be right but misleading. For confidence:
    • Don’t just check power, check bias and coverage also
    • Check power especially under the null
  • Don’t let a focus on power distract you from more substantive diagnosands

2.6 References

Ding, Peng, and Fan Li. 2019. “A Bracketing Relationship Between Difference-in-Differences and Lagged-Dependent-Variable Adjustment.” Political Analysis 27 (4): 605–15.
Gerber, Alan S, and Donald P Green. 2012. Field Experiments: Design, Analysis, and Interpretation. Norton.
Muralidharan, Karthik, Mauricio Romero, and Kaspar Wüthrich. 2023. “Factorial Designs, Model Selection, and (Incorrect) Inference in Randomized Experiments.” Review of Economics and Statistics, 1–44.