Using DeclareDesign for power analysis

Insights from the MIDA framework

Graeme Blair, Alex Coppock, Macartan Humphreys

WZB presentation | November 2022

Plan

Topics

  1. The MIDA framework and the declaration-diagnosis-redesign cycle
  2. Brief intro to DeclareDesign
  3. \(p\)-values review
  4. Power review
  5. Power via design diagnosis: the how to and advantages thereof
  6. Lessons & strategies
  7. Some applications

Timing

  • Part I: Topics 1-4 and exercises 1
  • Break: 10:13 - 10:33
  • Part II: Topics 5-6, 10:33 - 11:30
  • Break: 11:30 - 11:45
  • Part III: Topic 7, 11:45 - 12:30

The MIDA Framework

Declaration

Q: Is my research design good?

A: Well let’s simulate it to see how it performs.

Q: What should I put in the simulation?

A: All elements of a research design.

Q: What are the elements of a research design?

A: M! I! D! A!

Four elements of any research design

  • Model: set of models of what causes what and how
  • Inquiry: a question stated in terms of the model
  • Data strategy: the set of procedures we use to gather information from the world (sampling, assignment, measurement)
  • Answer strategy: how we summarize the data produced by the data strategy

Four elements of any research design

Declaration, Diagnosis, Redesign

Declaration

Telling the computer what M, I, D, and A are.

Diagnosis

Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, amount learned.

  • want to diagnose over model uncertainty

Redesign

Fine-tuning features of the data and answer strategies to understand how they change the diagnosands

  • Different sample sizes
  • Different randomization procedures
  • Different estimation strategies
  • Implementation: effort into compliance versus more effort into sample size

Very often you have to simulate!

  • This is too hard to work out from rules of thumb or power calculators
  • Specialized formulas exist for some diagnosands, but not all.

DeclareDesign

Key commands for making a design

  • declare_model()
  • declare_inquiry()
  • declare_assignment()
  • declare_measurement()
  • declare_inquiry
  • declare_estimator()

and there are more declare_ functions!

Key commands for using a design

  • draw_data(design)
  • draw_estimands(design)
  • draw_estimates(design)
  • get_estimates(design, data)
  • run_design(design), simulate_design(design)
  • diagnose_design(design)
  • redesign(design, N = 200)
  • design |> redesign(N = c(200, 400)) |> diagnose_designs()
  • compare_designs(), compare_diagnoses()

Cheat sheet

https://raw.githubusercontent.com/rstudio/cheatsheets/master/declaredesign.pdf

A simple design

N <- 100
b <- .5

design <- 
  declare_model(N = N, U = rnorm(N), 
                potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N), Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

You now have a two arm design object in memory!

If you just type design it will run the design—a good check to make sure the design has been declared properly.

Make data from the design

data <- draw_data(design)

data |> head () |> kable()
ID U Y_Z_0 Y_Z_1 Z Y
001 -1.3967765 -1.3967765 -0.8967765 1 -0.8967765
002 0.5233121 0.5233121 1.0233121 1 1.0233121
003 0.1422646 0.1422646 0.6422646 0 0.1422646
004 -0.8466480 -0.8466480 -0.3466480 0 -0.8466480
005 -0.4118214 -0.4118214 0.0881786 0 -0.4118214
006 -1.4650350 -1.4650350 -0.9650350 0 -1.4650350

Make data from the design

Play with the data:

lm_robust(Y ~ Z, data = data) |>
  tidy() |>
  kable(digits = 2)
term estimate std.error statistic p.value conf.low conf.high df outcome
(Intercept) -0.26 0.13 -2.06 0.04 -0.52 -0.01 98 Y
Z 0.81 0.17 4.76 0.00 0.47 1.15 98 Y

Draw estimands

draw_estimands(design) |>
  kable(digits = 2)
inquiry estimand
ate 0.5

Draw estimates

draw_estimates(design) |> 
  kable(digits = 2) 
estimator term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
estimator Z 0.22 0.19 1.16 0.25 -0.16 0.6 98 Y ate

Get estimates

get_estimates(design, data) |>
  kable(digits = 2)
estimator term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
estimator Z 0.81 0.17 4.76 0 0.47 1.15 98 Y ate

Simulate design

simulate_design(design, sims = 3) |>
  kable(digits = 2)
design sim_ID inquiry estimand estimator term estimate std.error statistic p.value conf.low conf.high df outcome
design 1 ate 0.5 estimator Z -0.03 0.22 -0.16 0.88 -0.47 0.40 98 Y
design 2 ate 0.5 estimator Z 0.72 0.20 3.58 0.00 0.32 1.12 98 Y
design 3 ate 0.5 estimator Z 0.47 0.20 2.36 0.02 0.08 0.86 98 Y

Diagnose design

design |> 
  diagnose_design(sims = 100) 
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.49 -0.01 0.20 0.20 0.65 0.95
(0.02) (0.02) (0.01) (0.01) (0.05) (0.02)

Redesign

new_design <-
  
  design |> redesign(b = 0)
  • Modify any arguments that are explicitly called on by design steps.
  • Or add, remove, or replace steps

Compare designs

redesign(design, N = 50) %>%
  
  compare_diagnoses(design) 
diagnosand mean_1 mean_2 mean_difference conf.low conf.high
mean_estimand 0.50 0.50 0.00 0.00 0.00
mean_estimate 0.50 0.50 -0.01 -0.04 0.02
bias 0.00 0.00 -0.01 -0.04 0.02
sd_estimate 0.30 0.20 -0.09 -0.12 -0.07
rmse 0.29 0.20 -0.09 -0.12 -0.07
power 0.41 0.70 0.29 0.23 0.34
coverage 0.94 0.94 0.00 -0.03 0.02

\(p\)-values review

Tests

In the classical approach to testing a hypothesis we ask:

How likely are we to see data like this if indeed the hypothesis is true?

  • If the answer is “not very likely” then we treat the hypothesis as suspect.
  • If the answer is not “not very likely” then the hypothesis is maintained (some say “accepted” but this is tricky as you may want to “maintain” multiple incompatible hypotheses)

How unlikely is “not very likely”?

Weighing Evidence

When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.

  • Othello has a hypothesis that Desdemona is innocent.

  • Iago confronts him with evidence:

    • See how she looks at him: would she look a him like that if she were innocent?
    • … would she defend him like that if she were innocent?
    • … would he have her handkerchief if she were innocent?
    • Othello, the chances of all of these things arising if she were innocent is surely less than 5%

Hypotheses are often rejected, sometimes maintained, but rarely accepted

  • Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.

  • He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.

So:

  • He assesses: \(\Pr(\text{Data} | \text{Hypothesis is TRUE})\)
  • While a Bayesian would assess: \(\Pr(\text{Hypothesis is TRUE} | \text{Data})\)

Calculate a \(p\) value in your head

  • Illustrating \(p\) values via “randomization inference”

  • Say you randomized assignment to treatment and your data looked like this.

Unit 1 2 3 4 5 6 7 8 9 10
Treatment 0 0 0 0 0 0 0 1 0 0
Health score 4 2 3 1 2 3 4 8 7 6

Then:

  • Does the treatment improve your health?
  • What’s the \(p\) value for the null that treatment had no effect on anybody?

Calculate a \(p\) value in your head

  • Illustrating \(p\) values via “randomization inference”
  • Say you randomized assignment to treatment and your data looked like this.
Unit 1 2 3 4 5 6 7 8 9 10
Treatment 0 0 0 0 0 0 0 0 1 0
Health score 4 2 3 1 2 3 4 8 7 6

Then:

  • Does the treatment improve your health?
  • What’s the \(p\) value for the null that treatment had no effect on anybody?

Power: Basics

What power is

Power is just the probability of getting a significant result rejecting a hypothesis.

Simple enough but it presupposes:

  • A well defined hypothesis
  • An actual stipulation of the world under which you evaluate the probability
  • A procedure for producing results and determining of they are significant / rejecting a hypothesis

By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

  • I will roll the dice once.
  • If a six comes up I will reject the hypothesis.

What is the power of this test?

By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

  • I will roll the dice twice.
  • If a six comes up either time I will reject the hypothesis.

What is the power of this test?

Two probabilities

Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.

I want to test the hypothesis that this dice is fair.

Here’s my test:

  • I will roll the dice 1000 times and if I see fewer than x 6s or more than y 6s I will reject the hypothesis.

Now:

  • What should x and y be?
  • What is the power of this test?

Step 1: When do you reject?

For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.

Here is how many 6’s I would expect if the dice is fair:

fabricate(N = 1001, sixes = 0:1000, p = dbinom(sixes, 1000, 1/6)) |>
  ggplot(aes(sixes, p)) + geom_line()

Step 1: When do you reject?

I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:

c(lower = pbinom(143, 1000, 1/6), upper = 1 - pbinom(189, 1000, 1/6))
##      lower      upper 
## 0.02302647 0.02785689

Step 2: What is the power?

  • Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.

  • For instance: we think that in fact sixes appear 20% of the time.

Now what’s the probability of seeing at least 190 sixes?

1 - pbinom(189, 1000, .2)
## [1] 0.796066

So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.

Rule of thumb

  • 80% or 90% is a common rule of thumb for “sufficient” power
  • but really, how much power you need depends on the purpose

Think about

  • Are there other tests I could have implemented?
  • Are there other ways to improve this test?

Last subtleties

  • Is a significant result from an underpowered study less credible? (only if there is a significance filter)
  • What significance level should you choose for power? (Obviously the stricter the level the lower the power, so use what you will use when you actually implement tests
  • Do you really have to know the effect size to do power analysis? (No, but you should know at least what effects sizes you would want to be sure about picking up if they were present)
  • Power is just one of many possible diagnosands

Power via design diagnosis

A simple design

N <- 100
b <- .5

design <- 
  declare_model(N = N, 
    U = rnorm(N),
    potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N),
                     Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

“Run” the design once

run_design(design)
Summary of a single ‘run’ of the design
inquiry estimand estimator term estimate std.error statistic p.value conf.low conf.high df outcome
ate 0.5 estimator Z 0.84 0.23 3.63 0 0.38 1.29 98 Y

Run it many times

sims_1 <- simulate_design(design) 

sims_1 |> select(sim_ID, estimate, p.value)
sim_ID estimate p.value
1 0.48 0.03
2 0.33 0.10
3 0.89 0.00
4 0.78 0.00
5 0.58 0.00
6 0.30 0.16

Power is mass of the sampling distribution of decisions under the model

sims_1 |>
  ggplot(aes(p.value)) + 
  geom_histogram() +
  geom_vline(xintercept = .05, color = "red")

Power is mass of the sampling distribution of decisions under the model

Obviously related to the estimates you might get

sims_1 |>
  mutate(significant = p.value <= .05) |>
  ggplot(aes(estimate, p.value, color = significant)) + 
  geom_point()

Check coverage is correct

sims_1 |>
  mutate(within = (b > sims_1$conf.low) & (b < sims_1$conf.high)) |> 
  pull(within) |> mean()
## [1] 0.9536667

Check validity of \(p\) value

A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)

sims_2 <- 
  
  redesign(design, b = 0) |>
  
  simulate_design() 

Design diagnosis does it all (over multiple designs)

  diagnose_design(design)
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.50 0.00 0.20 0.20 0.70 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)

Design diagnosis does it all

design |>
  redesign(b = c(0, 0.25, 0.5, 1)) |>
  diagnose_design()
b Mean Estimate Bias SD Estimate RMSE Power Coverage
0 -0.00 -0.00 0.20 0.20 0.05 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
0.25 0.25 -0.00 0.20 0.20 0.23 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
0.5 0.50 0.00 0.20 0.20 0.70 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
1 1.00 0.00 0.20 0.20 1.00 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00)

Diagnose over multiple moving parts (and ggplot)

design |>

  # Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  
  # Diagnosis
  diagnose_design() |>
  
  # Prep
  tidy() |>
  filter(diagnosand == "power") |>
  
  # Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()

Diagnose over multiple moving parts (and ggplot)

Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

design |>

  # Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  
  # Diagnosis
  diagnose_design() |>
  
  # Prep
  tidy() |>
  
  # Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()+
  facet_wrap(~diagnosand)

Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

DeclareDesign tips

  • make sure the design runs by typing the design name in the console (e.g. design) or run_design(design))
  • you can simulate \(k\) times with simulate_design(design, sims = k)
  • you can diagnose the design with \(k\) simulations with diagnose_design(design, sims = k)
  • you can even simulate different steps a different number of times, by making sims a vector
  • you can diagnose lists of designs diagnose_design(design_list)
  • you can use tidy to get the diagnosis information in a nice format for ggplot

Lessons and strategies

In this section

  • principles
  • power with bias
  • power with the wrong standard errors
  • power with uncertainty over effect sizes
  • power and multiple comparisons

Power depends on all parts of MIDA

We often focus on sample sizes

But

Power also depends on

  • the model – obviously signal to noise
  • the assignments and specifics of sampling strategies
  • estimation procedures

Power when estimates are biased

bad_design <- 
  
  declare_model(N = 100, 
    U = rnorm(N),
    potential_outcomes(Y ~ 0 * X + U, conditions = list(X = 0:1)),
    X = ifelse(U > 0, 1, 0)) + 
  
  declare_measurement(Y = reveal_outcomes(Y ~ X)) + 
  
  declare_inquiry(ate = mean(Y_X_1 - Y_X_0)) + 
  
  declare_estimator(Y ~ X, inquiry = "ate", .method = lm_robust)

Power when estimates are biased

You can see from the null design that power is great but bias is terrible and coverage is way off.

diagnose_design(bad_design)
Mean Estimate Bias SD Estimate RMSE Power Coverage
1.59 1.59 0.12 1.60 1.00 0.00
(0.01) (0.01) (0.00) (0.01) (0.00) (0.00)

Power without unbiasedness corrupts, absolutely

Power with a more subtly biased experimental design

another_bad_design <- 
  
  declare_model(
    N = 100, 
    female = rep(0:1, N/2),
    U = rnorm(N),
    potential_outcomes(Y ~ female * Z + U)) + 
  
  declare_assignment(
    Z = block_ra(blocks = female, block_prob = c(.1, .5)),
    Y = reveal_outcomes(Y ~ Z)) + 

  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  
  declare_estimator(Y ~ Z + female, inquiry = "ate", 
                    .method = lm_robust)
  diagnose_design(another_bad_design)

Power with a more subtly biased experimental design

You can see from the null design that power is great but bias is terrible and coverage is way off.

Mean Estimate Bias SD Estimate RMSE Power Coverage
0.76 0.26 0.24 0.35 0.84 0.85
(0.01) (0.01) (0.01) (0.01) (0.01) (0.02)

Power with the wrong standard errors

clustered_design <-
  declare_model(
    cluster = add_level(N = 10, cluster_shock = rnorm(N)),
    individual = add_level(
        N = 100,
        Y_Z_0 = rnorm(N) + cluster_shock,
        Y_Z_1 = rnorm(N) + cluster_shock)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = cluster_ra(clusters = cluster)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")
Mean Estimate Bias SD Estimate RMSE Power Coverage
-0.00 -0.00 0.64 0.64 0.79 0.20
(0.01) (0.01) (0.01) (0.01) (0.01) (0.01)

Let’s fix that one

clustered_design_2  <-
  clustered_design |> replace_step(5, 
  declare_estimator(Y ~ Z, clusters = cluster))
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.00 -0.00 0.66 0.65 0.06 0.94
(0.02) (0.02) (0.01) (0.01) (0.01) (0.01)

Power when you are not sure about effect sizes (always!)

  • you can do power analysis for multiple stipulations
  • or you can design with a distribution of effect sizes
design_uncertain <-
  declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = mean(b)) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

draw_estimands(design_uncertain)
##   inquiry  estimand
## 1     ate 0.7120984
draw_estimands(design_uncertain)
##   inquiry estimand
## 1     ate 2.139047

Multiple comparisons correction (complex code)

Say I run two tests and want to correct for multiple comparisons.

Two approaches. First, by hand:

b = .2

design_mc <-
  declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = b) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

Multiple comparisons correction (complex code)

design_mc |>
  simulate_designs(sims = 1000) |>
  filter(term != "(Intercept)") |>
  group_by(sim_ID) |>
  mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
         p_holm = p.adjust(p = p.value, method = "holm"),
         p_fdr = p.adjust(p = p.value, method = "fdr")) |>
  ungroup() |>
  summarize(
    "Power using naive p-values" = mean(p.value <= 0.05),
    "Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
    "Power using Holm correction" = mean(p_holm <= 0.05),
    "Power using FDR correction" = mean(p_fdr <= 0.05)
    ) 
Power using naive p-values Power using Bonferroni correction Power using Holm correction Power using FDR correction
0.7374 0.6318 0.6886 0.7032

Multiple comparisons correction (approach 2)

The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.

my_estimator <- function(data) 
  lm_robust(Y ~ factor(Z), data = data) |> 
  tidy() |>
  filter(term != "(Intercept)") |>
  mutate(p.naive = p.value,
         p.value = p.adjust(p = p.naive, method = "bonferroni"))
  

design_mc_2 <- design_mc |>
  replace_step(5, declare_estimator(handler = label_estimator(my_estimator))) 

run_design(design_mc_2) |> 
  select(term, estimate, p.value, p.naive) |> kable()
term estimate p.value p.naive
factor(Z)2 0.0535175 0.9751237 0.4875618
factor(Z)3 0.1970167 0.0213185 0.0106592

Multiple comparisons correction (Null model case)

Lets try same thing for a null model (using redesign(design_mc_2, b = 0))

design_mc_3 <- 
  design_mc_2 |> 
  redesign(b = 0) 

run_design(design_mc_3) |> select(estimate, p.value, p.naive) |> kable()
estimate p.value p.naive
-0.0366055 1 0.6300116
-0.0266577 1 0.7285604

Multiple comparisons correction (Null model case)

…and power:

Mean Estimate Bias SD Estimate RMSE Power Coverage
0.00 0.00 0.08 0.08 0.02 0.95
(0.00) (0.00) (0.00) (0.00) (0.00) (0.01)
-0.00 -0.00 0.08 0.08 0.02 0.96
(0.00) (0.00) (0.00) (0.00) (0.00) (0.01)

bothered?

Now lets try

  • Power for an interaction (in a factorial design)
  • Power for a binary variable (versus a continuous variable?)
  • Power from adding covariates in the analysis stage
  • Power gains from blocked randomization
  • Power losses from clustering at different levels
  • Controlling the ICC directly? (see book cluster designs section)

Big takeaways

  • Power is affected not just by sample size, variability and effect size but also by you data and analysis strategies.
  • Try to estimate power under multiple scenarios
  • Try to use the same code for calculating power as you will use in your ultimate analysis
  • Basically the same procedure can be used for any design. If you can declare a design and have a test, you can calculate power
  • Your power might be right but misleading. For confidence:
    • Don’t just check power, check bias and coverage also
    • Check power especially under the null
  • Don’t let a focus on power distract you from more substantive diagnosands