Design diagnosis

Power and more

Macartan Humphreys

1 Design diagnosis

A focus on power

1.1 Outline

Tests review
\(p\) values and significance
Power
Sources of power
Advanced applications

1.2 Tests

1.2.1 Review

In the classical approach to testing a hypothesis we ask:

How likely are we to see data like this if indeed the hypothesis is true?

If the answer is “not very likely” then we treat the hypothesis as suspect.
If the answer is not “not very likely” then the hypothesis is maintained (some say “accepted” but this is tricky as you may want to “maintain” multiple incompatible hypotheses)

How unlikely is “not very likely”?

1.2.2 Weighing Evidence

When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.

Othello has a hypothesis that Desdemona is innocent.
Iago confronts him with evidence:
- See how she looks at him: would she look a him like that if she were innocent?
- … would she defend him like that if she were innocent?
- … would he have her handkerchief if she were innocent?
- Othello, the chances of all of these things arising if she were innocent is surely less than 5%

1.2.3 Hypotheses are often rejected, sometimes maintained, but rarely accepted

Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.
He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.

So:

He assesses: \(\Pr(\text{Data} | \text{Hypothesis is TRUE})\)
While a Bayesian would assess: \(\Pr(\text{Hypothesis is TRUE} | \text{Data})\)

1.2.4 Recap: Calculate a \(p\) value in your head

Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.

Unit	1	2	3	4	5	6	7	8	9	10
Treatment	0	0	0	0	0	0	0	1	0	0
Health score	4	2	3	1	2	3	4	8	7	6

Then:

Does the treatment improve your health?
What’s the \(p\) value for the null that treatment had no effect on anybody?

1.3 Power

1.4 What power is

Power is just the probability of ~~getting a significant result~~ rejecting a hypothesis.

Simple enough but it presupposes:

A well defined hypothesis
An actual stipulation of the world under which you evaluate the probability
A procedure for producing results and determining of they are significant / rejecting a hypothesis

1.4.1 By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

I will roll the dice once.
If a six comes up I will reject the hypothesis.

What is the power of this test?

1.4.2 By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

I will roll the dice twice.
If a six comes up either time I will reject the hypothesis.

What is the power of this test?

1.4.3 Two probabilities

Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.

I want to test the hypothesis that this dice is fair.

Here’s my test:

I will roll the dice 1000 times and if I see fewer than x 6s or more than y 6s I will reject the hypothesis.

Now:

What should x and y be?
What is the power of this test?

1.4.4 Step 1: When do you reject?

For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.

Here is how many 6’s I would expect if the dice is fair:

fabricate(N = 1001, sixes = 0:1000, p = dbinom(sixes, 1000, 1/6)) |>
  ggplot(aes(sixes, p)) + geom_line()

1.4.5 Step 1: When do you reject?

I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:

c(lower = pbinom(143, 1000, 1/6), upper = 1 - pbinom(189, 1000, 1/6))

     lower      upper 
0.02302647 0.02785689

1.4.6 Step 2: What is the power?

Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.
For instance: we think that in fact sixes appear 20% of the time.

Now what’s the probability of seeing at least 190 sixes?

1 - pbinom(189, 1000, .2)

[1] 0.796066

So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.

1.4.7 Rule of thumb

80% or 90% is a common rule of thumb for “sufficient” power
but really, how much power you need depends on the purpose

1.4.8 Think about

Are there other tests I could have implemented?
Are there other ways to improve this test?

1.4.9 Subtleties

Is a significant result from an underpowered study less credible? (only if there is a significance filter)
What significance level should you choose for power? (Obviously the stricter the level the lower the power, so use what you will use when you actually implement tests)
Do you really have to know the effect size to do power analysis? (No, but you should know at least what effects sizes you would want to be sure about picking up if they were present)
Power is just one of many possible diagnosands
What’s power for Bayesians?

1.4.10 Power analytics

Simplest intuition on power:

What is the probability of getting a significant estimate given the sampling distribution is centered on \(b\) and the standard error is 1?

Probability below -1.96: \(F(-1.96 | \tau))\)
Probability above -1.96: \(1-F(1.96 | \tau)\)

Add these together: probability of getting an estimate above 1.96 or below -1.96.

power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))  

  1 - pnorm(critical, mean = abs(b)) + pnorm(-critical, mean = abs(b))

power(0)

[1] 0.05

power(1.96)

[1] 0.5000586

power(-1.96)

[1] 0.5000586

power(3)

[1] 0.8508388

1.4.11 Power analytics: graphed

This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!

See:

pwrss::power.z.test(ncp = 1.96, alpha = 0.05, alternative = "not equal", plot = TRUE)

     power ncp.alt ncp.null alpha  z.crit.1 z.crit.2
 0.5000586    1.96        0  0.05 -1.959964 1.959964

1.4.12 Power analytics: graphed

Substantively: if in expectation an estimate will be just significant, then your power is 50%

1.4.13 Equivalent

power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))  

  1 - pnorm(critical - b) + pnorm(-critical - b)

power(1.96)

[1] 0.5000586

Intuition:

x <- seq(-3, 3, .01)
plot(x, dnorm(x), main = "power associated with effect of 1 se")
abline(v = 1.96 - 1)
abline(v = -1.96 - 1)

1.4.14 Power analytics for a trial: by hand

Of course the standard error will depend on the number of units and the variance of outcomes in treatment and control.
Say \(N\) subject are divided into two groups and potential outcomes have standard deviation \(\sigma\) in treatment and control. Then the conservative variance of the treatment effect is (approx / conservatively):

\[Var(\tau)=\frac{\sigma^2}{N/2} + \frac{\sigma^2}{N/2} = 4\frac{\sigma^2}{N}\]

and so the (conservative / approx) standard error is:

\[\sigma_\tau=\frac{2\sigma}{\sqrt{N}}\]

Note here we seem to be using the actual standard error but of course the tests we actually run will use an estimate of the standard error…

1.4.15 Power analytics for a trial: by hand

se <- function(sd, N) (N/(N-1))^.5*2*sd/(N^.5)


power_2 <- function(b, alpha = .05, sd = 1, N = 100, critical = qnorm(1-alpha/2), se = 2*sd/N^.5)  

  1 - pnorm(critical, mean = abs(b)/se(sd, N)) + pnorm(-critical, mean = abs(b)/se(sd, N))

power_2(0)

[1] 0.05

power_2(.5)

[1] 0.7010827

1.4.16 Power analytics for a trial: flexible

This can be done e.g. with pwrss like this:

pwrss::pwrss.t.2means(mu1 = .2, mu2 = .1, sd1 = 1, sd2 = 1, 
               n2 = 50, alpha = 0.05,
               alternative = "not equal")

 Difference between Two means 
 (Independent Samples t Test) 
 H0: mu1 = mu2 
 HA: mu1 != mu2 
 ------------------------------ 
  Statistical power = 0.079 
  n1 = 50 
  n2 = 50 
 ------------------------------ 
 Alternative = "not equal" 
 Degrees of freedom = 98 
 Non-centrality parameter = 0.5 
 Type I error rate = 0.05 
 Type II error rate = 0.921

power_2(.50, N = 100)

[1] 0.7010827

1.4.17 Power for more complex trials: Analytics

Mostly involve figuring out the standard error.

Consider a cluster randomized trial, with each unit having a cluster level shock \(\epsilon_k\) and an individual shock \(\nu_i\). Say these have variances \(\sigma^2_k\), \(\sigma^2_i\).

The standard error is:

\[\sqrt{\frac{4\sigma^2_k}{K} + \frac{4\sigma^2_i}{nK}}\]

Define \(\rho = \frac{\sigma^2_k}{\sigma^2_k + \sigma^2_i}\)

\[\sqrt{\rho \frac{4\sigma^2}{K} + (1- \rho)\frac{4\sigma^2}{nK}}\]

\[\sqrt{((n - 1)\rho + 1)\frac{4\sigma^2}{nK}}\]

where

\(((n - 1)\rho + 1)\) is the “design effect”
\(\frac{nK}{((n - 1)\rho + 1)}\) is the “effective sample size”

Plug in these standard errors and proceed as before

1.4.18 Power via design diagnosis

Is arbitrarily flexible

N <- 100
b <- .5

design <- 
  declare_model(N = N, 
    U = rnorm(N),
    potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N),
                     Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

1.4.19 Run it many times

sims_1 <- simulate_design(design) 

sims_1 |> select(sim_ID, estimate, p.value)

sim_ID	estimate	p.value
1	0.81	0.00
2	0.40	0.04
3	0.88	0.00
4	0.72	0.00
5	0.38	0.05
6	0.44	0.02

1.4.20 Power is mass of the sampling distribution of decisions under the model

sims_1 |>
  ggplot(aes(p.value)) + 
  geom_histogram() +
  geom_vline(xintercept = .05, color = "red")

1.4.21 Power is mass of the sampling distribution of decisions under the model

Obviously related to the estimates you might get

sims_1 |>
  mutate(significant = p.value <= .05) |>
  ggplot(aes(estimate, p.value, color = significant)) + 
  geom_point()

1.4.22 Check coverage is correct

sims_1 |>
  mutate(within = (b > sims_1$conf.low) & (b < sims_1$conf.high)) |> 
  pull(within) |> mean()

[1] 0.9573333

1.4.23 Check validity of \(p\) value

A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)

sims_2 <- 
  
  redesign(design, b = 0) |>
  
  simulate_design()

1.4.24 Design diagnosis does it all (over multiple designs)

  diagnose_design(design)

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.50	0.00	0.20	0.20	0.70	0.95
(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)

1.4.25 Design diagnosis does it all

design |>
  redesign(b = c(0, 0.25, 0.5, 1)) |>
  diagnose_design()

b	Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0	-0.00	-0.00	0.20	0.20	0.05	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
0.25	0.25	-0.00	0.20	0.20	0.23	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
0.5	0.50	0.00	0.20	0.20	0.70	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
1	1.00	0.00	0.20	0.20	1.00	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)

1.4.26 Diagnose over multiple moving parts (and ggplot)

design |>
  ## Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  ## Diagnosis
  diagnose_design() |>
  ## Prep
  tidy() |>
  filter(diagnosand == "power") |>
  ## Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()

1.4.27 Diagnose over multiple moving parts (and ggplot)

1.4.28 Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

design |>

  ## Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  
  ## Diagnosis
  diagnose_design() |>
  
  ## Prep
  tidy() |>
  
  ## Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()+
  facet_wrap(~diagnosand)

1.4.29 Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

1.5 Beyond basics

1.5.1 Power tips

coming up:

power everywhere
power with bias
power with the wrong standard errors
power with uncertainty over effect sizes
power and multiple comparisons

1.5.2 Power depends on all parts of MIDA

We often focus on sample sizes

But

Power also depends on

the model – obviously signal to noise
the assignments and specifics of sampling strategies
estimation procedures

1.5.3 Power from a lag?

Say we have access to a “pre” measure of outcome Y_now; call it Y_base. Y_base is informative about potential outcomes. We are considering using Y_now - Y_base as the outcome instead of Y_now.

N <- 100
rho <- .5

design <- 
  declare_model(N,
                 Y_base = rnorm(N),
                 Y_Z_0 = 1 + correlate(rnorm, given = Y_base, rho = rho),
                 Y_Z_1 = correlate(rnorm, given = Y_base, rho = rho),
                 Z = complete_ra(N),
                 Y_now = Z*Y_Z_1 + (1-Z)*Y_Z_0,
                 Y_change = Y_now - Y_base) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y_now ~ Z, label = "level") +
  declare_estimator(Y_change ~ Z, label = "change")+
  declare_estimator(Y_now ~ Z + Y_base, label = "RHS")

1.5.4 Power from a lag?

design |> redesign(N = c(10, 100, 1000, 10000), rho = c(.1, .5, .9)) |>
  diagnose_design()

1.5.5 Power from a lag?

Punchline:

if you difference: the lag has to be sufficiently information to pay its way (the \(\rho = .5\) equivalent between level and change follows from Gerber and Green (2012) equation 4.6)
The right hand side is your friend, at least for experiments (Ding and Li (2019))
As \(N\) grows the stakes fall

1.5.6 Power when estimates are biased

bad_design <- 
  
  declare_model(N = 100, 
    U = rnorm(N),
    potential_outcomes(Y ~ 0 * X + U, conditions = list(X = 0:1)),
    X = ifelse(U > 0, 1, 0)) + 
  
  declare_measurement(Y = reveal_outcomes(Y ~ X)) + 
  
  declare_inquiry(ate = mean(Y_X_1 - Y_X_0)) + 
  
  declare_estimator(Y ~ X, inquiry = "ate", .method = lm_robust)

1.5.7 Power when estimates are biased

You can see from the null design that power is great but bias is terrible and coverage is way off.

diagnose_design(bad_design)

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
1.59	1.59	0.12	1.60	1.00	0.00
(0.01)	(0.01)	(0.00)	(0.01)	(0.00)	(0.00)

Power without unbiasedness corrupts, absolutely

1.5.8 Power with a more subtly biased experimental design

another_bad_design <- 
  
  declare_model(
    N = 100, 
    female = rep(0:1, N/2),
    U = rnorm(N),
    potential_outcomes(Y ~ female * Z + U)) + 
  
  declare_assignment(
    Z = block_ra(blocks = female, block_prob = c(.1, .5)),
    Y = reveal_outcomes(Y ~ Z)) + 

  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  
  declare_estimator(Y ~ Z + female, inquiry = "ate", 
                    .method = lm_robust)

  diagnose_design(another_bad_design)

1.5.9 Power with a more subtly biased experimental design

You can see from the null design that power is great but bias is terrible and coverage is way off.

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.76	0.26	0.24	0.35	0.84	0.85
(0.01)	(0.01)	(0.01)	(0.01)	(0.01)	(0.02)

1.5.10 Power with the wrong standard errors

clustered_design <-
  declare_model(
    cluster = add_level(N = 10, cluster_shock = rnorm(N)),
    individual = add_level(
        N = 100,
        Y_Z_0 = rnorm(N) + cluster_shock,
        Y_Z_1 = rnorm(N) + cluster_shock)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = cluster_ra(clusters = cluster)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
-0.00	-0.00	0.64	0.64	0.79	0.20
(0.01)	(0.01)	(0.01)	(0.01)	(0.01)	(0.01)

What alerts you to a problem?

1.5.11 Let’s fix that one

clustered_design_2  <-
  clustered_design |> replace_step(5, 
  declare_estimator(Y ~ Z, clusters = cluster))

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.00	-0.00	0.66	0.65	0.06	0.94
(0.02)	(0.02)	(0.01)	(0.01)	(0.01)	(0.01)

1.5.12 Power when you are not sure about effect sizes (always!)

you can do power analysis for multiple stipulations
or you can design with a distribution of effect sizes

design_uncertain <-
  declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = mean(b)) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

draw_estimands(design_uncertain)

  inquiry   estimand
1     ate -0.3967765

draw_estimands(design_uncertain)

  inquiry  estimand
1     ate 0.7887188

1.5.13 Multiple comparisons correction (complex code)

Say I run two tests and want to correct for multiple comparisons.

Two approaches. First, by hand:

b = .2

design_mc <-
  declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = b) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

1.5.14 Multiple comparisons correction (complex code)

design_mc |>
  simulate_designs(sims = 1000) |>
  filter(term != "(Intercept)") |>
  group_by(sim_ID) |>
  mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
         p_holm = p.adjust(p = p.value, method = "holm"),
         p_fdr = p.adjust(p = p.value, method = "fdr")) |>
  ungroup() |>
  summarize(
    "Power using naive p-values" = mean(p.value <= 0.05),
    "Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
    "Power using Holm correction" = mean(p_holm <= 0.05),
    "Power using FDR correction" = mean(p_fdr <= 0.05)
    )

Power using naive p-values	Power using Bonferroni correction	Power using Holm correction	Power using FDR correction
0.7374	0.6318	0.6886	0.7032

1.5.15 Multiple comparisons correction (approach 2)

The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.

my_estimator <- function(data) 
  lm_robust(Y ~ factor(Z), data = data) |> 
  tidy() |>
  filter(term != "(Intercept)") |>
  mutate(p.naive = p.value,
         p.value = p.adjust(p = p.naive, method = "bonferroni"))
  

design_mc_2 <- design_mc |>
  replace_step(5, declare_estimator(handler = label_estimator(my_estimator))) 

run_design(design_mc_2) |> 
  select(term, estimate, p.value, p.naive) |> kable()

term	estimate	p.value	p.naive
factor(Z)2	0.2508003	0.0021145	0.0010573
factor(Z)3	0.2383963	0.0052469	0.0026235

1.5.16 Multiple comparisons correction (Null model case)

Lets try same thing for a null model (using redesign(design_mc_2, b = 0))

design_mc_3 <- 
  design_mc_2 |> 
  redesign(b = 0) 

run_design(design_mc_3) |> select(estimate, p.value, p.naive) |> kable(digits = 3)

estimate	p.value	p.naive
-0.036	1	0.630
-0.002	1	0.979

1.5.17 Multiple comparisons correction (Null model case)

…and power:

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.00	0.00	0.08	0.08	0.02	0.95
(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.01)
-0.00	-0.00	0.08	0.08	0.02	0.96
(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.01)

bothered?

1.5.18 You might try

Power for an interaction (in a factorial design)
Power for a binary variable (versus a continuous variable?)
Power gains from blocked randomization
Power losses from clustering at different levels
Controlling the ICC directly? (see book cluster designs section)

1.5.19 Big takeaways

Power is affected not just by sample size, variability and effect size but also by you data and analysis strategies.
Try to estimate power under multiple scenarios
Try to use the same code for calculating power as you will use in your ultimate analysis
Basically the same procedure can be used for any design. If you can declare a design and have a test, you can calculate power
Your power might be right but misleading. For confidence:
- Don’t just check power, check bias and coverage also
- Check power especially under the null
Don’t let a focus on power distract you from more substantive diagnosands

Ding, Peng, and Fan Li. 2019. “A Bracketing Relationship Between Difference-in-Differences and Lagged-Dependent-Variable Adjustment.” Political Analysis 27 (4): 605–15.

Gerber, Alan S, and Donald P Green. 2012. Field Experiments: Design, Analysis, and Interpretation. Norton.