Power and more
A focus on power
In the classical approach to testing a hypothesis we ask:
How likely are we to see data like this if indeed the hypothesis is true?
How unlikely is “not very likely”?
When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.
Othello has a hypothesis that Desdemona is innocent.
Iago confronts him with evidence:
Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.
He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.
So:
Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.
Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
Power is just the probability of getting a significant result rejecting a hypothesis.
Simple enough but it presupposes:
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.
I want to test the hypothesis that this dice is fair.
Here’s my test:
Now:
For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.
Here is how many 6’s I would expect if the dice is fair:
I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:
Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.
For instance: we think that in fact sixes appear 20% of the time.
Now what’s the probability of seeing at least 190 sixes?
So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.
Simplest intuition on power:
What is the probability of getting a significant estimate given the sampling distribution is centered on \(b\) and the standard error is 1?
Add these together: probability of getting an estimate above 1.96 or below -1.96.
This is essentially what is done by pwrss::power.z.test
– and it produces nice graphs!
See:
Substantively: if in expectation an estimate will be just significant, then your power is 50%
power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))
1 - pnorm(critical - b) + pnorm(-critical - b)
power(1.96)
[1] 0.5000586
Intuition:
Of course the standard error will depend on the number of units and the variance of outcomes in treatment and control.
Say \(N\) subject are divided into two groups and potential outcomes have standard deviation \(\sigma\) in treatment and control. Then the conservative variance of the treatment effect is (approx / conservatively):
\[Var(\tau)=\frac{\sigma^2}{N/2} + \frac{\sigma^2}{N/2} = 4\frac{\sigma^2}{N}\]
and so the (conservative / approx) standard error is:
\[\sigma_\tau=\frac{2\sigma}{\sqrt{N}}\]
Note here we seem to be using the actual standard error but of course the tests we actually run will use an estimate of the standard error…
This can be done e.g. with pwrss
like this:
pwrss::pwrss.t.2means(mu1 = .2, mu2 = .1, sd1 = 1, sd2 = 1,
n2 = 50, alpha = 0.05,
alternative = "not equal")
Difference between Two means
(Independent Samples t Test)
H0: mu1 = mu2
HA: mu1 != mu2
------------------------------
Statistical power = 0.079
n1 = 50
n2 = 50
------------------------------
Alternative = "not equal"
Degrees of freedom = 98
Non-centrality parameter = 0.5
Type I error rate = 0.05
Type II error rate = 0.921
[1] 0.7010827
Mostly involve figuring out the standard error.
Consider a cluster randomized trial, with each unit having a cluster level shock \(\epsilon_k\) and an individual shock \(\nu_i\). Say these have variances \(\sigma^2_k\), \(\sigma^2_i\).
The standard error is:
\[\sqrt{\frac{4\sigma^2_k}{K} + \frac{4\sigma^2_i}{nK}}\]
Define \(\rho = \frac{\sigma^2_k}{\sigma^2_k + \sigma^2_i}\)
\[\sqrt{\rho \frac{4\sigma^2}{K} + (1- \rho)\frac{4\sigma^2}{nK}}\]
\[\sqrt{((n - 1)\rho + 1)\frac{4\sigma^2}{nK}}\]
where
Plug in these standard errors and proceed as before
Is arbitrarily flexible
sim_ID | estimate | p.value |
---|---|---|
1 | 0.81 | 0.00 |
2 | 0.40 | 0.04 |
3 | 0.88 | 0.00 |
4 | 0.72 | 0.00 |
5 | 0.38 | 0.05 |
6 | 0.44 | 0.02 |
Obviously related to the estimates you might get
A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.50 | 0.00 | 0.20 | 0.20 | 0.70 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
b | Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|---|
0 | -0.00 | -0.00 | 0.20 | 0.20 | 0.05 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
0.25 | 0.25 | -0.00 | 0.20 | 0.20 | 0.23 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
0.5 | 0.50 | 0.00 | 0.20 | 0.20 | 0.70 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
1 | 1.00 | 0.00 | 0.20 | 0.20 | 1.00 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
coming up:
We often focus on sample sizes
But
Power also depends on
Say we have access to a “pre” measure of outcome Y_now
; call it Y_base
. Y_base
is informative about potential outcomes. We are considering using Y_now - Y_base
as the outcome instead of Y_now
.
N <- 100
rho <- .5
design <-
declare_model(N,
Y_base = rnorm(N),
Y_Z_0 = 1 + correlate(rnorm, given = Y_base, rho = rho),
Y_Z_1 = correlate(rnorm, given = Y_base, rho = rho),
Z = complete_ra(N),
Y_now = Z*Y_Z_1 + (1-Z)*Y_Z_0,
Y_change = Y_now - Y_base) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y_now ~ Z, label = "level") +
declare_estimator(Y_change ~ Z, label = "change")+
declare_estimator(Y_now ~ Z + Y_base, label = "RHS")
Punchline:
You can see from the null design that power is great but bias is terrible and coverage is way off.
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
1.59 | 1.59 | 0.12 | 1.60 | 1.00 | 0.00 |
(0.01) | (0.01) | (0.00) | (0.01) | (0.00) | (0.00) |
Power without unbiasedness corrupts, absolutely
another_bad_design <-
declare_model(
N = 100,
female = rep(0:1, N/2),
U = rnorm(N),
potential_outcomes(Y ~ female * Z + U)) +
declare_assignment(
Z = block_ra(blocks = female, block_prob = c(.1, .5)),
Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z + female, inquiry = "ate",
.method = lm_robust)
You can see from the null design that power is great but bias is terrible and coverage is way off.
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.76 | 0.26 | 0.24 | 0.35 | 0.84 | 0.85 |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.02) |
clustered_design <-
declare_model(
cluster = add_level(N = 10, cluster_shock = rnorm(N)),
individual = add_level(
N = 100,
Y_Z_0 = rnorm(N) + cluster_shock,
Y_Z_1 = rnorm(N) + cluster_shock)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = cluster_ra(clusters = cluster)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
-0.00 | -0.00 | 0.64 | 0.64 | 0.79 | 0.20 |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.01) |
What alerts you to a problem?
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.00 | -0.00 | 0.66 | 0.65 | 0.06 | 0.94 |
(0.02) | (0.02) | (0.01) | (0.01) | (0.01) | (0.01) |
design_uncertain <-
declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(b)) +
declare_estimator(Y ~ factor(Z), term = TRUE)
draw_estimands(design_uncertain)
inquiry estimand
1 ate -0.3967765
inquiry estimand
1 ate 0.7887188
Say I run two tests and want to correct for multiple comparisons.
Two approaches. First, by hand:
b = .2
design_mc <-
declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = b) +
declare_estimator(Y ~ factor(Z), term = TRUE)
design_mc |>
simulate_designs(sims = 1000) |>
filter(term != "(Intercept)") |>
group_by(sim_ID) |>
mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
p_holm = p.adjust(p = p.value, method = "holm"),
p_fdr = p.adjust(p = p.value, method = "fdr")) |>
ungroup() |>
summarize(
"Power using naive p-values" = mean(p.value <= 0.05),
"Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
"Power using Holm correction" = mean(p_holm <= 0.05),
"Power using FDR correction" = mean(p_fdr <= 0.05)
)
Power using naive p-values | Power using Bonferroni correction | Power using Holm correction | Power using FDR correction |
---|---|---|---|
0.7374 | 0.6318 | 0.6886 | 0.7032 |
The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.
my_estimator <- function(data)
lm_robust(Y ~ factor(Z), data = data) |>
tidy() |>
filter(term != "(Intercept)") |>
mutate(p.naive = p.value,
p.value = p.adjust(p = p.naive, method = "bonferroni"))
design_mc_2 <- design_mc |>
replace_step(5, declare_estimator(handler = label_estimator(my_estimator)))
run_design(design_mc_2) |>
select(term, estimate, p.value, p.naive) |> kable()
term | estimate | p.value | p.naive |
---|---|---|---|
factor(Z)2 | 0.2508003 | 0.0021145 | 0.0010573 |
factor(Z)3 | 0.2383963 | 0.0052469 | 0.0026235 |
Lets try same thing for a null model (using redesign(design_mc_2, b = 0)
)
…and power:
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.00 | 0.00 | 0.08 | 0.08 | 0.02 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
-0.00 | -0.00 | 0.08 | 0.08 | 0.02 | 0.96 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
bothered?