Last week in class we discussed the idea that \(p\) values are uniformly distributed under the null. Coincidentally, the same day Richard McElreath put out an interesting thread saying that they are not, not even in theory. https://twitter.com/rlmcelreath/status/1677322772719054849 1
McElreath showed an example where the distribution seemed very far from uniform for a pretty standard application: an experiment with a binary \(X\) and \(Y\) and a null of no effect for any unit (so his example presupposed a sharp null).
First a conceptual clarification on what \(p\) values are. We can distinguish between the real, or actual \(p\) values and the estimated, or nominal, \(p\)-values. The real (or actual) \(p\) value reports the probability of seeing such a large test statistic (e.g. estimate) under some null, for instance the null of no effect. The estimated (or nominal) \(p\) value is what gets reported in papers and it is an estimate of the real \(p\) value.
If the \(p\) values are far off in this simple case then we really have a problem. (Of course we do have multiple problems with \(p\)s, but the issue here is whether they are reliable on their own merits.)
Why’s that? Because we get two distinct things from the claim that \(p\) values are uniform over \([0,1]\).
McElreaths’s critique could apply to either the real or the estimated \(p\)-values and threaten both claim B and, perhaps, claim A.
To explore further I took a bit of a deep dive into McElreath’s example to understand better what’s going on there. I ran the same simulation but increased the number of simulations, letting \(N\) and the estimation method vary, and also calculating the estimated (“nominal” \(p\) values) as well as the correct \(p\) values (analytically and via simulation) as well as the exact randomization inference \(p\) values for nulls for a collection of sample average treatment effects.
Full declaration and simulations are below but the headlines are:
A couple of other things of note.
DeclareDesign
along with a trick to use multistage
simulation to calculate randomization inference \(p\) values.I start with a simulation approach since although it’s not necessary for the current example it is more flexible in general.
The simulations are done using DeclareDesign
. The design
is very simple:
N <- 100
design <-
declare_model(N = N, X = rep(0:1, N/2), Y = rbinom(N, 1, 0.85)) +
declare_estimator(Y ~ X, .method = glm, family = binomial(), label = "logit") +
declare_estimator(Y ~ X, .method = lm_robust, label = "lm_robust")
If you simulate this design and plot the \(p\) values you get essentially the same graph as in the post. And you get essentially the same graph whether you use lm_robust or probit.
Because I am doing a very large number of simulations I am going to
save the simulation object and read it back in (run
turns
the simulations on and off).
if(run)
design |>
redesign(N = c(100, 500)) |> # Vary N
simulate_design(sims = 100000) |> write_rds("sims_100_500.rds")
simulations <- read_rds("sims_100_500.rds") |>
mutate(cdf = cume_dist(round(p.value, 4)))
simulations |>
ggplot(aes(p.value)) + geom_histogram(bins = 50, boundary = 0) + facet_grid(N~estimator)
Note that I coarsened the p.values slightly—rounding to 4 decimal places—to make the cdf easier to read by being unresponsive to extremely tiny differences in estimate \(p\) values.
If you plot the cdf you see it does pretty well at the low end. Things go wacky at the high. Though in practice people really use \(p\) values at the low end.
simulations |> ggplot(aes(p.value, cdf)) + geom_point() +
facet_grid(N~estimator) +
geom_abline(intercept = 0, slope = 1, color = "orange")
Let’s zoom in to \(p\) values in the range \((0, .1]\):
simulations |> filter(p.value <= .1) |>
ggplot(aes(p.value, cdf)) + geom_point() +
facet_grid(N~estimator) +
geom_abline(intercept = 0, slope = 1, color = "orange")
On visual inspection here the fit is very good. Things are quite good too when you look at a set of specific values:
simulations |> filter(N == 100 & estimator == "lm_robust") |> pull(p.value) |>
quantile(probs = c(.01, .02, .03, .04, .05))|> round(3)
## 1% 2% 3% 4% 5%
## 0.011 0.021 0.029 0.038 0.051
Fit as a regression the \(R^2\)
would be 99.8% (lm(cdf ~ p.value, data = simulations)
).
Things are still better with a larger \(N\) which allows for a fuller support for the \(p\)-values.
simulations |> filter(N == 500 & estimator == "lm_robust") |> pull(p.value) |>
quantile(probs = c(.01, .02, .03, .04, .05))|> round(3)
## 1% 2% 3% 4% 5%
## 0.010 0.020 0.030 0.039 0.049
How close are the “nominal” \(p\)-values estimated here to the “real” \(p\)’s?
Recall the real \(p\) values report the probability of getting an estimate as large or larger than what we get under the null hypothesis. Since we hvae lots of data generated under the null hypothesis we can use the simulations we already have to calculate these real \(p\) values up to simulation error.
Let’s do it here for the lm
case.
true_p <-
simulations |> filter(estimator == "lm_robust") |>
group_by(N) |>
mutate(p = 1-percent_rank(c(10000, abs(round(estimate, 5))))[-1]) |>
ungroup()
See again that the modal p value is not 1 but below 1, reflecting the case with an imbalance of 1 between treatment and control.
true_p |> ggplot(aes(p)) + geom_histogram(boundary = 0) + facet_grid(~N)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We now graph the estimated \(p\)s against the distribution of the actual \(p\)s (calculated using the full simulated sampling distribution).
true_p |>
group_by(N, p) |>
summarize(mean = mean(p.value),
lower = quantile(p.value, .025),
upper = quantile(p.value, .975)) |>
ggplot(aes(p, mean)) + geom_point() + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.01) + geom_abline(intercept = 0, slope = 1, color = "orange") +
xlab("actual p") + ylab("nominal p") + facet_grid(~N)
## `summarise()` has grouped output by 'N'. You can override using the `.groups`
## argument.
The bottom line is that estimated \(p\)’s are imperfect estimates: not just noisy, but biased. That’s the bad news. In fairness though they do a strikingly good job given that they are (each) calculated give only one draw of the data!
Interestingly if you do this for the logit case you get conflicting \(p\) estimates in cases where there is a very large estimate with large uncertainty: the large estimate produces a \(p\) of 1 in the sampling distribution but the large standard error results in a \(p\) of 0 from significance testing.
Note that the range of the true p values is quite narrow. This is because the set of possible estimates is actually quite narrow. e.g. an effect of 0.01 or 0.03 is not possible.
Things are again better with larger \(N\).
Given the small estimates space for the linear model we can calculate both the real \(p\)s and the exact distribution of the nominal \(p\)s analyticaly.
X <- rep(0:1, each = 50)
# function to return the $p$ value we would derive from each possible Y distribution
est_p <- function(y0, y1){
df <- data.frame(Y = rep(c(0,1,0,1), c(50-y0, y0, 50-y1, y1)))
M <- lm_robust(Y ~ X, data = df)$p.value[2]
}
# All possible distributions of Y (with some variance)
Ys <- expand_grid(Y0 = 0:50, Y1 = 0:50) |>
filter(Y0 + Y1 != 0)|>
filter(Y0 + Y1 != 100)
# The real p values derived from the sampling distribution of estimates
real_ps <- Ys |>
mutate(prob = dbinom(Y0, 50, .85)*dbinom(Y1, 50, .85),
estimate = abs(Y1 - Y0)/50) |>
group_by(estimate) |>
summarize(prob_estimate = sum(prob)) |>
arrange(-estimate) |>
mutate(real = cumsum(prob_estimate)) |>
select(prob_estimate, estimate, real)
# The exact distribution of nominal p values; note rounding at 4th digit
nominal_ps <-
Ys |>
mutate(
prob = dbinom(Y0, 50, .85)*dbinom(Y1, 50, .85),
nominal = round(sapply(1:nrow(Ys), function(j) est_p(Ys$Y0[j], Ys$Y1[j])), 4),
estimate = abs(Y1 - Y0)/50)
# Combine
combined <-
left_join(nominal_ps, real_ps)
## Joining with `by = join_by(estimate)`
combined <-
combined |>
mutate(cdf = sapply(combined$nominal, function(p)
sum(combined$prob[combined$nominal <=p])))
Here is the distribution of the real p values:
real_ps |> ggplot(aes(real, prob_estimate)) + geom_point() +
geom_linerange(aes(x=real, ymax=prob_estimate, ymin=0)) + ylab("probability") +
xlab("real p value")
This distribution obviously fails for purpose B but by construction it performs perfectly for purpose A.
Note that 1 is not the most common value for the actual \(p\) values but it is for the estimated \(p\) values. The reason is that there is more scope for having differences of 0.02 (i.e. 1/50) between treatment and control in absolute terms (which produces a \(p\) value of 0.88) than for differences of 0 (which produces a \(p\) value of 1), However the former correspond to multiple possible estimated \(p\) values since different data patterns that produce the same estimate can produce different estimates of the standard error.
You can get intuition for this from the joint distribution of estimates and p values. The same estimate can produce different p values depending on the estimated standard error.
simulations |> filter(N == 500 & estimator == "lm_robust") |>
ggplot(aes(estimate, p.value, color = std.error)) + geom_point()
How good are the estimated \(p\)s? We can answer that exactly here.
A plot of the real against the nominal p values. Both calculated analytically:
combined |> ggplot(aes(real, nominal, size = prob)) + geom_point() + geom_abline(intercept = 0, slope = 1, color = "orange")
The CDF of the nominal p values
combined |> ggplot(aes(nominal, cdf, size = prob)) + geom_point() +
geom_point() + geom_abline(intercept = 0, slope = 1, color = "orange")
Zooming in:
combined |> filter(nominal <=.1) |> ggplot(aes(nominal, cdf, size = prob)) +
geom_point() + geom_abline(intercept = 0, slope = 1, color = "orange")
Performance for small values:
((1:10)/100) |> sapply(function(j) c(j, sum(combined$prob[combined$nominal <= j]) |> round(3)))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0.010 0.020 0.030 0.040 0.050 0.060 0.070 0.080 0.090 0.100
## [2,] 0.009 0.018 0.032 0.042 0.047 0.057 0.072 0.077 0.095 0.096
The approach to getting the “actual” \(p\) above uses the entire data generating process and the simulated (or calculated) distribution of estimates. Of course in practice you won’t have access to this.
A related approach is to use randomization inference (ri) to assess the \(p\) value for the sharp null of no effect. Subtly the ri procedure conditions on the distribution of \(Y\). It is guaranteed to be correct (up to simulation error) with respect to the distribution generated from the randomization.
However we see here that it is not not correct for the expected \(p\) value when the expectation also takes account of the generation of \(Y\) not due to \(X\).
This declaration makes it possible to use the simulations to calculate ri p values.
design <-
declare_model(N = 100, Y = rbinom(N, 1, 0.85)) +
declare_measurement(X = complete_ra(N)) +
declare_estimator(Y ~ X)
if(run)
design |> simulate_design(sims = c(500, 1000, 1)) |>
write_rds("simulations.rds")
simulations <- read_rds("simulations.rds") |>
mutate(estimate = round(estimate, 4)) |>
group_by(step_1_draw) |>
mutate(p_ri = 1-percent_rank(c(10000, abs(estimate)))[-1],
cumulative_ri = cume_dist(p_ri)) |>
ungroup() |>
mutate(cumulative_ri_all = cume_dist(p_ri))
We see that we hit the 45 degree perfectly for each draw of \(Y\).
simulations |>
filter(step_1_draw <=4) |>
ggplot(aes(p_ri, cumulative_ri)) + geom_point() + facet_wrap(~step_1_draw)
However the cumulative distribution is a bit of a mess when \(Y\) is stochastic. The figure below show cdf conditional on the distribution of \(Y\) for 4 draws, and then the unconditional cdf.
simulations |> ggplot(aes(p_ri, cumulative_ri_all)) + geom_point() +
geom_abline(intercept = 0, slope = 1, color = "orange")
Maybe for some intuition: the distribution (1/3, 2/3, 1) and (2/3, 2/3, 1) both have cdfs with \(F(p) = p\) on their supports, but (1/3, 2/3, 2/3, 2/3, 1, 1) does not.
How does coverage look?
Richard pointed to this piece which highlights bad behavior of confidence intervals for small \(N\) in cases where we are estimating sample proportions https://www.jstor.org/stable/2685469?origin=crossref&seq=6 We are interested here in differences in means which is a somewhat different problem and it seems that in general things are better behaved.
Even still, for small \(N\) you see that coverage is always too high. For some intuition, if \(N=6\) and there are three cases with \(X=0\) and 3 with \(X=1\), the only distribution of \(Y\) that will produce a confidence interval that does not cover 0 is one in which there is a perfect correlation between \(X\) and \(Y\). This is most likely to happen when the probability that \(Y=1\) is 0.5. But even then, this occurs—and we get significant results—just \(2*.5^6\) = 3% of time under the null. So coverage will never get as low as 95%.
N <- 100
prob <- .85
design <-
declare_model(N = N, X = rep(0:1, N/2), Y = rbinom(N, 1, prob)) +
declare_inquiry(ATE = 0) +
declare_estimator(Y ~ X)
if(run)
design |> redesign(N = c(10, 100), prob = seq(.04, .96, .02)) |>
diagnose_design(sims = 20000) |>
write_rds("saved/coverage2.rds")
read_rds("saved/coverage2.rds") |> tidy() |>
filter(diagnosand == "coverage") |> ggplot(aes(prob, estimate)) + geom_line() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high, width = 0)) +
ylim(.7,1) + facet_grid(N~., labeller = label_both) + geom_hline(yintercept = 0.95, color = "orange")
# function to return the $p$ value we would derive from each possible Y distribution
covered <- function(y0, y1, X, n){
if(y0 == y1) return(TRUE)
df <- data.frame(Y = rep(c(0,1,0,1), c(n/2-y0, y0, n/2-y1, y1)))
M <- lm_robust(Y ~ X, data = df)
M$conf.low[2]*M$conf.high[2] <=0
}
qs <- seq(.02, .98, .01)
coverage <- function(N){
# Xs
X <- rep(0:1, each = N/2)
# All possible distributions of Y (with some variance)
Ys <- expand_grid(Y0 = 0:(N/2), Y1 = 0:(N/2))
# Coverage for each distribution
Ys <- Ys |>
mutate(covered = sapply(1:nrow(Ys), function(j) covered(Ys$Y0[j], Ys$Y1[j], X, N)))
cov <- function(prob)
Ys |>
mutate(q = dbinom(Y0, N/2, prob)*dbinom(Y1, N/2, prob)) |>
summarize(coverage = weighted.mean(covered, q))
sapply(qs, cov) |> unlist()
}
c('6'=6, '10'=10, '20'=20, '40'=40, '100'=100, '200'=200) |>
lapply(function(N) data.frame(N = N, q = qs, coverage = coverage(N))) |>
bind_rows(.id = "N") |>
ggplot(aes(q, coverage)) + geom_point() +
facet_wrap(~as.numeric(N)) +ylim(.7,1) + geom_hline(yintercept = .95)
Note there’s also a set of interesting posts here: https://twitter.com/dggoldst/status/1656479972129755141↩︎