Last week in class we discussed the idea that \(p\) values are uniformly distributed under the null. Coincidentally, the same day Richard McElreath put out an interesting thread saying that they are not, not even in theory. https://twitter.com/rlmcelreath/status/1677322772719054849 1
McElreath showed an example where the distribution seemed very far from uniform for a pretty standard application: an experiment with a binary \(X\) and \(Y\) and a null of no effect for any unit (so his example presupposed a sharp null).
First a conceptual clarification on what \(p\) values are. We can distinguish between the real, or actual \(p\) values and the estimated, or nominal, \(p\)-values. The real (or actual) \(p\) value reports the probability of seeing such a large test statistic (e.g. estimate) under some null, for instance the null of no effect. The estimated (or nominal) \(p\) value is what gets reported in papers and it is an estimate of the real \(p\) value.
If the \(p\) values are far off in this simple case then we really have a problem. (Of course we do have multiple problems with \(p\)s, but the issue here is whether they are reliable on their own merits.)
Why’s that? Because we get two distinct things from the claim that \(p\) values are uniform over \([0,1]\).
McElreaths’s critique could apply to either the real or the estimated \(p\)-values and threaten both claim B and, perhaps, claim A.
To explore further I took a bit of a deep dive into McElreath’s example to understand better what’s going on there. I ran the same simulation but increased the number of simulations, letting \(N\) and the estimation method vary, and also calculating the estimated (“nominal” \(p\) values) as well as the correct \(p\) values (analytically and via simulation) as well as the exact randomization inference \(p\) values for nulls for a collection of sample average treatment effects.
Full declaration and simulations are below but the headlines are:
A couple of other things of note.
DeclareDesign
along with a trick to use multistage
simulation to calculate randomization inference \(p\) values.I start with a simulation approach since although it’s not necessary for the current example it is more flexible in general.
The simulations are done using DeclareDesign
. The design
is very simple:
N <- 100
design <-
declare_model(N = N, X = rep(0:1, N/2), Y = rbinom(N, 1, 0.85)) +
declare_estimator(Y ~ X, .method = glm, family = binomial(), label = "logit") +
declare_estimator(Y ~ X, .method = lm_robust, label = "lm_robust")
If you simulate this design and plot the \(p\) values you get essentially the same graph as in the post. And you get essentially the same graph whether you use lm_robust or probit.
Because I am doing a very large number of simulations I am going to
save the simulation object and read it back in (run
turns
the simulations on and off).
if(run)
design |>
redesign(N = c(100, 500)) |> # Vary N
simulate_design(sims = 100000) |> write_rds("sims_100_500.rds")
simulations <- read_rds("sims_100_500.rds") |>
mutate(cdf = cume_dist(round(p.value, 4)))
simulations |>
ggplot(aes(p.value)) + geom_histogram(bins = 50, boundary = 0) + facet_grid(N~estimator)
Distribution of nominal p values
Note that I coarsened the p.values slightly—rounding to 4 decimal places—to make the cdf easier to read by being unresponsive to extremely tiny differences in estimate \(p\) values.
If you plot the cdf you see it does pretty well at the low end. Things go wacky at the high. Though in practice people really use \(p\) values at the low end.
simulations |> ggplot(aes(p.value, cdf)) + geom_point() +
facet_grid(N~estimator) +
geom_abline(intercept = 0, slope = 1, color = "orange")
Let’s zoom in to \(p\) values in the range \((0, .1]\):
simulations |> filter(p.value <= .1) |>
ggplot(aes(p.value, cdf)) + geom_point() +
facet_grid(N~estimator) +
geom_abline(intercept = 0, slope = 1, color = "orange")
On visual inspection here the fit is very good. Things are quite good too when you look at a set of specific values:
simulations |> filter(N == 100 & estimator == "lm_robust") |> pull(p.value) |>
quantile(probs = c(.01, .02, .03, .04, .05))|> round(3)
## 1% 2% 3% 4% 5%
## 0.011 0.021 0.029 0.038 0.051
Fit as a regression the \(R^2\)
would be 99.8% (lm(cdf ~ p.value, data = simulations)
).
Things are still better with a larger \(N\) which allows for a fuller support for the \(p\)-values.
simulations |> filter(N == 500 & estimator == "lm_robust") |> pull(p.value) |>
quantile(probs = c(.01, .02, .03, .04, .05))|> round(3)
## 1% 2% 3% 4% 5%
## 0.010 0.020 0.030 0.039 0.049
How close are the “nominal” \(p\)-values estimated here to the “real” \(p\)’s?
Recall the real \(p\) values report the probability of getting an estimate as large or larger than what we get under the null hypothesis. Since we hvae lots of data generated under the null hypothesis we can use the simulations we already have to calculate these real \(p\) values up to simulation error.
Let’s do it here for the lm
case.
true_p <-
simulations |> filter(estimator == "lm_robust") |>
group_by(N) |>
mutate(p = 1-percent_rank(c(10000, abs(round(estimate, 5))))[-1]) |>
ungroup()
See again that the modal p value is not 1 but below 1, reflecting the case with an imbalance of 1 between treatment and control.
true_p |> ggplot(aes(p)) + geom_histogram(boundary = 0) + facet_grid(~N)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.