1 Overview

Last week in class we discussed the idea that \(p\) values are uniformly distributed under the null. Coincidentally, the same day Richard McElreath put out an interesting thread saying that they are not, not even in theory. https://twitter.com/rlmcelreath/status/1677322772719054849 1

McElreath showed an example where the distribution seemed very far from uniform for a pretty standard application: an experiment with a binary \(X\) and \(Y\) and a null of no effect for any unit (so his example presupposed a sharp null).

1.1 Concepts

First a conceptual clarification on what \(p\) values are. We can distinguish between the real, or actual \(p\) values and the estimated, or nominal, \(p\)-values. The real (or actual) \(p\) value reports the probability of seeing such a large test statistic (e.g. estimate) under some null, for instance the null of no effect. The estimated (or nominal) \(p\) value is what gets reported in papers and it is an estimate of the real \(p\) value.

1.2 What’s at stake

If the \(p\) values are far off in this simple case then we really have a problem. (Of course we do have multiple problems with \(p\)s, but the issue here is whether they are reliable on their own merits.)

Why’s that? Because we get two distinct things from the claim that \(p\) values are uniform over \([0,1]\).

  • A. First we get from this the claim that the probability of seeing a \(p\) value less than or equal to \(\alpha\) under the null is itself \(\alpha\). So the probability of seeing a \(p\) value of 0.05 or less is 0.05. (“Validity” definitions have a weak inequality: \(\Pr(p \leq \alpha)) \leq \alpha\).)
  • B. Second we have an expectation of what the distribution of \(p\) values should look like if indeed the null is true – for example for balance tests (should you do them) for baseline covariates after randomization. This feature is sometimes used to assess the risk of fraud. To wit: https://twitter.com/GidMK/status/1656228772633522177

McElreaths’s critique could apply to either the real or the estimated \(p\)-values and threaten both claim B and, perhaps, claim A.

To explore further I took a bit of a deep dive into McElreath’s example to understand better what’s going on there. I ran the same simulation but increased the number of simulations, letting \(N\) and the estimation method vary, and also calculating the estimated (“nominal” \(p\) values) as well as the correct \(p\) values (analytically and via simulation) as well as the exact randomization inference \(p\) values for nulls for a collection of sample average treatment effects.

1.3 Headline results

Full declaration and simulations are below but the headlines are:

  • it is unambiguous that both actual and nominal \(p\) values can fail with respect to \(B\): you can see this directly from McElreaths’s histogram for the nominal \(p\) values, but it is also true for the real \(p\) values. We saw examples of this already in the randomization inference applications. Thus the distribution of \(p\) values will obviously not be uniform over [0,1]. This is a challenges for claim B for both real and nominal \(p\) values.
  • real \(p\) values should not fail with respect to \(A\). To see why, imagine the sampling distribution of the estimates—all the estimates given our data and answer strategy. Then the \(p\) value associated with some estimate \(b\) that you might get is the probability mass at or beyond \(b\) and \(-b\). Any test statistic beyond \(b\) that might arise will produce a lower \(p\) value. So if test statistic \(x\) produces value \(p\) then the probability of \(p\) or lower is that probability of a statistic as large or larger than \(x\) in absolute value, which is exactly what \(p\) reports.
  • even if \(B\) fails it is still possible that \(A\) holds for nominal \(p\) values. For instance when we generate \(p\) values through randomization inference we are guaranteed that this property will hold when the distribution is taken with respect to the randomization. In fact when we calculate the exact distribution (or look with a large number of simulations) it looks like this property does hold well for low values of \(\alpha\); things go off somewhat at the high end. However the \(\alpha\)s used in tests are typically taken from the low end, e.g. 0.01, 0.05, 0.1, where as far as I can see things really do look OK.

1.4 Secondary results

A couple of other things of note.

  • The issue here doesn’t seem to have much to do with OLS or logit; these produce essentially the same distributions of \(p\) values (though with some interesting exceptions).
  • The real \(p\) values have an even smaller domain that the domain from the estimates from OLS and logit. This is because these are calculated using the sampling distribution of estimates only whereas the estimated (nominal) \(p\) values also use information of the estimated variance of the sampling distribution, which varies across runs. Thus the true distribution of (actual \(p\) values) faces challenge \(A\) but not challenge \(B\).
  • The randomization inference \(p\) values have property B (are valid) conditional on the distribution of \(Y\)—these are constructed to be so; but they do not have this feature across distributions of \(Y\). This is because they target the sample average effect. Nevertheless it means that if we see multiple \(p\) values generated via randomization inference for different outcomes (for instance) we should not expect the \(p\) values to be uniformly distributed.

1.5 And also

  • These notes show how to do this kind of simulation in DeclareDesign along with a trick to use multistage simulation to calculate randomization inference \(p\) values.
  • The code is written so that it is easy to alter the key features of the design: the \(N\) and the probability \(Y = 1\).

2 Simulation approach

I start with a simulation approach since although it’s not necessary for the current example it is more flexible in general.

2.1 Basic Declaration

The simulations are done using DeclareDesign. The design is very simple:

N <- 100

design <-
  declare_model(N = N, X = rep(0:1, N/2), Y = rbinom(N, 1, 0.85)) +
  declare_estimator(Y ~ X, .method = glm, family = binomial(), label = "logit") +
  declare_estimator(Y ~ X, .method = lm_robust, label = "lm_robust") 

If you simulate this design and plot the \(p\) values you get essentially the same graph as in the post. And you get essentially the same graph whether you use lm_robust or probit.

Because I am doing a very large number of simulations I am going to save the simulation object and read it back in (run turns the simulations on and off).

if(run)
  design |> 
  redesign(N = c(100, 500)) |>   # Vary N
  simulate_design(sims = 100000) |> write_rds("sims_100_500.rds")

simulations <- read_rds("sims_100_500.rds") |>
  mutate(cdf = cume_dist(round(p.value, 4)))

simulations |> 
  ggplot(aes(p.value)) + geom_histogram(bins = 50, boundary = 0) + facet_grid(N~estimator)
Distribution of nominal p values

Distribution of nominal p values

Note that I coarsened the p.values slightly—rounding to 4 decimal places—to make the cdf easier to read by being unresponsive to extremely tiny differences in estimate \(p\) values.

If you plot the cdf you see it does pretty well at the low end. Things go wacky at the high. Though in practice people really use \(p\) values at the low end.

simulations |> ggplot(aes(p.value, cdf)) + geom_point() + 
  facet_grid(N~estimator) + 
  geom_abline(intercept = 0, slope = 1, color = "orange")

Let’s zoom in to \(p\) values in the range \((0, .1]\):

simulations |> filter(p.value <= .1) |>
  ggplot(aes(p.value, cdf)) + geom_point() +
  facet_grid(N~estimator) +
  geom_abline(intercept = 0, slope = 1, color = "orange")

On visual inspection here the fit is very good. Things are quite good too when you look at a set of specific values:

simulations |> filter(N == 100 & estimator == "lm_robust") |> pull(p.value) |>
  quantile(probs = c(.01, .02, .03, .04, .05))|> round(3)
##    1%    2%    3%    4%    5% 
## 0.011 0.021 0.029 0.038 0.051

Fit as a regression the \(R^2\) would be 99.8% (lm(cdf ~ p.value, data = simulations)).

Things are still better with a larger \(N\) which allows for a fuller support for the \(p\)-values.

simulations |> filter(N == 500 & estimator == "lm_robust") |> pull(p.value) |>
  quantile(probs = c(.01, .02, .03, .04, .05))|> round(3)
##    1%    2%    3%    4%    5% 
## 0.010 0.020 0.030 0.039 0.049

2.2 Comparison with the real \(p\) values

How close are the “nominal” \(p\)-values estimated here to the “real” \(p\)’s?

Recall the real \(p\) values report the probability of getting an estimate as large or larger than what we get under the null hypothesis. Since we hvae lots of data generated under the null hypothesis we can use the simulations we already have to calculate these real \(p\) values up to simulation error.

Let’s do it here for the lm case.

true_p <- 
  simulations |> filter(estimator == "lm_robust") |>
  group_by(N) |>
  mutate(p = 1-percent_rank(c(10000, abs(round(estimate, 5))))[-1]) |>
  ungroup() 

See again that the modal p value is not 1 but below 1, reflecting the case with an imbalance of 1 between treatment and control.

true_p |> ggplot(aes(p)) + geom_histogram(boundary = 0) + facet_grid(~N)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of real (actual) $p$ values is also not uniform

The distribution of real (actual) \(p\) values is also not uniform

We now graph the estimated \(p\)s against the distribution of the actual \(p\)s (calculated using the full simulated sampling distribution).

true_p |>
  group_by(N, p) |> 
  summarize(mean = mean(p.value), 
            lower = quantile(p.value, .025), 
            upper = quantile(p.value, .975)) |>
  ggplot(aes(p, mean)) + geom_point() + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.01) + geom_abline(intercept = 0, slope = 1, color = "orange") + 
  xlab("actual p") + ylab("nominal p") + facet_grid(~N)
## `summarise()` has grouped output by 'N'. You can override using the `.groups`
## argument.
The nominal $p$ values are close to but not centered on the actual $p$ values

The nominal \(p\) values are close to but not centered on the actual \(p\) values

The bottom line is that estimated \(p\)’s are imperfect estimates: not just noisy, but biased. That’s the bad news. In fairness though they do a strikingly good job given that they are (each) calculated give only one draw of the data!

Interestingly if you do this for the logit case you get conflicting \(p\) estimates in cases where there is a very large estimate with large uncertainty: the large estimate produces a \(p\) of 1 in the sampling distribution but the large standard error results in a \(p\) of 0 from significance testing.

Note that the range of the true p values is quite narrow. This is because the set of possible estimates is actually quite narrow. e.g. an effect of 0.01 or 0.03 is not possible.

Things are again better with larger \(N\).

3 Analytic results

Given the small estimates space for the linear model we can calculate both the real \(p\)s and the exact distribution of the nominal \(p\)s analyticaly.

X <- rep(0:1, each = 50)

# function to return the $p$ value we would derive from each possible Y distribution
est_p <- function(y0, y1){
  df <- data.frame(Y = rep(c(0,1,0,1), c(50-y0, y0, 50-y1, y1)))
  M  <- lm_robust(Y ~ X, data = df)$p.value[2]
  }

# All possible distributions of Y (with some variance)
Ys <- expand_grid(Y0 = 0:50, Y1 = 0:50) |>
  filter(Y0 + Y1 != 0)|>
  filter(Y0 + Y1 != 100)

# The real p values derived from the sampling distribution of estimates
real_ps <- Ys |>
  mutate(prob = dbinom(Y0, 50, .85)*dbinom(Y1, 50, .85),
         estimate = abs(Y1 - Y0)/50) |>
  group_by(estimate) |>
  summarize(prob_estimate = sum(prob)) |>
  arrange(-estimate) |>
  mutate(real = cumsum(prob_estimate)) |>
  select(prob_estimate, estimate, real)

# The exact distribution of nominal p values; note rounding at 4th digit
nominal_ps <- 
  Ys |>
  mutate(
    prob = dbinom(Y0, 50, .85)*dbinom(Y1, 50, .85),
    nominal = round(sapply(1:nrow(Ys), function(j) est_p(Ys$Y0[j], Ys$Y1[j])), 4),
    estimate = abs(Y1 - Y0)/50) 

# Combine
combined <- 
  left_join(nominal_ps, real_ps) 
## Joining with `by = join_by(estimate)`
combined <- 
  combined |> 
  mutate(cdf = sapply(combined$nominal, function(p) 
    sum(combined$prob[combined$nominal <=p])))

Here is the distribution of the real p values:

real_ps |> ggplot(aes(real, prob_estimate)) + geom_point() + 
  geom_linerange(aes(x=real, ymax=prob_estimate, ymin=0)) + ylab("probability") + 
  xlab("real p value") 
The distribution of real (actual) $p$ values is also not uniform (here calculated not simulated)

The distribution of real (actual) \(p\) values is also not uniform (here calculated not simulated)

This distribution obviously fails for purpose B but by construction it performs perfectly for purpose A.

Note that 1 is not the most common value for the actual \(p\) values but it is for the estimated \(p\) values. The reason is that there is more scope for having differences of 0.02 (i.e. 1/50) between treatment and control in absolute terms (which produces a \(p\) value of 0.88) than for differences of 0 (which produces a \(p\) value of 1), However the former correspond to multiple possible estimated \(p\) values since different data patterns that produce the same estimate can produce different estimates of the standard error.

You can get intuition for this from the joint distribution of estimates and p values. The same estimate can produce different p values depending on the estimated standard error.

simulations |> filter(N == 500 & estimator == "lm_robust") |> 
  ggplot(aes(estimate, p.value, color = std.error)) + geom_point()
Each estimate is associated with multiple nominal p values that differ because the estimated standard errors differ

Each estimate is associated with multiple nominal p values that differ because the estimated standard errors differ

How good are the estimated \(p\)s? We can answer that exactly here.

A plot of the real against the nominal p values. Both calculated analytically:

combined |> ggplot(aes(real, nominal, size = prob)) + geom_point() +   geom_abline(intercept = 0, slope = 1, color = "orange")

The CDF of the nominal p values

combined |> ggplot(aes(nominal, cdf, size = prob)) + geom_point()  + 
  geom_point() + geom_abline(intercept = 0, slope = 1, color = "orange")
The calculated cumulative distribution of nominal p values, point sizes show how likely particular values are to arise.

The calculated cumulative distribution of nominal p values, point sizes show how likely particular values are to arise.

Zooming in:

combined |> filter(nominal <=.1) |> ggplot(aes(nominal, cdf, size = prob)) + 
  geom_point() + geom_abline(intercept = 0, slope = 1, color = "orange")
The calculated cumulative distribution of nominal p values, point sizes show how likely particular values are to arise. Lower end.

The calculated cumulative distribution of nominal p values, point sizes show how likely particular values are to arise. Lower end.

Performance for small values:

((1:10)/100) |> sapply(function(j) c(j, sum(combined$prob[combined$nominal <= j]) |> round(3)))
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [1,] 0.010 0.020 0.030 0.040 0.050 0.060 0.070 0.080 0.090 0.100
## [2,] 0.009 0.018 0.032 0.042 0.047 0.057 0.072 0.077 0.095 0.096

4 Randomization inference

The approach to getting the “actual” \(p\) above uses the entire data generating process and the simulated (or calculated) distribution of estimates. Of course in practice you won’t have access to this.

A related approach is to use randomization inference (ri) to assess the \(p\) value for the sharp null of no effect. Subtly the ri procedure conditions on the distribution of \(Y\). It is guaranteed to be correct (up to simulation error) with respect to the distribution generated from the randomization.

However we see here that it is not not correct for the expected \(p\) value when the expectation also takes account of the generation of \(Y\) not due to \(X\).

This declaration makes it possible to use the simulations to calculate ri p values.

design <-
  declare_model(N = 100, Y = rbinom(N, 1, 0.85)) +
  declare_measurement(X = complete_ra(N)) +
  declare_estimator(Y ~ X)

if(run)  
design |> simulate_design(sims = c(500, 1000, 1)) |>
  write_rds("simulations.rds")

simulations <- read_rds("simulations.rds") |>
  mutate(estimate = round(estimate, 4)) |>
  group_by(step_1_draw) |>
  mutate(p_ri = 1-percent_rank(c(10000, abs(estimate)))[-1],
         cumulative_ri = cume_dist(p_ri)) |>
  ungroup() |>
  mutate(cumulative_ri_all = cume_dist(p_ri))

We see that we hit the 45 degree perfectly for each draw of \(Y\).

  simulations |> 
  filter(step_1_draw <=4) |> 
  ggplot(aes(p_ri, cumulative_ri)) + geom_point() + facet_wrap(~step_1_draw)
The simulated cumulative distribution of nominal p values from randomization inference for different distributions of $Y$.

The simulated cumulative distribution of nominal p values from randomization inference for different distributions of \(Y\).

However the cumulative distribution is a bit of a mess when \(Y\) is stochastic. The figure below show cdf conditional on the distribution of \(Y\) for 4 draws, and then the unconditional cdf.

  simulations |> ggplot(aes(p_ri, cumulative_ri_all)) + geom_point() + 
    geom_abline(intercept = 0, slope = 1, color = "orange")
The simulated cumulative distribution of nominal p values from randomization inference where the cumulative distribution takes account of sampling variation and not just variation in assignment.

The simulated cumulative distribution of nominal p values from randomization inference where the cumulative distribution takes account of sampling variation and not just variation in assignment.

Maybe for some intuition: the distribution (1/3, 2/3, 1) and (2/3, 2/3, 1) both have cdfs with \(F(p) = p\) on their supports, but (1/3, 2/3, 2/3, 2/3, 1, 1) does not.

5 Coverage of confidence intervals

How does coverage look?

Richard pointed to this piece which highlights bad behavior of confidence intervals for small \(N\) in cases where we are estimating sample proportions https://www.jstor.org/stable/2685469?origin=crossref&seq=6 We are interested here in differences in means which is a somewhat different problem and it seems that in general things are better behaved.

Even still, for small \(N\) you see that coverage is always too high. For some intuition, if \(N=6\) and there are three cases with \(X=0\) and 3 with \(X=1\), the only distribution of \(Y\) that will produce a confidence interval that does not cover 0 is one in which there is a perfect correlation between \(X\) and \(Y\). This is most likely to happen when the probability that \(Y=1\) is 0.5. But even then, this occurs—and we get significant results—just \(2*.5^6\) = 3% of time under the null. So coverage will never get as low as 95%.

5.1 Simulated

N <- 100
prob <- .85

design <-
  declare_model(N = N, X = rep(0:1, N/2), Y = rbinom(N, 1, prob)) +
  declare_inquiry(ATE = 0) +
  declare_estimator(Y ~ X) 

if(run)
design |> redesign(N = c(10, 100), prob = seq(.04, .96, .02)) |> 
  diagnose_design(sims = 20000) |> 
  write_rds("saved/coverage2.rds")
      
read_rds("saved/coverage2.rds") |> tidy() |>
  filter(diagnosand == "coverage") |> ggplot(aes(prob, estimate)) + geom_line() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high, width  = 0)) +
  ylim(.7,1) + facet_grid(N~., labeller = label_both) + geom_hline(yintercept = 0.95, color = "orange")                   

5.2 Exact

# function to return the $p$ value we would derive from each possible Y distribution

covered <- function(y0, y1, X, n){
  if(y0 == y1) return(TRUE)
  df <- data.frame(Y = rep(c(0,1,0,1), c(n/2-y0, y0, n/2-y1, y1)))
  M  <- lm_robust(Y ~ X, data = df)
  M$conf.low[2]*M$conf.high[2] <=0
  }

qs <- seq(.02, .98, .01)

coverage <- function(N){
  
  # Xs
  X <- rep(0:1, each = N/2)
  
  # All possible distributions of Y (with some variance)
  Ys <- expand_grid(Y0 = 0:(N/2), Y1 = 0:(N/2)) 
  
  # Coverage for each distribution
  Ys <- Ys |>
    mutate(covered = sapply(1:nrow(Ys), function(j) covered(Ys$Y0[j], Ys$Y1[j], X, N)))
  
  cov <- function(prob)
    Ys |>
    mutate(q = dbinom(Y0, N/2, prob)*dbinom(Y1, N/2, prob)) |>
    summarize(coverage = weighted.mean(covered, q))
  
  sapply(qs, cov) |> unlist()
  }

c('6'=6, '10'=10, '20'=20, '40'=40, '100'=100, '200'=200) |>
  lapply(function(N) data.frame(N = N, q = qs, coverage = coverage(N))) |> 
  bind_rows(.id = "N") |> 
  ggplot(aes(q, coverage)) + geom_point() + 
  facet_wrap(~as.numeric(N)) +ylim(.7,1) + geom_hline(yintercept = .95)

6 Open question

  • why the shape of the CDF that we see with the wavy lines?
  • is it generally the case that the cumulative distribution of ri p values lies so far below the 45 degree line?

  1. Note there’s also a set of interesting posts here: https://twitter.com/dggoldst/status/1656479972129755141↩︎