DeclareDesign
Q: Is my research design good?
A: Well let’s simulate it to see how it performs.
Q: What should I put in the simulation?
A: All elements of a research design.
Q: What are the elements of a research design?
A: M! I! D! A!
Telling the computer what M, I, D, and A are.
Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, amount learned.
Fine-tuning features of the data and answer strategies to understand how they change the diagnosands
declare_model()
declare_inquiry()
declare_assignment()
declare_measurement()
declare_inquiry
declare_estimator()
and there are more declare_
functions!
draw_data(design)
draw_estimands(design)
draw_estimates(design)
get_estimates(design, data)
run_design(design)
,
simulate_design(design)
diagnose_design(design)
redesign(design, N = 200)
design |> redesign(N = c(200, 400)) |>
diagnose_designs()
compare_designs()
,
compare_diagnoses()
https://raw.githubusercontent.com/rstudio/cheatsheets/master/declaredesign.pdf
N <- 100
b <- .5
design <-
declare_model(N = N, U = rnorm(N),
potential_outcomes(Y ~ b * Z + U)) +
declare_assignment(Z = simple_ra(N), Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)
You now have a two arm design object in memory!
If you just type design
it will run the
design—a good check to make sure the design has been declared
properly.
ID | U | Y_Z_0 | Y_Z_1 | Z | Y |
---|---|---|---|---|---|
001 | -1.3967765 | -1.3967765 | -0.8967765 | 1 | -0.8967765 |
002 | 0.5233121 | 0.5233121 | 1.0233121 | 1 | 1.0233121 |
003 | 0.1422646 | 0.1422646 | 0.6422646 | 0 | 0.1422646 |
004 | -0.8466480 | -0.8466480 | -0.3466480 | 0 | -0.8466480 |
005 | -0.4118214 | -0.4118214 | 0.0881786 | 0 | -0.4118214 |
006 | -1.4650350 | -1.4650350 | -0.9650350 | 0 | -1.4650350 |
Play with the data:
term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|
(Intercept) | -0.26 | 0.13 | -2.06 | 0.04 | -0.52 | -0.01 | 98 | Y |
Z | 0.81 | 0.17 | 4.76 | 0.00 | 0.47 | 1.15 | 98 | Y |
inquiry | estimand |
---|---|
ate | 0.5 |
estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome | inquiry |
---|---|---|---|---|---|---|---|---|---|---|
estimator | Z | 0.22 | 0.19 | 1.16 | 0.25 | -0.16 | 0.6 | 98 | Y | ate |
estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome | inquiry |
---|---|---|---|---|---|---|---|---|---|---|
estimator | Z | 0.81 | 0.17 | 4.76 | 0 | 0.47 | 1.15 | 98 | Y | ate |
design | sim_ID | inquiry | estimand | estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
design | 1 | ate | 0.5 | estimator | Z | -0.03 | 0.22 | -0.16 | 0.88 | -0.47 | 0.40 | 98 | Y |
design | 2 | ate | 0.5 | estimator | Z | 0.72 | 0.20 | 3.58 | 0.00 | 0.32 | 1.12 | 98 | Y |
design | 3 | ate | 0.5 | estimator | Z | 0.47 | 0.20 | 2.36 | 0.02 | 0.08 | 0.86 | 98 | Y |
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.49 | -0.01 | 0.20 | 0.20 | 0.65 | 0.95 |
(0.02) | (0.02) | (0.01) | (0.01) | (0.05) | (0.02) |
diagnosand | mean_1 | mean_2 | mean_difference | conf.low | conf.high |
---|---|---|---|---|---|
mean_estimand | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 |
mean_estimate | 0.50 | 0.50 | -0.01 | -0.04 | 0.02 |
bias | 0.00 | 0.00 | -0.01 | -0.04 | 0.02 |
sd_estimate | 0.30 | 0.20 | -0.09 | -0.12 | -0.07 |
rmse | 0.29 | 0.20 | -0.09 | -0.12 | -0.07 |
power | 0.41 | 0.70 | 0.29 | 0.23 | 0.34 |
coverage | 0.94 | 0.94 | 0.00 | -0.03 | 0.02 |
In the classical approach to testing a hypothesis we ask:
How likely are we to see data like this if indeed the hypothesis is true?
How unlikely is “not very likely”?
When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.
Othello has a hypothesis that Desdemona is innocent.
Iago confronts him with evidence:
Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.
He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.
So:
Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.
Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
Power is just the probability of getting a significant
result rejecting a hypothesis.
Simple enough but it presupposes:
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.
I want to test the hypothesis that this dice is fair.
Here’s my test:
Now:
For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.
Here is how many 6’s I would expect if the dice is fair:
I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:
## lower upper
## 0.02302647 0.02785689
Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.
For instance: we think that in fact sixes appear 20% of the time.
Now what’s the probability of seeing at least 190 sixes?
## [1] 0.796066
So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.
inquiry | estimand | estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|---|---|---|
ate | 0.5 | estimator | Z | 0.84 | 0.23 | 3.63 | 0 | 0.38 | 1.29 | 98 | Y |
sim_ID | estimate | p.value |
---|---|---|
1 | 0.48 | 0.03 |
2 | 0.33 | 0.10 |
3 | 0.89 | 0.00 |
4 | 0.78 | 0.00 |
5 | 0.58 | 0.00 |
6 | 0.30 | 0.16 |
Obviously related to the estimates you might get
## [1] 0.9536667
A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.50 | 0.00 | 0.20 | 0.20 | 0.70 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
b | Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|---|
0 | -0.00 | -0.00 | 0.20 | 0.20 | 0.05 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
0.25 | 0.25 | -0.00 | 0.20 | 0.20 | 0.23 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
0.5 | 0.50 | 0.00 | 0.20 | 0.20 | 0.70 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
1 | 1.00 | 0.00 | 0.20 | 0.20 | 1.00 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
design
) or run_design(design))
simulate_design(design, sims = k)
diagnose_design(design, sims = k)
sims
a vectordiagnose_design(design_list)
tidy
to get the diagnosis information in a
nice format for ggplot
We often focus on sample sizes
But
Power also depends on
You can see from the null design that power is great but bias is terrible and coverage is way off.
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
1.59 | 1.59 | 0.12 | 1.60 | 1.00 | 0.00 |
(0.01) | (0.01) | (0.00) | (0.01) | (0.00) | (0.00) |
Power without unbiasedness corrupts, absolutely
another_bad_design <-
declare_model(
N = 100,
female = rep(0:1, N/2),
U = rnorm(N),
potential_outcomes(Y ~ female * Z + U)) +
declare_assignment(
Z = block_ra(blocks = female, block_prob = c(.1, .5)),
Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z + female, inquiry = "ate",
.method = lm_robust)
You can see from the null design that power is great but bias is terrible and coverage is way off.
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.76 | 0.26 | 0.24 | 0.35 | 0.84 | 0.85 |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.02) |
clustered_design <-
declare_model(
cluster = add_level(N = 10, cluster_shock = rnorm(N)),
individual = add_level(
N = 100,
Y_Z_0 = rnorm(N) + cluster_shock,
Y_Z_1 = rnorm(N) + cluster_shock)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = cluster_ra(clusters = cluster)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
-0.00 | -0.00 | 0.64 | 0.64 | 0.79 | 0.20 |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.01) |
clustered_design_2 <-
clustered_design |> replace_step(5,
declare_estimator(Y ~ Z, clusters = cluster))
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.00 | -0.00 | 0.66 | 0.65 | 0.06 | 0.94 |
(0.02) | (0.02) | (0.01) | (0.01) | (0.01) | (0.01) |
design_uncertain <-
declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(b)) +
declare_estimator(Y ~ factor(Z), term = TRUE)
draw_estimands(design_uncertain)
## inquiry estimand
## 1 ate 0.7120984
## inquiry estimand
## 1 ate 2.139047
Say I run two tests and want to correct for multiple comparisons.
Two approaches. First, by hand:
b = .2
design_mc <-
declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = b) +
declare_estimator(Y ~ factor(Z), term = TRUE)
design_mc |>
simulate_designs(sims = 1000) |>
filter(term != "(Intercept)") |>
group_by(sim_ID) |>
mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
p_holm = p.adjust(p = p.value, method = "holm"),
p_fdr = p.adjust(p = p.value, method = "fdr")) |>
ungroup() |>
summarize(
"Power using naive p-values" = mean(p.value <= 0.05),
"Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
"Power using Holm correction" = mean(p_holm <= 0.05),
"Power using FDR correction" = mean(p_fdr <= 0.05)
)
Power using naive p-values | Power using Bonferroni correction | Power using Holm correction | Power using FDR correction |
---|---|---|---|
0.7374 | 0.6318 | 0.6886 | 0.7032 |
The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.
my_estimator <- function(data)
lm_robust(Y ~ factor(Z), data = data) |>
tidy() |>
filter(term != "(Intercept)") |>
mutate(p.naive = p.value,
p.value = p.adjust(p = p.naive, method = "bonferroni"))
design_mc_2 <- design_mc |>
replace_step(5, declare_estimator(handler = label_estimator(my_estimator)))
run_design(design_mc_2) |>
select(term, estimate, p.value, p.naive) |> kable()
term | estimate | p.value | p.naive |
---|---|---|---|
factor(Z)2 | 0.0535175 | 0.9751237 | 0.4875618 |
factor(Z)3 | 0.1970167 | 0.0213185 | 0.0106592 |
Lets try same thing for a null model (using
redesign(design_mc_2, b = 0)
)
design_mc_3 <-
design_mc_2 |>
redesign(b = 0)
run_design(design_mc_3) |> select(estimate, p.value, p.naive) |> kable()
estimate | p.value | p.naive |
---|---|---|
-0.0366055 | 1 | 0.6300116 |
-0.0266577 | 1 | 0.7285604 |
…and power:
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.00 | 0.00 | 0.08 | 0.08 | 0.02 | 0.95 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
-0.00 | -0.00 | 0.08 | 0.08 | 0.02 | 0.96 |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
bothered?