compare_designs
: Using design declaration for reconciliation and critique
Graeme Blair, Alex Coppock, Macartan Humphreys
2 June 2021: CAS Workshop on Prediction, Registration, and Replication of Scientific Findings
Roadmap
- The MIDA framework
- Design declaration and diagnosis
- Implementation errors
- Formal reconciliation illustrated
- Takeaways
For more: https://declaredesign.org/
Main idea
Think of a research design as an interrogable object: design_1
Try to represent the actually implemented research as an alternative design: design_2
Reconcile by formally comparing designs: compare_designs(design_1, design_2)
to:
- automate reconciliation
- clarify nature of deviations
- clarify implications of deviations
- invite critique
The MIDA framework
Many research designs can be represented as positions or decisions with respect to: Models, Inquiries, Data Strategies, Answer Strategies.
A design
b <- .3
design_1 <-
declare_model(
N = 500,
u_0 = rnorm(N),
u_1 = rnorm(N),
potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N = N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, model = lm_robust)
This is a complete design declaration.
- it can be “run”:
run_design(design_1)
- and “diagnosed”
diagnose_design(design_1)
Diagnosis
diagnose_design(design_1)
Inquiry
|
Bias
|
RMSE
|
Power
|
Coverage
|
Mean Estimate
|
SD Estimate
|
Mean Se
|
Mean Estimand
|
ate
|
0.00
|
0.10
|
0.78
|
0.97
|
0.30
|
0.11
|
0.11
|
0.30
|
|
(0.00)
|
(0.00)
|
(0.02)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
- We see the design has nice features: reasonable power, unbiasedness
- Coverage is off though (because standard errors estimates are conservative when there are heterogeneous effects)
Registration
The DeclareDesign
idea is that the design object is the thing you register.
The design object captures:
- your background assumptions about the conditions in which you are operating
- what your question really is
- how you plan to generate data
- how you plan to analyze it
It’s short, complete, interrogable.
In practice
Say now we implement and we do a few things differently:
- We gather data on fewer cases than we expected (at random)
- We have some missingness in outcome measures (not at random)
- We end up focusing on analysis in a subgroup
In addition, we learn things from data that we only speculated about at the design stage.
These suggest modifications to:
- M: The model
- I: The inquiry
- A: The answer strategy
(Note: not D, though the data \(d=D(m)\) might change because of changes in M)
Data deviations
Our data might have looked like this:
df <- design_1 %>% redesign(b = .2) %>% draw_data()
But instead it looks like this:
df_observed <-
df %>%
dplyr::filter(simple_ra(n(), prob = .8) == 1) %>%
mutate(Male = simple_ra(n())==1) %>%
mutate(Y = ifelse(Y_Z_1 < 0 & Male, NA, Y))
The differences are:
- we have too few observations
- we have an additional variable “male”
- we have data missing not at random among the men
Analysis deviations
Our analysis should have looked like this:
get_estimates(design_1, df)
estimate
|
std.error
|
statistic
|
p.value
|
conf.low
|
conf.high
|
df
|
0.13
|
0.11
|
1.22
|
0.22
|
-0.08
|
0.34
|
498
|
Might have looked like this:
get_estimates(design_1, df_observed)
estimate
|
std.error
|
statistic
|
p.value
|
conf.low
|
conf.high
|
df
|
0.26
|
0.13
|
1.91
|
0.06
|
-0.01
|
0.52
|
322
|
But instead looks like this:
get_estimates(design_1, df_observed %>% filter(Male))
estimate
|
std.error
|
statistic
|
p.value
|
conf.low
|
conf.high
|
df
|
0.77
|
0.15
|
5.04
|
0
|
0.46
|
1.07
|
108
|
So what to do?
Our analysis has produced a very misleading answer that stems in part from our analysis decision and in part from unexpected selection biases arising during data collection.
Two key ideas:
- The deviations come from two sources: factors that we can control and factors that we cannot control
- We can reconcile our actual with our planned design in a way that opens up avenues for critique, rather than closing things down.
Let’s see this kind of reconciliation in operation.
Adjusting beliefs I
First, knowing what we think we know now about challenges in the field, we could provide more realistic inputs to the design. Consider:
design_2 <-
declare_model(
N = 500,
u_0 = rnorm(N),
u_1 = rnorm(N),
potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = complete_ra(N = N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_sampling(S = complete_rs(N = N, prob = .9)) +
declare_estimator(Y ~ Z, model = lm_robust)
This design differs by allowing for failure to hit targets in data gathering.
This is treated as a deviation that we did not intend but that we do understand.
Comparisons
If you provide design_2
along with design_1
then readers can compare them.
There are a set of tools that automate the comparison of the reconciled design to the original
Compare code
compare_design_code(design_1, design_2, rmd = TRUE, format = "auto", pager = "on", context = 0)
"declare_sampling(S = complete_rs(N = N, prob = 0.9))"
Compare data structures
compare_design_data(design_1, design_2, rmd = TRUE, format = "html")
'data.frame': 500 obs. of 7 variables:
'data.frame': 450 obs. of 8 variables:
$ ID : chr "001" "002" "003" "004" ...
$ ID : chr "001" "002" "003" "004" ...
$ u_0 : num -0.294 -0.563 0.378 -0.143 0.224 ...
$ u_0 : num -0.294 -0.563 0.378 -0.143 0.224 ...
$ u_1 : num -0.933 -0.979 0.67 1.328 -0.766 ...
$ u_1 : num -0.933 -0.979 0.67 1.328 -0.766 ...
$ Y_Z_0: num -0.294 -0.563 0.378 -0.143 0.224 ...
$ Y_Z_0: num -0.294 -0.563 0.378 -0.143 0.224 ...
$ Y_Z_1: num -0.927 -1.243 1.347 1.485 -0.243 ...
$ Y_Z_1: num -0.927 -1.243 1.347 1.485 -0.243 ...
$ Z : int 1 0 1 1 0 1 0 1 1 1 ...
$ Z : int 1 0 1 1 0 1 0 1 1 1 ...
$ Y : num -0.927 -0.563 1.347 1.485 0.224 ...
$ Y : num -0.927 -0.563 1.347 1.485 0.224 ...
$ S : num 1 1 1 1 1 1 1 1 1 1 ...
Compare design runs
compare_design_summaries(design_1, design_2, rmd = TRUE, format = "html", pager = "on")
Step 1 (model): declare_model(N = 500, u_0 = rnorm(N), u_1 = rnorm(N), potential_outcomes(Y ~ u_0 + Z * (b + u_1)))
Step 1 (model): declare_model(N = 500, u_0 = rnorm(N), u_1 = rnorm(N), potential_outcomes(Y ~ u_0 + Z * (b + u_1)))
min median mean max sd N_missing N_unique
min median mean max sd N_missing N_unique
-2.77 -0.11 -0.06 3.28 1 0 500
-3.04 0 0.04 3.14 1.01 0 500
min median mean max sd N_missing N_unique
min median mean max sd N_missing N_unique
-2.89 -0.05 -0.05 2.59 1.01 0 500
-3.43 0.06 0.03 3.22 1.02 0 500
min median mean max sd N_missing N_unique
min median mean max sd N_missing N_unique
-2.77 -0.11 -0.06 3.28 1 0 500
-3.04 0 0.04 3.14 1.01 0 500
min median mean max sd N_missing N_unique
min median mean max sd N_missing N_unique
-4.25 0.19 0.19 4.44 1.42 0 500
-4.14 0.44 0.37 5.39 1.5 0 500
Step 2 (inquiry): declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) -------------------
Step 2 (inquiry): declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) -------------------
A single draw of the inquiry:
A single draw of the inquiry:
Step 3 (assignment): declare_assignment(Z = complete_ra(N = N)) ----------------
Step 3 (assignment): declare_assignment(Z = complete_ra(N = N)) ----------------
Step 4 (measurement): declare_measurement(Y = reveal_outcomes(Y ~ Z)) ----------
Step 4 (measurement): declare_measurement(Y = reveal_outcomes(Y ~ Z)) ----------
min median mean max sd N_missing N_unique
min median mean max sd N_missing N_unique
-3.57 -0.01 0.07 4.44 1.24 0 500
-2.82 0.22 0.24 5.39 1.26 0 500
Step 5 (estimator): declare_estimator(Y ~ Z, model = lm_robust) ----------------
Step 5 (sampling): declare_sampling(S = complete_rs(N = N, prob = 0.9)) --------
min median mean max sd N_missing N_unique
-3.04 0 0.04 3.14 1.01 0 500
min median mean max sd N_missing N_unique
-3.04 0 0.06 3.14 1.03 0 450
min median mean max sd N_missing N_unique
-3.43 0.06 0.03 3.22 1.02 0 500
min median mean max sd N_missing N_unique
-3.43 0.06 0.02 3.22 1.01 0 450
min median mean max sd N_missing N_unique
-3.04 0 0.04 3.14 1.01 0 500
min median mean max sd N_missing N_unique
-3.04 0 0.06 3.14 1.03 0 450
min median mean max sd N_missing N_unique
-4.14 0.44 0.37 5.39 1.5 0 500
min median mean max sd N_missing N_unique
-4.14 0.44 0.37 5.39 1.51 0 450
min median mean max sd N_missing N_unique
-2.82 0.22 0.24 5.39 1.26 0 500
min median mean max sd N_missing N_unique
-2.82 0.22 0.23 5.39 1.27 0 450
Step 6 (estimator): declare_estimator(Y ~ Z, model = lm_robust) ----------------
A single draw of the estimator:
A single draw of the estimator:
estimator term estimate std.error statistic p.value conf.low conf.high
estimator term estimate std.error statistic p.value conf.low conf.high
estimator Z 0.3653772 0.1097515 3.329133 0.0009357255 0.1497442 0.5810101
estimator Z 0.3955818 0.1190763 3.322087 0.0009666505 0.1615643 0.6295993
Compare diagnoses
We can diagnose both designs and look to see how they perform side by side.
diagnose_design(design_1, design_2)
Design
|
Inquiry
|
Bias
|
RMSE
|
Power
|
Coverage
|
Mean Estimate
|
SD Estimate
|
Mean Se
|
Mean Estimand
|
design_1
|
ate
|
0.00
|
0.10
|
0.79
|
0.97
|
0.30
|
0.11
|
0.11
|
0.30
|
|
|
(0.00)
|
(0.00)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
design_2
|
ate
|
0.00
|
0.11
|
0.75
|
0.97
|
0.30
|
0.11
|
0.12
|
0.30
|
|
|
(0.00)
|
(0.00)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
Main thing we see here is that there is a decline in power (increase in RMSE and standard errors). No major differences beyond this.
Comparing diagnoses
Diagnoses may differ due to simulation error. The compare_diagnoses
function helps assess whether two diagnoses are really different. (Best run with low alpha and large sims.)
comparison <- compare_diagnoses(design_1, design_2, alpha = .01)
inquiry
|
estimator
|
diagnosand
|
design1
|
design2
|
difference
|
ate
|
estimator
|
bias
|
0.00
|
0.00
|
-0.00
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
coverage
|
0.97
|
0.96
|
-0.01
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
mean_estimand
|
0.30
|
0.30
|
0.00
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
mean_estimate
|
0.30
|
0.30
|
-0.00
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
mean_se
|
0.11
|
0.12
|
0.01*
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
power
|
0.79
|
0.74
|
-0.05*
|
|
|
|
(0.01)
|
(0.01)
|
(0.01)
|
ate
|
estimator
|
rmse
|
0.10
|
0.11
|
0.01*
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
sd_estimate
|
0.11
|
0.12
|
0.01*
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
ate
|
estimator
|
type_s_rate
|
0.00
|
0.00
|
0.00
|
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
Adjusting beliefs II
Now, we also have larger estimates for effects for men than we expected and we know more about variance within different groups.
- We could include these elements in the reconciled design.
More importantly we have also switched estimand.
- We should include this new estimand in the reconciled design also.
Adjusting beliefs II
We introduce male
as a variable, include an effect that’s closer to the effect we estimated, and include the new estimand in the design.
b <- .5
design_3 <-
declare_model(
N = 500,
male = simple_rs(N = N)==1,
u_0 = rnorm(N),
u_1 = rnorm(N),
potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0),
ate_men = mean((Y_Z_1 - Y_Z_0)[male])) +
declare_assignment(Z = complete_ra(N = N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_sampling(S = complete_rs(N = N, prob = .9)) +
declare_estimator(Y ~ Z, model = lm_robust, inquiry = "ate") +
declare_estimator(Y ~ Z, model = lm_robust, subset = male, inquiry = "ate_men", label = "men")
Diagnosis 3
Does the design comparison give support to the decision to switch estimands?
diagnose_design(design_1, design_2, design_3)
Design
|
Inquiry
|
Bias
|
RMSE
|
Power
|
Coverage
|
Mean Estimate
|
SD Estimate
|
Mean Se
|
Mean Estimand
|
design_1
|
ate
|
-0.01
|
0.10
|
0.76
|
0.97
|
0.29
|
0.11
|
0.11
|
0.30
|
|
|
(0.00)
|
(0.00)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
design_2
|
ate
|
0.00
|
0.11
|
0.74
|
0.97
|
0.30
|
0.11
|
0.12
|
0.30
|
|
|
(0.00)
|
(0.00)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
design_3
|
ate
|
0.01
|
0.11
|
0.99
|
0.97
|
0.51
|
0.11
|
0.12
|
0.50
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
design_3
|
ate_men
|
0.01
|
0.15
|
0.87
|
0.97
|
0.51
|
0.16
|
0.16
|
0.50
|
|
|
(0.00)
|
(0.00)
|
(0.01)
|
(0.01)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
A change in estimand is generally hard to justify on design grounds but it is certainly not supported here. (You have better diagnostics for ate_men
in design_3
than for ate
in design_1
, but still better diagnostics for ate
in design_3
).
Last, and bravest: Redesign III
There is a part of the design-in-practice that we do not understand: the pattern of non-random data missingness among men.
Here is an attempt to reconcile.
rho <- .7
design_4 <-
declare_model(
N = 500,
male = simple_rs(N = N)==1,
u_0 = rnorm(N), u_1 = rnorm(N),
# A new term, correlated with potential outcomes:
u_2 = correlate(given = u_1, rho = rho, rnorm),
potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
declare_inquiry(ate = mean((Y_Z_1 - Y_Z_0)),
ate_men = mean((Y_Z_1 - Y_Z_0)[male]),
ate_women = mean((Y_Z_1 - Y_Z_0)[!male])) +
declare_assignment(Z = complete_ra(N = N)) +
declare_sampling(S = complete_rs(N = N, prob = .9)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
# Non-random missing data
declare_measurement(Y = ifelse(male & u_2 < quantile(u_2, .25), NA, Y)) +
declare_estimator(Y ~ Z, model = lm_robust, inquiry = "ate") +
declare_estimator(Y ~ Z, model = lm_robust, subset = male,
inquiry = "ate_men", label = "men") +
declare_estimator(Y ~ Z, model = lm_robust, subset = !male,
inquiry = "ate_women", label = "women" )
Compare diagnoses 4
diagnose_design(design_1, design_4)
Design
|
Inquiry
|
Bias
|
RMSE
|
Power
|
Coverage
|
Mean Estimate
|
SD Estimate
|
Mean Se
|
Mean Estimand
|
design_1
|
ate
|
-0.00
|
0.10
|
0.76
|
0.96
|
0.30
|
0.11
|
0.11
|
0.30
|
|
|
(0.00)
|
(0.00)
|
(0.02)
|
(0.01)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
design_4
|
ate
|
0.12
|
0.17
|
1.00
|
0.84
|
0.62
|
0.12
|
0.12
|
0.50
|
|
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.02)
|
(0.00)
|
(0.00)
|
(0.00)
|
(0.00)
|
design_4
|
ate_men
|
0.30
|
0.35
|
1.00
|
0.63
|
0.80
|
0.17
|
0.18
|
0.50
|
|
|
(0.01)
|
(0.01)
|
(0.00)
|
(0.02)
|
(0.01)
|
(0.01)
|
(0.00)
|
(0.00)
|
design_4
|
ate_women
|
-0.01
|
0.15
|
0.84
|
0.97
|
0.49
|
0.17
|
0.16
|
0.50
|
|
|
(0.01)
|
(0.01)
|
(0.02)
|
(0.01)
|
(0.01)
|
(0.01)
|
(0.00)
|
(0.00)
|
- The design comparison speaks against the switch in estimand.
- If anything it supports a switch to the
ate_women
estimand which does not suffer from risks of bias.
Critique
The honest reconciliation in the last design includes a possible model of the data missingness.
- It lets people see the risk of bias in the male estimand
- Let’s you see that the switch in estimand might not have been wise
design_4 %>% redesign(rho = c(-.7, 0, .7)) %>% diagnose_design()
rho
|
Inquiry
|
Bias
|
RMSE
|
Power
|
Coverage
|
Mean Estimate
|
SD Estimate
|
Mean Se
|
Mean Estimand
|
-0.7
|
ate_men
|
-0.29
|
0.34
|
0.21
|
0.64
|
0.21
|
0.19
|
0.18
|
0.50
|
0
|
ate_men
|
-0.00
|
0.18
|
0.76
|
0.96
|
0.50
|
0.18
|
0.19
|
0.50
|
0.7
|
ate_men
|
0.30
|
0.35
|
0.99
|
0.62
|
0.79
|
0.19
|
0.18
|
0.50
|
The discussion can then focus on the researcher’s claims about missingness patterns which are provided in the reconciliation.
Takeaways
Your reconciliation can clarify:
- how things are different
- whether things are different in relevant ways (e.g. under optimistic conditions)
- where conclusions in fact depend on the reasons for deviations
Proposal: attach both the original (registered) design and the reconciled design to your manuscript.