`compare_designs`: Using design declaration for reconciliation and critique

Graeme Blair, Alex Coppock, Macartan Humphreys

2 June 2021: CAS Workshop on Prediction, Registration, and Replication of Scientific Findings

Roadmap

The MIDA framework
Design declaration and diagnosis
Implementation errors
Formal reconciliation illustrated
Takeaways

Main idea

Think of a research design as an interrogable object: design_1
Try to represent the actually implemented research as an alternative design: design_2
Reconcile by formally comparing designs: compare_designs(design_1, design_2) to:
- automate reconciliation
- clarify nature of deviations
- clarify implications of deviations
- invite critique

The MIDA framework

Many research designs can be represented as positions or decisions with respect to: Models, Inquiries, Data Strategies, Answer Strategies.

MIDA as a DAG

A design

b <- .3

design_1 <- 
  
  declare_model(
    N = 500,
    u_0 = rnorm(N),
    u_1 = rnorm(N),
    potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_assignment(Z = complete_ra(N = N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, model = lm_robust)

This is a complete design declaration.

it can be “run”: run_design(design_1)
and “diagnosed” diagnose_design(design_1)

Diagnosis

diagnose_design(design_1)

Inquiry	Bias	RMSE	Power	Coverage	Mean Estimate	SD Estimate	Mean Se	Mean Estimand
ate	0.00	0.10	0.78	0.97	0.30	0.11	0.11	0.30
	(0.00)	(0.00)	(0.02)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)

We see the design has nice features: reasonable power, unbiasedness
Coverage is off though (because standard errors estimates are conservative when there are heterogeneous effects)

Registration

The DeclareDesign idea is that the design object is the thing you register.

The design object captures:
- your background assumptions about the conditions in which you are operating
- what your question really is
- how you plan to generate data
- how you plan to analyze it
It’s short, complete, interrogable.

In practice

Say now we implement and we do a few things differently:
- We gather data on fewer cases than we expected (at random)
- We have some missingness in outcome measures (not at random)
- We end up focusing on analysis in a subgroup
In addition, we learn things from data that we only speculated about at the design stage.
These suggest modifications to:
- M: The model
- I: The inquiry
- A: The answer strategy

(Note: not D, though the data \(d=D(m)\) might change because of changes in M)

Data deviations

Our data might have looked like this:

df <- design_1 %>% redesign(b = .2) %>% draw_data()

But instead it looks like this:

df_observed <- 
  df %>%
  dplyr::filter(simple_ra(n(), prob = .8) == 1) %>%
  mutate(Male = simple_ra(n())==1) %>%
  mutate(Y  = ifelse(Y_Z_1 < 0 & Male, NA, Y))

The differences are:

we have too few observations
we have an additional variable “male”
we have data missing not at random among the men

Analysis deviations

Our analysis should have looked like this:

get_estimates(design_1, df)

estimate	std.error	statistic	p.value	conf.low	conf.high	df
0.13	0.11	1.22	0.22	-0.08	0.34	498

Might have looked like this:

get_estimates(design_1, df_observed)

estimate	std.error	statistic	p.value	conf.low	conf.high	df
0.26	0.13	1.91	0.06	-0.01	0.52	322

But instead looks like this:

get_estimates(design_1, df_observed %>% filter(Male))

estimate	std.error	statistic	p.value	conf.low	conf.high	df
0.77	0.15	5.04	0	0.46	1.07	108

So what to do?

Our analysis has produced a very misleading answer that stems in part from our analysis decision and in part from unexpected selection biases arising during data collection.

Two key ideas:

The deviations come from two sources: factors that we can control and factors that we cannot control
We can reconcile our actual with our planned design in a way that opens up avenues for critique, rather than closing things down.

Let’s see this kind of reconciliation in operation.

Adjusting beliefs I

First, knowing what we think we know now about challenges in the field, we could provide more realistic inputs to the design. Consider:

design_2 <- 
  
  declare_model(
    N = 500,
    u_0 = rnorm(N),
    u_1 = rnorm(N),
    potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_assignment(Z = complete_ra(N = N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_sampling(S = complete_rs(N = N, prob = .9)) +
  declare_estimator(Y ~ Z, model = lm_robust)

This design differs by allowing for failure to hit targets in data gathering.

This is treated as a deviation that we did not intend but that we do understand.

Comparisons

If you provide design_2 along with design_1 then readers can compare them.

There are a set of tools that automate the comparison of the reconciled design to the original

Compare code

compare_design_code(design_1, design_2, rmd = TRUE, format = "auto", pager = "on", context = 0)

<
 
design1
>
 
design2
@@ 0 @@
@@ 9,2 @@
~
 
>
 
                                                                                             sampling 
~
 
>
 
                                               "declare_sampling(S = complete_rs(N = N, prob = 0.9))"

Compare data structures

compare_design_data(design_1, design_2, rmd = TRUE, format = "html")

<
 
str(design1)
>
 
str(design2)
@@ 1,8 @@
@@ 1,9 @@
<
 
'data.frame':   500 obs. of  7 variables:
>
 
'data.frame':   450 obs. of  8 variables:
 
 
 $ ID   : chr  "001" "002" "003" "004" ...
 
 
 $ ID   : chr  "001" "002" "003" "004" ...
 
 
 $ u_0  : num  -0.294 -0.563 0.378 -0.143 0.224 ...
 
 
 $ u_0  : num  -0.294 -0.563 0.378 -0.143 0.224 ...
 
 
 $ u_1  : num  -0.933 -0.979 0.67 1.328 -0.766 ...
 
 
 $ u_1  : num  -0.933 -0.979 0.67 1.328 -0.766 ...
 
 
 $ Y_Z_0: num  -0.294 -0.563 0.378 -0.143 0.224 ...
 
 
 $ Y_Z_0: num  -0.294 -0.563 0.378 -0.143 0.224 ...
 
 
 $ Y_Z_1: num  -0.927 -1.243 1.347 1.485 -0.243 ...
 
 
 $ Y_Z_1: num  -0.927 -1.243 1.347 1.485 -0.243 ...
 
 
 $ Z    : int  1 0 1 1 0 1 0 1 1 1 ...
 
 
 $ Z    : int  1 0 1 1 0 1 0 1 1 1 ...
 
 
 $ Y    : num  -0.927 -0.563 1.347 1.485 0.224 ...
 
 
 $ Y    : num  -0.927 -0.563 1.347 1.485 0.224 ...
~
 
>
 
 $ S    : num  1 1 1 1 1 1 1 1 1 1 ...

Compare design runs

compare_design_summaries(design_1, design_2, rmd = TRUE, format = "html", pager = "on")

<
 
design1
>
 
design2
@@ 1,58 @@
@@ 1,132 @@
 
 
 
 
 
 
 
 
Design Summary
 
 
Design Summary
 
 
 
 
 
 
 
 
Step 1 (model): declare_model(N = 500, u_0 = rnorm(N), u_1 = rnorm(N), potential_outcomes(Y ~     u_0 + Z * (b + u_1))) 
 
 
Step 1 (model): declare_model(N = 500, u_0 = rnorm(N), u_1 = rnorm(N), potential_outcomes(Y ~     u_0 + Z * (b + u_1))) 
 
 
 
 
 
 
 
 
N = 500 
 
 
N = 500 
 
 
 
 
 
 
 
 
Added variable: ID 
 
 
Added variable: ID 
 
 
 N_missing N_unique     class
 
 
 N_missing N_unique     class
 
 
         0      500 character
 
 
         0      500 character
 
 
 
 
 
 
 
 
Added variable: u_0 
 
 
Added variable: u_0 
 
 
   min median  mean  max sd N_missing N_unique
 
 
   min median mean  max   sd N_missing N_unique
<
 
 -2.77  -0.11 -0.06 3.28  1         0      500
>
 
 -3.04      0 0.04 3.14 1.01         0      500
 
 
 
 
 
 
 
 
Added variable: u_1 
 
 
Added variable: u_1 
 
 
   min median  mean  max   sd N_missing N_unique
 
 
   min median mean  max   sd N_missing N_unique
<
 
 -2.89  -0.05 -0.05 2.59 1.01         0      500
>
 
 -3.43   0.06 0.03 3.22 1.02         0      500
 
 
 
 
 
 
 
 
Added variable: Y_Z_0 
 
 
Added variable: Y_Z_0 
 
 
   min median  mean  max sd N_missing N_unique
 
 
   min median mean  max   sd N_missing N_unique
<
 
 -2.77  -0.11 -0.06 3.28  1         0      500
>
 
 -3.04      0 0.04 3.14 1.01         0      500
 
 
 
 
 
 
 
 
Added variable: Y_Z_1 
 
 
Added variable: Y_Z_1 
 
 
   min median mean  max   sd N_missing N_unique
 
 
   min median mean  max  sd N_missing N_unique
<
 
 -4.25   0.19 0.19 4.44 1.42         0      500
>
 
 -4.14   0.44 0.37 5.39 1.5         0      500
 
 
 
 
 
 
 
 
Step 2 (inquiry): declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) -------------------
 
 
Step 2 (inquiry): declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) -------------------
 
 
 
 
 
 
 
 
A single draw of the inquiry:
 
 
A single draw of the inquiry:
 
 
 inquiry  estimand
 
 
 inquiry  estimand
<
 
     ate 0.2508963
>
 
     ate 0.3295257
 
 
 
 
 
 
 
 
Step 3 (assignment): declare_assignment(Z = complete_ra(N = N)) ----------------
 
 
Step 3 (assignment): declare_assignment(Z = complete_ra(N = N)) ----------------
 
 
 
 
 
 
 
 
Added variable: Z 
 
 
Added variable: Z 
 
 
    0    1
 
 
    0    1
 
 
  250  250
 
 
  250  250
 
 
 0.50 0.50
 
 
 0.50 0.50
 
 
 
 
 
 
 
 
Step 4 (measurement): declare_measurement(Y = reveal_outcomes(Y ~ Z)) ----------
 
 
Step 4 (measurement): declare_measurement(Y = reveal_outcomes(Y ~ Z)) ----------
 
 
 
 
 
 
 
 
Added variable: Y 
 
 
Added variable: Y 
 
 
   min median mean  max   sd N_missing N_unique
 
 
   min median mean  max   sd N_missing N_unique
<
 
 -3.57  -0.01 0.07 4.44 1.24         0      500
>
 
 -2.82   0.22 0.24 5.39 1.26         0      500
 
 
 
 
 
 
<
 
Step 5 (estimator): declare_estimator(Y ~ Z, model = lm_robust) ----------------
>
 
Step 5 (sampling): declare_sampling(S = complete_rs(N = N, prob = 0.9)) --------
~
 
>
 
 
~
 
>
 
N = 450 (50 subtracted) 
~
 
>
 
 
~
 
>
 
Added variable: S 
~
 
>
 
    1
~
 
>
 
  450
~
 
>
 
 1.00
~
 
>
 
 
~
 
>
 
Altered variable: ID 
~
 
>
 
  Before: 
~
 
>
 
 N_missing N_unique     class
~
 
>
 
         0      500 character
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
 N_missing N_unique     class
~
 
>
 
         0      450 character
~
 
>
 
 
~
 
>
 
Altered variable: u_0 
~
 
>
 
  Before: 
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -3.04      0 0.04 3.14 1.01         0      500
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -3.04      0 0.06 3.14 1.03         0      450
~
 
>
 
 
~
 
>
 
Altered variable: u_1 
~
 
>
 
  Before: 
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -3.43   0.06 0.03 3.22 1.02         0      500
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -3.43   0.06 0.02 3.22 1.01         0      450
~
 
>
 
 
~
 
>
 
Altered variable: Y_Z_0 
~
 
>
 
  Before: 
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -3.04      0 0.04 3.14 1.01         0      500
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -3.04      0 0.06 3.14 1.03         0      450
~
 
>
 
 
~
 
>
 
Altered variable: Y_Z_1 
~
 
>
 
  Before: 
~
 
>
 
   min median mean  max  sd N_missing N_unique
~
 
>
 
 -4.14   0.44 0.37 5.39 1.5         0      500
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -4.14   0.44 0.37 5.39 1.51         0      450
~
 
>
 
 
~
 
>
 
Altered variable: Z 
~
 
>
 
  Before: 
~
 
>
 
    0    1
~
 
>
 
  250  250
~
 
>
 
 0.50 0.50
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
    0    1
~
 
>
 
  227  223
~
 
>
 
 0.50 0.50
 
 
 
 
 
 
~
 
>
 
Altered variable: Y 
~
 
>
 
  Before: 
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -2.82   0.22 0.24 5.39 1.26         0      500
~
 
>
 
 
~
 
>
 
  After:
~
 
>
 
   min median mean  max   sd N_missing N_unique
~
 
>
 
 -2.82   0.22 0.23 5.39 1.27         0      450
~
 
>
 
 
~
 
>
 
Step 6 (estimator): declare_estimator(Y ~ Z, model = lm_robust) ----------------
~
 
>
 
 
 
 
Formula: Y ~ Z 
 
 
Formula: Y ~ Z 
 
 
 
 
 
 
 
 
Model:  lm_robust 
 
 
Model:  lm_robust 
 
 
 
 
 
 
 
 
A single draw of the estimator:
 
 
A single draw of the estimator:
 
 
 estimator term  estimate std.error statistic      p.value  conf.low conf.high
 
 
 estimator term  estimate std.error statistic      p.value  conf.low conf.high
<
 
 estimator    Z 0.3653772 0.1097515  3.329133 0.0009357255 0.1497442 0.5810101
>
 
 estimator    Z 0.3955818 0.1190763  3.322087 0.0009666505 0.1615643 0.6295993
 
 
  df outcome
 
 
  df outcome
<
 
 498       Y
>
 
 448       Y

Compare diagnoses

We can diagnose both designs and look to see how they perform side by side.

diagnose_design(design_1, design_2)

Design	Inquiry	Bias	RMSE	Power	Coverage	Mean Estimate	SD Estimate	Mean Se	Mean Estimand
design_1	ate	0.00	0.10	0.79	0.97	0.30	0.11	0.11	0.30
		(0.00)	(0.00)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
design_2	ate	0.00	0.11	0.75	0.97	0.30	0.11	0.12	0.30
		(0.00)	(0.00)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)

Main thing we see here is that there is a decline in power (increase in RMSE and standard errors). No major differences beyond this.

Comparing diagnoses

Diagnoses may differ due to simulation error. The compare_diagnoses function helps assess whether two diagnoses are really different. (Best run with low alpha and large sims.)

comparison <- compare_diagnoses(design_1, design_2, alpha = .01)

inquiry	estimator	diagnosand	design1	design2	difference
ate	estimator	bias	0.00	0.00	-0.00
			(0.00)	(0.00)	(0.00)
ate	estimator	coverage	0.97	0.96	-0.01
			(0.00)	(0.00)	(0.00)
ate	estimator	mean_estimand	0.30	0.30	0.00
			(0.00)	(0.00)	(0.00)
ate	estimator	mean_estimate	0.30	0.30	-0.00
			(0.00)	(0.00)	(0.00)
ate	estimator	mean_se	0.11	0.12	0.01*
			(0.00)	(0.00)	(0.00)
ate	estimator	power	0.79	0.74	-0.05*
			(0.01)	(0.01)	(0.01)
ate	estimator	rmse	0.10	0.11	0.01*
			(0.00)	(0.00)	(0.00)
ate	estimator	sd_estimate	0.11	0.12	0.01*
			(0.00)	(0.00)	(0.00)
ate	estimator	type_s_rate	0.00	0.00	0.00
			(0.00)	(0.00)	(0.00)

Adjusting beliefs II

Now, we also have larger estimates for effects for men than we expected and we know more about variance within different groups.

We could include these elements in the reconciled design.

More importantly we have also switched estimand.

We should include this new estimand in the reconciled design also.

Adjusting beliefs II

We introduce male as a variable, include an effect that’s closer to the effect we estimated, and include the new estimand in the design.

b <- .5

design_3 <- 
  
  declare_model(
    N = 500,
    male = simple_rs(N = N)==1,
    u_0 = rnorm(N),
    u_1 = rnorm(N),
    potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0),
                  ate_men = mean((Y_Z_1 - Y_Z_0)[male])) + 
  declare_assignment(Z = complete_ra(N = N)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_sampling(S = complete_rs(N = N, prob = .9)) +
  declare_estimator(Y ~ Z, model = lm_robust, inquiry = "ate") +
  declare_estimator(Y ~ Z, model = lm_robust, subset = male,  inquiry = "ate_men", label = "men")

Diagnosis 3

Does the design comparison give support to the decision to switch estimands?

diagnose_design(design_1, design_2, design_3)

Design	Inquiry	Bias	RMSE	Power	Coverage	Mean Estimate	SD Estimate	Mean Se	Mean Estimand
design_1	ate	-0.01	0.10	0.76	0.97	0.29	0.11	0.11	0.30
		(0.00)	(0.00)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
design_2	ate	0.00	0.11	0.74	0.97	0.30	0.11	0.12	0.30
		(0.00)	(0.00)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
design_3	ate	0.01	0.11	0.99	0.97	0.51	0.11	0.12	0.50
		(0.00)	(0.00)	(0.00)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)
design_3	ate_men	0.01	0.15	0.87	0.97	0.51	0.16	0.16	0.50
		(0.00)	(0.00)	(0.01)	(0.01)	(0.01)	(0.00)	(0.00)	(0.00)

A change in estimand is generally hard to justify on design grounds but it is certainly not supported here. (You have better diagnostics for ate_men in design_3 than for ate in design_1, but still better diagnostics for ate in design_3).

Last, and bravest: Redesign III

There is a part of the design-in-practice that we do not understand: the pattern of non-random data missingness among men.

Here is an attempt to reconcile.

rho <- .7

design_4 <- 
  
  declare_model(
    N = 500,
    male = simple_rs(N = N)==1,
    u_0 = rnorm(N), u_1 = rnorm(N),
    # A new term, correlated with potential outcomes:
    u_2 = correlate(given = u_1, rho = rho, rnorm), 
    potential_outcomes(Y ~ u_0 + Z*(b + u_1))) +
  declare_inquiry(ate = mean((Y_Z_1 - Y_Z_0)),
                  ate_men = mean((Y_Z_1 - Y_Z_0)[male]),
                  ate_women = mean((Y_Z_1 - Y_Z_0)[!male])) + 
  declare_assignment(Z = complete_ra(N = N)) +
  declare_sampling(S = complete_rs(N = N, prob = .9)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  
  # Non-random missing data
  declare_measurement(Y = ifelse(male & u_2 < quantile(u_2, .25), NA, Y)) +
  declare_estimator(Y ~ Z, model = lm_robust, inquiry = "ate") +
  declare_estimator(Y ~ Z, model = lm_robust, subset = male, 
                    inquiry = "ate_men", label = "men") +
  declare_estimator(Y ~ Z, model = lm_robust, subset = !male, 
                    inquiry = "ate_women", label = "women" )

Compare diagnoses 4

diagnose_design(design_1, design_4)

Design	Inquiry	Bias	RMSE	Power	Coverage	Mean Estimate	SD Estimate	Mean Se	Mean Estimand
design_1	ate	-0.00	0.10	0.76	0.96	0.30	0.11	0.11	0.30
		(0.00)	(0.00)	(0.02)	(0.01)	(0.00)	(0.00)	(0.00)	(0.00)
design_4	ate	0.12	0.17	1.00	0.84	0.62	0.12	0.12	0.50
		(0.00)	(0.00)	(0.00)	(0.02)	(0.00)	(0.00)	(0.00)	(0.00)
design_4	ate_men	0.30	0.35	1.00	0.63	0.80	0.17	0.18	0.50
		(0.01)	(0.01)	(0.00)	(0.02)	(0.01)	(0.01)	(0.00)	(0.00)
design_4	ate_women	-0.01	0.15	0.84	0.97	0.49	0.17	0.16	0.50
		(0.01)	(0.01)	(0.02)	(0.01)	(0.01)	(0.01)	(0.00)	(0.00)

The design comparison speaks against the switch in estimand.
If anything it supports a switch to the ate_women estimand which does not suffer from risks of bias.

Critique

The honest reconciliation in the last design includes a possible model of the data missingness.

It lets people see the risk of bias in the male estimand
Let’s you see that the switch in estimand might not have been wise

design_4 %>% redesign(rho = c(-.7, 0, .7)) %>% diagnose_design()

rho	Inquiry	Bias	RMSE	Power	Coverage	Mean Estimate	SD Estimate	Mean Se	Mean Estimand
-0.7	ate_men	-0.29	0.34	0.21	0.64	0.21	0.19	0.18	0.50
0	ate_men	-0.00	0.18	0.76	0.96	0.50	0.18	0.19	0.50
0.7	ate_men	0.30	0.35	0.99	0.62	0.79	0.19	0.18	0.50

The discussion can then focus on the researcher’s claims about missingness patterns which are provided in the reconciliation.

Takeaways

Your reconciliation can clarify:

how things are different
whether things are different in relevant ways (e.g. under optimistic conditions)
where conclusions in fact depend on the reasons for deviations

Proposal: attach both the original (registered) design and the reconciled design to your manuscript.

Risks and questions

Researchers have too much freedom in selecting M?
- Idea is: researchers can stipulate the worlds in which their design works well
- But any critic can assess performance under different models
Too formal?
- We see gains from formalization but are conscious that benefits also depend on interpretations
- The design object needs accompanying documentation
Too great a burden?
- We have tried to minimize this
- But it is the greatest worry

compare_designs: Using design declaration for reconciliation and critique

Roadmap

Main idea

The MIDA framework

A design

Diagnosis

Registration

In practice

Data deviations

Analysis deviations

So what to do?

Adjusting beliefs I

Comparisons

Compare code

Compare data structures

Compare design runs

Compare diagnoses

Comparing diagnoses

Adjusting beliefs II

Adjusting beliefs II

Diagnosis 3

Last, and bravest: Redesign III

Compare diagnoses 4

Critique

Takeaways

Risks and questions

`compare_designs`: Using design declaration for reconciliation and critique