<- declare_model(N = 20) + NULL design
puzzles for DeclareDesign
bootcamp
For each puzzle: explore the issues raised by the puzzle and generate a self contained presentation in .qmd
(or .Rmd
) that reports on your investigations.
1 Basic DD
Q 1.1
Step by step
Define this tiny design:
Note , the NULL
here is needed in order to have a minimal pipe: turning a step into a design.
Draw data from it:
draw_data(design) |> head()
Add some more detail to the model:
<-
design +
design declare_model(Y = rnorm(N, mean = 1))
Note N
is figured out by DeclareDesign
from the model you have already declared.
Draw data, it is what you expected?
Add an inquiry:
<-
design +
design declare_inquiry(Q = mean(Y))
Draw an estimand:
draw_estimand(design)
Do it again.
Add a sampling step:
<-
design +
design declare_sampling(S = complete_rs(N=N, n = N/2))
Draw data and check the length of the data:
draw_data(design) |> dim()
Is that what you expect?
Add an estimator:
<-
design +
design declare_estimator(Y ~ 1, .method = lm)
This runs lm
and returns the constant.
Draw an estimate:
draw_estimates(design)
Do it again.
Run the design:
run_design(design)
Check out the structure of your design:
str(design)
Now:
- write code that declares this whole design in a single pipe.
- diagnose the design and interpret the power
diagnose_design(design)
Q 1.2
Sometimes people worry that with larger samples you are more likely to get a false positive. Is that true?
Assess by generating a simple experimental design from scratch in which we can vary the
N
and in which there is no true effect of some treatment.
Then:
- Plot the distribution of \(p\) values from the
simulations_df
. What shape is it and why? - Plot the power as \(N\) increases, using the
diagnosands_df
- Plot the estimates against \(p\) values for different values of \(N\); what do you see?
- Discuss
Hint: the slides contain code for a simple experimental design
2 Inference
Q 2.1
A design with correlated outcomes
The standard error is standard deviation of the sampling distribution of an estimate.
That sounds complicated, but actually the sampling distribution of an estimate lives in the simulations data frame so you can look at its standard deviation and assess whether standard errors estimate it well.
Challenge
: Generate a simple experimental design in which there is a correlation (rho
) between the two potential outcomes (Y_Z_0
andY_Z_1
).Show the distribution of the estimates over different values of
rho
Assess the performance of the estimates of the standard errors and the coverage as
rho
goes from -1 to 0 to 1. Describe how coverage changes. (Be sure to be clear on what coverage is!)
<- rnorm(1000)
Y_Z_0 <- correlate(rnorm, given = Y_Z_0, rho = .5)
Y_Z_1
cor(Y_Z_0, Y_Z_1)
[1] 0.5034729
Q 2.2
Clustering
- Say that you have a set of 20 schools randomly sampled from a superpopulation of schools. There are 5 classrooms in each school and 5 students in each class room.
- Say you assign a treatment at the classroom level. Should you cluster your standard errors at the level of the school or at the level of the classroom?
Now:
- Declare a design with this hierarchical data structure. Allow for the possibility that treatment effects vary at the school level. Assess the performance of the standard errors when you cluster at each of these levels (and when you do not cluster at all).
- Examine whether the performance depends on whether you are interested in the population average effects or the sample average effects.
Hint For generating hierarchical models use add_level
. Also: be sure to have a reasonable large top level shock in order to see differences arising from clustering at the school level. You could also try heterogeneous effects by school.
<-
g declare_model(
L1 = add_level(N = 10, u = rnorm(N)),
L2 = add_level(N = 12, v = rnorm(N)))
g() |> slice(1:3, 13:15) |> kable()
L1 | u | L2 | v |
---|---|---|---|
01 | -0.032162 | 001 | 0.0306150 |
01 | -0.032162 | 002 | -0.4170501 |
01 | -0.032162 | 003 | -0.2049611 |
02 | 1.103179 | 013 | 2.2271018 |
02 | 1.103179 | 014 | -0.8099450 |
02 | 1.103179 | 015 | 0.5110117 |
3 Controls
Q 3.1
Declare a design in which:
- The assignment of a treatment \(X\) depends in part on upon some other, binary, variable \(W\): in particular \(\Pr(X=1|W=0) = .2\) and \(\Pr(X=1|W=1) = .5\))
- The outcome \(Y\) depends on both \(X\) and \(W\): in particular \(Y = X*W + u\) where \(u\) is a random shock.
- Diagnose a design with three approaches to estimating the effect of \(X\) on \(Y\): (a) ignoring \(W\) (b) adding \(W\) as a linear control (c) including both \(W\) and an interaction between \(W\) and \(X\).
Discuss results. Do any of these return the right answer?
Hint: You can add three separate declare_estimator
steps. They should have distinct labels. The trickiest part is to figure out how to extract the estimate in (c) because you will have both a main term and an interaction term for \(X\).
Q 3.2
Covariates
Sometimes researchers running an experiment look for imbalance on a covariate and then include the covariate as a control if and only if they see imbalance. Set up a design in which a covariate may or may not affect potential outcomes and assess the performance given different rules
- no control
- control as a function of correlations of covariates with outcomes
- control regardless of correlations