puzzles for `DeclareDesign` bootcamp

Author

Macartan Humphreys

Published

April 28, 2024

For each puzzle: explore the issues raised by the puzzle and generate a self contained presentation in .qmd (or .Rmd) that reports on your investigations.

1 Basic DD

Q 1.1

Step by step

Define this tiny design:

design <- declare_model(N = 20) + NULL

Note , the NULL here is needed in order to have a minimal pipe: turning a step into a design.

Draw data from it:

draw_data(design) |> head()

Add some more detail to the model:

design <- 
  design +
  declare_model(Y = rnorm(N, mean = 1))

Note N is figured out by DeclareDesign from the model you have already declared.

Draw data, it is what you expected?

Add an inquiry:

design <- 
  design +
  declare_inquiry(Q = mean(Y))

Draw an estimand:

draw_estimand(design)

Do it again.

Add a sampling step:

design <- 
  design + 
  declare_sampling(S = complete_rs(N=N, n = N/2))

Draw data and check the length of the data:

draw_data(design) |> dim()

Is that what you expect?

Add an estimator:

design <- 
  design +
  declare_estimator(Y ~ 1, .method = lm)

This runs lm and returns the constant.

Draw an estimate:

draw_estimates(design)

Do it again.

Run the design:

run_design(design)

Check out the structure of your design:

str(design)

Now:

write code that declares this whole design in a single pipe.
diagnose the design and interpret the power

diagnose_design(design)

Q 1.2

False positives and \(N\)s

Sometimes people worry that with larger samples you are more likely to get a false positive. Is that true?
Assess by generating a simple experimental design from scratch in which we can vary the N and in which there is no true effect of some treatment.

Then:

Plot the distribution of \(p\) values from the simulations_df. What shape is it and why?
Plot the power as \(N\) increases, using the diagnosands_df
Plot the estimates against \(p\) values for different values of \(N\); what do you see?
Discuss

Hint: the slides contain code for a simple experimental design

2 Inference

Q 2.1

A design with correlated outcomes

The standard error is standard deviation of the sampling distribution of an estimate.
That sounds complicated, but actually the sampling distribution of an estimate lives in the simulations data frame so you can look at its standard deviation and assess whether standard errors estimate it well.
Challenge: Generate a simple experimental design in which there is a correlation (rho) between the two potential outcomes (Y_Z_0 and Y_Z_1).
Show the distribution of the estimates over different values of rho
Assess the performance of the estimates of the standard errors and the coverage as rho goes from -1 to 0 to 1. Describe how coverage changes. (Be sure to be clear on what coverage is!)

Y_Z_0 <- rnorm(1000)
Y_Z_1 <- correlate(rnorm, given = Y_Z_0, rho = .5)

cor(Y_Z_0, Y_Z_1)

[1] 0.5034729

Q 2.2

Clustering

Say that you have a set of 20 schools randomly sampled from a superpopulation of schools. There are 5 classrooms in each school and 5 students in each class room.
Say you assign a treatment at the classroom level. Should you cluster your standard errors at the level of the school or at the level of the classroom?

Now:

Declare a design with this hierarchical data structure. Allow for the possibility that treatment effects vary at the school level. Assess the performance of the standard errors when you cluster at each of these levels (and when you do not cluster at all).
Examine whether the performance depends on whether you are interested in the population average effects or the sample average effects.

Hint For generating hierarchical models use add_level. Also: be sure to have a reasonable large top level shock in order to see differences arising from clustering at the school level. You could also try heterogeneous effects by school.

g <- 
  declare_model(
    L1 = add_level(N = 10, u = rnorm(N)),
    L2 = add_level(N = 12, v = rnorm(N)))

g() |> slice(1:3, 13:15) |> kable()

L1	u	L2	v
01	-0.032162	001	0.0306150
01	-0.032162	002	-0.4170501
01	-0.032162	003	-0.2049611
02	1.103179	013	2.2271018
02	1.103179	014	-0.8099450
02	1.103179	015	0.5110117

3 Controls

Q 3.1

Confounded.

Declare a design in which:

The assignment of a treatment \(X\) depends in part on upon some other, binary, variable \(W\): in particular \(\Pr(X=1|W=0) = .2\) and \(\Pr(X=1|W=1) = .5\))
The outcome \(Y\) depends on both \(X\) and \(W\): in particular \(Y = X*W + u\) where \(u\) is a random shock.
Diagnose a design with three approaches to estimating the effect of \(X\) on \(Y\): (a) ignoring \(W\) (b) adding \(W\) as a linear control (c) including both \(W\) and an interaction between \(W\) and \(X\).

Discuss results. Do any of these return the right answer?

Hint: You can add three separate declare_estimator steps. They should have distinct labels. The trickiest part is to figure out how to extract the estimate in (c) because you will have both a main term and an interaction term for \(X\).

Q 3.2

Covariates

Sometimes researchers running an experiment look for imbalance on a covariate and then include the covariate as a control if and only if they see imbalance. Set up a design in which a covariate may or may not affect potential outcomes and assess the performance given different rules

no control
control as a function of correlations of covariates with outcomes
control regardless of correlations