May 2024
Useful references on design and structuring research. Blair, Coppock, and Humphreys (2023), King, Keohane, and Verba (2021), Lipson (2018), Van Evera (1997)
Writing. People quibble but there is a lot of wisdom in Strunk Jr and White (2007):
[1] G. Blair, A. Coppock, and M. Humphreys. Research design in the social sciences: declaration, diagnosis, and redesign. Princeton University Press, 2023.
[2] G. King, R. O. Keohane, and S. Verba. Designing social inquiry: Scientific inference in qualitative research. Princeton university press, 2021.
[3] C. Lipson. How to write a BA thesis: A practical guide from your first ideas to your finished paper. University of Chicago Press, 2018.
[4] W. Strunk Jr and E. B. White. The elements of style illustrated. Penguin, 2007.
[5] S. Van Evera. Guide to methods for students of political science. Cornell University Press, 1997.
https://macartan.github.io/teaching/how-to-write
Classic structure
Here is a complete, albeit barebones (and possibly incorrect), argument:
Say I and G are positively correlated. Does this mean that I causes G?
Say I and G are negatively correlated. Does this mean that I does not cause G?
How might you estimate the effect of I on G?
How does C help establish the link between I and G?
Where is the theory? Is in equivalent to the graph or is it something else that generates the graph?
How might you check if the proposed theory is correct?
Which of the counterarguments are strong and why?
Four arguments. For each one you should identify the:
In developing countries that discover natural resources, such as oil, the ruling elite can extract wealth without needing to tax citizens and develop the state apparatus. Because the state does not rely on taxation for government revenue, it does not need to set up accountability structures or extend its reach and citizens do not feel that they have ownership over the state. The state therefore becomes both less democratic and weaker than if it had not discovered the resources.
Rich countries are more likely to be democratic for the simple reason that when people become wealthier they refuse to be dictated to by others and they demand a role in government. The marginal effects of income increases are greater for poorer countries because the impacts on eduction are greatest at these levels. You can test this proposition by exploiting natural variation in commodity prices which provide shocks to national income, especially for countries dependent on primary commodity exports.
When countries increase trade (imports and exports), the returns to economic factors (such as labor, land and capital) are affected differently. Specifically, the returns to factors that are the most abundant are positive, while the returns to factors that are the most scarce are negative. Therefore, the relative factor endowments of a country will predict what sort of political coalitions will form (eg Land versus Labor + Capital) and which groups will favor free trade policies.
In democratic states, leaders are accountable for any losses incurred as a result of the wars that they enter into. Two states with democratic leaders are also more likely to share a common set of norms, and to engage in trade with one another. Therefore, two democracies are far less likely to enter into war with one another than a democracy and a non-democracy, or two non-democracies.
Four elements of any research design:
M
: DAGs, game theoretic modelsI
: ATEs, CATEs, COEs, modelsD
: Sampling schemes, assignment schemes, text analysis, interviewA
: Experiment, observational, quantitative, qualitative:
Declaration: Telling the computer what M, I, D, and A are.
Diagnosis: Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, amount learned.
Redesign : Fine-tuning features of the data and answer strategies to understand how they change the diagnosands
Different sample sizes
Different randomization procedures
Different estimation strategies
Implementation: effort into compliance versus more effort into sample size
declare_model()
declare_inquiry()
declare_assignment()
declare_measurement()
declare_inquiry
declare_estimator()
and there are more declare_
functions!
draw_data(design)
draw_estimands(design)
draw_estimates(design)
get_estimates(design, data)
run_design(design)
, simulate_design(design)
diagnose_design(design)
redesign(design, N = 200)
design |> redesign(N = c(200, 400)) |>
diagnose_designs()
compare_designs()
, compare_diagnoses()
https://raw.githubusercontent.com/rstudio/cheatsheets/master/declaredesign.pdf
N <- 100
b <- .5
design <-
declare_model(N = N, U = rnorm(N),
potential_outcomes(Y ~ b * Z + U)) +
declare_assignment(Z = simple_ra(N), Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)
You now have a two arm design object in memory!
If you just type design
it will run the design—a good check to make sure the design has been declared properly.
ID | U | Y_Z_0 | Y_Z_1 | Z | Y |
---|---|---|---|---|---|
001 | 0.8949488 | 0.8949488 | 1.3949488 | 1 | 1.3949488 |
002 | -0.0574970 | -0.0574970 | 0.4425030 | 1 | 0.4425030 |
003 | 0.9280977 | 0.9280977 | 1.4280977 | 0 | 0.9280977 |
004 | 0.3762109 | 0.3762109 | 0.8762109 | 1 | 0.8762109 |
005 | -0.7357462 | -0.7357462 | -0.2357462 | 0 | -0.7357462 |
006 | -0.7711031 | -0.7711031 | -0.2711031 | 1 | -0.2711031 |
inquiry | estimand |
---|---|
ate | 0.5 |
estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome | inquiry |
---|---|---|---|---|---|---|---|---|---|---|
estimator | Z | 0.8 | 0.19 | 4.22 | 0 | 0.43 | 1.18 | 98 | Y | ate |
estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome | inquiry |
---|---|---|---|---|---|---|---|---|---|---|
estimator | Z | 0.67 | 0.21 | 3.23 | 0 | 0.26 | 1.09 | 98 | Y | ate |
design | sim_ID | inquiry | estimand | estimator | term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
design | 1 | ate | 0.5 | estimator | Z | 0.95 | 0.21 | 4.60 | 0.00 | 0.54 | 1.37 | 98 | Y |
design | 2 | ate | 0.5 | estimator | Z | 0.51 | 0.21 | 2.35 | 0.02 | 0.08 | 0.93 | 98 | Y |
design | 3 | ate | 0.5 | estimator | Z | 0.57 | 0.20 | 2.83 | 0.01 | 0.17 | 0.97 | 98 | Y |
Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
---|---|---|---|---|---|
0.47 | -0.03 | 0.17 | 0.18 | 0.65 | 0.99 |
(0.02) | (0.02) | (0.01) | (0.01) | (0.04) | (0.01) |
Error in eval(expr, envir, enclos): object 'run' not found
diagnosand | mean_1 | mean_2 | mean_difference | conf.low | conf.high |
---|---|---|---|---|---|
mean_estimand | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 |
mean_estimate | 0.48 | 0.50 | 0.02 | -0.01 | 0.04 |
bias | -0.02 | 0.00 | 0.02 | -0.01 | 0.04 |
sd_estimate | 0.28 | 0.20 | -0.08 | -0.10 | -0.06 |
rmse | 0.28 | 0.20 | -0.08 | -0.10 | -0.06 |
power | 0.38 | 0.71 | 0.32 | 0.26 | 0.37 |
coverage | 0.97 | 0.96 | -0.01 | -0.04 | 0.01 |
Always work from a folder where your work automatically backs up.
Have analysis files integrated with writing files
qmd
, Rmd
fabulous for thisBe able to replicate all data work and analysis with one click
Outsource formatting.
tex
, .qmd
, .Rmd
automatically format. If you use Word, use their “Styles”@putnam2000bowling
(in Rmd) which produces Putnam (2000) and handles the formatting. Other tools work similarly. Don’t do this by hand.Have all files: writing, files, data files, additional analysis files or images, etc in a single directory with relative references
20201005_paper.Rmd
0_archive
and backup old copies regularly (not so important if you have good versioning)See examples in sample_project
You should generate your replication files at the same time as your analysis
.html
via .qmd
or .Rmd
Some examples:
pacman
Nothing local, everything relative: so please do not include hardcoded paths to your computer
First best: if someone has access to your .Rmd
/.qmd
file they can hit render or compile and the whole thing reproduces first time.
But: often you need ancillary files for data and code. That’s OK but aims should still be that with a self contained folder someone can open a master.Rmd
file, hit compile and get everything. I usually have an input
and an output
subfolder.
Resources and ideas from the institute for replication https://i4replication.org/reproducibility.html