Writing Seminar: Orientation

May 2024

Macartan Humphreys

1 Today

  1. Intros
  2. Lecture
  3. Individual work
  4. Discussion
  5. Perhaps: DD deeper dive

2 Lecture outline

  • General aims and structure
  • Expectations
  • Pointers for coding
  • Topics / refreshers
    • Papers: structures and processes
    • DAGs, variables, identification strategies
    • Designs and design diagnosis

2.1 Aims

  • Write a paper
  • Follow open science principles

2.2 Expectations

  1. Introduction (today!)
    1. Task: Describe the design informally in MIDA terms
    2. Task: Identify relevant literature and a model article
  2. Design
    1. Task: Read one piece from relevant literature of at least 2 classmates
    2. Task: Formally “declare” the design, present the declaration
    3. Task: Situate the contribution, present the contribution
  3. Writing from inside out
    1. Task: Before class: Only if relevant for your project: register the design before implementing analysis. This is not relevant if you have already implemented analysis.
    2. Task: Before class: Share abstract, introduction, and paper outline, identifying any additional analyses
    3. Task: In class: Present Main Results, table or figure

2.3 Expectations

  1. Draft 1 shared and discussed
    1. Task: Before class: Share draft 1
    2. Task: Read draft of at least 2 classmates
  2. Final class presentations (Date TBD)
    1. Task: Incorporate comments, complete a draft
    2. Task: Present in class using slides

2.4 Some readings

Useful references on design and structuring research. Blair, Coppock, and Humphreys (2023), King, Keohane, and Verba (2021), Lipson (2018), Van Evera (1997)

Writing. People quibble but there is a lot of wisdom in Strunk Jr and White (2007):

2.5 Full refs

[1] G. Blair, A. Coppock, and M. Humphreys. Research design in the social sciences: declaration, diagnosis, and redesign. Princeton University Press, 2023.

[2] G. King, R. O. Keohane, and S. Verba. Designing social inquiry: Scientific inference in qualitative research. Princeton university press, 2021.

[3] C. Lipson. How to write a BA thesis: A practical guide from your first ideas to your finished paper. University of Chicago Press, 2018.

[4] W. Strunk Jr and E. B. White. The elements of style illustrated. Penguin, 2007.

[5] S. Van Evera. Guide to methods for students of political science. Cornell University Press, 1997.

3 Topics 1: Papers

3.1 Classic paper structure

https://macartan.github.io/teaching/how-to-write

Classic structure

  1. Motivation
  2. Theory
  3. Strategy (perhaps descriptives here)
  4. Main results
  5. Disucission / deepening
  6. Conclusion

3.2 Classic paper stages

  1. Come up with a question
  2. Come up with an answer strategy, find a data source
  3. Develop the design
  4. Present the design, modify as needed
  5. Register the design
  6. Get ethics approval if needed
  7. Implement data strategy
  8. Implement answer strategy; generate replication material in parallel
  9. Generate slides and present to colleagues
  10. Submit

4 Models: Arguments as DAGs

4.1 An argument:

Here is a complete, albeit barebones (and possibly incorrect), argument:

  • Good institutions (I) cause economic growth (G), except in countries with large stocks of natural resources (N)
  • The reason is that institutions encourage people to invest (V) which spurs growth (this effect does not kick in in natural resource rich countries as people just live off rent)
  • Growth also makes it easier to maintain good institutions, which creates a virtuous cycle
  • Being an ally (A) of the US helps growth, but it can corrupt domestic institutions.
  • Historically, places with climates (C) suitable for colonizers to settle had better institutions. These climatic conditions are otherwise irrelevant for contemporary economic growth.

4.2 Some counterarguments:

  • Places with climates suitable for colonizers benefited from better access to international markets which led to growth.
  • Good soil is also important for growth!
  • Good institutions also make sure that investments yield greater returns and that’s what causes growth

4.3 Questions on Nodes

  • What are the dependent variables?
  • What are the independent variables?
  • What are the mediating variables?
  • What are the conditioning variables?
  • What are the confounding variables?
  • What are the instrumental variables?
  • Graph the relations between the variables.

4.4 Questions on Inference

  • Say I and G are positively correlated. Does this mean that I causes G?

  • Say I and G are negatively correlated. Does this mean that I does not cause G?

  • How might you estimate the effect of I on G?

  • How does C help establish the link between I and G?

  • Where is the theory? Is in equivalent to the graph or is it something else that generates the graph?

  • How might you check if the proposed theory is correct?

  • Which of the counterarguments are strong and why?

4.5 A graph

4.6 Exercises: Dissect these arguments

Four arguments. For each one you should identify the:

  • type of argument (effect of \(X\), cause of \(Y\), effect of \(X\) on \(Y\))
  • unit of analysis
  • dependent variable(s)
  • independent variable(s)
  • mediator(s)
  • possible conditioning variable(s)
  • possible confounder(s)
  • possible identification strategy
  • relevant key agent(s) (actor(s))
  • measurement strategy

4.7 A. Natural resources and conflict

In developing countries that discover natural resources, such as oil, the ruling elite can extract wealth without needing to tax citizens and develop the state apparatus. Because the state does not rely on taxation for government revenue, it does not need to set up accountability structures or extend its reach and citizens do not feel that they have ownership over the state. The state therefore becomes both less democratic and weaker than if it had not discovered the resources.

4.8 B. Democracy and growth

Rich countries are more likely to be democratic for the simple reason that when people become wealthier they refuse to be dictated to by others and they demand a role in government. The marginal effects of income increases are greater for poorer countries because the impacts on eduction are greatest at these levels. You can test this proposition by exploiting natural variation in commodity prices which provide shocks to national income, especially for countries dependent on primary commodity exports.

4.9 C. Factor Endowments and Coalitions

When countries increase trade (imports and exports), the returns to economic factors (such as labor, land and capital) are affected differently. Specifically, the returns to factors that are the most abundant are positive, while the returns to factors that are the most scarce are negative. Therefore, the relative factor endowments of a country will predict what sort of political coalitions will form (eg Land versus Labor + Capital) and which groups will favor free trade policies.

4.10 D. Democratic peace

In democratic states, leaders are accountable for any losses incurred as a result of the wars that they enter into. Two states with democratic leaders are also more likely to share a common set of norms, and to engage in trade with one another. Therefore, two democracies are far less likely to enter into war with one another than a democracy and a non-democracy, or two non-democracies.

5 Design declaration and diagnosis

book: https://book.declaredesign.org/

5.1 The MIDA Framework

Four elements of any research design:

  • Model: set of models of what causes what and how
  • Inquiry: a question stated in terms of the model
  • Data strategy: the set of procedures we use to gather information from the world (sampling, assignment, measurement)
  • Answer strategy: how we summarize the data produced by the data strategy

5.2 Four elements of any research design

5.3 Examples of MIDA elements

  • M: DAGs, game theoretic models
  • I: ATEs, CATEs, COEs, models
  • D: Sampling schemes, assignment schemes, text analysis, interview
  • A: Experiment, observational, quantitative, qualitative:
    • Conditioning on observables
    • Difference in differences
    • RDD
    • Instrumental variables

5.4 Declaration, Diagnosis, Redesign cycle

  • Declaration: Telling the computer what M, I, D, and A are.

  • Diagnosis: Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, amount learned.

  • Redesign : Fine-tuning features of the data and answer strategies to understand how they change the diagnosands

  • Different sample sizes

  • Different randomization procedures

  • Different estimation strategies

  • Implementation: effort into compliance versus more effort into sample size

5.5 In code: Key commands for making a design

  • declare_model()
  • declare_inquiry()
  • declare_assignment()
  • declare_measurement()
  • declare_inquiry
  • declare_estimator()

and there are more declare_ functions!

5.6 In code: Key commands for using a design

  • draw_data(design)
  • draw_estimands(design)
  • draw_estimates(design)
  • get_estimates(design, data)
  • run_design(design), simulate_design(design)
  • diagnose_design(design)
  • redesign(design, N = 200)
  • design |> redesign(N = c(200, 400)) |> diagnose_designs()
  • compare_designs(), compare_diagnoses()

5.7 Cheat sheet

https://raw.githubusercontent.com/rstudio/cheatsheets/master/declaredesign.pdf

5.8 A simple design

N <- 100
b <- .5

design <- 
  declare_model(N = N, U = rnorm(N), 
                potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N), Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

You now have a two arm design object in memory!

If you just type design it will run the design—a good check to make sure the design has been declared properly.

5.9 Make data from the design

data <- draw_data(design)

data |> head () |> kable()
ID U Y_Z_0 Y_Z_1 Z Y
001 0.8949488 0.8949488 1.3949488 1 1.3949488
002 -0.0574970 -0.0574970 0.4425030 1 0.4425030
003 0.9280977 0.9280977 1.4280977 0 0.9280977
004 0.3762109 0.3762109 0.8762109 1 0.8762109
005 -0.7357462 -0.7357462 -0.2357462 0 -0.7357462
006 -0.7711031 -0.7711031 -0.2711031 1 -0.2711031

5.10 Draw estimands

draw_estimands(design) |>
  kable(digits = 2)
inquiry estimand
ate 0.5

5.11 Draw estimates

draw_estimates(design) |> 
  kable(digits = 2) 
estimator term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
estimator Z 0.8 0.19 4.22 0 0.43 1.18 98 Y ate

5.12 Get estimates

get_estimates(design, data) |>
  kable(digits = 2)
estimator term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
estimator Z 0.67 0.21 3.23 0 0.26 1.09 98 Y ate

5.13 Simulate design

simulate_design(design, sims = 3) |>
  kable(digits = 2)
design sim_ID inquiry estimand estimator term estimate std.error statistic p.value conf.low conf.high df outcome
design 1 ate 0.5 estimator Z 0.95 0.21 4.60 0.00 0.54 1.37 98 Y
design 2 ate 0.5 estimator Z 0.51 0.21 2.35 0.02 0.08 0.93 98 Y
design 3 ate 0.5 estimator Z 0.57 0.20 2.83 0.01 0.17 0.97 98 Y

5.14 Diagnose design

design |> 
  diagnose_design(sims = 100) 
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.47 -0.03 0.17 0.18 0.65 0.99
(0.02) (0.02) (0.01) (0.01) (0.04) (0.01)

5.15 Redesign

new_design <-
  
  design |> redesign(b = 0)
  • Modify any arguments that are explicitly called on by design steps.
  • Or add, remove, or replace steps

5.16 Compare designs

redesign(design, N = 50) %>%
  
  compare_diagnoses(design) 
Error in eval(expr, envir, enclos): object 'run' not found
diagnosand mean_1 mean_2 mean_difference conf.low conf.high
mean_estimand 0.50 0.50 0.00 0.00 0.00
mean_estimate 0.48 0.50 0.02 -0.01 0.04
bias -0.02 0.00 0.02 -0.01 0.04
sd_estimate 0.28 0.20 -0.08 -0.10 -0.06
rmse 0.28 0.20 -0.08 -0.10 -0.06
power 0.38 0.71 0.32 0.26 0.37
coverage 0.97 0.96 -0.01 -0.04 0.01

6 Topics 3: Workflow

6.1 Principles

  • Always work from a folder where your work automatically backs up.

    • Dropbox, Drive, many others
  • Have analysis files integrated with writing files

    • qmd, Rmd fabulous for this
    • Set it up so that you can quickly do reality checks on data and analysis
  • Be able to replicate all data work and analysis with one click

  • Outsource formatting.

    • e.g. tex, .qmd, .Rmd automatically format. If you use Word, use their “Styles”
    • Bibliographies: use bibtex or similar. Keep a file and reference like this @putnam2000bowling (in Rmd) which produces Putnam (2000) and handles the formatting. Other tools work similarly. Don’t do this by hand.

6.2 Folders and files.

  • Have all files: writing, files, data files, additional analysis files or images, etc in a single directory with relative references

    • Number your folders.
    • Have few files in each folder.
    • I often label files with date: 20201005_paper.Rmd
    • Compile regularly and have a readable compiled file beside your work file
    • Keep your main document clean
    • Have an archive folder 0_archive and backup old copies regularly (not so important if you have good versioning)

6.3 Examples

See examples in sample_project

6.4 Tasks

  • Keep a to do list
  • If you use github it is great to use “issues” to keep track of to dos
  • Try to complete well defined tasks in one sitting. Work in chunk.
  • If you have a repetitive task there is probably a way to automate it: ask for help
  • If you have a conceptually hard task that you are not making progress on, stop, move away, and try it from a whole new angle
  • Order your tasks: figure out whether you work better linearly or doing parallel work.
  • Cross off tasks when done
  • Don’t be afraid to discard work
  • Be ambitious but don’t let the best be the enemy of the good.

7 Pointers for coding

You should generate your replication files at the same time as your analysis

7.1 How to: Replication files

  • Best in self-contained documents for easy third party viewing. e.g. .html via .qmd or .Rmd

Some examples:

7.2 Good coding rules

7.3 Good coding rules

  • Metadata first
  • Call packages at the beginning: use pacman
  • Put options at the top
  • Call all data files once, at the top. Best to call directly from a public archive, when possible.
  • Use functions and define them at the top: comment them; useful sometimes to illustrate what they do
  • Replicate first, re-analyze second. Use sections.
  • Have subsections named after specific tables, figures or analyses

7.4 Aim

Nothing local, everything relative: so please do not include hardcoded paths to your computer

  • First best: if someone has access to your .Rmd/.qmd file they can hit render or compile and the whole thing reproduces first time.

  • But: often you need ancillary files for data and code. That’s OK but aims should still be that with a self contained folder someone can open a master.Rmd file, hit compile and get everything. I usually have an input and an output subfolder.

Resources and ideas from the institute for replication https://i4replication.org/reproducibility.html

8 References

Blair, Graeme, Alexander Coppock, and Macartan Humphreys. 2023. Research Design in the Social Sciences: Declaration, Diagnosis, and Redesign. Princeton University Press.
King, Gary, Robert O Keohane, and Sidney Verba. 2021. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton university press.
Lipson, Charles. 2018. How to Write a BA Thesis: A Practical Guide from Your First Ideas to Your Finished Paper. University of Chicago Press.
Putnam, Robert D. 2000. “Bowling Alone: America’s Declining Social Capital.” In Culture and Politics, 223–34. Springer.
Strunk Jr, William, and Elwyn Brooks White. 2007. The Elements of Style Illustrated. Penguin.
Van Evera, Stephen. 1997. Guide to Methods for Students of Political Science. Cornell University Press.