Development strategies: Orientation, May 2023

Macartan Humphreys

Today

General aims and structure
Expectations
Pointers for replication and re-analysis
Three topics / refreshers?
- Basic growth models
- DAGs, variables, identification strategies
- Designs and design diagnosis

General aims and structure

Aims

Engage deeply with fresh cutting edge work (as well as some older canonical work)
Develop skills in empirical methods
Exposure to open science practices

Key idea: you (often) can’t fully evaluate a piece of work until you have looked at the data and done the analysis yourself.

Syllabus

is here: https://macartan.github.io/teaching/ds-hu-2023

see also the replication repo: https://macartan.github.io/ds_hu_2003_reps/

The readings: 1 Macro processes

Reading	Data
1.1 Daron Acemoglu, Simon Johnson, and James A. Robinson. The Colonial Origins of Comparative Development: An Empirical Investigation. AER (2001)	Data
1.2 James Fearon and David D. Laitin. Ethnicity, insurgency, and civil war. APSR (2003).	Data
1.3 Nathan Nunn. The long term effects of Africa’s slave trade QJE (2008)	1, 2
1.4 Daron Acemoglu; Simon Johnson; James A. Robinson; Pierre Yared, Income and Democracy AER (2008)	Data

The readings: 2 Group politics

Reading	Data
2.1 Alberto Alesina, Paola Giuliano, and Nathan Nunn. On the Origins of Gender Roles: Women and the Plough QJE (2013).	Data
2.2 Raghabendra Chattopadhyay, Esther Duflo Women as Policy Makers: Evidence from a Randomized Policy Experiment in India Econometrica (2004)	Data
2.3 Salma Mousa Building Social Cohesion Between Christians and Muslims Science (2020)	Data
2.4 Saad Gulzer, Nicholas Haas and Benjamin Pasquale Does Political Affirmative Action Work, and for Whom? Theory and Evidence on India’s Scheduled Areas APSR 2020.	Data

The readings: 3 Institutions and accountability

Reading	Data
3.1 Guy Grossman, Kristin G. Michelitch, and Carlo Prato. The Effect of Sustained Transparency on Electoral Accountability AJPS (2023)	Data
3.2 Claudio Ferraz and Frederico Finan Electoral Accountability and Corruption: Evidence from the Audits of Local Governments AER (2011)	Data
3.3 Pia J Raffler Does political oversight of the bureaucracy increase accountability? Field experimental evidence from a dominant party regime APSR (2022)	Data
3.4 Thomas Fujiwara and Leonard Wantchekon Can Informed Public Deliberation Overcome Clientelism? Experimental Evidence from Benin AEJ (2013)	Data

The readings: 4 Aid and interventions

Reading	Data
4.1 Nathan Nunn and Nancy Qian U.S. Food Aid and Civil Conflict AER (2014)	Data
4.2 Robert Blair, Di Salvatore, Jessica; Smidt, Hannah, UN Peacekeeping and Democratization in Conflict-Affected Countries APSR (2023).	Data
4.3 Christopher Blattman; Annan, Jeannie, 2015, Can Employment Reduce Lawlessness and Rebellion? A Field Experiment with High-Risk Men in a Fragile State	data
4.4 Karthik Muralidharan, Paul Niehaus, and Sandip Sukhtankar. Building State Capacity: Evidence from Biometric Smartcards in India. https://doi.org/10.1257/aer.20141346. AER (2016)	Data

Expectations

5 tasks
Work in four “rep teams”: 1 team per session \(\times 4\)
Prepare a research design or short paper, perhaps building on existing work. Typically this contains: (i) a theoretical argument or motivation, (ii) a proposed empirical test of that argument (iii) a formal design object and (iv) a discussion of policy prescriptions that might result from the argument. We will discuss these in the final days of class. Focus on something that might turn into a research project.
Plus general reading and participation. Only four papers per session: read them all carefully!

Rep team job

Engage with both methods and substance of paper
Access data
Run basic replication of main results
Draft a pre-re-analysis plan
Implement new analyses
Share clean re-analysis files with class via Moodle
Prepare a “replicable presentation” and present to the class (ca 45 mins)

Pointers for replication and re-analysis

How to: Replication files

Best in self-contained documents for easy third party viewing. e.g. .html via .qmd or .Rmd

Some examples:

Good coding rules

Metadata first
Call packages at the beginning: use pacman
Put options at the top
Call all data files once, at the top. Best to call directly from a public archive, when possible.
Use functions and define them at the top: comment them; useful sometimes to illustrate what they do
Replicate first, re-analyze second. Use sections.
Have subsections named after specific tables, figures or analyses

Aim

Nothing local, everything relative: so please do not include hardcoded paths to your computer

First best: if someone has access to your .Rmd/.qmd file they can hit render or compile and the whole thing reproduces first time.
But: often you need ancillary files for data and code. That’s OK but aims should still be that with a self contained folder someone can open a master.Rmd file, hit compile and get everything. I usually have an input and an output subfolder.

Resources and ideas from the institute for replication https://i4replication.org/reproducibility.html

Longer term goal: replication package to make it easier to access and share replications like these

Collaborative coding / writing

Do not get in the business of passing attachments around
Share self contained folders; folders contain a small set of live documents plus an archive. Old versions of documents are in archive. Only one version of the most recent document is in a main folder.
Data is self contained folder (in) and is never edited directly

How to: Contents

Sample TOC for presentations:

Motivation
Theory: e.g. a DAG plus motivation for it
Design (e.g. ex ante perspective): A design diagnosis
Analysis:
- descriptives: show data!! Make sure you and we all understand the type of data we are facing
- replication
- re-analysis: for robustness, for strengthening
Interpretation (explanation, implications, generalizability)

Don’t skip the big picture.

See: How to critique: https://macartan.github.io/teaching/how-to-critique

Biggest message: be probing but be sympathetic

How to: Theory

Strong recommend you try to initially present all arguments as a DAG

make_model(
   "Inequality -> Mobilization -> Democratization <- Threat <- Inequality;
   Pressure -> Democratization")  |> 
  plot()

How to: Theory

…and then use the DAG to describe new analysis, e.g. questions regarding:

identification
mechanisms
plausible or implausible assumptions
possible confounds

Straight replication

most cases code will be provided
encourage you not to rely on this; try from scratch

If you fail to replicate:

this might be your fault!
check your code against their code; compare Ns; break it down to estimate single quantities
if it’s a small discrepancy don’t worry
if it’s large, check with me; if it’s important check with authors (politely) to help make sense of things; always be generous

Re-analysis

Two distinct overarching goals:

Check robustness
Deepen the analysis

Robustness: Things you might do:

Plot data to understand where effects are coming from and checking no weirdness
Swap out datasets (sometimes possible with observational data)
Check robustness to specification / assumptions
Check missingness patterns
Assess plausible external validity
Check for multiple comparisons
Check against pre-analysis plans

Deepening: Things you might do

Principled exploration of heterogeneity (e.g. using causal forests)
Principled incorporation of controls (e.g. Lasso approaches)
Search for additional implications of theory: if the theory is right what should we see? If the theory is right where should effects be strongest or weakest?

Reanalysis: Avoiding fishing

There’s a risk of engaging in a type of fishing when you do replication or re-analysis.
Say you decide to control for something and a result disappears: is this a threat to the analysis?
Not necessarily
In our Research Design book we recommend using design declaration to justify re-analysis decisions

Reanalysis: Avoiding fishing

e.g.

Home ground dominance. Holding the original M constant (i.e., the home ground of the original study), if you can show that a new answer strategy A’ yields better diagnosands than the original A, then A’ can be justified by home ground dominance.

Robustness to alternative models. A second justification for a change in answer strategy is that you can show that a new answer strategy is robust to both the original model M and a new, also plausible, M’.

Topics

Let’s review basic ideas:

A simple growth models
Variables and arguments
Designs and evidence

The simplest growth model: Solow

Can someone walk us through the Solow model?

The simplest growth model: The economy

Production function.

\[Y = F(K,L)\] Per capita:

\[y = f(k)\] e.g.

\[y = Ak^\alpha\]

The simplest growth model: What changes

Savings are constant:

\[\text{savings} = s y_t\]

Depreciation constant share of capital: \[\text{depreciation}=\delta k\]

Law of motion of capital:

\[k_{t+1} = sk_t^\alpha - \delta k\]

Steady state has:

\[k_t = k_{t+1} = k_t^*\leftrightarrow sAk_t^{*\alpha} = \delta k^{*} \leftrightarrow k^* = \left(\frac{sA}{\delta}\right)^{\frac1{1-\alpha}}\]

The simplest growth model: What drivers?

So what causes long term growth?
What causes variation in income?
Where is the politics?
What growth models do you have in your head?

Models: Arguments as DAGs

An argument:

Here is a complete, albeit barebones (and possibly incorrect), argument:

Good institutions (I) cause economic growth (G), except in countries with large stocks of natural resources (N)
The reason is that institutions encourage people to invest (V) which spurs growth (this effect does not kick in in natural resource rich countries as people just live off rent)
Growth also makes it easier to maintain good institutions, which creates a virtuous cycle
Being an ally (A) of the US helps growth, but it can corrupt domestic institutions.
Historically, places with climates (C) suitable for colonizers to settle had better institutions. These climatic conditions are otherwise irrelevant for contemporary economic growth.

Some counterarguments:

Places with climates suitable for colonizers benefited from better access to international markets which led to growth.
Good soil is also important for growth!
Good institutions also make sure that investments yield greater returns and that’s what causes growth

Questions on Nodes

What are the dependent variables?
What are the independent variables?
What are the mediating variables?
What are the conditioning variables?
What are the confounding variables?
What are the instrumental variables?
Graph the relations between the variables.

Questions on Inference

Say I and G are positively correlated. Does this mean that I causes G?
Say I and G are negatively correlated. Does this mean that I does not cause G?
How might you estimate the effect of I on G?
How does C help establish the link between I and G?
Where is the theory? Is in equivalent to the graph or is it something else that generates the graph?
How might you check if the proposed theory is correct?
Which of the counterarguments are strong and why?

A graph

Exercises: Dissect these arguments

Four arguments. For each one you should identify the:

type of argument (effect of \(X\), cause of \(Y\), effect of \(X\) on \(Y\))
unit of analysis
dependent variable(s)
independent variable(s)
mediator(s)
possible conditioning variable(s)
possible confounder(s)
possible identification strategy
relevant key agent(s) (actor(s))
measurement strategy

A. Natural resources and conflict

In developing countries that discover natural resources, such as oil, the ruling elite can extract wealth without needing to tax citizens and develop the state apparatus. Because the state does not rely on taxation for government revenue, it does not need to set up accountability structures or extend its reach and citizens do not feel that they have ownership over the state. The state therefore becomes both less democratic and weaker than if it had not discovered the resources.

B. Democracy and growth

Rich countries are more likely to be democratic for the simple reason that when people become wealthier they refuse to be dictated to by others and they demand a role in government. The marginal effects of income increases are greater for poorer countries because the impacts on eduction are greatest at these levels. You can test this proposition by exploiting natural variation in commodity prices which provide shocks to national income, especially for countries dependent on primary commodity exports.

C. Factor Endowments and Coalitions

When countries increase trade (imports and exports), the returns to economic factors (such as labor, land and capital) are affected differently. Specifically, the returns to factors that are the most abundant are positive, while the returns to factors that are the most scarce are negative. Therefore, the relative factor endowments of a country will predict what sort of political coalitions will form (eg Land versus Labor + Capital) and which groups will favor free trade policies.

D. Democratic peace

In democratic states, leaders are accountable for any losses incurred as a result of the wars that they enter into. Two states with democratic leaders are also more likely to share a common set of norms, and to engage in trade with one another. Therefore, two democracies are far less likely to enter into war with one another than a democracy and a non-democracy, or two non-democracies.

Design declaration and diagnosis

book: https://book.declaredesign.org/

The MIDA Framework

Q: Is my research design good?

A: Well let’s simulate it to see how it performs.

Q: What should I put in the simulation?

A: All elements of a research design.

Q: What are the elements of a research design?

A: M! I! D! A!

Four elements of any research design

Model: set of models of what causes what and how
Inquiry: a question stated in terms of the model
Data strategy: the set of procedures we use to gather information from the world (sampling, assignment, measurement)
Answer strategy: how we summarize the data produced by the data strategy

Four elements of any research design

Examples of MIDA elements

M: DAGs, game theoretic models
I: ATEs, CATEs, COEs, models
D: Sampling schemes, assignment schemes, text analysis, interview
A: Experiment, observational, quantitative, qualitative:
- Conditioning on observables
- Difference in differences
- RDD
- Instrumental variables

Declaration, Diagnosis, Redesign cycle

Declaration: Telling the computer what M, I, D, and A are.
Diagnosis: Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, amount learned.
Redesign : Fine-tuning features of the data and answer strategies to understand how they change the diagnosands
Different sample sizes
Different randomization procedures
Different estimation strategies
Implementation: effort into compliance versus more effort into sample size

In code: Key commands for making a design

declare_model()
declare_inquiry()
declare_assignment()
declare_measurement()
declare_inquiry
declare_estimator()

and there are more declare_ functions!

In code: Key commands for using a design

draw_data(design)
draw_estimands(design)
draw_estimates(design)
get_estimates(design, data)
run_design(design), simulate_design(design)
diagnose_design(design)
redesign(design, N = 200)
design |> redesign(N = c(200, 400)) |> diagnose_designs()
compare_designs(), compare_diagnoses()

Cheat sheet

https://raw.githubusercontent.com/rstudio/cheatsheets/master/declaredesign.pdf

A simple design

N <- 100
b <- .5

design <- 
  declare_model(N = N, U = rnorm(N), 
                potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N), Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

You now have a two arm design object in memory!

If you just type design it will run the design—a good check to make sure the design has been declared properly.

Make data from the design

data <- draw_data(design)

data |> head () |> kable()

ID	U	Y_Z_0	Y_Z_1	Z	Y
001	0.8949488	0.8949488	1.3949488	1	1.3949488
002	-0.0574970	-0.0574970	0.4425030	1	0.4425030
003	0.9280977	0.9280977	1.4280977	0	0.9280977
004	0.3762109	0.3762109	0.8762109	1	0.8762109
005	-0.7357462	-0.7357462	-0.2357462	0	-0.7357462
006	-0.7711031	-0.7711031	-0.2711031	1	-0.2711031

Draw estimands

draw_estimands(design) |>
  kable(digits = 2)

inquiry	estimand
ate	0.5

Draw estimates

draw_estimates(design) |> 
  kable(digits = 2)

estimator	term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome	inquiry
estimator	Z	0.8	0.19	4.22	0	0.43	1.18	98	Y	ate

Get estimates

get_estimates(design, data) |>
  kable(digits = 2)

estimator	term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome	inquiry
estimator	Z	0.67	0.21	3.23	0	0.26	1.09	98	Y	ate

Simulate design

simulate_design(design, sims = 3) |>
  kable(digits = 2)

design	sim_ID	inquiry	estimand	estimator	term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
design	1	ate	0.5	estimator	Z	0.43	0.21	2.09	0.04	0.02	0.84	98	Y
design	2	ate	0.5	estimator	Z	0.61	0.19	3.21	0.00	0.23	0.99	98	Y
design	3	ate	0.5	estimator	Z	0.40	0.18	2.28	0.02	0.05	0.76	98	Y

Diagnose design

design |> 
  diagnose_design(sims = 100)

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.50	0.00	0.21	0.21	0.68	0.94
(0.02)	(0.02)	(0.01)	(0.01)	(0.04)	(0.02)

Redesign

new_design <-
  
  design |> redesign(b = 0)

Modify any arguments that are explicitly called on by design steps.
Or add, remove, or replace steps

Compare designs

redesign(design, N = 50) %>%
  
  compare_diagnoses(design)

Error in eval(expr, envir, enclos): object 'run' not found

diagnosand	mean_1	mean_2	mean_difference	conf.low	conf.high
mean_estimand	0.50	0.50	0.00	0.00	0.00
mean_estimate	0.51	0.49	-0.03	-0.06	0.01
bias	0.01	-0.01	-0.03	-0.06	0.01
sd_estimate	0.30	0.20	-0.10	-0.13	-0.08
rmse	0.30	0.20	-0.10	-0.13	-0.08
power	0.43	0.67	0.24	0.18	0.31
coverage	0.94	0.95	0.01	-0.01	0.04