Lectures on the design and analysis of experiments

Macartan Humphreys

1 Roadmap

Intro: | Course outline, tools
Experimenting: | questions, workflow, ethics
Causality: Fundamental problems, questions, and answers
Estimation: Answer Strategies
Design: Experimental Design and Evaluation
Topics: Specific experimental design

2 Intro

2.1 Getting started

General aims and structure
Expectations

2.2 Aims and items

Primary focus is training on how to do experiments; not just knowledge of good experiments
Deep understanding of key ideas in causal inference
Transportable tools for understanding how to evaluate and improve design
Practice!

2.2.1 Syllabus and resources

https://macartan.github.io/ci for all resources
https://macartan.github.io/ci/syllabus_experiments_2026.pdf
Git repo https://github.com/macartan/ci
Student survey: Please fill this out today.

2.2.2 The topics

Day	Topic	Activity
1: 1/9	Getting started Outline	In-class exercise, intro to field experiment
2: 1/16	Doing experiments: Workflows; ethics	Presentation of plans, ethics application
3: 1/30	Causality: Foundations, Inquiries	Field reports
4: 2/6	Answer strategies: Estimation and Inference	Discussion of findings 1; Proposals 2
5: 2/20	Data strategies and Design evaluation `DeclareDesign`, Assignments, Design evaluation	Proposal refinements 2
6: 2/27	Topics: Survey experiments, Spillovers, Downstream experimentation	Design declarations and ethics applications

2.2.3 Hands on!

We hope to run three experiments:

Today, simulated experiments by hand
Over the next four weeks (completed by 6 Feb): A field experimental variation of Choi, Poertner, and Sambanis (2023)
Last weeks development of designs to implement in collective survey experiment (admitted based on external review and subject to ethics approval and implemented likely after class ends). Expect to have space for four projects (individual or collective).

2.3 Responsibilities

2.3.1 Required

Do the readings and take part in discussion (30%)
Take part in (field) experiment I (30%)
Design an experiment as a candidate for (survey) experiment II (40%)

2.3.2 Optional

Prepare a research design or short paper, perhaps building on existing work. Typically this contains:
- a problem statement
- a description of an experimental strategy to address the problem
- empirical (ideally) or simulation based results (ok)
- a discussion of implications
- A passing paper will illustrate identify a causal effect of substantive interest credibly; a good paper will innovate on method or topic.

2.4 Support

Office hours: Friday afternoons after class: 2 - 4
Lennard: Office hours
Email

2.5 Let’s go 1

2.5.1 Task

We are going to do some experiments by hand. You choose assignment schemes and analysis plans.

You have been given an envelope with 20 cards. Each card has two numbers. One in black lettering (on white) and one in white lettering (on black). On the other side of the card there may or may not be writing.
Your job is to figure out whether–on average–the black numbers are larger or smaller than the white numbers or the same.

2.5.2 But

The catch: when you turn over a card you are only allowed to read the black number or the white number (honor code!!). What’s more: you have to decide before reading the card which number you will look at (though of course you can see the symbol on the card, if there is one, before reading the numbers)
To do: Access the complete procedure sheet, get a pack of cards, go!

2.6 Let’s go 2

2.6.1 Task

We are collectively going to try to replicate an experimental design that was implemented in Germany.

We are free to modify this design as will. Our goal is to:

Modify the design (or not)
Generate a pre-analysis plan, including power estimates, if we can
File an ethics application: HARD DEADLINE January 19
Actually implement it!
Run analysis
Short write up

We are going to make mistakes and learn to work as a team.

2.6.2 Starter pack I

AJPS

2.6.3 Starter pack I

Why do native Europeans discriminate against Muslim immigrants? Can shared ideas between natives and immigrants reduce discrimination? We hypothesize that natives’ bias against Muslim immigrants is shaped by the belief that Muslims hold conservative attitudes about women’s rights and this ideational basis for discrimination is more pronounced among native women. We test this hypothesis in a large-scale field experiment conducted in 25 cities across Germany, during which 3,797 unknowing bystanders were exposed to brief social encounters with confederates who revealed their ideas regarding gender roles. We find significant discrimination against Muslim women, but this discrimination is eliminated when Muslim women signal that they hold progressive gender attitudes. Through an implicit association test and a follow-up survey among German adults, we further confirm the centrality of ideational stereotypes in structuring opposition to Muslims. Our findings have important implications for reducing conflict between native–immigrant communities in an era of increased cross-border migration.

2.6.4 Starter pack I: Results

Results

2.6.5 Starter Pack I : A norms study

Summary:

Inquiry: Does prejudice depend on outgroup attitudes?
Population: People out and about
Treatment: Random exposure to outgroup attitudes; lab-in-the-field experimental variation
Difference in means (variant)

2.6.6 Starter Pack II : A protest study

Team

2.6.7 Starter Pack II : A protest study

Despite decades of scholarship on protest effects, we know little about how bystanders—citizens who observe protests without participating—are affected by them. Understanding the impact of protest on bystanders is crucial as they constitute a growing audience whose latent support, normative beliefs, and concrete actions can make or break a movement’s broader societal impact. To credibly assess the effects of protests on observers, we design and implement a field experiment in Berlin in which we randomly route pedestrians past (treatment) or away from (control) three large-scale Fridays for Future (FFF) climate strikes. Using data gathered on protest days as well as through a one-month follow-up survey, we find evidence for a substantial increase in immediate donations to climate causes but no detectable impact on climate attitudes, vote intentions, or norm perceptions. Our findings challenge the prevailing assumption in both scholarship and public discourse that protest effects operate via impacts on public opinion and call for renewed theorizing that centers on observers’ immediate behavioral activation.

2.6.8 Starter Pack II : A protest study {.smaller

Results

2.6.9 Starter Pack II : A protest study

Results

2.6.10 Starter Pack II : A protest study

Summary:

Inquiry: Does protest matter? Does observing a climate protest affect beliefs or behavior?
Population: People out and about
Treatment: Random exposure to protests; field experimental variation
Difference in means (variant)

2.7 Let’s create some brainstorming groups

4 groups
Each thinks through a possible intervention
- Motivation: What question can you answer?
- Treatment: What would be varied?
- Measurement: What would outcome measures be?
- Feasibility: Challenges to doing this rapidly and ethically in the coming weeks
- Sketch of workteams for implementation
Initial feedback.
Plan to present next week (4 slides). Groups can collude.

2.8 Appendix: Coding Tips

2.8.1 Good coding rules

2.8.2 Good coding rules

Metadata first
Call packages at the beginning: use pacman
Put options at the top
Call all data files once, at the top. Best to call directly from a public archive, when possible.
Use functions and define them at the top: comment them; useful sometimes to illustrate what they do
Replicate first, re-analyze second. Use sections.
(For replications) Have subsections named after specific tables, figures or analyses

2.8.3 Aim

First best: If someone has access to your .qmd file they can hit render or compile and the whole thing reproduces first time. So: Nothing local, everything relative: so please do not include hardcoded paths to your computer
But: often you need ancillary files for data and code. That’s OK but aims should still be that with a self contained folder someone can open a main.Rmd file, hit compile and get everything. I usually have an input and an output subfolder.

2.8.4 Collaborative coding / writing

Do not get in the business of passing attachments around
Documents in some cloud: git, osf, Dropbox, Drive, Nextcloud
General rule: only post non sensitive, non proprietary material
Share self contained folders; folders contain a small set of live documents plus an archive. Old versions of documents are in archive. Only one version of the most recent document is in a main folder.
Data is self contained folder (in) and is never edited directly
Update to github frequently

2.8.5 Fin

3 Experimenting

3.0.1 What is an experiment?

Cox & Reid (2000) define experiments as:

investigations in which an intervention, in all its essential elements, is under the control of the investigator.

Two types of control:

control over assignment to treatment and
control over the treatment itself

3.0.2 Two types of control:

3.0.3 More broadly:

Experimental studies use research designs in which the researchers uses:

interventional data strategies (to control setting or treatments)
often with causal inquiries
often with design based answer strategies

3.1 Plan

Let’s discuss:

The four parts to any design
The stages of an experimental design
Examples of the kinds of questions people address
The ethics of intervening

Then: Deep dive into discussion of actual experiments

Then: Plans for our own

3.2 The MIDA Framework

3.2.1 Four elements of any research design

Model: set of models of what causes what and how
Inquiry: a question stated in terms of the model
Data strategy: the set of procedures we use to gather information from the world (sampling, assignment, measurement)
Answer strategy: how we summarize the data produced by the data strategy

3.2.2 Four elements of any research design

3.2.3 Declaration

Design declaration is telling the computer (and readers) what M, I, D, and A are.

3.2.4 Diagnosis

Design diagnosis is figuring out how the design will perform under imagined conditions.
Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, “amount learned”.
Diagnosis takes account of model uncertainty: it aims to identify models for which the design works well and models for which it does not

3.2.5 Redesign

Redesign is the fine-tuning of features of the data- and answer strategies to understand how changing them affects the diagnosands

Different sample sizes
Different randomization procedures
Different estimation strategies
Implementation: effort into compliance versus more effort into sample size

3.2.6 Very often you have to simulate!

Doing all this is often too hard to work out from rules of thumb or power calculators
Specialized formulas exist for some diagnosands, but not all

3.3 Experimentation: Stages

Find a good question: Minimally, find an $X$ and a $Y$.
- $X$-based motivations
- $Y$ based motivations
- Theory based, hypothesis based, mechanism based, replication focused
Survey literatures, identify contribution.
Identify partners, resources. Perhaps run exploratory pilot analyses.
Develop a design: Data strategy | Measurement strategy | Answer strategy
Simulate the design: Power analysis, Confirmation of properties of estimators
Present! gather feedback
Gather priors: what do others expect will happen?
Apply for ethics approval
Ensure compliance with other provisions (e.g. GDPR)!
Register a pre-analysis plan (e.g OSF, AEA)

3.4 Experimentation: Stages

Conduct Baseline (if needed): With live data quality control!
Implement randomization
Gather process observations
Implement Endline. With live data quality control!
Run analysis; check analysis
Generate key tables
Present to colleagues, but also present to any stakeholders
Prep replication materials. Make data and code available to others
Writeup
Submit [Get rejected - Submit again]

3.5 Experimentation: Questions

Good questions studied well

Prospects and priorities

3.5.1 Prospects

Whenever someone is uncertain about something they are doing (all the time)
Whenever someone hits scarcity constraints
When people have incentives to demonstrate that they are doing the right thing (careful…)

3.5.2 Prospects

If you can, start from theory and find an intervention, rather than the other way around.
If you can, go for structure rather than gimmicks
In attempts to parse, beware of generating unnatural interventions (how should a voter think of a politician that describes his policy towards Korea in detail but does not mention the economy? Is not mentioning the economy sending an unintended message?)

3.5.3 Innovative designs

Randomization of

encouragements to gain citizenship (New York)
where police are stationed (India)
how government tax collectors get paid (do they get a share?) (Pakistan)
the voting rules for determining how decisions get made (Afghanistan)
of populations to peacekeepers (Liberia)
of ex-combatants out of their networks (Indonesia)
students to ethnically homogeneous or ethnically diverse schools (Ethiopia)

3.6 Ethics

3.6.1 Constraint: Is it ethical to manipulate subjects for research purposes?

There is no foundationless answer to this question.
Belmont principles commonly used for guidance:
1. Respect for persons
2. Beneficence
3. Justice
Unfortunately, operationalizing these requires further ethical theories. (1) is often operationalized by informed consent (a very liberal idea). (2) and (3) sometimes by more utiliarian principles
The major focus on (1) by IRBs might follow from the view that if subjects consent, then they endorse the ethical calculations made for 2 and 3 — they think that it is good and fair.
Trickiness: can a study be good or fair because of implications for non-subjects?

3.6.2 Is it ethical to manipulate subjects for research purposes?

Many (many) field experiments have nothing like informed consent.
For example, whether the government builds a school in your village, whether an ad appears on your favorite radio show, and so on.
Consider three cases:
1. You work with a nonprofit to post (true?) posters about the crimes of politicians on billboards to see effects on voters
2. You hire confederates to offer bribes to police officers to see if they are more likely to bend the law for coethnics
3. The British government asks you to work on figuring out how the use of water cannons helps stop rioters rioting

3.6.3 Is it ethical to manipulate subjects for research purposes?

Consider three cases:
- You work with a nonprofit to post (true?) posters about the crimes of politicians on billboards to see effects on voters
- You hire confederates to offer bribes to police officers to see if they are more likely to bend the law for coethnics
- The British government asks you to work on figuring out how the use of water cannons helps stop rioters rioting
In all cases, there is no consent given by subjects.
In 2 and 3, the treatment is possibly harmful for subjects, and the results might also be harmful. But even in case 1, there could be major unintended harmful consequences.
In cases 1 and 3, however, the “intervention” is within the sphere of normal activities for the implementer.

3.6.4 Constraint: Is it ethical to manipulate subjects for research purposes?

Sometimes it is possible to use this point of difference to make a “spheres of ethics” argument for “embedded experimentation.”
Spheres of Ethics Argument: Experimental research that involves manipulations that are not normally appropriate for researchers may nevertheless be ethical if:
- Researchers and implementers agree on a division of responsibility where implementers take on responsibility for actions
- Implementers have legitimacy to make these decisions within the sphere of the intervention
- Implementers are indeed materially independent of researchers (no swapping hats)

3.6.5 Constraint: Is it ethical to manipulate subjects for research purposes?

Difficulty with this argument:
- Question begging: How to determine the legitimacy of the implementer? (Can we rule out Nazi doctors?)

Otherwise keep focus on consent and desist if this is not possible

3.6.6 APSA Guidelines

Available here
Used e.g. by APSR
Below is lightly abbreviated; full text however has detailed guidelines

3.6.7 APSA Ethics: General [Abbr]

Political science researchers should respect autonomy, consider the wellbeing of participants and other people affected by their research, and be open about the ethical issues they face.
Political science researchers have an individual responsibility to consider the ethics of their research related activities and cannot outsource ethical reflection to review boards, other institutional bodies, or regulatory agencies.
These principles describe the standards of conduct and reflexive openness that are expected of political science researchers. … [In cases of reasonable deviations], researchers should acknowledge and justify deviations in scholarly publications and presentations of their work.

3.6.8 APSA Ethics: Power

When designing and conducting research, political scientists should be aware of power differentials between researcher and researched, and the ways in which such power differentials can affect the voluntariness of consent and the evaluation of risk and benefit.

especially with low-power or vulnerable participants
covert or deceptive research with more than minimal harm may sometimes be ethically permissible in research with powerful parties

3.6.10 APSA Ethics: Deception

Political science researchers should carefully consider any use of deception and the ways in which deception can conflict with participant autonomy.

ask: is it plausible that engaged individuals would withhold consent if fully informed consent were sought
disclose, justify,, and describe steps taken to respect participant autonomy.

[Note: no general injunction against]

3.6.11 APSA Ethics: Harm and Trauma

Political science researchers should consider the harms associated with their research.

Researchers should generally avoid harm when possible, minimize harm when avoidance is not possible, and not conduct research when harm is excessive.
do not limit concern to physical and psychological risks to the participant.

Political science researchers should anticipate and protect individual participants from trauma stemming from participation in research.

3.6.12 APSA Ethics: Confidentiality

Political science researchers should generally keep the identities of research participants confidential; when circumstances require, researchers should adopt the higher standard of ensuring anonymity.

Researchers should clearly communicate assurances of confidentiality / anonymity
If confidentiality bit provided, communicate this and justify c./d. consider risks at all stages
Researchers who determine that it would be unethical to share materials derived from human subjects should be prepared to justify their decision to journal editors, to reviewers, etc

3.6.13 APSA Ethics: Impact

Political science researchers conducting studies on political processes should consider the broader social impacts of the research process as well as the impact on the experience of individuals directly engaged by the research. In general, political science researchers should not compromise the integrity of political processes for research purposes without the consent of individuals that are directly engaged by the research process.

cases in which research that produces impacts on political processes without consent of individuals directly engaged by the research might be appropriate. [examples]
Studies of interventions by third parties do not usually invoke this principle on impact. [details]
This principle is not intended to discourage any form of political engagement by political scientists in their non-research activities or private lives.
researchers should report likely impacts

3.6.14 APSA Ethics: Laws, Regulations, and Prospective Review

Political science researchers should be aware of relevant laws and regulations governing their research related activities.

3.6.15 APSA Ethics: Shared Responsibility

The responsibility to promote ethical research goes beyond the individual researcher or research team.

Mentors, advisors, dissertation committee members, and instructors
Graduate programs in political science should include ethics instruction in their formal and informal graduate curricula;
Editors and reviewers should encourage researchers to be open about the ethical decisions …
Journals, departments, and associations should incorporate ethical commitments into their mission, bylaws, instruction, practices, and procedures.

3.7 Appendix

3.8 Transparency & Experimentation

3.8.1 Transparent workflows

Experimental researchers are deeply engaged in the movement towards more transparency social science research.

Analytic replication. This should be a no brainer. Set everything up so that replication is easy. Use quarto rmarkdown, or similar. Or produce your replication code as a package.

3.8.2 Contentious Issues

Experimental researchers are deeply engaged in the movement towards more transparency social science research.

Contentious issues (mostly):

Data. How soon should you make your data available? My view: as soon as possibe. Along with working papers and before publication. Before it affects policy in any case. Own the ideas not the data.
- Hard core: no citation without (analytic) replication. Perhaps. Non-replicable results should not be influencing policy.
Where should you make your data available? Dataverse is focal for political science. Not personal website (mea culpa)
What data should you make available? Disagreement is over how raw your data should be. My view: as raw as you can but at least post cleaning and pre-manipulation.

3.8.3 Open science checklist

Experimental researchers are deeply engaged in the movement towards more transparency social science research.

Should you register?: Hard to find reasons against. But case strongest in testing phase rather than exploratory phase.
Registration: When should you register? My view: Before treatment assignment. (Not just before analysis, mea culpa)
Registration: Should you deviate from a preanalysis plan if you change your mind about optimal estimation strategies. My view: Yes, but make the case and describe both sets of results.

3.8.4 Two distinct rationales for registration

File drawer bias (Publication bias)
Analysis bias (Fishing)

3.8.5 File drawer bias

– Say in truth $X$ affects $Y$ in 50% of cases.

– Researchers conduct multiple excellent studies. But they only write up the 50% that produce “positive” results.

– Even if each individual study is indisputably correct, the account in the research record – that X affects Y in 100% of cases – will be wrong.

3.8.6 File drawer bias

– Say in truth $X$ affects $Y$ in 50% of cases.

– Researchers conduct multiple excellent studies. But they only write up the 50% that produce “positive” results.

– Even if each individual study is indisputably correct, the account in the research record – that X affects Y in 100% of cases – will be wrong.

3.8.7 File drawer bias

Exacerbated by:

– Publication bias – the positive results get published

– Citation bias – the positive results get read and cited

– Chatter bias – the positive results gets blogged, tweeted and TEDed.

3.8.8 Analysis bias (Fishing)

– Say in truth $X$ affects $Y$ in 50% of cases.

– But say that researchers enjoy discretion to select measures for $X$ or $Y$, or enjoy discretion to select statistical models after seeing $X$ and $Y$ in each case.

– Then, with enough discretion, 100% of analyses may report positive effects, even if all studies get published.

3.8.9 Analysis bias (Fishing)

– Say in truth $X$ affects $Y$ in 50% of cases.

– But say that researchers enjoy discretion to select measures for $X$ or $Y$, or enjoy discretion to select statistical models after seeing $X$ and $Y$ in each case.

– Then, with enough discretion, 100% of analyses may report positive effects, even if all studies get published.

3.8.10 Analysis bias (Fishing)

– Try the exact fishy test An Exact Fishy Test (https://macartan.shinyapps.io/fish/)

– What’s the problem with this test?

3.8.11 Evidence-Proofing: Illustration

When your conclusions do not really depend on the data
Eg – some evidence will always support your proposition – some interpretation of evidence will always support your proposition
Knowing the mapping from data to inference in advance gives a handle on the false positive rate.

3.8.12 The scope for fishing

3.8.13 Evidence from political science

Source: Gerber and Malhotra

3.8.14 More evidence from TESS

Malhotra tracked 221 TESS studies.
20% of the null studies were published. 65% not even written up (file drawer or anticipation of publication bias)
60% of studies with strong results were published.

Implications are:

population of results not representative
(subtler) individual published studies are also more likely to be overestimates

3.8.15 The problem

Summary: we do not know when we can or cannot trust claims made by researchers.
[Not a tradition specific claim]

3.8.16 Registration as a possible solution

Simple idea:

It’s about communication:
just say what you are planning on doing before you do it
if you don’t have a plan, say that
If you do things differently from what you were planning to do, say that

3.8.17 Worries and Myths

Lots of misunderstandings around registration

3.8.18 Myth: Concerns about fishing presuppose researcher dishonesty

Fishing can happen in very subtle ways, and may seem natural and justifiable.
Example:
- I am interested in whether more democratic institutions result in better educational outcomes.
- I examine the relationship between institutions and literacy and between institutions and school attendance.
- The attendance measure is significant and the literacy one is not. Puzzled, I look more carefully at the literacy measure and see various outliers and indications of measurement error. As I think more I realize too that literacy is a slow moving variable and may not be the best measure anyhow. I move forward and start to analyze the attendance measure only, perhaps conducting new tests, albeit with the same data.

3.8.19 Structural challenge

Our journal review process is largely organized around advising researchers how to adjust analysis in light of findings in the data.

3.8.20 Myth: Fishing is technique specific

Frequentists can do it
Bayesians can do it too.
Qualitative researchers can also do it.
You can even do it with descriptive statistics

3.8.21 Myth: Fishing is estimand specific

You can do it when estimating causal effects
You can do it when studying mechanisms
You can do it when estimating counts

3.8.22 Myth: Registration only makes sense for experimental studies, not for observational studies

The key distinction is between prospective and retrospective studies.
Not between experimental and observational studies.
A reason (from the medical literature) why registration is especially important for experiments: because you owe it to subjects
A reason why registration is less important for experiments: because it is more likely that the intended analysis is implied by the design in an experimental study. Researcher degrees of freedom may be greatest for observational qualitative analyses.

3.8.23 Worry: Registration will create administrative burdens for researchers, reviewers, and journals

Registration will produce some burden but does not require the creation of content that is not needed anyway
It does shift preparation of analyses forward
And it also can increase the burden of developing analyses plans even for projects that don’t work. But that is in part, the point.
Upside is that ultimate analyses may be much easier.

3.8.24 Worry: Registration will force people to implement analyses that they know are wrong

Most arguments for registration in social science advocate for non-binding registration, where deviations from designs are possible, though they should be described.
Even if it does not prevent them, a merit of registration is that it makes deviations visible.

3.8.25 Myth: Replication (or other transparency practices) obviates the need for registration

There are lots of good things to do, including replication.
Many of these do not substitute for each other. (How to interpret a fished replication of a fished analysis?)
And they may likely act as complements
Registration can clarify details of design and analysis and ensure early preparation of material. Indeed material needed for replication may be available even before data collection

3.8.26 Worry: Registration will put researchers at risk of scooping

But existing registries allow people to protect registered designs for some period
Registration may let researchers lay claim to a design

3.8.27 Worry: Registration will kill creativity

This is an empirical question. However, under a nonmandatory system researchers could:
Register a plan for structured exploratory analysis
Decide that exploration is at a sufficiently early stage that no substantive registration is possible and proceed without registration.

3.8.28 Implications:

In neither case would the creation of a registration facility prevent exploration.
What it might do is make it less credible for someone to claim that they have tested a proposition when in fact the proposition was developed using the data used to test it.
Registration communicates when researchers are angage in exploration or not. We love exploration and should be proud of it.

3.8.29 Punchline

Do it!
But if you have reasons to deviate, deviate transparently
Don’t implement bad analysis just because you pre-registered
Instead: reconcile

3.8.30 Reconciliation

Incentives and strategies

3.8.31 Reconciliation

Table 1: Illustration of an inquiry reconciliation table.

Inquiry	In the preanalysis plan	In the paper	In the appendix
Gender effect	X	X
Age effect			X

3.8.32 Reconciliation

Table 2: Illustration of an answer strategy reconciliation table.

Inquiry	Following A from the PAP	Following A from the paper	Notes
Gender effect	estimate = 0.6, s.e = 0.31	estimate = 0.6, s.e = 0.25	Difference due to change in control variables [provide cross references to tables and code]

4 Causality

Fundamental problems

4.1 Potential outcomes and the counterfactual approach

Causation as difference making

4.1.1 Motivation

The intervention based motivation for understanding causal effects:

We want to know if a particular intervention (like aid) caused a particular outcome (like reduced corruption).
We need to know:
1. What happened?
2. What would the outcome have been if there were no intervention?
The problem:
1. … this is hard
2. … this is impossible

The problem in 2 is that you need to know what would have happened if things were different. You need information on a counterfactual.

4.1.2 Potential Outcomes

For each unit, we assume that there are two post-treatment outcomes: $Y_i(1)$ and $Y_i(0)$.
- $Y(1)$ is the outcome that would obtain if the unit received the treatment.
- $Y(0)$ is the outcome that would obtain if it did not.
The causal effect of Treatment (relative to Control) is: $\tau_i = Y_i(1) - Y_i(0)$
Note:
- The causal effect is defined at the individual level.
- There is no “data generating process” or functional form.
- The causal effect is defined relative to something else, so a counterfactual must be conceivable (did Germany cause the second world war?).
- Are there any substantive assumptions made here so far?

4.1.3 Potential Outcomes

Idea: A causal claim is (in part) a claim about something that did not happen. This makes it metaphysical.

4.1.4 Potential Outcomes

Now that we have a concept of causal effects available, let’s answer two questions:

TRANSITIVITY: If for a given unit $A$ causes $B$ and $B$ causes $C$, does that mean that $A$ causes $C$?

4.1.5 Potential Outcomes

Now that we have a concept of causal effects available, let’s answer two questions:

TRANSITIVITY: If for a given unit $A$ causes $B$ and $B$ causes $C$, does that mean that $A$ causes $C$?
A boulder is flying down a mountain. You duck. This saves your life.
So the boulder caused the ducking and the ducking caused you to survive.
So: did the boulder cause you to survive?

4.1.6 Potential Outcomes

CONNECTEDNESS Say $A$ causes $B$ — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?

4.1.7 Potential Outcomes

CONNECTEDNESS Say $A$ causes $B$ — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?

Person A is planning some action $Y$; Person B sets out to stop them; person X intervenes and prevents person B from stopping person A. In this case Person A may complete their action, producing Y, without any knowledge that B and X even exist; in particular B and X need not be anywhere close to the action. So: did X cause Y?

4.1.8 Causal claims: Contribution or attribution?

The counterfactual model is about contribution and attribution in a very specific sense.

Focus is on non-rival contributions
Focus is on conditional attribution. Not: “what caused $Y$?” or “What is the cause of $Y$?”, but “did $X$ cause $Y$ given all other factors were what they were?”

4.1.9 Causal claims: Contribution or attribution?

Consider an outcome $Y$ that might depend on two causes $X_1$ and $X_2$:

\[Y(0,0) = 0\] \[Y(1,0) = 0\] \[Y(0,1) = 0\] \[Y(1,1) = 1\]

What caused $Y$? Which cause was most important?

4.1.10 Causal claims: Contribution or attribution?

The counterfactual model is about attribution in a very conditional sense.

This is problem for research programs that define “explanation” in terms of figuring out the things that cause $Y$
Real difficulties conceptualizing what it means to say one cause is more important than another cause. What does that mean?

4.1.11 Causal claims: Contribution or attribution?

Erdogan’s increasing authoritarianism was the most important reason for the attempted coup

More important than Turkey’s history of coups?
What does that mean?

4.1.12 Causal claims: No causation without manipulation

Some seemingly causal claims not admissible.
To get the definition off the ground, manipulation must be imaginable (whether practical or not)
This renders thinking about effects of race and gender difficult
What does it mean to say that Aunt Pat voted for Brexit because she is old?

4.1.13 Causal claims: No causation without manipulation

Some seemingly causal claims not admissible.
To get the definition off the ground, manipulation must be imaginable (whether practical or not)
This renders thinking about effects of race and gender difficult
Compare: What does it mean to say that Southern counties voted for Brexit because they have many old people?

4.1.14 Causal claims: No causation without manipulation

More uncomfortably:

What does it mean to say that the tides are caused by the moon? What exactly do we have to imagine…

4.1.15 Causal claims: Causal claims are everywhere

Jack exploited Jill
It’s Jill’s fault that bucket fell
Jack is the most obstructionist member of Congress
Melania Trump stole from Michelle Obama’s speech
Activists need causal claims

4.1.16 Causal claims: What is actually seen?

We have talked about what’s potential, now what do we observe?
Say $Z_i$ indicates whether the unit $i$ is assigned to treatment $(Z_i=1)$ or not $(Z_i=0)$. It describes the treatment process. Then what we observe is: \[ Y_i = Z_iY_i(1) + (1-Z_i)Y_i(0) \]

This is sometimes called a “switching equation”

In DeclareDesign $Y$ is realised from potential outcomes and assignment in this way using reveal_outcomes

4.1.17 Causal claims: What is actually seen?

Say $Z$ is a random variable, then this is a sort of data generating process. BUT the key thing to note is
- $Y_i$ is random but the randomness comes from $Z_i$ — the potential outcomes, $Y_i(1)$, $Y_i(0)$ are fixed
- Compare this to a regression approach in which $Y$ is random but the $X$’s are fixed. eg: \[ Y \sim N(\beta X, \sigma^2) \text{ or } Y=\alpha+\beta X+\epsilon, \epsilon\sim N(0, \sigma^2) \]

4.1.18 Causal claims: The estimand and the rub

The causal effect of Treatment (relative to Control) is: \[\tau_i = Y_i(1) - Y_i(0)\]
This is what we want to estimate.
BUT: We never can observe both $Y_i(1)$ and $Y_i(0)$!
This is the fundamental problem (Holland (1986))

4.1.19 Causal claims: The rub and the solution

Now for some magic. We really want to estimate: \[ \tau_i = Y_i(1) - Y_i(0)\]
BUT: We never can observe both $Y_i(1)$ and $Y_i(0)$
Say we lower our sights and try to estimate an average treatment effect: \[ \tau = \mathbb{E} [Y(1)-Y(0)]\]
Now make use of the fact that \[\mathbb E[Y(1)-Y(0)] = \mathbb E[Y(1)]- \mathbb E [Y(0)] \]
In words: The average of differences is equal to the difference of averages.
The magic is that while we can’t hope to measure the differences; we are good at measuring averages.

4.1.20 Causal claims: The rub and the solution

So we want to estimate $\mathbb{E} [Y(1)]$ and $\mathbb{E} [Y(0)]$.
We know that we can estimate averages of a quantity by taking the average value from a random sample of units
To do this here we need to select a random sample of the $Y(1)$ values and a random sample of the $Y(0)$ values, in other words, we randomly assign subjects to treatment and control conditions.
When we do that we can in fact estimate: \[ \mathbb {E}_N[Y_i(1) | Z_i = 1) - \mathbb {E}_N(Y_i(0) | Z_i = 0]\] which in expectation equals: \[ \mathbb{E} [Y_i(1) | Z_i = 1 \text{ or } Z_i = 0] - \mathbb{E} [Y_i(0) | Z_i = 1 \text{ or } Z_i = 0]\]

4.1.21 Causal claims: The rub and the solution

This highlights a deep connection between random assignment and random sampling: when we do random assignment we are in fact randomly sampling from different possible worlds.

4.1.22 Causal claims: The rub and the solution

This provides a positive argument for causal inference from randomization, rather than simply saying with randomization “everything else is controlled for”

Let’s discuss:

Does the fact that an estimate is unbiased mean that it is right?
Can a randomization “fail”?
Where are the covariates?

4.1.23 Causal claims: The rub and the solution

Idea: random assignment is random sampling from potential worlds: to understand anything you find, you need to know the sampling weights

4.1.24 Reflection

Idea: We now have a positive argument for claiming unbiased estimation of the average treatment effect following random assignment

But is the average treatment effect a quantity of social scientific interest?

4.1.25 Potential outcomes: why randomization works

The average of the differences $\approx$ difference of averages

4.1.26 Potential outcomes: heterogeneous effects

The average of the differences $\approx$ difference of averages

4.1.27 Potential outcomes: heterogeneous effects

Question: $\approx$ or $=$?

4.1.28 Exercise your potential outcomes 1

Consider the following potential outcomes table:

Unit	Y(0)	Y(1)
1	4	3
2	2	3
3	1	3
4	1	3
5	2	3

Questions for us: What are the unit level treatment effects? What is the average treatment effect?

4.1.29 Exercise your potential outcomes 2

Consider the following potential outcomes table:

In treatment?	Y(0)	Y(1)
Yes		2
No	3
No	1
Yes		3
Yes		3
No	2

Questions for us: Fill in the blanks.

Assuming a constant treatment effect of $+1$
Assuming a constant treatment effect of $-1$
Assuming an average treatment effect of $0$

What is the actual treatment effect?

4.2 Pause

Take a short break!

4.3 Endogeneous subgroups

4.3.1 Endogeneous Subgroups

Experiments often give rise to endogenous subgroups. The potential outcomes framework can make it clear why this can cause problems.

4.3.2 Heterogeneous Effects with Endogeneous Categories

Problems arise in analyses of subgroups when the categories themselves are affected by treatment
Example from our work:
- You want to know if an intervention affects reporting on violence against women
- You measure the share of all subjects that experienced violence that file reports
- The problem is that which subjects experienced violence is itself a function of treatment

4.3.3 Heterogeneous Effects with Endogeneous Categories

V(t): Violence(Treatment)
R(t, v): Reporting(Treatment, Violence)

	V(0)	V(1)	R(0,1)	R(1,1)	R(0,0)	R(1,0)
Type 1 (reporter)	1	1	1	1	0	0
Type 2 (non reporter)	1	0	0	0	0	0

Expected reporting given violence in control = Pr(Type 1) (explanation: both types see violence but only Type 1 reports)
Expected reporting given violence in treatment = 100% (explanation: only Type 1 sees violence and this type also reports)

So you might infer a large effect on violence reporting.

Question: What is the actual effect of treatment on the propensity to report violence?

4.3.4 Heterogeneous Effects with Endogeneous Categories

It is possible that in truth no one’s reporting behavior has changed, what has changed is the propensity of people with different propensities to report to experience violence:

	Reporter	No Violence	Violence	% Report
Control	Yes No	25 25	25 25	$\frac{25}{25+25}=50\%$
Treatment	Yes No	25 50	25 0	$\frac{25}{25+0}=100\%$

4.3.5 Heterogeneous Effects with Endogeneous Categories

This problem can arise as easily in seemingly simple field experiments. Example:

In one study we provided constituents with information about performance of politicians
We told politicians in advance so that they could take action
We wanted to see whether voters punished poorly performing politicians

What’s the problem?

4.3.6 Endogeneous Categories: Test yourself

Question for us:

Quotas for women are randomly placed in a set of constituencies in year 1. All winners in these areas are women; in other areas only some are.
In year 2 these quotas are then lifted.

Which problems face an endogenous subgroup issue?:

4.3.7 Endogeneous Categories: Test yourself

Which problems face an endogenous subgroup issue?:

You want to estimate the likelihood that a woman will stand for reelection in treatment versus control areas in year 2.
You want to estimate whether incumbents are more likely to be reelected in treatment versus control areas in year 2
You want to estimate how much treatment areas have more re-elected incumbents in elections in year 2 compared to control

4.3.8 Endogeneous Categories: Responses

In such cases you can:

Examine the joint distribution of multiple outcomes
Condition on pretreatment features only
Engage in mediation analysis

4.3.9 Missing data can create an endogeneous subgroup problem

It is well known that missing data can undo the magic of random assignment.
One seemingly promising approach is to match into pairs ex ante and drop pairs together ex post.
Say potential outcomes looked like this (2 pairs of 2 units):

Pair	I	I	II	II
Unit	1	2	3	4	Average
Y(0)	0	0	0	0
Y(1)	-3	1	1	1
$\tau$	-3	1	1	1	0

4.3.10 Missing data

Say though that treated cases are likely to drop out of the sample if things go badly (e.g. they get a negative score or die)
Then you might see no attrition if those would-be attritors are not treated.
You might assume you have no problem (after all, no attrition).
No missing data when the normal cases happens to be selected

Pair	I	I	II	II
Unit	1	2	3	4	Average
Y(0)	0		0		0
Y(1)		1		1	1
$\hat{\tau}$					1

4.3.11 Missing data

But in cases in which you have attrition, dropping the pair doesn’t necessarily help.
The problem is potential missingness still depends on potential outcomes
The kicker is that the method can produce bias even if (in fact) there is no attrition!
But missing data when the vulnerable cases happens to be selected

Pair	I	I	II	II
Unit	1	2	3	4	Average
Y(0)		[0]	0		0
Y(1)	[-3]			1	1
$\hat{\tau}$					1

4.3.12 Missing data

Note: The right way to think about this is that bias is a property of the strategy over possible realizations of data and not normally a property of the estimator conditional on the data.

4.3.13 Multistage games

Multistage games can also present an endogenous group problem since collections of late stage players facing a given choice have been created by early stage players.

4.3.14 Multistage games

Question: Does visibility alter the extent to which subjects follow norms to punish antisocial behavior (and reward prosocial behavior)? Consider a trust game in which we are interested in how information on receivers affects their actions

Table 3: Return rates given investments under different conditions.

Return rates given investments under different conditions
		Average % returned
Visibility Treatment	% invested (average)	...when 10% invested	...when 50% invested
Control: Masked information on respondents	30%	20%	40%
Treatment: Full information on respondents	30%	0%	60%

What do we think? Does visibility make people react more to investments?

4.3.15 Multistage games

Imagine you could see all the potential outcomes, and they looked like this:

Table 4: Potential outcomes with (and without) identity protection.

Potential outcomes with (and without) identity protection
	Responder’s return decision (given type)						Avg.
Offered behavior	Nice 1	Nice 2	Nice 3	Mean 1	Mean 2	Mean 3
Invest 10%	60%	60%	60%	0%	0%	0%	30%
Invest 50%	60%	60%	60%	0%	0%	0%	30%

Conclusion: Both the offer and the information condition are completely irrelevant for all subjects.

4.3.16 Multistage games

Unfortunately you only see a sample of the potential outcomes, and that looks like this:

Table 5: Outcomes when respondent is visible.

Outcomes when respondent is visible
	Responder’s return decision (given type)						Avg.
Offered behavior	Nice 1	Nice 2	Nice 3	Mean 1	Mean 2	Mean 3
Invest 10%				0%	0%	0%	0%
Invest 50%	60%	60%	60%				60%

False Conclusion: When not protected, responders condition behavior strongly on offers (because offerers can select on type accurately)

In fact: The nice types invest more because they are nice. The responders return more to the nice types because they are nice.

4.3.17 Multistage games

Unfortunately you only see a (noisier!) sample of the potential outcomes, and that looks like this:

Table 6: Outcomes when respondent is not visible.

Outcomes when respondent is not visible
	Responder’s return decision (given type)						Avg.
Offered behavior	Nice 1	Nice 2	Nice 3	Mean 1	Mean 2	Mean 3
Invest 10%			60%		0%	0%	20%
Invest 50%	60%	60%		0%			40%

False Conclusion: When protected, responders condition behavior less strongly on offers (because offerers can select on type less accurately)

4.3.18 Multistage games

What to do?

Solutions?

Analysis could focus on the effect of treatment on respondent behavior, directly.
- This would get the correct answer but to a different question (Does information affect the share of contributions returned by subjects on average?)
Strategy method can sometimes help address the problem, but note that that is (a) changing the question and (b) putting demands on respondent imagination and honesty
First mover action could be directly manipulated, but unless deception is used that is also changing the question
First movers could be selected because they act in predictable ways (bordering on deception?)

Take away: Proceed with extreme caution when estimating effects beyond the first stage.

4.4 Pause

Take a short break!

4.5 DAGs

Directed Acyclic Graphs

4.5.1 Key insight

The most powerful results from the study of DAGs give procedures for figuring out when conditioning aids or hinders causal identification.

You can read off a confounding variable from a DAG.
- You figure out what to condition on for causal identification.
You can read off “colliders” from a DAG
- Sometimes you have to avoid conditioning on these
Sometimes a variable might be both, so
- you have to condition on it
- you have to avoid conditioning on it
- Ouch.

4.5.2 Key resource

Pearl’s book Causality is the key reference. Pearl (2009) (Though see also older work such as Pearl and Paz (1985))
There is a lot of excellent material on Pearl’s page http://bayes.cs.ucla.edu/WHY/
See also excellent material on Felix Elwert’s page http://www.ssc.wisc.edu/~felwert/causality/?page_id=66

4.5.3 Challenge for us

Say you don’t like graphs. Fine.
Consider this causal structure:
- $Z = f_1(U_1, U_2)$
- $X = f_2(U_2)$
- $Y = f_3(X, U_1)$

Say $Z$ is temporally prior to $X$; it is correlated with $Y$ (because of $U_1$) and with $X$ (because of $U_2$).

Question: Would it be useful to “control” for $Z$ when trying to estimate the effect of $X$ on $Y$?

4.6 Challenge for us

Say you don’t like graphs. Fine.
Consider this causal structure:
- $Z = f_1(U_1, U_2)$
- $X = f_2(U_2)$
- $Y = f_3(X, U_1)$

Question: Would it be useful to “control” for $Z$ when trying to estimate the effect of $X$ on $Y$?

Answer: Hopefully by the end of today you should see that the answer is obviously (or at least, plausibly) “no.”

4.7 Conditional independence and graph structure

What DAGs do is tell you when one variable is independent of another variable given some third variable.
Intuitively:
- what variables “shield off” the influence of one variable on another
- e.g. If inequality causes revolution via discontent, then inequality and revolution should be related to each other overall, but not related to each other among those that are content or among those that are discontent

4.7.1 Conditional independence

Variable sets $A$ and $B$ are conditionally independent, given $C$ if for all $a$, $b$, $c$:

\[\Pr(A = a | C = c) = \Pr(A = a | B = b, C = c)\]

Informally; given $C$, knowing $B$ tells you nothing more about $A$.

4.7.2 Conditional independence on paths graphs

Three elemental relations of conditional independence.

4.7.3 Conditional independence from graphs

$A$ and $B$ are conditionally independent, given $C$ if on every path between $A$ and $B$:

there is some chain ($\bullet\rightarrow \bullet\rightarrow\bullet$ or $\bullet\leftarrow \bullet\leftarrow\bullet$) or fork ($\bullet\leftarrow \bullet\rightarrow\bullet$) with the central element in $C$,

there is an inverted fork ($\bullet\rightarrow \bullet\leftarrow\bullet$) with the central element (and its descendants) not in $C$

Notes:

In this case we say that $A$ and $B$ are d-separated by $C$.
$A$, $B$, and $C$ can all be sets
Note that a path can involve arrows pointing any direction $\bullet\rightarrow \bullet\rightarrow \bullet\leftarrow \bullet\rightarrow\bullet$

4.7.4 Test yourself

Are A and D unconditionally independent:

if you do not condition on anything?
if you condition on B?
if you condition on C?
if you condition on B and C?

4.7.5 Back to this example

$Z = f_1(U_1, U_2)$
$X = f_2(U_2)$
$Y = f_3(X, U_1)$

Let’s graph this
Now: say we removed the arrow from $X$ to $Y$
- Would you expect to see a correlation between $X$ and $Y$ if you did not control for $Z$
- Would you expect to see a correlation between $X$ and $Y$ if you did control for $Z$

4.7.6 Back to this example

Now: say we removed the arrow from $X$ to $Y$

Would you expect to see a correlation between $X$ and $Y$ if you did not control for $Z$?
Would you expect to see a correlation between $X$ and $Y$ if you did control for $Z$?

4.7.7 Conditional distributions given `do` operations

We nor formalize things a little more:

The probability of outcome $x$ can always be written in the form \[P(X_1 = x_1)P(X_2 = x_2|X_1=x_1)(X_3 = x_3|X_1=x_1, X_2 = x_2)\dots\]
This can be done with any ordering of variables.
We want to describe the distribution in a simpler way that takes account of parent-child relations and that can be used to capture interventions.

4.7.8 Conditional distributions given `do` operations

Given an ordering of variables, the Markovian parents of variable $X_j$ are the minimal set of variables such that when you condition on these, $X_j$ is independent of all other prior variables in the ordering

\[P: P(x_1,x_2,\dots x_n) = \prod_{}P(x_j|pa_j)\]

A DAG is “causal Bayesian network” or “Causal DAG” if (and only if) the probability distribution resulting from setting some set $X_i$ to $\hat{x}'_i$ (i.e. do(X=x')) is:

\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]

where the parents $pa_j$ are the parents on the graph (parents of $x_i$ are the nodes with arrows pointing into $x_i$.)

4.7.9 Conditional distributions given `do` operations

So:

\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]

This means that there is only probability mass on vectors in which $x_i = x_i'$ (reflecting the success of control) and all other variables are determined by their parents, given the values that have been set for $x_i$.
Such expressions will be critical later when we want to consider identificatation.
They let us assess whether the probability of an outcome $y$, say, depends on the value of some other node, given some other node, or given interventions on some other node.

4.7.10 Conditional distributions given `do` operations

Illustration, say we have binary $X$ causes binary $M$ which cases binary $Y$; say we intervene and set $M=1$. Then what is the distribution of $(x,m,y)$?

It is:

\[\Pr(x,m,y) = \Pr(x)\mathbb I(M = 1)\Pr(y|m)\]

4.7.11 Application

We will use these ideas to motivate a general procedure for learning about, updating over, and querying, causal models.

4.8 Causal models

4.8.1 From graphs to Causal Models

A “causal model” is:

1.Variables

An ordered list of $n$ endogenous nodes, $\mathcal{V}= (V^1, V^2,\dots, V^n)$, with a specification of a range for each of them
A list of $n$ exogenous nodes, $\Theta = (\theta^1, \theta^2,\dots , \theta^n)$

A list of $n$ functions $\mathcal{F}= (f^1, f^2,\dots, f^n)$, one for each element of $\mathcal{V}$ such that each $f^i$ takes as arguments $\theta^i$ as well as elements of $\mathcal{V}$ that are prior to $V^i$ in the ordering
A probability distribution over $\Theta$

4.8.2 From graphs to Causal Models

A simple causal model in which high inequality ($I$) affects democratization ($D$) via redistributive demands ($R$) and mass mobilization ($M$), which is also a function of ethnic homogeneity ($E$). Arrows show relations of causal dependence between variables.

4.8.3 Effects on a DAG

Learning about effects given a model means learning about $F$ and also the distribution of shocks ($\Theta$).
For discrete data this can be reduced to a question about learning about the distribution of $\Theta$ only.

4.8.4 Effects on a DAG

For instance the simplest model consistent with $X \rightarrow Y$:

Endogenous Nodes = $\{X, Y\}$, both with range $\{0,1\}$
Exogenous Nodes = $\{\theta^X, \theta^Y\}$, with ranges $\{\theta^X_0, \theta^X_1\}$ and $\{\theta^Y_{00}\theta^Y_{01}, \theta^Y_{10}, \theta^Y_{11}\}$
Functional equations:
- $f_Y$: $\theta^Y =\theta^Y_{ij} \rightarrow \{Y = i \text{ if } X=0; Y = j \text{ if } X=1\}$
- $f_X$: $\theta^X =\theta^X_{i} \rightarrow \{X = i\}$
Distributions on $\Theta$: $\Pr(\theta^i = \theta^i_k) = \lambda^i_k$

4.8.5 Effects as statement about exogeneous variables

What is the probability that $X$ has a positive causal effect on $Y$?

This is equivalent to: $\Pr(\theta^Y =\theta^Y_{01}) = \lambda^Y_{01}$
So we want to learn about the distributions of the exogenous nodes
This general principle extends to a vast class of causal models

4.8.6 Recap: Things you need to know about causal inference

A causal claim is a statement about what didn’t happen.
If you know that $A$ causes $B$ and that $B$ causes $C$, this does not mean that you know that $A$ causes $C$.
There is no causation without manipulation.
There is a fundamental problem of causal inference.
You can estimate average causal effects even if you cannot observe any individual causal effects.
Estimating average causal effects via differences in means does not require that treatment and control groups are identical.
Estimating average causal effects via differences in means is fraught when you condition on post treatment variables or on colliders.

4.9 Estimands

Estimation and testing

4.10 Outline

Types of estimands
Principal strata
Identification
Backdoor
Frontdoor
dagitty

4.11 Estimands and inquiries

Your inquiry is your question and the estimand is the true (generally unknown) answer to the inquiry
The estimand is the thing you want to estimate
If you are estimating something you should be able to say what your estimand is
You are responsible for your estimand. Your estimator will not tell you what your estimand is
Just because you can calculate something does not mean that you have an estimand
You can test a hypothesis without having an estimand

Read: II ch 4, DD, ch 7

4.11.1 Estimands: ATE, ATT, ATC, S-, P-

ATE is Average Treatment Effect (all units)
ATT is Average Treatment Effect on the Treated
ATC is Average Treatment Effect on the Controls

4.11.2 Estimands: ATE, ATT, ATC, S-, P-

Say that units are randomly assigned to treatment in different strata (maybe just one); with fixed, though possibly different, shares assigned in each stratum. Then the key estimands and estimators are:

Estimand	Estimator
$\tau_{ATE} \equiv \mathbb{E}[\tau_i]$	$\widehat{\tau}_{ATE} = \sum\nolimits_{x} \frac{w_x}{\sum\nolimits_{j}w_{j}}\widehat{\tau}_x$
$\tau_{ATT} \equiv \mathbb{E}[\tau_i \| Z_i = 1]$	$\widehat{\tau}_{ATT} = \sum\nolimits_{x} \frac{p_xw_x}{\sum\nolimits_{j}p_jw_j}\widehat{\tau}_x$
$\tau_{ATC} \equiv \mathbb{E}[\tau_i \| Z_i = 0]$	$\widehat{\tau}_{ATC} = \sum\nolimits_{x} \frac{(1-p_x)w_x}{\sum\nolimits_{j}(1-p_j)w_j}\widehat{\tau}_x$

where $x$ indexes strata, $p_x$ is the share of units in each stratum that is treated, and $w_x$ is the size of a stratum.

4.11.3 Estimands: ATE, ATT, ATC, S-, P-, C-

In addition, each of these can be targets of interest:

for the population, in which case we refer to PATE, PATT, PATC and $\widehat{PATE}, \widehat{PATT}, \widehat{PATC}$
for a sample, in which case we refer to SATE, SATT, SATC, and $\widehat{SATE}, \widehat{SATT}, \widehat{SATC}$

And for different subgroups,

given some value on a covariate, in which case we refer to CATE (conditional average treatment effect)

4.11.4 Broader classes of estimands: LATE/CATE

The CATEs are conditional average treatment effects, for example the effect for men or for women. These are straightfoward.

However we might also imagine conditioning on unobservable or counterfactual features.

The LATE (or CACE: “complier average causal effect”) asks about the effect of a treatment ($X$) on an outcome ($Y$) for people that are responsive to an encouragement ($Z$)

\[LATE = \frac{1}{|C|}\sum_{j\in C}(Y_j(X=1) - Y_j(X=0))\] \[C:=\{j:X_j(Z=1) > X_j(Z=0) \}\]

We will return to these in the study of instrumental variables.

4.11.5 Quantile estimands

Other ways to condition on potential outcomes:

A quantile treatment effect: You might be interested in the difference between the median $Y(1)$ and the median $Y(0)$ (Imbens and Rubin (2015) 20.3.1)
or even be interested in the median $Y(1) - Y(0)$. Similarly for other quantiles.

4.11.6 Model estimands

Many inquiries are averages of individual effects, even if the groups are not known, but they do not have to be:

The RDD estimand is a statement about what effects would be at a threshold; it can be defined under a model even if no actual individuals are at the threshold. We imagine average potential outcomes as a function of treatment $Z$ and running variable $X$, $f(z, x)$ and define: \[\tau_{RDD} := f(1, x^*) - f(0, x^*)\]

4.11.7 Distribution estimands

Many inquiries are averages of individual effects, even if the groups are not known,

But they do not have to be:

Inquiries might relate to distributional quantities such as:
- The effect of treatment on the variance in outcomes: $var(Y(1)) - var(Y(0))$
- The variance of treatment effects: $var(Y(1) - Y(0))$
- Other inequality measures (e.g. Ginis; (Imbens and Rubin (2015) 20.3.2))

You might even be interested in $\min(Y_i(1) - Y_i(0))$.

4.11.8 Spillover estimands

There are lots of interesting “spillover” estimands.

Imagine there are three individuals and each person’s outcomes depends on the assignments of all others. For instance $Y_1(Z_1, Z_2, Z_3$, or more generally, $Y_i(Z_i, Z_{i+1 (\text{mod }3)}, Z_{i+2 (\text{mod }3)})$.

Then three estimands might be:

$\frac13\left(\sum_{i}{Y_i(1,0,0) - Y_i(0,0,0)}\right)$
$\frac13\left(\sum_{i}{Y_i(1,1,1) - Y_i(0,0,0)}\right)$
$\frac13\left(\sum_{i}{Y_i(0,1,1) - Y_i(0,0,0)}\right)$

Interpret these. What others might be of interest?

4.11.9 Differences in CATEs and interaction estimands

A difference in CATEs is a well defined estimand that might involve interventions on one node only:

$\mathbb{E}_{\{W=1\}}[Y(X=1) - Y(X=0)] - \mathbb{E}_{\{W=0\}}[Y(X=1) - Y(X=0)]$

It captures differences in effects.

An interaction is an effect on an effect:

$\mathbb{E}[Y(X=1, W=1) - Y(X=0, W=1)] - \mathbb{E}[Y(X=1, W=0) - Y(X=0, W=0)]$

Note in the latter the expectation is taken over the whole population.

4.11.10 Mediation estimands and complex counterfactuals

Say $X$ can affect $Y$ directly, or indirectly through $M$. then we can write potential outcomes as:

$Y(X=x, M=m)$
$M(X=x)$

We can then imagine inquiries of the form:

$Y(X=1, M=M(X=1)) - Y(X=0, M=M(X=0))$
$Y(X=1, M=1) - Y(X=0, M=1)$
$Y(X=1, M=M(X=1)) - Y(X=1, M=M(X=0))$

Interpret these. What others might be of interest?

4.11.11 Mediation estimands and complex counterfactuals

Again we might imagine that these are defined with respect to some group:

$A = \{i|Y_i(1, M(X=1)) > Y_i(0, M(X=0))\}$
$\frac{1}{|A|} \sum_{i\in A}(Y(1, 1) > Y(0, 1))$

here, among those for whom $X$ has a positive effect on $Y$, for what share would there be a positive effect if $M$ were fixed at 1?

4.11.12 Causes of effects and effects of causes

In qualitative research a particularly common inquiry is “did $X=1$ cause $Y=1$?

This is often given as a probability, the “probability of causation” (though at the case level we might better think of this probability as an estimate rather than an estimand):

\[\Pr(Y_i(0) = 0 | Y_i(1) = 1, X = 1)\]

4.11.13 Causes of effects and effects of causes

Intuition: What’s the probability $X=1$ caused $Y=1$ in an $X=1, Y=1$ case drawn from a large population with the following experimental distribution:

	Y=0	Y=1	All
X=0	1	0	1
X=1	0.25	0.75	1

4.11.14 Causes of effects and effects of causes

Intuition: What’s the probability $X=1$ caused $Y=1$ in an $X=1, Y=1$ case drawn from a large population with the following experimental distribution:

	Y=0	Y=1	All
X=0	0.75	0.25	1
X=1	0.25	0.75	1

4.11.15 Actual causation

Other inquiries focus on distinguishing between causes.

For the Billy Suzy problem (Hall 2004), Halpern (2016) focuses on “actual causation” as a way to distinguish between Suzy and Billy:

Imagine Suzy and Billy, simultaneously throwing stones at a bottle. Both are excellent shots and hit whatever they aim at. Suzy’s stone hits first, knocks over the bottle, and the bottle breaks. However, Billy’s stone would have hit had Suzy’s not hit, and again the bottle would have broken. Did Suzy’s throw cause the bottle to break? Did Billy’s?

4.11.16 Actual causation

Actual Causation:

$X=x$ and $Y=y$ both happened;
there is some set of variables, $\mathcal W$, such that if they were fixed at the levels that they actually took on in the case, and if $X$ were to be changed, then $Y$ would change (where $\mathcal W$ can also be an empty set);
no strict subset of $X$ satisfies 1 and 2 (there is no redundant part of the condition, $X=x$).

4.11.17 Actual causation

Suzy: Condition 2 is met if Suzy’s throw made a difference, counterfactually speaking—with the important caveat that, in determining this, we are permitted to condition on Billy’ stone not hitting the bottle.
Billy: Condition 2 is not met.

An inquiry: for what share in a population is a possible cause an actual cause?

4.11.18 Pearl’s ladder

Pearl (e.g. Pearl and Mackenzie (2018)) describes three types of inquiry:

Level	Activity	Inquiry
Association	“Seeing”	If I see $X=1$ should I expect $Y=1$?
Intervention	“Doing”	If I set $X$ to $1$ should I expect $Y=1$?
Counterfactual	“Imagining”	If $X$ were $0$ instead of 1, would $Y$ then be $0$ instead of $1$?

4.11.19 Pearl’s ladder

We can understand these as asking different types of questions about a causal model

Level	Activity	Inquiry
Association	“Seeing”	$\Pr(Y=1\|X=1)$
Intervention	“Doing”	$\mathbb{E}[\mathbb{I}(Y(1)=1)]$
Counterfactual	“Imagining”	$\Pr(Y(1)=1 \& Y(0)=0)$

The third is qualitatively different because it requires information about two mutually incompatible conditions for units. This is not (generally ) recoverable directly from knowledge of $\Pr(Y(1)=1)$ and $\Pr(Y(0)=0)$.

4.12 Inquiries as statements about principal strata

Given a causal model over nodes with discrete ranges, inquiries can generally be described as summaries of the distributions of exogenous nodes.

We already saw two instances of this:

The probability that $X$ has a positive effect on $Y$ in an $X \rightarrow Y$ model is $\lambda^Y_{01}$ (last lecture)
The share of “compliers” in an IV model $Z \rightarrow X \rightarrow Y \leftrightarrow X$ is $\lambda^X_{01}$

4.13 Identification

What it is. When you have it. What it’s worth.

4.13.1 Identification

Informally a quantity is “identified” if it can be “recovered” once you have enough data.

Say for example average wage is $x$ in some very large population. If I gather lots and lots of data on the wages of individuals and take the average then then my estimate will ultimately let be figure out $x$.

If $x$ is $1 then my estimate will end up centered on $1.
If it is $2 it will end up centered on $2.

4.13.2 Identification (Definition)

Identifiability Let $Q(M)$ be a query defined over a class of models $\mathcal M$, then $Q$ is identifiable if $P(M_1) = P(M_2) \rightarrow Q(M_1) = Q(M_1)$.
Identifiability with constrained data Let $Q(M)$ be a query defined over a class of models $\mathcal M$, then $Q$ is identifiable from features $F(M)$ if $F(M_1) = F(M_2) \rightarrow Q(M_1) = Q(M_1)$.

Based on Defn 3.2.3 in Pearl.

Essentially: Each underlying value produces a unique data distribution. When you see that distribution you recover the parameter.

4.13.3 Example without identification 1

Informally a quantity is “identified” if it can be “recovered” once you have enough data.

Say for example average wage is $x^m$ for men and $x^w$ for women (in some very large population).
If I gather lots and lots of data on the wages of (male and female) couples, e.g. $x^c_i = x^m_i + x^w_i$ then, although this will be informative, it will never be sufficient to recover $x^m$ for men and $x^w$.
I can recover $x^c$, but there are too many combinations of possible values of $x^m$ and $x^w$ consistent with the observed data.

4.13.4 Example without identification 2

share $a$ of units have negative effects
$b$ have positive effects
$c$ that always have $Y=0$ and
$d$ always have $Y=1$.

Then with very large (experimental) data we observe:

	Y = 0	Y = 1
X = 0	$\alpha_{00} \rightarrow b/2 + c/2$	$\alpha_{01} \rightarrow a/2 + d/2$
X = 1	$\alpha_{10} \rightarrow a/2 + c/2$	$\alpha_{11} \rightarrow b/2 + d/2$

What quantities are identified?

$b-a$?
$d-c$?
$b$?

4.13.5 Example without identification 2

What if we :

knew that $a=0$
observed that $\alpha_01=0$

	Y = 0	Y = 1
X = 0	$\alpha_{00} \rightarrow b/2 + c/2$	$\alpha_{01} \rightarrow a/2 + d/2$
X = 1	$\alpha_{10} \rightarrow a/2 + c/2$	$\alpha_{11} \rightarrow b/2 + d/2$

What quantities are now identified?

$b-a$?
$d-c$?
$b$?

4.13.6 Identification : Goal

Our goal in causal inference is to estimate quantities such as:

\[\Pr(Y|\hat{x})\]

where $\hat{x}$ is interpreted as $X$ set to $x$ by “external” control. Equivalently: $do(X=x)$ or sometimes $X \leftarrow x$.

If this quantity is identifiable then we can recover it with infinite data.
If it is not identifiable, then, even in the best case, we are not guaranteed to get the right answer.

Are there general rules for determining whether this quantity can be identified? Yes.

4.13.7 Identification : Goal

Note first, identifying

\[\Pr(Y|x)\]

is easy.

But we are not always interested in identifying the distribution of $Y$ given observed values of $x$, but rather, the distribution of $Y$ if $X$ is set to $x$.

4.14 Levels and effects

If we can identify the controlled distribution we can calculate other causal quantities of interest.

For example for a binary $X, Y$ the causal effect of $X$ on the probability that $Y=1$ is:

\[\Pr(Y=1|\hat{x}=1) - \Pr(Y=1|\hat{x}=0)\]

Again, this is not the same as:

\[\Pr(Y=1|x=1) - \Pr(Y=1|x=0)\]

It’s the difference between seeing and doing.

4.14.1 When to condition? What to condition on?

The key idea is that you want to find a set of variables such that when you condition on these you get what you would get if you used a do operation.

Intuition:

You could imagine creating a “mutilated” graph by removing all the arrows leading out of X
Then select a set of variables, $Z$, such that $X$ and $Y$ are d-separated by $Z$ on the the mutilated graph
When you condition on these you are making sure that any covariation between $X$ and $Y$ is covariation that is due to the effects of $X$

4.14.2 Illustration

4.14.3 Illustration: Remove paths out

4.14.4 Illustration: Block backdoor path

4.14.5 Illustration: Why not like this?

4.14.6 Identification

Three results (“Graphical Identification Criteria”)
- Backdoor criterion
- Adjustment criterion
- Frontdoor criterion
There are more

4.14.7 Backdoor Criterion: (Pearl 1995)

The backdoor criterion is satisfied by $Z$ (relative to $X$, $Y$) if:

No node in $Z$ is a descendant of $X$
$Z$ blocks every backdoor path from $X$ to $Y$ (i.e. every path that contains an arrow into $X$)

In that case you can identify the effect of $X$ on $Y$ by conditioning on $Z$:

\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\] (This is eqn 3.19 in Pearl (2000))

4.14.8 Backdoor Criterion: (Pearl 1995)

\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\]

No notion of a linear control or anything like that; idea really is like blocking: think lots of discrete data and no missing patterns
Note this is a formula for a (possibly counterfactual) level; a counterfactual difference would be given in the obvious way by:

\[P(Y=y | \hat{x}) - P(Y=y | \hat{x}')\]

4.14.9 Backdoor Proof

Following Pearl (2009), Chapter 11. Let $T$ denote the set of parents of $X$: $T := pa(X)$, with (possibly vector valued) realizations $t$. These might not all be observed.

If the backdoor criterion is satisfied, we have:

$Y$ is independent of $T$, given $X$ and observed data, $Z$ (since $Z$ blocks backdoor paths)
$X$ is independent of $Z$ given $T$. (Since $Z$ includes only nondescendents)

Key idea: The intervention level relates to the observational level as follows: \[p(y|\hat{x}) = \sum_{t\in T} p(t)p(y|x, t)\]
Think of this as fully accounting for the (possibly unobserved) causes of $X$, $T$

4.14.10 Backdoor Proof

We want to get to:

\[p(y|\hat{x}) = \sum_{t\in T} p(t)p(y|x, t)\]

But of course we do not observe $T$, rather we observe $Z$. So we now need to write everything in terms of $Z$ rather than $T$.

We bring $Z$ into the picture by writing:

\[p(y|\hat{x}) = \sum_{t\in T} p(t) \sum_z p(y|x, t, z)p(z|x, t)\]

now we want to get rid of $T$…

4.14.11 Backdoor Proof

now we want to get rid of $T$…

Using the two conditions from the backdoor definition above:
1. replace $p(y|x, t, z)$ with $p(y | x, z)$
2. replace $p(z|x, t)$ with $p(z|t)$

This gives: \[p(y|\hat x) = \sum_{t \in T} p(t) \sum_z p(y|x, z)p(z|t)\]

Cleaning up, we can get rid of $T$:

\[p(y|\hat{x}) = \sum_z p(y|x, z)\sum_{t\in T} p(z|t)p(t) = \sum_z p(y| x, z)p(z)\]

4.14.12 Backdoor proof figure

For intuition:

We would be happy if we could condition on the parent $T$, but $T$ is not observed. However we can use $Z$ instead making use of the fact that:

$p(y|x, t, z) = p(y | x, z)$ (since $Z$ blocks)
$p(z|x, t) = p(z|t)$ (since $Z$ is upstream and blocked by parents, $T$)

4.14.13 Adjustment criterion

See Shpitser, VanderWeele, and Robins (2012)

The adjustment criterion is satisfied by $Z$ (relative to $X$, $Y$) if:

no element of $Z$ is a descendant in the mutilated graph of any variable $W\not\in X$ which lies on a proper causal path from $X$ to $Y$
$Z$ blocks all noncausal paths from $X$ to $Y$

Note:

mutilated graph: remove arrows pointing into $X$
proper pathway: A proper causal pathway from $X$ to $Y$ only intersects $X$ at the endpoint

4.14.14 These are different. Simple illustration.

Here $Z$ satisfies the adjustment criterion but not the backdoor criterion:

$Z$ is descendant of $X$ but it is not a descendant of a node on a path from $X$ to $Y$. No harm adjusting for $Z$ here, but not necessary either.

4.14.15 Frontdoor criterion (Pearl)

Consider this DAG:

The relationship between $X$ and $Y$ is confounded by $U$.
However the $X\rightarrow Y$ effect is the product of the $X\rightarrow M$ effect and the $M\rightarrow Y$ effect

Why?

4.14.16 Identification through the front door

If:

$M$ (possibly a set) blocks all directed paths from $X$ to $Y$
there is no backdoor path $X$ to $M$
$X$ blocks all backdoor paths from $M$ to $Y$ and
all ($m,z$) combinations arise with positive probability

Then $\Pr(y| \hat x)$ is identifiable and given by:

\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]

4.14.17 Frontdoor criterion (Proof)

We want to get $\Pr(y | \hat x)$

From the graph the joint distribution of variables is:

\[\Pr(x,m,y,u) = \Pr(u)\Pr(x|u)\Pr(m|x)\Pr(y|m,u)\] If we intervened on $X$ we would have ($\Pr(X = x |u)=1$):

\[\Pr(m,y,u | \hat x) = \Pr(u)\Pr(m|x)\Pr(y|m,u)\] If we sum up over $u$ and $m$ we get:

The first part is fine; the second part however involves $u$ which is unobserved. So we need to get the $u$ out of $\sum_u\left(\Pr(y|m,u)\Pr(u)\right)$.

4.14.18 Frontdoor criterion

Now, from the graph:

$M$ is d-separated from $U$ by $X$:

\[\Pr(u|m, x) = \Pr(u|x)\] 2. $X$ is d-separated from $Y$ by $M$, $U$

\[\Pr(y|x, m, u) = \Pr(y|m,u)\] That’s enough to get $u$ out of $\sum_u\left(\Pr(y|m,u)\Pr(u)\right)$

4.14.19 Frontdoor criterion

\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\sum_u\left(\Pr(y|m,u)\Pr(u|x)\Pr(x)\right)\]

Using the 2 equalities we got from the graph:

\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\sum_u\left(\Pr(y|x,m,u)\Pr(u|x,m)\Pr(x)\right)\]

So:

\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\left(\Pr(y|m,x)\Pr(x)\right)\]

Intuitively: $X$ blocks the back door between $Z$ and $Y$ just as well as $U$ does

4.14.20 Frontdoor criterion

Substituting we are left with:

\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]

(The $'$ is to distinguish the $x$ in the summation from the value of $x$ of interest)

It’s interesting that $x$ remains in the right hand side in the calculation of the $m \rightarrow y$ effect, but this is because $x$ blocks a backdoor from $m$ to $y$

4.14.21 Front foor

Bringing all this together into a claim we have:

If:

$M$ (possibly a set) blocks all directed paths from $X$ to $Y$
there is no backdoor path $X$ to $M$
$X$ blocks all backdoor paths from $M$ to $Y$ and
all ($m,z$) combinations arise with positive probability

Then $\Pr(y| \hat x)$ is identifiable and given by:

\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]

4.14.22 Front foor

This is a very elegant and surprising result
There are not many obvious applications of it however
The conditions would be violated for example if unobserved third things caused both $M$ and $Y$

4.15 In code: Dagitty

There is a package (Textor et al. 2016) for figuring out what to condition on.

library(dagitty)

4.15.1 In code: Dagitty

Define a dag using dagitty syntax:

g <- dagitty("dag{X -> M -> Y ; Z -> X ; Z -> R -> Y}")

There is then a simple command to check whether two sets are d-separated by a third set:

dseparated(g, "X", "Y", "M")

[1] FALSE

dseparated(g, "X", "Y", c("Z","M"))

[1] TRUE

4.15.2 Dagitty: Find adjustment sets

And a simple command to identify the adjustments needed to identify the effect of one variable on another:

adjustmentSets(g, exposure = "X", outcome = "Y")

{ R }
{ Z }

4.15.3 Important Examples : Confounding

Example where $Z$ is correlated with $X$ and $Y$ and is a confounder

4.15.4 Confounding

Example where $Z$ is correlated with $X$ and $Y$ but it is not a confounder

4.15.5 Important Examples : Collider

But controlling can also cause problems. In fact conditioning on a temporally pre-treatment variable could cause problems. Who’d have thunk? Here is an example from Pearl (2005):

4.15.6 Illustration of identification failure from conditioning on a collider

U1 <- rnorm(10000);  U2 <- rnorm(10000)
Z  <- U1+U2
X  <- U2 + rnorm(10000)/2
Y  <- U1*2 + X

lm_robust(Y ~ X) |> tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	-0.02	0.02	-1.21	0.23	-0.06	0.01	9998	Y
X	1.02	0.02	56.52	0.00	0.98	1.05	9998	Y

lm_robust(Y ~ X + Z) |> tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	-0.01	0.01	-1.13	0.26	-0.03	0.01	9997	Y
X	-0.34	0.01	-34.98	0.00	-0.36	-0.32	9997	Y
Z	1.67	0.01	220.37	0.00	1.65	1.68	9997	Y

4.15.7 Let’s look at that in dagitty

g <- dagitty("dag{U1 -> Z  ; U1 -> y ; U2 -> Z ; U2 -> x  -> y}")
adjustmentSets(g, exposure = "x", outcome = "y")

{}

isAdjustmentSet(g, "Z", exposure = "x", outcome = "y")

[1] FALSE

isAdjustmentSet(g, NULL, exposure = "x", outcome = "y")

[1] TRUE

Which means, no need to condition on anything.

4.15.8 Collider & Confounder

A bind: from Pearl 1995.

For a solution for a class of related problems see Robins, Hernan, and Brumback (2000)

4.15.9 Let’s look at that in dagitty

g <- dagitty("dag{U1 -> Z  ; U1 -> y ; 
             U2 -> Z ; U2 -> x  -> y; 
             Z -> x}")
adjustmentSets(g, exposure = "x", outcome = "y")

{ U1 }
{ U2, Z }

which means you have to adjust on an unobservable. Here we double check that including or not including “Z” is enough:

isAdjustmentSet(g, "Z", exposure = "x", outcome = "y")

[1] FALSE

isAdjustmentSet(g, NULL, exposure = "x", outcome = "y")

[1] FALSE

4.15.10 Collider & Confounder

So we cannot identify the effect here. But can we still learn about it?

5 Answer strategies

5.1 Outline

Simple estimates from experimental data
Weighting, blocking
Design-based variance estimates
Design-based $p$ values
Reporting

See topics for Controls and doubly robust estimation

5.1.1 ATE: DIM

Unbiased estimates of the (sample) average treatment effect can be estimated (whether or not there imbalance on covariates) using:

\[ \widehat{ATE} = \frac{1}{n_T}\sum_TY_i - \frac{1}{n_C}\sum_CY_i, \]

5.1.2 ATE: DIM in practice

df <- fabricatr::fabricate(N = 100, Z = rep(0:1, N/2), Y = rnorm(N) + Z)

# by hand
df |>
  summarize(Y1 = mean(Y[Z==1]), 
            Y0 = mean(Y[Z==0]), 
            diff = Y1 - Y0) |> kable(digits = 2)

Y1	Y0	diff
1.07	-0.28	1.35

# with estimatr
estimatr::difference_in_means(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	1.35	0.17	7.94	0	1.01	1.68	97.98	Y

5.1.3 ATE: DIM in practice

We can also do this with regression:

estimatr::lm_robust(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	-0.28	0.12	-2.33	0.02	-0.51	-0.04	98	Y
Z	1.35	0.17	7.94	0.00	1.01	1.68	98	Y

See Freedman (2008) on why regression is fine here

5.1.4 ATE: Blocks

Say now different strata or blocks $\mathcal{S}$ had different assignment probabilities. Then you could estimate:

\[ \widehat{ATE} = \sum_{S\in \mathcal{S}}\frac{n_{S}}{n} \left(\frac{1}{n_{S1}}\sum_{S\cap T}y_i - \frac{1}{n_{S0}}\sum_{S\cap C}y_i \right) \]

Note: you cannot just ignore the blocks because assignment is no longer independent of potential outcomes: you might be sampling units with different potential outcomes with different probabilities.

However, the formula above works fine because selecting is random conditional on blocks.

5.1.5 ATE: Blocks

As a DAG this is just classic confounding:

make_model("Block -> Z ->Y <- Block") |> 
  plot(x_coord = c(2,1,3), y_coord = c(2, 1, 1))

5.1.6 ATE: Blocks in practice

Data with heterogeneous assignments:

df <- fabricatr::fabricate(
  N = 500, X = rep(0:1, N/2), 
  prob = .2 + .3*X,
  Z = rbinom(N, 1, prob),
  ip = 1/(Z*prob + (1-Z)*(1-prob)), # discuss
  Y = rnorm(N) + Z*X)

True effect is 0.5, but:

estimatr::difference_in_means(Y ~ Z, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	0.9	0.1	9.32	0	0.71	1.09	377.93	Y

5.1.7 ATE: Blocks in practice

Averaging over effects in blocks

# by hand
estimates <- 
  df |>
  group_by(X) |>
  summarize(Y1 = mean(Y[Z==1]), 
            Y0 = mean(Y[Z==0]), 
            diff = Y1 - Y0,
            W = n())

estimates$diff |> weighted.mean(estimates$W)

[1] 0.7236939

# with estimatr
estimatr::difference_in_means(Y ~ Z, blocks = X, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	0.72	0.11	6.66	0	0.51	0.94	496	Y

5.1.8 ATE with IPW

This also corresponds to the difference in the weighted average of treatment outcomes (with weights given by the inverse of the probability that each unit is assigned to treatment) and control outcomes (with weights given by the inverse of the probability that each unit is assigned to control).

The average difference in means estimator is the same as what you would get if you weighted inversely by shares of units in different conditions inside blocks.

5.1.9 ATE with IPW in practice

# by hand
df |>
  summarize(Y1 = weighted.mean(Y[Z==1], ip[Z==1]), 
            Y0 = weighted.mean(Y[Z==0],  ip[Z==0]), # note !
            diff = Y1 - Y0)|> 
  kable(digits = 2)

Y1	Y0	diff
0.59	-0.15	0.74

# with estimatr
estimatr::difference_in_means(Y ~ Z, weights = ip, data = df) |>
  tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
Z	0.74	0.11	6.65	0	0.52	0.96	498	Y

5.1.10 ATE with IPW

But inverse propensity weighting is a more general principle, which can be used even if you do not have blocks.
The intuition for it comes straight from sampling weights — you weight up in order to recover an unbiased estimate of the potential outcomes for all units, whether or not they are assigned to treatment.
With sampling weights however you can include units even if their weight was 1. Why can you not include these units when doing inverse propensity weighting?

5.1.11 Illustration: Estimating treatment effects with terrible treatment assignments: Fixer

Say you made a mess and used a randomization that was correlated with some variable, $U$. For example:

The randomization is done in a way that introduces a correlation between Treatment Assignment and Potential Outcomes
Then possibly, even though there is no true causal effect, we naively estimate a large one — enormous bias
However since we know the assignment procedure we can fully correct for the bias

5.1.12 Illustration: Estimating treatment effects with terrible treatment assignments: Fixer

In the next example, we do this using “inverse propensity score weighting.”
This is exactly analogous to standard survey weighting — since we selected different units for treatment with different probabilities, we weight them differently to recover the average outcome among treated units (same for control).

5.1.13 Basic randomization: Fixer

Bad assignment, some randomization process you can’t understand (but can replicate) that results in unequal probabilities.

N <- 400
U <- runif(N, .1, .9)

 design <- 
   declare_model(N = N,
                 Y_Z_0 = U + rnorm(N, 0, .1),
                 Y_Z_1 = U + rnorm(N, 0, .1),
                 Z = rbinom(N, 1, U)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, label = "naive")

5.1.14 Basic randomization: Fixer

Results is a sampling distribution not centered on the true effect (0)

diagnosis <- diagnose_design(design)

diagnosis$simulations_df |>
  ggplot(aes(estimate)) + geom_histogram() + facet_grid(~estimator) +
  geom_vline(xintercept = 0, color = "red")

5.1.15 A fix

To fix you can estimate the assignment probabilities by replicating the assignment many times:

probs <- replicate(1000, design |> draw_data() |> pull(Z)) |> apply(1, mean)

and then use these assignment probabilities in your estimator

design_2 <-
  design +
  declare_measurement(weights = Z/probs + (1-Z)/(1-probs)) +
  declare_estimator(Y ~ Z, weights = weights, label = "smart")

5.1.16 Basic randomization: Fixer

Implied weights

draw_data(design_2) |> 
  ggplot(aes(probs, weights, color = factor(Z))) + 
  geom_point()

5.1.17 Basic randomization: Fixer

Improved results

diagnosis <- diagnose_design(design_2)

diagnosis$simulations_df |>
  ggplot(aes(estimate)) + geom_histogram() + facet_grid(estimator~.)+
  geom_vline(xintercept = 0, color = "red")

5.1.18 IPW with one unit!

This example is surprising but it helps you see the logic of why inverse weighting gets unbiased estimates (and why that might not guarantee a reasonable answer)

Imagine there is one unit with potential outcomes $Y(1) = 2, Y(0) = 1$. So the unit level treatment effect is 1.

You toss a coin.

If you assign to treatment you estimate: $\hat\tau = \frac{2}{0.5} = 4$
If you assign to control you estimate: $\hat\tau = -\frac{1}{0.5} = -2$

So your expected estimate is: \[0.5 \times 4 - 0.5 \times (-2) = 1\]

Great on average but always lousy

5.1.19 Generalization: why IPW works

Say a given unit is assigned to treatment with probability $\pi_i$
We estimate the average $Y(1)$ using

\[\hat{\overline{Y_1}} = \frac{1}n\left(\sum_i \frac{Z_iY_i(1)}{\pi_i}\right)\] With independent assignment the expected value of $\hat{\overline{Y_1}}$ is just:

\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( \left(\pi_1 \frac{1\times Y_1(1)}{\pi_1} + (1-\pi_1) \frac{0\times Y_1(1)}{\pi_1}\right) + \left(\pi_2 \frac{1\times Y_2(1)}{\pi_2} + (1-\pi_2) \frac{0\times Y_1(1)}{\pi_2}\right) + \dots\right)\]

\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( Y_1(1) + Y_2(1) + \dots\right) = \overline{Y_1}\]

and similarly for $\mathbb{E}[\hat{\overline{Y_0}}]$ and so using linearity of expectations:

\[\mathbb{E}[\widehat{\overline{Y_1 - Y_0}}] = \overline{Y_1 - Y_0}\]

5.1.20 Generalization: why IPW works

Note we needed $\pi_i >0$ and also $\pi_i <1$ everywhere. Why?
We used independence here; sampling theory is used to show similar results for e.g. complete randomization
For blocked randomization this is easy to see

5.2 Design-based Estimation of Variance

Lets talk about “inference”

5.2.1 Var(ATE)

Recall that the treatment effect is gotten by taking a sample of outcomes under treatment and comparing them to a sample of outcomes under control
Say that there is no “error”
Why would this procedure produce uncertainty?

5.2.2 Var(ATE)

Why would this procedure produce uncertainty?
The uncertainty comes from being uncertain about the average outcome under control from observations of the control units, and from being uncertain about the average outcome under treatment from observation of the treated units
In other words, it comes from the variance in the treatment outcomes and variance in the control outcomes (and not, for example, from variance in the treatment effect)

5.2.3 Var(ATE)

In classical statistics we characterize our uncertainty over an estimate using an estimate of variance of the sampling distribution of the estimator.
Key idea is we want to be able to say: how likely are we to have gotten such an estimate if the distribution of estimates associated with our design looked a given way.
More specifically we want to estimate “standard error” or the “standard deviation of the sampling distribution”

(See Woolridge (2023) where the standard error is understood as the “estimate of the standard deviation of the sampling distribution”)

5.2.4 Variance and standard errors

Given:

$\hat\tau$ is an estimate for $\tau$
$\overline{x}$ is the average values of $x$

The variance of the estimator of $n$ repeated ‘runs’ of a design is: $Var(\hat{\tau}) = \frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2$

And the standard error is:

$se(\hat{\tau}) = \sqrt{\frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2}$

5.2.5 Variance and standard errors

If we have a good measure for the shape of the sampling distribution we can start to make statements of the form:

What are the chances that an estimate would be this large or larger?

If the sampling distribution is roughly normal, as it may be with large samples, then we can use procedures such as: “there is a 5% probability that an estimate would be more than 1.96 standard errors away from the mean of the sampling distribution”

5.2.6 Var(ATE)

Key idea: You can estimate variance straight from the data, given knowledge of the assignment process and assuming well defined potential outcomes?
Recall in general $Var(x) = \frac{1}n\sum_i(x_i - \overline{x})^2$. here the $x_i$s are the treatment effect estimates we might get under different random assignments, the $n$ is number of different assignments (assumed here all equally likely, but otherwise we can weight) and $\overline{x}$ is the truth.
For intuition imagine we have just two units $A$, $B$, with potential outcomes $A_1$, $A_0$, $B_1$, $B_0$.
When there are two units with outcomes $x_1, x_2$, the variance simplifies like this:

\[Var(x) = \frac{1}2\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \frac{1}2\left(x_2 - \frac{x_1 + x_2}{2}\right)^2 = \left(\frac{x_1 - x_2}{2}\right)^2\]

5.2.7 Var(ATE)

In the two unit case the two possible treatment estimates are: $\hat{\tau}_1=A_1 - B_0$ and $\hat{\tau}_2=B_1 - A_0$, depending on what gets put into treatment. So the variance is:

\[Var(\hat{\tau}) = \left(\frac{\hat{\tau}_1 - \hat{\tau}_2}{2}\right)^2 = \left(\frac{(A_1 - B_0) - (B_1 - A_0)}{2}\right)^2 =\left(\frac{(A_1 - B_1) + (A_0 - B_0)}{2}\right)^2 \] which we can re-write as:

\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\] The first two terms correspond to the variance of $Y(1)$ and the variance of $Y(0)$. The last term is a bit pesky though, it corresponds to twice the covariance of $Y(1)$ and $Y(0)$.

5.2.8 Var(ATE)

How can we go about estimating this?

\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\]

In the two unit case it is quite challenging because we do not have an estimate for any of the three terms: we do not have an estimate for the variance in the treatment group or in the control group because we have only one observation in each case; and we do not have an estimate for the covariance because we don’t observe both potential outcomes for any case.

Things do look a bit better however with more units…

5.2.9 Var(ATE): Generalizing

From Freedman Prop 1 / Example 1 (using combinatorics!) we have:

$V(\widehat{ATE}) = \frac{1}{n-1}\left[\frac{n_C}{n_T}V_1 + \frac{n_T}{n_C}V_0 + 2C_{01}\right]$

… where $V_0, V_1$ denote variances and $C_{01}$ covariance

This is usefully rewritten as:

\[ \begin{split} V(\widehat{ATE}) & = \frac{1}{n-1}\left[\frac{n - n_T}{n_T}V_1 + \frac{n - n_C}{n_C}V_0 + 2C_{01}\right] \\ & = \frac{n}{n-1}\left[\frac{V_1}{n_T} + \frac{V_0}{n_C}\right] - \frac{1}{n-1}\left[V_1 + V_0 - 2C_{01}\right] \end{split} \]

where the final term is positive

5.2.10 Var(ATE)

Note:

With more than two units we cannot use the sample estimates $s^2(\{Y_i\}_{i \in C})$ and $s^2(\{Y_i\}_{i \in T})$ for the first part.
But $C_{01}$ still cannot be estimated from data.
The Neyman estimator ignores the second part (and so is conservative).
Tip: for STATA users, use , robust (see Samii and Aronow (2012))

5.2.11 ATE and Var(ATE)

For the case with blocking, the conservative estimator is:

$V(\widehat{ATE}) = {\sum_{S\in \mathcal{S}}{\left(\frac{n_{S}}{n}\right)^2} \left({\frac{s^2_{S1}}{n_{S1}}} + {\frac{s^2_{S0}}{n_{S0}}} \right)}$

5.2.12 Illustration of Neyman Conservative Estimator

An illustration of how conservative the conservative estimator of variance really is (numbers in plot are correlations between $Y(1)$ and $Y(0)$.

We confirm that:

the estimator is conservative
the estimator is more conservative for negative correlations between $Y(0)$ and $Y(1)$ — eg if those cases that do particularly badly in control are the ones that do particularly well in treatment, and
with $\tau$ and $V(Y(0))$ fixed, high positive correlations are associated with highest variance.

5.2.13 Illustration of Neyman Conservative Estimator

$\tau$	$\rho$	$\sigma^2_{Y(1)}$	$\Delta$	$\sigma^2_{\tau}$	$\widehat{\sigma}^2_{\tau}$	$\widehat{\sigma}^2_{\tau(\text{Neyman})}$
1.00	-1.00	1.00	-0.04	0.00	-0.00	0.04
1.00	-0.67	1.00	-0.03	0.01	0.01	0.04
1.00	-0.33	1.00	-0.03	0.01	0.01	0.04
1.00	0.00	1.00	-0.02	0.02	0.02	0.04
1.00	0.33	1.00	-0.01	0.03	0.03	0.04
1.00	0.67	1.00	-0.01	0.03	0.03	0.04
1.00	1.00	1.00	0.00	0.04	0.04	0.04

Here $\rho$ is the unobserved correlation between $Y(1)$ and $Y(0)$; and $\Delta$ is the final term in the sample variance equation that we cannot estimate.

5.2.14 Illustration of Neyman Conservative Estimator

5.2.15 Tighter Bounds On Variance Estimate

The conservative variance comes from the fact that you do not know the covariance between $Y(1)$ and $Y(0)$.

But as Aronow, Green, and Lee (2014) point out, you do know something.
Intuitively, if you know that the variance of $Y(1)$ is 0, then the covariance also has to be zero.
This basic insight opens a way of calculating bounds on the variance of the sample average treatment effect.

5.2.16 Tighter Bounds On Variance Estimate

Example:

Take a million-observation dataset, with treatment randomly assigned
Assume $Y(0)=0$ for everyone and $Y(1)$ distributed normally with mean 0 and standard deviation of 1000.
Note here the covariance of $Y(1)$ and $Y(0)$ is 0.
Note the true variance of the estimated sample average treatment effect should be (approx) $\frac{Var(Y(1))}{{1000000}} + \frac{Var(Y(0))}{{1000000}} = 1+0=1$, for an se of $1$.
But using the Neyman estimator (or OLS!) we estimate (approx) $\frac{Var(Y(1))}{({1000000/2})} + \frac{Var(Y(0))}{({1000000/2})} = 2$, for an se of $\sqrt{2}$.
But we can recover the truth knowing the covariance between $Y(1)$ and $Y(0)$ is 0.

5.2.17 Tighter Bounds On Variance Estimate: Code

sharp_var <- function(yt, yc, N=length(c(yt,yc)), upper=TRUE){
  
  m <- length(yt)
  n <- m + length(yc)
  V <- function(x,N) (N-1)/(N*(length(x)-1)) * sum((x - mean(x))^2)
  yt <- sort(yt)
  if(upper) {yc <- sort(yc)
  } else {
    yc <- sort(yc,decreasing=TRUE)}
  p_i <- unique(sort(c(seq(0,n-m,1)/(n-m),seq(0,m,1)/m)))- 
        .Machine$double.eps^.5
  p_i[1] <- .Machine$double.eps^.5
  
  yti <- yt[ceiling(p_i*m)]
  yci <- yc[ceiling(p_i*(n-m))]
  
  p_i_minus <- c(NA,p_i[1: (length(p_i)-1)])
 
 ((N-m)/m * V(yt,N) + (N-(n-m))/(n-m)*V(yc,N) + 
     2*sum(((p_i-p_i_minus)*yti*yci)[2:length(p_i)]) - 2*mean(yt)*mean(yc))/(N-1)}

5.2.18 Illustration

n   <- 1000000
Y   <- c(rep(0,n/2), 1000*rnorm(n/2))
X   <- c(rep(0,n/2), rep(1, n/2))

lm_robust(Y~X) |> tidy() |> kable(digits = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high	df	outcome
(Intercept)	0.00	0.00	0.63	0.53	0.00	0.00	999998	Y
X	1.21	1.41	0.86	0.39	-1.56	3.98	999998	Y

kable(t(as.matrix(ols)))

Error: object 'ols' not found

c(sharp_var(Y[X==1], Y[X==0], upper = FALSE),
  sharp_var(Y[X==1], Y[X==0], upper = TRUE)) |> 
  round(2)

[1] 1 1

The sharp bounds are $[1,1]$ but the conservative estimate is $\sqrt{2}$.

5.2.19 Asymptotics

It is a remarkable thing that you can estimate the standard error straight from the data
However, once you want to use the standard error to do hypothesis testing you generally end up looking up distributions ($t$-distributions or normal distributions)
That’s a little disappointing and has been one of the criticisms made by Deaton and Cartwright (2018)

However you can do hypothesis testing even without an estimate of the standard error.

Up next

5.3 Randomization Inference

A procedure for using the randomization distribution to calculate $p$ values

5.3.1 Calculate a $p$ value in your head

Illustrating $p$ values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.

Unit	1	2	3	4	5	6	7	8	9	10
Treatment	0	0	0	0	0	0	0	1	0	0
Health score	4	2	3	1	2	3	4	8	7	6

Then:

Does the treatment improve your health?
What’s the $p$ value for the null that treatment had no effect on anybody?

5.3.2 Calculate a $p$ value in your head

Illustrating $p$ values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.

Unit	1	2	3	4	5	6	7	8	9	10
Treatment	0	0	0	0	0	0	0	0	1	0
Health score	4	2	3	1	2	3	4	8	7	6

Then:

Does the treatment improve your health?
What’s the $p$ value for the null that treatment had no effect on anybody?

5.3.3 Randomization Inference: Some code

In principle it is very easy.
These few lines generate data, produce the regression estimate and then an ri estimate of $p$:

# data
set.seed(1)
df <- fabricate(N = 1000, Z = rep(c(0,1), N/2), Y=  .1*Z + rnorm(N))

# test stat
test.stat <- function(df) with(df, mean(Y[Z==1])- mean(Y[Z==0]))

# test stat distribution
ts <- replicate(4000, df |> mutate(Z = sample(Z)) |> test.stat())

# test
mean(ts >= test.stat(df))   # One sided p value

[1] 0.025

5.3.4 Randomization Inference: Some code

The $p$ value is the mass to the right of the vertical

hist(ts); abline(v = test.stat(df), col = "red")

5.3.5 Using `ri2`

You can do the same using Alex Coppock’s ri2 package

library(ri2)

# Declare the assignment
assignment <- declare_ra(N = 1000, m = 500)

# Implement
ri2_out <- conduct_ri(
  formula = Y ~ Z,
  declaration = assignment,
  sharp_hypothesis = 0,
  data = df, 
  p = "upper",
  sims = 4000
)

5.3.6 Using `ri2`

term	estimate	upper_p_value
Z	0.1321367	0.02225

You’ll notice slightly different answer. This is because although the procedure is “exact” it is subject to simulation error.

5.3.7 Randomization Inference

Randomization inference can get more complicated when you want to test a null other than the sharp null of no effect.
Say you wanted to test the null that the effect is 2 for all units. How do you do it?
Say you wanted to test the null that an interaction effect is zero. How do you do it?
In both cases by filling in a potential outcomes schedule given the hypothesis in question and then generating a test statistic

Observed		Under null that effect is 0		Under null that effect is 2
Y(0)	Y(1)	Y(0)	Y(1)	Y(0)	Y(1)
1	NA	1	1	1	3
2	NA	2	2	2	4
NA	4	4	4	2	4
NA	3	3	3	1	3

5.3.8 `ri` and CIs

It is possible to use this procedure to generate confidence intervals with a natural interpretation.

The key idea is that we can use the same procedure to assess the probability of the data given a sharp null of no effect, but also a sharp null of any other **constant* effect.
We can then see what set of effects we reject and what set we accept
We are left with a set of values that we cannot reject at the 0.05 level.

5.3.9 `ri` and CIs in practice

candidates <- seq(-.05, .3, length = 50)

get_p <- function(j)
  (conduct_ri(
      formula = Y ~ Z,
      declaration = assignment,
      sharp_hypothesis = j,
      data = df,
      sims = 5000,
    ) |> summary()
  )$two_tailed_p_value

# Implement
ps <- candidates |> sapply(get_p)

5.3.10 `ri` and CIs in practice

Warning: calculating confidence intervals this way can be computationally intensive

5.3.11 `ri` with `DeclareDesign`

DeclareDesign can do randomization inference natively
The trick is to ensure that when calculating the $p$ values the only stochastic component is the assignment to treatment when calculating the $p$ values

5.3.12 `ri` with `DeclareDesign` (advanced)

Here we get minimal detectable effects by using a design that has two stage simulations so we can estimate the sampling distribution of summaries of the sampling distribution generated from reassignments.

test_stat <- function(data)
  with(data, data.frame(estimate = mean(Y[Z==1]) - mean(Y[Z==0])))

b <- 0

design <- 
  declare_model(N = 100,  Z = complete_ra(N), Y = b*Z + rnorm(N)) +
  declare_estimator(handler = label_estimator(test_stat), label = "actual")+
  declare_measurement(Z = sample(Z))  + # this is the permutation step
  declare_estimator(handler = label_estimator(test_stat), label = "null")

5.3.13 `ri` with `DeclareDesign` (advanced)

Simulations data frame from two step simulation. Note computational intensity as number of runs is the product of the sims vector. I speed things up by using a simple estimation function and also using parallelization.

library(future)
options(parallelly.fork.enable = TRUE) 
plan(multicore) 

simulations <- 
  design |> redesign(b = c(0, .25, .5, .75, 1)) |> 
  simulate_design(sims = c(200, 1, 1000, 1))

5.3.14 `ri` with `DeclareDesign` (advanced)

A snap shot of the simulations dataframe: we have multiple step 3 draws for each design and step 1 draw.

simulations |> head() |> kable(digits = 2)

design	sim_ID	estimator	estimate	step_1_draw	step_3_draw
design_1	1	actual	-0.02	1	1
design_1	1	null	0.06	1	1
design_1	2	actual	-0.02	1	2
design_1	2	null	0.03	1	2
design_1	3	actual	-0.02	1	3
design_1	3	null	-0.05	1	3

5.3.15 `ri` with `DeclareDesign` (advanced)

Power for each value of b.

simulations |> group_by(b, step_1_draw) |> 
  summarize(p = mean(abs(estimate[estimator == "null"]) >= abs(estimate[estimator == "actual"]))) |>
  group_by(b) |> summarize(power = mean(p <= .05)) |>
  ggplot(aes(b, power)) + geom_line() + geom_hline(yintercept = .8, color = "red")

If you want to figure out more precisely what b gives 80% or 90% power you can narrow down the b range.

5.3.16 `ri` interactions

Lets now imagine a world with two treatments and we are interested in using ri for assessing the interaction. (Code from Coppock, ri2)

set.seed(1)

N <- 100
declaration <- randomizr::declare_ra(N = N, m = 50)

data <- fabricate(
  N = N,
  Z = conduct_ra(declaration),
  X = rnorm(N),
  Y = .9 * X + .2 * Z + .1 * X * Z + rnorm(N))

5.3.17 `ri` interactions

The approach is to declare a null model that is nested by the full model. Then $F$ test statistic from the model comparisons is taken as the test statistic and distribution of this is built up under re-randomizations.

conduct_ri(
    model_1 = Y ~ Z + X,
    model_2 = Y ~ Z + X + Z * X,
    declaration = declaration,
    assignment = "Z",
    sharp_hypothesis = coef(lm(Y ~ Z, data = data))[2],
    data = data, 
    sims = 1000
  )  |> summary() |> kable()

term	estimate	two_tailed_p_value
F-statistic	1.954396	0.171

5.3.18 `ri` interactions with `DeclareDesign`

Let’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects

Note: this is quite a sharp hypothesis

df <- fabricate(N = 1000, Z1 = rep(c(0,1), N/2), Z2 = sample(Z1), Y = Z1 + Z2 - .15*Z1*Z2 + rnorm(N))

my_estimate <- (lm(Y ~ Z1*Z2, data = df) |> coef())[4]

null_model <-  function(df) {
  M0 <- lm(Y ~ Z1 + Z2, data = df) 
  d1 <- coef(M0)[2]
  d2 <- coef(M0)[3]
  df |> mutate(
    Y_Z1_0_Z2_0 = Y - Z1*d1 - Z2*d2,
    Y_Z1_1_Z2_0 = Y + (1-Z1)*d1 - Z2*d2,
    Y_Z1_0_Z2_1 = Y - Z1*d1 + (1-Z2)*d2,
    Y_Z1_1_Z2_1 = Y + (1-Z1)*d1 + (1-Z2)*d2)
  }

5.3.19 `ri` interactions with `DeclareDesign`

Let’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects

Imputed potential outcomes look like this:

df <- df |> null_model()

df |>  head() |> kable(digits = 2, align = "c")

ID	Z1	Z2	Y	Y_Z1_0_Z2_0	Y_Z1_1_Z2_0	Y_Z1_0_Z2_1	Y_Z1_1_Z2_1
0001	0	0	-0.18	-0.18	0.76	0.68	1.61
0002	1	0	0.20	-0.73	0.20	0.12	1.06
0003	0	0	2.56	2.56	3.50	3.42	4.36
0004	1	0	-0.27	-1.21	-0.27	-0.35	0.59
0005	0	1	-2.13	-2.99	-2.05	-2.13	-1.19
0006	1	1	3.52	1.72	2.66	2.58	3.52

5.3.20 `ri` interactions with `DeclareDesign`

design <- 
  declare_model(data = df) +
  declare_measurement(Z1 = sample(Z1), Z2 = sample(Z2),
                      Y = reveal_outcomes(Y ~ Z1 + Z2)) +
  declare_estimator(Y ~ Z1*Z2, term = "Z1:Z2")

diagnose_design(design, sims = 1000, diagnosands = ri_ps(my_estimate))

Design	Estimator	Outcome	Term	N Sims	One Sided Pos	One Sided Neg	Two Sided
design	estimator	Y	Z1:Z2	1000	0.95	0.05	0.10
					(0.01)	(0.01)	(0.01)

5.3.21 `ri` in practice

In practice (unless you have a design declaration), it is a good idea to create a $P$ matrix when you do your randomization.
This records the set of possible randomizations you might have had: or a sample of these.
So, again: assignments have to be replicable

5.3.22 `ri` Applications

Recall that silly randomization procedure from this slide.
Say you forgot to take account of the wacky assignment in your estimates and you estimate 0.15.
Does the treatment improve your health?: $p=?$

5.3.23 `ri` Applications

Randomization procedures are sometimes funky in lab experiments
Using randomization inference would force a focus on the true assignment of individuals to treatments
Fake (but believable) example follows

5.3.24 `ri` Applications

		Capacity	T1	T2	T3
Session	Thursday	40	10	30	0
	Friday	40	10	0	30
	Saturday	10	10	0	0

Optimal assignment to treatment given constraints due to facilities

Subject type	N	Available
A	3	Thurs, Fri
B	30	Thurs, Sat
C	30	Fri, Sat

Constraints due to subjects

5.3.25 `ri` Applications

If you think hard about assignment you might come up with an allocation like this.

			Allocations
Subject type	N	Available	Thurs	Fri	Sat
A	30	Thurs, Fri	15	15	NA
B	30	Thurs, Sat	25	NA	5
C	30	Fri, Sat	NA	25	5

Assignment of people to days

5.3.26 `ri` Applications

That allocation balances as much as possible. Given the allocation you might randomly assign individuals to different days as well as randomly assigning them to treatments within days. If you then figure out assignment propensities, this is what you would get:

			Assignment Probabilities
Subject type	N	Available	T1	T2	T3
A	30	Thurs, Fri	0.250	0.375	0.375
B	30	Thurs, Sat	0.375	0.625	0.000
C	30	Fri, Sat	0.375	NA	0.625

5.3.27 `ri` Applications

Even under the assumption that the day of measurement does not matter, these assignment probabilities have big implications for analysis.

			Assignment Probabilities
Subject type	N	Available	T1	T2	T3
A	30	Thurs, Fri	0.250	0.375	0.375
B	30	Thurs, Sat	0.375	0.625	0.000
C	30	Fri, Sat	0.375	NA	0.625

Only the type $A$ subjects could have received any of the three treatments.
There are no two treatments for which it is possible to compare outcomes for subpopulations $B$ and $C$
A comparison of $T1$ versus $T2$ can only be made for population $A \cup B$
However subpopulation $A$ is assigned to $A$ (versus $B$) with probability 4/5; while population $B$ is assigned with probability 3/8

5.3.28 `ri` Applications

Implications for design: need to uncluster treatment delivery
Implications for analysis: need to take account of propensities

Idea: Wacky assignments happen but if you know the propensities you can do the analysis.

5.3.29 `ri` Applications: Indirect assignments

A particularly interesting application is where a random assignment combines with existing features to determine an assignment to an “indirect” treatment.

For instance: $n$ of $N$ are assigned to a treatment.
You are interested in whether “having a friend assigned to treatment” makes a difference to a subject. Or maybe “a friend of a friend”
That means the subject has a complex clustered assignment that depends on how many friends they have
A bit mind-boggling, but:
- Rerun your assignment many times and each time figure out whether a subject is assigned to an indirect treatment or not
- Calculate the implied quantity of interest for each assignment
- Assess the place of the actual quantity in the sampling distribution

5.4 Principle: Keep the reporting close to the design

5.4.1 Design-based analysis

Report the analysis that is implied by the design.

		T2
		N	Y	All	Diff
T1	N	$\overline{y}_{00}$	$\overline{y}_{01}$	$\overline{y}_{0x}$	$d_2\|T1=0$
		(sd)	(sd)	(sd)	(sd)
	Y	$\overline{y}_{10}$	$\overline{y}_{10}$	$\overline{y}_{1x}$	$d_2\|T1=1$
		(sd)	(sd)	(sd)	(sd)
	All	$\overline{y}_{x0}$	$\overline{y}_{x1}$	$y$	$d_2$
		(sd)	(sd)	(sd)	(sd)
	Diff	$d_1\|T2=0$	$d_1\|T2=1$	$d_1$	$d_1d_2$
		(sd)	(sd)	(sd)	(sd)

This is instantly recognizable from the design and returns all the benefits of the factorial design including all main effects, conditional causal effects, interactions and summary outcomes. It is much clearer and more informative than a regression table.

6 Design

Focus on randomization schemes

6.1 Aims and practice

Goals
Cluster randomization
Blocked randomization
Factorial designs
External validity
Assignments with DeclareDesign
Indirect assignments

6.1.1 Experiments

Experiments are investigations in which an intervention, in all its essential elements, is under the control of the investigator. (Cox & Reid)
Two major types of control:

control over assignment to treatment – this is at the heart of many field experiments
control over the treatment itself – this is at the heart of many lab experiments

Main focus today is on 1 and on the question: how does control over assignment to treatment allow you to make reasonable statements about causal effects?

6.1.2 Experiments

6.1.3 Basic randomization

Basic randomization is very simple. For example, say you want to assign 5 of 10 units to treatment. Here is simple code:

sample(0:1, 10, replace = TRUE)

 [1] 0 1 0 1 1 1 1 1 1 0

6.1.4 …should be replicable

In general you might want to set things up so that your randomization is replicable. You can do this by setting a seed:

set.seed(20111112)
sample(0:1, 10, replace = TRUE)

 [1] 1 0 1 1 1 0 1 1 1 1

and again:

set.seed(20111112)
sample(0:1, 10, replace = TRUE)

 [1] 1 0 1 1 1 0 1 1 1 1

6.1.5 Basic randomization

Even better is to set it up so that it can reproduce lots of possible draws so that you can check the propensities for each unit.

set.seed(20111112)
P <- replicate(1000, sample(0:1, 10, replace = TRUE)) 
apply(P, 1, mean)

 [1] 0.519 0.496 0.510 0.491 0.524 0.514 0.535 0.497 0.470 0.506

Here the $P$ matrix gives 1000 possible ways of allocating 5 of 10 units to treatment. We can then confirm that the average propensity is 0.5.

A huge advantage of this approach is that if you make a mess of the random assignment; you can still generate the P matrix and use that for all analyses!

6.1.6 Do it in advance

Unless you need them to keep subjects at ease, leave your spinners and your dice and your cards behind.
Especially when you have multiple or complex randomizations you are generally much better doing it with a computer in advance

A survey dictionary with results from a complex randomization presented in a simple way for enumerators

6.1.7 Did the randomization ‘’work’’?

People often wonder: did randomization work? Common practice is to implement a set of $t$-tests to see if there is balance. This makes no sense.
If you doubt whether it was implemented properly do an $F$ test. If you worry about variance, specify controls in advance as a function of relation with outcomes (more on this later). If you worry about conditional bias then look at substantive differences between groups, not $t$–tests
If you want realizations to have particular properties: build it into the scheme in advance.

6.2 Cluster Randomization

6.2.1 Cluster Randomization

Simply place units into groups (clusters) and then randomly assign the groups to treatment and control.
All units in a given group get the same treatment

Note: clusters are part of your design, not part of the world.

6.2.2 Cluster Randomization

Often used if intervention has to function at the cluster level or if outcome defined at the cluster level.
Disadvantage: loss of statistical power
However: perfectly possible to assign some treatments at cluster level and then other treatments at the individual level
Principle: (unless you are worried about spillovers) generally make clusters as small as possible
Principle: Surprisingly, variability in cluster size makes analysis harder. Try to control assignment so that cluster sizes are similar in treatment and in control.
Be clear about whether you believe effects are operating at the cluster level or at the individual level. This matters for power calculations.
Be clear about whether spillover effects operate only within clusters or also across them. If within only you might be able to interpret treatment as the effect of being in a treated cluster…

6.2.3 Cluster Randomization: Block by cluster size

Surprisingly, if clusters are of different sizes the difference in means estimator is not unbiased, even if all units are assigned to treatment with the same probability.

Here’s the intuition.Say there are two clusters each with homogeneous treatment effects:

Cluster	Size	Y0	Y1
1	1000000	0	1
2	1	0	0

Then: What is the true average treatment effect? What do you expect to estimate from cluster random assignment?

The solution is to block by cluster size. For more see: http://gking.harvard.edu/files/cluster.pdf

6.3 Blocked assignments and other restricted randomizations

6.3.1 Blocking

There are more or less efficient ways to randomize.

Randomization helps ensure good balance on all covariates (observed and unobserved) in expectation.
But balance may not be so great in realization
Blocking can help ensure balance ex post on observables

6.3.2 Blocking

Consider a case with four units and two strata. There are 6 possible assignments of 2 units to treatment:

ID	X	Y(0)	Y(1)	R1	R2	R3	R4	R5	R6
1	1	0	1	1	1	1	0	0	0
2	1	0	1	1	0	0	1	1	0
3	2	1	2	0	1	0	1	0	1
4	2	1	2	0	0	1	0	1	1
–	–	–	–	–	–	–	–	–	–
$\widehat{\tau}$:				0	1	1	1	1	2

Even with a constant treatment effect and everything uniform within blocks, there is variance in the estimation of $\widehat{\tau}$. This can be eliminated by excluding R1 and R6.

6.3.3 Blocking

Simple blocking in R (5 pairs):

sapply(1:5, function(i) sample(0:1))

1	2	3	4	5
1	1	0	1	1
0	0	1	0	0

6.3.4 Of blocks and clusters

6.3.5 Blocking

Blocking is a case of restricted randomization. Although each unit is sampled with equal probability, the profiles of possible assignments are not.
You have to take account of this when doing analysis
There are many other approaches.
- Matched Pairs are a particularly fine approach to blocking
- You could also randomize and then replace the randomization if you do not like the balance. This sounds tricky (and it is) but it is OK as long as you understand the true lottery process you are employing and incorporate that into analysis
- It is even possible to block on covariates for which you don’t have data ex ante, by using methods in which you allocate treatment over time as a function of features of your sample (also tricky)

6.3.6 Other types of restricted randomization

Really you can set whatever criterion you want for your set of treated units to have (eg no treated unit beside another treated unit; at least 5 from the north, 10 from the south, guaranteed balance by some continuous variable etc)
You just have to be sure that you understand the random process that was used and that you can use it in the analysis stage
But here be dragons
- The more complex your design, the more complex your analysis.
- General injunction http://www.ncbi.nlm.nih.gov/pubmed/15580598 ``as ye randomize so shall ye analyze’’)
- In general you should make sure that a given randomization procedure coupled with a given estimation procedure will produce an unbiased estimate. DeclareDesign can help with this.

6.3.7 Challenges with re-randomization

We can see that blocked and clustered assignments are actually types of restricted randomizations: they limit the set of acceptable randomizations to those with good properties
You could therefore implement the equivalent distribution of assignments y specifying an acceptable rule and then re-randomizing when the rule is met
That’s fine but you would then have to take account of clustering and blocking just as you do when you actually cluster or block

6.4 Factorial Designs

6.4.1 Factorial Designs

Often when you set up an experiment you want to look at more than one treatment.
Should you do this or not? How should you use your power?

6.4.2 Factorial Designs

Often when you set up an experiment you want to look at more than one treatment.
Should you do this or not? How should you use your power?

Load up:

	$T2=0$	$T2=1$
T1 = 0	$50\%$	$0\%$
T1 = 1	$50\%$	$0\%$

Spread out:

	$T2=0$	$T2=1$
T1 = 0	$25\%$	$25\%$
T1 = 1	$25\%$	$25\%$

6.4.3 Factorial Designs

Often when you set up an experiment you want to look at more than one treatment.
Should you do this or not? How should you use your power?

Three arm it?:

	$T2=0$	$T2=1$
T1 = 0	$33.3\%$	$33.3\%$
T1 = 1	$33.3\%$	$0\%$

Bunch it?:

	$T2=0$	$T2=1$
T1 = 0	$40\%$	$20\%$
T1 = 1	$20\%$	$20\%$

6.4.4 Factorial Designs

Surprisingly, adding multiple treatments does not generally eat into your power (unless you are decomposing a complex treatment – then it can. Why?)
Especially when you use a fully crossed design like the middle one above.
Fisher: “No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, at a time. The writer is convinced that this view is wholly mistaken.”
However – adding multiple treatments does alter the interpretation of your average treatment effects. If T2 is an unusual treatment for example, then half the T1 effect is measured for unusual situations.

This speaks to “spreading out.” Note: the “bunching” example may not pay off and has undesireable feature of introducing a correlation between treatment assignments.

6.4.5 Factorial Designs

Two ways to do favtial assignments in DeclareDesign:

# Block the second assignment
declare_assignment(Z1 = complete_ra(N)) +
declare_assignment(Z2 = block_ra(blocks = Z1)) +
  
# Recode four arms  
declare_assignment(Z = complete_ra(N, num_arms = 4)) +
declare_measurement(Z1 = (Z == "T2" | Z == "T4"),
                      Z2 = (Z == "T3" | Z == "T4"))

6.4.6 Factorial Designs: In practice

In practice if you have a lot of treatments it can be hard to do full factorial designs – there may be too many combinations.
In such cases people use fractional factorial designs, like the one below (5 treatments but only 8 units!)

Variation	T1	T2	T3	T4	T5
1	0	0	0	1	1
2	0	0	1	0	0
3	0	1	0	0	1
4	0	1	1	1	0
5	1	0	0	1	0
6	1	0	1	0	1
7	1	1	0	0	0
8	1	1	1	1	1

6.4.7 Factorial Designs: In practice

Then randomly assign units to rows. Note columns might also be blocking covariates.
In R, look at library(survey)

6.4.8 Factorial Designs: In practice

But be careful: you have to be comfortable with possibly not having any simple counterfactual unit for any unit (invoke sparsity-of-effects principle).

Unit	T1	T2	T3	T4	T5
1	0	0	0	1	1
2	0	0	1	0	0
3	0	1	0	0	1
4	0	1	1	1	0
5	1	0	0	1	0
6	1	0	1	0	1
7	1	1	0	0	0
8	1	1	1	1	1

In R, look at library(survey)

6.4.9 Controversy?

Muralidharan, Romero, and Wüthrich (2023) write:

Factorial designs are widely used to study multiple treatments in one experiment. While t-tests using a fully-saturated “long” model provide valid inferences, “short” model t-tests (that ignore interactions) yield higher power if interactions are zero, but incorrect inferences otherwise. Of 27 factorial experiments published in top-5 journals (2007–2017), 19 use the short model. After including interactions, over half of their results lose significance. […]

6.5 External Validity: Can randomization strategies help?

6.5.1 Principle: Address external validity at the design stage

Anything to be done on randomization to address external validity concerns?

Note 1: There is little or nothing about field experiments that makes the external validity problem greater for these than for other forms of ‘’sample based’’ research
Note 2: Studies that use up the available universe (cross national studies) actually have a distinct external validity problem
Two ways to think about external validity issues:
1. Are things likely to operate in other units like they operate in these units? (even with the same intervention)
2. Are the processes in operation in this treatment likely to operate in other treatments? (even in this population)

6.5.2 Principle: Address external validity at the design stage

Two ways to think about external validity issues:
1. Are things likely to operate in other units like they operate in these units? (even with the same intervention) 2.Are the processes in operation in this treatment likely to operate in other treatments? (even in this population)
Two approaches for 1.
- Try to sample cases and estimate population average treatment effects
- Exploit internal variation: block on features that make the case unusal and assess importance of these (eg is unit poor? assess how effects differ in poor and wealthy components)
2 is harder and requires a sharp identification of context free primitives, if there are such things.

6.6 Assignments with `DeclareDesign`

6.6.1 A design: Multilevel data

A design with hierarchical data and different assignment schemes.

design <- 
  declare_model(
    school = add_level(N = 16, 
                       u_school = rnorm(N, mean = 0)),     
    classroom = add_level(N = 4,    
                  u_classroom = rnorm(N, mean = 0)),
    student =  add_level(N = 20,    
                         u_student = rnorm(N, mean = 0))
    ) +
  declare_model(
    potential_outcomes(Y ~ .1*Z + u_classroom + u_student + u_school)
    ) +
  declare_assignment(Z = simple_ra(N)) + 
  declare_measurement(Y = reveal_outcomes(Y ~ Z))  +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, .method = difference_in_means)

6.6.2 Sample data

Here are the first couple of rows and columns of the resulting data frame.

my_data <- draw_data(design)
kable(head(my_data), digits = 2)

school	u_school	classroom	u_classroom	student	u_student	Y_Z_0	Y_Z_1	Z	Y
01	1.35	01	1.26	0001	-1.28	1.33	1.43	0	1.33
01	1.35	01	1.26	0002	0.79	3.40	3.50	1	3.50
01	1.35	01	1.26	0003	-0.12	2.49	2.59	0	2.49
01	1.35	01	1.26	0004	-0.65	1.96	2.06	1	2.06
01	1.35	01	1.26	0005	0.36	2.97	3.07	1	3.07
01	1.35	01	1.26	0006	-0.96	1.65	1.75	0	1.65

6.6.3 Sample data

Here is the distribution between treatment and control:

kable(t(as.matrix(table(my_data$Z))), 
      col.names = c("control", "treatment"))

control	treatment
645	635

6.6.4 Complete Random Assignment using the built in function

assignment_complete <-   declare_assignment(Z = complete_ra(N))  

design_complete <- 
  replace_step(design, "assignment", assignment_complete)

6.6.5 Data from complete assignment

We can draw a new set of data and look at the number of subjects in the treatment and control groups.

set.seed(1:5)
data_complete <- draw_data(design_complete)

kable(t(as.matrix(table(data_complete$Z))))

0	1
640	640

6.6.6 Plotted

6.6.7 Block Random Assignment

The treatment and control group will in expectation contain the same share of students in different classrooms.
But as we saw this does necessarily hold in realization
We make this more obvious by sorting the students by treatment status with schools

6.6.8 Blocked design

assignment_blocked <-   
  declare_assignment(Z = block_ra(blocks = classroom))  

estimator_blocked <- 
  declare_estimator(Y ~ Z, blocks = classroom, 
                    .method = difference_in_means)  

design_blocked <- 
  design |> 
  replace_step("assignment", assignment_blocked) |>
  replace_step("estimator", estimator_blocked)

6.6.9 Illustration of blocked assignment

Note that subjects are sorted here after the assignment to make it easier to see that in this case blocking ensures that exactly 5 students within each classroom are assigned to treatment.

6.6.10 Clustering

But what if all students in a given class have to be assigned the same treatment?

assignment_clustered <- 
  declare_assignment(Z = cluster_ra(clusters = classroom))  
estimator_clustered <- 
  declare_estimator(Y ~ Z, clusters = classroom, 
                    .method = difference_in_means)  


design_clustered <- 
  design |> 
  replace_step("assignment", assignment_clustered) |> 
  replace_step("estimator", estimator_clustered)

6.6.11 Illustration of clustered assignment

6.6.12 Clustered and Blocked

assignment_clustered_blocked <-   
  declare_assignment(Z = block_and_cluster_ra(blocks = school,
                                              clusters = classroom))  
estimator_clustered_blocked <- 
  declare_estimator(Y ~ Z, blocks = school, clusters = classroom, 
                    .method = difference_in_means)  


design_clustered_blocked <- 
  design |> 
  replace_step("assignment", assignment_clustered_blocked) |> 
  replace_step("estimator", estimator_clustered_blocked)

6.6.13 Illustration of clustered and blocked assignment

6.6.14 Illustration of efficiency gains from blocking

designs <- 
  list(
    simple = design, 
    complete = design_complete, 
    blocked = design_blocked, 
    clustered = design_clustered,  
    clustered_blocked = design_clustered_blocked)

diagnoses <- diagnose_design(designs)

6.6.15 Illustration of efficiency gains from blocking

Design	Power	Coverage
simple	0.16	0.95
	(0.01)	(0.01)
complete	0.20	0.96
	(0.01)	(0.01)
blocked	0.42	0.95
	(0.01)	(0.01)
clustered	0.06	0.96
	(0.01)	(0.01)
clustered_blocked	0.08	0.96
	(0.01)	(0.01)

6.6.16 Sampling distributions

diagnoses$simulations_df |> 
  mutate(design = factor(design, c("blocked", "complete", "simple", "clustered_blocked", "clustered"))) |>
  ggplot(aes(estimate)) +
  geom_histogram() + facet_grid(~design)

6.6.17 Nasty integer issues

In many designs you seek to assign an integer number of subjects to treatment from some set.
Sometimes however your assignment targets are not integers.

Example:

I have 12 subjects in four blocks of 3 and I want to assign each subject to treatment with a 50% probability.

Two strategies:

I randomly set a target of either 1 or 2 for each block and then do complete assignment in each block. This can result in the numbers treated varying from 4 to 8
I randomly assign a target of 1 for two blocks and 2 for the other two blocks: Intuition–set a floor for the minimal target and then distribue the residual probability across blocks

6.6.18 Nasty integer issues

# remotes::install_github("macartan/probra")
library(probra)
set.seed(1)

blocks <- rep(1:4, each = 3)

table(blocks, prob_ra(blocks = blocks))

table(blocks, block_ra(blocks = blocks))

6.6.19 Nasty integer issues

Can also be used to set targets

# remotes::install_github("macartan/probra")
library(probra)
set.seed(1)

fabricate(N = 4,  size = c(47, 53, 87, 25), n_treated = prob_ra(.5*size)) %>%
  janitor::adorn_totals("row") |> 
  kable(caption = "Setting targets to get 50% targets with minimal variance")

Setting targets to get 50% targets with minimal variance
ID	size	n_treated
1	47	23
2	53	27
3	87	43
4	25	13
Total	212	106

6.6.20 Nasty integer issues

Can also be used to set for complete assignment with heterogeneous propensities

set.seed(1)

df <- fabricate(N = 100,  p = seq(.1, .9, length = 100), Z = prob_ra(p)) 

mean(df$Z)

[1] 0.5

df |> ggplot(aes(p, Z)) + geom_point() + theme_bw()

6.7 Indirect assignments

Indirect control

6.7.1 Indirect assignments

Indirect assignments are generally generated by applying a direct assignment and then figuring our an implied indirect assignment

set.seed(1)

df <-
  fabricate(
    N = 100, 
    latitude = runif(N),
    longitude = runif(N))

adjacency <- 
  sapply(1:nrow(df), function(i) 
    1*((df$latitude[i] - df$latitude)^2 + (df$longitude[i] - df$longitude)^2)^.5 < .1)

diag(adjacency) <- 0

6.7.2 Indirect assignments: Adjacency matrix

adjacency |>  
  reshape2::melt(c("x", "y"), value.name = "close") |> mutate(close = factor(close)) |>
  ggplot(aes(x=x,y=y,fill=close))+
  geom_tile() + xlab("individual") + ylab("individual") + theme_bw() +
  scale_fill_grey(start = 1, end = 0)  # 1 = white, 0 = black

6.7.3 Indirect assignments

n_assigned <- 50

design <-
  declare_model(data = df) + 
  declare_assignment(
    direct = complete_ra(N, m = n_assigned),
    indirect = 1*(as.vector(as.vector(direct) %*% adjacency) >= 1))

draw_data(design) |> with(table(direct, indirect))

      indirect
direct  0  1
     0 13 37
     1 13 37

6.7.4 Indirect assignments: Properties

indirect_propensities <- replicate(5000, draw_data(design)$indirect) |> 
  apply(1, mean)

6.7.5 Indirect assignments: Properties

df |> ggplot(aes(latitude, longitude, label = round(indirect_propensities_1, 2))) + geom_text()

6.7.6 Indirect assignments: Redesign

replicate(5000, draw_data(design |> redesign(n_assigned = 25))$indirect) |> 
  apply(1, mean)

6.7.7 Indirect assignments: Redesign

df |> ggplot(aes(latitude, longitude, label = round(indirect_propensities_2, 2))) + 
  geom_text()

Looks better: but there are trade offs between the direct and indirect distributions

Figuring out the optimal procedure requires full diagnosis

7 Design diagnosis

A focus on power

7.1 Outline

Tests review
$p$ values and significance
Power
Sources of power
Advanced applications

7.2 Tests

7.2.1 Review

In the classical approach to testing a hypothesis we ask:

How likely are we to see data like this if indeed the hypothesis is true?

If the answer is “not very likely” then we treat the hypothesis as suspect.
If the answer is not “not very likely” then the hypothesis is maintained (some say “accepted” but this is tricky as you may want to “maintain” multiple incompatible hypotheses)

How unlikely is “not very likely”?

7.2.2 Weighing Evidence

When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.

Othello has a hypothesis that Desdemona is innocent.
Iago confronts him with evidence:
- See how she looks at him: would she look a him like that if she were innocent?
- … would she defend him like that if she were innocent?
- … would he have her handkerchief if she were innocent?
- Othello, the chances of all of these things arising if she were innocent is surely less than 5%

7.2.3 Hypotheses are often rejected, sometimes maintained, but rarely accepted

Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.
He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.

So:

He assesses: $\Pr(\text{Data} | \text{Hypothesis is TRUE})$
While a Bayesian would assess: $\Pr(\text{Hypothesis is TRUE} | \text{Data})$

7.2.4 Recap: Calculate a $p$ value in your head

Illustrating $p$ values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.

Unit	1	2	3	4	5	6	7	8	9	10
Treatment	0	0	0	0	0	0	0	1	0	0
Health score	4	2	3	1	2	3	4	8	7	6

Then:

Does the treatment improve your health?
What’s the $p$ value for the null that treatment had no effect on anybody?

7.3 Power

7.4 What power is

Power is just the probability of ~~getting a significant result~~ rejecting a hypothesis.

Simple enough but it presupposes:

A well defined hypothesis
An actual stipulation of the world under which you evaluate the probability
A procedure for producing results and determining of they are significant / rejecting a hypothesis

7.4.1 By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

I will roll the dice once.
If a six comes up I will reject the hypothesis.

What is the power of this test?

7.4.2 By hand

I want to test the hypothesis that a six never comes up on this dice.

Here’s my test:

I will roll the dice twice.
If a six comes up either time I will reject the hypothesis.

What is the power of this test?

7.4.3 Two probabilities

Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.

I want to test the hypothesis that this dice is fair.

Here’s my test:

I will roll the dice 1000 times and if I see fewer than x 6s or more than y 6s I will reject the hypothesis.

Now:

What should x and y be?
What is the power of this test?

7.4.4 Step 1: When do you reject?

For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.

Here is how many 6’s I would expect if the dice is fair:

fabricate(N = 1001, sixes = 0:1000, p = dbinom(sixes, 1000, 1/6)) |>
  ggplot(aes(sixes, p)) + geom_line()

7.4.5 Step 1: When do you reject?

I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:

c(lower = pbinom(143, 1000, 1/6), upper = 1 - pbinom(189, 1000, 1/6))

     lower      upper 
0.02302647 0.02785689

7.4.6 Step 2: What is the power?

Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.
For instance: we think that in fact sixes appear 20% of the time.

Now what’s the probability of seeing at least 190 sixes?

1 - pbinom(189, 1000, .2)

[1] 0.796066

So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.

7.4.7 Rule of thumb

80% or 90% is a common rule of thumb for “sufficient” power
but really, how much power you need depends on the purpose

7.4.8 Think about

Are there other tests I could have implemented?
Are there other ways to improve this test?

7.4.9 Subtleties

Is a significant result from an underpowered study less credible? (only if there is a significance filter)
What significance level should you choose for power? (Obviously the stricter the level the lower the power, so use what you will use when you actually implement tests)
Do you really have to know the effect size to do power analysis? (No, but you should know at least what effects sizes you would want to be sure about picking up if they were present)
Power is just one of many possible diagnosands
What’s power for Bayesians?

7.4.10 Power analytics

Simplest intuition on power:

What is the probability of getting a significant estimate given the sampling distribution is centered on $b$ and the standard error is 1?

Probability below -1.96: $F(-1.96 | \tau))$
Probability above -1.96: $1-F(1.96 | \tau)$

Add these together: probability of getting an estimate above 1.96 or below -1.96.

power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))  

  1 - pnorm(critical, mean = abs(b)) + pnorm(-critical, mean = abs(b))

power(0)

[1] 0.05

power(1.96)

[1] 0.5000586

power(-1.96)

[1] 0.5000586

power(3)

[1] 0.8508388

7.4.11 Power analytics: graphed

This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!

See:

pwrss::power.z.test(ncp = 1.96, alpha = 0.05, alternative = "not equal", plot = TRUE)

Error in match.arg(alternative): 'arg' should be one of "two.sided", "one.sided", "two.one.sided"

7.4.12 Power analytics: graphed

Substantively: if in expectation an estimate will be just significant, then your power is 50%

7.4.13 Equivalent

power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))  

  1 - pnorm(critical - b) + pnorm(-critical - b)

power(1.96)

[1] 0.5000586

Intuition:

x <- seq(-3, 3, .01)
plot(x, dnorm(x), main = "power associated with effect of 1 se")
abline(v = 1.96 - 1)
abline(v = -1.96 - 1)

7.4.14 Power analytics for a trial: by hand

Of course the standard error will depend on the number of units and the variance of outcomes in treatment and control.
Say $N$ subject are divided into two groups and potential outcomes have standard deviation $\sigma$ in treatment and control. Then the conservative variance of the treatment effect is (approx / conservatively):

\[Var(\tau)=\frac{\sigma^2}{N/2} + \frac{\sigma^2}{N/2} = 4\frac{\sigma^2}{N}\]

and so the (conservative / approx) standard error is:

\[\sigma_\tau=\frac{2\sigma}{\sqrt{N}}\]

Note here we seem to be using the actual standard error but of course the tests we actually run will use an estimate of the standard error…

7.4.15 Power analytics for a trial: by hand

se <- function(sd, N) (N/(N-1))^.5*2*sd/(N^.5)


power_2 <- function(b, alpha = .05, sd = 1, N = 100, critical = qnorm(1-alpha/2), se = 2*sd/N^.5)  

  1 - pnorm(critical, mean = abs(b)/se(sd, N)) + pnorm(-critical, mean = abs(b)/se(sd, N))

power_2(0)

[1] 0.05

power_2(.5)

[1] 0.7010827

7.4.16 Power analytics for a trial: flexible

This can be done e.g. with pwrss like this:

pwrss::pwrss.t.2means(mu1 = .2, mu2 = .1, sd1 = 1, sd2 = 1, 
               n2 = 50, alpha = 0.05,
               alternative = "not equal")

+--------------------------------------------------+
|                POWER CALCULATION                 |
+--------------------------------------------------+

Welch's T-Test (Independent Samples)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : d - null.d = 0 
  H1 (Alt. Claim) : d - null.d != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 50 and 50
  Type 1 Error (alpha)   = 0.050
  Type 2 Error (beta)    = 0.921
  Statistical Power      = 0.079  <<

power_2(.50, N = 100)

[1] 0.7010827

7.4.17 Power for more complex trials: Analytics

Mostly involve figuring out the standard error.

Consider a cluster randomized trial, with each unit having a cluster level shock $\epsilon_k$ and an individual shock $\nu_i$. Say these have variances $\sigma^2_k$, $\sigma^2_i$.

The standard error is:

\[\sqrt{\frac{4\sigma^2_k}{K} + \frac{4\sigma^2_i}{nK}}\]

Define $\rho = \frac{\sigma^2_k}{\sigma^2_k + \sigma^2_i}$

\[\sqrt{\rho \frac{4\sigma^2}{K} + (1- \rho)\frac{4\sigma^2}{nK}}\]

\[\sqrt{((n - 1)\rho + 1)\frac{4\sigma^2}{nK}}\]

where

$((n - 1)\rho + 1)$ is the “design effect”
$\frac{nK}{((n - 1)\rho + 1)}$ is the “effective sample size”

Plug in these standard errors and proceed as before

7.4.18 Power via design diagnosis

Is arbitrarily flexible

N <- 100
b <- .5

design <- 
  declare_model(N = N, 
    U = rnorm(N),
    potential_outcomes(Y ~ b * Z + U)) + 
  declare_assignment(Z = simple_ra(N),
                     Y = reveal_outcomes(Y ~ Z)) + 
  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  declare_estimator(Y ~ Z, inquiry = "ate", .method = lm_robust)

7.4.19 Run it many times

sims_1 <- simulate_design(design) 

sims_1 |> select(sim_ID, estimate, p.value)

sim_ID	estimate	p.value
1	0.24	0.22
2	0.80	0.00
3	0.25	0.16
4	0.27	0.18
5	0.87	0.00
6	0.57	0.00

7.4.20 Power is mass of the sampling distribution of decisions under the model

sims_1 |>
  ggplot(aes(p.value)) + 
  geom_histogram() +
  geom_vline(xintercept = .05, color = "red")

7.4.21 Power is mass of the sampling distribution of decisions under the model

Obviously related to the estimates you might get

sims_1 |>
  mutate(significant = p.value <= .05) |>
  ggplot(aes(estimate, p.value, color = significant)) + 
  geom_point()

7.4.22 Check coverage is correct

sims_1 |>
  mutate(within = (b > sims_1$conf.low) & (b < sims_1$conf.high)) |> 
  pull(within) |> mean()

[1] 0.944

7.4.23 Check validity of $p$ value

A valid $p$-value satisfies $\Pr(p≤x)≤x$ for every $x \in[0,1]$ (under the null)

sims_2 <- 
  
  redesign(design, b = 0) |>
  
  simulate_design()

7.4.24 Design diagnosis does it all (over multiple designs)

  diagnose_design(design)

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.50	-0.00	0.20	0.20	0.69	0.95
(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)

7.4.25 Design diagnosis does it all

design |>
  redesign(b = c(0, 0.25, 0.5, 1)) |>
  diagnose_design()

b	Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0	0.00	0.00	0.20	0.20	0.05	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
0.25	0.25	0.00	0.20	0.20	0.24	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
0.5	0.50	-0.00	0.20	0.20	0.69	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
1	1.00	-0.00	0.20	0.20	1.00	0.95
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)

7.4.26 Diagnose over multiple moving parts (and ggplot)

design |>
  ## Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  ## Diagnosis
  diagnose_design() |>
  ## Prep
  tidy() |>
  filter(diagnosand == "power") |>
  ## Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()

7.4.27 Diagnose over multiple moving parts (and ggplot)

7.4.28 Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

design |>

  ## Redesign
  redesign(b = c(0.1, 0.3, 0.5), N = 100, 200, 300) |>
  
  ## Diagnosis
  diagnose_design() |>
  
  ## Prep
  tidy() |>
  
  ## Plot
  ggplot(aes(N, estimate, color = factor(b))) +
  geom_line()+
  facet_wrap(~diagnosand)

7.4.29 Diagnose over multiple moving parts and multiple diagnosands (and ggplot)

7.5 Beyond basics

7.5.1 Power tips

coming up:

power everywhere
power with bias
power with the wrong standard errors
power with uncertainty over effect sizes
power and multiple comparisons

7.5.2 Power depends on all parts of MIDA

We often focus on sample sizes

But

Power also depends on

the model – obviously signal to noise
the assignments and specifics of sampling strategies
estimation procedures

7.5.3 Power from a lag?

Say we have access to a “pre” measure of outcome Y_now; call it Y_base. Y_base is informative about potential outcomes. We are considering using Y_now - Y_base as the outcome instead of Y_now.

N <- 100
rho <- .5

design <- 
  declare_model(N,
                 Y_base = rnorm(N),
                 Y_Z_0 = 1 + correlate(rnorm, given = Y_base, rho = rho),
                 Y_Z_1 = correlate(rnorm, given = Y_base, rho = rho),
                 Z = complete_ra(N),
                 Y_now = Z*Y_Z_1 + (1-Z)*Y_Z_0,
                 Y_change = Y_now - Y_base) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y_now ~ Z, label = "level") +
  declare_estimator(Y_change ~ Z, label = "change")+
  declare_estimator(Y_now ~ Z + Y_base, label = "RHS")

7.5.4 Power from a lag?

design |> redesign(N = c(10, 100, 1000, 10000), rho = c(.1, .5, .9)) |>
  diagnose_design()

7.5.5 Power from a lag?

Punchline:

if you difference: the lag has to be sufficiently information to pay its way (the $\rho = .5$ equivalent between level and change follows from Gerber and Green (2012) equation 4.6)
The right hand side is your friend, at least for experiments (Ding and Li (2019))
As $N$ grows the stakes fall

7.5.6 Power when estimates are biased

bad_design <- 
  
  declare_model(N = 100, 
    U = rnorm(N),
    potential_outcomes(Y ~ 0 * X + U, conditions = list(X = 0:1)),
    X = ifelse(U > 0, 1, 0)) + 
  
  declare_measurement(Y = reveal_outcomes(Y ~ X)) + 
  
  declare_inquiry(ate = mean(Y_X_1 - Y_X_0)) + 
  
  declare_estimator(Y ~ X, inquiry = "ate", .method = lm_robust)

7.5.7 Power when estimates are biased

You can see from the null design that power is great but bias is terrible and coverage is way off.

diagnose_design(bad_design)

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
1.59	1.59	0.12	1.60	1.00	0.00
(0.01)	(0.01)	(0.00)	(0.01)	(0.00)	(0.00)

Power without unbiasedness corrupts, absolutely

7.5.8 Power with a more subtly biased experimental design

another_bad_design <- 
  
  declare_model(
    N = 100, 
    female = rep(0:1, N/2),
    U = rnorm(N),
    potential_outcomes(Y ~ female * Z + U)) + 
  
  declare_assignment(
    Z = block_ra(blocks = female, block_prob = c(.1, .5)),
    Y = reveal_outcomes(Y ~ Z)) + 

  declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) + 
  
  declare_estimator(Y ~ Z + female, inquiry = "ate", 
                    .method = lm_robust)

  diagnose_design(another_bad_design)

7.5.9 Power with a more subtly biased experimental design

You can see from the null design that power is great but bias is terrible and coverage is way off.

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.76	0.26	0.24	0.35	0.84	0.85
(0.01)	(0.01)	(0.01)	(0.01)	(0.01)	(0.02)

7.5.10 Power with the wrong standard errors

clustered_design <-
  declare_model(
    cluster = add_level(N = 10, cluster_shock = rnorm(N)),
    individual = add_level(
        N = 100,
        Y_Z_0 = rnorm(N) + cluster_shock,
        Y_Z_1 = rnorm(N) + cluster_shock)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = cluster_ra(clusters = cluster)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
-0.00	-0.00	0.64	0.64	0.79	0.20
(0.01)	(0.01)	(0.01)	(0.01)	(0.01)	(0.01)

What alerts you to a problem?

7.5.11 Let’s fix that one

clustered_design_2  <-
  clustered_design |> replace_step(5, 
  declare_estimator(Y ~ Z, clusters = cluster))

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.00	-0.00	0.66	0.65	0.06	0.94
(0.02)	(0.02)	(0.01)	(0.01)	(0.01)	(0.01)

7.5.12 Power when you are not sure about effect sizes (always!)

you can do power analysis for multiple stipulations
or you can design with a distribution of effect sizes

design_uncertain <-
  declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = mean(b)) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

draw_estimands(design_uncertain)

  inquiry   estimand
1     ate -0.3967765

draw_estimands(design_uncertain)

  inquiry  estimand
1     ate 0.7887188

7.5.13 Multiple comparisons correction (complex code)

Say I run two tests and want to correct for multiple comparisons.

Two approaches. First, by hand:

b = .2

design_mc <-
  declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
  declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_inquiry(ate = b) +
  declare_estimator(Y ~ factor(Z), term = TRUE)

7.5.14 Multiple comparisons correction (complex code)

design_mc |>
  simulate_designs(sims = 1000) |>
  filter(term != "(Intercept)") |>
  group_by(sim_ID) |>
  mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
         p_holm = p.adjust(p = p.value, method = "holm"),
         p_fdr = p.adjust(p = p.value, method = "fdr")) |>
  ungroup() |>
  summarize(
    "Power using naive p-values" = mean(p.value <= 0.05),
    "Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
    "Power using Holm correction" = mean(p_holm <= 0.05),
    "Power using FDR correction" = mean(p_fdr <= 0.05)
    )

Power using naive p-values	Power using Bonferroni correction	Power using Holm correction	Power using FDR correction
0.7374	0.6318	0.6886	0.7032

7.5.15 Multiple comparisons correction (approach 2)

The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.

my_estimator <- function(data) 
  lm_robust(Y ~ factor(Z), data = data) |> 
  tidy() |>
  filter(term != "(Intercept)") |>
  mutate(p.naive = p.value,
         p.value = p.adjust(p = p.naive, method = "bonferroni"))
  

design_mc_2 <- design_mc |>
  replace_step(5, declare_estimator(handler = label_estimator(my_estimator))) 

run_design(design_mc_2) |> 
  select(term, estimate, p.value, p.naive) |> kable()

term	estimate	p.value	p.naive
factor(Z)2	0.1182516	0.2502156	0.1251078
factor(Z)3	0.1057031	0.3337476	0.1668738

7.5.16 Multiple comparisons correction (Null model case)

Lets try same thing for a null model (using redesign(design_mc_2, b = 0))

design_mc_3 <- 
  design_mc_2 |> 
  redesign(b = 0) 

run_design(design_mc_3) |> select(estimate, p.value, p.naive) |> kable(digits = 3)

estimate	p.value	p.naive
0.068	0.799	0.399
0.144	0.151	0.076

7.5.17 Multiple comparisons correction (Null model case)

…and power:

Mean Estimate	Bias	SD Estimate	RMSE	Power	Coverage
0.00	0.00	0.08	0.08	0.02	0.95
(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.01)
-0.00	-0.00	0.08	0.08	0.02	0.96
(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.01)

bothered?

7.5.18 You might try

Power for an interaction (in a factorial design)
Power for a binary variable (versus a continuous variable?)
Power gains from blocked randomization
Power losses from clustering at different levels
Controlling the ICC directly? (see book cluster designs section)

7.5.19 Big takeaways

Power is affected not just by sample size, variability and effect size but also by you data and analysis strategies.
Try to estimate power under multiple scenarios
Try to use the same code for calculating power as you will use in your ultimate analysis
Basically the same procedure can be used for any design. If you can declare a design and have a test, you can calculate power
Your power might be right but misleading. For confidence:
- Don’t just check power, check bias and coverage also
- Check power especially under the null
Don’t let a focus on power distract you from more substantive diagnosands

8 Topics

8.1 Noncompliance and the LATE estimand

8.1.1 Local Average Treatment Effects

Sometimes you give a medicine but only a nonrandom sample of people actually try to use it. Can you still estimate the medicine’s effect?

	X=0	X=1
Z=0	$\overline{y}_{00}$ ($n_{00}$)	$\overline{y}_{01}$ ($n_{01}$)
Z=1	$\overline{y}_{10}$ ($n_{10}$)	$\overline{y}_{11}$ ($n_{11}$)

Say that people are one of 3 types:

$n_a$ “always takers” have $X=1$ no matter what and have average outcome $\overline{y}_a$
$n_n$ “never takers” have $X=0$ no matter what with outcome $\overline{y}_n$
$n_c$ “compliers have” $X=Z$ and average outcomes $\overline{y}^1_c$ if treated and $\overline{y}^0_c$ if not.

8.1.2 Local Average Treatment Effects

Sometimes you give a medicine but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?

	X=0	X=1
Z=0	$\overline{y}_{00}$ ($n_{00}$)	$\overline{y}_{01}$ ($n_{01}$)
Z=1	$\overline{y}_{10}$ ($n_{10}$)	$\overline{y}_{11}$ ($n_{11}$)

We can figure something about types:

	$X=0$	$X=1$
$Z=0$	$\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}^0_{c}+\frac{\frac{1}{2}n_n}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}_{n}$	$\overline{y}_{a}$
$Z=1$	$\overline{y}_{n}$	$\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}^1_{c}+\frac{\frac{1}{2}n_a}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}_{a}$

8.1.3 Local Average Treatment Effects

You give a medicine to 50% but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?

	$X=0$	$X=1$
$Z=0$	$\frac{n_c}{n_c + n_n} \overline{y}^0_{c}+\frac{n_n}{n_c + n_n} \overline{y}_n$	$\overline{y}_{a}$
(n)	($\frac{1}{2}(n_c + n_n)$)	($\frac{1}{2}n_a$)
$Z=1$	$\overline{y}_{n}$	$\frac{n_c}{n_c + n_a} \overline{y}^1_{c}+\frac{n_a}{n_c + n_a} \overline{y}_{a}$
(n)	($\frac{1}{2}n_n$)	($\frac{1}{2}(n_a+n_c)$)

Key insight: the contributions of the $a$s and $n$s are the same in the $Z=0$ and $Z=1$ groups so if you difference you are left with the changes in the contributions of the $c$s.

8.1.4 Local Average Treatment Effects

Average in $Z=0$ group: $\frac{{n_c} \overline{y}^0_{c}+ \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a\right)}{n_a+n_c+n_n}$

Average in $Z=1$ group: $\frac{{n_c} \overline{y}^1_{c} + \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a \right)}{n_a+n_c+n_n}$

So, the difference is the ITT: $({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}$

Last step:

\[ITT = ({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\]

\[\leftrightarrow\]

\[LATE = \frac{ITT}{\frac{n_c}{n}}= \frac{\text{Intent to treat effect}}{\text{First stage effect}}\]

8.1.5 The good and the bad of LATE

You get a well-defined estimate even when there is non-random take-up
May sometimes be used to assess mediation or knock-on effects
But:
- You need assumptions (monotonicity and the exclusion restriction – where were these used above?)
- Your estimate is only for a subpopulation
- The subpopulation is not chosen by you and is unknown
- Different encouragements may yield different estimates since they may encourage different subgroups

8.1.6 Pearl and Chickering again

With and without an imposition of monotonicity

data("lipids_data")

models <- 
  list(unrestricted =  make_model("Z -> X -> Y; X <-> Y"),
       restricted =  make_model("Z -> X -> Y; X <-> Y") |>
         set_restrictions("X[Z=1] < X[Z=0]")) |> 
  lapply(update_model,  data = lipids_data, refresh = 0) 

models |>
  query_model(query = list(CATE = "Y[X=1] - Y[X=0]", 
                           Nonmonotonic = "X[Z=1] < X[Z=0]"),
              given = list("X[Z=1] > X[Z=0]", TRUE),
              using = "posteriors")

8.1.7 Pearl and Chickering again

With and without an imposition of monotonicity:

model	query	mean	sd
unrestricted	CATE	0.70	0.05
restricted	CATE	0.71	0.05
unrestricted	Nonmonotonic	0.01	0.01
restricted	Nonmonotonic	0.00	0.00

In one case we assume monotonicity, in the other we update on it (easy in this case because of the empirically verifiable nature of one sided non compliance)

8.2 Survey experiments

Survey experiments are used to measure things: nothing (except answers) should be changed!
If the experiment in the survey is changing things then it is a field experiment in a survey, not a survey experiment

8.2.1 The list experiment: Motivation

Multiple survey experimental designs have been generated to make it easier for subjects to answer sensitive questions
The key idea is to use inference rather than measurement.
Subjects are placed in different conditions and the conditions affect the answers that are given in such a way that you can infer some underlying quantity of interest

8.2.2 The list experiment: Motivation

This is an obvious DAG but the main point is to be clear that the Value is the quantity of interest and the value is not affected by the treatment, Z.

8.2.3 The list experiment: Motivation

The list experiment supposes that:

Subjects do not want to give a direct answer to a question
They nevertheless are willing to truthfully answer an indirect question

In other words: sensitivities notwithstanding, they are happy for the researcher to make correct inferences about them or their group

8.2.4 The list experiment: Strategy

Respondents are given a short list and a long list.
The long list differs from the short list in having one extra item—the sensitive item
We ask how many items in each list does a respondent agree with:
- $Y_i(0)$ is the number of elements on a short list that a respondent agrees with
- $Y_i(1)$ is the number of elements on a long list that a respondent agrees with
- $Y_i(1) - Y_i(0)$ is an indicator for whether an individual agrees with the sensitive item
- $\mathbb{E}[Y_i(1) - Y_i(0)]$ is the share of people agreeing with sensitive item

8.2.5 The list experiment: Simplified example

How many of these do you agree with:

	Short list	Long list	“Effect”
	“2 + 2 = 4”	“2 + 2 = 4”
	“2 * 3 = 6”	“2 * 3 = 6”
	“3 + 6 = 8”	“Climate change is real”
		“3 + 6 = 8”
Answer	Y(0) = 2	Y(1) = 4	Y(1) - Y(0) = 2

[Note: this is obviously not a good list. Why not?]

8.2.6 The list experiment: Design

declaration_17.3 <-
  declare_model(
    N = 500,
    control_count = rbinom(N, size = 3, prob = 0.5),
    Y_star = rbinom(N, size = 1, prob = 0.3),
    potential_outcomes(Y_list ~ Y_star * Z + control_count) 
  ) +
  declare_inquiry(prevalence_rate = mean(Y_star)) +
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
  declare_estimator(Y_list ~ Z, .method = difference_in_means, 
                    inquiry = "prevalence_rate")

diagnosands <- declare_diagnosands(
  bias = mean(estimate - estimand),
  mean_CI_width = mean(conf.high - conf.low)
)

8.2.7 Diagnosis

diagnose_design(declaration_17.3, diagnosands = diagnosands)

Design	Inquiry	Bias	Mean CI Width
declaration_17.3	prevalence_rate	0.00	0.32
		(0.00)	(0.00)

8.2.8 Tradeoffs: is the question really sensitive?

declaration_17.4 <- 
  declare_model(
    N = N,
    U = rnorm(N),
    control_count = rbinom(N, size = 3, prob = 0.5),
    Y_star = rbinom(N, size = 1, prob = 0.3),
    W = case_when(Y_star == 0 ~ 0L,
                  Y_star == 1 ~ rbinom(N, size = 1, prob = proportion_hiding)),
    potential_outcomes(Y_list ~ Y_star * Z + control_count)
  ) +
  declare_inquiry(prevalence_rate = mean(Y_star)) +
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z),
                      Y_direct = Y_star - W) +
  declare_estimator(Y_list ~ Z, inquiry = "prevalence_rate", label = "list") + 
  declare_estimator(Y_direct ~ 1, inquiry = "prevalence_rate", label = "direct")

8.2.9 Diagnosis

declaration_17.4 |> 
  redesign(proportion_hiding = seq(from = 0, to = 0.3, by = 0.1), 
           N = seq(from = 500, to = 2500, by = 500)) |> 
  diagnose_design()

8.2.10 Negatively correlated items

How would estimates be affected if the items selected for the list were negatively correlated?
How would subject protection be affected?

8.2.11 Negatively correlated items

rho <- -.8 

correlated_lists <- 
  declare_model(
    N = 500,
    U = rnorm(N),
    control_1 = rbinom(N, size = 1, prob = 0.5),
    control_2 = correlate(given = control_1, rho = rho, draw_binary, prob = 0.5),
    control_count = control_1 + control_2,
    Y_star = rbinom(N, size = 1, prob = 0.3),
    potential_outcomes(Y_list ~ Y_star * Z + control_count)
  ) +
  declare_inquiry(prevalence_rate = mean(Y_star)) +
  declare_assignment(Z = complete_ra(N)) + 
  declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
  declare_estimator(Y_list ~ Z)

8.2.12 Negatively correlated items

draw_data(correlated_lists) |> ggplot(aes(control_count)) + 
  geom_histogram() + theme_bw()

8.2.13 Negatively correlated items

correlated_lists |> redesign(rho = c(-.8, 0, .8)) |> diagnose_design()

These trade-off against each other: the more accuracy you have the less protection you have

8.2.14 Individual or group effects?

This is typically used to estimate average levels
However you can use it in the obvious way to get average levels for groups: this is equivalent to calculating group level heterogeneous effects
Extending the idea you can even get individual level estimates: for instance you might use causal forests
You can also use this to estimate the effect of an experimental treatment on an item that’s measured using a list, without requiring individual level estimates:

\[Y_i = \beta_0 + \beta_1Z_i + \beta_2Long_i + \beta_3Z_iLong_i\]

8.2.15 Hiders and liars

Note that here we looked at “hiders” – people not answering the direct question truthfully
See Li (2019) on bounds when the “no liars” assumption is threatened — this is about whether people respond truthfully to the list experimental question

8.3 Mediation

8.3.1 The problem of unidentified mediators

Consider a causal system like the below.
The effect of X on M1 and M2 can be measured in the usual way.
But unfortunately, if there are multiple mediators, the effect of M1 (or M2) on Y is not identified.
The ‘exclusion restriction’ is obviously violated when there are multiple mediators (unless you can account for them all).

8.3.2 The problem of unidentified mediators

Which effects are identified by the random assignment of $X$?

8.3.3 The problem of unidentified mediators

An obvious approach is to first examine the (average) effect of X on M1 and then use another manipulation to examine the (average) effect of M1 on Y.

But both of these average effects may be positive (for example) even if there is no effect of X on Y through M1.

8.3.4 The problem of unidentified mediators

An obvious approach is to first examine the (average) effect of X on M1 and then use another manipulation to examine the (average) effect of M1 on Y.

Similarly both of these average effects may be zero even if X affects on Y through M1 for every unit!

8.3.5 The problem of unidentified mediators

Both instances of unobserved confounding between $M$ and $Y$:

8.3.6 The problem of unidentified mediators

Both instances of unobserved confounding between $M$ and $Y$:

8.3.7 The problem of unidentified mediators

Another somewhat obvious approach is to see how the effect of $X$ on $Y$ in a regression is reduced when you control for $M$.
If the effect of $X$ on $Y$ passes through $M$ then surely there should be no effect of $X$ on $Y$ after you control for $M$.
This common strategy associated with Baron and Kenny (1986) is also not guaranteed to produce reliable results. See for instance Green, Ha, and Bullock (2010)

8.3.8 Baron Kenny issues

df <- fabricate(N = 1000, 
                U = rbinom(N, 1, .5),     X = rbinom(N, 1, .5),
                M = ifelse(U==1, X, 1-X), Y = ifelse(U==1, M, 1-M)) 
            
list(lm(Y ~ X, data = df), 
     lm(Y ~ X + M, data = df)) |> texreg::htmlreg()

Statistical models
	Model 1	Model 2
(Intercept)	0.00^***	0.00^***
	(0.00)	(0.00)
X	1.00^***	1.00^***
	(0.00)	(0.00)
M		0.00
		(0.00)
R²	1.00	1.00
Adj. R²	1.00	1.00
Num. obs.	1000	1000
^*p < 0.001; ^p < 0.01; ^*p < 0.05

8.3.9 The problem of unidentified mediators

See Imai on better ways to think about this problem and designs to address it.

8.3.10 The problem of unidentified mediators: Quantities

Using potential outcomeswe can describe a mediation effect as (see Imai et al): \[\delta_i(t) = Y_i(t, M_i(1)) - Y_i(t, M_i(0)) \textbf{ for } t = 0,1\]
The direct effect is: \[\psi_i(t) = Y_i(1, M_i(t)) - Y_i(0, M_i(t)) \textbf{ for } t = 0,1\]
This is a decomposition, since: \[Y_i(1, M_i(1)) - Y_1(0, M_i(0)) = \frac{1}{2}(\delta_i(1) + \delta_i(0) + \psi_i(1) + \psi_i(0)) \]
If there are no interaction effects—ie $\delta_i(1) = \delta_i(0), \psi_i(1) = \psi_i(0)$, then \[Y_i(1, M_i(1)) - Y_1(0, M_i(0)) = \delta_i + \psi_i\]

8.3.11 The problem of unidentified mediators: Solutions?

The bad news is that although a single experiment might identify the total effect, it can not identify these elements of the direct effect.

So:

Check formal requirement for identification under single experiment design (“sequential ignorability”—that, conditional on actual treatment, it is as if the value of the mediation variable is randomly assigned relative to potential outcomes). But this is strong (and in fact unverifiable) and if it does not hold, bounds on effects always include zero (Imai et al)
Consider sensitivity analyses

8.3.12 Implicit mediation

You can use interactions with covariates if you are willing to make assumptions on no heterogeneity of direct treatment effects over covariates.

eg you think that money makes people get to work faster because they can buy better cars; you look at the marginal effect of more money on time to work for people with and without cars and find it higher for the latter.

This might imply mediation through transport but only if there is no direct effect heterogeneity (eg people with cars are less motivated by money).

8.3.13 The problem of unidentified mediators: Solutions?

Weaker assumptions justify parallel design

Group A: $T$ is randomly assigned, $M$ left free.
Group B: divided into four groups $T\times M$ (requires two more assumptions (1) that the manipulation of the mediator only affects outcomes through the mediator (2) no interaction, for each unit, $Y(1,m)-Y(0,m) = Y(1,m')-Y(0,m')$.)

Takeaway: Understanding mechanisms is harder than you think. Figure out what assumptions fly.

8.3.14 In `CausalQueries`

Lets imagine that sequential ignorability does not hold. What are our posteriors on mediation quantities when in fact all effects are mediated, effects are strong, and we have lots of data?

model <- make_model("X -> M ->Y <- X; M <-> Y")

plot(model)

8.3.15 In `CausalQueries`

We imagine a true model and consider estimands:

truth <- make_model("X -> M ->Y") |> 
  set_parameters(c(.5, .5, .1, 0, .8, .1, .1, 0, .8, .1))

queries  <- 
  list(
      indirect = "Y[X = 1, M = M[X=1]] - Y[X = 1, M = M[X=0]]",
      direct = "Y[X = 1, M = M[X=0]] - Y[X = 0, M = M[X=0]]"
      )

truth |> query_model(queries) |> kable()

label	query	given	using	case_level	mean	sd	cred.low	cred.high
indirect	Y[X = 1, M = M[X=1]] - Y[X = 1, M = M[X=0]]	-	parameters	FALSE	0.64	NA	0.64	0.64
direct	Y[X = 1, M = M[X=0]] - Y[X = 0, M = M[X=0]]	-	parameters	FALSE	0.00	NA	0.00	0.00

8.3.16 In `CausalQueries`

model |> update_model(data = truth |> make_data(n = 1000)) |>
  query_distribution(queries = queries, using = "posteriors")

Error in if (parent_nodes == "") {: argument is of length zero

Why such poor behavior? Why isn’t weight going onto indirect effects?

Turns out the data is consistent with direct effects only: specifically that whenever $M$ is responsive to $X$, $Y$ is responsive to $X$.

8.3.17 In `CausalQueries`

Error in if (parent_nodes == "") {: argument is of length zero

8.4 Spillovers

8.4.1 SUTVA violations (Spillovers)

Spillovers can result in the estimation of weaker effects when effects are actually stronger.

The key problem is that $Y(1)$ and $Y(0)$ are not sufficient to describe potential outcomes

8.4.2 SUTVA violations

Unit	Location	$D_1$	$y(D_1)$	$D_2$	$y(D_2)$	$D_3$	$y(D_3)$	$D_4$	$y(D_4)$
A	1	1	3	0	1	0	0	0	0
B	2	0	3	1	3	0	3	0	0
C	3	0	0	0	3	1	3	0	3
D	4	0	0	0	0	0	1	1	3

Table: Potential outcomes for four units for different treatment profiles. $D_i$ is an allocation and $y_j(D_i)$ is the potential outcome for (row) unit $j$ given (column) $D_i$.

The key is to think through the structure of spillovers.
Here immediate neighbors are exposed
In this case we can define a direct treatment (being exposed) and an indirect treatment (having a neighbor exposed) and we can work out the propensity for each unit of receiving each type of treatment
These may be non uniform (here central types are more likely to have teated neighbors); but we can still use the randomization to assess effects

8.4.3 SUTVA violations

			0		1		2		3		4
Unit	Location	$D_\emptyset$	$y(D_\emptyset)$	$D_1$	$y(D_1)$	$D_2$	$y(D_2)$	$D_3$	$y(D_3)$	$D_4$	$y(D_4)$
A	1	0	0	1	3	0	1	0	0	0	0
B	2	0	0	0	3	1	3	0	3	0	0
C	3	0	0	0	0	0	3	1	3	0	3
D	4	0	0	0	0	0	0	0	1	1	3
$\bar{y}_\text{treated}$			-			3		3		3
$\bar{y}_\text{untreated}$			0			1		4/3		4/3
$\bar{y}_\text{neighbors}$			-			3		2		2
$\bar{y}_\text{pure control}$			0			0		0		0
ATT-direct			-			3		3		3
ATT-indirect			-			3		2		2

8.4.4 Design

dgp <- function(i, Z, G) Z[i]/3 + sum(Z[G == G[i]])^2/5 + rnorm(1)

spillover_design <- 

  declare_model(G = add_level(N = 80), 
                     j = add_level(N = 3, zeros = 0, ones = 1)) +
  
  declare_inquiry(direct = mean(sapply(1:240,  # just i treated v no one treated 
    function(i) { Z_i <- (1:240) == i
                  dgp(i, Z_i, G) - dgp(i, zeros, G)}))) +
  
  declare_inquiry(indirect = mean(sapply(1:240, 
    function(i) { Z_i <- (1:240) == i           # all but i treated v no one treated   
                  dgp(i, ones - Z_i, G) - dgp(i, zeros, G)}))) +
  
  declare_assignment(Z = complete_ra(N)) + 
  
  declare_measurement(
    neighbors_treated = sapply(1:N, function(i) sum(Z[-i][G[-i] == G[i]])),
    one_neighbor  = as.numeric(neighbors_treated == 1),
    two_neighbors = as.numeric(neighbors_treated == 2),
    Y = sapply(1:N, function(i) dgp(i, Z, G))
  ) +
  
  declare_estimator(Y ~ Z, 
                    inquiry = "direct", 
                    model = lm_robust, 
                    label = "naive") +
  
  declare_estimator(Y ~ Z * one_neighbor + Z * two_neighbors,
                    term = c("Z", "two_neighbors"),
                    inquiry = c("direct", "indirect"), 
                    label = "saturated", 
                    model = lm_robust)

8.4.5 Spillovers: direct and indirect treatments

8.4.6 Spillovers: Simulated estimates

8.4.7 Spillovers: Opportunities and Warnings

You can in principle:

debias estimates
learn about interesting processes
optimize design parameters

But to estimate effects you still need some SUTVA like assumption.

8.4.8 Spillovers: Opportunities and Warnings

In this example if one compared the outcome between treated units and all control units that are at least $n$ positions away from a treated unit you will get the wrong answer unless $n \geq 7$.

9 References

9.0.1 References

Baron, Reuben M, and David A Kenny. 1986. “The Moderator–Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations.” Journal of Personality and Social Psychology 51 (6): 1173.

Choi, Donghyun Danny, Mathias Poertner, and Nicholas Sambanis. 2023. “The Hijab Penalty: Feminist Backlash to Muslim Immigrants.” American Journal of Political Science 67 (2): 291–306.

Deaton, Angus, and Nancy Cartwright. 2018. “Understanding and Misunderstanding Randomized Controlled Trials.” Social Science & Medicine 210: 2–21.

Ding, Peng, and Fan Li. 2019. “A Bracketing Relationship Between Difference-in-Differences and Lagged-Dependent-Variable Adjustment.” Political Analysis 27 (4): 605–15.

Freedman, David A. 2008. “On Regression Adjustments to Experimental Data.” Advances in Applied Mathematics 40 (2): 180–93.

Gerber, Alan S, and Donald P Green. 2012. Field Experiments: Design, Analysis, and Interpretation. Norton.

Green, Donald P, Shang E Ha, and John G Bullock. 2010. “Enough Already about ‘Black Box’ Experiments: Studying Mediation Is More Difficult Than Most Scholars Suppose.” The Annals of the American Academy of Political and Social Science 628 (1): 200–208.

Hall, Ned. 2004. “Two Concepts of Causation.” Causation and Counterfactuals, 225–76.

Halpern, Joseph Y. 2016. Actual Causality. MIT Press.

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60.

Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Li, Yimeng. 2019. “Relaxing the No Liars Assumption in List Experiment Analyses.” Political Analysis 27 (4): 540–55.

Muralidharan, Karthik, Mauricio Romero, and Kaspar Wüthrich. 2023. “Factorial Designs, Model Selection, and (Incorrect) Inference in Randomized Experiments.” Review of Economics and Statistics, 1–44.

Pearl, Judea. 2009. Causality. Cambridge university press.

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic books.

Pearl, Judea, and Azaria Paz. 1985. Graphoids: A Graph-Based Logic for Reasoning about Relevance Relations. University of California (Los Angeles). Computer Science Department.

Robins, James M, Miguel Angel Hernan, and Babette Brumback. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” LWW.

Samii, Cyrus, and P M Aronow. 2012. “On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments.” Statistics & Probability Letters 82 (2): 365–70.

Shpitser, Ilya, Tyler VanderWeele, and James M Robins. 2012. “On the Validity of Covariate Adjustment for Estimating Causal Effects.” arXiv Preprint arXiv:1203.3515.

Textor, Johannes, Benito van der Zander, Mark S Gilthorpe, Maciej Liśkiewicz, and George TH Ellison. 2016. “Robust Causal Inference Using Directed Acyclic Graphs: The r Package ’Dagitty’.” International Journal of Epidemiology 45 (6): 1887–94. https://doi.org/10.1093/ije/dyw341.

Estimand	Estimator
\(\tau_{ATE} \equiv \mathbb{E}[\tau_i]\)	\(\widehat{\tau}_{ATE} = \sum\nolimits_{x} \frac{w_x}{\sum\nolimits_{j}w_{j}}\widehat{\tau}_x\)
\(\tau_{ATT} \equiv \mathbb{E}[\tau_i \| Z_i = 1]\)	\(\widehat{\tau}_{ATT} = \sum\nolimits_{x} \frac{p_xw_x}{\sum\nolimits_{j}p_jw_j}\widehat{\tau}_x\)
\(\tau_{ATC} \equiv \mathbb{E}[\tau_i \| Z_i = 0]\)	\(\widehat{\tau}_{ATC} = \sum\nolimits_{x} \frac{(1-p_x)w_x}{\sum\nolimits_{j}(1-p_j)w_j}\widehat{\tau}_x\)

	\(T2=0\)	\(T2=1\)
T1 = 0	\(50\%\)	\(0\%\)
T1 = 1	\(50\%\)	\(0\%\)

	\(T2=0\)	\(T2=1\)
T1 = 0	\(25\%\)	\(25\%\)
T1 = 1	\(25\%\)	\(25\%\)

	\(T2=0\)	\(T2=1\)
T1 = 0	\(33.3\%\)	\(33.3\%\)
T1 = 1	\(33.3\%\)	\(0\%\)

	\(T2=0\)	\(T2=1\)
T1 = 0	\(40\%\)	\(20\%\)
T1 = 1	\(20\%\)	\(20\%\)

Level	Activity	Inquiry
Association	“Seeing”	If I see \(X=1\) should I expect \(Y=1\)?
Intervention	“Doing”	If I set \(X\) to \(1\) should I expect \(Y=1\)?
Counterfactual	“Imagining”	If \(X\) were \(0\) instead of 1, would \(Y\) then be \(0\) instead of \(1\)?

Level	Activity	Inquiry
Association	“Seeing”	\(\Pr(Y=1\|X=1)\)
Intervention	“Doing”	\(\mathbb{E}[\mathbb{I}(Y(1)=1)]\)
Counterfactual	“Imagining”	\(\Pr(Y(1)=1 \& Y(0)=0)\)

	Y = 0	Y = 1
X = 0	\(\alpha_{00} \rightarrow b/2 + c/2\)	\(\alpha_{01} \rightarrow a/2 + d/2\)
X = 1	\(\alpha_{10} \rightarrow a/2 + c/2\)	\(\alpha_{11} \rightarrow b/2 + d/2\)

		T2
		N	Y	All	Diff
T1	N	\(\overline{y}_{00}\)	\(\overline{y}_{01}\)	\(\overline{y}_{0x}\)	\(d_2\|T1=0\)
		(sd)	(sd)	(sd)	(sd)
	Y	\(\overline{y}_{10}\)	\(\overline{y}_{10}\)	\(\overline{y}_{1x}\)	\(d_2\|T1=1\)
		(sd)	(sd)	(sd)	(sd)
	All	\(\overline{y}_{x0}\)	\(\overline{y}_{x1}\)	\(y\)	\(d_2\)
		(sd)	(sd)	(sd)	(sd)
	Diff	\(d_1\|T2=0\)	\(d_1\|T2=1\)	\(d_1\)	\(d_1d_2\)
		(sd)	(sd)	(sd)	(sd)

ID	X	Y(0)	Y(1)	R1	R2	R3	R4	R5	R6
1	1	0	1	1	1	1	0	0	0
2	1	0	1	1	0	0	1	1	0
3	2	1	2	0	1	0	1	0	1
4	2	1	2	0	0	1	0	1	1
–	–	–	–	–	–	–	–	–	–
\(\widehat{\tau}\):				0	1	1	1	1	2

	X=0	X=1
Z=0	\(\overline{y}_{00}\) (\(n_{00}\))	\(\overline{y}_{01}\) (\(n_{01}\))
Z=1	\(\overline{y}_{10}\) (\(n_{10}\))	\(\overline{y}_{11}\) (\(n_{11}\))

	\(X=0\)	\(X=1\)
\(Z=0\)	\(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}^0_{c}+\frac{\frac{1}{2}n_n}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}_{n}\)	\(\overline{y}_{a}\)
\(Z=1\)	\(\overline{y}_{n}\)	\(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}^1_{c}+\frac{\frac{1}{2}n_a}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}_{a}\)

	\(X=0\)	\(X=1\)
\(Z=0\)	\(\frac{n_c}{n_c + n_n} \overline{y}^0_{c}+\frac{n_n}{n_c + n_n} \overline{y}_n\)	\(\overline{y}_{a}\)
(n)	(\(\frac{1}{2}(n_c + n_n)\))	(\(\frac{1}{2}n_a\))
\(Z=1\)	\(\overline{y}_{n}\)	\(\frac{n_c}{n_c + n_a} \overline{y}^1_{c}+\frac{n_a}{n_c + n_a} \overline{y}_{a}\)
(n)	(\(\frac{1}{2}n_n\))	(\(\frac{1}{2}(n_a+n_c)\))

			0		1		2		3		4
Unit	Location	\(D_\emptyset\)	\(y(D_\emptyset)\)	\(D_1\)	\(y(D_1)\)	\(D_2\)	\(y(D_2)\)	\(D_3\)	\(y(D_3)\)	\(D_4\)	\(y(D_4)\)
A	1	0	0	1	3	0	1	0	0	0	0
B	2	0	0	0	3	1	3	0	3	0	0
C	3	0	0	0	0	0	3	1	3	0	3
D	4	0	0	0	0	0	0	0	1	1	3
\(\bar{y}_\text{treated}\)			-			3		3		3
\(\bar{y}_\text{untreated}\)			0			1		4/3		4/3
\(\bar{y}_\text{neighbors}\)			-			3		2		2
\(\bar{y}_\text{pure control}\)			0			0		0		0
ATT-direct			-			3		3		3
ATT-indirect			-			3		2		2

Lectures on the design and analysis of experiments

1 Roadmap

2 Intro

2.1 Getting started

2.2 Aims and items

2.2.1 Syllabus and resources

2.2.2 The topics

2.2.3 Hands on!

2.3 Responsibilities

2.3.1 Required

2.3.2 Optional

2.4 Support

2.5 Let’s go 1

2.5.1 Task

2.5.2 But

2.6 Let’s go 2

2.6.1 Task

2.6.2 Starter pack I

2.6.3 Starter pack I

2.6.4 Starter pack I: Results

2.6.5 Starter Pack I : A norms study

2.6.6 Starter Pack II : A protest study

2.6.7 Starter Pack II : A protest study

2.6.8 Starter Pack II : A protest study {.smaller

2.6.9 Starter Pack II : A protest study

2.6.10 Starter Pack II : A protest study

2.7 Let’s create some brainstorming groups

2.8 Appendix: Coding Tips

2.8.1 Good coding rules

2.8.2 Good coding rules

2.8.3 Aim

2.8.4 Collaborative coding / writing

2.8.5 Fin

3 Experimenting

3.0.1 What is an experiment?

3.0.2 Two types of control:

3.0.3 More broadly:

3.1 Plan

3.2 The MIDA Framework

3.2.1 Four elements of any research design

3.2.2 Four elements of any research design

3.2.3 Declaration

3.2.4 Diagnosis

3.2.5 Redesign

3.2.6 Very often you have to simulate!

3.3 Experimentation: Stages

3.4 Experimentation: Stages

3.5 Experimentation: Questions

3.5.1 Prospects

3.5.2 Prospects

3.5.3 Innovative designs

3.6 Ethics

3.6.1 Constraint: Is it ethical to manipulate subjects for research purposes?

3.6.2 Is it ethical to manipulate subjects for research purposes?

3.6.3 Is it ethical to manipulate subjects for research purposes?

3.6.4 Constraint: Is it ethical to manipulate subjects for research purposes?

3.6.5 Constraint: Is it ethical to manipulate subjects for research purposes?

3.6.6 APSA Guidelines

3.6.7 APSA Ethics: General [Abbr]

3.6.8 APSA Ethics: Power

3.6.9 APSA Ethics: Consent

3.6.10 APSA Ethics: Deception

3.6.11 APSA Ethics: Harm and Trauma

3.6.12 APSA Ethics: Confidentiality

3.6.13 APSA Ethics: Impact

3.6.14 APSA Ethics: Laws, Regulations, and Prospective Review

3.6.15 APSA Ethics: Shared Responsibility

3.7 Appendix

3.8 Transparency & Experimentation

3.8.1 Transparent workflows

3.8.2 Contentious Issues

3.8.3 Open science checklist

3.8.4 Two distinct rationales for registration

3.8.5 File drawer bias

3.8.6 File drawer bias

3.8.7 File drawer bias

3.8.8 Analysis bias (Fishing)

3.8.9 Analysis bias (Fishing)

3.8.10 Analysis bias (Fishing)

3.8.11 Evidence-Proofing: Illustration