| Day | Topic | Activity |
|---|---|---|
| 1: 1/9 | Getting started Outline | In-class exercise, intro to field experiment |
| 2: 1/16 | Doing experiments: Workflows; ethics | Presentation of plans, ethics application |
| 3: 1/30 | Causality: Foundations, Inquiries | Field reports |
| 4: 2/6 | Answer strategies: Estimation and Inference | Discussion of findings 1; Proposals 2 |
| 5: 2/20 | Data strategies and Design evaluation DeclareDesign, Assignments, Design evaluation |
Proposal refinements 2 |
| 6: 2/27 | Topics: Survey experiments, Spillovers, Downstream experimentation | Design declarations and ethics applications |
We hope to run three experiments:
We are going to do some experiments by hand. You choose assignment schemes and analysis plans.
You have been given an envelope with 20 cards. Each card has two numbers. One in black lettering (on white) and one in white lettering (on black). On the other side of the card there may or may not be writing.
Your job is to figure out whether–on average–the black numbers are larger or smaller than the white numbers or the same.
The catch: when you turn over a card you are only allowed to read the black number or the white number (honor code!!). What’s more: you have to decide before reading the card which number you will look at (though of course you can see the symbol on the card, if there is one, before reading the numbers)
To do: Access the complete procedure sheet, get a pack of cards, go!
We are collectively going to try to replicate an experimental design that was implemented in Germany.
We are free to modify this design as will. Our goal is to:
We are going to make mistakes and learn to work as a team.
AJPS
Why do native Europeans discriminate against Muslim immigrants? Can shared ideas between natives and immigrants reduce discrimination? We hypothesize that natives’ bias against Muslim immigrants is shaped by the belief that Muslims hold conservative attitudes about women’s rights and this ideational basis for discrimination is more pronounced among native women. We test this hypothesis in a large-scale field experiment conducted in 25 cities across Germany, during which 3,797 unknowing bystanders were exposed to brief social encounters with confederates who revealed their ideas regarding gender roles. We find significant discrimination against Muslim women, but this discrimination is eliminated when Muslim women signal that they hold progressive gender attitudes. Through an implicit association test and a follow-up survey among German adults, we further confirm the centrality of ideational stereotypes in structuring opposition to Muslims. Our findings have important implications for reducing conflict between native–immigrant communities in an era of increased cross-border migration.
Results
Summary:
Team
Despite decades of scholarship on protest effects, we know little about how bystanders—citizens who observe protests without participating—are affected by them. Understanding the impact of protest on bystanders is crucial as they constitute a growing audience whose latent support, normative beliefs, and concrete actions can make or break a movement’s broader societal impact. To credibly assess the effects of protests on observers, we design and implement a field experiment in Berlin in which we randomly route pedestrians past (treatment) or away from (control) three large-scale Fridays for Future (FFF) climate strikes. Using data gathered on protest days as well as through a one-month follow-up survey, we find evidence for a substantial increase in immediate donations to climate causes but no detectable impact on climate attitudes, vote intentions, or norm perceptions. Our findings challenge the prevailing assumption in both scholarship and public discourse that protest effects operate via impacts on public opinion and call for renewed theorizing that centers on observers’ immediate behavioral activation.
Results
Results
Summary:
pacmanFirst best: If someone has access to your .qmd file they can hit render or compile and the whole thing reproduces first time. So: Nothing local, everything relative: so please do not include hardcoded paths to your computer
But: often you need ancillary files for data and code. That’s OK but aims should still be that with a self contained folder someone can open a main.Rmd file, hit compile and get everything. I usually have an input and an output subfolder.
git, osf, Dropbox, Drive, Nextcloudin) and is never edited directlyCox & Reid (2000) define experiments as:
investigations in which an intervention, in all its essential elements, is under the control of the investigator.
Two types of control:
Experimental studies use research designs in which the researchers uses:
Let’s discuss:
Then: Deep dive into discussion of actual experiments
Then: Plans for our own
Model: set of models of what causes what and howInquiry: a question stated in terms of the modelData strategy: the set of procedures we use to gather information from the world (sampling, assignment, measurement)Answer strategy: how we summarize the data produced by the data strategyDesign declaration is telling the computer (and readers) what M, I, D, and A are.
Design diagnosis is figuring out how the design will perform under imagined conditions.
Estimating “diagnosands” like power, bias, rmse, error rates, ethical harm, “amount learned”.
Diagnosis takes account of model uncertainty: it aims to identify models for which the design works well and models for which it does not
Redesign is the fine-tuning of features of the data- and answer strategies to understand how changing them affects the diagnosands
Good questions studied well
Randomization of
There is no foundationless answer to this question.
Belmont principles commonly used for guidance:
Unfortunately, operationalizing these requires further ethical theories. (1) is often operationalized by informed consent (a very liberal idea). (2) and (3) sometimes by more utiliarian principles
The major focus on (1) by IRBs might follow from the view that if subjects consent, then they endorse the ethical calculations made for 2 and 3 — they think that it is good and fair.
Trickiness: can a study be good or fair because of implications for non-subjects?
Many (many) field experiments have nothing like informed consent.
For example, whether the government builds a school in your village, whether an ad appears on your favorite radio show, and so on.
Consider three cases:
Consider three cases:
In all cases, there is no consent given by subjects.
In 2 and 3, the treatment is possibly harmful for subjects, and the results might also be harmful. But even in case 1, there could be major unintended harmful consequences.
In cases 1 and 3, however, the “intervention” is within the sphere of normal activities for the implementer.
Sometimes it is possible to use this point of difference to make a “spheres of ethics” argument for “embedded experimentation.”
Spheres of Ethics Argument: Experimental research that involves manipulations that are not normally appropriate for researchers may nevertheless be ethical if:
Otherwise keep focus on consent and desist if this is not possible
Political science researchers should respect autonomy, consider the wellbeing of participants and other people affected by their research, and be open about the ethical issues they face.
Political science researchers have an individual responsibility to consider the ethics of their research related activities and cannot outsource ethical reflection to review boards, other institutional bodies, or regulatory agencies.
These principles describe the standards of conduct and reflexive openness that are expected of political science researchers. … [In cases of reasonable deviations], researchers should acknowledge and justify deviations in scholarly publications and presentations of their work.
[Note: no general injunction against]
Researchers should generally avoid harm when possible, minimize harm when avoidance is not possible, and not conduct research when harm is excessive.
do not limit concern to physical and psychological risks to the participant.
cases in which research that produces impacts on political processes without consent of individuals directly engaged by the research might be appropriate. [examples]
Studies of interventions by third parties do not usually invoke this principle on impact. [details]
This principle is not intended to discourage any form of political engagement by political scientists in their non-research activities or private lives.
researchers should report likely impacts
Mentors, advisors, dissertation committee members, and instructors
Graduate programs in political science should include ethics instruction in their formal and informal graduate curricula;
Editors and reviewers should encourage researchers to be open about the ethical decisions …
Journals, departments, and associations should incorporate ethical commitments into their mission, bylaws, instruction, practices, and procedures.
Experimental researchers are deeply engaged in the movement towards more transparency social science research.
Experimental researchers are deeply engaged in the movement towards more transparency social science research.
Contentious issues (mostly):
Data. How soon should you make your data available? My view: as soon as possibe. Along with working papers and before publication. Before it affects policy in any case. Own the ideas not the data.
Where should you make your data available? Dataverse is focal for political science. Not personal website (mea culpa)
What data should you make available? Disagreement is over how raw your data should be. My view: as raw as you can but at least post cleaning and pre-manipulation.
Experimental researchers are deeply engaged in the movement towards more transparency social science research.
Should you register?: Hard to find reasons against. But case strongest in testing phase rather than exploratory phase.
Registration: When should you register? My view: Before treatment assignment. (Not just before analysis, mea culpa)
Registration: Should you deviate from a preanalysis plan if you change your mind about optimal estimation strategies. My view: Yes, but make the case and describe both sets of results.
File drawer bias (Publication bias)
Analysis bias (Fishing)
– Say in truth \(X\) affects \(Y\) in 50% of cases.
– Researchers conduct multiple excellent studies. But they only write up the 50% that produce “positive” results.
– Even if each individual study is indisputably correct, the account in the research record – that X affects Y in 100% of cases – will be wrong.
– Say in truth \(X\) affects \(Y\) in 50% of cases.
– Researchers conduct multiple excellent studies. But they only write up the 50% that produce “positive” results.
– Even if each individual study is indisputably correct, the account in the research record – that X affects Y in 100% of cases – will be wrong.
Exacerbated by:
– Publication bias – the positive results get published
– Citation bias – the positive results get read and cited
– Chatter bias – the positive results gets blogged, tweeted and TEDed.
– Say in truth \(X\) affects \(Y\) in 50% of cases.
– But say that researchers enjoy discretion to select measures for \(X\) or \(Y\), or enjoy discretion to select statistical models after seeing \(X\) and \(Y\) in each case.
– Then, with enough discretion, 100% of analyses may report positive effects, even if all studies get published.
– Say in truth \(X\) affects \(Y\) in 50% of cases.
– But say that researchers enjoy discretion to select measures for \(X\) or \(Y\), or enjoy discretion to select statistical models after seeing \(X\) and \(Y\) in each case.
– Then, with enough discretion, 100% of analyses may report positive effects, even if all studies get published.
– Try the exact fishy test An Exact Fishy Test (https://macartan.shinyapps.io/fish/)
– What’s the problem with this test?
When your conclusions do not really depend on the data
Eg – some evidence will always support your proposition – some interpretation of evidence will always support your proposition
Knowing the mapping from data to inference in advance gives a handle on the false positive rate.
Source: Gerber and Malhotra
Implications are:
Summary: we do not know when we can or cannot trust claims made by researchers.
[Not a tradition specific claim]
Simple idea:
Lots of misunderstandings around registration
Fishing can happen in very subtle ways, and may seem natural and justifiable.
Example:
Our journal review process is largely organized around advising researchers how to adjust analysis in light of findings in the data.
Frequentists can do it
Bayesians can do it too.
Qualitative researchers can also do it.
You can even do it with descriptive statistics
The key distinction is between prospective and retrospective studies.
Not between experimental and observational studies.
A reason (from the medical literature) why registration is especially important for experiments: because you owe it to subjects
A reason why registration is less important for experiments: because it is more likely that the intended analysis is implied by the design in an experimental study. Researcher degrees of freedom may be greatest for observational qualitative analyses.
Registration will produce some burden but does not require the creation of content that is not needed anyway
It does shift preparation of analyses forward
And it also can increase the burden of developing analyses plans even for projects that don’t work. But that is in part, the point.
Upside is that ultimate analyses may be much easier.
In neither case would the creation of a registration facility prevent exploration.
What it might do is make it less credible for someone to claim that they have tested a proposition when in fact the proposition was developed using the data used to test it.
Registration communicates when researchers are angage in exploration or not. We love exploration and should be proud of it.
Incentives and strategies
| Inquiry | In the preanalysis plan | In the paper | In the appendix |
|---|---|---|---|
| Gender effect | X | X | |
| Age effect | X |
| Inquiry | Following A from the PAP | Following A from the paper | Notes |
|---|---|---|---|
| Gender effect | estimate = 0.6, s.e = 0.31 | estimate = 0.6, s.e = 0.25 | Difference due to change in control variables [provide cross references to tables and code] |
Fundamental problems
Causation as difference making
The intervention based motivation for understanding causal effects:
The problem in 2 is that you need to know what would have happened if things were different. You need information on a counterfactual.
Idea: A causal claim is (in part) a claim about something that did not happen. This makes it metaphysical.
Now that we have a concept of causal effects available, let’s answer two questions:
Now that we have a concept of causal effects available, let’s answer two questions:
TRANSITIVITY: If for a given unit \(A\) causes \(B\) and \(B\) causes \(C\), does that mean that \(A\) causes \(C\)?
A boulder is flying down a mountain. You duck. This saves your life.
So the boulder caused the ducking and the ducking caused you to survive.
So: did the boulder cause you to survive?
CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?
CONNECTEDNESS Say \(A\) causes \(B\) — does that mean that there is a spatiotemporally continuous sequence of causal intermediates?
The counterfactual model is about contribution and attribution in a very specific sense.
Consider an outcome \(Y\) that might depend on two causes \(X_1\) and \(X_2\):
\[Y(0,0) = 0\] \[Y(1,0) = 0\] \[Y(0,1) = 0\] \[Y(1,1) = 1\]
What caused \(Y\)? Which cause was most important?
The counterfactual model is about attribution in a very conditional sense.
This is problem for research programs that define “explanation” in terms of figuring out the things that cause \(Y\)
Real difficulties conceptualizing what it means to say one cause is more important than another cause. What does that mean?
Erdogan’s increasing authoritarianism was the most important reason for the attempted coup
More uncomfortably:
What does it mean to say that the tides are caused by the moon? What exactly do we have to imagine…
Jack exploited Jill
It’s Jill’s fault that bucket fell
Jack is the most obstructionist member of Congress
Melania Trump stole from Michelle Obama’s speech
Activists need causal claims
This is sometimes called a “switching equation”
In DeclareDesign \(Y\) is realised from potential outcomes and assignment in this way using reveal_outcomes
Say \(Z\) is a random variable, then this is a sort of data generating process. BUT the key thing to note is
Now for some magic. We really want to estimate: \[ \tau_i = Y_i(1) - Y_i(0)\]
BUT: We never can observe both \(Y_i(1)\) and \(Y_i(0)\)
Say we lower our sights and try to estimate an average treatment effect: \[ \tau = \mathbb{E} [Y(1)-Y(0)]\]
Now make use of the fact that \[\mathbb E[Y(1)-Y(0)] = \mathbb E[Y(1)]- \mathbb E [Y(0)] \]
In words: The average of differences is equal to the difference of averages.
The magic is that while we can’t hope to measure the differences; we are good at measuring averages.
This provides a positive argument for causal inference from randomization, rather than simply saying with randomization “everything else is controlled for”
Let’s discuss:
Idea: random assignment is random sampling from potential worlds: to understand anything you find, you need to know the sampling weights
Idea: We now have a positive argument for claiming unbiased estimation of the average treatment effect following random assignment
But is the average treatment effect a quantity of social scientific interest?
The average of the differences \(\approx\) difference of averages
The average of the differences \(\approx\) difference of averages
Question: \(\approx\) or \(=\)?
Consider the following potential outcomes table:
| Unit | Y(0) | Y(1) | \(\tau_i\) |
|---|---|---|---|
| 1 | 4 | 3 | |
| 2 | 2 | 3 | |
| 3 | 1 | 3 | |
| 4 | 1 | 3 | |
| 5 | 2 | 3 |
Questions for us: What are the unit level treatment effects? What is the average treatment effect?
Consider the following potential outcomes table:
| In treatment? | Y(0) | Y(1) |
|---|---|---|
| Yes | 2 | |
| No | 3 | |
| No | 1 | |
| Yes | 3 | |
| Yes | 3 | |
| No | 2 |
Questions for us: Fill in the blanks.
What is the actual treatment effect?
Take a short break!
Experiments often give rise to endogenous subgroups. The potential outcomes framework can make it clear why this can cause problems.
Problems arise in analyses of subgroups when the categories themselves are affected by treatment
Example from our work:
| V(0) | V(1) | R(0,1) | R(1,1) | R(0,0) | R(1,0) | |
|---|---|---|---|---|---|---|
| Type 1 (reporter) | 1 | 1 | 1 | 1 | 0 | 0 |
| Type 2 (non reporter) | 1 | 0 | 0 | 0 | 0 | 0 |
Expected reporting given violence in control = Pr(Type 1) (explanation: both types see violence but only Type 1 reports)
Expected reporting given violence in treatment = 100% (explanation: only Type 1 sees violence and this type also reports)
So you might infer a large effect on violence reporting.
Question: What is the actual effect of treatment on the propensity to report violence?
It is possible that in truth no one’s reporting behavior has changed, what has changed is the propensity of people with different propensities to report to experience violence:
| Reporter | No Violence | Violence | % Report | |
|---|---|---|---|---|
| Control | Yes No |
25 25 |
25 25 |
\(\frac{25}{25+25}=50\%\) |
| Treatment | Yes No |
25 50 |
25 0 |
\(\frac{25}{25+0}=100\%\) |
This problem can arise as easily in seemingly simple field experiments. Example:
What’s the problem?
Question for us:
Which problems face an endogenous subgroup issue?:
Which problems face an endogenous subgroup issue?:
In such cases you can:
| Pair | I | I | II | II | |
|---|---|---|---|---|---|
| Unit | 1 | 2 | 3 | 4 | Average |
| Y(0) | 0 | 0 | 0 | 0 | |
| Y(1) | -3 | 1 | 1 | 1 | |
| \(\tau\) | -3 | 1 | 1 | 1 | 0 |
| Pair | I | I | II | II | |
|---|---|---|---|---|---|
| Unit | 1 | 2 | 3 | 4 | Average |
| Y(0) | 0 | 0 | 0 | ||
| Y(1) | 1 | 1 | 1 | ||
| \(\hat{\tau}\) | 1 |
| Pair | I | I | II | II | |
|---|---|---|---|---|---|
| Unit | 1 | 2 | 3 | 4 | Average |
| Y(0) | [0] | 0 | 0 | ||
| Y(1) | [-3] | 1 | 1 | ||
| \(\hat{\tau}\) | 1 |
Note: The right way to think about this is that bias is a property of the strategy over possible realizations of data and not normally a property of the estimator conditional on the data.
Multistage games can also present an endogenous group problem since collections of late stage players facing a given choice have been created by early stage players.
Question: Does visibility alter the extent to which subjects follow norms to punish antisocial behavior (and reward prosocial behavior)? Consider a trust game in which we are interested in how information on receivers affects their actions
Average % returned
|
|||
|---|---|---|---|
| Visibility Treatment | % invested (average) | ...when 10% invested | ...when 50% invested |
| Control: Masked information on respondents | 30% | 20% | 40% |
| Treatment: Full information on respondents | 30% | 0% | 60% |
What do we think? Does visibility make people react more to investments?
Imagine you could see all the potential outcomes, and they looked like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
|---|---|---|---|---|---|---|---|
| Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
| Invest 10% | 60% | 60% | 60% | 0% | 0% | 0% | 30% |
| Invest 50% | 60% | 60% | 60% | 0% | 0% | 0% | 30% |
Conclusion: Both the offer and the information condition are completely irrelevant for all subjects.
Unfortunately you only see a sample of the potential outcomes, and that looks like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
|---|---|---|---|---|---|---|---|
| Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
| Invest 10% | 0% | 0% | 0% | 0% | |||
| Invest 50% | 60% | 60% | 60% | 60% | |||
False Conclusion: When not protected, responders condition behavior strongly on offers (because offerers can select on type accurately)
In fact: The nice types invest more because they are nice. The responders return more to the nice types because they are nice.
Unfortunately you only see a (noisier!) sample of the potential outcomes, and that looks like this:
Responder’s return decision (given type)
|
Avg.
|
||||||
|---|---|---|---|---|---|---|---|
| Offered behavior | Nice 1 | Nice 2 | Nice 3 | Mean 1 | Mean 2 | Mean 3 | |
| Invest 10% | 60% | 0% | 0% | 20% | |||
| Invest 50% | 60% | 60% | 0% | 40% | |||
False Conclusion: When protected, responders condition behavior less strongly on offers (because offerers can select on type less accurately)
What to do?
Solutions?
Take away: Proceed with extreme caution when estimating effects beyond the first stage.
Take a short break!
Directed Acyclic Graphs
The most powerful results from the study of DAGs give procedures for figuring out when conditioning aids or hinders causal identification.
Pearl’s book Causality is the key reference. Pearl (2009) (Though see also older work such as Pearl and Paz (1985))
There is a lot of excellent material on Pearl’s page http://bayes.cs.ucla.edu/WHY/
See also excellent material on Felix Elwert’s page http://www.ssc.wisc.edu/~felwert/causality/?page_id=66
Say you don’t like graphs. Fine.
Consider this causal structure:
Say \(Z\) is temporally prior to \(X\); it is correlated with \(Y\) (because of \(U_1\)) and with \(X\) (because of \(U_2\)).
Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?
Say you don’t like graphs. Fine.
Consider this causal structure:
Question: Would it be useful to “control” for \(Z\) when trying to estimate the effect of \(X\) on \(Y\)?
Answer: Hopefully by the end of today you should see that the answer is obviously (or at least, plausibly) “no.”
Variable sets \(A\) and \(B\) are conditionally independent, given \(C\) if for all \(a\), \(b\), \(c\):
\[\Pr(A = a | C = c) = \Pr(A = a | B = b, C = c)\]
Informally; given \(C\), knowing \(B\) tells you nothing more about \(A\).
Three elemental relations of conditional independence.
\(A\) and \(B\) are conditionally independent, given \(C\) if on every path between \(A\) and \(B\):
or
Notes:
Are A and D unconditionally independent:
Now: say we removed the arrow from \(X\) to \(Y\)
do operationsWe nor formalize things a little more:
The probability of outcome \(x\) can always be written in the form \[P(X_1 = x_1)P(X_2 = x_2|X_1=x_1)(X_3 = x_3|X_1=x_1, X_2 = x_2)\dots\]
This can be done with any ordering of variables.
We want to describe the distribution in a simpler way that takes account of parent-child relations and that can be used to capture interventions.
do operations\[P: P(x_1,x_2,\dots x_n) = \prod_{}P(x_j|pa_j)\]
do(X=x')) is:\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]
where the parents \(pa_j\) are the parents on the graph (parents of \(x_i\) are the nodes with arrows pointing into \(x_i\).)
do operationsSo:
\[P_{\hat{x}_i}: P(x_1,x_2,\dots x_n|\hat{x}_i) = \mathbb{I}(x_i = x_i')\prod_{-i}P(x_j|pa_j)\]
This means that there is only probability mass on vectors in which \(x_i = x_i'\) (reflecting the success of control) and all other variables are determined by their parents, given the values that have been set for \(x_i\).
Such expressions will be critical later when we want to consider identificatation.
They let us assess whether the probability of an outcome \(y\), say, depends on the value of some other node, given some other node, or given interventions on some other node.
do operationsIllustration, say we have binary \(X\) causes binary \(M\) which cases binary \(Y\); say we intervene and set \(M=1\). Then what is the distribution of \((x,m,y)\)?
It is:
\[\Pr(x,m,y) = \Pr(x)\mathbb I(M = 1)\Pr(y|m)\]
We will use these ideas to motivate a general procedure for learning about, updating over, and querying, causal models.
A “causal model” is:
1.Variables
A list of \(n\) functions \(\mathcal{F}= (f^1, f^2,\dots, f^n)\), one for each element of \(\mathcal{V}\) such that each \(f^i\) takes as arguments \(\theta^i\) as well as elements of \(\mathcal{V}\) that are prior to \(V^i\) in the ordering
A probability distribution over \(\Theta\)
A simple causal model in which high inequality (\(I\)) affects democratization (\(D\)) via redistributive demands (\(R\)) and mass mobilization (\(M\)), which is also a function of ethnic homogeneity (\(E\)). Arrows show relations of causal dependence between variables.
Learning about effects given a model means learning about \(F\) and also the distribution of shocks (\(\Theta\)).
For discrete data this can be reduced to a question about learning about the distribution of \(\Theta\) only.
For instance the simplest model consistent with \(X \rightarrow Y\):
Endogenous Nodes = \(\{X, Y\}\), both with range \(\{0,1\}\)
Exogenous Nodes = \(\{\theta^X, \theta^Y\}\), with ranges \(\{\theta^X_0, \theta^X_1\}\) and \(\{\theta^Y_{00}\theta^Y_{01}, \theta^Y_{10}, \theta^Y_{11}\}\)
Functional equations:
Distributions on \(\Theta\): \(\Pr(\theta^i = \theta^i_k) = \lambda^i_k\)
What is the probability that \(X\) has a positive causal effect on \(Y\)?
This is equivalent to: \(\Pr(\theta^Y =\theta^Y_{01}) = \lambda^Y_{01}\)
So we want to learn about the distributions of the exogenous nodes
This general principle extends to a vast class of causal models
Estimation and testing
dagittySay that units are randomly assigned to treatment in different strata (maybe just one); with fixed, though possibly different, shares assigned in each stratum. Then the key estimands and estimators are:
| Estimand | Estimator |
|---|---|
| \(\tau_{ATE} \equiv \mathbb{E}[\tau_i]\) | \(\widehat{\tau}_{ATE} = \sum\nolimits_{x} \frac{w_x}{\sum\nolimits_{j}w_{j}}\widehat{\tau}_x\) |
| \(\tau_{ATT} \equiv \mathbb{E}[\tau_i | Z_i = 1]\) | \(\widehat{\tau}_{ATT} = \sum\nolimits_{x} \frac{p_xw_x}{\sum\nolimits_{j}p_jw_j}\widehat{\tau}_x\) |
| \(\tau_{ATC} \equiv \mathbb{E}[\tau_i | Z_i = 0]\) | \(\widehat{\tau}_{ATC} = \sum\nolimits_{x} \frac{(1-p_x)w_x}{\sum\nolimits_{j}(1-p_j)w_j}\widehat{\tau}_x\) |
where \(x\) indexes strata, \(p_x\) is the share of units in each stratum that is treated, and \(w_x\) is the size of a stratum.
In addition, each of these can be targets of interest:
And for different subgroups,
The CATEs are conditional average treatment effects, for example the effect for men or for women. These are straightfoward.
However we might also imagine conditioning on unobservable or counterfactual features.
\[LATE = \frac{1}{|C|}\sum_{j\in C}(Y_j(X=1) - Y_j(X=0))\] \[C:=\{j:X_j(Z=1) > X_j(Z=0) \}\]
We will return to these in the study of instrumental variables.
Other ways to condition on potential outcomes:
Many inquiries are averages of individual effects, even if the groups are not known, but they do not have to be:
Many inquiries are averages of individual effects, even if the groups are not known,
But they do not have to be:
Inquiries might relate to distributional quantities such as:
You might even be interested in \(\min(Y_i(1) - Y_i(0))\).
There are lots of interesting “spillover” estimands.
Imagine there are three individuals and each person’s outcomes depends on the assignments of all others. For instance \(Y_1(Z_1, Z_2, Z_3\), or more generally, \(Y_i(Z_i, Z_{i+1 (\text{mod }3)}, Z_{i+2 (\text{mod }3)})\).
Then three estimands might be:
Interpret these. What others might be of interest?
A difference in CATEs is a well defined estimand that might involve interventions on one node only:
It captures differences in effects.
An interaction is an effect on an effect:
Note in the latter the expectation is taken over the whole population.
Say \(X\) can affect \(Y\) directly, or indirectly through \(M\). then we can write potential outcomes as:
We can then imagine inquiries of the form:
Interpret these. What others might be of interest?
Again we might imagine that these are defined with respect to some group:
here, among those for whom \(X\) has a positive effect on \(Y\), for what share would there be a positive effect if \(M\) were fixed at 1?
In qualitative research a particularly common inquiry is “did \(X=1\) cause \(Y=1\)?
This is often given as a probability, the “probability of causation” (though at the case level we might better think of this probability as an estimate rather than an estimand):
\[\Pr(Y_i(0) = 0 | Y_i(1) = 1, X = 1)\]
Intuition: What’s the probability \(X=1\) caused \(Y=1\) in an \(X=1, Y=1\) case drawn from a large population with the following experimental distribution:
| Y=0 | Y=1 | All | |
|---|---|---|---|
| X=0 | 1 | 0 | 1 |
| X=1 | 0.25 | 0.75 | 1 |
Intuition: What’s the probability \(X=1\) caused \(Y=1\) in an \(X=1, Y=1\) case drawn from a large population with the following experimental distribution:
| Y=0 | Y=1 | All | |
|---|---|---|---|
| X=0 | 0.75 | 0.25 | 1 |
| X=1 | 0.25 | 0.75 | 1 |
Other inquiries focus on distinguishing between causes.
For the Billy Suzy problem (Hall 2004), Halpern (2016) focuses on “actual causation” as a way to distinguish between Suzy and Billy:
Imagine Suzy and Billy, simultaneously throwing stones at a bottle. Both are excellent shots and hit whatever they aim at. Suzy’s stone hits first, knocks over the bottle, and the bottle breaks. However, Billy’s stone would have hit had Suzy’s not hit, and again the bottle would have broken. Did Suzy’s throw cause the bottle to break? Did Billy’s?
Actual Causation:
An inquiry: for what share in a population is a possible cause an actual cause?
Pearl (e.g. Pearl and Mackenzie (2018)) describes three types of inquiry:
| Level | Activity | Inquiry |
|---|---|---|
| Association | “Seeing” | If I see \(X=1\) should I expect \(Y=1\)? |
| Intervention | “Doing” | If I set \(X\) to \(1\) should I expect \(Y=1\)? |
| Counterfactual | “Imagining” | If \(X\) were \(0\) instead of 1, would \(Y\) then be \(0\) instead of \(1\)? |
We can understand these as asking different types of questions about a causal model
| Level | Activity | Inquiry |
|---|---|---|
| Association | “Seeing” | \(\Pr(Y=1|X=1)\) |
| Intervention | “Doing” | \(\mathbb{E}[\mathbb{I}(Y(1)=1)]\) |
| Counterfactual | “Imagining” | \(\Pr(Y(1)=1 \& Y(0)=0)\) |
The third is qualitatively different because it requires information about two mutually incompatible conditions for units. This is not (generally ) recoverable directly from knowledge of \(\Pr(Y(1)=1)\) and \(\Pr(Y(0)=0)\).
Given a causal model over nodes with discrete ranges, inquiries can generally be described as summaries of the distributions of exogenous nodes.
We already saw two instances of this:
What it is. When you have it. What it’s worth.
Informally a quantity is “identified” if it can be “recovered” once you have enough data.
Say for example average wage is \(x\) in some very large population. If I gather lots and lots of data on the wages of individuals and take the average then then my estimate will ultimately let be figure out \(x\).
Identifiability Let \(Q(M)\) be a query defined over a class of models \(\mathcal M\), then \(Q\) is identifiable if \(P(M_1) = P(M_2) \rightarrow Q(M_1) = Q(M_1)\).
Identifiability with constrained data Let \(Q(M)\) be a query defined over a class of models \(\mathcal M\), then \(Q\) is identifiable from features \(F(M)\) if \(F(M_1) = F(M_2) \rightarrow Q(M_1) = Q(M_1)\).
Based on Defn 3.2.3 in Pearl.
Informally a quantity is “identified” if it can be “recovered” once you have enough data.
Then with very large (experimental) data we observe:
| Y = 0 | Y = 1 | |
|---|---|---|
| X = 0 | \(\alpha_{00} \rightarrow b/2 + c/2\) | \(\alpha_{01} \rightarrow a/2 + d/2\) |
| X = 1 | \(\alpha_{10} \rightarrow a/2 + c/2\) | \(\alpha_{11} \rightarrow b/2 + d/2\) |
What quantities are identified?
What if we :
| Y = 0 | Y = 1 | |
|---|---|---|
| X = 0 | \(\alpha_{00} \rightarrow b/2 + c/2\) | \(\alpha_{01} \rightarrow a/2 + d/2\) |
| X = 1 | \(\alpha_{10} \rightarrow a/2 + c/2\) | \(\alpha_{11} \rightarrow b/2 + d/2\) |
What quantities are now identified?
Our goal in causal inference is to estimate quantities such as:
\[\Pr(Y|\hat{x})\]
where \(\hat{x}\) is interpreted as \(X\) set to \(x\) by “external” control. Equivalently: \(do(X=x)\) or sometimes \(X \leftarrow x\).
If this quantity is identifiable then we can recover it with infinite data.
If it is not identifiable, then, even in the best case, we are not guaranteed to get the right answer.
Are there general rules for determining whether this quantity can be identified? Yes.
Note first, identifying
\[\Pr(Y|x)\]
is easy.
But we are not always interested in identifying the distribution of \(Y\) given observed values of \(x\), but rather, the distribution of \(Y\) if \(X\) is set to \(x\).
If we can identify the controlled distribution we can calculate other causal quantities of interest.
For example for a binary \(X, Y\) the causal effect of \(X\) on the probability that \(Y=1\) is:
\[\Pr(Y=1|\hat{x}=1) - \Pr(Y=1|\hat{x}=0)\]
Again, this is not the same as:
\[\Pr(Y=1|x=1) - \Pr(Y=1|x=0)\]
It’s the difference between seeing and doing.
The key idea is that you want to find a set of variables such that when you condition on these you get what you would get if you used a do operation.
Intuition:
The backdoor criterion is satisfied by \(Z\) (relative to \(X\), \(Y\)) if:
In that case you can identify the effect of \(X\) on \(Y\) by conditioning on \(Z\):
\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\] (This is eqn 3.19 in Pearl (2000))
\[P(Y=y | \hat{x}) = \sum_z P(Y=y| X = x, Z=z)P(z)\]
\[P(Y=y | \hat{x}) - P(Y=y | \hat{x}')\]
Following Pearl (2009), Chapter 11. Let \(T\) denote the set of parents of \(X\): \(T := pa(X)\), with (possibly vector valued) realizations \(t\). These might not all be observed.
If the backdoor criterion is satisfied, we have:
Key idea: The intervention level relates to the observational level as follows: \[p(y|\hat{x}) = \sum_{t\in T} p(t)p(y|x, t)\]
Think of this as fully accounting for the (possibly unobserved) causes of \(X\), \(T\)
We want to get to:
\[p(y|\hat{x}) = \sum_{t\in T} p(t)p(y|x, t)\]
We bring \(Z\) into the picture by writing:
\[p(y|\hat{x}) = \sum_{t\in T} p(t) \sum_z p(y|x, t, z)p(z|x, t)\]
now we want to get rid of \(T\)…
now we want to get rid of \(T\)…
Using the two conditions from the backdoor definition above:
This gives: \[p(y|\hat x) = \sum_{t \in T} p(t) \sum_z p(y|x, z)p(z|t)\]
Cleaning up, we can get rid of \(T\):
\[p(y|\hat{x}) = \sum_z p(y|x, z)\sum_{t\in T} p(z|t)p(t) = \sum_z p(y| x, z)p(z)\]
For intuition:
We would be happy if we could condition on the parent \(T\), but \(T\) is not observed. However we can use \(Z\) instead making use of the fact that:
See Shpitser, VanderWeele, and Robins (2012)
The adjustment criterion is satisfied by \(Z\) (relative to \(X\), \(Y\)) if:
Note:
Here \(Z\) satisfies the adjustment criterion but not the backdoor criterion:
\(Z\) is descendant of \(X\) but it is not a descendant of a node on a path from \(X\) to \(Y\). No harm adjusting for \(Z\) here, but not necessary either.
Consider this DAG:
Why?
If:
Then \(\Pr(y| \hat x)\) is identifiable and given by:
\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]
We want to get \(\Pr(y | \hat x)\)
From the graph the joint distribution of variables is:
\[\Pr(x,m,y,u) = \Pr(u)\Pr(x|u)\Pr(m|x)\Pr(y|m,u)\] If we intervened on \(X\) we would have (\(\Pr(X = x |u)=1\)):
\[\Pr(m,y,u | \hat x) = \Pr(u)\Pr(m|x)\Pr(y|m,u)\] If we sum up over \(u\) and \(m\) we get:
\[\Pr(m,y| \hat x) = \Pr(m|x)\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\] \[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\]
The first part is fine; the second part however involves \(u\) which is unobserved. So we need to get the \(u\) out of \(\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\).
Now, from the graph:
\[\Pr(u|m, x) = \Pr(u|x)\] 2. \(X\) is d-separated from \(Y\) by \(M\), \(U\)
\[\Pr(y|x, m, u) = \Pr(y|m,u)\] That’s enough to get \(u\) out of \(\sum_u\left(\Pr(y|m,u)\Pr(u)\right)\)
\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\sum_u\left(\Pr(y|m,u)\Pr(u|x)\Pr(x)\right)\]
Using the 2 equalities we got from the graph:
\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\sum_u\left(\Pr(y|x,m,u)\Pr(u|x,m)\Pr(x)\right)\]
So:
\[\sum_u\left(\Pr(y|m,u)\Pr(u)\right) = \sum_x\left(\Pr(y|m,x)\Pr(x)\right)\]
Intuitively: \(X\) blocks the back door between \(Z\) and \(Y\) just as well as \(U\) does
Substituting we are left with:
\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]
(The \('\) is to distinguish the \(x\) in the summation from the value of \(x\) of interest)
It’s interesting that \(x\) remains in the right hand side in the calculation of the \(m \rightarrow y\) effect, but this is because \(x\) blocks a backdoor from \(m\) to \(y\)
Bringing all this together into a claim we have:
If:
Then \(\Pr(y| \hat x)\) is identifiable and given by:
\[\Pr(y| \hat x) = \sum_m\Pr(m|x)\sum_{x'}\left(\Pr(y|m,x')\Pr(x')\right)\]
There is a package (Textor et al. 2016) for figuring out what to condition on.
Define a dag using dagitty syntax:
There is then a simple command to check whether two sets are d-separated by a third set:
And a simple command to identify the adjustments needed to identify the effect of one variable on another:
Example where \(Z\) is correlated with \(X\) and \(Y\) and is a confounder
Example where \(Z\) is correlated with \(X\) and \(Y\) but it is not a confounder
But controlling can also cause problems. In fact conditioning on a temporally pre-treatment variable could cause problems. Who’d have thunk? Here is an example from Pearl (2005):
U1 <- rnorm(10000); U2 <- rnorm(10000)
Z <- U1+U2
X <- U2 + rnorm(10000)/2
Y <- U1*2 + X
lm_robust(Y ~ X) |> tidy() |> kable(digits = 2)| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -0.02 | 0.02 | -1.21 | 0.23 | -0.06 | 0.01 | 9998 | Y |
| X | 1.02 | 0.02 | 56.52 | 0.00 | 0.98 | 1.05 | 9998 | Y |
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -0.01 | 0.01 | -1.13 | 0.26 | -0.03 | 0.01 | 9997 | Y |
| X | -0.34 | 0.01 | -34.98 | 0.00 | -0.36 | -0.32 | 9997 | Y |
| Z | 1.67 | 0.01 | 220.37 | 0.00 | 1.65 | 1.68 | 9997 | Y |
g <- dagitty("dag{U1 -> Z ; U1 -> y ; U2 -> Z ; U2 -> x -> y}")
adjustmentSets(g, exposure = "x", outcome = "y") {}
[1] FALSE
[1] TRUE
Which means, no need to condition on anything.
A bind: from Pearl 1995.
For a solution for a class of related problems see Robins, Hernan, and Brumback (2000)
g <- dagitty("dag{U1 -> Z ; U1 -> y ;
U2 -> Z ; U2 -> x -> y;
Z -> x}")
adjustmentSets(g, exposure = "x", outcome = "y"){ U1 }
{ U2, Z }
which means you have to adjust on an unobservable. Here we double check that including or not including “Z” is enough:
So we cannot identify the effect here. But can we still learn about it?
See topics for Controls and doubly robust estimation
Unbiased estimates of the (sample) average treatment effect can be estimated (whether or not there imbalance on covariates) using:
\[ \widehat{ATE} = \frac{1}{n_T}\sum_TY_i - \frac{1}{n_C}\sum_CY_i, \]
df <- fabricatr::fabricate(N = 100, Z = rep(0:1, N/2), Y = rnorm(N) + Z)
# by hand
df |>
summarize(Y1 = mean(Y[Z==1]),
Y0 = mean(Y[Z==0]),
diff = Y1 - Y0) |> kable(digits = 2)| Y1 | Y0 | diff |
|---|---|---|
| 1.07 | -0.28 | 1.35 |
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 1.35 | 0.17 | 7.94 | 0 | 1.01 | 1.68 | 97.98 | Y |
We can also do this with regression:
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | -0.28 | 0.12 | -2.33 | 0.02 | -0.51 | -0.04 | 98 | Y |
| Z | 1.35 | 0.17 | 7.94 | 0.00 | 1.01 | 1.68 | 98 | Y |
See Freedman (2008) on why regression is fine here
Say now different strata or blocks \(\mathcal{S}\) had different assignment probabilities. Then you could estimate:
\[ \widehat{ATE} = \sum_{S\in \mathcal{S}}\frac{n_{S}}{n} \left(\frac{1}{n_{S1}}\sum_{S\cap T}y_i - \frac{1}{n_{S0}}\sum_{S\cap C}y_i \right) \]
Note: you cannot just ignore the blocks because assignment is no longer independent of potential outcomes: you might be sampling units with different potential outcomes with different probabilities.
However, the formula above works fine because selecting is random conditional on blocks.
As a DAG this is just classic confounding:
Data with heterogeneous assignments:
True effect is 0.5, but:
Averaging over effects in blocks
# by hand
estimates <-
df |>
group_by(X) |>
summarize(Y1 = mean(Y[Z==1]),
Y0 = mean(Y[Z==0]),
diff = Y1 - Y0,
W = n())
estimates$diff |> weighted.mean(estimates$W)[1] 0.7236939
# with estimatr
estimatr::difference_in_means(Y ~ Z, blocks = X, data = df) |>
tidy() |> kable(digits = 2)| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 0.72 | 0.11 | 6.66 | 0 | 0.51 | 0.94 | 496 | Y |
This also corresponds to the difference in the weighted average of treatment outcomes (with weights given by the inverse of the probability that each unit is assigned to treatment) and control outcomes (with weights given by the inverse of the probability that each unit is assigned to control).
# by hand
df |>
summarize(Y1 = weighted.mean(Y[Z==1], ip[Z==1]),
Y0 = weighted.mean(Y[Z==0], ip[Z==0]), # note !
diff = Y1 - Y0)|>
kable(digits = 2)| Y1 | Y0 | diff |
|---|---|---|
| 0.59 | -0.15 | 0.74 |
# with estimatr
estimatr::difference_in_means(Y ~ Z, weights = ip, data = df) |>
tidy() |> kable(digits = 2)| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| Z | 0.74 | 0.11 | 6.65 | 0 | 0.52 | 0.96 | 498 | Y |
But inverse propensity weighting is a more general principle, which can be used even if you do not have blocks.
The intuition for it comes straight from sampling weights — you weight up in order to recover an unbiased estimate of the potential outcomes for all units, whether or not they are assigned to treatment.
With sampling weights however you can include units even if their weight was 1. Why can you not include these units when doing inverse propensity weighting?
Say you made a mess and used a randomization that was correlated with some variable, \(U\). For example:
Bad assignment, some randomization process you can’t understand (but can replicate) that results in unequal probabilities.
Results is a sampling distribution not centered on the true effect (0)
To fix you can estimate the assignment probabilities by replicating the assignment many times:
and then use these assignment probabilities in your estimator
Implied weights
Improved results
This example is surprising but it helps you see the logic of why inverse weighting gets unbiased estimates (and why that might not guarantee a reasonable answer)
Imagine there is one unit with potential outcomes \(Y(1) = 2, Y(0) = 1\). So the unit level treatment effect is 1.
You toss a coin.
So your expected estimate is: \[0.5 \times 4 - 0.5 \times (-2) = 1\]
Great on average but always lousy
\[\hat{\overline{Y_1}} = \frac{1}n\left(\sum_i \frac{Z_iY_i(1)}{\pi_i}\right)\] With independent assignment the expected value of \(\hat{\overline{Y_1}}\) is just:
\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( \left(\pi_1 \frac{1\times Y_1(1)}{\pi_1} + (1-\pi_1) \frac{0\times Y_1(1)}{\pi_1}\right) + \left(\pi_2 \frac{1\times Y_2(1)}{\pi_2} + (1-\pi_2) \frac{0\times Y_1(1)}{\pi_2}\right) + \dots\right)\]
\[\mathbb{E}[\hat{\overline{Y_1}}] =\frac1n\left( Y_1(1) + Y_2(1) + \dots\right) = \overline{Y_1}\]
and similarly for \(\mathbb{E}[\hat{\overline{Y_0}}]\) and so using linearity of expectations:
\[\mathbb{E}[\widehat{\overline{Y_1 - Y_0}}] = \overline{Y_1 - Y_0}\]
Lets talk about “inference”
In classical statistics we characterize our uncertainty over an estimate using an estimate of variance of the sampling distribution of the estimator.
Key idea is we want to be able to say: how likely are we to have gotten such an estimate if the distribution of estimates associated with our design looked a given way.
More specifically we want to estimate “standard error” or the “standard deviation of the sampling distribution”
(See Woolridge (2023) where the standard error is understood as the “estimate of the standard deviation of the sampling distribution”)
Given:
The variance of the estimator of \(n\) repeated ‘runs’ of a design is: \(Var(\hat{\tau}) = \frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2\)
And the standard error is:
\(se(\hat{\tau}) = \sqrt{\frac{1}n\sum_i(\hat\tau_i - \overline{\hat\tau_i})^2}\)
If we have a good measure for the shape of the sampling distribution we can start to make statements of the form:
If the sampling distribution is roughly normal, as it may be with large samples, then we can use procedures such as: “there is a 5% probability that an estimate would be more than 1.96 standard errors away from the mean of the sampling distribution”
Key idea: You can estimate variance straight from the data, given knowledge of the assignment process and assuming well defined potential outcomes?
Recall in general \(Var(x) = \frac{1}n\sum_i(x_i - \overline{x})^2\). here the \(x_i\)s are the treatment effect estimates we might get under different random assignments, the \(n\) is number of different assignments (assumed here all equally likely, but otherwise we can weight) and \(\overline{x}\) is the truth.
For intuition imagine we have just two units \(A\), \(B\), with potential outcomes \(A_1\), \(A_0\), \(B_1\), \(B_0\).
When there are two units with outcomes \(x_1, x_2\), the variance simplifies like this:
\[Var(x) = \frac{1}2\left(x_1 - \frac{x_1 + x_2}{2}\right)^2 + \frac{1}2\left(x_2 - \frac{x_1 + x_2}{2}\right)^2 = \left(\frac{x_1 - x_2}{2}\right)^2\]
In the two unit case the two possible treatment estimates are: \(\hat{\tau}_1=A_1 - B_0\) and \(\hat{\tau}_2=B_1 - A_0\), depending on what gets put into treatment. So the variance is:
\[Var(\hat{\tau}) = \left(\frac{\hat{\tau}_1 - \hat{\tau}_2}{2}\right)^2 = \left(\frac{(A_1 - B_0) - (B_1 - A_0)}{2}\right)^2 =\left(\frac{(A_1 - B_1) + (A_0 - B_0)}{2}\right)^2 \] which we can re-write as:
\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\] The first two terms correspond to the variance of \(Y(1)\) and the variance of \(Y(0)\). The last term is a bit pesky though, it corresponds to twice the covariance of \(Y(1)\) and \(Y(0)\).
How can we go about estimating this?
\[Var(\hat{\tau}) = \left(\frac{A_1 - B_1}{2}\right)^2 + \left(\frac{A_0 - B_0}{2}\right)^2+ 2\frac{(A_1 - B_1)(A_0-B_0)}{2}\]
In the two unit case it is quite challenging because we do not have an estimate for any of the three terms: we do not have an estimate for the variance in the treatment group or in the control group because we have only one observation in each case; and we do not have an estimate for the covariance because we don’t observe both potential outcomes for any case.
Things do look a bit better however with more units…
From Freedman Prop 1 / Example 1 (using combinatorics!) we have:
\(V(\widehat{ATE}) = \frac{1}{n-1}\left[\frac{n_C}{n_T}V_1 + \frac{n_T}{n_C}V_0 + 2C_{01}\right]\)
… where \(V_0, V_1\) denote variances and \(C_{01}\) covariance
This is usefully rewritten as:
\[ \begin{split} V(\widehat{ATE}) & = \frac{1}{n-1}\left[\frac{n - n_T}{n_T}V_1 + \frac{n - n_C}{n_C}V_0 + 2C_{01}\right] \\ & = \frac{n}{n-1}\left[\frac{V_1}{n_T} + \frac{V_0}{n_C}\right] - \frac{1}{n-1}\left[V_1 + V_0 - 2C_{01}\right] \end{split} \]
where the final term is positive
Note:
, robust (see Samii and Aronow (2012))For the case with blocking, the conservative estimator is:
\(V(\widehat{ATE}) = {\sum_{S\in \mathcal{S}}{\left(\frac{n_{S}}{n}\right)^2} \left({\frac{s^2_{S1}}{n_{S1}}} + {\frac{s^2_{S0}}{n_{S0}}} \right)}\)
An illustration of how conservative the conservative estimator of variance really is (numbers in plot are correlations between \(Y(1)\) and \(Y(0)\).
We confirm that:
| \(\tau\) | \(\rho\) | \(\sigma^2_{Y(1)}\) | \(\Delta\) | \(\sigma^2_{\tau}\) | \(\widehat{\sigma}^2_{\tau}\) | \(\widehat{\sigma}^2_{\tau(\text{Neyman})}\) |
|---|---|---|---|---|---|---|
| 1.00 | -1.00 | 1.00 | -0.04 | 0.00 | -0.00 | 0.04 |
| 1.00 | -0.67 | 1.00 | -0.03 | 0.01 | 0.01 | 0.04 |
| 1.00 | -0.33 | 1.00 | -0.03 | 0.01 | 0.01 | 0.04 |
| 1.00 | 0.00 | 1.00 | -0.02 | 0.02 | 0.02 | 0.04 |
| 1.00 | 0.33 | 1.00 | -0.01 | 0.03 | 0.03 | 0.04 |
| 1.00 | 0.67 | 1.00 | -0.01 | 0.03 | 0.03 | 0.04 |
| 1.00 | 1.00 | 1.00 | 0.00 | 0.04 | 0.04 | 0.04 |
Here \(\rho\) is the unobserved correlation between \(Y(1)\) and \(Y(0)\); and \(\Delta\) is the final term in the sample variance equation that we cannot estimate.
The conservative variance comes from the fact that you do not know the covariance between \(Y(1)\) and \(Y(0)\).
Example:
sharp_var <- function(yt, yc, N=length(c(yt,yc)), upper=TRUE){
m <- length(yt)
n <- m + length(yc)
V <- function(x,N) (N-1)/(N*(length(x)-1)) * sum((x - mean(x))^2)
yt <- sort(yt)
if(upper) {yc <- sort(yc)
} else {
yc <- sort(yc,decreasing=TRUE)}
p_i <- unique(sort(c(seq(0,n-m,1)/(n-m),seq(0,m,1)/m)))-
.Machine$double.eps^.5
p_i[1] <- .Machine$double.eps^.5
yti <- yt[ceiling(p_i*m)]
yci <- yc[ceiling(p_i*(n-m))]
p_i_minus <- c(NA,p_i[1: (length(p_i)-1)])
((N-m)/m * V(yt,N) + (N-(n-m))/(n-m)*V(yc,N) +
2*sum(((p_i-p_i_minus)*yti*yci)[2:length(p_i)]) - 2*mean(yt)*mean(yc))/(N-1)}n <- 1000000
Y <- c(rep(0,n/2), 1000*rnorm(n/2))
X <- c(rep(0,n/2), rep(1, n/2))
lm_robust(Y~X) |> tidy() |> kable(digits = 2)| term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 0.00 | 0.00 | 0.63 | 0.53 | 0.00 | 0.00 | 999998 | Y |
| X | 1.21 | 1.41 | 0.86 | 0.39 | -1.56 | 3.98 | 999998 | Y |
Error: object 'ols' not found
c(sharp_var(Y[X==1], Y[X==0], upper = FALSE),
sharp_var(Y[X==1], Y[X==0], upper = TRUE)) |>
round(2)[1] 1 1
The sharp bounds are \([1,1]\) but the conservative estimate is \(\sqrt{2}\).
However you can do hypothesis testing even without an estimate of the standard error.
Up next
A procedure for using the randomization distribution to calculate \(p\) values
Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.
| Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
| Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
ri estimate of \(p\):# data
set.seed(1)
df <- fabricate(N = 1000, Z = rep(c(0,1), N/2), Y= .1*Z + rnorm(N))
# test stat
test.stat <- function(df) with(df, mean(Y[Z==1])- mean(Y[Z==0]))
# test stat distribution
ts <- replicate(4000, df |> mutate(Z = sample(Z)) |> test.stat())
# test
mean(ts >= test.stat(df)) # One sided p value[1] 0.025
The \(p\) value is the mass to the right of the vertical
ri2You can do the same using Alex Coppock’s ri2 package
ri2| term | estimate | upper_p_value |
|---|---|---|
| Z | 0.1321367 | 0.02225 |
You’ll notice slightly different answer. This is because although the procedure is “exact” it is subject to simulation error.
Observed
|
Under null that effect is 0 |
Under null that effect is 2 |
|||
|---|---|---|---|---|---|
| Y(0) | Y(1) | Y(0) | Y(1) | Y(0) | Y(1) |
| 1 | NA | 1 | 1 | 1 | 3 |
| 2 | NA | 2 | 2 | 2 | 4 |
| NA | 4 | 4 | 4 | 2 | 4 |
| NA | 3 | 3 | 3 | 1 | 3 |
ri and CIsIt is possible to use this procedure to generate confidence intervals with a natural interpretation.
ri and CIs in practiceri and CIs in practiceWarning: calculating confidence intervals this way can be computationally intensive
ri with DeclareDesignDeclareDesign can do randomization inference nativelyri with DeclareDesign (advanced)Here we get minimal detectable effects by using a design that has two stage simulations so we can estimate the sampling distribution of summaries of the sampling distribution generated from reassignments.
test_stat <- function(data)
with(data, data.frame(estimate = mean(Y[Z==1]) - mean(Y[Z==0])))
b <- 0
design <-
declare_model(N = 100, Z = complete_ra(N), Y = b*Z + rnorm(N)) +
declare_estimator(handler = label_estimator(test_stat), label = "actual")+
declare_measurement(Z = sample(Z)) + # this is the permutation step
declare_estimator(handler = label_estimator(test_stat), label = "null")ri with DeclareDesign (advanced)Simulations data frame from two step simulation. Note computational intensity as number of runs is the product of the sims vector. I speed things up by using a simple estimation function and also using parallelization.
ri with DeclareDesign (advanced)A snap shot of the simulations dataframe: we have multiple step 3 draws for each design and step 1 draw.
ri with DeclareDesign (advanced)Power for each value of b.
If you want to figure out more precisely what b gives 80% or 90% power you can narrow down the b range.
ri interactionsLets now imagine a world with two treatments and we are interested in using ri for assessing the interaction. (Code from Coppock, ri2)
ri interactionsThe approach is to declare a null model that is nested by the full model. Then \(F\) test statistic from the model comparisons is taken as the test statistic and distribution of this is built up under re-randomizations.
ri interactions with DeclareDesignLet’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects
Note: this is quite a sharp hypothesis
df <- fabricate(N = 1000, Z1 = rep(c(0,1), N/2), Z2 = sample(Z1), Y = Z1 + Z2 - .15*Z1*Z2 + rnorm(N))
my_estimate <- (lm(Y ~ Z1*Z2, data = df) |> coef())[4]
null_model <- function(df) {
M0 <- lm(Y ~ Z1 + Z2, data = df)
d1 <- coef(M0)[2]
d2 <- coef(M0)[3]
df |> mutate(
Y_Z1_0_Z2_0 = Y - Z1*d1 - Z2*d2,
Y_Z1_1_Z2_0 = Y + (1-Z1)*d1 - Z2*d2,
Y_Z1_0_Z2_1 = Y - Z1*d1 + (1-Z2)*d2,
Y_Z1_1_Z2_1 = Y + (1-Z1)*d1 + (1-Z2)*d2)
}ri interactions with DeclareDesignLet’s imagine a true model with interactions. We take an estimate. We then ask how likely that estimate is from a null model with constant effects
Imputed potential outcomes look like this:
| ID | Z1 | Z2 | Y | Y_Z1_0_Z2_0 | Y_Z1_1_Z2_0 | Y_Z1_0_Z2_1 | Y_Z1_1_Z2_1 |
|---|---|---|---|---|---|---|---|
| 0001 | 0 | 0 | -0.18 | -0.18 | 0.76 | 0.68 | 1.61 |
| 0002 | 1 | 0 | 0.20 | -0.73 | 0.20 | 0.12 | 1.06 |
| 0003 | 0 | 0 | 2.56 | 2.56 | 3.50 | 3.42 | 4.36 |
| 0004 | 1 | 0 | -0.27 | -1.21 | -0.27 | -0.35 | 0.59 |
| 0005 | 0 | 1 | -2.13 | -2.99 | -2.05 | -2.13 | -1.19 |
| 0006 | 1 | 1 | 3.52 | 1.72 | 2.66 | 2.58 | 3.52 |
ri interactions with DeclareDesign| Design | Estimator | Outcome | Term | N Sims | One Sided Pos | One Sided Neg | Two Sided |
|---|---|---|---|---|---|---|---|
| design | estimator | Y | Z1:Z2 | 1000 | 0.95 | 0.05 | 0.10 |
| (0.01) | (0.01) | (0.01) |
ri in practiceri Applicationsri Applicationsri Applications| Capacity | T1 | T2 | T3 | ||
|---|---|---|---|---|---|
| Session | Thursday | 40 | 10 | 30 | 0 |
| Friday | 40 | 10 | 0 | 30 | |
| Saturday | 10 | 10 | 0 | 0 |
Optimal assignment to treatment given constraints due to facilities
| Subject type | N | Available |
|---|---|---|
| A | 3 | Thurs, Fri |
| B | 30 | Thurs, Sat |
| C | 30 | Fri, Sat |
Constraints due to subjects
ri ApplicationsIf you think hard about assignment you might come up with an allocation like this.
|
Allocations
|
|||||
|---|---|---|---|---|---|
| Subject type | N | Available | Thurs | Fri | Sat |
| A | 30 | Thurs, Fri | 15 | 15 | NA |
| B | 30 | Thurs, Sat | 25 | NA | 5 |
| C | 30 | Fri, Sat | NA | 25 | 5 |
Assignment of people to days
ri ApplicationsThat allocation balances as much as possible. Given the allocation you might randomly assign individuals to different days as well as randomly assigning them to treatments within days. If you then figure out assignment propensities, this is what you would get:
Assignment Probabilities
|
|||||
|---|---|---|---|---|---|
| Subject type | N | Available | T1 | T2 | T3 |
| A | 30 | Thurs, Fri | 0.250 | 0.375 | 0.375 |
| B | 30 | Thurs, Sat | 0.375 | 0.625 | 0.000 |
| C | 30 | Fri, Sat | 0.375 | NA | 0.625 |
ri ApplicationsEven under the assumption that the day of measurement does not matter, these assignment probabilities have big implications for analysis.
Assignment Probabilities
|
|||||
|---|---|---|---|---|---|
| Subject type | N | Available | T1 | T2 | T3 |
| A | 30 | Thurs, Fri | 0.250 | 0.375 | 0.375 |
| B | 30 | Thurs, Sat | 0.375 | 0.625 | 0.000 |
| C | 30 | Fri, Sat | 0.375 | NA | 0.625 |
Only the type \(A\) subjects could have received any of the three treatments.
There are no two treatments for which it is possible to compare outcomes for subpopulations \(B\) and \(C\)
A comparison of \(T1\) versus \(T2\) can only be made for population \(A \cup B\)
However subpopulation \(A\) is assigned to \(A\) (versus \(B\)) with probability 4/5; while population \(B\) is assigned with probability 3/8
ri ApplicationsImplications for design: need to uncluster treatment delivery
Implications for analysis: need to take account of propensities
Idea: Wacky assignments happen but if you know the propensities you can do the analysis.
ri Applications: Indirect assignmentsA particularly interesting application is where a random assignment combines with existing features to determine an assignment to an “indirect” treatment.
Report the analysis that is implied by the design.
| T2 | |||||
|---|---|---|---|---|---|
| N | Y | All | Diff | ||
| T1 | N | \(\overline{y}_{00}\) | \(\overline{y}_{01}\) | \(\overline{y}_{0x}\) | \(d_2|T1=0\) |
| (sd) | (sd) | (sd) | (sd) | ||
| Y | \(\overline{y}_{10}\) | \(\overline{y}_{10}\) | \(\overline{y}_{1x}\) | \(d_2|T1=1\) | |
| (sd) | (sd) | (sd) | (sd) | ||
| All | \(\overline{y}_{x0}\) | \(\overline{y}_{x1}\) | \(y\) | \(d_2\) | |
| (sd) | (sd) | (sd) | (sd) | ||
| Diff | \(d_1|T2=0\) | \(d_1|T2=1\) | \(d_1\) | \(d_1d_2\) | |
| (sd) | (sd) | (sd) | (sd) |
This is instantly recognizable from the design and returns all the benefits of the factorial design including all main effects, conditional causal effects, interactions and summary outcomes. It is much clearer and more informative than a regression table.
Focus on randomization schemes
DeclareDesignExperiments are investigations in which an intervention, in all its essential elements, is under the control of the investigator. (Cox & Reid)
Two major types of control:
In general you might want to set things up so that your randomization is replicable. You can do this by setting a seed:
and again:
Even better is to set it up so that it can reproduce lots of possible draws so that you can check the propensities for each unit.
[1] 0.519 0.496 0.510 0.491 0.524 0.514 0.535 0.497 0.470 0.506
Here the \(P\) matrix gives 1000 possible ways of allocating 5 of 10 units to treatment. We can then confirm that the average propensity is 0.5.
A survey dictionary with results from a complex randomization presented in a simple way for enumerators
People often wonder: did randomization work? Common practice is to implement a set of \(t\)-tests to see if there is balance. This makes no sense.
If you doubt whether it was implemented properly do an \(F\) test. If you worry about variance, specify controls in advance as a function of relation with outcomes (more on this later). If you worry about conditional bias then look at substantive differences between groups, not \(t\)–tests
If you want realizations to have particular properties: build it into the scheme in advance.
Note: clusters are part of your design, not part of the world.
Often used if intervention has to function at the cluster level or if outcome defined at the cluster level.
Disadvantage: loss of statistical power
However: perfectly possible to assign some treatments at cluster level and then other treatments at the individual level
Principle: (unless you are worried about spillovers) generally make clusters as small as possible
Principle: Surprisingly, variability in cluster size makes analysis harder. Try to control assignment so that cluster sizes are similar in treatment and in control.
Be clear about whether you believe effects are operating at the cluster level or at the individual level. This matters for power calculations.
Be clear about whether spillover effects operate only within clusters or also across them. If within only you might be able to interpret treatment as the effect of being in a treated cluster…
Surprisingly, if clusters are of different sizes the difference in means estimator is not unbiased, even if all units are assigned to treatment with the same probability.
Here’s the intuition.Say there are two clusters each with homogeneous treatment effects:
| Cluster | Size | Y0 | Y1 |
|---|---|---|---|
| 1 | 1000000 | 0 | 1 |
| 2 | 1 | 0 | 0 |
Then: What is the true average treatment effect? What do you expect to estimate from cluster random assignment?
The solution is to block by cluster size. For more see: http://gking.harvard.edu/files/cluster.pdf
There are more or less efficient ways to randomize.
Consider a case with four units and two strata. There are 6 possible assignments of 2 units to treatment:
| ID | X | Y(0) | Y(1) | R1 | R2 | R3 | R4 | R5 | R6 |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
| 3 | 2 | 1 | 2 | 0 | 1 | 0 | 1 | 0 | 1 |
| 4 | 2 | 1 | 2 | 0 | 0 | 1 | 0 | 1 | 1 |
| – | – | – | – | – | – | – | – | – | – |
| \(\widehat{\tau}\): | 0 | 1 | 1 | 1 | 1 | 2 |
Even with a constant treatment effect and everything uniform within blocks, there is variance in the estimation of \(\widehat{\tau}\). This can be eliminated by excluding R1 and R6.
Simple blocking in R (5 pairs):
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 1 |
| 0 | 0 | 1 | 0 | 0 |
DeclareDesign can help with this.Load up:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(50\%\) | \(0\%\) |
| T1 = 1 | \(50\%\) | \(0\%\) |
Spread out:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(25\%\) | \(25\%\) |
| T1 = 1 | \(25\%\) | \(25\%\) |
Three arm it?:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(33.3\%\) | \(33.3\%\) |
| T1 = 1 | \(33.3\%\) | \(0\%\) |
Bunch it?:
| \(T2=0\) | \(T2=1\) | |
|---|---|---|
| T1 = 0 | \(40\%\) | \(20\%\) |
| T1 = 1 | \(20\%\) | \(20\%\) |
This speaks to “spreading out.” Note: the “bunching” example may not pay off and has undesireable feature of introducing a correlation between treatment assignments.
Two ways to do favtial assignments in DeclareDesign:
In practice if you have a lot of treatments it can be hard to do full factorial designs – there may be too many combinations.
In such cases people use fractional factorial designs, like the one below (5 treatments but only 8 units!)
| Variation | T1 | T2 | T3 | T4 | T5 |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 1 |
| 2 | 0 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| 5 | 1 | 0 | 0 | 1 | 0 |
| 6 | 1 | 0 | 1 | 0 | 1 |
| 7 | 1 | 1 | 0 | 0 | 0 |
| 8 | 1 | 1 | 1 | 1 | 1 |
Then randomly assign units to rows. Note columns might also be blocking covariates.
In R, look at library(survey)
| Unit | T1 | T2 | T3 | T4 | T5 |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 1 |
| 2 | 0 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 |
| 4 | 0 | 1 | 1 | 1 | 0 |
| 5 | 1 | 0 | 0 | 1 | 0 |
| 6 | 1 | 0 | 1 | 0 | 1 |
| 7 | 1 | 1 | 0 | 0 | 0 |
| 8 | 1 | 1 | 1 | 1 | 1 |
library(survey)Muralidharan, Romero, and Wüthrich (2023) write:
Factorial designs are widely used to study multiple treatments in one experiment. While t-tests using a fully-saturated “long” model provide valid inferences, “short” model t-tests (that ignore interactions) yield higher power if interactions are zero, but incorrect inferences otherwise. Of 27 factorial experiments published in top-5 journals (2007–2017), 19 use the short model. After including interactions, over half of their results lose significance. […]
Anything to be done on randomization to address external validity concerns?
DeclareDesignA design with hierarchical data and different assignment schemes.
design <-
declare_model(
school = add_level(N = 16,
u_school = rnorm(N, mean = 0)),
classroom = add_level(N = 4,
u_classroom = rnorm(N, mean = 0)),
student = add_level(N = 20,
u_student = rnorm(N, mean = 0))
) +
declare_model(
potential_outcomes(Y ~ .1*Z + u_classroom + u_student + u_school)
) +
declare_assignment(Z = simple_ra(N)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z, .method = difference_in_means) Here are the first couple of rows and columns of the resulting data frame.
| school | u_school | classroom | u_classroom | student | u_student | Y_Z_0 | Y_Z_1 | Z | Y |
|---|---|---|---|---|---|---|---|---|---|
| 01 | 1.35 | 01 | 1.26 | 0001 | -1.28 | 1.33 | 1.43 | 0 | 1.33 |
| 01 | 1.35 | 01 | 1.26 | 0002 | 0.79 | 3.40 | 3.50 | 1 | 3.50 |
| 01 | 1.35 | 01 | 1.26 | 0003 | -0.12 | 2.49 | 2.59 | 0 | 2.49 |
| 01 | 1.35 | 01 | 1.26 | 0004 | -0.65 | 1.96 | 2.06 | 1 | 2.06 |
| 01 | 1.35 | 01 | 1.26 | 0005 | 0.36 | 2.97 | 3.07 | 1 | 3.07 |
| 01 | 1.35 | 01 | 1.26 | 0006 | -0.96 | 1.65 | 1.75 | 0 | 1.65 |
Here is the distribution between treatment and control:
We can draw a new set of data and look at the number of subjects in the treatment and control groups.
But what if all students in a given class have to be assigned the same treatment?
assignment_clustered <-
declare_assignment(Z = cluster_ra(clusters = classroom))
estimator_clustered <-
declare_estimator(Y ~ Z, clusters = classroom,
.method = difference_in_means)
design_clustered <-
design |>
replace_step("assignment", assignment_clustered) |>
replace_step("estimator", estimator_clustered)assignment_clustered_blocked <-
declare_assignment(Z = block_and_cluster_ra(blocks = school,
clusters = classroom))
estimator_clustered_blocked <-
declare_estimator(Y ~ Z, blocks = school, clusters = classroom,
.method = difference_in_means)
design_clustered_blocked <-
design |>
replace_step("assignment", assignment_clustered_blocked) |>
replace_step("estimator", estimator_clustered_blocked)| Design | Power | Coverage |
|---|---|---|
| simple | 0.16 | 0.95 |
| (0.01) | (0.01) | |
| complete | 0.20 | 0.96 |
| (0.01) | (0.01) | |
| blocked | 0.42 | 0.95 |
| (0.01) | (0.01) | |
| clustered | 0.06 | 0.96 |
| (0.01) | (0.01) | |
| clustered_blocked | 0.08 | 0.96 |
| (0.01) | (0.01) |
In many designs you seek to assign an integer number of subjects to treatment from some set.
Sometimes however your assignment targets are not integers.
Example:
Two strategies:
Can also be used to set targets
# remotes::install_github("macartan/probra")
library(probra)
set.seed(1)
fabricate(N = 4, size = c(47, 53, 87, 25), n_treated = prob_ra(.5*size)) %>%
janitor::adorn_totals("row") |>
kable(caption = "Setting targets to get 50% targets with minimal variance")| ID | size | n_treated |
|---|---|---|
| 1 | 47 | 23 |
| 2 | 53 | 27 |
| 3 | 87 | 43 |
| 4 | 25 | 13 |
| Total | 212 | 106 |
Can also be used to set for complete assignment with heterogeneous propensities
[1] 0.5
Indirect control
Indirect assignments are generally generated by applying a direct assignment and then figuring our an implied indirect assignment
Looks better: but there are trade offs between the direct and indirect distributions
Figuring out the optimal procedure requires full diagnosis
A focus on power
In the classical approach to testing a hypothesis we ask:
How likely are we to see data like this if indeed the hypothesis is true?
How unlikely is “not very likely”?
When we test a hypothesis we decide first on what sort of evidence we need to see in order to decide that the hypothesis is not reliable.
Othello has a hypothesis that Desdemona is innocent.
Iago confronts him with evidence:
Note that Othello is focused on the probability of the events if she were innocent but not the probability of the events if Iago were trying to trick him.
He is not assessing his belief in whether she is faithful, but rather how likely the data would be if she were faithful.
So:
Illustrating \(p\) values via “randomization inference”
Say you randomized assignment to treatment and your data looked like this.
| Unit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Treatment | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Health score | 4 | 2 | 3 | 1 | 2 | 3 | 4 | 8 | 7 | 6 |
Then:
Power is just the probability of getting a significant result rejecting a hypothesis.
Simple enough but it presupposes:
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
I want to test the hypothesis that a six never comes up on this dice.
Here’s my test:
What is the power of this test?
Power sometimes seems more complicated because hypothesis rejection involves a calculated probability and so you need the probability of a probability.
I want to test the hypothesis that this dice is fair.
Here’s my test:
Now:
For this we need to figure a rule for rejection. This is based on identifying events that should be unlikely under the hypothesis.
Here is how many 6’s I would expect if the dice is fair:
I can figure out from this that 143 or fewer is really very few and 190 or more is really very many:
Now we need to stipulate some belief about how the world really works—this is not the null hypothesis that we plan to reject, but something that we actually take to be true.
For instance: we think that in fact sixes appear 20% of the time.
Now what’s the probability of seeing at least 190 sixes?
So given I think 6s appear 20% of the time, I think it likely I’ll see at least 190 sixes and reject the hypothesis of a fair dice.
Simplest intuition on power:
What is the probability of getting a significant estimate given the sampling distribution is centered on \(b\) and the standard error is 1?
Add these together: probability of getting an estimate above 1.96 or below -1.96.
This is essentially what is done by pwrss::power.z.test – and it produces nice graphs!
See:
Substantively: if in expectation an estimate will be just significant, then your power is 50%
power <- function(b, alpha = 0.05, critical = qnorm(1-alpha/2))
1 - pnorm(critical - b) + pnorm(-critical - b)
power(1.96)[1] 0.5000586
Intuition:
Of course the standard error will depend on the number of units and the variance of outcomes in treatment and control.
Say \(N\) subject are divided into two groups and potential outcomes have standard deviation \(\sigma\) in treatment and control. Then the conservative variance of the treatment effect is (approx / conservatively):
\[Var(\tau)=\frac{\sigma^2}{N/2} + \frac{\sigma^2}{N/2} = 4\frac{\sigma^2}{N}\]
and so the (conservative / approx) standard error is:
\[\sigma_\tau=\frac{2\sigma}{\sqrt{N}}\]
Note here we seem to be using the actual standard error but of course the tests we actually run will use an estimate of the standard error…
This can be done e.g. with pwrss like this:
pwrss::pwrss.t.2means(mu1 = .2, mu2 = .1, sd1 = 1, sd2 = 1,
n2 = 50, alpha = 0.05,
alternative = "not equal")+--------------------------------------------------+
| POWER CALCULATION |
+--------------------------------------------------+
Welch's T-Test (Independent Samples)
---------------------------------------------------
Hypotheses
---------------------------------------------------
H0 (Null Claim) : d - null.d = 0
H1 (Alt. Claim) : d - null.d != 0
---------------------------------------------------
Results
---------------------------------------------------
Sample Size = 50 and 50
Type 1 Error (alpha) = 0.050
Type 2 Error (beta) = 0.921
Statistical Power = 0.079 <<
[1] 0.7010827
Mostly involve figuring out the standard error.
Consider a cluster randomized trial, with each unit having a cluster level shock \(\epsilon_k\) and an individual shock \(\nu_i\). Say these have variances \(\sigma^2_k\), \(\sigma^2_i\).
The standard error is:
\[\sqrt{\frac{4\sigma^2_k}{K} + \frac{4\sigma^2_i}{nK}}\]
Define \(\rho = \frac{\sigma^2_k}{\sigma^2_k + \sigma^2_i}\)
\[\sqrt{\rho \frac{4\sigma^2}{K} + (1- \rho)\frac{4\sigma^2}{nK}}\]
\[\sqrt{((n - 1)\rho + 1)\frac{4\sigma^2}{nK}}\]
where
Plug in these standard errors and proceed as before
Is arbitrarily flexible
| sim_ID | estimate | p.value |
|---|---|---|
| 1 | 0.24 | 0.22 |
| 2 | 0.80 | 0.00 |
| 3 | 0.25 | 0.16 |
| 4 | 0.27 | 0.18 |
| 5 | 0.87 | 0.00 |
| 6 | 0.57 | 0.00 |
Obviously related to the estimates you might get
A valid \(p\)-value satisfies \(\Pr(p≤x)≤x\) for every \(x \in[0,1]\) (under the null)
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.50 | -0.00 | 0.20 | 0.20 | 0.69 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
| b | Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.00 | 0.20 | 0.20 | 0.05 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
| 0.25 | 0.25 | 0.00 | 0.20 | 0.20 | 0.24 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
| 0.5 | 0.50 | -0.00 | 0.20 | 0.20 | 0.69 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
| 1 | 1.00 | -0.00 | 0.20 | 0.20 | 1.00 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) |
coming up:
We often focus on sample sizes
But
Power also depends on
Say we have access to a “pre” measure of outcome Y_now; call it Y_base. Y_base is informative about potential outcomes. We are considering using Y_now - Y_base as the outcome instead of Y_now.
N <- 100
rho <- .5
design <-
declare_model(N,
Y_base = rnorm(N),
Y_Z_0 = 1 + correlate(rnorm, given = Y_base, rho = rho),
Y_Z_1 = correlate(rnorm, given = Y_base, rho = rho),
Z = complete_ra(N),
Y_now = Z*Y_Z_1 + (1-Z)*Y_Z_0,
Y_change = Y_now - Y_base) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y_now ~ Z, label = "level") +
declare_estimator(Y_change ~ Z, label = "change")+
declare_estimator(Y_now ~ Z + Y_base, label = "RHS")Punchline:
You can see from the null design that power is great but bias is terrible and coverage is way off.
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 1.59 | 1.59 | 0.12 | 1.60 | 1.00 | 0.00 |
| (0.01) | (0.01) | (0.00) | (0.01) | (0.00) | (0.00) |
Power without unbiasedness corrupts, absolutely
another_bad_design <-
declare_model(
N = 100,
female = rep(0:1, N/2),
U = rnorm(N),
potential_outcomes(Y ~ female * Z + U)) +
declare_assignment(
Z = block_ra(blocks = female, block_prob = c(.1, .5)),
Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z + female, inquiry = "ate",
.method = lm_robust)You can see from the null design that power is great but bias is terrible and coverage is way off.
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.76 | 0.26 | 0.24 | 0.35 | 0.84 | 0.85 |
| (0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.02) |
clustered_design <-
declare_model(
cluster = add_level(N = 10, cluster_shock = rnorm(N)),
individual = add_level(
N = 100,
Y_Z_0 = rnorm(N) + cluster_shock,
Y_Z_1 = rnorm(N) + cluster_shock)) +
declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(Z = cluster_ra(clusters = cluster)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(Y ~ Z, inquiry = "ATE")| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| -0.00 | -0.00 | 0.64 | 0.64 | 0.79 | 0.20 |
| (0.01) | (0.01) | (0.01) | (0.01) | (0.01) | (0.01) |
What alerts you to a problem?
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.00 | -0.00 | 0.66 | 0.65 | 0.06 | 0.94 |
| (0.02) | (0.02) | (0.01) | (0.01) | (0.01) | (0.01) |
design_uncertain <-
declare_model(N = 1000, b = 1+rnorm(1), Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = mean(b)) +
declare_estimator(Y ~ factor(Z), term = TRUE)
draw_estimands(design_uncertain) inquiry estimand
1 ate -0.3967765
inquiry estimand
1 ate 0.7887188
Say I run two tests and want to correct for multiple comparisons.
Two approaches. First, by hand:
b = .2
design_mc <-
declare_model(N = 1000, Y_Z_1 = rnorm(N), Y_Z_2 = rnorm(N) + b, Y_Z_3 = rnorm(N) + b) +
declare_assignment(Z = complete_ra(N = N, num_arms = 3, conditions = 1:3)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_inquiry(ate = b) +
declare_estimator(Y ~ factor(Z), term = TRUE)design_mc |>
simulate_designs(sims = 1000) |>
filter(term != "(Intercept)") |>
group_by(sim_ID) |>
mutate(p_bonferroni = p.adjust(p = p.value, method = "bonferroni"),
p_holm = p.adjust(p = p.value, method = "holm"),
p_fdr = p.adjust(p = p.value, method = "fdr")) |>
ungroup() |>
summarize(
"Power using naive p-values" = mean(p.value <= 0.05),
"Power using Bonferroni correction" = mean(p_bonferroni <= 0.05),
"Power using Holm correction" = mean(p_holm <= 0.05),
"Power using FDR correction" = mean(p_fdr <= 0.05)
) | Power using naive p-values | Power using Bonferroni correction | Power using Holm correction | Power using FDR correction |
|---|---|---|---|
| 0.7374 | 0.6318 | 0.6886 | 0.7032 |
The alternative approach (generally better!) is to design with a custom estimator that includes your corrections.
my_estimator <- function(data)
lm_robust(Y ~ factor(Z), data = data) |>
tidy() |>
filter(term != "(Intercept)") |>
mutate(p.naive = p.value,
p.value = p.adjust(p = p.naive, method = "bonferroni"))
design_mc_2 <- design_mc |>
replace_step(5, declare_estimator(handler = label_estimator(my_estimator)))
run_design(design_mc_2) |>
select(term, estimate, p.value, p.naive) |> kable()| term | estimate | p.value | p.naive |
|---|---|---|---|
| factor(Z)2 | 0.1182516 | 0.2502156 | 0.1251078 |
| factor(Z)3 | 0.1057031 | 0.3337476 | 0.1668738 |
Lets try same thing for a null model (using redesign(design_mc_2, b = 0))
…and power:
| Mean Estimate | Bias | SD Estimate | RMSE | Power | Coverage |
|---|---|---|---|---|---|
| 0.00 | 0.00 | 0.08 | 0.08 | 0.02 | 0.95 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
| -0.00 | -0.00 | 0.08 | 0.08 | 0.02 | 0.96 |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.01) |
bothered?
Sometimes you give a medicine but only a nonrandom sample of people actually try to use it. Can you still estimate the medicine’s effect?
| X=0 | X=1 | |
|---|---|---|
| Z=0 | \(\overline{y}_{00}\) (\(n_{00}\)) | \(\overline{y}_{01}\) (\(n_{01}\)) |
| Z=1 | \(\overline{y}_{10}\) (\(n_{10}\)) | \(\overline{y}_{11}\) (\(n_{11}\)) |
Say that people are one of 3 types:
Sometimes you give a medicine but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?
| X=0 | X=1 | |
|---|---|---|
| Z=0 | \(\overline{y}_{00}\) (\(n_{00}\)) | \(\overline{y}_{01}\) (\(n_{01}\)) |
| Z=1 | \(\overline{y}_{10}\) (\(n_{10}\)) | \(\overline{y}_{11}\) (\(n_{11}\)) |
We can figure something about types:
| \(X=0\) | \(X=1\) | |
|---|---|---|
| \(Z=0\) | \(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}^0_{c}+\frac{\frac{1}{2}n_n}{\frac{1}{2}n_c + \frac{1}{2}n_n} \overline{y}_{n}\) | \(\overline{y}_{a}\) |
| \(Z=1\) | \(\overline{y}_{n}\) | \(\frac{\frac{1}{2}n_c}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}^1_{c}+\frac{\frac{1}{2}n_a}{\frac{1}{2}n_c + \frac{1}{2}n_a} \overline{y}_{a}\) |
You give a medicine to 50% but only a non random sample of people actually try to use it. Can you still estimate the medicine’s effect?
| \(X=0\) | \(X=1\) | |
|---|---|---|
| \(Z=0\) | \(\frac{n_c}{n_c + n_n} \overline{y}^0_{c}+\frac{n_n}{n_c + n_n} \overline{y}_n\) | \(\overline{y}_{a}\) |
| (n) | (\(\frac{1}{2}(n_c + n_n)\)) | (\(\frac{1}{2}n_a\)) |
| \(Z=1\) | \(\overline{y}_{n}\) | \(\frac{n_c}{n_c + n_a} \overline{y}^1_{c}+\frac{n_a}{n_c + n_a} \overline{y}_{a}\) |
| (n) | (\(\frac{1}{2}n_n\)) | (\(\frac{1}{2}(n_a+n_c)\)) |
Key insight: the contributions of the \(a\)s and \(n\)s are the same in the \(Z=0\) and \(Z=1\) groups so if you difference you are left with the changes in the contributions of the \(c\)s.
Average in \(Z=0\) group: \(\frac{{n_c} \overline{y}^0_{c}+ \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a\right)}{n_a+n_c+n_n}\)
Average in \(Z=1\) group: \(\frac{{n_c} \overline{y}^1_{c} + \left(n_{n}\overline{y}_{n} +{n_a} \overline{y}_a \right)}{n_a+n_c+n_n}\)
So, the difference is the ITT: \(({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\)
Last step:
\[ITT = ({\overline{y}^1_c-\overline{y}^0_c})\frac{n_c}{n}\]
\[\leftrightarrow\]
\[LATE = \frac{ITT}{\frac{n_c}{n}}= \frac{\text{Intent to treat effect}}{\text{First stage effect}}\]
With and without an imposition of monotonicity
data("lipids_data")
models <-
list(unrestricted = make_model("Z -> X -> Y; X <-> Y"),
restricted = make_model("Z -> X -> Y; X <-> Y") |>
set_restrictions("X[Z=1] < X[Z=0]")) |>
lapply(update_model, data = lipids_data, refresh = 0)
models |>
query_model(query = list(CATE = "Y[X=1] - Y[X=0]",
Nonmonotonic = "X[Z=1] < X[Z=0]"),
given = list("X[Z=1] > X[Z=0]", TRUE),
using = "posteriors") With and without an imposition of monotonicity:
| model | query | mean | sd |
|---|---|---|---|
| unrestricted | CATE | 0.70 | 0.05 |
| restricted | CATE | 0.71 | 0.05 |
| unrestricted | Nonmonotonic | 0.01 | 0.01 |
| restricted | Nonmonotonic | 0.00 | 0.00 |
In one case we assume monotonicity, in the other we update on it (easy in this case because of the empirically verifiable nature of one sided non compliance)
Multiple survey experimental designs have been generated to make it easier for subjects to answer sensitive questions
The key idea is to use inference rather than measurement.
Subjects are placed in different conditions and the conditions affect the answers that are given in such a way that you can infer some underlying quantity of interest
This is an obvious DAG but the main point is to be clear that the Value is the quantity of interest and the value is not affected by the treatment, Z.
The list experiment supposes that:
In other words: sensitivities notwithstanding, they are happy for the researcher to make correct inferences about them or their group
Respondents are given a short list and a long list.
The long list differs from the short list in having one extra item—the sensitive item
We ask how many items in each list does a respondent agree with:
How many of these do you agree with:
| Short list | Long list | “Effect” | |
|---|---|---|---|
| “2 + 2 = 4” | “2 + 2 = 4” | ||
| “2 * 3 = 6” | “2 * 3 = 6” | ||
| “3 + 6 = 8” | “Climate change is real” | ||
| “3 + 6 = 8” | |||
| Answer | Y(0) = 2 | Y(1) = 4 | Y(1) - Y(0) = 2 |
[Note: this is obviously not a good list. Why not?]
declaration_17.3 <-
declare_model(
N = 500,
control_count = rbinom(N, size = 3, prob = 0.5),
Y_star = rbinom(N, size = 1, prob = 0.3),
potential_outcomes(Y_list ~ Y_star * Z + control_count)
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
declare_estimator(Y_list ~ Z, .method = difference_in_means,
inquiry = "prevalence_rate")
diagnosands <- declare_diagnosands(
bias = mean(estimate - estimand),
mean_CI_width = mean(conf.high - conf.low)
)| Design | Inquiry | Bias | Mean CI Width |
|---|---|---|---|
| declaration_17.3 | prevalence_rate | 0.00 | 0.32 |
| (0.00) | (0.00) |
declaration_17.4 <-
declare_model(
N = N,
U = rnorm(N),
control_count = rbinom(N, size = 3, prob = 0.5),
Y_star = rbinom(N, size = 1, prob = 0.3),
W = case_when(Y_star == 0 ~ 0L,
Y_star == 1 ~ rbinom(N, size = 1, prob = proportion_hiding)),
potential_outcomes(Y_list ~ Y_star * Z + control_count)
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z),
Y_direct = Y_star - W) +
declare_estimator(Y_list ~ Z, inquiry = "prevalence_rate", label = "list") +
declare_estimator(Y_direct ~ 1, inquiry = "prevalence_rate", label = "direct")rho <- -.8
correlated_lists <-
declare_model(
N = 500,
U = rnorm(N),
control_1 = rbinom(N, size = 1, prob = 0.5),
control_2 = correlate(given = control_1, rho = rho, draw_binary, prob = 0.5),
control_count = control_1 + control_2,
Y_star = rbinom(N, size = 1, prob = 0.3),
potential_outcomes(Y_list ~ Y_star * Z + control_count)
) +
declare_inquiry(prevalence_rate = mean(Y_star)) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(Y_list = reveal_outcomes(Y_list ~ Z)) +
declare_estimator(Y_list ~ Z) This is typically used to estimate average levels
However you can use it in the obvious way to get average levels for groups: this is equivalent to calculating group level heterogeneous effects
Extending the idea you can even get individual level estimates: for instance you might use causal forests
You can also use this to estimate the effect of an experimental treatment on an item that’s measured using a list, without requiring individual level estimates:
\[Y_i = \beta_0 + \beta_1Z_i + \beta_2Long_i + \beta_3Z_iLong_i\]
Note that here we looked at “hiders” – people not answering the direct question truthfully
See Li (2019) on bounds when the “no liars” assumption is threatened — this is about whether people respond truthfully to the list experimental question
Which effects are identified by the random assignment of \(X\)?
An obvious approach is to first examine the (average) effect of X on M1 and then use another manipulation to examine the (average) effect of M1 on Y.
An obvious approach is to first examine the (average) effect of X on M1 and then use another manipulation to examine the (average) effect of M1 on Y.
Both instances of unobserved confounding between \(M\) and \(Y\):
Both instances of unobserved confounding between \(M\) and \(Y\):
Another somewhat obvious approach is to see how the effect of \(X\) on \(Y\) in a regression is reduced when you control for \(M\).
If the effect of \(X\) on \(Y\) passes through \(M\) then surely there should be no effect of \(X\) on \(Y\) after you control for \(M\).
This common strategy associated with Baron and Kenny (1986) is also not guaranteed to produce reliable results. See for instance Green, Ha, and Bullock (2010)
df <- fabricate(N = 1000,
U = rbinom(N, 1, .5), X = rbinom(N, 1, .5),
M = ifelse(U==1, X, 1-X), Y = ifelse(U==1, M, 1-M))
list(lm(Y ~ X, data = df),
lm(Y ~ X + M, data = df)) |> texreg::htmlreg() | Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | 0.00*** | 0.00*** |
| (0.00) | (0.00) | |
| X | 1.00*** | 1.00*** |
| (0.00) | (0.00) | |
| M | 0.00 | |
| (0.00) | ||
| R2 | 1.00 | 1.00 |
| Adj. R2 | 1.00 | 1.00 |
| Num. obs. | 1000 | 1000 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
The bad news is that although a single experiment might identify the total effect, it can not identify these elements of the direct effect.
So:
Check formal requirement for identification under single experiment design (“sequential ignorability”—that, conditional on actual treatment, it is as if the value of the mediation variable is randomly assigned relative to potential outcomes). But this is strong (and in fact unverifiable) and if it does not hold, bounds on effects always include zero (Imai et al)
Consider sensitivity analyses
You can use interactions with covariates if you are willing to make assumptions on no heterogeneity of direct treatment effects over covariates.
eg you think that money makes people get to work faster because they can buy better cars; you look at the marginal effect of more money on time to work for people with and without cars and find it higher for the latter.
This might imply mediation through transport but only if there is no direct effect heterogeneity (eg people with cars are less motivated by money).
Weaker assumptions justify parallel design
Takeaway: Understanding mechanisms is harder than you think. Figure out what assumptions fly.
CausalQueriesLets imagine that sequential ignorability does not hold. What are our posteriors on mediation quantities when in fact all effects are mediated, effects are strong, and we have lots of data?
CausalQueriesWe imagine a true model and consider estimands:
truth <- make_model("X -> M ->Y") |>
set_parameters(c(.5, .5, .1, 0, .8, .1, .1, 0, .8, .1))
queries <-
list(
indirect = "Y[X = 1, M = M[X=1]] - Y[X = 1, M = M[X=0]]",
direct = "Y[X = 1, M = M[X=0]] - Y[X = 0, M = M[X=0]]"
)
truth |> query_model(queries) |> kable()| label | query | given | using | case_level | mean | sd | cred.low | cred.high |
|---|---|---|---|---|---|---|---|---|
| indirect | Y[X = 1, M = M[X=1]] - Y[X = 1, M = M[X=0]] | - | parameters | FALSE | 0.64 | NA | 0.64 | 0.64 |
| direct | Y[X = 1, M = M[X=0]] - Y[X = 0, M = M[X=0]] | - | parameters | FALSE | 0.00 | NA | 0.00 | 0.00 |
CausalQueriesError in if (parent_nodes == "") {: argument is of length zero
Why such poor behavior? Why isn’t weight going onto indirect effects?
Turns out the data is consistent with direct effects only: specifically that whenever \(M\) is responsive to \(X\), \(Y\) is responsive to \(X\).
CausalQueriesError in if (parent_nodes == "") {: argument is of length zero
Spillovers can result in the estimation of weaker effects when effects are actually stronger.
The key problem is that \(Y(1)\) and \(Y(0)\) are not sufficient to describe potential outcomes
| Unit | Location | \(D_\emptyset\) | \(y(D_\emptyset)\) | \(D_1\) | \(y(D_1)\) | \(D_2\) | \(y(D_2)\) | \(D_3\) | \(y(D_3)\) | \(D_4\) | \(y(D_4)\) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 1 | 0 | 0 | 1 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| B | 2 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 3 | 0 | 0 |
| C | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 3 |
| D | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 3 |
Table: Potential outcomes for four units for different treatment profiles. \(D_i\) is an allocation and \(y_j(D_i)\) is the potential outcome for (row) unit \(j\) given (column) \(D_i\).
| 0 | 1 | 2 | 3 | 4 | |||||||
| Unit | Location | \(D_\emptyset\) | \(y(D_\emptyset)\) | \(D_1\) | \(y(D_1)\) | \(D_2\) | \(y(D_2)\) | \(D_3\) | \(y(D_3)\) | \(D_4\) | \(y(D_4)\) |
| A | 1 | 0 | 0 | 1 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
| B | 2 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 3 | 0 | 0 |
| C | 3 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 3 |
| D | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 3 |
| \(\bar{y}_\text{treated}\) | - | 3 | 3 | 3 | |||||||
| \(\bar{y}_\text{untreated}\) | 0 | 1 | 4/3 | 4/3 | |||||||
| \(\bar{y}_\text{neighbors}\) | - | 3 | 2 | 2 | |||||||
| \(\bar{y}_\text{pure control}\) | 0 | 0 | 0 | 0 | |||||||
| ATT-direct | - | 3 | 3 | 3 | |||||||
| ATT-indirect | - | 3 | 2 | 2 |
dgp <- function(i, Z, G) Z[i]/3 + sum(Z[G == G[i]])^2/5 + rnorm(1)
spillover_design <-
declare_model(G = add_level(N = 80),
j = add_level(N = 3, zeros = 0, ones = 1)) +
declare_inquiry(direct = mean(sapply(1:240, # just i treated v no one treated
function(i) { Z_i <- (1:240) == i
dgp(i, Z_i, G) - dgp(i, zeros, G)}))) +
declare_inquiry(indirect = mean(sapply(1:240,
function(i) { Z_i <- (1:240) == i # all but i treated v no one treated
dgp(i, ones - Z_i, G) - dgp(i, zeros, G)}))) +
declare_assignment(Z = complete_ra(N)) +
declare_measurement(
neighbors_treated = sapply(1:N, function(i) sum(Z[-i][G[-i] == G[i]])),
one_neighbor = as.numeric(neighbors_treated == 1),
two_neighbors = as.numeric(neighbors_treated == 2),
Y = sapply(1:N, function(i) dgp(i, Z, G))
) +
declare_estimator(Y ~ Z,
inquiry = "direct",
model = lm_robust,
label = "naive") +
declare_estimator(Y ~ Z * one_neighbor + Z * two_neighbors,
term = c("Z", "two_neighbors"),
inquiry = c("direct", "indirect"),
label = "saturated",
model = lm_robust)You can in principle:
But to estimate effects you still need some SUTVA like assumption.
In this example if one compared the outcome between treated units and all control units that are at least \(n\) positions away from a treated unit you will get the wrong answer unless \(n \geq 7\).