Updating on causal quantities
stan
stan
CausalQueries
Bayesian methods are just sets of procedures to figure out how to update beliefs in light of new information.
We begin with a prior belief about the probability that a hypothesis is true.
New data then allow us to form a posterior belief about the probability of the hypothesis.
Bayesian inference takes into account:
I draw a card from a deck and ask What are the chances it is a Jack of Spades?
Now I tell you that the card is indeed a spade. What would you guess?
What if I told you it was a heart?
What if I said it was a face card and a spade.
These answers are applications of Bayes’ rule.
In each case the answer is derived by assessing what is possible, given the new information, and then assessing how likely the outcome of interest among the states that are possible. In all the cases you calculate:
\[\text{Prob Jack of Spades | Info} = \frac{\text{Is Jack of Spades Consistent w/ Info?}}{\text{How many cards are consistent w/ Info?}} \]
You take a test to see whether you suffer from a disease that affects 1 in 100 people. The test is good in the following sense:
The test result says that you have the disease. What are the chances you have it?
It is not 99%. 99% is the probability of the result given the disease, but we want the probability of the disease given the result.
The right answer is 50%, which you can think of as the share of people that have the disease among all those that test positive. For example
e.g. if there were 10,000 people, then 100 would have the disease and 99 of these would test positive. But 9,900 would not have the disease and 99 of these would test positive. So the people with the disease that test positive are half of the total number testing positive.
What’s the probability of being a circle given you are black?
As an equation this might be written:
\[\text{Prob You have the Disease | Pos} = \frac{\text{How many have the disease and test pos?}}{\text{How many people test pos?}}\]
Consider last an old puzzle described in Gardner (1961).
To be explicit about the puzzle, we will assume that the information that one child is a boy is given as a truthful answer to the question “is at least one of the children a boy?”
Assuming also that there is a 50% probability that a given child is a boy.
As an equation:
\[\text{Prob both boys | Not both girls} = \frac{\text{Prob both boys}}{\text{Prob not both girls}} = \frac{\text{1 in 4}}{\text{3 in 4}}\]
Can anyone describe the Monty Hall puzzle?
Formally, all of these equations are applications of Bayes’ rule which is a simple and powerful formula for deriving updated beliefs from new data.
The formula is given as:
\[\Pr(H|\mathcal{D})=\frac{\Pr(\mathcal{D}|H)\Pr(H)}{\Pr(\mathcal{D})}\\ =\frac{\Pr(\mathcal{D}|H)\Pr(H)}{\sum_{H'}\Pr(\mathcal{D}|H')\Pr(H'))}\]
Formally, all of these equations are applications of Bayes’ rule which is a simple and powerful formula for deriving updated beliefs from new data.
For continuous distributions and parameter vector \(\theta\):
\[p(\theta|\mathcal{D})=\frac{p(\mathcal{D}|\theta)p(\theta)}{\int_{\theta'}p(\mathcal{D|\theta'})p(\theta')d\theta}\]
Consider the share of people in a population that voted. This is a quantity between 0 and 1.
Here the parameter of interest is a share. The Beta and Dirichlet distributions are particularly useful for representing beliefs on shares.
An attractive feature is that if one has a prior Beta(\(\alpha\), \(\beta\)) over the probability of some event, and then one observes a positive case, the Bayesian posterior distribution is also a Beta with with parameters \(\alpha+1, \beta\). Thus if people start with uniform priors and build up knowledge on seeing outcomes, their posterior beliefs should be Beta.
Here is a set of such distributions.
The Dirichlet distributions are generalizations of the Beta to the situation in which there are beliefs not just over a proportion, or a probability, but over collections of probabilities.
If four outcomes are possible and each is likely to occur with probability \(p_k\), \(k=1,2,3,4\) then beliefs are distributions over a three dimensional unit simplex.
The distribution has as many parameters as there are outcomes and these are traditionally recorded in a vector, \(\alpha\).
As with the Beta distribution, an uninformative prior (Jeffrey’s prior) has \(\alpha\) parameters of \((.5,.5,.5, \dots)\) and a uniform (“flat”) distribution has \(\alpha = (1,1,1,,\dots)\).
The Dirichlet updates in a simple way. If you have a Dirichlet prior with parameter \(\alpha = (\alpha_1, \alpha_2, \dots)\) and you observe outcome \(1\), for example, then then posterior distribution is also Dirichlet with parameter vector \(\alpha' = (\alpha_1+1, \alpha_2,\dots)\).
Bayes on a Grid
Now with a strongish prior on 50%:
fabricate(N = 100,
parameters = seq(.01, .99, length = N),
prior = dbeta(parameters, 20, 20),
prior = prior/sum(prior),
likelihood = dbinom(20, 100, parameters),
posterior = likelihood*prior/sum(likelihood*prior)) |>
ggplot(aes(parameters, posterior)) + geom_line() + theme_bw() +
geom_line(aes(parameters, prior), color = "red")
This approach is sound, but if you are dealing with many continuous parameters, the full parameter space can get very large and so the number of calculations you do increases rapidly.
Luckily other approaches have been developed.
In this short lecture we:
The good news: There is lots of help online. Start with: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
We will jump straight into things and work through a session.
To implement a stan model you should write the code in a text editor and save it as a text file. You can also write it directly in your script. You can then bring the file into R or call the file directly.
I saved a simple model called one_var.stan
locally. Here it is:
The key features here are (read from bottom up!):
a + bX
and standard deviation sigma
.a
, b
, sigma
.N
and X1,
Y` dataWe feed data to the model in the form of a list. The idea of a list is that the data can include all sorts of objects, not just a single dataset.
When you run the model you get a lot of useful output on the estimation and the posterior distribution. Here though are the key results:
mean | sd | Rhat | |
---|---|---|---|
a | -0.179 | 0.214 | 1 |
b | 0.738 | 0.183 | 1 |
sigma | 0.950 | 0.175 | 1 |
These look good.
The Rhat at the end tells you about convergence. You want this very close to 1.
The model output contains the full posterior distribution.
With the full posterior you can look at marginal posterior distributions over arbitrary transformations of parameters.
Let’s go back to the code.
There we had three key blocks: data
, parameters
, and model
More generally the blocks you can specify are:
data
(define the vars that will be coming in from the data list)transformed data
(can be used for preprocessing)parameters
(required: defines the parameters to be estimated)transformed parameters
(transformations of parameters useful for computational reasons and sometimes for clarity)model
(give priors and likelihood)generated quantities
(can be used for post processing)The parameters block declared the set of parameters that we wanted to estimate. In the simple model these were a
, b
, and sigma
. Note in the declaration we also:
Instead of defining:
We could have defined
and then referenced coef[1]
and coef[2]
in the model block.
Or we could also have imposed the constraint that the slope coefficient is positive by defining:
In the model block we give the likelihood
But we can also give the priors (if we want to). If priors are not provided, flat (possibly improper) priors are assumed
In our case for example we could have provided something like
This suggests that we start off believing b
is centered on -10. That will surely matter for our conclusions. Lets try it:
This time I will write the model right in the editor:
mean | sd | Rhat | |
---|---|---|---|
a | -1.338 | 2.444 | 1.003 |
b | -7.875 | 1.172 | 1.003 |
sigma | 10.988 | 2.499 | 1.001 |
Note that we get a much lower estimate for b
with the same data.
Now imagine a setting in which there are 10 villages, each with 10 respondents. Half in each village are assigned to treatment \(X=1\), and half to control \(X=0\).
Say that there is possibly a village specific average outcome: \(Y_v = a_v + b_vX\) where \(a_v\) and \(b_v\) are each drawn from some distribution with a mean and variance of interest. The individual outcomes are draws from a village level distribution centered on the village specific average outcome.
This all implies a multilevel structure.
Here is a model for this
ml_model <- '
data {
vector[100] Y;
int<lower=0,upper=1> X[100];
int village[100];
}
parameters {
vector<lower=0>[3] sigma;
vector[10] a;
vector[10] b;
real mu_a;
real mu_b;
}
transformed parameters {
vector[100] Y_vx;
for (i in 1:100) Y_vx[i] = a[village[i]] + b[village[i]] * X[i];
}
model {
a ~ normal(mu_a, sigma[1]);
b ~ normal(mu_b, sigma[2]);
Y ~ normal(Y_vx, sigma[3]);
}
'
Here is a slightly more general version: https://github.com/stan-dev/example-models/blob/master/ARM/Ch.17/17.1_radon_vary_inter_slope.stan
Lets create some multilevel data. Looking at this, can you tell what is the typical village level effect? How much heterogeneity is there?
mean | sd | Rhat | |
---|---|---|---|
mu_a | -0.29 | 0.25 | 1.00 |
mu_b | 1.89 | 0.26 | 1.00 |
sigma[1] | 0.61 | 0.25 | 1.00 |
sigma[2] | 0.47 | 0.28 | 1.01 |
sigma[3] | 0.98 | 0.08 | 1.00 |
parameters drawn from theory
Say that a set of people in a population are playing sequential prisoner’s dilemmas.
In such games selfish behavior might suggest defections by everyone everywhere. But of course people often cooperate. Why might this be?
We will capture some of this intuition with a behavioral type model in which
In all, this means that a player with propensity \(r_i>.5\) will cooperate with probability \(1-r_i\); a player with propensity \(r_i<.5\) will cooperate with probability \(1\).
Interestingly the not-very-rational people sometimes cooperate strategically but the really rational people never cooperate strategically because they think it won’t work.
What then are the probabilities of each of the possible outcomes?
where \(p\) is the density function on \(r_i\) given \(\theta\)
Given the assumption on \(p\)
We have data on the actions of the first movers and the second movers and are interested in the distribution of the \(p_i\)s.
Lets collapse that data into a simple list of the number of each type of game outcome:
And say we start off with a uniform prior of \(\theta\).
What should we conclude about \(\theta\)?
Here’s a model:
Note we define event weights as transformed parameters on a simplex. We also constrain \(\theta\) to be \(>.5\). Obviously we are relying a lot on our model.
What is the probability of observing strategic first round cooperation?
A player with rationality \(r_i\) will cooperate strategically with probability \(r_i\) if \(r_i<.5\) and 0 otherwise. Thus we are interested in \(\int_0^{.5}r_i/\theta dr_i = .125/\theta\)
CausalQueries
CausalQueries
brings these elements together
CausalQueries
brings these elements together by allowing users to:
CausalQueries
figures out all principal strata and places a prior on theseCausalQueries
writes a stand model and updates on all parametersCausalQueries
figures out which parameters correspond to a given causal queryevent | strategy | count |
---|---|---|
Z0X0Y0 | ZXY | 158 |
Z1X0Y0 | ZXY | 52 |
Z0X1Y0 | ZXY | 0 |
Z1X1Y0 | ZXY | 23 |
Z0X0Y1 | ZXY | 14 |
Z1X0Y1 | ZXY | 12 |
Z0X1Y1 | ZXY | 0 |
Z1X1Y1 | ZXY | 78 |
Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]
query | given | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|
Y[X=1] - Y[X=0] | - | 0.55 | 0.10 | 0.37 | 0.73 |
Y[X=1] - Y[X=0] | X==0 & Y==0 | 0.64 | 0.15 | 0.37 | 0.89 |
Y[X=1] - Y[X=0] | X[Z=1] > X[Z=0] | 0.70 | 0.05 | 0.59 | 0.80 |