Bayes

Macartan Humphreys

1 Bayesian approaches

Updating on causal quantities

1.1 Outline

  1. Bayesian reasoning
  2. Bayesian calculations by hand
  3. stan
  4. A simple structural model with stan
  5. CausalQueries

1.2 Bayes reasoning

  • Bayesian methods are just sets of procedures to figure out how to update beliefs in light of new information.

  • We begin with a prior belief about the probability that a hypothesis is true.

  • New data then allow us to form a posterior belief about the probability of the hypothesis.

1.2.1 Bayes Rule

Bayesian inference takes into account:

  • the consistency of the evidence with a hypothesis
  • the uniqueness of the evidence to that hypothesis
  • background knowledge about the problem.

1.2.2 Illustration 1

I draw a card from a deck and ask What are the chances it is a Jack of Spades?

  • Just 1 in 52.

Now I tell you that the card is indeed a spade. What would you guess?

  • 1 in 13

What if I told you it was a heart?

  • No chance it is the Jack of Spades

What if I said it was a face card and a spade.

  • 1 in 3.

1.2.3 Illustration 1

These answers are applications of Bayes’ rule.

In each case the answer is derived by assessing what is possible, given the new information, and then assessing how likely the outcome of interest among the states that are possible. In all the cases you calculate:

\[\text{Prob Jack of Spades | Info} = \frac{\text{Is Jack of Spades Consistent w/ Info?}}{\text{How many cards are consistent w/ Info?}} \]

1.2.4 Illustration 2 Interpreting Your Test Results

You take a test to see whether you suffer from a disease that affects 1 in 100 people. The test is good in the following sense:

  • if you have the disease, then with a 99% probability it will say you have the disease
  • if you do not have it, then with a 99% probability, it will say that you do not have it

The test result says that you have the disease. What are the chances you have it?

1.2.5 Illustration 2 Interpreting Your Test Results

  • It is not 99%. 99% is the probability of the result given the disease, but we want the probability of the disease given the result.

  • The right answer is 50%, which you can think of as the share of people that have the disease among all those that test positive. For example

  • e.g. if there were 10,000 people, then 100 would have the disease and 99 of these would test positive. But 9,900 would not have the disease and 99 of these would test positive. So the people with the disease that test positive are half of the total number testing positive.

1.2.6 Illustration 2. A picture

What’s the probability of being a circle given you are black?

1.2.7 Illustration 2. More formally.

As an equation this might be written:

\[\text{Prob You have the Disease | Pos} = \frac{\text{How many have the disease and test pos?}}{\text{How many people test pos?}}\]

1.2.8 Two Child Problem

Consider last an old puzzle described in Gardner (1961).

  • Mr Smith has two children, \(A\) and \(B\).
  • At least one of them is a boy.
  • What are the chances they are both boys?

To be explicit about the puzzle, we will assume that the information that one child is a boy is given as a truthful answer to the question “is at least one of the children a boy?

Assuming also that there is a 50% probability that a given child is a boy.

1.2.9 Two Child Problem

As an equation:

\[\text{Prob both boys | Not both girls} = \frac{\text{Prob both boys}}{\text{Prob not both girls}} = \frac{\text{1 in 4}}{\text{3 in 4}}\]

1.2.10 Monty Hall

Can anyone describe the Monty Hall puzzle?

1.2.11 Bayes Rule

Formally, all of these equations are applications of Bayes’ rule which is a simple and powerful formula for deriving updated beliefs from new data.

The formula is given as:

\[\Pr(H|\mathcal{D})=\frac{\Pr(\mathcal{D}|H)\Pr(H)}{\Pr(\mathcal{D})}\\ =\frac{\Pr(\mathcal{D}|H)\Pr(H)}{\sum_{H'}\Pr(\mathcal{D}|H')\Pr(H'))}\]

1.2.12 Bayes Rule

Formally, all of these equations are applications of Bayes’ rule which is a simple and powerful formula for deriving updated beliefs from new data.

For continuous distributions and parameter vector \(\theta\):

\[p(\theta|\mathcal{D})=\frac{p(\mathcal{D}|\theta)p(\theta)}{\int_{\theta'}p(\mathcal{D|\theta'})p(\theta')d\theta}\]

1.2.13 Useful Distributions: Beta and Dirichlet Distributions

  • Bayes rule requires the ability to express a prior distribution but it does not require that the prior have any particular properties other than being probability distributions.
  • Sometimes however it can be useful to make use of “off the shelf” distributions.

Consider the share of people in a population that voted. This is a quantity between 0 and 1.

1.2.14 Useful Distributions: Beta and Dirichlet Distributions

  • Two people might may both believe that the turnout was around 50% but differ in how certain they are about this claim.
  • One might claim to have no information and to believe any turnout rate between 0 and 100% is equally likely; another might be completely confident that the number if 50%.

Here the parameter of interest is a share. The Beta and Dirichlet distributions are particularly useful for representing beliefs on shares.

1.2.15 Beta

  • The Beta distribution is a distribution over the \([0,1]\) that is governed by two parameters, \(\alpha\) and \(\beta\).
  • In the case in which both \(\alpha\) and \(\beta\) are 1, the distribution is uniform – all values are seen as equally likely.
  • As \(\alpha\) rises large outcomes are seen as more likely
  • As \(\beta\) rises, lower outcomes are seen as more likely.
  • If both rise proportionately the expected outcome does not change but the distribution becomes tighter.

An attractive feature is that if one has a prior Beta(\(\alpha\), \(\beta\)) over the probability of some event, and then one observes a positive case, the Bayesian posterior distribution is also a Beta with with parameters \(\alpha+1, \beta\). Thus if people start with uniform priors and build up knowledge on seeing outcomes, their posterior beliefs should be Beta.

1.2.16 Beta

Here is a set of such distributions.

Beta distributions

1.2.17 Dirichlet distributions.

The Dirichlet distributions are generalizations of the Beta to the situation in which there are beliefs not just over a proportion, or a probability, but over collections of probabilities.

  • If four outcomes are possible and each is likely to occur with probability \(p_k\), \(k=1,2,3,4\) then beliefs are distributions over a three dimensional unit simplex.

  • The distribution has as many parameters as there are outcomes and these are traditionally recorded in a vector, \(\alpha\).

  • As with the Beta distribution, an uninformative prior (Jeffrey’s prior) has \(\alpha\) parameters of \((.5,.5,.5, \dots)\) and a uniform (“flat”) distribution has \(\alpha = (1,1,1,,\dots)\).

  • The Dirichlet updates in a simple way. If you have a Dirichlet prior with parameter \(\alpha = (\alpha_1, \alpha_2, \dots)\) and you observe outcome \(1\), for example, then then posterior distribution is also Dirichlet with parameter vector \(\alpha' = (\alpha_1+1, \alpha_2,\dots)\).

1.3 Bayes by hand

Bayes on a Grid

1.3.1 Bayes by hand

  • The simplest and most intuitive way to do Bayesian estimation is just to apply the formula over a grid of possible values
  • This becomes too hard once your parameter space grows but it is worth working though the logic by hand to get a feel for Bayes

1.3.2 Bayes by hand

  • Lets say that we want to figure out the share of women in some population.
  • We start off with a flat prior over all possible numbers
  • We draw a sample from the population: 100 people of which 20 are women
  • What’s our posterior?

1.3.3 Bayes by hand

fabricate(N = 100,
          parameters = seq(.01, .99, length = N),
          likelihood = dbinom(20, 100, parameters),
          posterior = likelihood/sum(likelihood)) |>
  ggplot(aes(parameters, posterior)) + geom_line() + theme_bw()

1.3.4 Bayes by hand

Now with a strongish prior on 50%:

fabricate(N = 100,
          parameters = seq(.01, .99, length = N),
          prior = dbeta(parameters, 20, 20), 
          prior = prior/sum(prior),
          likelihood = dbinom(20, 100, parameters),
          posterior = likelihood*prior/sum(likelihood*prior)) |>
  ggplot(aes(parameters, posterior)) + geom_line() + theme_bw() +
  geom_line(aes(parameters, prior), color = "red")

1.3.5 Bayes by hand

  • This approach is sound, but if you are dealing with many continuous parameters, the full parameter space can get very large and so the number of calculations you do increases rapidly.

  • Luckily other approaches have been developed.

1.4 Stan

1.4.1 Plan

In this short lecture we:

  • fire up stan
  • implement a simple linear model and talk through the main model blocks
  • implement a simple hierarchical model
  • describe a behavioral game and set up a model to recover some parameters of interest, given the game

1.4.2 Getting set up

The good news: There is lots of help online. Start with: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

We will jump straight into things and work through a session.

  1. Install the stan package and fire up. Useful to set options so that multiple cores are being used:
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

1.4.3 One variable model: Simple example

  1. Now lets consider the simplest one var linear model.
    • We will need model code
    • And data

1.4.4 A simple model: Code

To implement a stan model you should write the code in a text editor and save it as a text file. You can also write it directly in your script. You can then bring the file into R or call the file directly.

1.4.5 A simple model: Code

I saved a simple model called one_var.stan locally. Here it is:

readLines("assets/one_var.stan", warn = FALSE) |>
  cat(sep = "\n")
data {
  int<lower=0> N;
  vector[N] Y;
  vector[N] X;
}
parameters {
  real a;
  real b;
  real<lower=0> sigma;
}
model {
  Y ~ normal(a + b * X, sigma);
}

1.4.6 A simple model: Code

The key features here are (read from bottom up!):

  • \(Y\) is assumed to be normally distributed with mean a + bX and standard deviation sigma.
  • There are then three parameters: a, b, sigma.
  • There are no priors placed on these but sigma is constrained to be positive. Without priors improper flat priors are assumed.
  • Stan expects a data set that contains three things: a scalar, N and X1,Y` data

1.4.7 Simple model: Data

We feed data to the model in the form of a list. The idea of a list is that the data can include all sorts of objects, not just a single dataset.

X = rnorm(20)

some_data <- list(
 N = 20,
 X = X,
 Y = X + rnorm(20)
 )

1.4.8 Simple model: Now Let’s Run It

M <- stan(file = "assets/one_var.stan", 
          data = some_data)

When you run the model you get a lot of useful output on the estimation and the posterior distribution. Here though are the key results:

mean sd Rhat
a -0.179 0.214 1
b 0.738 0.183 1
sigma 0.950 0.175 1

These look good.

The Rhat at the end tells you about convergence. You want this very close to 1.

1.4.9 A simple model: Now lets use it

The model output contains the full posterior distribution.

my_posterior <- M |> extract() |> data.frame() 

my_posterior |> ggplot(aes(a,b)) + geom_point() + theme_bw()

1.4.10 A simple model: Now lets use it

With the full posterior you can look at marginal posterior distributions over arbitrary transformations of parameters.

summary((my_posterior$a + my_posterior$b)/my_posterior$a) |> round(2)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2528.86    -3.83    -1.45     0.43    -0.16  6552.41 

1.4.11 Building up

Let’s go back to the code.

There we had three key blocks: data, parameters, and model

More generally the blocks you can specify are:

  • data (define the vars that will be coming in from the data list)
  • transformed data (can be used for preprocessing)
  • parameters (required: defines the parameters to be estimated)
  • transformed parameters (transformations of parameters useful for computational reasons and sometimes for clarity)
  • model (give priors and likelihood)
  • generated quantities (can be used for post processing)

1.4.12 Parameters block

The parameters block declared the set of parameters that we wanted to estimate. In the simple model these were a, b, and sigma. Note in the declaration we also:

  • said what kind of parameters they (vectors, matrices, simplices etc)
  • gave possible constraints

1.4.13 Parameters block

Instead of defining:

real a;
real b;

We could have defined

vector[2] coefs;

and then referenced coef[1] and coef[2] in the model block.

1.4.14 Parameters block

Or we could also have imposed the constraint that the slope coefficient is positive by defining:

real a;
real<lower = 0> b;

1.4.15 Model Block

In the model block we give the likelihood

But we can also give the priors (if we want to). If priors are not provided, flat (possibly improper) priors are assumed

In our case for example we could have provided something like

model {
  b ~ normal(-10, 1);
  Y ~ normal(a + b * X, sigma);
}

This suggests that we start off believing b is centered on -10. That will surely matter for our conclusions. Lets try it:

1.4.16 Version 2

This time I will write the model right in the editor:

new_model <- '
data {
  int<lower=0> N;
  vector[N] Y;
  vector[N] X;
}
parameters {
  real a;
  real b;
  real<lower=0> sigma;
}
model {
  b ~ normal(-10,1);
  Y ~ normal(a + b * X, sigma);
}
'

1.4.17 Estimation 2

M2 <- stan(model_code = new_model, data = some_data)
mean sd Rhat
a -1.338 2.444 1.003
b -7.875 1.172 1.003
sigma 10.988 2.499 1.001

Note that we get a much lower estimate for b with the same data.

1.4.18 A multilevel model

  • Now imagine a setting in which there are 10 villages, each with 10 respondents. Half in each village are assigned to treatment \(X=1\), and half to control \(X=0\).

  • Say that there is possibly a village specific average outcome: \(Y_v = a_v + b_vX\) where \(a_v\) and \(b_v\) are each drawn from some distribution with a mean and variance of interest. The individual outcomes are draws from a village level distribution centered on the village specific average outcome.

  • This all implies a multilevel structure.

1.4.19 A multilevel model

Here is a model for this

ml_model <- '
data {
  vector[100] Y;
  int<lower=0,upper=1> X[100];
  int village[100];
}
parameters {
  vector<lower=0>[3] sigma; 
  vector[10] a;
  vector[10] b;
  real mu_a;
  real mu_b;
}
transformed parameters {
  vector[100] Y_vx;
  for (i in 1:100) Y_vx[i] = a[village[i]] + b[village[i]] * X[i];
}
model {
  a ~ normal(mu_a, sigma[1]);
  b ~ normal(mu_b, sigma[2]);
  Y ~ normal(Y_vx, sigma[3]);
}
'

1.4.20 A multilevel model

Here is a slightly more general version: https://github.com/stan-dev/example-models/blob/master/ARM/Ch.17/17.1_radon_vary_inter_slope.stan

1.4.21 Multilevel model: Data

Lets create some multilevel data. Looking at this, can you tell what is the typical village level effect? How much heterogeneity is there?

village   <- rep(1:10, each = 10)
village_b <- 1 + rnorm(10)
X         <- rep(0:1, 50)
Y         <- village_b[village]*X + rnorm(100)

ml_data <- list(
  village = village,
  X = X, 
  Y = Y)

1.4.22 Multilevel Results

M_ml <- stan(model_code = ml_model, data = ml_data)
mean sd Rhat
mu_a -0.29 0.25 1.00
mu_b 1.89 0.26 1.00
sigma[1] 0.61 0.25 1.00
sigma[2] 0.47 0.28 1.01
sigma[3] 0.98 0.08 1.00

1.5 A game and a structural model

parameters drawn from theory

1.5.1 A game and a structural model

Say that a set of people in a population are playing sequential prisoner’s dilemmas.

In such games selfish behavior might suggest defections by everyone everywhere. But of course people often cooperate. Why might this be?

  • One possible reason is that some people are irrational, in the sense that they simply choose to cooperate, ignoring the payoffs.
  • Another possibility is that rational people think that others are irrational, in the sense that they think that others will reciprocate when they observe cooperative action

1.5.2 Model

We will capture some of this intuition with a behavioral type model in which

  • each player has a “rationality” propensity of \(r_i\) – this is the probability with which they choose to do the rational thing, rather than the generous thing
  • \(r_i \sim U[0, \theta]\) for \(\theta > .5\).
  • A player with rationality propensity of \(r_i\) believes \(r_j \sim [0, r_i]\). So everyone assumes that they are the most rational people in the room…
  • The game is such that:
  • second mover: a second mover with rationality propensity \(r_i\) will cooperate with probability \(1-r_i\) if the first mover cooperated; otherwise they defect
  • first mover: a first mover with \(r_i\) will cooperate nonstrategically with probability \((1-r_i)\); however with probability \(r_i\) they will also cooperate strategically if they think that the second mover has \(r_j<.25\).

1.5.3 Expectations from model

  • In all, this means that a player with propensity \(r_i>.5\) will cooperate with probability \(1-r_i\); a player with propensity \(r_i<.5\) will cooperate with probability \(1\).

  • Interestingly the not-very-rational people sometimes cooperate strategically but the really rational people never cooperate strategically because they think it won’t work.

1.5.4 Event Probabilities

What then are the probabilities of each of the possible outcomes?

  • There will be cooperation by both players with probability \((\int_0^{.5} p(r_i) dr_i + \int_{.5}^1 p(r_i)(1-r_i) dr_i)\int_0^1p(r_i)(1-r_i)dr_i\)
  • There will be cooperation by player 1 only with probability \((\int_0^{.5} p(r_i) dr_i + \int_{.5}^1 p(r_i)(1-r_i) dr_i)(\int_0^1p(r_i)(r_i)dr_i)\)
  • There will be cooperation by neither with probability: \(1-\int_0^{.5} p(r_i) dr_i - \int_{.5}^1 p(r_i)(1-r_i) dr_i\)

where \(p\) is the density function on \(r_i\) given \(\theta\)

1.5.5 Event probabilities

Given the assumption on \(p\)

  • There will be cooperation by both players with probability \((1+.25/\theta -.5\theta)(1-.5\theta)\)
  • There will be cooperation by player 1 only with probability \((1+.25/\theta -.5\theta)(.5\theta)\)
  • There will be cooperation by neither player with probability \((.5\theta-.25/\theta)\)

1.5.6 Data

  • We have data on the actions of the first movers and the second movers and are interested in the distribution of the \(p_i\)s.

  • Lets collapse that data into a simple list of the number of each type of game outcome:

  • And say we start off with a uniform prior of \(\theta\).

  • What should we conclude about \(\theta\)?

1.5.7 Model

Here’s a model:

game_model <- '
data {
  int<lower=0> play[3];
}
parameters {
  real<lower=.5, upper=1> theta;
}
transformed parameters {
simplex[3] w;
 w[1] = (1+.25*theta - .5*theta)*(1-.5*theta);
 w[2] = (1+.25*theta - .5*theta)*(.5*theta);
 w[3] = (-.25*theta  + .5*theta);
}
model {
  play ~ multinomial(w);
}
'

1.5.8 Model

Note we define event weights as transformed parameters on a simplex. We also constrain \(\theta\) to be \(>.5\). Obviously we are relying a lot on our model.

1.5.9 Plot posterior on \(\theta\)

M3 <- stan(model_code = game_model,  
           data = list(play = c(10,10,10)))

1.5.10 Plot posterior on \(\theta\)

M4 <- stan(model_code = game_model,  
           data = list(play = c(20,6,4)))

1.5.11 Posterior on a quantity of interest

What is the probability of observing strategic first round cooperation?

A player with rationality \(r_i\) will cooperate strategically with probability \(r_i\) if \(r_i<.5\) and 0 otherwise. Thus we are interested in \(\int_0^{.5}r_i/\theta dr_i = .125/\theta\)

1.6 CausalQueries

CausalQueries brings these elements together

1.6.1 Big picture

CausalQueries brings these elements together by allowing users to:

  1. Specify a DAG: CausalQueries figures out all principal strata and places a prior on these
  2. Provide data to the DAG: CausalQueries writes a stand model and updates on all parameters
  3. Query model: CausalQueries figures out which parameters correspond to a given causal query

1.6.2 Illustration: “Lipids” data

data("lipids_data")

lipids_data |> kable()
event strategy count
Z0X0Y0 ZXY 158
Z1X0Y0 ZXY 52
Z0X1Y0 ZXY 0
Z1X1Y0 ZXY 23
Z0X0Y1 ZXY 14
Z1X0Y1 ZXY 12
Z0X1Y1 ZXY 0
Z1X1Y1 ZXY 78

Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]

1.6.3 Model

model <- make_model("Z -> X -> Y; X <-> Y") 
model |> plot()

1.6.4 Updating and querying

model |>
  update_model(lipids_data, refresh = 0) |>
  query_model(query = "Y[X=1] - Y[X=0]",
              given = c("All",  "X==0 & Y==0", "X[Z=1] > X[Z=0]"),
              using = "posteriors") 
Table 1: Replication of Chickering and Pearl (1996).
query given mean sd cred.low.2.5% cred.high.97.5%
Y[X=1] - Y[X=0] - 0.55 0.10 0.37 0.73
Y[X=1] - Y[X=0] X==0 & Y==0 0.64 0.15 0.37 0.89
Y[X=1] - Y[X=0] X[Z=1] > X[Z=0] 0.70 0.05 0.59 0.80
Gardner, Martin. 1961. The Second Scientific American Book of Mathematical Puzzles and Diversions. Simon; Schuster New York.