Summary: Knox, Lowe, and Mummolo argue that “traditional analyses understate racial bias in police violence”. They bring attention to the important but underappreciated point that conditioning analysis on data from the set of police stops can introduce “collider bias.” In a dispute with Gaebler et al they dismiss a condition that ensures no such bias for being “knife edge.” Interestingly however the condition is less an edge as a border: on the other side of the border, bias can go in the opposite direction, and one type of discrimination can be over-estimated while another type is even more severely underestimated.

I unpack this all below but emphasize right away that the point here is not that Knox et al. are wrong, but rather to highlight how much the overall conclusions really ride on an assumption that has escaped attention during this debate.

The methodological takeaways are:

1 Model recap

We are imagining a model in which being stopped by the police (\(M=1\)), and thus entering administrative data, depends on race (\(D\)) as well as some unobserved feature of the situation (\(U\)). Both \(D\) and \(U\) might also affect whether force gets used if you are stopped.1 \(U\) might be how suspicious you are perceived to be, or it might be something about the context or location, or it might be something about the police you are engaging with. For this note I will assume that \(U\) is binary.

The conditional distributions of variables is as follows:

  1. \(M\) depends on \(D\) and \(U\). Let \(p_{du}\) denote the probability of being stopped if \(D=d\) and \(U=u\).

  2. In the case in which \(m=1\), \(Y\) also depends on \(D\) and \(U\) (otherwise, if \(m=0\), \(Y=0\)). Let \(y_{du}\) denote the average value of \(Y\) if \(D=d\) and \(U=u\) among the cases with \(m=1\).

  3. We also need distributions on \(D\) and \(U\). For simplicity let’s assume that \(\Pr(D=1) = \Pr(U=1) = .5\). Note that these are independent.

We also assume that \(M\) is monotonic in \(D\). That’s assumption 2 in Knox, Lowe, and Mummolo. It’s a very strong assumption and imposes structure beyond what is available from knowledge of these distributions.

That’s the model.

2 Estimands and estimators

We have enough information here to write the estimands and estimates in terms of \(p_{du}\) and \(y_{du}\).

The estimands are:

\[\begin{eqnarray} ATE &=& (p_{10}y_{10}+p_{11}y_{11})/2 - (p_{00}y_{00}+p_{01}y_{01})/2\\ CDE_{M=1} &=& \frac{(p_{10}+p_{00})(\overline{y}_{10}-\overline{y}_{00}) + (p_{11}+p_{01})(\overline{y}_{11}- \overline{y}_{01})}{p_{10}+p_{11}+p_{00}+p_{01}} \\ ATE_{M=1} &=& CDE_{M=1} + \frac{(p_{10}-p_{00})y_{00} + (p_{11}-p_{01})y_{01}}{p_{00} + p_{01} + p_{10} + p_{11}} \end{eqnarray}\]

In words, the \(ATE\) is the effect of \(D\) on getting stopped and experiencing force (averaged over eh different \(U\) types). \(CDE_{M=1}\) is the average affect of \(D\) on force among those in the \(M=1\) group, keeping \(M=1\) fixed as we consider counterfactual changes in \(D\). \(ATE_{M=1}\) is the average affect of \(D\) on force among those in the \(M=1\) group, but allowing \(M\) to change as we consider counterfactual changes in \(D\).

Note that if \(M\) is monotonic in \(D\) then \(ATE_{M=1}>CDE_{M=1}\), the difference arising from the increased chance of being observed times force in the \(D=0\) condition (the difference between outcomes in the \(D=1\) condition and the \(D=0\) condition being captured already in the CDE part).

For the estimator we will assume we have lots of data. We focus on the “naive estimate,” which takes the difference in \(Y\) among \(M=1\) types for \(D=1\) and \(D=0\) types:

\[\hat{b} = \frac{p_{10} y_{10} + p_{11}y_{11}}{p_{10} + p_{11}} - \frac{p_{00} y_{00} + p_{01}y_{01}}{p_{00} + p_{01}}\]

Restrictions on \(p\) and \(y\) can capture assumptions that there is discrimination at both stages (\(M\) and \(Y\) are increasing in \(D\)) or that both being stopped and the use of force if stopped, are increasing in \(U\). Such restrictions have implications for whether there is bias from the naive estimate and what direction bias takes.

3 Collider bias

We will say “collider bias” is the non-independence of \(D\) and \(U\) in the \(M=1\) set despite unconditional independence between \(D\) and \(U\).

We can write the probability \(U=1\) given \(M=1\) and \(D=d\) as: \(q_d = \frac{p_{d1}}{p_{d0}+p_{d1}}\) and the define collider bias as:

\[s:=q_1 - q_0\]

We have:

\[s \geq 0\leftrightarrow \frac{p_{11}}{p_{10}+p_{11}} \geq \frac{p_{01}}{p_{00}+p_{01}} \leftrightarrow \frac{p_{11}}{p_{01}} \geq \frac{p_{10}}{p_{00}}\]

In words: within the observed data, we expect \(U\) to be higher among minorities if and only if the (multiplicative) effect of minority status on reporting is greater when \(U=1\) than when \(U=0\). Equivalently: \(U\) is higher for minorities if the odds ratio is increasing in \(U\).

Collider bias can be thought of as a question of the extent to which \(D\) and \(U\) are complements for \(M\), or more precisely whether there is “log complementarity”: \(\log(p_{11}) - \log(p_{01}) \geq \log(p_{10}) - \log(p_{00})\).

4 Implications of collider bias for the direction of bias in the estimation of \(CDE_{M=1}\)

Collider bias can result in different groups in the \(M=1\) set having higher or lower levels of \(U\) on average. If \(U\) affects \(Y\) then we will see differences between these groups because of \(U\) and not because of \(D\), and so, bias in the estimation of \(CDE_{M=1}\) from differences in outcomes between groups.

Consider a simple case in which \(Y = MU\). That is, if stopped, force is entirely determined by the unknown variable \(U\) and does not depend on \(D\), except via being stopped.

In that case, the controlled direct effect of \(D\) is 0. However the estimate will be:

\[\hat{b} = \frac{p_{11}}{p_{10}+p_{11}} - \frac{p_{01}}{p_{00}+p_{01}}\]

which is exactly collider bias! It is the difference between \(D=1\) and \(D=0\) cases in the probability that \(U=1\) among \(M=1\) cases.

To be clear: in this case, the bias in the \(CDE_{M=1}\) is the same as the collider bias: it can be negative, positive, or zero, depending on log complementarity.

Let’s generalize slightly to a broader set of cases in which \(Y\) is increasing in \(U\).

cde_m1 = function(p00, p10, p01, p11, y00, y10, y01, y11){

  ((p00+p10)*(y10 - y00) +   (p01+p11)*(y11 - y01))/ (p00 + p01 + p10 + p11)   
  
}

A3_b = function(p00, p10, p01, p11, y00, y10, y01, y11)
  ((p00*y10 + p01*y11)/(p00+p01)) >= (((p10-p00)*y10 + (p11-p01)*y11)/(p10+p11 - p00 - p01))

A3_w = function(p00, p10, p01, p11, y00, y10, y01, y11)
  ((p00*y00 + p01*y01)/(p00+p01)) >= (((p10-p00)*y00 + (p11-p01)*y01)/(p10+p11 - p00 - p01))

ate = function(p00, p10, p01, p11, y00, y10, y01, y11)
  (p10* y10 + p11*y11)/2 - (p00* y00 + p01*y01)/2

ate_m1 = function(p00, p10, p01, p11, y00, y10, y01, y11){
  (2*p00*(y10 - y00) +   2*p01*(y11 - y01)
   + (p10-p00)*y10 + (p11-p01)*y11)/ (p00 + p01 + p10 + p11)   
}
diff <- function(p00, p10, p01, p11, y00, y10, y01, y11)
  ((p10 - p00)*y00 +   (p11-p01)*y01)/ (p00 + p01 + p10 + p11)   
collider <- function(p00, p10, p01, p11, y00, y10, y01, y11) p11/(p10+p11) - p01/(p00+p01)

# This is the effect of D on force experienced but *not* conditional on D=1
point_est <- function(p00, p10, p01, p11, y00, y10, y01, y11){

  (p10* y10 + p11*y11)/2 - (p00* y00 + p01*y01)/2
    
}


est = function(p00, p10, p01, p11, y00, y10, y01, y11){

  (p10* y10 + p11*y11)/(p10 + p11) - (p00* y00 + p01*y01)/(p00 + p01)
    
}

p00 = .05
p10 = .15
p01 = .1
p11 = .3
y00 = .1
y10 = .3
y01 = .2
y11 = .3

Say we had the following tables for \(M\) and \(Y\):

data.frame(S = c("D=0", "D=1"), "U=0" = c(p00, p10), "U=1" = c(p01, "p_11")) %>%
  kable(format = "html", escape = F, row.names = FALSE, col.names = c(" ", "U=0", "U=1"), caption = "M: Prob. stopped")  %>%
kable_styling(full_width = F)
M: Prob. stopped
U=0 U=1
D=0 0.05 0.1
D=1 0.15 p_11
data.frame(S = c("D=0", "D=1"), "U=0" = c(y00, y10), "U=1" = c(y01, y11)) %>%
  kable(format = "html", escape = F, row.names = FALSE, col.names = c(" ", "U=0", "U=1"), caption = "Y: Prob. force | stopped")  %>%
kable_styling(full_width = F)
Y: Prob. force | stopped
U=0 U=1
D=0 0.1 0.2
D=1 0.3 0.3

Then different values of \(p_{11}\) would result in different collider biases and different estimator biases. You can see these in this figure:

df <- expand_grid(p11 = seq(.1, .9, .025), y11 = c(.4, .8, 1)) %>% data.frame() %>%
  mutate(collider = collider(p00, p10, p01, p11, y00, y10, y01, y11),
         est = est(p00, p10, p01, p11, y00, y10, y01, y11),
         cde_m1 = cde_m1(p00, p10, p01, p11, y00, y10, y01, y11),
         ate_m1 = ate_m1(p00, p10, p01, p11, y00, y10, y01, y11),
         ate = ate(p00, p10, p01, p11, y00, y10, y01, y11),
         point_est = point_est(p00, p10, p01, p11, y00, y10, y01, y11),
         ate_m1_bias = est - ate_m1,
         cde_m1_bias = est - cde_m1,
         ate_bias = est - ate
)


df%>% filter(y11 == .4) %>% data.table %>% melt(id.vars = "collider", measure.vars = c("ate_m1_bias", "cde_m1_bias", "ate_bias")) %>%
  ggplot(aes(x = collider, y = value, color = variable)) +
  geom_line() + xlab("Collider bias: p11/(p10+p11) - p01/(p00+p01)") + ylab("bias") + geom_hline(yintercept=0) + geom_vline(xintercept=0)

In particular we see:

But it can be worse….

5 The lazy racist: An example where the naive estimate is an overestimate for all three estimands

Imagine a world in which there are two types of police officer. One type (\(U=0\)) is diligent but non violent: they stop often but don’t use force so often (though they are still racist in both stops and use of force). The other type (\(U=1\)) is lazier, stopping less overall, but likely to employ violence, especially against minorities that they stop.

M: Prob. stopped
U=0 U=1
D=0 0.5 0.1
D=1 0.6 0.2
Y: Prob. force | stopped
U=0 U=1
D=0 0.1 0.1
D=1 0.2 0.9

With such an underlying data generating process we might observe data like this: force was used among 300 of 800 minorities and among 60 of 600 non-minorities.

These values imply the following estimands and (large N) estimates:

data.frame(quantity = c("CDE | M=1", "ATE | M=1", "ATE", "Estimate"), 
           
           #"Assumption 3 (blacks)", "Assumption 3 (whites)"),

           value = c(CDE_M1 = cde_m1(p00, p10, p01, p11, y00, y10, y01, y11),
ATE_M1 = ate_m1(p00, p10, p01, p11, y00, y10, y01, y11),
ATE = ate(p00, p10, p01, p11, y00, y10, y01, y11),
estimate = est(p00, p10, p01, p11, y00, y10, y01, y11)#,
#A3_blacks = A3_b(p00, p10, p01, p11, y00, y10, y01, y11),
#A3_whites = A3_w(p00, p10, p01, p11, y00, y10, y01, y11)
)) %>% kable(row.names = FALSE, digits = 2)  %>%
kable_styling(full_width = F)
quantity value
CDE | M=1 0.25
ATE | M=1 0.26
ATE 0.12
Estimate 0.28

So the naive estimate overestimates all three quantities.

Note that there is nothing “knife edge” about this example; you can make small changes to any of these numbers without qualitative changes to the conclusion.

6 How can we get overestimation like this when Knox et al say we should get underestimation of discrimination?

These examples, with overestimation of discrimination seem contrary to the results in Knox, Lowe, and Mummolo.

In fact nothing here suggests there is any problem with the proofs in Knox et al. The issue rather is that Knox et el impose an assumption that rules such cases out.

Indeed, for \(M\) and \(Y\) increasing in \(U\) (like the cases in part 4 above) and given monotonicity of \(M\) in \(D\), Assumption 3 (“relative non severity of racial stops”), is equivalent to the assumption of non positive collider bias. They say this assumption is a necessary assumption for identification but I don’t understand that.2

To see this, note that the assumption conditions on \(d\) and \(m\) so the only question is whether \(u\) is greater in expectation among the set for whom \(d\) makes a difference to stopping compared to those that would be stopped regardless.

Assuming monotonicity, there are share \(p_{0u}\) of \(U=u\) types for which stopping does not depend on \(D\) and share \(p_{1u} - p_{0u}\) for whom it does. Thus, share \(\frac{p_{01}}{p_{01} + p_{00}}\) of those that are always stopped have \(U=1\) while share \(\frac{p_{11} - p_{01}}{p_{11} - p_{01} + p_{10} - p_{00}}\) of those who are stopped because of \(D\) have \(U=1\). The former is larger than the latter as:

\[\frac{p_{11} - p_{01}}{p_{11} - p_{01} + p_{10} - p_{00}} \leq \frac{p_{01}}{p_{01} + p_{00}}\]

or:

\[\frac{p_{11}}{p_{01}} \leq \frac{p_{10}}{p_{00}}\]

So, it seems, whether we think there is over- or underestimation depends on what we think about Assumption 3, which has the no collider bias condition (that corresponds to the condition in Gaebler et al. (2020)) as a boundary case.

Do we think some types of cops can be especially lazy and violent? Do we think that violent cops might be more racist in their stops? If so we have to worry about Assumption 3 and the take home from Knox et al.

The bigger take home I think, is to that we would do well to build out on Knox et al’s contribution in order to figure out conditions under which bias goes one way or another rather than imposing conditions that ensure a particular answer.


  1. So the DAG is: \(Y \leftarrow D \rightarrow M \rightarrow Y \leftarrow U \rightarrow M\).↩︎

  2. Perhaps they mean A5 implies A3. But A5 is not necessary either, as they acknowledge, if you remove other arrows.↩︎