Description of a randomization function that handles heterogeneous probabilities and awkward integer blocks sizes.

These problems can both occur in a given problem and indeed you would expect them to whenever there are generic probabilities and blocks.

Illustration: One arm

With random data:

s <- 100
prob <- runif(s)
blocks <- sample(1:5, s, replace = TRUE, prob = 1:5)
sims <- 10000
runs <- replicate(sims, prob_ra(prob, blocks))

Total selected is as tight as possible

There should only be a unit difference between the totals assigned in any set of runs:

table(apply(runs, 2, sum))

## 
##   50   51 
## 3296 6704

Total selected in each block is also as tight as possible

Should be only max 1 unit between min and max

bin_dist <- apply(runs, 2, function(j) table(blocks, j)[,2])
table_check <- t(rbind(apply(bin_dist, 1, function(j)  c(mean(j), min(j), max(j)))))
colnames(table_check) <- c("sim_p", "min", "max")

cbind(size = table(blocks), 
      true_p = aggregate(prob, by = list(blocks), FUN = sum)[,2], 
      table_check) |>
  kable(digits = 2)

size	true_p	sim_p	min	max
8	3.40	3.40	3	4
12	6.66	6.66	6	7
20	11.00	11.00	10	11
30	14.77	14.77	14	15
30	14.84	14.84	14	15

True assignment probabilities are respected at the lowest level

plot(prob, apply(runs, 1, mean), xlim = c(0,1), ylim = c(0,1))
abline(0,1)

Illustration: Multiple arms

The function can also be used sequentially for multiple treatment. In this case it implements the based treatment in a hierarchical manner, which preserves individual probabilities, but prioritizes balancing by order.

Multiple Treatments Illustration

s  <- 100
blocks  <- sample(1:5, s, replace = TRUE, prob = 1:5)
p1 <- runif(s)
p2 <- runif(s)*(1-p1)
prob <- cbind(p1,p2)

Note that t2 will be systematic, like t1, given t1, but not unconditionally systematic

We do two step allocation: first allocate t1 optimally and then given this allocation we allocate t2. We do this many times to check that the probability of assignments are all correct for p2.

runs2 <- replicate(sims, 1*(as.vector(prob_ra(prob))==2))

The result is much tighter than independent, but not as tight as possible:

par(mfrow=c(1,2))
indep <- replicate(sims,  sum(rbinom(length(p2), 1, p2)))
hist(indep, main = "Total t2 allocation | indep") 
hist(apply(runs2, 2, sum), main = "Total t2 allocation | scheme", xlim = range(indep))

Total selected in each bin is tight but not as tight as possible

Ideally max 1 unit between min and max

bin_dist <- apply(runs2, 2, function(j) table(blocks, j)[,2])
table_check <- t(rbind(apply(bin_dist, 1, function(j)  c(mean(j), min(j), max(j)))))
colnames(table_check) <- c("sim_p", "min", "max")

cbind(size = table(blocks), 
      true_p = aggregate(p2, by = list(blocks), FUN = sum)[,2], 
      table_check) |>
  kable(digits = 2)

size	true_p	sim_p	min	max
3	0.11	0.11	0	2
14	5.26	5.26	1	11
22	4.74	4.74	0	11
28	6.54	6.55	1	14
33	7.39	7.39	1	14

But the true probabilities preserved at unit level (and so also at block levels)

plot(p2, apply(runs2, 1, mean), xlim = c(0,1), ylim = c(0,1))
abline(0,1)

Harder designs

Multiple treatments and arms can be difficult sometimes.

“Neat” designs with no integer issues are handled easily:

blocks <- rep(1:4, each = 4)
prob <- matrix(.25, 16, 4)
z <- prob_ra(blocks=blocks, prob = prob)
table(blocks, z)

##       z
## blocks 1 2 3 4
##      1 1 1 1 1
##      2 1 1 1 1
##      3 1 1 1 1
##      4 1 1 1 1

A hard example with a reasonable solution:

prob <- rbind(c(2/3, 1/3, 0),
           c( 0,  2/3, 1/3),
           c( 1/3, 0,  2/3))
prob <- rbind(prob,prob,prob,prob)
blocks <- rep(1:2, each = 6)
z <- prob_ra(blocks=blocks, prob = prob)
table(z, blocks)

##    blocks
## z   1 2
##   1 2 2
##   2 2 2
##   3 2 2

another hard one:

prob <- t(replicate(12, c(.5, .25, .25)))
blocks <- rep(1:4, each = 3)
z <- prob_ra(blocks=blocks, prob = prob)
table(z, blocks)

##    blocks
## z   1 2 3 4
##   1 2 1 1 2
##   2 0 1 1 1
##   3 1 1 1 0

A harder case with suboptimal results and where ordering matters

Here is a hard case with no blocks but multiple treatments. The issue is that optimality depends on the ordering of the blocks.

Consider this:

sims <- 10000
prob <- t(matrix(c(.15,.65,.2, .47, .48, .05), 3,2))
prob

##      [,1] [,2] [,3]
## [1,] 0.15 0.65 0.20
## [2,] 0.47 0.48 0.05

runs <- sapply(1:sims, function(j) prob_ra(prob = prob))
round(sapply(1:3, function(j) apply(runs==j, 1, sum)), 2)/sims

##        [,1]   [,2]   [,3]
## [1,] 0.1356 0.6600 0.2044
## [2,] 0.4740 0.4787 0.0473

Assignment probabilities are hard. But ideally there would be at least one unit in treatment 2 in each draw, but sometimes none here….

set.seed(17)
prob_ra(prob = prob)

##      [,1]
## [1,]    3
## [2,]    1

Report the number in T2 in each draw:

share_in_t2 <- table(apply(runs==2, 2, sum))/sims
share_in_t2

## 
##      0      1      2 
## 0.1269 0.6075 0.2656

This has too much diversity as seen here by the set of cases in which no unit is assigned to T2. Though it still produces the correct allocations on average:

c(expectation = sum(prob[,2]), average = (share_in_t2%*%as.numeric(names(share_in_t2))))

## expectation     average 
##      1.1300      1.1387

Compare with this:

prob <- prob[, c(2,1,3)]
prob

##      [,1] [,2] [,3]
## [1,] 0.65 0.15 0.20
## [2,] 0.48 0.47 0.05

runs2 <- sapply(1:sims, function(j) prob_ra(prob = prob))
round(sapply(1:3, function(j) apply(runs2==j, 1, mean)), 2)

##      [,1] [,2] [,3]
## [1,] 0.65 0.16 0.20
## [2,] 0.48 0.48 0.04

share_in_t1 <- table(apply(runs2==1, 2, sum))/sims
share_in_t1

## 
##      1      2 
## 0.8716 0.1284

c(expectation = sum(prob[,1]), average = (share_in_t1%*%as.numeric(names(share_in_t1))))

## expectation     average 
##      1.1300      1.1284

Note that in the first case the treatment is alternatively given to 1 or no units; in the second case it is given to 1 or 2 units.

So interestingly smart ordering can solve the problem.

Illustration of simple applications

The function is general enough to do normal blocks an clusters (with a wrapper that supplies cluster information) though it’s probably a lot slower than randomizr functions for standard designs.

Just \(n\) provided

prob_ra(n = 4)

## [1] 1 0 0 1

Just blocks \(blocks\) provided

Here a matched pair

prob_ra(blocks=rep(1:5, each = 3))

##  [1] 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0

Just probability vector \(prob\) provided

prob_ra(prob = c(.4, .6))

## [1] 0 1

Treatment probabilities do not have to sum to 1

Here probabilities for two treatments are provided and so there is an implicit residual category. The residual is assigned label 0.

prob <- matrix(c(.25,.35,.4, .47, .48, .05), 3,2)
prob

##      [,1] [,2]
## [1,] 0.25 0.47
## [2,] 0.35 0.48
## [3,] 0.40 0.05

set.seed(2)
prob_ra(prob = prob)

##      [,1]
## [1,]    0
## [2,]    2
## [3,]    1

Probabilities that exceed 1

One can think of the probability vector as reporting the expected number of units rather than the probability. In this case the values can exceed 1. This is useful for example if one simply wanted to do the allocation across blocks and do the within block allocation is a second stage.

prob_ra(prob =  c(1.2, .8, 2.1, .9))

## [1] 2 1 2 0

round(apply(replicate(2000, prob_ra(prob =  c(1.2, .8, 2.1, .9))), 1, mean),2)

## [1] 1.18 0.81 2.09 0.92

Blocked assignment with heterogeneous probabilities

Macartan Humphreys

September 2019