Description of a randomization function that handles heterogeneous probabilities and awkward integer blocks sizes.
Two illustrations of functionality made possible with this function.
Jack and Jill have a race. Jill is faster than Jack and has a
higher probability of winning. You want to simulate a distribution of
wins. This is a situation where probabilites are heterogeneous and in
which there is a target number of units to be selected. This problem is
neither simple nor complete, as understood by
randomizr
.
You have 2 districts with 3 villages each. You want to assign 3 villages to treatment, blocking by district, and with equal probabilities for all units. This randomization requires an allocation both across and within blocks. More generally, the issues here is that the target number to be assigned in a given block is not an integer.
These problems can both occur in a given problem and indeed you would expect them to whenever there are generic probabilities and blocks.
With random data:
<- 100
s <- runif(s)
prob <- sample(1:5, s, replace = TRUE, prob = 1:5)
blocks <- 10000
sims <- replicate(sims, prob_ra(prob, blocks)) runs
There should only be a unit difference between the totals assigned in any set of runs:
table(apply(runs, 2, sum))
##
## 50 51
## 3296 6704
Should be only max 1 unit between min and max
<- apply(runs, 2, function(j) table(blocks, j)[,2])
bin_dist <- t(rbind(apply(bin_dist, 1, function(j) c(mean(j), min(j), max(j)))))
table_check colnames(table_check) <- c("sim_p", "min", "max")
cbind(size = table(blocks),
true_p = aggregate(prob, by = list(blocks), FUN = sum)[,2],
|>
table_check) kable(digits = 2)
size | true_p | sim_p | min | max |
---|---|---|---|---|
8 | 3.40 | 3.40 | 3 | 4 |
12 | 6.66 | 6.66 | 6 | 7 |
20 | 11.00 | 11.00 | 10 | 11 |
30 | 14.77 | 14.77 | 14 | 15 |
30 | 14.84 | 14.84 | 14 | 15 |
plot(prob, apply(runs, 1, mean), xlim = c(0,1), ylim = c(0,1))
abline(0,1)
The function can also be used sequentially for multiple treatment. In this case it implements the based treatment in a hierarchical manner, which preserves individual probabilities, but prioritizes balancing by order.
<- 100
s <- sample(1:5, s, replace = TRUE, prob = 1:5)
blocks <- runif(s)
p1 <- runif(s)*(1-p1)
p2 <- cbind(p1,p2) prob
Note that t2 will be systematic, like t1, given t1, but not unconditionally systematic
We do two step allocation: first allocate t1 optimally and then given this allocation we allocate t2. We do this many times to check that the probability of assignments are all correct for p2.
<- replicate(sims, 1*(as.vector(prob_ra(prob))==2)) runs2
The result is much tighter than independent, but not as tight as possible:
par(mfrow=c(1,2))
<- replicate(sims, sum(rbinom(length(p2), 1, p2)))
indep hist(indep, main = "Total t2 allocation | indep")
hist(apply(runs2, 2, sum), main = "Total t2 allocation | scheme", xlim = range(indep))
Ideally max 1 unit between min and max
<- apply(runs2, 2, function(j) table(blocks, j)[,2])
bin_dist <- t(rbind(apply(bin_dist, 1, function(j) c(mean(j), min(j), max(j)))))
table_check colnames(table_check) <- c("sim_p", "min", "max")
cbind(size = table(blocks),
true_p = aggregate(p2, by = list(blocks), FUN = sum)[,2],
|>
table_check) kable(digits = 2)
size | true_p | sim_p | min | max |
---|---|---|---|---|
3 | 0.11 | 0.11 | 0 | 2 |
14 | 5.26 | 5.26 | 1 | 11 |
22 | 4.74 | 4.74 | 0 | 11 |
28 | 6.54 | 6.55 | 1 | 14 |
33 | 7.39 | 7.39 | 1 | 14 |
plot(p2, apply(runs2, 1, mean), xlim = c(0,1), ylim = c(0,1))
abline(0,1)
Multiple treatments and arms can be difficult sometimes.
“Neat” designs with no integer issues are handled easily:
<- rep(1:4, each = 4)
blocks <- matrix(.25, 16, 4)
prob <- prob_ra(blocks=blocks, prob = prob)
z table(blocks, z)
## z
## blocks 1 2 3 4
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
A hard example with a reasonable solution:
<- rbind(c(2/3, 1/3, 0),
prob c( 0, 2/3, 1/3),
c( 1/3, 0, 2/3))
<- rbind(prob,prob,prob,prob)
prob <- rep(1:2, each = 6)
blocks <- prob_ra(blocks=blocks, prob = prob)
z table(z, blocks)
## blocks
## z 1 2
## 1 2 2
## 2 2 2
## 3 2 2
another hard one:
<- t(replicate(12, c(.5, .25, .25)))
prob <- rep(1:4, each = 3)
blocks <- prob_ra(blocks=blocks, prob = prob)
z table(z, blocks)
## blocks
## z 1 2 3 4
## 1 2 1 1 2
## 2 0 1 1 1
## 3 1 1 1 0
A harder case with suboptimal results and where ordering matters
Here is a hard case with no blocks but multiple treatments. The issue is that optimality depends on the ordering of the blocks.
Consider this:
<- 10000
sims <- t(matrix(c(.15,.65,.2, .47, .48, .05), 3,2))
prob prob
## [,1] [,2] [,3]
## [1,] 0.15 0.65 0.20
## [2,] 0.47 0.48 0.05
<- sapply(1:sims, function(j) prob_ra(prob = prob))
runs round(sapply(1:3, function(j) apply(runs==j, 1, sum)), 2)/sims
## [,1] [,2] [,3]
## [1,] 0.1356 0.6600 0.2044
## [2,] 0.4740 0.4787 0.0473
Assignment probabilities are hard. But ideally there would be at least one unit in treatment 2 in each draw, but sometimes none here….
set.seed(17)
prob_ra(prob = prob)
## [,1]
## [1,] 3
## [2,] 1
Report the number in T2 in each draw:
<- table(apply(runs==2, 2, sum))/sims
share_in_t2 share_in_t2
##
## 0 1 2
## 0.1269 0.6075 0.2656
This has too much diversity as seen here by the set of cases in which no unit is assigned to T2. Though it still produces the correct allocations on average:
c(expectation = sum(prob[,2]), average = (share_in_t2%*%as.numeric(names(share_in_t2))))
## expectation average
## 1.1300 1.1387
Compare with this:
<- prob[, c(2,1,3)]
prob prob
## [,1] [,2] [,3]
## [1,] 0.65 0.15 0.20
## [2,] 0.48 0.47 0.05
<- sapply(1:sims, function(j) prob_ra(prob = prob))
runs2 round(sapply(1:3, function(j) apply(runs2==j, 1, mean)), 2)
## [,1] [,2] [,3]
## [1,] 0.65 0.16 0.20
## [2,] 0.48 0.48 0.04
<- table(apply(runs2==1, 2, sum))/sims
share_in_t1 share_in_t1
##
## 1 2
## 0.8716 0.1284
c(expectation = sum(prob[,1]), average = (share_in_t1%*%as.numeric(names(share_in_t1))))
## expectation average
## 1.1300 1.1284
Note that in the first case the treatment is alternatively given to 1 or no units; in the second case it is given to 1 or 2 units.
So interestingly smart ordering can solve the problem.
The function is general enough to do normal blocks an clusters (with
a wrapper that supplies cluster information) though it’s probably a lot
slower than randomizr
functions for standard designs.
prob_ra(n = 4)
## [1] 1 0 0 1
Here a matched pair
prob_ra(blocks=rep(1:5, each = 3))
## [1] 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0
prob_ra(prob = c(.4, .6))
## [1] 0 1
Here probabilities for two treatments are provided and so there is an implicit residual category. The residual is assigned label 0.
<- matrix(c(.25,.35,.4, .47, .48, .05), 3,2)
prob prob
## [,1] [,2]
## [1,] 0.25 0.47
## [2,] 0.35 0.48
## [3,] 0.40 0.05
set.seed(2)
prob_ra(prob = prob)
## [,1]
## [1,] 0
## [2,] 2
## [3,] 1
One can think of the probability vector as reporting the expected number of units rather than the probability. In this case the values can exceed 1. This is useful for example if one simply wanted to do the allocation across blocks and do the within block allocation is a second stage.
prob_ra(prob = c(1.2, .8, 2.1, .9))
## [1] 2 1 2 0
round(apply(replicate(2000, prob_ra(prob = c(1.2, .8, 2.1, .9))), 1, mean),2)
## [1] 1.18 0.81 2.09 0.92