Part 2 Design steps

2.1 Population

The challenge is to generate a population with members that belong to the hidden and non-hidden populations, but that also have membership (or not) in some other group. The population must have a networks structure that reflects plausible types of in-group and out-group connections for each of these groups. This is implemented with the get_study_population() function in the package.

The first step is to draw a simulated population. This is done using the function get_study_population(). Using default parameters we can draw one simulation of population of size \(2000\) as follows:

population <- get_study_population()
Table 2.1: Sample data generated by get_study_population (transposed)
name 1 2 3 4 5 6
known 0 0 0 0 0 0
hidden 0 0 0 0 0 0
type 00 00 00 00 00 00
p_visibility_known 1 1 1 1 1 1
known_visible 2 1 2 1 3 1
p_visibility_hidden 0.685 0.726 0.627 0.628 0.762 0.731
hidden_visible 0 0 2 0 0 1
type_visible_00 4 2 1 3 2 1
type_visible_01 0 0 0 0 0 1
type_visible_10 2 1 0 1 3 1
type_visible_11 0 0 2 0 0 0
n_visible 6 3 3 4 5 3
links c(73, 215, 682, 1252, 1380, 1486) c(893, 899, 1420) c(770, 1948, 1993) c(107, 113, 1239, 1353) c(103, 144, 1424, 1637, 1685) c(299, 1478, 1810)
total 2000 2000 2000 2000 2000 2000
service_use 0 1 0 0 1 1
loc_1 0 0 0 1 0 0
loc_2 0 0 0 0 0 0
loc_3 0 0 0 0 0 0
known_2 0 0 0 0 0 0
known_3 0 0 0 0 0 0
total_known 625 625 625 625 625 625
total_hidden 197 197 197 197 197 197
total_service_use 591 591 591 591 591 591
total_loc_1 555 555 555 555 555 555
total_loc_2 208 208 208 208 208 208
total_loc_3 455 455 455 455 455 455
total_known_2 191 191 191 191 191 191
total_known_3 402 402 402 402 402 402
service_use_visible 1 0 0 2 3 1
loc_1_visible 1 2 1 1 0 2
loc_2_visible 1 0 0 0 0 0
loc_3_visible 1 0 0 1 2 2
known_2_visible 1 0 0 0 0 1
known_3_visible 3 1 1 1 1 1

2.1.1 Population Parameters

Full details on parameters are provided in the package documentation (via ?). See also section ??.

The key parameters you can provide to the population function are:

  • N – size of population
    • K – number of groups (with \(M < K\) such that groups \(K - M, \dots, K\) have unknown size). For simplicity we can assume \(K = 2,\, M = 1\)
    • prev_K – prevalence of each group in population
    • rho_K – correlations in group membership
    • p_edge_within, p_edge_between – probability of edges within and between groups
    • p_visibility – average probability of revealing particular group membership relevant for all types of sampling. Essentially this represents whether group membership of individual will be observed by others. In addition, in RDS sampling this parameter affects whether members of hidden group will participate in the study if they are given a coupon
    • add_groups – average probabilities of other binary individual characteristics (e.g. services utilization, time-location presence, group memberships)

These parameters are used to form an adjacency matrix \(G\) (possibly directed underlying graph \((V,E)\)) with \(\mathbf{d}\) representing the vector of individual degrees.

The resulting data frame has one row per person. In addition to network structure the data records the unit’s type (membership in multiple groups).

Below we show an example of simulated population with the following set of parameters:

N = 2000, K = 2, prev_K = c(known = 0.3, hidden = 0.1), rho_K = 0.05, p_edge_within = list(known = c(0.05, 0.05), hidden = c(0.05, 0.9)), p_edge_between = list(known = 0.05, hidden = 0.01), p_visibility = list(hidden = 0.7, known = 0.99), add_groups = list(service_use = 0.3, loc_1 = 0.3, loc_2 = 0.1, loc_3 = 0.2, known_2 = 0.1, known_3 = 0.2)

2.1.2 Example


population_study <-
  do.call(what = declare_population,
          args = c(handler = get_study_population, study_1[1:8]))


set.seed(19872312)
population <- population_study()
Population in study 1

(#fig:study_pop)Population in study 1

2.1.3 Considerations

Things to consider in the future:

  • Individuals can have parameters representing barrier effects, transmission and recall biases (Maltiel et al. 2015)
    • Using more complex models (e.g. ERGM or block ERGM) to generate the network structure instead of simple block structure used currently
    • The propensity to recruit can be dependent on the blocks (membership in groups) (Berchenko, Rosenblatt, and Frost 2017)

2.2 Sampling strategies

2.2.1 RDS Sampling

The RDS function takes the population data frame and then implements a simulated RDS procedure to return a data frame that records whether someone was sampled as well as information about how they were sampled, particularly in which wave and at what time.

For instance:

population <- get_study_population()
sample     <- sample_rds(population)
name type rds rds_from rds_t rds_wave rds_hidden rds_own_coupon rds_coupon_1 rds_coupon_2 rds_coupon_3
1811 01 1 1918 32 3 1 9-2-2 9-2-2-1 9-2-2-2 9-2-2-3
1816 01 1 1955 25 5 1 6-3-2-2-3 6-3-2-2-3-1 6-3-2-2-3-2 6-3-2-2-3-3
1825 01 1 1839 27 3 1 3-1-2 3-1-2-1 3-1-2-2 3-1-2-3
1826 01 1 1852 28 2 1 8-3 8-3-1 8-3-2 8-3-3
1827 01 1 1893 34 4 1 6-3-3-2 6-3-3-2-1 6-3-3-2-2 6-3-3-2-3
1834 01 1 1845 19 3 1 6-3-2 6-3-2-1 6-3-2-2 6-3-2-3

In RDS sample we need to draw recruitment graph, degrees of recruited individuals, timing of recruitment, coupon matrix and derive \(I_{t}\), number of active recruiters at any stage (presuming that if someone has a non-activated coupon, they are active recruiters), from this data

RDS parameter options:

  • n_seed – The number of starting seeds from hidden population. Currently these are assumed to be drawn according to simple random sampling.
    • n_coupons – Number of coupons assigned to each seed
    • target_type – Whether the sampling aims to enroll particular number of respondents or waves
    • target_n_rds – Target sample size or number of waves depending on target_type

2.2.2 TLS Sampling

The TLS function takes the population data frame and then implements a mapping of locations, samples locations, and samples individuals in those locations.

For instance:

population <- get_study_population(
  N = 1000,
  add_groups = list(p_service = 0.3, loc_1 = .1, loc_2 = .2, loc_3 = .3))

sample <- sample_tls(population)
name hidden tls_loc_sampled tls
890 1 loc_3 1
892 1 loc_2 1
894 1 loc_3 1
895 1 loc_2 1
897 1 loc_3 1
901 1 loc_3 1

2.2.3 PPS Sampling

The PPS function takes the population data and samples individuals proportionally to the size of strata defined by their group memberships

For instance:

population <- get_study_population()
sample     <- sample_pps(population, group_pattern = "^known")
name hidden pps_share pps
2 0 0.484 1
5 0 0.484 1
6 0 0.484 1
7 0 0.484 1
9 0 0.484 1
14 0 0.484 1

2.2.4 Considerations

  • For NSUM we also need a proportional sample from the population with the same data structure except we do not need timing and coupon matrices

  • For Service Multiplier we need data on participation in target program within RDS sample. In addition we need an estimate of total hidden population program participation

  • How the seeds in RDS selected? Do we specifically target connected individuals (to increase chances of enrollment) or individuals with varying degrees (to increase sample coverage)?

  • Can RDS enroll non-target population?

  • Are individuals with less than n_coupons links allowed to enter study?

  • What happens if sampling stops prior to the target size/waves? Do we allow enrollment of more seeds?

  • Two features of RDS sampling:

    • Homophily – tendency of the individuals with the same trait to share social links
    • Differential activity – average connectedness of individuals with different traits is different
    • Both features are roughly captured by block structure of adjacency matrix

2.2.5 Sampling strategies as design steps

Using the default study design we can declare any combination of three main sampling strategies as follows:


rds_study <- 
  do.call(declare_sampling,
          c(handler = sample_rds, 
            sampling_variable = "rds",
            drop_nonsampled = FALSE, study_1[9:12]))

set.seed(19872312)
draw_data(population_study + rds_study)

pps_study <- 
  do.call(declare_sampling,
          c(handler = sample_pps, 
            sampling_variable = "pps",
            drop_nonsampled = FALSE, study_1[13]))

set.seed(19872312)
draw_data(population_study + rds_study + pps_study)

tls_study <- 
  do.call(declare_sampling,
          c(handler = sample_tls, 
            sampling_variable = "tls",
            drop_nonsampled = FALSE, study_1[14]))

set.seed(19872312)
draw_data(population_study + rds_study + pps_study + tls_study)

2.3 Estimands

The true value of estimands, which we hope to recover, can be calculated from the population dataframe. At the study level we are interested in:

  • For each group of interest, \(k\), size of group \(N_{k}\)
  • For each group of interest, \(k\), prevalence of the group \(p_{k} = \frac{N_{k}}{N}\)
  • Average degree of connectedness
  • Average degree of connectedness in hidden population

At the meta level we are interested in:

  • Relative bias of each of the methods compared to the truth , i.e. \(|\hat{\tau}_{\mathrm{RDS}} - \tau| - |\hat{\tau}_{\mathrm{NSUM}} - \tau|\)

  • Bias of each of the methods compared to the truth deivided by unit costs (for costs effectiveness)

  • [Anything else?]

Estimands can be declared and drawn once as follows:


study_estimands <- 
  declare_estimand(handler = get_study_estimands)


set.seed(19872312)
draw_estimands(population_study + rds_study + pps_study + tls_study + study_estimands)  %>% 
  kable(caption = "Estimands as calculated from population dataframe")
(#tab:declare_estimand)Estimands as calculated from population dataframe
estimand_label estimand
hidden_size 190.0000
hidden_prev 0.0950
degree_average 5.0810
degree_hidden_average 0.8655

2.4 Estimators

  • For each estimator used in the studies (Horvitz-Thompson, RDS+, NSUM) create a handler and estimator declaration

# Declare SSPSE estimator
estimator_sspse <- 
  declare_estimator(handler = get_study_est_sspse, label = "sspse")

# Declare HT estimator
estimator_ht <- 
  declare_estimator(handler = get_study_est_ht, label = "ht")

# Declare Chords estimator
estimator_chords <- 
  declare_estimator(type = "integrated",
                    handler = get_study_est_chords, label = "chords")

# Declare NSUM estimator
estimator_nsum <- declare_estimator(handler = get_study_est_nsum, label = "nsum")


set.seed(19872312)

bind_rows(
  draw_estimates(population_study + rds_study + study_estimands + estimator_sspse),
  draw_estimates(population_study + rds_study + study_estimands + estimator_chords),
  draw_estimates(population_study + pps_study + study_estimands + estimator_ht),
  draw_estimates(population_study + pps_study + study_estimands + estimator_nsum)) %>% 
  knitr::kable(caption = "Various study estimators calculated from population and sampling dataframe")
(#tab:declare_estimators)Various study estimators calculated from population and sampling dataframe
estimator_label estimate se estimand_label
hidden_size_sspse 161.0000000 61.2582432 hidden_size
hidden_size_chords 219.0000000 hidden_size
degree_hidden_chords 6.5114155 degree_hidden_average
hidden_prev_ht 0.0996436 0.0212336 hidden_prev
hidden_size_nsum 169.2416107 12.5299527 hidden_size
degree_average_nsum 4.9337748 degree_average

Service Multiplier

Berchenko, Yakir, Jonathan D. Rosenblatt, and Simon D. W. Frost. 2017. “Modeling and Analyzing Respondent-Driven Sampling as a Counting Process.” Biometrics 73 (4): 1189–98. https://doi.org/10.1111/biom.12678.

Maltiel, Rachael, Adrian E. Raftery, Tyler H. McCormick, and Aaron J. Baraff. 2015. “Estimating Population Size Using the Network Scale-up Method.” The Annals of Applied Statistics 9 (3): 1247–77. https://www.jstor.org/stable/43826420.