Part 2 Design steps

2.1 Population

The challenge is to generate a population with members that belong to the hidden and non-hidden populations, but that also have membership (or not) in some other group. The population must have a networks structure that reflects plausible types of in-group and out-group connections for each of these groups. This is implemented with the get_study_population() function in the package.

The first step is to draw a simulated population. This is done using the function get_study_population(). Using default parameters we can draw one simulation of population of size \(2000\) as follows:

population <- get_study_population()

Table 2.1: Sample data generated by get_study_population (transposed)
name	1	2	3	4	5	6
known	0	0	0	0	0	0
hidden	0	0	0	0	0	0
type	00	00	00	00	00	00
p_visibility_known	1	1	1	1	1	1
known_visible	2	1	2	1	3	1
p_visibility_hidden	0.685	0.726	0.627	0.628	0.762	0.731
hidden_visible	0	0	2	0	0	1
type_visible_00	4	2	1	3	2	1
type_visible_01	0	0	0	0	0	1
type_visible_10	2	1	0	1	3	1
type_visible_11	0	0	2	0	0	0
n_visible	6	3	3	4	5	3
links	c(73, 215, 682, 1252, 1380, 1486)	c(893, 899, 1420)	c(770, 1948, 1993)	c(107, 113, 1239, 1353)	c(103, 144, 1424, 1637, 1685)	c(299, 1478, 1810)
total	2000	2000	2000	2000	2000	2000
service_use	0	1	0	0	1	1
loc_1	0	0	0	1	0	0
loc_2	0	0	0	0	0	0
loc_3	0	0	0	0	0	0
known_2	0	0	0	0	0	0
known_3	0	0	0	0	0	0
total_known	625	625	625	625	625	625
total_hidden	197	197	197	197	197	197
total_service_use	591	591	591	591	591	591
total_loc_1	555	555	555	555	555	555
total_loc_2	208	208	208	208	208	208
total_loc_3	455	455	455	455	455	455
total_known_2	191	191	191	191	191	191
total_known_3	402	402	402	402	402	402
service_use_visible	1	0	0	2	3	1
loc_1_visible	1	2	1	1	0	2
loc_2_visible	1	0	0	0	0	0
loc_3_visible	1	0	0	1	2	2
known_2_visible	1	0	0	0	0	1
known_3_visible	3	1	1	1	1	1

2.1.1 Population Parameters

Full details on parameters are provided in the package documentation (via ?). See also section ??.

The key parameters you can provide to the population function are:

N – size of population
- K – number of groups (with \(M < K\) such that groups \(K - M, \dots, K\) have unknown size). For simplicity we can assume \(K = 2,\, M = 1\)
- prev_K – prevalence of each group in population
- rho_K – correlations in group membership
- p_edge_within, p_edge_between – probability of edges within and between groups
- p_visibility – average probability of revealing particular group membership relevant for all types of sampling. Essentially this represents whether group membership of individual will be observed by others. In addition, in RDS sampling this parameter affects whether members of hidden group will participate in the study if they are given a coupon
- add_groups – average probabilities of other binary individual characteristics (e.g. services utilization, time-location presence, group memberships)

These parameters are used to form an adjacency matrix \(G\) (possibly directed underlying graph \((V,E)\)) with \(\mathbf{d}\) representing the vector of individual degrees.

The resulting data frame has one row per person. In addition to network structure the data records the unit’s type (membership in multiple groups).

Below we show an example of simulated population with the following set of parameters:

N = 2000, K = 2, prev_K = c(known = 0.3, hidden = 0.1), rho_K = 0.05, p_edge_within = list(known = c(0.05, 0.05), hidden = c(0.05, 0.9)), p_edge_between = list(known = 0.05, hidden = 0.01), p_visibility = list(hidden = 0.7, known = 0.99), add_groups = list(service_use = 0.3, loc_1 = 0.3, loc_2 = 0.1, loc_3 = 0.2, known_2 = 0.1, known_3 = 0.2)

2.1.2 Example


population_study <-
  do.call(what = declare_population,
          args = c(handler = get_study_population, study_1[1:8]))


set.seed(19872312)
population <- population_study()

(#fig:study_pop)Population in study 1

2.1.3 Considerations

Things to consider in the future:

Individuals can have parameters representing barrier effects, transmission and recall biases (Maltiel et al. 2015)
- Using more complex models (e.g. ERGM or block ERGM) to generate the network structure instead of simple block structure used currently
- The propensity to recruit can be dependent on the blocks (membership in groups) (Berchenko, Rosenblatt, and Frost 2017)

2.2 Sampling strategies

2.2.1 RDS Sampling

The RDS function takes the population data frame and then implements a simulated RDS procedure to return a data frame that records whether someone was sampled as well as information about how they were sampled, particularly in which wave and at what time.

For instance:

population <- get_study_population()
sample     <- sample_rds(population)

name	type	rds	rds_from	rds_t	rds_wave	rds_hidden	rds_own_coupon	rds_coupon_1	rds_coupon_2	rds_coupon_3
1811	01	1	1918	32	3	1	9-2-2	9-2-2-1	9-2-2-2	9-2-2-3
1816	01	1	1955	25	5	1	6-3-2-2-3	6-3-2-2-3-1	6-3-2-2-3-2	6-3-2-2-3-3
1825	01	1	1839	27	3	1	3-1-2	3-1-2-1	3-1-2-2	3-1-2-3
1826	01	1	1852	28	2	1	8-3	8-3-1	8-3-2	8-3-3
1827	01	1	1893	34	4	1	6-3-3-2	6-3-3-2-1	6-3-3-2-2	6-3-3-2-3
1834	01	1	1845	19	3	1	6-3-2	6-3-2-1	6-3-2-2	6-3-2-3

In RDS sample we need to draw recruitment graph, degrees of recruited individuals, timing of recruitment, coupon matrix and derive \(I_{t}\), number of active recruiters at any stage (presuming that if someone has a non-activated coupon, they are active recruiters), from this data

RDS parameter options:

n_seed – The number of starting seeds from hidden population. Currently these are assumed to be drawn according to simple random sampling.
- n_coupons – Number of coupons assigned to each seed
- target_type – Whether the sampling aims to enroll particular number of respondents or waves
- target_n_rds – Target sample size or number of waves depending on target_type

2.2.2 TLS Sampling

The TLS function takes the population data frame and then implements a mapping of locations, samples locations, and samples individuals in those locations.

For instance:

population <- get_study_population(
  N = 1000,
  add_groups = list(p_service = 0.3, loc_1 = .1, loc_2 = .2, loc_3 = .3))

sample <- sample_tls(population)

name	hidden	tls_loc_sampled	tls
890	1	loc_3	1
892	1	loc_2	1
894	1	loc_3	1
895	1	loc_2	1
897	1	loc_3	1
901	1	loc_3	1

2.2.3 PPS Sampling

The PPS function takes the population data and samples individuals proportionally to the size of strata defined by their group memberships

For instance:

population <- get_study_population()
sample     <- sample_pps(population, group_pattern = "^known")

name	pps_share	pps
2	0.484	1
5	0.484	1
6	0.484	1
7	0.484	1
9	0.484	1
14	0.484	1

2.2.4 Considerations

For NSUM we also need a proportional sample from the population with the same data structure except we do not need timing and coupon matrices
For Service Multiplier we need data on participation in target program within RDS sample. In addition we need an estimate of total hidden population program participation
How the seeds in RDS selected? Do we specifically target connected individuals (to increase chances of enrollment) or individuals with varying degrees (to increase sample coverage)?
Can RDS enroll non-target population?
Are individuals with less than n_coupons links allowed to enter study?
What happens if sampling stops prior to the target size/waves? Do we allow enrollment of more seeds?
Two features of RDS sampling:
- Homophily – tendency of the individuals with the same trait to share social links
- Differential activity – average connectedness of individuals with different traits is different
- Both features are roughly captured by block structure of adjacency matrix

2.2.5 Sampling strategies as design steps

Using the default study design we can declare any combination of three main sampling strategies as follows:


rds_study <- 
  do.call(declare_sampling,
          c(handler = sample_rds, 
            sampling_variable = "rds",
            drop_nonsampled = FALSE, study_1[9:12]))

set.seed(19872312)
draw_data(population_study + rds_study)


pps_study <- 
  do.call(declare_sampling,
          c(handler = sample_pps, 
            sampling_variable = "pps",
            drop_nonsampled = FALSE, study_1[13]))

set.seed(19872312)
draw_data(population_study + rds_study + pps_study)


tls_study <- 
  do.call(declare_sampling,
          c(handler = sample_tls, 
            sampling_variable = "tls",
            drop_nonsampled = FALSE, study_1[14]))

set.seed(19872312)
draw_data(population_study + rds_study + pps_study + tls_study)

2.3 Estimands

The true value of estimands, which we hope to recover, can be calculated from the population dataframe. At the study level we are interested in:

For each group of interest, \(k\), size of group \(N_{k}\)
For each group of interest, \(k\), prevalence of the group \(p_{k} = \frac{N_{k}}{N}\)
Average degree of connectedness
Average degree of connectedness in hidden population

At the meta level we are interested in:

Relative bias of each of the methods compared to the truth , i.e. \(|\hat{\tau}_{\mathrm{RDS}} - \tau| - |\hat{\tau}_{\mathrm{NSUM}} - \tau|\)
Bias of each of the methods compared to the truth deivided by unit costs (for costs effectiveness)
[Anything else?]

Estimands can be declared and drawn once as follows:


study_estimands <- 
  declare_estimand(handler = get_study_estimands)


set.seed(19872312)
draw_estimands(population_study + rds_study + pps_study + tls_study + study_estimands)  %>% 
  kable(caption = "Estimands as calculated from population dataframe")

(#tab:declare_estimand)Estimands as calculated from population dataframe
estimand_label	estimand
hidden_size	190.0000
hidden_prev	0.0950
degree_average	5.0810
degree_hidden_average	0.8655

2.4 Estimators

For each estimator used in the studies (Horvitz-Thompson, RDS+, NSUM) create a handler and estimator declaration


# Declare SSPSE estimator
estimator_sspse <- 
  declare_estimator(handler = get_study_est_sspse, label = "sspse")

# Declare HT estimator
estimator_ht <- 
  declare_estimator(handler = get_study_est_ht, label = "ht")

# Declare Chords estimator
estimator_chords <- 
  declare_estimator(type = "integrated",
                    handler = get_study_est_chords, label = "chords")

# Declare NSUM estimator
estimator_nsum <- declare_estimator(handler = get_study_est_nsum, label = "nsum")


set.seed(19872312)

bind_rows(
  draw_estimates(population_study + rds_study + study_estimands + estimator_sspse),
  draw_estimates(population_study + rds_study + study_estimands + estimator_chords),
  draw_estimates(population_study + pps_study + study_estimands + estimator_ht),
  draw_estimates(population_study + pps_study + study_estimands + estimator_nsum)) %>% 
  knitr::kable(caption = "Various study estimators calculated from population and sampling dataframe")

(#tab:declare_estimators)Various study estimators calculated from population and sampling dataframe
estimator_label	estimate	se	estimand_label
hidden_size_sspse	161.0000000	61.2582432	hidden_size
hidden_size_chords	219.0000000		hidden_size
degree_hidden_chords	6.5114155		degree_hidden_average
hidden_prev_ht	0.0996436	0.0212336	hidden_prev
hidden_size_nsum	169.2416107	12.5299527	hidden_size
degree_average_nsum	4.9337748		degree_average

Service Multiplier

Berchenko, Yakir, Jonathan D. Rosenblatt, and Simon D. W. Frost. 2017. “Modeling and Analyzing Respondent-Driven Sampling as a Counting Process.” Biometrics 73 (4): 1189–98. https://doi.org/10.1111/biom.12678.

Maltiel, Rachael, Adrian E. Raftery, Tyler H. McCormick, and Aaron J. Baraff. 2015. “Estimating Population Size Using the Network Scale-up Method.” The Annals of Applied Statistics 9 (3): 1247–77. https://www.jstor.org/stable/43826420.