--- title: "Optimization of sampling strata with the SamplingStrata package" author: "Marco Ballin, Giulio Barcaroli" date: "2025-03-02" # output: rmarkdown::html_vignette output: bookdown::html_document2: df_print: kable highlight: tango number_sections: yes theme: flatly toc: yes toc_depth: 2 toc_float: collapsed: no smooth_scroll: yes toc_title: spsur vs spse # code_folding: hide bibliography: SamplingStrata.bib vignette: > %\VignetteIndexEntry{Optimization of sampling strata with the SamplingStrata package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} --- # Introduction Let us suppose we need to design a sample survey, having a complete frame containing information on the target population (identifiers plus auxiliary information). If our sample design is a stratified one, we need to choose how to form strata in the population, in order to get the maximum advantage by the available auxiliary information. In other words, we have to decide in which way to combine the values of the auxiliary variables (from now on, the 'X' variables) in order to determine a new variable, called 'stratum'. To do so, we have to take into consideration the target variables of our sample survey (from now on, the 'Y' variables): if, to form strata, we choose the X variables most correlated to the Y's, the efficiency of the samples drawn by the resulting stratified frame may be greatly increased. Every combination of values of each active variable determines a particular stratification of the target population, i.e. a possible solution to the problem of 'best' stratification. Here, by best stratification, we mean the stratification that ensures the minimum sample cost, sufficient to satisfy precision constraints set on the accuracy of the estimates of the survey target variables Y's (constraints expressed as maximum allowable coefficients of variation in the different domains of interest). When the cost of data collection is uniform over the strata, then the total cost is directly proportional to the overall sample size, and the convenience of a particular stratification can be measured by the associated size of the sample, whose estimates are expected to satisfy given accuracy levels. This minimum size can be determined by applying the Bethel algorithm, with its Chromy variant (@bethel1989). In general, the number of possible alternative stratifications for a given population may be very high, depending on the number of variables and on the number of their values, and in these cases it is not possible to enumerate them in order to assess the best one. A very convenient solution to this is the adoption of the evolutionary approach, consisting in applying a genetic algorithm that may converge towards a near-optimal solution after a finite number of iterations. The methodology is fully described in @ballin2013, and a complete illustration of the package, together with a comparison with the *stratification* package (@baillargeon2014), is in @barcarol2014. Also a complete application in a case of network data is reported in @ballin2016. The implementation of the genetic algorithm is based on a modification of the functions in the *genalg* package [see @willighagen2005]. In particular, the crossover operator has been modified on the basis of the indications given by @oluing2019. Two more issues, not contained in this vignette, namely: * the use of *models* to take into account the anticipated variance of target variables not directly available in the sampling frame, * the use of SamplingStrata to handle *spatial sampling*, i.e. to optimize the stratification of geo-coded sampling frames, are the object of two dedicated vignettes in the section "Articles" in the website: https://barcaroli.github.io/SamplingStrata/ . # Procedural steps The optimization of the sampling design starts by making the sampling frame available, defining the target estimates of the survey and establishing the precision constraints on them. Then, a choice has to be made with regard to what stratification variables to choose among those available in the frame, on the basis of an analysis of the correlation existing between the two sets of variables (stratification and target). When the chosen stratification variables are both categorical and continuous, in order to make them homogeneuous the continuous ones should be categorized (by using for instance a clustering k-means algorithm). Then, the optimization step making use of the 'atomic' method can be executed. If, conversely, the stratification variables are all of the continuous type the optimization step can be directly executed by making use of the method 'continuous'. It is also possible to perform both kinds of optimization, compare the results and choose the more convenient. Prior to the optimization based on the use of the genetic algorithm, it is advisable to run a different quick optimization task based on the use of the k-means algorithm, with a twofold purpose: * give a hint on a suitable number of final strata; * get an initial 'good' solution to be given as a 'suggestion' to the genetic algorithm in order to speed its convergence to the final solution. In the optimization step it is possible to indicate the set of sampling units that must be selected in any case ('take-all' strata). After the optimization, it is possible to evaluate the quality of the solution by simulating the selection of a high number of samples from the frame, and calculating sampling variance and bias for all the target variables. It is also possible to 'adjust' the sample size of the optimized solution on the basis of the available budget: if a higher size is allowable, sampling rates in strata are increased proportionally until the new total sample size is reached; the opposite is done in case we are obliged to reduce the sample size. Finally, we proceed with the selection of the sample. In the following, each step will be illustrated starting from a real sampling frame, the one that comes with the R package *sampling* (the dataframe *swissmunicipalities*). ## Preparation of inputs required by the optimization step ### Frame For sake of simplicity, let us consider a subset of the *swissmunicipalities* dataset: ```r library(SamplingStrata) data(swissmunicipalities) swissmun <- swissmunicipalities[swissmunicipalities$REG < 4, c("REG","COM","Nom","HApoly", "Surfacesbois","Surfacescult", "Airbat","POPTOT")] head(swissmun) #> REG COM Nom HApoly Surfacesbois Surfacescult Airbat POPTOT #> 2 1 6621 Geneve 1593 67 31 773 177964 #> 3 3 2701 Basel 2391 97 93 1023 166558 #> 4 2 351 Bern 5162 1726 1041 1070 128634 #> 5 1 5586 Lausanne 4136 1635 714 856 124914 #> 9 2 371 Biel (BE) 2123 976 196 463 48655 #> 10 2 942 Thun 2158 425 694 523 40377 ``` In order to limit the processing time we have selected only the first 3 regions and only the variables of interest for our example. Each row in this dataset contains information on a swiss municipalities, identified by *COM* and *Nom*, and belonging to one of three selected regions (*REG*). Suppose we want to plan a sampling survey where the *target estimates Ys* are the totals of *wooded area* (*Surfacesbois*) and *buildings area* (*Airbat*) for each of the 3 regions (*domains of interest*). Suppose also that in each municipalities are always updated the values of *total area* (*HApoly*) and *total population* (*POPTOT*). From the correlation matrix: ```r cor(swissmun[,c(4:8)]) #> HApoly Surfacesbois Surfacescult Airbat POPTOT #> HApoly 1.00000000 0.76920101 0.3398758 0.2602289 0.09940795 #> Surfacesbois 0.76920101 1.00000000 0.4492188 0.2968918 0.09881328 #> Surfacescult 0.33987579 0.44921881 1.0000000 0.3230496 0.11451750 #> Airbat 0.26022890 0.29689183 0.3230496 1.0000000 0.86896631 #> POPTOT 0.09940795 0.09881328 0.1145175 0.8689663 1.00000000 ``` we see that the correlations between *Surfacesbois* and *HApoly*, from one side, and between *Airbat* and *POPTOT* from the other side, are high (respectively 0.77 and 0.87), so we decide that both *HApoly* and *POPTOT* play the role of *stratification variables Xs* in our frame. In a first moment we decide to treat the stratification variables as categorical, so we have to categorize them. A suitable way to do so, is to apply a *k-means clustering* method [see @hartigan:1979] by using the function *var.bin*: ```r swissmun$HApoly.cat <- var.bin(swissmun$HApoly,15) table(swissmun$HApoly.cat) #> #> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #> 311 345 281 214 158 126 111 84 66 32 29 32 15 11 8 swissmun$POPTOT.cat <- var.bin(swissmun$POPTOT,15) table(swissmun$POPTOT.cat) #> #> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #> 452 388 266 193 124 109 79 59 42 33 35 28 5 6 4 ``` We can now define the *frame* dataframe in the format required by SamplingStrata. Function *buildFrameDF* permits to organize data in a suitable mode for next processing: ```r frame1 <- buildFrameDF(df = swissmun, id = "COM", X = c("POPTOT.cat","HApoly.cat"), Y = c("Airbat","Surfacesbois"), domainvalue = "REG") head(frame1) #> id X1 X2 Y1 Y2 domainvalue #> 1 6621 15 8 773 67 1 #> 2 2701 15 9 1023 97 3 #> 3 351 15 12 1070 1726 2 #> 4 5586 15 11 856 1635 1 #> 5 371 14 9 463 976 2 #> 6 942 14 9 523 425 2 ``` ### Strata This dataframe is not explicitly required, as it is automatically produced from the *frame* dataframe by the *optimStrata* function. Notwithstanding, it is worthwhile to produce it in order to analyse the initial stratification of the frame, and what could be the associated sample size without optimization. The function *buildStrataDF* is the one to produce the *strata* dataframe: ```r strata1 <- buildStrataDF(frame1, progress=F) #> #> Number of strata: 350 #> ... of which with only one unit: 130 head(strata1) #> STRATO N M1 M2 S1 S2 COST CENS DOM1 X1 X2 #> 1 1*7 1 7.00 360.0 0.000000 0.00000 1 0 1 1 7 #> 2 1*5 5 10.00 467.6 4.195235 79.79624 1 0 1 1 5 #> 3 1*13 5 12.20 1085.4 3.187475 537.57924 1 0 1 1 13 #> 4 1*8 4 16.25 734.5 3.418699 153.20003 1 0 1 1 8 #> 5 1*9 2 15.50 530.0 1.500000 70.00000 1 0 1 1 9 #> 6 10*15 1 258.00 3541.0 0.000000 0.00000 1 0 1 10 15 ``` Each row in this dataframe report information related to a given stratum (obtained by cross-classifying each unit with the values of the X variables) regarding: * the identifier of the stratum (named 'strato'), concatenation of the values of the X variables; * the values of the m auxiliary variables (named from X1 to Xm) corresponding to those in the frame; * the total number of units in the population (named 'N'); * a flag (named 'cens') indicating if the stratum is to be censused (=1) or sampled (=0); * a variable indicating the cost of interviewing per unit in the stratum (named 'cost'); * for each target variable y, its mean and standard deviation, named respectively 'Mi' and 'Si'); * the value of the domain of interest to which the stratum belongs ('DOM1'). ### Precision constraints The *errors* dataframe contains the accuracy constraints that are set on target estimates. This means to define a maximum coefficient of variation for each target variable and for each domain value. Each row of this frame is related to accuracy constraints in a particular subdomain of interest, identified by the *domainvalue* value. In our case, we have chosen to define the following constraints: ```r ndom <- length(unique(swissmun$REG)) cv <- as.data.frame(list(DOM=rep("DOM1",ndom), CV1=rep(0.10,ndom), CV2=rep(0.10,ndom), domainvalue=c(1:ndom) )) cv #> DOM CV1 CV2 domainvalue #> 1 DOM1 0.1 0.1 1 #> 2 DOM1 0.1 0.1 2 #> 3 DOM1 0.1 0.1 3 ``` This example reports accuracy constraints (maximum CV allowable equal to 10%) on variables Y1 and Y2 that are the same for all the 3 different subdomains (Swiss regions) of domain level DOM1. Of course we can differentiate the precision constraints region by region. It is important to underline that the values of 'domainvalue' are the same than those in the *frame* dataframe, and correspond to the values of variable 'DOM1' in the strata dataframe. We check that the dataframes that we have defined so far are correct: ```r checkInput(errors = checkInput(errors = cv, strata = strata1, sampframe = frame1)) #> #> Input data have been checked and are compliant with requirements ``` For instance, this function controls that the number of auxiliary variables is the same in the frame and in the strata dataframes; that the number of target variables indicated in the frame dataframe is the same than the number of means and standard deviations in the strata dataframe, and the same than the number of coefficient of variations indicated in the errors dataframe. So far so good. Now we want to determine the total sample size, and related allocation, under the given strata, using the function *bethel* (@bethel1989): ```r allocation <- bethel(strata1,cv[1,]) sum(allocation) #> [1] 570 ``` This is the total sample size (570) required to satisfy precision constraints under the current stratification, before the optimization. ## Optimization The function *optimStrata* is the one performing the optimization step. Actually, this function is a 'wrapper' that calls three different optimization functions: 1. optimizeStrata (method = *atomic*, if stratification variables are categorical, or reduced to categorical); 2. optimizeStrata2 (method = *continuous*, if stratification variables are continuous); 3. optimizeStrataSpatial (method = *spatial*, if stratification variables are continuous and there is spatial correlation among units in a geo-coded sampling frame). For continuity reasons, these functions are still available to be used standalone, and in some situations it may be useful to use them by a direct call (see related help for details). Here we report the most important parameters related to the three methods (for the others see the help): | Parameter | Description | | ------------- |:-------------------------------------------------------------------------| | *framesamp* | The name of the dataframe containing the information related to the | | | sampling frame. | | *framecens* | The name of the dataframe containing the units to be selected in any case. | | | It has same structure than "framesamp" dataframe. | | *nStrata* | The number of final optimized strata to be obtained. | | *model* | In case the Y variables are not directly observed, but are estimated by | | | means of other explicative variables, in order to compute the anticipated | | | variance, information on models are given by a dataframe "model" with as | | | many rows as the target variables. Default is NULL. | | *errors* | The dataframe containing the precision levels expressed in | | | terms of maximum allowable coefficient of variation on a given estimate. | | *minnumstr* | Minimum number of units that must be allocated in each stratum | | | (default is 2). | | *iter* | Maximum number of iterations (= generations) of the genetic algorithm | | | (default is 50). | | *pops* | The dimension of each generations in terms of individuals (default is 20). | | *suggestions* | Optional parameter for genetic algorithm that indicates a suggested solution | | | to be introduced in the initial population. The most convenient is the | | | one found by the function "KmeanSolution" (default is NULL). | ### Method 'atomic' As a first run we execute the optimization step using the method *atomic* (required as the stratification variables are of the categorical type). ```r set.seed(1234) solution1 <- optimStrata(method = "atomic", errors = cv, nStrata = rep(10,ndom), framesamp = frame1, iter = 50, pops = 10) #> #> Input data have been checked and are compliant with requirements #> #> Number of strata: 350 #> ... of which with only one unit: 130 #> #> *** Starting parallel optimization for 3 domains using 3 cores ```  ``` #> #> *** Sample size : 281 #> *** Number of strata : 22 #> --------------------------- ``` The execution of *optimStrata* produces the solution of 3 different optimization problems, one for each domain. The graphs illustrate the convergence of the solution to the final one starting from the initial one (i.e. the one related to the atomic strata). Along the x-axis are reported the executed iterations, from 1 to the maximum, while on the y-axis are reported the size of the sample required to satisfy precision constraints. The upper (red) line represent the average sample size in each iteration, while the lower (black) line represents the best solution found until the i-th iteration. We can calculate (analytically) the expected CVs by executing the function: ```r expected_CV(solution1$aggr_strata) #> cv(Y1) cv(Y2) #> DOM1 0.0999918 0.0991177 #> DOM2 0.0998998 0.1000010 #> DOM3 0.1000000 0.0907580 ``` The obtained total size of the sample required to satisfy precision constraint is much lower than the one obtained with the simple application of the Bethel algorithm to the initial atomic stratification, but maybe not yet satisfactory. In order to explore other solutions we may want that each unit in the sampling frame be considered as an atomic stratum, and let to the optimization step to aggregate them on the basis of the values of the Y variables. In any case, as we have to indicate at least one X variable, we can use to this purpose a simple progressive number: ```r swissmun$progr <- c(1:nrow(swissmun)) frame2 <- buildFrameDF(df = swissmun, id = "COM", X = "progr", Y = c("Airbat","Surfacesbois"), domainvalue = "REG") head(frame2) #> id X1 Y1 Y2 domainvalue #> 1 6621 1 773 67 1 #> 2 2701 2 1023 97 3 #> 3 351 3 1070 1726 2 #> 4 5586 4 856 1635 1 #> 5 371 5 463 976 2 #> 6 942 6 523 425 2 ``` We can use this approach because the number of units in the frame is small: it would not be possible to consider each unit as a stratum in case of real population registers or even with business registers. Even so, the processing of the 1,823 strata may be slow. In order to speed up the convergence towards the optimal solution, an initial one can be given as a "suggestion" to 'optimizeStrata' function. The function *KmeansSolution* produces this initial solution by clustering atomic strata considering the values of the means of all the target variables Y. For any given number of clusters, the correspondent aggregation of atomic strata is considered as input to the function *bethel*. The number of clusters for which the value of the sample size necessary to fulfil precision constraints is the minimum one, is retained as the optimal one. Also, the optimal number of clusters is determined inside each domain. It is possible to indicate a maximum number of aggregated strata to be obtained by using the parameter *maxcluster*: ```r strata2 <- buildStrataDF(frame2, progress=F) initial_solution2 <- KmeansSolution(strata = strata2, errors = cv, maxclusters = 10) ```  The overall solution is obtained by concatenating optimal clusters obtained in domains. The result is a dataframe with two columns: the first indicates the clusters, the second the domains. On the basis of these, we can calculate the most convenient number of final strata for each domain: ```r nstrata2 <- tapply(initial_solution2$suggestions, initial_solution2$domainvalue, FUN=function(x) length(unique(x))) nstrata2 #> 1 2 3 #> 7 8 9 ``` and we can supply both the *initial_solution* and the number of strata *nstrata* as inputs to the optimization step: ```r set.seed(1234) solution2 <- optimStrata(method = "atomic", errors = cv, framesamp = frame2, iter = 50, pops = 10, nStrata = nstrata2, suggestions = initial_solution2) #> #> Input data have been checked and are compliant with requirements #> #> Number of strata: 1823 #> ... of which with only one unit: 1823 #> #> *** Starting parallel optimization for 3 domains using 3 cores ```  ``` #> #> *** Sample size : 161 #> *** Number of strata : 23 #> --------------------------- ``` Notice that the obtained solution in this run in terms of sample size is significantly better than in the previous one. ```r outstrata2 <- solution2$aggr_strata expected_CV(outstrata2) #> cv(Y1) cv(Y2) #> DOM1 0.0995009 0.0651465 #> DOM2 0.0996311 0.0403613 #> DOM3 0.0992532 0.0331673 ``` ### Method 'continuous' The last thing to do is to test also the *continuous* method. First, we have to redefine the frame dataframe in this way: ```r frame3 <- buildFrameDF(df = swissmun, id = "COM", X = c("POPTOT","HApoly"), Y = c("Airbat","Surfacesbois"), domainvalue = "REG") head(frame3) #> id X1 X2 Y1 Y2 domainvalue #> 1 6621 177964 1593 773 67 1 #> 2 2701 166558 2391 1023 97 3 #> 3 351 128634 5162 1070 1726 2 #> 4 5586 124914 4136 856 1635 1 #> 5 371 48655 2123 463 976 2 #> 6 942 40377 2158 523 425 2 ``` We want to get indications about a suitable number of optimizes strata in each domain: ```r set.seed(1234) init_sol3 <- KmeansSolution2(frame=frame3, errors=cv, maxclusters = 10) ```  ```r nstrata3 <- tapply(init_sol3$suggestions, init_sol3$domainvalue, FUN=function(x) length(unique(x))) nstrata3 # 1 2 3 # 10 10 9 ``` Note that this time we call a different function (*KmeansSolution2*) that requires, instead of the *strata* dataframe, directly the *frame* dataframe. We are now able to perform the optimization step with the *continuous* method: ```r set.seed(1234) solution3 <- optimStrata(method = "continuous", errors = cv, framesamp = frame3, iter = 50, pops = 10, nStrata = nstrata3) #> #> Input data have been checked and are compliant with requirements #> #> *** Starting parallel optimization for 3 domains using 3 cores ```  ``` #> #> *** Sample size : 101 #> *** Number of strata : 23 #> --------------------------- ``` This solution requires a total sample size that is by far the best among those we produced, so we decide to select this one. # Analysis of the obtained solution ## Strata structure The results of the execution are contained in the list 'solution', composed by three elements: 1. *solution$indices*: the vector of the indices that indicates to what aggregated stratum each atomic stratum belongs (if the *atomic* method has been used) or to which aggregated stratum each unit in the frame belongs (in case the *continuous* method has been used); 2. solution$*framenew*: the initial frame updated with the indication, for each unit, of the optimal strata to which each unit belongs; 3. *solution$aggr_strata*: the dataframe containing information on the optimized strata. When stratification variables are of the continuous type, and the *continuous* (or *spatial*) method has been used, it is possible to have detailed information on the structure of the optimized strata, for instance by using the function *summaryStrata*: ```r strataStructure <- summaryStrata(solution3$framenew, solution3$aggr_strata, progress=FALSE) head(strataStructure) #> Domain Stratum Population Allocation SamplingRate Lower_X1 Upper_X1 Lower_X2 Upper_X2 #> 1 1 1 402 19 0.047288 27 3691 32 1281 #> 2 1 2 86 10 0.117364 95 3567 1286 3805 #> 3 1 3 38 6 0.146870 455 12141 129 3926 #> 4 1 4 27 6 0.232000 130 14361 386 6723 #> 5 1 5 13 2 0.157778 5341 29559 239 7093 #> 6 1 6 16 6 0.393347 78 7515 8460 11907 ``` For each otpimized stratum, total number of units together with allocations and sampling rates are reported. Also ranges of the stratification variables are listed, in order to characterize the strata. If the stratification variables are in a limited number, as in our case, it is possible to use the *plotStrata2d* function, that allows also to visualize strata by choosing couples of variables and one domain per time: ```r plotStrata2d(solution3$framenew, solution3$aggr_strata, domain = 3, vars = c("X1","X2"), labels = c("Total Population","Total Area")) ``` 
Stratum | Population | Allocation | SamplingRate | Bounds Total Population | Bounds Total Area |
---|---|---|---|---|---|
1 | 128 | 3 | 0.02638695 | 109-2055 | 32-510 |
2 | 69 | 2 | 0.03407425 | 235-2319 | 226-886 |
3 | 62 | 4 | 0.06825565 | 978-11702 | 155-902 |
4 | 22 | 2 | 0.09567947 | 744-8647 | 925-1065 |
5 | 11 | 2 | 0.18181818 | 566-13977 | 443-1137 |
6 | 16 | 2 | 0.13730442 | 721-14904 | 1070-1603 |
7 | 12 | 2 | 0.17325072 | 2741-20370 | 700-2029 |
8 | 1 | 1 | 1.00000000 | 166558-166558 | 2391-2391 |