« What is an odds ratio in a logistic regression | Main | What is the Intra-Class Correlation Coefficient? »

Does my list of potential survey respondents match the population that I want to survey?


Within any survey we have a population (eg. a group of people) that we would like to collect data on, and we often start the survey with a list of all of the members of that ideal population. To illustrate this let me give you a simplified version of a survey that I am currently working on. I am currently working on a survey of all store-holders in Papua New Guinea. At the start of the survey we obtained from the PNG Trade Commission a list of all store-holders in PNG. We will then visit each of the store-holders on the list and ask them to complete our survey. The set of all store-holders is the population that we would like to survey (known as the target population) and the set of all store-holders on the list from the Trade Commission are the store-holders that we will survey in practice (known as the sampling frame). We could also choose a representative group of store-holders from this list in order to reduce the time and cost involved in the survey (a group known as the survey sample), but that is another story for another time. There are many reasons for why there would be a difference between the target population and the sampling frame, where any such difference would introduce a bias into our survey results. It is important for us to understand these differences in order to reduce this level of bias.

There are four types of mismatch between the target population and the sampling frame:

a)      There might be store-holders in operation that are not on the list (known as under-coverage)

b)      There are store-holders on the list that are not in operation (eg. they have recently gone out of business). These store-holders are known as ineligible units.

c)       There might be store-holders that appear more than once on the list (eg. one business might have more than one workplace). This is known as duplication.

d)      There might be store-holders on the list that correspond to more than one store-holder in practice (eg. there might be more than one business in the same shop). This is known as clustering.


In an attempt to reduce the biases caused by a difference between the target population and the sampling frame it can be useful to consider these four patterns one at a time (under-coverage, ineligible units, duplication, and clustering), to think of why we might see each of these four patterns in our survey, and to design a data collection process for our survey to try and reduce the effect of each of these four mechanisms.



Under-coverage is often the mechanism that survey teams and subsequent survey users are most concerned about. There are four commonly used methods for tackling under-coverage:  the half-open interval, multiplicity sampling, multiple frame designs, and approaches for increasing coverage while also increasing the number of ineligible units.

The simplest illustration of the half-open interval involves a household survey. In essence we might visit a street where the sampling frame indicates that there are two houses in the street (A and B). We arrive at the street and find that three new houses have been built between house A and house B. If we were planning on surveying the residents in house A, then we will also survey all of the residents in the houses between house A and house B.

In the context of multiplicity sampling then when we approach each of the store-holders in our original sampling frame then we will also ask them whether they know of any additional stores that weren’t already on our list. We need to be careful when applying multiplicity sampling. There may be some stores that are well-known that can be easily added to our list, and other stores that are less-known and hence that we would still miss. Hence our results might still be biased by how well-known different stores are. If we had some way of measuring how well-known a store is then we can account for this “knowledge” in our analysis, but if not then we will still have a bias in our data.

Multiple frame designs involve more than one type of list. For example we could obtain a list of all stores from the Trade Commission, and we could then approach another government office (eg. a tax department) and obtain a second list from them. We would now know about more store-owners as those missing from one list might appear on the other list. The problem will be if there are duplicate records (stores that appear on both lists). These duplicates could be removed if there was a unique piece of information listed on each list (eg. if each store had one and only one telephone number). Larger stores might have more than one telephone and so hence we might not be able to identify all of the duplicates.

Increasing the number of ineligible units involves starting with a broader definition of what is a “store-holder”. In one survey approach we could ask the Trade Commission for a list of all stores. In a second survey approach we could ask a government department for a list of all businesses, and then when we approach that business then we could start by determining if they matched our definition for a “store-holder”. This second approach would involve a lot more time to conduct the survey, but we might also identify new stores that weren’t on our original list.


Ineligible units

There might be stores that are listed in our original sampling frame which are not in existence in practice (eg. they have closed or there were errors in the government records). Ineligible units can often be removed by simply identifying these stores during the data collection stage in our survey. One thing to keep in mind is that if we do have ineligible units in our sampling frame then this will reduce the total sample size for our survey (eg. if we planned on surveying 1000 stores and we found that we had some ineligible units then we might end up with data from only 900 stores at the end of the survey). Hence we might decide to collect further data after the first data collection phase.



There might be a number of stores that are listed together as one record in the sampling frame (eg. they might all share the same workplace). This is similar to doing a household survey and finding a number of suitable respondents all living in the same house. In this situation we could then complete a survey with each of the stores that are present within that workplace (or all of the residents from that one household).

However we might find that we do not have enough time or money to conduct all of those surveys. In that case we would record the total number of stores in that workplace (or residents in that household), we would complete the survey with one or a small number of stores, and then account for the surveys that we haven’t completed within our analysis.



In this case a store might be listed more than once in our sampling frame (eg. they might be recorded more than once if they have more than one telephone). There are three approaches to duplication. In the first approach we might be able to remove those stores that are listed more than once from our initial sampling frame prior to collecting any data. In the second approach we remove the duplicates during the data collection where for example the first time that the store is contacted then we ask them whether they have any other telephone numbers and remove those numbers from our sampling frame (where that store will only be contacted once as a result). In the third approach we would only recognize that we have contacted that store more than once after the data has been collected, in which case we would account for the duplication during the data analysis stage.



Within this tutorial I have presented some of the key ideas involved when discussing the target population and the sampling frame. In most surveys it is inevitable that there will be some difference between the target population and the sampling frame, however with careful thought and planning the level of bias can be reduced significantly. There is also no solution to this source of bias that works perfectly in all surveys, instead one needs to carefully plan a strategy for accounting for this bias according to the nature of each particular survey being conducted.


If you would like to find out more:

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.
Member Account Required
You must have a member account on this website in order to post comments. Log in to your account to enable posting.