Selecting a subset of data from a population is called sampling. Working with the full population can be prohibitive and it is convenient to work with a smaller set of data. When doing so, the smaller set should be chosen to be “representative” of the population. Exactly what this means leads to different methods of sampling.
We shall give instructions on how to apply the various sampling methods in a spreadsheet.
We shall assume that our data is in a sheet called unimaginatively Data
and within that sheet it lies in the range A7:M190
with headers in row 6
. To avoid accidentally modifying the data, we shall do our sampling in a separate sheet (we won’t need the name of that sheet, but Sampling
would be a good choice).
We shall select data by selecting rows from this data according to the various schemes. To select contiguous columns, we can specify them using the range syntax:
Data!A7:D190
To select separate columns, we can use the syntax:
{Data!A7:A190, Data!E7:E190}
It will also be useful to copy across the headers, this can be done with the following formula in the first header cell above the selected data (it will expand to fill the other header cells):
={A6:D6}
Or, for separate columns:
={A6,E6}
The key command in each method is the filter
command which has the syntax:
=filter(<range>,<criterion>)
The criterion
does not have to refer to the range
, but they do have to have the same number of rows.
In simple random sampling each data value is given an equal chance of being chosen and then a sample of a given size is selected.
Create a list of random numbers
In the Sampling
sheet, we need to create a list of random numbers. It needs to be big enough to encompass the data, but it can be bigger. We will create it in column A
.
A1
, type the formula =rand()
.A200
(or at least, beyond A190
). This creates a column of random numbers. However, there is a problem with it in that every time the spreadsheet is modified, the random numbers are recalculated. The next step fixes this.A
by clicking on the A
at the top and copy it to the clipboard.B1
. From the Edit -> Paste Special
menu, select Paste values only
(equivalently, Shift+Ctrl+v
). This has the effect of “freezing” the random numbers in column B
. The ones in column A
will continue to change, so these can be used as a source of new random numbers if needed, simply repeat these last two steps with a different target column.We will consider the random numbers in column B
of this sheet to be associated with the corresponding rows of data in the Data
sheet. (We could have created the random numbers in exactly the right rows, but we didn’t need to.)
Filter the data
We will use the random numbers in the criterion
part of the filter
command. We first need to decide on a way of picking a certain quantity of those random numbers. The simplest is to pick the top $N$ of the numbers (since the numbers were assigned randomly, this will pick $N$ rows randomly). To do this, we need to know the $N$th largest of the random numbers. The formula for this is:
large(B7:B190,N)
Note that the range B7:B190
is chosen to match the range of the data in the Data
sheet.
Putting that into the filter
command, to select $25$ rows from the first four columns, we type the formula:
=filter(Data!A7:D190, B7:B190 >= large(B7:B190,25))
Selecting individual columns
To select non-consecutive columns, we can merge ranges using the syntax:
{<range one>, <range two>, ...}
For example:
=filter({Data!A7:A190,Data!E7:E190}, B7:B190 >= large(B7:B190,25))
In systematic sampling, the method is to select “every $N$th item”. More precisely:
For this to work, the data has to be ordered so that every $N$th data point makes sense. The ordering doesn’t have to relate to anything in particular.
Create an ordered list
We’ll create it in column A
of the Sampling
sheet. Put 1
in cell A1
and 2
in cell A2
. Then select both cells. Drag the little square in the corner of the selection down column A
as far as needed (in our example, cell A200
would be plenty).
Filter the data
To filter the data, we use the command:
=filter(<range>, <criterion>)
To determine the criterion, we need to choose the period. We can get the spreadsheet to calculate this. If we want $25$ data points, we divide the total number of data points by $25$ (and round down) using the formula:
=floor(count(A7:A190)/25)
Here, count(A7:A190)
counts how many rows are in the range while floor
rounds down the division.
We also need the starting point, which is a random number between $0$ and one less than this number. The formula for this, assuming the period calculation is in B1
, is:
=randbetween(0,B1 - 1)
This will change with every update of the spreadsheet, so to “freeze” it, we copy it and use Edit -> Paste Special -> Paste values only
. We can overwrite the formula, or paste it into a new cell, say B3
.
The filter command is then
=filter(Data!A7:D190, mod(A7:A190, B1) = B3)
The mod
command calculates remainders: mod(14,3)
is the remainder when $14$ is divided by $3$.
Stratified sampling is used when the population naturally divides into different groups (aka strata) which should be proportionately represented in the sample. The method of sampling within each group could be by simple random sampling or systematic sampling.
As the actual sampling methods are covered above, in this section we will concern ourselves with using the filter
command to select a group.
For the purposes of the demonstration, we will assume that column A
contains dates and we want to group our data according to month. The command for this is:
=filter(Data!A7:D190,Month(Data!A7:A190) = 5)
This can then be used as the source of data in another sampling.