notes
Statistical Sampling

Statistical Sampling

Contents

Concept

Selecting a subset of data from a population is called sampling. Working with the full population can be prohibitive and it is convenient to work with a smaller set of data. When doing so, the smaller set should be chosen to be “representative” of the population. Exactly what this means leads to different methods of sampling.

In a Spreadsheet

We shall give instructions on how to apply the various sampling methods in a spreadsheet.

We shall assume that our data is in a sheet called unimaginatively Data and within that sheet it lies in the range A7:M190 with headers in row 6. To avoid accidentally modifying the data, we shall do our sampling in a separate sheet (we won’t need the name of that sheet, but Sampling would be a good choice).

We shall select data by selecting rows from this data according to the various schemes. To select contiguous columns, we can specify them using the range syntax:

Data!A7:D190

To select separate columns, we can use the syntax:

{Data!A7:A190, Data!E7:E190}

It will also be useful to copy across the headers, this can be done with the following formula in the first header cell above the selected data (it will expand to fill the other header cells):

={A6:D6}

Or, for separate columns:

={A6,E6}

The key command in each method is the filter command which has the syntax:

=filter(<range>,<criterion>)

The criterion does not have to refer to the range, but they do have to have the same number of rows.

Simple Random Sampling

In simple random sampling each data value is given an equal chance of being chosen and then a sample of a given size is selected.

In a Spreadsheet

  1. Create a list of random numbers

    In the Sampling sheet, we need to create a list of random numbers. It needs to be big enough to encompass the data, but it can be bigger. We will create it in column A.

    1. In cell A1, type the formula =rand().
    2. Copy that formula down to cell A200 (or at least, beyond A190). This creates a column of random numbers. However, there is a problem with it in that every time the spreadsheet is modified, the random numbers are recalculated. The next step fixes this.
    3. Select column A by clicking on the A at the top and copy it to the clipboard.
    4. Click on cell B1. From the Edit -> Paste Special menu, select Paste values only (equivalently, Shift+Ctrl+v). This has the effect of “freezing” the random numbers in column B. The ones in column A will continue to change, so these can be used as a source of new random numbers if needed, simply repeat these last two steps with a different target column.

    We will consider the random numbers in column B of this sheet to be associated with the corresponding rows of data in the Data sheet. (We could have created the random numbers in exactly the right rows, but we didn’t need to.)

  2. Filter the data

    We will use the random numbers in the criterion part of the filter command. We first need to decide on a way of picking a certain quantity of those random numbers. The simplest is to pick the top NN of the numbers (since the numbers were assigned randomly, this will pick NN rows randomly). To do this, we need to know the NNth largest of the random numbers. The formula for this is:

     large(B7:B190,N)

    Note that the range B7:B190 is chosen to match the range of the data in the Data sheet.

    Putting that into the filter command, to select 2525 rows from the first four columns, we type the formula:

     =filter(Data!A7:D190, B7:B190 >= large(B7:B190,25))
  3. Selecting individual columns

    To select non-consecutive columns, we can merge ranges using the syntax:

     {<range one>, <range two>, ...}

    For example:

     =filter({Data!A7:A190,Data!E7:E190}, B7:B190 >= large(B7:B190,25))

Systematic Sampling

In systematic sampling, the method is to select “every NNth item”. More precisely:

  1. Decide how many data points to select and divide the total population size by this number to get the selection period, call this NN.
  2. Choose a start point at random in the first NN and then take every NN data point from there.

For this to work, the data has to be ordered so that every NNth data point makes sense. The ordering doesn’t have to relate to anything in particular.

In a Spreadsheet

  1. Create an ordered list

    We’ll create it in column A of the Sampling sheet. Put 1 in cell A1 and 2 in cell A2. Then select both cells. Drag the little square in the corner of the selection down column A as far as needed (in our example, cell A200 would be plenty).

  2. Filter the data

    To filter the data, we use the command:

     =filter(<range>, <criterion>)

    To determine the criterion, we need to choose the period. We can get the spreadsheet to calculate this. If we want 2525 data points, we divide the total number of data points by 2525 (and round down) using the formula:

     =floor(count(A7:A190)/25)

    Here, count(A7:A190) counts how many rows are in the range while floor rounds down the division.

    We also need the starting point, which is a random number between 00 and one less than this number. The formula for this, assuming the period calculation is in B1, is:

     =randbetween(0,B1 - 1)

    This will change with every update of the spreadsheet, so to “freeze” it, we copy it and use Edit -> Paste Special -> Paste values only. We can overwrite the formula, or paste it into a new cell, say B3.

    The filter command is then

     =filter(Data!A7:D190, mod(A7:A190, B1) = B3)

    The mod command calculates remainders: mod(14,3) is the remainder when 1414 is divided by 33.

Stratified Sampling

Stratified sampling is used when the population naturally divides into different groups (aka strata) which should be proportionately represented in the sample. The method of sampling within each group could be by simple random sampling or systematic sampling.

In a Spreadsheet

As the actual sampling methods are covered above, in this section we will concern ourselves with using the filter command to select a group.

For the purposes of the demonstration, we will assume that column A contains dates and we want to group our data according to month. The command for this is:

=filter(Data!A7:D190,Month(Data!A7:A190) = 5)

This can then be used as the source of data in another sampling.

category: statistics