Variance

The **variance** of a population or a sample of a population is a measure of spread. The definition is slightly different depending on whether the data is a *population* or a *sample*.

For population data $x_i$, it is defined as:

$\sigma^2_x = \frac1n \sum (x_i - \bar{x})^2$

where $\bar{x}$ is the mean of the population.

For sample data $x_i$, it is defined as:

$s^2_x = \frac1{n-1} \sum (x_i - \bar{x})^2$

The rationale behind the definition is as follows. We want to measure how spread out the data is, so we imagine an extreme situation where all the data takes the same value. That value would then be the mean, $\bar{x}$. To measure the spread of our data, we measure the distance between our actual data and this extreme case. Now one standard way to measure the distance between two lists of numbers is to use a generalisation of Pythagoras' theorem?, namely: calculate the differences, square them, and then add them up.

There are two extras here. Firstly, the formula in Pythagoras' theorem? then takes the square root. This would result in the standard deviation, but often the variance is easier to work with. Secondly, we often want to compare the variance of data of different lengths. Dividing the answer by the number of data values, $n$, means that it makes sense to compare the variance of different sizes of data.

The formula for the population variance can be rearranged to:

$\sigma^2_x = \frac1n \sum x_i^2 - \bar{x}^2$

This is often more straightforward to calculate in practice than the formula in the definition.

The sample variance can be similarly rearranged to:

$s^2_x = \frac1{n-1} \sum x_i^2 - \frac{n}{n-1} \bar{x}^2$

These formulae are derived as follows. It uses the fact that $\bar{x} = \frac1n \sum x_i$ and that the sum of $n$ copies of a number is $n$ times that number.

$\begin{aligned}
\sigma^2_x &= \frac1n \sum (x_i - \bar{x})^2 \\
&= \frac1n \sum (x_i^2 - 2 x_i \bar{x} + \bar{x}^2) \\
&= \frac1n \sum x_i^2 - 2 \bar{x} \frac1n \sum x_i + \frac1n \sum \bar{x}^2 \\
&= \frac1n \sum x_i^2 - 2 \bar{x} \bar{x} + \frac1n n \bar{x}^2 \\
&= \frac1n \sum x_i^2 - 2 \bar{x}^2 + \bar{x}^2 \\
&= \frac1n \sum x_i^2 - \bar{x}^2 \\
s^2_x &= \frac1{n-1} \sum (x_i - \bar{x})^2 \\
&= \frac1{n-1} \sum (x_i^2 - 2 x_i \bar{x} + \bar{x}^2) \\
&= \frac1{n-1} \sum x_i^2 - 2 \bar{x} \frac1{n-1} \sum x_i + \frac1{n-1} \sum \bar{x}^2 \\
&= \frac1{n-1} \sum x_i^2 - 2 \bar{x} \frac{n}{n-1} \bar{x} + \frac1{n-1} n \bar{x}^2 \\
&= \frac1{n-1} \sum x_i^2 - 2 \frac{n}{n-1} \bar{x}^2 + \frac{n}{n-1}\bar{x}^2 \\
&= \frac1{n-1} \sum x_i^2 - \frac{n}{n-1}\bar{x}^2
\end{aligned}$

If the data is given in a frequency table, where data point $x_i$ occurs with frequency $f_i$, the formulae are:

$\begin{aligned}
\sigma^2_x &= \frac1n \sum (x_i - \bar{x})^2 f_i \\
&= \frac1n \sum x_i^2 f_i - \bar{x}^2 \\
s^2 &= \frac1{n-1} \sum (x_i - \bar{x})^2 f_i \\
&= \frac1{n-1} \sum x_i^2 f_i - \frac{n}{n-1} \bar{x}^2
\end{aligned}$

If the data has been put into a grouped frequency table?, an estimate for the variance can be calculated using the midpoints of the classes.

In a spreadsheet, the formula to calculate the variance of a range is:

```
=varp(<range>)
=var(<range>)
```

The `p`

suffix is for the *population* version.