Measures of Variability

Location measures do not provide a proper summary of the nature of a data set.
We can not make meaningful conclusion without considering sample variability. Example:
- Each data set contains two samples and the difference in the means is roughly the same.
- Data set B provides sharper distinction between two populations.

**Figure 1.7:** Different data sets. Difference in the means is roughly the same.
$\includegraphics[width=3.5in]{figures/01-05}$

The simplest measures of sample variability is the sample range $X_{max}-X_{min}$ .
The range can be very useful in statistical quality control.
The question ``How far is each x from the mean?''
The one that is used most often is the sample standard deviation.
The sample variance, denoted by $\sigma^2$

$\displaystyle \sigma^2=\sum_{i=1}^n\frac{(x_i-\bar{x})^2}{n-1}$
The sample standard deviation, denoted by $\sigma$ , is the positive square root of $\sigma^2$ , that is

$\displaystyle \sigma=\sqrt{\sigma^2}$
The sample variance is measured in squared units. The sample standard deviation is in linear units.

For a bell-shaped distribution,
- within one standard deviation of the mean there will be approximately (empirically) 68% of the data;
- within two standard deviations of the mean there will be approximately 95% of the data;
- within three standard deviations of the mean there will be approximately 99.7% of the data.
That is,

$\displaystyle x \pm \sigma \approx 68\%, x \pm 2\sigma \approx 95\%, x \pm 3\sigma \approx 99.7\%.$
This is a rule of thumb. Since the range $R \approx 6\sigma$ , the rule is also called the $6\sigma$ -rule.
An observation beyond ( $x - 2\sigma, x + 2\sigma$ ) can be declared as an outlier.

The quantity is called the degrees of freedom associated with the variance estimate.
It depicts the number of independent pieces of information available for computing variability. Only terms can vary freely.
In general,

$\displaystyle \sum_{i=1}^n(x_i-\bar{x})=0$
The computation of a sample variance does not involve independent squared deviations from the mean. For example,
- for the data set (5, 17, 6, 4), the sample mean is 8.
- The variance is
  
  $\displaystyle (5-8)^2+(17-8)^2+(6-8)^2+(4-8)^2$
  
  $\displaystyle =(-3)^2+(9)^2+(-2)^2+(-4)^2$
- The quantities inside parentheses sum to zero.

In statistical inference, we like to draw conclusions about characteristics of populations, called population parameters.
Population mean and population variance are two important parameters.
The sample variance is used to draw inferences about the population variance.
The sample standard deviation and the sample mean are used to draw inferences about the population mean.
In general, the variance is considered more in inferential theory, while the standard deviation is used more in applications.

Cem Ozdogan 2012-02-15