08 Sampling Distribution of the Mean

Distribution of Statistics

8.1

As noted in earlier chapters, statistics are the measures of a sample. The measures are used to characterize the sample and to infer measures of the population termed parameters.

Parameter

A parameter is a numerical description of a population. Examples include the population mean μ and the population standard deviation σ.

Statistic

A statistic is a numerical description of a sample. Examples include a sample mean x and the sample standard deviation sx.

Good samples are random samples where any member of the population is equally likely to be selected and any sample of any size n is equally likely to be selected. Consider four samples selected from a population. The samples need not be mutually exclusive as shown, they may include elements of other samples.

Venn diagram of four samples from a population

The sample means x1, x2, x3, x4, can include a smallest sample mean and a largest sample mean. Choosing a number of bins can generate a histogram for the sample means. The question this chapter answers is whether the shape of the distribution of sample means from a population is any shape or a specific shape.

Sampling Distribution of the Mean

The shape of the distribution of the sample mean is not any possible shape. The shape of the distribution of the sample mean, at least for good random samples with a sample size larger than 30, is a normal distribution. That is, if you take random samples of 30 or more elements from a population, calculate the sample mean, and then create a relative frequency distribution for the means, the resulting distribution will be normal.

In the following diagram the underlying data is bimodal and is depicted by the light blue columns. Thirty data elements were sampled forty times and forty sample means were calculated. A relative frequency histogram of the sample means is plotted in a heavy black outline. Note that though the underlying distribution is bimodal, the distribution of the forty means is heaped and close to symmetrical. The distribution of the forty sample means is normal.

In the following diagram the underlying data is bimodal and is depicted by the columns with thin outlines. Thirty data elements were sampled forty times and forty sample means were calculated. A relative frequency histogram of the sample means is plotted with a heavy black outline. Note that though the underlying distribution is bimodal, the distribution of the forty means is heaped and close to symmetrical. The distribution of the forty sample means is normal.

The center of the distribution of the sample means is, theoretically, the population mean. To put this another simpler way, the average of the sample averages is the population mean. Actually, the average of the sample averages approaches the population mean as the number of sample averages approaches infinity.

histogram distribution

Another Example (2002)

Consider a population consisting of 61 body fat measurements for women at the COM-FSM national campus:

15.6, 18.9, 20, 20.3, 20.6, 20.8, 21.9, 22.1, 22.2, 22.2, 22.4, 22.7, 22.8, 22.8, 23.5, 23.5, 23.6, 23.8, 23.9, 24.3, 24.4, 25.2, 25.2, 25.5, 25.6, 26.1, 26.2, 27.3, 27.5, 27.8, 27.9, 28, 28, 28.1, 28.1, 28.3, 28.4, 29.2, 29.3, 29.3, 29.5, 29.8, 30.5, 31.1, 31.6, 32.9, 34, 34.4, 34.9, 35.5, 35.8, 35.9, 36, 37.5, 38.2, 38.8, 40, 40.8, 44.1, 47, 50.1

The population mean (parameter)for the above data is 28.7. Consider those measurements as being the total population. The distribution of those measurements using an eight bin histogram is seen below.

BinFreqRelFreq
19.920.03
24.2170.28
28.5180.30
32.980.13
37.280.13
41.550.08
45.810.02
50.120.03
611.00

notes07_02.gif (3433 bytes)

The distribution is skewed right, as seen above.

If we were doing a statistical study, we would measure a random sample of women from the population and calculate the mean body fat for our sample. Then we would use our sample statistic (our sample mean) to estimate the population parameter (the population mean). Understanding the SHAPE of the distribution of many sample means is a key to using a single sample mean (a statistic) to estimate the population mean (a parameter).

The table that follows consists of ten randomly selected samples from the population and the means for each sample. Each sample has a size of n=10 women. The bottom row is the mean of each sample.

Smpl 1Smpl 2Smpl 3Smpl 4Smpl 5 Smpl 6Smpl 7Smpl 8Smpl 9Smpl 10
40.84020.324.321.944.1 22.822.134.450.1
40.838.227.325.228.338.2 2029.520.829.2
3427.52835.927.929.2 38.825.631.635.5
26.135.54023.923.822.8 24.422.238.228.3
20.327.534.927.832.9 20.629.827.328.122.8
25.232.93423.629.325.6 38.227.820.320.3
30.525.629.335.522.4 27.826.230.522.724.4
37.54023.929.528.424.4 29.23631.136
4034.42823.627.831.1 25.220.84734
15.627.320.831.635.828 35.831.122.222.4
31.0832.8928.6528.0927.85 29.1829.0427.2929.6430.3

The mean of the values in the last row is 29.4. This could be called the mean of the sample means! A histogram can be used to show the distribution of these sample means. These frequencies and relative frequencies are in the two rightmost columns of the table below.

BinFreqRelFreqAvgDistRFavg
19.920.0300
24.2170.2800
28.5180.3030.3
32.980.1360.6
37.280.1310.1
41.550.0800.0
45.810.0200.0
50.120.0300.0
611.00101.00

Note that the sample means are clustered tightly about the population mean. This can be seen below where the sample mean distribution is superimposed (placed on top of!) the population distribution.

notes07_03.gif (3580 bytes)

The Shape of the Sample Mean Distribution is Normal!

The sample mean distribution is a heap shaped, as in the shape of the normal distribution, and centered on the population mean.

If the sample size is 30 or more, then the sample means are NORMALLY distributed even when the underlying data is NOT normally distributed! If the sample size is less than 30, then the distribution of the samples means is normal if and only if the underlying data is normally distributed.

The normal distribution of the sample means (averages) allows us to use our normal distribution probabilities to estimate a range for μ. The mean of the sample means is a point estimate for the population mean μ.

The mean of the sample means can be written as:

Mean of the sample means: mu xbar

In this text the above is sometimes written as μ x

The value of the mean of the sample means μ x is, for a very large number of samples each of which has a very large sample size, the population mean. As a practical matter we use the mean of a single large sample. How large? The sample size must be at least n = 30 in order for the sample mean (a statistic) to be a good estimate for the population mean (a parameter). This requirement is necessary to ensure that the distribution of the sample means will be normal even when the underlying data is not normal. If we are certain the data is normally distributed, then a sample size n of less than 30 is acceptable.

Later in the course we will modify the normal distribution to handle samples of sizes less than 30 for which the distribution of the underlying data is either unknown or not normal. This modification will be called the student's t-distribution. The student's t-distribution is also heap-shaped.

The normal distribution, and later the student's t-distribution, will be used to quote a range of possible values for a population mean based on a single sample mean. Knowing that the sample mean comes from a heap-shaped distribution of all possible means, we will center the normal distribution at the sample mean and then use the area under the curve to estimate the probability (confidence) that we have "captured" the population mean in that range.

Central Limit Theorem

8.2

The Central Limit Theorem is the theory that says "for increasingly large sample sizes n, the sample mean approaches ever closer the population mean."

Standard Error

The standard deviation of the distribution of the sample means

There is one complication: the sample standard deviation of a single sample is not a good estimate of the standard deviation of the sample means. Note that the distribution of the sample means is NARROWER than the sample in the above example. The shape of the distribution of the sample means is narrower and taller than the shape of the underlying data. In the diagram, the shape of the underlying data is normal, the taller narrower distribution is the distribution of all the sample means for all possible samples.

Standard Error

The standard deviation of a single sample has to be reduced to reflect this. This reduction turns out to be inversely related to the square root of the sample size. This is not proven here in this text.

The standard deviation of the distribution of the sample means is equal to the actual population standard deviation divided by the square root of n.

sigmaxbarovern.gif (1166 bytes)

The standard deviation divided by the square root of the sample size is called the standard error of the mean.

If σ is known, then the above formula can be used and the distribution of the sample mean is normal.

As a practical matter, since we rarely know the population standard deviation σ, we will use the sample standard deviation sx in class to estimate the standard deviation of the sample means. This formula will then appear in various permutations in formulas used to estimate a population mean from a sample mean. When we use the sample standard deviation sx we will use the student's t-distribution. The student's t-distribution looks like a normal distribution. The student's t-distribution, however, is adjusted to be a more accurate predictor of the range for a population mean. Later we will learn to use the student's t-distribution. Until that time we will play a little fast and loose and use sample standard deviations to calculate the standard error of the mean.

sigmaxbarsxovern.gif (1179 bytes)