06 Probability Distributions

Types of probabilities and distributions

6.1

Mathematically equally likely outcomes usually produce symmetric distributions. Simple probabilities of a single coin or single die are uniform in their shape. The probabilities of multiple coins or dice form a symmetric heap that is called a binomial distribution. As the number of dice and pennies increase, the distribution approaches a shape we will later learn to call the "normal" distribution.

Distributions based on relative frequencies can have a variety of shapes, symmetrical or non-symmetrical.

The shape of the distibution of a sample is often reflective of the shape of the distribution of a population. If the sample is a good, random sample, then the shape of the sample distribution is a good predictor of the shape of the population distribution.

Probability Distributions

A probability distribution usually refers to a relative frequency histogram drawn as a line chart.

Both discrete and continuous variables can have a probability distribution. Intervals (or bins or classes) can be constructed, relative frequencies (or probabilities) can be calculated and a relative frequency histogram can be drawn. If the data is continuous, then a mean can be calculated for the data from the original data. There is also a way to recover the mean from the bin values and the probabilities, although this depends on the bin values being treated as being a part of a continuous distribution. In later chapters the columns of the histogram chart will be replaced by a line, specifically a "heap" or "mound" shaped line. The diagrams further below show how one might move from a column chart representation of data to a line chart representation.

The following data consists of 39 body fat measurements for female students at the College of Micronesia-FSM Summer 2001 and Fall 2001. Following the table is a relative frequency histogram, the probability distribution for this data.

BFI fem CUL
x
Frequency
f
Relative Frequency
f/n or P(x)
20.120.05
24.6120.31
29.2130.33
33.750.13
38.170.18
Sum (n):391.00

Relative Frequency Histogram for the BFI for 39 female students

The area under the bars is equal to one, the sum of the relative frequencies. The above diagram consists of five discrete classes. Later we will look at continuous probability distributions using lines to depict the probability distribution. Imagine a line connecting the tops of the columns:

notes05_histo02.jpg (13942 bytes)

If the columns are removed and the class upper limits are shifted to where the right side of each column used to be:

notes05_histo03.jpg (12480 bytes)

The orange vertical line has been drawn at the value of the mean. This line splits the area under the "curve" in half. Half of the females have a body fat measurement less than this value, half have a body fat measurement greater than this value.

We could also draw a vertical line that splits the area under the curve such that we have ten percent of the area to the left of the orange line and ninety percent to the right of the orange line. This line would be at the value below which only ten percent of the measurements occur.

Calculations of the mean and the standard deviation

6.2

In some situations we have only the intervals and the frequencies but we do not have the original data. In these situations it would be useful to still be able to calculate a mean and a standard deviation for our data.

If we only have the intervals and frequencies, then we can calculate both the mean and the standard deviation from the class upper limits and the relative frequencies. Here is the mean and standard deviation for the sample of 39 female students:

BFI fem CUL
x
Frequency
f
Relative Frequency f/n or P(x) Mean μ:
∑(x*P(x))
stdev σ:
√(∑((x-μ)²*P(x)))
20.120.051.034.52
24.6120.317.587.29
29.2130.339.720.04
33.750.134.322.23
38.170.186.8613.56
Sum:391.00μ = 29.51 ∑ = 27.64
σ = 5.26

A spreadsheet with the above data is available at:
http://www.comfsm.fm/~dleeling/statistics/statistics_fall2001.xls

Note that the results are not exactly the same as those attained by analyzing the data directly. Where we can, we will analyze the original data. This is not always possible. The following table was taken from the 1994 FSM census. Here the data has already been tallied into intervals, we do not have access to the original data. Even if we did, it would be 102,724 rows, too many for some of the computers on campus.

Age x Total f Relative frequency f/n or P(x) x*P(x) (x-μ)²*P(x)
4146620.140.5757.78
9150900.151.3233.58
14149440.152.0414.90
19124250.122.303.17
2491920.092.150.00
2970420.071.991.63
3468000.072.256.46
3959980.062.2812.93
4431310.031.3412.05
4936010.041.7221.70
5422710.021.1919.74
5920890.021.2024.74
6419780.021.2330.62
6913080.010.8825.65
7411690.010.8428.31
795440.010.4215.95
843130.000.2610.93
89990.000.094.06
94560.000.052.66
98120.000.010.64
Sums:102724124.12 327.50
sqrt: 18.10

The mean μ = 24.12
The population standard deviation σ = 18.10

A spreadsheet with the above data is available from: http://www.comfsm.fm/~dleeling/statistics/statistics.xls

The result is an average age of 24.12 years for a resident of the FSM in 1994 and a standard deviation of 18.10 years. This means at least half the population of the nation is under 24.12 years old! Actually, due to the skew in the distribution, fully 56% of the nation is under 19. Bear in mind that 56% is in school. That means we will need new jobs for that 56% as they mature and enter the workplace. On the order of 57,121 new jobs.

How old are you? Below, at, or above the mean (average)? Do you have a job?

Note we used the class upper limits to calculate the average age. Potentially this inflates the national average by as much as half a class width or 2.5 years. Taking this into account would yield an average age of 21.62 years old.

There is one more small complication to consider. Since the population of the FSM is growing, the number of people at each age in years is different across the five year span of the class. The age groups at the bottom of the class (near the class lower limit) are going to be bigger than the age groups at the top of the class (near the class upper limit). This would act to further reduce the average age.

Homework: Use the 2000 Census data to calculate the mean age in the FSM in 2000.

Age2000
414782
914168
1414213
1913230
249527
297620
346480
396016
445560
494650
543205
591903
641733
691487
74993
791441
  1. Did the mean age change?
  2. Are you still (below|at|above) the mean age?

Alternate Homework:

Use the following data to calculate the overall grade point average and standard deviation of the grade point data for the Pohnpeian students at the national campus during the terms Fall 2000 and Spring 2001

Grade Point Value
x
Frequency
f
Relative Frequency
f/n or P(x)
Mean:
∑(x*P(x))
stdev:
√(∑((x-μ)²*P(x)))
4851__________________
31120__________________
21023__________________
1459__________________
0690__________________
Sums: ________________________
Sqrt:______