Fall 2001 Statistics MS 150 Notes on Chapter Two of Brase and Brase

MS 150 Statistics Fall 2001 Notes on Chapter Two

2.1 Measures of central tendency: mode, median, mean

Mode

The mode is the value that occurs most frequently in the data. Excel can determine the mode with the function MODE.

= MODE(data)

In the Fall of 2000 the statistics class gathered data on the number of siblings for each member of the class. One student was an only child and had no siblings. One student had 13 brothers and sisters. The complete data set is as follows:

1,2,2,2,2,2,3,3,4,4,4,5,5,5,7,8,9,10,12,12,13

The mode is 2 because 2 occurs more often than any other value. Where there is a tie there is no mode

For the age data 18,19,19,20,20,21,21,21,21,22,22,22,22,23,23,24,24,25,25,26 there is no mode: there is a “tie” between 21 and 22, hence there no single must frequent value. Excel will, however, report a mode of 21 in this case. Excel appears to pick the smallest value in the tie.

If all values appear only once, then there is no mode. Excel will display #N/A.

Median

The median is the central (or “middle”) value in a data set. If a number sits at the middle, then it is the median. If the middle is between two numbers, then the median is half way between the numbers.

For the sibling data 1,2,2,2,2,2,3,3,4,4,|4|,5,5,5,7,8,9,10,12,12,13 the median is 4.

Note the data must be in order (sorted) before you can find the median. For the data 2, 4, 6, 8 the median is 5: (4+6)/2.

The median function in Excel is MEDIAN.

= MEDIAN(data)

Mean (average)

The mean, also called the arithmetic mean and also called the average, is calculated mathematically by adding the values and then dividing by the number of values (the sample size n). If the mean is the mean of population, then it is called the population mean m. If it is the mean of the sample, then it is the sample mean .

mean = sum of data/sample size = Sx/n

The sum of the data Sx can be determined using the Excel function =SUM(data). The sample size n can be determined using =COUNT(data). Thus =SUM(data)/COUNT(data) will calculate the mean. There is also a single function that calculates the mean. The Excel function that directly calculates the mean is AVERAGE

= AVERAGE(data)

Resistant measures: One that is not influenced by extremely high or less data values. Median tends to be more resistant than mean. Trimmed mean: Top 5% bottom 5% passed then mean calculated .

If the mean is measured using the whole population then this would be the population mean. If the mean was calculated from a sample then the mean is the sample mean. Mathematically there is no difference in the way the population and sample mean are calculated.

Differences in the Distribution of Data

Consider the following data:

	A	B	C	D	mean	median	mode	max	min	range	pop stdev s	pop var	samp stdev sx	samp var	CV
Data set 1	5	5	5	5	5	5	5	5	5	0	0	0	0	0	0
Data set 2	2	4	6	8	5	5	#N/A	8	2	6	2.2361	5	2.5820	6.6667	0.5164
Data set 3	2	2	8	8	5	5	#N/A	8	2	6	3	9	3.4641	12	0.6928

Neither the mode, median, nor the mean reveal clearly the differences in the distribution of the data above. The mean and the median are the same for each data set. The mode is the same as the mean and the median for the first data set and is unavailable for the last data set (Excel will report a mode of 2 for the last data set). A single number that would characterize how much the data is spread out would be useful.

Range

The range is one way to capture the spread of the data. The range is calculated by subtracting the smallest value from the largest value. In Excel:

=MAX(data) - MIN(data)

The range still does not characterize the difference between set 2 and 3: the last set has more data further away from the center of the distribution. The range misses this difference.

To capture the spread of the data we use a measure related to the average distance of the data from the mean. We call this “the standard deviation”. If we have a population, we report this average distance as the population standard deviation. If we have a sample, then our average distance value may underestimate the actual population standard deviation. As a result we adjust the result mathematically to be slightly larger. For our purposes these numbers are calculated using Excel functions.

Population standard deviation s

= STDEVP(data)

Sample standard deviation sx

=STDEV(data)

Note that 5,5,5,5 has “no” standard deviation, while the bimodal distribution 2,2,8,8, has the largest standard deviations. Set (3) has the “most spread” in its data. The standard deviation successfully captures this spread.

In some instances we report the square of the standard deviation. The square of the standard deviation is called the variance.

Coefficient of variation CV

The Coefficienct of variation is calculated by dividing the standard deviation (usually the sample standard deviation) by the mean.

= STDEV(data)/AVERAGE(data)

Note that the CV can be expressed as a percentage: “Group 2 has a CV of 52% while group 3 has a CV of 69%”. A deviation of 3.46 is large for a mean of 5 (3.46/5 = 69%) but would be small if the mean were 50 (3.46/50 = 7%). So the CV can tell us how important is the standard deviationrelative to the mean.

Statistics home
Lee Ling home
COM-FSM home page