MS 150 Statistics Fall 2001 Notes on Chapter One

1.1 1.2 1.3 1.4

1.1

Population: The complete group of measurements, observations, objects or people.

Sample: A part of the population. A sample is usually more than five measurements, observations, objects, or people, but always smaller than the complete group.

Examples

We could use the ratio of females to males to estimate the ratio of females to males on campus.  Population: the College students.  Sample: The class.

We could use the average body fat index for a randomly selected group of females on campus to determine the average body fat index for females in the FSM between the ages of 18 and 22.  Population: Females in the FSM.  Sample: those females on cmapus that we've measured.

Ways to gather statistics

  1. Sampling requires taking a random subgroup of the population
  2. Experiments.  Form a hypothesis, make a prediction, formulate an experiment to test prediction using a control group, experimental group, with the groups being chosen randomly and double blind.  Test the hypothesis for a  statistically significant difference.
  3. Simulation.  AIDS simulation is good here.
  4. Census: Actually measure entire population.
  5. Surveys: Questionnaire a random sample. -voluntary response tends to be negative. Hidden bias/unfair questions: Are you the only crazy person in your family?

Generalizing: The process of extending from sample results to population.

Level of measurement Definition Examples
Nominal In name only Sorting by categories such as red, orange, yellow, green, blue, indigo, violet
Ordinal In rank order, there exists an order but differences and ratios have no meaning Grading systems: A, B, C, D, F
Rating systems: лллл,ллл,лл,л,
Interval Differences have meaning, but not ratios. There is either no zero or the zero has no mathematical meaning. The numbering of the years: 2001, 2000, 1999.  The year 2000 is 1000 years after 1000 A.D. (the difference has meaning), but it is NOT "twice as many years (the ratio has no meaning).  Someone born in 1998 is eight years younger than someone born in 1990, but they are not half the age of someone born 999 A.D.
Ratio Difference and ratios have meaning.  There is a mathematically meaningful zero Physical quantities: distance, height, speed, velocity, time in seconds, altitude, acceleration, mass. 100 kg is twice as heavy as 50 kg. Ten dollars is 1/10 of $100.

1.2

Random samples

n: a variable that represents any number.  Also the number of elements/objects/people in a sample.  The sample size.

A simple random sample of n measurements from a population is one selected in a way that any member of the population is equally likely to be selected.

Ensuring that a sample is random is difficult.  Suppose I want to study how many Pohnpeians own cars, would people I meet/poll on main street Kolonia be a random sample? Why? Why not?

Computers can generate pseudo-random numbers. Pseudo:falsely random. They are very close to random but are actually not necessarily random. We will look at computer generated random number later in the course. Useful in simulations, not in other situations.

Coins and dice can be used to generate random numbers.

Methods of sampling

Stratified sampling

To ensure a "balanced sample": Suppose I want to do a study of the average body fat of young people in the FSM.  If I choose as my sample students a random sample of students at the national campus then I am likely to wind up with Pohnpeians being overrepresented.  The national population is half Chuukese, but the campus population is more than half Pohnpeian.  Hence I am likely, in a random selection of 100 students to pick too many Pohnpeians.

State Population Fractional share of national population (relative frequency) Number of student seats held by state at the national campus Fractional share of the national campus student seats
Chuuk 52870 0.50 679 0.20
Kosrae 7354 0.07 316 0.09
Pohnpei 33372 0.32 2122 0.62
Yap 11128 0.11 287 0.08
104724 1.00 3404 1.00

The solution is to use stratified sampling.  First I decide I want 100 students that are representative of the four states.  Then I can randomly pick 50 Chuukese students, 7 Kosraen, 32 Pohnpeian, and 11 Yapese and I will accurately reflect the makeup of the nation rather than the national campus. Each state would be considered a single "strata."

Systematic sampling

Used where a population is in some sequential order.  A start point must be randomly chosen. Useful in a measuring a timed event. Never used if there is a cyclical or repetitive nature to a system: If sample rate cycle rate then the results are not going to be randomly distributed measurements.

Cluster sampling

The population is divided into naturally occurring subunits and then subunits are randomly selected for measurement. In this method it is important that subunits (subgroups) are fairly interchangeable. If we want to poll the people in Kitti's opinion on whether they would pay for water if water was guaranteed to be clean and available 24 hours a day. We could cluster by breaking up the population by kosapw and then randomly choose a few kosapws and poll everyone in these kosapws. The results would probably be generalizable to all Kitti.

Convenience sampling

Results or data that are easily obtained is used. Highly unreliable as a method of getting a random samples. Often biased.

(P13-14 5,7) (P21-23 5, 13)

Using Excel to generate random numbers

1.3

Graphs and Charts

Bar graphs are also called column charts.  Use Excel to do your graphing.   There is an example in the online spreadsheet. If a column chart is sorted so that the columns are in descending order, then it is called a  Pareto Charts.  Pareto charts are useful ways to convey rank order as well as numerical data.

TOEFL Pareto Spring 2002 (5K)

Circle graphs ("pie" chart) Whole circle is 100% Used when data "adds" to a whole, e.g. state populations add to yield national population.

Line graph

xy graph. When you have two sets of continuous data (value versus value, no categories), use an xy graph. Often used in science

1.4

Histograms and Relative Frequency Distributions

Often used for interval or ratio level measurements. Not possible at nominal or ordinal level. Akin to a "bar graph" (column graph in Excel) but each "column" represents an interval that is the same for each and every column. The original data is gathered into sets or groups of data.  Excel refers to this as putting the data into bins.

How to make a histogram

  1. Find the minimum value of the data set using the MIN function in Excel
  2. Find the maximum value of the data set using the MAX function in Excel
  3. Calculate the range by subtracting the MIN from the MAX:
    range = maximum value - minimum value
  4. Decide on the number of bins you are going to use (also called intervals or classes)
  5. Divide the range by the number of bins to calculate the bin width (or interval width or class width)
  6. Calculate the bin upper limit (also called class upper limit or interval upper limit)
  7. Put the bin upper limits into a column of cells in Excel
  8. Manually tally the data into the frequency column or use the FREQUENCY function to determine the frequencies for each bin.  The bin upper limit is included in each tally.
  9. Create a column chart
Bin Upper Limits Frequency
=min + bin width =FREQUENCY(data,bins)
+ bin width
+ bin width
+ bin width
max

For the female height data:

58, 58, 59.5, 59.5, 60, 60, 60, 60, 60, 61, 61, 61.2, 61.5, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 63, 63, 63, 63.5, 64, 64, 64, 64, 65, 65, 66, 66

Five bins would produce the following results:
Min = 58
Max = 66
Range = 8
Width = 1.6

Height CUL Frequency
59.6 4
61.2 8
62.8 13
64.4 8
66 4
Sum: 37

Note that 61.2 is INCLUDED in the bin that ends at 61.2.  Excel includes the class upper limit.

There are quirks to the process of making charts in Microsoft Excel, and the quirks vary from version of Excel to version of Excel.  This lab has Excel 95, Excel 97, and Excel 2000 present due to the varying ages of the computers.  Because the quirks are version unique, moving from computer to computer will cause confusion as you move from version to version.  Hence I recommend you sit at the same computer each day so as to become accustomed to the quirks of your particular version.

A histogram of the height of females in statistics Fall 2001

Note that the gap width on the columns has been set to zero.

Relative Frequency (also known as probability)

Divide each frequency by the sum to get the relative frequency

Height CUL Frequency Relative Frequency f/n or P(x)
59.6 4 0.11
61.2 8 0.22
62.8 13 0.35
64.4 8 0.22
66 4 0.11
Sum: 37 1.00

The relative frequency always adds to one (rounding causes the above to add to 1.01, if all the decimal places were used the relative frequencies would add to one.

wpe256.jpg (9439 bytes)

The area under the relative frequency columns is equal to one.

An in-class example from Fall 2001 with integers:

0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4.5, 5, 5, 5, 6, 6, 7, 8, 9, 10

Five bins

min = 0
max = 10
range = 10
width = 10/5 = 2

Bin Num Calculation Bins Frequency Relative Frequency f/n or P(x)
1 min + width 2 4 0.20
2 + width 4 6 0.30
3 + width 6 6 0.30
4 + width 8 2 0.10
5 + width 10 2 0.10
Sum: 20 1.00

Note that the above does not conform to either standard statistical practice nor to Brase and Brase.  The above method is simply the easiest way to produce equal width bins and to conform to Microsoft Excel's inclusion of the class upper limit.

Homework Fall 2001: Using five bins, produce both a frequency histogram and a relative frequency histogram for the following 25 body fat percentages for females in MS 150 and MS 101 Summer and Fall 2001.

95, 112, 113, 116, 116, 117, 120, 123, 125, 125, 126, 126, 127, 127, 128, 130, 132, 132, 132, 134, 143, 147, 149, 152, 160.

Shapes of Distributions

See the BFI tab of the notebook statistics_fall2001.xls to see various shapes of histograms.

Statistics home
Lee Ling home
COM-FSM home page