01 Introduction: Samples and Levels of Measurement

Populations and Samples

1.1

Statistics studies groups of people, objects, or data measurements and produces summarizing mathematical information on the groups. The groups are usually not all of the possible people, objects, or data measurements. The groups are called samples. The larger collection of people, objects or data measurements is called the population.

Statistics attempts to predict measurements for a population from measurements made on the smaller sample. For example, to determine the average weight of a student at the college, a study might select a random sample of fifty students to weigh. Then the measured average weight could be used to estimate the average weight for all student at the college. The fifty students would be the sample, all students at the college would be the population.

Population: The complete group of elements, objects, observations, or people.

• Parameters: Measurements of the population.

Sample: A part of the population. A sample is usually more than five measurements, observations, objects, or people, and smaller than the complete population.

• Statistics: Measurements of a sample.

Examples

We could use the ratio of females to males in a class to estimate the ratio of females to males on campus. The sample is the class. The population is all students on campus.

We could use the average body fat index for a randomly selected group of females between the ages of 18 and 22 on campus to determine the average body fat index for females in the FSM between the ages of 18 and 22. The sample is those females on campus that we've measured. The population is all females between the ages of 18 and 22 in the FSM.

Measurements are made of individual elements in a sample or population. The elements could be objects, animals, or people.

Sample size n

The sample size is the number of elements or measurements in a sample. The lower case letter n is used for sample size. If the population size is being reported, then an upper case N is used. The spreadsheet function for calculating the sample size is the COUNT function.

=COUNT(data)

Types of measurement

Qualitative data refers to descriptive measurements, typically non-numerical.

Quantitative data refers to numerical measurements. Quantitative data can be discrete or continuous.

Discrete: A countable or limited number of possible numeric values.
Continuous: An infinite number of possible numeric values.

Levels of measurement

Type	Subtype	Level of measurement	Definition	Examples
Qualitative		Nominal	In name only	Sorting by categories such as red, orange, yellow, green, blue, indigo, violet
Q u a n t i t a t i v e	Discrete	Ordinal	In rank order, there exists an order but differences and ratios have no meaning	Grading systems: A, B, C, D, F Sakau market rating system where the number of cups until one is pwopihda... (highest), , , ,... (lowest)
	Continuous	Interval	Differences have meaning, but not ratios. There is either no zero or the zero has no mathematical meaning.	The numbering of the years: 2001, 2000, 1999. The year 2000 is 1000 years after 1000 A.D. (the difference has meaning), but it is NOT twice as many years (the ratio has no meaning). Someone born in 1998 is eight years younger than someone born in 1990: 1998 − 1990. A vase made in 2000 B.C., however, is not twice as old as a vase made in 1000 B.C. The complication is subtle and basically can stem from two sources: either there is no zero or the zero is not a true zero. The Fahrenheit and Celsius temperature systems both suffer from the later defect.
	Continuous	Ratio	Difference and ratios have meaning. There is a mathematically meaningful zero	Physical quantities: distance, height, speed, velocity, time in seconds, altitude, acceleration, mass,... 100 kg is twice as heavy as 50 kg. Ten dollars is 1/10 of $100.

Descriptive statistics: Numerical or graphical representations of samples or populations. Can include numerical measures such as mode, median, mean, standard deviation. Also includes images such as graphs, charts, visual linear regressions.

Inferential statistics: Using descriptive statistics of a sample to predict the parameters or distribution of values for a population.

Simple random samples

1.2

The number of measurements, elements, objects, or people in a sample is the sample size n. A simple random sample of n measurements from a population is one selected in a way that:

any member of the population is equally likely to be selected.
any sample of a given size is equally likely to be selected.

Ensuring that a sample is random is difficult. Suppose I want to study how many Pohnpeians own cars. Would people I meet/poll on main street Kolonia be a random sample? Why? Why not?

Studies often use random numbers to help randomly selects objects or subjects for a statistical study. Obtaining random numbers can be more difficult than one might at first presume.

Computers can generate pseudo-random numbers. "Pseudo" means seemingly random but not truly random. Computer generated random numbers are very close to random but are actually not necessarily random. Next we will learn to generate pseudo-random numbers using a computer. This section will also serve as an introduction to functions in spreadsheets.

Coins and dice can be used to generate random numbers.

Using a spreadsheet to generate random numbers

This course presumes prior contact with a course such as CA 100 Computer Literacy where a basic introduction to spreadsheets is made.

The random function RAND generates numbers between 0 and 0.9999...

=rand()

The random number function consists of a function name, RAND, followed by parentheses. For the random function nothing goes between the parentheses, not even a space.

To get other numbers the random function can be multiplied by coefficient. To get whole numbers the integer function INT can be used to discard the decimal portion.

=INT(argument)

The integer function takes an "argument." The argument is a computer term for an input to the function. Inputs could include a number, a function, a cell address or a range of cell addresses. The following function when typed into a spreadsheet that mimic the flipping of a coin. A 1 will be a head, a 0 will be a tail.

=INT(RAND()*2)

The spreadsheet can be made to display the word "head" or "tail" using the following code in OpenOffice.org Calc:

=CHOOSE(INT(RAND()*2);"head";"tail")

Note that in OpenOffice.org Calc the formula uses semi-colons.

A similar function will do the same in Microsoft Excel, the only difference is that Excel uses commas:

=CHOOSE(INT(RAND()*2),"head","tail")

A single die can also be simulated using the following function

=INT(6*RAND()+1)

To randomly select among a set of student names, the following model can be built upon. Note the formula uses OpenOffice.org Calc semi-colons.

=CHOOSE(INT(RAND()*6+1);"Jen";"John";"Jess";"Jeff";"Jocelyn";"Jim")

To generate another random choice, press the F9 key on the keyboard. F9 forces a spreadsheet to recalculate all formulas.

Methods of sampling

When practical, feasible, and worth both the cost and effort, measurements are done on the whole population. In many instances the population cannot be measured. Sampling refers to the ways in which random subgroups of a population can be selected. Some of the ways are listed below.

Census: Measurements done on the whole population.

Sample: Measurements of a representative random sample of the population.

Simulation

Today this often refers to constructing a model of a system using mathematical equations and then using computers to run the model, gathering statistics as the model runs.

Stratified sampling

To ensure a balanced sample: Suppose I want to do a study of the average body fat of young people in the FSM. The FSM population is roughly half Chuukese, but the Palikir campus population is more than half Pohnpeian. If I choose as my sample students at the Palikir campus, then I am likely to wind up with Pohnpeians being over represented relative to the actual national proportion of Pohnpeians.

State	Population	Fractional share of national population (relative frequency)	Number of student seats held by state at the national campus	Fractional share of the national campus student seats
Chuuk	53595	0.501	679	0.20
Kosrae	7686	0.072	316	0.09
Pohnpei	34486	0.322	2122	0.62
Yap	11241	0.105	287	0.08
	107008	1.00	3404	1.00

The solution is to use stratified sampling. First I decide I want 100 students that are representative of the four states. Then I can randomly pick 50 Chuukese students, 7 Kosraen, 32 Pohnpeian, and 11 Yapese and I will accurately reflect the makeup of the nation rather than the national campus. Each state would be considered a single strata.

Systematic sampling

Used where a population is in some sequential order. A start point must be randomly chosen. Useful in a measuring a timed event. Never used if there is a cyclic or repetitive nature to a system: If the sample rate is roughly equal to the cycle rate, then the results are not going to be randomly distributed measurements. For example, suppose one is studying whether the sidewalks on campus are crowded. If one measures during the time between class periods when students are moving to their next class - then one would conclude the sidewalks are crowded. If one measured only when classes were in session, then one would conclude that there is no sidewalk crowding problem. This type of problem in measurement occurs whenever a system behaves in a regular, cyclical manner. The solution would be ensure that the time interval between measurements is random.

Cluster sampling

The population is divided into naturally occurring subunits and then subunits are randomly selected for measurement. In this method it is important that subunits (subgroups) are fairly interchangeable. Suppose we want to poll the people in Kitti's opinion on whether they would pay for water if water was guaranteed to be clean and available 24 hours a day. We could cluster by breaking up the population by kosapw and then randomly choose a few kosapws and poll everyone in these kosapws. The results could probably be generalized to all Kitti.

Convenience sampling

Results or data that are easily obtained is used. Highly unreliable as a method of getting a random samples. Examples would include a survey of one's friends and family as a sample population. Or the surveys that some newspapers and news programs produce where a reporter surveys people shopping in a store.

Experimental Design

1.3

In science, statistics are gathered by running an experiment and then repeating the experiment. The sample is the experiments that are conducted. The population is the theoretically abstract concept of all possible runs of the experiment for all time.

The method behind experimentation is called the scientific method. In the scientific method, one forms a hypothesis, makes a prediction, formulates an experiment, and runs the experiment.

Some experiments involve new treatments, these require the use of a control group and an experimental group, with the groups being chosen randomly and the experiment run double blind. Double blind means that neither the experimenter nor the subjects know which treatment is the experimental treatment and which is the control treatment. A third party keeps track of which is which usually using number codes. Then the results are tested for a statistically significant difference between the two groups.

Placebo effect: just believing you will improve can cause improvement in a medical condition.

Replication is also important in the world of science. If an experiment cannot be repeated and produce the same results, then the theory under test is rejected.

Some of the steps in an experiment are listed below:

Identify the population of interest
Specify the variables that will be measured. Consider protocols and procedures.
Decide on whether the population can be measured or if the measurements will have to be on a sample of the population. If the later, determine a method that ensures a random sample that is of sufficient size and representative of the population.
Collect the data (perform the experiment).
Analyze the data.
Write up the results and publish! Note directions for future research, note also any problems or complications that arose in the study.

Observational study

Observational studies gather statistics by observing a system in operation, or by observing people, animals, or plants. Data is recorded by the observer. Someone sitting and counting the number of birds that land or take-off from a bird nesting islet on the reef is performing an observational study.

Surveys

Surveys are usually done by giving a questionnaire to a random sample. Voluntary responses tend to be negative. As a result, there may be a bias towards negative findings. Hidden bias/unfair questions: Are you the only crazy person in your family?

Generalizing

The process of extending from sample results to population. If a sample is a good random sample, representative of the population, then some sample statistics can be used to estimate population parameters. Sample means and proportions can often be used as point estimates of a population parameter.

Although the mode and median, covered in chapter three, do not always well predict the population mode and median, there there situations in which a mode may be used. If a good, random, and representative sample of students finds that the color blue is the favorite color for the sample, then blue is a best first estimate of the favorite color of the population of students or any future student sample.

Favorite colors
Favorite color	Frequency f	Relative Frequency or p(color)
Blue	32	35%
Black	18	20%
White	10	11%
Green	9	10%
Red	6	7%
Pink	5	5%
Brown	4	4%
Gray	3	3%
Maroon	2	2%
Orange	1	1%
Yellow	1	1%
Sums:	91	100%

If the above sample of 91 students is a good random sample of the population of all students, then we could make a point estimate that roughly 35% of the students in the population will prefer blue.

For sighted users, a pie chart is a good way to convey percentage or proportion data.

Statistics • Lee Ling • COMFSM