Visualizing data

Graphs and Charts

2.1

The table below includes FSM census 2000 data and student seat numbers for the national site of COM-FSM circa 2004.

State Population (20000 Fractional share of national population (relative frequency) Number of student seats held by state at the national campus Fractional share of the national campus student seats
Chuuk535950.56790.2
Kosrae76860.073160.09
Pohnpei344860.3221220.62
Yap112410.112870.08
 107008134041

Circle or pie charts

In a circle chart the whole circle is 100% Used when data adds to a whole, e.g. state populations add to yield national population.

A pie chart of the state populations:

notes002_pie (12K)

Column charts

Column charts are also called bar graphs. A column chart of the student seats held by each state at the national site:

notes002_column (18K)

Pareto chart

If a column chart is sorted so that the columns are in descending order, then it is called a Pareto chart. Descending order means the largest value is on the left and the values decrease as one moves to the right. Pareto charts are useful ways to convey rank order as well as numerical data.

entrance test Pareto: COMET overall test scores

entrance test Pareto: essay test scores

entrance test Pareto: math test scores

Line graph

A line graph is a chart which plots data as a line. The horizontal axis is usually set up with equal intervals. Line graphs are not used in this course and should not be confused with xy scattergraphs.

XY Scatter graph

When you have two sets of continuous data (value versus value, no categories), use an xy graph. These will be covered in more detail in the chapter on linear regressions.

Histograms and Frequency Distributions

2.2

A distribution counts the number of elements of data in either a category or within a range of values. Plotting the count of the elements in each category or range as a column chart generates a chart called a histogram. The histogram shows the distribution of the data. The height of each column shows the frequency of an event. This distribution often provides insight into the data that the data itself does not reveal. In the histogram below, the distribution for male body fat among statistics students has two peaks. The two peaks suggest that there are two subgroups among the men in the statistics course, one subgroup that is at a healthy level of body fat and a second subgroup at a higher level of body fat.

Bimodal distribution

The ranges into which values are gathered are called bins, classes, or intervals.

Nominal level of measurement

At the nominal level of measurement one can determine the frequency of elements in a category, such as students by state in a statistics course.

State FrequencyRel Freq
Chuuk 6 0.11
Kosrae 6 0.11
Pohnpei310.57
Yap 110.20
Sums: 541,00

nominal (13K)

Ordinal level of measurement

Unbinned

At the ordinal level, a frequency distribution can be done using the rank order, counting the number of elements in each rank order to obtain a frequency. When the frequency data is calculated in this way, the distribution is unbinned, not aggregated.

AgeFrequencyRel Freq
1710.02
1850.1
19140.27
20120.24
2190.18
2210.02
2330.06
2430.06
2510.02
2610.02
2710.02
sums511

ordinal (15K)

Binned

The ranks can be collected together, binned, to reduce the number of rank order categories. in the example below the age data in gathered into two-year cohorts.

AgeFrequencyRel Freq
19200.39
21210.41
2340.08
2540.08
2720.04
Sums:511

ordinal_binned (12K)

Ratio level of measurement

At the ratio level data is always gathered into ranges. At the ratio level, binned histograms are used. Ratio level data is not necessarily in a finite number of ranks as was ordinal data.

The ranges into which data is gathered are defined by a bin lower limit and a bin upper limit. The width is the bin upper limit minus the bin lower limit. In this class histograms are generated using the bin upper limit in conformance with the usage in spreadsheets.

To calculate the bin lower and upper limits the minimum and maximum value in a data set must be determined. Spreadsheets include functions to calculate the minimum value MIN and maximum value MAX in a data set.

=MIN(data)

=MAX(data)

In OpenOffice the MIN and MAX function can take a list of semi-colon separated numbers or a range of cells in a spreadsheet. In statistics a range of cells is the most common input for these functions. When a range of cells is the usual input, this text uses the word "data" to refer to the fact that the range of cells is usually your data! Ranges of cells use two cell addresses separated by a full colon. An example is shown below where the data is arranged in a vertical column from A2 to A42.

=MIN(A2:A42)

How to make a binned histogram at the ratio level

  1. Find the minimum value of the data set using the MIN function in Excel
  2. Find the maximum value of the data set using the MAX function in Excel
  3. Calculate the range by subtracting the MIN from the MAX:
    range = maximum value - minimum value
  4. Decide on the number of bins you are going to use (also called intervals or classes)
  5. Divide the range by the number of bins to calculate the bin width (or interval width or class width)
  6. Calculate the bin upper limit (also called class upper limit or interval upper limit)
  7. Put the bin upper limits into a column of cells in Excel
  8. Manually tally the data into the frequency column to determine the frequencies for each bin. The bin upper limit is included in each tally.
  9. Create a column chart
Bin Upper Limits (BUL)Frequency
=min + bin width
+ bin width
+ bin width
+ bin width
+ bin width = max

For the female height data:

58, 58, 59.5, 59.5, 60, 60, 60, 60, 60, 61, 61, 61.2, 61.5, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 63, 63, 63, 63.5, 64, 64, 64, 64, 65, 65, 66, 66

Five bins would produce the following results:
Min = 58
Max = 66
Range = 66 - 58 = 8
Width = 8/5 = 1.6

CalculationHeight (BUL)Frequency
58 + 1.6 59.64
59.6 + 1.661.28
61.2 + 1.662.813
62.8 + 1.664.48
64.4 + 1.6664
Sum:37

Note that 61.2 is INCLUDED in the bin that ends at 61.2. The bin includes values at the bin upper limit. In other words, a bin includes all values up to and including the bin upper limit. After making the column chart, double click on the columns to open the data series dialog box. Find the Options tab and set the spacing (or gap width) to zero.

A histogram of the height of females in statistics Fall 2001

Note that the spacing or gap width on the columns has been set to zero.

Relative Frequency

Relative frequency is one way to determine a probability.

Divide each frequency by the sum (the sample size) to get the relative frequency

Height BULFrequencyRelative Frequency f/n or P(x)
59.640.11
61.280.22
62.8130.35
64.480.22
6640.11
Sum:371.00

The relative frequency always adds to one (rounding causes the above to add to 1.01, if all the decimal places were used the relative frequencies would add to one.

wpe256.jpg (9439 bytes)

The area under the relative frequency columns is equal to one.

Another example using integers:

0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4.5, 5, 5, 5, 6, 6, 7, 8, 9, 10

Five bins

min = 0
max = 10
range = 10
width = 10/5 = 2

Bin NumCalculationBinsFrequency Relative Frequency f/n or P(x)
1min + width240.20
2+ width460.30
3+ width660.30
4+ width820.10
5>+ width1020.10
Sum:201.00

The above method produces equal width bins and to conforms the inclusion of the class upper limit by spreadsheet packages.

Frequency function

For more advanced spreadsheet users, frequency data can be obtained using the frequency function FREQUENCY. This function is also very useful when working with large data sets. In OpenOffice.org Calc the frequency function is:

=FREQUENCY(DATA;BINS)

In Microsoft Excel the function uses a comma instead of a semi-colon:

=FREQUENCY(DATA,BINS)

DATA refers to the range of cells containing the data, BINS refers to the range of cells containing the bin upper limits.

The data set seen below are the height measurements for 49 female students in statistics courses during two consecutive terms.

The frequency function built into spreadsheets works very differently from all other functions. The frequency function called an "array" function because the function places values into an array of cells. For the function to do this, you must first select the cells into which the function will place the frequency values.

Selecting the cells

With the cells still highlighted, start typing the frequency function.

frequency function

After typing the opening parenthesis, drag and select the data to be binned. If the data is more than can be selected by dragging, type the data range in by hand.

drag select

For OpenOffice.org Calc, type a semi-colon. For Microsoft Office Excel, type a comma.

to semi-colon or to comma, that is the question

Drag and select the bin upper limits.

drag the bin upper limits

Type the closing parenthesis.

close it up

Then press and hold down BOTH the CONTROL (Ctrl) key and the SHIFT key. With both the control and shift keys held down, press the Enter (or Return) key.

Press the magic keys ctrl-shift-enter

As noted above, the frequencies should add to the sample size. When working with spreadsheets, internal rounding errors can cause the maximum value in a data set to not get included in the final bin. In the last bin, use the value obtained by the MAX function and not the previous bin + a width formula to generate that bin upper limit.

Shapes of Distributions

2.3

The shapes of distributions have names by which they are known.

shapeofdists (13K)

One of the aspects of a sample that is often similar to the population is the shape of the distribution. If a good random sample of sufficient size has a symmetric distribution, then the population is likely to have a symmetric distribution. The process of projecting results from a sample to a population is called generalizing. Thus we can say that the shape of a sample distribution generalizes to a population.

Endnote: Creating histograms with spreadsheets

Making histograms with OpenOffice.org Calc

Select both the column with the bins and the column with the frequencies.

notes002_histo_oo_01 (10K)

Click on the chart wizard button and then drag the mouse to place and size the histogram.

notes002_histo_oo_02 (1K)

At the first dialog box be sure to click on the "First column as label" check box as indicated by the arrow in the diagram below.

notes002_histo_oo_03 (8K)

For the next two screens simply click on "Next"

notes002_histo_oo_04 (15K) notes002_histo_oo_05 (14K)

On the next screen fill in the appropriate titles. The legend can be "unchecked" as seen below.

notes002_histo_oo_06 (13K)

When done, click on Create.

notes002_histo_oo_07 (4K)

Double click any column to open up the data series dialog box.

notes002_histo_oo_08 (17K)

Click on the options tab and set the spacing to zero.

notes002_histo_oo_09 (17K)

Click on OK.

notes002_histo_oo_10 (3K)

Making histograms with Microsoft Excel

Select ONLY the column with the column with the frequencies. Click on the chart wizard.

notes002_histo_xl_01 (9K)

Click on next.

notes002_histo_xl_02 (21K)

In step 2 of 4, click on the series tab

notes002_histo_xl_03 (17K)

Click in the Category (X) axis labels text box

notes002_histo_xl_05 (19K)

Select the bin upper limits by dragging with the mouse. Click on next when done.

notes002_histo_xl_06 (32K)

Fill in the appropriate titles and then click on finish.

notes002_histo_xl_07 (18K)

Double click any column to open up the Format Data series dialog box.

notes002_histo_xl_08 (21K)

Click on the options tab and set the gap width to zero.

notes002_histo_xl_09 (15K)

Click on OK.

notes002_histo_xl_10 (4K)

StatisticsLee LingCOMFSM