Open data exploration

round island relay round island relay round island relay round island relay

Starting locationRunnerTime
Spanish wallMaclino Ardos13
Beyond PanasangMagdano Marquez13
Base Mount DolonJasper Ponapart13
Daniel store COMFSMXavier Edwin15
Dien Churchill drivewayJason Ernest21
PwudoiMarlon Etnold18
Ricky Jano StoreDionisio Augustine17
Lehn MesiFloriano Ponapart14
Pwok/Jeds gasRelo-liza Saimon20
Enipein power bridgeX-Ler Rodriguez15
Wone/Rohi bridgeRC Lopez16
Madolehnihmw borderDana Lee Ling26
Hill past Soisi residenceHendy Ardos13
Parau churchMcWelton Gomez13
Lester Ezekias residenceJustin Rodriguez21
Borbert Albert residenceAdson Dadius18
Top of hill to MesihsouMcCaffrey Gilmete17
Wensner John LaundryDiony Setik15
Ahlo kapwMars Gilmete13
Miler Benjamin residenceEugene Amor16
Simon Kihleng residenceOneil Cantero16
Gilmete residenceVicky Nick17
Adams apartments Marino Ardos12
4TYMike Laurdine13
Perman hut LidakihdaRobert Nakasone*13

In the "real world" no one is likely to ask you to calculate the mean, make a box plot, or run a t-test. The reality is that someone comes to you with some data and questions. You then have to decide how to analyze the data and the statistical meaning of the results of that analysis. This first example comes from a round island relay run in March 2013.

On 30 March 25 runners from election district three ran the round island relay race. Each runner ran two miles. This open data exploration is based on the estimated duration of time in minutes for each runner. Time-stamped photographs taken that day provided rough estimates for each runner's time. This data set represents an exercise in data exploration from the field of exercise sport science.

Imagine that you are the team statistician. Your job is to provide statistical information in a report of use to the coach. There are plans to run this race again with teams invited from abroad. Based on the time data, what useful statistics can you report? The coach wants to be able to make decisions on who should start, who should anchor, who should be retained, and who should be replaced and why. Provide a complete statistical report on the data. For any decisions, cite statistical support. Look also for unusual data, if any, reporting the unusual data and why that data is unusual. Provide statistical values that support your recommendations.

* State senator Robert Nakasone was slated to run anchor and did run anchor. Due to a close finish, however, another runner also ran the final two miles carrying the District 3 sash across the finish line for the win. Other members of the team joined in for the final two miles, following the lead District 3 runner. The team also included Marson Etnol, Mihter Wendolin, Anderson Ponapart, Amanda Wendolin, and Pauraheko Ardos. Images and notes from the run are available in a Wordpress blog article on the round-Pohnpei relay run..

Analysis suggestions. Start by determing the level of measurement. The level of measurement will tell you what statistics you can calculate for that sample. The time data is ratio level data. Calculate the sample size, the minimum, the maximum. Calculate the measures of the middle: the mode (if any), median, and mean. Find the first and third quartile. Make a box plot. Look for outliers on a Gnumeric box plot, using the option that displays outliers. If you find outliers, run a calculation for the z-scores for the outliers to determine if the box plot outliers are also z-score outliers. Make a frequency histogram. You pick the number of classes. Calculate the 95% confidence interval for the mean.