November 30, 2011

"Tidy Data" (Next Week at the Statistics Seminar)

For those of us who use R all the time, next week's speaker needs no introduction. I can't make the kids in statistical computing attend next week's seminar, but I probably ought to.

Hadley Wickham, "Tidy Data"
Abstract: It's often said that 80 percent of the effort of analysis is spent just getting the data ready to analyze, the process of data cleaning. Data cleaning is not only a vital first step, but it is often repeated multiple times over the course of an analysis as new problems come to light. Despite the amount of time it takes up, there has been little research on how to do clean data well. Part of the challenge is the breadth of activities that cleaning encompasses, from outlier checking to data parsing to missing value imputation. To get a handle on the problem, this talk focuses on a small, but important, subset of data cleaning that I call data "tidying": getting the data in a format that is easy to manipulate, model, and visualize.
In this talk you'll see some of the crazy data sets that I've struggled with over the years, and learn the basic tools for making messy data tidy. I'll also discuss tidy tools, tools that take tidy data as input and return tidy data as output. The idea of a tidy tool is useful for critiquing existing R functions, and will help to explain why some tasks that seem like they should be easy are in fact quite hard. This work ties together reshape2, plyr and ggplot2 with a consistent philosophy of data. Once you master this data format, you'll find it much easier to manipulate, model and visualize your data.
Time and place: 4--5 pm on Monday, 5 December 2011, in Doherty Hall A310

