Data Wrangling R | RStudio Webinar - 2016

Transcript#

This transcript was generated automatically and may contain errors.

So I've prepared some slides sort of off the cuff for the webinar today, and it's all about data wrangling. And one thing I'll point out after I introduce myself, I'm Garrett Grollemund, and I work for RStudio , and I write books on R and teach people how to program with R. And all the slides I'm going to show you today are available online. They're available right now, so you can download them or later. They're at this link at the bottom of the slide. But I put that link in the slide template, so it'll show up on any slide. If you want to follow me on Twitter, there's my Twitter address, and also my email address if you have any questions you'd like to follow up with.

This webinar will go through two packages that Hadley Wickham , my colleague, made, and they're both really geared towards working with the structure of data. So that's the tidyr package and the dplyr package. And then what I'm going to cover today really follows closely a cheat sheet that we published a couple days ago. So this is a cheat sheet you could download at the link at the bottom of the slide here, and it's a two-page sheet that just summarizes the tidyr package and the dplyr package. So it's a great resource for remembering the functions that we're going to go through today, and also for just remembering functions in general as you work with data.

Ground rules: tibbles and the pipe operator

So before we really get into data wrangling, there's a couple of ground rules that I want to familiarize you with. So these two packages introduce some things into R that makes R work better, but also makes R look a little different. So I just want to make sure you're comfortable with that before we start using them. And the first is the table structure or table TBL. And what a table is, I'm going to say table because I don't like table, but it's just a data frame basically, you can think of it as a data frame, that appears differently in your console window.

So for example, if I go over here to my R window, this is just a RStudio window that I put off on the side here, and I open up a familiar library like ggplot2 , there's a data set in here called diamonds, which is humongous. And if you try to look at this data frame, well this is what happened, it fills up my screen and at some point R tells me that it's not going to show the rest of the data. This data set is 52,000 rows, and really what I do see here isn't very helpful. A, it fills up my memory buffer, which means I can't see what I did before, and B, I can't even see the names of this data frame. So what a table is, is a new class that you could give to a data structure like diamonds.

And it's implemented through the dplyr package. So I can change diamonds into a table with this function table underscore dia. And now I have the same data frame here, but the print method is only going to show me the part of the data frame that fits in my console window. So it's showing me right now that there's a variable called y and a variable called z in the diamonds data frame that went off the side of the window. Instead of wrapping them below, R now is just going to tell me these variables are here but they're not shown. And then instead of showing me 52,000 rows, it's just going to show me 10 rows. So this is a more pretty way to look at your data.

So there's a function I recommend to use when you want to look at the entire data set, and that's the view function. It's in Base R, and if you call it from RStudio, RStudio will open up a spreadsheet-like view window where you can check out your data set, almost as if it were an Excel document. Keep in mind that's view with a capital V, and you can use this on any data frame that you have in R.

Then there's one last function that really changes how R looks, and that's the pipe operator. This comes from the Magruder package, but it's imported by the dplyr package, and it's a different way to write the same code that you'd write before. So you can probably recognize what this command would do here. We haven't looked at select. We'll look at that today, but it's calling select on an object called tb, and then it has some arguments here. The pipe operator allows you to pipe in the first argument of select, so I'd write tb pipe select child to elderly, and what the operator will do is it'll insert tb as the first argument of select, so these two lines of code would do the same thing.

So at this point, it might not be obvious why you'd use pipe over anything else, but the cool thing about this format is you could start chaining arguments together. As your chains get longer and longer, this becomes much more efficient than actually managing where you save the in-between states.

What is data wrangling?

So let's take a look at the functions that can actually help you wrangle data, and what I mean by data wrangling is what other people call munging or transforming your data or manipulating it, and the reason I use the word wrangling is it sort of captures how painful this process can be. So there's an article we link to in the registration email from the New York Times that said that about, you know, 50 to 80 percent of the data scientist's time is spent doing things like munging and wrangling their data. It also offered a new word for data wrangling, which was data janitor work, which I found amusing. So I don't know where they got this statistic from, but I think a lot of people would say that just getting the format of your data into a format that you can work with is time-consuming, and it's often boring and painful, and if you could do that more efficiently, that would be a big win, and the functions we'll look at today will help you do that.

just getting the format of your data into a format that you can work with is time-consuming, and it's often boring and painful, and if you could do that more efficiently, that would be a big win, and the functions we'll look at today will help you do that.

Normally when you wrangle your data, you have two goals. First, you might need to make your data set suitable to a particular piece of software. For example, we're going to need to make our data suitable to R because that's what we're using, and then second, you can actually reveal information by changing the format and the structure of your data.

The magic happens when you combine a group data frame with summarize, or mutate, or filter, or whatnot. What dplyr will do is apply summarize in a group-by fashion, or mutate in a group-by fashion, and so on, where it makes sense.

You'll notice that the three variables I calculated with summarize, mean, sum, and end, are in the final data set, but to those, dplyr has added the city variable, and that variable came from the grouping criteria, and it's necessary for dplyr to add each variable that's involved in the grouping process, so we know what the values of mean, sum, and end refer to.

So if we put this process together, this is what it looks like. We take our data, we group it, and we get group-wise summaries. In this case, I'm only taking the mean. Here, I'm grouping by size. Before, I was grouping by city. The rows don't have to be near each other in a group. Everything will just be grouped together based on common values, wherever those values appear in the data set. Once you have grouped data, you can remove the grouping information with ungroup.

So this data set doesn't actually exist. There is a TB data set in the EDAWR package, but it's much more complicated than this one. This is a simplified version. But you can imagine, we could take this data set that has different countries, different years, different genders, number of cases of TB. Anyways, we could group it. In this case, we could add more than one variable to the grouping criteria, and what groupby will do is it'll create a separate group for each combination of those variables. So for example, Afghanistan in 1999 will be one group, and Afghanistan in 2000 will be a separate group, and so on. And then we can run summarize on that. And what we'll get is our summary. But when you run summarize on grouped data, summarize will strip off one variable from the grouping criteria. So now that it's made the summary, it's going to strip off the rightmost variable. In this case, that's year. And what you'll end up with is a data set still grouped, but it's only grouped by country. It's no longer grouped by country and year.