Resources

What is data wrangling? Intro, Motivation, Outline, Setup -- Pt. 1 Data Wrangling Introduction

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. These videos introduce you to these tools. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks. Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw - 01:44 Intro and what’s covered Ground Rules - 02:40 What’s a tibble - 04:50 Use View - 05:25 The Pipe operator: - 07:20 What do I mean by data wrangling? Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM - /00:48 Goal 1 Making your data suitable for R - /01:40 `tidyr` “Tidy” Data introduced and motivated - /08:15 `tidyr::gather` - /12:38 `tidyr::spread` - /15:30 `tidyr::unite` - /15:30 `tidyr::separate` Pt. 3: Data manipulation tools: `dplyr` https://youtu.be/Zc_ufg4uW4U - 00.40 setup - /02:00 `dplyr::select` - /03:40 `dplyr::filter` - /05:05 `dplyr::mutate` - /07:05 `dplyr::summarise` - /08:30 `dplyr::arrange` - /09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation) - /11:45 `dplyr::group_by` - /15:00 `dplyr::group_by` Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together - /00.42 `dplyr::bind_cols` - /01:27 `dplyr::bind_rows` - /01:42 Set operations `dplyr::union`, `dplyr::intersect`, `dplyr::set_diff` - /02:15 joining data `dplyr::left_join`, `dplyr::inner_join`, `dplyr::right_join`, `dplyr::full_join`, ______________________________________________________________ Cheatsheets: https://www.rstudio.com/resources/cheatsheets/ Documentation: `tidyr` docs: tidyr.tidyverse.org/reference/ - `tidyr` vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html `dplyr` docs: http://dplyr.tidyverse.org/reference/ - `dplyr` one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html - `dplyr` two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html ______________________________________________________________ New York Times “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, By STEVE LOHRAUG. 17, 2014 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html ______________________________________________________________

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Garrett Grohlman, a professional educator at RStudio and the author of two books on R. I've put together these videos to teach you some of the main ideas of doing data wrangling with R and the Tidyverse. If you're not familiar with the Tidyverse, it's a collection of R packages that are designed to help you do data science.

In this series of short videos, we will walk through some of the most useful tools for wrangling your data with R. Before we dive in, I want to briefly motivate why this topic is so important and outline how these resources are organized.

To quote an article from the New York Times, data scientists spend from 50% to 80% of their time mired in the mundane labor of collecting and preparing data before it can be explored for useful information. I'm not sure if this labor really is mundane, but the simple fact is that before an R program can look for answers, your data must be cleaned up and converted to a form that makes information accessible. Learning to do this well is the best investment that you can make in yourself as a data scientist.

Learning to do this well is the best investment that you can make in yourself as a data scientist.

In these videos, you will learn how to use the dplyr and tidyr packages to optimize the data wrangling process. You'll learn to spot the variables and observations within your data, to quickly derive new variables and observations to explore, to reshape your data into the layout that works best for R, to join multiple data sets together, and to use group-wise summaries to explore hidden levels of information within your data.

Also, keep in mind that you can skip around these videos to the tool that you want by clicking the jump to links in the video description.

Intro and what's covered

This webinar will go through two packages that Hadley Wickham, my colleague, made, and they're both really geared towards working with the structure of data. So that's the tidyr package and the dplyr package. And then what I'm going to cover today really follows closely a cheat sheet that we published a couple days ago. So this is a cheat sheet you can download at the link at the bottom of the slide here. And it's a two-page sheet that just summarizes the tidyr package and the dplyr package. So it's a great resource for remembering the functions that we're going to go through today, and also for just remembering functions in general as you work with data.

Ground rules: tibbles

So before we really get into data wrangling, there's a couple of ground rules that I want to familiarize you with. So these two packages introduce some things into R that makes R work better, but also makes R look a little different. So I just want to make sure you're comfortable with that before we start using them.

And the first is the table structure, or tbl, T-B-L. And what a table is, I'm going to say table because I don't like tbl, but it's just a data frame, basically. You can think of it as a data frame that appears differently in your console window. So, for example, if I go over here to my R window, this is just a RStudio window that I put off on the side here. And I open up a familiar library like ggplot2. There is a data set in here called diamonds, which is humongous. And if you try to look at this data frame, this is what will happen. It fills up my screen, and at some point R tells me that it's not going to show the rest of the data. This data set is 52,000 rows.

And really what I do see here isn't very helpful. A, it fills up my memory buffer, which means I can't see what I did before. And B, I can't even see the names of this data frame. So what a table is, is a new class that you can give to a data structure like diamonds. And it's implemented through the dplyr package. So I can change diamonds into a table with this function table underscore df. And now I have the same data frame here, but the print method is only going to show me the part of the data frame that fits in my console window.

So it's showing me right now that there's a variable called y and a variable called z in the diamonds data frame that went off the side of the window. Instead of wrapping them below, R now is just going to tell me these variables are here, but they're not shown. And then instead of showing me 52,000 rows, it's just going to show me 10 rows. So this is a more pretty way to look at your data. If I were to change my console window to something wider and then look at my table of diamonds, I get those extra variables here. It pays attention to how wide my window is, so it knows how many variables it can show. If I call this again, now it only shows up to depth and so on.

Using View

So this is a very useful way to look at large data sets or work with large data sets. But you, at times, will still want to see your complete data. So there's a function I recommend to use when you want to look at the entire data set, and that's the view function. It's in base R, and if you call it from RStudio, RStudio will open up a spreadsheet-like view window where you can check out your data set, almost as if it were an Excel document. Keep in mind that's view with a capital V, and you can use this on any data frame that you have in R.

The pipe operator

Then there's one last function that really changes how R looks, and that's the pipe operator. This comes from the Magruder package, but it's imported by the dplyr package. And it's a different way to write the same code that you'd write before. So you can probably recognize what this command would do here. We haven't looked at select. We'll look at that today. But it's calling select on an object called tb, and then it has some arguments here. The pipe operator allows you to pipe in the first argument of select. So I'd write tb pipe select child to elderly. And what the operator will do is it will insert tb as the first argument of select. So these two lines of code would do the same thing.

If we come over here, we could take a look at that. So if I had diamonds, I could pipe in... Let me just take one variable from diamonds, maybe dx column. And I could pipe that into mean. And if I run this code, it's going to give me the mean of the x column, which is the same as if I did mean diamonds x.

So at this point, it might not be obvious why you'd use pipe over anything else, but the cool thing about this format is you can start chaining arguments together. So if I wanted to round every value in the x column of diamonds to the second decimal place, and then take the mean, I can use this code here, which won't look all that different. Instead of actually saying like, okay, let's round all this stuff, round all this stuff, and save that, and then call mean on the output. As your chains get longer and longer, this becomes much more efficient than actually managing where you save the in-between states.

What is data wrangling?

So let's take a look at the functions that can actually help you wrangle data. And what I mean by data wrangling is what other people call munging, or transforming your data, or manipulating it. And the reason I use the word wrangling is it sort of captures how painful this process can be. That's just getting the format of your data into a format that you can work with is time consuming, and it's often boring and painful. And if you could do that more efficiently, that would be a big win. And the functions we'll look at today will help you do that.

And the reason I use the word wrangling is it sort of captures how painful this process can be. That's just getting the format of your data into a format that you can work with is time consuming, and it's often boring and painful. And if you could do that more efficiently, that would be a big win.

Thanks for checking out our data wrangling video series. This is just one video in a larger set of resources on data wrangling. So here are links to all of our other videos in the series. And again, have a look at the video description for shortcuts to other parts of the series, as well as other resources we think might be helpful, like cheat sheets and further documentation. We hope you find all of these resources useful. And if you do, as always, we would appreciate a thumbs up or a share. Both go a long way to helping this content reach people who could benefit from it.