What is data wrangling? Intro, Motivation, Outline, Setup -- Pt. 1 Data Wrangling Introduction

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. These videos introduce you to these tools. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks. Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw - 01:44 Intro and what’s covered Ground Rules - 02:40 What’s a tibble - 04:50 Use View - 05:25 The Pipe operator: - 07:20 What do I mean by data wrangling? Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM - /00:48 Goal 1 Making your data suitable for R - /01:40 `tidyr` “Tidy” Data introduced and motivated - /08:15 `tidyr::gather` - /12:38 `tidyr::spread` - /15:30 `tidyr::unite` - /15:30 `tidyr::separate` Pt. 3: Data manipulation tools: `dplyr` https://youtu.be/Zc_ufg4uW4U - 00.40 setup - /02:00 `dplyr::select` - /03:40 `dplyr::filter` - /05:05 `dplyr::mutate` - /07:05 `dplyr::summarise` - /08:30 `dplyr::arrange` - /09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation) - /11:45 `dplyr::group_by` - /15:00 `dplyr::group_by` Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together - /00.42 `dplyr::bind_cols` - /01:27 `dplyr::bind_rows` - /01:42 Set operations `dplyr::union`, `dplyr::intersect`, `dplyr::set_diff` - /02:15 joining data `dplyr::left_join`, `dplyr::inner_join`, `dplyr::right_join`, `dplyr::full_join`, ______________________________________________________________ Cheatsheets: https://www.rstudio.com/resources/cheatsheets/ Documentation: `tidyr` docs: tidyr.tidyverse.org/reference/ - `tidyr` vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html `dplyr` docs: http://dplyr.tidyverse.org/reference/ - `dplyr` one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html - `dplyr` two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html ______________________________________________________________ New York Times “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, By STEVE LOHRAUG. 17, 2014 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html ______________________________________________________________

Mar 14, 2018

8 min

Grammar of Data Manipulation Data Science Data Wrangling Applied Statistics Statistics RStudio Data Manipulation

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Garrett Grohlman, a professional educator at RStudio and the author of two books on R. I've put together these videos to teach you some of the main ideas of doing data wrangling with R and the Tidyverse . If you're not familiar with the Tidyverse, it's a collection of R packages that are designed to help you do data science.

In this series of short videos, we will walk through some of the most useful tools for wrangling your data with R. Before we dive in, I want to briefly motivate why this topic is so important and outline how these resources are organized.

To quote an article from the New York Times, data scientists spend from 50% to 80% of their time mired in the mundane labor of collecting and preparing data before it can be explored for useful information. I'm not sure if this labor really is mundane, but the simple fact is that before an R program can look for answers, your data must be cleaned up and converted to a form that makes information accessible. Learning to do this well is the best investment that you can make in yourself as a data scientist.

Learning to do this well is the best investment that you can make in yourself as a data scientist.

In these videos, you will learn how to use the dplyr and tidyr packages to optimize the data wrangling process. You'll learn to spot the variables and observations within your data, to quickly derive new variables and observations to explore, to reshape your data into the layout that works best for R, to join multiple data sets together, and to use group-wise summaries to explore hidden levels of information within your data.

Also, keep in mind that you can skip around these videos to the tool that you want by clicking the jump to links in the video description.

And the reason I use the word wrangling is it sort of captures how painful this process can be. That's just getting the format of your data into a format that you can work with is time consuming, and it's often boring and painful. And if you could do that more efficiently, that would be a big win.

Thanks for checking out our data wrangling video series. This is just one video in a larger set of resources on data wrangling. So here are links to all of our other videos in the series. And again, have a look at the video description for shortcuts to other parts of the series, as well as other resources we think might be helpful, like cheat sheets and further documentation. We hope you find all of these resources useful. And if you do, as always, we would appreciate a thumbs up or a share. Both go a long way to helping this content reach people who could benefit from it.