Working with Two Datasets: Binds, Set Operations, and Joins -- Pt 4 Intro to Data Manipulation

Data wrangling is too often the most time-consuming part of data science and applied statistics. Two tidyverse packages, tidyr and dplyr, help make data manipulation tasks easier. Keep your R code clean and clear and reduce the cognitive load required for common but often complex data science tasks. `dplyr` docs: dplyr.tidyverse.org/reference/ - http://dplyr.tidyverse.org/reference/setops.html - http://dplyr.tidyverse.org/reference/join.html ---------------- Pt. 1: What is data wrangling? Intro, Motivation, Outline, Setup https://youtu.be/jOd65mR1zfw - /01:44 Intro and what’s covered Ground Rules: - /02:40 What’s a tibble - /04:50 Use View - /05:25 The Pipe operator: - /07:20 What do I mean by data wrangling? Pt. 2: Tidy Data and tidyr https://youtu.be/1ELALQlO-yM - /00:48 Goal 1 Making your data suitable for R - /01:40 `tidyr` “Tidy” Data introduced and motivated - /08:10 `tidyr::gather` - /12:30 `tidyr::spread` - /15:23 `tidyr::unite` - /15:23 `tidyr::separate` Pt. 3: Data manipulation tools: `dplyr` https://youtu.be/Zc_ufg4uW4U - /00.40 setup - /02:00 `dplyr::select` - /03:40 `dplyr::filter` - /05:05 `dplyr::mutate` - /07:05 `dplyr::summarise` - /08:30 `dplyr::arrange` - /09:55 Combining these tools with the pipe (Setup for the Grammar of Data Manipulation) - /11:45 `dplyr::group_by` Pt. 4: Working with Two Datasets: Binds, Set Operations, and Joins https://youtu.be/AuBgYDCg1Cg Combining two datasets together - 00.42 `dplyr::bind_cols` - 01:27 `dplyr::bind_rows` - 01:42 Set operations `dplyr::union`, `dplyr::intersect`, `dplyr::set_diff` - 02:15 joining data - `dplyr::left_join`, `dplyr::inner_join`, - `dplyr::right_join`, `dplyr::full_join`, ______________________________________________________________ Cheatsheets: https://www.rstudio.com/resources/cheatsheets/ Documentation: `tidyr` docs: tidyr.tidyverse.org/reference/ - `tidyr` vignette: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html `dplyr` docs: http://dplyr.tidyverse.org/reference/ - `dplyr` one-table vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html - `dplyr` two-table (join operations) vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html ______________________________________________________________

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm Garrett Grohlman, a professional educator at RStudio and the author of two books on R. I've put together these videos to teach you some of the main ideas of doing data wrangling with R and the Tidyverse . If you're not familiar with the Tidyverse, it's a collection of R packages that are designed to help you do data science.

In this series of short videos, we will walk through some of the most useful tools for wrangling your data with R. Also keep in mind that you can skip around these videos to the tool that you want by clicking the jump to links in the video description.

Combining datasets with binds

So there's one last thing that you will probably want to do as you wrangle your data with these packages and that's combine multiple datasets into the same dataset. There's a simple way to combine datasets together in R. If your datasets were meant to go together, you can add them together by pasting one as extra columns onto the first with the bind calls command from dplyr . There's also a cbind command in base R. Bind calls is just a little more optimized. So that's why I'm showing you bind calls.

This is somewhat dangerous to do. R has no way of checking whether the rows are in the same order between the datasets or whether when you combine the datasets together, the row describes an observation that makes sense. For example, here you probably want B to be on the second line as in the first line. So use column binding with caution. Make sure the data is arranged for you to do what you want to do.

You can add the second dataset as additional rows to the bottom of your first dataset with bind rows or rbind in base R. Again, you need to make sure that the variables are in the same order and mean the same thing, but it's probably less risky than column binding.

Set operations on data frames

Even if you actually have two datasets like this where they describe the same variables, they just have different observations of those variables, you could actually use set operations on these datasets. So data frame set operations come with dplyr. You take two data frames and get the union of rows that appear in both data frames, the intersection of rows that appear in both data frames, and the set difference of rows that appear in both data frames. That's just your basic union intersection set different things applied to data frames.

Joining data

Now normally when you want to put datasets together, you need to do something a little more involved, and that's what I call joining data. So here's a trivial example. Here's two data frames. They both talk about the Beatles. One's a list of songs and who wrote each of the songs, and then one's a list of the artists in the Beatles and what each one played or one of the instruments they played. And if you wanted to put these things together into the same dataset, so you match up the instrument that each Beatle played with, you know, what song they wrote, you can do that with a left join.

What left join does is it takes the names of two data frames, and then it takes some joining criteria. These would be the name or names of columns that both datasets share. And then what it will do is it'll go through the second data frame, find the pieces of information that apply to rows in the first data frame, and add them where they belong in that data frame. So order is not preserved. The second data frame's order is not preserved. First data frame, yes, that is preserved. But things are moved around to supply a play column that actually makes sense with your data frame.

And at the end, you have rows in your output that describe a unified observation. John both wrote Across the Universe and played guitar. Guitar had to appear twice because John was also in the second row, and so it appears twice. There's no information about this last row, Buddy Holly, who made Peggy Sue. So for the place column that gets an NA, there's nothing we could put there.

Left joins aren't the only way to join data together in an intelligent fashion. You could do a right join. Well, this would be the syntax for if you want to join based on two columns, both first and last. Use the C to make a vector of column names and pass that to by.

Left join isn't the only way to join data sets together. You could do an inner join, which is a little more exclusive. So an inner join still puts the data sets together in the same fashion, but if there's any row from the first data frame that doesn't have information in the second data frame, like Peggy Sue and Buddy Holly didn't appear in the second data frame, inner join will exclude that from the results. It just contains the rows that appeared in both data frames.

So we just released two new joins in the slides. You'll find them on the cheat sheet. Those are right join and outer join. So right join is like a left join, but it has the same effect as if you switch the order the data frames appear. If you put artist first and song second, that would give you the equivalent of a right join. The second data frames prioritize are the first. An outer join is the complement of an inner join. An inner join only keeps the rows that appear in both data frames. An outer join keeps every row that appears in any data frame, and you get a very big data frame at the end, but that probably has a bunch of missing values. You could look those up on the cheat sheet.

Filtering joins

Then we have two filtering joins, semi-join and anti-join. These don't actually join data sets together at all. These take the first data set and remove rows from it based on what appears in the second data set. So what semi-join will do is it'll take the first data set, look at the data set and see which rows have an equivalent in the second data set and return just those rows that have an equivalent. It doesn't add anything to the rows, it just shows you what rows will actually be joined should you do a join. Anti does the opposite. It shows you which rows do not have equivalent in the second data set. So you can use anti-join to see which rows will not be joined when you do a join. Semi-join and anti-join are good ways to sort of debug or get a feel for how the join's going to work ahead of time.

Semi-join and anti-join are good ways to sort of debug or get a feel for how the join's going to work ahead of time.

Further resources

All right. So let's skip over some of this recap. We're getting near the top of the hour so I just want to point your attention to where you can learn more. This cheat sheet is a really great and concise way to go about learning how to wrangle data with these two packages. You can download it here at this link. We also have other cheat sheets about other things there. If you want to practice with dplyr, there's a course that I've made with Datacamp that has about four hours worth of content on dplyr. There's some video lessons, it's a live coding environment, and mostly it has lots and lots of practice. So this is a good place to go if you want to practice. It does cost, I think, $95 or so.

Thanks for checking out our data wrangling video series. This is just one video in a larger set of resources on data wrangling. So here are links to all of our other videos in the series. And again, have a look at the video description for shortcuts to other parts of the series, as well as other resources we think might be helpful, like cheat sheets and further documentation. We hope you find all of these resources useful, and if you do, as always, we would appreciate a thumbs up or a share. Both go a long way to helping this content reach people who could benefit from it.

Featured software#