dplyr 1.1.0 Features You Can't Live Without - posit::conf(2023)

Presented by Davis Vaughan Did you enjoy my clickbait title? Did it work? Either way, welcome! The dplyr 1.1.0 release included a number of new features, such as: - Per-operation grouping with `.by` - An overhaul to joins, including new inequality and rolling joins - New `consecutive_id()` and `case_match()` helpers - Significant performance improvements in `arrange()` Join me as we take a tour of this exciting dplyr update, and learn how to use these new features in your own work! Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Lightning talks. Session Code: TALK-1162

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. I'm here to talk about dplyr 1.1. But really, I'm actually here to talk about reducing friction in the latest version of the dplyr release. And by friction, I really just mean anything that kind of takes you out of that flow state of just writing code.

We've actually reduced friction in three different ways in this release, one in case when, one in some big changes to grouping, and then some really cool things with joins.

case_when improvements

So we're going to jump right in with kind of an easy win that we had with case when. So my wife actually is a big tidyverse user. This is really cool for me because it means I get a lot of one-on-one feedback about exactly what's going wrong.

And she writes a lot of case when statements that look like this, and what's going wrong is this. If you haven't seen this before, case when is fairly strict about the types of the things on the right-hand side. So she would mix a character like large and small with na as a logical na here, and it would complain. It would say, no, you have to use this very special na character thing instead.

If you use case when a lot, you've probably seen this. And this isn't something we typically expect people to know. It makes case when harder to use and harder to teach. So as of 1.1, a very easy win is that case when is now much less strict about the things on the right-hand side. So this just works.

Per-operation grouping with .by

So I think this section is really cool. I play a lot of Mario Kart, so I came up with a question of which character, Mario or Luigi and so on, had the most wins on a single track? So if you have this imaginary races data set, I'm looking for just one row back here. I want the character on a track with the most number of wins.

The way you might get here is say, well, I'm going to group by character and track. I'm going to compute the total number of wins, and then I'm going to slice out the top win row. If you were to actually run this code, though, there's a bug in here.

I only wanted Luigi at Moo Moo Farm, yes, that is a real track, with 305 wins. That's the only one I wanted back, but I got multiple rows. I actually end up with one per character here. And the reason this happens is that Summarize, by default, just peels off one layer of grouping. So when we go into that SliceMax call, we're actually still grouped by character. So we end up with one row per character.

Now if you're a keen dplyr user, you know what the problem is. I forgot to ungroup. I'm probably not the only one who's forgotten to ungroup. All of you have forgotten to ungroup.

Remembering for the 648th time that if dplyr does something weird, it is always because you have forgotten to ungroup. It has become such a thing that people just made it a habit of group by, do the thing, ungroup, and every time they call something in dplyr. It has even become such a thing that it has joined Alison Horst to artwork, where she is begging you not to forget to invite ungroup to the party.

Remembering for the 648th time that if dplyr does something weird, it is always because you have forgotten to ungroup.

All right, so we took a step back and we said, is there anything we could do about this to make it so you never have to remember to ungroup ever again? We've actually taken a page out of the Data Table Playbook, and as of 1.1, you can now write the grouping in line with the Summarize call.

This is pretty cool because Races goes in to Summarize as an ungrouped DataFrame, and it comes out of Summarize as an ungrouped DataFrame, full stop, no strange edge cases whatsoever. This means that Summarize now no longer has to message you about what it's doing with the grouping, and you never have to worry about that .groups argument if you don't want to.

Now, group by is not going away, there's no need to worry, but we think this new syntax is pretty cool. All of the major dplyr verbs support it.

New join features

Lastly, I just want to talk very quickly about joins. I don't have much time here, but there is a lot here. We've introduced rolling joins, inequality joins, overlap joins, these new arguments that help you with quality control, like one-to-one or one-to-many. And the most I can do is just show you some questions and say, look, these are things that were really hard to answer with dplyr before, but we've reduced that friction to essentially make them one-liners now.

If you ever work with time series or you ever work with genomic data, these are things that you can do now, and I think you should really look at Hadley's R for Data Science book. He's written a great new section on joins, where we've introduced this new join by helper that can let you express very complex joins very easily.

So to recap, we've talked about these three things. There is a lot more in the 1.1 release that I didn't even get to touch, but we have written a whole series of 1.1 blog posts, so I encourage you to go to tidyverse.org and check those out. Thank you so much.

Featured software#