What's new in the tidyverse? by Professor Mine Çetinkaya-Rundel

Transcript#

This transcript was generated automatically and may contain errors.

All right, thank you very much for having me. I cannot believe I'm here. It's so far away from where I live, but also so exciting to be here. So thank you so much to Monash University for having me out for the next two weeks. And I'm really glad that the timing worked out to be able to give this talk today. So today I want to talk about what's new in the tidyverse .

As Emi mentioned, I work with the tidyverse team pretty closely, mostly in terms of sort of developing educational materials for it. But oftentimes, developing educational materials after something is built isn't necessarily the best way to go about things. So I very much enjoy being able to sort of be involved with the development process, thinking about things from the perspective of learners and educators.

So I'm going to talk a little bit about, very quickly, about the tidyverse. I assume a lot of you know about it. But turns out it is more than just those cute hex logos that we saw here. Yeah, that's like one perk of working at Paws that you can always get these hex stickers.

Tidyverse principles and overview

Okay, so let's talk a little bit about the principles of the tidyverse. So tidyverse itself is a meta R package that loads nine core packages when invoked. And what's special about these packages is that they share a design philosophy, a common grammar, and data structures. So this is what it looks like when you first load the tidyverse. And when we think about this sort of figure for data science in terms of the data science cycle, we can see that many of these packages map on to many of the steps of the data science cycle.

I'm going to use the Palmer Penguins dataset, the Penguins dataset from the Palmer Penguins package for some of my examples here. And for those of you who may not be familiar with it, it is a dataset of a little over 300 penguins, and we have some body measurements on them that are numerical variables and a couple of categorical variables like species, island, and sex.

And this is what a typical tidyverse pipeline looks like. So you can see that we start with the data frame, we do some stuff to it, and then we visualize it. And this is what a typical tidyverse workflow looks like. You start with a data frame, and then you say, I want to group my data by species and sex, and then I want to, let's say, calculate the mean body weight of these penguins. And then you realize, oh, there's a warning, I wonder what that means, or a message, and then you sort of adjust to it.

And then, now that I have a data set that I can visualize, I can pipe this into a visualization sort of pipeline and make a bar plot. And that bar plot doesn't really make a whole lot of sense to me. Maybe something like this, where the bars are dodged, makes a little bit more sense. Frankly, as a statistician, these sorts of plots really bother me, where you're sort of plotting the mean without any sort of data about the variability within each of these groups.

So let's keep that in mind, and let's talk a little bit about what I'm going to talk about in this presentation. Sometimes I'm going to show you a slide that looks like this, where it says, previously, you used to do this, and now you should do this. But most of the time in the presentation, I'm going to say, previously, you used to do this in the tidyverse, and now you can do this. And you'll see that the latter will be a lot more often.

So while I'm talking about the updates to the tidyverse, I'm not mostly talking about changes to the tidyverse, where you absolutely have to change your code. It's just sort of quality of life improvements, mostly, for you to either make your code more readable or more efficient, or some combination of those. And because I can't help myself, I'm going to sprinkle in some teaching tips along the way as well.

Tidyverse 2.0 and lubridate

So let's start with tidyverse 2.0, which was released this year. And there are two things in this meta package that are new. One is the lubridate package is now a core tidyverse package. And the other one is that package loading message got even longer. So let's take a peek at what those are.

The lubridate package is an incredibly useful package with a incredibly not-so-great name, in my opinion. But it is a package that makes it really much easier to work with date times. So maybe previously, you used to do this at the beginning of your R scripts or documents, and now you can just get away with doing this, just the library tidyverse.

And this is what lubridate functionality looks like. I'm going to give sort of three progressively, I think, more annoying to parse strings. One of them is just today's date as a numeric value, the other one's a text string, and the other one is an actual sentence. What we can do is we can take each one of these and apply the appropriate lubridate function. So in this case, the data is in the order of year, month, and date. So then I apply YMD, and we can see that it turns it into a date class.

I can do the same thing where the data is in the order of MDY. Or I can say, just take that sentence and figure out in there a date, month, and year, and an hour as well. And it does a pretty good job of doing so. And you can also supply time zones to it, if that's the sort of thing that you tend to work with. And if so, I am sorry. And then it will give you a POSIX CT class.

So this is, I think, pretty neat. And what is really neat about this is that particularly for that last example, it lets you get away with not having to write regular expressions. And to me, that is a huge quality of life improvement.

And what is really neat about this is that particularly for that last example, it lets you get away with not having to write regular expressions. And to me, that is a huge quality of life improvement.

The conflicted package

The other one is that message that I said got longer. So when we load the tidyverse, now we can see an additional informational point that says, use the conflicted package to force all conflicts to become errors. So what does that mean? It basically, we can explicitly check for conflicts with, when we load the tidyverse, with existing loaded packages. And in this case, I'm loading the tidyverse before loading any other things. So this is the conflicts with base R, basically, or the stats package in this case. Filter and lag are functions that exist there that tidyverse is overwriting or dplyr .

If the conflict resolution with base R looks something like this, it just gives precedence to the most recently loaded package. So if I have not loaded the tidyverse yet, and I try to do something like filter for the species where the penguins are Adélie, it will give me an error that's not like I can't find the filter function. In fact, it's giving me the error of what would this sort of expression passed onto stats filter would look like. After loading the tidyverse, base R will silently use the last loaded package, and things work nicely. Until they don't, I suppose.

With conflicted, what you can see is that if that package is loaded, the error is a little bit different. It is saying that I am not going to just use one of the ones. It's saying there are two packages that have the filter function, and you need to explicitly choose. And you can explicitly choose with sort of a method that doesn't require conflicted at all by saying which package name and then colon colon the function, or you can say I'm going to use filter a lot. I know that I want the dplyr filter going forward, so for this session, I want to prefer dplyr filter. So from that point onwards in my session, it is going to use the dplyr filter.

So this is making things a lot more explicit, which, when things are going fine, is almost overhead, frankly. But when they're not, it's actually one of the more annoying, I think, error messages to sort of figure out when you're using the wrong function from a package you didn't intend to. And this can grow to be an even bigger problem when you have lots of packages that might have sort of named conflicts.

So a little bit of teaching tip around this. I said that loading the tidyverse message got even longer now, and I did hear the sigh, and I am with you on that. But I think that it is important to show these startup messages in the teaching materials and not hide them, especially if you teach with slides. I know that it is so tempting to hide them because they take up so much space, but I think it's important to address them early on because, first of all, you want to model good behavior. You can't ask your students to read error messages if you don't read them yourself, or any messages for that matter. So it requires, it encourages reading and understanding messages, warnings, and errors, and what the distinction between them are. And also it helps during hard to debug situations that result from base R's silent conflict resolution.

That being said, do show your students how to hide those, particularly at like an editing or polishing a document stage.

dplyr updates: joins

All right, so let's talk about dplyr, which got a lot of updates over the last year. Many, many updates that sort of expanded its functionality. So I am going to talk about a few things that cross my path on almost a daily basis. This is a non-exhaustive list, so at the end I will point you to other places where you can read about other advances in dplyr.

One of them is improved and expanded join functionality. One of them is added functionality for per operation grouping. And the other one is some quality of life improvements to case when and if else. Case when happens to be my favorite function if I had to pick one, so that's why I'm highlighting it.

And there are a few more. So let's start with the joins. There is a new join by function that we can use for the by argument in any of the join functions. And the new join functions have all gained new arguments that allow us to handle various matches like one-to-one, one-to-many, many-to-many relationships, as well as explicitly handling unmatched cases.

So let's start with join by. Previously, maybe you did something like this and even, I don't know, somehow seems crazier to me that you could actually get away with not putting these quotes here as well, because that's how you would define a named vector in R. Now, optionally, you can do this. So this is optional, but I do think it is a quality of life improvement where we're passing a join by function and then we can actually use sort of the non-standard evaluation to not have to quote our variable names.

So what does that look like? I'm going to make up a data set. This is real data, but not necessarily the most useful data. Here are the coordinates for the three islands that appear in our penguins data set, and what we're going to do is we are going to join these to our data. With by alone, you could do something like this, where, and just to remind, the data frame name is islands and the name of the island is in a column called name. So we can say, take the penguins data frame and then left joined it such that the island in the penguins data frame matches name in the islands data frame. Or you can do something like this, where you can actually articulate things as, take the penguins data frame, left join it to the islands data frame, where island is equal to name and the two. So we actually can use these sort of logical operation that we use elsewhere in R as well.

So my recommendation from a teaching perspective would be to use join by, particularly because you can read it aloud, read it out loud as where x is equal to y, which is something you probably already say if you do tend to read your code out loud, wherever you have the double equal sign. And also you don't have to worry about the various ways of sort of passing name vectors, where you can't have both of them uncoded here. This would be invalid, but either this or this would be valid. And personally for me, I find, you know, being able to teach joins early on to be a win. Being able to talk about name vectors early on to be not as big a win. Not that they're not important, but something that can come a little bit later perhaps. So it helps you avoid some awkward conversations, I would say.

Now let's talk a little bit about handling various matches. So previously join functions looked something like this. A few more arguments, but mostly two data frames that you're joining and a by argument. And now they have a few new arguments as well, with extensive documentation around them. But I want to just highlight a few cases that might cross your path, and that might come in handy.

As a setup, we're going to create three data frames. One of them is a data frame of just three penguins. These are just three randomly selected penguins from our data frame, and we know their species and the island name. Another data frame is something hopefully realistic, where if you are measuring anything in a scientific setting, chances are you are measuring things multiple times and multiple times and averaging them or something like that. So this data frame indicates that for those three penguins, we have their sample IDs and also a measurement ID, where instead of weighing the penguin once, we weighed each penguin twice. So we have a measurement ID of one and two for both. And the same thing for their flipper length as well. So we have these three data frames. You can perhaps imagine a situation where different people have collected these data.

Now, one-to-many relationships, things work out pretty nicely. I have my three penguins data frame that, remember, had three rows, a row per penguin, and I've joined it to the weight measurements data frame, and everything looks fine. It's basically my measurements joined with the species and island of those penguins. Not much to think about here.

Now, with many-to-many relationships, on the other hand, we get a warning. So here, I am taking my weight measurements data frame, maybe from one of the scientists, and the flipper measurements data frame from the other, and I just say, just bring them together by sample ID, and I get a warning saying that row one of your first data frame matches multiple rows in your second data frame, and row one of your second data frame matches multiple rows in your second one, and so on. It says, if you want to silence this, tell me that you are, in fact, intending to do a many-to-many match.

So, we can go ahead and do that. We can go ahead and say, I do want to make a many-to-many match. Does this look correct? What do we have here? We had one scientist take two measurements each on three penguins, so six measurements, and another scientist take two measurements each on the same three penguins, so another six measurements, and somehow what happened is I ended up with a 12-row data frame. So, I have an explosion of rows, which doesn't seem so bad when your options are six versus 12, but you can imagine a scenario where this could be problematic if you had a large data, two large data sets that you're joining.

What actually is happening is that we probably should have joined by both sample ID and measurement ID, right? So, instead of joining by a single variable, the penguin, or the sample ID, we probably should have joined by both so that we ultimately have six rows in our data frame with measurements coming from the two scientists. So, the takeaway message here is that, one, the many-to-many relationships can be costly, particularly computationally, so the warnings are helpful to stop you and make you think, but the warnings themselves aren't enough to get you to the right answer necessarily.

So, the takeaway message here is that, one, the many-to-many relationships can be costly, particularly computationally, so the warnings are helpful to stop you and make you think, but the warnings themselves aren't enough to get you to the right answer necessarily.

So, the warning that we saw here was simply talking about, I'm observing something happening here, and I want you to be explicit about whether you want that to happen, and this is where the human needs to come into play and say, I know something about this data set joining by these two variables makes a bit more sense.

Now, let's do one more thing. I'm going to bring in one more penguin, so that we have three penguins. I'm adding in one more, an Adelie penguin from the Biscoe Island, and now let's join the four penguins data set to our weight measurements data set. Remember, three penguins were measured so far, and what I've done is I've said, I'm going to take the weight measurements from one of my scientists, and I'm going to join it with my four penguins data set to get the information on the penguins, and what happened to the fourth penguin? Poof, it's gone.

And that's potentially what we intended to do. I mean, I explicitly selected left join here, right? I am saying, keep all the rows in the left data frame, and don't worry about the rows in the right data frame. Maybe that is true, but if you have a large data frame where you can't really see all of the rows, where it's not so obvious what disappeared or did not disappear, that can be a pretty risky move. So one thing you can do with a new argument unmatched is say, if anything is going to go poof, just give me an error first. So let me, stop me in my tracks before I continue.

And I think that this sort of introduces a paradigm of programming where you're not just like writing things and then just like expecting a whole document to render because you wrote the code correctly. It actually is making you to explicitly use sort of the code that you're writing to check if what you're doing is indeed what you're intending to do. So in this case, it says if there are any unmatched cases that are going to go poof, just error out, and let me think about it. And it did say that row four of y, that fourth penguin was not matched.

So now I, as the analyst, have two options. I can say, maybe I just want to do an inner join where I wanted to make sure that basically the matching penguins are the only ones that I want. Or I could say that if there are any unmatched ones, just drop them, which is the exact same result I had at the very beginning. The difference being that I am intentionally sort of doing this and nothing is just sort of disappearing under my, like without me realizing.

Or you can just do nothing. You can just go with the default. So the defaults work just as you would expect, but they don't necessarily allow you to catch yourself making errors, and chances are you're working with more than three penguins at a time, and it is a lot easier to sort of introduce these errors, particularly if you write in pipelines.

And there are a bit more on joins as well. So we talked about join by, we talked about different ways of handling relationships and unmatched cases. There are two other exciting developments that are inequality joins and rolling joins that are basically all made possible because now instead of saying, take this one variable from the first data frame and this variable from the data frame, I actually into the join by function can pass on expressions that say something like less than or greater than or whatnot. If you want to read more about these, I have included some resources here, and I'll share the link to the slides at the end as well.

And if you're thinking, what in the world are inequality joins and rolling joins? If you know, you know, okay, that is the reality. And I personally, for me, they don't cross my path as often as some of the other things do. But I also don't tend to work a lot with sort of time series-ish data sets, I think where these come into play a bit more.

That being said, we do have a new section in the second edition of R for Data Science that is dedicated to the various types of joins that you can do. And these are also moves that you can easily do with data table package and in SQL as well. So basically these sort of expanded functionality in dplyr is sort of matching other tools that you could use in order to achieve the same outcome.

All right, and in terms of a teaching tip, I would say that the exploding joins, particularly where we just ended up with like extra rows where we didn't intend, can be hard to debug for students. Because I think joins are like inherently, in my experience, relatively straightforward to teach and relatively straightforward to learn. But diagnosing when, you know, cases sort of disappear, or when they gain an unexpected amount of cases, or performing a join without thinking and taking down an entire teaching server, which tends to happen whenever we do sort of open-ended projects with our students, these things do happen. So I think that teaching some of these sort of defensive strategies of coding is using them for your own use is good. And I think teaching them can be helpful as well, particularly if you're expecting that students will be working with data that might be unfamiliar to them, where they may not have the intuition to be like, wait, why did I just all of a sudden get 12 rows here like we were able to do so earlier?

And I will say that hearing from educators who actually have to articulate this and teach it to new learners who ask the most critical questions of why did you say that the other day and you're saying this today are also incredibly helpful for finding those misalignment cases.

Okay, well thank you very much for sharing that. Yeah, you're welcome.

I was going to ask the question that Emmy asked, so I'll ask a follow-up one, which is what can we expect to see in the next edition of tidyverse?

You know, let me try to think if there's like something I know for sure. That's a really good question and I don't have a good answer for it off the top of my head. And I think it's because over the last couple months I've been so like in the mindset of catching up with what's there in terms of the book that I haven't done a whole lot of a look ahead. I don't think we have a big tidy up like this that's sitting there. I do know that there's quite a bit of work in terms of sort of performance enhancements that happen with dplyr particularly. And I will admit that I tend to be like I think they're like super crucial but they don't like I personally don't get affected by them a lot. And so I tend to be a little bit behind the curve when it comes to performance stuff. So I don't have a great answer for it.

So question from online. So there's a theme set function in ggplot to overwrite the default. Would something similar for dplyr be a good idea? Set default grouping behavior.

Good question. So I am a big fan of that theme set function particularly for things like writing a report or doing a presentation where I can just set that at the beginning and I don't have to keep adding sort of theme arguments. I can see how setting some of these as an option can be useful. I can also see how that can if it's not as explicit as theme set where you do it sort of like there's things where you can do theme set versus there's options that you can set. And generally I would think that the options is not the way to go because they tend to be sort of not so transferable between folks like for reproducibility purposes. Off the top of my head I don't see why that wouldn't be a good idea. So I would say that would be a good issue to open to see if there is like sort of traction or if others can think of sort of reasons not to.

So another question online is are there any plans to introduce tidyverse paradigm to Python since RStudio is now called Posit?

So I can say that on the tidyverse team there is Michael who works on Shuba which is sort of a dplyr port to Python. There's also Hassan who works on Plot9 which is like a ggplot2 port to Python. So in terms of supporting open source work that sort of take this design philosophy and implement it in Python the company has made an investment in it and actually just like personally I can say they're hearing from them in terms of their development process and how things sort of similar conversations happen in Python land has been super informative for me and I think has made the team better. I don't think