
Teaching the tidyverse in 2023 | Mine Çetinkaya-Rundel
Recommendations for teaching the tidyverse in 2023, summarizing package updates most relevant for teaching data science with the tidyverse, particularly to new learners. 00:00 Introduction 00:46 Using addins to switch between RStudio themes (See https://github.com/mine-cetinkaya-rundel/addmins for more info) 01:40 Native pipe 03:08 Nine core packages in tidyverse 2.0.0 07:15 Conflict resolution in the tidyverse 11:30 Improved and expanded *_join() functionality 22:05 Per operation grouping 27:41 Quality of life improvements to case_when() and if_else() 31:41 New syntax for separating columns 34:51 New argument for line geoms: linewidth 36:08 Wrap up See more in the Teaching the tidyverse in 2023 blog post https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello and thank you for joining me today. I'm Mine Çetinkaya-Rundel. I am a developer educator at Posit and a professor of the practice at Duke University. I mainly teach statistics and data science and today's agenda is teaching the tidyverse in 2023. I am going to talk to you a little bit about the code base that has changed in the tidyverse over the last year and really when I say last year, I mean the academic year and there's an accompanying blog post to this video that you are welcomed and encouraged to read through for more details as well.
But before we get started with the tidyverse, let me start with a little bit of a trick for you. So this is usually what my RStudio IDE looks like when I get started teaching. This is sort of the color scheme, the theme I prefer to use when I am working on my own during the day. This is what I prefer when I'm working at night and this is what I prefer when I'm teaching and the way I've changed through these different themes is I've written a package that has a bunch of like a personal package that has some add-ins for me so I can quickly switch between teaching mode and modes that I like, themes that I like to use when I am actually working on my own. I recommend getting some shortcuts for yourself like this as well.
So we are going to start by loading the tidyverse and the palmer penguins packages that we're going to use for some of our data examples and I'm going to load two other packages gt and package load. We're going to use these for just a couple things along the way.
Native pipe
The first thing I'm going to talk about is the native pipe. Your tidyverse pipelines may have looked something like this in the past. You start with a data frame and then you pipe that into say a dplyr function like count and the result looks something like this. Since R 4.0 you can now use the native pipe here as well and you're going to get absolutely the same answer. So you might be wondering why would I want to do that? Well one of the nice things about using the native pipe is that it's going to work even sort of outside of the tidyverse or outside of loading the migrator package as well.
So let's go ahead and restart R so now none of my packages are loaded again and let's add something else. Something like start with the mpg data frame or maybe even mtcars data frame and then let's use a base R function like summary. Even though I have none of the tidyverse packages loaded or the migrator package loaded I'm able to use the native pipe as well. So if you're teaching your students with the native pipe and teaching them to sort of build these tidyverse pipelines they can learn that workflow and apply it beyond working with the tidyverse as well. Should they be using a different package or in a different course or at some other point in their life when they're coding in R.
Nine core packages in tidyverse 2.0.0
Coming back to the tidyverse, one of the updates to the tidyverse was the tidyverse meta package itself. So tidyverse 2.0 was released in I think March of this year. Let's go ahead and restart our R session one more time and let's go ahead and load the tidyverse package. We can see that in the past we would also have to do something like add the lubricate load the lubricate package separately but we don't really have to do that anymore. So I'm going to restart I'm going to show you that when I load the tidyverse the lubricate package is loaded as well.
If you have not used the lubricate package before or haven't taught the lubricate package before let's go through its functionality very very briefly. Perhaps sort of the most important thing about it as you can tell from the name is that it allows you to work with dates and times and these can be pesky things to work with. So let's go ahead and put today's date. This is right now a numeric variable as you can see. Then I can use the lubricate package's ymd function so that starts for he stands for year month and day and it will convert that so it will parse that text into a date object for me.
This probably seems pretty straightforward right? I mean we could have probably done this by sort of string parsing without the lubricate package as well. So let's try something else let's make it a little bit more complicated. I'm gonna put this as a text field so I might say 7-19-20-23 and we can say that the class of this new object the variable we created oops I should say today t is a character. This is in the format month day year so I'm going to use the mdy function and lubricate will convert that for me to a date object as well and we can check that it does that so mdy today t and there we go.
This also this is I think like a little bit more complex but again we could have easily parsed this string and said use the slash the forward slashes as delimiters so let's make something even more complex let's create something like an actual text string and I will type here something like this video is being recorded on 19 july 2023 at 1 pm et. You can see that the class of this is a character so how are we going to parse the date from here doing so with sort of string parsing would be a little bit more complicated we would have to catch that 19 july 2023 there that might take a bit of you know knowing about regular expressions but what I can do with lubricate is I can say this is in the order day month and year and then there's also an hour there as well and it will actually go in and try to parse out the sort of the date and time component from that text string now the thing is the time zone is incorrect time zones are additionally pesky so you're you would in this case let's see actually need to add the time zone itself and now we actually have the right time there as well and if I look at the class of this object that we now created that is a posit ct object so lubricate is pretty helpful for dealing with you know pesky things like dates and times and now it comes with the core tidyverse as well.
Conflict resolution in the tidyverse
Next let's go back to when we first loaded the tidyverse package so I am going to restart my R session and talk a little bit about conflict resolution so what do I mean by conflict resolution what I mean is that when I load the tidyverse package you can see that there is a new or updated message it tells me that there are two functions in the core tidyverse the filter and lag functions in the dplyr package which masks the filter and lag functions so the functions with the same name in a package that is already loaded with base r the stats package. This message has been here for some time now but what is new in tidyverse 2.0 is that it specifically advertises the conflicted package as well and you can always get back to this message even if you're not at the beginning of your R session with the tidyverse conflicts function which will sort of print out the end of that message for you.
The way we would resolve base r would resolve this is if you then load the dplyr package the filter function becomes available and because it's been loaded after base r we're able to sort of r will just automatically choose the version of the function that was last loaded and things work just fine they work just fine until they fail. So if we want to be a little bit more sort of strict about how we go about things uh in terms of having multiple packages loaded with the same function names we can leverage the conflicted package which is not in the core tidyverse but it is a tidyverse packet so i'm going to go ahead and load that and once i load conflicted i can no longer get away with just using the last loaded version of filter it explicitly tells me look there are two packages loaded they both have the filter function in it tell me which one you want me to use.
One option is we can use the sort of the colon colon notation to say every time i use filter i can explicitly say i want you to use the dplyr version this is the sort of thing you might see in sort of like a package development scenario perhaps where you want to be very clear about where other functions from other packages are coming from but i think in a data analysis scenario this gets pretty busy especially if you're teaching r especially if you're teaching you know dplyr for the first time you're probably going to have a bunch of filter statements and constantly noting them with dplyr colon colon while not noting them for all the other functions that you're teaching i think can get confusing for the students. So instead you can be uh sort of uh explicitly you can explicitly declare which version of filter you want to use and there you can say i want to prefer in this r session going forward the dplyr version of filter no more complaints from the conflicted package and i am being very sort of explicit in my code about which version i'm using.
Improved and expanded *_join() functionality
So that was all about the tidyverse meta package let's talk a little bit about uh now the um sort of improved and expanded functionality uh with joins so much of these came with updates to dplyr with 1.1.0 onwards i think as of today we're at 1.1.2 so there's been sort of additional updates to get uh the rough edges sorted but generally these updates are mainly to better align with tools like sql and data table. So we have here um a sample data frame so i've created an islands data frame where i have noted the coordinates of sort of the centroid of these three islands.
So with this new notation that is available to you um you can actually read this out loud how we often read out loud logical statements so double equals is something we would read out loud as where x is equal to y so now i'm able to say take the penguins data frame left join it to the island's data frame by uh where island is equal to name. Um i think that this while the sort of the previous version the syntax the by syntax does not go away i would strongly recommend updating your teaching materials to use the join by function um and the thing that is nice about that is that one this reading out loud is made so much easier and sort of aligned with how we read out loud other logical statements.
Um i think that this while the sort of the previous version the syntax the by syntax does not go away i would strongly recommend updating your teaching materials to use the join by function um and the thing that is nice about that is that one this reading out loud is made so much easier and sort of aligned with how we read out loud other logical statements.
Another uh new argument is uh handling unmatched cases with the improved join function so i'm going to make one more new data frame this time my measurements include one more um penguin uh so i have four penguins in my data frame and we're gonna say take the weight measurements from that one scientist and left join them to these four penguins. It's fine it r will happily join them for you but poof that fourth penguin is gone. Um maybe you never wanted the fourth penguin and maybe this is exactly what you wanted but um it is likely the case that this was a much larger data set and some of these unmatched cases sort of silently disappeared um and you never caught them.
Per operation grouping
So here for example i am grouping the data by species and sex again i am going to find the mean body weights again and then i want the top row of that output but if i actually drop my groups then the top row of that output is just one row so um when we have grouping that persists through the pipeline certain functions will then output slightly different results and if you were expecting one and you get the other or vice versa it can be pretty confusing to try to catch where did this happen. Another example i like to give with this is using the wonderful gt package for making tables here i have uh data that is still grouped after uh running the summarize function so the gt package will nicely group my output and uh display this output for me as such however when i actually explicitly drop my groups that grouping structure goes away so gt doesn't apply that sort of nice grouping which one you want um the idea is you probably want to be explicit about that.
And i can actually do this by species and sex like we were doing before and again the resulting data frame is not grouped i didn't have to explicitly say once you're done with it drop the groups which i think is a slightly less sort of straightforward or intuitive way of going about things but here we're just explicitly saying when you're summarizing group by these two variables and when you're done don't leave the data frame grouped anymore. So my recommendation would be i think to consider um teaching the dot by um version of this like per operation grouping particularly if you find that your learners tend to uh struggle with persistent grouping in dplyr pipelines.
Quality of life improvements to case_when() and if_else()
New syntax for separating columns
Wrapping up
Um we've talked about a lot of new exciting updates so i'd like to wrap this up by talking about something maybe a little bit more light-hearted so one minor update to ggplot2 this is not the only update i think that has happened to ggplot2 but one that continues to catch me in my tracks is um a new argument for um changing how your lines look so for line gm so if i do something like this where i am plotting flipper length versus body mass of penguins and i want an overlay a smooth curve over it but i want that line to be a bit thicker um the you will get a warning that says please use line width instead of size. I do agree that line width is indeed the uh seems to be more appropriate here uh in terms of what to call that and i've in the past struggled with sort of articulating why it was called size that being said um i will say that the best teaching tip i can give you is just check the output of your old teaching materials thoroughly do not make a fool of yourself when you're teaching to constantly see this warning pop up at an unexpected time.
Um and remember that the tidyverse blog is a great place to catch up with all tidyverse updates uh whether teaching related or not there's been a lot of other updates to the packages that if you sort of dig through this year's posts that you will see to string art to tidy art to forecats a new api for the arvest package and also if you also tend to teach sql along with uh your data science content um shorter and more readable and in some cases faster sql queries in db flyer and i'll wrap up by saying that the new version of our for data science is out and much of the updates um that i've talked about here were sort of done in order to catch up with what we wanted to uh showcase in the book so you can expect the development to slow down a little bit as you sort of phase in these new changes into your teaching materials. Thank you very much for listening.

