Resources

Teaching the tidyverse in 2023 | Mine Çetinkaya-Rundel

Recommendations for teaching the tidyverse in 2023, summarizing package updates most relevant for teaching data science with the tidyverse, particularly to new learners. 00:00 Introduction 00:46 Using addins to switch between RStudio themes (See https://github.com/mine-cetinkaya-rundel/addmins for more info) 01:40 Native pipe 03:08 Nine core packages in tidyverse 2.0.0 07:15 Conflict resolution in the tidyverse 11:30 Improved and expanded *_join() functionality 22:05 Per operation grouping 27:41 Quality of life improvements to case_when() and if_else() 31:41 New syntax for separating columns 34:51 New argument for line geoms: linewidth 36:08 Wrap up See more in the Teaching the tidyverse in 2023 blog post https://www.tidyverse.org/blog/2023/08/teach-tidyverse-23

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello and thank you for joining me today. I'm Mine Çetinkaya-Rundel. I am a developer educator at Posit and a professor of the practice at Duke University. I mainly teach statistics and data science and today's agenda is teaching the tidyverse in 2023. I am going to talk to you a little bit about the code base that has changed in the tidyverse over the last year and really when I say last year, I mean the academic year and there's an accompanying blog post to this video that you are welcomed and encouraged to read through for more details as well.

But before we get started with the tidyverse, let me start with a little bit of a trick for you. So this is usually what my RStudio IDE looks like when I get started teaching. This is sort of the color scheme, the theme I prefer to use when I am working on my own during the day. This is what I prefer when I'm working at night and this is what I prefer when I'm teaching and the way I've changed through these different themes is I've written a package that has a bunch of like a personal package that has some add-ins for me so I can quickly switch between teaching mode and modes that I like, themes that I like to use when I am actually working on my own. I recommend getting some shortcuts for yourself like this as well.

So we are going to start by loading the tidyverse and the palmer penguins packages that we're going to use for some of our data examples and I'm going to load two other packages gt and package load. We're going to use these for just a couple things along the way.

Native pipe

The first thing I'm going to talk about is the native pipe. Your tidyverse pipelines may have looked something like this in the past. You start with a data frame and then you pipe that into say a dplyr function like count and the result looks something like this. Since R 4.0 you can now use the native pipe here as well and you're going to get absolutely the same answer. So you might be wondering why would I want to do that? Well one of the nice things about using the native pipe is that it's going to work even sort of outside of the tidyverse or outside of loading the migrator package as well.

So let's go ahead and restart R so now none of my packages are loaded again and let's add something else. Something like start with the mpg data frame or maybe even mtcars data frame and then let's use a base R function like summary. Even though I have none of the tidyverse packages loaded or the migrator package loaded I'm able to use the native pipe as well. So if you're teaching your students with the native pipe and teaching them to sort of build these tidyverse pipelines they can learn that workflow and apply it beyond working with the tidyverse as well. Should they be using a different package or in a different course or at some other point in their life when they're coding in R.

Nine core packages in tidyverse 2.0.0

Coming back to the tidyverse, one of the updates to the tidyverse was the tidyverse meta package itself. So tidyverse 2.0 was released in I think March of this year. Let's go ahead and restart our R session one more time and let's go ahead and load the tidyverse package. We can see that in the past we would also have to do something like add the lubricate load the lubricate package separately but we don't really have to do that anymore. So I'm going to restart I'm going to show you that when I load the tidyverse the lubricate package is loaded as well.

If you have not used the lubricate package before or haven't taught the lubricate package before let's go through its functionality very very briefly. Perhaps sort of the most important thing about it as you can tell from the name is that it allows you to work with dates and times and these can be pesky things to work with. So let's go ahead and put today's date. This is right now a numeric variable as you can see. Then I can use the lubricate package's ymd function so that starts for he stands for year month and day and it will convert that so it will parse that text into a date object for me.

This probably seems pretty straightforward right? I mean we could have probably done this by sort of string parsing without the lubricate package as well. So let's try something else let's make it a little bit more complicated. I'm gonna put this as a text field so I might say 7-19-20-23 and we can say that the class of this new object the variable we created oops I should say today t is a character. This is in the format month day year so I'm going to use the mdy function and lubricate will convert that for me to a date object as well and we can check that it does that so mdy today t and there we go.

This also this is I think like a little bit more complex but again we could have easily parsed this string and said use the slash the forward slashes as delimiters so let's make something even more complex let's create something like an actual text string and I will type here something like this video is being recorded on 19 july 2023 at 1 pm et. You can see that the class of this is a character so how are we going to parse the date from here doing so with sort of string parsing would be a little bit more complicated we would have to catch that 19 july 2023 there that might take a bit of you know knowing about regular expressions but what I can do with lubricate is I can say this is in the order day month and year and then there's also an hour there as well and it will actually go in and try to parse out the sort of the date and time component from that text string now the thing is the time zone is incorrect time zones are additionally pesky so you're you would in this case let's see actually need to add the time zone itself and now we actually have the right time there as well and if I look at the class of this object that we now created that is a posit ct object so lubricate is pretty helpful for dealing with you know pesky things like dates and times and now it comes with the core tidyverse as well.

Conflict resolution in the tidyverse

Next let's go back to when we first loaded the tidyverse package so I am going to restart my R session and talk a little bit about conflict resolution so what do I mean by conflict resolution what I mean is that when I load the tidyverse package you can see that there is a new or updated message it tells me that there are two functions in the core tidyverse the filter and lag functions in the dplyr package which masks the filter and lag functions so the functions with the same name in a package that is already loaded with base r the stats package. This message has been here for some time now but what is new in tidyverse 2.0 is that it specifically advertises the conflicted package as well and you can always get back to this message even if you're not at the beginning of your R session with the tidyverse conflicts function which will sort of print out the end of that message for you.

So let's go ahead and unload that dplyr package so we're going to take out the dplyr package and let's try to run this very sort of common tidyverse pipeline I want to take the penguins data frame and filter it for penguins that are atolies you've probably run into this issue before if you haven't ran into this issue kudos to you but your students probably have ran into this what's happening here is that the message that we get is not that it can't find the filter function it can't find the dplyr filter function but it is able to find the stats filter function and it's complaining that you know it can't find this penguins object.

The way we would resolve base r would resolve this is if you then load the dplyr package the filter function becomes available and because it's been loaded after base r we're able to sort of r will just automatically choose the version of the function that was last loaded and things work just fine they work just fine until they fail. So if we want to be a little bit more sort of strict about how we go about things uh in terms of having multiple packages loaded with the same function names we can leverage the conflicted package which is not in the core tidyverse but it is a tidyverse packet so i'm going to go ahead and load that and once i load conflicted i can no longer get away with just using the last loaded version of filter it explicitly tells me look there are two packages loaded they both have the filter function in it tell me which one you want me to use.

One option is we can use the sort of the colon colon notation to say every time i use filter i can explicitly say i want you to use the dplyr version this is the sort of thing you might see in sort of like a package development scenario perhaps where you want to be very clear about where other functions from other packages are coming from but i think in a data analysis scenario this gets pretty busy especially if you're teaching r especially if you're teaching you know dplyr for the first time you're probably going to have a bunch of filter statements and constantly noting them with dplyr colon colon while not noting them for all the other functions that you're teaching i think can get confusing for the students. So instead you can be uh sort of uh explicitly you can explicitly declare which version of filter you want to use and there you can say i want to prefer in this r session going forward the dplyr version of filter no more complaints from the conflicted package and i am being very sort of explicit in my code about which version i'm using.

Improved and expanded *_join() functionality

So that was all about the tidyverse meta package let's talk a little bit about uh now the um sort of improved and expanded functionality uh with joins so much of these came with updates to dplyr with 1.1.0 onwards i think as of today we're at 1.1.2 so there's been sort of additional updates to get uh the rough edges sorted but generally these updates are mainly to better align with tools like sql and data table. So we have here um a sample data frame so i've created an islands data frame where i have noted the coordinates of sort of the centroid of these three islands.

And what i'm going to do is i am going to join the penguins data frame uh to this so i'll take the penguins data frame and left join it to islands and this should be a pipeline you're sort of familiar with um what we've done in the past was to use the by argument and then in here actually denoted which variable from the penguins data frame that's the island variable should match to which variable from the islands data frame that's the name variable instead what we can do now is we can say join by so a new function um and i will say where island is equal to name.

So with this new notation that is available to you um you can actually read this out loud how we often read out loud logical statements so double equals is something we would read out loud as where x is equal to y so now i'm able to say take the penguins data frame left join it to the island's data frame by uh where island is equal to name. Um i think that this while the sort of the previous version the syntax the by syntax does not go away i would strongly recommend updating your teaching materials to use the join by function um and the thing that is nice about that is that one this reading out loud is made so much easier and sort of aligned with how we read out loud other logical statements.

Um i think that this while the sort of the previous version the syntax the by syntax does not go away i would strongly recommend updating your teaching materials to use the join by function um and the thing that is nice about that is that one this reading out loud is made so much easier and sort of aligned with how we read out loud other logical statements.

And number two uh when we work with the um sort of the older version of this so this we know worked um but some other versions worked as well so i could also do something like this and this would work as well just because of how named vectors work in r but this wouldn't work because it says i can't find this new name that you're talking about and we also saw at the beginning that quoting both of these work as well options are good but sometimes they can be confusing so by using the join by notation we no longer need to quote our variable names which aligns much better with other functions in the tidyverse.

Another update to uh join functions is where we have many too many relationships so i'm going to create a couple a few more new data frames just so we have them in our pocket and we can use them in our examples one of them is a data frame called three penguins and this has uh three rows uh so data on three penguins and we have one row per sample id so imagine that you're a scientist and you're sort of collecting data on these penguin samples so this is what my three penguins data frame looks like and then we also have uh maybe collected by another scientist let's say data on weight measurements of these penguins.

And if you have ever done things like collecting uh like data weight data from um you know animals you probably have done this thing where you had to measure at least twice so we have this weight measurements data where for each of the sample id of penguins we have two measurements as well and finally we have one more data on flipper length measurements of these penguins so similarly we've done two measurements per um penguin so let's say we have data from these three separate scientists and we want to bring them together.

First i'm going to start by uh bringing together the three penguins data set with uh the weight measurements data set and there's just a one-to-many relationship and things just work fine um however when i have many-to-many relationships so remember those two scientists that each take two measurements from uh each of the three penguins if i try to join those weight measurements and um um flipper measurements data frames from these two scientists now i get a uh warning so i get a warning that says that it detected many-to-many relationships between x and y um that was unexpected and at the end of the warning it says if a many-to-many relationship is expected you can set relationship equal to many-to-many.

Whenever i see you know a warning like this my first hunch is okay well let's just try what it tells me to do um at least it'll shut up the warning but maybe it will also be the right thing to do so let's go ahead and add that so this relationship argument is new uh with dplyr 1.1.0 onwards and in fact now we have explicitly said yes we were expecting this many-to-many relationship however let's think about this for a second we started with three penguins two measurements per penguin so six rows of data and we're bringing together these data from these two scientists but all of a sudden i have this explosion of rows situation i have 12 rows and 12 is not a large number but imagine if you had hundreds of measurements from these scientists you would end up with this explosion of rows.

So in this case what we actually should be doing is instead of denoting a many-to-many relationship we might actually think back to our data structure and say no i want to join by not just sample id but by measurement id as well so the join by function can take multiple variables just like the by argument of joins can and the relationship argument of the improved join functions is great for sort of catching in the tracks of making a mistake about how you're joining data sets but the suggestions it gives may not always be the right suggestion for your use case so don't just blindly apply the warnings but instead think through your data structure to see how these should be coming together.

And i think the teaching tip to take away from here is that when you're introducing these functions to the students i would recommend doing something similar to what i've done here which is iteratively sort of fall into the traps that you expect to fall or you don't expect to fall into read the warnings messages or errors out loud think through them with them together and then apply some sort of solutions to the problem.

The explosion of rows problem is something you have probably come across working particularly with new learners who will happily join any two data frames and sometimes you know crash their r session or if you're teaching on a teaching server crash the whole server so uh sort of walking through these sorts of examples with small data sets where they can see how the rows are expanding or contracting back to what we expect them to be i think is uh pretty helpful.

Another uh new argument is uh handling unmatched cases with the improved join function so i'm going to make one more new data frame this time my measurements include one more um penguin uh so i have four penguins in my data frame and we're gonna say take the weight measurements from that one scientist and left join them to these four penguins. It's fine it r will happily join them for you but poof that fourth penguin is gone. Um maybe you never wanted the fourth penguin and maybe this is exactly what you wanted but um it is likely the case that this was a much larger data set and some of these unmatched cases sort of silently disappeared um and you never caught them.

So one option is to say well i definitely don't care any of the penguins whose weights i don't have so i'm going to explicitly use an inner join to solve this problem not a left join so saying that i only want the penguins that are common between these two another option is to actually use the um unmatched argument and say when i do this join anything that's unmatched explicitly drop that so again the results we're getting from these is exactly the same and in fact you can also do nothing you can do exactly what you did at the beginning but do it at your own risk.

So the difference between using the unmatched argument um or not is basically saying that i'm making it very explicit when i come back to this code i want to remember that i explicitly did not care about the other penguins whose weights we did not have another argument you can another value you can use for the unmatched argument is actually error so this might be a good starting place where it says if there's any unmatched cases just error out i will stop at that point and read through the error and then decide what i want to do so the error tells me that each row of y must be matched with x and in this case row four of y was not matched and now it's basically telling you decide what you want to do with it and you might go back to saying fine drop that one so the error uh value for unmatched is a great starting place particularly for interactive data analysis and then you want to explicitly choose uh one of the values and sort of go with that.

Per operation grouping

All right so we've talked about joins in new dplyr let's talk about one other feature of the most recent dplyr package which is per operation grouping so i'm going to start with a very typical tidyverse pipeline here i want to plot the mean body weights of penguins by species and sex so i'm going to start with the penguins data set just for ease i'm going to drop any of the penguins for which we don't have body mass or sex information and then i am going to group them by species and sex so i can see that this is now a group data frame and then i'm going to summarize to calculate um uh the mean body weights uh per species and sex of these penguins and then let's go ahead and actually make a plot of this.

Now you can see that when i make this plot or even earlier when i just did my summary there was a message that got printed out it says summarize has grouped output by species you can override this using the dot groups argument so this message um which can sort of catch you by surprise particularly if you're doing something like creating a plot where you know summarize is not even the last statement in your uh data pipeline is basically saying that you had grouped the data by two variables when we ran summarize we sort of peeled off that last grouping variable but the data are still grouped by species.

And turns out for making this plot it doesn't make any difference so you can happily sort of ignore that message but what generally tends to happen is that if we don't address this issue that the data frame is still grouped this can result in frustrating situations particularly for new learners but trust me even for experienced users because the data underlying this is now remaining grouping and sometimes this can have downstream effects it didn't have it on the ggplot but let's look at a couple examples.

So here for example i am grouping the data by species and sex again i am going to find the mean body weights again and then i want the top row of that output but if i actually drop my groups then the top row of that output is just one row so um when we have grouping that persists through the pipeline certain functions will then output slightly different results and if you were expecting one and you get the other or vice versa it can be pretty confusing to try to catch where did this happen. Another example i like to give with this is using the wonderful gt package for making tables here i have uh data that is still grouped after uh running the summarize function so the gt package will nicely group my output and uh display this output for me as such however when i actually explicitly drop my groups that grouping structure goes away so gt doesn't apply that sort of nice grouping which one you want um the idea is you probably want to be explicit about that.

So one of the options that we saw is that every time you run summarize you can decide what you want to do with your groups you can decide to group them sorry drop them which means the remaining data frame is not grouped or you can decide that you want to uh assign another value to this like keep or keep last for example alternatively with new dplyr you have a new option per operation grouping so you can see that in this new pipeline i do not have a group by statement anymore i start with my data frame i did the dropping the nas and then i said summarize to calculate the mean body masses for me but do that by species so this new dot by argument allows me to um do the grouping in the operation where i am doing the summarization and the resulting data frame is no longer grouped.

And i can actually do this by species and sex like we were doing before and again the resulting data frame is not grouped i didn't have to explicitly say once you're done with it drop the groups which i think is a slightly less sort of straightforward or intuitive way of going about things but here we're just explicitly saying when you're summarizing group by these two variables and when you're done don't leave the data frame grouped anymore. So my recommendation would be i think to consider um teaching the dot by um version of this like per operation grouping particularly if you find that your learners tend to uh struggle with persistent grouping in dplyr pipelines.

That being said um the new version of r for data science which very recently came out right here um does mention the dot by um argument for per operation grouping but it doesn't dwell on it too much um so majority of the presentation still uses group by so you can decide based on how strictly you want to stick to the text or not if you are using that text for your teaching.

Quality of life improvements to case_when() and if_else()

All right one more bit on dplyr um my most favorite function case when for some reason i really like that function um so now it has gotten some updates as well some quality of life improvements um let's start with this i'm going to calculate some quantiles for the penguins data set so i have the 25th percentile and the 75th percentile of my data and using these i want to categorize my penguins as small medium or large so if a penguin is less than 35 50 grams i'm going to label them small if they're between these two i'm going to label them medium and if they're greater than 47 50 i'm going to label them large um and if they were nas i want to leave them as an ace.

So here uh what we're seeing is this would be a typical sort of case when um uh pipeline where the last row or the last call says if it's not any of these so if the then call it true and for those rows label things large um let's also do something that will allow us to more easily see what's going on in this pipeline uh whenever i have a data frame that's sort of wide with lots of columns and the columns that i'm operating with um i can't see when i run my pipeline i like moving them up so i probably will want to move up here um the body mass and the new variable that i created the bmcat variable so let's relocate those to move them up so that i can sort of spot check that my um assignment is working well in fact we might temporarily pipe this to the view function just to make sure that we have some larges as well so indeed in our data frame we do have some larges as well.

Um i have personally always found explaining why we call this true a little challenging when teaching case when um because it's not you know not necessarily reading out all the conditions um and even if you do i want to say something like if all else fails so uh case when has gained a new argument called dot default that basically says if it's not any of those make it large so now we can achieve the same um outcome without sort of um using the true as a catch-all for anything that didn't match.

And finally with dplyr i want to highlight what i think is a very exciting uh development with if else case when and friends so let's say that i have uh my data frame uh you know something like this and i want to create a new variable that actually shows the units so the grams next to the values as well so let's say i want to have something like this for each of the body masses i also want to have a character value that brings together the value and the units um it used to be that you had to type na character here because all the other values in that column were characters so now you had to introduce to do this very simple task of creating a character string from a numeric the notion that there are different sorts of na's in r and i think that's a pretty like high level thing to be mentioning so now you can actually get away with calling this na and don't have to worry about na character or other na variants where i think they don't necessarily belong as early in an introductory r r curriculum as what we have here.

New syntax for separating columns

Um next let's take let's leave dplyr aside and take a bit of a detour and um talk about tidy r and specifically separating columns so i have a new data frame here data on penguins with text string description so maybe you have a different scientist who instead of putting things in a spreadsheet has put them in a word document or something and you have these three specimens and some descriptions about them and what you might want to do is take these descriptions and turn them into a data frame or a spreadsheet version.

The separate function existed previously in tidy r but it has sort of evolved to allow for different ways of separating so one option is to separate and we want to separate wider because we want to put these pieces of information in columns that go next to each other as opposed to longer into additional rows we want to separate wider this description column and we want the delimiter to be the comma so i want something like this for one of my variables and something like this for my other variable and i'm going to create call those new variables species and island.

This gives me something uh perhaps a bit more useful than these descriptions but there's a lot of repeated information so if you are into that sort of thing of writing regular expressions you can actually take things one step further with a new function called separate wider where you can actually give regular expression patterns to it and in one go result in something like a species column and an island column without the repeated information if you are teaching regular expressions i would say this is the way to go if you're not teaching regular expressions in your class or before you teach regular expressions you could do something like this and then use some functions from the stringar package like str remove to get rid of the redundant species and island uh tags here.

Um another exciting development here though is something about um what happens when things don't work out as expected so if you look at the documentation for these new separate functions you will see some new um arguments as well uh specifically too few or too many so what to do if there are too few components should they be aligned at the start aligned at the end or should just give you an error so that you can look into it or if there are too many of them uh what to do should they be dropped or should they be merged with the last one and the debug option here basically will print out something it's probably not something you want in your final code but it's very helpful for uh interactive views where you can take a look to see what to do and then make a decision in terms of whether align start align or align and makes more sense.

Wrapping up

Um we've talked about a lot of new exciting updates so i'd like to wrap this up by talking about something maybe a little bit more light-hearted so one minor update to ggplot2 this is not the only update i think that has happened to ggplot2 but one that continues to catch me in my tracks is um a new argument for um changing how your lines look so for line gm so if i do something like this where i am plotting flipper length versus body mass of penguins and i want an overlay a smooth curve over it but i want that line to be a bit thicker um the you will get a warning that says please use line width instead of size. I do agree that line width is indeed the uh seems to be more appropriate here uh in terms of what to call that and i've in the past struggled with sort of articulating why it was called size that being said um i will say that the best teaching tip i can give you is just check the output of your old teaching materials thoroughly do not make a fool of yourself when you're teaching to constantly see this warning pop up at an unexpected time.

So i've tried to go through much of the code uh that is in the accompanying blog post and sort of interactively explain some of these updates i should mention and emphasize that much of what i've talked through are things that you can now do optionally and in places where i feel like that is an improvement in the context of teaching i try to highlight those you could also just not do any of these things chances are none of your code is going to fail per se unless there was something happening that was an error to begin with that was sort of slipping through uh but i think that these updates are nice to sort of think about um as we approach the new academic year and as you might be thinking about what to change or what to update in your teaching materials.

Um and remember that the tidyverse blog is a great place to catch up with all tidyverse updates uh whether teaching related or not there's been a lot of other updates to the packages that if you sort of dig through this year's posts that you will see to string art to tidy art to forecats a new api for the arvest package and also if you also tend to teach sql along with uh your data science content um shorter and more readable and in some cases faster sql queries in db flyer and i'll wrap up by saying that the new version of our for data science is out and much of the updates um that i've talked about here were sort of done in order to catch up with what we wanted to uh showcase in the book so you can expect the development to slow down a little bit as you sort of phase in these new changes into your teaching materials. Thank you very much for listening.