Teaching the tidyverse in 2023 | Mine Çetinkaya-Rundel

Transcript#

This transcript was generated automatically and may contain errors.

Hello and thank you for joining me today. I'm Mine Çetinkaya-Rundel . I am a developer educator at Posit and a professor of the practice at Duke University. I mainly teach statistics and data science and today's agenda is teaching the tidyverse in 2023. I am going to talk to you a little bit about the code base that has changed in the tidyverse over the last year and really when I say last year, I mean the academic year and there's an accompanying blog post to this video that you are welcomed and encouraged to read through for more details as well.

But before we get started with the tidyverse, let me start with a little bit of a trick for you. So this is usually what my RStudio IDE looks like when I get started teaching. This is sort of the color scheme, the theme I prefer to use when I am working on my own during the day. This is what I prefer when I'm working at night and this is what I prefer when I'm teaching and the way I've changed through these different themes is I've written a package that has a bunch of like a personal package that has some add-ins for me so I can quickly switch between teaching mode and modes that I like, themes that I like to use when I am actually working on my own. I recommend getting some shortcuts for yourself like this as well.

So we are going to start by loading the tidyverse and the palmer penguins packages that we're going to use for some of our data examples and I'm going to load two other packages gt and package load. We're going to use these for just a couple things along the way.

Um i think that this while the sort of the previous version the syntax the by syntax does not go away i would strongly recommend updating your teaching materials to use the join by function um and the thing that is nice about that is that one this reading out loud is made so much easier and sort of aligned with how we read out loud other logical statements.

And number two uh when we work with the um sort of the older version of this so this we know worked um but some other versions worked as well so i could also do something like this and this would work as well just because of how named vectors work in r but this wouldn't work because it says i can't find this new name that you're talking about and we also saw at the beginning that quoting both of these work as well options are good but sometimes they can be confusing so by using the join by notation we no longer need to quote our variable names which aligns much better with other functions in the tidyverse.

Another update to uh join functions is where we have many too many relationships so i'm going to create a couple a few more new data frames just so we have them in our pocket and we can use them in our examples one of them is a data frame called three penguins and this has uh three rows uh so data on three penguins and we have one row per sample id so imagine that you're a scientist and you're sort of collecting data on these penguin samples so this is what my three penguins data frame looks like and then we also have uh maybe collected by another scientist let's say data on weight measurements of these penguins.

And if you have ever done things like collecting uh like data weight data from um you know animals you probably have done this thing where you had to measure at least twice so we have this weight measurements data where for each of the sample id of penguins we have two measurements as well and finally we have one more data on flipper length measurements of these penguins so similarly we've done two measurements per um penguin so let's say we have data from these three separate scientists and we want to bring them together.

First i'm going to start by uh bringing together the three penguins data set with uh the weight measurements data set and there's just a one-to-many relationship and things just work fine um however when i have many-to-many relationships so remember those two scientists that each take two measurements from uh each of the three penguins if i try to join those weight measurements and um um flipper measurements data frames from these two scientists now i get a uh warning so i get a warning that says that it detected many-to-many relationships between x and y um that was unexpected and at the end of the warning it says if a many-to-many relationship is expected you can set relationship equal to many-to-many.

Whenever i see you know a warning like this my first hunch is okay well let's just try what it tells me to do um at least it'll shut up the warning but maybe it will also be the right thing to do so let's go ahead and add that so this relationship argument is new uh with dplyr 1.1.0 onwards and in fact now we have explicitly said yes we were expecting this many-to-many relationship however let's think about this for a second we started with three penguins two measurements per penguin so six rows of data and we're bringing together these data from these two scientists but all of a sudden i have this explosion of rows situation i have 12 rows and 12 is not a large number but imagine if you had hundreds of measurements from these scientists you would end up with this explosion of rows.

So in this case what we actually should be doing is instead of denoting a many-to-many relationship we might actually think back to our data structure and say no i want to join by not just sample id but by measurement id as well so the join by function can take multiple variables just like the by argument of joins can and the relationship argument of the improved join functions is great for sort of catching in the tracks of making a mistake about how you're joining data sets but the suggestions it gives may not always be the right suggestion for your use case so don't just blindly apply the warnings but instead think through your data structure to see how these should be coming together.

And i think the teaching tip to take away from here is that when you're introducing these functions to the students i would recommend doing something similar to what i've done here which is iteratively sort of fall into the traps that you expect to fall or you don't expect to fall into read the warnings messages or errors out loud think through them with them together and then apply some sort of solutions to the problem.

The explosion of rows problem is something you have probably come across working particularly with new learners who will happily join any two data frames and sometimes you know crash their r session or if you're teaching on a teaching server crash the whole server so uh sort of walking through these sorts of examples with small data sets where they can see how the rows are expanding or contracting back to what we expect them to be i think is uh pretty helpful.

Another uh new argument is uh handling unmatched cases with the improved join function so i'm going to make one more new data frame this time my measurements include one more um penguin uh so i have four penguins in my data frame and we're gonna say take the weight measurements from that one scientist and left join them to these four penguins. It's fine it r will happily join them for you but poof that fourth penguin is gone. Um maybe you never wanted the fourth penguin and maybe this is exactly what you wanted but um it is likely the case that this was a much larger data set and some of these unmatched cases sort of silently disappeared um and you never caught them.

So one option is to say well i definitely don't care any of the penguins whose weights i don't have so i'm going to explicitly use an inner join to solve this problem not a left join so saying that i only want the penguins that are common between these two another option is to actually use the um unmatched argument and say when i do this join anything that's unmatched explicitly drop that so again the results we're getting from these is exactly the same and in fact you can also do nothing you can do exactly what you did at the beginning but do it at your own risk.

So the difference between using the unmatched argument um or not is basically saying that i'm making it very explicit when i come back to this code i want to remember that i explicitly did not care about the other penguins whose weights we did not have another argument you can another value you can use for the unmatched argument is actually error so this might be a good starting place where it says if there's any unmatched cases just error out i will stop at that point and read through the error and then decide what i want to do so the error tells me that each row of y must be matched with x and in this case row four of y was not matched and now it's basically telling you decide what you want to do with it and you might go back to saying fine drop that one so the error uh value for unmatched is a great starting place particularly for interactive data analysis and then you want to explicitly choose uh one of the values and sort of go with that.