Jenny Bryan | Lazy evaluation | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

If you haven't heard of tidy eval, I'm about to explain it, but a lot of people have heard about it and have a certain amount of discomfort and anxiety as evidenced by, for example, these two provocatively titled community posts.

And so part of what I want to answer is, like, why are we putting you through this? So what is tidy evaluation, and is it going to be bad for your health, something you might care about? And I think part of the reason that so many people know about it but don't know it or know about it and fear that they're missing something or that they're behind is this phenomenon that I liken to when people become vegan or they start doing CrossFit, they tend to tell you about it.

So when people start learning tidy eval, they tell you about it, and they blog about it and they tweet about it, and so it creates in some sense perhaps more general awareness than is actually necessary.

if you make everything equally easy, you're also making everything equally hard.

So the tidyverse struck this bargain in a really, really big way. So if you make the direct specification variable names really easy, like people can just type them and they don't need to type quotes, it makes indirect specification harder for all of us, and it turns out indirect specification still comes up a lot.

So the two main examples here are like what if you have variable names stored as an object? You want to drop that in there and drive some deep IR function, or what if a user has passed a variable name as a function argument? How do you pass it through? So these are two pretty mundane examples of where the design of the tidyverse makes this harder.

NSE in base R

So is this a problem created by the tidyverse for itself? And the answer is no, and before I worked for RStudio, I was a professor and I taught a course on applied statistics, and this example I'm about to show you really comes from teaching this course, SAT 545.

So here are four functions in base R that all use nonstandard evaluation, and before the tidyverse existed, I was a heavy user of these functions because I liked how fluid they were. I liked not retyping my data frame name all the time. So LM, subset, transform and with are all great examples where this ease of interactive use was prioritized.

The problem is if you read the help for subset, transform or which, all of them contain almost the exact wording. This is pulled from one of them. It makes it very clear that this is a convenience function that you are really only supposed to be using interactively if you're there, sitting there, ready to see the errors, and if you use it in a programming context, you've been warned that it might have unanticipated consequences. And so you've been kind of warned not to program around those functions.

And I'm going to make this example actually more about LM. So I would teach these students how to do a little bit of data wrangling, data visualization, and so they might write a function that fits a quadratic model, and I'm going to use the gap minor data in a moment. So let's say we want to explain life expectancy as a function of time, of the year. This is a super simple model in the LM code that might do this.

And then I would teach them, like, how do you do this for every country in the gap minor data set? And so they know that they need to wrap LM in a function, and then they drop that into some sort of iterative machine, and before the tidyverse existed, we were heavy users and teachers of the apply functions. So here I am dropping that into buy, and indeed it works. It goes and fits this polynomial model to every country in the gap minor data set.

And this is where I outsmarted myself. So the students were like, okay, this is great. I have this function written. What if I want to do this now for GDP per capita instead of life expectancy? So what if I want to sort of let Y and X be general here? So you might hope, many do, that that code right there will work. And it won't, okay?

Here's examples of it not working. So if you just put those unquoted variable names in in the usual way and you just cross your fingers and hope they get passed through, you get an error. If you're like, what if I apply quotes? You get a different error. So it was around this time usually in the course where I'm like, oh, gosh, we need to move on to visualization, and I just I never really answered this question.

Because this is what the answer would look like. So here is an example of me improving that little function so that you can use it in the usual way. And so I really set myself up for failure in some of the ways that I taught things. Because I would show my students these cool functions with nonstandard evaluation, and then I would teach them how to write functions, and then I was like, but please do not combine the two. Because I couldn't teach them this toolkit.

But if you put yourself to that pain, you get a function that's actually quite pleasant to use, and it does work to be dropped in to buy, for example, in this case. And so you can change the variables. You can still use expressions like year minus 1952. So that's pretty cool.

So in base R, programming around these NSE functions is possible for sure, but it's always either been explicitly discouraged with those warning messages or just implicitly discouraged because it's not made to be very easy.

The tidyverse and the messy eval era

So going back to sort of tidyverse history, I've been told I'm allowed to call it there was a messy eval era. And I hear that there was some thought of maybe not letting people use unquoted variable names in the tidyverse, and that was discarded. And the first thing that really rolled out is you might remember, and they're still there, these functions like AES string instead of AES when you have the variable name as a character. Or select underscore versus regular select. So the standard evaluation version and the nonstandard evaluation version. And it just turned out that that was pretty unpredictable for users, and I think not so fun to maintain. It's not very sustainable. It's like, are you going to create two versions of every function in the tidyverse? And so that doesn't feel so good.

So the good news is that the tidyverse did ultimately decide to prioritize usability and putting the data mask on, allowing unquoted variable names, allowing you to use unquoted variables and expressions. But the bad news is it made programming around it harder for everyone. And so the good news and the inevitable news is that there has to be some sort of toolkit to use internally to create consistency and for other people to use so they can extend things. And that's provided by rlang .

So this is a package which you may or may not have heard of. It's kind of like vectors in the sense that ideally most people will not have heard of rlang. It's more of a developer-focusing package. So that's where the toolkit lives. And other things live there as well. But this tidy eval toolkit does. And most people should not need to use rlang with their bare hands.

How much tidy eval do you actually need?

So I'm going to close with a rapid-fire set of scenarios. So I'm going to sort of fake what a user might want to do. And then I'm going to tell you how much tidy eval you would need to know.

So if you want to use existing tidyverse functions to analyze data, you do not need to know tidy eval. Congratulations. Continue on in your sort of blissful existence.

But if you see that you have a lot of code that looks incredibly duplicated and there's common logic and all that's changing is the variable that you're doing something to, you probably want to write a function to dry this code out, to reduce the duplication. But what if the duplicated code is using dplyr? It is still possible that you do not actually need to know tidy eval. And pass the dots is the phrase I'm going to say. Leonel will show that in much greater detail. And you definitely don't need rlang.

I'll show one example of this, but it's really more suited to the next talk. But so if I wanted to take a data frame in, group it by something unspecified, and do a summarize, you could take dots and the signature of your convenience function and pass that straight through. And you're keeping up the NSE chain. You are not the weak link in the chain. But you really didn't have to do anything explicitly. And indeed that works. You could take the Star Wars data and group by home world or species and it still works.

Again, you're still in this world where you want to write simple functions to reduce duplication. Sometimes your life is not simple enough for pass the dots to get the job done, in which case the paradigm you'll need to work with is using a function called nquo, sort of capture that unquoted thing and quote it. And then at the place you need to mention it, you unquote it with bang, bang, hence the really bad dad joke at the beginning.

So dplyr, ggplot2 and tidyr , for example, all expose this syntax to you. And that's why you don't need to know rlang. Because it's really anticipated that people need to use these.

So here's an example where I need to capture a grouping variable and a summary variable with this nquo bang, bang pattern. And the number of arrows on the page here tips you off to why the dot, dot, dot trick wouldn't have worked. And again, this code also works. And I'm able to change, for example, the variable that I'm summarizing over.

And so what if you want to make names out of user input? You have to use — I asked Leonel how do I say this symbol and he's like we call it colon equals. So you need the colon equals thing. And again, dplyr, ggplot2 and tidyr will know about this. They will make this available. And again, you do not need rlang.

And so if you're someone who wants to compute on the expression and you want to manipulate environments, I am afraid that you do really need to understand R's evaluation model and how the tidy eval toolkit manipulates it. And you will be using rlang to do that if you're trying to produce something with this tidy eval feel.

Resources

I want to give a shout out to a few resources that were helpful to me as background research. I hope Thomas doesn't mind me pointing this out, but Thomas Lumley wrote something, I'm not sure what he would call it, it's clearly not a paper, but this document called standard nonstandard evaluation back in 2003. And it's a nice summary of where things were in base R at that time, which was already feeling a bit of pain about people are doing NSE different ways. And so he does a nice survey of the ways.

And I imagine this document was meant to generate a conversation and move things towards doing it more standard. Sort of the base R conversation that ultimately is also happening in the tidyverse. Then Thomas Miland has a really beautiful pair of blog posts, I think, on explaining scoping rules and NSE. And then Hiroaki Yutani has a beautiful talk, I think, on tidy eval. And in particular, he has a great example that builds to the end that really explains why tidy eval itself is necessary and it gets at that quality I was hinting at earlier about that there is something about tidy eval that's intrinsically tidy.

Internal resources are from the people who bring you tidy eval. These are resources they've written. The second edition of Advanced R, which is under development by Hadley, has a whole part on metaprogramming. So there's several chapters there that are really useful. Leonel Henry is working on a book down site, tidyeval.tidyverse.org , that's still a work in progress to try to centralize some of this information and put some cookbooks. And there's a great RStudio community thread where people submit wild caught examples. And then Hadley and Leonel tell them that they don't actually need to use tidy eval. So it's a nice little miniature code makeover thread to look over.

Jenny Bryan | Lazy evaluation | RStudio (2019)

Transcript#

Things to learn before tidy eval

What is tidy eval?

Nonstandard evaluation explained

NSE in base R

The tidyverse and the messy eval era

How much tidy eval do you actually need?

Resources

Featured software#

rlang

rstudio

tidyverse