
Jenny Bryan | Lazy evaluation | RStudio (2019)
The "tidy eval" framework is implemented in the rlang package and is rolling out in packages across the tidyverse and beyond. There is a lively conversation these days, as people come to terms with tidy eval and share their struggles and successes with the community. Why is this such a big deal? For starters, never before have so many people engaged with R's lazy evaluation model and been encouraged and/or required to manipulate it. I'll cover some background fundamentals that provide the rationale for tidy eval and that equip you to get the most from other talks. VIEW MATERIALS https://github.com/jennybc/tidy-eval-context#readme About the Author Jenny Bryan Jenny is a recovering biostatistician who takes special delight in eliminating the small agonies of data analysis. She’s part of Hadley’s team, working on R packages and integrating them into fluid workflows. She’s been working in R/S for over 20 years, serves in the leadership of rOpenSci and Forwards, and is an Ordinary Member of the R Foundation. Jenny is an Associate Professor of Statistics (on leave) at the University of British Columbia, where she created the course STAT 545
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
If you haven't heard of tidy eval, I'm about to explain it, but a lot of people have heard about it and have a certain amount of discomfort and anxiety as evidenced by, for example, these two provocatively titled community posts.
And so part of what I want to answer is, like, why are we putting you through this? So what is tidy evaluation, and is it going to be bad for your health, something you might care about? And I think part of the reason that so many people know about it but don't know it or know about it and fear that they're missing something or that they're behind is this phenomenon that I liken to when people become vegan or they start doing CrossFit, they tend to tell you about it.
So when people start learning tidy eval, they tell you about it, and they blog about it and they tweet about it, and so it creates in some sense perhaps more general awareness than is actually necessary.
Things to learn before tidy eval
So I'd like to kind of get out ahead of some of this, but first I'd like to undermine my whole talk with a list of things that might give you more bang-bang for your buck than learning tidy eval.
And this was kind of crowdsourced within the tidyverse team to describe a few things that you probably want to learn first before you tackle tidy eval. So being pretty comfortable writing functions in general would be a really sound thing to start with. I think a lot of people work in a domain like maps or time series or text analysis where you have some very effective tooling in your area and getting really up to speed on what the bleeding edge is there could be extremely fruitful.
You know that I'm very into data wrangling, and so I think being comfortable with all of the different data receptacles in R is really rewarding, especially lists, and being comfortable putting your lists inside your data frame and this nesting, unnesting paradigm which is often how this happens, and it's uncomfortable so you often want to get out of that situation which is the unnesting.
I'm a big fan of the purrr programming package. It's described as functional programming which is accurate. I also think a completely fair way to think of it as it's a way to iterate really well in R with sort of one common pattern, and then I think to elevate any specific set of functions here I decided to give the award to the scoped dplyr verbs are probably under utilized where you go in and do a mutate or something like that very selectively, like at certain positions or at certain names.
What is tidy eval?
But if you've decided that indeed you may in fact actually need tidy eval, I'd like to demystify it a bit what it is. So some people here this is more than you needed to know, or like you already know this and you're going to love Leonel's talk, and I'm kind of setting that talk up a bit.
So what is tidy eval? I would call it a tool kit for meta programming. So for a code that writes code or code that mutates code. And why is it called tidy eval? So is there something about it that is intrinsically tidy instead of us just like putting tidy on everything?
And I think there is, and it's to explain it would be at a level that's not what this talk is about. But I think closures are an example of there is actually something intrinsically tidy about the model of changing how evaluation works in the tidy eval world that is like it actually really deserves that name.
But it is also true that the tidyverse itself does a lot of meta programming inside. And therefore it requires a certain tool kit to exist, and tidy eval powers that. So it is also the putting tidy on everything, it's both.
So if you don't know what I mean by tidy eval does a lot of meta programming, but let's just look at a little bit of canonical tidyverse code. So here I'm showing working with this Star Wars dataset that comes with dplyr. I'm doing a little bit of data wrangling with dplyr in that first example, and then I switch to a different dataset and show you what ggplot2 does.
And I've been talking a lot to Greg Wilson who joined RStudio this year, and he's really coming from Python, and one of the things we end up talking about a lot is like what's up with all these unquoted variable names and everything necessary to make that happen and everything necessary to preserve that in your own functions?
So the lack of quotes around home world, height, highway, color, sorry, class, that is brought to you by all the use of meta programming under the hood.
Nonstandard evaluation explained
And here are several equalities that are technically not true, but they're kind of true for the duration of my talk, and they're very useful. So when I'm talking about meta programming here, which is a really general term, I'm basically talking about a more R-specific term, I associate with R, maybe it's used in other languages, called nonstandard evaluation, NSE, and it's a term you'll see a lot, especially once you're sort of walking around in the tidy eval world, and I'm also going to go even further and kind of equate that with using unquoted variable names, which is like a further abuse of all this, but that's mostly what I'm talking about here.
So what is nonstandard evaluation? So every time we evaluate an expression in our environment, an environment or a chain of environment is going to be searched for name value binding, so you've got a name of something and you need to look up the value currently associated with that. And so nonstandard evaluation means you've got this expression, and before you evaluate it, you mess with it, perhaps.
You might actually edit the expression, and or right before you evaluate it, you manipulate this chain of environments so that it's not sort of the default one that everyone would expect, and then often you're doing both, okay? And this is what gives you unquoted variable names and working inside the data frame, so both of those things are going on in like every deep IR function that you use, right? It's going to put that data frame first, data mask, and then you're having these unquoted variable names.
So any function that accepts an unquoted variable name in a data frame is using an SE under the hood, so that means if you want to write functions and you have to confront that somehow, and I'm going to show you that there are different levels of confronting that and how easy or hard that can feel.
This top sentence, it's referring to a phrase I think was about Ruby that said if you make everything equally easy, you're also making everything equally hard. So certain things in Ruby were made very easy, and then the bargain you strike is you make other things hard.
if you make everything equally easy, you're also making everything equally hard.
So the tidyverse struck this bargain in a really, really big way. So if you make the direct specification variable names really easy, like people can just type them and they don't need to type quotes, it makes indirect specification harder for all of us, and it turns out indirect specification still comes up a lot.
So the two main examples here are like what if you have variable names stored as an object? You want to drop that in there and drive some deep IR function, or what if a user has passed a variable name as a function argument? How do you pass it through? So these are two pretty mundane examples of where the design of the tidyverse makes this harder.
NSE in base R
So is this a problem created by the tidyverse for itself? And the answer is no, and before I worked for RStudio, I was a professor and I taught a course on applied statistics, and this example I'm about to show you really comes from teaching this course, SAT 545.
So here are four functions in base R that all use nonstandard evaluation, and before the tidyverse existed, I was a heavy user of these functions because I liked how fluid they were. I liked not retyping my data frame name all the time. So LM, subset, transform and with are all great examples where this ease of interactive use was prioritized.
The problem is if you read the help for subset, transform or which, all of them contain almost the exact wording. This is pulled from one of them. It makes it very clear that this is a convenience function that you are really only supposed to be using interactively if you're there, sitting there, ready to see the errors, and if you use it in a programming context, you've been warned that it might have unanticipated consequences. And so you've been kind of warned not to program around those functions.
And I'm going to make this example actually more about LM. So I would teach these students how to do a little bit of data wrangling, data visualization, and so they might write a function that fits a quadratic model, and I'm going to use the gap minor data in a moment. So let's say we want to explain life expectancy as a function of time, of the year. This is a super simple model in the LM code that might do this.
And then I would teach them, like, how do you do this for every country in the gap minor data set? And so they know that they need to wrap LM in a function, and then they drop that into some sort of iterative machine, and before the tidyverse existed, we were heavy users and teachers of the apply functions. So here I am dropping that into buy, and indeed it works. It goes and fits this polynomial model to every country in the gap minor data set.
And this is where I outsmarted myself. So the students were like, okay, this is great. I have this function written. What if I want to do this now for GDP per capita instead of life expectancy? So what if I want to sort of let Y and X be general here? So you might hope, many do, that that code right there will work. And it won't, okay?
Here's examples of it not working. So if you just put those unquoted variable names in in the usual way and you just cross your fingers and hope they get passed through, you get an error. If you're like, what if I apply quotes? You get a different error. So it was around this time usually in the course where I'm like, oh, gosh, we need to move on to visualization, and I just I never really answered this question.
Because this is what the answer would look like. So here is an example of me improving that little function so that you can use it in the usual way. And so I really set myself up for failure in some of the ways that I taught things. Because I would show my students these cool functions with nonstandard evaluation, and then I would teach them how to write functions, and then I was like, but please do not combine the two. Because I couldn't teach them this toolkit.
But if you put yourself to that pain, you get a function that's actually quite pleasant to use, and it does work to be dropped in to buy, for example, in this case. And so you can change the variables. You can still use expressions like year minus 1952. So that's pretty cool.
So in base R, programming around these NSE functions is possible for sure, but it's always either been explicitly discouraged with those warning messages or just implicitly discouraged because it's not made to be very easy.
The tidyverse and the messy eval era
So going back to sort of tidyverse history, I've been told I'm allowed to call it there was a messy eval era. And I hear that there was some thought of maybe not letting people use unquoted variable names in the tidyverse, and that was discarded. And the first thing that really rolled out is you might remember, and they're still there, these functions like AES string instead of AES when you have the variable name as a character. Or select underscore versus regular select. So the standard evaluation version and the nonstandard evaluation version. And it just turned out that that was pretty unpredictable for users, and I think not so fun to maintain. It's not very sustainable. It's like, are you going to create two versions of every function in the tidyverse? And so that doesn't feel so good.
So the good news is that the tidyverse did ultimately decide to prioritize usability and putting the data mask on, allowing unquoted variable names, allowing you to use unquoted variables and expressions. But the bad news is it made programming around it harder for everyone. And so the good news and the inevitable news is that there has to be some sort of toolkit to use internally to create consistency and for other people to use so they can extend things. And that's provided by rlang.
So this is a package which you may or may not have heard of. It's kind of like vectors in the sense that ideally most people will not have heard of rlang. It's more of a developer-focusing package. So that's where the toolkit lives. And other things live there as well. But this tidy eval toolkit does. And most people should not need to use rlang with their bare hands.
How much tidy eval do you actually need?
So I'm going to close with a rapid-fire set of scenarios. So I'm going to sort of fake what a user might want to do. And then I'm going to tell you how much tidy eval you would need to know.
So if you want to use existing tidyverse functions to analyze data, you do not need to know tidy eval. Congratulations. Continue on in your sort of blissful existence.
But if you see that you have a lot of code that looks incredibly duplicated and there's common logic and all that's changing is the variable that you're doing something to, you probably want to write a function to dry this code out, to reduce the duplication. But what if the duplicated code is using dplyr? It is still possible that you do not actually need to know tidy eval. And pass the dots is the phrase I'm going to say. Leonel will show that in much greater detail. And you definitely don't need rlang.
I'll show one example of this, but it's really more suited to the next talk. But so if I wanted to take a data frame in, group it by something unspecified, and do a summarize, you could take dots and the signature of your convenience function and pass that straight through. And you're keeping up the NSE chain. You are not the weak link in the chain. But you really didn't have to do anything explicitly. And indeed that works. You could take the Star Wars data and group by home world or species and it still works.
Again, you're still in this world where you want to write simple functions to reduce duplication. Sometimes your life is not simple enough for pass the dots to get the job done, in which case the paradigm you'll need to work with is using a function called nquo, sort of capture that unquoted thing and quote it. And then at the place you need to mention it, you unquote it with bang, bang, hence the really bad dad joke at the beginning.
So dplyr, ggplot2 and tidyr, for example, all expose this syntax to you. And that's why you don't need to know rlang. Because it's really anticipated that people need to use these.
So here's an example where I need to capture a grouping variable and a summary variable with this nquo bang, bang pattern. And the number of arrows on the page here tips you off to why the dot, dot, dot trick wouldn't have worked. And again, this code also works. And I'm able to change, for example, the variable that I'm summarizing over.
And so what if you want to make names out of user input? You have to use — I asked Leonel how do I say this symbol and he's like we call it colon equals. So you need the colon equals thing. And again, dplyr, ggplot2 and tidyr will know about this. They will make this available. And again, you do not need rlang.
And so if you're someone who wants to compute on the expression and you want to manipulate environments, I am afraid that you do really need to understand R's evaluation model and how the tidy eval toolkit manipulates it. And you will be using rlang to do that if you're trying to produce something with this tidy eval feel.
Resources
I want to give a shout out to a few resources that were helpful to me as background research. I hope Thomas doesn't mind me pointing this out, but Thomas Lumley wrote something, I'm not sure what he would call it, it's clearly not a paper, but this document called standard nonstandard evaluation back in 2003. And it's a nice summary of where things were in base R at that time, which was already feeling a bit of pain about people are doing NSE different ways. And so he does a nice survey of the ways.
And I imagine this document was meant to generate a conversation and move things towards doing it more standard. Sort of the base R conversation that ultimately is also happening in the tidyverse. Then Thomas Miland has a really beautiful pair of blog posts, I think, on explaining scoping rules and NSE. And then Hiroaki Yutani has a beautiful talk, I think, on tidy eval. And in particular, he has a great example that builds to the end that really explains why tidy eval itself is necessary and it gets at that quality I was hinting at earlier about that there is something about tidy eval that's intrinsically tidy.
Internal resources are from the people who bring you tidy eval. These are resources they've written. The second edition of Advanced R, which is under development by Hadley, has a whole part on metaprogramming. So there's several chapters there that are really useful. Leonel Henry is working on a book down site, tidyeval.tidyverse.org, that's still a work in progress to try to centralize some of this information and put some cookbooks. And there's a great RStudio community thread where people submit wild caught examples. And then Hadley and Leonel tell them that they don't actually need to use tidy eval. So it's a nice little miniature code makeover thread to look over.

