What's New in Dplyr (0.7.0) | RStudio Webinar

Transcript#

This transcript was generated automatically and may contain errors.

Well hi everyone and thanks for joining me on this webinar. I'm going to do kind of a quick overview of some of the new stuff in dplyr 0.7. We'll start with kind of a bunch of smaller stuff, none of it's kind of particularly exciting by itself, but in aggregate it adds up to just relieving a whole bunch of small pain points in dplyr. Then I'll talk a little bit about databases and then finish off by talking about TidyEval, which is a new system for programming in dplyr. So there should be plenty of time for questions. I'll stop after each section and take any that look particularly interesting to the whole crowd.

Okay, so the first thing I wanted to mention, if you don't already know about it, dplyr has a website now, dplyr.tidyverse.org . That's a great way if you want to see the examples and the code, see the vignettes, and generally get an overview of the package. So that's a great place to start if there's anything you want to know more about dplyr after this webinar. I also wanted to mention briefly the Tidyverse package. If you haven't heard of the Tidyverse, the Tidyverse package just basically makes it easy to get a bunch of packages that work well together. So when you install it, it installs all of the packages in the Tidyverse, and then when you load it, it loads the most important packages.

So here I'm showing you the development version, which in conjunction with the development version of RStudio gives you this pretty colorful display telling you which packages are loaded. So you can see here I'm using the development version of Purr. It gives you a little bit of information, a little bit of session info, so you can, if you're having problems, you can, you know, if your code works on your computer and not someone else's, you can often track down the difference to one of these issues. And then finally tells you what are the functions that the Tidyverse is masking from base R or other packages.

So if you don't use schemas, this does not affect you at all, but many big corporate databases use schemas extensively, and this should make your life much, much easier.

TidyEval: programming with dplyr

Okay, so let's finish off by talking about tidy eval, which is kind of the biggest and most complicated new feature in dplyr. And before I sort of talk about exactly what this is, I want to like illustrate the problem that we're trying to solve. And so here I've got a little table and I'm doing some data analysis with it. And basically each of these three clumps of code, I am doing the same summary, but with a different group. And so to do this with dplyr, you might be copying and pasting your code. And that's great. I think a rule of thumb is it's fine to kind of copy and paste up to like three times. But as soon as you go beyond that, it's worth the investment to write a function.

And that's because often, you know, the thing that you're doing changes, or you discover a bug. And if you've copied and pasted that code everywhere, tracking down every single instance of that bug can be really painful. So it's a really good idea to turn repeated code that you've created by copying and pasting into a function. But this is hard because dplyr lacks a property called referential transparency.

So let me try and just describe this quickly. I've created two simple functions, one that just multiplies its input by 10, and one that adds 10. And normally, when you're writing functions in R, you can extract out kind of a subextract expression or part of that function and assign it to a variable. And that doesn't matter. You can do that as much as you like, and you always get the same result. That's what's called referential transparency. You can create new variables, and it doesn't change how the expression is computed. But this doesn't work with dplyr because you can't pull out that expression. You can't pull out this repeated code, this G1, this G2, and just G3. You can't assign that to a variable because there is no objects called that. Because inside the dplyr expressions, to make your data analysis, your data manipulation as fast and fluid as possible, this variable refers to a variable inside the data.

So this is one of the features of R called the sometimes called nonstandard evaluation, where we're taking this expression and we're evaluating it in a special way. And this is really great because it allows you to save so much typing. You're not constantly having to type the name of the data frame again and again and again in your expressions. But the downside of this implicitness is that it's hard to program. And so in this version of dplyr, we have a new system called tidy eval, which is basically a new framework for programming with dplyr, dealing with this type of problem in general.

And so if you encounter this problem, you might say, well, I can't assign it to, I can't assign the bear variable name. Maybe I could try putting that inside a string. And that still doesn't work because it's looking for a column called group var. It's not looking inside that variable to find the G2. So to solve this problem, we have a new data structure called a quosure, which you can create with the quo function. So this basically kind of captures the expression. So this is what dplyr is all about. It's about capturing what you want and evaluating it at a different time.

So when you do this in dplyr, it's about capturing this expression and evaluating in a different context, in a context where you don't have to say which data frame every variable comes with. Or in the case of when you're talking to a database, the expression actually doesn't even get evaluated by R. It gets evaluated by the database much, much later on. So now we can use this quo function to kind of capture this variable. So we're saying we don't want, we want to use this variable G1 later on.

Now, unfortunately, that still doesn't work with group var. And what we need is some way to kind of unquote this variable. We just say, don't take this valuable, don't take this input literally, look inside this variable and see what it looks like. And so we now have this unquoting tool, bang, bang. And so what this tells dplyr to do is don't just say, don't look at group var literally, look at what's inside a group var and evaluate that. And so this gives us the ability to program with dplyr. We can now create functions where you can write functions that work like dplyr does. We can even take this further. You can not only do the grouping variables, but you can use this in every single function in dplyr, whether it's summarize or mutate or mean or whatever.

So this, I am like a hundred percent confident that the theory behind this is correct and robust and works like a hundred percent of the time, not just 95% of the time. We are still working to like try and explain it in a way that you can understand.

So this, I am like a hundred percent confident that the theory behind this is correct and robust and works like a hundred percent of the time, not just 95% of the time. We are still working to like try and explain it in a way that you can understand. This is basically, it's sort of a similar level of complexity to learning how to write functions. So if my explanation right now just left you reeling and confused, do not blame yourself. That is my fault that I do not know a good way of explaining that yet, but we are working on that. You can always see kind of our latest efforts on the dplyr website. There was an article about programming with dplyr that explains the problem in much more depth. And we will keep rewriting that until we are confident that we have figured out a good way to explain what's going on here in a way that you can understand and then deploy in your own code.