Hadley Wickham | State of the Tidyverse 2020 | RStudio (2020)

State of the Tidyverse 2020

Dec 20, 2020

23 min

Rstudio::conf(2020) Hadley Wickham

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

This is Hadley Wickham , and take it away. Thanks Mara. So I wanted to talk, not about visualization, unfortunately, but a little bit about the Tidyverse generally.

So I wanted to give you kind of a little bit of an update, like kind of where things are going in the Tidyverse, where are we today, talk a little bit about TidyEval, which I know has caused some unhappiness in various places in the R community, and then talk a little bit about what we're doing to address some of those problems in the future.

So I made a few plots to kind of show how things are going. So the first one is just the cumulative number of package downloads. So 2018 on the left, 2019 on the right. This data is not super trustworthy, I would love to believe that three quarters of a million people decided to use the Tidyverse, I guess, basically on my birthday.

But I think the main takeaway here is that the numbers are continuing to grow, really just continues to amaze me how many people find these tools useful. The other thing that I think is really fascinating is that 2019 was the first year that more people downloaded dplyr than ggplot2 . I don't really know what that means, but I think that's kind of interesting.

The tidyverse team and community

So who works on these packages? So my team at RStudio encompasses a bunch of people. So Jim Hester, Thomas Lynn Peterson, Gabor Ciardi, Romain Francois, Jenny Bryan , Max Kuhn, Leonel Henry, me, Mara Averick, and Davis Horne. And very, very soon we'll be joined by Julia Selge on Monday.

But as well as us, there's also a huge number of people in the community who contribute to the Tidyverse. And last year, in fact, we had 236 unique contributors, which is just so amazing and feels so wonderful to me.

And so I also made this chart of the number of unique contributors comparing 2018 to the last year. We ended up with quite a few more unique contributors last year because of two main events. These are the Tidyverse developer days, which we did for the first time last year, but we're going to be continuing to do them a lot in the future. So these happen after RStudio Conf, after UseR, that basically a chance to get a bunch of people interested in the Tidyverse or contributing to the Tidyverse in a room with helpers and we do a bunch of PRs to fix a bunch of issues.

You might also notice that there was a big jump in kind of May of 2018. I did a little investigation to figure out what this was. This was actually me going into a PR closing, merging frenzy in ggplot2. So there are a lot of PRs that have just been sitting there. So we got about like 20 new contributors in one day because I merged all those PRs.

Highlights of the year

A few highlights of this year, I think, really excited that we now have a paper that you can cite if you want to cite the entire Tidyverse, which you can get by using the citation function. The kind of idea of that paper is that rather than having to cite every package you might use individually, there's just like one place you can cite. It's a published article in the Journal of Open Source Software. It's kind of, I have to admit that I'm like slightly addicted to checking the citation count. Like, you know, normally when you check the citation count of a paper, it like changes like maybe once every six months or something, but we've already had like 17 citations in the two months it's been published, which is really, really amazing.

Also we introduced embracing the double parentheses. I'll talk about that a little bit later, but part of our efforts to make kind of tidy evaluation easier and more approachable so you don't have to learn about all the theory in order to use it.

The Vectors package, which is a bit of a mystifying package in some ways because if Vectors is successful in its goals, you will never know that it exists. And the goal of Vectors really is just to make things work more consistently behind the scenes so that when you use a function in one package, you know, your predictions about functions in other packages are more likely to be correct. That's the main impact for like data scientists. If you're a developer as well, it makes it much, much easier to create new types of S3 Vectors. So if you have a new type of thing, something that makes sense as a column in the data frame, Vectors provides a really nice set of tools for this.

And if you want to learn more about this, you can check out when it comes out, Jesse Sadler gave a really nice talk about how you can create new types of Vectors using this package.

Another really cool package, which I think reflects some things, some sort of development processes, some coding practices that we'll see more of in the future is the Vroom package. Vroom is a really, really fast way of getting data off disk from CSV files or TSV files or whatever into R. The first is that it makes heavy use of C++ 11 behind the scenes. This is kind of our first major use of C++ 11 in the tidyverse because we've been a little bit worried about whether enough people can use C++ 11, even though it is eight years after, nine years after 2011. But the main advantage there is it gives us really easy access to multi-threaded computation. So Vroom will use many or if not all of the cores available in your computer to speed up data ingest.

The other thing that's really interesting about Vroom is it makes extensive use of this new technology in base R called AltRIP, which allows us basically to lazily load the data in the file so that if you're working with a very large data set, Vroom is only going to read in the data that you actually touch. So if you read in a 10 gigabyte file, but you only look at one column or a small set of rows, it's only going to read that data up into memory, which of course makes things much faster and saves memory.

And then last but certainly not least, I think another big thing that's been happening this year is Max and Davis has been working on tidy models, a collection of packages to bring modeling into the tidyverse. I think that's really coming up to speed this year with Julia's joining that team. And I think next year that's really, people are already using tidy models effectively, but I think in the next year, that's really going to become a killer set of tools.

Looking ahead to 2020

A few things to look forward that I'm kind of excited about for 2020, I don't want to make any predictions that turn out to be famously wrong. I will not be declaring that 2020 is the year of anything, unless I want to kill that thing.

But a few things that I'm excited about, we've been putting a bunch of work into dplyr 1.0. This is kind of one of the projects, one of the first times we've done sort of like a full cork press. So like there's been four or five people from my team working on this in various ways. Really new implementation, which makes it much, much easier to add new features. And it's going to provide a really, really solid base for future extensions.

Also working on adding more problem-oriented documentation. I think we've always had pretty good or decent tutorials and reference implementation. We're also working now more on more documentation that helps you solve specific problems.

Also going to see kind of less per, I think. If you're teaching like an intro data science course, you will teach per much, much later as we provide tools to kind of eliminate the use of per for some of the most important uses. So Vroom, for example, allows you to slurp up an entire directory of CSV files in a single call. TidyEye has new functions for rectangling complicated or deeply nested JSON data. And dplyr is going to be bringing back the row-wise verb. So there should be many less times that you need to use per when you're doing data science. Still like really, really strongly believe in per and functional programming as a programming toolkit, but you won't be forced to learn it just to do data science.

And then also really excited about Google Sheets 4. We've got some really great ideas, I think, for making that even more flexible. Jenny is working hard on that. And I am personally really hopeful that we're going to be able to use that. So Google Sheets will be our primary kind of data source for RStudio Conf next year, which will avoid a bunch of problems that we had this year with keeping data synchronized in various places.

Tidy evaluation: mistakes and lessons

So what I did really want to talk about, though, is a little bit about tidy evaluation, because I think we did make some mistakes. Just here's kind of a provocative question that someone asked a little while ago. Will tidy eval kill the tidyverse? I'm pretty confident the answer is no.

But just to kind of, if you haven't seen tidy evaluation before, this is kind of one of the challenges of programming with functions from dplyr and ggplot2, for example, is introducing indirection. In dplyr and ggplot2, normally you provide the name of the variable directly. But what do you do if you want the user to supply the name of the variable in a function? Well, previously we had this technique where you did bang, bang, enquo, bang, bang, required that you kind of learned about the theory of quasi-quotation and enquo, made you think about these complicated things called quotiers. Now we have a system called embracing, which should allow you to solve the vast majority of problems, along with a few other techniques, which means that you don't have to learn the theory if you don't want to.

So we've also been working on lots of articles to explain this. I think I have these up again at the end if you want to grab them. But I wanted to talk about kind of the mistakes we made. And I think the first mistake we made is that this problem as a whole was much, much harder than we thought. And I know there have been at least like five points where I'm like, yes, we finally understand how all of this should work. And on every point, I was wrong, except for today, and I'm reasonably confident, more confident than I've ever been before, basically because we've been doing a much better job, I think, of creating this problem-oriented documentation. And we can see in that that we can solve the problems that most people are having.

I think the other thing, the other mistake, the next mistake we made is the theory really is like beautiful and elegant. It is, I think, pretty much unique amongst programming languages, because very few programming languages have the combination of first-class environments and computing on the language. And this just adds, like, it's sort of really interesting. And I think it's really cool. And Leonel has worked on this a lot with me, thinks it's really cool. But most other people do not think it's really cool. So totally, I think it's totally still worth it if you want to learn about the theory, because you want to learn more about how these things work. But generally, kind of the cost-benefit ratio, if you're like a data scientist just trying to solve some problem, the cost-benefit for, like, spending the time to learn the theory just wasn't there.

And then I think we also ended up, like, introducing too much. We wanted to be, like, precise. And so we ended up creating a lot of vocabulary that just ended up overwhelming people.

And so we ended up creating a lot of vocabulary that just ended up overwhelming people.

Function life cycles

So I think two of the kind of takeaway messages with this, like, I think one, just being aware of the problem, I think, is important. So that we are now kind of more aware that there are things that, like, we get excited about that are, like, really exciting to us, but we know, like, the rest of the R community are not going to be very excited about them.

So we're going to try and make it more clear, like, what is the status of various things that we're working on? Is this something that we think is really cool, but it's pretty experimental? Maybe it's going to change radically? Or is this something that's, like, we now have kind of grave doubts about, and maybe it's on its way out in the future? And sort of and think about how can we get feedback on ideas that need more feedback without forcing, like, everyone to have to think about those issues.

And one of the ways we're doing that is trying to be more clear about where functions live in the kind of the life cycle of a function. So what is the life cycle of a function? Well, the place where most functions live is this kind of stable life cycle. Like, we're pretty confident, like, this is a function, like, it does what it says on the can. It's useful. We don't think it's going to change majorly.

Some functions, like, when we introduce them, we're, like, seems like a good idea. You know, I think, like, when I first introduced the when we first started introducing the pipe, like, I was, like, well, this seems like a really cool idea, but, like, no one's going to understand how it works. And it turns out, like, it doesn't actually matter. The pipe is useful because it does something. It allows you to write code in a really clear way. It doesn't actually matter if you don't understand how it works. You can still use it.

There are other things that we have kind of so sometimes, you know, we think we have a really good idea, and then later on we're, like, hmm, maybe that wasn't such a good idea. And so we're starting to label those functions now with questioning. We're not sure if those functions are a good idea or not. Now, sometimes we'll, like, think about it more, and we'll end up back in stable. Other times we'll be, like, oh, okay, we finally figured out a better way to implement this function or a better way to solve this problem, and those functions become superseded.

So just to give you kind of an example of this in dplyr 1.0, there were two functions, row wise and do, that a lot of people really liked that have been in questioning for a while. We kind of figured out, like, row wise actually can go back to being stable. I figured out why I didn't like it and fixed that problem so it's stable again. Do we figured out a better way of solving, and so that becomes superseded. So superseded functions I'll talk about a little bit more shortly. These are functions, like, we don't think they're the best solution anymore, but they're not going away. We've got, like, a better approach, but you don't need to worry that if you're using these functions that we're going to yank the rug out from under you.

We're not going to the only time a function is when we want to be clear that a function is going away. We'll tell you it's deprecated, and then when eventually it gets removed altogether, then it becomes defunct.

So just to kind of contrast deprecated and superseded a little bit more clearly, a deprecated function is, like, clearly on its way out in the near future, which kind of probably means in the next year or two. You'll be warned when you use it. So whenever you use a deprecated function, you will get a warning. Only you get that warning, like, once per session. I think maybe that's not warning you, like, quite enough. We don't want to warn you every time you use it, because if we warn you every time you use it, it's basically just as annoying as, like, taking the function away in the first place. So we're still working on trying to get that balance right. Like, how do we, like, gently nudge you that maybe you should be moving away from this function without, like, getting in your face all the time?

So two examples of that, like, the tbldf function in dplyr, which is one way you used to create tbls before we called them tbls, is on its way out. Another function is dplyr do. Generally, these functions, we won't go from, like, it will try, unless it's, like, a very niche function, it will not go from stable to deprecated immediately. Normally, it will go through a questioning phase first, just to make sure. So we've got plenty of notice if we're uncertain about something.

Now, to contrast this, we're talking about the superseded life cycle. We've also called this retired in the past. My sort of thinking was, like, you know, when you retire, you're not, like, actively working anymore, but you're still a productive member of society, but when people heard that a function was retired, they're, like, oh, they're going to take it out back and shoot it in the head. So we're changing the name to superseded to make it clear that we think there's a better alternative, an alternative that's, like, maybe easier to use or faster or easier to learn or more powerful or something, and we think you should learn that new approach when you've got some free time.

So these superseded functions, they're not going anywhere. They're going to hang around for a long time, but they're not going to get any new features, and they'll only receive, like, really critical bug fixes. So a really good example of this, spread and gather and reshape and tidy are, you know, like, you know, they've been around for a long time, hundreds of thousands of people rely on them. We think we've got a better approach with pivot longer and pivot wider, an approach that hopefully you can actually remember how to use, but spread and gather are not going away. They will probably eventually go away, but that's, in this case of spread and gather, probably at least five years away. So we do encourage you to switch. We will change them in our documentation, in our books, so that you don't, that we don't teach them, we don't advise them, but those functions will live on for a long time.

And then on the other end of the spectrum, we've got these experimental functions. These are functions that, like, we have played around with internally, and we think they're kind of interesting or kind of cool, but we're not, like, 100% sure about them. So maybe they'll go away, maybe they'll stay. We just don't know. And so we kind of want you, if you are adventurous, we want you to try these functions out and, like, tell us, like, are these functions, do you love this function, do you hate this function, like, what's going on? So we can use that to kind of inform our decisions. But don't use them in critical code, like, they may get removed in the future. Don't rely on them 100%. But if they are useful to a lot of people, we will, you know, we'll keep them around and hopefully remove that experimental label.

And really all of this is kind of in service between this tension, like, we want to give you a stable foundation that you can build upon and rely upon, but at the same time, like, just because design is fundamentally iterative, we know we can't get it right the first time. And sometimes we want to be able to have a go at it a few times before we are confident. So we really want to just make it clear, like, where functions are on this so we can continue to build the stable foundation while trying to get it right in the long term.

And really all of this is kind of in service between this tension, like, we want to give you a stable foundation that you can build upon and rely upon, but at the same time, like, just because design is fundamentally iterative, we know we can't get it right the first time.

So just to finish up, I think the big message is here, Tidyverse seems to be doing pretty well. Lots more contributors. Really excited about how the community is contributing to the Tidyverse. You know, we messed up with tidy eval, but I think we've learned from it. And one of those things that we've learned is that making it more clear what the life cycle is will hopefully make it easier for other people to know, like, what's going on. Thank you.

Q&A

Thanks, Hadley. We have time for a couple of questions, but if yours don't get asked, I'm sure Hadley's really easy to find and has free time to ask him afterwards.

Is Vroom eventually replacing Redar? Yes. So Vroom will not replace Redar so much as Redar will use this Vroom under the hood. So in the long run, you will just use Redar and it will use the fast Vroom code. That should hopefully happen in the next year.

Why less Perl? It's awesome. Yeah, yeah. Perl, yes, Perl absolutely is awesome. But I think one thing we see when teaching newcomers to data science, there are a few things that you just want to be able to deal with, like that you've got a directory full of CSV files. Like, that is such a common problem. It's really nice to be able to deal with that before teaching about functions or functional programming or iteration. So for the most important tools, I think having some high-level way to express what you want is a little bit easier. It's not about not teaching Perl, it's about teaching Perl kind of later in the curriculum. So it's still really, really powerful. I still love Perl. It's not going away, but we're not forcing quite as many people to use it.

Will there be a tidyverse solution to interactive web graphics, ggplot3, question mark? Thomas and I are arguing about that already, so maybe. Interactive graphics is still something that's very near and dear to my heart. I think we now understand some of the data structures, understand how we could fix some of the mistakes in ggplot2. But it just doesn't still feel to me quite like the bottleneck, like the critical problem that we need to solve next, but hopefully in the long term.

And then last one I think we have time for is, will Vroom be faster than Freed and DataTable? Vroom is already faster than DataTable for some circumstances, basically because it's fast because it's lazy. It doesn't do as much work as DataTable's Freed if you don't have to work with all of the data and the data set. So if you go to the Vroom website, Jim Hister, who's the author of Vroom, put together a bunch of benchmarks. You can see where it does better than Freed, where it does worse than Freed.

Well, thank you so much, Hadley. Thank you.

Featured software#