Resources

Hadley Wickham | Maintaining the house the tidyverse built | RStudio

Hadley will talk about how the tidyverse has evolved since its creation (just five years ago!). You'll learn about our greatest successes, learn from our biggest failures, and get some hints of what's coming down the pipeline for the future. About Hadley: Hadley Wickham is the Chief Scientist at RStudio, a member of the R Foundation, and Adjunct Professor at Stanford University and the University of Auckland. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. You may be familiar with his packages for data science (the tidyverse: including ggplot2, dplyr, tidyr, purrr, and readr) and principled software development (roxygen2, testthat, devtools, pkgdown). Much of the material for the course is drawn from two of his existing books, Advanced R and R Packages, but the course also includes a lot of new material that will eventually become a book called "Tidy tools"

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Hadley Wickham and today I'm excited to talk to you about a topic that's near and dear to my heart, code maintenance. This certainly isn't one of the sexiest topics, but as your career in programming advances it becomes increasingly important. And today I wanted to talk about code maintenance through the lens of home maintenance.

Because in many ways I think the tidyverse is a little bit like a hardware store. We don't build the house for you, we don't do the data analysis for you, but we provide you the tools to do it. And like a good hardware store, we also help you pick the right tools and give you some education about how to use them effectively. Now unlike most hardware stores, we also develop, we also build many of the tools that we sell. So we're also listening to hear what your problems are with the tools to learn how we can make them better.

Frustration and home maintenance

I've also been thinking about code maintenance and home maintenance a little bit more than usual lately because of a very frustrating experience that I had recently. If you don't own your own home, you might kind of think of houses as things that just like exist. But in reality, they generate this steady stream of small maintenance tasks. And one that I needed to tackle recently was a couple of exterior lights stopped working. And so I thought, you know, this seems like a pretty simple task, this should be something I can tackle by myself. Take maybe an hour or two on a trip to Home Depot.

Four hours and three trips to Home Depot later, I gave up frustrated and extremely angry. And so the first thing I want to talk about today is how do you deal with that frustration that comes up while doing maintenance? Like how do you deal with all those negative emotions that this thing that should just work doesn't?

And when I'm tackling a home maintenance task like this, and it's going wrong, I often start telling myself these pretty negative stories. Like this is a really simple task. And if I can't perform it, then I must be an idiot. Or you know, I'm looking at the documentation in the box that describes how to install these lights, looking at what's on the outside of my house, and they look nothing alike. Like the documentation is utterly useless. Like it must have been an idiot who wrote this. Or at the end of those four hours, you know, I just tell myself that was a total waste of time. You'll never get those four hours back.

And maybe you are much better at home maintenance than I am. You almost certainly are. But you might have experienced some of these thoughts when dealing with code maintenance, with dealing with the fact that your code that worked fine a couple of years ago hasn't changed, but it breaks today. And now you've got to do all of this extra work to get it working again.

Dealing with negative thoughts

And I think a really useful technique to deal with these emotions, with these thoughts, is to have some kind of rational canned responses you can tell yourself. So for me, it's telling myself, well, you know, actually, I don't do that much home maintenance. It's kind of unreasonable to expect that I'd be an expert at it right away. Or you know, it's like there's no way the documentation could explain what every single light looks like on every single person's house in the world. So I can't expect the documentation is going to apply exactly to my situation, but it should contain some hints so I can read it and take away some important messages. And then sure, I didn't succeed in the goal, but it's not a waste of time. At least I've learned in the future I'm going to need to budget more time. And now I know where all the various bits and pieces are at Home Depot. So I've saved some time for the future as well.

Now this technique comes from cognitive behavioral therapy. It's a way of responding to these automatic thoughts with kind of balanced alternatives. And you can apply it yourself in lots of situations. The main thing is to, when you come up with these balanced alternatives, it's very, very difficult to do that in the moment. You want to sit down when you're in a good place and think about some of these negative thoughts that come up and how you can argue against them. I learned about this technique a few years ago when I read this book called Feeling Good by David D. Burns.

The whole idea of cognitive behavioral therapy, I think, is really powerful because it gives you this toolkit for understanding how your thoughts and emotions are connected together and how by changing your thoughts, sometimes you can succeed in changing your emotions.

The whole idea of cognitive behavioral therapy, I think, is really powerful because it gives you this toolkit for understanding how your thoughts and emotions are connected together and how by changing your thoughts, sometimes you can succeed in changing your emotions.

Now that's obviously a very general technique for dealing with frustration, however it arises in your life. But I think it's also worth recognizing, like, where does frustration come from? Because I think at the heart of pretty much every frustration is this conflict between what you want to be true and some unyielding reality.

And I think it's easy to think, like, when you write code that you're creating this monolith that will last a thousand years, that you wrote some code two years ago, you can tell your code hasn't changed. Why doesn't it work today? And the reason is, you know, while your code may not have changed, the world has changed around it. And instead of thinking about code as this kind of monolith, I think it's better to think about code more like a smoke alarm. It's something that needs regular maintenance. You have to change the batteries once a year and replace the whole thing every ten years.

The tidyverse lifecycle

Now let's start to get into some more of the details of dealing with change in code. And this is a place where the metaphor of the hardware store starts to break down a little. Because once you've purchased something from the hardware store, it stays the same. But when you use code from a package, that code might work differently when you update the package. This is a, on the whole, this is normally a good thing. This is like the package developer can come into your house, replace all of those expensive, inefficient, incandescent bulbs with new LED bulbs that are much more efficient. You get new features, you get bug fixes, you get performance improvements. But any updates to a package also comes with a risk that it might break your code.

And I want to talk in a little bit more detail about how we think about that, particularly in the tidyverse. These ideas apply in general, but I'm going to talk about the kind of framework that we are using currently in the tidyverse. And I want to talk a little bit about the kind of set of promises, the tidyverse money back guarantee, if you will, that we want to provide to you so that you can better understand what is likely to break your code and what's likely not to break your code. And to do that, we first have to talk a bit about the tidyverse lifecycle.

So you might have seen this before, because the documentation of certain functions contains these badges, like lifecycle deprecated or lifecycle superseded. So the full tidyverse lifecycle is relatively complicated, but today I wanted to talk about the four most important components, experimental, deprecated, stable, and superseded. And you can kind of roughly divide these in half by whether their functions in these various stages are kind of out of warranty or in warranty. And I'll talk about what that means precisely shortly. But roughly, you know, if a function is out of warranty, it might change at any point. If it's in warranty, we have some pretty strict, we make some pretty strict promises about how it will change.

So out of warranty, we've got experimental on the one hand. These are new functions, new features. We don't know exactly how they should work, but we want to get them out into the world so that you can try them out and give us feedback. This is the type of code that you need to have a pretty close relationship with. You need to be working with it a lot because it's going to change based on your feedback. Now most functions are on a stable lifecycle. That means the developer is basically happy with them and has no major plans to change in the future.

There's two ways that a function can exit that stable lifecycle stage. It can become deprecated, which means basically that it's a bad idea, you should stop using it. Or it can become superseded. We know that there's a better alternative, but the old function's not going away.

So the first thing we're going to guarantee, the first promise we're going to make you is if we're going to try and avoid as much as possible to make breaking changes in any stable functions. So a breaking change is a change that we expect, we predict, will break the vast majority of correctly written code that uses it. So what are some examples of breaking changes? Well firstly, if you remove a function, obviously any code that uses that function is going to break. Or if you remove an argument to a function, any code that uses the argument. Or if you decrease the set of allowed inputs to an argument, that's also going to break some code. So in general, making a function, the scope of a function kind of smaller is going to be a breaking change because code that works today will not work after this change happens.

Now as well as the input to a function, there's also the output. Changing the type of an output, type of the output of a function is also a breaking change. If a function previously returned a numeric vector and now it returns a data frame, it's likely to break. So what are non-breaking changes? Well of course we can change the output because we have to be able to fix bugs. And this means that technically something that's a non-breaking change can still actually break your code if you've accidentally depended on a bug. So this is a really important topic that I'm going to come back to later when we talk about off-label usage of functions. Otherwise, non-breaking changes are generally increasing the scope of things, so increasing the possible set of allowed inputs, adding new arguments, or adding new functions. So anything that grows the interface of a function or a package is going to be a non-breaking change.

Deprecation in practice

Now sometimes we will have to make a breaking change to a function that is stable, but we're never going to do that suddenly. We're going to do it through a gradual process of deprecation. We're going to give you a chance to find out that something is changing and respond to that change before we take it away for good. So what does that look like? Well let's take a function from the tbl package called data underscore frame. Now we, probably I, created this function when we just started working with the idea of tbls, and because tbls are kind of like a modern reimagining of a data frame, I thought well let's give it a modern name by taking the dot and replacing it with an underscore. Pretty soon afterwards I realized that was a, kind of cutesy, and b, misleading, because this function doesn't return a data frame, it returns a tbl. So we decided to deprecate that function.

So what happens if you call that function today? Well you're going to get a warning that tells you it's deprecated, and it tells you when it was deprecated. So tbl 1.1 was actually released in 2016, about four years ago. So deprecated functions don't go away immediately, they're going to hang around for a while so that you can find out about the change. Generally more, the more important or the more widely used a function was, the longer that deprecation message will hang around. Next you're going to find out what you should use instead. So here you can use a tbl function instead, it does exactly what it says on the tin, it creates a tbl for you.

Now what happens if you're looking at some old code that has hundreds of data underscore frame calls in it? It would be super annoying if every single one of those generated a message, so we're only going to show you this warning every eight hours. But it's also important for you to fix these deprecated functions, so we're going to help you find out where they were by giving you this advice to call this lifecycle's last warnings function.

There's one important thing that you do not see here, and that's an error. So a deprecated function does not generate an error, it still works, it still does exactly the same thing it used to do, but it is going to go away in the future, and the warning message is going to encourage you to update.

So if you do call that last warnings function, you'll get that same message again, and you'll also get a backtrace. And a backtrace is just a sequence of function calls that eventually lead to that deprecated function. In this made up example we've got a function f that calls g that calls h that calls i that finally calls data underscore frame. And this is quite important because it allows you to narrow in exactly where that warning is coming from as quickly as possible.

We really do not want deprecated functions to be that one smoke alarm in your house that needs its batteries changed, and it feels like the beep is like perfectly calibrated, it's fast enough, it happens frequently enough that it like drives you to the point of insanity, but not frequently enough that you can ever figure out which of the four smoke alarms is actually coming from.

Superseded functions

Now the lifecycle stages, I've talked about them mostly in the context of functions, they also apply to packages in slightly different ways, and they also apply to arguments of functions and even specific values of arguments of functions. And so for example in the nest function in tidyr 1.0, we wanted to update it to match emerging consensus of interface, function interface across the tidyverse, so we deprecated one argument and we deprecated another way of using dot dot dot.

So deprecated functions are functions that are kind of moving, deprecated functions, features, arguments are things that are moving out of warranty. What about things that we just kind of regret a little, they're not wrong, but maybe there's a better alternative available now. That's the idea of a superseded lifecycle stage, which I'm going to talk about with a little example.

Because a few years ago it sort of felt like there was this disturbance in the tidyverse force that a lot of people were having problems remembering which of spread and gather was which and how do you actually use them. Like every time you go to use them you'd have to look up the documentation or do some time googling. And this just wasn't happening to the community, this was also happening to me, like I forgot how to use these functions, which seemed like a really good sign it was time to put some work into them. And so last year we spent a bunch of time talking to the community, looking at other approaches taken by packages like datatable, and we came up with two new functions, pivot longer which makes your data longer, and pivot wider which makes your data wider. And overall it seems like that those new functions have been successful. People have found them, or at least people in this cherry-picked sequence of tweets have found them much easier to use and remember.

But just because pivot longer and pivot wider exist and are great and are easier to remember, it doesn't mean that spread and gather are wrong or bad. They still do what they say in the tin, they still do, they still work the same way that they always have, and by any means they're some of the most successful functions I've ever written, probably used by hundreds of thousands of people. So spread and gather are never going to go away, but we want to make it clear that newer approaches are available. And that's the idea of this superseded lifecycle badge, which basically says this function, the development is now complete, it's never going to get any new features, and it will only receive the sorts of critical bug fixes that will keep that function alive and useful.

And to me the way that I think about superseded functions is kind of like the idea of building codes. So a building code is a set of best practices that you need to apply when you are building a new home today. And obviously over time those standards evolve. So what happens if you have a 50-year-old home or a 100-year-old home, do you have to go and update your house to meet all of today's standards? That would be crazy. So the idea of building standards is they are important when you are building a new house or whenever you touch something in your existing house. So if you're renovating your kitchen and discovering the wiring is no longer up to code, you're going to have to replace that, but you don't have to rip out all of the walls, all of the wires in your house, elsewhere in your house.

And the same applies to superseded functions. Really good idea to use them in new projects. If you are touching old projects, if you're renovating old projects, update superseded functions when you need them, but don't otherwise worry about it. It's totally fine for the old and the new to coexist seamlessly. The other place it's worthwhile updating is if you're teaching. Really important when you're teaching to be teaching the latest best practices as much as possible.

You might wonder, well, when are these superseded functions going to go away? And you know, I don't want to make any promises like I can't commit to. Who knows what our priorities are going to be in five years or 10 years and what our resources will be. But I think you can get some sense of how we care about superseded functions by looking at some superseded packages, like for example, the reshape package, which I wrote in 2005 and was superseded in 2010 by reshape2. So because people are still using it, because packages are still using it, we've kept reshape still working. It's not getting any new features, so it's not a lot of work. We're going to keep that alive on CRAN as long as possible. Similarly, same story for plier and reshape2. Both created a long time ago. They've been superseded. We're going to keep those functions, those packages alive and on CRAN for the foreseeable future.

Off-label usage of functions

Now so far, I've talked about these lifecycle stages, which we as the creators of the functions have control over. There's an important thing that you have control over, which is how you use functions. And I want to talk a little bit about the kind of off-label usage of functions. Because in medication, off-label usage is basically prescribing a medicine for something it's not formally approved for. This is not illegal. It's not wrong. It's useful in lots of cases because there's lots of knowledge about medication that hasn't been formally approved yet. But it is a little riskier than using something for its intended purpose. And particularly for functions, it exposes you, I think, to a greater risk of breakage.

So what exactly do I mean by off-label usage? Let's take a little example. I have got a factor, and I want to extract the underlying numeric levels of the factor. Now one way to do this is to use the C function, because the C function doesn't know how to handle factors, and so it just drops all the levels. But I would say that this is off-label usage because the job of the C function is to concatenate vectors, concatenate or combine vectors together. And here we've only got one vector. So you, or someone else, you in the future, or someone else looking at this code, might scratch your head a little bit and think, well, what does it mean to combine one vector? Here you're really relying on kind of a side effect, an unintended side effect of the C function, which, because it doesn't know about factors, drops those levels. You're going to be much better off explicitly declaring that you want to get the integer, underlying integer levels by using something like as.integer.

Or maybe you've got a data frame, and you want to find all the rows in this data frame where x equals 1. And you do that using subsetting in R, and because R has this fantastic rule that missing values don't go silently missing, you're going to get a missing row in the output, which a lot of times is not what you want. And so you might think, well, I'll just use which to get rid of this. Now this certainly isn't such an egregious off-label usage as the last example, but the primary purpose of which is to find the location of true values in a logical vector. So this, you know, it's hard to know what else which would do with missing values, but you're not really fully communicating what you're trying to do here. What you're trying to do is remove missing values. And so in this case, I think you're better off switching to a function like dplyr filter or a base subset, explicitly designed for subsetting rows of a data frame, which knows that you don't want to get those Na rows back.

And the reason you want to avoid this off-label usage is because the original author has no way to anticipate what you're doing. And so when they make changes to the function, it's much more likely to negatively affect these off-label usages than the documented usage. So and another way of saying that is off-label usage is going to void your warranty. All bets are off when you start using a function in a way that the author did not envisage.

So and another way of saying that is off-label usage is going to void your warranty. All bets are off when you start using a function in a way that the author did not envisage.

So how do you avoid it? Well, the first thing you need to recognize is that recognizing, understanding off-label usage and what the intended purpose of a function is, is a skill. It's not something that you're born knowing. It's something you're going to develop over the course of your career in R. And in particular, when you start out learning R, your primary goal is to solve the problem in front of you and get the result that you need. And that is 100% OK. But as you get better, it's a really good practice to think, are you using functions not just because they do what you want, but because they do what they say? And the easiest way to discover that is to read the documentation. If you are using a function for a purpose that is not mentioned anywhere in its documentation, you are getting a little, you're likely to be in the danger zone.

But I think the best technique for learning more about the intended usage of a function is to do code review. So this is going to help you when you get your code reviewed by someone else, because they're going to look at that, they're going to look at you concatenating a single factor and they're going to ask you, like, what are you doing here? Because the chances are they don't know all the same quirks of the functions that you do. They'll know the intended purpose of a function, but they're less likely to know some side effect that you're relying on. But code review also helps you when you're reading other people's code, because you're going to look at that code and you're going to think about it. You're going to say, well, like, what is this doing? Maybe you'll read a little bit of documentation, you'll learn some new functions. And that process of reflecting on other people's code is really going to help you understand and improve your ability to write good code.

Personally, one of the biggest improvements to my coding style, I think, happened when I started grading code, when I started teaching data analysis at Rice University. That process of having to read other people's code and think about, like, why was this hard to understand, really made my own code much more elegant and clearer.

Avoiding unforeseen consequences

Now, while off-label usage of a function will void your warranty, we as maintainer of the package still want to avoid unforeseen consequences as much as possible. And I think there's a really interesting unforeseen consequence of this transition from incandescent light bulbs to LED light bulbs. And that is, a few years back, Citi started converting their traffic lights from incandescent to LED, because it saved a bunch of time, they don't need to be replaced as frequently, and they used much, much less power. But unfortunately, the first winter after this happened, they discovered a problem. That snow was accumulating in the traffic lights and blocking them so people couldn't see them. And this was because one of the reasons that incandescent bulbs were so inefficient was that they also produced a bunch of heat. And so now, previously, they would have melted the snow, and now with these new LED bulbs, they didn't, snow accumulated, and there were a bunch of car crashes.

So in the tidyverse, we want to avoid that as much as possible by trying, particularly when we make big changes, to making sure that there aren't common off-label usages that we need to support. And I want to talk about this in the context of a recent change to Magrida, a set of big changes that improved backtraces, so when you get an error message, it's much easier to see where in the pipeline it actually happened, there's much less overhead, so it's faster, and it is much closer to the upcoming base pipe. So I think these are three really big and important improvements, but to get them, we had to fundamentally re-engineer how the pipe worked. And our analysis suggested that it should be fine for, you know, any correct usage of the pipe, or any way that we used the pipe, but we wanted to make sure it's not going to, there weren't a bunch of uses that we hadn't imagined that we still needed to support.

So we did three things, and you can learn from those three things. So the first thing we always do is we always run our command check on all of the packages that use our package on CRAN. Now code in CRAN packages is not exactly the same as data analysis code, but it is a very large body of R code that is very easy for us to rerun. So this is a great first pass to pick up if there's any major problems with a new version of our package. And this is also, you know, if you can go to the effort of putting a package on CRAN, this also gets you some extra protection before any package in the tidyverse changes, all of your tests will be rerun and will tell you if we've broken something.

We also tweeted about it, we tweet about pretty much everything we do, following us on Twitter is a great way to keep up with the latest and greatest changes in the tidyverse. But for changes of this magnitude, we also blog about it. And so in this case, Leonel wrote a blog post saying, hey, there's a big new version of Magrida coming along, please try it out and let us know what you think. Does it break anything in your code?

And in the process of coming up with a slide, I realized if you don't use Twitter and you don't use blogs, how are you supposed to keep up with these things? So we're going to add a mailing list so you can easily sign up for updates over email in the not too distant future.

Now when we did this for Magrida, we actually uncovered a bunch of smaller and not so small things. So we either fixed Magrida to make this work, or if we couldn't do that, at least make it give an informative error message. But there were still a few things where the bit tradeoff between the amount of time it would take to fix them versus the frequency of the usage in the community. And at the end of the day, we have to kind of make this balance that we are going to pick these large positive improvements for the vast majority of people is an okay tradeoff for a few new problems for a very small amount of people.

Opting out of package updates

Now even after learning about the tidyverse lifecycle and the various promises that we make to you, you might still feel nervous that for particularly critical projects, you might be worried that an otherwise innocuous package update might break your code. And so you have a fallback, which is to opt out of the hurly-burly of package updates. Now this is generally not a good idea because you're not just opting out of changes that might break your code, you're also opting out of bug fixes and improvements in new features. But for some types of code, particularly code that is run unattended like code in production, it can be a good idea to use one of these techniques so that your code is 100% isolated and protected from changes to other packages.

So the first thing you can do is use the renv package. Basically what this is going to do is it's going to isolate each project you work on, give it its own library of R packages. So it doesn't matter what you install elsewhere, this project will always use the same set of packages. It also comes with a pair of functions that let you save those package versions to disk and then restore them elsewhere so you can run your code on another computer and it'll hopefully give you exactly the same results because you've installed exactly the same versions of packages. Now because renv works by giving you a custom library of packages, an isolated library of packages, it works with any packages and it doesn't matter where it comes from, whether it's CRAN or GitHub or your company's internal package repository.

But renv is quite a lot of work to set up, it's not a huge amount of work, but there is a simpler alternative if you only care about CRAN packages and that's to use a CRAN time machine. So there are two alternatives here, you can use the public version of RStudio's package manager or you can use Microsoft's MRAN. Both of these basically work the same way. They take a regular, roughly daily snapshot of CRAN and so you can choose to pick a day in the past and install packages that were on CRAN on that day. Now this is obviously much simpler to set up, you just have to change the repos option in your R profile or your R configuration, but it has the limitation that it only works with CRAN packages.

Summary

So to sum up, I've told you about some of my frustrations with home maintenance, which also I think apply very much to code maintenance, particularly accepting that the reason that you're frustrated, that the reason that these maintenance tasks can cause frustration is this collision between what you want to be true and what really is true. And personally I've found the ideas of cognitive behavioral therapy really useful for this, helping me to identify my automatic thoughts and come up with more balanced, rational alternatives that help me think about things more productively.

We also talked about the four major, the four most important stages of the tidyverse life cycle, experimental, deprecated, stable and superseded, and talked about some of the guarantees that we in the tidyverse team do our best to provide. So firstly, we're going to try our best to avoid breaking changes in stable functions, and if there are ever any breaking changes, we're going to do this gradually through a deprecation cycle, so you have time to discover that there's a change and respond to it. But remember, if you're going to use a function off-label, that voids the warranty, because we don't know how you're using the function, it's much, much, much more difficult for us to make sure it keeps working in the same way that you're relying on it today. Finally, if the hurly-burly of package updates is just too much and you want to opt out, particularly if you have a package that's running unattended in production, you can use tools like RN or the CRAN time machines to isolate or pick a date and time so that the packages you install do not change ever.

So I hope you've enjoyed this journey of code maintenance and home maintenance with me, and I'm really looking forward to taking your questions in just a couple of minutes. Thank you.