
Hadley Wickham | An introduction to R7 | RStudio (2022)
The R7 package is a new OOP system designed to be a successor to S3 and S4. It has been designed and implemented collaboratively by the RConsortium Object- Oriented Programming Working Group, which includes representatives from R-Core, BioConductor, RStudio/tidyverse, and the wider R community. In this talk, I'll introduce R7 to the wider world. Attendees will learn why we created R7 and how they can use it to create new classes and packages. I hope to inspire folks to download, try it out, and give us the feedback we need to make it better. Session: Just typing R code: advanced R programming
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you. So I was thinking earlier, like, what idiot scheduled this talk at this time? And like, because immediately afterwards, I have to go and introduce Jeff for the keynote. And of course, the idiot was me. So today, I want to give you a bit of a first look at this new OOP system for R called R7. And obviously, like, I'm giving this talk, but I didn't do all the work for this talk. This is a joint effort from the R Consortium Working Group on object-oriented programming. So we'll talk a little bit about that later, because I think that's a really important part of this project. This is not just, like, some idea that I've had. It's a team effort from a number of very important stakeholders in the R community.
Why we need OOP
But to begin, I wanted to kind of talk about, like, why do we need OOP? And I want to do that by talking about this sort of this Bizarro function. And the goal of this Bizarro function is to take an R object and turn it upside down somehow. So for example, if we've got a numeric vector, maybe we'll flip the sign. Or if we've got a logical vector, maybe I'll turn trues to falses and falses to trues. Or if I've got a character vector, I'll flip the letters in each of the strings in that vector. Or if I've got a factor, I'll flip the levels. And if there's anything else, I'll just throw an error.
So what's wrong with this function? Well, there's nothing really wrong with this function, because it's so simple. Like, you can already see it on one slide. But there are some problems with this kind of general approach. And the first one is, as we handle more and more types of data, this function is going to get bigger and bigger and bigger. And you're going to have to have all of that code in a single file.
And if you think about functions in base R that have to have different behavior for different types of things, this is going to become really problematic. And I think the function in base R that does the most different things for the most different types of objects is print. And if you were to write print with this kind of if-else style, this is what it looks like just for the types of objects in base R that begin with the letter A. That's already not enough to fit on a slide, and I haven't even included the implementations. So you're going to imagine, as you deal with more and more types of things, it's going to get bigger and bigger and bigger.
And the other possibly more important problem is that there's only one person who can add new types of behavior to that function, that can teach that function to handle new types of object, and that's the original author. And again, you can imagine for R core, like if every time someone added a new type of thing, a new class, that some code in R itself would have to change, that's obviously going to be a huge, huge hassle. So that's really the inspiration for OO programming in R.
And the other possibly more important problem is that there's only one person who can add new types of behavior to that function, that can teach that function to handle new types of object, and that's the original author.
S3, S4, and where R7 fits in
But there are already two types of OO systems in R. So how does R7 fit in? Well, first of all, we have S3, or Mama Bear. Now, S3 has its name because it was the OOP system developed for the third version of the S programming language. It really is the simplest OOP system that could possibly work. It works basically mostly on conventions. It doesn't have a lot of structure. And it works great, but it's very, very simple.
And then a few years later, as kind of a reaction to S3, along came Papa Bear, or S4. And S4 is very strict, very rigorous. You get a lot more guarantees, but it's also a lot more complicated and a lot more work. So S4, so-called, because it was designed for the fourth version of the S programming language.
And so now we've been working on this tool called R7, where we want to try and be like Baby Bear. We want to be just right. We want to strike the right balance between complexity and power. And we're calling it R7 because it's hopefully 3 plus 4, like the best parts of S3 and S4. And it's designed for R, not for backward compatibility with S.
Implementing the bizarro function in R7
So what would it look like if we were going to implement this bizarre function in R7? Well the first thing we're going to have to do is load the R7 package. And this is kind of probably temporary. So currently R7 is just a package on GitHub. Hopefully in a few months it'll become a package on CRAN. But the long-term goal is to get it into R itself. So this is not an additional dependency. This is something that's built into R.
But for now we're going to load that package. And then we're going to create a generic. And a generic is a special type of function whose behavior depends on the type of one or more of the functions. So here we're defining a new generic. And crazily enough, we're going to use a function called newGeneric to do that. And we're going to give that function the name of the generic, so bizarro. And we're also going to store that result in a variable called bizarro. That is going to be our generic. So it's a little bit unfortunate that we've got to have this repetition here. We're saying bizarro twice. We considered a number of ways to work around this and decided that this is the least worst solution, was this little bit of repetition. There's one other argument here, and that is the name of the argument that's going to be used for dispatch. Dispatch is the process of finding the implementation given the type of the argument. So this generic doesn't really do anything. There's no code here. There's no implementation. But it defines the shape of what this function is going to look like. And if we want to add implementation, what we have to do is define methods.
So if we go back to that bizarro function, look at that first if block, if is numeric x. What we're going to do now, instead of an if block, is we're going to define a method. And we're going to define that method with the method function. So the first argument of the method is the generic. And the second argument to the generic is the type of object we want to provide an implementation or a method for. And so here we're going to use a built-in class. This is something, and numeric vectors are something that come built into R, something that R7 is going to provide for you. And then we provide the implementation, which is exactly the same as before. So the implementation has not changed, but the way we get to that implementation, instead of having a bunch of if-else statements, we're going to use this method lookup. And so if we work through the other methods, the logical, exactly the same thing. Now instead of testing if it is.logical, we say we're defining a method for the bizarro generic with a logical class. Same thing for character and so on.
And once we've done that, the generic is going to work exactly the same way as our original function. Nothing's going to differ there, except that when we call it with some type of object we haven't provided a method for, we're going to get a slightly different error message. Because we're going to say the R7 system is going to say, well, I can't find a method for an object that's a function. If we print that generic, we can see that this is an R7 generic that has a bunch of methods. And if you're paying attention, you might notice that there are four methods there, but I only just defined three methods. And that's because the numeric class is a little bit of a fiction. In R, there isn't really such a thing as a numeric vector. A numeric vector is either actually an integer vector or a double vector, and that's what you can see in this class.
Now you've probably seen a similar thing if you've ever printed an S3 generic, but this is what an S3 generic looks like. So if you've ever been hunting for the source code for a function and you see this weird use method call, that tells you you're dealing with an S3 generic. And so one of the things we've tried to do with R7 is to be a little bit more helpful. When you look at a generic, it's going to tell you, hey, this is a generic and these are the methods that I know about. And those four methods there, which look like function calls, those are actually function calls. And if you copy and paste that code into your console, you'll see the implementation of each of those methods.
Classes in R7
So there are two parts to object-oriented programming systems in R, generics, generic functions, which define behavior, and classes, which define data. So I've shown you generics, and now I want to show you a little bit about what classes in R7 look like.
So I'm going to create a new class. Surprise, surprise, I'm going to use a function called new class to do that. I'm going to call it arrange class. Again, I'm going to assign the result of that function to an object with the same name as the class, and that is going to produce what we call the constructor function. This is both kind of a definition of the class and the thing you're going to use to create instances or objects of that class.
This class has some properties. So properties are the data that an object possesses. If you've used S3 before, this is the same with attributes. If you've used S4, these are the same as slots. So these are the data that every object of this range class is going to possess. Everyone is going to have a start property and an end property that are both numeric vectors.
So now that we've created that object, that class object, that class constructor, we can create range objects just by calling that function. We can print it, we get some information about the data that's in it, and we get this hint that maybe we can access the values inside that object with the at symbol. So if you've ever used S4 before, that's the way you extract data out of objects in S4 with at, and R7 takes that.
So you can get values, you can set values, but R7 has got some built-in safety rails. So that if you try and assign a character vector to that X property that we said had to be a numeric, you're going to get an error. But that doesn't protect you against every possible error, every possible way you might create an invalid object. Because you might say, well, actually, I want to have a range object where the start is always less than the end. And there's currently nothing stopping me from creating invalid ranges just by using a larger start than an end.
You can resolve that in R7 by providing something called a validator. A validator is just a function that takes an object and either says, all is well, or it returns a message telling you what is wrong with that object. And so once you supply a validator, then that is going to ensure that every instance of that object has to be valid. It's still possible to create an invalid object, but you've got to jump through some hoops to do that.
And this is a really valuable property, because this means that if you have a range object, you know that start is always going to be less than end. You know that these conditions are true, which is really useful.
Another really cool thing you can do is you can make properties that aren't actually data, that are computed on demand. They behave exactly as if there was some data behind them, but you can do basically anything you want inside these properties. So these work a little bit like active bindings if you've used those before.
R7 vs S3 and S4
So that's R7 in a nutshell. But how does it compare to S3 and S4? Well, I think if we're going to compare OO systems, it's good to think about what's the complexity, how hard it is to use, and what's the payoff. And S3 has a very, very low complexity. It's very, very simple, and the payoff is relatively low. S4, the payoff is much, much higher, but the complexity is also much, much higher. And so our goal with R7 is to create something that's just a little bit more complex than S3, but has almost all of the features, all of the payoff of S4.
Now whenever I hear about someone trying to solve a problem where there's two existing solutions by creating a third solution, I always think of this XKCD comic. And we really want to avoid this with R7. We don't want to make the whole problem more complicated by now giving people three possible choices of object-oriented programming instead of the two that are baked into base R, not to mention like the five or six other packages on CRAN that have other object-oriented programming systems.
So how are we going to do that? Well, first of all, this is a team effort. The R Consortium Working Group on Object-Oriented Programming, that's a mouthful, includes representatives from R-Core, from Bioconductor, from ROpenSci, from RStudio, tidyverse, and the general R community. This is not an effort to solve the problems of just a small group of people, but this is an effort to solve the problems faced by a large proportion of the R community.
Next we're trying to take what we believe to be the best parts of S4 and keep them. And that's like the formal definition of classes, so when you get an object of a specific type you know a bunch about it right off the bat. That that at helper, which gives you a useful error message if you try to extract something that does not exist. R7 also provides multiple dispatch when needed, which is really important for, for example, arithmetic operations where you need to choose the implementation based on the left-hand side and the right-hand side of addition. And finally we have this validator function which allows us to apply additional conditions to make sure our objects are valid.
And we've tried to do that while keeping R7 100% backward compatible with S3. So if you have an existing S3 class, you can convert it to R7 in such a way that everyone who uses that object as if it's an R7 object gets all of these new goodies, but no existing code breaks.
And one of the ways we've done this is a little, is a little weird in terms of the spectrum of object-oriented programming systems and other environments. So one of the core kind of precepts of OO programming is normally encapsulation. And that means like only the person who like writes a class can access the data inside that class.
And this, this is I think a really good idea in like 99% of programming languages, but it doesn't really work in R because the people who use R are data scientists and they want that data inside that object and they will do whatever it takes to get that object if they want it, to get that data if they want it.
So S4 tried to kind of protect people from this, like that you're never, as a user, you're not really supposed to use the at symbol. You're always supposed to use some kind of accessor function, but it just didn't really work. People want that data.
And so instead R7 is designed around, like rather than trying to fight this very natural tendency, R7 accepts it while giving you the ability to change the internals of your class if you need to with these dynamic properties. So if you have to, you can change the internal implementation. You can deprecate fields, properties inside your object and still have existing code work, which is really important property to allow code to evolve over time.
And so instead R7 is designed around, like rather than trying to fight this very natural tendency, R7 accepts it while giving you the ability to change the internals of your class if you need to with these dynamic properties.
And then finally we've tried to dig a pit of success with thoughtful function names, argument names, documentation and errors. So hopefully after you've used it a little bit you can start to guess what the code you need is and if that guess is slightly wrong, you will get useful feedback that points you in the right direction.
So really, we'd really, really appreciate it if you would try out R7. As I said, it's currently a GitHub package which you can install pretty easily if you want. We'd love to know not just what doesn't work or how it doesn't help you solve your problems, but we also want to know what doesn't make sense, what don't you understand. Where are the problems in the documentation? Where are the problems in the error messages so that we can do better and that we can make R7 just right. Thank you.

