Resources

The R6 Class System by Winston Chang from useR! Brussels 2017

R6 is an implementation of a classical object-oriented programming system for R. In classical OOP, objects have mutable state and they contain methods to modify and access internal state. This stands in contrast with the functional style of object-oriented programming provided by the S3 and S4 class systems, where the objects are (typically) not mutable, and the methods to modify and access their contents are external to the objects themselves. R6 has some similarities with R's built-in Reference Class system. Although the implementation of R6 is simpler and lighter weight than that of Reference Classes, it offers some additional features such as private members and robust cross-package inheritance. In this talk I will discuss when it makes sense to use R6 as opposed to functional OOP, demonstrate how to use the package, and explore some of the internal design of R6

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right. Okay. I'm Winston Chang, and I'm going to be talking about the R6 Class System.

So, first off, I want to... one of my colleagues pointed me to this webpage. This is at r-documentation.org, run by Datacamp, and it lists the most downloaded packages. And I didn't realize this, but apparently R6 is the most downloaded package. It's for object-oriented programming. And so that was a little surprising to me.

But when I thought about it, it made sense because there's a lot of very popular packages that use R6, like dplyr, for example. One thing I did find very surprising, though, was that in this column here, it apparently is the most popular directly downloaded package, which means that a lot of people are doing install.package as R6.

So, either people are very, very into object-oriented programming in R, or there's some problem with the data. Okay, so they're still working on this at Datacamp. So, this is not true, sadly. As I suspected.

R6 by the numbers

All right, so I'm just going to give you some numbers here, R6 by the numbers. It's been out on CRAN for three years. It's only 588 lines of code. There's about 1,700 lines of tests, so it's a well-tested package. Apparently, it's the number one package on CRAN, at least for indirect downloads. I hope that's true. And I've given zero talks about it until today, so I'm very excited. This is my first talk about R6, even though it's been out for a while and is widely used.

Functional vs. encapsulated OOP

So, I'm going to start off with talking about R's built-in object-oriented programming systems. There's a classification that John Chambers uses in his book, Extending R. Functional object-oriented programming systems and encapsulated object-oriented programming systems. So, for the functional object-oriented programming, that's exemplified by the S3 and S4 systems in R. And for encapsulated object-oriented programming, that's exemplified by reference classes in R. And these are the ones that are built into R.

So, the differences between them include this. For functional object-oriented programming, the objects contain data, but class methods that operate on those objects, they're separate. The methods themselves are separate from the objects, and the objects are not mutable. So, whenever you have a function that operates on an object, and if it modifies the object, what actually happens is you get a copy of the object, and then you can choose to store that modified copy or not.

For encapsulated object-oriented programming, the objects contain both the data and the methods. The objects are mutable. They can change when you call a method or if you modify things in it. And it's also known as classical object-oriented programming. So, this is the type of system you'd see with Java, C++, or Python. This is much more common in the wider programming world.

Okay, so for using functional object-oriented programming, you might do something like, if you've got your object X, you'd call summary of X, and there might be a method for your particular class that gives you a summary. If you want to modify X, you'd call some function, modify, here, modifyX. As I said, that doesn't modify X in place. It returns a new object, which you would have to save back in X.

Now, for encapsulated object-oriented programming, the methods are part of the object. The dollar sign is the subset operator, so you'd say X$summary. You might have a method that modifies the object to X$modify. That could change X in place. You don't have to save that back in X.

If you pass X to a function, like I have this object called trim, it may change X, or it might not change X. So, from the outside, you can't tell. Because of this, it can be easier to reason about functional object-oriented programming, where every time you want to modify an object, you always have an assignment operator. Whereas with encapsulated, you don't know if it's going to change. But encapsulated object-oriented programming is useful for representing data that evolves over time.

But encapsulated object-oriented programming is useful for representing data that evolves over time.

Some real-world examples of this, where R6 is used, are Shiny. Their Shiny session is represented with an R6 object. It represents the state of the application as the user interacts with it. In dplyr, R6 objects are used to represent database connections. In a relatively new package called ProcessX by Gabor, R6 objects are used to represent system processes. In this case, these are all things that have connections to things that are outside of R. This is one very common use case for encapsulated object-oriented programming.

Defining a basic R6 class

Here's a basic R6 class. This is how you define a class. You call function R6 class. You give it a label. In this case, it's accumulator, or a name. Then here, I say there's a list of these public properties. One of them is the sum. This accumulator, I just want to add values to a total sum. The sum starts off at zero. Then there's a method called add. This is a function that takes a value x. It takes self$sum, adds x, and then assigns it back to self$sum. Then this method returns self. It returns the accumulator object so that you can chain it together.

That defines the class. To instantiate the class, to create an object of that class, you'd say accumulator$new, and save that in x. Now I can say x$add4, or x$add10, and then add10, because this first add10 returns that accumulator object, so I can keep using the dollar sign and calling this add method. At the end, I can access the sum just by using x$sum. That's the very basics of creating an R6 class.

Stack example with public and private members

For an example that's a little bit more useful, here's an example of a stack. In Shiny, we use stacks to keep track of various things, keep track of various bits of state. I won't go into the details of this code, but the important things are that there's some public methods here. Push, you push things on the stack. Pop, you take the top item off the stack. Size is a method that tells you how many items are in the stack.

Now the new thing here is that there's also some private stuff. This is a list. It contains one thing, the items in the stack, which also begins as an empty list. They're not accessible from outside. From inside the class, you can use private$items to access that value, so I can modify it or read it using private$items. This keeps it encapsulated. All you get to see are the functions from the outside. To use it, you'd say s is stack$new, s$push1, push5 through 10, pushtext, and then if you pop them off, it'll take them off in reverse order. That's a slightly more useful example of a class.

If you were to do this with functional programming, you might have a function stack that creates a stack object and you save it in s. Then when you want to add things to it, you'd have to say push s, 1, and that doesn't modify s in place, so you have to save that back in s. Every time you push things, you'd have to do that. If you want to remove things from the stack, if you want to get the top item and then remove it from the stack, first you'd call a function that takes that top item off, let's say last of s. Save that in x1. Then you'd call removeLast of s to modify s itself, and then you'd have to save that back in s. There's two steps here. Whereas if you're using R6, the method, it both modifies s and then it returns the value that you're interested in.

Initializers and inheritance

Another handy feature of R6 classes are initializer methods. In the stack class, it just starts off empty. The one that I showed you, but you can add a special method called initialize. In this one, it's a function that takes dot, dot, dot arguments and it just assigns those elements as the beginning, the starting state of the stack. If I say stack$new, I'd say stack$new 5, 6, 7, and then the stack starts with those items in the stack.

It also supports inheritance. If you want to create a subclass, I've got my base class stack, and then I want to create this fancier stack class called stack2 with some extra features. I'd say R6 class inherit equals stack. Then there's an initialize method. I can add methods and I can override methods. I can add a push method, which overrides the base class's push method. In this one, it takes, again, a variable list of arguments and it loops over them and it calls the superclass's push method and just pushes each one onto the stack. You can use super to access methods from the superclass.

One nice thing, again, is that the API encapsulates the data, so all you get to see from the outside, going back to the original stack class, is just these methods. If you want to refactor it and make it more performant, make it more efficient, you just have to make sure those methods continue to work properly. If you change the internal data structure, you don't have to worry that somebody has written code that uses that internal data structure because it will break when you change it.

When to use R6

Now, when should you use R6? When you need objects with reference semantics, that's often, again, when you're representing external resources, that's a common case. When you want to encapsulate the data and just expose a particular interface to your data, and also when you don't want to pollute your working environment with a whole bunch of methods. Those methods, instead of just floating out there, they're part of your objects, so they're contained in there.

R6 vs. reference classes

Reference classes existed before R6 did, so a comparison between them. Some of the advantages of R6 are that it's simple and it's lightweight. As I said, it's only 508 lines of code. It supports private properties, whereas with reference classes, everything is public. There's robust inheritance. You can have inheritance across packages without any compromises, and that can be a bit of a problem for reference classes.

Also, the generators and the R6 objects are fully encapsulated. Once you've created them, they don't call any functions from the R6 package, which means that you can save them out and then load them in another R session which has, say, a different version of R6 or doesn't even have R6 installed at all, and they will continue to function. That can avoid a class of sort of hard-to-diagnose bugs because of that.

For reference classes, it's built into R, which is a big thing, so there's built-in support everywhere. It's built using S4 and the methods package. Also, the fields can have types in reference classes. This is a comparison. Obviously, I'm biased. I'm sure somebody here will have something to say about this. For more information, you can see the vignettes. I think that's a pretty good introduction if you go to the CRAN package page, or you can just run the vignette function.

How R6 is implemented

How R6 is implemented. R6 objects are just environments with a particular structure. It's pretty simple once you understand it. Normally, when you're running R code, you have your object here. This is just the value 10. You have a binding to that object. This is the name x, and it refers to this object here, value 10. That binding lives in an environment. You're probably familiar usually working in the global environment if you're working at R at the console. Environments have parents, so that's what this arrow represents here, that relationship. If it doesn't find a binding here, it'll search in the parent environment and it'll keep going until it hits the bottom.

There's also function objects. They have a special thing, which is this arrow here represents the enclosing environment. That's sort of, in some sense, the environment where the function runs. It'll search for things in this enclosing environment. This is the basic way that objects and binding environments work.

For R6, an R6 object consists of usually two or maybe three objects, or two environments, I should say. One is the public-facing environment. I call it the public binding environment. It has x, and this is a different function name from before. It's pow x y. This points to a function, and this function's enclosing environment is not this public binding environment. It's a special enclosing environment which contains just the binding self. Self refers back to this public binding environment. If you want to access x, you'd say self$x. You can't just access x from that function directly. This enclosing environment, if you're in a package, it would typically be the package namespace. These functions can call all the internal functions from that package.

It looks like I'm out of time. I'll just go back to the original slide. This is the basics of how R6 is implemented. There's two separate environments. I will go back to this, and I guess I can take some questions.

Q&A

As you mentioned, the reference classes seem very similar to classes implemented in C++, Java, Python. I was wondering if you could do a quick compare-contrast. If there's things you can do here that can't be done there, or vice versa. I am not expert enough in those languages to really say. I'm sorry.

Since the functions or the methods are hidden behind the R6 class, what's the best way to troubleshoot or profile a function when we're using R6? For troubleshooting, you can call debug on, not on the class, but you have to call debug on an instance, like s$push or something. You call debug on that, and then when you run it, you can step through it. Or you can put a browser statement in it. That's what I do a lot. I use profviz. It's a profiling package. It should show up in a way that's somewhat easy to understand using profviz.

Years ago, S4 was considerably slower than S3. I don't know if that's the case anymore. Do you have any comment on speed of R6 compared with other? When I first, when I created R6, there's actually one of the vignettes has performance comparisons with reference classes, and reference classes are built on S4. And it's considerably I mean, doing simple operations, like just accessing a field can be, or modifying a field can be, I don't know, on the order of 5 or 10 times faster because there's type checking that happens with reference classes. The vignette, it's automatically built every time I submit it to CRAN, so there's benchmarks on there, which I haven't looked at recently, but I assume that there's still quite a bit of performance advantage.

So, can you defeat the privacy using, like, the triple colon operator or things like that? There is, yeah. It's actually not that hard. It used to be much harder, but now you can do, like, x$ and then there's a special variable in there called .__encloseenv. If you have your tab auto-completion, it should just come up. And then you do that, that's the enclosing environment, so from there, you can access the private let me see if I can, yeah, so, that refers to this thing, and you can access private from there.

Sorry, is a private property of a class, does that, is that still private for the subclass? Sorry, can you say that again? Is it private for the subclass of that class? The subclass can access it. Can access it, so it's not private.

Byte compilation, does that all work fine with R6, or? I believe that when you build a package with R6, the byte compilation does not work. I'm not an expert on that stuff. I believe if you use, like, the just-in-time compiler, it will probably compile it as it goes, but it will not, as far as I know, compile it for the class as a whole. It'll probably compile each instance, each method instance.