Hadley Wickham | vctrs Tools for making size and type consistent functions | RStudio (2019)

vctrs is a new package that provides tools (cognitive and computational) to ensure that functions behave consistently with respect to inputs of varying length and type. The end goal of vctrs is to be invisible to the end user of the tidyverse (simply enabling their predictions about function outputs to be more correct), but will help developers write functions that "just work"

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Today I want to talk about a new package that I've been working on called vctrs and this associated idea of type stability, which basically means that ideally it should be easy to predict what type of thing your function returns.

But before we get into that, I want to talk a little bit about the motivation in terms of some WATs. How many of you have heard of WAT before? Okay, WAT is like WAT or W-T-E-F, but even more surprising. And there's a really great talk at this URL that kind of created this term, talking about some of the really surprising things about Ruby and JavaScript, and so I'm going to show you some really kind of crazy stuff in R, and this is sort of the motivation.

WAT moments in R

So I'm going to show you a little bit of code, so here I've got two factors, and I combine them together, and what do I get? What? You might hope that I get an A and a B in there, right, and hopefully maybe the output would be a factor. You probably did not expect to get two 1s out of that.

Or what happens if I take a date and a date time and combine them together? What? It took me a while to even kind of, like, figure out what was going on here, but this is a day, a date, and in the year 4,200,010, 937 on the 24th of January. Or what happens if you do the other thing? What happens if you combine a date time with a date? What? Again, we get the date time that I gave it, and I get this weird date in 1969.

What happens if you even look at a date time the wrong way? So what happens if you take a date time and just concatenate it with nothing? You lose the time zone. Or what happens if you combine it with a literal nothing, a null? Oh, you get this random number.

So I seriously hope that no one would argue that this behavior is correct. I think there's some kind of good reasons of this. There's not great reasons, but there's some accidents of history that have led us to this place where we can get some pretty surprising results from, like, pretty unsurprising data types and a pretty unsurprising operation, C.

But so what's kind of— and what's even kind of worse about this is I've just shown you one function, but there's a bunch of other places we have to do this. You've got to take two vectors of different types and combine them together. You get this in unlist and R bind and if else and merge and a bunch of other places. And I don't mean to just pick on base R here, but tidyverse has exactly the same problem, that some functions for combining two vectors together do something different to other functions, which is clearly a really, really bad thing.

The rules for combining types

And so kind of the motivation for this vectors package was to think about, well, like, what should we do? This is clearly crazy what currently happens. How can we do better? What should we do? And I think we can get some hint of that by looking at the rules for C, for concatenate, for atomic vectors, because the rules here are, like, pretty simple and pretty reasonable.

And here I've kind of displayed them in, like, painful detail. If you've got a logical and logical, you get a logical. If you've got a logical and an integer, you get an integer. And if you just list them all out like this, it seems pretty complicated, but you can kind of display this in a table form, too. So if you've got a logical and a logical, you get a logical. If you've got a double and a character, you get a character. And if you stare at this table a little bit, you'll notice that it's actually symmetrical.

And that's a really nice principle, right? Because it means, in this sense, it means it doesn't matter which order the arguments are, you're going to get the same type of output. And it means you've got, like, a lot less to, you've got less than half of the table to remember now.

And that also means we can actually express it even more simply. We can express it as a, well, this is technically a tree. This is a very simple tree. It's a tree with no branches. I'll show you a tree with some branches a little bit later. But you can kind of, you can sort of work this tree by, if you've got, like, a logical and a double, say, you kind of follow the arrows until they meet up at a common place, which in this case is a double, right? So you can just put your finger on the two types and then follow the arrow from one to the other to know which type you're going to get.

And so now this is a pretty simple summary of the rules that Bayes R pretty consistently applies to atomic vectors. And I think even if you've never been explicitly taught this, you've probably, like, kind of picked it up implicitly, because this rule is consistently applied.

The problem with C, however, arises that when we start to get to these more complicated S3 vectors, which are built on top of these atomic vectors, that there aren't any kind of simple rules anymore. So when we get to factors and dates and date times and data frames, things start to get more complicated, and it is exceedingly difficult to elucidate the complete set of rules.

So figuring out what the rules should be, at least in my opinion, is the goal of this vectors package, and I kind of, I don't know, the thing I like most about this package is possibly the logo, which is very Tron inspired.

So the goal of vectors is to kind of figure out these principles kind of like once and for all to apply them everywhere, and the goal in some sense of vectors is a little bit weird, because I think if vectors are successful, that there becomes this consistent set of rules, the package itself becomes invisible. So my goal is that you should never, ever have to know that this package exists. But I want to talk, but unless you're a package developer, unless you start to dive into the details, we can start to now learn this pretty simple theory.

So my goal is that you should never, ever have to know that this package exists.

And the basic idea of vectors is to take that very simple tree we had before, that line, and basically kind of turn it into a forest. So again, the same principles apply. If you want to combine two types of vector, you put your finger on each one, and then you follow the arrows. So you will notice, however, that this is not like a tree, this is a forest. Some of the things are not connected. You cannot put your finger on factor and double and find any way to go between them. And I think this is really, really important. There are some coercions that simply do not exist.

It does not, there's no way that it makes sense to combine a factor and a date time together instead of trying to do something, trying to guess at what you mean. It's much safer to get an error that forces you to confront the fact that you've got a mix of weird data types. You have to deal with that yourself.

Live demo of vctrs

So I'm going to do a little demo of this. Live demos are always risky, but this should be fairly safe. So I've got the vectors package loaded, and basically I'm going to show you this function called vec underscore C, which is basically the same as the C function you're familiar with from base R, but obeys this consistent set of rules. So if I get a logical and an integer together, or a logical and a double together, I'm going to get double. If I take a double and an integer, I get a double. These are the rules that you're already familiar with. You don't even have to think about them.

It is a little bit stricter about coercing the character vectors. So if you try and combine a double and a character together, you get an error. This is unlike base R, where you can just combine these things that are a little bit different together, and R will just do it for you.

So vectors is a little bit more strict in general, because it has this escape hatch. All of these vectors functions allow you to declare the desired prototype of the output. So here I can say, actually, I want the output to be a prototype, this character type. And I call this a prototype, because you don't give it a name of a type. You give it an example of a type. So here I'm saying, make sure the output looks like a empty character vector. And so I can do that. And now, instead of implicitly coercing vectors to a common type, here we're going to explicitly cast to the type that we have provided it. So now we have this consistent way that you can say, I want to make sure the output is a character vector, and it will either be a character vector, or you will get an error. You will never get any other type of output when you call the function in this way.

So again, we get the same kind of similar property with factors. We can combine them together. We always get a factor back. Now note, the order in which we combine the factor is important, because the levels are different. So if I combine A and B first, the levels are A and then B. If I combine B, then A, the levels are in the opposite order. So it turns out it's kind of not possible to be completely symmetric, because the only option here, if I wanted to be completely symmetric, is I think I'd have to sort the factor levels, which seems like a really, really bad idea and not very useful.

So the thinking behind vectors is kind of a combination of philosophy and pragmatics. There are some things that would just be maybe kind of philosophically more beautiful or more pure, but pragmatically would just cause you a bunch more pain when trying to use this in your code. And so vectors tries to kind of balance those things, tries to steer you in a direction of greater safety, of greater strictness, of this idea of type stability, which I'll come back to shortly, and away from just kind of always working, doing whatever happens.

So the same thing works with ordered factors. If you try and combine a factor and an ordered factor, you're going to get an error. You can combine factors and characters, or again, you've always got this escape hatch. You can say, if there isn't an automatic coercion that exists, I can always say, I want the output to be of this type. So again, vectors is going to be more likely to give you errors than base R is, because it gives you the standard way of resolving those errors by explicitly declaring the type of output that you want.

And again, with dates and datetimes, this is like basically the most boring demo ever, because it just does what I think any reasonable person would expect it to do. If you combine dates and datetimes together, you don't get crazy dates, far, far in the future or far in the past. And if you just concatenate a single vector with itself, then the time zone's unchanged. And then when you try and combine different things in base R, I kind of showed you a few of these before. But if you combine a factor and a datetime, you get an integer. But if you combine a datetime and a factor, you get a datetime, a date with a weird date. I mean, vectors just always gives you an error.

And again, then there's always this other kind of one last escape hatch. Like what happens if you really do want to combine a datetime and a factor into a single vector? Well, you can always fall back to a list, because a list in R can contain anything else. So we can always say, make sure the output is a list. And we'll just combine all these things together.

Type stability

So the underlying idea behind vectors is this idea of type stability, which I'm still trying to figure out how to express, to articulate as clearly as possible. But I think there's this principle. The way I sort of think about it is, like, when you're reading R code, what you are actually doing is running that R code in your internal mental model of R. And obviously, your internal mental model of R is much, much simpler than the real thing. But I think in your internal mental model of R, one thing that's really useful to think about is the type of each variable. And I notice that when I read other people's code, when I'm reviewing pull requests, for example, if I read the code and I look at a variable and I have no idea what type of variable that is, I don't know if it's a list or a character or a data frame, that code just feels very dangerous to me, because I cannot accurately predict what's going to happen.

So the first principle of type stability is that the output type should only depend on the types of the inputs. And so one kind of violation of this in base R is the if else function. So here I've got three calls to if else. The first argument is always logical. The second argument is always an integer. The third argument is always a character vector. And yet, depending on the value of the logical vector, the output could be a logical vector, it could be a double vector, it could be a character vector. And I think this is an undesirable property, because when you read code that involves if else, it makes it hard to know what type of output you're going to get.

And again, I don't want to say, like, the tidyverse is perfect, it's far from perfect. Deply if else actually is equally annoying, but in the opposite direction, it's too strict, which forces you, if you want to insert a missing value, to know about all of these different types of missing values that you should never be forced to confront, and it's pretty annoying.

The second principle is that if you're just combining a bunch of vectors or your argument has dot, dot, dot in it, your function has dot, dot, dot in it, the order of the inputs should not affect the output type. So I kind of showed you an example of this already, lots of examples of this with C, but ideally, the order in which you supply arguments to C should not affect the output. This is the idea that gave us that symmetry in the table. This makes predicting the output much, much, much simpler, because there's just a much smaller set of rules that you need to remember and internalize.

And again, the goal here is not that you should ever have to explicitly memorize the rules, but hopefully by having this small set of reasonable and consistent rules, you will just come to learn them over time, and when you predict, when you guess, your naive guess about what a function will return should just be more accurate from the get-go.

And then the final principle is almost so obvious that you shouldn't have to say it, and that is ideally, there should be one set of rules that is applied everywhere. So again, just to illustrate this, if you concatenate two factors together, you get an integer vector, but if you put those factors together in a list and then you unlist them, you will get a factor as you expected. So the fact that we have these inconsistencies just makes it hard to build up that accurate mental model, and it gets confusing.

And I think there's sort of both this first-order effect that when you look at some code, you're like, oh, I don't know what this is going to return, but you also get the second-order effect that's even worse, that over time you sort of get this learned helplessness. You're like, I have no idea, I've been burned so many times before, I've got no idea what this is going to return, so I'm just not even going to try to guess, which I think is even more harmful.

You're like, I have no idea, I've been burned so many times before, I've got no idea what this is going to return, so I'm just not even going to try to guess, which I think is even more harmful.

Other features of vctrs

Now vectors does a bunch of other things, which I want to talk about briefly. Max really, really, really wants to give me the gong if I talk too long, so I don't want to make him too happy, but I want to talk about briefly these other parts of vectors.

So one thing that I have recently discovered about data frames is that you can have a column of a data frame that is itself a data frame. And this is supported in base R, so here I'm creating a data frame, and then I'm making a column of that data frame that is itself a data frame. And you may ask yourself, well, why on earth would I ever do such a crazy thing? And I don't really have a great answer for you right now, but it seems pretty cool and like it might be useful, and so I thought it would be worthwhile to kind of work through all of the implications of that, and so that's one of the things that vectors does.

Another thing very closely related to type stability is this idea that I call size stability. This is basically kind of the recycling rules of base R that you know if you combine a vector that is, when you combine two vectors of different lengths, there are these things in R called the recycling rules, which determine what's going to happen. I think those recycling rules are like a really good idea, but they have two flaws. The first flaw is that they will recycle, like if you combine a vector of length 2 and a vector of length 3, it will combine the vector of length 2 like a non-integer number of times, which I think is just a little bit too clever in most cases. And the other problem with the recycling rules, if you sort of start looking into them with a fine-tooth comb, you discover there isn't actually one set of recycling rules, there's actually like four sets of recycling rules that are very subtly different.

So again, the idea of these kind of size stable functions is given the sizes of the inputs, you should be able to easily predict the size of the output. And then finally, the last feature in vectors is that we should have easier ways to create new types, new vector classes. So if you have ever tried to create a new type of vector in R, something like a factor or a date time, currently it's really painful. You need to understand a lot about the S3 object system. You've got to define methods for a lot of different generics. Vectors also provide some support so that when you create new types of vectors, there's some clear documentation. If you want your vector to do this, you need to define these methods. If you want it to do that, you need to define these other methods.

So if you want to learn more about vectors, two places to start are the vectors website itself, which has a lot of very detailed information about how that package works, which is not that accessible right now, but will hopefully get a little easier to understand over time. And then there's also more information about it in the vectors chapter of advanced R. One of the big goals of the tidyverse in 2019 is to thoroughly integrate these ideas.

One last thing before I go, and that is if you are interested in working with RStudio on the tidyverse or Shiny team, we are having an internship program again this year. We'll have about 10 interns in paid positions over the summer. You have to be able to work in the U.S., so we're trying to work on international students next year. If you would like to find out more or apply, go to this URL. Thank you.

Q&A

We probably have time for one or two questions while we switch over mics. So if you got a question, wave your hand, and someone will throw a mic to you or at you.

This is really exciting, but I think one of the things I always struggle with when we talk about these things is I feel like in kind of just like the community, we don't have like a great concrete definition of what we're talking about when we say type in terms of like, does the metadata matter for the type, or are we just talking about like built-in types? Are we talking about classes? Like, are there like resources or like some kind of like common language we're developing for like what we mean by this stuff?

Yeah, so that's another aspect of the vectors package is this idea of the prototype, which I just kind of glossed over here. But the goal of that is to provide this sort of consistent and concrete discussion of what we actually use, what I actually mean when I say type. I think like 90% of the time, just your sort of everyday understanding is enough, but yeah, we need a very concrete and precise definition. And that my goal is to supply that in the vectors package.

We could do one more quick question. So I just thought that the factors were internally represented as integers, yet they were on separate trees in the three trees that you mentioned. So is it not the case that they're integers in their normal representation, or is there something else that's different there?

So they are internally represented as integers, but you as a data scientist should never be forced to grapple with that. Like the fact that when you concatenate them together, that underlying implementation is exposed, I think is a really bad thing. And so the idea of vectors is to hide you from that.

So I noticed there's a bunch of people coming in for Jenny's talk. So if there is a seat next to you, please move towards the middle, or do not be surprised if the tidyverse suddenly stops working on your computer.

Featured software#