
Jesse Sadler | vctrs: Creating custom vector classes with the vctrs package | RStudio (2020)
The base R types of vectors enable the representation of an amazingly wide array of data types. There is so much you can do with R. However, there may be times when your data does not fit into one of the base types and/or you want to add metadata to vectors. vctrs is a developer-focused package that provides a clear path for creating your own S3-vector class, while ensuring that the classes you build integrate into user expectations for how vectors work in R. This presentation will discuss the why and how of using vctrs through the example of debkeepr, a package for integrating historical non-decimal currencies such as pounds, shillings, and pence into R. The presentation will provide a step-by-step process for developing various types of vectors and thinking through the design process of how vectors of different classes should work together
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you. All right. So my name is Jesse Sadler. I am a historian. I teach at Loyola Marymount University. And so even though I am talking about vectors today, what I normally do, and just quickly, the slides are online and I'll have a link at the end, but what I normally do is I study history. I am a historian. I'm interested in merchant networks and merchant families in the 16th and 17th century. And I'm particularly interested in inheritance and what inheritance shows about how people interacted with each other and how families interacted with each other.
And so that leads me to doing math, specifically in terms of doing math with non-decimal currencies. So these are currencies like here, this is Flemish pounds, but basically currencies that are in pounds, shillings, and pence. And this is a problem because we don't learn how to do non-decimal arithmetic anymore. And computers are obviously based on decimal arithmetic. And so you can't just plug these numbers into a computer and do an analysis on them. It doesn't really work that way, unfortunately.
The problem space: compound unit arithmetic
So what exactly is the problem space? The problem space is that with non-decimal currencies, you have to do something called compound unit arithmetic. So you have to add up the separate columns and then divide by the base of that column. And then you can get your answer. So in other words, we have a number of different problems here. We have three separate units that make up one value. Those units do not have decimal bases. So therefore, we need to use this compound unit arithmetic. And finally, to make things even worse, the units can differ. So the normal most used units in historical currencies is to have 20 shillings in a pound and 12 pence in a shilling, but they could change.
So you guys didn't know you were getting so much history today. JJ, in the keynote, already talked about history of corporations. This is what he's talking about. I'll have some ideas about corporations.
So there's a problem, and you can't just plug this in. However, programming languages like R are very flexible. And so you can use them to do this essentially compound arithmetic yourself. So what we want to do in a situation like this is we want to take these totals, which are just added up, and then normalize them, do the compound unit arithmetic in order to get the answer that we want. And so you can make a what I'm calling here a normalized function, which basically takes a vector of length 3, the amount of units that we have, and then normalizes them using a remainder division essentially, and then brings them all together.
Building an S3 class
OK, so this is great. We have a normalized function. We're ready to go. We've got something. We can even create an S3 class. And S3 classes, who here has built their own S3 class before? All right, a number of you, right? S3 classes are great because they're really easy to make, right? I just built an S3 class, right? The key here is the structure part, right? And so this is a class that has a value into it, x. And then it has, I call the class LSD, which is the Latin terminology for pounds, shillings, and pence. And then I've added a basis attribute so that we can know what the basis of the shillings and pence units are.
All right? And so I can just do this, and I can create something. And it doesn't have a nice printing method or anything right now, but I've created an S3 class. OK, so great. I've created an S3 class. Now I'm going to sit down. I'm going to figure out what am I going to do with this? What else do I need to do? Well, right now, I can only have one value at a time. So maybe I need to have multiple values, so I should use a list or something. And then I need to change the normalization method. OK, no problem.
All right, so what else do I need to do? OK, at this point, it gets really confusing because you have to implement all these methods. And it's not really clear how exactly you're going to do them, and what methods you have to implement, and why you need to do them, and so on. And so basically, I went through this, and I implemented some methods and some I didn't. And then I talked to Hadley Wickham, and he said, no, what you should do is that you should use vectors. And so vctrs is a package that gives you a path to implementing all of those different methods, to actually creating your own S3 class without having to know everything there is to know about the different S3 classes.
Why use vctrs
So the goals of the vctrs package are fairly simple in some ways, but very deep. It's about type stability and size stability, and Hadley talked about this at the last RStudio conference. But I am going to focus on this last aspect of building S3 classes.
So what do you get by using this vctrs package? Well, as I said before, you get a clear path. But you also get consistency with base R. It's just an S3 class, and so there's no reason why your users would need to know that it's based on vctrs. And then finally, it is integrated and continuing to be integrated into the tidyverse.
So what do you get by using this vctrs package? Well, as I said before, you get a clear path. But you also get consistency with base R. It's just an S3 class, and so there's no reason why your users would need to know that it's based on vctrs. And then finally, it is integrated and continuing to be integrated into the tidyverse.
Okay, so what I want to do in this talk, and for the rest of the talk, basically, is talk about why you might want to create your own S3 class through this example of non-decimal currencies, why you should use vctrs, and to point you to some things of how you can do it. And I don't have enough time to go through and do all of a tutorial on this and show all the different steps. And so what I've done is I've created a package that I'm calling debvectors, which is a simplified version of the package where I've done more of this in DebKeeper, and you see the URLs there. And what the debvectors package is, is it provides a tutorial of a step-by-step way to get through this, and I will go through some of these steps now.
Six steps to creating an S3 class with vctrs
Okay, so there's basically six different steps to creating a S3 class with vctrs. Again, I'm not going to go through all these. I'm going to concentrate on the first half, the first three, and I won't go into the last three, but I will talk about them. You'll see how they are when they're implemented. And if you're creating a class that's based on double vectors, you only have to do four of these, two of these steps you get for free. And so just to give you an idea, so debvectors is a package, and it's tested, and it has everything, but the way that the scripts are organized is to follow this, and then within the scripts, there's a lot of different comments on how to do these things. So I definitely recommend that you go and look at this that's on GitHub, and you see the URL there.
Okay, so let's go through some of these choices that I made with debkeeper slash debvectors. And so just as a reminder, right, to go back to our problem space, we have to deal with these issues of compound unit arithmetic, of representing a single value with three different numbers. So in designing this, I thought about different things. So one, I didn't want to just decimalize everything. I wanted to maintain the structure of the numbers. I wanted to keep track of the bases of the shillings and pence because they could differ, and I wanted to make sure that if two values had different bases that they couldn't be combined in any way.
But at the same time, I did want to create a decimalized class that I could use as a fallback, and this would have the same basic attributes of a bases attribute that you could follow what the different bases of the pounds and shillings are, but in addition, it would have an attribute on the unit. So it would be a decimalized unit of the pounds unit, or the shillings unit, or the pence unit, right? And different units can come together because they can be the same currency, but there needs to be a way in which that's done. And so I'm calling these classes DebLSD. Deb is short for Double Entry Bookkeeping, which is where you find all these different values, and then DebDecimal.
So again, I'm not going to go into this, but this just gives you an idea of these are the steps within the first major step of creating the vector. And so it's broken down here. And I'll just give you this slide, which is a simplified version of the creation of the class. This doesn't have any checks or anything, but you can see here the different arguments that we have on the left, the different vectors that can be put in for pounds, shillings, and pence. And then on the right, just the single one, and then the attributes of each class.
And then here we have the actual creation. And so with the DebLSD, when we want to keep these tripartite structures, I'm going to use a record style vector. And so this is essentially a list that has vectors of equal length. And so that gives us a basis from which we can have this different structure, this tripartite structure. On the right, it's just essentially, again, a double vector.
So here we have now, I'm skipping some steps, right? But this is what it looks like if we implement our class. And so we can start here on the left. We have now a function where we can create a class. And so we have different things that we can add into the pounds values, the shillings values, the pence values. And here, we're going to take our standard basis of 12 and 20. And on the right is an equivalent vector that is decimalized. And here we have the printing methods that I've chosen. So again, these are things that you can create. So here, I've chosen to include the base attributes and the unit attribute. So it's very clear. And both work natively with TIBL. The record style vectors are not fully integrated, but they're getting there.
Casting and coercion
So the last thing I want to talk about here that I have time to talk about is the issue of casting and coercion, which is really at the heart of what is happening as you are creating your vectors, your classes, and want them to interact with other different classes. So the workflow is fairly simple. You have a boiler plate that you're going to use on each time. And this is the same for casting and coercion, where you define the method for the class, and then you give a default. Then you do the methods within the class, and then the methods with any compatible classes that you decide to choose.
So what coercion and casting do is they are kind of two sides of the same coin. Coercion looks for and determines what the common type is with this function, VecpType2, while casting does the actual transformation. And so things like comparison between classes are made possible by implementing both of these things.
So again, I don't have a chance to go through all this, but coercion is really about design choices. There's not necessarily that much code that goes into something like this. And so in this instance with debVectors, I decided to have a double goes to a debDecimal, which would then go to a debLSD. So if you had these three types in a combined function, then you would get a debLSD.
Casting is a little bit more interesting, because this is where we have the programmatic logic of the actual transformation. And one reason why I think that debVectors is an interesting example of this is because I have two different kinds of vectors, and so you have to think about how you can combine them in different ways. And so here, again, the code is not necessarily super easy to read, but this is taking a debDecimal and then converting it to this tripartite structure. And essentially what I'm doing is taking simple if-else statement, what is the unit, and then from that unit, placing it where it should be in a call to create a debLSD vector, and then finally normalizing it so that we get a normalized value, so that we've done the compound unit arithmetic.
Okay, so how do we put it all together? So this is essentially the endpoint. Here I'm just showing what you can get. So we can combine debLSD, debDecimal, and double, and we get debLSD. We can compare different types, so 3,255 pence is less than 15 pounds. And again, if you go on, you can do arithmetic, including arithmetic with different types. So what I want to say is that you can create your own S3 vectors. You can extend the capabilities of R to fit your own needs, and that vctrs provides this clear developmental path that enables you to do this. Thank you.
Q&A
Thank you very much, Jesse. That was fantastic. We have a couple of questions. If you still want to sneak one in, please do. There's a question about whether you can comment on your choice between, you mentioned explicitly S3 and vctrs. Did you consider R6 classes, and how did that factor into your decision making?
So I looked at it a little bit, but mostly I was sort of interested in the simplicity of S3. And like I said, I basically wrote the entire package with S3 based on lists, and then I talked to Hadley, and then I had like a month of work to do.
So the second question is, once you had your data in vctrs format, what did you have to do additionally to make it compatible with Tidyverse?
So like I said, especially if your vector is based on a double, then it pretty much works right out of the package. Right now with record style vectors, there are some things like mutate that don't quite work, but I know that RStudio is working on that. But there you can just go back to base R things that work just fine, and hopefully soon that will also get figured out.
And the final question, what was the process, what was your original data format? I mean, how do you go from paper record into vctrs?
Yeah, so it's fun thinking about the questions of big data, because I have to create all of my own data. And it's also funny to think about it in terms of reproducible data. Oh, don't change the data. And it's like, well, wait, I created, I can change it if I want to. But basically, I take these numbers, I take these documents, and I input them by hand, trying to figure out what the numbers are, and do it in a spreadsheet program, and then export it to CSV. And there is a function in devkeeper, but not in devvectors, but that will take that and create dev LSD columns from those three different columns. Thank you very much. Thank you.
