
Tyson Barrett | List-columns in data.table | RStudio (2020)
The use of list-columns in data frames and tibbles is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models, groupings of text, data summaries, or even graphics) with corresponding data. For example, one can store student information within classrooms, player information within teams, or analyses within groups. This allows the data to be of variable sizes without overly complicating or adding redundancies to the structure of the data. In turn, this can improve the reliability to appropriately analyze the data. Because of its efficiency and speed, being able to use data.table to work with list-columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data sets). Herein, I demonstrate how one can create list-columns in a data table using the by argument in data.table and purrr::map(). I compare the behavior of the data.table approaches to the dplyr::group_nest() function and tidyr::unnest(), two of the several powerful Tidyverse nesting and unnesting functions. Results using bench::mark() show the speed and efficiency of using data.table to work with list-columns
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you. This is the corner room, right? So we have to, everyone leaves and comes at weird times. If you're here for doing, learning about list columns and data table, you're in the right place. Or maybe you wanted to see Julie's and Allison's beautiful artwork. You're also in the right place.
Maybe you just heard that I had a few data table hex stickers. They're really rare in the wild. I do have like two more. So come find me. In any case, whatever brings you here, welcome. RStudio Conf is one of my favorite things in the world. I have to say that very carefully. It's one of my favorites because I think my wife and kids are watching. Just to be clear, they are my favorite. RStudio is close, though.
What is a list column?
So if this is you, where you look at the tidyverse code and you're like, wow, that is beautiful. But then you also feel this when you look at data table code. I understand nothing. That is okay. That is expected for this audience for the most part. I know a lot of people have used data table. However, my goal today is to show how the tidyverse and data table are complementary. And to show that, I'm going to show it using list columns and how they're beneficial and how you can use them in data table.
So to start, what even is a list column? I'm curious, how many have purposefully used a list column in their work? You guys are awesome. It's about half. That's awesome. So for those that haven't, a list column is basically where you take some subset of the data, either a vector or a data table, or there's a bunch of other things you can put in there, and you group them into a single column.
So this example shows we have this data, and then we nest it. There's a group nest function in dplyr. And essentially what it does is it grabs all the data that's associated with group one and shoves it right here into a single cell. What's cool about this is you can put all sorts of stuff in here. You can put expressions. You can put visualizations like ggplot code. All of that stuff can go into a list column.
Why use list columns?
But you may be wondering, like, why would you want to do that? That is weird. That's at least my first impression when I saw a list column. I was like, you just can't stop, can you? You just always want to add something new. But really, it is a great question. Why would we want to use this? I've compiled a short list of some of the reasons.
The first is that there's several functions in the tidyverse, which means it must be cool. So that's one reason to learn it. I'm going to show today that it can make some data manipulations actually safer. You can avoid some common problems by using a list column. It's also memory efficient. So if you're working with really big data, a list column can reduce redundancies in your data. And in some cases, it can even make understanding what you're looking at, your data, much simpler when your data is really complex. And the last reason I had was it just is going to show up. If you've ever used something like Benchmark, it's there. And you can't avoid them. They're just going to show up in your life. So it's good to get to know list columns a little bit.
Why data.table?
On the other hand, why data table? We have tons of amazing tools, especially stuff that we've learned here at RStudioConf. According to the readme, they give these six reasons why a data table is important. I'm going to focus on these first three. The first one, a concise syntax. So data table, it's built up with these parts right here. You start with your data table. We'll call it DT for short. In the square brackets, you have three main parts. There's other parts that you can use, but there's three main ones. There's I, J, and by.
So to translate that, after talking about translating, into dplyr verbs, I, you can think of that as either a filter or a range. So the things that you would normally put in those two, you would put in I. In J, you can think mutate. So when you see code in that place, think mutate. What would I put in mutate? Then by is like group by. So I'm going to show some examples that highlight exactly what those look like, but whenever you're seeing the I, J, by, I often translate that in my head to filter or range, mutate and group by.
Next thing is that it's really fast. That may be the only reason why you're here is you heard that it's really fast. Responding to why data table is so fast, this is what Hadley, I don't know if you've heard of him, this is what he said. I think it's a relentless focus on performance across the entire package. And I think this is true. I've contributed a little bit to the data table package, and every time they pause and look at anything that could be slowing down your code or using up extra memory. And they do it to a crazy degree. And it's really paid off. The package is incredibly fast.
I think it's a relentless focus on performance across the entire package.
Which also goes along with it being very memory efficient. This was an analysis done by Matt Dow, the creator of data table. This was with a huge data set. Might not be huge to some of you if you're working with a lot of flowing data. This one has a billion rows and nine columns. It's about 50 gigabytes of memory. He compared all these different frameworks, and for the most part, they either just ran out of time or couldn't even start. With data table, it was able to accomplish a fairly complex summing by group request in a reasonable amount of time, too.
Combining tidyverse and data.table
So because you have such a great framework in the tidyverse and you have such great performance in the data table package, I say let's team up. Let's make them work together. And so for today's example, we're going to work with some grammar that's provided with the tidyverse. It's nest and unnest. So nesting, when you think of that, it's making a list column of data tables or vectors or visualizations. Unnest is unnesting those list columns back into what we normally think of in data frames or data tables.
So in order to highlight this, I'm going to go through a brief example. We're going to use data from Statistica and data from the dplyr package of the Star Wars data. So this has actually become really common for me. I start with these two lines of code at the top of my scripts. So I have the tidyverse and data table. Note that the approaches that I show you today actually work with the developmental version of data table. But there are other ways of doing this. This is just the way that it's going to work going forward.
So the two data sets we're going to play around with, the dplyr ones at top one, it has the name of the character. It has their species, their home world, and then it has, oh, no, there's a list column. It shows up already. So we have films. So it's a character vector of the films that, for example, on the first row that Luke Skywalker is in. It has five films listed. The other data set, it's a revenue data set, and it's by film. So A New Hope made $775 million, roughly.
So these two, if our goal is to understand how things in the first data set relate to revenue for the movies, we're going to have to do some work to get it in that format. Because right now, we couldn't join these very easily. So just to get us to where so we have a framework we're working from, you are here at the tidying and transforming steps of the data work life. Again, this is from Allison. Beautiful. Just makes me happy looking at it.
Unnesting and nesting with data.table
So one of our issues right here that we have to deal with is that one of the data sets is at the film level. So we have per film the revenue. And the other one is at the character level. By character, we have information about them. And so they're not in a format that we can easily join them so that we can look at things together. So the first thing we're going to do is unnest. So we're going to unnest the films column. It's, again, a list column of character vectors.
And so the first thing we're going to do is we're going to let R know that we're going to be using data table. This is one of the really important steps to use in data table. In order for the square brackets to work, you have to tell R that it's a data table. So in this one, we're just grabbing it straight from the dplyr package. And we're just letting it know that it's a data table. We'll call it sw here.
And then this is where the unnesting happens. So if you remember what we were talking about before, you can look at the placeholder. So you have a placeholder here for i. In this case, we're not going to be filtering or arranging. So we just leave that blank, do a comma. And then this placeholder is for like things that you would do in mutate. Here we're creating a new column or, in essence, replacing it. By taking the elements we're unlisting here in the currently list column called films. And then in the by statement, we're just going to put all the other variables that are at the character level that we want to hold on to, for example, like name and species.
So this produces this data set. So now instead of just one Luke Skywalker row and the films being shoved into a single cell, now we have the name and we have each film as a row. So we can see that Luke was in Revenge of the Sith, Return of the Jedi, et cetera.
Next I propose that we actually do another nesting. So we're not getting away from the list columns that easy. So we're going to create here a new column called data. And what we're going to do is we're going to create a list column. That's what this is doing. Of this symbol. It's .sd. And if you haven't used data table very much before, you can think of it as the subset of the data. So sd for subset of data. And in this case, that subset is going to be everything that's not in the by statement.
So what does this produce? This. So now by film, we have the associated data in a single cell. So we have a new hope. And all the data that goes with the new hope is now comfortably packed into that single cell as a data table.
Joining and analyzing the data
So now we actually can join them. And I suggest that this is a great way of doing this. In data table, one way to join is you have one data table. And then the first part of the square bracket in the I part, you actually put another data table. So revenue is another data table. And then you tell it what you're joining on. And this is a really fast form of joining. When you print this out, this is what we now have. So we have per film, we have the associated data. And we have revenue.
So the reason why I think nesting before we joined was really useful is, I don't know, you don't have to raise hands to admit it. But has anyone made a mistake joining data or been surprised by the result when you're joining data? If so, you're human. I've done it numerous times. And it's usually like a misspelling in one of the keys. There's a million ways that things can go wrong when you're joining data. And so one way to make it simpler is actually using this nesting. And so all we had to look for was the key and anything that the new data set's bringing in. And now it's really easy to problem solve. I can look through if there's a surprise or one of the revenues wasn't there. I would be able to find out really quickly. Whereas, if my data set was really long and not nested, it would be much harder.
So now that we have the data sets together, we actually can do some analyses. Here, like I mentioned before, all of this happening here is mutate. But this time, we have this what is called the walrus operator. That's my favorite name for an operator. And you can see why it's called a walrus operator, right? The teeth. So what it's doing is it's creating a new variable called counts. And it's reaching into each cell of data. That's a column that we have in our data table. And we're using count to just get the count by each species in the data that we have.
What this operator does that's different than what we're normally used to in R is that it's going to modify in place. And this is a nuance that is important when you start using data table. Because it is changing SWN, that object, in place. It's not making a copy. It's just changing it right there. And so you don't actually have any assignment towards the end of this. This actually takes care of the assignment, which is sometimes a hurdle for people to get used to as they're getting in there.
So this is going to create a list column again. But this time using per map of counts by each. And I have an outdated thing there. It's by species instead of gender. You'll notice at the end of the object, I keep putting the square brackets. That forces the print in data table. When you use this operator, sometimes it doesn't print. And so when you are using data table, this is a great way to force it to print if you want it to.
And now, last step. We're just about there. Now we're going to unnest a list column that has data tables in it or tables in this case instead of vectors. So this time we don't actually need to give it a name like we did before. Now we're just unnesting it. So this is the new variable. And we're just grabbing that and unnesting by film and revenue. And what this is going to do is it's just going to grab everything in each of those individual data sets and it's just going to unnest it so it looks like a normal data table.
So this is what we get. We get for a new hope, we can see the revenue that's tied to it. We can see that there are three droids in a new hope. If you're familiar with the movies, you probably can list off what droids those are. For a new hope, there's 12 humans, et cetera. And from here, we can get proportions and things and we could possibly come up with this and see how revenue is associated with how many, the percent of the major characters being human in a movie. The Force Awakens really throws things off because that's the new one. It made a ton of money. But for the most part, it looked like people watched it less if there were more humans. We just like, we want other stuff.
Performance benchmarks
But at this point, you're like, okay, cool, whatever. We're here for the performance. And it is actually kind of cool looking at performance when you're working with packages like this. So we've packaged up some of these functions to make them easier to test and to use. It's in TidyFast. If you're interested, it's on GitHub. And so for nesting, this is a really small dataset. But you already start to see some improvement. The place that you see a lot of improvement with nesting is the memory that it uses to do the operation.
So the improvement in performance, both with memory and time, actually grows as your dataset gets bigger and bigger and as you have more and more groups. So as your data gets more complicated and bigger, the benefits grow. For unnesting, similar thing. You do see benefit even with a small dataset. But again, it grows. The benefits grow with the size of the data and the complexity of it.
So the improvement in performance, both with memory and time, actually grows as your dataset gets bigger and bigger and as you have more and more groups. So as your data gets more complicated and bigger, the benefits grow.
So that's everything I have for you. I hope you got to see a little bit how DataTable can be integrated into your work, that it's definitely not a replacement of anything in dplyr or vice versa, that they're very complementary. This is the picture slide. If you want to find all the resources that are tied to me, take a picture. I'm really grateful for the DataTable team, everything they've done, and for Julie and Allison for letting me use their beautiful art. Thank you.
Q&A
First one from William Chu asks, what did you use as DataTable rather than SetDT? So the SetDT will replace the object in place. And so when the dataset's not huge, I recommend using as.DataTable, because it creates a copy, and then you're not going to be modifying your first object, whatever that was before. When your data get really big and you don't want to make copies, then SetDT is the way forward. Thank you. Great question.
The second question is, what are your next steps for TidyFast? So it's a great question. As of right now, TidyFast is more trying to implement tidyr functions with DataTable as the back end, because I didn't want to even try to compete with something like DTplyr. It's a fantastic package. So it's mostly tidyr functions. It's testing, bug fixing, and just improving the package.
And I have one last one here. It's going to be a quick one. Favorite Star Wars character? I really fell for Rey. I think Rey is the hero that we all have wanted. So yeah, Rey's my favorite. Thank you, Tyson. Appreciate it.
