
Neal Richardson | Bigger Data With Ease Using Apache Arrow | RStudio
The Apache Arrow project enables data scientists using R, Python, and other languages to work with large datasets efficiently and with interactive speed. Arrow is so fast at some workflows that it seems to defy reality--or at least the limits of R's capabilities. This talk examines the unique characteristics of the Arrow project that enable it to redefine what is possible in R. The talk also highlights some of the latest developments in the arrow R package, including how you can query and manipulate multi-file datasets, and it presents strategies for speeding up workflows by up to 100x. About Neal: Currently Director of Engineering at Ursa Labs / RStudio. Previously led product and engineering at Crunch.io. Ph.D. in Political Science from the University of California, Berkeley
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
You know that feeling you get when you've got these really good results and you're just really excited to share them with the world, and you think everyone's going to be really excited to see what results you have? And then you know that other feeling you sometimes get with that one, when you get this sinking suspicion that maybe those really good results are just too good to be true?
So this happened to me a few months ago. I lead the engineering team at Ursa Labs. We're one of the leading developers behind the Apache Arrow project. So Arrow, as you may know, is essentially building the foundation for the next generation of data frames and bigger data analysis. And I maintain Arrow's R package.
So I was meeting with RStudio's leadership, because RStudio has been a generous sponsor of Ursa Labs from the beginning. So I'm in the call, and I'm there with Hadley Wickham, and with JJ Allaire, founder, CEO of RStudio, Tarif Khalaf, who's president. And I was talking about this result that I found recently, which was I'd been benchmarking Arrow's CSV reader, and found that Arrow has lots of great things that it does in its library. I'm going to show you some of them in this talk. But even if you just use Arrow as a CSV reader, and use it to read a data frame into R, and then you could use any other R package on that data frame afterwards, you could get a big benefit.
And in fact, even on this particular New York City taxi data set that I was testing out on, I found that Arrow was two to three times faster than DataTable's CSV reader. And at this point, Hadley cuts me off and says, yeah, I don't know, that doesn't sound right. But DataTable, that group, they have optimized really everything that you could optimize out of reading data into R. And so maybe you could match their performance. Maybe you could get a little bit better. But you couldn't be that much better, because they've done everything you could do. And Tarif added, yeah, you probably screwed up something in your benchmarking code. You probably actually didn't do all the work you thought you did.
And I thought about it, and it's like, wow, that's actually a much more logical, simpler explanation for the result that I found, that I screwed up something in my code. But I figured I'd go check it out and get back to them. And I compared the objects that I had. The data frames read by both were the same size, 15 million rows, both really big in memory. So it checked out. It turned out Arrow's CSV reader was two to three times faster than DataTable's here. So we'd achieved an incredible result. It was literally incredible. This group of people here that know a lot about the internals of R literally did not believe that it was this fast.
It was literally incredible. This group of people here that know a lot about the internals of R literally did not believe that it was this fast.
What makes Arrow so fast
So how were we able to do that? So what I want to talk about today is how characteristics of the Arrow project, and particularly the community around it, enable us to do things that we might otherwise have thought would be impossible to do in R.
So a little history. So Arrow was started in 2016. A group of database developers and data frame library maintainers got together and realized that they were all trying to solve similar problems. And rather than saying, this is fine, and we're just going to keep duplicating each other's work, what if we work together and create a shared foundation for our work? And we would all benefit from that.
So Arrow is fundamentally about three things. First is it is a format, is a specification for how data is represented in memory. It's columnar, and it's designed to take advantage of features of modern CPUs and GPUs and other hardware. There's also a set of 12 libraries that implement it in different languages. Python and R use the C++ library. But there are many others in the Arrow project. And then third, there's this broader ecosystem of packages and projects around Arrow that use Arrow in some form, whether it's as their internal data model, or is it just a means of exchange between projects.
So how do these characteristics of Arrow lead us to get such unbelievable performance and be able to enable us to do things that we didn't think were possible in R? So in terms of modern hardware, Arrow is designed to take advantage of many features of CPUs that exist now that didn't exist when R was developed initially. So R was first released in 1995. Of course, computer technology has progressed quite a bit since then. And there's things that I can do on my laptop now that was not something that was a thing back then.
So for example, my laptop here has got eight cores. That's CPUs that you can run in parallel. You can get a host on AWS with 96 cores. That's pretty good. But R is generally only using one of those. So we're leaving a lot of performance on the table. Newer CPUs also can take advantage of what's called SIMD, single instruction multiple data. My laptop, for example, can take up to 256 bits at a time. So that would be the size of eight integers in R. Other newer ones can do 512 bits. So you can feed a lot more data into the CPU at a time if your code is designed to take advantage of that, but R generally is not.
So it's like you have this super fast sports car and you just want to take it for a spin around the track and see what you can do with it. But for whatever reason, you have to keep it under 25 miles an hour, drive, keep your hands at 10 and 2 on the steering wheel, nothing too crazy. And it's a really missed opportunity.
So what do you do about that? So you could write some C or C++ code in your R package to do multi-threading to take advantage of all those extra cores that you have available. You could write code to do SIMD optimizations, but that's really hard, frankly. And the number of people that know how to do that well isn't that great, probably. And the number of people who do that and are going to be writing R packages is even smaller.
The power of the Arrow community
So here's where the large Arrow community comes in handy here. So the Arrow project, since it was created in 2016, has had a steady growth of contributors to the project over time, well over 500 by now. And what this means is all of these people are working together on this project and the benefits are shared. So in order to have smart multi-threading in the R package, I didn't have to write that. Anyone else who understands how to do that much better than I would did that. I didn't have to write SIMD code either. There are other people, people who work on hardware that's the actual CPUs themselves that know what those options are. They wrote that code. And they're all contributing it in. And they're not contributing it in because they want to make the R package faster. They're contributing it in because they are using the C++ library for some other project or they're using Python, which uses the same C++ library that the R package does. So we benefit from all these other communities' contributions.
So what does that look like? So to give an example of how this work plays out in reality, I want to talk about this example, this demo that I gave last year at RStudio conference. I was introducing the Arrow R package and I was going through this example with the data set where you take about 10.5 years of New York City taxi data, 2 billion rows, and you could scan that and get some results on that on your laptop. So I did some kind of sample dplyr, filtering and selecting and grouping and aggregating on this data set that I'd opened with the Arrow package where I pointed at a directory of these files and then I could query it. And I got a result in four seconds over 2 billion rows. That was pretty good.
I did a talk this past summer and I redid the result just on the latest version of the code and it was twice as fast. And I hadn't really done anything in the R package to do that. This is all based on improvements in the underlying C++ code that all the rest of the community was working on. Right before I did this talk now, I ran the code again on the latest version and it's another 25% faster. Again, I didn't do anything to make that happen. It's just based on the Arrow community and its work.
Arrow as a universal standard
So there's other reasons, other ways that this ecosystem plays out and has benefits for us in our community. So when you hear that a bunch of database developers got together and decided that there wasn't a good standard for columnar data and they were going to create their own standard to unify all of these, it might bring this XKCD comic to mind where the creation of standards to solve the lack of standards just makes more problems, more standards, more competing standards. And there certainly is that risk. But five years on from the creation of Arrow, we can see now how the benefits of this approach have played out.
So just to give an example, so suppose I've got data in Spark and I want really quick access to that in Python and R. What you could do is that you could from Spark, you could write a special adapter for Python, a special adapter for R that understands, in R's case, you understand R's vector types and specific, the bits in the vectors that make up a data frame. And you have to do the same thing for Python, for NumPy or Pandas. And Spark and its Java code would have to understand how to do this. I put another arrow in the chart there to connect the interchange between Python and R. That's obviously something you'd want to do too. And it's another place where you'd need someone on the other side of the language barrier to understand the internals of your format.
So that's feasible if this is all you care about. But in reality, the system is, there's many more languages and there's many more projects and too many to fit on the slide. And as you can imagine, if you try to draw all of those lines to connect the various adapters to the different things, it gets crazy really quickly. So what do you do? You're obviously not going to do that. That's too much work. And so what you want is to have a ubiquitous standard that everyone can write to and read from and then you don't have to re-implement everything every time.
So prior to Arrow, what you do is probably just dump out a CSV. Every language knows how to read a CSV. Every database knows how to read in a CSV. But this comes with some costs. CSVs can be expensive to read and write because you have to convert between a string format on disk to the arrays of bits that R and other languages use in memory to do work. And that conversion is costly. And CSV also doesn't have rich support for types. You may have timestamps and you don't have any way of indicating that this is a timestamp. You have to guess whether it should be just a regular string or a timestamp. And so there's lots of costs and penalties that come from doing this. But CSV is kind of the lowest common denominator and so that's why it gets picked up.
But Arrow is kind of a happier version of that where you have a standard and it's columnar and it's binary. So it matches very closely with how R or Julia or Spark or any of these languages actually work with data in memory. So the conversion is minimal and in some cases doesn't actually involve copying memory because it where it does actually line up as being the same shape. So it's very efficient. And from our perspective in the R community, what this means is once we have and now that we have an Arrow R package, we have access to all of these other projects and databases and languages very efficiently as long as they also can write to Arrow. And over the last five years, we've seen an increasing number of projects that have picked up on Arrow for this because it's really efficient.
So going back to Spark for our example there, early on in Arrow's life, we added support in PySpark, the Python Spark library, to communicate with Spark the database which is written in Java very efficiently using Arrow. And so you'd get a hundredfold speed up on certain operations whether it's with user defined functions or whether you're just pulling a large amount of data into your Python environment to do further analysis on it. And that worked because Java has an Arrow library, Python has an Arrow library, and they write the same format. The beauty of this is that once we finally got the R package going, we could pick up and do the same thing. These are both references to blog posts on the Arrow website. So we were able to take advantage of the same thing. And because we can read and write Arrow from R, we didn't need to do any other special Spark business in order to get those 100x speed ups of pulling data to and from Spark in R.
Going beyond what's possible in R today
So, the Arrow format is designed to take advantage of modern hardware and the power of the large community of developers around the Arrow project really changed what we're able to do in R and enable all sorts of new workflows that weren't really possible or feasible before. So it's nice, like I showed in the beginning, to see that Arrow makes reading CSVs into R faster, and we like seeing that, but that's really not the goal. We're not just trying to shave off a little bit of time from work that you already can do now. We're trying to go beyond that. We're trying to enable new workflows that were either cost prohibitive to do before or just, frankly, not possible at all.
So there's a lot more coming in the coming releases this year in 2021. I'm really excited to be able to tell you about them when they come, but for now I wanted to just go back and walk you through another example of something you can do today in the Arrow package that is a little bit beyond what you would normally do in R but can really have some impossible seeming results, let's say.
So earlier I showed reading a dataset of 125 files of New York City taxi data that was in the parquet format. Parquet is a file format that you can read in the Arrow package. It's a columnar, very efficient file format. Often of course you don't have parquet files yet, and one of the reasons you want to use Arrow is to get your less efficient CSV data into parquet so that you can read it more quickly. So the Arrow package has some facilities for doing this.
So here's an example. I read, I converted some of those parquet files back to CSV. I only did six months of it because CSVs are big and it takes a lot of space. So you can, just as you can point at a directory of parquet files and say open dataset, you can do that with CSVs and you can do the same kinds of queries on them that you did, I showed with the parquet dataset, selecting and filtering and grouping, aggregating all of these things. As you see, the result here, the one I showed, it's a different query so it's not apples to apples, but it took seven and a half seconds to do this query over this data. It's pretty good. In fact, I didn't have to mess with reading each of the files individually and doing things with it. That's good. But, you know, I think we can do better.
So one of the things we want to do is we could write the dataset to a more efficient format, either parquet, as I mentioned, or feather, which is the Arrow format. It's literally the Arrow format on disk. So we can use similar types of dplyr style syntax to do this. I can say group by this payment type column in the dataset. That was one of the columns that I filtered on before. In this case, when I'm writing a dataset, when I say group by, what that means is I want you to partition my data by that. And what that's going to do is it's going to write out separate files inside different directories based on the value of that variable. So payment type in this dataset took on values one to five. So now I've got inside my feather taxi directory, I've got five subdirectories, one for each of the values of payment type, that now have different parquet files in it, or feather files.
And what this means is when I do a query again on this dataset, and if I filter on payment type, I only have to look at the files inside that directory that correspond to the value of payment type that I'm filtering on. I don't even have to read the other files in to see if they match my filter, because I've kind of pre-filtered it based on this partitioning. So indeed, if we do the same query that I just did on the CSVs, but I do it on the feather version of the dataset, it's now a hundred times faster than before. Exact same query, just with a more efficient file format, and taking advantage of partitioning in the dataset query.
So indeed, if we do the same query that I just did on the CSVs, but I do it on the feather version of the dataset, it's now a hundred times faster than before. Exact same query, just with a more efficient file format, and taking advantage of partitioning in the dataset query.
So this is great. We're trying to make things that have been difficult or impossible, possible. Trying to help you take full advantage of the sports car that you've got on your lap when you're working in R. But really, I like to think what we have is not just any old sports car, but it's a DeLorean, perhaps with a flex capacitor in it. And where we're going, we don't need roads. We're going beyond what you could be doing on your laptop today, even. So thank you very much.

