Neal Richardson | Bigger Data With Ease Using Apache Arrow

Transcript#

This transcript was generated automatically and may contain errors.

You know that feeling you get when you've got these really good results and you're just really excited to share them with the world, and you think everyone's going to be really excited to see what results you have? And then you know that other feeling you sometimes get with that one, when you get this sinking suspicion that maybe those really good results are just too good to be true?

So this happened to me a few months ago. I lead the engineering team at Ursa Labs. We're one of the leading developers behind the Apache Arrow project. So Arrow, as you may know, is essentially building the foundation for the next generation of data frames and bigger data analysis. And I maintain Arrow's R package.

So I was meeting with RStudio 's leadership, because RStudio has been a generous sponsor of Ursa Labs from the beginning. So I'm in the call, and I'm there with Hadley Wickham , and with JJ Allaire, founder, CEO of RStudio, Tarif Khalaf, who's president. And I was talking about this result that I found recently, which was I'd been benchmarking Arrow's CSV reader, and found that Arrow has lots of great things that it does in its library. I'm going to show you some of them in this talk. But even if you just use Arrow as a CSV reader, and use it to read a data frame into R, and then you could use any other R package on that data frame afterwards, you could get a big benefit.

And in fact, even on this particular New York City taxi data set that I was testing out on, I found that Arrow was two to three times faster than DataTable's CSV reader. And at this point, Hadley cuts me off and says, yeah, I don't know, that doesn't sound right. But DataTable, that group, they have optimized really everything that you could optimize out of reading data into R. And so maybe you could match their performance. Maybe you could get a little bit better. But you couldn't be that much better, because they've done everything you could do. And Tarif added, yeah, you probably screwed up something in your benchmarking code. You probably actually didn't do all the work you thought you did.

And I thought about it, and it's like, wow, that's actually a much more logical, simpler explanation for the result that I found, that I screwed up something in my code. But I figured I'd go check it out and get back to them. And I compared the objects that I had. The data frames read by both were the same size, 15 million rows, both really big in memory. So it checked out. It turned out Arrow's CSV reader was two to three times faster than DataTable's here. So we'd achieved an incredible result. It was literally incredible. This group of people here that know a lot about the internals of R literally did not believe that it was this fast.

It was literally incredible. This group of people here that know a lot about the internals of R literally did not believe that it was this fast.

So indeed, if we do the same query that I just did on the CSVs, but I do it on the feather version of the dataset, it's now a hundred times faster than before. Exact same query, just with a more efficient file format, and taking advantage of partitioning in the dataset query.

So this is great. We're trying to make things that have been difficult or impossible, possible. Trying to help you take full advantage of the sports car that you've got on your lap when you're working in R. But really, I like to think what we have is not just any old sports car, but it's a DeLorean, perhaps with a flex capacitor in it. And where we're going, we don't need roads. We're going beyond what you could be doing on your laptop today, even. So thank you very much.

Neal Richardson | Bigger Data With Ease Using Apache Arrow | RStudio

Transcript#

What makes Arrow so fast

The power of the Arrow community

Arrow as a universal standard

Going beyond what's possible in R today

Featured software#

rstudio