
Neal Richardson | Accelerating Analytics with Apache Arrow | RStudio (2020)
The Apache Arrow project is a cross-language development platform for in-memory data designed to improve system performance, memory use, and interoperability. This talk presents recent developments in the 'arrow' package, which provides an R interface to the Arrow C++ library. We'll cover the goals of the broader Arrow project, how to get started with the 'arrow' package in R, some general concepts for working with data efficiently in Arrow, and a brief overview of upcoming features
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Our first speaker is Neal Richardson from Ursa Labs. Thanks. All right. So I'm going to talk today about the Arrow project, both the Arrow R package, as well as the broader Apache Arrow project that it's within.
I've got a link to the slides there, and I've also tweeted it at NPR. You can also tweet at us or follow us at Apache Arrow for more news. Don't worry if you don't copy this down. I've got a link to the slide on every slide, a link to the slides on every slide.
So before I jump in to talk about Arrow, I first wanted to introduce who I am and who Ursa Labs is. I'm the maintainer of the R package for Arrow, and I'm also the engineering director with Ursa Labs. And we are a nonprofit that is oriented at funding open data tools, we're sponsored by RStudio and a number of other companies, and we are focused on the Apache Arrow ecosystem.
If you're interested in learning more about us or have follow-up questions from the talk, we're going to be in the expo hall, the Yosemite room, at 345 today.
What is Arrow?
So generally when I tell people that I work on the Arrow project in data science, they're like, oh, yeah, Arrow, that's awesome, I've heard of that, that's really cool. What is Arrow?
And Arrow's a lot of things, and that kind of defies a simple explanation. One way that I like to frame it in terms particularly of what it means for the R community is that it's a foundation for the next generation of data frames.
As anyone who's worked with bigger data in R has surely experienced, R is great and there's lots of powerful tools and really expressive ways of getting work done in R, but once you start getting data sets that are bigger than memory on your laptop, you have to do some special tricks to make it work. Often you have data sets where the data's split into multiple files, and you can work with this, but it's, again, it's not as natural. And many data sets have more complex data types than your base character, numeric factor and what have you. And of course, the more data we have, the more we want to throw all of our hardware at it. So it's very possible in R, but it's just not how R itself was designed.
But we're in good company. These are not just problems of R. So Wes McKinney, who created the Pandas library for Python, ran into similar challenges there, and constraints on memory and on data types and whatnot. And so Wes and some others from the data community created the Apache Arrow project in 2016.
You may have experienced the feather package, which came out of some of the initial explorations there, the prototype of the Arrow format, which makes it really easy to write data frames in R and read them in Pandas and vice versa. And the idea was to draw on all these lessons that we have from R and from Pandas and from the database community and build the next generation of data frame tools that were not tied to a specific language or implementation.
So Arrow fundamentally is two classes of things. So first is the format. This is the specification of how bits are arranged in memory. It's columnar and it's independent of any language. And then there's implementations or bindings to other implementations in currently 11 languages. So the Python library and the R package, they both bind to the C++ library, which is where a lot of the development happens.
The Arrow R package
So the R package has been on CRAN since August. We're about to do a new release. I'm hoping next week this will be up on CRAN, CRAN willing. And so some of the things I'm going to show you right now, I'm going to show a little code and talk about a few other features. They are not on CRAN right now. If you're interested in checking them out right now, not right now, but after I'm done talking, I guess, we build nightly versions of the packages that you can download. And there's a repository that you can point at and then do install packages against that. Don't worry, I'll show these links again. And again, they're also on the slides if you want to follow along.
Datasets API demo
I mentioned code. So let me do this. Let me show you some code.
So one of the cool features that's coming out in the upcoming release of the R package is the datasets API. And this lets us point at a directory of files and then treat them as a single entity and then be able to do dplyr on them. So dplyr is not a dependency of the R package. It is an optional suggest dependency. So we have to load both of them if we want to use this feature.
I've got, I'm going to look at, this is in the New York City taxi data. I've got about, I've got over ten years of data split into files per month. And you see what's in the directory here. I've got 125 files, so ten years and five months. And each directory is year, month, and then a parquet file with those in it. This is a lot of data, two billion rows. And so even with these compressed files, it's 37 gigabytes of data. So more than I can read into memory and hold the memory in this laptop.
So the new version of the R package has this function open dataset that lets me just point at a directory. And I can also specify information about the path segments of those files that are in there. So in this case, the paths are year and month. And so that's useful data in our dataset. So when I scan all of the files and inspect the metadata of those parquet files and get the common schema of them, I also, you can see through here, I also get year and month added as essentially virtual columns there. And then I can use that in my queries. And importantly, I can filter against that, which means that I don't have to actually load all of those files in order to know that, for example, in this case, only files that are in that 2015 directory are going to have data for 2015.
So I can run this fairly basic dplyr, do some filtering, selecting columns, and then do what is this, the median tip percentage on these rides. So I just scanned 2 billion rows in under four seconds, and here's my table.
So I just scanned 2 billion rows in under four seconds, and here's my table.
For those of you following along on the slides, this is actually a link here to that notebook that's also online. So you can check at that later. There's also a vignette in the new version of the package that walks through this and talks a little bit more about the feature.
So with this datasets feature, we're able to point at a directory, and we can use parallel processing. I didn't have to tell it to do it. It's just taking advantage of all the CPU it can. And we're able to make efficient queries on this. And all the work is pushed down into these individual file chunks. So I don't need to be able to read the whole dataset into memory in order to query with it.
There's a lot more we plan to do on this this year. So if this is interesting to you, stay tuned. More file formats, cloud storage, so that I can point at an S3 bucket, for example, and do the same type of operation just as smoothly. As well as faster aggregation in C++.
Parquet file format
I mentioned Parquet along the way. I believe the arrow package is the only one on CRAN now that lets you read and write Parquet files. If you're not familiar, Parquet is a common columnar binary file format used in a lot of data warehousing, ETL systems. And the really cool thing about it is that it supports multiple forms of compression and encoding. So you can get really small files that are also really fast to load and work with. We did some benchmarking of this a few months ago. And you can see that with this example file that we're looking at here, the read time is around in the same ballpark as all these other readers, but the file size is an order of magnitude smaller. So if you're not using Parquet, check it out. There's some cool stuff there.
Why Arrow is unique
So I'm excited about this stuff. I think there's a lot of cool work in the new version of the arrow package coming out and that we're working on this year. But there are lots of ways to solve these problems. This is R, of course, so there's many solutions to every problem. And you could cobble together something from a number of different packages to solve the memory mapping or multiple files or point out a database or any number of solutions for this. And those are totally valid. And I'm certainly not going to try to convince you that you should use Arrow instead of doing that.
What I want to talk about now, though, is why I think the Arrow project is unique and why it's worth following and considering whether it's appropriate for your projects.
So the first thing is that, you know, I just demoed one thing here, but the Arrow, Apache Arrow project is much larger than that. And one measure of that is the number of contributors. We've got around 400 contributors, maybe more or less. And it's a steadily growing number. As you can see over the course of the history of the project here, you know, we're picking up about two new contributors every week. So this is a growing community, lots of people adding things to make the software better. Make the software better for you. And this, you know, in contrast with, you know, another project where maybe you've got one core maintainer or maybe two, you know, I think the sustainability as well of a project that has this type of foundation is really good.
Second, as I mentioned, there's a lot more going on than just the features I demoed in R. These projects here are part of the Arrow broader project. And they don't currently have R bindings. Hopefully this year maybe we'll get some of those. One that's really cool, just to highlight one, is Flight, which is a client server framework for shipping columnar data around the network. And some folks have been using it and found it to be really fast and performant. So I'm really excited about seeing that in R as well.
So there are lots of contributors. There are lots of projects that are also already using Arrow. And I don't say this to say, well, everyone else is doing it, so you should do it. But what it means is that there's a lot of other projects that have a vested interest in the growth and health of Arrow and in making it better. And also, if you're using some of these other projects, Arrow improves interoperability. So it is language independent. The format is language independent. And so it makes it easier for you to efficiently get data in and out of whatever database or system.
One example of this in R is with the sparklyr package. Javier, who's talking next, he added features to sparklyr to take advantage of Arrow. Apache Spark database can emit Arrow data. And by adding support for reading that in R, just by loading the Arrow package now, you just get a 40x speedup when pulling data out of Spark. You don't have to do anything else. So I think there's a lot of opportunities for things like this. And like I said, if this is interesting to you, don't leave after I'm done talking. I know you're all here just for me. But Javier's going to talk a bit more about things in that ecosystem.
I've said this a few times, but Arrow is language independent, language agnostic. And I think this has a number of nice properties for us in the R community. So unlike saving an R data file or something, if you write using Arrow, the data that you write is immediately readable by people in Python or in Java or Rust or Go or JavaScript or whatever. And so that really helps us with collaboration with people that, for whatever reason, aren't using R. And it also gives us an onramp to access all these other projects that use Arrow as their core representation, but that are not implemented in R.
So unlike saving an R data file or something, if you write using Arrow, the data that you write is immediately readable by people in Python or in Java or Rust or Go or JavaScript or whatever.
So one thing I'm hoping to get out this year at some point is support with Reticulate for Arrow data types. So this is just hypothetical code, because this hasn't been released yet. But for one example, NVIDIA has this project, the Rapids project, and they have a number of largely Python-based projects that use GPUs, GPU acceleration for machine learning. This one here in this example, CUDF, is a data frame library that uses GPUs. And so with Reticulate and Arrow, we should be able to just do our work using CUDF and then continue with our workflow in R as if it were in R. So again, not in the package. Don't run this code right now. But hopefully very soon.
Getting involved
All right, so if this sounds interesting to you, I have some offers for you. First thing you can do to get involved is, of course, try out the package. I mentioned it's on CRAN. It's if you use Conda, we also have a Conda recipe. Here's the link again to our nightly repositories. We build binary packages for Mac and Windows as well as C++ binaries for a number of Linux distributions so that you can just point at our repository and download the latest development build that way.
I mentioned Linux. If anyone here has tried installing Arrow previously on Linux and felt frustrated because you had to install system packages or whatever, I see some heads nodding, sorry. But as of the current release, this should no longer be an issue. The current release, the upcoming release will just work on Linux with no system dependencies required. And if you're really curious and want to learn more about that, we have a vignette that shows how that works.
I mentioned our dev builds. I encourage you, if you want to use the latest Arrow, and again, we only do CRAN releases every three months or so, so if you want to check out what's going on in between those, which is often a lot, you should try our nightly builds. I would not recommend using install GitHub for Arrow just because of the C++ building stuff that it has to deal with. If you use our nightly builds, you won't have to worry with that.
When you use Arrow, tell us about it. We are actively working hard to improve it, and there is zero chance we have anticipated every use case or weird data shape out there. I say weird, not in a judgmental way. There's lots of interesting data out there. And so tell us about it. We'd love to make it better.
And of course, this is open source. This is community-supported software, and we can't do it except with contributors from users like you. So feel free to we will probably, if you write an issue, we will probably ask you if you are interested in submitting a pull request for it. Please do. And some of the projects can be highly technical, but other parts of it are much more accessible, and we'll do our best to provide on-ramps to help people who want to contribute. Documentation improvements, totally fair game, and I'm sure there's plenty of room for that.
Finally, you can also sponsor Ursa Labs. So as I mentioned, we are a nonprofit organization funded by a number of companies, first among others by RStudio, as well as a number of other companies, and this lets us do some of the important things, kind of like what JJ was talking about in the keynote this morning, of having a dedicated team that is working to improve and foster the health of the community. And it lets us get off the ground some of these bigger projects and get a critical mass together to kick off the data sets or the query engine that we're working on this year and whatnot. It's really difficult to do in a pure volunteer open source project where people are contributing the time that they can.
And also, we take seriously the open source community and our role within it as kind of maintainers and helping to make it easier for other people to contribute. So info at ursalabs.org if you're interested in this. And also, I don't know if this link works yet, but it looks like it might. We are hopefully coming up on GitHub sponsors, so you can just click there and give us money that way too.
Thanks a lot. As I mentioned, I'll take some questions now. You can tweet at us or follow us at Apache Arrow, and that's me, at NPR. And we'll be in about an hour over in the Yosemite room after this session if you have any further questions. Thanks a lot.
Q&A
So we do have time for a few questions. If you haven't discovered it already, you can also submit questions on slido.com using the code HEXAGON that gives you a chance not only to submit questions, but vote on other people's questions as well. So does anybody have a question? There are a bunch on Slido. I was giving the room first. All right. Well, I'll go right to Slido then.
So one of the first questions we... One of the most popular questions on Slido, the most popular question with 10 votes is data table is multi-threaded and very fast. What are the performance metrics between Arrow and data table?
So it's not really a fair comparison. They have different goals. So what I just showed here of querying against a multi-file data set in Parquet, I haven't tried data table. I don't know if it supports multiple file data sets. If it does, that'd be interesting. Parquet reading, I think the goals of what we're doing in Arrow are slightly different from what data table's goals are. But one of the things that's very important to us is performance, obviously, and one of our goals for the year is to put up benchmarking information so that when we have questions like this, I can give a hard number instead of a not.
So Javier, if you want to step up and start getting ready, we'll do another question from the audience. I just wanted to say that I was testing that the other day, and on some very large but very simply structured data sets, Arrow was reading and writing about twice as fast. Cool. But on relatively small data sets, data table was quite a bit faster. All right. Good to know. Thank you.
Another popular question is, is Feather dead? And there was a related question around whether Feather was appropriate for long-term storage or not. Excellent questions. So we view Feather as kind of like the first version of the Arrow format standard. It's very effective, but it also has a number of limitations in terms of the data types it supports. So what we're looking to do is to have kind of a 2.0 version of Feather that is essentially the Arrow format going forward. There are a number of things that we're working on this year around addressing that sort of stability of the Arrow format and its long-term applications. We've got our 1.0 release coming up, hopefully in the coming months. Right, Wes? It's not dead. Yeah. And no, not at all dead, and I think there will be an evolution going forward, but it's not dead. Excellent. Thank you. Another round of applause for Neil, please. Thank you.

