Resources

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So the summary is that we decided last year to create a not-for-profit development group to work on the Apache Arrow project, which I will tell you about. There's a number of reasons why we decided to do this. Partly because the Arrow project does involve the R world, but it involves the Python world and many other programming language ecosystems, and we wanted to be able to build relationships with many different companies around the world that want data science tools to become faster and more scalable into the future.

So Hadley and I are the leadership for the organization. We are now five full-time engineers, so we've got three people in North America, two people in Europe, so Ben and Francois are here, so if you see them, please say hello. Folks from the Tidyverse team are involved on a part-time basis to work on the R side of things.

We are still hiring, so if you know anyone who likes to work on database systems or likes to do benchmarking and test and build automation, we have a lot of that to do, so please definitely get in touch.

So as Tarif was saying in his keynote earlier today, RStudio has been really amazing in helping us get this off the ground by essentially making it so that I don't have the overhead of having to run a company, can focus on doing software engineering. I used to work at Two Sigma. They're continuing to sponsor the work, and we've had NVIDIA and also the Open Data Science Conference join as sponsors for 2019, so NVIDIA just gave us some really high-end hardware to do testing for the software, so that's been great.

The problem with current data science tools

So the executive summary of what this is all about is that if you look at how data science tools work now, like Base R, NumPy, SciPy, the Python ecosystem, MATLAB, a lot of the semantics of how these tools work are based on the way that computational systems worked in the 1980s and 1990s, so everything's built on a foundation of 30- to 40-year-old Fortran code.

We have seen tools augmented with multi-core algorithms and some more sophisticated execution, but in general, you have things that are single-core, that evaluate eagerly. The language ecosystems are pretty fragmented, so you very rarely see R programmers collaborating with Python programmers and sharing libraries of code to power their applications.

So the world that I would like to see is to be able to take advantage of all this fancy hardware that we have now. We have graphics cards, we have computers, you can buy a desktop with 16 physical cores and 32 virtual cores, so you should be able to take advantage of that. We can compile code at runtime and generate custom functions that are specialized for your hardware, but a big part also is that I would like to see the language ecosystems able to collaborate and share code and data.

So one reason that not as much of this work has happened yet has been because of Moore's Law and, you know, if your code runs slow, you can just wait because in a couple of years your processor will be twice as fast and that has essentially stopped happening, so we need to get more out of the hardware that we have because it's not getting that much faster year over year. Although we do have graphics cards, which seemingly are, you know, continuing to scale and have thousands of cores on a single graphics card.

Apache Arrow's vision

So the Arrow project, so I helped launch the project about three years ago and our central idea is that we wanted to define a language independent open standard for data frames. And you might say, well, we have data frames in R and data frames in Python, but if you look under the hood how those data frames are implemented, how they're arranged in memory, the memory representation is very different. And so as a result of that, you can't really share code and if you want to share data from one runtime environment to another, you have to serialize, which is very expensive.

So essentially the view of the world that we would like to see is to have tools able to collaborate and work on a common data representation that is not tied to a particular runtime environment or programming language.

So essentially the view of the world that we would like to see is to have tools able to collaborate and work on a common data representation that is not tied to a particular runtime environment or programming language.

So kind of the bird's eye view, if you think, and where the project is going, and since I have zero time now, but what I would like to do and the way that we've been building the community is to bring together the data science world with the analytic database community. So a lot of the performance and scalability work in working with big data, I don't know if we're still saying big data now, a lot of that work has happened in the analytic database world and relatively little work has made its way into the data science ecosystem.

And if you look at the way that database engines are built, typically you interact with them through SQL or some other frontend API, usually it's SQL queries, but those queries then interact with a bunch of components that deal with query execution, in memory, data management, and interactions with storage and essentially whether it's disk or cloud storage or whatever. So what we were doing in the Arrow project is building a set of high-quality components that have public APIs that can be used as the basis of building different kinds of essentially embedded database engines that can be run on a laptop or run kind of in a distributed cluster.

But we aren't really being prescriptive about the frontend, so if you want to build a database, you can build a database, if you want to build a data frame library or build something that can be used by an R or Python programmer, you can do that as well.

Primary use cases

So we have three primary use cases, and I'm going to tell you very briefly about how this is relevant to R programmers, but the three big things that we're concerned with is getting faster access to data, so being able to read and write datasets in the most important file formats as fast as possible. We want to be able to move around data as efficiently as possible within a single node, so between one process and another through shared memory, or moving data between nodes in a network through remote procedure calls, so through sockets or other inter-node communication.

The last, so the last thing is that we want to be able to compute analytics, and so if you're using R, you're using Python, you're already doing analytics every day when you use data frame libraries, so these would be standardized libraries of algorithms that could be used in many different scenarios, and we want to be able to take advantage of the latest and greatest in compiler technology to be able to do specialization of functions to go even faster based on prior knowledge of what you're doing with your data.

Relevance to R

So how is this all relevant to R? Well, we are building bindings for the Arrow libraries, which are mostly written in C++, so that work on the R side has been led by Romain Francois, who you may know from the dplyr project.

So I will talk about R data frames, because I have no time, but the key use cases where there's a lot of interest in the R world is being able to memory map and interact with very large on-disk data sets that don't fit into memory. We want to be able to, this is not about replacing dplyr or replacing data.table, it's about having tools that can be used to accelerate these projects while leaving the user interface, like with the code that you write, largely unchanged.

So you may know about dplyr, dplyr is already a domain-specific language where you write a sequence of dplyr verbs and expressions, and then those expressions are effectively compiled into whatever backend where dplyr is executing. So dplyr can execute against SQL engines or it can execute using its in-memory engine. And so what our goal is to do is if you do have a very large data set which is coming out of code in the arrow ecosystem, whether it's memory mapped on-disk data, or maybe you have a directory of parquet files, or maybe a set of cloud buckets that contain parquet files.

So using dplyr, you may have an arrow data set and a collection of dplyr verbs, and so we can translate these standard dplyr expressions into computation graphs that are evaluated by parallel runtime coming out of the arrow project, and we can also do fancy things like look at non-standard evaluation, kind of unevaluated R expressions. So when you write this in dplyr, these expressions don't get evaluated immediately, but we can use LLVM to compile these expressions to go even faster against the data sets.

So anyway, I'm sorry about the AV problems, but these things happen. But anyway, there's a lot of interesting things on the horizon for the project, and I'm excited to collaborate with the R community, getting the software into production and making your data science work faster and more efficient and hopefully a little bit more fun.

So many of you may be aware about three years ago, Hadley and I started the Feather project, which uses the arrow technology internally. So another thing we'd like to see is an evolved version of Feather that has more features and is faster and more efficient than what is there now. So even if you're already a Feather user, even that will be improved eventually. So anyway, check out the slides, and I'll be around the conference if you'd like to chat. Thank you.