Dewey Dunnington | Accelerating geospatial computing using Apache Arrow | RStudio (2022)

The ‘arrow’ R package and wider Apache Arrow ecosystem provide an end-to- end solution for querying and computing on in-memory and bigger-than-memory data sets using the Apache Arrow C++ library. In this talk we introduce the ‘geoarrow’ package, which extends Arrow to provide efficient columnar storage for spatial types and functions to support spatial queries in the Arrow compute engine. We focus on a workflow where (1) data are stored in multiple files that can be hosted remotely (e.g., on S3-compatible storage), (2) queries are processed batchwise and in parallel allowing for efficient processing of bigger- than-memory geospatial data and (3) results can be passed without copying to Rust, Python, or other R packages for further analysis. Talk materials are available at https://github.com/rstudio/rstudio-conf/blob/master/2022/deweydunnington/Accelerating%20geospatial%20computing%20using%20Apache%20Arrow%20-%20Dewey%20Dunnington.pdf Session: Lightning Talks

Oct 24, 2022

3 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Dewey Dunnington, I'm a software engineer at Evoltron Data, and I'm a developer on the Apache Arrow project, where I get to work on cool things like geospatial data and R. One of the things that came out of that is the GRO package, which takes my favorite things about Arrow, and my favorite things about R, which is rspatial, and puts them together.

Parquet files

My first favorite thing about the Arrow package are parquet files. So parquet files are a little bit like CSVs in that you can read and write tables, but they're binary, so they're smaller, and they're faster to read and write, and they remember your data type. So if you write a date to a parquet file, you get a date back. If you write a number to a parquet file, you get a number back.

With GeoArrow, if you write an sf object to a parquet file, you get an sf object back, which is really useful, and it turns out that it's also really fast. So if you find yourself waiting for your files to load, try parquet. If you find yourself waiting for your geospatial files to load, try GeoArrow read-view-parquet.

Datasets and parallel queries

My second favorite thing about the Arrow package are datasets. Datasets let you take a whole lot of files that you create, and you can split them up however you want, and you can query them like they're a database.

In this example, I have a whole bunch of parking interactions from the city of Philadelphia, and I have them hosted on an S3 bucket. They could be on my computer, too. Here, using the Arrow package, I can open them as a dataset, and I can query them using DeepLyer, and when I hit collect, Arrow takes care of the details.

And there's a lot of details to take care of. Arrow pulls the data from a remote data source or from your computer, and Arrow runs your query in parallel using all the cores on your computer, and it splits up the data into manageable chunks just in case the dataset is bigger than your memory. A lot of geospatial data is really big, and a lot of it is bigger than memory, so this lets you use that workflow with geospatial data, and it just works.

A lot of geospatial data is really big, and a lot of it is bigger than memory, so this lets you use that workflow with geospatial data, and it just works.

Simplifying the rspatial ecosystem

The final thing is something that a lot of users don't think about, but it's something that I'm really passionate about as an rspatial package maintainer, which is simplifying possibly how to maintain all of the components that make up the rspatial ecosystem and how they fit together.

There's a lot of components, and they all come together, hopefully silently, to make your experience as an rspatial user awesome. But there's a lot of interactions, and there's a lot of people that spend a lot of time managing all of those interactions and maintaining them.

GeoArrow uses a data structure that we can pass between all of those, which means that we can define an interaction for every single component rather than every single pair of components, which is a lot less maintenance work, and it means that rspatial developers like me can spend more time adding features, which is fun, and less time fixing bugs, which is less fun.

And finally, because that thing that we're passing around is an Arrow array, we get all of the Arrow ecosystem at our disposal, which includes the Arrow package, Parquet files and datasets, but also the wider Arrow ecosystem, which includes bindings for Python and Rust and Julia and a lot of other languages and a lot of other frameworks that have adopted Arrow.

The GeoArrow package is still under development, as well as the Apache Arrow integration that enables this, but if you want to follow along, you can follow me on Twitter or follow me on GitHub, or visit my website where I blog about all sorts of incredibly nerdy things involving geospatial, R, and Arrow.

Dewey Dunnington | Accelerating geospatial computing using Apache Arrow | RStudio (2022)

Transcript#

Parquet files

Datasets and parallel queries

Simplifying the rspatial ecosystem

Featured software#

rstudio

rstudio-conf