
Make Big Geospatial Data Accessible with Arrow (Cari Gostic, Sonoma Technology) | posit::conf(2025)
Make Big Geospatial Data Accessible with Arrow Abstract: Firelytics is a methodology that computes sub-daily wildfire growth metrics for over 22,000 wildfires that burned in California since 2012. The Firelytics dataset is used in studies funded by NOAA and NASA to better understand wildfire behavior and health implications of wildfire smoke. The accompanying Firelytics Dashboard lets a user map the sub-daily timeseries of fire progression for any wildfire in the dataset. Firelytics is a novel tool for historical fire analysis in California which will help to better predict and plan for future wildfires. Sfarrow, the spatial extension of the Arrow package, facilitates the real-time querying of big spatial data that is necessary to share the Firelytics dataset in this accessible, visual format. Speaker(s): Cari Gostic posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
In this room full of programmers, I'm wondering if this situation is familiar to anyone. We made a really cool data product. It's really valuable and we're excited to share it. But then you realize what you've actually made is a firehose of information. Because much like a talk at a conference, even the most valuable data product can be useless if the delivery isn't catered to your audience.
So I was in this situation recently with a geospatial data product that my team developed. And I'm here to tell you how we use Arrow for geospatial data to deliver our product in an accessible dashboard.
Introducing Firelytics
So our cool data product is called Firelytics. And it's a historical database of fire progression in California. So that map up there shows the fire footprint in California just for 2020. Which, that's pretty scary. That's a lot of land burned in just one year. And so I'll circle the largest fire in history, the August fire.
And this is a true value of Firelytics. It's a database that holds the spatial progression for every fire that's been satellite detected going back to 2012. So this set of growth polygons exists for all fires on that map plus ten more years of data. So Firelytics holds over 22,000 fires and millions of polygons.
And this data could be really useful for a variety of audiences. For example, scientists making fire predictive models. We have people working on fuel reduction or evacuation planning. And then we have the general public. Anyone who has been in the path of a wildfire recently might find this interesting. Or my sister actually pointed out this is really useful for hunting for morel mushrooms, which tend to pop up in burn scars. So she was really interested in this.
And my team full of our programmers, our obvious solution to deliver our data to meet the range of everyone here, is our Shiny dashboard. And our goal for this dashboard is for a user to pick any fire in that historical record and map its progression over time. But generally spatial data is big and it's slow and it's not really well suited for a dashboard where users aren't going to sit around and wait for 10 seconds, 20 seconds for data to load.
Using Arrow for geospatial data
And that's where Arrow comes in. The Geospatial Arrow suite here, it consists of the Geoparquet file format. And we use two R packages, GeoArrow and SFArrow, to help us out.
So I'll show you some benchmarks for that 2020 fire data I showed earlier. We saw some savings in memory. The Geoparquet file is a little bit smaller than a shape file, but what was really impressive to me and my team is the read-in speed, which is less than half that for Arrow compared to SF. And this is just the beginning of the time efficiency offered by Arrow.
If you've used Arrow before, all of the functionalities that are available for tabular data are also available for geospatial data. And that includes things like partitioning your data by a grouping variable. Here we partition by Fire ID. As well as pre-filtering and pre-aggregating your data before loading it into working memory. So here we use dplyr filter to only load the Fire ID number one into our working memory out of the whole Firelytics database.
And this is really well suited for our dashboard. We can very quickly load very small subsets of Firelytics into our app.
And this is really well suited for our dashboard. We can very quickly load very small subsets of Firelytics into our app.
So here is someone clicking through the fire progression for that August fire. And each click here runs a pull to the Firelytics database, which is hosted in the cloud as an Arrow data set. So this app is actually very light and responsive because only the shape that you see on the map at any given time is loaded into working memory.
And the geospatial suite for Arrow is relatively new within the past few years. So this is a PSA that exists. And it's really powerful and impressive. I actually also found it very easy to use. I've been known as a bit of a ding dong. And I didn't really have much trouble learning it. So next time you find yourself loading geospatial data into R, I really suggest you give it a try.
