
Edzer Pebesma | Spatial data science in the Tidyverse | RStudio (2019)
Package sf (simple feature) and ggplot2::geom_sf have caused a fast uptake of tidy spatial data analysis by data scientists. Important spatial data science challenges are not handled by them, including raster and vector data cubes (e.g. socio-economic time series, satellite imagery, weather forecast or climate predictions data), and out-of-memory datasets. Powerful methods to analyse such datasets have been developed in packages stars (spatiotemporal tidy arrays) and tidync (tidy analysis of NetCDF files). This talk discusses how the simple feature and tidy data frameworks are extended to handle these challenging data types, and shows how R can be used for out-of-memory spatial and spatiotemporal datasets using tidy concepts. VIEW MATERIALS https://edzer.github.io/rstudio_conf/2019/index.html About the Author Edzer Pebesma I lead the spatio-temporal modelling laboratory at the institute for geoinformatics. I hold a PhD in geosciences, and am interested in spatial statistics, environmental modelling, geoinformatics and GI Science, semantic technology for spatial analysis, optimizing environmental monitoring, but also in e-Science and reproducible research. I am an ordinary member of the R foundation. I am one of the authors of Applied Spatial Data Analysis with R (second edition), am Co-Editor-in-Chief for the Journal of Statistical Software, and associate editor for Spatial Statistics. I believe that research is useful in particular when it helps solving real-world problems
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
This is joint work with Michael Sumner, who is regrettably not here, and Etienne Racine, who is here, sitting in the forefront, and I'm pretty confident, I think I can promise you that I will be the only talk in this session without moving images.
So I talked last year about simple features, and I couldn't find a better summary than this lovely graph, this lovely illustration that Alison Horst posted on Twitter recently. Simple features basically makes handling feature data, so that's points, lines, and polygon data easier, spatial data easier in R, because it kind of glues stuff as an additional column in your table, in your data frame or table, and it keeps them there, and then you can do all your regular manipulations with that. In addition, you can do all kind of spatial operations, like spatial joins and operating on the geometries.
Simple features package updates
So that was sort of caught up very well, that is the good news. So there's a strong uptake of simple feature package by both users and developers, and we improved by that the interoperability with spatial databases, GeoJSON, and also off-corner reference system handling in R. Also spatial indexes are computed in fly, for instance, by ST intersect, and we have now also plot support in ggplot2 that takes care of reprojecting incompatible layers.
There is an additional thing that we added to the simple feature package, and that is the issue, like, it is very easy to find out whether a point lies inside a polygon or not. That's a simple geometry problem. But then if you work with properties, like you saw here in this graph that we work with geometries and properties of these geometries, it's not so easy to assess when you query a point in a polygon to assess what a value would be. So if you have on the left-hand side, for instance, a land use map, you could reasonably say that at a particular point in that polygon, the land use will be forest, but if you have a population density map, which is also a polygon with a value, you cannot say that because the population density is an aggregate property over the entire polygon. So we made categories for that, and that information is propagated, and if you don't know or if you are on the right-hand side of the figure, you get warnings when you query these points that you basically make assumptions that are not sort of supported by the data.
Challenges with time-dependent and raster data
So this is all very nice. It starts getting sort of challenging, more challenging when you have time-dependent data. So here we have the classical North Carolina sudden infant death data set where we have two rounds of measurements in 74 and 79. This is one of the first published spatial statistics data sets. And we can sort of plot them, and then basically we have them as two columns in the data set, and we need to use Garter here to stack them into a single column, and then we have to replicate all the geometries, and we can do that. But then what happens if you have like 50 instances of time, or when you have 500 or 1,000, you would have like 5,000 columns, and you are going to stack all of them and so on. It's getting very messy.
And also keeping time in columns is not a good idea at all, because where do you put the timestamp? So you would have to put it in the column name. So time is an issue there. We have time-dependent data. Where do we put it? Over multiple columns, or do we recycle features? If we put it in a long, in a tidy table, that is all sort of not ideal. Easy plotting is something of an issue.
There is also a sort of recurring question of the support for raster data and the solid vector and raster integration. This is something you can emulate by sort of having points on the raster or little square polygons, but it is not going to bring you far in the sense of scaling. Why raster data for those not involved in the raster world? Well, variables that vary continuously over space or over time are typically represented more naturally by regular sampling. So think images do that still, images or video, but also Earth observation data.
So satellite data, we have now visible satellites that give information about land, about ocean, about temperature, radar satellites. We have satellites that measure atmospheric composition. And these all come down freely now. Europe is here clearly on the ball at the moment. In the Copernicus program where they have like a little bit less than 10 satellites now running, it comes down to 150 terabytes per day. So that's quite a bandwidth, sort of the data sets that you can have access to.
Also climate data, climate model data is also free and large. The climate, the coupled model intercomparison program will generate sort of over a couple of years, around six, will generate 15 to 30 petabytes. That's a data center, I imagine. And there are numerous other sources of raster data that are interesting. And that is also what really motivates me, that there is so incredibly much free information that is collected or generated that you could get access to and use if you could use it, right? But it is so hard to do it quite often.
And that is also what really motivates me, that there is so incredibly much free information that is collected or generated that you could get access to and use if you could use it, right? But it is so hard to do it quite often.
So here's an example of an image that comes from the Sentinel-5P that where six months of data were sort of thrown together. And this looks like what you typically see from models, but this is actually measurement. This is measured atmospheric composition. So this is six months of imagery compiled together, and you see sort of for those living in Europe, you see a fairly familiar pattern. But what you, for instance, also see is the ship lines flowing here. So you see even the NO2 emitted by sea traffic. And it is quite revolutionary that you get this from pure measurements. There's, of course, a lot of heavy processing going on there.
Data cubes as a solution
The thing to sort of the way out to handle time properly and to deal with raster data and to deal with more complicated data, in my eyes, was that of data cubes. Data cubes, the idea of data cubes go back to OLAP cues, which is sort of there's a paper from Jim Gray on this, mentioned on Wikipedia from 1996, sort of where values basically are given for each combination of dimensions for dimension values. So we can have sales by product, store, and week. So for every combination of product, of store, and week, we get a sales figure. We can have population by gender, age, class, region, and census around. We can have temperature by XYZ and time. Four-dimensional data cube, we can have screen pixels here, like which are two-dimensional or which vary, and then it's three-dimensional, like video. We can have also forecasts, so like weather forecast, by time of forecast, and time to forecast.
Why not handle them entirely as tidy, as long tables, as Iaro would suggest in that symbol? Well, there is storage to speed and indexes, right? I sort of welcome anyone to put a video stream into a sort of XY time red, green, blue dataset. You don't do that. So you could do it, but it's not going to scale up really much, only for smaller subsets.
So what we do is basically working with arrays, right? And arrays are, the array has a long history of support in a base and also in dplyr. So base array is a very strong structure that has dim names, and dim names are pretty limited because they're characters. So if you want to put time in dim names, then you have to encode it as character and it's messy. But the good thing is that you can use apply, so that means you can sort of cut out, you can aggregate of particular sets of dimensions. You can do subsetting for slicing, cropping, and slicing and dicing. There's a bind support, arrays are atomic.
Then there's the table cube in dplyr, which is sort of a structure that hasn't received very much love or exposure, but it's relatively a cool idea to list of arrays, and it mimics by that basically heterogeneous array records. And it has an additional list effector with dimension values that's much more flexible than dim names that can also have like date, date structures and so on, and it has filter methods and group by stuff. But it cannot, for instance, deal with simple features as dimensions, which you would do. So and it doesn't also doesn't have support for regular dimension, meaning if I want to map sort of my space linearly to a dimension index.
The stars package
So what we do in the stars package, in the stars project, is to implement raster and factor data cubes, and those are MD cubes where one or more of the dimensions relate to space and typically also to time. The raster data cubes typically have x and y, take spatial dimensions, effector cubes, have a single dimension that sort of lists the points, lines of polygons that the index refers to. And then we have, we basically have a list of arrays and sets of dimensions, have regular dimensions, dimensions as values, and take care of measurement units and code and reference systems, do the I.O. with GDAL, which we already did with SF, also do a lot of work with net CDF, which Michael Sumner is very instrumental in that side. And we support out-of-core, and we will support cloud processing, and we'll say a little bit about how that works.
So you think that raster types are easy. Well, they're not. So there are all these kind of varieties. This is the nightmare for the developer. The first three sort of have linearly mapped array index to x and y. The last two do not do that. This is like a graticule, but basically if you have a regular raster on the earth and you then project it, then it becomes automatically curvilinear unless you do resampling to a new regular raster, which is what GDAL calls warping.
So here's an image similar to what I showed last year of what stars does. So we read here a little geotiff, but here we use ggplot2 and we use a geomSTARS, which is very primitive, which is a 10-line function that essentially decides should I call geomRaster or should I call geomTile, depending on what I have, or geomRect, it was geomRaster or geomRect. In any case, a lot of other things you still need to do, like coordinate equal and so on. And this basically plots what is there, plots the data.
And that is kind of okay, but it sort of makes it into a long data frame and then plots all of the pixels. And that is okay up to the moment that your data are sort of, you know, this size that you have. Like a typical Sentinel-2 image has 10,000 by 10,000 pixels, which means it's a 100-kilometer by 100-kilometer area with 10-meter grid cells. So this is the state of the art in visible remote sensing, free data.
So when you plot this thing, you think, hey, what's happening here? The thing is basically read here by saying proxy is true, and what then happens is that none of the data is read, that only the metadata is read, meaning that only the dimensions are being read, but none of the image pixels are being touched. And when I plot this data, and that's why I didn't do it with ggplot2 and sort of convert it into a markdown, because that would sort of take every time 10 minutes before my markdown compiled into slides, and you can't work like that. So this plot method here looks at, hey, what is my plot sort of area, how many pixels do I have there? Oh, it's much less than 10,000 by 10,000. It's only 800 by 800. So let's downsample. Let's only read as many pixels as I maximally can view, and then sort of render them and throw them to the plot device, and that is sort of in a matter of seconds done compared to sort of 5 or 10 minutes.
And you can continue on this idea. So we have an stapply here that is applied to this object, which is still not read, and it says to the dimensions x and y, apply this function, which is NDVI, which computes a vegetation index by combining two bands, which is this little function. So it applies this to every pixel, because I have four bands here. It applies it to every pixel, and you think, oh, this is going to take time, because it's going to do this for this 10,000 cells, but no, it does nothing, as you see here. And only when I plot this thing, then it does something, but then it sort of changes execution order. It says, oh, I have to plot something. That means I can downsample, and it downsamples, and then it reads only the downsampled data, computes the NDVI for those, and plots the thing. So this is sort of a query optimization, you could say, by this.
And only when I plot this thing, then it does something, but then it sort of changes execution order. It says, oh, I have to plot something. That means I can downsample, and it downsamples, and then it reads only the downsampled data, computes the NDVI for those, and plots the thing. So this is sort of a query optimization, you could say, by this.
Cloud proxies, something that we have in mind is where the data are basically not in your local machine, but in the cloud. And the next idea is that of a data cube view, where you basically have dimension settings that you don't read from an image, but that you set independently, and we basically query the imagery on this setting. So you could have, then, for instance, a collection of images that are in different projections, like you typically have with Sentinel, with satellite, or also Landsat data that they are in over different UDM zones, and sort of handles that, an idea that is pioneered and inspired by Google Earth Engine, a very strong platform.
Non-cube spatial data: hurricane example
So a lot of data are, of course, non-cube data, right? So we have spatial temporal events. We have spatial networks. We have movement data. And there is a nice sort of data set that is also in the dplyr, available as a subset of this in the dplyr package called Storms, from this hurricane data set, which was luckily still in the air. It's also not updated regularly, I think once per year or something like that. This here you see is the color, the year of the hurricane, and the path it took without reference, only the latitudes and longitudes. And this is difficult to interpret, right? So this is a time series, and the question is, like, is there anything changing over time? But they are all kind of, you know, it's a hairy ball.
So you can cut it up, and you can say, oh, this is nice, ggplot2, and so on. But it's still a lot of hairy balls, and you can't really tell. There's a lot of overlap. So you can't really say, is anything increasing or decreasing? Because all the overlap means you don't know how much there is. So what you would do is really sort of make little boxes, and then count in these little boxes. And that's exactly what I did here. So I made little boxes. So I made a raster, basically, in two dimensions. And then for each of the raster cells, I intersected all of the lines and counted sort of when there was sort of a part of the hurricane going through the raster cell. I counted, and I counted the number of raster cells.
And now, if you're really smart, then you think, hey, but this is running from 10 degrees to 80 degrees. So what's happening here is, of course, that raster cells, if you would distribute them regularly over x and y, you get sort of boxes on the earth that are of completely different size. Because all the time, they get narrower when you get close to the pole. So this would not sort of work very well. But what I did is I made the y direction increasing. So they're very narrow in the south, and they get sort of taller when you get more north. So it's such that we have approximately equal areas for these boxes. And that means it's a rectilinear grid. And then counts are proportional to densities. And you could, of course, that ignores length, and duration, and strength. So you can do much more with this data. But this is a very sort of simple analysis that I started with.
Conclusion
So to conclude, spatial data science is an exciting field full with data, maps, challenges, and tidyverse extensions. With SF, the simple features package, we extend the tables to spatial tables. And with the stars package, we extend that to roster and vector data cubes. And that includes spatial time series as a special case. Stars can do out-of-core roster, is lazy, does some optimizations there, and will eventually address also cloud roster processing, which is where everything is going in this world in terms of these data sizes to do something useful. We can analyze big image collections interactively only when we downsample cleverly. And there is more to read about sort of the stuff that we do in the book that I'm writing together with Roger Biven called Spatial Data Science, which has a link here.
So the slides are in a tweet that I sent an hour ago, the link to the slides. And also, I want to sort of draw your attention to a spatial birds of feather meeting, which takes place tomorrow at breakfast. Thank you very much.
Q&A
So how is SF and stars going to interoperate?
So SF is for vector spatial data, and stars is for raster. Are they going to interoperate in some sense to align the coordinate system?
I should have put this full with vector data cubes. So this is the confusion. The confusion is this, that stars is about raster data. Yes? But stars is about spatial temporal data cubes. And there are two types of data cubes, raster data cubes and vector data cubes. So if you have this kind of data geometries with time series, then geometries is one dimension, and time is another one. And you could have population data per region by age class for different censuses. And you have a 3D vector data cube, where space takes only one dimension. So it does both. And by that, it integrates it automatically. So you could do an aggregation over your raster and reduce the x and y dimension into a set of features, which could be polygons.
