Edzer Pebesma | Spatial data science in the Tidyverse | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

This is joint work with Michael Sumner, who is regrettably not here, and Etienne Racine, who is here, sitting in the forefront, and I'm pretty confident, I think I can promise you that I will be the only talk in this session without moving images.

So I talked last year about simple features, and I couldn't find a better summary than this lovely graph, this lovely illustration that Alison Horst posted on Twitter recently. Simple features basically makes handling feature data, so that's points, lines, and polygon data easier, spatial data easier in R, because it kind of glues stuff as an additional column in your table, in your data frame or table, and it keeps them there, and then you can do all your regular manipulations with that. In addition, you can do all kind of spatial operations, like spatial joins and operating on the geometries.

Simple features package updates

So that was sort of caught up very well, that is the good news. So there's a strong uptake of simple feature package by both users and developers, and we improved by that the interoperability with spatial databases, GeoJSON, and also off-corner reference system handling in R. Also spatial indexes are computed in fly, for instance, by ST intersect, and we have now also plot support in ggplot2 that takes care of reprojecting incompatible layers.

There is an additional thing that we added to the simple feature package, and that is the issue, like, it is very easy to find out whether a point lies inside a polygon or not. That's a simple geometry problem. But then if you work with properties, like you saw here in this graph that we work with geometries and properties of these geometries, it's not so easy to assess when you query a point in a polygon to assess what a value would be. So if you have on the left-hand side, for instance, a land use map, you could reasonably say that at a particular point in that polygon, the land use will be forest, but if you have a population density map, which is also a polygon with a value, you cannot say that because the population density is an aggregate property over the entire polygon. So we made categories for that, and that information is propagated, and if you don't know or if you are on the right-hand side of the figure, you get warnings when you query these points that you basically make assumptions that are not sort of supported by the data.

Challenges with time-dependent and raster data

So this is all very nice. It starts getting sort of challenging, more challenging when you have time-dependent data. So here we have the classical North Carolina sudden infant death data set where we have two rounds of measurements in 74 and 79. This is one of the first published spatial statistics data sets. And we can sort of plot them, and then basically we have them as two columns in the data set, and we need to use Garter here to stack them into a single column, and then we have to replicate all the geometries, and we can do that. But then what happens if you have like 50 instances of time, or when you have 500 or 1,000, you would have like 5,000 columns, and you are going to stack all of them and so on. It's getting very messy.

And also keeping time in columns is not a good idea at all, because where do you put the timestamp? So you would have to put it in the column name. So time is an issue there. We have time-dependent data. Where do we put it? Over multiple columns, or do we recycle features? If we put it in a long, in a tidy table, that is all sort of not ideal. Easy plotting is something of an issue.

There is also a sort of recurring question of the support for raster data and the solid vector and raster integration. This is something you can emulate by sort of having points on the raster or little square polygons, but it is not going to bring you far in the sense of scaling. Why raster data for those not involved in the raster world? Well, variables that vary continuously over space or over time are typically represented more naturally by regular sampling. So think images do that still, images or video, but also Earth observation data.

So satellite data, we have now visible satellites that give information about land, about ocean, about temperature, radar satellites. We have satellites that measure atmospheric composition. And these all come down freely now. Europe is here clearly on the ball at the moment. In the Copernicus program where they have like a little bit less than 10 satellites now running, it comes down to 150 terabytes per day. So that's quite a bandwidth, sort of the data sets that you can have access to.

Also climate data, climate model data is also free and large. The climate, the coupled model intercomparison program will generate sort of over a couple of years, around six, will generate 15 to 30 petabytes. That's a data center, I imagine. And there are numerous other sources of raster data that are interesting. And that is also what really motivates me, that there is so incredibly much free information that is collected or generated that you could get access to and use if you could use it, right? But it is so hard to do it quite often.

And that is also what really motivates me, that there is so incredibly much free information that is collected or generated that you could get access to and use if you could use it, right? But it is so hard to do it quite often.

So here's an example of an image that comes from the Sentinel-5P that where six months of data were sort of thrown together. And this looks like what you typically see from models, but this is actually measurement. This is measured atmospheric composition. So this is six months of imagery compiled together, and you see sort of for those living in Europe, you see a fairly familiar pattern. But what you, for instance, also see is the ship lines flowing here. So you see even the NO2 emitted by sea traffic. And it is quite revolutionary that you get this from pure measurements. There's, of course, a lot of heavy processing going on there.

Data cubes as a solution

The thing to sort of the way out to handle time properly and to deal with raster data and to deal with more complicated data, in my eyes, was that of data cubes. Data cubes, the idea of data cubes go back to OLAP cues, which is sort of there's a paper from Jim Gray on this, mentioned on Wikipedia from 1996, sort of where values basically are given for each combination of dimensions for dimension values. So we can have sales by product, store, and week. So for every combination of product, of store, and week, we get a sales figure. We can have population by gender, age, class, region, and census around. We can have temperature by XYZ and time. Four-dimensional data cube, we can have screen pixels here, like which are two-dimensional or which vary, and then it's three-dimensional, like video. We can have also forecasts, so like weather forecast, by time of forecast, and time to forecast.

Why not handle them entirely as tidy, as long tables, as Iaro would suggest in that symbol? Well, there is storage to speed and indexes, right? I sort of welcome anyone to put a video stream into a sort of XY time red, green, blue dataset. You don't do that. So you could do it, but it's not going to scale up really much, only for smaller subsets.

So what we do is basically working with arrays, right? And arrays are, the array has a long history of support in a base and also in dplyr . So base array is a very strong structure that has dim names, and dim names are pretty limited because they're characters. So if you want to put time in dim names, then you have to encode it as character and it's messy. But the good thing is that you can use apply, so that means you can sort of cut out, you can aggregate of particular sets of dimensions. You can do subsetting for slicing, cropping, and slicing and dicing. There's a bind support, arrays are atomic.

Then there's the table cube in dplyr, which is sort of a structure that hasn't received very much love or exposure, but it's relatively a cool idea to list of arrays, and it mimics by that basically heterogeneous array records. And it has an additional list effector with dimension values that's much more flexible than dim names that can also have like date, date structures and so on, and it has filter methods and group by stuff. But it cannot, for instance, deal with simple features as dimensions, which you would do. So and it doesn't also doesn't have support for regular dimension, meaning if I want to map sort of my space linearly to a dimension index.

The stars package

So what we do in the stars package, in the stars project, is to implement raster and factor data cubes, and those are MD cubes where one or more of the dimensions relate to space and typically also to time. The raster data cubes typically have x and y, take spatial dimensions, effector cubes, have a single dimension that sort of lists the points, lines of polygons that the index refers to. And then we have, we basically have a list of arrays and sets of dimensions, have regular dimensions, dimensions as values, and take care of measurement units and code and reference systems, do the I.O. with GDAL, which we already did with SF, also do a lot of work with net CDF, which Michael Sumner is very instrumental in that side. And we support out-of-core, and we will support cloud processing, and we'll say a little bit about how that works.

So you think that raster types are easy. Well, they're not. So there are all these kind of varieties. This is the nightmare for the developer. The first three sort of have linearly mapped array index to x and y. The last two do not do that. This is like a graticule, but basically if you have a regular raster on the earth and you then project it, then it becomes automatically curvilinear unless you do resampling to a new regular raster, which is what GDAL calls warping.

So here's an image similar to what I showed last year of what stars does. So we read here a little geotiff, but here we use ggplot2 and we use a geomSTARS, which is very primitive, which is a 10-line function that essentially decides should I call geomRaster or should I call geomTile, depending on what I have, or geomRect, it was geomRaster or geomRect. In any case, a lot of other things you still need to do, like coordinate equal and so on. And this basically plots what is there, plots the data.

And that is kind of okay, but it sort of makes it into a long data frame and then plots all of the pixels. And that is okay up to the moment that your data are sort of, you know, this size that you have. Like a typical Sentinel-2 image has 10,000 by 10,000 pixels, which means it's a 100-kilometer by 100-kilometer area with 10-meter grid cells. So this is the state of the art in visible remote sensing, free data.

So when you plot this thing, you think, hey, what's happening here? The thing is basically read here by saying proxy is true, and what then happens is that none of the data is read, that only the metadata is read, meaning that only the dimensions are being read, but none of the image pixels are being touched. And when I plot this data, and that's why I didn't do it with ggplot2 and sort of convert it into a markdown, because that would sort of take every time 10 minutes before my markdown compiled into slides, and you can't work like that. So this plot method here looks at, hey, what is my plot sort of area, how many pixels do I have there? Oh, it's much less than 10,000 by 10,000. It's only 800 by 800. So let's downsample. Let's only read as many pixels as I maximally can view, and then sort of render them and throw them to the plot device, and that is sort of in a matter of seconds done compared to sort of 5 or 10 minutes.

And you can continue on this idea. So we have an stapply here that is applied to this object, which is still not read, and it says to the dimensions x and y, apply this function, which is NDVI, which computes a vegetation index by combining two bands, which is this little function. So it applies this to every pixel, because I have four bands here. It applies it to every pixel, and you think, oh, this is going to take time, because it's going to do this for this 10,000 cells, but no, it does nothing, as you see here. And only when I plot this thing, then it does something, but then it sort of changes execution order. It says, oh, I have to plot something. That means I can downsample, and it downsamples, and then it reads only the downsampled data, computes the NDVI for those, and plots the thing. So this is sort of a query optimization, you could say, by this.

And only when I plot this thing, then it does something, but then it sort of changes execution order. It says, oh, I have to plot something. That means I can downsample, and it downsamples, and then it reads only the downsampled data, computes the NDVI for those, and plots the thing. So this is sort of a query optimization, you could say, by this.

Cloud proxies, something that we have in mind is where the data are basically not in your local machine, but in the cloud. And the next idea is that of a data cube view, where you basically have dimension settings that you don't read from an image, but that you set independently, and we basically query the imagery on this setting. So you could have, then, for instance, a collection of images that are in different projections, like you typically have with Sentinel, with satellite, or also Landsat data that they are in over different UDM zones, and sort of handles that, an idea that is pioneered and inspired by Google Earth Engine, a very strong platform.

Edzer Pebesma | Spatial data science in the Tidyverse | RStudio (2019)

Transcript#

Simple features package updates

Challenges with time-dependent and raster data

Data cubes as a solution

The stars package

Non-cube spatial data: hurricane example

Conclusion

Q&A

Featured software#

ggplot2

rstudio

tidyverse