Resources

Earo Wang | Melt the clock Tidy time series analysis | RStudio (2019)

Time series can be frustrating to work with, particularly when processing raw data into model-ready data. This work presents two new packages that address a gap in existing methodology for time series analysis (raised in rstudio::conf 2018). The tsibble package supports organizing and manipulating modern time series, leveraging tidy data principles along with contextual semantics: index and key. The tsibble data structure seamlessly flows into forecasting routines. The fable package is a tidy renovation of the forecast package. It promotes transparent forecasting practices and concise model representations, to empower analysts tackling a broad domain of forecasting problems. This collection of packages form the tidyverts, which facilitates a fluent and fluid workflow for analyzing time series. VIEW MATERIALS https://slides.earo.me/rstudioconf19 About the Author Earo Wang I’m currently doing my Ph.D. on statistical visualisation of temporal-context data at Monash University, supervised by Professor Di Cook and Professor Rob J Hyndman. I enjoy developing open-source tools with R, and is the (co)author of some widely-used R packages including anomalous, hts, sugrrants, rwalkr and tsibble. My research areas invovle data visualisation, time series analysis, and computational statistics

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good day, everybody. Last year I was at the RStudio conference as well, and several questions have been addressed about tidy time series analysis during the conference, especially forecasting with tidy objects. So today I'm going to present a solution to that. So I want to talk about a streamlined workflow for time series under the Tidyverse framework using two packages. I introduce you three big ideas behind those two packages. So they are Sible, Mable, and Fable, and how they link together. I hope to explain them concretely with data stories.

So I believe this diagram isn't foreign to you. This is a Tidyverse model, and each module here is powered by one of the Tidyverse packages. And the Tidyverse plays seamlessly with each other, and one of the fundamental reasons is they all share the same underlying data structure, which is data frame or a table. So the data is actually placed in the center of the diagram. But why we cannot bring this workflow easily into time series? So because the current time series objects in R are model-focused. By saying model, I mean it not only includes statistical models, but also forecasting, decomposition, autocorrelation function, and other time series tools. And all the methods or functions expect matrices as inputs. But the data arrives in the right beginning of this process. It really comes into the matrix form. And we have to write so much ad hoc code to get the data into a time series model-ready object. It is a pain because of the mismatch between temporal data and time series models.

So we hope to change that. So some new tools are provided to streamline this workflow and make time series analysis a bit easier, more fun, and more intuitive. So the Sibyl package will focus on the tidy and transformation part. The Fable package will do time series forecasting, and the visualization is done with ggplot2 and its extensions. So to make this workflow work, they're going to share the same underlying data structure, which is a new data abstraction for time series, and we call it Sibyl. So Sibyl is a time series table, and if you know TS is a native object to represent time series in R, so this is where the name Sibyl comes from.

Introducing the tsibble data structure

So now we are going to have some fun with the open data set. So this data set is about residential electricity consumption in Australia. So it's a table with 46 million observations and eight variables. So the column custom ID includes unique identifiers for each household, thousands of households in this data set, and a rating date time gives the time stamps when the rating is recorded every 30 minutes. General supply is a variable that we are interested in forecasting, and there are some other measurements in the table as well. So first, how are we going to turn this data into a Sibyl?

So what makes time series special or different from a normal data frame? Because it has its own semantics. So the first one is obviously the index variable that represents time. So in this case, it's rating date time, and Sibyl supports a wide range of time classes, so for numerics to date time to nano time. So the second component is what we call key, and the identifying variables that define series or entities over time. For this example, each household is a series, and the key variable is custom ID. You can also include multiple variables in the key. So basically, index defines the time and the key defines the series. So the key together with index uniquely identifies each observation in a Sibyl. Basically, it means each customer has to have unique time stamps, and for this data, yes, they are.

So now we have a Sibyl. So Sibyl gives a very nice printing method, and Sibyl is going to enhance that by adding contextual information. So now we have a Sibyl with data dimensions, and it recognizes 30 minutes interval, and the time zone associated with the index. So here, we have UTC, but you might have time zones with daylight savings, for example, and Sibyl will respect any time zone in your data. So the key variable is reported with a number of the series, so we have 2,924 customers in this big table. And we also got some time gaps in this data, so there are implicit missing values. Since ggplot2 isn't aware of those gaps, it always draws a straight line between data segments if you use geomline, and Sibyl comes with a very handy function called fill gaps, so it fills in those gaps with NA by default, or you can do by values or functions. So those lines can be removed from the plot easily with fill gaps.

So except for fill gaps, Sibyl provides other three verbs to handle implicit missing values, and they all suffix by gaps. So nearly 20% of the customers have time gaps, and the output shown here, you can see the last two customers with gaps. So almost all the model functions require complete time series, so it's a good practice to look at and fill in those gaps in the first place.

Wrangling time series with tsibble

So Sibyl works nicely with deplier and tidier verbs. A new end verb you will use quite often is index by. So it's similar to group by by preparing grouping structure, but it only groups the index. So combined with summarize, the data can be aggregated to any high-level time resolutions. So for example, I want to work with hourly data instead of 30 minutes, and I use floor date, or selling date, or run date from the looped package on the index variable inside the index by, followed by a summarize. So I get hourly average electricity usage across all the households. So the result is going to be a single time series and one hour interval, and the key is implicit.

So another chunk of operations for time series is definitely rolling window, and it not only does iteration over each element, like what per map does, but also needs to roll. So what I'm going to do is to wake up this cat and let her roll. So actually there are three different types of rolling operations, so sliding, tiling, and stretching. So it's just like per, slide takes one input, slide two takes two inputs, and p-slide for multiples. So if you have a data frame to roll by observations, you need to use p-slide, and for the purpose of type stability, they're going to return a list. And other variants include like integers, characters, logicals, et cetera. So in the recent version, I've added the parallel support. They're all prefixed with future by using the fir package by Davis as a backend. So if you're doing some rolling regressions, so it's nice to save some time by rolling them in parallel.

And you can put an arbitrary function through those rolling windows. So a simple example is rolling average. It uses parallel syntax, so you can call a function name or an anonymous function using tilde. I specify the window size to be 24, so basically one day, and it's rolling forward from left to right. If you want to move in an opposite direction, just change the window size to be negative. So those three rolling windows, as you can see, result different averages. A more advanced example but extremely useful is rolling forecast. So I defined a custom function called expand forecast and also enable parallel processing by using future p-stretch. It's never been that easy to do an expanding forecasting on. So a nice thing about functional programming is we can focus on writing expressions instead of writing a long for loop statement.

Tidy forecasting with fable

So what sort of code goes into the expand forecast? And this question brings us to the next part of the presentation, tidy forecasting. So how many of you have used the forecast package before? Quite a lot. Thank you. So the Fable is a tidy replacement of the forecast package. So why we call it Fable? First, it makes forecasting tables. Second, a Fable is like a forecast. It's never true, but it tells you something useful.

Second, a Fable is like a forecast. It's never true, but it tells you something useful.

So let's take a look at the data in the first 30 days of January. So each facet gives a daily snapshot of hourly electricity demand. And the peak in the late afternoon is driven by the use of air conditioning. And January is the summertime in Australia. So you can see some days have much higher usage, colored by red, because they are very hot days with maximum temperature greater than 32 degrees. And I'm going to use this subset to forecast the demand one day ahead. And I hold the data of January 31 as a test set.

So let's model the data. So I construct two models for the energy consumption with a model verb. And the first model I built is a naive model as a benchmark. So the naive method simply uses the observed values from yesterday as forecasts. And the second model is ETS, exponential smoothing, which can be thought of as a weighted average of the past values. And ETS is also short for error, trend, and seasonality. So the model function uses a formula interface. So on the left-hand side, we specify the average supply as a response variable. And on the right-hand side, we can put some specials related to the method. So I've specified the naive function to use the 24 values from yesterday instead of the value from the previous hour. And if we don't specify the right-hand side like ETS, so it will do automatic model selection by picking up the best model for you.

So now we have a Mabel back. So a Mabel is a model table that contains model objects. So each cell shows a succinct model representation saying I have a seasonal naive model and an ETS with three selected components. So models are reduced form of the data. So the model function is an analog of summarize because they use the same semantics. And the model function also reduces the data down to a single summary, but it happens to be a model object of that summary. So in order to look at like parameter estimates, information criteria, or residuals from model objects, we just use a familiar Bloom functions, tidy, glance, and augment. By applying the tidy function on the Mabel, we get the parameter estimates for the models we've built. So we've got a bunch of parameters for ETS, like alpha, beta, and I know those are boring parameters.

So it's time to forecast. We pipe the Mabel into the forecast function, and we're doing a one-day-ahead forecast, equivalent to 24 steps ahead. And it supports human-friendly input, so we can read more naturally, like forecasting with one-day horizon. It's convenient because we no longer need to mentally compute how many hours, minutes, seconds in a day, but it can still do H equal to 24. And we're done with forecasts. We have a forecasting table, which is a fable. So it's a special table, which includes the future predictions. It not only tells you the point forecasts, but also our underlying prediction distribution that involves uncertainty, because we are forecasters, not fortune tellers. And you can see the normal distribution with its mean and standard deviation in the last column dot distribution. And this is one of my favorite features about Fable, reporting distribution forecasts. So you're able to produce any level of prediction interval you like.

It not only tells you the point forecasts, but also our underlying prediction distribution that involves uncertainty, because we are forecasters, not fortune tellers.

And we'll see the forecast more clearly with plots using geom forecast. So the naive method repeats yesterday's pattern, but the 80 and 95 predictions intervals are quite large, and some even goes below zero. So how about ATS? So the ATS nicely captures the daily trend and produces a much narrower prediction interval. And which model performs better? So use the accuracy function to compare the predictions with a test set that I held before. So looking at the accuracy measures, ATS does slightly better than the naive in terms of root mean squared error. But they all tend to give underestimated predictions. And the black line in the plot is actual data. So it looks like it's another hot day in January. If we have weather information like temperatures, we can include them to improve the forecast, but need to use a different model that allows for exogenous regressors. For example, a Ramer model.

Scaling to many time series

So far, I have showed all the steps from model building to model assessments for just one time series. Would it be any different if we have multiple time series in a table? No, because models are fundamentally scalable. So it's simple as a modern reimagining of time series. It designs for hosting many time series together. Especially the series have already been defined when we created the table. And we are obviously interested in forecasting the demands for each household here. So I've removed some troublesome series, ended up with 1,480 households. So no extra steps needed for forecasting and scale. Just as what we did before, we can directly pipe them into the model function. It will fit an ETS model for each customer at once, and then happy forecast. So I also take a log transformation on the response variable to ensure that I get positive forecast back. And the forecast function will take care of the back transformation for you. So you can see 1,480 models have been faded in the Mabel. And the key variable is always a key to refer to a series across Siebel, Mabel, and Fabel. Models are scalable, but visualization is not. So I just plot four customers with their forecast here. And the individual level, lots of noise, producing much larger prediction interval as well.

So I've shown a proportion of what Siebel and Fabel can do. We've got decomposition simulation based on model fades, interpolation of missing values, and model supports for streaming data. So please check them out. So it's a joint work with Di Cook, Rob Hyman, and Mitch O'Hara-Wild. I need to mention that Siebel is on Chrome, but Fabel is on GitHub at the moment. And they all belong to tidyverts.org. Those are the useful links to those packages, my slides, and the source code behind the slides. That's all from me. Thank you.

Q&A

Thank you so much, Eero. We have time for a couple of questions, and we have throwable mics moving around, so just raise your hand.

Hello? Can you hear me? Okay. I'm wondering if any of these tools are well-suited for irregular time series?

Not yet. But Siebel will support irregular data structure.

Hi. I was wondering if this supports hierarchical time series, reconciliation?

Yes, we're considering that probably second half of this year.

What is the level of maturity of the package with compare the models that are available today in the forecast package?

So we sort of like think go to Chrome like sometime in March or April, and if like to replace the whole forecast package would be like the end of this year, I think.

I noticed that there's a rival Tyble time package, and I'm just wondering what the difference between Siebel and Tyble time is.

So basically there's two time series structures, and we've got a couple of things in common. So first, both are built on top of Tyble, and second, both declare the index variable, and for Siebel, we require the key as well, and also does the check, so it needs to be distinctive observations. So this makes Siebel is different from Tyble time, and also the function like interface are also different. Do I answer your question?

So kind of a shoot off of the last question, do you have any plans to incorporate the ability to specify time windows using shorthand like XTS does or like Tyble time does?

Yes, I do have the functions. It's called a filter underscore index, so it does support the shorthand. Just to check the documentation, please.