Resources

Dr. Sydeaka Watson | Neural Networks for Longitudinal Data Analysis | RStudio (2020)

Longitudinal data (or panel data) arise when observations are recorded on the same individuals at multiple points in time. For example, a longitudinal baseball study might track individual player characteristics (team affiliation, age, height, weight, etc.) and outcomes (batting average, stolen bases, runs, strikeouts, etc.) over multiple seasons, where the number of seasons could vary across players. Neural network frameworks such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) can flexibly accommodate this data structure while preserving and exploiting temporal relationships. In this presentation, we highlight the use of neural networks for longitudinal data analysis with tensorflow and keras in R

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Alright, and next up we have Dr. Sydeaka Watson.

So perhaps as I get set up here, I can add to my resume that Max opened for me. So now I'm kind of a big deal, I guess.

Okay, so good afternoon. So my name is Sydeaka Watson. I am a senior data scientist at a consulting firm called Illicit Insights. Most of our projects at Illicit are in the customer analytics and employee analytics space, and I'd be happy to chat more about some of those projects after this session. But this presentation is about a fun little side project that I've been working on. So I've listed my speaker affiliation here, Corolasi Data Insights.

What is longitudinal data?

So title, Neural Networks for Longitudinal Data Analysis. First let's start with the definition. So longitudinal data are recorded on the same units at multiple points in time. And so as an example, you could think about a clinical trial where you might have a patient observed over multiple time points, maybe every three months, every six months or something. They come in and you take their vital signs, you weigh them, you take blood or tissue samples, and then maybe you record some biomarkers at each of the different time points. So that's a healthcare example.

Another example could come from retail where maybe you'd observe a customer's behaviors or characteristics every month, or perhaps you're analyzing user behavior, so maybe you're at Facebook or Twitter, and you're looking at how the users behave or interact with the platform every day. So there are lots of other examples.

In those applications, we're looking at people over time, but we could also have a longitudinal data set that stores information about buildings or objects, pets, anything really, whatever it is, where again, we're looking at the same thing over multiple time points. And they don't have to be, but usually the time points are evenly spaced, so say we're making observations every day or every month, and then the units don't necessarily have to have the same number of observations. So thinking back to that clinical trial example, some patients might come in only twice, and then some patients might come in 10 times. And you can think of each unit as having a matrix of data where the variables are observed or across the top of the matrix, and the time points are in the rows.

The baseball dataset

Sean Lehman has a baseball database that is a fantastic example of longitudinal data collection. His data set has information about players in various roles and positions going back to about 1871, and these data are available in the R package called Lehman, so when you install the package, you'll automatically get access to all of those data sets. But you could also visit his website and download those SCSVs from his website.

And when we load the tables, we get information about how well the players performed, how well their teams performed as a group, whether they won any awards, whether they switched teams, all sorts of great information about the players. And for baseball hitters in particular, the batting statistics table captures all of the usual performance metrics that are commonly reported on how well they were hitting. In particular, there are three primary baseball outcomes that make up what's called the slash line. And these include the batting average, the on-base percentage and the slugging percentage. And these are usually reported as decimal numbers that are rounded to three decimal places with slashes in between them in that particular order.

So let's suppose that we are particularly interested in three baseball players, and we know a lot of information about them over the past three seasons, but we'd like to know or predict how they're going to perform in the following season. So can we do that?

The first player in our collection of three players is Willie Mays, and again, we're just a spit's throw away from Black History Month, so I wanted to make sure I highlighted a high-performing African-American player. So Willie Mays, he had this incredible season in 1965, and in his slash line, both his on-base percentage and his slugging percentage led the league, so he's just this powerhouse of a player, right? And in particular, 1965 was a good year for him. The second player is Greg Nettles. He in 1986 was playing his 20th season as a 41-year-old, which is really old for a player, but for a non-pitching player, I'm almost 40, so it's very young for other situations. And so his performance in 1986 was okay, right? So he's an example of a moderate player.

And then we finally have Chris Davis, who had a terrible season in 2018. I'm sure maybe some of the baseball fans probably know a lot about him. He signed this massive $161 million contract and has been just vastly underperforming that. And that year in particular, like I said, was a terrible year.

Modeling considerations

So the question is, now that we have these three players in mind with varying performance, with their actual slash lines shown below, we'd like to somehow build a model that can predict these outcomes.

Now for this type of scenario, we have three key modeling considerations. First, we know that there is a very strong correlation between past performance and future performance. So data from one year to the next are highly correlated, and so we should have some sort of smart way to encode these temporal dependencies and exploit them if possible.

Next we'd like to avoid feature engineering over the different time periods. And so if you may have been dealing with this type of scenario in the past, maybe if you had data over multiple time points, you might say, OK, let me aggregate somehow over the different time points how many runs they had or look at some percentage from one period to the next or something like that. And then we might also have to manually encode some of the interactions between some of the variables. So obviously we're limited by time and our own imaginations, right? So there's almost an infinite number of ways that you could do that. And so we'd like to, if possible, allow the algorithm to come up with those types of transformations.

And then finally, we would like this model to flexibly accommodate different types of model outputs. We might want to predict more than one outcome at a time. So such as in our example where we want to predict the three numbers in the slash line. And normally in a regression, you would have three separate models, right, where you individually predict those three outcomes. And then so each of those individual models are learning separate patterns that are predictive of each of those outcomes. But maybe you want to have one model—maybe it's easier to maintain—that looks at individual patterns that can come together and simultaneously predict all of those outcomes. And maybe we also want to think about predicting over multiple time points. So maybe saying for those three players, could we predict how they performed over the next two or three seasons, right? So that's a lot to ask.

We know that there is a very strong correlation between past performance and future performance. So data from one year to the next are highly correlated, and so we should have some sort of smart way to encode these temporal dependencies and exploit them if possible.

Recurrent neural networks for longitudinal data

Okay, so recurrent neural network models to the rescue. Those for people who have seen those before, you may have been aware that those naturally accommodate temporal dependencies. You've probably seen them before in different types of sequential data analyses such as written text or audio. There are some variants such as long short-term memory and gated recurrent units that can regulate the flow of information through the network, in particular looking at data from one period to the next.

So this slide shows an example of how information would flow through such a network for our incredible season, 1965 season for our high-performing player Willie Mays. And we see that he has the same attributes, age, games, number of hits, number of runs, etc., recorded with different values for each of these years in 1962, 1963, and 1964. And so the model is learning all of these season-to-season patterns over all the players that are predictive of these three outcomes in the slash line.

Data preparation

To prepare the data, we must break each player's record into these rolling four-season windows. And so what that includes is the three seasons, remember, that we're using for prediction. So those will be the input years of data. And then we've got the one season in the future, right? So one of the things that was an issue, of course, because players are not active every single year, is that if they were inactive for a particular season, we have to do some padding. And so there are some utilities in R that were helpful for that. We have to pad the series to accommodate the skipped years.

But for example, for Willie Mays, we could think about predicting 1962. So we're using the data from 59, 60, and 61, and using that to predict his 62 performance. And he was also active in 63. So we can slide that window to the right and have 60, 60, and 62 predicting 63. And we can do it again to predict 64, and slide it over one more time to predict 65. Now, this, of course, means that our data samples are not independent. And this is OK for now, because we're focusing primarily on the predictions that we're getting from the model at the moment. But there are some other aspects of this model where we will have to address this dependence issue head on. And so I'll revisit that a little later in the talk.

I focused on data collected between 1916 and 2018. And I only looked at the players' first stint. So I didn't consider what happened after they were traded to another team. And this was really just more of because of the way that the data was collected, it was actually reporting performance on each individual stint. I could have also aggregated across the different time points if I was doing this for real. But again, this was just an illustration. I only looked at players that had at least four consecutive seasons.

And think about the way that I'm constructing these rolling windows. So a player that has a longer career has more rolling windows, right? I'm sliding that window over a longer period of time. So that player will have more points in the data set. So to limit their influence, I downsampled players that had longer careers. And then to focus on hitters, I required the player to have at least 85 plate appearances.

Now another issue is that, so we also want to avoid data leakage. And so think about the fact that we randomly are splitting the data into training and testing and validation sets. It's possible to have the same player represented in training and testing and validation if we're not careful, right? And so in order to avoid that, one thing we could do is to just randomly split on the players. So you have the training players, the validation players, the testing players, and then you can compute the rolling windows for those individual players all across the data splits.

Model fitting and benchmarks

As a benchmark, I fit an ordinary least squares regression model where the seasons, the three seasons of player data are represented in Y format. So if we had 100 features, that means we've got 300 columns of data. And 100 for each time period. I didn't do any manual feature engineering. Remember, that was one of the things I said in the beginning that I was trying to avoid. And I just used the data as it was, aside from doing some normalization to the 01 scale for the attributes and features. And I did that for the neural network approach as well.

For the neural network approach, I tried different models that included one or more gated recurrent unit layers and then one or more fully connected dense layers with tons of hyperparameter values tuned over the grid. And then I represented the model in two ways. The first approach was to look at three single outcome models where each model was just predicting a single metric in the slash line. And then I also looked at a multi-outcome model where I'm simultaneously predicting all three metrics.

It isn't on the slide, again, but I mentioned that normalization to the 01 scale. And so Max's talk just before this one was timely. Again, just thinking about model tuning and how important that is in machine learning and neural networks in particular. So you're probably aware that these models involve a lot of parameters that must be tuned. A single run of this kind of model can take, you know, many minutes or hours or days sometimes. And so to cut down on the computational time, I used free services from Google Colab. They allow you to use a GPU for free. Within reason, obviously. I think they put some limits of up to 12 hours of modeling. But that was completely free. So that was helpful here. Especially for projects where you're just kind of playing around with something. So you'd want to probably use something like AWS, you know, the GPU if you're working on a real project.

My GitHub repository has a complete minimal example showing how you could potentially implement this type of model in R. The version that I'm showing doesn't include all of the data joins and filtering and data transformations that were specific to the baseball example. So if you run this, you won't see the results that I'm going to share on the next few slides. But I did it this way because I wanted to keep this example simple enough so that you could think of this as a template that you could modify for your purposes and use it as inspiration for your projects.

Results

Okay. So after all of that, how did we do? Right? In our three modeling approaches. The first graph that I'm going to show here shows the performance in the ordinary least squares regression model on our held out test set. And so for each outcome in the slash line, I'm plotting the predicted values against the actual values. And this dash line is a reference that shows where the actuals match the predicted, which is where we want a majority of our points to lie in a great prediction model. And as we can see, this is not a great prediction model. Right? It's terrible.

And our three selected players' performances are shown in the colored stars. I don't know if you can see those. But I chose red for terrible, yellow for moderate, and green for great. So again, it doesn't seem to be doing a good job of predicting the outcomes for any of them.

Next, we see the performance of the best gradient-boosted machine regression model that we observed over the grid of hyperparameter values. And you might have noticed from this slide compared to the previous one that the mean square error is lower. But it doesn't appear to be doing a much better job of predicting compared to even ordinarily squares. And so I have a couple theories about that. One is that we, again, there's the data from time point one, time point two, time point three, where it's not actually clear to the data, to the model, that there's some temporal dependency here. So that's actually not being captured there. So that's one thing. And the second is that there might be some types of data transformations or interactions or something that if we were to manually encode some of those things, maybe the model might be better.

Okay. And then finally, we have our neural network model fits. And notice how a majority of the points lie close to the line, often on the line. And for our three selected players, the model was able to capture Willie May's excellent performance, Greg Nettle's okay performance, and Chris Davis's terrible performance in those three specific years.

And for our three selected players, the model was able to capture Willie May's excellent performance, Greg Nettle's okay performance, and Chris Davis's terrible performance in those three specific years.

So yeah, I can quit my job and work in fantasy baseball and make a lot of money, right? No. So obviously, this was done more for illustration purposes. If I was really trying to deploy this, of course, I'd have to do a lot of other types of validation. But again, this is illustrating the point that these types of models are useful for longitudinal data analysis.

Future work and open issues

There are some issues that I haven't talked about in this short talk that we might explore in a future session. I mentioned earlier that the neural network enables you to predict multiple time steps in the future. But I haven't talked about how you would do that. Really that it just involves changing the shape of the output layer. And so that's just making some you'd also have to make some adjustments to the loss function, potentially, if you were wanting to, for example, emphasize that you want the model to perform better in earlier time points compared to later time points or later versus earlier or something like that. So those are those are things that you could do.

And then the elephant in the room, again, is that I have not talked about prediction intervals. So we talked about point estimates, estimating that one number. But getting some sort of estimate of some range of value, some reasonable range of values that that actual value could lie within is something that we would have to spend a little bit more time thinking about. Remember, we're talking about that correlation structure from before. There's this nagging dependence issue that we've introduced by producing these rolling windows. So that's something we would certainly have to work out.

And then here are the reference links to the Sean Layman baseball datasets that I referenced earlier and I'll also give you the link to my GitHub repo where the slides and the code are available. Thank you.

Q&A

We have time for a few questions. First one is, what packages did you use to train your models? Great. And I actually just deleted that slide. So the ordinary least squares regression model was just based on the stats package. I fit the GBMs using H2O. And then the neural network models were fit using TensorFlow and Keras packages. And all of this was done in R, of course.

And then how do you deal with cases where certain players may have more or less past observations? For example, predicting one player in his 12th season and another in his third? So the length of the season, remember we intentionally used that common length of window, remember, where we had the task was to predict using the player's previous three seasons what their next performance was going to be in the following season. So in that sense, it really didn't matter how long they had played overall. It was a sense of how can we actually translate that data structure into the one that the model was actually going to represent that represents this particular modeling task.

But that is actually a really good question, because as I'm sure if you've worked with LSTMs and these kinds of models, they can accommodate various lengths, right? So you could potentially say given the player's entire history, given a player that has played whether they played two years or ten years or however long they played, what is their last season going to be, right? And it is very easy to do that. You don't necessarily have to specify ahead of time or lock in a particular length of season. And so that's fairly easy to do. And yeah, I don't have time to go into that solution, but it is very straightforward to do.

One more question. What are some diagnostic plots slash sanity checks you can use to see if your RNN is behaving properly? Sure. Great question. I didn't really go into it as much, but I flashed up on the screen. So the example that I provided on my GitHub page does provide two types of graphs. One is just the individual losses for the individual outcomes. So that's for each of the different outcomes in the slash line individually. And then there's this overall loss function. So you could actually see how those lines are behaving, whether or not we're actually getting convergence, whether or not we're actually continuing to see training beyond a certain number of iterations. And then also just as a means of monitoring, in intuitive terms, how well the model's doing. On the right side, I'm also just showing that predicted versus actual plot as well. But yeah, there's a ton of other things that you could do.

Perfect. Thank you so much. And next up, we have Nick Strayer. Thank you.