Dr. Sydeaka Watson | Neural Networks for Longitudinal Data Analysis

Transcript#

This transcript was generated automatically and may contain errors.

Alright, and next up we have Dr. Sydeaka Watson.

So perhaps as I get set up here, I can add to my resume that Max opened for me. So now I'm kind of a big deal, I guess.

Okay, so good afternoon. So my name is Sydeaka Watson. I am a senior data scientist at a consulting firm called Illicit Insights. Most of our projects at Illicit are in the customer analytics and employee analytics space, and I'd be happy to chat more about some of those projects after this session. But this presentation is about a fun little side project that I've been working on. So I've listed my speaker affiliation here, Corolasi Data Insights.

We know that there is a very strong correlation between past performance and future performance. So data from one year to the next are highly correlated, and so we should have some sort of smart way to encode these temporal dependencies and exploit them if possible.

Recurrent neural networks for longitudinal data

Okay, so recurrent neural network models to the rescue. Those for people who have seen those before, you may have been aware that those naturally accommodate temporal dependencies. You've probably seen them before in different types of sequential data analyses such as written text or audio. There are some variants such as long short-term memory and gated recurrent units that can regulate the flow of information through the network, in particular looking at data from one period to the next.

So this slide shows an example of how information would flow through such a network for our incredible season, 1965 season for our high-performing player Willie Mays. And we see that he has the same attributes, age, games, number of hits, number of runs, etc., recorded with different values for each of these years in 1962, 1963, and 1964. And so the model is learning all of these season-to-season patterns over all the players that are predictive of these three outcomes in the slash line.

Data preparation

To prepare the data, we must break each player's record into these rolling four-season windows. And so what that includes is the three seasons, remember, that we're using for prediction. So those will be the input years of data. And then we've got the one season in the future, right? So one of the things that was an issue, of course, because players are not active every single year, is that if they were inactive for a particular season, we have to do some padding. And so there are some utilities in R that were helpful for that. We have to pad the series to accommodate the skipped years.

But for example, for Willie Mays, we could think about predicting 1962. So we're using the data from 59, 60, and 61, and using that to predict his 62 performance. And he was also active in 63. So we can slide that window to the right and have 60, 60, and 62 predicting 63. And we can do it again to predict 64, and slide it over one more time to predict 65. Now, this, of course, means that our data samples are not independent. And this is OK for now, because we're focusing primarily on the predictions that we're getting from the model at the moment. But there are some other aspects of this model where we will have to address this dependence issue head on. And so I'll revisit that a little later in the talk.

I focused on data collected between 1916 and 2018. And I only looked at the players' first stint. So I didn't consider what happened after they were traded to another team. And this was really just more of because of the way that the data was collected, it was actually reporting performance on each individual stint. I could have also aggregated across the different time points if I was doing this for real. But again, this was just an illustration. I only looked at players that had at least four consecutive seasons.

And think about the way that I'm constructing these rolling windows. So a player that has a longer career has more rolling windows, right? I'm sliding that window over a longer period of time. So that player will have more points in the data set. So to limit their influence, I downsampled players that had longer careers. And then to focus on hitters, I required the player to have at least 85 plate appearances.

Now another issue is that, so we also want to avoid data leakage. And so think about the fact that we randomly are splitting the data into training and testing and validation sets. It's possible to have the same player represented in training and testing and validation if we're not careful, right? And so in order to avoid that, one thing we could do is to just randomly split on the players. So you have the training players, the validation players, the testing players, and then you can compute the rolling windows for those individual players all across the data splits.

Model fitting and benchmarks

As a benchmark, I fit an ordinary least squares regression model where the seasons, the three seasons of player data are represented in Y format. So if we had 100 features, that means we've got 300 columns of data. And 100 for each time period. I didn't do any manual feature engineering. Remember, that was one of the things I said in the beginning that I was trying to avoid. And I just used the data as it was, aside from doing some normalization to the 01 scale for the attributes and features. And I did that for the neural network approach as well.

For the neural network approach, I tried different models that included one or more gated recurrent unit layers and then one or more fully connected dense layers with tons of hyperparameter values tuned over the grid. And then I represented the model in two ways. The first approach was to look at three single outcome models where each model was just predicting a single metric in the slash line. And then I also looked at a multi-outcome model where I'm simultaneously predicting all three metrics.

It isn't on the slide, again, but I mentioned that normalization to the 01 scale. And so Max's talk just before this one was timely. Again, just thinking about model tuning and how important that is in machine learning and neural networks in particular. So you're probably aware that these models involve a lot of parameters that must be tuned. A single run of this kind of model can take, you know, many minutes or hours or days sometimes. And so to cut down on the computational time, I used free services from Google Colab. They allow you to use a GPU for free. Within reason, obviously. I think they put some limits of up to 12 hours of modeling. But that was completely free. So that was helpful here. Especially for projects where you're just kind of playing around with something. So you'd want to probably use something like AWS, you know, the GPU if you're working on a real project.

My GitHub repository has a complete minimal example showing how you could potentially implement this type of model in R. The version that I'm showing doesn't include all of the data joins and filtering and data transformations that were specific to the baseball example. So if you run this, you won't see the results that I'm going to share on the next few slides. But I did it this way because I wanted to keep this example simple enough so that you could think of this as a template that you could modify for your purposes and use it as inspiration for your projects.

Results

Okay. So after all of that, how did we do? Right? In our three modeling approaches. The first graph that I'm going to show here shows the performance in the ordinary least squares regression model on our held out test set. And so for each outcome in the slash line, I'm plotting the predicted values against the actual values. And this dash line is a reference that shows where the actuals match the predicted, which is where we want a majority of our points to lie in a great prediction model. And as we can see, this is not a great prediction model. Right? It's terrible.

And our three selected players' performances are shown in the colored stars. I don't know if you can see those. But I chose red for terrible, yellow for moderate, and green for great. So again, it doesn't seem to be doing a good job of predicting the outcomes for any of them.

Next, we see the performance of the best gradient-boosted machine regression model that we observed over the grid of hyperparameter values. And you might have noticed from this slide compared to the previous one that the mean square error is lower. But it doesn't appear to be doing a much better job of predicting compared to even ordinarily squares. And so I have a couple theories about that. One is that we, again, there's the data from time point one, time point two, time point three, where it's not actually clear to the data, to the model, that there's some temporal dependency here. So that's actually not being captured there. So that's one thing. And the second is that there might be some types of data transformations or interactions or something that if we were to manually encode some of those things, maybe the model might be better.

Okay. And then finally, we have our neural network model fits. And notice how a majority of the points lie close to the line, often on the line. And for our three selected players, the model was able to capture Willie May's excellent performance, Greg Nettle's okay performance, and Chris Davis's terrible performance in those three specific years.

And for our three selected players, the model was able to capture Willie May's excellent performance, Greg Nettle's okay performance, and Chris Davis's terrible performance in those three specific years.

So yeah, I can quit my job and work in fantasy baseball and make a lot of money, right? No. So obviously, this was done more for illustration purposes. If I was really trying to deploy this, of course, I'd have to do a lot of other types of validation. But again, this is illustrating the point that these types of models are useful for longitudinal data analysis.