
Namita Nandakumar | R + Tidyverse in Sports | RStudio (2020)
This talk will use a case study, most likely in hockey, to showcase the many ways in which R and the Tidyverse can be used to analyze sports data as well as the unique priorities and considerations that are involved in applying statistical tools to sports problems
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Alright, thanks everyone. It's exciting to be here. So Danny was a great lead into this because he works in hockey and was talking about football, and I work in football, and I'm going to talk about hockey.
Just to give you a bit of background, I'm a quantitative analyst with the Philadelphia Eagles, but I also have done a lot of public hockey research in the past. To be honest, a lot of my public research is like really niche stuff about the draft that frankly none of you need to hear right now. So I thought that it would be maybe more interesting to learn about one of the most popular stats in public hockey analytics today. So that's expected goals, which is why the title is Finding Expected and Unexpected Goals with R.
Just to give you guys a bit of background, there might not be some sports fans in here, but this definition worked really well the last time I used it at a technical conference, so I'm going to keep with it. Violent soccer on ice. And there are 31, soon to be 32, as Danny knows, teams in the NHL who spend all year trying to be good enough at violent soccer on ice that they win the Stanley Cup.
Why does it matter? I don't know. Some people think it's fun. There are literally dozens of us who love it.
Evaluating hockey teams and players
And so we can start to think about what are some ways to evaluate hockey teams and players. Maybe the most obvious one is the percentage of times that they win hockey games. That seems relevant. But also, in order to win games, you need to score more goals than your opponent. So goals for, goals against also seems valid. In order to score a goal, you need to take a shot. What's that saying? You miss 100% of the shots you don't take. So you can look at shots for and shots against.
You could also just rely on your personal feelings. I think that that is equally valid, so I just wanted to put that up there. But since we are at an R conference, let's talk about the statistical aspect.
So shots for and shots against have this really nice property for analytical purposes of being one of the most granular things we can observe. If there are five, six, seven goals in a game, there are dozens of shots that we can look at. But the natural retort to this, which has been true for years, as long as people have evaluated shot-based metrics, is that not all shots are equally dangerous. And that's right. We have a right to criticize that.
not all shots are equally dangerous.
Breaking down a shot
So I wanted to give you guys a little visual here. So this is a goal that happened recently. Again, we're starting from the basics here. And to give you a little bit of a closer look at this goal, again, this is a randomly chosen, it's not really random. I'm from Philly. I'm a Flyers fan, so this is a Flyers goal. From this season. So it happened a couple weeks ago.
And it's worth trying to figure out what can we say and what can't we say about this goal. So just by looking at the play-by-play data, we know that it was, in fact, a goal. It was scored by Flyers forward Travis Konechny on Kings goalie Jack Campbell. It was at even strength, so there were five skaters on the ice for each team. It wasn't like a penalty situation or anything weird like that. It was right in front of the net, about 18 feet away. It was a wrist shot, and it caused the Flyers to go up 2-0.
But what is the play-by-play data hiding? And it's worth knowing what are the strengths and limitations of our data. So first of all, actually, it's really nice that they labeled the shot for us, but frankly, the angle looks a bit off, which is definitely plausible because these shots are manually tracked. I say for now because the NHL has promised that they're going to implement tracking data this year, which will probably affect that.
But for now, we can assume there is recording bias to some extent with these shot locations. So if we look back at this, kind of looking at where Travis Konechny is, I probably would have labeled the shot a little bit to the right. So we can say the distance maybe looks pretty good, but the angle is a bit off. So if we're making any models with that data, that's something to keep in mind.
It's also worth noting that before the goal happened, Konechny was set up with a great quick pass that he was able to handle right away. If the pass wasn't as good, he might not have gotten as good of a shot off. He was also wide open. There wasn't any defenders in his immediate vicinity, and there were two players screening the net. Again, we can see it right here. So that probably gave the goalies some trouble.
So all of that is to say there's a lot of stuff we know about the shot. There's a lot of stuff that we don't know. Probably if we incorporate all of this stuff that we can't see in the play-by-play data, it's probably a more dangerous shot than we would be able to estimate. But let's try it anyway, see what we can come up with.
Introducing expected goals
So unblocked shots. Blocked shots are like if you shoot and I stand in front of you and it hits me, and I get hurt real bad, but it doesn't go near the net. So unblocked shots that actually make it to the net at 5v5 have scored about 6% of the time this season. But I think TK, Travis Konechny's shot, was better than that. So how can we quantify this?
And everyone with a possible understanding of hockey, which I've just given to you right now, and R, which I hope a lot of you I think have come in with that knowledge, can build an expected goals model right now. These are facts. The opinion is that this is a fun thing to do. So you'll just have to take my word for it on that.
And before I go any further, it's definitely worth noting that this is not some idea that I came up with. Expected goals has been a stat in the hockey analytics community for quite a while. And there have been a lot of really smart researchers to take a crack at this. And I can't go through all of their write-ups, but this is just to say that none of the ideas that follow are brand new or uniquely mine or anything like that.
Getting the data
And so I was talking about the play-by-play data. It's worth looking into, you know, where are we finding that? The short answer is the NHL. The longer answer is that a lot of people have put a lot of work into scraping NHL data and cleaning it. One of these people is Peter Tanner from moneypuck.com. So I give him a lot of credit for making this data really easily available for everyone. Because he scrapes NHL data every single day and shares season-level CSVs of every unblocked shot and all of the associated play-by-play information.
And so for the purposes of this talk, we're going to use the information from 2017 to present. So, like, two and a half seasons, the most recent ones. And there is a lot of stuff that he did for us that I didn't want to do, so thank you so much, Peter. Including that he labeled shot types such as rush shots, rebounds, et cetera. So for example, the play-by-play won't say rebound, but you know that if the shot happened within two seconds of a previous shot, that that was probably a rebound.
He also adjusted XY coordinates for recording bias. So I mentioned that these shot locations are manually tracked by, you know, a consistent group of people at each arena. So the recording bias has some consistent patterns based on the arena that you're in. So there are adjustments for that. And in general, cleaned it up, cross-referenced it, made sure that it was nice and ready to be very tidily analyzed by us. So again, shout out to him.
This is the big code chunk that I'm sharing. It's not a complicated model or anything, but I really did want to share this to show like this is literally how easy it is to get NHL shot data in your RStudio instance and start to analyze it. And all I did was write a function to download the CSVs on his website and then use a bit of purrr to map that to the most, three most recent seasons. And then I love the janitor package. I love clean underscore names. I love like snake case. I can't look at anything else anymore.
Exploring the data with heat maps
The first thing we can do before we get back to expected goals, heat maps. Everyone loves heat maps. So there is also code available to essentially make your own rink in ggplot. So I used that here to make a heat map of just all the unblocked shots in this data and where they were taken. And you can see they're generally taken closer to the net. I also made a heat map of the goals. And actually this alone can also tell you that, you know, shot location matters because if it didn't matter, if all shots were created equal, these heat maps would look the same. But clearly you can see that shots closer to the net are more dangerous, which also just makes sense like logically.
Building the model
The first thing we can do as kind of a really lazy baseline thing is just look at distance because I feel like that's something that's really important. And we can just use a logistic regression with a polynomial term and just estimate the probability of scoring a goal based on the absolute distance to the net. So the X axis there is shot distance. The Y axis is the probability. And we can see that just based on this one feature, your estimate can be anywhere from near 0% to near upwards of 20% if you're really close to the net.
The slightly less lazy thing to do would be to use something like XGBoost and also incorporate shot angle, shot type, time and distance from the last event. And also not assume linearity. We can do some cross validation to choose the number of boosting rounds. And I always feel like whenever I talk about machine learning, I make it sound really simple and unimpressive. And I know you're not supposed to do that, which is why I've incorporated these images from Google searches of machine learning of like big brains and lots of nodes and stuff.
And when we're talking about variable inclusion, it's worth really thinking about it. Because frankly, we're not working with big data here. There's not that many shots even when you're going over multiple seasons. And so I don't think this is the type of problem that requires like a kitchen sink approach of just like tossing in all the variables and seeing what sticks. I think it's worth it to kind of rely on some domain knowledge as well and think about what should be important, things like shot distance, shot angle, and what might not be as important.
So I'm not including like score state. Because I can't think of a reason why after adjusting for everything else we already know about a shot, like the distance, angle, type, et cetera, I can't think of a compelling reason why it would matter that you took that shot if you were up two or if you were down three or whatever else. And fortunately, this has also been confirmed by previous expected goals research. You know, researchers that have included score state have found that it's like not really that important. So it's good to hear that confirmation.
You could also include shooters and goalies. And I would say that it is important to know who was the goalie for the shot, who was the person who was shooting the puck. But you know, once you incorporate them as variables, you have to make decisions about sort of model structures and especially what you're going to do about players who have really small samples. You know, if they've only shot the puck five times, you know, are we going to regress them to the mean? How are we going to do that, et cetera? So for the sake of simplicity, I'm going to evaluate shooters and goalies by their residuals in this model and limit it to players with enough of a sample size that we feel comfortable evaluating them.
And just really quick, some variable importance metrics from the XGBoost model. We can see shot distance is the most important one by far. And actually even the second most important is also just a component of shot distance. So really that is what matters, which I think makes a lot of sense. The type of shot and like whether it was a defender or forward matters a lot less. I think especially the fact of like position not mattering, probably what's happening is that's being incorporated in the shot distance variables because defenders tend to take shots much further away from the net.
Model evaluation
So I love fitting smooths to glance at calibration in addition to looking at sort of standard evaluation metrics. So here I have three models, the logistic model, my XGBoost model, and also the MoneyPuck data includes Peter Tanner's expected goals model as well, so we can compare that. And so the X axis is the fitted expected goal value. The Y axis is, I mean, it's the indicator of whether it was a goal. And then as we fit a smooth, it should be on line Y equals X if it's really well calibrated.
You can see that for the logistic model, you can barely see it because it's just kind of on Y equals X from zero to 25% basically, which is saying that like it's well calibrated, but obviously limited in a lot of ways. It doesn't have information to call any shots super dangerous. The XGBoost model does a bit better in that respect, although some of the really high fitted values are becoming goals like less than you'd expect. MoneyPuck also strays a little bit once you get to the really high values, but there honestly are much fewer shots in those categories.
Some more model comparisons. So I always like to make the point, this is a problem that has a lot of class imbalance. So I could stand here and tell you like, oh, my accuracy is like 95% or whatever, and that would sound so good. But all I did was just say that no shot will ever become a goal, which goals do happen sometimes. So we can look to things like log loss and compare it to the situation where we say all shots are equally dangerous. We can compare that to our logistic model with the one feature. We can compare that to the XGBoost model. And here I'm using the cross validation predictions.
So rest assured that these predictions have not looked at the data of that shot in question that is being predicted. And also, again, the MoneyPuck model, which is a nice benchmark for comparison as well. And you look at log loss, you look at area under the curve, you can see that basically what it comes down to is just distance alone will get you most of the way there. The XGBoost model with a few additional features gets you a little bit further. And then probably whatever Peter did for his model was a little bit better. But I think it's great that I was able to get close to his results in, frankly, like a day of work. So calling that a win.
I think it's great that I was able to get close to his results in, frankly, like a day of work.
Evaluating shooters with unexpected goals
And then also to bring things full circle, we looked at Travis Konechny's goal earlier, and we knew that it wouldn't be right to assume that it had the same probability of scoring as the average shot. So that would be about 5.7%. If we incorporate the shot distance and look at the logistic model, we would assume about a 10% chance of scoring. The XGBoost model goes a little bit higher, closer to 12%. And then MoneyPuck's model, I think, does the best job by saying it's 15%. And then, you know, if we have additional features, maybe from tracking data or something like that, we might even say that it's more dangerous than that.
And lastly, because I said expected and unexpected goals, so here's what I mean by unexpected goals. So we can use this to evaluate shooters. And what we do essentially, and I have some nice tidyverse verbs there, we can for each shooter and all the unblocked shots that they have in this dataset, we can calculate their expected shooting percentage based on our expected goals model. We can calculate their actual shooting percentage based on the goals that they scored and the difference. And then once we filter down to enough shots that we don't have really wonky results, you know, these are the top six for the timeframe that I looked at.
And so we can see Brett Connelly, Logan Couture, Berkovsky, Dreisaitl, Kopitar, Panarin. These guys, conditioning on where they were shooting, the angle, et cetera, have scored at the highest, you know, percentage relative to that. And so that is interesting from a player evaluation standpoint.
So with that, that's all I have for you guys. Thank you so much for listening. Thank you again to Moneypuck and some of my other friends, Manny, Micah, Luke, and Josh, and all the other researchers I've read and cited. I'll be tweeting this out at NNStats, so you can find me there. And I'll take a few questions if there's time.
Q&A
We do have time for one question. How is the information communicated to the coaches and players? Are there analytics that can be provided mid-game that leads to strategy changes?
This is actually something that I'm curious about myself. And frankly, you know, I'll mention a little bit of my day job here. Because football is so discreet in the plays and stuff like that, like, there are definitely materials that we can provide saying, like, if you see this situation, this is what you should do. But frankly, I'm curious about it myself. In a hockey context, what are the sort of adjustments that you can make in-game? Frankly, a lot of the research that I've done has been more on the player evaluation side of things. So just trying to build up the best roster and then kind of be like, all right, go do your best.
