Tom Bliss | R in Sports Analytics - Player Tracking Data & Big Data Bowl | RStudio
NFL Big Data Bowl & Analyzing Tracking Data Presentation by Thompson Bliss Analyzing Tracking Data: Sports tracking data (location-based x/y coordinates of every player and the ball) is a new source of information that allows researchers to perform more complex analyses than they could with traditional sports data. The Big Data Bowl is an NFL data competition designed to spur insights in Next Gen Stats, the league’s player tracking data. This presentation will give background about the Big Data Bowl and how to analyze the NFL tracking data in R. Speaker Bio: Thompson Bliss is a Data Scientist for the National Football League. He completed his master’s degree in Data Science at Columbia University in the City of New York in December 2019. At Columbia, he worked as a graduate assistant for a Sports Analytics course taught by Professor Mark Broadie. He received a Bachelor of Science in Physics and Astronomy at University of Wisconsin - Madison in 2018. Agenda: Introduction to NFL Big Data Bowl Tracking Data Overview & Demo Q&A / Open Discussion
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Awesome. Well, I'm very excited to be here. And thanks to everybody for organizing this. I'll just go ahead and get started. So yeah, the focus of the day will be sports tracking data, specifically really NFL tracking data and the Big Data Bowl. I listed my Twitter there at data with bliss. Feel free if there are any follow up questions to contact me there or and also I just tweeted out links to both the Kaggle competition, the Big Data Bowl and the code that I will be sharing today. So if you want to follow along as we go, it should be easy to do so.
What is sports tracking data?
So first of all, what is sports tracking data? Why is it important? Why is it interesting? Um, essentially, what it is, is we know a sport like, for example, in the NFL here, um, we don't just record maybe what happened, the outcome of the play, whether in NBA, maybe it's a made basket and baseball, maybe it's a hit in hockey, maybe it's a pass. We're not we're sort of going a step further. And actually, throughout the entirety of a given play, or series of plays, we're looking at the X, Y coordinates where the players actually are. And that can give us a lot more information.
And this is relatively new data. A lot of sports analytics, like other data sources might go back a little bit further. For example, there's NFL play by play data going back to about 1999. But this tracking data, this this, that sort of gives more information than just the what's going on in the play in terms of the outcome, that's actually giving the coordinates of every player throughout the play that this is relatively new. And another thing to note is it's not typically publicly available. The NBA has a little bit of it publicly available. But outside of the big data bowl, for NFL, at least there really isn't any source that I know of that that's publicly available.
And the last thing I was going to sort of say before I move on was, there's sort of two types of tracking data, you can use video, which is used in other leagues, such as the NBA and the NHL and soccer, where maybe you're, you use the same broadcast, you see the video, and you convert the images of the video into X, Y coordinates. Or what the NFL does is they use RFID chips. So literal radio chips that are going to be in the shoulder pads of every single player and in the ball that relays to a sort of a source radio. And that will, and it'll constantly be measuring where the players are throughout the entirety of the play. So that's what NFL tracking data is based on these radio coordinates.
What is the Big Data Bowl?
So next, what is the big data ball? So it's an online, free to sign up competition, I encourage everybody who's interested after this to look at it, think about signing up. The focus is on this NFL tracking data that I'm discussing. And the goal is, we give folks the data, they have an opportunity to analyze it, create new metrics, do fun stuff, visualize it. And typically, what it also like, as the league, we get to see a lot more cool new content. And also as a league, it sort of provides a pipeline for NFL team hires, and honestly, also other sports team hires. It shows others that you have this skills to work with this, this complex data, and a lot of teams are sort of asking for this. And it's a good way to show off your skills and provide new contact and create some cool stuff.
This is a very, very general timeline of what the big data ball typically is. Typically, the it starts in September, October, the submission deadline is November to January. Then there's judging, the winners are announced. And we're unsure what will happen this year with COVID. But typically, the winners have presented in the past at the NFL scouting combine in February and March. And eventually, the work is submitted into production by teams by the league, and it becomes useful. So yeah, currently, we're in the middle of the 2022 big data ball, which launched in late September, and we'll go through January. So definitely not too late to sign up if those are interested.
So here's a quick overview of what the big data ball competitions have looked like in the past. We've had three in the past, we're currently in the middle of our fourth. The first big data ball was fairly open ended. We shared a bunch of tracking data. And we were asking folks to look at stuff involving route combinations. So sort of how a receiver might run to try to get open rule changes and which how the rules the games will be in their speed. The second one in 2020, what the focus was expected rushing yards. And I don't know if anyone has seen the rush yards over expectation metric. That's something that that stemmed from the big data ball. And last year, we were looking at the secondary. And so that's defensive players and coverage trying to prevent other players from from catching. So so all three years have been very cool. A lot of very good stuff has come out of each of them. And so it's been really fun so far.
And I guess the other thing I would like to note is, each year, we have more, sort of, we have a lot of entrants, we're getting more entrants, a lot more attendees, it's growing, and we're seeing NFL hires each year. The first year, we had 11, then three, then seven. There's a cash prize that goes to the winner. A lot of countries are participating. For example, in 2020, there were 32 different countries. The reason 2020 maybe got a little bit more than the other years is there was this was sort of a traditional category competition, the sense that there was one metric where other years it was more open ended, and it was more submission based.
The 2022 Big Data Bowl
So my final slide is I do this brief overview. I'm gonna talk about the 2022 big data ball, that's the one that's going on today. Again, the September to January and essentially what participants are asked to submit is a Kaggle notebook that will have the that that will show sort of a analysis results. The goal is to submit a paper that gives information that someone can read and learn something new. And the theme is NFL special teams plays, which that includes kickoffs, field goals, punts, as opposed to offensive and defensive plays. And also, we also are including some external data, which which comes from pro football focus. So this is the first year where we actually go across three years, which is pretty cool. And there's a lot of cool data that we're sharing. And obviously, you can see on Twitter, you can see right here, we encourage people to check it out on kaggle.com slash NFL big data ball 2022.
So with that, I'm going to share a quick animation of what this data looks like. And then I think I'm going to take a quick pause for questions after I show this animation before I get into the code. So right here, we're going to look at a video of some data that we could see during that that's included in the NFL Big Data Bowl. 2022. It's some tracking data involving a kickoff play by the Bears and the Vikings. And I'm going to sort of show what it what the video looks like and what the actual play looks like.
So as you can see, this is before the video starts, we have some players lined up. And they align with the players lined up there. So the players in the video in white, that's the Vikings, they're white here and players in dark blue, which are the Bears are here. So I'm going to play the animation, the video at the same time. And this is just to give a sense of really what we're doing is we're just taking the video, we're taking what actually is happening in the game. And we're just translating it to data to x, y coordinates, so we can analyze players in place.
So I guess at the same time, so you can see in the video, it's moving in the ball. See, now the player has the ball and he's running up the field. And the video is not working. Sometimes this happens. But you can see now the the player with the balls is running towards the end zone. And you can see that in the video and in the animation. So this is just a way to analyze what's actually happening in the game. And we now have x, y coordinates. It just it's a good way to see how everything is working. And it's it's efficient. It's easier than just watching every play per se. So with that, I'm gonna do a quick break for questions. And then afterwards, I'm going to get into the code.
Thanks so much, Tom, I just took a quick look at the slide. Oh, and I don't see any specific questions on the big data bowl yet. I'm going to check again, but feel free to raise your hand and I can turn on the ability for people to unmute themselves for a second as well.
And we can also save questions for this till the end as well. Yeah, yeah, that's fine. Yeah. So if there aren't any questions, what is the sampling rate for the data? That's a great question. So 10 times a second. We do a reading. So every 10th of a second.
Cool. All right. Well, if there are a few, one other question was when will the data bowl for 2023 begin? Yeah, so I would assume September, October 2022. Just sort of towards the beginning of the NFL season for the 2022 season. But again, there's still plenty of time left for this big data bowl. So if you're interested, maybe you're going to be less busy during that time. So that might make more sense. But if you're interested, I highly encourage you to still try to check it out. There's plenty of time still to get a submission.
Is there a reason why the data isn't? Yeah, the data is, the NFL doesn't want to share the data. It's proprietary. Part of it is teams have access to it. And you don't want to, it just doesn't make sense to sort of give it away to everybody outside of this event. So maybe I'm not the best person to answer that question maybe. But yeah, we don't share it outside of Big Data Bowl.
What is the error? I don't know if there's like a RMSC I can give you, but certainly there are some plays where there are mistakes. Sometimes when the ball's punted and it's high up, it might lose a little bit of connection to the radios that are stationary. I don't know if radios are there, but it might be a little bit harder to read where exactly it is. So that's a possibility of error. Typically though for the Big Data Bowl, most of the error written plays are removed. So everything should be good. I guess maybe a better way to, what we can say is I think it's pretty accurate to, I believe it's six inches or something. It's pretty close. Like it's at least less than a foot. Yeah. That seems, yeah. So it's less than a foot of error typically. So it's pretty good in that regard.
Analyzing kick attempt offset from center
Awesome. There's a few other questions around like specific packages that I think might be helpful to cover maybe after the next part as well. Okay. Yeah. That sounds good. So I'll jump in to my code if no one else has any questions.
So yeah, I'm sharing this from Kaggle and unfortunately I couldn't figure out a way to make it bigger. So hopefully this is readable for everybody. If it's not, I can zoom in, but I'm hoping everyone is able to read it. That looks good to me.
Okay, good. So the first code sample I'm going to do, and this is stuff that I've publicly shared before. We're going to discuss using the big data, bold data of 2022 to analyze kick attempt offset from center, which is a metric that we're creating. And you can only really create with the tracking data. So the first question is like, what exactly is this metric? So as football fans would know, kickers are, part of it is a field goal kicker will try to kick a ball in between the field goal posts.
As you try to make it through, as you might, as you're, if you're kicking it, let's say a soccer ball or a hockey puck, assuming maybe there's no goalie there, but you try to get it as centered as possible because that would increase your chance of making it. So what we're doing here is we're going to measure how close to the center as a, as the ball goes to the uprights, a typical field goal kicker might be. And that's just a way for us to say, who's the most accurate, who's best at getting it right in between the posts. Maybe there's also an element of luck. Maybe one player might, one kicker might be 90%, making 90% of their field goals, but a lot of their field goals were barely in, where another player was kicking 89% and all their field goals were completely in. And that just is a way to say, maybe there's a little bit of luck going on here. And in the future, we might expect this other kicker to be doing better. So this is just a way to measure kicker skill, looking at that, that offset from center.
Maybe one player might, one kicker might be 90%, making 90% of their field goals, but a lot of their field goals were barely in, where another player was kicking 89% and all their field goals were completely in. And that just is a way to say, maybe there's a little bit of luck going on here.
So we're going to load up some libraries, tidyverse, of course, animate, we're going to animate some stuff and we're gonna make a ridge plot. So nothing, nothing too crazy in terms of packages to start. We're going to load in plays data. Plays data, that's, you're going to see this in every sport. It's sort of like a description of what's going on on a given play. And in football, it's a very sort of broken up sport in the sense that you're playing and then you're not playing, then you're not. So each of the, each row will represent a given play, a given series of like a snap, and then what happens after the snap.
Players, again, that might be that you're going to see that in every sport, every sport has players, and this is just who they are. So we look at the head of the players variable, and we just have an ID, we have their name, we have some other information. Really, we're just going to look, we're going to really care about the ID and the name, or I guess it's the college name, but the ID and the display name, which is their actual name. This is the plays variable. It's, you have a game ID, a play ID, a quarter, down, distance, information like that.
Okay, cool. So now we're going to load in the tracking data, and the tracking data is very big, which makes sense. You have, you're measuring it every 10th of a second, and you have a row for each player on each 10th of a second for each play. So it's going to be very large. You're going to, there's 11 players per team and the ball. So for each 10th of a second, you're going to have 23, and typical plays will last like four to six seconds. So, and there's also some time immediately after, immediately before the play. So it grows really quickly. So that's why we have to load this in iteratively, because it's big data. So we're going to load it in. We're going to look at the head, and as you can see, we have a time, we have an X, we have a Y coordinate, we have an NFL ID, which we can match the players, and we have a game ID and a play ID, which we can match the plays, and a frame ID, which is sort of like what frame it is. You can match it across different players on a given play. Some of these other variables we're not going to use here. We might use them in the next one. S is speed, A is acceleration. So this is what the data we're going to be working with looks like.
So next, we're going to think about cleaning the data. So one thing to note is, and I'm going to show you a map of what the data looks like, we have an X and a Y. And this is just something that, this is an NFL field. The X coordinate goes from end zone to end zone. The Y coordinate goes from one side of the field across the other side of the field. And this is a way to look at it. So zero, zero is the corner of the home end zone, where zero, 120, or sorry, 120, zero is the corner of the visitor end zone. So football field's 100 yards between the end zones. Each end zone is 10 yards each.
One thing to note, and that's what we're about to do in the next part of code, is because each team will change sides at the end of each quarter, sometimes the offense and the kicker will be kicking in one direction. Sometimes they'll be kicking in the other direction. So as you can see, the X and the Y coordinates don't necessarily change with who has the ball, which direction they're going, which direction they're kicking. So one thing that it's important to do is to check the play direction. That's the direction that the team is kicking, or in this case kicking, or the direction the offense is going and flipping it. So that way it's consistent. So if you're kicking a field goal, you're never going to be, you're always going to be kicking a field goal the same direction in the coordinates.
So we're flipping it. So that way, no matter what, the team will be kicking this way. That's somewhat arbitrary. We could say they're always kicking this way, but just to keep it consistent, we don't want a field goal kicker to be kicking towards this end zone, a field goal kicker kicking towards this end zone, to look totally different. We want to be able to compare them apples to apples, since they're essentially the same.
So we do that flip. We take the tracking data. We look at, we filter for when it's the football. The tracking data also contains players. In this case, we only care about the football, although we're interested in who's kicking it. We're not, we're really just curious about the trajectory, not what the kicker is doing as they kick it. We don't really care about that. We're taking the football. We're looking at the X coordinate and we're seeing, okay, where we want plays where the lag of X, the play previously, and before I, let me note that I'm grouping by game and play. So that way we're only looking at specific for a given game and given play and we're arranging by the game play and the frame. So that way we're making sure we're looking in order. So we're filtering for plays where you have passed the 120 mark, but your previous, the previous X was before the 120 mark. So we're really sorry to look at the map. We're really looking right back here and that's where the upright is located. We're filtering for the spot right when they cross this back end zone, right when they would be crossing the uprights.
We're selecting the first occurrence just in case maybe it rolls after we don't want any weirdness going on. And then we're calculating the key metric here, which is offset from center. So we're taking the Y variable, the cross field variable, and we're figuring out the difference between it and the center, which we kind of see as the goal. We want, an ideal kick will be just down the middle as they might say. So we're taking the difference between where that Y variable is in the center. So then for each game and each play, we're only selecting these variables. We have the offset from center. So an ideal offset from center would be zero. A bad offset from center might be like 10. The goal posts are about, or I think 18 and a half feet apart. All these coordinates are in yards, which is the coordinates of football. So I think about six point something is the length of the uprights. So if you're outside of 3.3, if you're 3.5 and above, about you're probably going to be missing the kick.
So we're merging it to the plays variable. We're only looking at plays when there's a field goal attempt. Removing plays when the field goal attempt was short. We're looking at the yards from the end zone just so we have a sense of how far was this kick. We don't want, we want to, because if you're kicking very close, if it's an easy kick, we want to treat that differently than if you're kicking from a very far distance. And so we're calculating the kick length, which is the yards from the end zone, plus 18, which takes into account the length of the end zone and some other factors. We're joining the players, so that way we have the names. We're joining this metric that we had. So now we're only going to look at what we're interested in. We're going to look at the game ID, the play ID, the name of the kicker, how long the kicker, the field goal kick was, and what the offset from center is. And that's the outcome variable.
So we know that yeah, kicking a 50-yard field goal versus kicking a 20-yard field goal, for those of you who know football, those are very different in difficulty. So we want to make sure when we're looking at this offset from center variable, we're being fair. So this is what our final variable, our final thing looks like. We have game ID, play ID, we have a display name, we have a kick length, we have offset from center, and we have a description of what happened.
Animating and visualizing the metric
Okay, cool. So before we sort of get into visualizing and figuring out who's the best in this metric, it's ideal to animate a play or two. So we're going to use this field from Marshall Furman's GitHub. We're going to just select a random play. Oh, I guess we're selecting a play with the minimum offset from center to give an example. This is one where it was really like a very perfect kick, and we're joining that to the tracking just to select that one out. I'm not going to go too deep into all these sort of colors. Essentially, I'm setting a lot of colors and widths just to make sure the field looks good. I'm setting the plot titles, the play description, which is this variable, just sort of saying what happened. I'm going to do an animation using ggplot and ggfield. I'm adding a segment for the uprights. So I'm going to animate each point, and it's going to look similar to what we saw previously right before I took the first break. We're going to add some points, and we're going to animate what the players look like and the ball as this field goal goes in using geom point and adding some colors and some text for the jersey numbers.
So this is what it looks like. So we see this was the kick with the lowest offset from center, so the best in the metric, and this is a 50-yard field goal. I think the other one, yeah, so we're looking at 50-yard field goals only here. So as you can see, the field goal here gets it, and if you look right there, it looks perfectly just down the middle. This is about as in between those uprights as you can get, so that's what we're saying is what we see as an ideal kick. We now can animate one, but instead look at the max offset from center when looking at a 50-yard kick. So this one might be the one that was the worst in terms of being able to kick it between the uprights.
So as you can see, this one wasn't really that close. It was a couple yards out of the uprights. It wasn't made. It was not a good kick, and compared to this one that was just down the middle, which was perfect, this one was not good. And I mean, another thing you can think of is if they're going to miss it, you probably feel a little bit more promise with their performance if they're missing it just outside this one that you missed decently outside.
Okay, cool. So we're going to visualize this metric by kicker. So as we saw before, we were able to merge who's kicking on a given play to this metric, and that way we have a sense of who's doing well in this metric. And for this, we'll just look at all kicks between 30 and 40 yards just because there's the biggest sample size there. And again, we don't want to use a kick length that is... We don't want to vary by kick length because maybe some kickers are just taking easier field goals. We don't want to give them the benefit of the doubt there. So we're grouping by the display name, which is the name of the kicker. We're limiting for only kickers who have 75-plus attempts. We don't want to have a kicker who had one attempt that was perfect and be showing him as the best when we know that maybe he wouldn't be as good if he had a better, a bigger sample size. So we're grouping by that. We're filtering for a sample size of 75 or more, and then we're calculating the average offset from center. We're doing a ggplot. We're doing a barplot, and we're setting the theme, labels, other stuff.
And now look at this. We have a plot that will give us the kicker average offset from center ordered from the best kicker to the worst kicker. And for those of you who know NFL football, it's probably not a surprise that Justin Tucker is seen as the best kicker. He's universally seen as probably the best kicker today, and certainly over 2018 to 2020. I think a lot of people would say he's probably the best kicker over that time. The worst kicker is Adam Benatarian, although he historically was a very good kicker. His 2019 season and his 2018 season, I think his last two, were probably not his best. So that's why he might be not as high in this metric compared to some others.
So again, we can compare. On average, they're all making them, and that makes sense because typical kicker today is probably making 80, 90 percent. And if it was greater than three, maybe we could say maybe on average they're missing. But on average, it's in between the uprights, but they vary in terms of how close to right down the middle they are. Tucker, unsurprisingly, is the best, and then you see a lot sort of in the middle. He's the best by a wide margin, which also would agree with intuition.
We're going to take this exact same data and look at it as a ridge plot, as a series of density plots. That will give us a little bit more information as opposed to a bar, which is just one average. We can see maybe some kickers are just wild. They're sometimes very good, sometimes very bad, so they look average. Maybe some kickers are just typically right at a certain range, and looking at a density plot will give us a sense of whether this is a consistent kicker that consistently is at where it's at, or maybe he is sometimes right on the money, sometimes is a little bit off. So we're going to use GeomDensityRidges from the ggridges package to plot this, and again, I ordered the exact same way by smallest average, but yeah, Tucker, we can see it is fairly consistent. Adam Binateri, he probably has one of the flattest distributions. Sometimes he was really good, sometimes he was really bad, and that sort of agrees with what we saw. In 2019, he wasn't very good at all. 2018, he was decent.
Q&A: packages, insights, and the z-coordinate
So the second one, I might go through a little bit quicker, but maybe I might stop for questions here, but also people can ask at the end again. So does anyone have any questions?
I can start with reading a few from Slido, and just a reminder, if you don't mind, if you're putting questions in the Zoom chat, I love to see them there too. I just want to make sure I don't miss any, if you don't mind copying them over to Slido. But one of the questions is, what are some preferred packages for best dealing with XY data? Yeah, so I mean, I typically just use the tidyverse metrics, but I definitely know that there are some that are pretty good. I know Ben Baldwin has a package, and I can't think of it off the top of my head, but he has a package that does some data cleaning for next-gen stats data, such as what you might see in the big data bull. That's something I would suggest to just start, but I mean, there's a lot of different ways to tackle it. It really depends on what you're doing, and there are some spatial packages that sometimes would make sense if you're doing a specific thing, but sometimes, typically what I do is I just can use tidyverse, and I guess I would also recommend some animation packages, such as gganimate, just because with this data, it's hard to even know what's going on, even if you're looking at, I like to use view sometimes, just look at what data I'm dealing with. Even that sometimes is really difficult because there's so much going on, so always animating, using some sort of animation package to animate the data goes leaps and bounds in terms of understanding what's going on.
Awesome. One other question is, what insights are you most interested in with this year's competition? You know, that's a great question. I don't know if there's, obviously the theme is special, teams. The advice I would give is not necessarily any specific insight, but if you're going to do an insight, if you're going to participate, I would suggest rather than trying to solve 10 problems at once and say, here's a metric that looks at punters on punt plays, and here's another metric that looks at field goal kickers, and here's another play that looks at kickoff, here's a metric that, a bunch of metrics for a bunch of different things, I would highly suggest looking at one specific player and, or not one specific player, one specific position, one specific position on a specific play type, and really digging deep.
I would highly suggest looking at one specific position on a specific play type, and really digging deep. Definitely teams and analytic staffers from teams are going to be the judges, so those that are the judges definitely gain a lot more when it's something really in-depth that solves one issue rather than trying to solve special teams with tracking data all in one submission.
Definitely teams and analytic staffers from teams are going to be the judges, so those that are the judges definitely gain a lot more when it's something really in-depth that solves one issue rather than trying to solve special teams with tracking data all in one submission. So I would just suggest whatever you do, and there's plenty of things you can look at with punting, kicking, honestly, even if it seems simple, that can go leaps and bounds if it's done well. Just don't try to tackle every single problem. Just find something that you're interested in and that you can do well and just tackle that really, really well.
Tom, there's been quite a bit of discussion in the chat about the z-coordinate, especially as it relates to the innovation that you created. Can you talk a little bit about what's available, what isn't available for that? Yeah. So unfortunately, we don't measure the z-coordinate with our tracking data currently, but I would say there are definitely ways to figure it out, whether it be looking at the event data so that there's events that are tagged like field goal kick, field goal make, field goal miss, or kickoff land. That, as well as you could use the PFF hang time data to get a sense of when it landed. So I would say that there are certainly ways to use, to infer the z-coordinate. You know the speed. You know how long it was in the air. You know where it landed. You can get a sense of what the z-coordinate is using. And I'm a physics person, so I like physics a lot. So using kinematics as, and I think there has been some work on that, maybe in previous big data bowls, but definitely that's something that could be inferred using some of the other variables. But unfortunately, it's not measured on site.
Thank you. I see a few people have put their their name when they ask the question on Slido, so I'd love to like turn the mic over to you to have you ask those, if that's okay. Amelia, I see you asked a question about different factors pertaining to the kicker. Would you like to ask that one live? Sure. Hi, I was just curious if how you thought about using age as a factor in these kickers and how they perform? You know, not in this specific analysis, but that's definitely something that could be explored. And maybe especially, although it's only three years, which maybe isn't a huge sample size of years, this is the first time when we have multiple years of data. So maybe that is something to look at. But yeah, definitely something that you should think about exploring. And you have the birth dates from the players data, so you should feel free to explore that.
Awesome. Brian, I see you had a, Brian Filker had another question as well, if I could pass the mic over to you. Yeah, of course. Yeah, Tom, I was kind of curious on factors such as like, you know, wind direction and speed, those are kind of things you're trying to find like a uniformity, you know, in setting the kick direction, different stadium constructs, and how I always think of, you know, Phil Dawson always mentioned with the Cleveland Browns, how we had a special flag installed, so we can see the wind direction, wind speed to factor into his kicks. Is that taken into consideration at all? Yeah, so again, not for this analysis. And this was sort of built to get people started. Definitely, I would highly suggest if you're going to look at that and you're interested in weather stuff, definitely go for it. I posted on the Kaggle website. So external data for this competition is totally free to use as long as it's publicly available to everybody for free. So I would highly encourage you to seek out weather data. I posted some on the Kaggle site and tweeted some and it's also on my GitHub. But also there are plenty of other sources you can use for weather data. So feel free to use it.
Modeling player speed on kickoff plays
Perfect. Perfect. Yeah. So I'm probably just going to skip the cleaning and the reading data for the second thing. But I'll briefly go over what the metrics about and talk about the modeling technique. So another thing that another demo that we put together, we looked at speed on kickoff plays. So the kickoff, special teamers are planned to sprint down the field, try to kick off to the end zone and try to make a tackle. And they try to run as fast as possible to prevent the return team from gaining yards. So for those of you who know football, you know the kickoff.
So essentially, when the kicker kicks it off, all the other members of the the kickoff team are sprinting as fast as possible. One thing that we can look at is what is their speed? How fast do they go on these plays? So I'm going to skip the cleaning and the reading, although those are very important, just sort of talking about a type of model that we like to use at the league. And it's also just something that applies here. So essentially, we assume the data has already been cleaned. Basically, what we have is we have a game ID, a play ID, an NFL ID. So a given player on the kickoff team and their max speed during the play, and we're going to try to model that. And also, we also have surface, which we merge, which is also part of it.
So we're going to create a mixed effects model. And the mixed effects model sort of to account for lack of independence among observations, we assume that each player is sort of from the same population of players, and no player is going to be drastically faster than another. So we assume that each player is from a distribution. And as they sprint more, as we get more observations, we sort of have more of a posterior and can assume where that player might be. If they have few observations, we assume that they're probably close to the overall league mean in terms of speed. So we're just modeling the max speed taking into account the NFL ID and the surface.
We have a model that takes into account. We assume players are a random effect, meaning if we have a lot of information about them, we're going to sort of believe what the information says. Otherwise, if they have one observation, we're going to assume that they're probably closer to the mean. And that's just to prevent a player with one really good observation to be leaps and bounds better than a player with very few. And if we were to do a sort of a dummy variable or a factor variable for each NFL ID, we'd probably overfit and have a ton of variables. So this is just a way to have sort of a value for each NFL ID and not overfit.
So we're going to look at the summary of the model. We have fixed effects, which are the surface. That's a pretty typical thing. That treats as sort of like a normal factor type variable. And then we have the NFL ID, which is the intercept. That's, again, we're assuming it's from a population where all the players from that population are more or less the same and will have more or less the same speed. We can plot the results of the model, sort of the player effects. And again, this is just to take into account the fact that the players will have a max speed effect that is based on the fact that they're running across different surfaces.
The each player will have an effect that is based upon how they've done across different surfaces. But this modeling technique takes into account each surface. So perhaps it's really hard to run fast on grass. Players that run fast on grass will have an adjustment to their coefficient. It's kind of like any other model where each sort of variable will take each other into account in order to have a value that makes sense. So what we get based on taking into account the surface, figuring out what the effect or the intercept will be for each NFL ID in terms of figuring out what their max speed will be, we have player effects that can kind of be thought of as plus or minus what you might expect compared to the average player on the average surface.
So the thought is Matt Cole will typically be about 1.7 yards per second faster than the average kickoff coverage player on the average surface. This is the top 25 of all the players in this effect. So Y axis is player name. X axis is that plus or minus player effect. And we have the top 25 in this metric. So Matt Cole is the fastest. Devin Duvernay is the slowest of the top 25, but he's still in the top 25. So this is just a way for us to take into account surface and really measure how well we expect these players to do compared to the average player.
This is what the entire population looks like of these player effects. So it's a left skewed distribution with mean zero, standard deviation around one, and maybe actually more like around 0.5. And the distribution will give us information on what this looks like. And you can see that the difference between the best player and the worst player isn't necessarily that large. It's about two yards per second.
The last thing that I'm going to sort of touch with this part of the presentation is one thing that I would highly suggest people do if they're planning on participating in the Big Data Bowl is to analyze a model stability to make sure that, okay, if this model predicts well for the first season, how well will it do in the second season? Or if this model predicts well for the beginning of the season, how well will it do for the end of the season? It's a way to make sure that your results are making sense and are picking up on something. So one thing that we can do here is we can do the exact same process before to get the player effects, but do one model that is only weeks one to eight, do another model that is only weeks nine to seventeen, and again we have the NFLID, the display name, and their effect. NFLID, display name, and effect. So we have two different models only using subsets of the data, and we can merge them together and compare them using a sort of an xy plot.
This red line is y equals one, so if everything perfectly goes through that line, then that means we have a perfect fit. Based on the fact that we sort of see obviously tree flowers here at the bottom seems a little bit off, basically since we're seeing a pretty good fit line, we feel pretty confident. And then we can do the, we can remove the outliers, we feel a little bit better.
Yeah, so then we can look at the r-squared, but yeah, this is just two things that I would say to take away from this part is just one, really analyzing stability is a good way to make sure your model is doing what you hope it does, and two, definitely there's a lot of definitely there's a lot of different modeling techniques to look at here, and definitely take into account if you're going to do something with a bunch of players, doing a model with 300 different players each having their own variable can be difficult, especially because each of them are going to have different number of observations, they'll be playing different number of plays, they'll have different roles, they'll be on different teams, so really thinking about that stuff is important.
Final Q&A
So yeah, I'll pause again, and I understand I probably went through this section a little bit fast, so hopefully people were able to catch stuff, but I'll pause again, and I can answer some questions. I can answer through 1.15, hopefully I won't get kicked out of the room that I'm in, and if I do, I'll just have to turn off video for a sec, and I can jump back in answering questions.
I see there's an anonymous question that you may have touched, I think you touched upon this one already, is environmental data included or allowed in this big data bowl? Yeah, it's not necessarily included with what we're providing, but it's certainly allowed as long as it's publicly available, free to haul, so definitely allowed if you can find something that you want to use.
Awesome, Josh Goldberg, I see you had asked a question in Slido around peak age, would you want to ask that question live, or add any other, any context? Sure, I think Amelia kind of had a similar question to it before, but I was just curious if you had looked into any of the kind of distribution for how that metric offset by center, what that looks like for players over the course of their career, if that kind of is a certain peak age to when players are, you know, have the lowest absolute value there, kind of what that looks like. Yeah, no, I have not, but really I would just say that's a great big data bowl project if somebody wants that to take what I was doing there and take it a step further. We really, like the only, the data we provided was 2018 to 20, but honestly that's essentially all the data that we at the league have, because that's when it was really being recorded. So realistically, no, I probably, the coolest thing about this data is answering questions, because it's so new, you might be one of the first people, if not the first people, person to answer that question. So no, I didn't look at it in this analysis, but I would highly recommend anyone who's interested in the big data bowl to look at that, because I think that'd be a cool project.
Let's see, going over to Slido, for the XY coordinators, how does it work with with sidelines? And Zach, I see you've actually asked this question, if you want to jump in and add some context there. Right, yeah, I was looking, mainly looking at kick returns and punt returns, and of course, quite often they run it over to the sideline. I see that it starts with zero and zero on the home sideline, I presume it just goes into negative coordinates? Yeah, that's exactly right. So if your returner and you run out, if this mouse is the returner and you're running out right here, you're probably going to be around like negative one, negative two. And then if you go above it, you're going to be above 53 and a third, which is where the field ends. It'll just be above 53 and a third, and if you're out of your own end zone, you'll be negative, and if you're above 120, you'll be above 120. Perfect, thank you very much.
Yeah, and I tweeted out the code that I shared, as well as the link to the Big Data Bowl for my Twitter, at datawithbliss, so feel free to go on there, and all those links are going to be available. You can also just google Kaggle Big Data Bowl 2022, all of this will show up.
Awesome, there's a broad question as well, which I think is awesome to ask all of our speakers, but they asked, what is the best advice or courses for someone who has little understanding in R to develop better skills? Yeah, so I would say a lot of how I learned R was just, I did a lot of learning sort of by doing, and hopefully that helps the question, maybe that doesn't answer the question completely, because I don't know if I have a specific tutorial or anything that I leant on, but essentially, I think the first thing I've ever done in R, it was a class I took in undergrad, and it was, I just, you could do a project kind of in whatever you wanted, and I just said, oh, I'm interested in, I'm interested in basketball, I'm gonna just do something, and if you do something that you're interested in with R, the more you do it, it's gonna be easier for you to keep at it, because you're gonna be interested in the results, you're gonna be interested in the project, and then by the end, you're gonna get a little bit better, and if you keep doing stuff like that, that's interesting to you, eventually you're gonna get very comfortable with it, and it's not easy