Tan Ho | Project Immortality: Using GitHub To Make Your Work Live Forever | Posit (2022)
If you've invested a lot of time and energy on a data science project, you might be ready to move on to new and exciting things. Don't let your old projects wither away and die! There are some powerful and free resources from GitHub that you can leverage to help pay it forward to the next person looking to use your work. In this talk, I'll showcase how you can transform ordinary R scripts into self-sufficient, robust projects by converting your code into a package, adding some GitHub Actions, and storing data into GitHub Releases. This will help your projects more useful - now and long after you've stopped working on the project! Talk materials are available at https://github.com/tanho63/project_immortality_with_github/ Session: Generating high quality data
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My talk today is going to be about Project Immortality. It's about using GitHub to make your work live forever. You can follow me on Twitter, at underscore Tan Ho. Slides are, this is actually a GitHub repository, but the slides are uploaded to that.
So I'd like to introduce you to the life and times of a data science project, okay? This might sound familiar to most of you, certainly the story of most of my projects if not all of them. You start by importing some data, okay, whether that's from scraping, APIs, a database, pre-existing data, stuff that your company has, whatever it is, okay? You start by importing some data. Then you get to know this data. You do some wrangling, you do some feature engineering, you start doing some exploratory data analysis, trying to understand how messy the data is and how bad your data is, shout out to Jim.
And then once you've kind of done with that, you kind of understand that data a little bit as from a human perspective, you start trying to teach your computer to do something with it. You try to model, you try to get some clustering done or regressions or try to predict a number. And then you study the results of that output, you plot it, and then you tweet the plot, okay? So, you know, you share it with the world, you've learned something, here's what I've learned, here's, you know, get some feedback on it, validation, 15 minutes of fame, hopefully.
Hopefully everyone likes it, you get some feedback, and then most projects die here, right? And it's rightfully so, because you've learned something, you've done what you set out to do, you've learned something from that project and you're ready to move on.
The problem of abandoned projects
Sometime later, this happens. Relevant XKCD as ever? Never have I ever felt so close to another soul and yet so helplessly alone as when I Googled this problem and there's one result, a thread by someone who studied the exact same problem that I'm interested in and was last posted to in 2020, 2020 feels like 2003, Who were you, Denver Coder 9? Who were you, Tan? What did you see? What did you learn?
Never have I ever felt so close to another soul and yet so helplessly alone as when I Googled this problem and there's one result, a thread by someone who studied the exact same problem that I'm interested in and was last posted to in 2020, 2020 feels like 2003,
This story happens a lot, right? And whether it's a Stack Overflow answer, a problem, a GitHub, sometimes now you'll Google and you'll find a Twitter thread. How do you help this person? Actually, I'm not that interested in how you help that person, because the answer is they'll reach out to you, if you're available, you'll spend time and energy and you'll give it back to them if you have time, right? What I'm actually more interested in is how can your project help this person? Because at the end of the day, that's not always possible. Whether you're constrained, your time has limits, your energy has limits, your project should be able to help them and get what they need from it and move on.
So my question, my talk today is going to be about how can you help your project help this person? And I think that there's, oh, skipped a slide on me, that GitHub has a bunch of resources that can really help you along this path.
So who this talk is for? This talk is for people who are doing personal hobby projects, it's for people who are doing academic research, it's for people who are interested in public and open source work. More broadly, it's for people who don't have a budget when they're doing their work and the people who are interested in helping others use their work and helping their projects live on.
So why do I care so much about this? This is me. This is literally my profile picture on every single Twitter and Twitch and everything else about me. In a past life, I was a property manager, grew up in the family business, absolutely hated that job, okay? So on the side, I started doing fantasy football, started analyzing it, started teaching myself some data analysis to go with it, started with Excel, Power Query, eventually learned R, taught myself basically football analysis, NFL, fantasy football, et cetera. Eventually that led to a data science career. I got a job in a home building company doing data science and doing programming, and recently I've just accepted a position with Zealous Analytics to work on pro soccer. So you can make that career switch if you're interested by doing this sort of thing.
And then today, along the way, I've become a maintainer of public NFL data, and I maintain NFL-verse and FF-verse art packages for NFL and fantasy football, respectively. So why do I care so much about this? I've been the hobbyist programmer. I've been the person who's broke and wanting to do stuff without a budget. And I care today, even still, about making sure that people can build on my work and can use my work in their own projects.
FF Opportunity: a case study
So let's talk about one of my projects, FF Opportunity. So it's a project that uses NFL play-by-play data to study expected fantasy points. Now, I'm not going to go into the details about NFL or expected fantasy points, you don't need to know anything about that, but if you're interested, expected fantasy points help measure the value of play opportunities in fantasy football. And you can find my project here, github.com, FF-verse, FF Opportunity.
So this might sound familiar. You start by importing some NFL-verse play-by-play data, then I wrangled some features together, trained an XGBoost model, and then used that model to predict fantasy points, right? This is the story of every machine learning project, every data science project, really. Now what? How can I make FF Opportunity live on?
I'm actually going to frame this question, what is this person who stumbles on my project later interested in, and what kind of questions are they going to be asking? How can they use and improve on my model? No model is perfect, but they're useful. And then understanding how past work has, understanding the past work in that field can help you improve on that model, where the gaps are, understand how to use it, understand how to improve it, where the, you know, that kind of thing. Can new predictions be automated every week? Maybe they're good with this model, but they're interested in how the prediction, what the predictions are and what the model says about games that haven't happened yet and games that are going to happen soon. So can you automate new predictions every week of the NFL season? There's a new slate of games every week from September through to January, February. Lastly, where can they find these predictions? Where can I put it, where can they access the predictions in a way that's not trying to, you know, ask me to email them or tweet them or send them these predictions? It's no good if it lives in my computer, right?
Making your model reusable: R package infrastructure
So let's go through these questions one by one. Can someone use and improve on my model? There's a lot of work in the R space now in terms of reproducibility. RNV is a great option, Docker is a great option. The one I actually like to use is making it a package, and I'm not talking about making it for CRAN, but I think there's two elements of packages that I think are really underrated and help kind of build things going forward. One of them is adding a description file, and the other one is wrapping code into functions.
Adding a description file, one of the things that adding a description file does, it forces you or asks you to add a license. And a license is really important, especially when you're doing public work, because it tells people what you're okay with them doing with your work, right? So if you want them to, if you're okay with them doing anything with that, say so. Add an MIT license, that's what that's basically there for. If you want them to only use this work to help extend public work, you can use a license like GPLv3, right? So just choosing the license helps communicate that you want them to use that work. And by sharing that work to them with that license, people can understand what you're okay with them doing with it.
The other thing that it does is it makes dependencies, it makes installation of dependencies really easy, because you're forced to list all the dependencies, and if you're able to also list the minimum versions, it makes things like remotes or Pac or Pac-Man work, because they can read the description file from GitHub and then install all the dependencies for you while you're installing this package, or this package. And even if it doesn't have any functions in it, it'll still install the dependencies for you, and then you can use that in, you know, running the script after you've cloned it on GitHub. Of course, Useless has this covered, so you don't actually need to type out a description file. It's got four functions that you need to know to do exactly what I'm talking about. Use description, use MIT license, use package for each package that you're using, and once you've added all the packages, you use latest dependencies, and it's gonna version lock the version that's on your machine. So it'll put that in as the minimum version, and then, you know, you set this up for success.
The other thing that I like to do, this is kind of the next level, is to wrap logic into functions. It's really easy for users. So from your end of things, what you need to do is convert the hard-coded variables and data frames to arguments. Extract function is a cool feature in RStudio, Ctrl-Alt-X, and then if you copy and paste or if you select the entire code chunk, and you press this shortcut, I think it's Command-Option-X. I don't use a Mac, but it's related. It'll automagically create a function and put the skeleton of your code in a function and take all the variables and put them into arguments. So this already basically does it for you, and then the other thing you can do is add some usage notes. And I say documentation here because documentation scares people, but really understanding how you use logic, how you use the function, and what the function's meant to do will help other people kind of understand where you're going with it. And the goal is to make it easy for users to run, right? So our users love functions. They understand exactly what to do when you give them a function, you put arguments into it, and then you kind of go from there.
Automating predictions with GitHub Actions
Okay. So we talked about packages. That's the bare minimum. There's a whole bunch more on packages. There's a whole book on it. But from my end of things, that's the bare minimum. The next thing I'm going to talk about is how can you predict new... Can you automate predictions every week, right? Because the users who don't care about the model itself are really only interested in the output of the model. So can you automate giving them a model every single week? The answer is yes, of course, with GitHub Actions.
This is literally the 20-line file that automates the FF opportunity predictions. And it uses the description file that we just talked about to run every night after every game. So that's 4 or 5 a.m. on Monday, Tuesday, and Friday, from September to January. And then it installs R on the machine, the virtual machine, installs the packages from the dependencies that I listed in my description, and then it runs a script. And that's it. That's all you need to do. So obviously there's a whole bunch of magic that happens in the back end here. But from an end user perspective, as someone automating your projects, this is already... You're leaning on a bunch of work by the RLib Actions team that is out there, and it's ready for you to use, right? And GitHub Actions is free. If you've got public work, public projects, there's no action limits.
Storing data with GitHub Releases
The last question, and the one I think that more people need to know about, is where can I make predictions accessible? So if you're... We're already talking about GitHub. So the common approach is to just commit things to your GitHub repository, right? Well, there's three common problems with that. File size limits, inefficient binary data storage, and commit history chaos.
We'll start with file size limits. This is approximate, but I think the GitHub file size limits are something like 75 megabytes per file and 2 gigabytes per repo. In today's modern age, that's basically nothing, right? So it's very easy to go through this. So if you have a CSV, for example, you might think, okay, well, maybe what I should do is use an RDS file or a Parquet file to use Arrow or whatever kind of small file size. I think my friend Seb, he likes to use QS. Same idea. They're all binary files.
So the problem with that is that version control is incredibly bad at dealing with binary files. Here's an example of a repository that has a daily automation that basically overwrites an RDS file every single day. It's really tiny, so I've annotated it for you. The actual data in this repository is 60 megabytes in size. And then Git, since this is over the course of six months or a year, has automatically backed up 6 gigabytes worth of data that I didn't know about until I decided that I need to clone it down and do something with the code. So then once I cloned it down, you can't push it back to GitHub because Git has a limit of 2 gigabytes.
So I actually, like, rage tweeted a whole bunch of this stuff when I found this out. If you're really keen, you can go find it. So it's incredibly bad. What's happening is that version control is literally tracking versions, right? And what it does, it compares the differences and tries to log the differences. But binary files are entirely different every time you write it. So Git naturally stores a copy, right? And that's why Git is really bad at dealing with binary files. So even if you're into parquet and the latest and greatest there, it's very quick, especially when you pair it with actions, that you kill your repo like this.
Lastly, this is another problem. If you're committing to the same repo that your code lives in, you end up with, like, 6,000 commits. Now, pop quiz. Which commit did I make a typo in and automatically write to the wrong file? Well, it was, like, 4,222 commits ago. How do I revert to that? Right? It's impossible. So commit history chaos.
There's obviously better solutions. Here's some meh solutions. There's Git large file storage. It takes care of some of the large file size problems. But it also wrecks your Git repository. It's impossible to collaborate with. It's really hard for users to work with because your data isn't actually stored on GitHub and files anymore. It's sharded everywhere. Amazon S3 buckets. Two problems. One, they cost money. Again, we're talking about free, especially in sustainable projects. And the other problem is that it's stored away from your repository. So you have no idea what data is available for that repository. And so someone's got this bucket and it's got things on it. It's really hard to work with visually for the novice user, someone who's not accessing a terminal in a programming perspective to understand what data is there. Lastly, people still use Dropbox for this sort of problem. But that also really difficult to work with. But now from the command line side. How do you get files back and forth from there?
The real solution, again, pairing with the GitHub theme, GitHub releases. And GitHub releases, it's kind of a lesser known thing. But you can make a new release. And releases are stored right next to your repo. So if you kind of go through this section beside your repo, there's releases. You make a new release. You upload files to the release. And then when you're ready, you can update those files as desired. How is this different from the other GitHub data repos? Well, there's friendly file size limits. So each file that you upload to a release is understood to be bigger. So the actual limit for a file is, like, 2 gigabytes. And you can upload as many of these files as you want. So that's much more generous than 2 gigabytes in total and 75 megabytes. You version by choice. So when you upload, you overwrite the file unless you make a new release, right? And if you're ready to archive that data, you make a new release, upload to that new repository, and you still have the old version. You just commit history clean. You're manually uploading. You're uploading through an API. History stays with your project. So if I'm interested in what data is available within the NL first data package, you go through the releases. You figure it out. You can point and click and download. And best of all, it's free. Both for public and private repositories.
Of course, R is wonderful. And there's a package for everything. The piggyback package, which I started working with, and it's originally by Carl Buttiger, came up with three functions that basically run that whole process. First is pb release create. You create a new release. Pb upload to upload a file to a specific release. Pb download lets you download all the files, some of the files, or just one file from that release.
The real solution, again, pairing with the GitHub theme, GitHub releases. And GitHub releases, it's kind of a lesser known thing. But you can make a new release. And releases are stored right next to your repo.
So, to recap, today I presented three tools for project immortality, talked about R package infrastructure to help make things easy to install and also to communicate what you want people to do with your work, talked about GitHub actions to schedule and automate things, and then GitHub releases as an awesome and wonderful way to store data. So with that, with those three tools under your belt, I'd like to leave you with one last question. How can you help your projects help others now and into the future? Thank you.