
Jim Hester | Azure Pipelines and GitHub Actions | RStudio (2020)
Open source R packages on GitHub often take advantage of continuous integration services to automatically check their packages for errors. This is very useful to catch things quickly, as well and increasing confidence for proposed changes, as the Pull Requests can be checked before they are merged. Travis-CI and Appveyor are the most popular current methods. However newer services, Azure Pipelines and GitHub Actions, show promise for being more powerful and simpler to configure and debug. I will discuss these services and demonstrate some of their capabilities and how to configure them for your own use in packages and reports
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome, everybody, to our final presentation in this programming track. I want to introduce you to Jim Hester. Jim is a software engineer on the Tidyverse team, and he is going to talk about his work on GitHub Actions.
All right, thanks, everyone, for coming. As Andre said, I'm Jim Hester, I'm a software engineer at RStudio, and I'm going to talk about GitHub Actions for R.
So GitHub Actions is a new service recently released by GitHub, and it allows you to automate common workflows in software packages and in repositories that you have on GitHub. And one of the most common things that people want to do with these types of services is what's called continuous integration, or CI. And continuous integration is, if you're not familiar with it, it's a way to run your all the tests and checks on your project every time you make a change to the project. So if you commit a change to the repo, or if someone submits a pull request to your repo, continuous integration would run all of the package checks on that new code to verify that it doesn't break your existing functionality.
And this is a really great way to make sure that you don't accidentally break something, and also to give you confidence that the user, the person who submitted the pull request to you, isn't going to break something in your project. And this is also a good way to ensure your project is reproducible, because the code is running on someone else's machine, which is really one of the first steps in reproducibility.
So Martin Fowler has this famous quote that, if it hurts, you should do it more often. And that's really the crux of continuous integration. Rather than running your tests maybe only before you do a CRAN release or a big release of your package, you run the tests every time you make a change. And that really makes sure that if you introduce a bug, it's only going to be a bug in a very small subset of your package, because it's only the things that have changed since the last time you ran a successful test.
So Martin Fowler has this famous quote that, if it hurts, you should do it more often. And that's really the crux of continuous integration.
What GitHub Actions looks like in practice
So what does this actually look like on GitHub? So this is all of the commits in one of my repositories, the Vroom repository. And next to each commit, you can see there's, it's kind of small, but there's a green check mark next to some, or a red X next to some others. And this is showing the check status of the checks that I ran on GitHub Actions for each of those commits.
And if you click on one of these, the red check marks, it'll give you a more detailed view of all of the checks that were run. So here I'm running checks on multiple different operating systems. And you can see the Windows and macOS checks worked fine, but all of the checks on Linux seems to have broke. So clearly, that gives you some idea of where the problem might lie already. And if you need more detail, there's a live log for the checks, and you can find exactly what line might have caused the error.
Why choose GitHub Actions
So this GitHub Actions is not the only service that provides these sort of features. So why might GitHub Actions be a good thing for you to try on your repository? The first thing is that it's one of the few services that has support for multiple, all three major operating systems, Linux, Windows, and macOS. And it also has first class support for Docker containers. So if your package or your project, if you're used to working in Docker containers, and you have all of your dependencies already bundled into a Docker container, you can use those on GitHub Actions very easily.
It also has very generous resource allocations for your jobs. So you can run 20 concurrent jobs for a given repository. And those jobs can take up to six hours. And because GitHub Actions is built into GitHub, it doesn't require separate authentication and setup steps apart from your regular GitHub authentication. So as soon as you commit the configuration file that describes your workflow to your repository, it's going to start running on GitHub Actions. You don't have to do any additional manual setup.
And you might ask, is this service free? And it is for open source, academic, and educational repositories. If you don't fall into one of those buckets, you can still use GitHub Actions, but you have to pay a fairly reasonable fee.
So I mentioned there were some other services that do similar things to what GitHub Actions provides. Some of the most popular that are used in the R community are Travis CI, CircleCI, and AppVeyor. You can see compared to these, GitHub Actions is one of the few that has support for all of these three operating systems and really good support for Docker containers. And then it also gives you many more concurrent builds than the other services provide for open source projects. And the maximum job length is the longest of any of the other options. And again, in terms of setup, because this is integrated into GitHub, it's basically painless to set up, whereas some of the other services range from moderately annoying to extremely annoying.
Azure Pipelines vs GitHub Actions
So you might have noticed that in my talk abstract, it was about Azure Pipelines and GitHub Actions. So why am I not talking about Azure at all? So the first thing is that it is true that Azure Pipelines has been around as a publicly available service for a lot longer than GitHub Actions. So as a result, it's much more mature. There's definitely more features available in Azure. However, because it's not integrated into GitHub, it's actually much more challenging to set up and get working on your project.
And they actually share the same infrastructure. So GitHub Actions is running on the Azure cloud. And actually, a lot of the code, the backend code, is identical between the two services. So a lot of the features are basically the same between the two. And because GitHub Actions is so much easier to get set up, and I think the features are just going to continue to improve over time, I just would recommend using GitHub Actions.
Configuring GitHub Actions for R packages
So how do you actually do this on your repository? So in the develop version of the use this package, we have a function called use GitHub Actions. And this will create a workflow configuration file. These are a YAML file, if you're familiar with that format. And this describes to GitHub Actions the workflow that you want it to perform.
So at the beginning, we have to tell GitHub Actions what GitHub event we want to run the action on, the workflow on. And so in this case, we want to run it any time there's a push to the repository or there's a pull request. And the next thing we are specifying is what operating system the workflow should run on. In this case, we're running on macOS. And then we need to check out the code. So this checks out your Git code on the worker. And then we set up R using this custom action that I wrote to install R on the worker.
And then we need to... So this is checking an R package. And those R packages usually have additional packages they depend on. So this set of code is installing the remotes package to install all of the package dependencies for the R package we're trying to check. And then finally, we can actually run the check on R package with the R command check package.
So that's the simplest workflow for doing a package check. In the tidyverse, we try to have a little bit more robust checks. We want to check on more than one operating system and more than one version of R. So we have this more... This lengthier... So this is the entire file, which obviously you can't read. But I'm gonna go through some of the features in this. And this uses a different function, use GitHub actions tidy.
So the first thing that this does is uses a matrix build, which means that instead of only building one operating system and one version of R, we're building multiple different operating systems. So we're building on Windows, Mac OS, and Ubuntu Linux. And then we're also building on multiple versions of R. So it's building on R Devel, R 3.6, and all the way back to R 3.2. And we do this in the tidyverse packages to ensure that all of our packages work on even pretty old versions of R.
The last thing this has is RStudio package manager. It takes advantage of RStudio package manager, which now provides Linux binaries. So that really speeds up the installation of installing your package dependencies on Linux.
Another thing that helps with installing package dependencies quickly is package caching, which GitHub actions also provides. And this code that I'm showing here is how you set that up for an R package. Another thing that this workflow does is it automatically will, using the R hub sysrex database, it automatically queries any Linux system dependencies that your packages or dependencies needs and will automatically install them. So rather than having to know about these Linux system dependencies yourself, this will query them and install them for you automatically.
And then, finally, it's often — I guess not quite finally, but when a failure happens, it's often useful to download all of the logs for that check run to your local machine so you can inspect them more closely. So this chunk of code will, if a failure happens, it will upload all of the check logs as an artifact of the build so you can download them if you need to. And then, finally, we run the — we use the cover package to run test coverage on the R package and use the CodeCov.io service to upload the results and store the results of our check coverage.
Additional workflows
So that was the more complicated workflow. We also — so I should mention all of these workflows that I'm going through are available on this GitHub site, github.com slash rlib slash actions. So it has examples of everything I'm going to be talking about today and all of the custom actions that we've built for R packages.
So there's additional workflows that we have available. So these are three similar workflows for building other things related to R packages. So the first one is a workflow that builds the package down site for your R package. Second one is if you have a blog down site, this will use GitHub actions to build your blog down site and deploy it to Netlify. And the third one is if you have a book down book, you can use GitHub actions to build your book and then deploy it to Netlify as well.
So these are all fairly similar, so I'm only going to go through one of them, the blog down one. And you can add these to your project using the use this, use GitHub action and then the file name. In this case, it's blog down.yaml.
So that's the full file. So what this does is the first thing is it installs R and Pandoc. This is very similar to the workflow we already saw. The next thing is for this blog down site, I'm actually going to use the renv package to track the package dependencies I used in my site and then install exactly the same versions that I used locally on the GitHub actions worker. So that's what this is doing. And then it's also installing the Hugo web framework that you need for blog down. And then finally, I can build the site using blog down build site and then deploy the site to Netlify by installing the Netlify command line tool and then using that to deploy the site.
So the other thing to point out in this slide is that I'm using a secret. So GitHub actions lets you add secrets to your repository in the repository settings. And then you can use these secrets within your action workflows for doing things like I'm doing here if I have an authentication token I need.
Triggering actions on different events
So all of the actions I've shown so far have basically used the same event. So they're running on either a push or a pull request event. But you can actually trigger a GitHub action on any event in a GitHub repository.
So some examples we have of triggering things on other events are an action that will re-render a readme.rmd file any time you modify the contents of that file. So that won't run normally if you're editing something else in your repository. But if a given commit or pull request edits something in that readme, then this action will run and re-render. And then actually it will also push the results back to the repository. So you don't ever have to do that rendering locally.
The second thing is for... It's a thing to help package maintainers when they're dealing with pull requests. So we have a custom GitHub action that looks for special comments that the package maintainer posts on a pull request slash document and slash style. And GitHub actions uses this event. And when those comments appear, it will automatically run devtools document or use the styler package and restyle the pull request and then push the result of those action, of that code, back to the pull request. So rather than the maintainer having to download the code locally, run, like, devtools document, submit the result and push it back, they can just comment slash document. GitHub actions will do all of this. And then you can go on with your code review.
Another example is in my personal website on jimhester.com, I have this GitHub-inspired calendar that is showing, like, all of my activity throughout the year. And I have to update this every day. So I build my site with GitHub actions. And I have a you can schedule an action to run at a certain time. So I have this action running every day at 11 o'clock. And so that way, I don't have to, like, update this manually every day.
Using Docker containers
So I mentioned GitHub actions has support for Docker containers. So how would that look? So this is a simple example of using a Docker container in a GitHub action. So you have to specify which container you want to use. And this is pulling in this case, I'm using the rocker-verse container. And this is pulling from the Docker hub. So this container is available on Docker hub. And it will run all the rest of the code in that container. So here I'm fitting a model in one R script. And then I'm running I'm rendering an RMD file and creating an HTML output and then uploading that HTML output as an artifact of the build.
So this presentation was about GitHub Actions for R. This work was done by me. And also, we had an RStudio intern, Max Held, who did a lot of the early pioneering work of this over the summer. So this wouldn't be possible without him. All of the workflows that I talked about are available on github.com slash rlib slash actions. And that also has all of our custom actions. So if you're looking for information about using GitHub Actions for R, that's the best place to go. And if you want to try these out on your repository, you can go to use this, use GitHub Actions. And the slides are available on speaker deck at that URL. So thank you.
Q&A
Thank you very much, Jim. The first question is, if you use Azure DevOps for source control, do you think the increased difficulty over GitHub Actions is largely eliminated?
I know basically nothing about Azure DevOps. But yeah, I think so. I think if you're already like in sort of the Azure ecosystem, I think maybe Azure pipelines might work. But I don't know for sure.
We have a rather starry-eyed question here, important one. How did you make these slides? This is just keynote with a theme I found online. I just Googled, you know.
Are there plans and or resources for using Azure pipelines within Azure DevOps when testing and checking internal R packages? So there are no plans on our team. In terms of the Tidyverse team, we're very much, I think, going to just steer towards GitHub Actions for all of our repositories. I think that's probably, for our open source work, I think, and we do all of our work on GitHub. So I think it makes a lot of sense for us to invest resources there rather than in Azure.
Can you say something about your experience of moving from Travis CI to GitHub Actions for the Tidyverse project? Yes, yeah. So I think, so we've made Travis, the configuration you need for Travis, very simple. Like you can basically just have language R and a couple more lines. The workflow files you use in GitHub Actions are a little bit longer. But I think some people, when they're using Travis, don't understand all the steps that are going on when you set language R. So part of my motivation in GitHub Actions was to make that a little bit more transparent, so it would be easier for people to understand exactly what was happening. And that they could modify it if they need to, yeah.
This person is asking, you mentioned CronTab. You can schedule kicking off an action. Can GitHub Actions be used to provide a package that has data that refreshes hourly or at some frequency? Or are there pros and cons? Are there reasons why you would not do that? You can definitely do that. So you can, yeah, you could schedule a GitHub Action to run hourly and then deploy, like create data and then deploy it wherever. Like there's actually, so I didn't talk about any of this, but there's like a whole marketplace of custom actions that people have developed for use in GitHub Actions. And they can do things like deploy to Amazon S3 or Azure Cloud or wherever, basically. And so you can basically, you can do a lot of some sophisticated things with scheduled jobs like that, yeah.
