Isaac Florence | Scaling and automating R workflows with Kubernetes and Airflow | Posit (2022)
During the pandemic, epidemiologists have been forced to adapt to the unprecedented scale of the data and high cadence of reporting. At the UK Health Security Agency, we have created a platform for teams to easily deploy R and/or Python tasks onto our High-Performance Computing resources, scheduling their execution, and allowing previously unthinkable workloads to be executed with ease. Thanks to Kubernetes, git, Docker, and Airflow, our epidemiologists can stop worrying about their laptop's memory and bandwidth, and focus on answering the crucial questions of the pandemic. We'd like to tell you how we did it. Session: R be nimble, R be quick, R help me plan my vaccine stick: Rapidly responding to world events with R
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Good afternoon, everyone, and thank you very much for having me. I'm sorry not to be with you in Washington, D.C., but joining you virtually. And so I'll kick on talking about scaling and automating R workflows with Kubernetes and Airflow.
I spent a lot of my childhood and teenage years working on farms, and any of you who will have done the same will recognize baler twine. Now, the joke goes that you can solve 90% of your problems on a farm with baler twine, and this is essentially nylon string that is an offcut from or kind of a byproduct of tying up bales of hay. But it ends up being a useful tool in tying together, say, a farm gate or a trailer to an ATV or even a lead for your dog. It's something that's lying around that's tough and sturdy and largely gets the job done. And the equivalent, I guess, would be duct tape. And as you can see, and as many of you will have experienced, duct tape isn't really the appropriate tool to get the job done here or in many cases, but it sort of works and it's sort of okay up to a certain point. But then the farm analogy really is that you can get so far with baler twine, but at some point you're going to need real infrastructure. You're going to need bricks and cement.
Background and context
So who am I? I'm an infectious disease epidemiologist working at the UK Health Security Agency or what used to be called Public Health England. And like many of you, I imagine at the RStudio conference today, you'll have been an analyst of some sort in a particular specialist area. And as you've fallen presumably further in love with R, you've ended up becoming more of a proper data scientist as it would have been back then. And then you, and I think this will be a familiar story to either the people have gone through or the people will hope to be going through or will inevitably find themselves going through is finding themselves not having quite the right tools either for their specific area or more largely stand up becoming more of a software engineer as they're trying to create the tools for themselves and their colleagues to work.
And then once you've got those tools, and as we heard in the last talk, you often, it's not the tool so much that you're lacking, but the data. And so you end up finding yourself spending all your life moving data around, cleaning it, linking it, preparing it, and then you may as well be a data engineer. And then the final stage of transformation as you get further and further away from analysis is when people start to presume you're a member of your ICT department, and you fancily decide to refer to yourself as a DevOps engineer. Now, this all boils down to in my case, having a profound love for both epidemiology and for code. But it's taken a lot for me to find this area where I am in now, which is enablement of my colleagues and I to analyze epidemiology and infectious disease data at scale.
So what am I going to be talking about exactly that? So how we replaced our baler twine like solutions are, they sort of work, but we need them to be a bit better solutions with true infrastructure. And clicking on, I'm going to be talking about those main tools that we've ended up using. So why Kubernetes and what even is Kubernetes at this point? And then what do we mean by automation? And why is it valuable? And then Airflow and why it is actually and why it's solved so many of our problems.
So the context for all of this, which I imagine will be familiar, but it's important nonetheless to buzz through is that public health agencies have traditionally worked with maybe hundreds of cases of each infectious disease they specialize in each year. These are small data sets, they're easy to manage for simple analytical tools. And those traditional ones and the more modern ones like R and Python. So there's never been an impetus per se to begin to think about scale or automation as you're not working at breakneck pace per se.
And the turning point for this in recent years has been that both data that is captured and the answers that we're required to provide to decision makers and health protection colleagues on the ground is becoming more detailed and more complex. And in the UK, thanks to our National Health Service, where care is free and standardized to a certain extent, and we have collected fantastic volumes and quality of data from it, we're needing to understand it in much more detail than we've ever had before. So scale has started to creep in.
We're at this point where we're needing to do more complex analysis. And so we're using more appropriate tools. So we're moving away from those kind of legacy tools that many of us would have been taught in our infectious disease courses at university for those epidemiologists in the room. But R quickly becomes the lingua franca for our topic area. It's open source, it's accessible, and it's usable to all of us. And it answers all the questions that we needed to.
Now, COVID is where we've started to blend both that much more complex data requirements into needing to work at unprecedented scale going from hundreds of cases up to millions of recorded cases in the UK or confirmed laboratory confirmed cases. And so we've needed to move away from those traditional ways of working. And so some people will say, well, just use SQL, and you'll be fine. But as we'll all know, SQL will only take you a certain way before you need to be able to wrangle your data in a much more flexible and accessible way. And so although ODBC powered workloads kind of creep in to increase to your scripts and your processes, really, you just need to be able to do R but bigger.
Although ODBC powered workloads kind of creep in to increase to your scripts and your processes, really, you just need to be able to do R but bigger.
Now, the final context is that and all of our colleagues who work in the public sector will know is that you have a requirement to work with high value for money, and have it maintained by rather small, small teams, the small but mighty teams that exist in many public sector organisations. And then also you're expected to perform at least the industry standard if not further ahead. And all of that leads to quite strong innovation. In many cases, it's a really a sink or swim. And that's really the sink COVID was the sink or swim moment for us. Now in the UK, we're very lucky in that the UK government prefers open source, it's a written down kind of truth that in order to be achieving high value for money, you need to be using open source technology where possible.
From baler twine to real infrastructure
So where have we come from, we've come from a place of people running an individual script kind of in a standard operating procedure, and then maybe thinking, okay, I'm going to write all of this out instead and run it on a scheduled task on my laptop for our Windows users or Cron job for our Unix users, setting up the schedule task on the virtual machine when your laptop started to feel a little bit unrobust needing to have your laptop on all the time. And that's really the kind of the beginning of where the baler twine solution start to creep in and you're finding yourself wrapping more and more of your solutions in bits of tough nylon string trying to keep it together. And that's when you need to start pushing into having that place of having more robust infrastructure. So you might set up a Cron job on an older tool like Slurm, or if in more modern cases on Kubernetes, and then finally getting to that true infrastructure of having both an orchestration tool and a compute tool of having your DAGs in Airflow running on Kubernetes.
What is Kubernetes?
So what is Kubernetes? What are we talking about? So Kubernetes first is open source. It's a cloud native computing foundation tool that's become pervasive in running typically web apps, but now increasingly data science to workloads across the world. Now, the use of container images moving away from what it works for my machine to therefore shipping the entire operating system and all of your system packages. And those container images popularized by Docker, which everyone will have heard of, and the slightly less popular but increasingly upcoming Podman tool providing very similar functionality. And Kubernetes extends that idea of running your workloads in container images.
And instead of having those container images running on your laptop, they allow you to orchestrate having those container images run anywhere. So you could have Kubernetes on your laptop, you could have it anywhere. And I'll come on to that in a moment. But you specify your software, your operating system in your container image, and then you apply resources, your compute, your memory, your storage into in your pods and run those together.
Now, pods are entirely isolated, which is either very helpful, or not helpful at all, in which case you can configure them to not be isolated from each other, very specifically, but it provides that great deal of security and reassurance that your workloads can coexist in whatever environment that you have created. Now, the UK Health Security Agency, we prefer the red hat flavor of Kubernetes known as OpenShift. And I'll talk a little bit about that now. Because it's easy to use continuous integration and continuous delivery, it has those add-ons. And I think everyone, particularly those RStudio or now Posit fans will know that using open source tools is certainly cost effective, but there does become a time where paying for the expertise and the tools that enable the easier use of those open source tools is really worthwhile. And that's certainly the case with OpenShift, in our opinion.
So that continuous integration, continuous deployment is more of a DevOps-y term. But to the user, what's important is that you can have your GitLab or GitHub repository containing your code. And whenever you push your master branch, say, that code is replicated into your container image in Kubernetes or OpenShift, ready to go, running just as it would do on your laptop, and allowing you to push that workload into whatever literal environment you want.
Now, why have we used it? And that's primarily because you want to use fewer resources. Because instead of having a virtual machine per task configured just the way you like it, you configure for each of your workloads exactly the operating system, the system packages, your code, and any of your connections all into one bundle. And you can put that onto one place and then use fewer volumes of compute and storage capacity, which is going to be saving money. So increasing that value for money as a startup. It's also easier to manage for ICT administrators. And I think all of us will feel either hamstrung or very sorry for our ICT colleagues, particularly in the public sector, where you want to make their life as easy as possible so they can make your life as easy as possible. And having all of your workloads standardized, or at least with a standard definition, is incredibly helpful for the simplified management of those resources.
And the other great thing about Kubernetes, as well as it being open source, is that it runs anywhere. You can have it available in the public cloud, like Azure, AWS, or GCP, in the hybrid cloud if you're from a larger organization, or on-premises, or even on your own laptop when you're doing development. And then Kubernetes, as I said, is frankly already the way forward, or increasingly pervasive for running containerized web applications and microservices. And although data science is lagging behind, it's got a lot of tools and projects that are propelling it into being more and more commonplace.
What is Airflow?
So then Airflow, the second piece of our puzzle, or the second piece of our true infrastructure and solution, and again is open source. From the Apache Software Foundation, it's a workflow scheduler, which means that you, it's similar to a cron job, and one of the funniest ways of it ever being described, I suppose, when it was described as Apache Fancy Cron, as opposed to Apache Airflow, or maybe Cron++, because it gives you not just scheduling something to say, I want to run this command at 7 a.m. every day, seven days a week. You can say not only that, but also email me when there's a problem, or I want to see all of the details of how, which days it succeeded, which days it failed, how many times it tried to retry itself, and so on and so forth. It's just got those added extras, which give you, instead of having a myriad of tasks and custom scripts with their own monitoring and logging in them, Airflow offloads that. So anything that you, you can offload all of that thought to Airflow itself.
Now, you define your workflows in directed acyclic graphs, or DAGs, as more commonly referred to. Now, DAGs is not an uncommon phrase to those in data science. If any of you use the targets R package, you define your work in a DAG. Also, those who work in machine learning quite often will have this, but for those who don't know, a DAG is simply a set of tasks and the relationships between tasks that show the dependencies between one thing having happened and the next thing having happened, and Airflow manages all of its workloads in DAGs. Now, DAGs in Airflow are very simple because you have a simple code as configuration, although it's written in Python, to an R user who hadn't written any Python before he started using Airflow. It's remarkably simple, so don't let that certainly put you off.
And then this allows you to do anything from as simple as running an R script, just an arbitrarily defined R script, to setting up Airflow to be linked to your Hadoop cluster or any other large compute cluster, or other data processing tool. Now, Airflow is incredibly flexible, which is fantastic because it fits a wide variety of use cases, but at UKHSA, we've defined to define that into a very opinionated setup, not making use of much of that flexibility, and instead trading that for a user-friendly, simple-to-learn step 8, which I'll go into detail more about now.
How we use Kubernetes and Airflow together
So, how have we used these together? How have we melded these together, and why is that useful? So, each team has its own Kubernetes namespace, so-called, or projects, if you're using OpenShift-specific language, and that allows you to define users, resources, security, and other details about all of your workloads that mean you can have tenants, different teams within the organization, and many, both public sector and private sector teams will know that not everyone in your organization should see all of your data, you've often got to specify it down. So, having those individual tenants within your organization is incredibly helpful, and also just allows you to make sure no one's getting in their own way, or getting in each other's way, even.
And then, when you integrate that with Airflow, you have all of your container images, all of your code, all of your credentials, all set up in OpenShift or Kubernetes, and then you assign each DAG to a project. So, a team, or a team in their own project or namespace can have multiple DAGs, but each DAG only looks, it only belongs to a certain team, and that's been remarkably helpful, as well, for making sure people aren't confused by which workload is failing, which one isn't, if it's theirs or not. And then, teams, therefore, can only see their own project and their own DAGs, or their own projects in OpenShift, and their own DAGs in Airflow, which, again, helps keep things simple and secure.
And then, within that, we say every Airflow task, rather than each different task being a different operator, as they're known in Airflow, each task simply creates a new pod in Kubernetes, in OpenShift, with your container image, with all of your code, and then you give it a specific command. And most often, in my workloads, that command is simply to run an R script, because we offload all of the connections to our other resources or secrets credentials to Kubernetes, where that enterprise-level security and resource management happens, and then Airflow is simply telling Kubernetes how to run a pod and with what resources. And this is what we found to be, after a lot of battle testing and user acceptance, to be the most simple, secure, and easy-to-manage way to move workloads from people running individual scripts on their laptops into Kubernetes and Airflow, which saves, as you can already imagine, you don't need me to spell out, vast amounts of staff time, and also observability, making sure that the entire team knows when something's happened, rather than needing to send Teams or Slack messages to ask, have you run the script for today?
This is what we found to be, after a lot of battle testing and user acceptance, to be the most simple, secure, and easy-to-manage way to move workloads from people running individual scripts on their laptops into Kubernetes and Airflow.
Getting started
So the important bit then is, how do you get started? Well, if you haven't already got started, you can use the popular, and although not open source, in the case of Docker, but tools like Docker and Podman to write and build your container images, and then your working containers to start with a tool. Now Red Hat, with its commitment to open source, provides these free developer licenses, which means that you can have your own OpenShift cluster on your laptop, which is the perfect way to get started with learning how Kubernetes works itself with these user-friendly, easy-to-get-started add-ons that the OpenShift flavor of Kubernetes provides. Airflow, as we said, was open source. It's simply a Python package. You can pip install it very happily, and then begin playing with simple DAGs in half an hour.
And then, when you want to go to the next step, you're using OpenShift Kubernetes yourself, you're using Airflow yourself, it's then time to, instead of having Airflow on your own computer, put it onto Kubernetes. You have the official pre-built container images to run Airflow on Kubernetes, or on your own computer, and then after that, when you've got your workloads with small amounts of resources on your own laptop, you can say, okay, I'm ready now to move on to the main Kubernetes cluster, and that takes you from the 8-16 gigabytes of memory and 8 CPU that have on a laptop up to as large as you can possibly achieve with the hardware that we currently have in terms of your compute.
Now, extra tools that you might want to know about, there are a couple that will really help you in this kind of very high-level summary of this massively complicated topic. Both the Rocker project, providing container images with R installed and ready to use, also, and I'm not just saying this because we're speaking at the RStudio conference, RStudio's public version of its package manager has been enormously both time-saving and helpful for having R package binaries for Linux available to the public. It's a truly wonderful resource, which I couldn't recommend highly enough, but then, actually, if you're just wanting to get started with Kubernetes, not in a production sense, but in terms of doing your own development, the absolute best way to be learning and to moving your development workloads into Kubernetes, into the cloud, is using RStudio Workbench, which I couldn't endorse more thoroughly, and I know that I've come up to time now, so I'm going to hold there, and thank you very much for listening.