Isaac Florence | Scaling and automating R workflows with Kubernetes and Airflow

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon, everyone, and thank you very much for having me. I'm sorry not to be with you in Washington, D.C., but joining you virtually. And so I'll kick on talking about scaling and automating R workflows with Kubernetes and Airflow.

I spent a lot of my childhood and teenage years working on farms, and any of you who will have done the same will recognize baler twine. Now, the joke goes that you can solve 90% of your problems on a farm with baler twine, and this is essentially nylon string that is an offcut from or kind of a byproduct of tying up bales of hay. But it ends up being a useful tool in tying together, say, a farm gate or a trailer to an ATV or even a lead for your dog. It's something that's lying around that's tough and sturdy and largely gets the job done. And the equivalent, I guess, would be duct tape. And as you can see, and as many of you will have experienced, duct tape isn't really the appropriate tool to get the job done here or in many cases, but it sort of works and it's sort of okay up to a certain point. But then the farm analogy really is that you can get so far with baler twine, but at some point you're going to need real infrastructure. You're going to need bricks and cement.

Background and context

So who am I? I'm an infectious disease epidemiologist working at the UK Health Security Agency or what used to be called Public Health England. And like many of you, I imagine at the RStudio conference today, you'll have been an analyst of some sort in a particular specialist area. And as you've fallen presumably further in love with R, you've ended up becoming more of a proper data scientist as it would have been back then. And then you, and I think this will be a familiar story to either the people have gone through or the people will hope to be going through or will inevitably find themselves going through is finding themselves not having quite the right tools either for their specific area or more largely stand up becoming more of a software engineer as they're trying to create the tools for themselves and their colleagues to work.

And then once you've got those tools, and as we heard in the last talk, you often, it's not the tool so much that you're lacking, but the data. And so you end up finding yourself spending all your life moving data around, cleaning it, linking it, preparing it, and then you may as well be a data engineer. And then the final stage of transformation as you get further and further away from analysis is when people start to presume you're a member of your ICT department, and you fancily decide to refer to yourself as a DevOps engineer. Now, this all boils down to in my case, having a profound love for both epidemiology and for code. But it's taken a lot for me to find this area where I am in now, which is enablement of my colleagues and I to analyze epidemiology and infectious disease data at scale.

So what am I going to be talking about exactly that? So how we replaced our baler twine like solutions are, they sort of work, but we need them to be a bit better solutions with true infrastructure. And clicking on, I'm going to be talking about those main tools that we've ended up using. So why Kubernetes and what even is Kubernetes at this point? And then what do we mean by automation? And why is it valuable? And then Airflow and why it is actually and why it's solved so many of our problems.

So the context for all of this, which I imagine will be familiar, but it's important nonetheless to buzz through is that public health agencies have traditionally worked with maybe hundreds of cases of each infectious disease they specialize in each year. These are small data sets, they're easy to manage for simple analytical tools. And those traditional ones and the more modern ones like R and Python. So there's never been an impetus per se to begin to think about scale or automation as you're not working at breakneck pace per se.

And the turning point for this in recent years has been that both data that is captured and the answers that we're required to provide to decision makers and health protection colleagues on the ground is becoming more detailed and more complex. And in the UK, thanks to our National Health Service, where care is free and standardized to a certain extent, and we have collected fantastic volumes and quality of data from it, we're needing to understand it in much more detail than we've ever had before. So scale has started to creep in.

We're at this point where we're needing to do more complex analysis. And so we're using more appropriate tools. So we're moving away from those kind of legacy tools that many of us would have been taught in our infectious disease courses at university for those epidemiologists in the room. But R quickly becomes the lingua franca for our topic area. It's open source, it's accessible, and it's usable to all of us. And it answers all the questions that we needed to.

Now, COVID is where we've started to blend both that much more complex data requirements into needing to work at unprecedented scale going from hundreds of cases up to millions of recorded cases in the UK or confirmed laboratory confirmed cases. And so we've needed to move away from those traditional ways of working. And so some people will say, well, just use SQL, and you'll be fine. But as we'll all know, SQL will only take you a certain way before you need to be able to wrangle your data in a much more flexible and accessible way. And so although ODBC powered workloads kind of creep in to increase to your scripts and your processes, really, you just need to be able to do R but bigger.

Although ODBC powered workloads kind of creep in to increase to your scripts and your processes, really, you just need to be able to do R but bigger.

Now, the final context is that and all of our colleagues who work in the public sector will know is that you have a requirement to work with high value for money, and have it maintained by rather small, small teams, the small but mighty teams that exist in many public sector organisations. And then also you're expected to perform at least the industry standard if not further ahead. And all of that leads to quite strong innovation. In many cases, it's a really a sink or swim. And that's really the sink COVID was the sink or swim moment for us. Now in the UK, we're very lucky in that the UK government prefers open source, it's a written down kind of truth that in order to be achieving high value for money, you need to be using open source technology where possible.

This is what we found to be, after a lot of battle testing and user acceptance, to be the most simple, secure, and easy-to-manage way to move workloads from people running individual scripts on their laptops into Kubernetes and Airflow.

Getting started

So the important bit then is, how do you get started? Well, if you haven't already got started, you can use the popular, and although not open source, in the case of Docker, but tools like Docker and Podman to write and build your container images, and then your working containers to start with a tool. Now Red Hat, with its commitment to open source, provides these free developer licenses, which means that you can have your own OpenShift cluster on your laptop, which is the perfect way to get started with learning how Kubernetes works itself with these user-friendly, easy-to-get-started add-ons that the OpenShift flavor of Kubernetes provides. Airflow, as we said, was open source. It's simply a Python package. You can pip install it very happily, and then begin playing with simple DAGs in half an hour.

And then, when you want to go to the next step, you're using OpenShift Kubernetes yourself, you're using Airflow yourself, it's then time to, instead of having Airflow on your own computer, put it onto Kubernetes. You have the official pre-built container images to run Airflow on Kubernetes, or on your own computer, and then after that, when you've got your workloads with small amounts of resources on your own laptop, you can say, okay, I'm ready now to move on to the main Kubernetes cluster, and that takes you from the 8-16 gigabytes of memory and 8 CPU that have on a laptop up to as large as you can possibly achieve with the hardware that we currently have in terms of your compute.

Now, extra tools that you might want to know about, there are a couple that will really help you in this kind of very high-level summary of this massively complicated topic. Both the Rocker project, providing container images with R installed and ready to use, also, and I'm not just saying this because we're speaking at the RStudio conference, RStudio's public version of its package manager has been enormously both time-saving and helpful for having R package binaries for Linux available to the public. It's a truly wonderful resource, which I couldn't recommend highly enough, but then, actually, if you're just wanting to get started with Kubernetes, not in a production sense, but in terms of doing your own development, the absolute best way to be learning and to moving your development workloads into Kubernetes, into the cloud, is using RStudio Workbench, which I couldn't endorse more thoroughly, and I know that I've come up to time now, so I'm going to hold there, and thank you very much for listening.

Isaac Florence | Scaling and automating R workflows with Kubernetes and Airflow | Posit (2022)

Transcript#

Background and context

From baler twine to real infrastructure

What is Kubernetes?

What is Airflow?

How we use Kubernetes and Airflow together

Getting started