Resources

Darby Hadley | RStudio Job Launcher Changing where we run R stuff | RStudio (2019)

RStudio Job Launcher provides the ability to start processes within batch processing systems and container orchestration platforms. In this talk, we will explore what is possible when you have the ability to launch containerized R sessions including scaling, isolating, and customizing environments. We will review examples of launching ad-hoc jobs as well as dockerized R sessions in Kubernetes using the Job Launcher. About the Author Darby Hadley Darby is a QA engineer for multiple teams at RStudio. He has a passion for improving products, creating efficient processes, and helping people. Before joining RStudio he worked primarily in the video game industry

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm going to be talking about the RStudio Pro Launcher and how this can change or help change where we run R stuff.

What is the launcher?

So what is the Launcher? The Launcher will allow applications, in this case we're talking about RStudio Server Pro, to easily launch jobs, and in this case we're talking about R jobs, somewhere. Now this sounds pretty simple, but it opens up some interesting use cases. I'd like to show you some of the cool things it allows you to do.

So I thought I'd start off with an example of how I've used the Launcher, then we can dive more into the details of how it actually works.

A motivating example: Twitter portraits

So at the last RStudio conference, I saw Gyor's talk on five packages in five weeks, you might have been there. In it he mentions a post about creating portraits with text using R, so here's a portrait of Hadley made entirely of dplyr code.

I read through his post and I thought it was interesting. So before this conference I thought it would be cool to create the same kind of portraits with all the RStudio employees and conference speakers using what Twitter profiles I had access to.

So I get working on my script. I'm doing this work on RStudio Server Pro, connecting using a browser. It doesn't take that long because I have all the pieces, I just need to put it together. It does take some time to run, though. It downloads all the text and images, it does some transformation to it, and then it plots the text to create the image, and it does this one at a time.

So I wait for it to finish, and then I have my beautiful Twitter artwork. So each one of these images is created using the text from the author's tweets. That's pretty cool, I guess. I didn't think it would be cool to do the same thing with the top 100 Twitter users by number of followers, and I don't want to wait as long, so I figure out how to parallelize my code. I only have access to a couple of CPU cores on this server, so it helps cut the time it would take in half.

So it takes about seven minutes to complete with 100 images in those two cores, and basically I'm ignoring all of Joe's talk this morning on making things more efficient, and I'm just throwing money at it.

Scaling up with Kubernetes

So at this point, my imagination starts to go a little crazy to see what I could do next. So I'm a gamer, and I thought it'd be cool to take a video of the first level of Doom, separate all the individual frames, and then recreate each frame with this same portrait style using text from the source code of the game.

So it seems like a good use of my time, right?

Now that was a lot of individual frames to deal with. It was about 1,000 images, and it could take six, seven hours to complete, even if I parallelized my code, and I'm restricted on the server for the amount of CPU and memory, and it would be cost inefficient for me to have some really beefy system doing something like this.

But I do have access to a Kubernetes cluster up on Google Cloud, and this cluster is actually set up for me to have a lot more CPU and memory. If you don't know what Kubernetes is, don't worry about it. We'll talk a little more about it later. And each one of these nodes has 32 CPUs, 120 gigabytes of memory, and I have access to use it on demand for some reason. Because of its auto-scaling ability, I could send the cluster this type of job, it'll scale up, and then when it's not in use, it'll scale back down, thus making it way cheaper to operate.

So if only I could just use 25 cores or so, and I can get this done really fast. So there's ways that I can do that, but they can be somewhat painful, but nothing is fully integrated within the IDE until now.

So this is introducing the RStudio launcher. So I'm on a server that has a launcher already configured and set up, and I have the ability to launch these jobs on the Kubernetes cluster right from RStudio. All I have to do is go to the new jobs pane in RStudio, which is new for the RStudio 1.2 release, start a launcher job, select my script and whatever options I want, don't worry too much about the details here, in this case I'll choose 25 cores and I'll start it.

The script will run in the cluster and I can see the status and output in the jobs pane. This will be running completely independent of my currently running R session.

Now I'm able to do a much more intensive task in a much shorter amount of time. What would have taken forever now takes about 15 minutes with the 25 cores. It doesn't really look like you can see it very well, but it's pretty cool.

What would have taken forever now takes about 15 minutes with the 25 cores.

This might seem like a silly example, but the ability to launch R jobs on a Kubernetes cluster opens up some neat possibilities. You can see how if you have an intensive job, you can utilize another processing environment to speed up that analysis.

How the launcher works

Now that you've seen one of the things a launcher can do, let's talk about what is actually going on. We've added in two additional layers into the system. The RStudio launcher is a separate service that is independent of RStudio server and it communicates with a plug-in to connect to these different environments.

So what am I talking about with these plug-ins? So the core of the job launcher is simply a framework for loading external plug-ins which contain the actual logic for communicating with the job's destination. So how do these things work together? Once your plug-in is configured and the RStudio launcher process starts, it executes your plug-in process as a trial of itself and communicates with it via standard in and standard out.

So what kind of plug-ins are available? So for the 1.2 release of RStudio, we are releasing three plug-ins, local, Kubernetes, and Slurm. And side note on Slurm plug-in, it's still in development and it's not really ready to be played with yet but it's going to be coming soon.

So what about other plug-ins? We're still trying to figure out what comes next but we're planning on adding more plug-ins in later RStudio releases. But what if you have a super awesome processing platform that you use in your own company? Well good news is that you can create your own plug-ins. And here's the documentation to do so. And that should have everything you need to get up and running with whatever environment you want to connect to.

Types of jobs: ad hoc and containerized sessions

So there are two types of jobs you can launch from RStudio. The first is ad hoc jobs and that's what we were talking about in the first example. You can use the jobs pane in the IDE to launch specific scripts to run wherever the plug-in is configured for. You might see that I snuck a Python job up there. The job launcher can also run Python jobs directly from RStudio.

The second is containerized sessions. So in this example I developed my code with an R session on the RStudio server that lived on that box. This is where I wrote and tested my script and it's a session that's tied to the console in the IDE. Then I launched the script as an ad hoc job to the Kubernetes cluster.

But with the launcher I can also develop and test directly on the cluster with a containerized session. I can write and debug the code interactively with all the capabilities your code will have when it runs. That means you can completely separate these R sessions from the RStudio server if you would like. And this is integrated within the RStudio UI.

So typically you have a single server and everyone is forced to use the same version or versions of R and Linux dependencies and an admin has to maintain that box for those users if they need updates or new software. But with the launcher and containerized session those dependencies don't even have to be installed on the server. They live in the Docker image itself. So various users can be using different images with different versions of R, specific R packages installed by default. And this allows you to create those truly reproducible environments. Also these images can be shared and distributed on Docker Hub or any other container registry.

So on the same RStudio server, user 1 can be running a session with a Docker image that contains a specific version of R, get specific Linux packages and user 2 could be running a completely different version of R and different R packages installed in shared libraries by default. Or one user could be running a bunch of different types of images for different sessions for various types of work. And these dependencies are now self-contained and isolated from other containers being run so a user can completely screw up everything in the container and it won't affect the others.

And these dependencies are now self-contained and isolated from other containers being run so a user can completely screw up everything in the container and it won't affect the others.

Because of this containerized session, one huge benefit is scaling. So currently with RStudio Server Pro, if you'd like to scale, you have to set up multiple RSP nodes with all the required dependencies and load balance between them. If you get more users and sessions, you have to provision more nodes and reconfigure the group. Things like autoscaling becomes a much harder problem to solve if that's what you want to do.

But using the launcher, you can let a container orchestration platform handle that scaling for you. As you can see here, we have RSP running on one server with the launcher and the launcher communicates with the Kubernetes cluster to create these containerized sessions there. Kubernetes has all that functionality to do the scaling for you. So this allows you to just have that one RSP server. You could have multiple for resiliency, but all the memory and processing load is kicked to the cluster and not on the RSP server itself.

Live demo

Now I'd like to demonstrate launching the two types of jobs from with RStudio Server. This includes ad hoc jobs and the containerized sessions.

So here's RSP running the latest 1.2 release. New to 1.2 is this jobs pane, as some people have already talked about. And here I have a script that's just a test script that prints a message, it sleeps for a little bit and then prints the R version string.

So in the jobs pane, I have the ability to start local jobs, but on RSP and if I have the launcher set up, I can also start launcher jobs. So first I'm going to start a local job with this test script, just to kind of show you what this jobs pane works or how it works. If you didn't see, Jonathan already talked about it.

And so as you can see, I can watch the output as it comes in. It shows a progress bar and I can see a list of all my other running jobs.

And then while that's running, I have another script, and this is my top Twitter.rscript, which is the one I use to get the top 100 Twitter users, and I'm going to start this as a launcher job.

So as you can see, I've got a little bit different options when I'm going to start a launcher job. I have the environment here that I can choose my R script or working directory, just like I can on local jobs, but then I can also name it, so I can name this RStudio. I can choose the cluster that I want this job, where this job to go to. Right now this RStudio server pro has the local plug-in and Kubernetes plug-in, and so I'm going to choose the Kubernetes plug-in. I can choose the amount of CPU and memory. You can see that it was defaulted to certain numbers, and I also have a maximum, and this can all be configured by an RStudio admin in the profiles.

So for this script, I'm going to run with 10 CPUs, I'm going to leave it at 10 gigs, and then this is where I can choose my image. So I have a couple images already set up on Google Cloud Registry, and I also have the ability to choose other and choose whatever URL I want for whatever image I have. So you can choose a Docker Hub image or whatnot. So I'm going to go with this default that I have set up, that I have dependencies for the script, and I'm going to hit start.

So as you can see, it's now running, it has this little icon showing that it's a launcher job instead of a local job. I can click into it to see the output, and then it will just continue to run.

So I'm going to go back to this test script, and I'm going to change this so it doesn't take so long, and then I'm going to run this test script as a launcher job. So just to show off that you're running in a different environment, this current environment that I'm in is running R version 3.5.1, and then I'm going to start this launcher job in a different image.

And I'm going to run in a different image. So some of you might be familiar with the Docker Hub image R base, and that will be running the latest version of R, which is R 3.5.2. So I'll see what the output is here.

Start the job. Look at the output. And as you can see, in my current running environment, I'm running 3.5.1, and this image is 3.5.2. So you can see how you can run different images with whatever dependencies you have for whatever scripts you want to run.

So now I want to show off containerized sessions. So Tarif showed off this a little bit in the keynote. But if you go to the RStudio Server Pro homepage, if you're not familiar with this, it shows your currently active sessions, you're currently running jobs, your completed jobs, and I'm going to go here to new to create a new session, and from here, I have the same type of options as I had before. So I'm going to name this. I'm going to choose the same cluster. I'm going to use the default two CPUs, 10 gigs of memory, and the same default image.

So now this currently running RSP session is running in Kubernetes and not on the RSP box. So it looks exactly the same, runs exactly the same, but now any CPU and memory you use is actually on that cluster.

Frequently asked questions

So I have a couple frequently asked questions here. How do I get the launcher and plugins? So for the 1.2 release, the launcher will come bundled with RStudio Server Pro and include those three plugins that we talked about. The release is available as a preview currently if you want to download it and try it out.

So will this come to the RStudio desktop? And the answer is yes, it will. But for the 1.3 release. So we're going to implement the launcher after this release and then we'll start working on 1.3. If you're unfamiliar with what RStudio Desktop Pro is, because it's only going to be for the Pro edition, it's an enhanced version of the IDE with enterprise feature that we're shipping with the 1.2 release.

What about other RStudio products? So the launcher is something we eventually want our products to standardize on and so RStudio Connect and Package Manager will be integrating with the launcher in future releases.

So to sum everything up, RStudio Job Launcher is a separate service bundled with RStudio Server Pro. It's available now as a preview release. The three plugins are Kubernetes, Slurm, and Local. And you can do some cool stuff with it, like launch ad hoc jobs so you can run those computationally expensive R Python scripts somewhere else. You can run these containerized sessions, so you can run an isolated environment with all the desired dependencies, create reproducible environments, and this will help you scale. So it should make admins happy.

So I have some further reading here. Here is the documentation for our admin guide of RStudio Server Pro and then also the Job Launcher documentation. And if you have any additional questions you want to talk to us, please meet us up in the professional lounge. Thank you very much.

Q&A

So we actually have time for a couple of questions, if anyone has some.

I have a question about an admin's access. Can you set the Kubernetes clusters that the users have access to individually? Because some users may need to be in one region, cluster in one region, some users in another, things like that. Yes. You can use the RStudio profiles to set whatever clusters they have available to them, as well as how much CPU and memory they can utilize, and so you can set those restrictions.

I'm gonna ask, what happens to your jobs if RStudio's server crashes? They'll actually continue to run, and so if RStudio is able to come back, you should be able to reconnect to them.

Do you have a command line of the launcher, or can we call inside R? A command line level of the launcher. So, we're actually gonna be working on an implementation with our RStudio ID package, and so we'll be working on that you can launch jobs through that package, that R package.