Purrrfectly parallel, purrrfectly distributed (Charlie Gao, Posit)

Transcript#

This transcript was generated automatically and may contain errors.

Great, thank you Max, and thank you for attending this session and my talk on perfectly parallel, perfectly distributed.

I'm Charlie Gao , I'm a software engineer on the open source team here at Posit. And hopefully you're suitably curious about one of the two things in the title. So, in this presentation, I'm hopefully going to be answering the question, why do we have parallel purrr , and why do we have it now? And then distributed purrr, why might I want to use that? And if so, how would I go about doing that?

mirai was designed to be used in production, i.e. reliably. And it does this by adopting a very clear evaluation model with a clean separation between what's in your current environment and what's in the environment of the process which is actually doing the parallel work.

We've minimized the complexity of the package itself, and we've eliminated any hidden state. What this means for you is the code you write, you can expect it to be evaluated transparently and reliably over time.

And you can literally deploy this anywhere. So locally, on your own laptop, on a remote machine over SSH, and this is what I'm gonna be covering in the second half of the presentation, or if you're lucky enough to have access to a high-performance compute cluster, you can use your scheduler of choice. And then mirai has this concept of modular compute profiles. So you can actually be connected to all three resource types at the same time.

And I'm gonna say one last thing about mirai before we move on to distributed computing, and that we've not just put mirai into purrr, we've rolled mirai out across our ecosystem. So mirai is now the primary async backend for Shiny. It is the only built-in async evaluator for Plumber 2. And it's also an option to run things in parallel in tidy models, such as hyperparameter tuning.

Now, mirai is an RLib package, so you can find extensive documentation at mirai.rlib.org. The next part of the presentation does contain some code, but the idea is not for you to memorize the syntax in particular, but it's just to show you intuitively how mirai works a bit under the hood, and to show you some patterns that you probably might not have come across yourself yet. So just try and follow the code, no need to memorize it, there's extensive documentation.

Distributed purrr

So how might you use distributed purrr? And all I mean by distributed is, have your compute run on another machine, so not on the laptop or the local machine you're using. Well, firstly, simply you can use it to replace local compute. So maybe you're working on an underpowered laptop, and you actually need to do heavy computations, which are more suited to a workstation or a server.

Another way you can use distributed compute is to extend your local compute. So as well as using all the calls on your laptop, you can also use another machine, or even multiple machines on your network.

And lastly, I'm gonna show you how to use differentiated compute. So this is a case where you have a workflow where certain calculations may require a GPU, which you don't have on your local machine, but on another machine.

So before I do that, for those of you who have not encountered the new parallel functionality in purrr yet, what we've done is we've added this in parallel adverb. So whereas before you would map a function over a list or a vector, all you would do now is map in parallel a function over a list or a vector.

And again, before we move on to distributed case, how would you run that parallel map on your local computer, using all the calls available to you? Well, as a user, you would call this function daemons in the mirai package. Daemons sets where and how you actually do this parallelization. And it's for the user to set this, because when you write the code, your code may be in a package, for example, and you don't know the resources that are available to the end user. So when the user actually runs your code, they will call this daemons function. And daemons sets where and how this parallelization actually happens.

So in this case, to run things on your local laptop, all you would do is you would pass a local URL to daemons. What this does is this creates a socket which is listening for connections from other processes on your own machine. Think of this as setting up a base station. You would then call this function launch local. This actually spawns local processes, which dial in back to your local URL. So these are your subglides. And this is how mirai actually works under the hood. What you would actually do is you would actually just do this in one line and just say how many processes you want.

Replacing local compute with SSH

So in the distributed case, you're not just starting processes on your local machine. You're starting them on another machine. Now, SSH is the most ubiquitous way to run things on other machines. So you would have come across this if you've ever set up a home network, for example. Or conversely, if you spun up a cloud instance, something like an EC2 on AWS. The first thing you would get in that case is the IP address of your cloud instance, and you would SSH in, and you would do all the setup for that instance.

Well, how would you use that in mirai? Well, you would take that IP address, or if it's just a machine on your local network, you can also use the host name, and you would pass that to this SSH config function. What that does is that creates a remote launch configuration. This is a simple list. There's no other special source in that. So you can save that, you can reuse that, you can even put this line in your .R profile.

Next, we would call daemons like we did before, and in this case, we would use a host URL. The only difference here is that a host URL is available for connections from your network, so not just on your local machine. And then instead of launch local, we now say launch remote, and we pass it the configuration that you've created. You can also do all of that on one line.

And once you've done that, the, sorry, when you launch those daemons, you actually SSH into those machines and create those processes, which dial back in to your host URL. So your host has to be able to accept these incoming connections. Now, that might already be set up, but if you're just using something like a local laptop, typically that won't be the case, because there'll be a firewall that's blocking all of these connections.

What you would do in that case is use SSH tunneling to get around this problem. Now, I'm gonna explain SSH tunneling in the simplest way possible, without the use of a diagram. So, whereas before, you have a remote machine connecting back into your local machine, all that's happening now is, on both ends, connections are local to local, so think of it as a loop, local to local, and you only have the SSH connection acting as a bridge.

What does that mean in terms of, for mirai, well, all we have to do is add tunnel equals true when you're creating the configuration. That's the only change you need. But when you get to set up daemons, instead of host URL, you're moving back to local URL, because all connections are actually local now. But you have to remember to set TCP equals true, because SSH requires TCP connections. And we launch remote the same as before. And you can do it in one line, and once you've done that, you will have six daemons on a remote machine, regardless of the firewall settings that you have.

Extending and differentiating compute

So, the second way you can use distributed compute is to extend your local compute. So you can use both cores on your local machine and remote daemons on another machine at the same time. Or this could be daemons on many machines. Example of doing this, you create your remote configuration, you set daemons with a host URL. This is available to other computers, but also to processes on your own machine. So you can launch daemons remotely, and you can launch daemons locally. And you will have 12 daemons for you to spread your work over.

Lastly, I'm gonna show you how to do differentiated compute. So this is where parts of your code need to be run on a GPU, which you may not have on your laptop, but on another machine. To do this, you would make use of mirai's concept of compute profiles. These are different daemon settings which can coexist at the same time. So in this example, if I call daemon six, that uses the default compute profile. I can also set up remote daemons and pass a name to this .compute argument. This creates a compute profile called GPU. And how I would use this, if I do a parallel map, that will use the default compute profile, which in this case will run locally. Anything I wrap in with daemons, and here I give it the name of the compute profile GPU, anything within that scope will run with that compute profile. So in this case, remotely.

And that's all I've got to share with you. I've shown you how to do distributed computing to replace local compute, extend local compute, and to do differentiated compute. Thank you very much, I'm open for questions.

Q&A

I use one, some package versioning can vary in a remote parallel plan. Is it the responsibility of the developer to ensure versions are compatible or does the first-class MRI attempt to implement constraints on similar environments? So MRI has a very explicit evaluation model, doesn't do anything that's implicit under the hood, it doesn't try to second-guess what you have in your code. So it's your responsibility to have the same package versions on your own computer and any other remote computer that you want to run MRI daemons on.

Does MRI's visual tasks need an underlying event loop for those outputs of a later patch? So the event-driven promises uses the later event loop and it uses it in a very efficient way. The model for MRI is, it uses a thread pool, so it's not limited to just the R event loop, but it uses later basically to call into the R event loop when needed for things like promises and in applications like Shiny and Plummer.

Are there any tips for deciding how many? Right, so MRI consciously doesn't automatically detect or attempt to set the number of workers, this is always the responsibility of the user. The reason for this is, it's fairly simple to know how many cores your laptop has, so most of the time it's not that useful to do automatic detection, but also the software can never know what else is going on in your machine at the same time. So you might be running something else, it may not even be in R. And this becomes especially important if you're actually on a shared resource, so not just on your local laptop, but for example, on a high-performance compute cluster, you have to share the resources with others, so you need to explicitly specify what are those resources.

the software can never know what else is going on in your machine at the same time. And this becomes especially important if you're actually on a shared resource, you have to share the resources with others, so you need to explicitly specify what are those resources.

I will finish my talk. Thank you.