
Purrrfectly parallel, purrrfectly distributed (Charlie Gao, Posit) | posit::conf(2025)
Purrrfectly parallel, purrrfectly distributed Speaker(s): Charlie Gao Abstract: purrr is a powerful functional programming toolkit that has long been a cornerstone of the tidyverse. In 2025, it receives a modernization that means you can use it to harness the power of all computing cores on your machine, dramatically speeding up map operations. More excitingly, it opens up the doors to distributed computing. Through the mirai framework used by purrr, this is made embarrassingly simple. For those in small businesses, or even large ones – every case where there is a spare server in your network, you can now put to good use in simple, straightforward steps. Let us show you how distributed computing is no longer the preserve of those with access to high performance compute clusters. Materials - https://shikokuchuo-posit2025.share.connect.posit.cloud/ posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Great, thank you Max, and thank you for attending this session and my talk on perfectly parallel, perfectly distributed.
I'm Charlie Gao, I'm a software engineer on the open source team here at Posit. And hopefully you're suitably curious about one of the two things in the title. So, in this presentation, I'm hopefully going to be answering the question, why do we have parallel purrr, and why do we have it now? And then distributed purrr, why might I want to use that? And if so, how would I go about doing that?
Background: purrr and the tidyverse
So I'm going to start with this diagram, which I borrowed from Hadley Wickham. It's his famous Life Cycle of a Data Scientist. So where does purrr feature in this diagram? Well, you can find it in the bottom left-hand corner in the programming. So it's not used for any particular step in this life cycle, but it's used throughout the life cycle as a means to program more reliably and consistently.
So how has the purrr ecosystem evolved over time? Well, purrr itself is now 10 years old. Purrr has one of the most adorable, most recognizable hex logos in the entire R ecosystem. And it's maintained, of course, by our very own Hadley Wickham.
Some of you may have also come across this package. Purrr has been with us for a surprisingly long time, since 2018, in fact. And this is a parallel copy of purrr, which uses similar syntax to run things in parallel. This is maintained by Davis Vaughn, one of my colleagues also in the tidyverse. And he has also been instrumental in developing parallel purrr. So we've taken his years of experience in maintaining purrr, and put them into making parallel purrr the best that it can be today.
Introducing mirai
Fast forward to 2022. This is the year that I released mirai. mirai is a modern way to do parallel and async programming in R.
So if we cross purrr and mirai to do parallel purrr, first of all, what would the hex logo look like? For those observant amongst you, you would have noticed that furr upgrades the cat of purrr into a lion, albeit still a very sleepy one. So if you take mirai, which is much more high-powered, and you cross it with purrr, what would you get? And this is something that I actually asked Hadley himself a couple of weeks ago. And he actually answered without any hesitation at all. And he said, you would probably get four cats. So now you have literally four cats running in parallel.
So jokes on the hex logo aside, this was born in July 2025, so a couple of months ago. This is very much the newest, shiniest thing in the UK system.
Why parallel purrr, and why now
So to answer the question, why do we have parallel purrr and why do we have it now? The answer is both purrr and mirai were designed to be used reliably in programming. So if you remember back to the first slide, purrr was used for programming. And why can I say the same thing about mirai? Well, mirai has this design philosophy.
It's based on four pillars. It has a modern foundation, which gives it the performance that it has. And because it was designed for production, you can have the confidence to deploy it everywhere.
What do I mean by this? Well, in terms of foundation, it's built on NNG. This is short for Nano Message Next Generation. This is a high-performance messaging library. And because it's built on that, we have the optimal connection types built in. So things like inter-process communications, TCP and TLS if you need it. We've also engineered base R serialization mechanism to better support custom serialization of newer cross-language data formats, things like Arrow or Safe Tensors.
Because we have this foundation, mirai scales to millions of tasks over thousands of connections. And we can all do this at 1,000 times the efficiency we could otherwise. The extra bit of engineering that we've done are to create these zero-latency promises. So if you use mirai together with something like Shiny, you will notice the extra responsiveness from that.
And we come to this point, which is why we have Parallel Pro in the first place. mirai was designed to be used in production, i.e. reliably. And it does this by adopting a very clear evaluation model with a clean separation between what's in your current environment and what's in the environment of the process which is actually doing the parallel work.
mirai was designed to be used in production, i.e. reliably. And it does this by adopting a very clear evaluation model with a clean separation between what's in your current environment and what's in the environment of the process which is actually doing the parallel work.
We've minimized the complexity of the package itself, and we've eliminated any hidden state. What this means for you is the code you write, you can expect it to be evaluated transparently and reliably over time.
And you can literally deploy this anywhere. So locally, on your own laptop, on a remote machine over SSH, and this is what I'm gonna be covering in the second half of the presentation, or if you're lucky enough to have access to a high-performance compute cluster, you can use your scheduler of choice. And then mirai has this concept of modular compute profiles. So you can actually be connected to all three resource types at the same time.
And I'm gonna say one last thing about mirai before we move on to distributed computing, and that we've not just put mirai into purrr, we've rolled mirai out across our ecosystem. So mirai is now the primary async backend for Shiny. It is the only built-in async evaluator for Plumber 2. And it's also an option to run things in parallel in tidy models, such as hyperparameter tuning.
Now, mirai is an RLib package, so you can find extensive documentation at mirai.rlib.org. The next part of the presentation does contain some code, but the idea is not for you to memorize the syntax in particular, but it's just to show you intuitively how mirai works a bit under the hood, and to show you some patterns that you probably might not have come across yourself yet. So just try and follow the code, no need to memorize it, there's extensive documentation.
Distributed purrr
So how might you use distributed purrr? And all I mean by distributed is, have your compute run on another machine, so not on the laptop or the local machine you're using. Well, firstly, simply you can use it to replace local compute. So maybe you're working on an underpowered laptop, and you actually need to do heavy computations, which are more suited to a workstation or a server.
Another way you can use distributed compute is to extend your local compute. So as well as using all the calls on your laptop, you can also use another machine, or even multiple machines on your network.
And lastly, I'm gonna show you how to use differentiated compute. So this is a case where you have a workflow where certain calculations may require a GPU, which you don't have on your local machine, but on another machine.
So before I do that, for those of you who have not encountered the new parallel functionality in purrr yet, what we've done is we've added this in parallel adverb. So whereas before you would map a function over a list or a vector, all you would do now is map in parallel a function over a list or a vector.
And again, before we move on to distributed case, how would you run that parallel map on your local computer, using all the calls available to you? Well, as a user, you would call this function daemons in the mirai package. Daemons sets where and how you actually do this parallelization. And it's for the user to set this, because when you write the code, your code may be in a package, for example, and you don't know the resources that are available to the end user. So when the user actually runs your code, they will call this daemons function. And daemons sets where and how this parallelization actually happens.
So in this case, to run things on your local laptop, all you would do is you would pass a local URL to daemons. What this does is this creates a socket which is listening for connections from other processes on your own machine. Think of this as setting up a base station. You would then call this function launch local. This actually spawns local processes, which dial in back to your local URL. So these are your subglides. And this is how mirai actually works under the hood. What you would actually do is you would actually just do this in one line and just say how many processes you want.
Replacing local compute with SSH
So in the distributed case, you're not just starting processes on your local machine. You're starting them on another machine. Now, SSH is the most ubiquitous way to run things on other machines. So you would have come across this if you've ever set up a home network, for example. Or conversely, if you spun up a cloud instance, something like an EC2 on AWS. The first thing you would get in that case is the IP address of your cloud instance, and you would SSH in, and you would do all the setup for that instance.
Well, how would you use that in mirai? Well, you would take that IP address, or if it's just a machine on your local network, you can also use the host name, and you would pass that to this SSH config function. What that does is that creates a remote launch configuration. This is a simple list. There's no other special source in that. So you can save that, you can reuse that, you can even put this line in your .R profile.
Next, we would call daemons like we did before, and in this case, we would use a host URL. The only difference here is that a host URL is available for connections from your network, so not just on your local machine. And then instead of launch local, we now say launch remote, and we pass it the configuration that you've created. You can also do all of that on one line.
And once you've done that, the, sorry, when you launch those daemons, you actually SSH into those machines and create those processes, which dial back in to your host URL. So your host has to be able to accept these incoming connections. Now, that might already be set up, but if you're just using something like a local laptop, typically that won't be the case, because there'll be a firewall that's blocking all of these connections.
What you would do in that case is use SSH tunneling to get around this problem. Now, I'm gonna explain SSH tunneling in the simplest way possible, without the use of a diagram. So, whereas before, you have a remote machine connecting back into your local machine, all that's happening now is, on both ends, connections are local to local, so think of it as a loop, local to local, and you only have the SSH connection acting as a bridge.
What does that mean in terms of, for mirai, well, all we have to do is add tunnel equals true when you're creating the configuration. That's the only change you need. But when you get to set up daemons, instead of host URL, you're moving back to local URL, because all connections are actually local now. But you have to remember to set TCP equals true, because SSH requires TCP connections. And we launch remote the same as before. And you can do it in one line, and once you've done that, you will have six daemons on a remote machine, regardless of the firewall settings that you have.
Extending and differentiating compute
So, the second way you can use distributed compute is to extend your local compute. So you can use both cores on your local machine and remote daemons on another machine at the same time. Or this could be daemons on many machines. Example of doing this, you create your remote configuration, you set daemons with a host URL. This is available to other computers, but also to processes on your own machine. So you can launch daemons remotely, and you can launch daemons locally. And you will have 12 daemons for you to spread your work over.
Lastly, I'm gonna show you how to do differentiated compute. So this is where parts of your code need to be run on a GPU, which you may not have on your laptop, but on another machine. To do this, you would make use of mirai's concept of compute profiles. These are different daemon settings which can coexist at the same time. So in this example, if I call daemon six, that uses the default compute profile. I can also set up remote daemons and pass a name to this .compute argument. This creates a compute profile called GPU. And how I would use this, if I do a parallel map, that will use the default compute profile, which in this case will run locally. Anything I wrap in with daemons, and here I give it the name of the compute profile GPU, anything within that scope will run with that compute profile. So in this case, remotely.
And that's all I've got to share with you. I've shown you how to do distributed computing to replace local compute, extend local compute, and to do differentiated compute. Thank you very much, I'm open for questions.
Q&A
I use one, some package versioning can vary in a remote parallel plan. Is it the responsibility of the developer to ensure versions are compatible or does the first-class MRI attempt to implement constraints on similar environments? So MRI has a very explicit evaluation model, doesn't do anything that's implicit under the hood, it doesn't try to second-guess what you have in your code. So it's your responsibility to have the same package versions on your own computer and any other remote computer that you want to run MRI daemons on.
Does MRI's visual tasks need an underlying event loop for those outputs of a later patch? So the event-driven promises uses the later event loop and it uses it in a very efficient way. The model for MRI is, it uses a thread pool, so it's not limited to just the R event loop, but it uses later basically to call into the R event loop when needed for things like promises and in applications like Shiny and Plummer.
Are there any tips for deciding how many? Right, so MRI consciously doesn't automatically detect or attempt to set the number of workers, this is always the responsibility of the user. The reason for this is, it's fairly simple to know how many cores your laptop has, so most of the time it's not that useful to do automatic detection, but also the software can never know what else is going on in your machine at the same time. So you might be running something else, it may not even be in R. And this becomes especially important if you're actually on a shared resource, so not just on your local laptop, but for example, on a high-performance compute cluster, you have to share the resources with others, so you need to explicitly specify what are those resources.
the software can never know what else is going on in your machine at the same time. And this becomes especially important if you're actually on a shared resource, you have to share the resources with others, so you need to explicitly specify what are those resources.


