Resources

{mirai} and {crew}: next-generation async to supercharge {promises}, Plumber, Shiny, and {targets}

{mirai} is a minimalist, futuristic, and reliable way to parallelise computations – either on the local machine, or across the network. It combines the latest scheduling technologies with fast, secure connection types. With built-in integration to {promises}, {mirai} provides a simple and efficient asynchronous back-end for Shiny and Plumber apps. The {crew} package extends {mirai} to batch computing environments for massively parallel statistical pipelines, e.g. Bayesian modeling, simulations, and machine learning. It consolidates tasks in a central {R6} controller, auto-scales workers, and helps users create plug-ins for platforms like SLURM and AWS Batch. It is the new workhorse powering high-performance computing in {targets}. Talk by Charlie Gao and Will Landau Slides: https://wlandau.github.io/posit2024 GitHub Repo: https://github.com/wlandau/posit2024 `mirai`: https://shikokuchuo.net/mirai/ `crew`: https://wlandau.github.io/crew/

Oct 31, 2024
20 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good morning, everyone. I'm Charlie Gao. I'm author of mirai, an asynchronous evaluation framework for R. I'm really glad to be joined here by Will Landau. He's not quite on stage, but will be.

Will Landau is the author of Target and also the author of crew, which extends mirai to high performance computing. We'd also like to think that we're joined in spirit at least by this guy, Zhou Cheng, creator of Shiny. You'll notice he's not on stage because he's at this moment on another stage talking about some other little project that he's been working on, something about bringing Shiny to Python. So it will be down to me to do the big reveal on another little project that we, all three of us, have been working on collaboratively, just a little something to advance the state of async for R and R Shiny. And after I'm done with my part of the presentation, Will is going to highlight some of the important scientific work that all of this is supporting.

What is async?

So going back to the title of the presentation, what do we actually mean when we say next-generation async? Or what do we mean by async in the first place? Well, let me give you a very simple definition, which is if we say that parallelism is the ability to do multiple things at once, then async is just not waiting around while that's happening.

Now, this definition, this concept isn't very controversial in a lot of other programming languages. So even if you've never programmed in Rust or Go or JavaScript, you might have heard that async generally works very well in those contexts. Most modern websites have some JavaScript in them. So if you open a web page, perhaps on your mobile, and you click a button, you generally expect it to just work. When you're hitting that button, you're not likely to be wondering whether that thing is going to hang on you.

And that conveniently brings us to where we are now in R. So if I bring back this definition of async, so we have this in mind, then I suggest that perhaps most of us are missing what I term a first class async experience. And why do I say that? Well, I think for most of us, our typical experience with parallelism is with the parallel package, which has been part of base R for over 20 years now. And that is simply parallel and not async. So you send tasks to a lot of parallel workers, but you're waiting until they all happen, and then you collect the results at the end.

Of course, there are many other excellent packages, such as Polar, which is used for local parallelism because it relies on writing and reading files from disk. But again, that's not always async. If you create what are called persistent sessions, then if those sessions are all busy doing jobs, you cannot send another job onto those sessions. And that actually is the same thing with the future package, which also promises async. But again, simply if you have more tasks than the number of workers, this will actually block your session.

Introducing mirai

So what would it actually take to bring first class async to R? Well, Nanomessage Next Generation, or NNG, implements async in C. And this is async on a par with what I've mentioned before in Go and Rust, etc. It's a very lightweight C library, and it implements a brokerless model. And by that, I just mean there's no need for a central server anywhere. And Nanonext, the package, brings NNG to R.

So coming now to mirai, the star of the show, well, at least my part of the show. And in case you're still wondering, mirai is just simply Japanese for future. And it's an async evaluation framework for R. Again, if I just bring back the definition of async so we have this in mind, then mirai uses Nanonext to deliver true first class async. And what I mean by that is you can connect thousands of parallel processes and launch millions of tasks all at once. So many more tasks than processes. And because mirai uses Nanonext, which is so lightweight, the response times to these processes reduce right down from a millisecond to the microsecond range.

And what I mean by that is you can connect thousands of parallel processes and launch millions of tasks all at once. So many more tasks than processes.

mirai promises for Shiny and Plumber

And for all you Shiny and Plumber developers out there, you'll have noticed by use of the word responsiveness, and that's obviously something that you care a lot about. And to use async in either of those contexts, you use a promise or what's called a promising object, which is easily converted into a promise. And the promise in this context is just the availability of a value sometime in the future.

So to date, to create these promises, you could use an actual future from the future package. But as we've seen, this will actually block your session if you have more tasks than workers. You can use what's called a future promise. This is actually implemented in the promises package. And this has always been experimental. But even more importantly, for both of these solutions, they require constant polling for whether this future value is available or not. And this is every 0.1 seconds checking, has this future result? Has it resolved? Has it resolved?

And you can imagine if your Shiny app has quite a few of these promises around, then it might look something like this. And you can sort of see that there's a lot going on. When in fact, none of this is really necessary. And this is the next generation of promises. Because now mirai is a promising object. It has native support for Shiny extended tasks and Plumber. And it's completely event driven. That means there's no need for any of this polling. And because you're not waiting for the next time it polls, there's zero latency.

So now what we have is something that looks like this. And this is simply launching one million promises all at once. And this is actually something you can try at home. But we're not going to show you this here. Instead, we're going to show a somewhat simpler Shiny app, which does 10,000 coin flips asynchronously in parallel. And you can see on the left is the old polling approach. And here, no matter how fast you poll, the update will always be discrete and in chunks. Whereas on the right, you have the new mirai promises and you can see it updating almost continuously. So on this slide, I just have one last thing to say, which is what you're seeing, rather, is not even mirai. It's actually crew, which does use the same mirai promises underneath. But who better to talk to you about crew than Will Landau himself.

crew: extending mirai to high-performance computing

Thanks. I'm Will. I am a statistician and package developer. And I am mirai's all time biggest fan. It solves a problem that I and a lot of other people needed for years. And really glad to join my friend and collaborator Charlie to talk about an extension to mirai that I wrote called crew. To begin, I work in the life sciences and my group and I make decisions about what we design clinical trials and we analyze clinical trials that make decisions about what we think is our best educated information about whether a medical treatment works and is safe or it doesn't work.

And we throw at this all the advanced modeling we can because these are some of the most important decisions that can be made with statistics. And those models take a long time to run, potentially hours for a single model. And when we design the clinical trial, it gets even more challenging because to design a trial, we often have to simulate it. And to simulate a trial, we often need to run these models thousands of times. If you were to try this on your local laptop, you would never get it done. We need entire armies of computers to do this. Armies of computers that are working together on a cluster that's managed with a resource manager like Slurm or Grid Engine or a cloud system or a service like Batch.

And these systems are really tricky to use. For the uninitiated, they're difficult to even access. And we get into challenges of overhead and cost. And to give you an idea, if you request a virtual machine or a job on one of these systems, you may be waiting several minutes, maybe even hours to actually have that resource available to have it running for you. And so it's tempted to keep it running because what if you need to submit work? But if it's idling and sitting there, either if you're on your company's cluster, that might be occupying resources that other people could be using. Or in the cloud, you risk racking up a huge bill. And this is where crew comes in.

crew is designed to navigate these kinds of tricky waters. So the solution to the cost overhead trade-off that crew provides is autoscaling. crew plugs into high-performance computing environments through the crew cluster package or the crew AWS Batch package or other plugins to make it easier to access these systems. And thanks to mirai, which does by far the most difficult part of all of this, when we manage those tasks, the overhead is really low.

How crew works

So how does all this work? So in crew, there's this thing called a controller, which is an interface that sits locally. And then there are workers, and each worker is an R process. And these R processes could live on different computers on the local network, and they do one or more tasks. Now, these workers need to get the instructions somehow. They need to know what to do, and they need to be able to send results back. So there's got to be communication. And this is exactly the challenge that mirai automatically solves.

So the controller calls a function in mirai that says, here are a bunch of WebSockets on the local network. All you workers out there, come find me. And then each worker dials into one of those WebSockets with another function in mirai, and each worker accepts instructions, does tasks, sends the results back. And so this way, mirai doesn't care where the workers are, but it manages the tasks, and crew gets the workers up and running.

Autoscaling

And the degree that mirai allows crew to operate to accomplish this allows flexibility for autoscaling. And to demonstrate, let's take this pipeline for simulation. This is a directed acyclic graph like the one Shatish mentioned in the previous presentation. And each of these tasks, from simulated datasets to models, is kind of like a unit of demand for a worker. So if you have tasks to do, you require workers to do them. When crew sees a pipeline like this, one of the first things it does is it spins up the number of workers required to accomplish the tasks that can be done. And so a worker starts. And if the task load increases in the pipeline, if you can fan out and accomplish parallel tasks simultaneously, then crew will increase the number of workers to meet that demand.

When you proceed along the pipeline, crew reuses the workers that may have been running previously, but are just finishing up tasks. And that way you avoid the overhead of requesting brand new resources. And when the workload subsides, demand decreases, some workers may be sleeping on the job, doing nothing. And so to avoid the extra cost, we evict them from the team.

So this fluctuation in demand, it can happen any number of times over the course of a pipeline. And crew will scale up and scale down automatically. And you can control that with settings when you create the controller. How many seconds do you wait here? How many idle seconds do you wait before you terminate a worker? How many tasks can a worker run before it retires? And this allows us a continuum between perfectly transient workers that only run one task and exit, and fully persistent workers, on the other hand, that keep running until there's no more work to do. This continuum between transient persistent workers is something we've never had in R before. And with crew, you can hit any point in the middle.

This continuum between transient persistent workers is something we've never had in R before. And with crew, you can hit any point in the middle.

Managing tasks and plugins

To manage tasks, this is, in some sense, the easier part. Because all the configuration optimization details are part of the controller. At this point, crew has standard verbs to submit tasks, push. In push, you give it an expression in R, the data it needs, and it submits the task to a worker. And the pop verb gets any results, gets the result of a task, if it's available. And there are functional programming verbs like map and walk and collect to work with multiple tasks at once.

Like I mentioned before, crew can plug into a lot of different high-performance computing systems. Slurm and Grid Engine are a couple examples, but also AWS Batch. And in addition to the settings I described earlier, it's just a matter of plugging in the configuration details for how you want to access Batch, your job definition, job queue. Networking may be a little tricky, but that's at the platform level. And happy to talk about that afterwards.

crew is also designed for you to be able to write your own plugins. If you know how to launch a worker or terminate one in your system, then you can write one of your own. And crew is designed for users to be able to do this. And there's a whole package vignette that describes it.

crew with Shiny and targets

Now, as Charlie mentions, a lot of this enables really cool stuff to be done with Shiny. And so far, I've only been talking about parallel computing, but async computing is equally possible. So this controller push method returns a mirai promise. And, well, it returns a mirai task object that automatically becomes a promise when it needs to. And these promises automatically invalidate Shiny reactive expressions as soon as the promise is resolved. And you get really responsive apps. Every time this text is changing on the screen is a promise resolving and triggering an update. This can happen really fast. And it unblocks the current app session when this happens. So even if you're running a single app session, or if you're running multiple app sessions for a single app, regardless, you see really high responsiveness like this.

Up to this point, I haven't even mentioned targets. And targets is really the tool I designed crew to support. It's the primary use case. And the previous talk mentioned pipelines arranged in a directed acyclic graph. That's exactly the contribution of targets. And all you need to do to use crew and all the distributed computing and parallel computing is to supply a controller. And the rest targets takes care of. And there is a whole vignette on that as well.

To recap, mirai is this fantastic, blindingly fast parallel package with first class async that we have needed for a long time. And crew tries to take this to the next level by plugging it into high performance computing systems where people doing scientific work need mirai the most. Thanks very much.

Q&A

Well, thank you both. That's incredible. You have my mind racing on a bunch of different ways I want to apply it into my normal day to day. So we have time for maybe just one or two questions from the Slido. Are there any hardware requirements for mirai and crew?

Not that I can think of. It supports R3.6 and onwards. And for crew, it really depends on the plugin. So if you want to use it locally, there's a local controller. And if you use a plugin for, let's say, Slurm, then you'll need access to the Slurm. But that's very situation specific.

What thoughts do you have about Coro and async await?

So mirai supports Coro. You can use it instead of the teacher or any other asynchronous function. Plugs well into Coro. I've tested it with Neonel, the author of Coro. And yeah, it's just you can use Coro to write a more succinct code than if you were to chain promises. A good way to use mirai.