
Henrik Bengtsson | Future: Simple Async, Parallel & Distributed Processing in R | RStudio (2020)
Future is a minimal and unifying framework for asynchronous, parallel, and distributed computing in R. It is designed for robustness, consistency, scalability, extendability, and adoptability - all in the spirit of "developer writes code once, user runs it anywhere". It is being used in production for high-performance computing and asynchronous UX, among other things. In this talk, I will discuss common feature requests, recent progress we have made, and what is the pipeline
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
We're going to get started with the last talk in this section by Henrik Bengtsson about Future, Simple Async, Parallel & Distributed Processing in R. What's next?
Hi everyone, my name is Henrik. I work at UCSF here in San Francisco. But I first want to say that I'm super honest to be mentioned in the same sentence as Steve about the 4-H project. So thank you very much. So I am a trained computer scientist and I did a PhD in mathematical statistics and I'm doing very applied work and mostly in cancer genomics. And we process a lot of large data sets. And when I've been large, and I prefer to do things from R, and by large we're talking terabytes and sometimes hundred terabytes. So R is the glue for me.
What is parallel processing?
So what do we mean by parallel processing? So the most common use case is we want things to finish sooner. So if we run it sequentially, it's going to finish tomorrow. But if we can parallelize, it's finished today. And we typically go to our local computer and say, I'm going to use my different course here. Distributed computing is basically you want to reach out to more machines and parallelize over those. And I do want to mention that synchronous programming and processing, it's related to parallel processing and the most common use case that people might know here is if you have a shiny application and there is a button and you click on that button, and that needs to run a lot of computation and it would take a long time before it's finished, you don't want your user interface to freeze up. So you want to run that in the background in parallel. And the future framework supports these needs.
So we get a brilliant introduction to parallel processing in R already. I'm going to repeat a little bit. But if you as a developer want to start, you have sequential code and you want to go parallel processing, you want it just to be simple. It shouldn't bother your development. So let's consider this simple case. We have a vector of 20 elements. You want to call a function called slow on each of the elements. And what happens inside there is that R will basically step by step go through them. And if slow takes one minute to finish each time, this is going to take 20 minutes. And with the parallel package, we have the beautiful MCL apply, we heard in several talks now. And it looks very similar. So as a developer, it feels familiar. The only difference here, you need to specify how many cores you want to parallelize.
So if you parallelize on two cores, what's going to happen inside, the data, the X, is going to split up into chunks and workers will be started and they will start working on this downstream. And since they're two working on the same task, it's going to finish sooner, about 10 minutes. So everything is done and everything is beautiful. And as we heard, if you do this, you might say, oops, you discovered it did not work on Windows. And there are other cases where you should not use the forking property of MCL apply. So you'd start exploring other things. So in the parallel package, we also have the par L apply and the family of that, which is background R sessions running. And they work on all machines, on all operating systems. For each, it's the big player here. There are others. If you buy a conductor, you have other things. So you start thinking, which one should I pick here?
Another thing is what operating system should I support? Today, I only need Linux. But maybe next year, someone will ask to support Windows or you get collaborators on Windows or Mac. So you start thinking about that on day one when you start. You don't want to think about these things, but you have to. Will it scale? It runs on my local machine. What if we get 96 cores? Will it work? Will it work on five machines and so on? And do I need to maintain several code bases? We already saw that example in the previous talk, where typically, people have sequential code. And by adding support for parallel code, it's very common to say, if not parallel, run the sequential or run some parallel version of it.
So now you have two code bases you have to maintain. You have to figure out test coverage for those and twice the risk for bugs, basically. So questions like this you have. And then when you tried everything, you might actually end up in everything works for you and then there is some user. It didn't work for me because some global variable wasn't handled.
The future framework
So that's where the future framework and the future package comes in. It's tried to take away all those things. You don't have to worry about them. It's a simple unifying solution for parallel APIs in R. The motto is write once, run anywhere. 100% cross-platform. It's installed super easily in all packages, so you don't have to worry about that. I claim it's very well tested. It's got lots of CPU mileage and used in production. And it's meant to be so things just work. There will be corner cases, but they can be fixed. But most of the time, it will just work.
The motto is write once, run anywhere. It's got lots of CPU mileage and used in production. But most of the time, it will just work.
So I need to tell you a little bit about the core of this. So if you peek into MCL Apply, Parallel Apply, ForEach, and a lot of other solutions, they have in principle three things in common, three building blocks. And we pulled out those concepts into the future package. So one is you have an R expression, and you want to run that in parallel. So there's a function in the package future called future that you say, take this expression, run it in parallel, and keep a track of it in a handle. And then you can go to that thing in the background and say, are you done? You call resolve, done, true, false, true, false. And depending on the answer, in your code, you can do different things. And of course, at the end of the day, you want the value from this expression to come back to you. So that's a function called value. And if it's not done yet, it will wait until it's done. So these three constructs are very powerful.
And I didn't invent them. This came up in the mid-'70s by different groups. And it's been studied in computer science. That's where the term future was coined.
So with these things, we can now build our first parallel LApply. And this code works. It shows the gist of the idea. The real thing, you want to do a little bit more bells and whistles beneath the hood. But it works. So the idea here is instead of calling this function on each element as is in your current session, you just spawn it off as a future in parallel. So you get a list of futures here, handles to those. And then in the end of your function, you say, give me the values of this, because that's what I'm interested in, and return. And what happens behind the scenes, the workers will start working on the futures when they're available. And this runs in parallel.
So these are very powerful building blocks, I say. And it allows you, if you want to do parallel processing packages, to stay with your coding style. So if you like base R LApply, there is the future.apply package that provides future LApply, future apply, and all different styles there. And they work the same. So you basically just have to prepend your package with future underscore. And then magic happens. If you like tidyverse and use per, the per map function, Davis created a fir package with the same idea, prepend with future underscore, and it would just work. And if you like foreach, if you used it before, or you want to use that API style, there is the do future adapter to foreach, which brings you all of the future framework and options into foreach framework. So you can stick with whatever you like here. And other people would invent other APIs for this, too, in the future, so to speak.
Letting users control parallelization
So I didn't mention how you specify number of cores and stuff. And in the same idea of foreach, that is not your decision as a developer to decide. That is something the user should decide, because you can never predict what the current users or the future users will have as computational resources. And some user might want to run your code in sequential, and they just call a function called plan sequential, and they get that setting. And this is actually the default. They might want to run it in parallel on your local machine. They can specify how many cores they want to use all cores by default. Someone have a local machine, a machine in another university, and at the same time, they might have something in the cloud, and they can specify this, and they parallelize out the work on this. Or if you have a high-performance compute cluster, you can use SLURM job scheduler, for instance, and use that to parallelize. But the key thing here, as a developer, your code will look exactly the same all the time. So you don't have to worry about this.
So I've been saying it's worry-free, worry-less. Does it work? Yes. No. So don't take my word on it. It's been on CRAN since 2015, but the code is older. So Drake, for instance, it's one of the first to have a workflow manager in R. It's similar to like a Unix make with bells and whistles in R. It can use parallelization with futures underneath. Shiny, if you use asynchronous shiny at the very bottom, you can use futures to achieve that.
And from testing perspective, there's a lot of testing. All platforms possible, all new versions of R running, of course, reverse dependency checks on now more than 100 packages. That's the immediate dependencies. Another thing, because of the for-each success, there are 600 or 700 packages on CRAN that use for-each, Plyer, Carrot, GLMnet, et cetera. And I found a way to, a lot of them have really good examples. So I tweaked the example code to force them to run in parallel and with futures. That allows me to test futures in all these real use cases, and they pass. So that's a validation for futures.
And there is also a future.test package. So if anyone comes and want to write a new backend for parallel processing, they can use this validation, suite of validation tests to make sure they conform to the future standard. And that's a nice contract, because as a developer, you don't have to worry about testing on all possible future backends, because that's taken care of. As a user, too, you know when I pick up a certified backend, it will work.
What's new: output, warnings, and progress bars
So what's new? So in 2019, finally, design wrapped up this thing about output, messages, warning, errors. So errors have been handled from early days, but people have been asking about warnings and output, and we heard that today.
And if you look at this toy example here, it's not very efficient, but you have a vector of minus 1, 10, and 30, and you want to take the log of those. You can use that in an L apply call. And in each iteration, you say which element you operate on as well. You're going to get the output, Z equal minus 1, 10, and 30. We're also going to get a warning, because that log minus 1 is not defined. Okay. So if we try to do the same thing in, for instance, MCL apply, it's going to be silent. If you're in Linux terminal, you're probably going to get some messages. You won't get the warning, but messages, and that's just by pure luck and not by design.
So now, 2019, the future framework at the core, regardless if you use for each future apply or for, it will relay the output and the warnings back to your main session. And you can use capture output and all calling handlers and things there. They are relayed back to you in the same order as you would expect in sequential processing.
So, okay. So now this important thing. I mean, feature number one that everyone asks for when they do parallel processing. I want progress bars. And why is that? Because they're probably running something that takes a long time. And when you spawn it off in parallel, you have no idea what happens. And it's a really hard problem to solve in a proper way. You can do it for certain things, local machine and things like that. But if you run remote machine or whatever, and the key with the future framework is your code should look the same.
So I think I solved this. So forget about parallel processing now. If you're just interested in progress updates, you can use this package. Progressor. It's on CRAN since two weeks ago. It's an inclusive unifying API for progress updates. It works with all MapReduce things you want to come up with. And the idea here also is there's an API for developers and one for end users. And as a developer, the only thing you should worry about is how many progress steps you want to report on. You do that with a function, say, progressor. You create a new function that you can call. And then you call this function in your thing. And every time you call this function, it's going to signal, I made one step. And you can send an optional message along if you want to.
The user has to just decide, I want to listen. I want to listen to this. I want to get these updates or not. And they do that with a with progress function. So they wrap that around where they want to listen to.
So how does this work? So another toy example called Snail. This is basically just doing s apply calling slow function again on each element of the x vector. And at the end, it sums all those results together and returns.
So if the vector x is 50 and slow takes one minute, it's going to take 50 minutes. And it's going to do dead silence here. You don't know if it's running or not. So as a developer, you can now just add these two things. You create a progressor along x. And then you call this in every iteration with a message saying which element you work on.
And the user calls it again. Of course, nothing happens because they need to do with progress. And now they're listening. And they're going to get progress updates.
So I mentioned inclusive. I mean inclusive design. So not everyone would like to see a progress bar in the terminal. Someone might want to have sound feedback. So the end user can now choose to say use the beeper package. So when the thing starts, you get one sound. And then you get ticks along the way. And when it's done, it's like ka-ching. You can combine these. You can do both.
Someone might want to go and write a telegram or Twitter or email when it's done or when it starts. So it fits in that model. So it's the end user to decide how it's reported.
This also works with Shiny. So if you use the progressor, there's a with progress Shiny that you can use. And it will just work with a Shiny interface.
Progress bars in parallel
OK, so now back to parallel processing. So both the future framework and the progressor framework was designed so they shouldn't know about each other. There's one line of code basically in future that knows about progressor. I had to do that. But it doesn't depend on the package. It's just a condition class, technically.
And these progression updates will work in parallel here. And to answer the question that was asked in the previous session, can we get more live updates? Yes, you can add for certain types of backends. Again, your code doesn't work.
So this is a snail example again. And the only thing I changed, I made a supply call using futures. So now if the user want to run it in parallel or remote machines, they will run this thing. So in this case, the user decided to run it in parallel on the local machine and also see progress bars in the terminal and sound and this. And they will get this. And in the message here, I added what process ID was outputting it. And as soon as these are coming in, basically it's through the polling mechanism, but it's very fast. It will actually get things. So progress come in in a different order depending on which worker is first. But we get it coming in. So this now works since two weeks with the future framework.
One thing before I go to my last slide, there is a big hope and chance that in R4.0 that comes in now in the spring, global calling handlers will be available. And that basically means in this case, we don't need to use that with progress. So the end user just called the function. So that's a cleaner interface. And there are other use cases for global handlers that I'm really keen on.
Okay. So take a message. Try it. As a developer, focus on what to parallelize, not how. That's the user's decision. Stay with your favorite coding style. And it's automagic, as we said, like for each global packages, output warnings, errors, and progress bars are handled. Thank you.
As a developer, focus on what to parallelize, not how. That's the user's decision.
Q&A
Thank you, Henrik. So we have time for a couple questions. The first one here, what is the communication overhead between processes? And then I would kind of add on top of that, like when you have large data sets, like when you're transferring them between processes, what's your advice there?
Yeah. So two things here. Parallel processing in R at this level, not like native code multithreading stuff, is more of things that takes several seconds to run. You don't do super small things. And typically, if you get things running for like a whole minute, you don't have to worry about the overhead. So that's the orchestration overhead. You have overhead from passing data, too. And there are things in the fir package and the future apply that try to do this in a clever way. So you don't send out everything that you need, but you need to transfer data if you have big data. So of course, if you send that to Mars or whatever, that will take a long time. So you have to think about that. And that's the corner case. This is like, should the data live locally or should it be distributed?
Whoa, that's out of my league. Is that the expression in English? But I think you can use the later package. You can use this in combined. You can use later and use multiple things in parallel for later.
So you kind of spoke on this a little bit, but how does future parallelize on local machines? Does it fork R processes like parallel, or what does it do on Windows?
So that's a good question. So I didn't have to go for it. I have time to go for all details. Future package doesn't really do the backends. It's just leveraging the parallel package. So you have the option to use multi-core forking like MCL apply or the cluster, but these also allow you to do other things. So there's like a caller package. So there's a caller backend. So the future package tried to stay away from that, but it comes with a parallel package built in.
