Henrik Bengtsson | Future: Simple Async, Parallel & Distributed Processing in R

Transcript#

This transcript was generated automatically and may contain errors.

We're going to get started with the last talk in this section by Henrik Bengtsson about Future, Simple Async, Parallel & Distributed Processing in R. What's next?

Hi everyone, my name is Henrik. I work at UCSF here in San Francisco. But I first want to say that I'm super honest to be mentioned in the same sentence as Steve about the 4-H project. So thank you very much. So I am a trained computer scientist and I did a PhD in mathematical statistics and I'm doing very applied work and mostly in cancer genomics. And we process a lot of large data sets. And when I've been large, and I prefer to do things from R, and by large we're talking terabytes and sometimes hundred terabytes. So R is the glue for me.

What is parallel processing?

So what do we mean by parallel processing? So the most common use case is we want things to finish sooner. So if we run it sequentially, it's going to finish tomorrow. But if we can parallelize, it's finished today. And we typically go to our local computer and say, I'm going to use my different course here. Distributed computing is basically you want to reach out to more machines and parallelize over those. And I do want to mention that synchronous programming and processing, it's related to parallel processing and the most common use case that people might know here is if you have a shiny application and there is a button and you click on that button, and that needs to run a lot of computation and it would take a long time before it's finished, you don't want your user interface to freeze up. So you want to run that in the background in parallel. And the future framework supports these needs.

So we get a brilliant introduction to parallel processing in R already. I'm going to repeat a little bit. But if you as a developer want to start, you have sequential code and you want to go parallel processing, you want it just to be simple. It shouldn't bother your development. So let's consider this simple case. We have a vector of 20 elements. You want to call a function called slow on each of the elements. And what happens inside there is that R will basically step by step go through them. And if slow takes one minute to finish each time, this is going to take 20 minutes. And with the parallel package, we have the beautiful MCL apply, we heard in several talks now. And it looks very similar. So as a developer, it feels familiar. The only difference here, you need to specify how many cores you want to parallelize.

So if you parallelize on two cores, what's going to happen inside, the data, the X, is going to split up into chunks and workers will be started and they will start working on this downstream. And since they're two working on the same task, it's going to finish sooner, about 10 minutes. So everything is done and everything is beautiful. And as we heard, if you do this, you might say, oops, you discovered it did not work on Windows. And there are other cases where you should not use the forking property of MCL apply. So you'd start exploring other things. So in the parallel package, we also have the par L apply and the family of that, which is background R sessions running. And they work on all machines, on all operating systems. For each, it's the big player here. There are others. If you buy a conductor, you have other things. So you start thinking, which one should I pick here?

Another thing is what operating system should I support? Today, I only need Linux. But maybe next year, someone will ask to support Windows or you get collaborators on Windows or Mac. So you start thinking about that on day one when you start. You don't want to think about these things, but you have to. Will it scale? It runs on my local machine. What if we get 96 cores? Will it work? Will it work on five machines and so on? And do I need to maintain several code bases? We already saw that example in the previous talk, where typically, people have sequential code. And by adding support for parallel code, it's very common to say, if not parallel, run the sequential or run some parallel version of it.

So now you have two code bases you have to maintain. You have to figure out test coverage for those and twice the risk for bugs, basically. So questions like this you have. And then when you tried everything, you might actually end up in everything works for you and then there is some user. It didn't work for me because some global variable wasn't handled.

The future framework

So that's where the future framework and the future package comes in. It's tried to take away all those things. You don't have to worry about them. It's a simple unifying solution for parallel APIs in R. The motto is write once, run anywhere. 100% cross-platform. It's installed super easily in all packages, so you don't have to worry about that. I claim it's very well tested. It's got lots of CPU mileage and used in production. And it's meant to be so things just work. There will be corner cases, but they can be fixed. But most of the time, it will just work.

The motto is write once, run anywhere. It's got lots of CPU mileage and used in production. But most of the time, it will just work.

So I need to tell you a little bit about the core of this. So if you peek into MCL Apply, Parallel Apply, ForEach, and a lot of other solutions, they have in principle three things in common, three building blocks. And we pulled out those concepts into the future package. So one is you have an R expression, and you want to run that in parallel. So there's a function in the package future called future that you say, take this expression, run it in parallel, and keep a track of it in a handle. And then you can go to that thing in the background and say, are you done? You call resolve, done, true, false, true, false. And depending on the answer, in your code, you can do different things. And of course, at the end of the day, you want the value from this expression to come back to you. So that's a function called value. And if it's not done yet, it will wait until it's done. So these three constructs are very powerful.

And I didn't invent them. This came up in the mid-'70s by different groups. And it's been studied in computer science. That's where the term future was coined.

So with these things, we can now build our first parallel LApply. And this code works. It shows the gist of the idea. The real thing, you want to do a little bit more bells and whistles beneath the hood. But it works. So the idea here is instead of calling this function on each element as is in your current session, you just spawn it off as a future in parallel. So you get a list of futures here, handles to those. And then in the end of your function, you say, give me the values of this, because that's what I'm interested in, and return. And what happens behind the scenes, the workers will start working on the futures when they're available. And this runs in parallel.

So these are very powerful building blocks, I say. And it allows you, if you want to do parallel processing packages, to stay with your coding style. So if you like base R LApply, there is the future.apply package that provides future LApply, future apply, and all different styles there. And they work the same. So you basically just have to prepend your package with future underscore. And then magic happens. If you like tidyverse and use per, the per map function, Davis created a fir package with the same idea, prepend with future underscore, and it would just work. And if you like foreach, if you used it before, or you want to use that API style, there is the do future adapter to foreach, which brings you all of the future framework and options into foreach framework. So you can stick with whatever you like here. And other people would invent other APIs for this, too, in the future, so to speak.

As a developer, focus on what to parallelize, not how. That's the user's decision.

Q&A

Thank you, Henrik. So we have time for a couple questions. The first one here, what is the communication overhead between processes? And then I would kind of add on top of that, like when you have large data sets, like when you're transferring them between processes, what's your advice there?

Yeah. So two things here. Parallel processing in R at this level, not like native code multithreading stuff, is more of things that takes several seconds to run. You don't do super small things. And typically, if you get things running for like a whole minute, you don't have to worry about the overhead. So that's the orchestration overhead. You have overhead from passing data, too. And there are things in the fir package and the future apply that try to do this in a clever way. So you don't send out everything that you need, but you need to transfer data if you have big data. So of course, if you send that to Mars or whatever, that will take a long time. So you have to think about that. And that's the corner case. This is like, should the data live locally or should it be distributed?

So for web scraping and web API data, is it better to run code in parallel or maybe asynchronously, like using the later package, for example?

Whoa, that's out of my league. Is that the expression in English? But I think you can use the later package. You can use this in combined. You can use later and use multiple things in parallel for later.

So you kind of spoke on this a little bit, but how does future parallelize on local machines? Does it fork R processes like parallel, or what does it do on Windows?

So that's a good question. So I didn't have to go for it. I have time to go for all details. Future package doesn't really do the backends. It's just leveraging the parallel package. So you have the option to use multi-core forking like MCL apply or the cluster, but these also allow you to do other things. So there's like a caller package. So there's a caller backend. So the future package tried to stay away from that, but it comes with a parallel package built in.

Perfect. Thank you so much. Thank you very much.

Henrik Bengtsson | Future: Simple Async, Parallel & Distributed Processing in R | RStudio (2020)

Transcript#

What is parallel processing?

The future framework

Letting users control parallelization

What's new: output, warnings, and progress bars

Progress bars in parallel

Q&A

Featured software#

rstudio