Resources

Bryan Lewis | Parallel computing with R using foreach, future, and other packages | RStudio (2020)

Steve Weston's foreach package defines a simple but powerful framework for map/reduce and list-comprehension-style parallel computation in R. One of its great innovations is the ability to support many interchangeable back-end computing systems so that *the same R code* can run sequentially, in parallel on your laptop, or across a supercomputer. Recent new packages like future package define elegant new programming approaches that can use the foreach framework to run across a wide variety of parallel computing systems. This talk introduces the basics of foreach and future packages with examples using a variety of back-end systems including MPI, Redis and R's default parallel package clusters

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Alright, so next up, please join me in welcoming Brian Lewis who will be talking about Parallel Computing with R using foreach, future, and other packages.

Alright well, people are shuffling in and out. First it goes without saying that it's a huge privilege for me to be able to hang out with so many cool and interesting people. It's not, I normally don't get to hang out with cool people. So I'd really like to thank Joe, and Hadley, and JJ, and everyone at RStudio for making that possible.

And can you all hear me okay? Because I'm going to talk really fast because I've got a ton of slides, right? So, I'm kidding, but there really is a lot of slides.

The problem with parallel computing in R

So, you're writing some R code, or maybe you're working on an R package. And your package is a super big data number crunching science package, right? So you're really, really interested and worried about performance of your code. And you need to study this. So maybe you pick up Pat Byrne's excellent book, The R Inferno. And you learn a lot from this book. Tons of performance tips about R, and lots of other interesting nuggets about the R language. But the book kind of scares you a little bit.

So then you move on to Hadley's book, right? And you learn even more about the R language, and in particular about performance. And you also stroke your chin thoughtfully as you ponder all those Donald Knuth quotations in the book, right?

And after studying these things for a long time, you've become expert in vectorization and bytecode compilization and all of these things. And finally, with great anticipation, you get to run your package. And it runs really, really slow. It's very, very cute. Your package is super cute, but it's also super, super slow, right?

And then you remember, oh, wait. I've got RStudio. And they've got all these newfangled profiling tools, a profvis and other things that help you look at your R codes, visualize your R code, see where it's spending all this time. So you load up your package into RStudio, and you run profvis, and you look at where you're spending all their time in code. And you notice that your package is spending a lot of time in these loop-like structures like L applies or maybe replicates or reduces or maps.

And then it strikes you. Of course. I spent all this crazy amount of money on my super, super fancy Mac Pro Escalade, which has got a zillion CPU cores. And all those L apply loops that I'm running are only using a single CPU. But of course, this is R. And you realize there's a package for that. The parallel package. And it even comes with R. So of course, this is what I'll use.

So with renewed vigor, you go back into your package, and you look at all your L applies, and you change all of the L applies to mick L applies, and you change all your maps to mick maps. All the while, you're thinking, what if McDonald's sues for trademark infringement? I mean, Simon Urbanik is going to be in some serious trouble, right?

And so anyway, you get all this done. And finally, you're ready to run your package again. And it blows your mind. It's running so fast that your Mac Pro Escalade is glowing red hot. It's ripping through your data, big science code, really super fast. You've never seen R or even your Mac perform like this before. It's fantastic, right? So you tell all your friends about it. You're so excited.

Everybody wants to use your package. And so your friend Jared Lander, he calls you up, and he's excited about running your package. So you give your package to Jared. And of course, Jared's using a Windows laptop, and when he runs it, it doesn't work, right? Because Windows, right?

And then you think, well, yeah, wait, though. Those parallel package functions, not all of them work on all the operating systems, especially Windows. So you tell Jared, you know, that's not a problem, Jared. I'll just rip out some of my L applies, and I'll replace them with cluster whatever applies, right? And I'll get it all to work for you on your Windows machine. And you do that, and it works fine.

Meanwhile, while you're working on that, your other friend, that's right, you've only got two friends. Your other friend, George Ostrichov, he's really excited about your package, too. And he's got, of course, he's got this super petty exit computer in his garage down there in Tennessee somewhere in the Appalachian Mountains. And he wants to run it. And you start talking about George, and you're excited to run it on such a big computer. But you realize you've got to run it using something called MPI, because George says real HPC programmers use MPI, you know, whatever.

But there's a package for that, too, right? There's this rmpi package. So you start rewriting your code again, and this is harder this time, because this MPI stuff is really tricky. And it's like, it's taking quite a bit of while. And meanwhile, this is snowballing. Other people are asking, well, what about this? What about this? Can I run it on that? Can I run it on that? And before long, you're losing friends, and your family's pissed off at you, because you're spending all of your time staring at an RStudio console, maintaining like 50 versions of your package for all these different parallel distributed network runtime systems, right? It's driving you crazy.

And then, you know, you're neurotic, right? So you start to get really worried about this, and you're like, well, what if I lose friends over this? I might start losing sleep. And then maybe I would get depressed. I could lose my job. That would probably lead, inevitably, to a life of crime and woe. And then, finally, you have a vision of yourself as a crazy old person on a street corner ranting and raving about lazy evaluation, dark, dark, dark thoughts.

The foreach philosophy

This is the situation we found ourselves in in 2008. And my friend and, at the time, my colleague, Steve Weston, he looked back at this problem, and he thought, there's got to be a better way, a different way, a different approach. What if package authors and code writers could decide which parts of their program can run in parallel, and separately, later, the users of those packages decide how those codes should run in parallel based on their circumstances and available resources?

This is the philosophy of foreach. And it's a philosophy shared by Henrik's much more recent future package. And it's a very, very important point, because we abstract away the implementation of a particular parallel portion of our code and leave that up to the user to decide at runtime how things should run.

What if package authors and code writers could decide which parts of their program can run in parallel, and separately, later, the users of those packages decide how those codes should run in parallel based on their circumstances and available resources?

Now, back in 2008, Steve made the decision to syntactically structure foreach in the form of a for loop. And because back then Steve was really enamored of list comprehensions and set comprehensions and all these kinds of Haskell-y things, he was really into that.

Speaking of Haskell, you know the reticulate package and how beautifully it integrates Python with R? It's lovely, right? Have you used reticulate? Wouldn't it be really cool if somebody did that with R and Haskell, right, and they could call it Rascal? I would really like that. That would be an excellent package.

How foreach works

But back to foreach, despite its for loop syntax, foreach really kind of works like a reduce of a map. And let's make that concrete by looking at a really kind of simple contrived example. Here's a foreach loop, and I'll explain each part.

foreach is a function, and these iteration variables i and j in the blue circle, they kind of play the role of any old iterator variable in a regular old for loop. And we can supply a reduction function using this .combine argument, and that can be any function you want of two or more arguments. And in this case, I'm just using the rbind function. And then in the purple circle, there's just a regular old R expression. It can be any R expression you want, and it can use those iterator variables. But in our case, we're just going to run over the unique values of i and j. Each iteration of the loop is going to produce a single row data frame, and we'll pass it to rbind.

And if we run... But wait, there's this weird mysterious operator, dopar, that takes a function call on one side and an R expression on the other. It's kind of bizarre. If we run this, well, big surprise. We have two loop iterations. Each one emits a single row of a data frame. They're passed into rbind, and our ultimate output is a two-row data frame. Rocket science.

But we also get this interesting warning message from dopar. And that's dopar telling us, look, man, you didn't tell me that you could run it in parallel or you wanted to run it in parallel, so just so you know, this was run on a plain old sequential R session. And that's what dopar does. That strange operator is effectively the API of the foreach package. It's the glue that takes your R statements and variables and puts them together with various ways to run those things in parallel, like MPI or whatever. And that's the abstraction layer, the API.

And so here's an example of that. A user then can register a particular parallel adapter package, in this case doparallel, with this registration function, and dopar sees that through a hidden global variable somewhere. And now the loop, without changing the code in any way, runs and gives you the same answer, but this time it's run in parallel. In this example, two separate R processes were started for each loop iteration. Their output was combined through the rbind function to give you the same result. That's the philosophy of foreach and the future package in a nutshell.

The future package

Thanks to Henrik's amazing work, the future package is like a new extension of this idea. It just doesn't give you any... There's no... Unlike foreach, where you have a very opinionated syntactical structure of a for loop, future package lets you do things like this using any type of R syntax that you want. You can use regular old for loops, elliplies, maps, the other map, replicate, no loop at all. You can do anything you want to do.

And the cool thing is both the future package and the foreach package use magic. They do. They automagically detect and make sure that lexically scoped R objects are available to the worker processes. So in this case, it's the same loop we were running before, except notice there's this K variable at the top that's not defined within the body, the expression that's in the loop. And foreach and dopar are smart enough to identify that and say, wait a minute, that worker process is going to need that value. So I make sure that it has it, even though it's defined in a lexical scope. And sure enough, it works. And that'll work if you're running it on the same computer, or if you're running it across your departmental network, or if you're using AWS to run it across the country, or even if you're running it on another planet. It'll all just work magically. It's really quite amazing.

There are a few other kind of interesting things about the foreach package that are more esoteric. You can compose two foreach loops together with a composition operator, and that produces a third foreach loop. And that can be really useful for nested parallelism, to make it work, actually, in a more predictable way.

And going back to the Haskell-y ideas, you can compose a foreach loop with a predicate filtering operation to produce the when function, in this case, to produce a set comprehension, a very Haskell-y kind of thing. And Steve was very much into those ideas.

Now, again, thanks to this amazing guy right here who's going to talk about the future package just in the very next talk, so you all should stick around. Future works interoperably with foreach through the do future parallel adapter package, as do a number of other older parallel adapters that are listed there.

Before this talk, I went and updated some that I've been neglecting for a while. The do-Redis and do-MPI, MPI and Redis backends for foreach are recently updated, and they're either on CRAN now or they're kind of pending into CRAN. And I started writing, kind of hacking together a guide to explain the internals of foreach to prospective parallel adapter authors. If you want to write your own parallel adapter to connect to your super-duper fancy whatever multi-GPU computer thing, you can do that, and there's a little guide there that helps walk you through how to dot the i's and cross the t's.

Steve Weston's legacy

Now I have to say I believe that this is a vitally important addition to the R software ecosystem. It really fills a gap, or it filled back in 2009 when we introduced it. It filled a gap that was really missing in R, and it's been very important. That's why I've tried to support it a little bit in a limited capacity here and there over the years. But you don't have to take my word for that. All you need to do is go on CRAN and look at the foreach page, and you can see all of these R packages that either depend explicitly or implicitly on foreach, and there's a lot of them. So there's a huge impact from this software.

And that really makes me quite happy, because as some of you know... So we lost Steve last summer, he passed away, sadly. But he didn't, right? His ideas were as vital today to the R software ecosystem and spawning new work by new people as they were 10 years ago. So I think it's a testament to Steve's vision how important this work was.

His ideas were as vital today to the R software ecosystem and spawning new work by new people as they were 10 years ago.

And this has been kind of a weird lightning introduction to an abstraction layer for parallel computing. I know I didn't go very deep into things, but I'm going to conclude this talk because I don't want to burden you with all of these super technical details. I'll leave that to Henrik. But I'd like to conclude the talk in kind of a weird, characteristically weird way. I'll paraphrase the great labor organizer and folk singer, Steve was also into folk music, Joe Hill. And on Joe's deathbed, he said, don't mourn, don't mourn my passing. Parallelized. Well, he didn't say parallelized, but Steve would have said parallelized. That's all I have for you. Thank you very much.

Q&A

So as the author of this package, I'm kind of curious about your answer. What do you think about the fir package as the combination of purr and future? Yeah, I think so. So the functional concepts, I mean, you can see me, I mentioned in Haskell, right? So having side effect free operations whenever possible that are promoted by things like the purr package are important. And they're in fact essential for embarrassingly parallel style map reduce computation that we get with future and with foreach. So I think it's a natural fit, actually.

One other question here. So this is just kind of technical, but how do we assign the number of cores to be used? Yeah, that's up to the parallel adapter code. I mean, that's the point. At runtime, the user can make a decision. So if you're just using the R's native parallel package, you would supply a number. When you say register, I had that slide up there that said register do parallel. If you just put a number in there, that's the number of cores it's going to use. And that, of course, depends. If you're using an MPI back end, you've got a different convention. Or if you're using Redis or one of these other job queuing systems or whatever, they all have slightly different conventions with dealing how the work is going to be run.

Are there any recommendations about debugging when using parallel computing? Oh, that's a great question, actually. It's challenging, right? Parallel computing in general is nightmarishly complicated to debug. And yeah, in both the future work and in foreach, there are various tricks that you can use to help debug your code. There's an option in foreach to stop on the first error that's encountered on any worker process anywhere and throw that error in the R session. That's its default. But you can also just pass the errors as they come. So if you're getting lots of errors in your loop iterations, you can just see all of them to see what's happening. And you can also, this is probably not advisable, but you can also ignore errors.

Depending on the back ends, the kind of adapter packages, the ones that work with the standard are parallel package and MC packages. They don't do so well at debugging. But some of the other parallel adapters have provisions in them for logging and maintaining logs to just regular R connections. So you can direct logs to files on the back end workers and inspect those after. Or if you have the ability to look at standard error, standard out, you can actually look at those as it's running on the other workers, too, to help debug. But it's still tricky.

So he and I are really interested in getting messages kind of back in real time. Like any warnings or error messages coming back in real time from your workers. Is that even possible, depending on... Well, I don't know about future. In foreach, you can throw an error, and that's coming back in real time if you pass the errors through, right? So that would be the only way to do that right now on any... For each in an abstract way.

Like warnings and messages, maybe, where it's still continuing to run? Like that didn't stop the worker? Yeah. That's a great question. And I don't know of any facility for doing that. That would be... You'd need some extra communication channel off the main channel to send that information through. You could do it through files, but you may not have access to files. If they're running on another planet, maybe you can't get those, right? So I don't know. It's a tricky one, Max. I figured you would ask something tricky like that.