Bryan Lewis | Parallel computing with R using foreach, future, and other packages

What if package authors and code writers could decide which parts of their program can run in parallel, and separately, later, the users of those packages decide how those codes should run in parallel based on their circumstances and available resources?

Now, back in 2008, Steve made the decision to syntactically structure foreach in the form of a for loop. And because back then Steve was really enamored of list comprehensions and set comprehensions and all these kinds of Haskell-y things, he was really into that.

Speaking of Haskell, you know the reticulate package and how beautifully it integrates Python with R? It's lovely, right? Have you used reticulate? Wouldn't it be really cool if somebody did that with R and Haskell, right, and they could call it Rascal? I would really like that. That would be an excellent package.

His ideas were as vital today to the R software ecosystem and spawning new work by new people as they were 10 years ago.

And this has been kind of a weird lightning introduction to an abstraction layer for parallel computing. I know I didn't go very deep into things, but I'm going to conclude this talk because I don't want to burden you with all of these super technical details. I'll leave that to Henrik. But I'd like to conclude the talk in kind of a weird, characteristically weird way. I'll paraphrase the great labor organizer and folk singer, Steve was also into folk music, Joe Hill. And on Joe's deathbed, he said, don't mourn, don't mourn my passing. Parallelized. Well, he didn't say parallelized, but Steve would have said parallelized. That's all I have for you. Thank you very much.

Q&A

So as the author of this package, I'm kind of curious about your answer. What do you think about the fir package as the combination of purr and future? Yeah, I think so. So the functional concepts, I mean, you can see me, I mentioned in Haskell, right? So having side effect free operations whenever possible that are promoted by things like the purr package are important. And they're in fact essential for embarrassingly parallel style map reduce computation that we get with future and with foreach. So I think it's a natural fit, actually.

One other question here. So this is just kind of technical, but how do we assign the number of cores to be used? Yeah, that's up to the parallel adapter code. I mean, that's the point. At runtime, the user can make a decision. So if you're just using the R's native parallel package, you would supply a number. When you say register, I had that slide up there that said register do parallel. If you just put a number in there, that's the number of cores it's going to use. And that, of course, depends. If you're using an MPI back end, you've got a different convention. Or if you're using Redis or one of these other job queuing systems or whatever, they all have slightly different conventions with dealing how the work is going to be run.

Are there any recommendations about debugging when using parallel computing? Oh, that's a great question, actually. It's challenging, right? Parallel computing in general is nightmarishly complicated to debug. And yeah, in both the future work and in foreach, there are various tricks that you can use to help debug your code. There's an option in foreach to stop on the first error that's encountered on any worker process anywhere and throw that error in the R session. That's its default. But you can also just pass the errors as they come. So if you're getting lots of errors in your loop iterations, you can just see all of them to see what's happening. And you can also, this is probably not advisable, but you can also ignore errors.

Depending on the back ends, the kind of adapter packages, the ones that work with the standard are parallel package and MC packages. They don't do so well at debugging. But some of the other parallel adapters have provisions in them for logging and maintaining logs to just regular R connections. So you can direct logs to files on the back end workers and inspect those after. Or if you have the ability to look at standard error, standard out, you can actually look at those as it's running on the other workers, too, to help debug. But it's still tricky.

So he and I are really interested in getting messages kind of back in real time. Like any warnings or error messages coming back in real time from your workers. Is that even possible, depending on... Well, I don't know about future. In foreach, you can throw an error, and that's coming back in real time if you pass the errors through, right? So that would be the only way to do that right now on any... For each in an abstract way.

Like warnings and messages, maybe, where it's still continuing to run? Like that didn't stop the worker? Yeah. That's a great question. And I don't know of any facility for doing that. That would be... You'd need some extra communication channel off the main channel to send that information through. You could do it through files, but you may not have access to files. If they're running on another planet, maybe you can't get those, right? So I don't know. It's a tricky one, Max. I figured you would ask something tricky like that.

Bryan Lewis | Parallel computing with R using foreach, future, and other packages | RStudio (2020)

Transcript#

The problem with parallel computing in R

The foreach philosophy

How foreach works

The future package

Steve Weston's legacy

Q&A

Featured software#

rstudio