Resources

Winston Chang | Asynchronous programming in R | RStudio (2020)

Writing regular R code is straightforward: you tell R to do something, it does it, and then it returns control back to you. This is called synchronous programming. However, if you use R to coordinate threads, processes, or network communication, the regular model may be unable to do what you want, or it may only be able to do it with a significant performance penalty. In this talk I'll explain how asynchronous programming with the later package can handle these kinds of programming problems. I'll also show how to provide a synchronous interface for asynchronous code, so that users will have a simple, familiar way to use your code. Materials - github repo

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thanks, everybody. Okay, so my talk is entitled Asynchronous Programming in R, but I have to give a little disclaimer. It's sort of, but not really about asynchronous programming in R. When I was developing the materials for this talk, I realized I couldn't... Asynchronous programming is such a huge topic, I would have a hard time really explaining it in the way that I found satisfying in 20 minutes. And then I thought about all these other topics that come along with asynchronous programming, in my experience and the stuff that I work on.

So there's also parallelism, concurrency, and event-driven programming. So let me define these for you in case you're not really familiar with these. So asynchronous programming is when you call a function, it doesn't block. So normally, when you're writing your R code and you're running it, it steps through, it does each thing, and if you do something that takes a long time, like, let's say, you tell R to go download a file, it stops there, and then once the file is downloaded, it continues and your script continues to run. In an asynchronous program, you might, if there's an asynchronous download, what it would do is, you're essentially, it would say, hey, go download this file, some other thing in the computer, go download the file, and the code would keep running, and then later you'd check, hey, is my file downloaded, or maybe you might say, or it might run a callback and tell you, hey, the file is done now. That's asynchronous programming.

Parallelism is when you do multiple things at the same time. That's very common on modern computers with multiple cores. Concurrency is when it seems like you're doing multiple things at the same time, but you might not actually be doing multiple things at the same time. So if you're familiar with JavaScript in a web browser, that's single-threaded, and it can seem like a web page is doing a lot of things at once, but it's really splitting its time in between different tasks and switching between them very quickly, so it just seems like it's doing things in parallel. And finally, there's event-driven programming, where you have events that occur, there might be some outside signal, and that causes some code to run.

So I was thinking about this, and asynchronous programming was too big, and so I thought I'd try to talk about all of these a little bit, or at least actually a common thread that runs through all of them, which is a package called Later. And Later was originally created by Joe Chang, and I don't know if it's a coincidence, but it was a conversation with him that helped me settle on this topic. So thanks, Joe.

Introducing the later package

All right, so I'm gonna show you a demo here of... Let's say I'm creating a data frame called data, and I've populated it with some X and Y values, and this plot magically appeared in RStudio here. And if I modify it, so I square all the X values, that plot redraws, and now I have a parabola shape here. And I can do the same with Y, and if I restore data to what it was before, it re-plots again. So there's something going on here. It's not any magic from RStudio, the IDE, but it involves later.

And what you're seeing here is sort of, at least from the user perspective, this is event-driven programming. So I'm not telling it to re-plot every time I change data, but every time I change data, it causes this plotting code to re-execute. And I ran that, actually, ahead of time, before I started this talk, and I'll show it to you in a little bit.

But let's talk about what the later package does. So later provides something called an event loop. An event loop is a queue of functions that will run in the future, and it's very similar to setTimeout in JavaScript, if you're familiar with JavaScript. And if this is confusing to you, I will show you a very simple example of how later is used. So you load the later package. I'm setting a flag to tell me when I'm... To signal when something is done. And then I say, later, run this function here, which prints out a message and updates the flag, after five seconds. And then at the end, I'll say, while I'm not done, run now. It means keep running these functions that are in this event loop, in this queue.

So let's do this. Let's run this stuff here. And you won't be surprised to see that after five seconds, it prints out this message. Hello, world. And I know there's people out there who have already figured out how to write this in about six lines of code to implement this. So this part is not really that difficult. But what later... What you might be a little bit more surprised to see is that if I run this same code here, and I don't do run now, and we wait five seconds, it will also print hello, world. So later has some C code that runs. And when your R console is idle, when the call stack is empty and you're not running anything else, it will continue running this event loop.

So later has some C code that runs. And when your R console is idle, when the call stack is empty and you're not running anything else, it will continue running this event loop.

Okay. So that plot watching code that I showed... Or that... You saw the plot watcher before. And this is the code. So what it does is, first... It's pretty simple. So I'm just setting data to null and the last value to null. And then I have this function called plot watch. And if the data's not null, and if it's different from the last value, then plot it, and then update the last value. So that's all really standard R code. The thing that's different is, right here, I call later plot watch. So this function is rescheduling itself to run after a quarter second. And then after we define the function, we have to kick it off. We have to get it started by invoking the function once. And then... And it's doing what's sort of... I guess you might call it a polling loop, where it just keeps running every quarter of a second. And every time data changes, it executes the plotting code.

The C API and thread safety

So later also has a C API. This is important when you're doing a lot of this... When you're doing asynchronous programming. Because a lot of times, you have to interface with external libraries. So from C code, you can call later to schedule a C function to execute. There's a C function later. And that C function that you scheduled to execute can in turn call an R function. And also, another very important point is that this function... The later function is thread safe. So you can have another thread schedule something to run. It can schedule an R function to run.

Okay. So that's a brief overview of later. What it does is pretty simple, but it unlocks a lot of possibilities. So here's some real world uses of later. And I'm gonna show you some stuff... Some demos and explain how some of these things work.

WebSocket client demo

So one thing that we've worked on recently is a WebSocket client. WebSocket is a protocol for basically for communicating between computers. Through a... Well, through a fancy web server. So this is how it's used. So I create a WebSocket. I say WebSocket new and connect to the server. And then I tell it... Hey, when you receive a message, invoke this function. So this is event driven programming here. But it doesn't actually call this function right away. And what it does... This function will print out, hey, I've received this... Whatever the message is.

All right. So I've set up that event handler. And then at the end, I can say... I can tell it to send hello world. So let's see this in action. So if you haven't... If you can't see it, this is the same code that you saw before. And so I sent hello world, and what I received back is that same string reversed. So it's talking to a WebSocket server that takes a string, reverses it, and sends it back. So it's just doing this really basic transformation on it.

All right. Now, the way that this works internally is used as polling, similar to that plot watcher that we... Or the data watcher that we had before. So we have this arrow representing time. And several times per second, we're polling and we're checking for any... If there's any input or output that has happened on this WebSocket. Now, at some arbitrary point in time, in R, you might tell it to send a message. Send hello world. And that gets... That puts it in this output queue. But that output queue isn't handled until this polling event here. And when we get to the polling event, that actually calls out to a C++ library that handles all this input and output stuff. And it sends a message to the server. When that happens, the server sends a message back. And then that message sits in the input queue until the next polling event happens. And when that occurs, then it calls the on message handler. And it prints out they have received this hello world reversed. And I just... In this particular slide, the blue indicates that this is something that was triggered by the polling event that later provided.

Okay. So that's how the WebSocket package works. At least that's how the CRAN version of WebSocket works. On GitHub, there's actually... The development version of WebSocket has been changed to use threads. Actually, let me show you one thing real quick about this. So if you're familiar with this sort of programming, you know that polling is not the ideal way to do this. There's a little bit of latency that can occur between these events here. And it also... If you're doing a lot of polling really quick, it can use CPU time that shouldn't be... That really shouldn't be used. But it's simple to implement.

So the development version of WebSocket is threaded. This is... And so there's two threads. There's this main R thread. And then it launches another thread that handles the input and output. All right. So if I tell it to send a message from... And that happens in R. What it does is it calls the C function... Or C++ function send, which is thread safe. Which tells the IO thread, hey, I want you to send this message. And the IO thread at this point will be idle. And so it will deal with this request right away. Send the message to the server. Server sends something back. And if the IO thread is idle then, it will immediately use later to tell the R thread... This is the C function later. To tell R thread, hey, handle this message that I just got. And if R thread is sitting idle at that point, it will print it out. If R thread is busy at that time, like represented by that red bar there, then it will... It will wait until it's done doing whatever computation it's doing. And then it will handle that queued function.

HTTPUV web server

All right. So these are two common ways of handling input and output. Asynchronous IO. With polling and with threads.

Another use that we have is a web server called HTTPUV. This package, you might have heard of it before. It has an awkward name. It underlies Shiny and Plumber. And to start a web server, it's actually very simple. You just say start server. You give it the port you want to listen to it on. And then this function here, which it's a callback that gets invoked when an event happens. An incoming request occurs. So when you get an incoming HTTP request, it calls this function and passes in this object called rec. And in this case, I'm just having it create a web page that has the time and the path that was requested. And then it returns some other HTTP stuff along with that. So you can imagine that you can use this in all sorts of different ways. It just will execute an R function whenever you get an HTTP request.

So I'll show you this real quick in action. Gives you the time and the path that was requested. So that was just slash. I could say R studio conf. And then it prints out that path there.

Remote R console and Chromote demo

All right. Now I'm just gonna show you some other cool stuff. Some fun stuff. So I also started before this... Before I started doing the talk, I started up another HTTP via application, which provides a remote R console, a REPL. So I can say, you know, 1, 1 plus 1, sum of 1 through 10. And this is causing R code to execute. So this is actually using WebSocket connection. It's... The web browser sends this string over the WebSocket. And then the R process gets the string, it evaluates it, captures the output, and then sends it back. So that's what's happening here. The interesting thing about this is it's actually running in the same R process that we've been doing all the rest of this stuff.

And of course, you're familiar with Shiny. Shiny is based on HTTPUV so I can run this Shiny app here, which is displaying the same data. And I can... If I randomize it, it'll... You know, if I click this button to randomize, it does... You've seen this sort of thing with a Shiny app. And again, these are connected here. So when I randomize the data, I'm using poor Shiny programming practices where I'm modifying a global variable. But for this demo, it's good. And it's changing that plot right there.

Okay. So that's a taste of some of the things that are possible. There's one other thing I want to show. Using a package called Chromote, which is a headless Chrome web browser package, which uses a lot of asynchronous programming. And I won't go into depth about it right now. I just want to show some other fun stuff that you can do. So I can start a Chromote session. I'll navigate to the same application. And I'll take a screenshot of it. And when it takes a screenshot, it actually will cause that screenshot to appear in this viewer pane here. This is not the Shiny app here. This is a screenshot of a Shiny app. If you see, I can drag this around here. And just to prove to you that it's actually... Oops. I tried to run it in RStudio, but RStudio console is blocked. I have to run it over here. Because this one is driven by later. This is... You can get a view into the headless browser. And then you can sort of interact with it here. I can click on randomize over there. And then if I take a screenshot of it again, you'll see that... Sorry. Okay. If I take a screenshot again, you'll see that that button is sort of grayed, which indicates that it was selected, which is what I was doing over there.

Summary and the later ecosystem

All right. So that's the fun demos. All right. So let's get back to what we saw here. So we saw this WebSocket server that reverses strings. Web server that shows you the time. Web server with this remote R console, which you can use while you're running a Shiny application. Running a Shiny app. We had this plot data watcher. And we had this headless Chrome client. All of these use later. That's what makes it all possible for them to run concurrently. And this is all in one R process.

All of these use later. That's what makes it all possible for them to run concurrently. And this is all in one R process.

Okay. So one thing... I took a look at the later CRAN page. And I looked at the packages that we're using later. And one thing I noticed was that even though it's been out for a couple of years, there's not that many packages that use it. And all of them are maintained by people that work at RStudio. So I'm hoping that... We have a lot of this knowledge internally from... We've got battle scars working on this stuff. But hopefully... Hopefully if you're working on async programming or parallelism or concurrency, this will be useful for you. And I have the URL here. It's not... The materials aren't actually up there right now, but they will be. Thank you.

Q&A

Thanks, Winston. That was fascinating. I can assure you that I am using your work. I have promote writing on a scheduled basis to take photographs of the work my colleagues are doing. For those of you that are leaving, we are just carrying on with questions. So please do so quietly if you can.

Question number one. Does Later deal with the issues of tail call stack optimization that could arise from the recursive event loop format? It does not. That's actually a great question. You have to be... I wanted to mention it, but I didn't have time. If you're going to be calling the function itself, you have to make sure not to create any closures that will keep increasing the depth of the call stack. So you have to create a function outside of your function that you're calling.

And the second question is, if Later is running and you want to run another line of code, does the new line you are running jump ahead of later, or does it wait those five seconds to run? So if you... It will jump ahead of later. So after five seconds have elapsed, even whatever you've written, whatever you've done before, it won't wait an extra five seconds. It will wait for five seconds total, unless R is occupied at the moment, that callback would occur.

Thank you. And last question. It's getting quite a few votes. Is there a way to prioritize different asynchronous streams? Is there a way to prioritize different asynchronous streams? That, I'm not sure I fully understand the question, so I'm sorry I can't answer that. Winton, thank you very much. This was fascinating. Thank you.