Alex Gold | Deploying End-To-End Data Science with Shiny, Plumber, and Pins

Transcript#

This transcript was generated automatically and may contain errors.

I have great pleasure to introduce our first speaker today. He is my colleague, a good friend of mine, and so without further ado, I'd like us all to join me in welcoming Alex Gold.

Hey, everybody. Like Kelly said, my name is Alex. I'm here to talk today about end-to-end data science deployment with R Markdown, Shiny, Plumber , and Pins. The slides and code are available online at rstudio .io. Although as I'm saying this, I'm realizing that the slides are not there yet, but they will be. So look forward to that.

So this story starts, as all great stories do, with a Shiny app. And this Shiny app in particular uses the Capital Bike Share data. So if you don't know what the Capital Bike Share is, it's the system of docked bicycles in Washington, D.C. That's what they look like. That's what the docks look like. You can rent a bike from a dock and ride it to a different dock and dock it there. And so, you know, what I'm going to try and do with this Shiny app is predict the number of bikes at different stations at different times.

So where's my window? Here's my window. So this is what the app looks like. I've got a map of Washington, D.C. I can click on a particular station. You'll see this is at 21st and M. And if I scroll down, now I can see the number of bikes that are available at that station or predicted to be available at that station in the next 24 hours. Important maybe if I wanted to go get a bike in D.C. at 8 p.m. tonight because there are going to be none at 21st and M. Too bad.

So let's talk a little bit about sort of how this app works. So the Capital Bike Share organization makes available an API that provides real-time bike data. How many bikes are at each station right now available on the API. And I'm going to import this using an R Markdown document. From there, I'm going to build an XGBoost model to do all my prediction for me. I'm going to do that also in R Markdown. I'm going to serve this model with a Plumber API. And if you're interested in Plumber, I'm not going to talk a lot more about it right now, but hang out because, like, the rest of this track is going to be a lot of Plumber. So that will be awesome. And then, of course, I'm serving this to the client app at the end.

And so I had this app, but I wanted to take it further. It just lives on my machine. I don't want to have to, like, walk my computer to people to show them my app. And so I knew where I wanted to get to was, you know, my Shiny app is available all the time to whoever wants it. The model gets retrained every day, right? Like weather's going to change, different seasons. I want to be able to have, you know, the model retraining periodically. And I need to do this data import every 20 minutes. It's realtime data. So once it's gone, it's gone. And so I want to get those realtime data, pull them in, and store them somewhere so I can do my predictions as a time series or panel kind of thing.

Principles for deployment

So I'm there. But that leads to a whole bunch of questions. What should I prioritize while I'm doing this? There are all these things that I could want to do as I'm trying to, like, you know, sort of deploy this app and, you know, add some horsepower. And these are, like, hard, scary, big questions. And really what they boil down to is, you know, is my deployment sophisticated enough? Am I doing it for real or not?

Of course, I then, you know, did what I sometimes do when I'm procrastinating, which was I started watching Star Wars, just because, you know, why not do that? And you all might know this guy. Green, big ears, talks funny. He's the Jedi Master Yoda, of course. And he says this line in the middle of one of the movies, size matters not. Look at me. Judge me by my size, do you? And I thought this was really appropriate. If you just sub in the word, like, sophistication here, this is a great way to think of why do we care about deploying Shiny apps? And, of course, we care because we want to increase value. We want to provide value. Sophistication is irrelevant, right? Are you providing value is the right question.

Sophistication is irrelevant, right? Are you providing value is the right question.

And so how do we deploy to increase value? How was I going to do that? And so there were sort of three principles that I took as my guides. And the first one was that I wanted to make my content accessible. And what that meant is that people who should have access to the content can get it when they need it and somewhere that they know where to find it. The second thing was reproducible, right? Can I reliably understand how the content exists, why it is the way it is? Can I sort of audit it? And then secure. And that's sort of the inverse of accessible, right? Should people who don't — people who shouldn't have access don't have access? That's pretty important.

If you're hitting these three keys, you're probably increasing value with what you're doing as opposed to just adding sophistication.

Q&A

So, I failed to mention at the top of things here, but if you have questions, we're going to continue using Slido as we did in the keynote. So we're in Grand Ballroom B, so if you have a question, go to the same Slido page, select Grand Ballroom B, and you can enter your questions there or vote on questions that are already asked there. So I'm going to ask Alex some questions here since we have a couple minutes. Maybe this was covered in the last half of your presentation, but do you want to readdress how pins is different from a well-organized shared file system with correct ACLs?

Yeah, that's a great question. To be honest, under the hood, it's not that different, right? You have the same kind of control and access controls. The main difference is that it's a very R-centric answer, and so you can, from within R, address those objects without having to go through the hassle of figuring out how your shared file system works, finding things and that sort of thing. So from an IT perspective, maybe not that different, but from the perspective of an R developer, it's a very different development experience. It's much more straightforward.

Yeah. This next one we get a lot, so I'm happy to ask it here. What's the benefit of using a Plumber API for saving the object to S3 and then calling it with a Shiny app?

Oh, that's a great question. I love that question. There are a couple benefits. The biggest benefit to me, and this happens in several different ways, is that it decouples the logic of serving the model from the logic of displaying whatever you're displaying or doing something else with it. So you decouple those things, and that makes your life easier. It makes your life easier in a couple ways. One of them is purely from a code perspective. It makes it easier to write the code because you write your Plumber API to serve the model, you write your Shiny app or whatever it is to consume the model, and you get to sort of separate those concerns, which is really nice. The other piece from a performance perspective, you can test those two things independently. You can test the Plumber API with, say, the load test package, which I think some people might be talking about next. And then you can test the Shiny app separately. You can decouple both the coding and the performance issues between the API and the app, which is great.

Excellent. Okay. Next, is there a limit to the amount of data you should be using with pins? It depends on which board you're using. Some of those boards do have limits to how much you can save. You know, my suggestion would be that if you're saving more than a gigabyte's worth of data, there's probably a better solution for what you're doing. And that's totally a rule of thumb, not a real thing. But just in general, if you're finding yourself doing more than that, you're probably saving an object that's bigger than pins really is designed for. One thing that can be helpful is if you're saving model objects, often the model objects package up a lot of stuff with it that you don't necessarily want, like the training data. And that could make your model really big and sort of inflate the size. There is a package called butcher, like, you know, a person who makes meat cuts, that you can use to sort of chop out some of the stuff in your model object you don't want and just save, like, actually the model. So if you're saving models, that can be a useful thing to do.

Cool. Do one more here. I like this one as well. Is there a reason your import and model training code is in R Markdown files instead of plain R scripts?

Yes, that's a great question. I love it when things that I put in the presentation and then took out because I didn't have time come back in questions. Yeah, so I really love R Markdown for two main reasons. One is that it encourages literate programming that is interspersing code and prose, right? Because you have just, like, a place to write prose. It doesn't even have to be comments. You just write it, and it's great. The other reason that you have a record of what you did. So when you come back later and something has gone awry with your model, you have output there with the code of what actually happened, unlike if it's just an R script and then whatever happened was totally ephemeral and it's gone forever.

Well, everyone, please join me in thanking Alex again. Thanks, everybody.

Alex Gold | Deploying End-To-End Data Science with Shiny, Plumber, and Pins | RStudio

Transcript#

Principles for deployment

Introducing pins

What to pin

Pins in the bike app

Security and deployment options

Takeaways

Q&A

Featured software#

plumber

rstudio

Shiny

Alex Gold | Deploying End-To-End Data Science with Shiny, Plumber, and Pins | RStudio

Transcript#

The Capital Bike Share app

Principles for deployment

Introducing pins

What to pin

Pins in the bike app

Security and deployment options

Takeaways

Q&A

Featured software#

plumber

rstudio

Shiny