
Alex Gold | Deploying End-To-End Data Science with Shiny, Plumber, and Pins | RStudio
It’s easier than ever to craft a complete R-centric data science pipeline thanks to packages like Shiny, Plumber, and Pins. In this talk, you’ll learn how to use R to bring your modeling and visualization work into production. You’ll walk away with recipes, tips, and tricks to deploy data, models, and apps to ensure your work is as impactful as possible. About Alex: Alex is a Solutions Engineer at RStudio, where he helps organizations succeed using R and RStudio products. Before coming to RStudio, Alex was a data scientist and worked on economic policy research, political campaigns, and federal consulting
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I have great pleasure to introduce our first speaker today. He is my colleague, a good friend of mine, and so without further ado, I'd like us all to join me in welcoming Alex Gold.
Hey, everybody. Like Kelly said, my name is Alex. I'm here to talk today about end-to-end data science deployment with R Markdown, Shiny, Plumber, and Pins. The slides and code are available online at rstudio.io. Although as I'm saying this, I'm realizing that the slides are not there yet, but they will be. So look forward to that.
The Capital Bike Share app
So this story starts, as all great stories do, with a Shiny app. And this Shiny app in particular uses the Capital Bike Share data. So if you don't know what the Capital Bike Share is, it's the system of docked bicycles in Washington, D.C. That's what they look like. That's what the docks look like. You can rent a bike from a dock and ride it to a different dock and dock it there. And so, you know, what I'm going to try and do with this Shiny app is predict the number of bikes at different stations at different times.
So where's my window? Here's my window. So this is what the app looks like. I've got a map of Washington, D.C. I can click on a particular station. You'll see this is at 21st and M. And if I scroll down, now I can see the number of bikes that are available at that station or predicted to be available at that station in the next 24 hours. Important maybe if I wanted to go get a bike in D.C. at 8 p.m. tonight because there are going to be none at 21st and M. Too bad.
So let's talk a little bit about sort of how this app works. So the Capital Bike Share organization makes available an API that provides real-time bike data. How many bikes are at each station right now available on the API. And I'm going to import this using an R Markdown document. From there, I'm going to build an XGBoost model to do all my prediction for me. I'm going to do that also in R Markdown. I'm going to serve this model with a Plumber API. And if you're interested in Plumber, I'm not going to talk a lot more about it right now, but hang out because, like, the rest of this track is going to be a lot of Plumber. So that will be awesome. And then, of course, I'm serving this to the client app at the end.
And so I had this app, but I wanted to take it further. It just lives on my machine. I don't want to have to, like, walk my computer to people to show them my app. And so I knew where I wanted to get to was, you know, my Shiny app is available all the time to whoever wants it. The model gets retrained every day, right? Like weather's going to change, different seasons. I want to be able to have, you know, the model retraining periodically. And I need to do this data import every 20 minutes. It's realtime data. So once it's gone, it's gone. And so I want to get those realtime data, pull them in, and store them somewhere so I can do my predictions as a time series or panel kind of thing.
Principles for deployment
So I'm there. But that leads to a whole bunch of questions. What should I prioritize while I'm doing this? There are all these things that I could want to do as I'm trying to, like, you know, sort of deploy this app and, you know, add some horsepower. And these are, like, hard, scary, big questions. And really what they boil down to is, you know, is my deployment sophisticated enough? Am I doing it for real or not?
Of course, I then, you know, did what I sometimes do when I'm procrastinating, which was I started watching Star Wars, just because, you know, why not do that? And you all might know this guy. Green, big ears, talks funny. He's the Jedi Master Yoda, of course. And he says this line in the middle of one of the movies, size matters not. Look at me. Judge me by my size, do you? And I thought this was really appropriate. If you just sub in the word, like, sophistication here, this is a great way to think of why do we care about deploying Shiny apps? And, of course, we care because we want to increase value. We want to provide value. Sophistication is irrelevant, right? Are you providing value is the right question.
Sophistication is irrelevant, right? Are you providing value is the right question.
And so how do we deploy to increase value? How was I going to do that? And so there were sort of three principles that I took as my guides. And the first one was that I wanted to make my content accessible. And what that meant is that people who should have access to the content can get it when they need it and somewhere that they know where to find it. The second thing was reproducible, right? Can I reliably understand how the content exists, why it is the way it is? Can I sort of audit it? And then secure. And that's sort of the inverse of accessible, right? Should people who don't — people who shouldn't have access don't have access? That's pretty important.
Introducing pins
So, you know, the question is how do I get there? And this is honestly not a trivial problem, right? In particular, in between those important train steps with my data, I have to store that data somewhere. And then in between training and serving the model, I have to store that model somewhere. And until recently, this wasn't a solved problem. You know, I could put the analysis data in a database. If I have one handy, if I have right privileges to one, that's fine. But if I don't or if I don't want to put it in a database for some reason, there kind of was no good answer there. And similarly with modeling. There was kind of no good answer.
So I went back to Star Wars. Went back to, you know, not working on this. Because that was easier. And of course Yoda came to my rescue yet again. Saying, you know, when hopeless the task seems, come out with a sweet new R package. Someone will. It's inevitable.
You know? And of course that was true. The pins package is a new package by Javier Larrache. This is a super cool package. I'm super excited about it. Basically what it lets you do is take something, an R object, put it somewhere else, and then get it back when you want it. This is sort of the essence of the pins package. But let's go into a little more detail about exactly how it works. So I'm going to start off with my R, my RStudio session, and I have an object I want to save. You can think like a data frame. It could also be a model. It could be any sort of S3 compliant object. And I have a board also. And my board is where I'm going to save it. Pins currently supports a bunch of different boards. So RStudio Connect, Kaggle, GitHub, Azure, Google Cloud Platform, S3, or a website. Javier is constantly adding new boards I will also add. So if you want something that isn't here, put it on GitHub and it will probably have it done before the end of conf. It's kind of incredible.
But that aside, it supports a lot of different places that you can put your data. Great. So it's really easy to use as well. You put your data onto the board with the pins pin command and get it back with the pins pin get command. So let me show you how this works just to be a little more concrete here.
So the way this works is that I've got my pins pins. Let me make this bigger for y'all. Better. It's probably easier to see, right? So I'm going to load up the pins package. I'm going to register my board. In this case, I'm registering RStudio Connect. And this just means, you know, this is a board you can talk to. To the R session. So it's going to register. Takes just a moment. And then I'm going to pin my data to the board, right? It's pin whatever the object is I'm pinning, the name of the object and the board where it's going. And it's going. Taking a moment to load. And so here I am in the connections pane. And you can see that like any other type of connection, like a database connection, it loads up right here. And if what I pinned is a data frame, I can actually see what's going on there. I can see the different fields in my data frame. And I can also go over here. I can preview the table. I can see the first 1,000 rows. This is MT cars. They're only 32. But, you know, you can see up to 1,000. And then if I want to get the pin back, I just run this pin get command, saving it to a local file. And now I've got MT cars back.
To me, this is so cool. Because what's happened now is instead of having to email around data or having to save data somewhere and tell everybody where it is and version 2 final, actually final, Alex final, none of that. The person needs to know the name of the board or the URL of the board. And they need to know what the name of the pin is. And then what they get when they do a pin get is always the newest version. This is the thing that's really exciting. The newest version is always available on this pin.
What to pin
So the question is sort of, you know, what are good things to pin and what are bad things? Obviously you don't want to store every single R object you've ever used in a pin. So again, wisdom from Yoda. Just before Luke goes off to fight Vader at Cloud City. Before that, Yoda gives him some very sage advice. Best considered a caching mechanism, pins are. And what this means, A, it's kind of, you know, this movie came out in 1980 and this package came out in 2020. So this is very impressive to begin with. But even more impressive that he's right. And so, you know, you generally don't want to save your data of record in a pin. It's not a database. It doesn't have all the like bells and whistles of a database, being able to audit. Not a database for placement. What pins are great for is they're great for something that you could compute at the time or you could get at the time, but it's easier to store it somewhere. You'd rather cache it rather than recreating it on demand.
So there are three attributes that good candidates for a pin have. They're relatively small, say like a gigabyte. There's no hard rule, but a gigabyte is a good rule of thumb. They're reused that you actually want this thing over and over and over again in different pieces of your content. And that they're current. You want the tip of the pin. You only care about the newest version. If you care about archival versions, a pin is not a great choice. But if you care about just the newest one, that's really good. And again, pins are for things you can recreate. It's not for your data of record. That's not a great choice. I mean, if you do it, whatever. I'm not going to stop you. But officially recommending against that.
Pins in the bike app
So let me show you, let's go back to our bike app here. And just show a little bit of where pins show up in this app. We're very zoomed in. And so here we have the app, right? Rinder, this is what our app looks like. We've got a map. We click on the map. We get the prediction of the number of bikes at different stations throughout. And so there are a couple places that pins show up here. And I'm going to be very honest, this app is quite, the code that goes in here is quite large. It demos a lot of different things. So I'm going to go real quick through a couple little things here and ignore a lot of the rest of the code. So that's just how it's going to go. Sorry.
And so when I go to my pins here, I have my ingest and tidy part here. This is where I'm pulling in from the API. So I'm registering with both my database connection, right? That's this line 15. And in line 16 and 17, I'm registering the pins board. Because I'm going to put things in both places. So first what I'm going to do is pull down from the API the number of bikes. Oh, interesting. Maybe I should run this line first. There we go. And so what I'm going to do is I'm going to pull down the number of bikes. And you can see this is just a data frame that came back. All this code in here is just sort of interpreting the API call. And, you know, what I'm doing is I'm just pulling in the station ID and the number of bikes available. These are really sort of the most important fields to me. This is going to be my target when I'm doing my machine learning. And then I'm going to write this to my database. That's getting written to my database. And that's actually scheduled to happen on RStudio Connect every 20 minutes. So that's great. Taken care of.
The next piece, though, and this is what I'm going to pin. So this data frame, this data frame, this is the station information. So it's those IDs, those station IDs, which are just numeric. But I have the names here, which is what I'm putting, you know, pop up there in the app. I'm putting up the latitude and longitude, which I'm going to use both to put them on the map and also as a feature in my machine learning model, as well as the capacity of the stations, which is also an important feature of my machine learning model. And this is a perfect candidate for a pin. It's small. It's less than 600 rows. I reuse it several different times across this piece of content. And I always want the most current version. On my map, I don't want to be showing stations that don't exist or that have moved or anything like that. And so this is what I'm going to be pinning in my code.
Similarly, then when I go on to build my XGBoost model, again, I'm going to skip a lot through some code here. But you can see that I'm pulling in another pin here. This is my modeling parameters, actually, that I'm writing elsewhere. But what I really want to show is down here, I'm training my model with parsnip XGB train. And then what I'm saving here is this list object, right? This is like this could be a formal S3 object. It's not. It's just a list. But I have my model. I'm saving some metadata with it as well, the training date, the splitting date between my testing and training set, and the recipe that I'm using from the recipes package to actually make my training data. And then I'm pinning that here on to RStudio Connect. And this is super useful, right? Now I have a really good way to know what I've done.
And then last, my Plumber API that's serving those predictions, I've got my model. I'm reading it in here. And then all this code basically is just sort of interpreting the input from the API and making sure it's ready to be fed into the model. And then down here, right, I'm running that predict command on that model that's from the pin. And this is, you know, purring away on RStudio Connect. And so, you know, I can come here to the Swagger interface and get back, you know, these are the predictions for station 75. I don't know what station that is. But that's why I have the other one, the other pin with the station names.
So that's how pins sort of fit into this app. So now I've got a solution for this, right? For my analysis data, I can do it in a database or in a pin. For my model, I can do it in a pin. One thing I do want to say is, like, these are just slots. And you can put whatever you want into these slots. So, for example, you know, if you're using some model in Python or some import method in Python, you can totally do that and use reticulate to go back and forth between R and Python super easily. If what you're serving isn't a model, if it's, you know, plots or just some kind of counting, right, the same principle totally works. And when you're sharing, right, if what you've got is an R Markdown document that you want to render and email out or something like that, again, think of these as slots. You know, my version happened to be R Markdown, R Markdown Plumber Shiny app, but yours can be whatever you want.
Security and deployment options
Okay. So pins has really helped me a lot make this content more accessible, more reproducible. But I haven't really addressed security at all. And that's a really important piece here. The security of my app. And so, of course, I went back to Yoda. And, you know, in Return of the Jedi, as Yoda is dying, fading away after 800 years of training Jedi knights, witnessing the downfall. Sorry, I really like Star Wars. He gives this one final piece of advice to Luke, and he says, deploy to the right server and secure your data will be. Which, you know, I think is very prescient of him.
And so, you know, you can do this in an open source way if that's what you're more comfortable doing for this important training step. Put it on a server, any server that has R works. You can use a cron job if you want to schedule it. For your analysis data in your model, if you have a database for the database stuff or a pins board, I showed there are a whole bunch of options, right? For serving the model, any server with R, again, can work. You do have some things you have to be delicate with here around how auth works, right? So you can do that with networking rules, you can put a proxy in front of it, you can, like, route it through a lambda gateway to take care of API keys. You have lots of options, but you're going to want to do something to secure this API. And then for sharing, you have a lot of options. If it's an R Markdown document, either a website or a rendered document, Netlify is a great choice. Shiny server open source is available to you, and shinyapps.io is also an option, which is not open source, but is relatively affordable. So, you know, that can be a really good option here if you're doing a private thing or something that's for your own use.
Of course, we see a lot of organizations that sort of struggle with pieces of this. Either they don't want to manage this many sort of separate pieces, their IT organizations don't want to manage stuff in R. We see that a lot. That's totally fine. And so, of course, sort of easy mode here is our professional product, RStudio Connect, and that takes care of a lot of these pieces for you, right? It can deal with the API keys, it can deal with scheduling the R Markdown documents, it can deal with securing the Plumber APIs, it can deal with accessibility controls and that kind of stuff, and you can put pins on RStudio Connect. So this can be a really useful way to do all of this.
Takeaways
So, big takeaways. When you are thinking about deploying content, the top three things to really optimize for are accessibility, reproducibility, and security, right? If you're hitting these three keys, you're probably increasing value with what you're doing as opposed to just adding sophistication. While you're doing that, R Markdown, Shiny, Plumber, and Pins are good friends of yours. And, of course, may the force be with you. Thanks, everybody.
If you're hitting these three keys, you're probably increasing value with what you're doing as opposed to just adding sophistication.
Q&A
So, I failed to mention at the top of things here, but if you have questions, we're going to continue using Slido as we did in the keynote. So we're in Grand Ballroom B, so if you have a question, go to the same Slido page, select Grand Ballroom B, and you can enter your questions there or vote on questions that are already asked there. So I'm going to ask Alex some questions here since we have a couple minutes. Maybe this was covered in the last half of your presentation, but do you want to readdress how pins is different from a well-organized shared file system with correct ACLs?
Yeah, that's a great question. To be honest, under the hood, it's not that different, right? You have the same kind of control and access controls. The main difference is that it's a very R-centric answer, and so you can, from within R, address those objects without having to go through the hassle of figuring out how your shared file system works, finding things and that sort of thing. So from an IT perspective, maybe not that different, but from the perspective of an R developer, it's a very different development experience. It's much more straightforward.
Yeah. This next one we get a lot, so I'm happy to ask it here. What's the benefit of using a Plumber API for saving the object to S3 and then calling it with a Shiny app?
Oh, that's a great question. I love that question. There are a couple benefits. The biggest benefit to me, and this happens in several different ways, is that it decouples the logic of serving the model from the logic of displaying whatever you're displaying or doing something else with it. So you decouple those things, and that makes your life easier. It makes your life easier in a couple ways. One of them is purely from a code perspective. It makes it easier to write the code because you write your Plumber API to serve the model, you write your Shiny app or whatever it is to consume the model, and you get to sort of separate those concerns, which is really nice. The other piece from a performance perspective, you can test those two things independently. You can test the Plumber API with, say, the load test package, which I think some people might be talking about next. And then you can test the Shiny app separately. You can decouple both the coding and the performance issues between the API and the app, which is great.
Excellent. Okay. Next, is there a limit to the amount of data you should be using with pins? It depends on which board you're using. Some of those boards do have limits to how much you can save. You know, my suggestion would be that if you're saving more than a gigabyte's worth of data, there's probably a better solution for what you're doing. And that's totally a rule of thumb, not a real thing. But just in general, if you're finding yourself doing more than that, you're probably saving an object that's bigger than pins really is designed for. One thing that can be helpful is if you're saving model objects, often the model objects package up a lot of stuff with it that you don't necessarily want, like the training data. And that could make your model really big and sort of inflate the size. There is a package called butcher, like, you know, a person who makes meat cuts, that you can use to sort of chop out some of the stuff in your model object you don't want and just save, like, actually the model. So if you're saving models, that can be a useful thing to do.
Cool. Do one more here. I like this one as well. Is there a reason your import and model training code is in R Markdown files instead of plain R scripts?
Yes, that's a great question. I love it when things that I put in the presentation and then took out because I didn't have time come back in questions. Yeah, so I really love R Markdown for two main reasons. One is that it encourages literate programming that is interspersing code and prose, right? Because you have just, like, a place to write prose. It doesn't even have to be comments. You just write it, and it's great. The other reason that you have a record of what you did. So when you come back later and something has gone awry with your model, you have output there with the code of what actually happened, unlike if it's just an R script and then whatever happened was totally ephemeral and it's gone forever.
Well, everyone, please join me in thanking Alex again. Thanks, everybody.
