
RStudio's {pins} package: what it is, how it works, and what it can do for you! || RStudio
00:00 Introduction 00:09 What is the pins package? 01:49 pins - not just for RStudio Connect! 02:31 pins use cases 04:47 How to use pins instead of final_final_01_noreallyfinal.xls 06:37 How do pin boards work? 08:55 Getting started with pins 10:42 Versioning with pins at the board or pin level 11:47 pins and caching 12:13 Things you shouldn't pin 14:00 Major functions in the pins package 17:21 Using pin_upload( ) and pin_download( ) 19:52 pins and Google Cloud 21:27 pins and modelops with the vetiver package Learn more about the pins package here: https://pins.rstudio.com/ Got questions? The RStudio Community site is a great place to get assistance: https://community.rstudio.com/ Content: Katie Masiello (@katieontheridge) and Jesse Mostipak (@kierisi) Animation, motion design, and editing: Jesse Mostipak (@kierisi) Theme song: Contrarian by Blue Dot Sessions (https://app.sessions.blue/browse/track/64281)
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
So pins provides a way to easily share things, either across projects or across people. What ends up happening oftentimes in data science work? We have some asset, and maybe it's data or it's a model or some other R object. And we either need to reuse this asset later on downstream or across different projects, or we might need to share it with different members of our team. We say anything that can be serializable in R. So think about how you might stick something on a corkboard so you can keep it handy, right? Or you'd stick a flyer on a community bulletin board at the grocery store, right? So that other people can see it. And so the pins package pins objects to what's called a board.
This is analogous to this corkboard idea. And so this board is this place where files are written and they're read from. Examples of boards are things like you can have a board that's S3 storage, shared folders like network drives or Dropbox. You can have your RStudio Connect be a board, a SharePoint site be a board. Really, it's just a file store, a file share. Nothing magical about that. It's just pins has a nomenclature and a way of organizing things in this place on the board so that you can interact with them through the package.
Okay. So what I'm hearing is I can pin all kinds of things in all kinds of locations.
It works nicely with Connect, but there's nothing specifically implemented in pins for Connect. They're the things that Connect does to make life for pins a little bit better. Connect will give you access to the same sharing content sharing settings that you have for any piece of content on Connect. So I can put my pin on Connect and I can specify who should be able to access it and share it. And then pins does have a nice preview of the pinned object. I like that feature, but that's a Connect implementation for how it reads pins, nothing specific in the pins package itself.
Pins use cases
So if I work by myself, either for whatever reason, I am a team of one, how might I use pins? A lot comes down to what are your pain points and what do you have available to you already for storing data, sharing data, accessing data downstream, things like that. Right. So do you, in your team of one, do you need to reuse something in other work? Right. Do you have like a reference table or, you know, some kind of other information that multiple projects are going to use? You might need to come back to this.
Pins could be a pretty good use case for that. Right. Or, you know, do you just not have a convenient place to put things? You know, pins comes in and gets to be really helpful when you just maybe don't have another place to store data. Sometimes, you know, teams that don't have ready access to a database, you know, find pins to be pretty helpful because it kind of makes them self-sufficient. They can be more autonomous in this way. You don't have the mechanism to, you know, either get ahold of a database or have one set up or even like the data that you're working with isn't, I'm sorry, worthy of being in the database. Right. Then you can use a pin as a place to, I'm going to have this nice location where it's organized, it's versioned. I can work through cached versions of it as well. So things go faster.
And it helps to alleviate other pain points. If you have places, like if putting things in Dropbox or in GitHub or if those things are working for you already, then, you know, you don't need to bring this tool into your toolbox per se. Sort of like what pain points do you have and what needs do you maybe have in terms of being able to track versions or share things more readily.
Replacing final_final_01_noreallyfinal.xls
That was sort of my entry point for pins too, is working with Excel files just like that. Right. And what do you do when you've got this Excel file in your workflow? At some point, you're going to get another version of it. Right. And either on the file server where the file's stored or in your code, every time there's a new version, you're going to go in there and rename it. Right. And so it's going to be final version one or latest copy or whatever. Right. Final, final, final.
Right. With your initials on the end. So somewhere along the line, you're going to be putting some kind of hacked versioning nomenclature on it. And there's this introducing possibility for error, of course. Right. How many times you've gone in and you've run your whole analysis and realized, oh, I forgot to change the read underscore CSV to the latest version. Right. Or, you know, or you actually have more rigor of going to the file server and changing the file name so that the latest version is always called whatever, you know, data.csv and then everything else is archived. In some form or fashion, you're implementing this hacky way of keeping things current. Whereas with pins, right, you can pin that CSV. And so pins has different options for versioning. But, you know, fundamentally, you can have pins always pull the latest version.
And so whenever, you know, your code just says pin read, you know, data from my pin board, it's always going to pull the latest version. Or if you want to get specific, you can say, I want this particular version always each every time so that even if new versions come online, then you're always referencing the specific one.
In some form or fashion, you're implementing this hacky way of keeping things current. Whereas with pins, right, you can pin that CSV. And so pins has different options for versioning. But, you know, fundamentally, you can have pins always pull the latest version.
How pin boards work
Let's say I'm using Dropbox. Do I just get one board? Or can I set up different boards for different projects? Or is it like one board per system? Like I get one Dropbox board, one Google Drive board, one Connect board. Yeah, it took it took me a while to kind of get my mind around what's this thing called a board. It sounds like there's magic under the hood or something really special about boards. There's nothing really super exciting about boards per se.
It's a file path. Basically, it's a file path. So I have on my screen here, this is my browser window here, right? I started a Dropbox location, I have a folder called pins, like I manually created this folder called pins. And so earlier today, I just started pinning stuff. But this, this is all that my board is. My board is a file path that's users, Katie, Dropbox, pins. And this is, this is where I've defined it. This is my board. And what happens when you define a board, it just says like, this is where I'm going to go write or read my files and where they live.
You know, and while we're here, right, what are these, what are these magical mythical pins look like on the file path? It's not that special. So it's, it's a hash of the date and the time and a particular hash for the pin itself. And two things inside. Here's an RDS file with my, my classic Penguins data. And here's a text file. It's just got metadata included in it. So, you know, if, if I wanted to, I can go and just go into my code and read that RDS file manually. Now it's not like once a pin, always a pin. It's just a thing. It's just a serializable thing in R, but pins is going to make it easier for me to find these locations, work with them in a consistent manner, pull them in, read them in and kind of interpret and work with this metadata overall.
Getting started with pins
One of the first things that you do with pins is define your board, right? And like I said, your board can be on connect, it can be on S3, it can be GitHub or, or, or folder URL, local or whatever. So in this case, just, I like being able to sort of see what we're doing while we're playing. So I'm creating this board. That's just a folder. This folder happens to be my Dropbox path here, right? But I can, to my heart's content, let's make another board and make it versioned. We'll talk about that in a second. Folder. Oh gosh, I'm going to have to type all this.
You know the autocomplete? Dropbox. This is fabulous.
So I've created two different definitions for my board. Okay. This one's versioned. This one, oh sorry. This one is not versioned. This one is versioned. It's the same file path. Okay. I can have multiple boards, right? I'm just describing with each definition of board I'm just describing where it is, how I'm accessing it and different characteristics of it. So for example, you know, these two different boards, regular board and board versioned, I'm just saying when I'm using this file path, when I'm using this board, either do or don't automatically keep multiple versions of my pin for me.
Versioning with pins
And so there's different ways, you know, for dealing with versions. You can define things at this higher level, on the board level, and says anything that I pin to this board, I want it automatically to be versioned. You know, and for example, if you're doing a RStudio Connect board, by default, the RSConnect board itself is versioned equals true, just out of the box default. But typically, typically the versioning is false, is turned off by default.
And if so, then you can also then do versioning at the pin level. And so you can see down here, as I've defined my pin, getting ahead of ourselves, define my pin, I can specify, even if my board isn't versioned, I can make my specific pin versioned and then have different, different iterations of it. Okay. That's, that's really cool.
Caching is another nice feature, right? That once you pull, once you, once you've got it set up, you pulled in your pin, there will be a cached version of that. And so it will only pull a fresh version if the data has changed. And so it can be a time saver utilizing that cache overall.
Things you shouldn't pin
Are there some limits? Can I just pin everything? No, no, that would be irresponsible. There's, you need to think a little bit about what's appropriate. So there's sort of two elements, right? What's appropriate in terms of what are you trying to pin? Is it something that maybe belongs in a database? You know, don't, don't use pins for long-term archival storage of data, right? Things like your, your crown jewel data. Don't make that a pin. That needs to go in a database, right? There are best practices around data and databases and backing things up and access all these things.
Don't replace your databases with pins. A good practice is to make the pinnable object something that is reproducible, right? Don't count on your pin being your source of truth for all of eternity. Make sure it's something that you can programmatically reproduce.
Pins are often really good with like ephemeral data, stuff that's just like coming in, it's sort of in the middle of my process. I kind of need to stick this somewhere and then be done and keep working with it. Ephemeral data is really good, really lightweight data, you know, lookup tables, stuff like that. Things that make your life easier is good to use for a pin. And then in terms of size, a few hundred megabytes is great. If it's really, if it's too big for Excel, it's too big to pin.
Don't count on your pin being your source of truth for all of eternity. Make sure it's something that you can programmatically reproduce.
Major functions in the pins package
So, we talked about boards already, right? Boards, it's not magic. Think of it just as a file path. Where am I going? And so, in this case, let's just use this board here. That is my Dropbox location. Get rid of this one here. If I'm on the board, so first of all, I can just do a quick query. It's like, what's out there? So, I'm going to do board and pins is pipeable, which is nice. So, you can work with it, right? I typically, you know, define my board as my first step and then pipe that into all of my different functions. So, we'll pipe board into pin list and you can see here's all of my pins that I have. And this is mimicking, right? It's bringing in what you can see here in my file browser.
There's also, if you're not quite sure what's out there or if you kind of want to get more information, pin search is nice. Pin search will search the names and the titles of your pins. So, in this instance, I don't have very many pins. So, I'm just going to search with a blank query with the double quotation marks. And it does present kind of in a nicer format than just the list that we got in the last one, right? You can see a little bit more information we created and the file size and whatnot. And like I said, it will search the name and the title. So, if I'm looking for penguins, I've got two pins, right? I've got penguins and I have not penguins, right? So, searching those different things.
But this is good for finding things, right? You can also, so pins has metadata associated with them, which can be pretty helpful. So, I've got this pin here. It's not super exciting, but it's just a timestamp. But the metadata associated with that, I can see what it is. And in this case, when I pinned it, you can also include your own custom metadata. So, that can be very useful as you're building out a project or you want to put specific things into that metadata. There is a place to put that into the metadata. You can see how I put that in up here as I wrote this pin. And I just specified the metadata as a list. I love that.
And then different versions. We've talked about versions, right? So, I can see what versions I have. So, here are my pins versions for timestamps. I've got five different versions. And let's just see for sanity, right? Over here in my actual file browser, here's my five different versions. And you can see, you know, I mean, it does get to be duplicative, right? It's just copy, copy, copy, one after another. So, it's not the smartest package in terms of file space, right? This is kind of going back to this use pins if it makes your life easier. It's kind of quick and dirty and it's there to help you to kind of get things done. Are there more elegant ways of doing things? Sure. Are there maybe better workflows for long-term archival data storage? Absolutely. But pins is definitely useful in these use cases.
Using pin_upload() and pin_download()
So, we've talked about the pin read and the pin write and the pin read functions. There is another set of functions for pin upload and pin download. And so, these kind of take a little bit of an abstraction layer further out. And really, this is for being able to pin any kind of file. So, this is for being able to pin any kind of file. Like an Rdata file, for example, right? So, if you have an Rdata file, again, here's my board, my Dropbox board. I have this Rdata file.
And what is it, right? It's just this collection of mine for different variables and things that I like to have for breakfast. So, if I wanted to put this on a board, I'd use the pin upload function. And pin upload, right? We can sort of see what's going to happen over here. I have this over here, right? And this is looking familiar, right? Here's my hash, but here's my Rdata file itself. No magic there, right? Here's my metadata. And it's just tidily nested over here in my file folder. And you'll note here, right? It's replaced the version that I had before, as opposed to appending a version, because I didn't tell it I want a version to board.
And so, the complement to pin upload is pin download. And what pin download does is it returns a path to the cached file that it's downloaded, right? So, what does that look like? Here's my output. Here's where it's living. Here's where it's downloaded that file and cached it. And so, I can now, right? If I want to bring that object in, pins, board, path to cached file, right? I have my path and let's just load it. So, you can use pin upload and pin download to be able to bring other objects, files, things like that, into that structure that pins provides for being able to read metadata, play with versions, call things, have caching, things like that.
Pins and Google Cloud
It's one of the legacy functions in the API, and sort of what that means, right? So, the first iteration, I would say, of the pins package had a different overall format and different API behind it for doing the read and write functions. And so, in, gosh, maybe nine months ago, maybe a year ago, I don't know. Pandemic time. Pandemic time. It doesn't make sense anymore. A little while ago, a new version, a new API version of pins came out. And it started with pins version 1.0. We're at 1.01 right now, I believe. And that retooled, under the hood, the API. And so, I'm going to retooled, under the hood, the API that was calling and writing and interacting with pin boards.
And we also saw some of the older functions drop off. That being said, though, that doesn't mean, so Google Cloud doesn't mean that that's not going to be a board in the future. There's currently an issue right now on the repo to bring Google Cloud as one of the new implementations of boards. And really, it's just a matter of if that's something that's useful for folks. We just need people to go in and put a plus one on the issue to help weigh the priority for developing that.
Pins and modelops with the vetiver package
So, we've talked about pinning data, pinning data frames, things like that. But pinning models is a really nice workflow. And all of the work that Julia's been doing around Vetiver, I think, showcases that really nicely. And so, what you can do with the package, right, with Vetiver, you can pin a Vetiver model. And then with that, you'll have all of the information about that model that you need to deploy it and call it and work with it. And so, then you can then call that model or different versions of that model into other assets downstream, right? So, a super simple example of that.
So, earlier today, let's see, what do I have? I borrowed some of Julia's nice work here. A model that she has just demonstrated here with home prices in Sacramento. She's using a random Boris model. And creating this model, saving it as a Vetiver model. And if we do this, right, let's actually look at this, what this looks like. All right. So, what is V? V is a ranger regression modeling workflow using core features. So, with this, I'm going to pin this. In this case, again, I'm going to just pin it on my Dropbox folder. So, we'll pin this. And where's my model? So, I have these different models here.
And now, I'm going to use this model. So, first of all, one of these workflows that she's illustrated is to wrap this model up into an API so that you can ping it and query it, right? So, right. So, I'll take this model and my background job here. There's functions within the Vetiver package for creating a Vetiver API. In this case, I'm just going to run this locally. I could publish my API. I could publish it somewhere, host it on Connect, or host it in some other place that I have in my infrastructure. But right now, I'm just going to run this locally so I can ping off of this.
And what this looks like, so, if we run this, this is just running locally. This has a nice swagger interface so I can test out and check, try my model, right? So, first, I'm going to just do a this is just a ping, right? And it's just responding with the time and the date. But if I want to run this locally, so, I'm actually going to run this as a background job because when this is running, let me run this again, right, if this is running, it's just consuming and occupying my console so I can't interact with it. So, a whole lot of good that does this. I'm going to run this as a background job so that I can interact and still play with my console. So, here's my background job. And I'm going to run this. And this is now running. So, I just go to a browser.
All right. Let's pull up a browser window. So, here I am in a browser window. Just like we had in the preview. But now I have a console window. Okay. And so, I'm going to use my model that's been pinned. And I'm going to define an endpoint. My endpoint is this prediction endpoint here. And I'm going to create some new data. And then predict off that endpoint. And so, I just run predictions off of that pinned model that I have hosted that I have running as an API in the background.
But what's nice is that, right, in this model that I'm running, I remember I pulled it from a pen. I specified the version. So, say I had a different version of this model that I wanted to run. Well, let's see, you know, which versions do I have? I've got three different versions right here. I want to instead, you know, grab this version. So, right, if I wanted to use this version, then I would just change here which particular version that I'd be reading in from my pen and serving up via my API overall. And so, this is a nice way that you can toggle back and forth and try different versions of your pinned model.
