How to keep data up-to-date with 6 pins workflows (aka avoid data-final.csv & data-final-final.csv)
Ever chase a CSV through a series of emails or had to decide between data-final.csv and data-final-final.csv? Pins (both for R & Python) is a package that a bunch of people at the Data Science Hangout wish they knew about earlier. It allows you to publish and share objects (data, models, etc.) across projects and with your colleagues. Pins package (R) - https://pins.rstudio.com/ Pins package (Python) - https://pypi.org/project/pins/ Timestamps: 1:15 - Posit Team Overview 2:18 - Introduction to pins (scenarios where you might want to consider using pins) 4:42 - Installing pins 6:24 - Workflow #1: Pinning an R Object to Posit Connect (from RStudio) 10:23 - Workflow #2: Pinning a Python Object to Posit Connect (from JupyterLab) 15:19 - Workflow #3: Reading in a Python pin in an R Session 16:07 - Workflow #4: Reading an R pin into a Python session 17:50 - Workflow #5: Pin versioning 21:50 - Workflow #6: Automating the pin writing process (through job scheduling on Connect) Helpful resources: Q&A for this session on August 30th: https://youtube.com/live/8hc9ck1ZNLE Blog post on pinning an R dataset to Posit Connect: https://posit.co/blog/pins-posit-connect/ Many people find this useful for: 1. Scheduling reports that need to be updated with the newest data each week 2. Reusing data across multiple projects or content (Shiny app, Jupyter Notebook, Quarto doc, etc.) We host these end-to-end workflow demos on the last Wednesday of every month. No registration is required to attend - simply add it to your calendar using this link: pos.it/team-demo If you ever have ideas for topics or questions about them, please let us know in the comments
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, and welcome to our fifth Enterprise Community Meetup, where we'll discuss another end-to-end data science workflow using Posit Team. My name is Ryan Johnson, and I'm a data science advisor here at Posit.
Now, for today's session, we're going to do things a little bit differently, and rather than discuss a single end-to-end data science workflow, we're actually going to break it up into a series of smaller workflows using one of my favorite packages in both R and Python called Pins.
Now, during today's session, we'll start off with a brief introduction into Posit Team, and then we'll talk about how Posit Team, specifically Posit Connect, can be used to enhance your Pins workflows. So with that, let's go ahead and get started with an introduction to Posit Team.
Posit Team overview
All right, if you joined us in the past months, then you've probably seen this diagram before. This is a quick overview of Posit Team, and we'll start right up here at the very top with your data scientists, your data analysts, your end users. And they will be creating insights using Posit Workbench.
This is a development environment, and they can develop within RStudio, Jupyter, or VS Code, creating things in both R and Python. So if you're R developers, maybe they're creating Shiny applications, Quarto documents, Plumber APIs. Or if you have Python developers, maybe they're creating applications using things like Bokeh or Dash, Streamlip. You can have Jupyter Notebooks as well, Shiny for Python, Pins for Python, which we'll talk about in a bit here.
But it doesn't do the data scientists any good if they can't share that content with the people that need to see it. That could be their coworkers, it could be their boss, or maybe just their friends and family. And that's where Posit Connect comes into play. Connect is our professional publishing platform and allows you to easily share those insights with whoever you want.
And then finally, we have Posit Package Manager, which does exactly as its name implies, helps control all those great open source packages in both R and Python. And you can also host and share any internally developed R and Python packages with your team as well.
Introduction to pins
Now, for today's workflow, we're going to focus on Pins. And in a second, I'll explain what the Pins package is and how you use it, but I thought it might be helpful to first describe a few scenarios where you might want to consider using the Pins package for your workflows.
In this first scenario, have you ever created an asset, maybe like a piece of data, that you thought was the final version of the data, but then you had to go back and make some edits and create another final version. And rather than overwriting the original data set, you decide to give it a new name. And maybe you go through this process again and again and again, and you end up with multiple final versions of a data set, and no one knows what the true final version is.
Maybe this data we just talked about is used by many different content types, including a Shiny app in R or a Quarto document in Python, and you're worried that the data may not be consistent across all assets. Additionally, maybe you want to share this data with multiple people, and it's typically much easier and safer to share a Pin rather than trying to email files around.
And finally, maybe this data contains sensitive information, and you only want to share it with specific people or groups of people. And this is actually possible when using pins in combination with Posit Connect.
So how do pins actually work? The best way to think of pins, in my opinion, is to think of a corkboard. Similar to how you can pin a note to a corkboard, you can pin an R or Python object to a board where you and others can access it.
The best way to think of pins, in my opinion, is to think of a corkboard. Similar to how you can pin a note to a corkboard, you can pin an R or Python object to a board where you and others can access it.
To further explain this corkboard analogy, the pushpin is the tool we use to pin an object to the corkboard. This is the pins package. The note that we pin is our R or Python object, and this could be a dataset, a model, plots, or even files. Finally, the corkboard in our workflows for today will be Posit Connect. Now, there are a lot of other boards you could use, including Amazon S3, Azure, or Google Cloud Storage, or even local or shared folders. But Posit Connect does come with some added advantages, which we'll discuss today.
Installing pins and connecting to a board
Before we get started with pins, we need to do a small amount of prep work. And since we are going to show some workflows in both R and Python, we'll first need to install the pins package in both languages, and then we'll need to connect to our board, which again will be Posit Connect. So let's do that now, and we'll start with R in RStudio.
So here I am within Posit Workbench, and the first thing I'm going to do is log in.
Now that we're within Posit Workbench, the first thing I'm going to do is click on a new session and select RStudio.
The easiest way to connect to Posit Connect from within the RStudio IDE is to go to Tools at the top of your screen and select Global Options. On the left-hand side, there will be a Publishing option, and we'll want to select Connect. In the pop-up menu, we want to select Posit Connect. And here is where you want to add the URL for your Connect server, where you can see mine is already pre-populated.
So I'll click Next. We'll get another pop-up window. This is asking if we want to confirm the connection. I'll hit Connect. Oh, got to add my password one more time here. And then we can click Log In. I'll close out of this window and hit Connect Account. I'll apply these changes and then hit OK.
Workflow 1: pinning an R object to Posit Connect
So coming back to my slides here, this is going to be the first workflow that we're going to discuss today. I'm going to take an object that we built in R and show you how to pin it to Posit Connect.
All right, so coming back to my RStudio session within Posit Workbench, the first thing we need to do in order to pin an object to Posit Connect is first install and load the pins package. So coming over to my console, I'm going to first install pins. I'm going to hit Enter, and I want you to note how fast it goes. It's already done.
And that's because I pulled in a pre-built binary, a Linux binary from my instance of Posit Package Manager. So that's one of the advantages of using Package Manager as well. And once we've installed it, I can go ahead and hit or type out library, and we're going to load the pins package. And I'll clear my screen, and we're good to go.
And next we need to define our pins board. So we're going to create a new variable. I'm going to call it my board. And we're going to assign this variable. We're going to give it a function from within the pins package. So within pins, there is a function called board connect. And that's it. That's the only thing we need to run. I'll hit Enter. And you can see it successfully connected to Posit Connect.
Now we need something to pin to Posit Connect. So to keep it nice and simple, we're going to use a data set within R called mtcars. And if I hit Enter, you can see what that data set looks like. So we're going to take this data set, and we're going to pin it to Posit Connect. To do that, we are going to use from within the pins package, the pin write function. We then need to give it the name of my board. So my board. Then we need to give it the pin, the R object. So that's the mtcars data set. And then we want to give this pin a name. And so I'm going to call it my username in this instance, which is Posit. I'll say mtcars underscore, and I'm going to put an R since this is an R object. I'll hit Enter. And now you can see it successfully wrote the pin as Posit mtcars underscore R.
So let's go ahead and navigate to our instance of Posit Connect. And there's our pin right at the top. I'll go ahead and click on the pin. And this is what it looks like hosted on Posit Connect. I could click that link right there at the top to download the raw RDS file. And right below that is all the code you would need to read that pin into your various workflows. And we'll talk about that here in a second. And then at the very bottom, you can actually get a snippet of what the data looks like.
Now, before I show you that exact same workflow in Python, I first want to talk a little bit about what you're seeing over here on the right-hand side of your screen within Posit Connect. And these are your access controls. So you can determine how you want this pin, this data set, to be shared across your team. Currently, I have it set to specific users or groups. And you can define those users and groups right here in the search box, which will be tied to your authentication.
Now, if you want to be a little less strict, I can select all users, login required. And that means if you can log in to this instance of Posit Connect, you can view this pin data set. Or I can select anyone, no login required. And that will open up this pin data set to the world so that anyone could access it.
Workflow 2: pinning a Python object to Posit Connect
All right, so if I come back to my slides, for our second workflow, I'm going to show you how to pin a Python object to Posit Connect.
Now, coming back to our Posit Workbench homepage, I'm going to open up a new session alongside our RStudio session. So I'll select new session. And this time, I'm going to select JupyterLab, and we'll hit start session.
And here I have a terminal session opened up already. And we're first going to install the pins package. So to do that for Python, we're going to use pip to install pins and hit enter. And you can see the requirements are already satisfied. So we should be good to go. And I'll clear my screen.
Next, I'm going to open up a Jupyter notebook. So I'll click on this blue icon over here and select a new notebook. Now, the first thing we want to do is to import the pins package that we just installed. And we also need, similar to our R workflow, we need something to pin. And we're actually going to pin the same data set. But again, we're doing this within Python. So from pins data, we're going to import the mtcars data set. And I'll go ahead and run this.
And to pin this mtcars data set to Posit Connect, we first need to register our board, just like we did before. So our board, I'll call it my board again. And this is going to be created using, from within the pins package, the board connect function. And we're going to run this. But before we do that, we need two things. We first need to define our server URL. And we also need to provide this function an API key so that this Jupyter notebook knows who I am.
So the first thing we're going to add is the URL. So I'm going to come back to Posit Connect. I'm going to grab the URL. So you probably can't see this on your screen. That's okay. So copy this. And I'm going to add the URL.
And the next thing we need to add is the API key. To generate an API key in Posit Connect, we're going to go back to Connect. And I'm going to click my name here in the top right-hand corner and select API key. Now, as a quick caveat here, I am using a temporary environment. And as soon as this demo is over, it's going to be torn down. So you will see my API key, but that's okay. For future reference, you should always treat your API key just like you would a password. You want to keep them safe and not share them with anybody.
So to generate an API key in Posit Connect, I'll select new API key. And I'll call this Workbench API key and hit OK. I'm going to copy this, come back to our Python notebook, and paste it in there. And then we'll run this command.
And so the last thing we need to do is pin our MT cars dataset to Posit Connect. So to start, we are going to type out my board. And we're going to use the pin write function. Now for this, for within Python, it takes three arguments. And the first argument is the pin itself. So we're going to use MT cars. The next is the name. And we're going to give this a similar name. So I'll say my username, which is Posit. And I'm going to call it MT cars underscore Python. Because if you recall, the other name was underscore R, since that was an R object. And the last thing we need to give it in this environment here is the type. So how do we want to save this? We're going to save it as a CSV file. And that's it. We'll run this command. And now you can see that it has written this pin to Posit Connect.
And if we come back to Posit Connect, you see on your screen and I click content. Now we can see our two pins. Here's our MT cars pin that we created in R. And then here's the MT cars pin we created in Python.
Reading pins across R and Python
So now that we have both objects, now pin to Posit Connect, I'm now going to show you how to read pins. And we're going to start here with reading a pin into an R session. And to make things interesting, we're actually going to read in the Python object into our R session and then vice versa.
All right. So reading a pin is super easy. We're going to use from within the pins package, the pin read function. We'll next give it our board, which again we call my board. And now the name of the pin. And so I want to read in the Python pin, which we called Posit MT cars underscore Python. And that's pretty much it. We'll hit enter. And that's how you can read in the Python MT cars dataset into an R session.
So let's go ahead and do the same thing, but this time in Python. So we're going to read in our R object and Posit Connect into our Python session.
All right. So now we're back in our JupyterLab session and to read in a pin in Python, we're going to again start with my board and we'll use the pin read function. And the only thing you need to give it is the name of the pin. And so if you recall, we saved it as Posit username MT cars. And I want to read in this R object. Now I'm actually expecting an error message. So let me run this.
Yep. And so we do get an error message. And if we look towards the bottom here, no driver for type RDS. So when we save the R object to Posit Connect, by default, it uses an RDS format and that's actually not compatible with Python. So we need to change that.
Let me come back to my RStudio session and we're going to rewrite that pin. So from the pins package, we use pin write, we'll give it the name of my board, the object we want to pin, which is MT cars, the name, which again, keep the name the same, Posit MT cars underscore R. And this time we're going to add a type argument and we're going to save it as a CSV file. Hit enter. And there we go.
So now let's come back to our Jupyter notebook and let's try rewriting the, rerunning this command. And there we go. There's our R object read into a Python session.
Pin versioning
So for all of our subsequent workflows, I'm going to show them in R in RStudio just for the sake of time, but everything I show can also be implemented in Python too. So for our next workflow, workflow five, I'm going to introduce you to pin versioning. Now it may come as a surprise that we actually already did this when we rewrote our R pin as a CSV file. We didn't delete the original. We actually just created a new version on top of the original. So in this workflow, I'll show you how you can see the various versions of your pin and how you can easily switch between them.
So now if I come back to Posit Connect, here you can see the R object that we pinned. So if I click into it, you can see the most recent version is the CSV format that we saved. Now Posit Connect also allows you to view the previous versions as well. So if I click on the three dots you see here in the top left corner, I can select history and we have two current versions. So here's the most recent version. I can click on a previous version and see there's the RDS format that we saved previously.
Now, just for fun, let's go ahead and create a third version of this pin. So I'm going to just extract the first row of the mtcars dataset. We'll save that as a new variable and then we'll pin that to the same exact location, the same pin. So I'm going to create a new dataset, the mtcars dataset, where I'm just going to extract the first row. So that's what the new pin is going to look like. So I'm going to go ahead and save that as mtcars2 and create a new variable. And now let's go ahead and save it as a pin to the same location. From within the pins package, pin right, give it the name of my board, the new object, so that's mtcars2, the name of the pin. So we want to save it to the same location. So it's posit mtcars underscore R and we'll make the type again a CSV format and hit enter. And now we've pinned a third version of this dataset.
And if we come back to Posit Connect, I'll click into my R pin, select history, and now you can see I have three versions of the pinned dataset. So this is the most recent pin, which will be the active pin, and then we can see the historical versions as well.
And you can also view the previous versions from within your R session. So there is a function within pins called pin versions. We'll give it again my board and then the name of the pin, so that's posit mtcars underscore R and hit enter. And so you can see we have three versions of the same pin and the most recent version that we pinned is going to be the active version. So if we ever read in the pin, this is going to be the one that it reads in.
But let's say you want to go back to a previous version. You can see that each version has a unique number and unique identifier to it. So if I want to read in, maybe we'll just read in the most recent version to start. So I do pin read. Again, we'll give it my board and the name of the pin. So that's the most recent version. But if I run the same command, but this time I'm going to add in another argument called version and I'm going to give it this previous version right here. So 20, hit enter. And that's how you can read in a previous version of your pin.
Automating pin updates with scheduled jobs
For our last workflow for today, we are going to show you how you can automate the pin writing process. This is probably the workflow I use the most often and it's perfect for those assets that are constantly changing or being updated. Think of a data set where new rows are being added as data is being collected.
So this specific pins workflow is made possible by the job scheduling feature on Posit Connect. So to get started, we first need to create a document that houses our pin writing workflow. So let me navigate back to Posit Workbench.
All right, so now that I'm back within this RStudio session within Posit Workbench, the first thing I'm going to do is open up a new, and we'll select Quarto for this workflow. And if you're not familiar with Quarto, we actually have a previous end-to-end workflow session where we really focused in on Quarto. So in the top left corner, I'm going to select Quarto document. And I'm going to call this my pins workflow and we'll hit create.
So here is a new Quarto document and it has some placeholder text in here, which I'm just going to delete for right now.
So let's go ahead and script this pins writing workflow. The first thing I want to do is load the pins package. So I'm going to insert a new R cell and we're going to load the pins package and we'll run that.
After that, I'm going to read and transform my raw data. Now we're going to stick with the mtcars data set, but I'm going to extract a few additional rows. So think of this as pulling in some raw data set and doing some type of transformation to clean it up, which will eventually write as a pin.
So we'll insert a new cell here. And the first thing I'm going to do is create a new variable. I'll call it mtcars3. And we are going to make that or assign that to the mtcars data set. And I'm just going to extract the first five rows. And then we can print it here and we can run this code cell. And so that's what this new pin is going to look like. So this is what we're going to read, or sorry, write to Posit Connect.
So the last thing we want to do is the actual pin writing process. So we want to pin to Posit Connect. Now since it's going to be hosted within Posit Connect, we still need to make that board connection. So I'm going to write out my board. And this is going to be board connect. That's the pins function we want to use here. And then let's go ahead and write out the actual pin write function. So we assign it, provide it with my board, the new mtcars3 data set. And let's go ahead and give this a new name. So I'll say Posit, and we'll just keep it simple and call it mtcars3. And let's go ahead and save it as a CSV file.
Now let's go ahead and run all this and just make sure everything works as expected. First, I'm going to actually save it. So I'll call this pins workflow and hit save. And now you can see it down here in my file directory. So let's go ahead and run all these code cells and just make sure everything works well. And it looks good. You can see that it successfully pinned it to Posit Connect.
Now again, I want you to think of this data set, maybe it changes every single day or every single week, and you want to set up this workflow to run automatically. To do that, the first thing we need to do is publish it to Posit Connect. So the easiest way to publish this Quarto document to Posit Connect is to go into this little blue button you see at the top of your screen and select publish document.
So it's going to ask for a few additional packages that need to be installed. And again, this will go nice and quick since we're using Posit Package Manager. We want to select Posit Connect. And we want to make sure we publish document with the source code because that's going to allow us to rerun it on the Connect server. So this is the only file we need to pin. This is the location of our Connect server, which you can see we already registered very early on in this workflow. And we'll just leave the title as Pins Workflow. And I'll go ahead and hit publish, and that's it.
You're going to see this deploy tab that opens up, which is going to print some logs to the screen. And we'll just give this a few seconds to publish to Posit Connect.
And there we go. Here is the Quarto document we just created now hosted on Posit Connect.
So to set this up for job scheduling, you'll want to click the schedule tab, which can be accessed by clicking this gear icon on the top of your screen. So I'll click schedule. We'll select this box. And we'll go ahead and create a new job. So I'll go ahead and create a new job. So I'll click schedule. We'll select this box. You can choose your time zone and the start date and time for this recurring job. Next, you can choose the frequency. So by default, you can see it's set up to run every single day at 11 a.m. But you can change this. You can make it as frequent as by the minute or every two minutes, or you can make it as infrequent as every year or every two years if you really wanted to.
But I think daily sounds good. And we'll have that run every single day. You can also choose every single weekday. And then once we are finished, it's going to rerun this report and publish it. And you also have the option to send an email. So if you want to let someone on your team know that this Quarto document was rerun, they can get an email directly in their inbox. And so once we save this, now every single day at 11 a.m., it is going to rewrite this MT Cars 3 data set to that pin on Posit Connect. So it's a great way to keep those pins updated on a recurring schedule that you choose.
So it's a great way to keep those pins updated on a recurring schedule that you choose.
And with that, we've come to the end of our session today. Now, I hope you found this content helpful and you can feel a bit more confident about implementing pins into your own workflows. Thanks again, everyone, for joining today. And feel free to stick around if you have any questions you'd like to ask us. Otherwise, have a great rest of your day, and we hope to see you again next month.
Thanks so much, Ryan. Pins is something that I come up quite a bit in the data science Hangouts lately, so it's awesome to have these pins workflows to point to as well. As Ryan mentioned, we are going to stick around here for some Q&A for another 15 minutes or so, and YouTube should automatically push you over to that. If for any reason it doesn't, I've also included the link in the details below, and I'm going to copy that over to the chat right now too. As a reminder, there's also a Slido open for anonymous questions, which you can see on the screen, but that's at pos.it slash demo dash questions. And we'll keep that open for the rest of the week as well, and I'll add the answers in there too. Thanks again for joining us today, and we'll see you over there for Q&A.