
Easier data and asset sharing across projects and teams with {pins} and Databricks
Led by Edgar Ruiz, Software Engineer at Posit PBC April 30th at 11 am ET / 8 am PT Sharing data assets can be challenging for many teams. Some may rely on emailed files to keep analyses up to date, making it difficult to keep current or know what version of the data is used. {pins} improves sharing data and other assets across projects and teams. It enables us to publish, or ‘pin’, to a variety of places, such as Amazon S3, Posit Connect and Dropbox. Given recent customer feedback, the ability to publish, or ‘pin’ to Databricks Volumes has been added to R. The same capability is also currently in the works for the Python version of {pins}. This session on April 30th will showcase the acceleration of predictions by distributing a 'pinned' model using pins and Spark in Databricks. We’ll walk through integrating {pins} with Databricks in your team’s projects and cover novel uses of pins inside the Databricks ecosystem. GitHub repo: https://github.com/edgararuiz/talks/tree/main/end-to-end Here are a few additional resources that you might find interesting: 1. Pins for R: https://pins.rstudio.com/ 2. Pins for Python: https://rstudio.github.io/pins-python/ 3. More information on how Posit and Databricks work together: https://posit.co/use-cases/databricks/ 4. Customer Spotlight: Standardizing a safety model with tidymodels, Posit Team & Databricks at Suffolk Construction: https://youtu.be/yavHEWpgrCQ 5. Q&A Recording: https://youtube.com/live/HDTDmEaK5zQ?feature=share
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, my name is Edgar. I'm with the open source team here at POSIT, and today we're going to talk about using pins with Databricks. We're going to do an end-to-end workflow demo that I hope you'll enjoy.
What is pins?
So I'd like to start first by talking about how we can conceptualize what pins does. And we can start by thinking of a physical board, right? A board that you may have at home or at the office, in which you would use pins to pin certain information, right? It could be notes, pictures, and things like that, that you can use as a reference for later, or even as a reference for others inside that location that they can see. But in the case of the pins package, what you are actually pinning are files, right? Could be CSV files, could be arrow files, could be RDS files.
So if we're pinning that, what is a board then, right? A board is anything that could contain a file, such as a folder, right? It could be also a regular cloud storage, or even SharePoint, or especially POSIT Connect, where you can not only publish tiny apps or dash apps, you can also publish pins. And lastly, we have a Databricks volume, which is basically a database of which is what we just recently added as a backend to the pins package. So the idea here would be that as a user or a developer, you can write a pin into the board that then can be used downstream, either by yourself and other projects, or maybe other members of your team, or maybe even apps can read the pin and use them.
So why do I want to use a pin, right? I could use easily, you know, a shared drive that the team has access to. Well, pins themselves automatically are version, which I think is really cool because you can access the versions of those pins programmatically. And they also allow you to be able to go back in case there's some mistakes that you need to fix, or even help with reproducibility. You can also customize the pins metadata in case there's some additional information you want to add to the pin to make it clear what it does, which we'll see here in a minute. And it goes beyond just rectangular data. You can save list objects as pins, which pins themselves convert into a JSON file.
Another really cool thing is that it goes beyond R, right? You can use the pins package both in R and Python. Having said that, the Databricks functionality that I've been mentioning exists only in the R package today, but we are working on adding it to the Python package. In fact, I have a PR open for it that would hopefully get that functionality into the Python package soon.
Pins vs. databases
I wanted to mention this real quickly because we had a customer asking about this, about when to use a database versus using a pin. The main thing that I think you have to keep in mind is that whenever you want to decide that, you have to think in terms of how big is your data, right? And how much of that data are you using? Because if at the time that you're going to do your analysis, you're going to use part of the data, then it's better if you use a database because you can query the data to download the portion of the data that you need and then using whatever analysis. Because with the pin, when you say pin read, for example, it downloads the entire data set. So if the data is really big, it may not be feasible to download it every time if you're not going to use the whole thing.
So that's the main thing that you have to keep in mind. I would say to think in terms of the pins is another tool in your toolbox rather than a replacement of something such as a database.
the pins is another tool in your toolbox rather than a replacement of something such as a database.
Demo: creating a volume and writing pins
So next, let's see an example of how we can use pins.
Okay, we're going to start in the Databricks web UI. And here we're going to go to catalog. Inside the catalog, I'm going to choose this one. And then under the end to end schema is where I want to do today's demo. I'm going to create a new volume, you can see how easy it is just press create and create volume. And I'm going to say my volume and create.
So that's it. That's all it takes to create a volume. Of course, you need to have rights to it. But once you do, it's easy to do that. And then what I like about it is that automatically, the web UI gives you this, the path that you can copy, which I just did. And what's so neat is that I can go into R and load the library.
And then I can define my board. And now let's just copy and paste the path. Now that I have my reference to it, I can query it, I can say pin list to get a list of the boards that are there. Or rather the pins. And you see there's nothing there yet. So let's put something in there.
And here, I'm just going to say, pin write board df. So I'm making reference to my board, and then just sending that a data frame. And you'll notice that creates a new pin with this version. So if we go into a new volume, I can refresh the volume. And you'll see that we have a new folder named df, which matches to the name of our pin, or a file or data frame. And then inside that folder, there's another folder that matches the name to the version of the pin. And inside that folder, we'll find the actual file that contains the data, and also a data.txt file that contains the metadata of the pin.
So we usually don't go this far into examining, you know, the pin structure, but because we're introducing it today, I just wanted to show you how it works. So let me go back to R here and try to run this again, right? So if I run the same pin, right, with the exact same data frame, you'll notice that it actually tells me that it didn't change, it didn't make any changes to the actual pin in the volume. And that's because we didn't, the data frame is still the same, which is great, because that way, it prevents it from creating multiple folders with the same copy data of the same information, loading the volume, right, and adding more need for room or space in your in the volume in the catalog.
Another thing that we can do is we can add another version. So let's say, let's change a piece of the data here and try to write it again. This has changed. See how now it creates a new version, which if then we come back and check on our pin, we see a new folder that will have that version. So that's what's really nice about this, right? So I mentioned you can also have a list. So let's create a list object real quick, a nested list, which we can then write.
And as I mentioned, that creates a JSON file. We can also customize the metadata by using the argument. So we just created a new pin with the additional information for the metadata.
Then I can see, we got the list here, and we got the test two that has my customized metadata.
Accessing pins from Python
Another cool thing that we can do is that if I were to go to Python, in this case, I have Python loaded here in Positron. I can import pins. Now I have installed the version I mentioned that we're about to merge to add the functionality for Databricks, but I figured I would show you today. So I can say board equals pins.boardDatabricks, and then board pin list.
You can see our pins, and we can read one. Let's read df. Remember, this is RDS, but what's neat is if it's a table, then this version of pins will automatically convert it for us to a pandas data frame, which is really nice.
So it's going to download the pin and give us the information back. Right, so that's basic functionality that I want to show you.
A novel workflow: model scoring with Spark
Okay, so this diagram reasserts what I mentioned earlier about how we can create a pin or several pins, put them in a board, in this case, a Databricks volume that then can be used by either projects or teams or even apps. I didn't mention, though, that there's another possibility where you can actually have a job, for example, that would run on a regular schedule, and it would read the latest data from the database or whatever data sources, and then it would actually write the pin, and that pin then becomes available downstream for the apps, projects, and teams that we're talking about. The jobs doesn't necessarily have to create new pins, they could update pins that are existing. So that's another way that we can think in terms of how we can add pins in our workflow.
Right, moving forward to this novel workflow that I want to propose, and I'm calling it novel because it's kind of different from other things that we have shown or how we can think in terms of pins, but I think it would be really useful, interesting to keep into consideration. I mentioned this especially because of the Databricks ecosystem is so expensive. There's a lot of things that you can use as a data scientist to have more complex things, more complex processes to be even more effective in your daily work. And I call it complex because they're actually really easy to use, really easily accessible, so the complexity itself is not into how to set it up, but rather just using it.
So we have talked about before how we can actually download portions of the data into R and train the data locally. In fact, even before Databricks, I'm sure we've done this, right? We sample the data, download into R, and train. What happens afterward is actually very, varies a lot, right? Depends on what you need to do, there's a bunch of ways that you can go.
But what I'm proposing is that whatever end result model you return it into the Databricks environment and you use it as a pin instead of volume. So once you publish that model as a pin, what's really cool is that now if you want to score your data, let's say it's a huge dataset, now you can use Spark as a way to do that, right? Because then you can have Spark read the data and read the model and actually run the predictions. On top of that, you can trigger that prediction using Spark Apply through sparklyr. So it's really cool how this works because it allows us to do one thing that it sets it apart from the regular way that we serve models, right? Whenever you serve a model, you generally what are doing is serving it as a REST API endpoint.
So a typical self model would require that you send it whatever set of data and then it would return a prediction, right? So let's say we have a table that has a million records. If you were to use the traditional model, then you would then have to do a million calls in order to score the entire dataset. That of course is totally acceptable today, we do that all day long. But what I'm proposing is that we do this a little bit differently. So the idea is that Spark can read the data and by default Spark partitions the data into smaller sections.
Then because it allows us to run R UDFs or meaning that you can run R code inside Spark, you can have a small program that will then read the volume, meaning the model from the volume, and it will run the predictions in each of these segments or partitions individually and also in parallel. So that way, instead of querying the model a million times, you're querying it as many partitions as there are. There may be tens, maybe even a hundred, depends on how big your data is and how much Spark partitions it. But it's much more efficient and definitely faster. So that's why I think this is a very good model to consider.
depends on how big your data is and how much Spark partitions it. But it's much more efficient and definitely faster. So that's why I think this is a very good model to consider.
Demo: training a model and scoring with Spark
So let's see how this works in practice.
Okay, I want to start by showing you this table that we have in that same schema. It's a table named loans full schema, which has about 10,000 records that we'll use for our example. So going into RStudio, I have loaded a Quarto document that we're going to use to show the process. And we're going to start by connecting to Databricks cluster.
So now that I'm connected, you'll notice that I can browse the catalog almost the same way as we can on the web UI. So you see the loans full schema in which we're going to make a reference here too. So this table lending variable, you'll notice that it's going to return the top 1,000 rows. And it's actually not importing the entire data set, right? It's only importing 1,000 to show a preview of the data. All it does is just a reference, like I said, to that table. So it's really nice because we don't have to import everything, but we can refer to it programmatically.
To start, we're going to use the function slice sample that sparklyr supports to do the sampling. Now it is a very basic sampling, but at least it's something, right? It's not the top 1,000 records or something that you had to filter specifically, which may be a problem whenever you do sampling. So this is the sampling that comes from the Spark SQL. So we can just run this. And in this case, we're actually collecting. And now we can see the 1,000 records that I collected that are going to be part of my model.
Notice too that I'm not typing all this because there's a lot to kind of go through, but all this material will be available in the GitHub repo so you can review it. Now we're going to do the model. We're going to use tidymodels, which is definitely something I would recommend that you consider using whenever you do this kind of thing. Because what's really cool with tidymodels, it has all these other enhancements, such as the ability to do pre-processing steps, sampling, also being able to confirm that your model works properly, to be able to create a specific workflow that you can use.
And in this case, I'm not going to go through each step of what I'm doing. It's not the best model. It's just something that I wanted to show as an example of things that we can do. I have a split of the data through our samples. Then I'm going to use a recipe and then parsing it to create the model and then create the workflow, which would do the pre-processing steps and then actually do the predictions if we need to. And in this case, we're going to fit the model. And now that I have a fitted model, I'm going to test it with the test sample and just kind of look at my prediction. So this is something that you would do locally, right? This is where we would spend most of our time.
So the next part is what I kind of want to show. So next I want to connect to the R models board and I'm going to save that new model that we created just now into the lending model linear bin. Because every time I run the sample, it's going to give me different data. The weights are going to be different. The coefficients are going to be different. So we're going to get a new version every time, which is okay. More likely it's going to run the same, but I want to show you this.
Next, we're going to skip this part first. I'm going to explain this function. So if you recall, we're telling Spark to run this particular R code for each of the partitions. And the way we express that is that we put all that into a single function. So we can pass to Spark apply. This is the function that I'm going to use. And at the end of the day, all it's doing is this. It's running predictions and then binding the predictions into the dataset, the partition, the part of the dataset that it got, and reordering the column layout.
All this section right here is about where am I going to get the model, or rather the file for the model. So what's really neat about using a pin inside Databricks, which is something that even though I did before with pins and whatever, I couldn't do that directly because I was using a pin outside of Databricks. But now that it's inside, it's really easy to refer to the RDS file that has the model because it's all inside, right? The Spark cluster is inside the entire Databricks environment. This allows us to just create a quick reference and use read RDS to get the pin. If I don't find that path, which more likely means that I'm local, then I'm using board Databricks to get the connection and then get the pin and then use it as the model. So it basically just switches back and forth because I want to test it with local data before I know, before I try it against the Spark cluster.
So that's why I do this. And the part that we skip is exactly what it does is gives me the path to that latest pin, right? That basically builds the path with the latest version of the pin that I can just copy and paste. So the idea is that if you try to do this, you can copy this particular piece of code. Obviously, you can replace the name of your pin and then use that as what you're going to use for the reference.
So a little bit of a trick there, but it works. And it's not so obfuscated or hidden. You can literally put all this together. Now that we create the function, let me run this. We tested locally. We're a local lending data.
So what it's doing is running the predictions, right? And then returning the dataset. And then each one of our jobs is going to create their own partition, predicted dataset. And then at the end of the day, Spark's going to put it all together for us and gives us a single dataset.
Um, the next thing I'll do is I'm going to test that, but only over 10, like the first 10 records in the actual data in Databricks, you know, that table that we saw that exists, you know, in Databricks. So I'm right here locally in my laptop, right? Using RStudio locally. I'm allowed to run this against the data that's in Databricks. I'm not downloading anything. I'm just sending over the function and say, Hey, run this over the dataset. Um, so it's running right now and I'm not running over everything, right? Only a few, just to make sure that it works. So it's a, it's a good thing to do first. As you can see in returns, it worked, right? We can see the predictions.
And one of the things that is interesting is that tidymodels makes the prediction name dot pred. Um, so it's something that the Databricks environment doesn't like. So rename set to underscore pred. That is the reason why in the function, when I reorder, I say 56, because that way it works for both, right? It's just, just a little quick thing for me to do, but that's why I do that. Uh, so that way it doesn't matter if it's dot pred or underscore pred, that function will always work.
Okay. So having said that, again, not the interesting part, but now we saw the interesting part is that actually it ran, right? We have a function that ran in, uh, Spark and it worked. So the other thing I want to mention is that, um, Spark will require that you pass a column spec, uh, of whatever thing is going to output from whatever process you're running through the R program. Um, if you don't pass it, then sparklyr will attempt to create it, uh, to create the column spec for you and run it. So if it does that as part of the, um, the return, the output in R, it will tell you that, uh, it created one for you and it gives you the exact thing that it used. So you can literally copy and paste this into a variable like I'm doing here and keep that, right?
So you can have that, uh, to, for further use. So I ran it as a test over 10. Now that I want to run it over the entire data set, see, I don't have any, um, any limitations of how many rows I'm grabbing, uh, but I am using the column spec, right? So I copy that and then I'm going to run the, the Spark apply for the entire data set. And then I'm going to mutate it and add a new, uh, field that all it does is calculates the difference between the interest rate that's, uh, for that particular loan versus what we predicted that should have been. And then we're going to compute, which means that it's going to create a temporary table inside the Spark, uh, session with all the data. Now, this may not always be something that we can do if the data is way too big and may not fit in memory in Spark, but in this case we can.
So we're going to run this. So now again, all I'm doing when it comes to like this, uh, the model stuff, now I'm sending it directly to Spark and Spark will decide how it splits the data. And it's only going to read that, uh, pin modeled as few times as it created partitions and returns entire thing for us, which I think is really cool.
So what I want to do as a kind of cool thing to do for this demo is essentially figure out which ones are outliers. The model said, Hey, you should have this interest rate. And then we see the actual interest rate that the customer got and it's way higher, right? Uh, but the way higher, maybe may vary. We don't want to really set a number. We may want to set it based on the population. So we'll just get the, um, the standard deviation. So in this case with this population today is 4.12. So we want to find, again, this is all kind of like just a toy idea, but trying to give it some, some, um, significance, right? So we want to take that, um, standard deviation and try to find what loans are actually three times the standard deviation out, right? Or outside of that threshold. So in this case, we have 85 loans.
Notice that I have not collected any data until just now. So all the calculations, the predictions and the identification of the 85 has all happened externally. And now that I have the actual dataset that I'm interested in, you can get the number here, but I'm going to do, I'm going to create another, uh, pin named large differences. I'm going to save back into, uh, uh, into Databricks. And now that I did that, I can disconnect.
Publishing to Posit Connect and scheduling jobs
Next, we can use that pin in a thing such as a shiny app, for example, that can load the model pin and then, um, use it to, in this case, simulate what a result of that, uh, uh, interest should have been, uh, based on certain characteristics of the loan. I actually have this, uh, shiny app already published in Posit Connect, which I can show you here. So you can interact with it. It's not this best, you know, uh, shiny app, but, uh, it shows you, for example, one avenue that you can use.
The other avenue, which is the one I'm most interested in, is to show you this chart. What I did here was I copied, uh, all the necessary code from the analysis into this, uh, Quarto doc, and then, um, essentially make it run to do the same thing, to get the standard deviation and then get the large differences, and then to save the new data into the same pin. The idea here is that we can also publish this same, um, document into a same, um, document into RStudio Connect. And if I go back to Connect into this tab, which I have loaded, you can have the same job that reads the pin, connects to Spark, runs the predictions, and finds those outliers, and then saves them back to the same, uh, pin. I can have it run in a schedule, um, that would let me run the job on a monthly, weekly, or even daily basis.
What's neat about it is that the sparklyr package has a convenience function that allows you to easily publish the job if you were to run sparklyr, the deployed Databricks, and you can do some selections here. It will automatically look at your Databricks host and token environment variables. And if I go back to my, um, published document here, you can see that it has it with them here. So, um, it's very easy to publish, right? Um, I don't want it to seem like it's, I spent a lot of time and magically have it working and actually run the process. It's just, we don't have enough time to show the whole thing. So that's the other thing that we can do that's really cool.
Sharing pins across teams and languages
Okay, let me go back to the Databricks UI. And here we're going to go to workspace real quick. And remember, I mentioned that, um, we can have that board, uh, that's accessible to others inside your organization. Well, think in terms of maybe now we have a different team. Maybe the credit team needs to analyze those, um, loans that have the large differences. Um, here I'm using a notebook inside, uh, Databricks where I actually installed that, um, that version of pins and assuming that let's say that the credit team is using Python, they can do the same, um, uh, the same thing in order to access the board. See, I'm able to access it from here.
Um, then I can use a pin read to read that board. The version of, uh, pins that, uh, in Python uses for Databricks, it actually has as its, um, main integration point, the Databricks SDK. So you can easily use the, um, the same functionality once it's available inside, um, the Databricks notebooks. Uh, once it's imported, we can run, uh, any pandas, you know, to command and then continue to analysis.
So this is the final workflow, um, of what we just reviewed. We started with an initial analysis that then, um, gave us a model that we can use, which in turn fed the Shiny app that we sampled. And then, uh, that Shiny app was also published into Posit Connect. Um, we also had that initial analysis create essentially another job that will help us automate the process that reads the model, reads the data from Databricks. And, uh, through that scheduled job inside Posit Connect, that'll makes the creation of, or rather the update of the large differences pin, which can then be used by others, uh, inside the organization to continuous researching, uh, even in a different language. So we can see the combination here, this wonderful combination of Posit and Databricks and R and Python and pins. I hope you have enjoyed this presentation and thank you for your presence here today.


