Easier data and asset sharing across projects and teams with {pins} and Databricks

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Edgar. I'm with the open source team here at POSIT, and today we're going to talk about using pins with Databricks. We're going to do an end-to-end workflow demo that I hope you'll enjoy.

What is pins?

So I'd like to start first by talking about how we can conceptualize what pins does. And we can start by thinking of a physical board, right? A board that you may have at home or at the office, in which you would use pins to pin certain information, right? It could be notes, pictures, and things like that, that you can use as a reference for later, or even as a reference for others inside that location that they can see. But in the case of the pins package, what you are actually pinning are files, right? Could be CSV files, could be arrow files, could be RDS files.

So if we're pinning that, what is a board then, right? A board is anything that could contain a file, such as a folder, right? It could be also a regular cloud storage, or even SharePoint, or especially POSIT Connect, where you can not only publish tiny apps or dash apps, you can also publish pins. And lastly, we have a Databricks volume, which is basically a database of which is what we just recently added as a backend to the pins package. So the idea here would be that as a user or a developer, you can write a pin into the board that then can be used downstream, either by yourself and other projects, or maybe other members of your team, or maybe even apps can read the pin and use them.

So why do I want to use a pin, right? I could use easily, you know, a shared drive that the team has access to. Well, pins themselves automatically are version, which I think is really cool because you can access the versions of those pins programmatically. And they also allow you to be able to go back in case there's some mistakes that you need to fix, or even help with reproducibility. You can also customize the pins metadata in case there's some additional information you want to add to the pin to make it clear what it does, which we'll see here in a minute. And it goes beyond just rectangular data. You can save list objects as pins, which pins themselves convert into a JSON file.

Another really cool thing is that it goes beyond R, right? You can use the pins package both in R and Python. Having said that, the Databricks functionality that I've been mentioning exists only in the R package today, but we are working on adding it to the Python package. In fact, I have a PR open for it that would hopefully get that functionality into the Python package soon.

Pins vs. databases

I wanted to mention this real quickly because we had a customer asking about this, about when to use a database versus using a pin. The main thing that I think you have to keep in mind is that whenever you want to decide that, you have to think in terms of how big is your data, right? And how much of that data are you using? Because if at the time that you're going to do your analysis, you're going to use part of the data, then it's better if you use a database because you can query the data to download the portion of the data that you need and then using whatever analysis. Because with the pin, when you say pin read, for example, it downloads the entire data set. So if the data is really big, it may not be feasible to download it every time if you're not going to use the whole thing.

So that's the main thing that you have to keep in mind. I would say to think in terms of the pins is another tool in your toolbox rather than a replacement of something such as a database.

the pins is another tool in your toolbox rather than a replacement of something such as a database.

depends on how big your data is and how much Spark partitions it. But it's much more efficient and definitely faster. So that's why I think this is a very good model to consider.

Demo: training a model and scoring with Spark

So let's see how this works in practice.

Okay, I want to start by showing you this table that we have in that same schema. It's a table named loans full schema, which has about 10,000 records that we'll use for our example. So going into RStudio , I have loaded a Quarto document that we're going to use to show the process. And we're going to start by connecting to Databricks cluster.

So now that I'm connected, you'll notice that I can browse the catalog almost the same way as we can on the web UI. So you see the loans full schema in which we're going to make a reference here too. So this table lending variable, you'll notice that it's going to return the top 1,000 rows. And it's actually not importing the entire data set, right? It's only importing 1,000 to show a preview of the data. All it does is just a reference, like I said, to that table. So it's really nice because we don't have to import everything, but we can refer to it programmatically.

To start, we're going to use the function slice sample that sparklyr supports to do the sampling. Now it is a very basic sampling, but at least it's something, right? It's not the top 1,000 records or something that you had to filter specifically, which may be a problem whenever you do sampling. So this is the sampling that comes from the Spark SQL. So we can just run this. And in this case, we're actually collecting. And now we can see the 1,000 records that I collected that are going to be part of my model.

Notice too that I'm not typing all this because there's a lot to kind of go through, but all this material will be available in the GitHub repo so you can review it. Now we're going to do the model. We're going to use tidymodels , which is definitely something I would recommend that you consider using whenever you do this kind of thing. Because what's really cool with tidymodels, it has all these other enhancements, such as the ability to do pre-processing steps, sampling, also being able to confirm that your model works properly, to be able to create a specific workflow that you can use.

And in this case, I'm not going to go through each step of what I'm doing. It's not the best model. It's just something that I wanted to show as an example of things that we can do. I have a split of the data through our samples. Then I'm going to use a recipe and then parsing it to create the model and then create the workflow, which would do the pre-processing steps and then actually do the predictions if we need to. And in this case, we're going to fit the model. And now that I have a fitted model, I'm going to test it with the test sample and just kind of look at my prediction. So this is something that you would do locally, right? This is where we would spend most of our time.

So the next part is what I kind of want to show. So next I want to connect to the R models board and I'm going to save that new model that we created just now into the lending model linear bin. Because every time I run the sample, it's going to give me different data. The weights are going to be different. The coefficients are going to be different. So we're going to get a new version every time, which is okay. More likely it's going to run the same, but I want to show you this.

Next, we're going to skip this part first. I'm going to explain this function. So if you recall, we're telling Spark to run this particular R code for each of the partitions. And the way we express that is that we put all that into a single function. So we can pass to Spark apply. This is the function that I'm going to use. And at the end of the day, all it's doing is this. It's running predictions and then binding the predictions into the dataset, the partition, the part of the dataset that it got, and reordering the column layout.

All this section right here is about where am I going to get the model, or rather the file for the model. So what's really neat about using a pin inside Databricks, which is something that even though I did before with pins and whatever, I couldn't do that directly because I was using a pin outside of Databricks. But now that it's inside, it's really easy to refer to the RDS file that has the model because it's all inside, right? The Spark cluster is inside the entire Databricks environment. This allows us to just create a quick reference and use read RDS to get the pin. If I don't find that path, which more likely means that I'm local, then I'm using board Databricks to get the connection and then get the pin and then use it as the model. So it basically just switches back and forth because I want to test it with local data before I know, before I try it against the Spark cluster.

So that's why I do this. And the part that we skip is exactly what it does is gives me the path to that latest pin, right? That basically builds the path with the latest version of the pin that I can just copy and paste. So the idea is that if you try to do this, you can copy this particular piece of code. Obviously, you can replace the name of your pin and then use that as what you're going to use for the reference.

So a little bit of a trick there, but it works. And it's not so obfuscated or hidden. You can literally put all this together. Now that we create the function, let me run this. We tested locally. We're a local lending data.

So what it's doing is running the predictions, right? And then returning the dataset. And then each one of our jobs is going to create their own partition, predicted dataset. And then at the end of the day, Spark's going to put it all together for us and gives us a single dataset.

Um, the next thing I'll do is I'm going to test that, but only over 10, like the first 10 records in the actual data in Databricks, you know, that table that we saw that exists, you know, in Databricks. So I'm right here locally in my laptop, right? Using RStudio locally. I'm allowed to run this against the data that's in Databricks. I'm not downloading anything. I'm just sending over the function and say, Hey, run this over the dataset. Um, so it's running right now and I'm not running over everything, right? Only a few, just to make sure that it works. So it's a, it's a good thing to do first. As you can see in returns, it worked, right? We can see the predictions.

And one of the things that is interesting is that tidymodels makes the prediction name dot pred. Um, so it's something that the Databricks environment doesn't like. So rename set to underscore pred. That is the reason why in the function, when I reorder, I say 56, because that way it works for both, right? It's just, just a little quick thing for me to do, but that's why I do that. Uh, so that way it doesn't matter if it's dot pred or underscore pred, that function will always work.

Okay. So having said that, again, not the interesting part, but now we saw the interesting part is that actually it ran, right? We have a function that ran in, uh, Spark and it worked. So the other thing I want to mention is that, um, Spark will require that you pass a column spec, uh, of whatever thing is going to output from whatever process you're running through the R program. Um, if you don't pass it, then sparklyr will attempt to create it, uh, to create the column spec for you and run it. So if it does that as part of the, um, the return, the output in R, it will tell you that, uh, it created one for you and it gives you the exact thing that it used. So you can literally copy and paste this into a variable like I'm doing here and keep that, right?

So you can have that, uh, to, for further use. So I ran it as a test over 10. Now that I want to run it over the entire data set, see, I don't have any, um, any limitations of how many rows I'm grabbing, uh, but I am using the column spec, right? So I copy that and then I'm going to run the, the Spark apply for the entire data set. And then I'm going to mutate it and add a new, uh, field that all it does is calculates the difference between the interest rate that's, uh, for that particular loan versus what we predicted that should have been. And then we're going to compute, which means that it's going to create a temporary table inside the Spark, uh, session with all the data. Now, this may not always be something that we can do if the data is way too big and may not fit in memory in Spark, but in this case we can.

So we're going to run this. So now again, all I'm doing when it comes to like this, uh, the model stuff, now I'm sending it directly to Spark and Spark will decide how it splits the data. And it's only going to read that, uh, pin modeled as few times as it created partitions and returns entire thing for us, which I think is really cool.

So what I want to do as a kind of cool thing to do for this demo is essentially figure out which ones are outliers. The model said, Hey, you should have this interest rate. And then we see the actual interest rate that the customer got and it's way higher, right? Uh, but the way higher, maybe may vary. We don't want to really set a number. We may want to set it based on the population. So we'll just get the, um, the standard deviation. So in this case with this population today is 4.12. So we want to find, again, this is all kind of like just a toy idea, but trying to give it some, some, um, significance, right? So we want to take that, um, standard deviation and try to find what loans are actually three times the standard deviation out, right? Or outside of that threshold. So in this case, we have 85 loans.

Notice that I have not collected any data until just now. So all the calculations, the predictions and the identification of the 85 has all happened externally. And now that I have the actual dataset that I'm interested in, you can get the number here, but I'm going to do, I'm going to create another, uh, pin named large differences. I'm going to save back into, uh, uh, into Databricks. And now that I did that, I can disconnect.

Publishing to Posit Connect and scheduling jobs

Next, we can use that pin in a thing such as a shiny app, for example, that can load the model pin and then, um, use it to, in this case, simulate what a result of that, uh, uh, interest should have been, uh, based on certain characteristics of the loan. I actually have this, uh, shiny app already published in Posit Connect, which I can show you here. So you can interact with it. It's not this best, you know, uh, shiny app, but, uh, it shows you, for example, one avenue that you can use.

The other avenue, which is the one I'm most interested in, is to show you this chart. What I did here was I copied, uh, all the necessary code from the analysis into this, uh, Quarto doc, and then, um, essentially make it run to do the same thing, to get the standard deviation and then get the large differences, and then to save the new data into the same pin. The idea here is that we can also publish this same, um, document into a same, um, document into RStudio Connect. And if I go back to Connect into this tab, which I have loaded, you can have the same job that reads the pin, connects to Spark, runs the predictions, and finds those outliers, and then saves them back to the same, uh, pin. I can have it run in a schedule, um, that would let me run the job on a monthly, weekly, or even daily basis.

What's neat about it is that the sparklyr package has a convenience function that allows you to easily publish the job if you were to run sparklyr, the deployed Databricks, and you can do some selections here. It will automatically look at your Databricks host and token environment variables. And if I go back to my, um, published document here, you can see that it has it with them here. So, um, it's very easy to publish, right? Um, I don't want it to seem like it's, I spent a lot of time and magically have it working and actually run the process. It's just, we don't have enough time to show the whole thing. So that's the other thing that we can do that's really cool.

Okay, let me go back to the Databricks UI. And here we're going to go to workspace real quick. And remember, I mentioned that, um, we can have that board, uh, that's accessible to others inside your organization. Well, think in terms of maybe now we have a different team. Maybe the credit team needs to analyze those, um, loans that have the large differences. Um, here I'm using a notebook inside, uh, Databricks where I actually installed that, um, that version of pins and assuming that let's say that the credit team is using Python, they can do the same, um, uh, the same thing in order to access the board. See, I'm able to access it from here.

Um, then I can use a pin read to read that board. The version of, uh, pins that, uh, in Python uses for Databricks, it actually has as its, um, main integration point, the Databricks SDK. So you can easily use the, um, the same functionality once it's available inside, um, the Databricks notebooks. Uh, once it's imported, we can run, uh, any pandas, you know, to command and then continue to analysis.

So this is the final workflow, um, of what we just reviewed. We started with an initial analysis that then, um, gave us a model that we can use, which in turn fed the Shiny app that we sampled. And then, uh, that Shiny app was also published into Posit Connect. Um, we also had that initial analysis create essentially another job that will help us automate the process that reads the model, reads the data from Databricks. And, uh, through that scheduled job inside Posit Connect, that'll makes the creation of, or rather the update of the large differences pin, which can then be used by others, uh, inside the organization to continuous researching, uh, even in a different language. So we can see the combination here, this wonderful combination of Posit and Databricks and R and Python and pins. I hope you have enjoyed this presentation and thank you for your presence here today.

Easier data and asset sharing across projects and teams with {pins} and Databricks

Transcript#

What is pins?

Pins vs. databases

Demo: creating a volume and writing pins

Accessing pins from Python

A novel workflow: model scoring with Spark

Demo: training a model and scoring with Spark

Publishing to Posit Connect and scheduling jobs

Featured software#

pins-python

rstudio

tidymodels

Easier data and asset sharing across projects and teams with {pins} and Databricks

Transcript#

What is pins?

Pins vs. databases

Demo: creating a volume and writing pins

Accessing pins from Python

A novel workflow: model scoring with Spark

Demo: training a model and scoring with Spark

Publishing to Posit Connect and scheduling jobs

Sharing pins across teams and languages

Featured software#

pins-python

rstudio

tidymodels