Parallelize R code using user-defined functions in sparklyr

If you’re an Apache Spark user, you benefit from its speed and scalability for big data processing. However, you might still want to leverage R’s extensive ecosystem of packages and intuitive syntax. One effective way to do this is by writing user-defined functions (UDFs) with sparklyr. UDFs enable you to execute R functions within Spark, harnessing Spark’s processing power and combining the strengths of both tools. In this tutorial, you'll learn how to: - Open Posit Workbench as a Databricks user - Start a Databricks cluster within Posit Workbench - Connect to a cluster within Posit Workbench - View Databricks data in RStudio - Create a prediction function - Create a user-defined function with sparklyr Read our most recent blog that covers parallelizing R code using user-defined functions (UDFs) in sparklyr: https://posit.co/blog/databricks-udfs/ Learn more about our Databricks partnership: https://posit.co/solutions/databricks/ Watch other tutorials on using Databricks and RStudio: https://youtube.com/playlist?list=PL9HYL-VRX0oR-3AgWbXtlfdr29626EjRJ&feature=shared

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So let's say you've created this great model using tidymodels , which is an opinionated set of R packages for machine learning. You've used recipes to detail the preprocessing steps you want, you've fitted a LASSO model, and you've created a workflow that adds your model and your preprocessing recipe to your training data. When you fit your data, it all looks great. It's exactly the model you want to use. But you're an Apache Spark user, so you want to benefit from Spark's speed and scalability for big data. But you also want to maintain your R code. You want to use R's extensive ecosystem of packages and its syntax, and you definitely don't want to rewrite this great model you've created.

So to do that, we can create what are called user-defined functions in sparklyr. Sparklyr is a package that provides an R interface to Spark, allows you to execute distributed R code within the Spark environment. For Databricks users, we at Posit have built many features into Posit Workbench, our enterprise-level development environment, to make the integration between Posit Workbench and your Databricks environment seamless.

Setting up the model and storing it as a pin

So let's see how this actually looks. Here are the steps that we ran to create our model. Then we're going to use the vetiver package, which is a package for machine learning operations and deploying models in production in R or Python to store a version of this model on Posit Connect as a PIN. A PIN allows you to easily share data, model, or other objects across different projects. So in this case, the model is going to be stored as an RDS file that we're going to read in other places.

Opening Posit Workbench and connecting to a Databricks cluster

So this is the Posit Workbench front page. So I'm going to walk through how you sign in as a Databricks user. If you click New Session, there's RStudio Pro. And then under Session Credentials, I already have my AWS Databricks workspace that I've signed into using SSO. I click Start Session. It's starting my RStudio Pro session.

Once it is ready, it will automatically drop me into RStudio Pro. So as mentioned earlier, we have built a bunch of features on Posit Workbench to make the integration between Posit Workbench and Databricks seamless. One example of that is the Databricks pane. So if you notice here, the Databricks pane automatically detects any clusters that I have configured in my personal environments. If I click into here, I can see more information about them. And I can also start my cluster from within Posit Workbench. So I'm going to click here, Start Cluster. Once it's ready, I can then actually connect to my cluster via Posit Workbench 2.

You may have seen the Connections pane before for connecting to databases from within RStudio. And here, it provides me the sparklyr code that I need to connect. But with the Databricks pane, I can actually do that within here too. I click this plus sign. This pane opens up and it automatically detects my cluster ID and then also my Python environment and also provides the code for connecting using sparklyr. If I click OK, here it is running that sparklyr code to actually connect to Databricks.

Great. Now we're connected. So back in the Databricks, sorry, the Connections pane, it's loading the objects. Once it's ready, I can actually see everything that's contained within that Databricks catalog. So I can check out my Hive Metastore. I can take a look at a certain table and see what's contained within it. I can click the Viewer if I want to see the table within RStudio as well.

Writing the user-defined function

So I'm going to go into my Environments pane now where we can see our connection. And now let's walk through the actual code for creating our user-defined function. So first, I'm going to load the packages that we're going to need, sparklyr, dplyr, and dbplyr . And then I'm going to connect to my table. I'm going to save it in an object called LendingClubDat.

Here, I can manipulate my data using dplyr and dbplyr code without having to actually bring it into R. I'm still running everything on the database. I'm going to load the vetiver package, the package that I mentioned earlier for machine learning operations.

And here we're going to create our R function that is going to create the prediction of our data based on the model that we have saved on Posit Connect. So the function is called predictVetiver. It's going to load the workflows library. And then it's going to connect to my board where my pin is, in this case, Posit Connect. If you notice, I have the key here. And so I'm going to replace this with my actual key in a second. Then I'm going to save an object called model, which is actually going to read that pin and pull that model into RStudio. Then I'm going to run my predictions and then output the predictions in this function. So running that, now I have saved it. So I'll be right back after I put my API key in this function.

In the predictVetiver function above, I've replaced placeholderConnectAPIKey with my actual API key. So that way, I can actually connect to Posit Connect and pull the model pin off the board. I'm going to create a slightly smaller data frame just for demonstration purposes. And then I'm going to run that predictVetiver function to make sure that it actually worked. And taking a look, it looks like it succeeded. It was able to load the appropriate packages, connect to Posit Connect, and use that model RDS file to run predictions. In this case, the predicted value is 20.6.

Running the UDF with spark_apply

But what if we want to use Spark instead? And for that, that is when we create a user-defined function. So to do that, we use the spark apply function from the sparklyr package. And within it, we give it the function that we want it to run. In this case, predictVetiver. Notice that I didn't have to edit any R code. I didn't have to edit the function. I didn't have to add any code that wasn't in R. All I had to do was give it the name to spark apply. Then I give it the columns that I'm going to use to create the prediction and what I want to get back. And so running this, and we can see that it worked.

So using sparklyr, we were able to use Spark to read R code that pulls down a model created with tidymodels in R that has been pinned on Posit Connect and then run predictions on a data frame using Spark without changing any of our R code.

So using sparklyr, we were able to use Spark to read R code that pulls down a model created with tidymodels in R that has been pinned on Posit Connect and then run predictions on a data frame using Spark without changing any of our R code. So I hope you found this introduction to user-defined functions and look forward to seeing what you create.

Featured software#