Using R with Databricks Connect - posit::conf(2023)

Presented by Edgar Ruiz Spark Connect, and Databricks Connect, enable the ability to interact with Spark stand-alone clusters remotely. This improves our ability to perform Data Science at-scale. We will share the work in `sparklyr`, and other products, that will make it easier for R users to take advantage this new framework. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Tidy up your models. Session Code: TALK-1084

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So thank you, everyone. My name is Edgar Ruiz . I work with the sparklyr package. So I've been doing large-scale implementations that have to do with implementing R, as well as being able to work with SQL and, in this case, Spark. And what I'm going to talk about today is its implementation with a new kind of Spark, which is Spark Connect.

What is Spark Connect?

With Spark Connect, it's something that started in a very, very recent version of Spark, 3.4. In fact, it's so recent that 3.5 literally came out yesterday. So we're talking about that close. And with Spark Connect, it's not so much a new way of implementing, like you have standalone or Kubernetes or Yarn. This is more about how you connect to Spark, right?

So if we're talking about a remote way of connecting to Spark as it used to be before, that was through something called Livy. And in order for us to use Livy, we needed to have a full-on implementation of a Yarn-based cluster, such as Hadoop cluster, to be able to take advantage of that service, which is not ideal in the new world that we live in, where most of our data is now in S3 buckets, right? All we need right now is not so much a place to land our data, like Hadoop does, but more about how to process it, right? A way to process it easily, which is through Spark, right?

So with Spark Connect, it's basically doing that for us, where we can actually have a Spark cluster that we can actually interact with remotely, right? In our laptops, we can easily go back and forth and interact with it.

Databricks Connect, version 13 plus, have Spark Connect, it's based on that. So you'll be able to take advantage of it if you're currently using it or think about using it.

How Spark Connect works under the hood

So how does it work? Well, underneath it, instead of doing like straight-up REST APIs like it used to be with Livy, Spark Connect is using gRPC, which actually works as that layer of communication. And at this point, I can confidently say that the best way to interact with that gRPC is through PySpark. So PySpark will use other Python libraries that implement gRPC, and then talk to Spark.

For the machine learning stuff, which again just came out yesterday, it's going to be done through Torch. And right now, there's like I believe one machine learning model implemented in a new version.

sparklyr and the reticulate integration

So what about R, right? What about sparklyr? Well, because of the implementation that's going on with PySpark, we made the call to go ahead and integrate through that library. And in order to do that, we're using the reticulate package. So if you're not familiar with it, this is the package that in R allows us to integrate with Python directly from our R session, so we can have two-way communication with it.

So as you can see in this little diagram, you can go from reticulate to PySpark into the Spark cluster. So the question is, okay, if I can do this from reticulate, why do I need sparklyr for? Well, if you are a user today, you know that sparklyr does other things, right? It gives you that interface for dplyr , so you can use dplyr commands to interact with Spark. You can also use DBI , and also the connection panes, as well as other very easy-to-use functions that as data scientists, we use to run models and things like that.

The other great thing that I'm very excited about is like, once you start using it, you don't have to have Java installed in your computer anymore. This has been the subject of a lot of heartache for a lot of us, to have Java working. Well, because it's all remote, and because it's using this gRPC, you don't have to worry about that anymore.

you don't have to have Java installed in your computer anymore. This has been the subject of a lot of heartache for a lot of us, to have Java working. Well, because it's all remote, and because it's using this gRPC, you don't have to worry about that anymore.

Getting started with pysparklyr

To get started, it's very easy. You would use the latest version of sparklyr, as well as the way that we've chosen to implement it at this time, which is through an extension package called pysparklyr, that you will install from GitHub. Hopefully, we'll have this in CRAN soon. So all you have to do is upgrade your sparklyr, install it.

Then to get the Python components that you'll need, pysparklyr gives you a convenience function that will install all the Python libraries that you will need to use in order to get it working. So if you're a Python user already and you're saying, I don't need the convenience function, I can do it myself. These are the packages that you'll need. There is a requirement for you to have Python 3.9 or above.

If you are an R user and the other way around, you don't really want to mess too much with Python, that's totally fine too. I feel exactly the same way. So I've gone through the pain of doing all this and putting all the stuff that you need in that function. If you don't have 3.9 available in your laptop, it'll warn you and it'll give you some tips on how to do your upgrade if you need to.

You can also run Spark Connect locally. So this is something that we're used to as we start learning Spark. We run Spark locally and what it does is start Spark and then whenever you do a connection and actually it stops the service when you disconnect. For Spark Connect, it's going to be a little bit different. You're going to have to start the service separate and then stop it separate from your connection. At this point, we can actually make this function a little bit better, a bit more convenient. We have it available now and we're definitely going to be improving it to make it easier to use.

Connecting to Databricks Connect

So now we come to Databricks Connect. So how does that connect to Databricks Connect? So if you're a customer of working with Databricks before, you heard Databricks Connect before, that term. But now you also hear the term Databricks Connect v2. So that version two is what we're talking about now that started with Spark 3.4 through the Spark Connect, that is going to be available in DBR 13 and above.

Databricks Connect, same thing as Spark Connect, now you can use your laptop to be able to interact with Spark. So you don't have to be inside the environment, inside Databricks and open RStudio there and all that. You can actually do that on your laptop. In order to do that today, you're able to connect. You'll need four things, which is the master and that will be the URL for the Databricks of your organization, the cluster ID that you want to work with, also your personal token that you can also get from Databricks, and the method which will be Databricks underscore connect as opposed to Databricks dash connect or Databricks.

So we try to make it as confusing as possible for you and I apologize for that. So but we have a lot of warnings that hopefully will get you there.

All right. So the other thing that we want to do is these two environment variables Databricks host and Databricks token are becoming very standardized right now across different applications as you start working with Databricks Connect V2. So sparklyr picks them up. So that way whenever you connect to your cluster, you don't need to set them all up. You basically just provide your cluster ID and your Databricks Connect method. Of course, please, please, please use your Databricks token in an environment variable. Do not put your credentials in an open text for your code. Definitely best practices.

Unity catalog and the connections pane

Because we're able to do that integration directly to PySpark, we're getting a lot of nice goodies to integrate better. So this is the Unity catalog where you can see your tables inside the Databricks web UI. Well, now that's inside RStudio. This is the first time in sparklyr history that we're offering more than one layer. Used to be that we would stop from table into a schema. Now we go the three layers all the way to matching to what Databricks does. So we can go from table to schema to catalog, and that will make it where you can be here and have the exact same navigational structure that you see inside UC. So that way it's a lot easier for you to be able to find the tables that you want to interact with.

Of course, you can also preview the tables if you need to. By just clicking on it, you'll see the first 1,000 rows. So that way that give you that visual first look of the data.

Also accessing the catalog data is very easy now. dbplyr has a new function called InCatalog. So you may be aware of InSchema. This InCatalog gives you from a supposed schema with those two levels. InCatalog will do the three levels. So as you see, you have the three samples to NYC taxi to trips. You basically put that in the same order, inside the function call, and you're good to go. That creates the pointer to the table.

Then now, whenever you just call the variable, it's just going to bring you the top 10 rows. Because dbplyr and sparklyr does have those guardrails that prevent you from, or me. I don't say you as in general, but me from be able to, when I do this, download the entire billion records. You can use those goodies again from the stuff that we wrap around all this other communication with Spark.

Once you have your table, excuse me, your pointer, you can use dbplyr as we do today. The same commands, you're able to push all that computation to the cluster. So instead of you having to download the billions of records and then do the summarization in your laptop, you can just write your dbplyr as you would. You treat trips just as if it was a table that you have locally. What it does in the background is that it will translate your dbplyr commands into SQL. So all the computation, all the aggregation is happening remotely, and you're just getting back the results, which is the exact ideal way that you want to deal with these kinds of database and big data backends.

which is the exact ideal way that you want to deal with these kinds of database and big data backends.

If you want to see the query that you're getting, you can use show query and that will provide you that query.

Coming soon: Databricks pane in Posit Workbench

Also coming soon, I'm very excited to talk about this first time we're actually showing it as far as the work that we're doing. Inside the Databricks WebUI again, you have where you can manage the clusters, where you can start and stop a cluster and be able to basically get the information that you need in order to connect to it. Well, in a coming soon version of Posit Workbench, we'll be able to have that same ability inside a new pane, a Databricks pane. That pane will match to what you see today in the WebUI where you can start and stop a cluster, and you can also expand the details of a specific cluster and see the same information that you would see in the WebUI.

So that way, again, you start in RStudio and you end in RStudio. So we started with the catalog and now we're being able to administer the clusters by starting and stopping, and this is my favorite feature because now the first time that you connect to a cluster. Again, remember the only thing that you will need will be the cluster ID. Well, the very genius folks that working on this solution, which is not me, by the way, this is the IDE folks, have put a copy button right there for you to get your cluster ID, which then you just paste it in your code and you're good to go. No problem, you start and end in RStudio.

Once you connect for the first time, then the connections pane takes over from what you've been using before. You'll be able to have that connection. So you end your work and you come back tomorrow, you'll have it right there, you click on it, it shows you the code that you used to connect, and then once you say, yes, I want to connect, you go right back to the Unity catalog. So I mean, this is really exciting for me because I'm sure a lot of folks who are working today with this environment will see a lot improvements when it comes to what we call quality of life as far as interacting with it.

What's supported and what's not

Some additional information, there's some limitations right now with Spark 3.4 and 3.5, when it comes to Spark Connect.

So what we support, as I mentioned before, we have the dplyr and DBI APIs, the vote command will work, the connections pane like I showed. Also, the personal authorization token will work. We're actually also working. Another thing I'm glad to announce on OAuth that will work directly from Workbench, and also most of the read and write commands work today. What's not supported, of course, especially in 3.4, which is what's right now GA available, is ML functions, SDF functions, and tidyr . Although, this is not right because PivotLonger is available now, although not all arguments. So if you want to try an existing PivotLonger code that you have in Spark Connect, it will not ignore the arguments, it will just tell you that it's not supported. So it's safe for you to try out.

Accessing the full API via reticulate

Also, you can access entire API. So sparklyr will not have everything all the time, but you should be able to access everything as opposed to how it was before, that you had to wait on me to go in and make the argument change because from one version to the next, add a new argument, now you have to wait for me to add it. You don't have to do that. You can actually, because of how reticulate works and how it does the integration, you can access those Python objects directly.

So in here, I'm using the CreateDataFrame function from Python to add the empty cars. Then when I pull it up, it's not printing the table like it does before. I can show you before, it's actually showing you that Python object that's been loaded into the Spark context. Then I can access functions that are part of that DataFrame, in this case, correlations, for example. Obviously, this is not the way that you will be doing it day-to-day, but you can see that there's no limitations.

Another way that we can see that there's no limitations is like when you extend it with reticulate, you can literally call the same Python libraries inside your RStudio sessions. So in this case, for something so recent, like literally yesterday started, you can go and get the MLConnect classification library, that then you can start accessing its functionality. Here, I'm using the same table empty cars, and I set up or I prepare it for me to be able to run the machine learning algorithm.

Then I run it here and I fit it, I get my logistic regression, and then I can use it to do the transform, which is basically the predict, and I extract the data to pandas through the pandas function, which gives me everything like the predictions here. You can see the predictions directly as new columns inside my DataFrame, which by the way, it's already inside R. So I run all the models out there and I can get the results back. This is just an example that obviously, we want for you to use the ML logistic regression function. But again, once those features are available, we're not there yet, you can access them. We're not locking you out here.

Closing summary

So in closing, Spark Connect will enable us to be able to communicate with Spark remotely. Databricks Connect, it has that enable on DBR 13 and above. We're using PySpark to reticulate, to integrate that in sparklyr. Again, we're not limiting you, you can extend that with reticulate. Here are some links that I'm going to put because I want to thank you first of all for your attention, and here's the QR if you want to get to this presentation, which will have all the links that you need. Thank you so much.