Resources

From SDKs to Agents: Building with R and Python on Databricks (Rafi Kurlansik & Zac Davies)

From SDKs to Agents: Building with R and Python on Databricks Speaker(s): Zac Davies; Rafi Kurlansik Abstract: Databricks offers a rich ecosystem of packages for R and Python developers. This session explores the key tools available—like {ellmer}, {brickster}, and the Databricks SDKs—helping developers build scalable workflows and AI-powered applications. Whether you're working in R or Python, you'll learn how to integrate your IDE, leverage Databricks-hosted LLMs, and build agentic workflows. We’ll cover practical implementations, best practices, and how to scale your projects using Databricks’ robust infrastructure. Whether you're automating workflows or deploying AI-driven solutions, this talk provides the essential toolkit for success. Materials posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone. Hopefully this will be engaging and interesting as the last talk. It's not going to be too much about Databricks specifically, but about the cool things that you can build and Databricks will be part of that. So how we can go from doing things with APIs to really, really cool ways to do workflows that you might not think you can even do today.

So we'll start off with a typical day. You probably had a message where you get something from your manager. It's a quick one, ever quick. And turns out the CEO wants to figure out if return to office is a sensible strategy. And they need some analysis done, but it needs to be done by Monday. And today's Thursday and your manager's out until next week. So it's got to be there on his desk when he comes back. Painful.

I love doing this to Zac. And this starts off with a process that we're all familiar with, which is we're going to have to go do a whole bunch of exploration, iteration, and then come back and communicate something to the team. And it looks something like this. We're going to be in Positron. We're going to do a whole bunch of work. And this is sped up. And sometimes it feels like it's going at this speed and not really doing too much. But at the end of the day, we're going to be doing visualizations, looking through the data platform, in this case Databricks, pulling back data. And as it turns out, I didn't actually lift my fingers at all. And what you just saw was completely programmatic. I was using Positron Assistant this whole time. And it looks like I was working.

So one of the cool things about this is not only was the Assistant doing it the whole time, it actually wrote the report in Confluence for me. So not only did I not lift a finger and write a little bit of code, it also did the report. Now, the whole point here is to whilst it can do these things, and these are really cool, and I'm going to show you how you can do this yourself. This is not something to say that it should do all your work. It allows you to free your mind up to think more strategically. You don't have to bog yourself down in the little details of code. You can sit there iterating with the Assistant. Very, very cool.

It allows you to free your mind up to think more strategically. You don't have to bog yourself down in the little details of code.

Addressing the skeptics

Okay. Awesome. That truly is incredible. We are living in the future. But I have to say, how realistic is that, really? Because there's a few considerations here, some obstacles that you might be facing. So I'm a little bit skeptical, Zach, of your ability to actually pull this off. First of all, I saw you using an LLM. Where was that data going? You were sending all this metadata off to some third party. How do we know how secure that is?

The other thing is, you know, we have a lot of our data in Databricks, and we very carefully govern that data using Unity catalogs. So how do I know if the Assistant was actually following the governance policies? Was it accessing data that it wasn't supposed to? Was it bringing back data that you weren't supposed to see? And beyond data security, there's also questions around resource constraints, like, you know, I saw you were using Cloud, and Cloud is pretty expensive. So how are you controlling how much spend, you know, you were actually using there? Another is, you know, are you sure that you actually ran this on the entire data set? Because we have a lot of data related to this report that you're writing, and maybe you were only able to run it on a subset of the data. And then lastly, that was actually kind of complicated, because you're using Databricks with Positron, but then you were also writing things to Confluence. So how did you go about authenticating there? You know, sometimes that's also kind of a complicated setup. Did you work with IT on that? Like, you know, what's going on here that enabled you to do that?

How it works behind the scenes

A lot of questions. I think it's best explained if I show you from scratch how we built up the system to work. So let's go a little bit about, like, how this works behind the scenes.

So what actually happened? So I opened up Positron. I had a question. What impact do we have in terms of, like, the office and productivity? And that requires us to have an MCP server, which if you haven't heard of it, this is a very fancy way of saying there's a standard way for a model to use what's called a tool to talk to a system. That's just fancy talk for basically a function. And we talk to the Databricks MCP server, which knows how to talk to Databricks in a standardized way. And once Positron can talk to Databricks and knows how to see what's inside, it can then go ahead and it can write some code. And it can push that back into R or Python. And it's going to then use that code and use Brickster, which is an R package to make it really easy to work with Databricks and talk to Databricks and you're able to run that analysis through things like DeepLayer. And then I've got to have the report pushed into Confluence. So we're going to need another MCP server that standardizes how Positron can talk to Confluence. And that's all here.

Now, if this is still a little bit confusing and you're not following, that's okay. I'm going to go through and build up a very simple example step by step. And hopefully you'll walk away thinking that is a lot easier than I thought it was going to be.

Building the MCP server step by step

So the first thing you need to do is you need to create a tool. And this is a tool specifically to discover resources in Databricks. And we're going to talk about warehouses. You don't necessarily need to know what that is. But the code will look something like this. It's just two lines of code. Show me all the warehouses that are there. And I'm going to take some of that metadata and it will return just a named list. Nothing too fancy.

So we end up in a situation where we've got this tool call in ellmer. So we're going to keep adding a few packages in. Really, this is quite simple at the end of the day. This is just the code from before. So those two lines of code formatted a bit more nicely. And it's going to have some metadata. And this is just a self-describing way of saying, what does this function do? And the reason this is important is because you need to tell the model why and when it should use this code. So this is basically how to use it.

And then you need to turn this into an MCP server. So this is the same code we just saw a moment ago. But there is a very subtle difference. And it's just this one line. And this is saying, hey, I want an MCP server with this tool that I've built. That's it. So is this going to be a fully fledged MCP server? Is this actually going to work, you might ask?

So let's go into the Positron IDE. Let's look at the assistant. And we're going to show you in real time how this is easy to set up. So we're going to copy the script path. We're going to click into the assistant. Let's say we want to configure a new MCP server. We give it the command to run the script, which is the 20 lines of code. We give it a name. We make it available everywhere. And that's it. It configures a little JSON with what the steps we went through are. And now I'm going to ask a question. And it's going to go ahead and respond. And it's asked me to use the tool that I just made. I'm going to say, yep, that's okay. And it's going to pop open the authentication to Databricks. I'm logged in already. So it's going to act on behalf of me, my user, using OAuth. It's going to answer that question. And I can see I've got a warehouse already there ready to be used.

So that JSON, we're not going to spend a whole bunch of time on this. But this is essentially to say, hey, Positron, look over here. This is the configuration that you need. This is a fairly standardized format. And this is just saying, hey, run the script to start the server.

So what just happened? Let's kind of go over that and make sure we revise what just happened. So we have the IDE, whatever language we want, using Positron in this case. And we've got the assistant, which is using Anthropic Cloud. And we're going to configure the little example MCP server that we had. That's using ellmer and MCP tools and running in R. And it knows how to talk to Databricks. And that's it.

If we want to add Confluence back into the picture, we need to extend the MCP.JSON. We tell it to use the MCP server that Atlassian provides. One extra line, pretty much. And we go back to where we were a moment ago. And we can extend that a little bit further. And now we're starting to be a bit closer to what we saw at the start of the talk where I was able to have it do all of these things. So we add in that little configuration. And it knows how to talk to Confluence now. And that's it. Other than the fact that we kind of need the rest of the hour, which is we don't need just warehouses. We need the full Databricks MCP server. So we can switch that out. And the talk appendices and resources will have that code, if you're interested. It's not just 20 lines. It's 150. So a little bit more. But quite simple.

The cool thing here is you actually still get to decide how this code runs. So just because I wrote the MCP server in R, it doesn't mean I need to use R to run everything. I can use Python, and it's just fine. Because the assistant is going to write the code that talks to Databricks using whatever method I want. If I want to use PySpark, that's fine. If I want to use Brickster in R, that's also fine. If I want to use ODBC, that's fine, too. So you get to choose how the code is being written. This is a collaboration with the assistant. It is not just a hands-off keyboard end-to-end.

How Databricks addresses the obstacles

So coming back to reality, hopefully that's explained a little bit more about how we actually did things. Okay. That was pretty compelling. I think it's worth revisiting sort of how you did that through our lenses here. So when it comes to how Databricks actually gets around some of these obstacles, there's really three products that play a major role.

So the first is Unity Catalog, which is essentially the governance layer in Databricks. It provides access control for all of your tables, all of your models. It even goes to objects like functions and files. It's called Unity for a reason. But this is basically whatever access Zach has to those assets in Databricks, those are governed by Unity Catalog. So when he makes this connection and he's running with the when the assistant uses his authorization via OAuth, it's respecting all of the Unity Catalog access control.

The next thing in terms of compute, do you have enough compute? So Zach's actually running this on serverless compute on Databricks. So we were basically able to scale to whatever we need for the data volume for any particular query. So even if you're running small queries alongside larger queries, serverless compute handles that all for you in the back end. You don't have to worry about that. So yes, you do have enough compute.

And then another really big piece here is the AI gateway. So the idea of the AI gateway on Databricks is no matter which LLM you're using, whether it's an open source model that's hosted on Databricks model serving or if you're using OpenAI or Anthropic, you can essentially control and govern access to those endpoints through the AI gateway. So your authentication to OpenAI or something like that can be handled by the AI gateway. Anthropic actually has a partnership with Databricks, so you can use Cloud natively in Databricks, so you don't have to worry about the governance there. It's totally handled by the Databricks security perimeter. And when it comes to tokens, the AI gateway can really help there too, because you can basically set guardrails for how many tokens your users can consume. So we can keep things from getting out of control with the assistant.

Just one last thing on the complexity with the authentication. In both cases, Zach used OAuth, so he signed in as himself. He was prompted by the assistant to basically sign in as himself. And those credentials were brought into the Positron session and were used by the assistant. So it wasn't able to go beyond the scope of what Zach has access to across these different software.

And the access is limited, right? If it's OAuth, the sessions are not infinite in terms of their span, so they're only going to be short-lived credentials.

Try this yourself

So what I want everyone to go away and do is at least try to work with something like this. It doesn't have to be using Databricks. I want you to try this for yourself. I've found this to be incredibly productive, and it allows me to deliver better results than I would have otherwise delivered, because I'm thinking more about the end output and how that appears and looks rather than writing the minute details. I'm still obviously sanity checking the things of the code and the correctness and all of that.

To try this yourself, you need a few things. You're going to need a language. Python and R are a good start. You're going to need an assistant. I recommend the Positron assistant. It's actually quite easy to set up with a whole bunch of other tooling. So you'll also need an Anthropic token for that from wherever you choose to get your Anthropic token.

If you want to use Databricks, this whole talk was using the free edition of Databricks, which has no cost, no credit card requirement. You can sign up. It's free forever. It's just a limited thing you can do with it. So I used the free edition for this, and I used Confluence. However, this is all quite flexible. But everything here is optional. You can switch out Confluence for Notion. You can switch out Databricks for DuckDB. Anything you want is modular. The whole point is these systems are modular. So you shouldn't feel limited to say, hey, I don't have a Databricks. I can't do what Zach just showed me. That's not true. You just have to change the MCP servers that you're using.

So that's pretty much it. We've got time for questions. Thank you very much. I'm going to leave the QR code up for a moment, but there's a repository here with all the resources and more if you're interested in reading anything else. Thanks.

Q&A

All right. Thank you, Zach and Rafi. We do have a few questions. So let's take a look. How might you share reports in this architecture? Do you use apps hosted on Databricks?

It's up to you, but if you're using Databricks and you set up the assistant to author these applications, which I know people have done, then yes, you can share with Databricks apps. That's not a problem. If you're using Posit Connect, there's no reason that you can't have it author and write things that you can publish to Posit Connect either. It really doesn't so matter about where you're publishing to. It's just about how you work to get to that end output and then wherever you need to distribute it. Happy days.

Another question. Is Databricks currently supported as a backend in Positron Assistant?

Backend in terms of the model provider, that is coming, I believe, next month as an active Git issue to add support for not just Anthropic or Copilot directly. So that'll be soon. And then you'll be able to use the full Databricks end-to-end stack if you wanted.

All right. Awesome. A few more. This one is from Steve. How should I make the argument to management that we should move our sensitive data off of our file servers and onto Databricks?

Yeah, that's a... We could have a long discussion about this. I think it comes down to what's your company's security posture and what are the arguments for, like, if you have data in the cloud already, there's obviously a precedent to move data off of your file server. So I think I would start with, like, what's the core, like, a version from, like, is it security? Is it IT? Get to the fundamental thing they want to solve. And then if you're working or talking with a company like Databricks, there's usually people who are dedicated to help you solve those problems that you can ask for help, like myself or Rafi. But at the end of the day, there's a whole bunch of documents that we provide on, like, security, and we're happy to have discussions. And there are companies doing immensely sensitive work in Databricks day to day. You've got a company doing healthcare research. You've got financial companies doing a whole bunch of stuff, sensitive data. It should be a solvable problem. It's just a matter of effort.

I want to add one thing there. So this was not, like, a Lakehouse talk. It was more about MCP and Posit and Databricks working together. But Lakehouse, Steve, is really, like, the concept I think you want to sell to the folks who are running the file servers now. Because in a Lakehouse like Databricks, which is a pioneer of that whole concept, you can manage and govern unstructured data, whether that's flat files or images or video alongside structured tables. And then with Unity Catalog, now you can also add to that lists machine learning models and just any arbitrary function. So all of those things can be governed in the same platform. So when I hear file server, I hear, like, data silo. This is a technology that's been around for years now. You can get away from that. You have thousands of customers getting value from that. It's an established pattern now. So yeah. Go watch a Lakehouse talk or contact us, and we can help you with that.

I'm sure you've got a pain to solve for as well. If there's no precedent to actually move, then it's a bit harder of a discussion as well.

All right. Let's do one last question. Does Positron integrate with Databricks Connect?

Yes. Yeah. The Databricks integration with Visual Studio Code, just like the Snowflake one you saw a moment ago, has the same follow over. And Databricks Connect works through Spark if you're using that. So it should be no problem at all.

All right. Thank you very much. Round of applause, please.