Next-Gen Data Science: How Posit and Databricks Are Transforming Analytics at Scale

Transcript#

This transcript was generated automatically and may contain errors.

Welcome everyone, hopefully the headphones are working and you're able to hear. A couple of housekeeping items, everything here is forward-looking, we're making no promises or guarantees, but we're excited to talk to you about what we're doing. And if you can, fill out surveys when you're done with the session today, that'd be great, we love getting the feedback and learning from our experience here.

The title here is Next-Gen Data Science, what does that mean, what does it look like? We obviously live in a very exciting time, maybe a confusing time, a world of agentic, agents, AI tools that are rapidly, rapidly changing the landscape of data science workflows and what it looks like to do data science today. So I wanna talk to you a little bit about our perspective at Posit, what this means for us and what we're doing as we look towards the future of data science work in what we believe is still a critically open-source world.

Here's kind of what we're gonna cover, we'll talk a little bit about who we are as Posit, what we do with Databricks, if that's of interest, and then we'll dive into some examples of tools that we're building at Posit to help address the needs of today's data scientists who are using either Python or R to work with data that's increasingly in these platforms like Databricks.

About Posit

You may not know us, we are the company that created the RStudio development environment, we used to be called RStudio until a couple of years ago, we rebranded ourselves to be Posit, and our focus has remained the same, which is we wanna support open-source data science and scientific research so that anyone, regardless of economic means, has the ability to make sense of a world of ever-increasing data.

And so even today, 15 years after the company's creation, we remain firmly committed to this notion of open-source data science, in fact, nearly half of the engineers at Posit are focused on open-source development, purely open-source development, in either Python or R, that's a critical component of how we operate today as a company. We're also the winner of this year's Developer Tools Partner of the Year Award, two years in a row, which we've been really excited about, demonstrating the depth of our partnership and strength of our partnership with Databricks.

So a little bit about me, I'm a very mediocre but avid cyclist. So this is me in a local race just a couple weeks ago, coming in like 15th out of 20th place, nothing exciting, but if you know any cyclists, we all have one thing in common, which is that we never shut up about it. So naturally what we're gonna do, because this is who I am, is we're gonna look at some data from the Tour de France. So we're gonna take some data about every stage that's been ridden in the Tour de France since its inception in 1903 up until 2019, and we're gonna use that as kind of the backbone to tell this story of what it looks like to do data science both today and in the future.

Introducing Positron

To do this, I wanna introduce a tool that we call Positron. Some of you may have heard of this, some of you may have not, but Positron is our next generation development environment built for data science specific use cases.

Now like I mentioned at the beginning, we are the company behind RStudio, and the reason that I bring that up here is because we've learned a tremendous amount in our time building RStudio for the past 15 years about what it takes to do successful data science. At some point in the past five years, we realized that as a company, we wanted to embrace open source data science for all languages, not just R specifically, but for Python and whatever might be introduced in the next 15, 20, 50 years. And that led us down this road to create what we're calling Positron.

When you open up Positron, one of the ways to access this is through Posit Workbench, which is one of our commercial tools. Positron's also available as a desktop installation, which I'll talk more about here in a moment. So from Posit Workbench, I can log in, I can say I want to use Positron, and I can also sign directly into my preferred Databricks workspace, and I'm in the front door. I'm running Posit tools alongside Databricks with direct access into everything that Databricks provides.

When I launch Positron, this is what greets me. And at first glance, this looks like every other VS Code fork on the market, right? This is very familiar if you've ever opened up VS Code, if you've ever opened up Cursor or Windsurf. This looks and feels in many ways like some of those tools. But the things that make Positron different are that it's designed specifically for data scientists and developers who are writing and working with tools like Python and R to understand data.

On the left-hand side, we have what we call Positron Assistant, which just barely entered kind of its public preview phase. Any sort of developer tool worth its salt these days has some sort of AI interface, and this is what Positron brings to the table.

The thing that makes Positron Assistant unique is that it has full context of not only your static code that might be in the project you're working with, it also has the full context of what plots you're looking at, what plots you've generated. What about what variables have you loaded in your Python interpreter? It sees the exact same context that you as a data scientist see. So not only can I ask things like, help me write code to do this thing, I can also ask things like, hey, we just made this plot, help me understand what it's telling me. Or this plot could use something a little bit different. Give me some suggestions for how to improve this plot. And instead of just looking at the code and saying, well, you could add a title, it's going to look at the plot itself and say, you know what, this plot could be a little bit better if we adjusted these things.

So not only can I ask things like, help me write code to do this thing, I can also ask things like, hey, we just made this plot, help me understand what it's telling me.

If we move kind of across the screen here, we have a text editor, which is a text editor, but this has code completion, everything you would expect to see in a modern data science developer tool. Below the text editor, we have support for native interpreters for both R and Python. And this is where we start to depart a little bit from what we saw inside of RStudio, where all you could do is run an R session. Here I can run three R consoles and five Python consoles. They can all be from different virtual environments, managed with different dependencies, but I can access each of them from the same place, and I can submit code for execution into each of those places.

Over on the right-hand side, if I create visualizations, if I work with interactive plots, if I create interactive dashboards or web apps, those will render directly inside the developer tool for me to access without having to leave and open a browser window or figure out what local port I'm running on. Things will just render directly for me inside the IDE. And then above that, we have an environments pane that will show me what things I've loaded in memory, what variables I've loaded, what data objects I've created, so that it's easy for me to understand both for myself, the current context of what my session looks like.

It also makes this really easy for Positron Assistant to understand what's available, so that it can give very tailored specific suggestions in response to my natural language queries. All of us, or I shouldn't assume, but many of us have likely interacted with LLM agents before where we say, hey, I'm working in this context, help me write some code to do this thing, and what we get back is like 80 to 85% there, but the column names are all wrong, the variables it assumes aren't correct, and so we have to go and edit it to work correctly. Positron Assistant's success rate on the first shot is much higher, because it sees what variables I have loaded, it knows what data I have access to, and so it knows what column names to use, it knows what variables to refer to without making things up.

Building a Tour de France dashboard

This is like a very, very brief introduction, but a workflow that we see very typically is data scientists will come into an environment like Positron, they'll explore some data, and then they'll want to deliver what they've learned or some sort of insight to business users and stakeholders, and in many cases that takes the form of some sort of dashboard or interactive application. Positron's a great environment for building these kinds of dashboards and interactive applications.

Here's our Tour de France data, we have all this information about how long it took people to ride a given stage, what the result was, who the rider was, and so we want to present this information to our business users in a way that enables them to get their own insights from this particular dataset. What we can do is we can, inside of Positron, we're going to write an interactive web application using the Shiny framework in Python. We can do this in Streamlit, we can do this in R, we can do this in Dash, we can do this in Gradio. Choose the framework you want.

I'm choosing Shiny here, and then I can click on this little play button inside the console, inside the text editor, and say, run the app. What happens is Positron runs the app directly inside the IDE. I don't have to open another browser window, I don't have to figure out what uvicorn is and how it's related to a unicorn, I can just run the app directly in the IDE, and it live updates. If I make changes to my source code, the app refreshes, and I see those changes directly. It's a very nice pattern for doing this sort of interactive application development.

So let's take a look at our app for a moment. We have some charts that indicate who had the most stage wins in the dataset, and other information here, who participated in the most stages. But the critical thing about this app is that the only user input is this chat interface on the left-hand side. And so here, as a user, I've come in and said, hey, how many total unique cyclists are there in this dataset? And you'll notice that it generates this SQL query, and it says, look, there's 5,162 unique cyclists.

Something really interesting is happening here. We're not giving all this data to the LLM and saying, hey, tell me how many cyclists are in this data. Instead, we've given the LLM the schema of our data, and then we've said, hey, given this schema, write a SQL query that will help us answer this question. So it's fully transparent, and much less prone to hallucinations, because we're not asking the LLM to interpret data itself, we're asking it to write the code that will then give us the answer to the problem that we have, or the question that we pose.

So it's fully transparent, and much less prone to hallucinations, because we're not asking the LLM to interpret data itself, we're asking it to write the code that will then give us the answer to the problem that we have, or the question that we pose.

Interestingly, if we switch over to Databricks for a moment, and we look at our workspace, here's my Databricks query history. So not only did I ask this question in natural language inside of my application, that resulting SQL query was submitted to a SQL warehouse in Databricks for execution. The result came back, and I was given the answer inside my application. Not only that, but the underlying LLM model, which in this case is Cloud 3.7, that underlying LLM model is also coming from Databricks.

So we have this really unique environment where we're interacting with this LLM in Databricks and submitting the resulting SQL queries back into Databricks, and that's all supported through this query chat package that we're currently in the process of developing. What query chat does is it allows me to build this sort of natural language interface into my Shiny applications with very, very few overhead, like a very little amount of overhead on the developer side. I have to configure a connection into Databricks, I set up that connection, I tell this query chat package to use this Cloud model that's available through Databricks, and to point resulting queries back into Databricks, make sure that I have authentication configured the right way. If I'm running Imposit Workbench, authentication's already done for me, and it's just gonna work.

And then I have this ability to submit these natural language queries and get the response back in native SQL that's executed in these SQL warehouses. So from my Shiny application, I write a natural language query that gets submitted to Cloud 3.7 in Databricks, that responds back with a SQL query, we then turn that SQL query around to the SQL warehouse in Databricks, and we finally get our query results back. And we can ask questions like aggregation questions, how many cyclists are there, or filtering questions, show only cyclists who ever won a stage. And it will filter the underlying data and everything in the dashboard will update all according to my Databricks user permissions.

Publishing and deploying the app

So now we've built this perhaps interesting application, it's running in my Databricks environment, but I need somebody else to look at it. And I want to share this with them. How do I do it? Well, we heard this morning from Justin and Deborah Braun about Databricks apps, and the way that Databricks is looking to use apps as a way to democratize access to their data intelligence platform. One of the things that we're working on at Posit is adding support for Databricks apps publishing directly into Positron.

So that inside of Positron, I can say, hey, I've built this app, I want to publish this app now into Databricks. We can open a publisher dialog and we can say publish this to Databricks. This is proof of concept work that's subject to change, but something that we're actively working on. And once I go through this process, then I run over to Databricks, I have this new app, this Tour de France dashboard that's now running inside of Databricks. And if I open this application up, I see what I saw in my local Positron environment.

Here I have the Tour de France application running, and I can ask questions like, hey, filter to only data from the most recent decade. And it will figure out what the last year in the data set is, figure out what 10 years back from that is, and then filter underlying data so that now the information we're looking at is not most stage wins of all time, Eddie Merckx is now no longer on the list, but most stage wins in the past decade where Mark Cavendish floats to the top. All powered by Databricks, running on Databricks and Databricks apps.

One of the other deployment solutions that exists in this case will be using something like Posit Connect, where I can publish to my Posit Connect environment in much the same way that I did the Databricks apps. And now I have some additional functionality and features around how I share this across my organization and how access is managed and controlled within my organization.

Unity Catalog and row-level permissions

Now, one kind of interesting piece here is you'll notice on the right-hand side, we've shown that all users who have a login can come visit this application. And then down at the bottom, there's this little piece that says, hey, this application is connected to Databricks. It shows me that it's connected into Azure Databricks. So anytime somebody views this application, they're prompted to log in to their Azure Databricks account. Now, that's critical because what that means is if you have granular permissions defined inside of Unity Catalog, if you've got a Unity Catalog filter that you've written on top of this data, that can apply all the way through.

So let's illustrate this with a very infamous example, right? Let's take Lance Armstrong, most infamous cyclist of all time, famous for doping violations that he later admitted to and being stripped of all seven Tour de France overall victories, as well as all UCI World Tour victories.

If I come in to my application as a user, so here in Databricks, we've written a Unity Catalog filter that says, hey, any convicted cyclist for doping is invisible or not accessible to certain users. So as my user, if I come in and I say, hey, application, how many stages of the Tour de France did Lance Armstrong win? It comes back and says he won 24 stages. But if a less-permissioned user logs in and then asks the same question, they get a very different response. It says zero.

And then the LLM kind of goes on a journey, because it says, I think Lance did win some stages, and so it tries a couple of times to figure out, is his name misspelled? Maybe he's organized in here a little bit differently? And then finally it comes in and it concludes and says, you know what? I can't find any results for Lance Armstrong, and it's probably because he was convicted of doping and all of his results were stripped.

Same application, same URL, two different users, two different answers, because they have different permissions in Unity Catalog. This allows you to use Unity Catalog as the source of truth for data and data access, and then serve an entire collection of business users with a single entry point without needing to worry about building in logic for who can see what, because Unity Catalog is already taking care of it. Not only is this best practice, it also makes the life of the developers a lot easier, because one app can now serve multiple audiences within the organization.

Not only is this best practice, it also makes the life of the developers a lot easier, because one app can now serve multiple audiences within the organization.