Posit and the modern data stack: Using Posit, Snowflake, and Databricks (James Blair)

Transcript#

This transcript was generated automatically and may contain errors.

All right, welcome everyone. We'll jump in since we have some limited time. Just as like a brief introduction, so I'm James Blair. I work at Posit as a senior product manager over cloud integrations, which people ask me all the time what that means. I don't really know, but mostly it means that I work with our third-party partners, Snowflake, Databricks, kind of the prominent ones, but then also AWS and Azure and GCP and those kinds of things, working to make sure that we build joint product experiences that meet the needs of commercial customers.

So today, our kind of walkthrough and demo is focused on, there's, you know, like AI is everywhere, so we're gonna really lean into that today. We've covered a lot of, I think, really exciting things throughout this conference about the work that we've been doing at Posit, work that I've been really excited about. And one of the things that I've been really interested in is sort of this intersection of commercial data platforms like Snowflake and Databricks and others, and all of the sort of AI work that we've been doing. The goal from our perspective has been how do we enable data scientists to be more efficient at their jobs and to be able to arrive more quickly at significant results that can drive business outcomes. And ideally, right, those business outcomes are eventually tied to significant outcomes in the lives of others.

And so what I wanna walk through is kind of what the experience looks like today if you're going to use tools like Positron Assistant and Databot with some of these third-party backends. What we think is going really well in this space, what we recognize isn't great right now, and then kind of give an idea of what we see coming as we continue to work on this sort of intersection that, again, is really interesting to me and I think provides potentially a lot of value.

Connecting to Snowflake with Ibis in Quarto

So the first thing we're gonna do, I'm in Positron, hopefully by this point you know what that is. If not, welcome, this is Positron. We're going to kind of approach this from a couple of different routes. The first thing I'm gonna do, I'm actually gonna flip over to this. So I've got a Quarto document here.

I will make a note really quickly here at the beginning. I am doing all of this off of Positron Desktop, but all of this also works out of Posit Workbench. Positron's available as an IDE there. I'm using the desktop version just because things in this space, particularly with Positron Assistant and Databot, are evolving so quickly. A lot of our internal workbench environments are struggling to keep up. And so to show kind of the latest state of the art of where things are at, I'm on the desktop today.

So I'm gonna come in. I've got this Quarto document pulled open. And the only thing that I'm gonna do is I'm gonna import this Ibis library. This is using Python. We'll use R in just a moment. I just want to kind of demonstrate that either or works here. So we're gonna bring in this Ibis library, and then we're gonna connect to Snowflake using, Snowflake has this Python SQL connector, which Ibis uses under the scenes, or behind the scenes. So we're gonna make this connection to Snowflake.

There we go. Okay, so I'm gonna make this connection into Snowflake. I have this connection now. This is, like I've already signed into Snowflake earlier today, so it's using those credentials. If you're on Posit Workbench, just as a note, one of the nice things there is that you can automatically log in to your Snowflake account when you start a session in Workbench, and all of this just works there too. Then I'm gonna create a connection to this stages table that exists inside of Snowflake, and we'll take a look at what this is.

So I'm a, if you've ever spent more than like five minutes with me, you probably know this, I'm a huge cycling nerd. And so this is actually Tour de France data that's inside of our Snowflake instance. And this just represents stage results from 1903 up until 2019. So not the most recent editions of the Tour, but most editions of the Tour, information about the rider, the stage, the amount of time it took them to complete it, whether they won, what their position was, things of that nature.

Positron Assistant and remote data awareness

So now that I've done this, I've created a connection to Snowflake, and then I've created this reference to this table. The thing that I wanna point out before we jump into Positron Assistant is if I look at my environment right now, I only have two things that I've created, the connection to Snowflake and the stages table, but the stages table is not local data. I didn't pull all of this data into my local environment, it's still just in Snowflake, right? All I've done is create a reference to this object in Snowflake. And the reason that's significant is because in many cases, if I'm connecting to some sort of external data store, it might be Snowflake, it might be something else, it might be BigQuery, it might be Redshift, who knows, right? There's a growing catalog of external data sources I might wanna connect to.

Doesn't really matter, the point is when I'm connected to this data, the reason probably is something to do with the fact that that data is far larger than what I'd wanna work with locally. So I haven't brought anything locally, I've just created this reference, and now I'm gonna pop open Positron Assistant over here on the left-hand side. I'm gonna close this so our screen's not too crowded.

And then a couple things, again, we support Anthropic today via API key. I'm pointing to Claude Sonnet, and then I'm in this edit mode over here when I look at the different modes that I have available to me. We're gonna use edit for the time being today. And I can say, okay, let's update this document with an exploration of the stages data.

All right, so now Positron Assistant's gonna start kind of going through its thing. It's gonna say, look, I'll help you update this Quarto document based on what you have in your session. I can see that you've already got this stages table loaded. Looks like it's related to Tour de France data. And then it's gonna go through, and depending on our network speed and a couple other things, this should resolve pretty quickly. And we'll start to see changes being made to our Quarto document to help us explore this data.

Now, as this is happening, I wanna call back, if you participated in the keynote that Joe and Hadley gave yesterday, one of the things that Joe kind of described was this difference between the agentic tools that we've created. There's Positron Assistant, which we're looking at now, then we have Data Bot, which we'll look at in a moment. And the difference is, we look at Positron Assistant as a way to help you write code kind of generally. Right, and so here I've said, hey, I wanna explore this data. Help me kind of figure out what this might look like.

Data Bot, on the other hand, is very specifically tuned to exploratory data analysis, and we'll look at what that workflow looks like in just a moment, where I open up Data Bot and say, hey, let's just look at this together, and we'll kind of compare and contrast the differences between these two things.

All right, if this is gonna, there we go. Okay, so if we look at what we've got, let me minimize this pane over here. So this is starting to go through, and I can look through the code. Again, this is all code first. We're not passing a bunch of data. The model that we've connected to, in this case, Claude Sonnet and Anthropic, we're not passing a bunch of data to that model and saying, hey, help me explore this. We're just saying, look, I've got a connection to some data in Snowflake. I wanna explore that connection. I wanna explore that data, and it's now writing code to help me do that.

And the thing that's really interesting about this, and the thing that I wanna point out is, as we go through and look at this, Positron Assistant and Data Bot both do a really good job, once I've made this initial connection to the data, in writing code that adheres to whatever framework I've chosen. So in this case, we chose to connect to Snowflake using IBIS. If I come back up here, we can see that I made this IBIS connection to Snowflake, and then this stages data as a reference based on that connection. And so because of that fact, Positron Assistant recognizes, okay, I know what you're trying to do, or I at least know what connection type you've made, and I'll use the tools available through that connection type to continue to work with this data without bringing all that data into your local environment.

So if we scroll through and look at some of what this is doing, it's looking at the schema so that it can gain some information about what this data looks like, it's printing out the first few rows, and then it's running through some data quality checks, it does some analysis, so on and so forth, and we can run through each of these blocks if we wanted to, and start to kind of get a feel for what this data looks like. It's written quite a bit of code here, we can go through, if any of this errors out, we can certainly go back and have it address those errors.

But the key kind of discovery here, key point is, in all of this, I'm not bringing any of this data locally. Positron Assistant is keeping the data inside of Snowflake, and I'm using, in this case, Ibis, to execute queries against that data to explore it. So just like you wouldn't wanna bring, you wouldn't wanna select star everything in your local environment, Positron Assistant is aware of those best practices, and writes code that kind of adheres to that, and if we were to open up Snowflake in our Snowflake account, we'd be able to see queries being submitted against this data, rather than data moving across into our local environment.

Positron Assistant is keeping the data inside of Snowflake, and I'm using, in this case, Ibis, to execute queries against that data to explore it.

Switching to R and Databricks with sparklyr

Okay, let me, I'm gonna shift gears a little bit, and I'm gonna come over here now, we're gonna close this out, there we go. Okay, so I'm gonna, kind of same exercise, we're gonna open up, just run this really quickly. So now we're gonna switch into R, kind of same process, I'm gonna use the sparklyr package, I'm connecting to a Databricks cluster that I have inside my Databricks account, and then we have the same data that's there, this stages data that exists inside of our Databricks account, we've created a reference to it using this in catalog function, here's the catalog schema table definition of where this data lives. So once this connects to that data, we can open up Databot, and I can say, okay, Databot's friendly and tells me hi, and I can come in and say, okay, help me analyze the stages data.

Okay, so it's gonna say, look, I can see you have stages loaded in your R session, it looks like this is a Spark DataFrame, which is right. So again, same principle applies, it recognizes the type of connection I've made, and now it's going to use the Spark API available through sparklyr to execute queries against this dataset. So I can come in here, it's gonna use a glimpse, you can see here, one of the key kind of tells that I'm working with remote data is that this comes back and says, like, I actually don't know how many rows are here, because I didn't ask that about the data, I just took a snapshot of the first few rows so that you could gain some visibility. So we can see that it's submitting queries over to this data connection inside of Databricks, and building up kind of its understanding of what this data looks like.

And this is kind of much more of an agentic style workflow, where it's executing some code, it's looking at the results, it's executing some more code, it's looking at the results, and then we can kind of see its dialogue here. If we go down, it's like, okay, I can see this is a comprehensive cycling dataset, it seems like it's Tour de France data, with some rich historical information, there's about 250,000 rows, there's 106 different editions, there's unique stages, so here's kind of what I found, right, it's giving us some information about what it sees in the data.

And then let's go ahead and say, let's see, it gives us some suggestions here, let's look at the distribution of ages across the Tour de France from 1903 to 2019, and how that's evolved. So again, it's gonna go through, write queries, and what I'm gonna do now, actually, I think I've got this, I'm gonna open up our Databricks instance really quickly, just because I think it's kind of interesting to see if the Wi-Fi's happy, there we go.

All right. So if I come into Databricks, and I look at the cluster that I've connected to, so if I look, let me zoom in. All right, let's go one more, there we go. Okay, so if I come to compute, I'm connected to this cluster, I can come over here, I can look at the driver logs, no, I don't need driver logs, what do I need? There we go. If I look at the SQL queries that are coming in, and I expand any of these, I can see that these are queries coming related to this stages data, right? So the formatting here is not super great, and here's a bunch of line breaks that are just included in plain text, but this is the SQL that's being submitted for execution by Databricks. So Databricks is running all the compute data, it's this cluster on Databricks that's doing all the calculation for me, but it's DataBot that's driving that calculation.

So I have kind of the best of both worlds. I'm able to keep all of my data securely in the database that I wanna keep it in. Again, this might be BigQuery, it might be Redshift, it might be DuckDB, it doesn't really matter. The backend is less important here than the principle of DataBot and Positron Assistant are both very capable of driving analysis against remote data without being overly eager to pull all that data into your local environment.

The backend is less important here than the principle of DataBot and Positron Assistant are both very capable of driving analysis against remote data without being overly eager to pull all that data into your local environment.

If we come back here into our Positron environment now, we can see we've created some plots, evolution of writer ages, oldest writer, youngest writer, average age. So average age has been slightly increasing over time, early editions of the tour had some fairly old writers by athlete standards. I think the oldest writer in the data set's 49 years old. And then you can see that the youngest writers kind of crept up, young writers now are typically in their early 20s.

The zero-to-one connection problem

So this process of working through and exploring data in this sort of interactive AI-assisted way doesn't just work with local data, which is a lot of what we've kind of seen demonstrated at the conference this year, but it also works really well against these remote data sources. Now I want to point out a couple of things that I think are significant. One is this works really well when you're already connected to the data. So notice in both cases with both Snowflake and with, excuse me, with both Snowflake and Databricks, the first thing I did was I created the connection to the data myself and then said, hey, help me explore this thing.

That's fine, and you may be well familiar with how to establish these connections, and in fact, if you're in an environment where you're often interacting with remote data, you're probably familiar with how to do this. But we've been doing some experiments. One of my favorite things, about a month ago, we just opened up sort of a blank Positron session, loaded Databot, we had nothing set up, right? I've got a blank session, I've loaded no packages, I've loaded no variables into my local environment, and I said, hey, I have some data in Databricks, I wanna explore it. And Databot tried its very hardest to help me. It was like, okay, let's connect to Databricks. How do you wanna connect? And I was like, all right, let's try ODBC . And then, of course, it couldn't find the ODBC driver, and it really was like, I don't know how to make this work.

And then finally, this was like a very surprising way and good reminder of how creative these LLMs can be. Finally, it was like, you know what I'm gonna do? I found documentation about the Databricks API. I'm just gonna write an R wrapper around that API. So for like five minutes, Databot just went and wrote its own R package for accessing data in Databricks, and then gave me a bunch of methods for doing so, which is really cool, but not super practical, because that package is not portable. It's like its own little SDK that it wrote. So one of the things that we're working on immediately right now is kind of solving this zero to one problem. If I'm opening one of these tools, whether it's Positron Assistant or Databot, and I wanna connect to data that's not local, right? I wanna connect to some data that's in some place, making it a lot easier to say, hey, help me explore this data that's here, the table's called this, and we'd have the right context and the right understanding to be able to go and do that without reinventing the wheel, which is kind of what we saw previously.

QueryChat and enterprise data permissions

The last thing I'll point out here, and then I'll see what questions we have, is kind of the flip side of this, and this has been brought up a couple of times here in the conference as well. In addition to these AI sort of assistant tools that exist from a developer standpoint, we've also created a lot of tools that enable you as developers and as data scientists to build on top of these AI systems. One of those tools that's been brought up a couple of times here in the conference has been QueryChat.

So let me just share this kind of example. So this is a dashboard written in Shiny that we created for this Tour de France data, where we have some visualizations and some information that we can see, and then on the left-hand side, we have this chat interface where we can ask questions about the data or create filters for the data. And this is really, like, the way to set this up, the way to configure this is remarkably easy in terms of there's not a lot of plumbing to do, and then you have this really nice, intuitive, natural language interface to your data.

But the other thing that's really great is if you build these tools on top of enterprise data sources like Snowflake, you can get this behavior where I can come in here and I can say, okay, show me the writers who ever won a stage. There's one really glaring inclusion here. This is really fuzzy, I'm sorry, but this is Lance Armstrong, notoriously famous for doping his way to several Tour de France titles. So, like, officially, he's won zero, right? But in this data, it shows that he's won 24 stages. So what we did is we went into Snowflake and we wrote this filter that says, hey, certain users actually cannot see records from doped cyclists. Well, if one of those users goes to this dashboard that we have running on Posit Connect, and they log in and they say, hey, like, tell me how many stages Lance Armstrong won, they don't see 24, the answer they get back is zero. And it's not because the dashboard's different, it's because their permission on the data is different.

So this sort of intersection of Posit tools and enterprise data sources gives you this really flexible way to build resources and tools using agentic helpers that also adhere to enterprise best data practices. People that shouldn't see certain data don't see it through your application, right? The same filters, the same restrictions apply on the underlying data when they come in here to Posit Connect.

So this sort of intersection of Posit tools and enterprise data sources gives you this really flexible way to build resources and tools using agentic helpers that also adhere to enterprise best data practices.

Q&A

About one minute, so you'll be on the Positron Lounge channel in Discord. Yes. Okay, so I will keep an eye if there's questions on the Positron Lounge channel inside of Discord, and then I'm happy, I think we've got maybe a minute for questions here. So I'll wrap up and see if there's any questions with this group.

I'm gonna stop with one question right now. George has asked, in terms of cost, generally speaking, is it more efficient to run data wrangling and analysis tasks in Positron or RStudio versus cloud environment?

I think cost-wise, it's an interesting question because these cloud environments, depending on what it is, right? Snowflake and Databricks both charge by compute unit, and so you do incur some costs there. But it's less of a cost trade-off, I think, in my opinion, and it's more of a performance trade-off, and just like a security thing. In a lot of cases, security doesn't want data being shuffled around into local environments and then back into the cloud. They want data to stay within the security boundary of whatever warehouse you've selected to use, and in a lot of cases, that data's gonna be large enough that moving it locally and then trying to work with it there is gonna end up becoming a bottleneck. So just keeping it in the warehouse, I think, is typically best practice.

If it's small enough data that you want to move it locally, I think that's a conversation to have internally, maybe with security and others, but in many cases, this is a fairly simple toy example, but when you're dealing with terabytes of data that might be in these warehouses, then letting the warehouse do the heavy lifting makes a lot of sense.

Do we have a question from the room? Well, I have a question. Oh, go ahead. I was just wondering, given that the cloud's a strong system, it's using cloud, will cloud have visibility to the Snowflake connection credentials, because you're kind of seeing that as potentially being a security issue?

Yeah, so that's kind of part of what we're starting to think through right now, is how do we expose, excuse me, how do we expose metadata from these connections so that the suggestions and the code that's written is accurate in terms of column names and references and things like that, without giving free reign of credentials, right? So we're very conscious of that. We recognize that there's a high level of risk in exposing these tools to enterprise data platforms, so we're working to approach it the right way, where you can control what sort of context is passed into these tools, and you're not just passing a free pass into your enterprise data.

All right, we gotta get ready for the next session, but thank you so much. Thank you, everyone.