Data analysis with Posit AI-assistants | Sara Altman & Simon Couch

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Data Science Lab, everybody. I'm Libby. I run community here at Posit and I'm joined by Isabella Velazquez. Isabella, would you like to say hello?

Hi, everyone. Thank you for joining us.

So what I will do now is introduce our featured lab manager for today, Sara Altman. Sara, would you like to introduce yourself and talk a little bit about what you will be talking about today?

Yeah, sounds good. Yeah. Hi, everyone. I'm Sara. So I'm a senior developer advocate on the AI team here at Posit. Thanks, Libby and Isabella for hosting this. So today we're going to walk through data analysis with AI assistants. So if you've seen the AI stuff that Posit makes before, you might know we have packages that let you build stuff that uses LLMs. And then we have tools that are built with LLMs that help you do data analysis, data science tasks. So I'm going to focus on the second. So it's less like the packages like ellmer or chatlas or whatever you might have used and more on tools where there is an LLM embedded in the tool. And then we're using that to do data analysis. So I'm going to show you a couple tools or like three to four. We'll see how much we get through. But I also would like to talk about like why you might want to use AI for data analysis or data science, but also why you might not want to.

What kind of tasks it might be useful to use it for and what kind of tasks it's more risky or you might not want to use AI for. So please ask questions. It'd be great to talk about that. And I'll explicitly talk about a couple pitfalls and how you might avoid them. But yeah.

Perfect. Yes, definitely. Definitely the place to be a an observer, an observer, like a informed skeptic. Right. And like, ask all the good questions because Sarah and Simon are asking them and they're like actively writing blog posts, putting them in the AI newsletter. And that is a fantastic resource for anybody who is curious about use cases of AI and pitfalls and dos and don'ts.

I'd also love to just let Simon introduce himself really quickly because he is here. I wasn't sure if he was going to be here. Simon's also on the AI team with Sarah. He's our little special guest today.

Yeah. Yeah. Happy to be here. I work on the AI team with Sarah. Sarah and I work together quite a bit, especially on the AI newsletter, which we send out every two weeks, as well as various deep dives into problems that we see while we work on data science agents that we are building on the AI team.

Wonderful. Glad you're here, Simon. All right, Sarah, I will let you take it away and share your screen and maybe get us started with a little bit of an intro on the first tool we're going to talk about today or the first method.

Introducing Positron and its AI tools

Okay. Can you all see that? This is just the Positron website, because we're going to start by talking about how to install these things. The first couple tools I'm going to talk about are the two AI assistants that are within Positron. If you haven't used Positron, it's Posit's next generation data science ID. You can do both R and Python in it. These are both assistants that are available in Positron. I'll also show you a tool that you can use in RStudio and then a more general thing at the end. But we're going to start with Positron. Before we talk about how to get started, I just want to say a couple central ideas they want to get across.

The first is if you haven't used these tools and don't know how to get started, I'm hopeful that this can help you just get started playing around with them. And the second is that I want to emphasize that AI usage for data analysis or data science really isn't an all or nothing thing. You don't have to suddenly convert to doing absolutely everything with AI assistant and never looking at your code again and never knowing what's going on with the data again. You could do that. You probably don't want to. But there's a wide variety of things that you can do. And you can pick and choose what kind of tasks you want to use, but also pick and choose what kind of tools are most useful for you.

And so if you are approaching this skeptically because you don't want to lose control over your work or you're really wary of the risks, I just want to emphasize that you can do a little bit of AI assistance or use it for only some tasks, something like that. It's really not an all or nothing thing.

And so if you are approaching this skeptically because you don't want to lose control over your work or you're really wary of the risks, I just want to emphasize that you can do a little bit of AI assistance or use it for only some tasks, something like that. It's really not an all or nothing thing.

OK, so the two tools that are in Positron are Positron Assistant, so this one, and then DataBot. And so both of them you have to enable. So I'm just going to go over the instructions here and then I'll show you in Positron really quickly. But the first thing is that you have to enable. OK, so let me back up. You want to set up Positron Assistant first and then DataBot. So to enable Positron Assistant, there's a setting that you enable.

And then you will add a language model provider. I know people often have questions about which models you can use. So I can go over this or feel free to ask specific questions in a second. And then you're basically ready to go. So let me see. Let me talk about DataBot again first before we go over to Positron so I don't have to keep switching the windows.

The same kind of thing for DataBot. It is a thing that you need to enable. You also do need to install the DataBot extension and hopefully we can provide this link somewhere so you can follow along. But I just wanted to show you where this documentation is. So for DataBot, you need to install the extension and then the same thing like with Positron Assistant. There's a setting that you need to enable.

And then you're basically ready to go. So importantly, you do need some kind of access to some LLM. And if you're going to use DataBot, you specifically need access to a cloud model. And this could either be through an API key that you have or your organization has provided you with or because you have like some kind of enterprise or you have AWS Bedrock access, which is how Posit does it. And so that's what you're going to see on my screen. But you can also use a key if you have a key available.

Sarah, would you mind real quick just briefly explaining the differences between Positron Assistant and DataBot?

Yeah, yeah. So Positron Assistant is a general coding agent. And I'm going to show you both these tools and we can talk a lot more about this. But at a high level, Positron Assistant is like a general purpose coding assistant. You might use it for debugging, you might use it to write a bit of code for you. Like you would use it in the sort of a general way that if you want help with code. DataBot is much more specific. It's specifically for exploratory data analysis. So you'll see this in a bit, but like that is the main thing that DataBot in Positron does. It is geared almost entirely towards exploratory data analysis.

Setting up Positron Assistant

I wanted to also hop in and say, whenever you see Sarah hovering over the links that have a little gear icon next to them, for example, in the step one enable Positron Assistant, that's actually a link directly to the settings in Positron. So when you click that, your browser is going to say, do you want to open Positron? And you're going to say yes. And then it's going to take you directly to the correct setting that you need to enable inside of Positron. So having Positron open first is a good idea and using those links is going to be really, really helpful, especially if you've never say modified settings.json in Positron before.

Yeah, that's a good point. Yeah, I was just about to click on that. So I did click on it and you didn't see Positron open just because I don't have it in the share window, but it now opens this settings file. And then I already have it enabled, but you would then check this enable option to enable Positron Assistant. Okay. And then we said that the next step is to set up your model provider.

So this is how you get access to the LLM in Positron. Everything that you can pretty much do like with for like settings for the IDE or like things that you're setting up, you can access from the command palette.

search for Positron Assistant or configure language model providers, this will come up. And then you're going to choose your option, which for right now, like it's giving me the option to either add an Anthropic API key or sign in with Amazon Bedrock. So this is Posit hosts all its models through Bedrock, which is why I've signed in here, but you can also provide it with a key.

Simon, shouldn't also the OpenAI compatible endpoint icon be over here? It was over here like yesterday when I did this. I wonder if it depends on some sort of experimental setting. I can poke around and send a note.

I know that I had to change my Positron Assistant enabled providers setting to OpenAI-API. And if it was anything else that didn't work, I had to change it.

It was there yesterday when I looked at this. I am on a daily build of Positron, I think, so it may be something updated between today, but we're going to use Anthropic models anyway, so it doesn't, but if you have access to another model that you could provide a key in that way, you can use Positron Assistant. Databot right now, we're saying only use Cloud models.

Exploring data with DataBot

So I said, we're going to look at the hallucination cases data. So there are a variety of ways we might open this conversation. I could just ask it to like, sorry, it keeps thinking I'm clicked into the Zoom. Take a look at the hallucination data, and it'll look through the files and probably find it. But I did want to point out that you can also do this. You can just ask it to load the. And then if you do at, you get a file picker, and then I can select the exact file.

So now we're getting this little pop up that is showing us what code it wants to run and asking us if we can allow it. I'm going to say allow for session so that I don't have to keep clicking this.

And so this is like the core of what DataBot does. You ask it a question, and it runs some code. And then it gives you a little summary, and then it gives you some suggestions. So it didn't do much this time because all I asked it to do was load the data. But now it has some suggestions of things we might want to look at. So let's just get a summary of the data set. You can ask questions that are not in the suggestions, but you can also click on them.

Kieran mentioned how interesting it is that it defaulted to read.csv and not tidyverse read underscore csv. Same. Interesting. If I was doing a code review, I would have been like, maybe, maybe use read underscore csv.

Okay. So it has run some code to explore the data. We're getting some output. It's giving us a summary of the data that now we can take a look at. And then it's giving us some suggestions at the end. So a couple things I just want to point out is that by default, you can see all of this code. There is a way to hide the code, but we generally want you to be able to easily see this. So it is possible to review it if you want to. And then the other thing is that we want DataBot to not give you too much output at any one time. As soon as it starts doing just like tons and tons of code, it becomes very hard for you as the human to keep track of what's going on. And for this kind of tool, we want the person to keep pace with DataBot, essentially.

So you can keep track of what's happening in your data. And one primary reason for this is that if you're doing exploratory data analysis, the point is for you as the person to have an understanding of the data. If the LLM just has the understanding of the data, that doesn't really help you that much. If you're doing EDA, you want to know what is going on in it. And so if it's just doing a bunch of code and then giving you tons and tons of output that you're not going to sift through, you're not going to actually do the sense-making process of looking through and understanding, you know, what the patterns are in the data, if there's any problems, all of that.

If the LLM just has the understanding of the data, that doesn't really help you that much. If you're doing EDA, you want to know what is going on in it.

So this is just like text that was the output technically of code that it ran. So all the code, if you scroll up, we see more like, eventually we'll see more like code looking stuff and not just the output of cat. All this code is being run here.

And so one other thing to point out is that DataBot and Positron Assistant have access to your variables. I didn't really have anything loaded when I started up DataBot. But if I did, I could just ask it to, you know, like if I started a new session, I could say, look at hallucination cases. It won't need to do anything. It already has access to this data. But this also means that when it runs code, I now have access to this.

Data analysis with Posit AI-assistants | Sara Altman & Simon Couch | Data Science Lab

Transcript#

Introducing Positron and its AI tools

Setting up Positron Assistant

Exploring data with DataBot

DataBot memory and the databot.md file

Positron Assistant for debugging

Posit Assistant in RStudio

When to use AI for data analysis

Skills in DataBot

The reviewer package

Featured software#

chatlas

ellmer

Positron

rstudio

tidyverse

tidyverse.org