Transcript#
This transcript was generated automatically and may contain errors.
Hi, everyone. Thank you for joining us.
Yeah, sounds good. Yeah. Hi, everyone. I'm Sara. So I'm a senior developer advocate on the AI team here at Posit. Thanks, Libby and Isabella for hosting this. So today we're going to walk through data analysis with AI assistants. So if you've seen the AI stuff that Posit makes before, you might know we have packages that let you build stuff that uses LLMs. And then we have tools that are built with LLMs that help you do data analysis, data science tasks. So I'm going to focus on the second. So it's less like the packages like ellmer or chatlas or whatever you might have used and more on tools where there is an LLM embedded in the tool. And then we're using that to do data analysis. So I'm going to show you a couple tools or like three to four. We'll see how much we get through. But I also would like to talk about like why you might want to use AI for data analysis or data science, but also why you might not want to.
Introducing Positron and its AI tools
Okay. Can you all see that? This is just the Positron website, because we're going to start by talking about how to install these things. The first couple tools I'm going to talk about are the two AI assistants that are within Positron. If you haven't used Positron, it's Posit's next generation data science ID. You can do both R and Python in it. These are both assistants that are available in Positron. I'll also show you a tool that you can use in RStudio and then a more general thing at the end. But we're going to start with Positron. Before we talk about how to get started, I just want to say a couple central ideas they want to get across.
The first is if you haven't used these tools and don't know how to get started, I'm hopeful that this can help you just get started playing around with them. And the second is that I want to emphasize that AI usage for data analysis or data science really isn't an all or nothing thing. You don't have to suddenly convert to doing absolutely everything with AI assistant and never looking at your code again and never knowing what's going on with the data again. You could do that. You probably don't want to. But there's a wide variety of things that you can do. And you can pick and choose what kind of tasks you want to use, but also pick and choose what kind of tools are most useful for you.
And so if you are approaching this skeptically because you don't want to lose control over your work or you're really wary of the risks, I just want to emphasize that you can do a little bit of AI assistance or use it for only some tasks, something like that. It's really not an all or nothing thing.
OK, so the two tools that are in Positron are Positron Assistant, so this one, and then DataBot. And so both of them you have to enable. So I'm just going to go over the instructions here and then I'll show you in Positron really quickly. But the first thing is that you have to enable. OK, so let me back up. You want to set up Positron Assistant first and then DataBot. So to enable Positron Assistant, there's a setting that you enable.
And then you will add a language model provider. I know people often have questions about which models you can use. So I can go over this or feel free to ask specific questions in a second. And then you're basically ready to go. So let me see. Let me talk about DataBot again first before we go over to Positron so I don't have to keep switching the windows.
The same kind of thing for DataBot. It is a thing that you need to enable. You also do need to install the DataBot extension and hopefully we can provide this link somewhere so you can follow along. But I just wanted to show you where this documentation is. So for DataBot, you need to install the extension and then the same thing like with Positron Assistant. There's a setting that you need to enable.
And then you're basically ready to go. So importantly, you do need some kind of access to some LLM. And if you're going to use DataBot, you specifically need access to a cloud model. And this could either be through an API key that you have or your organization has provided you with or because you have like some kind of enterprise or you have AWS Bedrock access, which is how Posit does it. And so that's what you're going to see on my screen. But you can also use a key if you have a key available.
Yeah, yeah. So Positron Assistant is a general coding agent. And I'm going to show you both these tools and we can talk a lot more about this. But at a high level, Positron Assistant is like a general purpose coding assistant. You might use it for debugging, you might use it to write a bit of code for you. Like you would use it in the sort of a general way that if you want help with code. DataBot is much more specific. It's specifically for exploratory data analysis. So you'll see this in a bit, but like that is the main thing that DataBot in Positron does. It is geared almost entirely towards exploratory data analysis.
Setting up Positron Assistant
I wanted to also hop in and say, whenever you see Sarah hovering over the links that have a little gear icon next to them, for example, in the step one enable Positron Assistant, that's actually a link directly to the settings in Positron. So when you click that, your browser is going to say, do you want to open Positron? And you're going to say yes. And then it's going to take you directly to the correct setting that you need to enable inside of Positron. So having Positron open first is a good idea and using those links is going to be really, really helpful, especially if you've never say modified settings.json in Positron before.
Yeah, that's a good point. Yeah, I was just about to click on that. So I did click on it and you didn't see Positron open just because I don't have it in the share window, but it now opens this settings file. And then I already have it enabled, but you would then check this enable option to enable Positron Assistant. Okay. And then we said that the next step is to set up your model provider.
search for Positron Assistant or configure language model providers, this will come up. And then you're going to choose your option, which for right now, like it's giving me the option to either add an Anthropic API key or sign in with Amazon Bedrock. So this is Posit hosts all its models through Bedrock, which is why I've signed in here, but you can also provide it with a key.
Simon, shouldn't also the OpenAI compatible endpoint icon be over here? It was over here like yesterday when I did this. I wonder if it depends on some sort of experimental setting. I can poke around and send a note.
It was there yesterday when I looked at this. I am on a daily build of Positron, I think, so it may be something updated between today, but we're going to use Anthropic models anyway, so it doesn't, but if you have access to another model that you could provide a key in that way, you can use Positron Assistant. Databot right now, we're saying only use Cloud models.
Exploring data with DataBot
So I said, we're going to look at the hallucination cases data. So there are a variety of ways we might open this conversation. I could just ask it to like, sorry, it keeps thinking I'm clicked into the Zoom. Take a look at the hallucination data, and it'll look through the files and probably find it. But I did want to point out that you can also do this. You can just ask it to load the. And then if you do at, you get a file picker, and then I can select the exact file.
So now we're getting this little pop up that is showing us what code it wants to run and asking us if we can allow it. I'm going to say allow for session so that I don't have to keep clicking this.
And so this is like the core of what DataBot does. You ask it a question, and it runs some code. And then it gives you a little summary, and then it gives you some suggestions. So it didn't do much this time because all I asked it to do was load the data. But now it has some suggestions of things we might want to look at. So let's just get a summary of the data set. You can ask questions that are not in the suggestions, but you can also click on them.
Kieran mentioned how interesting it is that it defaulted to read.csv and not tidyverse read underscore csv. Same. Interesting. If I was doing a code review, I would have been like, maybe, maybe use read underscore csv.
Okay. So it has run some code to explore the data. We're getting some output. It's giving us a summary of the data that now we can take a look at. And then it's giving us some suggestions at the end. So a couple things I just want to point out is that by default, you can see all of this code. There is a way to hide the code, but we generally want you to be able to easily see this. So it is possible to review it if you want to. And then the other thing is that we want DataBot to not give you too much output at any one time. As soon as it starts doing just like tons and tons of code, it becomes very hard for you as the human to keep track of what's going on. And for this kind of tool, we want the person to keep pace with DataBot, essentially.
So you can keep track of what's happening in your data. And one primary reason for this is that if you're doing exploratory data analysis, the point is for you as the person to have an understanding of the data. If the LLM just has the understanding of the data, that doesn't really help you that much. If you're doing EDA, you want to know what is going on in it. And so if it's just doing a bunch of code and then giving you tons and tons of output that you're not going to sift through, you're not going to actually do the sense-making process of looking through and understanding, you know, what the patterns are in the data, if there's any problems, all of that.
If the LLM just has the understanding of the data, that doesn't really help you that much. If you're doing EDA, you want to know what is going on in it.
So this is just like text that was the output technically of code that it ran. So all the code, if you scroll up, we see more like, eventually we'll see more like code looking stuff and not just the output of cat. All this code is being run here.
And so one other thing to point out is that DataBot and Positron Assistant have access to your variables. I didn't really have anything loaded when I started up DataBot. But if I did, I could just ask it to, you know, like if I started a new session, I could say, look at hallucination cases. It won't need to do anything. It already has access to this data. But this also means that when it runs code, I now have access to this.
DataBot memory and the databot.md file
So that would be a good thing for the databot.md file. So one way to like create this file is by clicking the little elephant where it says memory. And so I don't think I have one already. There's nothing really in here that I want to save to the memory, but we could say like, yeah, always use readr read underscore CSV to read in CSVs.
And this is going to create a databot.md file. And this is kind of like a set of instructions for data bot within this particular directory. This is sort of a specific instantiation of an agents.md file, which you might have seen or used for other agents. You could put things like, please always use read underscore CSV, always use the native pipe, but also specifics about your data. So if there were things specific to this hallucination cases data that every time I looked at it, I definitely wanted data bot to know about, I could put that in the databot.md file.
Like the example I was going to go into later with this data set is that like there's a couple issues in the data that it is impossible to know unless you tell the model that there are those problems in the data. And this is something that if you are analyzing the data with any kind of agent, you want it to always know. And so you can put it in the databot.md file or the agents.md file. And it'll load that at the start of every conversation.
So here, like, I don't know if there was some issue in the data we could put or we might even just put general documentation for this data set. This one is relatively straightforward in like my playing around with data bot with this. It does a pretty good job of understanding what's going on. But if it were more complicated or like these column names were inscrutable or something like that, you could put, you know, data dictionary, more information on what's going on or like your goals for your analysis, other things like that in the file.
So the first thing is that you can look back at previous conversations. So if you make a new conversation, I have a bunch open here. But you can go back and look at previous conversations that way. But you can also write the code to a report. We didn't do that much in this conversation, but it'll still put something in there. And I'll show you what happens when I run this. Just like, let's just say summarize findings. You can also just run this command by itself. And this is a slash command. So you slash report, summarize findings.
So it made a Quarto report with our findings. So this is one way to save the code. It probably put like more information than we wanted in here. It also wrote a bunch of prose. But it did include at least some of the code that we wrote. If we had done more manipulations or exploration, we could have said like specifically, you know, I want to include these pieces of the analysis. Please put them in the QMD file. It did remember that because I asked it to fix the name. So it did remember what it did there. And it put that in the setup chunk, which is nice.
And yeah, I feel like I'm rushing a little bit. Sorry, but there's other things that I want to show you.
Positron Assistant for debugging
One thing that positron assistant is really useful for is debugging. So it fails and we could say like, QMD didn't render. I like the brevity. Please fix.
So it correctly identified the issue. So like data bot, it has access to kind of like all of the pieces of information that it's going to need to be helpful. It can see your variables. It can see the console. It can see your files. And it fixed the problem. And for positron assistant, it has this sort of interface where it shows you the suggested change and you can keep or undo it.
Yeah, again, one thing I want to point out is that, like, DataBot specifically is geared towards you still having quite a bit of control. So it could have just, like, done a bunch of analysis behind the scenes and then plopped it all in that Quarto report. But instead, how DataBot works is you sort of build up an understanding of the of the data alongside of it. You're keeping track of what's happening. You're seeing the code. You're, like, looking at these little summaries it gives you and you're exploring the data together. And then you might put the results in a QMD file. So this was a conscious choice. Like, we want the human to be keeping up with DataBot and for you to be, like, iterating together.
And part of the reason is just so that, like, you can, you know, assess whether or not the analysis is correct. But it's also because part of, like, again, like I said, if you do EDA and at the end of it, you as the person have no idea what's going on in the data. Like, you didn't really do EDA and that probably wasn't that helpful for you. There are uses where you just need, you know, report output. But if your goal is to understand the data, you need a tool that is going to, like, you know, allow you to do that.
Posit Assistant in RStudio
OK, let's quickly go to Posit Assistant, which is in RStudio. So I'm going to share that. And this is, like, brand new. I've never seen this before.
OK, so it did, it did make, I'm not like going to claim this is, you know, the best app of all time. And if I'd given it more instructions, it probably would have done a better job. But it did just go off and successfully make a shiny app that we can now continue to iterate on, which is pretty cool.
And I think like doing this kind of task is a good example of like a task that you generally are OK with an AI assistant doing more autonomously versus something like exploratory data analysis where you need to sort of be involved frequently so that you can understand your data. I'm OK with like if I know what I want to go into the shiny app, I'm generally OK with Posit Assistant or something similar to just go and take what I know about the data and put it into a shiny app relatively autonomously.
One reason is that like it is easy to verify for both computers and humans if the shiny app is working. So I can just look at this and see immediately if this is what I want and then go back to the assistant if I want adjustments. And the LLM or like the assistant itself can also tell if the code is broken and then fix it like while it's running the app for the first time or like if it produces an error and then I ask it to fix something. So it's easy to generally easy to verify if it's working. Whereas something like a data analysis where it's often harder to tell for the LLM or for you as the person quickly if it is done correctly. So you often need more human involvement.
When to use AI for data analysis
OK, well, the three things that I wanted to cover is first, I just want to talk a little bit more about like what kinds of tasks that you might want to do with AI and what kind of things you might not want to. And then someone had asked about a skill so we can show that. And then Simon or or I will show the reviewer tool.
It's just quickly, I just want to say, like, we talked a little bit about verification of of like task completion or task correctness. So I'd say, like, generally, if you are working on some kind of analysis, one thing that you might want to ask yourself is, like, can you verify it or like, how will you know if it is correct? If you're doing something with AI and you have no clear answer for how you would tell if it's correct or not, you might want to be wary of using it.
Analysis where you don't understand if the code or if the data is correct or not. And you're asking the LM to tell you if the data is correct. It's not going to know that because unless it has access to tons and tons of context. And so you might want to be wary or just like use something with more human involvement.
And then the other thing, a couple other things you might want to ask yourself is also like, how much does this thing matter? It is very important and hard to verify. You might not want to use AI or be wary or cautious or go over the results carefully. And then the other final thing is like, are current LLMs good at this type of task? Generally, they are very good at, you know, pure coding tasks for most purposes. But they are worse at something. So like if you are doing some kind of task, you should have a general sense of if LLMs are good at that or not. And they are really good at some things and less good at others. And so it's not sort of a blanket like they are bad at all of data science or they are good at all of data science. They are good at, you know, sort of different types of tasks or different areas.
And so because of that, one thing that is important are either providing guardrails for the assistant or for adding additional information through a customization. So we already saw one way that you can do that by creating this MD file for the agent with more information. So you can use this to give the data bot, give Posit Assistant way more information about your data than it would otherwise have, which can help prevent it from, like, you know, hallucinating column names that don't exist or using a column that you don't want it to use. You might have data where there's two columns that kind of do the same thing. You only want it to use one. This kind of information could go in the databot.md file and could save you from creating an analysis that is incorrect.
Skills in DataBot
The other thing is that if it's possible to add, like, determinism in this sort of largely non-deterministic process, that can be really helpful. And one way to do that is through a skill. So we only have five minutes, but you can add skills for data bot and for Posit Assistant.
And hopefully there is a skill creator skill. So this is a skill that helps you create other skills. So the example I was going to use the other data for, but let's see if it'll work here.
Create a skill that creates a specific type of table. So we were going to make a table and then, like, create a skill for that, but this should work. So notice that it has this little output thing that says skill, create skill. So this is using the create skill skill to make a skill.
So now it is it thinks it has enough information. If you were really doing this, you would probably want to be much clearer with it about what you wanted it to do. But I just wanted it to start the process so I can show you where they go. It's going to put this in dot Posit AI slash skills. And after it works for a little bit, it's going to come up with a markdown file that specifies the skill. It's going to be, like, a bit of code and some information about it. And then in the future, when I want to create this type of skill, I'm going to be able to I just reference it by name, and it will always use that information to make the table. So this is really helpful if you want, like, if you always make a certain type of formatted table or you have to, for some reason, put that in a skill so you don't have to provide it with that context every time that you talk to it.
Did that make sense? I know that was very brief. The important thing to remember is that there's a skill creator skill that can help you.
The reviewer package
So this last example that I want to show you, this is a package. This is a package called reviewer. And what this does is help you review code from, like, a tidy perspective or, like, a good R code perspective. And so Simon made this package. One reason I wanted to show it is that it's cool and useful, but also it's a good example of something where the you as the human have more control over how you are using the LLM suggestions than if you were using an agent that just goes and works very autonomously.
So let's see how this works. This is a sample script that has some issues in it. And if we do reviewer, review, and then we pass it the file path to the script, it should open, but you can't see it.
And it gives you suggestions in the side for things that would be, like, bad for reproducibility, just bad code style, you know, sort of tidy conventions. And then you can either accept or reject them. So, like, this script had to use setwd. It didn't like that. It mixed different types of pipes. And then it's only going to look at sort of chunks of the script at a time, and then it will give you more suggestions that you can accept or reject.
So this is nice, like, if you want to sort of learn various conventions, you might use this. Because you kind of have to read the suggestion before accepting it versus, like, the LLM just going off and making a bunch of edits for you, and that, like, you never saw the process.
This is very cool. So reviewer, the reviewer package. Is this on CRAN and working in both Positron and RStudio?
So the current status on this project, I just put, like, a few hours into it and ended up feeling pretty unsure about how useful this idea was. And so since then, I haven't put much more time into the idea. But if people feel like this is an interface that they would find useful, I would definitely be into the idea of trying to fix a bunch of bugs and push this further and send it CRAN as well. So do let me know if you think this is an idea that you're interested in.
Fantastic. I think that this is really interesting. If you want to go look at reviewer and test it, I think this would be a really interesting thing if teachers could put their preferences inside of it to, like, replicate so students could get, like, a pseudo review from their professor. I think that'd be super cool. All right. Well, we have reached the top of the hour. We have lots of interest about reviewer in the chat, though. Thank you so much, Sarah and Simon, for coming and hanging out with us.
All right, everybody. Goodbye. We'll see you next week. Thanks.
