Keeping LLMs in Their Lane: Focused AI for Data Science and Research

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Joe Cheng . I am the CTO of Posit, and I'm so pleased to be here speaking with you at R Plus AI.

Today, I'm gonna talk about how we can take LLMs and really focus them for working with data science and research in a way that is useful and responsible.

And when I'm talking about LLMs today, I'm really not talking about ChatGPT, and I'm not talking about Copilot or any kind of coding agent. These tools are very useful, but I think we're all sort of accustomed to them. We're used to their trade-offs. And I think that one of the most interesting ways we can think of LLMs today is how to harness them as custom agents, how we take this sort of underlying ability that LLMs have and use them to our own ends in our own custom workflows.

And to that end, we've really tried to make it as easy as possible for anyone with a background in R to be able to harness these LLMs for their own purposes. We have this package called Elmer that I think is a really beautiful and elegant package, and it is one of my favorite packages that we've released in recent years. And it just gives you the ability to so easily tap into the power that LLMs have, not just simple chat, as you can see here, but the ability to do tool calling, to create agentic workflows, and to create user interfaces on top of LLMs.

Now, I'm not gonna get too far into Elmer today, but I'd like you to take my word for it that if you are here and you know how to write R code, that you are ready to harness LLMs to create advanced agents. It really is quite easy to get started with these things.

The prime directive: correctness, transparency, reproducibility

So if we all have the ability to do this, that doesn't answer the question of whether we should, whether it's a good idea. And this is a question that is very near and dear to my heart and to the heart of Posit, and to talk about why, I wanna talk a little bit about where Posit comes from. So we are a public benefit corp, and if you are not familiar with that concept, you can sort of think of it as being a compromise between a for-profit company and a nonprofit organization. So we have all of the flexibility of a for-profit company, but we are motivated by a mission.

And as a PBC, we have to state out loud what that mission is. And for us, this is sort of an abridged version of the beginning of our official mission. Our mission is to create open source software for data science, scientific research, and technical communication.

And I really wanna focus on this data science and scientific research. A lot of the use cases that are really important to us are things like people doing academic and scientific research, people who are really using this stuff for high stakes kinds of scenarios, people who are analyzing healthcare data, or doing drug development, deciding public policy, environmental studies, epidemiology, things that it's really important that your answers are correct or people could really get hurt.

And to kind of hammer this home, I wanna point out this passage from a book called Software for Data Analysis by John Chambers, who is the creator of S, the precursor of R. And Chambers makes this point. He says, those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis and by extension, the software that produced it.

Both the data analysts, that's probably most of you, and the software provider, that's us and people who write packages for R, therefore have a strong responsibility to produce a result that is trustworthy and if possible, one that can be shown to be trustworthy. This obligation I label the prime directive.

This obligation I label the prime directive.

So that's how seriously John Chambers takes it and we've adopted this prime directive as our own from the earliest days of the company. And one of our software engineers on the tidy models seem really hammer this home by saying, I'm aware that if I make a mistake, bad things happen, death and other things.

So if we have this principle in mind and for us, it really is, it's more than a best practice. It's really a moral and ethical obligation to adhere to this prime directive. What does that mean concretely? Like how does that translate into real concerns as we work with data science software?

So number one, correctness is obviously paramount at all times. We wanna do our best to write software without bugs, especially in the code paths that produce answers. Secondly, transparency. We want the methods of our analysis to be able to be inspected and having open source code goes a long way towards those ends. And third, reproducibility. The ability for the analysis to be repeated on the same data in the future, hopefully producing the same results. So these three principles of correctness, transparency and reproducibility are first and foremost in our minds any time we are working on a new tool for helping with data analysis.

I think it's just one of the most exciting pieces of software that I've ever worked on.

If you have not seen DataBot, it is an extension for Positron exclusively right now. So you need a relatively recent version of Positron and then you can install DataBot by going to the extensions pane here. You will need an Anthropic API key or I think we can also use AWS Bedrock and very shortly, you'll be able to connect to other LLMs as well. By the way, this is not currently available in RStudio but it or something much like it will be coming to RStudio very soon.

So I have DataBot here and you wanna start out in a project or a folder that you're ready to do some work. So maybe you brought some data into this folder. In this case, I have a CSV but we could connect to a database also or use Databricks or an S3 bucket or SAS data, just anything you can do in R.

And then I'm just gonna start asking it questions and I can ask it relatively high level questions. I can say just like load the CSV data in this directory. And after I give it a command, it is going to usually either call some tools or write some code to try to answer that question. And in this case, it's doing both. It is both looking at what files are available in this directory and then it is executing some R code. And very importantly, both the model and ourselves as the users, we can both see this information that's coming out of the R code.

So it looks at what happened here. It's looking at the summary here and it comes to these conclusions about what kind of data is here. And this appears to be like weather data about various countries and their capitals, okay? 60,000 rows and 22 columns.

And then it stops and gives us four suggestions for how to proceed. And this is incredibly important. If you look at these suggestions, they're quite specific and sensitive to what just happened. Like these are not just scripted or stock suggestions like, do you wanna do some data quality checks? Do you want to transform the data? Do you wanna visualize the data? It's asking like pretty specific questions here like create visualizations of temperature patterns over time.

So every interaction you have with DataBot is gonna be like this. You ask a question or give it an instruction. It will run some code. It'll look at the results. It will make some observations, make some suggestions and then stop. And it is very important that it stop and not just run off and just answer whatever questions it feels like leaving us behind. DataBot is carefully tuned to work in chunks that feel good to a human, that are enough for us to feel like we're making progress, but also that we can sort of have a general handle of what it's doing and why.

So now let's ask it like create a visualization of temperature patterns over time. It's looking at how many countries. And okay, so this is temperature patterns for all 252 countries. This is just data from September to October of this year. And then it's separating it out by continent. And as before, we can see these plots and the model can see the plot. And then it makes some conclusions about what it sees here.

And now we've asked and answered a whole bunch of questions. Like we've really taken it down a number of different pathways and we've ended with a pretty interesting, they're pretty scrunched because of my screen resolution here, this plot that I think is pretty interesting. And I really don't care about everything else. Like really, this is the conclusion that I would really like to draw attention to.

So with DataBot, I have the ability to tell it, okay, we've done all this stuff. You've written all this code, answered all these questions that most of which I don't care about anymore. Let's create a reproducible artifact just about the thing that I do care about. So I'll say like slash report to say, I wanna make a Quarto report.

So what this is gonna do is create a reproducible Quarto document that takes what it's learned during the course of this conversation and makes a reproducible version. Okay, so you can see here that it created this report for us and reproduced this very nice visualization. And we also automatically drop this warning at the top. So this is your cue to carefully review each piece of code and each conclusion that is found in this report. And then you can remove that little disclaimer.

So is DataBot and this approach of deferred review, is it responsible? It is responsible in terms of correctness to the degree that you have the discipline and expertise to actually go back and review the work created by the model. In terms of transparency, I mean, there is R code. It is telling you how it's coming to these conclusions. But I mean, to be honest, while you're in the thick of the exploration, there's a lot of R code coming at you and it's going by pretty fast. In terms of reproducibility, I think pretty good. This capability to generate a reproducible report for you, I think is incredibly important.

So overall, compared to the other approaches, this is definitely riskier. But I will say it is so unbelievably useful that I certainly will not be doing exploratory data analysis without DataBot or something like DataBot in the future.

I will say it is so unbelievably useful that I certainly will not be doing exploratory data analysis without DataBot or something like DataBot in the future.

Choosing your risk tolerance

So this plot is showing that for each approach that I've shown today, along with ChatGPT's deep research mode, what are the likelihoods that I think these approaches are likely to make a mistake from the model? And then on the y-axis, how likely is it that that mistake would be overlooked by a human operator?

And based on these different points in space, you have to decide what makes sense for you. Where is your line in terms of what is responsible use of LLMs and what is not? So for example, if you work in marketing and you are trying to optimize the click-through rates of your mobile ads, then maybe you would draw the line somewhere here and anything below and to the left of the line is acceptable risk for you. But maybe you work in some field where lives are at stake. Maybe you're working on nuclear power plant safety or drug development. Maybe in some of those cases, your tolerance is gonna be way down here, that none of today's tools might be good enough for what you need. Or on the other hand, you might be an AI influencer on LinkedIn and then your tolerance is way out here.

So that is what I have for you today. Thank you so much for your time and attention. And I'd be happy to take questions if there's time. Thank you.

Q&A

Amazing. Thank you, Joe, for putting together this talk and thank you everyone else for being with us today. We do have questions in the Q&A. I'm sure you can see everything. So why don't we just go ahead and have you answer at your leisure. And we have until 45 minutes after the hour, so about eight minutes. So let's go.

Oh, there are questions showing up here. Okay. Yeah, so let me first address, first of all, thank you for having me and thank you for your time and attention. Let me first address the local LLM question.

So as we were developing this, as we were developing DataBot specifically, DataBot really emerged from specifically Claude Sonnet's ability to not only answer questions or your data questions, but to be incredibly, like just really sensitive about what would make sense to do next. And what we found is that using other models at the time, really, even the frontier models from other proprietary labs, it fell off to the point that it really, to me, it didn't make sense to use this tool with another model.

I do think that that situation has changed over the last few months. And what we have seen is that the best, certainly GPT-5 can do a pretty decent job, albeit it seems to require a lot more thinking, a lot more runtime token usage in order to do that. But I think it seems very promising that we should be able to make it work well with things besides Claude. With local models, they have gotten a lot better over the last few months, or at least earlier this year, they seem to take a big leap forward. I'm still a little bit, I'm still waiting to see them get better still, I think before we can in good conscience recommend them.

That said, for Positron in particular, we have the ability to connect to any OpenAI API compatible endpoint, which I think somebody asked about, like, do you have the ability to connect to your organization's local LLMs? So we can now say, yes, if it hasn't shipped already, it will be shipping shortly. I think it's already shipped for the desktop versions of Positron and should be coming out for Workbench shortly.

So I think this does give you the ability to connect to a lot of different kinds of models. But I continue to think that today, you're not likely to have an awesome experience unless you're connecting to one of the Cloud models. Actually, Cloud Sonnet or Cloud Haiku are both really excellent. And I think partially for these kinds of advanced agents, there's more tailoring to a specific model than you would think. A lot of what we do, like it's almost like we're trying to fit a custom suit onto this body. And then everybody's asking, can I just swap this out for a totally different body? And it's really difficult because we have so carefully measured and tweaked and tucked so that it fits just well with this model.

So I can't really guarantee that you'll get good results if you use it with something else.

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

Transcript#

The prime directive: correctness, transparency, reproducibility

Why LLMs are a poor fit — and what that means

Three approaches to responsible LLM use

Approach 2: micromanaging the model

Approach 3: deferred review with DataBot

Choosing your risk tolerance

Q&A

Featured software#

rstudio

Shiny

webinars