Resources

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

From R+AI 2025, hosted by R Consortium Keynote LLMs are powerful, flexible, easy-to-use… and often wrong. This is a dangerous combination, especially for data analysis and scientific research, where correctness and reproducibility are core requirements. Fortunately, it turns out that by carefully applying LLMs to narrower use cases, we can turn them into surprisingly reliable assistants that accelerate and enhance, rather than undermine, scientific work. This is not just theory—I’ll showcase working examples of seamlessly integrating LLMs into analytic workflows, helping data scientists build interactive, intelligent applications without needing to be web developers. You’ll see firsthand how keeping LLMs focused lets us leverage their “intelligence” in a way that’s practical, rigorous, and reproducible. Bio Joe Cheng is the CTO and first employee at Posit, PBC (formerly known as RStudio), where he helped create the RStudio IDE, Shiny web framework, and Databot agent for exploratory data analysis. R Consortium Resources Main R Consortium Site: https://www.r-consortium.org/ R+AI website: https://rconsortium.github.io/RplusAI_website R Consortium webinars: https://r-consortium.org/webinars/webinars.html Blog: https://r-consortium.org/blog/ LinkedIn: https://www.linkedin.com/company/r-consortium/

Nov 15, 2025
46 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Joe Cheng. I am the CTO of Posit, and I'm so pleased to be here speaking with you at R Plus AI.

Today, I'm gonna talk about how we can take LLMs and really focus them for working with data science and research in a way that is useful and responsible.

And when I'm talking about LLMs today, I'm really not talking about ChatGPT, and I'm not talking about Copilot or any kind of coding agent. These tools are very useful, but I think we're all sort of accustomed to them. We're used to their trade-offs. And I think that one of the most interesting ways we can think of LLMs today is how to harness them as custom agents, how we take this sort of underlying ability that LLMs have and use them to our own ends in our own custom workflows.

And to that end, we've really tried to make it as easy as possible for anyone with a background in R to be able to harness these LLMs for their own purposes. We have this package called Elmer that I think is a really beautiful and elegant package, and it is one of my favorite packages that we've released in recent years. And it just gives you the ability to so easily tap into the power that LLMs have, not just simple chat, as you can see here, but the ability to do tool calling, to create agentic workflows, and to create user interfaces on top of LLMs.

Now, I'm not gonna get too far into Elmer today, but I'd like you to take my word for it that if you are here and you know how to write R code, that you are ready to harness LLMs to create advanced agents. It really is quite easy to get started with these things.

The prime directive: correctness, transparency, reproducibility

So if we all have the ability to do this, that doesn't answer the question of whether we should, whether it's a good idea. And this is a question that is very near and dear to my heart and to the heart of Posit, and to talk about why, I wanna talk a little bit about where Posit comes from. So we are a public benefit corp, and if you are not familiar with that concept, you can sort of think of it as being a compromise between a for-profit company and a nonprofit organization. So we have all of the flexibility of a for-profit company, but we are motivated by a mission.

And as a PBC, we have to state out loud what that mission is. And for us, this is sort of an abridged version of the beginning of our official mission. Our mission is to create open source software for data science, scientific research, and technical communication.

And I really wanna focus on this data science and scientific research. A lot of the use cases that are really important to us are things like people doing academic and scientific research, people who are really using this stuff for high stakes kinds of scenarios, people who are analyzing healthcare data, or doing drug development, deciding public policy, environmental studies, epidemiology, things that it's really important that your answers are correct or people could really get hurt.

And to kind of hammer this home, I wanna point out this passage from a book called Software for Data Analysis by John Chambers, who is the creator of S, the precursor of R. And Chambers makes this point. He says, those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis and by extension, the software that produced it.

Both the data analysts, that's probably most of you, and the software provider, that's us and people who write packages for R, therefore have a strong responsibility to produce a result that is trustworthy and if possible, one that can be shown to be trustworthy. This obligation I label the prime directive.

This obligation I label the prime directive.

So that's how seriously John Chambers takes it and we've adopted this prime directive as our own from the earliest days of the company. And one of our software engineers on the tidy models seem really hammer this home by saying, I'm aware that if I make a mistake, bad things happen, death and other things.

So if we have this principle in mind and for us, it really is, it's more than a best practice. It's really a moral and ethical obligation to adhere to this prime directive. What does that mean concretely? Like how does that translate into real concerns as we work with data science software?

So number one, correctness is obviously paramount at all times. We wanna do our best to write software without bugs, especially in the code paths that produce answers. Secondly, transparency. We want the methods of our analysis to be able to be inspected and having open source code goes a long way towards those ends. And third, reproducibility. The ability for the analysis to be repeated on the same data in the future, hopefully producing the same results. So these three principles of correctness, transparency and reproducibility are first and foremost in our minds any time we are working on a new tool for helping with data analysis.

Why LLMs are a poor fit — and what that means

And when we hold these three principles against LLMs, this is not an obvious fit. Like it's actually quite an obvious misfit because when it comes to correctness, LLMs are infamous for giving convincing but wrong answers. They're getting better all the time, but this phenomenon of very convincingly telling you the wrong thing is certainly still very prevalent. Secondly, transparency. I mean, LLMs are the ultimate black box. We don't really understand what they do. We don't really understand how they do it. And in fact, the more you understand about them, the less what they are already capable of seems possible.

And then finally, they are kind of inherently not reproducible. They're inherently non-deterministic just by virtue of the way that they work. Some of the model providers will give you the ability to set a random seed, and even they will warn you that this is not fully reproducible because the answers will still vary based on the underlying hardware or their software configuration. So we inherently do not have the ability to reproduce conversations or actions that LLMs take.

And we've all seen this in action. Like, we've all asked LLMs things like, draw an intricate piece of ASCII art, and it says, here's an intricate piece of ASCII art of a wolf, penguin. I mean, and it's very confidently saying, I've done this for you. Why does it do that? We don't really know.

And I wanna make sure you understand that my point here is not that LLMs are bad or that they can't do things. They can do a lot of things. And in fact, the shape of the things that they can and can't do is quite surprising. I think you could intuitively imagine that LLMs work like this, that there is a spectrum of tasks that they could possibly do that range from very easy, according to some criteria or intuition, to very hard. And you would think that the easy tasks they excel at, and as these tasks get harder, they get worse and worse, which is what you see here, kind of a smooth, relatively smooth capability curve.

And that is not the world that we live in. Today's LLMs do not reflect this kind of behavior. In fact, it's more like this, where there are very easy tasks that they excel at and very easy tasks that they are terrible at and very difficult tasks that they excel at and very difficult tasks that they're terrible at. And it is very, very difficult to intuit without trying to know what category or whether this will be inside or outside of their capability curve when you're thinking of a new task.

Just to take two examples, they're quite good at coding. And I think you would think, as a human, that coding is a pretty hard thing to do. There's a lot of logic and reasoning. There's a lot of nuance. There's abstractions. There's a lot of context that you need to hold in your head. There's a lot of intent that needs to be inferred from reading code. And yet, these models are incredibly good at it. But at the same time, the very simplest data tasks, they are really surprisingly bad at. And let me just give you a quick demonstration of that.

So I tried to think of, what is the very easiest data operation that you could do? And the one I thought of was length. Given an array of values, how long is the array? How many elements are in the array? So to do this in R, we create an array of length n. These are random floating point values between zero and one. And then we ask, in this case, GPT 4.1, we ask it, how long is this array? And then we pass it those random numbers encoded as a JSON array.

And if you pass it 10 values to its credit, it will get it correct. At 100, it gets it correct. And even 1,000, it gives the correct answer. But then at 10,000, it says 1,000. I mean, it's not even close. And even these are cherry-picked round numbers. They're just powers of 10. When I gave it 103 elements, it said there were 100. So this is a pretty disappointing showing that even such a fundamental operation as length, these models are not reliable for.

In fairness, when I made this slide, this was definitely true, and GPT 4.1 was state-of-the-art. With GPT 5, it takes a very long time, but I have mostly seen it give correct answers, and the same with opus 4.1. So the state-of-the-art is changing, but the fact remains that an operation that feels like it should be simple and fundamental, we cannot trust these LLMs. And that does not bode well for all the other possible data science operations, the fundamental operations you'd wanna do.

Three approaches to responsible LLM use

So given these inherent challenges, these challenges around correctness, transparency, reproducibility, are there responsible ways to use LLMs in the service of data science? Are there responsible ways that we can leverage these kind of magical new tools?

So what I am gonna show you now are three different approaches that we have taken. And I wanna emphasize that these are not what I consider, like these are principles that everybody should be abiding by. These are just experiments that we ran and approaches that we kind of observed have worked more or less well for these different scenarios that we tried. So this is all evolving very quickly. Every model behaves differently. So we are approaching this with a spirit of experimentation and humility and openness for each generation of LLMs to bring us a different set of opportunities. And I highly encourage you to think of these approaches in that spirit.

So the first approach I wanna talk about is to constrain these LLMs, okay? And what that looks like is we identify some things that these LLMs can do, at least take a particular model that you're targeting. And we decide, based on our empirical experience, okay, this LLM is very, very good at this one task or this one skill. It is firmly inside the jagged frontier of their capability. Then in order to make this one skill more useful, we kind of augment the LLM with other kinds of software we write, like deterministic software that kind of leverage that LLM skill and make it more useful. And then we instruct the LLM, just stick to the task we are giving you, stick to doing this thing that we've decided that you're very good at. And resist the urge to add features that would pull the LLM in directions that it's still okay at, but maybe not quite as solidly good at.

So the example I have in mind for this is to use an LLM to drive SQL, to drive a dashboard. Let me show you what that looks like. So this is a Shiny dashboard. This is not an LLM-powered dashboard. This is just a regular Shiny dashboard. This is actually one of the templates for Shiny for Python.

And just like a typical Shiny application, you can see that there are some reactive outputs, and then there are some inputs on the side. You know, if this was a real dashboard, there would probably be a lot more inputs than this. But as you change these values, you can see that the values on the right update, okay?

So we asked ourselves, could we leverage these LLM's abilities to write SQL very well, and use that to enhance our dashboard with more capabilities? So here we've replaced all the filters with a sidebar. This is now written in R with Elmer and ShinyChat. And we can ask these filtering questions like show only male smokers who had dinner on Saturday. And this request goes out to an LLM. The LLM comes back with a select statement. And that select statement is shown to the user in two places. And then that select statement is applied to a data source by Shiny using reactive code. And everything that you see down here, this code has not changed. It is the same reactive Shiny code. It's just keyed off of a different reactive data frame, one that is driven by this SQL. This technique works with any SQL speaking database that you can get to from R.

And what this means is that we do not have to worry about these values being hallucinated. We don't have to worry about this plot or this table of numbers being hallucinated. All we have to worry about is whether this SQL is correct. And if you are someone who speaks SQL fluently, then great, that's a very easy task for you. But even if this is something that you're handing off to say your boss who doesn't speak SQL as well, there's a really good chance that they can either sort of intuit by reading the SQL. They can intuit whether it's right or wrong. And at worst, they can take that SQL and bring it to you and ask you to verify it.

But importantly, modern LLMs are quite good at writing SQL, so it's very infrequent that they make mistakes here. And we can use this capability to achieve filtering that we could not do. We could not do with traditional Shiny controls. This is a very easy thing to reproduce in normal Shiny, that select these three criteria, but something as simple as invert this filter and saying the ones that we're showing now make those the only ones we are not showing. That is not an easy thing to do with most dashboard UIs.

This approach, this LLM SQL dashboard approach turns out to be incredibly, incredibly useful. And I want to use this illustration from Rick and Morty. If you haven't seen the show, there's this one scene where Rick, who is a scientist, he's tinkering at the breakfast table, and he builds this little robot. And the robot wakes up, and it immediately asks, what is my purpose? And instead of answering, Rick asks the robot to pass the butter. And the robot brings the butter over, and Rick's very happy. And then other things happen in this scene. They're having conversation while the robot just waits there. And then when there's a pause, the robot says, what is my purpose? And Rick says, you pass butter. And this poor robot goes, oh my god.

And this is sort of how I think of this scenario, where you have this LLM that is good at a lot of things. It has this very general capability, this very general intelligence. And we are really saying, you pass butter. You do this one thing, you just write SQL, and we do not want you to do anything else.

So this approach of massively constraining what the model can do, is this a responsible approach? So in this case, in terms of correctness, we are asking it to generate SQL, and it generally does that incredibly well. Certainly, I would say, at a superhuman level compared to the median, or even well above median data analysts. In terms of transparency, all the analysis that you're seeing on the screen is driven off of a SQL query that is very clearly displayed. And in terms of reproducibility, the fact that you can take that SQL and apply it later on, and get the same result as long as the data has not changed, I think that qualifies.

So this approach, we were so happy with this sort of like, take a SQL chatbot and apply it to a Shiny data dashboard. It's worked so well that we have created an open source package called QueryChat. If you have a dashboard that is primarily driven off of a single data frame, it's so easy to drop in QueryChat and recreate this experience for your data.

Approach 2: micromanaging the model

So that's the first approach, this approach of constraining. The second approach that we're going to take is to micromanage the model. And in this case, we are going to have an incredibly tight, very, very tight human AI feedback loop, where we're asking it to do tiny steps that we are then immediately judging to be right or wrong. Or if it's not a matter of right or wrong, just whether we like it or don't like the result. And then we can correct it immediately, and again, see the new result.

I think the idea here is that if we are micromanaging AI this closely, then mistakes are all but guaranteed to be caught right away. So even if what we're asking the model to do, it is not 100% at, it's only 85% at, that is fine. That 15% of the time, we'll just notice, we'll correct it, we'll help it get to the right place.

So the example I want to use here is a tool that I vibe coded this week to tweak plots. So believe it or not, everything I'm going to show you, this is a Shiny app. And I think one realization that I've had using ChatGPT over the last year is that I find using my voice to speak to an LLM to be incredibly helpful for reducing the friction of fast iteration or giving a lot of information to an LLM. So that's the approach that we're taking here. Instead of typing into a chat box, instead of typing into an input box, what I would like to see happen, I'm going to use my voice and use that to shorten the feedback cycle here.

So I'm going to start by, I'm going to press and hold space bar and that'll let me speak and then I'll let go and that'll send my command.

Let's create a plot of MT cars. Okay, great. So you can see here it did that, it did create a plot. That's fine, it even colored by cylinder, that's nice.

Can you make the text significantly bigger and move the legend to the bottom? Great, make the data points a little bit bigger as well.

All right, let's see. I'd like to see these sort of groupings a little bit better. Can you add a cell above to calculate the convex halls based on the cylinders?

Oh, great, okay. That looks pretty good, but let's get rid of the border of the convex hall. Just use the fill.

All right, great. And finally, I'm kind of colorblind. I'm having trouble telling between these colors. Can you pick some more colorblind friendly colors?

All right, great, I think that looks pretty good.

So as you can see in this example, this feedback loop is so tight and I'm so on top of every little thing that it's doing that it's sort of hard for it to get too far off track.

And in terms of an illustration for this one, what this feels like, this is maybe a slightly dark example, but I think of the movie Whiplash, this famous not quite my tempo scene. If you haven't seen it, J.K. Simmons plays this band director and he's extremely, extremely exacting. So they start playing and he says, the equivalent of let's plot MT cars and Miles Teller is the drummer and he's like, aha, I did this. This is going great. And nope, not quite. The scatter plot points are a little small and okay, so he fixes that. Not quite, the text annotations and I want the y-axis to start at zero and are you serious with this color palette? And this poor model is just doing its best to satisfy our every whim.

So this idea of we're just so on top of it, we are just so exacting and examining every minute detail of what this model is creating. So is this approach responsible? Is this like plot tweaking tool and having this tight feedback loop? So in practice, I do feel like it makes far fewer mistakes than a human does and the kinds of mistakes that it makes are usually easy to catch.

For example, if it gets your factors, it treats your factors as unordered instead of ordered. Usually that's something that you can just sort of see and deal with quickly. Of course, it depends on what you're doing with this bot but in the case of visualizations, I think it's pretty good. In terms of transparency, the user is directing what's happening here and you can see the R code that it's generating at all times and I think that's really important. And finally, it's reproducible. So like the R code that you are creating is generally going to be able to be saved in a script and uploaded to Git and just all the normal ways our scripts are reproducible.

Approach 3: deferred review with DataBot

So the third approach that I wanna talk about, the third and final approach is deferred review. So the idea here is that we're gonna have a similar approach to the last one but with like kind of a looser rein, like a looser leash. So we're gonna ask the AI to do things with slightly bigger steps for it to be taking bigger leaps without us but still not just like going off and doing stuff without us. So we're still there, we're aware of what it's doing, we're still kind of directing it, we're aware of what kinds of questions that it wants to answer and we are directing it on what kinds of things to do next but we are not closely scrutinizing every single step in every line of code that it's writing for errors and hallucinations.

Obviously, it needs to be decent at whatever it's doing but we're not gonna go over the top in terms of reviewing. So what that means is we are gonna enjoy this like really fast movement upfront because we're not stopping to scrutinize everything but we are piling up sort of a review debt along the way that we will eventually come due.

And that is usually at the latest before you ship your work, before you take your work and actually put it in front of people who might act on it or who might misinterpret it as fully vetted analysis, that is when you need to stop and carefully review. I think of this as being akin to working in Git, having your own branch and you're kind of working through things and you kind of clean it up and get it all ready and have it reviewed before you merge but you generally do that at the end.

So the agent that I wanna show you for this is DataBot. DataBot is our exploratory data analysis agent that we released recently and I think it's just one of the most exciting pieces of software that I've ever worked on.

I think it's just one of the most exciting pieces of software that I've ever worked on.

If you have not seen DataBot, it is an extension for Positron exclusively right now. So you need a relatively recent version of Positron and then you can install DataBot by going to the extensions pane here. You will need an Anthropic API key or I think we can also use AWS Bedrock and very shortly, you'll be able to connect to other LLMs as well. By the way, this is not currently available in RStudio but it or something much like it will be coming to RStudio very soon.

So I have DataBot here and you wanna start out in a project or a folder that you're ready to do some work. So maybe you brought some data into this folder. In this case, I have a CSV but we could connect to a database also or use Databricks or an S3 bucket or SAS data, just anything you can do in R.

And then I'm just gonna start asking it questions and I can ask it relatively high level questions. I can say just like load the CSV data in this directory. And after I give it a command, it is going to usually either call some tools or write some code to try to answer that question. And in this case, it's doing both. It is both looking at what files are available in this directory and then it is executing some R code. And very importantly, both the model and ourselves as the users, we can both see this information that's coming out of the R code.

So it looks at what happened here. It's looking at the summary here and it comes to these conclusions about what kind of data is here. And this appears to be like weather data about various countries and their capitals, okay? 60,000 rows and 22 columns.

And then it stops and gives us four suggestions for how to proceed. And this is incredibly important. If you look at these suggestions, they're quite specific and sensitive to what just happened. Like these are not just scripted or stock suggestions like, do you wanna do some data quality checks? Do you want to transform the data? Do you wanna visualize the data? It's asking like pretty specific questions here like create visualizations of temperature patterns over time.

So every interaction you have with DataBot is gonna be like this. You ask a question or give it an instruction. It will run some code. It'll look at the results. It will make some observations, make some suggestions and then stop. And it is very important that it stop and not just run off and just answer whatever questions it feels like leaving us behind. DataBot is carefully tuned to work in chunks that feel good to a human, that are enough for us to feel like we're making progress, but also that we can sort of have a general handle of what it's doing and why.

So now let's ask it like create a visualization of temperature patterns over time. It's looking at how many countries. And okay, so this is temperature patterns for all 252 countries. This is just data from September to October of this year. And then it's separating it out by continent. And as before, we can see these plots and the model can see the plot. And then it makes some conclusions about what it sees here.

And now we've asked and answered a whole bunch of questions. Like we've really taken it down a number of different pathways and we've ended with a pretty interesting, they're pretty scrunched because of my screen resolution here, this plot that I think is pretty interesting. And I really don't care about everything else. Like really, this is the conclusion that I would really like to draw attention to.

So with DataBot, I have the ability to tell it, okay, we've done all this stuff. You've written all this code, answered all these questions that most of which I don't care about anymore. Let's create a reproducible artifact just about the thing that I do care about. So I'll say like slash report to say, I wanna make a Quarto report.

So what this is gonna do is create a reproducible Quarto document that takes what it's learned during the course of this conversation and makes a reproducible version. Okay, so you can see here that it created this report for us and reproduced this very nice visualization. And we also automatically drop this warning at the top. So this is your cue to carefully review each piece of code and each conclusion that is found in this report. And then you can remove that little disclaimer.

So is DataBot and this approach of deferred review, is it responsible? It is responsible in terms of correctness to the degree that you have the discipline and expertise to actually go back and review the work created by the model. In terms of transparency, I mean, there is R code. It is telling you how it's coming to these conclusions. But I mean, to be honest, while you're in the thick of the exploration, there's a lot of R code coming at you and it's going by pretty fast. In terms of reproducibility, I think pretty good. This capability to generate a reproducible report for you, I think is incredibly important.

So overall, compared to the other approaches, this is definitely riskier. But I will say it is so unbelievably useful that I certainly will not be doing exploratory data analysis without DataBot or something like DataBot in the future.

I will say it is so unbelievably useful that I certainly will not be doing exploratory data analysis without DataBot or something like DataBot in the future.

Choosing your risk tolerance

So this plot is showing that for each approach that I've shown today, along with ChatGPT's deep research mode, what are the likelihoods that I think these approaches are likely to make a mistake from the model? And then on the y-axis, how likely is it that that mistake would be overlooked by a human operator?

And based on these different points in space, you have to decide what makes sense for you. Where is your line in terms of what is responsible use of LLMs and what is not? So for example, if you work in marketing and you are trying to optimize the click-through rates of your mobile ads, then maybe you would draw the line somewhere here and anything below and to the left of the line is acceptable risk for you. But maybe you work in some field where lives are at stake. Maybe you're working on nuclear power plant safety or drug development. Maybe in some of those cases, your tolerance is gonna be way down here, that none of today's tools might be good enough for what you need. Or on the other hand, you might be an AI influencer on LinkedIn and then your tolerance is way out here.

So that is what I have for you today. Thank you so much for your time and attention. And I'd be happy to take questions if there's time. Thank you.

Q&A

Amazing. Thank you, Joe, for putting together this talk and thank you everyone else for being with us today. We do have questions in the Q&A. I'm sure you can see everything. So why don't we just go ahead and have you answer at your leisure. And we have until 45 minutes after the hour, so about eight minutes. So let's go.

Oh, there are questions showing up here. Okay. Yeah, so let me first address, first of all, thank you for having me and thank you for your time and attention. Let me first address the local LLM question.

So as we were developing this, as we were developing DataBot specifically, DataBot really emerged from specifically Claude Sonnet's ability to not only answer questions or your data questions, but to be incredibly, like just really sensitive about what would make sense to do next. And what we found is that using other models at the time, really, even the frontier models from other proprietary labs, it fell off to the point that it really, to me, it didn't make sense to use this tool with another model.

I do think that that situation has changed over the last few months. And what we have seen is that the best, certainly GPT-5 can do a pretty decent job, albeit it seems to require a lot more thinking, a lot more runtime token usage in order to do that. But I think it seems very promising that we should be able to make it work well with things besides Claude. With local models, they have gotten a lot better over the last few months, or at least earlier this year, they seem to take a big leap forward. I'm still a little bit, I'm still waiting to see them get better still, I think before we can in good conscience recommend them.

That said, for Positron in particular, we have the ability to connect to any OpenAI API compatible endpoint, which I think somebody asked about, like, do you have the ability to connect to your organization's local LLMs? So we can now say, yes, if it hasn't shipped already, it will be shipping shortly. I think it's already shipped for the desktop versions of Positron and should be coming out for Workbench shortly.

So I think this does give you the ability to connect to a lot of different kinds of models. But I continue to think that today, you're not likely to have an awesome experience unless you're connecting to one of the Cloud models. Actually, Cloud Sonnet or Cloud Haiku are both really excellent. And I think partially for these kinds of advanced agents, there's more tailoring to a specific model than you would think. A lot of what we do, like it's almost like we're trying to fit a custom suit onto this body. And then everybody's asking, can I just swap this out for a totally different body? And it's really difficult because we have so carefully measured and tweaked and tucked so that it fits just well with this model.

So I can't really guarantee that you'll get good results if you use it with something else.