
Simon Couch - Practical AI for data science
Practical AI for data science (Simon Couch) Abstract: While most discourse about AI focuses on glamorous, ungrounded applications, data scientists spend most of their days tackling unglamorous problems in sensitive data. Integrated thoughtfully, LLMs are quite useful in practice for all sorts of everyday data science tasks, even when restricted to secure deployments that protect proprietary information. At Posit, our work on ellmer and related R packages has focused on enabling these practical uses. This talk will outline three practical AI use-cases—structured data extraction, tool calling, and coding—and offer guidance on getting started with LLMs when your data and code is confidential. Presented at the 2025 R/Pharma Conference Europe/US Track. Resources mentioned in the presentation: - {vitals}: Large Language Model Evaluations https://vitals.tidyverse.org/ - {mcptools}: Model Context Protocol for R https://posit-dev.github.io/mcptools/ - {btw}: A complete toolkit for connecting R and LLMs https://posit-dev.github.io/btw/ - {gander}: High-performance, low-friction Large Language Model chat for data scientists https://simonpcouch.github.io/gander/ - {chores}: A collection of large language model assistants https://simonpcouch.github.io/chores/ - {predictive}: A frontend for predictive modeling with tidymodels https://github.com/simonpcouch/predictive - {kapa}: RAG-based search via the kapa.ai API https://github.com/simonpcouch/kapa - Databot https://positron.posit.co/dat
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thanks, Hari. Yeah, good morning, folks. Happy to be here.
I find a good bit of AI discourse pretty annoying. If you're a LinkedIn user, you've probably seen a post that looks something like this before, where it starts out with some, like, vague, bold assertion, like, AI will change everything about data science. And then, like, in order to read anything about the actual content of the post, you have to, like, click in and actually engage, so it gets a boost in the algorithm.
And then there's some sort of picture, usually the imagery we've converged on is a humanoid robot typing on something between a laptop and a desktop, and it has, like, 200 keys on the keyboard.
What AI discourse gets wrong about data science
So in these posts, for one, robots, or in reality, API keys to LLMs, are free or happily paid for by someone else. So when these new big feature drops happen from the major labs, they're often quite expensive. And even the ones that have been around for almost a year now, like Cloud Code, a day of development where you're really just using the tool for eight hours, you can rack up 50 or 100 bucks in billing easily.
Another sort of assumption that underlies a lot of these posts is that the data being scienced can just be sent straight to OpenAI servers or Anthropic servers or Googles, when in reality, the data that many of us work on day to day is confidential, and we can't just send it over to any company that tells us they can do interesting things with it.
And finally, there's often this assumption that data science can be one-shotted, which is this term in the AI space, meaning you just like pass along your data, and then you like go make a coffee or have lunch or something, and you come back to a complete analysis.
There's a lot about real world data science that doesn't get represented when we make these assumptions. So in reality, for one, Frontier LLMs cost money, right? Even like technologies like Cloud Code, again, if you're using it all day, you can easily rack up a bill that you probably don't want to foot yourself.
Data science happens on mostly sensitive and confidential data. So whether that results from clinical trials or you're pulling something from an EMR, this data is subject to privacy constraints that we have to be conscious of. And so you can't just make use of the latest and greatest from any lab if you need those privacy guarantees.
And finally, in the real world, data science is messy and it's subtle and it's context-rich. The problems that data scientists solve day-to-day aren't just like data set in, insight out. There's all sorts of context and understanding that we bring to data science. Integration with data sources and tools that are outside of maybe the most obvious choices. And so data science really needs that like real human cognition that LLMs can be proxies for, but don't truly have.
Data science really needs that like real human cognition that LLMs can be proxies for, but don't truly have.
What's possible with ellmer in R
So I'm not just going to be ragging on folks on LinkedIn throughout the rest of this talk. I want to focus on two big ideas. The first one is that I'd like to show what's currently possible in R with ellmer. ellmer is a package that lets you talk to LLMs in R. It's made leaps and bounds over the last year. And I think it's like the best interface to program against LLMs that there is.
And then I also want to speak to specifically the constraints that people doing data science in organizations, and specifically in organizations like those in pharma, what that data science actually looks like, what using LLMs in those circumstances actually looks like. And so I want to help you imagine making it work in practice. But we'll start first with a glimpse of what's possible in R.
Again, the ellmer package is a package that allows you to talk to LLMs in R. And it enables three sort of boosts for data science that I think are underappreciated. And I want to go through those three in order.
The first is about extracting structured data from unstructured data. So that could be free text fields that live in an EMR or result from some clinical trial. Those could be images. LLMs are incredible when situated in the right way to extract structured data from these unstructured sources.
The next thing is tool calling. Tool calling is a way to give LLMs the ability to run deterministic, quote, normal software. And equipping LLMs with tool calling fills in a lot of the gaps in these LLMs in terms of the things that they're not good at that we would think they'd be good at and otherwise.
And then finally, coding and specifically coding agents. LLMs are quite strong in terms of generating code to carry out a given task. And so if you provide them with the right scaffold, you provide them with true human cognition, they can be really helpful in writing data science code.
So we'll start out with just a quick intro into what it looks like to talk to an LLM via ellmer in the first place. And there's sort of three high level steps here. The first one is loading the package. The second is creating this chat object.
There are something like 20 different functions inside of ellmer that start with this chat underscore, and then they have the name of some sort of provider. So that could be like one of the three major labs, Anthropic, OpenAI, or Google Gemini. That could also be like a local model on your own machine, like chat-olama. And that could also be a private or secure deployment of an LLM that your organization might have deployed.
So if your organization uses Bedrock or Databricks or Snowflake or some other private vendor, you can probably connect to that vendor via ellmer. But regardless, whichever provider that you choose, you end up with this chat object. And the chat object has a bunch of methods attached to it. So you can put a dollar sign after, and there's all sorts of functions that can act on the chat object.
And so we'll use this dollar sign chat object or chat function, which is the easiest entry point to start talking to these LLMs. So I could say something like, who are you? And because I have made a connection to this Anthropic API, I'm talking to Cloud Sonic, and the model will tell me that it's Cloud.
Structured data extraction
Through this interface, we can take advantage of those three different capabilities to do data science. So first, we'll talk about turning unstructured data into structured data.
How are our regex chops in the crowd? If I saw a set of sentences like this, and I needed to extract name and age from this data, I don't even know that I could write a regex that makes this happen. I don't even know if that's possible.
These are sort of contrived examples, but we can imagine in recordings of patient conversations or interviews through a clinical trial or something, we have all sorts of these unstructured free text fields, and we need to extract information from them. So we have these six different sentences where at some point somebody says their name and their age. And using traditional software, using something like regex, this would be a really difficult problem to solve.
This is something that LLMs are actually quite good at. So using that chat function, we could say something like, extract the name and age from each sentence I give you, and then just pass through each of those sentences one by one. As you can see in these results, the LLM is quite good at pulling out both of those pieces of information.
This does kind of put us in a situation where we still need to use regex, where the LLM has outputted more free text, and it just says star star name colon star star, and then the name, and it does a similar thing for the age. So we would still have to process that a little bit.
It turns out that we can do that explicitly. We can get an output that looks a little bit more like this, a normal R list where there's a name field and an age field.
The way that we do that is via a different method on this same object, and that method, instead of being called chat, is called chat structured. So we pass in the prompt just as before, but we also pass in this type object. Type objects are just specifications of the format of the response that you would like to get back. So in our example, we can say we want the name, and that's going to be a string. And then we want the age, and that's going to be a number. And if we pass that along, we can get an R list back with the name and age.
This sort of thing is super useful from going to free text, unstructured data, to tables. Gets you to happy data frame land as quickly as possible.
And you don't just have to do this one by one. You can do this in parallel with ellmer. It has a function called parallel chat structured. This is parallel in the sense of a ton of HTTP requests being sent off at once and then collected as they come back, not in the sense of spinning a bunch of R processes up on your machine. And so this means that you can pass along all six of those prompts at once, and using a model like Cloud Sonic, you get a response back in a second or something like that. And so this enables you to process that data at the scale that you actually need to.
It's not just text that you can use this structured data extraction with. You can also do so with images. So if we imagine that I have this directory called animals, and there's a bunch of paths to images inside of that directory, I can use a function content image file on each of those paths in the directory, load in the image, and then pass them along in the same way using parallel chat structured. So if I have these images of animals and I want to know which animal it is and what the background color of the picture is, I can also specify that in a type object and send it along.
So you can imagine inside of a clinical trial, if there's like imagery or some other image information that is collected as part of the trial, you can process it with these models.
So at a high level, this is what's possible with structured data extraction with LLMs. And this is like such a useful tool in getting yourself into a data frame that you know how to work with, with your usual dplyr and tidyr and ggplot and things like that.
Tool calling
The next capability that I want to talk about is something called tool calling. And practically, this is a super helpful capability when you're trying to get LLMs to help you with certain tasks.
So one thing that LLMs can't do is access real time information. This just isn't how they work. They have a knowledge cut off. They're trained on a bunch of information. Natively, they have no way to learn what the current date is. So if I initialize this chat object and I say, what day is it today? Cloud will respond to me and say, I don't know.
If you're using a model like chat2pt.com or cloud.ai or whatever, they'll often inject that information into the prompt so that it appears that they do know what the date is. But that's like scaffolding on top of the model rather than the raw model itself.
So what we can do is provide a tool to the model. A tool is just a function and then some documentation on how to use the function. So, for example, I could make this tool called today, which gets today's date. And all it does when it's called is just calls the sys.date function. And so we describe this function as get today's date to the model. The function has no arguments. So we just provide an empty list.
And then we use yet another method on this chat object, which is the register tool method. This tool function here is exported from the ellmer package, and it just allows you to attach some metadata to a given function. And once you've done that, you can register it with the chat.
So then if I go back to that chat object and I ask the same question, I say, what day is it today? The model can choose to return a message back to me in this structured way, which ellmer will see and say, oh, this isn't a message that's intended for the user. This is a message that I can process myself and immediately return a response back. So the LLM decides to send me this special formatted message that says call this function today.
And the function returns the date at the moment at which it was called. So this is August 8th, 2025. So then it's again the model's turn to do whatever it wants to do. And in this case, it's just going to situate this response from the tool inside of its message back to you. So it can say today is August 8th, 2025. And it happens to know that it's Friday because LLMs are weird.
This is a strange concept to wrap your head around at first. So I just want to like actually diagram this out and make sure this is clear what's actually happening here. So this is me last week. And last week, I was having a conversation with this humanoid robot typing on a laptop slash desktop. So last week, I send this message over to the humanoid robot. And I say, what day is it today?
So the only thing that the model can do is respond back with another HTTP request. Except that it can format that request in such a way that ellmer will see the response. And it will realize that the response shouldn't be shown to me. So I never see this message, please call today. ellmer sees it and it processes it automatically and returns the result back to the model.
So again, the humanoid robot sends this specially formatted message that says, please call today. The response is returned to the robot. And then the humanoid robot can situate that response in a response back to me.
So this is what we call a complete tool calling loop, where a model chooses to call a tool. It gets a response back, and then it chooses what to do next. So often, a model can call a tool. It can observe the result. And then once it observes the result, it can either respond back to me, as it did. Or it can just continue to call tools until it's accomplished the task that it would like to do.
Coding agents
So LLMs, through their hazy recollection of the entire Internet, are pretty good at generating code. And if we give LLMs the ability to run code in a thoughtful way, we provide them with a scaffold that helps them assist us as best as they can. Then these models can be pretty helpful at generating data science code.
So again, this is a callback to the tool calling loop. When people say coding agent, which is a pretty popular term now, this can refer to agents like Cloud Code or Codex, and also agents built by Posit, like Positron Assistant or Databot. This is LLMs calling tools in a loop.
And usually, when we say something is an agent, we mean that those tools give the model the ability to read and write state. So what do I mean? If I make a fresh chat object, and in that chat I say, delete the CSV files in my working directory. If the model were given the ability to run code, it could probably figure out how to do that pretty easily.
So, for example, this model has chosen to show me how to do this in Bash if it were to be given the ability to run code, or it can just tell me this is the code that you would use to make that happen. What this model is not actually doing is going and deleting the CSV files, because it doesn't have tools that give it the ability to read and write state.
We can do this with two ellmer tools. The first one we will call ls, and that just calls the dir function in R, which will list the files inside of the current directory. So this tool will give the model the ability to read state.
So if we return to that fresh chat again and ask the same question, delete all the CSV files in the current directory, the model can choose to call the ls tool, which will return a bunch of information about the objects that are inside of that directory. The model can identify which of those are CSVs and call another tool.
So this is an example of that full tool calling loop, where the model calls a tool, it gets a response back, and before it says something and yields control back to the user, it decides to call another tool before it does so. So it will call this remove tool, and it will provide these two paths to the CSV files that it found, and then the tool will return.
So in this case, because all the rm tool is doing is writing state, it just says, okay, I did that, done, which is what this true represents. And only at this point will the model complete the tool calling loop and yield back to the user saying, done, I deleted those files.
DataBot and Positron Assistant
The first of these is DataBot, which is an assistant for exploratory data analysis inside of Positron. When you install DataBot, it looks like a typical chat window, except that it has access to your active R environment. So imagine I've loaded this data already into my R session, it's called forested, I can tell the coding agent DataBot to go ahead and check that data out and make some ggplots.
So first thing, the model will take a look at the data and make sure that it knows the column names to use and what those column types are. And then once it's done so, it'll glimpse that out through running R code and then start making ggplots.
After a couple plots, the model will then yield control back to the user. And this is like one of the really nice, powerful parts about DataBot, which is that it's sort of tuned to just call a couple tools, push the analysis forward a little bit, and then yield control back to the user, maybe provide some suggestions on where to go from there. But this really keeps the user in the loop through that process.
So how is this agent reading and writing state? Often in these applied contexts, the distinction between reading tools for reading state and writing state is a little bit blurred. Because if you give a model the ability to run R code, generally, then it can do anything that it can do via bash even.
So the run R code tool, which allows the model to write R code, and then see its results, as well as show the user the results, can both read and write state. And then there's also this createCordoReport tool. So once you're finished with some EDA, you can take that initial analysis and situate it inside a persistent reproducible report.
For more general coding assistance tasks, Positron comes with one more coding agent, which is called Positron Assistant. And again, Assistant is focused a little bit more on coding than EDA specifically, but it can help along with some EDA as well.
To demonstrate this, we can take that same forest data and imagine we have situated it inside of this readme file. I can ask the coding agent, could you add some ggplots to the end of this file? And in this case, the model will read the readme.rmd file, it will take a look at the forested dataset, and then it will add some code to the end of the file.
Positron Assistant is especially nice because it's really tightly integrated with many different elements of the IDE. So it can see your active R session, it can see the files you have open, and it's really easy to give it the right context.
But there's also these special tools that allow it to access different parts of the IDE. And its implementations of tools display tool requests really nicely inside of the interface. So, for example, if the Positron Assistant wants to edit a file, like adding those lines to the readme, that's displayed nicely in that diff format inside of the editor. So it's really easy to see the changes that it wants to make.
Sidekick: a coding agent for RStudio
So up to this point, I've just talked about coding agents for Positron. I know that many of the folks here are probably still RStudio users, and I want to call out a quick coding agent that I've been spending some time on in recent weeks, and I hope you might be able to get some use from it.
That coding agent is called Sidekick. Sidekick is built entirely in R, and it runs inside of a Shiny app. So if you run this function side colon colon kick, the first thing that it will do is spin up a Shiny app that runs in a background job inside of RStudio.
And it's a little hard to see because the viewer here has a little background, but there's a Shiny app pulled up on the left here that allows you to converse with the LLM via ellmer. So Sidekick can do normal Cloud Code codex coding agent sort of tasks. So for one, we could ask this agent to refactor some code into a helper.
So these six lines, seven lines are pretty constrained in what they do, and we could probably make use of it in a few different places inside this package. So the model will first choose to run this code search tool that will help it find the code that I just passed along to it. And then it will read the whole file once it finds the function or the lines of code that I mentioned.
Once it's found those lines of code, it can then propose edits to the files using an edit files tool, and that tool will display the changes that the model wants to make inside of a nice diff format. You can choose to reject or approve the edits, but once you've approved them, those changes will show up inside of the real files.
You can also start new chats and chats are persistent inside of Sidekick. So the Sidekick agent is also able to run our code inside of the active session. It can see your active session and you'll be able to see the results of the code that it runs nicely formatted inside of this UI.
So if you've had an eye out for a coding agent for RStudio, I would encourage you to give Sidekick a try. This is freely available on my GitHub in the link shown at the bottom, and this will also be in the notes for the talk that I'll share at the end.
I just want to quickly say in Joe Cheng's keynote at PositConf, Joe said that some AI functionality was coming for RStudio. I just want to clarify that this is not that. I'm really excited about the work that we're doing to integrate AI into RStudio for the users that do want that in the coming months. But this is sort of an early experiment for me.
Making it work in practice with secure deployments
This is the extent to which I'll talk about what's possible in R right now. And so I want to come back and revisit that second part of the conversation, which is what it looks like to actually make this work in practice.
So far in this talk, I've just talked about the entropic provider, which does not have zero data retention by default. And so you probably could not pass data that is confidential to these APIs. So how can you do that?
If I can, I would like to attempt a little mind reading here. And so if this statement applies to you, just say yes in the chat. Is it the case that your workplace has some approved secure deployment of an LLM? So in like AWS Bedrock or Databricks or Snowflake or some other provider of an LLM that you have the thumbs up to provide the data that you're doing data science on to that deployment?
Is it the case that the LLMs that you have access to via that deployment are like frontier-ish? Like at this point, you might have Cloud SONET 4.5, but you probably have four. You probably don't have Cloud HYCU 4.5, which came out like a week and a half ago or something like that. Maybe you have access to DeepSeq models or some other open source models, or maybe you just recently got access to like the GPT-5 drop.
I would guess that when this deployment was first announced to you, it was via something like a chat bot. And it might be the case that that LLM has an API that it's surfaced from, and somebody on your team might have figured out how to talk to it via ellmer, but that might not be the case.
Okay, if you relate to some of these points, here's what I have to say about how you can make use of these tools using that approved secure deployment. It's likely the case that that deployment surfaces an OpenAI API-compatible endpoint.
When I say this OpenAI API-compatible endpoint, I don't necessarily mean that the model is like an OpenAI model, a GPT-5 or something like that. What I mean is that a few years ago, when OpenAI was like kind of the first big player that made this splash with the first release of chat GPT, they also made that model accessible via this API called the V1 Completions API. And since then, that API format has been the format by which almost all of the players have been surfacing their models.
You can connect to any deployment of an OpenAI API-compatible endpoint via ellmer. And the way that you do that is by setting a couple arguments to chat OpenAI. So you can probably get away with just setting this base URL and API key arguments. And those arguments allow you to point the connection to a given model from the OpenAI API to your own secure deployment.
And we've seen that a couple of our customers need the ability to set these API headers specifically. So we've also provided an entry point for that. So if you wanted to use this sort of endpoint inside of Sidekick, I would take a look at these arguments inside of chat OpenAI. The same goes for Positron Assistant. This is still preview functionality and isn't necessarily like something that we're screaming from the rooftops yet. But if you are on the bleeding edge and you want to give this a try, you can connect to custom providers inside of Positron Assistant. Again, the edges are sort of rough here and we're still working on tuning this, but this is a possibility.
Closing thoughts
This was Practical AI for Data Science. In like 2017 or 2018 or something, when like support vector machines were what we referred to as AI, Positron really leaned into this phrase serious data science, which I remember sort of speaking to like, OK, there's a ton of hype about this AI stuff and a lot of it is pretty annoying. But there's also some real utility here. And if we're thoughtful about the way that we integrate this into our workflows, we can really get something out of it.
And I sort of see us now on a serious AI moment where we see some real utility with these tools. And we think that if you integrate them thoughtfully into your work, you can really get some value out of them.
And I sort of see us now on a serious AI moment where we see some real utility with these tools. And we think that if you integrate them thoughtfully into your work, you can really get some value out of them.
So if you want to stay up to date with what we're thinking about when it comes to AI at Posit, Sarah Altman and I are working on a newsletter that releases every two weeks. And you can check that out on the Posit blog under the AI newsletter tag. There will be a link to this in the repo that I'll mention now, which is github.com slash Simon P. Couch slash rpharma-25. This repository has a bunch of links to different things that I've mentioned throughout the talk. And I hope you can get some mileage out of it. Thank you for having me.
Q&A
Well, thank you so much, Simon. So many wonderful insights here. And we really thank you for your contributions here. And we may have time for one or two questions here, if you don't mind. I've recorded them from the chat.
First one comes from Tiana. She asks, if you ask the LLM to extract specific information from free text that wasn't actually there, is there a risk that it might return hallucinated data? And if so, is there a way to account or control for that?
Yes. So it is absolutely possible that a model could return hallucinated information here. The easiest entry point that I would think about for reducing the chances of this happening is providing some way for the model in like a specific format to say I don't have that information here, which in most cases will get you a lot of mileage out of that. I would also say this is the sort of context where using the best possible, most expensive possible model that you have access to will greatly increase the accuracy here.
When you're building a system like this and you want to make sure that it behaves in the way that you'd like, there's a package called vitals, which is a sort of companion package to ellmer. And vitals implements LLM evaluation in R, which would allow you to measure empirically like, OK, here's 100 real use cases of trying to extract this information. If I've labeled these 100 and I know what the right answer is, how often does the system get it right? And so you can provide some evidence as to how that works.
And I've been watching vitals really closely. So you've been one of the key leaders in this space on debugging a lot of the AI myths out there, but really being practical about how we can assess performance and where we use it in our day-to-day data science. So it's been wonderful to see those resources come through. So certainly, yeah, thank you.
Thank you again so much, Simon. We'll make sure to put the links that you mentioned in the chat as well as the recording when we get that up on the YouTube channel. So again, really appreciate that. And you also teased up the fact that we'll be going in depth in the data bot actually this afternoon with Joe Cheng himself. So you gave us a good little teaser for that. And we're going to dive right into making those details you gave a snippet about today. Great. Thanks, Eric.


