Resources

Is that LLM feature any good? (Simon P. Couch, Posit) | posit::conf(2025)

Is that LLM feature any good? Speaker(s): Simon P. Couch Abstract: The ellmer package has enabled R users to build all sorts of powerful LLM-enabled tools. How do you test these features, though? How do you know whether a change to your prompt made any difference, or if a much cheaper model would work just as well for your users? This talk introduces an R port of Inspect, a Python framework for LLM evaluation that has been widely adopted by both LLM developers and tool builders. Attendees will learn about the process of—and importance of—evaluating LLM-enabled apps empirically. Materials - https://github.com/simonpcouch/conf-25 posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

A month or two ago, I was browsing social media, and I came across this post from Posit on LinkedIn. And it reads, Are you ready to plan your Posit Conf adventure? The agenda is out now, packed with workshops, keynotes, and talks. And it linked to this chat bot that allows you to explore the schedule via an interface like Chat2BT.

This app was built by an attendee to Conf. His name is Sam Palmer. I think Sam is in the audience. If you're here, can you give us a wave? There's Sam in the back. Can we give him a round of applause?

So if anybody gave this chat bot a go before the Conf, it's really cool. What Sam did is scraped all the data from the schedule. And so there's the talks, and the speaker names, and the rooms, and the times, and so on and so forth. And all that gets stuck inside a database. And then an LLM has access to this database via something called Retrieval Augmented Generation, or RAG. And so that LLM can then search inside of the database to respond to your questions.

So you could say, At what time is Simon Couch speaking? And hopefully it'll tell you right now.

So this chat bot was built with ellmer and ShinyChat, which are two R packages, as well as Ragnar, which is an R package supporting RAG inside of R. And it's something like 20 to 30 lines of R code.

So I'm giving a talk. I see this chat bot. I'm like, I wonder how good it is at surfacing my own stuff. So I have a little narcissistic moment. I'm staring at my own reflection. I say, Is Simon Couch giving a talk? He says, Yes. Here's the date and time. So on and so forth. And I get progressively more and more arcane just to kind of see, like, what are the limits of this?

And at some point I ask, Are there any talks about evals? This is not really a fair question. Evals is sort of slang in the LLM space for large language model evaluation. But I give it a go anyway. And the model says, There's no talk explicitly focused on evals. However, you can go watch this other talk, which happens to be concurrent with mine. So it appears that nobody in this room typed in that question, or else you might not be here.

Why evals matter

So again, you can build an app like this in something like 20 to 30 lines now. These ellmer and ShinyChat packages are super powerful. And it's really cool how far you can get in such a short amount of time. At the same time, once you've started to build out a demo of an AI app, taking that app to production requires you to be able to test it and iterate on it quickly. And it's difficult to do that without an ability to evaluate quality and debug issues.

This is where the Vitals R package comes in, which is a toolkit for LLM evaluation in R. Vitals is an R port of a framework in Python called Inspect AI. So if you're a Pythonista, for the most part in this talk, I'm going to refer to Vitals. But Inspect AI uses the same language, same framework. This framework is widely adopted by many of the Frontier labs as they're building the models themselves.

And also on the Python side, we have peers to ellmer and ShinyChat, which are Chatless and appropriately named ShinyChat. But I'll focus on Vitals throughout the rest of this talk.

In the rest of the talk, I want to remind you of two things. The first is that if you're building an LLM app, you really should be running evals on it. And the second thing is that if you have managed to build an LLM app, you really can do it. Building evals is straightforward once you have an app built with ellmer.

So let's go ahead and start on the really-you-really-should-be side. I hope to convince you that you really should be evaluating your LLM product.

This is a quote from Hemal Hussain, who is a leader in the AI eval space. He writes, Like software engineering, success with AI hinges on how fast you can iterate. So whether you're a data scientist or a software engineer, I don't know if immediately you agree with that statement, but I think I might be able to convince you that at some level you do because you adopt tooling that allows you to iterate quickly.

Like software engineering, success with AI hinges on how fast you can iterate.

So in software engineering, we adopt tools that allow us to make changes, like our code editors, and we have autocomplete, maybe LLM-assisted autocomplete, and now these agentic editors like Cloud Code or Positron Assistant. But in addition to these tools to make changes, we have tools to evaluate quality, like unit tests. So if I go and make some changes in an R package I'm working on, I can both test that that works as I expect it to and didn't break anything else in the process. And we also have tools to help us debug issues, so like the debugger and rep practices, and so on and so forth.

In AI, we also have all sorts of tools that allow us to make these sorts of changes to our apps. So we can engineer the system prompt, we can put together this rag tool, like was under the hood in this conf chatbot app, and so on and so forth, and there's a lot of effort in this part of the space. What about those other two steps? It's mostly like a vibe check. So I make some changes to the prompt, and I ask a couple questions, and I'm like, that seems reasonable. And so we have a look-see, and if you've tried to bring your AI demos to production before, you know that this can kind of become a whack-a-mole sort of process, where you might resolve one issue, and at the same time, you accidentally introduce another one.

And so Vitals allows us to iterate more quickly by giving us tools to evaluate quality and debug issues. I'll show you in a moment that when you run evals with Vitals, you get these higher-level summaries of how well your app is performing on different axes, and when there are issues, you have the ability to drill down in the details to debug what went wrong.

Building an eval with Vitals

So hopefully I've convinced you that this is something worth doing. Now I want to convince you that if you're a user of ellmer and you've made a chatbot already, you do have the ability to make some evals as well. And that's because we designed Vitals with ellmer specifically in mind, so that once you have an app, you can just plug it right into Vitals, and it's plug-and-play.

To talk through that process, we'll talk through the three pieces that compose an evaluation, which is a dataset, a solver, and a scorer. And to do so, we'll use that app from SAM that is the conf chatbot app.

So first let's talk about a dataset. Minimally, datasets are just a data frame that have two columns. The first column is an input, and this is something that a user might ask of your app. So for example, in that last slide, I said, are there any talks about evals? And then the second column in that data frame is a target, and a target gives you corresponding grading guidance for that given input, or maybe even the answer itself. And so a target in that last example might be, yes, Simon Couch will be giving a talk called, Is That LLM Feature Any Good?

So I know this is a conference about serious data science, and I'm an R package developer, so I take data structures really seriously and yada yada. You can just use Google Sheets. It's okay. It's really, like, sort of painful to type out these long structured text response inside of a call to data.frame or tbl or trbl or whatever it may be. This is, like, a perfect use case for something like a Google Sheet or an Excel file. So again, this is just a data frame with two columns. One is the input, one is the target, and then you can read that into R or Python after.

So for example, another possible input you might ask is, are there any sessions about causal inference? And your grading guidance would be, yes, there's this workshop called Causal Inference in R by Malcolm and Lucy.

So we have our data set. The next step is building a solver. If you've built a chat app with ellmer, you know it looks roughly something like this. You can load the ellmer package and then initialize a client, which is just a specification of how you connect to a given LLM. So if I run chat open AI, I've connected by default to GPT 4.1, but you can connect to Anthropic or Bedrock or Snowflake or Databricks or whatever it may be with ellmer. And then you make your changes. You write out a system prompt and you incorporate tools, like in this conf chatbot app, this is a rag tool, and so on and so forth. And once you're finished with that, you just drop it into live browser. This is a function in ellmer that makes use of ShinyChat and puts your client into a container that looks something like chatgpt.com.

So when I say that evaluation with vitals is plug-and-play, I mean that you can just plop your client right inside to vitals, and that's your solver. So you're done.

So up to this point now, the data set has two columns. The data set has the input and then the corresponding grading guidance for that input, which is the target. We pass the input from the data set to the solver, so we ask the solver, are there any talks about evals, and we get some response back. And so up to this point, we now have three data points. One way that we can score this is by concatenating all these together in a string. You might have known I was going here. And we can say you're assessing a submitted answer on a given task based on a criterion. Here's the task. Here's the submission, which is the response from the solver. And here's the criteria, which is that target grading guidance. Does the submission meet the criterion? And you pass that along to some LLM.

The first time I saw this, I definitely had this sort of reaction, like, are we really serious? This is the way that we're going to do this? There's some strong empirical evidence that this is generally something that works and tends to agree well with human raters when these systems are designed well. So on the Vitals package website, as well as the Inspect AI website, there's all sorts of guidance about how to put together these systems thoughtfully. But for now, we'll just take it for granted that, yes, this is the way that we're going to do it. If you're like, absolutely not, this is not the way that this is going to go, there are certain solvers and LLM systems that you don't need this LLM, as a judge is what they call it, where you use an LLM itself to score the response. And so if you're like, no way, that's not a deal-breaker. There's a way to make that happen.

Running the eval and the log viewer

But now we have a data set, a solver, and a score. The container for these three objects in Vitals and Inspect is called a task. And so we can use Google Sheets 4 to read in that spreadsheet that we put on online, and then we situate the data set, the solver, and the score inside of this task. This task has a bunch of methods. One of them is called eval, which first runs all the solver inputs in parallel and then scores all of them in parallel. And so for these eight inputs, this took something like 11 seconds, 12 seconds.

In addition to a modified version of the task itself, something that both Vitals and Inspect give back to you is this thing called the log viewer. And the log viewer is sort of a foundational tool for both evaluating quality and debugging issues. So at the highest level, you get these scores, like the accuracy. So in this case, in six of eight responses, the solver from this conf chatbot app was rated as being correct. And we can scroll down through this app until we see where some sort of issue came up. So this score here was marked as incorrect with a red circle. And I can click into this and drill down into the details to figure out what went wrong.

So the user says, will there be any talks about making R data packages? The LLM then chooses to use this RAG tool to go search for relevant speakers. And the model doesn't find anything that it particularly thinks is compelling, but it tosses something out anyway. And it turns out that the target we were looking for is a talk from Kelly McConville this afternoon called Teaching Data Sharing Through R Data Packages. And so since that LLM system didn't surface the response that we were interested in doing, this is marked as incorrect.

I hope in this talk that I've managed to convince you that when it comes to running evals on your LLM apps, you really should be doing it, and also it is something that you're capable of doing. If you are building LLM apps already, this is really something you should be doing. And if you're already able to do that, you can build evals.

I hope in this talk that I've managed to convince you that when it comes to running evals on your LLM apps, you really should be doing it, and also it is something that you're capable of doing.

I'm very grateful for the opportunity to be here. If you want to learn more about vitals, if you want to see the source code for these slides, or some resources and quotes that I've pointed to throughout the talk, you can go to github.com slash simonpcouch slash conf dash 25. Thank you.

Q&A

Thanks, Simon. We do have some time for questions. Should we view evals as a replacement or complement to vibes, as in vibe coding?

Yeah, this is quite the debate. There was this interview with, like, the person who wrote Cloud Code that came out a week or two ago, and at some point in this interview, he's like, yeah, we tried evals, and I don't know about that. Are they a replacement for vibes? I think, especially early on in the process, as you're building these demos, like, the vibes are the thing to focus on. And only once you start to discover these patterns where you see the same sorts of errors again and again are points where you might want to start introducing those evals to making sure you have an eye out for them.

When you get bad evals, how do you improve the output? This is, like, really where a lot of the focus in the LLM space is going right now is, like, these changes that you can make to the app. So the places that we tend to focus on when we're advising people on where to start with building LLM apps is stuffing a bunch of stuff into the context and then giving LLMs the ability to surface information, run code that they need to to answer your question using tools. The LMR and chat list packages try to make it as easy as possible to iterate on those system prompts in your tools, so that's a good place to start.