
Max Kuhn - Measuring LLM Effectiveness
For information on upcoming conferences, visit https://www.dataconf.ai. Measuring LLM Effectiveness by Max Kuhn Abstract: How can we quantify how accurately LLMs perform? In late 2024, Anthropic released a preprint of a manuscript about statistically analyzing model evaluations. The concepts are on target, but the statistical tactics have narrow applicability. A simpler statistical framework can be used to quantify LLM models that can be used in many more scenarios/experimental designs. We'll describe these methods and show an example. Bio: Max Kuhn is a software engineer at Posit PBC (nee RStudio). He is working on improving R’s modeling capabilities and maintaining about 30 packages, including caret. He was a Senior Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the pharmaceutical and diagnostic industries for over 18 years. Max has a Ph.D. in Biostatistics. He and Kjell Johnson wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association, which recognizes the best book reviewed in Technometrics in 2015. He has co-written several other books: Feature Engineering and Selection, Tidy Models with R, and Applied Machine Learning for Tabular Data (in process). Presented at The New York Data Science & AI Conference Presented by Lander Analytics (August 27, 2025) Hosted by Lander Analytics (https://www.landeranalytics.com)
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Our next speaker didn't give us a fun fact. He said I can make up whatever I want about him. I said you would make up. Would, could, can, all the same thing. So a lot of you probably know him from his great contributions to open source, a lot of some fantastic work he's done. A lot of you might know him from some great books he's written. He has a book that's about his original books like this thick. It balances all my other books out. Really good reading too, though. He's good for reading. Don't just put stuff on top of it.
But he's also just a very lively person, has some great stories, and just really fun to hang out with. He's a really great person to be around, and someone I really treasure knowing for as long as I have and being such a good friend. So please, everyone welcome Max.
Great. Thanks for coming back after the break. I'm here to talk about how we would measure effectiveness with LLMs in terms of their evaluations. If you want to get the, there's a fair amount of links here. So if you go on GitHub, I'm Topepo. And you'll find right up top of the repositories this 2025, oops, NYR. Forgive me.
Motivation for measuring LLM effectiveness
All right, so we pose some question or questions to LLMs. They're constantly changing. There's plenty of vendors. With the same vendor at the same time, queried a few times, you can get different results. A lot of times, we hope to try to measure which one's working best for us. If you're a developer or something like that, hopefully a qualitative or a quantitative way to measure if I'm developing a prompt, is it getting any better, am I making it worse? So we basically want to have results. And I'm here to talk about the ways we can analyze those results.
So this is with Simon Couch, who used to be in the Tiny Models group working for me. And he's been so adept at making really good small-scale AI tools in R that he's in our new AI group. But he wrote a post yesterday about something very much related to this, where he's working on an assistant for Tiny Models. And he was doing all this work to get the LLMs to recognize the right syntax for Tiny Models and did a lot of work for that. And then all the new models came out, and they just worked better without all the work he had previously done. And it just seemed like a good example. It's an interesting read.
It's a good example, though, because there's so much not volatility so much as it's very dynamic. It'd be nice to have a way to measure effectiveness. And so hopefully, one thing we'll eventually get to is something where we can just say, we have some leader LLM from the last time we looked at things. And then we have a set of questions that we want to pose. And we can just turn that loose and measure. I have it written as LLM or model here, but it's a combination of your prompt or whether you're using RAG or MCP or something like that. However, you get the system working.
And hopefully, you can measure some inferentially, say something about the difference in accuracy between these models. That's what we hope to get to.
Inspect and vitals
So I thought I'd mention two things, inspect and vitals. Inspect is a Python-based framework that the UK created. I have to be honest, the only reason I'm familiar with it is because the person who owns our company just went off to the UK and just worked on it for a long time. And so it's kind of an amazing feat of engineering. It's built for very, very large-scale stuff. It can handle crazy, agentic stuff that large companies would make. But it's a little bit overkill if you're just like somebody who's doing package development, especially in R, and you want to just measure these things. So Simon created this package called vitals. And it takes a little bit of inspect. And it basically, I don't know if he would say it this way, but I think of it as like unit tests for LLMs.
So you write a list of questions that you want to continually, like you can expand them, but they're like a static set of questions you want to answer and see, are these models better at this corpus of questions? So it's almost like a unit test. And so the example we're using is not a new one. You can tell by the models that are listed there. But there's 25 questions related to R. Some of them are like R6, or tidyverse, or base IR questions, or fairly diverse. And then Simon ran them within vitals using GPT 4.1, Gemini 2.5, and CloudForce on it. And then we know that a lot of these models are stochastic. So if you run them more than once, you'll get different answers. So for each one, like vitals that you do this, you just say measure it three times, or however many you want to look at it. So he did each for three repeats of those.
So we have 25 things measured over three models, three times each. And just a little bit of notation and thought about the data, we decided to, these are fairly small-scope questions. So we can programmatically rate them, basically. And I'm going to basically punt on all of the different ways you might be able to take the output of LLMs and measure them in terms of correctness. Bill talked about that yesterday. But we could very easily say that things were either correct, partially correct, or correct. And partially correct could mean like it gave me, I asked it to do a GM smooth, and it did it well. But then it hallucinated a link to the documentation or something like that. It did a decent job, but not quite all the way there.
So there's a little bit of notation. I'll show you an equation in a minute where this will matter. We have C equal three outcome levels, so like incorrect, partially correct, and so on. Three LLMs, so that's P equal three. And so I'm going to use the term epochs for replicates. So if we run each of these three times, it's like three epochs. And I'll be talking about these ordinal outcomes as like the thing we're trying to measure. But really, this is not the most complicated model in the world. But if you were doing something like binary outcomes or proportion correct or even like some number system, the model gets actually a lot more simpler than the one I'll show you in a minute. So the system that I'm going to talk about here can work on pretty much anything.
The Anthropic paper and a better statistical framework
Here's like a visualization of the data. You can see the questions on the y-axis, and we've just color-coded the answers. You can see when I inspect this with my eyeballs, I think it's maybe like 40%, 50% correct just based on the colors. That's my visual assessment of it. But you can see some go from like completely correct to completely incorrect or in between partially correct or not. And then some of them on the top and bottom are pretty consistently good or bad.
So last year, somebody from Anthropic released a manuscript on rating LLMs, statistical approaches for doing that. And it's actually a quite good manuscript, especially in terms of the spirit of what it's trying to do. And so it's talking about if you measure these things multiple times, you have to account for the within question correlation. And a lot of the ideas are good. The part that kind of stood out for me is then they went into details on how to do it. It's not wrong, but it's not especially good methodology. It's also the situation where for different experimental designs, you would have different equations. And as a statistician, we'd never think about that. So I'll talk about a more top level way that can compensate and not have to really think about the, like it won't be deriving equations for these things. It's already done for you. And the methods are very versatile, and they can work for pretty much any experimental design.
And I'm a person who's mostly focused on estimation problems my whole career, and machine learning, and I'm a statistician, so for me to stand up here and say, we need statistical inference is kind of a change for me. But that's what we need. And basically, standard kind of things are what we really need here. These problems have, believe it or not, been already solved. And so just as an example, generalized linear models are 51 years ago. Like, that's crazy to me. And so a framework like that solves almost all the problems that we're working on. So although the paper is really good, like their derivation of equations for this and that were really unnecessary, because we already have all these frameworks to do what we want.
Most of the experimental designs that we would do fall into basically ANOVA models, like analysis of variance. Like we have these LLMs and their groups, and we want to make comparisons between them. There might be like some sort of numerical covariate, like the number of tokens or the cost that we might put in these models. But there's nothing crazy about them. And the experimental design is probably fairly balanced and complete. So they don't seem like they're hard to fit models to. So there's a lot of basic things that, at least as a statistician, you would maybe get in like your first year of graduate school, if not sooner, that can handle like this design as well as other types of outcomes.
The proportional odds model
So this is like really the only equation we'll see. It looks kind of daunting, but if you've ever seen the equation for logistic regression, that's 2 3rds of what's here. Over here on the left-hand side of the equation, you basically have to load it, just like you would with logistic regression. So this is called a proportional odds model. It's built for cumulative or ordinal data, like ordinal outcomes, like incorrect, partially correct, correct. And so the only real difference on the left-hand side between this and logistic regression is you're modeling the probability that's cumulative. So you might, so if you have three, like we do have outcomes, you build two logistic regressions. And the problem is the probabilities for partially correct, incorrect, and correct have to add up to one. So you can't fit three models because there's a dependency and a constraint in there. So you fit two out of the three, and then you can infer the third probability from the two models.
And so on the left-hand side there, the logit is like a probability of being, like if C here were equal to two, we would say that that's a probability of being incorrect or partially correct versus the bottom would be, in that case, being correct. So it's very much like a logit where you have like a probability on the top in the parentheses and it's opposite on the bottom except for it's cumulative. And on the right-hand side here, looking at the betas, that's the same thing you would see in linear regression or logistic regression. These are the things we actually care about. So beta two, the way it's written here would be the difference in, let's say, accuracy between, let's say we ordered them as being chat GPT, Gemini, and Claude. So beta two, there would be the estimated difference between what GPT gives you and what Gemini gives you. And the third coefficient would be the difference between GPT and Claude.
Now the two new things here is this theta is necessary for the ordinal regression part. It's basically a intercept that is specific to the level of the outcome that we're modeling. And the alpha here is not necessarily needed, but it is for our design because we've taken each question and we've replicated it three times. We've taken the same setting and just repeated it a bunch. And in statistics, we call that, we call the question the independent experimental unit, but the rows in our dataset aren't independent because if we had the three replicates in rows one through three, those results are correlated with one another and they're more like each other than they would be for any of the other rows. And if you ever had a lecture about a paired t-test versus a t-test, it's the same thing. If we don't model that correlation that's sort of already baked into the data, then our standard errors are a lot higher. We've basically underpowered our whole design. And so what alpha here does is it estimates that correlation and basically factors it out. So it looks like a lot, but on the left side here and on the far right side, it's basically logistic regression.
Frequentist estimation
So Dr. Gellman's not here right now. So I'm gonna talk about frequentist and Bayesian methods for estimation and talk about them at a very high level. So on the frequentist side, it's basically, if you've ever used a mixed model in statistics, that's what we're talking about. It's the same model. So we're gonna be looking at this, but just different ways to estimate all these coefficients. And in the mixed model, basically all we do is we know what the likelihood is because we've written down a probability model that's binomial, an ordinal binomial model or multinomial model, and we can write down its likelihood and then we can basically optimize that to find the maximum likelihood estimates. It's pretty straightforward.
Sometimes, like for design like this, it's pretty easy to fit. The inference is pretty easy. I mean, if you're into p-values, that's, you know, I'm happy for you. You could do that. Confidence intervals are kind of better, but you can make inferences on these things. And it's really easy to do in R. There's an ordinal package. You just, one line, you have the entire model that I wrote. The only interesting things about this is this. This is like the standard R sort of hierarchical model formula where we're modeling these scores as a function of LLM is just a factor variable, an indicator for whether it's CLOD or GPT and so on. And then the parentheses one bar ID, the ID is the independent experimental unit. And then one means that we have a random intercept that is driven by that unit, and that's basically this intercept here. So, you know, this is a complicated model, but the code's like trivially easy.
So once you fit this, you can get your confidence intervals and p-values. One thing you might want to do is fit a null model, which assumes they're all the same, and then this would do like a likelihood ratio test up front to say, like, is there any difference between these things before I try to pull out their pairwise differences? And at least from a statistical perspective, you're safe doing this before the all combinations sort of analysis. But if you want to get all the p-values and the confidence intervals, you run the tiny method on that, and you have a nice data frame to do that. So it seems really complicated, but it's pretty easy.
Here's what the results look like for our dataset. And so on the left-hand side there where we have parameters, that's like the logit unit estimates. So logit goes from negative infinity to infinity. Zero is like the 50% mark in the transformation. So when you look at the Gemini results, the estimate is fairly low. You can look at the confidence interval next to the p-value and pretty easily say that there's really no difference between CHA2GPT results and the Gemini results for this particular analysis. And then you get to the CLOD results. The parameter estimate's a lot higher, and the confidence interval barely covers zero. But from a statistical perspective, we say that a 0.5 estimate on logit scale is a pretty significant number, but it's not really statistically significant. There's not enough evidence to say it's different from zero.
And also we can convert these to odds ratios if that's better for you. So the odds ratio here, sort of the null value is one. That would mean they're equal. And 1.73 means there's about a 73% improvement of using CLOD for this dataset over GPT, but that's not really statistically significant here. And speaking of statistical significance, I find just having to write out, because I used to have to do this in previous jobs, what it means to be a confidence interval just kills me every time. It's like this circuitous, almost terrible way to explain to somebody officially what this means. Like if you did it a bunch of times, or 95% of the times, the true value would fall between these intervals, and that's, ugh. But that's frequentist methods for you.
One cool thing about these models is you can get estimates of difficulty. So these are the intercept estimates for the actual questions. So what we can do is, you can imagine if you had like 500 or thousands of questions, you can really easily use a good statistical method to rank them in terms of like, well, which ones did it just not do well on? And then adjust your prop accordingly. So we have a few here that you got all right, and they're sort of like hitting the top of the scale. With this model, this is sort of constrained to be a normal distribution. So they sort of fall under a normality type situation. And then we have especially like three down here that just weren't answered at all well. Some of these are like, zero would be like average difficulty. And so that's one thing you can get out of these models.
Bayesian estimation
So that's just like a summary of the frequentist methods. Now we move to the Bayesian methods. And like Dr. Gilman was talking about, you put priors on your parameters, and they express your belief about what they could possibly be. And most Bayesian models have pretty good prior distribution defaults, and you can use those. They're very vague, which means they're designed to, especially when there's a fair amount of data, let the data speak more to the parameter estimates than the prior. And so the defaults are pretty good. If you don't know what your prior is, I'll show you in a second, one that I changed from the default. But you would estimate this with Markov chain and Monte Carlo, which is quite a feat sometimes. It takes a little bit to evaluate and run, even with a design that's pretty small like this. But the work you have to do to make a prior and do MCMC, it's not trivial, but it's workable, but it's way worth it when you get to the inference stage. Because inference of Bayesian models is comparatively delightful, if I ever felt that way about inference. It's just really easy to do.
If you want to fit this model, you can use the BRMS package. You use, R has a standard formula method for this type of thing, so we just recycle our formula. The main thing is right here, we choose a logit model with a cumulative link. That was the left-hand side of the equation I showed you earlier. And then one thing I did was, I wanted to take the prior for the intercepts that are due to our questions, and give them a very heavy tail distribution. So this is sort of like a normal distribution, but the tails, the extremes of the distribution have more mass in them. And the reason I chose that is because you can imagine some questions that you give it that would be extraordinarily difficult or extraordinarily easy. And with a normal distribution, it sort of enforces that probability range. And so a t-distribution would allow you to estimate what the normal would be considered outliers in terms of difficulty or easiness in terms of their questions. So it's just a different way to think about what you would expect from your questions. And then you can add all the options you want. There's things like how many chains, how many iterations, and so on.
And so there's a lot here to do with the Bayesian models, but it works very well. And the results here are, because I didn't really change the default priors, it's a pretty stable model, these results are pretty close to what we got with the frequentist model. You'll notice it says like five percentile, 95 percentile. These are credible intervals. So one thing that's different when you fit a model with maximum likelihood, your parameters take on a single value. They're the best parameters that explain your data. And with Bayesian methods, what you get is you get a distribution around your parameters at the end. You get a posterior distribution. And what that is is the probability distribution around your parameters. And typically when we want to summarize, I'm going to use the mean or the mode or the distribution. So the means here are pretty close to what we got before. And then instead of a confidence interval or credible intervals, we just take some, well, the easiest thing to do, some quantiles of that distribution to give us a sense of the variability around things.
And then again, with maximum likelihood estimation and Bayesian results, the nature of those models and their properties allow us, when we do the odds ratios, we're just exponentiating the parameter estimates and their distributions or intervals. And so this is basically how it works. But it's remarkably similar to what we got on the frequentist side. So you might think like, well, why would I go all the trouble of doing that? And here's where the answer is. The inference is incredibly simple. Instead of talking about like if we repeat this and blah, blah, blah, I can look at the posterior distribution, let's say for the Claude parameter estimate and say like how much of that distribution is positive compared to zero. And when I do that for this particular experiment, 92% probability that Claude four is better than GPT four. It's a really direct statement. It also seems a lot more emphatic than we got with the frequentist analysis because the frequentist analysis hides behind a sort of arbitrary decision and give you some like indirect ways of saying how true that situation is. And this is delightful to use when you have to explain things to people because it's a very direct way to look at, it's not, I don't want to call it significance, but how likely it is to be better than the alternative.
92% probability that Claude four is better than GPT four. It's a really direct statement. And this is delightful to use when you have to explain things to people because it's a very direct way to look at, it's not, I don't want to call it significance, but how likely it is to be better than the alternative.
And so we also get the difficulty estimates. So the nice thing about this is since the Bayesian model is we can put credible intervals around them and get some sense of their variation, which is kind of nice too. If you're looking for like, what seems to be conclusively difficult questions, you can get that by using these intervals.
Conclusions
So I mean, I think one of the conclusions I have, and this talk is not like a reaction to that paper, but it did make me think a lot that even in a company like Anthropic, they might not have an actual statistician on board. Because I think if you think about these designs, it just screams out for basic stuff that again, I think most statisticians would know how to do. It's not hard to do them either. Python has its own like stand packages and libraries so you can do the hierarchical Bayesian model almost the same as we did it here. R for example has many, I mean, there's a lot of different packages to choose from that fit these models. So it's not like it's some arcane thing that you have to like write a dissertation to work on. It turns out to be a fairly typical design that makes it really easy to analyze. And that's pretty much it. Thanks for listening.

