Max Kuhn - Measuring LLM Effectiveness

Transcript#

This transcript was generated automatically and may contain errors.

Our next speaker didn't give us a fun fact. He said I can make up whatever I want about him. I said you would make up. Would, could, can, all the same thing. So a lot of you probably know him from his great contributions to open source, a lot of some fantastic work he's done. A lot of you might know him from some great books he's written. He has a book that's about his original books like this thick. It balances all my other books out. Really good reading too, though. He's good for reading. Don't just put stuff on top of it.

But he's also just a very lively person, has some great stories, and just really fun to hang out with. He's a really great person to be around, and someone I really treasure knowing for as long as I have and being such a good friend. So please, everyone welcome Max.

Great. Thanks for coming back after the break. I'm here to talk about how we would measure effectiveness with LLMs in terms of their evaluations. If you want to get the, there's a fair amount of links here. So if you go on GitHub, I'm Topepo. And you'll find right up top of the repositories this 2025, oops, NYR. Forgive me.

92% probability that Claude four is better than GPT four. It's a really direct statement. And this is delightful to use when you have to explain things to people because it's a very direct way to look at, it's not, I don't want to call it significance, but how likely it is to be better than the alternative.

And so we also get the difficulty estimates. So the nice thing about this is since the Bayesian model is we can put credible intervals around them and get some sense of their variation, which is kind of nice too. If you're looking for like, what seems to be conclusively difficult questions, you can get that by using these intervals.

Conclusions

So I mean, I think one of the conclusions I have, and this talk is not like a reaction to that paper, but it did make me think a lot that even in a company like Anthropic, they might not have an actual statistician on board. Because I think if you think about these designs, it just screams out for basic stuff that again, I think most statisticians would know how to do. It's not hard to do them either. Python has its own like stand packages and libraries so you can do the hierarchical Bayesian model almost the same as we did it here. R for example has many, I mean, there's a lot of different packages to choose from that fit these models. So it's not like it's some arcane thing that you have to like write a dissertation to work on. It turns out to be a fairly typical design that makes it really easy to analyze. And that's pretty much it. Thanks for listening.

Max Kuhn - Measuring LLM Effectiveness

Transcript#

Motivation for measuring LLM effectiveness

Inspect and vitals

The Anthropic paper and a better statistical framework

The proportional odds model

Frequentist estimation

Bayesian estimation

Conclusions

Featured software#

rstudio