Simon Couch - Practical AI for data science

Transcript#

This transcript was generated automatically and may contain errors.

Thanks, Hari. Yeah, good morning, folks. Happy to be here.

I find a good bit of AI discourse pretty annoying. If you're a LinkedIn user, you've probably seen a post that looks something like this before, where it starts out with some, like, vague, bold assertion, like, AI will change everything about data science. And then, like, in order to read anything about the actual content of the post, you have to, like, click in and actually engage, so it gets a boost in the algorithm.

And then there's some sort of picture, usually the imagery we've converged on is a humanoid robot typing on something between a laptop and a desktop, and it has, like, 200 keys on the keyboard.

There's a lot of what data science looks like in the real world that doesn't get represented in this sort of post.

What AI discourse gets wrong about data science

So in these posts, for one, robots, or in reality, API keys to LLMs, are free or happily paid for by someone else. So when these new big feature drops happen from the major labs, they're often quite expensive. And even the ones that have been around for almost a year now, like Cloud Code, a day of development where you're really just using the tool for eight hours, you can rack up 50 or 100 bucks in billing easily.

Another sort of assumption that underlies a lot of these posts is that the data being scienced can just be sent straight to OpenAI servers or Anthropic servers or Googles, when in reality, the data that many of us work on day to day is confidential, and we can't just send it over to any company that tells us they can do interesting things with it.

And finally, there's often this assumption that data science can be one-shotted, which is this term in the AI space, meaning you just like pass along your data, and then you like go make a coffee or have lunch or something, and you come back to a complete analysis.

There's a lot about real world data science that doesn't get represented when we make these assumptions. So in reality, for one, Frontier LLMs cost money, right? Even like technologies like Cloud Code, again, if you're using it all day, you can easily rack up a bill that you probably don't want to foot yourself.

Data science happens on mostly sensitive and confidential data. So whether that results from clinical trials or you're pulling something from an EMR, this data is subject to privacy constraints that we have to be conscious of. And so you can't just make use of the latest and greatest from any lab if you need those privacy guarantees.

And finally, in the real world, data science is messy and it's subtle and it's context-rich. The problems that data scientists solve day-to-day aren't just like data set in, insight out. There's all sorts of context and understanding that we bring to data science. Integration with data sources and tools that are outside of maybe the most obvious choices. And so data science really needs that like real human cognition that LLMs can be proxies for, but don't truly have.

Data science really needs that like real human cognition that LLMs can be proxies for, but don't truly have.

And I sort of see us now on a serious AI moment where we see some real utility with these tools. And we think that if you integrate them thoughtfully into your work, you can really get some value out of them.

So if you want to stay up to date with what we're thinking about when it comes to AI at Posit, Sarah Altman and I are working on a newsletter that releases every two weeks. And you can check that out on the Posit blog under the AI newsletter tag. There will be a link to this in the repo that I'll mention now, which is github.com slash Simon P. Couch slash rpharma-25. This repository has a bunch of links to different things that I've mentioned throughout the talk. And I hope you can get some mileage out of it. Thank you for having me.

Q&A

Well, thank you so much, Simon. So many wonderful insights here. And we really thank you for your contributions here. And we may have time for one or two questions here, if you don't mind. I've recorded them from the chat.

First one comes from Tiana. She asks, if you ask the LLM to extract specific information from free text that wasn't actually there, is there a risk that it might return hallucinated data? And if so, is there a way to account or control for that?

Yes. So it is absolutely possible that a model could return hallucinated information here. The easiest entry point that I would think about for reducing the chances of this happening is providing some way for the model in like a specific format to say I don't have that information here, which in most cases will get you a lot of mileage out of that. I would also say this is the sort of context where using the best possible, most expensive possible model that you have access to will greatly increase the accuracy here.

When you're building a system like this and you want to make sure that it behaves in the way that you'd like, there's a package called vitals , which is a sort of companion package to ellmer. And vitals implements LLM evaluation in R, which would allow you to measure empirically like, OK, here's 100 real use cases of trying to extract this information. If I've labeled these 100 and I know what the right answer is, how often does the system get it right? And so you can provide some evidence as to how that works.

And I've been watching vitals really closely. So you've been one of the key leaders in this space on debugging a lot of the AI myths out there, but really being practical about how we can assess performance and where we use it in our day-to-day data science. So it's been wonderful to see those resources come through. So certainly, yeah, thank you.

Thank you again so much, Simon. We'll make sure to put the links that you mentioned in the chat as well as the recording when we get that up on the YouTube channel. So again, really appreciate that. And you also teased up the fact that we'll be going in depth in the data bot actually this afternoon with Joe Cheng himself. So you gave us a good little teaser for that. And we're going to dive right into making those details you gave a snippet about today. Great. Thanks, Eric.

Simon Couch - Practical AI for data science

Transcript#

What AI discourse gets wrong about data science

What's possible with ellmer in R

Structured data extraction

Tool calling

Coding agents

DataBot and Positron Assistant

Sidekick: a coding agent for RStudio

Making it work in practice with secure deployments

Closing thoughts

Q&A

Featured software#

btw

ellmer

mcptools

positron

tidymodels

tidyverse

tidyverse.org

vitals