Resources

LLMs for Data Science

[2025 - Day 1 - Data Science & Algos] Hadley Wickham shares insights from practical applications of LLMs in data science, exploring three key areas where these tools prove genuinely useful beyond the hype: writing code, writing prose, and rectangling non-rectangular data. For data scientists working with text, images, videos, or audio data, this talk offers valuable perspectives on leveraging LLMs effectively for real workflows and transforming fundamentally unstructured information. ABOUT THE SPEAKER: Hadley Wickham, Chief Scientist, Posit Sign up for our "No BS" Newsletter to get the latest technical data & AI content: https://datacouncil.ai/newsletter ABOUT DATA COUNCIL: Data Council brings together the brightest minds in data to share industry knowledge, technical architectures and best practices in building cutting edge data & AI systems and tools. FIND US: Twitter: https://twitter.com/datacouncilai LinkedIn: https://www.linkedin.com/company/datacouncil-ai/ Website: https://www.datacouncil.ai/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I wanted to start with a little bit of level-setting just about sort of LLMs and AI in general. I think one of the things that makes them really hard for me to kind of reason about is they sort of flip on their head things that computers are traditionally good at and things that computers are traditionally bad at. So, you know, you can ask ChatGPT, the latest ChatGPT, how many N's are there in unconventional? And you know, usually we expect computers to be pretty good at counting things, but I'm pretty sure there are one, two, three, four N's in there.

Or people have done various experiments asking ChatGPT to multiply large numbers together. And here the red is the, the color scale is the accuracy, right? Traditionally computers are pretty good at multiplying numbers together and LLMs are terrible at it.

But LLMs are also like really good at things that computers traditionally could not do. Like I can ask for a limerick about the R programming language and maybe it's not the best limerick in the world, but it's better than I can do at 15 seconds of thinking. Or write an acrostic about ggplot2. Like it really turns on its head the things that we're used to computers being good at and the things that we just expect computers can't do.

That said, however, still might be able to do poetry, but they still cannot do jokes. Why don't data scientists ever go to the beach? Because they're afraid of overfitting their sunscreen and ending up with a collinear tan line. Like it has kind of all the bits of a joke, but doesn't quite land.

The jagged frontier

And I think a really useful term for kind of understanding the state of AI right now is this idea of the jagged frontier. That the difference between things that LLMs can do well and the things that they can't do well is often very, very narrow and it makes it hard to get a sense of like where can I, where can I use this tool? What's it good at?

But today I really want to focus mostly on the things that it is good at and I'm going to show you across a few areas, kind of places where I've had a lot of success are really useful for like brainstorming, like give me a bunch of ideas. Doesn't matter if a bunch of them are shitty. I'm going to find four or five good ideas and go from there. Another thing I find profoundly useful is this kind of idea of the blank page problem. Like often getting started is really hard. It's often much easier to look at something that's like bad and be like, oh, that's wrong. That's wrong. I'll fix that. And so getting an LLM to spew out something that's kind of in the right ballpark can often be a great way to get started as well.

Clearly they're awesome at rapid prototyping. You can spin up and do experiments very, very quickly. And they're also great at getting rid of kind of boilerplate in your code and your everyday life.

Clearly they're awesome at rapid prototyping. You can spin up and do experiments very, very quickly. And they're also great at getting rid of kind of boilerplate in your code and your everyday life.

And so I want to sort of, so I think like overall the thing that I think is like hard is trying to figure out like where do we as humans play? Like where, what are the things that we are uniquely good at that we want to spend our time on? What are the things we want to hand over to LLMs to do? And what are the things that we're going to use our traditional, you know, things that computers are traditionally good at for?

And my kind of fervent hope is that we will become like data science centaurs. I got this idea of this, there's this idea of these chess centaurs. I don't know if you've heard this term before, but these teams of humans and AIs playing chess together who do better than any of them individually. So the kind of hope is to, my fervent hope is that we can form this kind of centaur where we have the AI as being like the horse buddy that can race us. As fast as we need to go, but we've still got the human on top directing us to where we need to go.

I will say I tried very hard, but I could not get ChatGPT to generate a correctly proportional horse body for this data scientist, one of AI's great weaknesses today.

Coding with LLMs

And so I'm going to talk about three areas that I think are pretty familiar to you as data scientists, where you can, where I'm using LLMs, or you probably are too. So coding, super obvious, writing, a little bit less obvious, and then this kind of general catch-all term that I've called data sciencing, which is not a great category system, but you'll just see a bunch of kind of interesting things.

So I'm going to talk the least about coding, because I think this is kind of the most obvious place, like this is everyone is talking about how to use LLMs to program better, so I want to focus a little less on this, but I do want to show you my personal favorite use case for code complete, LLM-powered code complete. And that is this case where I am wanting to create a new section in my document, and an LLM fills in the dashes for me. Like there's just something about this I find so appealing, like I'm using like literally billions of dollars of technology to complete dashes on a line, and it doesn't even get the right number of dashes. But still super useful, and I love it.

On the more practical side, this is a really cool application from one of the folks on my team, George, he was like, let me just try, I'm going to like write out an app, I'm just going to sketch it out, and then if you ask, I asked, I'm not going to show it now, because the wiper is a little bit flaky, but I asked Claude Sonnet to like turn this into a Shiny app, and it just did it. And I think this, to me, this is the essence of like, this is the coolest thing about like you can take a new technology, for better or worse, something you don't know a lot about, you can make a sketch, and you can turn that into a working app in a matter of minutes, and I think that is so, so, so cool.

I just wanted to use that to say, you know, like I write R code, all of the examples I'm going to show you are basically R code, obviously, you can use other things like Python instead. If you are using R a lot, Simon Couch on my team has been working on a series of R eval, you know, formal evaluations of like which models are doing a good job at generating R code. Currently, like Claude Sonnet 3.7 is kind of my go-to, but the latest evaluation he did shows that GPT-03 and 04-mini are just starting to edge out there a little bit.

The other last kind of coding thing that I found, I think this has been my kind of biggest LLM-powered win, is this thing which I've tried to draw a picture of, this is encapsulating the leaky token bucket algorithm. Now, how many people have heard of the leaky token bucket? Okay, well, you're lucky, because I've never heard of this before, and I was implementing a new feature in one of my packages, and I just like started brainstorming with Claude, like, hey, this is a new feature I want to add, like, you know, how should I tackle it? And it mentioned this idea of like using a leaky token bucket, and I was like, I've never heard of that before, but it turned out to be like exactly the thing that I've been, you know, looking for, maybe not my whole life, but certainly for a few years.

And just finding that, like, that idea that now, I think to me, like, the weakness of Google was always, like, if you knew the name of something, you could find out information about it, but if you didn't know the name of something, terribly, terribly difficult to find it. So, now, like, this idea that you can have a conversation with an LLM, find the name of something, and then once you've got the name of that thing, it gives you, like, a tremendous amount of power. You know, that was really useful. Then I could ask it to sketch out the algorithm for me. Also useful, really good, because the token bucket algorithm is something that I am terrible at implementing. There's a lot of, like, small details that I am just not good at getting, and I always want to try and just ensure it my way to the answer, but really, really useful.

Writing with LLMs

I've found it really useful, like, the thing that real, instead of, like, going away and immediately starting to draw a picture, like, saying, like, ask me questions first, like, being really explicit about, like, I want to give you more feedback before you get started can be really useful.

Oh, yeah. So, this is, like, one, I'll tell you the thing, you know, like, I'm a long-term R user. The thing I fundamentally do not get about the Python ecosystem is that Python packages do not have cool logos. So, it's really, really important in the R community to have, like, a cool hex logo. So, we do a lot of brainstorming for, like, cute logos. And so, like, having it, you know, I give it a pretty short prompt and then say, like, ask me some questions, and there's a little bit of back and forth before it comes up with something, and I think that, I don't know, that seems useful, and then you always get to some point where you're, like, hey, can you try and put all the letters inside the hex again, and it's, like, no, and you try, like, trying many times and just get slightly different versions.

Okay. Any other last writing prompt or writing ideas folks want to share? Yeah. Sean?

Something I just remembered that my company doesn't use it for is summarizing, like, each of its reports. So, when something's gone terribly wrong, it's like, right, this is what actually happened, and this is what went wrong, and so on and so forth.

Yeah. Yeah, I guess the other writing thing I've used it for, you know, occasionally you get emails where they're just, like, either, like, so toxic or just, like, push all your buttons in the wrong way that it's very difficult to read the content of the email without getting, like, emotionally triggered. And so, I have also asked ChatGPT to, like, rewrite an email that's been sent to me and be, like, hey, rewrite this in, like, a more neutral tone so I can actually, like, read that, and then that's the email I respond to.

Data sciencing: rectangling unstructured data

It's quite useful. Okay. And so, now I wanted to talk about this kind of last, like, data sciencing term. I think one thing that, as, like, a, you know, amateur data scientist, I guess, now, one of the things that I'm really excited about LLM's for is taking all of these types of data that I previously had no kind of handle on and turning them into nice, tidy, rectangular data sets, so that's, like, unstructured text, that's PDFs, that's images, that's video, that's audio. Now, you can pretty easily turn that, take whatever those are, like, narrow down a little bit on what you want, and then extract data into rectangles that we're all very used to dealing with.

So, I'm going to show you a few demos here, which will hopefully work.

So, the first one is I'm on the conference committee for PositConf, and one of the things that we do a little differently at PositConf is when you submit a talk, you don't submit a title and abstract, you make a one-minute video. And as someone who, like, reviews, like, talks, that's, like, the best way to get a sense of, like, is this an interesting topic and an interesting speaker, like, that one-minute video gives you so much more information than an abstract. On the downside, it's in this video form, which we can watch as humans, but it's much harder to analyze. So, like, this time, what I did is I used Gemini, and I'm not going to show you that here, but I used Gemini to get a transcript of every single one of the videos. And then what I wanted to do is take all those transcripts and try to figure out, like, what is this talk about?

There's, like, a few buckets at PositConf we're kind of particularly interested in, like, what's the balance of talks about R and Python, then we know there's a lot of talks about reporting and Shiny and AI. So, what I'm using here is this structured data extraction, which is a pretty common feature of most foundation models these days, where you give it effectively a JSON schema describing the data that you want. So, you don't just get blobs of text. You're going to get a nicely structured JSON output. Here, I'm using the Elmer package, which is my package to do this.

So, you can see I've got a talk object. It consists of a summary, which I want to be a two- to four-bullet summary of the talk, and then a bunch of topics. So, is it about R or Python or teaching or community, and then give me a few keywords. So, then I'm going to use Google Gemini here, and then I'm going to feed it, basically. Let's see if this is going to work. I'm going to feed it all 300-and-something transcripts, so each of these is a separate.

Now, we cross our fingers and see if this works. Oh, whoops. It's still running the Shiny app I was using earlier. Let's start that again. So, it's running my R code, and now it's sending all of those in parallel. I think this is pretty neat. So, it's, you know, working through all, I guess, 200-and-something, and now I can find out how many tokens I spent. You know, that was a grand cost of 12 cents, and then I can turn it into a nice data frame.

If I do that, I think I can do a nice data frame, which tells me for each of these topics, which is a major or minor feature, and then, like this, I do a little bit more data manipulation and turn it into a plot, and this is just so useful for us as a program committee to get a sense of, like, what did people actually submit?

So, you can see, not too surprisingly, we used to be RStudioConf, so a lot of talks about the majorly feature R, you know, lots of Python, lots of Shiny, quite a bit of AI. Sort of interesting, and I gave it this, you know, I think another good general prompting advice is always give the model some explicit way to, like, opt out if it doesn't know what you're doing, what it's doing, so I gave it this unknown. You know, every time I run this, the bar chart looks a little bit different, but that's kind of okay in this case.

Like, what I'm mostly interested in, like, overall, like, on average, like, how are things looking? So, this kind of combination of, you know, LLMs, which are stochastic, they give you slightly different responses all the time, and moving into this kind of more, like, statistical mindset, where I don't actually care too much about the individual answers. I'm looking more about the averages and the trends. That just feels like a really, really nice pairing to me, and the ability to just, like, churn through all of this, like, text analysis so quickly. Very, very general tools, like, only really limited by my ability to write good prompts. Just feels, like, really, really cool and useful to me.

So, this kind of combination of, you know, LLMs, which are stochastic, they give you slightly different responses all the time, and moving into this kind of more, like, statistical mindset, where I don't actually care too much about the individual answers. I'm looking more about the averages and the trends. That just feels like a really, really nice pairing to me.

Chatbots and interactive tools

So, I have a couple of other kind of projects that I wanted to show. Sort of trying to think around, like, you know, like, what might data scientists use AI for? Of course, you can write chatbots. Everyone's most beloved way of interacting.

This chatbot we just made to make it easier for people to learn how to use Shiny. But the idea of taking a chatbot that you've given some additional extra information to or given some ability to call tools can be really, really amazing.

So, one of the kind of coolest applications of this that we had internally, we've been running some, like, AI hackathons to try and just generally get people up to speed on what they can do with our AI. And then our engineering R&D spins got audited for, like, 2022 or something. And so, part of that audit meant that someone had to go back, like, everyone had to go back and figure out, like, what the heck was I working on, like, three years ago? And so, our director of engineering made this chatbot. We fed it, like, gave it the ability to do some tool calls to get up. So, you could, like, ask it, like, hey, what was I working on in 2023? Like, what were the projects? What were the risks associated with them? And that kind of, I mean, sort of similar to your story about, like, compliance, that sort of boilerplate-ery, compliancy thing. Like, it's, you know, it's important. No one really loves to do it. And having an LLM help you is really, really beneficial.

So, kind of along those lines, we have another tool called Query Chat, which is sort of an interesting, you know, I think as a data scientist, I think, unfortunately, like, dashboards tend to be our kind of bread and butter. And dashboards are great in terms of, like, here are the, here's the important stuff. But if the person looking at the dashboard has some slightly related question that they also want to answer, like, their only recourse is basically to, like, email you. And, like, you know, you don't want people emailing you, asking you for things. So, what if you could make some, like, sort of special purpose little chatbot that you could, like, slap on the side, like, of any dashboard. So, that someone looking at it, they're still kind of in the scope of that dashboard. But now they can ask, like, novel questions.

So, this is just some very old tipping data. But I could say just, please show me only smokers. So, you can tell how old this was because the tipping data includes whether the people were smoking. In the smoking section of a restaurant or not. So, but, you know, it goes away, generates a SQL, you know, not super complicated SQL. But then it applies it to the dashboard. So, now you've got this customized filtered view that, you know, your stakeholders can be asking and answering their own questions to some extent. Like, it's a little scary. You have to think about, like, how can you protect them from themselves to some extent. But that feels, like, pretty compelling to me. Like, helping, like, empowering other people in your organization to answer their own questions in this kind of fairly scoped environment seems like a big, big win to me.

AI-assisted exploratory data analysis

And the last thing I wanted to show you is sort of tools for, like, doing data science yourself. So, I'm going to load in a bunch of data from about the Simpsons from a tidy Tuesday data set.

Tell you nothing is more fun than debugging code in front of an audience with one hand. Okay, here we have a data bot. So, it's going to get started. And I'm going to say, like, hey, just take a look at what the data that's loaded in my session. So, it's going to go away. It's going to say, oh, like, what are all the data sets you've got loaded? I'm going to take a look at the Simpsons data set. And then it kind of, like, it's prompted to try and really engage with you. Like, rather than just going off and doing stuff, it's going to say, hey, what do you want to do? Well, let's look at the trends over time. And so, this is generating R code or Python code, if you're so inclined. It's running that code. It's creating plots. It's sending the plots back to the LLM so that they can interpret it. It's creating summary tables. It's interpreting those. And it's just trying to help you do this whole sort of exploratory data analysis cycle of, like, looking at the data, generating questions, writing code to answer those questions, and iterate again and again and again.

I don't see this as being, like, a replacement for the data scientists. But this is something that I think can really help you get up to speed with a new data set very, very quickly.

Wrapping up

So, we've got just a couple of minutes left. So, we won't have any more time for you to chat. I just want to talk, again, you know, we've talked a little bit as a data scientist. I think a lot of your job is coding. A lot of your job is writing. And the rest of it is data sciencing. Coding, I think, in general, we've got a good handle on how you can use LLM to code. Writing, we're still learning about. Data science, tons and tons of opportunities.

And I just wanted to finish with a few of the resources that I find most useful for keeping up to date. Simon Willison does a great job of giving you just sort of a vibe check on the latest models. I really enjoy Ethan Mollick's Substack, which is data science-y, but sort of AI in general. I just started following Hilary Gridley, who gives good advice for using LLMs, sort of more generally in the business. And then, finally, I love Lynne Cherney's Substack, which is really about, like, weird and whimsical uses of AI and LLMs. Thank you.