Resources

Making the most of artificial and human intelligence for data science (Hadley Wickham, Joe Cheng)

Making the most of artificial and human intelligence for data science Speaker(s): Hadley Wickham; Joe Cheng Abstract: This presentation explores the complex and often contradictory nature of large language models (LLMs) in data science, acknowledging the simultaneous excitement and apprehension that we feel toward these technologies. We’ll provide a practical framework to help you understand the LLM ecosystem (from foundation models and hosting to SDKs and applications) that supports our current philosophy: augmenting, not replacing human intelligence. The talk demonstrates how Posit is addressing this space through two complementary approaches: building SDKs and tools that help you create your own LLM-powered solutions, and developing integrated LLM capabilities directly into data science workflows through tools like Positron assistant and databot. We’ll showcase practical, immediately useful applications while addressing current limitations, providing you with both the emotional preparation and technical foundation needed to effectively leverage LLMs in their data science practice today. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon, welcome back. Right now we're about to begin our afternoon keynote, and this keynote is on the topic of AI, so I thought I would use AI to write the speaker introductions. But you know, I think we've all been spoiled by the recently really good AI, and we've forgotten that AI mostly used to be made of Markov chains with spelling errors. And so here is today's keynote introduction with yesterday's AI.

Hadley Wickham arrived on earth at the age of new, and immediately began organizing his toys into tidy data sets. The chief data scientist at PositPBC, Hadley is mostly made of code and polite enthusiasm. He has science and models. He invented the tidyverse while waiting for the bus. And the bus has still not arrived. Hadley loves writing packages, writing functions that write functions, and teaching humans to pivot longer and wider until they disappear entirely. He once tried to eat spaghetti but turned it into a graph instead. His hobbies include wrangling data, wrangling more data, and quietly becoming data himself. The community gave him a syntax because he spoke like a plot, and he looked like a plot, and in many ways, Hadley was a plot. We owe him so many vignettes.

Joe Chang is the former, sorry, current CTO of PositPBC, a former child, and the original inventor of Shiny, which is both an R package and a lifestyle choice. Joe once discovered that data wanted to become apps, and instead of ignoring it like the rest of us, he simply allowed it. He has been programming since before computers were invented. Colleagues describe him as compile once, run away, and garbage collected, but still extremely memory safe. His hobbies include inventing recursion, collecting bugs for sport, and disguising machine learning as a small talk. Joe is widely respected as a thought leader, but only by thoughts. He has no servers, but he thinks he is one. Joe speaks in Dashboard, his teeth are reactive inputs, and he frequently reminds reality that consistency models are just very polite disagreements.

Please welcome to the stage Joe Chang and Hadley Wickham.

Acknowledging the complexity of AI

Thanks, thanks Jonathan. So I'm going to talk about, Joe and I are going to talk about AI today, and I think whenever you talk about AI it's like really important to kind of acknowledge like the big feelings that many of us have about AI. My kind of pendulum of like existential dread sort of swings between like mild and moderate depending on the day. There's like so many cool things that LLMs can help us do, but there's also like so many things that are just clearly so evil and we should be worried about, where it's the fact that they're worsening climate change, or just the blatant theft of copyrighted materials, the misleading answers they generate, the fact that they're destroying jobs, or concentrating wealth into the hands of those are already incredibly wealthy.

There's a lot to worry about when it comes to AI, and I think it's very easy in the midst of all this to feel kind of disempowered, like what can you as an individual do? So one of our goals today is to try and like empower you to like understand a little bit more about LLMs and how you can use them to do data science, and not just do data science, but do it responsibly, because that's one of the core values that we have at Posit.

So I'm going to show you a bunch of different ways you can do this, but I think there's sort of three themes you'll see throughout my presentation and Joe's presentation. You want to find problems that LLMs are good at, this is kind of like self-evident I guess, but you you also want to augment them with your existing programming skills. The fact that you can program still makes you incredibly valuable, and that's even so useful when you combine them with an LLM. And then finally for many applications it's really important to keep LLMs on track by tightly integrating some kind of human feedback loop.

What makes LLMs hard to reason about

But before we talk about those specific applications, I want to talk about two things that I think make LLMs like particularly hard for us to kind of grapple with. And the first is they kind of turn what we know that computers are good at and bad at on their heads. Because traditionally you know we think of computers being pretty good at calculating things, and if I give some numbers to an LLM and say hey can you compute the mean and the correlation of these numbers, like they'll 100% give it a shot. But every single number will be incorrect.

So LLMs can be really bad at things that we think expect computers to be good at, but they can now do things like write poetry that we never expected computers to be able to do. And maybe this is not the best poetry in the world. A thousand tiny verbs curl like commas, they lean into sunlight, pipes become rivers, and data sings itself clean. But the fact that we can now use computers to write poems I think is really cool.

And this slide kind of reminds me of my next point, which is one of the other reasons that LLMs are so tough to think about, is it's so hard to predict, like it's a real head-scratcher to predict like are they going to do this job well or are they going to do this job poorly. And so if you imagine this kind of two-dimensional chart with the difficulty on the x-axis and performance on the y-axis, you know you normally expect to see some kind of inverse relationship. Like as the task gets harder, performance decreases. But if you've read anything about LLMs, you've seen them like absolutely fail on like the simplest of tasks, like how many bees are there in blueberry, which is a pretty famous recent challenge. But if you've engaged with them a little bit, you've probably also had some experience of giving it a really hard task, like here's just a casual handwritten sketch of a shiny app, and it just one-shots that into perfectly working R code.

So as you kind of gain more experience with LLMs, I don't think you get much better at predicting what they're good at and what they're bad at. And so kind of my first point to you, you know, if you've tried LLMs a little bit and you've noticed that they're like pretty crappy at pretty simple things, like that's not a reason to give up on them, because for some weird reason they seem to be equally, they seem to be really good at really hard problems as well.

Programming with LLMs using elmer

But let's go back to my goal, our goal in this talk, which is to empower you to use LLMs for data science responsibly. So we're going to tackle this in two parts. First of all, I'm going to talk about the tools we are building that help you all use LLMs to do data science, and then Joe is going to talk about some of the tools that we have been building that help you do data science. So I'm going to talk a little bit about programming with LLMs, hopefully demystify a little bit of jargon, and then show some very concrete examples of where we can believe, where we believe you can use LLMs today to do data science responsibly. And we're going to do that by focusing on cases where either we have a really tightly constrained scope, or we've really thought about the human factors and developed this really tight feedback loop between the human and the LLM.

So at Posit we've been working on two packages, elmer for R and ChatList for Python, that aim to make it as easy as possible for you to use LLMs from R or Python. For hopefully obvious reasons I'm going to focus on elmer, but pretty much everything I'm going to show you in elmer you can do in ChatList as well. So what does using an LLM from code look like? Well here I'm going to load the elmer package, I'm going to create a chat object, I'm using ChatAnthropic, that's one of my favorite models, not least because it does a pretty good job at generating R code, and then I can ask it a question, and it's going to answer just like I was using a chatbot on the web. And if you want an experience even more like a chatbot on the web, well hey, you can use ShinyChat.

And we can get a really funny joke about scientists. Cue laughter.

And we'll see later like integrating LLMs into your existing Shiny app can be a pretty powerful way of like adding new features in ways that we think are probably pretty low risk. Now of course as well as ChatAnthropic, there's a bunch of other providers, all of the major frontier models, all of the major providers that you might be using inside your enterprise.

Tool calling

To go a little further with elmer, you're going to need to learn a little bit more about what how LLMs work today, or at least how the APIs that we work with LLMs work. And if you read anything about LLMs, there is just so much jargon, so much bullshit to wade through. And I think like one of the the services that I hope that we can provide at Posit is to like demystify some of that, to like dig through some of that bullshit so that you don't have to.

And the first thing I want to talk about is this really powerful idea called tools, or tool calling. And tools were invented to solve a pretty major problem with LLMs, namely that they don't have any access to data about the real world. So if you ask an LLM like what day is it today, like best case scenario, it's going to say I don't know. Worst case scenario, it's just going to make some shit up. So that's where tools come in. And if you know how to program, it's really easy to define what a tool is, because a tool is just a function, an R function, a Python function, along with some metadata that tells the LLM how to use it.

So let's define a tool. Here I'm going to define a tool that provides today's date. It's a function that's going to call the base sys.date function, which just returns today's date. I'm going to give it some metadata, a name, a description, the arguments and takes, in this case none, and then I can register it. And once I've done that, if I ask the LLM again what day is it today, it can now give me the correct answer.

So how does this work? It would be pretty surprising if every LLM provider was like running R in the back end and could somehow magically run all of my R code. That's not quite what's happening. So when I send this question to the LLM, it does a little bit of thinking and it realizes, okay I'm going to need to call this tool. So it's going to send that request back to me. Then elmer or Chatlus is going to receive that request, call that function, send the result back to the LLM, and now it can generate an answer that goes back to me, the human. So this is really powerful because it allows anything you can do in your R session or in your Python session, you can now have an LLM do for you. Really cool, really powerful, but also really dangerous, as you can imagine, and as I'm going to demonstrate with my next slide.

So here I'm going to empower my LLM with a tool that allows it to run R code. Now I'm showing this mostly for the purposes of illustration, like you probably shouldn't do this, right? Because now the LLM can do anything you can do in R, and that is like pretty much anything. Like you could ask this LLM to delete every file on my computer and potentially it could do so. But what's really interesting, and I think one of the reasons that people are so excited about LLMs at the moment, is if I give this tool and I re-ask that initial question, what's the mean of x and y, one of their correlations, the LLM will go and actually, instead of trying to compute it incorrectly itself, it'll write a bunch of R code, which does the computation correctly, and now I get the correct results.

And I think this is really neat because it means that LLMs are not a closed system, they are something that you can actually add to. If you know how to write an R function, if you know how to use an R package, or a Python function, or a Python package, you can extend LLMs in pretty much any way you can imagine.

And I think this is really neat because it means that LLMs are not a closed system, they are something that you can actually add to. If you know how to write an R function, if you know how to use an R package, or a Python function, or a Python package, you can extend LLMs in pretty much any way you can imagine.

Agents and structured data extraction

And that brings me to my next topic of agents. So if you've read anything about AI lately, you've probably heard agent a million times, or agentic a million times, but it's really, really difficult to figure out exactly what an agent is. But if you know what a tool is, I think it's actually really easy to understand what an agent is. An agent is just a tool that allows you to read something from the world, and a tool that allows to change the world in some way. Or a little bit more broadly, really an agent is just combining an LLM with things that we know, already know how to do in programming languages, like write for loops and functions. I'm going to leave you there with agents for now, they're really cool, but you have to wait until Joe's demos to see some of the agents that we've been working on.

If you are interested in learning more about AI, Simon Couch and Sarah Altman write a Posit AI newsletter which comes out every two weeks. This is really our effort to kind of help you keep up with all the news, explain some of the jargon, and update you with what we're doing about AI as well.

So now I want to show you a couple of applications using these tools where we think LLMs can really help you do data science, by really tightly scoping the problem. In the first case, I've got a little bit of, I've got some data here, I've got some names and ages, and I want to extract them from this unstructured text. And if you spend a few seconds like contemplating this, like maybe if you're like really, really good with regular expressions, you might be able to come up with something to extract names and maybe ages, but it's gonna be pretty tough.

But this is actually something that LLMs are like really good at, and they're a pretty low risk of hallucination here, because you're not asking it to create something new, you're asking it to extract something from a larger data set. And so if you want to do this programmatically, all you need to do is define the schema, the type, the names, in this case the age, name and age, and the types of those variables, we want a string and a number, and then you can use a tool like ParallelChat Structured, which will go away and call LLM APIs in parallel and extract all that data, and get it back to you in a nice rectangular data frame.

And I genuinely believe like this is a data science superpower. The ability to go from unstructured text into nice, nice tidy rectangles, this is a superpower for data scientists, because if there's one thing data scientists are good at, it's good at working, they're good at working with rectangles.

And you're not just limited to text, these same ideas work with images and video and audio and pdfs and so on and so forth. So here I've got four images, and I can write a little bit of R code to say, hey take these images, I want to extract the animal, and I want to extract the background color, and again I'm going to use ParallelChat Structured, and now I can turn these into a data frame. Now if I do this a few times, you might notice I don't always get exactly the same values. And I'm not going to make you play spot the difference here, but some of the times the LLM reports this as a kitten with a gray background, and some of the times it says it's a cat with a beige background. And this I don't think is like a problem with the LLM per se, this is a problem that we haven't really precisely specified what we want.

And I think one of the reasons it's important to think about this, is that whenever you're using an LLM seriously, you need to be thinking about how you evaluate them, how do you tell if that LLM is doing a good job. The art and the science of evaluating your LLM are complicated, but really important to learn, and I think people with a background in statistics and modeling are really well suited to understand that and do a really good job. So at Posit and friends, we've worked on two packages to do this, vitals for R and the inspect package for Python.

QueryChat for dashboards

Now another domain where this kind of tight constraints can be really useful is with dashboards. So here I've made a dashboard visualizing some data about James Beard award-winning restaurants. So I'm not going to go into too much details on my not particularly amazing dashboard, but you can see there's a map and if you look at the top you can imagine maybe there's some charts and tables there. But what I want you to focus on is this very common need in dashboards is that we need to do some filtering. Like in this case we might want to filter it down by year, or by were they a winner or just a nominee, what category was the award for, what was the cuisine, what's the price range was the restaurant.

And writing this code is like pretty tedious because you've got to write a bunch of front end code and then carefully pair that with back end code. And this is certainly like less tedious now than it ever has been before, like I'm not going to lie and I had Claude write all that code for me. But it's still a lot of code and at the end of the day it's not very flexible. And this was the problem that QueryChat was designed to solve. QueryChat kind of asked the question, well what if we just turn this on the head, what if we just gave you a little chat box and you could ask for whatever data filtering you wanted.

And this sounds like a little bit dangerous but I think we've made this pretty safe by really constraining to say all you can do, all this LLM can do is generate SQL select queries. So you can ask a question like show me all the winners in Texas with a price of dollar dollar dollar or less. So this is something you could easily do with a point and click filtering interface. But now the LLM is going to go away and generate that SQL. And LLMs have seen a lot of SQL code so they're by and large like pretty good at generating it. Or you could start to ask questions that you could certainly answer with that point and click interface but we're going to be pretty tedious, like show me all the non-American restaurants in the Midwest. And while this is not exactly the SQL code that I would write personally, it certainly gets the job done.

And you can start to ask questions that there's no way you could possibly express in a dynamic filtering UI like can you show me all the restaurants that were nominated at least five times before winning. And that generates a lot of SQL. And I'm not going to lie I think this is currently like a weak point of query chat because I would certainly want to sit down and spend 15 minutes like seriously contemplating the SQL before I could convince myself it was correct or not. So my hope with query bot is with query chat we can get to the point where instead of just showing you the SQL it's going to say hey the SQL is like pretty complex you might want to talk to your local friendly data scientist just to confirm that these results are correct.

ggbot2 demo

So the last thing I want to show off and so the transition between my part of the talk and Joe's part of the talk is a little tool that Joe spent about five hours on and I maybe put it like another hour on. It's not quite to the point where it's something you can recreate easily but something that we hope to package up and make it very easy to add to your own apps soon. A little something I call ggbot2.

Hey ggbot2, can you help me visualize the James Beard data? Absolutely. I'm ready to dive into the James Beard data with you. Just let me know what kind of plot you want to create. Maybe you're thinking about comparing chefs, exploring restaurant locations, or looking at trends over the years. Just say the word and we'll whip up a beautiful plot in no time. Let's start with a map.

Great choice. A map can give us a fantastic overview of where all these award-winning restaurants are located. We can plot their latitude and longitude and maybe even color them by something like the award category or the year. Let me whip up a basic map showing the locations of all the restaurants. Now let's focus on the restaurants just in the continental U.S.

Looks like there's a ton of over plotting like each restaurant or each city has probably got lots of points on it. Can we display that better in some way? Absolutely. Let's fix that by aggregating the points. We could use something like a hex bin map or a 2D density plot to show the concentration of restaurants. This will give us a much clearer picture. Let me whip that up real quick. Let's start with the original city locations but just make the size of the point proportional to the number of restaurants there.

Yeah I like that but let's make the points like way way bigger. The big points bigger and the small points smaller. It looks like the borders are on top of the points. Can you make them behind the points?

Yeah and maybe let's try making the points hollow.

There's still a lot of points on this plot. What if we just focus on the top 20 cities?

Let's do the top 50 cities.

Yeah I like that and can we label them somehow?

Oof there's a lot of over plotting those labels there. Can you make it easy to read every single label even if they're pretty close together?

Yeah that's looking pretty good. I think we can just finish it off by adding a little pizzazz. This is a map so can you make it look more like one? Like get rid of the the background and the axie names etc.

And how about we make the states like pale gray with white borders?

Yeah I like that. Thanks ggbot. You're welcome. I'm glad you like how it turned out. If you ever want to tweak it more or explore other visualizations just let me know.

So I think there's something like really interesting to me about this interface. Like obviously the AI here is a bit of a suck up. It's pretty common. But there's something like it's still like it's still very code focused. I'm still generating ggbot2 code. I'm just not literally doing it by like frantically typing the keyboard. And I have to say like I actually created this map first by hand and it took me like 20 minutes to remember how to put state borders on the map in ggbot2. But then I think that tight feedback loop, it's not always perfect, it makes mistakes, but you can kind of naturally like correct it pretty easily.

That very fast and fluid kind of interaction where you know my goal has always been for the kind of code to get out of the way and for you to express your ideas so you can interact you know and with the data as quickly as possible. I think it's really important. And I think the other thing that you sort of see here is like you couldn't just give this to anyone. Like I'm using my knowledge of good visualization practices. Like I know what ggbot2 can do. In other rounds, not this one that I videoed, it would kind of go off the deep end and start using like weird ggbot2 features and I would just tell it hey actually I want to use this geom. So definitely like absolutely benefits from my expertise as a data scientist and software engineer, but I'm no longer like quite so limited by what my memory of ggbot2 code is.

So on that note I'm going to hand you over to Joe who's going to talk about some of the tool. This is a tool like we've spent maybe six hours collectively on. Joe's going to talk about some of the tools that we've spent 10 or even hundreds of hours on. Yeah yeah but your scientists were so preoccupied with whether or not they could they didn't stop to think they should.

Positron Assistant and Databot

Thanks Hadley.

I'm here to talk about our two agents that we're talking about right now which are Positron Assistant and Databot. Actually how many of you have tried one or the other of these? Okay quite a few. All right sweet.

So Positron Assistant is the first one and it is for general purpose coding and the second one Databot is about exploratory data analysis. Now a lot of people seem to get them confused, say they look alike, they clearly have the same taste in glasses and mouth parts I guess. But it's important to realize what they're each for. So Positron Assistant is the one that you would ask questions like please help me write this code while Databot is the one that you would ask to help you explore data. Positron Assistant is a built-in feature of Positron and Databot you need to install as an extension from the extension pane.

And finally Positron Assistant is a research preview. There are definitely rough edges, there are definitely things that need to be polished and improved but we already think it's very useful and we're seeing rapid improvements each week and month. Databot is also a research preview. I think we have really been using it a lot internally but have not had that much real world use by you all in the community so we are very eager to hear your feedback and whether you find this style of EDA bot useful.

So I only have time to give you the briefest introduction to each of them today so I do encourage you to try them out. So if you are here at PositConf in person then there will be short talks by George Stagg and Winston Chang tomorrow at 12 10 p.m. in the Learning Center. I found out after I made this slide that the capacity of the Learning Center is like maybe 60 people so line up early. But luckily you can go to the lounge at any time and there are four stations set up with people ready to help you walk through these agents and you can do that anytime tomorrow.

If you are joining us virtually or if you have a laptop you can download the latest Positron build at the Positron homepage and you can click on the features link at the top of that website and look for AI tools and there will be very detailed instructions for how to get started.

So let's first talk about Positron Assistant. Positron is a general purpose coding agent and it looks like this. So it is analogous to Copilot Chat. The probably the number one feature request that we've had for Positron from the beginning is where is Copilot and here is our version of Copilot.

In addition to doing normal coding agent things like reading your code and helping you write it we also have some Positron specific features that add to the context and tools that are available to the LLM. So for example one of the unique features of Positron is that we have a live R or Python session or console at all times and with Positron Assistant it has the ability to read your variables or inspect the shape of those variables that you currently have loaded so that it can write better code for you. You can also have it execute code directly in your current R or Python session. It can also read your console and help you with tasks that you're doing just in the live console. And finally it can see your plots in the plots pane so that you can work together with it to diagnose problems you're seeing in the plots you're generating.

So enough talk let me show you a little bit of a recording. So in this case we have some bird data that's on disk. So we say please write a Python script that loads this data. This works equally well with R. It is looking to see what packages are installed. It's going to find polars and write some polars code to load this CSV into a data frame. And I can take this code and insert into my existing document or copy it to the clipboard but in this case I'm going to insert into a new file. I'll save that as birds.py and in Positron with one click you can run it in your active console and you can see in the upper right it has created the DF data frame and we can show the first few rows if we like.

Now I'm going to ask it a follow-up question to write a little bit more code to like just let's do the simplest thing and have it print some basic summary statistics. And now it's going to offer some edits to my existing code. For some reason this particular time it decided to rewrite the original code as well. No worries when you go to apply it Positron Assistant is smart enough to know that it only needs to apply the new code and not the existing code.

So we can review this code we can decide to keep it or to undo that change. In this case it looks good but then when we hit run it actually turns out there are some deprecation warnings. So it worked but there were some warnings. So I can go straight to Positron Assistant and ask it to please explain the errors that are in the console. I don't have to copy and paste those errors. It can see them and it immediately knows that they're deprecation warnings. Which is a little bit frustrating because it also wrote the code but you know we are in 2025 so it is what it is. And I can decide to apply those changes if I like. I found out after I made this recording that the good folks on the Positron team have actually added a shortcut slash fix and that does exactly that. Or in the case of true errors not just warnings you will actually see a link right underneath the error that says fix or explain which is pretty nice.

So besides writing code in files for you, you can also have Positron Assistant directly execute code at the console. So in this case I just want to see a plot of the top birds by count. I don't need to save this. So it is going to write that code for me and allow me to run it in my session directly. So I can review the code I can hit run it executes in my normal console and then we can see the results in our normal plots pane.

So there's so much more that Positron Assistant can do but I'm going to have to leave it there just for time. Again I encourage you to check it out in the lounge or just download it and try it yourself. Now before you can do that I have to point out some requirements and some limitations and these apply to both Positron Assistant and Databot. Number one, these agents are for Positron only and I know a lot of you are asking where is AI for RStudio. That is what we are working on next and it is something that we are taking very very seriously. I can't promise that they're going to take exactly the shape of the agents that you see today only because the ground is shifting so fast and tomorrow's models are going to be different than today's models. But certainly we're bringing the same values that we have in building these agents into working for RStudio as well.

Secondly and maybe most surprisingly both of these agents require an entropic API key. They only work with the entropic API currently. We are imminently going to introduce support for accessing entropic models through many other providers that provide access to those same models. So Copilot, Bedrock, LLM gateways like Portkey and Databricks and Snowflake all provide access to entropic and we'll let you hook those up to Positron Assistant and Databot soon. A more tricky question is when will we support other models? That's a little bit more TBD because these models, they are not one size fits all. They are not fungible at all when you get into more advanced agents. So we have had a lot of success using the entropic models. We've had less success using the other models but I'm very hopeful that with a little bit more work we can prompt our way into getting good results with a number of frontier models and then we'll be able to open up the support to OpenAI and Gemini as well. But obviously I can't promise when that's going to happen until we've done that work.

Finally, from a security perspective, I do need to warn you and just kind of rain on the party a little bit. These are advanced coding agents and all advanced coding agents today suffer from some very scary security implications. If you have not heard the terms lethal trifecta or OWASP LLM top 10, I encourage you to take a picture of this slide, especially if you work with any sensitive information or if you speak into the security posture at your organization about how LLMs and data science can work together. Yeah, Google those later and good luck sleeping tonight.

The upshot is that this dialogue of like, do you trust the files in this folder? We're going to have to take this much more seriously. I think a lot of us have been conditioned to just click yes but the attack surface is rapidly increasing. And you need to think of the files that can hurt you as no longer just being the build scripts or the code. Now, every markdown file, every CSV is potentially an attack vector. I don't mean to overstate. A lot of these are more theoretical than we are seeing these in the wild. But nevertheless, the theoretical dangers are quite real and quite surprising if you have done security work before. These are new attack vectors. I apologize, like this talk is going to feel like a little bit of a roller coaster. Like there's going to be stuff that really feels exciting and then it's going to feel like this, like a little scary and downers. But hey, like this is the world that we're living in right now.

Databot: exploratory data analysis agent

I'm really excited to talk to you about DataBot. DataBot, our exploratory data analysis agent. This is one of the most fascinating and useful pieces of software I've ever worked on in my career. So just let me start off by giving you a quick taste of what it's like to use DataBot. And then we'll zoom out a little bit and talk about why it works this way. Full disclosure, this playback is at 1.5x speed, just in the interest of time.

So you open DataBot with a command and then you ask it, often starting with a very general question, like, please load this restaurant data. I'm using the same restaurant data that Hadley showed, but it's a little bit different. Same restaurant data that Hadley showed in his shiny dashboard. So it's writing some Python code that I can review. I'm going to YOLO and say allow always. You can see the output of that code and then it looks at the output of the code itself and makes some observations here in prose. And then finally, it provides five links of suggestions of next steps that we could do together.

So let's just, like, talk about this for a second. I've been calling this the where loop. So far, I'm the only one who's been calling it the where loop. Even inside the company, this is not catching on. They're not all winners.

So it starts with you asking a question or providing an instruction. DataBot will respond to that instruction or question by first writing Python or R code, depending on what kind of session you're using. And secondly, it will go ahead and execute the code for you once you click allow. Third, and very importantly, it sees the output. The human sees the output and the machine sees the output. And you look at it together. And this output can include the console output, it can include data frames, and it can include plots.

So it makes whatever observations it wants to make about that data that it thinks you should know. And then it repeats. So if there are additional steps it wants to take in order to try to answer your question or follow your instruction, it will do that up to three to five times. And then we tell it to stop. And this is a really important point, that we have finally tuned the granularity of this human machine feedback loop to what we think is appropriate for exploratory data analysis. It can go off and do enough work that your exploration can proceed very quickly. But in small enough chunks that the human is not left behind and, like, can't keep up with the, like, quote unquote, insights being thrown off by the machine.

After it's done that, then it comes back to the user, suggests next steps. It always suggests here three to five things that we can do next. So you're never left at a blank prompt just thinking, like, what now? And then it waits for your instruction. So you can click one of the links that it provides or you can type in your own instruction. And you just do this again and again until you're done exploring your data.

Finally, once you have all the answers that you're looking for, you probably have a giant chat transcript, which is not useful to anybody. So there is an explicit set of commands in DataBot to extract out the important insights and you tell it what you think are the important insights. And it will create a reproducible version of your chat that just limits it to the code and pros that are necessary to get your point across.

So let's look again at this example. These are the suggestions it made. Again, this is restaurant, the James Beard Restaurant Awards and the data set for the winners over the years. If you look at the suggestions, it's really striking to me, even after using DataBot all this time, every time it's really striking to me how good these suggestions are. These suggestions don't feel scripted or canned. They're quite sensitive to the specific nature of the data that you've shown it or whatever it is that was revealed in the last step that you took with it.

In a lot of cases, when I show people, they see a visualization or something fly by and they say, oh, what about this? And it turns out that whatever this was is one of the three to five suggestions that DataBot makes. And that's really encouraging. It feels like you have an active and engaged partner that's making really good suggestions at every step. So in this case, I'm going to say, investigate the geographic distribution of awards by city.

So it's going to do a number of things here. First, it is tallying by city. And no surprise, New York is at the top. By state, California is at the top, as we would expect. And then it makes a plot nine visualization. Then it's making a bunch of observations. And then, again, the suggestions, three to five suggestions.

But what's most interesting to me in this particular case is, look at this. It's observed that of this data, over a third of the records are missing city data. Why would a restaurant data set be missing city data? Full disclosure, none of the suggestions that it made were, hey, let's stop and investigate the missing city data, right? So like as Hadley said, this is where our expertise is essential in knowing that, wait, we have to stop. And in order to do a robust analysis, we need to get to the bottom of this question.

So to me, this kind of question, why so much missing data? These are often the most interesting questions, right? They expose cracks in the foundation of our analysis. And to DataBot's credit, it does pretty much immediately figure out exactly what's going on here. It sees that the data is not actually missing. It's in the wrong column. For a third of the data, yeah, a lot of you are scrunching up your faces. That's the appropriate response. And it also noticed that these problems seem to be localized in certain types of awards.

So let me just illustrate this for you. This is what the data is supposed to look like. This is a clean version of this data. And some of these rows are not about chefs, but about restaurants. And that name column really should be chef's name. So some of these, a third of these rows are about bars or restaurants. And there is no name to put there. So the correct data should have missing cells there. But that's not what happened with this CSV. This CSV looks like this. It's as if a comma was missing in the CSV, right?

And it turns out, like, this is not a synthetic example that I made up. This was actually the Tidy Tuesday for December 31st, 2024. And just coincidentally, I'm not making this up. I was using the very first prototype of DataBot for the very first time. And I said, let me just try the most recent Tidy Tuesday. And this was the first thing that happened to me. And it was a very auspicious start, because I was so blown away by how quickly it allowed me to identify and chase down this problem. And it also was able to quickly fix the problem so that we could do a clean analysis. I mean, clean-ish. There's a N-A-N there. I don't know what that is. I didn't ask the follow-up question yet.

But when I did this in January, I was so perplexed by this. I went to John Harmon, who is the person who curates the Tidy Tuesday datasets. And I asked him, number one, did you do this on purpose? Like, did you hide this data quality issue as, like, an exercise for the reader? And number two, how many people noticed this data quality problem? This is a very fundamental data quality problem. And he said, first of all, this was not intentional. That the restaurant awards and the chef awards came from two different data sources, and he combined them without realizing they had a different set of columns. So, I mean, totally a mistake any of us could have made. And he said that not one other person had noticed this.

So, it's a really interesting example where a human and a fallible agent working together were actually able to find a mistake that a lot of humans missed.

So, it's a really interesting example where a human and a fallible agent working together were actually able to find a mistake that a lot of humans missed. And it turns out this was not the last time this has happened to me. Over and over, I've been able to discover issues like this in work done by humans with the help of this agent.

So, I'm super excited about DataBot. Sarah Altman and I wrote this blog post and published it a couple weeks ago that talked about all the reasons we think exploratory data analysis is fundamental to doing good data science, and how we designed DataBot to work really well with how we think EDA should be done. But at the same exact time that we published this blog post, we also published this blog post. DataBot is not a flotation device.

And this is like several hundred words about ways that we are worried you're going to misuse this piece of software. DataBot is not a flotation device. It means that I think this is a really easily misunderstood piece of software, and if you use it wrong, it's dangerous. You might think with DataBot, you don't need to know how to code. That is wrong. Or that with DataBot, anyone can be a data scientist. Sorry, that is incorrect. Or that with DataBot, you can sit back and let the machine do the thinking. Again, that's wrong.

And the reason all of these assumptions are wrong is because today's best models are not good enough for trustworthy EDA. Now, they are more than good enough to be useful for EDA, but they are not quite at the level where you can just sit back and trust them. They make coding errors. They make errors in judgment. They make errors of omission. They often will see what they expect to see. If you were in my workshop yesterday, you saw a very scary example of that. And they will often cut corners in really inhuman ways to try to reach the objective that you set for them.

So the way I think of it is like Tesla's, quote-unquote, full self-driving mode today. If you've ever experienced that, you know that it's fascinating and it's surprising, but it is not perfect. And therefore, you would never take a child and put them in the driver's seat and send them off. You cannot have a passive bystander sitting in the driver's seat. And you cannot have a non-data scientist or a data scientist that's leaning back driving DataBot. And the reason is because you always need someone who's ready to grab the wheel and be an active participant and partner. Okay, lecture over.

The EDA analogy: teleporting through insights

That being said, I love this tool. I am so excited about what it feels like to use DataBot. And it's really hard to convey unless you've had the opportunity to try it yourself on data that you don't know. So I encourage you to do that. But since I have all of your attention here, I ask myself, like, how could I create an illustration that would sort of capture the feeling of what it's like to use this tool?

So if you will indulge me for a few minutes, my few remaining minutes, I want to start by showing you an illustration of what it's like to do traditional EDA, traditional exploratory data analysis with Python or R in a pre-LLM world. So for this analogy, imagine that exploring an unknown data set is like being dropped into a foreign land that you don't know anything about. So you have this map, but you can only see the area immediately around you. So you see an interesting feature in the distance and you go walk up to it. So this would be an interesting feature of your data and walking is like writing code. So you're writing your code. You're writing your dplyr and ggplot code or your pandas code. And oh, okay, so now I'm looking at this feature. I'm looking at the distribution of this variable. And that's interesting. There's something to learn here. It's made out of stone. I didn't know that. And after being up close and personal with this feature, now you've learned a little bit of something about this landscape. You know something about your data that you didn't know before. It's a little bit, but it's something. Just repeat that over and over again for the next interesting thing you see and the next interesting thing you see. And pretty soon you've got an entire map that you've uncovered that shows you every corner of this data.

Now that's what the world has been like. A lot of companies have a very different vision for what data science should be like. So not to name any names, but a lot of them want to sell you an experience like this where you're still in this unfamiliar landscape, but instead of exploring, you ask an agent to find you treasure, right? Like ask your top level question, how do I increase revenue for Q3? Or how do I improve conversion rates for my sales pipeline, right? And this agent swings into action. It's clearly working very hard. It's analyzing and it's exploring and it's referring to things and you are not a participant in any of it. In fact, you're not even privy to what it's doing. So it is doing all the work and if all goes well, I mean, maybe 20 minutes later, it comes to you and says, ta-da, I have found the treasure. I have teleported you directly to the treasure. Here it is. It's a chest and it's got some diamond gear in it. Congrats. Okay, so great. We have a diamond pickaxe now. We have a diamond sword. And what have we learned about our data? Actually, not that much.

And I think to a lot of people, they think that data science is about finding diamond pickaxes. And that's not the approach that we've taken with DataBot. With DataBot, we start in this landscape and again, we identify an interesting feature in the distance. But this time, instead of walking up to it, we teleport. Instead of writing the code, DataBot is writing the code for you, but you are gaining the same insights as if you had written the code yourself. And it becomes so fast and fluid to teleport from insight to insight, from variable to variable, that the cost of seeing something that's not interesting is, that's fine, you just move on to the next thing and after enough teleporting around within seconds, what would have taken minutes or minutes would have taken hours, you find your diamond