Harnessing LLMs for Data Analysis | Led by Joe Cheng, CTO at Posit

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Joe Cheng I'm the CTO of Posit and today I'm going to talk to you about harnessing LLMs from code in the service of data analysis

This talk is intended for people who know how to code in either R or Python I'm assuming that you have used LLMs via something like chat-gpt or copilot, cursor or windsurf But I've also assumed that you have not written code that calls LLMs That you have not called LLM APIs If you have a lot of experience using LLMs from code Maybe you're also just interested in Posit's take on these tools in which case welcome to you as well

Getting started with LLMs

So as we're getting started with LLMs, there's a couple of things you need to know right up front One is that LLMs are generally accessible these days through HTTP APIs There are other ways to call LLMs than through HTTP APIs, but I don't think there are any other ways that matter This is the way most models are accessed today Therefore Posit has created packages that are designed to make it really easy to call LLMs through these HTTP APIs And we're going to talk a lot about these packages today. For R, there's ellmer and for Python, there's chatlas

So as we are getting started using these LLMs, let's just set our expectations correctly First of all, you can expect that this is going to be super easy as we get started ellmer and chatlas are both extremely easy to use These models, the best models out there at least, the best LLMs are highly capable and they're getting better all the time So if that's not your expectation Maybe because you haven't used these models in a while or you haven't sort of persevered through seeing them make mistakes Just know that they're quite good at a lot of things these days and they are extensible We're going to talk today about some ways that we can add our own behaviors and even add code Well code in a way to these models

However, these models do make mistakes and they make them a lot and not only do they make mistakes. They do them unpredictably Nondeterministically, which makes it even more frustrating. I think in summary today, you know May 2025 these LLMs have a jagged frontier I think that's a term you're gonna hear more and more There are things that they are surprisingly good at Even shockingly good at and then things that they are really surprisingly bad at so as we Explore them. Let's just let's just expect to be surprised in both directions

I think in summary today, you know May 2025 these LLMs have a jagged frontier I think that's a term you're gonna hear more and more There are things that they are surprisingly good at Even shockingly good at and then things that they are really surprisingly bad at so as we Explore them. Let's just let's just expect to be surprised in both directions

So I have a couple of tips as well for how we should not approach getting started with playing with LLMs First of all, you may have heard a lot about local models or open weights models These are models that you can download to your own machine or your own server and run them I highly recommend that you do not start with them and instead start with the very best models that the world has right now Those are clod 3 7 sonnet or actually as of today clod 4 sonnet Open AI has a number of state-of-the-art models that are sort of have different trade-offs and Google Gemini 2.5 is I've heard very good the the open weights ones like llama and its derivatives quen Deep seek. They're just not as good as you have heard. I'm sorry. Sorry, not sorry.

I also suggest that as you're getting started Don't start with non public information or data. Don't start with your customer data. Don't start especially with patient data Instead as you're getting to know these APIs Do it with low stakes publicly available or just non proprietary data And only after you get a good sense of how to use these APIs and have a good sense for how they work Then you can think about security and data privacy and whether it's safe for you to use Non-public data

So to get started. We're going to need to do a couple things the by far most annoying part of this is We need to sign up for a developer account with anthropic or open AI or Google if you want to go the Gemini route And you need to grab an API key It's pretty simple to do this if you just Go to their site and sign up, but it does require a credit card These sites generally will not work will not allow you to make API calls without putting down a credit card and paying a few dollars All I have to say is just do it already. I mean, I know this stops a lot of people from proceeding It's just enough of a pain enough friction But it is really worth pushing through this Step and it really is not going to be a lot of money. You can do a lot of experimentation with just you know $20 that might last you a couple months of Quite intense playing with these models

Once you have that API key Then you need to add it to your environment variables and I have links that I'll drop in the description for doing that For both are for Python and then it's time to install an LLM client package for our there's ellmer and for Python There is chatlas. I just say no when you go to use chatlas. It may ask you as you Try to connect it may ask you to install other packages depending on what provider you're using. So just be aware of that

Hello world with ellmer

So this is sort of the hello world of using ellmer. This is the our version You can see here. It's extremely short. There's just basically two lines of code here the first Creates a chat object Calling this chat anthropic function and passes in the model that we want to speak with This is club 3.7 sonnet and we are asking for the latest version You could change this from chat anthropic to chat open AI and put in an open AI model identifier And now you're chatting with with open AI neither case. You have this client object that you get back We often refer to it as a chat object and you call the chat method on it and pass in your question and you get A response. So in this case summarize the plot of Romeo and Juliet and it does so

Another interesting thing about this client variable is it is not just a way to ask questions It is a way to carry on a conversation So you can call chat multiple times and each subsequent call is like the next question in this ongoing conversation So here I say now Hamlet, this is just this is not just me asking context-free this model Hey model now Hamlet. It is a follow-on to the previous question about Romeo and Juliet So it knows how to answer with a summary of 20 words or less

This is the example in R and then This is the example in Python. It is almost identical. You can see here that the Capitalization has changed because I'm calling a Python constructor instead of an R function But other than that, it's it's extremely similar

So a few things to know about ellmer and chatlas number one They support many different LLM providers Pretty much all the ones that we can get our hands on and this is really important because you want the ability To experiment with different model providers and just kind of see Whether there are certain things that one provider does better than another Secondly you have the ability to have a chat and then take that chat object and just print it at the console or in a Jupyter notebook and you'll be able to View the conversation history Just by printing the object

Third both ellmer and chatlas have built-in chat UIs And in fact, they each have two Two built-in chat UIs one for a web browser and one for the console and I'll show you real quick how that works

So for example, let's take that Romeo and Juliet example and instead of doing this at The console let's do it in an app So we can say So we can ask this question and Yeah, very convenient to be able to do this in in a web interface or maybe instead of a web interface you prefer a console interface and I'm continuing the conversation here. I can say like how about Hamlet And there you go streams the answer back

Customizing with system prompts

Okay, so super easy to use and designed to work really well interactively But that's not actually that interesting right it's not that interesting for us to take a bone stock LLM that you already have a web interface for that. You could just go to claw.ai Instead we want to be able to customize these models and the first way we often want to customize them is by adding knowledge So we want to tell these models something that they don't know from the factory

Just a quick couple of Discouragements here you may have heard a lot about rag. I recommend that you do not start there It is a very useful technology But I don't think it's the one you should be starting with same thing even more so with fine-tuning You may have heard this term I recommend you do not start there either when you want to add knowledge the first place you should start is with system prompt customization It is easy. It is effective. It's super quick whereas these other techniques take a little bit of fiddling

so here is one example of Adding some information via The system prompts What the system prompt is is a way of us telling the model that for this interaction or for this conversation? These are the ground rules that I want you to obey In this case, we're saying you're a bot that answers questions based on this expense policy doc and I have an expense policy doc here that is actually for a fictional company called example Inc and Really boring right? It's saying here's what is eligible for expensing when you're traveling when you're visiting Clients when you're buying office supplies super boring But important to get right So here I'm creating a string that says you're a bot that answers questions based on this expense policy doc And then I put in some like XML tag looking things This is just to organize all the lines. I'm about to paste in and then I put in this entire doc I literally copy and paste this doc into this system prompt string and then I have a couple more follow-up bullet points site policy section numbers and use examples when helpful and Now I can take that system prompt and pass it as an argument when I create a chat object and now When I do that the answer Incorporates the knowledge from that document. So in this case, can I expense a programming book? It says based on the expense policy. It might be Ask ask your manager section 3.4.

So let's do one that is a little bit more in the vein of data analysis. So for this one Maybe you can relate to this experience that I've had multiple times where? There's a data set that I'm very excited to get my hands on in this case It's the college scorecard data that the the federal government puts out and it has a lot of information about different characteristics of colleges and universities as well as a lot of I think data from the IRS about The outcomes that students experience six years and ten years after they graduate so really rich data set There's just one problem with this data set

Got this CSV here And if we look in positron there are thousands of columns and These are quite impenetrable column names and The data dictionary is 800 kilobytes of Excel spreadsheets. I mean, it's just a lot So what I've done here is I've taken that those That data dictionary that Excel data dictionary and I've used Claude code to come up with sort of a condensed summary as a Markdown file and now I have created a system prompt that says you're a bot that helps data analysis for college scorecard data Do not attempt to analyze the data, but you can write our code I'm telling it to use the read our package to read CSVs And that assume that the CSV files are in the working directory and then I read in the entire The Entire file I have that summarizes this data and you can see it talks about hundreds of available variables

And this actually works awesome So I'm able to Run this example and I can ask a question like what are some demographic variables?

All right, pretty cool And by the way note what we did not do here we did not actually load the College scorecard data Into the LLM and ask it to start answering data questions. We did not do that We asked it to write code and there's a really a good reason for that Because on their own the LLM themselves are terrible at analyzing data like they simply are Not suited for this task So if you load up data in into them in the format that they understand best, which is maybe JSON They are actually Not able to reliably do even simple arithmetic In fact, they cannot even reliably count rows in a data frame In fact just today. I tried it with a small CSV 299 rows loaded it into Claude v4 The the latest that just came out today and it said that there were 216 rows so we cannot trust them on their own

Instead what we can almost trust them to do is have them write code that analyzes data and sometimes even Let the LLM have enough control that it can execute the code for you And that that is a much much better way to have LLMs help us with data tasks We want them to write code that then gets executed

Because on their own the LLM themselves are terrible at analyzing data like they simply are Not suited for this task Instead what we can almost trust them to do is have them write code that analyzes data and sometimes even Let the LLM have enough control that it can execute the code for you And that that is a much much better way to have LLMs help us with data tasks

Harnessing LLMs for Data Analysis | Led by Joe Cheng, CTO at Posit

Transcript#

Getting started with LLMs

Hello world with ellmer

Customizing with system prompts

Customizing behavior and building Shiny apps

Vision, structured output, and tool calling

Putting it all together: the querychat demo

Closing thoughts

Featured software#

chatlas

ellmer

querychat

Shiny

shiny-server

shinyapps

shinychat

tidyverse

tidyverse.org