Demystifying LLMs with Ellmer

Transcript#

This transcript was generated automatically and may contain errors.

Welcome, everyone. I'm helping to moderate this. I'm in Miami. We don't know how to tell time in Miami. So we'll give everybody else two minutes to get here.

While we're waiting, there's going to be some coding. If you want there to be, if you just want to sit and be a passive observer, that's fine, too. I don't think you'll get the most out of this as you would otherwise. But if you plan on coding along, I do request that you start going through these steps. And please speak up in the chat if you have any issues with this, especially step three.

Yes. And the link to the slides is at the bottom of every slide.

So on behalf of our medicine, myself, Ray Beliss, Ben, and Beth, we're here from the planning committee. It is my distinct honor to welcome Joe Chang, a person who has done more to help my career than he can ever know, invented more tools than I can count, solved problems that I didn't know I had. So it is with my profound pleasure to welcome you, Joe. Take it away.

Joe's background and motivation

Great. Thank you. So I see like one person was, oh, three people were able to claim keys. Great.

This is the one part of this workshop that I'm quite nervous about, is whether this process goes well. You all getting this set up. If we can get through this, then we can get through. You can get through anything.

I guess I want to leave this slide up so that you all can continue to make progress on this. But in the meantime, maybe I'll just talk a little bit about myself, about this material, about how I got here. So I am the CTO of Posit. I've been with the company for 15 plus years at this point.

And it's with sort of mixed pride and chagrin that I tell you that maybe a year and a half ago, both myself and I think the whole company, save three enthusiastic individuals, were pretty skeptical about all this AI stuff, all this LLM stuff. It definitely felt to us like everyone was getting way too excited about a technology that had such deep, inherent flaws, especially when it comes to its application in data, data science and data analysis. You know, we're a company that really prides ourselves on taking reproducibility seriously, on taking reproducible science seriously, on not only giving you the correct answers, but giving you the ability to let people study your methods, to review your code and see if there are any mistakes in your methodology. And LLMs are sort of the opposite of that, right? Like they're sort of inherently chaotic, inherently black boxes.

And so, you know, a year ago, a year and a half ago, we were pretty, like, cautiously pessimistic, I would say. You'd hear the term stochastic parrot being thrown around a lot at Posit. And especially the more experienced people, I think, were just like really, really not excited about what LLMs had to offer. And that began to change for me, through the early half of 2024. Really seeing that the things these LLMs were doing, were starting to be able to do really defied my understanding of, of the math of these, the way these models are constructed, it didn't seem like they should be able to do what they could do.

And that and the insistence of a couple of people on my team that were that had really drunk the LLM Kool-Aid and were really excited about it, led me to look into how to experiment, experiment with these things from the code level, from the API level. And that led to a set of sort of eureka moments for me personally, a set of moments where I was so taken aback by what I was seeing, that it changed my mind about these LLMs and, and my enthusiasm for engaging with them.

a set of moments where I was so taken aback by what I was seeing, that it changed my mind about these LLMs and, and my enthusiasm for engaging with them.

And in fact, that conviction became so strong that I began to feel like it was actually something that the entire company, posit at this point was maybe 325 people. And I felt like, well, at least the people who could code needed to understand what was happening here and that the opportunities were not what they seemed.

So, I think in response to this, I decided to start a hackathon series internally, where I would do a three-hour workshop and help people go from, I don't know anything about LLMs, except my deep skepticism of them, to by the end of three hours, I'm ready to start hacking on my own projects. And, and two days later, we would get back together and see what people had come up with. We have now had, I think 150 employees go through this process internally. I've taught 18 or 19, 21 cohorts of this exact, the material that I'll walk you through today. And, well, I'll let you draw your own conclusions, but I think, I think I'm, I'm, I'm very excited for, for all of you, if you're coming into this with, with fresh eyes, as far as how LLMs work.

it is ridiculously easy to use, as you will see in a moment. It is very powerful. And crucially, it supports many LLMs.

So, one reason we don't want to write raw JSON and make HTTP requests without the help of a library is because if you are experimenting with LLMs, it's going to be very likely that you're going to want to try something with one model. You're going to get good results or bad results, but you're going to want to try the same thing with another model from another provider. Like, it'll just be helpful to know, this almost works on OpenAI. I wonder if it will be better using Anthropic. And using ellmer, it papers over the differences between these different providers. So, the same code will work against OpenAI, against Anthropic, against Google, assuming that they have the same underlying

capabilities. You just need to change one line to say, like, okay, I'm no longer calling OpenAI. I'm now calling Anthropic. And if you're doing raw JSON, you don't have that luxury. You would have to go and change a bunch of code to make it match. So, ellmer is not exclusive in offering that, but it is an extremely helpful feature that it has.

One thing that is unfortunate about ellmer temporarily is that it is so new that you cannot talk to ChatGPT about it by default or Claude or any of the others. Their knowledge cutoffs are too old, and ellmer just came out earlier this year. So, we have created our own little assistant called ellmer Assistant, and you can ask it questions about ellmer, and it has been seeded with knowledge about ellmer.

Okay. Somebody asked a question, the tokens are the same for different languages. Are they linked to the concept itself somehow? That is a deep question. The tokens are actually so interesting. So interesting. There's a whole, like, algebra you can apply to them. Whether they're linked to the concept, I think, is almost a philosophical topic at this point. I encourage you to look at, I think they're called embeddings. I think if you look into how embeddings work and vector math when it comes to embeddings, there's a lot of interesting stuff there, but I think it's a little out of scope for us today.

Does the presence of Boris? I'm not going to take that question right now, Boris, but maybe it's something we can return to when we're a little further along.

Introducing ellmer

Okay, so this is ellmer, and ellmer is designed for both programmatic and interactive use. I think this is really important, and there are very few LLM packages that are like that. Just like for R, we use R both at the console or very interactively using the source and the console together in order to explore stuff in real time, but also we can write packages, we can write reports that run every night. So just like R, ellmer is designed to be used in both ways, both for your interactive exploration and programmatically.

So let me show you some of these features instead of just talking about them. So I'm going to switch to my IDE here. I'm using Positron , but it would be exactly the same in RStudio. And first of all, this is how simple this example is that we just did. Like what is the capital of France and what is its most famous landmark? This is what that looks like in ellmer. So first, I load the library, and then I create a chat client object. In this case, I'm using Claude. Claude is one of the models from Anthropic.

And I'm telling it, you are a terse assistant. And then I'm asking the first question. So, let me run each of these lines. So, I'm going to load ellmer. I'm going to create this client object. And all I have to do is call this chat method. What is the capital of France? Paris. And then what is its most famous landmark? The Eiffel Tower. So, notice I didn't have to maintain a list of messages here. I didn't have to append anything. This client object is a conversation. So, it maintains a conversation for me. And in fact, I can print this object at the console and it will say this is the history of the conversation so far.

Okay? So, it's hard to imagine how we could have made this simpler for these simple use cases. There are a couple things that are even more friendly in terms of using this interactively. For one, there's a function called live browser. And I can pass it an existing chat object, a chat client. And now it's put me into a web-based interface for speaking with it. So, I can say, like, what are some other famous landmarks? You can be less terse now. Let's see if it is less terse.

Okay. I don't know if you could see it over Zoom, but it did stream in the output. So, these were such short answers that they just kind of popped in, but you could see that just like chat GPT, it will stream these things in. Okay? And that's very useful if you have, you know, made some customizations to your ellmer object, as we'll talk about soon. And then you want to just kind of get like do a vibe check. Just see how it feels. It's very handy to do that.

We also have a mode, if for some reason you can't use a web interface, like maybe you are SSHed into a remote server or something like that, you also have a similar thing called live console. And you can see here that I've entered into some different kind of mode. The prompt looks different. So, I can say tell me about landmarks in London as well. And similarly, you know, we're going to have this back and forth conversation. And all the conversation that I have through live browser or live console is, again, stored on the object. So, you can see when I print it again, you can see the continuing conversation.

If you do want to reset the conversation, the way you do that is just by making a new object. So, you call this again. And now it's an empty object.

Okay. Question. Any way to do this where the live browser doesn't also hijack the console? I'm thinking for asking and coding side by side. There is a better way to do this coming soon. I'm not sure if I'm allowed to talk about it.

So, that is our interactive — oh, actually, sorry. One other thing to note. So, I've got a new thing here. A new conversation. And now I just want to note that when I chat here and I say, like, something long. Like, tell me a story. I want to note that as it is responding it's a little slow today. Oh. Oh. Anthropic has gotten very popular. So, it sometimes struggles to respond. Oops. It told a terse story. Not much story. Anyway. What I wanted to show is that the response also streams in when you call it this way. That being said, we also designed this to be programmatic. So, client chat not only prints the response at the console, but if you look at what was returned from that chat call, we also have it as a string. So, we can store it. We can, you know, save it in a database. We can do other things with it. And for more advanced uses, there are other ways to get results back in a streaming fashion, in an async fashion. If you don't know what those things are, don't worry about it.

Hands-on exercise

Okay. So, now I'd like you all to turn to your own IDEs, your own repos, and open up this 01basics.r example that I just showed. Go ahead and run it and make sure that it works properly and responds. At least some of you are probably going to run into some issue. I have created a troubleshooting script called troubleshoot.r. If you source that, hopefully it will give you a clue as to what is wrong. But otherwise, you know, go through some of these steps and just play with ellmer a little bit. And especially if you get to step three, try putting some different instructions in system prompt, the wackier the better, and just kind of see what it does.

Okay. Thanks, Glenda. So far, so good. Again, if you joined us late, please open up the slides. You can see here in the footer of my screen, that's the slide deck. And on the second slide, there's setup instructions for how to clone this repo and to get an Anthropic API key.

There's not much competition for the Eiffel Tower. Has anyone got a different answer? Yeah, it is pretty famous, I guess. Or feel free to change the question to anything more interesting to you.

Just in case anyone is following along without choosing to code, I guess I'll mess around with this and you can just kind of watch my screen. So, let's change this to respond only in Japanese, I guess. I just realized I don't speak Japanese. That's — I'm sure that's right, though. I'm sure that's right. Actually, let's put this in translate.google.com. Yeah, pretty good. You can see it was quite a bit less concise.

Okay. Does client object have some built in parameter that indicates how many tokens are consumed up now inside the client object? Yes, there is. So, you can say client dollar get tokens, and that will give you a token count. And you can also say get cost. And not every provider — we don't have pricing information for every provider, but for the ones we do, it will return the cost so far, because it's costing me less than a cent.

Matias says, anyone having this error, trying to get slot role from an object class ellmer turn? Are you — have you used ellmer before? I wonder if those packages need to be reinstalled, or if you already reinstalled them, whether you need to restart R again first time. That is very surprising to me. Has anyone seen that? That's very surprising. If you have just run my example like this, and then the next thing you did was called live browser client, and you got that, then please file an issue on ellmer.

Okay. I'm enjoying the system prompt. You are speaking to a 3-year-old. Is it possible to change the system prompt in the middle of a chat without resetting the client? Yes, it is possible. So, you can say client set system prompt. Act like a 3-year-old. Oh, I'm sorry. I want mine to act like a 3-year-old. And then I'll say, tell me more. It is very slow today. Okay. That's just condescending. That's pretty offensive to 3-year-olds.

Okay. Great. So, hopefully all of you had a chance to play around a little bit. Any other interesting system prompts or system prompts that gave interesting results, feel free to drop that in the chat.

Okay. So just to summarize — oh, how can I switch to Cloud Opus 4? Good question. So, I believe what you can do is ellmer has some model underscore functions. So, you can say model anthropic. And it will give you a list of supported models. So, here you can say Cloud Opus 4. So, copy the ID and then use that to replace the model parameter here. And then rerun. Opus 4 is quite expensive. You can see here it's $15 per million. By the way, I am paying for all this API usage out of my pocket. So, please don't go too crazy. If you're just asking questions, that's totally fine. Later as we get into images or if you're loading data programmatically into them, just please be gentle. I mean, it's fine. If you're really curious, please do what you want. But just don't be purposefully abusive.

Summarizing the basics

Okay. So, just to summarize, a message is an object with a role. So far we've seen system, user, and assistant. And then a content string. And a chat conversation is essentially a growing list of messages. Half of which are contributed by us, the user, and half of which are contributed by the assistant. And the open AI chat API is like a stateless — or it is a stateless HDP endpoint. It takes a list of messages as input and then returns a new message as output.

Ella, that error does happen when the server just kind of closes the connection. Like I said, Anthropic is very popular right now. They're really struggling with stability and capacity, unfortunately. So, these things, unfortunately, are not that uncommon. So, it's better some days than others. Seems like today may be a little iffy. If you're using your own key, if you get your own key after this, you also might get errors like you exceeded your token rate. Until Anthropic has sort of developed a feel for how trustworthy you are, they put pretty low limits on these API keys. Because you're all using my API key and my API key, we have a lot of history with Anthropic. They're very high limits. So, you're unlikely to hit that today.

Chatbot UIs with Shiny

Okay. So, let's move on from this topic and let's talk about chatbot UIs. A lot of times, we are not just creating ellmer chatbots to use them at the console. We are creating them because we want to create chatbot interfaces that we want to ship to users. Especially less technical users.

So, number one, the most important thing for Shiny is that there is the Shiny chat package that is designed to make it easy to create chatbot UIs using Shiny and ellmer. So, they look something like this. Well, they look like the examples that we've seen so far.

Some of you may have used Shiny Assistant, which is our AI assistant for creating Shiny apps. It is very powerful, but it is not able to help you with creating chatbots. Because the applications that it generates are executed purely on the web using Shiny live, it's not able to connect to Anthropic or OpenAI or any of those things. So, instead, there is a Shiny Assistant version that is local in VS Code and Positron. So, if you basically open up your — in VS Code, it would be copilot chat and begin your question with at Shiny and ask it to create, you know, in this case, a basic AI chat app. It will use Shiny Assistant knowledge and Shiny Assistant UI affordances and give you a Shiny Assistant-like experience, but it is running locally on your real R process. So, if you want to use Shiny Assistant to help with chatbots, we recommend you go this way. You can get the Shiny Assistant for VS Code and for Positron by installing the normal Shiny VS Code extension.

So, you just click on this extension sidebar icon, type in Shiny, install it, and that will make it available in copilot chat.

Okay. Question. So, in the future, if I were to want to set this up on a different machine, I'd need to get my own API key. How would I do that? Yes. So, you can go to — for Anthropic, you go to anthropic.com and look for, like, you know, get started with APIs and you put in your credit card and you get an API key. It's pretty simple to do for both Anthropic and for OpenAI. Google, AWS, they all have their own sort of process for getting API keys. By the way, these API keys, they are sensitive information. So, do not commit them to your Git repositories. Definitely don't commit them to Git and then push to GitHub. Not only, like, might someone run up your bill, but Anthropic and OpenAI and these other companies are looking for leaked keys on GitHub and they will shut it down immediately.

Yes. Put in the R environment and in Gitignore. Yes. Okay. So, just a note that to use Shiny Assistant in VS Code, it does require a copilot subscription. And in Positron, you need to use Positron Assistant, which I just showed you existed. We'll talk more about this in the next few weeks. One other thing to be aware of that builds on top of Shiny Chat is this package called Query Chat. So, this is a package that is not for building, like, general purpose chatbots. It's made for building a very specific kind of chatbot, which is a sidebar chatbot that's tied to a specific data frame or database table. And you can ask questions of it, like, please filter to show only these kinds of rows. And the rest of your Shiny app can just be reading a reactive data frame that Query Chat will then produce for you. About a little less than a year ago, I gave a presentation at PositConf where I showed an example app that looked just like this. At the time, it was just an example. You'd have to fork it and customize it to work with your own data. Now, we've got a proper package called Query Chat that makes this extremely, extremely easy to do. So, if this type of usage of LLMs with Shiny dashboards is interesting to you, highly encourage you to check out Query Chat. Is this new? Yes. Maybe it's a month old.

Okay. That's all we're gonna say about UIs. It's a big topic. There's a lot there. But again, I think in terms of forming good mental models for how to use LLMs, it is of secondary importance. So, that's all the time we're gonna give it. So, we are coming up almost on an hour. I'm gonna cover one more topic. And then let's maybe take a five or ten minute break.

Tool calling

But before we do that, tool calling. This is the most interesting topic. This is the most interesting thing we'll talk about today. And of all the sort of aha moments I had about LLMs, this was by far the most impactful. LLMs by themselves cannot interact with other systems. They cannot interact with the world. The illustration I like to give is think of a brain in a jar.

So, think about a brain in a jar. This is a brain that contains a very lossily compressed copy of the entire internet. But it is ultimately a brain in a jar. It receives questions and it answers them, and that's all that it can do. It does not have the ability to do anything other than answer questions you present to it. And by the way, it only answers one question and then it is sort of destroyed and replaced with an identical brain in a jar. It's a pretty dark illustration when I think about it. But that's a fairly accurate representation of what's happening here. These LLMs by themselves, they just take tokens in, they just return tokens.

This is a brain that contains a very lossily compressed copy of the entire internet. But it is ultimately a brain in a jar. It receives questions and it answers them, and that's all that it can do.

So, when we talk about all these exciting agentic things, that these things are going to be able to make restaurant reservations for you, they're going to be able to make changes to your code base and submit GitHub PRs, what are you talking about? It's a brain in a jar. How can it do any of those things? How does it get at a computer? And tool calling is the answer.

When I first heard this, that LLMs by themselves can't do anything, but there's this tool calling thing that lets them integrate with other systems, I immediately imagined a pretty complicated picture of how this would work. Before I show you, I'm just going to say not every LLM supports this. Notably, DeepSeek R1 does not support tool calling, but most of them do. Most of the really good models do support tool calling, but just something to be aware of. So, I'm going to first show you how tool calling does not work. This was what I first imagined how this would work, but this is not an accurate picture.

So, what I imagined is that our client ellmer would ask OpenAI a question, like, what's the weather right now at Fenway Park? The current weather is not something that a brain in a jar could possibly answer. This brain in a jar only has knowledge from six months ago or whenever it was trained. It certainly does not know realtime weather information. So, I imagined tool calling, whatever that is, would enable it to make its own call out to a separate service called openweather.org. It would say, give me the weather for the zip code 0, 2, 2, and 5, and that would return some data about the weather, and then the brain is smart enough to take that result and turn it into this textual response, like 65 degrees and sunny, a beautiful day. This is not an accurate picture of the world, but it's just what I imagined.

In some ways, this is a very simple picture, right? Like, sure, let's just make the model more capable. Let's give it the ability to call things. But as, again, an experienced system, an experienced software engineer who's built distributed systems, that immediately raises red flags for me. Like, how does OpenAI have the ability to call openweather.org? Is it just limited to calling web APIs, or are there other things we can have it do? And if there's other things, how do we get it to do it? Or even if it's just calling APIs, like, what if openweather.org needs an API key? Am I supposed to hand over my API key for openweather.org to OpenAI? I don't trust OpenAI with that information. So, it immediately raised all these red flags, and I imagined, like, oh, this is going to be a complicated thing. If I want to get OpenAI to call openweather.org, I'm going to have to create some kind of complicated YAML file. I'm going to upload it along with some kind of Docker container. It's going to be a whole thing. It's going to be crazy. And fortunately for all of us, this is not accurate. This is not how tool calling works.

Instead, it's this more complicated picture. We'll walk through it, and then I'll explain why this actually simplifies the world. So, what happens instead is it starts with we still ask this question. ellmer asks OpenAI, what's the weather right now at Fenway Park? But we've attached another piece of information called a tool. So, a tool is not — a tool is basically a description of a function. It is not the code for a function. It is just the name of a function, a description of what it does, and then a list of its arguments along with the argument types. So, here I've just kind of shorthanded by saying get current weather. But underneath what it would be saying is the current — there's a function called get current weather. It retrieves the current weather. It takes a single argument. That argument is a zip code and it's a string. So, just imagine everything I just said is accompanying the request. And what the model does is it decides instead of answering you, instead of just trying to tell you what the weather right now is, I'm instead going to respond that I want to call this function that you've offered me. I want to call this tool. So, it says get current weather, and here's the argument that I want to use, 02215.

So, it is actually now the responsibility of ellmer, of the client, to carry out that tool call. It is responsible for calling openweather.org, getting the response back, you know, getting this information back, and then it needs to relay this back to the model. And now the model is ready to say it's 65 degrees and sunny, a beautiful day. So, I'm going to stop for a second and just let you look at this image and take it in.

Why return the output back to OpenAI? Really good question. We'll come back to that in a moment.

Other questions? Other reactions to this?

So, ellmer is always in the middle of tool calling? Yes. ellmer or something like it is always in the middle. How ellmer calls openweather.org. Good question. I'll show you the real implementation of this in a moment. OpenAI, transfer your question in a command you can run. Yeah, that's a good way to put it.

Okay. I mean, Melvin Vandermark? Is that right? Somebody asked, why return the output back to OpenAI? Oh, yeah. Okay. Yes. Okay. This is exactly. Also, if you're providing the tool, what is the benefit of using OpenAI instead of just using the tool yourself? OpenAI is not doing anything we couldn't have done by calling this, right? So, what value is the model providing in this case? I can answer that, but I'd like to turn it over to you and see if anybody else wants to take a stab at it. Why is this useful if we're the ones doing the work at the end of the day?

Well, Ella touched on part of it. I'm sorry. This is very America-centric, I guess. 0215 is the zip code of Boston, which is where Fenway Park is. Fenway Park is the baseball stadium in Boston. So, first of all, how do we know Fenway Park is 0215? Well, that is something that is in the brain in the jar. The brain in the jar knows things like where famous landmarks are located. So, there is one that's one part of the answer, Melvin, is that we needed the help of the LLM to decide what argument to provide to OpenWeather. So, that's one.

Any other responses, like why might it be useful to use an LLM with these tools instead of just calling the tools ourselves?

So, this is a very reductive example where we're just saying, like, we're asking the most direct question that can be answered with the exact tool that we have. But I want you to imagine, let's say there's not just this tool. There's not just a GetCurrentWeather tool. There's also a tool for search Google Maps for locations. So, you could say, like, restaurants in San Francisco, and it would return that back as, you know, the tool would return that back and GetCurrentWeather. So, you might say to your assistant, the user might ask a question, like, I want to have dinner this weekend at an outdoor restaurant in San Francisco. Please go, like, make a plan for me. And that's not something like OpenAI doesn't know what the weather is going to be on Saturday without the help of these tools. But with these tools, it can sort of orchestrate a larger set of it can carry out this larger task with the help of these individual tools.

So, the fact that we can call the tool is inconsequential. What's important is that OpenAI is deciding when is it appropriate to call a tool. It's deciding what arguments to pass to that tool, and then it's deciding how to interpret the results. And those three things are incredibly, incredibly useful, even if we had the ability to do the underlying tool call implementations ourselves.

So, the fact that we can call the tool is inconsequential. What's important is that OpenAI is deciding when is it appropriate to call a tool. It's deciding what arguments to pass to that tool, and then it's deciding how to interpret the results. And those three things are incredibly, incredibly useful, even if we had the ability to do the underlying tool call implementations ourselves.

Yeah. So, Boris says, what if we consider the same workflow, but with a database containing patient survival data? How OpenAI will know to form query if it does not know the codes that identify treatment groups and other predictors inside the data frame with survival data? Yeah. So, that is, we'll come back to this, but basically, there's two answers to this. Number one, you, like, it's not enough to give these agents, or sorry, these models, these powerful tools. You also need to give them enough background information that they can make effective use of them. But I will also say that you might be surprised, you might be surprised how often these models do not need your help to understand what you're trying to get at. It's a little bit scary, actually. I had to do a demo for a pharma group, and they have, like, a pretty standard dataset format that the FDA demands for clinical submissions called STDM, or maybe SDTM, I can never remember. I loaded a data frame of that data and didn't even tell it what it is, and Claude was like, this appears to be clinical trial submission data, SDTM, I know what all these variables are. So, yeah, in general, you do want to make sure that you are equipping a model the same way you would a person, but it's a person with surprising amounts of context all on its own.

Live demo: tool calling implementation

Okay. So, let me show you the actual implementation of this, and then I'm going to give you a moment to study this yourself. So, there's this example, 02 tools weather. Oh, I'm sorry, we don't need this. You can just ignore the .env stuff there.

So here is the implementation of our tool. This is the getWeather function. I made a mistake, it's not a zip code, it's latitude, longitude, okay? And the service is not called openWeather, it's called openMeteo. But the rest of this is, this is all that's required. So I'm using the hitter2 package to make a request to this URL. I'm passing in the latitude and longitude. I'm saying that I want current temperature, wind speed and relative humidity. And then all I'm doing is returning the response as a string. I'm not even parsing JSON or whatever, I'm just blindly returning this to the model. So this is all that is required to access this API.

I'm creating my ellmer object, the same way that you've seen so far. And then I'm calling registerTool, okay? So I'm saying, I have a tool for you. It's called, oops, it is this getWeather tool. The description of the tool is, it fetches weather information for a specified location. And then this is, the latitude is a number, and this is what the number means, and the same thing for longitude. So just by calling registerTool, for the rest of this conversation, if the model wants to, it has the ability to call getCurrentWeather, but it doesn't have to, okay? So let me go ahead and run this.

Okay, so it got the weather for me. And this is a little tricky to read, so I'm going to print my chat object, client. Okay, so we can see here, I said, what's the weather? It said, okay, I'm going to help you. And then it guessed, I don't know why it said, would you like me to use the approximate locations? And then it just went ahead and guessed. So it called getWeather with these arguments, and then ellmer responded with all of this. This is what came back from OpenMeteo, and the assistant just figured out what that means and gave the current weather. Oh, I'm not in Seattle right now, so I don't know if that's right, but I'm pretty sure it is, okay?

So that's what that looks like. Any questions about this?

That's a lot of tokens to know the weather. One cent. And this was with Sonnet, which is quite a smart model. I think we could do the same thing with like Haiku latest or dbt4.1.mini, and it would give the same results.

So just to review, the user asks the assistant a question. We include metadata for what the available tools are. Then the assistant asks us to invoke the tool or asks ellmer to invoke the tool and passes arguments. And then ellmer or us, you know, we invoke the tool and then we return the output, okay?

Oh, one other thing that I forgot here. So the most annoying part of this is writing this definition here, right? Like this is a pretty annoying thing to have to do. So we have, ellmer has a function called create tool def, where if you pass in a function that you've defined, and especially if you include Roxygen documentation for it, it will use an LLM to create this code for you, which you can then copy and paste into your script or whatever, okay? I don't think I'm set up for that right now. Oh, let me try. I don't think it's gonna work. I think it requires you to have an open API key, which I don't have right at this moment.

Okay, is there a way to know for sure that the LLM indeed used the tool? So you can by saying, okay, so I'm gonna say I'm gonna use the LLM and you can by saying, you can by just printing client afterwards. But there is also a way, there's an argument that you can pass when you're chatting to force it to use, like don't respond without using a tool call. I forget immediately how to do that. But yeah, that's a good question. Is it possible to use ellmer with a local LLM? Yes, we will come back to that.

Why tools are so powerful

Okay, a demo with PubMed would be cool, yeah. Okay, any other questions before we move on from this slide? Well, actually the next slide is quite similar. So another way to think of this is that the client, that's us, we can do things the assistant can't do. We can go get the weather. So tools let us take these capabilities, whatever capabilities we want, and we put control into the hands of the assistant so that it decides when to use them. It decides what arguments to pass in. It decides what to do with the results. And having this ability, like having it coordinate tools that we create and implement, it is surprisingly general. It's surprisingly powerful.

Just to give you an example, so in our home, we use the Amazon Alexa assistant to control the lights, to set alarms and kitchen timers and things like that. And a couple of years ago, there was a set of news reports came out saying that Alexa was losing enormous amounts of money for Amazon. They've never figured out a business case for it. And I thought to myself like, okay, if Alexa went away, how difficult would it be to replace it? How difficult would it be for me to build my own Raspberry Pi version of Alexa? And other than the microphone technology, the most challenging thing would be how do you take natural language speech and turn it into intent, turn it into programmatic intent that my scripts can use to say, okay, I better go turn the lights on or open the garage door or what have you.

And if you think of all of the capabilities that Alexa has as just being things that you can implement using tool calls, well, then the hardest part of Alexa is now a solved problem. Like all you have to do is have a tool that turns lights on and off. And then you put the query that came from the user, you just give it to the LLM and say, and here are the tools that you can call if you want to. What used to be the hardest part is now, half a cent per query or something like that. And it just is the easiest thing to implement. So it's an incredibly powerful capability.

And that example also points out an interesting thing about tools. They don't only have to be a way to get more information into the model. They can also be a way to let the model exert force into the universe. It can be tools that are not just for the value they return, but tools that operate for their side effects, for the effect that they have on the world. So turning lights on and off is one example, but it could be sending an email or sending a Slack alert. It could be putting a row in a database. It could be, as in the case of query chat, updating the Shiny application that you're sitting right next to.

We really have to think expansively, like anything you can do with a function, pretty much an assistant can now do.

So this is the illustration I came up with with ChatGPT over the last day or so, that an LLM by itself is very wise, but it's just sitting there. So I use an elephant because ellmer's logo is an elephant. An LLM with system prompt customizations is kind of cool. Like you can dress it up how you want and make it act the way you want. But an LLM with tool calls is like, it's got all these things on that can do incredible things that it could not do without it, so.

Yeah, Martin says a bit scary, the third one. Yeah, it's meant to be. It is meant to be. I mean, before I knew about tool calling, I would hear about people saying like, oh, what if these AIs take over the world? What if we're building Skynet and it decides to turn on its makers? And I'm like, what are you talking about? Like you are talking about a chat bot. You give it questions, it gives you answers. How could something like that take over the world? And then I learned about tool calling and I was like, oh, that's how, okay. That's how.

And people, I gotta say like people are going pretty nuts giving these LLMs access to pretty unfettered tools. Like here's a computer, like take control of my computer, take control of a web browser. And fortunately, most of the better LLMs these days, they are quite, by default, they really try to behave well and not do scary things. But you can totally see how you would get to a Skynet scenario eventually. Yeah, so meant to be a little bit scary.

Okay. Any other questions about tool calling? Any reactions?

Okay, my last question is any ideas for tools that you might want to add on to an LLM? Like what are some things that you want to give an LLM the ability to interrogate or the ability to exert some agency to have some effect on the world?

So a demo with PubMed, I think, so a demo with PubMed would be cool. I assume they were saying, give a PubMed search as a tool to an LLM, totally. All kinds of search engines can be very, very useful for these LLMs. What else?

Got very quiet. Did I bum everybody else talking about Skynet?

Tool that writes API for a software without API? Absolutely. I think what people see as a really important use case for these LLMs, like the ability to have the LLM drive software that was not intended to be driven by other software. Absolutely. That actually has like pretty profound implications for accessibility, for users who are, you know, blind or otherwise have difficulty with the normal modalities of interacting with a computer. If you can interact with an LLM, then the LLM can maybe mediate those interactions for you. Helping with navigate and visualize many important data quickly and clearly. Absolutely. Especially with large record system. Absolutely. If we have time, I'll show you a little more about that later.

I'm struggling with an AWS S3 connection with R and Python. So same as Neil is.

Struggling with most of what we do is PHI, CBI. We're limited in using these tools. Beth, I do think, I mean, especially this audience, I'm sorry I did not tailor this presentation more for the R Madison crowd, but you're absolutely right. I mean, the nature of the data you're dealing with has profound privacy implications. And maybe we'll have some time at the end to talk more about that.

Documentation. Yep, absolutely. Tool that saves important bits of previous interactions. Yes, Glenda, incredibly insightful idea there. This is actually something that ChatGPT does by default. If you've ever used ChatGPT and it says memory updated out of nowhere, it has decided that whatever you just said was important enough that it's storing it in its databanks so that in future conversations, it's going to inject that information, which is also a little bit, very useful, but a little bit scary. What is the future of data engineers and analysts? Let's save that for the end. Yep, still fascinating to see what can be done, yeah. I will say, we'll talk more, well, you know, let's talk later about the local LLMs and the different trade-offs, okay?

Break and quiz demo

So let's do a couple of things. Oh, actually I have one more really stupid demo. Oh, okay. I've asked you to do it. So I'd ask you to take a moment to look at the tool calling docs and then open this weather example, skim the code. This is what we just did, run it, make sure it runs locally for you. And then take a look at this fun example, this O2 tools quiz. So give that file a run and experience that for yourself. Spoiler, put on headphones or turn your volume up. There's a sound component to it. And if you have time in the inclination, feel free to just like modify the example to attach some other tool if you have time. So let's also take our break now. So instead of taking, you know, like a five minute break, why don't we, let's combine that with doing this exploration. So feel free to jump off, take a break and, or get into this and we will regroup at 1.05. I'll take a five minute break myself and come back in a few minutes to be available for questions in the interim.

I'm sorry about the load. And I changed the examples last night and I missed a couple of changes. So let me fix this real quick. Pretty much anywhere you see library.env or load.env, just delete those lines and it should be fine. So sorry about that.

When you said ChatGPT updates memory based on you saying something important, is that its memory only linked to your conversations or is it updating it for all its users? Oh goodness, it is only your conversations. If you ask ChatGPT, what do you know about me? It'll just tell you, it'll give you a list of facts that it knows. And in the settings UI somewhere, you can also find that list of facts and you can delete them if you want. You can also instruct it to remember something and it'll do that. Yeah, that creeps a lot of people out. I don't blame them. I personally, it's equally creepy, but also really useful. Like it's helpful that it knows often when I'm asking questions about a topic, it's in the context of, I work at Posit, I work on Shiny, I work on ellmer, I give talks and it frequently will kind of bias answers pretty severely towards what I actually care about. Or for personal things like, I have an old Mazda sports car that I do a lot of talking with Claude about, sorry, with ChatGPT about, and it's able to, if I ask it about secondary turbochargers, it usually knows, like it knows exactly, but I'm talking about my car. So it is very useful, but a little scary.

I guess just with the three minutes remaining in the break, for those of you who are not running R and not following along with the code for whatever reason, I'll just go ahead and show you this quiz example that I asked you to take a look at. So this is a little quiz game. I have a tool here called play sound, and it plays one of three sounds, correct, incorrect, or you win. And I've mapped those to three particular sounds in the BPAR package. And here I've created an ellmer chat object, but the system prompt was enough that instead of putting a string here, I put it in a separate markdown file. And this is a pattern I highly recommend people follow. So this entire markdown file will be passed as a string to the ellmer chat object. So I'm saying it's a quiz game show, ask the user to choose a theme. I'm saying ask simple questions, ask user to answer via multiple choice. And the most important thing here says play sound effects. And then I give it an example of like, you know, I actually don't think that this is necessary, but it was with the, you know, nine months ago when I made this example, this really helped it behave the way I wanted it to. So I'll go ahead and run this.

Okay, welcome to this quiz show. Let's do history. Okay, what year did World War II end? So let me say B. So it's correct, so hopefully you heard that sound. And then if I get something wrong, it plays a different sound. And if you get all the way to the end, then it plays kind of a fanfare for you.

So just like totally in contrast to the weather example, there's no useful information coming back from this tool at all. It's just about making something happen in this case, making the sound, the sound happen.

Model context protocol (MCP)

Okay, great, hopefully we're all back. This ellmer package is remarkable, thanks. Can it be used for retrieval augmented generation? So there, I'm gonna mention it now because I don't really mention it later. Well, actually, no, I'm gonna hold on. I'm gonna talk about retrieval augmented generation in a second.

Okay, but before that, we talked about tool calling. Let's talk about model context protocol or MCP. This has been a real buzzword in the news lately. We've heard all about MCP this and MCP that. You can think of MCP as a standard way to make tools available to LLMs without writing more code. So we use tools in this way. It's like pretty easy to do, but you are writing code. Whereas MCP is like the same capabilities. It is based on tool calling, but instead of you writing code to wire up the tool, you're just like snapping them together like Lego bricks. So there are MCP servers that provide tools, like there's a Google Maps tool, there's a file system tool and a browser MCP.

And then you have the LLMs are in these applications like Claude Desktop, Claude Code, these other programs. Copilot now supports MCPs. So without writing any code to bring together Google Maps and Claude Code, instead you can just use like a configuration file to say, I've got my Google Map MCP over here. Claude Code has a configuration file over here. Just stick the two together.

So that's one advantage is that you don't have to write code. You just have to be able to edit a YAML file or something like that, which means it puts this kind of LLM customization in the hands of way more people.

Honestly, to me, I find it a little bit scary how easy, like we are talking about two very powerful tools here, arbitrary access to your file system and then control over a web browser. And we're saying now anyone can combine these two things with any chat front end. You don't even need to know how to code. Where I'm a little bit like, maybe you should know how to code if you're gonna equip the LLM with such powerful tools.

So the other thing it does is it very conveniently, it provides a convenient way for you to write tools in one language and then consume them in another. So maybe like it's really easy to create a Google Maps tool in JavaScript because Google Maps has a really good JavaScript SDK. But when you're reading data from Databricks, then you really wanna use Python or you really want to have access to like generalized linear models, then you wanna use an R-based tool. And MCP really helps provide like a neutral mechanism for combining tools implemented in all different languages. There's more that MCPs do, but that's the gist of it.

And I start with tool calling because I think tool calling gives you the mental model you need to understand like model context protocol is just a convenient way to implement tool calling. But what is happening underneath is the same thing. You are giving the model a description of tools that are then executed locally, whether directly in an R function or calling out to one of these model context protocol things. It's all the same. As far as the LLM interaction is concerned, it's exactly the same thing.

Okay. Someone had mentioned MCP before. I hope that answered whatever question they may have had. Any other questions about MCPs before we move on?

ellmer has had tool calling support since the beginning. I can't remember right now whether we have added model context protocol support directly to it yet. But if not, that is something that'll come soon. Scary as it is, we are going to make it something that people can do if they want to.

Okay. So that's the end of our tool calling MCP, strap weapons to an LLM kind of part of the workshop. Now we're going to completely shift gears to talk about the different models that are available and how we choose between them, okay?

Overview of available models

So there are many models out there. There are way more than I could possibly name, but there's a smaller number of ones that are like relevant to many of us. And five of the ones that we're choosing to focus on here, OpenAI, Anthropic's Claude, Google Gemini, Meta Llama, DeepSeek. We'll talk a little bit about each of these. There's one that's, there are a few that are missing that I, you know, arguably should be on this slide. Mistral is one, M-I-S-T-R-A-L. They're probably the leading European OpenWeights model. And then a bunch of other ones besides DeepSeek from China, like Quen and some other ones.

Thanks for joining us, Ella. Hope that was useful so far. So we'll go in a little bit of depth for each of these, but not too much. Probably the most widely known and widely used models right now are the ones from OpenAI. GPT-4.1 is their kind of best general purpose model, and it is extremely good. I've put stars next to the ones that are particularly highly recommended by, well, by me. So GPT-4.1 is excellent. And then they have cheaper versions called GPT-4.1 Mini and GPT-4.1 Nano that are quite a bit cheaper. They do change the equation enough that it might change what you are willing to use these models for.

I believe they're pretty much equivalent in feature set. They're just like different amounts of, I don't want to use the word intelligence, but reliability, adherence to instructions, good use of tool calling, that kind of thing. Gets worse as you get faster and cheaper. On the other end of the spectrum, you have a set of reasoning models that begin with the letter O. So O3 is, before it responds, it will sort of like, quote unquote, think to itself. It'll talk to itself and sort of talk itself into and out of different answers before it responds to you with an official response. And it turns out that this is actually a pretty useful capability for certain types of problems.

The way I think of it is, it's not an analogy that'll land with everybody, but if you have read the, what is that book? Thinking Fast and Slow. Is it Kahneman? Anyway, that famous book, Thinking Fast and Slow, that says like human brains operate in two modes. There's like mode one, where you just are thinking almost like pattern matching and reflex, where you just like immediately respond to something. That's system one. And that is something we can do like surprisingly effectively as human beings. We can do it very quickly, it requires very little energy. And when we are doing things that we are sort of used to, then we do this almost without thinking.

System two is where you kind of stop and weigh different criteria. You really kind of use your, I didn't make it that far into the book, but you're really using your brain harder, you're thinking more deliberately. And system two, like the second mode of thinking is better for overcoming your natural biases and things like that. And I don't know if this is a great analogy, but I think it's at least a good analogy that like these models, like generally you can think of them as working in system one, and these models have a remarkable system one. And O3 is like system two, where it's really forcing itself to slow down and really think about things before just beginning to answer.

At the time that this slide was written, or even as recently as yesterday, O3 was quite a bit more expensive than GPT-4.1. I think this morning they announced a 80% price decrease. So it is now the same price per token as GPT-4.1. It is slower, it is still for, I think for a lot of things you're gonna wanna use the general model, but this is not nearly as expensive as it used to be. And then there is O4 mini, which is like the next generation, but smaller model that does reasoning. And this is particularly beloved right now, especially for problem solving and coding.

You can access OpenAI models either through OpenAI themselves or through Azure. A lot of corporations and organizations that are already all the way in with Azure prefer to use Azure because they make a more specific set of security and privacy guarantees. I don't know if these are gonna be at all usable to those of you working with patient data, but I have to believe like these, these are going to be solved problems at some point, like the stakes are so high, and there's so many companies with incentives to make these tools work for all different kinds of scenarios, including incredibly sensitive PII and PHI scenarios that I am sure if Azure does not already, they will be offering like here is the highest security that we can offer, we're going to, we're going to take OpenAI's largest models and stick them in a very, very restrictive sandbox for you, you know, that's HIPAA compliant and whatever, I just, I can't believe that's not going to happen.

But that is by no means advice for how to use these models today, I have no idea what the, what the implications are for, for, for health information.

Okay, so the second, and my personal favorite are the Anthropic models. So Claude Sonnet is the equivalent of like GPT-4.1. It is amazing for code generation, and I think most of the people I know treat this as their sort of the best daily driver for like technical things. Claude Sonnet 4 is their most recent model, but 3.7 and 3.5 were the last two versions, and they're both still excellent for their own different reasons. These four, 3.5, 3.7, and 4, have slightly different personalities. 3.7 and 4 try a lot harder to answer questions than 3.5. For example, if they're writing code and they realize they made a mistake, they will try over and over to correct these mistakes, where 3.5 gives up a little bit more easily. Sometimes that's good, sometimes that's bad. But all, all three of those versions of Sonnet are really excellent models.

Claude Opus 4 is supposed to be stronger than Sonnet 4. I have not really seen a ton of evidence that it is actually better, but it is definitely expensive and slower. So I default to Sonnet. And then there is a cheaper version of Claude 3.5 called Haiku. I'm not sure why, but there's no 3.7 or 4 of Haiku yet. It is a little bit faster, and it is a little bit cheaper. I think it's a third of the price of Sonnet. So it's not 10 times cheaper, it's not 20 times cheaper, it's just a third of the price.

when the benchmarks disagree with your personal experience, be skeptical of the benchmarks. They have been shown to be gamed pretty heavily over the past year.

Feel free to ask questions at any point during this. I'm going to go on for a little bit about this. The Google models, I have to admit, I have very little experience with them, but people that I trust have said that both the latest Gemini 2.5 Flash, which is their faster, cheaper one, and Gemini 2.5 Pro, which is their smarter, slower one, that both of them are really excellent models. They have very large context lengths. And Google also, I think their cheaper models are very cheap.

So that covers what we think are the three most important families of proprietary hosted models. Like, these models, you cannot download these models. You can only access them through APIs that you pay for. Now there's another set of models that you can download and run them. You can run them on your laptop. If you have a powerful enough laptop, you can deploy them on a server. You can take these models and build your own derivations on top of them. And the most famous of these is Llama, which is from Meta.

So Llama itself is just, like, many, many gigabytes of numbers. So in order to actually run it, in order to turn it into an API that you can call using ellmer, you need something to host it with. And the most popular of those is called Ollama. So this is an open source piece of software that you can download. And then they have a repository of different models that it's very, very easy. You just, like, if you wanted to use Llama 4, then you would just say Ollama run Llama 4, and that would automatically download and serve up this model.

There are several different versions and sizes of these models. The large ones surprisingly are pretty, sorry, supposedly, the largest ones supposedly are pretty smart, but there is no way you're going to run this model on your own hardware, your own personal hardware. These require enormous GPUs that are, you know, tens of thousands of dollars, and you need to use, like, a bunch of them in parallel to serve up one API endpoint for the best models. And even at their best, they're not quite as good as the best that OpenAI and Anthropic and Google have to offer.

The Llama models are free, yes. However, there are several smaller versions that they have made available at different times that are faster, easier to run, dumber, and those are all available and free to download as well. I'm not going to say much more about this, but suffice to say, you can really go deep into this world. There are things like quantizing, which is a technique for taking a 95 parameter 90 billion parameter model and making it effectively a smaller model while retaining most of the intelligence. These are all techniques that you can get into if you look into how to run these things locally.

I will warn you that these models are just not as good as the ones that you've been using today. I really encourage people, unless you have a really good reason, start with the best models. Start with the frontier models. It'll show you kind of what is capable, and then I think we can all assume that these sort of lesser models will get better over time. If the only problem with what you've done is that it's running on Claude and you can't deploy it locally, probably within the next year or two, hopefully, you will be able to run it locally on some kind of smaller model.

Also great for airplanes. I keep Ollama around with a local model just so that when I'm on an airplane, I can still have a coding assistant to chat with.

Alex Ziskin, YouTube channel, if you want to nerd on the quantization and check some comparison on hardware for LLM, awesome. Thank you for that. Thank you for that, Zoom user. Oh, I'm sorry, there's other things. Gemini has strengths in conversational AI apps, fast, real-time interactions, very natural. Thank you. That's really great feedback.

Is it also true for running models for research, like NLP? Let's come back to that at the end. Well, no, let's talk about that now. I am not an expert in this by any means. But for NLP, if you're using LLMs for NLP for things like sentiment analysis or classification or entity extraction, things like that, as far as I've seen, those are incredibly easy tasks for these LLMs. So for that kind of thing, I wouldn't be surprised if even the ones that run on your MacBook are pretty good, or that can run on a laptop are pretty decent. If you're doing it in bulk, it might be slow, but what are you going to do?

So yeah, for NLP stuff, again, I'm far from an expert, but I found these things to be really surprisingly good. And not just good, but so easy compared to traditional NLP techniques, so easy and flexible.

DeepSeek and model selection advice

Okay, so I guess the last one we'll talk about is DeepSeek, just because it made a really big splash earlier this year. It is an open-weight model. It is Chinese. So there are some, I guess there's some censorship applied that reflects the values of the Chinese Communist Party. So that may or may not be a disadvantage for you. It really, I think, made really big news, because they used some optimizations when training it that were unprecedented and resulted in a really surprisingly cheap training run supposedly. That being said, I have personally not found it to be super awesome. I have not been able to use it to replace Claude for any of the demos that I have used. Same thing with Llama. And that's why I'm a little bit skeptical of all the benchmarks out there, because they say that, you know, every new model just crushes the competition, and then most of the time you go to use some new model and be like, this doesn't work for whatever I'm using it for.

Again, I encourage you to, if you're getting started with these things, if you're getting started playing with LLMs and you're prototyping, start with the best models that you have access to. I encourage you to not be distracted or worse, develop a false intuition for what the state of the art is in LLMs by using models that are just not close to the state of the art. And by all means, if people here have experience that's contrary to what I just said, please, I'm always interested to hear more data, more anecdotes.

Customizing LLM behavior with system prompts

All right. So we talked a little bit about system prompts before, and we customized it to make it speak in Japanese or do whatever fun things. So let's dive more deeply into that. Let's talk in general about customizing behavior and adding knowledge to these LLMs. So the problem is one of two things, or both. The first being maybe you want to customize how the LLM responds. So the example I gave earlier was to say, don't be as verbose as you usually are. But you also might want to say, don't talk about these topics. These topics are off limits. Or focus narrowly on answering these kinds of questions and refuse to answer anything else. Or you might say, you know, we know these LLMs have a tendency to be overly agreeable sometimes. So you can say, like, be critical of both, you know, my thoughts and my questions, whatever. Like I really am looking for you to be to provide pushback and to challenge my ideas.

And as we'll see, you can even have much more dramatic customizations where you're saying, like, I don't even want words back. I want JSON or I want, you know, what have you. For data analysis scenarios, we might say, when I ask you data questions, please answer using R code. Like not Python code, but R code. And not base R code, but tidyverse R code. Or whatever it is. Like, you can ask it for exactly what you want.

Secondly, you might have the LLM you might want to use an LLM to answer questions, but it does not have the knowledge that it needs to either answer these questions or carry out whatever tasks you need it to. It might be because you have information that is needed that is too recent. Arguably, like, the weather data is something like that. That's definitely not going to be part of its training set. Or it could be information that's too specific. Like you're really getting into the nitty gritty of some particular, you know, like, some very detailed facts about some public figure and you want to, like, dive deep into that. Or, you know, facts about I don't know, like, we want to talk about Paris landmarks, but we have a very specific ranking that we want to be basing it off of. Or information that's private. Information that would never be part of the training set of an LLM. For example, like, I we want to take our marketing data or our customer data or, you know, God forbid, your patient data and give it access to that. There's no possible way that's part of this training set.

So in both of these examples, we have a number of possible strategies for customizing or augmenting the knowledge of these LLMs. Beth says I've had answers come back using older versions of R packages because there's more info on the web for older syntax. So that's a great example of, like, we might want to use the latest features of dplyr or whatever, but that's information that's too recent. So let's talk about that a little bit later. There are solutions to that.

So there are three main solutions we're going to talk about. There's actually a fourth that I forgot to add to this list. There's prompt engineering, which is a fancy way of saying customize the system prompt. There's retrieval augmented generation and there's fine tuning. And there's a fourth called agentic search. So let's focus on prompt engineering first. And we're going to spend a bunch of time on this. How do we customize the system prompt in order to direct the behavior and output of these models?

So just some examples of ways we might that we commonly prompt these models. We say respond with just the minimal information necessary. Or we say think through this step by step. Or carefully read and follow these instructions. It's weird that this in at some level we are talking about computer software and yet things like this matter. Like telling it to like but really think before you answer. It actually makes a difference. Or saying like carefully read and follow these instructions. Studies have shown this does sometimes improve adherence to instructions for some models. Like I said, you can ask it to do R if you want. Be careful to only like don't hallucinate. This doesn't it's not fully effective. But it doesn't do nothing. It's kind of surprising.

Raymond says play the role of a blank helps. Yes. Yeah. Absolutely. By the way, really interesting. Respond with just the minimal information necessary or be a terse assistant. It actually turns I didn't know this at the time that we put this presentation together. It actually makes the answers worse in a lot of cases. It's like the act of speaking helps the the act of speaking during inference time, it actually helps the model give more accurate answers. Which is pretty mind blowing. Maybe I'll add some slides about that for the next time I do this.

Prompt engineering example: structured extraction

So we're going to walk through this a little bit. We're going to give a specific non-trivial example of we want to take a plain text natural language recipe and we want to turn it into something structured. We want to extract an ingredient list out of this and return it in a structured format. So let's say this is the user input that comes in. And let's say the first our first attempt at the system prompt is just to say the user input contains recipe, extract a list of ingredients and return it in JSON format and we get this. So number one, like it did do it, right? This is JSON, it is the ingredients in that recipe. So like great job, kudos. That was not a lot of effort for us to go from plain text to structured. But if you look closer at this, like there are some I mean, like this is maybe not that helpful if we were going to do something programmatic with this data. Like if we were to format a recipe card, often we want the number or the units to be formatted differently than the item or something like that.

So let's take another attempt at this. Let's be more specific about what we're asking, what we wanted to generate. So this time we say extract a list of ingredients and return it in JSON format. It should be an array of objects where each object has keys, ingredient, quantity, and unit, and then put each object on one line of output. And it did do that, okay? So ingredient, unsalted butter, quantity one, unit, cup. So that actually worked. In this example, if you look closer, there are actually a couple of weird things happening here. Number one, quantity of one half. This is not valid JSON. Like if you tried to parse this, this would actually fail. You can say 0.5, but like 1 over 2 is not valid JSON. And for the egg, it gave unit large. Like what is a large? So it's unclear why it decided to do that. And then here for the unit, like it's cup, singular here, it's cups, plural here, unclear whether that's what we would want, okay?

Our third attempt at this, we are saying extract a list of ingredients and return in JSON format, and then let me just give you an example of what I want. So this text example output and then this code block, this is part of the system prompt. And it's really important to note, like this is extremely effective for LLMs. It is extremely effective to give them just an example of what you want. And often, if you give them an example and you give them instructions, and the instructions and the example disagree, they conflict, often it's more likely to adhere to the example. That's how powerful examples are. And that's like a little bit surprising, but not that surprising, because think about humans. Like if we have documentation that disagrees with the example, 9 times out of 10, I think people end up using the example.

So I really think it's my best advice for this entire workshop is this. If you are trying to system prompt your way into a certain kind of behavior or a certain kind of outcome and it's not working, I would ask you three questions in this order. Number one, are you using the best models? Are you using GPT-4.1 or Claude 4, Sonnet, or, you know, O3 or O4 Mini? Are you using the best models? Number two, does your system prompt say what you're looking for? Like are you telling the model what you want or are you relying on it to just guess what kind of answer you're looking for? And number three, if you've described what you're looking for and it's not giving you those kinds of responses, have you provided an example? So number one, are you using the best model? Number two, have you told the model in the system prompt what you want? And number three, did you give it an example? And a surprising amount of time, just by doing those three things, these models, you can go from failure to success.

So I really think it's my best advice for this entire workshop is this. If you are trying to system prompt your way into a certain kind of behavior or a certain kind of outcome and it's not working, I would ask you three questions in this order. Number one, are you using the best models? Number two, does your system prompt say what you're looking for? And number three, if you've described what you're looking for and it's not giving you those kinds of responses, have you provided an example?

So that covers kind of talking about tailoring the behavior, customizing the behavior of

these models. And you're saying, like, act in this way, give me results in this style.

Adding context to the system prompt

So let's shift gears a little bit and talk about adding new context or new knowledge to the prompt. So using the system prompt not to tell it what to do, but to give it more information about the problem space that you're dealing with.

So some examples of ways we might do this, we might take documentation files and add it to the prompt. Like I said earlier, ellmer is too early for chat GPT to know how to work with it or even what it is. So we created our own ellmer assistant. That is literally just a totally stock, you know, GPT 4.1 or whatever, except the system prompt is I pasted the readme from ellmer and from ShinyChat into the system prompt and said, like, here are a couple of readmes, answer questions about these packages, and it works really shockingly well.

So adding documentation is one way to customize the prompt. Like I said, add positive examples. Oh, at least in all the models that I've used so far, they're all really like it's very helpful to add examples of what you want. And much less helpful to add examples of what you don't want.

So to say, like, yes, give me responses like this, no, don't give me responses like this, the negative examples don't they often don't work well and in fact are often counterproductive. So try to frame everything in terms of positive examples.

And if you're really tempted to put in, like, a negative example, like, go ahead and try it, but if it doesn't do what you want, try adding a few more positive examples that don't exhibit whatever the negative example is going to exhibit.

It is important to note that, like I said, every model has its limits for how much information can fit inside. So you're going to be limited to whatever docs can fit in that window.

And then here are some examples where you can see exactly, like, the kinds of information that was stuffed into a prompt. So this is the ellmer assistant. You can see here, I literally said, like, here is the readme. And, like, the rest of this is just, like, all of this is just going into the system prompt.

So there's some other examples here you can see. Matias said, one feature I found useful in creating chatbots with Python was to suggest certain links when keywords were mentioned. Interesting. Do you mean certain links for the model to go and get more information about?

Well, let's come back. Let's come back to that.

So that all was system prompt customization. Let's shift gears now. Is that generally true? Should the prompt avoid telling it what not to do? Like don't hallucinate? Oh, that's a good question.

You know, I have found it to be not terrible at respecting don't commands that were not example based. Yeah, I mean, definitely take this with a grain of salt. But I feel like if I say, like, don't talk about this topic, it's pretty good about not talking about it. Now, don't hallucinate is tough because, like, it's not clear that it can tell. Like it doesn't know when it's I don't know.

It's hard to talk about this without anthropomorphizing pretty heavily. But I don't think there's any prompt you can give it that will solve the hallucination problem. But if you tell it, like, you know, never use this R package, I bet that it would do that. It's really with the examples that it's that I don't know the conventional wisdom is, like, that doesn't work very well with the negative examples, okay?

But I don't think there's any prompt you can give it that will solve the hallucination problem.

RAG: retrieval augmented generation

So that's all system prompt stuff. Let's totally shift gears and talk about RAG, which stands for retrieval augmented generation. A lot of you have probably heard this term. I will say that there is some squishiness in the definition of this term. I'm going to give you what I want this definition to mean. Not everybody agrees with me. Some people agree with me. Not everybody does. But my definition, I think, is the most useful definition.

I'll tell you what the other definition is as well when we finish this section. So RAG is a technique that's very useful when the documents you have don't fit into the context window. Like, you have a lot of information that would be great if you could just stick it in the system prompt, but it won't fit. Or it's too expensive to do that.

So what this does instead is when the user sends a query to an app saying, like, how do I do this or that, and you're pretty much designed the bot to be able to answer, like, this question that generally could not be answered without additional information.

So the application, meaning the chat bot, before it sends a request to the LLM, it will go and find some related information from the docs that it has in hand via some kind of search. And it will come up with some hits from its search, and it will combine those hits with the question asked by the user. You can think of it as being, like, oh, you asked me a question. I'm about to pass that question to an expert. But first I'm going to do a Google search and include the results of that Google search along with the question and send all of that to the expert so that the expert has this additional information if they want to consult it.

So what is this search? What is this usually? For whatever reason, the prevalent technique this could be anything. It could be a simple keyword search. It could be using whatever traditional search engine technology we've had up to this time. For whatever reason, in the LLM, what's been very fashionable is using a vector database. So it doesn't really matter what this is. Just suffice it to say, it is not doing a simple match of the exact word that you searched for being in the documentation. It looks for the general meaning of the word and will also match on concepts that are nearby.

So the result, though, is that the application will issue some kind of query to some kind of system, get some kind of chunks of text back, and then include that text with the question when it sends it to the LLM, okay? So the LLM only has access to the chunks that were passed in by the app. It does not have access to the rest of the documentation, okay?

Can we have a live showcase of this? Sure. Let's have a live showcase.

So what is the name? We used to have... There's this service that we could do this right now. So SAS or RAG. Gosh. There's probably 1 billion of these now. You know what? Let's just cheat and... I'm just gonna say, like... I'm giving a presentation right now about how RAG works. Someone has asked for a live example. Can you simulate that?

So someone asking a question... Let's see if Claude can just conjure one up for us. And if this turns out to be crazy, then I'll discard it and talk about it some other way.

What is happening right now? It's writing an app? Oh, my goodness gracious. What is happening? Well, this is very interesting. What in the world is it doing? It's simulating a database in JavaScript. I guess I used the word live, and it took it very literally. It's still editing. What in the world?

Okay. Stop! Why are you... Oh, oh. It must be when it's loading, there's errors, and it's just, like, going nuts. Yeah. RIP tokens. Yeah. So luckily, because I'm using this through the Claude AI, I'm not paying per token. I'm paying just by monthly fee. Wow. This is so ambitious. Why is it doing this?

So I told you Claude 4 is kind of try hard. And you can see, like, it is trying very hard. Like, it is relentlessly trying to fix whatever issues it's finding here.

Agentic search

Okay. Let's come back to that. Let's let it work in the background. So there is one final technique that is different than RAG, called agentic search. Do I have another slide on this? No. Okay. So agentic search.

Okay, yes. So it's similar to RAG in that it is a way of providing extra information to the LLM, but it's different than RAG in that we don't have the application that surrounds the LLM doing the search and sending the LLM the information that it found. Instead, it starts with just sending the information to sending, sorry, with agentic search, you're sending the question to the LLM and providing a tool that the LLM can use to perform a search if it wants to. So, like, the difference here is with RAG, the code that calls the LLM is preemptively doing a search on the LLM's behalf using whatever search criteria it decided, whereas when you put control with agentic search, you're putting control into the LLM's hands to say, like, I don't know if you want to make a search or if you want to make a search what search terms you're going to use, but here's a tool that you can use to do that if and when you decide to.

And that has a number of advantages. Number one, it avoids this sort of needless waste of doing preemptive searches that it might not even be needed. Secondly, it's possible that the appropriate search term has very little resemblance to the actual user prompt. Like, it might be that the term that would be useful for producing good results is something that's obvious to a human and obvious to an LLM, but not obvious to whatever deterministic code is surrounding the LLM and performing this search.

Okay. And number three, it gives the LLM the ability to decide, like, oh, these search results did not answer my question. Let me try a different query or let me fetch more results for this particular query. So those are the benefits. The downside is because the search has to be initiated by the LLM and then the LLM has to retrieve the response, it is slower and more expensive than just letting the search be done preemptively and included in one single interaction.

Okay. So agentic search and RAG, RAG is like your force feeding hits, search result hits into the model, and then agentic search is you're giving the model the ability to pull information from the outside world into itself. All right. Any questions about RAG versus agentic search?

I complained earlier about, like, RAG doesn't mean the same thing to everybody, and what I'm really basically what it comes down to, what I'm calling agentic search, some people refer to this as RAG as well. So both when you are force feeding search results to the LLM and when the LLM has the ability to go perform a search, some people refer to both of those as RAG, which I find really frustrating because the distinction is interesting and important.

Glenda asked, RAG may have some promise in medical when the corpus or repository changes over time and needs to be isolated. For example, an accumulating patient record. Yeah, totally.

Fine-tuning

Let's talk about the opposite of that. Let's talk about fine-tuning. So fine-tuning it, oh, actually, let's go back and see if our RAG thing is done. Hey!

Okay. What are the benefits of our cloud storage solution? Oh, yawn. Okay. But let's let it run that. So it's saying, okay, it found some hits. These hits are scored at different amounts. Combining the user question with retrieved context. Why is it going so slowly? Okay. So you can see here that what it's going to pass to the model includes all of these documents. Oh, this is infuriating that this is simulated latency and yet it went so slow. Okay. We'll come back to that again.

So fine-tuning is where we take an existing model and we provide new information to retrain the model and come up with a new version with updated weights. Okay. So not all models can be fine-tuned. The open weights ones like LLAMA and DeepSeq tend to be fine-tunable. And some of the proprietary ones, they'll provide an API where you provide the fine-tuning data and then they will, without giving you access to the model directly, they will create a fine-tuned version of it for you and make that available via API.

Okay. That sounds great. That sounds very exciting. Like, oh, why don't we just make the model better? A couple of problems with that. Number one, the data must be provided in a very specific format. It has to be in the form of a question and answer. So I could not just take the ellmer docs and say, please fine-tune GBT 401 using the ellmer docs. No, you have to construct a set of question and answers. And from what I've been told, you do need thousands of examples, thousands of these sort of query and response pairs to be able to make it reliable. More recently, I heard someone say, no, it can be as few as a couple hundred. But regardless, it's more effort.

It also takes a long time to do fine-tuning. And with fine-tuning, every time there's new information that you need to update the model with, you need to go through this fine-tuning exercise again. So suffice it to say, it's really expensive and not a very dynamic process, and also, at the end of the day, less reliable in changing the behavior than the other two.

So my best advice to you is to start with prompting, then shift to RAG and or agentic search, and only, if absolutely necessary, turn to fine-tuning. I will modify this to say this is definitely, I'm mostly talking about small-scale usage or for prototyping. If you have an LLM model that you are deploying at scale, I mean, probably not in the context of R medicine, but if you're a startup and your whole business is serving up this model to millions of users and you really care about maximizing the cost per query, then fine-tuning might make a ton of sense. But for the kinds of things that we intended LLM for, for you to be able to play with these models and create really useful tools for yourself and for your colleagues, then we highly suggest start with prompting, then RAG or agentic search, and then fine-tuning.

So my best advice to you is to start with prompting, then shift to RAG and or agentic search, and only, if absolutely necessary, turn to fine-tuning.

Okay. So RAG is when we click on search the web in ChatGPT, and agent search is multi-step and multi-tool. Am I right? No. So ChatGPT, when you say search the web, it's still ChatGPT, my understanding is it's still ChatGPT calling a search the web tool. So it can decide what are the terms that it's going to use to search the web with. It can get the results back and say, let me try some different search criteria. So that, to me, the LLM is in control of that web search. So I would classify that as agentic search.

I'm not even sure if ChatGPT offers RAG functionality, but there is something called Google Notebook that is much more RAG-like. If I have time, I can show that later.

Can you find out how many tokens the RAG demo was? I am not sure that we can. Oh, okay. So this answered, okay. This is so boring, I'm not going to read this enough to know whether that was a good answer. I don't think we get a token count through this, but I don't know, I could just ask and see what happens.

This totally could be hallucinated. I have no idea what to expect here. Oh, okay. Well, good bot. Good bot for acknowledging its limitations.

Okay. Any questions about all of this, about prompting, about RAG, about agentic search, fine-tuning, any of this stuff? Oh, okay.

Forrest said, does presence of multiple previously used content messages explain why encoding ChatGPT sometimes proposes to use the previously rejected solution and streams into vicious cycle of repeated ineffective suggestions? How to avoid entering into this vicious circle? I actually, I don't know. If what you're asking is, is this kind of, like, when coding agents or ChatGPT kind of go into, like, a spinning out mode, could it be because the context is just filling up? I don't know. I mean, like, we see this behavior even when there's not a ton of stuff in the context. So it might just be, like, it's doing its best to solve a problem that it's just, like, not going to be able to solve.

I will say I do think it is highly dependent on the model. And I can't prove this, but I feel like it also almost depends on the agent surrounding the model. So if you're using Cloud Code versus Cursor versus Copilot, I feel like that has an effect on, quite strongly, on the behavior. So I really like Cloud 4 with Cloud Code as the coding agent. We are developing Positron Assistant on top of Cloud 3.5, 3.7, and 4. And I think it tends to be pretty good. It's not perfect, but it's quite, quite good at coding.

Yeah. Best way, perhaps, is to start a new chat. Yeah, sure. Absolutely. Yeah. For no question, if you start to see it making mistakes and getting stuck in a rut of those mistakes, yeah, often it's best just either start a new chat or go back and edit the question or have it, like, regenerate the first answer that had the problematic behavior.

I've definitely noticed sometimes, like, it'll just fall into some, like, calling an R API incorrectly, even though nine times out of ten, it does it correctly, and then once it does it incorrectly, sometimes it just kind of gets stuck.

Using Ragnar for RAG and agentic search

Ken, how do we use Ragnar? Oh, thank you, thank you, thank you. So with ellmer, so there is another package called Ragnar. I do not have it in this presentation, partly because it's much newer than when I created this presentation, but also I just have not had time yet to play much with Ragnar. But this is definitely this is this was created by Posit. It's very, very new.

But it was intended to be used alongside ellmer for the kinds of things ellmer is used for. So it helps with a number of things. It helps with kind of processing different kinds of documents and breaking them into chunks, and then helping to augment, you know, your queries using different kinds of searching.

So I will say, I will say, people who are not practitioners of this AI stuff, people who are, like, thought leaders in the AI space, I'm trying to say this without sounding too cynical. I think they'll often talk about RAG as if it's just, like, you check the box, like, you have RAG, and then that solves a bunch of problems. And it turns out that the decisions you make as you are putting together your RAG solution, there are all these decisions, these micro decisions that really affect the quality of your results. For example, like, when you're breaking your text into chunks, the size of those chunks and where you draw the lines between chunks and whether you have overlap between chunks and how much, they really affect whether the model's going to be able to make sense of the results that you see.

So highly encourage you to not think of this as a one-size-fits-all solution, but really something that, if you're going to get into this, to really pay attention. And especially, I've been told by the author of this, that the most helpful thing you can do is when you see results that you don't expect, if it gave a particularly bad answer that you would have thought would have been answerable using the RAG, to look, like, to directly inspect what were the chunks that came back? What were the search results that the application sent to the LLM? And if it looks like, oh, yeah, all the information is right here, the model just kind of blew it, well, then you know. Like, this might be some kind of inherent limitation of whatever model you're using. But if you look at the search results and it's, like, these are not helpful hits, and I know that there are helpful hits in there, well, now you know that the problem is in the retrieval and you can focus in on that.

So, yes, Ragnar is out there. For agentic searches, there are a couple things you can do. Like, on the one hand, it really is, you're just sort of, well, actually, you can, I believe you can use Ragnar to implement agentic search as well. I'm really kind of out over my skis now, because I have not used it directly, but my understanding is that you can use Ragnar and do retrieval as a tool call. I'm not sure if that's something that's built in or, oh, yeah, here. Okay. So, this is, you could do one form of agentic search with this, by taking your ellmer object and passing it into this Ragnar register tool. But agentic search is a very general concept. So, you could imagine, like, any tool that you register that takes in, like, a search criteria argument and comes up with results, you could consider agentic search.

I found that using something like this at the end of long chats is starting a new one. Keeps Claude's head on straight. Yes. Absolutely. Yes. It is very helpful to start new chats based on summaries of old chats with a summary written by Claude. Very, very helpful. In fact, with Claude code, this is such a fundamental thing that you need to do so often that there's a command for it. You just say slash compact, and it will say, like, all right, great, here is the new history. I've summed up everything, and then you can just type your next question. Really, really helpful. Absolutely. Or, you know, maybe you're, like, you're really happy with the context you've built up over this long period. So, not only am I going to use this in a new session, I'm going to use this in many new sessions in the future. Like, this is just, like, the way I'm going to start talking to Claude about this particular topic.

All right. Other questions before we move on from this sort of enhancing LLMs with more information?

Getting structured output

Cool. We have two more topics to hit very quickly. And then if we have time, I'd like to open it up for Q&A. So, the first of these last two is getting structured output. So, we already had an example of this where we took the recipe that was in plain text and turned it into JSON by asking very nicely. This turns out to be a pretty common wanting to take unstructured things, not only text, but even images, which we'll talk about in a moment, and to get something out that we can easily consume by code, where we can take the structured data and put it into a spreadsheet or a database or something like that, or do calculations on, you know, on these things.

So, LLMs are really good at producing unstructured output, but you can get them to do structured output as well. So, here are just, like, a number of different techniques. I'm not really going to go into detail about any of these, but I will show you that I have one example here for you where I am saying I want to ask give me a list of three fruits with their colors, and I want the results to come back using this schema that I've described using these ellmer-type functions, okay? So, if I go ahead and run this, oh, extract data was deprecated. Change it to chat structured. I'm sorry.

Okay. So, you can see that the result comes back the way I asked for it to. Okay. So, just so you know, that is a thing that's available. Especially useful when you are doing, if you're using an LLM not as an assistant, but to do NLP kinds of tasks, you know, pretty helpful for doing that. I don't have a slide on this, but I will know that ellmer

also has the ability to do this with, like, many values in parallel. Like, if you want to do, like, do a sentiment, you know, analysis or give me a sentiment score for each row of this data frame or something like that, there are functions in ellmer to do that. I don't remember what they're called. Something about parallel structured or something like that.

Vision capabilities

Okay. In the interest of time, I'm just going to move on to the last topic, which is vision. So, vision is the ability that some of these models have to look at images. They I've had success with them looking at both photographs and plots. I will say they're both, like, really surprisingly good at some things with images and then really surprisingly bad at others. We refer to this as the jagged edge or jagged frontier of LLMs. Like, their capabilities keep growing, but the frontier is very jagged. So, it's very difficult to know for any given task until you've tried it whether this is inside of its capabilities or outside of its capabilities.

So, let me give you an example of one. I have this example where I have this image. This is a photo of my son when he was very young. He's a 20-year-old now. So, he's looking at the s'more, right? His face is blurry. The s'more is in focus. So, the prompt here is, what photographic choices were made here, and why do you think the photographer chose them? I mean, that's a pretty, I mean, if you're thinking about a machine answering that question five years ago, it would be like, what are you talking about? Like, that is so subtle and unlike what we're used to, machines being able to answer for us. So, the second question I ask, a follow-up, come up with an artsy, pretentious, minimalistic, abstract title for this photo.

So, let's ask these one by one. So, I ask this question, and then I call a function called content image file and pass in the image, a path to the image. So, if I wanted to, I could pass in two images by just having another one of these, or I could have two images, and then I could, like, put more text. But here, it's just one question, one image. I'm going to run that.

What photographic choices were made? And it's going to say depth of field. The photographer used a shallow depth of field with a small and sharp focus. It creates a clear subject hierarchy, draws immediate attention, blah, blah, blah. I mean, all this is essentially dead on, I mean, or certainly very defensible. And again, it's still sort of surreal to see a machine, you know, talking like this. And then come up with an artsy, pretentious, abstract, minimalistic title. So, these are pretty good.

Ephemeral layers, a study in childhood's transient sweetness. Untitled, Graham Marshmallow. This might be the best one I've ever seen.

Yeah. Anyway. Oh, so, it has its own opinion about which one it likes. So, I just, man, I love that.

Okay. So, it's able to do that. But shockingly, one of our hackathon participants wanted to feed it pictures of a knitting needle with a certain number of stitches on them. Apparently, a thing that knitters have to do a lot is count stitches. And if you lose count, then, you know, it would be nice to hold it up to a webcam and have a bot immediately say 12. And it couldn't do it. It could tell what it was looking at, but it couldn't tell you if it was 5 or 7 or 10 or 12. It would just consistently guess wrong, even when you tell it, you know, look carefully or whatever. It was just not able to figure it out.

So, yes, be sure you test. That being said, giving it plots, again, I found it able to make surprisingly human-like judgments, reading even less common types of plots like ridge plots or, you know, reading trend lines with confidence intervals. I mean, like, it was just, like, totally able to make sense of those things, including it clearly was reading the x and y axis. But then other times, like, the plotting routine failed, and the plot looks like garbage, and it definitely does not make any sense, and it's coming to conclusions that are definitely not there. It's just hallucinating them. So, again, jagged frontier, right? Shockingly good at some things. Shockingly bad at others. Your mileage may vary. Definitely test.

Shockingly good at some things. Shockingly bad at others. Your mileage may vary. Definitely test.

Multi-agent workflows

Okay. Could a number of ellmer agents be assembled into a workflow, each with a specialized prompt dash role? Yes. So, it is possible to do this, and the way you do it is by having one overall ellmer agent being the kind of gatekeeper, the one that's interacting directly with the user, and then you can attach other ellmer agents just by virtue of being inside of tool calls. So, you can have tool calls that spin up an agent, load in whatever context you think is important, attach whatever tools you want to that subagent, run it, get an answer, return that back to the calling agent.

I will say it's surprising to me, given how much people talk about agents and subagents and agent graphs in the industry, I've been shocked how infrequently this has turned out to be something that I need to do. Often just giving one model all the information, all of the tools, surprisingly often you can get really good results that way. It's maybe a little bit cost inefficient or context management inefficient. So, there are definitely reasons to go the multi-agent route. But I would start with, again, did you use the best models? Did you put what you want in the system prompt? Do you give examples and use these other techniques before turning to the multi-agent approach? It just adds a lot of complexity.

If you feed in a plot created with ggplot, can they figure out the packages used to create the image? Sure. Let's find out. So, ggplot to cookbook. I mean, the default ggplot has a pretty big tell, right? Like the colors, the I wonder, can I just paste this in?

Let's look at its thoughts. There you go. Okay. Pretty interesting.

Yeah, we have 15 minutes left. If you want to look at this file, feel free to. It is so simple. I just showed you what it is. I definitely do encourage you, drop your own photos in there. Ask it questions like you would ask a human. It's really interesting. Oh, one thing I will note, by default, content image file will down sample your image. The models tend to prefer that. But if it is important, you can say resize equals and then a different value. I think high or number or something. I forget what the resize values are. I think it's low and high. But if you have something where precision is important and you want to pass a higher resolution image, be sure to do that.

I had someone that did a hackathon project of using an LLM to automatically look at a whole lot of photographs and say which ones are the most in focus and it couldn't tell. And the reason was because they were being down sampled to 512 by 512. As soon as he changed it to resize equals high, then it worked perfectly. So, just be aware that that is a thing.

Okay. LLMs are surprisingly bad in drawing SVG figures, even if SVG completely programmable inside. Is there a way to indicate that SVG is programmable graphics and due to this, LLM can program it instead of trying to draw for resembling raster images. So, a couple of things. The most famous example of this, Simon Willison, is I think of all the people out there talking about LLMs, his insights are the most consistently fantastic and useful and practical. He very recently published something about a test that he would run of like making SVGs of pelicans on bicycles and every new model that comes out, that's what he does. He asks it to create pelicans on bicycles. And over the months and years, it's like the models have gotten better. Turns out now these model makers are quite aware of this benchmark and possibly are tuning to it. So, you know, maybe its usefulness as a measure are over.

So, he has a lot of thoughts on this. And there's really like I think it's a pretty unreasonable thing to ask these LLMs to be good at this given how difficult it is for humans to create these SVGs. But I will say what I bet would work pretty well is to not only I think if you had a tool that said I take SVG and I return the rendered SVG so that it could not only be like generating the SVG but then seeing what was generated and then have the opportunity to resubmit and kind of iteratively get this feedback and continually iterate its way to a better and better version of the SVG, I bet that would work pretty well. I bet you could well, I bet with the best models that that would work like shockingly well. I wouldn't bet a ton of money on it, but I wouldn't be surprised. As it is, these things are doing it, you know, zero shot, no feedback, and what they're already able to do is really surprising. What they are able to do extremely well in my experience is create diagrams using, is it mermaid? I think it's mermaid. Like they're quite good at creating these kinds of mermaid diagrams and these are pretty easy to use from R. So that might be something to consider using if what you want is more in the vein of a diagram than it is, you know, a true vector image.

Q&A: who to follow and LLM recommendations

Okay, let me just see if there's anything else. Okay, so that's all I have in terms of materials. These are just links that I've shown you before and I'm going to leave this up while we talk. I've got 11 minutes for questions.

Can you give other recommendations on people to follow for us experienced R folks and also for novice students? Okay, so Simon Couch is the person within Posit who is, I think, the one who's doing the most work in public. Simon Couch is the person who's who is, I think, the one who's doing the most work in public looking at models and really developing a ton of different R packages that are just like whatever he feels like are interesting. So one of the things he does is each new major model release that comes out, he has developed his own benchmark for solving difficult R problems.

I will tell you, it's a surprise to a lot of people, but these LLMs are unreasonably good. They're unreasonably good at writing R code. You would think that they would be better at Python than R because there's so much Python code out there, so much more training data out in the world. I am here to tell you for data analysis tasks, they are better. They're better at producing R than they are at Python and I'm working really hard to close the gap because some of the agents that we have are multilingual and they are usable in R and they are not quite usable in Python because out of the box, it is just much harder to get them to write reliable, especially dplyr versus pandas or polars. It is just better at pandas and again with ggplot2 versus, sorry, why am I talking about this? All right. So what Simon has done is he's created a set of very, very difficult R coding problems and then asks each new model to solve them. So not only for these benchmarks, but for other things. He's just got really interesting stuff that he's working on. I encourage you to follow him.

The Posit blog also, you know, a little bit to my chagrin, there's just like more and more AI stuff appearing here. So if you, if you actually there's good, there's a lot of non out so this is some AI stuff. We've got a posit.co, or I think if you go to posit.ai it'll redirect you to a page that lists many, not all but many of our resources and tools and packages for keeping up with what we're doing in AI.

All right, so you were saying, someone was asking like who to follow so I think the the main people Simon Willison is one, and I think if you follow Simon and nobody else, you're probably doing pretty good. Some of the other really famous people in the space are. I don't know how to say his name but Andrej Karpathy, I guess. He is the one who coined the term vibe code, vibe coding. So, yeah, just just one of the most, you know, preeminent researchers in AI and, and does a lot of really good videos. I've heard that his videos on like how LLMs work are incredible.

There's one other I'll say the YouTube channel 3Blue1Brown is just flat out awesome, everything that he does is awesome, but his series on everything I didn't talk about which is how does the math behind transformers work, how did these LLMs work at, at the linear incredibly, incredibly compelling series of videos on that so I encourage you to check those out if if you're interested in that I'm definitely not going to be the one to present that to you.

The evolution of LLMs for coding

So, yeah, I think within the last, I'd say in June of 2024, the best models went from kind of a curiosity that sometimes can produce useful results to pretty reliably able to reward persistent effort at using them to generate code. So that was with the advent of Cloud 3.5 Sonnet, I think that was the first model that was actually usable for coding. And for a pretty specific reason, like the best open AI models at the time, they could generate reasonable code for you sometimes, but often there would be a mistake. Like, it wrote 50 lines of code, there was a mistake on one line of code, you point out the mistake, ask it to correct it, even tell it how to correct it. But in the process, also rewrite everything else at random, and more often than not, introduce a new problem. And that is just, it's incredibly frustrating to like see like the answer is right there. Instead, like just by nature of how these things worked, it, you know, would just completely introduce that two steps, you know, one step forward, two steps back.

Cloud 3.5 Sonnet was the first one that would just fix the mistake you asked it to make and proceed, just like it did on its own without even my intervention this time. At that time, both the models have improved with Cloud 3.7 and Cloud 4. And then even more so, the kind of way these LLMs are called and the agent frameworks around them for code have gotten better, Cursor has gotten better, Cloud Code has gotten better. Nowadays, I think that, I can't say for like really complex R packages, but for smaller R packages like query chat, I definitely heavily use Cloud Code to work on both the Python and R versions of that. It wasn't perfect. There were definitely things I had to go in and either ask Cloud Code to fix, or I had to fix myself. In particular, I wrote the R version first, I asked it to port it to Python, including all the documentation. Inexplicably, it just decided to port half the documentation and leave like half of the settings and options that I documented unported. And I didn't even realize until, you know, someone else had to point it out to me.

So, I wouldn't trust these models today, but I do think if they are available to you, it is well, well worth your time to try and to learn how to get good at prompting them and get good at writing the results. You may not enjoy it, you may decide like, no, I want to write code the old fashioned way, but at least you'll know.

Copilot vs ellmer, and production use

I don't know the differences between Copilot and ellmer. Could you please explain? Sure. So, Copilot is a feature. Well, actually, Microsoft has used Copilot now to mean a million things. When I say Copilot, I always mean GitHub Copilot, which was like the original Copilot. Now they also call like the thing in Outlook that summarizes your email. That's also called Copilot. It's like Microsoft 365 Copilot. So, what I'm talking about is when you are in an IDE, and you start writing some code, and this ghost text that appears, like, oh, it went away, but it wanted to autocomplete with something non-trivial there. Like, that is an example of Copilot kicking in.

So, with VS Code, not with Positron, but with VS Code, if you click like a chat button, you get like a task, a chat sidebar appearing. That also now is called Copilot, and you can tell it to make changes to your code base, and it will do that. Our version is called Positron Assistant. We're not really super public about it right now, but it is coming out in the next, I don't know, soon this summer. And it does this like similar things to Copilot. So, yeah, ellmer, on the other hand, is just what we saw today. Like, you are manipulating an LLM from R. You are creating things using ellmer. You are creating chat apps, or you are using ellmer in a workflow to use LLMs to do sentiment analysis. So, like, ellmer is if you want to build things using LLMs, and Copilot is if you want to use LLMs to help you write code. So I hope that distinction makes sense.

So, like, ellmer is if you want to build things using LLMs, and Copilot is if you want to use LLMs to help you write code.

Are there public examples where ellmer is being used in production? I don't know if there are, and this has been, like, I have a lot of people telling me that they're using ellmer, but I think not a single one of them that I can think of right now is public. They're all for internal use only, and the reason there's a reason for that, which is, like, the thing about ellmer is that behind it is an LLM, usually, if you follow my advice, a proprietary LLM API, and those cost money. So, like, having these things out in public is exposing yourself to your users being able to run up costs on you. So I think that for that reason, a lot of these things tend to be internally targeted, private, and therefore, like, people don't have a lot of incentive to talk about them. That being said, we are dying for customer spotlights. So if you have even internally deployed something using ellmer, we definitely want to hear about it. We want to brag about it on our posit.co website. So please, like, reach out to me, Joe at posit.co. Not posit.com, it's posit.co. Please reach out to me, and I'd love to, you know, put you in touch with our, you know, dev rel and marketing folks. We'd love to tell your story.

Do you recommend paying for GitHub Copilot? I actually don't know. So, because I work on so many open source packages, GitHub automatically put me in a tier of, like, you get to use Copilot for free, because, like, we see that you do a lot of open source stuff. I actually don't know what you get for free, what the tiers are. I will say in general, most of these tools don't cost that much. And if you are a professional writing code for a living, the productivity boost is well, well worth the $20 or $40 or, if I'm honest, even $200 a month. It's so worth it. That being said, what you need to be careful of is your IT department, especially if you're working in medicine, your IT department may not be cool with this. So, definitely check with IT before you start, you know, using these tools on your proprietary code bases.

Also, you can get a free student license. Awesome. Great.

Closing thoughts

Oh, and just like that, that is the time that I have available. I hope that this was useful to you. I hope, if nothing else, that LLMs have gone from something that feel, like, large and amorphous and scary to something where, like, yes, they are mysterious because they are these chaos boxes that are, like, very indeterministic and we can't see the insides, but at least the shape of them, the way we harness them, or if you hear about some fancy AI agent that some, you know, billion-dollar startup is putting out that you can instantly decompose in your head, oh, that's probably this kind of prompt, these kinds of tool calls, yeah, I can see how they would build something like that.

So, that's really what I want you to take away from this, is being able to decompose whatever you see out there in the AI world into stateless API calls, tool calls, and system prompt customization. That's pretty much all there is. So, thank you so much for your time. Really appreciate it.

Being able to decompose whatever you see out there in the AI world into stateless API calls, tool calls, and system prompt customization. That's pretty much all there is.

On behalf of our medicine, thank you. This was an unbelievably good presentation, top to bottom. Beautiful. Thank you.

Demystifying LLMs with Ellmer

Transcript#

Joe's background and motivation

Workshop approach and assumptions

How LLM conversations work under the hood

Tokens and pricing

Introducing ellmer

Introducing ellmer

Hands-on exercise

Summarizing the basics

Chatbot UIs with Shiny

Tool calling

Live demo: tool calling implementation

Why tools are so powerful

Break and quiz demo

Model context protocol (MCP)

Overview of available models

DeepSeek and model selection advice

Customizing LLM behavior with system prompts

Prompt engineering example: structured extraction

Adding context to the system prompt

RAG: retrieval augmented generation

Agentic search

Fine-tuning

Using Ragnar for RAG and agentic search

Getting structured output

Vision capabilities

Multi-agent workflows

Q&A: who to follow and LLM recommendations

The evolution of LLMs for coding

Copilot vs ellmer, and production use

Closing thoughts

Featured software#

ellmer

Shiny