Large Language Models in RStudio - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

Hello, my name is James Wade, and I'm a research scientist at Dow, a chemistry and material science company. I'm also the maintainer of two packages that let you use large language models inside of RStudio , called GPT Studio and GPT Tools.

Today I'll show you how you can use those, but first, I want to start with a bit of a concern. I'm worried that large language models are a threat to the R community, but I hope to show you today that we have both the power and the obligation to shape how we use these models to get our work done.

I want to take us back to a little over a year ago, at the last RStudio Conf. When I look back at these pictures, I'm reminded of a few things. For one, the masks remind me that the influence that the world around us has on our own community. The hex sticker wall reminds me of all of the fantastic R capabilities that make it such a powerful tool built by global open source contributors, developed by all of us. And standing there with my friends and colleagues, I'm reminded by what makes this community so great. It's the people, and it's the culture.

And even the name itself, RStudio Conf, reminds me that change can happen. Change can be good, but sometimes change can be scary.

The rise of large language models

Like it or not, large language models are already everywhere. If you don't use them today, it's likely that you will be using them soon. They're being integrated directly inside the tools that we use to do our jobs. Whether we're data scientists, developers, sit in meetings all day, write emails, respond to chats, Microsoft 365 Copilot is bringing language models directly into tools like Teams, Outlook, Word, and PowerPoint. And if you're like me, that covers a lot of what you're using day to day. Google's Duet AI is doing the same thing for Google's Workspace. And even autocomplete features in our text messages and emails are bringing these models to our fingertips by default, whether we want them there or not.

These models are a big change for all of us, developers or not, but my own experience started with code. Specifically, it started with me using GitHub Copilot, a large language model powered tool that could help me write code. This was before the craze of chat GPT, but I was still quite excited by the ability of a development environment to fix my code, give me new ideas, and really just lower the barrier to me making progress on a data science project.

I was so enamored by Copilot that for a time I actually switched away from the RStudio IDE and instead was using Visual Studio Code. I did that because that's where I could access this tool, and it taught me that these models are impacting not only how we work, but where we're doing that work. It's also worth noting that this tool was not designed to help me write R, even though that's what I was using it for. It was designed for more popular programming languages like Python or JavaScript, but I still found it quite useful to help me write my R code.

I hadn't fully appreciated how large language models were already changing the way that I worked until the demise of Stack Overflow started trending on Twitter. Now, much like many rumors of demise, they were largely overstated, and the article that I have a screenshot of on the slide here from the Stack Overflow team firmly rebuts those claims. Now, for me, I can tell you that I'm using Stack Overflow less and chat GPT more. I'm also encouraging colleagues who are new to R to use large language models so that they can overcome the cumbersome challenges like syntax or other peculiarities of learning to code for the first time.

This is where my concern for R community comes in. What does it mean for us if those new to R or code in general stop visiting the community forums like Stack Overflow and the Posit community forum, instead are going to chat GPT or other language model-based coding tools? What does it mean if the models that are helping people learn to code are much better helping them in languages other than our own?

Well, I'm here to tell you that the use and the application of large language models within the R community is wide open. Now is the time for us to define how we use these tools. And I hope to use today's talk to give you some ideas towards just that.

Well, I'm here to tell you that the use and the application of large language models within the R community is wide open. Now is the time for us to define how we use these tools.

When you think of large language models, maybe you think about hundreds of billions of parameters or trillions of tokens, whatever that architecture of a transformer really is, they can seem untouchable and complex. I don't know about you, but I don't have tens of millions of dollars to go train a new one. And so it can feel like there's not much for us to do. But when we think about how we use large language models, they're actually quite simple. We write a prompt, send it over to one of those hundreds of billions of parameters models, and we get a response back. Making simple tweaks to the prompt, like including documentation from our favorite packages or maybe even some snippets of code that you found useful, you can augment that prompt to get better and better responses from those models.

Today's talk is going to focus on us as the user of these tools and how we can use them right inside the RStudio environment and how we can use them to make our work even better. Let's dive in.

He says that these people have no idea how incredibly hard it is for me, Michel, a dyslexic non-native, to produce passable English.

Another package co-author, Samuel Calderon, a data analyst at the Ministry of the Interior in Peru, taught me about the importance of language barriers in learning to code, and his experience mentoring peers in Peru, he found that they would often give up if the source documentation was not in their native language, finding it difficult to learn both another spoken language while also learning how to code. Because of his insights, the GPT Studio app now includes translations into Spanish and because of another open source contribution, German as well, to make it easier for others to use, but to also make them feel more welcome in using the app. I should also note that Samuel is the reason that we have streaming in the app. These are only two lessons, and I'm sure that I have many more to learn.

How you can contribute

Now that's where you come in. These packages, in addition to some others like Chatter or OpenAI, if you want to make direct queries to the OpenAI's API, are still wide open. They each have single digit numbers of contributors, maybe a handful of issues and a couple open pull requests. It's a wide open landscape for you to contribute code and help define how we use these models. If you don't want to contribute code, I strongly encourage you to become users of some or all of these packages. Help us learn how we should use these large language models to help us get our work done. Help us make R the best place to use these large language models. Thank you.

Large Language Models in RStudio - posit::conf(2023)

Transcript#

The rise of large language models

Using GPT Studio in RStudio

Creating a custom chatbot

Text embeddings with GPT Tools

Keeping package documentation up to date

Accessibility and community

How you can contribute

Featured software#

rstudio