Resources

Github Copilot integration with RStudio, it's finally here! - posit::conf(2023)

Presented by Tom Mock This talk closes issue #10148, "Github Copilot integration with RStudio", the most upvoted feature request in RStudio's history. Code generating AI tools like Github Copilot‚ promise an "AI pair programmer that offers autocomplete-style suggestions as you code". For the first time, we'll show a native integration of Copilot into RStudio, helping to build on that promise by providing AI-generated "ghost text" autocompletion with R and other languages. I'll also provide a comparison of Copilot's "ghost text" to a chat-style interface in RStudio via the {chattr} package from the Posit MLVerse team. To make the most of these new features, I'll walk through some examples of how sharing additional context, comments, code, and other "prompt engineering" can help you go from code-generating AI tools that feels like an annoying backseat driver to an experienced copilot. We'll close with a robust end-to-end example of how these new RStudio integrations and packages can help you be a more productive developer. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Data science infrastructure for your org. Session Code: TALK-1117

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

and RStudio. And I'm here with a one-slide presentation, which is Github Copilot is available in RStudio, and it's finally here! Okay, so we have a couple more slides, because we want to talk about it, but this talk officially closes issue 10148, can we have Github Copilot integration with RStudio, which is by far the most requested feature of all time, and RStudio with about 519 upvotes. But we don't want to stop there in terms of just because it's available doesn't mean it's the most useful tool, or you need to know how to use it to get the most out of it. So that's what I'll spend today talking about.

To really understand what Copilot is, I first want to just briefly mention the idea of generative AI, which is a tool that's designed to create novel or new content, right, so it's been trained to identify patterns and then generate new outcomes based upon them. For today's talk I was playing around with Midjourney, which is another tool with generative AI for creating images, and so I created my own little Copilot here, my dog Howard, who's always with me as I'm programming in my home. And you can see the prompt I used was actually about 15 words or so, and it came out with this remarkably novel output in terms of an image that's fairly complex. I could not probably describe this image in 15 words, but I was able to create this image in 15 words. So Copilot is a tool with generative AI to do similar things but with code.

Ghost text vs. autocomplete

So Copilot is kind of autocomplete style suggestions as ghost text available inside RStudio. Now, I want to differentiate a little bit here though. You already have autocomplete available inside RStudio, right? So you have autocomplete that will parse the code and the environment, and importantly it's supplying possible completions, right? It has to have some text and it's finishing the word for you. So this is a static set of completions and a little pop-up, and it's provided from your IDE, from your local disk or your local environment.

So Copilot can also provide autocomplete style suggestions as ghost text, and I'll show that in a second, but let's differentiate a little bit here. So Copilot, yes, it'll parse the code, it'll parse the environment, but it also has billions of examples of code in the training data of how other people have approached coding problems. So it can supply likely completions. They don't even have to be possible or present in your environment, but the likely completion of it. And these dynamic set of completions it's providing are available via ghost text. So it's not just that one little snippet you're completing, but it could be a multi-line or an entire function that you're defining. And importantly, it is a generative AI tool and it's provided via an API endpoint that you're querying and getting back into your IDE.

So let's do this visually. We like graphics, we like images, so here's autocomplete inside RStudio. I'm autocompleting the word mean, and I have a couple different variations of mean that I can complete here. Great. This works fantastic in RStudio. But if you notice, I have a comment earlier in my script saying take the mean of the mpg column in mtcars dataset. I've already said what I want to do, so why don't I just get the average of the mtcars dataset mpg column. So here Copilot is providing ghost text completion, and it's actually parsing that comment and doing a prediction of what it thinks I want to get out of this. So my intent has been codified into a comment, which is then reflected in an output. But we can go a lot farther than that. Maybe I want to do a group by summarize. So I can have a comment saying calculate the average fuel efficiency of cars grouped by cylinder and load dplyr. So now I have context that includes the comment and that I've loaded a library. Let's use that library to do something. I get Copilot-generated ghost text that's multiple lines, not just one word. And importantly, this ghost text doesn't exist in the IDE yet. It's just a UI element, and I can choose to accept it. If I don't like that output, I'll back it out and try again, add some more context, go forward.

So my intent has been codified into a comment, which is then reflected in an output.

So you can be really powerful with this. You can codify more of your intent into the document and be more productive. And we'll go into some more advanced examples. But I've only been using screenshots, and I at least want to show one little video of what this looks like in practice. It's going to be fast, but we'll watch it together. I'm speeding up my development process here. Three lines, a function I made, mode, it's very simple, but I can move forward with it and use it almost immediately. And it's really at the speed of auto-completion. So again, I provided a little bit of context in the script. I provided a scoped and specific prompt, which is return a function to calculate the mode of a vector. It defines a multi-line function for me. I can use that, and then I can immediately use it in my script. I also want to call out the three, the status of the request at the bottom of the IDE saying, Copilot is communicating, or it's finished sending me back a response.

Getting started with Copilot in RStudio

So we're going to build off this idea of, sure, it can generate ghost text, it can be single line, multi-line, and we'll go into a bit more of a complicated example. If you want to get started with Copilot in RStudio, importantly, you do have to have a subscription to Copilot. So you can go to GitHub, get a personal subscription, or you can get it through your enterprise for business. In the upcoming release of RStudio and Workbench, which we're expecting in September, you'll be able to enable this within the IDE, or you can try it in the dailies that are available right now. So you go to tools, global options, Copilot tab, you go to sign in, it takes you to a prompt, you sign into GitHub, and then you're signed in as your user. In this case, I'm J. Thomas Mock on GitHub. I'm signed in and can use Copilot.

Playing keyword: a generative loop demo

Now that I've got Copilot available in my IDE and I want to get started, I think the most important thing you can do with any brand new tool is play around with it. Don't just use something basic, don't necessarily try to use it in production, but play around with something and try and break it, try and see where it works for you and where it doesn't. I like a game called Keyword. This is something my family started playing. It's similar to like Wordle, but it's like a mini crossword puzzle that you can play on your phone and we can do it as a family. It looks a little bit like this. So a miniature crossword, but there's no hints. All you're trying to do is fill the word across horizontally, and in this case the word is felony, and the horizontal word is felony, and then each of those letters spell a different word. So how can we solve the keyword game? I can sit there, I can try and figure out what the letters are, or because I'm doing a conference talk, I can say, let's try and solve this with R, RStudio, and Copilot and see how far we can get.

That's exactly what I did. I played around with it, went through for about an hour, prompting, trying to see how far Copilot could take me with me doing as minimal coding as possible. Now to do this, I had to get the most out of the generative loop, and I want y'all to also get the most out of the generative loop. Copilot is just a generative AI tool. It doesn't have intelligence, it doesn't really know anything, it's just a prediction engine. So the more context and the more of our intent we can write down, the better predictions it's going to give us. That context is like what prompts have been provided, what source code is there, what packages are loaded, what's in your environment, kind of all this metadata that you can provide, which is quite different than the intent, which is what I actually want to do. So again, I'm trying to codify my intent into a prompt that I can get a response back from. And then there's the output that Copilot actually returns, which hopefully is close to my intent, which was provided by the context.

So in this case, I'm going to show you a few ways to do better context, to get you closer to what you actually want to do with your intent, and get overall a better output. And my little buzzword for what we'll use for today is called S2C, for simple, specific, and use comments. And I mean use comments throughout. Inline, big comments all over the place, and simplify and break up your problems into specific parts.

So again, we're trying to solve keyword, which is a pretty open-ended problem to solve with a programming language. So we want to break down the task, give it an overall structure, what are we going to do moving forward. I know how to play keyword, but now I have to teach Copilot, or teach R, to play Copilot. Although I know other people have solved games like Wordle with R already. So I know it's possible. I routed a bunch of comments. I'm not trying to get code back yet. I'm just writing down for both myself and for Copilot as a prompt, what am I trying to do. I'm trying to solve the keyword game. I talk about how to play the game and provide a little bit of context. Again, this is not really talking anything about R. This is just background. I'm not saying write this function, but we're going to create a function ultimately.

So I take that context and then I move down in my script and start adding a little bit more. Again, I want to break down this complex task of solve keyword into something more simple. The simplest method is to not cheat, but be clever. So I know that this is a web portal. It is a JSON API that's available. So I can get the JSON blob from behind the scenes. And I get the answer from that. So Copilot has solved it and we're done for the day. The answer is staple for August 9th. But that's cheating and I don't really want to cheat. I want to show you how cool Copilot is inside RStudio. So what else is in there? We actually have the words or the hints that are provided. So that's something unique. I can provide those hints to R, to RStudio, to Copilot, and we can move forward and try and solve each of the individual words to make up the larger word. And if I fill in those with staple, I get see, chant, bear, spur, really, scale. So we kind of know where we're trying to go with this simpler problem as we move forward.

So we're just going to get the hint words, breaking down into the component parts and working with those. Now I've provided a little bit longer prompt. So six lines of comments and then I say I'm basically trying to get that URL for any specific date, right? I don't want to have to manually type in the date. I can do like today's date, sys date. And here I'm getting a function back that takes the date, glues it together, returns me a URL that I can use to get these hints every single time. So then I can move forward with using those outputs and those words moving forward.

So now that I've got the hint words, I can save those into an object and then I get them back out and be like, all right, now I can actually start working with these hints in R and I'm not cheating anymore, I'm programming. Now while that was a lot of comments, I can also do expressive names. So if I have comments and say, oh, well, this is what I'm actually trying to do and name it. So the variable, the function, the objects I'm working with, this will help provide additional context about what I'm trying to do. So here I only have two comments and I start typing regex and it returns back a regex from the input word. In this case, replacing any underscore with a AZ match, so to replace it with any lowercase letter. Again, I've provided only minimal comments and I'm just starting to use more expressive names to go along with those comments and getting back a nice output of what I actually want. And then I can apply that regex from Word. So I can say, oh, well, given that, let's do matched words. No comments, but just by using what are the matched words as the expressive name, now it's saying, oh, well, we'll subset, limit the words in, get the regex, apply those together, and these are the words from my little database of words I have locally that are four units long and could possibly match what's your missing letters. But no comments, just by using expressive names and the additional context from within the document.

And again, we can get a little bit longer, and by saying I want the top words and doing a little bit more comments, now we have about 12 or 15 lines, I think, of code that it's generating, a longer function that's using that matched words we used previously, applying it, splitting it out into characters, scoring the letters to say, oh, these letters are higher value, setting those to the name, sorting it, and then putting out the top. So I'm not really worried about what the code is. It's functional. It's working. But I'm just trying to see, okay, how far can we get? I've got a lot of momentum going forward. I'm really excited. Let's keep going. And then without any comments, I can say, well, give me just the letters from the blank, and it gives me a five-line function to get that out. So, again, as you build up your script and as you add more context into Copilot, it will give you a better content out, so you get a better output.

Ultimately, rather than showing you how to solve keyword, we did eventually get to keyword, so I can say from September 9th, a month after I started trying to do this, I can get the keyword as one of the following words, recipe or repipe. And as much as I love pipes, as a Tidyverse user, I don't really know if repipe is a word, so I'm going to go with it's recipes, which is a wonderful Tidymodels package, and we'll accept that. So there you go, Max Kuhn.

Tips for getting unstuck

Ultimately, I want to show you a couple things that you can look at after the fact. There's a gist with the full transcript of me prompting, trying to say, do this, do that, give me this, get back, and you can look at it. It's a couple hundred lines of me exploring. And then the keyword function, if you really are interested and want to impress your family, and solve keyword in about two seconds every single day.

Now, while this was fun, and like we went through and solved a complex problem, you're going to get stuck, and I got stuck and had to interject myself in the middle at certain points. So, again, what you're trying to do to get yourself unstuck is add more context. And again, follow S to C, simple, specific, and use comments. Break up your problem into simpler problems, solving a specific task, that are well-commented. You can also try prompting again or in a different way. So, rather than just trying to get a function back, you could have an inline comment of an existing function that you're trying to expand, adding new arguments or expanding it in a different way. And again, more comments, more top-level comments, more inline comments, more comments throughout. Even in the middle of a dplyr pipeline, you can add a comment to have it do a mutate step, to specifically add something in there. And then, ultimately, you're trying to build off your own momentum. As my script got longer, and as I got closer to solving my problem, Copilot was actually doing a much better job with the context it was getting from my script. And ultimately, write some of your own code. This is not going to replace who you are as a programmer. It's going to expand your skills or make you a bit faster. So, solving a problem in an hour, as opposed to me just kind of searching around, trying to find examples, and then spending a whole day or two on it.

The chattr package: chat-style AI in RStudio

So, we talked a lot about Copilot, but I've got Edgar Ruiz here, so I do want to do a call-out to say that there's more than one way to generate text. There's Ghost Text, which is really cool. It's Copilot. It's in RStudio. But maybe I want to chat with somebody. Maybe I want to do chat GPT style. So, Edgar has a package called Chatter, through the MLverse group here at Posit. And you can actually run a OpenAI chat interface or other local models like Llama on your laptop and communicate with them through RStudio in the viewer pane. So, the Chatter app is what this basically is, and it can allow you to connect to a lot of different models.

Now, inside of RStudio proper, number one, great logo. Chatter is awesome. You can call the Chatter Chatter app, and you can pop it into RStudio, or you can run it as a background job to leave your console free so that you can interact with it. But in short, it'll just be in your little viewer pane, and you can interact with it and ask more open-ended questions. But it's not even that basic. It really is essentially chat GPT running inside of RStudio in this example. But what Edgar has done here, and what's really exciting, is the prompt is pre-populated with useful things. So, if you look at this, and I don't want you to read it all because it's too much, but there's like 20 lines of text in here of enriched information. So, when you actually go to your prompt, it already has amazing amounts of context to help your responses be better for the task at hand, which is probably our programming. Although you can always customize this. We've provided nice defaults, but you can customize it, modify it, and make it behave a bit more like you want it to.

So, ultimately, this is more of what you're doing with a chat-style interface, is you're asking a question, but your question also has some enrichments that Chatter does, like saying, oh, there's files in the environment, there's data frame names in the environment, there's additional prompts that we're providing, and it has the history of the discussion so far. And this is submitted to the model, and you get a response back in the IDE, and happy, healthy, you can ask it questions, as well as get your copilot results in the IDE.

Wrap-up and resources

So, ultimately, this was a wild run through keyword and solving problems with copilot in RStudio, just showing you a few different ways of you can use them inside Workbench in RStudio. Again, for the best outputs, at least with copilot, you want to do these simple, specific use comments to be productive. Copilot is available as an optional integration. It's available as a preview feature in our upcoming 2023.09 release of RStudio and Workbench. And if you want to provide feedback or report bugs about this feature, please open a GitHub issue on our repo, and we'll make improvements through that. For Chatter, you can install it from GitHub as well. So, remotes, install GitHub, ML versus Chatter. And you can use it, again, with open AI endpoints, some other endpoints that are available via API interfaces, or open source models running on your local laptop, like Llama. So, ultimately, whether you are a cat person, a dog person, or you like Totoro, there's a generative AI tool out there that you can use in RStudio, be productive, and you can have your own little copilot that'll work with you. So, thank you for your time, and I have some time for questions.

Q&A

Yes, thank you, Tom. Thank you for that awesome presentation. Thank you. I'm very excited about this, and we definitely have a lot of questions, and in fact, I've never seen this room so full, so I'm excited and a little bit nervous at this point. All right. So, the question is, and this is connected to the copilot thing, has there been any special tuning training done to optimize the copilot for R?

Gotcha. Yeah. So, copilot is a third-party integration, fully, absolutely. So, it's been trained on, I want to say, several billion examples of code. So, it is available for R, Python, SQL, JavaScript, all sorts of different things. It's not necessarily optimized for R or for Python or for SQL. It kind of, whatever examples were fed into that model are what it's good at. I found it remarkably good for specific tasks in R. If you're trying to do data manipulation pipelines, pretty good. If you're trying to get a 20-line data analysis script, it might kind of hallucinate or send you in a direction that's a bit odd, if you're trying to do too broad of a problem that it's solving. But for these programming tasks, writing functions, turning repeated things into functions, adding little steps into your mutate pipeline, remarkably good. And again, good at capturing my intent and putting it into an output.

Thank you. The next question says, what elements of your R environment are being sent to copilot? All the text in the script, contents of objects in the environment, something more? Yeah. So, there's an option. You can have it limited to only the file at hand, so the script you're working on, or you can actually have it index additional files in your project. That's really helpful when you're working on, say, a multi-file application, or you have multiple sourcing files from external files, or sourcing functions from external files. So, you can have it get that in the prompt as well. So, it's controllable, but it's reading primarily from the script at hand and additional files, if possible.

Thank you. The next question says, how do I avoid flinging key material into the cloud via the copilot API? Yeah, absolutely. So, again, optional integration, there's like six steps you have to go to, to even opt into this, which is buy a subscription, opt in, install something, and then sign in. Again, we're setting up some things where you can have it focus on only the file at hand, so if you have sensitive information in other files, you don't open them there. You can also turn it on and off really quickly. So, with the command palette in RStudio, command shift P, or control shift P, you can turn it on and off immediately if you're like switching between an open kind of script and one where you don't want to send that out.

Thank you. So, the next question says, can I use one copilot license with RStudio and other IDEs, for example, VS Code? Absolutely, yeah. There's nothing that we're doing to limit that access, so you should be able to use your login. I used it for VS Code as well as for RStudio with my personal account.

Great. I think we have time for one more question. Does copilot work within notebook chunks and know which language you are using? Yes, you can use it inside Quarto source mode, and I actually am writing up in some of the documentation how you can do things in the body of a document where you can actually include HTML or JavaScript, and then when you go into a code chunk, it is indicating this is R or it's Python or it's SQL or whatever. There are times where it kind of thinks the wrong thing and it gives the wrong prediction, but most of the time it was accurate to say, oh, this is R, this is Python, or this is SQL, but you can always give it more context saying Python chunk, R chunk, if it's giving you the wrong outputs. Thank you so much, Edgar. Thank you again. Appreciate it. Thank you, Tom.