
Jonathan McPherson | New language features in RStudio | RStudio (2019)
RStudio 1.2 dramatically improves support for many languages frequently used alongside R in data science projects, including SQL, D3, Stan, and Python. In this talk, you'll learn how to use RStudio 1.2's new language features to work more efficiently and fluidly in multi-lingual projects. VIEW MATERIALS https://github.com/rstudio/rstudio-conf/tree/master/2019/RStudio_1.2_Language_Features--Jonathan_McPherson About the Author Jonathan McPherson Jonathan is a software engineer at RStudio working on the IDE. In the past, he’s written Web applications at a nuclear site in the desert, exploratory information visualization systems at UC Davis, and features for flagship Office products and modern web applications at Microsoft
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Good afternoon everyone, by good afternoon I of course mean good morning. My name is Jonathan McPherson, like she said, I am a software engineer on the RStudio IDE and today I'm going to talk to you about some of the new language features that we've built into RStudio 1.2.
Before I do that though, a bit of background. So it's been our experience that as people work with R, they are often working with more than one tool at the same time and they're creating projects that have bits and pieces that are composed of different languages. For example, if you're working with an R project in RStudio, you might need to get some data from a SQL database and so you need to open up another window to create a SQL query and look at the data there so you can analyze it in R and perhaps by the time you're done with this query, you think of the fine work that Wes McKinnon did on Pandas and think maybe you'll use that to analyze the data so you open up PyCharm and then you remember that you've actually been asked to visualize this data in a bubble chart which you can't think of an R package to do so you start visualizing that in D3 and then you start thinking about modeling and you start browsing Wikipedia for modeling languages like Stan and then you play that game on Wikipedia, have you ever played that game where you click on the very first link of every article and you always wind up at philosophy?
Every time. Try it out. And then the next thing you know, you're at the zoo and you are shaving a yak.
So this is, this will not do. So this is the kind of problem, one of the kinds of problems we are trying to solve with RStudio 1.2.
So our goals for RStudio 1.2 are many but one of them is to be a more comprehensive workbench for your R projects. So we're going to embrace some of the languages that we commonly see people use in their R data science projects and we're going to make the interoperability between those languages a lot more seamless so that you're not doing so much context switching. If you think about that slide that I just showed with all those windows, like every time you have to switch between tools, switch between windows, every time you have to import data from one tool, export it into another tool, that takes a lot of your time and mental energy and kind of breaks you out of the state of flow while you're working on your data science project. So we're hoping that we can take some of these workflows and make them really easy to do seamlessly right inside of RStudio.
We also have some non-goals. We don't want to make one of these things. We are not trying to become a general purpose IDE. We're not trying to fully replace your dedicated tools or lose focus on R. So RStudio is and always will be first and foremost a workbench for data science with R.
So RStudio is and always will be first and foremost a workbench for data science with R.
So here are our agenda. We're going to cover just a few languages. They are as follows, SQL, Python, we'll do D3, we'll do Stan, and if we have time at the end, I'm going to show you a couple of the other fun things we've added to RStudio 1.2.
SQL demo
So for most of the rest of this talk, I'm just going to do a live demo of some of these new language features in RStudio, and I'm going to start with SQL. This will look pretty familiar if you went to the keynote this morning. So one of the things that we built into RStudio is a data connectivity, and here I'm connecting to a database for a record store. For those of you who still remember what a record store is.
So this database has information about all of the records that we sell, all the artists who made those records, that is information about customers, our employees, how much people have paid for the records, et cetera. So I'm going to start out by creating a simple SQL query report that tells me all of the albums that I sell and which artists created those albums. As you can see, this data is currently in separate tables.
So I'm going to start by pressing this new SQL button you'll see here. And we'll call this Albums, okay, and you'll notice right away that RStudio has suddenly become an interface that should be pretty familiar to anyone who has spent time working on SQL queries. Notice I have a list of tables over here, I have a query over here, and then I have the results of the query right here. And it's quite easy for me to work on the query and to get real-time feedback about the results of the query here.
So I just selected from the artist table as well. As you can see, this has a Cartesian join result, which is not at all what I want for this query. So I'm going to add a where clause. And you'll notice that as I'm typing, I'm getting real-time autocompletion of the results of the fields in the tables. So I can say where albums, and I can get autocompletion there to artist ID equals artists.artistID. And there we go. So now I've got a quick little SQL report that tells me all of the albums that I sell and all the artists who made the albums. Again, fairly straightforward, but really easy and very fluid for me to do this without opening 12 windows to look at my schema and my database and my results.
Python and reticulate in R Markdown
The next thing that we're going to do is we are going to look at a little notebook that I've created. So this is a notebook that does a little bit more in-depth analysis. It's going to help me figure out who the top customers are at my failing record shop so I can figure out how I can get more money out of them.
To start off with, we're just going to connect to the SQL database, and I'm going to use the reticulate package. Now, a lot of the things you're about to see are a direct result of the reticulate package. This package doesn't require our studio to use. It was put together by JJ and Kevin, and it powers a lot of the Python interoperability stuff you're about to see.
But before we get to the Python stuff, I wanted to note that all of the really nice things that I just showed you are not only in the SQL window. They are also available to you inside of our markdown documents. Here I have a SQL query, and what this query does is it simply goes out and gets me a list of every customer and every invoice that that customer has ever had. So all that nice stuff that I just showed you is still available, right? So I can get the same kind of auto-completion that I'm used to.
And I actually want to save this query into an output variable so I can use it later. So now I'm going to use Pandas. I'm going to use Pandas to summarize the data that I just created. So I made a list of invoices, now I'm going to use Pandas to group together and summarize that data to figure out who has spent the most total money on invoices. Let's go ahead and run that. And you can see here are my top five customers.
Now I wanted to show you about a couple of things that you can do while you're authoring these chunks. First of all, you get the same kind of auto-completion you do in R. So just like you can type library R in R and you can get a list of your R packages, you can type import in a Python chunk, you'll get a list of your Python packages. You also get the same kind of method completion that you get inside of R. So for instance, we've got variables, we can see the names of the functions here. If we use a function, we can actually get help for that function right in the help pane. You can jump to the definition of that function right in its Python file. So we think that these capabilities will make it really easy for you to very fluidly and naturally author Python chunks inside of R Markdown or other bits and pieces of your R project.
The other thing I'd like to point out here, and some of you who are a little bit more astute might have already noticed it, is that I didn't have to do anything to get that data from SQL into my Python chunk. And the reason is this. Notice that when I created an output variable here in SQL, it went into my global environment right here. See, here's that data from SQL. And the only thing I needed to do to reference that data in Python is this. R.spinning. That just takes the variable from R called spinning and allows me to use it right inside my Python code.
In previous versions of RStudio and R Markdown, Python chunks basically ran in a vacuum. You had to import the data at the beginning of the chunk from wherever and save it at the end. And that's no longer necessary. Reticulate actually embeds a Python session inside of your R session. So it's very easy and seamless to get data in and out of Python.
Reticulate actually embeds a Python session inside of your R session. So it's very easy and seamless to get data in and out of Python.
Speaking of getting data in and out of Python, it's also quite easy to do the opposite of what I just showed you. Just like you can say R dot in Python, you can say py dollar sign in R, notice here I'm getting a nice realtime autocompletion that not only tells me what data from Python is available but also gives me some information about that data. So it's here I can run that chunk and now I've got the summarized data from Python.
Now let's do a little bit of visualization. This is something you also kind of saw hinted at in the keynote. I'm going to use Matplotlib again right inside of RStudio to visualize these top five spinners at my fancy record shop.
Now another thing I want you to notice here is that each one of these Python chunks is running in the same Python session. In previous versions you basically had to run each Python session independently. Every one of these chunks would start a new Python session, run all of the code, and then quit. So now that we have an embedded session, these things can build on each other. You can actually, just like you can with R chunks, you can build seamlessly chunk to chunk inside of your Python chunks.
D3 visualization
So the last chunk in this notebook is a D3 visualization. So it's basically going to take that same data. And I'm going to sort by the number of people here and there we go. I made a nice little bubble chart.
So I want you to notice here that RStudio is not only, you know, an environment where you can look at these D3 visualizations, it's also an environment where you can create them. So you'll notice here that this bubble chart that I made is a D3 script called bubbles.js and I can just open that and start working on that visualization. You'll notice I've got a preview button here so I can take a look at what this visualization looks like. And I think you'll notice that the font is really a little bit too small here. So let's go ahead and bump that up. Notice, again, there's some sample data up here so that I don't have to have any actual live data connected to this thing to work with it. Let's go ahead and preview it. You can see the font's slightly more reasonable. And if I go back here and run my visualization, you'll see that now my customer's names are a lot easier to see. So again, very seamless authoring and integration of D3 visualizations inside of RStudio.
So now that we've kind of taken a look at our top five spenders and spent a little bit of time kind of passing data back and forth, I want you to take a step back and think about what we've just done here. We've brought together SQL and R and Python and D3 into a reproducible publishable report all without leaving RStudio. It's really our hope that it's going to be very easy for you to compose these same kinds of multi-language workflows yourself.
Python scripts and Stan
So there's a couple other Python features I wanted to show you really quickly while we're talking about it. Python. And that is mainly that there's a couple things that you can do in Python that are only possible in Python scripts. So you'll notice here I have a new source script button. This just runs the same code that we looked at earlier that analyzes my top five spenders. However, it's not only possible for me to run the whole script at once, but just like you can in R, you can run the script one line at a time. And you'll notice that when I do that, RStudio automatically switches to an embedded Python console. And this is also powered by the reticulate package and it uses that same embedded R sorry, that same embedded Python session that the chunks were using. So all of my state is still here. And it's very easy for me to iterate on it and I can, you know, type my own Python code if I want to. And there we go.
So the only language we haven't talked about is Stan. I'm not going to spend a whole lot of time on it. But for those of you who have worked with Stan, we've now made Stan quite a lot easier to work with. We've got a very dedicated editor mode for it now. We have all the same things that I just showed you with respect to Python and SQL. You also will get in Stan. So you have auto completion. You can see here I've got a list of all the functions I can call. You'll notice I've already created a syntax error, which is right there. And we also have a document outline that makes it really easy for you to navigate your Stan code. I'm not a Stan expert, so I'm not going to spend a lot more time on this, but we hope this makes it a lot easier for you to work with Stan projects inside of the IDE.
Other new features in RStudio 1.2
So with our last five minutes or so, I want to show you a couple of other small features we've added to RStudio 1.2. So the first of these is background jobs. So it's been our experience that when people work with R, they will often create R scripts that really take quite a lot of time to run. And if you've ever had to work with these inside of RStudio, you know that you basically have to run the script and then wait. Because you can't do a whole lot of things with RStudio while the R session is busy. If you're on RStudio desktop, you can launch another copy of RStudio to get some other possibly unrelated work done. But it can be quite challenging if you want to run a couple of these things.
So a background job is basically just running an R script or something like it inside a background R task. So we've got a new button here called sources local job. And that's just going to take the script that I wrote and run it in the background. And get me some results when I'm done. I'm going to copy the results to the global environment and go ahead and start the job. You'll notice here that I get progress and output for the job as it runs. I've got a list of running jobs over here. I can monitor them in the console. And while the job is running, all of the things that RStudio does are still available. So the R session is not busy. And I'm able to continue to do work while I run these background processes. You're not limited to just one job. You can run multiple jobs at once. It is possible to run as many jobs as your machine has the capacity for. So hopefully this will make it a lot easier for you to take those long running computations and make them work really nicely without, again, having to constantly switch contexts.
The next little thing that I wanted to show is auto installation of missing R packages. This is another quality of life improvement that we've made. And we've noticed when people open up an R script that they themselves wrote a long time ago or from another coworker and they don't have the right packages installed, it's kind of a pain. You have to go manually find the packages that are missing and manually install them. It sometimes takes a couple of times to run the script to figure out which packages you actually need.
So in RStudio 1.2, we have created a code parser that basically analyzes your code to figure out what packages you're going to need in order to run the script and then gives you a little prompt that offers to install those packages for you. You'll notice here that I have asked to run an R script that uses the carrot package and the ML package. Because I don't actually have those packages installed, it's going to actually offer to install them for me.
One other little quality of life improvement that we've added is the ability to create PowerPoint presentations. I'm not going to spend a whole lot of time on this, but a lot of people just for purposes of, you know, getting their workflow done, like, their R stuff winds up in a PowerPoint presentation. So you can now make those very easily. It's just as easy as making a Word document with R Markdown.
And finally, I wanted to show you the appearances pane here. So we've also made it really easy to change RStudio's colors. This is another small quality of life improvement that people have been asking us for for many years. And it basically makes it possible for you to add or remove your own custom themes. You can make these themes yourself online. It's possible to do them with any text theme editor. And we have our own theme format, too, if you need further tweaking. Particularly like this theme that somebody made for us. It's a Christmas theme, and it has a candy cane background for your cursor line. So that's pretty fun.
Summary and Q&A
So I just have a few minutes left, so let's wrap it up. We did the demo. So in summary, we hope that the 1.2 release makes it a lot more seamless for you to work with your data science projects that span languages and technologies. We've added a whole bunch of quality of life improvements. You can do background jobs. You can install packages that you don't have automatically. We have custom themes. There's a whole bunch of other stuff. I'm glossing over large swaths of improvements here.
And all the stuff that you're seeing today is available in RStudio 1.2 public preview, which is getting quite close to a stable release, which we expect to have out this spring. And if you want to get at any of the code that I've shown in this presentation, I've put it up at this link. It's got everything that you need to run it. All you need to do is download the RStudio 1.2 preview, which I have linked at this as well, and you can run all that code yourself. All the data is included. So if you'd like to play with it and kind of get a feel for how working with those languages is, you're more than welcome to do so.
And that's all I got. Thank you very much for your time.
Thank you, Jonathan. We have time for a few questions while the next speaker is setting up. Just wave me down, and I'll throw this in your direction.
Yeah, thanks for the great talk, Jonathan. For those background jobs, can you launch them on a compute cluster besides running them locally like you were doing there? Like any external R server or things like that?
Got it. So the question was, you may have been sort of hinted into this by the fact that it says run local job. So there's a feature in RStudio server pro that allows you to run these jobs basically anywhere that your R sessions can run. If you're interested in learning more about that, there's a talk about the RStudio job launcher that will go into a lot more detail about how you can use the background jobs feature to launch much bigger jobs on your compute cluster.
Are there any plans to open up the languages to user-created highlighting and completion and stuff like that?
So the question was, are we going to basically create a system that allows you to plug your own language engines into RStudio so that there's a possibility to make user extensible languages? We don't currently have plans for that, but we are watching the language server protocol project which Microsoft started for Visual Studio code that aims to provide a consistent back end for that kind of thing. So we're keeping an eye on it, but we don't have specific plans for it right now.
The package installation, will it look for GitHub as well or just CRAN packages?
It only looks at CRAN packages. So when you open up an R script, there's really no way of telling where the package came from. So it's not looking for installation commands. It's looking for, like, library commands. So because it doesn't really know where the packages came from, all it can do is offer to install the CRAN versions. Now, that said, it is also pretty simple to use something like PackRat, and we're also working on a few other initiatives that will make it easier for you to snapshot your package dependencies and replay them on someone else's machine. So if you really care about, like, installing the exact version of the package that worked with R script, you need some other metadata of some kind. We don't currently support those for that particular feature.
