
Outgrowing your laptop with Positron - Julia Silge
Have you ever run out of memory or time when tidying data, making a visualization, or training a model? An R user may find their laptop more than sufficient to start their journey with statistical computing, but as datasets grow in size and complexity, so does the necessity for more sophisticated tooling. This talk will step through a set of approaches to scale your tasks beyond in-memory analysis on your local machine, using the Postron IDE: adopting a lazy evaluation engine like DuckDB, connecting to remote databases with fluent workflows, and even migrating from desktop analysis entirely to server or cloud compute using SSH tunnelling. The transition away from a local, in-memory programming paradigm can be challenging for R users, who may not have much exposure to tools or training for these ways of working. This talk will explore available options which make crossing this boundary more approachable, and how they can be used with an advanced development environment suited for statistical computing with R. Integrations in the Positron IDE make all these tasks easier; for example, remote development in Positron allows an R user to seamlessly write code on their local machine, and execute that code on a remote host without tedious interactions outside the IDE. Whether you train statistical models, build interactive apps, or work with large datasets, after this talk you’ll walk away with techniques for doing it better with Positron. https://positron.posit.co
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Great, I am very excited to be here today to talk to you about how you can think about outgrowing your laptop. Something that Henrik said that I feel like really transfers from his talk to my talk is that there is, it is not magic. The things that I'm going to be talking about today, like if we can all have a little bit more understanding of how these things work, we can know when to use and reach for different tools.
So, I come from a background in like physics and astronomy and I did some things like on servers then, I then kind of transferred or transitioned into a career in data science and as a data scientist I came somewhere that had, you know, larger than memory data in databases that I kind of had to learn how to use and now I work at Posit building tools for people and all the kind of experiences I had in the past about what do I do when I cannot like deal with my data in memory on my laptop, I feel like all those experiences have led me to think, you know what, I learned about these in a very, a model that was like an apprenticeship style model without like thinking it was, often thinking it was specific to where I was rather than it being somewhat generalizable and that these things are related and solving kind of related problems.
So, when I think about like my own path with this and I think about it now, I'm like, oh, it would be really helpful for all of us to have some shared understanding of what the options are and what some of these trade-offs are.
Starting on your laptop
So, we almost all start out, maybe if you are in a class in a, you know, maybe a data science class or a stats class, like we really often start here where you're working on your laptop that is sitting here in front of you, R is on your laptop, your IDE, RStudio or maybe Positron, which is what I'm focusing on in this talk, it is also running on your laptop and say a CSV file is literally on your laptop and everything is here together. We often start in a situation like this and we run code that may look like this. This is example tidyverse code, but I'm sure you can imagine code you might write if you use a different kind of paradigm for writing your R code.
So, what's happening here is that the CSV file is going into R here and the data is fully in memory in R and then I can do operations on it in memory there in R. When you're working in IDE in this kind of mode, you can see, you know, it's telling me information about this thing and it is all on the laptop in memory there. The data that was used to make this plot came from the thing that was in memory. This is Positron's data explorer and everything that you see there is literally connected to something that is in memory in R right there in front of you.
So, this is a great place to start. I am not saying that we should be changing how we introduce people to how they learn R, but this is something that comes with limitations and often the thing that we hit that we're like, oh man, I no longer can do this. The first limitation that we often hit is a limitation around memory. We no longer can read all the data that we need into whatever laptop or desktop that we're working on and the other thing we often run into is problems around performance. It takes too long to run the analysis or even do the query that I want to, the sort of like, you know, group by summarize that I showed. Like, it's like I can't even do that because of how many groups there are or how much data there is.
So, there are different options that have different trade-offs when we need to start fighting this fight. When it's time, even though your laptop is a great place to start, when it's time to do something else, there are different options that we have that have different trade-offs to deal with these problems of memory and performance.
So, this is the outline of my talk right here. So, we often start with a CSV file and that is great, but we're going to walk through some different options, talking about Parquet and DuckDB, talking about databases, talking about remote SSH. It's not really, I mean, using this meme kind of implies this is from less great to more great and it's not true. These are just different from each other. So, don't over, don't over interpret the order that this is in because they are different from each other and have different kinds of trade-offs.
Parquet files and DuckDB
So, let's dig in to what does it look like if we go to that next step when we're, when at least in the hierarchy that I sort of in a fake way made here. So, you'll notice this looks really similar. The sort of first next step is that instead of a CSV file that's on your laptop, you can deal with a Parquet file that is on your laptop. So, a Parquet file is a, it's a binary file format. So, in that sense, you can't like open it in a text editor and like see the rows, but it is a file format that is used to store rectangular data along with metadata like the column name and metadata about what the type is. So, it's specially made to deal with rectangular data.
And the thing that is great about Parquet files is that we have ways to run queries on the file. So, the code that you write, you'll notice looks really similar to the code that I wrote before. Instead of doing a read CSV, I'm using a different function from, in this case, DuckPlier. And so, I'm going to say read Parquet DuckDB. Here's the thing. When I run that file, when I run that line there, it actually is not reading the whole file into memory. It is looking at the file and understanding what's in it, but the whole file is not being brought into memory.
I can write code against this object that kind of acts like a data frame, but actually is not a fully in-memory data frame in R. So, I can do something like a, like a, like a group by summarize, including an arrange here. And I get out results like this. I want to highlight for you that it doesn't say it's a table. And if you were using non-tidy verse, it wouldn't say it's a data frame in this case, because it actually has not been brought over the wire, if you will, into R at all yet. There is no data frame in R at this case. If I decided I wanted to bring it into R, I would do something like this. Explicitly say, okay, I need to collect it from this sort of hazy, it's not really materialized yet, into an actual R object. So, now it is a regular R object.
So, this solves problems both about memory and performance, because you don't have to bring the whole thing into memory. You can just bring a summarized result into memory. Instead of bringing every single row over into R, you can just bring the summarized data that you need into R, that you might need to make a visualization or to train a model.
So, this solves problems both about memory and performance, because you don't have to bring the whole thing into memory. You can just bring a summarized result into memory. Instead of bringing every single row over into R, you can just bring the summarized data that you need into R, that you might need to make a visualization or to train a model.
When you're working, so this is this sort of, sort of initial kind of architecture. Just you switch out a CSV for a parquet file, and this helps you with performance, because it turns out DuckDB is a really efficient and fast query engine. It can run queries faster than even some of the, like, powered by C, say, dplyr operations or data table. Like, DuckDB as an engine for queries is faster than anything we can do from R in many situations. So, it both solves the memory performance and the performance, the memory problems and the problems with performance.
Another kind of cool thing that can happen in here is that you don't only have to deal, you don't actually have to have the parquet file here on your laptop. The parquet file can be somewhere else. If you're writing code that looks like this, it again, you're not having to change the way you write code very much. We still have the, we still have very similar, like, read parquet DuckDB, group by, summarize, arrange. But now I'm going to a parquet, some enormous parquet files that are very far away. And when I am writing, when I'm writing and running this code, it is like, it is like running really fast, because it actually is not bringing everything into memory. It's like super speedy. I'm able to iterate really quickly. It's only when I do something that actually, that actually materializes it, that it starts to be like, okay, now it's actually doing the whole query and bringing it into memory.
When you're working in this kind of mode, in an IDE, like I'm showing Positron here, there are some things that help you understand what it is that you are doing. I want to highlight just the, just the printing, like, as we can like look around, we're getting the cues that tell us what it is that we're doing. So I can notice that this, this df object here, it tells me it's a DuckPlier data frame. So I can think, oh, yep, I don't actually have that in memory in R. If I'm looking up in the variables pane, you actually notice it says question mark rows, question mark rows, because it has not been brought all the way over. So we don't actually know how many rows are in it at that point. This is a good, this is good, because that means we're able to write the code really quickly that we need to know to get forward. In this case, the only thing that has like actually come over into R is the data that we use to make this plot. So this plot was actually made by R with an R object. And so that, you know, which is only 24 numbers here, right, rather than maybe millions of rows, like that makes us have this good, healthy, quick workflow.
Tips for lazy evaluation
So let's go over a couple of tips for what it's like when you work with a situation like this, when you use lazy evaluation, meaning we're not actually saying what are these numbers for real until we actually need it. The first is to get clear on what some of this, what some of this verbiage means. So DuckDB is an engine for executing queries. It can execute queries for a lot of different things, like a SQLite database, actually a CSV file. But one of the things that's especially powerful at working with is Parquet files. Parquet files, that's a file format, like a way to save something to some kind of a file on either your local computer or a faraway kind of computer. So it's good to get some understanding of what these different words mean so that we can have, you can have clarity and understand that it is not magic.
Another tip I'll say is that working with lazy queries can get some, take some getting used to. And actually, the kind of the first stuff you do is feel so fast that you're like, yeah, yeah, yeah. And then you'll like accidentally do something that brings it into memory. And the whole, like the whole thing slows just like comes crashing to a stop. And you're like, oh, crap, I accidentally materialized it. Like this is a very common kind of thing to happen when you have this kind of workflow. And so it can take some getting used to of dealing with like what's it like to deal with something that is actually not in memory and feel super speedy until you maybe accidentally bring the whole thing into memory by materializing when you didn't need to.
The translation between dplyr and DuckDB is pretty amazing. Like we can write a familiar domain-specific language. But it is true that there are not every single thing is supportive. You'll notice I didn't write group by summarize. I wrote summarize by. So there are some differences to be aware of.
I also, if you yourself have to directly read and write parquet files, I would invite you to look at the nanoparquet package. That's a very useful package if all you need to do is read and write them. If you use something like read parquet with the nanoparquet package, it doesn't do lazy evaluation. It reads the whole thing. But sometimes that is what you need to do. You do need to like read and write them with like a lightweight fast package. And once you're dealing with files that are not on your laptop, often the most hard thing to do is not the R code you need to write or like keeping track of how these things work, but like dealing with some of the messiness around authentication. So that's just something to be aware of as like one of the pain points here.
Connecting to databases
So my colleague Neil is going to speak in more detail about some of these technologies right after me. So stay to hear more. We are going to move on and talk about databases. So the architecture, like when you think about where is stuff, is similar to what I just showed you, where it's on your laptop. You have your IDE, such as Positron, and your R session. And then R itself is connecting to something else on a remote server. A database is different from a parquet file in that it's not one single file. Databases are so interesting, so cool. There's like whole academic research around them. They have these, similar to DuckDB, like they have these query engines inside of them that are like finely honed to be super efficient.
So they maybe solve these problems around memory and around performance in similar ways that maybe using DuckDB and parquet does, in that you don't have to bring everything into memory. You can have the database with millions and millions of rows, and you can use a query engine to get a smaller summarized result out that you can then do something with.
So if you were going to write code in R to access a database and do something, you would do something like this. First, you would create a connection to your database, and then you would pull out a table from that database, and then do something like group by summarize again here. I'll notice that as this is printed out here, once again, it's giving me some cues by how it's printing that this is not an R object. So once again, we're executing a query, getting a preview back, but this is not yet an R. And once again, if you were to use an explicit gesture like collect, that would then bring it into memory. So now this thing is an R object here.
So when you're working with a database in an IDE like Positron, again, you have some nice supports for how is it that you go about dealing with it. So I'm writing, again, really similar R code up here. I'm interacting with it in a way that it's giving me those cues about when I am and I'm not having this thing in memory versus one of the previews of what is going on with the database remotely. There's also a connections pane in Positron that has some similarities to the connections pane in RStudio, if you've ever used that. But what it does is it helps you visualize and interact with what is in the remote server. So here is an example here with some different tables that are in the database where I can see both the tables and the columns that are there.
So the tips that we have here are similar. It's worth it to learn a little bit of SQL, because it turns out that when we know a little bit of what's going on under the hood, it helps us understand it's not magic and to solve problems that we may run into. And again, we get these kind of subtleties here. That connections pane that I showed you also works with DuckDB. So DuckDB, there's overlap in these technologies. Like, they're not all different from each other, but rather fairly similar. And guess what? Working with databases, often the hardest part is not the R code. Like, the really most frustrating stuff is not the R code you need to write, but rather dealing with auth.
Remote SSH with Positron
All right, so let's now move to the last bit of my, the last sort of step in this meme, which is dealing with remote SSH. So this is a slightly different architecture that you'll notice here. Now what's happening, what's on your laptop is just actually the front end of Positron. And what is on the remote server is the back end. And R itself is on that remote server. And your files, whether they're parquet files or some connection to a database on another server, like the separation between what's local and what's remote is actually now kind of between the front end and the back end of the IDE itself. This is something RStudio doesn't support and cannot support. And so it's kind of like a unique thing you might choose to use Positron for.
This is what it looks like when you're working in remote SSH. So it looks very comfortable, very similar. But we get some sort of clues in the IDE about what is going on. Like here, you can click on the ports and see how all the ports are getting forwarded and whatnot.
So my tips for dealing with remote SSH sessions, this one is a little different. This one is a little different. It's like the terminals that you have in the IDE, your files, your extensions, they're all on the remote server. They're not local at R. R itself is on the remote server. There's some kind of nice features we have around how you set this up. If you set this shutdown timeout quite long, you can actually close Positron locally. You can shut your computer. And your long-running R process will keep going, like it will just keep going on the remote server.
If you set this shutdown timeout quite long, you can actually close Positron locally. You can shut your computer. And your long-running R process will keep going, like it will just keep going on the remote server.
You can, if you're interested, use all of that infrastructure entirely locally, actually, which you might do if you have high reproducibility needs. This does not solve problems around memory and performance. And actually, everything gets slightly worse because you're like running it inside a mini computer on your laptop. But you might choose to do that for other reasons. And I bet you know what I'm going to say next. Well, often when you're working with remote sessions, again, like the hardest thing is off. It's not that you have to deal with different code. It's some of the details of going with that.
And I will drop up some links here. Another option in this list of ways to work things are to work in a truly server-type environment where you log into a browser, and then you connect to a server that's running Positron. And you can do that via Workbench. And with that, I will say thank you very much.


