Outgrowing your laptop with Positron - Julia Silge

Transcript#

This transcript was generated automatically and may contain errors.

Great, I am very excited to be here today to talk to you about how you can think about outgrowing your laptop. Something that Henrik said that I feel like really transfers from his talk to my talk is that there is, it is not magic. The things that I'm going to be talking about today, like if we can all have a little bit more understanding of how these things work, we can know when to use and reach for different tools.

So, I come from a background in like physics and astronomy and I did some things like on servers then, I then kind of transferred or transitioned into a career in data science and as a data scientist I came somewhere that had, you know, larger than memory data in databases that I kind of had to learn how to use and now I work at Posit building tools for people and all the kind of experiences I had in the past about what do I do when I cannot like deal with my data in memory on my laptop, I feel like all those experiences have led me to think, you know what, I learned about these in a very, a model that was like an apprenticeship style model without like thinking it was, often thinking it was specific to where I was rather than it being somewhat generalizable and that these things are related and solving kind of related problems.

So, when I think about like my own path with this and I think about it now, I'm like, oh, it would be really helpful for all of us to have some shared understanding of what the options are and what some of these trade-offs are.

Starting on your laptop

So, we almost all start out, maybe if you are in a class in a, you know, maybe a data science class or a stats class, like we really often start here where you're working on your laptop that is sitting here in front of you, R is on your laptop, your IDE, RStudio or maybe Positron , which is what I'm focusing on in this talk, it is also running on your laptop and say a CSV file is literally on your laptop and everything is here together. We often start in a situation like this and we run code that may look like this. This is example tidyverse code, but I'm sure you can imagine code you might write if you use a different kind of paradigm for writing your R code.

So, what's happening here is that the CSV file is going into R here and the data is fully in memory in R and then I can do operations on it in memory there in R. When you're working in IDE in this kind of mode, you can see, you know, it's telling me information about this thing and it is all on the laptop in memory there. The data that was used to make this plot came from the thing that was in memory. This is Positron's data explorer and everything that you see there is literally connected to something that is in memory in R right there in front of you.

So, this is a great place to start. I am not saying that we should be changing how we introduce people to how they learn R, but this is something that comes with limitations and often the thing that we hit that we're like, oh man, I no longer can do this. The first limitation that we often hit is a limitation around memory. We no longer can read all the data that we need into whatever laptop or desktop that we're working on and the other thing we often run into is problems around performance. It takes too long to run the analysis or even do the query that I want to, the sort of like, you know, group by summarize that I showed. Like, it's like I can't even do that because of how many groups there are or how much data there is.

So, there are different options that have different trade-offs when we need to start fighting this fight. When it's time, even though your laptop is a great place to start, when it's time to do something else, there are different options that we have that have different trade-offs to deal with these problems of memory and performance.

So, this is the outline of my talk right here. So, we often start with a CSV file and that is great, but we're going to walk through some different options, talking about Parquet and DuckDB, talking about databases, talking about remote SSH. It's not really, I mean, using this meme kind of implies this is from less great to more great and it's not true. These are just different from each other. So, don't over, don't over interpret the order that this is in because they are different from each other and have different kinds of trade-offs.

Parquet files and DuckDB

So, let's dig in to what does it look like if we go to that next step when we're, when at least in the hierarchy that I sort of in a fake way made here. So, you'll notice this looks really similar. The sort of first next step is that instead of a CSV file that's on your laptop, you can deal with a Parquet file that is on your laptop. So, a Parquet file is a, it's a binary file format. So, in that sense, you can't like open it in a text editor and like see the rows, but it is a file format that is used to store rectangular data along with metadata like the column name and metadata about what the type is. So, it's specially made to deal with rectangular data.

And the thing that is great about Parquet files is that we have ways to run queries on the file. So, the code that you write, you'll notice looks really similar to the code that I wrote before. Instead of doing a read CSV, I'm using a different function from, in this case, DuckPlier. And so, I'm going to say read Parquet DuckDB. Here's the thing. When I run that file, when I run that line there, it actually is not reading the whole file into memory. It is looking at the file and understanding what's in it, but the whole file is not being brought into memory.

I can write code against this object that kind of acts like a data frame, but actually is not a fully in-memory data frame in R. So, I can do something like a, like a, like a group by summarize, including an arrange here. And I get out results like this. I want to highlight for you that it doesn't say it's a table. And if you were using non-tidy verse, it wouldn't say it's a data frame in this case, because it actually has not been brought over the wire, if you will, into R at all yet. There is no data frame in R at this case. If I decided I wanted to bring it into R, I would do something like this. Explicitly say, okay, I need to collect it from this sort of hazy, it's not really materialized yet, into an actual R object. So, now it is a regular R object.

So, this solves problems both about memory and performance, because you don't have to bring the whole thing into memory. You can just bring a summarized result into memory. Instead of bringing every single row over into R, you can just bring the summarized data that you need into R, that you might need to make a visualization or to train a model.

So, this solves problems both about memory and performance, because you don't have to bring the whole thing into memory. You can just bring a summarized result into memory. Instead of bringing every single row over into R, you can just bring the summarized data that you need into R, that you might need to make a visualization or to train a model.

When you're working, so this is this sort of, sort of initial kind of architecture. Just you switch out a CSV for a parquet file, and this helps you with performance, because it turns out DuckDB is a really efficient and fast query engine. It can run queries faster than even some of the, like, powered by C, say, dplyr operations or data table. Like, DuckDB as an engine for queries is faster than anything we can do from R in many situations. So, it both solves the memory performance and the performance, the memory problems and the problems with performance.

Another kind of cool thing that can happen in here is that you don't only have to deal, you don't actually have to have the parquet file here on your laptop. The parquet file can be somewhere else. If you're writing code that looks like this, it again, you're not having to change the way you write code very much. We still have the, we still have very similar, like, read parquet DuckDB, group by, summarize, arrange. But now I'm going to a parquet, some enormous parquet files that are very far away. And when I am writing, when I'm writing and running this code, it is like, it is like running really fast, because it actually is not bringing everything into memory. It's like super speedy. I'm able to iterate really quickly. It's only when I do something that actually, that actually materializes it, that it starts to be like, okay, now it's actually doing the whole query and bringing it into memory.

When you're working in this kind of mode, in an IDE, like I'm showing Positron here, there are some things that help you understand what it is that you are doing. I want to highlight just the, just the printing, like, as we can like look around, we're getting the cues that tell us what it is that we're doing. So I can notice that this, this df object here, it tells me it's a DuckPlier data frame. So I can think, oh, yep, I don't actually have that in memory in R. If I'm looking up in the variables pane, you actually notice it says question mark rows, question mark rows, because it has not been brought all the way over. So we don't actually know how many rows are in it at that point. This is a good, this is good, because that means we're able to write the code really quickly that we need to know to get forward. In this case, the only thing that has like actually come over into R is the data that we use to make this plot. So this plot was actually made by R with an R object. And so that, you know, which is only 24 numbers here, right, rather than maybe millions of rows, like that makes us have this good, healthy, quick workflow.

If you set this shutdown timeout quite long, you can actually close Positron locally. You can shut your computer. And your long-running R process will keep going, like it will just keep going on the remote server.

You can, if you're interested, use all of that infrastructure entirely locally, actually, which you might do if you have high reproducibility needs. This does not solve problems around memory and performance. And actually, everything gets slightly worse because you're like running it inside a mini computer on your laptop. But you might choose to do that for other reasons. And I bet you know what I'm going to say next. Well, often when you're working with remote sessions, again, like the hardest thing is off. It's not that you have to deal with different code. It's some of the details of going with that.

And I will drop up some links here. Another option in this list of ways to work things are to work in a truly server-type environment where you log into a browser, and then you connect to a server that's running Positron. And you can do that via Workbench. And with that, I will say thank you very much.

Outgrowing your laptop with Positron - Julia Silge

Transcript#

Starting on your laptop

Parquet files and DuckDB

Tips for lazy evaluation

Connecting to databases

Remote SSH with Positron

Featured software#

positron