
Outgrowing Your Laptop with Positron (Austin Dickey, Posit) | posit::conf(2025)
Outgrowing Your Laptop with Positron Speaker(s): Austin Dickey Abstract: Ever run out of memory or time when crunching data, making a visualization, or training a model? As computational demands and data sizes grow, many practitioners find their laptops behaving more like cranky toddlers than high-performance machines. In this talk, we’ll demo how the Positron IDE helps you scale your development without losing your sanity or your data. You’ll learn how Positron can integrate into different setups and see how lazy-evaluated libraries like Ibis can manage data too big for memory. Whether you're building AI models, running complex simulations, or working with large-scale datasets, you’ll walk away with techniques for doing it better with Positron. Materials - https://github.com/austin3dickey/ssh-demo posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Cool. For the last year and a half or so, I've been training a neural network. And her name is Joy. This is my daughter.
Joy is great. Over the last, you know, over the time that I've been raising her, I've learned a lot of things. And one of the things I've learned is that the toys, the clothes, the activities that work today, you cannot trust to work tomorrow. She grows out of things really quickly. But some of the things that I love the best are the things that grow up with her. Because that means I don't have to go out and buy a bunch of new stuff.
Here's that little blue ball that she's still playing with to this day. As much as I love to keep sharing baby photos, this talk, surprise, is about data. As your data grows and as your workflow size grows over time, and kind of outside of what your laptop can handle, you want a good tool that grows with you.
Because you don't want to be in the situation where none of your clothes fit, basically, and all of your options kind of seem insurmountable. You can't run anything on your laptop. So how do we solve that problem? That's what today's talk is about. And I have a tool that I know grows with you really well, and that's Positron, as you probably guessed.
Before I continue, I'm going to put a little QR code there in the bottom right. So if you want to scan it, you'll go to my GitHub repo with all these slides. It'll remain there the rest of the talk. It'll also have my demo code that I'll show off and links to further documentation and resources that you might want.
So if you take nothing else out of the talk today, take this extremely complicated flow chart, which I will talk through for the rest of the talk, obviously. What do you do in the situation where you are running out of memory or time when you're running your workflows on your laptop? I'm going to talk a little bit about modern data processing tooling, how that might help you. I'm going to talk a little bit about a feature that I work on that I'm really excited about called Positron Remote SSH, which now that I say it out loud, sounds really sci-fi. But it's not. It's regular.
Modern data processing tooling
So let's talk about modern data processing tooling. Basically, who is this for? I think this is for people like all the people at my sister's wedding who are computational biologists and were assigned laptops by their organization that they call, quote, crappy and need to run their R scripts for weeks on end and making sure that their laptop is connected to a power source.
This is for people who don't have access to any other machines and are confined to their laptop. It might also be for people who maybe have access to other stuff, but they're working really quickly on quick local iterations or proof of concepts. This is kind of a different way of thinking how to do data processing. But I want to point out that these techniques are always useful, no matter your scenario. So I would definitely consider learning these kinds of things as you walk away from this talk.
So as an example, throughout this talk, I'm going to work with the NYC taxi data set. Some of you might be familiar with this. This is a lot bigger than Iris or Penguins, but it's basically all the trips that New York City taxis have taken since 2009. The one that I'm using is hosted by Ursa Labs, and that's a data set in parquet format that goes up until it's ten years of data, 2019, lives in the cloud. It's about 36 gigabytes on disk, and if you were to read it all into pandas in Python, that would be 400 gigabytes of RAM, and I don't know about you, but my MacBook does not have 400 gigabytes of RAM.
So what can you do in this scenario? I mean, luckily for you, there are a lot of tools that help with this. I put up some logos here on the slides. You've got DuckDB, Polars, Apache Spark, Dask. Python and R work with a lot of these. The purpose of this talk is not which one to choose, because often that's determined by whatever your organization is using or, you know, different pros and cons between these things, but instead I want to focus on their similarities.
So they typically use these strategies. The first one is deferred evaluation or lazy evaluation, just like these cats, and basically that means you don't query the data until the last possible second, and that's good because you don't want a bunch of intermediate data sets living in your laptop's memory and crashing your process, right?
We also have a strategy called predicate pushdown or column reading, which is you only read the data that you need. It's really helpful if you have data, for example, in parquet formats that's partitioned. Because when you write a query, and, again, this can be in Python, R, SQL, or anything. When you write a query, it can only access the data that it needs in order to execute that query. So if you have a million row or million column parquet file and you only need access to a few of those columns for the query that you're doing, you're reading, it only reads those columns, and in some cases, if your parquet file is partitioned the right way, it only reads certain rows from that file as well.
And that's good because you're not wasting time reading a bunch of data that you don't need. And then finally, these libraries and tools, they split the work among multiple processors if possible. So if your laptop or any computer you're working on has multiple CPUs or maybe even GPUs or VCPUs, whatever, it's going to split up that work, parallelize it, and make things faster. So that's great.
Ibis demo in Positron
I'm going to go through a demo today in Python. I like the Ibis library. Hopefully some of you have heard of it. Ibis in Python I think of as a universal adapter. So it's a data frame library, it's verbs, it's syntax is very similar to dplyr and R. And basically you can connect to any of these backends.
And that's what I'm doing here at the top of this slide is this is a bunch of different ways to connect to different backends, right? I'm connecting to DuckDB, Polars, PySpark, Postgres. Once you have the connection details stored in your connection object, then you use that connection object to define a table that you work with or a data frame. That's what I'm doing in the second part of the slide. I'm reading these Parquet files that live in the cloud. I'm saying, hey, go to Amazon S3, read the Ursa Labs taxi bucket, and look at all of those Parquet files metadata. Call it NYC taxi. And there you go. You can manipulate these Parquet files just by referring to this table.
Let me actually jump into a demo here instead of just showing code on a slide. So here I am in Positron. I love dark mode. Sorry about everyone else. This is I've got my file up here in the top left, I've got my console in the bottom left, which is, you know, we've walked through this in the past couple of presentations. Plots pane is bottom right, variables pane is top right.
I know it's a little bit hard to read, so I'm going to read through what's happening here. Let me walk through again. And then let me start to I did prerecord this, but it is live. So let me start importing my Python libraries, setting up a connection. In this case, I'm connecting to a local in memory database, because that's the easiest thing to do, in my opinion, here. And then I'm going to define that table, which is reading the Parquet files in S3.
You see kind of a green bar there in my console, which means that code was executing, but now it's done. Again, all this is doing is reading the Parquet metadata, not the actual data. So now I can inspect this table object, again, this has been pretty quick so far for such a large dataset. I can see the different columns in this dataset and their column types. I can see that I have 22 of those columns, and then I'm actually going to execute a count against that dataset and see that we have over a billion and a half rows, which is kind of a lot. Again, if I were to read this in as a CSV, it would definitely crash my Python process. Yeah, that's pretty big, and that count query was actually really fast, too, because it's just reading the metadata of these Parquet files.
Next, I want to do an actual query that you might use for, I don't know, EDA or something. Let's say I have the question, I want to find the mean tip percentage grouped by the number of passengers in the taxi, but only look at the data where the total price was greater than $100, right? So this is the query that I would construct in Ibis. What you'll see, maybe, if you can read it, is that Ibis has a lot of verbs that are very Dplyr familiar, right? So I've got filter, select, mutate.
The filter that I'm using, this first one, I'm filtering the timestamp to January of 2009, and the reason I'm doing this is because this is my local laptop, and I'm reading data over the Internet, and I don't want it to take too long for this demo, right? So when I filter it to only January of 2009, what that means is when I actually execute the query, it only reads the Parquet file that contains that data for January 2009, and it's going to skip the rest of the Parquet files, which is, again, nice so that I don't have to wait for the query to execute over the Internet.
Next I'm going to filter to the total amount, greater than 100, the passenger count is greater than zero, just to exclude some outliers or bad data, right? This is just very typical stuff, right? I'll select some columns, I'll add a column, I'll group by passenger count, add some aggregated columns and mutate, again, add another column. So let's execute this query, and that was instantaneous, right? Because, again, this is lazy evaluation. It's not going to actually look at the data until I ask it to.
Finally, I execute my ggplot code, right? So this is from plot 9, a very good plotting library in Python. This will take about 20 seconds to execute, because now it's finally querying the data. But honestly, that's not that long, if you think about the time it would take to do that by reading the entire dataset, filtering, group by, right? I already got a plot here, and this is on my bad Internet at home. This is just Positron desktop. So that's really cool.
All this stuff I've talked about so far would work in any IDE, right? So why in Positron? I have the hot take that Positron is the best IDE to learn in. At least that's how I'm developing it, right? So Positron has a bunch of features that will help you learn, right? So our help pane is going to help you learn new tools like Ibis faster. If you're new to Ibis as a Python programmer, you can go into the help pane and read all the documentation in your IDE for what's supposed to happen.
I have the hot take that Positron is the best IDE to learn in.
We've also got a connections pane that will help you understand your databases better, right? So that will ‑‑ I've got a screenshot of that on the bottom left, and you can see that I'm talking to my DuckDB dataset. I can see my NYC taxi data and all of the different columns that it has. And then finally, we've gone over the data explorer a lot. The next talk by Wes is about it. But that lets you understand your data better with Sparklines. You can explore that data. There's just a bunch of different integrations that I think Positron is built for data science, and all of these different built‑in integrations are going to make your experience a lot better even as you're working with big data like this.
Positron Remote SSH
So that's an overview of what I believe modern data processing tooling looks like. Next I want to talk about, again, I think one of the most undervalued features of Positron, which is remote SSH. So SSH stands for secure shell. If you haven't heard of it, it's a way to connect to a local machine. This is for people who have access to another machine, right? This could be anything from a spare computer you've got laying around your house to a high‑performance cluster that your organization has set up in the cloud. Or it's for people who don't have one today, but they could set one up if they wanted to and maintain it, right? So they might set up an EC2 instance on Amazon or a VM of some sort.
I've seen this a lot in scrappy data science teams. I was on one at a weather company in a previous job where we had, you know, terabytes of data to analyze because we're trying to predict the weather all over the globe. And you just can't read terabytes of data over the Internet all the time, in my opinion. So we had a VM set up that we mounted that data set to, which basically just meant that whenever it read that data, it didn't have to talk over the Internet. It was in the same data center, and the reads were so much faster because it looked like you were just reading from the same file system, right?
How does remote SSH work? Let me point out that, by the way, this is completely free, right? So this comes shipped with Positron Desktop. All you need to do is install Positron Desktop, and if you have access to this remote machine, this should just work, theoretically, if you can SSH into that remote machine, which, you know, that's usually the hardest part is authentication, but once you have that figured out with your IT team, this should just work, right? So you go to your bottom left, you click open a remote window, you enter in your user, password, host name, you know, all that information, and then you click go, basically, and this is what pops up, right?
So this looks a lot like the demo that I showed before, and that's the point, right? There are a few differences here that I've highlighted with the arrows, but basically all you see is that it highlights that you're using SSH mode, it's showing me the host name that I've connected to, but everything else is the same, and that's by design, right?
So here's the architecture of remote SSH, what it's doing, if you allow me to nerd out for a second, on your local machine is just the UI, right? Like the pretty pixels that show up. On the remote machine is where you do all of your compute, right? So not only are your Python session or your R session happening on the remote machine, but your file system that you're working with is there. The Positron back end itself is also there, so we're running that extension host that Jonathan mentioned on the remote machine, and that's great, because you can take full advantage of your more beefy machine's capabilities.
I have another demo. Let me show off what happens when I set up a beefy machine. So in this case, this is my local Positron desktop. I've connected to a machine before, so I can go here to recent and click on it. And here I go. I'm connecting. You can see it's setting up the host. This is the layout of the last time I connected to this host. So this is great. Again, I've got my file up.
On the left here is the file explorer, and what you might notice if you have 20-20 vision is that I have downloaded these parquet files to this remote machine, right? So it's in a folder called data. And that's great, again, because like I mentioned, it's good if you're not talking over the Internet and you have local access to these data parquet files. So here's all my months of parquet files.
So let me close that. The two things that I've changed about this script for this demo is that, number one, I'm reading the parquet files from local folder called data instead of from the cloud. And then the second thing I did was I commented out the first filter that filters the dataset to January of 2009, which means we are going to load the entire dataset here instead of just that first month. And I think I can do that, because this machine is more, like I said, beefy, and I'm reading the data locally.
So let me just go through this code here again. I'll kick off that ggplot code. And this is, again, querying the data. You can see the progress bar querying all these parquet files, and, wow, that was it. So here's the plot that pops up. Because it's the final, this is loading the entire dataset instead of just that one month, the plot looks a little prettier. Honestly, it went faster. I love remote SSH.
A few more benefits. Again, free to use, including at work. That's great. You don't have to set anything up on your remote machine. You don't have to set up Positron Pro or something like RStudio server. This should just work. Also you don't have to change any code if you don't want to. If the only thing that's limiting you on your laptop, if you're using Pandas, let's say, and the only thing that's limiting you is memory and you think if you ran this on a more powerful computer, it would work, you don't have to change your code. You can just run the same code on your larger machine, and it should work, right?
And then finally, this is a huge feature that I think a lot of people enjoy. You can disconnect and reconnect when running this code. I didn't demo that, because that's not a very interesting demo, but it's a big deal if you're the type of person who wants to kick off a job, go have coffee or go to bed and then come back the next day and reconnect and see the results of that job, right? This is something that VS Code doesn't have. I think a lot of people have been asking for it, but Positron, it's implemented in remote SSH, which is cool.
You can disconnect and reconnect when running this code. I didn't demo that, because that's not a very interesting demo, but it's a big deal if you're the type of person who wants to kick off a job, go have coffee or go to bed and then come back the next day and reconnect and see the results of that job, right?
Posit Workbench
Finally, I want to give a shout out to Posit Workbench. Posit Workbench is our enterprise solution. It's for people who need a lot more power and scaling, right? So this is going to be for people who want a managed solution to bring up this cluster. It's also for industries, let's say, like pharmaceutical, for example, with high compliance and trust standards. You need auditability and security, right? Posit Workbench comes with a lot of administrative controls to control what the code is that your data scientists are using, and access to that data, too, right?
So this is also for teams looking for standardized development environments and data access. What does that mean? You know, again, it's a lot easier if all your data scientists can log on to the same box and immediately have access to the same environments, the same Python interpreters, the same packages, the same access to the same data, right? All of these administrative tools make it a lot easier. I think of it as SSH plus more, right? It's like SSH on steroids because it is still using the power of a remote cluster, right? But it's got additional features and customization that make it really good for these large teams, let's say, or even small teams who need these key features, like cluster management, security and auditing, and consistent development environments.
We are also working with partners like Snowflake to offer managed offerings of Posit Workbench. So you can just click a button, basically, and if you're in that Snowflake marketplace, it works, right? If you want to learn more about Posit Workbench, I don't have a lot of time left, so I will call out Posit's commercial products, TalkTrack is today at 1. There's also some specific examples of people using Workbench and Snowflake on enterprise data platforms tomorrow at 2.40. And if you want to talk to me about anything else that I talked about today, I'll hang out in the Positron lounge tomorrow morning with Sharon and Isabel.
I am on Discord. I am in the GitHub discussions. I saw my name up on Jonathan's keynote, and I got really happy. So if you are interested in that, please come see me afterwards. And I will see you at the aquarium tonight, which, by the way, is an experience that absolutely grows with you, no matter what age.
I have one quick question for you. When you remote SSH into another machine, is there an ability to choose how much memory and CPU you'd like to request?
By default, I believe it will use... Correct me if I'm wrong. I think it will just use whatever it has access to. I don't know exactly the answer to that, but I could get back to you. Yeah. Appreciate it. Thank you so much.

