Resources

Exploring Datasets in Positron (Wes McKinney, Posit) | posit::conf(2025)

Exploring Datasets in Positron Speaker(s): Wes McKinney Abstract: Inspecting raw data in data frames and tables can be a critical tool in the data preparation, tidying, and feature engineering process. In Positron, we made it a priority to design a modern Data Explorer component that works well for both large and small datasets. In this talk, I will discuss the design of the Data Explorer UI and its backends for Python, R, and DuckDB, and how we made it work smoothly with massive datasets having millions of rows or thousands of columns. Additionally, I will discuss the sorting, filtering, search, and statistical data visualization capabilities that we have added to help make users more productive. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Okay, very good. Hi, I'm I'm Wes McKinney. I'm a software architect here at Posit and here to do a little bit of a deeper dive into the Data Explorer, which you've already seen featured and all of all my colleagues talks in this Positron, Positron session.

So I wanted to talk a little bit about how I got involved in the project, why I'm working on working developing the Data Explorer for for Positron and how we've designed it to have great performance scalability as well as integrating into your development workflow as a natural tool to make you more productive.

Most of you know me as the as the creator of the pandas project and I've spent a lot of the last 20 years staring at datasets in the console and in other environments. I was really excited when the Jupiter notebook used to be called the IPython notebook came out in 2000 2011 and for many years. I had a very console and notebook centric environment for working with datasets and developing these developing open source libraries that process and manipulate these these data sets in my book Python for data analysis is about teaching people how to use these tools.

But one of the first things that I did after five years of working on pandas was that I started a visual analytics company called data pad to help build visual environments for working with working with datasets. And so this has been a long a long time passion for me later, you know, we started the IBIS project to provide portability and adaptability between different backends so that we can build data expressions once and then run them in many different many different places and you just saw that featured prominently in in Austin's talk.

Pain points in data viewing

But when we think about the the data viewer component that's that's within ID is often we we run into a number of pain points. So so one one pain point is the tedium of jumping between coding and working with the data set and inspecting and inspecting the results. And so seeing like the part of the data set that you need like the the part that you're focused on maybe you're doing some data cleaning some data preparation and you want to see specifically like the part of the data set that you're trying to fix. We want to make that easier.

We've all seen the scale limitations of large data sets a data viewer breaking down. It works great when your data set is small. But when you have suddenly you have a data set wild data set with 50,000 columns and it completely grinds to a halt the data set has a billion rows as long as you have a backend like a you know, duck TV can handle a billion rows. No problem. So we should be able to interact with those types of data sets in in our in our data viewer.

So having having things become laggy or unresponsive is something we definitely don't want another problem, which is a little more subtle is is the low information density problem. And this is definitely associated with those terminal centric data inspection workflows where you're just not seeing that much about the data set in the terminal. So, you know, the human mind is able to process a lot of information. And so in building the the data Explorer for for Positron, we wanted to pack a lot of visual information to leverage the the power of human cognition to be able to recognize, you know, recognize outliers see see issues of the data set that you're that you're trying to fix.

So we think that human cognition is underrated and we want to to help augment, you know, your ability to see issues with your data or to find find areas that you want to look at more closely so that ultimately you can iterate faster in your data wrangling whether that's you writing code or working with the AI assistant positron assistant or another LLM to to write your code.

So we think that human cognition is underrated and we want to to help augment, you know, your ability to see issues with your data or to find find areas that you want to look at more closely so that ultimately you can iterate faster in your data wrangling whether that's you writing code or working with the AI assistant positron assistant or another LLM to to write your code.

Design principles and inspirations

So we're building all of this in Positron. We had some inspirations from some other tools that we love. Of course, there's the data viewer which is been a popular and long-standing tool in our studio. There's some other really cool projects that I've admired from afar and in real developer as well as the data Wrangler for VS code. Of course, there's a ton of other other projects that we can learn from and so in our design principles for for the data Explorer.

We wanted it to enable ephemeral export exploration without disrupting your your coding flows be fast and responsive. It should work just as well whether you're in a Python session on our session or just clicking on a CSV or parquet file. It should update in real time as you work. So if your data set changes if you apply it you manipulate the data frame or you change the data on disk. It should update almost instantaneously easy to say in practice hard to easy to say hard to implement in practice in in pulling this off.

We had to do a lot of custom work. So in particular we spent a lot of time deciding can we do this with off-the-shelf, you know cobbling together off-the-shelf components and what we ultimately came to the conclusion that we needed to do, you know, build a custom virtual data grid so that we could achieve a level of snappiness even with the most unwieldy data sets. So there's like really wide data sets or wide and long data sets. We don't want to have any lagging or unresponsiveness.

We've done a lot of optimization work to to achieve good performance across the different environments where you're looking at data. We don't want the data Explorer to weigh down your computer with a lot of unneeded memory that it's holding. So it's very memory efficient and we've done a lot of work to enable that that live update workflow, which I'll show you in the in the live demo portion.

Features of the data Explorer

We want to make it something you can launch and get at very easily from from wherever you are in Positron. So you've already seen that in in some of these other talks, but if you just have a data file in your workspace, you can click on it or visit it through the command palette and it will just open if you have a data frame in the variables pane. There's a little button over here that you can click and open the data Explorer. But if you're working in the console, we also want you to be able to run.

Let's say you have an ad hoc Python expression and that you can use the in Python. You can use the percent view magic or an are the capital V view function, which you can pipe with a in a deep plier expression, for example, and open a data Explorer just to look at the particular Python expression that you're working on.

So you've seen the data Explorer. This is what it looks like. So go a little bit through the layers of the of the data Explorer first the grid a digital frontier shout out to my Tron legacy fans. It is again. It's a custom built table spreadsheet like UI. It's provides instant scrolling to anywhere in the data set provide selection and copy and pasting and and soon soon in the next major release of Positron will support column and row pinning which I will show you summary pain gives you those at a glance statistical data summary.

So at when you open the data Explorer you have histograms or for categorical data you have written strings you have value count so you can see the most frequently occurring values in each column. These can be expanded showing you more detailed summary statistics soon. There's a future launching to be able to sort and search within column. So if you have a data set with hundreds or thousands of columns and there's certain columns that you want to filter down to a field there will be soon be a filter bar that allows you to find just those columns of interest that you want to see in the data set.

I'm keen to eventually have a click to drag filter capability within these these spark lines. So so stay tuned for that if you're and if you're interested in that we can chat about how it should work on on GitHub

filtering and sorting. I'll show more in the live demo, but we want to enable you to find the parts of the data set that you're that you that you're after very quickly. So those filters show up in the filter bar and you can sort columns in ascending or descending order by clicking on the column drop down menu from the from the data Explorer. You can sort by multiple columns just recently. Isabel added convert to code capability. So if there's a particular view that you're looking at whether it's in Python R or from from clicking on a file in the Explorer, you can get the exact code that would produce that view and then copy and paste that into your code window, which is super useful.

Live demo

So it's much more much more exciting to see a live demo and I'm going to pray to the demo gods that there will not be too many that I you know, I've run into so many bugs in my career with with live demos. So I will do my I will do my best.

So here we are in Positron. I'll make it I'll make us a little bigger and first of all files in the files in the file browser. If we have a CSV or parquet file, we click on a parquet file. It opens and populates this uses DuckDB amazingly which which ships with Positron to provide provide this capability when you when you sort from the column drop down it, you know, it sorts and updates pretty much instantaneously you can jump to the middle of a sort of table and it's it's essentially essentially instantaneous in the summary pane.

We have tool tips on these histograms. So for example, suppose you were interested in data that's missing in the departure timetable. So if I double click here, I jump right to this departure table and here I'm going to add a filter departure time is missing. And so now we see just the missing values in that column and we can all of the see all of the summary plots here have updated to just show the statistics where where there are missing values. And so if you were interested and say well, maybe there's just one, you know pesky origin. So it turns out that's these New York airports where there happens to be a missing departure time. So they're not reporting their data their data very accurately,

but a cool thing that we can do is this column pinning functionality that I just mentioned is a new feature. So when you have a column with a data set with many columns, sometimes you want to see that column which is in the middle of the table and then be able to jump somewhere else in the table so you can visually compare the data values in that column with another column just which is in some other part part of the data set and the same is true of if there's a Wiley row that you want to compare with other rows in the data set. We can pin that row to the top of the data Explorer and then jump elsewhere in the table. So we hope that you find that that you find that that you find that useful.

So very quickly just wanted to show you a little bit of how how the live updating functionality works. So here I created a pandas data frame, which is populated in the variables pane. I click on view data table to open that there and then I'm going to split right collapse the variables pane. And so here it's a small data set. I'll go back to my data file where I'm editing editing the data frame and I will quickly add a new column of some random values. And so I run that I run that line. It adds the new column with those random values and you can see that each time I run that line. So it's just random data and you can see that it updates and updates in real time and that the same is true for if you just let's say clicked on a CSV file and let's open that CSV file is plain text.

Split right so we can see them see them side by side. And so if I were to change, you know, this value value dog to bird save the file so it updates immediately in the table. And so I find it's very useful to have a data Explorer open. You can even pop this any of these any of these tabs out into into new windows. And so if you have a large monitor like I have one of those ultra wide monitors at home and so it's nice to be able to have the data Explorer window out as a as a separate window alongside my code editor and my console so that I can be looking at the data set while I'm writing my code and doing all of my all of my data munching.

Closing remarks

So so we put a lot of work into this. Of course, we built this partly for ourselves, but mostly for you. And so we want to hear from you how we can make this this part of Positron even better than it is. We have tons of ideas about new capabilities to add for it to add to it. But the most important, you know, in addition to using it one of the best ways you can help us is by providing your feedback. So we're on GitHub started GitHub discussion open a GitHub issue if you run into into challenges, but you know, this is we just wanted to build something the ideal tool that we always wanted and we think we've made a lot of progress on this and we're excited to see where we can take it to be an important part of Positron going forward. So thank you.

Q&A

Thank you so much. We have quite a few questions here. So does the data Explorer work with Ibis?

There is a an open pull request to add direct data Explorer support for Ibis expression. So if you have an Ibis expression, it will show up in the variables pane, but you cannot currently open it as a as a data Explorer tab. You can execute it and then open it. But as soon as we merge Isabelle's pull request, we will have Ibis support in pot in direct Ibis support in the data Explorer, which is great outstanding. Thank you.

What are your thoughts about data exploration for other data types such as multidimensional arrays?

We it's we've discussed adding support in the data Explorer for non tab non tabular data. It would be tricky from a from from a user interface standpoint because if it doesn't map into the it doesn't map neatly into the grid. We would need to develop some other type of UI layer that enables you to to drill down into the different. Let's say you were looking at a large nested dictionary type object or like a nested like a nested array like a NumPy array with more than two dimensions is is that type of thing? You can open our matrices in the data Explorer but dimensions more than that. We haven't we haven't built support yet. But if you have an idea of like how that how you think that that should work in the data Explorer would be interested in in hearing from you.

How well does the data viewer play with the dreaded list column?

I I know that you can see I know that you can see list columns one. That's one area that we have not like it's it if you go on the Positron issue tracker, there are issues about having improved UI for for list types and like looking at complex values. So for example, if you have a large value that does not fit in a that does not fit in a data cell. So, you know, we like for example, like if we had a value that doesn't fit here if you hover over it, it will show you a tool tip showing the whole value but for more complex objects, including list or array types. If you're using Polar's Polar's has a built-in list type, it will format the values as it will show the list of values within the cell, but we'd like to be able to drill into those those cells and get a richer UI display to be able to like similar to how you can look at a JavaScript object like a JSON object in the Chrome developer tools pane. If you've ever done that, we'd like to expose that type of visual analysis visual display of complex values within within the data Explorer.

Great, I think we have time for one more quick question. Are there capabilities to explore data from relational databases without having to load the tables into the environment?

So we're we're planning to develop an integration between the data Explorer and the connections pane. So if you're connected to a remote database Postgres database or DuckDB database that you will be able to in in the future that you will be able to click on a table and get a data Explorer pane. There's some subtleties around we want to provide control to the user so that it does not execute too much unwanted computation so you don't get a large bill from snowflake or from from your database provider that you ran tons of queries because you just you clicked on the wrong table and in the connections pane. So so probably some of the summary pane plots will be opt-in or like click to compute so that we aren't spamming your snowflake warehouse with too many too many potentially unwanted queries on a massive on a massive table.

I think this will go hand-in-hand with a sequel editing and development experience that we're planning to develop in Positron in the fullness of time. So I think that will all go hand-in-hand being able to develop and test and run SQL queries directly within Positron to see the outputs of SQL queries within the data Explorer and to be able to see your data warehouse as a as a pain in the connections pain in directly and directly in Positron. Thank you so much.