
Visualizing Data Analysis Pipelines with Pandas Tutor and Tidy Data Tutor - posit::conf(2023)
Presented by Sean Kross The data frame is a fundamental data structure for data scientists using Python and R. Pandas and the tidyverse are designed to center building pipelines for the transformation of data frames. However, within these pipelines it is not always clear how each operation is changing the underlying data frame. To explain each step in a pipeline data science instructors resort to hand-drawing diagrams to illustrate the semantics of operations such as filtering, sorting, and grouping. In this talk, I will introduce Pandas Tutor and Tidy Data Tutor, step-by-step visual representation engines of data frame transformations. Both tools illustrate the row, column, and cell-wise relationships between an operation's input and output data frames. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Teaching data science. Session Code: TALK-1096
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
We have Sean Cross, who's going to be talking about Visualizing Data Analysis Pipelines with Pandas Tutor and Tidy Data Tutor.
Good afternoon, everyone. So this is a real organic screenshot of my desktop by day and a similar screenshot by night. You could easily say that code dominates my work life. And I believe that code is still important. I think that's part of the reason why we're here at posit.conf.
We're not at tableau.conf, we're not at low-code, no-code.conf. Those things are certainly important and no negative judgment on my part, but I think that there's still a really important thing about building lasting visual infrastructure with code.
But there are some issues with code that we just can't get around. Code can be intimidating. You hear about you're inheriting a new code base and you're like, oh my gosh, I can't believe I have to get into this. You get into that code and it can be opaque. You can read the code several times and you don't even really understand what's going on inside the code itself. And it can be overwhelming. You've got like a ton of code files that you need to integrate into a process that you're in charge of and it's just a lot to handle.
And to me, this kind of sounds, these ideas of what can be wrong with code, it sounds like data in a lot of circumstances, right? What is different with this audience, though, is that most of us have a data science background. We've got all of this knowledge about tools that we can use to dig into data and to kind of overcome some of those challenges that I've discussed that data and code share.
So a question that I really want you to ask yourself is, what could you build if you could apply your favorite data tools to your code?
Code as data
This is John McCarthy. He invented a lot of the fundamentals of computer science, including the programming language Lisp. And he also had this great idea that unfortunately I think has kind of been forgotten, which is this idea of code as data or code is data. And what I mean by this, this phrase code is data really refers to the idea that source code written in a programming language can be manipulated as data.
And I don't know about you, but that sounds really exciting to me, right? Because we've got all these incredible data transformation and manipulation tools that we're all very familiar with. We've been learning about them for the last day at least, including yesterday and this weekend. Wouldn't it be great if we could apply some of these tools to code?
And in R especially, this is kind of going on in the background in ways that we're not even aware of. So take, for example, the single line of code plot 1 to 10, right? Makes a very simple plot. But how does R know that the way that we're specifying the integers from 1 to 10 is with this expression 1 colon 10, right? There are lots of different ways you could write that in R. Yet R knows that that's what we wrote in the code. And then it actually puts it into the y-axis label here on this graph, which is kind of funny.
So the two things I want to discuss for the rest of this talk is I want to show an example of what can be built with this idea of code as data. And then I want to point you on some directions of how you can build with this concept of code as data.
Data analysis pipelines at Fred Hutch
So I work in the data science lab at the Fred Hutch Cancer Center in Seattle, Washington. And I am extremely privileged, because I get to work with these incredible biomedical data science practitioners and these incredible health care providers. And what a lot of them do is that they spend half of their time in the lab at the bench, or they spend half their time in the clinic. And they spend the other half of their time kind of doing this, right? They're looking at their computer. And they're looking at data. They're transforming data.
And a lot of the main data artifact that they're dealing with all the time is with data frames, data tables that we're familiar with. And the way that they choose to manipulate those data frames is through writing code. There's a lot of R written at Fred Hutch and a lot of Python written at Fred Hutch. And these are just kinds of the tools of the trade in biomedical data science.
And an artifact within this code that I'm sure many of us are familiar with is the data analysis pipeline. So we want to break down one of these data analysis pipelines a little bit. The first speaker mentioned penguins. I'm going to talk about penguins now.
So the number one is pointing to this variable penguins, which represents and contains a data frame about measurements of penguins taken during a scientific study. Number two is a pipe operator. And what the pipe operator does is that it takes what's on the left-hand side of the pipe operator, in this case, the data frame penguins, and it makes it the first argument to the function on the right-hand side of the pipe operator, which in this case is select. So what's happening here is that the select function is picking out these two individual columns from the penguins data set, the species of the penguin and the bill length in millimeters of the penguins.
And what's nice about this kind of data analysis pipeline framework is that as long as the function on the right-hand side produces a data frame, you can chain together these functions. And it's kind of like a step-by-step recipe of the data manipulations that are happening to this data set. We're going to select these two columns. We're going to group the rows by the species of the penguin. We're going to arrange the rows in descending order according to bill length within each individual group. And then we're going to select the top row in each group is essentially what these five lines of code say.
And I'm explaining these data analysis pipelines to people relatively often, either people who I'm onboarding them to some data analysis code because they're going to be changing it or they want to adapt it for their own purposes. Or I also do some teaching in different contexts at Fred Hutch. And so if I want to explain, for example, this filter function, I might draw a diagram like this and say, OK, on the left-hand side of the pipe, this is the status of the data frame. Then we're going to cross all of these rows out because they're going to get filtered out. And then the rows that are remaining move up to the top of the data frame. And it's the same kind of situation with the summarize example on the right-hand side.
So after drawing enough of these diagrams, I thought, well, maybe I could automate this process a little bit. So I, along with colleagues of mine, Sam Lau and Philip Guo, built tidydatatutor.com and also similarly pandastutor.com, which automates the creation of these diagrams.
Tidy Data Tutor in action
So for example, if we start with the code that I was discussing before, the first thing that would get illustrated by Tidy Data Tutor is the select operation. So as you can see, we're picking out these two individual columns. The arrows show that the columns move from really what's going on the left-hand side of the pipe to the right-hand side of the pipe. Then we're going to group by the species. And so the species is highlighted on the left-hand side. And then the group that each row is in according to the species gets an individual color according to the group on the right-hand side.
Then we're arranging within each group according to descending order. So the rows get rearranged. And the arrows show where each individual row moves in that arrangement operation. And then finally, we're taking the top row off of each group of the data frame and just returning the maximum build length for each species in the final data frame.
So I, along with colleagues of mine, Sam Lau and Philip Guo, built tidydatatutor.com and also similarly pandastutor.com, which automates the creation of these diagrams.
So this project has been pretty popular so far. Like I mentioned, there's also pandastutor.com. And a lot of these ideas from Tidy Data Tutor and from Pandastutor were inspired by our colleague, Philip Guo, who built pythontutor.com.
How to start building with code as data
So those projects are just kind of like one example of what can be built with this idea of code as data. I now want to talk just a little bit and kind of point you all in the direction of how maybe you could start building things with this idea. And I really hope that this will facilitate a conversation and that we can talk after this session.
The first thing I want to mention is a paper which actually describes how a couple packages work together that were built by myself, Lucy D'Agostino-McGowan, and Jeff Leak. That paper is called Tools for Analyzing R-Code the Tidy Way. Secondly, I really encourage everyone to read the metaprogramming sections. It's a couple chapters of Advanced R, which is available online for free.
And the other thing that I encourage everyone to do in terms of an entree into what packages can help you treat your code as data is to mess around with the rlang package. It's a great package to explore. And when I really got into it, it really changed how I see both the code that I write and other R-Code that I have to approach through my work.
Yeah, so that's really it. I really appreciate your attention, and thank you so much.
Q&A
So no questions in Slido right now, but are there any questions from the audience that you immediately had?
Hello, so you introduced this as a useful tool for new people who might be taking over the code base. Has this been helpful in explaining what you do to non-data people?
That's a great question. So the answer is absolutely yes. I've used it before in meetings, both with people who have a coding background and people who don't, really to just explain how certain decisions were made during a data analysis. And also to give them another artifact, both to verify that the logic that I'm using during a data analysis process is a logic that they agree with. And it also is an artifact that helps them really have something to point to and gives us something much deeper to discuss in terms of really the decisions that are made as we're doing an ongoing iterative data analysis together.
I have a little addition on the code that I've made. So what I usually see when these things are being explained, like the pipe operator that's been used on multiple processes that are used on top of each other. And usually, these things are being explained with tables, just like to explain how a left join or a right join or an inner join or something works. And I think perhaps it even works better to explain that to non-data people when this being used in real visualizations, when you aggregate certain things or when you do these pipe operators on the several steps that are involved in the full stack of those five manipulations that you did on the raw data of the, I think the penguins data set or something. When a real scatterplot or something is being used on just the raw data, and then the grouping, and then the filtering, and the aggregation. Because then you can really see from a non-data perspective person that you see what the unit of analysis is in your data later on when you do the aggregation versus what is the unit of observation. So what the really raw data would look like. I think that would be perhaps a nice addition to explain to non-data people what these steps would really mean because sometimes I really have the idea that, yeah, these non-data people think, oh, the only thing that they do, these people is visualize the data, but you do so much more than the data visualization is usually just the last step in that stacking of pre-processing data before you can actually make sense of the data.
Yeah, the transformation of raw data into anything else is something I spend a lot of time thinking about. In fact, one of my major inspirations for that thinking is in this room right now. But illustrating that value for different stakeholders I think would be really interesting. The challenges, as someone else at this conference said, all it's an Anna Karenina type of problem where all untidy data is different and all tidy data is the same.
And so it's hard to what's nice about doing this reasoning about code is that the logic in the code gives things like a bounded structure where sometimes if the raw data is messy enough, the structure might be incoherent. I mean, I'm sure we've all been down in that dark place before, but I appreciate it. Thank you.
