779: The Tidyverse of Essential R Libraries and their Python Analogues — with Dr. Hadley Wickham

Transcript#

This transcript was generated automatically and may contain errors.

Do you personally encourage teams to be multilingual? Or do you often write in multiple different languages, maybe even within the same project?

I think most people tend to be like, you know, like 90% R or 90% Python. But in general, it's just better to be pragmatic. And I think that's one area where I think generative AI is like really interesting. It's really easy to translate between them. Maybe this means the barriers between them are going to erode a little bit more.

If we were going to have like a metric that me and my team are going to try and optimize, like I think the metric that I want to optimize is like the amount of time you spend in like flow states.

Dr. Hadley, welcome to the Super Data Science Podcast. It is a surreal experience for me to have you here. We have, I have seen you in person. In fact, we have actually seen each other in person, I'm sure of this. And so let me tell you these stories. We didn't, I didn't tell you this before we started recording. So you're just getting this on air .

Circa 2014, in New York at an O'Reilly Strata Hadoop World Conference, you did some kind of hands on training, it might have been a half day training, they use the the air, the flight times data set. Does that all track? Yeah. So I did that I was in the audience for that. And then a couple of years later, in 2016, at the joint statistical meetings in Chicago,

there was an announcement from our studio, that Hadley Wickham would be at the RStudio booth at a certain window of time. And I walked by a couple times. And we made eye contact and you kind of smiled friendly. But I was too nervous to talk to you. I didn't know what to say.

You were at that time and still today, one of the most iconic people in data science to me. And I was just like, well, you know, there's no what do I do? What do I say? How do I introduce this? And so now I finally know what to say. I've got a question to ask you. Hadley, welcome to the super data science podcast. Where in the world are you calling in from? I'm calling in from Houston, Texas. Nice. It is truly such an honor to have you on the show. You were on the show in the past.

So four years ago, you were on the program. And that was specifically episode number 337. But at that time, our host was Kirill Aromenko. And the timestamp on this is pretty interesting, because that episode was published in February 2020. So it was a pre-pandemic world.

A very different, different time. Exactly. So all of that has passed and we're almost back to normal, except that so many data scientists are working from home. All right. So straight into the technical content. Hadley, if there's one word that is most associated with you, it's got to be tidy. In 2014, you wrote a highly cited paper called Tidy Data. And you're also an author of the popular tidyverse , a collection of packages that share a high level design philosophy.

Last but not least, you have been writing a book called Tidy Data Principles, which is to be completed next year in 2025. What does tidy mean in the context of data programming? And yeah, what's the guiding principle? That's definitely not a question that Chad GPT would have asked you.

What "tidy" means

Tidy to me is about having things that are kind of like well organized and like well broken down into kind of like little pieces that you can then like reassemble, like Legos. I think that's been a motivation for a lot of my work is like, how do you take some like big, maybe kind of vaguely ill-defined problem and then break it down into like concrete pieces that you can actually get stuck into?

And experiment with and play around with and iterate towards a final solution. All right. And yeah, so in your tidy data paper, you draw parallels between tidy data and the principles of relational databases, specifically Codd's relational algebra. What is Codd's relational algebra? Could you elaborate on how database design can benefit statisticians and data analysts in their work?

Yeah, so you can go and like look up on like Wikipedia or somewhere what Codd's relational like algorithm actually algebra actually is, but like I can never remember it, but I think there's a. I don't know, like I.

Yeah, it's one of those kind of very precise definitions where every like word makes sense individually, but when you string them together in a sentence, it's very hard to understand what it means. But I was kind of.

And I sort of got into like relational data because my dad had done a lot of database design for capturing data about cows in particular, and this idea about cows, yeah, for cattle breeding.

And this idea of like making sure that kind of each facts, each unique fact is recorded like once in a data set rather than having it like potentially like either split across multiple places or reported recorded in multiple different ways in different places.

And so I think like that, like the ideas of Codd's relational algebra are really important. Like you want to have your you know, you don't have inconsistencies in your data, but it's really difficult for folks who are not like trained in databases and computer science to get the idea of the algebra.

And so a lot of that, like the idea of the tidy data was like, how can I frame this in a way that like makes more sense to statisticians and data scientists and people working with data. And to me, like it's like this, like, it's just like you've got a rectangle and all tidy data really is.

We put the variables and the columns and you put the observations in the rows and you kind of wonder like that legitimately took me like eight years to figure out. It seems so simple in hindsight, but it's just one of those things that like once you figure it out and explain it to other people, it makes a lot of sense, but it takes a while to get there.

We put the variables and the columns and you put the observations in the rows and you kind of wonder like that legitimately took me like eight years to figure out. It seems so simple in hindsight, but it's just one of those things that like once you figure it out and explain it to other people, it makes a lot of sense, but it takes a while to get there.

It takes a while, even for me as my first time using the tidy data principles, which must have been many years ago now is probably a decade ago or more. But that first time kind of wrapping your head around shaping the data in this tidy way, it is so different from the way that we're typically taught in university.

And so therefore, that first time doing things in a tidy way, exactly as you described, where each piece of information is only replicated once as opposed to having a whole column where the variable name, let's say you have a binary outcome and every row of this giant table is like outcome zero or outcome one.

It's so wasteful, especially if that's like a string. It's incredibly inefficient. And so popping over to the tidy principles where that goes away is transformative. But it did it. I remember the first time trying to like melt data.

And I'm like, what? And even when I saw it the first time, it took me a while to figure out and I was kind of like, this isn't right. Like I needed to kind of, I felt like I needed to change it into the way that I'm used to, to even work with it.

Yeah, it's really, it's really interesting at the moment. I'm on the sort of run the program committee for Posit Comp. So we decide on like what talks we're going to have and how we're going to arrange them. And so the program committee is mostly like data scientists.

And so the way we think we like, we do a lot of organizing data and like Google sheets because it's so easy to collaboratively edit, but it's always in like a tidy format.

So that, and what that typically means is you can't actually get a shape. You can't really understand the shape of the whole program without like joining three different things together.

And it's interesting, like when I share it with my colleagues in like marketing who have to then turn this into the website, like the way they want to put that in a Google sheet is just like so totally different. There's like merge cells, there's like colors. It's just like, I get it, but just like the, you could not like compute on data in that form, but it's so much easier to look at as a human.

Are you able to, when you're in a data science company like Posit, are you able to kind of dictate, you know, marketing, you've got to do things our way, the data scientist way, or is that, it's an, that's an impossible.

It's impossible. I mean, I, you know, I, I don't think I could dictate it. I could certainly do more to like try and persuade people or like, you know, help them do their jobs better, but it's like a lot of, a lot of work and you're, you know, fighting.

Like the fact that none of their tools think about data in this way. And I don't really want to have to go and create like the tidyverse for marketing and the tidyverse for finance and like every other, every other field.

We like, we do, I think we have like more, certainly more penetration of like Quarto and other kind of those, those sort of document generation tools, but there's still quite a high bar for folks who are like, if you're not used to using Git and GitHub, like collaborating in this way is pretty tough.