Hadley Wickham: Spreadsheets, bikes, and the accidental empire of R packages

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field. On this episode, we sit down with Hadley Wickham , Hall of Fame statistician, open source programmer, and chief scientist at Posit. I'm Michael Chow . Thanks for joining us.

You're the chief scientist at Posit, formerly RStudio , and you created this really great ecosystem of R packages for making data analysis delightful in R. And you work on the tidyverse , which is a collection of tools in R. And as you're wearing the tidyverse socks. Yeah, I've got a little, I'm repping the tidyverse a little bit.

You're also, I think you produced this sock.

But the tidyverse is a collection of R tools to make data science easy and delightful in R.

So I was hoping we could talk a little bit through the history. And maybe to start out, we could just go back to the very beginning, which is your very first data analysis.

So what that effectively means is like, there's a lot of freedom, like, as long as you're kind of major advisor is happy with what you want.

Joining RStudio and learning C++

I think, yeah, one thing I'm really curious about is, as you transition from more academic work, and as you rolled out some of these packages, and they got adoption. I know, in like 2012, you joined RStudio. And I'm really curious what that looked like, as you sort of had this little system of packages going. Yeah, and made that jump. What was that? Like, that? Yeah, that was like, really a pretty amazing time for me, because I've moved, you know, moving from academia, where the packages like, kind of valued, but mostly, about like, to prove in academia, like the about packages had value, I had to turn them into papers, somehow. And that sort of pushes you in this role, like relentless part of like making new things, you can't do a write a paper about like a version update to a package. And that just didn't feel like super compelling to me, like I could see all of these problems I really wanted to work on. So when I started at RStudio, it was just like this amazing, like, oh, like, there's nothing on my calendar, like I can just spend all day programming.

Yeah. And that's like, basically what I did. And around that time, I guess, that was when I started like learning C++. And there was something about the process of learning that that that felt like, like, I kind of got addicted to writing C++ code and the way like, you know, 10 years previous, I was like addicted to playing like Age of Empires or whatever, like there was something about that. This kind of perfect on ramp of like challenge and results. And I got really excited. And like, I remember, you know, I was like waking up at like, 5am and like rolling out of bed and like starting to write C++ code.

Yeah. I saw you mentioned JJ helped you. Yeah. Are you like, are you saying you're like, throwing emails at each other? Or like? Yeah, we were like chatting. Like, he wasn't, you know, I was like reading books and stuff. But he was there, like, whenever I got stuck with something I couldn't understand. Whenever I just needed to talk something through with someone, like he was there. And that was like, incredible. That was like, so, so valuable. Yeah. And I still think like, one of the like, best compliments I got in my life was when JJ described me as like an intermediate C++ program. I was like, wow.

There's something about a like, dig, like a kind of barbed compliment, I feel like. But like, someone like JJ calling someone an intermediate C++ program. That's like, I was like, wow. Yeah. And I guess for broader context, JJ Allaire is the CEO, or president, was the one of the founders of Posit.

Building the tidyverse team

He was the CEO at the time. Yeah. And I do feel like something I noticed, so talking to some of the tidyverse team, it does feel like, as like, over the next couple years, there was this kind of like, revenge of the C++ users, as I understand it. Like, so I heard, something about like, so two people on your team, Jim Hester and Leonel. Yeah. Came in sort of around 2015, 2016. That there, there's something about like, Jim Hester uses Vim. Leonel uses Emacs. Yeah. And so, something about maybe you recruited kind of like a C++ nerd army. Yeah. And can you say a bit about their stack and. Yeah. I mean, I think they, like that was the time where I was like, okay, clearly like C++ like opens a lot of doors, both in terms of like performance for R packages, but also connecting to existing tools written in C and C++. And that was, and when I got the opportunity through RStudio to like, hire, to grow my team, to hire some people, that was like very, very high on my list.

Because, you know, I was, you know, fine R programmer. I was an okay C++ programmer, but that's where I really felt like there was leverage to like. Yeah. Connect R to more systems to make R packages that are faster and do more things. Oh, cool. And then both like Jim and Leonel, I had interacted with a bunch on GitHub already. So they felt like, you know, like they were kind of safe hires because I knew I could work with them.

Leonel mentioned that you essentially, after a little bit of collaborating, you sent him an email that was like, hey, I need C++ programmers. Yeah. Are you interested? Yeah. And according to Leonel, he said, yes, but he didn't consider himself a C++ programmer. So for the next month, I think he just frantically studied C++. Yeah, that's fine. I do not know that.

Focus, workflow, and the forcats origin story

And I was surprised too, because I think the, at this time, as you're spinning people up, talking to the tidyverse team a little bit, and this is one thing I'm really curious about, is your approach to tools, like what you choose to focus on. Yeah. It was really around that time where I just had like so many R packages that I couldn't like work on all of them simultaneously. There's just like too much like context and background knowledge. They're going to load up into my brain. And so I started being pretty disciplined about saying I'm only going to like focus on this repo for now and I'm going to ignore everything else.

And that, like that's obviously going to bag because you end up in these situations where someone like, you know, does a cool, useful contribution and just sits there for a year. But I decided like I just like had to do that in order to remain like sane and a functioning human and to keep like making traction and stuff. So that like batched, like switching my, like keeping my focus narrow, but like regularly switching it between different projects. I think that that has been like super, super useful.

Like thinking about workflow. So like this idea, you're switching between packages and really focused on one. Another one I'm really curious about is forcats, which is around the time you kind of started the tidyverse and the tidyverse team. Yeah. And it's kind of a quirky package. Yeah. But the reason I'm curious is Jenny Bryan on your team told me that you were sort of anti-factor. Yeah, yeah. At the time. So like for background, like forcats is a package to kind of wrangle factors and make them nice or fun to use. And I was surprised she mentioned like, yeah, you two just chatting basically like. Yeah. She said you were anti-factor and she basically like. Yeah, Jenny like persuaded me. Incepted into your brain. Yeah. Yeah. I think that was like both interesting because like once I was, like once she persuaded me and the kind of switch like flipped and I was like, oh, OK, I should create this. And I just felt like there was this like outpouring of like functions that I had always really needed, but I'd never really noticed. And like once I had the place to put them, it was super useful.

And I think the other thing that I took away from that is just that importance of having this like team of people I trust, like when Jenny says something, like even if I don't like like fully believe it or agree, I'm like I like it's worth my time to like really like process this and think about it compared to like some like, you know, random person like filing a GitHub issue that's not necessarily going to be convincing because, you know, of the people who randomly file GitHub issues, most of them are like, you know, they're not they're not going to persuade me to create a new package or tackle something a totally different way. So that like I felt that like the benefit of having a team of people like you trust and you work closely with and who can like call you out on like when you're making a mistake in a way that I wouldn't listen to if it's just coming from a stranger. That was pretty important.

So that like I felt that like the benefit of having a team of people like you trust and you work closely with and who can like call you out on like when you're making a mistake in a way that I wouldn't listen to if it's just coming from a stranger. That was pretty important.

Vctrs, recycling rules, and handing over the keys

I guess the last sort of situation I was really curious about is vectors. So I know Davis on your team started kind of doing other stuff like a lot of tidy models and sort of got pulled into like really absorbed into vectors. Could you could you say a bit about kind of that path he took? I don't even really remember how that like the pro except Yeah, I think I mean, Davis started in tidy models, but I think he was always like pretty interested in like the low level programming as well. And that kind of drew him towards problems that kind of intersect these sort of at the intersection between tidyverse and tidy models. Yeah. And vectors was was this attempt to try and pull out some like underlying principles of like vectors. And if you want to create a new vector class, like what are the things you need to do and think about? And then how do we plumb those into the rest of the tidyverse? So someone else can. So basically, like other people can extend the tidyverse without having to like do pull requests into every single package.

I'm still not. And we spent a lot of effort on that package. I'm not still not 100 percent convinced it was worth it. Or maybe we could have found a way to like solve the same problem with less effort. But it certainly did force us to really think about some of the low lying invariants about like data frames and things where like base R is like a little bit. Like kind of murky or sloppy, like there's depending on, you know, like a base size is recycling rules, which are basically when you have vectors of two different lengths, like what happens and we kind of talk about recycling rules like there's like one set of recycling rules. But if you look really carefully, you can figure out like in different functions of implemented things slightly differently. So there's actually like three or four different different variations of recycling rules. Yeah. OK. And so that like that was sort of interesting and like really thinking about that and that, you know, that ended up with our own like tidyverse recycling rules where we only have a recycle vectors of length one.

But that, you know, so that process of like really thinking about and to me, the only way you can think precisely about any of these programming things is actually like program them and try them out because that forces you, forces the like the vague ideas in your head to like very concrete reality that you can actually like. You're saying vectors is like you all working through it. Yeah. Collectively, like with with code to like keep us 100 percent honest about does this work or not.

I'm really curious about how someone like Davis goes to from contributing to something like vectors to kind of being like the boss of vectors. Yeah. Could you say a little bit about the what that path looks like? Yeah, I think it's like a pretty, you know, gradual process. And to some extent, like people, this happens to people like who actually work for the tidyverse team and just like ordinary, you know, outside people who contribute a bunch that just like like over time as you get, you know, more and more comfortable and you kind of have contributed to more and more piece parts, it just becomes pretty natural. You know, the existing contributors will start to like cede authority and say, like, you probably know more about this than us by now. So and then soon, like you're the boss, like you get to make the decisions.

And is there are you like just chucking them the keys or are you like nighting? We have like, yeah, we have a little bit of a process like it's fairly informal, but we have like a governance talk that we kind of have some steps written down. But really, like a lot of it, the governance process was kind of written initially to help me feel like safer, like giving the keys away because that felt like pretty nerve wracking at first, you know, now that I've done it like five or six times and no one's done anything like bad. Everyone's been like really grateful to have the opportunity and taken the most of it. It feels like pretty easy for me to say like, hey, OK, you can now contribute to this directly without my required oversight.

AI, LLMs, and vibe coding

Yeah, I mean, like everyone, I guess it's like AI and LLMs. So I've been spending a lot of time working on Alma, which is a package to call, you know, use LLMs from R, which is really the kind of infrastructure to make it happen, like just making sure you can call chat GPT or Anthropic APIs and get the results back in ways that are convenient to you as an R user. What I'm still like hoping to do more of is like really get into like, well, how can you actually use LLMs as an R programmer or as a data scientist to make your life easier, better, more fun in some way? So I'm hoping you're going to get out of the weeds of Alma soon and get into actually like using this to do cool stuff.

Yeah. Do you have a sense of how you'll kind of unpack that problem? Like how to use LLMs effectively? Not really, I guess. I mean, I think, like it feels to me it's in the stage where there's not like there's no, I don't have any kind of big cohesive like theory in my head of how the pieces fit together. And so that just means like trying stuff out, playing around, figuring out lots of like little things that might be useful and the hope that like over time that kind of accumulates. And then I'm eventually able to say, okay, there's like four main ways you're going to find using LLMs useful. And those four ways can be divided up further into these seven subways. And, but no, I have no real sense of what that, that is. Yeah. Which is like, you know, it's a totally new thing, which is like both kind of exciting and a little scary. And like, you know, what does that mean to be a data scientist today? It's like very clearly very different than when I started my career.

Yeah. Do you feel like as part of your due diligence in finding out how to best wrangle LLMs that you will need to do a little bit of vibe coding? Do you feel like that's part of the. I think so. Like I've done a little bit of it and I think it's sort of, it is fascinating and like empowering to be able to like tackle areas that I know like nothing about and I can now get traction on pretty easily. It's also scary because I'm like, I'm like writing code that I don't really understand. Like, where does it go from here? But I think certainly like, it feels like LLMs kind of give you the ability to sort of be like a, yeah, like an intermediate programmer and like any language or any tool with like very little effort. And that's, yeah, that's kind of exciting. That gives you the capability to do a lot of things you couldn't before. But like once you get past that, like initial core prototype, like where does it go from there? I think that's a, yeah, that's a kind of next step.

Writing a Shiny book for a bicycle

Yeah. I guess the last thing I'm really curious about, which is kind of a total tangent, but is it true that you wrote a textbook on Shiny in exchange for a bicycle, a custom bicycle? That is correct. Although it was not like, it was more like I wrote, I gave Joe Jing, the author of Shiny, the gift of writing a book for him about Shiny. Yeah. In exchange. Well, and then like after the book was written, he was like, Oh, can I build you a bike to say yes. Yeah. I love that. Yeah. But it was like a mutual, like, like I like writing books and he likes building bikes. So it was like a mutual, yeah, it worked out really well. Like we both got to do something we enjoyed and got something that we wouldn't have done ourselves out of it.

I feel like it helps a lot too, that I don't know how many people in the world love writing books. Like to just drop in and write a book for someone about their thing is a pretty, like, that's a pretty big stunt. You're like, you like kickflipped over the Shiny docs. I feel like because you actually do have so many books, like, what do you like about it? Like what? I don't know. That's just like, some of us now I'm just like in the habit of it. And so it's just easy to keep going with the habit. I do like that there was something, I mean, I think I do generally like this sort of like this tidying idea. Like how do you organize information in a way that makes sense and is elegant and helps people learn it better. Yeah. So I think that was something I find like intellectually satisfying about that. And there's something nice about writing a book about someone else's things, because you can just be like, oh, and this bit's not like that good, or it's a little bit fiddly to use. Yeah. And like, whereas I write, when I write stuff about that, about the tidyverse, I'm like, oh, I like, I should fix that. Tweak it. Yeah. Yeah. So it's just nice. And I can just like write the book like about this thing as it is without having to worry about like making it better as I go.

I mean, it's pretty wild. As I was like preparing for this discussion, I did at some point try to figure out how many books you're writing right now or have written. And I do feel like it was pretty unclear. I feel like there are. Do you have a sense for like how many books you're writing right now? Like how many? Well, how many sticks are on the fire? I mean, there's like the third edition of the Gigi part two book, which is kind of like on hold currently, I guess, is the tidyverse design principles, which is also on hold currently in the R and production book, which is what I'm trying to focus on. But I've mostly been pulled away into LLM stuff. So like three. Is there an LLM book cooking? No. Okay. No, I do not want to write a book about LLM. Yes. I've got like one, maybe once I finish this, like three other books. But I mean, but at the same time, like I've written a lot of vignettes about LLMs for the Alma package. So it's not a book, but there's still like quite a lot of writing going on there.

I mean, it's pretty wild. As I was like preparing for this discussion, I did at some point try to figure out how many books you're writing right now or have written.

Yeah, geez. The Test Set is a production of Posit PBC, an open source and enterprise tooling, data science software company. This episode was produced in collaboration with branding and design agency, Agi. For more episodes, visit thetestset.co or find us on your favorite podcast platform.