
Hadley Wickham: Spreadsheets, bikes, and the accidental empire of R packages
Before Hadley Wickham became a pillar of modern data science, he was a spreadsheet-loving teenager making databases for his dad’s job. In this episode, he reflects on the early days of his involvement with R, the birth of tidyverse, and how real-world unpredictability — like a bear in a field — shapes data science. Data scientists around the world owe a lot to Hadley Wickham — but how did he go from wrangling databases as a teen to shaping the future of statistical computing? In this episode of The Test Set, Hadley takes us through the winding path of his early days with R, the birth of the tidyverse, and the quirks of open source development. We dig into the philosophies behind his tools, why team dynamics matter, and how large language models are sparking his curiosity all over again. What’s Inside: • Hadley’s first brush with R code … inside a Word doc • Consulting as a grad student — and learning what people really want from stats • How messy Excel sheets inspired the tidy data revolution • Writing R packages as a form of self-defense (and productivity) • The secret sauce of building the tidyverse team • On focus, burnout, and saying “no” to GitHub pull requests • Current obsession: using LLMs to make data science faster, easier, and more fun • How writing books is a form of tidying ideas, and how a Shiny textbook led to a custom bike
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field. On this episode, we sit down with Hadley Wickham, Hall of Fame statistician, open source programmer, and chief scientist at Posit. I'm Michael Chow. Thanks for joining us.
You're the chief scientist at Posit, formerly RStudio, and you created this really great ecosystem of R packages for making data analysis delightful in R. And you work on the tidyverse, which is a collection of tools in R. And as you're wearing the tidyverse socks. Yeah, I've got a little, I'm repping the tidyverse a little bit.
You're also, I think you produced this sock.
But the tidyverse is a collection of R tools to make data science easy and delightful in R.
So I was hoping we could talk a little bit through the history. And maybe to start out, we could just go back to the very beginning, which is your very first data analysis.
Early days: statistics, consulting, and tidy data
So it would have been at the University of Auckland where I did my undergraduate in statistics. I don't remember my first data analysis, but I remember my like first R code, because I still have some of the files where it's like R code, like written inside a Word document that I clearly like copied and pasted to run it somewhere else. But that was mostly, like we didn't, yeah, like we didn't really do data analysis in a statistics undergrad, which is a little odd in retrospect. So I think probably it would have been like during my PhD when I really started to learn more about exploratory data analysis.
And I also had an assistantship that was helping PhD students in other departments analyze their data. So that like really, like that's when I was really like forced to be like, okay, now I've got to take this like, you know, random PhD student in a random discipline with some data and some questions and think about like, how do I actually use what I know about statistics to answer those. And you were kind of, it was sort of like a consulting thing, is that right? So you were rotating through? Yeah. So yeah. So, but it was sort of interesting because it was, you learn a little bit about consulting too, because you, you know, you saw a bunch of different people, but also sometimes if the person didn't like the answer they got from you, they would like try other consultants in the same team to see if they could get like a different answer that they preferred.
Oh, interesting. So you're doing like, this is like competitive, helpful data analysis. Kind of, because you know, they're like, I want a significant p-value so I can publish this. And so if you don't give them a significant p-value, then they'll like go to someone else and, you know, try and try again, which is, you know, it's just fascinating. You know, reproducible data analysis and the reproducibility crisis and all this kind of stuff. You got to see that kind of like firsthand. Yeah. That's a pretty brutal, I feel like. Yeah. But like, at the same time, like nothing malicious, but just people like trying, you know, they've spent their last three years, like collecting this data and they want to be able to publish something for their PhD.
And it's not like they're just like getting data out of a database. Like they're going out in the field, like collecting data. I remember there was like one case it was like missing values because there was a bear in the field on that day. Yeah. They couldn't even go to collect their data or like growing plants, like stuff that takes like years. That was also my introduction to just like how people kind of naturally record data in Excel spreadsheets. And that is like really what led me to a lot of like work around like tidy data. Because I'm like, I have no idea how to get like, this is like, I can look at this and be like, this is not the way I would organize it. But like, how do I like, why would I organize it a different way? And how do I get it from how it was to like where it should be?
Spreadsheets, databases, and a teenage nerd
Yeah, for sure. And were you, I seem to recall someone saying that you actually like were wrangling spreadsheets as a kid or had like some pretty early, you like cut your teeth on spreadsheets pretty young. Is that? Yeah. Well, like, I mean, databases really first. I guess when I was about 15, I started, I got really into like Microsoft Access, like obviously a very nerdy child, but I was really into it. My dad had like a lot of his career had been around databases. So he kind of like supported me in that. Like I had part-time jobs, like developing databases, like for his work or had another part-time job, like doing database documentation.
So I was kind of like familiar with that. And then I also remember like one of dad's friends, like ag research in New Zealand had this, was like really, really into Excel. And he had like implemented a rotating 3D scatterplot in Excel. And I was like, just like blown away by like how cool that was.
And then I also developed like, while I was in university, I developed a database for the Northern Regional Health Genetic Health Service, which is like, which one of the health departments that like helps people like do prenatal screenings and detect genetic anomalies and stuff. And then like, I completely forgot about that. But then one of my friends, like graduated from med school, ended up working at the Northern Regional Health Genetic Service like 10 years later. And they were like, we're still like, oh, no, I can open it up. And I had like my name on it. And they're like, we're still using your database. Yeah, that's fun. And except, except because it's been like 10 years with like no maintenance. They were like, your database is like, like everyone hates it. It's so flaky and unreliable. And I was like, bittersweet. They're like we 10 years on, we still use it. It's still there. And we hate you for it. But then they had to spend like a quarter of a million dollars replacing it with something professional. And I was like, I was like vindicated again, because they probably paid me like $10,000 to develop it.
Building R packages and the PhD thesis
Yeah, geez. And is it in your time, when you were doing a lot of the consulting, and then in grad school, that you ended up building, sort of reshaping ggplot2 out of those? Yeah. Oh, cool. And did you? Were people just building R packages at the time? Like what? What was that? Not really. Like, I actually, I guess I created my first R package. When I did my master's at the University of Auckland, that was for visualizing microarrays, like the, the bioinformatics data. That package does not exist today. But I think that just because I was like at the University of Auckland, where you know, the home of the birthplace of R, that kind of creating an R package felt natural.
That, yeah, I mean, I guess like that, that sort of led to like this whole other thread of my work, which is like, I started creating so many R packages that I needed to create other packages to make making packages easier. Right. And that kind of like, spiraled into a lot of other tooling that just made my life and other people's lives easier.
Yeah. Like a lot of your early work to working on like reshape and ggplot, like are pretty, they're pretty unusual for statistics department. Yeah. I think that's probably still true today. And you know, because my PhD thesis was basically reshape ggplot2 and then another paper on like model visualization, which is like really, you know, I didn't have like a bunch of equations on like proving theories. So it was like pretty lucky, I think, like I had a really supportive advisor, Di Cook and Heike Hoffman, but that made that possible, like kind of afterwards, like I sort of realized, like, particularly at Rice, when I was actually a faculty member, like there's a little bit of like, kind of like, PhD thesis, like, it's like some component of mutually assured destruction, because like every PhD, like every, every person is both like the main advisor on someone's thesis, and like an advisor on a co advisor on other students. So like, like, if someone like rejects your student, then you're going to be like, okay, well, maybe your students pass and then like everyone there's like a spiral and no PhD student will graduate. So what that effectively means is like, there's a lot of freedom, like, as long as you're kind of major advisor is happy with what you want.
So what that effectively means is like, there's a lot of freedom, like, as long as you're kind of major advisor is happy with what you want.
Joining RStudio and learning C++
I think, yeah, one thing I'm really curious about is, as you transition from more academic work, and as you rolled out some of these packages, and they got adoption. I know, in like 2012, you joined RStudio. And I'm really curious what that looked like, as you sort of had this little system of packages going. Yeah, and made that jump. What was that? Like, that? Yeah, that was like, really a pretty amazing time for me, because I've moved, you know, moving from academia, where the packages like, kind of valued, but mostly, about like, to prove in academia, like the about packages had value, I had to turn them into papers, somehow. And that sort of pushes you in this role, like relentless part of like making new things, you can't do a write a paper about like a version update to a package. And that just didn't feel like super compelling to me, like I could see all of these problems I really wanted to work on. So when I started at RStudio, it was just like this amazing, like, oh, like, there's nothing on my calendar, like I can just spend all day programming.
Yeah. And that's like, basically what I did. And around that time, I guess, that was when I started like learning C++. And there was something about the process of learning that that that felt like, like, I kind of got addicted to writing C++ code and the way like, you know, 10 years previous, I was like addicted to playing like Age of Empires or whatever, like there was something about that. This kind of perfect on ramp of like challenge and results. And I got really excited. And like, I remember, you know, I was like waking up at like, 5am and like rolling out of bed and like starting to write C++ code.
Yeah. I saw you mentioned JJ helped you. Yeah. Are you like, are you saying you're like, throwing emails at each other? Or like? Yeah, we were like chatting. Like, he wasn't, you know, I was like reading books and stuff. But he was there, like, whenever I got stuck with something I couldn't understand. Whenever I just needed to talk something through with someone, like he was there. And that was like, incredible. That was like, so, so valuable. Yeah. And I still think like, one of the like, best compliments I got in my life was when JJ described me as like an intermediate C++ program. I was like, wow.
There's something about a like, dig, like a kind of barbed compliment, I feel like. But like, someone like JJ calling someone an intermediate C++ program. That's like, I was like, wow. Yeah. And I guess for broader context, JJ Allaire is the CEO, or president, was the one of the founders of Posit.
Building the tidyverse team
He was the CEO at the time. Yeah. And I do feel like something I noticed, so talking to some of the tidyverse team, it does feel like, as like, over the next couple years, there was this kind of like, revenge of the C++ users, as I understand it. Like, so I heard, something about like, so two people on your team, Jim Hester and Leonel. Yeah. Came in sort of around 2015, 2016. That there, there's something about like, Jim Hester uses Vim. Leonel uses Emacs. Yeah. And so, something about maybe you recruited kind of like a C++ nerd army. Yeah. And can you say a bit about their stack and. Yeah. I mean, I think they, like that was the time where I was like, okay, clearly like C++ like opens a lot of doors, both in terms of like performance for R packages, but also connecting to existing tools written in C and C++. And that was, and when I got the opportunity through RStudio to like, hire, to grow my team, to hire some people, that was like very, very high on my list.
Because, you know, I was, you know, fine R programmer. I was an okay C++ programmer, but that's where I really felt like there was leverage to like. Yeah. Connect R to more systems to make R packages that are faster and do more things. Oh, cool. And then both like Jim and Leonel, I had interacted with a bunch on GitHub already. So they felt like, you know, like they were kind of safe hires because I knew I could work with them.
Leonel mentioned that you essentially, after a little bit of collaborating, you sent him an email that was like, hey, I need C++ programmers. Yeah. Are you interested? Yeah. And according to Leonel, he said, yes, but he didn't consider himself a C++ programmer. So for the next month, I think he just frantically studied C++. Yeah, that's fine. I do not know that.
Focus, workflow, and the forcats origin story
And I was surprised too, because I think the, at this time, as you're spinning people up, talking to the tidyverse team a little bit, and this is one thing I'm really curious about, is your approach to tools, like what you choose to focus on. Yeah. It was really around that time where I just had like so many R packages that I couldn't like work on all of them simultaneously. There's just like too much like context and background knowledge. They're going to load up into my brain. And so I started being pretty disciplined about saying I'm only going to like focus on this repo for now and I'm going to ignore everything else.
And that, like that's obviously going to bag because you end up in these situations where someone like, you know, does a cool, useful contribution and just sits there for a year. But I decided like I just like had to do that in order to remain like sane and a functioning human and to keep like making traction and stuff. So that like batched, like switching my, like keeping my focus narrow, but like regularly switching it between different projects. I think that that has been like super, super useful.
Like thinking about workflow. So like this idea, you're switching between packages and really focused on one. Another one I'm really curious about is forcats, which is around the time you kind of started the tidyverse and the tidyverse team. Yeah. And it's kind of a quirky package. Yeah. But the reason I'm curious is Jenny Bryan on your team told me that you were sort of anti-factor. Yeah, yeah. At the time. So like for background, like forcats is a package to kind of wrangle factors and make them nice or fun to use. And I was surprised she mentioned like, yeah, you two just chatting basically like. Yeah. She said you were anti-factor and she basically like. Yeah, Jenny like persuaded me. Incepted into your brain. Yeah. Yeah. I think that was like both interesting because like once I was, like once she persuaded me and the kind of switch like flipped and I was like, oh, OK, I should create this. And I just felt like there was this like outpouring of like functions that I had always really needed, but I'd never really noticed. And like once I had the place to put them, it was super useful.
And I think the other thing that I took away from that is just that importance of having this like team of people I trust, like when Jenny says something, like even if I don't like like fully believe it or agree, I'm like I like it's worth my time to like really like process this and think about it compared to like some like, you know, random person like filing a GitHub issue that's not necessarily going to be convincing because, you know, of the people who randomly file GitHub issues, most of them are like, you know, they're not they're not going to persuade me to create a new package or tackle something a totally different way. So that like I felt that like the benefit of having a team of people like you trust and you work closely with and who can like call you out on like when you're making a mistake in a way that I wouldn't listen to if it's just coming from a stranger. That was pretty important.
So that like I felt that like the benefit of having a team of people like you trust and you work closely with and who can like call you out on like when you're making a mistake in a way that I wouldn't listen to if it's just coming from a stranger. That was pretty important.
Vctrs, recycling rules, and handing over the keys
I guess the last sort of situation I was really curious about is vectors. So I know Davis on your team started kind of doing other stuff like a lot of tidy models and sort of got pulled into like really absorbed into vectors. Could you could you say a bit about kind of that path he took? I don't even really remember how that like the pro except Yeah, I think I mean, Davis started in tidy models, but I think he was always like pretty interested in like the low level programming as well. And that kind of drew him towards problems that kind of intersect these sort of at the intersection between tidyverse and tidy models. Yeah. And vectors was was this attempt to try and pull out some like underlying principles of like vectors. And if you want to create a new vector class, like what are the things you need to do and think about? And then how do we plumb those into the rest of the tidyverse? So someone else can. So basically, like other people can extend the tidyverse without having to like do pull requests into every single package.
I'm still not. And we spent a lot of effort on that package. I'm not still not 100 percent convinced it was worth it. Or maybe we could have found a way to like solve the same problem with less effort. But it certainly did force us to really think about some of the low lying invariants about like data frames and things where like base R is like a little bit. Like kind of murky or sloppy, like there's depending on, you know, like a base size is recycling rules, which are basically when you have vectors of two different lengths, like what happens and we kind of talk about recycling rules like there's like one set of recycling rules. But if you look really carefully, you can figure out like in different functions of implemented things slightly differently. So there's actually like three or four different different variations of recycling rules. Yeah. OK. And so that like that was sort of interesting and like really thinking about that and that, you know, that ended up with our own like tidyverse recycling rules where we only have a recycle vectors of length one.
But that, you know, so that process of like really thinking about and to me, the only way you can think precisely about any of these programming things is actually like program them and try them out because that forces you, forces the like the vague ideas in your head to like very concrete reality that you can actually like. You're saying vectors is like you all working through it. Yeah. Collectively, like with with code to like keep us 100 percent honest about does this work or not.
I'm really curious about how someone like Davis goes to from contributing to something like vectors to kind of being like the boss of vectors. Yeah. Could you say a little bit about the what that path looks like? Yeah, I think it's like a pretty, you know, gradual process. And to some extent, like people, this happens to people like who actually work for the tidyverse team and just like ordinary, you know, outside people who contribute a bunch that just like like over time as you get, you know, more and more comfortable and you kind of have contributed to more and more piece parts, it just becomes pretty natural. You know, the existing contributors will start to like cede authority and say, like, you probably know more about this than us by now. So and then soon, like you're the boss, like you get to make the decisions.
And is there are you like just chucking them the keys or are you like nighting? We have like, yeah, we have a little bit of a process like it's fairly informal, but we have like a governance talk that we kind of have some steps written down. But really, like a lot of it, the governance process was kind of written initially to help me feel like safer, like giving the keys away because that felt like pretty nerve wracking at first, you know, now that I've done it like five or six times and no one's done anything like bad. Everyone's been like really grateful to have the opportunity and taken the most of it. It feels like pretty easy for me to say like, hey, OK, you can now contribute to this directly without my required oversight.
AI, LLMs, and vibe coding
Yeah, I mean, like everyone, I guess it's like AI and LLMs. So I've been spending a lot of time working on Alma, which is a package to call, you know, use LLMs from R, which is really the kind of infrastructure to make it happen, like just making sure you can call chat GPT or Anthropic APIs and get the results back in ways that are convenient to you as an R user. What I'm still like hoping to do more of is like really get into like, well, how can you actually use LLMs as an R programmer or as a data scientist to make your life easier, better, more fun in some way? So I'm hoping you're going to get out of the weeds of Alma soon and get into actually like using this to do cool stuff.
Yeah. Do you have a sense of how you'll kind of unpack that problem? Like how to use LLMs effectively? Not really, I guess. I mean, I think, like it feels to me it's in the stage where there's not like there's no, I don't have any kind of big cohesive like theory in my head of how the pieces fit together. And so that just means like trying stuff out, playing around, figuring out lots of like little things that might be useful and the hope that like over time that kind of accumulates. And then I'm eventually able to say, okay, there's like four main ways you're going to find using LLMs useful. And those four ways can be divided up further into these seven subways. And, but no, I have no real sense of what that, that is. Yeah. Which is like, you know, it's a totally new thing, which is like both kind of exciting and a little scary. And like, you know, what does that mean to be a data scientist today? It's like very clearly very different than when I started my career.
Yeah. Do you feel like as part of your due diligence in finding out how to best wrangle LLMs that you will need to do a little bit of vibe coding? Do you feel like that's part of the. I think so. Like I've done a little bit of it and I think it's sort of, it is fascinating and like empowering to be able to like tackle areas that I know like nothing about and I can now get traction on pretty easily. It's also scary because I'm like, I'm like writing code that I don't really understand. Like, where does it go from here? But I think certainly like, it feels like LLMs kind of give you the ability to sort of be like a, yeah, like an intermediate programmer and like any language or any tool with like very little effort. And that's, yeah, that's kind of exciting. That gives you the capability to do a lot of things you couldn't before. But like once you get past that, like initial core prototype, like where does it go from there? I think that's a, yeah, that's a kind of next step.
Writing a Shiny book for a bicycle
Yeah. I guess the last thing I'm really curious about, which is kind of a total tangent, but is it true that you wrote a textbook on Shiny in exchange for a bicycle, a custom bicycle? That is correct. Although it was not like, it was more like I wrote, I gave Joe Jing, the author of Shiny, the gift of writing a book for him about Shiny. Yeah. In exchange. Well, and then like after the book was written, he was like, Oh, can I build you a bike to say yes. Yeah. I love that. Yeah. But it was like a mutual, like, like I like writing books and he likes building bikes. So it was like a mutual, yeah, it worked out really well. Like we both got to do something we enjoyed and got something that we wouldn't have done ourselves out of it.
I feel like it helps a lot too, that I don't know how many people in the world love writing books. Like to just drop in and write a book for someone about their thing is a pretty, like, that's a pretty big stunt. You're like, you like kickflipped over the Shiny docs. I feel like because you actually do have so many books, like, what do you like about it? Like what? I don't know. That's just like, some of us now I'm just like in the habit of it. And so it's just easy to keep going with the habit. I do like that there was something, I mean, I think I do generally like this sort of like this tidying idea. Like how do you organize information in a way that makes sense and is elegant and helps people learn it better. Yeah. So I think that was something I find like intellectually satisfying about that. And there's something nice about writing a book about someone else's things, because you can just be like, oh, and this bit's not like that good, or it's a little bit fiddly to use. Yeah. And like, whereas I write, when I write stuff about that, about the tidyverse, I'm like, oh, I like, I should fix that. Tweak it. Yeah. Yeah. So it's just nice. And I can just like write the book like about this thing as it is without having to worry about like making it better as I go.
I mean, it's pretty wild. As I was like preparing for this discussion, I did at some point try to figure out how many books you're writing right now or have written. And I do feel like it was pretty unclear. I feel like there are. Do you have a sense for like how many books you're writing right now? Like how many? Well, how many sticks are on the fire? I mean, there's like the third edition of the Gigi part two book, which is kind of like on hold currently, I guess, is the tidyverse design principles, which is also on hold currently in the R and production book, which is what I'm trying to focus on. But I've mostly been pulled away into LLM stuff. So like three. Is there an LLM book cooking? No. Okay. No, I do not want to write a book about LLM. Yes. I've got like one, maybe once I finish this, like three other books. But I mean, but at the same time, like I've written a lot of vignettes about LLMs for the Alma package. So it's not a book, but there's still like quite a lot of writing going on there.
I mean, it's pretty wild. As I was like preparing for this discussion, I did at some point try to figure out how many books you're writing right now or have written.
Yeah, geez. The Test Set is a production of Posit PBC, an open source and enterprise tooling, data science software company. This episode was produced in collaboration with branding and design agency, Agi. For more episodes, visit thetestset.co or find us on your favorite podcast platform.

