Wes Mckinney: Part 1 — Building Pandas, Arrow and a speedrunning legacy

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

This episode is part 1 of a conversation with Wes Mckinney, open source software developer, author, metalhead, and principal architect at Posit. I'm Michael Chow . Thanks for joining us. All right, Wes, thanks for coming on The Test Set. I'm so — yeah, I feel so privileged and honored to be able to interview you about your contributions to the Python open source ecosystem. And just to be doing this podcast with you, I feel like is a really incredible opportunity. Maybe to start out, do you mind just catching everybody in the world up on sort of what you've worked on?

Origin story: from quant finance to pandas

Well, happy to be here. And yeah, well, it's been a lot of projects over time, but I definitely get asked about this a lot. It's like, how did you start building pandas? And it was — we were just talking earlier today, I think it was 17 years ago this month that I started working on the code base that became pandas. But I got interested in Python as a programming language in my first job out of school. I was a math major and had a little bit of programming experience, but not much. I didn't do computer science. I took a couple of theoretical computer science classes in college. But then I got a job in quant finance and thought that I would be doing financial modeling and stuff like that. It turns out I was doing a lot of data processing and data analysis, data cleaning, that kind of thing.

And I found that working with data was fairly frustrating. And so I initially became really interested in how I could build tools for myself to make myself more productive and make it easier to work with data. And I was very influenced by the R ecosystem, R circa 2008. So this was a really long time ago. Like the tidyverse didn't exist yet. There was ggplot2 , but there was no dplyr. A lot of the things that we now take for granted didn't really exist. And the development environment, there was no RStudio either. So it was also like R was much more primordial in those days. And I was doing a lot of SQL.

And so I started building Pandas, which initially was an internal project at AQR where I worked. I eventually got permission to release that as an open source project. And so I spent a couple of years working on Pandas a little bit on the side. It didn't get really that popular until maybe 2011, 2012 when I switched to being a grad student in statistics. So I was interested in becoming a statistician because I wanted to get better at data science by learning about statistics. But I saw that Python was getting more popular. And I just felt like I need to figure out how to work full time on Pandas and to help grow this ecosystem. And so it was around that time that I started writing Python for data analysis. I ended up dropping out of my PhD and working on Pandas full time. I recruited a couple of my colleagues from AQR to come and work with me on that. And so that was 2011. I released my book, Python for Data Analysis, in 2012.

And then I started doing a hybrid of open source development and entrepreneurship because I was trying to figure out how can I build businesses or create some kind of business model to support the open source development that I was doing. So I ended up founding a company in 2013 with Changshu, who was one of the first contributors to Pandas. That wasn't me. He was also at AQR. And so we went on. So we did that startup. We ended up getting acquired, acquihired by Cloudera. And so I was at Cloudera. A group of us started the Apache Arrow project, which is entering its 10th year of development now. And so that was a big focus for many years, right around the time that we started the Arrow project. You mentioned IBIS. So Arrow was more about like building a better computational foundation for data science, for data frame libraries, but also for interacting with databases in the outside world, file formats, all that fun stuff. And IBIS was a Python project that was about building a richer user interface for translating between like the composable data frame concepts that people were familiar with with Pandas, but bridging that into the SQL database and data warehouse world, which I was seeing a lot of in the big data ecosystem. And so IBIS is still an active and in development project. I'm not working on it so much lately.

Speed running and community organizing

I feel like there's so much. It's just a tremendous amount of work. And I feel like it's so neat. One thing that really stuck out to me is a lot of the history and not just the creation, which I think you hit on a bit, but all the organizing and getting people together. The first most impressive, maybe not the most impressive example of organizing, but the first case that really stuck out to me was your time speed running. And I don't want to take away from the other things, but I almost do feel like I'd love to explore just all the organizing you've done and almost maybe starting there and working forward to the Python ecosystem.

This is going back in another decade.

Just like dial it back one more.

So in the late 90s, there was a video game that came out in 1997 called GoldenEye 007. And it was one of the most popular console first person shooters that was ever, I think it was the first really popular console first person shooter that came out. And it developed this cult following, which included me. And for some weird reason, the developers had included this feature in the game where it would show you the fastest time where you beat each level. And I discovered the speed running community when it already existed. There was a website run by this fellow, Glenn McDiarmid in Australia, where it assembled people's fastest times playing the game. And for whatever reason, I was really attracted to helping that community thrive and grow. And there was an opportunity for me to take over stewardship of the community and run it for a couple of years. And so I didn't really know that much about web development, but I really did enjoy, as you said, the community development aspect of building processes and thinking about the governance and how we make decisions. If you suspect that somebody is cheating at the game, we had to create processes like submit videotapes and digitize the videotapes and put them online. So that was my first exposure to community development and building that speed running community.

But yeah, so in a sense, when I later got involved in open source development, some of those behaviors were familiar to me from having been active for a few years in that.

But yeah, so in a sense, when I later got involved in open source development, some of those behaviors were familiar to me from having been active for a few years in that.

And when you were getting involved, how did you find this community? And what stood out to you the most? Being part of the speed running community, what really drew you to it, would you say?

I mean, I think the people in it were mostly interacting through both email and eventually message boards. Eventually we had a message board. This was pre-Slack and pre-Discord, way ancient history. But I found that the people in the community, I felt that they were a lot like me, that they liked video games and they were interested in technology and the internet. And the fact that we don't know how to create the right tools and systems to create this speed running ecosystem and to make it function, but we'll figure it out together.

Eventually the community grew and I also went to college and I got busy. I didn't have time to speed run. You're 13, 14 years old, you have infinite time and schoolwork doesn't take that much time. So I stopped having that much time to play games. But yeah, eventually I was all surpassed and I'm just a drop in the ocean at this point. If you look at the latest GoldenEye World records, which they still exist and there's still records being set all the time. It's pretty wild.

Yeah, I'd imagine as time goes on, they're really cooking on it and figuring it out.

Maybe you could psychoanalyze it as that fixation on efficiency and thinking about how to do things optimally. Translate it in some way to building tools. And later on, it frustrates me that doing this task related to data processing or related to data science is tedious or inefficient. And so I want to find a way to make it fit my brain better or make it more efficient. And so in a sense, I guess it's just a personality trait of trying to speed run the data science workflow a little bit.

It seems nice to like, it's kind of, I feel like working on open source, it's like kind of the dream too, to get a call and be asked, like, would you want to work on this full time? Like as a job is a real kind of dream scenario.

Right. Absolutely.

The Test Set is a production of Paws at PBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with branding and design agency, Agi. For more episodes, visit thetestset.co or find us on your favorite podcast platform.

Wes Mckinney: Part 1 — Building Pandas, Arrow and a speedrunning legacy

Transcript#

Origin story: from quant finance to pandas

Speed running and community organizing

Performance and the CSV bottleneck

Building the early pandas team

Philip Cloud and the IBIS project