Resources

Wes Mckinney: Part 1 — Building Pandas, Arrow and a speedrunning legacy

Wes McKinney’s fingerprints are all over the modern data stack — from inventing Pandas to co-creating Arrow. But before all that, Wes was organizing speedrun communities and hacking together better ways to wrangle datasets in finance. In this conversation, he shares his origin story and what makes good tools good. In this episode of The Test Set, we talk with Wes McKinney about the origin story of pandas, what he learned from the R ecosystem, plus his legacy as a community organizer for GoldenEye 007 speed runs (and how this shaped his approach to building tools). We dig into the early days of open source Python, the evolution of pandas, and the rise of Arrow and Ibis. Wes also shares on community stewardship and the power of letting go. What’s Inside: • How frustration with data work led Wes to build pandas (and leave a PhD) • A nostalgic dive into the GoldenEye speedrunning scene • Why read_csv performance is a deeply personal crusade • Lessons from convincing friends to quit finance and go open source • Founding startups, launching Arrow, and the Ibis origin story • The beauty of letting contributors take the reins • Shoutout to Philip Cloud, pandas’ resident pun master • Why open communities win — and what it takes to build them

Dec 3, 2025
23 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

This episode is part 1 of a conversation with Wes Mckinney, open source software developer, author, metalhead, and principal architect at Posit. I'm Michael Chow. Thanks for joining us. All right, Wes, thanks for coming on The Test Set. I'm so — yeah, I feel so privileged and honored to be able to interview you about your contributions to the Python open source ecosystem. And just to be doing this podcast with you, I feel like is a really incredible opportunity. Maybe to start out, do you mind just catching everybody in the world up on sort of what you've worked on?

Origin story: from quant finance to pandas

Well, happy to be here. And yeah, well, it's been a lot of projects over time, but I definitely get asked about this a lot. It's like, how did you start building pandas? And it was — we were just talking earlier today, I think it was 17 years ago this month that I started working on the code base that became pandas. But I got interested in Python as a programming language in my first job out of school. I was a math major and had a little bit of programming experience, but not much. I didn't do computer science. I took a couple of theoretical computer science classes in college. But then I got a job in quant finance and thought that I would be doing financial modeling and stuff like that. It turns out I was doing a lot of data processing and data analysis, data cleaning, that kind of thing.

And I found that working with data was fairly frustrating. And so I initially became really interested in how I could build tools for myself to make myself more productive and make it easier to work with data. And I was very influenced by the R ecosystem, R circa 2008. So this was a really long time ago. Like the tidyverse didn't exist yet. There was ggplot2, but there was no dplyr. A lot of the things that we now take for granted didn't really exist. And the development environment, there was no RStudio either. So it was also like R was much more primordial in those days. And I was doing a lot of SQL.

And so I started building Pandas, which initially was an internal project at AQR where I worked. I eventually got permission to release that as an open source project. And so I spent a couple of years working on Pandas a little bit on the side. It didn't get really that popular until maybe 2011, 2012 when I switched to being a grad student in statistics. So I was interested in becoming a statistician because I wanted to get better at data science by learning about statistics. But I saw that Python was getting more popular. And I just felt like I need to figure out how to work full time on Pandas and to help grow this ecosystem. And so it was around that time that I started writing Python for data analysis. I ended up dropping out of my PhD and working on Pandas full time. I recruited a couple of my colleagues from AQR to come and work with me on that. And so that was 2011. I released my book, Python for Data Analysis, in 2012.

And then I started doing a hybrid of open source development and entrepreneurship because I was trying to figure out how can I build businesses or create some kind of business model to support the open source development that I was doing. So I ended up founding a company in 2013 with Changshu, who was one of the first contributors to Pandas. That wasn't me. He was also at AQR. And so we went on. So we did that startup. We ended up getting acquired, acquihired by Cloudera. And so I was at Cloudera. A group of us started the Apache Arrow project, which is entering its 10th year of development now. And so that was a big focus for many years, right around the time that we started the Arrow project. You mentioned IBIS. So Arrow was more about like building a better computational foundation for data science, for data frame libraries, but also for interacting with databases in the outside world, file formats, all that fun stuff. And IBIS was a Python project that was about building a richer user interface for translating between like the composable data frame concepts that people were familiar with with Pandas, but bridging that into the SQL database and data warehouse world, which I was seeing a lot of in the big data ecosystem. And so IBIS is still an active and in development project. I'm not working on it so much lately.

Speed running and community organizing

I feel like there's so much. It's just a tremendous amount of work. And I feel like it's so neat. One thing that really stuck out to me is a lot of the history and not just the creation, which I think you hit on a bit, but all the organizing and getting people together. The first most impressive, maybe not the most impressive example of organizing, but the first case that really stuck out to me was your time speed running. And I don't want to take away from the other things, but I almost do feel like I'd love to explore just all the organizing you've done and almost maybe starting there and working forward to the Python ecosystem.

This is going back in another decade.

Just like dial it back one more.

So in the late 90s, there was a video game that came out in 1997 called GoldenEye 007. And it was one of the most popular console first person shooters that was ever, I think it was the first really popular console first person shooter that came out. And it developed this cult following, which included me. And for some weird reason, the developers had included this feature in the game where it would show you the fastest time where you beat each level. And I discovered the speed running community when it already existed. There was a website run by this fellow, Glenn McDiarmid in Australia, where it assembled people's fastest times playing the game. And for whatever reason, I was really attracted to helping that community thrive and grow. And there was an opportunity for me to take over stewardship of the community and run it for a couple of years. And so I didn't really know that much about web development, but I really did enjoy, as you said, the community development aspect of building processes and thinking about the governance and how we make decisions. If you suspect that somebody is cheating at the game, we had to create processes like submit videotapes and digitize the videotapes and put them online. So that was my first exposure to community development and building that speed running community.

But yeah, so in a sense, when I later got involved in open source development, some of those behaviors were familiar to me from having been active for a few years in that.

But yeah, so in a sense, when I later got involved in open source development, some of those behaviors were familiar to me from having been active for a few years in that.

And when you were getting involved, how did you find this community? And what stood out to you the most? Being part of the speed running community, what really drew you to it, would you say?

I mean, I think the people in it were mostly interacting through both email and eventually message boards. Eventually we had a message board. This was pre-Slack and pre-Discord, way ancient history. But I found that the people in the community, I felt that they were a lot like me, that they liked video games and they were interested in technology and the internet. And the fact that we don't know how to create the right tools and systems to create this speed running ecosystem and to make it function, but we'll figure it out together.

Eventually the community grew and I also went to college and I got busy. I didn't have time to speed run. You're 13, 14 years old, you have infinite time and schoolwork doesn't take that much time. So I stopped having that much time to play games. But yeah, eventually I was all surpassed and I'm just a drop in the ocean at this point. If you look at the latest GoldenEye World records, which they still exist and there's still records being set all the time. It's pretty wild.

Yeah, I'd imagine as time goes on, they're really cooking on it and figuring it out.

Maybe you could psychoanalyze it as that fixation on efficiency and thinking about how to do things optimally. Translate it in some way to building tools. And later on, it frustrates me that doing this task related to data processing or related to data science is tedious or inefficient. And so I want to find a way to make it fit my brain better or make it more efficient. And so in a sense, I guess it's just a personality trait of trying to speed run the data science workflow a little bit.

Performance and the CSV bottleneck

Yeah, that's fair. And I guess you see that with Pandas too, some of the optimizing and speeding up that there's a lot. I feel like through your work, there's a lot of kind of optimizing and speeding up.

Yeah, early on in Pandas, because it was the only tool around for a lot of the things that it was doing. Just reading a CSV file and doing some basic data preparation, data manipulation, it was the best, but also effectively the only tool available in Python for doing that. And so if you could make something twice as fast or three times as fast, that would have a meaningful improvement in people's lives. I remember there were some things that were out of my control. For example, getting data out of databases. And so in my first job, there was a lot of frustration around doing select star from this database table. And there was a lot of that whole pipeline of how do I get this data into Pandas so that I can do my data analysis that was just completely inaccessible to me. I can't interact, I can't do stuff with the database driver. There's just a lot of the machinery there was just out of my reach to really make better. I think that definitely motivated the Arrow work later on.

I do remember, I think you've maybe talked about or written about too, the read CSV is a big one where it was like reading a CSV in Pandas, as it turned out, was like a big kind of like getting that fast was its own.

Yeah, it's a huge bottleneck. And this is going back to like 2011. I think we had built, like I had built like a fairly primitive CSV parser reader for Pandas that was written in pure Python. And so, but there really was no other tool that did the equivalent of read.csv in R. And now there's, I think there was also an early version of F read, like the data table at the time. So there was a faster version of read CSV in R. But it's this kind of small like self-contained problem and very, even though it's a little bit boring, it's weirdly appealing to work on because everybody's got CSV files as much as we talk about how much we hate them. And so, if you can improve that fundamental workflow, whether it's making the CSV reader automatically recognize more weird scenarios, like weird delimiters or other things that often foul up CSV readers. But if you make it twice as fast, like a lot of data analysis work, even still, people are working on a laptop, they've got, you know, 100 megabyte or, you know, or could be a small CSV file. But if you're reading it a lot of times and doing the same analysis over and over, if you can take something that takes three seconds and take that down to 300 milliseconds, that's a big deal in terms of like, you know, you press return and like how long do you have to wait to rerun your whole analysis? It might be that like 50% of the time is just reading the CSV file, which is, you know, pretty annoying.

Building the early pandas team

Yeah. I'm also curious, I know you've mentioned before that, like for Pandas, you spent a few, like your first year, you kind of convinced some friends to go at it with you. What it, I'm really curious what that looked like. What did it look like in the early days of Panda to kind of like, convince your, I don't know if they're like buds, like convince your buds to, well, yeah, abandon their jobs and it was a little more complicated than that. But we, so I, so I worked, I worked with Adam Klein and Changshu at AQR and we had been internally like, call it partners in crime in evangelizing Python and building early tooling around Python there. And so they saw the potential of Python as a language that could straddle the interactive computing and data analysis world and the production software world. And in particular, we saw a lot of potential in finance. And as we explored the, as we explored the ecosystem and talked to a lot of, you know, banks and hedge funds and stuff, we just found that it was going to be difficult to build a toolkit that we could license because a lot of like financial firms, they want to build all of their own, all of their own tools, top to bottom. And so to get to the point where, even though that we were former quants ourselves, to get to the point where they could like license our toolkit and, and depend on us as a, as a company, that was, you know, in addition to adopting Python that at that time, which was 2011, that was a bridge, a bridge too far. And so ultimately Adam ended up going back, you know, back into it, back into finance. And then Chang and I decided to do a different, a different startup, you know, something on, well, not totally unrelated cause it was still using Python and the Python data stack, but we decided to build more of like a visual analytics, like a data exploration tool company. And that's what, that's what brought us to the, from New York to the West coast. Cause that's where the center of gravity for those types of tools were, was in those days.

Gotcha. And then, then I guess at that point was it, the community had kind of started also to pick up a lot of the maintenance of pandas and.

Yeah. And so I was, it was very fortunate that right around, right around that time that we were figuring out what kind of startup to do, that the pandas community started getting new core team members. And so two of the first two major contributors to pandas were Jeff Reback and Philip Cloud. I remember the first time I got a contribution from Jeff Reback, he sent me an email and he had like edited a bunch of files and pandas is like, look, I edited these files and pandas and made them better, like, you know, to accept my changes. Yeah. This was like, this was like pre GitHub pull requests, I guess. I think GitHub existed at the time, but pandas was like on Google code and like hadn't yet adopted GitHub. But you know, they were, they were essential to the project, you know, becoming more than just like my, you know, my passion project. And eventually it got to a point in early 2013 where I felt comfortable handing off the project to them and say, I trust you. We've been working together for more than a year on this and I'm doing the startup and they really took it and ran with it. And I, I frankly credit, you know, credit, credit them and other people that, that showed up, came out of the woodwork and became really active in the project as, as, as being, you know, I think I helped seed the project with developing it the first five years and getting that first like critical mass of contributors. But going from, you know, 10 or 20 contributors to pandas to like 2000, you know, I really have to give them, give them the credit for, you know, essentially building, building out the project and nurturing it, you know, over these last 11, 12 years.

Yeah. Yeah. It's a really wild move. Just the like scale of some of these Python projects like pandas, just the sheer number of contributors and kind of scope they have to wrangle. It seems like a really heroic kind of effort.

Yeah. I mean, I guess pandas became this, this gravity well for anything having to do with data manipulation or data, data wrangling. But you know, the community has done, done an amazing job. I think they, they've also done a lot in terms of community outreach and, and say like, we want all kinds of contributions to this project. We want help with documentation, you know, we want help improving, you know, even, even small things in an open source project make a difference. And so by, you know, really casting a wide net and making the tent bigger to invite as many people from all around the world to make, you know, contributions to the project, both, both big and small, because you do see, you do see some open source projects where the contributors maybe don't communicate so openly, or there's a small group of developers that are insular and they don't out, they don't sort of, they don't try to recruit and bring people into the project and be inclusive. And so having, you know, international documentation sprints and, and, and event community events to, to get people involved in pandas development has, has, has gone a long way towards, you know, saying, you know, all are welcome, you know, come here, work on this project and make it better thing, something that you use all the time. So I think that's clearly been a success. And I don't know exact, the exact number of unique contributors to pandas, but I think it's, it's several, several, several thousand at least at this point.

Philip Cloud and the IBIS project

I'm curious too about Philip Cloud, cause I know Philip and I, I actually love interacting with Philip. I find he's like, I mean, I don't know a better way to say this, super funny as a person. Like he's just a funny guy. He's just got a lot of zingers, I feel like, but well, Philips, yeah. Well, one of Philips main things is that he loves puns. Yeah. Okay. So he's the person, so he's the person that, you know, you can always count on to not, to not miss a pun. Yeah. I do feel like we all need a pun person in our lives. Like all lives are made better with a pun person. And yeah, so I'm glad that pandas got, it's like resident punner. Yeah. Well, what's interesting about Philip is, you know, he, he came from a more like science and science and research background and that's how, when he initially got involved in pandas and then he, he worked at, he, he worked at both, both Anaconda and formerly Continuum Analytics and Facebook. And so he got, so he got a lot of exposure to, you know, the development of the Python data science world in the early 2010s, as well as at Facebook now Meta, the like, what enterprise, like ETL, SQL, like data warehousing and data management looks like at, at large scale. And so I think it was when he was at working for, for Facebook that he got, he saw that I was working on, on IBIS. And so the appeal of, of working on a framework that can translate between the SQL data warehouse world and the data frame, the data frame world of pandas was, was really appealing. And IBIS is essentially a, a query transpiler. So you have, it gives you an API for expressing, you know, data frame operations. It gives you the benefits of code reuse and programming in Python, tab completion. You know, all the things that SQL doesn't give you basically, you know, functions, like reusable functions. Yeah. Yeah. And so, but so you, but you see you build these, these big expressions and then you say, oh, I want to run this on this SQL engine and it will give you a SQL query that runs on, you know, I don't know, runs on BigQuery or that runs on PostgreSQL or SQLite or what have you. Now it's, now it's, you know, the default backend for IBIS is DuckDB, but when IBIS started, DuckDB didn't exist at all. And so anyway, so it was, it was having him involved in the project and also having the practical experience of coming from that large data, data warehousing environment at Facebook was really critical, I think for, for the development of the development of the IBIS project. And I've, you know, looked for ways to continue working with him in any capacity, you know, over the last 10 years. It's certainly, it's been, you know, getting on, you know, 14 years since we. Yeah, it's really wild. 13 or 14 years since we started working on Pandas together.

Yeah. And do y'all, like, how does it work? Like, do y'all have some kind of chat? Is Philip like, yo, I'm looking to get into IBIS or do you just start like chipping, he like, he's working on Pandas, y'all are collaborating on Pandas and then he like slides over to IBIS. Like what did the. I mean, he just started, he just started sending pull requests. Yeah. Initially. And, and then you just, you know, and, you know, I probably, I'd have to look back and, but I'm sure I sent him an email and say, Hey, you're interested in this. Tell me more. Why, you know, why, why are you interested in, why are you interested in contributing to this? And so we, you know, we developed a collaboration on that and then fast forward, you know, several years after that, this was like around 2015 that he's, he got involved in IBIS development. And so then, then I think, you know, six years later, like, you know, now like IBIS was six, about six years old and Arrow had been in development for six years. And so I, I founded a company called Voltron Data with a group of people. And so we decided that we were going to make a big investment in IBIS as like a user interface to a lot of the tools for accelerated computing that we were building. And so I, you know, Philip was basically the first person I call because I was like the perfect person who knows a lot about this and has been already, you know, has, has done a lot more development, development on IBIS than I have at this point. But there, but by that time there were also another, you know, another, an additional collection of people who'd gotten involved in IBIS development. And, and so, yeah, even though it's not as popular or well known a project as Pandas, it's still, you know, developed a healthy and, you know, they're reasonably thriving community around it. People who are passionate about Python, but also like have the need to interact with all these different SQL systems and what they want to write more Python and less SQL.

Right. It seems nice to like, it's kind of, I feel like working on open source, it's like kind of the dream too, to get a call and be asked, like, would you want to work on this full time? Like as a job is a real kind of dream scenario.

It seems nice to like, it's kind of, I feel like working on open source, it's like kind of the dream too, to get a call and be asked, like, would you want to work on this full time? Like as a job is a real kind of dream scenario.

Right. Absolutely.

The Test Set is a production of Paws at PBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with branding and design agency, Agi. For more episodes, visit thetestset.co or find us on your favorite podcast platform.