Resources

Michael Chow: From psychology and Python to constrained creativity

For this one, we turn the mic around. Wes McKinney takes over the interviewer’s chair to chat with his co-host, Michael Chow. Michael’s a principal software engineer at Posit, but he started out studying how people think — literally, with a PhD in cognitive psychology. Somewhere along the way, he got hooked on data science, helped build adaptive learning tools at DataCamp, and now spends his days thinking about how to make Python easier to use and more fun. The two dig into what drives Michael’s curiosity, how a “weird obsession with tables” turned into a beloved open source project, and the future of data science/scientists. We explore Michael’s path from studying the mind to shaping the Python data science ecosystem. From adaptive learning platforms to Great Tables, Michael shares how following unexpected curiosities can spark tools and communities that last. What’s Inside: • Michael’s pivot from an academic career in data science • Behind-the-scenes messiness of building data and learning platforms • Open source projects born out of zany, single-minded passions • Bringing beauty to rows and columns • Big-picture thoughts on where data science — and open source tooling — are headed

Dec 3, 2025
1h 7min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

Hi everyone, I'm Wes McKinney, Principal Architect at Posit and co-host on the Test Set podcast. A little while ago, we recorded and released a number of interviews where Michael Chow was the host. I was interviewed along with Hadley Wickham, Roger Pang, and Meena Set and Kaya Rundle. We wanted to turn the tables and interview Michael so you can learn more about him as we continue to build and produce this podcast. Michael Chow is a Principal Software Engineer at Posit and a key contributor to the Python community. He received a PhD in Cognitive Psychology from Princeton University and is interested in what drives expert data science performance. This led him to build Datacamp's Signal Adaptive Test of Data Science Skill.

So I'll turn it over to you, Michael, to give a little brief intro to yourself and your career up until this point, and then we'll dive into all of the things that make you tick and the things that you're thinking about these days. I've got a lot on my mind with everything that's going on with AI, so I'm sure we have a lot to talk about.

Yeah, excited for the inevitable AI blurb.

And I have to say, as a person who works full-time on Python, open source at Posit, formerly RStudio, I'm a giant fan of all your work in Python, so I'm excited to be able to talk about just all data, but also what's going on in the Python ecosystem. I guess as a little background, yeah, so I did a PhD in Cognitive Psychology, and I would say a few years in, I sort of realized that I wasn't sure if I wanted to fight tooth and claw for an academic position, so I started to do way more Python and R for data analysis. And then, yeah, after graduating, was able to go to data camp where I worked on sort of how do we grade people's submissions and how do instructors who are writing these interactive exercises write code to grade other people's code that they submit for these exercises? So very meta-activity.

Yeah, so basically, from there, did a lot more data science into how people learn sort of on this interactive platform, and that sort of led me into, like, how do I best analyze in Python and R? And then finally, I guess, took me to Posit where I'm really excited to work on open source tools around data analysis and to be able to think about what are the sort of nicest things we could do during data analysis to kind of speed people up or help them kind of go end-to-end from raw data to some kind of report or insight.

Background and tool building

It's interesting because I found in our field of open source data science tool builders that there's a lot of people who started out in a scientific field. So cognitive psychology is a heavily statistical field where there's a lot of modeling and statistics and signal processing and latent factor models and that sort of thing. Myself, I started out in mathematics and then did quant finance and then statistics, more Bayesian statistics. But I was also attracted to the tool-building aspect because of how difficult it was to work with data, and I found that just building tools and accelerating workflows and making people more productive was really satisfying. But I think it's important to have that background in science and actually doing statistics, doing data science to know what do good ergonomics look like.

I remember the first time that you came on my radar was through the Suba project, and then more recently, you've worked on Great Tables and Quartodoc, and so Great Tables being a port or a reimagining of the GT R package, so bringing rich and beautiful tables to Python, which having spent a lot of time formatting and producing tabular output for pandas, it's nice that that work can be delegated to Great Tables now because it can be extremely tedious to produce these nitty-gritty tables that have all the bells and whistles. I guess you would know as much about that as anyone nowadays.

Yeah, I feel like I'm still learning a lot about that world, and for a little bit of background, I kind of joined up with this person, Rich, who did this package in R. And it is funny, I do feel like tables was not on my radar. I'd use the panda styler a little bit, but this idea of being so obsessed with how to style a table and to produce a nice table was almost so fascinating to me that Rich was just obsessed with this one kind of singular activity. I feel like it was really exciting to kind of be able to both ride along with and kind of figure out, like, in Python, how do we make it really easy to do? But I do love how these activities are super zany. I feel like formatting a table is a really specific, tiny thing to get into.

I love that Rich was just obsessed with this one activity and just digging deeper and deeper kind of into it.

I love that Rich was just obsessed with this one activity and just digging deeper and deeper kind of into it. Yeah, well, I think it's cool also that now with the development of new DataFrame libraries like Polars, so in the past, we wanted to ship table rendering and plotting, table rendering, table styling, all those things in Pandas. And so we built those things inside Pandas. But now a library like Polars, starting from scratch, but also there's this much richer ecosystem of open source libraries. And so it's an easier decision to make to say, let's delegate out to a library like Great Tables rather than Polars implementing all of its table styling itself.

So how have you found some of those types of collaborations in the open source ecosystem? I mean, I do think because we integrated with Polars really early in Great Tables, I do feel like it was almost kind of the perfect time to your point that we could tell Polars they have this thing called lazy expressions. And we could tell them like, hey, we think that if we just take the job of displaying tables and making them pretty, we can just use your lazy expressions to make it really easy to say like highlight a specific row, say like the row where some value is its maximum in the table. Like, we can do that with your expressions. And like you can focus on how could people compute things in Polars and use a data frame. And we could just borrow this one piece and kind of like do this job you don't want, or probably don't get a ton of joy out of.

I think you're right that it's tricky. Like in Pandas, the styler was kind of rolled into Pandas. And that is a hard decision library builders have to make. Can I quickly kind of provide a good enough way to do this in our tool? Should we like really encourage someone else to do it? And I do think one of the challenges I'd imagine for Pandas is like, once you roll it, you sort of have to maintain it. And then if no one comes along, and it's really gassed up by the problem, it can be kind of tricky to keep maintaining all these pieces.

Yeah, yeah. And one issue that I've seen is that you also end up with a tightly coupled release process where maybe you want to be able to release bug fixes or incremental improvements in some of these presentation and styling layers, but those end up being gated. I think one of the challenges that we've seen, and occasionally it can be a frustration that whenever Pandas makes a major release, and then somebody finds a bug, and one of the peripheral parts of the project that there has to be like a bug release or patch release, and that can get hung up for a number of weeks. But if you encounter a bug in Great Tables, you can roll out a new release with fixes much more easily.

What drives Michael's curiosity

So I wanted to get back to kind of maybe rewinding a little bit just like how you got involved in this field and sort of what makes you tick. So I'm curious, like, you know, your core motivations, like what gets you out of bed in the morning personally and professionally? And, you know, what are some of the things that help keep you engaged in this space of open source tool building?

When I was an undergrad, something that really appealed to me was my statistics advisor saying something like, statisticians get to play in everyone's backyard. And I feel like something about that I just loved, like that statistics is used in so many fields that you can sort of collaborate broadly across a lot of different fields of research. And I think with tools, it's sort of the same that like a tool builder also often gets to play in everyone's backyard, that there are all these people doing data analysis in different domains. And I think what gets me up is I've really liked diving into a few domains where I feel like data is used for incredible impact.

So I think, for example, the legal system, I think data is used incredibly creatively and for public good oftentimes by nonprofits. And then I think the other one is, oh, transit. I think public transit's a really nice area where data, we interact with data all the time, like when's my train coming? When's my bus coming? And advances like real-time tracking and all that have gone super far. But also like being able to report like how busy is our network where and design affected transit networks I find to be a really impactful data problem. So I think as a tool builder, I've really tried to kind of ground myself in different areas. And I don't know if you've ever felt this way. I think sometimes like the longer I spend in open source tools, the less I actually do data analysis. So I've really tried to kind of like, yeah, I just need to like touch grass and like analyze some data.

Yeah. Yeah. For me, like one of the challenges also is that the feedback that you get from the open source community tends to be fairly delayed. And I think for some new open source contributors, that can be a little bit discouraging. But the reality is that it takes a while for the software to trickle out and for people to learn how to use the new something new. In my experience, like six months is my expectation for when I can expect to really start getting some some feedback about building something new.

Yeah, it's a it's an interesting question, because I think you're right that, like, if you're putting open source tools out, it could be quite a long time before someone picks it up. Or you might even be frustrated and notice like, nobody's picked it up, like your biggest risk maybe is obscurity sometimes. But I do think one interesting kind of in between that I've experienced is like, helping build tools internally in organizations. So like, I think there are a lot of sort of tool builders that aren't they're not quite doing open source, but they start sort of internally inside a company, building a package that does a thing. And then they have like really good feedback loops with this team that's using it in the company. And then maybe after a while, they kind of bring it into open source, or it inspires like a tool similar to that. And I, I find that arc to be really powerful, where you sort of you can't be precious about it, you like, are building this tool for a specific group of people in your company.

Suba, Ibis, and lazy expressions

I will say like Suba was a little bit like that. So yes, for some background, Suba is a tool to analyze data in Python, and it's sort of like wrapped pandas. And I would say it's big thing is that it implemented lazy expressions, so that you could say like what you want to do.

Yeah, it's like, yeah, the way I describe it to people is like, the dplyr is a library that Hadley Wickham created for R, which provides these lazy, lazy table expressions. And, and so like I had, I had created a project called Ibis, which was similar concept, but wasn't trying to be, you know, wasn't trying to pattern match or be, you know, implement the exact same API flow as, as dplyr. It was its own, you know, portable data frame API, whereas I think what was cool about what was cool about Suba was if you had used dplyr, and we're working in Python, you could pick up Suba and see, you know, the same concepts expressed. And you could port between dplyr expressions and Suba expressions fairly, fairly comfortably.

I think this is such a good example, because Suba, I was building to basically defeat my boss at data analysis. Ibis had been out for a couple years. So you got nerd sniped, essentially. Yeah, yeah, by like, some kind of spite arc against my boss. But I will say like, just to foreshadow, I mean, I think today, Ibis is probably like, if people said, like, should I use Suba? I would probably just point them to Ibis. And essentially, I do think a lot of it boils down to Ibis did implement lazy expressions. And in the PR, they mentioned like, Suba has these lazy expressions using underscore, would it be useful? And I think at the time, I was like, oh, that's interesting and neat. But now I feel like I really appreciate the Ibis team and respect, like, especially Philip Cloud and, you know, your work.

Yeah, I mean, I think I think it was cool, because it was there, there was a period of time where, where, you know, Ibis was in early, you know, early development, like the first started around 2015. The first, say, five years of the project, but I think there was there was an era where the Python community was trying to figure out this, like, lazy, lazy table expression API, like, what is a good API? What a good API ergonomics look like? How do we get around the limitations of not having non standard evaluation. So that's like, the feature in R where you can refer to, you can late bind, like variables, like, based on the context, like an expression inside of a function call, you can refer to column names in a, in a data frame. And so within the evaluation layer, those are late bound to the to the physical columns within the table, whereas in Python, you have to create like actual objects that that that are the either are the column or are an expression object that can be composed with other with other expressions.

So I think there were a bunch of projects. So there was the blaze project at continuum analytics, now now Anaconda, and then I started Ibis and then and then Suba. And so I think the that cross pollination of ideas and like the underscore, like the underscore expression builder was was really cool. And something that eventually got adopted and in Ibis, but I think we emerged with some really nice, some really nice tools on the other end. And I think that all of this work definitely, you know, directly or indirectly influenced the design of the design of polars and like polars as a expression system, which I think works really well.

And people seem to be happy with there's a new another new project narwhals. I don't know if you've looked at narwhals, but it's like, yeah, it's basically like a polars API compatibility layer for different backends. So it's like, like Ibis, but if I just had the polars API a little bit, which is cool, but but I appreciate like the the goal there of, you know, wanting to simplify. Yeah, I see all these projects as being like, part of a collective open source community effort to try to reckon with like, what is good ergonomics look like for lazy table expressions and data frame, data frame, efficient data frame operations across different types of backends?

Yeah, I do think they've all kind of all wrapped around that concept, like, in 2025, lazy expressions have kind of emerged in every tool like pandas just added pd.call in the spirit of polars. So they have their lazy expressions. Yeah, I think to tie it all back, like, so I think the super story is like in 2019. I need to data fight my boss to basically analyze data. He's analyzing data live in our and he's just like, mocking Python. And I need to keep up with him. I think today, hopefully it'd be like narwhals or Ibis, just kind of like, keeping pace.

I do remember. Yeah, Dave, Dave Robinson's coding, coding live streams and demos at meetups and things. So when I used to live in New York City, I would often see Dave at New York City data science meetups. Yeah. And yeah, so he was definitely a definitely a fixture, you know, fixture in that in that world. It's also kind of cruel, too, because he's like an animal. So he just he goes so fast. And he's just like, talking as he goes. And if you're not like if I'm in Python, and I can't keep up, I'm just he's just pummeling me, you know, he's like, you can't do this, but I'm just like having a good time.

Origin story and pivotal moments

So yeah, just thinking back, like, you know, maybe, you know, before, you know, before you turn into the person you are today, like what, you know, what are you know, what were some of the experiences that you feel like put you put you on this path?

Yeah, I've been thinking about this a lot, because I, I visited my undergrad advisor, Candy Turley a couple weeks ago. And well, maybe I can start with the three most surprising things. I tell people from undergrad, which is, I failed my first year of undergrad, just beautifully, I had like a 1.8 GPA, I ended up going to Princeton, by the like, support of these really great mentors. And I think that I like, really had no idea how I was going to dig out like after I failed my first year, I found it so hard. And I remember my, my dad asked me how I was going to pay for Princeton. Not, none of us realized that Princeton funds PhDs, like, it's fully paid and has a stipend.

So I think, to me, the like most pivotal moment into data was like, connecting with my undergrad mentors. So Candy, who was in psychology and Duane Derryberry, who did stats. And just like realizing how helpful professors are and how, like, getting so much support from them, just to like, figure out what grad programs were like and how to even pursue statistics. But I will flag like, I think today, with AI, like, a lot of the problems, like, I had, like, I'm embarrassed that I didn't know Princeton would pay for my PhD. I went to the visit at Princeton, I didn't know that they would pay for my school. And I only found out from my host, Fowder, that it's funded. But I think they kind of take for granted that it's obvious. But ChatGPT or Claude, I think would have just pummeled that, blasted that question.

So, yeah, I mean, it's amazing how, like, you know, today with AI versus internet pre-AI is almost as big of a difference as, I mean, yes, you could have looked things up on Google or whatnot, but it feels like as big of a change as like, going from no internet to internet in many ways.

The evolving Python ecosystem and data science roles

And this gets a little bit into, you know, the next topic that I wanted to talk with you a little bit about, and it does start to tie into AI, so I promise I'm going to get to more of the AI topics. But, you know, as the Python ecosystem has expanded so rapidly in recent years, you know, I started out building tools that I thought were for, you know, my idea of what is a data scientist. But now, if you look at people that are using Python, there's so many different types of business functions and roles. There's data engineers, there's AI infrastructure engineers, AI engineers, there's people doing machine learning and more modeling type work. And the way that people work and, like, their level of technical ability and the other types of tools that they're using can vary so much. And so I think for open source tool builders, it creates this challenge of, like, who are you building for? Or who is your primary audience? And even, like, what is a data scientist now in 2025? Is that even an outdated or old-fashioned term at this point?

I think today, especially more and more so than six years ago, like, a lot of my thoughts are almost switching between, like, how I thought of data six years ago, which was sort of, like, using our Python to take raw data and to put out, like, a report or a dashboard or a model. And all these tools could query databases. So I could, if I had to, I could hit our warehouse as part of the process. I think I switched between thinking of that kind of, like, raw data to dashboard or, like, a little warehouse in between to thinking about kind of the DBT labs community and this, like, ELT analytics engineer data modeling world, where actually transforming data in the warehouse is this huge activity. And it's not just the, like, coding of it. It's also the design of the warehouse.

And I think that thinking between those worlds I find really interesting. Like, I noticed that the analytics engineering community is much more focused on, like, BI tools, so business intelligence, where there's often, like, a UI where it's really quick to make plots and dashboards. And I would say, like, if you think of tools like Great Tables, it's a kind of funny example because I think that a person who's using BI tools is probably not going to use Great Tables a lot because they already are in a platform that has some way to format tables for them. But I think the challenge is, like, if that BI platform doesn't format the table in the way you want or if it has kind of funky ergonomics, it's often hard to break out of the platform.

So I would say I've been thinking a lot about these two worlds, but a lot of the tool building I've been doing is still more focused on, like, our Python, like, data scientists who maybe are, like, just trying to take raw data and churn out, like, a report or a dashboard and can use coding to kind of, like, dune buggy over rough edges or to get different tools to kind of do stuff that BI tools might sort of struggle to do.

Yeah, it's definitely something I've thought a fair bit about, especially, you know, being back at Posit the last couple of years. And I've gotten involved in doing development on Positron, so the new data science IDE development environment that Posit has been building for the last few years. And so it's interesting because Posit as a company is still very focused and, you know, tends to remain focused on that data science archetype. So somebody with scientific skills is doing data analysis, data visualization, modeling, but also technical communication and needs to be able to build reports, static reports, interactive reports. Maybe they're authoring some type of document that's going to be published. So there's many different publication modalities, but also, like, they're still very involved in writing data analysis code, doing interactive exploratory computing.

I did a little consulting before joining Posit with California Department of Transportation on a pipeline. And I do feel like that was what kind of demystified a lot of these roles for me. I ended up having to get a lot more into dbt and ended up going to Coalesce and almost, like, running into people there was so helpful. But I do think coming out, meeting some of the data engineers and analytics engineers, I do feel like the big difference was, like, data engineers could ship data through pipes, but they don't really care what the data is. They're, like, pipe the data, whatever. And I'll think about it just enough to, like, help you pipe it. Whereas analytics engineers, like, once it's in the warehouse, they just think so hard about the SQL transformations and the representation and, like, data marts.

Right. And there's, like, a, yeah, there's, like, a big focus on, like, semantic layers, like, what are the metrics, like, you know, essentially creating the higher level concepts that map onto the BI tools. Yeah. But I've never really wrapped my head around it, I feel like, completely in the sense that, like, I've always wondered, like, are there a lot of problems where, like, this kind of handoff struggles? Like, having a data engineer do the pipes, having an analytics engineer do the transforms, like, is this kind of, like, handoff leaving some kind of, like, really high leverage analyses or putting it into kind of, like, an intense pipe where we're, like, fire bucket style handing down? And I do think that's the big difference in how I really like to work is that my favorite thing is when data scientists go straight into the buckets, like, they go straight into the raw data, and the data engineers are kind of, like, freaking out, and the analytics engineers are freaking out, but they're, like, I just got to get this report out.

Current projects and day-to-day tools

So these days, what are, you know, what are you actively working on right now? And, like, what sort of tools do you use in your day-to-day, day-to-day activities? Well, so one thing is we're running, this is maybe a funny thing to mention for, like, open source tools, but we run an annual table contest and plotting contest. So, like, any tables you create in R, Python, or honestly, you could create the table anywhere. We just want submissions of, like, cool, interesting, beautiful publication ready tables. I think we're taking submissions through, like, early October. And then plots, we did this last year, just looking for plots using plot nine to make beautiful visuals. But I'll say I actually love these contests because it's just nice to see, like, what people can do. And it's such a good reminder that people are better at using my tools than I am. Like, I feel like that's, like, the one nicest experience you can have as a tool builder is, like, people stunting on you with the tool you built.

I feel like that's, like, the one nicest experience you can have as a tool builder is, like, people stunting on you with the tool you built.

So I'm excited to get stunted on two months from now. I guess the other thing I've been looking at is, this is a kind of small thing I've been thinking about, but it's just this, like, very tiny activity, auto completing the column names of your data frame in lazy expressions. It's more become, like, a side obsession of seeing, like, in all these different platforms, in the different places I can be within the platform, how and when do they auto complete the names of columns? And I've been really interested in, like, JupyterLab, because I can control the suggestions it gives. But actually, the range of IDEs, it's an interesting issue, because a lot of IDEs roll their own kind of, like, tooling to just handle data frames specifically. There's, like, data frame specific code or, like, something that detects if you're using pandas or polars. And I've always found that kind of funny that there's a million ways to auto complete a column name, but oftentimes it just doesn't happen yet in Python.

Right. Yeah. Or, you know, it had to have auto complete, you know, especially when people are writing these, like, cascading chained expressions and things like that. So you don't want to make an incorrect suggestion or something that isn't going to bind correctly. Like, I know one thing that we tried to do in Ibis was to constrain the, like, column methods that are available based on the inferred type of the column. Like, if you infer that, like, this is a number or this is a string, then only certain, only string methods will be available on that expression object or only numerical. Like, you can't call, you wouldn't want to call square root on a string, for example, or you wouldn't want to call, like, dot contains or dot lower on a number. But that also can be limiting, and there are situations where you can't necessarily or might be challenging to infer, like, infer the type of an expression.

Yeah. Yeah. It's such an interesting dynamic of, like, what, yeah, what do you offer? What do you take away? Like, how close can you kind of get to something? The risk of, like, the map is not the territory. Like, if you try to model the database really hard, but it turns out databases are really quirky, that just the act of trying to, like, take away options that shouldn't exist can become a really quirky problem, just given the, like, unpredictability of databases or their, like, edge behaviors, things like that.

Right. I feel like that was one of the fun parts of, like, Suba, and I know Ibis is, like, I thought, I think Ibis has a tiny note, which is, like, cursed knowledge, which is, like, some incredibly cryptic piece of information you learn about a database that you wish you didn't have to learn. It's called cursed knowledge. You end up stubbing, shooting your toes off on the same problem too many times. It's good to write it down, things that you wish you didn't have to commit to memory. I'm sure if I thought about it, there's plenty of things that I, hard lessons learned that I wish I didn't have to learn or I wish I didn't know so intimately.

Yeah. I do think that it also hits on, like, when you're designing tools, the, like, pure relief of delegating to the user, the figuring out of some aspect of the tool, like, just being able to point them to, like, for example, like, that case to be able to say, like, we aren't going to model the column types, but, so, like, some of the things that you can express can't happen or might produce an error, but at least, like, now it's your job to figure out, do these things which you can say, kind of, like, error out versus not being able to say them at all.

I think one of the, yeah, like, I know one of the tricky things in Ibis is that there are some operations which are permissible in certain, permissible in certain backends, but not in others, or maybe in some backends there needs to be an explicit cast introduced in order for the operation to work. And so I think what we, what the project has tried to do is to, you know, make things just work, but in a way that is safe or doesn't lead to, like, potentially, you know, shooting your, shooting your toes off. But in there, I'm sure that there's ways to cook up, like, diabolical expressions where it will execute correctly in one backend, but in this, like, certain edge case with, like, the wrong, or the certain kind of input data, it might error in another, another backend. But that's, that's just, you know, one of the frictions that comes into play, like, the little subtle differences between different backends, like, I think Ibis supports more than 20 different backends, and so there's bound to be certain, certain incompatibilities or ways that you can concoct expressions which don't work in a totally inconsistent manner. But I think at all times, like, there's the Zen of Python mantra, like, in the face of temptation, in the face of ambiguity, refuse the temptation to guess.

Yeah, and that's, that is kind of how I knew, like, I love the Ibis team, which we met at, I think we met for the first time at SciPy, and we were chatting a little bit, and then I feel like inevitably what happened was, like, a few hours of just, like, database quirk talk, like, what, things that keep us up at night when you try to translate something to different databases, like, when you rank something, where do nulls go, just, like, I feel like that was so nice to be able to, like, nerd out. I feel like that's how I knew we had to put a ring on it, that the Ibis team was sort of, like, my ride or die.

Yeah, no, it's, it's a, yeah, it's a great, it's a great, a great team there, and I think in general I've had just, just really wonderful experiences working with the Pandas community, the, like, Philip, Philip Cloud, who had led, has led Ibis for a number of years, he, we met through the Pandas community, and so it's a, it's a really cool, cool group of, cool group of people. I will say, too, like, one thing I often tell people working on open source libraries is, like, if you think you have a competitor, you should just, like, chat with them, because you're two people obsessed with the same thing, maybe working for free, like, you might as well get a friend out of it, because you probably, like, obsess about the same kinds of things. But I do find sometimes tool creators are a little bit hesitant to, like, chat with each other if they do similar things, just because I think that, I mean, we often create our tools in reaction to other tools, but at the end of the day, it's nice, I think, to just, like, chat and realize that we have so much in common.

If you think you have a competitor, you should just, like, chat with them, because you're two people obsessed with the same thing, maybe working for free, like, you might as well get a friend out of it.

Yeah, I totally, I totally agree. So I think just being proactive about reaching out, proactive about communicating, because open source is, the whole intent is to be collaborative and not, and not competitive. I know that people often ask me, like, Wes, like, how do you feel about polars? And, and I'm, my answer is always, of course, learn, learn polars, but there's extraordinary amounts of pandas code in the wild, and LLMs are really good at, at writing pandas code. So knowing how to use pandas is also, also a good idea, because, so learn, learn them, learn them both. And I'm happy to see that there's, there's innovation happening in, in DataFrame libraries, and especially polars is arrow native. So that's, like, the memory kind of fast DataFrame, like, memory representation layer that powers polars, and a bunch of other new, new open source projects.

AI tools and the future of open source

But yeah, I guess the segue into, you know, somewhat, somewhat related, somewhat related topic, but, well, firstly, I'm interested, like, what AI tools you use in your, in, in your day to day work? I'm curious, like, what are, what are your, what are your daily drivers? And then after that, I think the next topic, after we talk about your tools, I'm really curious, like, a big thing that's been occupying a lot of my mental space lately is what open source development is going to look like for the rest of our careers, because it feels like this is a pretty permanent inflection point and how we approach tool building when everyone is now working in this more AI first fashion, where if they if the LLM can't help them, they with something or can't suggest tools, if they want to use something, if the LLM doesn't know how to use that tool already, they're just not going to use it, they're going to use something that's in the training set for the LLM.

I think, well, I will say my first, my first tool in the LLM AI stack is dread, you know, that I know that I should be learning tools, and that there's actually so much out there that I feel like I very motivated right now to really like, dig a lot deeper. I'd say like Copilot, I use quite a bit for like autocomplete. I was using Klein for a long time. Every once in a while, and I found and what and what is Klein remind me again, I haven't actually used it. Yeah, Klein, especially really early on. So Klein's an agentic assistant that can like execute actions for you in VS code. So it's a VS code extension, where you can like, ask it to do something. And it will be like, should I write this code to this file? I think that Klein was, there's a fork of Klein called Rue code, which I think is a lot of people, a lot of people use now. Yeah, that rings a bell. And I don't doubt that they're going to be forks that end up being more, more compelling or people move around. I think Klein was before Copilot had a gentic mode. So it was sort of like a precursor to that. But I would say I do a lot more like Copilot Autocomplete.

Definitely like Claude and ChatGPT, just quick question asking. And then I do, I will say like use from time to time, Chatlas, which is the Posit's like open source kind of like chat interface. And I find that is really nice for like quick, but also like quick LLM interactions that you can code. So you can like stitch together kind of like workflows and interactions. It's not, I feel like what I like about it is it's not like here's a million Lego pieces to like create agents. It's like, here's chatting. You can, I think, add images or audio, but it's like a pretty simple interface. The way I think of Chatlas is that it's like a port of the Elmer package for R, but it, it, it simplifies the whole process of using these different APIs and interacting with the LLMs, which if you're doing it from scratch can be a little bit, you know, a little bit tedious because every, every, every API has a little bit, slightly different, slightly different API incantations.

Yeah. Yeah. And I think that a lot of the other tools I've looked at are sort of like get ready to build like a very production grade or like get ready to build something. Whereas I feel like Chatlas is like a few actions and you're kind of like up and running, which has been nice. It's almost like imagine the chat GPT web UI or cloud AI UI, but now like you're chatting from your console and you're just, you're able to like take the responses and add a reasonable number of tools using Python code. It's sort of like just the right maybe level of customization.

So how do you feel? I mean, one, so one, as I mentioned, one thing that I I've been thinking about a lot is that, that there's going to be a growing segment of, of users that if you're, if you're building a new open source project, people aren't going to use, people aren't going to use your open source project until they can, they can use it via vibe coding essentially, or until it gets indexed by the LLMs and like it reads the documentation and all of the example code it can find on, on GitHub and learns how to use, use your library. And I've, I've talked to people building new open source projects and they'd mentioned to me, like they can pinpoint like an exact point in time where their project got picked up in the training sets for cloud or for, for chat GPT, like the open AI models. And then they see this like massive, you know, increase in usage, like installations on the Python package index or, you know, bug reports and, and, and things like that. So I feel like that is going to have a major impact on folks like us that build new open source projects that, that, that many, many people are going to be put off of really engaging with a new open source project until the LLMs know about it. But that, that also creates this like chicken and egg problem where how do, how do the, the LLM providers decide which projects to, to include in their training set and to, and to, to index?

Yeah, it's a good, it's a good question. Like, I feel like it's something that we will have to reckon with. How will people find our tools and will the kind of like way they're using and coding now, will our tools like afford them the ability to use them sort of like out of the box? I'm so haunted by the challenge of just like getting in front of a group of people and finding like the challenge of like zooming in, like finding a person who like is willing to use your tool every day, who you can like talk to and see their frustration. Like, I almost feel like the LLM issue just adds another thing to our plate, which is like, how will say like that person be able to interface with your tool. But I guess for me, like, I've thought of so much of the challenge is just like getting into a room with the right people that I suspect that will continue to be a big factor, maybe less so for like, really shooting your tool into the atmosphere in terms of popularity. But in terms of just like being sure you're building the right thing, like, I just find it so hard to get into a room with people sometimes and watch them use the tool.

I know like Polars, you mentioned, like has this issue a little bit where I don't know how it is now, but like autocomplete or suggestions for pandas do tend to be a bit better. Sometimes Polars suggestions are just straight wrong, which is not great. Yeah, there's like, you know, a combination of like, maybe API changes and the LLMs having outdated knowledge of Polars, or maybe it's like the training set is a year old, or it's six months old, something like that. But sometimes there's just outright, outright hallucinations. And because there's not as much, not as much training data from the internet and GitHub and whatnot for Polars as there is for pandas. So probably I would guess that the incidence rate for hallucinations with Polars APIs is more significant, or like maybe not hallucinations, but like wishful thinking, wouldn't it be nice if this method existed? I guess you could take the hallucinations and turn those into feature requests for Polars, why not? Yeah, just make it real.

But I do think you're right, like I think just to answer your question, like Will, is this challenge of like LLMs indexing your tool and being able to like give the right answers? I think to me, it is like, all really, the signal to me is once I start seeing people battle this, like getting bad advice, I think usually that's what sends me down these rabbit holes. And unfortunately, I haven't seen it yet. But I kind of know, I do think it's, it's coming. So I do kind of dread that.

Open source and enterprise data science

But on a, you know, kind of a separate but related topic, you know, we're, we're both at, we're both at Posit. You know, Posit builds, builds a large portion of engineering builds open source software for for data science, open source projects that, you know, we want to become freely available, widely widely adopted, but we also build professional products for to empower enterprises to do open source data science and, you know, keep up with all of their security and compliance requirements and, you know, data governance and all the bells and whistles. And so I'm curious, like, you know, in your your time as a as an engineer at Posit so far, you know, what are, you know, some of your reflections or learnings on, like the common pain points that you, you see amongst users of Posit's pro products?

Yeah, the enterprise is so interesting that I feel like a lot of people talk about, like innovation and the enterprise and like tools doing really fancy things in the enterprise. But for me, the thing I find most compelling and surprising is that I feel like for an enterprise team, I feel like the number one thing I hear is like, I can't just do normal things. Like suddenly, like there's compliance, there's an IT team, there's like a lot of hurdles in a business, especially a big business, where suddenly like you even running Docker on your computer, or working outside of a hosted Jupyter notebook, suddenly these things become a problem.

And so I think that it's interesting, because when I came into Posit, I came from working, building out a data team for the government, for transit. And that was a very different place where we're like hosting Jupyter notebooks for analysts. You know, they can't run Docker on their machines, but we can run services for them. You know, we're really working around to build these like good pipelines that also kind of really meet their analysts where they are. And the first project I took up was pins, which is this tool to basically save data, put different places, and to really easily be able to say like, okay, I have this thing called a pin board. And I'm pinning my data sets to it. And basically, every time I save or pin a new data set, it saves the like old version of the data set. So it's really easy to share.

I mean, it's kind of like a caching solution for, let's say, if you're building an R Shiny application or a Streamlit app or something, and then publishing it to Posit Connect, and you don't want to keep running the same expensive SQL query every time somebody runs the dashboard. Is that mostly the type of use case that you see?

Yeah, I think so. Or even like, I will say like Ibis uses it now, or did after we met at SciPy, because they had example data sets. And they were like, we want to fetch these example data sets, but we also don't want to fetch them every time. And so it just so happened that, yeah, like caching was kind of good enough in pins that they were like, we like the idea that you solve this. And we don't have to like, figure it out.

Yeah, I think so. So it's like, it's easy for users, because you just say fetch this pin, and you have the data, and you can like list out your pins. So these are each of your data sets on this thing called a board, which is kind of like a folder. But I will say in like, business, one really interesting use case I noticed is that, so yeah, Posit Connect is this tool to run your reports, and host dashboards and stuff. So publishing, publishing platform. Yeah.

Yeah. Yeah, thanks. That's a much better word. I feel like I'm a very much a freestyler when I'm describing things. But I was really confused why we were implementing pins over Posit Connect. And in fact, I was like, this concept makes way less sense to me than like pinning to S3. But one thing I realized is that data teams in large orgs with like IT, big IT groups, they might actually struggle to get like an AWS bucket or something. So if they have Posit Connect, if you can let them pin data to Posit Connect, they don't have to battle their IT team to get say like AWS buckets. So it's this funny situation where it's like pinning, pins works good enough with Posit Connect. And as it turns out, when you can store data on Posit Connect, you can really kind of like work around a lot of maybe requirements from your IT team, just have things in one place. It's just really convenient when you're just trying to put up a report.

I think I know one of the things that is interesting for me to get exposed to are some of the like, like data governance and IT security concerns that that businesses that have really sensitive data, maybe there's like, you know, government compliance issues. And, you know, Posit has customers all around the world. And so every government has different, let's say, you know, data, you know, data handling, you know, compliance requirements. And so to have a, a publishing platform where you're publishing, publishing, you know, data and results and reports and things like that. It's really important to make sure that like the wrong person doesn't, it doesn't access the wrong, the wrong data, essentially.

And it is, yeah, it is so interesting because I do think like data teams and especially people analyzing data are sort of like pinched sometimes between like a rock and a hard place where like, maybe there's compliance issues limiting like where they can be and, or they need to be able to restrict who has access to what. But on the other end, I feel like sometimes I see like you have data engineers and analytics engineers going crazy saying like, oh, you got to put this in the warehouse. Like you, you can't just be pinning data. Like this actually needs to be part of the pipeline. And I think that like, that's to me, that's actually kind of the beauty of Posit Connect, which is like, let analysts cook, like, like maybe they do have to move things back into the warehouse at some point. But I think that just facilitating that movement, like let them at least store data for now and be able to support like who can access it. So they have like the luxury of time to like migrate back into a pipeline, or maybe they like put out a report and nobody's interested in it. And then they know we don't even need this in our pipeline. That's actually like the best outcome because now you saved an analytics engineer some work.

Hot takes

Yeah. Yeah. Cool. Well, I think we're getting pretty close to wrapping here. So I think I'll end the interview with, let's call it the hot takes section. Yeah. So curious. Yeah, if you, you know, any, any top of mind hot takes or controversial opinions nowadays.

For data or just in life?

It's totally up to you. Yeah. It doesn't need to be, it doesn't need to be data science, data science tools related. It doesn't need to be AI related at all. It feels like all content nowadays has become AI content. So non-data or AI related hot takes are totally okay as well.

Maybe, you know, in a world of AI, this is a hot take that I feel like the people need, which is, I think that label makers are the greatest tool for artistic expression. Meaning, you know, in a world where you can generate anything, I actually think creativity thrives under constraint. So just having to generate words on a piece of tape, maybe with a beautified frame around it and like 30 possible icons is actually the like peak of artistry. And I think that so much of my work is inspired by artists that they like, they take less than they do more. They just beautify anything they touch. And I think that that's part of the challenge is like, how are we going to constrain? Like, what does a good constraint do for me?

I actually think creativity thrives under constraint. So just having to generate words on a piece of tape, maybe with a beautified frame around it and like 30 possible icons is actually the like peak of artistry.

And yeah, I'm a big fan. So we actually have a label maker in our house, but I left it downstairs. I can get it. I'm just saying, if you need a demo, just give the word.

That's all right. I admit that I also weirdly adore my label maker. I was probably 35 or so before I got my first label maker. And it was a little bit of this feeling of like, where's this been? A latecomer to label theory.

Yeah. So now I'm putting labels on everything. It's a small thing, but I appreciate its simplicity and its utility. I've been learning a little bit of Japanese and I was thinking that I should seek out a Japanese label maker, because the label maker I have, I think it's a brother label maker, and it doesn't have the ability to write like hiragana or katakana. And so I was thinking that I should seek out a Japanese label maker just so I can put little labels on things to remind myself of what that's called.

I feel like the big question is, is it phone only or do you have the tactile keyboard? The tactile keyboard, yeah. Maybe the phone would be a better way, because then you could probably input non-Roman character sets and whatnot and it would print it.

So I'll look into that. I don't know. Gabe recognized Gabe because I started with the purely phone Bluetooth label maker, but nobody would use it at our house. And I actually had to switch to the tactile label maker, because as it turns out, the people in my house will not want the keys. The keys are so important. I admit, it's satisfying. I'm not gonna argue with it. Yeah, it slaps. It hits. It's analog. For sure. Yeah, there's a lot of cheeky notes around our house. I feel like that's the beauty, is you can put your opinion on anything and sometimes the names you call things are important messages to people. So I feel like if people out there have it, got a label maker in their life, I highly recommend it as just a soul-enriching practice.

Yeah, I will keep that in mind and maybe give me some inspiration to be a little bit more creative with my labeling. I'm curious if Hadley, as a label maker, that's like the big open question now. We're like two for three. Yeah, I guess we'll have to find out on a future interview when Hadley joins us.

Well, Michael, it's been a pleasure. So I appreciate your time and the conversation.