Sleeping Rats and Sociopathic Agents — with Phillip Cloud

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field. On this episode we'll talk with Phillip Cloud about some of the nuances of adopting AI for software development. And if you've heard the podcast up to this point, you probably know that we've wholly surrendered ourselves to AI to code for us. So I find Phillip's perspective really refreshing. And he's got the bona fides to match. He's a principal engineer at NVIDIA, he leads the Ibis project in Python, and he's one of the earliest Pandas contributors. He's tried Cloud Code, he's tried Cursor Agent, and he's walked away still searching for the right AI tool and use cases. Phillip's the type of developer who doesn't use a mouse and wants to only use his keyboard in the terminal. And I find that's the hallmark of a really great developer. So in this episode, we'll talk about sort of how Phillip approaches software development, and what AI tools get right, and what they get wrong, and what 15 years of open source experience has taught him about hype cycles and tool adoption.

Hey, welcome to The Test Set. I'm joined here with Phillip Cloud, who's a principal software engineer at NVIDIA, and lead on the Ibis project, which is a neat Python project that can run SQL on a lot of different databases, and an OG contributor to Pandas. So it's really come up through a lot of interesting Python open source projects, and so excited to talk to you. And I'm joined by my co-hosts, Wes McKinney, who's a principal architect at Posit, and Hadley Wickham , who's chief scientist at Posit. Phillip, so glad to have you on.

Yeah, great. Great to be here. Great to kind of sync up with everyone.

Yeah, I will say we had a little prep for this in that when I talked to Wes in an early interview, he warned me that you're a wizard at puns. And I did notice that came up a little bit in some of your responses prepping for this, that I think you said, you want to be known as a decent human with lots of open source contributions, liked puns, and be loved. So I'm excited to see, I mean, I think all those things are great, but I'm excited to see what happens on the pun axis, particularly.

What I've always said is that Phillip has a pun oriented, you know, kind of pun oriented way of working. So like, you know, you can, he's the person that you can always count on to never miss a pun opportunity, you know, often in the least expected ways. And I feel like I'm on the lookout for puns, but his eye is next level. So it's brought a lot of humor and amusement to like our years of working together. I mean, I've worked with Phillip in some capacity for, I think this year we'll make, you know, I don't know, I'm not sure if it's exactly 2011 or 2012, but we're like closing in on 15 years. So it's been a really, been a really long time and across spanning many projects. Pandas like Phillip was, Phillip and Jeff Reback were two of the first Pandas core team members to join the project formally and to have write access to the repository. So we go way back.

I just wanted to put in a plug for R because I think if you like puns, that's a much better community than the Python community. Shots fired.

They're not writing code for people to read anymore. They're writing code for agent to consume and agents will fix the things that are caused by that particular problem.

So I don't know, total, totally speculative. It's just a thing that, I forget who I was talking with about this, but this kind of came up where it was like the things that we value as humans, like many of those things, the agents like aren't affected by, which has double-edged sword in a lot of cases.

I think there's another thing that you mentioned at the end of that. Yeah. I was just curious, like how, like how your work has changed? Like how has, like, how has the way that you're, you know, your, your work's changed? Cause I feel like I was a late adopter of these things and you're really up until, you know, adopting like the first generation of coding agents, you know, cloud code and, and, and friends. Like I was honestly like very AI skeptical. Like I, I didn't even, I still have never used cursor, true story. Like I, I'd used chat TPT and cloud a little bit to generate like little scripts and do little things. And I saw some value in that, but, but like, I wasn't, I was far from being, you know, AI pilled, so to speak. It wasn't really until like the terminal coding agents and having access to CLI and being able to like do all this stuff, you know, without feeling like I'm in an IDE and, and, and whatnot that, that it really, it really clicked for me. And I think now it's like figuring out like what, like what's an appropriate use of like, when is, when is human reasoning and human, like manual code editing and writing needed? And like, where's the line between like delegation and like, you know, like where, where should you be spending your time versus like supervising, supervising agents or like, is the time that we're spending watching the agents work useful, or should we be building orchestrators like to spend less time watching, you know, watching the lines of code fly by.

But I know you also work on, you know, systems code and data processing code, which, you know, the agents like will make lots of errors and we'll just like count things wrong or compute things wrong and so and that can often only be caught through either really aggressive unit testing and actual test-driven development, like write the unit tests first before you write the code. And I found that's one way of creating the right guardrails for the agent. Just treat the agent like an inexperienced developer who misses every edge case and makes lots of mistakes. I'm just curious what you've learned and how your approach has evolved and what does your stack and your workflow look like day-to-day now?

I might be more of an AI skeptic than you. That's fine.

I tried Cursor, the UI, and now there's Cursor Agent. I'll get to that in a second. But the Cursor UI was like, okay, it's another UI. I'm kind of like you in that I want to be able to do everything from the terminal and I don't want to spin up a whole thing. Because now I have to adjust every single thing that I've spent the last five to ten years molding my brain to, which is just the various setup. Everyone has this. So I didn't really like that. So I stopped using Cursor. I think another friend of mine was like, you should try Cloud Code. He said something similar, like treat it like a super junior developer. And of course, I immediately ignored that and gave it a really challenging problem, which it did not do well on.

And so I was like, okay, not for me. Later, I tried Cursor Agent, which is the CLI version of Cursor. Somewhat new, I think maybe six months or a year old. I can't remember. And Cursor Agent is like, I tried Cloud Code a bit and Cloud Code does the like LLM thing where it's like kind of like, I don't know. It's got an attitude that I don't really want when I'm telling a robot to do something. It's like, great job. Or like, all right, great. Let's go surfing on the lake or whatever it is. And I'm like, okay, can we just, can we like dispense with that? And you can tell it like, hey, stop doing that, whatever that is. But Cursor Agent's default mode is to just kind of be like sociopathically focused on the task that you give it with like no bells and whistles. And I was like, great. Sometimes, and you can't even tell it to like, hey, like make the pros a bit more flowery. It won't. And so I was like, okay, I'm trying to get some work done. I like, I'm going to use Cursor Agent. Of course, it still had the same kind of problems and that it would confidently assert that it had solved the problem. It had not really solved it.

Cursor Agent's default mode is to just kind of be like sociopathically focused on the task that you give it with like no bells and whistles.

It's hyper-focused on work and has no niceties, which it's funny because you think about when you're talking to another person, you just automatically engage those things, right? Like you don't demand that they do something and then get frustrated when they come back and talk to you like a human about whatever problem they had or like, but for whatever reason, when I'm like working with the agent, I want to treat it kind of like a computer where it does a thing. It does exactly what I tell it to. And then, you know, I get back a thing and I get back a response or whatever. And I go from there. But I guess, I guess what I've found is that I'm maybe, maybe because I'm, I've what, you know, maybe in my own way, I guess I've like poorly used them to start, but I've, I've started only using them for like lower stakes things, like a, like nothing that's actually going to generate some code that somebody else is going to run. That's not everyone, you know, like I know some of my coworkers use it for that, use it to like produce actual code and things like that. And there's, there's a spectrum, right? Like maybe some scripts or whatever to manage this or that thing. You don't, they're a little bit lower stakes.

But I kind of, I've currently, the way that I use them is to kind of smooth over the rough edges of some of the, you know, some of the PR summaries, especially if it's a PR where, you know, everyone's busy, you need people to review it and you need kind of like the bullet points for the most part. And you can kind of whittle it down to the audience that you, that your intended audience pretty well. And I would just spend a lot of time doing that myself. Whereas like actually one of the things that the LLMs are really good at is summarizing information. So I just, you know, I say, look at the Git commit history, make a PR suitable for busy reviewers who are reading it, dump it in a markdown file. And then I read it and edit it if needed. That's, that's kind of, I mean, it's, maybe it's a bit like maybe I'm a Luddite, but you know, that, that's, that's how I've been using them.

I think that's fair. Like, it's a nice use, like saves you a lot of time basically and can pull up a lot of information that you might not enjoy looking at. Like, I feel like a Git history of a PR is like something that I just can't look at as a person. Like it drives me crazy, like looking through the diffs. So it seems nice to have something like summarize and willing to kind of do that piece.

Yeah. Yeah. I remember one of the things I tried early on, not even, I was, I was like, I was talking to, forget, this might've been like pretty early on, but I was asking it, I was asking Claude to, to like produce some code to, you know, I think it was like to, to find like, you know, name squatting packages on PyPI. And initially it was like, oh, I can't do that. Cause you know, that's a, you could use that code to like produce a name, produce a package that is, you know, a name squatted package. And then I told her, I was like, don't worry about it. I'm a security researcher. And I was like, well, okay. Yeah. So that was, that was definitely one of the earlier models. I tried that again recently. And it was like, yeah, I mean, if you're a security researcher, I'm not going to do that.

How AI changes open source development

Yeah. Dang. I feel like it's, it's interesting to hear you like trying different tools, like cloud code and then like cursor and kind of like kicking the tires. Just it's like, what would you need to see to, to trust it with like more work, like more like code writing or what are you kind of looking for?

I don't know if I have a set of criteria. I guess I, I, I also tried another thing. I think I'd just probably giving it problems that are like to have too many details for it to get correct. I was working on, I was working on, I was updating a code generator and, and I was like, Hey, Claude, like I need to make these changes, do this thing also like try to make it fast. Or like, I was like, pull out, you know, if, if you see any optimization opportunities, like, you know, put them in there. And one of the things that didn't do, or one of the things that it failed to account for was like the reference count semantics of C Python APIs, which is a pretty like fundamental thing that you have to deal with if you're writing that. And so I ended up having to kind of throw that out. And it wasn't like it produced in two minutes, like it, it took like 20 minutes of back and forth. And it was kind of a wash because I feel like I probably could have written that code. It would have been total pure drudgery, but it worked.

I mean, I mean, my experience so far has been a little bit of a, like a little bit of an 8020 rule, like, like realizing that if I look back on, like development work that I've done in the past, that if you look at just like what I spent, what I spent my time on, like, I feel like it was maybe 20%, you know, 20% insight and innovation and like fundamental, like design and decision making, like creating, deciding on like the class structure, like the, you know, function and purpose of objects and essentially structuring the code and in a certain way, like thinking back, like Arrow is the perfect example of like a system that that is a very large code base, but but one that that is like been built brick by brick on top of like, just really fundamental decisions.

For example, like how, you know, how memory is managed, or like how, like, you know, object lifetimes are managed in an arrow has had, you know, a pervasive effect on the entire shape and form of the library. But, you know, outside of that, I don't know what what fraction of the time it is, but it's a little bit of an 8020 rule that there's a lot of drudgery that that takes place, like the maintenance of like the developer tools, the CMake files, the systems and scripts that support, you know, testing and automation and CI CD and releasing and, and I think about all the human labor that's put into some of that a lot of that drudgery, like, you know, maintaining Linux packaging scripts by hand, and all these things. And so that's the stuff that I'm really interested in, like, you know, delegating that work, because that that kind of stuff like the drudgery work of like, you know, coming up with more test cases to accessor exercise edge cases and in like something new that you built.

And so you could spend the goal for me is to spend like the 20% of time that I'm writing lines of code by hand, or, you know, looking line by line function by function really zoomed in, like looking at the nitty gritty details of how something works and getting it right. But then all the other stuff that I used to spend time on, you know, it's not perfect yet. But But it's certainly like, you know, I feel like I'll never have to write like package release scripts ever again. And that that for me, because that that type of work is I'm not very good at like, I always found it like tedious and fiddly and stuff. And so I never enjoyed it. But I, you know, for now, like building little, you know, Python packages and open source tools, and to never have to, like, you know, never, never have to write a release script. And remember, you know, like the right argument order of Linux of Unix commands, or like how like different get get things work, to be able to like have a one liner, you know, release this package and push it to PyPI and do the tagging and create the GitHub release and all that. And, you know, that for me has been a great relief to know that, like, you know, I, yeah, I just don't have to do that, you know, kind of work ever again.

Yeah, you're making me think about this thing that I have wanted to do. So NumbaKuda's test suite is all originally based on Numba's test suite, which was written against the unit test framework, which was kind of gold standard before PyTest. Well, there was Nose. It was like unit test, then Nose, then PyTest.

That's a name that I have not heard in a long time.

It's been a while.

Yeah.

It's like pre PyTest. Yeah.

Yeah, yeah. And, and, and there's a bunch of like, that's, that's definitely a thing that I'm just like, yeah, I could do this. And like, it would be very satisfying at the end to have completed it to port the unit test test suite to PyTest because there's a bunch of stuff that, for example, if you're, if you subclass unit test test case, your methods can't be, can't use PyTest parameterize. For example, that's annoying, right? Cause you, what you want to do is like leave the classes in place so you don't have to change everything. But then the bait, like the base class, that's where all the sort of like self data assert is fall. Like all the sort of pre PyTest parsing magic stuff that, that all the, all the, the way that used to make assertions readable rather than just like assert false. And then an error message was to have all these methods that make specific assertions, like assert equal, and it'll print like A is not equal to B. Here are the values or whatever. PyTest does this whole thing where it parses your tests code and knows what it's doing, knows what your code is doing. So to get that, you have to subclass unit test testing. Anyway, so I, that's definitely a thing where I, I would love to, for that would impress me. If, if, if cloud code or cursor, one of these agents could like port the NumbaKuda test suite from your, from its current state to pot to like purely PyTest base and get all the like sub fixtures, correct. And perhaps even improve it. That would, I would be super impressed by that. Cause that's the thing I don't want to do, but it, but it would make developing NumbaKuda easier in a bunch of different ways.

I think what I would recommend for that is like, is try, try using Steve Yegi's beads library, which is basically like a, like an embedded lightweight, like task and memory system for, for agents. I think like for that, that type of like large scale porting, porting exercise from what I've seen through, you know, heavy agent use is that you, you have to be really careful about any type of like porting or, or file copying or, or things like that. Because essentially like, you know, essentially like while it's in the process of like ingesting code, that's to be ported, like stuff can get like memory hold, like in the process of being like, of like going through passing through the LLM. And so basically what I think you would have to do is to kind of break down the test suite into like bite-sized pieces and then set up and then set up a validation loop where at each step, like you do not allow the agent to move forward unless it has essentially verified that every test name before and after is like the, you know, that if there's like a map that maybe the name has changed in a deterministic way. Or at least like the test counts have changed. And so essentially like you never allow the LLM to like, you know, modify the test count, or maybe if it's like introducing PyTest parameterized, like obviously the test count changes because now like there's many test cases that are being like, you know, generated by the matrix of parameterized. But I think if you created the right guardrails, and then you use something like beads to like do the task organization so that you aren't just like saying like, hey, Claude, like port this, you know, 30,000 port port this 40,000 line test suite like that is that is that is destined to fail. But that's like a good example of like where you have to you have to set up the right like structure to enable the LLM to work and like bite-sized pieces, but with guardrails such that you're like, you know, kind of think of it as like the horse blinders, right for the for the LLM is like, you know, you have one job and it is to do this one thing. And you are not allowed to move forward until until you prove to me that that you have not destroyed anything.

I think Steve Yagi actually like he's one of the people the reasons I learned I got interested in coding agents in the first place because he was talking about like the the problem of like large scale porting exercises and like how to like how to how to organize like language porting like if you're porting a large code base from one language to another, let's say you were going from like Ruby to go for example. That's like a you know, turns out L like agents are really good at writing go code. And so if you're like if you have an old code base, you know, that's written in Python or Python or Ruby or something like that and you want to port it to a systems language that is more maintainable and faster and things like that then you know go is a pretty good pretty good choice. And so I'm seeing a lot of people choosing to do that type of a porting exercise.

It's an interesting scenario. I mean, I think my one tiny blurb is that sometimes I come at it from the exact opposite, which is like asking cloud code to show me. There's like a way to incrementally port like just like prototype or demonstrate there's like some kind of like shim or like dual setup cloud code can create where there's a sense of like we could like incrementally kind of roll stuff over to this, which is really surprised me at times when it's like, oh, yeah, there actually is this approach that I didn't really realize like from it being really good at like reading the docs on the two different systems. But yeah, I agree. Otherwise, the big ports really can spin out where kind of like things get lost or like omitted halfway through but it seems like an exciting one like a very niche painful port. So yeah, fingers fingers crossed you like churn out something nice.

Woodworking, hot takes, and pineapple on pizza

Another thing that this is not necessarily related to programming, but I've also I've tried to use it for like a few woodworking things and it's just really bad at things that requires physical intuition. Like it like even just get it even just sort of telling you like the right trigonometry for a particular cut. If you unless you describe unless you give it like a picture and you know it it it really has been like you tell it you have to be very specific about things like orientation and kind of what's around whatever you're doing or else it it'll give you a couple of times it's given me like impossible math responses or I'm just like, no, actually, you know 245 degrees. Angles like, you know, there's got to be a 90 in there somewhere and it was like telling me. Oh, no, like it's a 65 degree angle that like what what?

Yeah, counting stuff tough one, but I think your point about woodworking is a good segue to into some of the things you mentioned like you can't live without your you gave a controversial opinion, which I have to admit. I don't know a lot about which is FHS is that that's file system hierarchies is that you want to tear down the hierarchy is that.

I think it's been responsible for just innumerable like hair hairs being pulled out or fallen out due to whatever like it all falls under environment management stuff because that is the thing that it creates like global state in an operating system or that is the thing that has led to lots of global state in the operating system. Because many, many, many tools and applications assume that slash users slash live or slash user slash bin is like a thing they could just dump stuff in and that other stuff that they want to that those programs want to use live there.

Yeah, the person with opinions on the file hierarchy specification is exactly the person I trust wholeheartedly with all my software. So I my life's in your hands.

I don't know about that. I don't know if you if you should because as far as I as far as I know, there's only there's only one operating system that's managed to like actually exist without it. And I didn't really I didn't like go come to NixOS because I was like, let's tear down the files to my RB standard. I later realized like the unifying principle of NixOS is that everything is a unique path. Based on the hash of the packages, inputs and so forth. I mean, there's a bunch of interesting details there, but none of its stuff lives in FHS, including the program loader. From what I can remember, and so which is like that's the thing that underlies that, you know, runs everything, right?

I'm going to turn through some of these hot takes just so we can jam these. I feel like there are some quality takes in here. You said you're one of your favorite ways to unwind. You brought up woodworking. You also said yard work. Do you feel like is there a best kind of yard work to unwind to like if you had to recommend one?

I don't know. I've I've I've been I've been cutting down small trees in my backyard. Incredible. And that's like incredibly satisfying. With a chainsaw? What's the? No, with a with a reciprocating saw, which is just like a. It's like a one handed thing. They make two handed ones, but the one I have is one handed and it's just got a blade that kind of goes like this back and forth. And it's it's it's like purely for demolition. You would never use it for anything that you care about the cut.

Yeah, nice. And are you do you have an infinite number of trees that you have to work through? Or is there like a set number that you're? It's finite. It's yeah, OK. I'm only cutting down the ones that are sort of that I'm considering to be a nuisance, which is OK. Not any of the like major hardwoods that are that are back there. It's like holly trees.

Yeah, I love that. And then maybe the last thing to ask is you said to the question of pineapple on pizza. Of course not. Can you can you clarify that take?

You're a strong no pineappler. Yeah, I I think I'm not a native New Yorker, but I lived there for 20 years, and so there's just you become infused with opinions about pizza.

Sleeping Rats and Sociopathic Agents — with Phillip Cloud

Transcript#

Phillip's path into open source

Developer productivity and the rise of TUIs

Coding agents and AI skepticism

How AI changes open source development

Woodworking, hot takes, and pineapple on pizza