Kelly Bodwin: Quarto hacks, AI in the classroom, and why R should stay weird
In this episode, we’re joined by Kelly Bodwin — candy corn defender, board game enthusiast, and Associate Professor of Statistics and Data Science at Cal Poly. We discuss her path from English and French to statistics, how she builds teaching tools and navigates AI in the classroom, and what it takes to keep a programming community weird in the best possible way. Kelly is curious, collaborative, and unafraid to lean in on quirky. Kelly shares how she balances teaching three courses with master's student supervision, applied research projects spanning Polish history and beyond, and her belief that the best part of academia is the people. We also dive into the practical and philosophical challenges of staying current in a field that reinvents itself every few years. What's inside: • Breakfast mixology • Building Quarto extensions with JavaScript and AI • When ChatGPT helps students learn (and when it doesn't) • Applied stats meets history: analyzing social networks from the Polish Revolution • Why remarkable, welcoming communities matter more than perfect code
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.
On this episode, we sit down with Kelly Bodwin, board game nerd, candy corn defender, and assistant professor of statistics and data science at Cal Poly.
Hey, Kelly, welcome to the test set, where we chat with interesting folks in data and see what makes them tick. We're so happy to have you on. For a little bit of background, Kelly Bodwin is a professor at Cal Poly, which is short for California Polytechnic State University, San Luis Obispo. And we're also joined by Wes McKinney, who's a principal architect at Posit and the creator of pandas and interesting libraries like Ibis and Python. So, Kelly, we're so excited to have you on.
We've chatted with a few different people who are our professors and educators. I feel like I'm so excited to talk because I love the energy you bring to any time we see at conferences. And I know when preparing for this, you mentioned a lot of really interesting stuff about sort of how you got to where you are today and also what makes you tick in terms of collaboration. But to maybe kick it off, I'm curious to hear what gets you out of bed in the morning. And is it, per our conversation, a Coke and orange juice?
Yeah, I think I put it as a controversial opinion that a morning drink I enjoy. I don't drink it that much because it's very sugary. But yeah, a half Coke, Coca-Cola, and a half orange juice. And it's like, it's like Orangina, but with a little caffeine.
It looks weird, but it's delicious. So don't knock it till you try it.
It was like born in college. So it was from the machines, you know, so it was in a little plastic cup. So I think, I think you have to see the weird color to truly appreciate the drink.
What gets Kelly out of bed
Um, I mean, so definitely the main motivator for me is always like other people, either excitement to see them or knowing they have expectations on me. Um, so this is why the professor job works well for me, because I got to get out of bed because I got to teach a class. So I can't just not go to that. I'm not a morning person. So, you know, literally what gets me out of bed is that I have commitments.
What's exciting to me is always the, like the human interaction. I was not a happy person during COVID. Um, and I would rather, I know it's not possible, but I'd rather be in person with you too, as well. I just, I like being in my department, you know, I'm friends with all my colleagues in my department and we chat in the hallways and then, you know, being in the classroom with students is really fun. I really like the lecturing and I really like chatting with them. So those are the things that motivate me.
A week in the life of a professor
This quarter I'm, you know, pretty chock-a-block like nine to six. Uh, so I have three different classes right now that I'm teaching, um, which is difficult load for us, three courses. Um, you know, so after this, I've got to rush over and teach four hours of lecture. Um, but, uh, and then I have, we just started a master's program, which is really cool. Um, so this is the third year of it, the second year that I've supervised students. So I've got four master's students. Um, so I've got, you know, meetings with them scattered throughout the week and then lots of admin and grading type stuff throughout the day too, you know, on committees and whatnot. So, uh, what I like about the job, right. Is that like every day looks quite different and every quarter looks quite different.
Um, right now it's week seven out of 10, we just gave midterms. So I've got a pile of midterms sitting over here to grade and that, and that makes it a tough time in the quarter. Um, but yeah, I mean, the average day is just like running across campus from a meeting to a lecture to a meeting to office hours.
Um, I mean, it's a statistics master's program, so everything is statistics adjacent. Um, but there's quite a variety of projects like in the program. Um, some of them will be applied projects with kind of partners on campus. Um, mine, I mean, last year, three of them were tidyclassed our projects. So contributing new algorithms to tidyclassed, um, or since I knew algorithms, ones that have not been built into tidyclass yet. Uh, so it's a thesis driven program. So they do independent research.
Kelly's path to statistics and data science
Yeah. I mean, so I started college thinking I would do either pure math and physics or English and French. So I was like all over the place. And then when I took a probability class with Joe Blitzstein, um, who was my advisor in undergrad, that was what made me say, oh no, I'm going to do statistics. This is so cool. Um, so it was very much a focus on statistics.
Um, and so then I guess, you know, my classes used R, but I didn't really learn R. And then I did a senior thesis with Joe Blitzstein, uh, where I was doing a lot of simulation in R. So that was where I came to, I guess. I wouldn't say at that point I loved it. I would say at that point I knew how to use it. Um, and then, uh, and then I went to grad school and a lot of my work was also like R, but again, that wasn't the main focus.
And then I came to Cal Poly and my first year at Cal Poly, I went down with a colleague, with Shannon Pelleggi, to, um, the RStudio conference. And it was just like, that was the awakening. It blew my mind. But even then, like, I wouldn't have said data science. Even then, the data science program at Cal Poly, we have a, um, we call it a cross-disciplinary studies minor, but it's more of a double major with math, with CS and that. Um, but I wasn't teaching in that because that's largely Python.
And then eventually, I say during COVID, uh, I picked up the Python class and I learned Python. Um, and that was, yeah, more or less, was when, uh, that was when I was teaching in the data science program. So I was like, okay, I guess I'm a data science professor now. But to me, to me, there's not a difference between data science and statistics. Like, I think they're kind of two words for the same thing.
I mean, I think one of the problems that emerged was that at a certain point, data scientist wasn't a specific enough term, and it started to include too many skills, and so then businesses were searching for these, like, unicorn individuals who are, like, really good at DevOps and building pipelines and configuring stuff in the cloud, but also, like, know how to do statistics and do causal inference and all those things, as well as being, like, really good software engineers. But I think, like, the reality is that now things have become, like, more specialized, and so I find that I see fewer and fewer job postings and companies that are searching specifically for a data scientist and more looking for, like, a statistician, or a research engineer, or an AI infrastructure engineer.
AI in the classroom and the job market
Yeah, it's pretty, like, the AI can ace all my exams, including my conceptual exams. So it can figure out, like, what's the right statistical test or model to use, given some data information. So it can do a lot of, like, sort of base level things.
The places where I've found the human judgment most needed, I guess, would be the data, the EDA, honestly. I mean, AI can produce, like, exploratory plots and so forth, or maybe spot little issues. But the understanding of the structure of the data, like, does this observational unit really address the problem that I'm trying to address? Like, that still feels like it needs a human. And then I've found, you know, every now and then, when I'm, like, stuck on my own applied project, and I'm like, what model should I be trying here? If I turn to AI, it usually gives me unnecessarily overcomplicated answers. They're not wrong, per se, but they're, like, more than is needed. So I suspect that just will continue to improve a little bit. But the judgment to know when AI has gone overboard is maybe where human is needed.
But the judgment to know when AI has gone overboard is maybe where human is needed.
Yeah, certainly you can't, like, quite vibe code as directly as you can with some of these coding tasks. But I think that those jobs, especially, like, more basic data analysis jobs, are still in danger of being taken over.
Using AI as a tool: the Quarto extension story
I'm very resistant to using it for any writing, because I'm an English minor. I still am attached to writing my own things, so I'm resistant to that. But on the other hand, like, the essay or the papers I'm getting from my students are much better. I don't hate that I don't have to read as bad of writing anymore.
But for myself, I honestly don't use it that much in my research. I use it, like I said, sometimes to, like, suggest the next path forward. I've used it to unstick myself when I can't motivate myself to work. More often than not, I'll say, you know, I'm trying to write this thing, can you write it for me? And then it writes it, and I hate what it's written. And so then it motivates me, because I'm like, this is not correct, here's the correct way to do it, you know, and so I actually go do the thing.
The place where I really use it, this might be what you're referencing, is a student and I have been working on a Quarto extension, and, like, figuring out the structure to make a Quarto extension was pretty straightforward. I took a really great workshop at PositConf about it, but also there's good documentation, but the one we wanted to write, it had to be JavaScript, it couldn't be done in R, and so the, and I don't know JavaScript. So the way that I did this is I literally went line by line. I didn't write the whole program in R, because it didn't really make sense, but I would say, okay, like, what I'm trying to do here is, I don't know, loop through all of these strings looking for a regular expression. So I'd write the one line in R, like, you know, per map, start attack, whatever, and copy that line into chat.gpt and say translate this to JavaScript, and then I'd copy that back in and I'd run it on some tests. So it was, like, a very tedious process, and it didn't work to just design the whole thing and say, write me this in JavaScript, partially because it wasn't perfect, and partially because, like, I don't have the ability to debug a full program in JavaScript, but I do have the ability to debug line by line.
Yeah, it's, I don't know if you ever used Flare back in the day, but Flare was the version for R Markdown that was totally built in R, and, like, Quarto opens up so many cleaner ways to do what Flare was doing. Flare was very hacky. But what it's for is so that you can, like, for teaching mainly, so that you can establish, you know, you've written a code chunk, and when you show it to the students, you want it to highlight, you know, all the functions or something like that, but you still want it to be reproducible so that what you're running is the output of that code.
Applied research: Polish history and collaboration
Yeah, the, I think the one that is the most fun to talk about is my collaborator in history. His name is Greg Domber, and he works on the Polish Revolution. So, like, in 1989, you know, in Poland, they all met, this group of 500-and-some people met at a series of meetings and, like, peacefully transitioned from communism to democracy, or dictatorship. And so he has collected, like, painstakingly by hand over a decade, data about those 500 people and what they were doing since 1950. So he has this, like, wonderful longitudinal social network data that is, you know, these two people were in this organization together. These two people co-signed this protest letter, et cetera.
Yeah, and then I have been working for, like, eight years now on that data. You know, the first thing was cleaning it, which was big. I give him credit because he wrote it all down in a spreadsheet, and that's, like, very impressive for a historian. But there was a lot of work to do. And then we made a shiny app where you can explore these, and we presented it in Poland a year and a half ago, which was very cool. And now we're working on kind of doing some, like, modeling analysis to talk about, you know, people that were in this one organization, were they in some way more impactful than people in this other organization, that sort of thing.
Like, the second week that I landed on campus, he sent me an email because I had listed my research interests on the website, and I put digital humanities because I really like, you know, text analysis, literary analysis, that sort of thing. And we got coffee on campus, and he was telling me all this, like, oh, I have this idea that if we had, you know, these connections and we could study it. And he had a really good vision for it as someone who isn't a data scientist. And I was getting more and more skeptical. Like, okay, like, plenty of people have these aspirations where they think that statistical analysis is just, like, throw everything in a blender and you get a magic answer, you know? And then he opened his spreadsheet, and I was like, oh, you have data! You've actually collected this data and structured it in a clean, not clean, but a consistent, anyway, form.
Staying current as an educator
Um, I mean, I think, like, a thing that I wish people thought about more as far as educators, and especially educators, you know, my job is not a research job. There's a little bit of research element, but it's a teaching job, right? You know, people in the industry say, oh, like, Python is now a little more of a lingua franca. We want Python, or we want SQL, or we want whatever. And how am I going to teach that to a student if I didn't learn it, right? And so the only way that I'm going to teach that to a student is I'm going to go out and learn it myself. That's kind of fun sometimes, but it's not, you know, I don't have any extra time in my day to do that. I'm not paid to do that. It really is a labor of love anytime you pick up a new skill.
So as far as new skills that I pick up for fun for myself, you know, I think of a lot of the R work I do not as work. I think of it as a thing I do for fun. So when something exciting, you know, comes across my blue sky, or I find out in a conference or something, and then I, like, get an idea and sit down and do it, that doesn't feel like a work task. It feels like the same as when I cut stickers on my cricket. You know, it's like a hobby. So I pick up new skills in the R world just because I'm excited to.
But when it's, like, skills needed for class. And then there's this decision in the classroom, too, of there's these skills that I've picked up. Do I build them into the class? You know, I mean, even the moment where we converted our R class from base R primary to tidyverse primary, that was a lot of work, even though I knew the tidyverse. There's the conversion. And I'm actually, Wes, you might have an interesting thought. I went through this fall of whether I needed to rebuild my Python class with Polars, and I ended up not doing it this time. But I feel like I need to build it in there soon. And then there's the whole, you know, DuckDB Arrow world. Do I build that in at some point? So even the stuff that I know how to do, that's a different thing than having built the materials to teach it, which also is in my free time.
Right now I'm having a conversation with my editors at O'Reilly about Python for Data Analysis, which is now a 13-year-old book. And we're talking about a fourth edition. And so I think we're going to do a fourth edition that still uses Pandas, but I'm using Polars more and more. And maybe two years ago, I used Polars a little bit, and it would run into bugs and it would crash and it felt like it wasn't ready for primetime. Whereas I'm building stuff with Polars, I'm not writing that much Polars code directly. I'm having Cloud Code write a lot of the Polars code. But it's fast. It's got an API that's, especially if you're dealing with complex datasets and dealing with more Arrow-like data, like JSON data and stuff like that, it's definitely a lot more powerful and expressive for some of those more complex transformations.
One of the good things for education is that Polars is a lot smaller. API is a lot less complex in many ways than Pandas. And so I feel like it may actually be easier to teach. And there's fewer rough edges. You don't have to care about it. There's no indexes. And so I know that indexes is one of those complexities that people coming from the Arrow world find very tedious and unintuitive in Pandas.
Open source sustainability and AI
Fortunately for me, it seems that the pattern that I've developed over the years is that I start or help start projects and then get them to a place of critical mass where I recruit developers to join the core team to maintain the project. And then when the project is being sufficiently well looked after and maintained, I move on to start the next project. And so a fortunate side effect for me is that I'm not actively maintaining any open source projects right now.
But it's hard. And another thing that I spoke about in a talk recently was that with now with generated AI and ChatGPT and Cloud and all these things, it's going to lead to much more use of these open source projects because ChatGPT and friends are all really good at writing Pandas code, and they're all really good at writing Polars code. And so now the effective, the addressable audience, the addressable user base of people who can become Polars users or become Pandas users is probably 10 or 100 times what it was before. And so to go from 10 million global Pandas users to half a billion Pandas users who are now running into the same bugs and issues that everyone else is running into. But, you know, open AI isn't helping maintain Pandas. So like, what the heck?
Well, yeah, well, the flip side of that question is that and the thing that I've been thinking a lot about lately, and I asked everyone about it, what they think is that it will be hard to get people to use new open source projects when their AI assistants don't know how to use them yet. And so it creates this like chicken and egg problem where chat TPT is really good at using pandas because there's huge amounts of training data available on GitHub and all over the internet of people using pandas to solve problems. And so they've been able to take in all of this data and become really good at solving problems with pandas. But suppose a new tool comes along, and there's just not a corpus of training data to teach the LLMs how to use them. And so that basically, if people become utterly dependent on AI tools to do anything, then new projects will never get used, they'll never get training data generated.
It will be hard to get people to use new open source projects when their AI assistants don't know how to use them yet.
I think, you know, when I watch my students try to adopt a new function, you know, they certainly turn to AI first. And it's not necessarily a bad thing, I would actually say, like, the AI explaining the function tends to be more digestible than reading the documentation. But what they there's two things that are skills that they don't seem to have right now. One is doing the right query, right? So if you ask, here's this function, show me some examples and explain why they work, that's going to get you much further than what is this function? So the prompt engineering, which is a very fancy way of saying it, but what should you ask is hard. And then the checking afterwards, you know, not just because AI isn't always accurate, but just because like, still good to look and say, this argument needs to be a data frame or whatever. And so I'm seeing them not do those steps.
Keeping R weird
One thing I really want to be sure to loop back to is I love that you gave a keynote at USAR on keeping R weird. I was wondering if you could just say a little bit about that.
No, yeah, that was, that was so cool of a thing to get to do. I'm not really a nervous speaker. And you can really hear that I started that one out super nervous just because I looked out and I was like, all my heroes are in this audience plus a thousand other people. It was like really intimidating. But no, it was a really fun talk to put together because, because there's kind of the two types of weirdness, right? There's the language is a weird language. And I didn't know that because it was my first fluency. Well, it wasn't my first language, but it was the first one that I was really embedded in, you know, and I'm not a computer scientist. And so learning from people as I got deeper into R, how it's weird was fun. But then the community is very quirky, especially the like sort of tidyverse pocket.
I mean, I don't know if Python, maybe Python's already weird. I do think the reputation from the outside for Python has changed a lot. It was, you know, maybe 10 years ago or something, it was a bro language and I had no real interest in getting involved in that community. And now it feels more like the R community, like there's pie ladies and there's fun stickers. And I met this woman that use R this past summer who had a really cool Python tattoo and like all these things that feel special to me seem to be existing. And that makes me more excited to get involved with Python, even though I don't love writing Python code. I prefer not to.
And I think a lot of that comes with an openness, like when you don't take yourself so seriously, it also makes a culture where when a beginner is asking a question, you know, you don't jump on them. You don't say, well, it's in the documentation. The beginner doesn't know how to read documentation, you know. And so I've always really appreciated about specifically the tidyverse pocket of the R community where back in the old days of Rstats Twitter, like someone would ask a question, very basic question, and it might be like Hadley or Jenny or Mina who would jump in and answer it. And that's, you know, that feeling of my question is so not stupid that a name that I recognize will answer me. It is like so important to making people excited to work on it.
That feeling of my question is so not stupid that a name that I recognize will answer me. It is like so important to making people excited to work on it.
And so that's where my perception, whether it was true or not, of the bro culture back in the day for Python has changed. I now feel like it's a language that is welcoming to beginners, and I'm not sure what shifted that. This is all my semi-outsider perspective. But yeah, I think that's a value of turning into like one big party instead of like this very serious, you know, let me think about the stack in the heap sort of approach to open source.
I think on the Python side, like I think compared with R, I think the R community by comparison, not to say it's 100% like this, but the R community is a lot more, trying to find the right description, like culturally homogenous and that like a lot of the R community comes from like a statistical origin. Like people, you know, they entered into learning R because they started out in statistics or biostatistics or like more of like a statistical field. And then, and so like that's like the foundation or core of the R ecosystem. Whereas like Python started out being like a little bit more like Unix sysadmin, and then it developed the web development ecosystem. And then, but there was always this scientific computing thing off on the side. And then that turned into a data thing, which now has gotten really big. And the data thing grew an AI appendage.
Most people come to Python through computer science. Like, most people see Python and they learn it. That's true. That's absolutely true. Yeah. And CS101. So most people come to R through some variety of statistics, but they're not statisticians, or they're not even taking a statistics class necessarily. Like a lot of people are coming to R because they have data in biology, or they have data in history now, or, you know, they have data, and they're like, this is the language people told me I should use to run my t-test, you know. So the statistics part is true, but it's not statisticians. And I wouldn't say the plurality even of R users are statisticians.
I think most people coming to Python know what a for loop is, you know. And like, many people are using R successfully and don't know what a for loop is. So there's a background hetero, how do you say that word? Heterogeneity. Yeah. That makes the R community a little more, you just get a lot more different perspectives, perhaps.
Candy corn and closing thoughts
Absolutely. Yeah. It's like red vines, candy corn. You want to see candy corn brought out of the seasonal.
It is sugar in triangular prism form and the important thing about candy corn, any sugar is sugar, but is the texture, right? It has the perfect texture to make your teeth feel good chomping on it without it sticking.
Well, Kelly, thanks so much for coming on. I think I feel like I just love the energy you bring to the art community. And anytime I see you at a comp and you're wheeling, I'm actually so sad I didn't. Yeah, you were slinging earrings, laser cut earrings. I so deeply appreciate the community and just like welcoming. Also, the level of pump you bring into situations like the whole. I mean, I just it's fun, right? Like for me, positive comp is like a highlight of the year. We go to board game conventions and then we go to my art convention.
It was I had planned my wedding so that I could go to last year because like I don't want to miss priorities. And now having now gone for eight years, I guess not counting covid like these people are my real friends.
The enthusiasm is not. Again, I don't I don't view it as work. Like I'm really appreciative of all the things that the people are doing in the community. I didn't start the earring thing. I didn't start the crafting. I didn't start the hex stickers. I just get to benefit from that. And it's yeah, I like feel very grateful for all the things that the people have built, both the actual technological tools, but the community that's formed, it's yeah, it makes my job much more fun for sure.
I learned that I didn't coin that. I don't know how that phrase got stuck in my head, but Hadley dug up some talk he gave from way back when where he had to keep our weird slide. And I feel like it's almost certain that I saw that and it stuck in my head. So everything I do is I think taking a slide to keynote is such a beautiful number. Someone had to like expound to write a thesis on the keeping of weirdness. So I'm glad I'm glad it was you. And thanks. Thanks so much for coming on. And we'll I'll see you at the next comp and on the internet probably.
That's great. And Wes, great to talk to you. I really appreciate everything you've done.
The Test Set is a production of PositPBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with creative studio Agi. For more episodes, visit the test set.co or find us on your favorite podcast platform.