Resources

Mine Çetinkaya-Rundel: Teaching in the AI era — and keeping students engaged

In this conversation, Mine Çetinkaya-Rundel, data science educator at Duke University and Posit, joins Michael, Hadley and Wes to talk about teaching data science at a time when AI can write the code for you. Mine shares her journey from actuarial science to academia, the teaching philosophy behind the “whole game” approach, and her experiments using LLMs for instant student feedback. Along the way, the group dives into the joys and risks of coding by hand, the role of open source in the classroom, and what it’s like to work across both the R and Python communities. Here’s your behind-the-scenes pass to Mine’s world — teaching hacks, AI perspective and experiments, R/Python crossover conversation, and more. Plus, how a U.S. News and World Report publication set her career in motion (thanks, Mom). What’s Inside: • How a career in actuarial science led Mine to the world of data science and teaching • The “whole game” approach to learning and how it helps students stay motivated • Building an LLM-powered feedback tool for low-stakes assignments • Balancing AI assistance with the need for hands-on coding experience • The shared DNA of R and Python scientific computing communities • The hidden value of live coding, pair programming, and seeing the process — not just the output

Dec 3, 2025
54 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we chat with Mine Çetinkaya-Rundel, professor of the practice of statistical science at Duke University and a professional educator at RStudio. I'm Michael Chow. Thanks for joining us.

All right. Hey, everyone out there. Welcome to the test set. I'm Michael Chow, and I'm joined by Hadley Wickham and Wes McKinney. And we're here with our fantastic guest, Mine Çetinkaya-Rundel, who's a professor at Duke and works on education as well at Posit. Mine, we're so excited to have you, and thanks for coming in.

Yeah, thanks for having me.

Mine's path from actuarial science to data science

I was hoping to ask, just to start out, could you just tell us how you arrived in your roles?

Yeah. I wish I had a very exciting story. I don't, but I had an undergraduate degree from a business school, honestly, because my parents wanted it. And I didn't really know what to do in business school, so I thought, I'll do the most mathematical thing one can do. So I was an actuarial science major, and then I worked as an actuary for two years in New York. And at the end of that, I decided, I'm not loving this. What can I do for free?

PhD is a thing. If you can get into it, you can do for free. And I also really enjoyed teaching when I was an undergrad. I used to do a lot of tutoring. That's how I paid a lot of bills. So I thought, this could be an interesting thing to try. And just looking at my background and also a little bit of the work that I was doing when I was an actuary, I really enjoyed the pieces of the work where we did things with data, which I hadn't really done a lot while I was an undergrad. I thought, I'll apply to some stats programs. And so that's how I decided to go to a stats PhD and fell in love with teaching even more.

And also really realized that I wrote my first line of code when I was in graduate school, I think, which is late, but whatever. I really enjoy that as well. And then decided to stay in academia. And at some point, I was doing lots of sort of public-facing things as well and using a lot of the tools that are developed by then RStudio, now Posit. And so I work part-time at Posit as well as a developer educator.

Oh, awesome. And this section reminds me, I don't know a lot about actuaries, but you mentioned writing your first line of code in grad school. Our actuaries aren't slinging a lot of code, is that?

Yeah. I mean, actually, that's not 100% fair. There was one piece of software we used. I believe it was called Ginsu, like a knife. It was for like cleaning data.

One of the clients we had was the United Nations. So we did the retirement plan for United Nations. The data came in literal boxes of CDs. So we had to load them up and you had to check for things like did the— and each row of your data set is like an employee. Did the employment date change? Did the birth date change from last year's data to now? Because those things shouldn't change. So there was a lot of like data quality checking and stuff. And I remember using that software. I don't remember anything about it. And I believe it was an in-house software, to be honest. But what I do remember is that you would write some— or someone else previously had written these like tests. Did the birthday change? And it would print out on a dot matrix printer. So I would go through the paper, like looking at the results. And then go back to the client and ask, actually, can we confirm these lines? So I really enjoyed that piece of the work.

There was the other piece of the work, which was more like regulatory. Like you have to do these filings, sort of like an accountant, which was just like doable, but wasn't the as enjoyable part. So I did sort of work with code others had written, but didn't really write code myself as an actor.

Yeah, nice. I love that you're, though, it's very like farm to table. You're like actually in the data physically, like striking things out with a pen.

And then another thing I really enjoyed doing was, there's like all this analysis that you do. And then all of it needs to go into a report. And oftentimes, those reports were Word documents. So someone would take a number from this printout and put it into the Word document. And that really bugged me. So I had changed all of our reports to be actually Excel files, that when you print them out, they look like Word documents, but the cells were automatically calculated.

I think that's sort of what I later found out was called reproducibility. But I just didn't know. I was just like, I don't trust this process.

I think that's sort of what I later found out was called reproducibility. But I just didn't know. I was just like, I don't trust this process.

Because I was an actuary because my mom had read a U.S. news thing that said actuary is the best job you can have. Like that was the reason why I was an actuary.

It's funny. Like I actually also had a dalliance with actuarial science. Like I had an internship as a, in like actuarial science firm, also was in the business of certifying pension plans, which I didn't know a lot about, but it's like kind of a dying field now that, you know, defined benefit pension plans have moved to 401ks. And so I was, I saw some of the same issues around like, can't this be automated? Like this seems really error prone. And I thought that I would be doing more math and science, but there was so much like data entry and like filling out tax forms and double checking numbers. And really, um, yeah, it felt, uh, I was like by the end, like I hadn't written very much code at the time. And so it was like, can we write code to do this? And, um, but I think they, you know, the actuarial firms like they also are in the business of staying in business. And so if you automate too much of the work, then there's, you know, there's less work to do. And so it was almost a counter incentive to, to make things too, uh, too efficient.

Yeah. I love that being actuaries agitated both of you into data science.

Joining Posit and the value of classroom teaching

I'm really curious about too. So you joined, um, so you're at Duke and you also do work at Posit. How did you get involved with Posit stuff?

I took a year leave to sort of try out industry work. And there are parts of that work that I really, really enjoyed, like working on open source packages, thinking about how do we communicate about them. I really like like writing documentation, that sort of stuff. Um, I love, I really enjoyed that piece, but, um, I really missed classroom teaching during that year. And I also realized that some of the things that used to motivate me to, you know, open issues and the repose of these packages was all always observing students getting confused or making a mistake. And then me realizing actually, like, that's totally the fault of how this is written. You know, this could be more clear, um, as opposed to just like a complete misconception, uh, by the student. And I realized that I not being in the classroom with students, I had lost that.

Um, now of course you can do like workshops and stuff, which is something I still do and did during that year as well. But there's so sort of like short term engagements, you don't end up seeing how someone's where they start and get better. And to me, that's the joyous part of teaching, like seeing students, you know, in their first year to their fourth year and like how they develop. So I missed that part of it. And that's why I tried to negotiate, well, can I do a little bit of both? So that's how I settled into what I do now.

Oh, neat. And you've been at Posit for almost eight years, is it?

Yeah, I, I'm curious. I feel like this has come up before too, this like making contact with students. Hadley, I'm curious about your, what you think of that, this kind of dynamic of like students, you mentioned like students kind of like keeping you grounded or like getting you in touch with kind of these realities.

Yeah, I really enjoyed that at Rice. And especially when you're teaching, like for me, like teaching this stat 405, which is basically introduction to data science. Um, you know, you, you teach this, you, you get a fresh batch of students every year. So you kind of get to see like, is my teaching gotten better as my tools gotten better. So they get a little bit further this year than last year. And I found that like really, and just being really like, if you pay attention, you like learn what are the things that I just assume that everyone knows that just things that I know, and then I have to like communicate to the students.

The "whole game" approach and R for Data Science

And I thought, yeah, so I, and after I left Rice, I was definitely worried for a long time that I just kind of like drift away into like ivory tower land with nothing to like pull me back to reality. Yeah. But that doesn't seem to have happened. And I've stopped worrying about it.

Oh, nice. I mean, I know you were doing, so I mean, I know you are an author on the second edition of R for DS. Do you feel like part of teaching in the classroom influenced kind of what you brought into that book?

Yeah, totally. I think that, um, we reordered some things. We also, uh, sort of like restructured the book to have this like whole game part, which was actually inspired by another book of Hadley and Jane. The R packages book I think has that as well. But this idea that like, you know, if you're only going to get so much of it, which I think about it in an academic sense of like, maybe you're at an institution with a quarter, um, and, or maybe the course has other learning goals and you're not going to get to see like all the intricacies of the tooling, the tidyverse offers, for example. Um, that can be a really good start. But also I find that sort of getting to a complete story quickly can be really motivating for students. Um, and then we go into the details and the rest of the book.

So that was, uh, that was fun, uh, to write and sort of think about the tooling had also evolved over time. So this, I think also was an opportunity to sort of rethink what aspects do we want to highlight and what aspects maybe we don't need to as much. Because some of the things that were like call outs or warnings could actually just be removed because the tooling does the right thing or already gives you the warnings anyway.

Have you read the, the, like the teaching book that the whole game idea came from? Uh, yes. Yeah. Cause I, I found that like the, it's a baseball. Yeah. I was like, no, you don't teach, like when you teach kids to play baseball, you don't have them like swinging, you know, you don't teach them how to like swing the bat for like three weeks and then you teach them how to catch the ball for three weeks and then you like teach them how to run around for three weeks. Like you introduced like a simplified version and then, and that like, that's more fun and it like gets the, and that was like, that was like really influential. I think for a lot of our like teaching materials recently, like give people the experience of the whole thing, give them a quick first pass and then kind of loop back and go deeper and deeper each time.

Yeah. Yeah. It's a really striking, I think, feature of the books where this was added. I noticed like, yeah, a bunch of the books now for different tools or, um, different parts of the tidyverse use the whole game.

Um, I mean it's difficult, I mean it's, it's difficult in general to write, write books about open source software, especially software that is relatively new and is changing. And it's, you're also writing a book that's in the context of all of the other tools that are, it's related to, which includes like how you install the software and you know, other packages that you mentioned in the book. And so like I saw this very acutely with, um, with, with Python for data analysis when it came out in 2012, pandas was a pretty new package and uh, it's clearly, it's evolved a lot in the intervening 10 years. Like just how you get Python up and running and all the packages installed has changed, has changed a lot. And so you definitely have to go back and revisit and redo all of that. And whenever you do subsequent, subsequent additions.

But you know, as Hadley mentioned, like I think a key thing is, is like not losing, you don't want to lose sight of like the things that people struggle with when they're learning how to use these tools because as a tool builder, it is easy to take, to take things for granted. And to like not realize that like this might be easy for you, but it's not, it's not easy for, uh, it may not be easy for everyone or intuitive for, for everyone. And so I think even when you're writing, like one of the, one things that I found is like the importance of how you like the language that you use and how you, how you describe things to be more, um, to, to not make assumptions, like not in not suggest that things are easier, that things are simple. And so I remember when I went through like Python for data analysis, second edition, I removed like every just and every, this is easy. Like this is simple because I'm like, well, this is obviously no, it's not, it may not be obvious.

Open source in the classroom and GitHub workflows

Yeah, that's really cool. And it's neat. All these books like, um, Python for data analysis and R for data science are both, um, open. So you can just go to the repository. I mean, I think, you know, now versus, you know, when we were starting out, there wasn't, you know, there wasn't GitHub and there was open source, but not nearly the open source development process. Wasn't nearly so, so accessible. So, you know, I'd be curious, like I imagine students are wanting to, you know, learn how to not, not just use open source, but be involved in open source and engage with like the tools and packages that they use.

Yeah. It's something I demo in class, um, as well, and not sort of like necessarily the whole, like let's clone this repo and rebuild the thing. But it's like, you caught a typo. You can literally go on to GitHub and edit it. And like that makes a pull request. If I remember correctly, I did this in a talk. Hadley merged the request during the talk. This was a talk, uh, sort of generally about contributing to open source software. Like that was an prearranged thing, but just to be able to see that that's there. And, um, I don't know, we always try to make a point to say thanks when we merge the issues. And even if we don't, we always try to say, thanks for the feedback. I think that's, um, helpful for, uh, any sort of learner student or not to just be like, oh, I can be part of this and I can help make it better.

Don't I, do I remember that you had like an advanced, more advanced course where students had to submit all their homeworks as part of their assignment through GitHub?

They do. My intro data science course is like that. Every single assignment is a GitHub repository and they each get submitted as such. And from semester to semester, I go through like, what do we do with the artifacts? Like there were semesters, particularly during COVID where I wrote more automated tests to check things. Now it's more like human feedback or I don't know, I'm working on a project where it's like, could AI give feedback, something, but ultimately every single thing is a GitHub repository and the tooling is there to make that happen. Now they're not doing all GitHub things. Like there's no pull requests, branches, whatever. It's just like one repo you push to it. But, um, it is actually, the students even mentioned that it's quite nice to be able to be a first year student, maybe applying for internships and have like Git and GitHub on your resume. And that's like very fair. It's not an exaggeration. They really do have the basic skills.

It is nice. It's like you are kind of like playing the game, like the whole game as part of the course, like rather than talking about how they could do it, you're almost just getting them in.

And let's be 100% honest here. There's self-preservation as well. The alternative is a course management system that I have to deal with. So I'm doing it for their learning, but also to protect my sanity a little bit.

Using LLMs for student feedback

You mentioned, um, building an LLM assistant to give feedback, which I think is maybe super intriguing. Could you say a little bit more about what that looked like?

Yeah, I think it's one of those projects that I've been working on over this year that's fueled by frustration, but trying to do something productive with it. Um, and so I teach introductory data science to students who have not coded or at least not coded in R. But nowadays, you know, for a university student, it's hard to say every single student has never coded before because they may have had some exposure, um, to programming in high school, for example, but rarely to working with data and code at the same time, uh, prior to this course. And, uh, a lot of the content is sort of what's in R for data science and also sort of modeling as well.

Um, the tasks that we ask students to do are simple, particularly the first few weeks. Um, but maybe as LLMs get better further into the semester are things that many LLMs can generate somewhat reasonable answers for. And I feel like over the last couple of years, the quality of those answers have been sort of, uh, increasing. Although when you look at like a homework assignment that's submitted, it really seems like someone with like multiple personalities have written it, you know, like from question one to question two, like very different styles, but neither of them, none of them wrong necessarily just like not ideal or doesn't confirm to the principles we teach in class.

Could you give a quick example of something they might answer? Yeah. So for example, take this data set, um, pivoted longer. So then you can then make a visualization. And also when you pivoted longer, maybe you need to sort of separate the columns into two to get the year and the month out or something like that. Um, now asking this question at a very high level, um, and not giving interim steps, I think would make it harder to just get the exact answer sort of like we're looking for. But because these are students who are just learning, it doesn't seem 100% fair to not scaffold it a little bit for them. And then as soon as you scaffold it, I feel like that's basically better prompting for an LLM. So I've done that work for it.

So then the little pieces of code, um, work pretty nicely. Um, so I've been thinking about, and that's been frustrating because what happens is then we have human teaching assistants who give feedback to machine generated responses that then no one reads really on average. Um, this is all on average, not talking about every student, but because they never wrote the code in the first place. Um, but I really don't think saying don't use AI tools is the right approach either. One, I don't think that's preparing them for the right thing and it's unrealistic. So like that doesn't seem productive.

Anyway, the idea was, could we use LLMs to give immediate feedback? Um, because I already write very detailed rubrics. So these 20 TAs that I have in this 300 person course can grade consistently anyway. So, uh, building it such that the LLM gives you feedback with the idea that maybe if the student knows that they can get immediate feedback for low stakes assessments through an LLM, they will maybe attempt it themselves first. Because like, I don't know who wants to be like feeding these things to each other. I mean, you may want to for fun, but I think that I'm hoping that we can, um, remove a little bit of the grade anxiety and then maybe bring in like, let's give you some safe space to practice by yourself. And then we have the LLM reporting to you, giving you some feedback. So not grading, but giving feedback.

And then bring in some of break things down so that some things happen collaboratively in class where I don't think that's going to be the way they work through it. Or even if they do, I think like if you're running, like giving some prompts, I don't know, chat GPT is giving you something and you're looking over it with a friend and analyzing like that's still good learning in my opinion. Um, it's more the copy paste just because it happened to run is the thing where I'm trying to reduce.

And then like having spending like human TA time grading that that's just a—

Yeah. Yeah. And so the idea is that that freed up time can go into more high touch things like office hours, maybe problem solving session. These are things students always say they enjoy quite a lot. And the TAs and myself like get really tired of reading, um, LLM generated code, I think just because, you know, their job is defined as to do a good job. You need to write good feedback. It's really hard to motivate yourself to write good feedback. If you don't think a human has written that code in the first place.

Like nothing's going to learn from it.

Yeah, you're like, you could be like fighting the battle of who could care less, I guess.

Yeah, yeah. Yeah. And I'm choosing to continue to care. So this is the, this is the thing that I want. And so we've, I've been working on sort of like testing it and building it and turning it into like an R package. So it's not about students copy pasting, but maybe they can highlight it when they're in the ID, push a button, and then we get the response back to them immediately. Um, and, but I will class test it in the fall.

Seems great. Wait, just to double check that I've got it. Are you saying like it used to be, they were getting grades back for the assignments, but now it's shifted to be more feedback. It's like a shift, I guess people sometimes call it like summative versus formative. Now it's like a formative assessment where they're just being assessed for like feedback so they can like learn really quickly.

Yeah. One thing I haven't yet figured out, but that seems like it's figurable audible is some artifact that can be submitted to say, I have done this. And then some reflection, because, you know, there's like, as I said, I have some hypotheses about how this may be motivating. I don't know if these are true. Um, so collecting some data from the students and also giving them some bounties. Like if you catch feedback, that's wrong, submit it. Because like, that, that can be fun. People love catching other people's or machine's mistakes, you know?

Yeah. And students will do like a lot for extra credit. Like if you make the extra credit, like find a case of how that was wrong. Yeah. Yeah. Yeah. And I think find a case where it's wrong and then articulate it. Like how would you make it better? That's a paragraph you might write, but I think that's like intellectually quite high value work in my opinion.

Yeah. And are you developing the tool with Elmer? So that's our library? Yes. Yes. Yeah. So that's, that's the package that's doing the sending the, um, sending the code to the model. And also the model we're using is one that is, that actually is built with, you know, the open source models out there plus content from R for data science and, um, my course materials. Cause the idea is that we don't want students running out of tokens. Although the idea of a cost is also sort of intriguing to play with. Like, should you really get in like limitless feedback or should you have to maybe think sometimes as to I'm ready for feedback. I haven't thought about that before. And to me, that's like a more cognitive thing to interesting to think about, not necessarily my expertise area. Um, but this way the model can be hosted at the university so that when that, you know, we don't want any sort of differentiation between students based on what they can afford to pay for.

Muscle memory, LLMs, and the future of learning to code

Yeah. I mean, one thing that I think about a lot lately, and I don't know how, how much it, you know, has been on your mind, but you know, I, I have perhaps old fashioned ideas about that thinking that, um, you know, developing muscle memory around like the really, you know, basic mundane details of data manipulation and using, for example, you know, dplyr's API or using pandas's API. And so with, with LLMs in the mix now, I feel like a lot of students and new programmers just won't get as much of that basic hands on doing really, you know, simple, you know, mundane tasks. And so like they won't, so they won't build up that base of muscle memory where you see a data set, you need to do X, Y, and Z. And, you know, in the old days before, before LLMs, you would have to learn those basic commands. And you would see the structure and say, okay, this is, you know, this, this sequence of, uh, transformations with dplyr, like these, this sequence of, of pandas operations. And, you know, again, maybe it's an old fashioned feeling that like that somehow something will be lost, or maybe this will make, you know, result in people being less effective. But if they always have an LLM available to write that code for them, maybe it doesn't, maybe it doesn't matter. So I don't know if you have some way to predict like longer term, like what effect that will have on the practice of data science.

I do think some of it is about like reframing it to be more like, like learning to play an instrument, right? Like you can like people, lots of people learn the guitar or piano. And 100% like no matter what song you want, you can like download a professional recording of it, that's going to be way better than you. Yeah, but that doesn't like there's still some like intrinsic joy in learning and instruments. And I think I like think for most of us, there's that like intrinsic joy and like programming to like, even if you could automate it. Like the other thing I sort of think about is like, so like Japanese, like hand tool woodworking, like where people do this with like hand tools, or like, I'm not going to use nails, I'm not going to use screws. Yeah, of course, it's like way less efficient, but there's still that like that joy of like doing it by hand.

I mean, like intricate joins where like, yeah, like rather than just like nailing things together, or like using a power saw to cut things, like you're making like beautifully intricate, like dovetails that just like fit together. Yeah. But, but like, it feels like some of that is like, that's happening to coding now, like we've got these power tools, where you can just like, saw through whatever.

Yeah, but I mean, like with with the, you know, AI powered autocomplete, sometimes, you know, the LM will see or will predict like what it sees you about to do, and then will essentially suggest the code that you're going to write. And so you press tab, you move on. And maybe people, sometimes it's not what you want at all. And you say, sorry, you're, you know, sorry, you know, chat GPT, or sorry, GPT 4.0, you're drunk.

But I think to what with autocomplete, for example, to say that's not what I want, requires background knowledge and experience that, at least for the students I work with, they don't necessarily have that. And, and I see this myself as well, I would say I'm like, pretty experienced programmer in R and much less so in Python. And to me, if I can't use the autocomplete stuff, when writing Python code, it just feels like some annoying person is like interjecting into my thoughts too quickly. And I don't have the experience to be able to just say, that's not what I want. So I then run the code. And running the code and getting an error is more frustrating. Yeah, I think then like having to sort of now I feel like I have to debug someone else's idea, if that makes sense. And that's, that's a harder skill.

And to me, if I can't use the autocomplete stuff, when writing Python code, it just feels like some annoying person is like interjecting into my thoughts too quickly. And I don't have the experience to be able to just say, that's not what I want. So I then run the code. And running the code and getting an error is more frustrating.

So it's, it's also like having a conversation with the other person's like constantly finishing your sentence for you, which is just like, yes, so irritating.

It does remind me to have like, if you're, if you're driving in a car, you can like have a conversation really easily. Yeah, it's not very like cognitively demanding. But if you're like driving in a storm, you might have trouble talking or like traffic. And if the person's like, in this car with you, they actually might regulate how they talk to you. Yeah, because they know that. So they might actually like talk to you in a different way. But I could see with LLM feedback, you just like you mentioned, like Python, you're just kind of getting hit with autocomplete. It maybe is not as like sensitive to like where you are. Yeah. And your need to like, maybe work through things.

Yeah, yeah, yeah. So I turn that off when I'm doing something that I'm not as comfortable with versus when I can really like immediately see if it's the right thing. Sure. It saves me some time writing characters.

I think another thing, though, is if a lot of this stuff can be automated, and you're not like in there looking at the data, looking at the interim output, I am not sure how one then develops a sense about the data set. So that when at a later point in your analysis, say you fit a model, and you're looking at the coefficients, you might say, that doesn't seem right to me, like, and, and I still do lots of statistical data analysis, where I do actually look at the coefficients, not did we predict right or wrong, but actually those numbers. And sometimes the magnitude is so much larger. And I don't know that I would have a sense to be able to evaluate that if I hadn't spent hours and hours looking at interim output.

And maybe I could be disciplined enough to there are these I packages even and I guess with alarms, you can do it more to like give a data set, let me give you some preliminary summaries. I don't tend to use them because I'm like, I want to go through that at my own speed to get to know the data, you know? Yeah. But yeah, it might be old school.

Pair programming and live coding with students

It does remind me of your, I don't know if this is a deep cut, but I noticed you did, like a few episodes on YouTube pair programming. Oh, yeah, with people. Yeah, like students. Yeah, yes. And what you said reminded me of that, like that, actually, like even so much as nonverbal and data analysis, like where people put their eyes, you know what they look at, and even how they shift things around.

And I'm curious, like how you get what, what did you take from those? Like, do you have any thoughts on what people get out of that type of activity? Versus like LLM feedback? Or? Yeah, I or maybe just how that went?

Yeah, that activity was fun. I wish we could have kept it up. But video making is always such a high activation energy thing that we didn't. But the idea there was that we would take a data set, like one of the tidy Tuesday data sets, maybe or maybe something else that the student was interested in. And we would have a conversation about like how to mostly visualize it or like ask some other questions.

And in some of the episodes, I am the person coding and the student is saying, I wonder if and then we talk through how we might do it. Or in other cases, they're the one coding and then I'm the one sort of trying to generate ideas. What was nice about that, I think, and also the feedback that I sort of like, got from students after chatting with them, like after the recording was that they're like, it takes them a bit of time to like, think about something and to be able to implement it. That's a sometimes give up along the way, like when they're driving. Yeah, yeah. But but to see over and over that I also didn't get there immediately was very helpful, which is something I do live coding in the classroom as well. It's just I teach large classes. So it's hard to get direction in that way in a large class. But when you're one on one or with a small group, they can give you ideas and they see you like having to look at documentation or having to, you know, like, run into errors and correct them.

I imagine, for example, for folks who are more sort of better at using these AI tools for coding to be able to do that sort of thing would also be very instructive, like how does an expert programmer use these to their advantage, as opposed to as an annoying voice that gets in the way?

Yeah, I mean, it probably won't probably won't surprise anybody to know that there's a whole, you know, kind of ecosystem developing of startups and companies building, basically vibe data science. And so some version of vibe coding for for data science, so much so that some of the tools they're generating code, but there's no point even to show you the code by default, because most people using the tool aren't going to read it anyway. And so you'll just look at the output and say, Well, this seems wrong. Or, you know, could you double check that? Or like, that doesn't seem right. So you could give feedback to the, to the agent that's doing the that's doing the coding for you. But a lot of the users aren't actually going to read the code. And to me, it feels like a bit weird, like, rather dangerous, like that you may actually you get to the point of like, actually making like a real life decision that impacts people. And you've got this whole, you know, vibe coded analysis that you haven't read, haven't read the code. And so I predict that we'll probably see like, you know, probably some success stories, but also some horror stories of how, you know, somebody pushed a button or made a business decision. And it was based on kind of vibe coded slop.

R and Python conferences

I was thinking maybe we could shift. I'm really interested, because I know you, you do both R and Python conferences for some of your workshops. Is that right? Yeah. Since we have a lot of R and Python, both R and Python representation. Yeah, I'm super curious, your, yeah, how you've experienced sort of R and Python conferences. What's that been like to go kind of across the languages?

I think I've been to more R conferences than Python conference. But last year, I went to SciPy, which I really, really enjoyed. I think I learned, like I was at the talks, even if it wasn't my sort of like focus area, I really enjoyed it. It reminded me of useR from back in the day, when it was like, about the same size, maybe, I don't know, 2018, 2019, something like that. I really enjoyed that. I enjoyed the workshops, I think they were like really thoughtfully designed. And I felt like I learned a lot.

I've been to a couple of other Python conferences as well, where I wasn't exactly sure where I fit. I could see what I was teaching in a tutorial could be useful for folks. So for example, one of them was like, take a Jupyter Notebook and turn it into a website. Well, you can use Quarto to do that. Like that's, I think, like a thing that's a useful tool for a lot of folks coming from many different avenues, or turn it into a book or something like that.

But in terms of the focus of applications, in some of the other conferences I've been to, I found less applicability. But that being said, I do, I am, I really, I think the right thing to say is learning Python still, but in a very much data science context. And I think that when you go to a Python conference, that's not the only context, even if it has the word data in it, I think. Versus I do think that most R conferences that at least I have been to have always had the data or stats sort of like ingrained in it. So it's a little easier for me to see how I fit in in that ecosystem and what I can get out of it.

Versus the other one is like nice for exposure, but a little harder for me to walk away with. These are things I can do. And I often try to measure the value, at least personally to me, of a conference with like, did I walk away with some new things that I can actually use like tomorrow.

So I will, I think, never forget the useR in. I think it was in Nashville where I learned what R Markdown is and I literally stopped listening to the talk so I could start converting my course materials like it was that useful to me.

Wait, I think I saw a little like thread of that. That's back when it was called Nidar. Is that yeah, that's that's that's the yes, that yeah, it was like R and W files 2012 or something 2012 sounds. Yeah, yes, wait, that's neat.

It might be useful to for us to unpack because I think there are so few people in the data world who will both have spanned useR and SciPy like that slice of the Venn diagram is like mostly just all border.

Yeah, yeah, I've been to SciPy too. They feel like the people are very similar. Yeah, like it's pretty academic. It's, you know, people struggling to understand the data sets and talking about it.

Pretty academic, but also I think with I would say a strong commitment to maintainable open source software that can serve others as well. I feel like there's a subset of academic, which I totally was when I was writing my page. I'm like, I need this to work for me so I can get here and the rest. I don't know once this is published, you know. But I think in these conferences, there's a lot of like academic motivation for the start of the projects. And a commitment to building tools that others can use as well. And that's the piece that I think is enjoyable to hear about.

Yeah, I think SciPy is the Python related conference that I've been to the most over the years. And there was a time where it was the only conference that existed where people came to talk about scientific computing and like there was some emergent work in doing statistics and statistical statistical data analysis.

So at the first two talks that I, Python talks I gave in 2010, one was in, I went to PyCon in Atlanta and then SciPy in Austin. And so that's where initially where like I networked with and met like Fernando Perez from IPython Jupiter, Brian Granger, Travis Oliphant from NumPy, Peter Wang, you know, Travis and Peter went on to found Anaconda. And so this was like kind of the original community that spawned like the much larger now PyData and another ecosystem. But I think it's interesting because Python has always had this pretty passionate scientific computing community, like high performance computing, more doing like high energy physics. Like they were refugees from like MATLAB and Fortran and they were just happy that they could wrap all their Fortran libraries in Python and script them that way. And so I'm happy that that community still exists and there are still people in academia and in research labs doing hard science and scientific computing and talking about scientific data formats that so it hasn't totally been taken over by data science and machine learning and AI.

Yeah. Yeah. It's neat to hear that that too was like one of the first places you debuted. It was a pandas talk. Is that? Yeah. Yeah. So there was a yeah. And there's there was a like a paper. So I still I think still has still has paper submissions. And so the so the first like academic style paper publication about about pandas came out was as part of the SciPy proceedings in 2010. So when people cite pandas, they usually cite that site, cite that paper had all be written in latex, of course.

Yeah, that's wild.

Yeah, the other thing that's sort of mind blowing to me looking back now is like, you know, a lot of like the early days of R and Python were like, rejecting this idea that you should like have to pay for a programming like that used to be the norm. Like you would go and buy MATLAB, you'd buy a Fortran compiler. And now that idea that you would like buy a programming environment, it just seems like it's bizarre. Yeah. But like that, that was where we started from.

Python in universities and teaching across languages

I'm also curious how interest in Python has shifted in universities.

Yes, one on one course because just about everyone at every university seems to take a course like that. So for our stats students, we recommend that quite a few of them do stats and computer science together. Some out of genuine intellectual interest and some I think hedging their bets in terms of where will the hiring market be by the time I graduate? Will it be more data science and modeling machine learning? Or will it be that like might I get a software engineering job? So they tend to sort of do both.

And so they do get exposure to that. At the graduate level, a lot of our PhD students, for example, still ultimately write code in R and use R packages. But I feel like I see more of them, particularly coming to a graduate program after like a couple years in industry, coming back with R and Python skills.

And the thing I really appreciate in some of these students is how they're good at going between languages, like very versatile. Just I feel like knowing the ecosystem and knowing that like there's this package or library that I can leverage and I can like figure that out. Sort of like enlarges your ability to do things.

It's like I think about it as people who rap in multiple languages. All of a sudden you have more words you can rhyme, you know, like you've like expanded your options. And it both sounds cool and also like gives you options.

What students want and what they need

It's a nice metaphor. I like I was thinking like a little more broadly, like, you know, I'm sure you get a lot of feedback from students about like what they want, what they think you should be teaching them. Like what are the things where you like that you just tell them like, you know, they think it's a good idea and you know it's not. And you tell them that. But also like what are the things where you're like, oh, yeah, we should change our curriculum to teach that.

Well, so one example for the latter, I can say is that, you know, I feel like ever since the R Markdown ecosystem and now Quarto has been around, I've like made my course websites with one of these tools. I've made slides with one of these tools. It used to not be a learning goal for the intro data science course. They did write like my students write Quarto documents for their homework assignments. And so they know how to write computational documents.

But like turning it into a website, for example, was not one of the learning goals because like they already have to learn so much. It seemed like putting a lot on them. But then I realized that they were I would hear things from students like, oh, you can only do data analysis with R, but with Python you can do everything.

And I feel like Shiny changed things a little bit. Oh, now I can make web applications with R. Like that's a thing that's useful outside of doing my homework. And so now, for example, in the intro data science course, the project that they do at the end actually is a Quarto website.

Oh, cool. And there's very little additional overhead to make this happen if you like set it up for them and they're just like putting their content in there. But I feel like just that knowledge that with this language that if you look it up on Wikipedia says statistical programming language, you can build more things is actually really useful. Because it I think maybe I don't then teach them every single other thing you can do with R. But it motivates some curiosity for them to be like, I wonder if we can do this with R as opposed to thinking there's no way I can do this with R.

Yeah, I could see how if you put one demo of kind of like doing the thing they set of things they thought was impossible, it really kind of opens the door. And I think you're writing a book on Quarto. So at some point people will have hopefully a whole game of Quarto to go through.

Yeah, we do have that. I am writing a book on Quarto.

Why data science matters

I think since we, I feel like we've touched so much on stuff that students will get so much out of. I don't know, maybe we can just go around really quick with the time we have left and say, yeah, like, why does data science matter to you? I think because so many like, because I do think this is so useful for students. I'd love to hear like, why, why data science, since we've talked so much about learning and conferences.

Hadley, do you want to start?

No, I just think it's like so empowering. Like there's so much data around us, like data that you're generating, data that things that you care about are generating. Like learning a little bit of data science just gives you this amazing power to kind of like dig in and learn stuff. And I think it's like so rewarding and so fun when you find like that data you really care about. And now you're empowered to learn more about yourself. And it's like super, super cool.

Yeah, it's really neat. Thanks. Wes?

I mean, I think it's a big question, but I think that data literacy and statistical literacy in general is probably, you know, something that's missing from a lot of basic education, especially in the United States. Like we learned to do trigonometry and geometry and algebra and I don't know what else is in the general high school curriculum, but I did not learn any statistics or data literacy in elementary school, middle school, high school.

And so when you go out into the world, I think people are missing this foundation of like how to make judgments, how to interpret information they're receiving from the standpoint of data. Like how do you understand taking risks? Like how do you understand your finances? And like everything, you know, whenever somebody presents you a fact, like you should be asking the question, like is that fact supported by data? If so, like what's the data? Can I have a look at it? If you could get access to the data set, maybe you could explore the data set for yourself and see if maybe the analysis is cooked or has been spun in a way to like support a narrative. And so I feel like equipping people with tools to make data analysis more accessible, you know, feels like just a really valuable thing to do in the world so that more people can be data literate and can be able to be equipped to ask more, you know, ask their own questions and have the tools to answer them.

And so I feel like equipping people with tools to make data analysis more accessible, you know, feels like just a really valuable thing to do in the world so that more people can be data literate and can be able to be equipped to ask more, you know, ask their own questions and have the tools to answer them.

Yeah, thanks. And Mine?

And learn this about myself. I think those are just like cool little things you can do to learn either about just you or the world around you. That's helpful. And I do think that that's working with data. There's also sort of the more statistics aspect of things making sense of uncertainty, which I think is even harder, which is a thing I try to keep like in my sort of teaching as much as possible, both because I teach in a stats department, but I also think that like it takes some time to be comfortable with making decisions around risk and whatnot when there's uncertainty around the estimates that you're looking at.

And I think that's something we need to get people to understand as well, because at very critical times of your life, that might be the sort of statistics someone tells you and you need to make a medical decision or something. So having some experience with that sort of thinking, I think is very helpful.

Yeah, thanks. Yeah, it's so helpful to hear the like need for kind of like code first data science and the power of data literacy. And yeah, like how you might need to make like personal decisions with data or like big decisions with data and having a sense for things like uncertainty are really important. Yeah, yeah. Thanks. Thanks so much, everyone. I think this has been so helpful to hear.

Do you want to answer the question too?

What's the question?

Why is data science important?

It's such a good. Wow, what a great question.

Yeah, I think data science is important. I mean, I think going back to what Mina said, like, I do think everybody interacts with data, whether it's like a little bit or you're being hit with data. And I think like all of our interactions with data matter. Like we use data for so many things.

And so I think that just I think data literacy really equips people, you know, whether it's statistics or coding to just handle all the situations that might be thrown at us, whether you're like tracking coupons in an Excel spreadsheet, which is like my dream scenario, or you're like, yeah, having to make an important decision that involves a statistic, which will have uncertainty. I just think that it can really improve so many people's lives in so many ways. And some of those are really critical decisions that impact a lot of us.

So, yeah, I really appreciate everybody taking the time. And I mean, I feel like if I were a student, this is exactly what I would have wanted to hear. So I think from going to actuarial science to data science and the role of LLMs in education. Yeah, I mean, I really appreciate you coming on and just hitting us with so many things to think about.

Thank you. Thanks for having me.

Thanks, Mina.

The Test Set is a production of Posit PBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with branding and design agency, Agi. For more episodes, visit thetestset.co or find us on your favorite podcast platform.