Mine Çetinkaya-Rundel: Teaching in the AI era — and keeping students engaged

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we chat with Mine Çetinkaya-Rundel , professor of the practice of statistical science at Duke University and a professional educator at RStudio. I'm Michael Chow . Thanks for joining us.

All right. Hey, everyone out there. Welcome to the test set. I'm Michael Chow, and I'm joined by Hadley Wickham and Wes McKinney. And we're here with our fantastic guest, Mine Çetinkaya-Rundel, who's a professor at Duke and works on education as well at Posit. Mine, we're so excited to have you, and thanks for coming in.

Yeah, thanks for having me.

Mine's path from actuarial science to data science

I was hoping to ask, just to start out, could you just tell us how you arrived in your roles?

Yeah. I wish I had a very exciting story. I don't, but I had an undergraduate degree from a business school, honestly, because my parents wanted it. And I didn't really know what to do in business school, so I thought, I'll do the most mathematical thing one can do. So I was an actuarial science major, and then I worked as an actuary for two years in New York. And at the end of that, I decided, I'm not loving this. What can I do for free?

PhD is a thing. If you can get into it, you can do for free. And I also really enjoyed teaching when I was an undergrad. I used to do a lot of tutoring. That's how I paid a lot of bills. So I thought, this could be an interesting thing to try. And just looking at my background and also a little bit of the work that I was doing when I was an actuary, I really enjoyed the pieces of the work where we did things with data, which I hadn't really done a lot while I was an undergrad. I thought, I'll apply to some stats programs. And so that's how I decided to go to a stats PhD and fell in love with teaching even more.

And also really realized that I wrote my first line of code when I was in graduate school, I think, which is late, but whatever. I really enjoy that as well. And then decided to stay in academia. And at some point, I was doing lots of sort of public-facing things as well and using a lot of the tools that are developed by then RStudio , now Posit. And so I work part-time at Posit as well as a developer educator.

Oh, awesome. And this section reminds me, I don't know a lot about actuaries, but you mentioned writing your first line of code in grad school. Our actuaries aren't slinging a lot of code, is that?

Yeah. I mean, actually, that's not 100% fair. There was one piece of software we used. I believe it was called Ginsu, like a knife. It was for like cleaning data.

One of the clients we had was the United Nations. So we did the retirement plan for United Nations. The data came in literal boxes of CDs. So we had to load them up and you had to check for things like did the— and each row of your data set is like an employee. Did the employment date change? Did the birth date change from last year's data to now? Because those things shouldn't change. So there was a lot of like data quality checking and stuff. And I remember using that software. I don't remember anything about it. And I believe it was an in-house software, to be honest. But what I do remember is that you would write some— or someone else previously had written these like tests. Did the birthday change? And it would print out on a dot matrix printer. So I would go through the paper, like looking at the results. And then go back to the client and ask, actually, can we confirm these lines? So I really enjoyed that piece of the work.

There was the other piece of the work, which was more like regulatory. Like you have to do these filings, sort of like an accountant, which was just like doable, but wasn't the as enjoyable part. So I did sort of work with code others had written, but didn't really write code myself as an actor.

Yeah, nice. I love that you're, though, it's very like farm to table. You're like actually in the data physically, like striking things out with a pen.

And then another thing I really enjoyed doing was, there's like all this analysis that you do. And then all of it needs to go into a report. And oftentimes, those reports were Word documents. So someone would take a number from this printout and put it into the Word document. And that really bugged me. So I had changed all of our reports to be actually Excel files, that when you print them out, they look like Word documents, but the cells were automatically calculated.

I think that's sort of what I later found out was called reproducibility. But I just didn't know. I was just like, I don't trust this process.

I think that's sort of what I later found out was called reproducibility. But I just didn't know. I was just like, I don't trust this process.

Because I was an actuary because my mom had read a U.S. news thing that said actuary is the best job you can have. Like that was the reason why I was an actuary.

It's funny. Like I actually also had a dalliance with actuarial science. Like I had an internship as a, in like actuarial science firm, also was in the business of certifying pension plans, which I didn't know a lot about, but it's like kind of a dying field now that, you know, defined benefit pension plans have moved to 401ks. And so I was, I saw some of the same issues around like, can't this be automated? Like this seems really error prone. And I thought that I would be doing more math and science, but there was so much like data entry and like filling out tax forms and double checking numbers. And really, um, yeah, it felt, uh, I was like by the end, like I hadn't written very much code at the time. And so it was like, can we write code to do this? And, um, but I think they, you know, the actuarial firms like they also are in the business of staying in business. And so if you automate too much of the work, then there's, you know, there's less work to do. And so it was almost a counter incentive to, to make things too, uh, too efficient.

Yeah. I love that being actuaries agitated both of you into data science.

And to me, if I can't use the autocomplete stuff, when writing Python code, it just feels like some annoying person is like interjecting into my thoughts too quickly. And I don't have the experience to be able to just say, that's not what I want. So I then run the code. And running the code and getting an error is more frustrating.

So it's, it's also like having a conversation with the other person's like constantly finishing your sentence for you, which is just like, yes, so irritating.

It does remind me to have like, if you're, if you're driving in a car, you can like have a conversation really easily. Yeah, it's not very like cognitively demanding. But if you're like driving in a storm, you might have trouble talking or like traffic. And if the person's like, in this car with you, they actually might regulate how they talk to you. Yeah, because they know that. So they might actually like talk to you in a different way. But I could see with LLM feedback, you just like you mentioned, like Python, you're just kind of getting hit with autocomplete. It maybe is not as like sensitive to like where you are. Yeah. And your need to like, maybe work through things.

Yeah, yeah, yeah. So I turn that off when I'm doing something that I'm not as comfortable with versus when I can really like immediately see if it's the right thing. Sure. It saves me some time writing characters.

I think another thing, though, is if a lot of this stuff can be automated, and you're not like in there looking at the data, looking at the interim output, I am not sure how one then develops a sense about the data set. So that when at a later point in your analysis, say you fit a model, and you're looking at the coefficients, you might say, that doesn't seem right to me, like, and, and I still do lots of statistical data analysis, where I do actually look at the coefficients, not did we predict right or wrong, but actually those numbers. And sometimes the magnitude is so much larger. And I don't know that I would have a sense to be able to evaluate that if I hadn't spent hours and hours looking at interim output.

And maybe I could be disciplined enough to there are these I packages even and I guess with alarms, you can do it more to like give a data set, let me give you some preliminary summaries. I don't tend to use them because I'm like, I want to go through that at my own speed to get to know the data, you know? Yeah. But yeah, it might be old school.

Pair programming and live coding with students

It does remind me of your, I don't know if this is a deep cut, but I noticed you did, like a few episodes on YouTube pair programming. Oh, yeah, with people. Yeah, like students. Yeah, yes. And what you said reminded me of that, like that, actually, like even so much as nonverbal and data analysis, like where people put their eyes, you know what they look at, and even how they shift things around.

And I'm curious, like how you get what, what did you take from those? Like, do you have any thoughts on what people get out of that type of activity? Versus like LLM feedback? Or? Yeah, I or maybe just how that went?

Yeah, that activity was fun. I wish we could have kept it up. But video making is always such a high activation energy thing that we didn't. But the idea there was that we would take a data set, like one of the tidy Tuesday data sets, maybe or maybe something else that the student was interested in. And we would have a conversation about like how to mostly visualize it or like ask some other questions.

And in some of the episodes, I am the person coding and the student is saying, I wonder if and then we talk through how we might do it. Or in other cases, they're the one coding and then I'm the one sort of trying to generate ideas. What was nice about that, I think, and also the feedback that I sort of like, got from students after chatting with them, like after the recording was that they're like, it takes them a bit of time to like, think about something and to be able to implement it. That's a sometimes give up along the way, like when they're driving. Yeah, yeah. But but to see over and over that I also didn't get there immediately was very helpful, which is something I do live coding in the classroom as well. It's just I teach large classes. So it's hard to get direction in that way in a large class. But when you're one on one or with a small group, they can give you ideas and they see you like having to look at documentation or having to, you know, like, run into errors and correct them.

I imagine, for example, for folks who are more sort of better at using these AI tools for coding to be able to do that sort of thing would also be very instructive, like how does an expert programmer use these to their advantage, as opposed to as an annoying voice that gets in the way?

Yeah, I mean, it probably won't probably won't surprise anybody to know that there's a whole, you know, kind of ecosystem developing of startups and companies building, basically vibe data science. And so some version of vibe coding for for data science, so much so that some of the tools they're generating code, but there's no point even to show you the code by default, because most people using the tool aren't going to read it anyway. And so you'll just look at the output and say, Well, this seems wrong. Or, you know, could you double check that? Or like, that doesn't seem right. So you could give feedback to the, to the agent that's doing the that's doing the coding for you. But a lot of the users aren't actually going to read the code. And to me, it feels like a bit weird, like, rather dangerous, like that you may actually you get to the point of like, actually making like a real life decision that impacts people. And you've got this whole, you know, vibe coded analysis that you haven't read, haven't read the code. And so I predict that we'll probably see like, you know, probably some success stories, but also some horror stories of how, you know, somebody pushed a button or made a business decision. And it was based on kind of vibe coded slop.

R and Python conferences

I was thinking maybe we could shift. I'm really interested, because I know you, you do both R and Python conferences for some of your workshops. Is that right? Yeah. Since we have a lot of R and Python, both R and Python representation. Yeah, I'm super curious, your, yeah, how you've experienced sort of R and Python conferences. What's that been like to go kind of across the languages?

I think I've been to more R conferences than Python conference. But last year, I went to SciPy, which I really, really enjoyed. I think I learned, like I was at the talks, even if it wasn't my sort of like focus area, I really enjoyed it. It reminded me of useR from back in the day, when it was like, about the same size, maybe, I don't know, 2018, 2019, something like that. I really enjoyed that. I enjoyed the workshops, I think they were like really thoughtfully designed. And I felt like I learned a lot.

I've been to a couple of other Python conferences as well, where I wasn't exactly sure where I fit. I could see what I was teaching in a tutorial could be useful for folks. So for example, one of them was like, take a Jupyter Notebook and turn it into a website. Well, you can use Quarto to do that. Like that's, I think, like a thing that's a useful tool for a lot of folks coming from many different avenues, or turn it into a book or something like that.

But in terms of the focus of applications, in some of the other conferences I've been to, I found less applicability. But that being said, I do, I am, I really, I think the right thing to say is learning Python still, but in a very much data science context. And I think that when you go to a Python conference, that's not the only context, even if it has the word data in it, I think. Versus I do think that most R conferences that at least I have been to have always had the data or stats sort of like ingrained in it. So it's a little easier for me to see how I fit in in that ecosystem and what I can get out of it.

Versus the other one is like nice for exposure, but a little harder for me to walk away with. These are things I can do. And I often try to measure the value, at least personally to me, of a conference with like, did I walk away with some new things that I can actually use like tomorrow.

So I will, I think, never forget the useR in. I think it was in Nashville where I learned what R Markdown is and I literally stopped listening to the talk so I could start converting my course materials like it was that useful to me.

Wait, I think I saw a little like thread of that. That's back when it was called Nidar. Is that yeah, that's that's that's the yes, that yeah, it was like R and W files 2012 or something 2012 sounds. Yeah, yes, wait, that's neat.

It might be useful to for us to unpack because I think there are so few people in the data world who will both have spanned useR and SciPy like that slice of the Venn diagram is like mostly just all border.

Yeah, yeah, I've been to SciPy too. They feel like the people are very similar. Yeah, like it's pretty academic. It's, you know, people struggling to understand the data sets and talking about it.

Pretty academic, but also I think with I would say a strong commitment to maintainable open source software that can serve others as well. I feel like there's a subset of academic, which I totally was when I was writing my page. I'm like, I need this to work for me so I can get here and the rest. I don't know once this is published, you know. But I think in these conferences, there's a lot of like academic motivation for the start of the projects. And a commitment to building tools that others can use as well. And that's the piece that I think is enjoyable to hear about.

Yeah, I think SciPy is the Python related conference that I've been to the most over the years. And there was a time where it was the only conference that existed where people came to talk about scientific computing and like there was some emergent work in doing statistics and statistical statistical data analysis.

So at the first two talks that I, Python talks I gave in 2010, one was in, I went to PyCon in Atlanta and then SciPy in Austin. And so that's where initially where like I networked with and met like Fernando Perez from IPython Jupiter, Brian Granger, Travis Oliphant from NumPy, Peter Wang, you know, Travis and Peter went on to found Anaconda. And so this was like kind of the original community that spawned like the much larger now PyData and another ecosystem. But I think it's interesting because Python has always had this pretty passionate scientific computing community, like high performance computing, more doing like high energy physics. Like they were refugees from like MATLAB and Fortran and they were just happy that they could wrap all their Fortran libraries in Python and script them that way. And so I'm happy that that community still exists and there are still people in academia and in research labs doing hard science and scientific computing and talking about scientific data formats that so it hasn't totally been taken over by data science and machine learning and AI.

Yeah. Yeah. It's neat to hear that that too was like one of the first places you debuted. It was a pandas talk. Is that? Yeah. Yeah. So there was a yeah. And there's there was a like a paper. So I still I think still has still has paper submissions. And so the so the first like academic style paper publication about about pandas came out was as part of the SciPy proceedings in 2010. So when people cite pandas, they usually cite that site, cite that paper had all be written in latex, of course.

Yeah, that's wild.

Yeah, the other thing that's sort of mind blowing to me looking back now is like, you know, a lot of like the early days of R and Python were like, rejecting this idea that you should like have to pay for a programming like that used to be the norm. Like you would go and buy MATLAB, you'd buy a Fortran compiler. And now that idea that you would like buy a programming environment, it just seems like it's bizarre. Yeah. But like that, that was where we started from.

Python in universities and teaching across languages

I'm also curious how interest in Python has shifted in universities.

Yes, one on one course because just about everyone at every university seems to take a course like that. So for our stats students, we recommend that quite a few of them do stats and computer science together. Some out of genuine intellectual interest and some I think hedging their bets in terms of where will the hiring market be by the time I graduate? Will it be more data science and modeling machine learning? Or will it be that like might I get a software engineering job? So they tend to sort of do both.

And so they do get exposure to that. At the graduate level, a lot of our PhD students, for example, still ultimately write code in R and use R packages. But I feel like I see more of them, particularly coming to a graduate program after like a couple years in industry, coming back with R and Python skills.

And the thing I really appreciate in some of these students is how they're good at going between languages, like very versatile. Just I feel like knowing the ecosystem and knowing that like there's this package or library that I can leverage and I can like figure that out. Sort of like enlarges your ability to do things.

It's like I think about it as people who rap in multiple languages. All of a sudden you have more words you can rhyme, you know, like you've like expanded your options. And it both sounds cool and also like gives you options.

What students want and what they need

It's a nice metaphor. I like I was thinking like a little more broadly, like, you know, I'm sure you get a lot of feedback from students about like what they want, what they think you should be teaching them. Like what are the things where you like that you just tell them like, you know, they think it's a good idea and you know it's not. And you tell them that. But also like what are the things where you're like, oh, yeah, we should change our curriculum to teach that.

Well, so one example for the latter, I can say is that, you know, I feel like ever since the R Markdown ecosystem and now Quarto has been around, I've like made my course websites with one of these tools. I've made slides with one of these tools. It used to not be a learning goal for the intro data science course. They did write like my students write Quarto documents for their homework assignments. And so they know how to write computational documents.

But like turning it into a website, for example, was not one of the learning goals because like they already have to learn so much. It seemed like putting a lot on them. But then I realized that they were I would hear things from students like, oh, you can only do data analysis with R, but with Python you can do everything.

And I feel like Shiny changed things a little bit. Oh, now I can make web applications with R. Like that's a thing that's useful outside of doing my homework. And so now, for example, in the intro data science course, the project that they do at the end actually is a Quarto website.

Oh, cool. And there's very little additional overhead to make this happen if you like set it up for them and they're just like putting their content in there. But I feel like just that knowledge that with this language that if you look it up on Wikipedia says statistical programming language, you can build more things is actually really useful. Because it I think maybe I don't then teach them every single other thing you can do with R. But it motivates some curiosity for them to be like, I wonder if we can do this with R as opposed to thinking there's no way I can do this with R.

Yeah, I could see how if you put one demo of kind of like doing the thing they set of things they thought was impossible, it really kind of opens the door. And I think you're writing a book on Quarto. So at some point people will have hopefully a whole game of Quarto to go through.

Yeah, we do have that. I am writing a book on Quarto.

Why data science matters

I think since we, I feel like we've touched so much on stuff that students will get so much out of. I don't know, maybe we can just go around really quick with the time we have left and say, yeah, like, why does data science matter to you? I think because so many like, because I do think this is so useful for students. I'd love to hear like, why, why data science, since we've talked so much about learning and conferences.

Hadley, do you want to start?

No, I just think it's like so empowering. Like there's so much data around us, like data that you're generating, data that things that you care about are generating. Like learning a little bit of data science just gives you this amazing power to kind of like dig in and learn stuff. And I think it's like so rewarding and so fun when you find like that data you really care about. And now you're empowered to learn more about yourself. And it's like super, super cool.

Yeah, it's really neat. Thanks. Wes?

I mean, I think it's a big question, but I think that data literacy and statistical literacy in general is probably, you know, something that's missing from a lot of basic education, especially in the United States. Like we learned to do trigonometry and geometry and algebra and I don't know what else is in the general high school curriculum, but I did not learn any statistics or data literacy in elementary school, middle school, high school.

And so when you go out into the world, I think people are missing this foundation of like how to make judgments, how to interpret information they're receiving from the standpoint of data. Like how do you understand taking risks? Like how do you understand your finances? And like everything, you know, whenever somebody presents you a fact, like you should be asking the question, like is that fact supported by data? If so, like what's the data? Can I have a look at it? If you could get access to the data set, maybe you could explore the data set for yourself and see if maybe the analysis is cooked or has been spun in a way to like support a narrative. And so I feel like equipping people with tools to make data analysis more accessible, you know, feels like just a really valuable thing to do in the world so that more people can be data literate and can be able to be equipped to ask more, you know, ask their own questions and have the tools to answer them.

And so I feel like equipping people with tools to make data analysis more accessible, you know, feels like just a really valuable thing to do in the world so that more people can be data literate and can be able to be equipped to ask more, you know, ask their own questions and have the tools to answer them.

Yeah, thanks. And Mine?

And learn this about myself. I think those are just like cool little things you can do to learn either about just you or the world around you. That's helpful. And I do think that that's working with data. There's also sort of the more statistics aspect of things making sense of uncertainty, which I think is even harder, which is a thing I try to keep like in my sort of teaching as much as possible, both because I teach in a stats department, but I also think that like it takes some time to be comfortable with making decisions around risk and whatnot when there's uncertainty around the estimates that you're looking at.

And I think that's something we need to get people to understand as well, because at very critical times of your life, that might be the sort of statistics someone tells you and you need to make a medical decision or something. So having some experience with that sort of thinking, I think is very helpful.

Yeah, thanks. Yeah, it's so helpful to hear the like need for kind of like code first data science and the power of data literacy. And yeah, like how you might need to make like personal decisions with data or like big decisions with data and having a sense for things like uncertainty are really important. Yeah, yeah. Thanks. Thanks so much, everyone. I think this has been so helpful to hear.

Do you want to answer the question too?

What's the question?

Why is data science important?

It's such a good. Wow, what a great question.

Yeah, I think data science is important. I mean, I think going back to what Mina said, like, I do think everybody interacts with data, whether it's like a little bit or you're being hit with data. And I think like all of our interactions with data matter. Like we use data for so many things.

And so I think that just I think data literacy really equips people, you know, whether it's statistics or coding to just handle all the situations that might be thrown at us, whether you're like tracking coupons in an Excel spreadsheet, which is like my dream scenario, or you're like, yeah, having to make an important decision that involves a statistic, which will have uncertainty. I just think that it can really improve so many people's lives in so many ways. And some of those are really critical decisions that impact a lot of us.

So, yeah, I really appreciate everybody taking the time. And I mean, I feel like if I were a student, this is exactly what I would have wanted to hear. So I think from going to actuarial science to data science and the role of LLMs in education. Yeah, I mean, I really appreciate you coming on and just hitting us with so many things to think about.

Thank you. Thanks for having me.