The dessert-first approach to teaching data science | Mine Cetinkaya-Rundel | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I would love to introduce you to our featured leader today, Mine Cetinkaya-Rundel. She's a professor of the practice and senior developer advocate. She is a professor of the practice at Duke University and a senior developer advocate at Posit PBC. Mine, I would love it if you could introduce yourself. Tell us a little bit about what you do, all the multifaceted what you do, and something you like to do for fun.

Yeah, thank you so much for having me. This is such a lovely crowd. I'm seeing some familiar faces and some new faces, so that's all great. Yeah, so as Libby said, I'm a professor at Duke University. I'm in the statistical science department. I've been here for 15 years, and I primarily teach introductory data science and data visualization with sort of a focus on statistical thinking as well as sort of programming. So I fell in love with sort of coding and programming much later in life myself and found it a struggle at the beginning to love it, to be perfectly honest. So I feel like I don't know if it's the scars from that first year of graduate school or just the mere fact that I've always enjoyed working with new learners. That has been my niche area, and I've also been working at RStudio slash Posit for many, many years now.

As part of that work, I have worked with the Shiny team, I think, when I first started, then primarily with the Tidyverse team. I've worked with Quarto and Positron as well, and oftentimes I find myself sort of in the space of there's something new and exciting happening that probably needs some learning materials, probably needs some documentation, and probably would benefit a lot from people sort of like working with it, maybe teaching their students with it and bring back some feedback from it. So I found this space where I have sort of a foot in both doors and get to navigate the exciting space of data science day-to-day from learners to builders to be an exciting place to be. And I also do lots of teaching outside of the university. I teach courses on Coursera, I like teaching workshops, so whatever conference I'm going to, I always try to see if I can manage to, you know, teach a workshop there and get to interact with folks who are learning new things, maybe not in a university course setting, but in other settings and for other reasons in their lives.

So it's possible I've crossed paths with some of you as a workshop instructor. I feel like I recognize a couple of faces in that way as well. I think we'll meet for one. I took your workshop a couple of years ago. And you asked for one other thing from me. Yeah, something you like to do for fun. What do I like to do for fun? Honestly, lately I like to do Pilates for fun. Is that too lame? That has been my thing that I've decided I'm going to try to make time for, but I also really love to spend time with my kid, particularly building Legos.

I like that it's hard. I can't think of anything else during that hour. I think that's why I enjoy it. It sounds very meditative. Yeah.

Why R?

All right. Well, let's hop into questions. My first question for Mine is, why R initially and why still R? Yeah. So my very initial intro to R was, I think, boot camp for graduate school, where we were told, hey, you probably know this. This is meant to be a reminder for you. And I don't think I had written a single line of code in my life at that point. I think I had maybe written a little bit of code. I worked for two years as an actuary prior to grad school. And we had like an in-house language. I remember it being called Ginsu, like a knife, because you used it to sort of chop data.

So yeah, I learned R because I was in a statistics PhD program. That's how it started for me. So I would say that the first few years of me using R was very much LM and then go on GLM and then go on. You fit a model. You get the sort of the results you need. But throughout that, I've also taken some computational statistics courses and realized that I am not ashamed to say the following. And I wear this badge very proudly. I'm pretty good at Excel. I had to get very good at Excel in my actuarial science job. And I'm still pretty good at it. And there are certain things that I use it for, not data analysis, but sort of like compiling data and whatnot. And when I realized that things I liked to do in Excel are a lot easier to do if you can actually code your way around it and document as well, I feel like that's when I decided this is something that I enjoy.

During the time that I was working, I wrote a lot of documentation because I inherited a lot of projects where you had to talk to another human. You had to make sure that human was in the office in order to get the information you need. And I was always wondering, why didn't anyone write down exactly what needs to happen? Then I realized, well, actually, if you code it, you don't have to separately write it down as well. It's sort of like in the code itself, what is happening. And that's when I really realized, oh, this is great. This is something that's worth investing time in.

I think for a lot of people also, it's finding out about Quarto and parameters reporting where they're like, oh, it can all be one step. It can all be one step where I'm doing the analysis, I'm doing the report, and I'm also having it parameterized out to multiple things, like a report for every state or a report for every school in your district or whatever. That's the click that I see for a lot of people where they're like, oh, this is worth investing the time to figure out how to not just do it in Excel, even though I'm really good at Excel and I'm a spreadsheet queen.

Absolutely. And my starting using R predates R Markdown even. However, I, to this day, remember the useR conference where the keynote was about R Markdown. This was in Nashville, I want to say 2011 or 2012, something like that. And I remember tuning out about minute 10 of that keynote because I was like, I'm going to use it right now. Like, I need to start using it right now. I cannot wait for this talk to be over.

I need to start using it right now. I cannot wait for this talk to be over.

And I cannot imagine like sort of being dropped into this ecosystem with all of this tooling here. I think it would be an absolute no brainer to just like dive right in and see how many problems it solves for me. Well, what's wild is that then 10 years later in 2022, I think you were giving the keynote on Quarto, which is where a lot of us heard about it for the first time. That's a full circle moment. Lots, lots of full circle moments. And I, it was a great delight to be able to give that keynote with Julie Lowndes as well. It was really fun working on it, but it was also, I think Quarto is one of the projects where I feel like I had like not just the opportunity, but genuinely the privilege of being involved with it from like day one. So I would be like testing things out as they were being developed. And it was just so nice to be able to sort of like play with something that I knew I was going to use basically every single day of my life going forward.

Instead of starting from the beginning and hoping that people will stick with you until the end, I tried to give away the punchline first and then take it back and say, hopefully, I have made it clear to you that it's worthwhile to stick with this lesson because you know where we're going to get to at the end.

I often think about when I first learned linear algebra, for example, which was, I don't know, second or third year of college. And then when I got why anyone thought I should learn linear algebra, which was second year in graduate school, and I told you I worked for two years in between, that's a pretty big gap in between those two things. And that's not to say I didn't have a good professor, like that is not it at all. It just was not how I was taught versus, for example, our students at Duke now, our statistics majors have an option to take a linear algebra course that's designed specifically for people working with data. So, they get the nuggets of like, where will I apply this earlier on that they report to be a lot more motivating.

So, I sort of try to think about that delay I had related to this like important foundational concepts and how long it took me to get the punchline of why anyone thought I should be learning them. And I try to reduce that time as much as possible. I find that students, when they realize this is worth investing time and brain cells in, they are more motivated to stick with it or ask questions. And if I just tell them, like, if you just stick with me, at the end, I will show you that it's worthwhile. I lose quite a few of them along the way.

Finding data sets for beginners

How do you identify appropriate data sets for beginners? Do you have any particular resources or tips for doing that?

The way I think about data sets is not so much is it in its rawest form appropriate for new learners, but more is the context something that might be interesting to them. And turns out, there is one thing years of experience is bad for and that's staying connected with the youth. What I think is interesting every day is farther from what my first and second year undergraduate students think is interesting. So every semester, at the beginning of the semester, I always do a Getting to Know You survey. Sometimes I use some of that data just to genuinely get to know my students. Sometimes I teach courses up to 300 students, so I don't end up getting to individually know them. So just reading through the narratives that they write there whenever I can make time sort of helps me stay connected with them a little bit. But one of the questions we ask on the survey are like, what sorts of data are you interested in exploring?

So they will say some things like related to criminal justice or related to public health. And sometimes I try to prompt them to be as specific as possible, not like linked to a data set, but to be as specific as possible, just so I can sort of then keep that in the back of my mind. So maybe next time there's a Tidy Tuesday data set, I'm like, oh, one of the students had mentioned they'd be interested in something like this. It gives me a cue to take a note of it. I also mentioned I like listening to the radio a lot. I don't drive much actually nowadays, but when I do, I always have NPR on. And if they mention a study or something like that, I will quickly go look up if I think it might be interesting to see if the data is available with it and then just download it.

And then what I do is once I have the raw data, I think about it as, what do I need to do to this data to make it semi-prepared for the audience that I have or for the topic that I want to teach? And I am a firm believer in bringing real data sets into the classroom. I don't necessarily think that every real data set has to be brought into the classroom in the rawest form that I have found it. I think a little bit of mise en place, like a little bit of prep is okay, just so that we're not always sort of spending class time to get it to the point and we can get to the point that we want to make a lot more easily. And you know, whenever I teach, there's students work on projects where they work with data starting in its rawest form because they are finding the data sets themselves. Hopefully we teach them enough to, you know, the skills enough between different data sets to sort of do the tasks that they need to do to prep their data. But I oftentimes will do like halfway prep so that we can pick things up and I can just get to the point of the sort of the topic for that day.

I do try to sort of bring current data sets in whenever possible. But then there are some sort of canonical data sets that, you know, work well for things. But I will say that I love, I love the penguins data set. But last semester when I taught data science, I told myself I don't get to use it past week two because it's so neat and they're so cute. I sometimes feel like if I'm running out of time, I'm just going to plug it in there. I was like, nope, that's it. No more penguins.

Misleading statistics in the media

How frustrating is it for you when you see statistics being used in a misleading way in the media and social media? And do you address that with your students in your intro class and how you deal with that?

Yeah, very frustrating, as you can imagine. Although sometimes, I feel like I've heard this. I don't really watch a lot of TV, but I watch lots of clips of like late night comedy TV shows, you know, the talk shows. And they often joke that like if things are sort of not the brightest in politics, it creates a lot of like material for them. So the joke is that they don't want these things to be happening, but it makes their job easier. Honestly, sometimes I feel like seeing these awful, awful visualizations that are not just awful, because someone made an honest mistake, but they were clearly designed to mislead people. While it does very much break my heart, sometimes I can't help but feel like it makes my job a lot easier to say, this is not how to do things. So to be able to bring them to the classroom, and have the students reflect on it a little bit, and then sort of talk about how would we fix it.

Another thing that we do, I do this in my intradata science course, and in my visualization course as well. If I see a visualization like that, I either track down, try to track down the data set, or to be perfectly honest, the most misleading visualization examples I've seen actually are visualizing like five, you know, data points anyway, that you can sort of like glean from the picture, and make a data frame yourself by looking at it. And I ask students to plot it themselves on the correct scale, for example, just so they can see the stark difference, so that we can actually tell them, look, like it would have been hard to make an honest mistake here. Someone really was moving these points around the plot to make the story what they want to be.

I think one of the trickier parts of this is to sort of pick examples that are, where we can keep the conversation around the misuse of statistics, maybe perhaps as opposed to around sort of my personal opinions around whether we should be discussing that topic, or whether this is even something to be presented in that way or not. That's a personal struggle. But beyond that, in terms of the misuse of statistics, unfortunately, there are good examples of this out there, and I think I certainly do bring them to the classroom. In my introductory data science course, we do a module on data science ethics. When I first started teaching this course, this was the last module in the course, and quickly I realized that may not be sending the right message. The last topic is like, sometimes you don't even end up having time for it, because it snowed that semester or whatever. So I actually moved it to the middle of the semester, when students are looking for data sets and trying to come up with a proposal for their project, so that we are talking about data science ethics at the time that they are building a data science project themselves, so they can ask themselves the question of, is this even something I should be collecting data on? Is this even a question I should be posing in that way? And the three sort of units in that module is misrepresentation, so we talk about visualizations and other representation of data, particularly in news media, algorithmic bias, and data privacy.

Communicating with non-technical stakeholders

In your classes, do you give any advice on students for how to communicate with people from different fields?

I think a little bit, and I'll try to sort of say some sort of bigger idea things in terms of how I try to communicate with folks like that, and then maybe some little tips as well. So on one hand, when I am talking with folks who are interested in the results that will ultimately be sort of squeezed out of the data, and the answers to the research questions they're interested in, and they are not that interested in how you got there, I sort of think about it, I don't know why I always come up with food analogies, but it's always like, well, when I go to a restaurant, and the dish is really good, sometimes I'm curious how it was made, and sometimes I'm not. Sometimes I'm just happy that someone figured out how to make this for me, and I can just enjoy it as is, and I might have questions about which wine it pairs with, but I don't need to try to make it myself. So oftentimes I'm thinking about it as, if the person is interested in genuinely sort of understanding how the analysis was carried out, then we can have a detailed conversation about that. But if they're just interested in the results, and if I'm talking to, for example, a practitioner, what I try to get out of them is, when I presented this table or this visualization to you, do these numbers make sense to you at all? Are these in the order of what you would expect to see or not? And if not, why? Can you articulate that to me?

I can then translate that conversation back to my analysis steps, and then come back with, well, I think here is the disjoint. Either you were right, I assumed something during my analysis that I shouldn't have, or no, you have this sort of preconceived notion about what this should have been because of what you know about the domain, but here is what I am finding, how I'm finding this data to be different. So how do we negotiate this gap between what you were expecting to see and the results that are coming out of this?

So I think I will basically say, I don't try to communicate about the code or the methodology sometimes unless that person is willing to invest the time in for that. The more sort of like tip I give my students, and mind you, a majority of them are sort of like intro students, but they probably will, you know, at a minimum, have an internship near term after that course where they might generate a data analysis report and someone's going to read. I find that they get so sort of hung up on how they got to a particular result, like the code that was needed to get there, because so much of what I teach ends up being about that, that they think of the number of hours that went into producing the results as something your audience should appreciate, and I just don't think that's true. Sadly, that's not true.

I often, not often, always tell my students when you're, before you finalize your data science project, you have to add echo false to your Quarto document and hide all of the code, and now read it again. Does it actually hang together? And what they submit, even though we do evaluate their code as well, what they submit is a write-up without the code. It's just the words and the results, and I want to make sure that they, at a minimum, once read it as such. It turns out, as much as I love Quarto, and I think I've said it enough that everyone believes me, when you actually have a document with a bunch of code in it, I think it's so hard to focus on the takeaway message that you just need to sort of like take that away and read that document one more time to see is what I'm saying hanging together to the person who can't see how these results were produced.

And oftentimes, I have them do peer review across teams that are working with the code hit, like they do a code peer review that's separate from the content peer review, if you will, and the comments they get are very different. And I think it's really important if the goal is to communicate results that you're putting in just as much effort into the narrative around the results and not just about like the excruciating pain you had to get to the results.

Career advice

Do you have a piece of career advice for us that has been meaningful to you or helpful or you like to give others?

Yeah, let me see what kind of, I think so many times in my life, I have decided it's too late to do X. And then turns out it's not. I tried to learn to play the guitar when I was 16. And then I told myself, it's too late. Like my life has passed already. All my friends who know how to play the guitar already are good. Like it's too late to get into it. And I think I've regretted it to this day that I did. And so I try to sort of like get things like if I can have one piece of career advice that it is okay to try new things. Actually, I love learning new things. I fully acknowledge I won't be as good at some of the new things I learn as I am at other things. And I have just learned to make peace with that.

And then the other thing, I'm not necessarily sure how many people this would apply to, but I will give this. So as we said at the beginning, I'm a professor. For those of you who are academics, you might know that a professor's life is like teaching research and there's like some service component. I'm on a bunch of committees. Sometimes people talk about this as a drag. I have managed to finagle my way into committee work that allows me to code in R and generate Quarto reports and impress people. So just find the things you love in the things that you might not enjoy as much and see if you can like bring them in there. That's how I try to find joy in some of the parts of my life that at face value doesn't seem like it's what I want to be spending my time on. But turns out I can always impress people with some data analysis. So I've enjoyed doing that.

Statistics in the age of AI

University of Nebraska recently announced they'll be shutting down their stats program. How do you feel about how statistics is viewed these days in the context of things like AI and ML?

I am aware of this and I am so heartbroken and sort of appalled by it, to be perfectly honest. I have colleagues and dear friends who work and teach there. For those of you who may not be familiar with it, I highly recommend reading some of the pieces that Susan Vander Plaats and Heike Hoffman sort of did an analysis of how these decisions were reached and how there are so many gaps in sort of the reasoning that led, that the university has said that led to this closure. So I cannot believe it, but I will say I personally find it hard to draw a line between sort of like how stats is viewed and valued in the face of like AI and ML and like what happened there. I think bigger problems existed in terms of how that particular decision was made.

I will say that I think it's my sort of like my feeling that statisticians tend to be on average pretty humble, I think, and sometimes we're not the loudest in a room where these conversations are happening. Again, this is not a comment about what happened at UNL, but in general in terms of this space of this like AI and ML, I think there is incredible value in sort of knowing and understanding and appreciating statistical modeling when you're operating in these circles, and I don't know if we managed to shout it as loud as we could from the rooftops. I do think that there is room for sort of statisticians in these domains, and one thing I will say is that when I look at some of these, you know, big companies that have big AI and ML teams, among their leadership is often folks with stats PhDs, and they're there for a reason for sure. That's not to say you need a stat PhD to be working on these problems, of course, but I think it's still heartening to me to see that those folks are there because when it comes to really sort of building a roadmap for like where to take projects to, that statistician's insight clearly is still valuable, and I believe it will continue to be so. At the same time, we as statisticians need to be nimble ourselves and need to sort of rethink our workflows, processes, techniques, sort of in light of these developments happening. So just saying, oh, this is all hype, I don't think is all that productive either.

We as statisticians need to be nimble ourselves and need to sort of rethink our workflows, processes, techniques, sort of in light of these developments happening.