Resources

Roger Peng: Sustaining data science — in classrooms, code, and conversations

Michael, Hadley, and Wes welcome Roger Peng, professor of statistics and data science at UT Austin and co-host of Not So Standard Deviations. Together they trace Roger’s journey from early R adopter to pioneering online educator and prolific podcaster. The conversation ranges from the accidental rise of “data science” as a field, to the tension between research papers and software maintenance, to what makes for meaningful, lasting creative work. This episode is about more than methods — it’s about endurance, curiosity, and the conversations that carry data science forward. What’s Inside: • Roger’s first analysis project and what it taught him about authorship and data • Roger’s advice for students testing the waters in data science • Why software has become the unifying language of modern statistics • The origins of “data science” as a field and a label • Reflections on Coursera, MOOCs, and opening education to the world • What keeps a podcast (and a career) going strong after a decade-plus

Dec 3, 2025
45 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to The Test Set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we chat with Roger Peng, author and professor of statistics and data science at the University of Texas at Austin. I'm Michael Chow. Thanks for joining us.

Welcome, everyone in the world. This is The Test Set, a podcast where I'm joined by my fabulous co-hosts, Hadley Wickham and Wes McKinney. And today we're here with Roger Peng, professor at the University of Texas at Austin, and prolific course creator, podcaster, and researcher. Roger, do you mind telling us a little bit about yourself?

Sure, yeah. So I'm a professor of statistics and data sciences at UT Austin. I've been there only about three years, actually. I used to be at Johns Hopkins University in the Department of Biostatistics. And like you said, I do a number of things. I do a little bit of research, primarily kind of developing statistical methods for environmental health problems. But then I've also developed a number of online classes, developed two podcasts, and written a couple of books. So yeah, general content creator.

Yeah, awesome. Yeah, and we're so glad you came. Like, we hit Roger up kind of last minute, and Roger was like, I'm there, you know, hopping the car, getting the band together. And I think it's super relevant because The Test Set is just, it's a little baby podcast right now. And we're feeling out sort of how baby's first steps into podcasting. We've got some really great people, and we feel like there's a lot of chance to kind of look at the people behind the data and the landscape.

But with you as a kind of like grizzled podcaster, I think you've been doing a podcast for like 10? Something like 10 years, yeah. Not so standard deviations. How many episodes? 186, I think it was.

Roger's background and early data analysis

Yeah, I think we're really excited to hear your take on like why podcasts exist and how to make them nice and what to do. But I think maybe to start out, it'd be really interesting to hear maybe some of your background in data too. Like, what was your very first data analysis?

Oh, that's a good question. I have to go back pretty far now. Every year, it just gets harder and harder. I think one of the first data analyses I did, which actually is the reason I started using R actually, was looking at, I was doing like a project in school and looking at different texts and trying to be able to develop a way to kind of identify who wrote which text, like author identification, using just discriminant analysis, very basic tools. And so there was like some text processing, a little statistical modeling, and yeah, and it was all done in R.

Oh, nice. When was this? Or what stage of life? Well, I can tell you. It was R version 0.65, something like that. Yeah, it was like 97. Two people speak in R stages.

Did you use anything before R? Well, yeah. So I was like a S plus user for sure. Yeah. So, but like S plus plus, I don't know, it was like hundreds of dollars. And I was looking around for a free alternative and in the Yahoo search directory, R came up.

Do you mean you're searching, you're like free alternative to S plus? No, well, you didn't really search in Yahoo, you know, because it was like, it was like you click on a link and then you click on another link. It was like a directory kind of thing. You click it and it was like, it was like free software, statistics, something. And then eventually there was like a couple of options. Yeah. You like went down a series of pipes. Exactly. Yeah. There was no like searching.

Advice for students entering data science

And do you have any advice you'd give students today trying to get into data science?

Yeah. So I also find that I get that question a lot. As you might imagine, I teach a lot of undergraduate students and they're interested in data science. And it's always a little weird for me because I kind of feel like for me, and I don't know, maybe for you Hadley, like data science happened to me. Like, I don't feel like I got into data science. You mean like it tackled you like on the street? Yeah, more or less.

Yeah. Like I double majored in computer science and statistics, which is a weird combination. Like alchemy. Yeah. And so I did statistics. I got my PhD in statistics, but I just happened to really enjoy programming. I took a lot of CS courses in college and I just really enjoyed programming. And I think it wasn't immediately clear that that would be useful in getting a PhD in statistics.

So I think, but yeah, so now that data science has happened, people kind of assume, especially the students that I have, they kind of assume it's been around forever. And so it's a little tricky. But I think, I mean, one of the things I generally try to tell them, especially at an early stage, is kind of like, is make sure that you can have some way to show people what you've done. Like if you've analyzed data, that you've been able to ask a question, draw some sort of even basic conclusion from the data you've looked at or from the work you've done and then kind of go from there and see if it interests you. If doing that doesn't interest you, then that's a good sign. Then you'll know early.

And one of the things that I think is a little different now is that there are like a lot of programs that you can go into and maybe get a certificate or get a degree or something like that. And I'm always a little hesitant to, even though I teach at a program, obviously, I'm always a little hesitant to say like, oh yeah, go get a master's or go, you know, because that costs a lot of money. And if you want to find out, you hate doing this, you know. So at least, but now, you know, it's pretty easy to get data. You know, there's a lot of data around. There's a lot of things that you could try. And I think if you could try it and find out that you enjoy doing it, then it's maybe worth investing a little bit more.

I mean, even the term like the field of data sciences, I think as recently as 15 years ago, that term didn't really exist. It was almost like a marketing term to describe doing statistical computing or data analysis in a business setting. So like some combination of like programming and data and data analysis combined with reporting and communication, and then combining that with, you know, combining that with domain knowledge about some business domain, you know, the famous Drew Conway Venn diagram. And then suddenly, you know, it became so popular. And also there was this big shift toward open source. And then every university needs to have a data science program because students are arriving and they're like, well, I want me to learn how to do data science because that's where the job opportunity is.

Like there's a great need for more data scientists in this industry. And so it seems like universities have had to respond by, you know, reorganizing themselves to create these cross-disciplinary data science programs to pull together statistics, computer science, and maybe, you know, political science and economics and other fields that cross over into where, because you have statisticians that live in all these different departments, but they do statistics within these domain-specific settings. But there's a lot to be gained from making all of those different research specialties accessible to undergrads and master students and PhD students who are, you know, building skills in that area.

Yeah. I think, yeah, it's been interesting to see just kind of how that has evolved at various universities because there are like, you know, as I said, there's lots of people analyzing data all across different areas. And they're in different departments usually because they're looking at different kinds of data like psychology or, you know, economics or whatever. And so it's a little bit of a challenge because then you've got to put everyone together. You have to think about like, well, what is the unifying principle? And I think a lot of times it's software, you know. So it's like, you know, there's lots of different places that you can analyze different kinds of data. But like, you know, in other areas it might be like theorems or whatever, laws, you know. But I think in data science a lot of the unifying thing is software. It's like, you know, we can use a piece of software for so many different things. You know, that's the generalizing kind of element.

Software, academia, and maintenance

So do you think there's like more respect in academia for like developing software now compared to like 10 years ago or 12 years ago when I left?

Yes. I think, yes, compared to 10 years ago, there's definitely, whether that's enough, I don't know. But there's definitely more, I would say, than, you know, compared to back when you were there. I think one of the challenges I think with academia and software is that, you know, academia to this day, I think it's set up to discover things. It's not set up to develop things or maintain them for that matter. And I think, you know, if you think about what a paper is, it's like, you know, you discover a fact, you write a paper about it. You don't have to maintain that fact, right? I mean, for the most part. And so you can just let it go and then move on to the next one. And it's not how software works, obviously. And so it is a challenge in that environment, I think.

I think one of the challenges I think with academia and software is that, you know, academia to this day, I think it's set up to discover things. It's not set up to develop things or maintain them for that matter.

It's kind of interesting to think that like maybe more papers, like if you had a more of a maintenance approach to papers, like maybe there'd be like less of a reproducibility crisis. You can like update. Yeah. Or if you were like allowed to go back to it. And, you know, and check. Like, oh, actually, we're kind of like not so sure about this.

The secret of a great podcast

Yeah. I'm super curious where we're kind of kicking off a podcast now about data and data science. And you've had, you have like 10 years of doing kind of podcasts. And then earlier than that, like Coursera courses. Yeah. I'd be really curious to hear from you, like what do you think is like the secret of a great podcast?

If I knew that, I'd have one.

I don't know for sure. I don't know. And I think that question can be asked, you know, from two perspectives. One's from the audience and one is from the actual like podcaster, right? Yeah. Like who's it for, I guess. Yeah. And also like, you know, what are your goals at, you know, developing the podcast? Do you want it to be something, you know, from my perspective, I want it to be something that, you know, I have a job, I have all these things that I'm working on, had to be kind of easy to do, fun to do, but still kind of have some sort of useful product at the end of the day. And so, you know, it's very minimal setup, you know, just recording over the internet. And so, and, you know, not too much prep. You know, the other end of the spectrum, you got like these narrative podcasts where everything's scripted and everything, you know, and, you know, that's a super expensive way to do things, but you get a really amazing product at the end of the day. And so, then everything else is kind of in between, I think.

Personally, I like listening to conversations between people that are like kind of deep and kind of involved and where there's like a relationship there and kind of people and they can just time to move into like more than just scratch the surface. But that's just my personal thing. And that's kind of how I modeled the two podcasts that I, you know, that I built.

Yeah. That's really interesting. I mean, for some context. So, yeah, I think we mentioned like not so standard deviations. So, you can go in like 10 years and you have over 180 episodes. How did that kick off? Like what led to not so standard deviations?

Yeah. So, I mean, I think I had listened to, I've been listening to podcasts, I mean, probably since the early 2000s, you know, so like when they first came, it came out and I've always enjoyed most of those podcasts were like tech podcasts for the most part, you know. And I think I always felt like there, at the time, which is like 2015, I didn't feel like there was, there were, there were too many podcasts where you just, a couple of data scientists just talking about stuff they're doing. Yeah. Just very informal, very just kind of casual. And there were a couple of things where like, you know, there are a lot of podcasts where like, you know, this type tonight we're going to talk, or today we're going to talk about linear regression and tomorrow we're going to talk about random fours or different techniques and different software packages. They were more like, for lack of a better word, instructional. And I think I wanted to just have like, imagine just like a hallway conversation between two data scientists someplace, you know, and that's kind of like how I wanted to feel, whether I achieved it, I don't know.

Yeah. And I think it really shines through like the responses to things and like what people are up to and getting takes. It's like your cohost, Hillary Parker, like, I do feel like my take, and this might be wrong, it's like 70% Hillary's going, you know, what's latest views, hot takes. And then you're like, yeah, yeah. You're like, maybe, or like, yeah, complicated.

One of the things that's actually been quite interesting, I think, is that she's changed jobs a couple of times, you know, which I think is the norm in the data science industry. And so we've been able to like kind of take a little tour of like what the different kinds of things that can happen in different companies and, you know, in the data science world.

Oh, nice. Yeah. And did you, did you approach her? Did she approach you about the podcast? How did you? I approached her. Okay. Yeah. So our connection, she graduated from the PhD program at Johns Hopkins. Yeah. And then we kind of lost touch for a number of years. And then I think we reconnected at, at like one of these ROpenSci kind of on conferences. Yeah. I think it was in the GitHub headquarters. Yeah. And so I just kind of, we were chatting at that time and I kind of just proposed the idea.

Do you, are you like keeping a log of topics? Like how are you hitting stuff? Or are you texting latest news? There's a little bit of that, a little bit of the latest news. There's a little bit of like, let's, let's maybe let's take a topic and just kind of go deep that we choose on our own. We've had a couple episodes where we'll like read certain books or something like that. And so, but it's a, yeah, I mean, I try to keep the, for better, for worse, I keep the prep to a minimum just because I've got, I don't want it to take over my life. But, but usually, especially nowadays, there's always stuff happening in the news and you know, whatever.

I mean, there is an interesting tension between like, you know, kind of burning brightly for a short amount of time versus like that sustained, like if you want to do something for like decades, then you can't like kill yourself putting together every episode. Like it's got to be sustainable. Yeah. Yeah. So, and I think that's true for so many things. Like you look at blogging, you look at YouTube, you know, YouTubing or whatever, anything like if someone's been doing it for 10 years, they have, it's like, it's gotta be like a sustainable kind of like manageable approach.

That's a good point. I mean, it's interesting parallel too, between the like papers versus software, this like big that you kind of like go and publish and then like drop the mic versus software where you're like chipping away for decades. Only if people are using it. Yeah. Yeah. You're chipping away while you like quietly disappear and kind of like close the door.

How do you, I mean, how do you think your listener community has, has shaped the podcast? Like, you know, positive feedback, negative feedback. And like, I find that you know, from, from having, you know, worked in data science tools for the better part of two decades, like the like fads come and go, there's like always like the things that are in, in, you know, really hyped and everyone talks about and wants to hear your, hear your opinions on. So there must be a little bit of like, you know, talking about the things you want to talk about, the stuff that's interesting, but then also maybe hitting on some of the flavors of the week or flavors of the month that are like on everybody's mind.

Yeah. Yeah. I think you have to, there's gotta be some blend of that. I, you know, we don't have really any data on our audience. We don't really collect any data, which I think is good for the most part. And I think, but, but my suspicion is just from the feedback that we get is that it has kind of like turned over a little bit, you know, over the last 10 years. And so it's not like there are maybe some people who are still listening from 10 years ago, but probably not that many. And I think, cause I think like we try to, it's pretty me a little bit more attractive to somebody who's like starting out just to kind of hear this kind of stuff. But yeah, so we do have, there is an element of like keeping, keeping up with the latest things, even if I don't use them and just kind of understanding how they work and what they do.

Yeah. Do you feel like it's more, yeah. Is the podcast more for you or more for the audience? That's a good question. Yeah. A little of both. I mean, I mean, I would, I mean, I like, I think doing it for us has been really helpful just to think through ideas and yeah, actually some of the stuff that we've talked about have, you know, over time turned into papers, for example. So yeah. Yeah. It's almost like, I mean, it almost strikes me as like pair programming. Like you spent 180 hours, like, or episodes, which are like almost an hour, just like working through something with a person.

How the community knows each other

Did you say you you're how you got started with the podcast with Hillary is you kind of ran into her at a conference. Yeah. Yeah. Yeah. Nice. I figure I've realized like it might be good to loop back to how y'all know each other and like the places y'all know each other from. Is it conferences? Is it like cryptic R mailing lists?

Yeah. I think it might, might be something along those lines. Yeah. Initially. Yeah. I think there, I think I might've responded to what are your early R like mailing lists posts. And then, and then, yeah. And then also in academia, there's kind of like a network of, you know, it's a pretty small world of statistics. You kind of see, you hear about people and that.

Yeah. Nice. I'd actually never met Roger before today. I mean, I was, I was in statistics. Like I was a, I was a PhD student at Duke before I dropped out to, you know, work on work more on pandas and do entrepreneurial things. But I had met, I had met Hadley in early pandas days. He invited me to give a, give a talk at, at rise statistics, which I think was maybe 13 or 14 years ago. Something, something like that. That's wild. It's way back. Yeah. I mean, I've known about Wes forever, you know, but I just, yeah, it's nice to be able to finally meet him.

Yeah. Yeah. Where are like the places that a lot of the kind of like development and the conversations happen? Like how much of it's happening at like conferences or other places?

I mean, there's a surprising amount of things that come out of conferences and like the hallway, you know, the hallway track. And, um, you know, you, maybe there's something you're thinking about and just when you're together with like birds of a feather, they have these, you know, birds of a feather want to talk about, I don't know, data visualization or data frames or, you know, data access and different things. And, you know, you'd be surprised how many times that just like a casual conversation turns into like, Hey, we should, we should do something together. We should, we should start something.

And so like, I know like, you know, Hadley, you know, Hadley and I created a file format called feather based on just like Hadley was visiting. And I said, Hey, I'm thinking about this new thing called arrow. It's like, why not? How hard could it be?

Yeah. But that was also like, I think, especially during, during COVID, like we really, I think we all realized like how difficult it was to be deprived of having those serendipitous in-person reactions with people at conferences and meetups and things, because all of a sudden, you know, you have to go out of your way to like, you know, that, that, you know, bumping into somebody at a, you know, statistics conference or at a data science conference, like suddenly those interactions are not happening. And so but yeah, so for me, at least like that's, that's been a rich source of like inspiration and new ideas and new, new collaborations.

How often is it like a friendly collab conversation? Like, Oh, I love your work versus like strong opinion. Hey, I got a bone to pick, you know, like, Ooh, we got to get this package sorted out. I feel like people are almost always pretty friendly in person. Like I've got a bone to pick things are always like email. We're all hiding behind masks.

Yeah. I mean, or at least in my experience, if it does, if it is like that, it's for like five seconds and then it's like, okay, let's just have a conversation about whatever, you know. And then you can often sort it out. I think.

I do feel like that's going to remind me that in-person conferences are kind of like a nice bomb sometimes where like, yeah, you might interact with someone online and they seem like kind of intense or mad, but like getting in person does feel like it kind of like. But I guess there's still like the, especially like academic conferences, like the questions at the end of your talk, it can be like, sometimes they're like pointed or like, this isn't actually a question. I'm just going to talk about what I'm working on.

Yeah. But there is still that hallway element, you know, like I think conferences are useful if you've got like ideas kind of cooking in your head for a while. And then you see someone who like, Oh, you know, mentions the same idea, then you see a little validation, like, Oh, maybe there's something to this, you know?

I do think there's like something about travel as well. That just kind of knocks you out of your day-to-day routine and it makes you think about, and then like a bunch of new ideas. And even if you're not really necessarily paying attention to like all the talks you're in, but there's just like something about like jolting you out of your, yeah, like local minima.

R versus Python (and SQL)

Since we have you all here, I thought this was the perfect time to broach the podcast question. R versus Python. Is this the only time we should ask this question?

I don't know. Like I was talking to Hillary last week and she's like, we're post-language. I completely agree with that. Yeah. And I think Hadley and I have like kind of been on the same page about that for a long time. Just like that the language wars are not productive. Or at least I think we've reached a point where, you know, we've established like a basis of, like, there are certain things that where we figured out how to share technology between the Python and R worlds. Like we built, like we built Arrow and like that was like a many year endeavor. And now there's DuckDB and you can use that. And, you know, of course now, like so many, we're using all these services that live in the cloud. And so, you know, we can use those portably.

But it's also interesting because I think like Python and R, like at one point, it seemed like they were in direct competition with each other. Maybe like, you know, the early, early 2010s, like Python started to become popular in a business setting for doing data analysis and some data science type work. There were some new open source projects for doing statistics and some of the work that you would do in R. You had some data visualization still not as good as R, even today still not as good as R, I would argue. But I feel like at some point, maybe in the mid 2010s and like certainly, you know, more recent times, like Python world has diverged a lot more into like analytics engineering, data engineering, AI engineering, you know, machine learning engineering. And so in a sense, like it feels like where Python was in terms of like statistical computing and data analysis hasn't evolved that much, where like I still wouldn't, if you're doing, you know, coming from SAS or SPSS or like a commercial statistics language, like R is the, if you want to be open source, like R is the thing to use. I think R is the thing to use regardless for that, you know, for that type of work. But certainly in an open source, whereas the Python ecosystem has, you know, gone in kind of a different direction. And it certainly has become like, you know, the scripting and automation language of AI.

Yeah. But like the conference that Wes and I were at last week, the language that everyone was talking about was SQL. Oh, interesting. Okay. Like that was like by far and away more people talking about SQL. Like SQL seems to be having like a resurgence right now over the last few years. Like we're asking R versus Python and SQL is laughing. Yeah, like 40 years later. Yeah. It's like, it's my time.

Do you feel like it's, do you think that SQL is seeing a surge of popularity or it was like already there and we're just sort of like noticing it now?

Yeah. I mean, I think it's just been like steadily. It's just like this layer that underpins all data activities is like SQL is there. And my feeling is like, it always will be there in like a hundred years time. People will still be running SQL. We're stuck with it like forever.

Advice on learning data science in the age of AI

Any advice to students learning data science today? I have to say, it was really interesting. Like I think Roger's advice was mostly like, learn if you actually want to do data science, which like thinking about like, yeah, like when you're a student, like you've got no idea about what is actually appealing to you. So I think there's like two, like part of it is like finding out like what like really resonates with you. And part of it is the sort of like optionality, like making sure you're trying to keep your options open so that you can, you know, get a job and earn, you know, decent money and still consider, you know, still be able to find out like, what is it that you like really care about?

So I don't know, to me, it's about like, don't get like too locked into like one path or like one vision for your future, but make sure you're like taking, you know, these weird and wacky courses that, you know, maybe sound like something you're not be interested in, but might turn out to be, that's actually the thing that like drives the rest of your career.

Yeah. Yeah. That's a good point about kind of discovery and feeling it out. I mean, I have to, I mean, I have to think that now that if you were, you know, 17, 18 years old and undergrad entering and wanting to become a data scientist, that the path now is probably, you know, radically different, not just like what it's like to, you know, go from not knowing, you know, not knowing R, not knowing Python to, you know, it's like, what's like, what does it mean to now master, master a programming language in the age of LLMs where in a sense, like you want to, you want the, you know, AI tools to help you automate away a lot of the boring stuff. Like you would have to learn all of this boilerplate and, you know, stuff that frankly was getting in the way and the higher level thinking, like actually thinking about the data, actually asking questions and engaging in more of like the scientific reasoning process. And so you would have to like learn how to program and learn the mechanics of writing code and all of the, you know, package management and environment management and the nuts and bolts of doing that. And so now we can rely on, you know, AI autocomplete and LLMs and agents and all those things to write a lot of that boilerplate for us.

So one, you know, one optimistic view is that, you know, that it will allow students to focus more on like the scientific reasoning and knowing like, you know, what are the right tools to use and like how can I give the best quality feedback to the AI, you know, code generator. If you don't know which models to use and like, I guess you can, there will be certainly a lot of data scientists who are just letting, going full vibe coding, letting the LLM, you know, do all the work. And they look at it, they get a nice PDF document or, you know, a markdown document at the end and they're like, it looks good. You know, it looks good. Send to boss. But I think that it's still going to be important for data science. Like there needs to be a human in the loop to like be able to judge the output to say like, is the LLM doing the right thing? Does this analysis make sense? Is it asking the right questions? Do we need to ask more questions? And so I think I have a more optimistic view that, you know, it's not like, you know, AI is going to eliminate the need for data scientists, but it is going to, I think it's going to require people to get better at the actual data science and maybe developing advanced programming skills, I think is going to become less, you know, maybe on a relative basis, less important.

Yeah. That's actually my hope. I don't think we're quite there yet. Yeah. Kind of in a weird in-between stage, I think right now, but I think what you said, it highlights, I think there's like aspects about doing data analysis and looking at data that are difficult to really kind of quantify or really understand. So it's like, it will really highlight that we don't, there's a lot of things that we don't know yet because the easy stuff, which is really, or so to speak, the programming, you know, if that's all, you know, taken away and we don't have to do that anymore, then we have to like actually think about data and that's going to be hard.

Yeah. I have to say, like, I'm sort of, I don't know, I'm optimistic about the, like this idea that like human plus AI can do better than either alone. But I would still say like, if I was a student today, I'd also be making sure there's something I can do, like that AI can not, like something with my hands, like just have a, like a backup plan. Like manual dexterity as a job insurance. Yeah. Yeah. Or like human to human interaction. It just feels like I'd be like, just to make sure you've got like a backup plan.

Communities and the data engineering ecosystem

Are there any communities like in data science or kind of around the ecosystem that you're most curious about? Just to inject one, I've, like DBT, I think is a really interesting, like wherever SQL people live, I find is a really interesting.

Yeah. We were just talking about SQL. Well, I think it's, it's interesting that, you know, I guess the emergence of like, you know, Python is the mainstream business computing and, you know, data orchestration and automation language has brought a lot of work that was taking place in like proprietary tools. And by like, you know, stodgy, you know, database administration tools has brought all of that into the, into the land of open source. And so, so DBT is taking on the role of like, you know, enterprise ETL, which used to be done with, like, you know, all these products that people bought that were really expensive. And people still had to build, you know, 30 years ago, people were still building data warehouses and they had much smaller data sets. There wasn't, you know, what we now think of as big data, but they still had to like design, design schemas and architect the data warehouse. And people would spend their whole careers just like architecting Oracle databases, like designing things for performance. Like, so like that work has always existed, but now there's like, you know, hype, you know, sort of really hyped up open source projects that everybody's scrambling to learn because new companies are being founded all the time and they need to say, okay, what's my analytics stack? Like, what's my, you know, what's my, how do I go from collecting data out of my mobile application or my website to like, I have a data warehouse where I can run a SQL query and build a dashboard. And dbt has become like one of the, you know, sort of mainstream technologies to make, you know, to make all that work.

I've never actually used dbt directly, so I will admit that. But I've been following its trajectory and I think it's definitely very interesting. It's not data science, it's like analytics engineering or data engineering, but because it's adjacent to our space, like it creeps up and, you know, it's not uncommon for people doing data science to incidentally say, oh gosh, like I have to learn how to do dbt because like I don't have the data that I need. And in order to have the data that I need reliably for my data science application, I need to set up some dbt workflows or pipelines so that I can deliver this, you know, data science application.

Yeah, that's a good point. It is an interesting case too, like the R versus Python situation and the like, what is data science and the kind of role that dbt is carved out with analytics engineers. Like I know they, one infamous talk, which I think is pretty tongue in cheek, like down with data science. Like it's an interesting kind of like comparison there too, like analytics engineering versus data science is kind of like roles in the ecosystem.

I think that, I feel like that's really changed a lot over the course, like when I started doing data science, like you were the data engineer. Like the data would be in like, oh, here's like a hundred CSV files in this directory and here's a random API, here's a SQL database, like your first job as a data scientist is just to get the data into a usable form for you. And I think that's no longer so much the case. There's still like challenges, but that's much less of a challenge now than it was 10 years ago, which I feel like a little sad about because that's like part of the job that I like love doing, like collecting data sets and organizing them.

Roger, do you see much interest in like analytics engineering or do you see it show up much with students?

I'm not a hundred percent clear that they necessarily are exposed to that kind of work or that they necessarily would be aware that it exists. But I think as we teach them data science and they get to understand like what are the different components that are involved and how do you get data from A to B so that you can analyze it, they eventually kind of get an understanding of why you need something like that.

Yeah. Yeah, that's fair. I mean, there's probably like a fair amount of like, I hesitate to call it bait and switch, but like students are getting recruited into these data science type roles and then finding that the actual work that they're doing is a lot of, they're maybe doing a lot more ETL and SQL and data engineering than they expected. I know that that certainly happened to me. I got a degree in pure math, this is before I did statistics, but I was working for a quantitative hedge fund and I got so excited like, gosh, I'm going to do modeling and like working out equations and like applying math in the real world. And I sit down the first day, it's like, okay, here you go. Here's SQL Server Management Studio, Microsoft SQL Servers over here. A lot of the work was like data quality, like data cleaning, data like really, but it's quality, like having good quality data for as inputs to your data analysis process is essential. So I think there was an initial shock of like, gosh, like I'm spending all my time like monkeying with sort of procedures and looking at Excel data quality reports, but then if you don't have that, it's like a little bit like if you haven't got your health, you haven't got anything.

There's a little bit of that, I think, everywhere, like in academia, I think it's just kind of the same thing. They think they're going to be doing all these like great scientific theories and then at the end of the day, it's like you're merging CSV files and you're going through Excel spreadsheets or something like that. I think a lot of that too is like, to be useful, like you have to not be too constrained by what you already know how to do. You have to like look, when you're faced with a new problem, be like, well, where's the value? Like, what do I need to learn in order to be good at this, rather than thinking like, these are the six things I know, like how do I apply them to the situation?

What's been most impactful

I think one more question for you all. What do you think has been most impactful from your work? And yeah, what excites you most about data science?

It's actually in many ways a lot easier now that I've been doing this for quite a long time, like 20, 25 years-ish. One thing I can see is that impact often takes a lot of time to kind of unfold, but it does happen. And so I think it's easier for me to justify to myself like, why am I doing this? Because I have seen some impact. Things like environmental policies change because of the research that we've done and things like that. And also, some of the online courses that we've built, you have like millions of people who have taken these courses. I've met them in person. They say they've gone into data science or whatever. But that takes a huge amount of time. That doesn't happen tomorrow or overnight. And it takes a long time for it to unfold. But then it does happen, and it's really rewarding.

One thing I can see is that impact often takes a lot of time to kind of unfold, but it does happen. And so I think it's easier for me to justify to myself like, why am I doing this? Because I have seen some impact.

I mean, I've really enjoyed just expanding access to technology, like to open-source technology. So building open-source projects and then creating technical communication, whether I wrote a Python book. I haven't written as many books as Hadley has, but Hadley enjoys writing more books more than I do, it seems. Now it's been, I think, Python for Data Analysis came out 13 years ago, and Pandas has been around for a few more years than that. And so I think, for me, the most satisfying thing has been hearing from people all around the world about their path to discovering Python, learning language, learning how to use Pandas, and the impact that it's had in their lives. And so hearing from people in really humble circumstances about how they were able to learn these tools, and it got them their first job, or it was really essential to the way that they work.

In the past, a lot of computing technology was hidden behind a paywall, essentially. So you had to buy a license to a piece of software. And for many people, that was a practical barrier to learning and to being able to use those tools. And so I think having free software available, anybody with an internet connection, and now computers have gotten a lot more capable. And so people all around the world can use these tools to empower themselves and change their lives. And I also find that building the open-source communities around them, and building communities that eventually run themselves, so that you can step away and think about what's next, I think has been really satisfying, too. And I like seeing people more productive. And so seeing that has been pretty exciting.

Yeah, it's really incredible. I mean, I think you mentioned, too, before, I think, Pandas, you said is 17 years old? 2000, 2008, yeah. Yeah, wow. Okay. So yeah, it's powerful to hear both of you talk about sort of this timescale, you know, where I'd imagine a lot of people cobbling a tool today are not thinking about, like, 17 years from now.

Yeah, I mean, I think I always was just thinking about tomorrow. Like, I want it to be better tomorrow. And so, and, you know, thinking about years from now, I mean, of course, people find it useful and they use it, and that's great. But yeah, often, like, the motivation is, you know, the problems that are right in front of you and the things that you see are not good now. And that, like, if you perceive that you have the ability to, like, you know, whether it's, you know, writing something that helps people learn or making the software better and solving a problem that's annoying you, then it's really satisfying to see that through and to put it, get it in people's hands.

Yeah. Yeah. Thanks. Yeah, I mean, I do think that is, like, one of the positive things about academia is, like, that more of that, like, long-term timescale. Like, things don't have to, you know, like, you don't have, like, a deadline on next Friday to get this done. That things can take, like, months or years to do, because it does feel like that, like, yeah, like, the impact you have, it just, it takes, like, a long time to build up and to, like, really know what the things you did, like, what were the things, what were the actually impactful things that you did?

And I, like, I had a few conversations, again, this conference we went to last week. Like, I kind of like to think, like, it's my code that's most impactful. But, you know, like, I talked to a lot of people that, you know, that maybe they used R and enjoyed using ggplot2 or the tidyverse. But, like, what stuck with them was, like, the underlying ideas or some of them, like, I was just, like, kind to people when they, like, talked to me, like, 10 years ago. And they're like, oh, you know, I still remember that. And that just feels, like, just kind of, like, reaffirmed to me, like, you know, that I had made this deliberate choice. Like, I had had some, like, negative role models of, like, you know, people higher up in the, in their fields who were, like, did not, I did not have a good experience with. And I, like, very deliberately, you know, try to be, like, you know, kind and patient to people, regardless of, you know, like, how much they know about the field. And that just feels, like, over time, like, that's increasingly valuable.

Like, what stuck with them was, like, the underlying ideas or some of them, like, I was just, like, kind to people when they, like, talked to me, like, 10 years ago. And they're like, oh, you know, I still remember that. And I, like, very deliberately, you know, try to be, like, you know, kind and patient to people, regardless of, you know, like, how much they know about the field. And that just feels, like, over time, like, that's increasingly valuable.

Yeah. Yeah, it's really impactful. It's really amazing to hear it circle back, like, 10 years later, that somebody might still be kind of carrying that and thinking about that.

Yeah. And it's also, like, scary when you think, like, oh, you know, I've had a bad day, and I get a bad interaction with someone, and I don't even remember that, like, 10 years later, like, when Hadley said something really mean to me. Like, you don't you don't hear the people who are like, I'm like, I'm leaving data science because of Hadley. Like, I never hear from those people.

Yeah. Well, thanks. Yeah. Thanks, everybody, so much for coming on. Thank you, Roger. But no, it's been it's been so helpful, I think, and incredible just hearing about the field of data science and, you know, your interactions with open source over the years and how that timescale can be really long. And I do think that's one really neat thing about this podcast is the opportunity to hear kind of some of these long term projects, or like the histories that brought people to where they are. And so just really appreciate you being here and taking the time to come down.

The test set is a production of Paws at PBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with branding and design agency Aggie. For more episodes, visit the test set.co or find us on your favorite podcast platform.