
Hunter Glanz | The Five Principles of Data Science Education | RStudio (2020)
In this talk, I will outline a unified philosophy of data science education, and provide tips and tools for implementing these principles in the classroom using R and RStudio. Although data science as a professional discipline is well-established, its pedagogy is still in a period of growth. Even within a single university, multiple data science courses may be offered across different departments leading to inevitable redundancy of efforts amidst rich domain-specific innovations. My experience as an instructor in many such courses has lead me to five principles that transcend domain, context, and choice of language: reproducibility, communication, version control, practical application, and data ethics. For each of these full-stack themes, I will share examples of how to leverage tools in R and RStudio to enhance learning. A 5-minute presentation in our Lightning Talks series
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I am the aforementioned Hunter, who is very lucky to work with Kelly Bodwin at Cal Poly University down in San Luis Obispo. I teach a lot of statistical computing and data science type courses in lots of different languages and stuff. And I feel like what's pretty much agreed on at this point, even though a lot of the education surrounding data science is still in flux, is there's math, there's statistics, there's computer science, and then there's some kind of domain application knowledge. That third, usually, element of the Venn diagram that you've probably all seen or heard of or something like that.
But it's that third piece, the domain knowledge or application, that can vary so much in everybody's kind of implementation of data science education. But I would argue that these things should remain constant. So if we can all agree on the math, the statistics, the computer science, and those kinds of skills with whatever language you may happen to see it in, and no matter what domain application you see or experience or deliver to your students, these things need to be in all of it.
You may have courses in data structures, you may have courses in regression models or machine learning and stuff like that. But at least at Cal Poly, we don't have a course in version control or a course in reproducibility. But these are things that are inherent and important for all of these topics and all of these courses. Students need to see these things. These need to be harped on and emphasized, I think, in any data science education.
So besides positing that these are kind of foundational, R and RStudio are really special tools for implementing and involving these topics, I think.
Reproducibility
So to start off, reproducibility. This is like the deadest horse, I feel like, of this conference. You guys have heard about reproducibility in lots and lots of talks already, so there's actually not too much for me to say here except what you already know, in one sense. So the idea, if you can't read the text, Shai is saying, what am I supposed to be looking for? If you're being passed code by somebody else or if I have students passing code around or things like that, then that code should be kind of robust enough and built in such a good way that it can be kind of run as is, that analyses, workflows, pipelines are reproducible.
So you shouldn't have to really look for anything. And if you are going to look for anything, you want to tweak anything, it should be kind of parameterized well, built in a robust way. But again, you guys have already heard this a lot of times. If you weren't at JJ's keynote, go check that out for how RStudio is epic at reproducibility or Carl Howe's talk from yesterday on education where he also talked about how awesome RStudio is for reproducibility.
Communication
Communication should not be magic, right? It shouldn't be just a black box for how all the data science gets done, right? We're all delivering to somebody, it might be a teammate, it might be a supervisor, it might be people in all different kinds of positions or on different teams and things like that. And the how of what we did and the why and all that kind of stuff shouldn't be magic. But this GIF's awesome, so I had to include that.
And so communication has to be a huge part of whatever coursework our students are going through, right? And RMarkdown and Shiny are phenomenal at that. You guys have seen tons and tons of talks. I'm learning new things all the time about different ways that these can be used to that end. So Gordon Shotwell's talk was really nice yesterday on technical debt, but everything you've seen so far on RMarkdown here is good for communication.
Version control
Version control, right? There's all these different versions of possibly the same thing, right? Code analysis and stuff like that going on. And you've got to track it, right? Along with reproducibility, we need resiliency in the data science pipeline. And so the Git integration in RStudio, again, is totally epic, right? So if you're not familiar with this, you probably all use version control of some sort, and you may use it outside of RStudio, but you can use it from within RStudio, so I would definitely recommend checking that out and getting your students hooked up to that.
And so the Git integration in RStudio, again, is totally epic, right? So if you're not familiar with this, you probably all use version control of some sort, and you may use it outside of RStudio, but you can use it from within RStudio, so I would definitely recommend checking that out and getting your students hooked up to that.
Practical application and data ethics
Practical application, you've probably heard this implemented in a lot of people's different universities or programs, programs, programs, programs. There's a practicum, right? Or some client-based project or something like that. This is also critical. We can equip students all we want in the classroom, but this football helmet's not really useful in the classroom, right? You can give them all the tools and stuff, but you've got to get them practice with it, like real practice with it. And so learn by doing, client-based projects are huge. R projects are obviously a really nice tool for that. R packages and having students maybe build packages and stuff like that are all really good ways to kind of get them into that mode of what they might be doing later on the job.
And finally, data ethics. This often maybe gets swept to the side or isn't explicitly addressed, but everything and everybody involved in the data need to be cool with it, right? Need to be on the right side of it. This has also been talked and touched on in a lot of different ways here at the conference, and so I won't say too much, except that everybody needs to be treated and considered well.
So that is it. Please reach out. I'm happy to discuss education with anybody and everybody, so. Thank you.
