Teaching Data Sharing through R Data Packages (Kelly McConville, Bucknell)

Transcript#

This transcript was generated automatically and may contain errors.

Hello. Thanks, everyone, for coming. So Ryan wanted you to imagine that you were babies. I would like you to imagine that you are students in my Intro, Stats, and Data Science class.

So we're about four weeks into the semester. This means you've learned how to wrangle and summarize data using dplyr. You can make beautiful visualizations with ggplot2 . And now I'm going to give you a new-to-you dataset on trees in Portland's parks. I'm going to ask you a more open-ended question than I have so far. I'm going to ask you to just do some exploratory data analysis on these trees. And what I want you to just ponder for a moment as my student four weeks into the semester, whether or not you feel ready to explore these data.

And when I say a moment, I really do mean just a moment, because I'm now going to show you a couple of the typical responses I've gotten from students when I did just this. All right.

So one of my students decided to take two quantitative variables and make a scatter plot. That's the appropriate graph. And they went a step further. They realized there were lots of points, so they tried to deal with the overplotting by making the points transparent. They know I love colors, so they customized the color with a hex code. And so their graph really looks pretty good. But then we get to their explanation. And we see that it says, here's a graph of the dbh and the crown with ew. We can see that as the dbh increases, so does the crown with w. All right. So this is not as bad as just saying, as x increases, so does y. They use the variable names. But I don't really get a sense of these trees. I don't feel like we learned anything about the trees from their interpretation of their graph.

Another student decided to look at species and to see what types of trees there are in Portland. And honestly, I was pretty impressed with their data wrangling skills, right? They found that there were lots of different species. So after they counted them up, they subsetted down to just the five most common. And then they even sorted their bar graph based on frequency. Wonderful. Used a lovely color again. But again, their interpretation of their graph, they noticed we have a lot of these PSME trees. I'm not a botanist. I don't think the student here was either. We don't have a great sense of trees in Portland.

All right. But you know what? This is not my students' fault. They've done what I've taught them to do. I have taught them to wrangle code. I've taught them to make graphs. And I gave them a one-sentence explanation of this data. So maybe it's obvious to all of you. It should have been obvious maybe to me. But to do good data work, we need context. And I was not providing my students with enough context to go to the last step and to actually tell stories with their data.

But to do good data work, we need context. And I was not providing my students with enough context to go to the last step and to actually tell stories with their data.

But you know what? Providing good context is tough and sounds exhausting. If you've ever taught an intro stats and data science course, I already have jammed enough things into that class, right? I have to teach them all of the statistical concepts. I have to teach them how to code. And now I also have to teach them about the data. So I'm going to admit to you that I mostly ignored this problem. Sure, instead of a sentence, I started giving them a paragraph about the data. And I would link out to where I got the data from. But I'm going to tell you right now that their interpretations of their summary statistics and their graphs didn't really improve very much.

context really is key to doing good data work.

So as I said, I really did make that genes demo repo so you can check it out. I also have hex stickers for my PDX trees package. If you'd like one, come see me after the session.

Q&A

Okay. We have a few questions in a minute or two. Do you teach about data ownership and how you navigate this as students create their own packages?

Yeah. So I mean, like have you ever thought about what license to have, right? So now that'll be a question. You're going to get an error when you try to, or at least a warning. One of those two, when you tried to build that package, if you've not put a license in there. And so yes, this is a perfect moment. And I'm going to say, you know what I do when we're going to talk about licenses and ownership? I bring a data librarian into the classroom because they are much better at talking about that than I am. And so they talk about the different types of licenses and what impact that would have downstream on people using the data set in your package. Great question.

Can you talk about having students make their own, how it went for the students?

Yeah. So, you know, even if on day one, we do a demo where we go from start to finish and me and the TAs are walking around the rooms and helping them, you're still going to hit snags, right? So, so absolutely there are, there are going to be issues that you hit along the way. So we would bake sometime into class where, again, they would get to work on their, their packages in their groups and I'm walking around the room and helping them. So I would say with enough scaffolding, it really can go okay, but you do really need some experts around because you don't want them to then sit in their dorm room, you know, for four hours with this one bug that in, you know, two seconds you would be able to catch and now just decide like coding is terrible because I can't, I can't get my package to compile. So, so yes, you just have to do a lot of scaffolding. My office hours during the time when they're working on this package, I always have like muffins and cookies as an extra incentive to really get them coming to see me. So, so yes, I think it's just all about providing them with the right, right support. And we spend four class periods actually learning about the different steps because talking about the documentation and helping them write good documentation takes time. So, so you'd still want to also devote a, a significant chunk of, of class to this.

All right, last question. Did any of them send their packages to CRAN?

Ah, so good question. And you'll notice one piece of documentation I did not talk about was making a website with package down. So, right, I give the students ownership over whether or not they do want to share things publicly or not. And so we talk about how you can make a website, but I don't actually require that as part of the project. So at the end of the project, they all get to decide whether or not as a group, they want to turn their GitHub repo to be now public instead of private. But I'm not sure any of them have submitted to CRAN yet. So that should be an extra credit point for, for the next iteration.

They should get free tuition for that.

Above my pay grade.

Let's thank Kelly again.

Teaching Data Sharing through R Data Packages (Kelly McConville, Bucknell) | posit::conf(2025)

Transcript#

The origin of the R data package idea

How to make an R data package

Creating good documentation

Final thoughts and advice

Q&A

Featured software#

devtools

usethis