
Teaching Data Sharing through R Data Packages (Kelly McConville, Bucknell) | posit::conf(2025)
Teaching Data Sharing through R Data Packages Speaker(s): Kelly McConville Abstract: Data science courses tend to teach students reproducible workflows. However, the origin of the data used in these workflows and definitions of the variables used are often not emphasized. This talk addresses this gap by focusing on how to teach students effective data sharing through the creation of R data packages. We’ll explore how to leverage key packages, such as devtools and usethis, and will demonstrate how to guide students in generating appropriate documentation through ReadMes, help files, and vignettes. Furthermore, we’ll discuss common pitfalls encountered when first learning to create R packages and will propose how to structure a project assignment where an R data package serves as the primary deliverable. Materials - https://mcconvil.github.io/r-data-package-talk-f25/data_packages_talk posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello. Thanks, everyone, for coming. So Ryan wanted you to imagine that you were babies. I would like you to imagine that you are students in my Intro, Stats, and Data Science class.
So we're about four weeks into the semester. This means you've learned how to wrangle and summarize data using dplyr. You can make beautiful visualizations with ggplot2. And now I'm going to give you a new-to-you dataset on trees in Portland's parks. I'm going to ask you a more open-ended question than I have so far. I'm going to ask you to just do some exploratory data analysis on these trees. And what I want you to just ponder for a moment as my student four weeks into the semester, whether or not you feel ready to explore these data.
And when I say a moment, I really do mean just a moment, because I'm now going to show you a couple of the typical responses I've gotten from students when I did just this. All right.
So one of my students decided to take two quantitative variables and make a scatter plot. That's the appropriate graph. And they went a step further. They realized there were lots of points, so they tried to deal with the overplotting by making the points transparent. They know I love colors, so they customized the color with a hex code. And so their graph really looks pretty good. But then we get to their explanation. And we see that it says, here's a graph of the dbh and the crown with ew. We can see that as the dbh increases, so does the crown with w. All right. So this is not as bad as just saying, as x increases, so does y. They use the variable names. But I don't really get a sense of these trees. I don't feel like we learned anything about the trees from their interpretation of their graph.
Another student decided to look at species and to see what types of trees there are in Portland. And honestly, I was pretty impressed with their data wrangling skills, right? They found that there were lots of different species. So after they counted them up, they subsetted down to just the five most common. And then they even sorted their bar graph based on frequency. Wonderful. Used a lovely color again. But again, their interpretation of their graph, they noticed we have a lot of these PSME trees. I'm not a botanist. I don't think the student here was either. We don't have a great sense of trees in Portland.
All right. But you know what? This is not my students' fault. They've done what I've taught them to do. I have taught them to wrangle code. I've taught them to make graphs. And I gave them a one-sentence explanation of this data. So maybe it's obvious to all of you. It should have been obvious maybe to me. But to do good data work, we need context. And I was not providing my students with enough context to go to the last step and to actually tell stories with their data.
But to do good data work, we need context. And I was not providing my students with enough context to go to the last step and to actually tell stories with their data.
But you know what? Providing good context is tough and sounds exhausting. If you've ever taught an intro stats and data science course, I already have jammed enough things into that class, right? I have to teach them all of the statistical concepts. I have to teach them how to code. And now I also have to teach them about the data. So I'm going to admit to you that I mostly ignored this problem. Sure, instead of a sentence, I started giving them a paragraph about the data. And I would link out to where I got the data from. But I'm going to tell you right now that their interpretations of their summary statistics and their graphs didn't really improve very much.
The origin of the R data package idea
Until I had to co-lead a workshop on teaching intro stats with R. So this was two instructors who most of them had been teaching intro stats for quite a while. But they were new to R. They were new to coding. It was a half-day workshop. And you know what I didn't want to have to do in this half-day workshop? I did not want to teach these people file paths. I did not want to spend an hour of the workshop just loading the data in. I needed to teach them how to wrangle and summarize and model and visualize data, right? And so my co-instructor said, hey, Kelly, you have that really cool data set on trees in Portland. How about you turn that into an R data package? Because you got this train ride coming up, you can just do it on the train.
All right. So my first R data package was created called pdxtrees. I have a public GitHub repository where the participants in my workshop, but also all of you, can learn about the package. Being a good data scientist, I made a hex sticker. If you're like, hmm, what in the world is going on with that hex sticker and those colors? Well, if you've never been to Portland's airport, it is fashioned after the PDX airport carpet. Now I'm all worried that you're going to go to Portland's airport and you're going to be like, that's not the carpet. It is true. They ripped up our beloved carpet. I think there's like one terminal that still has it. But this is my favorite Portland carpet.
Okay. So my first R data package was born, and this solved our problem. They could install the package, and then they all had the same code to load the data set. I didn't have to teach them about file paths. And we could go through the workshop then, and they could learn to wrangle and visualize and model and summarize the data.
Okay. But there was a really lovely side effect of creating an R data package. As I told you, I made a readme. So I shared that URL, which had the readme as the landing page where they could learn more about the data set. And then as they were using the data set, if they had a question about a particular variable, they could just do question mark and the function name, and they could learn about that variable. Very simply.
Okay. So now that I had this R data package that I used, sure, originally for a workshop, I now use it in my classes. And I have to say that the students' interpretations of their work has improved massively. So the student that now decides to make a scatter plot, again, of these two variables, tells me a lovely story about the trees. As the diameter of the tree increases, well, it makes sense that the canopy is going to get larger. We're going to have more branches and leaves and foliage. We don't have to be a botanist for that to make sense. We can now start to actually see the trees. And that most common tree? Well, that's Douglas fir. Douglas fir is Oregon state tree. It makes sense that it was the most common tree in our data set.
All right. So context is key to doing good data work. And what I learned is because the context is so close to the data in an R data package, that is a great way to give people access to the context, the story of the data. And then I went a step further, not with my intro students, with my second semester students. I decided once I'm now teaching students about data management and data sharing, I should be teaching them to share their data with R data packages. And I will say their favorite part of the project is creating their hex sticker. So here are just a selection of hex stickers from my students over the years.
How to make an R data package
Okay. Let's talk about how to make an R data package. First, all right, this seems obvious, but you need to have a data set that you want to share with others. So I'm going to just admit to you all, I had not bought a new pair of jeans since before the pandemic. I was really dreading it. And so I was a good data scientist, and I went on Anthropology's website and did what any data scientist would do, and I created a data set of jeans. Spoiler alert, people really like wide leg jeans nowadays. Okay. That's a thing.
Leg opening. All right. So now I want to share this jean data set with others so that when they want to go shopping for jeans, it's not as painful. All right. So let's put this data in an R data package. I'm going to admit something right now. I am not about to go through all of the steps to make an R data package because you should not learn how to make an R data package by passively sitting there and watching me talk through the steps. Instead, you should learn from the pros. But even still, you should not sit here and listen to Jenny and Hadley tell you the steps to make an R data package. What you should do is you should go get their book, which you can read for free online, and they actually go through the steps. So you should actively go through the steps of creating a package. Now, their book is about making R packages in general, but they have sections specifically about how to put data into your package.
So instead, what I want to focus on are some of the tips and the documentation. So in terms of tips, I recommend that you lean heavily on the wonderful helper functions that are out there. devtools, usethis. R oxygen 2. I just don't understand. I'm from Iowa. I'm going to pronounce things however I want. But these helper packages are really useful. They streamline the process. They automate a lot of the steps. They just make it easier to go from raw data to finished package. The other tip I have for you is that there are lots of wonderful R data packages out there. So mimic them. In particular, the Palmer Penguins package has wonderful documentation inside. And so I always have my students, when they're working on this project, go look at some of these packages, go to their GitHub repositories, and see how they're actually talking about the data.
Creating good documentation
Okay. But as I said, what I want to focus on in terms of the package development is creating good documentation. Because that is really the piece that is so wonderful about an R data package. I'm going to talk about three different types of documentation. And I'll show just a little bit of code that you can use to help with that process.
So let's start with the readme. You've already seen the readme for my pdxtrees R data package. We need to now make a readme for the genes package. To get the skeleton of a readme, you can use the use readme RMD function. And so the way I want us to think about the readme is, all right. I'm on anthropology's website. I've got a sea of genes in front of me. And I see one I think I might be interested in. So there's that nice little button, the quick view button, which will pull up just a small page that tries to give me the essential information I need to decide whether or not I want to look at that gene more or less. Or if I want to move on to some other pair. So I want you to think about the readme as the quick view. You're trying to help the user decide very quickly whether or not they should be interested in your package. So you want to give them a big picture overview of what the package does, how they can install it, and maybe a simple example of how they could use your package.
Okay? So we have that for our genes. Someone decides, all right, yes, I am interested in this package. I'm going to install it. I'm going to load it. So now we should move on to the help file. So the help file is our next layer in to help them get a better sense of this package and the data inside. The use R function is useful for creating the shell of where you're going to put the contents of your help file. You'll put in some R oxygen code. And then you'll run the document function to turn that into the help files we're used to. So that when someone does question mark genes, they'll see the help file for this data set. And the way I like to think about the help file is this is like when I now clicked on that genes actual page, I get the product details, right? I get to learn about what it's made out of. Was it made in the U.S.? What's the style? So on and so forth. That's the same idea as the help file, right? I get to learn about each of the individual variables. What's the class of those variables? What's the values it can take on? How many genes are in my data set? So I get a sense of how I could use this. What are the variables present? Why would I want to interact with this data set?
So now I've decided I want to interact with it. I have a better sense of it. But I would like to see some examples. Now, you can put examples in your help file. But maybe I want to have more fully fleshed out examples. So vignettes were brought up at the end of the last talk. So vignettes are going to be the last type of documentation that I'll mention. And the way so the use vignette will again make the shell of a vignette for you that you can fill in. And the way I want us to think about the vignette is on anthropology's page, underneath that project details, they have this lovely ways to wear section where they'll show the jeans with different shirts and shoes and accessories. Yes, they're just trying to get you to buy more on their website. But it's giving you concrete ideas on how you could wear those jeans. That's the same purpose our vignette should be serving. It should give the user concrete ideas on how they could use our dataset. How they could wrangle or visualize or model this dataset.
Final thoughts and advice
Okay. So I just have a few final thoughts I want to share with all of you around R data packages. So I did tell Nick this. But I'm also going to tell all of you, it took me longer than a three-hour train ride to create my first R data package. There are just errors you're going to run into that you have not seen before. And the Wi-Fi on Amtrak is terrible. So that's true. But after I made my first package, making the second and so on went much, much faster. And much more smoothly. And so my biggest piece of advice, if you're going to make an R data package is make it with someone else who's already made one before. Because once you hit some weird snag, they can come in, look over your shoulder and tell you what you're missing.
So this is really important to me in terms of teaching my students how to make these. So the first day of class or the first day of the project, we actually during the class period all make the same package from start to finish using the whole game chapter of Hadley and Jenny's book. So that again, I'm around when they, you know, on step four forget to put quotes in the name of the package at the end of their script file. I can find these mistakes and help them see that really they have the skills and the tool set to make a package. But again, I think it's really important to have someone around who can help you troubleshoot errors.
And then I want to talk about that lonely CSV for a moment. You're going to still share data as lonely CSVs. I absolutely still do that with my students. I don't just give them data from R data packages in part because after they leave my class, people are not going to share data with them always with R data packages. So it's okay to give people CSVs, but it is important that we are investing time in the context so that when we share data with others, with our students, our collaborators, frankly ourselves, we actually are giving people the information they need to do good data work because context really is key to doing good data work. Okay. Thank you so much for coming.
context really is key to doing good data work.
So as I said, I really did make that genes demo repo so you can check it out. I also have hex stickers for my PDX trees package. If you'd like one, come see me after the session.
Q&A
Okay. We have a few questions in a minute or two. Do you teach about data ownership and how you navigate this as students create their own packages?
Yeah. So I mean, like have you ever thought about what license to have, right? So now that'll be a question. You're going to get an error when you try to, or at least a warning. One of those two, when you tried to build that package, if you've not put a license in there. And so yes, this is a perfect moment. And I'm going to say, you know what I do when we're going to talk about licenses and ownership? I bring a data librarian into the classroom because they are much better at talking about that than I am. And so they talk about the different types of licenses and what impact that would have downstream on people using the data set in your package. Great question.
Can you talk about having students make their own, how it went for the students?
Yeah. So, you know, even if on day one, we do a demo where we go from start to finish and me and the TAs are walking around the rooms and helping them, you're still going to hit snags, right? So, so absolutely there are, there are going to be issues that you hit along the way. So we would bake sometime into class where, again, they would get to work on their, their packages in their groups and I'm walking around the room and helping them. So I would say with enough scaffolding, it really can go okay, but you do really need some experts around because you don't want them to then sit in their dorm room, you know, for four hours with this one bug that in, you know, two seconds you would be able to catch and now just decide like coding is terrible because I can't, I can't get my package to compile. So, so yes, you just have to do a lot of scaffolding. My office hours during the time when they're working on this package, I always have like muffins and cookies as an extra incentive to really get them coming to see me. So, so yes, I think it's just all about providing them with the right, right support. And we spend four class periods actually learning about the different steps because talking about the documentation and helping them write good documentation takes time. So, so you'd still want to also devote a, a significant chunk of, of class to this.
All right, last question. Did any of them send their packages to CRAN?
Ah, so good question. And you'll notice one piece of documentation I did not talk about was making a website with package down. So, right, I give the students ownership over whether or not they do want to share things publicly or not. And so we talk about how you can make a website, but I don't actually require that as part of the project. So at the end of the project, they all get to decide whether or not as a group, they want to turn their GitHub repo to be now public instead of private. But I'm not sure any of them have submitted to CRAN yet. So that should be an extra credit point for, for the next iteration.
