
Using R package structure for data science projects | Kylie Ainslie | Data Science Lab
The Data Science Lab is a live weekly call. Register at pos.it/dslab! Discord invites go out each week on lives calls. We'd love to have you! The Lab is an open, messy space for learning and asking questions. Think of it like pair coding with a friend or two. Learn something new, and share what you know to help others grow. On this call, Libby Heeren is joined by Kylie Ainslie who walks through how structuring data science projects as R packages provides a consistent framework that integrates documentation for you and facilitates collaboration with others by organizing things really well. Kylie says, "I stumbled on using an R package structure to organize my projects a number of years ago and it has changed how I work in such a positive way that I want to share it with others! In a world where our attention is constantly being pulled in many directions, efficiency is crucial. Structuring projects as R packages is how I work more efficiently." Hosting crew from Posit: Libby Heeren, Isabella Velasquez Kylie's Bluesky: @kylieainslie.bsky.social Kylie's LinkedIn: https://www.linkedin.com/in/kylieainslie/ Kylie's Website: https://kylieainslie.github.io/ Kylie's GitHub: https://github.com/kylieainslie Resources from the hosts and chat: posit::conf(2026) call for talks: https://posit.co/blog/posit-conf-2026-call-for-talks/ Kylie's posit::conf(2025) talk: https://www.youtube.com/watch?v=YzIiWg4rySA {usethis} package: https://usethis.r-lib.org/ R Packages (2e) book: https://r-pkgs.org/ Paquetes de R (R Packages in Spanish): https://davidrsch.github.io/rpkgs-es/ {box} package: https://github.com/klmr/box extdata docs in Writing R Extensions: https://cran.r-project.org/doc/manuals/R-exts.html#Data-in-packages-1 Tan Ho's talk on NFL data: https://tanho.ca/talks/rsconf2022-github/ {rv} package: https://a2-ai.github.io/rv-docs/ Whether to Import or Depend: https://r-pkgs.org/dependencies-mindset-background.html#sec-dependencies-imports-vs-depends {pkgdown} package: https://pkgdown.r-lib.org/ Edgar Ruiz's {pkgsite} package: https://github.com/edgararuiz/pkgsite Attendees shared examples of data packages in the chat! Here they are: https://kjhealy.github.io/nycdogs/ https://kjhealy.github.io/gssr/ https://github.com/deepshamenghani/richmondway https://github.com/kyleGrealis/nascaR.data https://github.com/ivelasq/leaidr â–º Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co The Lab: https://pos.it/dslab Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for learning with us! Timestamps: 00:00 Introduction 06:17 Reviewing the disorganized project example 10:01 Creating the package structure using create_package 17:50 Organizing external data and scripts in the inst folder 22:55 Adding a README and License 29:06 "What are the advantages to packaging a project?" 33:35 Writing Roxygen2 documentation 36:06 "Do you type return at the end of your functions?" 41:55 Handling dependencies with use_package 43:53 "Can you just use require(dplyr) at the top?" 47:45 Setting up a pkgdown site 50:11 Creating vignettes 52:22 "What is the role of the usethis package?" 54:18 Loading the package with devtools::load_all
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I would love to introduce our esteemed guest, our lab manager for today, Kylie Ainslie. Kylie, would you like to say hello? Sure. Hi, everybody. Thanks for coming.
All right, Kylie, tell us a little bit of background about what you do in data science and what we are going to be talking about today, because it is sort of a nice expansion of the comp talk that you gave last year at PositConf. Yep. So I am I don't necessarily consider myself a data scientist. I am an infectious disease modeler. So I basically use math, stats and R to study how diseases spread and how interventions can mitigate that spread. And I've just I'm currently in the Netherlands, which is why it's might seem a bit dark behind me. And I've just moved to Australia to work for the University of Melbourne.
But I do a lot of coding stuff. And so I was really actually excited that Libby asked me to be here today because I would like to talk about sort of a workflow that I kind of happened upon accidentally. So I was working in a really messy way and I just got really frustrated because all of my coding projects had different organization. And so today I sort of want to talk about the way that I've found to make that messiness go away and keep everything consistent and more efficient. So I'll leave it at that for now and then we can dive in.
Awesome. Yeah, I think that this this is something that I feel might be common where you like give a conf talk and there's a bunch of stuff you wish you could have done live that you can't because it's a conf talk. You're like I have 15 minutes and I can't live code. It's going to be all messed up if something goes wrong. This is a place where things can go wrong and it's totally fine because we are just going to be sharing screens willy nilly and being like messy and transparent and open.
So with that in mind, everybody, this is the place to stop and ask questions. This is not a like wait till the end and raise your hand type of deal. This is a place where we put questions in discord and I can like interrupt Kylie and we can ask a question. So if you have clarifying questions, this is the place to ask them and be completely open and totally fine.
Setting up the project
So basically the two things that I use the most or did use the most in trying to create this was they use this package, which is amazing and actually make setting up the actual R file structure just really, really simple and has some really nifty advantages to like and allows you to connect with GitHub and all sorts of stuff, so it's fun.
So we'll first just install some relevant packages here because I don't actually know if they're installed yet.
So I always let a candle in anyone's life coding. It's like a superstitious thing, but I didn't like when something's gonna go wrong. Well, it probably will because I'm also pushing myself to use Positron and I usually use RStudio, but I thought, you know what? Live coding is the time to do something brand new. You're right. Yeah.
This is going to be me in about a week when we have Edgar Ruiz on to talk about the mall package as I as I live work with LLMs, I'm sure something's going to go wrong.
OK, it looked like that worked. So basically we've got that. We're just going to create. I'll just do this just so people can see what package I'm using. So actually, in order to create the package is really easy. It's just you write use create package and I'm going to call it isolator because this is actually a project that was about testing and quarantine during the COVID pandemic and the strategies to do that so that people can get out of quarantine faster.
Huzzah, look at that. I've just got a it's already made my package and it has an R directory. It's got a couple of folders and R directory currently has nothing in it. It's got get ignore. Build ignore. These are things for GitHub and the R package build. It's got a description file, and if you've never written an R package, the description file basically tells what the package does.
So we can just say sorry, guys, it's an Australian group, so I have to use an S, not a Z. Analyzes quarantine testing strategies or whatever. OK, and then we'll just put my name, even though it's really my colleagues, but I didn't actually ask if he wants me to reveal himself.
OK, cool, that's great. We did that and then the namespace space. If you also aren't familiar with it is just it's something that's automatically generated, so you don't need to mess with it, but it essentially tells what is exported and I believe imported for the package to work. So how it interacts with other packages and functions. I will come back to that. Do not edit by hand. I need that warning, because that's something I would absolutely do. Yeah, yeah, so just don't touch it. I will do stuff later and come back to it and show you what pops up there.
Organizing files and folders
OK, what can we? What should we do next? So one of the things I want to do is just move some files, right? OK, so I now have this isolator thing and it's got an R directory. So what goes in the R directory? What goes in the R directory are the functions. So all the things that I have in this folder except for that readme there go into isolator R.
So the quarantine testing was just a name of a folder that I gave myself. You can create a package in the same directory that you're working in, but I like to give packages fun names, so I came up with isolator. So I thought I'll just let it. I'll just let it make a new package and that's actually a really good point. So when I ran the create package command, I just let it put the folder in my current working directory. Because I was already, but you can actually tell it where to put it if you use the create package path.
David says creating his first package. He wasted a lot of time editing the namespace file and then noticed it wasn't necessary when running devtools document. Yeah, see that would be me too.
Yeah, and so that's actually one of the really nice things about the sort of our workflow that I'll show you is that it can actually help you catch mistakes and it'll point out some issues for you. But if your package isn't wrapped into it, or if your code isn't wrapped into a package, there's no real way to do that. And so that's one of the nice things is just sort of catching little things like, oh, you've written the wrong thing in the namespace file or you forgot to annotate what input you need here. So that's it's really helpful.
So I also one of the first things that I do is I'll just use is I like to connect it to GitHub. Oh, you gotta use get first and then use GitHub, although it already has a get ignore file in here. I always just do it anyway. There we go. Just initializes with usethis. Yeah.
So in an R package, all of that sort of other stuff that doesn't have a specific place. So like the help files, which we'll generate later, the code files and other types of things go into the inst folder. And then within the inst folder, you can have a play. You can have whatever you want. I typically have a folder called X data or external data. And then one that's called like examples or scripts. So we'll call this one examples.
So I kind of think of the inst folder as sort of my special folder where I can really do whatever I want. So basically the R package build kind of now somebody correct me if I'm wrong, but the R package build kind of ignores it. It's like, OK, this is other stuff and doesn't put too many regulations on what that stuff is.
So it is not going to be accessible through the package as data that's built in. No, not unless you assign it as such. But if it's if this package is wrapped in a GitHub repo, then it's available via the GitHub repo.
Adding a readme and license
So one of those things that's really great to sort of best practices that I just mentioned that I've gradually learned is having things like a readme to tell you tell anyone who's looking at your package what it does and maybe how to use it or how it's structured, any sort of data sources. And so you can also use this, use this to add some of these features. So you can add a readme file. And I like to use the RMD. And so this creates a readme file.
And then the other thing that's also really good to have, particularly if you want to make your code public, is a license file. And that tells people how they can and cannot use your code. And there you see here we've got now a license. And I've just picked the MIT, which is a pretty standard, gives people a lot of freedom to use. And then now this line has been added to our description folder that says this package is covered by the MIT license. So it's just as easy as that. There are different licenses. You can pick which one your organization may have a specific one that they want you to use.
So I guess one thing that I think is really nice is the readme. So obviously there's nothing in here right now. But where is my hold on.
All right. I'm just filling out the readme a little bit. So because basically I have two different files here. One is an MD and one is an RMD. And an RMD is an R Markdown file. MD is a Markdown file. And an R Markdown file just looks a bit nicer. Or at least you can look at it a bit easier.
And so now we have some very basic stuff. Almost nothing on our readme, but we can actually build it. Let's see if this works. But notice what it's doing. It's installing your package in a temporary library and building out the readme file in the isolator folder.
So to give you a better example of a more finished version is that I wrote my first CRAN package last year. And so my readme has things like what the package does. It's called Mighty. It's based on a Scabies project. It's so fun. How to install it. And some features about how long it should take, the main functions. So if people are just trying to get oriented, which functions they're probably going to use. It's not a very big package. And then how you use it. And so all this is available when someone is just looking at this package. It's not hidden anywhere. It's right front and center.
And so I found that to be really helpful for myself, but also for anyone else who I might want to collaborate with. I can show them, hey, this is what I'm working on, or anyone who may want to actually use the package.
Advantages of the package structure
There was one from James that was, what are the advantages to packaging a project compared to more generalized templated folder structure like R Markdown data and R? Or, like, you know, some people will have, like, instead of R, it'll be called source. And then you might have a data with a data raw and data final or just data. So what are the advantages of doing it in the R package style?
Yeah, so that's a great question. So I would say the biggest advantage is how you share it and how other people can access it. So if you have that file structure that you mentioned where you've just got everything nice and organized and neat, if somebody wants to use your code, they still have to download all of those files. But if you have it wrapped in an R package and you have it somewhere that they can access it, whether that's a shared network drive at work or GitHub or GitLab, all they have to do to use the code or to reproduce what you've done is a single line of code that's installed from GitHub or install packages. And so it just makes it so much easier to use.
The other really added benefit is that when it's in a package structure, it sort of forces you to do some documentation things, which I'll go through next, like documenting, you know, functions, inputs, outputs, expected behavior and things like that, that you don't have to do if you're just working sort of by yourself in your own project. But that then make your package way, way easier to use.
The other really added benefit is that when it's in a package structure, it sort of forces you to do some documentation things, which I'll go through next, like documenting, you know, functions, inputs, outputs, expected behavior and things like that, that you don't have to do if you're just working sort of by yourself in your own project. But that then make your package way, way easier to use.
And they're automatically generated when you build the package, which doesn't happen if you just use a nice organized file structure. So if you want, we can go through a quick example of sort of the benefits of some of that documentation, because I think that's really one of the main benefits of using this approach is that there are different ways to document what you're doing, and they help you because then you have to actually explain what you've done. So if you come back to it, you know what you did, you know what decisions you made, maybe during an analysis.
But also anyone else. I mean, what if you have to hand over a project to someone? Then you're like, oh, I've got to write handover notes. But actually, you can build all that into an R package and say, here you go. It explains what to do.
And Tan made a great point, which is tooling for R packages is very robust, right? So like you're not on your own. You have roxygen2 for commenting. You have tests that you have usethis especially, which is already like building you whole structures and skeletons. So I agree. I think that's a really big benefit.
So that's why I say actually using this structure may seem like a bit of work at first, but it really has some benefits if you just get over that hump. And the likes of Jenny and Hadley have made that hump so much smaller in recent years.
Writing roxygen2 documentation
We have about 22 minutes left. Let's see how much further we can get in our example project here. I told you I could fill 50 minutes, Libby. I told you. OK. So I'm going to try to just write some quick roxygen2 preamble. So the first thing I've done is I've just added this apostrophe rather than just a typical comment. So this is recognized by roxygen2. Is it Roxygen? R-Oxygen? How do you all say it? I say Roxygen. I also say Roxygen.
So anyway, we'll go through this. So hopefully we can generate a man file. So basically with roxygen2 preamble, you write it before any of your functions. And it should have sort of a title-ish or a very brief statement of what the function does and then more details on the function. This is a really simple one that just abbreviates Australian state names. But then you can use you can specify what goes into the function. So in this case, the parameter is any sort of input argument, and that is state names, if I can spell.
I actually got this tip from Jenny Bryan because I watched a talk of hers about making packages, and she put the type of argument and then what it does. And I thought, oh, actually, that's really helpful because just saying maybe number of, I don't know, scabies mites on your skin doesn't necessarily tell you if it needs to be a character string or a number or an integer or something.
So then we also want to put what we expect the function to return, which this function was written poorly and there is no return. So I guess it returns. There's an implicit return here, not an explicit. Yes. I got into it one time with somebody over implicit versus explicit returns. I would love to see people in the chat. Tell me, do you type return at the end of your functions or do you leave it implicit? I am an explicit girly. Every single person. Yeah. Every single thing I do. I want it to be like clearly typed, clearly telling you what you're going to get.
And then you export it, which basically means you want the function to be sort of visible. So when you type out isolator and then your double colon, it's like not hidden and accessible as a function in your package.
Whoever said the man folder would appear when we did document. Megan. Absolutely right. Thank you, Megan. We have a man folder. And I don't know why it has figures, but sure, that's great. Oh, there must be figures in my readme file. And then we have this automatically generated RD file, which has all the information that we put in. And because we hit export, let's go look at our namespace. It had export and import. So that's awesome.
So the other great thing about an R package that you don't get with a different file structure is that if you're using your own package or someone else's and you think, what does this function do? Or what type of argument do I need? Booyah, look at that. Automatically generated help file that you can access by just using a question mark and the name of the function, rather than having to go search through some sort of file structure to find a readme or go through comments in the code. It's right there. Amazing.
So the other great thing about an R package that you don't get with a different file structure is that if you're using your own package or someone else's and you think, what does this function do? Automatically generated help file that you can access by just using a question mark and the name of the function, rather than having to go search through some sort of file structure to find a readme or go through comments in the code. It's right there.
And the other thing that now we need to deal with is that because I've told the package gods that I need to import from dplyr, now I need to make sure that in our description file we have the dependencies listed. So there should be a section for the packages that your package depends on, or if you have a vignette, what packages that needs. And usethis makes this really easy. So you just say use package, dplyr, and type equals imports. So that means that when your package is installed, it will also install dplyr because in order for our functions or this particular function to run, we need dplyr and we do that. And now look at that. Our description file has imports dplyr. That's amazing.
Yeah. So it's really, really easy. And so roxygen2 rocks. But the other piece about it that I talk about in my talk, which I'm sure is a more eloquent version of this lab, is that you can actually write that preamble as your live coding. You know, you're writing a function to do stuff, and you can actually just quick do, you can just quickly say, okay, what parameters, what kind, what do I expect? Because I don't know about you, but with me, I'll write a function and then I'll change my mind and I'll say, oh, I have this as a matrix. Oh, but I think a data frame would be better or I had a character, but this is getting too confusing. It's a string. I need a number. I'll just convert it. So I change my mind and I can kind of keep track of all of that by just updating the preamble.
And then, you know, if you use something like GitHub or version control, you can obviously keep track of the changes that you've made, but it's always there. And so you can just do that as you're writing. And so it's something that I've started to do. It's like I'm making a function. And rather than writing all of my code and then going back and documenting it, I'm starting to do it as I code.
Building a pkgdown website and vignettes
So one Libby, this one's for you is we can make a pkgdown website. Yeah. That's my favorite part. So you can make, if you want to, you can make a pkgdown website. And I hope this works because you know, I haven't connected it to anything. Fingers crossed.
So now we have, we've done use pkgdown. It's added some stuff to our directory. So now we have, it's got this pkgdown YAML, which is great. And it's just done some setup for us. And then let's all fingers crossed that this works. We can actually just in two lines of code, I think we should be able to build a site.
Yeah, so this is nonsensical because this is basically just the basic readme with the title. My name, our license, but it was that simple. So now you can have this and you can. Yeah, this shows what functions we've only exported one function. You know, it's not very fun bunch of functions in there. That reference then becomes like, hey, here is a web page for each of your functions and like all the information about it. And for that to be automatic is amazing.
And one of my favorite aspects of our package is the vignette. So I was talking about how the readme you can combine code and results and text and everything. You can also do that. Well, it's the same file, same file format. So with a vignette, you can have our markdown file, which shows which has the analysis that you've done. And actually, I can show an example.
So this is the package that I made. So this is the readme that I showed you earlier, which is always the homepage of the pkgdown site. But then I have some vignettes and some of them is like a quick start guide. So, you know, welcome to my package. Let's load some required stuff. Let's. This is what it's going to do. Here's some useful information. And then let's start doing stuff with data. Let's simulate some data. Let's look at it. Let's visualize it. And then let's use the package to do stuff. And that's really awesome.
So you could think of actually this, this could be your sort of lab notebook. You could use this as like, I'm doing my analysis, right? And you can, you can just kind of work, work through it. And to set up the vignettes is again, very, very simple. It is just news. And you can use different engines inside of use vignette. So like, if you wanted to use Quarto as your engine, you could, or Markdown is your engine. I am not an expert in all of that, but you can go check out the vignette engines and Google away and you will find lots of info.
Loading your package with devtools
So I guess the other great thing about this is now that we have this package while we're testing it, you can actually install everything in one line. So if, if it's on GitHub or something, this changes. But while you're, while you're working, let's say, you know, you go away, you close your computer, you come back and you want to start using this rather than having to source all those different files or write a file that sources all of them. You just go into your package directory and type devtools load all and it loads everything. So it so then now you can just use it as if you were using any other package, even though this is very much being built. And I found that saves a lot of time when just working on something.
I love it. Yeah. I hope everybody learned a lot. We're at the top of the hour. So we will let you go. We know that you have meetings to go to Kylie. Thank you so much for joining us today and walking us through and being vulnerable and live coding. I hope that you learned a ton. Please hop in the chat and thank Kylie for being so, so brave to live code in front of the entire Internet.
If you have extra questions, pop them in the chat. We will come through and attempt to answer them. Thank you for spending time with us. See you on Thursday. Hang out if you would like to come to that and then see you next week where I think we have Edgar going through the mall package and Elmer. Thank you so much. We'll see you in a couple of days or a week. Bye, everybody. Bye, everybody. Thanks for joining.

