Using R package structure for data science projects | Kylie Ainslie

Transcript#

This transcript was generated automatically and may contain errors.

I would love to introduce our esteemed guest, our lab manager for today, Kylie Ainslie. Kylie, would you like to say hello? Sure. Hi, everybody. Thanks for coming.

All right, Kylie, tell us a little bit of background about what you do in data science and what we are going to be talking about today, because it is sort of a nice expansion of the comp talk that you gave last year at PositConf. Yep. So I am I don't necessarily consider myself a data scientist. I am an infectious disease modeler. So I basically use math, stats and R to study how diseases spread and how interventions can mitigate that spread. And I've just I'm currently in the Netherlands, which is why it's might seem a bit dark behind me. And I've just moved to Australia to work for the University of Melbourne.

But I do a lot of coding stuff. And so I was really actually excited that Libby asked me to be here today because I would like to talk about sort of a workflow that I kind of happened upon accidentally. So I was working in a really messy way and I just got really frustrated because all of my coding projects had different organization. And so today I sort of want to talk about the way that I've found to make that messiness go away and keep everything consistent and more efficient. So I'll leave it at that for now and then we can dive in.

Awesome. Yeah, I think that this this is something that I feel might be common where you like give a conf talk and there's a bunch of stuff you wish you could have done live that you can't because it's a conf talk. You're like I have 15 minutes and I can't live code. It's going to be all messed up if something goes wrong. This is a place where things can go wrong and it's totally fine because we are just going to be sharing screens willy nilly and being like messy and transparent and open.

So with that in mind, everybody, this is the place to stop and ask questions. This is not a like wait till the end and raise your hand type of deal. This is a place where we put questions in discord and I can like interrupt Kylie and we can ask a question. So if you have clarifying questions, this is the place to ask them and be completely open and totally fine.

The other really added benefit is that when it's in a package structure, it sort of forces you to do some documentation things, which I'll go through next, like documenting, you know, functions, inputs, outputs, expected behavior and things like that, that you don't have to do if you're just working sort of by yourself in your own project. But that then make your package way, way easier to use.

And they're automatically generated when you build the package, which doesn't happen if you just use a nice organized file structure. So if you want, we can go through a quick example of sort of the benefits of some of that documentation, because I think that's really one of the main benefits of using this approach is that there are different ways to document what you're doing, and they help you because then you have to actually explain what you've done. So if you come back to it, you know what you did, you know what decisions you made, maybe during an analysis.

But also anyone else. I mean, what if you have to hand over a project to someone? Then you're like, oh, I've got to write handover notes. But actually, you can build all that into an R package and say, here you go. It explains what to do.

And Tan made a great point, which is tooling for R packages is very robust, right? So like you're not on your own. You have roxygen2 for commenting. You have tests that you have usethis especially, which is already like building you whole structures and skeletons. So I agree. I think that's a really big benefit.

So that's why I say actually using this structure may seem like a bit of work at first, but it really has some benefits if you just get over that hump. And the likes of Jenny and Hadley have made that hump so much smaller in recent years.

Writing roxygen2 documentation

We have about 22 minutes left. Let's see how much further we can get in our example project here. I told you I could fill 50 minutes, Libby. I told you. OK. So I'm going to try to just write some quick roxygen2 preamble. So the first thing I've done is I've just added this apostrophe rather than just a typical comment. So this is recognized by roxygen2. Is it Roxygen? R-Oxygen? How do you all say it? I say Roxygen. I also say Roxygen.

So anyway, we'll go through this. So hopefully we can generate a man file. So basically with roxygen2 preamble, you write it before any of your functions. And it should have sort of a title-ish or a very brief statement of what the function does and then more details on the function. This is a really simple one that just abbreviates Australian state names. But then you can use you can specify what goes into the function. So in this case, the parameter is any sort of input argument, and that is state names, if I can spell.

I actually got this tip from Jenny Bryan because I watched a talk of hers about making packages, and she put the type of argument and then what it does. And I thought, oh, actually, that's really helpful because just saying maybe number of, I don't know, scabies mites on your skin doesn't necessarily tell you if it needs to be a character string or a number or an integer or something.

So then we also want to put what we expect the function to return, which this function was written poorly and there is no return. So I guess it returns. There's an implicit return here, not an explicit. Yes. I got into it one time with somebody over implicit versus explicit returns. I would love to see people in the chat. Tell me, do you type return at the end of your functions or do you leave it implicit? I am an explicit girly. Every single person. Yeah. Every single thing I do. I want it to be like clearly typed, clearly telling you what you're going to get.

And then you export it, which basically means you want the function to be sort of visible. So when you type out isolator and then your double colon, it's like not hidden and accessible as a function in your package.

I'm going to try to basically sort of implement this preamble that I've just written by doing devtools document.

Whoever said the man folder would appear when we did document. Megan. Absolutely right. Thank you, Megan. We have a man folder. And I don't know why it has figures, but sure, that's great. Oh, there must be figures in my readme file. And then we have this automatically generated RD file, which has all the information that we put in. And because we hit export, let's go look at our namespace. It had export and import. So that's awesome.

So the other great thing about an R package that you don't get with a different file structure is that if you're using your own package or someone else's and you think, what does this function do? Or what type of argument do I need? Booyah, look at that. Automatically generated help file that you can access by just using a question mark and the name of the function, rather than having to go search through some sort of file structure to find a readme or go through comments in the code. It's right there. Amazing.

So the other great thing about an R package that you don't get with a different file structure is that if you're using your own package or someone else's and you think, what does this function do? Automatically generated help file that you can access by just using a question mark and the name of the function, rather than having to go search through some sort of file structure to find a readme or go through comments in the code. It's right there.

And the other thing that now we need to deal with is that because I've told the package gods that I need to import from dplyr , now I need to make sure that in our description file we have the dependencies listed. So there should be a section for the packages that your package depends on, or if you have a vignette, what packages that needs. And usethis makes this really easy. So you just say use package, dplyr, and type equals imports. So that means that when your package is installed, it will also install dplyr because in order for our functions or this particular function to run, we need dplyr and we do that. And now look at that. Our description file has imports dplyr. That's amazing.