Sharla Gelfand | Don’t repeat yourself, talk to yourself! Reporting with R

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thanks for coming. I am so excited to be here. RStudioConf is my favorite thing in the world, pretty much, and it took me quite a trajectory to get here.

So, three years ago, I came to RStudioConf for the first time, and I lost my wallet and Joe Chang had to lend me money to get home. And now, I'm here talking, so if anyone has lost their wallet, cancel your credit cards and please submit a talk abstract in 2023.

So I'm here to talk to you about repeated reporting in the R universe. If you're interested in the slides, they are available at bit.ly slash rstudioconf, and you may know me on Twitter as Sharla Gelfand, and if not, please follow me.

So it's way easier to document why you're doing something now than to remember why you did it a year ago.

Cleaning up: file organization and setup files

So with these principles, the first thing I did is I cleaned up after myself. And I needed to do this because my old code was just not reusable, and there was so much repetition of process having to relearn where things were, what they did, and what to do. So you've seen this before, and I think we can agree that leaving files like this is just rude.

So the first thing I did to reorganize that is I separated out data, reference code, and old code and reorganized things and named them clearly to communicate what they're actually doing. I also took advantage of the default ordering, which does things by name, and gave numbers to indicate what order things happen in. And even if this is all you do, next year when you come back, you can just copy the old code over and go through things in order.

But in case things were not obvious enough from the order, I also started creating this setup file, which is really useful for starting a project, like having a readme or an introductory vignette when using a new package. This is a good place to explain the project, where to start, like if you have to email someone and what steps to take. And the goal here is to explain how to do something and why to do something instead of just leaving an artifact of what you already did.

So this one's pretty basic, but it's, hello, welcome to the annual demographics report. Step one, query the data, open the file, and run the code. Step two, clean the data, open the file, and run the code.

Writing functions and making them chatty

And once I left my files usable as a whole, I had to start thinking about them individually and how I could make going through things less manually repetitive and what to abstract away. So functions are a great way to do this. So I had some code to clean and query data that was a good candidate for this.

So if we look at this code, we have a long and complicated SQL query or a dplyr version of it. Then you filter the data where the year is 2019. You have a bunch of code to rename and select columns, 30 various lines to clean up and summarize the data, and some more code to save the data. So I mean, talking through this is boring, and I like to talk. So you definitely don't want to have to run through this code every time you do it. And the thing with this code is that everything stays the same, and the only thing that changes is the year.

So it's a really good set of code to turn into a function where you take what stays the same, make it into the body of the function, and then you take what changes and make it into the argument of the function. Then you don't have to run through this code, copy it, or update logic in multiple places if it changes.

I also put this function into the setup file so that you can just run the function and have it query the data rather than saying go and open 01 queried data.

But abstraction was not enough for me, so I also decided to make my functions chatty as an opportunity to talk to myself. So the usethis package is amazing for many, many things, but I like this obscure part where you can just print UI to your console. So if you do UI info and you say hi, it'll print a little I for info and hi to the console, and you can use this in your own functions to print different kinds of output to the console.

You can also use code in these functions. So if I set the analysis year to 2019 and I do UI to do in this case, I can do querying demographics data for analysis year, and it'll print querying demographics data for 2019 in the function. So at the end, I had a function that when I ran it, it actually helped me keep track of what was going on. So you run it and it says querying demographics data for 2019, and then it tells you where it's saved. Because if you're coming back to this project a year later, you don't really need to track those things down. It'll just tell you.

Artisanal data and generalizable code

So I knew that not all of my code had logic that could abstract it away. Our projects had a lot of really specific data cleaning steps. There was a lot of back and forth and a lot of checking of assumptions and always some level of human interaction required.

But a big lesson that I learned is that artisanal data does not require artisanal code. So we had very special data, like locally sourced, handcrafted, new mistakes daily artisanal data.

But a big lesson that I learned is that artisanal data does not require artisanal code.

But the mistake that I always made was that I was so focused on cleaning that set of data, and I wasn't thinking about how to clean a new batch of the data. And I only knew why I had done something last time and not what to look for this time. So instead of writing super specific code, I started writing generalizable code with super specific instructions, including things on data cleaning, who to ask, what to look for and what to do. And this actually gives you time to deal with cleaning your artisanal data because you're not wasting time remembering how to clean it.

So we've seen this before. But again, what my code used to look like is that I would take the data, change the value of a field for one ID and the value of a different field for a different ID with no idea why I had done it. But what I replaced it with is something that looks like this. So first I'm saying what to do. So an example is I want to check for any cases where the province is Ontario, but outside of Ontario is true, because that's obviously not right. And then why to do it? Because some of them may have province wrong, but some may have outside Ontario wrong. And then you say who to talk to. So email Carla.

And ask them to check paper files to confirm which is correct and then what to do next. So based on the response, update any manually until the database is updated. Because when I started this job, I found it so difficult to keep track of who to ask for different things. You know, like you have to email that person and if they're gone, you have to email someone else. So I decided to just codify it so I wasn't leaving myself wondering. And then to avoid repetition, I could write a function that actually automates producing the template, again, using the usethis package. So you take the template, it makes and saves a copy of it, and you can even pass data to it.

Putting it all together in a package

And finally, I put everything together into a package. A project like this is a great candidate for making a package, and it is not as scary as it seems. And once I started making packages for this, I started making packages for everything else in my life.

So why do we want to make a package? I mean, as I said, making a package is really an ultimate form of do not repeat yourself. Your functions are centralized in a specific place and it's really clear where to update them. Your templates also have a place to live and then you can use other functions to copy and pass data to them. There's robust documentation and you can share code with your coworkers in a better way than emailing it or having it on a network drive.

So when all was said and done, I had a package that looked something like this. So my functions had a home in the R directory, and then the templates live in a special place so you can copy and use them later. There's documentation in the form of a readme, function documentation in the man folder, and vignettes for detailed usage. And then to put it all together, the setup function pulls the setup template and makes a new file with the template, which contains other functions that can run through the analysis. And then when you run the function to query the data, it does so, and it does so chattily so that it's abstracting away the logic but making sure that you know where things are saved.

So I use don't repeat yourself and talk to yourself by cleaning up code with clear names and ordering. I made functions, and I made them chatty to communicate what they were doing. I made automated templates, and I made them prescriptive to help me know what to do year after year. And then I made a package to put it all together.

And using this form of repeated reporting in R, I had a project that was reproducible, easy to keep track of, it had minimal manual work, was easily repeatable, and it freed up time, which I spent making other R packages.

So my takeaway to you is that if you have a repetitive task that you find confusing and you're doing the same work over and over, abstract away or automate what you can and loudly communicate the rest. Or because it's catchier, don't repeat yourself, talk to yourself. Thank you.

Q&A

The first one is how do you name your script if your boss asks you to double check something and you end up with another script that doesn't fit into the 0.1, 0.2 workflow? I would say if things are not going to stay the same year after year, something besides that, maybe make a folder that has extra things. The nice thing with this is that it gives you a fresh copy of the directory, which you can use, but you can always add things to that.

Another question was as a person who's never made a package before, how do you make your package available to someone else within your organization without making it public? So if your organization is small, you can have a free Git repo that you can have up to three contributors in, so that's a good option. I think if you have another form of version control at your work or if you have a network drive, you can install a package from there, even sending them the source code and they can install it using that.

What are the advantages of usethis, this UI info function, over a simple print call? I like the little icons that come with usethis, so there's like a – no, they communicate useful information, right? There's a little red dot that might indicate you need to do something, or there's a check mark that says this is where it is and this is where it's done. It's just a friendlier form of communicating, I think.

We have a few more minutes for questions, so can you estimate the time trade-off? That is, how much additional time did you spend and how much time did you save in the long run? I've done a few iterations of this, and it does take some time investment up front, for sure. As for the long run, I quit my job, so you'll have to ask my coworker Charlene how much time it saves. Going through these projects was back and forth that sometimes took many, many months of checking assumptions. Even if you have to do that, I think it's a time investment in the long run.

Someone else asked, a lot of your recommendations seem to be software engineering principles. Would you recommend that our users have a formal education in this field to prevent some troubles? No. I have no training in software engineering at all. Everything I know is from Jenny Bryan , so I think you're all fine.

Did you get pushback when initially transitioning to this workflow? If so, how did you convince them that this method may be better? I think there's always pushback for these kind of things, but if someone is in a position like your boss where they want to advance and they don't want you to have to waste your time doing the same thing over and over again, I think there's an argument to be made. I think it's a lot easier to keep people at a job if they're happy and doing work that they're interested in rather than copy and pasting things over and over again.

One most recent one is, what is the name of your package? So we had one called R-CNO, the organization was called CNO, so you just put R before it, and then we had a report, obviously. We had a report on exams that we wrote, so the package was just called Exams Report. I think as long as you don't name it like dplyr, you're likely not to have any conflicts.

One is, the hardest thing with this approach seems to be maintenance and sustaining. Can you speak to that side of package development and functionalizing? I think for this, it wasn't as much of an issue because things were so similar year after year. Like, the point was kind of that things never changed, so you wanted to, like, abstract those things away because it was not useful going through the same thing over and over again. So I think for maintaining, I did not cover it at all, but writing tests for your package is a really big thing, so if things are changing in your data or in your code or things that you're getting, being able to keep track of those.

And then the last question is asking about what you use for version control. I use Git.

Sharla Gelfand | Don’t repeat yourself, talk to yourself! Reporting with R | RStudio (2020)

Transcript#

The old workflow: SPSS, Excel, and Word

Transitioning to R

Two guiding principles

Cleaning up: file organization and setup files

Writing functions and making them chatty

Artisanal data and generalizable code

Putting it all together in a package

Q&A

Featured software#

rmarkdown

rstudio