Resources

Sharla Gelfand | Don’t repeat yourself, talk to yourself! Reporting with R | RStudio (2020)

If you’re responsible for analyses that need updating or repeating on a semi-regular basis, you might find yourself doing the same work over and over again. The principle of "don’t repeat yourself" from software engineering motivates us to use functions and packages, the core of repetition in the R universe. For analyses, it can be difficult to know how to use this principle and move beyond "copying and pasting scripts and changing the data file and the object names and updating the dates and results in RMarkdown", especially when there’s some element of human intervention required, whether it be for validating assumptions or cleaning artisanal data. This talk will focus on those next steps, showcasing opportunities to stop repeating yourself and instead anticipate the needs of and communicate effectively with your future self (or the next person with your job!) using project-oriented workflows, clever interactivity, templated analyses, functions, and yes, your own packages

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thanks for coming. I am so excited to be here. RStudioConf is my favorite thing in the world, pretty much, and it took me quite a trajectory to get here.

So, three years ago, I came to RStudioConf for the first time, and I lost my wallet and Joe Chang had to lend me money to get home. And now, I'm here talking, so if anyone has lost their wallet, cancel your credit cards and please submit a talk abstract in 2023.

So I'm here to talk to you about repeated reporting in the R universe. If you're interested in the slides, they are available at bit.ly slash rstudioconf, and you may know me on Twitter as Sharla Gelfand, and if not, please follow me.

The old workflow: SPSS, Excel, and Word

So the inspiration for this talk came from my previous job, where I was a statistician working with data on Ontario's nursing workforce, and at this job, we had a set of reports that we had to write every year, released to the government, and these reports used to be written using a combination of SPSS, Excel, and Word, with a lot of decisions kept track of in email and paper notes.

The process was pretty manual, and it might sound familiar to some of you. So first, we would somehow acquire the data. In a lot of cases, this involved emailing someone from the IT team, waiting for them to have time to work on the project, then they would query the data and send it to us. Then we use SPSS to clean and explore the data and make some summary tables. We move over to Excel, where there are nicely formatted tables, move the data over a year, and paste the new data in. Then we go to Word and replace the tables, change some numbers and text. Then we discover that there's a mistake in the data in the report. So we go back and we do everything all over again.

So these reports, which we had to write every year, took all year. They weren't reproducible, and things were scattered, hard to keep track of, and the work was boring, manual, and repetitive, and not easily repeatable. And instead of being able to do new things on the job, we were stuck doing the same work over and over again.

Transitioning to R

So this is a bit of a spoiler, but I'm sure you knew that this was going. So first, I have to say that I have not been completely honest. I didn't do any of the previous work. My coworker, Charlene, who is here, and previous statisticians did this, but I started the job knowing that I had no interest in using SPSS.

So I started the job, I took about a week to understand how things were going, how we made these reports, and I was having lunch with my coworker, and I'm like, hey, have you heard of R? It took a week.

And she had, thank God. So Charlene had used R in grad school and was more than happy to have opportunities to use it more. She had wanted to improve our processes, so she quickly learned the tidyverse, fell right into the pit of success, and we transitioned all of our projects over to R.

So we got the queries from our database team, we learned about our database structure, and we used packages like dplyr and dbplyr to query them. Then we used tidier, janitor, and other tidyverse packages to clean our data, and we used rmarkdown, cable extra, and ggplot2 to write our reports. And everything fit so well, almost as if it was designed to work together, and pretty soon our workflow started looking a lot like my laptop.

We were so proud of what we had accomplished. Our code was reproducible, and we didn't have to redo everything if there was a mistake. Our code was all in one place and easier to keep track of, and it took a little bit of time to transition things, but we knew it would be worth it because we had time to do new things on the job.

And next year arrives, and considering that there's 15 minutes left in this talk, I'm guessing you can tell that things did not go as expected. But, you know, I was feeling good, I go to work on these projects next year, and as an obsessed R user, I was like, there's nothing that can go wrong. We're using R, it's foolproof, we're all good.

And then I would come back to projects, just ready to knock them off to the to-do list, and see something that looks like this. So I come back to my code, and I'm like, okay, so we got check demographic assumptions done, check demographics, old code, then we have, you know, demographics report, copy of demographics report, we've got try cleaning demographics again, untitled.R.

And so I'm like, yeah, I'm a pretty organized person, but I come back to this, and I have just no idea where to start. Things are not named clearly, it is a jumbled mess, and I don't know what files are what, what order to do them in, and what's usable or what's old.

So I'm like, okay, you know, this one says done, so that's probably a pretty good candidate to open it. So I open the code, and so I take the data, and I'm like, okay, so I change the value of a field for one ID, why? I don't know. And the value of a different field for a different ID, again, I have no idea why. And then I take the data, I filter it, and I do some aggregation. And at this point, I'm like, cool, cool, cool, cool. So I have just no idea what is going on here.

Because the code that I was writing was, like, so specific for last year's project, and I wasn't thinking about how to reuse this code. So it was salvageable, you know, I have to go through, I remember what's what, I copy files over, I change filters, but I was so frustrated, because we had stopped using SPSS, Excel, and Word, because things were not reproducible. They were difficult to keep track of, manual and repetitive, not easily repeatable, and time-consuming. But the way that I was leaving my R code, things were kind of reproducible, difficult to keep track of, manual and repetitive, not easily repeatable, and time-consuming.

Two guiding principles

So in order to avoid having this happen again, I developed a set of approaches guided by two principles, don't repeat yourself and talk to yourself. And I hope that it's useful for those of you who use R for this kind of repetitive reporting.

So don't repeat yourself is a concept from software engineering aimed at reducing repetition. It says that you should abstract away repetitive logic and automate repetitive processes. So why do we want to do this? Well, copy and pasting your code is really error-prone, and you have to make an update everywhere that the code appears. And instead of focusing on what changes, we're busy focusing on what stays the same. You've probably all heard the mantra that if you copy and paste the same code three times, you should turn it into a function. So functions and packages are really the core of don't repeat yourself in the R universe.

But repetition is also present in processes. Having to redo things, remember where things are, remember what you did is pretty pointless, and you might not be able to abstract away that process, but you can definitely try to automate things.

Talk to yourself is a concept from me that I thought sounded good with don't repeat yourself. And it basically is just aimed at making work easier to use for your future self. So it's way easier to document why you're doing something now than to remember why you did it a year ago. The goal here is to focus on your future self instead of your present self, and just because we write easy-to-read code because of the tidyverse doesn't mean that it actually explains the motivations of why we're doing something.

So it's way easier to document why you're doing something now than to remember why you did it a year ago.

Cleaning up: file organization and setup files

So with these principles, the first thing I did is I cleaned up after myself. And I needed to do this because my old code was just not reusable, and there was so much repetition of process having to relearn where things were, what they did, and what to do. So you've seen this before, and I think we can agree that leaving files like this is just rude.

So the first thing I did to reorganize that is I separated out data, reference code, and old code and reorganized things and named them clearly to communicate what they're actually doing. I also took advantage of the default ordering, which does things by name, and gave numbers to indicate what order things happen in. And even if this is all you do, next year when you come back, you can just copy the old code over and go through things in order.

But in case things were not obvious enough from the order, I also started creating this setup file, which is really useful for starting a project, like having a readme or an introductory vignette when using a new package. This is a good place to explain the project, where to start, like if you have to email someone and what steps to take. And the goal here is to explain how to do something and why to do something instead of just leaving an artifact of what you already did.

So this one's pretty basic, but it's, hello, welcome to the annual demographics report. Step one, query the data, open the file, and run the code. Step two, clean the data, open the file, and run the code.

Writing functions and making them chatty

And once I left my files usable as a whole, I had to start thinking about them individually and how I could make going through things less manually repetitive and what to abstract away. So functions are a great way to do this. So I had some code to clean and query data that was a good candidate for this.

So if we look at this code, we have a long and complicated SQL query or a dplyr version of it. Then you filter the data where the year is 2019. You have a bunch of code to rename and select columns, 30 various lines to clean up and summarize the data, and some more code to save the data. So I mean, talking through this is boring, and I like to talk. So you definitely don't want to have to run through this code every time you do it. And the thing with this code is that everything stays the same, and the only thing that changes is the year.

So it's a really good set of code to turn into a function where you take what stays the same, make it into the body of the function, and then you take what changes and make it into the argument of the function. Then you don't have to run through this code, copy it, or update logic in multiple places if it changes.

I also put this function into the setup file so that you can just run the function and have it query the data rather than saying go and open 01 queried data.

But abstraction was not enough for me, so I also decided to make my functions chatty as an opportunity to talk to myself. So the usethis package is amazing for many, many things, but I like this obscure part where you can just print UI to your console. So if you do UI info and you say hi, it'll print a little I for info and hi to the console, and you can use this in your own functions to print different kinds of output to the console.

You can also use code in these functions. So if I set the analysis year to 2019 and I do UI to do in this case, I can do querying demographics data for analysis year, and it'll print querying demographics data for 2019 in the function. So at the end, I had a function that when I ran it, it actually helped me keep track of what was going on. So you run it and it says querying demographics data for 2019, and then it tells you where it's saved. Because if you're coming back to this project a year later, you don't really need to track those things down. It'll just tell you.

Artisanal data and generalizable code

So I knew that not all of my code had logic that could abstract it away. Our projects had a lot of really specific data cleaning steps. There was a lot of back and forth and a lot of checking of assumptions and always some level of human interaction required.

But a big lesson that I learned is that artisanal data does not require artisanal code. So we had very special data, like locally sourced, handcrafted, new mistakes daily artisanal data.

But a big lesson that I learned is that artisanal data does not require artisanal code.

But the mistake that I always made was that I was so focused on cleaning that set of data, and I wasn't thinking about how to clean a new batch of the data. And I only knew why I had done something last time and not what to look for this time. So instead of writing super specific code, I started writing generalizable code with super specific instructions, including things on data cleaning, who to ask, what to look for and what to do. And this actually gives you time to deal with cleaning your artisanal data because you're not wasting time remembering how to clean it.

So we've seen this before. But again, what my code used to look like is that I would take the data, change the value of a field for one ID and the value of a different field for a different ID with no idea why I had done it. But what I replaced it with is something that looks like this. So first I'm saying what to do. So an example is I want to check for any cases where the province is Ontario, but outside of Ontario is true, because that's obviously not right. And then why to do it? Because some of them may have province wrong, but some may have outside Ontario wrong. And then you say who to talk to. So email Carla.

And ask them to check paper files to confirm which is correct and then what to do next. So based on the response, update any manually until the database is updated. Because when I started this job, I found it so difficult to keep track of who to ask for different things. You know, like you have to email that person and if they're gone, you have to email someone else. So I decided to just codify it so I wasn't leaving myself wondering. And then to avoid repetition, I could write a function that actually automates producing the template, again, using the usethis package. So you take the template, it makes and saves a copy of it, and you can even pass data to it.

Putting it all together in a package

And finally, I put everything together into a package. A project like this is a great candidate for making a package, and it is not as scary as it seems. And once I started making packages for this, I started making packages for everything else in my life.

So why do we want to make a package? I mean, as I said, making a package is really an ultimate form of do not repeat yourself. Your functions are centralized in a specific place and it's really clear where to update them. Your templates also have a place to live and then you can use other functions to copy and pass data to them. There's robust documentation and you can share code with your coworkers in a better way than emailing it or having it on a network drive.

So when all was said and done, I had a package that looked something like this. So my functions had a home in the R directory, and then the templates live in a special place so you can copy and use them later. There's documentation in the form of a readme, function documentation in the man folder, and vignettes for detailed usage. And then to put it all together, the setup function pulls the setup template and makes a new file with the template, which contains other functions that can run through the analysis. And then when you run the function to query the data, it does so, and it does so chattily so that it's abstracting away the logic but making sure that you know where things are saved.

So I use don't repeat yourself and talk to yourself by cleaning up code with clear names and ordering. I made functions, and I made them chatty to communicate what they were doing. I made automated templates, and I made them prescriptive to help me know what to do year after year. And then I made a package to put it all together.

And using this form of repeated reporting in R, I had a project that was reproducible, easy to keep track of, it had minimal manual work, was easily repeatable, and it freed up time, which I spent making other R packages.

So my takeaway to you is that if you have a repetitive task that you find confusing and you're doing the same work over and over, abstract away or automate what you can and loudly communicate the rest. Or because it's catchier, don't repeat yourself, talk to yourself. Thank you.

Q&A

The first one is how do you name your script if your boss asks you to double check something and you end up with another script that doesn't fit into the 0.1, 0.2 workflow? I would say if things are not going to stay the same year after year, something besides that, maybe make a folder that has extra things. The nice thing with this is that it gives you a fresh copy of the directory, which you can use, but you can always add things to that.

Another question was as a person who's never made a package before, how do you make your package available to someone else within your organization without making it public? So if your organization is small, you can have a free Git repo that you can have up to three contributors in, so that's a good option. I think if you have another form of version control at your work or if you have a network drive, you can install a package from there, even sending them the source code and they can install it using that.

What are the advantages of usethis, this UI info function, over a simple print call? I like the little icons that come with usethis, so there's like a – no, they communicate useful information, right? There's a little red dot that might indicate you need to do something, or there's a check mark that says this is where it is and this is where it's done. It's just a friendlier form of communicating, I think.

We have a few more minutes for questions, so can you estimate the time trade-off? That is, how much additional time did you spend and how much time did you save in the long run? I've done a few iterations of this, and it does take some time investment up front, for sure. As for the long run, I quit my job, so you'll have to ask my coworker Charlene how much time it saves. Going through these projects was back and forth that sometimes took many, many months of checking assumptions. Even if you have to do that, I think it's a time investment in the long run.

Someone else asked, a lot of your recommendations seem to be software engineering principles. Would you recommend that our users have a formal education in this field to prevent some troubles? No. I have no training in software engineering at all. Everything I know is from Jenny Bryan, so I think you're all fine.

Did you get pushback when initially transitioning to this workflow? If so, how did you convince them that this method may be better? I think there's always pushback for these kind of things, but if someone is in a position like your boss where they want to advance and they don't want you to have to waste your time doing the same thing over and over again, I think there's an argument to be made. I think it's a lot easier to keep people at a job if they're happy and doing work that they're interested in rather than copy and pasting things over and over again.

One most recent one is, what is the name of your package? So we had one called R-CNO, the organization was called CNO, so you just put R before it, and then we had a report, obviously. We had a report on exams that we wrote, so the package was just called Exams Report. I think as long as you don't name it like dplyr, you're likely not to have any conflicts.

One is, the hardest thing with this approach seems to be maintenance and sustaining. Can you speak to that side of package development and functionalizing? I think for this, it wasn't as much of an issue because things were so similar year after year. Like, the point was kind of that things never changed, so you wanted to, like, abstract those things away because it was not useful going through the same thing over and over again. So I think for maintaining, I did not cover it at all, but writing tests for your package is a really big thing, so if things are changing in your data or in your code or things that you're getting, being able to keep track of those.

And then the last question is asking about what you use for version control. I use Git.