
Emily Riederer | RMarkdown Driven Development | RStudio (2020)
RMarkdown enables analysts to engage with code interactively, embrace literate programming, and rapidly produce a wide variety of high-quality data products such as documents, emails, dashboards, and websites. However, RMarkdown is less commonly explored and celebrated for the important role it can play in helping R users grow into developers. In this talk, I will provide an overview of RMarkdown Driven Development: a workflow for converting one-off analysis into a well-engineered and well-designed R package with deep empathy for user needs. We will explore how the methodical incorporation of good coding practices such as modularization and testing naturally evolves a single-file RMarkdown into an R project or package. Along the way, we will discuss big-picture questions like “optimal stopping” (why some data products are better left as single files or projects) and concrete details such as the {here} and {testthat} packages which can provide step-change improvements to project sustainability
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right. Hi, everyone. Thank you so much for coming. My name is Emily Riederer and I'm here today to talk about RMarkdown Driven Development. I assume many of us in this room are already familiar with RMarkdown and what an amazing tool it is for literate programming and combining both code and narrative in plain text to create an amazing variety of different types of outputs. But something I'm particularly fascinated by is how RMarkdown can also be used as a prototyping tool for more advanced analytical tools.
I contend that any analysis that you've created in an RMarkdown document actually contains a latent underlying implicit package or analytical tool that is custom tuned to solve your domain specific workflow from everything from dealing with custom nuances in your data to making the specific types of deliverables that the customers of your analysis need. And the goal of RMarkdown Driven Development is to take this implicit tool and make it an explicit one.
I contend that any analysis that you've created in an RMarkdown document actually contains a latent underlying implicit package or analytical tool that is custom tuned to solve your domain specific workflow from everything from dealing with custom nuances in your data to making the specific types of deliverables that the customers of your analysis need.
We'll do this by harvesting all of the assets that you've already created in that implicit tool. For example, from the development side, you've already probably had to go through and curate a wide set of packages that relate to your problem and play nicely together. And you hopefully have developed in your one-off analysis a consistent set of code that works to solve your problem and ideally is at least somewhat well tested and well styled.
But beyond pure code development that we have to work with, we also have a lot of great design elements already in place. One of the hardest parts generally of creating a software product is understanding user requirements and getting deep empathy for user needs. But as the person that did the original RMarkdown analysis, you get all of that for free. You understand how all of the pieces of analysis need to come together, a sane workflow for doing the work and processing your data, and again, answering those questions that are important to the people that you want to share your analysis with. And finally, you have already a complete and compelling marketing example of how your latent tool goes into use and can solve real-world problems in the wild.
So there are five main steps to our RMarkdown driven development that I want to talk about today. And along the way, we'll talk about how these steps can result in many different types of outputs and many different types of analytical products. You can think about producing well-engineered single files, projects, or packages. And while I depict this as a spectrum, I think it's important to call out at the beginning, there's not necessarily a clear value judgment in here. There's no better or worse in terms of quality, user experience, or your talent as a coder, depending on what part of the spectrum you stop at. Really, all it is, and we'll see as we go through this, is whether you want to tune your final analysis tool more to solving another instance of your very specific problem or a tool that solves more of a generic class of problems.
Step 1: Cleaning up your RMarkdown
So now we'll go through the steps of our RMarkdown driven development one at a time. The first step's pretty easy because we don't actually have to generate anything new. We just simply have to delete a lot of things that shouldn't be in our RMarkdown to begin with. So for example, some of these things might include hard-coded variables, which are common when we're doing a one-off analysis. But thinking about ourselves or a future user that might want to update this analysis later on, it can be really detrimental to have hard-coded variables late in our analysis that users may not realize when or where they have to change. Or even worse, they might change them inconsistently some places and not the others and introduce inconsistencies in their analysis.
Instead, RMarkdown parameters let us front load all of this in the YAML header of your RMarkdown document, which means that it's very easy for a future user to see exactly in front of them what changes they need to make and have everything flow on through your analysis. Effectively then, your entire RMarkdown is acting kind of like one mega function through which everything can flow. This is also an effective strategy for dealing with credentials that you might not want in your final output in case you end up sharing your code with a colleague and even just for good data security practices. There's really great synergy here with the RStudio IDE because if you click knit with parameters, you can get a nice little pop-up box and replace dummy credentials in your header with your actual credentials, so you've never hard-coded them anywhere.
Next up, I think we already learned from Jenny a little bit this morning about file paths, and we know that we don't want to include global file paths in our RMarkdown because it makes it very unresilient to working in the future. Local file paths are a slight improvement over these, but can still break as we change across operating systems or later on as we move our files around inside of our working directory. So best of all, we can use the here package to create very resilient file paths to make sure our analysis always works as planned.
And finally, and this step is by far the hardest for me, sometimes we have to kill our darlings. You can't let your RMarkdown be a junk drawer for code that you wrote, rabbit holes you went down, that you really, really thought would work, but then didn't end up being the direction your analysis went in. Take this time to remove package loads that you didn't need and delete code experiments that you maybe have commented out in your code just to create a very clean code script with a lot of hygiene on which to proceed with your refactoring.
Step 2: Rearranging code chunks
The next step in RMarkdown-driven development is simply rearranging your code chunks. RMarkdown is great because it allows us to capture our thought process as we go through our analysis, but the one downside of this is our thought processes are rarely linear and as a result, often neither are the resulting markdowns. The strategy here is to move to the top a lot of the heavy-duty infrastructure and computing chunks. These can be things that set up our environment, such as library loads, data loads, and a lot of our heavy-duty computation with data wrangling, data cleaning, model fitting, et cetera. And let's sink down to the bottom, the more communication and narrative aspects.
This has many benefits. First of all, it allows us users to clearly see the main dependencies of our script at the top as we have things like external file references and package loads right where the user opens the document. Similarly, by grouping like pieces of code together, we can more easily start to see patterns throughout the code and get a sense where we can later consolidate. And finally, often we'll be working on an analysis with colleagues that don't code and having all of the more narrative elements consolidated at the bottom may make it easier for them to find places where they can actually jump in and edit the text parts in RMarkdown without being kind of intimidated by large blocks of code sprinkled throughout.
Again, there's another great opportunity here to leverage the RStudio IDE for even more synergy and take advantage of some tips like naming your chunks and commenting your code with the quadruple dash, as you see in this example. The combination of these two features allows you to create a really friendly, nice table of contents that you can pop out to help a user navigate your script better and jump to the exact component that they want to work on.
Step 3: Extracting functions
So after we've done all of this rearranging, one of the main benefits of that is it can more easily help you understand related parts of your script and find places where we can cut out duplicate code and replace these with functions for an even more streamlined approach. As a somewhat trivial example of this, in an exploratory data analysis part of your workflow, you probably end up visualizing the relationship between multiple variables and a lot of related plots. Doing this without functions leads to a lot of code that's both repeated and efficiently, but also makes it harder to tell exactly what piece of those lines is changing. All I'm actually changing there is the variable being plotted on the y-axis, but it kind of takes some staring at it to tell.
Instead, breaking these apart into separate chunks to define my function and then to use my function gives me numerous benefits. First, if I want to change something about those plots, like theme, I can easily just change that code in one place instead of multiple places. And secondly, from a more literate programming perspective, it's much easier in the view on the right to see what stayed the same versus what's changed between iterations.
This is also a good time to start thinking about adding in R Oxygen 2 comments to your R Markdown. For those that haven't ever built a package, R Oxygen 2 is the tool that you'll use to document functions and create the help pages that we've all likely seen at some point when we've pulled out up R documentation. But even before you're building a package, you can easily in the RStudio IDE add in this R Oxygen 2 skeleton and use that to document your function. This is nice because it's a format that users are already used to seeing and gives them the content they expect to know about a function. It can also help you think more critically about the design of your functions because it forces you to almost make a contract with yourself about precisely what those inputs and those outputs ought to look like.
Step 4: Cleaning up style
So now that we've cleaned up the actual content of our code a lot, one final step to consider at this point is also cleaning up the style. And we have many great R packages to help out here, including the lintar, styler, and spelling packages. Lintars and stylers help us adhere to a specific style guide for our script by either flagging for us to change or by proactively themselves editing our script to make them adhere to certain rules such as rules about spacing and indentation. And spelling similarly can help us catch typos in our code for the cleanest possible view of our report.
From file to project
So going through all these steps, we've now really done a lot of work to move our R Markdown to something that's pretty sustainable, pretty readable, pretty navigable. And in some cases, this in and of itself may be a really good stopping point. In resource-constrained environments, if you don't have a good way to share files with a colleague, a single file can be very easy to send. If you don't have access to a good version control system, it can be very easy to just take a diff between versions. So there are some benefits to leaving things in a single file. Finally, you can also quickly refresh your code by just running one single file in isolation.
However, these still can be sometimes lengthy, monolithic, and sort of intimidating to new users. And you may actually be forcing R Markdown to do a lot more computation than necessary any time you want to know. So if these sound like problems to you, we may want to consider moving from a file to an R project.
An R project looks something like this. And at a high level, our goal is to take a lot of the assets we have in our single file R Markdown document and break them apart and modularize them into many different folders with semantic meaning. This can be very helpful if you or your organization uses a very standardized file structure across projects, because everyone will know where to find specific data assets within a project. For example, one structure I like is shown here on the screen, where I would save my actual R Markdown report and the output of that in my analysis folder. I put a lot of those functions that I defined and other little bits of R code, such as those that read and process data, in a variety of R scripts in my SRC folder. And I can save both my raw data as well as data artifacts that I create throughout the process in different data folders. And finally, it's a good place to have catch-all files for different types of documentation or external pieces of context.
This allows me to create a much leaner R Markdown, where I just read in the minimal amount of resources from these files that I actually need. To take two more specific examples of what this looks like, instead of defining both a function and calling that function in the same R Markdown, I can break that up and define my R function in a modularized script that could be easily stolen and repurposed in another analysis. And I could simply load that in by executing that code remotely with the source function in my R Markdown.
Another big kind of step change benefit by going through this process is in your actual data processing. Instead of doing all of your data processing within your R Markdown, I tend to think about doing it in three different steps. First, I use some R scripts to access my data from my database or my API or wherever that data is coming from, and save that raw data in a read-only data file. Then, through, again, a variety of R scripts that might help me with a lot of my data cleaning and wrangling and model fitting, I can create a lot of smaller data artifacts and have only those be what I read into my R Markdown. This means that I can run my R Markdown much faster without having to, like, repeat many long, lengthy simulations or model builds or whatever it is you may be doing if I just want to make some small change. And it also removes upstream dependencies on external systems, like APIs or databases, to be live and be working. If you just want to do something really small, like change it, fix the typo in your R Markdown, you don't want to not be able to do that just because for some reason your database wasn't running.
And one final note on projects. There exists a lot of great packages out there to help you set up these projects, and some of them also have different opinionated file structures, a little different maybe than the exact one I showed here, but very similar in spirit. So I definitely encourage you to check out use this project template and starters to find the one that works best for you. And also making a project is a great time to think more about managing your package dependencies that are going into that project.
From project to package
So again, stepping back a minute to take a view of where we are, we now have a pretty well refactored analytical project as opposed to a file. And some benefits of this are that we can easily, like, we've really broken apart all sorts of pieces that might be independently useful and put them in a place where they're very easily found and accessed to be repurposed. But at the same time, because we're in sort of project mode, we've really preserved a lot of the context specific to our individual analysis while doing this. So for people that are more interested in building off of our analysis or taking the actual, like, learnings from what we were trying, the question we were trying to answer, it's still very easy to do that.
However, for that very reason, projects can still cause some confusion because they kind of blur the line sometimes between what's a tool versus what's part of a higher level process that solves a class of problems versus your individual problem. If you want to focus mostly on solving that broad class of problems, that's when we can move to a full-blown R package.
To me, one magical thing about R packages that I didn't realize when I first started using R was they're really as simple as saving a lot of R code in the right places. And there's pretty clear one-to-one mapping between the kind of structure you might use for a project that I've talked about previously to the kind of structure you'd use for an R package. All of the functions that we just broke up into separate R scripts translate easily, go right into a single R folder.
The actual R Markdown that we've been working with this whole time can actually serve numerous purposes. It can either go in your package as an R Markdown template, so users of your package immediately have some code they can use to start playing with and experimenting with and a working R Markdown example of how to use what you've built. And they can also be repurposed as vignettes and long-form documentation to really teach people more deeply about how to think about the problem that they're working on.
Similarly, all of the data artifacts we created, if we're careful to anonymize them and take out anything that we shouldn't be sharing with the public, can be very useful as examples in your data package. By shipping example data in the data folder, this gives us a great way to build more vignettes, to conduct unit testing, and to give examples in your R Oxygen documentation. Really, the only two things we're missing at this stage are a man folder where the documentation manual lives and the test folder for unit tests.
And at this point, I doubt it should surprise anyone here that there are more great R packages to help us go this last mile. First of all, once again, the use this package is phenomenal for helping us set up the package infrastructure we need and making sure we do, in fact, save all those individual package assets in the right places with the right file structures and configuration files to make everything flow through seamlessly.
Dev tools can automatically generate our R documentation from the R Oxygen comments that we have already wrote many steps back in this process. And testthat provides a really friendly interface for doing unit testing, which, quite honestly, objectively, if we want to trust the outcome of any analysis, we probably even could have thought about doing many steps earlier in this process. And beyond all of those absolute requirements, R even goes above and beyond what we need. And there also exists Packagedown, which can similarly reuse all of those great assets you made throughout your project and throughout the R Markdown-driven development process and create a very user-friendly, attractive, polished website to help people learn about and use your analytical tool.
Wrapping up
So now we've created kind of the spectrum of possible outputs and possible analytical outcomes of our R Markdown-driven development process. I think all of us are probably very well aware what incremental benefits the package provides. It's, of course, one of the most formal distribution mechanisms, especially if you have access to a CRAN or a CRAN-like repository to share your work. And it also adds benefits because it provides a very formal and familiar way for people to learn about and engage with your tool. Pretty much any R user is used to working with an R package. That said, kind of the tradeoff again that exists here is that at this point, we've abstracted very far away from our initial problem. And if simply recreating your work is, you think, going to be the primary goal of future users, then maybe we've gone a bit too far.
So in summary, no matter which path you choose, your original R Markdown analysis can be a great starting point for a very wide variety of data products. And by leveraging all the work you did in conducting your initial analysis, you're much closer than you may realize to building a very sustainable and empathetic data tool. Thank you all very much for coming in your time today. If you're interested in learning more, I have a blog post on this topic that you can find at tiny.cc slash rmddd.
So in summary, no matter which path you choose, your original R Markdown analysis can be a great starting point for a very wide variety of data products. And by leveraging all the work you did in conducting your initial analysis, you're much closer than you may realize to building a very sustainable and empathetic data tool.
Thank you, Emily. That was excellent. We have time for just a couple questions, maybe just one. The most popular is, how does this workflow change if the end result is an automated scheduled job?
I think that probably can vary in a couple of ways depending on how complicated the job is. I know I've been reading about and learning about RStudio Connect some recently, and I think that actually has a lot of good options for automation based purely out of that original R Markdown. So I think that could be another good use case where the one-file R Markdown approach, even if it is handling kind of the full ETL process, is actually a good solution because it can still, some of the rerunning problems aren't really a problem because rerunning your work is actually what you're trying to accomplish. But for some more complicated batch processes, even if you're automating them, you'd still probably want to go all the way and build the package and then go back and use the tools you've created to build the job if you want to have more robust and well-tested and well-documented components to that batch job.
