Garrett Grolemund | Reproducibility in Production | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

Thank you, everyone, for attending. This will be the first in a series of three webinars that focuses on how to use reproducibility in a business setting. And the real gist of what we're doing here is, over the past few years, as you may know, academics have been very concerned with the problem of reproducibility in data science research. Here's just one of many headlines that the public has read about reproducibility. This one comes from Forbes. But there's been a quote-unquote reproducibility crisis in data science research. And there's been a lot of progress made to address that crisis. And from a technological standpoint, the problem's been solved, in my opinion. But there are many cultural changes that still need to be made. The cultural changes don't concern us so much as the technological solution.

The new technology that data scientists are using to make their work reproducible has created unintentional benefits for people who use data science in a business setting. And that's what these webinars will look at. The first webinar, which you're at today, is reproducibility in production. And I'm going to talk about the technological solutions to the reproducibility crisis, specifically one type of technology that I'm going to call computational documents, which are just documents with executable code inside of them. I'll show you how you can use those to create opportunities both for yourself and the people who consume your data science intelligence, whether that's customers, clients, bosses, colleagues, so on.

In the second webinar on September 18th, Thomas Mock will talk about RStudio Connect in production. RStudio Connect is a production platform that allows you to share the types of documents that we'll be talking about today. And it really completes the circle in terms of providing opportunities and advantages in an everyday business situation. In the final webinar, Kelly O'Brien will talk about interactivity, which is probably the largest opportunity created by the technology of computational documents. She'll go over the best practices for making interactive material and tell you the things that you should be considering as you develop interactive material for others to consume. That webinar will take place on October 2nd.

What are computational documents?

So for today, let's start by looking at what computational documents are. This is what I mean by computational document. Let me exit the slides for a moment. Let's go to RStudio. This is a computational document. It's just a regular document, a file. You can see text in here. But it has executable code embedded throughout the file. And a computational document is just a document that contains executable code. And when you can put executable code into a document, then you can allow the reader of the document to run the code. And the code can do things for them. As someone who knows how to write code, you could write whatever code you like to do whatever you want the reader to be able to do with your document.

This document contains a data analysis, a very simple one, as a matter of fact. And the code here just recreates the analysis. And this is typical of what computational documents do. They're a way to solve reproducibility crisis by putting the actual reproducible data analysis in your report. So whoever you pass the report onto has everything they need to recreate the data analysis. For example, I can run the code that's in this document, these buttons, and reproduce the document as I go. So here this analysis is making some graphs. We'll look at it in detail later. And then the other thing that this particular type of document allows you to do is you can knit this report. Both the text that the author wrote in the document and the code that runs the document is combined. And so we see the text here with the code results. And it creates a finished presentation to pass on to someone who wants to know about our results, but maybe not about the code.

So that's the gist of a computational document. It is something that can contain embedded code. And because it can do that, it can automatically reproduce the data analysis done with code. And you can see here how computational documents are a definitive solution to the reproducibility crisis in data science. Everything we need to reproduce our work is here. But because we're reproducing it with code, not only can we just simply reproduce analysis, we can reproduce the analysis automatically.

That automation is what creates opportunities for businesses who want to use computational documents. And not only does it create opportunities, it's going to be a disruptive technology that changes how you think about delivering and performing data science if you deliver or perform or conduct data science as part of your job.

And not only does it create opportunities, it's going to be a disruptive technology that changes how you think about delivering and performing data science if you deliver or perform or conduct data science as part of your job.

What I recommend is that you aim for the level of interactivity that's as simple as possible to do the task that you want, because that leaves less room for things to go wrong and need to be debugged on your part, less room for things to go wrong and for the user to have a poor experience, and less time needed to create those documents.

Here's some simple bright lines that you could use to think about this hierarchy. If you ask yourself, how many times do I need to change the data in my report during my presentation or during the viewer's presentation? How many times do they need to change the data for themselves? If it's less than once, then you don't need anything that involves Shiny. This is a very important question to ask yourself, because we tend to see people learn Shiny and then use it for everything. It's very hard to walk back from Shiny once you've adopted a Shiny workflow, and that's okay. Shiny is excellent, but it's very easy to overestimate what you need to accomplish the results you want, and if you're not updating your data more than once per session, then you can just use a simple non-Shiny R Markdown document, and you can create a new document before each session, and that document will suffice for that session.

The next thing to ask is, do you need to create a custom UI for your app? If what you want to report to your user would do fine in a document format or a dashboard format, you can do that with an R Markdown document, and you don't need to invest in creating a Shiny or an HTML UI to contain your interactive components.