
Garrett Grolemund | R Markdown The bigger picture | RStudio (2019)
Statistics has made science resemble math, so much so that we've begun to conflate p-values with mathematical proofs. We need to return to evaluating a scientific discovery by its reproducibility, which will require a change in how we report scientific results. This change will be a windfall to commercial data scientists because reproducible means repeatable, automatable, parameterizable, and schedulable. VIEW MATERIALS https://github.com/garrettgman/rmarkdown-the-bigger-picture About the Author Garrett Grolemund Garrett is a data scientist and master instructor for RStudio. He excels at teaching, statistics, and teaching statistics. He wrote the popular lubridate package and is the author of Hands On Programming with R and the upcoming book, Data Science with R, from O’Reilly Media. He holds a PhD in Statistics and specializes in Data Visualization
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Last summer, I had the chance to participate in the writing of this book, which is called R Markdown The Definitive Guide, and it's available for free online at this web address. I'm particularly pleased to be involved with this project because I believe that history will see R Markdown as a turning point in the replication crisis.
If you haven't realized it yet, the replication crisis is the one dark cloud in the otherwise bright future of data science, and it's a darker cloud than you might think. I'll assume that you may have heard something about the replication crisis because it's been widely reported on in academic journals like Nature, but also it's crossed over into the mainstream, being published in papers like the Wall Street Journal and the New York Times.
Basically, earlier this decade, pharmaceutical companies noticed something alarming. As part of their work, pharmaceutical companies replicate the results of promising studies to evaluate whether or not they could turn the findings into a drug or a treatment that they could then sell. Normally, they keep what they find in-house because it's a competitive advantage, but in 2012, the Amgen company noticed a picture that was so disturbing, they decided to make it public.
Amgen had replicated the results of 53 landmark studies. Now, landmark is Amgen's word, but by that I mean studies that are influential and that other studies relied on. What Amgen discovered is they could only get the same results as six of the 53 studies. Now, that's really bad. In scientific terms, that means the other 47 studies might as well be wrong. In fact, they probably were wrong, but they were published in peer-reviewed journals and they're accepted as true by academia.
After this announcement, the Bayer Pharmaceutical Company confirmed that they could only replicate the results of about 25% of the studies that they recreate. Since then, we've studied this in depth and our best estimate is that about 75% to 90% of research in preclinical studies is irreplicable. That means, again, that this research might be wrong. The results are likely coincidence at best.
Since then, we've studied this in depth and our best estimate is that about 75% to 90% of research in preclinical studies is irreplicable.
This should be concerning to you. Not only are academic reputations at stake, but there's a lot of money that is wasted on this research that can't be replicated. In fact, we have a very good estimate of how much money is wasted on irreplicable research in biomedicine, and that is $28 billion per year in the United States. Now, to put that in perspective, with $28 billion, you could buy a latte for everyone on the planet from Starbucks.
Now, if you're like me, you might not have a good sense of how many people are on the planet, but estimating based on the number of people in this room, with $28 billion, you could buy everybody in this room their own private island in the Bahamas, assuming that supplies last.
This is just for research done in one year, and it's just money wasted in the United States, and it's just money wasted in the field of biomedicine. Unfortunately, all signs suggest that the replication crisis is occurring across every branch of science.
For example, a study of 18 influential economics articles from prestigious journals revealed that only six of them had results that could be replicated. A study of 21 articles from Nature and Science showed that only 13 of those articles had results that could be replicated. Now, Nature and Science are considered the most prestigious academic journals, and they publish work across all the domains of science. So while 13 out of 21 isn't technically half bad, it's still very alarming.
And there's other reasons to be alarmed, too. We've seen that money is being wasted, but there's a real opportunity cost here. These studies are meant to do things like generate wealth, heal the environment, and cure cancer, but they can't do that, and worse, they're misleading the people who would otherwise be solving those problems. I like a healthy environment, and I really do want them to cure cancer, especially before I get too old.
The other thing is, these studies have become part of the scientific consensus, even though they're useless, and we don't know which part of the consensus they are. I mean, yes, we do know the ones that were studied, but everything we haven't looked at, we don't know what's true and what's false.
But more personally, if your expertise is closely associated with data science, and I suspect that it is since you're here, the replication crisis for you is a credibility crisis. The common denominator of all these studies is that they rely heavily on data and methods for analyzing data. And like it or not, commenters have observed this pattern. Academics, and presumably the people who read the New York Times and Wall Street Journal, are starting to realize that data is not a panacea, and neither are sophisticated methods for analyzing the data.
What's really causing the replication crisis
You can solve this problem, but you need to spot the cause first. I can tell you what academia thinks is the cause, because the American Statistical Association published an article on it last fall. It's right here next to the cover story on R, which, by the way, is also a very good read, and the ASA used the metaphor of a cargo cult to explain what's going on.
So do you know what a cargo cult is? In World War II, after World War II, in isolated South Pacific islands, cargo cults developed. During the war, natives who lived on these islands saw soldiers arrive and build airfields and radio towers and whatnot, and then miraculously, from the natives' perspective, planes descended from the sky, laden with cargo for the war effort, and a lot of that cargo made its way into the hands of the natives, who found it very, very useful. But the war ended, and the planes stopped coming.
So in some places, the natives reconstructed landing fields, radio towers, radar dishes, the things that they had seen the soldiers use to try to summon the planes back. They didn't understand the original technology, but they assumed if they did something that looked similar, they could get similar results. Well that's the metaphor the ASA uses. The ASA says that many applications of statistics are cargo cult statistics. Practitioners go through the motions with scant understanding. In other words, the people analyzing data just don't know what they're doing.
It's a convincing story in some ways. I'll show you. This is a worked example for research workers. We can assume that Sir Ronald Fisher understood the original technology of statistics because he invented most of it. The example begins here in blue text and continues for a few pages, and what Ronald Fisher is doing is he's stating his problem very precisely. He's describing what he thinks are important characteristics of the problem. He's spotting some reasonable assumptions, and then he's inventing a method that can help him answer his question. And down here we get his answer.
This is a really simple thing like, you know, what's the correlation? But this is how he did it. Now here's how modern researchers, at least the ones who don't use R, might do the same thing or handle the same problem. They'd say, well, okay, my software gives me these 20 tests to choose from, so I'll pick this one.
As I said, it is a seductive story, and if the authors of the ASA article are in here, I apologize, but I don't agree with the explanation. And that's largely because I have a PhD in statistics, and if you do statistics completely correct, you can still have a replication crisis.
Now when people talk about the replication crisis, they often mention p-values. Let's use p-values as an example. People say we're p-hacking, we're misusing p-values, but let's look at what happens when you use a p-value correctly. A p-value just means that you've done a statistical hypothesis test. Almost all statistical tests account for one source of variation, the uncertainty that comes from taking a random sample from a population. A p-value attempts to quantify the uncertainty.
But this source of uncertainty is just one source of uncertainty that you will encounter every time you try to use data to answer questions about nature. This is a graphic developed by my friend and colleague Drew Levy that depicts the other sources of unavoidable uncertainty that are involved in the process. Each small white bullet point here is a different source of uncertainty, but p-values only address that one segment over here. And those of you who suggest that we should replace p-values with something like a Bayesian odds ratio or whatnot need to account for those other sources of uncertainty too. A replacement for a p-value isn't going to solve the problem if it also only looks at that one source of uncertainty.
We act as if accounting for that one source of uncertainty justifies the entire process. But it doesn't.
Confusing science with math
So let me tell you what I think is going on. And I could do it with this example. This is just one of the pages from that Fisher example we were looking at. If you look closely at this page, you'll see things like formulas, algebraic variables, factorial signs. It looks like math. Why might that be a problem?
Well, let's do a thought experiment. Which set of words do you associate with math? I'm going to guess that's not hypotheses, messy, best guess, or discover, unless you're very new to math or very advanced at math.
The beauty of math is that it's so precise. It's so logically certain. You can prove things with math. And that's why we love to use it when we can. But now think of science, the sort of roll your sleeves up, get your hands dirty and make an experiment science. You're always working with hypotheses that only represent your best guess. At any point in the future, your hypotheses could be revised due to new data or completely overturned. Your only road to glory in science really relies on discovering something that no one else has documented yet. With science, you cannot prove things.
These two systems are complementary. Math as a form of logic can prove things for you, but only things that are already present in your premises or implied by your definitions. Math can't tell you if those definitions could correspond well with reality. But science can. That's science's job. Science helps you pick the most pragmatic definition, and it helps you keep track of whether or not that definition corresponds to your current state of knowledge. But science can never prove that that definition is correct. It will always be an estimate or a guess.
Now those of you who are fans of hypothesis testing might say, well, no, Garrett, you're wrong. You can prove that a hypothesis is false. Well, no. You are wrong. If there is a probability model involved, there will always be some probability, even if it's very small, that your hypothesis is correct, no matter what the test says. We just round that probability down to zero if it happens to be small, say less than one in 20. And that's the point. A statistical hypothesis test looks like math, but it's not math. It doesn't deliver logical certainty.
So we've created a cargo cult by confusing science with math. As we started to use more and more data in our science, our methods started to resemble more and more math. Somewhere along the way, as a group of people, we forgot or stopped acting like math was only a tool for scientific reasoning. We began to believe that our work was like math. It could deliver logical proofs.
So we've created a cargo cult by confusing science with math. As we started to use more and more data in our science, our methods started to resemble more and more math. Somewhere along the way, as a group of people, we forgot or stopped acting like math was only a tool for scientific reasoning. We began to believe that our work was like math. It could deliver logical proofs.
Now we must undo that cargo cult, and this is very important because we're on the verge of starting a second cargo cult as we use machine learning. Our science is starting to look more and more like computer algorithms. Computer algorithms are very powerful, automatable, reliable in their own way, but our science does not gain those quantities just because we use a computer algorithm. And what's worse, when machine learning fails, it seems to fail in a much more public way.
Reproducibility as the solution
But scientists know what to do. It's in your DNA. We were taught it all in grade school, and I have a very useful metaphor you could take away, and that is the age of exploration when navigators were going out there and discovering new continents. They were searching for parts of the world that had not yet been observed, and that is exactly what scientists do.
So think of someone like Christopher Columbus. He sailed across the sea, he discovered the new world, and he came back to Spain. Did he offer a logical proof that the new world existed? No, that wouldn't make any sense. He offered a map, and the map spoke for itself. If other explorers followed the map and got to the new world, well, that spoke for itself. If they followed the map and didn't get to the new world, then that would speak for itself. And that's what scientists should do. We should create maps of the things we find for other scientists to follow, and then let the destination speak for itself when they get there.
Now scientists are actually pretty good at this, and if you look at the typical scientific report, the thing where you start with your hypothesis, you talk about your methods, materials, and then the results, that is a very good map that goes from the questions you're asking to the data that you collect. And back in the days when data was almost synonymous with conclusions, that sufficed. But now in the days of data science, there's a second arc in the journey. You have to go from your data to your conclusions, and it's not straightforward. You need to provide your fellow scientists a map that can take them along the way that you went.
You can see why this is sort of difficult, because a map that other scientists could use to reproduce this journey would require some tough things. They would need your data. They would need the code you use to analyze your data, and they would need the software required to run that code, and also they need the reasoning that you use to make every decision along the way and to interpret the results at the end. It's an impossible-sounding combination of things.
But this is exactly what R Markdown lets you put in a single report. R Markdown is a free R package that provides an authoring format for data science. It's a plain text document that allows you to put things there.
R Markdown demo
So I don't really have much time to give you a demo, but let me give you a taste of R Markdown. This is a story about a scientist named Bill who hasn't updated his headshot in a very long time. His colleague, Virginia, asks him to solve a problem for their company. They work for a brewing factory, and they want to know if the bottling machine is malfunctioning. They've heard reports that less than 750 milliliters of beer is being put into each of their bottles, so Virginia asks Bill to go check it out.
He collects some data right off the semi-line, puts it into a spreadsheet, and then he opens an R Markdown file, a plain text file. He can add code chunks to the file. He can run the code chunks as if it were a Jupyter Notebook or something if he wanted to see the results in line. So this is the graph that he's making here. And then he can put text in between the code chunks that explain what he wants to write. So he's saying, look, this is what I'm doing, this is the method I'm using. And you could even use R code in his text to finish the text form depending on the results of his code chunks.
So he does this. The plain text document is very easy to put on GitHub, very easy to diff, see what's changed. Not too pretty. But the point of R Markdown is that you can then publish results to an impressive format, almost any format you like. For example, you could use this document, R Markdown will run all the code, use the code and the text to make a PDF that you can then pass on to the person you want to impress. Or a PowerPoint. Or a book. Or a blog post. Or a poster.
But everything you need to create the map to make your research reproducible and therefore, effectively replicable is there in that R Markdown document. So that's a taste of R Markdown. What I'd like you to do is if you haven't used R Markdown before, try it out, experience it for yourself, and make your research reproducible. Because it's good science, it's good data science, and it's good business. Thank you very much.
Q&A
Is there a functionality in R Markdown to track the potential changes in different versions when something changed? Because what if you can't reproduce the result? What if something changed in your data and you get a different set of results? How is that tracked, being tracked in the R Markdown framework?
R Markdown itself is just the authoring format, but its existence allows peripheral software that can take advantage of this. So the obvious one would be any source of version control you could use to track the R Markdown. But RStudio has developed a publishing platform for R Markdown documents and other documents that contain code that can be rerun called RStudio Connect. With RStudio Connect, you could publish your R Markdown document and then schedule it to run on a reoccurring basis to change the parameters right there without having to rewrite code and other things. You've seen it in Tarif's keynote and maybe other talks around here, too. We're putting more and more features into that, and I think tracking is one of those features. I could be wrong. If not, I would expect it to come in the future. Until then, in the meantime, you could definitely use Git and other sources of version control.
What you can keep in your Git repository is the PDF output that contains the results, or the HTML output, or whatever your output happens to be.
