Resources

Mike K Smith | Using rmarkdown and parameterised reports | RStudio

My brain is lazy, shallow and easily distracted. Learn how I use notebooks to keep my present-self organised, my future-self up to speed with what I was thinking months ago, and also how I use parameterised reports to share results for both quantitative and non-quantitative audiences across multiple endpoints. I can update and render outputs for a variety of outputs from a single markdown notebook or report. I’ll show you how I organise my work using the Tidyverse, use child documents with parameterisation and also how this is served out to my colleagues via RStudio Connect. About Mike K Smith: I have 25 years experience of working in the Pharmaceutical industry (Pfizer), with more than 15 years working on modelling and simulation projects. I am a keen advocate of smarter drug development with a particular interest in Bayesian methods, dose-response, reproducible research and knowledge management. My particular expertise is in the use of simulation methodology to predict drug outcomes, find efficient trial designs, assess decision criteria and evaluate analysis methodologies. My current role at Pfizer is as specialist in computation and modeling solutions - evaluating and deploying new tools and training colleagues. I am an RStudio certified tidyverse trainer

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, my name is Mike Smith, I work at Pfizer in the UK, and I'm very pleased to be here to talk to you today.

Because I realise that people are lazy and easily distracted, I'm going to give you the summary of the whole talk right up front. I use parameterised markdown reports with child documents to write up an exploratory analysis which I shared with some colleagues, some of whom were quantitative, some of whom were not quantitative. The quantitative guys like the statistician on the team and clinical pharmacologists from my department, including my manager, there were other non-quantitative guys, and it's maybe unfair to call a clinician non-quantitative, but I'm going to. So this report was intended to serve all of those purposes.

The analysis and the stuff I'm going to show today doesn't show that data, because of confidentiality reasons, but it has very similar properties.

But what I really want to talk to you today is about cutlery drawers, and what they say about you. Now, these cutlery drawers are from people that you might bump into at this conference, and I think it's an interesting exercise in how people arrange things, structure things, whether they do it by size, whether they do it by most frequently used, the differences between perhaps US and European cutlery drawers.

But here's mine. If you visit my house, you'll see my cutlery drawer. My wife will also be very surprised that you're visiting the house, and quite alarmed that you just want to see the cutlery drawer.

Now, I did the tidyverse Train the Trainer course, and thanks to Greg and Garrett for teaching on that, and I'm now taking every opportunity to have a learning experience for the tidyverse. So, if you take cutlery drawer, you might want to group by type. You want to gather those things together and arrange, and if you could do that from my cutlery drawer, I would be really happy.

New hashtag for the conference is untidyverse. And I'm really pleased that XKCD views the need for tidiness in the house in very similar ways. You can get Wi-Fi in a laptop. You can put all your other possessions in a big bucket marked miscellaneous.

The lazy, distracted brain problem

Okay, here's the premise for my talk. My brain is lazy, shallow, and very easily distracted. I hesitate to say that your brain is the same, but then we had Zhou Cheng up here telling us all that our intuition sucks. So I think some people in this audience must have a brain that's lazy, shallow, and easily distracted.

My brain is lazy, shallow, and very easily distracted.

Now, we're all familiar with this plot that's from the R4 Data Science book, and it gives us a framework or a structure for understanding any analysis. And on the Not So Standard Deviations podcast recently, there was a good discussion about this, and they came to the conclusion that really this is just a framework. It's a mental model. Real data analysis don't necessarily have to follow all of these steps or in sequence.

So let me explain to you something about my data analysis and how it works in practice. And I'm sure that these experiences I'm going to relate are just for me alone. So it's not you guys, you have your stuff well together, this is just me.

So I go and I get an email from my colleague who has a link to the data source. I download that data source, I read it into R, I wrangle the data, and I plot it. And then there's some time to step back, to reflect on my plot, to think about what it looks like, so that after lunch I can come back and make better plots. And I might fit a preliminary model to the data, hey, it's the end of the day, it's time to go home.

So the next day I come in and I find an email saying that the team has found an error in the data, so here's the new version of the data. And if your workflow is not reproducible, you're in a world of hurt here. So because I'm reproducible, I go to my markdown, I change the input data, I recompile my report, I see what differences there are, I check against the previous version, and we're good to go on.

I discuss the findings with my boss, I circulate the report, and my job is done.

Okay, the other thing is, did anyone notice that the transform and visualize bits from the R for data science diagram were back to front and swapped over? Cognitive load theory score one.

So anyway, six months passed. This is not an exaggeration. I got an email yesterday from my manager saying, you know that team and the work that you did for them? They're interested in your results, can you dig them out and share them? Now if you're anything like me, what happens is you pop open your R script or something like that, and you think, what the heck did I do, right?

rmarkdown notebooks to the rescue

To the rescue, for me, comes rmarkdown and notebooks, right? These guys are now saving my life. I love notebooks. Thank you, eBay and team who are working on notebooks. And whenever you're writing, you need to think about who is the audience that you're writing for. Well, for me, when I write a notebook, it's for distracted me. It's the one that is hopping between activities, not focusing 24-7 on my nice little analysis, but with a billion things to do.

It's also the future me, the six months later me that pops open the Markdown document and goes, all right, I'm good, because I've got my code, I've got the explanation, I can see the outputs. I've even got text that explains what I was thinking. Hurrah.

But here's another audience for your reports. It could be a quantitative colleague who wants to see code, who wants to see data, who wants to see your assumptions, who wants to dig into your residual plots from your model. I can guarantee you the non-quantitative people won't want to see that. So it's good if you can try to balance up and try to have techniques that would allow you to hide the bits that the non-quantitative people want to see, but also allow the quantitative people to get what they want from that report.

So, back to notebooks just for a second. I realize, and again, thank you E-Wait, that there is a debate here. But for analysis, if you're writing analysis, I think notebooks are fantastic. And back to Garrett's stuff about Markdown, if you're writing more comments than code, then you should be using rmarkdown in notebooks. If you're writing more code than comments, write more comments and use rmarkdown.

But, if you're not writing for analysis, then I really recommend you go and read this blog post because it's very interesting.

Parameterised reports

Also, because I'm lazy, I knew my manager would come back and ask for the same analysis across the three endpoints that are in my data. So I'm trying to set up my report to answer quantitative, non-quantitative, and for three different endpoints.

So, if you have to paste your code more than three times, what should you do? Write a function. If you have to perform an analysis across more than three endpoints, what do you do? Do you write multiple rmarkdown reports? Parameterized reports. Thanks very much. Otherwise, why would you be here? I mean, honestly, I'm not that entertaining.

Anyway, back to XKCD. XKCD talks about automation. And the top panel talks about the theory of automation. It's a bit like the theory of data science. It's the mental model for what we all hope will happen. We automate things. We can get on and we can do completely different work. The reality is that when you automate things and try to be clever, often you wind up troubleshooting the thing that's gone wrong.

But here's the thing that makes all of this work, is the YAML header parameters. If you're familiar with rmarkdown, then you'll know about the YAML header. If you're not familiar with rmarkdown, this may be a bit of a stretch. But bear with me.

So the YAML header says something about the document. It says the title, the author, the date it was done. And it says something about the formatting for the output. So here it's an HTML document and it has certain attributes. The bit I've highlighted in red is the bit we need to focus on. These are parameters that you can pass into your document. And it comes into your document, you know, like an R object that you can then use. You can compute on. You can do all kinds of clever things with it.

So here in my document, in my header, I've got a parameter which is called endpoint. It has a default value of HamDTL17. But it also has three distinct choices. So it's not just a free text. You have to choose one of those three options. Also I've got a Boolean parameter which is called quantitative audience. And if that is true, then I'm going to include a bunch of other stuff. But if it's false, then this is a stripped back report just for the non-quantitative audience.

Now, the specification of those is very close to what you might do with shiny inputs if you're familiar with them. So then when you're ready, you can choose to knit that document. And if you just hit the knit button, you'll get the default values. But if you can choose to knit with parameters, then a little pop-up box comes up, a bit like shiny. You choose your endpoints. You determine whether you're in the quantitative audience. And you can knit your document.

Now, parameters are really cool because you can do things with them. So here I've embedded the parameter endpoint, or it's param$endpoint, and that value can get pasted into text. So then you can talk about the endpoint for the report that you're talking about. You can pass the parameter in and use it within the markdown text. You can use it in the header or the access labels for the ggplot. And so it can be used in a variety of ways, sky's the limit, okay?

But the other thing that's kind of cool is that you can use the parameters in determination of whether the code chunk runs or not. So here I'm saying if it's not the quantitative audience, in other words, if it's the non-quantitative audience, then hide all the code, right?

The other thing that makes this work is in the top box, and that's because what I did was to rename all of the endpoint columns in my data with the name of the parameter endpoint so that then downstream I can fit a linear model on something called outcome and not on something called param$endpoint, okay? It just cleans up the code and it makes things easier to see further down.

Child documents and RStudio Connect

The other thing that I've used is a child document, because if you remember when I was presenting this to the quantitative audience, I want to pull in some extra information. The code chunks will run or not according to the settings here, but you might want some additional bit of text that comes in and says, okay, for the quantitative guys, here's what I'm doing, and that's where the child document comes in. They're just plain text, rmarkdown files, and those guys get pulled in if needed.

So for the quantitative audience, what you see here is that we're showing the code, I've run the chunk, you can see the data, and the bit at the bottom that says data manipulations is that child document text.

Now, Tarif kind of stole some of my thunder in his opening talk, but that's okay, he's the president, that's fine. What you can do in RStudio Connect is that you can go across to the left-hand side where it says input, you can pop that open, and it will have the same ability to select parameters and define what the report is going to run. But the other nice thing is that if you do that and you save those things as named objects, then in RStudio Connect you've just got this little drop-down menu of pre-compiled reports.

So what you might want to do is to set up, you know, commonly used, here it was all right because there's only six reports in total that could be done, but it's that ability that if someone has already run this report, you can just quickly grab it and it doesn't have to recompile.

More on parameterization

Okay, so more about parameterization. So what we saw was when we rendered the report, you go to the render with parameters or knit with parameters, but from the command line you can pass in your parameters just through a list like this, okay, and it's really straightforward. The second thing is you might want to change your analysis depending on which endpoint you're looking at. If you're looking at a categorical outcome, you're not going to want to fit a linear model, or at least you shouldn't.

So in that case you may want to tailor your analysis depending on the parameter endpoint, but again it's just, you know, another thing that you can compute on within a chunk. Also if something goes wrong with your analysis, then if you're handling the error appropriately, if you're using try-catch or something like that, you can compute on that and then have a trial document that says something helpful like, you know, something's gone wrong, contact your friendly data scientist, you know, here's his details, okay?

So the other last thing back to XKCD is how long should you spend parameterizing your report and setting things up? Now this is kind of alarming. I know that JD Long has also talked to this graph today, but the thing I want to impress on you is basically, you know, this is over five years. So, you know, if you spend a day or two days or five days sorting this out, then, and only two people are going to use it once every six months, maybe I wasted my time. But on the other hand I got a conference talk out of it, so that's good.

Anyway, thank you very much for your attention. Feel free to ask questions.

Q&A

All right, thank you very much, Mike. We have time for questions before the break for lunch. Hands, please.

Can I ask one myself? So you're building all of this wonderful machinery and then you move on and somebody else has to maintain it. How many of the people that you work with understand and value the machinery that you've just shown us?

Ooh, that's a tricky question. Not many at present. And how are you tackling that? Well, I've just done the train the trainer on the tidyverse and I'm here, so I'm going to go back and evangelize and, you know, but yes, it's a problem that we need to try to roll this out and get more and more people familiar with how it works.