
Instant Impact: Developing {docorator} to Simplify R Adoption for Teams (Becca Krouse, GSK)
Instant Impact: Developing {docorator} to Simplify R Adoption for Teams Speaker(s): Becca Krouse Abstract: Although R supports comprehensive analysis workflows, creating polished, production-ready PDFs directly from R remained a challenge for our pharma teams. With teams facing looming deadlines, our R enablement team swiftly created {docorator}—an open-source R package that transforms R-based tables and figures into production-level PDFs. By adorning results with “decorations” like headers, footers, and page numbers, {docorator} produces seamless, polished documents. Powered by Quarto, it also auto-sizes {gt} tables for user ease. Attendees will learn how {docorator} became the missing piece in GSK’s R workflows and learn how focusing on quick, simple solutions can have a lasting impact. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Okay. Thank you all for being here today. I'm excited to share an R package with you all that we've been working on called Docorator.
So I'll start by saying that this presentation is my personal opinion and not that of my organization.
So in my house lately, my son has taken a liking to home renovation shows, and especially this one called Fixer Upper. I don't know if you guys have seen it before, but if you haven't, the premise is that the hosts, Chip and Joanna, they work with a different set of clients in every episode, and they help them pick out a house that's pretty worse for wear, and they spend the episode redesigning, giving it a facelift, gutting out some pieces of it until it becomes a total dream home in the end. And it's kind of addicting, even though pretty much every episode has the same storyline with that transformation.
R adoption in pharma
And so I work in Pharma, where my role is to help enable teams and people using R for the first time, and even people who are a little further along. So just helping people on their open source journey. And more broadly, Pharma is going through a pretty significant movement towards open source adoption. It's pretty cool to see. Lots and lots of organizations are getting on board. We had a summit meeting on Monday, where everyone kind of came in and talked about challenges together. So a lot of commonalities there.
People, I'm sorry, organizations are going through this process, and they're not only upskilling, but they're also looking at their legacy systems and tooling and seeing, okay, what do we need to change? What do we need to revamp and transform here? But yeah, a lot of commonalities, and because of this, we have a lot of community efforts towards shared tooling.
So a few years ago, the Pharmaverse came to be. And the Pharmaverse is a collection of R packages. It's like an ecosystem of R packages that cover the routine milestones that many pharma companies have to cover, like standard data transformations, all the way through to standard-looking tables and figures, just the routine steps. So over the past few years, a lot of progress has been made. And today, lots of organizations are well-poised to cover their bases with these packages. And they work together in conjunction with the Tidyverse and beyond, which is why you see them all kind of coexisting here.
But the Pharmaverse isn't static. It's ever-evolving, it's expanding all the time because there's a lot of different needs popping up, different use cases. So there's bound to be gaps discovered, like the one that we discovered for production outputs, in particular, PDFs. And these are not just any PDF. They are rooted in years and years of tradition and regulatory expectations. They have really finicky formatting and just a traditional appearance. And out of that whole reporting process, this is the thing that lands on the desks of stakeholders. It's what they really care about, feeling very comfortable and familiar, like they're sitting at home by the fire, right?
Introducing Docorator
So that's the job of Docorator. It's an R package that helps take an R-based display, whether it's a table or a figure or a lot of tables and figures, and puts them in to a very traditional-looking output.
And so that goal of feeling like home, that's the goal in these home renovation shows. You wanna have a dream home in the end that the client is super excited about. But it's not all smooth sailing. There's typically a little bit of drama along the way where there's some unexpected mold or maybe there's a lot of duct issues. Lots of different challenges pop up. And there's still the original budget, the original timelines that need to be met. And when we discovered the gap for production outputs, we still needed to, we hadn't necessarily budgeted for that gap. And when you have study teams who are trying to meet their own deadlines, if there's a gap, they will feel it and they will try to solve it themselves. And all of a sudden, you have 10 different teams creating 10 different workarounds, right? And it gets kind of crazy.
Requirements and tooling choices
So time was of the essence here. So we took a look at what are the basic requirements, what are the absolute musts that we need to cover if we're building a solution, and what tooling is already available for us to help us out.
So taking a look at our traditional looking output, first and foremost, gt compatibility. So gt, the R package for making tables developed by Rich from Posit. This is really, really important to Pharma. We use it a lot for tables. It's very, very powerful with formatting. So whatever solution we created, we wanted to make sure it worked well with gt. Didn't want to disrupt that.
Headers and footers. These are really, really important, part of this whole traditional appearance and traceability. So we want to know what study this table came from, for instance, when the display was created, things like that. We want to have control over where this metadata goes as well.
And finally, this one may seem obvious, but we want a workflow that's based in R. This is for a few reasons. One is we have some teams who are very, very new to R, so we don't want to throw another thing in. We want to just kind of keep it consistent. The other is we're, as developers, we already have some tooling we're using and developing in R. So if we're trying to build something very, very quickly, we want to minimize the amount of change that we're making for ourselves and make it easy to develop and maintain.
So, gt. I took this image from the website here. It just shows the workflow for gt. I think many of us know it by now, but you start with your input data, and then you work with it in R to create a nice formatted table using the gt package, and you end up with an object that's in your R session that you're just kind of looking at in the RStudio viewer, and you can then save it out to a document if you like. And there's a few different choices.
So we kind of thought, okay, what's going to get us closest to PDF? And if we consider the first option listed there, HTML, Pharma's not quite ready for HTML yet. We're still making our paginated displays. You could also, you could take an image, like a screenshot of an HTML table and have a nice little pretty picture of it and put it in a PDF, but traceability is very important for us. So we wanted to make sure our values were embedded in that PDF. So HTML was unfortunately off the table.
RTF is something that has a pretty rich history in Pharma, so you could imagine that being an option, but going from RTF to PDF is not, it's not easy or smooth. It's kind of a manual step. We don't want manual. So that leaves us with LaTeX, and we just heard about LaTeX from Keaton. That was a nice primer. But LaTeX is a typesetting system that has been around for a long time, and once you get things into LaTeX, you're not quite to PDF, but you're very, very close. It just needs to be compiled, and that's a pretty, pretty simple step. So LaTeX puts us in the best position for PDF here.
Speaking of LaTeX, there's a lot of libraries in LaTeX. gt uses a few of those to format those tables, so we took a look at what else might be available there, and you saw it flash up on the screen in the last presentation, but Fancy Header is a pretty flexible LaTeX library for doing our headers and footers in different places, in the top and bottom of the page, with controlling all of those bits. So a really, really promising option for us.
And making it placed in R. So we're already, our teams are already developing their tables in R, their figures in R, they're doing it in the R script. How are we gonna get them into the PDF? And how does LaTeX come into play? Well, that's where R Markdown and Quarto come in. Instead of an R script, we can use these tools to render a PDF directly through the PDF formatting engine, and under the hood, it's doing the LaTeX, and so it makes it very easy for us to also throw in some LaTeX as well.
Building the package
So our basic requirements are met here, and that's great, we have kind of all the pieces, right? But there's just one catch. It's that our teams are, you know, I've mentioned this a few times, they may be learning a lot for the first time. And asking them to change from using an R script and running things that way to all of a sudden opening up a QMD and doing things a different way is a change in process. And that's not necessarily something that's trivial to implement when there's already a lot of change going on. So we don't want these teams to have to do a bunch of maneuvers to use our solution, which is why we create a doc writer.
We wrapped it all up, we took those bits and pieces, the R Markdown in Quarto, the fancy header, the gt LaTeX conversion, all of those pieces could come together and live in a package. And this is a parameterized report, so we're able to pass in that crucial metadata that can change from study to study and make it all happen dynamically.
We wrapped it all up, we took those bits and pieces, the R Markdown in Quarto, the fancy header, the gt LaTeX conversion, all of those pieces could come together and live in a package.
So in practice, it looks something like this, where we have our original R script, and we can, with a couple extra lines of code, pass or pipe our display right into the doc writer function calls and get a PDF on the other side.
If we look a little closer at these function calls, we have two here. There's an as doc writer function and a render PDF. So as doc writer holds everything about our display. The display itself, that render gt, as well as important metadata, like the headers and footers and display name and sizing things I don't show right here, but lots of key metadata that gets pulled together and then passed to the rendering engine.
Headers, footers, and flexibility
Because headers and footers are so important, I wanted to zoom in even closer. Here is our original display, just blown up a little bit, so we can see the headers more easily. So we have a couple lines of headers here that we can specify through the header argument and doc writer, and through the header argument and doc writer, we use a fancy head function, and this is totally an interface to the fancy header LaTeX library. And fancy head takes a series of fancy rows as an argument, so row by row in the header, you can say, what's my left, center, and right? And so here, we want something on the left, the study name, we want some automatic page numbering on the right through our little helper. In the second row, we just want something on the left, the population, so we leave everything else empty. And then in the final row, we want to put our centered table title.
And so users can do this for the footer as well, it's a footer argument and ask doc writer, and a fancy foot function instead, but all the row, fancy row stuff is the same, and they can just add in as many rows as they want within reason, because it will start to kind of encroach on the display area.
And this kind of flexibility is super important. We know that our user, when we went out to build the solution, we knew there was gonna be a lot of different variations in formatting coming at the tool, and if you've ever built a tool and rolled it out before, you probably know that people will start to throw things at it that you didn't expect, and it could break easily, so we wanted to just avoid that situation as best we could. When we first did a release, we tried to cover the essentials, but make it easy to adapt as people gave us feedback, and we really tried to listen to them early and often.
And something that has come up, and we tried to get ahead of it quickly, is accommodating lots and lots of data, lots of columns, many pages. There's a lot of data to display, and so we added some automated scaling for gt tables in particular. If a lot of columns are coming through, we'll try to help make sure everything fits to the width on the page, and also help with some proportional sizing of the left group and row data, values that tend to be very text-heavy, and then the numeric values in the rest of the columns, just try to help kind of even things out. If a user needs finer control, they can certainly do that through gt directly, and Docorator will kind of assess the situation, and will spit out some messages if things look a little, look like they might be a little bit of a weird result. So we hope to build this out even further, just to keep it nice and user-friendly.
Evolving the framework
We also realized pretty early on that our framework was generalizable, and useful in other scenarios for other document types. So we care a lot about PDFs, but there's other things that need to be made, not just for us, but for others as well, and so we started with just a render PDF function that was kind of all Docorator did in the beginning, and we decided to break it into pieces, and that's where the asDocorator function came to be. So we split up Docorator creation, rendering, and produced an intermediate Docorator object, and optionally a permanent file as well. And so that holds the display itself, and all of that important metadata that can be picked up and reused in different templates. So we have a couple templates now, the original PDF, the RTF, which is kind of very basic compared to the PDF, but it's in a position now, the framework, that it could expand to future output types, just need the template for it.
The ability to evolve is really, really important as well, and so we wanna be able to take our framework and adapt it as technologies change, as new things come in, much like if you're doing the kitchen part of your home renovation, you might make certain aesthetic choices today that you, you know, the white countertops and cabinets and everything. Five, 10 years from now, that may not be what you want anymore, there may be totally different trends. And so with Quarto in particular, we're well-poised to adapt to these trends.
Something like Docorator for Python may be useful one day, and so it's an easier transition to building something like that. And, you know, today we're using LaTeX, which is a little bit finicky with the syntax, it comes with the burden of system dependencies as well, so like Keaton, we're looking forward to the capabilities of types, it's a little bit of early days with gt, and because that's so important to us, we're not quite there, but, you know, looking forward to that potentially being a better option down the line, and having, you know, just a smoother maintenance process as well.
And something I'm pretty excited about with types is branding, you know, I've talked up a lot about how we care about that traditional appearance, but hopefully that will evolve as well, right? And so with branding, we can take things like our fancy dashboards, and we can do something similar between our PDFs and our dashboards and our websites and things like that, and maybe one day our stakeholders will be very excited for pumpkin spice season, and they'll wanna do something fun like this.
But all that to say, Docorator has been kind of the missing piece of the puzzle, the final proof of confidence that R can really be used end to end and create that final output. So it's been very, very useful for us, and, you know, we noticed the gap in open source, and so it was important for us to fill it back into the open source. You can find Docorator on our GitHub page, here's the website, we're working on a CRAN submission now, and likely a Pharmaverse submission as well to contribute that back. So thank you all, and happy to answer any questions.
Docorator has been kind of the missing piece of the puzzle, the final proof of confidence that R can really be used end to end and create that final output.
It will just print page by page. And you can also fuss with the bigger dimensions through Docorator.
Then, is the goal of Docorator to get Pharma to use R and eventually switch to a different reporting tools? Well, it helps with R adoption, certainly. So, you know, like I said, it's that missing piece and the end to end. We talk about end to end, we did an end to end workshop yesterday, about like, here's the pieces and putting them all together and five or so years ago, that was pretty, pretty early on. And I don't think we had all those pieces. So it's exciting to have all the evolution.
