
Using Quarto to Improve Formatting/Automate the Generation of Hundreds of Reports (Keaton Wilson)
Using Quarto to Improve Formatting and Automate the Generation of Hundreds of Reports Speaker(s): Keaton Wilson Abstract: This presentation showcases how KS&R’s Decision Sciences and Innovation (DSI) team modernized a legacy reporting pipeline to automate and scale custom survey report generation. Using tidyverse and Quarto, the team produced hundreds of personalized PDFs weekly over three months. Hosted on GitHub, the project integrated version control and streamlined collaboration while documentation ensured easy onboarding and adaptability. Attendees will gain insights into automating report workflows, overcoming implementation challenges, integrating custom formatting and fostering collaboration using tidyverse, Quarto, and GitHub. Materials - https://github.com/ksrinc/posit_conf_2025_quarto_automation posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I'm Keaton. I'm a solutions developer at KSNR. We're a market research company about 40 years old and we do database work with a variety of clients across a variety of sectors. I feel like this talk is appropriately placed because a lot of the data that we get is from surveys so I don't have to explain survey data. But I want to talk today a little bit about some work we've done over the past year and a half to leverage Quarto and some data wrangling tools to automate reporting.
Before I do that though, I want to talk about my great grandmother's zucchini pickles. Bear with me. So I'm a bit of a foodie and a collector of recipes. I have a lot of these recipe cards that have been handed down in my family. And they provide a really nice window into history. But they also have some interesting facets about them.
So the first is that they're kind of hard to read. I don't know about you all. I don't read or write cursive on a regular basis very much anymore. And when I first looked at this recipe, I actually thought it called for a half a cup of suet instead of salt. And I think those pickles probably would have turned out a lot differently.
You can also see some self-edits here. My great grandmother is correcting her cursive or her spelling, which is pretty great and pretty interesting to see. There's also some ambiguous instructions, right? What is a fairly fine chop? That means different things to different people. There's retro or hard-to-find components, right? Pickling spice varies by region, right? The same spices in one part of the world in pickling spice might be very different from another. There's some scaling challenges, right? If I wanted to become a zucchini pickle maven and make thousands of gallons of these, this recipe might not scale well. And finally, I can't ask for clarification, right? My great grandmother passed away a number of years ago. I can't call her on the phone and ask her questions about this recipe.
Reporting pipelines as recipes
So why am I talking about zucchini pickles? I would argue that reporting and data pipelines share a lot of similarities with recipes. Often we don't have up-to-date or thorough documentation on our pipelines. If those pipelines don't contain mechanisms for version control are based in GUI-based or graphical user interfaced clicking workflows, it's hard to see the history and understand those self-edits. Often we're working in orgs with outdated toolkits, and there's conversations around, why should I fix this? It isn't broken. Why are we spending time updating things? Scaling challenges, just like scaling up to that thousand-gallon batch of pickles is often challenging. And sometimes we can't ask for clarification, right? We have team members that leave or move on, and we can't ask them specifics about knowledge that only exists in their head.
I would argue that reporting and data pipelines share a lot of similarities with recipes.
The recipe challenge
So I want to talk a little bit about our recipe challenge that we've been working on. So we have survey data from a variety of global respondents, folks all over the world, for this particular project. Our client wanted PDFs that represent the answers for every respondent in the survey. So one PDF per respondent. We have hundreds of respondents per wave, and that wave is repeated and this project is repeated every year, so we're doing this on a rolling basis. Individual reports when we're in that period where we're collecting data from the surveys need to be generated and passed to the client on a weekly basis.
We had a historic pipeline, but it was built in maybe a bit of a retro toolkit, right? So mail merge from Microsoft, not a lot of version control, and hard to sort of understand what's going on. That's compounded by the fact that we had a team member leave who had that knowledge in his head, and we can't ask him about how the original pipeline was built. So it's a challenge, but it's also an opportunity to rebuild things from scratch.
At KSNR, we often are confronted with really wide SPSS data, so lots of columns, and so we need to think a little bit about some cross-functional data structure and naming conventions, so that when we go from the raw survey data that comes out of our platform, it'll play nice eventually with Quarto and the tools that I think a lot of us are familiar with in the room.
And finally, we really wanted to de-silo this entire pipeline, right? We don't want knowledge living in the head of one person. We wanted to build replicable infrastructure on GitHub hosted internally and thorough documentation throughout, so multiple team members can engage with the process.
How the pipeline works
Okay. So we have our recipe challenge, right? Let's talk a little bit about the how. I'm going to take a 30,000-foot view of this, mostly because of time constraints. There is a QR code throughout. It'll show up again here in a second, and it'll show up again at the end, where we have a GitHub repo that has a completely anonymized reproducible example for folks who really want to do the deep dive on all the code.
But it starts with raw survey data, and we need to get to something a little closer to more tidy data that we want to work with in our pipeline. This was a people process, right? It involved a lot of collaborative effort with our survey programming team and conversations around why the data needed to be wrangled a bit and a little longer, and eventually we settled on a format that worked, and they also helped us build a pipeline where this is automated to be generated or updated hourly. So we have a flow of data coming through that we can now put into the pipeline.
So this is the pipeline. A lot of components here. On a given run of this, the data, that sort of semi-wrangled data, kind of tidy data, gets put into a data store, and that data store then gets fed into this main automation script. That main automation script ingests a lot of other things, so we have some helper functions that do more complete data wrangling and deal with some other issues, and then we also have a Quarto template and a YAML header, and those define the basic structure of what the PDF reports should look like, right? And we have a template and a YAML header so we can parameterize some of that data, pass it through the entire system, and create the individual reports.
Lot of, again, familiar data wrangling tools leveraged throughout, but particularly in those helper functions. The output of a run via the main automation script is the PDFs get pushed to the report store, and then finally someone, often me, who is doing these runs, moves those to our shared network drive at our company, hand it off to the business team so they can do QA, QC, and eventually deliver them to the client.
Hurdles: automation vs. flexibility
Okay, 30,000 foot view, some basics on overall how it works, but like most things, there's some hurdles, right? Like Julia Child here, confronted with this massive piece of tuna. How do we cut this up? And I want to talk about two hurdles today. The first is a little bit of thought when we were designing this about where we wanted things to sit on the automation and flexibility spectrum. The second is we wanted to update the design.
There were a lot of conversations about how far down that road we wanted it to go, if we wanted to update it at all, and then a little bit about sort of the mechanisms of changing that formatting. So early in our conversations with project teams, this quote came up. They said, we really need it to be automated, but also we need to pick and choose which dates we want to run reports for ad hoc, right, at a given time. So to me, that doesn't sound a lot like automation. It sounds like we need to be somewhere maybe in the middle, right, where we don't have a fully automated solution. It's easier, it's faster. Components of it are automated, but we can jump in the places we need to and be flexible to get the clients what they need.
And there were a couple places where I think this reared its head. I'm going to focus on one, and that is the story of date ranges. So I'm going to show just a little bit of code here. Again, this is straight from the reproducible example on GitHub. This is the top of that main automation script. We're reading in the data. Here it's a tab-delimited file. We're employing a couple of those custom helper functions, so clean DF and strip HTML tags to clean up some of the survey text that we're getting from the platform. And then we're setting up this filter window. And this is pretty straightforward, right? We're just saying, like, filter a bunch of data starting at this date and this date, pass that forward. But it's a really easy way to get folks across our team who have more or less experience coding and are a really clear ability to go and run this on their own. And to that end, we also use functions from the CLI package pretty frequently, providing some good context in the console, again, thinking about the sort of spectrum of experience that folks have on our team.
Design improvements
On to design. So, again, there were a lot of conversations about if we should change the PDF. And if we do want to change it, how should we change it? And I think we ultimately did decide to make some upgrades. And I think that comes down to three reasons. The first is that some minimal upgrades in the formatting and design improved trust and clarity on the client side. They can see that we're engaged with providing great product for them and continually upgrading the experience that they have. That professional look and upgrades equals more engagement from them, even on something as simple as this, where it's really just survey responses. And finally, there were internal impacts. So, better formatted PDFs allow for faster QA and QC internally and a better overall experience for our team.
So, let's look at the before and after. So, on the left is what clients were getting before. Obviously, I've done some modifications to the content here to anonymize it. And on the right is after. This is not the most beautiful PDF report in the world, by any stretch of the imagination. These are small formatting changes, but we think they've made a real impact. So, we gave things a little bit of space and breathing room in appropriate places. We added quality of life improvements, like subsections, right? You might want to know what sort of portion of the survey and be able to quickly navigate to it. We removed all of the boxing in these legacy PDF formatting, and we added some custom headers and footers. Again, quality of life improvements, pretty basic, but generally enhancing the design of the PDF.
Okay. So, how did we implement all of those formatting changes? Well, the answer, and again, this is not to read specifically. If you want to dive into all of this code, again, this all comes from the GitHub repo. It's a lot of LaTeX. I'm not an expert in LaTeX, and it turns out no one at our company is an expert in LaTeX. So, that makes this portion of the pipeline pretty brittle, right? We're trying to de-silo it. We're trying to make sure that it's maintainable into the future. And so, one thing that's on our roadmap is converting this over to Typst in the next year, I think. And that comes out of great talks last year at Posit Conf about Typst integration with Quarto. That's something we're really excited about to make these formatting changes easier to understand and maybe de-silo that a little bit.
Results and takeaways
So, take-home story. We saw really big efficiency gains. This is the second time that we've run this pipeline. I think it finished about a month and a half ago. The project teams reported that they saw efficiency gains of 50%, which is pretty big. We cut their time in half in the amount of time that they have to spend generating these reports and interacting with it, which is pretty huge. The design improvements, well-received by the client. We had an opportunity to level up our team skills by migrating away from this mail merge approach to a modern toolkit with Quarto and the tidyverse. Consistency was essentially the same, and errors were the same. And again, part of that is because of the speed of the QA and the QC with the improved formatting. And then building collaboration and some shared ownership of this pipeline. Again, moving it from the head of one person into a replicable example that's well-documented.
The project teams reported that they saw efficiency gains of 50%, which is pretty big. We cut their time in half in the amount of time that they have to spend generating these reports and interacting with it, which is pretty huge.
I think we started with something like this. It works. The concept is great, and there's good bones here. But if we want to scale it and we want to bring it into 2025, we ended up making something like this, where the idea is multiple chefs or cooks can interact with a recipe that's well-documented and replicable. We employed some automation here in the form of a multi-handed chef robot to do some of the hard work for us. And then this allows us to build into the future. Again, shooting towards this goal of the 1,000-gallon jar of pickles.
So a big thank you to a lot of folks, colleagues on the DS&I team, Jamie Favada and Mike C. They are part of our survey programming crew. Really instrumental in, again, that first step of getting the survey data into a place where we could start working with it on our pipeline. And then, of course, the project team, Gunnar and Amy, those folks are the ones working with the clients directly and are seeing the efficiency gains and really instrumental in this whole process.
Again, feel free to connect with me. And if you have any questions on the code, feel free to submit an issue or shoot me a message on GitHub or find me somewhere over the next day or two. And happy to answer any questions.
Q&A
All right. So question number one. Do you have any logic for flagging or removing outlier responses?
A lot of that happens before we get the data. So we have mechanisms in our survey reporting platform. And as part of that wrangling process, that's where a lot of the sort of first step QA, QC happens. So luckily, we didn't have to worry about that too much once we got to our portion of things.
Is this workflow for web-based surveys? They're wondering if you use any validation tooling like Pointblank to flag issues among respondent data.
Yeah, that is a great question. It is from web-based surveys. And it's a, our survey platform is Forsta Plus is what we use. Although, I don't know, survey down may be in our future. But I think the Pointblank validation is a good idea and really, really cool. And I think it's an amazing package. I think we just need to figure out where in the entire process would be best to implement it. And again, there's so much data QA, QC that happens before we even get the data.
Yeah, I think there's, if you start looking at the code base, there are opportunities for improvements on the automation front. The way that the parameterization works now is passing through respondent IDs that are tied to that particular date range that we selected. That gets passed through to each individual report. A dynamic set of Quarto files based on the template get rendered, and then that gets rendered ultimately into the PDFs. So, short answer is some, there's some parameterization. I think more could be done.
Were there any challenges in convincing the team to switch to Typst?
How are the client reports commented on in the process?
Yeah, so the way that it works is we do an internal QA, QC check of all the reports for a particular batch, right? So, we have eyeballs on every report that gets sent to the client before it goes, and then they're sent in batch. And if there's additional comments, right, that goes back to the business team and may get filtered back to us. I hope that's what they meant about commenting.
Yeah, so pretty, it's not like insanely speedy, but again, we leverage a lot of Posit infrastructure like Workbench, so it can just kind of run in the background. But generally, I would say 100 reports is probably like five to seven minutes.
