Resources

Get Your Ducks in a Row with Databases (Melissa Albino Hegeman, NYS OGS) | posit::conf(2025)

Get your ducks in a row... faster Shiny apps with DuckDB Speaker(s): Melissa Albino Hegeman Abstract: Our small team tackled sluggish Shiny applications by implementing DuckDB as a cache layer, transforming slow, resource-intensive operations into responsive user experiences. Initially, our application pulled data directly from SQL Server, performing real-time aggregations that resulted in poor response times. By restructuring our pipeline to pre-compute results in DuckDB, we significantly improved performance. This presentation demonstrates how we identified bottlenecks, implemented DuckDB integration, and measured improvements. We'll share practical examples of DuckDB integration with R, discuss trade-offs, and show how this accessible solution can benefit other small teams with limited resources. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Without further ado, let's get your ducks in a row, Faster Shiny Apps with DuckDB.

So if you get nothing else out of this talk, I hope you realize that slow, shiny apps are a lifestyle problem, and they're cumbersome, frustrating, and they make people's day worse, but you don't have to deal with it. We solved this problem by caching results in DuckDB, and using that to feed our shiny app. And we got faster app performance for no infrastructure cost on our end, and very minimal effort.

We solved this problem by caching results in DuckDB, and using that to feed our shiny app. And we got faster app performance for no infrastructure cost on our end, and very minimal effort.

Background and context

All right, so a little bit about who I am. I work for the New York State Office of General Services. I've been with New York State for over 20 years, but these opinions are my own, and don't reflect New York State policy.

I just moved to Office of General Services earlier this year, but before that, I was with the Department of Environmental Conservation for most of my time as a marine biologist. And you probably have ideas about what marine biologists do. Everyone gets excited when I say that. I think you swim with dolphins, save whales, or you're out in the field, maybe on a boat surveying surf clams in the Atlantic. And they imagine it's a lot of field-based work, but there is a lot of quantitative work that's happening in certain sub-disciplines, such as fisheries management and fishery science, which is where I eventually ended up.

I can't believe they found a picture of me smiling on that boat. It was probably when we got back to the dock. I get very seasick, so I'm moving over into more of a data science role worked out really well for me.

And while I was at DEC, I led a few small data teams. And when I'm saying small, less than 10 people, all of whom were marine biologists. There is no software development person, no database admin. We were kind of an insular group working on fishery science questions and managing that data for New York. And we always start out with Excel and Access, because that's what you have when you start, and we always outgrew it. It couldn't do the kind of analysis and reports in a repeatable way that we were looking. So we had to figure out something else.

Also, since we're such a small team, there's no off-the-shelf software for fisheries management. Every state does it a little different, so there's no consistency. And that's where R and Shiny came in. Those tools were available to us, and the ability to create our own custom tools was a real game-changer.

But, and there's always a but, as it became more widely used, it eventually started getting sluggish and frustrating, and it wasn't the big success that it was in the beginning. People were getting annoyed all the time. So how do we address that?

The original system and its problems

So to give you an idea about what our system looked like at the beginning, it started out with a SQL server. We had to get all our data in one place. We were previously using distributed access databases, which were terrible because we needed one for a year's worth of data. Nothing fit. Once we got the data in the server, the next part was to create an internal package for us to use. We had to make sure everyone was counting fish the exact same way.

And once that was done, it made sense to make a Shiny front-end so that we didn't have to look at the code all the time. We could view it in a browser.

And that was wonderful for a while. Until, we did have some constraints on us.

One, all of our team was running a Shiny app locally. We did not have access to a Shiny server, so everyone had to clone the repo, make sure it ran on their machine. And that was okay. People were using it every day.

Doing the queries, running the reports, and they would get ideas about how it could be better. And they'd be really good ideas. And since we were in control of app development, we got to decide what we wanted to add and what we didn't. And so the cycle continued. In every iteration, the app got better and better.

But eventually, things started to catch up to us. Imagine, if you've ever waited tables, you're probably familiar with this problem, or maybe it's a problem at your dinner table with your own family. But back in college, I waited tables, and you'd go to your four-top, and someone would say, hey, can I have a glass of water? You'd say, sure, no problem. You'd go get the glass of water, and you'd bring it back. And then the second person would ask for a glass of water. You'd go, you'd get the glass of water, and you'd bring it back. And now the third person also wants a glass of water. And at this point, you're just like, you know, come on. Couldn't we have done this all in one shot? So you wise up, and you also ask the fourth person at the same time if they want a glass of water, too. And this was a pretty good analogy on how our app was running. It wasn't well thought out, because it was built to do a very small thing to start, and we kept adding all these features.

So in terms of the app, as with all shiny apps, you know, we had an interface that took inputs and then returned results. And every time you changed an input, that was a new query to the SQL server.

And we found out each query took like eight seconds. And that's an eternity. I thought about just sitting in silence for eight seconds up here, but I thought it would be too uncomfortable for all of us.

So where was this bottleneck happening? It was with all these calls to the SQL server. Every time we had a new question, we'd go back to the source to get the answer, recalculate everything, and display results. So too many queries, run too often, and most of them were really unnecessary.

Introducing DuckDB as a cache layer

And this is where DuckDB came in. We decided we should try pre-processing all this data and doing everything in one fell swoop at the beginning when people opened the app. So DuckDB gave us an easy way to create a single file of that data, again, with no cost to us, no additional infrastructure needs. It worked out of the box with everything we were already using. And since it was approved by IT, we could implement it ourselves and didn't have to wait. And if that wasn't enough, and that was enough for us to decide to give this a try, it also added spatial capabilities to our data that weren't available in the SQL server we were using. So in the future, we had the option to increase some of our QA, QC, and mapping capabilities that way.

So where does this fit in with the workflow we had? Our SQL server is still the database of record. That is where all the data entry goes. That's where all the corrections go. When we find mistakes, but it's just really slow. So we wanted our app to call that server only once on opening, and then write a file locally that holds all of that data.

So now, what is in the DuckDB file? It's a series of tables that has the raw data, the row-level data that the app needs to be able to display. It also has the results of all our calculations and aggregations stored as separate tables. So everything is happening immediately when the app is opened, and the app no longer has to do calculations or queries on the fly. So we have essentially asked for all four glasses of water at the beginning when we opened the app.

Implementation details

And I'm just gonna show you a little bit of where we made the changes into the code to make this happen. So we're using pretty standard, shiny dashboard stuff, where we call our libraries at the beginning, and then we run a bunch of source files at the top. I'm gonna talk about the specifics on the next slide, because that's where all the changes are happening. But the header, the body, and the sidebar remain untouched, as do the UI and server functions.

So we didn't have to change anything on that app that people are opening and running. What we did have to change was some of the files and modules we were calling at the beginning of the app. So we have our functions for the app, and we have all our modules, but we wanted to add a script that did that data pull and all those calculations for us, and stored it locally. So we created a very creatively named app called the CreateDuckDB database.

And in that script, we're using DuckDB and DBI, although I'm gonna have to look into Duckplier later to see if that would be better. And we're making the database. If the database already exists, it just references it with that one line. And now here's where we do our time-expensive SQL queries and all of the calculations. We save them as a data frame, and then write them to tables in the database. And that's it.

Additionally, we just had to make sure all of our modules were referencing, pointing to the right database. And since that was defined at the top of each of those scripts, it was very easy. This was probably an afternoon of work to get a testable version ready for everybody else.

Results and caveats

And the results were great. We went from an eight-second query many, many times throughout the app, because people could be using this for hours, they usually had it open all day, to pretty much instantaneous changes when we changed the input. The boot-up time for the app was like two seconds longer, so even if they only opened it to a run-run query, we were still ahead by six seconds. So it made the app feel much better to use. It was less frustrating. We were able to do it ourselves. And again, I can't emphasize the low-cost, no-cost enough, because we don't really have a budget to make these changes on the fly. I'm sure if any of you are working in the public sector in government, you understand how long it takes to get those things established.

But there are some big caveats I want to talk about. And it has to do with storing the data locally on the PCs. Because remember, everyone's making a copy and putting it on their machine for that shiny app to run. So depending on when they open up their app in the morning, there is a chance they could have different data sets between two people in the office. Most of the time, that's not a problem, but there could be situations where they're getting different results for the same task, and the most likely reason is that something happened, there was a big data upload between the times they started the app.

Because we are uploading hundreds of records every day, so in the grand scheme of things, that's not really a difference when you're looking at hundreds of thousands of records, but just by bad luck, it could make a difference if you're helping a specific customer. And the other issue, again, with that local storage of the file, is that you need to be conscious that it meets the permissions your user is allowed to have. In our situation, everybody could look at row-level and trip-level data as part of their job, but if your users were only supposed to look at calculations or aggregations of the data, you don't want to store the row-level data in the DuckDB database, because then they could look at it even if they're not supposed to.

And I feel like I ran through that so fast. But yeah, so my name is Melissa Albino-Hageman. I'm with the Office of General Services. You can check out our public-facing projects at ogs.ny.gov. You can find me on LinkedIn and GitHub at melissahageman.com. I will have notes from today up later this evening, and I'm trying to be more active on Blue Sky around our stuff, so you can find me there at mellikesmaps, and thank you very much.

Q&A

How do you decide what data to cache? Are you caching everything?

It was kind of based on what people were using. We started with the stuff that was most important on the daily basis, and we only have eight users, so it's really easy to get very specific data. If they say, hey, five of us can't do this, then it's easy to choose to fix that problem.

What advice do you have for shiny apps that process data based on user input, not processed all at the start?

I'm sorry, repeat that? What advice do you have for shiny apps that process data based on user input, not processed all at the start?

Well, you could guess what they want and pre-process some things ahead of time. In this case, the output changed based on the user input, but everything was calculated beforehand, so they were choosing options that had already been calculated.

Why keep the SQL server rather than move everything to DuckDB?

That's a good question. DuckDB isn't designed to enter transactions, so we have another application for data entry. We are still entering a lot of paper in this program, so that's not what DuckDB is designed for, is my understanding. It's more of a searching retrieval database. The SQL server's also backed up if that's supported by IT, and I do not want to mess around with losing 20 years of Fisheries data, so.

Is there an official CountFish package, asking for a friend? Oh no, I made that up. We do have an official package, but it is not publicly available, and it would only really work for New York anyway. I think even if there's not an official package, you can have the HEC sticker available. People would be into that.

And I have a question. You mentioned you only have eight users. Yes. What do you think the considerations would be if you wanted to scale that up to a larger number of users?

Like I said, the permissions make a big difference, and I'd have to think carefully about how we store the data. If we want people who don't have access to everything to even use that database, we might have to do something else.

For folks about to embark on the same journey you did, any hard-earned lessons you wish you knew about before you started?

That your first versions are gonna be horrible, and that's okay.