
Oops! I accidentally made a production dashboard (Jonathan Keane, Posit) | posit::conf(2025)
Oops! I accidentally made a production dashboard Speaker(s): Jonathan Keane Abstract: As data scientists we love making decisions with data. But we also don’t always do this with our own work. Ever wonder if that dashboard you are about to spend hours updating from line charts to 3D pie charts is actually being used? With usage metrics it’s easy for you to see, analyze, and show just how much traction your app is getting. Not only does this let you decide where to prioritize your efforts, it can help you demonstrate the impact you have on your business. This talk will explore how to be data-driven with your own work and tools we have for maintaining our work when it becomes important. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you, everyone. And I'm going to talk to you about a time that I accidentally made a production dashboard. I think this is something that will be familiar to many people in this room.
So a bit of a story. I was working at a job, and I was working on a dashboard. We were looking at benchmarking reports, and it ran every night. I thought it was pretty cool at first. We shared it with the engineering team. We started getting some questions. People seemed interested in it. And I thought that was that. And then one morning, I was in a meeting, and the CEO of the company said, oh, yeah, that thing, that thing's great. I look at it every morning before I even had my coffee. And that hit me like a ton of bricks. This was not a proof of concept anymore. This was something that I needed to maintain. I needed to make sure it had accurate numbers, or else I was going to get a call before I had my coffee, and they were one time zone ahead of me, so that would be even worse.
of the company said, oh, yeah, that thing, that thing's great. I look at it every morning before I even had my coffee. And that hit me like a ton of bricks. This was not a proof of concept anymore.
This is very common for data science. Data science projects transition from something that's a proof of concept, like is this even possible, let's try it out, to something that is suddenly actually production, where it is something that people depend on day to day, very quickly, and sometimes in ways that we don't realize, and that's where things get kind of dangerous. And a lot of the things that we've talked about in this whole session about how to productionize and what it means to be production are super important for things that are production.
Living in the slush
So I like to think about this as living in the slush. I live in Chicago, and one of the things that I had to learn when I moved to Chicago was that walking in slush can be really dangerous. And so if you're walking in the rain, it's pretty easy. You don't have to take many precautions. You can just wear normal shoes, maybe have an umbrella, and you're all good. This is kind of the POC phase. Things are flowing. Things are great. You don't have to worry about anything. When you're walking on slush, things are usually pretty fine as well. There's no big deal. You can walk. You're not worried about anything. But the thing that's dangerous with slush is that sometimes underneath that slush, there's ice. And you can slip on ice if you're not careful or you're not wearing the right shoes. And data science is exactly like this. A lot of data science operates in this slushy area where you're going from flowing with a proof of concept to something that you actually need to harden like ice or else you're going to slip and fall and you need to do something about that. So how do you tell that you're walking on ice?
Usage monitoring
The first and the most important thing is usage monitoring. And this is something that I know I personally learned this lesson over and over in my career and I think everyone else here does the same thing. We as data scientists bring data driven decision making to our organizations. And we don't sometimes take that same skill and that same approach with our own work. And this is what bit me with my production dashboard. I didn't have any usage monitoring. I had no clue who was looking at it. And so if we take things like usage monitoring, figure out who's looking at our reports, our dashboards, our apps, when they're looking at it, how widespread that usage is, that lets us know is this something that is just a cool proof of concept that I can just let be or is this something that is production I need to actually maintain?
Usage monitoring is something that's not new. It's used throughout the software world. And it's used not just to see who is looking at something, but also in detailed into features of who's clicking on this button, how long are they staying on this page? And this might sound scary, but it's easier than you would think. So if you've got a shiny app, you can write things to local storage when people click on things, but there are packages like shiny telemetry that makes it super easy to collect this type of usage data. If you've got Quarto documents or other static things that are hosted on some CDN, AWS, Posit Connect, you can use access logs to see who's seeing that. It's a little bit less fidelity, but you can get some really good data out of something pretty simple.
If your reports are something that's public or your organization allows it, you can use things like Google Analytics and there are other analytics-style things that are basically JavaScript that logs when someone has actually loaded a screen. Quarto even has a really easy way of connecting this in. You just add in, there's like an identifier for Google Analytics that will insert the JavaScript for you. You don't even have to think about it. And then, you know exactly who's looking at your dashboards. And if you have Posit Connect, this is actually baked in. Earlier today, Toph Allen shared the Connect Gallery, and one of the items in the Connect Gallery is this Usage Dashboard. It lets you see on Connect how popular your different items are that you've deployed, who's looking at things, and you can even drill into the detail and see the time of day that people are looking at it, who's looking at it. And so, if I had this at the time, I would have seen, oh, CEO's logging in every single morning on this. I need to actually take this seriously.
And so, in summary, once you're figuring out that you've got a thing that is popular, and you are in the side that is on the left side of this, which is, it's great. You're celebrating. You've got a thing that's popular. You're providing value. You're excited. But there's also a lot of chaos going on. You're going to start getting feature requests. You've got to deal with, you know, there's a critical bug fix that you need. Someone wants some extra data in there. And so, there's a real mixture of emotions, right? You've done something that's successful, but now you've got all this extra work. On the other hand, if you have something that was a cool proof of concept, you kind of said, hey, look, we can do this. This is the answer to this question. You don't have those feature requests. You don't have to put in the effort to make this a production grade. And that's great because you can sit and relax. The easiest work is work you don't have to do at all. Of course, that's also a little bit sad because you're like, well, people aren't really doing that. But the nice thing is you can take that energy and put it into the things that are actually important to maintain.
Modularity and testing
So, what do we do now that we know we've got something that we have to maintain? We really need to invest time in it. I'm going to share a couple of techniques that are from the realm of software engineering, but bringing them into data science can really help improve the way that we maintain our apps. And the first two are kind of interrelated, but modularity and testing. And much like Tom's talk, a lot of this will be like not new to anybody, but it is important to kind of think about this and especially do these things only when you actually have to.
So, what is modularity? It is taking a long script that's just kind of a very procedural set of code. There's a lot of nested if else and breaking it down into chunks that are logical chunks. When you're prototyping something, it's really easy to kind of just write out the code, and that's kind of one of the superpowers. You can just write, see if it works. You're not worrying about abstraction layers or anything. But once you move into that production realm, you do want to factor things out and do things like pull your data in one chunk or clean your data in another chunk, fit a model, things like that. And frankly, these are even actually probably a little bit too broad of chunks. You should have chunks that are smaller than that. But just illustratively, modularizing your code makes it much easier to figure out what's going wrong if something's going wrong. It also helps you focus in. And so, if what you're working on is your data cleaning code, you don't have to think about how am I getting this from the database or where is this going in the model? You can say, okay, I just need to transform these columns, I need to fill in NAs in this way and just focus on that. And modularity helps you do that at a code level, but also just at a kind of what you're thinking about while you're working on that code.
And the best part of composability and decomposing things is that it makes testing these things a lot easier. So, you could test the thing on the left, you could have a big set of inputs, run it through your whole script and check the outputs at the end, but that's going to get really messy really quickly. At its core, tests are, you've got some known inputs, you've got your function, you put your known inputs into the function and you compare it to outputs at the end. If you're testing smaller chunks of code, the number of inputs, the number of options for the inputs, you can have relatively large, but they run relatively quickly and they're constrained because the domain is just your data cleaning step or just your model fitting step. And so, testing at the modular level is much easier than it is if you have a giant, long script that you need to test.
The other nice thing about this is that I have found that using your tests as a kind of diagnosis tool of whether or not you factored your code correctly is really powerful. And so, if something is difficult to test, sometimes that's because the interface between the handoff from one function to the next isn't quite clean. You've got a couple of things interplaying and so you've got to kind of like test both at the same time. And that for me is a sign that maybe I need to switch where things are actually happening between these two functions. And that's a virtuous cycle of testing and modularity and something that you can cleanly test is generally cleanly factored.
virtuous cycle of testing and modularity and something that you can cleanly test is generally cleanly factored.
Continuous integration
Okay. Now that you've got your tests, you can run them locally. That's great. That will confirm that you don't have a bug. And that's what this person is doing, running a bunch of tests all the time. They make a change, they run a test. But if you have code that's going to be shipped off somewhere, hosted on a service and run somewhere else, you can't just test on your local laptop because you might have installed specific dependencies or something and not realize that's actually important to how you're actually running the code in production. And so, one of the first things that you need to do when you're testing is have your tests be run on some other computer. And so, you could call a colleague and be like, hey, can you run these tests for me? And that will work a few times. But if you're wanting to do that every single time you make a change, you're going to turn a friend into an enemy very, very quickly. And so, ultimately, what you want to do is find a way to run that test on some computer that's somewhere else. It's not your computer. And that's what cloud laptops are for. And that's exactly what continuous integration is. It's a container, which is a laptop in the cloud somewhere. It's not actually a laptop, but it runs your tests on another computer and you don't have to bug your colleague to do that for you.
And that is effectively what continuous integration is. It is running your tests every time you commit to your GitHub repository or other Git repository, spinning up in a container that's totally isolated, and it runs it constantly. And that is fantastic. So, continuous integration, typically these days, is connected to your source control system. So, if you're using GitHub, GitHub Actions is very integrated. It will spin up cloud laptops for you to run your tests. If you're using GitLab, GitLab runners are the same thing. They're very, very similar. They have different syntaxes in the way that they work out. But if you're in a larger organization, you might see other things like Travis CI or Circle CI or Jenkins. And other teams might already have these set up. And you can hook into these. And they do effectively the same thing. They're not quite as integrated with your version control system, but they can do basically the same thing as GitHub Actions or GitLab runners.
Keeping dependencies up to date
Okay. So, now that we've got our tests going, we have a dashboard that's running. We've got a bunch of dependencies like ggplot2 and dplyr and plumber, et cetera. What do we do as time marches on? One of the things that I have always found really arduous when I have a long lived project is making sure that I am staying up to date with my dependencies. One of the one approach you could take is to just constantly upgrade the dependencies bit by bit. Like every day check and see if there's something new. Bump it if there is. And this is really fantastic because when you do small steps with your dependencies, frequently there's nothing that you have to change in your code. There are no breaking changes. You can just move on with your life. But if there are breaking changes, and you're only going from version 1 to version 2, that's usually a pretty easy there's one or two things that are changed. It's quick and easy to actually change that and keep up with the upgrade. But if you're jumping version 1 to version 8, that can be a huge pain. Because you don't there are a bunch of things that changed in your dependency. Your code is interacting with that. You have to parse through the change log of the dependency.
But the nice thing about that is you're not sitting every day. I don't even need to update the dependencies. I need to update dependencies. With modern CI, like GitHub actions, you can use what's called Dependabot. It scans your dependencies and sends you a PR every time there's an update or there's a critical bug. You can just accept it. If you got tests that are testing whether or not your code is running, the tests run on the Dependabot PR, if everything is good, you can just merge it. And so that's basically having a bot do your constant updating for you, and you don't have to think about it. This is just one of the many, many fabulous things that you can do with modern CI. But it is fantastic, and it means that you're not worrying about dependency upgrades except for when you actually have to get in if you're out what changed.
And there are many other things that you can do once you realize you've got a production app. Many of the topics that we talked about here, and Joe will talk about what to do when you've got REST APIs that are slowing down because you've got a lot of people that are poking at them suddenly.
And so ultimately, what I want you to walk away from this is go from oops, I made a production dashboard, to yay, I made a production dashboard, and actually get to be happy about it, because you know you've got tools that you can maintain it, and you're not spending time doing that on dashboards that aren't actually production, aren't actually popular. So with these tools, you can monitor the dashboards that you're building, know whether or not they're popular, and then for the ones that are actually popular, you can keep them running stably with modulization, testing, keep them up to date with things like dependency updates. But again, most importantly, you only really need to do that if it truly is production. If you've got a proof of concept that was cool, answered some question, but nobody is looking at it, you can go home. You're good to go. Thank you.
Q&A
Thank you. I think we have time for a couple of questions. So first one, do you have any advice for someone who's trying to start modulizing their data code, but is trying to balance speed with best practices? Oh, that's complicated. I actually kind of want to know, speed there, is that speed of your energy or speed of the code itself? But I know you can't clarify. Yeah, that's how I would interpret it. Yeah, yeah, yeah. I would say start small. And especially, it sounds a little religious in some ways, but writing, doing your modularization and then writing your tests and using your tests as a confirmation that you've modularized well is super powerful. And once you kind of fall into that circle, I have found it helps a lot.
Awesome. Do you have a favorite tracker for web analytics and why? Favorite tracker for web analytics? Yeah, for web analytics. Or do you tend to use open source tools like Shiny Logger and stuff? Yeah, it really depends on the context that you are, where you're deploying your app to. So if it's something that's public on the internet, I think Google Analytics is really fantastic. It's basically free. It works with a bunch of different things. If you've got a dashboard that's kind of internal to an organization, that can be more difficult and might be against security practices. And so using some of these things where you're actually having to build it out yourself, or like I said, if you have Posit Connect, that comes built in with Posit Connect. But it really depends on, is it something that's public or is it something that is internal to an organization? Right. And I just wanted to say thank you, because honestly, I use Posit Connect every day and I do use the built-in features for that. But I didn't know that Shiny package and that Quarto package for logging existed. So thanks a lot for letting me know about that. Awesome. Yeah. Thank you. Thank you.
