
Observability at scale (Barret Schloerke, Posit) | posit::conf(2025)
Observability at scale: Monitoring Shiny Applications with OpenTelemetry Speaker(s): Barret Schloerke Abstract: Understanding what happens behind the scenes in production Shiny applications has always been challenging. When users experience slow response times or unexpected behavior, developers are left guessing where bottlenecks occur. This talk introduces OpenTelemetry integration for Shiny for R, a new approach to profiling code and understanding application behavior. Through a chat-enabled weather application, we'll explore how complex user interactions trigger cascading events across multiple processes. You'll learn how OpenTelemetry's "high-quality, ubiquitous, and portable telemetry" can provide complete visibility into your Shiny application's performance with minimal setup... just a few environment variables! After walking through the chat app, you'll have a taste of how to implement comprehensive monitoring for your Shiny applications in production, enabling you to proactively identify and resolve unexpected performance issues. Materials - https://github.com/schloerke/presentation-2025-09-17-posit-conf-otel posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everybody. My name is Barret Schloerke. I like the wave. Thank you. There's a QR link to the slides that will also be at the very last slide if you miss it now.
So today I want to start off with a code demo. It's going to be a small chat app. Nothing new. Nothing exciting. But we'll see what happens.
So let's look at this app. It is a small chat app that has a single tool call available to it that can ask for the weather. Can I have a city to ask for the current weather? I'm going to go with that one. So what is the weather in Chicago? And then it's going to think about it. Ask the tool. It's going to run it.
And then it's going to think about it. Ask the tool. It's going to respond. And then it's going to summarize that tool response. And for the purposes of the demo, it's going to shut down the session. That will come in later.
But nothing surprising. I mean, if we want, we can kind of sneak what the tool output was. But we can see that it asked the available tool call to ask about Chicago. And then wrote up a summary of that information.
Nothing too fancy. But let's actually look what happens underneath the hood. So we'll rerun that. And we'll say, what is the weather in Chicago?
Exploring the flame graph
Let's check out the chart on the left. So we have a simple session start. And then we have some reactive updating. And this is like the whole crux of the talk today. In that this little reactive update is going to spawn Claude to do some communication. That is going to make a post request. And then it's going to stream back some results if they're there.
This then prompted to have us go into executing a tool call. Which I just happened to do through Mirai. To go through a background R process. That itself said, hey, I'm going to go get the weather forecast. And it happened to do a hit or two requests. And then finally, I came back with Claude and with the post to say here's the response. And then we're streaming back results as Claude does within the website.
This is awesome. I now have an 8 second demo and I can point fingers as to who took the most time. In this case, while it is about 7 seconds here, 3 seconds was first Claude. Then the tool was here and it took about 650 milliseconds. Half a second roughly. And then finally Claude's response took another roughly 3 seconds. So my tool call is of the big picture, not the big time sync. But there are little locations where we can improve what's happening.
I now have an 8 second demo and I can point fingers as to who took the most time.
What is OpenTelemetry?
This is all being done through OpenTelemetry. And OpenTelemetry, according to their website, is highly high quality, ubiquitous and portable telemetry to enable effective observability. If we're just looking at our app the way we were the first time, we can see what's happening on the surface but really what's happening underneath where the computations are being done, that's the important part to the developers and during production. OpenTelemetry provides a set of APIs and libraries and agents and instrumentation to provide observability for your applications. In this case, OpenTelemetry is being integrated into the R ecosystem and is currently under development.
So let's look at this a little bit more generically. OpenTelemetry is really good at a burst of an action such as a user clicked the website or Shiny's reactivity just started. After this action you'll have different services be engaged, in this case the green, then the blue, then the yellow, which then calls a database and then they undo their work since they're done processing. So we can see how long the blue, then how long the yellow take, and then finally accessing that database how long of your computation time is performing that.
Integrating OpenTelemetry
So, seems like a good idea to have for production. How can we do it? What does it take to integrate? There's two steps. First one is we need to install OTEL and OTEL SDK. Packages will do the OTEL for you. You as an app author will need OTEL SDK and I'd like to take a small pause here to thank Gabor because this went from concept to implementation and on CRAN in a really fast time and there's lots of iterations so I just have to give a big shout out to Gabor.
The second thing you need to do is to add three, roughly three environment variables. It depends on your endpoint. In the demo I happen to be using LogFire. It's a third party website that can host the OpenTelemetry data and the environment variables what they are doing is they're saying please send the information according to HTTP protocol and then the location that you're going to is my LogFire website and then I'm adding in some authorization.
OpenTelemetry vs Profvis and ReactLog
So you're probably seeing these flame graph looking things and you're like well, what about Profvis? What about ReactLog? This is how I debug my applications already. And Profvis it allows for local R code profiling and it also displays some memory usage as well. In ReactLog you can replay all of the local reactivity that you have captured and it always displays a full dependency graph for every reactive component within your application. So if you have a very large application you'll have a very large connected graph. I highly recommend it if things aren't working the way you expect it.
The bad part about both of these packages is that they only support the main R process. So if you're using Mirai which does computing on a different R process we don't have access to that with Profvis or ReactLog. But with OpenTelemetry we can pass through some extra headers and Mirai can do the proper reporting. Also Profvis and ReactLog are not for production. By definition both of those packages produce a memory leak. They're constantly recording. Even if we were to write out to disk you're going to run out of space eventually. So both of these packages are great for local development, great for debugging, but I do not recommend them for production.
Analyzing span performance
So let's look at these flame graphs. In this case we have a long grey one, a green, blue and gold and how can we improve those performances? There's two approaches that we can do. The first one we want to reduce the length of the long running spans. In this picture and in the demos there wasn't something that was like oh I ran for 10 seconds when I really want it to be one second long. That would be a situation that you want to reduce it, but now you can point fingers as to who is the culprit. And then the other one is to limit excessive span nesting. In this demo it's like five nesting layers. Not an issue. I'd say 10, not an issue. When you're expecting 10 and it's 100, probably an issue.
So let's look at a screenshot of a similar demo. In this case the tool call happened to take 1.5 seconds to run. It happened to do two HTTP requests while inside that tool call to get the weather. I believe it's you use the city to get the Latin long and then given the Latin long it comes back with the weather location. But of the 1.5 seconds, only a half a second was used for the second request. The first request was maybe 100 milliseconds. So of the 1.5 seconds total that's not really the culprit that I should be focusing on. Instead I really think we should focus on the second of preprocessing that happens before that post request or the get request. That would be something that we could look at to try to reduce because that's the tool call is all of my code. It's not Claude. It's not the responding time. It's our code.
In addition, if you're looking at this as well, we can see Claude's sonnet was streaming and that took 2.5 seconds. You could argue if you want less load on your server because you're essentially playing proxy to these models, have a model that responds faster, has less chunks that it responds with. I mean, what is it with a Facebook model that just kind of goes blah, answer. I mean, it's great. Maybe it's not as good as what you want and so you'll have to play with that sliding scale to determine if you're okay to have the streaming and it takes a longer response or you have almost an instantaneous but maybe not as accurate response.
The 2025 approach to Shiny in production
When looking at a lot of GitHub issues, I'm constantly referring back to Joe Cheng's keynote in 2019 of Shiny in production. This is where he went through and described how you should iterate and try to make your apps in better performance but I would like to have this be updated now to be leveraging open telemetry so let's walk through that. This is the 2025 version.
First, enable open telemetry to see your span durations. Second, let's look at those long spans that are run in R and use Profvis to see where that code is slow. For when the code is slow, let's try to optimize it. We can, one, move work outside of Shiny. In the standard case, if you have consistent computing being done, maybe lift it outside into your global.R or try to do some preprocessing and have that be loaded before your server function. Just relocating it sometimes makes a big, big difference.
Two, make your code faster. If you can't move it outside the function, instead maybe use DuckDB as a backend. Even if it's still being processed in memory, it'll possibly be faster than what you have already with dplyr. Use caching. If it applies, if you're returning the same answer for the same input, maybe we can add a layer of caching on top of that and then your spans will just disappear. And then finally, use some non-blocking reactivity. In the demo, I happened to have the tool call use Mirai to pull it off the main R process so that others could do work. I wasn't showing it in the use case, but it did pull it off the main R process, and then that way it wasn't blocking others to stream or to do any calculations.
And then repeat. When you do make some of these changes, you may have to go back up to number two and see Profvis and reevaluate your app, because sometimes just by enabling caching, spans will just disappear and your computation will be very different in shape compared to the first time.
Next steps and active development
And some next steps. This is all in active development. I was very excited to be able to do a live demo there because I was kind of on like five dev branches at once. A little terrifying, but it worked great. It's in active development. When we get back from conf, we'll be trying to tighten all those up and finish them. Some things I want to add in is to have cleaner error detection in reporting. OpenTelemetry does a nice job to say, here's an error, here's the error message, possibly even file locations of where that occurred within the app. We're trying to toe the line of are we giving away company secrets to your own OpenTelemetry? But it's not the user, it's to you, so we'll see.
This would also involve setting the status of the span, saying, oh, we had an error, I knew this was bad, let's set it to an error, versus saying it's unset or okay. I'd like to have native Posit Connect integration. There's been a lot of work about having information on which user is visiting your application, so I kind of want to sneak some of that header information into the session start to say, oh, Barrett started this Shiny session, even though the app is owned by Daniel.
And then finally, Shiny for Python. Similar to most things in the Python world, we won't have to reinvent the wheel ourselves. There's already wonderful tools out there, but doing a lot of learning here on the R side first.
Q&A
All right, we have a few questions. Does it, I'm assuming meaning OpenTelemetry, come with any performance penalties to the application? It's pretty similar. A lot of the things that I've been testing, the functions that we've been integrating are in the microsecond world, like tens of microseconds, so most likely the app author logic is going to be slower than that, so I think we're okay.
How is OpenTelemetry different from writing my own logs? Is it just another logging framework? It is just a logging framework, absolutely it is. It should be treated just like that. You should have log, debug, trace, fatal, error, all the different things, just like a logging framework. I think the interesting part here is that Otel is defined to have a collector, and so you can have multiple Shiny applications. They can all be pointing to the same collector. The collector will get a batch and then send upstream to the third-party website, and that process is much better than logs. Logs are wonderful if you want to send, you own the server, you only have one server, and you're writing to a single location, sure, just use logs, but when you start having multiple machines that may or may not exist in five minutes, like having OpenTelemetry collect to a third-party site, it makes it that much easier.
Logs are wonderful if you want to send, you own the server, you only have one server, and you're writing to a single location, sure, just use logs, but when you start having multiple machines that may or may not exist in five minutes, like having OpenTelemetry collect to a third-party site, it makes it that much easier.
Alright, cool. Last question, does it also expose cat print message? Oh, no, sorry, does it override the cat function, the message function, print function, str, any of those? It does not. Because you have to opt in to the different logging levels if you're doing that, versus spans, it is something that you must opt in to as a developer. So you'll have to change your item instead to say otel log info, or otel log error, instead of saying message or cat.
Alright, and I think your last slide, you were talking about it's not there for Shiny for Python yet, but is there a rough ETA when that might be up? No. Alright, yep, and then that's it. That's all our questions, thank you.

