
Old Apps, New Tricks: How AI can write Automated Tests for your Shiny Apps (Karan Gathani, Posit)
Old Apps, New Tricks: How AI can write Automated Tests for your Shiny Apps Speaker(s): Karan Gathani Abstract: As Shiny applications grow in complexity, comprehensive testing becomes crucial yet often overlooked. Many developers struggle to implement proper testing due to time constraints and technical barriers. To address this challenge, we're introducing an innovative solution that automatically generates regression tests for existing Shiny apps. By leveraging AI models trained on testing best practices, the Shiny package will streamline the testing process, making it more accessible and efficient for developers of all skill levels. Materials - https://docs.google.com/presentation/d/130bJNGp3XIlKPaouA8K0Lv0O3P7OFesW/edit?usp=sharing&ouid=105060716153365836674&rtpof=true&sd=true posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Get it. So today I'll be talking about how we can integrate AI to generate tests for our Shiny Apps. Just a disclaimer, this is Shiny for Python only for now. All right, so for folks who have not seen me, I'm Karan Gathani. You might have been Shiny fans for a long time. I joined the team two years ago, so I thought I'd tell a little about myself. So I'm a QA engineer in the Shiny team, but then I wanted to actually make sure I share some details that I feel I will be able to resonate with the group.
So if you see something that you actually resonate with or you love to do as well, let's say AI, all right? So some things I love to do. Shiny, pizza, identify bugs inside code. Okay, that's good. Yes, something a lot of folks might not know about me, I love to observe bugs outside of code as well. And when I talk about observe bugs, I mean the six-legged ones. So love to actually look at insects that most people want to just squash and squish. I do see that them actually teaching us really valuable life lessons.
Hopefully I can bridge the gaps between the bugs that live inside our screens and the bugs that live outside. So that said, let's go around the room. I just want to get a feel of how well people know their bugs is how many of you in this room by a show of hands have seen grasshoppers? Okay, good, good, good. How many of you have seen locusts? Okay, good. And how many of you know the difference between the two? Okay, good, good, all right.
Gregory the grasshopper
Okay, so for this session, I am going to introduce my friend Gregory, the grasshopper. Gregory is like an off-the-shelf grasshopper you might see jumping around like in your parks, in your backyard. Gregory likes to do all grasshopper-y things. So Gregory likes to eat grass, Gregory likes to eat berries, watch a little bit of TikTok, and collect labooboos as well. So like I said, just your off-the-shelf grasshoppers.
Here's what most people wouldn't know about Gregory and other grasshoppers. So grasshoppers are introverts. How do we know that? Because if you go close to a grasshopper, you'll see them wearing t-shirts like this. So, all right, so here's a chain of events that actually transforms Gregory drastically. And so if Gregory is in a location wherein it's like severe drought followed by a good rainy season, it transforms the barren wasteland into lush green grasslands. And when that happens, it attracts grasshoppers from all across that area, and that place suddenly becomes crowded. And this is like any introvert's worst nightmare, when they get surrounded by others of their own kind, and then they just swarm.
But here's something that happens that's super interesting. So when these grasshoppers are actually bumping into each other, unintentionally, they will end up actually bumping into the hind legs of other grasshoppers. And when that happens, it causes a surge of serotonin released from their brains. And at that point, you would say, like, what's the problem? The problem is this grasshopper, the green one that we knew as shy and awkward, actually transforms into a yellow and black locust. So it's the same animal that gets transformed into a locust. And when that happens, they throw away their introvert personality, and then they become rowdy. They become a group of sports fans that love to watch football and drink beer. And just like any sports fans, when you have a lot of them together, start creating chaos. So they'll go into fields and start damaging them. They'll start eating, chomping on all the fields and farms we have, and they'll actually strip it to the bone. Causes millions of dollars in damages.
So what's the theme here? The theme here is dangerous threats can lie dormant until a trigger awakens them. So what we learned is a dormant threat by itself, not a big deal. When you mix it with another dormant threat, it can become a dangerous threat, which in this case is a locust.
The theme here is dangerous threats can lie dormant until a trigger awakens them.
How bugs affect Shiny apps
So how does this translate into how our Shiny app are affected by bugs? So you see our bugs are dormant until they actually get triggered, and then they can actually cause an escalation really quickly. So unless there's a scenario that they get triggered, they won't be affecting your app. But once they do, they will actually go and take out the shine from your Shiny app. And tests are a way to protect against that. So tests are an umbrella that actually protect you, help your Shiny app keep the shine that it deserves.
State of testing in the Shiny ecosystem
All right. So while we are talking about testing, let's talk about the state of testing in the Shiny ecosystem. So maybe you know a little bit about it. I'll just go through a quick refresher. So Shiny ecosystem, we have Shiny for R and Shiny for Python. Shiny for R, we have shinytest2, and Shiny for Python, we use Playwright. So I'll go into details of both of them.
So shinytest2, let's say you have an app that you can type your name, click a button, and then you get greeted. And you want to test this. So shinytest2 essentially is a UI tool that you would go and record your actions, and it will generate the code for you. So you can have that code run again, and you have your test just like that. It's really easy for you to generate that. Things on the Python ecosystem are a little different. We use Playwright, which is a really popular test framework developed by Microsoft. But the way you do it is you end up actually figuring out which elements you want to actually test, and then you do certain actions with those elements. So it might be clicking, it might be dragging, it might be filling stuff up, and then you do some assertions. So in this case, I'm trying to assert if the greeting is what I wanted. And if it fails, you know the assertion is failing.
So in 2023, we introduced Controller. The concept of Controller, the way it works is let's say you have a checkbox that allows you to interact with it, and you want to write a test for that. So Controllers essentially simplify how you would interact with it. So the way I like to think of Controllers are like the ball launcher you would use to keep your dog busy so that you can browse more TikTok. And it doesn't matter how the ball launcher works. It might be like a really complicated design or a simple design, but like for the user, it's a simple thing that they can keep on flicking, whether it's a kid or an adult, whether you do CrossFit six times a week or you haven't touched the ball launcher at all. And it's like supposed to give you consistent results time over time. That's essentially what Controllers are. And like if you look at the Controller for an input checkbox, this is how it looks like. So we would expose all the methods that you might physically want to do with like testing the component in your Shiny app.
So in 2024, we added the concept of Shiny add test. And Shiny add test is a CLI command that allows people to get started with their first test. So how does the command work? So you type in the command in your terminal, select the app file that you want the test to be generated, and then your test is created. When you look at the test file, it's essentially a scaffolding of your test. So inside it, it actually imports the Controller that you would use to write your test. It creates the fixture. And fixture is just a fancy word of some things you would want to do before and after your test. So in this case, we'll start a Shiny app locally. And then finally, it just navigates to the URL. So if you have ever run a Shiny app locally, it will never be on the same port, like if you're using an extension or something. And so this will dynamically get that URL, and then it will navigate. But that's it. That's where it stops. Because it's really hard to predict how people's Shiny apps are, at least in 2024.
So one thing we realized is this approach was well-intentioned. And we did provide you the ingredients of actually making, writing your own test. It does take time and effort to actually go through the documentation, figuring out which methods work, which Controllers to use. And then we started thinking, what if we could actually provide you a ready-made meal instead, instead of giving you just the ingredients? But that was 2024.
Integrating AI with Shiny add test
And we are in a new year right now. It's 2025. And we are officially surrounded by AI right now. And people are using AI for cheating on their partners or seeking, after cheating on their partners, seeking couples therapy, and using it to come up with fake citations for their research papers, or using it for their presentation, like I did. So we are constantly being hassled by AI providers right now, when they want you to just adopt them in your package or whatever. And so we went ahead and we integrated AI with the Shiny AddTest command.
So here's how the command works. We're similar to how the 2024 version worked. The only difference is this time you would bring your own API key. So we'll actually integrate it with different providers. So if you have Anthropic, you have OpenAI, we will allow you to do that. But then you would just type your command, select the app file. So here's an interesting thing that happens. So when we end up sending the app code, we also send another box. So you see the two boxes the delivery person is handing over to Claude. And that is the second box is the system instructions and the system training. And that's what actually helps Claude to come up with like tests that use our controllers and something that would be valuable for your use case. And then we let Claude do the thinking. That's what it gets paid for. And then it creates a test file. Like that's it. It will actually scan through your Shiny app code and it will generate the test file. And all of this happens like within like 5 to 10 seconds. Just leaves you more time to browse the internet for Labooboos and Dubai chocolate or whatever people do.
All right. So why do we recommend using this over just like people have access to chat bots that you would just like go in and like submit your app code and just get the test generated that way. So if you look at all the off the shelf AI, one of the things they do is they have a knowledge cutoff date, which is like they are always lacking and when they were training. But like Shiny, we keep on releasing new features pretty regularly. And so your knowledge base might always be lagging with where Shiny is at. So if there's a latest and greatest feature of something, you might not be able to use that if you do that. With Shiny add test, the developers who develop the components of the features have access to the same controllers. And so the Shiny add test command is aware of it. So you might get the latest features. You will be able to write the tests right away.
The other thing is off the shelf AI, since it's not aware about a lot of concepts like controllers and fixtures, Shiny actually exposes. It just ends up creating or reinventing the wheel and just makes a really verbose code. So right now a test for something like a Shiny checkbox, it just ends up running into 100 lines because starting the server, stopping the server or whatever. And if we were to use something like Shiny add test, it would be like a quarter of a line of code because it uses the existing methods that expose and are able to actually write tests for your components. The other thing is off the shelf AI, since they are not completely aware of how our components are built, they might take best guesses on how and when to target like individual components. And so it might be a hit sometimes, but sometimes it might like target like the outer element, which might just cause like flaky test. And flaky test just means people start losing trust in your testing system. With Shiny add test, the code is aware of how our components are designed. So it is much more targeted. So chances of you getting like flaky test or like incorrect locators are like close to zero.
Dealing with hallucinations
Okay. And this is the part where we talk about the challenges of using AI. For folks wondering, that is Bane, the Batman villain. It's not a new COVID mask design. And that is hallucinations, which is something if you have ever used AI, you would all have come across. And hallucinations is what keeps people who use AI awake at night. And hallucinations is when AI actually come up with answers or friends or facts that do not exist. In this case, the dog is actually thinking its shadow is a threat and not letting the owner sleep. So again, how do we deal with that? Because that's like a big part of integrating with AI.
So we actually try to avoid that when we actually pass the training data by actually running an evaluation suite. So we actually monitor how well the AI integration is doing by running through these 10 test apps. And then we ask them to generate the test for these test apps. And for that, we use something called inspect AI, which is an evaluation framework in Python, which allows you to actually grade the quality of the response you get from the LLM. So in this case, we will actually check, hey, was the test using this controller? Was the test using this method inside the controller? Did the test have a reasonable fixture or whatever? And so we'll rate it based on that. And only if it passes more than 80% will we allow that. And then all the apps are actually working apps, which means the test it generates should also be working. So if it's generating faulty tests, you know it's like there's something kind of messed up in the code. So we would not allow that code to be merged in. So we have these checks in place. Because we want to be confident when we do the integration with different providers, we'll be able to provide consistent results time over time.
Demo and housekeeping
All right. So what would the steps look like? The steps would be type in the command, enter the path to your Shiny app, and you select the path where you want your test file to live. And let's look at the demo. So all right. So we missed the train on this one. So we'll wait for the train to come back. All right. So I had the app and the location. And you can see this is realtime. I haven't sped up or slowed down the session. And it will tell you how long it took. It took about six seconds. Cost you about $0.07 to generate the whole thing. So like I said, you can run it on any of your Shiny for Python apps for now. And it will generate code just like that.
And you can always go in and change stuff if you need it. It will give you a really good first path to start with. All right. So some housekeeping things when you are going to use it. We always recommend use premium models. So when we say premium models, use the workhorse models that they recommend, like the ones that you use for daily use. Like there are cheap models out there, like Nano and Mini. In our testing, we did not get a lot of good results. So the way I like to think of it is you could use Huskies for pulling a sled or you could have Chihuahuas. But you get what you pay for. So we would recommend you use premium models if possible.
The test will only test Shiny components. So something to keep in mind. If you have something that is actually reading from a database and stuff like that, just looking at your Shiny app, it's really hard to tell. So you might want to be mindful of that. And if you have Shiny apps, Shiny components within your app, you will test those. Make sure you have IDs for every component, because that's how we seamlessly integrate controllers for testing. So make sure you have those. And then just a word of caution when you are actually sharing code with LLMs. If you do not want that code or that data to be used by LLMs for their training, do not share your crypto keys or whatever.
All right. Final slide. This is something that's available right away in PyPI. It's version 1.5. Like I said, it's a good thing. Write tests if possible. If you don't have the time, use Shiny add test. Makes your life easier. Also, like I said, this is not the end all. You will always have options to add more, extend your test, if you have non-Shiny business logic. And, yeah, if you see grasshoppers getting too close, just stop them. And that's it. Thank you so much.
Q&A
Depends on. I think if you are building the app, I think it's good to have a test in place. Any changes you make helps you detect if those are required changes or the side effects. But also, if you have an app that is super complicated and already built, adding a test is always a good thing to have so that you have a guardrail for things falling out of place.
Are there Shiny tests you would recommend continue to write manually?
Like I said, there would be, again, like this is not the end all solution. It does give you a really good base. But there would be some things that it might not test. Like I said, it's LLM. We try to make sure it gives the best performance. But you might need human intervention to verify that and make changes if needed. Maybe it might not test a certain thing that you wanted it to test. You might have to go in and add it. So it's always good to do it. But it should do most of the heavy lifting for you.
Is there a version of this coming for R anytime soon?
So R, in my experience, it has really good support since all the LLMs have been training on all the R material for such a long time. So we could do it. I just don't see like the return on investment in terms of like how good the thing would be because there's no concept of controllers in R. And I believe shinytest2 does a really good job with whatever you want to do already. So you could use it. I just don't see something that would provide like a huge value because the LLMs are super aware of all the R code already.
Not yet. It would be really easy to add that because we use Chatless under the hood, and Chatless has those providers. So that's something if you create an issue, we can work on it and add it.

