Resources

Using R, Python, and Cloud Infrastructure to Battle Aquatic Invasive Species - posit::conf(2023)

Presented by Uli Muellner and Nicholas Snellgrove Invasive species are a huge threat to lake ecosystems in Minnesota. With over 10,000 water bodies across the state, having up-to-date data and decision support is critical. Researchers at the University of Minnesota have created four complex R and Python models to support lake managers, all pulled together and presented with the most recent infestation data available. Come along with us to see how we connected these models in the AIS Explorer, a decision support application built in Shiny to help prioritize risks and placing watercraft inspectors, using tools like OCPU and cloud toolings like Lambda, EventBridge and AWS S3. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: R or Python? Why not both!. Session Code: TALK-1118

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I want to start off today by talking a little bit about Minnesota, the land of 10,000 lakes, as the locals call it. Now, fair call, I, before about a week ago, hadn't actually been anywhere near Minnesota. But in spite of that, I do actually know a few things about the state. I know that depending on which sources you use, there's actually anywhere between 18 and 24,000 uniquely registered water bodies across the state. I know that of those, there are around about 9,000 which have detailed inspection data available for them. That's about 4 million acres or so. And of those, there's almost 10% of these which are infested with invasive species, which is a big problem for the state and the people who enjoy those lakes.

Maybe let's back up for a second. Who is this guy? Why is this man from the other side of the world standing here talking about invasive species in Minnesota? Does he realize he's not in Minnesota right now? These are fair questions. I'll start off. My name is Nick Snellgrove. I'm a tech lead at Ippi Interactive. We're full service partners for Australia and New Zealand with posit. And the reason that I'm standing here right now is because of these four invasive species.

So for the last four years or so, we've been working with collaborators at the University of Minnesota, the Aquatic Invasive Species Research Center, battling these species in particular. So zebra mussel, starry stonewort, Eurasian watermilfoil, and spiny water flea. Now when I say something like battling invasive species, you might be tempted to think of something a bit more dramatic. It reminds me kind of of the movie Jaws, you know, people out on the boats hunting down these creatures and bringing them to justice. Maybe it is that way for some, but in our case, it's a little bit more subtle. The creatures we're battling against are a little bit smaller, so we have to be a little bit more detailed.

So for me, when I'm talking about battling invasive species, I'm actually talking about how we're using cloud-based infrastructure and a Shiny application called the AIS Explorer. And that's what we're going to go over today. So in this talk, we're going to be going over how we've built up this cloud-based infrastructure around the AIS Explorer to make sure that it's accessible to the people who need it, relevant for their uses, and that it performs well so that they can focus on what they need to do with it.

Background on the AIS Explorer

So let's have a little bit of background about this Shiny app. So this was initially conceptualized back in 2019 with the support of the people at the University of Minnesota, and specifically the Minnesota Aquatic Invasive Species Research Center. We wouldn't be able to do any of this without the support of these people. So special call out, thank you to Nick Phelps, Amy Kinsley, and Alex Badges for all their support over the years. They've been wonderful.

So the AIS Explorer started its life as a decision support and risk management tool to help decision makers on the ground in the state protect their lakes from these invasive species by linking up these complex analytics with these practical decisions that they need to make day-to-day. We finished development of the first version of AIS Explorer back in mid-2020, and it has remained in production since then, and it's been actively in use during that time. So these decision makers, the county managers, have been using this app. It's battle-tested, and it's been through several updates over that time. We've continued to maintain it and provide new features for the people who are using it.

And there's some pretty complex stuff going on behind the scenes here. Now, I'm not a data scientist. My background is in software, so I probably can't tell you a lot about the details of these models, but I can tell you that we've linked up these kind of collection of Python and R models into the application. So briefly, we have a risk model written in Python, which is used to determine the risk of lakes becoming infested within the next eight years, I think. Then we also have an intervention model, which allows users to kind of produce a simulation of if they intervene at certain lakes in certain ways, so things like inspections or decontamination or education and outreach, how is that going to change the story for these lakes? We have an inspection model, which allows people to determine the most cost-effective or the most beneficial way that they can place inspectors around their lakes in a county. And then finally, the collaboration model, one of the new additions, which is enabling county managers to see how the story changes if they pool their resources together.

So these models are pretty complicated, and that makes it a bit of a challenging process to keep them fed, watered, operating as they need to. The source data that feeds these models updates pretty frequently, and the computations which happen to produce the results we present in the app are pretty intensive. It takes quite a long time. We want to make sure that our Shiny app performs well for the people who are using it. So we need to kind of address that in the way that we're hosting this app. And even if we wanted to go with kind of a manual update process, that's not going to be practical for us in this case either, because we don't actually know when these updates are going to need to be applied. It's a bit of a variable timing on that one.

So this kind of boils down to three challenges that we have, which is, how do we ensure that the AIS Explorer is accessible, relevant, and performant for the people who are using it? So that's what we'll go over now.

Accessibility: hosting on AWS

We're going to start off firstly with accessibility. This is how we're making sure that these people can actually use the Shiny app that we've built and prepared for them. And the first question we ask is, where do we actually host this? There's usually a couple of options between, do we host it kind of on-prem on some server that we have, or do we host it somewhere in the cloud? For obvious reasons, it's probably not going to be great if we host it back in our office in Wellington. So we decided to go with some cloud infrastructure. And some discussions with the folks at Mazark led us to Amazon Web Services. Over 200 cloud-based services for us to choose from, 32 different regions, 102 availability zones, which means that we can take our application and put it right next door, where it's going to be nice and accessible for the folks in Minnesota who need to use it.

So now that we've decided on AWS, that leads us to the Elastic Compute Cloud, or EC2 instances. So these cloud-based servers, which we create one of these and upload the code for our Shiny application into one of these instances. And from this, we can do a lot of different configurations if we need to. We can adjust the performance. We can change security groups. We can customize these instances however we need to, really. We can also pick from a variety of different performance tiers. So we can crank up the performance of the application by just picking a different option from a dropdown and restarting the server, which can be really helpful, because this app can take quite a bit of effort sometimes. The other really useful thing about this is that we get to record our backups, and we can take kind of a snapshot of the state of these servers using the Amazon machine images, these AMIs. And they're going to come in handy later as well.

So we have our Shiny code on this EC2 instance, and then we actually need to have some way of running the Shiny app. And a little bit of a fun detour we had to take with this one is that we actually run the Shiny app directly on this EC2 instance. Now traditionally, you would probably go with a Shiny server, but as I painfully found out, while we were kind of going through this initial development, there are some edge cases in here that come up when you try to throw 18,000 lakes onto a single leaflet map, which meant that Shiny server actually wasn't really having it for us. So we decided to just run the app directly on the EC2 instance. And that brings up some questions, which we'll talk about later, about how we deal with multiple users.

Relevance: automated data updates

But with these solutions, we've made our app accessible to the people in Minnesota, and they can use it, and it should be performing reasonably well. The next thing we need to think about is how do we keep this relevant? So we've got these four complicated models in here. And like I said, I'm not going to go too much into the details of these models, but there's a couple of things to call out here. So if you have a look at the kind of runtime of these different models, you can see it's a pretty variable scale here. We've got models which take sort of anywhere from one to five minutes, depending on the inputs, up to 20 hours or more. So it's a pretty variable range of models and how long it takes for these outputs to be produced. And all of these need to come into our app at the same time.

The other thing to call out here is that there's a common thread running between these models, which is the source data, which is coming from the DNR infested waters list. So this is produced by the Minnesota Department of Natural Resources. And basically, this is feeding everything else which we put into the AIS Explorer. All of our models depend on this Excel file, which lives on the DNR website. So we obviously need to know when this changes, and that's going to be the trigger for us to actually run our models again and produce the results that we need. Specifically from that, we need two things. We need the species that a lake is infested with, and we need the DOW number. It's a unique identifier for each water body which is infested in that list.

I talked earlier about a variable update timing, and that's down to the season. So when it's winter in Minnesota, the lakes freeze over, people aren't really inspecting all that much. So they don't get a lot of new data in this spreadsheet. But during the summer, we might be seeing multiple times a week updates to the spreadsheet. So we need to have an automated process to know when that's changing. And again, for that, we turn to AWS.

So we're using Lambda to write some serverless code which will check the infested waters list and see if there are any changes from the last time we looked at it. And if there are no changes, that's great. We don't need to worry. If there are changes, though, then we need to do something, and we're going to spin up a processing instance to actually run these models. So we use EventBridge to schedule that rule to run every 12 hours, which is how I've ended up having an email inbox which twice a day gives me a notification, has this file changed, yes or no? Which is great for my mental well-being, but that doesn't matter.

So if there has been a change, then we spin up a processing instance. And this is how we are actually running the model code and producing these outputs that we need. And it takes a decent amount of time. This is probably running anywhere from sort of 36 to 48 hours once we kick it off. But we used one of those AMIs, those images which we created earlier, to kind of take a copy of a server which contains all of our preprocessing code, and we have a cron job on that instance that gets created to kickstart our pipeline process. That will run through, and then when everything is completed, we have another script on there which will upload all of our results, so the RDS files which are produced, we upload all of those to an S3 bucket, and we'll have them then sitting there waiting to be ingested by the app. So that all happens without even needing to think about it, really.

But then how do we get it into the app? That's the next important question. And the way that we do that is on our EC2 instance, which is hosting the Shiny application, we have a cron job which, once a day, will go to our S3 bucket, retrieve the latest data that's available in there, and then restart the Shiny application. And we've got this scheduled for about 5 in the morning Minnesota time, so that hopefully nobody's going to be affected when the server restarts. And this means that when people come to the application, all of the files that they need are already available, loaded in directly from the same server that the application is running on. So they can get right to what they need to do.

Performance: load balancing and tuning

That brings us around to kind of the last point of these challenges that we need to solve, and that's the idea of the performance of the application. We've already kind of spoken to some of these points as we've gone along. So we've made sure that we're hosting the Shiny app in the right place. We've made sure to increase the specs of our EC2 instance to make sure it has kind of enough power behind it to run maybe some of the more complicated tasks that we need to do in the app, things like presenting all those lakes on a leaflet map. We've preprocessed our expensive model outputs so that they're all kind of just sitting there ready to go whenever someone comes to the application.

But the question then is, what else should we do? And that kind of comes back to the way that we're running the Shiny app as well. So I mentioned before that we're not using Shiny server on these instances because it was kind of just a little bit too much in some cases. So that means that we kind of need to handle the question of load balancing ourselves. So we want to make sure that if there are multiple concurrent users for the app, that they're not going to bump into each other and cause problems in terms of their individual user experiences. And the way that we do that, again, we're turning to AWS, and we're using the load balancer along with target groups and autoscaling groups.

So our load balancer is linked up to the application URL. So when someone arrives intending to use the AIS Explorer, instead of just going directly to the server which is running the application, they'll instead be directed to the load balancer, which we have in the AWS Cloud. From there, the load balancer is going to take their connection, and it's going to redirect them into the target group, which has one of however many instances we have of the AIS Explorer at the time. And so it's going to distribute that traffic into the target group, depending on how it's currently being used and how that load is currently shared across the instances in that target group.

In terms of how many instances we have in that target group, we use the autoscaling group to kind of control that for us. So we have a minimum number of instances. So we always have at least one running. And then we have a certain threshold, which if the traffic on that instance gets to a certain point, then we'll spin up a new instance automatically so that as new people arrive, they will be redirected. Instead of going onto the same instance, they'll end up on one of the new ones, and it'll kind of share that load out so that people aren't really all crammed into one specific server. And as it kind of automatically creates these instances, as that traffic dies down, it will also safely tear those down without causing any problems for people who may be using the app.

So besides the load balancing, there's one other thing which we want to briefly consider, which is around performance tuning. And we've already done some of this. We want to make sure we're doing as little in the Shiny app as we can. We've preprocessed our expensive operations. We cache our results in the Shiny app. So using functions like bind cache in the app to kind of hold on to expensive operations people may have already done. But the last question which comes up for me is, what if we need to run one of our expensive models?

So to answer that, we actually went a little bit outside of AWS. It kind of crosses over a bit, but we have, on another EC2 instance, we have OpenCPU installed. And we've taken, in this case, the intervention model, and we've put that into OpenCPU as a package, which then allows us to, when someone is trying to run one of these intervention models, instead of having them sitting there waiting for 10 minutes while the model runs and the report generates, we actually let them send this request and then continue using the app while OpenCPU in the background on a different server is actually working on that request. And when it's done, we email the results as a report to the person who initially made that request. So it's a very hands-off approach. And that allows people to use these expensive models without needing to kind of sit around and wait for the results.

And that allows people to use these expensive models without needing to kind of sit around and wait for the results.

So using these cloud services, we've taken our Shiny application and we've made it accessible to the people who need it most. We have made sure that it is relevant, so it's always kept as up-to-date as possible so that the decision makers on the ground can make well-informed decisions about the ways that they're managing these species. And we've made sure that it performs well so that people don't have to waste time and wait around for the results that they want to see. They can focus more on making informed decisions and kind of taking that to battle these invasive species.

We never really win the fight against these invasive species. We just keep on fighting them. And because of these services we've introduced, the AIS Explorer is and will remain a very effective tool in the fight against those species now and in the future. Thank you.

Q&A

We have a few questions. I'll probably take a few, but if you have more, feel free to ask Nicholas maybe outside after this talk. So the first question is, do you use an Amazon AMI or how do you get the latest version of R onto your EC2?

So do we use the AMI? Yeah, so we have these AMIs set up. So we have one set up for the instance which actually holds the Shiny app. And then we also have one set up for the instance which contains kind of our processing code. So there's kind of two of those which we are mainly using. And we have kind of other services that we then use to get those going. But actually creating them is kind of a manual process as we're going through development. So whenever we do a round of updates or we did an update recently, we introduced some new models. And as we were doing that, we uploaded some new code. We updated kind of the dependencies, the other packages we were using, and kind of just created a new image from that. So yeah, as we're going through our development process, we actually use one of those instances. We have like a development instance which we have running just kind of in the background. And we kind of take that and then we'll gradually switch that over to be our new template for the production version. Does that make sense?

Before we move on, could your methods be generalized to work with other cloud providers? Yeah, I can't see why not. I personally am not super familiar with, I don't know, I'm assuming things like Azure. I'm not really familiar with that, but I see no reason why these couldn't be generalized for other cloud providers, definitely.