Data 911: how Posit can support decision-makers in times of environmental crisis (Marcus Beck)

Transcript#

This transcript was generated automatically and may contain errors.

All right, well, good morning, everybody. My name is Marcus. I am the senior scientist at the Tampa Bay Estuary Program. And I'm going to be talking about what arguably was the most stressful six months of my life. So I really felt that keynote this morning. This is really about data science under stress and how do you make decisions when there's a sense of urgency around those decisions.

So we're going to do the retrospective approach here. If you go back to July of 2021, I live in Tampa Bay. That's where I work. If you were to walk around the shores of Tampa Bay, you'd notice these really conspicuous dumpsters with a sign out in front of them that said dead fish only. It's really gross. There were dead fish all along the shoreline, right? So there was like a massive fish kill that had happened in Tampa Bay. So many fish that the powers that be didn't really know what to do about it. So the city was like, we'll just put out these dumpsters. The public can clean it up. It was out of control. I actually looked in the dumpster. Zero of ten. Would not recommend. It was disgusting.

But it was just a gnarly situation. And so a couple months earlier, what was kind of leading up to this event, there's this facility on the southeast shore of Tampa Bay called Piney Point. It is a legacy fertilizer processing plant. Long story short, there's a convoluted history as to why this facility exists, why it led to this emergency discharge. But there was really a leak in the holding tanks for the wastewater that was sitting on site for years. And to prevent, you know, issues with public safety and destruction of public property, the best decision at that time was to release all of that water into the bay to prevent catastrophic failure of the stacks. This was a couple months prior to the fish kill. Nobody wanted this to happen. It was not ideal.

And this water was nasty. You know, very acidic, excess nutrients, all that bad stuff. Over a ten-day period when they were releasing this water, that part of the bay received an amount of pollution that it normally would have received in a year. So we expected all sorts of negative outcomes to occur. And to further underscore the horrific nature of this event, my boy Stephen King was tweeting about this. Back when Twitter was a thing, I used to follow him. And he's quite witty. But I think he has a vacation home in Florida. So he often would tweet about Florida-based things. And needless to say, this was receiving national attention, given that, you know, there was an expected negative outcome from this event.

Over a ten-day period when they were releasing this water, that part of the bay received an amount of pollution that it normally would have received in a year.

These are my guardrails. This is the most important piece of this talk, I think, is this is where I did not sacrifice attention because I knew that the data coming out of this workflow had to be accurate and had to be something that would be useful on the dashboard.

So I'm going to look at this a little bit more carefully. This is just by the numbers over basically the six-month period from the discharge release to when conditions sort of improved. There are 641 commits to GitHub. Among those commits, there were about 8,500 tests. And the good news is a large majority of those tests passed. So my workflow, even though it was ugly, was churning out data that I was happy with. However, a small minority of those tests failed. And this is small in the grand scheme of things, but these tests that failed saved my ass in a lot of instances.

And I'm going to show you what that looked like. So if these tests were not in place, the obvious thing that would happen is we would break the dashboard. And this is a Flex dashboard, so we would get this Pandoc error if we pushed data that was not compatible with the dashboard. Obviously, we don't want to do this. We don't want to break the dashboard. It looks bad. Also, it's not serving the users. So it caught these instances where we would simply break the dashboard. But I think more importantly, it did catch more insidious issues that wouldn't necessarily break the dashboard, but would lead to us showing inaccurate information that could lead to wrong decisions about how to respond to this event.

And so this is an example where I'm showing basically just a couple days of data where I have chlorophyll A, which is sort of a proxy for water quality. The points are sized by the concentration. So you can think of the points, the bigger the points, the worse the water quality, very generally speaking. These small points here, these are actually instances where the units of those points mismatched what my schema was meant to be. So when the data was provided to me, it was different units. And what happens is because the units were different, the data basically shows up as NA values on the dashboard. They're still shown. This is a location, but there's no value associated with it. So this basically prevents or provides a false indication of water quality by showing these points. Because they're small, it looks like water quality is actually better than it actually is. And so the tests were able to identify that. And if it wasn't there, I would have pushed these points to the dashboard and provided a false sense of water quality that, of course, we didn't want to do.