Resources

Tanya Cashorali | Cross-Industry Anomaly Detection Solutions with R and Shiny | RStudio (2022)

This session highlights two anomaly detection use cases in production: identification of problematic life sciences manufacturing units and identification of significant newsworthy events. With both solutions, Shiny is integrated with live data to provide early detection for proactive intervention. Shiny’s intuitive user interface also allows for interaction with the data behind anomalies to uncover potential causes and paths to action or resolution. The session also briefly highlights a rapid prototyping development approach with Shiny. This technique allows for collaborative refinement of the underlying anomaly detection model in R, quickly incorporating user feedback, where end users may not have in-depth machine learning knowledge. Talk materials are available at https://docs.google.com/presentation/d/e/2PACX-1vTE7Ee2QIUGDUmfEKmF8l_WTQPVgnGaLJLGuuMquio57bXojeeb5YYSjuzO-xzYxMHxuX2cm_QNC2y-/pub?start=false&loop=false&delayms=60000&slide=id.gbb68c6dbe2_1_44 Session: Working with people is hard

Oct 24, 2022
17 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My goal at the end of this talk is you walk away learning something, whether it's just about an R package or you have some new strategies in terms of approaching some of your more technical solutions with end users or clients or what have you.

So quick introduction. My name is Tanya Cashorali. I got my first start with R in 2005. I was working at Children's Hospital Informatics Program analyzing genetic data, and I've been using it ever since. I love R. I started my own company based out of Boston called TCB Analytics in 2015, and I always have to include Porkchop. He's my assistant. You may see him come in here at some point, but you feel free to contact me through any of these avenues. I'm more than happy to answer questions if we don't have enough time either.

What is anomaly detection?

So what is anomaly detection? This is just the Wikipedia definition. It's sometimes referred to as outlier detection, but I like to say it gets a little fancier than just detecting standard deviations above or below a mean. It's a way to identify rare items or events. I typically use it for time series data. Twitter actually developed a package, which I'll get into, and they use it to look at tweets per second. You can use it for CPU metrics and utilization, but I'll talk about two different industry cases in which we were able to use it and had a lot of success doing so.

So the R package, like I said, was developed first at Twitter, and there's some really good papers that I link at the end of this. The slides are available. I definitely recommend it. There's some good reading in there. I won't have time to get into all the technical details, but I actually used a fork from HarborMaster, and he made a few modifications to the package, one of which is it returns tidy data frames, tibbles, and he also made the function names a little easier, snake case aliases, so you can see them, ADTS for time series and ADVEC for vector. So this is the package. It's been phenomenal for us so far.

Now just to talk about the implementation or the underlying algorithm, very briefly, it builds upon an extreme studentized deviant test, just a big word for saying it does some time series decomposition and robust statistics, and the paper I mentioned actually benchmarks these four approaches, quant linear, a seasonal decomposition with loss, and a quant spline. The one they said actually performed the best was piecewise median, so it looked like stable metrics typically exhibited very little change in the median over two-week windows, so it defaults to breaking your data into two-week windows and calculating a moving median. It also accounts for intraday and weekly seasonalities. It was pretty much performing on par with these more expensive calculations. Four times faster than the other ones, so that's the red line you're seeing here. We've had no performance issues and have given it quite a bit of data, and we run everything in real time.

Working with people and rapid prototyping

So I love that this track is also called working with people, because it's the hardest thing we do. R and code and math is typically easy compared to working with people, and that's what we do day to day. Typically, the problems when implementing complex technical solutions comes down to communication breakdown. It always comes down to that, between developers and end-users typically. So oftentimes, we'll get a set of initial requirements, and when building data products, they almost always change or get thrown out the window. There's just so much that any end-user, I don't care how much you know the product or the data, there's so many things you'll never be able to predict, and that could happen as soon as you start building. Before an app gets into production, it just doesn't meet the end-user's expectations. They may have given you a mock-up or a wireframe, but there were a whole bunch of other edge cases that they just didn't think of that they wanted to tackle, and we see that happen a lot.

And this happens because not all users have the technical understanding. Sometimes they have a limited understanding of the statistical processes and principles, and then developers may just take those user requirements and run with them, and they disappear and come back with an end product. Lastly, developers, they're not always including those users, so you don't want to disappear and then just come back to the user with something that they may have not actually wanted. So part of that art is figuring out exactly what the user does want compared to what they're telling you. And that's where Shiny comes into play.

So we've had a lot of success just taking ideas, connecting to real data, and showing the user something tangible, something that they can interact with. And what we found is using Shiny as a collaboration aid, we see those requirements quickly change, we see the user get excited because they're now touching something real and working with something real, and they feel more engaged. They feel a sense of ownership that typically they wouldn't have if you were just a developer off doing your own thing, and disappearing for weeks or months, and then coming back with something that they didn't necessarily feel like they were a part of.

So this is what I like to call rapid prototyping, and it enables you to demonstrate value before doing this full-scale implementation, and it enables you to find all these different edge cases, and data scientists can focus on problem-solving instead of having to play this back and forth of, well, what exactly do you want? You can solve a problem quickly, come back, show the client something, and then keep iterating that way.

Use case 1: life sciences hardware manufacturer

So to get into the examples, this first one I think is pretty cool. It's a hardware manufacturer in life sciences. They essentially build the rapid blood testing machines in hospitals. So if you go to the hospital, hopefully not, when you're there and you get your blood taken, this is the company creating the devices that give you those rapid results. They have sensors in them that relay quantitative metrics and help diagnose instrument failures, but what this company was doing was deploying the hardware in the field and finding out retroactively that there was a problem, so they would take the machine back and analyze all of the data coming off the machines. This seemed like a perfect opportunity to use anomaly detection. We wanted to identify those problems before they went out into the field.

So here's what the data and experimentation looks like. We have various metrics coming off these machines. There are things like protein levels and analytes, and they come off at different aggregations, 24 hours, 12 hours, et cetera. This anomaly detection function has a bunch of parameters, and we just played around with some of them and iterated on things like, what percentage of the data should we capture anomalies for? Should it be positive and negative direction, so above or below the expected threshold? The level of significance, maybe we really need to be stringent here. Maybe we can only go with 90% confidence versus 99%. And then that piecewise median period that I mentioned. Another thing we realized was there are some outliers that are just insane, and anomaly detection would always pick up, and that's very indicative of a hardware failure, but the client already knew about these. They said, well, we know that these are problematic, so we actually did some outlier removal and then ran the anomaly detection, and, of course, then there were no anomalies detected. The input data very much makes a difference on your end results.

So those are all parameters we wanted to play with, and we were literally manually analyzing all of this, and it turns out putting it into a Shiny app and just allowing myself and the client to play with these different parameters, we quickly got feedback on, hey, this actually looks like a good threshold to use. These are good parameters. Let's run with these. From there, we thought, well, instead of this sort of top-down approach and exploring all the different options, let's alert the user on exactly where to go. So we took the high-significance anomalies, highlighted them, and allowed the user to then click one. So I want to know more about this one, what's going on with potassium or calcium, and it would bring you right back to that screen, pre-populate all the parameters, again, just a nice little user experience thing, time saver, and they could dig into the details a little more.

The code was pretty straightforward. The input data looks just like you'd expect for a time series. There's a timestamp and a measure. We iterated over the different significance values. We also defaulted to using one year of data prior to the date range that the client is actually interested in. So if I care about May of 2020, let's start with data from May 1st of 2019 to get some of that seasonality, and more data is better for this type of algorithm.

But throughout the project, they then informed us, hey, we actually have these limits that we already know. It's like you could consider it a prior in Bayesian math. And so upper limit, lower limit is something that they would say these are automatically bad. This seems like it's guiding us a little more, and we could probably integrate this. First we started with a completely data-driven approach, agnostic from any business logic. But then we said, well, maybe we can incorporate these limits. And even giving them a simple tool like this to explore those different outliers was something that they didn't have, and they were really doing this all pretty manually and by hand.

So then we combined those rules with the anomaly detection. This is over a course of a few months. We got better and better. We were finding these parameters. So now the colors you see here, the green is just an anomaly that the algorithm found based on the parameters we already tweaked. The blue means it's approaching that threshold limit, and these lines, these vertical lines are those thresholds, which vary by analyte. And then the red is both. So we decided these are probably most important to look at because there's strong signal here. Not only are you approaching a limit, but the anomaly detection is saying this is trending differently based on this piecewise median.

The other thing we did was allow the user to specify how stringent they wanted this approaching limit to be. So here you can see it's not that close to the line. We were able to define thresholds. Maybe they really wanted to get close, and the way we did that was based on quartiles. So the closer they were to the quartile of that range, the top quartile, that was more stringent and so on, and we let them kind of open up the lens, so to speak, to be looser or more stringent. This ended up being the best approach for the client. Just not only was it the algorithm, but incorporating the business logic as well.

Just not only was it the algorithm, but incorporating the business logic as well.

So I want to jump quickly into the next example so that we have time for questions. The other example, and I should wrap up also that they're using this on a day-to-day basis. It runs as a Shiny app. They're able to connect live to their database, and it's been used as a tool to basically direct their investigation and identify these hardware failures before the machines get shipped out to the hospitals. So we think that's a really cool application, and it's R and Shiny. The price tag is pretty good.

Use case 2: sports and entertainment news anomalies

So real-world example, again, is another client that works in the sports and entertainment business. They actually do market research around sports technology. So things like in-venue betting and understanding how to engage fans in this new technical age, biometric sensors on athletes, you name it, basically any sports technology. And I like to call them kind of a market research company for sports tech. So what they have is a whole bunch of news articles that they collect via Feedly, DiffBot. These are different third-party tools, and Leo is something that's, I believe, part of Feedly, and it analyzes all the information for you. It's kind of like a smart filter to bring in the information they care about.

Okay, so the data is thousands of sports news articles. There's also entity tagging that this Leo tool does. It tags sports teams, technologies, vendors, meaning vendors like Google, Microsoft, Nike, technologies being anything from blockchain to augmented reality. And so what we thought here was they wanted interesting news stories to be surfaced to them. So we simply took the number of mentions of these various entities over time. You could imagine Boston Red Sox are mentioned 10 times a week. Suddenly they're mentioned 30. So we thought, okay, let's investigate why.

So here's what the first iteration looks like. We have a date range. We have, they can filter by league if they'd like. We're looking at all leagues here, and we have a list of anomalies that our package bubbled up to the top. The Seattle Sounders here were detected. We click that, and it gives us a time series look at that particular anomaly. We can then click it and discover more about what's happening. So on that date, and Shiny pops up this nice little module, the Seattle Sounders and Amazon work together. Amazon had acquired streaming rights for Prime Video soccer matches, and that's a huge thing that this company would like to know about.

So from there, you could see how you can rapid prototype, and it might be manual, and you surface some anomalies. But you could imagine how this could be automated into report. Give me the five most interesting stories that week. Give me the teams that are spiking in the news, and so on.

Another cool use case for them was we used it, it doubled as a QC tool. So certain things would be spiking that shouldn't be. For example, the Ottawa Senators, they weren't doing particularly well. There was really no reason. It wasn't like the playoffs or anything. This is an NHL team. So we decided to dig into it, and it turned out that DiffBot was tagging OTT for the Ottawa Senators, because that's their abbreviation. But it also stands for over-the-top streaming services. So we had identified a problem with the tagging. We alerted DiffBot, and we got that all fixed on the data itself. So it can also be used as a data QC tool, not only to surface interesting things, but potentially problematic things, as we talked about also in the first example of finding hardware failures.

Final thoughts

And I'm leaving, I think, at least four minutes for questions, but final thoughts. Again, working with people is the hardest thing. We're lucky that usually people we've worked with so far understand the problem, and they're very collaborative. And I think being able to show them these results piecemeal is what built up the confidence and their understanding of the tool and why it matters. Giving them the why or allowing them to dig in to the why and explore more on their own is that sort of self-service and freedom of analysis that they need to really feel like they're engaged.

Giving them the why or allowing them to dig in to the why and explore more on their own is that sort of self-service and freedom of analysis that they need to really feel like they're engaged.

So Shiny's been just incredible for that, because I remember the days where you just had to show them static charts and put together a PowerPoint, and it's very much like I'm just talking to you rather than working together.

So anomaly detection is a useful data-driven tool when used appropriately. In our cases, we had identified specific needs for each scenario and tailored them accordingly. The first use case was for proactively identifying hardware failures. We learned over time that incorporating their threshold limits would actually give us more accurate results rather than just relying on the data. And the other use case surfaced interesting news stories as well as doubled as a data QC tool.

Always work closely with your end users via that iterative process. Shiny, again, is fantastic for this. I always say build fast and fail quickly. I'd rather show you something that you think is terrible and you tear it apart. And getting that constructive criticism, to me, is a good sign. It means you're invested. It means you have ideas. And it means that we can improve upon it very fast.

And then don't leave out important business logic where it can help you. We don't always have to just rely on statistical models and assume it's going to solve all our problems. There's oftentimes that we can incorporate that domain knowledge and it can really improve our results. So always be looking for those pieces of logic that can help your automation or your math or your R package. And questions, feel free again to contact me here. I look forward to hopefully being there next year and enjoy the rest of the conference.