Tanya Cashorali | Cross-Industry Anomaly Detection Solutions with R and Shiny

Transcript#

This transcript was generated automatically and may contain errors.

My goal at the end of this talk is you walk away learning something, whether it's just about an R package or you have some new strategies in terms of approaching some of your more technical solutions with end users or clients or what have you.

So quick introduction. My name is Tanya Cashorali. I got my first start with R in 2005. I was working at Children's Hospital Informatics Program analyzing genetic data, and I've been using it ever since. I love R. I started my own company based out of Boston called TCB Analytics in 2015, and I always have to include Porkchop. He's my assistant. You may see him come in here at some point, but you feel free to contact me through any of these avenues. I'm more than happy to answer questions if we don't have enough time either.

Just not only was it the algorithm, but incorporating the business logic as well.

So I want to jump quickly into the next example so that we have time for questions. The other example, and I should wrap up also that they're using this on a day-to-day basis. It runs as a Shiny app. They're able to connect live to their database, and it's been used as a tool to basically direct their investigation and identify these hardware failures before the machines get shipped out to the hospitals. So we think that's a really cool application, and it's R and Shiny. The price tag is pretty good.

Use case 2: sports and entertainment news anomalies

So real-world example, again, is another client that works in the sports and entertainment business. They actually do market research around sports technology. So things like in-venue betting and understanding how to engage fans in this new technical age, biometric sensors on athletes, you name it, basically any sports technology. And I like to call them kind of a market research company for sports tech. So what they have is a whole bunch of news articles that they collect via Feedly, DiffBot. These are different third-party tools, and Leo is something that's, I believe, part of Feedly, and it analyzes all the information for you. It's kind of like a smart filter to bring in the information they care about.

Okay, so the data is thousands of sports news articles. There's also entity tagging that this Leo tool does. It tags sports teams, technologies, vendors, meaning vendors like Google, Microsoft, Nike, technologies being anything from blockchain to augmented reality. And so what we thought here was they wanted interesting news stories to be surfaced to them. So we simply took the number of mentions of these various entities over time. You could imagine Boston Red Sox are mentioned 10 times a week. Suddenly they're mentioned 30. So we thought, okay, let's investigate why.

So here's what the first iteration looks like. We have a date range. We have, they can filter by league if they'd like. We're looking at all leagues here, and we have a list of anomalies that our package bubbled up to the top. The Seattle Sounders here were detected. We click that, and it gives us a time series look at that particular anomaly. We can then click it and discover more about what's happening. So on that date, and Shiny pops up this nice little module, the Seattle Sounders and Amazon work together. Amazon had acquired streaming rights for Prime Video soccer matches, and that's a huge thing that this company would like to know about.

So from there, you could see how you can rapid prototype, and it might be manual, and you surface some anomalies. But you could imagine how this could be automated into report. Give me the five most interesting stories that week. Give me the teams that are spiking in the news, and so on.

Another cool use case for them was we used it, it doubled as a QC tool. So certain things would be spiking that shouldn't be. For example, the Ottawa Senators, they weren't doing particularly well. There was really no reason. It wasn't like the playoffs or anything. This is an NHL team. So we decided to dig into it, and it turned out that DiffBot was tagging OTT for the Ottawa Senators, because that's their abbreviation. But it also stands for over-the-top streaming services. So we had identified a problem with the tagging. We alerted DiffBot, and we got that all fixed on the data itself. So it can also be used as a data QC tool, not only to surface interesting things, but potentially problematic things, as we talked about also in the first example of finding hardware failures.

Final thoughts

And I'm leaving, I think, at least four minutes for questions, but final thoughts. Again, working with people is the hardest thing. We're lucky that usually people we've worked with so far understand the problem, and they're very collaborative. And I think being able to show them these results piecemeal is what built up the confidence and their understanding of the tool and why it matters. Giving them the why or allowing them to dig in to the why and explore more on their own is that sort of self-service and freedom of analysis that they need to really feel like they're engaged.

Giving them the why or allowing them to dig in to the why and explore more on their own is that sort of self-service and freedom of analysis that they need to really feel like they're engaged.

So Shiny's been just incredible for that, because I remember the days where you just had to show them static charts and put together a PowerPoint, and it's very much like I'm just talking to you rather than working together.

So anomaly detection is a useful data-driven tool when used appropriately. In our cases, we had identified specific needs for each scenario and tailored them accordingly. The first use case was for proactively identifying hardware failures. We learned over time that incorporating their threshold limits would actually give us more accurate results rather than just relying on the data. And the other use case surfaced interesting news stories as well as doubled as a data QC tool.

Always work closely with your end users via that iterative process. Shiny, again, is fantastic for this. I always say build fast and fail quickly. I'd rather show you something that you think is terrible and you tear it apart. And getting that constructive criticism, to me, is a good sign. It means you're invested. It means you have ideas. And it means that we can improve upon it very fast.

And then don't leave out important business logic where it can help you. We don't always have to just rely on statistical models and assume it's going to solve all our problems. There's oftentimes that we can incorporate that domain knowledge and it can really improve our results. So always be looking for those pieces of logic that can help your automation or your math or your R package. And questions, feel free again to contact me here. I look forward to hopefully being there next year and enjoy the rest of the conference.

Tanya Cashorali | Cross-Industry Anomaly Detection Solutions with R and Shiny | RStudio (2022)

Transcript#

What is anomaly detection?

Working with people and rapid prototyping

Use case 1: life sciences hardware manufacturer

Use case 2: sports and entertainment news anomalies

Final thoughts

Featured software#

rstudio

Shiny