
Colin Gillespie | How to win an AI Hackathon, without using AI | RStudio (2020)
Anyone reading a newspaper or listening to the news is led to believe that AI is the solution to all problems. From self-driving cars to detecting disease to catching fraud, there doesn’t seem to be a situation that AI can’t tackle. Once “big data” is thrown into the mix, the AI solution is all but certain. But is AI always needed? Over the last eighteen months, Jumping Rivers has entered (and won) four Hackathons. All Hackathons were characterised with “big data” and the need to improve prediction. All Hackathons were won without using AI (or any sort of machine learning). This talk will focus on one particular competition around reducing leakage at Northumbrian Water. Using a combination of R, Shiny, and Tidyverse (and a few other tricks), we were able to demonstrate within the short Hackathon time frame that clear presentation of data to the front line engineers was more likely to reduce leakage, than simply providing vague estimates of a potential future leak
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
hackathons, without using AI hacking. I will explain the basics of data, spatial data. Northumbrian water has two sites in the UK, one by the border in Northumbria in England and down in the south.
We have spatial data on that. We had time series. We had data on 15-minute intervals, we had data on that. We had leakage estimates, so how much water was actually lost due to leakage. We had flow rates.
So in the UK, you can imagine how the water sort of authority, the water companies have evolved over the last few decades. And we have something called a DMA. It's a collection of zones of postcodes, some are quite big, some are quite small, and the flow rate is just how much water is passing through, how much water is being used by people.
We have the reported leaks, it's basically an Excel spreadsheet with some dates, times, postcodes and sort of what the person said over the phone. We have pipe size, so that was measurements in metric, so centimetres, millimetres. Imperial UK, Imperial US, Imperial Roman, you name it, we have pipe location. That was more an indication rather than an exact measurement.
Again, pipes have been laid over the UK for decades, so they knew there was a pipe on this road somewhere on this road and they just hoped for the best. We've got pipe type, so you've got plastic, you've got metal. Turns out there's also asbestos, which, to be honest, worried me somewhat, but asbestos, we were told, was fine as long as no one drills into it, so we're all safe.
We've got social data, that's basically saying who's living in the area, is it students, is it families, is it business, what sort of mixture. We've got weather, rain, sun, not so often, but how much rain we've got, how much business, what sort of businesses are nearby, what their usage and then we've got stuff. What I'm guessing is someone had a week to put this data together. The first four days, they got all the stuff on the right, and then they spent the last day just putting in other stuff, and I can't remember what the other stuff was, but it's just general stuff, so it might be interesting. So this is what we had.
The machine learning framing
And the problem set up that we were given is a very standard, I would say, machine learning framework. So we've got all this data, all the data that you would think would give you an insight to leakage. And we've got all the previous results, so we've got leakage from the past of two, three, five years. We've got covariates or predictors that we think would be useful, so weather, businesses, pipe type, pipe size.
Intuitively, you'd think if the pipe was older, then it might break. Intuitively, you'd think if a pipe is made of lead, then that's more likely to have a leak than a pipe made of plastic. And so we're led up to this very classic machine learning problem of lots of data, now put it into your computer, turn the handle, and try and predict what's going to happen.
And in my head, I had this amazing algorithm that would tell someone the precise coordinates of where the next leak was going to happen, and engineers would rush out and fix the leak before anything would happen. That's why I thought it would happen.
Day one: coding vs. coffee
So this is me, approximately like this, and I worked hard all day. So this was a Wednesday, and I started coding about quarter past nine on a Wednesday. So I just got the data, and I opened up the folder, and I looked at it, and I automatically went to work. So I was a dplyr machine. Hardly would have been proud that this was amazing.
It turns out that there's multiple ways of spelling DMA. You'd think there's only one way of spelling DMA, but they spell DMA multiple times, so there's lots of left joins, right joins, every join you can imagine. I even did some AI or regression, as I like to call it. So I did some regression, I even did some boosty, baggy tree things to try and predict what was happening. I was working hard all day. I was doing what I was told.
This is Seb, an exact likeness, and Seb had a different strategy. So we were on the same team. So Seb is also in the booth at Jumping Rivers at San Francisco, so please go and embarrass him about this. So Seb spent all day drinking coffee. And Seb ate Hobnobs as well. So if you don't know what a Hobnob is, it's a top class biscuit made in the UK.
And Seb asked questions. So in this room there were 10 teams, and then there were a bunch of tables where people from Northumbrian Water were there ready to answer questions. And there were some people who were engineers, there were some people who were sort of in charge of the analytics, some people who were in charge of customer service, just, you know, people in tables who were waiting for questions. So Seb would walk around drinking his coffee, eating his Hobnobs and having a nice jolly old time while I was working hard.
And I was I have to be honest, okay? So I was doing all the work, and I wanted to eat Hobnobs, and I wasn't getting very many Hobnobs. I was drinking lots of coffee, and if I'm being truthful, I sort of started Jumping Rivers to avoid doing work, you know, that's what all founding a company is, it's about skiving from work. And, well, I seemed to be doing lots of work, and Seb seemed to be wandering around doing no work whatsoever.
The real problem: reducing engineering time
Do you remember back at the start, the goal of the hackathon was to reduce leaks by 20%. And that's sort of an obvious thing, you know, who wouldn't want to reduce leaks by 20%? But a more obvious question, or a less obvious question is, why, you know, why 20%? You know, why reduce leaks? You know, is it a big problem?
And, after chatting with a few people, Seb found out, and essentially they get charged a lot. Okay? So, they're wanting to reduce leaks because they're going to be charged an awful lot of money. And next year they're going to be charged even more money, and the year after that they're going to be charged even more money again. So each year they're going to get charged more and more money for having leakage.
So, the goal wasn't to reduce leaks by 20%, the goal was really to reduce the amount lost due to leakage. And they're both similar ideas, but if at the end of the year you had 10,000 leaks but lost very little water, then that solves the second one but not the first one. And that's what they're wanting, they're wanting to reduce the amount lost due to leakage. They didn't really care about leaks, it was just the overall amount.
Second one is, how do you estimate leakage? Now, in the UK, you pick up a newspaper and you read something like the amount lost due to leakage was 5,323 litres and 3 teaspoons. And you think, Dave got that measured, that's accurate. Lots of decimal places, you know exactly what's going on. This isn't made up.
Well, what happens in practice is you have a DMA, so a DMA is just a collection of zones, so postcodes. Again, some are quite big. And the thought process goes, well, we know how much water is going into the DMA, we know how much water is going out, that's how much water is being used. And, well, no one would use water at 3 o'clock in the morning. So any water being used at 3 o'clock in the morning must be a leak.
Which isn't the most precise of measurements when you think about it over the course of two seconds. And Northumbrian Water know this. They know this. It's not a surprise, they know this measurement is shall we say noisy or imprecise. There's nothing they can do about it. And we're currently working with another company to actually measure soil temperature and do a bit of prediction. But what we have on that day is that's the column that says leakage. It's the minimum flow rate at 3 o'clock in the morning. Would you believe that on New Year's Day the leakage goes through the roof and turns people up at 3 o'clock in the morning? So that's what they have.
What do you need to fix leaks? Well, you need an engineer and you need data. So the engineer is obvious. You need someone to go out there, probably in a van, with some tools to fix a leak. What data do we need? Well, when an engineer goes out, there's a leak at this region. Well, they want to know what's the flow rate, how much water is being used? What's the pipe type? Is it steel? Is it plastic? Is it asbestos? What's the historical leakage? Was there someone in the team that went there last week and perhaps it's the same thing again? What's the weather? Is it sunny? Is it raining? They need to be able to predict this and it could cause problems. Traffic? Allegedly, they're not allowed to just dig up roads when they feel like it. So if it's a main road, they have to schedule that at night time. Crime? Do they need to bring a friend to watch the van when they're going looking for leaks?
And this takes them two hours per day to get this information. So every single day, an engineer walks in and they say there's a leak at this area and they spend two hours getting all this information. All this information that we've just been given for the hack. There's 35 engineers. Engineers aren't cheap and currently have, every morning, two hours that five engineers spend trying to get all this data together.
So our solution is not to try and predict leaks. When I say our, this was Seb's solution to be perfectly honest but I'll go for our now. Our solution is not to try and predict where leaks are going to happen, but to reduce engineering time. We can get that two hours down to about 10 or 15 minutes. It's just giving them the data that they need. And this is relatively easy and it's definitely doable. There's not a sort of R squared or predictive value or anything like that. It's just about, they have data in one place, we're wanting to make it accessible in another. So this we can definitely do.
So our solution is not to try and predict leaks. Our solution is not to try and predict where leaks are going to happen, but to reduce engineering time. We can get that two hours down to about 10 or 15 minutes.
Building the Shiny app in half a day
Now, we're now in day two of the hack. And on the second half of the day we have to do a presentation of what we've built. So we've got half a day to do an app, and there's myself and Seb. So what we need to do is we need to decide what we're going to make and decide what to bluff, or lie, or pretend, or whatever you want to call it.
So for example, if we can demo a single DMA, we don't need to show two, three, four, five. We can do one, we don't need to do more. If we can display pipe location, we don't need to tell them about pipe type, that's obvious. We must be able to do it. So we've decided what to make and what to bluff.
So this is a screenshot of the winning app. This first screenshot contains lots of bluffing. So for example, in the top left-hand corner, there's a little person that indicates we've got authentication. We can do authentication, we do authentication for our customers. We did not do authentication on that half a day. We've got some high-priority DMAs all bluffed. We can do that. We didn't need to do that in a day.
Here we've got a screenshot of what an engineer could possibly see. So this was a long column of different graphs and different data. The first couple of graphs were nice and busy and interactive. The last few graphs were a bit more static in nature. And if you look at that little blue slider, when that was clicked on, the graph was going grey, and that was indicating that it wasn't going to be part of an overall report the engineer would take with them. Also an idea about taking a picture and then being able to upload it onto the app. Again, we didn't do this. It was just an indication. And this was all done using lots of dplyr, ggplot, Shiny app, and all done in half a day.
So the app has now been used in production. I'm the second person on the right, the tall baldy chap. And I would love to show you screenshots, but because it's got national infrastructure data, it's got locations of pipes, we can't just show screenshots easily. So that was when we won it two years ago. We also won it this year as well. So sorry for bragging, but we won this year's hackathon as well. And we also won that one as well. So I'm not really sorry at all about bragging. I'm now just bragging completely.
Summary and lessons learned
It's obvious, but think about your problem. And if someone told me before the hack, I would say, of course you'll think about your problem. But I really didn't. I jumped in with both feet. I'm good at dplyr, I'm good at R, I'm good at building models, and I went straight for that. And that's not the best way. Talk to people, ask them what the problem really is.
And that's not the best way. Talk to people, ask them what the problem really is.
Also, don't send me to a hackathon without others. So the one hackathon that we've lost is the only one that I've been to by myself. So that's somewhat embarrassing. So I'm now in this strange situation where if the team go on a hack, I'm not sure if I want them to win or lose, because, yeah.
Anyway, so thank you very much for listening. Sorry I couldn't join you in San Francisco. It's really painful watching Twitter and just seeing all this. Isn't our conference absolutely wonderful? So it's painful over in Newcastle. And by all means, please come over and say hi, not to me, but to Seb and the others at the stall, and thank you very much for listening. Thanks.
Thank you so much, Colin. That was great. Do you have any advice on where to find some hackathons in North America?
Not so many. Not so much. Typically, one should know would be my answer. I think you've caught me with that question. I suspect user groups would know, so if you're part of an R user group or a data science user group, typically any hackathons would approach the user group first of all for a bit of advertising. So that's where I think you'd find. And once you sort of stumble into that field, you then find them everywhere.
Awesome. Well, thank you so much for joining us from the UK. We really appreciate it. So this concludes this track. I believe lunch is going to be served somewhere out there around 1pm. So thank you so much.
