
Caitlin Hudon | Learning from eight years of data science mistakes | RStudio (2019)
Over the past eight years of doing data science, I’ve made plenty of mistakes, and I’d love to share them with you -- including what I’ve learned and what I’d do differently with some hindsight. This talk will cover mistakes made during analyses (including communication when delivering results) team and infrastructure mistakes, plus some advice for incoming data scientists. About the Author Caitlin Hudon My name is Caitlin Hudon and I am lead data scientist at OnlineMedEd, a startup in Austin. I have about eight years of experience doing data science-y things in a variety of industries including IoT, marketing, higher education, non-profits, and start-ups. I am also the co-founder of R-Ladies Austin, founder of the ‘ALL the Ladies in Tech’ quarterly happy hour here in ATX, and a member of the Fall 2017 NASA Datanaut class. Outside of data, I love tacos (especially trading taco spots in Austin), traveling (I’ve been to all 50 states and am working on the continents), the Cubs (including Will Ferrell’s Harry Caray impressions), and live music
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, I'm Caitlin Hudon. I'm here to talk about eight years of making data science mistakes.
So just a quick about me, I am lead data scientist at Online MedEd. That's a startup here in Austin. Co-organizer of Our Ladies Austin, really big fan of that community, also into traveling, hiking, tacos. I blog at CaitlinHudon.com. I tweet at Beyond a Posey. And I also have a hashtag, DSLearnings, where I track some of the things that I've been learning and try to share that with the community.
So before we get into all the mistakes I've made, let's talk a little bit about why we should share our mistakes. One thing that's really important to me is transparency. I think if we were all a little bit more transparent, it would make our community better. I think it would help fend off a little bit of imposter syndrome if we admitted things that we don't know alongside things that we do. I also think it's helpful for newcomers to see that other people have faced the same issues that they're seeing. To help other people, so I've learned a ton from other people's mistakes when I'm implementing new products, that sort of thing.
Also growth. So I am the sum of all of the mistakes that I've made up to this point. Mistakes count as experience. It's how you gain experience. It's how you become the analyst or the person that you eventually are.
I'm going to talk about two types of mistakes mainly today. So one is technical mistakes. The other is communication mistakes. I think that people tend to make a lot more technical mistakes in the beginning. And over time, you maybe make fewer communication mistakes, but I think that fundamentally, communication is harder than tech. So we'll cover a little bit of both of these. And we'll cover four sort of mistake zones.
Mistake zone one: building the first model
So we're going to start at the beginning of my career. And I was working as a statistical analyst. It was my job to build predictive models for nonprofits, higher education, small businesses, that sort of thing. And I was the first hire at this company. And so before me, the CEO had built all of the models. So no pressure there.
So this is me when I built my first model. I was super pumped about it, very confident. I thought I was good to go. I had set up a meeting to present it to a client. This was after a lot of training. I didn't start off building models. I did a lot of training first. So I had to sit down with my CEO so that he could review my work. I was ready to go. I thought it was good. Completely bombed.
And so he found a couple of issues with the model that I built. I was trying to predict enrollment. So based on students who had been accepted at a college, I was going to predict like who would actually accept that offer and show up in the fall.
In doing so, I had included two problematic variables in my model. One was deposit. So if you've ever gone to a college, before you can do that, you have to make a deposit. That's like a pretty much proxy for enrollment. They have a thing called summer melt, so it's not exactly the same. But highly problematic to have that in a model.
The other thing was a variable called campus visit. So any training data that we have, we get after the fact. And so we didn't have enough information to actually use this variable. We had campus visit but no date. Some people visit campus in the fall while they're looking at schools. Some people go to like an accepted student's day. And we had no way of knowing whether these people fell into the former category or the latter.
So those were two variables that I couldn't use. I spent about four years at this company building models. And during that time I made a couple of mistakes. Those mistakes became things that I knew to look for when I was building models. So I worked in the same areas. And those models started to look similar. I started to get really good at building those types of models, looking for those types of mistakes.
So my DS learning is to look for the usual suspects. So as you start modeling in a particular domain, you'll start to see the same problems crop up over and over again in your data. For me, a lot of that was like understanding the data source and data collection. I think that's really important. The assumptions that we're making, any filters we're adding, missing values, and then understanding the goals of the model, when it was supposed to be implemented, who it was supposed to impact, that sort of thing.
Mistake zone two: communicating with devs
So we're going to fast forward to mistake zone number two. This is communicating with devs. So I moved to Austin, started a new role. And as part of that role, we were working on this really cool feature in a product that was marketed to dentists, dental offices. So when you cancel an appointment, there is a hole in the calendar. And we wanted to be able to fill that hole with the person who would be most likely to take a last minute appointment. I was doing R&D for that algorithm. So I was looking at patterns in appointment history. Like, do people tend to always make their appointments in the morning or on the same day of the week? And so we would ask one question that would beget more questions. And we went down the rabbit hole.
One thing that we found is after we did our first analysis, we realized that there was an issue with time zones. Nobody had noticed it until we were having a conversation about the data a little while later.
And so we did sort of a postmortem. And we figured out there were three reasons for that. One is that most of the data that I worked with wasn't time specific. It wasn't, like, minute and second specific. It was more day specific. So I think everyone has to get bit by time zones once, and then you know. Another was the way that we set up our analysis hid some errors. So we decided to look at a.m. and p.m. appointments. We didn't count the number that were happening per hour. If we did, we probably would have noticed this a lot sooner. The other issue was that we had a lack of shared resources explaining what the data was. So any time I had a question, I had to bug a dev, ask them for their explanation. I got an oral history of data, which is never good. And as people left the company, they took some of that institutional knowledge with them.
I got an oral history of data, which is never good. And as people left the company, they took some of that institutional knowledge with them.
Eventually, we got the fill-in model to a good place, and we were ready to start A.B. testing it. And so this was the first time that these devs had worked on A.B. testing. So we were doing a lot of communication. We were getting everything set up. We were really excited. And when we got the results in and we started to talk about what we'd done, we realized that we ran an A.A. test.
So we were trying to test different versions of the algorithm with our, like, high users, and we actually had applied the new algorithm to everyone who was a high user, and all of the people who didn't really use it very much had the old algorithm. So that didn't work. We had to discuss it. We had to go back. It was kind of a waste of time and resources. Eventually, we were able to do kind of a pre-post, so we sort of saved it, but it was a big learning experience about working with devs for me.
So a few things there. Being willing to teach and to learn, and doing so very respectfully. I've worked in jobs where people have worked with a lot of data scientists. I've worked in jobs where people have never worked with a data scientist before, and so you really have to work with people and find a middle ground. Another piece was just don't play telephone. So some of the people who were actually implementing the A.B. test weren't in the important meetings up front. They were just asked to sort of do the work. They didn't have the full understanding, and that was borne out in the results. Another is avoiding jargon for clarity. So I've found that computer scientists tend to use different jargon. Sometimes they'll say things like data models. That means something very different than what a model means to me. And so making sure that we're being very clear about what we mean and we're actually speaking the same language has been really important. The other thing is just keeping in mind that we're all on the same team. We all have the same goals. We're all working towards the same thing.
Mistake zone three: communicating with business stakeholders
So the third mistake zone, communicating with business stakeholders. The next thing that I was tasked to do was to figure out the impact of this new feature. So when we rolled out that feature, we expected it to have an impact on sales. We thought that we would be able to track how many more sales we sold after we'd rolled out the feature. It would be a really easy analysis. We were really excited about it. In talking with the team, we realized the way we would tell the difference is the scripts change, like a sales script changes when a new feature gets rolled out. We rolled out three features at the same time and updated them all in the script. So we couldn't actually attribute any growth to a single feature.
So we went on a data adventure and I started talking to different groups of the business to see what data they had that we could use to try to prove that the work that we'd done had some value. When I did that, I encountered a lot of artisanal data, a lot of handcrafted, very lovingly tended to spreadsheets, which were great. They meant that this analysis was not reproducible. At this point, the business teams were familiar with the types of reproducible analyses we could do. So we were running like markdown reports for A-B tests, that sort of thing. They'd seen those outputs. They wanted that output to look every month and see what the impact had been. So this was problematic.
I thought a little bit about how to solve this. So what's the best way for me to communicate the value of this data? Some of the data was actually really valuable. Some of it was awful. It was things that I would never want people to make decisions on, but they had also asked for it. So a secret about me is I was also an English major. I really like English lit language, that sort of thing. And this is something I picked up in those studies. So the rhetorical triangle is a way of framing an argument. And so it consists of three parts. The speaker, that's you, all the credentials, all the background that you bring along with you, also the way that people see you. That's really important. So having sort of some awareness of how people see analytics and you within an organization. The next part is the audience. So what is the audience like? Are they technical? Are they non-technical? Have they been informed of all of the ins and outs of this problem? Are they just there to make decisions as part of a larger meeting? Getting some idea of who you're talking to will help you to frame your argument. And then the final piece is the context. Are you delivering good news or bad news? The things that you're asking for, are they things that people can help with? Will they understand the kind of problems that you have? So putting all of this together has been really helpful for me whenever I'm talking to people who are more on the business side of things to figure out ways that I can help them and analyses can help them and we can work together.
What I came up with for this particular exercise was a set of metametrics. And so at the bottom of every slide I made sort of like a stoplight thing that told you if the data was good, if it was trustworthy, if it was a repeatable analysis. And they got this. They totally understood it. It was awesome. One of the things I found is, for example, we go to trade shows and the way that we were tracking trade show data was we were hoping that a rep would tell the manager and then that the manager would remember to write down that a sale was the result of this feature. That data was horrible. It was reds all the way across. But then they knew that. And so if we wanted to invest in that data, if we wanted to start analyzing, they started to understand why we needed to put some money and some time into collecting the data and doing so in a good way.
So a few learnings there. Getting stakeholders involved early is really helpful to get everyone on the same page as you're working through these projects and making sure that you understand the business problem as well. So one thing that's really bad and I have done before is solving the wrong business problem or realizing at the end of an analysis that your model solves the wrong business problem. A sort of funny example of that is that my first job, we were building a model to predict who would be a major giver. And so that's usually or a sorry, like a bequeath. And so that's usually if someone passes away, they give a large gift to the university or to a museum or that sort of thing. We actually figured out that our model was predicting death rather than the gift that would come after. And so that was really bad.
We actually figured out that our model was predicting death rather than the gift that would come after.
So it's just like translating the data and the business problem. That's actually a really tricky thing. And so it's really helpful to talk to the business stakeholders to make sure that you're doing it right. Framing analysis in a way that makes sense. So I might be thinking of things in terms of what they look like in the database. Those things might have totally different colloquial names around the business. And so it's really important that I'm using the terminology that other people are using throughout the business.
Another thing that I personally am working on, and even after years of making mistakes, I still do this all the time. A lot of times you're in a meeting and people just want to know like who was in your data set, how many people was it and what did they do? I get so wrapped up in the details of my analysis and like all the things I want to tell them that I will often forget like, oh yeah, it was 600 people that I looked at. And so I tend to write those things down. Inevitably, those always come up. So making sure that you know the basic facts is really helpful. Another thing is knowing where they get their data. So I've worked at companies where the BI system is different than the data that the analysts work with, like data analysts or devs or that sort of thing. That can be really problematic. So if you have people in the C-suite or stakeholders who are looking at reports coming out of a system that you don't use, your numbers are not going to match. And so understanding where they get their data, perfect. And how you can explain why your data might be different is really important if you want to get everyone on the same page.
Mistake zone four: infrastructure and team
One last problem that we'll talk about is infrastructure and team. So one of the roles that I took on was to work on the algorithm. So we had a company that a large portion of their business ran on this single set of machine learning algorithms that were in production. The fact that we refer to it as the algorithm should tell you sort of everything you need to know about what the rest of the business understood about how this all worked. So we had a couple of data scientists leave at the same time. One of them was lucky enough to sit down with us and spend a whole week teaching us how the algorithm worked, the ins and outs, that sort of thing. When it came time for us to pick up the algorithm and just continue to work with it, it was super scary. We did not have the documentation that we actually needed to be able to understand the algorithm, like inside out. So we had to spend a ton of time trying to engineer the code, figure out where these things came from, and then explain it to other people.
So one of the things that we did is wrote up some pseudocode. And I've been doing this ever since for anything that's related to machine learning or that might be a little bit tougher for people to understand. So it's taking the ideas, like the inputs, the outputs, the business problem you're trying to solve, and putting them into plain English. And these documents can live close to the code or separately from the code. I've even written pseudocode documents for different audiences. So one for engineers who understand the databases I'm talking about, that sort of thing, and then one for business people who need to understand the basics of the algorithm, sort of what it's doing to be able to explain it in a business context.
So some learnings around infrastructure and team. I can't stress documentation enough. I've written a little bit about data dictionaries and SQL query libraries. Those have been really helpful, especially for making sure that you're not duplicating work. If you are working with different queries to find the number of active clients, for example, that should be a really simple problem. It's amazing how many times I've seen this done differently by different people in the same organization. So doing things like code reviews can help you with that, making sure that everyone is on the same page. If they know the ins and outs of certain systems, that those are being communicated to the whole team.
Another thing is core team meetings. So when I first started working at one particular job, our analysis team was a little bit siloed, and then we started working with more and more teams. One of the best things we ever did was get into these core team meetings. So it was business stakeholders. We would have, like, someone from marketing, a head of product, someone from engineering, someone from data, someone from customer service. All of those people, the heads of state of a single product, would come together once a week just for an hour over lunch and just kind of talk about the product, things that were coming out. But it was great, because we always had the most up-to-date information about what was going on, any issues that other teams were seeing, especially for data analysis. That context is key. It's really crucial to be able to do the right kinds of analysis. And then pseudocode. Pseudocode helps everyone. It's really helpful documentation to have.
Advice for aspiring data scientists
So I promised a little bit of advice. I have just one slide. So aspiring data scientists, learn SQL. Communication is a technical skill. It's every bit as hard as anything else that's technical. Start a blog. I think Dave Robinson's going to talk a little bit more about that later on, but I love that advice. Teach others. You learn concepts more deeply when you teach them yourself. Don't worry about learning everything at first. Find your community. Community is really helpful for helping to learn. Stay curious. Have fun. Don't be afraid to use Google. And remember that everybody is winging it. Thanks.
