Resources

Matt Dancho | Using R, the Tidyverse, H2O, and Shiny to reduce employee attrition | RStudio (2019)

An organization that loses 200 high-performing employees per year has a lost productivity cost of about $15M/year. This cost is massive, yet many organizations don’t know it exists. It doesn’t show up on a financial statement. Therefore, it goes unnoticed. This presentation showcases how several open source tools integrate to form a solution to the employee attrition problem. Specifically: (1) How the tidyverse enables problem identification through visualization. (2) How recipes + H2O can be combined to explain key relationships to attrition and predict employee attrition. (3) How Shiny can be used to create a powerful dashboard that empowers business leaders to make data-driven decisions across the organization

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Oh man, this is exciting. This is my first RStudio conference, by the way, guys. Yeah. I'm pumped. I'm a big fan of RStudio, the company. I think you guys know why. It's probably the same reasons that you're here.

So today what we're going to be doing is talking about R for business. So that's kind of where I specialize. I'd marry up business with data science. My company is Business Science, so you can probably guess how I got the name. Business plus data science. Merge the two. You got it.

So what we're going to be talking about today is a lot of workflow. We're going to be talking about how to solve a specific business problem. It's called employee attrition, and it ends up being a huge problem for businesses. It's a big dollar figure, $15 million per year, and we'll talk a little bit about how I came up with that number. But more importantly, what we're going to be doing is we're going to be showing how the different tool sets, including the tidyverse, my favorite modeling package, H2O, and also Shiny combine to really help us solve this business problem.

So I'll be your host. Again, my name is Matt Dancho. I am the founder of Business Science, an educational company. I'm a lover of R. I've been using it for quite a while. And I've even contributed some open source packages as well, probably the most popular of which is TidyQuant. And I do dabble in finance. But really, I'm an educator of data science. I specialize in teaching data science.

I both do onsite workshops at companies, and also what I've done is created an online platform called Business Science University, where students can come and take a range of courses that really up-level them, accelerating their careers. So oh, and one more thing, one of my special passions is converting business people to data scientists. So I just want you guys to know, if you are a business person in the audience, you don't need a Ph.D. to be a data scientist. Just throwing that out there.

The $15 million per year problem

All right, so agenda for today. We're going to be talking about a few different things. The $15 million per year problem, that's going to be our focus. We're focusing on a business problem. It's employee attrition, and we'll find out a little bit more about what that means. The second thing, and this is what I'm super excited about. I'm unveiling a new shiny web app that we're going to be teaching in the 300 series course and part of our program at Business Science University as part of the Data Science for Business program. So you guys are going to see it first, right here at RStudio.

And just one other thing about that app, just want to give credit to Kelly O'Brien. She's the one who developed it. So she's a RStudio employee, and I work with her quite a bit. Then we're going to talk about the internals of the app, the data science workflow, what powers this app, the tidyverse, we're going to talk about H2O and also another package called Lime that I'm very excited about. And then finally, we're going to pull it all back together and talk about learning R and how you guys yourselves can figure out how to do all of this stuff that I'm teaching you.

So that $15 million per year problem. So you guys might recognize this gentleman's face, Bill Gates. He was once quoted as saying, you take away our top 20 employees and overnight, we, Microsoft, become a mediocre company. Let's dissect what he's saying there. Take away. He's talking about a concept called employee attrition, employee turnover. Top 20. He's talking about high performers.

This is the top 20 people in his company, and often what we find is the 80-20 rule applies. Top 20% of employees in a company tend to generate about 80% of the results. So you really want to do what you can to preserve and keep and retain those high performers. Otherwise, your company is going to become a mediocre company.

So you really want to do what you can to preserve and keep and retain those high performers. Otherwise, your company is going to become a mediocre company.

So let's dive into this a little bit further. I want to talk about this curve. It's called the economic value of an employee over time. It looks a little bit like this. So we'll see that there's four different boxes up here. The first box is the curve. So when an employee just starts at a company, they represent that green dot there right at the beginning. So time has not elapsed, and actually what is happening is that company is investing and actually losing money having you as part of that company. And that's an investment that they're making in you.

So they have to provide you training, mentorship, they have to integrate you, and it takes a while to do this before you can become a productive member of that company. Then eventually you make it to the second box, and this is what's called the break-even point. So as you gradually get to that point where you begin to start generating returns for that company. And they call that the return zone. So you've got the investment zone, you've got the return zone. And that process can take as little as three weeks for jobs that aren't overly difficult or for highly technical jobs. It can take upwards of a year or longer. So that's the type of investment that that company is making in you.

So what happens when a person decides to quit? That's what this third box represents. So that employee has become a productive member of the company. They're generating returns, doing really good work, and then all of a sudden they decide to quit. And that could be because they hate their boss. That could be because their work-life balance is all out of whack. Or they may just not be like the work that they're doing. And what ends up happening is when that person quits, that line immediately drops down to zero, and then it extends for a period of time, and eventually that company decides to replace that person, typically. And then that cycle repeats itself. So they're reinvesting in the new person, getting them up to speed, and it takes a while before they get to the return zone. So as you can see in box number four, we've got lost time and lost productivity. This is what you want to avoid, especially for high performers.

So there's two different types of attrition, and we're going to focus on trying to prevent your high-performing employees from leaving. So not all employees are created equally. Some employees just never quite get there. And that's what this first box here is what we call necessary attrition. And this could be because that person just might not be a good fit for the culture of the company, may not mesh well, it might be a poor job fit. Whatever reason, they just don't quite cut it, and it's okay to lose these people. But what you really want to prevent is the right-hand box, which is bad attrition. And this is when you have high performers that have generated returns, you don't want to see them leave the organization.

So you can actually assign a dollar figure to this particular problem, and it's actually a very simple calculation. In fact, this is some R code. I know it might be a little bit difficult to read, but what this is is a function that I created. It's called calculate attrition cost, and it takes a few different parameters, and then it performs just a vectorized calculation that incorporates direct costs, lost productivity, and some assumptions in there, and then also it subtracts out the salary and benefits, which is what the company actually benefits from losing an employee. So for a high-performing employee that is a productive member of that company, it can be upwards of $78,000 per employee that that company loses when you assign a cost to it.

So the funny thing that happens is typically companies don't just lose one person, it ends up being more of a systemic issue. And so what happens when a company such as, say, Walmart or Target, you know, a competitor moves in and starts stealing your high performers, before you know it, you've got a couple hundred that you've lost, and if you lose 200 performers each year, that's a $15 million problem. Yikes. So we want to prevent these high performers from leaving the company. I think that's pretty obvious by now. So the way we can do that is through using data science.

Shiny app demo

And I'm going to show you the product first, just because I want to show you guys the end result. So we're going to do a quick, shiny web app demo. And again, this is something that's exclusive to the RStudio conference, I'm really excited about this. This is what we're developing as part of the 300 series course, so there's a 100, a 200, and a 300, and those who take that course are going to be able to learn how to build this app. So I'm just going to click this run app button, and if all goes well, we get a shiny app.

So what this app represents is the end product. You can imagine that you've got managers out there that are responsible for their employees. They are responsible for retaining them, making sure that they develop them into productive people, making sure that they're happy. And that's exactly what this app allows us to do. So I'm just going to scroll and kind of show you, each one of these numbers represents a specific employee. And this particular one, employee number 891, has a 47% prediction risk. So this is, this employee is actually predicted to leave, because that prediction risk is above the threshold for deciding whether or not that employee leaves. And this is actually H2O under the hood that's generating this predictive model, and the app is encapsulating it.

So you can imagine, put yourself in that manager's shoes. They see that they've got an employee that's 47% likely to leave, and then what we have over here are the reasons, the features, of why that person is predicted to leave. So this actually comes from the Lime package, and this first feature here is stock option. So that person has stock option level 0. So this is something that the manager can actually affect, or change. It's what's called a lever. It's a feature that the manager can adjust. So maybe moving that employee from level 0 to level 1 and giving that employee stock options, that might be enough to help that employee to stay.

The next one is the employee has over 28 years at the company. The next one is that the number of jobs that that employee worked at is over 6. And then the next one is that that person has a training time last year of 2. So these two here are not levers that can be adjusted. You can't really toggle the number of years that that employee works, or how many jobs that they've had previously, but that manager can then take a look at maybe training times, maybe to help get them engaged, give them a few more training sessions a year.

So this is really cool. We're able to actually bottle up data science into this machine learning app, and even further than that, we can develop predictive recommendations that are able to be presented to the manager. So that way they don't have to kind of think of strategies to do, but that's already incorporated for them. So for example, this management recommendation strategy has a work environment strategy of promote job engagement. So that manager should then focus on activities that will promote job engagement.

So this is what we're talking about. When we provide shiny web apps to non-technical people, business leaders that have a stake in the game where they can actually affect and make decisions that better improve the company, that can make a huge difference.

When we provide shiny web apps to non-technical people, business leaders that have a stake in the game where they can actually affect and make decisions that better improve the company, that can make a huge difference.

The better decision making effect

So this is what I call the better decision making effect. This is when you start to have, you provide the shiny web app to the decision makers in your company. You see that they're making, you've trained them on the app, how to use it, and you see them making positive decisions. What ends up happening is as you monitor the impact of the decision making improvements, you can start to see your percentage of attrition going down, down, and down. So it starts up here around 20%, ending in about a year at 9%. And then you can actually use that cost of attrition function to calculate how much you're saving the organization. And then how does that look when you show your boss at the end of the year, hey, I just saved you, our company, $4.3 million. I mean, that looks pretty good. And you'll probably get a, I don't know, maybe a promotion out of it.

Anyways, the better decision making effect. That's what I want to get across here is that we can use shiny apps to actually influence the decisions within our own organizations. And that's how we generate business value.

The data science workflow

So what I want to talk about next now is the data science workflow. So we talked a lot about the end product. Hopefully you like the app. And now what I want to do is talk about the process to get to that app. So how do you go from business problem to business value? It requires a process that can be shown as a workflow. There's three stages, preparation, experimentation, and distribution. So in the preparation stage, you're acquiring data and you're reformatting and cleaning data.

At that stage, then you move into the learning, which is the experimentation phase. You've got your data, you're coming up with hypotheses, you're meeting with stakeholders in the company that know that data inside and out to help you develop your strategies. And then you move into the transformation and visualization stage. And then you get to modeling and validation. And then, oh, you find, okay, we aren't getting a good model yet. I've got to go back and reevaluate our hypotheses. Or you may have to go back and acquire data. So you've got kind of this process that's iterative, and eventually you get a model that starts to show improvement, that gives you an idea that there could be a possibility of changing decisions. You then move into that final phase of distribution, you develop reports, you convince people in the company, executive leadership, that this is the right thing to do. And then you build an app and you deploy it and you start getting those business results. So that's how you generate business value.

What I want to show you is the tidyverse. It's a perfect fit. You can see we've got a bunch of packages that I've overlaid. You've got ReadR for reading in data. You've got Deplier, Tidier, StringR, Lubridate for time series, StringR for text, Forecats for categorical data to help you clean that data. You get into the experimentation stage. And once you're doing the visualization and the transformations, you're working with Deplier and ggplot2, H2O, once you get into the modeling and validation, and then you move into reporting, you've got R Markdown and then deployment, you've got Shiny. It just fits. And this is what I call the Arc toolchain. It's a fabulous toolchain. I've been teaching it for quite a while, and I'll tell you what, this gets results.

So if we examine our workflow, Deplier, it's awesome for sizing the problem, identifying where we've got issues in our data. This particular analysis, what we're doing is looking within job roles to define where we've got high cost of attrition by cohorts within the data. ggplot2, imagine if you make a lollipop chart like this and you show your management that, hey, we've got a problem here, and you know what? I've identified where we've got that problem. The sales executive role, 3.92 million per year, that's how much it's costing us. We better try and retain some of those high performing sales execs. Laboratory technician, that's the next one, 3.85 million. We should focus on them.

Then you get into the modeling, you've got H2O, and literally, this is all the code that it takes to make that model. I want to give a quick shout out to Erin Liddell at H2O. Her team has done a fabulous job to make a super scalable, high performing machine learning algorithm, and literally, this is called automated machine learning. It's automated. So it makes it super simple to get all the models, 20 different models, GLMs, deep learning, stacked ensembles when they merge them all together and average them, GBMs. They have just done a really good job. This is all it takes to start getting a predictive model.

And then we need to explain this feature. So it's not good enough to have a prediction risk. Management doesn't care about that. What they want to know is how can I change my decision making, and this is what LIME gets you. So this is all the code, literally, it takes you to develop that LIME. This is it. And you make a LIME chart like that, and you can see right away that this employee has overtime equals yes, so maybe their work life balance is out of whack. Maybe you should look at that to try and retain them.

And then Shiny. You've got H2O, LIME, dplyr, and the rest of the packages. That's all under the hood. Shiny is really the glue that allows us to join everything together, and that's what I'm so excited about, is just the Shiny web infrastructure makes everything so simple to get business value, business results. So again, this is the data science workflow. Just to recap, you've got this amazing tool chain, and what that allows you to do is build web apps that can change decisions, and then don't forget, you need to monitor those decisions and show the savings that you're generating for the organization.

Learning R

I just showed you an amazing tool chain. You're probably wondering, how do I learn all this stuff? Well, in my experience, I've been doing workshops, and I've been doing online education for quite a while now. Learning R is like a hill climb. The good data scientists really focus on the fundamentals first, and then they move their way up the hill. So the fundamentals, data cleaning and manipulation, that's learning like dplyr, tidier, string R for text, lubridate, and so on. We've got visualization, functional programming, advanced data science, and the goal, which is the shiny R Markdown.

So before I end here, I just want to give you guys a quick bonus. We actually do have a number of courses that we're developing strategically to be able to help get you there. If you want to get there fast, it doesn't take years, it takes weeks, and I even have a special bonus for you. We're giving everyone 15% off of Business Science University. The promotional code is RStudio. All right. Thanks, everybody, for your time.