Resources

Dr. Travis Gerke | UnicoRns are real | RStudio (2020)

Common advice from experienced data scientists to job-seekers is to avoid job postings that describe a "data science unicorn": someone who has experience performing an unrealistically large array of technical and business-related job duties. Seeking a unicorn is viewed as a potential indicator that the company fails to understand their data science needs, and that new hires will not be poised for success due to lacking support and resources [Robinson & Nolis, 2019]. The R language, particularly when used with RStudio products, has evolved to enable production-level activities in the areas of data wrangling, reporting/dashboarding, database/software engineering, machine learning, and web application development. It is increasingly plausible that a data scientist will be able to efficiently perform a wide variety of job functions with experience only in a single language (R). Indeed, even entry level R users may tread into "unicorn" territory. Current standards for data scientist job descriptions and salaries do not accommodate this nuance, leaving both job-seekers and hiring managers unable to distinguish job requirements which should be read as warning signs from listings which are idyllic matches for the modern R unicorn. In this talk, we present data aggregated from several large compensation analytics companies which summarize current benchmarks for data science job descriptions and corresponding salary ranges. We then suggest job description language to target modern R users, considering both job duty compatibility and job post findability. These descriptions are presented with likely salary range pairings. Attention is given to deviations from traditional degree requirements, years of experience, and demands for multiple programming language literacy which may lack relevance for the R unicorn. Our overarching goal is to provide job description templates which encourage optimal matchmaking between R job seekers and organizations in need of their talents

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Travis Gerke. Thanks for being here. Thanks for having me. This is tremendous. I'm excited. I don't know how else to put it. So, I lead a couple of data science teams at Moffitt Cancer Center in Tampa, Florida. You can find me at that Twitter handle, it's my name.

And I actually just tweeted out all the slides and the code that kind of goes along with making these slides, although this is not a technical talk. This is just a talk about writing job descriptions and things which are typically boring, probably, HR nuances. I'm sorry if there's any HR people in here, but we're going to talk about that.

So in particular, I'm interested in how we write job descriptions and how those end up mapping to certain salaries for data scientists, and in particular for our data scientists. I found this challenging often, and I thought I'd just share some experiences and hopefully get some good information for you all.

The unicorn job posting problem

I hope this is a message that most data science managers seek to convey, and I hope they practice it. Certainly they all want to hire the most talented data scientists, but importantly, I hope they want to compensate them accordingly. Seems pretty simple, but in practice it doesn't always pan out that way.

This is a retweet of a parody account. So Associate Deans is a parody account that talks about sort of bureaucratic silliness and mostly academia. And here what they're talking about is soft money positions. So what a soft money position is, it's a position where an investigator has brought grant money to an institution, and they hire someone to work on their grant with their grant money. And they feel empowered. At least they feel like they should have full control over how that grant money is spent. So they should be able to hire the people they want and pay them as much as they want, because they sort of earned that money.

But then HR kind of gets in the way, so to speak. So here Jason is lamenting this fact, and he's saying, well, I know who I want to hire and I want to pay them some amount, but HR is now telling me that I have to pay them less. And I jumped onto this thread and I said, yes, I feel this as well. I've experienced this myself. Many times where I want to pay a data scientist who I know is worth their weight in gold some amount of money, and then HR might say, oh, no, no, there's another calculation that has to happen here, and you have to pay them less.

So we're left kind of with the question, these HR people, are they our friends? Are they our enemies?

Whichever the case, and again, I'm so sorry if there's HR people in here. I promise it gets better. Whatever the case may be, I started to think, maybe there's another way, right? Maybe we can trick them. Like, maybe if I write a job description that describes a person who does not actually exist, then they can't do a salary benchmark analysis, and they can't establish a baseline for my person who doesn't exist, and then I can price point how I want. Seem pretty fair. And maybe other people have had this idea, and thus, unicorns are born, right? Maybe this is how it happened.

We've all seen these kinds of postings, these data science unicorn postings out in job boards and whatever, where they ask for a really, really long list of skills and needs and technical requirements and things like that. And so we're going to talk about those kinds of people.

When I thought I would take this strategy, that I'm going to write a unicorn posting, I thought, well, I should find one that uses a template, because it's kind of a lot. I don't know a whole lot of programming languages. I know R pretty well. I know some other things, but I don't know what I'll put in those. So I started shopping around, and with some help, we identified this one, which was live just last week on Indeed, and you don't have to read all that. I'll walk you through what this person's going to do. This is not my posting. This is someone else's posting.

So this person will do some machine learning. They'll deploy some models in production, so they'll do some machine learning engineering, it looks like. They're going to build applications, and they're going to develop and implement cloud-based security solutions. Cool. And then they'll integrate data and do some decision-making, so it sounds like there's some decision science tasks that are on this person's plate. Fair enough.

So here are the technical requirements for that person. I'll mention something that's at the bottom that didn't make the cut, is they only have a zero-to-one year of experience, and they know all these things. But anyway, don't read all this stuff.

I did read all these things, and as I stared at them, I became more convinced that a contemporary R user who is equipped with enough R packages at their disposal, which all of you have, and the RStudio suite of tools, which makes doing a lot of these things very easy, they can do all of these things with a single language and a single toolkit.

I became more convinced that a contemporary R user who is equipped with enough R packages at their disposal, which all of you have, and the RStudio suite of tools, which makes doing a lot of these things very easy, they can do all of these things with a single language and a single toolkit.

So there's a lot of languages listed on there, and I don't really think they're all necessary. They want this person to do exploratory data analysis, okay, tidyverse, they want them to do some visualizations, okay, ggplot, and then they want to wrangle large, complex data potentially, so there's all kinds of things out there, data.table, Vroom, other resources, even ddplyr now, dashboards, Shiny, visualizations, all this stuff. It happens. Interfacing with modern database technologies. We have packages for that now, deep learning, machine learning, all of it in R.

So I became convinced, right, so the R unicorn is real, and I want to actually not trick HR, but write the correct job posting for the person that I want. I want a unicorn-like person who knows R, and I think we should be able to write a job description like that.

But in order to be able to do so, well, hang on, there's a problem. We've seen warnings like this out there on Twitter and from other resources, and in particular from this book, which I want to point out is if you read one book this year, read this one. Even if you're not seeking a job, it's called Build a Career in Data Science from Emily Robinson and Jacqueline Nollis. I devoured it in like three days, and they very eloquently kind of map out the challenge here, where they say there are lots of listings that look like the previous one that I just described, but it might be a red flag that the company doesn't know what they actually want from a data scientist, and so you might want to sort of beware, and yet I just claim that these unicorns are real, and importantly that we should pay them the right amount.

How HR salary benchmarking works

So back to the exercise, how do we do this? To do this, you'll notice I had a co-author on this talk, her name is Donna Evans, she's a senior compensation consultant at Moffitt, and to know how the job description moves from those words to a quantitative salary estimate for the person that you want to hire, I needed to understand the black box voodoo that happens in the middle, so she helped me with that, and here it goes.

So the hiring manager writes a job description, we all knew that was coming first, I think, you actually only have to write a couple of things in that job description, the first being the primary purpose and the expected deliverables, there's almost a one liner, why does this job exist in this company, right, you just got to say that, and then some requirements around technical skills, prior experience, and or education, and these don't have to be extremely verbose.

With that information, HR will then classify the role, they'll give it a title, in this case we're going to be working with the data science title, data scientist, and within that domain they'll classify whether or not this is going to be an hourly or a salary role, so exempt versus non-exempt, the type of contributor, are they a technician, a professional or a scientist, and within each of those domains, what level are they contributing at, so entry through principal.

Once they have that classification in hand, they turn to benchmarking data, which is purchased from compensation survey companies, and those data resources are a bit different than what you might be used to, so if like me you've sat down and Googled what does a data scientist at company ABC make, you'll get some results back, and it'll probably be self-reported results from someone who said I worked at company ABC a couple of years ago and I made this. And that's valid, but there are problems with that sort of data, self-report bias, it's incomplete, other things like that, so it might not be the best.

These compensation survey companies are pretty different. What they do is they have clients who purchase the data from them, and as part of that agreement, they contribute data back, so the organization that has bought the data then goes across their organization and they say here are all the job descriptions and the job titles that we have, and here's what those people actually make today, and so then they send those back to the compensation survey company and they aggregate it and they send it back out to all their clients, right? So importantly, these are now salaries that are real-time, more or less, from warm bodies in those positions, and that's important.

Then HR divisions usually purchase more than one such survey, so at Moffitt we have many, and we sort of just, the HR, our HR group will take them and aggregate them together, maybe summarize them, take a median and an intercore trial range, and then according to any individual organization's compensation philosophy, that's how you come at the number. So the compensation philosophy could vary a little bit, for example, in the startup world where things like bonus incentives, stock options, all that stuff is more common, the base salary might dip a little bit lower because you have all those other things, and of course there are companies on the other end of that spectrum, but that's sort of where, that's how this happens, more or less.

Data scientist levels and job descriptions

So I iterated with Donna a couple times to understand how I could write a job description that mapped to roles that appear in those compensation surveys. So for title, the data scientist title was obvious. We started there, and that ended up being a pretty good decision. And then within these surveys, they tend to have five levels for data scientists, so one being entry level, and then five being a principal data scientist, kind of at the senior end of these things.

So here, we'll walk through them, here's a data scientist one. Generically, they're going to do the thing that all data scientists do, right? Summarize and analyze potentially complex and large data to guide business insights. That's just what probably most of you do. And then they'll do some other stuff that we would expect, merge and tidy data, do some visualizations, organize databases, maybe do some analysis, they'll produce reports, and they'll make it accessible to the end users.

I have a bullet in there which is where I just wrote kind of RStack experience just off the top of my head, and this is incomplete even for me, but I importantly put a wild card here for you if you're writing job descriptions. You can put in whatever R stuff that you would like your person to know in there, and it will not change the leveling of these things, to my knowledge. I checked that with Don, and that seems to be true.

The last two are actually the important bullet points when it comes to salary benchmarking. May require an advanced degree. So May is HR code speak for not required, so we don't actually need that in this role. And zero to two years of related experience preferred, also preferred being code speak for not required. So this is very much aligning with an entry-level position, and that's good.

And so the notes are things that do not appear actually in the job posting, but it's important for you to know, to think about how this data scientist will operate. This one will report to manager and have their work closely managed, because they're entry-level, and their projects often have limited complexity as they kind of learn the nuance of the role.

I'll move to two, and you'll see that nothing changes except that last bullet point. So I left the purpose and the RStack, all that stuff the same, and now we just have two to four years of related experience, typically required, so they need some kind of experience, but here's where you can start to mix and match experience with advanced degrees. So if someone has an advanced degree in some quantitative science or computer science or whatever it may be, you can swap that out for the two to four years of related experience, and you'll get a data scientist, too. And now they're requiring occasional direction, so you don't need someone looking over their shoulder all the time, and they're getting some exposure to complex tasks of the job.

Three, now we have four to seven years of experience, or an appropriate mix of that and an advanced degree, right, so if they have an advanced degree, this counts as two to five years of experience, depending on what sort of degree we're talking about. And now they require minimal direction, so more autonomy as we move through these stages, more complex challenges. Here's a four. This work is primarily independent from this contributor, and they often lead teams for complex problems. Seven plus years of related experience, again, may be an appropriate mix with an advanced degree, and lastly we have the principal data scientist, data scientist five, who is fully autonomous and they're leading their teams to tackle the most challenging problems encountered by the data science unit. So that's it. That's one through five. These are the job descriptions. And they map to titles and levels, which appear in the compensation surveys.

Salary data by region and level

So here's maybe what you were hoping to see. Here's what those people actually make across the country, so this is aggregated survey data from many surveys that Moffitt has purchased, so I'm not showing you any one survey for proprietary data reasons, and I've only got three number summaries here, so just a median and an interquartile range that I'm presenting, stratified by region.

So here we see the dark blue is across the whole United States for data scientists one through five, and we see regional trends that we might expect because of cost of living. So the light blue is the southeast US, then we have orange and yellow as the west coast and the northeast higher cost of living that make a bit more. Data scientist one tends to be in the 60,000 a year range all the way up through the five, which those tend to appear north of 150,000 a year.

If that's not granular enough, there's also this map, which if you go to the slides, you can use. It's interactive. So you can kind of mouse over it and see what these people would make. If you are interested in a census track or a particular metropolitan area, if you either work there or job hunting or hiring in those areas, this could be useful for you. So as an arbitrary example, here's Nashville. Data scientist one in the 55 per year range, data scientist five in the 130 per year. We can jog over here to where we are now, Bay Area. You all are, well, you're either money bags or you have a high cost of living, maybe it's both. So you've got 87,000 a year at the one and then north of 200 a year for the principal data scientist, right?

Lessons learned

So hopefully these things are useful. What did I learn from going through this process? I didn't know anything about HR when I went through this, but I was experiencing pain of having to interact with HR and not knowing how they work to be able to communicate with them well, to hire the people that I want and pay them what they're worth, right? So your own institution, do that. HR is actually your friend. They're helpful. It was so enlightening for me. They're really trying to do their best. They often don't understand data science roles because they're not data scientists for the most part. So talk to them. That will help.

This is something important that I took home from this. Primary drivers of salary estimates are experience. This is like years of experience and autonomy of the role. And it's not like the number of languages that you know. I think there's a temptation among all of us. Certainly it's true for me. I'm always like, well, if I just knew that one more language or that one, like if I took this bootcamp or whatever, my career trajectory would look different. I don't think that's necessarily true.

Particularly for our users, you know a lot and you can deliver a lot of value to any institution that you choose to join. Feel empowered to apply for those roles and you should get them because I think you're all awesome. So you don't need like language X. You just probably need R, at least in my opinion.

Particularly for our users, you know a lot and you can deliver a lot of value to any institution that you choose to join. Feel empowered to apply for those roles and you should get them because I think you're all awesome. So you don't need like language X. You just probably need R, at least in my opinion.

And these drivers, unfortunately, I don't know that they're totally correct, right, for data science in particular. Autonomy, fair enough. But experience, it's hard to have ten plus years experience in a technology which isn't ten or more years old. Like Shiny is a semi-arbitrary example, right? I don't really know how to navigate that and there's not really good solutions in the HR domain, but I think at least they're aware of it. So at individual organizations, you can talk to them about that sort of thing.

The salary surveys are not yet capturing specific data science roles. So again, in the build a data science career book, they do a very great job of spelling out all the data science subdomains. So like machine learning engineer, a decision scientist, all kinds of business intelligence analysts, all this stuff. There's all these words. And they do a good job of spelling that out. Because that hasn't been standardized for a long period of time, many companies aren't hiring into those titles and so they're not reporting out data on those. So we do not yet have data on how much each of those sub-niche roles sort of make. But that will change in the coming years.

I'm not arguing for or against the current process. I'm just telling you what it is. I'm kind of powerless against the machine, just like so many of us are, right? I mean, there are federal laws that are in play here. So you really can't diverge too much from what the standards actually are. But at least by understanding the process, you may be empowered to understand if you're a job seeker, where you land in the data science one through five track. And if you're a manager, it might also empower you to help know how you want to write the job description to get the unicorn that you actually need.

I feel a thank here. So Don Evans, of course, my compensation consultant and HR guru, she was immensely helpful. Jordan Creed, many of you probably already saw her because she was giving out the unicorn hex stickers, which she also made. She's a great data scientist and apparently also a future PR representative. Gary Caden-Booey, I've seen him thanked in at least half the talks here. And he's similar here because he makes so many good tools that everybody uses. I benefit from working with him. I was able to get advanced access to the share engine extra package, which made a lot of the fanciness that you saw in these slides happen if you're interested in such things. This ggiraffe, I think is how it's called, package, is what made that interactive map happen within the slides outside the context of Plotly. Also if you're interested in how those things work, that's a pretty great package. Thanks to the Tampa user group. They did quick feedback on this. I gave a little preview of this talk a few weeks ago, and they gave some great feedback. Again, Emily and Jacqueline for that book. There's a link. Please do check it out. If you're interested in any of these sorts of topics, you'll find that book fascinating. They're going to the resources, that's how you find me on Twitter. Thank you again. It's so awesome to be here.

Q&A

Thank you so much, Travis. We have a few questions that are queued up for you. The first one is, there are still many job descriptions that specify Python only, and sometimes they even say R is insufficient. How should our community approach this situation?

That can be valid. If an institution has a large technology stack that's built around Python, there's probably not much of a way to get around it. You can go in there and do a lot of the same things with R, but in particular if it's a large organization, you're not going to be able to knock down the Python vertical and put up R in its place as a new employee. I disagree strongly with the R is insufficient. I don't know. You probably don't want to work at that place. Let's just say it.

Next one is, how can data scientists develop their autonomy? What should we read, learn, and do?

I suppose demonstrating that you have independent packages. Everybody looks for the portfolio. Every hiring manager says, oh, do you have a GitHub? Most these days do, right? What's your GitHub portfolio look like? And I also appreciate the fact that it's a lot to ask people on their free time to build up their own projects because they're often currently working with a company where things are protected and they can't share them openly. I don't know what a good solution for that is. A lot of people are talking about that, but that probably is the best way, honestly. As someone who looks at CVs and resumes, I can usually get a good sense of how autonomous they are by looking at their repository, set of repositories.

One more question. What are the federal regulations that hamstring HR into these processes that you criticized? So what is it? The Fair Standards and Labor Act? Are there any HR people? Yeah. I'm seeing nodding heads. I know Jeff Leak alluded to this yesterday in his talk in the morning, unfortunately, but I heard it was awesome. So he's facing the same things whereby he is training people who don't have advanced degrees through data science, and he wants to hire them into advanced data science roles. But often that particular law says that those people cannot be salaried because they fall in between a couple of different categories. So Jeff is really a go-to on that, but that is one of the nuances that makes this really hard.

How does this correlate with the number of heck stickers on my laptop?

I can give Donna some numbers, and then I can report back. I'll fix the math. Or do a pull request if anybody wants to do that, and we can do it that way.

A lot of jobs are requiring data scientist skills, but calling it a data analyst position. How do you differ the two? Yeah. So, I mean, you're going to have to try to read between the lines in those job descriptions. Data analyst is a different thing than a data scientist. I think, at least in the Build a Data Science Career book, they specify an analyst to be a bit a tier below the data scientist. They're not aggregating as much data. They're typically closely managed and things like that. But there are other organizations that, for historical reasons, have a data analyst role, which performs all the tasks of a full-blown data scientist. And when they benchmark the salaries of the data analyst against these survey companies, they'll actually use the data scientist title, even though that's not what they're being called in the company. So, it's tricky. Just read the description, and maybe, if you go on an interview, try to tease out what it is they're actually having you do.

So, public universities, they often have rigid, capped salary scales. Our unicorns are golden. Full stop. It's a golden-hooved data scientist who value working in academia. Tips for those job ads? I mean, academia is tricky. There's another layer of federal requirements around those things that are related to federal grant money. If you spelled out in a budget that you're going to pay a data analyst or something like that, X amount of dollars, you're kind of bound within the context of that grant to pay them that. Although, many faculty or investigators will have additional kinds of streams of money that they may have access to. So, feel empowered to seek the salary that you know you're worth. That's my parting words.

Travis, thank you so much. Thank you.