
Eduardo Ariño de la Rubia | Value in Data Science Beyond Models in Production | RStudio (2020)
ML in production is one of the most obvious ways that data science organizations create value in business. However, these models are at the very end of a long story of how quantitative research changes and enhances organizations. In this talk I will discuss how I have found DS organization to be truly transformative outside of ML in the loop. Bio: Eduardo Ariño de la Rubia is a DS manager and educator. He loves R and RStudio. He has a Masters in Negotiation, Conflict Resolution and Peacebuilding, which is probably the most useful training he could have received
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello, everyone. So again, my name is Eduardo Ariño de la Rubia. You can't follow me on social media anymore because all my accounts are locked down. Too bad. I'm awesome. But if you want, you can follow me on LinkedIn. So first a disclaimer, this is not a talk about facts. If you wanted to go to a talk about facts, there's an awesome ggplot talk going on right now. This is the wrong talk if you wanted facts. This is a talk about opinions, okay? And they're my opinions. They're not my current esteemed employer. They're not previous awesome employers. They're not my wife. They're not my sweet angel mother. This is just me. If you got a problem with me, with my opinions, let's have it out. Let's have a great discussion.
And if you're a scientist working, we had somebody earlier at a cancer research lab and things like that, this ain't about your space either. I don't know anything about your space. This is about data science in business. I think it's also important to give you a little bit of context. I'm sad to report that I'm no longer an IC. I'm no longer an individual contributor. I'm not even a manager anymore. It's absolutely repugnant. I'm a manager of managers. You're getting a real sort of different point of view here than you do from a lot of the practitioners. At this conference, you've had world-class practitioner after practitioner after practitioner tell you about the craft that they've honed. This is a totally different perspective. I used to be one of those folks, and now instead mostly I sit around in meetings and think about the design of organizations. In the last couple of years, I have built, spun up, and handed off at least two dozen data science organizations across multiple time zones.
So take it for what you will. And more importantly, any sufficiently advanced trolling is indistinguishable from thought leadership. This is just a fact, and I'm a cage-fighting referee, so let's do this. Am I trolling you? Is this real? Let's make it happen.
Any sufficiently advanced trolling is indistinguishable from thought leadership.
The webmaster parallel
So of course, as a data science conference, I want to talk about webmasters. Webmasters existed for a very long time, and tell me if this sounds familiar. They were individuals who had an incredibly rare set of skills all inside one meat human, and they were able to do amazing things, and they were incredibly well-paid, and no one really understood what they did. This is an article from 1997 already saying that these folks are making 60 to 90 grand. I don't know about you, but it's a lot later than 1997, and 90 grand is still a pretty sweet salary. And so these webmasters existed. And so why did they exist? What did they do?
Every time in technology, there is some fundamental evolution. You find these roles pop up for some period of time. What webmasters did, if you're old like I am, you remember working at a business and being like, hey, Stephanie, we need 27 flanges, and Stephanie would take out a book and go to F for flanges and call or file a purchase order, and that's how business happened, is human beings were forced to talk to each other and exchange documents and shit like that. And what happened was that that was incredibly inefficient, and capitalism demands a push towards efficiency, and so webmasters brought offline businesses to online.
Here is what's happened to webmasters, by the way. This is Google Trends, and it only goes back to 2008. Imagine what it would look like if we were to look now, but even 2008, 2009, 11, 12, there was still some sort of trend, and there was a massive collapse. There is no such thing as a webmaster in any real sense. Maybe you've got a little website and you want to call yourself a webmaster. That's awesome, but the massive set of skills that were required to transform an entity from offline into online no longer really exist inside of a single human. So where did they go?
It used to be a webmaster. I remember being a webmaster at a company, and it was me, and there was 1,000 rando IT dudes that did Lord knows what to keep the COBOL systems going, and all of those people basically have had to be re-skilled, and now they're a front-end engineer, a back-end engineer, a cloud architect, a site reliability engineer, a network engineer, a test engineer. The entire world that was the webmaster role exploded to become what is basically technology careers of today, not all of them, but a significant number of them. Job titles are fragile. Responsibilities that prove themselves in adding value to the business become durable.
So, the death of webmasters is actually a story about success. Businesses went online. Instead of changing the internet, which happened a little bit, what really happened was businesses changed. It wasn't a business going online, it was a business transforming to become an online business. The way that business was done fundamentally changed. Software ate the world, as that famous quote says, and by software, I mean like this online interconnected friction-reducing thing that now happens that allows me to take my pocket glass box of horrors and push a button and get a pizza or a car delivered to me. Business fundamentally changed, and you couldn't scale that if you were looking for one person or two people who had this massive set of skills.
Data science hype is machine learning hype
Okay, so now you're like, cool, we're at RStudio conference, what does this have to do with values and data science beyond models and production? Hold on, I'm coming to it. Because this is a really interesting chart, right? So, if you take Google Trends and you overlay webmaster and you overlay data science as a field of study, they have this really neat intersection that happens shortly after the release of Google Panda. Do people know what Google Panda is? Google Panda was this massive seismic shift in which Google said quality matters and we're creating quality guidelines, and suddenly SEO and SEM advice all overnight disappeared. There were all of these incredible tricks and tips and people who were making their entire career on you got to do this thing with the meta keywords and you got to buy these links in this link exchange and you got to do keyword stuffing and comment spamming and blah, blah, blah, blah, blah. And all of that went away. Google's entire quality algorithm moved to machine learning.
They ended up training massive data sets based on human rated statements of this website has quality, this website doesn't have quality. SEO advice became try to make good content. This is a massive shift. Suddenly the entire world of people selling content online and selling attention online basically became actually like we can't make crap tutorials and hope that we capture enough time to get somebody to click on some random car ad or Lord knows what ad. It actually they had to figure out how to make good content and it turned out that's difficult.
So what's the parallel, right? The parallel is that a webmaster took an offline business and turned it into an online business. A data scientist takes a process and rules based business and turns it into an algorithmic business. What this really fundamentally means is they could no longer work on the process of we're going to stuff these meta keywords, we're going to buy these link exchanges. They had to constantly be experimenting to understand what is quality, what actually rakes ire, what are the characteristics of the data that we present and the narrative that we present that actually end up being incentivized by the Google algorithm. And this happened not just across content businesses but across basically the entire spectrum of business. Data scientists often turn process businesses into algorithmic businesses.
So here's one of my first hot takes. Data science hype is not real. It's actually machine learning hype. Okay? And this is where it gets interesting, right? Why? Because machine learning is the thing that engineers know how to embed into existing software systems. Data scientists do not have a sustainable advantage over engineers in producing models of sufficient quality which can be leveraged in production. Right now in the world there are massive advances happening in how you make machine learning models when you don't know statistics. Companies are building end-to-end tooling that fundamentally makes it possible for an engineer to build a good enough model.
And data science teams or organizations which are ML focused, if your data science organization principally is an organization that makes ML models, you are either going to become an engineering organization or if we talked right now and I pushed, you would admit that you are an engineering organization principally or you will be replaced by engineering organizations. If your data science team focuses on adding value other than models in production, I feel and predict that your organization is going to flourish.
If your data science team focuses on adding value other than models in production, I feel and predict that your organization is going to flourish.
By the way, somebody asked me what do I mean by production? Production is the set of hardware and software systems where money is made. That is what I mean by production. There is no other magical concept of production.
And so just as like if you overlay the Google trends of the field of study of data science and the field of study of machine learning, I understand that the world is full of spurious correlations but I argue that like this is not one of them. Like the data science hype cycle is intrinsically tied to the machine learning hype cycle. And let's be clear, today there's 15,408 data science jobs on monster.com but there's 17,931 machine learning jobs. So it's one of those things that like I want to keep making this point that actually like machine learning is not going to be the data scientist's competitive advantage. It is going to be other aspects of what they do.
And lastly, if you don't believe me that you don't have a competitive advantage, AutoML tools are getting pretty darned good. Here's a tweet from Erin Liddell. If you don't know her, I believe she's the chief research scientist at what? Chief machine learning scientist. Thank you very much, audience member. For H2O, she's absolutely brilliant and she works a lot on AutoML. I believe her PhD was on ensembling and bagging of models. I don't remember the specifics. Quite frankly, AutoML is getting pretty darned good.
And I hear you saying, oh, but that was a Kaggle dataset. That wasn't a real world data doesn't look like that. So I've already made the argument that ML engineers are likely going to take the ML work that data scientists do to add value. Here's my hot take. There's already a group of people that do this better than data scientists. It's called a data engineer. A data engineer is a whole field. They've got a journal, like they exist, right? They develop and construct architecture, data acquisition. They prepare data for predictive and prescriptive modeling. The idea that data scientists have some sort of moat around massaging data and making it useful is wrong. At scale, in production, data engineers have an incredible series of advantages on them.
And let's be clear, I believe that in mature organization, data engineers are the product owners of the data product. Here's one of my hot takes. Think about the analyses that you do. If to do your analysis, you're having to join more than a couple of tables. You don't have well-designed data products and you need a data engineer. So what have we got left? We've taken away ML modeling. We've taken away the massaging of data. We've taken all these things.
Quit mid-talk summary. I'm not claiming that these responsibilities are going away. As a matter of fact, if you remember, I said that this is because of the success that these responsibilities are so durable. I'm actually claiming that much like offline became online and businesses became online, data-informed is the DNA of the business. I'm definitely not claiming that data scientist is a bad job. I don't know that the title is going to exist for a real long time, but the job is freaking awesome, pays real well and allows you to do some amazing stuff. I'm claiming that whether it's this year or a few years down the road, your value will not be ML in production. And this is a side effect of success.
Where data scientists have a competitive advantage
Okay. So now we talk about what do I think data science organizations can do other than ML in production, right? Where do data scientists have a competitive advantage?
So you've probably heard the phrase that the product managers are the CEO of a product, right? That's the fundamental theory of the product manager. We have actually a PM job responsibilities on the left, CEO job responsibilities on the right. It's really about the scale that you're looking at, right? Are you looking at an individual product or set of features? Are you looking at the entire company? But realistically, I believe this. I believe that if you're working in a standard product team, the product manager is the CEO.
Okay. Well, guess what? I believe that data scientists are the CFO of product teams. So the CFO is responsible in a traditional company for determining and reporting the financial information. If money is a resource, the CFO is responsible for making sure that that resource is accounted for appropriately, is going to the right things, that there isn't any weird fraud or leakage. They are the ones who make decisions about how this resource is allocated. The CFO is to money like data scientists are to attention and effort. You are the ones who have the incredible power of deciding what the team focuses on, what the team builds, what the company is looking to build. This is the key, I believe, of why data scientists responsibilities are incredibly durable.
Does your company really understand what drives its success and capabilities? Right now, are you prepared to say that you really truly understand exactly why your company is where it is, where it's going, where it's successful, where it's not? Have you done the six foundational analyses? Have you done them recently? Can you talk to your company about this allocation of resources and this true understanding? Have you understood when people use your product, why they use it and why they stay? Have you understood when people fail to use your product somehow, how is it they're failing? What happened? How did they get there and were they able to self-remediate?
Have you understood your power users? The fact of the matter is that a power law distribution almost certainly is happening inside of your product. Do you understand these power users, the one that are driving your features, they're driving what you're doing? Do you understand in your experimentation framework why it is that you need to be very careful about power users because they will dominate the statistics of your A-B tests if you're not careful and you end up just making your product better and better and better for the one percent of people who were already never going to leave your product in the first place.
Are you doing enough segmentation analysis? Do you understand where your product is strong, where your product is weak, which country loves it, which country hates it, which group of people love it, which group of people hate it? Do you know this? Is it operationalized? Have you done an ecosystem analysis? Do you understand if for every one minute that is used in hypothetically RStudio, is that negative 30 seconds in Jupyter notebooks, is it incremental to, is it instead of? What is happening inside of the broader ecosystem that you function? Have you done an inflection point analysis? Do you understand what it takes to drive long-term growth and adoption, not just one-step wins? Do you have a large enough hold back that you can understand how a person who hasn't received any of your treatments over a year is different how a person who's received every marketing message over the previous year? Do you have this deep understanding about where the resources, the attention, and the skills of your organization are being spent?
Specific ways data orgs add value
To get real specific now, like what are the ways that data orgs add value? Number one is metric design. Are you monitoring what you should? Do you have the right metrics in place? Does your metric actually measure not just the immediate value of an A-B test, but the longitudinal value? Do you have the right counter metrics in place? Is it possible that you're increasing the click through rate, but destroying the adoption or destroying the renewal? Do you have the basket of metrics? Do you have proven time and time again that have the characteristics to help you by being real proxies to the things that are stated in your company values?
Maybe, maybe not. Do you measure right? Are you sure you're logging everything well? Do you work with your engineers to make sure that at every single screen, everything is logged so that all of your experiments aren't actually telling you something about the data generating process of improper logging as opposed to actually what the users experience? Really, really, I'm loving the nods, by the way. Those of you nodding like you're giving me life, right? No, this is a big deal. If you're not measuring properly, if you're not measuring consistently, you're always chasing noise. Your organization is not chasing any signal.
Goaling, I actually believe, is the single thing that data science organizations need to own with ferocity. There is no more powerful thing in organizing a group of people to take a mountain, take an objective, or do an audacious thing than a goal that is at this perfect sweet spot of audacious but believable. Do you as an organization have a formal praxis for determining what is a 50-50 goal, a goal that you will not hit half the time, a 90-10 goal, a 10-90 goal? Do you understand which one of the metrics you should actually goal on? Do you understand whether you should be going on the operational metrics or the tracking metrics? Do you have all of this in place? Is it a road machine that adds value to the business every quarter, every half?
I wanted to actually point out, I said no values in production. That's not the models, but you're still going to be doing modeling. Modeling is still going to be an important part of your life, right? Sometimes you still need to make a model that can be the best way to size an opportunity to understand what's going on. Thank you, I see the time. You're going to end up doing it, and R is a fantastic tool to make these incredibly fast models, to make them quickly, and to be able to educate the business.
More stuff. Are you creating enough self-service tools every time that you are getting feedback from a business partner that you need something done? Are you asking yourself, is now the appropriate time to turn this into something that we leverage? Just a quick shout out here to RStudio Connect. I think that is a fantastic tool for our organizations that are trying to do this exact thing, but if you're not building self-service tools as part of your practice, you're leaving value to the organization on the table.
Prioritization. Any one engineering company, any one company building a thing can build a hundred different features onto the thing. How are you choosing which one should be built? On average ratio, and I'm not saying it's a good or bad ratio, for every one data scientist, analytics person, whatever you want to call them, there's a minimum of 10 engineers. Let's make that clear. Standard Silicon Valley salary, that's $2 million worth of salary, right? 10 times $200,000. That one data scientist driving the prioritization of those 10 Silicon Valley engineers is making the choice whether $2 million worth of human capital is building a feature that is going to be good, or a feature that is actually not going to be used, or a feature that people don't really want, right? Off of all of the analysis we did earlier, which is a thing that you do via opportunity sizing. Like I said, there's 10 things that can be done, but which one should be, right? Are you working on a feature that only your power users are going to use? Are you using on a feature that is going to be a fundamental lever to the sustained adoption of your product? If you're not doing that opportunity sizing, you're not being a good CFO of your company.
And obviously experimentation, right? It's incredible. I'm not going to do a show of hands things, but if you've ever worked with an engineer who set up an A-B test and said, oh yeah, but we made it real easy for people to opt in and opt out of the A-B test. Have you had that experience? Have you had to explain to them how they no longer have an A-B test, how there's no longer an experiment there, and all that you're measuring is now bias? You as the data science organization add that value by being the people who make sure that experiments are actually meaningful.
And how do data orgs add value? Hey, there's this little thing called ethics. Oftentimes, data scientists in an organization are the ones who have the most training on ethical decision making. It's just a fact. If you look at an engineering curriculum, at least when I did my undergrad computer science curriculum, there was one one-hour class on ethics in which we learned about utilitarianism and deontology. It was not exactly a useful course. Data scientists have oftentimes in graduate school worked with real human subjects. If you haven't, if you have, great. If you haven't, you need to learn about this because almost certainly somewhere in your value chain there's humans. Check out the Menlo report for really great principles on ethics, respect for persons, beneficence, justice, and respect for law and public interest. This is something that as a data organization, you can add a lot of value to your company by making sure that your experiments, your prioritization, your goal is always being done from an ethical perspective.
Wrapping up
Wrapping up, the future of the data science is the future of business. There is no difference. You as data scientists add value by making every part of that business data informed. That is our competitive advantage, and that is the value other than another freaking model in production that you are adding to the business.
You as data scientists add value by making every part of that business data informed. That is our competitive advantage, and that is the value other than another freaking model in production that you are adding to the business.
My two little pitches, if you haven't read these two books, read these two books. If you don't know how to, if you've never taken formal negotiation coursework, learn it. These are the two most important books that I can suggest to you right now because in all of these things that I've discussed, how the data science teams add value, it's through communication, negotiation, and influence, and these will be superpowers.
And last but not least, hey, fist bumps, all right? There's a global epidemic right here, so no high fives, no handshaking, fist bumps against Wuhan coronavirus. So thank you for your time. I am hiring. I'm looking for data scientists in the Bay Area, Seattle, and Zurich. Feel free to email me at iorino at gmail.com. Thank you.
Q&A
Thank you, Eduardo. We have time for one question, and it's the most popular one on this list, and it's coming from Concerned Data Scientist. What are the fundamental differences between a regular organization and an engineering organization?
So what does an engineering organization do? Engineering is the praxis of how, right? Engineering cares about how things are done, not why they are done, right? And so what is an engineering organization? An engineering organization cares deeply about how and cares deeply about the creation of artifacts that they themselves have some life afterwards, either it be a piece of code, it be usually a piece of code, or a piece of hardware. That's one of the fundamental differences, is that at the end of the day, you want to be a data scientist. My guess is you want to ask questions about causality. You want to ask why questions. You want to understand deeper. If you're in an engineering organization, it's fine, but expect to spend your life answering how questions. Thank you very much. Thank you, everyone, for attending this segment.
