
How Can Data Teams Get Out of Their Own Way (Alex Gold) - KNN Ep. 196
Today, I had the pleasure of interviewing Alex Gold. Alex is the Director of Solutions Engineering + Support at posit and author of DevOps for Data Science. In this episode, we talk about Alex's experience working in data for presidential campaigns, his experience in solutions engineering, and the role that DevOps plays in the data domain. Book: https://www.routledge.com/DevOps-for-Data-Science/Gold/p/book/9781003213345 (You can use Code AFLY04 for a 20% discount) Alex's Links: LinkedIn - https://www.linkedin.com/in/alexkgold/ Twitter - https://x.com/alexkgold Podcast Sponsors, Affiliates, and Partners: - Pathrise - http://pathrise.com/KenJee | Career mentorship for job applicants (Free till you land a job) - Taro - http://jointaro.com/r/kenj308 (20% discount) | Career mentorship if you already have a job - 365 Data Science (57% discount) - https://365datascience.pxf.io/P0jbBY | Learn data science today - Interview Query (10% discount) - https://www.interviewquery.com/?ref=kenjee | Interview prep questions Listen to Ken's Nearest Neighbors on all the main podcast platforms! On Apple Podcasts: https://podcasts.apple.com/us/podcast/kens-nearest-neighbors/id1538368692 (Please rate if you enjoy it!) On Spotify: https://open.spotify.com/show/7fJsuxiZl4TS1hqPUmDFbl On Google: https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5idXp6c3Byb3V0LmNvbS8xNDMwMDQxLnJzcw?sa=X&ved=0CAMQ4aUDahcKEwjQ2bGBhfbsAhUAAAAAHQAAAAAQAQ MORE DATA SCIENCE CONTENT HERE: My Twitter - https://twitter.com/KenJee_DS LinkedIn - https://www.linkedin.com/in/kenjee/ Kaggle - https://www.kaggle.com/kenjee Medium Articles - https://medium.com/@kenneth.b.jee Github - https://github.com/PlayingNumbers My Sports Blog - https://www.playingnumbers.com ️ 66DaysOfData Discord Server - https://discord.com/invite/4p37sy5muZ
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My first actual data science job was literally the result of waking up from a dream. How does that lead you to where you are now? I had an experience that I think a lot of data scientists have, which is getting sort of inexorably dragged down the stack. What I wanted to do was do all this cool data science that would like make a difference, but what it kept turning out I had to do was make the data better. Are there any key factors for improving voter turnout? Turns out that working for political campaigns is like really exhausting because everything is on fire all the time. It's an interesting data problem. How do you fund open source software? And there have been a lot of different models to try and fund open source software. And the reality is, unfortunately, that most of them kind of suck. What do you think the state of the language war is now? I don't understand why people like put up with Jupyter Notebooks. Like y'all can have nice things too.
Today, I have the pleasure of interviewing Alex Gold. So Alex recently published the book DevOps for Data Science. And in this episode, we're going to dive into a little bit of what DevOps is, what it isn't, and how it might be different from MLOps, which we've talked at length about on the podcast. Alex also has had a very interesting career. He's currently the director of solutions engineering at Posit. And we're going to ask some questions about what solution engineering is, who might be a good fit for those types of roles, and what opportunities are available there.
We'll also dive into some of the details on the language wars. Obviously, Alex works at Posit. So he has a little bit of a strong take on the R versus Python debate. I really enjoyed this conversation with Alex, and I'm sure you will as well.
Alex's background and path into data
Alex, welcome to the Ken's Nearest Neighbors podcast. Thank you so much for coming on. So we met through some mutual friends at Posit. Obviously, I've worked with you guys a lot in the past, and you have a book coming out. I wanted to learn a little bit more, one, about your story, and two, a little bit more about the overarching process and your book in general. I'd love to first start with a little bit of background on yourself. Maybe let's dive into where you first got interested in data.
So my background, my undergrad was in math and econ. I went to Wesleyan University. They don't have minors. At least they didn't at the time. But so that's what I did. And the thing I thought was really cool about economics, and the reason I really loved it, is that it was numbers, it was math, but it was about the real world, right? It was not just sort of math for its own sake. It was really, you know, related to things that people care about in their day-to-day lives.
And so I spent the first few years of my career working in sort of the economic policy world in Washington, D.C. I worked at a couple different think tanks and found myself, I did some data-related stuff, some just, like, general, you know, qualitative policy work. Found myself enjoying the data part much more. And then, you know, a few years down the road, I started an econ PhD program and dropped out. And then a few years later, ended up going sort of all the way into the data world. And, you know, it was really more of the same. It was numbers, it was math, it was statistics, but it was about the real world.
We had a very similar experience there. I mean, in college, I remember taking my first econ course, and I was honestly a very bad student until that point. And it all just sort of made sense. I realized that I could see the world through trends, through graphs, and I could start to understand it in a very different way. And I remember how powerful that was that sort of opened my mind up to the possibilities of math or statistics being an applied thing rather than just a theoretical thing, which was super, super cool.
Yeah, absolutely. And I think what's really interesting about data science as a field is that I think that's the story of a lot of us who are in data science or who got into data science. I mean, these days, there are a lot of data science programs, which I think is awesome. But, you know, when I was in school, that didn't exist. And so everybody who sort of ended up in data science was coming from somewhere else. And that's something that I think is really cool about this field is just how it sort of is this marriage of a lot of really, you know, interesting statistics, machine learning, computer science kind of stuff. And then, like, what can you do with this in the real world?
Political data science and the voter file
So my sort of policy world led me into data science. And my first data science job was a political job, doing work on political campaigns. Then I had another role after that in data science, more like generic data science. I was working for a big consulting firm, did some work with hospital systems, did some work with the Social Security Administration. And that was, you know, sort of general data science. I ended up leading a team in that role.
And then from there, you know, I had obviously been spending a lot of time using R, getting excited about the power of programming for data science, right? But I had an experience that I think a lot of people, a lot of data scientists have, which is getting sort of inexorably dragged down the stack. That like, what I wanted to do was do all this cool data science that would like make a difference, you know, but what it kept turning out I had to do was like, make the data better, right? And then eventually you end up like managing a database. And at some point I found myself managing an RStudio server because like somebody needed to do it.
And so what I found was I actually kind of enjoyed that part of it, right? I enjoyed learning about how to like administer a Linux server. I enjoyed learning about how sort of database administration worked. Then what happened honestly is I got connected to some folks at then RStudio, now Posit, and learned a little bit about solutions engineering and made the hop over to then RStudio about five and a half years ago. And so spent, you know, about a year and a half being a solutions engineer and happy to talk more about sort of what that means, right?
I'll just share like my first actual data science job was literally the result of waking up from a dream that I had. I like woke up, I'd had a dream that I was going to work for, it was 2015, fall of 2015. I had a dream that I was working for the Hillary campaign. I was like, okay, I'm going to go do political data science stuff. And that was just a complete, it was a great left turn, but it was a total left turn for my career. And it was literally the result of a random dream that I had. I just like woke up and I was like, oh, what if I actually did that?
My first actual data science job was literally the result of waking up from a dream that I had. And it was literally the result of a random dream that I had. I just like woke up and I was like, oh, what if I actually did that?
You know, the more people I talk to on the podcast, the more that that really hits home, including in my life. I mean, nothing really makes sense until afterwards when you look back at it and you put it all together and going through it, we're all kind of just fumbling through the dark and trying to figure it all out.
I think that it is election season right now. And I don't particularly want to get into any politics, but I'm interested in the data surrounding the politics and like what goes into those types of things that maybe the traditional, the layman isn't familiar with.
So I worked in both policy and politics, which from like, if you're outside of it, you're like, those are the same thing, right? But like, fundamentally, they are quite different, right? Policy is about what is the government going to do for people or not do, right? And then politics is about who wins elections. And so while they are related, obviously, you only get to do policy if you win elections.
So in the policy world, I did some data work, for example, where we were using some longitudinal surveys of basically the economic success achieved by children. So we were trying to do this like very complex modeling. If anybody's familiar with the National Longitudinal Survey of Youth, the NLSY surveys, these are what we used. And we built this model that could, at least in theory, synthetic, we created a synthetic data set of people from age zero to 40. Now, there's no actual data set that follows people from zero to 40. And so we took these several different data sets and stitched them together with a variety of different sort of like matching, statistical matching techniques and other kinds of inference techniques.
And for what it's worth, I did this all in Stata, which was a real... I know there are people out there who think like R is a difficult language to work with. You should go have some fun with Stata, that's a bear.
So that was what I did on the policy side. There are interesting statistical problems. For the most part, a lot of data on the policy side is smallish data problems, right? They're longitudinal surveys, which are very expensive to collect.
In politics, there are a variety of different sources of data. One of the biggest ones is something called the voter file, which people who are not in the political world might be sort of horrified to know exists, but it's a standard thing. Each secretary of state for every state keeps a voter file, which is a list of all the people who are registered voters in that state and when they have voted. So as we like to say to people, who you vote for is private, but whether you voted is a matter of public record.
And so there's a lot of work in politics done off of these voter files. And so these data files are compiled by state secretaries of state. There are then companies that go around and gather the data from all 50 states, because actually it's like a pain in the ass to homogenize 50 different state voter files.
The role I was in, we were doing voter outreach experiments. And so we were looking at things like, does this message or that message make people more likely to go out to vote? And so the kind of things we would do is before the election, we would take the voter file, we would randomize it. And some people would get this piece of mail. Some people would get that piece of mail. Some people would get nothing or this text or that text. And then after the election, we would gather back up the voter files and see who voted and who didn't. And so they were literal like field experiments on voter outreach. This is sort of the gold standard, obviously, because it's an actual randomized controlled trial. But there's just a ton of data stuff going on in politics these days.
I mean, that makes a ton of sense to me. You think about a lot of the campaigns that are run about getting people to vote, rather than I would imagine it's a lot easier to get someone to vote from who's aligned with a specific party than to convert someone to vote for a different candidate. And it's an interesting data problem of, hey, let's invest money in just finding people and getting them to vote versus trying to change opinion, which is maybe a little bit sad.
The organization I worked for was aligned firmly with progressive causes and candidates. And so if that's the side you're on, what you're trying to do is you are trying to increase the vote margin for progressive candidates. And there are basically three things you could do. You can make people who were going to vote for a Republican vote for a Democrat instead. You can make more Democrats or people who are going to vote for Democrats, at least because very few people actually register these days. They say they're independents, but almost nobody's actually independent. And so you take people who are going to vote Democratic and you get them to come out, or you get the Republicans to stay home. Those are the three things you can do.
There is a really active, I think, debate, honestly, going on about whether persuasion, which is getting people to go to the other side, or straight turnout stuff is more useful. It definitely varies election by election. The other thing that's really interesting is, obviously, you can do a lot more in a smaller election, right? So if you're talking about in 2015, 2016, obviously, it was Hillary. Trump was the election. And most people knew a lot about that election. And there really is only so much you can do to affect that.
But when you're talking about down-ballot races, when you're talking about a state senate race, or the district attorney in Chicago, those are races, actually, where this kind of paid media can have a much bigger effect, because people aren't paying as close attention. Their minds aren't made up ahead of time. They're more persuadable. And they might not be planning to vote at all unless they learn something about the candidates or are sort of incentivized to come out for it.
Key factors in voter turnout
So this might be an impossible-to-answer question, and it might be highly variable, depending where you are, markets, whatever it is. But are there any key factors for improving voter turnout that are, you know, just, like, hugely correlated with improving?
Yeah, so one of the – there is this line of research called social pressure, which basically is, like, we are – we, as humans, are very susceptible to what other people are doing. And so – and again, I kind of got out of this world in 2016, so this was the state of the art in, like, 2016, which is now quite old. But at least at that point, people – this had sort of been fully internalized, but, like, everybody was using these sort of social pressure techniques, which basically are, like, every – you know, three out of four of your neighbors voted in this last election. Are you going to vote in this election? And it's sort of this sense that, like, everybody is voting. It is your civic duty. You should go out and vote. And that – at least among any of the sort of get-out-and-vote messages that people tried, that one is a real winner, you know, showing people's – you know, like, showing your voter record versus your neighbors. And, like, your neighbors are more consistent voters. You should probably get out there and vote. People respond pretty strongly to that one.
It turns out that working for political campaigns is, like, really exhausting because everything is on fire all the time. And I was very happy to move on from – I'm really glad I did it, but I was also very happy to, like, move on out of that world. Also, the reality for me, at least, is that I enjoyed the work. I did not find that the work of being, you know, trying to win elections was work that I found compelling enough to overcome the fact that it was a – it was a pretty grueling pace, especially, you know, in cycle.
Solutions engineering at Posit
I think that that is a really good segue to compare maybe your data science work to what a solutions engineer does. Because I think it's – most people are very familiar with what data work is. How does that differ from maybe a traditional solutions engineer role?
So, I mean, it's interesting, right? When I came to RStudio, then Posit, right? For people who are in the R community, I think it's easy to imagine, at least at the time, I think this is maybe less true than it was back then. But, like, at the time, it was, like, easy to imagine that all of RStudio was just, like, Hadley and JJ and, like, Winston and Joe hanging out and, like, writing open source packages. That was what RStudio was, right? It was, like, the RStudio IDE and Shiny and, like, that was kind of what we did. You know, and then I started working there, right? It's a software company, right? We are a software company.
Because, you know, I think underlying the way Posit works is this question of how do you fund open source software? And there have been a lot of different models to try and fund open source software. And the reality is, unfortunately, that most of them kind of suck. Like, you know, you can get out there and you can do grants. You can try and find a funder. You can do, like, freemium stuff. You can do services on top of open source software. And some of those work and can be sustainable. A lot of them are just not that sustainable.
And so the model that Posit has adopted, right, is we have many engineers working on free and open source stuff that we provide to the community, which is really cool. But then we need to pay for that. And so what we do is we have, right, professional products that we believe provide value over and above the open source software. And in particular, the way we think about it at Posit is, you know, we really give away all the stuff that an individual data scientist needs to be successful on their own. Where we charge for software is the things that teams and organizations need to adopt and succeed with open source at scale.
And that's sort of the solution engineering role is this really cool role where we work with Posit prospect people who are considering our professional products or have purchased them. And we help them figure out how do the Posit products fit in to the ecosystem of other data things they are doing, right? These days, nobody has one data product. And so we help them understand where does Posit fit into the stack that they already have. And then we help them implement it, right? We help them figure out how do I make authentication work?
And so that's really the solutions engineering role is this really cool, it's a very technical role, but also like you work with people all day, every day. And I think it's a really cool role for people who, you know, really enjoy technical stuff, but also enjoy working with people, enjoy talking to people, enjoy explaining things to people.
So that was sort of a meandering path to get there. But I want to set some context around, like, what does a solutions engineer do? Because it's a foreign role to a data scientist because it's really a role that comes out of being a software company, not out of doing data science.
I would say it's, it really depends on the role, right? We have people who are doing more engagement with our sales team, and those people are closer to sales, right? Most of what they are doing is they're putting together demos. They're putting together proof of concepts. They're helping organizations try the Posit software and figure out if it works for them. So it's still a very technical role, but ultimately the job is to help that company figure out, like, yes, Posit is the right choice for me. I'm going to pay you for the software.
Then there are folks who are more on the post-sales side. And that is more of a, you know, enhancing adoption, making sure that the platform is really great. You need the data science to be able to understand what customers are trying to do with our product. So they, when they say like, yeah, I'm trying to deploy this model. And like, I want to like have it run when this Quarto doc runs and like push it out to the system. Like you have to understand what all of that means.
And so the day-to-day work is not a ton of data science, but a lot of us on the team have this background in data science because it's what we're talking about all day, every day. We're helping data scientists articulate to the IT admins, hey, here's why we need to set this up in this way, because like, that's how I'm able to do my work. And then we talk to the IT admins to help them translate with the data science folks and sort of play this role of, you know, a little bit of translation. Sometimes there's a little bit of therapy in there too, depending on the organization.
You know, the other thing that I think is really cool about the way we do it at Posit at least is that the solutions engineering role is not, we're not just customer facing, right? We also, because we are in this unique position of being, you know, very technical, very knowledgeable about the products, very knowledgeable about data science, and we're spending all this time with customers. We just know a lot about what customers are trying to do, what they need, what their pain is. And so we really think a lot about the responsibility to like take that and help other people, other people at Posit, right? Bring that back into the product, bring that back into our documentation, bring that back to the open source engineers and the professional product engineers.
That's such a cool feedback loop. I don't think a lot of people really think about that as much.
Like people who succeed at Posit in this role are excited about having a big job with a lot of freedom where they just have the ability, they are asked to go try and, you know, improve the products based on what they've learned from talking to customers. I think people who thrive in the role find that exciting. I think everybody finds it terrifying when they start. I certainly did. But, you know, people who thrive in the role find it both terrifying, but also invigorating and not paralyzing.
The language wars: R vs Python
What do you think the state of the language war is now?
Yeah, I really think that we are in a spot where what language you choose to use, I mean, between Python and R, right? Like if you want to go write some C++, like you're doing a different thing entirely. But between Python and R, I really feel at this point, like a lot of it is style and background. Like I think personally, I really love R. That's the background I come from, right? I come from a social science background. I did two semesters of computer science in college, but like I'm not a computer scientist. And so like for me, R makes a lot of sense in the way it treats data. I really like the interface. I think the tidyverse is awesome. I grew up using RStudio and learned a lot there.
I think there are some libraries in Python that are better. There's some libraries in R that are better. These days, I think that's the reason, like that's why you would choose one or the other, or you're on a team that has standardized on one or the other. The degree of interoperability, especially like if you look at like Quarto for creating interactive documents, you can go like one code chunk to the next, different languages, fine, no problem, right?
You know, I think recently, right, we released, is it now beta? The Positron IDE, which is a new IDE that sort of is growing off of what, you know, JJ and folks learned about building RStudio and creating a, you know, from the ground up multilingual data science thing. And honestly, for me, and I know Python people feel differently. For me, the main reason I didn't want to do Python is like, I don't understand why people like put up with Jupyter notebooks. Like y'all can have nice things too. Like, I just think the RStudio IDE is so great. I've always loved it.
So when I started writing in Python, I actually use Anaconda, which is like very, very similar to RStudio. And then people made fun of me for that. So I started using VS Code and I use VS Code now. And I think VS Code, you have to install a lot of plugins to get the environment right. But the funny thing is, is my VS Code setup runs very similar to how my Anaconda environment was. It wasn't Anaconda, it was Spyder, Spyder not Anaconda, sorry, sorry. Spyder is very similar.
Have you tried Positron yet? I mean, so it's fundamentally a, like it's a VS Code. But it, for people who are comfortable in VS Code, it really like pulls downstream of that, but builds off of, right, the idea of a four-pane data science ID, which like, same as Spyder or RStudio, but sort of modernized and based off of VS Code.
So anyway, I mean, that was, so to go back to your original question, like, you know, to me, it's just like, it's more, and the other thing I didn't mention, right, is obviously like Arrow and other, you know, storage formats that allow basically symmetric access from R and Python. And so you can just like save and parquet or whatever. And like, it's fine, right? Like you get actually nice, right? It's not like saving in CSV where like, then you import and everything kind of, you know, you got to like recast all your dates and stuff, you know, but like, it like actually works.
And so I think like, increasingly, it's an interface level, the language is an interface level. And like, I think that's going to become even more true as LLMs become more and more capable in this regard that it really is going to come down to some combination of personal preference. And like, there's this or that package that happens to work better, right?
LLMs and the future of junior data scientists
I'm actually a little concerned on that front because the way these LLMs work, right, they are good at a lot of basic tasks. And I feel like you still need an expert to look at it and make sure it's doing what you think it's doing. And I do wonder if like, it introduces this at least question of like, how do junior people get their start? If like, if you're able to do a lot of the stuff that a junior data scientist would do, if an LLM can kind of do that, how do you get over that hump to like being able to then review what the LLM wrote and really like really knowing like, yep, what it wrote, like that was what I meant or like, oh shoot, that was not really what I meant. Like, try again, Claude.
I'm actually a little concerned on that front because the way these LLMs work, right, they are good at a lot of basic tasks. And I feel like you still need an expert to look at it and make sure it's doing what you think it's doing.
That feels like an unsolved problem to me of like, it's cool that it can do some of these different kinds of tasks, but it still feels to me like there's a need for expert eyes on it to make sure it's doing what you want. And I'm not sure how you get that exposure in a world where like LLMs are, you know, not today, but like in five years kind of territory.
So I think as they are right now, agree 100%. Basically, we need a lot of the stuff that produces is wrong, or it doesn't work, or it's like, I had an issue where I was writing some code. And when I ran it on my machine, I just got different results. But I'm a little less concerned because I think that it allows people to iterate faster and see dramatically more code. And that evens out that eventually, because there are going to be bugs in the code, you're going to be getting bad results. And like, frankly, you're not going to have a job if you're producing bad results.
And so inevitably, I think sort of the market corrects itself that, hey, after you make a couple of mistakes, you start scrutinizing it really carefully. And then you can actually use it as a tool for debugging itself in the sense that like, hey, like, why is it this way? Why did you choose that rather than this decision? Are there other ways to do this? I think I believe a lot of those problems can be solved with good prompting. And by good prompting, I mean, asking questions of what is being built and having it break itself down.
I think one of the things I'm concerned about, and we got like, what, we're like 40 minutes in. This is the first time I'm mentioning my book, DevOps for Data Science. But like, in the book, right, one of the things I talk about is that, you know, like, if you're doing regular software engineering, there's basically like, one kind of correctness you care about. It's basically like, does it run in a performant enough way? But like, when you're talking about data science, there's this second sense in which it needs to work, which is like, the answers need to be correct. And they need to be what you meant them to be, right?
And like, the problem with that is that unlike the way you can write unit tests for a software stack, it's very difficult to write testing for actual results in data science, because usually, like, you wouldn't be doing it if you knew what the answer was. And you could just run it and make sure you got the right answer. It is still scary to me that probably you're going to find out those mistakes by like, putting them in front of somebody who is an expert. And they'll be like, this, this number makes, I mean, we've all been, right? Like, this number makes no sense.
But, you know, there's a second level of correctness that you care about in data science that isn't there in pure software engineering that I think, you know, makes that sort of self-correcting piece a little scarier, I think, for a data scientist, whereas like, if you're just writing code, if it doesn't run, it doesn't run. But like, you know, in data science, like your data merge can work in some sense of the word, but completely screw up your data along the way. And you may not realize that for a long time.
DevOps for data science vs MLOps
So, you know, talking about the book, talking about the nature of, I guess, like DevOps in the data science space. Can you explain maybe what that actually means and how that might be different from, for example, MLOps, which is talked about quite a bit?
So, you know, fundamentally DevOps is a set of practices, procedures, and tools, sort of, right? So here's like, I'm going to digress here and tell a little history, right? So you're like in the nineties and, um, right, at least according to these stories, which I believe are completely apocryphal, everybody is writing their software using the waterfall method where you like spend a year gathering requirements. You spend two years writing software and then voila, it doesn't do what you thought, right? Like that's the like waterfall story. I'm somewhat doubtful that that ever actually happened, but like, that's the story.
And so then come up comes the rise of agile, right? The agile software movement, which is like this idea that you're going to, instead of trying to like do a year of requirements gathering and two years of building, you're going to build small increments. You're going to deliver quickly. And like, fundamentally, while agile is a development method, it's also a method of like checking that you're doing the right thing, right? Because you're frequently going back to your customer, whether that's a customer, customer, an internal customer, and you're saying like, does this do what you want to do?
But there's a problem, which is you can write the code that does, you know, creates this thing. But like, how do you then deliver it, right? You have to put it into production somehow to get it in front of those people. And if you're doing this right before you did that, like once, right? But now, you know, like if you're, if you're talking about a major code base, like Facebook, right, like, they push, I think, like it's thousands of updates a week to their code base, and they go live on the site, like they're constantly just pushing updates.
So DevOps is this system of tools, procedures, processes, so that you can build the software to make it easy to put into production and put things into production in a way that they're going to be safe, secure, observable. You can understand what happens when they go wrong, recreate the error and try and fix it on the code side. And so it's basically trying to bring closer together the development, the dev, and the day-to-day running of the software, the ops, right? So that's, that's where it comes from. It's a sort of pure software engineering kind of concept.
And so, you know, as might not be surprising, given how I described with what the solutions engineer role is, like, that's what I spend my time discussing with our customers. Like, how do you take data science and put it into production? How do you make it production grade? How do you like write code that is ready for production? Then how do you actually put it into production? And a lot of that is lessons from DevOps or they're an interpretation of DevOps. And, you know, it's tempting, I think, for people who are software engineers to be like, you just do DevOps, but you should just do it in R or Python. And I think that's really wrong. Like, I think fundamentally data science is a different thing than software engineering.
Like the analogy I draw is like, if software engineering is like architecture, data science is like archeology. Like they're, you're doing something in both cases, but they're very different. And the way you think about building software, and if you're doing data science, like you're building software, even if you don't like it, or if you're bad at it, like you're building software.
Like the analogy I draw is like, if software engineering is like architecture, data science is like archeology. Like they're, you're doing something in both cases, but they're very different.
And so that's really what the book was, was a reaction to was observing that people knew a lot about data science, about, you know, machine learning, about, you know, statistics about, you know, sort of like graphical design and these sorts of things. But then they got really stumped when it came time to put it actually into production. And so the book is sort of split into three parts, where the first part is really about, like, how do you write your code, your R or Python code in a way that makes it easy to put it into production when the time comes?
The second section of the book is the part that I really didn't want to write, but I kind of felt like I had to, which is like, if you have to manage the server that's hosting all this, like, how do you, how do you do that? How do you manage a server? How do you manage a database? What the hell is up with Docker? Like, you know, those, those kinds of questions, which I think a lot of us end up facing. I think most data scientists at some point in their career have to answer these questions, but are really lost about where to start.
And then the last section of the book is about, okay, you're working at a big company that has an IT admin organization. It's a sophisticated IT admin organization. What concerns are they going to have about the work you're doing? And how do you communicate with them about the work that you need to do and make sure that they understand what you need from them and giving them what they need from you so that you can create a great environment to do data science.
So DevOps for data science, it probably encompasses more of the pipeline than just MLOps. That's right. You did ask about that. So to me, I mean, MLOps is such an interesting, like to me, MLOps is just this like, and I don't know, there's been so much hype around it, but like it's this tiny narrow slice of the pie, which is like, once you've built a machine learning model, how do you serve that machine learning model? And then how do you, you know, do things like horse racing them? And I'm being a little like glib about it, but to me, it's such a small part of the work you need to do as a data scientist. And at least in my observation, there are organizations that have sophisticated MLOps needs, but they're much rarer than organizations that like have a dashboard in R or a, you know, some sort of image processing pipeline in Python. And they need to somehow productionize that.
I mean, unless, you know, if you work at Netflix, right? Like there are just straight ML engineers at Netflix and that's awesome if you can get that job, but like most data scientists are doing a lot more than just building machine learning models. And particularly if you want to deliver value, there's so many other things you have to do than just building and serving machine learning models.
The most overlooked aspect of DevOps for data science
Yeah, I mean, this is the like unsexy answer, but to me, it's the people. It's that fundamentally what you're doing is you are a person who is working with other people to try and do something that is valuable to your organization or to the world. And so, you know, there are tools, there are systems, but fundamentally what you're trying to do is to work with other people to get something, hopefully something cool, done.
And so to me, and I see this particularly from junior folks, right? When I, my first job managing a data science team, I was working with some folks who were just a couple of years out of college and they would be doing a lot of different kinds of work and they'd be like, when do we get to do the data science? Like, when do we get to build the models? And I'd be like, well, if what we want to do is provide value to our organization, ultimately that's what matters. We have to be focused on understanding what is valuable to the organization, what is valuable to the people in the organization. And then there's a lot of like relationship building to get that done.
And so, you know, that's why my favorite section of the book by far is the third section of the book, which does not have any code in it. It's all just like helping people understand what it is that IT admins are concerned when they're talking about security, like what are they, what do they care about when they're talking about authentication? What do they care about when they're talking about stability? What do they actually care about? Because to me, that's when data scientists can have the best outcomes is when the data scientists are able to do data science and when they can communicate effectively with other people who are concerned about all these sort of production-y things. But you need to be able to communicate with each other and need to understand what those people care about, like what keeps them up at night. It's probably not, you know, did you add that extra feature to your model? That's probably not the thing that keeps them up at night.
Although it might be what keeps you up at night as a data scientist and communicating about that I think is the easiest part to overlook because it's right, like people go, they're technical people. We want technical solutions. We like the elegance of, you know, APIs fitting together like puzzle pieces and like, but then you have to like go talk to the person who's going to like put the API into your like, you know, production system. And that, that's a people problem. And that's always going to be mushier than just writing a bunch of code.
I like that answer. It's a little harder to write or research or talk about the mushier side of things. And I'm excited that your book connects those two so elegantly.
Where can people learn more about it? Where can people learn more about and hear more from you?
So the entire book is available for free online, do4ds.com, d-o number four, d-s.com. Entire thing is available for free. There, you can also buy print or EPUB versions on Amazon or wherever you buy books. So if you want a print copy or an EPUB version, you can do that. I occasionally blog at alexkgold.space. I've written a bunch there about management. If that's something that you care about, I wrote about why I dropped out of a PhD program. If you're thinking about that, that's on my personal blog. And then, you know, if you want to reach out and chat with me, it used to be Twitter, but I don't think that's true anymore. I think it's probably LinkedIn. I'm Alex K Gold on LinkedIn.
I will link all of those in the description and in the show notes as well
