How Can Data Teams Get Out of Their Own Way (Alex Gold)

Transcript#

This transcript was generated automatically and may contain errors.

My first actual data science job was literally the result of waking up from a dream. How does that lead you to where you are now? I had an experience that I think a lot of data scientists have, which is getting sort of inexorably dragged down the stack. What I wanted to do was do all this cool data science that would like make a difference, but what it kept turning out I had to do was make the data better. Are there any key factors for improving voter turnout? Turns out that working for political campaigns is like really exhausting because everything is on fire all the time. It's an interesting data problem. How do you fund open source software? And there have been a lot of different models to try and fund open source software. And the reality is, unfortunately, that most of them kind of suck. What do you think the state of the language war is now? I don't understand why people like put up with Jupyter Notebooks. Like y'all can have nice things too.

Today, I have the pleasure of interviewing Alex Gold. So Alex recently published the book DevOps for Data Science. And in this episode, we're going to dive into a little bit of what DevOps is, what it isn't, and how it might be different from MLOps, which we've talked at length about on the podcast. Alex also has had a very interesting career. He's currently the director of solutions engineering at Posit. And we're going to ask some questions about what solution engineering is, who might be a good fit for those types of roles, and what opportunities are available there.

We'll also dive into some of the details on the language wars. Obviously, Alex works at Posit. So he has a little bit of a strong take on the R versus Python debate. I really enjoyed this conversation with Alex, and I'm sure you will as well.

Alex's background and path into data

Alex, welcome to the Ken's Nearest Neighbors podcast. Thank you so much for coming on. So we met through some mutual friends at Posit. Obviously, I've worked with you guys a lot in the past, and you have a book coming out. I wanted to learn a little bit more, one, about your story, and two, a little bit more about the overarching process and your book in general. I'd love to first start with a little bit of background on yourself. Maybe let's dive into where you first got interested in data.

So my background, my undergrad was in math and econ. I went to Wesleyan University. They don't have minors. At least they didn't at the time. But so that's what I did. And the thing I thought was really cool about economics, and the reason I really loved it, is that it was numbers, it was math, but it was about the real world, right? It was not just sort of math for its own sake. It was really, you know, related to things that people care about in their day-to-day lives.

And so I spent the first few years of my career working in sort of the economic policy world in Washington, D.C. I worked at a couple different think tanks and found myself, I did some data-related stuff, some just, like, general, you know, qualitative policy work. Found myself enjoying the data part much more. And then, you know, a few years down the road, I started an econ PhD program and dropped out. And then a few years later, ended up going sort of all the way into the data world. And, you know, it was really more of the same. It was numbers, it was math, it was statistics, but it was about the real world.

We had a very similar experience there. I mean, in college, I remember taking my first econ course, and I was honestly a very bad student until that point. And it all just sort of made sense. I realized that I could see the world through trends, through graphs, and I could start to understand it in a very different way. And I remember how powerful that was that sort of opened my mind up to the possibilities of math or statistics being an applied thing rather than just a theoretical thing, which was super, super cool.

Yeah, absolutely. And I think what's really interesting about data science as a field is that I think that's the story of a lot of us who are in data science or who got into data science. I mean, these days, there are a lot of data science programs, which I think is awesome. But, you know, when I was in school, that didn't exist. And so everybody who sort of ended up in data science was coming from somewhere else. And that's something that I think is really cool about this field is just how it sort of is this marriage of a lot of really, you know, interesting statistics, machine learning, computer science kind of stuff. And then, like, what can you do with this in the real world?

Political data science and the voter file

So my sort of policy world led me into data science. And my first data science job was a political job, doing work on political campaigns. Then I had another role after that in data science, more like generic data science. I was working for a big consulting firm, did some work with hospital systems, did some work with the Social Security Administration. And that was, you know, sort of general data science. I ended up leading a team in that role.

And then from there, you know, I had obviously been spending a lot of time using R, getting excited about the power of programming for data science, right? But I had an experience that I think a lot of people, a lot of data scientists have, which is getting sort of inexorably dragged down the stack. That like, what I wanted to do was do all this cool data science that would like make a difference, you know, but what it kept turning out I had to do was like, make the data better, right? And then eventually you end up like managing a database. And at some point I found myself managing an RStudio server because like somebody needed to do it.

And so what I found was I actually kind of enjoyed that part of it, right? I enjoyed learning about how to like administer a Linux server. I enjoyed learning about how sort of database administration worked. Then what happened honestly is I got connected to some folks at then RStudio, now Posit, and learned a little bit about solutions engineering and made the hop over to then RStudio about five and a half years ago. And so spent, you know, about a year and a half being a solutions engineer and happy to talk more about sort of what that means, right?

I'll just share like my first actual data science job was literally the result of waking up from a dream that I had. I like woke up, I'd had a dream that I was going to work for, it was 2015, fall of 2015. I had a dream that I was working for the Hillary campaign. I was like, okay, I'm going to go do political data science stuff. And that was just a complete, it was a great left turn, but it was a total left turn for my career. And it was literally the result of a random dream that I had. I just like woke up and I was like, oh, what if I actually did that?

My first actual data science job was literally the result of waking up from a dream that I had. And it was literally the result of a random dream that I had. I just like woke up and I was like, oh, what if I actually did that?

You know, the more people I talk to on the podcast, the more that that really hits home, including in my life. I mean, nothing really makes sense until afterwards when you look back at it and you put it all together and going through it, we're all kind of just fumbling through the dark and trying to figure it all out.

I think that it is election season right now. And I don't particularly want to get into any politics, but I'm interested in the data surrounding the politics and like what goes into those types of things that maybe the traditional, the layman isn't familiar with.

So I worked in both policy and politics, which from like, if you're outside of it, you're like, those are the same thing, right? But like, fundamentally, they are quite different, right? Policy is about what is the government going to do for people or not do, right? And then politics is about who wins elections. And so while they are related, obviously, you only get to do policy if you win elections.

So in the policy world, I did some data work, for example, where we were using some longitudinal surveys of basically the economic success achieved by children. So we were trying to do this like very complex modeling. If anybody's familiar with the National Longitudinal Survey of Youth, the NLSY surveys, these are what we used. And we built this model that could, at least in theory, synthetic, we created a synthetic data set of people from age zero to 40. Now, there's no actual data set that follows people from zero to 40. And so we took these several different data sets and stitched them together with a variety of different sort of like matching, statistical matching techniques and other kinds of inference techniques.

And for what it's worth, I did this all in Stata, which was a real... I know there are people out there who think like R is a difficult language to work with. You should go have some fun with Stata, that's a bear.

So that was what I did on the policy side. There are interesting statistical problems. For the most part, a lot of data on the policy side is smallish data problems, right? They're longitudinal surveys, which are very expensive to collect.

In politics, there are a variety of different sources of data. One of the biggest ones is something called the voter file, which people who are not in the political world might be sort of horrified to know exists, but it's a standard thing. Each secretary of state for every state keeps a voter file, which is a list of all the people who are registered voters in that state and when they have voted. So as we like to say to people, who you vote for is private, but whether you voted is a matter of public record.

And so there's a lot of work in politics done off of these voter files. And so these data files are compiled by state secretaries of state. There are then companies that go around and gather the data from all 50 states, because actually it's like a pain in the ass to homogenize 50 different state voter files.

The role I was in, we were doing voter outreach experiments. And so we were looking at things like, does this message or that message make people more likely to go out to vote? And so the kind of things we would do is before the election, we would take the voter file, we would randomize it. And some people would get this piece of mail. Some people would get that piece of mail. Some people would get nothing or this text or that text. And then after the election, we would gather back up the voter files and see who voted and who didn't. And so they were literal like field experiments on voter outreach. This is sort of the gold standard, obviously, because it's an actual randomized controlled trial. But there's just a ton of data stuff going on in politics these days.

I mean, that makes a ton of sense to me. You think about a lot of the campaigns that are run about getting people to vote, rather than I would imagine it's a lot easier to get someone to vote from who's aligned with a specific party than to convert someone to vote for a different candidate. And it's an interesting data problem of, hey, let's invest money in just finding people and getting them to vote versus trying to change opinion, which is maybe a little bit sad.

The organization I worked for was aligned firmly with progressive causes and candidates. And so if that's the side you're on, what you're trying to do is you are trying to increase the vote margin for progressive candidates. And there are basically three things you could do. You can make people who were going to vote for a Republican vote for a Democrat instead. You can make more Democrats or people who are going to vote for Democrats, at least because very few people actually register these days. They say they're independents, but almost nobody's actually independent. And so you take people who are going to vote Democratic and you get them to come out, or you get the Republicans to stay home. Those are the three things you can do.

There is a really active, I think, debate, honestly, going on about whether persuasion, which is getting people to go to the other side, or straight turnout stuff is more useful. It definitely varies election by election. The other thing that's really interesting is, obviously, you can do a lot more in a smaller election, right? So if you're talking about in 2015, 2016, obviously, it was Hillary. Trump was the election. And most people knew a lot about that election. And there really is only so much you can do to affect that.

But when you're talking about down-ballot races, when you're talking about a state senate race, or the district attorney in Chicago, those are races, actually, where this kind of paid media can have a much bigger effect, because people aren't paying as close attention. Their minds aren't made up ahead of time. They're more persuadable. And they might not be planning to vote at all unless they learn something about the candidates or are sort of incentivized to come out for it.

I'm actually a little concerned on that front because the way these LLMs work, right, they are good at a lot of basic tasks. And I feel like you still need an expert to look at it and make sure it's doing what you think it's doing.

That feels like an unsolved problem to me of like, it's cool that it can do some of these different kinds of tasks, but it still feels to me like there's a need for expert eyes on it to make sure it's doing what you want. And I'm not sure how you get that exposure in a world where like LLMs are, you know, not today, but like in five years kind of territory.

So I think as they are right now, agree 100%. Basically, we need a lot of the stuff that produces is wrong, or it doesn't work, or it's like, I had an issue where I was writing some code. And when I ran it on my machine, I just got different results. But I'm a little less concerned because I think that it allows people to iterate faster and see dramatically more code. And that evens out that eventually, because there are going to be bugs in the code, you're going to be getting bad results. And like, frankly, you're not going to have a job if you're producing bad results.

And so inevitably, I think sort of the market corrects itself that, hey, after you make a couple of mistakes, you start scrutinizing it really carefully. And then you can actually use it as a tool for debugging itself in the sense that like, hey, like, why is it this way? Why did you choose that rather than this decision? Are there other ways to do this? I think I believe a lot of those problems can be solved with good prompting. And by good prompting, I mean, asking questions of what is being built and having it break itself down.

I think one of the things I'm concerned about, and we got like, what, we're like 40 minutes in. This is the first time I'm mentioning my book, DevOps for Data Science. But like, in the book, right, one of the things I talk about is that, you know, like, if you're doing regular software engineering, there's basically like, one kind of correctness you care about. It's basically like, does it run in a performant enough way? But like, when you're talking about data science, there's this second sense in which it needs to work, which is like, the answers need to be correct. And they need to be what you meant them to be, right?

And like, the problem with that is that unlike the way you can write unit tests for a software stack, it's very difficult to write testing for actual results in data science, because usually, like, you wouldn't be doing it if you knew what the answer was. And you could just run it and make sure you got the right answer. It is still scary to me that probably you're going to find out those mistakes by like, putting them in front of somebody who is an expert. And they'll be like, this, this number makes, I mean, we've all been, right? Like, this number makes no sense.

But, you know, there's a second level of correctness that you care about in data science that isn't there in pure software engineering that I think, you know, makes that sort of self-correcting piece a little scarier, I think, for a data scientist, whereas like, if you're just writing code, if it doesn't run, it doesn't run. But like, you know, in data science, like your data merge can work in some sense of the word, but completely screw up your data along the way. And you may not realize that for a long time.

DevOps for data science vs MLOps

So, you know, talking about the book, talking about the nature of, I guess, like DevOps in the data science space. Can you explain maybe what that actually means and how that might be different from, for example, MLOps, which is talked about quite a bit?

So, you know, fundamentally DevOps is a set of practices, procedures, and tools, sort of, right? So here's like, I'm going to digress here and tell a little history, right? So you're like in the nineties and, um, right, at least according to these stories, which I believe are completely apocryphal, everybody is writing their software using the waterfall method where you like spend a year gathering requirements. You spend two years writing software and then voila, it doesn't do what you thought, right? Like that's the like waterfall story. I'm somewhat doubtful that that ever actually happened, but like, that's the story.

And so then come up comes the rise of agile, right? The agile software movement, which is like this idea that you're going to, instead of trying to like do a year of requirements gathering and two years of building, you're going to build small increments. You're going to deliver quickly. And like, fundamentally, while agile is a development method, it's also a method of like checking that you're doing the right thing, right? Because you're frequently going back to your customer, whether that's a customer, customer, an internal customer, and you're saying like, does this do what you want to do?

But there's a problem, which is you can write the code that does, you know, creates this thing. But like, how do you then deliver it, right? You have to put it into production somehow to get it in front of those people. And if you're doing this right before you did that, like once, right? But now, you know, like if you're, if you're talking about a major code base, like Facebook, right, like, they push, I think, like it's thousands of updates a week to their code base, and they go live on the site, like they're constantly just pushing updates.

So DevOps is this system of tools, procedures, processes, so that you can build the software to make it easy to put into production and put things into production in a way that they're going to be safe, secure, observable. You can understand what happens when they go wrong, recreate the error and try and fix it on the code side. And so it's basically trying to bring closer together the development, the dev, and the day-to-day running of the software, the ops, right? So that's, that's where it comes from. It's a sort of pure software engineering kind of concept.

And so, you know, as might not be surprising, given how I described with what the solutions engineer role is, like, that's what I spend my time discussing with our customers. Like, how do you take data science and put it into production? How do you make it production grade? How do you like write code that is ready for production? Then how do you actually put it into production? And a lot of that is lessons from DevOps or they're an interpretation of DevOps. And, you know, it's tempting, I think, for people who are software engineers to be like, you just do DevOps, but you should just do it in R or Python. And I think that's really wrong. Like, I think fundamentally data science is a different thing than software engineering.

Like the analogy I draw is like, if software engineering is like architecture, data science is like archeology. Like they're, you're doing something in both cases, but they're very different. And the way you think about building software, and if you're doing data science, like you're building software, even if you don't like it, or if you're bad at it, like you're building software.

Like the analogy I draw is like, if software engineering is like architecture, data science is like archeology. Like they're, you're doing something in both cases, but they're very different.

And so that's really what the book was, was a reaction to was observing that people knew a lot about data science, about, you know, machine learning, about, you know, statistics about, you know, sort of like graphical design and these sorts of things. But then they got really stumped when it came time to put it actually into production. And so the book is sort of split into three parts, where the first part is really about, like, how do you write your code, your R or Python code in a way that makes it easy to put it into production when the time comes?

The second section of the book is the part that I really didn't want to write, but I kind of felt like I had to, which is like, if you have to manage the server that's hosting all this, like, how do you, how do you do that? How do you manage a server? How do you manage a database? What the hell is up with Docker? Like, you know, those, those kinds of questions, which I think a lot of us end up facing. I think most data scientists at some point in their career have to answer these questions, but are really lost about where to start.

And then the last section of the book is about, okay, you're working at a big company that has an IT admin organization. It's a sophisticated IT admin organization. What concerns are they going to have about the work you're doing? And how do you communicate with them about the work that you need to do and make sure that they understand what you need from them and giving them what they need from you so that you can create a great environment to do data science.

So DevOps for data science, it probably encompasses more of the pipeline than just MLOps. That's right. You did ask about that. So to me, I mean, MLOps is such an interesting, like to me, MLOps is just this like, and I don't know, there's been so much hype around it, but like it's this tiny narrow slice of the pie, which is like, once you've built a machine learning model, how do you serve that machine learning model? And then how do you, you know, do things like horse racing them? And I'm being a little like glib about it, but to me, it's such a small part of the work you need to do as a data scientist. And at least in my observation, there are organizations that have sophisticated MLOps needs, but they're much rarer than organizations that like have a dashboard in R or a, you know, some sort of image processing pipeline in Python. And they need to somehow productionize that.