Julia Silge @ Posit | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, welcome to the Data Science Hangout. If we haven't met before, I'm Rachel Dempsey, I'm the host of our Hangout here and I lead customer marketing at Posit. This is our open space to chat about data science leadership, questions you're facing, and getting to hear about what's going on in the world of data across different industries. So we're here every Thursday at the same time, same place.

So if you are watching this on YouTube at some point in the future, and you want to join us live, there'll be a link in the details below where you can add it to your calendar. At the Hangouts, we're all dedicated to making this a welcoming environment for everybody. So we love to hear from everyone, no matter your years of experience, your titles, your industry, or the languages that you work in too.

It's totally okay to just listen in here as well. I know sometimes people will be out for a walk, maybe you're on your lunch break or it's later in your day. And there's always three ways, though, that you could jump in and ask questions or provide your own perspective. So one, you could jump in by raising your hand here on Zoom, and I'll be on the lookout. Two, you could put questions into the Zoom chat. And if it's something you wanted me to read instead, just put a little star next to it. And then three, we also have a Slido link, which I'm sure Curtis or someone from the team will share in just a second here, where you can ask questions anonymously too.

But with all that, thank you for spending your Thursday with us. I'm so excited to introduce my colleague and co-host for today, Julia Silge . Julia is a software engineer here at Posit. Julia, I'd love to have you introduce yourself and share a little bit about your role, maybe something you like to do outside of work too.

Great. Yeah, hello, everyone. I'm so glad to be here. My name is Julia and I have worked at Posit for going on four years now. And when I was first hired to, and so I work on one of the open source teams who focuses on open source software and mostly R. When I first came to then RStudio , I was on the Tidy Models team and I was on the Tidy Models team for two and a half-ish years, something like that. And it was a really exciting time to work on the Tidy Models team because it was just kind of getting to a point where it really was mature enough to be used for people's real machine learning and modeling workflows.

From when I was hired originally, one of the things that was kind of on the table for like something I would work with would be to work on the really applied and practical problems of deployment of what you need to do with a model after you have trained it. And so after being on the Tidy Models team for about a year, I kind of started digging into that. And maybe about a year after that, I transitioned to really working on that full-ish time, like most of my emphasis, going to those kinds of deployment kinds of things.

And we also hired someone to, from the beginning, the plan was that, well, that was always to make that something that worked for R and Python because the technologies that people use to deploy models are really not very specific to what language you use to train them in. And so we really saw this as an opportunity to help people have choices and have a unified approach at the end. So the MLOps kind of project is one that is bilingual, that is R and Python. So we use, we do just that, do both that. So if you have ever used something like pins, me and my team work on that for R and Python. The particular package that is focused on model deployment is Vetiver. And then there are other kind of like handful of things that are a little lower level that are kind of behind some of these things that we take responsibility for and maintain.

So I think that I, um, I probably lean less, I would say on my own opinions or tastes, I'm like, I am sure this is right. And I tend to, um, I tend to lean more on, let me like show things to people. Let me do even user interviews, you know, like, let me like, uh, uh, use some of those tools to, um, figure out how things need to happen.

So I think that I, um, I probably lean less, I would say on my own opinions or tastes, I'm like, I am sure this is right. And I tend to, um, I tend to lean more on, let me like show things to people. Let me do even user interviews, you know, like, let me like, uh, uh, use some of those tools to, um, figure out how things need to happen.

Getting stakeholder buy-in on iterative work

Um, I guess I've had the experience where people really don't want to see those intermediate steps that just want that final thing, um, especially to your point about like, you know, even, even in cases where it's a statistical, um, you know, model or something that you're needing to deliver, sometimes that delivering a simple thing is good enough. I've had the experience where it's just not. And, and I think it's because of perception of like stability or, um, you know, maybe just the culture and so I don't know if you've ever come across that, but it's like, it's basically like you have to sell the whole thing or else it dies.

So this, this experience where, um, someone is not, doesn't view it as important to give feedback on an intermediate thing or to look at an intermediate thing or to, it start integrating something that is, that is less complete or less mature or less finished, um, into maybe the system that you're, that you're going for. Um, so if it is something, if it is really something that you're like, I, I need, I need feedback on these intermediate steps in order to get where I need to go. Um, it, it can be helpful to, um, explain why, you know, like explain why we want to, um, build things in a, with, with an inner, an inner of improvement. You know, if you're, if you're talking to a, um, software engineer colleague, they, they probably resonate with this idea, right? Cause it's so common in software engineering, this idea of like, we do something small, then we improve it. You know, like we don't do something, we don't do something huge and massive all at once, right? Like we know that's more prone to errors. It's more prone to not getting where you want to be all that.

So, so you can, so I have had luck with software engineer colleagues in explaining this is like what you do, but for data work, you know, so I need you to, um, I, I need to us to take these steps so that we end up in a better place in the long run. Um, I think that it would, you know, it may be interesting, George to, um, like if you're having this kind of problem more with say, like a business stakeholder, like less with an engineer and an engineering kind of stakeholder in your data work and more like a business stakeholder, and I have had more challenges and those situations and trying to, um, and, and trying to get, uh, get those kinds of, um, answers.

So, so sometimes I have been in a situation where I just share and, um, partly and as a, like, here's where it is. And the next steps are X next week. Here's where it is. Next steps are Y. If I do need feedback, if I do need like, um, I need something back the other direction, um, it can be more challenging with business stakeholders, uh, to understand that, but I, when I have had success with that kind of overcoming that kind of obstacle, it has been through a fairly specific request or like a fairly specific, like, um, I have a, you know, I have a prototype of, um, the statistical results. It's not, it's in fine, it's final version, but I need to ask you these questions about it. And if like either set a meeting or if it's async, like a question, a question B question C, and then to like, when I've had success, it's been, it's been, they, like, I really clearly understand why I need something. It's like, the goal is this to get to the goal. I need to know which of these things do, but that can be a more, cause there's less of a culture, I think like in, in with engineers, there's more, there's more of a norm and expect the culture around. Like we do things iteratively to get a better, better result, but I have had more, it has been more challenging there.

Why models often don't get deployed

I think some of the time that's appropriate, like that's normal and okay, um, because, um, one thing that's different about, uh, machine learning as a practice compared to software engineering as a practice is you can pretty much always get, you can pretty much always build something. Like if you write out a spec for something, that's like, we need to build X, like as a software engineer, let's write out a spec, let's, um, do it, like you pretty much always can do that, you know, maybe it, maybe it will take longer than you think or more investment than you think, but actually that's not true when it comes to machine learning, because until you start the process, you don't necessarily know that you can predict the thing that you need to predict from the stuff that you have. You, you may, you end up, may end up not having appropriate data. You may end up, um, uh, it is not good enough. Like the results you get are not good enough to predict that you think that you want to.

Um, so it, uh, so machine learning as a practice is it's more iterative. Uh, it's more exploratory even, you know, I'm not, we're not even talking specifically about like exploratory data analysis, but as a, as a practice, it is more exploratory than say, we're going to build a feature, we're going to build a feature and it's, it is, um, there, you know, you can be less certain ahead of time that the, whatever model you're going, you're trying to do, you can actually fit, especially within whatever is needed. So I think that it happens because that that's a big reason why it happens is that that is the nature of data science, nature of machine learning.

There's a really great, um, paper. Um, I'm going to see if I can find it real, um, quick that talks about this. It's called operationalizing machine learning and I'll put the link here and it talks about, it talks about this in the context of machine learning, especially,a> and really just says, this is, this is why, like, this is why it's because of the nature of, um, of this work.

I think another reason is that it, um, often when people are like working on machine learning projects in their org, they're bumping up against the maturity of where they've been before, where their org has been before. And so people's, um, uh, don't have as much of the muscle memory and the skills about how to do these things, how to deploy them. So that's, I think, so the biggest, I think big reason is like, it is the nature of machine learning as work. The next piece is that often when people are doing these things, it's new to their org or newer, um, they're, they're, they're exercising new skills. And the other thing I think that really contributes to this is, um, you all, you almost always are ending up collaborating across functions. So most typically, like you are needing you, if you're a more statistical person, you are now needing to collaborate with someone who's a more OPSI person. And people work in different ways. Teams, um, don't sit together. It's like, like, like people end up with like, Oh, that will never work. You know, like that will never work because of X. And like, what, like, what can both sides do to actually make something work if it needs to happen? So I think those are like the real reasons why this kind of thing happens.

Going from a trained model to production

So I think one thing is to acknowledge, to understand that you, you as a team or some part of the team is going to need to, um, gain some new skills is going to need to learn how to do some new things and to try to make space and understand that, that, that kind of, that kind of change is, is coming and is needed. So this is, um, for example, uh, you know, you know, like maybe it's important for someone on your team to start in, in like kind of learning, like what does our org need if I were to employ a model, like what is going to work for our org? Uh, so this is now working with maybe people who are DevOps-y, maybe people who are, I sit in IT, like, like what do, so, so what are the new skills we need to, um, add?

Uh, so that's, that's one thing is to acknowledge that if you have been very statistically focused and it's time to deploy a model, how can we make space for some proportion of our team to, to gain some of those, um, some of those skills? I think also to not be afraid of some of those tools and skills and to, um, understand that like, that's how you, um, scale the impact of your work. So, so for example, like to learn, um, you know, about Docker, right? Like to learn about, um, um, uh, like, like what, like what in your, how do things work in your org? If your org has like a, like a certain cloud platform that they, that you, that you all use, like to start to, to not so much be like, okay, yes, well, we're the data people that they, we live, work in this environment, but to start to kind of push again, like to find how do I go? How do I then take that next step?

Also, I think doing things in as small of way as possible. So like you can, um, if you're going to deploy some kind of a model as, um, like, like as an API, it's like, it doesn't have to the first time you do it, be the huge scalable, can go to infinity kind of API. Like you can first do a, um, API at all. And then you show you, you know, like a software engineer or DevOps-y kind of coworker, like, look, look, if we serve this in this way, then we, you can get predictions back in this way and then have a discussion about like, how can we start to, to scale that up?

I will say if you are a person that combines like statistical know-how with like Opsy know-how, like that's killer career wise. So, so I mean, go for it. If like you're interested and, and you have a little bit of the space to start gaining those skills like that, that's fantastic. Like that, that, like, those are people who's like that set of skills combo of skills is like super valuable in the market.

I will say if you are a person that combines like statistical know-how with like Opsy know-how, like that's killer career wise.

Responding to community questions as part of the job

So I think that, um, it is, uh, so when I have been, well, while I've worked on open source projects, um, adoption, right? Like is a, it's something that we. That like open source teams watch that open source teams pay attention to and decide, you know, like, um, well, it's like one piece of data we have to decide how to invest and what to invest in.

I, we consider it part of our job. So I think that, um, uh, and I, I would say for the teams I have been on, we absolutely, um, have considered it. Um, part one, one, I'm just going to one, one, one second. Um, so absolutely, absolutely. The teams I have been on consider it part of our job. I do think it depends on the maturity of the project that you're working on. So for example, um, um, Hadley has told me like when he was first working on GG plot two, he answered every stack overflow question about GG plot two, or at least he looked at it. He looked at it. He made sure that, and because it's really valuable source of feedback, like it's a really valuable source of like, is that it is actually good for you, but now GG plot two is very mature and there's many people who answer questions on stack overflow about GG plot two. And so Hadley doesn't do it anymore.

So, um, I have set up like RSS feeds for Posit community questions for stack overflow that come to like a place that I check, like I have it going to an RSS feed reader and so I, for, for, for like, I literally, every time there's a question on stack overflow that's tagged tidy models, I look at it. I look at every single one of them and you know, sometimes someone has beat me to answering it and that's fine unless they answer it wrong, you know? And so that's totally fine, but I just want to make sure anyone who comes and is having a problem, we know what it is.

It's really interesting working on open source software from the perspective of like, how do you decide what to do and what the problems are? Because you know, lots of you, I think probably work in a situation where you have data, right? That's why you're hired. There's data that exists and you use that data to improve some product or to like change something when you're working on open source software, just by the nature of it, we don't have data about usage. We don't have data about like what most common errors are. So you kind of have to find proxies for that. And I think things like questions on Posit community questions on stack overflow, certainly GitHub issues. Like these are the proxies that we have for what problems are people having. And so I think it is absolutely yes. Part of the job of like figuring out what people's problems are.

And like I said, this especially applies to early maturity projects that like that's, that it's in fact, incredibly valuable data that you can use to decide what it is that you need to do. It definitely, you have to have, you have to have like a certain mindset. If you're going into this, because people ask, it turns out people ask the same question over and over. No, but like the fact is like, either you need to, if someone's that, if you are finding multiple people are asking the same question over and over, you should fix that. You should fix that or better document that or like make it easier for people to like not have to ask that question anymore.

Open sourcing internal packages in regulated industries

And like, that is such an interesting question because people like sometimes leadership in a company, like the risk, the risk feels too big compared to whatever, like, they're like, oh, why do you want to do that? You know? So I think the way, so sometimes you work at orgs that really value open source, like value contributing to open source, you know, like, so my most immediate place I worked before here was Stack Overflow and Stack Overflow, um, uh, had an ecosystem, like had an ethos of like, it's normal to contribute to open source. And because actually, so, and before even data scientists got there, they already had that ethos. Like there were, there was, it's largely a, like a .net C sharp kind of shop. And like some of the, like, like some of the most widely used, um, like logging libraries, some of these things were made by Stack Overflow devs and it was open source. And so they already had kind of this ethos.

But if you work in like healthcare or you work in finance or you work in something that has like higher, um, higher constraints around, um, uh, like, like data privacy, then like a tech company where everything is kind of public anyway, it's like all on the, you know, at Stack Overflow, it's like there's, there is user data that you don't want linked, but so much of it is public, like there's a public kind of ethos. It was like different. If you don't work in that, I think, um, I think the thing to do is to like clearly lay out why they should do it at all. And I think, I think there's a cut, there's like, why should they let you like, let you spend time on that? And I think one is, um, uh, one is that it can, like, depending, you know, and then this market, maybe this is not so compelling, but like for recruiting, like if you are in an org and you're like, we, maybe we work in like a field that's viewed as a little stuffy or boring, but you're like, no, we can show people that we, um, the kind of work we do by having, uh, an open source library or two, and you can say like, it can be used as a recruiting tool to show people the kind of tools that I'll be using with the kind of like way. So, so there's like recruiting.

There is, um, uh, like I know Pharma has just gone through like a big kind of like, is it like, if we all band together and make these tools, we can, we can use them then in our compliance, like, like how we need to report kinds of things. So I think, I think like the, the, the, like the, there has to be a, uh, like a compelling case made and then also like how, like, how are you going to make sure, like, like what, like how, what processes are going to be in place so that their, their reasonable concerns about data privacy are addressed, you know, like, so I think it's definitely tougher when you are in, when you are in a, um, uh, industry that has high compliance standards. Um, uh, but I do think the Pharma example is the one of like, why, like how it can work out, how it can work out to make things better actually overall for everyone.

Statistical guardrails in Tidy Models

So there for sure are. And I, so here's where I will admit. That, um, I actually don't have a super strong statistical background. Like I, like my background is physics and astronomy. I'm not a statistician. Like a lot of you, I literally have never taken statistics course in through my whole, my whole education. So some of this, but absolutely. So I'm fairly self-taught when it comes to these kinds of things. So some of these things that I'm about to say, um, I would say we're not necessarily my call. Um, but we're calls that I'm like, ah, makes sense. Yes, totally. Totally. We're going to do this.

Um, I like, there's a specific example in one of the tidy models package that is called, um, that is called the, uh, that is called our sample. And it is the package for, for resampling, for resampling data. And the, um, uh, there is a, there is a, uh, let me see if I can find it really quick, there's a, um, there's a certain resampling method that is certainly something you can do, but, um, it actually has pretty bad, bad characteristics. It has pretty bad. Um, uh, when, when it comes to, um, uh, deciding like how it is that you are going to do, um, resampling for a model or something like that, it does not, um, have good characteristics. It's, it's, uh, I think it's, it's leave one out, leave one out cross validation and it, um, uh, they're, they're pretty bad. And so we actually do have, um, uh, guardrails set up and some of the other tidy models packages that if you have a leave one out cross validation, um, set, it won't let you tune with it. It like won't let you tune with it because it's like pretty much always a bad idea.

Another example is like, um, uh, when you're doing machine learning or modeling, like, um, often what you'll see in a, like, you'll see like, um, let's walk through how to do this and then we're going to predict using the training sets. And then we're going to, I'm going to, I'm going to, I'm going to, um, compute some kind of, um, measure some kind of, um, um, metric using the training sets. And that is almost certainly going to cause, um, uh, like that that's going to be an overly optimistic result. So it's actually kind of hard to do that in tidy models. Like, and then we still, we get requests being like, I need to, I want the metrics on the training set, but that, um, uh, so definitely there are these, there are these, um, uh, guardrails and, and tidy models was built with many of these guardrails in mind to try to like keep, help you to do the right thing and make it hard to do the wrong thing.

tidy models was built with many of these guardrails in mind to try to like keep, help you to do the right thing and make it hard to do the wrong thing.

I think it happens less when it comes to software. Uh, like, like, oh, we can't do that. Like that is like pretty much, we are always willing to try to do something if we think it's a good idea. So I think it happens much more on the statistical side than the, um, then, then the actual like software side, because with, with enough creativity and energy, it's pretty much, it's pretty much possible to do anything.

The same, I mean, it's like the same thing applies like in Vetiver. Like, um, we, we have, there's quite a lot of guardrails in there. They think happens even in like the tidy verse, you know, like it tries to keep you from doing something that we almost certainly is a bad idea, you know?

Making it hard to do the wrong thing is such a great framework and such a good way to put it. Um, thank you. This has been great.

I know we're a little bit over here, but thank you so much, Julia, for, for joining us today and sharing your experience and everything that you do for the community as well. So nice to see so many of you here today too. Yeah. Thank you all so much for coming and thank you for having me. Absolutely. Have a great rest of the day, everybody.

Julia Silge @ Posit | Data Science Hangout

Transcript#

Julia's YouTube channel and blogging

Maintaining excitement on long-term projects

Vetting ideas and running small experiments

Getting stakeholder buy-in on iterative work

Why models often don't get deployed

Going from a trained model to production

Responding to community questions as part of the job

Open sourcing internal packages in regulated industries

Statistical guardrails in Tidy Models