
From Physics PhD to MLOps builder - Julia Silge - The Data Scientist Show #087
Julia Silge is an engineering manager at Posit PBC, formerly know as R-studio, where she leads a team of developers building open source software MLOps. Before Posit, she finished a PhD in astrophysics, worked for several years in the nonprofit space, and was a data scientist at Stack Overflow where some of her most public work involved the annual developer survey. We talked about MLOps tools, challenges in survey data, text analysis, and balancing her interests in data science and engineering. Subscribe to Daliana's newsletter on www.dalianaliu.com for more on data science and career. Daliana's Twitter: https://twitter.com/DalianaLiu Daliana’s LinkedIn: https://www.linkedin.com/in/dalianaliu/ Julia’s LinkedIn: https://www.linkedin.com/in/juliasilge/ Julia’s Website: https://juliasilge.com/ 00:00:00 Introduction 00:00:51 Getting into data science 00:04:45 Transition from data centers to engineering manager 00:13:59 Common challenges in tool development 00:17:33 Challenges with survey data 00:26:42 Engineering skills for data scientists 00:28:54 Balancing roles 00:34:44 Developing skills in Exploratory Data Analysis (EDA) 00:39:14 Python vs. R for data analysis 00:44:35 Exciting aspects in career and personal life
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello everyone, welcome to The Data Scientist Show. Today we have Julia Silge. Julia is a data scientist and engineering manager at PositPBC, formerly known as RStudio, where she leads a team of developers building fluent, cohesive, open-source software for machine learning and MLOps. Before Posit, she finished a PhD in astrophysics, worked for several years in the non-profit space and was a data scientist at Stack Overflow, where some of her most public work involved the annual developer survey.
Today we'll talk about MLOps tools, challenges in survey data, text analysis and balancing her interest in data science and engineering. If you like the show, subscribe to the channel, leave a comment and give me a five-star review. Welcome to the show, Julia. Thank you for having me. I'm really glad to be here.
From physics to data science
So Julia, how did you get into data science? Great, like you mentioned, my academic background is in physics and astronomy, and I was working in research and was realizing that the academic world was not gonna be where it was for me, not as fit for the long-term. My path at that point was a bit circuitous. I worked for an ed-tech startup. I actually was a stay-at-home mom for a few years, but eventually I started to see some people who were similar to me in background, like with this physics and astronomy, kind of making a transition into data science.
And I thought, I wondered, I talked to them and I was like, wait a minute. So there's a job out there where what I do is make plots, talk to people about analytical results, analyze data, and that's your actual job. That was my favorite part of when I was doing astronomy. So I thought this is gonna be a pretty good fit for me, and I decided to try to make this transition.
I, at the time, was a little bit underemployed. I actually had been laid off just a few months before this, and I was employed as a contractor, and I took about six months, and I took basically every MOOC that exists, every massive online course that exists to really learn some of the modern data science languages. I had a pretty strong background and programming background from the work in physics and astronomy, but I had never taken any formal stats course. I didn't know modern data science languages like Python, SQL, R. I didn't have machine learning experience because when I came up through physics and astronomy, it was not as common for people to use those kinds of machine learning tools.
I took a bit of time to do a lot of self-study, and then as part of that sort of self-study to try to make this transition, I started writing a blog. And my vision for this blog was that it would be projects that I could talk to people about during, say, interviews or for jobs, like job interviews, because my resume looked a little weird, and I thought I really need to have evidence that I can do this job, like evidence that I can do this job. It turned out that blogging opened huge doors for me, both in terms of jobs, but also with open source collaboration. I met people that I eventually wrote books with. It really was a big part of my transition was doing this public-facing kind of work.
It turned out that blogging opened huge doors for me, both in terms of jobs, but also with open source collaboration. I met people that I eventually wrote books with. It really was a big part of my transition was doing this public-facing kind of work.
I did get a job. So the first job that I got was in the nonprofit space, which I think is a really interesting way for people who are transitioning in, because often nonprofits have quite a lot of data, but they don't have as many resources for how to best use that data, how to best take advantage of that. Of course, salaries are not high in the nonprofit space, but it was really a great place for me to get that first data science title, that first job where that was my title, and to demonstrate that I could do this in a real org. So that's a little bit of how I got to that first kind of data science role. After that, I worked as a data science practitioner, like with the title data scientist, for several years at different kinds of orgs, moved from the nonprofit space into tech itself, and then moved to Posit about four years ago now.
Transitioning to engineering manager and MLOps
So when my title was data scientist, I noticed that I would spend about 80% of my time doing data analysis, and I would spend about 20% of my time building tools. This could be like an internal tool at the organization where I was, like an internal package or library, or something to make the other people in my org more effective. It also was open source tools. I contributed to open source software. I started pretty early, actually, in my transition into data science, and I noticed that I really am motivated by thinking about the systems people use to get their work done, the tools, like people's really practical, the nitty gritty of how do people do their jobs, like how do people do their jobs, and I was always interested in kind of that systems thinking like how do we go about making people more effective when they're doing text analysis, when they're building a machine learning model.
I was happy then, because I got to do data analysis, and then kind of 80-20 data analysis tool building. When I was kind of at a transition point looking for my next thing, I was really excited to see this job at Posit, but I knew it would be a change. I knew it would be a change, because basically it's that flipped. So now I would say I spend 80% of my time focused on tools for people who are doing data science, and maybe 20% of my time now actually doing data analysis itself.
One thing I notice about myself is that I do like to be involved in both. I really do like to be involved in both. I think if I ever had a job that was all one or all the other, I would not like that as much. It's been interesting to step back from a role where my primary role is doing data analysis or data science into a role where my primary focus is how can we build effective, fluent tools for people to be able to do the kind of tasks that they do. So I like doing a little bit of both, and then now, so I started out when I joined Posit just focusing on tools for model development, like building machine learning models, and then transitioned into a leadership role and building tools for machine learning operations, what people call MLOps.
So kind of that process of you've already trained a model, now what do you do? Like, congrats, you trained a model. What are you gonna do now? Like, how do you deploy that model? How do you version that model? How do you maintain that model in the long run? And know when it's time to retrain it. So that's the kind of work I do now, which has been, it's a really interesting space. And like I said, I love thinking about the really practical parts of how people do their jobs, and that I get to think about that a lot now.
Vetiver: the MLOps tool
So when I joined the company I'm at now, it was called RStudio. And a couple years ago, the company rebranded. So the new name of it is Posit, PositPBC. We're a public benefit corporation. And part of that rebranding was to clarify that we're not a company that just makes tools for R. We're like a data science company. So we have tools, the two main languages that we support are Python and R, but also some Julia, Observable. Basically, tools that people use to deal with data.
So the project that I work on now, it's called Vetiver. And if you are like a perfume person or a candle person, you may have heard that word, Vetiver. So it's a stabilizing ingredient in perfumery. So the idea is like in a perfume, you might have like very, lots of volatile kind of fragrances, but Vetiver is one that's a stabilizing ingredient. And so the metaphor here is that Vetiver, the project, is for stabilizing your models. For making your models might be these more, these more volatile things, but Vetiver helps you have confidence in the reliability of the model.
So it's a project for Python and for R. And what it focuses on is you have a model trained and what do you need to do to maintain it in production, reliably and efficiently. So the three main sort of tasks and approach are versioning your model. So that's, oh, I trained it this week and I got this result, but I trained it last week and I got these results. Like how do I reliably know where, or keep those models versions really organized in like a model registry. So versioning is the first one. The second one is deploying. So that's that process of getting the model out of the computational environment where you developed it. Maybe think of that as like your laptop, or maybe you might have developed it in like a server kind of environment, but you developed it somewhere and then you need to lift it out with all the like computational pieces that it needs and you put it somewhere else so that it can be in a production environment where it's integrated into the infrastructure of your org.
So picture that as like one way that you can do that is with Docker. There are other ways as well that Vetiver supports, but that might be a good way to think of it. So instead of it's on my laptop, you're like, ah, it's in a Docker container and is ready to make predictions because we have captured all the software requirements. So versioning is the first piece. So think of like a model registry. Deploying is the second piece. So think about getting it off your laptop, maybe like into a Docker container. And then the last piece is model monitoring. So this is, the model is in production. It's serving predictions in whatever is appropriate for your organization. And the, but you need to know, is the model performing as you expect it to? Is, do you, is it starting to degrade? Is it starting, are you having that sort of model drift so that you need to address it? So that third piece is the monitoring and maintenance piece.
MLOps challenges and user research
So before we started deciding exactly what would be the shape of the MLOps tool we would build, I did a round of user interviews. I went to both people who were customers of Posit, so these are people who use our professional products, and also people in the open source ecosystem, people who maybe are users of Scikit-learn and our tidy models, or people who are more, who are not our customers. So I did rounds of user interviews and I had an organized, like I designed an interview so that we could really understand what do people like about the tools they're trying to use now? What are the deficits that exist?
Yeah, one thing that I feel like was just hammered home to me, both through my own interviews and through the reading that I did, in many organizations, the people who are responsible for this are the people who are also developing the model. So in many places, if you're someone who develops a model, and by that I mean that process of hyperparameter tuning, evaluation, like how do I train the model? They are often the people who also are responsible for operationalizing that model. But many of the tools that are out there for MLOps are really built with a software engineer user in mind, and don't so much really acknowledge the iterative exploratory nature of what data science is like, what machine learning is like.
The paper, we can add it to your show notes, I think that was one of the big takeaways from it, from this interview study that was done of people who do MLOps. It's like a big difference between what you might call regular software engineering and the process of MLOps, is that it is more iterative and interactive in nature. And so you have to build tools for that kind of person, for that kind of person, that kind of user persona.
Another, just as a really concrete example, if you have an app, like just a software app that does not have machine learning in it, you, and someone talks about the performance of it, they're probably talking about things like latency, how fast does the app run, how much RAM does the app use? But when you're talking about a machine learning app, an API that serves a model or something like that, you do have to think about all those things. You have to think about latency, you have to think about how much memory it uses, how fast it returns predictions, but you also have to think about performance in terms of the statistical characteristics of the model.
Survey data challenges at Stack Overflow
Yeah, it is, so working on the Stack Overflow developer survey was honestly quite stressful at the time, partly because it was fairly public. It was fairly public. People would pay quite a bit of attention to it, and it was enormous. On the years that I worked on the survey, it would have on the order of 70,000 responses, 90,000 responses, and so it's an online opt-in survey. So in this way, it's different from surveys that you might hear about through a government statistical organizations, where they want a sample that is reflective of the population of that country, and they work very hard so that the people who take the survey are, they know how to map from the responses they get to the population.
Now, there's no census out there of developers. No one knows actually how many there are, or what their demographic characteristics are like, and when you have an online opt-in survey, one of the aspects of it that I found most challenging was how to accurately communicate about the nature of those results, and this is something that Stack Overflow got a fair amount of criticism for over the year, and some of that I think was quite fair. So a big piece of this would be gender representation in that survey.
So I don't remember the exact numbers, but you would see when I would work on that survey, the first year that I worked on that survey, it was 5% of the respondents were women and gender minorities, very low, and so when you would, if you are a woman in tech and you see that, you think, this is not representing me, like this survey is not representing me. So one of my priorities when I worked on the survey, first of all, we invested in trying to increase that proportion. We would go to groups like PyLadies, we would go to groups that were, you would find women in tech and say, hey, can you share this with your members, like in your Slack, so that, because we want to hear, in this example, like we want to hear from women and gender minorities in tech, and we did, we bumped up the proportion, like we doubled it, we doubled the proportion of respondents but the thing is, we don't actually know what the real answer is, we don't actually know.
So we tried to prioritize understanding differences. Yeah, I think that makes sense.
So another interesting, more technical challenge, like less sort of people problem and more technical problem is that at the time at Stack Overflow, the data science ecosystem was all like R and Python. And at the time Stack Overflow had no Python, no R in production anywhere. So I was doing the analysis with these data science tools, but we needed to get the results to production, to the website. And there was no way for me to use R, for me to use Python in that. So Stack Overflow is a pretty interesting website. It is super high performance, like really fast website that has been carefully built with these really fast technologies. So Python's too slow, R's too slow. It's not a good fit.
What I found was the best way for me to be able to deliver those results was to deliver those results in an API. So in R, you can use Plumber, in Python, you can use FastAPI to deliver like results from a situation like a computer over here that has R or Python available. And then you can publish an API that another, like you might think of it as a survey microservice maybe. And then the like high performance website can slurp up the results when it needs to, update it on whatever time period is appropriate and get those results in.
I did that actually with not only numbers, but also with plots. Like I would serve SVG plots in an API to get slurped up into the website. And so that was sometimes when people think about data science tools, like data science skills, you don't think as much about, can I serve my results in an API? But that is, I feel like someone who has like the data chops and then also is able to really back that up with some level of engineering chop. Sure, I can publish an API for you. Like, where do you want it? That is something that the survey particularly showed me how much that can scale your impact in an organization.
Engineering skills for data scientists
So I think that there are some that are table stakes, table stakes, and I think those are things like in your language of choice, being very skilled and effective at managing the nitty gritty of installation that you are someone who can build, that you are someone that effectively manages the software dependencies of your project. Maybe that seems even like a funny thing to say, but I run into people who really struggle with this piece, who really struggle with understanding like dependency management of projects and how to manage that. And it is a whole skill, like it is a whole skill. So that one I think is like table stakes, being able to be competent and get, right? Like that's table stakes.
Then I think the things that will really, like if you can learn some combination of these to really push you forward, I think one of them is being someone who not only can manage dependencies, but who has packaging skill. So you can build a Python package, you can build an R package to package up code and be able to run it somewhere else so that you can then install it. So that, so not only can you manage it, but you can be the creator of it. So you are someone who understands REST APIs, like I just mentioned, like that's another really big piece. You are someone who maybe has some of the DevOps skills, like you understand Docker and when is appropriate to use Docker and how to use that. So these are skills that are not specifically about data, right? They're not specifically about statistics or machine learning or data analysis, but they're about how do you operationalize the data analysis that you're doing.
Data storyteller vs. tool builder identity
Oh, that's a really tough question for me. I, that's a very interesting question because my day job pays me for the tool builder piece. Like that is what my primary professional, and that's what I'm paid for, I guess. That's what's in my job description is the tool builder piece. But I think that my primary, how I think about myself is still on the practitioner side, is still, oh, I'm someone who does data science. Like you introduced me as like data scientist and engineering manager, but I'm actually not a data scientist at my company. Like we have data scientists and I'm not one of them.
I do, no, I think that's true. And I think it affects how I decide to build tools. I think the fact that that is my primary identity really informs the kinds of tools I wanna build, what I care about those tools, how much I care about documentation and communicating about those tools effectively. So that is a super tough question, but I think that it is most honest to say that I, my primary professional identity is still on the practitioner side.
Text analysis and the blog
One of the things that I love the most is text as data. It's interesting in the era of LLMs to have so much more of the data science ecosystem thinking about text as data, but I love analyzing text, both from the very exploratory data analysis stage through to training models. And I am interested in how these largest, biggest models are being trained and used. I will say, though, I spend most of my time a little bit lower in the hierarchy, like a little bit more towards exploratory work, visualization, and not those hard to interpret giant bot models, but models that are more transparent, that are more easy to interpret and know what they mean.
One of my recent blog posts was about Taylor Swift's lyrics. And so treating the lyrics of Taylor Swift as a text data set and it builds model with that data. And so the particular model is an unsupervised model called a topic model. And so topic models treat documents as mixtures of topics and each topic is a mixture of words. And it is, I feel like it's a really underrated tool for when you have a corpus of language, open-ended survey questions. There are many ways you can apply this, but in this like fun example, I looked at the lyrics of Taylor Swift and looked to see, so treated the different songs as different documents and then tried to understand how are these related to each other? And if I remember correctly, like some of these results were like that the early albums of Taylor Swift have a lot in common in terms of the lyric content. If you look at the words that are there, like words about romance. And then if you look, reputation is really unique. It is lyrically unique compared to her other works, which makes sense. And then like Folklore and Evermore have a lot in common. They're like two of the albums that are the closest in terms of their lyric content, which definitely also makes sense.
I mean, those techniques can be applied in maybe more mundane or common ways, but it's really fun to look at something like Taylor Swift lyrics and do that. I actually did that the week after I went to see the movie. I went to see the concert film and it was like a little fun Taylor Swift moment for me.
I think a lot of people don't know what kind of project they should start with. You can just pick something that interests you. Yeah, absolutely. And I think actually the most fun part lies in the exploratory analysis. There are a lot of low-hanging fruits. And if you want to get started on your public portfolio projects, you don't have to go into the very complicated machine learning model. Start with something, exploratory analysis, analyzing, I don't know, your favorite food or some, I think there's public data about Yelp stars and things like that. And to add onto that, I think it is that actually those are incredibly valuable skills. If you're someone who's very quick and good at EDA, like those people are incredible. Like those people are very impactful in the organizations where they are, for sure.
Developing EDA instincts and presenting results
Yeah, this is a very interesting question because it is exploratory in nature. It is kind of hard to say, what is the right thing to do? And when do you know you've done a good job? And when do you know you're done? When do you know you're done? It will never be done. You always have another question. Yeah, so one of my favorite books that kind of approaches the process of this is an R-specific book, but I think that it, at least those first couple chapters would really benefit anyone, even if they use Python or another Julia or another language. So the book is called R for Data Science. It's by Hadley Wickham. And it is the first couple chapters talk about the iterative nature of say, you read data and you import data and then you do this process of visualize, summarize, model. And this process tends to loop back on itself multiple times where then, oh, I saw, oh, I see this result. Let me go and make another plot. Let me do another summarization. Let me build another maybe like simple model to learn what I need to, what I can about this data set.
I was a huge user of R Markdown at the time. And so what I would use today would be Quarto. So Quarto is like sort of a next generation iteration of what R Markdown is. And Quarto works for Python, R, Observable, like Julia. For that, I feel like what the superpower that gave me was the ability to quickly generate reproducible reports. And so I would do things like write in like what would be like a Quarto doc now. So text code, making the plots, making the results. And then I would run the whole thing, render the whole thing to either a PDF or Google Docs. Or if I was doing a presentation, I would make slides and I could do it reproducibly. I never had to do, like the superpower it gave me is that I was never like copying and pasting stuff or showing someone outdated results. Like I had really reproducible practices.
The other thing that was like a tool that was like a superpower for me was Shiny. So Shiny is a tool for making interactive apps for Python and for R. Being someone who knew how to use Shiny, let me make, picture a interactive data app that a stakeholder could then come and they could- Shiny made me so effective. Shiny made me so effective. And it's funny, at the time I was not, like I did not work for my company now, but part of the reason I wanted to come to my company is because it was the company that made these tools. So I don't personally work on Quarto or Shiny, like I work on Vetiver, like I said, but I know coming from me now, it's sure you're gonna say that, but literally those were like my superpowers when I was a data science practitioner. Like being able to generate reproducible reports really quickly, being able to make interactive apps really quickly, like it made people view me as incredibly productive and helpful to the org.
For that, I feel like what the superpower that gave me was the ability to quickly generate reproducible reports. Like being able to generate reproducible reports really quickly, being able to make interactive apps really quickly, like it made people view me as incredibly productive and helpful to the org.
Python vs. R
Yeah, so I think that the right answer is the one that you are more productive in. If you are someone who is more productive in Python, then that is the right tool for you. And if you are someone, I think it depends on your own skills and what is a better fit for you. I do also think that sometimes the tooling is better in one than the other. Like a real strength of R is that every kind of statistical model that exists, there's an R package for that. And so if you have needs around, say like specific statistical needs around the kind of models that you build, or like often R is used a lot in situations where you have fairly sophisticated statistical needs. You can't just like brute force something. Like you need to be quite careful. That's why it's used a lot in clinical trials, why it's used a lot in like human, like research with like people, because you're looking for like subtle effects and you need to make sure your stats are rock solid.
There are, of course, amazing tools in Python as well. I think one that I often will turn to Python for, just as an example, is spaCy. spaCy is a, like we talked about, I love text data. spaCy is a project and package for natural language processing in Python that is amazing. Like it's amazing, I love it. The people who built it did such a good job. I'm such a fan. And it is just for Python. And so that is something that I would turn to Python for to use the tool.
So I think the two things to think about are, which one are you better at? So for example, I am just way better with the tidyverse than I am with like pandas or pollers or any, like I'm just way faster. I'm just way faster. It matches how my brain thinks. And so if I need to do data munging, data EDA, if I need to do EDA, I'm gonna use the tidyverse because I'm so effective with it. So there's one piece that's like, what are you effective with? Use that. And then there's the other piece of the specific task you're doing, where are the best tools? Where are the best tools? And I will find myself picking up different tools depending on what it is that I want to do.
I do think people will, depending on companies and organizations, sometimes we'll get really locked in to one, like we are an all Python shop. We are an all R kind of shop. And so sometimes that also is part of the reality that like, oh, at this company, they've made a decision that all of our data science needs to be in Python. And so that's also sometimes a constraint that comes in is like where you are working, maybe you don't have complete freedom to decide what tool you are going to use.
I feel like lately is people acknowledging more these kinds of realities and not, I am not someone who is interested or willing to engage in one is good and one is bad. They're different. The trade-offs are different. And there are other really interesting projects out there too which we may all be wanting to use in five years. Julia hasn't, like some people who use Julia, love Julia, right? There's really cool stuff out there. It's been really interesting to see Mojo come out recently, this like new, really fast kind of language. So I think it's, I am very professionally associated with R but it is for sure not the only language I use. And I think that it's good not to get, I think it's good to realize once you have learned one programming language is not that hard to learn another one.
The things that are the hardest are the, at least this is my opinion, the things that are the hardest moving when you enter into a different language is usually not the syntax. It is usually what is that language's packaging ecosystem like? And how is it different than the one you're used to? So I feel like that's the hardest piece usually is like wrapping your mind around a different set of constraints in terms of how packages are distributed and dependencies are managed.
What's coming up
One thing in my professional life I'm really excited about this year is that I'm keynoting at SciPy. Yeah, so this July I'm giving a keynote at SciPy and it is one of the themes at SciPy this year is scientific computing across different languages. And so I'm really excited to get to go and talk about like my own experiences building projects across different languages. And what are things, they're less about a specific language and more about scientific computing generally. And then what are the things that come up that are different? Like when you're working with a Python user, when you're working with an R user, what are the things that actually do have to be different? So I'm super excited about giving that keynote. I'm super honored that they asked me.
So that is one thing. And then I guess another thing, this is also conference related. So my company is having its conference in August and I'm helping me and one of the people on my team, Isabel, is our teaching a workshop on using Vetiver. So it'll be the second time we've taught this workshop. And it's so fun to sit down with people in a room and help them get from a place where they have not done something before, like deployed a model and get them to that first deployment. It's just really exciting. So I'm looking forward to that as well. Thank you, Julia, for coming to the show today. Thank you so much for having me. It was a pleasure to chat with you.

