Alex Gold | Managing Packages for Open Source Data Science

Transcript#

This transcript was generated automatically and may contain errors.

Thanks so much for tuning in today to talk about managing packages for open source data science. My name is Alex, and I'm a solutions engineering manager here at RStudio . Just a little about me so that you sort of know who I am and why I'm talking about this. I'm a former data scientist and a data science team lead. I worked on sort of health care, politics, economic policy kinds of issues when I was a full-time data scientist. And since coming to RStudio, I've been a solution engineer here for about two years.

So let's dive right in and talk about managing packages. So, you know, data science is a lot like this delicious sandwich. This sandwich looks incredible, by the way. Anyway, data science is a little like making a delicious sandwich. Maybe it's an app that you're making, maybe it's an API or a port or a scheduled job of some sort. And it takes a little bit of planning, a little preparation, and a little skill to make something really, really great.

Today, we're going to talk about sort of one important aspect of the open source data science, which is package management. And so we'll start talking about how, like, package management can screw up your plans to make beautiful, delicious sandwiches or data science products. Some background on sort of how it all works. And then finally, some thoughts on, like, how it can be better and how it doesn't have to always be so painful.

The pain points of package management

So first, you know, sort of the pain. So one reason this can be painful is that folks are blocked, right? You're blocked from installing packages. Either you don't have permissions to install on your machine or you're blocked from public package sources like CRAN, Bioconnector, PyPi. And that's a big problem. Or you have no way to share private packages.

And sort of that leads into the situation of it feeling like IT and data science are at odds. And this isn't so surprising, right? Like, data scientists want the software they want, and they want it when they want it. And that's not necessarily always IT's concern, right? IT or admins have a lot of concern about platform stability and security. And those things don't have to be at odds, but they can be.

There's the issue of fragility. You know, when you're sharing your project with somebody else, are you confident that they'll be able to open it up and actually use it? Or, you know, might they open it up and find themselves stuck because they don't have the right package set?

And then lastly, sort of like, when you come back to a project in the future, right? Like, this is just a form of sharing with future you, but how confident are you that, like, your package set hasn't changed in the meantime, and you can sort of get back to the same state?

And so, to be completely clear, like, you can't fix all these problems. And particularly if you're just a lone data scientist, it may not be possible to fix all these. You may need some help from others at your organization, but you can set yourself up for success here.

Mise en place for data science

I love to cook. And when you're talking about cooking, there's this concept called mise en place, which means, like, putting in place in French. At least that's my understanding. I don't speak French. But, you know, what it's really about is, you know, you sort of cut, measure, and prep all of your ingredients before you ever turn on your stove, right? So, like, by the time you're sort of turning on the gas, by the time you're starting to cook, you know, everything is in place and sort of ready to go by that time.

And so, you know, one of the reasons that, like, you can't guarantee outcomes, but you can really guarantee your process. And so that's what sort of the mise en place step is meant to do when you're cooking. And you can do something very similar, right? Package management is very much like the mise en place of doing data science, right? It's getting all your ingredients, all your things in place before you go.

Package management is very much like the mise en place of doing data science, right? It's getting all your ingredients, all your things in place before you go.

So, this is really awesome, and it makes an easy way to share a lock file with or to share a project in an environment with somebody else or with future you.