Resources

Hadley Wickham - R in Production

R in Production by Hadley Wickham Visit https://rstats.ai for information on upcoming conferences. Abstract: In this talk, we delve into the strategic deployment of R in production environments, guided by three core principles to elevate your work from individual exploration to scalable, collaborative data science. The essence of putting R into production lies not just in executing code but in crafting solutions that are robust, repeatable, and collaborative, guided by three key principles: * Not just once: Successful data science projects are not a one-off, but will be run repeatedly for months or years. I'll discuss some of the challenges for creating R scripts and applications that run repeatedly, handle new data seamlessly, and adapt to evolving analytical requirements without constant manual intervention. This principle ensures your analyses are enduring assets not throw away toys. * Not just my computer: the transition from development on your laptop (usually windows or mac) to a production environment (usually linux) introduces a number of challenges. Here, I'll discuss some strategies for making R code portable, how you can minimise pain when something inevitably goes wrong, and few unresolved auth challenges that we're currently working on. * Not just me: R is not just a tool for individual analysts but a platform for collaboration. I'll cover some of the best practices for writing readable, understandable code, and how you might go about sharing that code with your colleagues. This principle underscores the importance of building R projects that are accessible, editable, and usable by others, fostering a culture of collaboration and knowledge sharing. By adhering to these principles, we pave the way for R to be a powerful tool not just for individual analyses but as a cornerstone of enterprise-level data science solutions. Join me to explore how to harness the full potential of R in production, creating workflows that are robust, portable, and collaborative. Bio: Hadley is Chief Scientist at Posit PBC, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes ggplot2, dplyr, and tidyr)and principled software development (e.g. roxygen2, testthat, and pkgdown). He is also a writer, educator, and speaker promoting the use of R for data science. Learn more on his website, http://hadley.nz. Mastodon: https://fosstodon.org/@hadleywickham Presented at the 2024 New York R Conference (May 17, 2024) Hosted by Lander Analytics (https://landeranalytics.com)

Jun 11, 2024
41 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Our next speaker represents the year 2018. And his current favorite spirit is Rum Fire. Please welcome Hadley.

Hey, everyone. So today, I wanted to talk about putting R in production. And this is a talk about a topic I don't really know anything about. And I've never done it myself. So a lot of this, I'm not going to be showing you any code today. What I'm going to be showing you is my efforts to understand what this thing is all about, with the hope that a few months down the road, it'll start to maybe influence some open source packages or maybe some of Posit's pro tools.

And I also want to say, I'm going to be talking about R in lowercase p production, not R in uppercase p production. And I think this is an important distinction to make if you're talking to folks in your IT organization or your DevOps organization. Because when they think production, they're often going to think capital P production.

And what's the difference? The difference, I think, is basically paging. The difference is paging. So when something's in capital P production, it means it is so vital to the health of your organization, to the correct operation of your organization, that if it stops working, someone is going to tell you, regardless of whether that's 3 PM on a Monday afternoon or 3 AM on a Sunday morning. That's what capital P production means, that your code is vital to the health of your organization.

And so I think you certainly can put R in production. But you probably don't want to, not because there's anything wrong with R, but because you don't want to put data scientists in production.

but because you don't want to put data scientists in production.

Because your data scientists don't want to be woken up at 3 AM on a Sunday morning, right? That's not part of their job description. So it's totally fine to put data scientists in lowercase P production. In fact, really important, because they're still going to be your data scientists are going to be producing things that are really useful for your organization. They're not so vital that someone has to be woken up if they break. But they're producing dashboards that people are going to be looking at every day, every week. It's really important.

And you absolutely can put R into lowercase P production, because the dirty secret of most organizations is they already have Excel in production. Like the number of organizations where there is some semi-automated Excel spreadsheet that is really, really important to someone in the executive part of the company is pretty high.

What does "production" actually mean?

So with that said, what does it mean to put something in lowercase P production? Well, I'm not 100% sure, but I can tell you what it's not. First of all, something in production isn't run just once. If you have a successful data analysis, it's going to need to be run again, and again, and again, and again.

It's also not going to be just run on your computer. It's typically going to need to be deployed somewhere else, like deployed in production. That means it's typically going to be running on a server somewhere, not on your laptop.

And then finally, if something's in production, it's no longer of importance to just you. You're not just the only person involved in it. You've got folks upstream, like data engineers. You've got your data scientist colleagues who you need to be able to collaborate with. And you have folks downstream, like the decision makers in your organization, or maybe other developers in your organization who want to be able to use the results of your models through a traditional API.

The ice cream prediction scenario

Now today, I'm going to focus on these two. So what are the challenges that arise when your R code is no longer running just once on your computer?

And so what I want you to do is to imagine that you're a data scientist who works for an ice cream company. And what you want to do is predict. You want to help your business by predicting the sales of ice cream tomorrow, maybe based on the sales of ice cream today, the temperature today, and the forecast for tomorrow. So you collect a bunch of data. You write a bunch of code. You fit some models. And then you come up with a beautiful dashboard.

And this is great. You know, you've taken into account important variables and made something that's really useful for your organization to make better decisions. But now you've got to take that dashboard, and it needs to work like every day. You need to kind of productionalize this so that ideally, your code stays the same. You get some new data, and it produces the correct dashboard.

And now of course, going into this, like you know the data is going to change, right? Because obviously, the ice cream sales are going to drop in winter. But a big risk is that the schema of the data might change.

So imagine, you know, obviously you are probably not out there recording the temperature each day. You're using some API to get weather data. And so far, you've been working with it. You've got the dates in month, day, year format, and the temperature. And you know, your API changes. It's now going to give you the data in a slightly different format.

So I'm going to give you all a quick challenge. What I want you to do is compare the old data on the left to the new data on the right. Like, can you figure out what's different, what's changed? And how is this likely to impact your analysis?

Who wants to yell out one difference they spotted? The column names, right? We've changed temperature to temp. So what impact is this likely to have on your analysis? It's probably going to break, right? It's probably going to break actually in quite a good way because it's going to say, oh, you don't have a variable called temperature anymore.

What else has changed? The date format, right? We've gone from American style to ISO 8601, the only true way of recording dates.

So what impact do you think this is going to have on your analysis? Like, you might be lucky, right? Maybe when you read this data in, you're just relying on whatever library, whatever package you're using to read this to automatically recognize this as a date. So this is still a date. So it might just work. Or if you've explicitly specified that this is month, day, year, this is going to cause an error. Again, probably quite a clear error.

What about the last change? I'm guessing it's probably gone from Fahrenheit to Celsius. And what impact is this going to have on your analysis? Your model's just going to give nonsense results, right? This is the worst case scenario because you're not going to get an error. It's just going to silently start returning garbage. And so I think this sort of change, like when I talk to folks who are actually doing data science, this sort of change is, I think, the most painful thing that people face today.

De-risking schema and dependency changes

So how can we de-risk that? Well, the first piece of advice probably no one is really going to appreciate. But my first piece of advice is you need to make friends. You need to make friends with the people who are producing your data so that they at least know you're relying on it and give you a heads up when something changes. There's a really big kind of people component to this, that you have to be talking to those people. You have to be collaborating with them. And if you're not, like technology, you can certainly prevent the worst of foot guns, but it's not the real solution. It's going to alert you to a problem, but it's not going to help you solve it.

Technologically, what you can do is use a package like point blank or great expectations to define the schema that you expect for your data. Really useful tools because they allow you to make precise exactly what you expect the data to look like. And then if the data changes, you're going to get an error. Like this doesn't solve the problem, right, but it ensures that you're not just, your analysis isn't just going to give silently nonsense results.

You're still going to have to make friends. You're still going to have to make friends with whoever changed the data and figure out what happened. But at least you know there's a problem.

OK, so what else can go wrong? Well, maybe a dependency changes. So a dependency might change. So you have a new version of your favorite package comes out with amazing new features, but it breaks your code in some way, which is like really frustrating, right?

So how do you de-risk this? I think there's some pretty good tools available. I think this is basically a solved problem for reasons I'll talk about shortly. But the way you solve this problem is by using a virtual environment. And what a virtual environment does is just captures for this project, what are the versions of R packages or Python packages that you're using. So you no longer are using the latest version of everything or whatever collection of packages happens to be installed on your computer or whatever. You're using a specific known set of packages.

And you need to do this anyway when we get into this not just on my computer problem, because if you're going to run this code somewhere else, you also need to capture those dependencies and ship them somewhere else. And so the reason I think this is basically a solved problem is that if you use PositConnect, that when you deploy something to PositConnect, it automatically creates a virtual environment for you. It automatically captures, even if you're not already using one, it's going to capture what are the exact version of all of your dependencies. And then every time your report is rerun on Connect, it's going to use exactly the same version of those dependencies. And that seems to have, by and large, fixed this problem for most people. This is not what's causing pain for most data scientists.

Now, there's kind of another related problem. And it's not just the dependencies, like the R and Python packages you're using, but it's that entire stack of all the computations that eventually lead up to some electrons moving around on a chip. So maybe Apple has come out with an amazing new laptop with a new chip on it. Maybe that's going to cause some of your results to change. Or maybe there is just a new version of the operating system. And something about the way that it optimizes matrix algebra means that you get some slightly different numbers. Or something in your other dependency stack, one of your system library changes, or maybe R or Python versions change themselves.

So this, I think, can occasionally be a problem. It's pretty rare these days, because we have a good solution for this, which is like running containers on commodity hardware. So a virtual environment is going to capture your R and Python package dependencies. A container kind of captures everything else. The version of the operating system, the versions of like BLAST, which is powering your linear algebra computations, that captures everything else. So capturing the dependencies, I think that solves like, I don't know, 95% of these problems. Capturing everything else that a container captures solves like another 4% or 4.9%. And then you've just got a very weird long tail of other problems. But again, mostly a solved problem.

When the universe changes

A much bigger challenge is that the universe is also going to change. So there might be a big change. Maybe your ice cream shop has moved to a new location on the beach. That's obviously going to have profound impact on how many ice creams people are buying.

But even when there isn't a profound change, every model isn't necessarily a simplification of the universe. It's kind of a Taylor series approximation. And as you move further and further away from that approximation point in time, that approximation is probably going to get worse and worse. Maybe not by a huge amount, but it's going to slowly get poorer and poorer.

So you might have heard terms like concept drift, or model drift, or data drift. I think these are all ideas that are reflecting that the universe is changing. It doesn't matter how good your model is today, if you're not regularly refitting it with new training data. And then occasionally, even thinking about the whole functional form of the model, thinking about rebuilding the model from scratch, it's just going to gradually drift into irrelevance.

So how do you de-risk this? I think the way you de-risk this is by just, you have to acknowledge it. A model is not a sit and forget type thing. You're going to have to monitor it over time. You're going to have to regularly, as well as the dashboard that maybe uses the predictions from the models to show to the decision makers in your organization, you also need a dashboard for you to look at as a data scientist that tells you information about the model metrics.

So there's tons of tools for this. I know about two of them, because they're produced by my colleagues at Posit. The first one is Vetiver, a toolkit for R and Python that makes it really easy to create this model monitoring system, those regularly updated reports about the quality of your model.

And another really interesting package is called Applicable. So the idea of Applicable is it's going to try and help you detect if the data you're using to make predictions with has drifted a long way from the data that you were used to fit the model. So have you moved from an area of confidence where you've got tons of data to an area where now you're extrapolating the model from afar? And I think this is really important. This seems to me like something that you want. Every time you do a prediction, you really want to know, do I think this is a good prediction?

And a model can't kind of reach outside of itself and say, oh, this is not in the model. This is kind of one of the big problems with LLMs right now. They very rarely say, I don't know. They just make something up. That's a characteristic of every model. But you can use tools like Applicable to at least say, well, the data point I'm making a prediction about is a long way from the data that I used to fit this model. And I probably need to be pretty skeptical about that.

When requirements change

And I've talked about models here. But the same thing applies if you're creating a dashboard or a visualization, right? You've created a visualization. You've created a dashboard that presumably you're going to show the most important things. And those most important things are probably going to change over time.

And I think this is something that really surprised me when I learned about how people use dashboards in real life, is that if a dashboard is successful, it's typically not static. It's going to change over time. And part of this is almost inherent in the nature of a good visualization or a good dashboard, right? If it's successful, people are going to make decisions based on it that were different to what they would have made without it. And so now you've kind of violated this fundamental idea that the time after the dashboard is different to the time before the dashboard, because people are making different decisions. And that's probably going to affect how your model sees the world.

And the last and kind of most kind of mixed problem is that the requirements change. In some ways, the worst thing that could happen to your dashboard is it becomes so successful that your fat cat CEO is now looking at it every day. And they have a constant stream of comments and feedback and changes that they want made to that dashboard.

And I think this is challenging not just because you've got to make changes, but I think that's something that few, it's not something you learn as a data scientist in your college courses. How do you respond to people asking you to make all of these changes? Just the way you think about the problem, the way you respond to people asking you for help, particularly people above you in the organization, how do you make sure that you can still do your actual job while responding to the stream of minor changes from people higher up in the org chart to you?

But I think there's also, I don't know, there's something else. I don't really know how you de-risk this. I think one big part of it is this idea of refactoring, that it's really important to spend some time working with your code, not to add new features, not to fix bugs, but to make that code easier to maintain. And that's the true idea of refactoring, is to spend time improving the quality of your code so that it's easier to change in the future.

You have to recognize and embrace the fact that everything you do is going to have to change in the future. And ideally, you want your velocity to increase over time as you get a better understanding of the domain. As you become a better programmer, you want to get faster at making these changes. You don't want to get slower as this massive backlog of things start to drag you down. And every time you go back to an old project, you just feel awful because you don't really understand how it works. And you're worried about any change you make is going to make the problem that much worse.

And I think software engineers learn more, or at least software engineering managers, understand a bit more how important this is, making changes for the sake of future change. But I don't think that's necessarily something like data scientists and data science managers know how to talk about and to justify.

Summary of challenges

So these are the five things that I think are the biggest challenges when you have an analysis that's successful enough that you do it again and again and again and again. The schema changing, I think, from my conversations with people doing data science out in the world, this is the number one pain point right now. Dependencies and platform changes, sure, they're kind of annoying. But we've got good technology to mostly make those problems go away.

Universe changing, I think most people, the way they solve that is by just closing their eyes. And the way that people handle requirements changing is covering their ears. But I think really important.

Again, like this is, I don't really know. I've never put code into production, really. Well, I guess I did have. I'll tell you the one thing that was closest to production is that there was an artist that I really liked. He was like on TikTok. So anytime he posted anything on his website, it would go out. It would be sold in like 15 minutes. So what I did is I wrote an R script that ran every 15 minutes on GitHub Actions and scraped his website and then was supposed to text me when anything changed so I could quickly go and buy it. And I never managed to get that to successfully work. Like it did actually scrape and collect some data, but I could never actually get it to send me information to make a decision. But because I was like checking on it so often, I noticed that when the site updated and managed to buy some art.

So again, I'm no expert in this. So I'd love you to just take a minute or two, talk it over with your neighbor. Like are there problems that you've faced putting R in production that you don't think I've captured here?

Anyone want to yell out any ideas?

So one thing that I think you need to address here is also the code itself, which you're not explicitly talking about. Your code needs to be able to run on a pendant, number one. Because it's going to be run on a machine on the system and it should not expect any sort of user input. Or if it does, number one. Number two, parametrization. Right, you're putting some parametrized three-hour logging so that it doesn't debug. We're actually going to cover most of those points in the next part of the talk.

Anything else? When a solution to the scale problem is addressed, it's contracts often held up to a solution that's already available because they're using the system away instead of using a different handler connecting directly to their database and asking them to build a new API instead. And then that's what you're expected to hold to is more of a pressure. So do you think of contracts as more of a people thing or a technology thing? It's an agreement among people for how technology is used.

Anything else? Your team changes. Can you leave and add some institutional knowledge of this project? Yeah, that's a really good one. That's sort of covered in the not just one person doing any more, which I'm not going to talk about.

Running code somewhere else

So those are kind of the problems related to running your code repeatedly. Now I want to talk about the problems more related to running your code somewhere else. And these problems aren't independent, but I think this character, this sort of breakdown is still useful. Because I think there are sort of three possible ways your code might be running. And not every organization has all three of these, but I think many organizations do.

Like you might be running code on your laptop, right? This is just kind of like the wild west of data science. And depending on your organization, you might be able to do whatever the hell you want on your laptop.

You might also have some kind of central compute service. So maybe that's some instance that's running on shared hardware that's fast or on some kind of container setup that's all centrally managed so everyone's using the same versions of things. And then you're often going to have, or hopefully you have, some kind of like combination of staging and deployment environment. So these tend to be like run unattended.

And you really want two of these things. They're basically identical, but I think the main difference is like your staging environment, like data scientists look at that. And decision makers look at the deployment environment. They're otherwise identical, but it means you can like put something, send something to staging and look at it to see if it's working or not without accidentally breaking the things that the people in your organization use to make decisions.

And I think these, I've kind of put, they're pretty thick lines here because there's some fairly big challenges in each of these transitions. You know, kind of one of the things that, I think like one of the goals of Posit, the company that I work for, is to help with each of these. Like the idea of Posit Workbench, that helps you run something like, like run the RStudio IDE, Jupyter Notebooks, Visual Code, like in a standardized way that's centrally maintained pretty easily. RStudio Connect allows you to deploy things like apps, Shiny apps, like Markdown documents, Quarto documents, Dash apps, Flask apps, in a way where you can kind of work on your laptop or central compute and just get things published as easily as possible.

OK, and so what I want to talk a little bit about is like how do we cross these various chasms? And I think just the first thing is just to be aware of them. Right, at each of these points, there's some pain associated with it. And particularly like if you are a Windows user, like Jared, there's going to be some pain. Because now when you deploy, when you're using a central compute or deployment, it's almost certainly going to be in Linux. There's also some differences just between desktop machines and servers generally.

Then as Mark pointed out, there's some major pain points moving from an environment where you can debug interactively to an environment where you run a script. And 30 minutes later, you get an error message back. And there's also some similar challenges that arise related to like who is doing the analysis. Like typically, on your laptop or in central compute, like the person doing the analysis is you. It's your access to the database that chooses what data to come back. When you start moving into a staging environment or a deployment environment, you need to think about like whose access is being used to get the data. This is like particularly important when the data is sensitive. And like different people in your organization can see different views of the data set.

Windows vs. Linux and desktop vs. server

So the first challenge to overcome is if you're a Windows user on your laptop, there's a bunch of things that are just different on Linux for like historic reasons. None of them are particularly important, but they're all annoying and they will all catch you out. Like Windows uses different line endings. It uses a different character encoding in R, and it uses a different path separator. And all of these are just like little things you will stub your toe on again and again and again and again.

And I guess I stub my toe on these things too because I generally write my code on a Mac, which for these purposes is functionally a Linux. And then when I test my code on Windows machines, I'm like, oh, the path separator is wrong. Oh, there's some weird thing in my test snapshots because the line endings are different.

Also differences because your computer is a desktop and the central computer is a server. Like a desktop is set up for like you. It has set up to use your time zone. It's set up to use your language of preference. It has all of the fonts you've installed, and it has graphics devices that lots of people use and a lot of care has been put into. When you move to a server, it's probably defaulting to a UTC time zone. So that's sort of basically the same as Greenwich Mean Time. It's probably using a C locale. So it just sorts things in kind of the ASCII order. Probably doesn't have many fonts installed, and it probably has kind of crappy graphics devices. The first two are gonna cause potentially problems for your code.

One of the things like we've tried to do in the tidyverse is make sure as much as possible you don't automatically derive those things. For example, there's a really interesting problem that in different, some different, some locales sort the letters of the alphabet differently to other locales. The order of a factor in R by default is alphabetical ordering, and that determines your contrast of a linear model. So it is possible, it's pretty unusual, but it's possible and it has happened to people that by running code in a different locale, you get a bunch of different contrasts outputted from your model. Like the model's still the same, but it's confusing for people looking at the outputs.

And the last two are just pain points. Like if you're trying to make sure all of your plots look exactly the same using your corporate style guide, your approved fonts, this is just a pain to get right when you can't easily install fonts on your computer.

Package installation and interactive vs. batch debugging

So one of the big differences between the way our packages work on Windows and Mac and Linux is that on Windows and Mac, when you install a package like XML2, which uses the libxml2 library, on Windows and Mac, libxml is like bundled inside that package. And you can just use it without having to separately worry about using libxml. Servers, generally, you're gonna install a package from source, which means you need to compile it, which means you need to have all these additional bits of software installed on your computer. That's changing a little bit, like we've done a lot of work with PositPackageManager and PositPublicPackageManager to make this easier. And there's some great tools in the pak package, which will tell you like all of the system dependencies that you need to install this package and all of its recursive dependencies, but tends to be a bit more painful and that might be something you need to get a server admin involved with. It's not something you can necessarily do by yourself anymore.

On your laptop, you typically just install all of your R packages in one library. Obviously, now on a shared server, you're gonna have lots of different libraries, because not everyone's gonna agree on that. Similarly, it's a wild west on your laptop. You install packages from GitHub, you install them from CRAN, you install them from Bioconductor, you install them from like some random site you found on the internet. Hopefully on your central compute, that's gonna be a little bit more locked down.

So that was kind of the transition from your laptop to central compute. There's another transition from central compute to deployment. And that is you are no longer in an interactive environment. So when something goes wrong, when you get an error, you can no longer say, oh, give me a trace back, or let me browse into this function and fix it.

So debugging something that's failing on some other computer is extremely frustrating. If you develop R packages, like the way you experience this often is that something fails in your automatic checks on GitHub Actions. And it fails in a way that you can't reproduce locally, and now your iteration speed, instead of like typing a few ideas of the console and getting them back in seconds, it's now like a 20 minute or a 30 minute loop, which is bad, not just because it's so long, but it's long enough that you do something else and then you forget about the thing you're trying to do and then you come back to it two hours later and you've lost all the state you had in your head.

So there's definitely some like techniques that are different. Like when you move into a, when you're in an interactive debugging environment, it's all about like very quickly coming up with hypotheses about what's going wrong, testing them out, and then discarding them. And because you're in that short iteration cycle, you can do that very simply. When you're in a batch scenario, it's more about like brainstorming every possible thing that could be causing this problem, trying to write some code that can identify all of those cases at once, and then deploying that so that when you get that information back in 20 minutes, you haven't just answered one possible question, you've answered like 10 possible questions.

And the other kind of technique that starts to get really useful is this idea of logging, just recording like what is happening in your code? Where have you got to? So when it goes wrong, you know, have you have some sense of like, how did I get to this point?

Authentication and access control

There's some certainly like really cool technologies out there about like with time travel debugging, where it is theoretically possible in some languages to kind of capture a failure on a server and then bring it back to your local computer in a way that you can interactively debug. That seems like a dream, but not something that is easy to port to R, unfortunately.

And then the other kind of issue with where this difference comes up is when you're authenticating against some server. Now, the easiest way and most, the way that works just about everywhere is to stuff credentials and environment variables. That works locally, that works in your deployment environment, that is all fine and dandy, but it is not really a great way to do auth because a lot of these credentials ideally should be changing over time. They should be rotating and that means you need to update them. And if you've relied on like that password and like 15 different data science projects, now you're going to go through and update all 15 of those.

So ideally, you're using something a bit more interactive, something like OAuth, where you're going to do like a little interactive dance to give the server some permission for you to do stuff on its behalf. It's going to give you something back interactively. This kind of all happens behind the scenes for you, all very easy to do interactively. But if you're doing this on a server, you can't do that kind of interactive dance. And that also forces you to think about, well, like, who is doing this? Who is seeing this analysis?

And typically, like if you're doing that locally, you are doing the analysis, you're going to pull down the data that you are allowed to see and work with that. If you're deploying something like a Shiny app, it's also possible that it should be using the data that the person viewing the Shiny app is allowed to see. I think that kind of the easiest scenario to think about this is like, if you're writing an HR app, like you should only be able to see the data for yourself and people report that report to you. You shouldn't be able to see random data for people in the organization.

And this of course, like leads to the worst possible debugging scenario, which is bugs that only occur for other people who can see data that you cannot see. And how you debug that, I have no idea, but thoughts and prayers.

And this of course, like leads to the worst possible debugging scenario, which is bugs that only occur for other people who can see data that you cannot see. And how you debug that, I have no idea, but thoughts and prayers.

Wrapping up

So today I'm going to talk about two of the three things that I think make something a production job that it's not run just once. And that causes these challenges that the schema might change, your dependencies might change, your platform might change, your universe might change, your requirements might change. And you need to think about these problems. Some of them have good solutions, some of them don't have good solutions, but at least I think acknowledging them is the first step.

And then the other problem is that you're not just running it on your computer. You've got this transition possibly from Windows to Linux, from desktop to server, from interactive to batch, and you've got a bunch of challenges related to auth.

I'll show you, I'll leave you with just like one last picture, which is my, sort of speaking to that last bullet point, that last problem, which is not just you, that now you're working with a team of data scientists and there's kind of this hierarchy of needs that ideally you want to be able to at least find your colleagues' work. Even better, you should be able to run their work. Even better, you should be able to understand their work. And optimally, you should actually be able to edit it if needed, so that if someone does leave your team, you can still go. I suspect many teams are still down here. Some teams are like striving to get higher up that pyramid, but definitely a challenge of how do you share work across people? How do you build standards in your team? Thank you.