R-Ladies Rome (English) - R in Production - Hadley Wickham

And so that's kind of like the worst possible scenario for code that's running in production, right? It doesn't error. It just silently gives the wrong results.

Okay, so those are the first two challenges. Like, you might get data you've never seen before. Another place this might arise, for example, is maybe originally the ice cream store was only open on the weekends. And now it's also open on weekdays. And so you might get new data, maybe the day of the variable, the day of the week is something that's important. And now you're going to get new days of the week you need to make predictions. The schema might change. Or maybe one of your dependencies changes.

So maybe there's a great new version of your favorite package comes up. And it adds a bunch of cool new features that you love and think are amazing. But it also breaks one of your existing plots because something no longer works. Maybe that's because you were relying on a bug. Maybe it's because the ggplot2 developers made a mistake. Maybe they made some deliberate change. But for whatever reason, your code no longer works because you are installing the right, the current version every day.

Now, there's a really good fix for this, and that's to use rn. And the basic idea of rn is it's going to capture all of the versions of the packages that you're currently using. So there's only two, there's only one, well, one package here, rvst and one package here that I'm using, notify. But it captures all of the packages that those packages depend on as well. And it records all that information in a lock file. So I've changed this, so there's now an rn lock file. If I look at that lock file, there's a JSON file. But most importantly, you can see it's got the version of every single package I have installed. So now when this runs on GitHub, it's going to run exactly the same versions of those packages, even if new packages are released at the time.

So by and large, like dependencies changing, I think it's kind of a solved problem, because you can just capture all of the dependencies at one point in time using rn, but there's some other similar tools, but I think rn is my favorite.

Another thing that might change is maybe the entire platform changes. So maybe there's a new chip that comes out, and maybe there's some difference in how it does linear algebra, and it gives slightly different results to your model. Or maybe the operating system changes, or maybe the C libraries of the packages you use depend on change, or maybe the version of R or Python changes. This is much, much, much less common, particularly these days, but there have been some particularly famous examples in the past where I remember something, I guess, like 20 plus years ago, where an Intel chip had a bug in its math operations. So a bunch of people got incorrect results in their scriptures. So this is not very common, but good to be aware of. And the way that people tend to fix this solve this problem in practice is to use containers. This is why you might have prepared a container that basically just captures an entire operating system in a box. You can just ensure that every time your code runs, it's running on exactly the same version.

The other challenge that you might face is the universe might change. So in the case of an ice cream store, like maybe it's because your shop changed location. And obviously, like if it's now on a beach instead of in the city, like the sales patterns are likely to be very different. But your code is just going to keep going. It's just going to keep producing. It's going to keep fitting the model that worked for you originally. Maybe that model doesn't fit very well, but it's not going to give you a good result. And worse, it's not going to give you a clear error that says, hey, the model's wrong because models kind of, by their very nature, can't do that.

And so you might've heard of terms like concept drift or model drift or data drift. But I think the basic idea is that like a model is by its very nature an imperfect, it captures reality imperfectly, and it captures it best at the time it was fit. And as you get further away from that time, like more and more little errors are going to creep up. So even if nothing major changes, like if you only fit your model once and make predictions from it, those predictions only get worse and worse over time because this kind of, you've somehow implicitly fit like a Taylor series approximation to the universe. And as you move away from that approximation point, it's going to get worse and worse. And the way you kind of could resolve that is you can't just like sit and forget the model. You're going to have to regularly check it.

In the case of like my little production thing, I think what the universe changing would look like is this looking different, like the structure of the HTML page changing. And the way that Arvis works is if it doesn't find anything, if I deliberately introduce a misspelling, it's going to report nothing. So I think this is something in my code here, what I really should be doing is saying like, if my products equals zero, I need to throw an error. And that means my script isn't going to continue running, even though the structure of the website has changed. And I'm now just getting nonsense results. And the nice thing about doing an error like this is that when you are running R in batch mode, so if you remember in my scraping script, it calls R script. And if you call R script instead of using R interactively, whenever there's an error, it's basically going to quit and it's not going to run any more code. And when that quits, it's going to notify GitHub actions. And then that will give you an error that you get notified about through your GitHub notifications.

So a lot of making, I think writing production ready code is thinking about like, how could the inputs to my code change in such a way that the code still works, but it gives me nonsense responses.

The last challenge of like running repeatedly running code and production is that over time, your requirements are going to change. And in some ways like the best and worst thing that can happen to you as a data scientist is if like something you've done, some dashboard you've created as a model that becomes so important that the executives in your company start to rely on it. And that's awesome because your work is having a very direct impact on the company, but it also means those people are going to be looking at it and emailing you with requests to change it. And I don't know, I don't think that's a problem necessarily, but I think like dealing with that kind of like iterative flow of request is not something that like data scientists tend to be trained in. It's not something you to learn in university. So how you like keep track of those, how do you make sure that you continue doing the work that is important to you as a data scientist, even while you get requests from like several layers above you in the old chart to do things urgently.

So those to me are kind of the things you need to think about when you're writing code that's like running long term. So let's just take a look at this. We've already kind of seen an example of the data changing, right? Like I looked at that today and there was only one product there, which made this much less interesting than it might have been.

So one thing that makes me a little bit more comfortable about scraping this, as you can see, it's powered by Squarespace, which is a big website that powers tons of websites. So I kind of know behind the scenes, like this is all automated. Someone's not manually handwriting this HTML. It's likely to be generated by code and it's unlikely for that code to change in the short run. It might change in the long run and that is why I really should add code like this to check that it hasn't because otherwise my code will just continue working, return a zero low row data frame, everything else will work, but I just don't get any notifications.

So we can protect against changes to dependencies, as I said, by using RN, which basically works by recording all of the changes. I've already protected against changes in the platform because in my YAML file, well, I've almost protected against it because I say run on Ubuntu latest. And so this is the container. This is the operating system and all the system libraries that GitHub is going to use to run my code.

So I could actually, if I wanted to be safer, change it to a specific version of Ubuntu. So now rather than using the latest released version of Ubuntu, I'll always use Ubuntu version 20. And in this case, I don't think it's that important because none of this code is like very, very simple code. It's kind of easy to see what might go wrong and I don't think the operating system is likely to impact it. And like operating systems change relatively slowly anyway, and I don't expect to be running this script for years on end. But again, this is something I could do. I could lock down a specific version of the operating system.

So for platform changing, we also talked about the universe changing. In this case, the universe changed because I successfully purchased the artwork that I wanted to purchase. And then the script is no longer useful. That's I think a not uncommon result of a data science project, right? You actually might.

Sorry about that. I just lost power. But I am back. Hopefully we will not lose power again.

Okay. Let me share my screen again. And let me regain my thought. We talked about platform, the universe. Yes. In this case, the universe changed the data analysis in production and its desire to infect, and I can now effectively wrap that up. Not super uncommon. Sometimes you do an analysis just to get something to change. It's changed, and then you're done.

Or maybe that was the requirements changing, because in this case, I got what I needed out of the analysis. So I think that's probably a good place to start. I kind of showed you a few ways I could make this code more kind of production-ready by adding in more errors so that when something goes wrong, I get notified about the script. I also spent a little bit of time yesterday figuring out how I could get rid of this dependency on the notify package to make my code even simpler.

And that just means I'm no longer using the notify package, but I'm using the HTTP package to do this directly. No real reason to do that, except that I think if you were doing this and just putting this code in production in a real company, it gets pretty hard to use code from random GitHub repos, because that code could change at any point, and many IT departments will not let you just use random code from the internet.

So I think let's stop there, and we've got some time for questions. I will not tell you about some of the challenges of not just my computer, but you did see a few of those along the way, and I will tell you the absolute, the thing that is hardest to debug is when this doesn't work, right? Because when this doesn't work, all that happens is you don't get a message, which is obviously very difficult to detect. So I think any of these problems, like you should expect some pain and frustration, especially when you do it the first time.

I think there are a lot of, like one of the advantages to using tools, commercial tools, like Posit Connect, is that they actually help you figure out when something didn't run and didn't do something, which is the hardest case to figure out.

Q&A

Peter, do you want to ask them, or do you want me to just pick them out and read them?

One, Avis, would you like to read the question and ask the question yourself?

Yes, I can read. I have two questions. The first one is about this web scraping. Apart from R Selenium package, are there any other R packages which can scrape JavaScript-managed web pages? And the second question is about this data drift. How to automatically identify such drifts?

Thank you. So these are my answers. First, Avis now has experimental tools for scraping live pages that run JavaScript. So this actually runs a web browser in the background and interacts with it. So you can interact with any website exactly as you're a real person. To echo one of the questions earlier in the chat, I will not be telling you how you can do this to scrape Ticketmaster or other places like that. I will say these techniques generally don't work on any website where you can imagine people really care about scraping, because Ticketmaster wants to stop people from scraping it because they don't want to enable scalping. So they introduce a bunch of tools to stop that, which, of course, you can overcome if you're creative enough. But I would never suggest such a thing.

Identifying data drift, I think, is more challenging. I'm kind of surprised there's not more research and interest in this. I think the best thing I've come up with or was suggested to me was this applicable package that Max Kuhn suggested. And basically what the applicable package does is every time you make a prediction, it tells you how far away from the training data is that prediction. And so I think that just seems like something you probably want with every prediction. Am I making a prediction that's solidly in the body of data that I've seen before and I can feel really confident? Or am I making a prediction that's a very long way away from the data that I used to put the model, in which case I should take that with a grain of salt? Or maybe if it's even far enough away, I should automatically flag that and encourage people to refit the model.

And there's sort of a similar vote by Vera from my colleagues, Julia and Isabel, help you set up this regular cycle of model assessment and monitoring so that when your model does drift away from reality, you've got some chance of catching that.

So she said, I've used the package that notifies you in Slack and that you work great. It was not really a question. It was just that that was an alternative to use for the notifications instead of the notify.

Yeah, there's like a ton of packages out there. I think Slack is useful. One of the things that I like about notify is I think the only like there's just so little. Yeah, but it's like free and you don't have to worry about like one of the other things that's like painful about putting stuff in production is how do you get all of your credentials shared correctly? And so like if I was going to publish to Slack, like I need to somehow get either my Slack username and password, which would be kind of a probably a bad idea to put that on my GitHub or some kind of like token that's like could give me and it's just like a bunch more work. But I think if you're doing this like kind of inside your organization.

I see David asked a question about how you manage credentials in production. There's like two ways, I think. There's the way that's not particularly secure. It's OK. It's like, OK, it's fine. That works everywhere. And that is to use environment variables. And so like locally, what that means is you could run a script like edit our environment, which I'm not going to run because I will show you all my environment variables, which contain a bunch of secret stuff. But you can like put them like the kind of the basic idea is you have this one file like your dot our environment file, which contains a bunch of things that never gets committed to GitHub. And then on GitHub, under settings, and secrets and variables, you can add these.

So once you've said it, you can never actually see it again. But you can't edit it. You can only replace it. But you can. There are ways you can read that into an environment variable in your GitHub actions. So you have to be careful to never like print that out. Although GitHub does take some basic precautions to make sure you don't accidentally do that. But that allows you to have kind of something that's like secrets locally and secret anywhere else. So this environment variables thing basically works everywhere. But it's kind of a pain because it typically relies on you doing some kind of copy and paste.

One of the things that we've been trying to work on and now and part of this professional product is making it all kind of just work. So if you're using Databricks or Snowflake, all of the credentials kind of magically flow through. And I think that's like in like a well resourced organization, that's the way it should work. Like your administrator should kind of take care of making all of this auth stuff that you don't need to worry about it. Until we get to that point, environment variables that you do whatever you need to do, wherever you need to do it, basically.

Okay, thank you. There is a question from Eugene. I wanted to ask myself, what is the future looking like for R?

Yeah, I'll preface your remarks. I think like this is Yogi, Yogi, Eric, I think. Like making predictions, it's tough. Yeah, I guess it's tough to make predictions, especially about the future. So I don't know. I don't know what's going to happen. I would say like it feels like I think one of the things that make it hard to kind of like get a sense for like what's up with R is I think like the absolute number of R users is still increasing. But the percentage of people using R for data science is decreasing because the number of people using Python for R is increasing faster.

Now, like and I think like the like, you know, don't get me wrong, Python is a great language. And I think the reason more and more people are using it for data science is it's a great general purpose programming language. But I don't think it's reasonable for there to just be like one programming language. Like I think it makes sense to have some general purpose tools and more special purpose tools. And while R is like a general purpose programming language, you can do anything you want in R. It's definitely like well, particularly well tailored for the needs of data science. And I think the design of R means, in my biased opinion, that tools like ggplot2 and dply are always going to be better in R than basically any other programming language. Because no other programming language gives you the flexibility to implement those sort of APIs, which give you this very fluent interface for exploring data. And I still, you know, I think still think ggplot2 and dply are better than their Python equivalents. And I think they always will be.

And I think the design of R means, in my biased opinion, that tools like ggplot2 and dply are always going to be better in R than basically any other programming language. Because no other programming language gives you the flexibility to implement those sort of APIs, which give you this

R-Ladies Rome (English) - R in Production - Hadley Wickham

Transcript#

Introduction of Hadley Wickham

What is R in production?

The demo: scraping an artist's website

GitHub Actions: running the script automatically

Challenges of running code in production

Q&A

Featured software#

rstudio