
R-Ladies Rome (English) - R in Production - Hadley Wickham
In this inspiring talk, dive into the world of R in production with Hadley Wickham, Chief Scientist at Posit PBC (formerly RStudio). Explore the challenges and best practices for deploying R solutions in real-world production environments, from effective code structuring to ensuring scalability and reliability. Whether you're a seasoned data scientist or just beginning your journey with R, this event equips you with invaluable insights and actionable tips to drive impactful outcomes in your organization. Don't miss out on this engaging discussion! Material: - https://github.com/hadley/available-work - https://github.com/hadley/web-scraping 0:00 Welcome & R-Ladies Introduction by Dorota Rizik (R-Ladies NYC) 6:28 Introduction and Dr. Wickham's Talk 53:46 Q&A Have a look at our WebSite for more insights about our events: https://rladiesrome.org
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is Dorota Rizek, and I'm a lead organizer of the New York City chapter of R-Ladies. And so I'm going to give you a little overview of our chapter and also R-Ladies Global in general, and then I will hand it over to Federica to introduce the Rome chapter, and then also introduce our presenter who will be joining us shortly. So it's a collaborative event between two R-Ladies chapters which is always fun and exciting.
All right, so let's start with just what is R-Ladies in general. So R-Ladies, also known as R-Ladies Global, is a worldwide organization that promotes gender diversity in the R community via meetups and mentorship events, and we always try to keep it as a friendly and safe environment as possible. It's also been sort of shifting recently from not just R specifically, but just programming in general. I think that's, in general, where the world is headed. So the mission is to promote and include more women in non-binary programmers, coders, developers, speakers, and leaders. And the goal, you know, the message and the motivation behind this is that more diversity and equity inclusion that for people who are developing R packages and R code will lead to a better community and just more progress in general.
And the R-Ladies Global community is very large. There's, when I looked earlier this month, there were 219 chapters across 63 countries and almost 4,000 events that have been created, which is just amazing to think about how wide of a reach there is across the globe.
If you are interested in starting a local chapter of R-Ladies Global, it's fairly easy to get started. You can either email R-Ladies Global directly at info at rladies.org or you can ping them on social media. They're usually very responsive.
So now a little bit more about my specific chapter, which is the New York City chapter. So our chapter has a co-organizing board that's made up of, I think, nine people at this point. It's quite a large board, but that's because we do have over 3,100 members. And so we do our best to host monthly meetings, both in person and online, though it's been mostly online in recent years. And you can contact us very easily. Either visit our website, rladiesnyc.org, or you can reach out to us via email.
A little bit of just the timeline overall. So R-Ladies Global was born in San Francisco back in October of 2012. And then a few other chapters popped up over the years across the world. And that NYC chapter was born in November of 2016. And then since then, you know, it's been eight years since then, and there's so many more chapters all over the world that we've grown so much even as an individual chapter as well.
And here is a visualization of that growth. So we've had, we now have, again, when I made this earlier this month, we had 3,155 members, and we've hosted a total of 110 events. And so our growth has been pretty steady over the years, which is always really nice to see. And we pulled this data directly from Meetup using a package called Meetup R, which was fun to play with.
And yeah, you can feel free to join us. We typically host Meetup talks or panels. We also try to attend conferences and workshops, and we do book clubs and also networking and socializing events. And ways to get involved with the New York City chapters, obviously, attend our Meetups, and you can also follow us on our social media channels. You can join our Slack channel and, you know, ask for resources or share resources, share job postings, things like that.
You can write a blog post for us, which we would love if you are interested in just getting some practice with writing about R or programming or data science or data work. You can also organize. You can submit a talk idea for us, and we'd be happy to help you develop that and give you a platform. So, you know, share and attend and join and participate. That's the best way to be a part of this community.
Introduction of Hadley Wickham
Thank you, everyone, for joining us today. Tonight's event, we have the honour of hosting Dr Hadley Wickham, Chief Scientist at Posit PBC, formerly RStudio, and a renowned figure in the world of data science. Dr Wickham is not only an adjunct professor of statistics at the University of Auckland, Stanford University, and Rice University, but he's also a mastermind behind some of the most widely used tools and packages in the R programming language. His contributions to the field have revolutionised the way we approach data analysis, visualisation, and software development in R. So tonight, Dr Wickham will be talking about R in production, sharing insights and best practices on how to deploy R solution effectivity in the real world environment. So without further ado, please join me in welcoming Dr Hadley Wickham.
What is R in production?
Okay. So I wanted to talk about R in production today. And so I'm going to start off with kind of a broad overview of what I think that means. And then I thought it would be fun to work on some code that I have in production, because I think it's a great example of code that even if you're not inside a company, even if you're just learning R or your school, this is a great way to kind of practice some of the same skills you'll need to put code in production.
So what is R in production? So I think of it, I'm not entirely sure what it is yet, but I can tell you what it's not. So typically when you're putting code in production, it's code that's sufficiently important that you're not going to run it just once. You're going to run it maybe again and again and again. Maybe that's because you've got a new data set coming in every now and then, or maybe it's because you've got new data coming in every day and you want to run your quarter document, produce a dashboard, and then send that to your boss every, maybe every Monday morning. So your code is not going to be run just once, but it's going to be run multiple times, you know, potentially over, you know, months or years.
The other thing that's really important about in production is that code in production is typically not going to be running on just your computer. It's going to be running on a server somewhere. And so there's, I think a couple of differences there. The first is if you're using Windows, that server is almost certainly going to be a Linux machine. And there are a bunch of differences between small, but annoying differences between Windows and Linux that you're going to need to learn about. It's also typically going to be configured as a server versus your personal desktop. So your personal desktop probably uses the language that you speak. It probably uses the time zone that you're in. And when you move to using a server, it's going to be using probably English or the ASCII locale, and it's going to be using some standard time zone.
The other big challenge with running code on another computer is what happens when something goes wrong. Like it's hard enough to debug your own code on your own computer, when you can kind of put a browser statement in, you can do some little interactive experiments and figure out what's going on. But now your code is running somewhere that you can't interact with. And typically it's going to take, you know, maybe five or 10 or 30 minutes to do an iterative cycle. And that's long enough that half the time, you know, you send the code off to run, you go and do something else, and then you've like forgotten what you're doing when you come back to it.
And the third challenge is typically in production, your code isn't just, it's not just you working on it, it's a team of people. You've got a bunch of data scientists colleagues who ideally need to be able to understand your code and you need to be able to share work with. But also, you know, you've got, you might be getting data that's produced by data engineers in your organization. You might be producing some kind of API that other developers in your organization are going to use to get model predictions from. And you're certainly going to be sending the results of your analyses, whether those are quarter reports or shiny apps or something else to the decision makers.
And so today I'm going to focus on the first two, because these are ones that I think you can sell, you can simulate reasonably well, even if you're not in an organization where you're putting stuff in production. And because I think these are really like, if you're still a student, if you're just getting started with data science, I think having some of these skills to be familiar with, so you can talk about them in job interviews, really, really useful. Because it means that once you get a job, you can hit the ground running. You can get data, you can do stuff with it, and then you can automate that whole process.
The demo: scraping an artist's website
And so I'm going to show you a demo that's kind of motivated by my use of TikTok. And one of the people I follow on TikTok is this artist who produces these really cool sculptures. And I was like, oh, this is really cool. I would love to own one of those. And so I went to his website. And of course, the artist's name is Weston Lambert. And of course, every single thing is sold out. And kind of the reason it's sold out is because he's got like 600,000 followers on TikTok. So it's not so surprising, like I don't use TikTok that much. When he posts something new, I'm like, I don't know about this. But I really wanted to buy one of his pieces. And so I thought, well, let's solve it.
And so that is what I'm going to show you today, my kind of production script that's going to like regularly scrape his website and then notify me whenever something new goes on sale. So this is very much kind of like a production type thing. You're going to run something regularly. And you kind of want someone to make like a decision or take an action at the end of it. So I think it's kind of a good little microcosm of production.
And I'm going to show it to you. And it's not very good. So I am ambitiously going to try and improve it live in front of you all. And so hopefully I won't get too stuck. Or if I do, you can give me suggestions to get me unstuck.
So I have this open in RStudio. And the first thing I'm going to do is use the Arvis package. And the Arvis package, if you haven't heard of it before, it's a package for basically turning websites into tidy databases. So I'm going to go to his website, Weston Lambert, go to the available work page, which there's now basically nothing on it apart from a custom stand, which I don't know why anyone would just want to buy a custom stand. But I can take that. And then I'm going to read the HTML. And the way you work with Arvis is you select things using CSS select. So CSS stands for Cascading Style Sheets. It's the language that web developers use to describe how things should be styled.
And the nice thing about CSS select is they also give us a way to identify elements on the page. Now, because it's just updated since when I looked at it yesterday, it's not going to be particularly exciting because there's only one product available. But hopefully, my code will still work. So basically, what I'm saying is give me and let's just, if I can look at that.
Okay. So one of the things which I'm going to try and show, if you ever do any web scraping, you can use the thing that's really useful is this thing called the browser developer tools. So I'm going to right click on this thing, this image. And we can kind of see all the HTML. You don't need to know too much about HTML for this purpose. But the idea is that HTML is a tree. And so we're going to try and find something like in this tree that might be useful. Because and when we kind of look up here, we'll see there's this division of the page with an ID called product list. And so that seems like a fairly good place to start. And so the way you select something with an ID is to put a hash in front of it. And so this is going to give me that element. And then I know everything inside of that is going to be a link and a link as a K tag. So I can look at these products now.
And unfortunately, there is only one. This would have been more exciting yesterday. But now I need to look at a look at this element. So let's see what this A is. Let's see what's inside of it. We need to try and figure out like how where's the price.
So you can see down here, this is kind of useful. It's got a class called product price and it's got a class called product title. These, you know, that's pretty suggestive. So I can use another selector. I'm going to say find all the elements with a class, a product title and extract the text. That gives me custom stand and I could do the same thing with the price.
The price is some extra text and it has some dollar signs. So I just strip all those out and convert it to an actual number. And there's lots of other ways I could do this. In particular, this would be much easier if I could use read out because I could just use read out past number and that would do that all for me.
But one of the things that's interesting about code and production is that your code is going to run somewhere else. Right. And that means all of your package, all of the packages you need and you run the analysis are also going to need to be copied somewhere else. So when you're running production, like minimizing the number of packages you use is really going to make your life easier.
So when you're running production, like minimizing the number of packages you use is really going to make your life easier.
And I noticed that a few people are asking in the chat about my little arrows and triangles here. This is actually a font that they use, which again, we said that appearance. I'm using this font called Fiero Code, which provides ligatures. So ligatures kind of custom displays for. So if I put a space in between them, you can see that it's just a regular vertical bar and a greater than symbol. Similarly, this is a less than sign followed by a minus. But Fiero Code just makes that look a little bit nicer. With the unfortunate side effect of confusing people when you show them your code.
Now, unfortunately, today there are no sold out items. So you're going to have to rely on the fact that I figured this out earlier, that there happens to be a, when they're sold out, it happens to have an element on it called sold out. And then also I can figure out where the actual product is.
So I found out all these different variables and I should say, I think I like this scraping isn't kind of the main point of this talk, but I'll point you to a talk. If I can remember where it was that I gave about this recently. So if you do want to learn that this is kind of interesting, you do want to learn more about web scraping. This is a workshop I gave a couple of months ago that will help you learn a little bit more.
Okay. So where are we? So what have we done? We now, and so I've now taken each of those pieces and I put them all together in a data frame. So this is not very exciting because there's only one row. But in principle, what I want to do now is say, well, let's take a look at the products I saw last time, which I saved as a CSV file. And so last time there were these 10 things for sale. Also, I mostly sold out. So what I want to do is find which ones are new and are not sold out.
And so I'm using this link as kind of a unique identifier, right? So I'm saying, if I've seen it before, don't tell me about it again. And so what I'm going to do is find all of the rows where sold out is missing. So they're not sold out. And the link is not present in the last set of links. So this should find the one that's missing. And then I'm going to find the one that's not present in the last set of links. So this should find, I should add a comment here, find all products that aren't sold out. I didn't see last time.
And then I'm going to create, if there are any products, I'm going to create a little message. Which didn't work. And now I have to figure out why didn't that work?
Let's just try one of those. So that worked. Oh, I see. When I looked at old, there already was a custom standard. Okay. So this is doing the right thing. It's saying there are no new products that you haven't already seen because this custom standard was on sale last time you checked it.
GitHub Actions: running the script automatically
But I'm going to update the products on CSV file. And now since this is on Git switch, you can't see in the share.
So in the Git pane, now effectively what I'm going to do is update products data. And actually before I do that, I should probably just make sure I should have done this when I started, but I probably want to get the latest changes.
Okay. So now there isn't actually any changes in Git because, which I'll show you shortly, I actually have a GitHub action that is running this automatically, but that's just a basic idea. So what have I done? So I scraped the website with Arvist. You saw that. And I created a tidy table of data. The next thing I'm going to show you is how I repeatedly rerun this. So if we go to the GitHub site and we look at the commits, you'll see this repo actually has 319 commits in it.
Most of them have the same, not very useful title. But if I look at one of these, you can see this is updating a CSV file with the changes to the website. So just having this, just this alone is quite useful. Right now I've converted a website into a CSV file and I'm tracking the changes over time with Git. So now at least you could look at this and kind of say, well, how often do things actually sell out? How often does he add new mystery sculptures to the website? Or you could maybe start to do some more analysis. Like, well, if I want to look at his website, what time should I go?
And the reason this works is I use a GitHub action. So GitHub actions are something that are free for everyone to run, open source repos at least, or publicly available repos. And I've set this one up to automatically run, what this means is every three hours. So every three hours, I'm going to run a GitHub action. What's it going to do? It's going to check out the repo. So this is kind of common when you're running something in production, often you're going to start with a completely blank slate. You just get given like a Linux machine that has nothing on it, basically. So the first thing I'm going to do is check out my code. Then I'm going to use some code that we provide through the Rlib actions repo that's going to install R. So every time this runs, we're going to install R. We're going to check out my code and then install R. Then we need to install all the dependencies I need for my package. And now I can actually run that script.
So that's not quite enough, because I also need to check that back into Git. I wish I could tell you how I got this code, but I can guarantee to you, I did not write this myself because I do not really, I do not remember doing that. And I don't remember why I need stuff like this there. But I think I wrote a comment here that this is the kind of idea that I copied from Simon Willison. And he used a slightly different approach, but pretty similar idea.
So what I've done, so what's going on with the script? I start with a completely blank slate. I'm going to get my code out. I'm going to install R. I'm going to install all the dependencies I need. I'm going to run my script. And then I'm going to save the results. And if you go back enough in time, I want this 300 commits here.
So I guess this has just been running, been running for a while. Okay. You can see like there's a lot of initial commits where there's quite a lot of iteration to get this working correctly. But I eventually got it working correctly. And once it did, it just has been running tirelessly every three hours, updating whenever the site changes for the last eight months.
So I see Philippe in the chat asked a good question, which how does it figure out my dependencies? I actually created a description file here. Not because this is a package, which always has a description, but just because we've got some easy tools to install dependencies. If you do have a description, one of the things you probably should do, one of the things, the downsides of this approach is it's going to install the latest version of your package. It's going to install the packages that are currently available on CRAN, which is not a great idea.
So the other thing it's supposed to do is send a push notification. Unfortunately, this push notification, I never actually got it to work. But I did a little exploration yesterday and I think we can change it to work. I realized a couple of minutes ago, my goal was to like do this push notification and then show you on my phone. But I forgot that I also use my phone as my webcam, so that's going to require some gymnastics.
But I'm using this repo, which I'm going to from John O'Carrill. Just seeing that using this free website called Notify. The only feature is that it's free, basically, and it does just what we need. So I'm going to just try. I've created a ramp and Notify is basically public to everyone. And so the way we get around that is we create a topic name that's basically just a random name. So you can sign up for these notifications. If you quickly copy this URL down, you put in the chat, actually, it's not a big deal. And now from R, I can run some code.
And then, hopefully, I got a link. I got this notification. And so the cool thing about this is as well as being like a desktop app, it's also an iPhone app. So I actually get a push notification. So let's, well, I can, I got, I got it came to my watch, which I can't show very easily, but you can see I got a notification. I got a notification on my watch telling me that something has happened on that website.
So this is pretty, I think this is pretty cool. Like I've now got all the pieces in place so that when something changes on this website, I'm going to get a notification and I can actually go and use it. And that did, in fact, work successfully for me. And I purchased this sculpture, which I really, really like.
Challenges of running code in production
So let's maybe just talk a little bit more about, I think, like some of the challenges you'll face when putting code into production. And then, depending on how long that takes, we can kind of come back and maybe make some improvements to this code.
So let's, so let's think about some of the challenges based with this. And for this particular example, I'm going to ask you to imagine that you're a data scientist working for an ice cream store. And all of my images are drawn by chat GPT, which is notoriously bad at words. So please enjoy the terrible spelling as we go.
So you're a data scientist for an ice cream company. You want to help them predict how many ice creams are going to sell tomorrow, maybe based on how many ice creams they sold today and the weather forecast for tomorrow. And so you've written a bunch of R code, you've fit some models, and you've made a nice Quarto notebook. And now you want to like run this every day so that the people who are in charge of making ice cream in your organization or shipping it out to the store know what to do.
And so what are the challenges to doing that? Well, the first challenge is that the data is going to change. I mean, hopefully, I think this is probably the most obvious, right, that you know, like in winter, sales of ice cream are likely to drop. But if you don't have data about winter, like if you don't have enough data going back into time, the first time you see something unusual like that, you're going to get bad predictions.
Another thing that might happen is that the schema of the data might change. So the schema is kind of the definition of the data, like one of the variable types, one of the variable names. So maybe you'd be working with data like this. Obviously, your company does not make weather forecasts, right? So you're downloading this from some API, some of the internet. And one day it changes.
So what I want you to do is just take a minute, think about this stuff. You might put ideas in the chat if you want. Like, what's the difference? What are the differences between these two data frames? And what do you think the impact on your code is likely to be? Like, I haven't shown you any code. I just want you to imagine, like, what do you think these changes are going to do? So let's just take a minute. And what I want you to do, see if you can identify the three differences and think about what the impact is going to be.
So what are the differences? So the most obvious difference is that temp and temperature have changed. Temp has changed to temperature. So what's the likely impact of this on your code? You're probably going to get an error message, right? Which is good. Which is kind of the best thing that could possibly happen compared to some of the changes we'll talk about next. So that's good because it tells you that the format's changed and you have to go back to your code and fix it.
What else has changed? Well, we've changed from a date format, the ISO 8601 date, which is used commonly in Europe, to month, day, year format, which is used commonly in America. So what might happen here? I think you've got a few different options. First of all, your code might just error because you've hard coded that you expect it to be in year, month, day format. It might just work because it's automatically guessing the date types from the column. Or in the worst possible scenario, it might guess that these are day, month, year data. So in this subset of data, this is all valid dates. This could be the 5th of January, the 5th of February, the 5th of March, the 5th of April, which, as humans looking at this, we're likely to say that that seems implausible. But if you're very unlucky, it might automatically guess the wrong date. And now you're just going to get nonsense. Your code isn't going to error. It's just going to get bad results.
And it's the same thing that might happen that's almost certainly going to happen with the last one, where it's changed from Celsius, Centigrade, to Fahrenheit. Like this is still a number. It's still a valid number. But if you fit your model on Celsius, and now you're giving it Fahrenheit, you're going to get terrible predictions out of that. And you're not going to know that, right? There's nothing that's not going to throw an error. You're just going to be getting really bad results out of that. And so that's kind of like the worst possible scenario for code that's running in production, right? It doesn't error. It just silently gives the wrong results.
And so that's kind of like the worst possible scenario for code that's running in production, right? It doesn't error. It just silently gives the wrong results.
Okay, so those are the first two challenges. Like, you might get data you've never seen before. Another place this might arise, for example, is maybe originally the ice cream store was only open on the weekends. And now it's also open on weekdays. And so you might get new data, maybe the day of the variable, the day of the week is something that's important. And now you're going to get new days of the week you need to make predictions. The schema might change. Or maybe one of your dependencies changes.
So maybe there's a great new version of your favorite package comes up. And it adds a bunch of cool new features that you love and think are amazing. But it also breaks one of your existing plots because something no longer works. Maybe that's because you were relying on a bug. Maybe it's because the ggplot2 developers made a mistake. Maybe they made some deliberate change. But for whatever reason, your code no longer works because you are installing the right, the current version every day.
Now, there's a really good fix for this, and that's to use rn. And the basic idea of rn is it's going to capture all of the versions of the packages that you're currently using. So there's only two, there's only one, well, one package here, rvst and one package here that I'm using, notify. But it captures all of the packages that those packages depend on as well. And it records all that information in a lock file. So I've changed this, so there's now an rn lock file. If I look at that lock file, there's a JSON file. But most importantly, you can see it's got the version of every single package I have installed. So now when this runs on GitHub, it's going to run exactly the same versions of those packages, even if new packages are released at the time.
Another thing that might change is maybe the entire platform changes. So maybe there's a new chip that comes out, and maybe there's some difference in how it does linear algebra, and it gives slightly different results to your model. Or maybe the operating system changes, or maybe the C libraries of the packages you use depend on change, or maybe the version of R or Python changes. This is much, much, much less common, particularly these days, but there have been some particularly famous examples in the past where I remember something, I guess, like 20 plus years ago, where an Intel chip had a bug in its math operations. So a bunch of people got incorrect results in their scriptures. So this is not very common, but good to be aware of. And the way that people tend to fix this solve this problem in practice is to use containers. This is why you might have prepared a container that basically just captures an entire operating system in a box. You can just ensure that every time your code runs, it's running on exactly the same version.
The other challenge that you might face is the universe might change. So in the case of an ice cream store, like maybe it's because your shop changed location. And obviously, like if it's now on a beach instead of in the city, like the sales patterns are likely to be very different. But your code is just going to keep going. It's just going to keep producing. It's going to keep fitting the model that worked for you originally. Maybe that model doesn't fit very well, but it's not going to give you a good result. And worse, it's not going to give you a clear error that says, hey, the model's wrong because models kind of, by their very nature, can't do that.
And so you might've heard of terms like concept drift or model drift or data drift. But I think the basic idea is that like a model is by its very nature an imperfect, it captures reality imperfectly, and it captures it best at the time it was fit. And as you get further away from that time, like more and more little errors are going to creep up. So even if nothing major changes, like if you only fit your model once and make predictions from it, those predictions only get worse and worse over time because this kind of, you've somehow implicitly fit like a Taylor series approximation to the universe. And as you move away from that approximation point, it's going to get worse and worse. And the way you kind of could resolve that is you can't just like sit and forget the model. You're going to have to regularly check it.
In the case of like my little production thing, I think what the universe changing would look like is this looking different, like the structure of the HTML page changing. And the way that Arvis works is if it doesn't find anything, if I deliberately introduce a misspelling, it's going to report nothing. So I think this is something in my code here, what I really should be doing is saying like, if my products equals zero, I need to throw an error. And that means my script isn't going to continue running, even though the structure of the website has changed. And I'm now just getting nonsense results. And the nice thing about doing an error like this is that when you are running R in batch mode, so if you remember in my scraping script, it calls R script. And if you call R script instead of using R interactively, whenever there's an error, it's basically going to quit and it's not going to run any more code. And when that quits, it's going to notify GitHub actions. And then that will give you an error that you get notified about through your GitHub notifications.
The last challenge of like running repeatedly running code and production is that over time, your requirements are going to change. And in some ways like the best and worst thing that can happen to you as a data scientist is if like something you've done, some dashboard you've created as a model that becomes so important that the executives in your company start to rely on it. And that's awesome because your work is having a very direct impact on the company, but it also means those people are going to be looking at it and emailing you with requests to change it. And I don't know, I don't think that's a problem necessarily, but I think like dealing with that kind of like iterative flow of request is not something that like data scientists tend to be trained in. It's not something you to learn in university. So how you like keep track of those, how do you make sure that you continue doing the work that is important to you as a data scientist, even while you get requests from like several layers above you in the old chart to do things urgently.
So those to me are kind of the things you need to think about when you're writing code that's like running long term. So let's just take a look at this. We've already kind of seen an example of the data changing, right? Like I looked at that today and there was only one product there, which made this much less interesting than it might have been.
So one thing that makes me a little bit more comfortable about scraping this, as you can see, it's powered by Squarespace, which is a big website that powers tons of websites. So I kind of know behind the scenes, like this is all automated. Someone's not manually handwriting this HTML. It's likely to be generated by code and it's unlikely for that code to change in the short run. It might change in the long run and that is why I really should add code like this to check that it hasn't because otherwise my code will just continue working, return a zero low row data frame, everything else will work, but I just don't get any notifications.
So we can protect against changes to dependencies, as I said, by using RN, which basically works by recording all of the changes. I've already protected against changes in the platform because in my YAML file, well, I've almost protected against it because I say run on Ubuntu latest. And so this is the container. This is the operating system and all the system libraries that GitHub is going to use to run my code.
So I could actually, if I wanted to be safer, change it to a specific version of Ubuntu. So now rather than using the latest released version of Ubuntu, I'll always use Ubuntu version 20. And in this case, I don't think it's that important because none of this code is like very, very simple code. It's kind of easy to see what might go wrong and I don't think the operating system is likely to impact it. And like operating systems change relatively slowly anyway, and I don't expect to be running this script for years on end. But again, this is something I could do. I could lock down a specific version of the operating system.
So for platform changing, we also talked about the universe changing. In this case, the universe changed because I successfully purchased the artwork that I wanted to purchase. And then the script is no longer useful. That's I think a not uncommon result of a data science project, right? You actually might.
Sorry about that. I just lost power. But I am back. Hopefully we will not lose power again.
Okay. Let me share my screen again. And let me regain my thought. We talked about platform, the universe. Yes. In this case, the universe changed the data analysis in production and its desire to infect, and I can now effectively wrap that up. Not super uncommon. Sometimes you do an analysis just to get something to change. It's changed, and then you're done.
Or maybe that was the requirements changing, because in this case, I got what I needed out of the analysis. So I think that's probably a good place to start. I kind of showed you a few ways I could make this code more kind of production-ready by adding in more errors so that when something goes wrong, I get notified about the script. I also spent a little bit of time yesterday figuring out how I could get rid of this dependency on the notify package to make my code even simpler.
And that just means I'm no longer using the notify package, but I'm using the HTTP package to do this directly. No real reason to do that, except that I think if you were doing this and just putting this code in production in a real company, it gets pretty hard to use code from random GitHub repos, because that code could change at any point, and many IT departments will not let you just use random code from the internet.
So I think let's stop there, and we've got some time for questions. I will not tell you about some of the challenges of not just my computer, but you did see a few of those along the way, and I will tell you the absolute, the thing that is hardest to debug is when this doesn't work, right? Because when this doesn't work, all that happens is you don't get a message, which is obviously very difficult to detect. So I think any of these problems, like you should expect some pain and frustration, especially when you do it the first time.
Q&A
Peter, do you want to ask them, or do you want me to just pick them out and read them?
One, Avis, would you like to read the question and ask the question yourself?
Yes, I can read. I have two questions. The first one is about this web scraping. Apart from R Selenium package, are there any other R packages which can scrape JavaScript-managed web pages? And the second question is about this data drift. How to automatically identify such drifts?
Thank you. So these are my answers. First, Avis now has experimental tools for scraping live pages that run JavaScript. So this actually runs a web browser in the background and interacts with it. So you can interact with any website exactly as you're a real person. To echo one of the questions earlier in the chat, I will not be telling you how you can do this to scrape Ticketmaster or other places like that. I will say these techniques generally don't work on any website where you can imagine people really care about scraping, because Ticketmaster wants to stop people from scraping it because they don't want to enable scalping. So they introduce a bunch of tools to stop that, which, of course, you can overcome if you're creative enough. But I would never suggest such a thing.
Identifying data drift, I think, is more challenging. I'm kind of surprised there's not more research and interest in this. I think the best thing I've come up with or was suggested to me was this applicable package that Max Kuhn suggested. And basically what the applicable package does is every time you make a prediction, it tells you how far away from the training data is that prediction. And so I think that just seems like something you probably want with every prediction. Am I making a prediction that's solidly in the body of data that I've seen before and I can feel really confident? Or am I making a prediction that's a very long way away from the data that I used to put the model, in which case I should take that with a grain of salt? Or maybe if it's even far enough away, I should automatically flag that and encourage people to refit the model.
So she said, I've used the package that notifies you in Slack and that you work great. It was not really a question. It was just that that was an alternative to use for the notifications instead of the notify.
Yeah, there's like a ton of packages out there. I think Slack is useful. One of the things that I like about notify is I think the only like there's just so little. Yeah, but it's like free and you don't have to worry about like one of the other things that's like painful about putting stuff in production is how do you get all of your credentials shared correctly? And so like if I was going to publish to Slack, like I need to somehow get either my Slack username and password, which would be kind of a probably a bad idea to put that on my GitHub or some kind of like token that's like could give me and it's just like a bunch more work. But I think if you're doing this like kind of inside your organization.
I see David asked a question about how you manage credentials in production. There's like two ways, I think. There's the way that's not particularly secure. It's OK. It's like, OK, it's fine. That works everywhere. And that is to use environment variables. And so like locally, what that means is you could run a script like edit our environment, which I'm not going to run because I will show you all my environment variables, which contain a bunch of secret stuff. But you can like put them like the kind of the basic idea is you have this one file like your dot our environment file, which contains a bunch of things that never gets committed to GitHub. And then on GitHub, under settings, and secrets and variables, you can add these.
So once you've said it, you can never actually see it again. But you can't edit it. You can only replace it. But you can. There are ways you can read that into an environment variable in your GitHub actions. So you have to be careful to never like print that out. Although GitHub does take some basic precautions to make sure you don't accidentally do that. But that allows you to have kind of something that's like secrets locally and secret anywhere else. So this environment variables thing basically works everywhere. But it's kind of a pain because it typically relies on you doing some kind of copy and paste.
One of the things that we've been trying to work on and now and part of this professional product is making it all kind of just work. So if you're using Databricks or Snowflake, all of the credentials kind of magically flow through. And I think that's like in like a well resourced organization, that's the way it should work. Like your administrator should kind of take care of making all of this auth stuff that you don't need to worry about it. Until we get to that point, environment variables that you do whatever you need to do, wherever you need to do it, basically.
Okay, thank you. There is a question from Eugene. I wanted to ask myself, what is the future looking like for R?
Yeah, I'll preface your remarks. I think like this is Yogi, Yogi, Eric, I think. Like making predictions, it's tough. Yeah, I guess it's tough to make predictions, especially about the future. So I don't know. I don't know what's going to happen. I would say like it feels like I think one of the things that make it hard to kind of like get a sense for like what's up with R is I think like the absolute number of R users is still increasing. But the percentage of people using R for data science is decreasing because the number of people using Python for R is increasing faster.
Now, like and I think like the like, you know, don't get me wrong, Python is a great language. And I think the reason more and more people are using it for data science is it's a great general purpose programming language. But I don't think it's reasonable for there to just be like one programming language. Like I think it makes sense to have some general purpose tools and more special purpose tools. And while R is like a general purpose programming language, you can do anything you want in R. It's definitely like well, particularly well tailored for the needs of data science. And I think the design of R means, in my biased opinion, that tools like ggplot2 and dply are always going to be better in R than basically any other programming language. Because no other programming language gives you the flexibility to implement those sort of APIs, which give you this very fluent interface for exploring data. And I still, you know, I think still think ggplot2 and dply are better than their Python equivalents. And I think they always will be.
And I think the design of R means, in my biased opinion, that tools like ggplot2 and dply are always going to be better in R than basically any other programming language. Because no other programming language gives you the flexibility to implement those sort of APIs, which give you this

