R-Ladies Gaborone & R-Ladies RTP (English) - Personal R Administration

Transcript#

This transcript was generated automatically and may contain errors.

Sure. Thank you, Shayla. So yeah. Hi, everyone. Like I said, my name is David Alja. I'm a solutions engineer at Posit. So I do work on helping people understand how to get their data science into data science environments to work at their jobs. And so today, we'll be sort of working through part of a course that we'll be teaching at Posit Conf. So it's the one that they forgot to teach you about our course. And our focus for today will be personal R administration.

I think just for a programming note, the runtime here is roughly three hours. If you can't stay for all three hours, that's fine. Feel free to drop. Like I said, there's a recording. But this is gonna be a little longer than the standard meetup. And our goal here is going to be to cover the things that you need to know to set up a development environment that's going to work effectively for you as you use R.

The material, the slides I'm working from now, you can access from GitHub. This is the link to the course repository, which will have links to these slides as well as some things we won't be covering here today. So you can get those things here. And then you can also if you go to rstats.wtf, you'll also see a bunch of links to a lot of material we'll be talking about today. I'll also drop a link to the slides in the chat.

Nothing, right. So when you set a project level R environment file, your user level one does not get evaluated, right. So there's a name for that behavior, which is short-circuiting.

Version control and secrets

Questions about that before we move on to another way of customizing the way R starts up? Well, I have a question, I guess about collaboration, right? Like is, if you put it in the actual project, and then that's on some sort of version control, is that generally an ignored file? Or because, yeah, I'm wondering.

Yeah, so again, I, the one of the reasons I don't usually give this workshop on my MacBook is because this is where all of my actual secrets are. So I have to do some things off screen to make sure I can show this to you safely. Get ignore. So I, yeah, I have, as part of my Git configuration for when I, so when I work with version control on this machine, there is a collection of files that I have Git configured to never recognize in any project. So if you work on Mac OS, right, you should ignore this file. Everyone should ignore this file. There's a bunch of R projects. So, you know, for example, if I'm working on a Mac OS, there's a bunch of R projects. So, you know, for example, the R project folder that gets created by RStudio, when you launch a project, the R history, right? Like I don't want to share everything I've ever typed into the console with my collaborators. I don't want to keep our data files on disk, right? I don't want my HTTP pass like OAuth tokens. And the R environment file is another one of those things that I ignore for all projects. So typically, if you have secrets that you're putting into that kind of context, you need some way of distributing those out of band. So that might be like a password manager that your whole team uses, or, you know, that's the only recommendation that I could make safely, a password manager that your whole team uses. There are some good free ones, but like some solution for communicating secrets out of band, having ways to set those on remote systems without passing them around in your code, this is how you do that. But yeah, so the R environment file, I ignore globally.

And the neat thing, this is kind of a neat capability of R, which is really like the only language I've used that has a built in way of doing this. Lots of other languages have a convention of using something called a .env file, and then having some package responsible for reading that .env file to get the same behavior. But this is kind of a neat thing that we get for free without needing to install stuff. So yes, don't commit your R environment files.

The R profile

Right, so we talked a little bit about, you know, those kind of secret key value pairs that you can use. You might also want to create some code that you run at the beginning of each session. And the one of the places you can do that is in this .R profile. It's R code that runs in the beginning of each session. There are a couple of ways, you know, a lot of the time if you're running a script at the beginning of an R session, one of the questions you have is like, am I doing this in an interactive context or not? So an interactive context, right, would be one where you're like the one we're in now, where I'm sitting here, I am typing and hitting enter and or control enter and lines are getting sent to the console. If you're knitting an R markdown document or running an R script from the command line or launching a shiny app, you typically like don't. Those are not interactive contexts. And most of the customization that you do in R profile, not all of it, but most of it is usually focused on stuff that's happening in the interactive context.

So things you might want to put in your R profile. You might want to in your session set the default place where you get CRAN packages. There are other ways to do this. If you're using the RStudio IDE, it has some ways of configuring how you set the default CRAN mirror that you're going to fetch packages from. But if you were working on a system where you didn't have access to that and you wanted to make sure you were getting packages from the right place anyway, setting a default CRAN mirror is a great way to do that. There's a link to the package prompt, which is another way of, you know, if you are the kind of person who works with R in the terminal a lot, then you can use this to customize the way R looks in the terminal. It does not work exactly the way you want it to in RStudio, but you can see here, right, the usage instructions for this package describe setting. It's a run in your R profile in interactive contexts. So something fun to check out if you do any sort of R work in the terminal.

But it's important to note, right, there are some things you don't want to put in your R profile. And in particular, if it matters to code you're sharing, right, then you don't really want it to be in your R profile. So let's see if we can figure out why we think we might not want to put these things in the R profile, right? So like, why might you not want to put strings as factors set to false in the R profile? This example is showing its age. I'm old. When I used R, when I started using R, this was more of a problem. It's less of a problem now, right? But if you're going to read something in to R as, and you want to set the kind of default, the type it comes in as, that's something you want to do explicitly in the code so that when you share your code with someone, they get the same results, right? You don't want to, for example, load something like the tidyverse in an R profile, because if you share code with a collaborator that doesn't have the tidyverse in it, they might, you know, run the code and get a different set of packages in their environment that execute the code, right? So that's something you would want to do explicitly in the script you're sharing, right? Don't alias functions in your R profile, right? Because that'll do, that'll cause some of the same problems, right? Someone might not end up with this F available in their environment. And then when you share your code with them, it won't work. And then, you know, again, if you were to set a theme in your R profile, then when someone else renders your plots, they're not going to get the same thing, right? So these are things that they matter to code that you're sharing, so you don't want to put them in the R profile, but you might want to, you'll want to set them explicitly in the script instead.

We don't have neighbors today, so I'll just, I'll pick on, will I pick on someone? No, that seems like a lot. But Shannon, I guess I'll pick on Shannon. Shannon, why might these be safe to put in your R profile?

So our use this and dev tools are things that you tend to use interactively, and they're not lending, the functions in these packages are not imperative to reproducible code and data and data reports and data artifacts. It's just things that you use on the fly. Right. So yeah, I like to think of these, thank you, Shannon. I like to think of these as like development dependencies, right? Use this as something I'm calling a lot if I'm doing interactive work. I'm just repeating what she said. That's not helpful, but they're development dependencies, right? So if I send a package to my collaborator, they don't need to have dev tools installed to work with the package necessarily. So those might be safe to put in your R profile. I still wouldn't, but this helps you kind of understand the distinction there.

Go ahead, Greg. You mentioned developer dependencies. Can I handpick which R profile I load on execution? Can you handpick which R profile is loaded when R starts? I mean, well, whenever you load a project, sorry, I'm trying to encapsulate stuff into projects as opposed to just code. Right. I'm going to say, stay tuned, Greg. We have, we're getting there. I mean, so there are definitely ways that you can, for example, customize the behavior, like the flowchart I showed at the beginning. There are a number of things before the files that we're talking about putting on disk, right? These are kind of the most common ways. If you have a need to put deeper, deeper into the system, you can totally change those things. Some of the tools we'll talk about in a little bit, take advantage of some of these facts to give you a startup experience that I think is going to reflect some of what you're looking for. But we'll get into some of those details a little bit.

Dot files and the R profile

So one way that you can figure out things that people put in their dot files is by searching, sorry, putting their dot R profile is by searching through them on GitHub. Files that start with this dot prefix are called dot files. It's a very creative name. And they are often configuration or other files for programs that people use on their computers.

Some people share their dot files publicly as a way of making them available for other people who want to use them or just because they, whoops, because they want to use them to set up new computers. My dot files are public for that reason. What is happening to my ability to copy and paste? There we go. We've done it. Right. So if you search for dot R profiles, right, Colin Gillespie, you can check out his dot R profile. He's got a lot of stuff happening in here. Right. So you can check some of these out on GitHub for some inspiration for things to do.

Right. I have a lot of people also put things under just the dot files, but I use a dot file manager system. So the translation to R is not immediately apparent there. Anyway.

Okay. So we're going to do the same thing we did last time. Right. Now we're going to edit our user dot R profile. Then we're going to edit our project dot R file dot, edit our user dot R profile. We're going to edit our project dot R profile. And we're going to see what happens after you restart each R session. So I'll give you five minutes to do that.

Live demo: editing R profiles

All right. So let's try out this activity and then we'll take a break. So what I'm going to do is I'm going to use this. I'm just paying attention to the button. What I'll do quickly is confirm I don't have anything. Okay. So I'm going to call use this, edit R profile. Again, you can see it takes the same scope argument as the edit R environment. So if I don't provide a scope, it's going to default to the one in my home directory. Right. So this is my user R profile. Right. And since this is dot R profile, we can just put R code in it. So I'll say hello from the user R profile. I save and restart this. Right. Now we can see, right, this is R code that gets executed on startup. Right. So I started the session. This was executed. I didn't do anything. Right. That's just what happens when I put R code in this user R profile.

Now I'm going to edit this. Now I'm going to edit the same file, but in the project scope, again, you can see, right, the project scope I have here, my home directory. Beneath my home directory, I have my projects folder. And then in this project, right, my WTF R-Ladies-Gabroni project, I'm modifying this R profile. If I say hello, right, and again, this is R code. So I use the print function, user R profile.

Save. Restart. Right. I get hello from user R profile. Right. Because the dot R profile is set in my projects directory, the one in my home directory is not evaluated. Right. So it's the same short circuiting behavior. But in this case, we're executing R code instead of setting values that we have to retrieve.

Questions about any of that before we break briefly? Like I said, this behavior is going to come back in a bit when we talk about reproducible environments with renv. But if there are no questions, should we give it another, Shannon, five minutes? Yeah. So give it five minutes. Take a break. Go get some coffee or whatever beverage is appropriate for what time you're in. And we'll see you in five minutes.

So we're gonna try this out in my environment. I'm gonna all use this. I'm gonna edit my project or profile. And that's this. I'm gonna drop this in here. In this case, I'll just add, I'll use package manager E3M, the brand latest. So now I have R up inside and package manager set as repositories.

I'm going to restart R. By the way, that interrupts that you see when I restart R is because I have a weird thing happening in my terminal. You shouldn't see that, but just to explain what's going on. If I kill this terminal, it will stop doing that.

So if I were to run options repos now, I would confirm that I have those two repositories set. And so now I can install packages, get seller. And you'll see it got fetched from the R universe. And we downloaded a binary package. So a neat thing to know about our universe is it's an R OpenSci project. For people who need to distribute more complicated packages, often they're associated with specific scientific domains and you want to distribute binary packages that are too difficult to get on CRAN for whatever reason. Then looking at the R universe is a great way to get some of those packages. And if you have packages you want to distribute and CRAN isn't the right place, setting up an R universe is pretty easy.

So that's the idea, right? We can modify this repository option, right? This is a thing we would do in our R profile and gives us the ability to install binary packages from somewhere else. When R says a version of the package might be available elsewhere, this is a kind of polite error message, right? It's not available in the repositories you've listed, but it might be available somewhere else. So there isn't necessarily a good place to look because there's a kind of infinite number of places the package could be. But the most likely places I would say you can use package manager has a search function. So if, for example, a package has been archived on CRAN, you might have an easier time finding it on package manager because it displays both current and archive packages, even if the process you have to go through is slightly different. So look at package manager. You can look on the CRAN website itself. Sometimes packages are only on GitHub if they're on GitHub but not distributed through an R universe. Those are going to be the most common places. Those are where I would look.

So, right, we got Git seller. We got a binary version of the package. And how do we know? We know because it told us that it downloaded a binary, right? Also, if you look at the extension here, you can see it's a .tgz, which on macOS is what the binary package format looks like when you download things on macOS. Again, it's slightly different on Windows and Linux.

So binaries, right, the easiest thing to get. But if you are installing, you know, packages from somewhere where compiled versions aren't available, if, for example, you're installing a package from GitHub, if you install a package from GitHub, you're just copying the source files down to get those latest versions. Then you have to compile them yourself. And so you may need to install those packages from source. And so hopefully this gives you some clues for how to do that.

Anything else on installing R packages before we talk about reproducible environments?

So if you can go back to your R session, you had a URL to point to the package manager. Is that correct? When I go there, it doesn't seem to work.

This URL?

Yeah. Maybe I'm typing it wrong.

Let's see what happens. Right. So let's talk about what's happening. So package manager, right, there's this is the web interface, right? So if you saw, I went to p3m.dev, I automatically get redirected to this client's thing, right? So this is this is the interface that gets served if you visit package manager in a browser. If I go over to the setup page, and say I want the directions for Mac OS, for example, in the RStudio IDE, then this is the repository URL that I need to configure from my R session, right? Because the HTTP request that I send from R to get packages is not going to look the same as the one that my browser makes. So that's the difference there, right? And there will be, I think we'll talk a little bit about some different settings you want to apply depending on how you're trying to get that information. But that's the distinction you're seeing there, right? Is the browser view and the view from the R package request are not the same.

Reproducible environments

Any other questions about installing packages before we talk about reproducible environments? Okay. Reproducible environments is my favorite topic, which is why I have this job.

And when I was working on this presentation a couple years ago, I just like, the takeaway I want is you are going to need to reproduce your environment. The work, if you believe the work you do is valuable, then you should believe it's worth being able to reproduce. That is an argument. There's a sermon that comes with it that I will give later in the interest of time. I'm just going to say having a way to reproduce your environment is going to make your life a lot easier if you have to pick up a project that you were working on in the future, which is a thing that happens often. Or if you want just someone else to be able to work on your stuff, just having a way to reproduce your environment is going to make your life easier.

The work, if you believe the work you do is valuable, then you should believe it's worth being able to reproduce.

This is a diagram that someone on my team made a while ago to talk about different ways that you can think about reproducing environments. On the x-axis, there's who's responsible. And then on the y-axis, there's how permissive the environment is. We are going to focus on this snapshot use case where you are responsible for reproducing your environment, and we're assuming that the environment is relatively permissive. There are different things you will have to do if you work in a context where the environment is not as permissive. But understanding how to do things well up here makes all of the rest of these more easier. And if you're in the red zone, it kind of sucks to be there. So you want to try to stay in the happy path. This is your responsible. You don't necessarily have control over how permissive the environment is, but the more responsibility you take for reproducing the environment, the easier time you'll have if you have to operate under some of these more restrictive conditions.

So we're going to talk about two different tools that you can use to construct reproducible environments. One is a positive package manager. We'll focus on public package manager. And then the other is using the renv package. In a world where you have the ability to access public package manager, there are a lot of things that it makes easy that you can do. renv is something that you can use as long as you can install the package. So we'll talk about both of those as strategies for reproducing your environment.

So one of the ways that package manager makes it possible for you to reproduce your environment is you'll notice that in our previous example, when I configured the repository that I wanted to get packages from, there's this slash latest. Slash latest tracks plus or minus a day, usually, the current state of cramp. So if I hit install packages, if I set this as my repository URL, then I'll get the packages the way they look on CRAN right now, whenever right now is.

One thing you can do that might make it easier to reproduce projects, say you were working on something a long time ago, and you don't want to figure out what collection of dependencies you can use to bring that back to life manually, is you can use this date-based snapshot capability. So under the snapshots, do you want to freeze package versions or do you want to install packages from a particular date? So let's say I went back to a year ago today. Right now, if I look at this URL, again, rather than latest, what this says is slash CRAN slash June 7th, 2023. If I use this URL as my repository URL, then when I request packages from package manager, I will get back a package set reflecting the way CRAN looked a year ago. And so on back to, I think, like October 2017.

There used to be another way to do this. If you ever worked with Microsoft R, MRAN and the checkpoint package enabled a similar kind of workflow. Microsoft stopped supporting MRAN last year. We have made some changes on the package manager side so that if you were using the checkpoint package and you try to get a date-based snapshot out of package manager, it will also respect that. So if date-based snapshots are a way you like to work, then you can use this repository. It's like date-based repository URL to recover previous states of the CRAN repository. This also works for PyPI.

So date-based snapshots. I selected dates. You can see I get the dates in the URL. So what you're going to do is, yeah, Greg, go ahead. Sorry, just out of curiosity. So is the date stamp formatting of snapshots, is that only available in PPM or is that something that's also in renv?

There is an renv function that will make it a little easier for you to work with date-based snapshots. It's a relatively new function. It's called checkout. But what I'm illustrating here, like using this date-based repository URL or date-based snapshot as the repository URL will work kind of no matter what package installation client you're using. Because

the from the perspective of a package installation client, you're just supplying a CRAN repository that happens to behave the way this looks. So what we're going to do is we're going to take a couple minutes. You're going to set a date-based snapshot URL as your repository in your project. And then you're going to install a version of dplyr and post in the chat when you have installed this version of dplyr. What version did you install?

We'll give people a couple minutes to do that.

Okay. So some stuff is happening. One thing that Shannon has hopefully pointed out to me, I am very comfortable YOLO installing things because when I'm actually doing work, I always work in a isolated project environment. If that is not your lifestyle, make sure when this class is over, you reinstall whatever version of dplyr you were using before so that you don't break all of your projects.

So what we're going to do is I was running R 4.4. So I'm going to reset my version of RStudio so that it does that.

So I've set my repository here to point to an instance of a sorry, a date-based snapshot for package manager. If I reset and I look at my repo option, you'll see that it's pointing to that version. If I install packages and I request dplyr, then I get dplyr version 1.0.5. Now, some people who are using R 4.3 did not get that result, and I also got something else slightly unusual happened, which I will debug after this call.

But in general, what ought to happen, right, is that you should be able to point yourself back in time, fetch an older version of dplyr than the one that is the latest on CRAN. So if I reset this to latest, I reload, and I install packages. I ask for dplyr again. Now I should get 1.1.4, right? So that's the current version of dplyr on CRAN. And so that's what I'm expecting to get.

Okay. Questions about that workflow, what's happening on the package manager side?

If not, we'll switch to, can also use package version dplyr to check. That was a good call. I just read log messages all day, so why run functions? But yes, if I say package version, and I ask for dplyr, it will tell me that this is the version I have. Okay.

Managing dependencies with renv

So managing your dependencies by choosing a date for a repository is one way to do things. Right. In the context of a project-based workflow, right, one thing you might consider doing instead is working with a library that's isolated from your other projects, right? And that's what the renv project is going to help you do, right, is it's going to give us a way of constructing per project r package libraries, so that when we make changes in one project, we're not worried about the influence they'll have on your other projects.

So normally, right, and this is to the point of the warning that I will reiterate, right, normally when you have like a user library, or in the sort of standard setup, I have project one, project two, project three, and they all depend on the shared project environment that's generated by the shared project environment that's available at libpaths.

And so what renv is going to do is give me the give me a way to have individual libraries associated with each project, right. And the there are a lot of safety advantages to doing this, right. So I can experiment with new package, new packages, I can, I can install things experimentally, I'm not worried about breaking the other projects on my machine.

You can communicate what versions of everything you're using to other people on your team, or just yourself if you're working on the same projects a year from now. And then renv also has a caching mechanism that means that if you've already installed the package, it'll just you'll get that one linked into the library. So you're reusing things instead of downloading each of them fresh each time. So you're getting kind of the best of both worlds, you're in project isolation, but intelligent reuse of the things you have on disk.

You can communicate what versions of everything you're using to other people on your team, or just yourself if you're working on the same projects a year from now.