
Building R packages with devtools and usethis | RStudio
Package building doesn't have to be scary! The tidyverse team has made it easy to get started with RStudio and the devtools/usethis packages. This hour long presentation will walk you through the basics of R package building, and hopefully leave you prepared to go out and build your own package! Slides: https://colorado.rstudio.com/rsc/pkg-building/ Source Code: https://github.com/jthomasmock/pkg-building devtools: https://devtools.r-lib.org/ usethis: https://usethis.r-lib.org/ R Packages book: https://r-pkgs.org/index.html
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right. Howdy, everybody. Thanks for joining me today. We're just going to get kicked off here, wait a couple minutes for some folks to roll in, but excited to be talking to you today about building packages in R. My name is Tom Mock. I'm a customer enablement lead at RStudio, and I'll be talking a little bit about everything we're doing today.
And let's see what time it is. We're about 9 a.m. Central Standard Time, where I am in Texas, but it'd be great to hear from the chat in terms of where everyone else is coming from or if you're in the U.S. or the U.K. or Africa or Europe or wherever else you're from. I'll probably wait another minute or so, and then we'll jump into the slides. I'll share that link real quick in terms of the GitHub repository with all of my code and slides from today, and then a link directly to the slides themselves.
Awesome. Looks like some folks from the U.S. We've got New Jersey, Maine, some folks from Canada, Switzerland, Chicago, Mumbai, Brazil. Fantastic. It's very cool to get kind of the international experience here, so thanks for dialing in from wherever you are. As a reminder, this is a live stream, so we'll kind of take things as they come, but it will be hosted up on YouTube, on RStudio's YouTube in the future. So if you do have to step out or if you have a colleague who can't make it today, feel free to send them the link in the future.
Awesome. So cool to see all the different groups, and thanks again for joining me today. We're about two minutes after the hour, so I'm going to go ahead and get started. As much as possible, feel free to post questions in the chat. I'll see as much as possible, try to answer some of those. And then again, the slides themselves are going to be linked to in the slide deck. So let me go ahead and start sharing my screen, and we're going to do a couple things today. I do have RStudio open in the back, so we will show a little bit of some live coding, but we have some slides today that we'll be covering for most of the time.
Again, the GitHub repository has a link to the slides as well as to the R packages book. This is an amazing textbook written by Hadley Wickham and Jenny Bryan from here at RStudio, and they do just a great job of walking through in obviously much greater depth than we can do even in an hour of the process of building an R package. For today, we're going to be kind of walking through end-to-end building functions, building packages, and then hopefully sharing them with others or at the very least kind of using them within your own team or within your own organization.
I also want to give a big shout out to Josiah Perry. He was pretty keen kind of kicking off this presentation and kind of the motivation behind it, as well as the idea of like spending quite a bit of time talking about functions, because really that's the part where you have to spend a lot of time thinking is like, what do I want to build? What is my function going to do? And then lastly, these slides are released under CC5 2.0, meaning feel free to refactor them or deliver them to other people. You can reuse these slides as you see fit.
Why build R packages?
So you're here today. Hopefully you want to build an R package. The packages at the most basic are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and potentially some sample data. In other words, package is a home for functions. And functions in R are a home for source code. So if we really want to start talking about packages, we really want to start talking about functions.
So stating this a bit differently, functions in R are just wrappers around longer source code. So one line function in terms of like do this as your function name is actually calling quite a bit of R code behind the scenes. And packages are just a way of describing, distributing these functions in a structured or consistent way. So taking a function that you've written one time and making it available on multiple projects or even potentially on multiple computers if you distribute it to your colleagues or to yourself on a different computer.
So in reality, if you want to build a package, you want to build a home for functions. And as far as kind of the motivation for writing these functions and writing packages, it's all about reproducibility or reusability. So reproducibility in code in some ways is actually all about being as lazy as possible in a good way. And we'll talk about that throughout.
So reproducibility in code in some ways is actually all about being as lazy as possible in a good way.
So functions are a way to don't repeat yourself and be more efficient in terms of you can use a function multiple times and you don't have to type out the same code over and over and over. You just use your function one time or even apply it multiple times with other functions. It allows you to share workflows and empower both yourself in terms of I write packages that are only for me. Or you could write functions and packages that are empowering the rest of your team to do work, to do their core job functions. They also allow you to test your code and trust your work or trust others' work who you may not know or you may not be able to learn how they write functions. So testing allows you to make sure that functions work as intended. And that's another part of writing packages.
So ultimately, functions make your work much easier, faster, and more reproducible. And our packages let you share these functions and be lazier in a good way in terms of this idea of reproducibility, about being as lazy as possible, using functions, packages, so you can recreate that environment as easily as possible.
Anatomy of a function
So I've kind of captured all your attention now. We're going to talk a little bit about functions in terms of you have to build functions to build packages. So we actually need to spend quite a bit of time talking about functions and how to write them and motivations behind them and this idea. Again, shout out to Josiah for saying we need to talk a bit more about functions because the process of building an R package is actually really straightforward. But thinking about how to write a function while it can be straightforward is where people typically get stuck a bit more.
So if we kind of describe or look at the anatomy of a function, a function is made up by a descriptive function name informing the user of the function's specific purpose. Arguments to that function which control the output. Basically, I can change these parameters or these arguments and I get a novel output. So my function can do multiple things. The body of the function, which is all the source code used internally that actually does all the work but is hidden by the wrapper around the function. And then lastly, what the function returns. So at some point it stops and it returns either a visible effect or a silent effect or just returns something or does something.
And if we do this with pseudocode or basically just showing some of this in R code, we have a function name, we have the function kind of wrapper, we have the argument, and then we have the body of the function, and lastly the return. So again, a descriptive function name, arguments to the function, the body itself, and then what the function returns.
But let's actually show this in a real function. We'll talk a little bit about that. So our first function that we're going to write will be super simple. We'll take it with one argument, which will be x, we'll use it to take x and square it, and then we will return that value. So we're going to name it something, you know what it does, it squares the value. And it has one argument of x and it takes x and it passes it to a second power so it squares it. So now we've written a function, we can pass numbers into it, and we get, you know, 2 squared is 4, 16 squared is 256. Great, you know, we have a very simple function, does exactly what we expect it to do, and, you know, we're ready to go.
But what about unintended inputs in terms of what if someone passes a string to this, or a list, or a plot? Like what about, you know, protecting against these unintended inputs? So if I take a cat, or in this case, like a string saying cat, and I try and square it, I get this lovely error function, which is error in cat squared, non-numeric argument to binary operator. Now, in computer science terms, this is describing exactly what it is. In human terms, this is not as helpful as I'd like it to be. And if I take my function I've just written and pass a cat to it, I get the same error because it's just wrapping that R code. So non-numeric argument to binary operator. Yikes.
So let's try and make that a little bit friendlier. So our first function two, we're going to take our function and just, you know, slap a two on the end of it to say that it's a second version. And now we're going to add a stop if not, which is basically saying, if this is not true, stop and return this error message. So now, if the input is not numeric, it will return the error message of input must be numeric because the input has to be numeric for this stop if not. So now if I pass our cat into square valve two, it says input must be numeric. And if I pass numerics into it, nothing happens. And it just does what I intended to do. But importantly, we've kind of made this function a little bit more user-friendly. We've really done it with one extra line of code, but it's helped protect us against unintended inputs and made it more user-friendly for ourselves and for people who don't know all the source code that you've written inside of it.
More complex function examples
Let's go a little bit further with this though, in terms of let's do a little bit more complex example. So now we're going to create a function that generates fake data, but reproducibly in terms of we're going to generate random data, but we want to be able to reproduce it every time if we wanted to. So we want to have a seed. So a couple of things going on here. We have a logical top. We have two arguments. We have an in or the number of fake data points we want to create and this with seed argument. If this with seed argument is not null, then it will set the seed with whatever number we put in here.
The other part we're going to do is we're going to pad some numbers. So we're going to make all the numbers the same length and we're going to sample in terms of getting our actual random numbers. We're going to sample from one to the length of the end. So one through 10. And then we're going to paste together that padding onto the number with the random integer that we've sampled. And we'll throw this into a data frame or a table and generate that out. So this is still all normal R code inside. We have two arguments and we have this null, which is based off the logical. And when we do this with 125, we can now generate 125 random observations in a data frame. So we could use this for simulating data or whatever else. We've just generated a data frame. Great.
The part about making it reproducible is I can pass it with a specific seed. So now even though I'm generating 10 random numbers, I have a seed attached to it. So I can generate the same random numbers every single time, which is helpful for doing things like reprexes or reproducible examples. That's what I use this type of function for. However, if I don't want it to be reproducible in terms of I really want random numbers, then I can generate it without a seed. And now I have 10 random numbers that are all different and have different string mismatches. So I can get my fully random out of it as well.
So maybe that's not as motivating to you. I'm just trying to show you that you've got a couple different arguments in terms of you can pass things like n, you can pass things like nulls or logicals, and then you can do something called passing the dots or in multiple params. So in this case, the dots allow us to share basically as many arguments as we want and pass those into something else. So we can use everything else the same. But now for our randomization, it's going to use our norm or normal distribution of random data. And we're going to pass the dots from the function arguments into this argument. So that means any arguments that our norm accepts, we can put in here in our function, and they will be passed through.
So if you're familiar with our norm, it takes arguments of mean and standard deviation. So even though I haven't defined those in my function previously, like we don't see, you know, mean or standard deviation up here, because these dots have been passed, I can pass these arbitrary arguments into the internal function. So very powerful in terms of like passing functions back and forth between things. And now rather than integers, I have you know, doubles or numeric values that can actually have like decimal places. And they all kind of range around an average of 10 with a standard deviation of two.
So the whole kind of example of this is this is a function I use all the time in my own package GT extras, specifically for user comes to me and says, like, hey, I have an issue, I need to reproduce it. And they have like a very specific data set that they're using. So they have, you know, a data set with, you know, 400,000 as the average, I can't use empty cars for that, because their values are so big. So now I can generate an exact, you know, set of values that specify like an exact mean or exact standard deviation with a specific number of groups, and a specific length. So I can generate very specific random data sets reproducibly to help solve my own problems or work with users to solve their problems in the package.
Passing the dots (…)
There was a question about dot dot dot. So let's go back here. Dot dot dot is basically called passing the params or passing the dots. So this means anything that you pass right here at the end of the function will be captured and evaluated in this context. So when I type mean and standard deviation here, so mean equals 10, standard deviation equals 2, that's actually evaluated right here. So the dots get replaced with those arguments, because I haven't defined them previously.
All right, so enough about kind of generating random data. And we can, you know, use it for writing functions, because we don't want to repeat ourselves. So for this example, absolutely, you could use regex or you could use, you know, stringr, but let's not for now. Let's say that you have a bunch of strings that you need to extract the last few values from. So we could use the base R function, substring, take our string and get the 9th through the 11th value. So this will get us the last three digits, because this whole string is 11 characters long, we can get the 9th, 10th and 11th value of 147. However, if we have a longer value, then we say like Wales national 148, this is not 11 units long, it's actually like 16 or 17 units long. So if we use the exact same arguments, then it gives us not what we want, we really want to get these last three values for the ID, not the text itself.
So we could use this, you know, existing function, and write it, you know, 15 times for all the different values, but then we're repeating ourself, we want to have something that we don't have to repeat ourself. And we can do it the same way each time. So we're not having to reuse a bunch of code.
So maybe we can try and do negatives. So we can say like, okay, from the end, take from negative three to negative one. Well, unfortunately, this function doesn't take negative indices. So you can't actually run it that way. But we can write our own function that does operate in that way. So let's do that. So substring write is our next function, it has two arguments, x and n. And now it's going to take the total length of the string with number of characters, it's going to take x, which is the first argument and say, how long is the string, it's going to subtract one from this total length. And then it's going to count across these and basically take from the right side, you know, whatever numbers you pass in. So we'll say from three to the end is what we want to take in. So now this Wales national 148, we can actually pass, you know, three to this, and it will go 123 over and take just the last three units, because it's counting from the right, as opposed to counting from the left, like substring is doing. And we can do it on a shorter string, so RStudio 147. And it will take those last three units as well. So 147.
Now, where we'd actually use something like this, rather than random strings, is you might imagine you have a very long data frame, and you're trying to extract IDs or something from it. So all of these strings have, you know, random links, but we always want to get the last three units, because that's the ID we're trying to get. So we could take this, and maybe we want to compare the sales against the ID, we could, you know, write custom things 10 different times, or we can use our substring write. So with substring write, with a mutate call, we can just take our company ID, take the last three units, and now we have our ID and its own column. So again, calling the function one time, having it vectorized across all the different values, and getting just the things we want, rather than having to manually write some type of thing over and over and over. So that's kind of the power of functions.
Tidy evaluation
Now, we're going to skip across that, we've covered a little bit about functions, the next part is like a lot of people love the tidyverse. A lot of people love things like dplyr, or gt, or ggplot. And what this provides is like a user friendly framework for writing, you know, R code, it's very easy to use. You have to learn one more thing sometimes, though, to write functions or wrappers around these, called tidy eval, or tidy evaluation.
So you might be familiar with dplyr, which is a way of like manipulating data in R, most dplyr verbs use tidy eval in some way. And tidy eval is a special type of non-standard evaluation used throughout the tidyverse. So tidy evaluation is kind of a fancy word for saying, you know, empty cars, group by cylinder, summarize n is equal to the count, mean is equal to the mean of miles per gallon. You'll notice that cylinder and miles per gallon are not quoted, there's no kind of quotations around them. So they're bare values. So with tidy evaluation, dplyr knows to, you know, find these things inside the empty cars data frame and reference them in that way.
So while this is super user friendly, you have to kind of supply arguments in the same way. So if you want to write wrappers around dplyr or other things with tidy eval, you can do it with just, in my opinion, two new concepts for the vast majority of things. So you can embrace your variable with the two embracers here. So you take your var, in this case, like cylinder, and in your function, you just put the curly curly or the embracing operator around it. And then dplyr knows to pass along this and evaluate it in the context of empty cars. Or you can pass the dots, which allows you to pass any named arguments into the other location. So this can be used for many arguments that you're not really wanting to capture all of them at the specific valuation time. And you can always revert back to doing, you know, dot data and double brackets and var if you wanted to use strings instead of var columns.
So let's show this real quickly with some functions and then we'll go into packages. So we'll load dplyr. And here we're going to do a car summary function. And all this function is doing is allow you to group by a specific variable and then get the average and the count. So take empty cars, group by, and we're using our embracing operator with these double brackets around var. And then it has summarize. And this part can be done with normal syntax. So when we call car summary on empty cars versus, it will group by the number of, if it's a V organized engine, and that's zero or one, and it gives us the average and the count. Or we can try and do this with two arguments in terms of we can try and do it with versus and is it automatic or manual transmission. But here, because var only has, you know, the ability to take one argument, it fails. It says, well, I don't know what to do with am because it's not indicated. There's not a second var up here, it's just one.
So in this case, we need to pass multiple things through. So we can use the dots here instead. In this case, car summary dots has a function with, you know, whatever named arguments you want to pass. So the dots are pulled forward and to group by. And then if you use any named arguments, they'll be evaluated in this context. So now we can take our car summary dots and use as many arguments as we want. We can use versus, am, cylinder, and it will group by all three of those and give us the average and the count of each of these little subgroups. So concept one, embracing operator allows you to take a specific variable and evaluate it within tidy eval. You just have to put these two brackets around the actual function argument name. And the dots allow you to pass any kind of count of observation around the, you know, grouping function here.
And you can combine these ideas. So maybe you want to, you know, group by a specific variable and then pass many different things into your summarize. So here we can group by there. We've got double, you know, brackets around it for the embracing operator. And then for dot dot dot, this allows us to pass new arguments to summarize. So now we can group by cylinder and then we can define novel new things. So we can say, well, I want to get the average of the horsepower and the standard deviation of the horsepower as well. And because you're passing the dots, the users can do this. You're grouping by cylinder or whatever other column they want to group by, but then they can define their own summarization functions within summarize. So a lot of different power here in combining things, whether it's single arguments, specific arguments, passing the dots, and for tidy eval, passing the dots or the embracing operator.
So let's do this one more time with novel data. I'm sure that very few people want to only be using empty cars. So in this case, to kind of adhere to the tidyverse style guide, we're going to try and put data as our very first argument. So this means we can pipe directly into it. So I take the empty cars data set and then group by cylinder and then summarize on MPG. And then summarize on MPG. So now rather than having our summarize function do the grouping and do all the rest of it, our summarizing function is just a wrapper around essentially summarize. And now we have our functions are min of the var, max of the var, and we're dropping our groups. So we take this, we can obviously group by cylinder and empty cars and then get our summarization. But we can just as easily use a new data frame or the tooth growth data set that's also built into R that has the vitamin C or orange juice with pigs and how long their teeth grow. Super interesting data set, but just showing you that you can take whatever different data you want and your function doesn't have to kind of capture the data inside of it. You can supply data as an argument.
So lastly, you know, if you did want to work with strings as opposed to bare call names, then you could do that. And you can use something like the dot data argument in double brackets. And now we're taking data in our var and rather than leaving it as miles per gallon without quotes, we're quoting it and it's being passed into kind of this more base R style of subsetting. So that's possible, but you know, I really like this idea of like being able to list things as their own kind of, you know, unquoted operations and then passing them into multiple kind of tidy evaluated functions.
Something I do a lot with GT. Speaking of GT, the reason why like I care a lot about tidy eval is this allows me to build very specific functions inside tables. So a lot of different things going on with this function. This is actually from GT extras, GT add divider. What we're doing here and all the different concepts have been captured. We have our stop, if not saying, if this is not found, then give me a nice error message for passing the dots. We have, you know, specific arguments here. And all we have to do in terms of passing columns, because it's tidy evaluated, is we do our embracing operator around columns in the different locations it's being used. So now I can take my data set, pass it into GT, creates a nice table, fantastic. And I can then pipe that into GT add divider. And without having to wrap cylinder anything else or treat it like other data set, I can use tidy evaluation to not have to do my quotes and all the other fancy things I can do with it. And this allows me to add a specific divider at the cylinder. But because I've also passed the dots, I can do novel arguments and make it, you know, red or heavier or lighter or whatever I want to do. And I can also pass multiple arguments. So I can do, you know, combine cylinder and miles per gallon. And now I have a barrier on or divider on miles per gallon and cylinder, along with all these other transformations I've done. So a lot of different power here with tidy eval, because it works across many different functions. And specifically with things like dplyr, that allows you to translate things from R to say like a database like SQL or Spark. So by using tidy evaluation, you not only get to not have to quote things, but you can pass them into other contexts, which is extremely important.
Resources and motivation for packages
So we've talked a lot about functions, you might be thinking, well, I want to learn about packages. But the idea here is that to build packages, you have to write functions. So I think it's really important to talk about a bunch of different ways of doing that. And if you want to learn a little bit more about general functions, these are some great resources about just writing functions and how they work. If you want to think about more about how do I name or style or organize my code, there's the tidy style guide. And if you want to learn all there is to know about tidy evaluation, and if you even really need it, you can look into these different things. So tidy evaluation, the book, things about the R language and how it works, programming a dplyr or ggplot, lots of different use cases there where you can explore.
We've talked a lot about building functions. So let's jump into building packages and why you might actually write these functions. So more realistic internal functions you might use, like I mentioned, is you can use dplyr to connect to databases. So maybe you use SQL a lot of work. You can actually use dplyr and wrappers around dplyr to write functions that query against a production database and pull data in. Or maybe you have data that you get every week, and you have to clean it the same way every week. You could write a function to do that cleaning step and make it one line of code as opposed to copying and pasting 30 lines of code around. Maybe you want to create a theme for your plot or your table or use specific color themes or logos or even specific data you're bringing in. Write a function to generate an entire report. Write a function for reusable shiny functions that you can use across the different applications you're writing. Or scaffolding or wrappers around common machine learning tasks. So if you're always writing training, testing, use this function, you could write a wrapper around all of that. Basically, anything you do repeatedly and want to repeat or test without having to redo it manually, that's an argument for writing a function and eventually taking that function into an R package.
As Hilary Parker says in her introduction of packages from a while back, seriously, it doesn't have to be about sharing your code or getting your code on CRAN, although that's an amazing added benefit. It's about saving yourself time.
Seriously, it doesn't have to be about sharing your code or getting your code on CRAN, although that's an amazing added benefit. It's about saving yourself time.
can write a function, you're being lazier in a good way. Rather than having to worry about copy pasting this code all around, you now have a function with very specific purpose, with very specific arguments that you can reuse over and over and over.
I also want to give a shout out to Emily Reiderer. She wrote a great blog post about building a team of internal R packages. So if you're thinking about how do I use this at work or how do I go even deeper, she's got a lot of amazing ideas written here about how you can actually integrate this into an enterprise environment and build out a team of internal R packages that all work together. So let's jump back to package development. We'll use the remaining time to talk specifically about packages now that we've covered a good chunk about R functions themselves.
Why packages instead of sourcing
So argument one, and it's something I hear a lot of time, you know, why do I write a package? Why can't I just like source code, you know, just use the base R source function to pull code from somewhere? You know, I'm just, I have it on my drive or I have it on, you know, GitHub, and I just read the source code in. So source references a very specific dot R file, reads it all in and executes it. And you could use this to add a function to your environment.
But sourcing doesn't know anything about the versioning of the code in terms of you're not using a specific package version, you're literally just reading in text and then evaluating it in R. It doesn't have, you know, included testing or structure around it. It doesn't have included documentation that both you can use and other people who use the code can use. And it requires the R file to be copied into every project that needs it. So if you use it in one project and move to another one, you have to read the file back into that. And what if you change it upstream? So if you need to make those changes, those changes need to be made in every project that needs it. And it could be modified or deleted accidentally by the end user or collaborator. So maybe your collaborator says, oh, okay, well, I'll just change this one part. It'll make it easier for me. You go back to use it. You read in the dot R file and nothing's working. So overall, it's just, you know, sourcing is fine in terms of like if I'm just sourcing a dot R file so that I'm, you know, being more compact with my code. But packages basically solve all those different problems for you and provide all these different things that you need.
Anatomy of a package
So a package in terms of what we're thinking about the anatomy of a function, let's talk about the anatomy of a package now. The metadata is basically the description, the name of the package, a description of the package's purpose, the version of the package and any package dependencies. So again, this structure allows you to install it and share it with others and have them be able to essentially recreate that same environment that you want to be evaluating things in. It has source code via dot R files that live in the R directory. It has special Roxygen comments inside the dot R files that describe how the function operates as well as its arguments, dependencies and other metadata. It has the namespace basically saying here's the functions that you're surfacing in your package and here's imported functions you bring in from other packages. And then it also can have things like tests that confirm your function works as intended. So when you change things in your function or if you add something new, you kind of confirm to yourself and others, hey, I haven't broken the whole rest of my package or the whole rest of my code.
So this is kind of the minimal anatomy of a package. You could go further, you could have data, you could have vignettes, you could have examples, all sorts of other things. But we're going to start with kind of that MVP or that minimal viable package for what we're going to do today. So all these other things in terms of installed files, compiled code from C++ or JavaScript or whatever else, we're not going to worry as much about that today and focus more on this kind of minimal parts of a package.
devtools and usethis
So while writing packages and you're thinking, well, there's four things I have to learn, the tidyverse team has spent years crafting metapackages to make their life and your life easier to create other packages. So these packages are used every day by thousands of package developers and really do make things easier with functions because you're writing packages and writing components of packages with these functions with very specific purposes.
So the two packages that we'll talk about today, devtools, the purpose of this is to make package development easier by providing R functions that simplify and expedite common package building tasks and usethis, which is a workflow package. Again, automating repetitive tasks that arise during project setup and development, both for R packages and non-package projects. So these are sub-linked in the slides. So if you go to this link for the packages and then you can go to each of them and kind of explore them deeper, but we'll talk about them a little bit today.
Live demo: building a package
So we've talked a lot about slides. We've talked a lot about packages, talked a lot about functions. Let's build a demo package real quickly in five minutes. So let's go to RStudio. This is where I was building the slides for today. So what we're going to do is we're going to create a new project. So within RStudio, I can create a package in this way. So I can go new project.
And now I'm going to say I'm going to create a new working directory. I want to create an R package and I'll give it a fun name. So fun name is what we'll call our package. And it's going to do all the cool things we want to do. So I'm going to create this project. It's going to switch over to this new project that we want to work in. So rather than package building, we now have our new project of fun name. Because we create it as a project, it's already pre-populated with a lot of the different things we need. It even gives us a nice friendly little hello world function saying like here's how to write a basic function and some useful shortcuts in RStudio or wherever else to install, check, or test these packages. So I'm going to delete my namespace because I want to build that from scratch. And I'm actually going to delete this hello world example because we're going to build our own function real quick.
So let's start off. We're going to clear this. We're going to do usethis, use R. And this is basically saying create an R function with a specific name. So let's do square val. When I do this, it will create this dot R file for square val. And then I can do fun. If I type just the shortcut here in RStudio, it's going to give me the snippet which is built in. And I can say square val x x squared. Now I have my function, you know, square val. So square val two. And that's four. Square val 16. 256. Great. But rather than, you know, reading it in that way, we can actually use devtools load all. And this will load the entire package and potentially dependencies within it. So I can do the same thing of square val two, whatever. It's been loaded but in the context of a package as opposed to me just, you know, reading in and saving something, a function as an R object in my local environment.
Now another part in terms of documentation, and I'm just going to kind of abbreviate this in saying that I'm going quickly and we're going to go through all these steps in the slides. Just trying to show you how quickly you kind of move through this process. So I can use command shift P. And I can say I want to insert a Roxygen comment. So insert a Roxygen comment. And this will give you this scaffolding for all the different things. So, you know, my function is square a value. A numeric value to be squared. It returns a number.
And for example, square val two. Perfect. So now when I actually am ready to go, I can say devtools install. And it's going to install my package. And it's going to install everything. It's going to take a second because I'm streaming. Everything is going. So let's enter that.
And because I'm live streaming, I've got to do one step. So let's restart everything. And devtools document. devtools document is basically going to say, you know, we've done something with this R function. Update the documentation. So it's going to rewrite some of the documentation. Now let's do devtools install again. And it's going to take my package and install it locally for me. So now in the context of, you know, this environment that I've done, I can load fun name. And I have my package loaded. And I can do something like tell me about square val with a question mark. And just like any other R package that you install from GitHub or CRAN or anything else, I've now gone from no package to creating a package to creating a function to installing my package. And it has documentation about it. So in this short span, I've created something that I can use in any other project. And I'm ready to go. I have my package.
Obviously, there's more we can do with this. We probably want more than one function. We want, you know, better documentation. We want to share with others. But I just want to very quickly show you that, like, in the span of a few minutes, you can create your function, create your package, and have it working for yourself locally.
I just want to very quickly show you that, like, in the span of a few minutes, you can create your function, create your package, and have it working for yourself locally.
Walking through the steps
So now that we've kind of hopefully kind of motivated you to kind of take that next step, let's talk a little bit about what we just did and talk about the individual components so you feel comfortable building on it. So we created a blank package. So I use you can either use usethis great package or from within RStudio, I just clicked open new directory, open R package, and then gave it a very specific name. And then that actually created my package in my environment that I could work with and then install. So you can kind of get started from there. Either way is fine.
So now that we have kind of a new project that is set up as a package, we're going to go through the whole game, essentially creating a package end to end, creating our function, creating the package, the documentation and some basic testing. So for the whole game, we're going to use those two packages I was talking about devtools and usethis, those are available on CRAN, you can install them with install dot packages. And again, these just make some of the process a lot easier. So rather than having to manually learn all the different pieces, you can just use these functions to build out a lot of the boilerplate or reuse components.
I do want to take a break here in terms of saying like part of the way that you can share packages is through version control. And you know, never going to try and shame anyone or throw shade at anyone. But this is the kind of the best way to share packages. And you probably should be using version control for your R packages. So you can, you know, step back and forth and collaborate. And when you're working on, you know, production packages and production code, you can check them into version control and understand like how changes were made over time, when they were changed, you know, documentation about the changes and collaborating on their development. At RStudio, we, you know, typically default to using Git, and a lot of us use GitHub. But other folks use things like SDN, Bitbucket, GitLab, whatever you're using is great. Version control is fantastic in any way you can get started with version control is great.
A lot of the functions and usethis are closely tied to GitHub, specifically because that's what the tidyverse team uses. So they're writing things that they're familiar with and things that they're using. So in that package we are using, we can actually do usethis use Git. And this will add Git into that component and we can start tracking our changes over time. You can also do something like usethis use GitHub or use GitLab or use whatever to, you know, start referencing it on a remote repository. So Git initially is local. This version control remotely is something like GitHub, which is based off the cloud or GitLab off the cloud or Bitbucket, which is an enterprise like on premise version of Git. Just in general, version control will allow you to take a package and share it with the world via something like GitHub or GitLab or other people can install it. To read a lot more about version control and specifically how to use Git with R, please see Happy Git with R by Jamie Bryan, another great resource that can help you get there.
All right. So again, the kind of the first step I did in my new project was usethis use R. And this is me basically saying, hey, I want to create a minimal R function and then open it for interactive editing. So I'm going to do squareval.R and this will create a function. It will open it up with basically me ready to go. And that was the first thing I did was usethis use R. At this point, you could copy over some code you're using or you could write it all inside that environment. It's just another .R file. You can do whatever you want in there and interactively code, interactively edit. Just you don't have to create a manual file. You can just usethis use R and get started really quickly.
The next step I did, once I wrote a function that I was like, I think this is working, let's test it, you know, I could highlight that code and load it into my environment. Or the better option is devtools load all. Because we're working in a package environment, you may actually write functions that are depending on other functions you've written. So in this way, you can load all the functions together in the context of the R package and it's not going to load them as specific objects in the global workspace. It's kind of doing something behind the scenes. But this devtools load all loads all the different functions you have in your R package all at once. So you're kind of ready to go. This is really helpful when your R package suddenly has like 10 or 15 or 100 functions in it and you want to load all of them and interact with them. Again, per version control. Once we've confirmed the minimal function is working, we should probably commit our changes. And you can do that via the built-in Git pane or via the terminal for GitHub.
So if I'm inside my R function for my R package, again, I can usethis, use Git, and it's going to ask me if I want to commit them. Yes, we've done initial commit. And then in RStudio, I have this Git pane at the very top for version control. And this will allow me to kind of interact with things like, okay, well, I've changed some parts and I can do commit messages and pulls and pushes and basic changes there. So you can use that or if you're comfortable using it from the terminal, you can do all sorts of Git commands from the terminal. There's really no difference in terms of the end result.
Checking the package
All right. So we've created a function. We've loaded it and used it. We've checked in into Git or version control. But, you know, we want to make sure it's working. So how do we go about checking the function? You know, again, we have evidence that the function works because we've used it and we interactively loaded it and used it. But how can we be sure that all the different components of the package still work? It might seem silly after you've only done one function to check it, but it's a good habit to establish the habit of checking this very often. Because you want small moments of friction as opposed to you've written all these different things and you check it or test it and you're just overwhelmed with the output. So we can use devtools check. And this will use kind of it will load the package, check the package, and use all these known best practices.
So if you go to our R function, devtools, check. So this is going to load it. It's going to do a bunch of different things. And you'll see a lot of things passing through. It's meant to be used interactively in terms of it's telling you all the different components and you can read through it as you want. And it's going to tell you warnings, failures, errors, and all sorts of other things. And it's just checking all this massive stuff for you so you don't have to manually go through and check all these components. You can actually look at the different things as noted. So it's got a couple different errors. So it's looking for the hello function that when I deleted. So we're going to have to remove that. It's giving me a warning saying that, hey, there's no license on here. And it's saying that the hello function is documented but not in the code. So we need to fix those things in terms of it's telling us issues about the package without us mainly having to check it with our own kind of eyes. We can look through here and get the actual output.
So the output of this function is really verbose. It's doing a lot of different things. And for the vast majority of the time, they all go yes. If you set up a project in the way that I've shown you and kind of with RStudio, it's ready to go and it's got all the different components. However, there's still mistakes you can make. You saw I had some errors and warnings and other things. So it's really good to run this frequently so I can make changes and fix things in the moment rather than having to do it a lot later and having this overwhelming amount of changes to make.
Adding a license
Now, the one specific thing that it showed us and it kind of gave this error was now this function works fine but we haven't added a license which check will throw as a warning. And this license is because I'm using open source code and anyone can see it, I need to provide a license basically telling people how they're allowed to use it. What are they allowed to do with it? Are they allowed to copy the code and use it in their own package? How do they reference it? It might seem silly to add a license if you're not showing it to other people. But even within your own org, you should at least define the license so that your wishes are respected and it's clear as to reuse and ownership of the source code.
The software licensing is really complex and luckily we don't have to like redefine things. There's common patterns. I typically default to use MIT license but other people use other licensing. There's a few different resources here for talking about why you should choose a license, how to choose a license, what they actually do. And I highly suggest if you're going to use a different license, check out what they mean at these different resources.
Documentation with Roxygen
Now, another step you saw me do was I wrote my function, I added the Roxygen comments which were those little indicators or the documentation at the top. Wouldn't it be nice if you documented how your function works and you were able to get help like you were with other functions? So if you did question mark on your functions, they actually told you what they were going to do.
So this requires that you have documentation. But again, rather than having to write all the R specific .rd files that are kind of like LaTeX, you can just use Roxygen to generate those automatically. So in RStudio, you can go to the code panel and do insert Roxygen skeleton. Or what I showed was with the command palette on RStudio 1.4 or later, you can just do command shift P and say, you know, insert Roxygen skeleton and it will show you the different parts.
So let's go here. If I go to code, there's a bunch of different things going on. And I can insert Roxygen skeleton. As long as I'm within a function, in terms of within these arguments here, when I go to code, insert Roxygen skeleton, it will give me these different comments that I filled in above. Or I can do command shift P to open this command palette. I can say insert Roxygen comment. And as I filter this down, it allows me to do the same thing. Insert Roxygen comment. And if you want to memorize it, there's this longer shortcut. I like using command palette because I can just type in what I want to do. And I don't have to memorize, you know, 15 or 100 different shortcuts. I just use one and then raw text.
Now, as far as what documenting is doing, is you're basically letting your code breathe on its own. You're self describing the code with Roxygen. And specifically Roxygen 2, like the modern version of this. So, the premise of Roxygen 2 is simple. Describe your functions and comments next to their specific definitions. And Roxygen will process the source code and the comments and generate those RD files for you. As well as update the namespace and potentially update the description. So, again, rather than you having to learn these 10 new things, you just learn how to write Roxygen comments. And that generates the downstream documentation that R package needs to operate. If you want to go really, really deep on Roxygen, there's a nice intro to Roxygen you can go through. But the basic ideas we'll talk about right now.
So, Roxygen items are basically special comments. In terms of a normal comment is just a pound sign or a hashtag. And a Roxygen comment is a pound sign with a single quote behind it. And then you give it specific things that you're changing. So, you do like an at param. This is the parameters or the arguments for your package. So, I say a parameter, an argument, and then describe that argument. So, for square val two, it was a numeric input that will be squared.
Now, while that basic idea, you can kind of repeat it over and over. And you can change out like at param, argument, and all the different at things you can change with Roxygen. It can seem overwhelming in terms of like, oh, I still have to know all these different things. There's really only a few things that are necessary to, again, build that minimum viable package. You're going to give the title of the function with that title. You can give a description of the function purpose with at description. You can document the function arguments with at param. And you can specify for export with export. And if it requires other packages, you can either globally import all of those or import specific packages from other, specific functions from other packages. And then tell R what is the function return.
Again, when I inserted this basic Roxygen skeleton, it basically gave me the minimal viable components. So, I didn't have to type out param return export examples. I just described them. You know, it already said, hey, this parameter is x. So, I just describe what x is. And it returns something. So, I say it returns a number. And yes, I do want to export it so when people load my package, this function is loaded. And for the examples, I'll just, you know, use the most basic one, square val two. But I could also do square val 16. And however many examples you want to put in there, it will execute those. So, while it's not required to add those examples, those can be really helpful. You know, if I call, you know, question on our norm, for example, it tells me a description, the title, and then it has some usage. But then here in the examples, it also gives me some common ways people use it. You know, like maybe you generate random data and then you plot it or you create a curve in base R. Or if you want to do, like, error functions. So, it basically gives you some context about how people use it as opposed to just documenting what the function does. So, those examples are really helpful.
So, let's, you know, show a full blown example in terms of a title, a description, parameter, return, export examples. All the different minimal kind of common items. So, again, these are special comments. They have the pound sign and then a quote. So, for the title, take a numeric value and square it. For the description, this function takes a numeric value and squares it. It's intended to be used as a replacement for value squared. For the one argument it has, so param argument, a numeric input that will be squared. It returns a numeric value. Yes, I want to export it so when people load my package, this function is loaded. And for the examples, square val four, which returns 16. It will actually show in the documentation because it will evaluate that function. So, while as you build more complex packages or more complex examples, you might expand upon this, in short, you're able to add kind of a ready to go function that's pretty well described just by adding these specific components. And then when you devtools document this and install the package, you can do, like, question mark square val and it will give you all of this in the pretty way that R displays it in the help panel.
So, now that we've added these Roxygen comments, we basically need to update the documentation or the metadata. So, if we do devtools document, it will read through all the different .R files and write out that documentation for you. And then, just as expected, you can do, you know, question mark square val and it will show you all that documentation you wrote. So, you get to immediately see the benefits of your labor and other people downstream will be very excited to actually figure out what does this function do and how do I use it. So, while this is great, it's done some more behind the scenes work beyond just, you know, the help panel. It's also added the square val function as an export in the namespace file. The namespace is something we don't want to edit by hand. Again, it's a downstream thing that, you know, devtools document and do for you. But this basically just tells the package, yes, we're exporting square val too.
So, if I go back to R and I can go into my package and you can look at the namespace, it has export square val. Just like we expected, because we've documented a package, it's added R function as an export. So, when I load the package, it's available. Now, we can devtools check one more time and make sure we haven't missed anything or that something hasn't broken. And then we'll basically, again, go through the whole process of checking it for errors or warnings or issues. And we're going to check it into version control. We're going to commit basically very early, very often, basically whenever we're making these changes and we're ready to kind of commit them into version control, make that commit. And then you can always kind of go back to that moment in time when things were working if you break it down a little bit.
Installing and using the package
All right. So, we've documented our package. We've created R functions. Now, we can actually install it and basically make it available on any R session within our computer. So, devtools install, when you run that inside your specific project, will install that package locally. Meaning, even though, you know, I've been doing everything inside function name, I can actually go back to my package building project, load my package, and use it there. I don't have to source the file in. I don't have to reinstall it. It's just available. So, if I do fun name, I think, yeah, square val 2, I can call this in this package. And this other, I can call the package in this environment. Even though it's a different environment, that package has been installed. So, I'm ready to go. I can use this.
For other people, they need to install it from version control. They need to download all the things and compile it and build it themselves, which is fine. But for me, I'm ready to go. I've created a package that's useful for me, and I'm happy with it.
So, as you'd expect, I can load it in any other environment. I can load demo package or package fun name or whatever else I want. And then I can use the functions as I need to.
Adding more functions and tests
Now, in most situations, you probably want to add a lot more functions. You're not just sticking with one function. You have ten functions you want to use or a hundred functions or however many you can think up and get ready to go. So, we can add all of our functions with a similar workflow. usethis, use R, define the function, write the function, make sure it's working, add documentation, and you're ready to go. Again, checking all these different components in the version control as you go. But because we've already covered that, I want to jump on to the next thing, which is let's add some tests. So, for the sake of time, and while we can add all those different functions with a similar workflow, we should talk about testing our functions with test that, because we want to show the whole game. We want to show, like, all the different components. So, testing is basically, in test that, our way of doing unit testing. And you've been doing unit testing, essentially, all the time, whenever you test your ideas.
If I say, like, 1 plus 1 equals 2, I've now checked myself on making sure 1 plus 1 equals 2. I've just done it in the R console. So, it's only useful for me at that moment in time, and I have to remember everything. Or if I do square val 2, and if I actually do fun name, and actually bring in the function, it gives me 4. And if I do 3, it should give me 9. So, I'm testing it, and it's, you know, going along with what I want it to be doing. But six months from now, I don't remember if it actually did what I wanted it to do. So, by adding tests, I can basically, you know, make these tests occur with code automatically, as opposed to me having to test it in the R console, or manually, or interactively.
So, whenever you're attempting to type something into just, like, a print statement, and be like, okay, yep, it gave me what I want, or a debugger expression, write it as a test instead. So, up until now, we've only tested our function interactively, literally used it in the R console, and checked for package errors via check. We can formalize and expand this with unit tests via the test that package.
Again, just a user-friendly package, making it easy to write tests in R, and basically check your assumptions. A unit test in R basically means we're expressing very specific expectations about our example. So, we want square val 2 to always generate squared values. And we can test that with a couple of different inputs. As far as unit tests, I came from a neurobiology background. I never used a unit test until I started writing packages. So, it was a new term to me. Unit tests are basically automated tests run and written by developers to ensure that their application or their unit of code behaves as intended. So, by writing these tests for small components, and then adding up all the tests together, you can basically test very complex objects, and with very simple tests.
Setting up unit tests with usethis
So, let's do this real quick. In terms of usethis as a workflow package, just like everything else, there's a use test set. So, we can load usethis, and then usethis use test set. This will create a test file for whatever file you have open.
So, let's go back to our function name. I do see a question in the chat about the console prompt.
I actually have a tweet about that and an example. I need to write a blog post about it. But I will, yeah. So, this is part of my R profile. It basically says, like, what is today's date? It gives me some kind of motivational examples and says, like, what R version are you using? And then every time I add something, or the time changes, and I executed something in our console, it gives me the current time, a nice little laptop emoji, and tells me which branch I'm working with.
This is probably beyond today's scope, but I will promise I'll tweet out about it again, about how to do this. There was a great example from someone in the community on how to do that. But we're going to be talking, we're going to focus again on unit testing. So, I'm going to use this, use test. Now, you'll note that before I call this function, I want to go to the file I have open. It will basically say, whatever file I've opened interactively, create a test for that.
And it's going to tell me a little few things that are going on. It's adding test that to suggest. It's creating a test folder. It's writing the test file. And now, it opens this up. Now, we have a test that function, which is basically, you know, it defaults to this example, which is test that multiplication works. And I need to load test that. So, library test that. We'll call it again.
And it gives me this happy, yeah, multiplication and R still works. Two times two is four. Great. That's good. But we're interested in square val actually squares. All right. So, let's do square val. And we want to square two, expect equal four. All right. So, what we're doing here is saying test that these different things are expected. So, we're going to expect that square val two returns four. And when I test that, let's see what it says. It says cannot find function square val. So, devtools load all. And the reason why I did that is because I moved into this project and didn't do anything yet. But if I do test that, it will say test pass. And it will always give you a happy little message every time it passes.
But what if something doesn't match? So, what if we say square val four is 15 instead of 16? So, we run this. And now, it gives me a failure. And it says square val actually squares. So, the text I have up here. The actual is not equal to the expected. The actual should be 16, but it returned 15. So, now, I can basically generate many of these different tests that test different components of my package and tell me, oh, yeah, by the way, like that thing that you thought was working is not working. And it failed in this exact way. And whenever someone opens an issue and says, hey, your package is failing, you can write a test for it to make sure it doesn't happen in the future.
And whenever someone opens an issue and says, hey, your package is failing, you can write a test for it to make sure it doesn't happen in the future.
Now, most of the time, when you're writing these functions initially, they just work. And that's great. Like you've written a function that works. But these unit tests allow you to standardize how you're making sure it actually does what you want. Now, again, there's many different things. So, maybe you don't want to expect equal. You actually want to expect failure. So, let's say expect failure. What happens if we square a cat? You know, it's Halloween. We're going to be squaring some cats.
So, the fun times about live streaming. So, now I'm in my own personal package. This is actually a package that I use a lot for writing R functions. So, let's find expect failure.
I can spell failure. Apparently I haven't written one of those before. So, we'll just abandon that for now and go back to the slides. And if I have time at the end, I'll keep exploring. Again, the idea is that we can use this, use test that to get our expectations. There is a way to test for errors that I'm blanking on right now, but I want to keep going rather than getting us off track. And the way that tests are written is if we saw that let's go to test again. So, we're going to test that. Let's do dot bar. So, this is a more complex test in terms of like we're testing SVG and HTML. But the way this is set up is we have a dot R file that is doing tests of the specific function. It's creating an object. And then we're doing expect equal. I'm basically saying like do these values match my expectation?
That's the way that it's set up. So, I created that with usethis, use test that.
And these expectations are grouped into those tests or that dot R file. So, an expectation is the specific unit of testing. You know, does it have the right value, the right class? Does it produce error messages? What does it kind of give as the output? These expectations are functions that always start with expect. So, I was using expect equal. And the one that I failed on, which is kind of ironic, is expect failure. And then you have a test, which is a group of multiple expectations. So, if I were to, you know, go back to function name, we'll go there for a second. The tests are the overall kind of parent of all those different testing that I'm doing or expectations with the error message. You know, that's one unit of functionality is a test has multiple expectations. And as long as those are all true or all met, you get your happy message out.
And the file is grouping together the tests, which is grouping together all those expectations. So, if we do this test, this whole file is the test file. Test that is the actual test. And these expectations or these expect equal are the units of testing. So, again, we'll load test that. We'll do test that, test equal. And as long as it expects, we'll do devtools load all. devtools load all. Going all the way here because we moved into our new thing. So, we got it loaded up. Test is passed. Happy days. We got it all the way back. So, we have a test file. We have an expectation, which is expect equal. It's your expression and the expectation. And then the testing that you're doing around that with the message if it fails.
Expect error vs expect failure
All right. So, interactively, you can write and check those tests. So, I can, again, load all, load test that, and I can run this interactively as I've been shown in terms of, like, I do this, and it gives me happy message. I do this one, and it expects, ah, this is why I was saying expect error, not expect failure. Here we go. So, good time. So, let's do this one more time. I should have just kept going on my slides.
Test that. And because we haven't written square val two, this is square val and binaric operator. So, square val cat. And this is what it gives me. This is the error message. The way that expect error should go is it should give me the exact message because we're trying to test for a message. And it's still going to give me a failure. Let's do one more. Expect error. Now it's passing. So, with this expect error instead of expect failure, which was my failure earlier, is basically saying, like, if I pass something that it can't operate on in terms of it can't square a cat, it's just going to error. So, if I do this, it generates an error. I'm expecting it to generate an error. And that all passes. So, not only can I square values and get the expectation, I can square the value of something that can't be squared, and it gives me the error that I want to get. So, we went full circle. We made it all the way around. Thanks for sticking with me on that. I'm glad we were finally able to square our cat.
Now, the way that these tests are written, you can see that I've done a few different things. So, for my expectations, I have, you know, 2 should be equal to 2 squared, 4 should be 4 squared, 16 equals 16 squared. So, I've written essentially, like, multiple tests testing the same idea, but with different inputs. Because while one input might always work, other inputs might not work. And this is probably more impactful for more complex functions, but we want to test our expectations multiple different ways. And then I can do test that non-numeric or missing input should error. So, A should always error as input must be numeric. Factors, data frames, missing values, those should all error out in specific ways. And for expect error, you can either do it as just give me an error, or give me this specific error message as it's expecting that. I'm writing this on SquareVal2, which is the one that had the little helper function saying, if it's given, give me a nice friendly output in terms of input must be numeric, rather than this kind of less helpful error, non-numeric argument to binary operator. So, kind of coming full circle.
Those tests, again, optional but useful, live in test slash test that. Names must start with tests. And as far as the usable test file, this is basically the exact same thing where you're doing interactively, but now we've put it into our testing file. And the reason why it's useful in a testing file is I can do devtools load all, devtools test. And now, it's going to go through and run my tests. And it gives me A okay. I have two passes because it basically evaluated this and an evaluated SquareVal. So, in this case, when I have the situation of I have dozens of functions, or 10 functions or 15 functions, rather than, again, manually going in there, I'm able to go in and test all of them at once, figure out which ones pass, which ones fail, which ones have warnings or other messages. And when I run devtools check, again, that thing that I should be doing very frequently to say, like, how is my overall package looking? It will, again, run all of the tests that I've written and test that.
So, what happens if they don't match? Again, it'll throw this error message. So, that was that one, if I would have kept going in my slides rather than trying to live code it all out, expect error, you know, give me this, you know, squaring of this, and the input must be numeric. So, you can still test things that should always fail or should always error in a specific way. So, you can still get those expectations and test both successes and failures in your package.
And if you do something that doesn't match, it will, again, show you that message of, like, you tried to square something, it wasn't a match, and here's the difference between the expectation and the actual value. So, while, again, I've gone through a lot of the different components of testing, and we went through kind of a side path that I appreciate you sticking with me on writing expect error versus expect failure, you can read a lot more about this in the R packages book, there's a testing chapter, and test that, test that documentation, and I wrote out kind of a longer form blog post example of robust testing.
Building a pkgdown site
Let's move on to some more documentation, though. So, while you might, you know, be relying on built into R kind of documentation, you can also build your own package down the site, just like the one for devtools that they have. They've got this really fancy, you know, cool devtools package down site with, like, documentation and links and articles and all these cool things. They didn't write all this out manually in terms of, like, this was built through the package down package. So, you can do the same thing. Just with the documentation we've done, we can use devtools build site, and it will actually build out that website for us with the documentation and all the examples. So, we'll let that run for a second. It's going to build out the documentation, but the benefit here is that documentation you've written ahead of time is used here. So, every time you run devtools document, it will, you know, update this as well. So, I can go into reference, I click on square value, and it says, hey, this is square value, and it shows me how it's supposed to work, and it shows me the examples.
So, if you do square value 2, it gives you the output. So, again, like, in literally seconds, I was able to go from writing a package, writing some documentation, to now having a website I can publish other people can go to and get information about how my package works. And this is how, like, my personal packages have their own documentation. So, for, like, GT extras, when you go to the reference page, I didn't have to manually build out documentation for all these functions. You know, I was able to go in and, you know, write the documentation with Roxygen, and it builds out all this complexity for me with, like, examples and all the different parts of the code and everything else. So, very, very powerful things you can do with package down and just the basic documentation.
So, I think this is really powerful. I'm a big fan of that. You can host that package down output anywhere. You can do it on, like, GitHub pages is where one of my examples is. Or you can host it on something like RStudio Connect, which, again, if you're building packages inside your organization, you may not want to show the entire world. You might just keep them inside your organization. But you can host that HTML anywhere. For, you know, if I was hosting an example of, like, one of my basic packages on RStudio Connect, I can, you know, load it with this kind of code, which is basically saying, like, hey, this is a lot of different files. Take all of them and deploy to Connect. And then here's one running on our demo server with the hello world example. Operates just the exact same way. But just you can take that documentation and host it anywhere.
Referencing external packages
Now, the last component we'll talk about as we're kind of nearing the end here is what if we want to reference external packages? So, I've written everything with just base R. The expectation is that while base R is super powerful, at some point you're probably going to bring in another package, whether you wrote it or someone else wrote it. For example, I wrote a lot of wrapper packages or wrapper functions around GT. That's what this whole entire package here is. All these different functions are wrappers around a GT package. So, I have to reference GT as one of the packages I'm bringing in because I'm essentially importing all these functions that other people have written and loading them into my package. So, GT add divider was one of the basic examples we showed when we were talking about tidy evaluation. Now, this only works if I bring in things from GT as well. So, tab style, cell borders, cell body, cell column labels, all of that is actually being brought in from GT. The part that I'm doing is just GT object and the actual inputs because it's wrapping all those other components.
So, in order for this to work, I have to tell the function, hey, by the way, I didn't write any of this. Please bring in this other package that's really beautiful and amazing. So, now in our documentation, we have the things that you've seen before, title, description, the function parameters, all those cool things. And then import from GT and import GT. Now, I'm importing from GT the McGridder pipe because inside my function, I want to be able to use McGridder pipe. I want to use it all over because I have really long functions and I want to pipe things together. I'm also importing all of GT in terms of all the functions that are available in GT are now made available inside my function. So, when I do GT add divider, it's able to load in all of those functions and use them. It also adds GT as an import in the namespace file, meaning when someone installs GT extras, it says, hey, by the way, you also have to install GT for this package to work. So, it also downloads GT and installs it. So, again, part of this is that with simple documentation, you can build out all these different components of all the important things in your R package. When you call devtools document or do something like usethis, use package GT, it will also update your imports in terms of, like, I need to import GT. I want to specifically have people bring in package version equal to 0.3 because they added in a lot of cool things in 0.3. So, when you import or you install my demo package, it also installs GT.
So, again, this structure gives you the ability to do your documentation and metadata and attach all the pieces together in a way that other people know what to do with it and R knows what to do with it when you try to install it or load it in other locations. So, usethis, use package is what allows you to add specific packages to your documentation. Although you can always manually write this out, if I'm in my package, let's, there's GT extras, we can show it here.
So, if I pull this down, you can see the GT extras brings in a lot of things because I need to, like, manipulate data with dplyr. I'm doing a lot of inline plots with ggplot. I absolutely need GT to actually build up all these things. So, this tells it every time that I install my package, you can actually import the specific one or you can bring in this specific package at a specific version. I do see a question from the chat. It looks like I thought that you could actually do a specific version in terms of if you wanted to say equal to 0.5, but typically, I would say greater than or equal to and most often greater than because someone could actually install a newer package version. So, this means as long as you have 0.8.5 or newer, dplyr is actually on version 1.0. So, we actually want to bring in anything where as long as you have dplyr 0.8 or newer on your system, it doesn't have to be installed. It can just import that package.
Workflow summary
Again, the whole kind of summary here as we get to the end, you know, we've talked about documentation. We've talked about building functions, building the package, writing all the different components. The basic workflows within a package if we're summarizing this is, remember, small changes committed frequently via version control. If you want to share it with the world, you probably want to check it into version control so they can download and install it. And if you do those in smaller components, if you do break something, you can go back earlier in the process. usethis use R allows you to add a new functions as you go. Control, command, alt shift R or command shift P or however else you want to do it. Add the Roxygen skeleton when inside the function. devtools load all lets you load or inactively test your new function.
And then usethis use package for a specific package you want to bring in to import to add that package as a dependency or something you want to bring in. usethis use version when you're updating your package. So, allowing you to change like, oh, this is 0.2 package version. It's the very first one I did where it's 0.2.1, you know, versioning your package so you can install it at a specific moment in time. And then devtools document to document the package with your various changes as you make them. So, any time you update the Roxygen comments or other documentation, you need to recall devtools document to document everything in an appropriate place. And lastly, devtools check all the time and devtools test to check and test the package so you're making sure that all the things are working.
Now, again, I've shown a lot of different stuff and it can still seem a little bit like there's a lot of different components going on. All these things don't always have to be done. Like, if you're not importing packages, you don't have to do that. If you're not checking into version control, you don't have to check into version control. I'm just trying to show you kind of the different levels of where you can go and where you can go in the future. And your minimal viable package can be done in just a few minutes.
And your minimal viable package can be done in just a few minutes.
If all is well, at the end of the day, just install the package and you can use it locally forever as long as you want to. And then if you want to update the package, you can, again, install the newest version once you've made changes to it and we'll just overwrite that package install.
Now, as far as devtools install, you know, we will have been installing it locally. So, no one else can install my package. That's fine. Maybe I'm the only person who's supposed to be using it. But to share with your colleagues, they're going to actually take it from the remote or from the cloud or wherever else you have the version control check to install it. So, for example, my GT extras package and the way that you install it is you would actually remote install from GitHub from my specific handle this specific package. So, even though all the source code doesn't live on the computer today, they can still install that package from GitHub or GitLab or wherever else and install it into their local environment. And if you get it on the CRAN, then you can just do install.packages, whatever the package name is. Now, you don't have to get things on the CRAN for everything. Again, you might have packages that are only used internally or you're only using it for yourself. So, while it's great to kind of shoot for CRAN in the future, that's not required for every, you know, package that you're building.
And again, while I'm using install GitHub, because that's where I have it, you could install it directly from GitLab or just from Git if you had like a local environment you're installing it from. Another thing that we use a lot in terms of like RStudio package manager allows you to host like your packages on premise. So, that's really useful in terms of like, again, if you're not releasing it to the world, you're holding packages internally, you can actually host them on RStudio package manager and install from there. So, you're not having to breach your firewall to install.
And package manager, whether you're using R public package manager, which is free, or the on premise RStudio package manager, which is a paid product, is it has all the different versions available. You can install specific package versions. All the packages are binaries. You can install them very quickly, especially in like a Linux environment.
Wrapping up
So, to wrap this up, and thanks for sticking along with me for everyone. We've gone through the end to end process of writing functions with both base R and with tidy eval. We've created a minimal package in just a few minutes. We've added more functions to the package. We've documented the package as well as potentially external dependencies. We've even covered testing and unit testing with your package, how to create public documentation internal to R, as well as package down for documentation that's external to R just on a website. So, we've really covered a lot today. And you may only use components of that, and that's fine. Really just trying to show that you can create a package very quickly, and you can build up those components over time with tasks or documentation, external documentation, other packages, whatever you need, you can kind of build up over time as you learn more. So, I really do hope that you feel empowered to create your own package.
You can use as many of the best practices as possible. But ultimately, once you get started, you quickly see that the tooling has really been written in a way to make the process very user friendly and really ready to go with it all. As far as follow up, you can read through the whole process again in R packages book. They have a chapter called the whole game, which goes through the entire process, basically what we did today, but written out.
So, this has like a basic package you're building and loading different libraries, creating functions, and all the different things that we did today. So, maybe if you're more of, you want to read through it as opposed to watch a video, you could always go through here. I was really just hoping to provide a video compliment to these type of resources where you can see some of my successes and failures in real time in terms of we struggled a little bit together with that expect failure because I was looking for expect error.
But overall, very, very kind of quick to get started. Again, the slides from today are available publicly. So, you can kind of go to the slides. I'll throw those in the chat again if you just want to look at it all. Or if you want to look at the source code for how I even built the slides or some of that other stuff, I have that on my GitHub at the second link I shared, which is here at github.com slash J Thomas mock slash package dash building. Overall, 90 minutes. We built a package. We looked at a lot of different things. We had a lot of fun. Thanks so much for hanging around. If there are any other questions, I'll hang around for a little bit more and answer some questions. But I think it was a fun stream, and I had a good time. So, thanks for hanging out with me.
Importing tidyverse packages
When importing tidyverse packages, would you recommend importing the whole tidyverse or just the individual packages? I would definitely recommend just importing the individual packages in terms of like the tidyverse is actually like 20 or 30 packages, which have quite a bit going on. And it's a meta package in terms of it's really just loading those other packages. So, for GT extras, for example, like I depend on some components of the tidyverse in terms of like dplyr, ggplot, you know, tidyverse packages. But I'm not bringing in tidyr, I'm not bringing in lubridate, I'm not bringing in all these other different components of the entire tidyverse. And I don't need to depend on those, and I don't need users to install all those packages, even though they may already have them installed locally.
I would definitely recommend just importing the individual packages in terms of like the tidyverse is actually like 20 or 30 packages, which have quite a bit going on. And it's a meta package in terms of it's really just loading those other packages.
Unit testing complex outputs
For unit testing, there's a few links on this specific slide. So, this slide has the references here. So, the R packages has a book, the book itself has a chapter on testing. Test that documentation goes into further examples. And then I actually did a long-form example of unit testing in this blog post. So, you might think like, oh, well, he's only testing square braille two, how do I do more complex testing? And that's really what this blog post is all about, is how do you test an HTML table?
So, if I create a GT table, it's actually hundreds or even a thousand lines of HTML. And those are more complex to test in terms of like, this whole section right here is all that table. And it's got all these different sub rows and different, you know, bodies and all these other different things going on. So, there's a lot of code there. But you can extract all those different components. And when you get to the actual unit testing, I'm just testing, you know, I'm just expecting this exact string in a very specific location. So, expect match or expect equal, you know, that this specific component of the table is equal to this. So, you can still get your unit test to be very simple.
As far as things that are more difficult to do or there's a question like, what are the works you've created that cannot be done with unit testing? So, something that's more, it's available now, but like there's interactive testing or test snapshots. So, you may imagine that rather than testing specific components, you want to test the entire HTML string all at once. So, you can still do that with expect snapshot, but you have to like, save the file and read it back in. Or something like, how do I test ggplot? Like, how do I test the actual graphic or the PNG I'm creating? For that, you usually do like a visual test, although you can still test individual components that are being created as normal.
All right. So, thanks again for everyone's time. I'll upload this to YouTube. But thank you so much for your time and hanging out with me today. I had a blast. I'm going to try and do some more of these regularly, maybe with smaller topics even. I know this was a relatively big one, but we'll talk a lot about tables, I'm sure, and other different things in the future. Thanks so much for your time, everyone. Have a great week. Stay safe and we'll see you next time.
