
Malcolm Barrett | You're Already Ready: Zen and the Art of R Package Development | RStudio
R packages make it easier to write robust, reproducible code, and modern tools in R development like usethis make it easy to work with packages. When you write R packages, you also unlock a whole ecosystem of tools that will make it easier to test, document, and share your code. Despite these benefits, many believe package development is too advanced for them or that they have nothing to offer. A fundamental belief in Zen is that you are already complete, that you already have everything you need. I’ll talk about why your project is already an R package, why you’re already an R package developer, and why you already have the skills to walk the path of development. About Malcolm: Malcolm Barrett is Clinical Research Data Scientist at Teladoc Health, an epidemiologist, and an R developer. He is also an organizer for the Los Angeles R Users Group. Malcolm is the author of several R packages, including ggdag and precisely. Previousy, he was an intern at RStudio and spent two years of service in AmeriCorps. In 2013 and 2014, while serving in AmeriCorps, Malcolm lived in the Zen Center of New York City, where he is still a student
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is Malcolm Barrett. I'm a clinical research data scientist at Teladoc Health. I'm an epidemiologist and I'm also an R developer. I use R packages every day in my work. They're also fundamental to the way that I organize my code. And the reason for that is that R packages are the fundamental unit of shareable code in R. They solve many of our problems for us. They make our code more robust, easier to share, and much safer over time, whether that's somebody that's using it in the future that we've shared it with, or it's just us in six months, right? Our code is less likely to break in that time because of the robustness of R packages.
And yet many people, when they encounter the idea of writing R package, they think, this is too advanced for me. This is beyond my scope. This is beyond my skill set. Many people also think that they don't have something to offer. They think that if they're not writing something like ggplot2 or dplyr, something that's really incredible, then they don't have anything to offer, anything to share.
The parable of the lost son
I'd like to tell you a story that comes from Buddhism. Once there was a father and a son. And at some point, the father and the son got separated. Here, their lives diverged dramatically. The father became quite wealthy. He developed a huge estate and amassed a great amount of riches. The son, on the other hand, became absolutely destitute. He was embodying poverty.
Later on in their lives, the son actually stumbled upon the father's estate. Now, the son didn't recognize the father. The father had changed so much in his riches that he was unrecognizable. But the father, even though his son was draped in poverty, recognized the son immediately. He tried to bring him into the house to say, look, look, you can have all of this. And the son thought he was crazy. He tried to run out of the house. He thought, this guy is nuts. I don't want anything to do with this. So the father sent his servants to go after the son and hire him instead as an employee of the estate.
The son worked his way up from the very bottom until many years later to the very top of the estate. At this point, he's the right-hand man of his father. On his deathbed, the father finally reveals to the son who he is, that in fact, everything that's around him is already his. And in some versions of the story, at this point, the son says that he actually has already understood that. He has naturally come to understand that what surrounds him is his wealth, his treasure, something that is of himself.
This is a talk about why you already have the skill set from the techniques that you use every day in your data analysis to pursue the path of our package development. In a Zen text called the Sandokai, there's a saying, if you do not see the way, you do not see it even as you walk on it. This is to say that we're actually always already perfect and complete, already ready. So in R, we might say, if you don't see the R package, you do not see it even as you develop it.
So in R, we might say, if you don't see the R package, you do not see it even as you develop it.
You already structure your project
For instance, you already structure your project. Now an R package has a very specific structure that it uses. And yours might look something like this, your project. I've got a data folder, a reports folder, a scripts folder, an analysis folder. Now an R package specifically uses a folder called R, with a capital R. It also uses a data folder, just as it is. Instead of a report, we might have a vignette, a long form documentation, as well as a number of other files that are formal but help us do things that we actually often do in our analysis. Describe it, test it, and provide information about how it works.
The use of this package is really useful in connecting the abilities that we already have that we use in our work to this formalized structure of using an R package. So as you might imagine, one of the most useful functions that this package has is to create a package, using the create package function. So here I'm creating a package called zenart R packages. And it's laying out all of this infrastructure that I need to use an R package, and it's doing it all for me automatically. It's also communicating to me very concretely, very simply, what it's actually doing, and if I need to do anything.
One of the most important files that it creates is the description file. The description file is going to tell you what the package does, what it's called, who wrote it, who contributed to it, what the license is, and it might tell you some more about dependencies and more along those lines. It's also a really amazing file because it's sort of a golden ticket. It gives you access to many of the really excellent tools in the R ecosystem for developing R packages, tools like usethis and devtools.
For instance, usethis also has a function called create project. And while this uses a formalized structure, it's much simpler compared to an R package. It does create an RStudio project, but it doesn't have a lot of this other top-level stuff that organizes an R package. Notably, it doesn't have a description file. But we can add one to it using use description. And what happens when I add a description file to a project that actually isn't a package, usethis still treats it as if it is a package. And so that means that we can tap into some of the tools that we're going to talk about, about writing tests and other tools that we're not going to talk about, like writing documentation, quite easily because usethis and devtools will actually treat your project as if it's a package.
You already write R code
You also already write R code. Now, your R code might look something like this. Here I'm taking iris, I'm grouping it, and I'm summarizing it, and it's resulting in a data frame. This is something that I do every day in my work. And if I use usethis, I can put this code in a formal place using useR. This will create an R folder for me if it doesn't exist already and put my code into it, as well as open up this file for me to write code in. Instead of just writing my code plainly, I'm actually going to wrap it in a function, functions being what you would expect in R package to export. So here I'm creating summarizeiris, and I'm putting it in the file that I just opened.
Now, one of the things that's really great about having a package or a project with a description file in it is tapping into the devtools ecosystem. Devtools has this function called loadall that has a key binding in RStudio that makes it a little bit easier. And what it does is it looks in that R folder, and it's going to load the functions that are in your package or your project into your session as if you had sourced that, as if you had sourced an R code script. And so now I have summarizeiris available to me, and I can run it in my console or in another script quite easily.
You already declare your dependencies
You may not realize it, but you also already declare your dependencies. For instance, if I have code like this, it often begins with a series of library calls. And really what I'm saying is, for this code to work, I need the dplyr library, and I need the ggplot2 library. Otherwise, this function is not going to run correctly.
In a package, we can declare dependency by using the use package function. So here I'm saying use package dplyr, usethis is adding it to the description file, and then it's telling me that I have something to do. Whenever I use a function from this package, I need to preface it with dplyr, colon, colon. I need to really specify which namespace this function is coming from. And then I'll also do the same for ggplot2, because of course my code also depends on this package. So now when I have my function, I need to include the names of these packages as well as the double colon. So this function ends up looking a little bit more like this, where I'm adding dplyr, colon, colon, and then the name of the function that comes from dplyr, and the same for ggplot2.
What also is changing is my description file. What previously looked like this now has two extra lines. dplyr and ggplot are part of the import section. And this is where that magic happens when you install an R package. You'll notice that when you install an R package, you don't have to manually go and install every other package that that package uses. R is smart enough to look at this file and install those packages for you. And so that's exactly what happens when you install a package from CRAN or from GitHub or from another source. R will actually go and look for these packages and install them for you. You can also use depth tools to install them manually using the installDepth function, which is extremely useful.
You already test your code
Now you may not realize this part in particular, but you also already test your code. This might be one where you're thinking, wait a minute, I don't really even know what testing your code means. I'm not a software engineer. I don't write unit tests when I do my analysis. I don't test my code, but actually we test our code every day, often in a situation like this. I've got this function, cleanData, and I'm giving it the iris dataset. And something obviously has gone wrong here, right? It's giving me an error and I don't quite know what the problem is. So I actually need to sit around and think about it a little bit.
Usually at this point, I start fiddling around in my console to see if I can actually make it work, right? I will give it a different dataset or change a few things or look at my code and really try to understand why isn't this working? So I go back and forth. I really iterate through this process to try and get it working. And it turns out in this case, if for some reason I removed the fifth column, it actually works. My data is now clean. But that just makes me think more, why is that? Right? And so I'm going to again go through that iterative process of figuring out what's wrong with my code. How can I get it to work? What do I expect it to look like? And what does it look like right now?
What you're doing is doing a unit test of your code. You just aren't writing it down. And so the process of writing unit tests is to formalize this iterative process of kicking the tire of our code to making sure that it works the way that we expect. Now, unfortunately, often the way we do this analysis is when it breaks. But writing unit tests helps you write code more robustly because it helps you kick the tires all around the car, not just in the tire that has a leak.
The process of writing unit tests is to formalize this iterative process of kicking the tire of our code to making sure that it works the way that we expect.
The use test function will help you create a test in the right spot, set up all the infrastructure that you need from the test that package, which is one of the most popular testing libraries in R, put everything in its right place, open up for this file for you, and set it up so that you can automate your tests. So if I go and I write the informal test that I just did in the command line directly in a test script, I can now run the test function from DevTools, also which has a very convenient key binding in RStudio. It will load the package for me and will run all my tests. And now I can know if I make a change to my code that everything is still okay. I've got a green light still. More importantly, if I don't, I know what's wrong and where it comes from.
Three techniques to get started
There are great many ways that you can use the R package system to extend what you already do in your analysis. But these are three really useful techniques that you can really get off the ground with. The first is using a description file. This lets you provide metadata about your project or your package. It lets you tell us what your dependencies are. And it gives you access to this whole ecosystem, such as loading and testing, that is available to you when you're developing an R package.
Write your code as functions. You're already using functions in your everyday work, so this step is to actually take that, wrap it in your own function. And finally, write down the tests that you're actually already doing and then automate them. Take advantage of the description file of this ecosystem for R packages and automate this process.
What would be the next step in coming home to R packages, taking advantage of this treasure that's already yours? We put together a workshop called My Organization's First R Package that's really focused on developing internal R packages, personal R packages, things for you, your team, your research group, things like that. So I recommend checking out this resource. And I also highly, highly recommend looking into the second edition of the R Packages book. In particular, the first chapter, which is called Whole Game, walks you through the whole process of creating an R package from A to Z. And it's great and will help you get off the ground right away.
So this is my invitation to you. Write an R package, whether it's for your own personal use that you'll never share with anybody else, perhaps for your team, perhaps changing a project into a package, or creating a package to help with a project, or maybe you've got a great idea that you want to develop out into a package that you're actually going to share with lots of people and maybe even submit to CRAN. This is my invitation to you, is to try this out, walk this path, and take advantage of this incredible resource that's already available to you with the skill set that you already have.
Trungpa Rinpoche, a Tibetan Buddhist teacher, had good news and bad news for us in our meditation practice. The first, the bad news, is that you're falling out of an airplane. You don't have anything to hang on to, you don't have a parachute, and things seem pretty bad. But the good news is, actually, there's no ground. There's no way that you can fail at this process. My name is Malcolm Barrett. You can find me on Twitter, GitHub, and my website. Thank you for coming, and good luck.
