Resources

Becoming an R Package Author (or How I Got Rich Responding to GitHub Issues) - posit::conf(2023)

Presented by Matt Herman The transition from analyzing data in R to making packages in R can feel like a big step. Writing code to clean data or make visualizations seems categorically different from building robust "software" on which other people rely. In this talk, I'll show why that distinction is not necessarily true by discussing my personal experience from learning R in graduate school to reporting bugs on GitHub to becoming a co-author of the tidycensus package and a practicing data scientist. The positive and supportive R community on GitHub, Twitter, and elsewhere contributes to why anyone who writes R code can become a package author. * I have not actually gotten rich but I did get freelance data work based on my package contributions! Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Package development. Session Code: TALK-1133

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Matt Herman. I am a data scientist at the Council of State Governments Justice Center. We are a non-profit research and policy organization working on criminal justice issues. And I think for folks that are here today, not to see Jenny and Hadley's travel pictures, which were lovely, but maybe you saw the subtitle of my talk and you're really interested in this cool, new, get-rich-quick, multi-level marketing scheme.

So, I'm just going to jump straight to that part of the story here and introduce you first to Kyle Walker. Kyle is a geography professor, and he's also the developer of a number of great R and Python packages, including the tidycensus package. And in 2019, he sent me this direct message on Twitter. He says, hey, Matt, do you do any consulting? I received a tidycensus consulting request, but I'm booked solid. Let me know if you're interested.

So, I got this message. I got really, really excited, played it super cool, and waited three hours to respond. And I said, hey, Kyle, haven't done much consulting, but definitely interested. And Kyle said, sounds good. I'm passed along your email address.

So, for I think a lot of people, getting a message like this or a conversation like this might be sort of normal, not that out of the ordinary. But for me, it was really surprising. I didn't feel like consulting material at that point.

Matt's background before R

When I was an undergraduate, I never took any technical classes, no computer science. I mostly hung out at my college radio station. After college, I ended up working in public radio, and then I made my way into the food business. I helped my friend open a butcher shop that was called N's Meat. Get it? And then eventually ended up in graduate school for sociology.

But I did have one thing going for me here, and that's that I really knew the tidycensus package and how to use it really, really well. For folks in the room who aren't familiar with tidycensus, it's a really cool R package that lets you get data from the U.S. Census Bureau directly into your R session via web API. And so, this makes it really easy to do all sorts of neat stuff, like make a map of household income by census tract in Chicago.

And to make this map, you really just need a few lines of tidycensus and ggplot2 code. So, that first code chunk is pulling that data and also the geometries from the census API into your R session, and then a couple lines of ggplot2 code to make this beautiful map.

But my talk today really is not about how to use tidycensus, and it's not about Twitter. It's not about consulting. It's more about how I went from working in a butcher shop to getting consulting gigs to being able to speak to you today at this conference.

And what I'm going to argue today in this conference is that really anyone who knows how to write code for data analysis or data visualization in R can and probably already should, like can and should, become a package developer, too, or at least a package contributor. And I think you already have the tools to do this.

And what I'm going to argue today in this conference is that really anyone who knows how to write code for data analysis or data visualization in R can and probably already should, like can and should, become a package developer, too, or at least a package contributor.

Starting small: wading into the water

This is my daughter, Tessie. She's two, and she's really into water and really wanted to learn how to swim this summer. So, we're good parents, so we didn't just pick her up and drop her off into the deepest body of water we could find and see if she floated. We didn't do that. We brought her to this beautiful lake where we live in Vermont and let her explore, you know, walk into the water, get wet, splash around, see how it feels.

And so, if you are ready for your R package development journey, I do not recommend going and finding a very high diving board and jumping in the deep end. Please don't start by submitting a package to CRAN. That's not going to be fun in the same way that I don't think we should drop Tessie into the deep end of a pool.

Contributing through GitHub issues

So, one good way I think that you can start is with GitHub issues. I think most folks in the room are probably familiar with GitHub. It's a place where a lot of R packages are stored. Issues are a really, really wonderful place to ask a question about an R package, or maybe you found a bug in a package you used, or maybe there's a feature you'd like to request, and you can just post that up on GitHub, and it's a really good place to start.

In fact, for me, that is my and was my first contribution to the tidycensus package. I was in graduate school doing a lot of work and research using census data, and I discovered R, discovered tidycensus. I liked using that much better than the census website, and I was hooked right away.

So, this is actually my first contribution to tidycensus. This is an issue I opened July 2017. And I say, I just started exploring tidycensus today, and I love it. One question, is there a way to get one-year versus five-year estimates? Details here are not important. But then I go on to write, as much more of an R user than a programmer, I can't help with actually implementing this feature, but I certainly would use this. A few weeks go by. The aforementioned Kyle Walker says, this functionality is now supported.

So, this is amazing for a lot of reasons. First, we have a document of the day I started using tidycensus, which is apparently July 17th, 2017. Great day. But also, so I had this idea. I knew about census data. I wanted to get a different kind of census data that wasn't currently available in tidycensus. And then all of a sudden, a few weeks later, the package developer, Kyle, made some changes to the package, and now I could get this data. So, I was super stoked.

Also, any other user of tidycensus also could sort of take advantage of this, and now they can get this great data. So, for Kyle, as a package developer, he was getting direct user feedback. Maybe he hadn't thought that users might want to specify one-year versus five-year data. So, this got me really excited about R, about tidycensus, and about GitHub and package development.

And so, I continued on following the tidycensus GitHub repository, paying attention to the issues, learning more about the package and how things worked. And what I ended up doing is actually starting to respond to GitHub issues. So, this is from a few months later. GitHub user BigCK2, it's a very cool username, asked a question about getting five-year data from 2016. And I happened to know the answer to this question, and I write here, oh, yeah, you can do this, but you've got to install the dev version from GitHub. BigCK2 says, great, I downloaded the dev version, and it works great. Thank you.

So, this is really great for a lot of reasons, too. One, BigCK2 gets the answer to his question, now he can get the data. Second, if someone else has this exact same issue, hopefully they find this GitHub issue, and they know, oh, I just have to install the dev version. And third, Kyle, the main package developer and maintainer, didn't have to do anything here. He can spend his time writing R code, not responding to GitHub issues. But I, as a tidycensus user enthusiast, could help out BigCK2.

And I continued on this sort of path, learning more about the package, responding to more GitHub issues, and I sort of have direct evidence here of Kyle appreciating this. So, at some point in some other GitHub issue that we were talking about, Kyle says, MF Herman, that was me, you're the best. I'm putting you down as a package contributor on the next CRAN release for answering so many user questions. So, this was really exciting to me. Again, didn't see myself, still don't really see myself as a programmer or coder, but now I'm a co-author or contributor to this awesome package that I love to use.

Going deeper: bug fixes and small features

So, let's say you're ready to get a little bit deeper into the water. You're starting with the doggy paddle, you're maybe going up to your neck a little bit, and you want to actually start contributing some code to a package. I think bug fixes, simple bug fixes, are a really great place to start. Along this journey, maybe you also are reading the R packages book. You're understanding a little bit more about DevTools, and you kind of want to apply some of these things you're learning about into a package that you like to use. And this is exactly what I did.

So, this is, I don't know, looks like December 2018. I noticed a bug that was related to an update to tibble that was causing some downstream issues in tidycensus. Not a really big deal, but it was a bug and it needed to be taken care of. And so, I hunted around in the code. I remember it really took a really long time to figure out where this bug was occurring. I didn't really know. I didn't have that whole workflow, the beautiful load all, check, test, all that stuff. I didn't really understand. It was a lot of manual hunting. But I tracked down this bug, and I thought that was really cool. I figured out how to make it work on my computer. And then I made this pull request on GitHub.

Also note, this pull request is tiny, like 11 lines of code, really, really small. But, again, this is another example. This was an issue that genuinely had to be fixed when tibble was going to be updated. And this is something that, again, Kyle, with limited amount of time to devote to resources, didn't have to do. He looked at my code. It worked. And he was able to merge. So, this is a really great way to contribute and to get your feet wet a little bit more. Again, 11 lines of code.

I think a next step that might be if you want to sort of keep going on this path, and this is what I did here, is think about a small feature that you might want to add to the package. Nothing big. Not whole new functions or anything like that. For me, there was a sort of convenience function, an additional argument that I wanted to add to one of the core tidycensus core tidycensus functions that I thought would make a user's life a little bit easier. And I had this idea. I figured out how to code it up. Seemed like it was working. And, again, pull request. It got merged.

Again, 101 lines of code added not a really big sort of addition to the package in the grand scheme of things. But this was, for me, a really exciting moment. Because what this meant was there was a new argument in the getACS function. And so, not only could I use that because it was useful for me, but anyone else who installed this package from GitHub or CRAN now had access to that function. And so that meant, like, my code lived on tens or hundreds or thousands of people's computers. And, again, that to me was a really good feeling. And it made me feel more like this package developer, not just that R user that I started as.

Encoding domain knowledge into a package

A third way that I think you can really make a lot of impact on your package contributions is by taking your domain knowledge. The thing that you know better than anyone else. And figure out a way to encode that into an R package. Encode that into, somehow, R functions or documentation. I've always really loved this quote from Zed Shaw. He says, programming as a profession is only moderately interesting. You're much better off using code as your secret weapon in another profession. And this is exactly how I felt.

You know, I was coming at this studying sociology and other sorts of social research. I was never going to be a professional programmer, but I did know a lot about census data. I did know a lot about other kinds of social research. And so the idea here is that that was my secret weapon, right? I couldn't write the most amazing R code in the world, but was there some way that I could take what I knew about the census and what I knew about sociology and put that into the R package somehow.

He says, programming as a profession is only moderately interesting. You're much better off using code as your secret weapon in another profession.

And so this is exactly what Kyle and I worked on next. And this was actually a big new feature. It allowed users to get census micro data. So instead of summarized data where you just say a count of how many people live in a certain area, you could actually get individual lines of data that represented responses to the American Community Survey directly into your R session. So this is really neat.

And along with this, I worked on a package vignette. Vignette is essentially documentation, an article about how to use your package. So this is the vignette called working with census micro data that you can find on the tidycensus website right now. And I do want to just point out the table of contents here. So if you look at some of these things, I'm talking about PUMAs, public use micro data area, person level data, housing level data, how to calculate standard errors. None of this is about R code. None of this is about R programming. This is about the stuff that I know about, census data, and it's in an R package.

So this is, again, providing huge value in getting all that domain knowledge out of my head and into an R package through documentation so that users who might be interested in this also can benefit from it. Around this time, Kyle sent me another Twitter DM. He says, changed your role on package to co-author. And this didn't really mean much to anybody, but now my name is on the tidycensus website. And, again, this is another sort of point of validation for me, and it felt really good to be a co-author of this package.

Why every R user should consider contributing

So what I hope to sort of get across here is that there's lots of different ways to contribute to packages. And I think this map of sort of how I did it, starting with GitHub issues, just asking them, then maybe responding to GitHub issues that you see, maybe level up and get into fixing little bugs or adding small new features, and then eventually taking what you know outside of R package development and try to insert it into the package framework is, like, really powerful, and there's a lot of good reasons for why I think everyone who's an R user might want to do this.

And I think of them along sort of two dimensions here. The first is becoming a tool builder. So you're going to improve your R programming skills markedly. When you start thinking about writing functions, you're learning about abstracting your code and writing modular and generalizable code. You're going to learn all about unit tests and how important those are. You're also going to learn a lot about Git and GitHub. Maybe if you're a solo R user, you might use Git just for version control on your own. But once you're in the package development world, you're all about collaboration. And so you're going to learn really quickly about forks and clones and merge requests and merge conflicts and all that fun GitHub stuff. And I think ultimately what this does is make you more desirable to an employer because you can say, not only do I know how to use R, I'm also an R package developer. Also, I know how to use Git and GitHub really well.

The other domain that I kind of think of this on is on community contributions. So you, through this, are going to be able to meet other R users online. I've never actually met Kyle in person. But as you can see, we've communicated a lot on Twitter, on GitHub, and other places. And I kind of think of him as a friend. You're also going to be able to grow and sustain the open source community. So as R users, we all have benefited so much from so many people who have volunteered their time to contribute to the open source community. And so me, now, as a tidycensus co-author, I'm able to give back in just a little bit, a little way, and show my appreciation for that community and help grow it and help sustain it. And then you might get really cool opportunities like being able to stand up on the stage and speak to you at posit::conf.

And sort of interestingly enough here, there is some documentation of this. I wrote this Twitter message to Kyle. I said, hey, I've been toying with the idea of submitting a talk proposal. I doubt it will get accepted, but just wanted to make sure it was cool with you. And Kyle says, yeah, of course, go for it. I do want to note here, I sent this to him in August of 2020, and I was talking about RStudioConf 2021. Life got in the way. There was a global pandemic. My daughter Tessie was born. Didn't end up submitting it then.

But what I really think that I learned from this whole journey is that it's never too early in your R programming career to start contributing to R packages. And there's lots of different ways to do it. Also, apparently it's never too late to apply to speak at posit::conf.

Q&A

Thanks, Matt. That was great. We do have a little bit of time for questions if you want to ask one or two. Anyone from the audience? There's none on Slido right now, but if you do have a question, I can come to you with the microphone.

I was curious to know, so you've contributed a lot to tidycensus now. Now that you've done all that, do you feel more comfortable contributing to other packages? And have you done any contributions to other packages?

That's a great question. Mostly, I started building a lot of internal packages at work. I do contribute to a few other packages like Tigris and sort of in the spatial universe. But really, it gave me the confidence to just like go build a bunch of packages at work to make my life easier. So, yeah, which I think everyone should do.

And you learned those skills through working on tidycensus, right? 100%. You know, it demystified it. And I think it gave you a place to start. And when I was doing this, this was before like usethis and DevTools were as user-friendly as they are now. And so getting started was kind of intimidating. So starting through this process showed me kind of how all the pieces fit together. And then when I wanted to make an internal one, it was ready to go. Great. Thank you so much, Matt.