Christopher T. Kenny - Templated Analyses within R Packages for Collaborative, Reproducible Research

Talk by Christopher T. Kenny Post: https://alarm-redist.org/fifty-states/ GitHub Repo: https://github.com/alarm-redist/fifty-states

Oct 31, 2024

5 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm Chris Kenny. I'm a PhD candidate at Harvard University. And today I'm going to talk to you about how our research team, the Alarm Project, uses templates within an R package to structure our research so that it's in a collaborative and reproducible way.

Primarily, I'm a political scientist, and what I work on is algorithmic redistricting. Now, if you're not familiar with the concept of redistricting, it's the simple process that's very political and very argumentative about how you separate a state or region like a county into different districts so that you control the way that voters translate their voice into power. And the way that this can look, you know, we'll give some red and blue districts, Republican and Democratic districts. What our research team works on is how we generate large samples, thousands and hundreds of thousands of alternative plans that we can use to evaluate it.

Now, our goal in this is to use redistricting algorithms to evaluate each state's redistricting process. So we can look at this and say, this works or this doesn't, this is gerrymandering, a kind of negative thing that can happen here. The problem we often run into is that social science data is extremely messy. This type of data that we're using is coming from thousands of different people across the United States that often are not actually trained to do the things that they're doing, right? You'll have people that just kind of end up in a job where they run their whole county's election system and have never considered that before.

So pretty much every place needs manual intervention. To fix that, we enlist labor, right? We enlist our very smart undergrads who help us and help us work on this. Now, this introduces our new problem, which is that undergrads are pretty much by definition inexperienced, right? We're trying to push them into doing research and they're ambitious, but they need to learn both R packages and Git and all sorts of other tools. So we want to teach them in a way that doesn't overwhelm them.

So we want to teach them in a way that doesn't overwhelm them.

Structuring research as an R package

So the particular solution that our team has ended up with is to place the project within an R package. And I'll make that concrete. So senior members of the team, grad students and faculty, set up a simple package. DevTools create just like you would normally do. And then we set up all of the research tools as if it's part of that package. And that includes the project management, which we'll go into.

This means that when your junior members, those undergraduates or new grad students set up and want to start using this, all they have to do is run DevTools load all and they get everything into their environment. This allows us to set up things that require no new tools. And it's going to look very familiar. So if you see our good old friend RStudio 's file explorer, the package stuff is the normal package stuff. And all we're doing really is adding a few more folders and a few more functions that let us do the research stuff. Stuff being the technical word, of course.

So what we're really doing, then, is using the R package to handle all of the project management part. And we create a series of functions. So, for example, if you want to create a new analysis, there's an analysis function. Or if you want to set up a pull request on GitHub, we automate that. Things like the dataverse, which in the social sciences is a big place to share data that you've created. Or because, again, these are inexperienced people that we're trying to teach how to do research, we want to be able to peer review their work internally. And so we set up a bunch of functions that handle that. These can be tested and documented so that the students don't have to learn anything new. They can just use the tools that they're already seeing in their classes.

Templates and validation in practice

You know, what this looks like is if you run into RStudio, you could run something like DevTools load all. And then here I'm showing you just running an analysis for the state of Delaware, DE, for the 2000 cycle. And when you hit run, it creates all sorts of files, which is where all of these templates come in. We can standardize, then, what the inputs are, what the outputs. So that little bit of manual intervention doesn't go crazy.

And, of course, we want to be able to see this and share this. So the other part of these workflows is to actually validate this. And because our team uses GitHub, so, for example, for the 2020 Washington analysis, what we create and set up is an automated set of diagnostics that all they have to do is copy and paste into GitHub once they're happy with it. And this is one of those things that to us works well, but it's not perfect.

It's very flexible and helps the students get in at a very basic level. We develop too many R packages as a team. And so it allows us to keep those types of things updated naturally without using things like our environment. It pairs very well with Git. The cons, of course, a determined user can screw it up as they please, and it doesn't handle everything. So with that, if you want to see it in action and see how we use it, you can see our GitHub or send me a message on my website. There's a package coming soon.