Amelia McNamara | Working with categorical data in R without losing your mind

Transcript#

This transcript was generated automatically and may contain errors.

Thanks for that introduction. I'm Amelia McNamara. I teach statistics at the University of St. Thomas in the Department of Computer Science. I tweet at AmeliaMN, which is a double entendre for McNamara and Minnesota. And I'm going to be talking about working with categorical data in R without losing your mind.

So this talk came from a paper that I wrote for the Practical Data Science for Stats Peer J Collection, which was organized by Jenny Bryan and Hadley Wickham . And if you haven't checked out this collection of papers yet, you really should, because it's extremely useful. There are all these wonderful papers like Data Organization in Spreadsheets, Opinionated Analysis Development. Excuse me, do you have a moment to talk about version control? And then the paper that I worked on was called Wrangling Categorical Data in R.

So all of these papers are fully reproducible. They're available online on GitHub. Then they're on these sort of open source preprints on Peer J, and we did a special issue of the American Statistician and put them in there as well.

This led to the strings as factors equal hell no movement.

Why we still need factors

So the problem is you sometimes still need factors. We can't get rid of them altogether. In particular, if you're doing modeling, even just using a simple linear model, but you're including a categorical variable, a factor variable in your analysis, those need to be formatted as factors so that you can choose which level is your reference level. And you can't do that with character strings because R will just pick the first level alphabetically.

You also need to work with factors if you want to reorder elements that are maybe in your ggplot. So if you have a number of items on your plot and you want them in some other order, you need those to be factors so that you can reorder them properly.

So we still need to use factors, and up until reasonably recently, this meant doing a bunch of ugly base R code. So this is an example of some ugly base R code that I wrote for another fully reproducible journal article in 2013. So if you want to look at some vintage R code, I recommend it. This paper will maybe come out this year, so it's now six years old and it's really getting pretty classic. So if you remember working with factors, this is the sort of thing that you would be doing, right? You'd be overriding the levels or you'd be reordering the levels by using square brackets and the C function to reorder them.

Bad approaches and fragile base R code

So R users love to talk about how Excel forces data to be dates, for example. When you open a spreadsheet, then anything that looks vaguely like a date becomes a date. But R has similar issues where factors can ruin your data in sort of a horrifying way.

So in the paper that Nick Horton and I wrote, we were focusing on the GSS data, and so what we showed were some of the bad approaches that you can use in base R. So if you have a variable about income and you want to order this so that it's in an appropriate order, right? We want far above average to come first, and then above average, average, et cetera, one thing that you can do is overwrite the levels. So that's what I was doing just a minute ago, and you can watch and you can see how this is wrecking my data. So watch like average, about 1,000 observations, and now average 666, okay? So because I'm overwriting the levels outside of the factor command, I've now broken the relationship between my levels and labels, and all of my data is now invalid. And again, it's not going to give you an error or a warning.

Another bad approach in base R, again, you could watch the average category, is you use the levels there and do it numerically in some way. So again, this is looking similar to my code that I was showing, except I think mine happened to continue to work. And so Nick and I, you know, said you shouldn't be using this kind of base R code. If you're going to use base R, you should be using a more robust approach. So you should be doing something where you're going to use the factor command, you're going to put levels within it, and now things are working much better.

If you watch, for example, above average 483, that count is going to stay appropriate, right? But something has happened to average. Average actually did not come through. Even though I'm using a more robust base R approach, I'm still losing my average values because I made a typo. I put a trailing space after the word average when I was recoding my levels.

So we called this a fragile approach, sort of a reference to the idea of software brittleness, where if one little thing changes, then your code can break. And so we're thinking about if you had a collaborator who sent you a new spreadsheet and they happened to change the phrasing of one of the levels, or if you are a little bit fat fingering and you hit a space, then you're going to, again, mess up your data. But this approach at least is not losing the counts for everything. It's just that one place where I made the typo.

The arrival of forcats

So we're writing this paper. We're working on it. It's May, June, July of 2016. We're saying here are some bad approaches and here are the more robust or better base R approaches. And then if you've been following the tidyverse, you probably know the thing that happens next, which is that Hadley writes forcats.

So I've been talking to Hadley Wickham about the problems with factor variables for a while. And my suspicion is that when he saw me working on this paper where I wrote out some of these very specific issues and the ways that they probably needed to be solved, that that was just what he needed to say now I'm actually going to implement this package. And you'll see the inspiration there because the dataset that comes with forcats is also GSS. So my students' choice of final project data I think is having far-reaching implications.

So he wrote this package, and it's fantastic because it solves a huge number of the issues that were happening when you were working with factors in base R. So if you haven't used it before, you must use it. It has functions like re-code, re-level, reorder, collapse, lump, other. So it can do a lot of these tasks that you would often want to do with categorical data without wrecking it in a much more human understandable way.

So if you haven't used it before, you must use it. It has functions like re-code, re-level, reorder, collapse, lump, other. So it can do a lot of these tasks that you would often want to do with categorical data without wrecking it in a much more human understandable way.

So with those base R approaches, a lot of it is putting the onus on the programmer. You as the person who's doing the data analysis have to remember what the levels are. You have to make sure that you're not getting a space. You have to kind of do this process where you go back and forth between the data and your code, and you have to be very careful that you're doing the right thing. So Brett Victor talks about how much cognitive load it takes to program when you're changing things and holding the information in your head. So forcats is helping fix a lot of that.

And this actually gave us a great opportunity for this paper that we'd written. We did have to go back and rewrite a huge amount of it. So thanks, Hadley. But I'm very interested in the variety of R syntaxes that exist. So I have this cheat sheet, which is a contributed cheat sheet on RStudio . So scroll down to the bottom. And it compares doing the same task in what I consider to be the three main syntaxes in R. The dollar sign syntax, which is kind of standard base R, the formula syntax, and the tidyverse syntax. When I teach, I try and stay away from that base R syntax. I was team formula syntax for a long time. So that's the mosaic package and lattice graphics. And now I've been pretty convinced to move over to the tidyverse syntax, because I think it removes more of that stuff you have to hold in your head as a programmer.