
Amelia McNamara | Working with categorical data in R without losing your mind | RStudio (2019)
Categorical data, called “factor” data in R, presents unique challenges in data wrangling. R users often look down at tools like Excel for automatically coercing variables to incorrect datatypes, but factor data in R can produce very similar issues. The stringsAsFactors=HELLNO movement and standard tidyverse defaults have moved us away from the use of factors, but they are sometimes still necessary for analysis. This talk will outline common problems arising from categorical variable transformations in R, and show strategies to avoid them, using both base R and the tidyverse (particularly, dplyr and forcats functions). VIEW MATERIALS http://www.amelia.mn/WranglingCats.pdf (related paper from the DSS collection) http://bitly.com/WranglingCats https://peerj.com/collections/50-practicaldatascistats/ About the Author Amelia McNamara My work is focused on creating better tools for novices to use for data analysis. I have a theory about what the future of statistical programming should look like, and am working on next steps toward those tools. For more on that, see my dissertation. My research interests include statistics education, statistical computing, data visualization, and spatial statistics. At the moment, I am very interested in the effects of parameter choices on data analysis, particularly data visualizations. My collaborator Aran Lunzer and I have produced an interactive essay on histograms, and an initial foray into the effects of spatial aggregation. I talked more about spatial aggregation in my 2017 OpenVisConf talk, How Spatial Polygons Shape Our World
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thanks for that introduction. I'm Amelia McNamara. I teach statistics at the University of St. Thomas in the Department of Computer Science. I tweet at AmeliaMN, which is a double entendre for McNamara and Minnesota. And I'm going to be talking about working with categorical data in R without losing your mind.
So this talk came from a paper that I wrote for the Practical Data Science for Stats Peer J Collection, which was organized by Jenny Bryan and Hadley Wickham. And if you haven't checked out this collection of papers yet, you really should, because it's extremely useful. There are all these wonderful papers like Data Organization in Spreadsheets, Opinionated Analysis Development. Excuse me, do you have a moment to talk about version control? And then the paper that I worked on was called Wrangling Categorical Data in R.
So all of these papers are fully reproducible. They're available online on GitHub. Then they're on these sort of open source preprints on Peer J, and we did a special issue of the American Statistician and put them in there as well.
Motivation: teaching with real data
And my paper was inspired by my work as a professor. So I teach a lot of introductory statistics classes, and in those classes I ask students to do projects where they go out in the world, they find real data, and then they analyze it, usually using some modeling, usually multiple regression. And this is a really useful exercise for students because they get the experience of working with data that's real data that they hopefully care about. But it's very challenging for them because they don't always have the skills to work with it. And it's also very challenging for me because when they don't have the skills to work with it or things are very complicated, then I step in and I'm there helping them do their data wrangling.
So I had a group one year that wanted to use data from the General Social Survey, which is housed at the University of Chicago, which is the survey that's been going on since 1972 to ask questions about Americans about their life. So this is a data set that sociologists love to use. And it's very interesting, but it's filled with categorical variables.
If you've been working in R or working with data, you probably know about categorical data. So that's as opposed to continuous data. It's something that has kind of discrete levels. So you might think about gender as a categorical variable, male, female, non-binary. You could think about whether you like your coffee hot or cold. That could be a categorical variable. You could think about the colors of Skittles on my front page. There's many types of data that end up being represented as categorical.
When you have survey responses, those are often like I agree, I disagree. Race is often a categorical variable because we don't have some continuous spectrum there. So lots of data has categorical variables.
Factors in R and their quirks
And in R, you probably know that we represent categorical data as factors. And factors consist of a set of values and then an ordered set of valid levels.
And when I started thinking about the paper that I was going to write, I was thinking about all these problems that come up when you're doing data wrangling in real data problems from this experience of working with my students. And so I said, you know, I could write a paper on a couple different subjects, but here's one, you know, I could talk about factors and how to recode them.
So we started off this paper by showing some of the ways in which factors behave unexpectedly. So one of the ways is that if you have some numeric data and you turn it into a factor, it sort of looks the same because it has those same values, but then it has the levels associated with it. And if you convert back to a numeric, then you don't get the numbers that you expect. Okay? So instead of getting back 20, 20, 10, 40, 10, I get back 2, 2, 1, 3, 1. And this is the sort of thing that can really mess you up when you're doing analysis with factors. Because you think that you're going to get back the value, but you get back sort of the numeric level instead.
There's other things like this where, you know, if you create a factor variable and give levels that don't match, you're going to get an NA. Again, you don't get a warning or a message or an error. If I had saved this into an object, it just would have seemed like it worked, but I've lost that level of A. So there are things like this that you learn to kind of work around when you're working with factor data.
The stringsAsFactors problem
And I think these frustrating issues were much more common back in the days of Yore when we were using read.csv to read in our data files. Because read.csv has this argument of strings as factors, and that was by default set to true. So any time you had any character or string variable, it was getting read in as a factor. And so people were running into these frustrating issues a lot.
Roger Peng has a great blog post that's strings as factors, an unauthorized biography, and he talks about some of the historical reasons why this default was set to be true. And I think the most salient one is that it used to save space in memory. So instead of having to save in quotes, you know, like high, medium, low, many, many times as strings, it would save one, two, three, the numerals many times, and then just keep those labels once. It turns out that even base R is no longer doing that, so it's not a memory or storage issue anymore.
But people were getting frustrated with strings as factors not defaulting to false. This led to the strings as factors equal hell no movement. At JSM one year, I think Jenny Bryan made these ribbons that you could put on your name tag to represent how you felt about that issue.
And so when we moved into the tidyverse kind of world where people are using read underscore CSV, that got baked into the philosophy that you shouldn't be forcing some data format on your string or character vectors, because that might not be what you want, and you could run into these icky issues where things can get disconnected in a strange way. So now if you read underscore CSV, you're not going to encounter factors that much anymore.
This led to the strings as factors equal hell no movement.
Why we still need factors
So the problem is you sometimes still need factors. We can't get rid of them altogether. In particular, if you're doing modeling, even just using a simple linear model, but you're including a categorical variable, a factor variable in your analysis, those need to be formatted as factors so that you can choose which level is your reference level. And you can't do that with character strings because R will just pick the first level alphabetically.
You also need to work with factors if you want to reorder elements that are maybe in your ggplot. So if you have a number of items on your plot and you want them in some other order, you need those to be factors so that you can reorder them properly.
So we still need to use factors, and up until reasonably recently, this meant doing a bunch of ugly base R code. So this is an example of some ugly base R code that I wrote for another fully reproducible journal article in 2013. So if you want to look at some vintage R code, I recommend it. This paper will maybe come out this year, so it's now six years old and it's really getting pretty classic. So if you remember working with factors, this is the sort of thing that you would be doing, right? You'd be overriding the levels or you'd be reordering the levels by using square brackets and the C function to reorder them.
Bad approaches and fragile base R code
So R users love to talk about how Excel forces data to be dates, for example. When you open a spreadsheet, then anything that looks vaguely like a date becomes a date. But R has similar issues where factors can ruin your data in sort of a horrifying way.
So in the paper that Nick Horton and I wrote, we were focusing on the GSS data, and so what we showed were some of the bad approaches that you can use in base R. So if you have a variable about income and you want to order this so that it's in an appropriate order, right? We want far above average to come first, and then above average, average, et cetera, one thing that you can do is overwrite the levels. So that's what I was doing just a minute ago, and you can watch and you can see how this is wrecking my data. So watch like average, about 1,000 observations, and now average 666, okay? So because I'm overwriting the levels outside of the factor command, I've now broken the relationship between my levels and labels, and all of my data is now invalid. And again, it's not going to give you an error or a warning.
Another bad approach in base R, again, you could watch the average category, is you use the levels there and do it numerically in some way. So again, this is looking similar to my code that I was showing, except I think mine happened to continue to work. And so Nick and I, you know, said you shouldn't be using this kind of base R code. If you're going to use base R, you should be using a more robust approach. So you should be doing something where you're going to use the factor command, you're going to put levels within it, and now things are working much better.
If you watch, for example, above average 483, that count is going to stay appropriate, right? But something has happened to average. Average actually did not come through. Even though I'm using a more robust base R approach, I'm still losing my average values because I made a typo. I put a trailing space after the word average when I was recoding my levels.
So we called this a fragile approach, sort of a reference to the idea of software brittleness, where if one little thing changes, then your code can break. And so we're thinking about if you had a collaborator who sent you a new spreadsheet and they happened to change the phrasing of one of the levels, or if you are a little bit fat fingering and you hit a space, then you're going to, again, mess up your data. But this approach at least is not losing the counts for everything. It's just that one place where I made the typo.
The arrival of forcats
So we're writing this paper. We're working on it. It's May, June, July of 2016. We're saying here are some bad approaches and here are the more robust or better base R approaches. And then if you've been following the tidyverse, you probably know the thing that happens next, which is that Hadley writes forcats.
So I've been talking to Hadley Wickham about the problems with factor variables for a while. And my suspicion is that when he saw me working on this paper where I wrote out some of these very specific issues and the ways that they probably needed to be solved, that that was just what he needed to say now I'm actually going to implement this package. And you'll see the inspiration there because the dataset that comes with forcats is also GSS. So my students' choice of final project data I think is having far-reaching implications.
So he wrote this package, and it's fantastic because it solves a huge number of the issues that were happening when you were working with factors in base R. So if you haven't used it before, you must use it. It has functions like re-code, re-level, reorder, collapse, lump, other. So it can do a lot of these tasks that you would often want to do with categorical data without wrecking it in a much more human understandable way.
So if you haven't used it before, you must use it. It has functions like re-code, re-level, reorder, collapse, lump, other. So it can do a lot of these tasks that you would often want to do with categorical data without wrecking it in a much more human understandable way.
So with those base R approaches, a lot of it is putting the onus on the programmer. You as the person who's doing the data analysis have to remember what the levels are. You have to make sure that you're not getting a space. You have to kind of do this process where you go back and forth between the data and your code, and you have to be very careful that you're doing the right thing. So Brett Victor talks about how much cognitive load it takes to program when you're changing things and holding the information in your head. So forcats is helping fix a lot of that.
And this actually gave us a great opportunity for this paper that we'd written. We did have to go back and rewrite a huge amount of it. So thanks, Hadley. But I'm very interested in the variety of R syntaxes that exist. So I have this cheat sheet, which is a contributed cheat sheet on RStudio. So scroll down to the bottom. And it compares doing the same task in what I consider to be the three main syntaxes in R. The dollar sign syntax, which is kind of standard base R, the formula syntax, and the tidyverse syntax. When I teach, I try and stay away from that base R syntax. I was team formula syntax for a long time. So that's the mosaic package and lattice graphics. And now I've been pretty convinced to move over to the tidyverse syntax, because I think it removes more of that stuff you have to hold in your head as a programmer.
So what this allowed us to do, the development of forcats with our paper, is compare several different methods for the same problem. So we often show like a compact but fragile way to do things with base R. We show a more robust but verbose, like many lines and very repetitive coding solution in base R. And then we show the direct and robust solution from the tidyverse.
So with that data about your opinion of income, again, you could watch average or below average. You're going to relevel that factor. Now everything's in the right order and the counts are right. And forcats has lots of good defaults. So in some functions, if you don't provide all the factor levels, it just sticks the rest of them at the end, or it'll give you good errors and warnings. So it's fixed a lot of this stuff.
Another example, if you did this in base and you wanted to collapse the marital status, you could do something like this. I think this is actually working, but it really counts on you as the analyst, being able to count the indices of, you know, divorced comes first and then married and then never married, no answer. And if the data changes in some way, like if your collaborator decides to move something around, it's going to break. But the tidyverse solution using recode is much neater, and it will actually work in a larger variety of situations.
Defensive coding and takeaways
I think there's also something to be said about defensive coding here. So the way that you make sure that your examples are working is you have to, like, look at your data over and over again. So this is from an issue on that same paper. This was someone's example that had to do with categorical data, factor data, splitting the data into testing and training data. You can get different factor levels in the two datasets, and then, again, when you do the modeling and the predictions, it won't throw an error, but everything is going to be wrong. So there's lots of ways that working with factors can go wrong.
And so one of your methods of defense is just summary, summary, summary. So you saw every one of those examples, I ran the summary first, then I did something with my factors, and then I ran it again. And you could use count or something from the tidyverse if you wanted to. This is just a little more compact to show on the screen.
The other thing that we talked about is maybe you could get some testing worked into your paper, right? So you have the number of levels, assert that the level should be equal, equal to three, or test that the levels should be female and male, just so that you can keep track of your analysis. Expect equivalent doesn't work if the levels are the right levels, but out of order, so that's on my wish list for future versions of the package.
And so my takeaways from this talk are essentially that you should use forcats. That's going to solve the vast majority of your problems. But like I said with the testing and training dataset, it doesn't always solve every single problem, so you really need to be practicing defensive coding. And that probably means that summary is your friend, and maybe you want to be asserting that and testing that things in your analysis. And if you want to see more horror stories and have a good way to explain to people why the tidyverse is better, not just for people and their thinking, but actually for making your data work, that's the paper, and thank you very much.
