Resources

Brooke Watson | R at the ACLU Joining tables to to reunite families | RStudio (2019)

Last year, over 2500 immigrant children were separated from their family while in government custody. Information about their status is scattered across several government agencies, and throughout the national class-action lawsuit “Ms. L vs ICE,” the Analytics team of the ACLU has been using R to join, deduplicate, validate, and analyze it. Using specifics of this case, this talk will address common challenges arising from human-generated data in spreadsheets. With generalizable examples, I will discuss data tidying, standardization, deduplication, and validation using the tidyverse, janitor, assertthat, and other packages. Finally, I will share best practices for requesting useful data from non-quantitative subject matter experts. About the Author Brooke Watson I am a Data Scientist at the ACLU, where I use code and statistics to support civil rights litigation and advocacy. Previously, I worked in public health and disease research, most recently as a Research Scientist with the EcoHealth Alliance. I completed my Master’s degree in epidemiology from the London School of Hygiene and Tropical Medicine and swam for Tennessee’s Lady Vols as an undergrad

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thanks. Hi, I'm Brooke. I am a data scientist at the ACLU, and today I'm going to be talking about the small role that R and data analysis played in an effort to reunify over 2,500 children who were separated from their families while trying to seek refuge in the United States. In that context, we'll talk about how to deal with extremely messy and flawed data and how to create efficient but safe ways to process and analyze that data, especially when it's updated over time and especially when those updates aren't consistent or predictable.

The story of family separation

But first we have to start from the beginning with the broader story of family separation. So in November of 2017, a Congolese woman and her six-year-old daughter arrived at a legal port of entry in California to seek asylum. They were detained together at first, but after four days in detention, ICE put the child on a plane and flew her 2,000 miles away from her mother to a facility in Chicago. Before she was sent to Chicago, S, her daughter, was close enough for Ms. L to hear her screaming through the night. They only got to talk to each other around six times for the four months that they were separated.

In February of 2018, ACLU lawyers met with Ms. L and immediately they filed a class-action lawsuit, because already by February, it was clear that this was not an isolated case. In March of 2018, Ms. L and her daughter were reunited largely in response to pressure from this litigation. But in that same month, then-DHS Secretary John Kelly proposed on CNN that family separation might be a useful policy to help deter immigration from Central America. In May, this policy became official and was met with immediate backlash, both from the legal community but also from activists, from regular people, from families. This was a huge story that I'm sure you saw over the summer and it really enraged everyone, as it should have.

In response to that, in June, on June 20th, Trump finally administered an executive order to stop the separation, but this executive order didn't have anything to do with the more than 2,000 children who were already separated. And so the judge in the Ms. L case issued an injunction and ordered the government to reunite the families. And those efforts are still continuing today. Most of the families have either been reunified or their wishes have been sort of gathered and understood. But this has been a really, really long and arduous process. And it's been clear the whole time that there was not a data system in place to handle the reunifications.

Data started pouring in around June, which also happened to be the time that I started working at the ACLU. And so I immediately got involved with the case to try to figure out how we can organize the data that was coming in.

The data challenge

In September, the Department of Homeland Security published an internal watchdog report which stated that they struggled to provide accurate, complete, and reliable data on family separations. In more day-to-day terms, what that meant was that I would get emails like this from our legal counsel with updates. Basically what would happen is every week we would get a joint status report in which the government told us the number of children that remained in various categories. And we would also get a data dump of messy Excel spreadsheets that all had been artisanally handcrafted one by one. And they never seemed to add up to the numbers in the joint status report.

And so my role as a data scientist wasn't really to do any data science. It wasn't to make predictions or run regressions or even to calculate anything more than simple counts and averages. My role was to support the team in what they needed to find out what data was wrong, what was missing, and what families might be falling through the cracks. And that boiled down to three main actions. Organizing, validating, and describing the data so that we could find people, ask new questions, and understand what was going on. Basically that means rectangulating hundreds of Excel spreadsheets, finding errors and inconsistencies in those, and then calculating descriptive statistics so that we could identify individual cases that were coming to us and make sure that the advocates, media, and legislators knew what was going on.

Organizing the data

So for organization, here's a snapshot of the data that we've received so far. And data was still coming in as recently as December. This isn't big data by anyone's definition, I don't think, but it's certainly big enough to not be able to do everything manually by just looking at one spreadsheet. And it had some quirks. For example, here are some different ways that the unique identifier, the alien number, which is kind of like a social security number for people going through this process. Here are all the different fields that we could have joined on. We could call it an alien number A, or we could abbreviate child, or we could call it a child or a UAC. We could call it a relative or a parent. Or we could not describe what those were, and those could vary across spreadsheets. We could just call it no.

And so my immediate reaction in response to this data request was to get everything in a nice, single, tidy rectangle with all the information, one row per person. But that clearly wasn't possible. One of the reasons it wasn't possible was because some of the data was simply wrong. And we couldn't just throw out the data that we knew to be bad. We had to follow up to the government to say, what is going on here? What is this error? Where is this person currently? In this case in a lot of our cases, the data source is just as important as the data itself. And I think that's a thing to keep in mind when we're doing analyses.

the data source is just as important as the data itself.

But we still like rectangles. Everybody loves rectangles. So the first step was to get everything into a nice rectangle with list columns. And we did that with these packages. I really like the FS package because it includes metadata about a given file. And then we could arrange those. We could sort by size or by the date that they came in. And then we could mutate over various rows based on what the path string was. And we could map over all of those rows and read the data in.

So here's an example of what that might look like. If we read in all of the files that match this regex pattern in a given folder, we can then arrange by the time that that file was created and try to see, you know, what are the updated pieces of information telling us. And once we have a handle on that, we can read all of the data in. And so we did this with a combination with a bunch of tidyverse packages. Purr was really, really useful in unnesting the various sheets.

So what we're doing here, just to walk through, is we're mapping over a read Excel function called Excel sheets so that we can see the names of all the different tabs. We're making it so that there's one sheet per row. And then we're reading all of the data in based on those sheet names.

Validating the data

And so once we have everything, at least in one location in R, we can start to ask questions and see if the data makes sense. And spoiler alert, it often did not. Here are some examples of some errors, to put it lightly, that we found in the data. There were a lot of book-in dates that were later than people had been deported or children who had been reunified before they had even been booked in. We had 22 children who were reunified with themselves. And then 73 parents who had deportation dates, after that essentially was not allowed anymore.

And so some of these obviously are data errors, but we don't know anything more about what is the location of that actual individual. And there's nothing we can do from a data perspective once we get to this point. We have to turn around and say, what is going on with these 22 rows? Because it's really important whether they were actually reunified with their parent, but the identification number was entered wrong, or if they're still somewhere else. We don't know.

And so we did this with a series of assertions and filters to help us find errors. And so we have that nice rectangle of list columns. We can unnest the data so that we have one row for every row of information. And we have tons and tons of columns because of the variable name problems that you saw earlier. But this way, we can filter to which are the rows in which the child's alien number was the same as the adult's alien number. And then we can look at what that path is and say, this is the Excel sheet where we have a problem. This is the source of that information, and return to that source and figure out what's going on.

We can also use verifications and assertions, which we can put right into a tidyverse-style pipe so that we are making sure that, for example, all of the children only have one age and one gender associated with them. So if we're going to tabulate how many children under 10 are separated from their parents, we're not double counting anyone.

Describing the data

And so once we get at least a subset of the data that we feel reasonably trustworthy about, we can start to describe it. And we can start to put it in pictures. So again, if we take that pipe from before and we pipe it into a ggplot, we can look just at a distribution of the ages and genders of all the children. And this patterns jump out that we might not have expected. The two big patterns in this data for me are the jump between five-year-olds and four-year-olds and the time trend of especially teen boys.

So one of the sort of peculiarities of this case was that children who were zero to four were put in a special category and expedited for reunification. And so the way that was reported on, there were around 100 children who were between the ages of zero and four and around 2,500 who were older than five, which makes it seem like, okay, this is bad, but maybe we're doing a better job with the youngest children. When you cut the data differently, we find out that more than 1,000 children were under the age of 10 when they were separated from their families.

Another important thing that we wanted to visualize, and we did this with Leaflet, is where the children were in space. So I think we think of family separation as a border problem, and it's true that most people did come in through the southern border, but that's not where they stayed. Children were sent to essentially foster care facilities all across the country. There ended up being over 400 children that ended up in New York. And a common question that we would get from legislators and from state representatives were, who are the people in my state? Are there any children here? Where are they? What can I do for them? What can I do about them? Are there any parents who are here? How can, you know, how can I in Nevada or Nebraska or South Carolina, how can I be most useful?

And some of those questions we weren't able to answer because the data was so poor. So one of the things that we didn't get accurate and consistent updates on was reunification date. We really wanted to know how long children had been separated, but we weren't able to effectively and responsibly calculate that because there was so much missingness. And there was missingness because when you don't have data about something or someone, you don't have to hold yourself accountable. And no agency was ready for this when it happened because the Office of Refugee Resettlement hadn't had to deal with the problem of reunifying children before.

What data reflects about our values

And this is because the data that we collect reflects what we value. And when we aren't collecting data, something or someone isn't being counted. I think this was said most succinctly and clearly by the federal judge in this case, who commented that we have methods of keeping track of the belongings and the objects of people who are coming in through this process, but we had no system in place to provide effective communication with children and to keep track of them and to eventually reunify them with their families. And it became very clear that no one had really thought this out in an appropriate way before this policy was enacted.

the data that we collect reflects what we value. And when we aren't collecting data, something or someone isn't being counted.

And this question of not collecting data and not counting people is really at the root of a lot of the issues that we're worried about and fighting for right now in the civil liberties space. A lot of the cases that we're fighting are really cases about who gets counted. So recently, the ACLU won a case that barred an addition of a citizenship question to the short-form census, which advocates were worried would reduce census responses in a lot of these populations that are already marginalized and already have good reason to be suspicious of government data collection. But when we figure out who gets counted, we can start to make choices based on that. So another case is whether or not we record people's LGBT status in something like the census.

When we don't collect that information, we don't know how we are able to make choices based on that population. And when we don't collect information and maintain a data structure for children who are separated from their families, we're making decisions about what is important. Omissions in data are just as important as what we have. And that doesn't mean collect data on everything. There are a lot of reasons why a full surveillance state is inappropriate. But it's important to reflect on what we're collecting and how that reflects what we value.

Omissions in data are just as important as what we have.

So I can't finish this talk without thanking all of the ACLU lawyers, all of the lawyers at other private sector firms and other nonprofits in both America, in both North America and in Central America, and all of the people who wouldn't stand for this, the people who, the literally hundreds of thousands of people who came out, who marched in the streets, who donated, who called their representatives. This was a really, really motivating issue for a lot of people. And while it was devastating, there were a lot of hopeful aspects. Because even though each of these individuals in these pictures has a lot going on in their lives, they knew that this wasn't something they were going to stand for. And I imagine that group of people turned activists include a lot of people in this room. So thank you. Are there any questions?

Q&A

We'll have time, actually, for just one question. Is there somebody who would like to ask a question? There we go. That was awesome, first. But how easy was it to obtain data from agencies? Did you have to use, like, FOIA or some other procedure, or was it pretty much just, here you go?

Yeah. The reason we got that data is because we were in a case, and so we got it through discovery. And I will caveat that with the data that we got is only as good as what we got from the government. And so there are other organizations, including us, that have found other cases that weren't originally included in that class list of around 2,600 members. And so that's a really good point to call out. Thank you.