Brooke Watson | R at the ACLU Joining tables to to reunite families

Transcript#

This transcript was generated automatically and may contain errors.

Thanks. Hi, I'm Brooke. I am a data scientist at the ACLU, and today I'm going to be talking about the small role that R and data analysis played in an effort to reunify over 2,500 children who were separated from their families while trying to seek refuge in the United States. In that context, we'll talk about how to deal with extremely messy and flawed data and how to create efficient but safe ways to process and analyze that data, especially when it's updated over time and especially when those updates aren't consistent or predictable.

the data source is just as important as the data itself.

But we still like rectangles. Everybody loves rectangles. So the first step was to get everything into a nice rectangle with list columns. And we did that with these packages. I really like the FS package because it includes metadata about a given file. And then we could arrange those. We could sort by size or by the date that they came in. And then we could mutate over various rows based on what the path string was. And we could map over all of those rows and read the data in.

So here's an example of what that might look like. If we read in all of the files that match this regex pattern in a given folder, we can then arrange by the time that that file was created and try to see, you know, what are the updated pieces of information telling us. And once we have a handle on that, we can read all of the data in. And so we did this with a combination with a bunch of tidyverse packages. Purr was really, really useful in unnesting the various sheets.

So what we're doing here, just to walk through, is we're mapping over a read Excel function called Excel sheets so that we can see the names of all the different tabs. We're making it so that there's one sheet per row. And then we're reading all of the data in based on those sheet names.

Validating the data

And so once we have everything, at least in one location in R, we can start to ask questions and see if the data makes sense. And spoiler alert, it often did not. Here are some examples of some errors, to put it lightly, that we found in the data. There were a lot of book-in dates that were later than people had been deported or children who had been reunified before they had even been booked in. We had 22 children who were reunified with themselves. And then 73 parents who had deportation dates, after that essentially was not allowed anymore.

And so some of these obviously are data errors, but we don't know anything more about what is the location of that actual individual. And there's nothing we can do from a data perspective once we get to this point. We have to turn around and say, what is going on with these 22 rows? Because it's really important whether they were actually reunified with their parent, but the identification number was entered wrong, or if they're still somewhere else. We don't know.

And so we did this with a series of assertions and filters to help us find errors. And so we have that nice rectangle of list columns. We can unnest the data so that we have one row for every row of information. And we have tons and tons of columns because of the variable name problems that you saw earlier. But this way, we can filter to which are the rows in which the child's alien number was the same as the adult's alien number. And then we can look at what that path is and say, this is the Excel sheet where we have a problem. This is the source of that information, and return to that source and figure out what's going on.

We can also use verifications and assertions, which we can put right into a tidyverse-style pipe so that we are making sure that, for example, all of the children only have one age and one gender associated with them. So if we're going to tabulate how many children under 10 are separated from their parents, we're not double counting anyone.

Describing the data

And so once we get at least a subset of the data that we feel reasonably trustworthy about, we can start to describe it. And we can start to put it in pictures. So again, if we take that pipe from before and we pipe it into a ggplot, we can look just at a distribution of the ages and genders of all the children. And this patterns jump out that we might not have expected. The two big patterns in this data for me are the jump between five-year-olds and four-year-olds and the time trend of especially teen boys.

So one of the sort of peculiarities of this case was that children who were zero to four were put in a special category and expedited for reunification. And so the way that was reported on, there were around 100 children who were between the ages of zero and four and around 2,500 who were older than five, which makes it seem like, okay, this is bad, but maybe we're doing a better job with the youngest children. When you cut the data differently, we find out that more than 1,000 children were under the age of 10 when they were separated from their families.

Another important thing that we wanted to visualize, and we did this with Leaflet , is where the children were in space. So I think we think of family separation as a border problem, and it's true that most people did come in through the southern border, but that's not where they stayed. Children were sent to essentially foster care facilities all across the country. There ended up being over 400 children that ended up in New York. And a common question that we would get from legislators and from state representatives were, who are the people in my state? Are there any children here? Where are they? What can I do for them? What can I do about them? Are there any parents who are here? How can, you know, how can I in Nevada or Nebraska or South Carolina, how can I be most useful?

And some of those questions we weren't able to answer because the data was so poor. So one of the things that we didn't get accurate and consistent updates on was reunification date. We really wanted to know how long children had been separated, but we weren't able to effectively and responsibly calculate that because there was so much missingness. And there was missingness because when you don't have data about something or someone, you don't have to hold yourself accountable. And no agency was ready for this when it happened because the Office of Refugee Resettlement hadn't had to deal with the problem of reunifying children before.

What data reflects about our values

And this is because the data that we collect reflects what we value. And when we aren't collecting data, something or someone isn't being counted. I think this was said most succinctly and clearly by the federal judge in this case, who commented that we have methods of keeping track of the belongings and the objects of people who are coming in through this process, but we had no system in place to provide effective communication with children and to keep track of them and to eventually reunify them with their families. And it became very clear that no one had really thought this out in an appropriate way before this policy was enacted.

the data that we collect reflects what we value. And when we aren't collecting data, something or someone isn't being counted.

And this question of not collecting data and not counting people is really at the root of a lot of the issues that we're worried about and fighting for right now in the civil liberties space. A lot of the cases that we're fighting are really cases about who gets counted. So recently, the ACLU won a case that barred an addition of a citizenship question to the short-form census, which advocates were worried would reduce census responses in a lot of these populations that are already marginalized and already have good reason to be suspicious of government data collection. But when we figure out who gets counted, we can start to make choices based on that. So another case is whether or not we record people's LGBT status in something like the census.

When we don't collect that information, we don't know how we are able to make choices based on that population. And when we don't collect information and maintain a data structure for children who are separated from their families, we're making decisions about what is important. Omissions in data are just as important as what we have. And that doesn't mean collect data on everything. There are a lot of reasons why a full surveillance state is inappropriate. But it's important to reflect on what we're collecting and how that reflects what we value.

Omissions in data are just as important as what we have.

So I can't finish this talk without thanking all of the ACLU lawyers, all of the lawyers at other private sector firms and other nonprofits in both America, in both North America and in Central America, and all of the people who wouldn't stand for this, the people who, the literally hundreds of thousands of people who came out, who marched in the streets, who donated, who called their representatives. This was a really, really motivating issue for a lot of people. And while it was devastating, there were a lot of hopeful aspects. Because even though each of these individuals in these pictures has a lot going on in their lives, they knew that this wasn't something they were going to stand for. And I imagine that group of people turned activists include a lot of people in this room. So thank you. Are there any questions?

Q&A

We'll have time, actually, for just one question. Is there somebody who would like to ask a question? There we go. That was awesome, first. But how easy was it to obtain data from agencies? Did you have to use, like, FOIA or some other procedure, or was it pretty much just, here you go?

Yeah. The reason we got that data is because we were in a case, and so we got it through discovery. And I will caveat that with the data that we got is only as good as what we got from the government. And so there are other organizations, including us, that have found other cases that weren't originally included in that class list of around 2,600 members. And so that's a really good point to call out. Thank you.

Brooke Watson | R at the ACLU Joining tables to to reunite families | RStudio (2019)

Transcript#

The story of family separation

The data challenge

Organizing the data

Validating the data

Describing the data

What data reflects about our values

Q&A

Featured software#

rstudio

tidyverse