Resources

Sophie Beiers | Trial and Error in Data Viz at the ACLU | RStudio

Visualizing data the “right” way requires many considerations — the topic, the quality of your data, your audience, your time frame, and the various channels of (sometimes conflicting) feedback you received. In this presentation, I’ll introduce some reflections on these considerations and ways I’ve incorporated feedback (or not) into my work as Data Journalist at the ACLU. Lastly, I’ll present some of the sillier trials and errors I’ve made that were arguably necessary to my process in creating effective data visualizations using R. About Sophie: Sophie Beiers works on the ACLU's Analytics team as a Data Journalist where she analyzes and visualizes data for their lawyers’ legal arguments and for external advocacy pieces. Prior to her time at the ACLU, she received her master’s degree in Quantitative Methods in Social Sciences at Columbia University where she also TA’d at the Lede Program for Data Journalism. Before NYC, she kicked off her career in analytics in San Francisco at the education nonprofit "YouthTruth." Sophie is a Bay Area native but currently lives in Portland, OR and enjoys running, hiking, and making pottery in her free time

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Sophie Beiers. Thank you for watching my video. I'm super honored to be a part of this conference. I definitely wish that we could all be in person. But in the meantime, welcome to my kitchen. I'm excited to talk to you a little bit about my process in creating data visualizations and data driven stories at the ACLU.

So just a little bit about me. I'm a data journalist at the ACLU where I analyze data for our lawyers' legal arguments and visualize and report on our data for external advocacy purposes. I'm only a few years into being a part of the R community, so I'm really excited to be here.

Data journalism at the ACLU

So to start, what is data science and data journalism more broadly at the ACLU? Some of you might know and remember Brooke Matabuanwu, who spoke at the 2019 RStudio conference about her data work in helping reunify children that were separated from their families while seeking refuge in our country. She happens to be my awesome manager and colleague, and I felt she recently summed up our work in just one slack that reads, the lawyer called our analysis both deeply disturbing and extremely helpful, which might just be the slogan for our team.

both deeply disturbing and extremely helpful, which might just be the slogan for our team.

This leads me to an important word to define, terribulous, which is commonly used on our team as a way of expressing excitement over being able to prove usually terrible news with quantitative evidence. Our analysis of the data that we receive often uncovers disparities that hold up in court and help our lawyers win their legal arguments. But in all seriousness, what this often means is the data that we are privileged to analyze represent real human beings who are being harmed in one way or another.

I'll briefly share with you a few examples of what this might mean. For example, each row in this data was a person that was filed for eviction. We found in this case that filings for eviction impacted black women at alarmingly disproportionate rates. Another example is this recent report that we put out where we found that black people, despite using marijuana at the same rates as white people, are still being disproportionately arrested for possession of marijuana. We also found that people of color were more likely to have their driver's licenses suspended due to unpaid and often unaffordable fines and fees. And we were able to visualize the story of two black women whose lives were turned upside down without the ability to use their cars.

The visualization process

All in all, I've been really lucky to be a part of these and other data journalistic projects during my time at the ACLU. So I wanted to talk to you a little bit about the process that I often go through and the key steps I take in creating my data visualizations. This will include the feedback that I sometimes receive, what I've done or not done with that feedback, and other considerations that I think are important in getting to a final result.

So some of you might be thinking that it's just this easy. You go from data analysis to visualization, you're good, you have a great viz, and you can go on and just enjoy your weekend. Maybe even I thought that at one point, based on this flack evidence, when I was analyzing data a few months into the pandemic and came to the conclusion that actually letting incarcerated people out of jails had no effect on crime levels nearby. Spoiler, this is not the process, but it was the finding. The actual process in my experience can be really messy and definitely not linear. But I'll share with you the key elements and considerations to take that I believe have set me up for a success for projects like this.

First, you'll get some data. A lot of the time for us here, that means that a lawyer will have some kind of data that they've obtained through the discovery process or through a FOIA request, and a lot of times we'll just use publicly available data. Next, you'll use R to read and explore that data. During the exploration process, I use the tidyverse quite a bit, but also the janitor package is one of my favorites to make easy tables, and always ggplot2 to find quick trends and actually start visualizing my findings.

So this is important. By now, you'll want to start thinking about what are your main points that you found in your exploration that you want to visualize and make clear. Ask yourself, does my visualization support and highlight my main point? You'll also want to carefully consider your audience. Is your audience made of purely academics? Are they advocates? Are they people who know a lot about your topic already, or maybe they know nothing at all?

It's also good to ask for feedback on your visualizations from various people. You'll want to be specific about the main points that you're trying to visualize before receiving feedback about whether those main points are coming through clearly or not to those people. And then you might find that once you receive feedback, you'll start iterating on your database, which means you'll likely need to go kind of back and forth and check a lot of these boxes again. The process could end up looking a lot like this.

A real-life example: COVID-19 and jail populations

Now, to show you how I've gone about using this process in practice, I'll go through these steps once more with a real-life example. For the purposes of the example that I'll use, I'll need to start with a little bit of background on this recent project that our team worked on. Sometime into the very beginning of the COVID-19 pandemic, it became really clear that people in jails and people that were incarcerated in America, which, by the way, has the highest incarceration rate in the world in particular, would be at risk. Incarcerated people are subject to extreme overcrowding, and jails often lack the resources and facilities that allow for handwashing and other hygienic practices that are necessary to slow the spread of COVID-19.

So it was really clear that in order to save lives, action had to be taken, and it had to be taken quickly. Our team partnered with academic researchers to build a model that accounted for the dynamics of the jail system and estimated the additional illnesses and loss of life that we'd expect if none of these preventative measures were taken. We recommended reducing arrests such that people were only being arrested for the most serious incidents and promptly releasing people who were pre-trial or charged for low-level crimes or were particularly vulnerable to the disease and posed no risk of immediate threat to others. All in all, the team estimated that taking these precautions could save 100,000 human lives, both inside the jails and in their surrounding communities.

Many communities did begin prioritizing the lives of incarcerated people and the people who worked within jails. Some localities reduced low-level arrests, or set bond at zero, or released a subset of incarcerated people. But there was some backlash. Some people were worried that taking these actions would result in a huge spike in crime in the surrounding communities, despite research showing that in New York City, for instance, people released from jails due to COVID were rearrested at lower rates than even expected. To quell these concerns further, our team analyzed trends in reported crimes during the initial three months of the pandemic in 29 of the largest US cities where data was available and released this report with our findings. For the purposes of this project, our goal was to find whether there was any relationship between the change in county jail populations and the reported crime rate in the closest metropolitan area.

So, back to the process, starting with the data step. We needed two categories of data, one for county jail populations and one that showed the number of crimes reported in the nearest city. These were ultimately the 29 locations for which I could find quality reported crime data for, for the time periods necessary that were either aggregated daily, weekly, or monthly for all of 2019 through May 2020 that we could also find jail population information for. So, we needed both. We collected this data a bit in piecemeal, mainly through Googling for specific county sheriff's offices or finding public open data sets. We ended up downloading countless CSVs and Excel files and even PDFs kind of continuously whenever we noticed a bit, a new bit of data had dropped. Hence, the circular arrow around the data step.

We also used the Vera Institute's data that tracked the number of people in each county jail over time and we ended up using the specific dates of February 29th and April 30th to calculate the percent change in jail population. Because most jails took action early on in the pandemic, we looked at the crime rates during the time period of March through May 2019 and compared those to the same time period in 2020. And then to sort of standardize the way that we analyzed crime, and also because many of these cities' crime data sets only included these types of crimes, we focused on part one crimes, which are defined by the Uniform Crime Reporting Program, and they encompass the most frequently reported violent and property crimes.

You'll see that data exploration here was super messy for me. This is an example of what I saw using this function for my very favorite janitor package on the crime description column in just one city. During my exploration, I created this plot in R, which showed the relationship between the magnitude of the change in jail population and the magnitude of the change in crime during that same time frame. Now, it could be really tempting to draw a regression line and to try and focus on some type of significant correlation that just isn't there. I received feedback from my colleagues that this graph focused a little bit too much on sort of the outliers, which were one location that increased their jail population, and one location where there was more reported crime than in the previous year. It was then I realized that one of the most important main points that I could make was actually the lack of relationship between X and Y.

So after lots of data cleaning and trying out more charts and receiving more feedback, these were the three main points that I wanted to make clear in my report and visualize to my audience. One, crime was not on the rise compared to the same time period the year prior. Crime rates were instead right within normal range or even lower. Second, most of the locales that we found data for did decrease their jail populations during this time period. And like I said before, just one location actually increased its population. Lastly, and importantly, we found no relationship between the county jail population and its increase or decrease and the number of crimes that were reported.

Iterating on the visualizations

So once I decided on these main points, it was time to come up with a few mock-ups of graphs that I thought would clearly express these. Of course, though, this is not always easy, especially when you realize that you need to take your audience into consideration. So unfortunately, this is an example of a time where I thought I nailed it, I'd come to the perfect solution, but I hadn't yet considered our audience. So this is a graph that shows the 29 locations in order of the percent change in jail population. The gray bar indicates the range of monthly crimes between January 2018 and February 2020, and the three dots indicate the number of monthly crimes reported in March, April, and May of 2020. You'll see that a lot of the time. The three dots are to the left of the gray bar, but never are they to the right of the gray bar, which indicates that reported crime during these first few months of the pandemic were within normal or expected range or lower.

I will give credit to Lucia, our chief analytics officer, for scribbling down this idea in Slack, of course, but will give credit to myself for what I felt was kind of incredible execution. The team of data scientists, of course, loved it. I thought I was basically done. But let's remember, your audience is key. So the next day, I received the following feedback from several non-analytics folks. It honestly was a little bit like reading my own bad reviews, but it's important not to look at constructive feedback that way. These colleagues were speaking up about some major issues with this graph. You might agree. I agree now. They were right.

I see feedback that's qualified with things like, I'm not a numbers person. A lot. People sometimes blame themselves for not knowing how to interpret a data viz, but it's actually really our job to make sure that the viz works for them and works for our audience. So this crimes and decarceration report was not just for data scientists. It was for a far more general audience, and this chart was clearly not working.

So sometimes this means going back to the drawing board, doing a little bit more data exploration, always with the janitor and ggplot packages for me, and trying out sort of new ways of visualizing your main points while thinking about your audience. These charts were hashtag innoventive, but they simply weren't working. Some of the most valuable feedback conversations I had were not over Slack, so I have no hard evidence, but in the times that you're asking for feedback from someone, it's really important to ask the why behind something isn't working for them. Sometimes the why might be sort of surprising, or maybe it'll make you actually realize that what you need to do is add better annotating, rather than just scrapping the whole thing and starting over.

And in the spirit of reiterating that this process was one messy adventure, before showing you the final visualizations we ended up with, at the end of the day, I really feel like this quote just kind of drives it all home.

So these were the final graphics that we published on our site. Both were created in R and touched up in Illustrator. Both, I feel, do a good job of expressing our three main points, and are not just interpretable for quote numbers people. I know these are probably a little bit hard to see, so I'll share a link to my report in my presentation, as well as where you can find the R code for the analysis in Viz.

And one last wrench I'll throw at you, because it was thrown at me, was that it turns out police data is often dynamic, meaning that sometimes crimes that occurred actually maybe a month ago could have been reported today. I cannot stress enough how much of a lifesaver reusable code in R was throughout this process, given the amount of extremely last minute data dumps and sort of changing out CSV files here and there. We want it to be precise, and it's thanks to R that remaking these graphs time and time again was such a small deal.

So hopefully through this process, I've been able to convince you that data Viz is and should often be non-linear. Mentors in my life have taught me that being good at making effective visualizations is not about getting from point A to point B the quickest. It's actually about carefully considering each of these points that makes you an effective communicator to your audience.

being good at making effective visualizations is not about getting from point A to point B the quickest. It's actually about carefully considering each of these points that makes you an effective communicator to your audience.

Lastly, feel free to reach out to me if you have any questions or just want to nerd out about data Viz. And thank you.