Larry Fenn | Journalism with RStudio, R, and the tidyverse

Transcript#

This transcript was generated automatically and may contain errors.

And now we can begin, so please join me in welcoming Larry Fenn.

Good afternoon, everyone. So I'll get right to it. I'll just say a little brief bit about myself. I'm going to talk about journalism at the Associated Press. We use R, at least on the data team, pretty much to do every part of our workflow. We really couldn't do it without R and RStudio and the libraries that exist for R.

Brief high-level summary, the AP is a not-for-profit cooperative news agency. I'm a data journalist on the data team. And so typical AP coverage includes but is not limited to breaking news. So that is, like, non-negotiable. That's part of our blood. That's kind of, you know, when photographers run out to grab a camera, we, as nerdy as it sounds, go looking for data, go looking for the story in the data. You know, it's a train crash, we look up level-crossing data, stuff like that.

We also cover investigative reporting. So that's sort of your long-form deep dives into complex or large data sets. Something else about the scope of what we do, it's pretty much every level you can imagine geographically, local, municipal issues, all the way up to international. Another big thing, and I think this is in common with a lot of journalism news agencies, not just the AP, but there's collaboration both inside with domain experts who are not necessarily so tech-savvy or data-savvy, but also collaboration outside. People who, again, with a wide variety of backgrounds and levels of expertise, and it's very important to try to synthesize these viewpoints together.

And finally, what is it that we produce? At the AP, we produce pretty much everything you can imagine from data, text copy, graphics, broadcast graphics, social, interactive experiences, like the fancy ones you see on the web. We try to do that, too.

stories have to be comprehensible to everyone, and so that kind of ties behind our back the degree to which we can really use advanced statistics and mathematical knowledge.

All right, next question. How do you get involved in doing data analysis for journalism? So, I think one on-ramp that has been kind of cool is there's a lot of state-level, in particular, open data communities. These are people who get together and maybe it's something as simple as election results or maybe it's something more complicated like lead testing results. If you look on GitHub, there are a surprising amount of sort of state open data communities that are all about, you know, in their state, doing data projects off of that. And I think if you, you know, if someone wanted to pursue an interesting story, you know, that might be the easiest way to at least get started. I don't know how you could make money with it, but you could certainly start investigating things at the state level by reaching out to people, folks who, you know, work in open data. Stuff like K-12 education, public health, that stuff is all, like, those data sets are all out there waiting for people to find stories in them.

How can you convince your management of allocating time and resources to developing work like this in R that may be costly up front? That's a tricky one, yeah. We, at least in the history of when I started, it oftentimes was a matter of building good partnerships within the company. So you're not going to be able to get this project off the ground, but at least from the journalism perspective, you know, if there was already a strong investigative project that you could identify had a strong data angle, that is sort of the on-ramp there, because then you can tangibly show, like, how much better the story is with the data angle. You know, your mileage may vary in different organizations.

Final couple of questions, I think they're from the same person, so I'm just going to ask them together. If other media rely on Associated Press for breaking stories, any future for them also drawing on AP's code, example, for quick visualization to be rebranded? Also just curious, are you paging through your presentation notes in a reporter's notebook? Yeah, I do have a reporter's notebook. Absolutely. You got me dead to rights.

For the code, yeah, we wanted to expand that as well. So we are already doing data distributions, and we are partnered with this organization, Data.World, and they have a sort of web-based SQL browser, and we give them SQL code out of the gate that they can use. But we definitely, you know, we believe strongly that, like, you know, reproducibility and, you know, distributing the code and making it out there, at least for members, hopefully eventually also for the public as well, so they can just, you know, grab the public data, grab the code we used, see the same exact things we saw in the story. That's definitely a future goal for us. Right now, yeah, unfortunately it's sort of limited to whoever is interested. We can't, like, kick down people's doors and force them to use R.

A few more coming in. Can you explain a little more about how you do data validation? So oftentimes, so what we're hoping to do with the automated one is to catch really simple things, like oftentimes, for example, if you're working with someone, you're requesting data from them, and they don't have the best data skills, they maybe did something in Excel, and it truncated the data file at 1.2 million rows or whatever, the Excel problem, right, picking up that that happened. That's something that we are hoping we can sort of build into some scripting solution. More generally, though, how do we know that the data is right? That is a long and tricky question, and that is where we intersect with traditional journalism because oftentimes, like, I'll look at K-12 data, and you'll just see something. You'll do the usual data analysis and see the outlier and be like, that's an extreme outlier. What's the next step? You pick up the phone, and you call the superintendent and be like, well, according to the survey data, you put this number in, and then usually the superintendent will be like, oh, that's not right. Or, you know, it's a soft process, unfortunately. I can't give you a cookbook for it, but you just have to be constantly critical. And for a lot of the stories where it's really meaningful and worth it, we will just, you know, ultimately make sure that, you know, let's say we're reporting on shootings. We'll go over every, you know, one of the 1,500 shootings that are in our story to make sure we got it right. You know, follow up with the police department or something. It takes a lot of time, and, you know, but that's, it's worth it.

Do you have any sort of peer review for your data analyses? So within the organization, we definitely make sure, you know, we have these principles that I think we've helpfully stolen from other communities of, you know, second set of eyes, making sure that nothing hits the wire without, you know, multiple people's input on it. More broadly, though, when it comes to stories, you know, we don't go as far as handing the story pre-written to the people that it involves. But many times, you know, for example, for me, you know, if I'm doing something about the Bureau of Labor Statistics, I'll make sure to check in with the BLS economist who prepared that data set and ask him, is this a valid use of your data? I'm trying to write this story. This is exactly how I've set it up. I'm looking at these numbers and this, you know, I'm trying to put it into this analysis, and just asking the economist that prepared that data, you know, up front, is this a valid use of the data? And if he says, you know, no, you've got it all wrong, it's seasonally adjusted, you fool, then, you know, yeah.

What is your favorite data-driven story that you've worked on with the AP and why? Okay, there was one that was really hard but really rewarding. We talked about, it was about K-12 education demographics. It involved, you know, this big, big question, which is charter schools and, you know, segregation by race and ethnicity. And while, you know, obviously controversial topic, one thing that I really wanted to make sure we could get across, one thing that I needed to make sure was in that story and in that data analysis, was to break the traditional approach of saying white, non-white. Because I thought that that was, yeah, it makes the analysis easier because you've reduced it to a binary sort of consideration. But that's absolutely not, in my opinion, the right way to look at it. So it took a lot of time, but coming up with sort of new metrics that captured the multidimensional nature of race and ethnicity, making sure we're using metrics that, you know, like, you know, we kind of picked and stole the best ones from other fields of research. But like, you know, how do people study housing segregation? How do people study, you know, sort of distributions of noise and information theory? What is the right way to represent similarity in a way that's deeper than just saying white versus non-white? Because so much of this country is not just a question of white versus non-white.

So it took a lot of time, but coming up with sort of new metrics that captured the multidimensional nature of race and ethnicity, making sure we're using metrics that, you know, like, you know, we kind of picked and stole the best ones from other fields of research.