Resources

Larry Fenn | Journalism with RStudio, R, and the tidyverse | RStudio (2020)

The Associated Press data team primarily uses R and the Tidyverse as the main tool for doing data processing and analysis. In this talk, some of the technology behind the published stories will be showcased: - Using dbplyr to work off a hosted database containing 380 million opioid records to identify "pill mills". - Using open-sourced AP style templates for R Markdown and ggplot to quickly produce graphics and reports off breaking news. - Using R Markdown and htmlwidgets to give reporters and editors interactive reports to identify reporting leads

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

And now we can begin, so please join me in welcoming Larry Fenn.

Good afternoon, everyone. So I'll get right to it. I'll just say a little brief bit about myself. I'm going to talk about journalism at the Associated Press. We use R, at least on the data team, pretty much to do every part of our workflow. We really couldn't do it without R and RStudio and the libraries that exist for R.

Brief high-level summary, the AP is a not-for-profit cooperative news agency. I'm a data journalist on the data team. And so typical AP coverage includes but is not limited to breaking news. So that is, like, non-negotiable. That's part of our blood. That's kind of, you know, when photographers run out to grab a camera, we, as nerdy as it sounds, go looking for data, go looking for the story in the data. You know, it's a train crash, we look up level-crossing data, stuff like that.

We also cover investigative reporting. So that's sort of your long-form deep dives into complex or large data sets. Something else about the scope of what we do, it's pretty much every level you can imagine geographically, local, municipal issues, all the way up to international. Another big thing, and I think this is in common with a lot of journalism news agencies, not just the AP, but there's collaboration both inside with domain experts who are not necessarily so tech-savvy or data-savvy, but also collaboration outside. People who, again, with a wide variety of backgrounds and levels of expertise, and it's very important to try to synthesize these viewpoints together.

And finally, what is it that we produce? At the AP, we produce pretty much everything you can imagine from data, text copy, graphics, broadcast graphics, social, interactive experiences, like the fancy ones you see on the web. We try to do that, too.

Loading data

So that's just a high-level overview of what it is that we do. Here's sort of what I'll just try to cover. I'm leaving a lot of stuff out because there is a whole lot out of scope on how journalism works. But when it comes to data analysis, let's start. So first we'll talk about how we load in the data. So there's a couple of cool, really remarkable tools that are makes available to us. First is dbplyr.

I don't know if people are familiar with this one. This is a really great, really great library. So in particular, when you have data sets that are extremely large. So this was the Arcos pharmaceutical database. I believe it's some hundreds of millions of rows. It's something that you couldn't, even with a laptop like this one, you really couldn't work with it feasibly. So we have it hosted, and then you can see here there's a sort of connection to it. And just to, this is not necessarily like a tutorial session. But the cool thing about it is once you run this code, you get this connection. And the way the connection behaves with dbplyr is it behaves basically like a data frame. But you'll see here it says database. And so the idea is you're still able to use all those verbs you're used to from the tidyverse. But you're hitting some SQL database somewhere in the world. Really useful, incredibly useful for us.

Another one is Google Forms and Google Sheets. So this is often what we need to do when we collect data for a story because the data doesn't exist. You know, it's kind of awe-inspiring that we don't have to pay for Mechanical Turk. We can just sit down and distribute work between ourselves and sort of digest it all. And so the workflow there, the way it looks, I'm just going to show it off a bit. We were doing this story about NFL injuries. And, you know, it's out there. That data's out there. But we wanted to make sure we were counting exactly what it is that we wanted to count and drawing the lines where we wanted to draw. So we came up with our own little survey that we could then hand out to reporters and then we could all put it together like a sort of human-powered MapReduce. You turn that crank and you get this great data sheet at the end. And then you plug into it using Google Sheets 4.

So that looks like this. And if you've used Google Sheets 4, it's pretty straightforward. There's an OAuth step, but you just do read sheet and then you get the data right away, ready to go. So that's another thing we do.

Last is HTTR for APIs. So that's both including ones that exist in the world, like the BLS, the Labor Statistics API, the Census API. But also we have homebrewed, again, databases and, you know, data admin tools. One that I'll call out here is we were sort of looking at mass killings and we have this whole system that sort of we rolled our own and use HTTR to get the authentication token and to get the data files or to get the relevant data off of the mass killings database that we rolled. The use case here, of course, is, you know, we often collaborate with other organizations. And so building, you know, it seems like overkill, but building a standalone data admin tool actually does save a lot of headaches with collaborating with other large organizations.

Template graphics

The next thing is template graphics. This is something that we are doing and we want to do more of. And, you know, it's an endless struggle to make better and better. But so the convenient thing for us is there's an existing newsroom style guide, right? We're not the first people to do graphs, right? But that means it's really easy for us, for me, you know, for us to kind of develop using ggplot a theme, a theme package. We just follow very prescriptively what the style guide is. And so this is actually available on the Associated Press GitHub if you want to kind of ape the graphical style. GitHub.com slash Associated Press slash AP theme. You may not have this proprietary font. My apologies. You'll have to substitute for some other font.

But the cool thing is this also facilitates conversation between us and the other sort of graphics-able people in the newsroom because, you know, we can say, all right, does this work? You know, is this the style of graphic we want to make? How does this comport with our style guidelines? And that conversation with ggplot, that prototyping was so quick that, you know, working it out. And once we have that theme in hand, you can just, you know, come up with graphics basically as often as you have the idea for them. The other thing is we had to make sure it fit into the existing publishing workflows. So we have a lot of artists. They use Illustrator. They use vector graphics. And so it was really handy in the R universe that you can just export these graphics like vector graphics. Perfect as far as plugging into how things already work.

Reporting findings with R Markdown

So next we'll talk about reporting the findings. So R Markdown has come up already in some of these case study presentations. And, you know, it's pretty much the same story here. People are in a hurry. No matter who we're trying to tell, you know, internally, externally, no one has time to read stuff. So R Markdown is a crucial tool to help condense what would otherwise be this sort of long, detailed, rambling notebook into a very condensed, concise thing. The other advantage, we used to roll our own things before R Markdown and R Notebooks came around. We'd make these web pages. We'd make these graphics. It was a nightmare trying to make it work for people who were trying to browse these things on their, you know, at home on an iPad or something. Terrible. But with R Markdown, you know, it's just a set of really handy existing sort of templates and tools that we can plug into.

So let me just show off kind of the one, just an example one. So the NFL injuries ended up being, you know, turned into something simple like this so that we can hand to whoever is new joining the project, like, hey, we're going to catch you up. Like, this is what we've already got. This is what we're studying, you know, and this is where we're going to go with the story. So really handy. DT, that data table that you see at the top here, great. We love it. There's the filters. You can also style it. We don't really do it that often, but you can sort of selectively highlight, like, oh, make all the ones that are above a certain size bold. It's really quite handy.

And then Leaflet for maps, you know, oftentimes I think a map is not really the right tool for the job, but no doubt it can be a very useful tool in data exploration when you first get a data set to at least see what is going on inside your data set. So that's where we're using R Markdown pretty heavily, and it's, again, we've made our own R Markdown template, and it's also on the Associated Press GitHub if you want to just steal it.

Future directions

So I'll just sort of wrap up. I think there's future directions. So things are not perfect. We're still working on it. One big one you might not have noticed here is interactive. So static graphics seems to be in hand. Interactive graphics, not so much. We've explored GGI raft, plotly, you know, bokeh, some sort of ggplot to D3 conversion. There's kind of still, you know, we're missing that kind of key piece there that kind of fits. Part of this is maybe what we need out of it, not necessarily like I'm not claiming that there's no good R interactive web graphic workflow. That's not true. But for our particular case, it's been sort of still a question of finding the right tool for the job there.

Automation, got to have that buzzword, but the two big ones we're talking about, one is recurring data. So some stories, they keep happening, right? Spring housing, labor statistics, that stuff is recurring. It's from an API. We can get better and better at closing that loop and making it so that, you know, when the data comes in, there's a minor, there's a human involved somewhere to make sure that the data is, you know, correct. But then other than that, all the other work has already been done and we package it up.

The other big one, yeah, this touches on it, is we want to make sort of a data validation thing. Again, not to remove the human element from the whole story because it's impossible to come up with some sort of magic script that will catch all the errors in your data, but just simple stuff like how many different types of white space characters are going on in this data file. Does this data file do day, month, year, or month, year, day, or month, day, year, you know, catching those kinds of problems before they become a real issue, sort of. I know that there are certain, like, people have made some efforts in this space and we're still working on trying to either contribute or latch on to an existing solution there.

Working with Python, I know, boo, but there are other people in the organization who work with Python. They make graphics, they do all these other reports in them, and thanks to packages like Reticulate, you know, we haven't done it yet, but hopefully it's not far off that we can bridge that gap within our organization as well, and that'll be a pretty outstanding moment when we get there.

And then lastly, the AP does this very particular thing where we hand out data to member papers and news orgs to then localize stories for themselves, and we believe that R could be a really... We could automate a lot more of that also using R to script and to control it. We don't think we need anything more than R to achieve that goal. With that, that's just the... that's it. Yeah, that's the summary. Any questions?

Q&A

Yeah, a few questions to come through. So the first one is, how do you export plots as vector graphics? If you have SVG Lite, and you do need to get installed with that, so that's like a system dependency, but when you're doing, like, a ggplot... So, yeah, here, I have this all in this AP theme script. You see here when you call the ggsave function, if you just make the file name end with SVG, it does it. Now, that is to say, though, there is one caveat. The graphic artists don't really like the SVGs that are, really, SVG Lite spits out, and so that's one thing we are also looking into trying to improve. Basically, there's... it's subpar. It doesn't quite line up with the way they like their project files to look. But just make the file name end with SVG and follow the error messages if it doesn't work.

All right, next question. When your team hires, are they looking for journalism experience? How important are writing skills versus data analysis skills? That's tricky because I was hired because of my background in this, but there are definitely... the team's core is about writing more than anything else. I think if we were looking... you know, if we were thinking about, you know, what is more in demand, there's still this question of news judgment because, to pull back the veil, we don't do a lot of Bayesian analysis or neural network modeling. We really don't. We would love to, but the reality is, stories have to be comprehensible to everyone, and so that kind of ties behind our back the degree to which we can really use advanced statistics and mathematical knowledge.

stories have to be comprehensible to everyone, and so that kind of ties behind our back the degree to which we can really use advanced statistics and mathematical knowledge.

All right, next question. How do you get involved in doing data analysis for journalism? So, I think one on-ramp that has been kind of cool is there's a lot of state-level, in particular, open data communities. These are people who get together and maybe it's something as simple as election results or maybe it's something more complicated like lead testing results. If you look on GitHub, there are a surprising amount of sort of state open data communities that are all about, you know, in their state, doing data projects off of that. And I think if you, you know, if someone wanted to pursue an interesting story, you know, that might be the easiest way to at least get started. I don't know how you could make money with it, but you could certainly start investigating things at the state level by reaching out to people, folks who, you know, work in open data. Stuff like K-12 education, public health, that stuff is all, like, those data sets are all out there waiting for people to find stories in them.

How can you convince your management of allocating time and resources to developing work like this in R that may be costly up front? That's a tricky one, yeah. We, at least in the history of when I started, it oftentimes was a matter of building good partnerships within the company. So you're not going to be able to get this project off the ground, but at least from the journalism perspective, you know, if there was already a strong investigative project that you could identify had a strong data angle, that is sort of the on-ramp there, because then you can tangibly show, like, how much better the story is with the data angle. You know, your mileage may vary in different organizations.

Final couple of questions, I think they're from the same person, so I'm just going to ask them together. If other media rely on Associated Press for breaking stories, any future for them also drawing on AP's code, example, for quick visualization to be rebranded? Also just curious, are you paging through your presentation notes in a reporter's notebook? Yeah, I do have a reporter's notebook. Absolutely. You got me dead to rights.

For the code, yeah, we wanted to expand that as well. So we are already doing data distributions, and we are partnered with this organization, Data.World, and they have a sort of web-based SQL browser, and we give them SQL code out of the gate that they can use. But we definitely, you know, we believe strongly that, like, you know, reproducibility and, you know, distributing the code and making it out there, at least for members, hopefully eventually also for the public as well, so they can just, you know, grab the public data, grab the code we used, see the same exact things we saw in the story. That's definitely a future goal for us. Right now, yeah, unfortunately it's sort of limited to whoever is interested. We can't, like, kick down people's doors and force them to use R.

A few more coming in. Can you explain a little more about how you do data validation? So oftentimes, so what we're hoping to do with the automated one is to catch really simple things, like oftentimes, for example, if you're working with someone, you're requesting data from them, and they don't have the best data skills, they maybe did something in Excel, and it truncated the data file at 1.2 million rows or whatever, the Excel problem, right, picking up that that happened. That's something that we are hoping we can sort of build into some scripting solution. More generally, though, how do we know that the data is right? That is a long and tricky question, and that is where we intersect with traditional journalism because oftentimes, like, I'll look at K-12 data, and you'll just see something. You'll do the usual data analysis and see the outlier and be like, that's an extreme outlier. What's the next step? You pick up the phone, and you call the superintendent and be like, well, according to the survey data, you put this number in, and then usually the superintendent will be like, oh, that's not right. Or, you know, it's a soft process, unfortunately. I can't give you a cookbook for it, but you just have to be constantly critical. And for a lot of the stories where it's really meaningful and worth it, we will just, you know, ultimately make sure that, you know, let's say we're reporting on shootings. We'll go over every, you know, one of the 1,500 shootings that are in our story to make sure we got it right. You know, follow up with the police department or something. It takes a lot of time, and, you know, but that's, it's worth it.

Do you have any sort of peer review for your data analyses? So within the organization, we definitely make sure, you know, we have these principles that I think we've helpfully stolen from other communities of, you know, second set of eyes, making sure that nothing hits the wire without, you know, multiple people's input on it. More broadly, though, when it comes to stories, you know, we don't go as far as handing the story pre-written to the people that it involves. But many times, you know, for example, for me, you know, if I'm doing something about the Bureau of Labor Statistics, I'll make sure to check in with the BLS economist who prepared that data set and ask him, is this a valid use of your data? I'm trying to write this story. This is exactly how I've set it up. I'm looking at these numbers and this, you know, I'm trying to put it into this analysis, and just asking the economist that prepared that data, you know, up front, is this a valid use of the data? And if he says, you know, no, you've got it all wrong, it's seasonally adjusted, you fool, then, you know, yeah.

What is your favorite data-driven story that you've worked on with the AP and why? Okay, there was one that was really hard but really rewarding. We talked about, it was about K-12 education demographics. It involved, you know, this big, big question, which is charter schools and, you know, segregation by race and ethnicity. And while, you know, obviously controversial topic, one thing that I really wanted to make sure we could get across, one thing that I needed to make sure was in that story and in that data analysis, was to break the traditional approach of saying white, non-white. Because I thought that that was, yeah, it makes the analysis easier because you've reduced it to a binary sort of consideration. But that's absolutely not, in my opinion, the right way to look at it. So it took a lot of time, but coming up with sort of new metrics that captured the multidimensional nature of race and ethnicity, making sure we're using metrics that, you know, like, you know, we kind of picked and stole the best ones from other fields of research. But like, you know, how do people study housing segregation? How do people study, you know, sort of distributions of noise and information theory? What is the right way to represent similarity in a way that's deeper than just saying white versus non-white? Because so much of this country is not just a question of white versus non-white.

So it took a lot of time, but coming up with sort of new metrics that captured the multidimensional nature of race and ethnicity, making sure we're using metrics that, you know, like, you know, we kind of picked and stole the best ones from other fields of research.