Resources

Andrew Tran | The Opioid Files: Turning big pharmacy data over to the public | RStudio (2021)

Through a community survey conducted over the summer, the RStudio tidymodels team learned that users felt the #1 priority for future development in the tidymodels package ecosystem should be ensembling, a statistical modeling technique involving the synthesis of multiple learning algorithms to improve predictive performance. This December, we were delighted to announce the initial release of stacks, a package for tidymodels-aligned ensembling. A particularly statistically-involved pesto recipe will help us get a sense for how the package works and how it advances the tidymodels package ecosystem as a whole. About Andrew: Andrew is a data reporter on the rapid-response investigative team at The Washington Post who has analyzed how covid-19 has disproportionately impacted certain communities, the spread of opioids across the country, and the rise of right-wing violence. He shared in winning the Pulitzer Prize for Investigative Reporting in 2018. He's an advocate for open data and reproducibility in journalism

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Andrew Batran, and I'm a data reporter for the Washington Post's investigative team. And I'm here to talk about The Opioid Files and how we investigated pharmaceutical companies and the government and made data accessible to all.

This is a two-week-year-old boy in an isolation ward in West Virginia. Any sound seriously disturbs him and he won't stop crying, and that's because he was born addicted to opioids. And he is one among thousands of children separated from their parents and families and forced into state care by an opioid epidemic that has fractured families in every corner of the country.

In West Virginia alone, more than 5,200 lives over the past few decades have been lost. And we've seen the effects of this. We know that the number of children in foster care in West Virginia has doubled ever since the epidemic started.

And other effects that we've been able to see are the death rates in communities across the country. In West Virginia, we can see the effects of the opioid death rates, and we can also see that there were also higher rates of child abuse and neglect claims in those same areas. State officials said that more than 80% of children in foster care have been affected by the drug epidemic.

The ARCOS data set

So we've seen the effects, but we've never really been able to see the causes before, not with a lot of specificity. But the thing is, pharmaceutical companies and the Drug Enforcement Administration have been able to see this for the longest time, because every single pill from manufacturer to distributor to pharmacy is tracked in a data set, and is called the Automated Reports and Consolidated Ords System.

And the Drug Enforcement Administration is also under the Department of Justice. And it's tracked because prescription pain pills are scheduled to drug, which means it has a high potential for abuse, with use potentially leading to severe psychological and physical dependence. And we call it ARCOS for short.

And so pharmaceutical companies and the government have known the scope of this data for the longest time, the reach of the spread of these pills. And journalists and researchers have tried to get this data set for the longest time as well. But we've only been able to get it at the aggregate level, maybe by state, maybe by county, maybe by zip code, maybe by overall drug. We've never been able to get it at the raw data set level, where we could really dig into the nuances of the data. Every time we've made a request for it, it's been denied.

That is until recently.

So for context, the most powerful people in the pharmaceutical America didn't want it released. And a lot of the most powerful people in the Department of Justice didn't want this data released. However, when 2,000 communities across the US sued dozens of big drug companies over the destruction that painkillers were causing across the country, a federal judge ordered that the ARCOS data set be made available to those plaintiffs for examination. However, they sealed it away from the public and from the media.

And so the Washington Post, as well as the Charleston Gazette-Mail sued for access to that data. And so we appealed the decision. And a group of appellate judges ruled that the protective order on the ARCOS data should be amended. In effect, they released data from 2006 to 2012, and eventually 2014. And so we got our hands on the data the day it was released.

Analyzing the data

And what we found out was that the data set was huge. It had about 42 columns, which isn't too bad, but it had literally hundreds of millions of rows of data for every pain pill order sold between 2006 and eventually 2014. And so the team of us, there were about three data reporters, including myself. We used a mix of different statistical analyzing languages, like Python and R. And so in order to maintain great communication between the two, we used data structures that could operate between the two languages, such as Parquet and Feather. And because the data sets were so huge, we used packages like doParallel in order to handle all that.

And in order to exchange our analysis methodologies and reports, we wrote up everything in Jupyter Notebooks or R Markdown. I also created a Shiny app so that we could investigate and dig into pharmacies that looked suspicious, because any way you might have sliced it would raise a different set of pharmacies that might have set off some suspicious orders.

What the data revealed

And so this data set was revelatory because it showed where each pill moved from manufacturer to distributor to pharmacy for the first time. And so we could see everything both in aggregate and fine detail. And this is what we found out.

We could see that 100 billion pills saturated the country between 2006 and 2014. And pharmaceutical companies from manufacturers to distributors to pharmacies made billions of dollars a year. We could finally see the connection between the deaths from opioids and how they soared in communities that were flooded with pain pills. And we saw that the highest per capita death rates nationwide were in rural West Virginia, Kentucky, and Virginia.

And we could see that just six companies distributed 76% of the pills during this period. And we could drill down into individual drug companies and pharmacy companies like Walgreens, where we noticed that at the height of the crisis, Walgreens sold about nearly one out of five of the most addictive opioids, and that they had a really, really interesting company structure. They not only had a franchise of pharmacies out there, they also owned a distribution company as well. So it was very much vertical integration all the way.

And we could see that just six companies distributed 76% of the pills during this period.

So while a lot of mom and pop pharmacies could argue that they didn't really know exactly the impact of all the surge in orders for opioids in their pharmacy, Walgreens knew exactly how much and where these pills were going.

And so we went through the effort of geocoding 200,000 or so pharmacies so that we could put it on a map and then really analyze geographic patterns that way. And we made an interactive that allowed people to search their local pharmacy and see the scale of sales they did compared to their neighbors.

A case study: Walgreens in Florida

And so we could really drill down in individual pharmacies like this one in Florida. This is a Walgreens in Florida. And what's important to really note is that it was the law for these companies to implement in an effective system for monitoring their opioid sales for suspicious orders. And so we found so many pharmacies that violated those suspicious orders but kept selling anyway.

This is an example of a store in Oviedo, Florida, where starting in 2011, sales started spiking. You can see it's declined a little bit after legislators passed a bill to limit the sales for a specific type of oxycodone, which slowed it for a month until customers realized they could order a different type of milligram strength and get that much easier. And so that started spiking even more exponentially.

It got so bad that a police chief in that town made so many arrests that he wrote a message to Walgreens executives begging them to change their policies and to restrict it more and to not sell to people who had been previously arrested for selling opioids on the street and to stop selling to people who had been rejected from other pharmacies with the same prescription.

But you can see it stopped for maybe a month or slowed down for about a month until it started growing in sales once again. And it wasn't until around 2012 that sales finally stopped slowing, started slowing down. But that was when they started realizing that the DEA was investigating several stores as well as distribution center until eventually that store was suspended. And then the distribution center itself in Florida that also got suspended.

You can see here that it sent billions of pills a month to Walgreens stores across the Southeast and Florida. And you can see here, thanks to the data, that even though they had been halted starting in 2012, that another distribution center stepped up to fill those orders that this particular distribution center stopped handling. So Walgreens barely slowed down their sales.

Making the data public

What was interesting about this data set and the story that we did was that we made the data widely available to the public. And we put out explicit instructions on how to use the data as well. Because the thing is, the data set was so huge that there were more stories that we could possibly tell by ourselves. And the data could be applied to specific communities to see what they thought was more interesting and applied to them.

So in the first main page where you could see the results of the opioid sales across the country, there was a section where folks could navigate and find their state and their county and see exactly how many pills were sold there, which were the top distributors, which were the top manufacturers, which were the top pharmacies, and get the numbers for them. And not only that, get generated viz specific to their county, and also the summarized tables, as well as the raw data itself, if they wanted to do more deeper analysis.

On top of that, we took that data and we used Plummer to create an API out of it. And what was great about Plummer is that with very little coding, it will self-generate some great documentation behind this data portal. We also created an R wrapper around it for more ease for the special R users. And then a researcher out of Maryland saw this and reached out and went and made a Python package around this as well.

And on top of that, we created a bunch of vignettes and walkthroughs so that folks could see exactly how we did our work, but also figure out how to tell stories relevant to their communities as well. And we included functionality in the package to slice out the data beyond what we had done in our stories ourselves, such as being able to look at individual drugs in the data set. For our stories, we focused on oxycodone and hydrocodone, but there are 12 other drugs in this data set from Arcos that we didn't even look at.

The map on the right visualizes all the methadone sales in Boston. It's this area called Methadone Mile. And we made sure a bunch of functions, include a bunch of functions that we anticipated people would find useful, such as bringing in the lab to log it in for all the pharmacies that we looked up. We brought in population tables for state, county, and all the years the data was available so that folks could normalize it well. And also all the summarized tables, because it was important to make data accessible.

It wasn't just enough to put a data set on GitHub or put a link out there and then let people go through it themselves and figure out how to do it, deal with it, which researchers are great at doing. They can create so many great research projects around it, such as the effects of the opioid epidemic on the housing market, for example. But we went through the effort of allowing many doorways to this data, through the API, through the browser itself on that main page, so that anyone with any level of, with different levels of expertise could access this data. And sometimes it can convert others to R, which I find is a success.

Stories told by local journalists

So after we put all that data out there, we found out that within a few days, hundreds of stories were written by local journalists, and it was amazing. For example, the Cleveland Plains dealer looked at the data and found that there was one particular small-town doctor that was responsible for selling 1.6 million pills, which helped fuel the surge there in Ohio. A radio station, WOSU, looked at the data and decided to focus on the victims and telling their stories as well as the stories of those who were left behind.

The Daily Union in Wisconsin looked at the data and tracked down the businesses who were selling the most pills, and then went to see how much they had donated to politicians and to whom. The Philadelphia Inquirer made some great interactive visualizations that showed how after OxyContin was reformulated so that it was not as crushable and snortable, that the demand for a different type of Oxycodone surged instead.

And through the data, we were able to look at manufacturers in a new way. So while Purdue got all the attention based on the data, we could see that they only controlled about 3% of the opioid market, and that three lesser-known generic drug manufacturers actually manufactured most of the addictive pills. However, since they weren't well-known, they avoided DEA scrutiny for many years.

Journalists also saw through the data that lawmakers were involved in this. One lawmaker in Louisiana, a House member, owned two pharmacies that were 12.5 miles apart from each other in a town that served 6,000 people, but sold 1.5 million pills in that time period. Another legislator that journalists found in the data set was a state representative in Tennessee who was president and CEO of the largest orderers of opioids in the state.

And what journalists connected with that was that earlier in the year, this particular state lawmaker, Shane Reeves, actually introduced and helped pass a law that made it easier to access opioid prescriptions.

Accountability and aftermath

And so as a result of all this, four companies that made or distributed prescription opioids and played roles in the catastrophic opioid crisis have reached a tentative $26 billion settlement with counties and cities that sued them for damages in the largest federal court case in American history. And internal documents out of all this showed that the pressure within the drug companies to sell opioids were in conflict with numerous red flags from the DEA and even internal workers. But the profits are so huge that sales didn't slow down.

But the profits are so huge that sales didn't slow down.

And we're still feeling the effects to this day in a different way. Prescription pain pills were the main cause of overdose deaths through 2011. But then after laws were passed, after pharmacists caught on, after doctors caught on and started restricting access to prescription pain pills, drug users then turned to heroin, which set the stage for fentanyl. And together, oxycodone, hydrocodone, heroin, and fentanyl have killed more than 400,000 Americans since the turn of the century, a quarter of whom have died from fentanyl in just the last six years.

So I'm not exactly sure how to wrap up this talk, but I want to point out that I hope the data that should have been kept an eye on by pharmaceutical companies, by the government, and has finally made it into the hands of the public, to the researchers and journalists out there, I hope we're able to hold everyone more accountable so that we can prevent more foster care children. Anyway, that's my talk. Thanks so much for listening.