Building scalable data pipelines through R and global health information systems' API

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thank you so much for coming. My name is Karishma, and today I'm going to be talking to you about simple machines, how to improve your workflows with APIs.

Okay, so show of hands, how many of you have ever heard or seen or built a Rube Goldberg machine before? Looks like most people in the audience. For those of you who have not, a Rube Goldberg machine is a chain reaction type of machine or contraption that's intentionally designed to perform a simple task in an indirect or overly complicated way. When done successfully, it is a clean process with very little human intervention. Once it gets started, it's efficient, it gives you a great output, and honestly, it's just super fun and satisfying to watch.

These simple machines are also sort of similar to a data access pipeline, from the point at which you access your data from the source and all of the intermediary steps that you take to get to your final output. So show of hands again, in your organizations, how many of you have data access pipelines that look as seamless as a Rube Goldberg machine? I see like two hands. Okay, cool. We were in the exact same boat. How many of you, one click, okay, instead have pipelines that look similar to this?

I'm not seeing as many hands as I would expect. Okay, there we go, that looks more like it. Yeah, lots of manual intervention, lots of hiccups, you get the picture, and we found ourselves here as well.

Background and key roadblocks

Okay, so today I wanted to share some lessons learned from our work at USAID on how we set up more efficient and automated systems to pull data through APIs to ultimately move ourselves along this spectrum from manual workflows to a more automated system.

As a caveat, we are not data engineers, we are a team of data scientists and public health professionals who found ourselves stuck in a series of analytic loops of inefficient data practices, and wanted to share with you today some of the small ways and tools that we've built to overcome these problems in our work.

So my name is Karishma Srikanth, I'm a data analysis advisor at the U.S. Agency for International Development, working in the Office of HIV-AIDS under the President's Emergency Plan for AIDS Relief, also known as PEPFAR. It's a mouthful. This presentation was developed in collaboration with my colleague, Aaron Chaffetz, who was unable to travel for the conference this time, but huge thanks to him for all of his work to get this presentation ready.

So to give you some background on where we work, our work supports PEPFAR, which is a six billion dollar HIV-AIDS program. Our team at USAID analyzes, synthesizes, and visualizes large amounts of monitoring and evaluation health data from over 55 countries worldwide. So that's a lot of data and a lot of context to ultimately inform HIV programming and move the needle to ending HIV as a public health threat.

When we were standing up our analytic infrastructure, we faced four key roadblocks. Firstly, as you saw, we have a really, really large scope. We have data coming in from various different external systems, often in non-standard formats, from static CSVs to messy Excel sheets to the occasional pull from a database. On top of that, we found ourselves in a series of analytic loops, frequently receiving similar analytic requests each quarter, and we often found ourselves addressing those requests in a really manual, repetitive, and redundant way.

Finally, in a government environment, we lacked a centralized data lake to manage these data processes. So we found ourselves doing these manual processes each time without a standardized infrastructure to make it a little bit easier for ourselves.

So when we were beginning to think about ways that we wanted to make our workflows more efficient, we knew we wanted to move across the spectrum from manual to automated. But importantly, we also needed to consider the axis of accessibility, ensuring that all the processes and tools that we built were meeting the needs of all of the users in our organization. For instance, a highly automated but complex infrastructure, while powerful, would be incredibly costly and time-consuming and potentially prohibitive for new users to actively engage with. Conversely, a highly complex manual system is sort of the worst of both worlds, requiring a lot of specialized expertise and knowledge that's not necessarily scalable to a broader organization.

We most frequently found ourselves here in the upper left-hand quadrant, with analytic workflows that were stood up sort of to respond reactively to requests, that were manual but accessible in nature to a lot of people in our organization. And so our goal was to leverage APIs strategically to move ourselves in this direction of the upper right-hand quadrant, optimizing for accessibility and automation.

And so our goal was to leverage APIs strategically to move ourselves in this direction of the upper right-hand quadrant, optimizing for accessibility and automation.

So just simply working within the infrastructure we have and identifying what those key metrics were that were causing a little bit of backlog and manual work ended up completely revolutionizing the way that we approached this problem.