Building scalable data pipelines through R and global health information systems' API - posit conf
Efficient and scalable analytics workflows are critical for an adaptive and data-driven organization. How can we scale systems to support an office charged with implementing USAID's $6 billion HIV/AIDS program? Our team leveraged R and global health APIs to build more efficient workflows through automation by developing custom R packages to access health program data. Our investment in creating an automated data infrastructure with flexible, open-source tools like R enabled us to build reproducible workflows for analysts in over 50 partner countries. We would like to share our experience in a federal agency integrating APIs with R to develop scalable data pipelines, as inspiration for organizations facing similar resource & data challenges. Talk by Karishma Srikanth Slides: https://usaid-oha-si.github.io/presentations/2024/08/14/posit-conf-simple-machines.html
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everyone, thank you so much for coming. My name is Karishma, and today I'm going to be talking to you about simple machines, how to improve your workflows with APIs.
Okay, so show of hands, how many of you have ever heard or seen or built a Rube Goldberg machine before? Looks like most people in the audience. For those of you who have not, a Rube Goldberg machine is a chain reaction type of machine or contraption that's intentionally designed to perform a simple task in an indirect or overly complicated way. When done successfully, it is a clean process with very little human intervention. Once it gets started, it's efficient, it gives you a great output, and honestly, it's just super fun and satisfying to watch.
These simple machines are also sort of similar to a data access pipeline, from the point at which you access your data from the source and all of the intermediary steps that you take to get to your final output. So show of hands again, in your organizations, how many of you have data access pipelines that look as seamless as a Rube Goldberg machine? I see like two hands. Okay, cool. We were in the exact same boat. How many of you, one click, okay, instead have pipelines that look similar to this?
I'm not seeing as many hands as I would expect. Okay, there we go, that looks more like it. Yeah, lots of manual intervention, lots of hiccups, you get the picture, and we found ourselves here as well.
Background and key roadblocks
So my name is Karishma Srikanth, I'm a data analysis advisor at the U.S. Agency for International Development, working in the Office of HIV-AIDS under the President's Emergency Plan for AIDS Relief, also known as PEPFAR. It's a mouthful. This presentation was developed in collaboration with my colleague, Aaron Chaffetz, who was unable to travel for the conference this time, but huge thanks to him for all of his work to get this presentation ready.
So to give you some background on where we work, our work supports PEPFAR, which is a six billion dollar HIV-AIDS program. Our team at USAID analyzes, synthesizes, and visualizes large amounts of monitoring and evaluation health data from over 55 countries worldwide. So that's a lot of data and a lot of context to ultimately inform HIV programming and move the needle to ending HIV as a public health threat.
When we were standing up our analytic infrastructure, we faced four key roadblocks. Firstly, as you saw, we have a really, really large scope. We have data coming in from various different external systems, often in non-standard formats, from static CSVs to messy Excel sheets to the occasional pull from a database. On top of that, we found ourselves in a series of analytic loops, frequently receiving similar analytic requests each quarter, and we often found ourselves addressing those requests in a really manual, repetitive, and redundant way.
Finally, in a government environment, we lacked a centralized data lake to manage these data processes. So we found ourselves doing these manual processes each time without a standardized infrastructure to make it a little bit easier for ourselves.
So when we were beginning to think about ways that we wanted to make our workflows more efficient, we knew we wanted to move across the spectrum from manual to automated. But importantly, we also needed to consider the axis of accessibility, ensuring that all the processes and tools that we built were meeting the needs of all of the users in our organization. For instance, a highly automated but complex infrastructure, while powerful, would be incredibly costly and time-consuming and potentially prohibitive for new users to actively engage with. Conversely, a highly complex manual system is sort of the worst of both worlds, requiring a lot of specialized expertise and knowledge that's not necessarily scalable to a broader organization.
We most frequently found ourselves here in the upper left-hand quadrant, with analytic workflows that were stood up sort of to respond reactively to requests, that were manual but accessible in nature to a lot of people in our organization. And so our goal was to leverage APIs strategically to move ourselves in this direction of the upper right-hand quadrant, optimizing for accessibility and automation.
And so our goal was to leverage APIs strategically to move ourselves in this direction of the upper right-hand quadrant, optimizing for accessibility and automation.
Lesson 1: Keep it small to start
The first lesson is to keep it small to start. A small but effective simple machine, like these dominoes falling in a circle here, are far more impressive than an intricate machine that might be prone to error and takes a lot of time to stand up. So typically in our work, when we are introduced to a new data source or a new problem, we start by seeing all of the possibilities and start to think really big about all of the ways that we can address this problem and the big things that we can build. However, we can't emphasize enough how important it was for us to keep our focus really small to start, and then set some of those larger ambitions aside, fail fast and small, and solve the basic problem at first in a way that allows us to scale this more broadly in the future.
So I want to take you through one of the first examples that our team embarked on to routinize our data access with an API. So on the left-hand side here, you'll see a screenshot of one of our internal PEPR data systems called Datum. And this houses all of our quarterly monitoring and evaluation data from over 50 countries. So in this data system, you can pull various different health indicators for various different countries all the way down to a health facility level if you wish. But as you can probably tell, it's incredibly cumbersome to use. A lot of point and click. You have to sort of drag these indicators into a pane, toggle them on and off. And then you can't remember what you did when you come back to it in a couple of days.
And so this scope was really vast. This is one of our most common data systems. Everyone in our organization uses it. So there were lots of different use cases that we thought of that we could use to make these systems more efficient. However, in the spirit of keeping things small, we first wanted to tackle the question of how do we build an API to query this without basically any documentation to do so.
So on the right-hand side there, you will see our just like basic code chunk that queries this API. So we use the glue package to pull in the URL for this API. And then use functions from the HTTR package, the get functions to create a get request to query this URL. And also authenticate with our internal credentials. So from here, once we had this as our framework, we had to figure out how to package this up in a way that was actually useful to all of the people in our organization to work with. Which was really challenging. Because you can take a really simple request, but then you need to think about how do I parameterize this and put it into a function with all of the different ways that someone can use this. But by keeping it small and going through the series of trial and error in this more focused way, we were able to figure out how to make this simple solution dynamic and generalizable to a broader audience. So this code chunk sort of became the start of a framework that ultimately became an R package that was completely dedicated to accessing our data systems through an API.
Lesson 2: Minimize the manual
The second lesson was to minimize the manual. And by that, I mean, limit your clicks. If your simple machine requires a lot of manual intervention, like a kid pushing a train down a train track, it can be prone to error, not likely to be sustainable, and can be costly to run. So by identifying and setting up APIs for data access, we can create a lot of time savings and speed up our production, allowing us to actually focus on some of the key components of our project rather than getting bogged down in the processing. Also, by reducing the cost of accessing this data, we can promote better data accessibility and lead to greater usage of these data.
So here's another example from our work using an external data system to our organization. This here is a Shiny app that houses publicly available epidemiological estimates that come from UNAIDS, which is a UN agency that focuses on HIV. And so this offers a glimpse across a lot of different countries about what the HIV epidemic looks like by country. So it's things like prevalence, how many people are on HIV treatment, et cetera, all really key metrics for us to understand programmatically and from a target setting perspective. So similar to the previous example, pulling data out of this system requires a lot of clicks and manual downloads. And you'll see here in this video, the most cumbersome part is that if you are trying to pull multiple different indicators, you sort of have to go through, pick everything you want, change it over to a table, click the CSV, and then go back and do it all over again for everything else that you're asking for. So it's a pain, especially when you're asked for a rapid request across multiple different indicators.
So we use APIs to solve this iteration problem. We figured out the API, the URL from this API, and identified the dynamic components of this URL, like the indicator or age group, and used the paste zero function to dynamically change the input URL within this function based on whatever the user is inputting. So this function here essentially does exactly what was shown on the previous slide. It just pulls things out of the Shiny app. We then took this a step further and leveraged packages like purr from the tinyverse with these API functions to make it easier to pull data across multiple different indicators, countries, and population groups. And ultimately, both of these functions are in the process of being packaged up into a package called Mind the Gap up in the top right-hand corner, which is entirely focused on projects related to these estimates. So the upshot here is by keeping it small and by focusing on the key components of this process that were sort of a pain because of how manual they were, we were able to build a really simple tool to increase the accessibility of these data and ultimately ease of use.
Lesson 3: Use what you have
The third lesson is to use what you have, and this one's pretty self-explanatory. Rather than going out and buying a bunch of equipment to build your machine, use what you have at your house. And by that, I mean use the infrastructure that your organization has set up. And this can be quite challenging. For our organization in particular, we didn't have a database that we could tap into and really leverage. Instead, we were limited to the Google Suite, which is not a database. But given that that was the environment that we were working in, we can use it more strategically to optimize our workflows. So whenever we can, we try to tap into the Google API and the Tinyverse packages like Google Drive and Google Sheets 4 to routinize our data access, but also just promote good documentation and project management guidelines, like setting up project folders and using Google Forms to collect data rather than just having manual submissions.
So in this example, in addition to some of the quarterly data systems that I've shared with you today, we also collect what's called high-frequency reporting data, which is just another way of saying monthly data submissions that are coming from our country teams that allow us to have a more granular level of how the HIV program landscape is happening on a month-to-month basis. So in the past, the way the system would work is we had over 30 countries that were reporting this data to our office in DC, and they would email us their Excel files, and we would kind of have to sift through our email, pull down all of these submissions that were emailed to us, pull them into R, process them, validate them, and then manually push them back up to the Google Drive. So super cumbersome. It was a pain.
However, leveraging the Google API really, really changed the way that we approached this problem. Instead of having these submissions be emailed, these were submitted to us via a Google Form, and then as you know, with a Google Form, the metadata gets stored into a Google Sheet, so we were able to use the Google API to query those data submissions from our country teams directly, pull them into our R environment, work with them, process, validate all those fun things, and then push up our documentation through an R Markdown or Quarto file. So just simply working within the infrastructure we have and identifying what those key metrics were that were causing a little bit of backlog and manual work ended up completely revolutionizing the way that we approached this problem.
So just simply working within the infrastructure we have and identifying what those key metrics were that were causing a little bit of backlog and manual work ended up completely revolutionizing the way that we approached this problem.
Lesson 4: Package it all up
So the final step is to package it all up. So using all of these lessons, keeping it small, reducing your manual, using the infrastructure you have, the next step is to take all of that awesome work that you did and build a package and documentation for your organization to store these generalized solutions. Packaging up your work based on your workflows should be flexible in nature to meet the organization's needs, but also kind of speak the language of your organization so that it's more broadly accessible to every user in your organization. That allows you to solve problems in a more streamlined fashion as they arise going forward because you've done all the upfront work to document.
So in our case, all of the examples that I shared with you today ended up as some kind of R package. Some of them, like Grabber and Glamour off to the left-hand side, are more just utility packages that were designed to house functions that just improve our workflows from project setups to querying our databases with APIs. All of these functions are stored here and they just work and can be scaled to a lot of our different projects. Others, like Mind the Gap and HFR, are more project-specific and were tailored to a specific need, an analytic question that we had, that can then be more broadly generalized if need be. So these are just four of the almost 15 packages that our team maintains for our organization and work. If you're interested in learning a little bit more about the work that we've done, our packages are linked at the bottom of this slide.
Recap and closing thoughts
So just as a recap, when thinking about ways to improve efficiencies in your work and potentially use APIs to improve your data access, the lessons are to keep it small to start, minimize your manual steps, and identify what those are to functionalize them more appropriately. The third is to use the infrastructure and environment that you have, and fourth, to package it up and document. These solutions are not a one-size-fits-all approach and will look different for every organization. However, it was really key for us to sort of challenge the status quo of how we've always done things just because they were easy and stood up at the time and pushed for ways to automate. And it created enormous time savings and got us back to focusing on the questions and key program areas that we really wanted to be focusing on with our data.
As we wrap up, let's just bring this back full circle to where we started with the Rube Goldberg machine. It's an elegant yet complicated solution to a simple problem. But as we've discussed today, our goal in managing our data workflows is to sort of simplify and streamline the process on the whole. We shouldn't want to have to think about it any more than we already do. So by applying some of these lessons we've learned, we can avoid building an overly complicated and error-prone machine and instead build something that's reliable, efficient, and ultimately more impactful, and it just works. So as you return to your projects, consider if you are building an intricate machine or sort of just working within the status quo, or if there are ways that you can promote automation and simplify the steps that you're taking for your data access. Thank you all for coming today, and we hope that this talk provokes you to think about what data processes in your organization you can automate and optimize.
Q&A
All right. Thank you for your talk. I also have a background in public health, and there's another shout out for thank you for your work. You're doing great. Thank you. All right. So our first question, how did you start the transformation approach to using APIs? For example, how did you convince leadership and other users in your organization that this is meaningful and helpful for the organization overall?
Definitely. And it was absolutely an iterative process, and I think something we're still working through. I think it was a matter of identifying that these were available to us so that we weren't necessarily taking advantage of them, and then building tools that made it really easy for people who might not have a background in coding or querying data using an API, just building a simple function where they don't really need to think too hard about what it is they're doing. So I think that building accessible and scalable functions like that, and then also providing training opportunities to capacitate the rest of our staff to understand what tools are out there.
All right. I think you alluded to the next question. Can you talk about how did you train the public health side of your team? For example, the people who might have data science, who might not have a data science background, but need to do data work?
Yeah, that's a great question. So our particular team is a data focused team. However, not everyone is an R user. We have folks that are primarily working out of Tableau, primarily working out of R, or even people who just kind of primarily like to work out of Excel. So it is absolutely still a work in progress and something that we're trying to figure out, how do we meet people where they are, but also promote sort of pushing the envelope a little bit and embracing code-based solutions to create these efficiencies in our work. So we do a lot of trainings across our office for both the data specific team and broadly, not just focused on R, but also focused on best practices for data management, data access, data processing, et cetera.
All right. I think we have time for one more. So how do you balance or decide between making an API or an R package? Are there situations when you only make an API or just an R package for something? I think they all sort of start off with just like, you know, a script to query the API. Then we're like, okay, we know we're going to use this like every month. Why don't we just try to functionalize this and store it somewhere where people can use it? And then as we build those resources up, then it's a matter of figuring out, okay, do we have an existing package that this can just fit into really nicely? Or is this something completely novel that we need to stand up on its own? So it's a little bit of like upfront investment that our team has tried to put, to focus on building out these packages, but I would say it's sort of, it's an iterative process. All right. Thank you. Let's give Karishma another round of applause.