Real World Applications of Open Source in Public Health
Speakers from the following organizations joined to discuss how their team's work has evolved after learning R or Python through Posit Academy: Idaho Southwest Health District Oregon Health Authority Vermont Department of Health What is Posit Academy? Posit Academy is a mentor-led apprentice program that follows a project-based learning format. Like riding the bike or playing the piano, Posit Academy participants learn by doing with real-world applications. Participants are placed in small groups, under the guidance of a mentor, to motivate and support one another throughout their learning journey. Helpful links: If you'd like to learn more about Posit Academy: https://posit.co/talk-to-us-about-academy/ Q&A Panel Recording: https://youtube.com/live/9xTsNuLt3aQ?feature=share
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome. This is Real World Applications of Open Source in Public Health. Whether you're joining us live for this YouTube premiere or watching us later, we're excited that you want to learn more about coding with open source.
We have three partner organizations here with us today who recently made the decision to upskill their team members with a code-first approach to working with data. They did this with Posit Academy. Posit Academy is Posit's mentor-led training program. That means there's a real human in the room with the participants. Each of these organizations sent groups of employees through our data science curriculum. Some of them learned R, some learned Python, and many came to the training with no programming experience at all. And in just 10 weeks or less, they were able to walk away with real data science skills.
So we'll spend the next 25 minutes hearing directly from them. How are they working differently? Are they saving time? Are they going deeper with their data to better serve their communities? You'll hear why they started their journey to open source, why they're glad they did, and see a detailed real-world example of how they're using code-first data science to make an impact in their community. As you listen, drop questions in the chat, because we'll meet you right back here with the panelists for a live Q&A session. See you soon.
Vermont Department of Health
Good afternoon. I am Jessie Hammond with the Vermont Department of Health. I'm the Division Director for the Division of Health Statistics and Informatics. I've been in the Health Department for more than 20 years. The first several years were as an analyst before moving into more of a management position in the last five years.
Hello, everyone. I am Amanda Jones, and I'm the Informatics and Data Modernization Director at the Vermont Department of Health. I've been in this role for almost two years. I've been in the Health Department for about eight years, and I've been working in public health, data, and epidemiology for about 15 years.
So let's talk about what motivated us to start using open source. In reality, there are many practical reasons, including cost. But the driving factor really is public health data modernization. Data modernization is a national initiative to improve the way health departments work with data and their data systems. The department really wants to equip our workforce and improve our data science capabilities. We also recognized a lot of our newer staff coming on board already had these skill sets, and we really want to grow and nurture these abilities in all of our analytic staff.
We started using these types of software, and so how are we now doing our work differently? I will admit a lot of our changes are still in progress, and a lot of analysts are still translating their code into things like R and Python. Some people are automating more of their work, such as using APIs to pull data instead of using more manual workflows. We've also seen people making R dashboards using some of the nifty visualization packages, as well as interactive reports. An example of such a report is this immunization registry report, which can be found on our website, and it includes some embedded HTML figures, graphs, as well as a map.
Overall, people are trying to change the way that they change their mindset when they're working with data and coding. Lastly, data governance is a real push for our department, so as a part of those processes, our staff are identifying and documenting the best practices, and one of the ways is the use of a reproducible environment, such as the renv packaging R.
So we're using it, we're trying to change our mindset and work differently, so what has the value of this been? After COVID-19 response work, analytic staff were very burned out. A lot of the work and the data workflows that were built during that work were very manual in nature. Leadership really recognized the need to change this moving forward, and we've also placed a really high value on our workforce. It was so important to give people learning opportunities to improve the skills. The result is a really renewed energy and motivation to try new things, and we really see people being excited about learning new skills and trying to change the way that they work and learn new technology.
So what's some advice that we would give to others who are wanting to embark on this journey? I think the first thing is that change is hard. This type of change takes investment on the part of the individual. Learning new things in general isn't easy, and it's particularly difficult for adults, and adults who have a lot of priorities. Individuals need to give themselves grace throughout the learning and implementation processes, but we also need to acknowledge that this requires patience on the receiver of the output of the work. It's going to take time for people to change their workflows or change the way that they redo some of the code for some of the things that they've done.
And most often, the receivers of the information are public health programs or leaders, so it's important to clearly communicate and communicate often. But with patience and perseverance, we think that people will see the benefits of this type of change. So thank you.
But with patience and perseverance, we think that people will see the benefits of this type of change.
Oregon Health Authority
I'm Andy Muniz Brewster, and I'm a research analyst at Oregon Health Authority. Today, I'm excited to share some highlights from our team's experiences with Posit Academy.
Here are just a few data products produced by Academy learners. I'll come back to a quick demonstration of one of these after sharing more from my colleagues.
So what motivated teams at OHA to start using open source? Two things, really. Shifting needs for work streams and public health data modernization efforts. Public health teams at OHA have been using R and Python for analysis and reporting since before the pandemic. Efforts to modernize public health data systems and workflows over the last five years revealed common themes among analytic teams. Specifically, teams need transparent, reproducible, portable, cost-effective solutions for informatics and automated reporting. And open source tools provide just that.
Informatics teams have been eager to take on the challenge of incorporating these tools into their work streams. Oregon received federal data modernization grant funding from the Centers for Disease Control, which specified the purchase of license and infrastructure for open source tool adoption. This supported new opportunities for collaboration in code development and reporting, which enabled teams to provide better, faster, actionable insights for decision-making and sharing information with community partners.
So now, after learning a code-first approach to their work, analysts have more confidence using R. Teams have a shared foundation of knowledge and skills to work from. This, in turn, has improved collaboration, increased engagement, and support between teams and across programs. Folks now share better practices through code reviews and testing.
Something I'm really excited about is this new perspective gained for how to approach problems. And now we are finding that folks can take working code or functions that solve specific analytic tasks, and they can make those functions more generalizable, so that approach can be applied to similar tasks.
And how are folks applying what they've learned? We're beginning to automate formally manual processes. A couple of teams have stood up data quality monitoring pipelines. Folks are updating old code and adapting existing workflows in response to environmental and policy changes. Folks are starting to develop custom functions, and they're putting those functions together into new packages for internal use. We're seeing better documentation of code and analysis workflows, and teams are getting more practice with testing and debugging functions.
In terms of value that we're seeing after Posit Academy, there's just so much time saved by using code to produce reports. That upfront time invested to write code to build a report is rewarded when we can rerun any process or reuse project templates. Analysts report being able to develop, modify existing, and deploy new interactive dashboards more quickly and efficiently after Academy. So we're seeing quicker turnaround time for dashboard development.
There's improved efficiency in our existing code resources through saving time by writing code once and using it over and over again as needed. And writing more efficient code, which saves analysis and server runtime. Analysts use their new skills to identify data quality issues at the source and implement corrective measures to improve data quality. And with better documentation and provisioning of source scripts for workflows, teams are more equipped to share and hand off work. This is crucial for supporting continuity of operations. Consequently, there's less risk of corrupting source documents or losing track of steps in a process.
When I asked our teams what advice we could give, folks had several suggestions. First, find community and find a mentor. Check out the R for data science learning community or the data science hangout. Or find an R user group through Meetup or LinkedIn. Or through affinity groups like R Ladies or Out in Tech.
One of the best things to come out of this Academy training was an internal teams channel to support an agency-wide data science community of practice at Oregon Health Authority. This provides space for folks to connect and learn about each other's work, celebrate accomplishments, support each other's learning, foster a sense of belonging between analytic teams. Also, seek project-based learning opportunities. Try out your new skills on real-world problems and see what you can do. Have an experimental growth mindset. Try stuff out. You won't break anything.
Have an experimental growth mindset. Try stuff out. You won't break anything.
Now, I'd like to show you, just to take a quick peek at that EMS annual dashboard.
The Emergency Medical Services Program has data on ambulance agencies, their patient care, and their staff. This is a map of Oregon showing the patient transport encounters from 2021. This dashboard also shows patient dispatch complaints and other patient-related demographics. We also track performance metrics for the EMS system here. And we've added a special tab here for tracking heat emergencies in the summer. I hope this has been a useful look into what we've been able to do at OHA through Posit Academy. Thank you.
Southwest District Health, Idaho
Well, hello, everyone. My name is Ricky Bowman. I am the Program Manager for Public Health Emergency Preparedness and Epidemiology Response Team at Southwest District Health in Caldwell, Idaho. We are a regional health department that covers six counties in southwest Idaho.
Hi, my name is Kate Lewis, and I'm an epidemiologist here with Southwest District Health. Hi, my name is Andy Nutting. I am also an epidemiologist here at Southwest District Health. Hi, my name is Lakshmi, and I am also an epidemiologist with Southwest District Health.
So the purpose of this meeting is for us to share what we learned during the Posit training, which Kate, Andy, and I took, and how we are using that in our work. So I'm going to share what we are doing right here with exponentially weighted moving average. I'll go into the details of what that is and how we are using R to do this and help keep track of our infectious disease trends and outbreaks here at the health district.
And the data that I used today is the one that Posit shared during the training. It's COVID data, which includes the number of cases, deaths over time in the United States, and we have state level data as well in that data set. So I created a core document that describes the code chunks and also a little bit information on what the statistical methods that we are using here. So the first code chunk here, it shows all the needed libraries that we have to load in R to perform these, which includes the DPL, YR, the QCC, which is used to calculate the control limits, which I'll go into detail later, then for plotting ggplot and for importing the data, I use the read Excel.
So what is exponentially weighted moving average? It's a statistical method used for analyzing time series data, which is usually the type of data that we deal with on a daily basis. It helps us to track the infectious disease trends over time. And what this method do is that it assigns different weights to different data points, and we can give more significance to the recent observations compared to the previous, which is very important in our realm, since we need to be more responsive to immediate changes and more real-time changes.
And then here you can see the formula that calculates the exponentially weighted moving average, and the alpha here, which is the multiplier, is also called a smoothing factor. And as we assign higher values to alpha, it gives more weightage to more recent data trends.
So in R, the code chunk, first step is to import the dataset using the read.csv. Then step two, this dataset contains data for all states across the country. So I just wanted to focus on one state, so I used the filter function to filter the state New York's data and assign the name as COVID filtered. And then the next step, the third one, is to format the date. And step four is to assign the alpha value, which we discussed earlier, which I set at 0.3, which is the usual practice. But as you assign higher values, it gives more weightage to more recent data. But 0.3 is a reasonable practice.
And step five is to calculate the initial value of the exponentially weighted average. We are assigning an initial value from the column cases, which tracks the number of cases in New York City over time. So the first row in column one is assigned as the initial value here, and we move forward to the step six, where we are actually initializing the EWMA column. So here I'm creating a column called EWMA using mutate function, and then assigning the initial value to the first row of the column and assigning the rest of value as NAs for now. And on step seven, we are calculating those NAs that we assigned previously, where from the second row onwards, we are calculating the EWMA based on the previous case number. And the formula here that I have put in is from this one up here, where we are multiplying with the smoothing factor and observation at time t, and then we add that to one minus the smoothing factor into the observation on the previous row. So that will give us the column values for the exponentially weighted moving average, and I will create a separate column in our data set.
And the next step is to identify our control limits, where we are tracking this exponentially weighted moving average, but we need to know the trigger points where we are crossing that limit. And that is where the control charts come into play. So what the purpose of control chart is to track the progression of a specific event over time. And here I'm applying the exponentially weighted average that we calculated into the control chart. And the package that is used is the QCC package that we loaded earlier at first. And here is the code chunk that calculates the control chart values. And the type of chart that I have put in here is x-bar dot one.
And what the x-bar chart do is that it's suitable for subgroup data. So if you have time series data like in our situation where we are tracking the number of cases over weeks or days, months, then the x-bar chart is the go-to. So what it does is it monitors the average or mean of the number of cases over time. And the central line of the chart will indicate the mean. And there'll be a lower confidence and an upper confidence limit that will be calculated, which is usually two to three standard deviations from the mean. And I have set and default is two standard deviations. You can change that to three standard deviation if needed in this code chunk on step eight.
And step nine would be to extract these means and the upper and lower confidence limits that we calculated. So here the mean can be extracted using this first code where from the control chart we are extracting the center. Then for lower confidence limit, similarly you can use these code chunks to extract those. And step 10 would be to apply these limits into our moving average that we calculated earlier. So step 10, identify points where the EWMA exceeds the upper confidence limit. So here in the data sets COVID filtered, I'm creating a new column called out of control where when the EWMA values are more than the upper confidence limit, it will be assigned as true. And when it's less than the upper confidence limit, it will be assigned as a false.
In the new column and step 11, once we have this data finalized, then we just have to plot it using the ggplot function. So on the x-axis we are doing the date and the y-axis we are doing the exponentially weighted moving average that we calculated. And I'm doing a line plot. So I have the geom line here. Then the geom ribbon function here is to plot the upper and the lower confidence limit that we calculated from the control chart. And the geom point marks the actual data points in the data set. And again, the geom hedge lines are to plot the upper and lower confidence limit. So in this graph, you can see the geom ribbon creates this gray shaded ribbon area and the geom hedge lines creates those upper median and the lower confidence limits from the control chart.
Then I am assigning the color manuals to black and red where if the column out of control is true, then it will be marked as red, which means that the exponential weighted average for that particular time point exceeds the upper confidence limit. And if it is a no, then it will be assigned a black color. So that's how this plot is created.
So here you can see over time we are tracking the exponentially weighted average of the number of cases in New York. And the upper confidence limit is this upper dashed black line. And whenever we see the number of cases going above that limit, it is indicating that it is more than usual, more than expected number of cases for that particular time point. And it triggers us to see what's going on, to look further into what's going on, whether it's an outbreak, what else is happening, what needs to be done at this point to kind of control the spread of the disease.
And the purpose of doing this exponentially weighted average is that it has several advantages. First is the sensitivity to recent data that I described earlier. We can assign that smoothing factor for the plot to be more responsive to more recent data, which is very important in our work. And then it's flexible. So we can adjust the weight parameter and the calibration parameters to allow customization. For example, when we are looking into the data trends, if we know instead of looking across the data to calculate the upper limit, we can just pick and choose those time points where we know that the data has been stable to calculate our baseline and upper limits. So there is that flexibility there.
And how we use this in our work is to, most recently, is to identify and track a pertussis outbreak that's ongoing in southwest Idaho. I have not shared our data and also the code chunks that I used to calculate this particular plot because of privacy issues, but it's very similar to the code that's already shared with the COVID data. And this is the plot that we created using our data. And it's in 2024, by around March time, we noticed that the data is exceeding that upper limit, which triggered us to declare an outbreak and to implement intervention measures to prevent further spread of pertussis.
And it's in 2024, by around March time, we noticed that the data is exceeding that upper limit, which triggered us to declare an outbreak and to implement intervention measures to prevent further spread of pertussis.
So I think that's all we have. If Kate, Andy, or Ricky wants to add anything else, please go ahead. I'll go ahead and stop sharing.
That was excellent. Thank you so much. That is everything that we have today. So if you have any questions, feel free to reach out to us. Thank you.