Resources

Building a Real-Time COVID-19 Surveillance System with R (Hugo Fitipaldi, Lund University)

Building a Real-Time COVID-19 Surveillance System with R: Lessons from the COVID Symptom Study Sweden Speaker(s): Hugo Fitipaldi Abstract: During the COVID-19 pandemic, the COVID Symptom Study Sweden collected over 20 million daily health reports from more than 200K participants. This talk demonstrates how R served as the backbone for transforming this massive dataset into actionable public health insights. I'll showcase our analytics pipeline built entirely in R, from processing raw data to developing an interactive Shiny dashboard for real-time COVID-19 surveillance. The presentation covers predictive modeling for prevalence and hospital admissions estimates and the creation of our 'covidsymptom' R package. Through practical examples, I'll share key learnings about handling large-scale health data and creating data products with R during a public health emergency. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

The COVID Symptom Study was a project that used an app that was developed by the health data science company ZOE in the UK, and it was launched in the UK together with King's College London. The app was also launched subsequently in the US with the help of Harvard University, and in Sweden with us from Lund University and our friends from Uppsala University. This app was launched very early in the pandemic in Europe and in the US, so early March for the UK and the US, but also in April for Sweden.

During the time of the project, that lasts more or less two to three years, we had engaged 4.7 million participants from these countries that contributed with over 500 million data reports.

In Sweden, the COVID Symptom Study Sweden was a collaborative effort. We had engaged researchers, but also data scientists, medical doctors, and of course the public. It was in its core a public health intervention. We want to understand a little bit more about this disease in terms of which groups were at risk, but also which symptoms were associated to a COVID positive test. Of course, we were trying to map the spread of COVID since it was a public health intervention, so we want to give the public information about the prevalence of that COVID infection in their region in Sweden.

How the app worked

So how did it work? The participants would download our app in their phones, and the first day that they downloaded, they would give us information about their baseline, their demographic. For instance, they would give us their age, but also if they want to inform gender, but also information about the postal code that we later on used to group them in regional groups to understand the spread of the virus as a map.

From that day on, the participants would be reminded to report once a day at least about how they were feeling that day, if they were feeling healthy or if they were feeling unhealthy. If the participants said that they were not feeling healthy, then we would ask them with a list of symptoms to point us which symptoms they were having or to write in a text field which symptoms that they were having.

We also asked participants that at the time had availability of COVID tests, since as I said, this was very early in the pandemic, COVID tests were not available for everyone. Most groups that were at risk that we identified in the beginning, but also health workers were able to have these tests. So we asked for anyone in our app that had information about tests in terms of, did you take a test? What was the result of the test? To also report that, and this will be important. I'll show you how.

So, gladly, we had engagement of all over Sweden. And as I said, we collect information about their demographic, their health history, their symptoms and their COVID tests. And with the baseline, we could create the groups that I mentioned and understand that these participants were coming from all around Sweden, but mainly from the biggest metropolitan regions. So Malmö, Lund, Skåne, Gothenburg and Västra Gotland and Stockholm and Uppsala in Stockholm. And we also found that most of our participants were female from the age 50 to 59 years old.

Symptom analysis and model building

The interesting thing that we used in this app, as I said, was the symptom survey and the COVID test survey, right? So very early on, we had a large engagement of people, as I said. So we had over 140,000 people participating in Sweden, which is a big number for Sweden, given that Sweden has 11 million people. So from these 140,000 people, we had 20,000 people more or less that had access for at least one COVID test and that had reported that to us.

And with this information, we were able to plot to try to understand the differences between the syndromic presentation of a positive COVID and the presentation of a negative COVID, as you can see here. And we see right away that there is a difference in shape of these curves, right? So this is a COVID positive, and we see that the symptoms are peaking higher and that they have a longer wide area under the curve, right? While the negative COVID are peaking and they go down really quickly.

And just for your reference, on the x-axis here, we have the days of the test standardized. So day zero is the day of the participant that they received the result of the test. And then we have 50 days before and 50 days after that, right? So we standardized them to try to understand how are the presentation, the prevalence of these symptoms. And here we have the symptom that peak the highest is loss of smell in a positive COVID. And today we know that loss of smell in the first two waves were rightly associated with COVID. But you have to remember that at the time we had, we didn't know much about which symptoms were more characteristic of COVID.

In fact, we were the first to study to identify loss of smell as a symptom that was highly associated with COVID at this stage.

In fact, we were the first to study to identify loss of smell as a symptom that was highly associated with COVID at this stage.

So then this difference in the curves gave us a hint that we could use this data that could be valuable to create a model so we could predict COVID based on the symptoms only. So that's exactly what we did. We use a very simple model, a logistic regression with a loss of penalty model in these 20,000 participants. And we identified them, the symptoms and interactions of symptoms that were associated positively and negatively with a true COVID result test, right?

So as I said, loss of smell was the symptom that had the highest association to a positive COVID, the highest weight for a positive COVID, right? But we had other symptoms such as fever, unusual muscle pain, and it goes on. And we had also some negative associations such as abdominal pain was not a COVID symptom at the time, at least not positively associated, right?

So then with this information, we then started predicting COVID based on the symptoms that our participants were reporting daily. So basically we were taking these 20,000 population, created a model, and then used the model to predict COVID in the part of the population that didn't have access to the test. And then we start right away plotting the daily prevalence of COVID at the national level, as you can see the right, but also at regional level. As I said, we were grouping them by the postal code. Everything here was, you have some GIFs here, but everything here was published in a Shiny app that I will show you soon.

Model validation and refinement

We then use other data information from other sources to try to understand if our model was actually telling us the truth, right? Some validation. So here on your left, you see our predictions with the confidence intervals in red, right? And here in black, you see COVID-19 hospitalizations in the country. And as you can see, we are very good at predicting the first wave and the second wave. But we have some artifact here around September that year, in 2020, that we predict a high COVID, not as high as the first and second wave, but a peak on COVID, and that was not the true based on the COVID-19 hospitalizations in the country, right?

So then we understand that probably our model was suffering for either change of something, the real scenario of the world at that point, right? In Sweden, in the case. So what we did was we updated our model with time dependency in terms of we were concerned that people that were downloading the app were downloading the app because they were feeling ill, so they would download the app and then stop using the app. So we use a time dependency, just excluding part of the population that were downloading the app and reporting right away, but adding a weight to that time that they were reporting. So with that, we were actually able to go over and to create a better model.

And this I just want to show you as an example as things are multiple and as models, but also other things like we later on had to use different models because we were having different types of COVID over the time. So I just would like to use this as an example of how we need to adapt in the real world with machine learning models so we can create a correction prediction of the COVID or anything, right?

The whole process of creating the model was very iteratively, and it's better described in a paper that we published in Nature Communications in 2022. I will soon show you the name of the paper. But one more thing that I would like to show to you is that we used that model together with the hospitalization data to try to predict hospitalization seven days before it happened. And our model performed really well, as you can see in the median absolute percentage errors for the regions that you can see here. So Stockholm, Västergötland, Skåne, these are all regions of Sweden, right? So these were the five regions that our model performed the best in terms of predicting a hospitalization that would happen in the future using our model together with other data that were also available in the country at that time.

Publications, tools, and broader impact

So as I said, everything is very well described in our publication Nature Communications. So this is the publication. But if you want to know more about the very nitty gritty details about the project, you can also find my PhD thesis in which I detail much more about the creation of the COVID Symptom Study Sweden, but as well as the creation of these two tools here. The COVID Symptom Study Sweden dashboard, that was a Shiny dashboard, and the COVID Symptom Package. There you will find more details about that.

But very quickly, the package was a need that we felt that there was a lot of people that want to use our data, our predictions for modeling in specific regions. So we then made these aggregated data available for data scientists and researchers in Sweden.

Yeah, so the data collection of the study ends in July 2022, but the project continues. We have projects related to vaccination, but also related to groups at risk and long COVID as well.

But I just want to show this study example as how we can use these survey data and apps like that to create some valuable information to real-time tracking of diseases. And as many epidemiologists have been saying lately, we might be having also future pandemics very soon. And this type of study with the use of R, with the use of data science tools, we can quickly use these tools to create a project and deliver some very insightful information for public health makers, but also for researchers at a very fast time.

And as many epidemiologists have been saying lately, we might be having also future pandemics very soon. And this type of study with the use of R, with the use of data science tools, we can quickly use these tools to create a project and deliver some very insightful information for public health makers, but also for researchers at a very fast time.

Q&A

Could you please explain more about how the time-dependent model was different from your original main model?

So, sure. Really quickly, because throughout the whole study, we had different models. So, the first model was only taking into consideration the symptoms, the presentation, basically, and how they present at any time given a period that we had set in the beginning, given a period like seven days before and seven days after the date of the test. What we did in the time-dependent model was to add a weight of time to the day of the report. So, closer to the dates of the test, then the symptoms had a bigger weight. But also, another thing, very quickly, is that we started excluding participants that downloaded the data, the app, and they will report right away, because we were finding a huge bias with a lot of people entering this study because they were feeling sick. But you can read more on the publications that I mentioned.

Speaking of the publication, also for Hugo, the app must have been developed very quickly since your first Nature paper was published in 2020. It was, I think, March, I thought. Can you touch on the rapid development of that app?

That was a very quick learning curve. So, basically, what happened was the ZOE team in the UK had a huge group of data scientists working for them in Lund University, as usually is in public universities. It was me, the whole group. So, I had to learn Shiny very quickly. And I'm very glad for the huge Shiny community that we have online, because at the time we didn't have also ChatGPT, right? So, I had to go a lot to these resources online. So, it was, yeah, it was very quick.

But we spent a few months actually just publishing ggplot maps without any interaction. And then it was a process that was developing day by day, you know? We had a MVP, and then we just grew by then when I was learning to do different things. But yeah, in two months we deployed that. And I had a lot of help as well from the guys in the Shiny live. I forgot the name, I'm sorry. But the Shiny server, yeah. The Shiny server people helped me a lot with that. We had a couple of meetings.