Resources

Sjoerd Wierenga & Job Spijker | Public Health | Shiny in Production | Posit

R in Public Sector: Organizational & Technical Aspects of Shiny in Production with the Dutch National Institute for Public Health and the Environment. 00:00 - Introductions 2:47 - Organizational aspects of Shiny in production 32:52 - Technical aspects of Shiny in production 52:33 - Ask us everything / Open Discussion Questions: 29:00 - When you first introduced Shiny, what other tools were you comparing it to? How did you explain the difference to your leaders? 30:00 - What were the most important aspects of your prototype app to create buy-in? 52:33 - As Clusterbuster began to be used by more people, did you face any performance issues? How did you adjust your app to deal with more concurrent users? 56:10 - Can you say anything about the update frequency of the data? 57:15 - Which model was used to define the clusters? 58:23 - Did you ever consider not using a database? 1:01:50 - What's the communication with the data engineering team? 1:03:51 - How often do you collect feedback from users and update your app? 1:05:10 - Was your data loaded into Docker in a form of some aggregates? How did you create them? 1:06:26 - What is the main advantage of keeping it all in R with Shiny? Did you feel at any point you were sacrificing simplicity? 1:08:14 - Did you use any specific methods to increase the performance of your app? Did you scope your data, or load it all in the global file? 1:12:03 - How did you make sure regions and users felt comfortable using your app? 1:13:25 - What types of businesses are hotbeds for covid clusters? Has this info informed policy changes? 1:14:50 - How did the data quality issues improve over the rollout? 1:16:47 - Did you use CI/CD? 1:17:38 - Did you have any functionality within your apps to send individual-level data to municipalities? 1:19:47 - For huge amounts of data, have you tested out different file types to store your data set within your containers? 1:20:54 - For people just starting to use Shiny, what is one piece of advice you would give them? Proof on Concept with fictitious data: https://rivm.shinyapps.io/clusterbuster/ Blog post from the team as well! https://www.rstudio.com/blog/how-the-clusterbuster-shiny-app-helps-battle-covid-19-in-the-netherlands/ Code-first blog post mentioned: https://www.rstudio.com/blog/code-first-data-science-for-the-enterprise2/ How the "Clusterbuster" app provides actionable information to 300 health professionals Presented by: Sjoerd Wierenga In this talk we want to give an overview of what it took to create the Clusterbuster from an organizational perspective. We will go into detail on how we got from an abstract question to an application that is user-friendly, safe, and valuable. Furthermore, we will offer a glimpse of what is yet to come, and where we see possibilities to turbocharge a more data-driven public policy approach. How to build a production shiny app within the context of public health governance. Presented by: Job Spijker This presentation goes into the more technical details about the production environment of the Clusterbuster application. We will show how we deployed the application, how we ensured security and mitigated the risks in case of a security breach, and how we organized our code for maintainability and refactoring. Presenter Biographies: Sjoerd Wieringa: As the son of two healthcare professionals, with a background in Public Administration, and a passion for technology, it is no surprise that Sjoerd Wierenga now works at the National Institute for Public Health and the Environment leading a team of highly skilled Data Scientists that created an application to support the battle against COVID-19. After having worked as a healthcare manager for several years, he decided he wanted to learn how to program. Which he has been doing now since 2016 in different capacities. Job Spijker: Job Spijker is a senior research and data scientist at the Dutch National Institute of Public Health and the Environment. He has a PhD in Earth Sciences with a focus on computational and statistical methods of spatial data. He is currently involved in projects about how the institute’s environmental and health data can be leveraged to create insightful actionable information to assist policy makers at local, regional, and national level

Feb 7, 2022
1h 24min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Well, thank you all so much for joining. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel Dempsey. I'm calling in from Boston, and I'll be your host for today's meetup. And I'm so excited to be joined by Job and Sjoerd today. But just to go through a brief agenda, while we wait for a few others from the waiting room, we'll go through some introductions of the meetup, have a great presentation on the Cluster Buster app, and the organizational aspects of putting that into production with Sjoerd. And then how to build a production Shiny app, some of the more technical aspects from Job. And then a fun Ask Us Everything open discussion time for you to ask anything that's on your mind about presentations and Shiny in production in general.

Just a quick note that the recording will be shared up to the RStudio YouTube. So you can always ask questions anonymously through Slido if you don't want to be part of that recording as well. You can upload other people's questions too, and that will help us organize the questions. One other Zoom note, if you want to turn on live transcription, you can do so during the talk as well. So in the Zoom bar below, you could just press more and turn that on.

For anyone who's joining for the first time, welcome to the RStudio Enterprise Community Meetup. Just to kind of let you know what this group is. This is a friendly and open meetup environment for teams to share the work they're doing within their organizations, teach lessons learned, network with others, and really just allow us all to learn from each other. So thank you all so much for making this a welcoming community. But with that, that's enough from me. And I'd love to turn it over to you, Sjoerd, for the first presentation.

Introducing the Cluster Buster

Thank you. All right. So this is my first presentation on the Cluster Buster in English. So if you see me struggling for words at some point, please forgive me. Give me a couple of seconds. I will figure out a way to put my thoughts into words eventually. So I'm here to talk about the Cluster Buster, which is an application for the surveillance of COVID-19. We made this for the RIVM in the Netherlands. And I'm going to talk mainly about the process to creating valuable insights with R Shiny.

First of all, a little introduction about myself. My name is Sjoerd Wierenga, which is a Dutch name, hard to pronounce. I studied public administration about a decade back, worked in healthcare for a couple of years, and finally decided that I wanted to learn how to program. So as you can see, I am not an epidemiologist. I'm not a doctor. So I'm not going to talk about the virus, about COVID or Omicron. I'm going to talk about the application and how to get things done within the context of the organization that I worked in. I started working at the RIVM a little over a year ago. It was Tuesday, December 1st, 2020.

So what is the RIVM? It is the National Institute for Public Health and the Environment. It's part of the Dutch government. And it houses the Center for Infectious Disease Control, which is the Dutch CDC, I guess you could say. There we have the Department of Data Innovation and Signaling where I work. And then finally, I was allowed to form my own team, which is Team Cluster Buster, which I'll be talking a little bit more about later on.

As I said before, I want to start off with the slightly abstract question that I was being asked in December of 2020, which was, can we help municipal health services gain insight into data concerning the clustering of cases? This is one of those questions that you need to really ponder on a little bit to fully understand what's going on here. So I had a lot of questions when somebody asked me this. What do you mean exactly by municipal health services? There are a lot of doctors working there, epidemiologists, but also researchers, policy makers, decision makers. Who are we referring to when we say we want to share insights with these organizations?

A little bit about the context. I was talking about these municipal health services. We have 25 of them in the Netherlands. They cover certain areas, certain municipalities, and they are at the front line of fighting COVID-19. There are a lot of doctors in infectious disease control working there, epidemiologists, and they are responsible for testing and vaccination in the Netherlands. They also have a local advisory function. So it does happen that sometimes an organization is closed because they refuse to adhere to the national guidelines. And these MHSs, these municipal health services, have an advisory function on that as well. These 25 regions are subdivided into seven regions. And let's be honest, they've been in crisis mode for close to two years.

Organizational approach

So let's start off with the approach now, which is, I guess, the most important part of this presentation. But a little disclaimer to begin with. You have this saying, under pressure, everything becomes fluid. We use it a lot in Dutch as well. Onder druk wordt alles vloeibaar, we say. And I put this at the start of the main part of this presentation, because this is not a normal organization in COVID-19 times. So I was thinking, what is it that I want to share in my presentation? I want to share something that you can actually take with you to your organization also in normal times, because a lot of things were different now. There was more money to get things done, there was less bureaucracy, etc.

So as I said before, I was faced with a problem. And I really wanted to understand this problem properly before continuing. So I wanted to have a clear view of who our customers were, which is not a term you hear often in the public domain organization, but it's really relevant. Who are we working for exactly? Is it the decision makers? Is it the researchers? We decided that it is actually the doctors and epidemiologists that work at these MHSs that we want to inform. So we had a clear definition of our customer.

What kind of data do we have? Well, we have the source investigation and contact tracing data. So anytime you test positive for COVID, you will be called by one of the employees of the MHS when you live in the Netherlands. And they will ask you where have you been? Who have you been in contact with? Where do you think that you may have contracted the virus in all likelihood? And this data is being sent to the RIVM. And we use this to create, well, let's say actionable insights. This was at least our wish from the very beginning. We want to not just create insights on things that people want to know, but there needs to be some action tied to the insight.

The doctors and epidemiologists in the MHSs, they advise the executive branch. So we wanted to give them the insights to have as good as an advice as possible to eventually bust clusters. This is what the woman that hired me actually told me that they wanted to go busting clusters. So I was like, what does that mean? This is actually an epidemiological term. And I thought to myself, that is a great name for an R Shiny application. So I kept using that. And eventually, when we went to production and started opening our application to doctors and epidemiologists, I was like, maybe we should find a more formal name for this application. But this doctor I spoke to said, no, no, no, you definitely need to keep it called the cluster buster, because that is a great name. So it's been called the cluster buster ever since.

So it's been called the cluster buster ever since. And I'm very happy that we stuck with it.

Another part of the problem is not all 25 regions had proper access to actionable insights. This was obviously a main part of the problem. They did not have the insights. They did have the data because they collected, but they did not have sometimes the analytical talent or the tools to come up with insights that were valuable. Some did, some had great tools, they had great dashboards, there was already an R Shiny dashboard produced in one of these MHSs. Some had great reports. So one of the main questions that we asked ourselves is, what are the best practices that we see in these MHSs? And can we replicate them? And can we redistribute them to all 25?

To do this, I first had to have a clear understanding of whether or not it was possible to share insights. And obviously, you have some prerequisites when you want to do this, you have some legal boundaries, we cannot share all the data, you have GDPR, privacy regulations in every country. So there were some boundaries and limitations to what we were allowed to share. Obviously, the data had some limitations. And the technology has some limitations. So I needed to have a clear understanding of what our limitations and possibilities were. But once I had this understanding, my advice was actually to move forward with R Shiny.

So finally, I said, yeah, well, I think R Shiny is actually the right tool for the job. And it was actually a logical choice in my view. Why? Because a lot of the data preparation, COVID-19 data in the Netherlands was already done in R. A lot of people working at the RIVM are familiar with R. And there was already some experience with R Shiny applications. So based on my prior experience, and my first weeks of investigation at the RIVM, I knew for a fact that we could produce value with R Shiny. But this was just me. R Shiny is still quite a new tool, although some of you might disagree. But for organizations, it's still at some point quite new. So I had to do some convincing.

So what I did was I created a prototype of which you see a screenshot here. And I use this to show that R Shiny can actually produce great looking and valuable dashboards, valuable insights. This one is still online. So yeah, it's probably most fun if you can just click through it yourselves. But I created this prototype, because I still needed to convince some people that R Shiny was actually the right tool for the job. And this is highly recommended. Yeah, they got really enthusiastic. So eventually, I got the green light. And they said, well, shoot, we trust you. Yeah, let's go for it and create a team to do whatever you think is necessary to create this application and share insights with these MHSs.

So I got the green light to build my own team. I wanted it to be as vertically integrated as possible. I wanted a team to be able to produce 80 to 90% of the application all by itself. Obviously, some of the work is too specialized to do within your team. So we had to rely on the IT department a bit as well. We were meeting very often. Well, let's say three times a week, we had a daily stand up meeting. And we would be meeting in between those obviously, pretty often. So we had a sort of an agile working style. Let's say it was rapid iteration, it was not agile to the T, but it was agile in spirit. And there was a lot of enthusiasm.

I think this is really a result of vertical integration and working in an agile way. Because we were capable of doing so much ourselves. At some times we got a request for an insight, and we could deliver the same day. And that is a great way of working. This is also thanks to our Shiny in part and to the special skills that we had in our team. So this vertically integrated way of working really helped us and made us very enthusiastic in the process.

So this is our team. We're seven of us right now. But we started off with a very small team with myself and Yossi. She is my colleague that is responsible for the coordination of doctors and epidemiologists that work at the RIVM. So we started off with Yossi, with Joep and with myself, but we gradually built towards a larger team. And each and every one of them had their own part to play. So this is the team, seven people, but I think all in all, it's probably two full time employees. It counts up to two FTE that we used to create this product, which, in my view, is still really quite amazing that we were able to do so much with so little.

Defining value and the advisory board

So January 2021, there was some understanding of the problem. The prerequisites were met, we had our weapon of choice, which was R Shiny, and there was a team. So now what? We were all ready to go. But yeah, what was the approach? And I think that this is the single most important question in this entire process is how do you allocate as much of your energy as possible towards creating value? And obviously, to be able to do this, you need to have a clear definition of value. And as I said before, I'm not a doctor, nor am I an epidemiologist. So nor am I an end user of the application. So we said to ourselves, we need some sort of a feedback loop from the users themselves, from the experts themselves.

So we put together an advisory board, Yossi and me, containing of doctors and epidemiologists that work at the MHS. So they were the representatives, so to speak, of our user group. Together with them, we had a strong focus on actionable insights as much as possible. Obviously, sometimes you create an insight just because it provides some background information that's useful in getting a situational awareness. But we were definitely focused towards creating insights that could lead to some action in practice, not just a nice to know thing.

For me personally, my most important role in this advisory board was balancing between requests and possibilities. So oftentimes, the advisory board would put down a request, shoot, is this something that you can make? And I would balance this out with legal possibilities as well as technological possibilities. And we would have weekly meetings with this advisory board. So here you can see this rapid iteration, weekly meetings. We would provide feedback on Fridays on what we did, and we would continue to look forward at trying to find new insights that we could produce that would help out even more.

So this is the structure that we end up with. All the way at the top, you see our user group. They are represented by the advisory board. And through the advisory board, we collect requests, the development team collect requests, and they start working on realizing those. And obviously, in the background, there's a lot of specialists that helped us out with, for example, legal questions or IT-related questions.

What the app delivers

So what did we end up with? We ended up with an application, an R Shiny application that provided insights into vaccination. So what you see here is a map with the vaccination rate, I believe it's called, for every neighborhood and every age group. And we also combined this with the amount of positive reports. So this is a bivariate choroplethic map. So you have the bivariate part is the vaccination rate versus the amount of infections in the last week. Everything you see here, obviously, is fake data.

This is a visualization of clusters. So on the left here, you see some of these clusters. This could be a workplace, maybe, or a school or a nursing home. And if you clicked on one of those, on the right, you would see where the linked clusters were situated. So this is one of the requests that was put forward by the advisory board. And thanks to R Shiny, actually, we could realize this. I'm not sure if we could have done this in any other tool. For us, it was really relatively easy to create these kinds of insights with R Shiny. So I seldom had to sell a no to a request from the advisory board, which was a nice way of working.

So finally, we had a pretty cool application. We thought it was very valuable, but obviously, this didn't sell itself. So I had to give a lot of presentations. I think I spoke to almost every region in the Netherlands. We have had some newsletters with regular updates. And obviously, we are doing a little bit of work on outreach, because we don't think that the product sells itself, so to speak. So there was a little bit of technology push, if you can call it that.

But eventually, we started noticing that a lot of people started asking for authorization. So all in all, now we have almost 400 doctors and epidemiologists that use our tool. We are approaching 400 users, and they are from all 25 MHS regions. So this is something that we're actually pretty proud of. We know that the application is being used. We know this because we track it quantitatively. Obviously, also, we ask it to the MHSes, how do you use it? Why do you use it? And what can we improve? I literally call them or send them an email to get some qualitative feedback as well.

They use these insights to stay up to date on COVID-19 clusters. And they have also used our application for the allocation of mobile vaccination units. So they have a limited number of mobile vaccination units. And one of the questions they ask themselves, obviously, is what is the best place to locate this mobile vaccination unit, which could be a bus or something like that. And they use the Cluster Buster to find the optimal spot for that.

And one of the nice side effects was that the data quality actually improved because the MHSes started using the Cluster Buster. So they finally saw the reason for keeping track of all different types of data. We gave them back insights, and this motivated them to really pay attention to the quality of their data.

Takeaways and next steps

Yeah, well, I put this in. This is just to say that we are already working on our next application. This is one that I hope that we will release, let's say, in the next month or so. And this is, we will take everything we learned from the Cluster Buster and put it in our next application. This, I hope, and we are working towards this, is going to be open source. So I'm hoping to share this, share the code behind this application so all of you can see how we did this exactly and can create your own tiny new Cluster Busters for whatever use you prefer.

So main takeaways for me, I guess the most important ones are have a very clear understanding of the problem, which I talked about a lot before. Create a convincing prototype. If your organization is not already used to using R Shiny, especially for a public use, create a convincing prototype or just show what we have made. Vertical integration helped us out a lot because we could create insights very quickly, so rapid iteration was possible. Very important to have a clear definition of value. So for us, we had our advisory board. I say here clear-ish definition of value because what is valuable may change over time.

So at first, we said that it would be very valuable to have insight into clusters of COVID-19. This is still valuable information, but at some point, we started focusing a little bit more on insights into vaccination data. So the definition of value may change, obviously, over time. Incorporate your user feedback. Keep a strong focus on what your definition of value is. I noticed that it's important to promote your products. Don't think that it will sell itself. And finally, for us at least, I think that R Shiny was an excellent choice, especially because we got to develop our insights as close as possible to the wish of our end user, and I think that is very valuable.

Q&A: organizational aspects

One of the most upvoted questions so far was when you first introduced Shiny, what other tools were you comparing it to? Standard BI tools? And how did you explain the difference to your leaders? That's a really good question. I was comparing it to Tableau, Power BI, all the tools that I knew. And for me, I also already from the very beginning had a clear focus on R Shiny because we were working so much with R. So it was a very logical choice for me to investigate. So in that sense, R Shiny was in the lead already because we were doing so much with R.

The question was, what were the most important aspects of your prototype app to create buy-in? Right. So I think that the overall looks of it and the swiftness — like the visualizations that were in there all worked really smoothly, I should say. So the overall user experience was, I think, good. And this is something that not everyone was expecting from R Shiny. That is because it's still quite new, and sometimes you see tools that are not that flashy or shiny looking. So I thought it was really important to make something that people instantly got a feel for that they wanted to work with it. So looks were actually really, really important.

Yeah, exactly. So imagine that you're a doctor and epidemiologist during this crisis and you work 60 hours a week, 80 hours a week, I don't know. And then you are asked to use an application that just doesn't look good. It doesn't work. It's slow, whatever. This is the last thing that you want. You really want to create an application that people would want to use even if they are working 50, 60, 70-hour work weeks during this crisis.

Technical aspects: architecture and security

To give a short introduction, I'm Job Spijker. I'm a geochemist at the National Institute for Public Health and Environment. I work at the environmental department and I already work with R for, I think, at least 20 years or something. I do a lot of data analysis with R. And the last years in our division, we already had to put out some Shiny applications on the internet. We used Docker technology for that. That was the reason that Sjoerd asked me for this project because he needed some back-end engineer to do this work. So although I'm a data scientist, I wasn't called for my geochemical knowledge because that wasn't needed, but mostly for my engineering skills.

Well, Sjoerd already told something about the municipal health services. They collect a lot of data from you. Let's consider that that's you somewhere out there having a COVID test. And they collect your name, your address, your phone number, your social security number, if you are affected or not. And if you are affected, they do a source investigation, a contact tracing. They ask about your whereabouts, the people you're in contact with, et cetera. And they also collect data about your vaccination. So that's quite a lot of sensitive data about you.

Now, this is us. We are a data science team. And our assignment was a little bit rephrased than what Sjoerd told us. But they asked us to just fiddle with that very private data, create some information out of it, put it on the internet, and make sure only people who are allowed to see it can actually see it. And this is actually the moment you have to get out of your chair because you start yelling at us. Because as data scientists, how can you be sure that that very sensitive data is in our good hands and that we have enough knowledge about security to make this application actually secure?

So actually, from an engineering point of view, we have three challenges. First, you have to create this information. You have to create the dashboard, which was the least of our concerns. You have to work with all the privacy issues. In Europe, you have the GDPR, the General Data Protection Regulations, which is very strict about privacy. And we're talking about personal medical data, which is the most sensitive data that we have. And we have to do a privacy impact assessment. We have to talk to our privacy officers. And that's a lot of paperwork. And part of the paperwork is to describe how we physically, in IT terminology, secure our data. And we are data scientists. We're not security experts. So this was the biggest challenge.

Before I go into how we solved it, I'll show you a little bit about the data flow that we have. Here we have the landscape with all these municipal health services. They collect all the data and part of the data they send to the RIVM using a myriad of systems. There are about eight different systems within these MHSs. This data is collected in a database or two. And then we have our own data science team, the insurance department, which actually extracts data from these databases and creates all kinds of data products, reports, graphs, visualizations, whatsoever, which are used for combating this COVID crisis. And all these data products are stored on a network file storage. And we take it from there.

Now, if you're going to look to the other side, this is how you want to make this data available again to the MHSs. We decided to create this application using a Docker environment. We use an OpenShift Kubernetes environment for that. And what we did is that we said, all right, we put this web application in a DMZ, which is an open piece of a network, which is open to the world. And we use our F5 proxy server for the authorization. The biggest advancement is that this proxy server was a building block, a service, which was provided by our IT department. So they were taking care of the authorization. And it means that the people at the MHSs can log in using their own credentials. It's like logging in using a Google account to some other website.

And also, the thing that we did was that if somebody logs in, then we know which user it is. So then, based on the username, we provide the user with only the data that he or she is allowed to see. It means that if somebody makes an error here at the MHS and authorize somebody who is allowed to log in, but we haven't received any message or we don't know him, then the person will get access to the application, but don't have access to the data. So in case of a data breach, it means that this application isn't accessible if you don't have an authorization.

So if you are some bad guy, somebody who wants to do something with that application, it means that you don't even come near it. However, if you can get near the application and start hanging around, it means that you only can have access to the container environment and to the data contained in that container environment. The internal data, the file storage, it was completely blocked off because there was a one-way firewall between it. So the only thing we could do is push our data through the container and it's not the other way around. But this was one way to also secure the data. And these were all services or building blocks or tools provided by the IT department and also stuff which the IT department could maintain. So we can deliver an application and we still can trust on the experts of the IT department securing everything.

So we can deliver an application and we still can trust on the experts of the IT department securing everything.

If you look at the total, let's first have a look at our own process. What we did is that we had this information created by the data scientist team on this network. And we had a small Linux virtual machine running all kinds of cron jobs and doing the data preparation. In the beginning, the machine was doing quite a lot of heavy lifting on the data. Nowadays, a lot of this heavy lifting is done by the data science team. And we only have to do some small adjustments to the data. But on this machine, a SQLite database is created and this database is pushed through this one-way only firewall onto the container platform which is running Docker containers. And these Docker containers contain the application. And of course we use a Git repository with the code.

So this total picture of how we created this architecture is that we have our data science teams who are doing the first step of the data preparation. Then it's our team who is actually creating the database and the application. But by launching the application and making the application available to the outer world, that's done on this container platform which was managed by the IT department. That means that we can focus on creating an application while all the security issues are the responsibility of the IT department. And these guys, they have the right experts to make sure that only the right people can access this application. So this is how we solve our biggest issue around security.

Technical aspects: the application itself

Now let's focus a little bit more on the application. But before that I want to talk about our team. Sjoerd already introduced us. In the beginning we were just a bunch of data scientists. But an application like this requires teamwork. So we decided pretty early already on certain team roles. Who is going to do what. In my case I do the backend engineering. Jolien did the visualization. Sjoerd did a lot of talking, etc. And we also had these regular standard meetings. So we had very short communications with each other. And we also decided very quickly to put everything on GitHub in a private repository. That means the code, not the data of course. And we decided to work with branches and with issue branches, etc.

And we created several containers for several stages of our application. So we have a development version. This is just for us developers. We have a testing version just to make sure everything works. We have an acceptance version. This is a version which is also accessible from the outside. It can only be accessed from the inside. So we can show some new things. And we have a production version. And using this kind of tiered versions, it means that when we are at the moment that we want to put something in production, we are pretty sure it's solid, it works, and users want to work with it.

Well, we chose Shiny, obviously. But one of the things we also decided is that we are going to split up this whole Shiny application in Shiny modules. If you create any serious Shiny applications, I think you should use Shiny modules. With modules, you make it much more easier to maintain and to reuse your code. As I said, we use SQLite as a database powering the app. SQLite is very portable. It's just one single file, which is very easy to move around. And we use the pool package to access the SQLite. For the visualizations, we use Highcharts. And Highcharts is a visualization library, which is almost the default for the Dutch government. You need a paid license for it, so it's not free. But we have the license and the visualizations just look cool. We use ShinyLogs to track usage. And for the visualizations of the maps, we use Leaflet and Simple Features. And we also use Sparklines, so our tables look pretty cool.

First, these modules. If you look at the application that we created, the total application contains about 75 different visualizations. However, there's a lot of reuse. It means that we can create 75 visualizations with about 25 modules. This also includes modules for titles and stuff like that. And reuse makes it very easy that you can reuse all your code. For example, there's a setting page in this application where you can see what the affections are, depending on different settings, like nursery, homes, or schools. And these pages, they contain all the same visuals. But if you use modules, you just create one module. And a module is one single visual. And you can use that visual on different pages and on different locations inside your applications. And also for maintaining your code, it's also very easy. Because if there's something wrong with the visualization, you immediately know where to look at, at which part of the code is causing the troubles.

What we also did is that we organized our code and what we use, what we call subscripts. If you look at the server.r file, for such a huge application, it's about hundreds, maybe thousands of lines. And we just chopped it up in different sections and sourced it using a source local command. And this makes also the code more maintainable. It also means that if you work as a team, it's much easier to work with one person on a single file. Because if you work with multiple persons on the same file, you have the risk of conflict. So it makes things a little bit more difficult.

One of the things that was very important for us is the tracking of the usage of the applications. We really wanted to know how are users interacting with our applications. Of course, you can ask what they're doing. But it's also nice to have some quantitative data about it. So we use ShinyLogs as a package. And I think ShinyLogs is a great package for that. However, there were some privacy issues with ShinyLogs. One of the things with ShinyLogs was it was recording the IP number. And the IP number is an identifiable number which can identify a person. And according to our privacy statement, we cannot record it. So what we did is that we wrote some functions. So now we rewrote some functions of ShinyLogs and then overloaded them in the library. It was in the assigned namespace. So we made some little adjustments to this package without actually altering the package. And what ShinyLogs does is store all the data that has also an SQLite database. We created even a separate dashboard out of it. So we could just follow the user interaction.

Of course, when you're working with an application, you're always thinking, what are the next steps that we're going to do? One of the things I really like to look into is using regression tests. We are creating a Shiny application. But testing takes quite a lot of time. If you can automate this somehow, it will really contribute to your speed of coding and speed of adding new features. It's not implemented right now in Cluster Buster, but we're thinking of it to do it for the next version. Also, one of the things that we want to look into is the communication between all these reactive elements in such an application. We think it can be more efficient, but it's also, if you're working with it, it always gives me headaches. And also for other applications, we're thinking about using frameworks like Golem or something else just to make the development cycle a little bit faster. And we really like to make this open source.

So in summary, as a data scientist, you like to make Shiny applications where you have to be aware that you're not a security oriented programmer. So make sure that somebody else, in our case, the IT department is taking care of the security and that they provide a safe environment, in this case, a Docker environment, where you just can put in your application. So you have to really think about your infrastructure. And that means go talk to these IT people. They, especially considering security, at least at our institute, they know what they're doing. And also ask yourself, do you need all the data? If you create an application, do you need to connect that application to all the corporate data you have? Or can you just extract only the data you need for this application? It means that in the case of a data breach, not everything is lost, but just one piece of, just a few data points. And the second thing is that if you make a selection, your application will be quite fast. Also, of course, modules make you happy. So please do use modules, split your code. It's very important to know your user. Sjoerd talked quite a lot about this. And what I really like is that you have to work as a team. As you start with such an application, you put some people together, go sit with each other, and just talk, how are we going to do this? And how do we like to work together? And how do we, well, make this work? Just a little reflection on how you interact with each other. It can be very important. And it can be very fruitful to get the energy going in a team.

Q&A: open discussion

Well, one of the most upvoted anonymous questions right now is, as Cluster Buster began to be used by more people, did you face any performance issues? And how did you adjust your app to deal with more concurrent users? Really good question. It's something that we thought about when we first started. But the reality is that we create an application for professional use. So it's not the case that we have hundreds or thousands of users at the same time. Sometimes it's 10 users at the same time, for example. And this is not a problem for us because our impact does not fully rely on the amount of users that we have. Sometimes it's more impactful to have one doctor look at it than other doctors look at it. So for us, the numbers don't really say that much. But because this was a tool made for professional use, the number of users was continuously low. So we never actually faced any problems in performance. And also because we are running this on the OpenShift Kubernetes platform, well, it will probably just scale up if we run into any performance issues.

I see another question that says, how did you gain the support of the IT team? Because for some, there may be a little bit of hesitation regarding Shiny. Well, I did notice some pushback, obviously. Someone even literally said to me, this is never going to work. As long as you just stay positive and patient at some points, I think it will all work out. So also what really helped was when we showed the prototype or when we showed the application itself, this really helped. So when we talked to people and I showed the application, they all of a sudden understood what we were talking about. And this, I believe, was very helpful.

That's very important to have a very open-minded talk with these IT people. Just tell them that you want to put something like the cluster buster on the internet and just try to think and work together how we can do this in a secure way. And at least our IT departments, suddenly what you would say is, they understand what you want and they are thinking with you. Don't try to make them think against you, but try to find a common solution which actually works and it's possible. Also, maybe the crisis helped out a little bit. So in this case, it was, let's get it done. That was also the mindset. So at some points we had to just push through and say, well, we need to do this. And I guess that's different in normal situations. But apart from that, I think a good honest talk, as Joep said, and showing what your ideas are really helps.

Can you say anything about the update frequency of the data in SQLite in production? Yeah. They send the data to us once a day and it's usually in the morning and now the data science team starts in the morning with all the extraction and stuff like that and they prepare the data sets that we use and usually around noon, we run our scripts to create a database. So it's created on a daily basis. So once a day, we create a new database with the latest numbers from yesterday. So it's once a day. But if you want to do it more often, it's quite easy. You can run it every hour, but the new data is only available once a day.

The next one was, which model was used to define the clusters? Right. So from an epidemiological standpoint, I cannot really answer that question, but I can answer how the data got constructed. So every time someone got a positive test result, he or she was called by one of the employees at the MHSs and was asked, with whom have you been in contact? Where have you been? Where do you think it's most likely that you've contracted the virus? And once three or more people were saying that they most likely contracted the virus at a specific place, we would call that a cluster. Obviously, if you scale up, you would call it an outbreak or a super outbreak. And the MHSs would write down these clusters, and they would send that to the RIVM. So that's the starting point for us to use for our visualizations.

Thank you for an awesome presentation. Can I ask you if you consider not using a database? We have considered this. At first, different ways to go. We were not really sure what would be the best way to go. So we started out with the database and eventually stuck with it. But this is one of those things that from time to time you need to reevaluate. Like, is this the right way for Cluster Buster 2.0? I'm not totally convinced that we would. But yeah, there is something to say for using databases and something to say for using pins and just our objects.

Yeah, well, the great thing is that, of course, in the database, there are multiple tables you want to use. And it's very easy to make relations between these tables. Therefore, you use a database. Also, if you put this information in our object or in our environment, for me, it's more difficult to work with. A database is such an easy object. And it's a single file. And you can work with it in R. But you can also open it in an SQLite browser just to inspect your data. Or you can use it in multiple applications. It's much easier to have multiple file connectors to the database. So for me, I think I prefer a database over some file-based object.

One of the ways you could look at this is also from the performance perspective. So right now, we are iterating over our database to give back the data from a specific region. This is not a very fast operation. But we do monitor the performance of the application. And if we would notice that it would take a lot of time to produce the visualizations, then I would definitely consider backing off of a database. So there are a multitude of considerations you need to make. So for now, it's fine. Maybe because of our small user group or the amount of data that we have. But this is certainly a question that's on our mind.

And the next question is, which is the main advantage of keeping it all in R with Shiny instead of using Streamlit, Power BI or Tableau, for example? Did you feel you were at any point sacrificing on simplicity? Well, no, no. I understand where this question is coming from. Maybe it's easier to build a dashboard using Power BI. This could be the case. But however, we had the skills to build an R Shiny application. And this is the interesting part. I started working with R Shiny with two new members of that team that does the data preparation. These were people that are skilled in R, but not so much in R Shiny. And we managed to build a new application from scratch within, let's say, two months. So with no experience, we got to create an application in R Shiny within two months. And this was part-time. This was a couple of hours a week that we were working on this. So it is not as hard as you might think. The advantages of having the same language through the entire analytics pipeline vastly outweigh the complexity of creating an R Shiny application.

The advantages of having the same language through the entire analytics pipeline vastly outweigh the complexity of creating an R Shiny application.

Did you use any specific method to increase the performance of your application? And did you scope your data or load it all in the global file? Yeah. Considering the performance, of course, we used — we profiled the application while it was running. And just see what are the things that just take most of the time. And it usually means the first step where you have to load the data from the database into memory so the user can work with it. By using a database, it also means that you can have most of the data just on file and you just use a query using a dbplyr and only collect the data you want to use. And that's actually the only gets the data from a specific region. And that's actually the data frame you're working with most of the time in the database.

In short, we have quite a large database. And now we have about tens of thousands of infections each day. So it's getting actually bigger and bigger and bigger. So you also see applications get a little bit slower. But as soon as the data is loaded, then the application is as fast as always. So what we do is that we select the data from the database. Only the selected data is put in some — I'm not even sure it's in the global. I think we use a global data frame for certain visualizations. Some are selected at the level of a module. It depends. So when we started off, I thought like five seconds loading time is too much. It needs to be instant. I think we passed five seconds now for some visualizations. And we probably could do better with reactlog and profvis. But in all honesty, it's like sometimes the metaphor is when are you going to change the wheels on the bus when you're going over the highway at 130 miles an hour? So it's quite difficult to optimize at this moment.

A question that I wanted to ask earlier, too, is around the prototype that you built and the process that you went through, because I feel like that was so important for getting everybody on board. Was that — was that shinyapps.io or what was the process for doing that? Well, I created it just in RStudio in the month of December. I did some work on R Shiny before I started working at the RIVM. So I had a lot of code already sitting there ready to be used. So that was very helpful. So I think I created this in, I don't know, maybe a week or so. And then we put it on shinyapps.io. No, I didn't even put it on shinyapps.io. I just ran it locally and showed it to people. I was just sitting in my office waving at people wanting to telling them to come in and take a look at the application. And afterwards we put it on shinyapps.io and we could show it to more people. But, yeah, a bit of improvisation there. This application, this was a demo application containing fake data, of course.

But I see Dan also asked a question through Slido that was how did you make sure regions and users felt comfortable using your app so you're not just fielding questions that are answered within the app