Accelerating Study Insights with Shiny and Posit | Regeneron x Atorus

Transcript#

This transcript was generated automatically and may contain errors.

Welcome, everybody. We've got an exciting webinar for you today featuring Regeneron x Atorus and the work that we've been doing together with those two teams. I'm going to be your host, Phil Bauscher. I'm going to kick things off today and then I'll also help with the Q&A at the end of the webinar. So if you have any questions throughout the webinar, feel free to post those in the chat. And then during the Q&A, we'll get to those questions as a group.

I've got some really awesome webinar hosts that are going to take you through the presentations today. First, we've got Sri Krishna Murthy, the Director of Product Management Global IT at Regeneron. We've also got Ryan Hu, the Associate Director of Stats Programming. And then we've got Maya Gaines, Associate Director at Atorus. And we've got Michael Stackhouse, CIO of Atorus as well.

And so we're really excited to highlight Regeneron today. They have been such a big user of open source across so many different areas of the company, drug discovery, PKPD, even people ops, and now bridging into clinical reporting. Today we're going to talk about the fragmented legacy environment and how they were able to unify that into a connected ecosystem using open source and Shiny to move from static reporting to dynamic, scalable, and real-time data science and highlight how they partnered with Atorus to do a lot of that work. So we're going to pass things over now to Sri to kick off the webinar.

So this feature builds the link between summary report and the patient-level data, which makes the data review super efficient.

Tips for successful Shiny apps

All right, so based on what we learned, I also like to share some tips about how to make successful Shiny apps. So first, we need to build powerful functionalities. We need to talk to our stakeholders, understand their requirements. For example, we can input a subgroup analysis. We can input different statistical models for efficacy analysis for them to choose. And we can also give them options to explore relationship between different parameters.

And the second, user interface is important. Because appealing user interface will always help end users to adopt the new tool quickly. And we also care about user experience, of course. For example, we require like any response time to be like one second or less for any calculation at the rendering. We also need to provide the proper trainings and the guidelines to the end users so they can get familiar with the app and they are able to leverage all the functionality and the power and help their review more efficiently.

Lastly, we also care about user feedback. So on the right side, we designed this within app feature feedback collection panel. So in this small panel, they can read their experience. And they can also enter comments and questions. After submission, it will trigger email. So many of our relevant team member will receive the email and then respond accordingly. Okay, so that's the end of my part. I'll pass to Maya. Thank you.

Config-driven app framework

Thanks, Ryan. So as Ryan went over this case study, we can use that and extrapolate it here for what we've done at Regeneron. So it's common practice, right, that we make a Shiny application for a specific study with specific endpoints in mind. And then once we're done with that, we want to reuse what we've just created for another study. So in this imagery here, we can lift and shift that by copying and pasting essentially all the code that was in that repository. What happens now if you find a bug in that second study? Then you have to copy and paste that code back into the other repository. What about a feature? Ryan showed a AE timeline, but maybe there's another graphic that you want specific to this second study. How am I going to back add that to the first study? And similarly, let's extrapolate that even farther to your entire portfolio of studies. How am I going to take that feature and now copy and paste it into a bunch of different repositories with different study data? That seems onerous and a little cumbersome.

So what we have done here is create a layer of abstraction. We created a package that is essentially a suite of features and functions to create this Shiny app that you can configure for all of your studies. Now if you find a bug in study C, we can replace that code in the package and then update your child repositories. Similarly, you can turn on and off different features within your child repositories. Maybe there's a bespoke chart that applies to a really rare disease that doesn't apply to another study. You could turn those on or off very easily.

Let's get nerdy and talk about implementation here. How did we do this? So each study is now going to have the Shiny code that calls that parent package. At Regeneron, we call the package the AE package and then within that package you have a function called create app. So I'm going to create a blank slate repository and it's going to have these three files within it. One is my app.R with the create app that calls the create app function of our custom suite of functionality. Another file is an R markdown script. This can be any R file, can be a Quarto document. Why we chose this is because this is the data preparation file that calls the specific data for this specific study and each application requires the data to be in a rigid format. So we do all of our data manipulation ETL on this in this file that we also host on Connect. We host this file on Connect because Connect makes it really easy to schedule a job to rerun this file at some cadence. So whenever this study's data is updated, this file will rerun and that output of the file is used as the input for the Shiny app. Lastly, Ryan touched upon, we have a YAML file which is essentially a file configuration file.

So this framework, it calls variable names instead of relying on static column names because those can change across studies. So this configuration file shows us what to map to which variable for that specific study. Using this paradigm, we can now create these three file systems for each of our studies across Connect, across our entire portfolio of studies. This process has allowed us to optimize a Shiny framework for performance, security, and maintainability. Thinking about your application at first with your end users, you know, from a POC with a single study is a great thing to do. But eventually, the goal here wasn't just to build an app, it was to build a foundation.

But eventually, the goal here wasn't just to build an app, it was to build a foundation.

So that is what we've done here. We've also included a verified handoff here. So the code is fully documented, we have infrastructure notes, and this allows the study programmers to configure these applications quite easily. We also have knowledge transfer sections with these statistical programmers who are running the create app function and provide those training materials as mentioned.

So we follow for this process, and as Ryan showed that feedback cycle, we follow a defined software development life cycle with this application to keep releases consistent and manageable. Each cycle is a three-week development cycle with one week of UAT where we're testing new features out on those child suite of applications to make sure we've accounted for backwards compatibility and the features are working with different data sets and different structures. Features are scoped to fit within these cycles, which keeps our progress steady and predictable, and each release includes detailed notes that are sent to the team covering any new functionality as well as bug fixes and how to integrate the new version of that parent package into your old application. So downstream users can fully adopt these new features and functionality. And lastly, this process reinforces best practices around version control, validation, documentation, you name it, helping the application evolve safely while remaining stable and has made this pretty easy to maintain.

Key takeaways and what's next

So as we look at the key takeaways of this, when we started working with Regeneron, one of the critical things that we started diving into first was building a unified environment. The dissonance between the SAS platform, having to migrate the data out, and not having a single source of truth of the data that you were working with and what things were being driven on is really something that causes friction, and it makes it harder to really implement these types of applications in practice and make them successful. So when we went about looking at Regeneron's modernization plans, what they wanted to do with the environment, how they wanted to grow their Posit environment itself, and how it fit into the larger GXP environment, making sure that it was unified and that everything was talking to centralized storage, harmonizing the directory structures, harmonizing the permissions and security model between everything was really critical to ensuring that this was seamless as the projects grew.

The second piece of this was to make sure that we were giving this power to early-stage R users. When we are building up these applications, it's difficult to expect everyone to be an expert. So with lots of ongoing trials and lots of ongoing study teams that you can't expect all of them to have a specialist like Maya here building all of the pretty visualizations and deploying the platform, we wanted to make sure that this was something that we could put into the hands of more junior users and provide the power to the domain experts that are familiar with the data at hand. We reduced the scope of what they're responsible for, but still make sure that they can deploy and take advantage of these applications. It's something that really gives a lot of power and a lot of autonomy down to the Regeneron team where you're centralizing that expertise of the development back to a team that can be responsible for the visualization and the development of everything at an advanced level.

But additionally, when we look at validation, we're centralizing where the development of the applications is actually done, making sure that we can implement testing in a central level and then de-risking the deployment of these applications at the lower study level to make sure, like Maya was pointing out, that we could fix bugs in one place and have that proliferate out to the apps as we're deploying this. So this was a really awesome implementation where we found that we could get a new study started up in about 30 minutes through the YAML-based configurations and making it plug and play to the data sets where we can have a small ETL pipeline that just focuses on getting the data ready for the application and centralizing where we configure what parts of the app that we want ready. It makes it a lot easier to digest for those more junior users or instead of thinking of them as junior, thinking of them as the domain specialists rather than the R experts themselves.

And then lastly, governance and support in this shared responsibility model and having an SDLC-aligned core package, it ensures sustainable growth in a feedback-driven release cycle. So one of the mental models that I've found for pretty much any statistical programming group and working with – that's working on regulatory deliverables – is the idea of a definition of a done. When you're working with applications and these Shiny applications, you have a lot more of an ongoing cycle. Maya mentioned the software development lifecycle. Embracing that model and understanding that you can do things iteratively, that you can release updates, that you can gradually improve over time instead of making sure that you have something thought of as perfection going live in the first place, really empowers the team to start taking advantage of these benefits, gather feedback, and continually improve as we go forth.

So it's a difficult mental model for these teams to adopt because we think of preparing our deliverable, validating it, and publishing it, and giving it off to the ultimate stakeholders and not being done until that release is completed. By putting together this SDLC, by getting the team on board with it, by working through and having everyone understand that process, it's made things a lot – it's reduced a lot of friction and it's made it a lot easier for us to continually improve and get this – these benefits into the hands of the study teams themselves.

So ultimately, what this gives us is a scalable, reproducible, and regulated R-based data review for Regeneron. So the benefits, it's – we can really deploy these applications out to any study that needs it. There's a simple setup process. There's a simple mechanism to get it out there, secured, and to enable the study team to do it. We can get from instantiation of the app to deployment within two hours. The build process is heavily simplified. It's really easy, even for novice R users, to get it up and running. It enables faster decision-making for the teams because we're giving that application out there. We can trust that it's been tested. We can trust that the visualizations on the page are going to provide the baseline application that we've put out there.

Improving over time is a more simplistic process. It does not – it's not rebuilding the app. It's not this heavy burden of up-versioning everything. We make our updates and just push them, except in the new version of the app. And it's a heavily simplified process, especially coming from the configuration-driven files, that versioning to the new version of the app is straightforward. And it's a powerful tool to use in senior management meetings, that visibility of giving beautiful visualizations that are helpful, informative, and provides this feedback so much more immediately. It makes Ryan's life easier because they can get this up and running and put it out there, and you're not sending requests back to the stat programming team to quick push out a report and have them waiting in order to do reporting. They have these tools at the tips of their fingers that can easily look at different subsets of data, that can give different views into different things and answer those questions immediately, instead of having those programmers on hand just waiting for whatever requests going to come forth.

So, what's next with this? Expanding the paradigm, reuse the SDLC of framework for more applications. So, Ryan's team dreams big. There's a lot of potential of the applications that they can have here. So, this model allows us to build up those applications, get those use cases fit, and then have a model to be able to deploy out to. Continuing to empower our developers, providing ready-to-use functions so that they can focus on their analysis, not infrastructure. So, that's one of the things that we just want to make sure that it's seamless for them to work on, that they don't need to think too hard into the underlying data platform. That unified environment, making sure that we're aligning with permissions models, we're aligning with the harmonized data structures, that we're making sure that project to project, you can trust that those foundations are in place and that you don't need to worry a lot about getting everything set up so that it can be deployed.

Applying this to separate departments as well. So, working with the biomarker team to apply this agile methodology for iterative delivery across other departments, expanding it out. Sri started by talking that there's a lot of different stakeholders who are using Posit platforms and using Shiny and these visualizations at Regeneron. So, these are foundations that can be taken advantage of within multiple departments, because at the end of the day, we need to align data security, we need to align the deployment process, and make sure that these things can be seamless for different teams. And then combine everything within the GXP environment, a much larger project, but the entire purpose of this has been to make sure that in this next generation platform, that everything is harmonized, that we have the compliance, validation, and scalability across projects. And so, with that, I think we can move into the Q&A.

Q&A

Yes, that sounds great. Thank you so much, Sri and Ryan, Maya, Mike. Fantastic webinar, lots of great comments, a lot of great questions are coming in. I'm glad we've got about 15-20 minutes here, so I'm just going to go through the list and we'll see how far we get. If you have questions still, feel free to post those in the chat and we can tackle those as they come in.

There's one question here, how did you manage GXP citizen development from a process and compliance perspective? So, I think that was maybe one for Regeneron. So, as far as SAS is concerned, that is our GXP environment at this point of time. And for open source, that is something we are building and that is on our roadmap for next year. And once we implement it, that's when the citizen development will come into the picture, right? But still, I'm going to be answering your question. So, when you talk about any environment, whether it's Posit or any other environment, there are two parts to it. One is validating the infrastructure itself. So, what we would do is, first of all, validate the infrastructure for Posit. Isilon is the storage that is already validated. So, one thing is out of the way and then we'll go ahead and then validate the Posit platform itself. Now, as far as the process is concerned, that's where the citizen development comes in, right? So, it's not specific for Posit. It can be for anything. But essentially, the idea here is you need to have standard operating procedures, right? And then that is what is your validation. So, as long as you identify those SOPs, you have tested all those things and then follow that, then basically you are GXP compliant. So, that's how you would do it. And then in Posit, one more additional thing we have to think about is really how do you validate the packages? And for that, I'm not going to name any vendors, but there are a couple of vendors out there, right? With whom you can sort of work and then you can get pre-validated packages, right? So, using all these things like validating the infrastructure or qualifying the infrastructure and then validating your SOPs and then having pre-validated packages and probably having a risk-based approach. So, that's how you validate the whole environment.

What is your preferred table format package for the Shiny apps? Reactable, DT , etc.? So, that's a great question. We actually ended up using DT in this instance because it has server-side rendering. I think you get a lot of prettier styling with Reactable, but as Ryan mentioned, one of our primary asks were that the end-users want to see results within a second. So, we highly use like custom CSS on top of DT, but I think it's, you know, there's a lot of great stuff out there. I also really love GT . There's different tools. You can make a matrix of the cost benefits of each and I think for different use cases, sorry, this is a non-answer, but for different use cases, you're going to want a different table package probably.

Another tech question. How is version control used within the environment? Is it for code, data, or both? And are there GXP regulatory implications for hosting the code on a platform like GitHub? And then it continues on. Does Isilon 1FS provide versioning features? So, for the code part, the reason we're using Bitbucket, so it's a similar platform as GitHub. So, we make it mandatory for any R work like programming and the Shiny app development. We make it mandatory to have to use the data at Bitbucket. So, that's quite different from a SAS programming. I mean, that's a learning curve for everybody. And for the data part, as far as I think there is a question on the Isilon storage. So, obviously, we won't be using Isilon for version control. It is really going to be Bitbucket. So, that's what we plan to do. So, that is already being implemented as we speak. I know that Ryan and his team have worked on quite a few use cases for that. And one additional thing we are doing on top of it is also enabling CICD. So, hopefully, that will be in place for next year.

People are trying to understand how big is the team at Regeneron or maybe how big is the team that helps to maintain the AE apps? So, we have 14 folks on like a package development and more advanced Shiny app development. At the same, we ask our study team to do the study-level application like AE and a few other like SRCD and so on. So, it's more like a collaboration efforts.

Another quick question. The Shiny apps, are they available to the clients or are they mostly used internally to help BioStats, the study teams, the programming teams? So, I don't fully understand the clients here, but we have our internal regional stakeholders like safety scientists, medical directors, and clinical scientists and so on. But probably, you're right. So, for now, we just use it internally at Regeneron. We are not sharing the app with any external stakeholders.

Fantastic. There's another one here. How do you handle the validation of data manipulation done in the markdown when it comes to GXP? Yeah. I can start to feel that. I think that's more of a Ryan question. So, the stuff that we're doing in the data preparation file is that the application is assuming the data is in a specific structure. So, it's a singular list object with different items in that list with specific names, but then the data can have different column names mapping to that YAML file. So, you're structuring the data. You're not manipulating any of the numbers. I think we also have a suite of, not I think I know, we also have a suite of unit tested functions if we are doing any ETL to the data to ensure that those functions are verified. I do know in pharma, validated is a special word. So, I don't want to use that incorrectly, but I can tell you that the data that we're using is tested to be kosher. Ryan, I'm sure you can add to that. Yeah. So, in addition to what Maya mentioned regarding the unit testing, after the deployment, we also need to go through a process of validation process. It's like an internal document. So, we build a bunch of checklists and all the checklists have to be completed before the release.

Awesome. I got a question here for Sri potentially. Does Databricks provide functionality similar to that of Spark EMR clusters for parallel computations for ETL on the big data? Yeah, I can certainly take that question. So, if you're familiar with Databricks, it is built on top of Apache Spark. So, that is why anything which EMR Spark clusters would have, this has the same features as well. But then it does go beyond EMR Spark in the sense that it is a fully automated and managed system. So, that is why things like cluster setups, scaling, maintenance, everything can be done within a Databricks itself. So, that's why people don't need to focus on the infrastructure. They can just focus on the data processing. So, apart from that, Databricks has so many other features like Data Lake and then ETL. So, it really gives a good platform to work on. But yeah, quick answer to that question. Yes, it does have what EMR Spark does and beyond.

Fantastic. I got a question for Maya, I think. They would like to know how customizable the apps are and how do you manage like one-off demands. And then there's a second question, I think we could connect to this. It was asking about the customization nature through the YAML configuration. That's a great question. And we made sure to address that early. So, we have this application framework that gives you a suite of functionality, but you're not locked into that. The end user still has full creative freedom to access the UI and server code so that they can add their own bespoke features for that one-off thing. And then if that becomes more than a one-off, maybe they'll work with our team to add that as a feature to the parent app so that we can deploy it more across applications. But by allowing for that level of... It's kind of a trade-off, right? That we have this suite of functions that you get for free, but we also gave you the open endedness to add your own code. That would not go in the YAML. You would have to write that R code yourself, if that makes sense.

And Maya, I think there was another question that came in that was similar. They were asking about the underlying RMD Quarto files instead of using small underlying R packages. And then they said, is this because of the encoding of the data or to help junior R users? So this file, essentially what it'll do is read it. You're going to read the raw data, which we can have our own whole conversation that I'm sure Sri can answer around access controls to be able to do that. And then you manipulate the data and then you create the data as an RDS that's used inside the child application. So I guess if the question is creating a different package that has all those data sets in there, I think that's just another approach that could potentially work. But why we did it as this file that's hosted on Connect is to essentially act as a cron job so that we can rerun this file really easily. Because if you have Connect, it's like this really nice GUI of just like rerun this file every N, whatever, every night at 3 a.m. So that's kind of why we did it that way.

Yeah, I think Maya is 100% right. So the reason we use a markdown is we can deploy the markdown to the Posit Connect as well. And from there, we can schedule like an automatic data refresh, right? Because some of our apps are connecting to the data lake and we require like a daily refresh. So that's the way to do that. I was just going to echo all of that data snapshots coming in, having that run at a set schedule so that we can ensure that the data are up to date and that it's hands off for it.

So I've got a couple of questions here from various people about training, onboarding users, what's the training like for people with R and Shiny, the Atorus Academy bit. So our people trained traditionally on SAS programming and for R and other open source programming, I mean, it's quite different, right? So we already tried different approaches. We tried the Atorus Academy and we also hosted a few internal sessions. I mean, Phil, we also invited Posit for some workshop. I think the bottom line, we need a lot more focus training. And after the training, we also need people to keep using R programming. Otherwise, I mean, the efforts help a lot.

Fantastic. Got a question for Sri here. They're wondering about where the clinical data save, the data lake, the database server. And then I get another question that said, hey, we see Databricks in the architectural diagrams. We'd love to learn more about the role that it plays in the ecosystem. So the first question is a real good question, right? So where is the clinical data stored? Is the data lake? And I'm sort of smiling because just before this webinar, I had exactly the same conversation. Because depending on the person you ask, like the definition of data lake would be different. Is it just the S3 buckets or the entire environment? What do you call it as data lake? But let me answer this question. So as far as the clinical data within Regeneron is concerned, we have all agreed upon that Isilon would be the storage location, right? So that means everything is going to be stored in Isilon. And that's what is going to be used by any product, whether it's Posit or SAS, right? They're going to be using it out of that. So that is a decision we have made. But that said, if there is a desire, you can always use like S3 bucket or Postgres tables for that matter, or any other table for storage of the data. But in case of Regeneron, it is Isilon.

And then the second question regarding Databricks, the way we are really using it is as a backend in the sense that if there is a big data use case where you want to reduce a large number of data rules, petabytes or terabytes of data, then we use a Databricks in this particular scenario. So the way we envision it is whether it's SAS or Posit, then it will pump all those backend queries into Databricks, it will execute it, reduce these datasets, and then push it back to the source environments. So that is how it's being used.

Fantastic. Got a couple of questions here about the programming teams. One in particular, they're saying, is there a dedicated team of programmers who are working on creating the Shiny apps, or is it the same trial programmers? And then they go on to ask, are there other tasks that you're switching from SAS to R like this? Yes, that's a good question. So we have a core team, right, focus on more like R partition development. At the same time, we have a study team, the training trial programmers focus on implementations. Actually, it also depends on what kind of Shiny apps we are talking about. For the more standard AE app, I mean, usually it's done by the study team for the implementation, but for more advanced, top-line Shiny apps, because it includes a complicated FHC analysis with a lot of highly customization, then that has to be done by our Shiny team. And the second question is about, so at Regeneron, we started from R Shiny development, but we are also evaluating using R for like a SDTM item at the TFL generation. So that's more for regulatory submission requirements. Yeah. So, I mean, we are going to that direction.

I've got quite a few questions about validation. In particular, I think they're just trying to understand the GXP approach, the strategy at Regeneron for the Shiny apps, and really Sri, even for the environment that you mentioned. Sri, do you want to? Yeah, I can start that off. In fact, I think I did answer that question earlier, but let me give a little bit more detailed response here. So right now, if you recollect my diagram about the roadmap, currently what we have validated is really the SaaS environment and the Posit environment will validate that or make it GXP compliant next year. And the way we'll go about it is, I think we already have it in our plans. We'll first start with validating the infrastructure, right? So that means the storage as well as the platform, right? So that'll be the first step. And one of those two is already done because Isilon is the common storage for both SaaS and Posit. And because we went live with GXP SaaS via earlier, we have already validated Isilon, right? So the storage validation is done. So now what we need to work on is really validating Posit as a platform, right? So that'll be the next thing. And even with that validation environment, a lot of use cases can be taken care of. But then the third thing we have to take care after that would be really validation of the packages, right? So we can always do it internally. We can always take various packages. We validate it and then certify that it's done. Or the other strategy is to really purchase validated packages from different vendors or from a vendor, right? So we purchase it and then consider that as validated. So that would be about validated packages. And then the final thing is really the work which is being done, right? How do people build the Arshani apps? How do we maintain it? How do we keep audit run? So that is what is the big thing here, right? So that is really the different SOPs or operating procedures which we have to define. And then we have to validate everything, right? In the sense that we have to define each and every procedure and then we have to test it out and then certify that as validated. So once we accomplish all of these things, then we can say that the environment is truly validated.

Maybe one for Maya that kind of builds on that. Someone was asking about testing of the release of the parent packages for the implementations of the app with a study or project. Yeah, I think we both kind of touched upon this. But so we have those monthly release cadences where we scope our features to that. And in that UAT week, we test the new functionality as well as backwards compatibility. So let me take us back here. So every change to the application, we push those to dev. So those are reviewed like as that singular ticket or task. And then the entire suite of features for that release are tested with integration testing in UAT. So you have kind of a double layer of defense there. And the way that we test that is, Ryan, correct me if I'm wrong, but with like prior studies that use the AE application as well as the ones that are on the roadmap. And then within those studies, when deploying a new, so when the statistical programmers run the create app function, then they're also going to do some QC and go through that checklist that Ryan was mentioning. So we have a lot of layers of defense here.

Surya, I think people are interested in, before the project got started even back, you know, two, three years ago, the sponsorship or the spark for the modernization initiative, you know, strategies to help with that.

Accelerating Study Insights with Shiny and Posit | Regeneron x Atorus

Transcript#

Building the unified analytics platform

Platform roadmap: 2023 to 2026

Technical highlights

R-Shiny app design principles

Case study: adverse events monitoring app

Tips for successful Shiny apps

Config-driven app framework

Key takeaways and what's next

Q&A

Featured software#

config

Shiny