Resources

Accelerating Study Insights with Shiny and Posit | Regeneron x Atorus

In this session, Atorus and Regeneron, and Posit showcase their innovative partnership. Topics covered: -Building a Unified Analytical Environment: Learn how to reduce friction by integrating fragmented SAS and R workflows. -Case Study: A Config-Driven App: See a live demo of how a simple configuration file can build a fully functional application for data review and safety monitoring. -Operational Shifts: Understand how internal teams are empowered to independently manage and extend their own applications in a regulated environment. This isn’t just about technology—it’s about the strategic shift required to simplify development and accelerate insights within a regulated environment. Learn more about Posit's work in the pharma space: https://posit.co/use-cases/pharma/?utm_source=youtube&utm_medium=social&utm_term=pharma&utm_content=atorus-regeneron

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome, everybody. We've got an exciting webinar for you today featuring Regeneron x Atorus and the work that we've been doing together with those two teams. I'm going to be your host, Phil Bauscher. I'm going to kick things off today and then I'll also help with the Q&A at the end of the webinar. So if you have any questions throughout the webinar, feel free to post those in the chat. And then during the Q&A, we'll get to those questions as a group.

I've got some really awesome webinar hosts that are going to take you through the presentations today. First, we've got Sri Krishna Murthy, the Director of Product Management Global IT at Regeneron. We've also got Ryan Hu, the Associate Director of Stats Programming. And then we've got Maya Gaines, Associate Director at Atorus. And we've got Michael Stackhouse, CIO of Atorus as well.

And so we're really excited to highlight Regeneron today. They have been such a big user of open source across so many different areas of the company, drug discovery, PKPD, even people ops, and now bridging into clinical reporting. Today we're going to talk about the fragmented legacy environment and how they were able to unify that into a connected ecosystem using open source and Shiny to move from static reporting to dynamic, scalable, and real-time data science and highlight how they partnered with Atorus to do a lot of that work. So we're going to pass things over now to Sri to kick off the webinar.

Building the unified analytics platform

Thank you so much, Phil. And good afternoon, everyone. My name is Sri Krishna Murthy. I'm Director of Global Development IT within Regeneron. So very excited to be in this webinar. And in this webinar, we'll be talking about how at Regeneron, we have created a compliant platform that supports multilingual analytics. And then we have multiple groups within global development at Regeneron using this platform in order to rapidly build standardized, reusable, and scalable R-Shiny applications. And that supports clinical review.

So this is the quick outline of what we are going to be speaking about. So initially, we'll be talking about how we created the platform itself. And then we'll be going into how the R-Shiny app was created and what are some of the best design patterns to make things standardized and reusable.

All right. So let's start at the beginning. Maybe we go back a couple of years back into 2023 and talk about why we needed to build this and what is the problem statement. One of the things which was mentioned by our user community was that we are big users of SAS like any other life sciences company because SAS is used typically to submit to the FDA. And one of the things which the users expressed interest in was that they wanted to do more work using open source programming, use a programming languages such as R and Python. And even though they were using it, but it was all done through laptops. So there was a desire to have some sort of centralized way of using open source programming in addition to SAS.

Also, they had a requirement that they needed to have some collaboration between the various platforms which were there at Regeneron. For example, we had SAS 9.4 on prem. And then we had R and Python, which was used in the laptops. But there was limited collaboration between all these environments because of the fragmented nature of the workspace. So they wanted more collaboration. And lack of collaboration meant that the same artifact would be regenerated over and over again. So they wanted to have better ways of working with each other. And then anything which would be created could be reused. Apart from that, one other thing was that our SAS environment itself was nearing end of life or nearing end of feature complete. And we needed to modernize or refresh it. And apart from that, we also had the issue that there were certain areas where the performance was low and it was not scalable. So all these were the problem statements we wanted to attack.

So based on this problem statement, we defined our goal. Our goal was essentially to build a scalable, multilingual, multi-source computing and analytics platform across the enterprise. So let me go into some details on what exactly we mean by this. By scalable, what we mean is that the platform which we built should be able to scale as we have a greater amount of data coming and we have a greater number of studies. It should be multilingual in the sense that it should be able to support languages such as SAS. It should also be able to support languages such as R, Python, Scala, to name a few. And it should be multi-source. That means it should be able to support a proprietary platform such as SAS along with like an open source platform such as Posit.

So all these things should go together and people should be able to seamlessly flow from one environment to another and then conduct their analytics. So that was our goal.

Platform roadmap: 2023 to 2026

So let's start with a roadmap, right, on how we started with 2023, where we are at present, and then what is our plan for going forward. So if you look at 2023, as I mentioned earlier, what we had was legacy SAS 9.4, which was a GXP system, and that was on-prem, that was on-prem software which was installed. And the storage was Isilon, and Isilon was also on-prem. As far as the non-GXP use cases was concerned, we had SAS Studio 9.4, and that was installed on AWS. So that was what was the environment. As you can see from this diagram, we really did not have any open source environment as such, but people did use R and other Python and things like that on their laptops.

Moving on to 2024, what we did was we rolled out our open source global environment platform, which was the Posit platform, and we branded that as RSGD, which stands for RStudio Global Development Environment. And that gained a lot of traction, popularity, and multiple groups started using that to do open source development. And then we moved on to 2025. And in 2024 itself, we started planning on how we would migrate SAS as well and build a consolidated ecosystem. So in 2025, what we did is we upgraded the RSGD platform to OSGD. We rebranded that as OSGD to give that a flavor of open source, not just R. And that was also sort of parallel to what Posit did by calling RStudio as Posit. So we wanted to give that open source flavor. So we upgraded that to OSGD, and we also moved the operating system from RHEL 7, which was sort of outdated, to RHEL 9 so that it had the most modern operating system. And then we also introduced a lot of new features related to security and scaling. And in addition to that, we also started a major migration to SAS Viya.

So right now, at this point of time, we have moved SAS Grid to SAS Viya for the non-GXP environment that is already live. And the migration of the studies is done as well. And come November, we should be able to go live with GXP SAS Viya as well. So we have done a lot of things this year. We have scaled out our open source environment, Posit environment. We also went live with SAS Viya GXP and non-GXP environment. And in addition to that, we also installed a few environments which are able to handle big data. For example, we started using Databricks as the backend to handle big data use cases. And one important thing, if you see the center, you can see shared storage. So we established Isilon 1FS as the shared storage between SAS Viya and the Posit platform.

So now going into 2026, I would say that 2025 is really the foundational year. And then 2026 will be the year of consolidation, where we are going to be truly unifying all these different platforms. We are also going to be pushing all the high performance use cases to either Databricks or to RWS clusters. And then we'll be having a clear segregation between GXP and non-GXP use cases. And then many of these platforms, including Posit, will also become GXP at this stage. And then, of course, we'll also be introducing new GenAI, AI use cases, integrating with Copilot and other LLPs. And then finally, we'll also be enabling multi-level security so that we can have fine-grained security at folder level, subfolder level, as well as row and data column levels as well. So that is our overall roadmap. And what you can see is we started small, and then we incrementally built this product. And then by 2026, we should have a fully built unified.

Technical highlights

So let's look at some of the technical highlights. Some of the things we enabled was, as I mentioned earlier, we enabled a unified storage, which uses 1FS Isilon. There are a few alternatives to that, which we can go into that later. But we use 1FS Isilon on AWS to support dual mounting across Windows and Linux, which connects to both the Posit platform as well as SAS. And then we are also GXP ready, or we'll be GXP ready, so that it can support both exploratory as well as primary endpoints. This is an important one because we don't want to do stuff on Windows anymore. So we want to do everything from a centralized area. So all the work now is being done either on SAS on cloud, or it is being done on Posit platform. So no work gets done on the laptops any longer.

As I mentioned earlier, we have enabled parallel computing. So this is really the back end. So if you want to do any parallel computing, then we have dedicated AWS clusters where the computing gets pushed either from SAS or from Posit. We also have harmonized directory structures. And this is one of the important points of collaboration, so that you have a harmonized directory structure, whether you work from SAS or you work from Posit. And then, as I mentioned earlier, it is big data ready, because if you have petabytes of data or terabytes of data, billions of rows of data, then that computation can be pushed into Databricks, and we can enable those use cases. And this is particularly helpful for some of our user community, which includes HVOR, high economic outcome results, or the PPT. And finally, as I mentioned earlier, security and access is a big part of it. So we support access management, blinding and unblinding, which is very important for pharma use cases through this mechanism.

So I'll go quickly through that. Using this new platform, we have 80% faster queries on large datasets. We have 30 minutes of productivity gain per dataset because of the shared file system across the different platforms. At this point of time, it's highly popular, and it is a go-to platform for doing any kind of iShiny development. We have almost 11 departments within global development itself using this platform, and we have 500 plus users and growing. And all these are GXP plus non-GXP use cases. So at this point of time, I would like to hand it over to my co-worker, Ryan Yu, who is going to be talking about how his group, which is statistical programming, uses this platform to create cutting-edge iShiny apps. Thank you very much.

R-Shiny app design principles

Thank you, Sri. So at Regeneron, in our statistical programming group, we have three technology regions, and they are real-time, dynamic, and scalable. And we think our iShiny is one of the critical components to help us achieve the visions. And our approach to developing iShiny apps is through rPackage. And this slide shows our rPackage design principle, and they are standard addition, flexibility, and easy to use.

For the standard addition part, we mean each rPackage input standard is a little like building blocks, which covers different data analysis. And while maintaining a standard addition, we also allow customization, because that's the nature of a clinical trial analysis. I mean, different study teams always have different requirements, right?

So to make our iShiny apps development simpler for programmers with limited R knowledge, we introduced a metafile system. So the system allows them to create a study-level application with minimal coding efforts.

Case study: adverse events monitoring app

So I'd like to show a case study. This is an iShiny app focused on adverse events monitoring in clinical trials. So it's mainly used by safety scientists. In this slide, the next few slides, I'm going to highlight some key features from the app.

So from the top of this screenshot, you see different types. So they are for different data analysis. And we also have global filters panel on the left side. And we have an output panel on the right side. So in this page, on the top, we have a local filter, which only applies to this page. Down below, we have this waterfall plot, which is like a count of the percentage of patients by the system. And the bars are interactive. And next, if you hover on it, you will see the count of patients who had adverse events in that category. And it also shows the population at the denominator and the corresponding percentage.

So the other key feature I want to talk about is upper right, there's a downloading feature. Because the data in this app keeps refreshing over the time. And if the end users want to keep a record, a static report of the current data version, then they can use this downloading feature. And then when they click the button, it will download like a R markdown file in HTML format. And this markdown reserves interaction features such as a tooltip from the app.

So another page from the same app is an AE timeline. So this is a visualization of all the adverse events by patients. So from the X axis, there's a study date. So we can see the, for each line is an adverse event. So we can see the position, like we can tell roughly when the event has started and when the event stopped. Now if we only focus on one patient, so you can tell this patient had four events, right? And it's color-coded based on their severity. So we can tell this patient had three mild events and one moderate. And each lines are also interactive. Yeah, from here, we can see the name of the preferred term, the start date, stop date, and also the duration. And if you want to learn more information about adverse events, you can click the line. And it will show a pop-up window which includes all the adverse events for that patient. And in this window, you can see more columns, right? Like others, severity, causality, and so on.

So here's another example from the app. So this table is a typical summary table by some kind of PT. And on the left side, if we choose a series of events only, then the table will be rendered to only including series events. And the entire table are also interactive. If we click the number two in the red box, it will have a pop-up window to show which patient ID contributed to the summary count. So this feature builds the link between summary report and the patient-level data, which makes the data review super efficient.

So another feature I want to highlight here is SMQ. So all the SMQs that are relevant to the study data are built into the app. So in this example, if I choose acute pancreatitis, and for the type, I choose broad plus narrow. At the same table, as we see previously, the summary by some PT will be rendered to only include the adverse events which belong to the acute pancreatitis.

So this feature builds the link between summary report and the patient-level data, which makes the data review super efficient.

Tips for successful Shiny apps

All right, so based on what we learned, I also like to share some tips about how to make successful Shiny apps. So first, we need to build powerful functionalities. We need to talk to our stakeholders, understand their requirements. For example, we can input a subgroup analysis. We can input different statistical models for efficacy analysis for them to choose. And we can also give them options to explore relationship between different parameters.

And the second, user interface is important. Because appealing user interface will always help end users to adopt the new tool quickly. And we also care about user experience, of course. For example, we require like any response time to be like one second or less for any calculation at the rendering. We also need to provide the proper trainings and the guidelines to the end users so they can get familiar with the app and they are able to leverage all the functionality and the power and help their review more efficiently.

Lastly, we also care about user feedback. So on the right side, we designed this within app feature feedback collection panel. So in this small panel, they can read their experience. And they can also enter comments and questions. After submission, it will trigger email. So many of our relevant team member will receive the email and then respond accordingly. Okay, so that's the end of my part. I'll pass to Maya. Thank you.

Config-driven app framework

Thanks, Ryan. So as Ryan went over this case study, we can use that and extrapolate it here for what we've done at Regeneron. So it's common practice, right, that we make a Shiny application for a specific study with specific endpoints in mind. And then once we're done with that, we want to reuse what we've just created for another study. So in this imagery here, we can lift and shift that by copying and pasting essentially all the code that was in that repository. What happens now if you find a bug in that second study? Then you have to copy and paste that code back into the other repository. What about a feature? Ryan showed a AE timeline, but maybe there's another graphic that you want specific to this second study. How am I going to back add that to the first study? And similarly, let's extrapolate that even farther to your entire portfolio of studies. How am I going to take that feature and now copy and paste it into a bunch of different repositories with different study data? That seems onerous and a little cumbersome.

So what we have done here is create a layer of abstraction. We created a package that is essentially a suite of features and functions to create this Shiny app that you can configure for all of your studies. Now if you find a bug in study C, we can replace that code in the package and then update your child repositories. Similarly, you can turn on and off different features within your child repositories. Maybe there's a bespoke chart that applies to a really rare disease that doesn't apply to another study. You could turn those on or off very easily.

Let's get nerdy and talk about implementation here. How did we do this? So each study is now going to have the Shiny code that calls that parent package. At Regeneron, we call the package the AE package and then within that package you have a function called create app. So I'm going to create a blank slate repository and it's going to have these three files within it. One is my app.R with the create app that calls the create app function of our custom suite of functionality. Another file is an R markdown script. This can be any R file, can be a Quarto document. Why we chose this is because this is the data preparation file that calls the specific data for this specific study and each application requires the data to be in a rigid format. So we do all of our data manipulation ETL on this in this file that we also host on Connect. We host this file on Connect because Connect makes it really easy to schedule a job to rerun this file at some cadence. So whenever this study's data is updated, this file will rerun and that output of the file is used as the input for the Shiny app. Lastly, Ryan touched upon, we have a YAML file which is essentially a file configuration file.

So this framework, it calls variable names instead of relying on static column names because those can change across studies. So this configuration file shows us what to map to which variable for that specific study. Using this paradigm, we can now create these three file systems for each of our studies across Connect, across our entire portfolio of studies. This process has allowed us to optimize a Shiny framework for performance, security, and maintainability. Thinking about your application at first with your end users, you know, from a POC with a single study is a great thing to do. But eventually, the goal here wasn't just to build an app, it was to build a foundation.

But eventually, the goal here wasn't just to build an app, it was to build a foundation.

So that is what we've done here. We've also included a verified handoff here. So the code is fully documented, we have infrastructure notes, and this allows the study programmers to configure these applications quite easily. We also have knowledge transfer sections with these statistical programmers who are running the create app function and provide those training materials as mentioned.

So we follow for this process, and as Ryan showed that feedback cycle, we follow a defined software development life cycle with this application to keep releases consistent and manageable. Each cycle is a three-week development cycle with one week of UAT where we're testing new features out on those child suite of applications to make sure we've accounted for backwards compatibility and the features are working with different data sets and different structures. Features are scoped to fit within these cycles, which keeps our progress steady and predictable, and each release includes detailed notes that are sent to the team covering any new functionality as well as bug fixes and how to integrate the new version of that parent package into your old application. So downstream users can fully adopt these new features and functionality. And lastly, this process reinforces best practices around version control, validation, documentation, you name it, helping the application evolve safely while remaining stable and has made this pretty easy to maintain.

Key takeaways and what's next

So as we look at the key takeaways of this, when we started working with Regeneron, one of the critical things that we started diving into first was building a unified environment. The dissonance between the SAS platform, having to migrate the data out, and not having a single source of truth of the data that you were working with and what things were being driven on is really something that causes friction, and it makes it harder to really implement these types of applications in practice and make them successful. So when we went about looking at Regeneron's modernization plans, what they wanted to do with the environment, how they wanted to grow their Posit environment itself, and how it fit into the larger GXP environment, making sure that it was unified and that everything was talking to centralized storage, harmonizing the directory structures, harmonizing the permissions and security model between everything was really critical to ensuring that this was seamless as the projects grew.

The second piece of this was to make sure that we were giving this power to early-stage R users. When we are building up these applications, it's difficult to expect everyone to be an expert. So with lots of ongoing trials and lots of ongoing study teams that you can't expect all of them to have a specialist like Maya here building all of the pretty visualizations and deploying the platform, we wanted to make sure that this was something that we could put into the hands of more junior users and provide the power to the domain experts that are familiar with the data at hand. We reduced the scope of what they're responsible for, but still make sure that they can deploy and take advantage of these applications. It's something that really gives a lot of power and a lot of autonomy down to the Regeneron team where you're centralizing that expertise of the development back to a team that can be responsible for the visualization and the development of everything at an advanced level.

But additionally, when we look at validation, we're centralizing where the development of the applications is actually done, making sure that we can implement testing in a central level and then de-risking the deployment of these applications at the lower study level to make sure, like Maya was pointing out, that we could fix bugs in one place and have that proliferate out to the apps as we're deploying this. So this was a really awesome implementation where we found that we could get a new study started up in about 30 minutes through the YAML-based configurations and making it plug and play to the data sets where we can have a small ETL pipeline that just focuses on getting the data ready for the application and centralizing where we configure what parts of the app that we want ready. It makes it a lot easier to digest for those more junior users or instead of thinking of them as junior, thinking of them as the domain specialists rather than the R experts themselves.

And then lastly, governance and support in this shared responsibility model and having an SDLC-aligned core package, it ensures sustainable growth in a feedback-driven release cycle. So one of the mental models that I've found for pretty much any statistical programming group and working with – that's working on regulatory deliverables – is the idea of a definition of a done. When you're working with applications and these Shiny applications, you have a lot more of an ongoing cycle. Maya mentioned the software development lifecycle. Embracing that model and understanding that you can do things iteratively, that you can release updates, that you can gradually improve over time instead of making sure that you have something thought of as perfection going live in the first place, really empowers the team to start taking advantage of these benefits, gather feedback, and continually improve as we go forth.

So it's a difficult mental model for these teams to adopt because we think of preparing our deliverable, validating it, and publishing it, and giving it off to the ultimate stakeholders and not being done until that release is completed. By putting together this SDLC, by getting the team on board with it, by working through and having everyone understand that process, it's made things a lot – it's reduced a lot of friction and it's made it a lot easier for us to continually improve and get this – these benefits into the hands of the study teams themselves.

So ultimately, what this gives us is a scalable, reproducible, and regulated R-based data review for Regeneron. So the benefits, it's – we can really deploy these applications out to any study that needs it. There's a simple setup process. There's a simple mechanism to get it out there, secured, and to enable the study team to do it. We can get from instantiation of the app to deployment within two hours. The build process is heavily simplified. It's really easy, even for novice R users, to get it up and running. It enables faster decision-making for the teams because we're giving that application out there. We can trust that it's been tested. We can trust that the visualizations on the page are going to provide the baseline application that we've put out there.

Improving over time is a more simplistic process. It does not – it's not rebuilding the app. It's not this heavy burden of up-versioning everything. We make our updates and just push them, except in the new version of the app. And it's a heavily simplified process, especially coming from the configuration-driven files, that versioning to the new version of the app is straightforward. And it's a powerful tool to use in senior management meetings, that visibility of giving beautiful visualizations that are helpful, informative, and provides this feedback so much more immediately. It makes Ryan's life easier because they can get this up and running and put it out there, and you're not sending requests back to the stat programming team to quick push out a report and have them waiting in order to do reporting. They have these tools at the tips of their fingers that can easily look at different subsets of data, that can give different views into different things and answer those questions immediately, instead of having those programmers on hand just waiting for whatever requests going to come forth.

So, what's next with this? Expanding the paradigm, reuse the SDLC of framework for more applications. So, Ryan's team dreams big. There's a lot of potential of the applications that they can have here. So, this model allows us to build up those applications, get those use cases fit, and then have a model to be able to deploy out to. Continuing to empower our developers, providing ready-to-use functions so that they can focus on their analysis, not infrastructure. So, that's one of the things that we just want to make sure that it's seamless for them to work on, that they don't need to think too hard into the underlying data platform. That unified environment, making sure that we're aligning with permissions models, we're aligning with the harmonized data structures, that we're making sure that project to project, you can trust that those foundations are in place and that you don't need to worry a lot about getting everything set up so that it can be deployed.

Applying this to separate departments as well. So, working with the biomarker team to apply this agile methodology for iterative delivery across other departments, expanding it out. Sri started by talking that there's a lot of different stakeholders who are using Posit platforms and using Shiny and these visualizations at Regeneron. So, these are foundations that can be taken advantage of within multiple departments, because at the end of the day, we need to align data security, we need to align the deployment process, and make sure that these things can be seamless for different teams. And then combine everything within the GXP environment, a much larger project, but the entire purpose of this has been to make sure that in this next generation platform, that everything is harmonized, that we have the compliance, validation, and scalability across projects. And so, with that, I think we can move into the Q&A.

Q&A

Yes, that sounds great. Thank you so much, Sri and Ryan, Maya, Mike. Fantastic webinar, lots of great comments, a lot of great questions are coming in. I'm glad we've got about 15-20 minutes here, so I'm just going to go through the list and we'll see how far we get. If you have questions still, feel free to post those in the chat and we can tackle those as they come in.

There's one question here, how did you manage GXP citizen development from a process and compliance perspective? So, I think that was maybe one for Regeneron. So, as far as SAS is concerned, that is our GXP environment at this point of time. And for open source, that is something we are building and that is on our roadmap for next year. And once we implement it, that's when the citizen development will come into the picture, right? But still, I'm going to be answering your question. So, when you talk about any environment, whether it's Posit or any other environment, there are two parts to it. One is validating the infrastructure itself. So, what we would do is, first of all, validate the infrastructure for Posit. Isilon is the storage that is already validated. So, one thing is out of the way and then we'll go ahead and then validate the Posit platform itself. Now, as far as the process is concerned, that's where the citizen development comes in, right? So, it's not specific for Posit. It can be for anything. But essentially, the idea here is you need to have standard operating procedures, right? And then that is what is your validation. So, as long as you identify those SOPs, you have tested all those things and then follow that, then basically you are GXP compliant. So, that's how you would do it. And then in Posit, one more additional thing we have to think about is really how do you validate the packages? And for that, I'm not going to name any vendors, but there are a couple of vendors out there, right? With whom you can sort of work and then you can get pre-validated packages, right? So, using all these things like validating the infrastructure or qualifying the infrastructure and then validating your SOPs and then having pre-validated packages and probably having a risk-based approach. So, that's how you validate the whole environment.

What is your preferred table format package for the Shiny apps? Reactable, DT, etc.? So, that's a great question. We actually ended up using DT in this instance because it has server-side rendering. I think you get a lot of prettier styling with Reactable, but as Ryan mentioned, one of our primary asks were that the end-users want to see results within a second. So, we highly use like custom CSS on top of DT, but I think it's, you know, there's a lot of great stuff out there. I also really love GT. There's different tools. You can make a matrix of the cost benefits of each and I think for different use cases, sorry, this is a non-answer, but for different use cases, you're going to want a different table package probably.

Another tech question. How is version control used within the environment? Is it for code, data, or both? And are there GXP regulatory implications for hosting the code on a platform like GitHub? And then it continues on. Does Isilon 1FS provide versioning features? So, for the code part, the reason we're using Bitbucket, so it's a similar platform as GitHub. So, we make it mandatory for any R work like programming and the Shiny app development. We make it mandatory to have to use the data at Bitbucket. So, that's quite different from a SAS programming. I mean, that's a learning curve for everybody. And for the data part, as far as I think there is a question on the Isilon storage. So, obviously, we won't be using Isilon for version control. It is really going to be Bitbucket. So, that's what we plan to do. So, that is already being implemented as we speak. I know that Ryan and his team have worked on quite a few use cases for that. And one additional thing we are doing on top of it is also enabling CICD. So, hopefully, that will be in place for next year.

People are trying to understand how big is the team at Regeneron or maybe how big is the team that helps to maintain the AE apps? So, we have 14 folks on like a package development and more advanced Shiny app development. At the same, we ask our study team to do the study-level application like AE and a few other like SRCD and so on. So, it's more like a collaboration efforts.

Another quick question. The Shiny apps, are they available to the clients or are they mostly used internally to help BioStats, the study teams, the programming teams? So, I don't fully understand the clients here, but we have our internal regional stakeholders like safety scientists, medical directors, and clinical scientists and so on. But probably, you're right. So, for now, we just use it internally at Regeneron. We are not sharing the app with any external stakeholders.

Fantastic. There's another one here. How do you handle the validation of data manipulation done in the markdown when it comes to GXP? Yeah. I can start to feel that. I think that's more of a Ryan question. So, the stuff that we're doing in the data preparation file is that the application is assuming the data is in a specific structure. So, it's a singular list object with different items in that list with specific names, but then the data can have different column names mapping to that YAML file. So, you're structuring the data. You're not manipulating any of the numbers. I think we also have a suite of, not I think I know, we also have a suite of unit tested functions if we are doing any ETL to the data to ensure that those functions are verified. I do know in pharma, validated is a special word. So, I don't want to use that incorrectly, but I can tell you that the data that we're using is tested to be kosher. Ryan, I'm sure you can add to that. Yeah. So, in addition to what Maya mentioned regarding the unit testing, after the deployment, we also need to go through a process of validation process. It's like an internal document. So, we build a bunch of checklists and all the checklists have to be completed before the release.

Awesome. I got a question here for Sri potentially. Does Databricks provide functionality similar to that of Spark EMR clusters for parallel computations for ETL on the big data? Yeah, I can certainly take that question. So, if you're familiar with Databricks, it is built on top of Apache Spark. So, that is why anything which EMR Spark clusters would have, this has the same features as well. But then it does go beyond EMR Spark in the sense that it is a fully automated and managed system. So, that is why things like cluster setups, scaling, maintenance, everything can be done within a Databricks itself. So, that's why people don't need to focus on the infrastructure. They can just focus on the data processing. So, apart from that, Databricks has so many other features like Data Lake and then ETL. So, it really gives a good platform to work on. But yeah, quick answer to that question. Yes, it does have what EMR Spark does and beyond.

Fantastic. I got a question for Maya, I think. They would like to know how customizable the apps are and how do you manage like one-off demands. And then there's a second question, I think we could connect to this. It was asking about the customization nature through the YAML configuration. That's a great question. And we made sure to address that early. So, we have this application framework that gives you a suite of functionality, but you're not locked into that. The end user still has full creative freedom to access the UI and server code so that they can add their own bespoke features for that one-off thing. And then if that becomes more than a one-off, maybe they'll work with our team to add that as a feature to the parent app so that we can deploy it more across applications. But by allowing for that level of... It's kind of a trade-off, right? That we have this suite of functions that you get for free, but we also gave you the open endedness to add your own code. That would not go in the YAML. You would have to write that R code yourself, if that makes sense.

And Maya, I think there was another question that came in that was similar. They were asking about the underlying RMD Quarto files instead of using small underlying R packages. And then they said, is this because of the encoding of the data or to help junior R users? So this file, essentially what it'll do is read it. You're going to read the raw data, which we can have our own whole conversation that I'm sure Sri can answer around access controls to be able to do that. And then you manipulate the data and then you create the data as an RDS that's used inside the child application. So I guess if the question is creating a different package that has all those data sets in there, I think that's just another approach that could potentially work. But why we did it as this file that's hosted on Connect is to essentially act as a cron job so that we can rerun this file really easily. Because if you have Connect, it's like this really nice GUI of just like rerun this file every N, whatever, every night at 3 a.m. So that's kind of why we did it that way.

Yeah, I think Maya is 100% right. So the reason we use a markdown is we can deploy the markdown to the Posit Connect as well. And from there, we can schedule like an automatic data refresh, right? Because some of our apps are connecting to the data lake and we require like a daily refresh. So that's the way to do that. I was just going to echo all of that data snapshots coming in, having that run at a set schedule so that we can ensure that the data are up to date and that it's hands off for it.

So I've got a couple of questions here from various people about training, onboarding users, what's the training like for people with R and Shiny, the Atorus Academy bit. So our people trained traditionally on SAS programming and for R and other open source programming, I mean, it's quite different, right? So we already tried different approaches. We tried the Atorus Academy and we also hosted a few internal sessions. I mean, Phil, we also invited Posit for some workshop. I think the bottom line, we need a lot more focus training. And after the training, we also need people to keep using R programming. Otherwise, I mean, the efforts help a lot.

Fantastic. Got a question for Sri here. They're wondering about where the clinical data save, the data lake, the database server. And then I get another question that said, hey, we see Databricks in the architectural diagrams. We'd love to learn more about the role that it plays in the ecosystem. So the first question is a real good question, right? So where is the clinical data stored? Is the data lake? And I'm sort of smiling because just before this webinar, I had exactly the same conversation. Because depending on the person you ask, like the definition of data lake would be different. Is it just the S3 buckets or the entire environment? What do you call it as data lake? But let me answer this question. So as far as the clinical data within Regeneron is concerned, we have all agreed upon that Isilon would be the storage location, right? So that means everything is going to be stored in Isilon. And that's what is going to be used by any product, whether it's Posit or SAS, right? They're going to be using it out of that. So that is a decision we have made. But that said, if there is a desire, you can always use like S3 bucket or Postgres tables for that matter, or any other table for storage of the data. But in case of Regeneron, it is Isilon.

And then the second question regarding Databricks, the way we are really using it is as a backend in the sense that if there is a big data use case where you want to reduce a large number of data rules, petabytes or terabytes of data, then we use a Databricks in this particular scenario. So the way we envision it is whether it's SAS or Posit, then it will pump all those backend queries into Databricks, it will execute it, reduce these datasets, and then push it back to the source environments. So that is how it's being used.

Fantastic. Got a couple of questions here about the programming teams. One in particular, they're saying, is there a dedicated team of programmers who are working on creating the Shiny apps, or is it the same trial programmers? And then they go on to ask, are there other tasks that you're switching from SAS to R like this? Yes, that's a good question. So we have a core team, right, focus on more like R partition development. At the same time, we have a study team, the training trial programmers focus on implementations. Actually, it also depends on what kind of Shiny apps we are talking about. For the more standard AE app, I mean, usually it's done by the study team for the implementation, but for more advanced, top-line Shiny apps, because it includes a complicated FHC analysis with a lot of highly customization, then that has to be done by our Shiny team. And the second question is about, so at Regeneron, we started from R Shiny development, but we are also evaluating using R for like a SDTM item at the TFL generation. So that's more for regulatory submission requirements. Yeah. So, I mean, we are going to that direction.

I've got quite a few questions about validation. In particular, I think they're just trying to understand the GXP approach, the strategy at Regeneron for the Shiny apps, and really Sri, even for the environment that you mentioned. Sri, do you want to? Yeah, I can start that off. In fact, I think I did answer that question earlier, but let me give a little bit more detailed response here. So right now, if you recollect my diagram about the roadmap, currently what we have validated is really the SaaS environment and the Posit environment will validate that or make it GXP compliant next year. And the way we'll go about it is, I think we already have it in our plans. We'll first start with validating the infrastructure, right? So that means the storage as well as the platform, right? So that'll be the first step. And one of those two is already done because Isilon is the common storage for both SaaS and Posit. And because we went live with GXP SaaS via earlier, we have already validated Isilon, right? So the storage validation is done. So now what we need to work on is really validating Posit as a platform, right? So that'll be the next thing. And even with that validation environment, a lot of use cases can be taken care of. But then the third thing we have to take care after that would be really validation of the packages, right? So we can always do it internally. We can always take various packages. We validate it and then certify that it's done. Or the other strategy is to really purchase validated packages from different vendors or from a vendor, right? So we purchase it and then consider that as validated. So that would be about validated packages. And then the final thing is really the work which is being done, right? How do people build the Arshani apps? How do we maintain it? How do we keep audit run? So that is what is the big thing here, right? So that is really the different SOPs or operating procedures which we have to define. And then we have to validate everything, right? In the sense that we have to define each and every procedure and then we have to test it out and then certify that as validated. So once we accomplish all of these things, then we can say that the environment is truly validated.

Maybe one for Maya that kind of builds on that. Someone was asking about testing of the release of the parent packages for the implementations of the app with a study or project. Yeah, I think we both kind of touched upon this. But so we have those monthly release cadences where we scope our features to that. And in that UAT week, we test the new functionality as well as backwards compatibility. So let me take us back here. So every change to the application, we push those to dev. So those are reviewed like as that singular ticket or task. And then the entire suite of features for that release are tested with integration testing in UAT. So you have kind of a double layer of defense there. And the way that we test that is, Ryan, correct me if I'm wrong, but with like prior studies that use the AE application as well as the ones that are on the roadmap. And then within those studies, when deploying a new, so when the statistical programmers run the create app function, then they're also going to do some QC and go through that checklist that Ryan was mentioning. So we have a lot of layers of defense here.

Surya, I think people are interested in, before the project got started even back, you know, two, three years ago, the sponsorship or the spark for the modernization initiative, you know, strategies to help with that.