Resources

Data Science Hangout | Satish Murthy, Janssen | Creating a validated environment for reproducibility

We were recently joined by Satish J. Murthy, Senior Manager, Pharma R&D IT at Janssen. During the hangout we dove into the topic of validation and what it actually means to have a validated environment. Snippets from our conversation with Satish: In regulated industries like pharma and finance, any time a health reporting agency comes in and asks for proof, we have to be comfortable in saying that what we ran a couple of days ago, weeks ago, months ago, or years ago – we need to be in a position to replicate that entire environment as is. So this goes through a formal validation/verification process. And this, in traditional parlance, is known as a GXP platform. [GxP is a general abbreviation for the "good practice" quality guidelines and regulations.] X will stand for clinical practice, manufacturing practice, lab practice, etc. So the verification validated environment is often referred to as a GxP environment. It is to build confidence for the end users that we serve in pharma. It is for folks who want to help deliver patient care, which is crucial. One of the key components of this is containerization. Because the ask in validation is to ensure repeatability and reproducibility, the only logical choice at this time is containers. We have to containerize. I have a responsibility for the R part of it so I'm going to concentrate a little bit more on that. First: Users would want to basically run through their studies using R. We are using the Posit Workbench as our launch pad. This is deployed on a traditional EC2 instance. This is a production platform but there is a logical grouping where our users would first want to come and test it in a development area, and then move it into the traditional QA and prod. The way they would first test in the development area is, number one, the entire Posit Workbench is containerized. We always try to keep up with the latest release of Posit. There are some unique challenges around it - mostly about the IQOQ, and the time that it takes. It’s not just Posit, there are other vendors [like Atorus] who are helping us with this effort. We have to go through a formal process of vetting, validating, and verifying. So all of this takes time. That's one challenge that we have. Second: Because we have containerized, it gives our users the capability to test something that is locked down and be ensured that, again, from a repeatability perspective, containers are the only way that will help them with that. So first, we will go through defining the process and specifying the requirements as to what really needs to go into the container. For example, if our users have identified a set of workflows that they would want to help them with the clinical studies that they are looking into. Traditionally, it will map to a set of R packages - the packages that need to make it into this container. This is where we are using standard technologies and tools that are used in the industry. We have a Jenkins pipeline and we are using the Posit Package Manager to lock down versions of our packages that we want to go into the container. There are some business processes that are defined as to what should go into this container. Once we have defined, we run through the build process to create the container and we will verify it in the traditional Posit Workbench. Once our users are comfortable that, yes, this is the workflow, this fits our model and we don't see any issues, that container is then locked. Then it moves into the QA/Prod area and that is what is deemed as a validated run. Anything that the users would try out in their development environment is a prerequisite, but it is not deemed as a validated run. It is only when it is run through the production that it is deemed as validated. In order to do that, all of these validated runs behind the scenes will run as a service account user. Behind the scenes, we also have the traditional EKS cluster where we scale up and launch these containers. _ Thank you to the Atorus team for their help in planning today's session! Resources shared during the call: R Validation Hub: https://www.pharmar.org/ riskmetric: https://pharmar.github.io/riskmetric/ Atorus Resources: https://www.atorusresearch.com/resources/the-power-of-a-trusted-partner/ ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co LinkedIn: https://www.linkedin.com/company/posit-software Twitter: https://twitter.com/posit_pbc To join future data science hangouts, add to your calendar here: pos.it/dsh (All are welcome! We'd love to see you!)

Jan 24, 2023
1h 1min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Happy Thursday, everybody. Welcome to the Data Science Hangout. Hope you all are having a great week. If it is your first time joining us today, hello. Nice to meet you. I'm Rachel. Thanks for the data science community to chat about data science leadership, questions you're facing and getting to learn about what's going on in the world of data science across different industries.

And so each week we feature a different data science leader as my co-host to help lead our discussion and answer questions from you all. So together, we're all dedicated to creating a welcoming environment for everyone. So I love when we can hear from everybody, no matter your level of experience or area of work or industry. Each week, it is totally okay to just listen in as well.

But there's also three ways to ask questions today and provide your own perspective on certain topics too. So you could jump in by raising your hand here on Zoom. You can put questions into the Zoom chat and feel free to just put a little star next to it if you want me to read it out loud, if you're in a coffee shop or something. And then we also have a Slido link where you can ask questions anonymously and our team should be sharing that in the chat in just a second here.

But I do like to make sure I tell everybody up front, we share the recordings of each session to the Posit YouTube, also the data science hangout site. So you can always go back and rewatch or share with a friend.

I wanted to remember to call this out too. I wanted to specifically say, if you are hiring right now, feel free to share that in the chat. I never see that as spammy. I know sometimes people ask. It's so awesome to be able to hear when roles are filled by people making connections here in the hangout too.

Thank you so much, Satish, for joining us as the co-host today. Satish is Senior Manager, Pharma R&D IT at Janssen. And I also wanted to say thank you real quick to the Taurus team as well, who are one of our full service partners at Posit and they actually helped me in scheduling today's session with Satish. So the Taurus team, if you want to wave and say hello to everybody or say hi in the chat too.

But Satish, I'd love to have you introduce yourself and share a little bit about your role. Maybe also something you like to do outside of work too.

I'm Satish Murthy. I'm a part of the Janssen R&D IT. I have responsibility primarily for the R platform, R and soon Python as well. It's a data science platform to help our scientists run through some reporting for clinical trial, whether it's top line reporting or using grid-based computing. It's an environment where the traditional SAS is supported as well, but there is an accelerated growth to have folks move on to R at this time.

There is a small hiking place near my place, near my house in Hillsboro. I live in New Jersey, in Southerland Mountains. For those of you who are close by, that's where we hike often. I do have a small dog and with two kids, God, my wife, anytime we can.

What's ahead for the team

So Satish, as we wait for questions to come in from everybody here, I'd love to hear what is something that you're excited about in the year ahead for your team?

Yeah, as I indicated earlier, we have the statistical program and compute engine, what is called SPACE inside Janssen to help our scientists. One of the things that we are looking forward is to add capabilities to this platform because as the adoption grows, as folks slowly start seeing the power of R and the open source community support that has really been the big reason for us to adopt R because there is always learn from others experience as well.

So we are trying to see how we can grow the R use case and adopt it, number one, and at the same time, enable other tools and technologies that our users are asking. For example, we just introduced the, for some of the folks who were wanting to leverage Python, we just added the capability as well.

One of the things that I'm proud of is, this is all in a validated environment. There are some challenges that come along with it and we are partnering with a Taurus team to help us with the validation aspect of it. Any GXP platform, as folks might know, there are two things in there. One is a formal verification that somebody traditional in IT like myself would be responsible for it because that's, it's crucial and we have to build a platform like this. It's absolutely essential that we have to do a repeated install of this just to prove that what we installed today is the same thing that we can install tomorrow because that will increase the user confidence.

The other aspect of it is the validation. What we have installed, will, are the users comfortable with the confidence that what they run today, they're going to get the same results. That is the formal verification slash validation process that we are partnering with Taurus to build some of the capabilities to support us in that effort.

As we continue to add platforms, of course, you know, we have to keep up with technology. We have to keep up with what is out there and some of the things that I'm looking forward is, you know, traditionally, if I have to wear a development hat at this time, some of the platforms that we are running are on RHEL 7. As we think about, you know, RHEL 7 being going, you know, rather sun setting, what is it that we need to do to build on RHEL 8 platforms? What are the tools that are available?

So that is what is, I'm looking forward from that perspective because that's an immediate challenge as folks might know. We are, RHEL 7 is going out of support by end of this year with an additional six months of extended support formally by Red Hat, but beyond that, we have to think about something. This is a little bit more tactical in nature, but we have to go through the entire process of migrating to a new platform.

What does a validated environment mean?

I was just going to ask, so for someone like myself who isn't in the pharma industry and is not familiar with like more of the admin side of things, what does like having a validated environment really mean?

Good question. Validated environment to me is, you know, say for example, pharma and finance is another example of regulated industries, correct? What it means is, you know, anytime a health reporting agency comes and asks for proof, we have to be comfortable in saying that what we ran a couple of days ago, months ago, weeks ago, or years ago, we need to be in a position to replicate that entire environment as is.

What it means is, you know, anytime a health reporting agency comes and asks for proof, we have to be comfortable in saying that what we ran a couple of days ago, months ago, weeks ago, or years ago, we need to be in a position to replicate that entire environment as is.

So, this goes through a formal validation slash verification process, and this in traditional parlance is known as a GXP platform. GXP is good, X will stand for clinical practice, manufacturing practice, lab practice. So, the verification validated environment is often referred to as a GXP environment. So, it is to build confidence for the end users that we serve in pharma. It is for folks who want to help deliver patient care, which is crucial. This is where the validated environment or this is what the validated environment is, Rachel.

I did see there are a few anonymous questions starting to come in, so I want to go ask one over there, but somebody had asked, when talking about R, we hear a lot from a data science or engineering perspective, what do you think the differences in thought are from an IT perspective?

From an IT perspective, yeah, good question. So, the one advantage that we have with R is, as I mentioned earlier, the power of open source where folks are contributing to the R ecosystem as such. So, there are more and more folks who are contributing to it. It could be, you know, at the end of the day, it's a traditional R package that gets developed.

And often what we are seeing is, from an IT perspective, there are some challenges in reproducing this platform as is. Sometimes, you know, the frustration that our users have is, hey, it worked yesterday. How, you know, why am I, you know, I tried the exact same thing today. Why am I having issues with that? And there are various reasons. Because, you know, we might be using version one of one particular package, but underneath that, it will go and pull certain dependencies, which could be a function of time.

But there are tools that are available today to help us with that approach. In other words, you know, there is an R package that folks are familiar with, RENB as an example, where you can lock the versions. We are also using Posit Package Manager as well to help us with snapshotting capabilities. And this is where, you know, IT can step in, implement these tools so that, you know, users can be confident that what they did today, they should be able to reproduce the same environment tomorrow as well.

Cloud infrastructure and AWS

So Bill had asked, do you have SAS and R running on the same validated server? Is that server in the cloud? And is that managed by an external service provider?

One, we have completely adopted a cloud, at least inside Janssen. Most of what we do, it is with the cloud provider. And we have a huge investment with AWS. We are partnering with AWS as and when we would want new capabilities. We do work with them and see where they could help us implement certain services that we would need.

Now, AWS might be ready, but then our technology services team would want to ensure, because it's in the cloud, that there are several security policies in place. So we do have to go through them to ensure that the services offered by any cloud service provider is blessed by our security team. That's how we work.

So we run everything in AWS. I don't want to call it a server hosting SaaS and R, because end of the day, from a tactical perspective, these are different assets. But it is a part of the same ecosystem, as I mentioned earlier. That is the statistical and program compute environment, which is a validated platform. Think of it as a platform consisting of multiple components. SaaS is one of them. R is another. 100% validated at this time.

Containerization and the build process

So I mentioned AWS, but one of the key components of this is containerization, Eric, as you rightfully pointed out. Yeah, I was hoping that somebody would ask that question. So everything in a validated environment, because the need or the ask is to ensure repeatability and reproducibility, the only logical choice at this time is containers. We try to containerize. Well, we have to. It's not that we need to. We have to containerize.

So the space platform, I have a responsibility for the R part of it. So I'm going to concentrate a little bit more there. So one, users would want to basically run through their studies using R. And how do we do it is the question. So we are using the POSIT Workbench as our launch pad. This is deployed on a traditional EC2 instance.

And then, you know, either we run ad hoc just to ensure for the users. This is a, first of all, this is a production platform, but there is a logical grouping where our users would first want to come and test it in a development area and then move it into the traditional QA and prod. But the way they would first test in the development area is number one, the entire RStudio POSIT Workbench is containerized. So we always ensure that we try to keep up with the latest release of POSIT. I think we are just one release behind at this time.

And there are some unique challenges around it. The challenges are mostly about the IQ, OQ, and the time that it takes. For example, it's not just POSIT. There are other vendors who are helping us with this effort. We have to go through a formal process of vetting, validating, and verifying. So all of this takes time. So typically, we lag behind. That's one challenge that we have.

Number two, because we have containerized, it gives our users the capability to test something that is locked on and be ensured that, you know, again, from a repeatability perspective, containers are the only way that will help them with that. So first, we will go through defining the process and specifying the requirements as to what really needs to go into the container.

For example, if our users have identified a set of workflows that they would want for, you know, to help them with the clinical studies that they are looking into. Traditionally, it will map to a set of, you know, if I have to wear a tactical hat, it is eventually going to boil down to R packages. The R packages that need to make it into this container. So this is where we pretty much are using standard technologies and tools that are used in the industry. We have a Jenkins pipeline. We are using the POSIT Workbench to, sorry, the package manager to lock down versions of R packages that we would want to go into the container.

So we, and there are some, you know, business processes that are defined as to what will, what should go into this container in terms of R packages. Once we have defined, we run through the build process to create the container and we will verify it in the traditional Studio Workbench. Once, you know, our users are comfortable that, yes, this is the workflow, this fits our model and we don't see any issues, that container is then blocked and then it moves into the QA slash prod area. And that's where they would run through the, and that is what is deemed as a validated run.

Anything that the users would try out in their development environment is a prerequisite, but it is not deemed as a validated run. It is only when it is run through the production that it is deemed as validated. In order to do that, not all of these validated runs behind the scenes will run as a service account user, because we want to ensure that number one, no particular user runs it in their environment and it runs in a service account. Behind the scenes, we also have the traditional EKS cluster and where we scale up and launch these containers.

The role of Taurus in validation

One part of the question earlier that we had missed, and I think would be really helpful to cover as well is, what role does a partner like a Taurus play in facilitating the AWS?

Okay, we pretty much rely on a Taurus. We are partnering with the Taurus mainly for the validation aspect of it, Rachel, and what I mean by that is, all of the R packages that we want to qualify, a Taurus is helping us with writing the IQ, OQs that need to run. For example, when we build the container, yes, we have a list of R packages, but how do we ensure that they fit the user's requirements?

So, one advantage, one unique thing that we have done inside this is, we do joint collaboration when we build the container and we run through the automatic process of running through the IQ, OQs, process of running through the IQ, OQ steps that has been written by the Taurus team. The advantage is, once you go through the container build process, one, we are, as I mentioned earlier, we are verifying, saying that these are the list of R packages that get installed, but at the same time, the scripts will run during the build process, which runs through these IQ, OQ steps, and at the end of the day, it will generate a validation report where we go through each and every IQ step that has been written.

It could be install package A, then, you know, ensure that all of the functionality within that package fits our requirements, and this is where the OQ part of it helps. So, we run through the build process, and then at the end of the day, we inspect the validation report. Then we take a look, hey, did all of the test cases pass? If it failed, what was the reason? Is this because of an underlying system dependency, or is this because some other additional package that we forgot needs to be, so, you know, installed, or is it simply because there was a bug somewhere? And we go through this process. It's an iterative process. This is where Taurus is basically building the entire IQ, OQ part of it and taking responsibility for the validation aspect.

Yeah, I would just say it was Satish working with you from the beginning, getting that design of the entire pipeline worked out so that it could all fit within the container build. I mean, that being an upfront requirement of what you wanted to see there, but it's been a close relationship with Satish and the business unit to make sure that we can get the right boxes checked for everything that they want to do. But we've gotten this nice fully automated pipeline that's proved fairly resilient when we've gone through updates of everything as we continue to expand and improve the environment. And this has been a combination of some of the internally developed packages within the Janssen team, focusing on some of their environmental workflows as well as the other packages that are in use within the environment. So there's been a lot of different components of this that have come together to make this focus product for the team to deploy within their space environment.

Handling requests for new open source tools

Satish, as data scientists and statisticians request support for open source tools that are new to their organization, how do you think about this? So thinking through tech debt or other issues, they're trying to understand your IT's perspective from the requests that you receive.

So, one, there is a fairly easy part of it. That is, what do you need this for? Why do you need it? And that's a simpler question, because there will always be a convincing argument from the users where they'll say, I need it to solve this particular problem. However, the other part of it is, as I mentioned earlier, we do have to go through a security process where some of this build process that I mentioned earlier, where we go through it behind the scenes, they come in and scan the container where they trigger a security validation report to check for vulnerabilities.

And then they will flag it. If it's a showstopper, we have to address it. No questions asked. We have to address it. It could be a simple question of installing a newer version of the system dependency. And if our validation fails at that reason, then we will somehow have to ensure that all the checks and balances, basically we pass all the checks. So when a request for a new tool comes in, we also have to check with our internal security team to see if they have any concerns.

Most of the times, yeah, it's okay. But sometimes they will alert us saying, hey, do you know that there is a vulnerability risk with it? Here is the reason why we are going to stop this. Most often, they will also suggest a workaround or to see how we can address that. But when a request for a new tool comes in or when a new R package comes in, we have to go through the security checks just to ensure that we don't expose ourselves for any vulnerability. Yes, it is true. Things are sitting behind a firewall. Yes, it is true. They are all running inside a container where the chances are a lot less than when it runs in a traditional data center. But we do have to ensure that we comply with all of the security risks and challenges.

Getting regulatory and quality teams on board

I wonder if you could speak a little bit to what did that take between the IT team, the clinical team, the users to bring them along to these new tools and platforms. I've got to think that there was a lot of change management there. And I'm wondering if there were any sort of critical things that made the difference for you to get regulatory on board with saying, okay, we're comfortable speaking to these systems in an audit, for example.

So you mentioned users and I also mentioned, I think you mentioned regulatory agencies as well. As far as regulate and then our internal validation team, TQ team, all of that. Let me see if I can address that in bits and pieces of that. One, the users, they were on board with that because number one, as I said, they would also want to learn from others' experience as well. Mike mentioned the internally developed R packages. So they are also contributing to the open source as well at this time. So there's both give and take, learn from each other.

So for the users, it was pretty easy and they were the ones who were saying, hey, we want this because this is going to help us get to where we want simply because of the open source contribution. And because things were much faster when it came to R.

When it came to where the TQ team, our internal quality team, right? Number one, they wanted to ensure one, we were basically putting a check mark against everything that they asked for. This was new for them. We had to sit with them. For example, the Docker build process that I mentioned earlier, they had several questions. We had to prove it to them that no matter what, what I run today, if I have to come back and run it again tomorrow, except for the timestamp that it was built, we had to show and prove that it was going through the same process. No matter how many times we re-build, we were going to get the exact same result.

Yes, there will be some challenges along the way. It wasn't easy because we were also new to it, but we were able to prove it to them. And it required many, I want to say weeks, months of effort trying to convince them because again, sometimes when you are in the vanguard, the onus is on you to prove that what you are doing is not violating any norms, well-established norms and practices. So it took us some time.

So once they were convinced that the tools and technologies that we had used conformed to their requirements, they gave their seal of approval. Of course, they were also learning in this process because again, with cloud technology, sometimes they'll ask questions like, hey, what is your failback mechanism? What happens if things go down? And then, as I mentioned earlier, we are all on AWS. So we had to tell them we have the power of AWS. We wouldn't want to worry about something going down or having catastrophic, but yet we had to plan for it. We had to plan for it.

And then even something as simple as disaster and recovery, how do you recover from that? It's not like you could go and build something immediately in another data center because we are locked down to a particular AWS region, which is U.S. Northeast one that is extremely popular for most of the folks out here, U.S. East one, Northern Virginia region. So we had to basically convince them that, yes, there are some well-founded fears, but no, we don't have to be totally paranoid about that.

The question about the health regulatory agencies and how did we convince them? I'm going to answer this in a slightly different way. If you talk to any health agency as such, they're not going to say you have to restrict yourself to R or SAS or this is bad or that is good or that is better. We just have to prove it to them that what we ran whenever, when they come and ask you for evidence, that we will be able to show the exact same results with the exact same study.

Scaling and load testing

Is ARC capable of running production environments with large scale data pipelines? Do you test that capability while validating a system?

Large scale, right? So large is probably going to be a relative term here, Rachel. Many a time we are going to rely on our users to tell us what that is. Yes, we did see some challenges initially and we are working through that at this time because, you know, there might be an odd scenario that we want to ensure that the maximum load can be supported. We are working with, as you know, with POSIT team as well to help us with that. And I feel very positive that we should be able to address some of those issues as well.

Because at the end of the day, these requirements will have to come in from our users and they will define that we want to ensure, I'm going to throw out a number there, like, you know, 100 users concurrently at the same time or 400 as the case may be here or 500. We want to be in a position to support that. Sometimes, you know, as you know, folks who are familiar with EKS, yes, you know, when we start scaling up, there are some delays. It could lead to some frustration on the users where they, you know, instead of seeing things can respond in like a second or two, if it takes even 30 seconds, 30 seconds is way too much for them. What can we do to help alleviate that?

There are certain tools and technologies that we are still evaluating at this time. And for those of you folks who are completely familiar and comfortable with EKS, Carpenter to help us with autoscaling, that is something that we are evaluating at this time. Again, as I said, everything has to be blessed from our security team and from our internal compliance team as well. Because even if we have to implement what might be regarded as a small change on the platform, because these are what I consider validated platforms, there is extensive scrutiny that we are subjected to, mainly in terms of documentation. Sometimes it takes a little longer than we would want.

Investing in new platforms and cloud costs

Do you have any tips about working internally with your peers and leadership to invest in moving towards new platforms?

One of the things that we have to be cognizant of is the cost aspect of it, right? See, when we do things in the cloud environment, we absolutely need to have our leadership support. That is because with in-house data centers, with in-house equipment, there is a finite cost that is associated with it, whether it is the initial capital cost or the maintenance cost of it, we know what it is and we can always project that. The challenge with AWS or with any of the cloud service providers, I want to say cloud provider, is you cannot predict those charges up front because we generally tend to pay after the fact.

So I could, for example, the load testing that we talked about earlier, how much can we go and how much loads can we test? If we have to start testing, then we will automatically start scaling up. We won't realize it. And then at the end of the month, when we are handed the bill, like, what, did I incur so much of cost? They will tell us here is the reason why. That is a unique challenge.

But on the other side of it, consider post-COVID world where we were completely shielded because none of us ever came to a JNJ location. We were pretty much remote. We were 100% functional and all of that was because we were in the cloud. So there are certain benefits as well. The cost part of it, you have to have management support. The huge advantage with that is now, hey, something happened with the instance in your data center. Somebody has to go and physically verify whether it is pulling out a hard disk, replacing it. All of that would cost subsequent, substantial billers. With any of the cloud providers, you don't have that challenge. Something happened, you stop that, you start it back up because they give you all the tools to be back up and running in no time.

Balancing speed and validation rigor

It's nice to hear about another big pharma company using containers and the qualification process. And from my perspective, it's fantastic to be able to qualify at once and then deploy it, use it in many, many places. It's one of the big strengths of this approach. But how do you balance up the kind of slow and steady approach of qualification, deployment, keeping things nice and documented and nice and steady versus colleagues who want latest, greatest, give me it tomorrow, I've got an urgent deadline.

Yeah, so let me talk about it from a validated platform perspective because I think that it's relevant. Yes, we do get asked that question. For example, for one of the more recent studies, it did require us to install a new package which required some system dependencies. And there is no instantaneous answer. That's sadly, when you have to containerize, you have to balance your long-term prospects saying that yes, there is a short-term pain, but there is a long-term gain. And yes, it does frustrate our users sometimes because we will not be able to react to it simply because again, when you go through this build process, you will encounter challenges, trust me.

But where I see a silver lining is, we have incredible partners, both from POSIT as well as Arturas who are helping Janssen with that approach. So anytime we have a challenge, for example, when we deployed our workbench, due to the nature of it, every organization will have their own unique challenges. I promise you that, every organization. One of the things that we had internally deployed was how we were treating our clinical data in terms of active directory groups, a little bit more tactical. But when we realized that the initial deployment of workbench did not work out, they were able to quickly come up with a release and then come up with an update saying, here is how we are going to help you get there. So we were able to deploy a new version.

There, it wasn't a question of what, if it was a question of when we wanted and we were able to adapt that quickly, fairly quickly. But then the process that it takes to get there, documenting each and everything, that there is no shortcut. I'm afraid there's absolutely no shortcut because our TQ team wants proof that, hey, show us evidence that what you did. So there's extensive IQOQ that we go through as well.

Yeah, there is absolutely no shortcut, but we always have to tell what is right, do the right thing as everybody says, saying that, yes, there is a short-term pain, but there is a long-term gain there.

Yeah, there is absolutely no shortcut, but we always have to tell what is right, do the right thing as everybody says, saying that, yes, there is a short-term pain, but there is a long-term gain there.

I would also just add that there's a completely separate like business unit consideration for this as well, because Mike, with you at Pfizer, you're also another group that has like a large group of SAS users. And the target group that this is going out to in Janssen is that SAS business unit. So Satish is just one part of this, our adoption effort that goes into the packages that are tidy TLG was just released open source, but that was designed to be able to support Janssen standards to be able to live up to all of that within that group. But then you have all of these business users who have gone through SAS training to get on board with that. And if you start changing everything under foot without bringing them along with you too quickly, then that will also upset the business users in a different way. So there's definitely a balance to strike that you need to consider from an organizational standpoint.

I think the kind of folks involved in reporting clinical trials data want it to be nice, controlled, well-managed, slow-changing, predictable. You know, it's other groups who have more rapid turnaround and the statisticians in some senses and supporting those groups that want the kind of latest, greatest. You know, this has just emerged on CRAN. Now I want it because it solves my problem. But I think, you know, like Satish is saying it, that the process to just simply fold that into the existing framework is not as simple as many people think.

Yeah, it never is. It will take some time to get us there. But the one thing that the users absolutely want is stability. We don't want to introduce changes which will potentially break things. So that's a red flag. That's a no-no. And this is where we spend careful amount of time, you know, substantial amount of time ensuring that care is given to ensuring backward compatibility. We don't break things. And this all, at the end of the day, will depend on how much of IQ-OQ you are willing to go through, how much of effort you are going to put into it. And I think it is well worth it.

Containerization at the repository level and security constraints

I'm interested, Satish, if you've been keeping an eye on things like GitHub Codespaces, where it's giving, you know, at a repository level, the ability to containerize its environment. I mean, you can do that without GitHub Codespaces as well, I suppose. But I'm just curious where you find, where your opinion is on maybe those other analysts or statisticians that want to be on the cutting edge for a particular project, and they try to take matters in their own hands if they have a little know-how container, how containers work with Docker and the like, compared to what you're trying to do from this, you know, more broader effort, which is obviously hugely valuable too.

Yeah, one thing that I must have forgotten to mention is even with the containers that internally get deployed, our technology services team will tell us, you know, what we could use as a base, right? In other words, we just cannot go and pull anything from Docker Hub. It has to be validated again because they want to do a scan of it. So there are some, I want to say, well-defined requirements. I don't want to call it constraints. It could be as simple as you have to use this version of CentOS 7 container. I'm just throwing out as an example. We can't simply go and use Ubuntu unless it is supported inside our J&J ecosystem. There is, you know, I'm sure folks are familiar with artifactory. Artifactory is where, you know, we host our container images and where we also pull certain base images as well.

So to answer your question, Eric, we cannot go and pull directly from GitHub, but to answer a second part of it is, say for example, if you were to point me to some GitHub repo, can you support this? The short answer is maybe. And the reason I said maybe is we can go and take a look into it, see what software modules are being installed as a part of it in the container, see what it is that we could immediately use, or if it requires some sort of a substitution by say, for example, some other equivalent package, then we will go and download it, install it. All of that, so long as it is blessed by our technology services team, we will be able to do it.

So the same challenge exists for our users as well. They have to rely on somebody in IT to help build them that. Yes, we are deploying pipelines, for example, might refer to the internally developed R package as well. We are now establishing pipelines where users from the business partners that we work with can develop an R package. And by simply doing the cut and paste of the basic necessities of the pipeline, we use Jenkins pipeline. We call it Jenkins Pipeline Manager. So they'll take that part of the infrastructure code, replicate it, and then work on what they are best qualified to do. That is work on the R package and not have to worry about the mundane aspects of, oh, how do I build it? How do I link it? Because all of that is streamlined to a point where we run through the linting, unit testing process, all of it when an R package is being built. So we help them by deploying those pipelines as well.

Education and adoption

I think the big part from our side is really education. So I guess Eric kind of mentioned as well, a lot of the users don't really care, but you have to sort of help them understand this is difficult, right? It's not just, I want this, let's have it. We have to consider accuracy, reproducibility, traceability of these things. Of most important is that our results are correct and they understand that, right? So that's kind of what we used to make sure like, okay, we have to validate this. Here's what it takes to do it. It's a significant effort.

So that takes a little time, but then really use case driven. So sometimes people ask for things just because they want them. But then when you say, well, what do you want to use it for? You start getting more details that they don't really need it. So having that use case driven validation, what do you need it for? Why do you need it? Is it for production or is it something you could just use as like an exploratory environment, right? Do we really have to validate this? And that really shortens the list a lot more.

And then once we get to, okay, we get it, you need it. We don't have the expertise as like the team that's overseeing it. You're now sort of our de facto expert on this. Put some skin in the game, help us out with this. And then sometimes they're willing to do it versus maybe they say, well, I see you've already validated the stats package is approved for use. I could do the stats package instead. Let me just use that, you know, and it sort of helps again, shorten that list down.

Open source IQ/OQ pipelines and unit testing

Are there any open source IQ, OQ pipelines for R?

Yeah, because I mean, I guess sort of the source of my question was for many use cases and clinical research like we're doing, it would seem that the IQ, OQ need not be overly complicated because if we just want to ensure that R was properly installed and can run dplyr, dbplyr pipelines can run a regression and a time to event analysis without trouble. We're not doing anything fancy. It would seem, couldn't I just do some minimal set of checks for which someone might write a reasonable starting point in the open source realm for like, has R actually properly installed? And does it actually do these three, four packages do the thing that they're supposed to do? Just curious about like what a minimum viable product for that might look like. But this is, of course, a Taurus's realm of expertise.

Yeah, and just to note, one of the things that we've really taken the guidance from is the work that was done in the R validation hub to start defining what a risk-based approach to validation is around that. So for, and for things like a lot of the base R stuff, a lot of the tidyverse maintained packages, those are, Nick, what is the term that we landed on? Accepted for that where you can run the package unit tests and you can accept that. But again, part of this is also documenting the evidence that that was done so that you can pass those through. Satish's team get the report so that we can log that into the QA systems so that you have the evidence and traceability that that was done.

But then like Nick says, when you get into other things and especially getting into the more statistical realm of things, having use case driven approaches around what you're looking for, how you can defend the integrity of that. And then let my comment back to you was, you're saying that it works, what does work mean? What is the kind of baseline that you're assessing that against? So there's a lot of kind of questions that you need to go in and investigate and answer that so that you're defending yourself because all of this comes back to what happens if you have to undergo an audit. What if a regulatory agency comes back to you and then start testing the questions of how you're verifying what you're doing?

So that's the worst case scenario and something like this would be that something is turned around because of the integrity of your system was questioned. And pharma, I think that a lot of times pharma is extremely risk averse just because of the investment that goes into getting where things are. Statistical programming is an extremely small window in the larger scope of everything that's going on. And no one wants to be the person who kind of is a showstopper for something or even delays can just be extremely costly.

But yeah, we're working in extremely risk averse industry.

That's a fascinating answer especially regarding the unit tests for that. I mean, it hadn't occurred to me but it's probably a good message to developers like write good unit tests and then there's actually almost unintended but fortunate downstream consequences of writing good unit tests. So you could just write out that report like for example, the tidyverse unit tests which I know are comprehensive and good. Check out, that's awesome.

Yeah, and one of the things that at least our technology services team is pushing for is to ensure that every R package goes through some static analysis saying, okay, here's what's being deployed. Is this like, did it go through all the checks and balances? Sometimes it's impossible and sometimes it's not possible. And it depends on who you're talking to. For the users, does it work for my particular use case? Go deploy the package. Somebody else comes in later and finds out that it's causing a conflict. That's a red flag.

Sometimes security might only be looking at it from a slightly different perspective. They don't care about the functionality. They care about the security aspect of it. So if it passes all the vulnerabilities checks then you are good to go from their perspective. But I go back to this comment that you had earlier, Travis, that, you know, hey, I just installed it. Did it work? I'm gonna go and use it. There is a danger as Mike pointed out. I've personally seen one effort at least. I don't want to say fail but it has reached a point where nobody wants to use this because, you know, IT took this approach that, yeah, we're gonna build this. Then I do a minimal qualification for us if it installed R, good. It passed our check.

Then they said it's on the users like Nick and others to go install the R packages for what they want. And then they suddenly realized that, you know, as everybody knows in the R ecosystem most of the packages are users can install. There is this small subset that is sometimes, you know it requires a unique dependency, which is easy but sometimes it does require things to be handled. And that's where the users say, I don't have these expertise and I'm sorry I just cannot use this. So there is a danger to it when you do minimum qualification. I for one certainly don't believe in it and wouldn't encourage that. We have to take our users workflow ensure that everything is tested accurately. As Mike pointed out, it's just to make sure that we defend ourselves. God forbid if there is an audit and we have to show evidence of what we ran a couple of years ago. Trust me, that short-term pain leading to long-term gain. It's accurate in this place.

Package management and rebuilding containers

No, I was just curious because I understand that the versioning and the environment and everything for like say critical packages or critical tests that are maybe developed and deployed packages need to be in place. But I'm working in the finance industry. And so we also have artifactory in place for let's say less critical like convenience packages. And I just wanted to ask if they are mixing these using artifactory maybe as a package mirror, internal package mirror for less critical packages. So like user convenience stuff and these things or if they keep everything on the positive package manager.

So we will not distinguish whether a particular package is less important or one is more important, Walter. We just don't take that approach. If a user wants a package, we just go through the entire process of IQing, OQing. Yes, it will be in positive package manager. Yes, we will snapshot it. We will use the exact same snapshot to replicate it. But no, we do not say, hey, these are less important. This is going to not go through any qualification or any such thing like that. We just go through the totality of it. And as Nick pointed out earlier, the question will always be, why do you need this package? And once that question is answered from an IT perspective, it really doesn't matter towards the importance of that package.

For a validated environment, the short answer is you just have to rebuild the container. And there's no other alternative. However, the process is, if you need new versions of existing packages, then, you know, you just go through the basic IQ OQ that was already written for that package. But if there's new functionality that you want to adopt, yes, you will have to write new IQ OQs, no questions asked. But there is no shortcut here. You just have to repeat and rinse through this entire process.

Sure, and we go through this, you know, whenever, you know, we have to build a new container from an IT perspective, right? We do check with users like Nick first. Hey, did it meet this, your requirement? And they have their own acceptance testing as well. Then he'll come back and say, nope, something failed. I need you to look into it. Yes, it's an, you know, basically it's a repeat and rinse, right? So you go through this process all over again. So go back to the drawing board, add the packaging, add any system dependencies, run through your IQ, you release the container for in development for, you know, users to test. And once they say, yep, this all looks good, it's at that time we, quote unquote, lock the container.

I know we just passed the hour mark here and a few people had to jump off for other meetings. I just want to say thank you so much, Satish, for sharing your experience with us and also the Atorus team for joining. Satish, you are such a wealth of knowledge and it's been great getting to see more of the IT experience as well that we don't always get to talk about in these Hangouts. Thank you so much. My pleasure. Thank you for having me.