Data Science Hangout | Satish Murthy, Janssen | Creating a validated environment for reproducibility

Transcript#

This transcript was generated automatically and may contain errors.

Happy Thursday, everybody. Welcome to the Data Science Hangout. Hope you all are having a great week. If it is your first time joining us today, hello. Nice to meet you. I'm Rachel. Thanks for the data science community to chat about data science leadership, questions you're facing and getting to learn about what's going on in the world of data science across different industries.

And so each week we feature a different data science leader as my co-host to help lead our discussion and answer questions from you all. So together, we're all dedicated to creating a welcoming environment for everyone. So I love when we can hear from everybody, no matter your level of experience or area of work or industry. Each week, it is totally okay to just listen in as well.

But there's also three ways to ask questions today and provide your own perspective on certain topics too. So you could jump in by raising your hand here on Zoom. You can put questions into the Zoom chat and feel free to just put a little star next to it if you want me to read it out loud, if you're in a coffee shop or something. And then we also have a Slido link where you can ask questions anonymously and our team should be sharing that in the chat in just a second here.

But I do like to make sure I tell everybody up front, we share the recordings of each session to the Posit YouTube, also the data science hangout site. So you can always go back and rewatch or share with a friend.

I wanted to remember to call this out too. I wanted to specifically say, if you are hiring right now, feel free to share that in the chat. I never see that as spammy. I know sometimes people ask. It's so awesome to be able to hear when roles are filled by people making connections here in the hangout too.

Thank you so much, Satish, for joining us as the co-host today. Satish is Senior Manager, Pharma R&D IT at Janssen. And I also wanted to say thank you real quick to the Taurus team as well, who are one of our full service partners at Posit and they actually helped me in scheduling today's session with Satish. So the Taurus team, if you want to wave and say hello to everybody or say hi in the chat too.

But Satish, I'd love to have you introduce yourself and share a little bit about your role. Maybe also something you like to do outside of work too.

I'm Satish Murthy. I'm a part of the Janssen R&D IT. I have responsibility primarily for the R platform, R and soon Python as well. It's a data science platform to help our scientists run through some reporting for clinical trial, whether it's top line reporting or using grid-based computing. It's an environment where the traditional SAS is supported as well, but there is an accelerated growth to have folks move on to R at this time.

There is a small hiking place near my place, near my house in Hillsboro. I live in New Jersey, in Southerland Mountains. For those of you who are close by, that's where we hike often. I do have a small dog and with two kids, God, my wife, anytime we can.

What's ahead for the team

So Satish, as we wait for questions to come in from everybody here, I'd love to hear what is something that you're excited about in the year ahead for your team?

Yeah, as I indicated earlier, we have the statistical program and compute engine, what is called SPACE inside Janssen to help our scientists. One of the things that we are looking forward is to add capabilities to this platform because as the adoption grows, as folks slowly start seeing the power of R and the open source community support that has really been the big reason for us to adopt R because there is always learn from others experience as well.

So we are trying to see how we can grow the R use case and adopt it, number one, and at the same time, enable other tools and technologies that our users are asking. For example, we just introduced the, for some of the folks who were wanting to leverage Python, we just added the capability as well.

One of the things that I'm proud of is, this is all in a validated environment. There are some challenges that come along with it and we are partnering with a Taurus team to help us with the validation aspect of it. Any GXP platform, as folks might know, there are two things in there. One is a formal verification that somebody traditional in IT like myself would be responsible for it because that's, it's crucial and we have to build a platform like this. It's absolutely essential that we have to do a repeated install of this just to prove that what we installed today is the same thing that we can install tomorrow because that will increase the user confidence.

The other aspect of it is the validation. What we have installed, will, are the users comfortable with the confidence that what they run today, they're going to get the same results. That is the formal verification slash validation process that we are partnering with Taurus to build some of the capabilities to support us in that effort.

As we continue to add platforms, of course, you know, we have to keep up with technology. We have to keep up with what is out there and some of the things that I'm looking forward is, you know, traditionally, if I have to wear a development hat at this time, some of the platforms that we are running are on RHEL 7. As we think about, you know, RHEL 7 being going, you know, rather sun setting, what is it that we need to do to build on RHEL 8 platforms? What are the tools that are available?

So that is what is, I'm looking forward from that perspective because that's an immediate challenge as folks might know. We are, RHEL 7 is going out of support by end of this year with an additional six months of extended support formally by Red Hat, but beyond that, we have to think about something. This is a little bit more tactical in nature, but we have to go through the entire process of migrating to a new platform.

What does a validated environment mean?

I was just going to ask, so for someone like myself who isn't in the pharma industry and is not familiar with like more of the admin side of things, what does like having a validated environment really mean?

Good question. Validated environment to me is, you know, say for example, pharma and finance is another example of regulated industries, correct? What it means is, you know, anytime a health reporting agency comes and asks for proof, we have to be comfortable in saying that what we ran a couple of days ago, months ago, weeks ago, or years ago, we need to be in a position to replicate that entire environment as is.

What it means is, you know, anytime a health reporting agency comes and asks for proof, we have to be comfortable in saying that what we ran a couple of days ago, months ago, weeks ago, or years ago, we need to be in a position to replicate that entire environment as is.

So, this goes through a formal validation slash verification process, and this in traditional parlance is known as a GXP platform. GXP is good, X will stand for clinical practice, manufacturing practice, lab practice. So, the verification validated environment is often referred to as a GXP environment. So, it is to build confidence for the end users that we serve in pharma. It is for folks who want to help deliver patient care, which is crucial. This is where the validated environment or this is what the validated environment is, Rachel.

I did see there are a few anonymous questions starting to come in, so I want to go ask one over there, but somebody had asked, when talking about R, we hear a lot from a data science or engineering perspective, what do you think the differences in thought are from an IT perspective?

From an IT perspective, yeah, good question. So, the one advantage that we have with R is, as I mentioned earlier, the power of open source where folks are contributing to the R ecosystem as such. So, there are more and more folks who are contributing to it. It could be, you know, at the end of the day, it's a traditional R package that gets developed.

And often what we are seeing is, from an IT perspective, there are some challenges in reproducing this platform as is. Sometimes, you know, the frustration that our users have is, hey, it worked yesterday. How, you know, why am I, you know, I tried the exact same thing today. Why am I having issues with that? And there are various reasons. Because, you know, we might be using version one of one particular package, but underneath that, it will go and pull certain dependencies, which could be a function of time.

But there are tools that are available today to help us with that approach. In other words, you know, there is an R package that folks are familiar with, RENB as an example, where you can lock the versions. We are also using Posit Package Manager as well to help us with snapshotting capabilities. And this is where, you know, IT can step in, implement these tools so that, you know, users can be confident that what they did today, they should be able to reproduce the same environment tomorrow as well.

Yeah, there is absolutely no shortcut, but we always have to tell what is right, do the right thing as everybody says, saying that, yes, there is a short-term pain, but there is a long-term gain there.

I would also just add that there's a completely separate like business unit consideration for this as well, because Mike, with you at Pfizer, you're also another group that has like a large group of SAS users. And the target group that this is going out to in Janssen is that SAS business unit. So Satish is just one part of this, our adoption effort that goes into the packages that are tidy TLG was just released open source, but that was designed to be able to support Janssen standards to be able to live up to all of that within that group. But then you have all of these business users who have gone through SAS training to get on board with that. And if you start changing everything under foot without bringing them along with you too quickly, then that will also upset the business users in a different way. So there's definitely a balance to strike that you need to consider from an organizational standpoint.

I think the kind of folks involved in reporting clinical trials data want it to be nice, controlled, well-managed, slow-changing, predictable. You know, it's other groups who have more rapid turnaround and the statisticians in some senses and supporting those groups that want the kind of latest, greatest. You know, this has just emerged on CRAN. Now I want it because it solves my problem. But I think, you know, like Satish is saying it, that the process to just simply fold that into the existing framework is not as simple as many people think.

Yeah, it never is. It will take some time to get us there. But the one thing that the users absolutely want is stability. We don't want to introduce changes which will potentially break things. So that's a red flag. That's a no-no. And this is where we spend careful amount of time, you know, substantial amount of time ensuring that care is given to ensuring backward compatibility. We don't break things. And this all, at the end of the day, will depend on how much of IQ-OQ you are willing to go through, how much of effort you are going to put into it. And I think it is well worth it.