Resources

Volha Tryputsen | R in Janssen Drug Discovery Statistics | Posit

From rstudio::global(2021) Pharma X-Sessions, sponsored by ProCogia: this talk discusses how R is utilized in the Janssen drug discovery statistics workflow. About Volha: Volha is the Principal Statistician in the Translational Medicine and Early Development Statistics (TMEDS) group in the Quantitative Sciences Department of Janssen R&D. Learn more about the rstudio::global(2021) X-Sessions: https://blog.rstudio.com/2021/01/11/x-sessions-at-rstudio-global/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Alright, thank you. Hello, everybody. And, yeah, I'd like to join Marlee and other presenters in thanking RStudio and Phil and others into putting this session together. You know, this is my favorite community, R in Pharma, and I'm glad to be here today. And also thank you for the opportunity to talk about R in Janssen Drug Discovery Statistics. So my name is Volha Tryputsen, and I work as a principal statistician in the Department of Translational Medicine and Early Development Statistics at Janssen. So what do we do at Janssen? Well, at Janssen, we are guided by our mission of blending heart, science, and ingenuity to profoundly change the trajectory of health for humanity. And we discover and develop drugs which, you know, treat or lead to prevent some of the world's devastating diseases.

Preclinical drug discovery overview

So how do we discover and develop drugs, right? Well, the process is quite complex, but it roughly can be broken down into two parts, preclinical drug discovery and development efforts and clinical. And I think clinical drug discovery, drug development gets a lot of spotlight, whether preclinical phase is not as commonly talked about. So I'd like to take a couple of minutes and to give you in a nutshell what that really means to discover a drug, right? So it all starts with choosing which disease are you after, right? And then after you do that, you identify the target that you're going to try to pursue. Once you have that in place, you're trying to screen possible compounds or even engineer a new compound, they're going to bind to your target. And once that is done, you basically study that, that, you know, relationship or that activity that is happening between your compounds and target. And you might want to optimize it and look at it at many, many angles. But what happens at the end, you select a subset of compounds that you're going to be moving forward into other subset of studies.

And once you have your subset of compounds, you would further test them for efficacy and safety using in vitro, which is outside of your, you know, living, living body sort of experiments, which could be testing those compounds and tissue and cell lines. And then also in vivo experiments where you test your compounds in a living organism. And then, you know, once that is done, you're basically ready to move your final compound into, you know, enemy phase, which is naming it for a new molecular entity. And that's sort of complete your preclinical drug discovery process. But as you can see, it's, it's very, very involved and, you know, very, very, very multifaceted.

So I work for, for drug discovery statistic group that basically supports all of these different drug discovery efforts across different therapeutic areas, whether it's neuroscience, oncology, immunology, infectious diseases, vaccines. And yeah, we do a lot of work working with the scientists. So what our department does, it's obviously bringing mathematical and statistical rigor into building this drug discovery pipeline and also for decision making.

How R drives the statistics workflow

Right. And so how do we do it? Well, you know, we take the body of data, a lot of studies that biologists have conducted. And, you know, one of the aspects for us is obviously designing those studies and work on design of experiments. But also the heavy task for us is to analyze all of the data. Right. So we put the data into this sausage making machine, right, called statistics. We add some magic to it and voila, we get a p-value. Right. Of course, I'm joking. It's not as, as, as, you know, as it is on this slide. But instead of magic, we obviously have a lot of math and statistics within our sausage making machine. Right. And p-value is one of the things, but it's never the only one. And we accompany all of our findings with, you know, a lot of different tables, figures, graphical representation, and a lot, a lot of different reports. Right. But where I'm trying to go with all of this is that, you know, we have this process in place for analyzing this data. But what's really great is that this whole process is driven by R. Right. In drug discovery statistics department at Janssen, we heavily rely on R and RStudio products. And this is what my presentation is about.

this whole process is driven by R. Right. In drug discovery statistics department at Janssen, we heavily rely on R and RStudio products.

So how do we use our capabilities and what do we do with R in drug discovery statistics? Well, the most important one is our portfolio support. So we use R extensively to analyze a lot of our studies. So when I was thinking how to best kind of present what do we do with R, I thought about splitting all the stuff that we do into this four different spheres. Right. So first and foremost is our general practices. Right. The majority of us use RStudio and we do it, we use it locally and we also use it off the server. And the server RStudio capabilities are great because what they help us to do is share our projects and our different study analysis between ourselves within the group and help us to collaborate in this in this manner. But also we've been tapping lately in a powerful computing capabilities of RStudio server as well for our analysis, but also for shiny app development.

We all use Git in Bitbucket. So Bitbucket is our local GitHub, we call it, which, of course, ensures traceability and transparency of everything that we do, but also plays a big role in collaboration and code sharing. We extensively use R packages. We create our own, whether they're internal and external for, of course, organizing our own code, but also for sharing our methodologies with others, which, of course, speaks to the rigor and innovation that R packages bring to the whole drug discovery statistics space.

We also start packaging our standardized analysis workflow that we actually use to create shiny apps into packages as well. And that's, again, an easy way to share your code, but also makes it easy to troubleshoot or to bring changes that might need to be implemented if the opportunity comes. We also use R packages for packaging basically templates for R Markdown, which I'm going to discuss next.

R Markdown reports

R Markdown reports, something that we simply can't survive these days. So we extensively use R Markdown reports for a lot of stuff. I think in 2019, or maybe at the end of that year, there was a big effort in a department to create one standardized parameterized R Markdown template, which kind of incorporates the set expectation between ourselves, the decision of how we want to report our statistics, but also between our clients, our biologists, and whoever we work with. So what happens last year is we decided to make that parameterized general template more specific to therapeutic area that we work with. For instance, I mainly support oncology. So we really customized that template that we created first into oncology settings, which is really cool because we can customize it whenever we want it, but it also brings consistency to all the analysis that we do within oncology drug discovery space, and obviously speaks to reproducible nature of our research.

Shiny apps

My favorite and everybody's favorite are Shiny apps. We do a lot of them. We put them together ourselves. And for some of them, we have our developers helping us as well. When I was thinking how to present the Shiny apps, it just would have, you know, we can talk about them the whole day, but I basically put everything that we do in terms of our Shiny into three buckets. So the first set of applications that we develop for drug discovery kind of goes into the bucket of data manipulation. So that could be anything from, you know, taking different data files that scientists have coming from different software that they're using and putting them into Shiny and, you know, mapping the data, merging the files, doing some other preprocessing, and then basically exporting some other data file type that they need, the scientists need to put into another software or something like that, right? So it could be an array of things that we can do with data, which is great that we can use Shiny for that because it really eliminates any copy paste or editing error and, you know, gives us so much confidence that nothing like that has happened while we're preprocessing and manipulating our data.

So once we have the data in the right form and shape, the next thing that we do in statistics is obviously take a closer look at the data and we call it EDA. So another set of Shiny applications that we have developed kind of go into that bucket where they have to do with data visualization and exploration of your data, which is super important and bringing efficiency in that space kind of talks to having a more thorough research and obviously, you know, a richer body of evidence when it comes to data exploration.

And then the last bucket is basically apps that have to do with statistical analysis, right? And here we've developed standardized workflows for a particular analysis that scientists are doing repeatedly. A lot of those apps have exportable markdown report, whether it's PDF, you know, Word or HTML, you name it. And this brings major efficiencies in our space because of automation of those repeated tasks and analysis and report generation. And of course, because of RStudio Connect capabilities, we can easily deploy the apps and share them with our colleagues in biology.

Innovation

So, you know, four of these things, general practices, our Shiny apps and R packages and R markdown report help us tremendously in a portfolio development space. But what else we're doing with R? Well, I have the second part where, which I call innovation, right? So what do I mean by that? Obviously, this speaks a lot to open source nature of R that we all benefit from and code sharing, which really facilitates not just within department collaborations, right, but collaboration across function, across companies with John Johnson and even beyond going into the community. So stat analysis automation that we're doing with Shiny apps and R markdown reports really allow us to release a lot of resources so we can go and explore other cutting edge technologies and bring innovative solutions to our drug discovery space.

Now, immediate access to all the R packages, whether we're creating our own or using someone else's, facilitates innovative drug discovery statistics solutions that we create in our department. And then something new that Johnson and Johnson haven't had before and had it into 2020 is J&J hackathons that really bring together a lot of different employees from different parts of Johnson and Johnson, whether you're in consumer or medical devices or pharma, to work on solving some of the very interesting business problems. You know, and I'm proud to say that whoever participated in those hackathons from drug discovery statistics, we're all using R to solve those problems.

Training

So portfolio support, innovation, what else, right? Training. We've conducted a lot of smaller trainings that have to do with R, but I want to concentrate on two bigger ones. So the first one in graphics with R, which have been talked about in previous years at the conferences at R and pharma and maybe others, where statistics and decision science department at Johnson put it together and led it with the aim of improving literacy in R programming among clinical statistician. So it was conducted to more than 100 clinical statisticians that went into the training and it was presented by external R experts. And a lot of them came from drug discovery statistics group, which I belong to. So what we've done, we created a lot of different homeworks on different topics that cover GG plot. We started with GG plot and it was the major intention, but the appetite just grew and we had to extend to R markdown, interactive plots, tidyverse, dashboards. And I think we ended up with a small shiny app at the end, but we would deliver homework for people, for trainees, and then they would go into review and regroup session and brainstorm solutions. And then they would present the solution on the general forum and discuss that. And then the expert would sort of come with their tips and tricks presentations.

So it worked really well and it was in 2019 at Janssen. So what happened last year, another training was designed and delivered, and that was led by quantitative sciences organization. And the aim there was to promote a shared foundation of quantitative literacy in machine learning space. That was a major, major effort, was 25 week course, again, presented by internal expert. Those are statistician who have advanced degree in machine learning, including again, colleague of mine and myself, and they're mostly PhD, obviously, who presented content about machine learning that was structured in four different modules about statistical learning, classification, your supervised, unsupervised approaches, and was attended by hundreds of people from all over the company. So that's going outside of Janssen even. And the way we had it going was, you know, discussion of the theoretical part and then practical data application. And again, I'm proud to say that for the practical data application part, we use R entirely, and we use R Markdown for all of our code. We had a standardized template that we all used, and we basically delivered all the solution in R Markdown. And that sort of goes in the repositories that a lot of trainees or people who attended or who didn't attend, but want to learn kind of can tap into now.

Building community

So we talked about portfolio support, innovation training, but I think R also helps to build community, whether it's for R or R driven. So something that we have going on at Janssen, and this is the stuff that, you know, I'm kind of involved in participating in organizing, but maybe there are more, is something like a shiny day that we've been having for, I think, three last years, which is an annual event that is organized by Janssen Quantitative Sciences and Scientific Computing Operations in partnership with the Data and Statistics Society, which I co-lead with some other people, where we basically bring our shiny developers together at one or two day event. And, you know, back in pre-COVID time, we would actually meet on site, on different sites to kind of get to know each other and share some exciting work that we're doing in shiny application development space. But last year we had it obviously remote. So that has been of a great interest and the community there is growing.

Also within J&J Data and Statistics Society, we do have workshops where we talk about R or, you know, discuss some of the methods developed in R. And, you know, the example there could be we're even having topics like using different software, but how do we plug in R in those softwares? And we had something about Tableau going on and, you know, how to bring R capabilities or how to connect R with Tableau and stuff like that. So we also have Yammer internal groups where our community is bubbling and growing, and I'm really happy to be part of it and see it grow. And I think a separate community is something that we have around internal Bitbucket or, you know, internal GitHub where people share code and sort of create collaborations.

Something else that we all love is hex stickers. So I love those like everybody else in this community. And I try to bring them in any app that I'm involved with. So some of the stickers that we've developed are kind of go for our shiny app that we created. And you see an example of the Enviva LD app that I put together and, you know, have a hex sticker for it. And then we put hex stickers together for different workshops or events that we have. You see a shiny day sticker and then some of the workshop that we had within that. And also we have a sticker for membership for Data and Statistics Society, which were actually in quite demand and we had to email them globally, which is great.

Summary and future potential

So to summarize, R in Janssen drug discovery statistics today and tomorrow. So I think today we heavily rely on R and RStudio products. We use RStudio projects, R Markdown reports, shiny apps, R packages, which bring efficiencies and statistical rigor to our portfolio. They ensure reproducible and traceable research and also standardization and harmonization of methods, workflows, et cetera, and give us the capability to quickly adapt to changes. Then open source nature of R provides us with immediate access and opportunity to share our cutting edge and state-of-the-art statistical methodologies and also facilitates innovation and collaboration. And then we're growing our community internally through different platforms, workshops, internal events, and trainings. And we're trying to give back to the community by having R packages on CRAN and presenting at different conferences.

open source nature of R provides us with immediate access and opportunity to share our cutting edge and state-of-the-art statistical methodologies and also facilitates innovation and collaboration.

Some notes for future potential. I think we can keep building a larger R community, not just within our Janssen pharmaceutical space, but also go into other Johnson & Johnson sectors, whether it's medical devices or consumer finance, et cetera. So that's always a great thing to do. I think there are ways to work on sharing R code, again, across the enterprise, not just within sub-communities. So I think that would be something great to do in the future. And then enhancing internal R packages visibilities. This is something I also feel is important and we can work better there. And with that, I'm going to conclude and say thank you to some of my colleagues who helped me to bounce back some of my ideas that I had for this presentation. And once again, thank you, RStudio and people who organized this R in pharma conference for inviting me to speak. And here I'll conclude. Thank you so much.