
Andy Nicholls & Michael Rimler | Using R to Drive Agility in Clinical Reporting | RStudio (2020)
The R language is used extensively throughout the pharmaceutical industry. But its use within the tightly regulated clinical reporting workflows has remained limited. GSK Biostatistics has embarked upon a journey to embed R as a primary statistical analysis tool for clinical reporting. Enabling R within a global department of over 600 Statisticians, Programmers and Data Scientists is challenging! It requires planning, patience, and a strong foundation that enables consistency across the enterprise. We invite you to learn more about how we achieved this at GSK. You'll learn about our tidyverse-centric training program, a future-ready Working Area for R Programming (WARP) environment, and a leading-edge R for Clinical Reporting (R4CR) initiative. The goal: help embed R in every-day clinical reporting output. About Andy: Andy Nicholls has a long history with the R language and in Data Science, authoring the book ‘R in 24 Hours’. He is currently Head of Statistical Data Sciences within GSK’s Biostatistics department. One of his team’s main objectives is to embed the R language within Biostatistics; developing training materials, overseeing various adoption initiatives and provisioning a world-class environment for R . He is also the lead for the cross-industry R Validation Hub initiative that aims to support the use of R for regulatory work. About Michael: Michael Rimler is a clinical programmer in GSK Biostatistics and passionate about influencing the evolving role of open source technologies and data science capabilities on clinical data analytics. He is involved with numerous internal initiatives aimed at moving the organization in this direction, including leading the effort to fully integrate R into the clinical reporting process. Michael is a co-lead of the Open Source Technologies in Clinical Research PHUSE working group project, has chaired a PHUSE US Single Day Event on Data Visualization, and will serve as a co-chair for the 2021 PHUSE US Connect
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
For those of you who don't work in the pharmaceutical industry, what this, and indeed for those of you do, this diagram in front of you is essentially an org chart and it's also a timeline. So this is how the biostatistics department at GSK is structured. I say it's like a timeline because our structure essentially reflects the drug development life cycle. So this can take up to 15 years, which is what this arrow at the bottom is presenting, and that takes us through the research in vivo in vitro analysis all the way through to the clinical development.
And by and large, if you're not from industry or you are and that's familiar to you, you can slice this up into two distinct ways of working really. The first is the sort of non-clinical area, and that's actually where my group, Cisco Data Sciences, sits, but it's also where our research statistics and our manufacturing statistics groups sit as well. And that's a large group of statisticians, programmers, data scientists, around about 100 people. But overwhelmingly, the vast majority of people in that group sit within the clinical space. And the clinical space has very, very different controls and ways of working to the non-clinical space.
And it's the clinical space that we're going to be talking about today. So a team of 500 clinical data scientists, that we're going to call them. And I put data scientists in quotes on this slide because, really, if you speak to anyone in that department, very few people would actually call themselves a data scientist. They are made up of people whose job title is either statistician or programmer, generally.
And one of the reasons why I think that I would also probably not call them data scientists in a sense is that, for me, a data scientist is someone who is multi-skilled in a number of different ways. So they're not simply a statistician. They can code a bit as well. They can use different languages. They use the right tool for the job. And although there's a huge amount of talent in that 500 statisticians or programmers, at the start of our journey, they weren't really data scientists in the sense that I'm describing.
Skills and language proficiency in 2018
Just to give you an idea of what kinds of skills we do have, away from just pure statistics and pure programming skill, I put up some proficiency, some rough estimates of proficiency within the department at the start of this journey in 2018. And as you can see, the overwhelming majority can program in SAS. I've said 99%. I put 100%. But I'm allowing myself a little bit of room for one or two people who can't. But essentially, everybody can program in SAS.
There is some proficiency in R in the clinical space. But largely, R has been used historically for the generation of graphics. And SAS still dominates in that area. It's normally bespoke graphics that are particularly difficult to generate in SAS that R has been used for, along with a few very proactive R users pushing R for graphics, and for clinical trial simulations where R arguably lends itself better towards simulations. Python, again, I've tried to cover myself with 1%. Essentially, nobody's using Python in this clinical space. But just in case, I put 1% there. And Julia, essentially, I've put up there to make it sound like I know what I'm talking about. Essentially, no one is using that either.
Why R hasn't been used in clinical reporting
So why are we not using R? Well, a lot of it comes down to the past 20 or 30 years of drug development. And skills have been built up over that time in SAS. And alongside those skills, the tools we use, what are called SAS macros, which if you're not familiar with SAS would be akin to an R function. But you can think of all the functions and packages you might build in R. All those kind of tools that we build to make our job easier have all been built in SAS. And in a clinical space, it's very, very heavily regulated. And so we have systems that control, to a very fine nature, our workflows that we use. And these systems, as well, are entirely built for SAS.
Add all that to the fact that it's very conservative. The further along you get along the pipeline, the more conservative it gets within our industry. And you can start to see why R hasn't been used. But the real one that I hear a lot is this statement here that regulators only accept analyses performed in SAS. This is essentially a myth that has been around for a long time. It gets repeated in various different forms.
But the real one that I hear a lot is this statement here that regulators only accept analyses performed in SAS. This is essentially a myth that has been around for a long time.
So you have, for example, when you're filing for submission with the FDA, the Food and Drug Administration of America, you have to provide data sets in the SAS transport format. And that's an open format. This statement here is from the FDA. It's clearly an open standard. I put a line of code, a bit of pseudocode on the right-hand side to highlight just how easy that is to generate in R. In fact, there's less code in R than it would take in SAS, ironically. But these kinds of things, these kinds of statements get put out there. And I think because it has SAS in it, there's always this assumption that SAS has to be used.
On the regulatory note, I could talk for an entire hour. I won't be, but I could talk for an entire hour on this topic that Phil mentioned in the introduction. If you are interested in the regulatory barriers or perceived regulatory barriers that we face in the pharmaceutical space, I would recommend you go to www.pharmaR.org. This is the R validation hub that Phil mentioned. And you can see here, in particular, our white paper on a risk-based approach to package validation.
Why we're looking at R now
So given all those barriers and challenges, why are we looking at R now? I think one of the key reasons is the influx of new talent and skill around R coming from college, from university. And we may have people who have been using SAS for 20 or 30 years. But with this new people coming in with skills in R, generally, they're not coming out with skills in SAS. And what we're doing is, because no one else understands R very well, we're having to retrain them. All their education has been in R. And we're retraining them in SAS to make them so that they can work with everybody else.
And now, I think it's probably fair to say that we've reached critical mass. We've got so many people who know what to use R that we're now looking at it the other way around. Let's retrain some of the people that have been there for 20 or 30 years in R. The other piece is the drive for innovation. And that's not new. We're always driving for innovation. But I think there's a few things, in particular, that I've highlighted on this slide. What we're seeing is an increased drive to use the latest statistical methodologies to try and design more efficient trials.
And when I talk about more efficient trials, what I mean is, if you think about a drug getting to market, up until the point where you have proved the efficacy and the safety of that drug, your drug is something that you want to expose as few people to as possible. It's a potentially dangerous compound that we want to make sure that we're minimizing that exposure. So we need to reduce that. And equally, if it is effective and it's safe, we want to get it to market as quickly as possible.
And there's been a lot of investment in things like placebo reuse of data, BSK, and other methodologies to try and create a more streamlined process for developing drugs. And a lot of that requires heavy amounts of simulation. And there are a lot of tools that have been built around that that are available in R. So they're really, really pushing things forward as well.
Before I move on to the last two and I talk about visualization and the dynamic reporting aspect, I just want to mention a couple of others that often get talked about in relation to adoption of R. The first one being cost. I think cost is important in certain sectors. I wouldn't put it as important as any of the other things on this slide, actually. When you think about cost, yes, R is free and tools like SAS are not. But there's a huge cost in training 500 people and updating all of our standard tools and systems and so on as well.
So in a sense that the cost, yes, there's some long-term cost savings. But in the short term, actually, it's probably the other way around. And then the last one, the gateway to Python and open source, everything I'm talking about here with respect to R, you could switch out R and put in Python or any other open source tool. R is being more statistical, I guess, in nature, at least in its origins, compared to something like Python. It is a more natural language in the pharmaceutical space to use. And that's why we've chosen it as the tool to sit alongside SAS.
GSK use cases: quantitative decision making
So the two GSK use cases I wanted to talk about, the first one is a Shiny use case. And this is something we call quantitative decision making at GSK. In a nutshell, it is the probability of being able to make a decision at the end of the trial, given the operating characteristics of the study, so the study design, and various other pieces of information that we know up front.
Essentially, what we try and do with quantitative decision making at GSK is we take known values. So up in the top right-hand corner, there's a diagram here where it talks about this is looking at treatment difference. So on the x-axis, you've got a difference, for example, a difference between our drug and placebo, or our drug and some competitor on the market. And in a standard clinical trial, you might look at the difference between those two compounds. And you're going to generate some kind of p-value. And if it's a late phase study, you might be looking for something like p less than 0.05 to show that there is some kind of treatment difference.
And that's great at the study level. But what QDM is looking at really is the lifecycle of the compound. So in the early phases of development, so in phase 2, where QDM is particularly used at GSK, you can't power study sufficiently to be able to make those p less than 0.05 kinds of statements. What you want to be able to do is, given everything you know about the drug, you want to be able to estimate probabilities. What's the probability that the difference between our two compounds is greater than some clinically meaningful, commercially viable difference?
Now, all of this is quite complicated. It uses a Bayesian framework. We put priors on all the information we know. There's a lot of study design elements that we consider. And what we're trying to get to at the end is the probability of success of the study. So the probability that we're going to be able to make a clear decision as to whether to continue into the next phase of development or stop development of our compound.
And because it's quite complicated, when we trained a lot of our clinicians in the methodology, they weren't immediately receptive to everything that we were saying. It's a lot of complex mathematics when mathematics is generally not their first job. So we developed an app. And in doing so, take all those factors, things like the sample size, the randomization ratio, the standard deviation, things you can see on this screen here, hopefully. And you combine that with information about the minimum value and the target value, and potentially lots of other operating characteristics as well. And you allow the clinicians and the clinical teams to play around with the sliders, change different options. And then it starts to become real for them. They can start to see what the impact of the study design is on the ultimate probability of success.
And that's really led, because this is such an important initiative at GSK, it's rolled out across all of our studies at GSK that are in that phase of development. And because of that, we're seeing widespread adoption of QDM methodology. Everyone is seeing the Shiny apps, and there's a huge number of requests for additional Shiny apps.
Dynamic reporting
Second example, I'll go over this a little bit more briefly, is dynamic reporting. Essentially, our submission process at the moment, or our development process, is such that we generate summary datasets in a very old-fashioned, boring text tables format at the moment. And we generate these in the same way we've done for 20 or 30 years. And that's fine for our submissions. But then when we want to present those in a PowerPoint presentation, and we want to publish those results in some other way, what we find ourselves doing is reformatting all that information.
Even, in fact, in the clinical study report, we hand over the data to our medical writing teams, and they end up doing things like manually merging columns and reformatting some of the data that we've given them before it goes to the clinical study report. So you could use, if you weren't using Microsoft products, something like R Markdown would be equally suitable for this. We actually used the Officer package to prove this concept last year, that we could just take the summary datasets that we generate in Biostatistics and automatically place all of the outputs into each of these different formats without the need for huge amounts of transcription QC, which is what takes place today.
Building capability
So I hope that's given you a bit of an introduction. And a couple of things that we're looking at that are really accelerating the adoption of R within GSK. What I want to move on to now is generally how we're trying to build this capability. So I guess some of the marketing, some of the advertising that we've done internally to promote R. And in building capability, I've put out something here which will probably horrify anyone working in change management, but what I'm calling a simple recipe for building capability. Because the components here, the ingredients, if you like, in this recipe are not rocket science. It's more about how you do it.
So Tilo's going to talk about the environment that we've built, about warp environment. And Michael will talk about the way that we're promoting on-the-job learning through an initiative called R4QC. Before I move on to training, I just want to highlight as well that given this is a heavily regulated industry, processes and standards also plays a big part. We actually have standard operating procedures that effectively reference SAS directly. Not intentionally, but our workflows are so built around the SAS workflow that certain elements that don't even make sense for R.
So let's move on, look at training before I hand over to Tilo to talk you through the warp environment. What you can see on the slide now, which is a build slide, is three prongs of how we talk about data science within biostatistics. And this is largely based around our traditional roles. So at the top, we talk about modeling. And modeling is the role that's traditionally performed by a statistician in GSK biostatistics. And a programmer would fulfill a data wrangling role within biostatics.
The role that my team fulfills in promoting R is more of an engineering role. And engineering can mean lots of different things to lots of different people. What we mean in this space is things like building the environment, building applications, building R packages. So if you like the tooling that goes that helps support data science analyses. And the way we've gone about training everybody is to focus on the tidyverse and focus on ggplot2. ggplot2 is clearly part of the tidyverse. The reason I've separated them out in this diagram is because they are two separate face-to-face courses that we created, which are now virtual courses for everyone in biostatistics.
And we want to make sure that everyone's got to a base level in tidyverse before we move on. We chose tidyverse because of all of the reasons you would normally choose tidyverse. So things like the simplicity of the functions, the consistency, and so on. But also because if you're trying to take people who know SAS and train them in R, R is a lot more expansive than SAS. It's a language. You have things like matrices and lists and so on. And can get quite complicated when you get into it. SAS, on the other hand, is built around very simple flows, built around data steps. So we call data steps in SAS. And the way that that works is very, very similar to the way that the tidyverse is built around data frames or tibbles in the tidyverse vernacular. So that makes it a lot easier for someone coming across from SAS to R to pick up tibbles and pick up that way of working than it does.
And in particular, the Magritte pipe and everything like that is very familiar to people coming from a SAS background. So we build them up in the foundations. And then in each of the core areas, we then look to strengthen core competencies. So we have bespoke training in various different forms. I put in the right-hand side some of the methods of delivery. So this will not all be face-to-face training. Lots of this is guidebooks, webinars, all sorts of different things that we do to promote the learning. And there'll be core stats training using real GSK data in programming. ProgPlus here represents sort of more targeted training in things like Stringer, Lubudate, those kind of packages for more advanced data manipulation. And then we look at things like function writing from an engineering perspective.
And then when the value comes in, it's around these kinds of things. So Shiny clearly plays a big part of our future. So we've got some expansive targets around building Shiny capability. Publishing here, I'm using to refer to things like Officer, as well as R Markdown, Bookdown, Blogdown, those kinds of tools as well. And you can see that through this, we're building capabilities in each direction across the group.
The WARP environment
OK. In this particular section, what I'm going to focus on is the environment that we've built to support statisticians, programmers, and data scientists within biostatistics, and increasingly now beyond biostatistics as well. We call this environment WARP because it's part of a wider initiative called SPACE. And if you want to make that into an acronym, we sometimes talk about the working area for our programming.
So as has been mentioned in the introduction, the majority of our users are clinical users, statisticians, and programmers. And the kind of data they work with, clinical trial data, is relatively small in the data science space. So a few hundred subjects visiting a clinic, a handful of times, maybe less than 10 times, generating at most in some of our laboratory data sets a few megabytes of data. Certainly not generally in the gigabyte terms.
And the kind of analyses that we're running are standard statistical models, linear models, generalized linear models, mixed models, cross-proportional hazards, these kinds of things. And when we simulate that data, what we have is computationally demanding simulations where we fit that model over and over again, as opposed to needing to work with very, very large data sources and trying to solve problems with memory and so on. The only exception to that is that we are seeing increasing use of molecular and genomics data as well. And that's something that we'll have to adapt to and respond to in the future.
So in terms of what that means our environment needs to cover, well, we need an interactive environment at its core. We need a high performance capability to support those simulations that I mentioned. And with respect to our new capabilities, our Shiny apps and our markdown-based workflows, we need an ability to share that kind of content. And of course, to do anything, we need access to all of our data sources.
We've provided this through a centralized environment. That probably isn't a surprise to anyone who works in this industry. But our aim is to get people off desktops where we can and provide a level of consistency. Particularly in the clinical setting, we need consistency, traceability, transparency in what we do. And we have a large number of users that we need to support when doing this. So we have capacity at the moment for around about 1,000 data scientists with 300 concurrent users, we estimate.
This diagram is just to highlight some of the technologies that are included in this environment. You'll see as well as R, you'll see technologies like Python and Stan mentioned. Generally, we access those through R as opposed to direct access through their own IDEs. So everything is built around RStudio Server Pro, and it's very much designed as an R-based environment. All of those other components there are really just supporting the use of R. So even take something like Git and GitHub, that's actually maintained completely separately, but we'll provide a connection in R to allow our code to be stored and managed through GitHub.
So we've chosen RStudio Server Pro. And again, maybe this is an obvious choice for many people, but clearly the RStudio IDE is the standard IDE for R anyway. But increasingly in the data science space, I do see a sort of a trend towards more notebook-like environments, but really in a clinical space, that doesn't work. We work with scripts primarily. And for those users coming across from SAS, all they've ever known really is scripts. So first and foremost, we need an environment to be able to develop and execute scripts. And RStudio Server Pro is really nice in that it provides, well, the RStudio IDE generally provides notebook access and obviously easy use of our markdown if you want it, and therefore is the best of both.
But it also provides the ability to load balance. So for us with a high number of users, we actually have two servers with RStudio Server Pro running. And users will log onto a single address, and they'll be put onto whichever server has the most capacity. And this is also really, really useful for things like maintenance to minimize the downtime that users experience as well. There are some challenges with the interactive environment, the load balancing with respect to high-performance usage, but I'll come onto that after the next slide.
Managing the environment
So in this slide, what I want to talk about is our means to manage such an environment. For a regulatory environment, I suppose at the moment we are quite open. So we're not locking much down for our users. First and foremost, we need to get people off laptops with R. And on a laptop, they're used to high levels of flexibility. So we provide a central installation of R, and that installation is updated every six months. So we have many, many versions of R, and all of the recent versions are exposed to users. So for every six months, over a period of three years, you can go back and select a version of R that you want to use.
And when we talk about a version of R, we don't just mean the core and recommended packages. To date, we currently have around 450 packages centrally installed in a system library as well. So when a new user logs onto the system, they get access to tidyverse and other popular packages for our kind of work immediately. They don't have to go and install those packages. And moreover, because these are centrally installed in the system library, everyone has the same version of those packages, which ensures a degree of stability and usage. We do allow flexibility, so users can install their own packages as well in their own personal user library. We're not restricting that at the moment, but it's very likely that we'll supplement this environment in the future with more static, stable versions of R that are a bit more locked down.
The only element of control that we really have on the environment is provided through RStudio Package Manager. So RStudio Package Manager provides access to packages that users would typically go and install from CRAN, Bioconductor, GitHub, and so on. And with respect to CRAN and Bioconductor, we pretty much provide the whole of CRAN, or we do provide the whole of CRAN and Bioconductor. For GitHub, users would have to request that particular repositories are added to our environment before they can install it. So they can't directly access CRAN, Bioconductor, or GitHub, or anywhere else. And we also have our own GitHub enterprise, the GSK, as well, which we release packages onto, and users can use RStudio Package Manager to access those.
What's actually really, really nice about RStudio Package Manager is that it allows us to monitor the downloads. And so when we see users installing, many users installing the same package, what we can then do is to install that in the central system library in a future release. Again, just to limit the amount of personalization and customization that might affect the transferability of our code.
And speaking of central monitoring, another really nice feature that our admin support team use quite extensively is the admin screens that are available through RStudio, through RStudio Server Pro. So here you can see a view of our system. What I haven't mentioned thus far is that the system, these two servers that we use are very, very powerful servers. They're originally intended for high-performance computing, so they each have 192 cores and three terabytes of RAM, which is why you can see sort of 1% CPU memory usage on here. So generally, we are not pushing these servers to their limits.
What you can see, though, is these little spikes that occur here. And these spikes occur when a user tries to paralyze or does successfully paralyze code and executes that in the background. So we can monitor the system, see these spikes, and then talk to those users. Users are permitted to generate, you know, they are allowed to paralyze code and they are allowed to practice HPC runs, but our expectation is that they run them on a separate HPC system, which I'll talk through next. So that system is fully accessible from our RStudio Server Pro environment.
So we built an add-in to access Slurm, which is the tool that we use for accessing HPC. So you can see on here, each one of these is a different server. So when the users in RStudio Server Pro access that add-in, the information they enter there essentially writes the necessary script to launch that job on to two separate HPC nodes. Again, all of these nodes, every single one of these servers in this diagram has 193 terabytes of memory, which is far too much for our management node. It doesn't need anything like that. And no doubt, we'll change that in the future. But certainly for the HPC, that means we've got some very powerful servers available.
One point to note, I mentioned that we built an add-in for this. We were sort of bleeding edge when we were doing this and the functionality wasn't readily available in RStudio. I understand that it is available now. You can use Launcher within RStudio to work with Slurm as well as Kubernetes. But until GSK migrates to start using Kubernetes, we are still using our own custom add-in at this moment in time.
Content sharing with RStudio Connect
So beyond the obvious, I just want to now highlight the value add, I guess, of this environment. So the new things that we can do now that we couldn't already do in our SAS-based environments. And for content sharing, so for sharing Shiny apps, RMarkdown documents, Bookdown, Blogdown, we use RStudio Connect. We have one server at the moment running RStudio Connect. In the future, again, that might be two. So we might evolve to have a development and production server. But at the moment, it's all in one and it's a pretty liberal model. So users can publish anything they like.
So you see a lot of apps, development apps with the word test next to them. Primarily, I'd say the Shiny apps are, I mean, the Shiny apps are a mixed bag. We get all sorts of QDM, quantitative decision making app that I mentioned earlier. We get various dashboards that people have created, training apps, and so on. The image in the left-hand side is taken from a proof of concept we did to use Shiny to report clinical trial data. So we've got an ongoing initiative at the moment where we're looking at replacing a lot of our standard outputs with a Shiny app.
Otherwise, we also use RStudio Connect for all of our training materials. So most of those are produced using Bookdown and we host those on Connect. One particular feature I want to highlight that isn't mentioned on this slide, but is something that I really, I'm really, really thrilled about is there's a feature in RStudio Connect that allows you to publish directly from Git. So we use Git and GitHub extensively, particularly within my team. And you can point RStudio Connect directly at a particular branch on GitHub. So for example, we'd have our production apps pointing at the master branch and we'll have a test branch where we would point RStudio Connect at for more regular updates and testing. And that's a really, really nice feature that often doesn't get talked about that I just wanted to highlight.
Otherwise, I think RStudio Connect generally works really, really well for us. The only bit of guidance that we often need to work with our users on is around data access. So sometimes people don't understand how RStudio Connect runs. So you can basically configure it to run as you as the user that's trying to access that content, or you can configure it to run as a generic user. And that affects things like data access. And so we're often having to work with users to explain that. Equally, you can configure, you can design an app such that the data is deployed as part of the app, or you can write apps to access external data sources where some of what I just mentioned becomes more of an issue. But there's lots of different mechanisms for controlling access within Connect. And that's one of the key advantages, I think, that we see for it for Shiny apps, is making sure that the right people see the right content.
Back-end data sources
Just before I round off, I just wanted to mention some of the back ends. I've mentioned things like relational databases. These are ideal for clinical trial data. Our clinical trial data is actually stored on file shares at the moment. But in future, it'll either live on a relational database or a graph database. So we've provided access to a number of internal file shares, and we're able to access relational databases as well.
Outside of a sort of clinical setting, there are a number of users who want to access things like real world data, electronic health record databases that are stored on separate Hadoop clusters within GSK. So we provide access to those. And then the second one down on the slide, the object store that we have. Again, something that's perhaps not talked about too much these days, but I think we will see it a little bit more in the future, is the concept of an object store for data. And we have some fairly large structured and unstructured data sources that we access through the object store. And it's something that's helped us develop our first clinical RNA sequencing pipeline, which we'll be able to announce later this year, as well as providing a means to enable another pipeline that allowed us to analyze COVID-19 data and respond to a demand from our clinical operations group to try and understand what the impact of COVID-19 is going to be on clinical trial recruitment.
So there's been a couple of really nice use cases where connecting to the object store has been particularly powerful. And we're very grateful for the work that's gone into getting AWS S3 back onto CRAN. So I'll just finish with an overview of the whole system. I've spoken about each of these components. The only one really to highlight that I haven't mentioned before is this management node. So this contains our centralized R distribution. So whether it's Connect, whether it's Interactive Environments or HPC nodes, all of these are working off the same base R installation, which sits on there, as well as things like our Studio Package Manager, then to manage that, and the Slurm workload manager. So having this extra node independent from the rest is kind of useful. It's not necessary as such to have it as a separate node. And as I mentioned in the previous slide, it's a little bit overpowered at the moment for what we need it for, but it's a really, really useful component, as well as the load balancing that takes place everywhere else. And as mentioned, a future improvement will be to balance the content across the Connect services, as well.
R4QC: on-the-job learning in clinical programming
So with that, I will round off and hand over to Michael to talk about our on-the-job learning, so an initiative called R4QC, in particular. Yeah, my name is Michael Rimler. I am in clinical programming within GSK Biostatistics, and I am here to talk about our adoption story, or our adoption story, within clinical programming, and really in the clinical reporting process.
The objectives of this project really are kind of twofold at a top level. One is driving capability. So as Andy mentioned, we have a vast majority of our staff are historically SAS programmers, but we're moving into this space where we want to integrate R into the clinical reporting process. And to do that, we need that capability. We need people to be able to program in R. And although people have developed their SAS skills over time, I've been in the industry for 10, 11 years now, and virtually everything that I've learned in terms of SAS has been on the job, where I had a challenge that I needed to figure out how to resolve, and I asked questions, I looked things up, and I figured out it wasn't from sitting in classes and taking courses and working on problem sets and things like that. And so we want to be able to build that capability.
And secondly, we want to establish an environment. And by environment, I don't mean what Tilo's saying, environment, although that, of course, is very, very important, but really making sure that people have the right to, the permission to engage in certain activities in the clinical reporting process. We know what the ways of working are, things like that, so that someone comes in and then has the tools available to them to do that as well. So someone comes in and we hire them in, and apples to apples, they look the same as a newly hired SAS programmer, but they happen to know R, as Andy mentioned, that this is where a lot of the folks coming out of university are coming in with, that they can be just as productive, they don't have to go through additional steps of learning, additional training, so that they can be productive right out of the gate.
Along this process, we've been asking questions like, well, why are we doing this? What's the role of R versus SAS in our future? And to Andy's point, that also brings into account more open source, but here we're focusing on R. So what is the role of R versus SAS? Now, my perspective in leading this project, and I feel like I have the support of our senior management, is that it's integrating R and SAS. It's not about flipping a switch and doing R or SAS, it's about how do we integrate this into our processes? How are we doing this? What are the elements that are needed to declare our teams to be R-ready? And what's next?
This started with a project that was called R for QC, where we wanted to look at using R for the QC process. So for those that are unfamiliar with a somewhat conventional QC process within clinical reporting, you have a production-side programmer who creates the output, they create the data set, they create the table, that sort of thing. And then someone else, an independent programmer, independently creates code using the same inputs, but then a different process, because independently generated. And if they get the same outcome at the end, then that satisfies a significant level of the requirements from a validation perspective.
And so that's what we were looking at when the R for QC project, in parts one and two, we focused on displays, just looking at, not looking at data sets, and it was more of a proof of feasibility, proof of concept type model. We only did about nine displays in part one, and then we expanded the team and expanded, and actually in part two, we were able to replicate or doubly validate and compare to the production-side SAS-generated displays using R as opposed to SAS. It was about 150-some displays that were in that proof of concept.
That opened the door for our part three, where we sort of go live, which was one of our initial objectives, which is, how do we open the door for all of our staff to be able to use R as opposed to SAS for something? Because that allows them to take any training that Andy's group is providing and use it on the job, so that they can immediately go out and do their normal, everyday work, but start to do it in the language of R as opposed to the language of SAS.
And so we had a number of components, which I'll go through in order to get endorsement for being able to do that, but we are now, our part three has now been ongoing for about four or five months now, and we continue to bring more and more people into the fold. At the same time, or roughly around the same time, we launched into part four, which is a proof of concept that moves us into the datasets and atom dataset mapping. These two projects now have become under the umbrella of a sort of new and broader initiative within clinical programming on R adoption.
And there, I was sort of tasked with, what's the roadmap to R then look like for clinical programming? What do we need to do to get to the point where we are using R as much as we are SAS, or at least they're sort of interchangeable, or we have a strategic direction, that sort of thing. And before I could figure out a roadmap, which is when are you going to do things, I didn't know what we're going to do. What are the components of broad adoption of R within the department? And that created these sort of six or seven work streams, which included the R for QC remaining elements of part three and part four, but looked at things like central tools.
So in the SAS world, in the SAS workflow, we have, and many organizations do, right? We have our SAS macro library suite that people use. We don't have the same thing in R. There's packages out there. Those packages are not written for the clinical data pipeline, but they do have functionality. So what can we do to make the ability for uptake of our SAS programmers moving into R easier? Because we're not asking them to program for first principles. We give them a set of tools that allows them to do that. And Andy's group on that engineering side that he was talking about, and his group provided those initial packages that we've been using through the R for QC project.
There's also a work stream on, of course, training and support for our staff. It's a massive organization, and this is a big change in terms of the way we do our work. So there's a change management and a communications piece to this. There's outreach. We're really driving this within the clinical program department, but there's outreach outside of clinical programming still within Biostats and still within GSK, but also external outreach. For example, this type of platform where we're sharing our knowledge and our experience. We're not the first ones to go through this, but we are carving our own path, and we do have different experiences. And the more that we share, the more that we can pull the rest of the industry along with us, along with the folks like Genentech and Roche and the likes of that.
We're not the first ones to go through this, but we are carving our own path, and we do have different experiences. And the more that we share, the more that we can pull the rest of the industry along with us, along with the folks like Genentech and Roche and the likes of that.
And then, of course, we always are thinking about, well, we're doing this on the QC set of things. What does it mean to get R onto production? Some of that falls within Andy's group. Some of that falls within the tech group, and some of that falls within the work that we're doing, and there's always the question out there about submission with R-generated analyses and data.
So that R for QC project, I talked about a little bit already, and Andy's also mentioned some of this also. But why did we decide to use R in the R for QC project? We have this long-term objective within clinical programming to become programmatically multilingual. It's not driven by SAS license costs. Andy already mentioned the idea of recruitment. I also think that flips to retention for the folks that have been within our group for a long time. This offers new and exciting things that can do if you have interest in doing that. But operationally, it also gives us flexibility, the ability to choose the right tool for the job at hand, and also an expansion of the things that we might deliver to our internal customers.
Why did we look at QC? It's a lower risk of entry point. That's what we thought from the beginning, that QC programming tends to have lower regulatory scrutiny. It's not typically submitted along with a submission package. So therefore, you can have QC code developed during the reporting workflow and in R as opposed to SAS, and it's not going to be challenged as much. And we're still producing everything using SAS at the moment. So we're taking things that are produced in a language and in a workflow that has been commonly accepted, and we're simply using our validation processes using R. And as mentioned, it's going to facilitate the on-the-job organic upskilling because we know that if I need to validate a summary of demographics, I can go and I can use R within our RStudio environment and do that work there as opposed to launching up SAS.
Successes and challenges
Those first few parts, parts one and two on displays, and now moving into part four, all proof of concepts, we're really thinking about demonstrating the capability of using R for stage two QC. And GSK is referred to as independent validation as stage two QC, and back to that facilitating the on-the-job use of R for upskilling.
We had a number of successes with that. So we had to figure out how to actually compare data frames. We can use PROC COMPARE in SAS. How do you do that in R? There's a couple of packages that we identified that could satisfy this, and we landed on the diffdf package. And that has been proven successful through our experience so far to leverage that package. We have successful cross-functional collaborations, most specifically with Andy's group. But that was needed because we didn't have the sort of internal skill set that we needed in knowledge, so we needed to work cross-functionally. I mentioned the central tools. It also gave us a better vision of our roadmap towards R, and we developed a significant amount of supporting documentation to help our staff.
Challenges that we faced, and I would argue Matt, one is onboarding into what I would call the R ecosystem. So that's the work environment, that's working with Git and GitHub, that's thinking about having to, as a programmer, having to worry about loading packages and installing packages and things like that. It's just different than SAS, and so how do you take folks that have been using SAS for 3, 5, 10, 15, 20 years and say, well, now we want you to do this in R, how do you help lower the entry cost for that so that uptake can be higher, adoption can be stronger?
PROC COMPARE and diffdf, diffdf seems to be developed and designed to be very similar to PROC COMPARE, but it has a different look, it has a different feel, and so we had to work through those challenges of how you use it as a diagnostic tool, not just a everything compares sort of result. Good program practice in R is different, also trying to institute some sort of code review culture, which is, I don't know what it's like in every organization, but it's not as strong as we would like within GSK. I think within clinical reporting, I think that's primarily driven by the independence of QC culture, but trying to build up a code review mentality and good programming practice.
And we also noticed, because we're comparing SAS to R results, we also noticed that there are differences between SAS and R on some fundamental things, not that either is right or wrong. When we noticed the differences in the compare output and we dug into it, we realized that SAS was doing what SAS said they should be doing and R was doing what R said they should be doing. And so how do we reconcile that from a reporting standpoint, from an analysis standpoint? If our general concept is, well, things should match exactly, and we're not getting things matching exactly, how do we sift through that? More importantly, how do we support our teams in navigating that? When you open the door to let loose 250 programmers, 350 programmers to do this, they need to know, all right, I now see this difference, what do I do about this difference?
Part three: going live with R for QC
That brought us to part three. Part three is where we go to live stage two or independent QC using R instead of SAS. You could think of running it as like a pilot study where you have a study that says, we're going to just validate everything using R. Instead of that model, we chose to have an organic sort of grassroots effort to doing this. So any individual programmer at the output level can say, all right, I've been assigned 25 different displays, so I'm going to choose these three and I'm going to validate these using R instead of SAS. That's the opportunity to use, to on the job learn R and take what you've learned from the training coming out of Andy's group and use it for the work that you typically do.
Part of the requirements of that, we're making sure that, of course, the leadership on the study was aware of what was happening because it doesn't put us, we don't think it puts a significant amount of risk, but it's something from a resourcing and timelines perspective leadership needs to know about. We have mechanisms in place to help with ways of working and also provide visibility on what's happening. But really, it was putting in place a model of encouragement from above. So we had leadership in the team saying, you can do this, we allow you to do this, we know it's going to take you a little bit longer than you might be able to do it in SAS because you're so fluent in SAS and just learning R, but we're going to provide all this support from below in terms of documentation and a support team to answer your questions and things like that.
Support mechanisms and documentation
We put together a quick start guide to just be a one-stop shop that links out to a number of internal documentation, and we put together on the left here is a guidance document that's pretty thorough. It goes through not just how to get started, but also what does it mean to develop code in RStudio, how do you execute code, some mini vignettes of going through some of our standard display templates, standard summary of AE template that's very GSK specific, but it gives people sample code, gives things that they can copy and paste and slightly tweak.
When I was learning SAS along the way, I learned by someone pointing me to a similar type output in a different study, and I would copy that code in, and I would start to manipulate it to make sure that it worked for my purposes, and so I wasn't starting from a blank sheet of paper every time I started code. So that's what we wanted to do. This document on the left, which is using the Bookdown package, was completely designed to try and again lower that transition cost of someone moving from SAS into R in order to improve uptake and improve adoption. On the right here is a workplace page where GSK
