
What R We Counting? (Ben Arancibia, GSK) | posit::conf(2025)
What R We Counting? Speaker(s): Ben Arancibia Abstract: GSK Biostatistics mandates that all new tools be written using open-source languages and open-source code achieve parity with proprietary software by the end of 2025. Given this ambitious timeline, someone might ask, "How is the open-source adoption going?" This seemingly simple question involves complexities: What metrics do you track? How to measure success? How to show progress? GSK addressed these by leveraging our internal GitHub data, using open-source R packages like {gh}, scheduling our data pipeline on Posit Connect, and generating diverse reports/dashboards. We’ll share our journey of transitioning to R and other open-source tools, offering insights on scaling full enterprise open-source adoption. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right. Well, thank you. You know, as mentioned, I'm here to talk about what are we counting. I had to do it, where I posit, had to do a little pun.
So, at some point in your open source journey, someone is going to come to you with a question. Might be a vice president. It might be a CEO. It might be some other stakeholder. They're going to come and ask you a question. And that question is really, really, really simple. They're going to say, hey, how is open source adoption going? It's a really simple question.
But it is incredibly difficult to answer. And when I see questions like this, one thing pops in my head. And that is Tom Hanks in the movie Apollo 13 with his famous line, Houston, we have a problem.
Now, you might be thinking, Ben, why are you here? Why are you here asking, telling us about this question? And also you might be asking, Ben, why are you wearing Beetlejuice looking pants? The second one, I just like them. The first question, it's because I lead open source adoption at GSK R&D. This question happened to me. A VP came and asked me and said, hey, how is open source adoption going? And similar to the hypothetical I just gave you, Houston, we have a problem flashed in front of my eyes as I tried to fumble around with holistic measures and things like that about, hey, how is open source adoption going?
Context: what biostatistics does and why this matters
So before I tell you how we actually solve that problem, how we answer that question, you need to know a couple of things. First, what does a bio status organization do? We do two things. One, we write statistical tests to make sure that our drugs are safe. And the second thing is we make we write statistical tests to determine if our drugs are effective. That is at the core of our business. That is what we do.
The other question or the other thing that you need to know is why did our VP ask us this question? Well, we've been on a journey for a long time. Since 2020, we've been putting all the pieces together for R adoption. We've been putting together the platform where we do our R work. We've been putting together the process. How do you do QC in R? We've been doing all the training. How do we move people from SAS to R? How do we actually build them up to be able to use R? We had to get the pipeline on board. It sounds simple, but it's really hard to convince a clinical trial to move and adopt different tools or a different programming language. And then finally, the tools. We had to put the tools in place and surround R and our different platforms and process to make people effective and efficient.
If you're interested in sort of the in-depth of how we did this, there's a link. You can search Posit GSK Enterprise Adoption on YouTube. We go into a really, really in-depth conversation about this. I would check it out.
So we've been putting this these pieces in place since 2022 or 2020. And so as a result, in August of 2023, we had two commitments from our leadership. First, all central tools had to be built using open source languages. And then the second is 50% of our code needs to be written in an open source language by the end of 2025. It's the end of 2025, so we're coming up on the deadline.
Using GitHub APIs to measure adoption
So you have the context now. So how do we actually go about answering our vice president's question? Again, similar to Apollo 13, we needed to figure out how to make a square peg fit into a round hole with using just the tools that we had in our organization. For those not familiar, Pharma is very highly regulated. And as a result, our tools are pretty limited about what we can introduce into our environments because we need to make sure, one, quality is good and we can trust all the different outputs.
So what was the tool that we had to pick? How did we go about doing this? Well, the tool we ended up using is GitHub. But more specifically, the GitHub APIs. So one of the tools that we introduced as part of our pieces was GitHub. And we said all studies must use version control. That was our mandate. So that was how we kind of started to figure out answering this question.
Now, if you've ever been on a GitHub repo page, there's tons and tons and tons of information. The one I care most about is on the bottom right. It's called language statistics. And what it does is it calculates in a repo how much of a certain programming language is actually there in the default branch or the main branch. We care about that because that helps us to be able to say, all right, if all our clinical studies are in GitHub, can we then determine what is our adoption?
So we started to play around with it and we said, great, we love this language statistics measure. Is there an API or are we going to have to web scrape every single GitHub repo page at GSK and pull this out the hard way? Luckily, there's an API endpoint. But everything that you see on a GitHub page I'm just hitting on the browser is literally accessible via an API endpoint. It's great. Thank you, GitHub. It makes our lives a lot easier.
So we said, great. API endpoint exists. How do we actually access it? Are we going to have to write some custom scripts? Is there a package available? How do we actually go about doing this? Well, similar with the R community, hopefully you've experienced it. Ask and you shall receive. And we actually realized the gh package which exists out there for a different use case can be used to write complex queries. You might have interacted with the gh package, set up your personal access tokens. It sits within uses, I believe. So it's a great way to, you know, get things started. But it's a great way to access the API. So you can actually write incredibly complex queries.
Choosing what to measure
So we figured that out. And then what did we start doing? What you would expect. We started to explore and pull all that data through the API and started to pull a lot of different information. It actually became overwhelming. What do you actually start to measure?
So, do you count the number of outputs in this study? Do you count the number of production programs or key C programs using R? Do you count both? Is that double counting? Do you count the memory size of programs? Does that matter to you? Do you count the number of studies using R, no matter how much or how little? Do you care about the growth of R versus other languages? Do you time box it? This is everything that kind of went through our minds when we were going through all the GitHub API metrics that were available to us. And we kind of really hit the analysis paralysis stage. Because really what's difficult is it's really hard to figure out what you need to measure to satisfy that vice president's question, even though it is so simple.
So we decided to sit back and think about it a little bit. And we channeled our inner Kevin Bacon, again in Apollo 13, kept it nice and simple. Just decided to add, basically, and do some really simple summary statistics. These are the four things that are important to us. One, relative R size. We care about the R code size out of the total programming language. So for our use case, a lot of R versus S. We care about the total number of repos with any usage of R. We care about repos where R code is above a target percentage. And we care about the growth of R. The reason why we care about the growth of R is we have huge legacy code bases. But if we can see the growth of R versus other languages higher, that's good for us. Because that tells us that people are going to, at some point, be able to eclipse our legacy. And that's what's really important when you start to compare, you know, relative R size. You have to add that context.
The production pipeline
Great. So we took our metrics and we put it in an incredibly fancy dashboard. And you'll get to see what it looks like. Well, sort of. This is our dashboard, but obviously all the metrics removed. I can't show you those metrics. I'm sorry. But just imagine, you put some fancy metrics there and it looks really, really cool.
So we're able to do this as a one-off. How do you actually put this into production so it's updated every single week? This is how we did it. It's a little confusing, or not confusing. There's a lot going on, so I'm going to walk you through it. But basically we go GitHub, source, all the way to our analysis. And it's orchestrated by Posit.
So first step. One, we use the gh package to do an API query. We then take our raw data and dump it into Azure Data Lake Storage. We do this once a week, you know, Sunday night, and it's all automated and coordinated by Posit Connect. After that raw data storage job is done and completed, we then use some R scripts to do processing and transformation, adding in a little extra information from our internal systems. And then once that's done, we dump it back into, again, Azure Data Lake Storage. You want a copy of raw and you want a copy of transformed. Again, coordinated by Posit Connect. After that job is done, we then kick off a job to do data visualization, quarter dashboards and shiny dashboard deployment. So on Monday morning, if our VP cares, he or she can take a look at it and say, all right, how's that R adoption going?
We also do a little predictive and analysis in ML. That's more on sort of the experimental phase and a lot smaller than, you know, what it is that we do a lot of. But we like to throw it in there. That way we can say we're data scientists. Again, it's all coordinated by Posit Connect, which is a really great feature.
Expanding beyond adoption metrics
So again, Kevin Bacon, he appears again. But we started to work with our VP and showing sort of what it is that we could do in terms of telling, answering that question, how is open source adoption going, as well as all the automation and everything that we put into place. And our VP loved it. Oh, man, that's incredible. But we became victims of our own success because our VP went from Kevin Bacon to Ed Harris. And he or she did not really care about what it is that we put together and seeing the current state. They cared about, all right, what can they actually do more of and what it can do.
victims of our own success because our VP went from Kevin Bacon to Ed Harris. And he or she did not really care about what it is that we put together and seeing the current state. They cared about, all right, what can they actually do more of and what it can do.
Something I didn't tell you in the beginning, I'll tell you now. We pulled all the R metric and language statistics information, but we also pulled a ton of other data. So things like commits, pull requests, repo life, repo pull request reviewers, irregular commits, unmerged pull requests and package names. The reason why we did that is because we wanted to experiment and do some other dashboards as well as answer some other questions that might come up in a more preemptive manner. So there are two other projects that we did out of this. They're a little bit more on the innovation experimental phase, so less robust than the current open source one. But I'll tell them to you now.
One project that I'm very proud about that I think is very cool is GitHub health. If you have ever tried to get people to use GitHub in an organization, it is incredibly difficult. It's a steep learning curve and it's tough. The key thing about support is how do you go from a reactive support to a proactive support. So the question that we have tried to solve is can you proactively find studies that are struggling with GitHub and support them? And we can. There's a lot of different metrics and things that you can pull, like, for example, merge conflicts, the number of merge conflicts that haven't been resolved, the number of commits, number of branches, things like that, that indicate that a study is struggling. And you can actually proactively reach them and say, hey, do you need some help? They're very appreciative of that. So that's something that I think is very cool, something that I'm very proud of.
The other thing is the tool catalog. One other question a VP is going to ask you is going to be a question around finance. Return on investment. It is a dreaded question. But it's a question that you need to be prepared to answer. So we spend a lot of time building tools. Do our users actually use them? That is a question that is going to come up. And so what you're going to have to answer is what tools are actually being used, and then what's our return on investment for either internal tools or open source tools? Through the GitHub API and some regex, you can pull package names out of scripts pretty easily, and you can get an idea of what is actually being used. So for example, one of the things that we use a lot in GSK is the admiral package. You can argue we probably need to invest more time in the admiral package because we use it everywhere. But this type of tool or this type of dashboarded data allows us to have those really important finance questions because it's going to come for you at some point during that open source conversation.
So one of the things that I think is really cool about this work through our GitHub API and whatnot is we were able to move our stakeholder, our VP, from, hey, how is open source adoption going, to the bottom. I know exactly how adoption is going. And that feels really good. Probably not as good as mission control when Apollo 13 landed back on earth, but equally as good. And one thing I want to highlight here is mission control, there's a lot of people here. There's a team. And just like that, at GSK, we have a team that worked on this. So I just want to thank them. Becca, who's in the audience. Hi, Becca. Alana, Aladri, Hamza, and Hashir. Again, it's really crucial that we have a team to be able to do this. And so just really want to highlight that. And that's it. I'm happy to take any questions that you might have about open source adoption or if you want to connect on LinkedIn, go for it. I feel like you have to put them on there. But, yeah, I'm an open book. So thank you.
Q&A
Okay. Thank you, Ben. Great presentation. We have a few questions that have come in from the audience. So the first one is from Scott. And Scott asks, how did you come up with the two specific goals, all open source central tools and 50% code written in open source?
So the reason how we came up with those commitments was there's a big transition in pharma to adopt open source. We just prefer it in terms of workforce that's coming out of schools, especially like stats programs, things like that, as well as we think it provides us more innovation. And honestly, some of those proprietary tools are really expensive. So that's why we decided to make those financial as well as other, I guess, call it holistic decisions for those commitments.
I think choosing targets is hard, but why 50%?
It sounded good at the time. We had 50% as sort of like the base, and then we have a stretch of 70%. We were just like, that sounds good. We'll put our finger in the wind and see how it goes over two years. So the beauty about goals is you can always change them later on if it's not going well or if it's going really well.
Okay. The next one is from anonymous. What are the main reasons practitioners resisted adopting open source?
What is the main reason? Well, it's hard. If you've been coding in a language for 20 plus years and you're told, hey, you can't use this tool anymore, you have to think about data in a different way, you have to stop using macros and start thinking about packages. It's hard, and I don't think we should overlook how hard it is for people. I think we have to have a lot of empathy. Even if you've only been doing your job for five years, it's hard to change, especially with really tight deadlines. In the clinical trial projects, we have really, really insane deadlines. So it's just hard. I think we have to have a lot of empathy.
It's hard, and I don't think we should overlook how hard it is for people. I think we have to have a lot of empathy. Even if you've only been doing your job for five years, it's hard to change, especially with really tight deadlines.
Okay. This is an interesting question from anonymous again.
Did closed source vendors try to push back when they saw how things were going?
Of course. Depends on who in the closed source vendors saw it, whether it's sales, technical, whatever. But, yeah, of course. I mean, it's a business, you know? Like, that's just how it is. But I think you just have to have backing from your leadership, and you have to have the desire to want to have those tough conversations. I would say if you're not willing to have that tough conversation, maybe you are a little bit too early in your open source adoption journey. But you're going to have to have it at some point. But I feel like that could be a totally different talk.
Okay. Let's do one more question. This is from Raphael. Was there already a culture of versioning code with GitHub before our adoption? If not, are you measuring all the old slash new SAS and new R code that exists elsewhere?
I mean, you could argue maybe there's not a culture now. It's tough. Let me be very frank. GitHub is really, really difficult. Because you are taking people from a paradigm where they save their code, and they don't want to share it unless it's 100% totally done. With GitHub, obviously, we want to do commits frequently, almost like an auto save function. So it's very, very difficult to get people to make that transition. In terms of before GitHub, we did everything on file shares. We saved it that way. So think of, like, you know, one drive essentially for studies. I would say we our version control is very good in the sense of, like, people are now pushing. It's not perfect. We still have a lot of bumps along the way. But it's just kind of how it is, I think, when you need to teach someone GitHub in a big organization. So.
Okay. Well, thank you, Ben. Really appreciate it. One more round of applause.
