Resources

Open Source in Pharma | Harvey Lieberman | Data Science Hangout

To join future data science hangouts, add it to your calendar here: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Harvey Lieberman, Associate Director of Data Science at Novartis, to chat about R/Pharma, automating processes, career advice, and data science in drug discovery vs. development. In this Hangout, Harvey talks about a lot of things, like the power of automating processes. He shares examples of how automating mundane tasks can save significant time and identify errors that humans might miss (we all know human error is a thing!). For instance, he automated the analysis of data from 48 Excel sheets that had previously taken a colleague about three months to process by hand; Harvey completed the automated analysis in one hour over lunch and found copying and pasting errors in the original manual process! Automating processes not only increases efficiency but can also help move people into more data-focused roles. Harvey suggests demonstrating that automation speeds things up and, most importantly, removes errors, which is when people start to pay attention and get interested. Resources mentioned in the video and zoom chat: R/Pharma website β†’ https://rinpharma.com/ Cecilia Baldoni's scrollytelling project (on shrews!) β†’ https://cecibaldoni.github.io/projects.html Advent of Code β†’ https://adventofcode.com/ Pharmaverse.org (pharmaceutical R packages) β†’ https://pharmaverse.org GSK's Journey to R β†’ https://www.youtube.com/watch?v=xDrt6txplek Roche's Journey to R β†’ https://www.youtube.com/watch?v=BlJNILSoZlM R/Pharma March 2025 newsletter (LinkedIn) β†’ https://www.linkedin.com/pulse/rpharma-march-2025-newsletter-open-source-in-pharma-wmf5c/ ggplot2 extenders club β†’ https://ggplot2-extenders.github.io/ggplot-extension-club/ Coursera: Making Data Science Work for Clinical Reporting Course β†’ https://www.coursera.org/learn/making-data-science-work-for-clinical-reporting hiring.cafe (for finding R jobs) β†’ https://hiring.cafe/ Posit's PydyTuesday GitHub β†’ https://github.com/posit-dev/python-tidytuesday Joy's Law (management concept) Wikipedia β†’ https://en.wikipedia.org/wiki/Joy%27s_law_(management) If you didn’t join live, one great discussion you missed from the zoom chat was about the diverse backgrounds of attendees. Many participants shared that they came to data science "sideways," holding degrees in fields such as sociology, psychology, mathematics, atmospheric science, education, history, chemistry, and various engineering disciplines, rather than traditional statistics or computational degrees. so many data scientists have non-traditional paths into the field! But we're all better together. β–Ί Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us!

May 14, 2025
54 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey there! Welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on, so find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I am so excited to be joined by our featured leader today, Harvey Lieberman, Associate Director of Data Science at Novartis. Hello Harvey, how are you doing today? I'm doing good, thanks Libby, thanks for having me. Oh we're so happy to have you here. Harvey, I would love it if you could introduce yourself, tell us a little bit about yourself and your background, and what you like to do for fun.

Sure, so my name is Harvey Lieberman. I work in the pharma industry, have done for 27 years now, it's a long time, and I'm a data scientist. I've worked in both the discovery setting and the development setting, and I'm sure we'll go into a little bit about that as we chat a little bit more. For fun, I guess I'm what you would call a little bit of a geek, because I like to do data science for fun as well as for work. I also have three kids and spend a lot of time with them, shuttling them around, doing different things, and so on.

About Novartis and Harvey's role

So Novartis is a big pharma company. I joined Novartis five years ago, and I'm working in a biostatistics group in drug development, in early drug development. So the role that the group that I'm in has is, as drugs are moving towards the development phase and into development, which is where we're starting to test them, potentially test them in humans for the first time, we get involved. We're, I would say, data science, biostatistics, and programming, doing all the statistical work to ensure that things are working, that things are safe on what is initially the first couple of phases of drug development, which are where we have a few patients, maybe up tens or hundreds, a couple of hundreds maybe.

But prior to my role here, I was also working in drug discovery, which is where we'll do a lot of work in trying to identify the molecules that will become medicines that will then move into development as well. And one last thing, there's an enormous attrition rate as you move, as you start in discovery, moving towards development. So most of the things that you work on, certainly from a discovery and early development perspective, most of the things that you work on don't make it to the market for one reason or another.

R/Pharma community

R/Pharma is essentially a community. We started in 2017. In 2017, our industry, the pharma industry was at a point where people were starting to get interested in the use of open source. And just to caveat this, pharma traditionally has used SAS for statistical programming. And that was starting to become a movement to see, well, could R be used instead of SAS? And where could it be used? And would it be advantageous to move in these kinds of directions?

So several people from multiple companies were starting to dabble in this area. And there wasn't really a forum where people could connect, could talk about it, could share information. So a few people got together through partly through Phil Boucher from Posit, which was then, back in the day, RStudio, who helped pull people together, pull some like-minded people together. And we decided to put together a conference, the first one in 2018, so that people could share what they've been working on.

And the first conference was at Harvard University. Harvard were really nice. They gave us a space where we could use. Pulled together, there were about 150 people for two days. And it went really well. And we repeated it in 2019. And then in 2020, we were looking to make it a little bit bigger, so that more people could come, more people could share. We then came across COVID, like everyone else, and pivoted to what has since then become an online, a virtual conference.

We also do a face-to-face summit once a year. We try and tag this onto PositConf. And really, what the group is, is it's just a large community of people who are all like-minded, want to work within open source, want to use open source within the industry. But also, I would say that it's not only people from the pharma and biotech industries. It's open to everyone.

Data science for fun

Yeah. Oh, God. So I just, I'm a big fan of R, of Python, of other languages. I'm teaching myself Go at the moment. So that's a fun experience. Really, it's just finding data from a day-to-day point of view and playing around with it. So we have things, obviously, things like the Tidy Tuesday. If, when it comes to December, I always try and do as much of Advent of Code as possible. If anyone does that, that's a lot of fun.

Especially to learn a new language. I'm playing around with Gen AI, just like everyone else probably is, as much as possible. But trying to do it from playing with the APIs and from a programming perspective. So for example, it doesn't sound like fun, but one of the things I'm trying to do with the AI side, and I've seen this done by other people too, is to take all the information from my, from our, for example, bank and credit card statements, and give me an idea as to what we're spending and where, and how we can do better. And doing it using local LLMs, so nothing's going out.

Plus, just little snippets of code whenever I can. I try and push them out, I try and blog them out wherever possible, because I find that it ends up as a repository that I can go back to at any point to find what I did.

Transitioning from SAS to open source

I can speak probably from an, you know, somewhat from an industry perspective as opposed to an internal perspective, because we're, I would say we're going through it at the moment. Other companies have gone through it, and there are lots of companies within the industry at various stages of going through it.

I would say that within the pharma industry, if you look at some companies, for example, GSK, Roche have done a lot of work, and they've actually presented and put out a lot of work on the transition. One of the biggest challenges, I would say, rather than an infrastructure perspective, is a training perspective, especially when you've got people who have been using something like SAS for a long time. Just the thought process of how you would approach a problem coming from more of a more traditional programming language approach is quite different.

And you also, when we're looking at moving, or when we're looking as an industry, moving from SAS to R, or transitioning SAS to R, I would say there's more to it than that. It's more transitioning from, as you put it, a monolithic type approach to something that is more of an open source approach, where we're not talking about just R, we're talking about using, maybe using Git, or something else for versioning, for version control, looking into where CICD can step in. So it's this concept of bringing people along who might be used to working in one way for a very long time, to what you might call more modern approaches.

And certainly, you know, the Git concept, or the GitLab, GitHub, Bitbucket concept, when you show someone how they can do it just with their own code, how they can manage it so much better, you tend to have people coming along for this journey.

Automating processes and Gen AI

Okay, so let me take a step back and give an example of an old one. Because I think we, many of us in data science, or working in data science, have probably had experiences like this. So I worked in a group where I was working on a biomarker analysis. And in this one example, I had a colleague come over from Germany to visit. And he was working in a completely different group. He was working with bioreactors. And he had a setup with an experiment with 48 bioreactors. And they collected a whole bunch of data over a period of time. And they analyzed it with an Excel workbook.

So he showed me an Excel workbook that had 48 sheets with data in. And the data did not start in the top left-hand corner. There was some notes coming down. And then you had tables of data for the first 48 sheets. And then the next 20 or 30 were charts that they'd made by hand. And it had taken someone about three months, I was told, to work on this.

So this was an example where I could take something and say, well, we can't get the raw data, but we can pull the data out from the Excel. We can put it together in a shiny dashboard. And we can just look at all the graphs and look at all the charts. And two things happen. One is, it took me one hour to do this over lunch. And secondly, it took me one hour and a half to do this. And secondly, I found mistakes in their copying and pasting of formulae from cell to cell.

So as I say, it was about three months worth of work to get to this point, which could have been saved easily. And I think we've all had situations like this. And the nice thing about this is, it took someone who was a lab scientist. And actually, they went back and it turned them into a data scientist, because they started to learn R and Python to be able to do these things.

This is one of the many, many examples. I hope we get to others. But to your question, Lou, how is Gen AI going to do this? How is Gen AI going to contribute to this? I think it's going to contribute hugely. I think what it's going to do is, what we've had the opportunity to do is, as data scientists in the past, is to take some of this mundane stuff and replace it with an automated process that's much more efficient, much faster. With Gen AI, we now have tools at our disposal that are going to be able to do this even more efficiently.

So I know people, it goes back and forth in the conversation with Gen AI. Is it a replacement? I look at this more as a tool. This is another tool in our box that we'll be able to use. One of the nice things it will be able to do is, when you come to, say, a data automation task, we'll be able to put things in place using these kinds of tools that can check that what we've got in place is working as we expect it as well. So I see these kinds of tools really speeding up the process, really helping moving forward and becoming more robust.

I look at this more as a tool. This is another tool in our box that we'll be able to use.

I'd say within our company, and I'd say within the industry as a whole, people are starting to approach it. So there's caution. The pharma industry is somewhat risk averse anyway, as an industry. People are starting to use it more. It's being encouraged more to use on a day-to-day basis. And what we're seeing is that people are starting to use it to automate and to speed up some of their day-to-day work. So for example, something as simple as summarize a meeting so that I don't have to read through.

And I feel that as people start to get more and more on board with that, we're going to find more and more of a move through. I feel like the industry as a whole is going to be somewhat protective of throwing all of their data into this. And I would say that most companies, you know, the few that I know working in the area have something where you've got certain business rules in place. So do not put your data into something that's going out. But utilize these kinds of things to make your day-to-day work more effective, more efficient.

Building a data science portfolio

See, I've approached this as someone who has a broad knowledge of a lot of languages. So I would say I'm somewhat of a specialist in R. I do a lot of work in Python, but I've also done a lot of work in working with database systems, different types of databases, and things outside of the language itself. So for example, working with SQL or Mongo or other types of database technologies, you can somewhat stand out by having a broad knowledge across several different types of technologies. Even if you're not an expert in them, being able to utilize these within a language.

The other thing I'd say is that within our industry and within data science as a whole, contributing to open source projects is a huge thing to get your name out there. There are so many projects that you can either contribute to, or you can start up your own and put it out there and try and gain some traction on it. But just by starting, just taking on some issues, even just responding, even just creating an issue is something to start the ball rolling.

And as you do so, you find, we've got this community, we talked about the R/Pharma community, you start to become part of other communities. And one of the things I would say, having worked for many years now, is that being part of a community and knowing people within the community, having people to reach out to, is a really good way to start to get going in a career, or even if you're halfway through a career.

Drug discovery vs. drug development

I would say, so development is far more structured in its approach than discovery. My experience in discovery has always been you're certainly much freer to experiment with something new, to dive into something very different, completely new.

So one example of the type of approach would be that in drug development, we're very concerned about validation all the way through. Are you comfortable enough to say that you're very confident with every step of the way that you have documentation to back it up and we have very strict rules and SOPs as to how to move? In discovery, you're in a much more open environment. So just the example of a new R package comes out, it would be very easy to pull that into a discovery setting. It would be more challenging to pull that into a development setting unless you've run some internal testing, some internal validation to say that you and the company are happy for you to move that way.

I'd also say along the lines we touched on AI, I'd say that certainly if you look at the AI work that has been done, the discovery community took to it faster than the development community. Although there are some things going on within the development community that will be catching up to that.

Automating processes β€” making the case

Really, and I'll give another example of this. A group that I used to be in a previous company, we had some really, really smart people working in the lab and we could collect data called mass spectrometry data. So it was high quality lab data. It would take about three days to collect the data and it would take about three weeks to analyze it. And the analysis took so long, partly because people were moving between seven pieces of software.

What you often find, certainly going back a little while, the equipment manufacturers, lab equipment manufacturers are very good at making high quality equipment and they're good at making software that can capture data, but they don't traditionally have not invested a huge amount in how you would analyze that data afterwards for specific experiments. So rather than go through seven pieces of software, of which one was Excel in the middle, so data was copied to Excel, calculations were performed. It was then taken from there to something else.

So this upset me to say the least when I first saw it. And it didn't take long to write within spare time. This was actually within spare time to write a pipeline in R. And this was at the very start, this was really early on with Shiny. So Shiny was just coming out. I said, I can write something in one language, in one workflow that can do it. And it can automate a lot of what we're doing or what the group was doing by hand. And in fact, I actually saw, I was working in a different group. I saw what was going on here and I asked to join this group to help with this kind of approach. And we reduced the time, the analysis time from three weeks down to one hour.

Another group was collecting data and we, what we did was wanted to automate the data preparation as well. So we wanted to use the hardware automation with robotics and the analysis automation at the end of it. And I had a lot of skeptics when we did this. But we did it as a proof of concept to show we can match the way it's being done by hand. So you had someone in the lab pipetting by hand samples, running the samples, and then doing the analysis by hand. And the data preparation by robot wasn't as fast as someone doing it by hand. But the point is, it could run overnight. It could run 24 hours.

And you then do the analysis afterwards. And I did this and we had thousands of samples and it would take about a week to analyze by hand. We did the analysis. We showed that for most of the samples, our robot could pipette as accurate as a human. And when we did the data analysis for most of the samples, or for all of the samples, it could pipette as accurate. And for the analysis, most of them were within less than 1% of a human. Except for two samples, which seemed ridiculously out.

So I went back through all the calculations. I went back through all the analysis. And I couldn't find out what was wrong. And it was getting to the point where I was getting incredibly frustrated at myself for not being able to spot something that must have been an obvious bug. So I asked to see the raw data, because they'd given me the data from the machine that was part processed. And I looked at the raw data and it didn't match what I had. And then I looked at what they'd given me that was part processed. And they'd switched two of the names of the samples around. So it was a human error that had led to me not matching the results exactly. When we switched them back, it matched exactly.

And this is another human error that would have been carried through. So if you can show, A, that you can speed things up, and most importantly, that you can remove the errors that crop in a day-to-day. If you can show that you've removed that error, then that is when people start to look and get interested. And you end up with two complete opposite, or in my experience, two complete opposite ways of responding to this. You never get apathy. You either get, this is the most amazing thing I've ever seen in my life. We have to implement it immediately. Or this is the scariest thing I've ever seen in my life. And it's coming from my job. And I want to hide it.

You never get apathy. You either get, this is the most amazing thing I've ever seen in my life. We have to implement it immediately. Or this is the scariest thing I've ever seen in my life. And it's coming from my job. And I want to hide it.

Pivoting into pharma

There are, I would say, several online resources. There was a shout out to group at Roche, who did clinical… a clinical data science course on Coursera that's available to take, which is a really great way that explains some from the clinical data science and reporting side of things. That's certainly a good entry point to get a feel for some of what's out there.

I would also say that as well as the community that we've got, the R/Pharma one, and once again, everyone's welcome to come and come to all the workshops that we put on, which will help bring people up to speed. There's the Pharmaverse group, which is essentially a group of people, many of whom work in the pharma industry, developing our packages to try and solve common problems. That's a great organization to get involved in. And there's plenty of open source development work going on there. If you're looking to be able to contribute to packages within an environment and within a community that can help, you know, make the connections that you might need to get into the pharma space. Once again, it comes back to something that's cropped up before. It helps build your portfolio as well. It shows an interest in the space and will certainly help move things along.

Harvey's background and Joy's Law

My background is in chemistry, not in data science, not in computational. But I grew up in the 1980s when personal computers were coming out. So I'm the generation that really got stuck into personal computers when I was in my teens. And so it stuck with me all the way through the concept of always using coding to solve problems. And so every step through when I was in different types of chemistry and chemical development and biomarkers, it was always, well, can we write the code to do this for us? I want to be able to sit back and watch it happen. And believe it or not, my goal has, it's a crazy thing to say, my goal has always been can I automate myself out of a job and then move on to another one?

To the second point, the R/Pharma Group, it came about, as I mentioned, to really focus within the pharma industry, which was certainly at that time going through a transition and still is going through a transition. And it was always this idea of can we build a community of people working in the same way, coming across the same problems? And the thing that comes to mind, there's a law, I'm just trying to think of it, I think it's called Joy's Law, which was named after Bill Joy, who used to be the CEO of Sun Microsystems. And he said his law was something along the lines of the smartest people don't work for you, or the smartest people are not in the room. And it's a law based on the fact that no one company can have all of the smartest people.

Because knowledge is dispersed, knowledge by its very definition is dispersed. And it's also what's known as sticky. Because knowledge tends to stay in certain places. The group we tried to form and all the others, it's an attempt to take on this concept of, in a way of open source, which is, can we bring everything together? Can we share everything? Can everyone contribute?

Career advice

I would say, and I think this has come out probably a thousand times before, which is find what you really like doing and do it. It's really simple. In a previous position, I used to come in on a, for example, I used to come in on a Monday morning and I would see people looking so miserable. And I'd ask, why do you look so sad? And they'd complain, well, it's Monday morning. And whereas I'm coming in, I couldn't, sometimes I couldn't wait to get out of bed on a Monday and come back to do the thing that I love.

And this is why playing with data and coding and stuff is my passion outside of work too. And I'm sure a lot of people on the call probably feel the same way. We love what we do and just keep doing it until you don't enjoy it anymore and then find something else that you enjoy. So that would be, I guess it's not really career advice, but the way I look at it is this is where I'm spending most of my waking time. So you better be doing something that you really, really enjoy.

The way I look at it is this is where I'm spending most of my waking time. So you better be doing something that you really, really enjoy.