Open Source in Pharma | Harvey Lieberman | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Hey there! Welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on, so find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

I am so excited to be joined by our featured leader today, Harvey Lieberman, Associate Director of Data Science at Novartis. Hello Harvey, how are you doing today? I'm doing good, thanks Libby, thanks for having me. Oh we're so happy to have you here. Harvey, I would love it if you could introduce yourself, tell us a little bit about yourself and your background, and what you like to do for fun.

Sure, so my name is Harvey Lieberman. I work in the pharma industry, have done for 27 years now, it's a long time, and I'm a data scientist. I've worked in both the discovery setting and the development setting, and I'm sure we'll go into a little bit about that as we chat a little bit more. For fun, I guess I'm what you would call a little bit of a geek, because I like to do data science for fun as well as for work. I also have three kids and spend a lot of time with them, shuttling them around, doing different things, and so on.

I look at this more as a tool. This is another tool in our box that we'll be able to use.

I'd say within our company, and I'd say within the industry as a whole, people are starting to approach it. So there's caution. The pharma industry is somewhat risk averse anyway, as an industry. People are starting to use it more. It's being encouraged more to use on a day-to-day basis. And what we're seeing is that people are starting to use it to automate and to speed up some of their day-to-day work. So for example, something as simple as summarize a meeting so that I don't have to read through.

And I feel that as people start to get more and more on board with that, we're going to find more and more of a move through. I feel like the industry as a whole is going to be somewhat protective of throwing all of their data into this. And I would say that most companies, you know, the few that I know working in the area have something where you've got certain business rules in place. So do not put your data into something that's going out. But utilize these kinds of things to make your day-to-day work more effective, more efficient.

Building a data science portfolio

See, I've approached this as someone who has a broad knowledge of a lot of languages. So I would say I'm somewhat of a specialist in R. I do a lot of work in Python, but I've also done a lot of work in working with database systems, different types of databases, and things outside of the language itself. So for example, working with SQL or Mongo or other types of database technologies, you can somewhat stand out by having a broad knowledge across several different types of technologies. Even if you're not an expert in them, being able to utilize these within a language.

The other thing I'd say is that within our industry and within data science as a whole, contributing to open source projects is a huge thing to get your name out there. There are so many projects that you can either contribute to, or you can start up your own and put it out there and try and gain some traction on it. But just by starting, just taking on some issues, even just responding, even just creating an issue is something to start the ball rolling.

And as you do so, you find, we've got this community, we talked about the R/Pharma community, you start to become part of other communities. And one of the things I would say, having worked for many years now, is that being part of a community and knowing people within the community, having people to reach out to, is a really good way to start to get going in a career, or even if you're halfway through a career.

Drug discovery vs. drug development

I would say, so development is far more structured in its approach than discovery. My experience in discovery has always been you're certainly much freer to experiment with something new, to dive into something very different, completely new.

So one example of the type of approach would be that in drug development, we're very concerned about validation all the way through. Are you comfortable enough to say that you're very confident with every step of the way that you have documentation to back it up and we have very strict rules and SOPs as to how to move? In discovery, you're in a much more open environment. So just the example of a new R package comes out, it would be very easy to pull that into a discovery setting. It would be more challenging to pull that into a development setting unless you've run some internal testing, some internal validation to say that you and the company are happy for you to move that way.

I'd also say along the lines we touched on AI, I'd say that certainly if you look at the AI work that has been done, the discovery community took to it faster than the development community. Although there are some things going on within the development community that will be catching up to that.

Automating processes — making the case

Really, and I'll give another example of this. A group that I used to be in a previous company, we had some really, really smart people working in the lab and we could collect data called mass spectrometry data. So it was high quality lab data. It would take about three days to collect the data and it would take about three weeks to analyze it. And the analysis took so long, partly because people were moving between seven pieces of software.

What you often find, certainly going back a little while, the equipment manufacturers, lab equipment manufacturers are very good at making high quality equipment and they're good at making software that can capture data, but they don't traditionally have not invested a huge amount in how you would analyze that data afterwards for specific experiments. So rather than go through seven pieces of software, of which one was Excel in the middle, so data was copied to Excel, calculations were performed. It was then taken from there to something else.

So this upset me to say the least when I first saw it. And it didn't take long to write within spare time. This was actually within spare time to write a pipeline in R. And this was at the very start, this was really early on with Shiny. So Shiny was just coming out. I said, I can write something in one language, in one workflow that can do it. And it can automate a lot of what we're doing or what the group was doing by hand. And in fact, I actually saw, I was working in a different group. I saw what was going on here and I asked to join this group to help with this kind of approach. And we reduced the time, the analysis time from three weeks down to one hour.

Another group was collecting data and we, what we did was wanted to automate the data preparation as well. So we wanted to use the hardware automation with robotics and the analysis automation at the end of it. And I had a lot of skeptics when we did this. But we did it as a proof of concept to show we can match the way it's being done by hand. So you had someone in the lab pipetting by hand samples, running the samples, and then doing the analysis by hand. And the data preparation by robot wasn't as fast as someone doing it by hand. But the point is, it could run overnight. It could run 24 hours.

And you then do the analysis afterwards. And I did this and we had thousands of samples and it would take about a week to analyze by hand. We did the analysis. We showed that for most of the samples, our robot could pipette as accurate as a human. And when we did the data analysis for most of the samples, or for all of the samples, it could pipette as accurate. And for the analysis, most of them were within less than 1% of a human. Except for two samples, which seemed ridiculously out.

So I went back through all the calculations. I went back through all the analysis. And I couldn't find out what was wrong. And it was getting to the point where I was getting incredibly frustrated at myself for not being able to spot something that must have been an obvious bug. So I asked to see the raw data, because they'd given me the data from the machine that was part processed. And I looked at the raw data and it didn't match what I had. And then I looked at what they'd given me that was part processed. And they'd switched two of the names of the samples around. So it was a human error that had led to me not matching the results exactly. When we switched them back, it matched exactly.

And this is another human error that would have been carried through. So if you can show, A, that you can speed things up, and most importantly, that you can remove the errors that crop in a day-to-day. If you can show that you've removed that error, then that is when people start to look and get interested. And you end up with two complete opposite, or in my experience, two complete opposite ways of responding to this. You never get apathy. You either get, this is the most amazing thing I've ever seen in my life. We have to implement it immediately. Or this is the scariest thing I've ever seen in my life. And it's coming from my job. And I want to hide it.

You never get apathy. You either get, this is the most amazing thing I've ever seen in my life. We have to implement it immediately. Or this is the scariest thing I've ever seen in my life. And it's coming from my job. And I want to hide it.

Pivoting into pharma

There are, I would say, several online resources. There was a shout out to group at Roche, who did clinical… a clinical data science course on Coursera that's available to take, which is a really great way that explains some from the clinical data science and reporting side of things. That's certainly a good entry point to get a feel for some of what's out there.

I would also say that as well as the community that we've got, the R/Pharma one, and once again, everyone's welcome to come and come to all the workshops that we put on, which will help bring people up to speed. There's the Pharmaverse group, which is essentially a group of people, many of whom work in the pharma industry, developing our packages to try and solve common problems. That's a great organization to get involved in. And there's plenty of open source development work going on there. If you're looking to be able to contribute to packages within an environment and within a community that can help, you know, make the connections that you might need to get into the pharma space. Once again, it comes back to something that's cropped up before. It helps build your portfolio as well. It shows an interest in the space and will certainly help move things along.

Harvey's background and Joy's Law

My background is in chemistry, not in data science, not in computational. But I grew up in the 1980s when personal computers were coming out. So I'm the generation that really got stuck into personal computers when I was in my teens. And so it stuck with me all the way through the concept of always using coding to solve problems. And so every step through when I was in different types of chemistry and chemical development and biomarkers, it was always, well, can we write the code to do this for us? I want to be able to sit back and watch it happen. And believe it or not, my goal has, it's a crazy thing to say, my goal has always been can I automate myself out of a job and then move on to another one?

To the second point, the R/Pharma Group, it came about, as I mentioned, to really focus within the pharma industry, which was certainly at that time going through a transition and still is going through a transition. And it was always this idea of can we build a community of people working in the same way, coming across the same problems? And the thing that comes to mind, there's a law, I'm just trying to think of it, I think it's called Joy's Law, which was named after Bill Joy, who used to be the CEO of Sun Microsystems. And he said his law was something along the lines of the smartest people don't work for you, or the smartest people are not in the room. And it's a law based on the fact that no one company can have all of the smartest people.

Because knowledge is dispersed, knowledge by its very definition is dispersed. And it's also what's known as sticky. Because knowledge tends to stay in certain places. The group we tried to form and all the others, it's an attempt to take on this concept of, in a way of open source, which is, can we bring everything together? Can we share everything? Can everyone contribute?

Career advice

I would say, and I think this has come out probably a thousand times before, which is find what you really like doing and do it. It's really simple. In a previous position, I used to come in on a, for example, I used to come in on a Monday morning and I would see people looking so miserable. And I'd ask, why do you look so sad? And they'd complain, well, it's Monday morning. And whereas I'm coming in, I couldn't, sometimes I couldn't wait to get out of bed on a Monday and come back to do the thing that I love.

And this is why playing with data and coding and stuff is my passion outside of work too. And I'm sure a lot of people on the call probably feel the same way. We love what we do and just keep doing it until you don't enjoy it anymore and then find something else that you enjoy. So that would be, I guess it's not really career advice, but the way I look at it is this is where I'm spending most of my waking time. So you better be doing something that you really, really enjoy.

The way I look at it is this is where I'm spending most of my waking time. So you better be doing something that you really, really enjoy.

Open Source in Pharma | Harvey Lieberman | Data Science Hangout

Transcript#

About Novartis and Harvey's role

R/Pharma community

Data science for fun

Transitioning from SAS to open source

Automating processes and Gen AI

Building a data science portfolio

Drug discovery vs. drug development

Automating processes — making the case

Pivoting into pharma

Harvey's background and Joy's Law

Career advice

Featured software#

ggplot2

python-tidytuesday