Resources

Succeed in the Life Sciences with R/Python and the Cloud - posit::conf(2023)

Presented by Colby Ford This talk covers best practices and lessons learned surrounding the use of R and Python by technical teams in the cloud, focusing on Posit Workbench, Azure ML, and Databricks. In the life sciences, whether it's pharma, biotech, research, or another type of organization, we are unique in that we blend scientific knowledge with technical skills to extract insights from large, complex datasets. In the cloud, we can architect solutions to help us scale, automate, and collaborate. Interestingly, the use of R and Python by bioinformatics, genomics, biostatistics, and data science teams can be challenging in a cloud-first world where all the data is somewhere other than your laptop (like a data lake). In this talk, I will share best practices and lessons learned surrounding the use of R and Python by technical teams in the cloud. We'll focus on the use of Posit Workbench and RStudio on various cloud services such as Azure ML and Databricks. Tuple, The Cloud Genomics Company: https://tuple.xyz Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Pharma. Session Code: TALK-1069

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, and good afternoon actually. Welcome to my talk, Succeeding in the Life Sciences with R, Python, and the Cloud. Before we get started, I want to give a few caveats and considerations. So everything in this talk is public, opinions are my own, not my clients or employers. This talk references many Microsoft tools, but before you get up and run away, if you're an AWS shop, everything that I talk about is, or almost everything I talk about is going to be applicable if you use GCP, AWS, or something else. This content spans scientific, technical, academic, and industry experience. So depending on which one you're from, a lot of this is applicable kind of across the life sciences space. So don't worry if you're in academia or if you're in a pharma company, we've got something for you.

A little bit about me. My name is Colby again. I am a computational biologist and cloud AI architect. I'm the founder and principal consultant at Tuple. We're a Microsoft partner and we're the only Microsoft partner that exclusively focuses in the genomics space. In addition, I am the co-founder of Amiso. We're an NIH and NSF-backed startup company that enables remote patient monitoring using the Apple Watch. I'm a visiting scholar at the Cipher Research Center at UNC Charlotte, which is an infectious disease research center, and I'm the author of Genomics in the Azure Cloud. I put some promotional copies of Genomics in the Azure Cloud up here at the front. They're free. I would ask that you only take them if you don't work for a multi-billion dollar corporation that can buy one. And if we run out, come find me somewhere tonight. I have more in my room. In addition, I'm a Microsoft MVP for Azure and I'm a Microsoft certified trader.

So in this talk, we'll talk about data storage and organization, scaling analyses with cloud compute, and then I'll leave you with a case study from one of my clients.

Why move to the cloud

Before we get started, I wanted to say a few things about if you're not in the cloud today. So if you work for a pharma company or in academia and you're doing everything on-prem, what are some reasons that you might actually want to do something in the cloud? And some of those reasons include scalability. So in the cloud, I can immediately spin up or elastically spin up virtual machines or clusters to meet some sort of compute need and then spin it down once I'm done. The cloud also enables us to deploy a large data repository, some of which we'll talk about today, and this allows us to kind of dynamically scale and grow our storage needs without having to worry about, oh, I got to spin up another database and I'm running out of hard drive space in the cluster, the HPC cluster. The cloud natively just offers a lot of things that are collaborative. So it really harbors collaboration across teams because once we get our data in one place and once we use shared workspaces or our solutions that are for compute or other kind of analyses, collaboration just naturally flows. And then because of all of these tools are in the cloud and they can kind of run without a lot of maintenance, it also allows us for a lot of automation. So meaning we can automate very complex or long-running pipelines. This is super important in the bioinformatics space where maybe we have a pipeline that runs for hours and we want to automate that so that our bioinformatics people can do something a little more interesting. In addition, since we're in a very highly regulated industry, the cloud offers a lot of security out of the box and then also offers blueprints for making sure that you meet certain regulatory and compliance offerings or requirements, but it is not idiot-proof.

Data storage and organization

So first, data storage and organization. I'm going to start with a little metaphor, which is unlike me if you've ever heard me talk. How many of you, show of hands, in your bedroom have like a lovely little accent chair sitting in the corner? Seeing some hands. But I think some of you are liars because if you're like me, the chair actually looks something like this, right? Yeah, I got some more hands, right? And this is actually a good representation of what my OneDrive looks like or Google Drive or Dropbox because for some reason I have all three, right? And it's a mess. So thank God for the search box because otherwise I can't find my socks, metaphorically speaking.

So what we do is we want to really start figuring out how to actually organize that data rather than just throwing it somewhere. So metaphorically speaking, let's take a trip down to Target or Ikea, buy a little cheap shelf. This is our storage account, our data lake. And we can start putting things on that shelf in the places that they're supposed to go. So now once everything's organized, if I need a pair of jeans and some Converse, I know exactly where to get them. And this is really where we want to get to from a data lake perspective, right?

On this next slide, I won't go through all of these levels, but this is an actual slide from a client workshop that I did where we really started talking about how to organize data so that you can be successful with a data lake across lots of weird types of data that we see in the life sciences. The pro tip here is I always start by defining what is the smallest unit of work. So in the bioinformatics world, we oftentimes start with a set of sequences or for a single sample. And then if we start from that level and work our way up in a hierarchical manner, we get something to where we have a nice organizational scheme by which we can place data as we get it. This may seem a little backwards, because right now, like in your Google Drive or OneDrive, you just kind of throw stuff as you get new things and hopefully make a folder or whatever. But in the cloud, if we can think of it the other way, plan ahead, it really will help us when we start needing to do something with that data.

So for this client, they are a bio manufacturing institute, and they develop stem cells for type 1 diabetes treatment. So in their manufacturing line, they capture exomes for these stem cells. They also capture pathology slides, phenotypes, lab data, et cetera. You can see we can organize by those individual data types. But they also, kind of planning for the future, we are able to expand to other studies and other clients and other projects, both internal and external by this kind of scheme. So this is something that we would do at your organization is try to figure out what all do you want to house in your data lake.

And just a little note here, earlier this year, Microsoft released OneLake, which is a unification system that's supposed to break down the silos in data lakes. What was previously happening is people would create storage accounts or data lakes, and then someone needed access to it and wasn't getting it, or IT was walling it off, and so they just go and create another one. Originally, the data lake was supposed to solve data silos in databases, but then now we just have data lakes that are also siloed off. So here with OneLake, this is supposed to solve that problem. Will it? Who knows? But the goal of OneLake is that across the entire organization, we share a certain amount of compute that allows us to communicate with different workspaces in the OneLake. So in this example here, we might have a research workspace that's preclinical, and at a pharma company, we might have an omics workspace that is clinical, and those two can't see each other, and the different people use those different parts, but at least they can be managed easier from an IT perspective, right? Keep the IT people happy. And in addition to this shared compute, we also have the concept of one security, meaning through our data warehouse or things like Power BI, I can set a security protocol, and it flows all the way from the output all the way through to the actual core data itself. So it makes things easier kind of through and through, rather than having to redefine security on each of these different services, which is a pain. And another thing that's really interesting is that they're all in on the Parquet format, and if you're unfamiliar with Parquet, it is a distributed file format that's really common in the Spark world where we can distribute, in a distributed manner, read and write data from data lakes. And so as long as you have data in Parquet format, you can read from any of these services that you see here at the top in addition to Databricks.

So why does the organization actually matter, though? Besides saving you from turning gray by trying to find some file that you threw in six months ago, the reason why it matters is all about scalable queries. So using services like Azure Synapse or Databricks, we can query across our data lake and answer more complex questions in a much easier fashion. So if I take this pink box here as an example, how many of the samples in study B have gene ERBB2 expression greater than 10 transcripts per million across RNA-seq analyses? This is an actual question that clients ask me, right? And so are these others. Traditionally, if this was all on-prem, we would have to go maybe to a network file share, go grab a bunch of files, load them up in R or Python, filter them, and then produce this output. It's the data retrieval part that's the problem. And so using this kind of pathing that you see here at the bottom, you can see my data lake is organized in a certain way where I can just drop in some wild cards and grab everything that we need to meet that query need. And you'll see a little bit more of this in the case study.

So why does the organization actually matter, though? Besides saving you from turning gray by trying to find some file that you threw in six months ago, the reason why it matters is all about scalable queries.

While I won't go through this entire example architecture, because you'll see a thousand architectures within the next two days, one thing I want to point out is that the cloud storage is really sitting at the core of the entire successful architecture. It serves as the landing place for when we get all these heterogeneous datasets coming in, whether it's electronic medical records, if you're at an academic medical center, or lab results or omics data or image data from MRIs. It serves as the single place where we don't have to worry about, we have to worry about where it goes, but we don't have to worry about the file type very much. We also don't have to worry about running out of space. Then the data lake also serves as the source for more advanced analytics like data warehousing or maybe even data science and AI workloads. So we saw the LLM keynote earlier today in order to do your own LLM, you kind of need your own data and it needs to be accessible and organized. This is a big deal, right? It's a major requirement there. And then also the data lake is the place that will serve up for operational databases for other front end or serving outputs like apps, Power BI, Tableau dashboards, or other third party solutions.

Scaling analyses with cloud compute

So once we have all of our data in the right place, it's organized, we're good to go, now we want to scale analyses with Cloud Compute. But you may be wondering, all right, well, but I got this laptop. It's cool. It's like, you know, I pay like $3,000 for this thing. Like why, why would I use the cloud? I could just use my laptop. It works. That's true. But you'll hit a wall at some point with your local, with your local machine. So why Cloud Compute? Again, elastic scalability. We can dynamically scale up or back virtual machines, clusters, whatever we need, and then spin them down when we're done, saving on costs. We can also orchestrate complex pipelines and automate repetitive tasks. So which is really common in the bioinformatics world where maybe we have some QC tasks or we have tasks where we want to process like sequence data, and it's the same pipeline over and over again. So why do we need to pay someone to actually click that run button when we could just automate those tasks? And instead, especially for long running analyses, like a bioinformatics pipeline that might take four hours, we can then scale using some of these cluster services like Kubernetes to instead of taking four hours to run one sample, now it's four hours to run 50 samples at the same time. All of these tools also enable analytical collaboration, especially things like Databricks and Azure Machine Learning where you're all working in the same workspace. So you're sharing the same code. You're also connecting to the same single source of truth of data. So it really helps to not have this kind of everyone has their own version or everyone has their own environment that's weird and wonky and doesn't match. And all of these things harbor much better scientific reproducibility, which is super important.

So I'll take a little bit of time to talk about running R and Python IDEs on Azure Machine Learning and Azure Databricks. And again, if you don't run away, if you're an AWS person, if you use Databricks on AWS or GCP, this also applies on those slides. Only thing that won't apply is the Azure Machine Learning part. Inside of Azure Machine Learning, if you've used it before, you can create a compute instance. For those of you who are not familiar, this is like a virtual machine that's managed and it comes pre-installed with lots of data science packages, both in R and Python. But we can also go into the advanced settings and create a new application using a Docker image. So if you already have licenses for Posit Workbench, you can drop your Posit Workbench licenses in here. In addition, you can use an open source image like the Rocker RStudio image. And once the compute instance is up and running, you'll be able to see Posit Workbench or RStudio by clicking on the little ellipsis. So this will serve that IDE up inside of Azure Machine Learning. So this is in addition to JupyterLab, Jupyter, VS Code, and Terminal, which are already enabled on the workstation. But then you also get Posit Workbench if you do this well, or RStudio.

So what's already included on the Azure Machine Learning compute instances as of the end of last month, these are the versions of Python and R that are pre-installed. Not a whole lot of options or flexibility there. And then these are the versions of JupyterLab and VS Code, which are notably a little bit older. With the latest Posit Workbench image, that's the jammy image, you get a bit more control. You can pick different versions, but the latest versions are Python 3.10, R 4.2.3. And then you can see that in addition, you get RStudio Pro as well, and a considerably newer version of VS Code.

In Azure Databricks, you can also create a similar instance. We can create a machine learning cluster or a cluster with the machine learning runtime. This ML runtime also includes lots of packages that we already use in the data science world. And in addition, certain versions also include RStudio Server. So as long as you just follow the directions of that and create the cluster with the ML runtime, you can, under apps, go and grab or go and click the button that says set up RStudio Server, and it'll open in a new tab. So one of the big reasons for this, or there's two big reasons, is one, even in RStudio that opens in your new tab, you'll still have the Spark back end. So if you want to use Spark R or Sparkly R workloads, no big deal, good to go. And the second is inside of Databricks, we can mount our data lake. And then when we're writing R code, we're just pointing to a file path. So it's no longer do you have to download from the data lake or from some weird place and then reference a local path. It's a mounted storage location, so it feels like a normal local path anyway.

If you're a Python person, or even if you just like notebook style environments, you can also use the Databricks notebook. These notebooks are polyglot, meaning that you can switch between R, Python, Scala, SQL, Markdown, Shell, Java, and some others, even all in the same notebook. So in this example that I'm showing here, I'm reading in some drug sensitivity data from my data lake. It's all in a mounted storage. And you can see I have pulled that drug sensitivity data in a Python or a PySpark-based data frame. And then later down in the notebook over here on the right, I'm doing the exact same thing in R. And so I'm able to switch back and forth. And also, these notebooks are collaborative. So you could have you and your colleague in the same notebook at the same time running different cells in different languages. So it's pretty cool, because we all know that there are certain languages that do things better for a certain use case, because a package exists, right? And it's not always R, and it's not always Python. So you can switch back and forth pretty easily.

Case study: Foundation 101 Genomes

So I'll end with a case study. So I have a client called Foundation 101 Genomes. They're a Belgian-based rare disease research foundation. The owners or the founders started this organization when they found out that their son has Marfan syndrome. And for those of you that are not familiar, Marfan syndrome is a connective tissue disorder that has two different morphologies. You're either kind of OK, you're just tall and lanky, or you have severe cardiovascular issues that results in death in a lot of adolescents. And so with this foundation, what they do is they pull in lots of research data, lots of lab data from patient-generated samples or participants in their study. They also have multi-omics data coming in, and then lots of MRI and CT images of patients with the heart condition, the heart style of Marfan syndrome. So our solution was to have a data-centric design that allows for us to query that data in Azure Synapse or Azure Databricks and also serve that data up in other applications, for example, the DICOM API that you see here on the slide.

So I think the big aha moment for them was their research lab that's a wet lab found in mice that there is a specific mutation in another gene, which I can't release yet, that is really related to the cardiovascular, the worst version of Marfan syndrome. And so then the question was, well, do we see that in the patients that we have in our cohort? And so within about 30 minutes, I'm able, in Databricks, to go and query all of the ECF files or variant data, filter to the gene and slash region of interest, and then pivot by variant and look at this region and see who is homozygous or heterozygous or has the wild type for that particular mutation or set of mutations. And then also overlaying that information to see if it breaks down between control and then also people with Marfans and people with Marfans with the cardiovascular disease. There'll be a paper coming out on this hopefully later this year. So this is kind of the big aha moment because being able to organize your data in a way that makes it accessible by these very large compute services means that we can answer questions very quickly and then hopefully eventually cure a major disease.

So this is kind of the big aha moment because being able to organize your data in a way that makes it accessible by these very large compute services means that we can answer questions very quickly and then hopefully eventually cure a major disease.

So in summary, running workloads in the cloud offers scalability, flexibility, and automation. Data lakes offer a single source for your data, but it needs to be organized to be useful. It can't be that chair, right? We can't just throw some socks on there and just promise ourselves we're going to fold it later. Connected compute services allow for you to retrieve that data, and those sources really feel, if we do it right, really feel like we're working on a local compute context, meaning there's a really low barrier to entry if you're used to using just your normal laptop. And with that, feel free to connect with me on LinkedIn, Medium, or GitHub at Colby Ford, or if you're interested in what I do, you can shoot me an email at Colby at tuple.xyz. I'll take some questions.

Q&A

Thank you again, Colby. Appreciate it. So whenever you engage with a new team, what is like the first thing you try to address when it comes to these kinds of data lakes?

I think the first thing is looking at all of the different types of analyses that that team works on, because then we can start to organize that data lake at the very fine grain level. Like if you're running a clinical trial, do you do IHC data? Do you do RNA-seq? Do you do whatever? Like all these different lab tests, whatever it is. And if we can identify those things, then it helps us to kind of plan for the future, so to speak, in a data lake environment. And then it also helps us to understand, do you have really small analyses that maybe we don't need to make some sort of complicated compute architecture? Or do you have very large workloads that we need to plan for?

Thank you. We've got a couple of minutes if anyone wants to send over a question. I am curious though, do you usually get some pushback from certain teams? Like not necessarily a team you're working with, but like an ancillary team, and typically which teams are those?

There's a couple. One of the biggest things is, since we're in a highly regulated environment, putting clinical data in the cloud is still for some reason very, makes people a bit nervous, which is really interesting because I've achieved GXP compliance plenty of times, done things that are on classified data lakes. Like it can be done. And because of all of these overlaid blueprints and regulatory and compliance offerings that you can overlay onto your Azure environment, it's really not that big of a deal. But for some reason, there's a gap there between understanding. That's one. And then two, talking about the last talk, people were saying there's this kind of barrier to entry for moving from a proprietary language to like an R or Python. Let's say you don't want to move from that proprietary language that costs millions of dollars a year for no damn reason. Then how do you actually run that workload in the cloud? Or how do you connect that service to a cloud-based data lake? And it can be done, but it's not exactly as smooth as what I've presented here.

Well, thank you again, Kobi. Appreciate it.