
David Smith | MLOps for R with Azure Machine Learning | RStudio (2020)
David Smith | January 31, 2020 Azure Machine Learning service (Azure ML) is Microsoft’s cloud-based machine learning platform that enables data scientists and their teams to carry out end-to-end machine learning workflows at scale. With Azure ML's new open-source R SDK and R capabilities, you can take advantage of the platform’s enterprise-grade features to train, tune, manage and deploy R-based machine learning models and applications. In this talk, the attendees will learn how to: •Carry out ML workflows using the authoring experience of their choice, from no-code to code-first options that include Azure ML’s drag-and-drop visual interface for defining workflows and RStudio Server on the Data Science Instance, a hosted VM workstation, for using the Azure ML R SDK from the RStudio browser-based interface. •Use the Azure ML R SDK to manage cloud resources and train, hyperparameter tune, and log and visualize metrics for their models at scale on Azure compute. •Build ML Pipelines in R for defining and orchestrating reusable and reproducible ML workflows. •Deploy, manage, and monitor their R ML models and applications as web services on Azure Container Instance and Azure Kubernetes Service, with an emphasis on robust DevOps and CI/CD for orchestrating and streamlining their end-to-end data science development lifecycle
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome back after lunch. It's been a great conference so far. It's been great to be here. My name is David Smith. I'm a cloud advocate with Microsoft. So basically that means I'm Microsoft's connection with the R community. So if there's anything you would like me to communicate back to the development teams back in Redmond, just have a chat with me and I'll pass that on.
But today I would like to talk to you about machine learning operations. And that's the process of efficiently and reliably building machine learning models at scale. We call it MLOps. It's a relatively new term. You might see it being related to something you're more familiar with, which is DevOps. And DevOps itself, I don't have time to get into all the details of that, but a definition that I like, which comes from one of my colleagues at Microsoft, is that DevOps is the union of people, process and products to enable the continuous delivery of value to your end users. And note that word value there. We're not talking about software. We're not talking about applications. We're talking about value in general. So it seems like it's something that we could also apply to machine learning in the same way that it's traditionally applied to regular applications.
So what DevOps gives us is a nice process for delivering applications in general. We plan, we develop and test those applications, we release them to production. And once they're in production, we're not done because in production we monitor and learn from our users' experiences with those applications and go through that entire cycle again. And so the question that I want to ask here is, can we apply a similar process to DevOps for building applications to building machine learning models with R in particular in this case?
And I don't know about you, but at least in my experience when working with machine learning models, it's a little bit different than building applications with regular programming languages. You start with some data, you might try a model, you might try another model, you'll evaluate the results, you'll decide that you might need some more data, you might need a different model. It's a very iterative process. But at the end of that process, what you get is a model that you'd like to put into production and you throw it over the wall to the engineering team to do something with it. At least that was my experience up until relatively recently. I don't know if it was yours as well. But you can see that this is kind of very different to the modern way of building applications with continuous integration and continuous delivery. So can we get to that stage with machine learning?
Differences between DevOps and MLOps
And I think one of the ways that we can think about that is to have a look at some of the differences between traditional development processes, DevOps, and machine learning building, MLOps. Now in both cases, we're working with files, you know, that source code. You know, it might be C++ or .NET files on the regular application side of things. But with machine learning, we're also working with other artifacts, things like data files. There's a whole thread I could go down to get down in at this point about thinking about what does it mean to have a 55 megabyte data file that you're analyzing and putting that into Git? Short hashtag don't do that. But lots of issues around managing data that don't come up in traditional development processes. Same things if you're working with notebooks or RMD documents or R Markdown documents. How do we deal with those and how do they manage in source code control systems?
A good practice when building applications is rather than managing infrastructure directly by pointing and clicking in consoles and terminals is to manage that infrastructure with commands in code. And those same principles apply on the machine learning side as well, with an additional wrinkle that we often have to manage environments. Think for in the R case, managing the packages and the package versions that need to exist for your R model to run in a reliable fashion. So we want to be able to manage those as code as well.
I touched a little bit about those issues of working with source code control systems. Those also apply in the machine learning operations scenario. But there are other things we need to track changes in as well, not just the source code. We also want to track changes in our experiments. And by that I mean, as we try out different types of models, perhaps different transformations of our data, we're looking for some kind of outcome, perhaps the model with the best accuracy. And we'd like to be able to track that as well in a reproducible way so that we can always go back to prior experiments if we end up using that one instead.
When we're building applications in a DevOps environment, typically we're building binary executables. Those builds might take minutes, maybe hours. That's kind of the typical limit for most types of applications. You're typically building on sort of commodity computing infrastructure. But on the machine learning side, rather than building, in addition to building executables rather, we're building the models, we're training the models that become part of those executables. And that model training, especially in sort of deep learning type of environments, could likely take hours, might well take days. I've seen examples where model training takes months. And so how do we put that into a traditional DevOps type of environment? And also how do we incorporate the exotic computing environments like the GPU computing environments that we need for deep learning?
When it comes to version control, version management, when we're building applications, we like to give versions to the applications that we build and release. On the machine learning side, we would like to assign versions to models so we can track when we change a model and then go back to the underlying data and code that created that model and do so reproducibly. And another sort of side branch that I could go into for hours is this whole concept of tests when it comes to machine learning as opposed to developing applications. On the application side, tests are fairly deterministic. Did it break? Did it cause an error? Did it give the right response? But in machine learning, tests tend to be a little more probabilistic. You know, think of the question of, is this a picture of a cat? That's not so much of a yes or a no type question. It's more of a probabilistic thing that we need to determine. And so the way we do testing is different.
I can't go into all of these topics today, but I'm just going to touch on a few of them. But I just wanted to give you this as just an illustration that there are quite a few differences, in my opinion, between machine learning operations and traditional DevOps. And we need to incorporate those differences into any process that we're using when we build applications with machine learning.
There are quite a few differences, in my opinion, between machine learning operations and traditional DevOps. And we need to incorporate those differences into any process that we're using when we build applications with machine learning.
Azure Machine Learning Service
Now, Microsoft has an application to help with that. It's called the Azure Machine Learning Service. It's based on a process that we developed internally for building the machine learning models that we have in things like Bing and Xbox and all sorts of other applications. But now we brought it out so that people can use it externally as well. And the workflow that's built into it has really been driven by a lot of the engagements that we have had with customers that are building machine learning models at scale and building them into very high-scale production applications.
There are lots of things in this that I don't have time to talk about today, things like being able to use notebooks, automated machine learning, being able to just give it a data set and it tries a bunch of models. There's a nice drag-and-drop visual interface that I can't go into today. But instead, I'm going to focus on these concepts, these abstractions, which help us build a nice process for machine learning operations. Things like data sets, being able to version a set of rows and columns of data, and being able to get back to that version of data at any time in a reproducible way when we want to recreate our models in our experiments.
Experiments in this context are training runs, a model, I'm thinking of a statistical model here that we would try against a specific version of data, and the metrics like accuracy that we get back from that model, being able to track all of those. Pipelines, I'm not going to cover too much here, but being able to combine together little pieces of R code, little pieces of Python code, that together form a complete workflow we might want to execute as part of a continuous integration or continuous delivery process. Models, we're all familiar with the concept of a model from a statistical standpoint, but a framework that allows us to register models that we might want to use and have versions associated with those models. And endpoints, this is where the ops part of it comes in. Real-time endpoints that we can use to pass data to those models and get results back, make predictions from models. And also endpoints for those pipelines that I mentioned just a minute ago, so that we can automate the process of kicking off sequences of doing computations in R and do that as part of a larger build process for a complete application.
The last three on this list are more like infrastructure that we need to manage to help us do all of this process. The first of those is compute, manage computing resources that we use for doing our interactive experimentation, the resources that we use to build our large models, and the resources that we use to deploy those large models into applications. The environments I mentioned briefly already, the computing environment that's needed to do all this reproducibly, and then data stores. Think here of the databases or the blob stores, where the fundamental data files that represent those data sets exist.
The Azure ML SDK for R
What I'm going to be using in this talk today is a relatively new package for R. It's available now on CRAN called Azure ML SDK, which provides a suite of R functions for doing and working with all those artifacts that I just told you about. So there are R functions for creating workspaces and experiments and compute models and so forth. All of this works with any modern version of R. It's not specific to Microsoft at all. You can use any R package, a GitHub package, a private package. It's all completely open in that sense. And it tracks all those associated requirements, like the packages that you're using with your code, so that it can follow that through to make sure those packages are available at deployment time.
A few other things I don't have time to talk about are things like hyperdrive support, which is mainly used in machine learning for trying lots of different hyperparameters around models. But we are going to look at publishing an R-based model as a web service. I'm going to do it into Azure, but you can also do it into your own servers as well in exactly the same way. And we're also briefly going to talk about triggering these processes from CICD pipelines as we're building applications around R.
Demo: building and deploying a model
So I'm going to show you a little example. This is based on some data from the US traffic service around fatalities in auto accidents. And I want to build a model that predicts the probability of a fatality given the types of variables you can see there on the right. But the process is going to be a pretty traditional machine learning process. I'm going to import some data and prepare it. And then I'm going to build a model. And to do that, I need a cluster of machines to train that model. And in fact, I'm going to do some GLM, KNN, and GLMnet models from Carrot.
Then we're going to have a look at the results of those models and select the best one according to accuracy, and then deploy an R function that predicts from that model as a container. And that container is then going to expose an endpoint, which we can call from any application. But I'm going to demonstrate that here with a Shiny application. So let me show you how that works.
The first thing I need to do is to create a compute instance, which is a little machine in the cloud that I can use for my interactive work. So here's an example where I've created an RStudio compute instance. This is a standard DS2V2 virtual machine, which is basically a two-core machine with about seven gigs of RAM. A bit better than my laptop, but not much. But the nice thing is it has persistent storage, and I can share that with other people. And I can just launch an RStudio server instance on that directly and use that on the cloud. If I need a beefier machine, I could launch a GPU-based instance like this one here, which has a little over 50 gigabytes of RAM and six processors. So you can just get a machine. If you need a more powerful machine, just to do your experimentation, it's really easy to spin one up. Or you can just use your own laptop. With all of this Azure Machine Learning service that I'm talking about, you can just use all your own infrastructure for all the levels of this, or you can use the Azure cloud-based services if you want as well.
So here's some R code. All of this is available in the GitHub repository, which is linked on the bottom left-hand corner of the screen right there. So here's an example of me creating a compute instance, which I'm going to use to within the compute instance, some code that I'm doing to just import a regular CSV file, do some preparation, and then store that R data object into a shared storage service. So that will be available when I do my training runs in just a moment.
Here is some R code, create AML compute, that's creating a cluster for me in the cloud, which I'm going to use to do all of my training runs. This is a fairly small cluster. I've defined it to be a minimum nodes one and maximum nodes two. That means one node is always live, which means it's always quick for me to submit jobs into that cluster, everything's always ready to go. You can set that minimum nodes to zero, and it will automatically scale down to nothing when you're not using it, so you're not paying for anything.
So now that we have our compute environment ready to go, let's try training some models so we can have a little bit of fun with it. So I'm going to go ahead and create a model, so we can have a look at some results and choose one to deploy. Here's an example of submitting some R code to that cluster, and you can see it's pretty simple. All I need to do is define an R script file. Here's where it's called accident GLM net dot R, which is some pretty standard carrot code, which imports that same data that I stored to the shared data store, does a train call, and then saves the resulting model object back to the data store again. And that submit experiment command you see right there at the bottom queues up that job on the cluster, it'll wait until the node is available, and then run that R code and save all the results for me.
If you need to work, you can also modify the behavior of these scripts as well with command line parameters. You can see one option I have there is percent train to define the test versus train proportion in the actual modeling, so you have a lot of control of how those run as well.
The actual environments where R runs in that cluster is actually preloaded with just about every package on CRAN for you, but if you want to have a GitHub package or a private package, it's easy to have that down there as well. It tracks all of the results, so I can always go back and have a look at all of the results. So on the top right-hand side, we can see the accuracy from all these models, and then pick one of those that's the best to deploy. To deploy it, the first thing I need to do is register that model in the system. That registers it with a version, so this is where the versioning of the models comes in.
The R script that runs through the actual prediction gets the data, which is a JSON object, so you just need to decompile it into JSON and then just use regular R code to do the prediction model right there. And then finally, we can deploy that model as a service. You can see the deploy model function right there, which takes the model that we just registered along with its R code, and then that becomes a live container, in this case in Azure Container Instances, which exposes a REST endpoint, which I can send data to, the container runs the R code, sends the prediction back for me.
Here is the R code. This is running on my local laptop right here right now, but when I run the Shiny application, hopefully the internet is working here, there we go, so this is the prediction based on all these variables. If I increase the occupant of the car's age, and particularly the impact speed of the accident, you can see that the probability of the accident gets much higher. That calculation is being done within that inference cluster that I created in the cloud, and I'm doing in this case from Shiny and R, but you could do that from any application you like, calling out directly to that REST endpoint.
CICD integration and summary
And that's kind of where the next stage comes into. I don't have time to go into all the details of this, but if you're using a CICD service like Azure Pipelines or GitHub Actions, you can control all those processes through the command line interface that's provided with Azure Machine Learning Service, and if you're using Azure Pipelines, which is part of Azure DevOps, there is a plugin that makes it really easy to do things like when I, as a data scientist, register a new model into the system, a CICD process is immediately kicked off within Azure DevOps to build the entire application around that new model and then deploy that into production.
I've got a link to a talk that I've given there, AI ML50 repo, that shows that entire process. It's about using Python to deploy a vision model into a website application, but all those same principles apply equally to R as well.
So just to summarize, what we just did is built a pipeline in Azure Machine Learning Service to prepare data, train a model, and then register that into the service, and then we have a CICD pipeline in Azure DevOps, which then takes that model, recognizes when that model is registered, deploys it as a container instance, a single container for testing. Typically then we do some testing on that container before going to production, and then to production we would then deploy that same model to a cluster like Kubernetes for a production level facility for integrating that model into your application.
And because you've got that pipeline set up, it's really easy to do things like retraining your model, for example, if after a certain amount of time or perhaps after reviewing the metrics from your live model you want to retrain it, you can just kick off those same pipelines to update your model completely automatically. So all the information you need about Azure pipelines and the Azure Machine Learning Service is available at those links right there, and those links are also available at the repository that I've put up for this talk at the link you see on your screen right there. Thank you very much.
Thank you, David. We have time for one question. Sure. Are there existing options for authenticating with DevOps from RStudio Server Pro? If not, is it something that will be coming?
In fact, that's what I did within RStudio Server, not even the Pro version. All the authentication happens directly within the R package itself. For example, when you first try and register a workspace, it pops up a little prompt for you to authenticate directly with Azure Active Directory through a web page. That already exists. Great. Thanks so much. Thank you.
