RStudio + Amazon SageMaker | Build Beyond Your Laptop

Transcript#

This transcript was generated automatically and may contain errors.

Today, we'll be talking about SageMaker and using RStudio Pro inside SageMaker for all of your R-based workflows.

My name is Tom Mock. I'm the Posit Workbench Product Manager and helping with some of our integrations to companies like AWS and the SageMaker platform. And I'll be joined by my colleague, Gagan Deep, later on to go through a live demo.

To start with, I'd like to cover some of the high level kind of value propositions of what RStudio Pro and SageMaker provides and some of the benefits and kind of different trade-offs that are available for that.

But the advantage of doing this here is that my compute is flexible. but I'm still an R programmer. So I'm still using what I know in R, but I'm doing it in a scalable manner.

Now that I've prepared my data set, I am going to divide into test training, validation, and then as part of this code block, I'm also saving it as CSV files for future reference.

Currently, this is saved in the file system attached to this session. But if I clear out this domain or something happens on the AWS side, I might lose this data. So one of the things I could do or I would want to do is move out this data, the input data for my model into a more persistent place, which is not like the file system attached to this session. And a common data service to be used in AWS for that kind of work is S3 buckets. So I can interact with S3 buckets from within my code in a more efficient manner than I use SageMaker. And that is what I'm going to do next once I've written those files.

Interacting with AWS S3 and the SageMaker SDK

In order to interact with the SageMaker session that is running this RStudio instance, I will use the SageMaker Python package, which is what Tom was alluding to before. So if I want to interact with different AWS resources like S3, or directly with the SageMaker service, which I can do that through the SageMaker SDK.

Now, this is a Python SDK. So you will have to use the Rediculate library. If you're not aware, Rediculate lets you interact with Python packages directly from your own environment. So, and in order to do that, I do need an existing Python installation running along with the session. So I will use the terminal feature within RStudio. As you can see, I've already done some work. But I've already installed Python into this session by using the Python binaries.

This is a session, this is a Python version that works for me. But if you have requirements on certain Python versions that you can use, you can obviously install it for your session. I need this Python session to also have the SageMaker Python package and install it so that I can import it in this session. So I will use a tool like pip to install the SageMaker package. And I've already pre-prepared the session to use that. So if I run the library Rediculate, it will attach that and it will import that SageMaker package.

So now the SageMaker is attached.

Now, I'm going to use the SageMaker Python packages internal functions to proceed. And this requires you to know what SageMaker is as a service. So I'm using the SageMaker functions to access the information related to the session. So I need to create a session object. And I can also, there's always an S3 bucket attached with a SageMaker session. So I'm going to use that S3 bucket as my place to keep all my data. So I need that information, I can directly get it from the session object as well.

And also while creating this domain, every domain needs an execution role in SageMaker. And this is set up by your administrator. And then through this execution role, they can give you permissions from the AWS site to access different resources. So I have always every session in SageMaker, every RStudio session has an execution role attached with it. And I can access information for that role as well.

Now through this role is how I will access things like S3. So if this was not SageMaker, and I was doing this in my desktop environment, I would have to explicitly provide the AWS secret keys and all the credentials I need to access AWS. But here in this case, I already have a role attached to the session. So as a user, I have one less responsibility. And I don't have to worry about maintaining credentials or any security risks that are associated with it. Like everything is native within SageMaker. So I can directly interact with S3. And this is exactly what I'm going to do.

And now I'm going to save my training, testing, and validation data set into this S3 bucket that is attached. So again, as you can see, it was pretty simple. Then using desktop, I don't have to maintain credentials, there's already a role, and I'm using that role to access S3.

Another way, so far, this was done through the SageMaker session that I'm running programmatically by using the Python package. There's also a native R package, if you're not familiar, it's a great package called pause, which provides the options to access S3 or other AWS services natively. And that also requires setting up AWS credentials. But in this scenario, I am already within the SageMaker AWS environment. So I can directly use pause to access all the AWS resources.

So as an example, I will show you how to read the same data that I pushed or put into the default S3 bucket. And now I'm going to read it through pause. So I'm going to create an S3 object to pause, and then just read that file. And then that will work.

So this is the same file I put in, I can read it as well. So interacting with AWS resources is very simple. And again, I can use any other AWS services as well, either through the SageMaker SDK, which is in Python, or directly to pause to find more of the function I want to use.

Training and deploying an XGBoost model

Now that I've played around with data, save my data, make sure it's persistent, and I have the training and testing, I want to build the exteriors model. So I can either use the native training models, or anything I'm familiar with. But since this is an instance of RStudio running on SageMaker, I can interact with the SageMaker service as well.

So SageMaker service has pre-built models available that you can use on like any size of data, because it's a flexible way of running those models. These models run as containers. So I can optimize the size and select based on my input data, I can select the variability there. So I am going to use the same SageMaker Python API that I've already loaded to build this XGBoost model directly into SageMaker.

So I'm in SageMaker, I can directly access it. So I'm just going to run this code. This is just preparation for that model. And these are all SageMaker API specific commands. So there is a need to remember and know what you're working with. But there's great documentation available from SageMaker on how to use this. And here, I'm going to build a container that will use the XGBoost framework from SageMaker.

And I'm going to give it the location and all like my role, what is the input type is, what size of EC2 instance it should run on. So I can optimize that as well. I can also give it hyperparameters, like I want to use logistic regression, and 100 times I want to run that XGBoost tree. Again, these are all settings that you can optimize based on your workflow. And again, these are all flexibilities available in SageMaker.

So once I've done that, and I've prepared the XGBoost model, I need to start the model training, which is run through this command. So this takes quite some time. So in order to save the time already, I have already trained the model beforehand, but this is a command you use. And once you have trained the model on the input data set, you will have to deploy it on SageMaker so that you can interact with it. This will give your model a unique identifier to be used either within this work code or any other service, any other language can access that. So I also have deployed that model. So it says that these are the outputs of that.

And in order to see where the model is, you can go to the SageMaker UI. Within the SageMaker UI, you can see this is the model I've deployed, the XGBoost model, and you can also test or work with it. You can get its ARN if you want to interact with it from outside of SageMaker as well. Within SageMaker, you can make it part of your different workflows. So now your model is built in R, but is available in SageMaker as an endpoint.

Now that I've done this, I can make predictions on the same model. I can do it either programmatically or through the UI. So this bunch of code already does that. I already have these predictions based on the actual value and what is the predicted churn, and I can validate that by creating an ROC curve as well.

So this is a modeling workflow that you can use with SageMaker. And then once you're done with the endpoint, you can obviously delete it, or you can keep it up and running based on your use case.

Alternative workflow with vetiver

So this modeling workflow so far relies heavily on knowing the SageMaker service natively. But as an R user, like if you're coming in from desktop versions, that's a difficult entry point. So there's also another alternative workflow available for you, which is through the R package called Petaver. If you've not heard of Petaver, it's a machine learning framework that helps you build your models natively in R or Python, and then deploy it on different endpoints, either on FaucetConnect or AWS SageMaker. So you don't have to rely on SageMaker's API, but if you're within SageMaker, Petaver can help you easily interact with and deploy it as well.

So for deploying a Petaver model, the process starts with actually building a model. So again, I'm using a different kind of model here for Petaver. I'm using a logistic regression, but Petaver supports all possible models within TinyModel. So you can choose, again, based on the workflow, your own selections, you can define a model that you want to deploy. I have this logistic regression model, and then I need to convert this into a Petaver model object so that Petaver can interact with it, and this is what I'm doing.

And Petaver needs the model to be pinned so that it can iterate on different versions of the model. So I will use the pins package here. Same S3 bucket that I'm using for my data, I can use it for my model as well. So I'm using the same bucket and using the Petaver pin write function to pin this model on the same default bucket that is associated. But if in your case, you're interacting with some other resource like PositConnect, you want to save your pin on PositConnect or another S3 bucket, you can change that and save that too.

Now, this is where I would say the selling point of using Petaver in this scenario is Petaver has the Petaver deploy SageMaker function already available, which does all the hard work of what we have done in terms of the deployment so far. So you don't need to know the SageMaker package. You can directly use this function, which will do all of the stuff for you.

So I will provide it to the board where the pinned model is, what is the name of that pin? Again, the same instance type I want to run it on and any other additional arguments. And what this Petaver deploy SageMaker function the way it's doing in the backend is it's doing three things for you, which is what we did step by step already to the SageMaker API. So it will first build the Docker image needed to run this model. And then it will add the change or deploy it as a SageMaker or create the SageMaker model. And then it also deploy the endpoint for you. So all these three steps that you can also do separately, but this function is doing it for you. So there's only I need this line of code. And this model that I have locally will be converted to a SageMaker model and already deployed on SageMaker.

Now, this is where I would say the selling point of using Petaver in this scenario is Petaver has the Petaver deploy SageMaker function already available, which does all the hard work of what we have done in terms of the deployment so far. So you don't need to know the SageMaker package.

Again, to save us time, I've already done that deployment. And like the other model that we already deployed, my model is also deployed and available for you to interact with. So this is a great way to get into working with SageMaker. If you don't know the SageMaker CLI , but you do know Vector, which uses the basic ID model stuff. And then there are multiple ways to get into SageMaker. And then this workflow now can be extended to create different other workflows within SageMaker since I've already done the modeling.

So this is all what I had and I wanted to show and I'm going to pass it on back to Tom to talk more about this. Thanks for your time.

So as we kind of wrap up for the day, I'll go back to this slide around the resources that are available. So I can walk through kind of a bunch of different ways that R can be used in our studio can be used inside SageMaker, both natively from using some of the SDKs and software tooling that SageMaker provides, as well as some of the tooling that we provide here at Posit through things like tidy models and vetiver.

If you're interested in learning more about SageMaker, you can again visit posit.co slash SageMaker for an overview, or even to schedule a call to talk about it with us. And then Gagan and I will also be here in the chat and can take some questions and questions from y'all here that are viewing about SageMaker and some of the integrations that we've done with our studio.

With that, I'd love to thank you again for watching along with us. We'll take questions now and we'll see you soon.