
RStudio + Amazon SageMaker | Build Beyond Your Laptop
Did you know that you can use RStudio, the best IDE for R and Python users, with Amazon Sagemaker? RStudio on Amazon SageMaker makes it easy for R users to quickly and easily get started coding in RStudio on AWS from their browser, no server setup required, by using a new integration with Posit Workbench. In this webinar, Posit team members will show you how to get started with RStudio on Amazon SageMaker to analyze your organization’s data in S3 and train ML models. As a fully managed offering on Amazon SageMaker, this release makes it easy for DevOps teams and IT Admins to administer, secure, and scale their organization’s centralized data science infrastructure with familiar AWS tools and frameworks. Learn more at: https://posit.co/products/cloud/sagemaker/ Talk to us about using RStudio and SageMaker: https://posit.co/schedule-a-call/?booking_calendar__c=Sagemaker
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Today, we'll be talking about SageMaker and using RStudio Pro inside SageMaker for all of your R-based workflows.
My name is Tom Mock. I'm the Posit Workbench Product Manager and helping with some of our integrations to companies like AWS and the SageMaker platform. And I'll be joined by my colleague, Gagan Deep, later on to go through a live demo.
To start with, I'd like to cover some of the high level kind of value propositions of what RStudio Pro and SageMaker provides and some of the benefits and kind of different trade-offs that are available for that.
Benefits of RStudio Pro on SageMaker
So let's jump into this. First off, we're really excited. This is a true kind of partnership that we have with the Amazon team and the SageMaker team. And we're really excited to kind of be able to offer our commercial offering of RStudio along with Amazon and SageMaker's platform. So enabling our customers to do this really powerful machine learning and enhance their workloads beyond what's available on their local laptops.
Ultimately, we can break down kind of the benefits of SageMaker with RStudio Pro into four topics. And we'll kind of dive into these a bit deeper.
So number one, RStudio Pro on SageMaker is a managed service. So this massively simplifies the administration. And you don't have to have a robust IT team to manage over everything. It's managed through the SageMaker platform as a hosted service.
Additionally, especially compared to a local laptop, SageMaker with RStudio Pro provides this flexible compute. And you don't have to figure out how to spin up new servers. You can actually just use a dropdown within RStudio to select different sizing of your environment for both RAM and CPU, et cetera.
Additionally, the SageMaker platform comes with strict security and compliance benefits. So things like SOC2, FedRAMP, HIPAA, and additional other kind of security confirmations and compliance. It's really useful and needed for our customers in banking or pharma or health care, for example.
And then lastly, SageMaker also allows customers to use their existing cloud budgets. So for a lot of customers, they actually have budgets that are set aside for purchasing compute through cloud like AWS. And you can use some of that budget to purchase licenses for RStudio Pro that is then used inside SageMaker. Again, alleviating some of the struggles with purchasing or figuring out how to get this budget approved.
Overall, you can kind of have this nice little table here that compares pure SageMaker to RStudio Pro in addition to the core SageMaker platform.
Both SageMaker and RStudio Pro on SageMaker are flexible computing environments. So actually backed by these ephemeral EC2 instances and Docker images that are loaded into it. Both have a dedicated home directory for persistent files or other things like that. And then the SageMaker platform is a little bit different. It's dedicated for persistent files, R and Python package installs, configuration settings, different things that you want your user kind of experience to be the same.
The main difference at this point is that SageMaker is really like a hosted version of Jupyter or JupyterLab. And then RStudio adds the additional RStudio IDE experience on top of SageMaker's compute.
For SageMaker with JupyterLab, you're primarily using Python. So you can use packages like Boto3 to interact with AWS and different services. And then in R, there's the pause R package that allows you to do very similar things and interact with all sorts of AWS services, in addition to SageMaker.
And then lastly, both SageMaker and RStudio Pro inside SageMaker allow you to leverage specific SageMaker machine learning capabilities. Both are really using the SageMaker Python SDK or software developer kit. In SageMaker, you primarily call it natively through Python. And then in RStudio, you actually wrap some of the commands in R using the same SageMaker Python SDK, but wrapping it with the reticulate R package, allowing you to call Python directly from R.
So really, the environment is the same for both of them, but providing kind of a different user interface with the same packages and persistent files behind the scenes.
Ideal customers and use cases
As far as a lot of the customers that we've seen kind of be ideal fits for RStudio Pro and SageMaker, number one, users who are already invested or are investing in the AWS ecosystem. So maybe they're already using SageMaker, but want to leverage R there and really like using RStudio. Or maybe they're using RStudio on a desktop and want to make use of the robust machine learning tooling and skilled compute available inside SageMaker.
Additionally, because SageMaker is a hosted environment and the AWS team is actually managing the infrastructure for you, users who lack IT support or don't necessarily want to commit the resources needed to maintain a fully custom environment can go through this with kind of a ready to go, spin it up kind of very quickly environment.
And then lastly, really emphasizing that for RStudio desktop users, at a certain point, your laptop or your computer or even your workstation is going to run out of compute. Maybe you need 64 gigs or 128 gigs of RAM. With SageMaker, you can scale up kind of as large as it'll allow you to go. And then whenever you're done, you can actually shut down that environment and stop accruing cost. So you can have these kind of cost efficient scaling out as needed.
Additional benefits and trade-offs
As far as some of the additional benefits of RStudio Pro on SageMaker, since again, RStudio is a managed service, customers don't have to really worry about managing installations of R and Python or even installing a Posit Workbench slash RStudio Pro itself in the environment. It's already kind of ready to go.
Additionally, our software product called Launcher is working behind the scenes with SageMaker internally that allows you to create these individually sized and isolated sessions. So you can have a gigantic session, run a model, leave it running and then go to a much smaller session with say like eight gigs of RAM and then do some computation there or exploratory data analysis while your large model training is occurring.
Additionally, RStudio Pro on SageMaker has direct integration with identity access management through AWS. So you can use a lot of the other AWS services without needing to reauthenticate or move around a whole bunch of credentials, kind of log in once and use a lot of other different things.
You can also use, again, these pre-allocated cloud budgets. So it's pretty often that we see some of these customers have a lot of money tied up into their cloud budgets, and they can use a small portion of that to purchase these RStudio Pro licenses to be used within SageMaker.
And then lastly, reiterating that for some of our compliant and secure needs of customers, that SageMaker already has SOC2, FedRAMP, HIPAA compliance and is adding additional compliance kind of certifications over time.
I do want to call out some of the trade-offs or differences, though, between a self-managed Posit Workbench that includes RStudio Pro as well as JupyterLab and VS Code versus RStudio Pro on SageMaker.
Number one, self-managed Posit Workbench provides complete flexibility about how you set up and configure and administer Workbench. You do need an IT admin, but you can really set up a really custom environment and have access to all these different things you need. This does come at the cost of responsibility. You do have to have that IT admin persona to come in and maintain, update the environment and the infrastructure. But if you really want that customization, that's a good route. And you can run it through an Amazon EC2 instance or even use clusters such as ECS for Kubernetes or parallel cluster for Slurm on AWS.
Importantly, RStudio on SageMaker is this managed service. So the responsibility for the software and the environment resides with AWS. You actually work through the AWS support team, as well as in some cases, the Posit support team, if you run into trouble, instead of the individual customer kind of only working through their IT admin.
And lastly, while Posit Workbench has access to multiple editors, it has RStudio, it has VS Code, as well as Jupyter and JupyterLab. The SageMaker environment has the RStudio IDE, as far as the RStudio Pro integration. And then Amazon SageMaker Studio also provides a managed kind of Jupyter, JupyterLab experience. It does not have the same VS Code integration that Posit Workbench does.
So there are some different trade-offs and just kind of design decisions as far as what you want. If you want this really custom environment to kind of set it up exactly as you want, or you have different needs, you can self-manage or install your own software into traditional infrastructure, or you can go this fully managed route through Amazon SageMaker.
Before I turn it over to my colleague, I'm also going to share some of the resources. So if you do have questions about getting started with SageMaker, you can go to Posit.co slash SageMaker. This has all the information about getting started with RStudio Pro on SageMaker, maybe doing a free trial where you can evaluate its use for your team. Or you can even talk to a salesperson here or a colleague at Posit to talk about doing some of the integration work as needed.
With that, I'm going to pause, stop screen sharing and turn it over to my colleague, Gagan Deep, who's going to cover some of the more technical details and the integrations and show off some demos about how to use AWS and RStudio Pro together.
Live demo: RStudio Pro inside SageMaker
Awesome. Thank you, Tom. Hi, everyone. I'm Gagan Deep Singh. I am a Senior Solution Engineer here at Posit. And today I'm following up, Tom, by showing you off on how to use this RStudio Pro integration from within SageMaker.
So to begin with, I land in our own internal SageMaker domain, which is maintained by us. And then I can enable the RStudio IDE from within that SageMaker domain. Now, in your case, this will look different on how you reach the RStudio integration. It will be managed by your AWS IT admins. And they will share a domain with you after they've enabled RStudio in it. So once you have logged into AWS and accessed that domain, you will see the same screen that I'm seeing here.
So now this is an RStudio workbench running within the SageMaker domain. So it's not on my computer. I'm accessing it through a URL. And this is an additional screen I get since I'm using the professional product to which I can run multiple sessions of RStudio in the SageMaker domain at the same time.
I can start a new session by using the new session button. I can give it a name. Right now, as Tom mentioned, only option to use is the RStudio IDE. So I have no choice here. And this is running within the SageMaker environment. So the cluster is also fixed at SageMaker.
The options I can change according to my use case are these instance type and images. So by instance type, I can choose which EC2 instance to use for this current session. So this is the flexibility that you get by using RStudio within SageMaker. It is directly integrated into the EC2, into the AWS environment. So I can choose an EC2 based on the type of work I'm doing. So if I plan to do a minimal EDA data analysis, just create some plots, explore some data, I can choose a smaller size instance. And if I'm planning to bring in bigger size data to do a heavy lifting in modeling, I can choose a bigger size instance. This flexibility only comes by using something like AWS and RStudio is integrated into that flexibility.
I can also change the docker image that is being used to run this session. So everything is running as a docker image, as a docker container. There is a default image available, but I can also customize what kind of images I want to use. So I can work with my administrator to create a docker image for different kinds of workflows. So if I have a requirement of using certain versions of packages or certain system dependencies in that environment, I can create different images for that.
For this work, I'm just using the default instance type and the default image, which I've already started a session here. So once I go into the session, this RStudio sessions looks very similar to what you're used to, if you're using the desktop version, but it does come with additional flexibilities of using the professional version.
Building a machine learning model: churn prediction
I'm going to run through a small example of building a machine learning model on a sample data set, where I'll be predicting the churn at a cellular company based on the customer data, if a customer is going to churn or not. So I've already prepared the code and I'm going to run through it today. I'm using a markdown and I'm doing some, to begin with, I'm setting up my environment.
I'm using the public package manager provided by Posit, through which I can get package binaries in this session. So you can also customize your repos to get binary packages from the public package manager that will make the installation easier. So I've done that setup, and I'm also loading the different libraries that I'm going to use. So I'll run this bunch of code.
While this is running, I can see that I already have the data set available that I'm going to use for this work. I have saved it in my session in the home directory that is attached to the session. So I have moved it here already, and I'm going to bring it into the session in the churn data frame.
So I've got that, and I'm going to explore the data. So this is what my input data on the model looks like. I have information on the customers, which state they belong to, how many days for their account with us, their area code, how many minutes have they spent, on how many calls they make per day, all that information associated with cellular information. I also have other features that are available, like how many minutes they take on call, how many calls they have made to the customer service, if they have an international plan or not, and in the end, if they have churned or not with us.
So this is my input data set, and I'm going to build a model to predict the churn for our consumers.
So before I begin with the modeling, I would, as a good data scientist, would like to explore the data that I'm working with. I want to see what are the different patterns I can observe, what kind of features I can use with feature combinations. That will help me later in building the model I want to build.
So I'm going to select the international plan, and then plot it and see how many people who have churned actually have used that international plan feature. So this is a histogram showing that the number of users who have churned, have they had an international plan or not. The data here looks pretty even, so not much to gather out of here.
Let me now create a histogram with how many customer service calls they have made. So it is an assumption that if somebody is churning, they might have some complaints about the service they are getting. So if I create this histogram, this is what I get for people who have churned. I can also change the variable and see the same information for people who have not churned, how satisfied they are based on that.
This is like a few examples of doing this kind of exploratory analysis. Like there's plenty of feature sets I could have played around with. But for the sake of this call and the workflow, I'm going to stick to these.
Before I jump into modeling, I would like to prepare my data for the model. So right now the data is pretty untidy in that sense. So I'm going to tidy it up. And these are some of the transformations I'm doing. I am first just filtering out the rows which have churned, the customers who have churned. Then I'm going to use, I plan to use XGBoost in this case. So I'm going to prepare some dummy variables on a voicemail, international plan, state they belong to. And I'm going to move the churn variable to the front of the data frame so that I can divide the test and train properly. So I'm just going to quickly run all of this code.
While this code is running, this so far, I have been doing what I generally do day to day as a data scientist, either on my desktop or somewhere in another professional server product. But the advantage of doing this here is that my compute is flexible. So my data set currently is not that big. But if I had a bigger data set and I was bringing millions of rows into this session for doing this kind of analysis, I could increase the size of the compute instance I'm using behind in SageMaker. So this helps me with the flexible compute, but I'm still an R programmer. So I'm still using what I know in R, but I'm doing it in a scalable manner.
But the advantage of doing this here is that my compute is flexible. but I'm still an R programmer. So I'm still using what I know in R, but I'm doing it in a scalable manner.
Now that I've prepared my data set, I am going to divide into test training, validation, and then as part of this code block, I'm also saving it as CSV files for future reference.
Currently, this is saved in the file system attached to this session. But if I clear out this domain or something happens on the AWS side, I might lose this data. So one of the things I could do or I would want to do is move out this data, the input data for my model into a more persistent place, which is not like the file system attached to this session. And a common data service to be used in AWS for that kind of work is S3 buckets. So I can interact with S3 buckets from within my code in a more efficient manner than I use SageMaker. And that is what I'm going to do next once I've written those files.
Interacting with AWS S3 and the SageMaker SDK
In order to interact with the SageMaker session that is running this RStudio instance, I will use the SageMaker Python package, which is what Tom was alluding to before. So if I want to interact with different AWS resources like S3, or directly with the SageMaker service, which I can do that through the SageMaker SDK.
Now, this is a Python SDK. So you will have to use the Rediculate library. If you're not aware, Rediculate lets you interact with Python packages directly from your own environment. So, and in order to do that, I do need an existing Python installation running along with the session. So I will use the terminal feature within RStudio. As you can see, I've already done some work. But I've already installed Python into this session by using the Python binaries.
This is a session, this is a Python version that works for me. But if you have requirements on certain Python versions that you can use, you can obviously install it for your session. I need this Python session to also have the SageMaker Python package and install it so that I can import it in this session. So I will use a tool like pip to install the SageMaker package. And I've already pre-prepared the session to use that. So if I run the library Rediculate, it will attach that and it will import that SageMaker package.
So now the SageMaker is attached.
Now, I'm going to use the SageMaker Python packages internal functions to proceed. And this requires you to know what SageMaker is as a service. So I'm using the SageMaker functions to access the information related to the session. So I need to create a session object. And I can also, there's always an S3 bucket attached with a SageMaker session. So I'm going to use that S3 bucket as my place to keep all my data. So I need that information, I can directly get it from the session object as well.
And also while creating this domain, every domain needs an execution role in SageMaker. And this is set up by your administrator. And then through this execution role, they can give you permissions from the AWS site to access different resources. So I have always every session in SageMaker, every RStudio session has an execution role attached with it. And I can access information for that role as well.
Now through this role is how I will access things like S3. So if this was not SageMaker, and I was doing this in my desktop environment, I would have to explicitly provide the AWS secret keys and all the credentials I need to access AWS. But here in this case, I already have a role attached to the session. So as a user, I have one less responsibility. And I don't have to worry about maintaining credentials or any security risks that are associated with it. Like everything is native within SageMaker. So I can directly interact with S3. And this is exactly what I'm going to do.
And now I'm going to save my training, testing, and validation data set into this S3 bucket that is attached. So again, as you can see, it was pretty simple. Then using desktop, I don't have to maintain credentials, there's already a role, and I'm using that role to access S3.
Another way, so far, this was done through the SageMaker session that I'm running programmatically by using the Python package. There's also a native R package, if you're not familiar, it's a great package called pause, which provides the options to access S3 or other AWS services natively. And that also requires setting up AWS credentials. But in this scenario, I am already within the SageMaker AWS environment. So I can directly use pause to access all the AWS resources.
So as an example, I will show you how to read the same data that I pushed or put into the default S3 bucket. And now I'm going to read it through pause. So I'm going to create an S3 object to pause, and then just read that file. And then that will work.
So this is the same file I put in, I can read it as well. So interacting with AWS resources is very simple. And again, I can use any other AWS services as well, either through the SageMaker SDK, which is in Python, or directly to pause to find more of the function I want to use.
Training and deploying an XGBoost model
Now that I've played around with data, save my data, make sure it's persistent, and I have the training and testing, I want to build the exteriors model. So I can either use the native training models, or anything I'm familiar with. But since this is an instance of RStudio running on SageMaker, I can interact with the SageMaker service as well.
So SageMaker service has pre-built models available that you can use on like any size of data, because it's a flexible way of running those models. These models run as containers. So I can optimize the size and select based on my input data, I can select the variability there. So I am going to use the same SageMaker Python API that I've already loaded to build this XGBoost model directly into SageMaker.
So I'm in SageMaker, I can directly access it. So I'm just going to run this code. This is just preparation for that model. And these are all SageMaker API specific commands. So there is a need to remember and know what you're working with. But there's great documentation available from SageMaker on how to use this. And here, I'm going to build a container that will use the XGBoost framework from SageMaker.
And I'm going to give it the location and all like my role, what is the input type is, what size of EC2 instance it should run on. So I can optimize that as well. I can also give it hyperparameters, like I want to use logistic regression, and 100 times I want to run that XGBoost tree. Again, these are all settings that you can optimize based on your workflow. And again, these are all flexibilities available in SageMaker.
So once I've done that, and I've prepared the XGBoost model, I need to start the model training, which is run through this command. So this takes quite some time. So in order to save the time already, I have already trained the model beforehand, but this is a command you use. And once you have trained the model on the input data set, you will have to deploy it on SageMaker so that you can interact with it. This will give your model a unique identifier to be used either within this work code or any other service, any other language can access that. So I also have deployed that model. So it says that these are the outputs of that.
And in order to see where the model is, you can go to the SageMaker UI. Within the SageMaker UI, you can see this is the model I've deployed, the XGBoost model, and you can also test or work with it. You can get its ARN if you want to interact with it from outside of SageMaker as well. Within SageMaker, you can make it part of your different workflows. So now your model is built in R, but is available in SageMaker as an endpoint.
Now that I've done this, I can make predictions on the same model. I can do it either programmatically or through the UI. So this bunch of code already does that. I already have these predictions based on the actual value and what is the predicted churn, and I can validate that by creating an ROC curve as well.
So this is a modeling workflow that you can use with SageMaker. And then once you're done with the endpoint, you can obviously delete it, or you can keep it up and running based on your use case.
Alternative workflow with vetiver
So this modeling workflow so far relies heavily on knowing the SageMaker service natively. But as an R user, like if you're coming in from desktop versions, that's a difficult entry point. So there's also another alternative workflow available for you, which is through the R package called Petaver. If you've not heard of Petaver, it's a machine learning framework that helps you build your models natively in R or Python, and then deploy it on different endpoints, either on FaucetConnect or AWS SageMaker. So you don't have to rely on SageMaker's API, but if you're within SageMaker, Petaver can help you easily interact with and deploy it as well.
So for deploying a Petaver model, the process starts with actually building a model. So again, I'm using a different kind of model here for Petaver. I'm using a logistic regression, but Petaver supports all possible models within TinyModel. So you can choose, again, based on the workflow, your own selections, you can define a model that you want to deploy. I have this logistic regression model, and then I need to convert this into a Petaver model object so that Petaver can interact with it, and this is what I'm doing.
And Petaver needs the model to be pinned so that it can iterate on different versions of the model. So I will use the pins package here. Same S3 bucket that I'm using for my data, I can use it for my model as well. So I'm using the same bucket and using the Petaver pin write function to pin this model on the same default bucket that is associated. But if in your case, you're interacting with some other resource like PositConnect, you want to save your pin on PositConnect or another S3 bucket, you can change that and save that too.
Now, this is where I would say the selling point of using Petaver in this scenario is Petaver has the Petaver deploy SageMaker function already available, which does all the hard work of what we have done in terms of the deployment so far. So you don't need to know the SageMaker package. You can directly use this function, which will do all of the stuff for you.
So I will provide it to the board where the pinned model is, what is the name of that pin? Again, the same instance type I want to run it on and any other additional arguments. And what this Petaver deploy SageMaker function the way it's doing in the backend is it's doing three things for you, which is what we did step by step already to the SageMaker API. So it will first build the Docker image needed to run this model. And then it will add the change or deploy it as a SageMaker or create the SageMaker model. And then it also deploy the endpoint for you. So all these three steps that you can also do separately, but this function is doing it for you. So there's only I need this line of code. And this model that I have locally will be converted to a SageMaker model and already deployed on SageMaker.
Now, this is where I would say the selling point of using Petaver in this scenario is Petaver has the Petaver deploy SageMaker function already available, which does all the hard work of what we have done in terms of the deployment so far. So you don't need to know the SageMaker package.
Again, to save us time, I've already done that deployment. And like the other model that we already deployed, my model is also deployed and available for you to interact with. So this is a great way to get into working with SageMaker. If you don't know the SageMaker CLI, but you do know Vector, which uses the basic ID model stuff. And then there are multiple ways to get into SageMaker. And then this workflow now can be extended to create different other workflows within SageMaker since I've already done the modeling.
So this is all what I had and I wanted to show and I'm going to pass it on back to Tom to talk more about this. Thanks for your time.
So as we kind of wrap up for the day, I'll go back to this slide around the resources that are available. So I can walk through kind of a bunch of different ways that R can be used in our studio can be used inside SageMaker, both natively from using some of the SDKs and software tooling that SageMaker provides, as well as some of the tooling that we provide here at Posit through things like tidy models and vetiver.
If you're interested in learning more about SageMaker, you can again visit posit.co slash SageMaker for an overview, or even to schedule a call to talk about it with us. And then Gagan and I will also be here in the chat and can take some questions and questions from y'all here that are viewing about SageMaker and some of the integrations that we've done with our studio.
With that, I'd love to thank you again for watching along with us. We'll take questions now and we'll see you soon.
