James Blair | Using RStudio on Amazon SageMaker | RStudio
Using RStudio on Amazon SageMaker Presented by James Blair *please note that this meetup will share a new update about our professional product, RStudio Workbench. All are welcome, but just wanted to make sure to share that upfront :) https://www.rstudio.com/products/workbench/ Agenda: Presentation & Demo - 40 minutes Q&A with both RStudio & AWS SageMaker team - 20 minutes Amazon SageMaker helps data scientists and developers to build, train, and deploy machine learning models quickly by bringing together a broad set of capabilities purpose-built for machine learning. RStudio recently announced the release of RStudio on Amazon SageMaker, developed in collaboration with the SageMaker team. This brings a fully-managed RStudio Workbench IDE into the powerful SageMaker environment. With this functionality, Data Scientists can quickly get to work, spinning up their favorite development environment on SageMaker, choosing from a wide array of instance types as needed. They can get access to their organization’s data stored on AWS, as well as all of SageMaker’s deep learning capabilities. As a fully managed offering on Amazon SageMaker, this release makes it easy for DevOps teams and IT Admins to administer, secure, and scale their organization’s centralized data science infrastructure, using familiar AWS tools and frameworks. Here are James' slides as well: https://github.com/blairj09-talks/rstudio-sagemaker-webinar Answers to the Q&A are shared in this blog post: https://www.rstudio.com/blog/using-rstudio-on-amazon-sagemaker-faq/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Awesome, great. So today our topic, like Rachel mentioned, is RStudio and SageMaker. And maybe, I don't know if an apology is the right phrase here, but I have a two-year-old son who's very into Spider-Man right now. And as a result, I recently rewatched Into the Spider-Verse, which is one of my all-time favorite movies. And that's like heavily influenced the direction that these slides have gone today.
So hopefully that makes sense. And if not, we can just focus on the content and less on the style that they're presented in. But I actually kind of thought this fit because as we go through this, I consider both RStudio and SageMaker both kind of superheroes in their own way in the data science space. And we're excited about this collaboration and the ways in which they can complement and kind of work with one another. And we'll talk through that today.
Again, as Rachel said, my name is James Blair, and I work as a solutions engineer here at RStudio.
Overview of the agenda
Just to give kind of an outline of the plan for today. First, we're just gonna talk briefly about RStudio Workbench. I'm operating under the assumption that many of us are probably already familiar with RStudio or RStudio Workbench to some degree. But for those who may not be, we'll just run through kind of a brief introduction of what that tool is. We'll also do the same for SageMaker. And then we'll talk about RStudio Workbench and SageMaker together and what that offers. We'll look at a little bit of a demonstration and then provide some resources for you to continue learning on your own and then answer any questions that come up through the course of the presentation.
RStudio Workbench
So RStudio Workbench is at its core an R-centric development environment. If you've used the RStudio open source software, either at the desktop level or through RStudio server, you're likely familiar with this experience. You have a development environment that gives you lots of tooling around working with R. Ways to manage packages, ways to manage your environment, edit files, write code, debug. All these tools are part of the RStudio development environment.
RStudio Workbench gives you this environment accessible through a browser. It's typically something that organizations will install, host and manage within their own infrastructure, within their own firewalls. We'll talk about how that model's a little bit different with SageMaker throughout the course of our presentation today.
One of the things that RStudio Workbench provides that you don't find in the open source version is you have the opportunity to support multiple versions of R concurrently. And there's some additional functionality that's also available related to security and scalability in RStudio Workbench itself.
Another kind of important component of the RStudio development environment, especially as it relates to the topic today, is that there are tools there for Python interoperability. So the ability to work in R and Python kind of simultaneously is something that's available. And that's not restrictive just to RStudio Workbench. That's just a feature of really the R language, but there's some tooling inside of RStudio, the development environment, that make it a little bit easier to work with Python frameworks within kind of within an R context. And we'll talk to why that's important here in a moment.
Amazon SageMaker
Amazon SageMaker, on the other hand, is a fully managed machine learning service. There are a large and growing number of services available through SageMaker. In fact, at AWS reInvent just last week, they announced some new functionality and some new features that are being made available on the platform. So it's continuing to grow, continuing to evolve. It's one of the fastest growing kind of AWS products that Amazon offers.
And for good reason, it's highly scalable. It provides access to a lot of native machine learning algorithms. You have an integrated machine learning environment for doing development and testing and model deployment and things like that. And then it gives you a really nice production ready way to deploy machine learning models, track those models, monitor them, kind of everything within the scope of model development and machine learning operations is available through the SageMaker platform.
Combining RStudio and SageMaker
With this said, it might be kind of easy to look at these two different products and these two different platforms and to kind of pit them against one another, right? SageMaker provides a machine learning kind of development environment that's built on top of JupyterLab. They provide lots of really nice functionality. And RStudio, on the other hand, provides a development environment heavily focused on the R user.
Thankfully, instead of putting these two things against one another, instead what's happened is we've combined the two, right? We've said, look, what if we have RStudio as an available editor, as an available tool through the...
And this effort has largely been driven by the R community. As our customer base has grown, both professionally as well as just the open source users who use RStudio products every day, there's been a growing interest in having some sort of solution offered through the SageMaker platform. And for good reason, right? SageMaker provides all these functionalities around machine learning and tooling that R users are likely very attracted to. But the RStudio IDE is the place where R users feel most at home.
And so we've worked closely with the SageMaker team and we're happy to announce that RStudio is an editor and an environment that is now natively available through Amazon SageMaker. And we'll take a look at what that looks like in just a moment.
And so we've worked closely with the SageMaker team and we're happy to announce that RStudio is an editor and an environment that is now natively available through Amazon SageMaker.
And we're really excited about this, right? We think this is an awesome opportunity, not only for us, but for R users kind of everywhere within organizations throughout the globe to use a tool in an environment that they're comfortable and familiar with with the enhancements provided by the SageMaker platform.
So a little bit of a summary of what this is, right? It's a managed RStudio solution, which means that running RStudio Workbench on SageMaker means that as an organization or as an individual, you do not need to set up, install, maintain RStudio yourself. Instead, that's handled on the SageMaker side. It's simply something that you choose to opt into when you set up your SageMaker domain.
It uses, if you have an existing RStudio Workbench license, you can use that license and you can carry that over into SageMaker. It gives you access to scalable Amazon resources, easy access to additional SageMaker features and capabilities like some of the machine learning abilities and model deployment and management and other things that are part of what SageMaker offers to users. And you have flexible compute resources for the different workloads that you might be involved in and engaged in.
And we'll take a look at some of these things in an example here in a moment. So here's a just kind of quick little video, and we'll actually look at this in real time in just a moment, but just to give you an idea of what this looks like in SageMaker on AWS, you go and open up the launch app dropdown, click on RStudio if it's been configured through your domain, open up, you'll have the RStudio Workbench landing page. And from there, you can start a new RStudio session. It's really, you know, like as simple as that is, the powerful thing is that it's that easy to get in, to get working inside of RStudio. Again, fully managed, hosted, maintained by SageMaker.
And then just kind of a word that I think helps highlight the significance of this for us, but also hopefully the significance of this for the R community in general. This is a quote from Tarif, the president of RStudio. And he said, RStudio is excited to collaborate with the Amazon SageMaker team on this release, as they make it easier for organizations to move their open source data science workloads to the cloud. We are committed to helping our joint customers use our commercial offerings to bring their production workloads to Amazon SageMaker and to further collaborations with the Amazon SageMaker team.
And I'll kind of touch on that last component here, which for me is kind of a personally exciting aspect of this whole thing. And that is that we're not done, right? We're excited about what we've started here. We're excited about what's available today, but we're also excited as we look towards the future and what we anticipate being made available in the near future. And we'll discuss some of what that will look like here towards the end of our conversation today.
Live demo
So with all this said and done, let's go ahead and jump in and let's look at a demo. Let's investigate what we can do, what's available to us in this integration. So I'm going to go ahead and switch over here. Let me see if I can move zoom controls out of the way. So I'm going to open up, here's my SageMaker domain on AWS. I can see my user here inside of SageMaker. And then over here, I can go to launch app and I'll see that I have three different options here. Studio, which is the native SageMaker interface, RStudio, which is what we've been talking about here today. And then Canvas is the new tool that was just announced last week that brings some kind of no-code functionality into SageMaker.
If we start with the traditional experience, just to give you an idea, if you've never been exposed to SageMaker before, and we open up SageMaker Studio, this is the kind of traditional interface through which developers will interact with SageMaker. Like I mentioned previously, it's built and heavily influenced by JupyterLab, but it's also heavily modified to provide lots of integration with the SageMaker platform and the tooling that that platform provides.
So here we can see, I have different tiles that I can scroll through. I have different projects that I can start. I have different kind of work streams that I can work on or different styles of projects that I can work on from within this environment. Now, this has been an environment that's been offered by SageMaker for quite a while. And our users may have experienced this environment themselves. There's been support for R, the language, inside of SageMaker for a while. And typically that interaction happens through Jupyter Notebooks.
Now, if we come back to our SageMaker landing page and we go ahead and click on RStudio, let's open this in a new tab here. Now we have, and we get dropped directly into this environment that if you've used RStudio Workbench before is likely familiar to you. Here we have our landing page where we can select sessions that we wanna work on. If we have currently running sessions, we can create new sessions if we want to. We can view projects that we've previously opened over here on the right-hand side. And we have some additional control over suspending or quitting the sessions that are running.
We're gonna be looking at this currently running session, this SageMaker webinar here in just a moment. But before we jump in there and look at some of the content there, I wanna go through the new session process just to kind of showcase what's available here.
We'll go ahead and open up new session. Here we can give this session a name, example, RStudio SageMaker session. That's too many characters. We'll just say example, RStudio SageMaker. There we go. And then here we can select the editor. The only option available is RStudio. So that's one thing to be aware of. If you are familiar with RStudio Workbench, you may be aware that it provides access to multiple editors beyond just RStudio. Those editors include Jupyter Notebooks, JupyterLab, VS Code. With the SageMaker and RStudio integration, the only editor available through RStudio is the RStudio editor, the RStudio development environment.
The other option that we have here is Cluster. Again, the only option available to us is SageMaker. And this is just kind of an interesting note, a little bit of a side note, but I think it's worth noting. And that is the fact that Amazon is one of the first groups that's taken the launcher, which is a component of RStudio Workbench that allows it to integrate with other execution engines like Kubernetes. But Amazon's one of the first groups that's taken that and created a custom backend for it. And that custom backend is this SageMaker backend. That's only available through the SageMaker integration.
But once I've selected SageMaker as my cluster, I can then define, we'll first talk about this image. I can define the Docker image. I want this particular session to be run on. As it stands currently, there's only one default image available, but that's something that will change in the near future. And then the instance type here is where I will select the EC2 instance or the Amazon compute instance that I want this particular session to run on. And there's a whole host of options. And this is one of the great kind of benefits of this architecture.
Because one, I don't have to, again, like I mentioned previously, as an organization, I don't have to manage and maintain this, right? This is something that's set up as part of my SageMaker domain. SageMaker is the owner, the maintainer of the RStudio Workbench environment. But automatically out of the box, if I go this route, I have a very flexible kind of compute engine that I can lean on. Because I can decide when I start a new session, what type of resources I might need. Do I need heavy compute? Do I have a very memory intensive process? Am I gonna be doing some really heavy model training with very big data? And I can make an intelligent selection at this point about what type of EC2 instance I want this particular session to run on so that I have the appropriate resources available to me as I run this session.
The other thing that's worth noting here is that as multiple users are engaged with RStudio through SageMaker, they're working kind of independently in their own environments. So that if one user does something in their environment, for example, reads data in that consumes all available memory or consumes all available CPU, they don't impact or negatively impact the other users because they're all kind of in their independent environments.
I'm gonna go ahead and select the default image here, start the session. We'll see a new session starting up right here. Now, because we're provisioning a separate new environment when we come in and start a session inside of SageMaker in the way that we've just done, it can sometimes take a long time to start this initial session because it takes some time to provision the environment, make the Docker image available, get that image running in a container, get everything up and running so that we have access to our RStudio session. But once that initial startup is complete, that session is available and subsequent sessions will start much more quickly.
And here we see this session didn't take too long to take it up and running. Let's go ahead and come in here and just see what we have available. And again, if you've used, let me zoom in a little bit here.
If you've used RStudio previously, whether at the desktop level, open source on the server, RStudio Workbench, in any case, this environment likely feels very familiar to you. I've got packages that I can scroll through down here. I can navigate my file system here. I can look at help for different functions. I can navigate and explore my environment. I've got the console here where I can execute R commands. I have this document pulled open where I can, I've got my editor where I can open up R scripts, R Markdown documents, anything that I want to work with in this context.
A couple of things just kind of technically speaking that are worth noting is one, Amazon automatically, so SageMaker automatically mounts a home directory into every session. So you'll see down here in the bottom right-hand corner when I look at files, I have a home directory available to me inside this home direct, this home directory is persistent. So this is nice because this means that if I am working on a project and I, let's say that I start a project in SageMaker and I realized part of the way through that I've under provisioned my environment, I didn't give myself enough resources. And if I want to be able to conclude my analysis, I need to do it in a more heavily provisioned environment.
Well, I can terminate my existing session. I can save my changes, first of all, terminate my existing session, start a new session with a new, bigger environment based on what I've learned from the previous session. And then once that environment starts up, I'll have access back to the same files I was working on previously within my home directory. So these files follow you around. This also means by extension that if I install an R package, right? If I'm in here and I say install packages, let's do, yeah, let's install the line package. Then this package will install into my home directory as a user package. And then it will be available to me persistently from that point forward. So I don't need to reinstall packages every time I start a new session in SageMaker. Those packages are persistent and follow me into whatever session I happen to be running on.
And we can see those packages installed. Here, we can see the user library that are packages I've installed to my own home directory. And then below that, we can see the system packages, the SageMaker instance. So that default instance that we selected contains a large number of kind of commonly used data science packages. So you'll see in here, we have, let's scroll down a little bit. We have the tidyverse in here. We have data table in here. We have a number of kind of common data manipulation and data modeling packages that are made available here inside of the SageMaker image by default.
Okay, so while this is continuing to run down here, let's go ahead and look at a couple of other things that we can do, right? So we're in RStudio and for our users, this is an environment that we're likely comfortable in. So I can, if I want to, from in here, I can start a new Shiny application if I wanted to.
We're gonna open RStudio back up and drop into our, if we can, maybe we've frozen everything up.
Okay, once this pulls up, we're gonna drop into our webinar environment that we have set up already, the running session that we currently have for the RStudio SageMaker webinar. Let's see if we can get this to unfreeze, maybe.
Of course, live demos are always exciting and things never go as planned.
If not, we'll switch to another environment and try it in there. This is my fault for waiting off script, that's what I get.
There we go, okay. All right, so let's come in. We'll leave that one running, but let's come into this environment, this session that we have running. Again, one of the advantages is that other session's kind of tied up, but we can drop into this session and go from here. So again, I'm in my RStudio environment. Let's just examine a couple of things that we can do just to verify that things kind of work and behave the way that we would expect them to.
So if I start, if I create a new file, I can create a new Shiny application. We'll call this Shiny example. Go ahead and create that. I can run this here within SageMaker. I'll see, in fact, let's run it in the viewer pane here. So I can run this here. I can see the behavior that I would expect, which is I have my rendered Shiny application running over here. I can do what I would expect to do, which is interact with that application. If I wanted to, I could then publish this application to a place like RStudio Connect or Shiny Apps IO.
Obviously this application isn't of any real consequence or significance, but this is evidence that if I wanted to, and this is likely the case, I can use SageMaker to do things like fully develop my internal Shiny application that I might then share on our internal RStudio Connect platform or some other mechanism. So Shiny works as I would expect it to. Tools like Plumber also work. If we come in here and say, let's start a new Plumber API, we'll call this example Plumber. We can see my Plumber file running here. I can run this. In fact, let's do that in the viewer pane again. Here we go. So I can run this API in here. And again, I have the behavior that I would expect, right?
And the nice thing about this, and the reason that I'm going through kind of these simple examples is to highlight that once you're into this environment, it works as you would expect. I'm able to install R packages. I'm able to write R code. I'm able to interact with R documents in the way that I would expect, whether those are source code files, like a Shiny application or just a script file that I'm working on, or something like a R Markdown document or like Plumber API or any number of other things.
Using SageMaker features from RStudio
So now that we've kind of verified some of this basic functionality, I think the natural next question, and this was kind of my next question once I started getting into this and becoming more familiar with SageMaker, and that was, okay, I understand the significance and the advantage from the administrative standpoint. I don't need to maintain infrastructure. I have a flexible environment. I've got scalable compute resources. I have a lot of convenience on the administrative side when I allow SageMaker to manage the environment for me. But as an R user, what do I get? Do I just get RStudio? It's just running somewhere else?
Because in most cases, I don't, as the R user, I don't necessarily care about where RStudio is running. I just want to make sure that I have access to it. But the question in my mind was, is there something that I can now do that was difficult to do before? And the answer to that is yes. There's a lot of functionality within SageMaker that becomes very easy and kind of natural to work with once you're inside of this environment.
And so we're going to pull open an example document here. I'm not going to spend a ton of time. And just another note, all of the kind of code and everything that we look at here is available on GitHub. And I'll share a link at the end of the presentation that will take you to the GitHub repository. The slides will be there. Any source code or examples or anything that we look at will be there as well. So don't feel like you need to frantically try to copy and paste and follow along. Everything's available and will be shared after the fact.
So I'm going to open up, this is an R Markdown notebook here. So we're not going to walk through this entire thing. It's a fairly lengthy document. And I'll be honest, this is pulled from an example, Jupyter notebook that Amazon has provided. So there's a GitHub repository that Amazon has that contains a whole host of SageMaker resources, examples, getting started guides, all kinds of resources to become familiar with some of the SageMaker features. So this particular example here is pulled directly from a Jupyter notebook that they provide that walks through how to interact with SageMaker via R.
And so I want to highlight some of the patterns. And again, I'm not going to walk in depth through everything that's here, but I'm just going to highlight some of the patterns and functionality that's available and how we make use of it from within this RStudio environment that happens to be running on SageMaker.
Now, the first thing that we're going to do, let me just, so we can start clean, I'm going to restart my session here. Okay. Now, I mentioned this previously as one of the strengths of the RStudio development environment. And that was the fact that it works really well as kind of an, not a front end, but it provides some tooling around Python interoperability, right? If I'm working with Python and R concurrently within the same document, I have some functionality that can help me out with that. And that becomes beneficial because one of the easiest ways to work with some of the SageMaker tooling is through the SageMaker Python SDK.
Now, I'm not, I don't spend a lot of time in Python. I'm much more comfortable in R, but fortunately the reticulate package, which allows me to bounce back and forth between R and Python, could be used to allow me access to this Python SDK from my R session. And so that's what we're going to do here at first. We're going to load in the reticulate package. We're going to bring in the tidyverse as well so that we can use the tidyverse for some things later on. And then I'm going to bring in this SageMaker package that's available in Python. So here, this import function comes from reticulate. This is loading that SageMaker SDK into my R session as this SageMaker object. I'm going to go ahead and run this.
This will load up. I have already set up my Python environment. The default image for RStudio on SageMaker has reticulate already available at the system library. So everything should be set up and configured so that you can just kind of hit the ground running. And if we open up my environment here, we can see I've got this SageMaker object that's actually the SageMaker module from Python. So this SageMaker object in my R session now gives me access to all the tooling that's available in that SageMaker Python module.
Now we're going to create a couple of different resources that we can use as we go through this process of training and tuning this model. So we're going to create this session object and this bucket that identifies an Amazon S3 bucket that we're going to use to store artifacts, data and objects that we create throughout the course of this document.
And then one of the really, I think, convenient features of running RStudio directly inside of SageMaker is the fact that I don't have to, I'm already permissioned, right? Like the role that I have when I open up and start running SageMaker is the role that I assume whenever I execute any of these SageMaker commands. So I don't need to reauthenticate myself to Amazon. That permission is already scoped and attached to me as I start running here.
And then one of the really, I think, convenient features of running RStudio directly inside of SageMaker is the fact that I don't have to, I'm already permissioned, right? Like the role that I have when I open up and start running SageMaker is the role that I assume whenever I execute any of these SageMaker commands.
So I can grab that role right here by running this get execution role function. And notice as I'm interacting here, right? I'm using this SageMaker module that we see here. And this SageMaker module is a Python module that I've loaded into my R session. And so I access the methods and different objects attached to that module with this dollar sign, just as if I was manipulating a list or other object here inside of R. So here I'm running these different functions or these different methods off of SageMaker to get this functionality.
Training an XGBoost model
Okay, so what we're gonna do is, and again, a lot of this comes directly from this Jupyter Notebook, but we're gonna pull down this Avalone dataset that comes from the UCI Machine Learning Repository. And we're gonna do a little bit of manipulation on that dataset. Not anything super crazy, we're gonna download it. We'll notice that if we scroll over, the sex column here is a character column, but it really should be factors because we only have three different levels here. So we'll go ahead and change that to be a factor. We can look at a summary of what this dataset now looks like. And we see that in height, we have some that have zero height. And we wanna take a look at what those are. So we can go ahead and use ggplot to investigate those. We see a couple of outliers here. And then we have some infants over here and we've added some jitter to this just so that we can gain more visibility into the data. But we can see that we have a couple of observations over here that are infants that have zero height reported.
So we're gonna go ahead and remove any observations where the height value is zero. And now we're gonna go through and we're going to prepare this dataset for training. So we're gonna come in here and we're going to create these kind of one hot encoding columns for the sex column and then remove the sex column itself here.
The other thing that we're gonna do is we're gonna move rings, which is the target of our machine learning model that we're going to build. We're gonna move that into the first column of our dataset. SageMaker expects the first column to be the target value when you define models. And so in order to comply with that, we're gonna move rings over here to the front.
Okay, now we're gonna split up our data so that we can get training on this. We're gonna go ahead and sample 70% of our data for training and then the remaining 30% we're gonna split in half and have half for testing and half for validation. And we're doing that here. And again, I'm not gonna step through everything. The other thing to highlight, a couple of things to highlight as we kind of round this example out is, one, like this is not intended to be an example of totally great and perfect machine learning practices. Rather, the intention here is to kind of highlight how to use some of the SageMaker features and functionality from within RStudio. And then the other thing to note is that I am by no means a SageMaker expert.
Okay, so we're gonna go ahead and trim down our data set to be only 500 rows. And the reason for this is because we wanna do some stuff towards the end where we're going to submit predictions to the, or we're going to submit data for prediction to an endpoint that we deploy. And we have to have, the maximum payload we can submit is 500 rows. So we're gonna go ahead and trim down our test set to be 500 rows here.
And then we're gonna write this data out as CSV files to our local directory. Now, the reason we write this data out as CSV files is so that we can upload it to S3. And that's what we're gonna do here. We're gonna create these different files in S3 that represent our training validation and test data.
And then we're going to define these input types so that when we train our model, SageMaker knows what data it's dealing with. So we're gonna create these train input and validation input types that just identify where the data is living and what type of data it is.
Now we're going to go ahead and start getting ready to train our XGBoost model. So SageMaker has ready-made containers for training certain types of models. XGBoosting is one of those model types. So here we can go ahead and retrieve this model or this container definition. And if we look at this, and again, this is all using that SageMaker SDK that we've loaded into our R session. Here we can see, here's where the container is living, and we can reference that as we go through our training process.
Now we're gonna define our estimator. Again, I'm not gonna go through everything that's available here. All this content will be available, but we're gonna go ahead and define the estimator that we wanna use by providing it with the container that we wanna train with, the instance type that we wanna do our training on, and where the output should go from that model and things of that nature.
Now, once we've defined what our estimator is, we can go through and define hyperparameters that we wanna explore and how we plan on tuning those hyperparameters. Again, I'm not gonna walk through everything that's involved here, but the idea is, again, I'm making use of this SageMaker module that's available to me in my R session through this SageMaker object.
We're also identifying our validation metric here. Now, once this is all said and