
Rika Gorn | From Zero to Hero: Best practices for setting up Rstudio Team in the Cloud | RStudio
Learn best practices for setting up the entire Rstudio team infrastructure - Server Pro, Connect, Package Manager from the perspective of a data scientist and for a data science audience - especially those who have never worked with servers, AWS, or bash. This talk will also be applicable to data scientists looking to start on an engineering project outside of Rstudio as well. I started out as a complete novice, & throughout my learning experience I noticed a distinct lack of resources for non-engineers. This talk will focus on best practices for AWS architecture and cloud formation, key security issues such as SSL and https, server configurations, deployment errors, and most importantly resources that are understandable for data scientists just getting into the data engineering or devops space. About Rika: Rika Gorn is the Manager of Business Intelligence at Spring Health - a mental healthcare tech start-up that provides comprehensive mental healthcare benefits. Previously, she worked on quality assurance for a mobile mental health team at Coordinated Behavioral Care, data analytics at Covenant House International, strategic management and evaluation at TCC Group, and program analysis at the Vera Institute of Justice. Rika received her Bachelors in Political Science from Hunter College and her Masters in Public Administration at the NYU Wagner School of Public Service. Rika is also a proud board member of R-Ladies NYC
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, my name is Rika Gorn, and thank you for coming to my talk, From Zero to Hero, Best Practices for Setting Up RStudio in the Cloud.
So last year, I was given the keys to an incredible treasure. And when I say keys, I mean literally keys. I was handed three product keys so that my data science team at Spring Health, where I work, could start using RStudio Server Pro, RStudio Connect, and RStudio Package Manager.
Now my name is Rika Gorn, and I'm a data scientist. I'm not an engineer. I work at an incredible organization called Spring Health. We're a mental health care tech startup that provides a comprehensive mental health care solution to employers all over the world. Over the last year, we've been growing and scaling a ton. And side note, we are hiring. And RStudio Team would be a huge win for a very quickly growing data science team.
So what did this mean for me and my team? Well, RStudio Server Pro would allow us to run R in a secure, remote environment, and we wouldn't have to rely on our local computer for computationally expensive jobs. RStudio Connect would allow us to publish all of our data products, including Shiny apps, R Markdown, dashboards, Plumber APIs, and quickly source them to other departments. And Package Manager would allow us to centralize and manage how our team uses internal and external packages.
So needless to say, this was an incredible win for my team, and we were super excited to get started. But because of how quickly everything was moving, while I was given the support of a fantastic engineer, the bulk of setting up and configuring RStudio Team would fall on me.
Now, I'm very comfortable coding in an R IDE or working with R Markdown documents, but I didn't know anything about servers or setting up an infrastructure in the cloud, but I was pretty sure that you couldn't code up a server in an R Markdown document. So what did I do?
Well, of course, first I went to Google. Now, RStudio has a ton of great guides for administering their products, but since these are geared for engineering or sysadmins, I didn't even know where to put all the code the guides were talking about. And so this was the beginning of my journey into the world of data engineering.
As I started learning, I realized that there was a huge disconnect between resources available for engineers and resources for data scientists to learn engineering tasks. I also learned that data scientists, especially in smaller organizations or startups where engineering support can be scarce, desperately need access to engineering skills and resources if they want to learn to quickly deploy their own data products.
I also learned that data scientists, especially in smaller organizations or startups where engineering support can be scarce, desperately need access to engineering skills and resources if they want to learn to quickly deploy their own data products.
A roadmap for data scientists learning engineering
So today what I want to do is I want to share with you my own roadmap for how to start the process of learning data engineering. So if you're looking to set up RStudio team, then this talk should be very helpful for you. But even if your first engineering product includes a different server setup, then most of this talk should also apply to you as well.
So my roadmap is framed around three different parts. People, so how you can use people around you and in your organization to help you set up your first engineering project. Learning, so the most important things for you to learn as a data scientist as you go on your data engineering journey, and implementation, what to do when you're actually in the thick of things, in the weeds of your project.
People: partnering with an engineer
Okay, so you're about to start your first data engineering project. What now? So even if you're going at this alone or as part of a small group, it's super helpful to have an engineer in your corner. It's important to understand that your relationship shouldn't be combative.
I think as a data scientist, very often you want all the data possible at your fingertips, whereas engineers, a lot of times they have particular security concerns and don't give you the data that you need as quickly as you may want it. So it's important to start a good relationship with your engineer and to learn from them what their security concerns may be for your server and to understand where their concerns are coming from.
It's also important before you start your project to really understand the value of your project. Why are you doing this? Who will benefit? Who will use this new server, this new product? Are there people in your organization that you need to train or that you need to bring on into your side? It's important to make sure that there's kind of a pot of gold at the end of your rainbow, because what you're going to be doing is taking time away from your regular data science tasks and focusing on a new skill set on data engineering. So it's important that before you start, the work that you're doing is going to very clearly bring value to you and to your team.
Learning: Bash, architecture, and security
So technology and learning, specifically for engineering projects, my biggest weakness when I started in the beginning was that I wasn't comfortable using Bash. For me, it was this kind of scary, black, blinking command line, and I found a lot of it very unintuitive. There was one-letter commands and arguments that were unintuitive as compared to R.
So here we have on the left-hand side some R code, and you can see it reads very easily, almost like a sentence. You can see what's being grouped, what's being counted, whereas the Bash on the right-hand side can be kind of confusing. There's one-letter commands. It's unclear what's happening. So it's important to start learning a little bit about this. You don't have to learn everything, but start. Learn how to move around in your directories. Learn how to copy and paste files. Learn how to log into your server using something called SSH protocols. And make sure that you actually have access to the correct files, and if you don't, learn how to change the access for your files using Bash.
Another piece of learning that's important, it's important to draw out your actual server architecture. And when I say draw it out, I mean actually take out a piece of paper and draw it out. It can even help to look at AWS architecture diagrams on the internet and try to decipher what they mean. AWS has a million and a half services, and as a data scientist entering that world, a lot of it can be super daunting to understand. And so understanding what all the various terms to all the different softwares and products can be really helpful. And that's why drawing out your diagram, even if it's super simple, can be helpful in organizing your work before you start. Also sharing that diagram with your engineer can also help you prevent making errors in your system before you even start.
So here's an example of a simplified AWS architecture diagram. When you first are drawing it out, it's important to think about which parts can actually talk to each other, what is public, what is private, and which parts of the server point to what other parts of the server. So here we have the VPC or the Virtual Private Cloud. This is your virtual server and it's hosted on AWS, but you can also host servers on Azure or Aptable or other managed services. Here is where we have the meat of what makes RStudio Team, RStudio Team. We have three EC2 instances or three smaller servers that have all the data that make up RStudio Server Pro, Connect, and Package Manager. So these EC2 instances live in a private subnet, which create a layer of added security so that random people can just access your server.
So here we have a bunch of extra layers of security. We have a load balancer, which helps route the correct requests to your servers. We also have a bastion or jump box, which adds yet another security layer so that you don't jump directly into the RStudio Server. You have to go through an added jump box in order to get into your server. Now once again, this is a simplified model, but it starts to show you what AWS products you as the data engineer have to start setting up in the cloud.
Now one last piece is another important learning topic that's critical before you start your project, and this is learning a little bit about ports and security. So this is a huge topic, but it's helpful to know just a little bit about how computers network and a little bit about security management and the most common internet protocols that are used. This is one of those topics that engineers seem to know a lot about, but when I was learning, I found there weren't that many resources available specifically for data scientists.
So ports are communication endpoints that allow computers to talk to each other in various secure ways. We have port 22, 25, 80, 443. These are things like HTTP, HTTPS, things that you may have heard about, but it's important to learn just a little bit about what ports are available, which ones you have to turn on and off in your server.
SSL and TLS protocols are protocols that make secure internet communication possible by encrypting internet traffic. I like to think of this as a handshake between different computers. Setting up these protocols requires learning a little bit about different certificates and configurations, so it's important to learn what protocol you're going to be using when setting up your project.
Implementation: getting started
Okay, so you've done your learning and now it's time for implementation. My main piece of advice is to get started. Obviously, do your research, but don't fall into the trap of analysis paralysis. I think for me as a data scientist, I kind of want to know everything available before I started, but what happened is that I did two weeks of research and then realized I had nothing to show for it other than theoretical knowledge.
It's helpful here to take a lesson from your engineering teams and use a more agile approach where you get started quickly, you may fail quickly, but then you can get started again just as quickly.
All right, so you've officially started your project, what do you do now? Use your engineer as a guide and to check in on your work. They can tell you when you've done something horribly wrong, like when I accidentally opened all of my instances to the entire world, and they can also help you to finish tasks that you may not have the appropriate security clearances for. So for example, if you want to point your server at a particular domain name, they can help you to do that and they could also help you to estimate costs for your server.
Learning, so it's important to start looking at the new data formats and files that you're going to be using in your server. So to set up RStudio Teams specifically, RStudio uses something called the Cloud Formation Template to set up push button deployment. Now when I started looking at this, I was very excited because it looked like something very similar to something in the R world, and that is that it looked basically like a YAML file. So it's important to get to know these files, don't just deploy them, look at them, read through them, understand their defaults. So when I started, I did not look at the defaults of my Cloud Formation file, and then I had to destroy my entire stack and start from scratch because I didn't change a really important default. So get to know your files, get to know your file formats, read through them, understand their defaults.
So implementation, you are going to be making mistakes. So make sure that you set up guardrails and fail-safes for those potential mistakes. First, take screenshots of your instances. This is something that you can do in AWS. When you set up your server, after you configure your server, you're actually able to take a screenshot of it. So if you ever mess anything up and you need to go back, you don't have to start over from scratch.
As you're going to be setting up your server, there's going to be lots of passwords, PEM keys, .pk keys, certificates, save all of these in a secure way. For example, saving them in something like one pass. Don't just save them on your local computer. And lastly, write good documentation. Write good documentation that an engineer can follow, that another data scientist can follow, because at the moment, while you're the admin, you may not always be the admin for this project. So it's important to write good documentation as you go along.
Configuring your server and training your team
All right. So your server is set up, whether it's for RStudio team or for another project. Now it's time for the fun part, which is to configure all of your settings and defaults.
So when you're configuring and after you've set up the main parts of your server, you need to get buy-in from the data scientists on your team, because they're going to be the ones who are going to be testing your deployment. So it was helpful for me, at least, to have a testing log to write down what deployments are working, which ones are failing. And this is helpful for later, when I was actually writing my documentation.
Understand who's going to need to be trained in the company alongside your data scientists. Don't bring everyone in until you've figured out most of the bugs. You don't want to be doing demos for your product when there's still bugs in it. So take time to set up training and set up testing way before you're going to be introducing it to folks at your company. You don't want to discourage your users and the consumers of your product before they can get started and get really excited for your new project. Set up time for training and make sure that there's a little bit of time before you actually start demoing.
All right, so once you've set up your server, you're now officially the root user with pseudo-access. It's important that you use this power for good and not for evil. So learn about different types of authorizations, about Google Auth, multi-factor authentication, API keys, basic authorization. As you're going to be configuring your product, you're going to need to be giving folks in your company access to it. And this has to be done in a secure manner. Your engineer can definitely help you here. But knowing a little bit about the differences for different authorization protocols can be super duper helpful. And once again, you may be the root user now, but you may not be the root user later.
Implementation. This is the fun part for me especially. Get inside your server and play around. Look at the important main files. So for RStudio Server Pro, Studio Connect, and Arc Package Manager, I found that these five files were super duper important in making sure that deployment was actually working. So when I first started, I had a lot of deployment errors, and that's because I realized that I hadn't changed some of the defaults in these five files. So get into your server, understand what the paths are like, understand what the important files are, understand what their defaults are, and how they should be changed, and if and when they should be changed.
Summary
All right. So this is my entire system for setting up a data engineering project as a data scientist. So before you start, partner with an engineer and understand their security concerns as well as the larger value that your project brings to the entire team. Learn a little bit about Bash, about ports and network administration, and draw out your AWS infrastructure diagram on a piece of paper. And when you're ready to implement, get started as soon as you can.
Throughout your project, use your engineer as a guide. Get comfortable with different data formats and set up guardrails in case of failure. Once you've set up your project, take the time to train your team, to test your project, and test out any other little bugs and issues that may come up. Learn a little bit about user authorization and root access. And then most importantly, get inside your server, play around, and start changing your defaults to your needs. And finally, remember to have fun.
Thank you so much for taking the time to listen to this talk. I look forward to answering your questions and connecting over the rest of this conference. Thank you so much to everyone at RStudio and everyone at Spring Health who helped me with this talk.
