Lou Bajuk & Kevin Bolger | Why Data Science in the Cloud? | RStudio (2020)

As business and organizational needs expand, a centralized ecosystem such as the cloud is needed to securely store and access data, conduct analyses, and share results. We’ll share some examples of what it means to do data science in the cloud and discuss some problems that users may face along the way, and the solutions that RStudio products can provide. We’ll also discuss best practices for migrating to a cloud environment. What you’ll learn: - What are the benefits of working in a cloud environment? - What are the different cloud environments available? - How do I learn which is the best fit for my organization? - What should I consider when migrating my data science infrastructure to the cloud? Webinar materials: https://rstudio.com/resources/webinars/why-data-science-in-the-cloud/ About Lou: Lou is a passionate advocate for data science software, and has had many years of experience in a variety of leadership roles in large and small software companies, including product marketing, product management, engineering and customer success. In his spare time, his interests includes books, cycling, science advocacy, great food and theater. About Kevin: After finishing his education in the University of Limerick, Ireland – Kevin’s passion for data science was cemented. Focusing primarily on data analytics and modelling, he went on to spend the first years of his career working at a biopharmaceutical company, where he led the data team on multiple products. Since moving to Seattle with his Washington native wife, Kevin has spent his spare time enjoying the beautiful PNW and playing ‘hurling’, an ancient gaelic field sport with the Seattle Gaels. He now leads the Data Science team at ProCogia as the Director of Data Solutions – where he works with clients from Biotech to Telecom

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you, everyone, for your time today. Just a quick recap of our agenda. We're going to be starting talking about some of the challenges that data science teams face and how RStudio products can help tackle those challenges, and how those challenges then help drive people to consider moving to the cloud.

And so then we'll talk about some of the different cloud options that are out there, hosted services, working with a VPC provider such as Azure or AWS, cloud marketplace offerings, and doing data science in your data lake. And then I'll hand off to Kevin to do a deep dive on some of the factors that you should consider when you are considering deploying your software to a VPC.

Then we'll wrap up and have time for as many questions as we can. And as Sam said, any questions we don't get to today, we'll have a follow-up thread on our community site to continue the discussion there.

Why data science teams fall short

So we've talked to many data science teams in a large variety of organizations over the years. And what we've heard from many of them is that a lot of data science teams fail to live up to their full promise of what they're trying to achieve, what the organization is hoping, the value that the organization is hoping to get from this team. And there's lots of reasons for that.

And these reasons tend to group, based on our conversations with, again, many data science teams, tend to group into a few different areas. Perhaps teams have trouble creating insights. Perhaps they have trouble, once they have actually created those insights, to actually use those insights to impact decision-making in the organization, because ultimately, if you've got a better way to do things or a better way to make a decision, it only actually matters if that gets implemented and that actually impacts your organization.

Or sometimes teams have the greatest challenges around actually maintaining and improving this value over time, that they get some insights, they create decision-making, but then it fades. Maybe they find it difficult to maintain. So there are many different reasons, many different challenges that go into these different areas and things that teams need to overcome.

Today we're going to be focusing on this third column, that siloed teams lead to often redundant work, make collaboration difficult. They may spend too much time maintaining tools or too much time re-running analyses because teams can't easily deploy self-service applications. Or perhaps it takes too much manual effort to deploy and maintain these systems. And I'm focusing on these because these are the items that are often addressed by centralizing your data science and or moving it to the cloud.

Serious data science: open source, code-oriented, and centralized

So what we found is that customers who have successfully scaled their data science team and the work that they do and impacted their organization tend to focus on three key attributes of their data science environment, open source, code-oriented, and centralized. And again, in this session, we'll focus on the centralized and cloud aspect. Open source, of course, is a key part of our studio's mission, and we find it critical because it eases recruiting and retention and training for new people and it's comprehensive and interoperable.

Code-oriented is a key part of our studio's philosophy of data science because code is flexible. It doesn't have black box constraints and code by its nature is easy to iterate and reuse and extend and inspect. But again, in this session, we'll focus on the centralized and cloud aspects because centralizing your data science helps reduce unnecessary work for the data science team, makes it easier to collaborate across that team, providing a common way of deploying self-service applications to your stakeholders, helps inform them, helps drive decision making and impact that in the organization.

And of course, if you're working with open source tools like R and Python, finding some consistent way to manage all the data science packages and manage the versions of that is critical. And often that's easiest if you centralize that so that it's common across both your development and deployment environments.

Now these three key aspects, open source, code-oriented, and centralized, are key aspects of what we call serious data science. And we call this serious data science because this approach typically enables a data scientist to tackle the complex, difficult, and at least initially often ill-defined problems that can really provide value and novel insights in an organization.

RStudio's commercial products

Now the two leading languages in open source data science are, of course, R and Python. But on their own, R and Python certainly provide the open source and code-oriented aspects, but they don't out of the box on their own make it easy to centralize your data science or deploy it to the cloud. And that's where RStudio's commercial products come in.

Our products support the development and deployment of data science with both R and Python. And we have a modular platform called RStudio Team that helps you centralize your data science in an enterprise-friendly way, providing security and scalability, et cetera. And so let's just talk about that briefly.

Now what does a data science team actually need to do? Well there's two things that they typically need to do in part of their work. They need to create insights, and then they need to find a way of sharing those insights to impact decision-making in the organization.

And so typically when they create insights, they're using some sort of data science workbench, some way where they can use their IDE of choice, whether it's Jupyter Notebooks or the RStudio IDE or VS Code, all in a single location, use those different development environments to create various data products using R and Python. These typically are tailored applications or reports or APIs. We call these tailored because they're not off the shelf, they're not black box, but they're focused on the problem at hand, focused on providing the exact information that a decision-maker needs in their day-to-day work.

And that's one of the big advantages, of course, of a code-oriented approach. You don't have any arbitrary limitations. You can focus on what your stakeholders need. And of course, once they create these interactive applications or reports or APIs, they need a way of publishing them, a way of sharing them. And so typically they can publish these to a deployment portal, and once this is available in a deployment portal, the data science team can now share these insights to impact decision-making.

And this typically takes the form of an interactive web application, say a Shiny application that can be shared with a decision-maker, or an automated email report that can be sent to the decision-maker with the analysis right within it, so they have that data within their inbox. Or perhaps they create APIs that can be integrated into other systems for automated decision-making.

And it's critical for both these areas, for both the development side and the deployment side, that you have a way of managing your open source packages, because this eases maintenance and reproducibility.

And this, of course, is where our products come in. RStudio Server Pro provides that centralized data science workbench with a variety of different development environments. RStudio Connect provides that deployment portal where you can share your applications and APIs with your decision-makers and other systems. And these may take the form of Shiny, RMarkdown , Plumber APIs, also a number of different Python frameworks that we support, things like Streamlet and Bokeh.

And it's critical for both of these that you have a way of managing packages, and that's where RStudio Package Manager comes in. And it's these three products, RStudio Server Pro, Connect, and Package Manager, that make up our RStudio team bundle.

Why move to the cloud?

So the question is, what then drives people to want to go to the cloud? There's several different reasons.

One key reason is simply simplifying and reducing the startup cost for a new data science team, that when an organization is spinning up a new team, cloud is often a way to really get that team, get the resources that team needs going quickly. Perhaps they want to make collaboration or instruction, say workshops or classes, between organizations or groups easier. Cloud's a great way of doing that.

Perhaps they want to mitigate the high cost of maintaining their own computing infrastructure. This was one of the primary drivers originally for organizations to move to the cloud. It is still a major factor. Closely aligned to that is scaling to meet variable demand. That if the demand for your computations, for your data science computations is highly variable, you don't want to keep a bunch of expensive computing infrastructure sitting around idle. Instead, you want to push that responsibility out to the cloud provider to maintain that for you.

And finally, if your data is already in the cloud, then you want to reduce time and costs in moving the data to your analysis. And rather than pulling it down locally to do data science, it's often much more efficient and much less expensive to do that data science up on the cloud directly right in your data lake.

Cloud options overview

So let's talk about a few of these different options. Again, one of the most common options is doing a hosted service where a vendor provides software as a service. We talked about a couple of things our studio does there. The other very common way of doing this is deploying to a virtual private cloud provider such as Azure or AWS. And I'll be touching on that lightly here, but Kevin will be covering that in a lot more detail in the next section.

Third and closely aligned is cloud marketplace offerings. Again, that's very similar in some ways to deploying to a VPC. You're running on the cloud provider's hardware. But here, typically it's easier to start up and it's done on an hourly basis. And finally, data science in a data lake such as Qball or Databricks.

So hosted software as service offerings. Some of the things that really make these offerings great are they're easy to quickly start up and they streamline collaboration. So if the members of your team can simply go in, put a credit card in and sign up for a few dollars a month to access some capabilities, that's an easy way to scale up very quickly.

Now sometimes software as service may have limited functionality compared to on-prem. So you might not have the full option that you would have if you were to install locally. And sometimes again, depending on the service, integration with your data internal systems could be more challenging or perhaps not even possible.

So a couple of offerings that RStudio has in this arena. We have RStudio Cloud, which is a hosted version of the RStudio Server Pro with the RStudio IDE. It makes it easy for anybody to do or share or teach or learn data science using just your web browser. So the initial focus right now for RStudio Cloud is primarily as a platform for instructor-led education.

So if you were doing a workshop in your commercial organization or if you were teaching a class on data science or statistics at the university level, RStudio Cloud is a great solution for that. And we have many organizations using it for that already. Another great focus for RStudio Cloud right now is enabling academic research by sharing among different groups. It's a handy way of doing that.

And there's a lot more that we'll be rolling out in RStudio Cloud over the next several months. So some of the great benefits of RStudio Cloud, again, nothing to install locally. All you need is a browser. You can share projects that you create there with your team or a class or workshop. And again, nothing to configure, no hardware installation or anything like that, just a month-to-month purchase. And there's various plans available for RStudio Cloud, including free to get started.

ShinyApps .io is another hosted service RStudio provides. This is a way of securely and scalably sharing your Shiny applications on a hosted service. And so if you have a Shiny application that you want to share with the world or with a much smaller group, you can do that quickly on ShinyApps.io. Again, various plans available there, including free to get started. You can visit ShinyApps.io to learn more.

The second major area of cloud, second major way of going to the cloud is deploying to a virtual private cloud provider like AWS or Azure or Google Cloud Platform. Now some of the advantages of this is that you can pay as you go for the compute resources specifically. Typically, if you're deploying to a VPC, you've purchased the license from the vendor already, though not necessarily, depends on the offering.

Another great advantage here, you get the full functionality of the on-premise software because typically it is the on-premise software or a close variant of that that's available on the VPC. So for example, right now, RStudio Cloud doesn't yet have the ability to provide Jupyter Notebooks in that environment. That's one of the things we're working on for the future. But right now, if you were to deploy RStudio Server Pro to a VPC, you'd have the full functionality of being able to utilize Jupyter Notebooks and JupyterLab through the same capabilities that you'd use the RStudio IDE. VPCs also give you access to specialized hardware like GPUs.

Now on the con side, it can be difficult to maintain. There's typically more administration overhead. There's a complexity of things like Kubernetes that you might need to consider. And Kevin is going to be discussing all this in a lot more detail in the next section. The other challenge around VPCs is that the cost can be high and variable, that you need to really manage the compute cost closely to make sure the software is only available when it's needed. Otherwise, it can keep running in the background and rack up your compute charges.

Third, and again, closely related, is cloud marketplace offerings. Instead of installing on your own on a VPC, you go to the AWS or Azure or GCP cloud marketplaces to spin up software from a vendor. RStudio has offerings on all those marketplaces.

It's very handy in a lot of ways because it's easy to get started quickly, makes it great for proof of concepts. You're paying as you go typically for both compute and software licenses on an hourly basis. It's very inexpensive to get started, and again, access to the specialized hardware. But like deploying to a VPC, you need to make sure you carefully manage the software to make sure it's only running when you need it. Otherwise, you could run into excessive hourly charges, which means often, not always, but often this is best for short-term projects, especially if you're collaborating between organizations because that's a great way to provide access outside your organization.

And then finally, data science in your data lake. Great example of a data lake is someone like Qball, a partner for RStudio. Qball provides a data lake with RStudio Server Pro capabilities built into it. And some of the advantages here is you can minimize the overhead by running your computations close to the data, avoiding the cost of moving the data from one cloud to another or down locally. It also makes it easy to incorporate data science directly into your data pipeline.

Now on the con side, running your data science in your data lake can add some technical complexity, can make it a little bit more difficult to do what you want to do, or at least a little bit more complicated. And again, you might run into some functionality limitations here that could be challenging, depending on what the data lake provider and the data science provider have done and how they're working together technically.

So to sum up this section, ultimately it really depends on what your primary goal is. And based on your primary goal, you can, different ways of going to the cloud. Your primary goal is to really minimize cost and maintenance and startup time, then using a hosted service is often a great way to start there, especially if you have a small team or really just an individual data scientist.

If you want the full flexibility of an on-prem solution, all the bells and whistles, all the capabilities, but without maintaining your own hardware, but still having the flexibility to scale, then deploying to a VPC provider is often the best way to go. Kevin will be talking more about that next. If you're doing something that's much more short term, perhaps just a proof of concept, you don't want to commit to a long time period, offerings on a cloud marketplace by the hour are often a great way to do that. And again, if your biggest concern is your data is already in the cloud and you want to minimize the overhead of moving that data someplace else to analyze it, then often data science in your data lake is going to be the best approach.

Key considerations for moving to the cloud

Thank you, Lou, for that really good introduction. Really appreciate that presentation. So I want to talk about more along the lines of. Why why you might want to move to cloud and what are some of the key considerations that people have when they move to cloud? So this is a big decision for a lot of organizations to either switch from their on-premise history or or maybe just expanding their their cloud footprint for the data science teams.

So typically when we meet with clients and we're discussing cloud infrastructure for various different data science applications and primarily a lot of time we're speaking about our studio team installations and configurations, we're considering a lot of different things and the key kind of facts that come up, the key recurring themes that we see coming up are, you know, first of all, assessing the needs of the team. You know, how big is their team? How much power do they need and do they really need the cloud that they need to utilize the cloud's immense scale?

And do they see a lot of potential future growth or unforeseen growth in the future? So do they need to be able to scale their their resources over time in an unpredictable manner? And this is where cloud is really powerful. It's quite easy or relatively easy to configure your environments to scale over time. If you understand where your team is going and what your needs are going to be.

Another key thing is, you know, how am I going to control the cost? So with infinite scale comes infinitely scaling bills. So this is a big concern, as as Lou alluded to. It's a pay as you go model, but unfortunately, that means if you use lots and lots of resources, you're going to get charged from lots of resources. So controlling costs is a very key consideration. It's much more straightforward to control those costs when you when you are the ones maintaining the physical infrastructure, because you've got a very defined view of what it can be. But when you're dealing with virtually unlimited resources, your bills can get out of hand quite quickly.

So with infinite scale comes infinitely scaling bills.

Another key consideration, we work with a lot of clients in more regulated industries, whether they're working in industries with proprietary information or maybe they're working in industries where they handle a lot of very sensitive data around customers. So PII that, you know, is maybe needs to be treated with a high degree of sensitivity. And this is a this is a consideration, a key consideration that maybe slows up the adoption of cloud over time. So typically, these industries are are lagging behind the more versatile or the more the more the tech industries that are maybe able to move a little quicker and these don't have these regulations that they need to consider.

And so they often ask, you know, can I how can I secure my cloud? And the answer really is as secure as you want to make it. And we can dive a little bit more deep into that in a few minutes.

Another key consideration is, of course, you're handing over the keys of your infrastructure to a cloud provider. And, you know, if you go back as far as the early noughties and the tech boom and there was plenty of websites that were being hosted by young entrepreneurs in in their in their home offices or their bedrooms, computers that put on their desks and they were hosting websites that way. But with the immense scale of these these cloud infrastructure companies, they're much more sophisticated organizations. And while you're handing over the keys to them and they're managing the physical security of your infrastructure, they're able to benefit from that massive scale to really secure their infrastructure in a way that is quite difficult to do.

And then the last one, and this kind of ties back to our first point with regards to scale and that is, how do I adapt to change and how easy is it to do that? You know, if if I architect a solution and I build this and all of a sudden I see a huge uptick in usage and how can I make sure that we can scale our environment for our teams in an efficient manner so that we don't see any downtime or we don't have very lengthy delays between making more resources available? And again, this is somewhere where cloud can really excel if you if you know what you're doing.

Assessing team needs and architecture

So when we initially engage with our clients, one of the first things we like to talk to them about is we carry out a series of interviews with their key users and the business users to try and get a better understanding across all the different products that they're trying to embed in their team. So as Lou mentioned, RStudio Pro for the data scientists, the people who are doing the development work, RStudio Connect, you know, the platform for sharing that information with the users and Pac-Man for sharing the sorry for controlling the managing the I.T. kind of security and package management.

So I put together this small matrix just to kind of give some of those considerations that we'd like to take into account. So we're looking to understand, you know, let's say if we take, for example, RStudio Pro, how big is their team? What's the makeup of the team? The context of the team? Is there a lot of data scientists, people running a lot of simulations? How variable are the workloads? What's the kind of size of data they're dealing with? Are they dealing with very small data in the order of megabytes or they're really a big data shop who are dealing with petabytes or terabytes of data?

And also the size of the team definitely matters, right? If you've got a very small team, if you've got a team of five people, you know, you could probably handle your workloads and a much more simple architecture. But as you scale up your team or your scale up the size of your data, you might want to consider more complex approaches. And that's where Lou was alluding to earlier with the clusters approach with Kubernetes.

And so I've kind of broken it up here. And you can think about and this is definitely a rule of thumb. And, you know, your mileage may vary, as they say. And this all comes out of a result of a series of interviews with your team and understanding your needs. But smaller teams, more simple teams would benefit from a simpler approach. And those are the ones in the yellow zone. So if you're dealing with relatively small teams with relatively small amounts of data, you could probably get away with a more basic or simple approach.

And the same with the red zone, you know, if you're dealing with very large sets of data or very large teams, you're definitely going to want to consider a more complex approach because it's just harder to predict what your team is going to be able to do. So you're maybe going to want to work with a more scalable solution, something that can flex with the needs of your team over time. And if you're sitting in that orange area again, you're going to have to think a little harder about what kind of workloads you're running, how predictable are they? And you know, you might be sitting somewhere in between the more simple approaches or the more complex approaches, depending on the type of work that your team does.

Cloud deployment solutions: from simple to complex

So there are a number of different solutions that you can implement in the cloud to implement your architecture from the very simple to the more complex. As Lou alluded to, there is something like a quick start image for most cloud providers in AWS, and I provided some AWS analogous examples throughout this presentation. In AWS, we have something called an Amazon machine image, and this is just a pre-configured image where you can quickly spin up the machine and test it out.

So this is better for maybe short term projects or where we typically recommend this be used is in cases where people really just want to kind of play around with RStudio and see is it worth investing their time into. So they can quickly launch it and test it out and get it up and running without too much work. And so this is a good for a prototyping phase or just kind of seeing what RStudio is all about. If you're maybe new to that or you're maybe new to cloud and seeing does it work well with your different systems, but it comes at a bit of a premium. And so, as Lou was saying, you probably want to make sure you don't leave it on all the time.

And so the simplest approach you can take is a virtual machine. Deployed on a VPC in AWS, they have something called the Elastic Complete Cloud, and that's their virtual machine deployment. And so EC2 instances, as recalled, you can go ahead and pick all the RAM and CPU requirements that you need. Pick the right OS. That can be a very important consideration for some teams. If you need a Red Hat Linux environment or do you need to or are you OK with an open source environment like Ubuntu?

You want to consider then you want to be highly available or not. And if you need double redundancy, this is this is something we often recommend that people are working in the cloud is RStudio has a built in load balancer that will distribute traffic evenly across the different machines. If you've got multiple machines running, but you can implement some double redundancy, which is super important for a lot of teams to make sure that if the master goes down, you don't lose access to your environment. And so we use load balancers in the cloud and every every major cloud provider has some version of this.

And this ultimately allows for us to scale our resources vertically so we can or, you know, you can actually scale it out horizontally as well. Typically, if you're doing that, we recommend the next approach, which is Kubernetes. So just for those who may be uninitiated, you know, scaling vertically will be just increasing the amount of RAM or CPU that your machine has versus scaling horizontally is simply adding more of the same machine.

But if you're going to go with that approach, we recommend going with a cluster approach. So if you've got a larger team and you've got variable workloads, you might want to consider using something like Kubernetes. And so, again, every cloud provider has their own version of this. On Amazon, they've got a managed service called Amazon EKS or Amazon Elastic Kubernetes Service. And this is really great for enterprises that have variable workloads. And also, if you want to version control your environment, you can use Docker to containerize your solutions, your environment and scale this infinitely to your team's needs. So if your team grows, you just configure your cluster to grow with your team.

The final one here that I want to point out, and that's an inbuilt feature that's come out in our studio in the last couple of years that some people may be aware of, and that's a server pro launcher. And this integrates well with Kubernetes. So what this allows you to do is within your regular environment. So maybe you've got pretty low compute needs on an everyday basis within your team. So maybe you don't need an EKS environment. Maybe you're OK with an EC2 instance. But every once in a while, you want the ability to scale up your jobs to a much larger scale. And so that's where Job Launcher can be really effective at allowing power users to run those heavy workloads without bringing to a halt the base instances.

So if you've got a team of 10 data scientists, they're all working and they're using their EC2 instance and somebody wants to run a really heavy model training or simulation, they can utilize this Job Launcher and not take down or tank the entire EC2 instance for everybody else. And so it isolates that workload from everybody else in the team.

Controlling cloud costs

So pay as you go, only pay for what you use. This is one of the great benefits of cloud, but only if you do it right. You can definitely get this wrong and end up with a big, unexpected bill. So some of the common mistakes that people see is you over provision your resource, and this comes down to poor planning. So you over overestimate your needs and you essentially just by some percentage amount overspend or have machines set up, an architecture set up that is maybe just a little bit too much for what you actually needed. And you can scale that down and if you're monitoring that well.

Another one is poor limit planning. So most providers will allow you to set limits on your spend or limits on how big you want your clusters to get or limits on what machines are different people are allowed to use and stuff like that. And so if you're not planning that out or if you're not utilizing it at all, like you say, with clusters, the cluster will grow to as big of a need as the team use. So if you've got a team of data scientists and you don't have any limits set and they get carried away with themselves, you could end up with a very large bill at the end of the month.

And the last one, as you alluded to, is shutting down those unused resources. So what we see in a lot of teams is they use, you know, elastic MapReduce or they use clusters and the analysts will turn on their clusters and do their analysis, but they'll forget to shut it off. So if they're not using scalable infrastructure and they're setting up a lot of a lot of expensive machines and they leave it running overnight, you know, that can be a pretty hefty bill in the morning for somebody if they forgot to turn things off.

So you can you can definitely mitigate against these problems. So you can be very thoughtful and plan about planning your architecture. And this is something where we really work a lot with our clients on and making sure that we spend a lot of time in the planning phase when we're not rushing into implementing the solutions. We really want to make sure that the solution that we're architecting for them is right for them now and right for them six months from now. So it's the right mix of balancing for what they need today, but also taking into consideration what are they going to need in the future.

Make sure you're utilizing those budgets and monitoring and alerts. Be very clever about that and utilize that to the max because that's going to help prevent any unexpected bills or any surprises. Categorize your spends. Make sure that, you know, if you are overspending, you're able to easily pinpoint where in your system you're you're overspending so you can, you know, isolate that offending instance or cluster and make mitigating actions to that. And of course, you need to regularly reassess so, you know, you can plan as much as you want, but your team's needs are going to change over time.

Cloud security and the shared responsibility model

The next key question is how secure is the cloud? And like I said earlier, it can really be as secure as you want it to be. So a lot of people think, oh, well, if I have my physical infrastructure in-house on the premise, that's the most secure I can be. But I mean, that's not always the case because you've got to pay for the security within your own premise. You've got to pay for security guards. If you're really concerned about people coming in and doing harm to your infrastructure, you've got to maintain that infrastructure and you're going to maintain all the firewalls and security around that.

If you work with the cloud, Amazon and their shared responsibility model or different providers, they will maintain the security and integrity of those physical machines and they provide all of the software that you need, that you could possibly need to configure the security. So really, it can be as secure as you want it to be. And that is a real caveat because you need to make sure that you're configuring your solutions to be as secure as your requirements might be. So if you have a very high security need, you need to scale up your investment in your architecture appropriately.

In different cloud providers, and I'm going to use AWS again as an example, you've got different resources. So you can have a VPC that's completely private with subnets are completely private or or maybe completely public or maybe a mix of the two. And you can have different instances in each environment that can speak to each other across using what are known as an Amazon, an Amazon resource name. But again, in every cloud provider, you've got this ability to allow services to talk to each other without having humans have access to those services.

So this is really powerful if you want to utilize serverless technology, maybe have an entry point that a human can interact with. But the underlying hardware or data or infrastructure, they can't access. It's only the actual services on AWS that they do have access that can do that.

And this is just an example where we've got two VPCs, their private subnets inside of them. Ordinarily, these two instances shouldn't be able to interact with each other. But again, Amazon provide the software. You can use this VPC peering connection and make sure that these instances are now able to communicate with each other. So a malicious actor cannot communicate with these private subnets or the private instances. But you can actually configure these services to be able to talk to each other. And that can be really powerful and important for a lot of use cases and making sure that your different systems across your organization are able to talk to each other without compromising security.

Again, just to point out, this is an example here of the AWS shared responsibility model. And just really to point out that it's up to your organization to configure a secure environment. So that's your portion of the shared responsibility model. AWS will provide and maintain the software and hardware needed to to do all that. But you need you need to take responsibility for making sure that your privacy and security is meeting your requirements. So if you're from a highly regulated industry, you need to consider that. There's also some government cloud options that if you're working in a government agency, most providers have an option that's exclusive to the government agencies.

Stability, disaster recovery, and scaling

Again, I want to keep on the track of this shared responsibility model, and we're talking about the stability of environments. So if I don't have access or I don't have I don't control my physical infrastructure, how can I ensure that it's going to be safe, that everything's going to be safeguarded? And what happens if there is a disaster? And I'm not going to lose all my data. I'm going to lose all my services. And again, you really can take advantage of the scale of the cloud in this in this sense.

So, yes, if you were to architect your solutions poorly, and that's, you know, that's your portion of the shared responsibility model, you can expose yourself to instability within the cloud infrastructure. So you want to make sure when you're architecting your solutions that you're taking advantage of some of the features that cloud providers have. So in the case of AWS, they've got many regions around the world, dozens of them, and they're constantly adding these regions. And they also then have within those regions many different availability zones. These are geographically distinct areas within those geographically distinct areas. And finally, within those, they have data centers. And these are where your actual machines live.

So if you want to make sure that you don't you aren't hit by a disaster and you lose all your data or you lose all of your computers or you lose access to your systems, you can architect solutions that mitigate for that. And in that way, you know, the cloud can be way more scalable and way more secure and stable than any on premise solution could ever be if you take advantage of this cloud. And that means you just got to architect it carefully.

The cloud can be way more scalable and way more secure and stable than any on premise solution could ever be if you take advantage of this cloud. And that means you just got to architect it carefully.

And if you go to the next slide, we can have a look here and you can see, yeah, we can go and we can make sure if we if we need if we have a global team and we want this highly available environment, we want to make sure that we've got our our our systems mirrored across many different regions. So that means if something happens in the Oregon region and one of the data centers experiences a big outage, which can happen, it's OK because we've got a mirrored environment inside of North Virginia. And so our analysts, our data scientists can continue to do their work.

You also want to make sure you're setting up disaster recovery effectively so that, you know, if you if you do have one region, still make sure you're backing up your data in separate areas, especially if you've got mission critical information in your infrastructure. So just making it, you know, just making that point that, you know, data centers can fare, can fail. Machines can fail. This is the reality. Cloud providers take advantage of that scale and the expertise they have in-house to mitigate against that and reduce that down as much as they can. But it's on you to make sure that you architect solutions that also prevents that as much as possible.

And yeah, there's just closing remarks then on architecting on cloud and the different solutions, just tying back to, again, that idea of of scale and you can plan as well as you want and you can build your team out or you can plan your architectures as well as you want around your current needs and maybe even project your needs a little bit into the future. But oftentimes what we see is people want to be very conservative. They're like, oh, I don't know how this is going to get adopted. And then lo and behold, because it's such great software, especially our studio team connected is something that sees a great a lot of adoption once it starts getting circulated within organizations. Suddenly your demand for those services will spike.

And so you want to be able to respond to that effectively. Now, if you're using on premise servers, you're using physical infrastructure in your house. That means you got to go and you got to call up your your hardware provider and order new machines and have them shipped to your site and then have your IT team install them and configure them and install the software on them. And it's a it's a time lag. So you're adding weeks onto your process with cloud. If you're clever and you've automated your deployments in an effective way, which you can do with stuff like cloud formation, it means that, yeah, I can I can scale up my architecture as quickly as I want to get it scaled up and take advantage of that infinite scale in the team.

So if you're planning on a lot of scale, you know, consider that maybe you need to use a clustered approach if you if you think your team's really going to grow or just just be aware of the different the different growth that your team might experience and take advantage of the monitoring that different cloud providers have to monitor that usage. You know, in AWS, you have cloud watch. You can really monitor that, monitor that. And our studio professional has inbuilt monitoring so you can build custom monitoring solutions, but keep a close eye on on usage patterns and understand when you need to scale it up.

And then if you've got a clever architect or cloud cloud engineer, you can easily scale up your your solutions with without that time lag of ordering and procuring that physical infrastructure. And I think that's the last slide, if you go to the next one, just to highlight that, as I mentioned throughout, you are a full service partner with our studio. What that means for us is it means we're licensed to resell the different studio products. But we also do training and implementation and configuration of our studio team environment. So we work with clients to architect their solutions. And we have people then who are qualified and trained to go in and install and configure those solutions and maintain those solutions over a long period of time as teams learn to control their own environments.

About RStudio and Q&A

And that is the end of the presentation, Lou. Great. Thanks, Kevin. And just a quick word on our studio. I'm sure you know who our studio is if you're listening to this webinar, but you might know us best through open source work. Things like creating and maintaining the tidyverse set of packages to make R easier to use and easier to learn. Tiny applications for creating web based applications from R. R Markdown to deliver, you know, essentially R notebooks to deliver reproducible, interactive R based reports and presentations and documents, as well as our IDE.

And certainly creating and supporting this open source software is and will continue to be our studio's core mission. In fact, we announced earlier this year that we've reorganized ourselves as a public benefit corporation, which means that now our open source mission, our commitment to this is codified into our business charter and all our decisions as an organization must balance the interests of all our stakeholders, including our community, our customers and our

Featured software#