
Matching Tools to Titans: Tailoring Posit Workbench for Every Cloud - posit::conf(2023)
Presented by James Blair In an era of diverse cloud platforms, leveraging tools effectively is paramount. This talk highlights the adaptability of Posit Workbench within leading cloud platforms. Delve into strategic integrations, understand key challenges, and uncover practical solutions. By the end, attendees will be equipped with insights to harness Posit Workbench's capabilities seamlessly across varied cloud environments. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Data science infrastructure for your org. Session Code: TALK-1115
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Great. It's great to be with you all. I'm James Blair. I work at Posit as a product manager for cloud integrations. So today we're going to be talking about different ways that Posit Workbench can operate in different cloud environments through some of the partnerships that we're establishing today.
Now, at Posit, we work really hard to make sure that Posit Workbench meets the needs of the modern data science developer. This includes giving you tools and environments that allow you to do the day-to-day workloads that are important for your work. Exploratory data analysis, model training and tuning, deploying models, managing things of that nature, exploring data visually with tools like ggplot and other things like that, building shiny applications. Coding and working in other environments like VS Code and Jupyter Notebooks are all things that are supported under the Posit Workbench platform.
Now, what's not often apparent to developers as they're working in this environment is the underlying infrastructure requirements that are sometimes complicated in order to support an enterprise product like Posit Workbench. To illustrate some of this complication, I want to take a moment to look at some of the documentation that we frequently share with customers when they're setting up Posit Workbench for the first time.
It almost reads like a choose-your-own-adventure script. You can install Posit Workbench as its own environment. It can be a standalone server. This is typically the simplest approach, often the approach most people breeze right past. You can also choose to use Posit Workbench in a load balance environment where I have multiple nodes configured behind a load balancer and users are assigned different nodes based on different rules that might exist. This can be more robust, but obviously adds additional complication to the setup process.
I could use Posit Workbench with an external resource manager like Kubernetes. Now I have Workbench set up. I have a separate Kubernetes environment set up. Workbench communicates with that environment. Sessions launch in that environment. This is a supported infrastructure. Or I could run the whole thing in Kubernetes. Workbench runs there. The sessions run there. Everything runs in Kubernetes. And then if I really want to make things interesting, I could also run and integrate with Slurm.
Now, some people look at this, some internal IT organizations look at this documentation and this is how they feel. They are salivating because they have the know-how. They have the expertise. They have the infrastructure. And it's almost a plug-and-play exercise to get Posit Workbench up and running. And we applaud those organizations.
However, I also recognize that lots of people look at this and they feel more like this. I'm overworked. In many cases, I may be underpaid and I don't have time to figure out what Kubernetes is, let alone figure out how to use it with this tool called Posit Workbench that if I'm in the IT organization, I probably have very limited familiarity with.
What we've noticed is that a lot of these administrators are starting to turn to the cloud to find solutions. They're looking for ready-made services or platforms that can alleviate some of this administrative burden and supply users with the experience they want without creating an undue maintenance overload for the IT professionals within the organization.
Today, we're going to talk about a number of different solutions that exist today and are coming in the near future that allow Posit Workbench and support Posit Workbench inside of these different cloud environments. We'll talk about some of the advantages, how they differ from one another, and provide some resources where you can learn more.
AWS and Amazon SageMaker
We'll start with AWS or Amazon Web Services. We've partnered with Amazon SageMaker so that it's possible for administrators to configure an Amazon SageMaker domain to have access to RStudio. In this process, an administrator goes in, they either create a new domain or they go to an existing domain that exists in their SageMaker environment and they make sure that there's a Posit license in place and they can configure access to Posit Workbench and RStudio within that environment.
Once that's been done, individual users can come into the SageMaker platform and can request RStudio sessions from within that platform. When users make this request, they're brought to the familiar Posit Workbench homepage, but there's a few key differences from what you might expect out of a traditional installation. One of the biggest things here is that users can request a specific compute instance type for their session to run in. This means that every user gets an isolated EC2 instance within AWS for their particular session.
This accomplishes two distinct things. One, users can request resources based on the workload they anticipate doing. If I have a huge data set and I plan on bringing it into memory and analyzing it, I know I'm going to need a lot of memory. So, I can make a choice when I start my session to have an instance that provides me with adequate memory for my analysis. On the other side, maybe I have an analysis that's going to require a lot of parallelization across multiple CPU cores. To improve efficiency, I can request a compute instance that has a high CPU count so that my workload finishes faster.
The other advantage of this architecture is that everybody's sessions run independent of one another. This is hugely advantageous if you're like me and you occasionally do something a little bit silly in RStudio and all of a sudden the entire server is locked up. I'm sure I'm not the only one who's tried to read a data set into memory that far exceeded the available memory in my environment. If that happens in a traditional single-server install, I've now brought the server down for everyone. I had to buy more than one round of drinks to make up for that.
Now, if you do the same thing here in the SageMaker environment, you need to reset your session, but other users are unaffected. If I do something in my session that causes me to exceed the available resources, I'll need to start over and make sure that my new session contains adequate resources, but I am not interrupting anyone else's workflow.
The advantage here, and this is true for all of these solutions that we talk about, is that this comes without additional IT administrative burden. The IT office does not need to worry about managing this environment. SageMaker handles the orchestration of the resources and everything behind the scenes.
The advantage here, and this is true for all of these solutions that we talk about, is that this comes without additional IT administrative burden. The IT office does not need to worry about managing this environment. SageMaker handles the orchestration of the resources and everything behind the scenes.
Google Cloud Workstations
Let's talk about Google Cloud. We partnered with Google Cloud Workstations, which is a fairly new offering on GCP. This offering is kind of divided into two different personas. Administrators define workstation configurations where they can define what resources are available for a given configuration and what tools operate there, and then users can come into the platform and request access to a given workstation.
When a user makes this request, a dedicated environment is created just for that user that provides access to the tool set that was defined in the configuration running on a specific instance that was also defined in that same configuration. When users come into the platform, they can see existing workstations they have, they can request new workstations based on configurations they've been given access to, and then they can launch workstations once they're running. And in the case of Posit Workbench, they will then have access to all the tooling that they would expect to have access to inside of a Workbench environment.
VSCode, VRStudio IDE, Jupyter Notebooks, JupyterLab are all available within this integration. All IT has to do is configure these configurations to define what tools are available. Posit Workbench is available from a drop-down list of tools you might want to make available, and then users can access it in their own dedicated compute environments.
Databricks integration
Next up is Databricks. This is one that we're very excited about. This is fairly new. We've published a few blog posts. In fact, one was published just last week that highlights some of this. And there's a number of things that are happening here, so I'll spend a little bit more time here with Databricks.
One is we've made a lot of progress recently on improving the SparklyR R package to support some new and exciting developments on the Databricks side. Specifically, we're now able to take advantage of Databricks Connect from within SparklyR. What that means is I can remotely connect into a Databricks session and environment from anywhere. I don't have to be co-located, I don't have to use a Databricks notebook, but I can have RStudio on my desktop, I can have Posit Workbench in AWS, I can have an R terminal in some dusty server in a closet. It doesn't matter. I can use SparklyR to connect to Databricks and facilitate workloads that happen in the Databricks environment.
When I create these connections, the connections pane in the upper right-hand corner of the RStudio IDE will show me details about the data, the catalogs, the schemas, everything that's available to me from that Databricks context. I can explore that data, I can connect to it, I can manipulate it, and all of the computation happens on the Databricks end.
We're also excited because we're creating a new feature inside the RStudio IDE that's part of Posit Workbench that will allow Databricks users to manage their clusters directly from the IDE. This means that if I'm using Databricks and I have a cluster that I want to connect to, but it's not started yet, I don't have to open a browser, log into Databricks, visit the control plane, click on Start Cluster, come back to the IDE, wait for the cluster to start, connect to the cluster, and then work, I can start the cluster directly from RStudio. Once it's started, I can connect directly to it and begin working. And this will be available in an upcoming release of Posit Workbench.
The last thing that we're excited about, and this is more forward-looking, this is not something that's available yet, but we anticipate being available next year, is Posit Workbench as a Lakehouse application. Databricks back in June announced Lakehouse applications as a way to run third-party applications natively directly on Databricks infrastructure. What this means is that coming soon, like I mentioned sometime next year, you'll be able to run Posit Workbench directly on your Databricks infrastructure in a way that's supported, stable, and provides users with the experience they expect.
Snowflake and Posit Cloud
I'll mention Snowflake briefly. This is one that we are in the very early stages with, but we're working closely with Snowflake engineers and developers to make it possible to run Posit Workbench as part of Snowpark Container Services, which is a fairly recently announced new development over at Snowflake. The idea is that you'll be able to run Posit Workbench directly within your Snowflake environment.
And when you do that, you'll also have access to the same connections pane that we looked at previously that will allow you to explore the data that's available to you within Snowflake, run queries against that data, use that data for analysis, all while staying within the realm of your Snowflake architecture and infrastructure. This is still really early phases, so more documentation and more examples will be available soon as we continue to work through this integration.
Finally, last but not least, we have Posit Cloud. We've done a lot recently to make Posit Cloud more robust, more capable, and more feature-rich as we've listened to feedback from our users. This is a really great kind of low-barrier way to get exposure to some of the products and features that we offer to enterprise customers. Anyone can create an account on Posit Cloud. And from that account, you can create workspaces, you can create projects that are leveraging either the RStudio IDE or Jupyter Notebooks. And from those projects, you can then publish things to Posit Cloud and share and distribute those with others. And we continue to build out this platform to improve the user experience.
Looking ahead
As we look towards the future, it's clear that the cloud landscape is always shifting. There's new technologies that arrive. There's technologies that disappear. But here at Posit, we remain committed to making sure that regardless of where your organization chooses to operate, we are a natural fit, that our products work well in those environments, and that in many cases, we find ways to make sure that they not only work well, but that they work in a way that doesn't add additional burden to IT organizations that are often already overworked.
We look forward to a future where all of Posit team, not just Posit Workbench, but Posit Connect, Posit Package Manager, are available in these different cloud environments, and it's something that we are diligently working towards with all these different partnerships.
So what's next? If you attended today and didn't see your cloud provider or cloud service of choice in what we've talked about, feel free to reach out to me. You can find me at james.posit.co. I'd love to chat or set up a discussion about things that you are doing within your organization or tools that you're using that you would like to see work better with Posit tools and Posit products.
Alternatively, if you are a user of one of these services, you use Snowflake, you use Databricks, you're operating in GCP or AWS, whatever the case is, if you have questions or if you have feedback on the existing integrations, again, happy to receive that and happy to have you reach out.
Here's a collection of resources. These are links that will take you to different blog posts, documentation things that outline some of the work that's happening right now, places that you can keep an eye on for the future. This is all available if you scan the QR code here. This will take you to a GitHub repository that contains the slides, links to everything that I've just shared, and things like that. But I appreciate your time today. Thank you very much, and we'll take some questions.
Q&A
So I have some questions here, actually. How does Posit choose which provider or features to prioritize, and is there any possibility of a technical roadmap?
Yeah, this is such a good question. It's like, it's more of like an art than a science, right? So what I mean by that is each of these partnerships that we have is kind of this, you know, there's two sides to it, right? We obviously want to make sure that our products are as well represented and as well positioned as they can be in each of these different environments. But we're also balancing that against the needs, demands, expectations of the group or company that we're working with. And so a lot of it just comes down to the requirements that they might have or the expectations they might have and where we end up meeting in the middle.
An example of this would be SageMaker, right? SageMaker has SageMaker Studio, which is a JupyterLab-based interface for interacting with their platform. So when we came to them and when we started these discussions about Posit Workbench, one of the things that they didn't want to do was they didn't want to enable Jupyter Notebooks and JupyterLab and VS Code within Workbench because they wanted Python users to continue to rely on the solution they provide. And so if you use SageMaker and Posit Workbench today, you'll notice that the only editor that's available in that context is the RStudio editor because of some of the tradeoffs we made in our conversations with Amazon.
So one of the things that I focus on and think about a lot is how I can make sure that our product integrations are as consistent as they can be across the different partners that we partner with, but also while acknowledging that in some cases there's expectations and things that need to be taken into consideration that might make the experience a little bit different from one to the other.
Is only providing RStudio or SageMaker an AWS limitation? Would you be able to run VS Code in the near future?
Yes. See previous response. Yeah, so we've talked to them a lot about that, right? I'd love it if they opened up some additional functionality on the SageMaker side. In all honesty, I don't expect that that will change anytime in the near future. But the one thing that I'll say, and this is true of any of these types of questions, is that if you have questions like this around how come this isn't this way or why is it this way, let us know about it. Also, if you have a relationship with SageMaker, if you're a SageMaker customer or an Amazon customer or with Databricks or whatever, reach out to your reps on their end as well and provide that feedback for them. Because they're managing their own roadmaps and we're trying to kind of jointly work on that with them. But these providers getting that feedback directly is often the thing that will help move things forward the best.
So I work in a highly regulated industry. Do any of these cloud service providers or provide encryption in transit, at rest, and even during use?
Yeah, that's a really good question. I know SageMaker has several different compliance requirements that it meets. I know it's SOC 2 compliant, there's HIPAA compliance, there's some other things that are there on the SageMaker side. It really depends, and that's a good question for the provider or the partner. One of the things that I don't have a clear answer for right now is as we look towards the future with Databricks and the work that we're doing there, being able to operate directly on Databricks infrastructure, which means that data wouldn't ever travel outside of your Databricks environment. Workbench would be there. As you analyze data, it would stay within the realm of Databricks. There might be some things there that meet some regulatory requirements, but that would be a question better suited for Databricks in that case.
Can these solutions connect to Package Manager and have the workbench environments use only packages available based on freeze dates?
So yes is the short answer, right? It depends on which particular integration or partnership we're talking about. For example, SageMaker lets you configure at the domain level when you set up a SageMaker domain and configure our studio, you can supply a default package repository URL. And so if you have like a snapshot from Package Manager that you want every user to use, you can supply it there and then every user session will by default point to that location. And other solutions like the GCP solution or the stuff that we're doing with Snowflake and things like that, everything is a little bit further down the line where you're managing containers and image definitions to provide users with access to the tooling. And so things can be implemented at that level to set defaults for upstream repositories or things like that.
