Resources

Alex Chisholm - Deploying data applications and documents to the cloud

Creating engaging data content has never been easier, yet easily sharing remains a challenge. And that's the point, right? You cleaned the data, wrangled it, and summarized everything for others to benefit. But where do you put that final result? If you're still using R Markdown, perhaps it's rpubs.com. If you've adopted Quarto, it could be quartopub.com. Have a Jupyter notebook? Well, that's a different service. And this is just for docs. Want to deploy a streamlit app? Head to streamlit.io. Shiny? Log into shinyapps.io. Dash? You could use ploomber.io, if you have a docker file - and know what that is. This session summarizes the landscape for online data sharing and describes a new tool that Posit is working on to simplify your process. Talk by Alex Chisholm Slides: https://docs.google.com/presentation/d/1zulnuaT2Dm_vM0l9Gd3vS26KWJuAf0gJ1pcFKjTUNbI/edit?usp=sharing

Oct 31, 2024
19 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. My name is Alex Chisholm. I'm a product manager for hosted like online data science platforms here at Posit. I work on Posit cloud and shinyapps.io. And more recently, something new we're working on called Connect cloud. Today, I'll be talking a little bit about deploying data assets, whether they're applications or documents up into the cloud.

I find it like it's great to be ending the data engineering session here. But it's a little bit, I guess ironic would be the right word here in the sense that I'm going to be talking about platforms that essentially remove the engineering work from data professionals in some ways, to be able to take what they're working on, on their local machines or elsewhere, and say, how can I quickly deploy this, maybe with minimal assistance from from other people.

So going through the workshops on Monday, looking at a lot of Quarto outputs, then having conversations yesterday, really throughout the day with people, like, it feels like we really are living in some kind of like Renaissance for for for data and frameworks and what you can make. I mean, the amount of amazing open source tools that you can get great documentation on, then you can find tutorials on the people embracing open source and being able to obviously release, you know, what you want to build to the world. It's really fantastic.

So while the hard work that at least data professionals still put into all these tools, it's relatively easy to get up and running in a variety of different tools today. And I think, you know, as somebody who has, you know, gone through a variety of organizations, mostly in that, you know, 300 to 1000 range, like if you were part of that Venn diagram that Nick fleshed out earlier of like, we've got data scientists, we've got data engineers, we got the people in the middle. Sometimes being in the middle is hard. Sometimes you want to be there.

A lot of the decisions we make around deployment are very dependent upon very specific circumstances within our organizations, and within our data sets. But I found especially early in my career, when I got to that stage of finding something incredible that was on my desktop, and showing it to my boss and saying, Oh, wow, we need the entire executive team to see this. We need the entire membership base of our organization to see this. It was hard to know what to do next. And it puts you in like the middle of conversations with pure data, data engineers, or pure IT professionals. And you felt a little bit or I felt a little bit like out of water, and not knowing exactly the best way to move forward.

Early career deployment challenges

And it all started with this really great, you know, starting point. You can see I added a few when I heard Nick talk. But like at the core, it's I make and I made these really cool things that I believe are useful to to other people. Like it's a it's a very like honest place to be like, you know, you did your work, you think you have something that can value other people. And then you can start getting pulled into conversations that, you know, is this going to be secure for us? Like, does the other team have prior priority to assist with it? Like, can we make this plugin public? Do we need to? Where's the data coming from? Who owns it? Maybe that new question that I put in italics down here after Nick, you know, who's going to make the data contract around all of this? Like, it's a little bit not just intimidating, it'd be discouraging for people that are trying to get up and running quickly.

And I think of like two, two specific examples came to mind. And this is like late 2000s. This was before like, you know, business intelligence tools really took off, at least in the organizations that I was in, I remember being like a young data analyst. And we had a third party website that had our revenue data, I did go to this website and download a CSV file. And I brought that file into into Excel, you got really creative in advance with like pivot charts and pivot tables, the my boss at the time wanted that report to physically, you know, print out or put on the CEO's desk, you would paste these into a Word document, you would like get a PDF put together after and then you would send the physical attachment out to all the people that were on like the mailing list within the organization, like a lot of steps, and especially in the middle of this, a lot of manual work for things that were very repetitive, you know, month after month after month.

And it took a lot of conversations internally, especially around like priorities and expertise and ownership to get the stage where like it got a little bit better, right? So I still had to go to that website externally, I couldn't solve that I still had to get that CSV file. But I got into coding for the first time. And it was actually Adobe Flex and Adobe Flash, like action script to do all this. And we were able to put together, you know, a pipeline to like at least append the old data, throw it into a dashboard that people could interact with, work with it to put it on an internal server, and have somebody that could have a link, which I think was whitelisted, again, to the organization that the executive team could look at at one time. And it looked pretty good. But it was a long way and many, many steps to get to that to that point.

And then a second end, like I think we're crawling many of us to this, this notion of production. And we'll talk a little bit about production later what that actually means. But getting closer to a stage, another organization, this time, it was a high tech organization that like a lot of talented engineers, you know, all over the place. And we still had data, you know, streaming into our click view dashboards, the moment we were doing a B testing on the website. So we had to filter down daily, pretty much to see we have a lot of traffic on the website. Here's what the current numbers look like. We were getting them from there, we were throwing them in internal documentation, we were doing, you know, the statistics behind it, trying to declare winners and losers and eventually update the product.

And it was a very different experience here, where, because we had talent, because we had engineers, you know, we were able to say, well, let's let's get someone to take, you know, the open source version of our studio server, throw that up somewhere, connect to our data, make some R markdown documents, loop through all of these, throw these into S3, and eventually give people links internally, that would be automatically update daily, for them to be able to go through and see what were the winner and losers of these pretty significant, you know, A, B tests that we were running. And this, this was like, you start feeling like you're almost doing real work by the end of this, by the end of this pipeline. And after like a few months, probably of me waking up at seven every day and running something manually, like on that, on that connection, like eventually we got everything in place and all the pieces were put together.

The current landscape of deployment options

So this is, it dates me a little bit, but you come to 2024. And there's so many options out there right now. And not everyone's going to be able to use this again, depending on, you know, their use case and their data. But like you type in, you know, how do I deploy this Python app or this Shiny app? And you kind of come across many, many different potential solutions in this hosted space, the online space. And I know there's this endless battle between, you know, security and, and cloud, right. Especially like multi-tenant SaaS type solutions. But then we talked to a lot of organizations and people that are willing, you know, to take risk at certain levels, given the nature and the state of their business and their case.

So trying to like decipher a little bit about all of the potentials out possibility solutions that are out there. Like, it looks like they float into like four major categories. And one of them is like general deployment. And I've talked to, you know, many, many data people that are okay. If they have like one standalone application, they'll use one of these services. It's scalable. You know, they have security around it, baked into it, and they can put it up there. These solutions tend to fall apart when you've got multiple documents or multiple applications that you're trying to like release at one time, because the cost gets a little prohibitive because you're not spinning up like individual apps that have payments associated with each one of them.

Not pure applications and documents, but I'm sure many of you have also played around more recently, like with some of these more advanced, these modern notebooks that are coming out there, especially on the Python side, but they support our workflows as well. So things like DeepNote and Hex and DataLore from JetBrains. And, you know, these are, think of like a Jupyter notebook, like on steroids, like a lot of bells and whistles around it, a lot of good integrations to go out to the data sets that you have. And from a publishing perspective, from a sharing perspective, it's very few clicks to go from the work that you're doing to be able to say, let's publish this and let's share this with other people, either internally or externally.

I want to focus sort of more on these code first data science outputs that we work with. And so this whittles down a little bit more, and there are a couple of variants. Has anyone used any of these tools up here before to deploy something? It's probably 30, 40% of the room. And they all take very different approaches at the moment, both in terms of like what they support in terms of frameworks, but also in terms of like how you're going to be engaging with these tools to put something into this, you know, production state, or at least the shareable state.

And if you start thinking about the first dimension of like, you know, what am I going to be able to deploy? And here I'm looking like at the free versions of these tools, and I'm looking for like native support in the most case, where there is doctor support, I call that out as well. But you kind of these kind of like break into three specific groups, and I can't see the shading on the big screen, but like the first four are like application driven, we're looking at Streamlit, and Dash, and Panel, and others. And then the new tool we're working on Connect Cloud and Hugging Face, you've got like a variation, you've got a combination of documents and applications that you can put up there. And then some of the previous tools that Posit has released, which have been sort of more uni-focused. So whether it's QuartoPub, Rpubs, or on that application side, shinyapps.io.

And this is, you know, these are very useful tools, and you can get value, I think, from all of them. But when you start thinking about like, well, what kind of person are you from that data science perspective? And what's neat is, since we launched the alpha release of Connect Cloud last month, nearly 70% of people who have signed up tell us that they intend to publish both documents and applications in one place. So we really like the idea of taking like what you can do with like Connect Self-Manage that many of you probably work with in your organizations, and having this in a more flexible cloud environment to simplify those workflows.

nearly 70% of people who have signed up tell us that they intend to publish both documents and applications in one place.

The other, I think, neat thing to think through is just how you engage with getting, again, from a folder somewhere that has code up into some link or something that you can share and bring people in to interact with your resources. So looking here, again, across like the seven or eight that we took a look at, you know, coding within the browser, you know, some tools give you the ability both like not just to like put finished code up there, but to start something from scratch, or to bring your things in and start tweaking it. Some have pure file uploads, where you're literally just uploading zip files. In Bloomberg's case in this, you need to do some special work, or at least around Shiny, which we were playing around.

A lot of Git-centric workflows, which I think makes sense. We still talk to, you know, a fair amount of data professionals who like, they know they should be using GitHub more, but it becomes like a very powerful tool in terms of saying, I've got code in a GitHub repo. I know it works locally. If I do enough to tell the service about what I'm trying to produce, we're going to be able to ingest that easily. Command line tools, push button, you know, Docker, but we see like kind of each one of these services sort of targets it in a different way. And I think what we want to continue trying to flesh out with Connect Cloud, at least, is being able to service multiple workflows here. It's like we're starting with GitHub, as I'll show in a quick slide in a few moments. But eventually, you know, we want to be able to support command line and push button, and potentially other deployment mechanisms as well.

Who is deploying and why

And then the final thing we're looking at here is, you know, these are supposed to be, in some cases, lightweight deployment options. You want to be able to share with the world. We know that we have a lot of people getting into data science kind of initially, and we have a lot of people who make careers out of this, both as an individual, a standalone, maybe a contractor, a freelancer, or an influencer, but also within organizations. Like we know that the people on our teams that make, you know, some of the most engaging documents and assets, you know, they want to be recognized for it as well. So thinking through like how these tools are going to be able to demonstrate in a searchable, discoverable way who is producing what, how that ties back to other data, other resources within the organization.

So like the question, like what is the best deployment solution? And of course, the answer to this is, you know, it depends. It depends on a lot of things. We see at least through like this general market survey of like what is being offered right now, you do have a lot of differences in like what is being supported in terms of the actual framework. Some of these tools are using Docker, which I know scares a fair amount of people in the pure data side of things, but they're like through Docker being able to say that we support essentially everything by making it fully customizable for you.

Also the, you know, the security considerations of being able to operate in a pure multi-tenant or like a more SaaS style cloud environment. I think my biggest takeaway from talking with both data scientists and teams over the last couple of years with Posit, you know, production means very different things to different people and security means very, very different things to different people. And typically when I'm talking to people, at least on this side of the Venn diagram, on the data science side, and you say, I know it needs to be secure, but what does it mean for you for it to be secure? It's hard to enunciate exactly what those, what those criteria are.

production means very different things to different people and security means very, very different things to different people.

So trying to think through the goals of, you know, why are we even trying to, that first question we looked at, like I've made something neat. I want to share it, but what are we looking to do here? And the two major dimensions that, that come up in conversations a lot are about scalability. You know, are only a few people are going to be interacting with what you made, or do you have a lot of people that are going to be interacting with it? And the other is that, that security side, like do you have a very restrictive security footprint? Maybe it's regulatory. Maybe it's just your, your, your organization's beliefs and what you should be doing with your data, or are you working with data that has, is a little bit more permissive, like you're able to, to, to share widely.

We work with, with plenty of organizations that have very secure data, but they also have a lot that they do publicly with aggregate data. And they're okay with these assets, especially ones that are highly engaged with a lot of people looking at it, being okay to operate in different spaces.

So a few, a few use cases that come to mind here that, that we hear often that, that we're trying to address with, with Connect Cloud and eventually want to be able to sort of handle all of these quadrants, just learning experimentation. You know, we started by looking at all the frameworks that are out there. When you get a good foundation for, for programming and data science skills, like being able to jump around is something to do, and you might find yourself in an environment where you're going to be forced to. So it's fun to be able, I think, to play around with these things and have kind of like a real production environment that is not as high stakes to, to play with.

People on the, the job hunt as well. We talked earlier a bit about, you know, the, the amount of tutorials and documentation that are out there because you can make this is also helpful, especially early in your career to put together that portfolio and say, I can make these things to tie this back to data, to show how you did things differently than the tutorial that you might've followed. We talked about rapid prototyping. So this is, you know, maybe you, maybe, you know, you couldn't operate in one of these, these cloud environments, but you know that this type of tool is something that you need and that would benefit your organization. Putting together something, you know, with anonymous data to be able to say, look at what I can make right now, go play with this. It's as simple as passing a link to someone within our organization. There's going to have a better understanding of what we're putting together or some way for one of our models to be, to be implemented.

Public dissemination is an interesting one, especially when you don't know how many people are going to be looking at your, your, your asset. We've talked to governments and departments within like US states, for instance, or countries. And they've said, you know, we don't know how many people are going to hop into this, but we're going to release it, you know, on our LinkedIn feeds on Twitter or X or whatever it might be. How are you going to be able to handle that kind of scaling? And then getting into the more restrictive range, you know, internal business intelligence, consulting projects, but we put them in these upper boxes. And of course, security is very, very important, but every conversation we have, like it's a sliding scale about where people think they need to be based upon the data that, that they're, they're working with.

Introducing Connect Cloud

At a minimum, there are a lot more options that are out there in the last minute or two that I have here. I just, I just wanted to mention, if you haven't played around yet with, with Posit Connect Cloud, this is something that we released last month. I like to think of it as either like a, a, a host inversion, a cloud variant of, of self-managed connect that again, a lot of you probably use. But also thinking through like the next generation of our other standalone tools that we have, like Pub and our pubs and shinyapps.io. You know, and we're going to continue to be developing this and we're moving for initially individuals. And right now it's just free plans from public GitHub repos to public outputs.

But eventually we're going to add on those additional publishing workflows, start thinking through privacy and more about security and collaboration as well. And it comes back to these two things that mean a lot to us, like the ability to have your open, so open source code, first code sitting in a GitHub repo private or public. And then also that working example right next, right next to it. We think this is a powerful combination and that's how we sort of built this first level of tooling here. So all you need to do is authenticate with GitHub, pick a repo that has one of the assets that you want to share with the world. It'll put together a public URL to share and sort of build up a portfolio for you.

So what we're hoping, regardless of whether it's connect cloud or whether it is the other options that we looked at, we want to like have data scientists not feel like they're that fish out of water when it comes to that final step of all the work that they just did. Like they made that really cool thing. They want to be able to share it.

This bitly leak just takes you to a bunch of links on some of the other deployment things and information that we looked at earlier. Like I think that's one thing that we're hoping to achieve with connect cloud, we're going to be able to get out how people have confidence in what they're building, be able to test it out in real time, share it and start getting feedback on what they're putting together. Thank you very much for your time. We're up at the lounge. I think on the seventh floor, we're happy to talk more about it. Thank you.

Q&A

Thank you, Alex. I know we're close to time. So maybe just one question. And this one actually hits close to home for me. So I would love to hear your take on it. But so much deployment seems to be focused on presentation like dashboards and reports. What about deploying jobs for R and Python scripts that run jobs like ETL workloads?

Yeah, yeah, we're thinking more about batch and interactive job, job types. We're also thinking more about like model deployments. Right now, you've probably noticed, for those who do use like API's, for instance, on things like self-managed connect, like those are not one of the frameworks we support right now. But eventually adding more of these things in place like pins and API's and this to have a little bit more advanced workflows as well. I think it gets a little complicated when you're looking at a hosted tool that is like could be constantly online. And like if you need something up for 24 seven today, like it turns into a cost consideration on both our sides. So we're thinking through the best way to sort of get into those phases. So we're going to talk a little bit more about sort of thinking through the best way to sort of get into those stages as well.

Thank you, Alex. Yeah. Thank you all.