Keeping Data Alive: Persistent Storage Options for Dynamic Cloud Applications (Alex Chisholm, Posit)

Transcript#

This transcript was generated automatically and may contain errors.

Thank you for coming today. I'm going to be talking about persistent data storage for cloud application deployments. And while I do talk about DuckDB for a slide or two, like that was just one of the methods that I looked into trying to like help this relatively common problem. So my name is Alex Chisholm. I've been with Posit for the last three years.

But before that, really my entire career, I've been doing something with with data. And a lot has changed, obviously, over those 15 plus years at this point. But some things remain frustratingly true, and say the same. And one of those is like the ability for us to create something, you know, on our machines and get it to work, and then want to show this to somebody else in a way that still functions the same way that it worked on our machine. So that's what we're going to be talking to today. And I find that especially frustrating, because when you get to that point, right, and you have that thing that is working, like we have this, this feeling that like, ah, this thing works on my machine, I really want to show this off, I really want to get it in other people. And then we start, you know, going through the motions. And there are more deployment tools today, you know, than there have been available before. But some things get lost in translation, depending on the type of application you're putting together.

I think this is true, as well, like, you know, teaching on the side over the years, like we talk a lot about just getting something to work, but like distributing and sharing is a vital part of the data science work that we all do. So, you know, I might start at this point, it works on my machine, and then I want to get my clicker to work, and it doesn't. So I go back to my laptop, and that doesn't work. There we go.

So we have that thing on our machine, we find some way to deploy something, and we have our base application, right? And then we probably have some kind of initial data sources that are powering our application. Like these are usually data driven in some way. And then we're going to put this in front of a user, someone who's going to interact with it. And just to think about it a little bit more concretely, this might be a Shiny application, you might have a CSV file, again, that is powering or showing things within your application, you might put this code onto GitHub, kind of all of it together, like it's a relatively small data set, let's say. Then you need to find out like what that deployment mechanism is.

I do work on the hosted team at Posit, so we're responsible for Posit Connect Cloud, which is a great way to deploy these types of things. But you can take these things, you can put it on Connect Cloud, and then you can make a link that goes out to somebody. And depending, again, on the type of application you're building, you know, really the first experience will likely be similar. The user goes to Posit Cloud or Posit Connect Cloud, they turn something on, the application takes a second to load, they see what is there, they start interacting with it, maybe it changes and gets a little personalized based upon what they're searching for, maybe they upload data of their own that becomes part of the application itself, like there are a lot of things that a user can do.

But at some stage, the user is going to go away, stop interacting with it, and then we tell our servers that, you know, this we need to shut this thing down, and we're gonna turn off the instance for it. And maybe unbeknownst to that user, or you the first time that you're testing it, you come back and you're expecting to see all that good stuff that happened that you wanted to stay with your application, not continue. So the app reloads, you start saying, like, well, I thought I uploaded this data already, but the instance or the application that's been deployed has no memory of what happened. So this is like why I'm trying to go through and think through, like, what are the stages or steps that I could follow to, like, add persistence storage to this, and how do we capture this data efficiently, maybe cheaply, so that we can have this new data to get our application to, like, v1.1, let's say.

The aquarium demo app

So I've got a simple app for this, if anyone's going to the Georgia Aquarium tonight, right? Like, the first starting point would just be, well, we have this application, we want people to just see things that we have. And you know, maybe you don't need persistent storage here, but at some stage you could do something innovative with this application. I could say, I want people to rate these animals for the first time ever. I want to say for the, this blue, what is it, this blue rock lizard, it's not my type, I want to thumbs down this thing, or I'm literally in love with this animal, right? So, like, what happens to these interactions for that user who's doing it in this simple example?

Why this data is important for us, with a persistence, is that we want at some stage for the new person to come to the application, or for that same person to come to the application, to be able to see, like, what historically has been said about these different animals. So you don't waste your time tonight going to the wrong part of the aquarium with the boring fish and the boring mammals, you want to be able to go to the right ones right away. And if we don't find some way to control for persistence, when these people come back, this is what they'll see. It's like your application, you know, essentially the data didn't live from those previous sessions.

And I, very much from a data science, data analysis background, I know many of you kind of crossed into this no-man's land of getting closer to, like, data engineering or software engineering, and, like, that's kind of how I've approached this as well. We typically follow the pattern of, does this work for me right now? If yes, we'll continue to do it. If not, you know, then we'll try to find something, something else.

But we need to do, I think, to begin with is reconsider those initial data sources that we're dealing with. So we probably want to break that into something that might be static. And this could be for that, the aquarium example, you know, the CSV file, which it was, was just a bunch of information about all the different animals that are, that you'll be able to see tonight. And maybe that doesn't change very often. We're not going to let users update that. All they need to do is query and find out information about it. And then we need to find the second side of this, which is like the dynamic part of your data. So these are the animal ratings that we're creating through the application. And again, we must find a way for this data to be able to persist if we want the application to grow and evolve along with those users.

Persistent storage options

So I, again, I'm not an expert in any one of these specifically, but I've found ways to get things to work kind of like with this range. And my guess is that some of you have considered the different ways to do this as well. So the reason why it's not working right is like every time we go back up, that CSV file is going back to its original state. So we need to put it somewhere so it can be updated. And so I'm looking at Google Sheets, DuckDB in cloud storage, MotherDuck, and then Postgres for like this specific application. And again, I'll, I have a link at the end. You can see code for this, like if you're dabbling with how to do it, but I think there are definitely trade-offs that I'll try to outline to the best of my ability for these specific four things.

The first one, Google Sheets. You know, this, there are a lot of pros to this, especially on the free side. Like most of us have some kind of Google account. We're familiar with what a Google spreadsheet looks like. The setup is relatively easy. There are packages to support this in both Python and R. One really cool thing about this is the real-time visibility. When I was setting up the, just the demo apps, like it is cool. Like I had the spreadsheet over here, I had my app here, and I clicked that like button, immediately the spreadsheet updated. So it was kind of cool for me again on the machine by myself. On the flip side of Google Spreadsheets, you have a limit in the number of rows going down or the columns going across. There's no real field validation that Google Sheets will do for you. You have to build that in yourself. And at some point, you got to worry about rate limits, about people hitting this API if a bunch of people are going to your application at the same time. But again, they're, they're well-documented, I think, paths about how to set things like up. There are these dedicated libraries, build a service account. You can do it interactively at first, pass in some environment variables, and make these connections for that storage to take place.

And then the second one, you know, initially when I was thinking about this, it was, you know, I could put the CSV file up in the cloud as well. But for those of you who know DuckDB well, or CSV file as well, like you get a nice performance boost when you're moving from one to the other, even just having it up there in terms of like the number of people that are able to interact with it and read from it, and the number that the type of operations you can run on it. And we're talking here still relatively small data. That's not the point here is big data. It's finding a place that can be talked to and connected to. So the wall DuckDB supports, as we heard in Marcus' presentation, supports multiple concurrent reads, which is fantastic even for large data sets. It doesn't do so for multiple concurrent writes. And if you've messed with some cloud storage like S3 before, it's not hard to set up, but the whole permissioning thing in AWS you probably like got caught up for before, or you put a new file up there, and all of a sudden you can't read that one. You know, there's some complexity there as well. But again, you can go through this, and you'll see in the source code if you want to go into it. There are libraries that made this, you know, relatively easy to connect on both the Python and the R side. Really, it's just credentialing at this stage. And modern tools, including Connect Cloud, give you the ability to store these things in an encrypted way, so that when you do your deployments, it makes the handshake between the two things quite easy.

So the limitation here was like that write concurrency. And then I heard about, you know, MotherDuck. MotherDuck seemed kind of, you know, innovative as well, especially because they have a very generous, or at least at the time that I started looking at this early this year, a very generous free plan that you can get in there, and you can sort of open up one of these databases. And they do have some kind of architecture, some kind of plumbing that means if multiple people are hitting the data that I'm storing up on MotherDuck at the same time, they have some kind of queuing mechanism that helps with the problem of multiple people writing at the same time. So if everybody is giving thumbs up and thumbs down, it's going to be more likely not to fail. But it's not a perfect system. It's not probably what it's designed for in the first place. But again, the question that we often ask, does it work? Like, you know, it works.

The other potential risk here, unlike the other ones, like I can switch my DuckDB files or Parquet files from S3 to a different cloud system and still maintain my infrastructure pretty much the same way. It's different here with MotherDuck. It's one of the few services that kind of helps the DuckDB process with this kind of setup. What's funny, I started out, my first 10 years were in R, and then I got a little more into web development for Python. The MotherDuck handshake, I couldn't get working over the last week. In part, it really needed that binary file for DuckDB with R on my local machine, and I kind of just gave up after waiting a little bit too long. I'm sure it can be done. I'm sure that's a me problem. But on the Python side, it was very fast. All we needed to do was pass through a token that I got from MotherDuck, and this thing was up and running, you know, very quickly.

And then the last one, which I think part of the fear that people have when they get into the need for persistent storage is, like, these things sound very much out of our natural realm. Like, when I was first learning data analysis, they weren't teaching me about, like, what a database really was in terms of how to build it and how to maintain it on yourself. But there's so many services out there now. But thinking it from the context of the application that I put together, like, this really is what it's designed for, right? First off, the cost is not, you know, as big as you think it might be. The demo one that I put up there, it's, first off, render, gives you a month for free to test it out. It was $6 for, like, the really small version after that. That sounded pretty good. I see on other services like AWS, that's more like $15 a month you might spend for, like, a capped sort of managed database service. But what really helps here for the application I was working on was multiple rights. Like, these are designed to have many people hitting this thing at one time, being able not to have any data loss or potential corruption, like, with those hand tradeoffs.

Sorry, for tradeoffs on the con side. Like, there is some kind of operational overhead. What's happening under the hood is definitely more complex, especially for non-engineers. But I think, like, practically, setting the one up that I did for this demo app, like, you go into UI, you press a button, it says create this thing, and you get a bunch of secrets that you're able to pass into services. So, like, the hardest part probably for someone who's not comfortable in that database land is the initial data population. So, like, get maybe the table you need up and running in sync with your application. But it's not as bad as you might think it is if you do feel like, you know, getting a database set up would be really hard. It doesn't have to be, especially if you're okay with, you know, the projects you're working on don't need the utmost security. There are options out there for you to power applications with this. And again, these are well established with libraries, and then it's just a matter of passing in username, password, port, all the type of things that whatever database that you go with ends up depending on.

But I think, like, practically, setting the one up that I did for this demo app, like, you go into UI, you press a button, it says create this thing, and you get a bunch of secrets that you're able to pass into services.

Putting it all together

So, for all of these specific tests that we did, like, we found a way to get that application to V1.1. So, we still have the Shiny application. We still have a CSV file that has all of the information about the animals at the aquarium. We still put our code on GitHub. Now, we've just implemented, you know, one of these. You don't have to do...you don't do all of them, but any one of these kind of will work. So, it gets back to, like, the question of, like, what is your use case, and what do you need out of setting something like this up?

So, now when I deploy something like this to Connect Cloud, we use secret environment variable management, which is fully encrypted, and we can't see it again, and you don't see it again. You need to delete these things if you needed to update them. But now, this is just going to touch into any of the services. So, this was the example for the Postgres database that just allowed that handshake between these two services to get up and running. And now, the first user comes here. They do things in the app. They can see it, right? So, you go in the first time, and that user who's interacting in that first session is actually able to see the ratings, you know, calculate in real time, and it gives them kind of tactical feedback. But now, after it goes to sleep, after it comes back, you're going to get to the point where the application is pulling data that it needs from the persistent storage from the Postgres database or Google Sheets or one of these other things to keep that persistence there. And now, we've got to the point where, ah, like it was working locally for me. Now, it works in the cloud as well, and I can feel happy with this application and hopefully never have to update it again and just leave it there forever with no pain.

So, in terms of thinking through, you know, these four things, I think it's going to depend a lot on your use case. If it's very analytical heavy, we heard earlier about the benefits of DuckDB and using Parquet files and the efficiency you can get there, the concurrency about people reading it. If it's just that, I think the cost of throwing that up into S3 for something like this would be fantastic for me to be able to query big things, power the application, even if you didn't need persistent storage. That could work well. But when you get to the point where you need a lot of that right concurrency, you know, that kind of changes the decision.

I think one question we probably need to ask ourselves as we think about where we fall on that spectrum between the data science world and the software engineering world is, like, what is our app trying to do? And you think of this, like, app that we put together, and I'm sure some of you have gotten to this point as well. Like, the first apps that I made were connected data, show chart, add filters, right? Like, you're building dashboard for things. But over time, these lines are really blurring in terms of what can be created and how people interact with it. This was still a data-driven app. We had a CSV file with a bunch of information that could be filtered down and seen. But we really got into, like, web development territory after that. Like, we're thinking of a bunch of users coming through, a bunch of users being able to give, you know, this feedback that we want to capture. So I think for this one, I would just, I'd spend six bucks a month and go with that Postgres database for this case.

Q&A

So I know that was a fair amount. I really appreciate you coming to conf and you coming to this session. I think we probably have a few minutes for questions, if there are any.

We do have some time. So first question. Do you have advice for how to handle database migration? For example, what happens when you realize you need a new field or table? Are some of these solutions better or worse in this situation?

Yeah. And again, I'm not an expert in any of this. When I first got into using databases, so I've mentioned I went from R for like 10 years, then I got into web development. So I got into, like, Django and Python. And that made it very easy. Like, Django was fantastic for, like, updating Postgres databases and adding fields. And like, it wasn't while you had to change code and you had to do migrations. It was kind of like taking care of you under the hood. So I'm definitely not the right person to advice about, like, physically migrating this. My naive approach would be, depending on the project size, I'm dumping records and I'm adding new records to the new system. That just comes to mind as the way that I would probably approach it.

Are there any built-in options in Posit Connect?

At the moment, what we've built in is the ability to have secret environment variable management to go out to these external services. We have thought about, like, what kind of integrations we could do to spin up something next to an application you put out there. I think one thing that is nice, from a software engineering point of view, I don't know if this is the right way. One thing that is nice about using that external storage from the start is that I can use it locally as well. And there's not this, like, hiccup going between what might be supported there. So if we were to build something like that on Connect Cloud, we'd want it to almost be standalone of the application itself. So it could be set up in advance, and then you could just attach various ones to it. Something we're open to, though. If anyone wants to talk about that one with me, please track me down after the session.

Okay. One last question. How do you handle writes to DuckDB in S3?

How do I handle writes to DuckDB? So from what I understand, and, like, this was me testing it, like, on just two or three browsers open. It's so quickly. Like, you go out there and you're writing something. While something is being written to it, I think it locks down. And if another request comes in, it's going to say, right now this is being locked. And unless, my guess would be, unless you built in that kind of error handling to your application specifically, you wouldn't, you probably couldn't handle it. So that was my understanding. And then MotherDuck sort of takes care of those multiple ones by, like, parsing them out into a queue, and then probably passing them in at some stage. All right. Thank you, Alex. Let's thank Alex again.

Keeping Data Alive: Persistent Storage Options for Dynamic Cloud Applications (Alex Chisholm, Posit)

Transcript#

The aquarium demo app

Persistent storage options

Putting it all together

Q&A

Featured software#

Shiny