Resources

The Power of Snowflake and Posit Workbench: Macroeconomic Data Exploration in the Cloud

In this live event, we will utilize the Posit Workbench Native App to demonstrate that macroeconomic research can be run in the Snowflake cloud but powered by R and RStudio. Starting with data sourced from the Snowflake marketplace, we will import, transform, visualize, and, finally, model data using the Orbital framework to push tidymodels down to the cloud. This is full-stack, R-driven macroeconomic research in the cloud. Add the event to your calendar: https://evt.to/eugmedshw Learn more about the Snowflake and Posit partnership: https://posit.co/use-cases/snowflake/

Jan 15, 2025
1h 0min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, well, welcome, everyone, we're excited to have all of you here in attendance today. I'm James Blair, a senior product manager at Posit over Cloud Integrations, and I'm joined today by Jonathan Regenstein from the Snowflake side. Jonathan, if you want to introduce yourself.

Sure. Thanks, James. Hey, everyone. Yeah, I'm Jonathan Regenstein. I head up financial services, machine learning, and AI practice here at Snowflake. Actually, formerly of Posit, formerly of Posit, formerly of RStudio, many, many years ago. So really happy to be talking to you all about how these two amazing data science platforms come together.

Perfect. Thanks. We're excited to have Jonathan. It's good to have him back. Like you said, he's an old friend of the company, and we're glad to be working with him again on this joint effort with Snowflake. I'm going to walk through just some introductory things here and kind of set the stage for what our Snowflake partnership means to us today. Some context around what is available and kind of where we see things headed. And then we'll turn the time over to Jonathan to walk through some an example use case and demonstration of some of the things that I highlight as we go through a few slides here at the beginning.

About Posit

Just to give some a little bit of context, many of you may be familiar with this. Some of you may not. So just to give some context of who Posit is, how we operate, we are a company at our core dedicated to this vision of supporting open source data science, scientific research and technical communication. And the way that we've built our business and company reflects that ideology. We want and believe in open source both now and in the future. And so we have a core commitment to that concept of supporting the open source ecosystem of packages, tools, products and things of that nature.

The way that we do that and the way this all kind of comes together is we continually, since the very beginning of the company, which as Jonathan kind of alluded to, was started under the name RStudio. And then we've rebranded and renamed the company to be Posit over the past couple of years. We still produce the RStudio development environment. That's still a part of our product portfolio. But from the very beginning, we've contributed heavily to the open source space. Many of you will recognize open source R packages that we contribute to, like tidyverse or tidymodels, many of the packages that are commonly used in kind of data science workflows. And then over the past several years, we've been continually investing in the Python ecosystem as well, contributing to and creating a number of packages and tools that are available in that ecosystem.

As adoption grows for these open source tools, one of the things that we've observed is that many individuals bring those tools and those open source frameworks into their organization and start using them to solve business critical problems. And with that kind of transition, there comes an additional level of sometimes scrutiny, but also requirement for how those open source tools are managed, maintained, delivered so that they meet enterprise requirements. As a result of that, we have a number of commercial or professional products that we provide that exist to support the open source data science ecosystem within the enterprise.

So our commercial products focus on things like security and scalability and auditability, things that most organizations care about. Individual practitioners or users or students or hobbyists may not think as deeply about those problems. And so they benefit from the open source side of things. And then once those open source tools get adopted within large scale organizations, we provide commercial tools to support that sort of transition and make sure that those open source tools can continue to solve business critical problems over time. And then we take that investment that comes in from these commercial tools and then reinvest it back into the open source ecosystem as sort of framework that we've grown to call the virtual cycle on the positive side.

To give a little bit of additional just context here, the collection of professional tools that we provide, we refer to as Posit team, but it's primarily made up of three distinct pieces. There's Posit Workbench, which is where developers, data scientists, actuaries, statisticians will spend the bulk of their time. It gives developers access to tools like RStudio, VS Code, the new development environment that we've been creating called Positron, Jupyter Notebook, JupyterLab. There's all these options available so that you can use the tool that you're most comfortable with and then using the collection of open source packages either on the R or the Python side or maybe you're mixing both together. You work with and analyze data and Posit Workbench gives you all the tools to do that within an enterprise organization.

On the complimentary side of that, we also offer a product called Posit Connect that allows the developers or data scientists, whoever they might be, to share the results of their analyses with a broader audience. Those results can be something as straightforward as a static plot that was created and we want to share that and we want to do it in a secure way, or it can be something much more dynamic like a dashboard or a web application or even an API that other development teams within the organization are going to use to bring some of the data science work into either their product or their tech stack. Posit Connect provides this bridge between developers and the work that they're doing and end users, business users, being able to interpret, understand, and view those results securely and in real time.

And then as a third part of this Posit team collection, we have a tool called Posit Package Manager that allows organizations to control at the repository level what R and Python packages they make available. It has auditability functionality. It provides some security functionality as well for organizations and gives you the option to distribute your own internal packages and make those easily accessible to end users.

The Snowflake and Posit partnership

Now, the reason that I bring this up is because this concept of open source and commercial, the way in which they blend together and work together, is really very evident in the work that we do with Snowflake. So if we take just a moment, and I'll talk to this very lightly, and Jonathan may touch on it a little bit more deeply, right? But Snowflake as a platform really tries to eliminate data silos, make it as simple and as easy as possible to access data, data artifacts, things of that nature under one single access point. Snowflake provides this gateway. There's security and access controls that Snowflake makes available. And then users can come in and work with the data available to them in Snowflake. And the tagline there is that it just works.

And our goal has been to make it so that Posit tools, and as an extension, users of those tools can operate in a similar way, that it all just works. That if I'm a developer inside of Posit Workbench, and I want to work with something inside of Snowflake, it should be able to be as easy as possible for me to do that without introducing any unnecessary friction. Over the past year or so, we've actually made a tremendous amount of progress in that direction by investing in a couple of different key areas. And so I want to highlight what those are, kind of talk about what's possible today.

Posit Workbench native app on Snowflake

One of the things that we're very excited about is that we've worked closely with Snowflake over the past several months, and have now delivered a solution that enables Posit customers to run and access Posit Workbench directly within their Snowflake environment through what Snowflake calls Snowpark Container Services. The idea is, as a subscribing customer inside of Snowflake, I can come to the Snowflake marketplace, I can subscribe to Posit Workbench there. If I'm a Posit customer, I can apply my existing Posit license to that service, and then it will spin up and create a running version of Posit Workbench that's fully within the confines of my Snowflake environment. Users authenticate into it using their Snowflake credentials, and then they can query and interact with other Snowflake resources directly from there, and everything operates within the context of Snowflake itself.

Viewer-level data permissions with Posit Connect

On the complimentary side, we've introduced a few distinct new features inside of Posit Connect. One of which is similar to what we see on Posit Workbench, we have the ability to configure OAuth authentication into Snowflake. What this means on the Posit Connect side is that if I'm a developer and I'm building some sort of a dashboard, and this dashboard is built on top of some data inside of Snowflake, and then I want to share this dashboard with my company or my department or whoever it might be, a common thing that we've seen is that in many cases, there are very granular permissions defined on that data.

Historically, this has been a difficult problem to solve when it comes to building business critical or business analytics type applications. Because historically, we've relied on things like service accounts to give us a shared access into the underlying data. But then how do we take that service account and create an even narrower scope for the specific user who happens to be viewing the application at a specific point in time? And I've seen all kinds of creative and clever solutions that we feel good about, but security doesn't.

The things that we've introduced, the changes that we've introduced to Posit Connect make this a much easier path to creating applications that provide true viewer level permissions for the underlying data. So in practice, what this means is a developer can create an interactive dashboard application, whatever it might be, share that with a large audience on Posit Connect. And when they publish that application to Posit Connect, they can specify this application requires Snowflake access. When a user visits that application, let's say it's me. I visit the application, I log in, here's what I see. I see a dashboard that shows me all the data of a given table inside of Snowflake.

Now, if I share this with somebody else, let's say that this other person was responsible for and only had access to information about the West region represented in this data. Well, if I share this with them, they'll visit the link. And the first thing they'll see is a prompt to tell them they need to first authenticate or log in to Snowflake before they can view the dashboard. So they'll go through that process. However, Snowflake authentication has been configured, they'll log in. And then when they view the dashboard, they only see data from the West region.

The benefit here is, I didn't as the application developer, I didn't have to do this logic myself. I don't have to go through and write a bunch of code that says, if user is in group B, show only this data. If user is in group C, only show this data. I've seen that pattern. It works, but it's brittle. And it's full of security issues and challenges. Here, we're relying on Snowflake to manage all the data governance and access control. Posit Connect just forwards on the identity of the logged in user. And Snowflake uses that as it returns results of the queries that are being executed. So the only data coming back from Snowflake is data that this specific user has access to.

Here, we're relying on Snowflake to manage all the data governance and access control. Posit Connect just forwards on the identity of the logged in user. And Snowflake uses that as it returns results of the queries that are being executed. So the only data coming back from Snowflake is data that this specific user has access to.

The Orbital package

I could have 100 different users, all logged in, viewing this dashboard at the same time, all of them with different levels of permission to the data in Snowflake, and they would all be seeing different views of the data because the data permission is coming from, the data is coming based on their permissioning within Snowflake.

The last thing that I want to touch on just briefly, and then I'll turn the time over to Jonathan, is a little bit more of a recent development, but there's a new open source package that we've worked on here at Posit called Orbital that we've been really excited about, particularly as it relates to what this makes possible inside of Snowflake. I'll give a high level overview and then turn the time over to Jonathan to cover this a little bit more in depth. But the concept of Orbital is to say, I've taken some data, I've created a local extract of it, and I've trained some sort of machine learning or statistical model on that data.

If you've used the tidymodels framework or have read about it or are familiar with it to some extent, one of the advantages that that framework offers is it gives you the ability to not only train and tune a model itself, but to also embed within that model definition any pre-processing that takes place prior to the model fit. So if I normalized values, if I imputed missing values, if I did anything clever in my pre-processing to prepare the data to generate the best possible model, all of that can be captured in this tidymodels object that I end up with.

What Orbital does is it will take this trained R model object that we have and convert that object into native SQL, which means that in the context of Snowflake, I can take this model definition that typically would require me to have access to an R runtime in order to run and generate new predictions, and instead I can turn that definition into a SQL statement that I can then store within Snowflake as a view perhaps or a stored procedure, and now all of a sudden my model execution and prediction is not happening in R session somewhere, it's happening in a Snowflake warehouse, and it's happening lightning fast because it's part of Snowflake's native dialect.

We've seen dramatic results with a number of different customer accounts where model prediction pipelines that previously took days, hours or days, now take seconds or minutes, and it's been a dramatic time savings because we're able to take advantage of the full compute capability of what Snowflake warehouses have to offer.

Demo: Snowflake marketplace and Posit Workbench native app

I'm going to share my screen and everyone, if you can't see the screen I'm about to share, let me know and then we can kind of get into it.

Okay, can everyone see what looks like a Snowflake marketplace right now?

Yes, okay, awesome, great, okay, all right, thank you, James. So yeah, so we're going to go through a few, really just an example of kind of what James was just talking through.

Before I get over to the example of running Posit though on Snowflake, I actually thought I'll spend a few minutes just showing y'all what Snowflake looks like because it occurred to me that probably a lot of the people on this call are heavy Posit users, probably Posit native people. You might not have seen Snowflake or really know much about it, so I'm just going to spend a few minutes on it and then we'll get over into the power that comes from being able to run Posit in RStudio natively inside of Snowflake.

So in my role at Snowflake, I really focus on financial services and machine learning, data science, and increasingly artificial intelligence side of that world. So the thing I am showing on the screen right now is what's called our Snowflake marketplace and this is frankly a really big reason for why a lot of financial services enterprises gravitate over towards Snowflake. This is actually one of the reasons I first started using Snowflake when I was building a data science team. This marketplace has just a growing and growing number of very valuable data sets to financial institutions.

So I'm going to talk about almost everything from the lens of a financial institution, but this applies pretty broadly to enterprises that don't just need data fast, but they need a secure, reliable way to get their hands on that data and that's kind of what the marketplace provides.

So I'm just showing a quick example here. I went up to the search bar, I typed in financial essentials. We're going to be working with this first one, this kind of free finance and economics data, but you can see there's just typing in financial essentials. There's a lot of different interesting data sources on here. Every time I come here, I find new things that look really interesting to me.

This is a free data set. So when I click in on a free data set like this one, it takes a few seconds to spin up. So I just kind of already have it going here. I've already grabbed this data and put it into my Snowflake account, so we don't have to go through that. But normally instead of saying open over here on the right, this would say get. So I would click get and then lo and behold, over in my Snowflake account, this data set would really almost automagically appear there.

You can imagine for financial institutions that have hundreds and hundreds, sometimes thousands of data sources that they're getting data from, to be able to have it all delivered in a centralized and really importantly secure way, it's a really big deal to these enterprises. So when I say secure, what I mean is when a big bank or a big asset manager, even if they want to just sample a new data source, their security team is going to say, well, we can't just let them give us a link. We can't just hit their API. They can't even just email us a file because we don't want you to open up some foreign file that could have some malicious code in it. If it's being delivered via the Snowflake marketplace, and that's already a blessed vehicle of delivery that really, really streamlines.

And it just gets updated, again, really automatically. You don't have to do anything to update the data because it's actually being done by the provider. So that's just kind of a quick look at why Snowflake is very prominent in the world of financial services. A lot of it starts with, we want this new data. That then leads to, okay, let's look at all the amazing kind of tooling and warehousing and now increasingly machine learning and data science work I can do on that data inside my secure environment. So kind of data is kind of level one.

Level two, we're going to be working with what we call a native app today. So that's what Posit Workbench is. So James mentioned that what makes this possible is what we call Snowpark container services. Through Snowpark container services, it's kind of exactly what it sounds like. It's our containerization feature. If you're familiar with Docker, you can just imagine it as a way to dockerize anything you want. It doesn't have to be RStudio. It could just be a big C++ code base and you want to dockerize it, put it up into the cloud and start running that container.

Native apps powers that. So in this case, the Posit engineering team has done all of the hard work of putting this into a Docker container and making it very easy to kind of pull down from the marketplace. So that's what we're looking at here. This is what the marketplace for these apps looks like. We've pulled down the Posit Workbench marketplace. I've put it into my account. And if I were to click on this, it would kind of spin up into this little homepage right here.

So I kind of start off in the marketplace. I pull in the data that I want to pull in. And then I kind of toggle over to this native apps tab here. So it's like data, data products, then then over to apps where I have this Posit Workbench, right? There's a ton of like native SQL work you can do inside of Snowflake, of course, and that's where a lot of the data engineering work happens. But for our purposes today, we're going to kind of go from the marketplace over to this native app. And then once I click on that native app, I get this screen here. So if you've used Posit Workbench before, hopefully this looks really familiar to you. What's going on here, though, is that this is actually running natively inside my Snowflake account.

So when I click on projects, if I wanted to spin up a new project, I could I could do that from here. I already have this one up and running. And this is now what my RStudio environment looks like. So I've kind of progressed from the marketplace to where I can pull in data through native apps. Now I've got workbench up and running. And hopefully this looks, again, really familiar to anyone who's an RStudio user. I could have clicked on VS Code or I could have clicked on JupyterLab. But for our purposes today, we want to look at RStudio.

This is also a really big deal because you can already, of course, run SQL natively on Snowflake. There's actually a lot of Python work you can do on Snowflake already, too. Running R natively on Snowflake would require kind of setting up your own Docker container right now. It actually turns out to be a pretty heavy lift to do that. So this has been extremely, extremely valuable to our customers who want to run models, do data science natively in R on Snowflake.

Exploring macroeconomic data in RStudio on Snowflake

So now we're kind of set up. We've spun up the RStudio IDE here in Snowflake. And now we have a few cool things available to us. So over here on the right-hand side, I'll just kind of click around a little bit. What we're doing here is we're peeking into my Snowflake account right now. So for all intents and purposes, I don't really need to be going back to my Snowflake Snow site. Snow site is kind of what we call the user interface of Snowflake for people who want to glance at their data. I don't really need to do that anymore because when I established, when I loaded up my libraries, and then James showed you all a really easy way to do this, I'm still doing it the old-fashioned manual way of just entering all my credentials. Once I established this connection back to Snowflake, I can peek into all my different objects.

So what are we going to be doing today? Let's kind of just go through a little bit of an example, and hopefully it'll help prove out the power of what we're doing here. So I'm going to kind of eventually get to that Orbital workflow that James was talking about. I'm going to kind of build off of the blog that Posit published on running tidymodels using Orbital natively in Snowflake, but I'm also going to bring some macroeconomic data to be part of that workflow. A really simple look at macroeconomic data, just a couple of time series, but I think you'll kind of get the idea of how starting over in the marketplace, we can just build and build and build these different data estates, which is really how we get more value out of our models.

Two external resources I'll point out to you guys at the end, but I'll just show them to you now. So one, this is the blog that kind of walks through in detail how to use Orbital with Snowflake. We're going to see how that works in just a second. Two, this blog here, a lot of the macro work that I'm doing is contained in this book that I wrote with Professor Chavar from Georgia Tech, a guide to exploring macroeconomic data with R. Nothing too fancy in here. We're just going to be exploring data, visualizing the data, but then I'm going to add that data to this blog post, basically to kind of show the power of what happens when you have your data and your workbench and all your compute in one place.

The data we're going to be working with is all contained in this. I have a database called JR underscore DB. So this is my database within Snowflake. And then all the data sets I'm working with are in our schema called public. Okay, so within Snowflake, kind of just the vernacular or the vocabulary we use is we have a database, and then we have a schema. And then inside of that schema, we have different tables, different data tables. If you hear someone refer to a warehouse in Snowflake, I think James referred to a warehouse. A warehouse in Snowflake is actually your compute engine. So that's where you're kind of spinning up CPUs to run your work.

Within public is kind of holding all the data that we have. Within that database, there is a table called Trez underscore data. That came from the marketplace. So that place where I was toggling over, looking at the marketplace, that free data, that was financial essentials. Within that free economic data, there's treasury yields, right? So basically, I didn't want you guys to have to sit here while I did this, but I grabbed some treasury yield data from that free marketplace, and I plunked it over here to the database that I wanted to use during this session, right?

One thing that's really important about that, just to reiterate, is that even though this is free and public data, the data is never actually going outside of Snowflake. So once it's been shared on that marketplace by the data provider, everything is staying within this wall, so to speak. So that's important for security reasons. No kind of malicious code is being introduced into this data. It's not going to walk away on someone's laptop because it's all staying on Snowflake. There's full auditability of this, right? So if someone were to change this underlying data that I'm calling Trez data, which is very simple data, but if it were complex and we were to make big, big changes to it, the full lineage of what happens to that data from the time it's imported to the time it even gets put into a model, all of that is tracked in Snowflake. So you can always kind of backtrack to the lineage of your data.

The second you kind of go local, I guess, so to speak, if you were to just kind of plunk this onto your desktop, I mean, nothing wrong with that. I do a ton of work on my desktop when I'm kind of writing and working on projects, but from an enterprise perspective, that can be a big no-no, breaking that lineage.

In any event, we've got this Treasury data. Here's a little snapshot of it. Again, it's nothing fancy. These are the yield to the interest rates on Treasury instruments going back to 2005. The reason I only went back to 2005 is that the data that we want to be modeling eventually is from that blog post. It's really only from the last few years, so we don't need to go back too far in time, but we could have gone back all the way to the 60s if we'd wanted to, just didn't need to for this particular use case. So you can see it comes in on a daily basis.

The data we want to ultimately be modeling at the end of the day, which is some sample data from LendingClub, is monthly. So we're going to convert that daily data to monthly data. I'll just show you guys. This isn't really anything to do with Snowflake, honestly. These are just how I like to work with time series inside Posit. So even though this is nothing, I guess this isn't like a Snowflake capability, the fact that I can just run this natively on Snowflake is still a really big deal to me, because this is kind of my most comfortable coding environment. I don't have to relearn any new grammar. I can stay completely on Snowflake, but use, again, my tools of choice for working with time series.

So for converting daily time series into monthly time series, I use this really nifty function called summarize by time. It's in the time TK package, and I can say, well, I want to summarize yields, and I want to get a monthly treasury yield. There actually is no such thing as a monthly treasury yield. There are only daily yields, even intraday yields, because it's a constantly fluctuating time series. So when we or anyone talks about a monthly yield, you have to kind of figure out, well, what do you mean by that? I'm just going to take the average yield.

So we can use, again, summarize by time, and that's going to give us our monthly treasury yields for two years and 10 years. So again, what I'm ultimately wanting to do is take this, join it up with the lending club data that we ultimately want to model, which is on a monthly basis as well.

So one of my favorite things about R and probably a lot of other people's is the ability to use ggplot to visualize data. So I just took that treasury data, plunked it into a ggplot workflow, just more of a sanity check than anything else. Make sure there's no holes, make sure there's no gaps, make sure there's no huge, huge spikes, which might indicate a data issue. You can see there really aren't any huge spikes here. Normally, it would probably alarm us to see that the two-year, the green line, has made its way above the red line for most of the last few years. That's typically considered a bad thing. It's called an inverted yield curve. People sound the alarm about that a lot. They think it means a recession is coming, but alas, no recession has come.

So we have our data. We've converted it to monthly. Again, this is all running in the Snowflake cloud, even though I'm just coding this up in R, using all the tools that I like to use in R. At this point, though, we've just kind of imported the data, turned it over to monthly. If I really want to use this for modeling, there's a few things that I would typically do. So first, I want to lag this data a little bit. If I want to see if treasury data is going to help me in some way model consumer lending data, it's probably not going to be a coterminous relationship where treasury yields today are somehow affecting consumer lending on that exact day. Typically, we would expect there to be some sort of a lag.

So I'm going to take these two time series and lag them. I'm going to break them up into different columns just so it's easier to see. So we'll run a pivot wider on that. And then we're going to do something that I call lagging at scale. I'm going to take advantage of this mutate across vocabulary, which I find super, super useful. And I'm going to create a one-month, two-month, and three-month lag of both our 10-year and two-year treasuries. And we could keep going. We could make a six-month lag, a nine-month lag, an eight, seven-month lag, however many lags we wanted to create. And we can do it across every single column that contains the word year. So again, this is very, very typical when we're getting into engineering features around different macroeconomic data. We want to create a lot of lags.

I'm going to do the same thing with the rolling means because I want to create a whole bunch of rolling means. I'm going to use mutate across again. And now I feel like I've got my data into how maybe I want to actually use it for modeling. So I'm going to save it as an object called treasury monthly with lags and rolling mean for DB. So very, very verbose name, but I wanted to make it verbose so that we could kind of keep track of what's going on. And I'm going to take that data, visualize it real quick. I'm going to save it back to Snowflake. So this is a really important step because now I'm actually writing data back to my Snowflake environment. That means that myself and my fellow R coders can always access this data now, but it also means that other modeling teams can use this data too.

And this is a really, really important point about the Posit Workbench, which is that as more and more companies are kind of migrating to Snowflake and different cloud providers, if our teams don't have a way to run natively there, then it's hard to actually share their work. You can always kind of write it back up to the table, but again, that will involve going outside of the cloud environment and bringing data back into the cloud environment. So you're writing stuff back, but where's it being written from? Once you can kind of work here natively in Snowflake and then write your data back as you want.

Modeling LendingClub data with tidymodels and Orbital

We're going to model this lending club data, right? And we're going to try to basically model or predict the interest rate that is charged on all these different loans. So I'm going to stick to the lending club data from just the year 2015. That's 421,000 rows. So nothing too big. I think if we did everything, we'd be above 2 million, but I just kind of chose us for 2015. So I'm going to filter down to that year. I'm just going to grab a couple of columns, not everything. BC util and BC open to buy interest rate. I'm going to grab the issue month and issue year. And then once we collect this into our environment, we're going to mutate that into an actual date. Once we mutate it into a date, that'll let us join up our treasury data, right?

This is the visualization of what it looks like. Pretty hard to see a trend here from a time perspective. I guess it gets a little bit, the spreads, I guess we could say are widening as we go through 2015. The interest rates are absolutely out of control. As you can also see, some of these are above 15%, which is absolutely nuts. But be that as it may, we're going to take that lending club sample data, join up our treasury data with it right here. So again, we're just doing all of this in R, but it's running in Snowflake.

Okay, now here's where the really fun part starts. We save that back to Snowflake in case we want to use it again. And now we can start modeling this data. So what that means is we have the full array of tidymodels available to us. So we're going to create a recipe for some feature engineering. We're going to run that recipe. We're going to normalize all of our numeric predictors. We're going to run imputation on these two that have a few NAs in them. And then we're going to get rid of any NA that we might have missed anywhere else. So this is kind of a feature engineering recipe that precedes running our model. We're going to choose to run a linear regression model. And then we're going to set up this workflow to run the model, to run the recipe. And then eventually, we're going to fit that model and calculate some metrics.

If you're familiar with tidymodels, and you're familiar with recipes and parsnip, this all probably looks pretty comfortable. And hopefully it looks pretty awesome because we're, again, running all of this natively on Snowflake. Where this gets super, super interesting is with this new Orbital package. So what Orbital lets us do is once we fit this model, Orbital is going to basically convert that workflow into code that can be executed directly on Snowflake. When I say directly, I mean it's not going to be executed necessarily in this native app. It's going to be executed back in a Snowflake warehouse using native SQL. That's going to make it super, super, super fast. And it's going to let you take advantage of all the partitioning that Snowflake does, really all the stuff that makes Snowflake really, really fast.

The kind of way this flow goes is that we use the Orbital function to turn that lending club fit object. And remember, that included the recipe for feature engineering and the fitting of the model. So that's going to give us this object. We don't have to run this next one, but we can. It will show us the SQL that's actually being created. So this is what's actually running in Snowflake, which has the best SQL engine in the world. And then we can run predictions on this. So this would run predictions for us on, I think I only put this down to like 400,000 observations. The blog runs it across over 2 million in two and a half seconds, which is really, really fast.

The blog runs it across over 2 million in two and a half seconds, which is really, really fast.

But this is a really, really powerful thing that honestly, it's pretty new. I think it's only been out for about the last four or five months, but it's a huge leap forward in how we can start with RStudio, Posit as a native app. Again, doing all the work we want to do using R, using the grammar we're familiar with in the Snowflake cloud. So everything's staying secure, everything's staying governed. We can write our object back to Snowflake, but then Orbital lets us take either our recipes, our models, and then run them as native SQL inside Snowflake. So then we get the benefit of the fastest SQL engine in the world to run these models.

Like James said, it doesn't cover every single model that's inside of tidymodels. So the documentation specifies which ones are there exactly. There's quite a few, and it's the same for recipes. It doesn't have every single recipe out there, but it does have almost all of them.

Snowflake native feature store

I only have one or two minutes left. There's one more thing that I just want to kind of point out. So you can see that we've kind of created our SQL here around these recipes. And I talked about saving data back to Snowflake. One more thing to kind of point out in Snowflake is that we just launched about, again, we're about the same time as this native app actually in the last nine months, our own native feature store. So feature store is a place to both store and track the features that you've engineered to run your models, right? This is becoming like, it's always been kind of a best practice, but now that we've opened this up to our customers, the adoption of that feature store is going much, much faster as companies try to not just de-silo their data, but try to de-silo their feature stores.

So with this workflow and the ability to save your data back to objects inside Snowflake, you can easily then create a workflow that would take those objects and plunk them back into the feature store so that other modeling teams, maybe modeling teams that don't use R and don't want to use R, maybe they want to use Julia, maybe they want to use Python, whatever it is, they should be able to take advantage of the features that you've created. And likewise, our coders should be able to take advantage of any features they've created by accessing the feature store. So it's kind of creating this unified hub for all these different features, and it really should be language agnostic, right? And so that's something that Snowflake is trying to enable because it's all stored natively in Snowflake, as long as it can be accessed via SQL, which it can in most of those IDEs, all these teams can kind of work together.

So I hope you guys are as excited about this capability as I am. To me, this is bringing together two of the most performant, efficient, and really, as James said, it just works. Pieces of software in Snowflake Composite, being able to run R natively inside Snowflake like this, something I've always wanted to be able to do, and the Orbital package really just takes it to the next level, honestly.

Q&A

Yeah, thanks for walking through that. Just to kind of recap, right? Like we on the Posit side have been, like, I can't emphasize enough how excited we've been about the work we've been able to accomplish with Snowflake in such a relatively short period of time. We went from not really having much context around the native app infrastructure and Snowpark container services to having a working version of Posit Workbench relatively quickly. And we view a lot of this as just the beginning of a foundation that we plan to build on for the foreseeable future.

We talked at the very beginning about some of the context around all of our professional products. We focus primarily on Workbench today, and that's because that's the first piece that's available inside of this native apps and Snowpark container services infrastructure within Snowflake. We are hard at work on bringing the other products that we offer to that same sort of deployment pattern. There's some work that's happening on the Snowflake side to support us there. There's work that we're doing. We don't have a timeframe on when we expect those to be available at this point, but it's something that we're conscious of and working diligently towards.

I've also pulled up just a couple of links. These have been shared in the chat, but if you wanted to grab them from here, you have access to the partner page that we have for Snowflake that just provides some details and context around our partnership with them. And then we also have that blog post that Jonathan's alluded to a few different times and that we've, again, kind of shared some links to in the chat. And then finally, if you want to reach out and get ahold of either of us, you can reach us here. If you're interested and you are a Posit customer today and you want to kind of experiment with this or understand a little bit more, you can reach out to your customer success representative. Or if you're not a current Posit customer, but would like to try this out, you can reach out to sales at posit.co and we can make sure that we get you the evaluation licenses that you would need to get this set up and configured on your end if it's something that you're interested in.

Yeah. And great to meet you all. Nick Pelican. I'm a senior solution architect at Posit and I cover mostly financial services. And thank you to Jonathan and James. That was super awesome. Love seeing macroeconomic data be used in the Posit Workbench native app. I think just to kick it off, we have a couple of different questions. One from Slido and Dominic on the YouTube chat. Both ask kind of similar questions around how much memory is available to R. Is this running in the Snowflake native app? Is this running on a Snowflake warehouse or like how do you assign compute to this app?

Yeah, I can take that one. And then Jonathan, if you've got color you want to add. But the current implementation today is that when you subscribe to the application and then your Snowflake administrator configures it, one of the things that they set up is the size of the compute pool that is available to the app inside of Snowflake architecture services. Snowflake provides a very wide range of compute pools or compute instances. So you can have something really small, you can have something really big, you can have something in between. But that defines what local resources are available. So if you set up this with an instance of with 64 gigabytes of memory, then that's the memory you have available. If it's 128, if it's whatever the number is there, that's what's available and it's shared across all users who are concurrently accessing the app at any given point in time.

The other side of this though is that one of the great things about working with Snowflake and R is that it's very easy to connect to Snowflake data and then just keep the data in Snowflake while submitting queries that get executed as SQL rather than pulling all the data locally and working off of it there. And so if you employ that pattern, you end up not requiring as much local memory in most use cases because you're just letting Snowflake be the execution engine. And in that case, all the compute is defined by the warehouse you've connected to with your connection.

Yeah, just to kind of echo that, I'll say two things. I mean, so one, just to pick on what James was just saying. So if you notice a couple of the code snippets where I wrote collect in my code, that was bringing the data in locally, you could say, right into the app. Before you run that collect statement, you're kind of executing everything pushed down back into the database. So you can do a lot of filtering, selecting, kind of down sampling before you actually pull your data in. So that's one way to kind of keep things to kind of manage your memory and CPU needs. But I'll also just say, I mean, to scale up, like James was saying, we have customers who need to scale up and down from very, very small to very large workloads inside Snowflake. So again, it's kind of just managing those compute pools, like James was saying.

Awesome. We've got another question from Atung who asked, will this support ETL pipelines from on-premise databases?

My perspective on that one is, I think the best practice there in my mind would be to configure that ETL process to happen, kind of however you want, such that the data lands itself inside of Snowflake. The easiest path here is if you have everything available to you that you need from a data perspective inside of Snowflake, then this becomes a really easy process. If what you're trying to do is take the native app, take this service that's running inside of Snowflake container services with Posit Workbench, and then like reach out to your some other data source that's not Snowflake, things get a little bit tricky there because there's some networking restrictions or requirements that come into play, and it becomes a little bit more of a, like the configuration becomes a little bit more nuanced at that point.

And so we've seen the most success, and our perspective has