Deploying Scikit-learn models for in-database scoring with Snowflake and Posit Team

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. Thanks for tuning in to another edition of Data Science Workflows with Posit Team. I'm Nick Pelican. I'm a Principal Solution Architect at Posit, and I'll be taking you through using Orbital, specifically Orbital for Python, in conjunction with Snowflake to deploy your machine learning models. Let's jump right in.

So first off, what is Orbital? And a lot of you might be asking the question, seen the last data science workflow that I did back a couple of months ago. You might be asking, why are you talking about this? You already talked about Orbital. We've introduced recently some really cool capabilities to Orbital, specifically in the Python world.

So what Orbital does, what Orbital specifically for Python does, is it enables you to run the predictions of Scikit-learn pipelines directly within databases. The best way to think of this is, first off, use the tools you're familiar with. Python users, we all know Scikit-learn to train your models. Then you use Orbital to convert those models into native Snowflake SQL and then deploy them on Snowflake.

Two and a quarter million rows in less than five seconds.

But then again, as I mentioned at the beginning, one of the really cool things about Orbital is that instead of having to deploy models, instead of having to figure out our Python run times, Docker images to deploy your models. What you can do now is because it's native SQL, you can actually deploy this model to Snowflake as a native Snowflake object. So what I'm going to do here is take that same SQL I used before. Stick a create or replace view statement in front of it. So what I'm creating here is a Snowflake view. A Snowflake view is just a save SQL query. Every time you run it, you can select from it as if it was a table. It'll run that SQL query every time. So I'm sticking a create or replace view in front of that SQL that I've already created. I'm going to name it my model name. And then again, that version name I got from the Snowflake model registry.

So let's save that to Snowflake. And now my model is deployed to Snowflake.

Comparing Orbital SQL vs. Snowpark Python runtime

So now I'm in my in my Snowflake instance itself. This is Snow site, Snowflake SQL dashboard. I've got a query here that I already wrote. Let me change this to reference the model that I just wrote. So now what I have is what I'm going to do here is select ID, term, loan amount, and my predicted interest rate from that model that I just fit. Again, I'm using that view that I just saved to do the actual model inference. And let's see how fast this runs. Again, just while I'm doing this, keep in mind, I'm not using some massive Snowflake compute engine here. I'm using Snowflake's second smallest warehouse set to do all this work.

And there's my results back. Takes only about took only about 1.6 seconds. It's that fast.

And let's compare that to doing the same thing, but using Snowflake's native Python runtime called Snowpark. So here's the same query that I just ran, but because I've saved that model in the Snowflake model registry, I can use Snowpark to access it.

And let's run this and see how fast this runs. You'll see that took around an order of magnitude longer to run.

So you might be asking right now, why is Orbital so fast? Well, a combination of different things. Number one, Snowflake is really good at understanding SQL code. Snowflake on the back end translates SQL code into really optimized C code. That's why Snowflake warehouses are so fast. Because Orbital translates a model into something that Snowflake natively understands, it's going to be that much faster.

The other reason is because if we go look at the query history, let's take a look and compare and contrast these two queries. So what you'll see happening here is this is the query that I that I ran first. This is using my Orbital converted model. This is using Snowflake because this is Snowflake native SQL. Snowflake's query optimizer can work on it. So it can do things like you'll notice instead of running on all two and a quarter million rows on this table, instead it's only running on, in some cases, 80,000, 51,000, 45,000. Instead, it's running on only a subset of rows because I had some filter statements in my query. So that query optimizer is doing a lot of legwork, making sure this returns really quickly.

And you'll see. One of the other things that you that you don't have to deal with is for every time Snowflake is running Python, it has to spin up a Python. So you pay a little bit of tax with that. So you'll see this initialization penalty happening. That took up about 45 percent of the time. Because if you convert a model SQL, Snowflake doesn't have to do that. It's going to run that much faster.

Orbital capabilities and limitations

Let's jump into some of the things that. Orbital can do that Snowflake so that Orbital can do and Orbital can't do. So you might be thinking to yourself, this is awesome. I want to use Orbital for everything. Orbital is a fantastic capability, although because it is converting the SQL, there are a few limitations. As I mentioned, Orbital has a ton of capability to convert different models, convert linear regressions, logistic regressions, tree based models, XGBoost models, things like that. Those convert just fine.

But let's say you want to use a model that has some element of recursivity. Something like k-nearest neighbor. Well, SQL doesn't have recursivity. You can't write a recursive function in the SQL. So unfortunately, you can't use Orbital for that. But you can always use Snowflake's Python runtime.

The other thing to keep in mind is that Orbital does not support and probably won't support something like a neural network model, something like a TensorFlow model or a Keras model. The reason for that being is those models involve a ton of parallelization with low numeric precision, which SQL, quite frankly, these SQL engines are designed for high precision. They can parallelize. There's nothing mathematically to say that we can't build an Orbital implementation of like, let's say, a Keras model. It would be one giant case when statement, but that's mathematically something that we could do. Those models are so much better served with GPUs because SQL really doesn't run on GPUs. That's where something like Snowpark is going to come in a ton of handy because you can have GPUs attached to Snowflake and Snowpark.

But again, for those cases where I'm sure a lot of us are doing a ton of them all the time, where we're building models, building a logistic regression, building a tree based model, those take up, those deliver so much value to our customers, to our stakeholders, that being able to convert those into SQL, being able to deploy those onto Snowflake effortlessly, being able to iterate as fast as I just showed you is going to be a huge step change in your capabilities. You're going to be able to deploy models faster. Your models are going to run faster. They're going to run more cheaply and it's just going to make your life that much easier.

being able to convert those into SQL, being able to deploy those onto Snowflake effortlessly, being able to iterate as fast as I just showed you is going to be a huge step change in your capabilities.

And with that, that's all I've got. So please, I'm super excited to hear your questions and I'll see you in the discussion room. Thanks, everyone.

Deploying Scikit-learn models for in-database scoring with Snowflake and Posit Team

Transcript#

Why use Orbital?

Setting up Posit Workbench on Snowflake

Connecting to Snowflake and exploring the data

Building the scikit-learn pipeline

Saving the model to Snowflake's model registry

Converting the pipeline to SQL with Orbital

Running inference on Snowflake

Comparing Orbital SQL vs. Snowpark Python runtime

Orbital capabilities and limitations

Featured software#

Shiny

tidymodels