
Deploying Scikit-learn models for in-database scoring with Snowflake and Posit Team
Modern data science workflows often face a critical challenge: how to efficiently deploy machine learning models where the data lives. Traditional approaches require moving large datasets out of the database for scoring, creating bottlenecks, security concerns, and unnecessary data movement costs. In this demo, Nick Pelikan at Posit highlights how orbital bridges the gap between Python’s rich ML ecosystem and database-native execution by converting scikit-learn pipelines to optimized SQL. This enables data scientists to: 1. Use familiar Python tools for model development 2. Automatically translate complex feature engineering to SQL 3. Deploy models without rewriting code in different languages 4. Maintain model accuracy through faithful SQL translations While this example focuses on Python, Nick also gave this demo using Orbital for R users: https://youtu.be/pnEjYNgOG9c?feature=shared Helpful resources: 1. Follow-along blog post: https://posit.co/blog/snowflake-orbital-scikit-learn-models/ 2. Supported models for Orbital: https://orbital.tidymodels.org/articles/supported-models.html 3. Orbital (Python) GitHub Repo: https://github.com/posit-dev/orbital 4. Orbital (R) GitHub Repo: https://github.com/tidymodels/orbital 5. Databricks and Orbital for R and Python model deployment: https://posit.co/blog/databricks-orbital-r-python-model-deployment/ 6. Emily Riederer's XGBoost example with Orbital: https://www.emilyriederer.com/post/orbital-xgb/ 7. Sara Altman's blog post on Shiny + Databases: https://posit.co/blog/shiny-with-databases/ 8. Emil's introduction of orbital: https://emilhvitfeldt.com/talk/2024-08-13-orbital-positconf/ 9. Nick's Blog on Running Tidymodel Prediction Workflows with Orbital: https://posit.co/blog/running-tidymodel-prediction-workflows-inside-databases/ If you'd like to learn more about using Posit Team, you can always schedule time to chat with Posit here: https://posit.co/schedule-a-call/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, everyone. Thanks for tuning in to another edition of Data Science Workflows with Posit Team. I'm Nick Pelican. I'm a Principal Solution Architect at Posit, and I'll be taking you through using Orbital, specifically Orbital for Python, in conjunction with Snowflake to deploy your machine learning models. Let's jump right in.
So first off, what is Orbital? And a lot of you might be asking the question, seen the last data science workflow that I did back a couple of months ago. You might be asking, why are you talking about this? You already talked about Orbital. We've introduced recently some really cool capabilities to Orbital, specifically in the Python world.
So what Orbital does, what Orbital specifically for Python does, is it enables you to run the predictions of Scikit-learn pipelines directly within databases. The best way to think of this is, first off, use the tools you're familiar with. Python users, we all know Scikit-learn to train your models. Then you use Orbital to convert those models into native Snowflake SQL and then deploy them on Snowflake.
Why use Orbital?
So you might be asking yourself, why would I use Orbital? Well, a couple of different reasons. Number one, it enables you to share your models really, really easily with anybody. And we'll jump into this in just a little bit. What Orbital lets you do is, again, because it's converting models into SQL, it's putting them into a language that Snowflake can understand, which means that you can leverage all the great parts about Snowflake, including Snowflake Compute, all of Snowflake's tools around things like views, materialized views, scheduled jobs, to take your models and actually put them into production.
It's as simple as, and I'll show you this in just a second, one click or one SQL statement. And the second reason you might use Orbital is it makes your models really, really, really, really, and I can't understate this enough, really fast. Because it is converting them to SQL, again, because it's putting them into that language that Snowflake understands, it makes the models blazingly fast. And you'll see that in just a little bit.
Setting up Posit Workbench on Snowflake
So first off, I'm going to go to Posit Workbench. And I'm actually using Posit Workbench running as a native app within Snowflake. So if you have any questions about using Posit Workbench directly within Snowflake, please reach out to us or check out some of our other material on our partnerships page about this really cool new capability. You can run the Posit tools, including all of your favorite IDEs, directly within your Snowflake environment.
So now that I'm in Posit Workbench on Snowflake, let me jump into Positron. If you haven't seen Positron yet, welcome. We're super excited to be releasing Positron as general availability. Just came out a couple of weeks ago. It's been in public beta, though, for about a year. It's our next generation IDE. We couldn't be more excited to release this to the public. If you've got any questions about it, please check out our website. You can download it to your desktop there. Or if you're a Workbench user in the most current versions of Workbench, you can upgrade and Positron's right there for you.
Connecting to Snowflake and exploring the data
So now I'm in Positron running within Snowflake. And let's get started with fitting a model. So first off, I'm going to connect to Snowflake. You'll notice because I'm using Posit Workbench in Snowflake, connection parameters, everything is just taken care of for me. I don't have to worry about username or password. It's just handled.
So now that I'm connected to Snowflake, let me access my data. So I'm going to use IBIS. If you're not familiar with IBIS, really cool project from the team behind Pandas. What IBIS does is it basically lets you run Panda statements directly against a SQL database. Definitely encourage you to check it out.
Let me connect IBIS to Snowflake. And throughout this demo, I'll be using the loan data data set. If you're not familiar with that, it's super popular. It's up on Kaggle. The loan data data set is from Lending Club. It's a data set of roughly two-ish million loans all throughout the U.S., mostly personal loans.
So we've got about two and a quarter million rows here in our main data set. For these purposes of this demo, I'm going to want to say I'm going to want to filter that down a little bit. I'm going to want to take a representative sample of that data set with just my columns of interest. You can see I've already cheated here a little bit. I've already done my feature selection. If you're doing this yourself, you'd want to do your own exploratory data analysis and feature selection.
So I'm going to do a sample down to 0.003 of the data set. That'll leave me with about 5 to about 10,000 rows on which to fit my model. Let's grab that sample. You can see I'm taking just a quick sample of the data just to see what it looks like. So I've got my columns of interest, interest rate, the term of the loan, BC util, which is total credit utilization, BC open to buy, and all utilization, which is all credit utilization. So what I'm going to try and do in this model is actually for a given loan, predict its interest rate.
Building the scikit-learn pipeline
Next up, do what we're all used to. I'm going to split into my training and test data sets. Again, with that target column being interest rate. So next up, let's create a feature engineering and model fitting pipeline. So if you're not familiar with scikit-learn pipelines, what scikit-learn pipelines let you do, I'm doing that right here, is combine preprocessing and model fitting steps. So you don't have to worry. So basically what that does is it takes all of your feature engineering and model fitting and fits it all together. So it remains one continuous model object.
So for preprocessing, what I'm going to do here is let's say I'm a grab term. That's my only categorical variable. And I'll just one hot and code that. And then for my numeric columns, I'm going to do some mean imputation on the missing data. And then I'm going to scale the data as well. Again, I'm just using standard scikit-learn functions for this. So simple imputer, ordinal encoder and standard scalar. And this remainder equals pass through means that all of the remaining columns on my data set are going to get passed through. So specifically what that will refer to, you'll notice I brought in one column called ID. This is just the idea alone. And that one, I just want to pass through this feature engineering step untouched.
And next up to predict this, I'm just going to use a simple linear regression. Orbital supports a ton of different models. Towards the end, we'll talk about some of the limitations on orbital as well. But you can do things like linear regression, logistic regression, GLMs, tree based models. There's a ton of different options out there to fit in whichever model you find works best for your specific problem or data set.
So now that I've defined my pipeline, I'm going to fit it. You can see scikit-learn just returns a visual representation of my pipeline. I've got my ordinal encoder, imputer, scalar, and then again, it's passing through that ID column.
Saving the model to Snowflake's model registry
Next up, what I'll do is I'm actually going to use the Snowflake model registry to save my model. Snowflake provides their model registry. It's a great piece of functionality that lets you save models. Again, for reasons like MLOps, for reasons like compliance, auditing. So what I'm going to do here is use Snowflake Snowpark package. I'm going to connect to the registry. And then Snowflake requires a model signature. Model signatures, of course, being what the model expects as input features. So I'll use my training data set as that model signature. And then I'm going to log the model into my registry.
This can take a little bit, but the really cool thing about this, again, is that I'm using common tooling with other teams. So what I might be doing here, I might have a model that I'm fitting using Orbital. But again, I'm using it, I'm fitting it, and I'm fitting it and registering it in Snowflake's model registry, which means that if another team that's fitting models in another language or another team that's fitting models not using Orbital, all of our models are going to be contained within our Snowflake model registry. So anyone who's curious about the models that are out there in the wild in my company, and was curious about model performance can go into the Snowflake model registry and just see the entire inventory.
And now my model is logged with Snowflake. So it's got a model name, I'm just calling this Orbital test, and I've got a model version indicator.
Converting the pipeline to SQL with Orbital
Let's jump into where Orbital comes in. So what I'm going to do here is import the Orbital package. And then what I'm going to do is have Orbital parse that pipeline that I just fit. And what I'm going to do is I'm going to give it that same model signature. You can see I'm using Orbital ML types to give it the type that it should expect from each specific feature column. Again, we've got our ID, term, bcutil, bcopenify, and all util columns right here. So let's let Orbital parse that pipeline.
Let's take a look at what that pipeline, what that parsed representation of pipeline looks like. This is what Orbital is doing under the hood. It is taking my scikit-learn pipeline and parsing it into a logical representation that it can then convert to any number of downstream functionality. And in this case, we're concerned with SQL. So you can see here all of the different pipeline steps that I just did previously.
Now let's turn that into SQL. So what I'm going to do is run Orbital ML export SQL and tell it what table I want to run against. It's going to be that loan data table I used at the beginning. I'm going to tell it what dialect to use. Orbital can work on a bunch of different SQL dialects. In this case, we're concerned with Snowflake. I'm giving it again that fit pipeline. And there it is. There's my model converted within seconds with just a couple lines of code into SQL.
And if you look at the name in the SQL, you can actually see what Orbital is doing and how it's converting. So what we've got is you can see that ordinal encoding of the term variable. You can see the mean imputation happening. You can see the scaling. And then finally, when you go into towards the bottom of the SQL statement, you can actually see the model steps, the model itself happening. So you can actually see the model coefficients.
Running inference on Snowflake
So now that we have that model fit converted into SQL, let's run some inference. And let's use because it's SQL. Let's use Snowflake to run that inference. Let's use all that compute power that Snowflake gives us to run that inference really, really quickly. So what I'm going to do is I'm just going to use pandas. Pandas read SQL, feed it that SQL along with my Snowflake connection. And let's see how fast that runs.
You can see that took about four seconds. And those are, again, some predictions back directly from that entire large data set that's in my Snowflake environment. And let's see how many predictions just happened. That was the entire data set. We just ran a model predictions over our entire data set. Two and a quarter million rows in less than five seconds.
Two and a quarter million rows in less than five seconds.
But then again, as I mentioned at the beginning, one of the really cool things about Orbital is that instead of having to deploy models, instead of having to figure out our Python run times, Docker images to deploy your models. What you can do now is because it's native SQL, you can actually deploy this model to Snowflake as a native Snowflake object. So what I'm going to do here is take that same SQL I used before. Stick a create or replace view statement in front of it. So what I'm creating here is a Snowflake view. A Snowflake view is just a save SQL query. Every time you run it, you can select from it as if it was a table. It'll run that SQL query every time. So I'm sticking a create or replace view in front of that SQL that I've already created. I'm going to name it my model name. And then again, that version name I got from the Snowflake model registry.
So let's save that to Snowflake. And now my model is deployed to Snowflake.
Comparing Orbital SQL vs. Snowpark Python runtime
So now I'm in my in my Snowflake instance itself. This is Snow site, Snowflake SQL dashboard. I've got a query here that I already wrote. Let me change this to reference the model that I just wrote. So now what I have is what I'm going to do here is select ID, term, loan amount, and my predicted interest rate from that model that I just fit. Again, I'm using that view that I just saved to do the actual model inference. And let's see how fast this runs. Again, just while I'm doing this, keep in mind, I'm not using some massive Snowflake compute engine here. I'm using Snowflake's second smallest warehouse set to do all this work.
And there's my results back. Takes only about took only about 1.6 seconds. It's that fast.
And let's compare that to doing the same thing, but using Snowflake's native Python runtime called Snowpark. So here's the same query that I just ran, but because I've saved that model in the Snowflake model registry, I can use Snowpark to access it.
And let's run this and see how fast this runs. You'll see that took around an order of magnitude longer to run.
So you might be asking right now, why is Orbital so fast? Well, a combination of different things. Number one, Snowflake is really good at understanding SQL code. Snowflake on the back end translates SQL code into really optimized C code. That's why Snowflake warehouses are so fast. Because Orbital translates a model into something that Snowflake natively understands, it's going to be that much faster.
The other reason is because if we go look at the query history, let's take a look and compare and contrast these two queries. So what you'll see happening here is this is the query that I that I ran first. This is using my Orbital converted model. This is using Snowflake because this is Snowflake native SQL. Snowflake's query optimizer can work on it. So it can do things like you'll notice instead of running on all two and a quarter million rows on this table, instead it's only running on, in some cases, 80,000, 51,000, 45,000. Instead, it's running on only a subset of rows because I had some filter statements in my query. So that query optimizer is doing a lot of legwork, making sure this returns really quickly.
And you'll see. One of the other things that you that you don't have to deal with is for every time Snowflake is running Python, it has to spin up a Python. So you pay a little bit of tax with that. So you'll see this initialization penalty happening. That took up about 45 percent of the time. Because if you convert a model SQL, Snowflake doesn't have to do that. It's going to run that much faster.
Orbital capabilities and limitations
Let's jump into some of the things that. Orbital can do that Snowflake so that Orbital can do and Orbital can't do. So you might be thinking to yourself, this is awesome. I want to use Orbital for everything. Orbital is a fantastic capability, although because it is converting the SQL, there are a few limitations. As I mentioned, Orbital has a ton of capability to convert different models, convert linear regressions, logistic regressions, tree based models, XGBoost models, things like that. Those convert just fine.
But let's say you want to use a model that has some element of recursivity. Something like k-nearest neighbor. Well, SQL doesn't have recursivity. You can't write a recursive function in the SQL. So unfortunately, you can't use Orbital for that. But you can always use Snowflake's Python runtime.
The other thing to keep in mind is that Orbital does not support and probably won't support something like a neural network model, something like a TensorFlow model or a Keras model. The reason for that being is those models involve a ton of parallelization with low numeric precision, which SQL, quite frankly, these SQL engines are designed for high precision. They can parallelize. There's nothing mathematically to say that we can't build an Orbital implementation of like, let's say, a Keras model. It would be one giant case when statement, but that's mathematically something that we could do. Those models are so much better served with GPUs because SQL really doesn't run on GPUs. That's where something like Snowpark is going to come in a ton of handy because you can have GPUs attached to Snowflake and Snowpark.
But again, for those cases where I'm sure a lot of us are doing a ton of them all the time, where we're building models, building a logistic regression, building a tree based model, those take up, those deliver so much value to our customers, to our stakeholders, that being able to convert those into SQL, being able to deploy those onto Snowflake effortlessly, being able to iterate as fast as I just showed you is going to be a huge step change in your capabilities. You're going to be able to deploy models faster. Your models are going to run faster. They're going to run more cheaply and it's just going to make your life that much easier.
being able to convert those into SQL, being able to deploy those onto Snowflake effortlessly, being able to iterate as fast as I just showed you is going to be a huge step change in your capabilities.
And with that, that's all I've got. So please, I'm super excited to hear your questions and I'll see you in the discussion room. Thanks, everyone.
