
Emil Hvitfeldt - Tidypredict with recipes, turn workflow to SQL, spark, duckdb and beyond
Tidypredict is one of my favorite packages. Being able to turn a fitted model object into an equation is very powerful! However, in tidymodels, we use recipes more and more to do preprocessing. So far, tidypredict didn’t have support for recipes, which severely limited its uses. This talk is about how I fixed that issue. After spending a couple of years thinking about this problem, I finally found a way! Being able to turn a tidymodels workflow into a series of equations for prediction is super powerful. For some uses, being able to turn a model to predict inside SQL, spark or duckdb allows us to handle some problems with more ease. Talk by Emil Hvitfeldt Slides: https://emilhvitfeldt.github.io/talk-orbital-positconf/ GitHub Repo: https://github.com/EmilHvitfeldt/talk-orbital-positconf/tree/main
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right, so this is a talk about something that I've been thinking about the last couple of years and it started with two years of nothing working, as in a trashed a couple of parcels and it just didn't work. And then last Christmas while I was returning for very slight surgery I had an idea and this is what this talk is about.
So we're in the modeling workflow and we're done. We found the right model we want and we're happy with it and then the next question is what should we do now? And there's a lot of different things to do with it, depends on what we want to do. It could be inference, it could be other types of extractions but what I'm really focused on right here is making predictions from this model we decided it was the best model ever.
To be able to do this I have pulled out today this little model using the Penguins dataset we all know and love. I have a fairly simple recipe that deals with missing values, we create dummy variables, we remove some variables that doesn't seem like they are useful, centering and scaling at the end. Then we're using the bonsai package to provide the party head engine for a decision tree. We combine it all in a workflow and fit it.
This is not a great model but it's very useful to showcase everything I need for this talk.
And we can use it to predict like we normally would. It follows the tidymodels prediction guarantee, it's always a table and it always has the right number of rows, right name and everything.
Introducing tidypredict
To sidestep a little bit I want to talk about one of my favorite packages that I didn't write myself. It's called tidypredict and what it does it is that it allows us to take a fitted model and allows us to run predictions inside a database. And the way it does that is that it parses the model and strats all the sufficient information and then reconstructs our formula that can then be evaluated in SQL or something else because of the great work done stuff like DT plier, DB plier, DBI and all of that that allows us to translate our code into code that needs to be run other places.
And tidypredict works with a lot of different types of models. We have linear models, we have forest models, other types of tree models, all different types of models. But it has a limitation and the limitation is that we're only allowed one equation. And this means that it ends up not being that useful for a lot of people because if you have a recipe it doesn't work because that's why it crashed for me because if you try to recursively reconstruct a recipe and a model up here together you did a for even a modest data set you get 50,000 terabytes of SQL strings and it doesn't like that.
For the same reason we end up with a lot of redundant information that needs to be calculated over and over again. If you're working with a tree model calculating when to split has to happen multiple times and it's inefficient. And lastly same reason we have one equation we can't get easy classification probabilities because we can only get one vector out at the end. So we can't get a vector for each class. So we can get it technically for the two class but then we're limited.
The orbital package
And this is where my contribution to this whole problem is where I added the orbital package.
So what so stepping back one more time is we think about how any given model works. And a lot of times especially when we're like just practitioners and we think we don't really we need to have predictions done we think about these models as more of a black box of data coming in predictions coming out. But sometimes we can peel back for some of these types of models especially a tree-based model that I used. And we can see that it follows this tree-based structure where at first we see is the build length less or greater like less or equal to some value. If that's true then we find some other thing and then at the end we have a that's the prediction right there. Here is just a normal print that comes out but we could rewrite it as a series of if-else statements that then is equivalent to prediction. And lastly we could also turn into a one-liner from taste using the taste when function from dplyr which is essentially what tidypredict is doing.
But we are not just working with the single model we're also working with a workflow which includes everything that happens before. Because you might have noticed that this tree used let a build length total point does negative because it has been normalized earlier in this recipe so we need to make sure that when we do the predictions it needs the right data at the right time. So we can think of this recipe as it works sequentially it comes in it handles all the nominal predictors using step up known then all the numeric predictors with imputed median and so on and so forth. But we can rethink this as a series of statements much like in a mutate call. So at the beginning we have some statements that looks to see if the values are na and then puts in unknown. Likewise when we do the impute median it's the same idea we find out if the values are na or not and pulls them in and we move on. And then step dummy you know the drill by now we can think of each recipe step as doing some calculations one after another in a sequential way ending up with our predictions at the end.
But if we notice in our statement at the end not all the variables from earlier were used. The ones we have left is set mail, build length and species chinstrap. So now let's go in reverse. So now we go back up and look at when we when we when we send out all the variables and we notice that only three out of the eight variables at that point in time were used. So we can just gray out the rest. We continue doing this going up now what we're centering, staling, identifying the ones that needs to happen for later use and so on and so forth. So now we identified all the calculations that need to happen to be able to provide prediction and we just throw away everything that doesn't.
So in essence if we can do this math we are able to provide the equivalent of a prediction.
So in essence if we can do this math we are able to provide the equivalent of a prediction.
And that's in essence what the orbital package will do. So you load up all the orbital and run the orbital function on our fitted workflow and it creates an object that has all the sufficient information to do predictions that are equivalent to the original predict method.
Before I move on I want to make it clear that this object by itself is quite small and it doesn't require any of the previous packages to work. So if you save this file to a disk and load it up in a new session you don't need tidymodels, you don't need parsing, you don't need bonsai, you don't need nothing. You just need orbital. So that's already a quite a bit deep trees in dependencies right here.
Predictions and code generation
So now we have the object we have two main things we can do prediction and code generation and I want to start prediction. And we predict just like we did before. So with the orbital object we can use the predict function on it and we see that the predictions are identical to the predictions that happened through the tidymodels object down to some rounding like the 12 15th decimal place. But I want to take this a step further so then we know that it works with a plain table. We can predict with it and get it back. But where it becomes interesting is if we set up a little bit of infrastructure. Here I spin up a ephemeral c12 lite server and I put the penguins data in this server. I'm able to predict directly in the database. So these predictions are evaluated c12 that happens not inside R in a database somewhere. So if you can give me a connection up yet and some data that matches the data that this model is fitted on we can predict in it. And because we can do this we can also do something like in a spot database directly in using everything that spot has to offer not inside R. Also works with things like arrow. The print is nice but I promise that it returns the same thing and it will also work with niceties like .db. So we have like on test prediction but not inside the R session.
So basically if you can give me some data in a way we can predict on it. But there's another side to the story which is a little bit how all of this works and that is the notion of code generation. So I showed you how we can do predictions in a c12 database. But we can also just not use R anymore. We take our orbital object and we bring back our c12 connection. Here we're using c12 lite connection and we use them together and it exports out the c12 that when run in this database does the same thing as prediction. It's important to note that we need the right c12. It needs to be the right connection for where you're using it because all the different c12 databases have slightly different flavors and different names for things. But this should work on any known database that has support from the dpi packages. This is also the way that .db works because the developers behind .db have enabled these translations.
Let's try another fun thing. So we're going back into R but we want to do predictions but we actually don't even want to use orbital. We just want to spin up a shiny app and we want to have it lean and mean so it like runs fast and everything you need. So to do that first we need a function that does the prediction. So we have a function here called orbital rfun. You give it the orbital object and a file name and what it will do is it will write to a file a function definition that when applied to a data set produces predictions that are equivalent to the predictions we saw before. So with that in mind and so this one needs dplyr because some of the steps needed dplyr to work but it doesn't need orbital it just needs dplyr to work. So with that in mind we can spin up a simple ui that matches the input that we are expecting.
And the server side we're taking all the input putting it in a data frame putting it into this function that we sourced in and now I'm just rendering it to text on a shiny app. So now we have a slightly modified shiny app so it looks nice but in essence not using tidymodels not using orbital at least in that shiny app we have live predictions that does exactly what you expect.
And we can take it one step further. So the same way I showed how we can generate an rfunction you can use the same technique. So this isn't implemented but I spent around 10 minutes with my very rudimentary javascript knowledge and I translated the r function that was outputted earlier into a javascript function and now we have tidymodels prediction using pure javascript in a website. So then we just like we can slide things around and it does the predictions the same way we expect them to in the browser we're not using r we're not using anything else. And this is just a simple javascript example to show what can really be done. You could also you could technically also translate it to a python function so we can do python predictions of like from tidymodels but you can also do java you can do c whatever you have the domain knowledge to do that is the right place for you to deploy this prediction model.
You could technically also translate it to a python function so we can do python predictions of like from tidymodels but you can also do java you can do c whatever you have the domain knowledge to do that is the right place for you to deploy this prediction model.
Downsides and limitations
All right so I said a lot of nice things about the package but there's some downsides. The main downside of why you don't want to use this package is that not all models are supported. And there's two reasons why not supported one of them is that people we haven't done it yet and the other reason is that it is feasibly impossible to do. So that is models that like distance-based models like nearest neighbor models you can't rewrite them as a series of expressions it's going to be too long. But other types of tree or linear based models could be implemented so you want them find an issue and it can happen.
Another major downside which is the upside of using tidymodels is that there's no input chatting. So it assumes that all the data coming into these functions are exactly as before. So if you have a new label in your factor it doesn't know what to deal with it and it might work but it might trash. If something was expected to be a numeric but comes out as a date time variable it also probably will trash.
But you're also gaining a little like a slight speed in that way because we're not spending time actually chatting. And the last time which is will go away over time is that this is very new. It was pushed to train about 15 days ago and this is the first time I've told people outside my team about it. But it can do a lot of neat things and this is where we come back to the pros that we don't have to spin up a data container with all of tidymodels and all the extra dependencies we need. We can either use orbital or work like straight in databases or like stepping R entirely.
And I see a lot of potential in this package and this is where I'm going to lead off. This will send you to the documentation of the package where I'm outlining a little bit more what it can and can't do.
Q&A
Thank you Emil. That was great. I look forward to using it. Thank you.
So you touched on this a little bit at the end of your talk there but does orbital support all possible steps and recipes or only some steps supported? I have implemented 46 steps and there's some I can make a list if people want a step that will never be supported. Partly because it needs to be told that can be translated into many flavors of SQL Spark.DB. So any test preprocessing features won't work because you can't tokenize easily in SQL and other things don't work. But if there's any step you need, let me know and I will let you know if it's possible.
That's great. So how do you envision orbital being used by data scientists? You touch on model deployment maybe outside of R directly in a database but could it be useful for things like model comparisons or anything else? So it depends on how you define model comparison because this is only prediction. So it doesn't have any characteristics of the model. So it would only be a way of comparing how predictions were across models. So you need to save that. But this is a very lightweight object. So the memory footprint is easy.
So someone here asked the question that was on my mind. Where does the name orbital come from? So this was originally code named Weasel because I needed a name that wasn't already used. And we tried a couple of different things. Orbital is like up in space you're sending stuff out into production. It felt, it didn't feel bad.
Should we run Butcher on an orbital object? Butcher doesn't do anything because it's already the most minimal object size possible.
Can orbital optimize away the transformations done on input variables that are kept by folding them into the final model code? So that is essentially what it's doing by not folding it in. Because if you fold it in, you get this recursive nature of it that ends up making the model object better. Because you need to, when you're scaling things multiple places, it ends up taking a long time. So it is, it's the fastest you can make it using this convention. And sometimes it's going to be slow because we can't beat matrix multiplication. And this is not matrix
So I see a question here about Python. I'll go ahead and ask it, but I have my own at the end. The question is, how would this be integrated with Python? And my own question is, could someone maybe borrow some of this, like, idea and concept and implement it in their own in Python? You're right. Like, part of the reason why I make this package is I never, I haven't actually tried it. I've only tried it once. Part of the reason why I make this package is I've never, I haven't seen anyone else do this for pre-processing. So people have tried taking a fitted model and extracting the prediction, but it never extended it to pre-processing as far as I know. But in theory, you should be able to follow this approach for sighted learn models and pre-processing. Well, thank you, Emil. Thank you.

