Emil Hvitfeldt - Tidypredict with recipes, turn workflow to SQL, spark, duckdb and beyond

Transcript#

This transcript was generated automatically and may contain errors.

All right, so this is a talk about something that I've been thinking about the last couple of years and it started with two years of nothing working, as in a trashed a couple of parcels and it just didn't work. And then last Christmas while I was returning for very slight surgery I had an idea and this is what this talk is about.

So we're in the modeling workflow and we're done. We found the right model we want and we're happy with it and then the next question is what should we do now? And there's a lot of different things to do with it, depends on what we want to do. It could be inference, it could be other types of extractions but what I'm really focused on right here is making predictions from this model we decided it was the best model ever.

To be able to do this I have pulled out today this little model using the Penguins dataset we all know and love. I have a fairly simple recipe that deals with missing values, we create dummy variables, we remove some variables that doesn't seem like they are useful, centering and scaling at the end. Then we're using the bonsai package to provide the party head engine for a decision tree. We combine it all in a workflow and fit it.

This is not a great model but it's very useful to showcase everything I need for this talk.

And we can use it to predict like we normally would. It follows the tidymodels prediction guarantee, it's always a table and it always has the right number of rows, right name and everything.

Introducing tidypredict

To sidestep a little bit I want to talk about one of my favorite packages that I didn't write myself. It's called tidypredict and what it does it is that it allows us to take a fitted model and allows us to run predictions inside a database. And the way it does that is that it parses the model and strats all the sufficient information and then reconstructs our formula that can then be evaluated in SQL or something else because of the great work done stuff like DT plier, DB plier, DBI and all of that that allows us to translate our code into code that needs to be run other places.

And tidypredict works with a lot of different types of models. We have linear models, we have forest models, other types of tree models, all different types of models. But it has a limitation and the limitation is that we're only allowed one equation. And this means that it ends up not being that useful for a lot of people because if you have a recipe it doesn't work because that's why it crashed for me because if you try to recursively reconstruct a recipe and a model up here together you did a for even a modest data set you get 50,000 terabytes of SQL strings and it doesn't like that.

For the same reason we end up with a lot of redundant information that needs to be calculated over and over again. If you're working with a tree model calculating when to split has to happen multiple times and it's inefficient. And lastly same reason we have one equation we can't get easy classification probabilities because we can only get one vector out at the end. So we can't get a vector for each class. So we can get it technically for the two class but then we're limited.

The orbital package

And this is where my contribution to this whole problem is where I added the orbital package.

So what so stepping back one more time is we think about how any given model works. And a lot of times especially when we're like just practitioners and we think we don't really we need to have predictions done we think about these models as more of a black box of data coming in predictions coming out. But sometimes we can peel back for some of these types of models especially a tree-based model that I used. And we can see that it follows this tree-based structure where at first we see is the build length less or greater like less or equal to some value. If that's true then we find some other thing and then at the end we have a that's the prediction right there. Here is just a normal print that comes out but we could rewrite it as a series of if-else statements that then is equivalent to prediction. And lastly we could also turn into a one-liner from taste using the taste when function from dplyr which is essentially what tidypredict is doing.

But we are not just working with the single model we're also working with a workflow which includes everything that happens before. Because you might have noticed that this tree used let a build length total point does negative because it has been normalized earlier in this recipe so we need to make sure that when we do the predictions it needs the right data at the right time. So we can think of this recipe as it works sequentially it comes in it handles all the nominal predictors using step up known then all the numeric predictors with imputed median and so on and so forth. But we can rethink this as a series of statements much like in a mutate call. So at the beginning we have some statements that looks to see if the values are na and then puts in unknown. Likewise when we do the impute median it's the same idea we find out if the values are na or not and pulls them in and we move on. And then step dummy you know the drill by now we can think of each recipe step as doing some calculations one after another in a sequential way ending up with our predictions at the end.

But if we notice in our statement at the end not all the variables from earlier were used. The ones we have left is set mail, build length and species chinstrap. So now let's go in reverse. So now we go back up and look at when we when we when we send out all the variables and we notice that only three out of the eight variables at that point in time were used. So we can just gray out the rest. We continue doing this going up now what we're centering, staling, identifying the ones that needs to happen for later use and so on and so forth. So now we identified all the calculations that need to happen to be able to provide prediction and we just throw away everything that doesn't.

So in essence if we can do this math we are able to provide the equivalent of a prediction.

So in essence if we can do this math we are able to provide the equivalent of a prediction.

And that's in essence what the orbital package will do. So you load up all the orbital and run the orbital function on our fitted workflow and it creates an object that has all the sufficient information to do predictions that are equivalent to the original predict method.

Before I move on I want to make it clear that this object by itself is quite small and it doesn't require any of the previous packages to work. So if you save this file to a disk and load it up in a new session you don't need tidymodels, you don't need parsing, you don't need bonsai, you don't need nothing. You just need orbital. So that's already a quite a bit deep trees in dependencies right here.

Predictions and code generation

So now we have the object we have two main things we can do prediction and code generation and I want to start prediction. And we predict just like we did before. So with the orbital object we can use the predict function on it and we see that the predictions are identical to the predictions that happened through the tidymodels object down to some rounding like the 12 15th decimal place. But I want to take this a step further so then we know that it works with a plain table. We can predict with it and get it back. But where it becomes interesting is if we set up a little bit of infrastructure. Here I spin up a ephemeral c12 lite server and I put the penguins data in this server. I'm able to predict directly in the database. So these predictions are evaluated c12 that happens not inside R in a database somewhere. So if you can give me a connection up yet and some data that matches the data that this model is fitted on we can predict in it. And because we can do this we can also do something like in a spot database directly in using everything that spot has to offer not inside R. Also works with things like arrow. The print is nice but I promise that it returns the same thing and it will also work with niceties like .db. So we have like on test prediction but not inside the R session.

So basically if you can give me some data in a way we can predict on it. But there's another side to the story which is a little bit how all of this works and that is the notion of code generation. So I showed you how we can do predictions in a c12 database. But we can also just not use R anymore. We take our orbital object and we bring back our c12 connection. Here we're using c12 lite connection and we use them together and it exports out the c12 that when run in this database does the same thing as prediction. It's important to note that we need the right c12. It needs to be the right connection for where you're using it because all the different c12 databases have slightly different flavors and different names for things. But this should work on any known database that has support from the dpi packages. This is also the way that .db works because the developers behind .db have enabled these translations.

Let's try another fun thing. So we're going back into R but we want to do predictions but we actually don't even want to use orbital. We just want to spin up a shiny app and we want to have it lean and mean so it like runs fast and everything you need. So to do that first we need a function that does the prediction. So we have a function here called orbital rfun. You give it the orbital object and a file name and what it will do is it will write to a file a function definition that when applied to a data set produces predictions that are equivalent to the predictions we saw before. So with that in mind and so this one needs dplyr because some of the steps needed dplyr to work but it doesn't need orbital it just needs dplyr to work. So with that in mind we can spin up a simple ui that matches the input that we are expecting.

And the server side we're taking all the input putting it in a data frame putting it into this function that we sourced in and now I'm just rendering it to text on a shiny app. So now we have a slightly modified shiny app so it looks nice but in essence not using tidymodels not using orbital at least in that shiny app we have live predictions that does exactly what you expect.

And we can take it one step further. So the same way I showed how we can generate an rfunction you can use the same technique. So this isn't implemented but I spent around 10 minutes with my very rudimentary javascript knowledge and I translated the r function that was outputted earlier into a javascript function and now we have tidymodels prediction using pure javascript in a website. So then we just like we can slide things around and it does the predictions the same way we expect them to in the browser we're not using r we're not using anything else. And this is just a simple javascript example to show what can really be done. You could also you could technically also translate it to a python function so we can do python predictions of like from tidymodels but you can also do java you can do c whatever you have the domain knowledge to do that is the right place for you to deploy this prediction model.

You could technically also translate it to a python function so we can do python predictions of like from tidymodels but you can also do java you can do c whatever you have the domain knowledge to do that is the right place for you to deploy this prediction model.