Resources

3 Reasons to Use Tidymodels with Julia Silge

This is a recording of a virtual seminar on '3 Reasons to use Tidymodels' by Julia Silge! The event took place July 13, 2023 and was hosted by R-Ladies Philly and R-Ladies DC. Thank you, Julia, for the wonderful talk! Talk description: Modeling and machine learning in R involves a bewildering array of heterogeneous packages, and establishing good statistical practice is challenging in any language. The tidymodels collection of packages offers a consistent, flexible framework for your modeling and machine learning work to address these problems. In this talk, we’ll focus on three specific reasons to consider using tidymodels. We will start with model characteristics themselves, move to the wise management of your data budget, and finish with feature engineering

Jul 28, 2023
1h 23min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

We are very excited to be hosting this event for you and want to thank you for being here. I also want to thank Our Ladies DC for collaborating with us to host this event. We'll be starting with a short introduction, announcements, and then I'll go ahead and introduce our speaker for today, who will then take over and talk to us about tidymodels. We can have a brief Q&A session. Go ahead and add your Q&A questions to the chat as the talk goes on.

And then for the last few minutes of the event, I'll be going over the raffle winner announcement. And so if you didn't read in the meetup page, we'll be doing a raffle. Three books for three books titled Tidymodels in R by Julia Silge will be up for raffle at the end of this talk. And you can choose if you'd like a hardcover or an e-book.

So again, welcome. We are Our Ladies Philly, a chapter of Our Ladies Global, a worldwide organization that promotes gender diversity in our community and in data science. We warmly welcome all individuals who share our values and follow our code of conduct, as well as those who have an interest in data science, regardless of skill level. We encourage everyone to join the Slack community where you have access to our help and more. Please subscribe to our YouTube channel to view recordings of past events and consider signing up as a volunteer speaker or mentor. And all skill levels are welcome.

We want to thank all of our sponsors, our consortium, Independence and O'Reilly Media. And also check out our Twitter page, Our Ladies Philly. We have two event announcements. First, we'll be having a intro to GitHub workshop, which will be an in-person event July 20th, which is really exciting. I believe we'll also be doing a raffle for the same book at that event. You can find more details about this on our Meetup page linked below. Our Ladies DC will be hosting an online book club on September 26th.

Bioconductor will be having a conference in the Boston area in August and is hosting a conference in Chicago in September. So if anyone is going to any of these conferences or any other conference, let us know in the chat.

And again, I'd like to announce that we'll be doing a raffle at the end of this event for this book here.

So now I'd like to briefly introduce our speaker today. Julia Szilagyi is a data scientist and software engineer at RStudio, where she works on open source modeling and MLOps tools. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences. So thank you so much, Julia, for being here again.

Introduction and background

Thank you so much for that introduction. I'm really happy to be here and to be talking about tidymodels and why I have been really glad to work on this project. Three things that you might keep in mind that if you run into your work or you start to get concerned about in your particular domain, that you might then think, ah, I should think about using tidymodels to solve this kind of problem.

So I'm going to start by giving just a little bit of background of, like, who, like, where I came from and how I came to kind of work on many of these kinds of tools. So my academic background is physics and astronomy. I have a Ph.D. in astrophysics, and I was in research, eventually left academia, and then went on a little bit of a winding path of, like, what am I going to do now? Now that I'm, like, exited academia.

And around the time when this was happening in my own life and career, we really were seeing this growth in, like, this newish at the time field, data science. And people who are coming into these kinds of roles were coming in from kind of backgrounds that were weird in the way that my background was weird.

So I come to the practice of data work, the practice of machine learning, not so much from a statistical background. Like, I'm not one of these people that has, like, really solid academic statistical knowledge or theoretical statistical knowledge. And I'm not someone also who comes to my work as a data science practitioner from a software engineering background. Instead, I come as, like, oh, let me come and, like, pick up these skills as I'm preparing for jobs or, like, on the job somewhere.

Can I learn about this? And can I think about it as someone who is really concerned with process, with systems, and, like, how things work together? I love thinking about people's really practical problems and how they solve them, like, really applied kind of work. So I worked as a data scientist in a couple of different organizations. And then starting about three and a half years ago, I was hired at then our studio, would now Posit. And I was hired to work on open source software.

So I contributed to open source software kind of on the side, right, like as many of us do. And we have regular jobs. And so going to work at our studio, formerly our studio, now Posit, this was a shift from being someone who was largely a practitioner and built some tooling on the side to think about tooling as a focus and then kind of keep my hand in on, like, data science practice as I think about how people engage in these tasks. I was hired to our studio to work on tidymodels.

So tidymodels started, you know, maybe a couple years before I started being engaged in it. But by the time I joined, things were really coming together. Things were really get, like, pretty much ready for people to engage and use tidymodels in their real work. So, like, we started saying things like, well, if you're using something else, like Carrot in your project, you know, no reason to rewrite it from scratch. But if you're starting a new project, we recommend, you know, you check out tidymodels and see if it will be good.

What is machine learning?

So I'm, if you, you know, we love thinking about, like, what is machine learning? Like, thinking, like, okay, what, when I hear this, what is going on? And, you know, this, of course, a very wonderful and delightful SKCD about, that gives us, like, an insight into, like, what is it that we mean when we say machine learning? You know, we, so we're using data, we're using, like, big piles of data to give us some kind of answers, to give us some kind of prediction.

So instead of maybe coding decisions, like, with SL statements or something, instead we use data and use statistical methods to learn patterns from the data to be able to give us some kind of an answer. So, you know, these kind of, it's interesting to be talking about this right now, because over the past, you know, year, six months, two months, you know, we see, we see talk about what is, like, what is AI and this word being used so much more broadly.

So traditionally, AI would be the biggest outside category and includes both making decisions, you learn from data, and setting up, like, rules-based AI. So being, say, picture yourself, a picture of yourself, the sort of, if we see this big, we would, if we say that's artificial intelligence for some definition of artificial intelligence, we can have one approach that is, I'm going to code a huge set of rules. If this happens, do this. If this happens, do this. And, like, code the rules directly. Another approach is to take a huge data set of chess games that exist, and we then put it all into some kind of statistical learning algorithm, and then we let a model learn how to play chess from the data that we have.

So we have the big category, traditionally, of AI, which traditionally includes rules-based and machine learning, meaning we learn from data. And then inside of that, there's different kinds of machine learning. So the kind that, you know, if you're hearing about in the news that are, you know, like driving these changes in large language models, they're all the way inside of that deep learning circle. The work that I'm going to talk about is in a little bit different place. It is in that place where it says dozens of different ML methods. Depending on the kind of data you have and the kind of problems you're trying to look at, that's actually where a huge amount of business value comes from, is in that place where it says dozens of different machine learning methods.

that's actually where a huge amount of business value comes from, is in that place where it says dozens of different machine learning methods.

Of course, we're seeing lots of people be really excited about generating value from things over there in, like, the deep learning category, but you do have to have specific kinds of data and some kind of very large volumes of data to be able to make use of that. So we're specifically going to focus down into that place where it says dozens of different ML methods. Another way to think about this kind of, which you might call classical machine learning, so that's kind of in contrast to deep learning methods, which are, I guess we can say, newer.

We generally categorize things into two places. So unsupervised is when your data doesn't have a label. So you might do things like dimensionality reduction there, where you say – look at examples. It says, like, okay, you've got clothes. Maybe I want to cluster data. I have different kinds of clothes, and I want to look at the – none of the clothes have labels. I don't know which kind of clothes are which, but I want to use the characteristics of the clothes to clump them into a cluster. So, say, I'm going to put the pants together with the shorts with the leggings, and I'm going to put the dresses with the jumpsuits. So, like, learn how things go together. But I don't have – in that case, I don't have any labels. I don't know – I don't have something I'm trying to predict.

On the other side, supervised machine learning, then, for sure, the data – the idea there is the data has a label. That label typically is either a category or a number. So when we have – let's say we've got socks, and we want to divide them up by color. So we see here on that little visual, we're going to, like, divide the apples from the pears. The other kind of supervised – the most common kind of supervised problem is regression, where we're predicting a number. So we can predict a category or predict a number.

The kind of work that I am going to be talking about here – this is all kind of like, what are we even talking about – is over on the left-hand side. So if you can formulate a scientific or business question as a supervised machine learning problem, then there is a huge swath of tools that are out there to help you solve that. So there's – you know, there's lots of different things that fall under the category of AI, lots of different things that fall under the category of machine learning. Supervised machine learning, either classification or regression, are areas that have – that have huge infrastructure of methods and tools so that you can use that to solve some kind of a problem. Specifically, we are going to talk about using tidymodels to approach this kind of problem.

The tidymodels framework

So the tidymodels framework is a – you can think of it as analogous to the Tidyverse. So it's a collection of R packages. So if you think about the Tidyverse being used for data manipulation, data visualization, reshaping of data, munging of data, dealing with CSV files, dealing with dates, you know, the tidymodels framework is an analogy where all these different packages are focused on a different part of the machine learning process with a specific focus on supervised machine learning. So if you can install the tidymodels metapackage, and then just like, you know, if you have done library tidyverse and then you've gotten ggplot2, dplyr, tidy R, if you do library tidymodels, you get a bunch of packages that get, you know, attached there so that you can use the functions from them.

So I would first say if it feels overwhelming to get started to know there's a bunch of packages and to not be sure which functions come from which individual package, I would say try to step back from that feeling of discomfort and just kind of go with it. Like, just – like, don't stress out too much if you don't know which functions come from which packages. That's definitely something that comes as you get more experience using something.

So maybe when you started, like, using tidyverse, maybe you felt, you know, maybe you're like, oh, gosh, is – where is the function coalesce from? Or where – like, oh, I think I saw people, like, talking about pivot wider and pivot longer. Is that in dplyr? Is that in tidy R? Like, no – try not to be super stressed about that, but to know it's a meta – it's a meta package with individual packages in it, and those individual packages are focused.

The reason why – so it actually is better for us, the maintainers of tidymodels, to have things modular like this. Also, it turns out, it's better for you to have the things more modular. It's better for maintenance because if each of these packages is separate, if you need to change something about, say, tuning, which you probably guess is for the tune package, so hyperparameter tuning for models, you only have to change it there. You can have smaller changes. It's easier to do releases when things are more modular. You probably experience in your own work. And when you make pieces of code more modular, things are better versus, like, one giant kind of thing that can get hard to deal with.

So it's better for maintenance. It's also better for you because it turns out if you want to do – you know, if you want to do, like, deploy a model, you don't need the example data that, you know, we use in all our packages. You don't need – you know, you don't need the tuning infrastructure, probably, if it's time for you to deploy a model. So it helps you be a – to adopt better practices in your machine learning when you have these modular pieces.

So let's look at a couple of these names. So yardstick is a package for measuring how models are doing. It's a package for model metrics. So yardstick is – you know, you probably don't have to worry too much about knowing these individual things, but yardstick is the package that has all the metrics that we use to see how something is going. Tune, like I said, is a package for hyperparameter tuning.

Reason 1: What makes a model

All right, so let's start with the first one, what makes a model. So in R, modeling is really heterogeneous. It's like really, really mixed up, really, really – there are dramatic differences around model interfaces, how you go about fitting a model, and like what the execution kind of strategy is. So this is a lot less true in Python. So in Python, there is scikit-learn, which you have access to many kinds of models within scikit-learn. In R, R has more, you know, statisticians in it, like working with it. And so there's a real strength in R is that there's a huge diversity of different kinds of models. They can live quite separately from each other.

On the downside, it means if you were to pick up some package for modeling, the way that you go about like fitting and maybe executing a model can be extremely different from the next modeling package that you fit up. So in R, the model, the norm is kind of that things are quite broken up. There's lots of packages and like the – say if you wanted to change from random forest to XGBoost, like you go to a whole different package because the implementations for those two things don't live in the same piece of software.

There have been several kind of attempts or try to kind of make a more unified kind of thing. If you ever have heard of CARET, the CARET package, C-A-R-E-T, that was by Max, my co-author on the book. And it was kind of like a first go to make a more scikit-learn-like interface to models. There's also MLR and currently like MLR3. They're kind of on their second iteration of like – or third, I don't know – of like how to go about making like a unified kind of interface to many kinds of models.

So part of tidymodels was approaching – like the part of the motivating why bother like doing tidymodels is to approach this very question and give a new answer, kind of like a modern answer, a Tidyverse-inspired answer to like how – what are we going to do when – like I have to rewrite all my code if I want to change from one kind of model to another. So the package that approaches this is called Parsnip.

So it's a – you know, it is a little bit of a joke, right, compared to CARET, like the idea of like – although that CARET is not spelled like the vegetable, but so that's where this kind of like pun or joke comes through. It's like another vegetable. So the Parsnip package is the package for setting up model specifications. And in tidymodels, there are three components to completely specifying what a model is like. The first one we might use the word model for, but what this means – we'll get into it a little bit, but this is like what kind of statistical algorithm is it. The second stage is to specify an engine, and the third is to set the mode, the mode of what kind of model that you need to have.

So let's walk through these. So first, model here means what kind of mapping are we going to make to get from our inputs to our outputs, like our predictors to our prediction. How do we get from our predictors to our prediction? How do we get from our inputs to our outputs? So one kind of model is linear regression, regular old OLS. So here I can set up a model specification here by using the function linear reg, linear reg here.

So there's lots of different kinds of things. A model – different model types. You might say model type, model algorithm. You know, we can – it's random forest is another one that I already mentioned. Like it's a different way of getting from the inputs to the outputs. Next we need to say an engine. So back here you can see it says computational engine LM because it turns out there are some defaults. But you don't have to just use the defaults. In fact, you can set your own engines, the engine that is right for you. So here instead of before it said – notice here it says computational engine LM. Here I'm saying computational engine Glymnet.

So this is an R package. So when I say LM, it's going to do ordinary least squares. It's going to use the built-in LM function that comes with R in the stats package. If I say set engine Glymnet, then it's going to do regularized regression using the R package Glymnet. So engine, think about it as – it's not about the algorithm that's being used. Instead, it is about what computational engine am I going to use to implement that computational algorithm or type, model type.

So here the first two that I told you about are both R. But not all engines are R. So we can also use SAN here. And here what we'll make is we'll use a Bayesian linear model implemented in the Bayesian engine SAN. So the engines can be things like R packages. They can be things where we're calling out not – what's actually doing the computation is not R at all, but rather I believe SAN is in C++. We can do things like Spark where it's doing off-host execution. So we separate out you deciding what kind of model type or model algorithm you're going to use from how that engine is implemented. And, of course, the idea here is that we're going to give a unified interface to all of them so that you can not have to change everything about your code if you want to switch from using, say, LM on your laptop to Spark in a Spark cluster.

So next we need to set a mode. So notice here that it says regression in all of these here because it turns out linear regression, the model always is predicting a number. It's always predicting a continuous property. But some model types, some model algorithms don't have to only do one kind of output. So an example of that is a decision tree. So if I call just the function decision tree here, you can see it has a default computational engine. R part is an R package that is used that has an implementation of decision trees. But notice it says unknown mode because a decision tree can be used for regression or classification.

When I first learned about this, I was confused because the idea of a decision tree being used to predict a number was confusing to me. But, you know, it depends on how complicated your tree is, like how smooth of a set of outputs that you can get. So decision trees can predict either regression or classification. And the way that we do that is we set the mode. So we can set the mode as regression here if we have a numeric continuous output, or we can set the mode as classification if we have a categorical output where we're going to predict yes versus no or red versus blue or some category versus a continuous one.

So if you go to tidymodels.org slash find slash parsnip, there's a list of all available models. And what you can do is you can, you know, search and sort and see what is there and what is appropriate to your use case.

So these are the three pieces that the three components or aspects of a model that make up a model specification in tidymodels. The reason why these three are the ones that are required is because that's what we need to kind of fully specify what kind of model you need to do. And a reason why you might decide to use tidymodels is because you don't necessarily know ahead of time what kind of model algorithm is going to be the best fit for your particular use case. And, in fact, you want to try a bunch. You want to try more than one, maybe a lot. And if you use tidymodels, that allows you to pretty fluently change from one kind of model algorithm to another.

By contrast, if you are someone who's like, I only ever use LM, LM solves all my problems and is good enough, then you're probably in a situation where tidymodels is not going to help you that much, where working with using tidymodels is not, you know, maybe not necessary, maybe not that important.

Another piece of this, what makes a model specification, is about the interface to these different models. So what we have here is three different implementations for boosted trees. So the XGBoost implementation, the C5O implementation for boosted trees, and the Spark implementation for boosted trees. So they all are implementing the same or close to the same algorithm, where you make a tree, a decision tree. You train it on some of the data. You use that result there to make the next one. You do it again. You, like, boost the tree. So you, like, take a bunch of weak learners, and you put them together to make a strong learner.

The thing is, they all have different names for the arguments that are actually the same thing. Like, look at that trees one. That's when you're making a boosted tree model like XGBoost does or like the Spark boosted tree thing does. There is a number of trees. Like, how many trees are you going to have? In XGBoost, that number is called n rounds, meaning how many rounds of boosting are you going to do. In C5O, it's called trials. In Spark, it's called max iter, like what's the maximum number of iterations will go through to boost things. I find it kind of overwhelming if I wanted to switch from, like, XGBoost to Spark to try and figure out how these map to each other, like how these things map to each other, so that you can do an apples-to-apples comparison or, you know.

So another thing that Parsnip offers you is a unified interface. We will admit it's another set of names, right, for all of these things. But it is one set of names to interface to all of them. So you only have to know trees instead of knowing all the individual different kinds of names. So this is another reason. So if you think about what kind of problem am I working on, do I need to try different implementations to be able to know what the best option might be? If you are in that situation, then tidymodels can be a great fit for you.

Reason 2: Spending your data budget wisely

All right, let's move on to the second reason to use tidymodels, the second sort of thing you might be faced with that you might think, okay, I want to do – I need to do this. How am I going to go about doing it? It's this idea of spending your data budget. When you work with – when it's time to approach some modeling problem, some machine learning problem, you have a certain amount of data. And, you know, in most of our situations, it's not infinite, right? You have a certain amount, and you want to do a good job of using it to get the best results that you can from the limited data that you have.

So the package in tidymodels that handles this data budget, these data budget kind of tasks, is called RSample. So it's like a shortening of resample. So it's tools for data splitting and data resampling. So let's start with data splitting. So I bet many of you are familiar with this idea of splitting your data into training and testing. So you have some original quantity of data, and you need to split it randomly, typically, and you're going to keep some of your data. You're going to use to estimate the model parameters. You're going to use for learning. Another piece of the data you are going to use to estimate how well the model is performing.

So it's not used for learning, but instead for prediction only, for estimating performance only. And we want those totally separate. We do not want to reuse – we do not want to use that testing data for any important things. So tidymodels, of course, provides fluent tools for these kinds of splitting. So here I've just got a quick little example data set here that's housing prices in Sacramento, California. And here if I take that original data set that I have and I use initial split, what I have is a split object that keeps track of which of my observations went into training and which of them went into testing.

Once I have that split object, I can operate on it to get out what I need. So if I called training on the split object, I get out those 699 houses that are in my training set. If I call testing, I get the other ones that are in my testing set. So the idea here, you use the training data only during the process of model development and only at the very end to use the testing data. In fact, we would think – we would say, like, the testing data was extremely precious. We can't go about just, like, spending it willy-nilly. The purpose of the testing set is to estimate your performance on new data, to do a final check on did my model – is my model performing the way that I thought it did.

The purpose of the testing set is to estimate your performance on new data, to do a final check on did my model – is my model performing the way that I thought it did.

So that all sounds well and good, you know, great, training, testing. But I just said a little bit ago that we might want to try a whole bunch of different models. We might want to – we might want to try XGBlues, Juice, plus Random Forest, plus Glimnet. I might want to try a lot of different things and see how they compare to each other. So what do I do? What do I do in this situation?

If I use the testing set to compare the performance for four or five models, I am up a creek. That is a bad decision, and I am in no good situation. I can't compare four or five different models. XGBlues needs to be tuned. There's a bunch of hyperparameters, and we have to try a bunch of different – there's no way to know what those hyperparameters should be from training the model one time. Instead, I have to train it a bunch of times on different model configurations and see which one is best. How do I decide? What data do I use to see which one is best? If I use the training set, I get an overly optimistic view. And, in fact, if I – depending on how my training split, I can get a wrong – I can get a wrong answer depending on exactly which observations are in the training set.

So the answer to this is resampling. So what we will do is we have our all data to start with. We divide into training and testing. And that testing data just stay – is set aside. Think of it as precious. We do not touch it until the very end of our model development process. And, instead, we resample our training set to create little simulated versions of that training set. So for each resample, I divide into an analysis set, which is an analog of training data, and my assessment set, which is an analog of the testing data. And I can use all those resamples for hyperparameter tuning or to do something like comparing. Say I want to compare random forest with glimnet with a decision tree. I want to compare them all and see which one is best. And we do that by using our resamples, using all of them and then aggregating the results there.

There's a huge number of ways to make resamples of how to divide up your data. One of them is called – one of the most common and great default kind of one is cross-validation. So if I – the way cross-validation works is that I have my pile of data, my pile of observations. In this example, it is the training set there, that yellow-orange pentagon-y thing, that training set. So these 30 things are the training set. And I can randomly split them up into a number of folds. So here we're doing three for this example. And so I randomly assign as I go along which one goes into which fold, yeah, into which fold, and then I divide them out so that in fold one, the first time I go through, I hold out one-third of the data, and the other two-thirds go into the analysis set.

So notice I'm going to fit the model using two-thirds of the data in this case, and then I will estimate performance using held-out data that was not used in the performance. So here where it says estimate performance using as we go along there, that's the equivalent of the assessment set. It's like a little test set. It is not used to train. It is instead used to measure performance, the test. So in the second fold, we hold out a different set, one-third, and train on the other two-thirds. On the third fold, we hold out the last third and do this. So this is how cross-validation works. In tidymodels, the function for it, it's called vfoldcv, and the default is actually to make 10 splits, because it turns out that works great for many situations.

If you have the amount of data that you should be doing cross-validation, 10 is a good default. You notice here this might be a little too small for cross-validation because I don't have that many observations in my little test sets there and my assessment sets. So I might, for this data, I want to think about a different approach. But let's talk about what it is. So we have the same amount of testing data, sorry, training data, and then we make 10 different resamples from it. We divide it into 10 pieces, and then we hold out the first, use the other nine-tenths, hold out the second, use the other nine-tenths, hold out the third, and so forth. That's how we make these cross-validation resamples. What this lets us do is to use the training set for tuning, for comparing models, for making decisions, before having that very precious test set at the end.

For something like this, it might be a better idea, for a smallish data set like this, it might be a better idea to use a different approach for resampling, like bootstrapping. So it is a different way of making these kinds of resamples. This approach is, again, showing an example of three. So the idea is you take your original training set that had 30 in this example, 30 examples in it, and you draw with replacements until you get to 30 again. So notice that in the first bootstrap iteration, the first observation is in there twice. So you draw with replacement so that it goes back into the bucket before you take another one. So back in the bucket, you take another one. So in the second one, notice three is in there three times. That can definitely happen. And then whatever doesn't get picked, whatever doesn't get picked to go into the analysis set, that goes into the assessment set.

So here we are always going to train on 30, and then we will estimate performance on, you know, whatever kind of number, you know, is leftover, whatever the leftovers are there. So bootstrapping, the default, you know, here's the function for that. Bootstrapping, the default is 25, bigger. And you can see that for my smaller data set, this seems like a better idea. So cross-validation and bootstrap resampling, they have different kind of questions, or different kind of bias-variance tradeoffs is what it comes down to. So it can be a good choice in different situations.

Yeah, we have a few questions in the chat. I'm going to save some of them for the end, hoping we have some time. But there's one relevant to what you're presenting right now. How does the setC function ensure that we can create reproducibility? Yeah, great. Yeah, so these all involve randomness, right? Like we're pulling from these distributions randomly. And so if you ran this, and then, well, I have a setC here. And so what that means is that you will get the same result every time that you run this, if you run these together with setC.

Here, oh, I have setC there too. But if you were to not run with setC, if you were just to call bootstraps, bootstraps, bootstraps, like you would get different versions every time. Now, we would hope we're using robust enough statistical methods that we're not seeing huge differences. Like we pick a totally different model if we set the seed, if we set the seed different. That means something's wrong, right? Like if you choose a totally different model, if you set a different seed, that means something's not going great. However, it always is a good idea to set your seed when you have randomness, just so that you get the same answer that you get the next time, like down to the same like exact numeric answer versus being like, oh, my gosh, my metrics are a little bit different this time than what I got last time. So great question about reproducibility and setting a seed. So that number there, 1, 2, 3, is not a special number. Like I could put, you know, the year I was born, this year. You know, like I can put any kind of number there. But what it does is it sets the random number generator.

So computers have a hard time making random numbers, but we have these algorithms that give us close to, as close as possible to truly random numbers. And it's a stream, like R has a random number generator stream. So if you've ever seen RNG stream, that's what that means. It's like the stream of random numbers. Now, if you don't set the seed, the next time you go into the stream and say, give me a random number, you'll get the next one. Whereas if you do set the seed, you always start at the same place in the random number generator seed, which is not like you would be bad if you got a totally different result at the end, like you chose a different model. But it's still helpful in the process of model development so that you get the same answers every time.

All right. So there's lots of different resampling methods. And what these allow you to do is to spend your data wisely. You have a certain amount of data, and then you need to make the best decisions possible about how to spend it. You spend some data on model estimation, on learning those parameters of that model, whether it's XGBoost or linear regression. And then you spend other data on how is that model performing? How is that model performing? I have to know. I can't use the same data that I use for training because you can memorize that data.

These different kinds of – so, like, really good options include cross-validation. So that's that V full CV. Bootstraps are a really great option. You can do Monte Carlo. It's okay, right? Leave one out. Cross-validation is maybe not the best option. But it is there as an option. If you have a ton of data, just a ton, then a validation split can be a good option. So that's basically picture this image, but you just have one of those. You just – instead of having B, see how it says resample one, resample two, resample B, instead of doing it, you just do one split. And then you can use that validation set to tune, to choose models. And so here, pictures are just being one row instead of ten rows or 25 rows, just one row. And that can work if you have so much data that it is a waste of time to do things ten times because you get a good estimate of how well a model is performing with one assessment set.

Reason 3: Building better features

Okay, so we talked about building models. Like, you may choose to use tidymodels because you need to try a bunch of different models. You may choose to use tidymodels because your data budget is limited and you want to use good statistical practices to spend your data in a wise way. tidymodels provide you with these tools and guardrails, like when it comes time to fit, to tune, to – we're about to talk about feature engineering. When it comes time to do all these things, tidymodels provide you with tools and guardrails, like keeping you from doing the wrong thing in many situations so that you can spend your data budget wisely.

The last of these three topics that I want to talk about, like why might you decide to use tidymodels, is to build better features. So the package in tidymodels that handles feature engineering is called Recipes. It's the Recipes package. And it uses a certain, I guess, analogy or metaphor for what feature engineering is like. So, you know, we use the phrase pre-processing kind of to be synonymous with feature engineering. Your data, you have data, and you need it to go into a model, but it's not quite ready to go into the model yet. You need to do some stuff first to it before it can go in.

This is not so much data cleaning. So data pre-processing or feature engineering is different from data cleaning or more basic data preparation. Like if it's something that you could use dplyr to do, that's not really,