3 Reasons to Use Tidymodels with Julia Silge

Transcript#

This transcript was generated automatically and may contain errors.

We are very excited to be hosting this event for you and want to thank you for being here. I also want to thank Our Ladies DC for collaborating with us to host this event. We'll be starting with a short introduction, announcements, and then I'll go ahead and introduce our speaker for today, who will then take over and talk to us about tidymodels . We can have a brief Q&A session. Go ahead and add your Q&A questions to the chat as the talk goes on.

And then for the last few minutes of the event, I'll be going over the raffle winner announcement. And so if you didn't read in the meetup page, we'll be doing a raffle. Three books for three books titled Tidymodels in R by Julia Silge will be up for raffle at the end of this talk. And you can choose if you'd like a hardcover or an e-book.

So again, welcome. We are Our Ladies Philly, a chapter of Our Ladies Global, a worldwide organization that promotes gender diversity in our community and in data science. We warmly welcome all individuals who share our values and follow our code of conduct, as well as those who have an interest in data science, regardless of skill level. We encourage everyone to join the Slack community where you have access to our help and more. Please subscribe to our YouTube channel to view recordings of past events and consider signing up as a volunteer speaker or mentor. And all skill levels are welcome.

We want to thank all of our sponsors, our consortium, Independence and O'Reilly Media. And also check out our Twitter page, Our Ladies Philly. We have two event announcements. First, we'll be having a intro to GitHub workshop, which will be an in-person event July 20th, which is really exciting. I believe we'll also be doing a raffle for the same book at that event. You can find more details about this on our Meetup page linked below. Our Ladies DC will be hosting an online book club on September 26th.

Bioconductor will be having a conference in the Boston area in August and is hosting a conference in Chicago in September. So if anyone is going to any of these conferences or any other conference, let us know in the chat.

And again, I'd like to announce that we'll be doing a raffle at the end of this event for this book here.

So now I'd like to briefly introduce our speaker today. Julia Szilagyi is a data scientist and software engineer at RStudio , where she works on open source modeling and MLOps tools. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences. So thank you so much, Julia, for being here again.

Introduction and background

Thank you so much for that introduction. I'm really happy to be here and to be talking about tidymodels and why I have been really glad to work on this project. Three things that you might keep in mind that if you run into your work or you start to get concerned about in your particular domain, that you might then think, ah, I should think about using tidymodels to solve this kind of problem.

So I'm going to start by giving just a little bit of background of, like, who, like, where I came from and how I came to kind of work on many of these kinds of tools. So my academic background is physics and astronomy. I have a Ph.D. in astrophysics, and I was in research, eventually left academia, and then went on a little bit of a winding path of, like, what am I going to do now? Now that I'm, like, exited academia.

And around the time when this was happening in my own life and career, we really were seeing this growth in, like, this newish at the time field, data science. And people who are coming into these kinds of roles were coming in from kind of backgrounds that were weird in the way that my background was weird.

So I come to the practice of data work, the practice of machine learning, not so much from a statistical background. Like, I'm not one of these people that has, like, really solid academic statistical knowledge or theoretical statistical knowledge. And I'm not someone also who comes to my work as a data science practitioner from a software engineering background. Instead, I come as, like, oh, let me come and, like, pick up these skills as I'm preparing for jobs or, like, on the job somewhere.

Can I learn about this? And can I think about it as someone who is really concerned with process, with systems, and, like, how things work together? I love thinking about people's really practical problems and how they solve them, like, really applied kind of work. So I worked as a data scientist in a couple of different organizations. And then starting about three and a half years ago, I was hired at then our studio, would now Posit. And I was hired to work on open source software.

So I contributed to open source software kind of on the side, right, like as many of us do. And we have regular jobs. And so going to work at our studio, formerly our studio, now Posit, this was a shift from being someone who was largely a practitioner and built some tooling on the side to think about tooling as a focus and then kind of keep my hand in on, like, data science practice as I think about how people engage in these tasks. I was hired to our studio to work on tidymodels.

So tidymodels started, you know, maybe a couple years before I started being engaged in it. But by the time I joined, things were really coming together. Things were really get, like, pretty much ready for people to engage and use tidymodels in their real work. So, like, we started saying things like, well, if you're using something else, like Carrot in your project, you know, no reason to rewrite it from scratch. But if you're starting a new project, we recommend, you know, you check out tidymodels and see if it will be good.

What is machine learning?

So I'm, if you, you know, we love thinking about, like, what is machine learning? Like, thinking, like, okay, what, when I hear this, what is going on? And, you know, this, of course, a very wonderful and delightful SKCD about, that gives us, like, an insight into, like, what is it that we mean when we say machine learning? You know, we, so we're using data, we're using, like, big piles of data to give us some kind of answers, to give us some kind of prediction.

So instead of maybe coding decisions, like, with SL statements or something, instead we use data and use statistical methods to learn patterns from the data to be able to give us some kind of an answer. So, you know, these kind of, it's interesting to be talking about this right now, because over the past, you know, year, six months, two months, you know, we see, we see talk about what is, like, what is AI and this word being used so much more broadly.

So traditionally, AI would be the biggest outside category and includes both making decisions, you learn from data, and setting up, like, rules-based AI. So being, say, picture yourself, a picture of yourself, the sort of, if we see this big, we would, if we say that's artificial intelligence for some definition of artificial intelligence, we can have one approach that is, I'm going to code a huge set of rules. If this happens, do this. If this happens, do this. And, like, code the rules directly. Another approach is to take a huge data set of chess games that exist, and we then put it all into some kind of statistical learning algorithm, and then we let a model learn how to play chess from the data that we have.

So we have the big category, traditionally, of AI, which traditionally includes rules-based and machine learning, meaning we learn from data. And then inside of that, there's different kinds of machine learning. So the kind that, you know, if you're hearing about in the news that are, you know, like driving these changes in large language models, they're all the way inside of that deep learning circle. The work that I'm going to talk about is in a little bit different place. It is in that place where it says dozens of different ML methods. Depending on the kind of data you have and the kind of problems you're trying to look at, that's actually where a huge amount of business value comes from, is in that place where it says dozens of different machine learning methods.

that's actually where a huge amount of business value comes from, is in that place where it says dozens of different machine learning methods.

Of course, we're seeing lots of people be really excited about generating value from things over there in, like, the deep learning category, but you do have to have specific kinds of data and some kind of very large volumes of data to be able to make use of that. So we're specifically going to focus down into that place where it says dozens of different ML methods. Another way to think about this kind of, which you might call classical machine learning, so that's kind of in contrast to deep learning methods, which are, I guess we can say, newer.

We generally categorize things into two places. So unsupervised is when your data doesn't have a label. So you might do things like dimensionality reduction there, where you say – look at examples. It says, like, okay, you've got clothes. Maybe I want to cluster data. I have different kinds of clothes, and I want to look at the – none of the clothes have labels. I don't know which kind of clothes are which, but I want to use the characteristics of the clothes to clump them into a cluster. So, say, I'm going to put the pants together with the shorts with the leggings, and I'm going to put the dresses with the jumpsuits. So, like, learn how things go together. But I don't have – in that case, I don't have any labels. I don't know – I don't have something I'm trying to predict.

On the other side, supervised machine learning, then, for sure, the data – the idea there is the data has a label. That label typically is either a category or a number. So when we have – let's say we've got socks, and we want to divide them up by color. So we see here on that little visual, we're going to, like, divide the apples from the pears. The other kind of supervised – the most common kind of supervised problem is regression, where we're predicting a number. So we can predict a category or predict a number.

The kind of work that I am going to be talking about here – this is all kind of like, what are we even talking about – is over on the left-hand side. So if you can formulate a scientific or business question as a supervised machine learning problem, then there is a huge swath of tools that are out there to help you solve that. So there's – you know, there's lots of different things that fall under the category of AI, lots of different things that fall under the category of machine learning. Supervised machine learning, either classification or regression, are areas that have – that have huge infrastructure of methods and tools so that you can use that to solve some kind of a problem. Specifically, we are going to talk about using tidymodels to approach this kind of problem.

The purpose of the testing set is to estimate your performance on new data, to do a final check on did my model – is my model performing the way that I thought it did.

So that all sounds well and good, you know, great, training, testing. But I just said a little bit ago that we might want to try a whole bunch of different models. We might want to – we might want to try XGBlues, Juice, plus Random Forest, plus Glimnet. I might want to try a lot of different things and see how they compare to each other. So what do I do? What do I do in this situation?

If I use the testing set to compare the performance for four or five models, I am up a creek. That is a bad decision, and I am in no good situation. I can't compare four or five different models. XGBlues needs to be tuned. There's a bunch of hyperparameters, and we have to try a bunch of different – there's no way to know what those hyperparameters should be from training the model one time. Instead, I have to train it a bunch of times on different model configurations and see which one is best. How do I decide? What data do I use to see which one is best? If I use the training set, I get an overly optimistic view. And, in fact, if I – depending on how my training split, I can get a wrong – I can get a wrong answer depending on exactly which observations are in the training set.

So the answer to this is resampling. So what we will do is we have our all data to start with. We divide into training and testing. And that testing data just stay – is set aside. Think of it as precious. We do not touch it until the very end of our model development process. And, instead, we resample our training set to create little simulated versions of that training set. So for each resample, I divide into an analysis set, which is an analog of training data, and my assessment set, which is an analog of the testing data. And I can use all those resamples for hyperparameter tuning or to do something like comparing. Say I want to compare random forest with glimnet with a decision tree. I want to compare them all and see which one is best. And we do that by using our resamples, using all of them and then aggregating the results there.

There's a huge number of ways to make resamples of how to divide up your data. One of them is called – one of the most common and great default kind of one is cross-validation. So if I – the way cross-validation works is that I have my pile of data, my pile of observations. In this example, it is the training set there, that yellow-orange pentagon-y thing, that training set. So these 30 things are the training set. And I can randomly split them up into a number of folds. So here we're doing three for this example. And so I randomly assign as I go along which one goes into which fold, yeah, into which fold, and then I divide them out so that in fold one, the first time I go through, I hold out one-third of the data, and the other two-thirds go into the analysis set.

So notice I'm going to fit the model using two-thirds of the data in this case, and then I will estimate performance using held-out data that was not used in the performance. So here where it says estimate performance using as we go along there, that's the equivalent of the assessment set. It's like a little test set. It is not used to train. It is instead used to measure performance, the test. So in the second fold, we hold out a different set, one-third, and train on the other two-thirds. On the third fold, we hold out the last third and do this. So this is how cross-validation works. In tidymodels, the function for it, it's called vfoldcv, and the default is actually to make 10 splits, because it turns out that works great for many situations.

If you have the amount of data that you should be doing cross-validation, 10 is a good default. You notice here this might be a little too small for cross-validation because I don't have that many observations in my little test sets there and my assessment sets. So I might, for this data, I want to think about a different approach. But let's talk about what it is. So we have the same amount of testing data, sorry, training data, and then we make 10 different resamples from it. We divide it into 10 pieces, and then we hold out the first, use the other nine-tenths, hold out the second, use the other nine-tenths, hold out the third, and so forth. That's how we make these cross-validation resamples. What this lets us do is to use the training set for tuning, for comparing models, for making decisions, before having that very precious test set at the end.

For something like this, it might be a better idea, for a smallish data set like this, it might be a better idea to use a different approach for resampling, like bootstrapping. So it is a different way of making these kinds of resamples. This approach is, again, showing an example of three. So the idea is you take your original training set that had 30 in this example, 30 examples in it, and you draw with replacements until you get to 30 again. So notice that in the first bootstrap iteration, the first observation is in there twice. So you draw with replacement so that it goes back into the bucket before you take another one. So back in the bucket, you take another one. So in the second one, notice three is in there three times. That can definitely happen. And then whatever doesn't get picked, whatever doesn't get picked to go into the analysis set, that goes into the assessment set.

So here we are always going to train on 30, and then we will estimate performance on, you know, whatever kind of number, you know, is leftover, whatever the leftovers are there. So bootstrapping, the default, you know, here's the function for that. Bootstrapping, the default is 25, bigger. And you can see that for my smaller data set, this seems like a better idea. So cross-validation and bootstrap resampling, they have different kind of questions, or different kind of bias-variance tradeoffs is what it comes down to. So it can be a good choice in different situations.

Yeah, we have a few questions in the chat. I'm going to save some of them for the end, hoping we have some time. But there's one relevant to what you're presenting right now. How does the setC function ensure that we can create reproducibility? Yeah, great. Yeah, so these all involve randomness, right? Like we're pulling from these distributions randomly. And so if you ran this, and then, well, I have a setC here. And so what that means is that you will get the same result every time that you run this, if you run these together with setC.

Here, oh, I have setC there too. But if you were to not run with setC, if you were just to call bootstraps, bootstraps, bootstraps, like you would get different versions every time. Now, we would hope we're using robust enough statistical methods that we're not seeing huge differences. Like we pick a totally different model if we set the seed, if we set the seed different. That means something's wrong, right? Like if you choose a totally different model, if you set a different seed, that means something's not going great. However, it always is a good idea to set your seed when you have randomness, just so that you get the same answer that you get the next time, like down to the same like exact numeric answer versus being like, oh, my gosh, my metrics are a little bit different this time than what I got last time. So great question about reproducibility and setting a seed. So that number there, 1, 2, 3, is not a special number. Like I could put, you know, the year I was born, this year. You know, like I can put any kind of number there. But what it does is it sets the random number generator.

So computers have a hard time making random numbers, but we have these algorithms that give us close to, as close as possible to truly random numbers. And it's a stream, like R has a random number generator stream. So if you've ever seen RNG stream, that's what that means. It's like the stream of random numbers. Now, if you don't set the seed, the next time you go into the stream and say, give me a random number, you'll get the next one. Whereas if you do set the seed, you always start at the same place in the random number generator seed, which is not like you would be bad if you got a totally different result at the end, like you chose a different model. But it's still helpful in the process of model development so that you get the same answers every time.

All right. So there's lots of different resampling methods. And what these allow you to do is to spend your data wisely. You have a certain amount of data, and then you need to make the best decisions possible about how to spend it. You spend some data on model estimation, on learning those parameters of that model, whether it's XGBoost or linear regression. And then you spend other data on how is that model performing? How is that model performing? I have to know. I can't use the same data that I use for training because you can memorize that data.

These different kinds of – so, like, really good options include cross-validation. So that's that V full CV. Bootstraps are a really great option. You can do Monte Carlo. It's okay, right? Leave one out. Cross-validation is maybe not the best option. But it is there as an option. If you have a ton of data, just a ton, then a validation split can be a good option. So that's basically picture this image, but you just have one of those. You just – instead of having B, see how it says resample one, resample two, resample B, instead of doing it, you just do one split. And then you can use that validation set to tune, to choose models. And so here, pictures are just being one row instead of ten rows or 25 rows, just one row. And that can work if you have so much data that it is a waste of time to do things ten times because you get a good estimate of how well a model is performing with one assessment set.

Reason 3: Building better features

Okay, so we talked about building models. Like, you may choose to use tidymodels because you need to try a bunch of different models. You may choose to use tidymodels because your data budget is limited and you want to use good statistical practices to spend your data in a wise way. tidymodels provide you with these tools and guardrails, like when it comes time to fit, to tune, to – we're about to talk about feature engineering. When it comes time to do all these things, tidymodels provide you with tools and guardrails, like keeping you from doing the wrong thing in many situations so that you can spend your data budget wisely.

The last of these three topics that I want to talk about, like why might you decide to use tidymodels, is to build better features. So the package in tidymodels that handles feature engineering is called Recipes. It's the Recipes package. And it uses a certain, I guess, analogy or metaphor for what feature engineering is like. So, you know, we use the phrase pre-processing kind of to be synonymous with feature engineering. Your data, you have data, and you need it to go into a model, but it's not quite ready to go into the model yet. You need to do some stuff first to it before it can go in.

This is not so much data cleaning. So data pre-processing or feature engineering is different from data cleaning or more basic data preparation. Like if it's something that you could use dplyr to do, that's not really,

3 Reasons to Use Tidymodels with Julia Silge

Transcript#

Introduction and background

What is machine learning?

The tidymodels framework

Reason 1: What makes a model

Reason 2: Spending your data budget wisely

Reason 3: Building better features

Featured software#

tidymodels