Emil Hvitfeldt | tidyclust - expanding tidymodels to clustering | RStudio (2022)

This talk marks the grand introduction of tidyclust, a new package that provides a tidy unified interface to clustering model within the tidymodels framework. While tidymodels has been a leap forward in making machine learning methods accessible to a general audience in R, it is currently limited to the realm of supervised learning. tidyclust, by Emil Hvitfeldt and Kelly Bodwin, builds upon the interfaces familiar to tidymodels users to make unsupervised clustering models equally approachable. Session: Updates from the tidymodels team

Oct 24, 2022

19 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, today I will be talking about clustering in tidymodels . It's something I've been working with Kelly Bartwin with for the last half year or so, and I'm very excited to come here and talk to all of you about it. So, before we get any further, I want to give a quick additional introduction to what I find most important about tidymodels as a framework.

And the three most important things I see is that it's consistent, modular, and extensible. It's consistent in a way that any function that returns a tuple will always return a tuple. If you return a fitted object, it's always a fitted object. And any things like dimensions with respect to column names and row numbers will always stay the same. So once you know what a function outputs, you can reliably code knowing that that will be the case.

Secondly, it's modular in the sense that instead of having one monster function that will do a bunch of preprocessing, fit a model inside a cross-variation fold, and find the most optimal model according to some heuristic, we instead create smaller objects that we can then bind together. Allows us to easily swap out and build a more complete workflow.

And lastly, and for me most importantly, is the idea of extensibility. So, tidymodels was built from the ground up knowing that we needed to make it so other people could build extensions and to make it much more maintainable. Instead of having one large package with hundreds of thousands of code that needs to all be kept in sync, instead we have smaller, more focused packages that can be handled independently, yet they all still work wonderfully together. This part is near and dear to my heart because this is how I started working with tidymodels four years ago when I started working on my first extension package. And now it led me up here to talk about another extension package in front of all of you.

The case for clustering in tidymodels

So, when we think about machine learning, it's a wonderfully ill-defined topic, but one of the ways we can think about it is in the idea of unsupervised and a supervised setting. So, tidymodels right now works really well in the supervised setting. We can do classification, we can do regression. As we just saw, we also are starting to work on tools to deal with sensor regression, which isn't even on this diagram. We also have tools in the form of preprocessing to deal with dimensionality. And having a lot of variables turn it into a smaller, hopefully better set of variables.

But there's something very glaring right in the middle of my diagram right here that we haven't been able to do for the longest time. And that's what I wanted to, that's what drove me to work on this project is we want to work with clustering, but it just didn't fit in. And it wasn't even a little happy to make it work. It wasn't possible before to do any kind of clustering models.

That's why I'm very excited to give the first public announcement that we're introducing tidyclust package.

That's why I'm very excited to give the first public announcement that we're introducing tidyclust package.

So, you're probably going to have a couple of questions or you're thinking something. Yes, finally, I needed this for years. People have been asking. People are trying to teach it. We have a lot of people that try to teach the classic intro to statistical machine learning and you can do almost everything in tidymodels and then you have one week where you show clustering and it just feels off. We're finally here.

But lastly, maybe more important question is why are we building another package? And this comes from some internal friction, as I mentioned before, that tidymodels just wasn't built to do clustering. It was built with supervised models in mind. And there wasn't really any way to force it through. It just didn't work. And this is because of some fundamental differences between how clustering models work and how a typical supervised models work.

And the main thing, which in essence is a very fundamental property and it seems a little bit simple, but it's most models have outcomes. And this is, yeah, of course they have outcomes, but it meant that every single piece of code we have, more or less, is written with outcomes in mind. And it just meant that it didn't work. Like we couldn't make it work because not only do clustering models not have outcomes, we don't want you to put in outcomes because it will be a symptom of you're trying to do the wrong thing. We also have in tidymodels the idea that we have clearly defined predictions, which are then clustering, very controversial topic. We'll still talk about it, but it still is not the most clearly defined thing. And lastly, related to the idea of outcomes, having a ground truth to test your model against means that you can do performance checking. You can have a model, you hide the outcome, you predict on it and see, do my predictions align with the true values? That makes it very easy to do normal performance metrics. We can do no such thing with clustering methods.

There are some ways of seeing, can we make, does this clustering model appear to be solid, but we don't have any clear answer. Nevertheless, we still move on and try to do the best we can. This means that we needed to do a lot of work to port over many of the internals of tidymodels, rewrite them from scratch with the assumption that there's no outcomes. So it's a lot of parallel code, but hopefully it means for all of you that it will be a seamless transition as possible. If you're already familiar with tidymodels, I would wager that by reading the vignettes we have coming up later today on the package downside, you can get up and running with tidymodels, tidyclust in an hour or two, because we found a fairly good transition.

The dates dataset

To show how some of this works, I will use some a very exciting data set. It's a data set of dates as in the fruit, and it is features extracted from images. They have a bunch of images of dates and then calculated various things of those images. The area will be the number of pixels that contain the fruit. We have perimeter, eccentricity. We also have various things about the RGB colors of those images. So there's a lot happening in here, and we're hoping to see if just by having these visual metrics about these dates, if we can pull out some clusters.

Two things you will notice about this data is some of these values have quite different ranges. So something like the area is in the order of hundreds of thousands, and something like eccentricity is quite a low value. So if our clustering model has any type of stale sensitivity, we need to deal with that. Additionally, we have a lot of these predictors being quite correlated. So generally, for most shapes, we have that the area, perimeter, convex areas are roughly related. And likewise, something like the colors, if there's a lot of blue in a picture, there might not be as much of the other colors. So we have a lot of correlations, which we can see in this correlation matrix. It's a little too small to actually see what different things are, but we are seeing huge chunks of highly correlated variables and highly de-correlated variables. So we might want to have to deal with that as well.

Building a tidyclust workflow

So we start out the same way we do in parsnip , by creating a clustering model specification. So here we are doing a simple Tameins, and we write it out, and we did a Tameins cluster specification. We have yet another mode. Here we taught about a partition. So we say a partition is when every observation will be assigned one and only one cluster. Later down the line, we want to expand our word surface to use extractive clustering, where any one observation can be in zero or more clusters. So there is work on that on the horizon.

But like how we can specify a cluster specification here using the default engine in the stats package, we can also set the number of clusters. We can change the engine or add engine-specific things. Another thing we really try to work hard on, as we do in parsnip and other packages, is have more descriptive names. So instead of having T and H, we will instead name it something like numclusters or totheist, so you don't have to know as much jargon moving around.

With this model specification, we can now work with the rest of tidymodels. So here we are creating a specification, and we said we wanted to normalize the data. I'm dealing with the correlations by applying some PCA, keeping the most variable principal components. And we can do it easily, put it into a workflow, and fit this workflow.

And if you take a quick note, you will see that this is very, very similar to everything else we're doing in tidymodels. The only two differences is instead of specifying a parsnip object, we're instead creating our tidyclust objects. And in the recipe, we're not specifying an outcome. But everything else should work smoothly, so there's very little new things for you to learn.

Extracting results and predictions

Once we have a fitted workflow, we now have an object we want to do certain calculations with, extract that, see what happens inside. Traditionally, a lot of that work was delegated in the broom package. While amazing has been drawing a lot and have drawing pain, so we are rewriting a lot of that functionality on our new engines to stay consistent and the way we want it to. Some of the things we can do is pull out what cluster each of the observations were found in in our cluster model. We always did a tipple pad here named dot cluster, so it's always the same. Additionally, we have done some transillidation of the cluster names. So in a case like Tameens, the clusters are random, but we are stopping some of that randomness from coming through. So you're more likely to actually have consistent clusters across straps.

Another thing we can do is pull out where those clusters are located. We can do that with the extract centroids. And then we can note here that since we had a workflow with recipe, the centroids are based on where the data. Centroids are based in the data space that was passed to the cluster model, which in this case was the PCs that came out of the recipe.

And lastly, we are a little on the edge by saying that some of these engines also allow for prediction. So prediction is a little bit of a touchy subject, but in for every different model, we have described what we actually mean by it. And in the case of Tameens, it means which cluster centroid are you closest to? And this works. And then with the whole workflow. So the preprocessing will happen. So you don't have those sneaky data leakage that happens by doing PC before anything like that.

Metrics and tuning

While I talked about how something like metrics can be quite hard to calculate, there still is a number of metrics that we would like to do. So in the same style, we have a number of metrics that are specific to these cluster fits. So here we have a function that tells you it's a total of it. Total of in some of that errors. And to calculate that it's not just a metric comparing an outcome to a ground truth. Instead, we are calculating a metric based on the centroids and the observations. So this metric is summing up how far the points are from their clusters.

And in the same style, if we have multiple, we can combine them in a cluster metric set to calculate multiple metrics as one. So, so far I talked a lot about that just a fitted one model that to do that where we arbitrarily set the number of clusters to be 5. But ideally, we would like to find out how many clusters of Fiji are like. We'll never know, but maybe we can see which clusters would be the best ones.

So we're going on to tuning so we can we can specify that the number of clusters should be tuned. I try a bunch of different things. We set up a grid of different values, we say I want to see something between 1 and 10 clusters will probably be best. It's a fairly small data set. If not, we can move on.

And then we have a function tune cluster that works almost this what has the same user interface as the tune functions you're familiar with from tune and fine tune. So you pass in a workflow or a cluster specification. You give it some resamples and a grid of values. And here we can also specify some cluster metrics. So the only difference here than your usual void flow is you want to use tune cluster instead of tune grid. And then you can optionally pass in a custom set of metric functions.

And then once all that runs clean, you get back the same type of result that you get from most of the tune functions. So you can do to let metrics, let predictions, select best by, auto plot, and so on. I'm not saying you should do select best because as we have here in attainments, when the number of clusters go up, this metric will always go down. That's just a reality. But this can easily be combined with something like ggplot to create an output chart or any other diagnostic you may want to do.

And that's all I have for you today. There's more word over here. We have something else I'm going to try to merge as soon as I can after this talk. The main takeaway I want for you is I want you to use the package. We're still in very early development. Most of the foundation is here in the sense that everything works, hopefully. But I want to know what you want to do because we still are missing more types of metrics, engines, and models. I want to hear all of that from you.

Featured software#