The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.
|
|
Since the beginning of last year, we have been publishing quarterly updates
here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag
to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these from the past month or so:
Since our last roundup post , there have been 21 CRAN releases of tidymodels packages. You can install these updates from CRAN with:
|
|
The NEWS files are linked here for each package; you’ll notice that there are a lot! We know it may be bothersome to keep up with all these changes, so we want to draw your attention to our recent blog posts above and also highlight a few more useful updates in today’s blog post.
- baguette
- broom
- brulee
- dials
- finetune
- hardhat
- hardhat
- multilevelmod
- parsnip
- plsmod
- poissonreg
- recipes
- rules
- stacks
- textrecipes
- tune
- the tidymodels metapackage itself
- usemodels
- vetiver
- workflows
- workflowsets
We’re really excited about brulee and vetiver but will share more in upcoming blog posts.
Feature hashing#
The newest textrecipes
release provides support for feature hashing, a feature engineering approach that can be helpful when working with high cardinality categorical data or text. A hashing function takes an input of variable size and maps it to an output of fixed size. Hashing functions are commonly used in cryptography and databases, and we can create a hash in R using rlang::hash():
|
|
The variable zip in this data on home sales in Sacramento, CA is of “high cardinality”
(as ZIP codes often are) with 67 unique values. When we hash() the ZIP code, we get out, well, a hash value, and we will always get the same hash value for the same input (as you can see for ZIP code 95838 here). We can choose the fixed size of our hashed output to reduce the number of possible values to whatever we want; it turns out this works well in a lot of situations.
Let’s use a hashing algorithm like this one (with an output size of 16) to create binary indicator variables for this high cardinality zip:
|
|
We now have 16 columns for zip (along with the other predictors and the outcome), instead of the over 60 we would have had by making regular dummy variables.
For more on feature hashing including its benefits (fast and low memory!) and downsides (not directly interpretable!), check out Section 6.7 of Supervised Machine Learning for Text Analysis with R and/or Section 17.4 of Tidy Modeling with R .
More customization for workflow sets#
Last year about this time, we introduced workflowsets , a new package for creating, handling, and tuning multiple workflows at once. See Section 7.5 and especially Chapter 15 of Tidy Modeling with R for more on workflow sets. In the latest release of workflowsets , we provide finer control of customization for the workflows you create with workflowsets. First you can create a standard workflow set by crossing a set of models with a set of preprocessors (let’s just use the feature hashing recipe we already created):
|
|
The option column is a placeholder for any arguments to use when we evaluate the workflow; the possibilities here are any argument to functions like tune_grid()
or fit_resamples()
. But what about arguments that belong not to the workflow as a whole, but to a recipe or a parsnip model? In the new release, we added support for customizing those kinds of arguments via update_workflow_model() and update_workflow_recipe(). This lets you, for example, say that you want to use a sparse blueprint
for fitting:
|
|
Now we can tune this workflow set, with the sparse blueprint for the glmnet model, over a set of resampling folds.
|
|
New parameter objects and parameter handling#
Even if you are a regular tidymodels user, you may not have thought much about dials
. This is an infrastructure package that is used to create and manage model hyperparameters. In the latest release of dials, we provide a handful of new parameters for various models and feature engineering approaches. There are a handful of parameters for the new parsnip::bart()
, i.e. Bayesian additive regression trees model:
|
|
This version of dials, along with the new hardhat release, also provides new functions for extracting single parameters and parameter sets from modeling objects.
|
|
You can also extract a single parameter by name:
|
|
Acknowledgements#
We’d like to extend our thanks to all of the contributors who helped make these releases during Q1 possible!
-
baguette: @EmilHvitfeldt and @hfrick .
-
broom: @cgoo4 , @colinbrislawn , @DanChaltiel , @ddsjoberg , @fschaffner , @grantmcdermott , @hughjonesd , @jennybc , @Marc-Girondot , @MichaelChirico , @mlaviolet , @oliverbothe , @PursuitOfDataScience , @simonpcouch , and @vincentarelbundock .
-
brulee: @dfalbel , @EmilHvitfeldt , and @topepo .
-
dials: @EmilHvitfeldt , @hfrick , and @py9mrg .
-
discrim: @deschen1 , @EmilHvitfeldt , @hfrick , @jmarshallnz , and @juliasilge .
-
finetune: @juliasilge , @Steviey , and @topepo .
-
hardhat: @DavisVaughan , @ddsjoberg , @EmilHvitfeldt , @hfrick , and @MasterLuke84 .
-
multilevelmod: @EmilHvitfeldt and @sitendug .
-
parsnip: @brunocarlin , @dietrichson , @edgararuiz , @EmilHvitfeldt , @hfrick , @jmarshallnz , @juliasilge , @mattwarkentin , @nikhilpathiyil , @nvelden , @t-kalinowski , @tiagomaie , @tolliam , and @topepo .
-
plsmod: @EmilHvitfeldt and @topepo .
-
poissonreg: @EmilHvitfeldt and @juliasilge .
-
recipes: @agwalker82 , @AndrewKostandy , @aridf , @brunocarlin , @DoktorMike , @duccioa , @EmilHvitfeldt , @FieteO , @hfrick , @joeycouse , @juliasilge , @lionel- , @mattwarkentin , @mdsteiner , @MichaelChirico , @spsanderson , @themichjam , @tmastny , @tomazweiss , @topepo , @walrossker , and @zenggyu .
-
rules: @EmilHvitfeldt , @juliasilge , and @wdkeyzer .
-
stacks: @amcmahon17 , @py9mrg , @Saarialho , @siegfried , @simonpcouch , @StuieT85 , @topepo , and @williamshell .
-
textrecipes: @EmilHvitfeldt , @lionel- , and @NLDataScientist .
-
tune: @abichat , @AndrewKostandy , @dax44 , @EmilHvitfeldt , @felxcon , @hfrick , @juanydlh , @juliasilge , @mattwarkentin , @mdancho84 , @py9mrg , @topepo , @walrossker , @williamshell , and @wtbxsjy .
-
tidymodels: @EmilHvitfeldt , @exsell-jc , @hardin47 , @juliasilge , @PursuitOfDataScience , @RaymondBalise , @scottlyden , and @topepo .
-
usemodels: @juliasilge and @topepo .
-
vetiver: @atheriel and @juliasilge .
-
workflows: @CarstenLange , @DavisVaughan , @dpprdan , @hfrick , and @juliasilge .
-
workflowsets: @DavisVaughan , @dvanic , @gdmcdonald , @hfrick , @juliasilge , @topepo , and @wdefreitas .

