We’re delighted to announce the release of textrecipes 0.0.1 on CRAN. textrecipes implements a collection of new steps for the recipes package to deal with text preprocessing. textrecipes is still in early development so any and all feedback is highly appreciated.
You can install it by running:
|
|
New steps#
The steps introduced here can be split into 3 types, those that:
- convert from characters to list-columns and vice versa,
- modify the elements in list-columns, and
- convert list-columns to numerics.
This allows for greater flexibility in the preprocessing tasks that can be done while staying inside the recipes framework. This also prevents having a single step with many arguments.
Workflows#
First we start by creating a recipe
object from the original data.
|
|
The workflow in textrecipes so far starts with step_tokenize()
, followed by a combination of type-1 and type-2 steps ending with a type-3 step. step_tokenize()
wraps the tokenizers
package for tokenization, but other tokenization functions can be utilized using the custom_token argument. More information concerning arguments can be found in the documentation. The shortest possible recipes are step_tokenize() directly followed by a type-3 step.
|
|
If one wanted to calculate the word count of the top 100 most frequently used words after stemming is performed, type-2 steps are needed. Here we use step_stem()
to perform stemming using the SnowballC
package and step_tokenfilter()
to keep only the 100 most frequent tokens.
|
|
For more combinations, please consult the documentation and the vignette , which includes recipe examples.
Acknowledgements#
A big thank you goes out to the 6 people who contributed to this release: @ClaytonJY , @DavisVaughan , @EmilHvitfeldt , @jwijffels , @kanishkamisra , and @topepo .

