Alan Feder | Categorical Embeddings: New Ways to Simplify Complex Data | RStudio

When building a predictive model in R, many of the functions (such as lm(), glm(), randomForest, xgboost, or neural networks in keras) require that all input variables are numeric. If your data has categorical variables, you may have to choose between ignoring some of your data and too many new columns. Categorical embeddings are a relative new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves. While there are a number of online tutorials on how to use Keras (usually in Python) to create these embeddings, this talk will use embed::step_embed(), an extension of the recipes package, to create the embeddings. About Alan: Alan Feder is a Principal Data Scientist at Invesco, where he uses as much R as possible to solve problems and build products throughout the company. Previously, he worked as a data scientist at AIG and an actuary at Swiss Re. He studied statistics and mathematics at Columbia University. He is unreasonably excited to spread the word about categorical embeddings. Alan lives in New York City with his wife, Ashira, and two children, Matan and Sarit

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, my name is Alan Feder and I am a Principal Data Scientist at Invesco. My talk today is about Categorical Embeddings and the Embed Package in R.

If you are trying to create a predictive model, you'd really like to use all of the variables in your dataset as that's what's likely to create the most accurate and successful predictions. But you may soon run into a problem because many of the functions within R that are used to create these models require that all of the input data are continuous numbers. Now this makes sense because many of the computations that go on in the background work on linear algebra and matrices, which inherently can only have numbers in them, but that might be a problem for you because some of your data might be categorical variables and you don't want to have to drop them.

The problem with categorical variables

You might try to replace each category with a number, 1, 2, 3, which might work if your data is ordinal, but if your data doesn't have any inherent order to it, this might not work. If one of your categories is color, how are you going to choose numbers for black, brown, yellow, or green?

You might try to create dummy variables, but this actually can cause big data problems really quick. Dummy variables will create a new variable for each level of your data and, for example, if one of your categories is United States ZIP Code, you're about to add 42,000 new variables to your data and that's probably not what you want.

Introducing categorical embeddings

That leads me to categorical embeddings. If you're familiar with word embeddings, such as Word2Vec, this is actually a very similar idea, except instead of an embedding for each word in language, you're embedding each category in your data.

What happens is that every category is replaced by a vector of numbers. The vector can be length 4, it can be length 50, it really depends on the situation, but these numbers will replace your categorical variable for any future modeling.

Now, categorical variables are most commonly fit by neural networks, which is why within R, many of the most common ways of creating them are using the Keras and TensorFlow packages or the new Torch package. However, I find the simplest way to do it is with the embed package, which is one of the add-on packages in the tidy models suite of packages.

However, I find the simplest way to do it is with the embed package, which is one of the add-on packages in the tidy models suite of packages.

Using the embed package

For example, let's say you had a data set of sales transactions and you wanted to predict which transactions are fraudulent and which ones are not. Having product in there, you can imagine how that would help the predictive ability of the model, but with over 4,500 unique products, that's going to cause problems with creating your model.

But with the embed package, all you have to do is add one function, step embed, onto a tidy model's recipe to create this embedding. Once you've prepped this recipe, the embeddings are there for use.

And you can see here how each product is now represented by an embedding of four numbers. And these four numbers are going to replace the old product category within the data going forward. You can now use this for any downstream model. It doesn't even have to be a neural network.

You could also use this, if you wanted, just to explore your data further and learn more about the products. Maybe you could use cosine similarity to figure out which products are most similar to each other.

Now, there's obviously a ton more that I could talk about with categorical embeddings. It's a topic that I find really exciting. I hope you learn and discover more about categorical embeddings as you use them within your modeling. Thank you very much for coming, and good luck.

Featured software#