
Creating Features for Machine Learning from Text – Julia Silge, March 2022
Julia Silge is a software engineer at RStudio PBC where she works on open source modeling tools. She holds a PhD in astrophysics and has worked as a data scientist in tech and the nonprofit sector, as well as a technical advisory committee member for the US Bureau of Labor Statistics. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences. Natural language that we as speakers and writers use must be dramatically transformed to new representations for analysis, whether we are just starting off with exploratory data analysis or are ready to train machine learning algorithms such as predictive models. We can explore typical text preprocessing steps from the ground up, from tokenization to building word embeddings, and consider the effects of these steps. When are these preprocessing steps helpful, and when are they not? In this talk, learn about the process of text preprocessing for ML models in the real world, how and when practitioners use different preprocessing choices, and considerations for text ML tooling. #rstats #nlp #juliasilge #coding #machinelearning https://rug-at-hdsi.org/ https://twitter.com/RUGatHDSI
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
And I'd love to introduce our speaker for today, Julia. I'll read her wonderful bio to you and then cede the rest of the time to her so she can do as she pleases with it. Julia Silge is a software engineer at RStudio where she works on open source modeling tools. She holds a PhD in astrophysics and has worked as a data scientist in tech and the non-profit sector, as well as technical advisory committee member for the U.S. Bureau of Labor Statistics. She is an author, an international keynote speaker, and a real-world practitioner focused on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences. So without further ado, Julia, go ahead.
Thank you so, so much for that lovely introduction.
All right, so I'm really happy to be here talking with you all here, specifically about feature engineering for text data. I think this is such a great and important topic for a couple of reasons. One is that having a better understanding of what we do to text data to make it appropriate as input for machine learning algorithms has a whole lot of benefits, both if you are, you know, directly getting ready to try to do this, like to try to train a model with text as inputs, or if you're at the beginning of a text analysis project and, you know, as you go through various analysis steps, you kind of understand how these steps may be used, or, you know, maybe by your own self at some future point for feature engineering. Or it's useful to understand this if you're trying to understand the behavior of a model that you're interacting with in some way.
You know, and we do this through, you know, cloud machine learning or AI services. And, you know, honestly, we do it a lot in our daily lives now, right, like dealing with the outputs of models that were trained using text data. And so understanding more about how this happens has a really wide, really wide range of benefits.
From natural language to numeric representation
So when we build models for text, either supervised models or unsupervised models, we start with something that looks like this. This is just some example data. I'm going to use it a couple of times during this talk. It's example data that describes animals. And then we have some information on their, some like quantitative structured information on their diet, like what kind of diet they have.
So this looks familiar to me and to you as like writers and speakers of human language. We can look at this. I could read it out loud to you. I can like read it in my head and understand what it says. So this kind of data here, this kind of like natural language is being generated all the time in all kinds of contexts. So if you work in health care or finance or tech, I mean, not to mention that like actual like digital humanities, right? We see, you see this kind of like text data being generated by electronic health records, by survey takers, by different kinds of, by, you know, from social media, from, you know, like there's tons of processes, like organizational processes that generate this kind of text.
So we got a lot of this. However, computers are not great at doing math on language that is represented like this. And instead, language has to be dramatically transformed to like a machine readable, numeric representation. And it will look more like this, to be ready for computation for like basically any kind of a model.
So I spent a fair amount of time working on software for people to do exploratory data analysis and like visualization, summarization and so forth with text data in a tidy format where we have one observation per row. But when it comes time to build a model, like to use some kind of underlying mathematical implementation, we almost always need something like this, which is like this particular thing here is called a document term matrix.
So we've got the, in this matrix, the columns are all, they all belong to terms, the rows, each row is a different document. And what goes in there is, in this particular example, is the counts. How many times does a document use a certain word? So the exact representation may differ from like this. You might wait by TFIDF instead of counts, or we might have, we might keep sequence information, like instead of counting up things or waiting, we might like keep track of where in a document did a certain word or token happen. But for basically all text modeling from, you know, simpler models like Naive Bayes to, you know, word embeddings or even anything like very state of the art today, like say transformer models, we have to heavily feature engineer and process language to get it ready for a representation that's suitable for machine learning algorithms.
Tidy Models and the Recipes framework
So I work in my day job on an open source framework in R for modeling and machine learning that's called Tidy Models. And a lot of the examples I'll be showing today use Tidy Models code. So some of the specific goals of the Tidy Models project are to provide a consistent, flexible framework for real world modeling practitioners, from those just starting out beginners to very experienced people, to harmonize the heterogeneous interfaces that we have within R, and to encourage good statistical practice.
So I'm glad to get to show a little bit of what it is that I work on and build and how it applies to text modeling. But a lot of what we're going to be talking today isn't that super specific to Tidy Models. It isn't even really that specific to R. Instead, we're focusing on these basics, these basics of how we transform text into predictors for machine learning.
So Tidy Models is a meta package in a similar way that the Tidyverse is. So if you've ever typed library Tidyverse and then use, you know, ggplot2 for visualization and dplyr for data manipulation, Tidy Models works in a similar way in that because it turns out modeling has a lot of sort of different pieces to it, like a lot of sort of different kinds of tasks. So pre-processing or feature engineering, what we're focusing on today, is part of a broader model process.
It starts, well, I mean, you really might argue it starts with like exploratory data analysis. And then let's say it comes to completion with model evaluation, with you saying, you understanding how well your model is performing. So Tidy Models as a piece of software is made up of R packages, each of which has a specific focus, like resampling data. That's what RSample is for, for resampling data or hyperparameter tuning. That's what Tune is for, is for hyperparameter tuning. And one of these packages is a package for feature engineering, for implementing feature engineering transformations. It is the one that's called Recipes.
So in Tidy Models, we capture the concepts of data pre-processing and feature engineering in this idea of a pre-processing recipe that has steps. So you choose which ingredients or variables you're going to use, you define the steps that you want to take, then you prepare them, you prepare those steps applied with those ingredients. And then you can apply that recipe to any kind of new data set, like testing data, if you're in the process of training your model, or new data, if you've like put your model into production and you're getting new predictions.
So the variables or ingredients that we use in modeling come in all kinds of shapes and sizes. And that includes text data. So some of the techniques and approaches that we use for pre-processing text data are the same as for any other kind of data that we might have. But some of what you need to know to be able to do a good job in this process for text is different and is specific to the nature of language data.
So I've written a book with my co-author, Emile Vievelt, on supervised machine learning for text, for text analysis in R. And fully the first third of the book focuses on how we transform natural language into features for modeling. The middle section of the book is how to use those features that we create in what you might call like simpler or more traditional machine learning models, like say like regularized regression, support vector machines, other ones, other more straightforward machine learning models that work well with text. The last third of the book talks about how we use deep learning models with text data.
So deep learning models still, still do require these kinds of transformations that we, where we go from natural language and we end up with some mathematical representation. You don't, deep learning models don't get you out of the need for this, for understanding this kind of pre-processing. They are different and they are often able to inherently learn structure or features from text, but it does not, using deep learning models doesn't get you out of the need for doing feature engineering and understanding feature engineering for text altogether.
They are different and they are often able to inherently learn structure or features from text, but it does not, using deep learning models doesn't get you out of the need for doing feature engineering and understanding feature engineering for text altogether.
So this book is now complete and available. Folks have print copies and it's also available in its entirety at smalltar.com. That's how we like to say it, smalltar, like it's a dragon or something, I think. So if you are new to dealing with text data, understanding these sort of like fundamental pre-processing approaches for text will set you up for being able to train effective models. If you're really experienced with text data, you probably have noticed that like, you know, like this was part of our motivation for working on this book is that existing, the existing literature is kind of sparse when it comes to like detailed, thoughtful explorations of how these pre-processing steps work and how choices made in these feature engineering steps tend to impact our model output.
Tokenization
So let's walk through a couple of these kinds of examples and talk about how they work. And let's start with what you might think of as the first or the most basic, and that is tokenization. So this is typically one of the first steps of transformation from natural language to machine learning feature. And to be honest, also with any kind of text analysis, including exploratory data analysis, before you build a model, tokenization is often the first thing you need to do in all of these situations.
So in tokenization, we take an input, a string, and a token type, which is some meaningful unit of text, such as a word. And then we split the input into pieces or tokens that correspond to that type. So most commonly, the meaningful unit of text or type of token that we want to split our text into is a word. So this might, you might be like, oh yeah, okay, thanks. That's super exciting. It might seem straightforward, but it turns out it's difficult to clearly define what a word is for even, for many or even most languages.
So many languages don't use whitespace between words for you to use whitespace for tokenization. Languages that do use whitespace, like English, often have particular examples that are ambiguous. And romance languages like Italian or French or Spanish often use pronouns or negation words that may be better considered prefixes with a space. And then we've got English contractions, like didn't, that may more accurately be considered two words with no space.
So it is these, once we have tokenized text, it's on its way to being able to be used in EDA. So we make some choice, and then we can use it in EDA, unsupervised algorithms, or as features for predictive modeling, which is what these results are shown here. So these are from a regression model that was trained on the descriptions of media from artwork in the Tate collection. So media meaning like what kind of art medium was used to create a piece of art. And what we're predicting with this model is when it was created.
So these, what we see here is that artwork created using like graphite, watercolor, engraving, this is more likely to be created earlier, like with older art. And artwork created using photography, screen printing, and dung and glitter, this is more likely to be created later, like from newer, more contemporary art. So the way that we tokenize this natural human generated language that we started with, you know, describing the medium of the artwork, the way that we tokenize each description has a big impact on what we learn from it. If we had tokenized in a different way, we would have gotten different results, both in terms of the performance, the model performance, and also in terms of like how we interpret the model, like what we learn from it.
N-gram tokenization
There are other ways to tokenize text. Another way is to, instead of breaking text up into single words or tokens, which is called unigrams, you can tokenize to n-grams. So an n-gram is a continuous sequence of n items from a given sequence of text. So this, what I'm showing on screen here, this shows that same text from before describing this animal divided up into bigrams or n-grams with two tokens. So notice how the words in the bigrams overlap so that we get to, we see the word collared there in both of the first two bigrams. So n-gram tokenization slides along the text to create overlapping sets of tokens.
This shows trigrams for the same text. So we have, you know, a sliding window of three words that slides around identifying trigrams. So using unigrams, it's faster and more efficient to identify and then to do model on. We don't, if we have only unigrams, we don't capture any information about word order. So this, if you ever hear the phrase bag of words, that's what it means. It's like, we took all the words and we just threw them in a bag, and we just like shuffling around, and we're not using, we're not keeping any information about word order. When we use a higher value for n, that keeps more information. But the vector space of tokens increases dramatically, and that corresponds to a reduction in token counts.
Combining degrees of n-grams can be a good idea. This allows you to extract different levels of detail from text data. So unigrams tell you which individual words have been a lot of times, used a lot of times. And it, you know, it's not uncommon for some of the most common unigrams to get overlooked if you're looking at bigrams or trigrams, because they don't co-appear with any other particular word that often.
So how does this turn out? What kind of results do we see? So this plot compares model performance in terms of RMSE there on the y-axis, root mean squared error, for a lasso regression model, where we're predicting the year of Supreme Court opinions with three different degrees of n-grams. So this whole, this little experiment was holding the number of tokens constant, the total number of tokens constant at a thousand. So here, notice using unigrams alone performed best for this corpus of Supreme Court opinions. And then, you know, we kind of go up with like worse and worse performance as we increase the number of tokens. This isn't always the case, you know, depending on the kind of model you use and the data set itself, like both how big it is and how the language in it is used.
Keep in mind that identifying engrams is computationally expensive. For this, for this experiment that I'm showing you right here, this data set of Supreme Court opinions, if you use bigrams and unigrams, that takes more than twice as long to train than only unigrams. Remember, this is the same number of tokens. So the time isn't coming, that time isn't coming from the model training mostly, it's mostly coming from the feature engineering. If you add in trigrams, it's almost five times as long as training on unigrams alone. And this is like using parallel processing. This is like, this is like using all the tricks we know about. So when you bump up in unigrams, you bump up in computational cost quite a lot and often your amount of, you know, here it actually got worse, but when you do see improvements, it's often very modest compared to the amount of time you have to spend to get to that.
Subword tokenization
All right, so let's go back the other direction. Let's tokenize to units smaller than words. So these are called character shingles. There's multiple different ways to break words up into subwords that are appropriate for machine learning. And often these approaches, these tokenization approaches or algorithms have the benefit of being able to encode unknown tokens or unknown words that are new at prediction time. So, you know, like you train your whole model with the data, your training data set, right? But then when it comes, hey, you put your model into production and you're getting new text in, and one of those new examples of text has a word you didn't see in the training time. Subwords can often match to new words.
So using this kind of subword information is a way to incorporate morphological sequences into our models. And that's a, like, concept from linguistics. Like, words are, like, put together with morphemes. And morphemes often do have meaning in them. And so using subword information is a way to kind of incorporate these kind of, like, morphological sequences.
So these results are for a classification model with a data set of very short text, just the names of post offices in the U.S. So I created features for the model that are subwords of these post office names. And a linear support vector machine here. And so we end up learning that names that start with H or P or contain that ale subword are more likely to be in Hawaii. And on the flip side, the subwords and, like, and or land, R-I, I-N-G, those things over there in that pink color, those are subwords that are more likely to be outside of Hawaii.
So here, again, we see that, like, the way that we tokenized determined what we learned, determined how well we were able to predict whether a post office were in, is in Hawaii or not. And it determined what, like, the choice that we made determines what, what it is that we learn. So in Tidy Models, we collect all these kinds of decisions about tokenization in code that looks like this. So we start with a recipe that specifies what ingredients or variables we'll use, and then we define pre-processing steps. Here we just have one where we're going to tokenize the text. This will, this example here will tokenize n of 1, 2, and 3 bigrams, so unigrams, bigrams, trigrams. That's what that options does there. So even at this first and arguably, like, you know, pretty simple or most basic step, the choices that we make affect our modeling results in, in a big way.
Stop words
So next, let's talk about stop words. So once we have split text into tokens, we often find that not all the tokens, you know, or words carry the same amount of information. If maybe any information at all for a machine learning task. So common words that are, you know, believed or found to carry little to maybe no meaningful information are called stop words. So it's common advice and practice to remove stop words like these for various NLP tasks. So what I'm showing here is the entirety of one of the shorter English stop word lists that's used very broadly.
So we see, you know, lots of pronouns, lots of conjunctions, like you, these are like glue words that are used to make a sentence make sense and to structurally work. But we, you know, we look at these and we're like, ah, that's not, those are not super meaningful on their own. It turns out that that decision just to remove stop words is a bit more involved and maybe more fraught than what you'll find, I think, often reflected in a lot of, you know, tutorials or resources out there.
So almost all the time, like real world NLP practitioners use pre-made stop word lists. So this plot visualizes set intersections for three common stop word lists in English in what is called an upset plot. So the horizontal bars tell you the length of each set, the vertical bars tell you the length of the set intersections. So notice a couple of things. The lengths of the lists are quite different. Notice they don't all contain the same sets of words. So really the most important thing to remember about stop word lexicons is that they are not created in some neutral, perfect setting, but instead they are context specific and they can be biased.
So these things are both true because these lists are created from large data sets of language. So they reflect the characteristics of the data that were used in their creation. So this is the list of the 10 words that are in the smart lexicon, but not in the snowball lexicon. Notice they're all contractions, but actually that's not because the snowball lexicon doesn't include contractions. It actually has a lot. Also notice that, um, that lexicon has he's, but not she's. So, and this is just like one tiny example of the kind of bias that occurs in these lists.
These lists are created from large data sets of text. Lexicon creators look at the most frequent words in some big corpus of language. They make a cutoff and then they decide, maybe make some decisions, right? About what to include or exclude, um, around the cutoff. And then you end up here. Um, so like the large, when the, um, large data set of language that you start with, you know, maybe discusses men more often than it discusses women, you end up in this kind of a situation.
So really the most important thing to remember about stop word lexicons is that they are not created in some neutral, perfect setting, but instead they are context specific and they can be biased.
So like so many decisions when it comes to modeling with language. So we as practitioners have to decide like when, like what is appropriate for my particular domain. It turns out this is even true for picking a stop word list. So in Tidy Models, we can implement a pre-processing step like this one by adding an additional step to our recipe. So first we specified what variables we'll use, then we tokenize the text, and now we're removing stop words. Here we're just removing a default set of stop words, but this is something you can, um, change with different arguments there.
So this plot, um, looks, you kind of explores and experiment seeing like what happens when we remove different, um, stop words. So it's the same, the same modeling problem as what I showed before, um, the supreme accord opinions and modeling whether they're recent or old. Um, and this uses, uh, three different stop word lexicons that have different lengths. So, um, the snowball lexicon contains the smallest number of words, and in this case it results in the best performance. So removing fewer words results in better performance. You take out more words, in this case, and the performance gets worse.
Um, this specific result is not generalizable to all data sets in all contexts, but the fact that removing different sets of stop words can have, you know, meaningfully different results, that is transferable. The fact that we will see a difference is something that happens, um, very commonly. Like one set of stop words will work well and one will be much worse. The only way to know, it turns out, is to try, is to try several options. There's no way to know ahead of time, um, for sure, like what will work best with your data.
This really highlights how machine learning is a, um, an empirical, um, field. Um, we, we, you know, there's, there's not so much about machine learning that you can, like, say ahead of time, um, you, how you know what is going to be best for solving a given problem. Um, the only way to know is by trying different options. This makes it extremely, extremely important that you use good statistical practice so that you are not fooled into, um, thinking something is better when it's not.
Stemming and lemmatization
So now let's talk about, um, another, another step, another, um, possible feature engineering step. So often when we deal with text, um, we often, we have documents that contain versions of one, um, base word, um, which we call a stem in this context. So say, say we're not interested in the difference between animals, S, and animal, you know, just singular. What if we want to treat them both together? Um, that, that is the idea at the heart of stemming, identifying the stems of words.
So I, here's a shock. There's no one or right, correct way to stem, stem text. So this plot shows, um, three approaches for stemming for this, um, this, uh, example data set of like animals, of animal descriptions. Uh, it starts off with, okay, what if we just literally remove the final S, um, to that middle one, there is a more, slightly more complex set of rules about handling plural endings that is called the S-stemmer. And then on the, on the pink, the pink one is, um, one of the best known implementations of stemming, uh, that's called the Porter algorithm. So Porter, so it's like a step, the way Porter stemming works is it's like a step-by-step set of rules. Like, okay, if it has this at the end, change to this. If, if not go to step D, you know, like, it's like a set of very algorithmic step-by-step set of rules about how to handle endings.
So practitioners typically are interested in stemming text data because it can bucket tokens together that belong, that belong together, um, in some way that we understand it. So we can use these kinds of stemming rules, um, which are, like I said before, these like step-by-step rules-based, um, uh, algorithms. We can also use lemmatization, which works in a different way. The, the, um, purpose of lemmatization is to identify lemmas, which are very similar in concept to the idea of a stem. Um, but, um, it's different in that it is instead of based on like a step or a set of rules that you go through, instead those are based on large dictionaries of text that, and that incorporates linguistic understanding of which words belong together.
So lemmatization usually depends on, um, you have, you have like a large data set of language and you, um, you, you can make dictionaries and you can say, ah, animal and animals go together, you know, because of, um, uh, how the diction, you know, like how we have a dictionary that tells us that instead of using like a rule, a set of rules to take off the ending. So these are kind of like the two, the two approaches. Think of one as rules-based and one as linguistics-based.
Um, so this seems like it's going to be a helpful thing to do when, um, because, because we, when we're dealing with text, we have so many features, um, in text data. It's like, oh, look at all these features. I, when I tokenize text at the end, the features that I need to use are the, um, the words, the tokens. And if you say, oh, I've got too many, I've got too many tokens. Let me, um, let me bucket them together.
So I want to note you to notice how many features there are. 16, almost 17,000 features. And also notice how sparse this matrix is. Um, think of this as the sparsity of the data that we want to use to build a supervised machine learning model. Um, text data is sparse data because, um, of the way natural language works. We tend to use a few words a lot, and there are a lot of words that we only use a few times. So we end up in this situation where we have very, very sparse data.
So if I were to stem this text with, um, the Porter algorithm, we reduce the number of features by a lot, by many thousands. The sparsity here didn't change that much. So it's like, well, we're still going to have to deal with like some pretty darn sparse data, but, um, common sense, you know, might say like, oh, reducing the number of word features that dramatically is going to help. That is going to improve our performance.
So, um, so common sense says that like reducing the, the number of features here is going to, it's going to help, like that's going to help our model perform better. Um, but that does assume that we haven't lost any important information by stemming. It turns out that, um, uh, like stemming, both stemming and limitization can be helpful in some contexts, but it turns out that typical stemming algorithms and limitization less so, but stemming algorithms a lot, limitization a little bit, you know, like less, they are built, uh, you think of them as aggressive. They're aggressive. They've been built to favor, um, sensitivity or recall or the true positive rate. And that is, you know, just cause there's no free lunch at the expense of specificity or precision or the true negative rate.
So in a supervised machine learning context, this affects the models, um, positive predictive value, um, the, um, the precision, the precision. So this is the ability to not incorrectly label true negatives as positives. So basically like stemming can increase a model's ability to find the positive examples. Like if we were saying, uh, doing a classification model about an animal's diet, um, it can help us find the positive examples, the animal descriptions associated with the diet. Um, however, if the text is overstemmed, the resulting model like loses its ability to label the negative examples, say the, the descriptions that are, um, not about that certain diet, um, it can, it like struggles to label those correctly.
So, um, even these really like, um, very common, basic pre-processing steps for text, like what is shown in this feature engineering recipe, like, um, they can be computationally expensive. Um, and, and they have, um, these, and these choices, like whether to remove stock boards or not, whether to stem text or not, um, these all have dramatic impact on how machine learning models of all kinds perform. So what this means is like as practitioners where we are, um, you know, like we're learning, we're teaching, we're writing, we're doing the work that we do, um, like folk, like being clear about what feature engineering steps we did take and, um, being clear about what the kind of impact can be contributes to like better, more robust statistical practice in our field.
Sparsity and word embeddings
I want to go back to the idea of the sparsity of text data. So this is one of the really defining characteristics of text. Um, uh, we end up with, um, uh, a relationship that looks like this in terms of how sparsity changes as you add more documents, thus more unique words to a corpus. So this is, um, let's take a data set, a real data set of documents. Let's start with, um, you know, like let's start with 10% of the documents and then go add more and more. And as we're doing that count, how many unique words are there? Um, how sparse does it get? How fast? And then how much memory does it take in a computer to hold it?
So notice that as we add more unique words, the sparsity goes up really fast, um, to, to very high, like this is very, very sparse data. And also notice what's happening to how much, you know, RAM that takes in your computer, how much memory it takes to your computer to hold that information. It turns out this, that I'm showing you right here is, is already using specialized data structures meant to store sparse data. Like often if you have a regular matrix, you'll keep track of the whole thing, including any zeros, but there's specialized data structures that help you, um, that, that are more efficient at storing sparse data, where instead of holding a whole two-dimensional thing, it keeps track of the column, the row, the value, the column, the row, the value, the column, the row, the value. So you don't keep track of the zeros.
So even if you try like the computational tricks that you have, we still end up growing the memory required to handle these kinds of data sets in a really non-linear way. This means, you know, you may just straight up run out of memory, um, but also this means it takes quite a long time to train, to train your model. This is why training models on text data tends to be quite challenging.
People have known about this for a long time, and linguists have worked for a long time on vector models for language that can reduce the number of dimensions representing text data based on how people are using language. So this, this, people have been working on this for a long time. This quote actually dates to 1957, before, like, you know, everyone had computers in their, you know, houses or anything like this.
So, um, this idea of, like, let's look at how words are used together, and then, um, uh, take our super high-dimensional sparse data, use that information to create dense word vectors. These are also called word embeddings. So the idea here is that we can use statistical modeling, uh, you know, you can just use something straightforward, like matrix factorization, or you can use fancier math, like neural networks, which is like fancy matrix factorization, and you can take this, um, really high-dimensional space, and you can create a new lower-dimensional space that is, um, special because this new lower-dimensional space is created based on vectors that incorporate information on which words are used together.
So this is, like, an approach, uh, for saying I am going to make it more practical for me to train the models using the fact that, um, words are not independent. You don't, you don't just, like, like, our words are not just some, like, scramble of, like, equally, you know, um, likely words that we just use together. Words are used together in very specific ways. So this is, let's use that information to make these dense word embeddings so I can, um, I can train a model.
So you need a really big data set of text to create or learn word embeddings. So this is a table that I'm showing right here from a set of embeddings that I created using a corpus of complaints to the United States Consumer Financial Protection Bureau. So, um, this is, uh, you know, the CFPB, if you've, you know, seen it around. So these are complaints from consumers about things like credit cards and mortgages and similar, like, something went wrong with my student loan payment, like, you know, the company did something bad or something like that. So in the new space that is defined by these embeddings that I created, the word month is closest in this new space to, um, the word year, months, monthly installments, um, payment. You know, people are talking about the payment from the certain month or the, you know, the installment in a certain month, weeks and weeks. So, so these are words that are close together in the new space because they're used together or in similar ways.
Um, let's look at another one. So in the new space defined by these embeddings I made, um, the word error is closest to, like, mistake, clerical, problem, glitch, errors, um, miscommunication, misunderstanding. So people are like, oh, there was an error with my student loan payment and it was like, oh, a clerical glitch, you know, like, uh, like we, I, there was a misunderstanding about, about my mortgage, you know, or like, like, these are the kinds of things people are saying.
Um, so you don't have to create embeddings yourself. In fact, it's very common to use, um, pre-trained word embeddings. These are created by, um, someone else, uh, based on some huge corpus of data that they have access to and you probably don't. Um, it, you know, we're talking, we're talking like Google, we're talking, um, Facebook has made some, like there are these pre-trained embeddings that people make available, um, uh, for use.
So let's look at that space. So this table shows the results for this same word, but from what are called the glove embeddings. So, um, the glove embeddings to give you an idea are trained on like all of Wikipedia, a bunch of the Google News datasets, like, like huge swaths of the internet, like put it into a giant computer, find the word embeddings. So let's look here. So some of the closest words in the space are very similar to before, right? Um, mistake, errors, but, um, we no longer have some of that like domain specific flavor, like clerical discrepancy and like misunderstanding. And now we do have probability and calculation. And, um, people were not talking about those things when they talked about their, um, financial product complaints, you know, people were not using those kinds of words.
So this highlights for you that embeddings are trained or learned from a large corpus of text data and the characteristics of that corpus become part of the embeddings. So, you know, like machine learning in general is exquisitely sensitive to the, you know, the data, like whatever is in your training data, like it's going to learn it and, um, and like reify it. Right. Um, and this is never more obvious than when dealing with text data. And maybe, maybe most with word embeddings, because what you're literally doing is trying to learn the way language is used together to make these lower dimensional, um, embeddings.
So, you know, like machine learning in general is exquisitely sensitive to the, you know, the data, like whatever is in your training data, like it's going to learn it and, um, and like reify it.
So this means, you know, you might, you might miss out on when you, then if you apply those free train embeddings to your own data, you might miss out right on something that is there. You might be trying to add in something that isn't there. And, you know, maybe most concerning from an ethical standpoint, um, any like human prejudice or bias in the corpus becomes imprinted into the embeddings.
So, um, in fact, when we look at some of these most commonly available embeddings that are out there that are able, that are available for, for use, we see that, um, uh, African-American first names that are more typical for African-Americans are associated with more unpleasant feelings than European-American, white American first names. Um, women's first names are more associated with family and men's first names are more associated with career. And, um, terms associated with women, like these are things like mother, aunt, um, sister are more associated with the arts and terms associated with men, like, like the brother, uncle, you know, father, they're most more associated with science.
So bias actually is so ingrained in word embeddings that the embeddings can be used to quantify change in social attitudes over time. Like you can actually, you, if like digital humanities folks have like taken the embeddings, like, like say, slice up your data in decade bins or something, and then find embeddings and then measure the bias in the embeddings over time, and can use that to measure changes in attitudes about, um, uh, in various social attitudes over time.
Wrapping up
So, uh, to sum up when it comes to pre-processing your text data, like creating these features that you need to build a model with, you have a lot of options and I think quite a bit of responsibility. So my advice is always to start with simpler models so that you can under, that you're able to understand quite, um, clearly and deeply, um, so you, you have that as a benchmark, um, to be sure to adopt good statistical practices, um, as you train and tune models so you aren't fooled about, um, the model performance you would expect to see with various approaches, and also to, um, use model explainability tools and frameworks so you can understand, um, your more, your less straightforward models. So these are all things that, like, my coworkers and I work on and talk about, and so we, we, um, we think there's so much possibility and also such important things to keep in mind when it comes to this kind of work.
And with that, I will say, um, thank you so much. I want to be sure to, um, uh, thank my teammates on the Tidy Models team at RStudio, as well as my, um, my co-author on, on this book, um, Emil Wietveld. So thank you.
I've got, I do think, I think we have time for a little bit of some questions. Um, the first question I see there is about, um, explainability tools. Um, that usually what you do, um, with most explainability tools is you, um, you, you often, like, train a model on top of a model. Or that's one way, if you see something called LIME, that's what that is. It trains a little, um, uh, linear model on top of your complex, fancy model. Other explainability tools, you are, think of them as, like, poking at the model. Like, you train a model and you say, how does this thing work? And so you say, well, okay, uh, I've got 500 real examples. What if I make my 500 real examples? I'm changing something in the input so that they all have the same, like, um, um, I don't know, this example that is houses that I just linked to.
So let's say they all, let's make it so they all have the same, um, uh, number of bedrooms. And then, and then I put that data in, I get the predictions that I come out. And so it's like, I poked the model and say, how do you work? And the next time I do it, or it's take, take, say one, say I'm interested in how price is related to, um, the size of the house, like square footage of the house. So you, so you take 500 of your, you know, real data points. And for each of those, you, you change the, um, square footage from very low to very high. You keep everything else about the house the same and you, but you change the, the square footage and you put it into the model and you see what you get out. And so it's like, you're like, you like poke at it to see how it behaves. So that, so most explainability tools, um, work like that.
Oh, Todd, do you have one? I was just going to raise my hand and just Julia, uh, thank you for joining us. You are a huge inspiration to me and to a lot of people that I know. Uh, but thank you for doing this. This is amazing. Um, NLP is one of the things that I'm working on and it is absolutely kicking my butt. So this is incredible. I love working with texts, but it really can be, um, you know, there, there are a lot of things that are different about it than like these rectangular data sets that we often get trained on or have more experience with.
Yeah. I just have to echo what Todd said. Thank you so much for coming and speaking for us. It's been a really great time to learn from you. Thank you. Yeah. Well, I love your YouTube videos. You know, I think you should quit your job and just focus on those. Just do that. I need to just do that.
So I, I had a question, Julia. I feel like I misunderstood one of the plots. Um, so there was a plot that had, um, it was like N grams, I think. And it was like, I think it was like mean squared error on the yacht and there was less mean squared error when there was just, when it was just one, um, one gram, which, which made me wonder, but, but when you have one, two, does that mean that you've included both one and two in the model? And I'm not, I'm, it like blew my mind that it went down.
Oh, okay. So it's holding the number of tokens constant. So these all have a thousand tokens. These, these are, these, this is holding, this is a thousand tokens in each, in each one. So all of these models have a thousand tokens. So you could do, so this does not involve anything fancy in like feature selection, like which ones will we use? Um, this, it, I mean, except that it is a lasso, it's a lasso model. So it's like here, here are a thousand tokens lasso, please tell me which ones are the best. And, um, so the difference is the, in the one where it says one, the thousand, um, features are all unigrams. Um, and the way that the cutoff was made was like the, um, the thousand most common ones. The one and two is it's a thousand tokens of unigrams and bigrams of the, um, so find all the unigrams and bigrams find what is most common. Um, like the, find the thousand most common ones. Um, and then lasso tell me which ones are important. And then, and then you get to get like some result back and it got worse. It did. It got worse. Like if you take the, for the same number of tokens at trying to use bigrams made it worse. Yes. Yes.
It probably means that, um, uh, I mean, this, this is a, this is a dataset with like, I think like in the thousands of examples. And so because, because if you look at how language is used in Supreme court opinions, um, you know, 5,000 or whatever examples is not enough to start learning from the bigrams. It's not enough. You would need way more. You would need way more to be able to learn from bigrams given how, um, given how language is used here. So it's not always true. Like, it's like, so you might, you know, I, we know this actually Supreme court opinions are written with a very big vocabulary compared to, so, so that, you know, like that's related, like you might not get the same result if you're looking at a text that uses more of the common words and less of the uncommon words. So we'd like, it really depends on the nature of your data and then how much of it you have to be able to learn from.
Thank you. That's so insightful. Thank you. Thank you for that great, great question. Awesome. Well, it was so nice to meet you all and to get to talk. Thank you so much for coming. Thank you, Julia. Thanks. Have a great night.

