Creating Features for Machine Learning from Text – Julia Silge, March 2022

Transcript#

This transcript was generated automatically and may contain errors.

And I'd love to introduce our speaker for today, Julia. I'll read her wonderful bio to you and then cede the rest of the time to her so she can do as she pleases with it. Julia Silge is a software engineer at RStudio where she works on open source modeling tools. She holds a PhD in astrophysics and has worked as a data scientist in tech and the non-profit sector, as well as technical advisory committee member for the U.S. Bureau of Labor Statistics. She is an author, an international keynote speaker, and a real-world practitioner focused on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences. So without further ado, Julia, go ahead.

Thank you so, so much for that lovely introduction.

All right, so I'm really happy to be here talking with you all here, specifically about feature engineering for text data. I think this is such a great and important topic for a couple of reasons. One is that having a better understanding of what we do to text data to make it appropriate as input for machine learning algorithms has a whole lot of benefits, both if you are, you know, directly getting ready to try to do this, like to try to train a model with text as inputs, or if you're at the beginning of a text analysis project and, you know, as you go through various analysis steps, you kind of understand how these steps may be used, or, you know, maybe by your own self at some future point for feature engineering. Or it's useful to understand this if you're trying to understand the behavior of a model that you're interacting with in some way.

You know, and we do this through, you know, cloud machine learning or AI services. And, you know, honestly, we do it a lot in our daily lives now, right, like dealing with the outputs of models that were trained using text data. And so understanding more about how this happens has a really wide, really wide range of benefits.

From natural language to numeric representation

So when we build models for text, either supervised models or unsupervised models, we start with something that looks like this. This is just some example data. I'm going to use it a couple of times during this talk. It's example data that describes animals. And then we have some information on their, some like quantitative structured information on their diet, like what kind of diet they have.

So this looks familiar to me and to you as like writers and speakers of human language. We can look at this. I could read it out loud to you. I can like read it in my head and understand what it says. So this kind of data here, this kind of like natural language is being generated all the time in all kinds of contexts. So if you work in health care or finance or tech, I mean, not to mention that like actual like digital humanities, right? We see, you see this kind of like text data being generated by electronic health records, by survey takers, by different kinds of, by, you know, from social media, from, you know, like there's tons of processes, like organizational processes that generate this kind of text.

So we got a lot of this. However, computers are not great at doing math on language that is represented like this. And instead, language has to be dramatically transformed to like a machine readable, numeric representation. And it will look more like this, to be ready for computation for like basically any kind of a model.

So I spent a fair amount of time working on software for people to do exploratory data analysis and like visualization, summarization and so forth with text data in a tidy format where we have one observation per row. But when it comes time to build a model, like to use some kind of underlying mathematical implementation, we almost always need something like this, which is like this particular thing here is called a document term matrix.

So we've got the, in this matrix, the columns are all, they all belong to terms, the rows, each row is a different document. And what goes in there is, in this particular example, is the counts. How many times does a document use a certain word? So the exact representation may differ from like this. You might wait by TFIDF instead of counts, or we might have, we might keep sequence information, like instead of counting up things or waiting, we might like keep track of where in a document did a certain word or token happen. But for basically all text modeling from, you know, simpler models like Naive Bayes to, you know, word embeddings or even anything like very state of the art today, like say transformer models, we have to heavily feature engineer and process language to get it ready for a representation that's suitable for machine learning algorithms.

Tidy Models and the Recipes framework

So I work in my day job on an open source framework in R for modeling and machine learning that's called Tidy Models. And a lot of the examples I'll be showing today use Tidy Models code. So some of the specific goals of the Tidy Models project are to provide a consistent, flexible framework for real world modeling practitioners, from those just starting out beginners to very experienced people, to harmonize the heterogeneous interfaces that we have within R, and to encourage good statistical practice.

So I'm glad to get to show a little bit of what it is that I work on and build and how it applies to text modeling. But a lot of what we're going to be talking today isn't that super specific to Tidy Models. It isn't even really that specific to R. Instead, we're focusing on these basics, these basics of how we transform text into predictors for machine learning.

So Tidy Models is a meta package in a similar way that the Tidyverse is. So if you've ever typed library Tidyverse and then use, you know, ggplot2 for visualization and dplyr for data manipulation, Tidy Models works in a similar way in that because it turns out modeling has a lot of sort of different pieces to it, like a lot of sort of different kinds of tasks. So pre-processing or feature engineering, what we're focusing on today, is part of a broader model process.

It starts, well, I mean, you really might argue it starts with like exploratory data analysis. And then let's say it comes to completion with model evaluation, with you saying, you understanding how well your model is performing. So Tidy Models as a piece of software is made up of R packages, each of which has a specific focus, like resampling data. That's what RSample is for, for resampling data or hyperparameter tuning. That's what Tune is for, is for hyperparameter tuning. And one of these packages is a package for feature engineering, for implementing feature engineering transformations. It is the one that's called Recipes.

So in Tidy Models, we capture the concepts of data pre-processing and feature engineering in this idea of a pre-processing recipe that has steps. So you choose which ingredients or variables you're going to use, you define the steps that you want to take, then you prepare them, you prepare those steps applied with those ingredients. And then you can apply that recipe to any kind of new data set, like testing data, if you're in the process of training your model, or new data, if you've like put your model into production and you're getting new predictions.

So the variables or ingredients that we use in modeling come in all kinds of shapes and sizes. And that includes text data. So some of the techniques and approaches that we use for pre-processing text data are the same as for any other kind of data that we might have. But some of what you need to know to be able to do a good job in this process for text is different and is specific to the nature of language data.

So I've written a book with my co-author, Emile Vievelt, on supervised machine learning for text, for text analysis in R. And fully the first third of the book focuses on how we transform natural language into features for modeling. The middle section of the book is how to use those features that we create in what you might call like simpler or more traditional machine learning models, like say like regularized regression, support vector machines, other ones, other more straightforward machine learning models that work well with text. The last third of the book talks about how we use deep learning models with text data.

So deep learning models still, still do require these kinds of transformations that we, where we go from natural language and we end up with some mathematical representation. You don't, deep learning models don't get you out of the need for this, for understanding this kind of pre-processing. They are different and they are often able to inherently learn structure or features from text, but it does not, using deep learning models doesn't get you out of the need for doing feature engineering and understanding feature engineering for text altogether.

They are different and they are often able to inherently learn structure or features from text, but it does not, using deep learning models doesn't get you out of the need for doing feature engineering and understanding feature engineering for text altogether.

So this book is now complete and available. Folks have print copies and it's also available in its entirety at smalltar.com. That's how we like to say it, smalltar, like it's a dragon or something, I think. So if you are new to dealing with text data, understanding these sort of like fundamental pre-processing approaches for text will set you up for being able to train effective models. If you're really experienced with text data, you probably have noticed that like, you know, like this was part of our motivation for working on this book is that existing, the existing literature is kind of sparse when it comes to like detailed, thoughtful explorations of how these pre-processing steps work and how choices made in these feature engineering steps tend to impact our model output.

So really the most important thing to remember about stop word lexicons is that they are not created in some neutral, perfect setting, but instead they are context specific and they can be biased.

So like so many decisions when it comes to modeling with language. So we as practitioners have to decide like when, like what is appropriate for my particular domain. It turns out this is even true for picking a stop word list. So in Tidy Models, we can implement a pre-processing step like this one by adding an additional step to our recipe. So first we specified what variables we'll use, then we tokenize the text, and now we're removing stop words. Here we're just removing a default set of stop words, but this is something you can, um, change with different arguments there.

So this plot, um, looks, you kind of explores and experiment seeing like what happens when we remove different, um, stop words. So it's the same, the same modeling problem as what I showed before, um, the supreme accord opinions and modeling whether they're recent or old. Um, and this uses, uh, three different stop word lexicons that have different lengths. So, um, the snowball lexicon contains the smallest number of words, and in this case it results in the best performance. So removing fewer words results in better performance. You take out more words, in this case, and the performance gets worse.

Um, this specific result is not generalizable to all data sets in all contexts, but the fact that removing different sets of stop words can have, you know, meaningfully different results, that is transferable. The fact that we will see a difference is something that happens, um, very commonly. Like one set of stop words will work well and one will be much worse. The only way to know, it turns out, is to try, is to try several options. There's no way to know ahead of time, um, for sure, like what will work best with your data.

This really highlights how machine learning is a, um, an empirical, um, field. Um, we, we, you know, there's, there's not so much about machine learning that you can, like, say ahead of time, um, you, how you know what is going to be best for solving a given problem. Um, the only way to know is by trying different options. This makes it extremely, extremely important that you use good statistical practice so that you are not fooled into, um, thinking something is better when it's not.

Stemming and lemmatization

So now let's talk about, um, another, another step, another, um, possible feature engineering step. So often when we deal with text, um, we often, we have documents that contain versions of one, um, base word, um, which we call a stem in this context. So say, say we're not interested in the difference between animals, S, and animal, you know, just singular. What if we want to treat them both together? Um, that, that is the idea at the heart of stemming, identifying the stems of words.

So I, here's a shock. There's no one or right, correct way to stem, stem text. So this plot shows, um, three approaches for stemming for this, um, this, uh, example data set of like animals, of animal descriptions. Uh, it starts off with, okay, what if we just literally remove the final S, um, to that middle one, there is a more, slightly more complex set of rules about handling plural endings that is called the S-stemmer. And then on the, on the pink, the pink one is, um, one of the best known implementations of stemming, uh, that's called the Porter algorithm. So Porter, so it's like a step, the way Porter stemming works is it's like a step-by-step set of rules. Like, okay, if it has this at the end, change to this. If, if not go to step D, you know, like, it's like a set of very algorithmic step-by-step set of rules about how to handle endings.

So practitioners typically are interested in stemming text data because it can bucket tokens together that belong, that belong together, um, in some way that we understand it. So we can use these kinds of stemming rules, um, which are, like I said before, these like step-by-step rules-based, um, uh, algorithms. We can also use lemmatization, which works in a different way. The, the, um, purpose of lemmatization is to identify lemmas, which are very similar in concept to the idea of a stem. Um, but, um, it's different in that it is instead of based on like a step or a set of rules that you go through, instead those are based on large dictionaries of text that, and that incorporates linguistic understanding of which words belong together.

So lemmatization usually depends on, um, you have, you have like a large data set of language and you, um, you, you can make dictionaries and you can say, ah, animal and animals go together, you know, because of, um, uh, how the diction, you know, like how we have a dictionary that tells us that instead of using like a rule, a set of rules to take off the ending. So these are kind of like the two, the two approaches. Think of one as rules-based and one as linguistics-based.

Um, so this seems like it's going to be a helpful thing to do when, um, because, because we, when we're dealing with text, we have so many features, um, in text data. It's like, oh, look at all these features. I, when I tokenize text at the end, the features that I need to use are the, um, the words, the tokens. And if you say, oh, I've got too many, I've got too many tokens. Let me, um, let me bucket them together.

So I want to note you to notice how many features there are. 16, almost 17,000 features. And also notice how sparse this matrix is. Um, think of this as the sparsity of the data that we want to use to build a supervised machine learning model. Um, text data is sparse data because, um, of the way natural language works. We tend to use a few words a lot, and there are a lot of words that we only use a few times. So we end up in this situation where we have very, very sparse data.

So if I were to stem this text with, um, the Porter algorithm, we reduce the number of features by a lot, by many thousands. The sparsity here didn't change that much. So it's like, well, we're still going to have to deal with like some pretty darn sparse data, but, um, common sense, you know, might say like, oh, reducing the number of word features that dramatically is going to help. That is going to improve our performance.

So, um, so common sense says that like reducing the, the number of features here is going to, it's going to help, like that's going to help our model perform better. Um, but that does assume that we haven't lost any important information by stemming. It turns out that, um, uh, like stemming, both stemming and limitization can be helpful in some contexts, but it turns out that typical stemming algorithms and limitization less so, but stemming algorithms a lot, limitization a little bit, you know, like less, they are built, uh, you think of them as aggressive. They're aggressive. They've been built to favor, um, sensitivity or recall or the true positive rate. And that is, you know, just cause there's no free lunch at the expense of specificity or precision or the true negative rate.

So in a supervised machine learning context, this affects the models, um, positive predictive value, um, the, um, the precision, the precision. So this is the ability to not incorrectly label true negatives as positives. So basically like stemming can increase a model's ability to find the positive examples. Like if we were saying, uh, doing a classification model about an animal's diet, um, it can help us find the positive examples, the animal descriptions associated with the diet. Um, however, if the text is overstemmed, the resulting model like loses its ability to label the negative examples, say the, the descriptions that are, um, not about that certain diet, um, it can, it like struggles to label those correctly.

So, um, even these really like, um, very common, basic pre-processing steps for text, like what is shown in this feature engineering recipe, like, um, they can be computationally expensive. Um, and, and they have, um, these, and these choices, like whether to remove stock boards or not, whether to stem text or not, um, these all have dramatic impact on how machine learning models of all kinds perform. So what this means is like as practitioners where we are, um, you know, like we're learning, we're teaching, we're writing, we're doing the work that we do, um, like folk, like being clear about what feature engineering steps we did take and, um, being clear about what the kind of impact can be contributes to like better, more robust statistical practice in our field.

Sparsity and word embeddings

I want to go back to the idea of the sparsity of text data. So this is one of the really defining characteristics of text. Um, uh, we end up with, um, uh, a relationship that looks like this in terms of how sparsity changes as you add more documents, thus more unique words to a corpus. So this is, um, let's take a data set, a real data set of documents. Let's start with, um, you know, like let's start with 10% of the documents and then go add more and more. And as we're doing that count, how many unique words are there? Um, how sparse does it get? How fast? And then how much memory does it take in a computer to hold it?

So notice that as we add more unique words, the sparsity goes up really fast, um, to, to very high, like this is very, very sparse data. And also notice what's happening to how much, you know, RAM that takes in your computer, how much memory it takes to your computer to hold that information. It turns out this, that I'm showing you right here is, is already using specialized data structures meant to store sparse data. Like often if you have a regular matrix, you'll keep track of the whole thing, including any zeros, but there's specialized data structures that help you, um, that, that are more efficient at storing sparse data, where instead of holding a whole two-dimensional thing, it keeps track of the column, the row, the value, the column, the row, the value, the column, the row, the value. So you don't keep track of the zeros.

So even if you try like the computational tricks that you have, we still end up growing the memory required to handle these kinds of data sets in a really non-linear way. This means, you know, you may just straight up run out of memory, um, but also this means it takes quite a long time to train, to train your model. This is why training models on text data tends to be quite challenging.

People have known about this for a long time, and linguists have worked for a long time on vector models for language that can reduce the number of dimensions representing text data based on how people are using language. So this, this, people have been working on this for a long time. This quote actually dates to 1957, before, like, you know, everyone had computers in their, you know, houses or anything like this.

So, um, this idea of, like, let's look at how words are used together, and then, um, uh, take our super high-dimensional sparse data, use that information to create dense word vectors. These are also called word embeddings. So the idea here is that we can use statistical modeling, uh, you know, you can just use something straightforward, like matrix factorization, or you can use fancier math, like neural networks, which is like fancy matrix factorization, and you can take this, um, really high-dimensional space, and you can create a new lower-dimensional space that is, um, special because this new lower-dimensional space is created based on vectors that incorporate information on which words are used together.

So this is, like, an approach, uh, for saying I am going to make it more practical for me to train the models using the fact that, um, words are not independent. You don't, you don't just, like, like, our words are not just some, like, scramble of, like, equally, you know, um, likely words that we just use together. Words are used together in very specific ways. So this is, let's use that information to make these dense word embeddings so I can, um, I can train a model.

So you need a really big data set of text to create or learn word embeddings. So this is a table that I'm showing right here from a set of embeddings that I created using a corpus of complaints to the United States Consumer Financial Protection Bureau. So, um, this is, uh, you know, the CFPB, if you've, you know, seen it around. So these are complaints from consumers about things like credit cards and mortgages and similar, like, something went wrong with my student loan payment, like, you know, the company did something bad or something like that. So in the new space that is defined by these embeddings that I created, the word month is closest in this new space to, um, the word year, months, monthly installments, um, payment. You know, people are talking about the payment from the certain month or the, you know, the installment in a certain month, weeks and weeks. So, so these are words that are close together in the new space because they're used together or in similar ways.

Um, let's look at another one. So in the new space defined by these embeddings I made, um, the word error is closest to, like, mistake, clerical, problem, glitch, errors, um, miscommunication, misunderstanding. So people are like, oh, there was an error with my student loan payment and it was like, oh, a clerical glitch, you know, like, uh, like we, I, there was a misunderstanding about, about my mortgage, you know, or like, like, these are the kinds of things people are saying.

Um, so you don't have to create embeddings yourself. In fact, it's very common to use, um, pre-trained word embeddings. These are created by, um, someone else, uh, based on some huge corpus of data that they have access to and you probably don't. Um, it, you know, we're talking, we're talking like Google, we're talking, um, Facebook has made some, like there are these pre-trained embeddings that people make available, um, uh, for use.

So let's look at that space. So this table shows the results for this same word, but from what are called the glove embeddings. So, um, the glove embeddings to give you an idea are trained on like all of Wikipedia, a bunch of the Google News datasets, like, like huge swaths of the internet, like put it into a giant computer, find the word embeddings. So let's look here. So some of the closest words in the space are very similar to before, right? Um, mistake, errors, but, um, we no longer have some of that like domain specific flavor, like clerical discrepancy and like misunderstanding. And now we do have probability and calculation. And, um, people were not talking about those things when they talked about their, um, financial product complaints, you know, people were not using those kinds of words.

So this highlights for you that embeddings are trained or learned from a large corpus of text data and the characteristics of that corpus become part of the embeddings. So, you know, like machine learning in general is exquisitely sensitive to the, you know, the data, like whatever is in your training data, like it's going to learn it and, um, and like reify it. Right. Um, and this is never more obvious than when dealing with text data. And maybe, maybe most with word embeddings, because what you're literally doing is trying to learn the way language is used together to make these lower dimensional, um, embeddings.

So, you know, like machine learning in general is exquisitely sensitive to the, you know, the data, like whatever is in your training data, like it's going to learn it and, um, and like reify it.

So this means, you know, you might, you might miss out on when you, then if you apply those free train embeddings to your own data, you might miss out right on something that is there. You might be trying to add in something that isn't there. And, you know, maybe most concerning from an ethical standpoint, um, any like human prejudice or bias in the corpus becomes imprinted into the embeddings.

So, um, in fact, when we look at some of these most commonly available embeddings that are out there that are able, that are available for, for use, we see that, um, uh, African-American first names that are more typical for African-Americans are associated with more unpleasant feelings than European-American, white American first names. Um, women's first names are more associated with family and men's first names are more associated with career. And, um, terms associated with women, like these are things like mother, aunt, um, sister are more associated with the arts and terms associated with men, like, like the brother, uncle, you know, father, they're most more associated with science.

So bias actually is so ingrained in word embeddings that the embeddings can be used to quantify change in social attitudes over time. Like you can actually, you, if like digital humanities folks have like taken the embeddings, like, like say, slice up your data in decade bins or something, and then find embeddings and then measure the bias in the embeddings over time, and can use that to measure changes in attitudes about, um, uh, in various social attitudes over time.