JULIA SILGE: TEXT MINING WITH TIDY DATA PRINCIPLES - UNIVERSITY MIGUEL HERNANDEZ

Transcript#

This transcript was generated automatically and may contain errors.

We are so lucky to enjoy the presence of Julia Silge here today. This is a pleasure for us that you are here today with us in this ACEP seminar. Julia is a data scientist and software engineer at POSIT, where she works on open source modeling and MLOps tools. I'm sure that you have read some of the books or some of the posts in LinkedIn or GitHub. She's an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences.

Thank you very much, Julia. We hope that we are going to enjoy this seminar.

Thank you so much. Thank you so much, Yama, for that wonderful introduction. I am so happy to be here with you today. So we have a session of two hours. And I hope that this two hours, if you are interested, that you can do some interactive work, some actual practice of what we're doing.

So to get that started, if you have a laptop with you, I would say, I'm going to make this big for a bit. If you can go to this URL here, so my name, juliasilgi.github.io slash asepelt-2024, the year. So if you want to go there right now, I'll show you what you will see.

And so that web page will look like this. And I invite you to, if you have not already, do some of these things at the preparation link here. So you have two choices. If you want to practically practice here with me today, there are two choices. One of the choices is that you already have R, RStudio on your computer, and you can install packages, and you can work yourself. The other option is that you go to Posit Cloud.

You need to have an account. So it's a free account, but it lets you do the analysis here with me today. So if you want to, if you, so the URL that I showed on the last link. And then you go to the here, to the preparation. And you can click through, and you can have a Posit Cloud account, which is free. And then I will give you the link in just a moment to join where the code is so that you can run it yourself, and we can learn together.

It can be an approach for us to take unstructured text data and get it to structured, get it to structured data so we can use algorithms and math on it.

So I work on a main package. The main package we'll be using today is called TidyText. So TidyText is an R package for text mining using tidy data principles. And there is a, something is wrong with this. Sorry. There's a book. And then today, we have some resources here. So the URL that I gave you before is being built by the code in this repo. So if you would like to see the code, it makes the slides that you can use later, you can go here.

So for example, if you, today, you're working in Posit Cloud and you think, oh, wait, what did we do? I want to see that again. This will be here always. So this URL and that URL at the bottom will be there for you to go back and look at. The Posit Cloud instance will go away, but the code will be there for you to look at afterwards.

Tokenization with unnest_tokens

OK, so we are here to talk about text analysis. So we have, from now, maybe one hour and 40 minutes. And there are two things that we want to cover. The first is EDA, which means Exploratory Data Analysis. This means we're not training a model. We are, instead, we want to summarize, visualize, make plots. We want to start to understand the content of the text. And then after we talk about EDA for text, we'll talk about modeling for text, building the kinds of models that you can build for text data.

So we'll maybe go to the hour with EDA. And then maybe we'll all stand up for five minutes at 5 PM and then come back and talk about modeling, maybe from 5 to 6 PM. So that will be the plan for what we go through today.

So I was telling you about tidy text, tidy data principles. And some of you may think, what are you talking about? What do you mean? So let's look at an example here. So this is the poem, Cantos Nuevos, by Federico Garcia Lorca. And notice, so I have assigned it to a, this is just the beginning of the poem. And I have assigned it here to a variable here. And so notice that this is what we would call a, in R, we call this a character vector. So it's a vector data structure.

And it has atomic variables that are of type character in them. If you are familiar with Python or other tools, you may think of them as strings, or if you have other strings, characters. But notice that it's a vector of length 4. So this is a vector. It has four elements here. So this is not yet tidy, meaning we don't have one observation per row here. Think of this as the unstructured data.

Now, we can put it into a data frame structure. So a table is a modern implementation of a data frame here. So if I put this into a structure like this, we have, now it's starting to be more like a data frame, more like a spreadsheet, more like a rectangle. So I have two variables. One tells me which line of the poem is it. And then the other one tells me what is in the poem. What is the content of the poem? So we're starting to have some richer information, but this still, I would say, is not a tidy data set when we talk about tidy data principles.

So we can load the tidy text library and then use this function unnest tokens. And now here, we have what I want to say is tidy data for text. So notice that this function is doing several things. It is, these are mostly options that you can turn on or off depending on your particular needs or analysis. But notice that instead of having, instead of having one row per line, now I have one row per word.

So this is called tokenization, where we take text and we divide it into tokens. The most common token when we work with text is a single word. So the process of tokenization is the process of identifying what are the words, where do I break them apart? We also did a little bit of some text processing to make it easier for us to analyze it later. We got rid of capitalization. We took some steps so that we got rid of punctuation between words. These again are options you may want to keep or you may not, but these are the defaults here.

So we've transformed our data from unstructured to now structured. Notice that I still know which word came from which line in the poem. So when we tokenize text and transform its shape so that it's in the tidy data still, we still know things like all the other information we have. If we had other columns here, not just the line of the poem, but other information about that text, we keep it all. We keep it all as we move forward. So we have one observation per row.

Working with Don Quixote from Project Gutenberg

So this is the function that does all this. So now what we're going to do is we are going to work with a larger, we're going to take one step up from a poem to a whole book. So Project Gutenberg is a resource, a project online that makes available the full text of public domain works, things that are in the public domain.

Now, if we run this code, you will then have the full text of the book that corresponds to this one. So let's go, I'll show you what we're going to do. Let's go, once you're over here, so you want to execute this here. So you want to have, you want to, this is the, where in the world are we getting our Gutenberg value from? And then we want to execute this like so.

So it's looking to the mirror that we want, and then now I have this thing that is called, I call it full text here, and I have that. So that worked for me.

Okay, okay, great, okay, great. So let's go back to the slides. I just want to make sure we're, because if you get stuck at the beginning and you don't even have the text, you get very stuck, you know, there's nothing to do later. So I'm just making sure that people who are, want to work are getting it.

So this is the next step of what we're going to do. So I bet some of you can tell what book this is. You know, it's not a secret what the book is. But this is Don Quixote. It's the Spanish language of, it's the original, not a translation, but the original for Don Quixote. And we have the next lines that we're going to run here.

So this is the part where I took my raw data, and I am transforming it into tidy data, into tidy data for me to explore. So I'm gonna keep track of the lines, what the lines are, which line of the book, and think of that like printed on a page, like line one, line two, but for the whole book. So that's some metadata I'm gonna keep track of about where the words are from. And then it has this other, so the Gutenberg ID, that's the same for the whole book, because the whole book is, of course, Don Quixote. And then we have the tokenization, where we take the text and we break it apart into pieces.

So I'm showing you here with an example that is a book, a fiction, but this applies to all kinds of text, survey results, documents that you have, reports, any kind of text data that you have, you can use for this kind of analysis.

All right, so if you were to go, so notice here that the thing that you need to add, if you are typing along, the thing that you need to add right here is the unnest tokens. So anytime in these interactive exploring together, what you need to do when you see this is you need to update it here to, and I think, what did I say it was called? I said word text. So we are making the word column from the text column here. So I can run this and this, and now I have it right here. So this is now a tidy text data set.

Counting words and stop words

All right. So that's the step you wanna take if you are going to work along and have some experience with this. All right, so let's look at this. So this uses, now, these kinds of functions use verbs, functions or verbs from the Tidyverse. So I, you know, we can look at this and all of you can look and look at these words and try to decide what do you think is gonna happen if we run this code? So we say, okay, I have my now my tidy book and then I'm going to count the words with sort equals true. So maybe very briefly, turn to someone around you and ask, what do you think I'm gonna get? Like, what do you think will happen after I run this?

All right, so what happens is we get a count of the words in this book sorted so that the most common words are on top here. So when we adopt tidy data principles and how we store our text, we are able to use very well-exercised, well-documented tools from the Tidyverse ecosystem to then be able to analyze the text that we have. So the only part here so far that is from tidy text is that unnest tokens. Everything else is from a set of tools that is built for working with quantitative data.

All right, so now I'm no great Spanish speaker, but I look at this and even I know, those are not very interesting words. Those are not very interesting words. When you take any collection of text data, natural text, and you count up what's the most common, the most common words are very boring. They're usually not important for the analytical purposes you have, and we have a word for what they're called in text analysis, and what it's called is stop words. They're called stop words.

So stop words, if you hear stop words, it's a word that usually means like, it's not important, it's very common, it's not very informative. It's there to make the structure of the sentences so that we can understand each other, but they're not very specific, usually, to the kind of data that we want to deal with.

So there are data sets of stop words that are available. It's important to know or to ask the question, where did these data sets come from? These data sets are, there's a bunch of them. This one, the default one, is from a lexicon called Snowball, which is an old software for analyzing text, and they include a list of stop words there, but there are many different data sets of stop words. Often, the way that they were created, these lists are usually pretty old. The way that they're created is you take some big collection of language, big, not one book, many books, and you add up the most common words, and you pick a line, and you say, okay, these are the stop words, and these are not, so you, or you compare across many different collections, and you say, what are the common ones, like this? So this is an English list of stop words that's somewhat conservative.

It's somewhat small. It's not so big.

They are not only for English. There are data sets for other languages as well. Notice this one is a bit bigger, but of course, it's kind of comparing apples and oranges to compare English to Spanish, right, like in terms of what we might consider stop words.

So where do the non-English ones come from? Because computationally, we have a big inequality between English language in text analysis and other languages, a big inequality. So where does this list come from? Some of them, I'm sad to say, are translated, that you take one list, and you translate the list. Others of them are curated, like human curated, but we have different lexicons, different ability, different lists.

Notice that we can get them from another place, like there are many different lexicons. So this one is English also, but let's remind ourselves. So the default one had 175. This one has 571. So it's a big, they can vary in how conservative or aggressive they are. Like take out everything, take out able, take out accordingly, like, okay, you know, there are different lists for different purposes.

Removing stop words and finding common words

So now when we have this kind of list, we can use it to get to a different answer for what is the most common words. So for your next exercise, you have the code. So we wanna ask the question, what are the most common words after we remove the stop words? So which ones are most common in Don Quixote after we remove the stop words? So this is all the correct code, but it's the lines are scrambled up.

So this is all the correct code, but the lines are scrambled up. So if you look at it, you want to put them in the right order. So if you are very new to R, you may be like, I don't know, I don't know. So don't worry. We'll show you the answer in just a minute. But maybe if you have a neighbor who is a little more familiar with R, you can together or you can try to make a guess. You can try to make a guess. Which one do you think is first and what order do they go in?

So take just a moment to look at your code and try to rearrange it and see if you can rearrange it so that you can run this code. So I think hopefully if you're learning along, you see this, and then you can rearrange it.

So I'll do it here with you. So first we'll use an anti-join. An anti-join is a way to keep something in one set that's not in another set. And then we will, after the anti-join, we'll do the same thing we did before. The same thing we did here that we did before, count, and then we'll pipe it to ggplot2 .

So this is how you want to do it. So let's talk through these steps. We take our tidy data set. We remove the stop words. We count them so that we know... We count them up. And then we use this function sliceMax to take the top 20. Then this starts our plot. We're going to do a visualization and makes a bar chart like so.

So now we have this list for the most common words in Don Quixote. This is looking better. Now we have Don Quixote. We have Sancho. We have Caballero. You know, like we have words now that are telling us something about what Don Quixote is like.

Now, let's look at that third word, C. So it turns out in that list of stop words in Spanish, C with an accent is in it. C for yes is in it. But C for if is not in that list. They only have C. So probably what happened is they tried to deduplicate the list. They tried to say, oh, don't have any double things in the list. And so they took— I don't know. I don't know how this happened. But if is not in that Spanish list. If is in the English list.

So what this— what you should learn from this is that text analysis is full of supporting data sets all the way up to pre-trained models, and they all have problems like this. They all have problems like this.

So what you should learn from this is that text analysis is full of supporting data sets all the way up to pre-trained models, and they all have problems like this.

And so it's very important that you, as a practitioner, you understand and you have the tools to be able to dig into, you know, where are these— like, what is the problem here? What's going on? And so when you use tidy data principles, it's straightforward to find out why this happened. And then also it is straightforward for you to make the appropriate choice for your analysis. You say, okay, I add in C to my— to my list of stock words, the list that I used for this project. And other things here that you decide— you know, you decide, like, okay, I'm also taking that out. It's up to you. It's up to you to decide how to do this.

Word frequency in practice

So so far what we've talked about here is just the beginning— just the very beginning of, like, we have text. I need to do some basic tokenization. And even just ask the question, what is most common?

All right. So what we will do now— I'm going to briefly— I'll briefly talk about some of this. So what I've shown you so far is not— is not mathematically very exciting. But I have found that even those kinds of tools to be very impactful. So I had— I found— this is work that I did. I found a data set of Puff lyrics. Like since the 1960s, all the songs from the top 100, like, one of the— the chart topping songs. All the words to them. And then I could use the kind of tokenization that we just showed you and say, what are the most common? What are the most common words?

And here we say— oh, okay. So the most common states— so I find all the words, count them up. And then say, okay, which words are about states? Which words are actually the names of states in the United States? And so we can see that California and New York are, like, the most mentioned. If you hear songs, pop songs, what we talk about is California and New York. But I am a very good data scientist. And so that means not only can I add, I also know how to divide. I also know how to divide. This is the important skills, right? Skills adding and dividing. Mathematically very sophisticated. I'm making a joke. But it is actually interesting.

So not only do I know the words, I also know the population of these states. And so you can ask the question, which states are mentioned a lot in pop music relative to their population? So here we actually see, like, a big cluster in the— in the southeast of the United States, like— which is— comes— this is— this was not the country. This was the pop list. But even in the pop list, you see this representation of that part of the country. And Hawaii is very— almost no one lives there. But it looms large in people's imagination in the pop music.

TF-IDF: measuring what a document is about

So okay. So this is just the beginning. But even the beginning can help you say interesting things about text data that you have. We're now going to move on to talk about, in a little more detail, how can we have measures that tell us what a document is about? So we're going to talk about a couple of statistics that you can measure about a document and how to combine them.

So term frequency is just like what I was telling you before. Counting. Counting. Usually dividing by the length of the document. So some— if Don Quixote says— Quixote, Quixote, Quixote, over and over and over again, that has a high term frequency.

Inverse document frequency is this. So this is not something that has a theory behind it. It's just a heuristic. But it's a natural log of the number of documents you have in a collection divided by the number of documents containing that term. So let's say we have— let's say we have many books. Let's say we have 100 books. And then we divide by the number of documents that contain that term. So I have 100 books, and only one of them has Quixote in them. So that means we have a big number divided by a small number, and we get the natural log, and that is— so that's a big number. Let's say we have 100 books. And let's say they all have— they're English books. They all have the word "the" in them. So then that's 1, and the natural log of 1 is 0, and so that is a small number. That weights it down.

So if you multiply these two numbers together, what you get is called TFIDF. So it's the term frequency multiplied by the inverse document frequency. So it's like a weighting. Say, okay, which words are common but are common in one document compared to the other ones? It's about comparing documents within a collection. So TFIDF is a way to weight counts— a way to weight counts. And what it gives us is a measure of what a document is about or how we can find out which words are important in one document compared to other documents I want to compare to.

So now we're going to go back to the code, and we're going to get— we're going to get not one book but three books so that we can compare across the three books. So let me remind myself how much I made you all fill in. So this I think you can just run. So this here, just run this, and then you should have a collection. So we have Don Quixote again, but then we also have two other books. So if you wanted, you can count the titles to see the three titles that are there.

And then what we're going to do is we're going to convert this more raw text to a tidy data structure for all of them together. So think of me now having not only one document but a collection of documents. These are long documents. They're whole books. But this could be short documents. So right now I have three long documents. But maybe think about survey responses. Maybe you could have— say again?

Like a text from an interview, a survey result. Maybe you have 1,000 short documents. It works the same. So you could have 1,000 survey responses or, yes, like a text from an interview. And then what we're doing is we're counting. So we count up here. But I say— okay. So Don Quixote is up here for all of this, and that probably means Don Quixote is the longest. It probably means it is the longest document because it has K more times than the other books have K.

So what term frequency— what TFIDF allows us to do is to account for all that, account for all that. So you will run this. And let's think— what do the columns of book words here tell us, this object book words? It says for this book and this word, how many times— how many times is it in there? So Don Quixote has the word K that many times, 20,000-plus times. And if we were to keep looking down, we would see that the other book has the word K a different number of times. So I'm counting up these words in the books here. And now I can use this function right here, bind TFIDF. So this function computes TFIDF for the words that are in the documents.

And then N tells us where is that there. So this— I think if I remember right, you need to type this in. And then what we get here, after we look at this— I'm going to— hopefully this isn't too fast. Hopefully folks have seen this. We get this. So we have— let's look at these words here. So we— that function for us computed TF and IDF and then multiplied them together. And what we have here now is the TFIDF. Look at all those. It's all zero. It's all zero. That's because all three of these books contain these words. K, T, E, LA, A, N. That's in all three of them. So what TFIDF— so it has weighted those down. It says those are common to all of them.

So we can arrange the results that we have here to do it in descending TFIDF. And what that shows us now is something interesting. So now we have these three books here. And we have, for example, words that are unique and specific to this title compared to this title. So I actually think it's pretty interesting— you said is not in the other two books. Usted is only in this book. In this book. Which I think is pretty interesting. Gaucho and Poema, you know, are in this book. And we see Don Quixote down here with panza and Dulcinea. So we're now seeing unique words for one document compared to another document.

I am— you are welcome to do this unscramble, but I'm just going to go a little bit to the next to show you the results. So if you want, you can go ahead— I'm not going to take some time right here. But we can— what we're doing here is we're saying for every title— so group by title, give me the top ten words by TFIDF. And then we keep going and put it into a plot, a visualization. Which will look like this.

So now this tells us the top ten words for TFIDF for these three documents. So we see the Don Quixote words, panza, Dulcinea, Rocinante. You know, we hear— we see— so I was less familiar with these books before I was, like, looking for them for this talk. But I was like, oh, cool, this one has Gauchos and poems. You know? I can kind of learn what this is about. And then over here we have, you know, these interesting people— you know, these characters here.

Weighted log odds

So think about a different application. If you have survey results, you could— what TFIDF helps us do is identify things that are important in one document compared to another document. It's not the only way to do this. So we just talked about this term frequency. But another way that I find very impactful for this kind of work is what's called a weighted log odds. Now we're not even really to models yet. But we're starting to have a little bit more sophistication in what we're doing.

So I imagine that you all are familiar that a log odds ratio tells us a probability. And we can weight the probabilities because we know— we know the distribution of words. So remember that some words are super common. Other words are more rare. And we want to weight the— we want to weight the log odds ratio by how sure we are that what we're seeing is a real difference. So let's say— let's say we have some words that are very common. Like and and K. You know? If we see a difference in those words, a big difference, we know that is real.

Let's say there's a word that appears one time in one book and doesn't appear in the other book. We cannot be so certain that that's a significant difference. We have to weight. Because natural language exhibits a power law distribution. It's— you know, we— if you plot it on a log log scale, we see that some words we use many times and other words we use very little. It's a characteristic of text data that we have this power law distribution.

And so when we use weighted log odds, it helps us to get a good answer for— a good answer for which words have a higher probability from coming from one document for the others. For those of you who are statisticians in the bunch, the typical way of doing weighted log odds is with empirical Bayes. So empirical Bayes, maybe we're starting to say is a model, you know? But it just uses the data itself to be able to— to be able to do the weighting, to be able to say what is the distribution that we would have. So we're maybe starting to get into using some modeling, but not so much.

So it looks just the same as a bind TFIDF, whereas here we're doing bind log odds. It is in another package, in tidy load, because this actually works great, not just for text data, but for many kinds of data that we can do this kind of weighted log odds. Anytime we have, like, a power law distribution, it's a really great application. Definitely true in text that we have that. So this is a different statistic. So the statistic before was TFIDF. It's a heuristic statistic, but it's one that we can measure for a word in a document in a collection of documents. This is a different thing we can measure, the log odds.

So it gives us— you'll notice here we have some similar results. You know, eugenia, you said, panza, you know, Dulcinea, we have some similar words. But it— so it's a— it's a different approach, but the— the main difference is that weighted log odds can distinguish between words that are used in all documents, in all text. So remember that when we had K in all three of the books, it went down to zero. But a weighted log odds allows us to distinguish between words that are in all the text. And they do sometimes— you can say, oh, this word is much more important in this document than this document, even if it is in all the documents.

N-grams and more complex analysis

Just in the next five minutes, we will talk about a little bit more of EDA, and then we'll take just a short break. So the— so far, I've shown you only single words, single words. But absolutely, we can move into more complex analysis. So when we— instead of looking at one single word, we look at groups of word. Those are called engrams.

So a word is a one-gram. A set of two words is a bigram or a two-gram. A set of three words is a trigram or a three-gram. So we've got— you probably still have this. This is Don Quixote again, the text of Don Quixote. And so now we can start to use some— use some more arguments to have different kind of analysis. So before, we said one word is one observation. But maybe what we are interested in is sets of two words, sets of two words. So what we have here, we say token equals engrams and n equals two. So that will find for us all of the groups of two words.

So if you type this, and then we can get rid of ones— at the edges, we have to get rid of the ones that are— that are not there, are not there. So if we do this, instead of what we had before, we have— so this is probably what you need to type in if you're going to work along with us. We have this. So notice— notice what's happening. El— now you hear my very bad pronunciation. El ingenioso, ingenioso Hidalgo. Hidalgo Don, Don Quixote, Quixote de la Mancha. So engrams slide along like this. So you get every combination of two words. Trigrams are the same. They slide along. Four— four grams slide along so that you get every set of four words like this.

So you may— you may notice it's— it's bigger. It becomes a bigger data set. And for one— for one book, it's not so bad. But when you start using engrams, the space, the dimensionality starts to become high— starts to become higher here. But working with bigrams and higher order engrams also gets us more rich information here.

So here are the most common engrams, just like we said before. So de la is the most common. But after that, Don Quixote. But look at a lot of these. These are— these are, again, soft words here, like not so interesting. So let's— if we try to use the same approach we did before, we can't because we don't have— we can't use that— that anti-join there. So instead, we will take this kind of approach.

So if we take out the— if we take out the stop words in the way that I just showed, then we get some very more useful words here. So things like— things like Senor Don. Things like Dijo Sancho. You know, we start to have these words that don't have stop words in them. And what can we do with engrams? So what we talked about already, you can do it with engrams. TFIDF of engrams. Weighted log odds of engrams. You can also take approaches like network analysis. So the tools— so a set of engrams, it turns out, is a network. It has nodes and it has connections. So you can treat a text analysis like— you can treat a set of bigrams like a network analysis.

You can also look for— for effects of negation. This— when— this came up more in the past when you were trying to do sentiment analysis and say something like— can I see the sentence— this is very good and the sentence this is not good. And they both have the word good in them, which has very high sentiment scores. So can we use bigrams to understand the effect of negation there?

So here's an example of analysis I did that uses a bigram analysis. So my favorite writer is Jane Austen. I love Jane Austen. And so this analysis took all the works of Jane Austen and then found all the bigrams that were he blank or she blank. So what were the things that come after he and what are the things that come after she? And then looked at the log odds here to say what comes— what is more likely to come after he and what is more likely to come after she?

So let's look here. So we see she remembered. She read. She felt. She resolved. But he stopped— whoops, did I— no, I still haven't— he stopped, he takes, he replied, he comes. So if you notice, the she words are all kind of about the internal life, about the feelings, the thinking, and the he words are all about what we observe him doing. So I love this because this is fairly simple analysis, but it gives us this insight into Jane Austen work, that they're about the lives and thoughts and feelings of women, and the men in it are— you don't really know what they think or feel. You only know what they do.

This is fairly simple analysis, but it gives us this insight into Jane Austen work, that they're about the lives and thoughts and feelings of women, and the men in it are— you don't really know what they think or feel. You only know what they do.

So these kinds of engram analysis can be very powerful, even with straightforward choices. I did the same thing with a dataset of movies, with the scripts of new movies, and saw what did more— were more she and more he. With she, you know, you see words like she snuggles, she laughs, and with he, it's like he shoots. So we see, like, what does culture tell us about men and women with these kinds of text analysis things?

This is the— this is the log odds, the weighted log odds for Jane Austen again, but with bigrams. So we can do the same— the things we were talking about before, they're not only for single words. They also can be used for more complex structures here.

Network analysis of bigrams

Now I have— I have in your— your code a little bit on network analysis. I'm going to— we won't go through it, but you'll be able to do this yourself. So before— I'll just show you this example. Before my job now, I was a data scientist at Stack Overflow, which is the website where you go to find out answers to code. And we did some surveys asking what— if you could change something about Stack Overflow, what would you change? And so this is what people said to us in the survey. We had tens of thousands of responses. So this was not survey data that we could read every answer. But we actually got a really good understanding of what people said with this kind of analysis.

You know, people talked about questions and answers, that it's— it's hard to find answers. The answers can be out of date, off topic. The questions are being closed because they're off topic, which makes me feel sad. People wanted— at the time, they wanted a dark mode. They wanted a dark theme for the website. We see about voting and reputation. And over there, we see some people are like— what I want you to change is the toxic community at Stack Overflow. So this is a big— an effective tool for dealing with corpuses of text that are— that are large.

Now, I invite you later, if you're interested, to walk through this. But this is what code looks like to do the same kind of thing for Don Quixote. We make a graph— graph here. This is what the graph looks like. And then we can make a— I'm just going to blow through this— make a plot here. And this is what we get. So this is telling us about how the words in Don Quixote are connected to each other. We see, you know, Don Quixote, that's so strong. That's so strong together. But then we see these other kinds of— we see other people. We see other sets of words that help us understand what's going on.

Okay. Okay. It's past 5. Let's take— till the next four minutes, till like 10 after, just stand up, turn around, talk to your neighbor. And then in like four minutes, we'll go to modeling. The first section was the introduction here and learning about text as tidy data. And next, we're going to talk about topic modeling for maybe 30 minutes. And then in the last time— maybe the last 30 minutes, word embeddings.

And where— instead of— you can close the— excuse me— close the EDA file and open the topic modeling file. And this topic modeling file will have the— what you need to do for this next— this next part.

Any questions about where we are or opening the files? Okay. Great.

So if you are working locally, these are the packages that you need to install for this section. But if you are on Posit Cloud, they are all already installed there for you here.

Moving between data formats

So let's look at this diagram that tells us a little bit about how we move back and forth between the— the different sort of formats. Different sort of formats. So picture where we were in text data, where it says text data, there— that's, like, the raw text data that we start with from— say, from Project Gutenberg.

When we download, say, the text of Don Quixote, it's right there in text data. If we— if we transform it with tidy data principles to a tidy text data set, that's right there where it says tidy text. Even if we do things like count to find the most common words, that is the summarized text. And we can— so we have been moving back and forth kind of along that main line. Kind of this line here is where we have been going back and forth, back and forth. We have raw text, we reshape it, we summarize it, and then we go— we can make visualizations.

In the next section, we're going to be talking about going down to the model. Going down to make a model. And it does turn out, by the time you want to do model estimation, a tidy format is not perfect. Because you usually need a matrix to do linear algebra. Basically any model, what is it underneath is linear algebra of some kind or another. So we need a matrix.

So I have been here telling you tidy data, it is so important. You should have all your data in a tidy data format. But it is true that when you— it's time to do some algorithm, some math, we have to get it into a matrix form. So we go— we go down here to make what's called a document term matrix. And then we can train models on that. So we'll talk about that process just a little bit.

Introduction to topic modeling

Now we're going to start with a kind of model that is called topic modeling. So in topic modeling, it is trying to find what are the topics. What are the topics in the documents that we have? What are they about?

So the way that the most common kinds of models work for topic modeling is that each document that we have is a mixture of topics. And then each topic is a mixture of words. So if you had survey responses, you could imagine maybe— maybe you're trying to learn what topics there are. And in all— any— so say we're trying— with some of the topics, let's say we're thinking about survey responses from student evaluations at a university. Some of the topics could be the tests, the homework, the lectures. And the— each survey response can be about all of those things in some mixture. And then each topic uses a variety of words. And you can have the same words in different topics.

So topic modeling is an example of unsupervised machine learning. So we have a label on the document and we're trying to predict, we're not trying to predict yes no. Instead of we don't have any labels. We're doing unsupervised machine learning here. So that's the kind of model that topic modeling is.

Here is an example back from my Stack Overflow days, a real example of topic modeling. We're trying to say, okay, for all of the questions on Stack Overflow, what are they about? And it is kind of, it's an interesting result because you end up with like a front-end web development topic, like a Java topic, like a SQL topic, like Android, so they're related to each other here.

The four books example

So we're going to go through a slightly smaller example. So let's imagine that you have a library. You have a library. And someone comes, you have books in your library. And then someone comes, here's your library. Let's say, so we're going to go to an English example, but it does not, it's not dependent on this being English. So the example I just walked through, you can use, you can, it does not, it's not, nothing in it depends on the language being English, but the example is English here.

So I'm going to, let's say you have a library and these, you have these four books in your library. Emma, from my favorite, Jane Austen, The War of the Worlds, The Wizard of Oz, and Wuthering Heights. So we have these four books. Let's say someone breaks into your library and they take your books and they rip it apart, they rip apart your books and throws them all in a mixed up pile on the floor. So your four books are torn apart and in a pile on the floor.

And so that is this part here. We make chunks, we take the book, we divide it into chunks, and then we put it in a pile on the floor. And topic modeling asks the question, can we pick up, can we pick up one of the chunks and can we look at it and can we tell which book it is from? Can we tell which book it is from? So that's the idea of this topic model.

But topic modeling can be applied in many different ways. Like in many, like what are the survey responses about? What are certain topics becoming more important over time with in company reports or in media? Like we can, there's many different ways of using topic modeling. But this is, this one is nice because we can walk through and understand what happened together.

Building and training the model

So here's what we have. So we have for the text, we have a document here. So these texts are labeled. Each one of the documents here is like a chunk of book. Like someone ripped apart your book and you have it there. So now what we want to see, can we put our books back together? Can we learn from the text which chunks go together? We're not going to, we don't tell the model which chunks belong together. We're going to say, can the model learn which ones go together here?

So first we will make a, we will make a tidy data set. We will use tokenization. We can remove some stop words. It turns out removing stop words does often help with topic modeling because it just helps us bring the, bring the space down, like the multidimensional space. It helps it be more fast to train. And then we're counting again here. Always counting.

So we have a, we have what's called a document term matrix. So here it's in a long narrow form, but we can then use, and so it gives us the counts of words per line. And then we can then use a cast function to cast it to a different shape. So we have a long narrow shape rectangle, and then we cast it to a matrix. So here we're casting it to a sparse matrix because it turns out that most of the documents don't have most of the words. So it's full of zeros. It's a sparse matrix that is full of zeros. So it's more efficient to store it in a sparse matrix representation here.

So now this is no longer tidy data. Now this is a matrix, but it's great for doing math on, for doing linear algebra. So it's not a tidy data set anymore. Now it is a non-tidy data set, which is a good fit for doing linear algebra.

So we're going to use the topic. My favorite implementation of topic modeling is called, is FTM. That stands for structural topic model. It's very nice, very fast, very good to use. It's my favorite. And so what we pass in is that sparse data frame.

If you are on Posit Cloud, don't run this. I think in there it says don't run it because you will, unfortunately you run out of memory in the free tier, but I saved the output of a training that you can just load in. So you should just run the line that says to load it out of memory because it turns out you need a lot of memory to train the model, but then once the model is done, it's actually quite small. So the model is saved in your workspace and you can open it, but don't train it yourself. It will crash. So don't do that.

So here, you know, this is a pretty robust model that is difficult to mess up, but there is, we have to say what K is. So this is like K-means where you don't know what K is ahead of time. So in this sort of pretend fun example, we know that it was four books that was ripped up and thrown on the floor. So we know that K is four. In a real problem, you don't know what K is. So there are approaches and methods for deciding what is the right K. Usually you have some heuristic based on how big your dataset is. Do you have 1,000 or 10,000 texts or 100, 100, 1,000, 10, like how many do you have? Then this is kind of the right range. And then in that range, you can train all of them and then look at the models and say what's the characteristic of the model to decide what the right value for K is. But here we just say K is four because we know there are four books that were broken apart and we get a result.

Exploring model results

We get a result here. It does not take a long time to train this kind of model. It's pretty fast and efficient. But we get results here. So we had four books that were ripped apart and then we trained four topics. And this summary tells us, you know, just some high level details, kind of some results here. If you're familiar with the War of the Worlds and Wuthering Heights and the Wizard of Oz, you probably start to see, oh, I think we're getting somewhere. I think we're able to put these books back together here.

But we can dig deeper by using tidy data tools. So now if there is a verb in the sort of ecosystem called just tidy. And the verb tidy, usually what it does is it takes something that is some sort of awkward shape and it gives it to you in a data frame, now back into one observation per row. So notice that we started with raw text and we converted it to a tidy data structure so that we can use the rules of relational algebra to analyze it. Then we cast back to a matrix so we can do math. And now we have the model object. How can I explore it in detail and know what it is? I can tidy. I can tidy my output again.

So notice that we started with raw text and we converted it to a tidy data structure so that we can use the rules of relational algebra to analyze it. Then we cast back to a matrix so we can do math. And now we have the model object. How can I explore it in detail and know what it is? I can tidy. I can tidy my output again.

So there are, remember that a topic modeling says that each document is a mixture of topics and each topic is a mixture of words. So there are two matrices that come out. This one is the beta matrix and it is the topic word probabilities. What is the probability that a certain word comes from a certain topic? But there's another matrix. I'll just show it to you real fast. No, I'll come to it.

The other matrix is about the document topic probabilities. So we get two sets of probabilities because this is a multi-level or two-level mixture model here. So this tells us the probability that a given topic has a different word. So this first word we see is green. So notice that it's like medium, medium, medium, tiny, tiny. So green has a medium probability for one, two, and three topics, and then just zero in four. The word chapter is medium for all, maybe kind of small, small, medium, big, Mr., tiny, and then bigger. So these are the probabilities that a given word comes from a given topic here.

So we can now explore this in detail. We can explore this in detail. So the next bit of code that you have is to rearrange again. And I— oh, no. It's already 525. Okay. I'm sorry. I'm just going to keep going. I'm going to keep going.

So we can rearrange to this, and then we have a results here. So these are the highest probability words from each topic. So they're all arranged. So at the top is topic one. So these are the highest probability words from topic one, Martians, people, black, time. So this is the War of the Worlds. Topic one is the words that come from the War of the Worlds.

And we can also, we can, since we're using tidy data again, we can do any kind of table, any kind of summarization, any kind of visualization that we need here. So here we can look at these, and since we know what the books were, we can start to understand which of the topics are related to which of the worlds. So we've got here with the Martians, we've got War of the Worlds, we've got Wuthering Heights over there, the Wizard of Oz down here, and then there's Emma over there. So these are the highest words. So looks pretty good. Looks like we're going to be able to put our words back together.

These are the most probable words, but we can also identify important words with other statistics. So FREX is about whether the word is, whether the word is both high frequency and exclusive. So is it common and also exclusive? The LIFT is about the weighted, the weighted log odds again. So there are different statistics. I'm going to, I'm going to point you, if you would like to learn more about FREX and high LIFT words, I have this YouTube tutorial on it, but we'll just keep going because the other matrix, the other matrix is the one that is between the documents and the topics, the documents and the topics. So this is typically called the gamma matrix. And I'm saying matrix, but it's actually tidy data.

So picture, picture in your mind, I can have a long skinny thing or I can reshape it into a matrix. Like we can, and depending on which is more useful for the step you want to do, you want to shape it in those different ways. So here for any topic, we say, what is the probability? So notice here that we have kind of medium, low, medium, low. So for this topic, for this chunk, this broken up piece, what is the probability it comes? So this is all of the chunks that we broke apart here and we can get back out. Remember, when we trained this model, it did not know which documents came from which books. It did not have that information, but we can, since we kind of stored that, we can get that back out and make a visualization that looks like this.

So this is on the Y-axis, the gamma probabilities. And then on the X-axis is which topic it is. And now we can say, how good of a job did we do at learning which things go together, being able to pick up a chunk of book from the floor and say, which book did it come from? So notice Emma looks very good. It's like all topic four. The Wizard of Oz is from topic three. But notice it's like, okay, we see some evidence, these are box plots, and we can see some evidence that Wuthering Heights and Wizard of Oz may be a little bit harder to tell apart than the other things.

So we're able to learn which documents are closest, which documents are kind of a little bit further apart, and then, you know, which are very apart. Like War of the World, it looks like War of the World was very easy to distinguish. Maybe it's probably the Martians. Probably it was the Martians, why it was so easy, you know? So we can learn this from the output of this kind of model.

Choosing K and further uses

We can continue to use topic models in more sophisticated ways. I'm going to say just briefly, how do we decide what K is? Because in the example we just used, I said four, because I know there are four books. But in real examples, you have to choose K using some measures from the model itself. So this is like hyperparameter tuning, where you have to try a bunch of different values and then see which one have the best characteristics. So you can look at things like semantic coherence, which is about frecs, like which words are used together that are more exclusive, or semantic coherence, and then exclusivity. And there's many more details on like how, in a real situation, how do you go about choosing K for the model that you want to build?

I have, if you're a Taylor Swift fan, I have this, this is a fun example of how to do topic modeling on the lyrics of Taylor Swift. And you can see which ones, which ones are close together, which albums are close together and which ones are far apart. And the result, you kind of see in the screenshot that Folklore and Evermore are close together. They have similar, they are drawn, they look like the lyrics are drawn from the same kind of topic. Reputation is really different than everything else. And then like the early, the early albums look also close together, like those early albums have, and this is about the lyrics. So not about the music, but the words, the lyrics, how they, how they go together. So this can help you understand a way to use topic modeling.

So, all right, it's just 531. So I want to highlight this again. We get text data that is, think of it as raw text data. We have to transform it then to a tidy format so that you can make, do exploratory data analysis. You can understand what is in your text. Then you want to transform it because often when we, when we do models, we have, we have to have a matrix to do linear algebra. Once you have the model, then we can tidy again. Then we can get back to this nice rectangular data frame that helps us to make plots, explore the results of our model.

Word embeddings

So, let's, let's jump into the last piece. And then after that, we'll have some time for questions and talking about this more. So if you are going, if you're working along, you want to, whoops, oh, okay. So let me show you. So if you were looking at this website, we're going to be opening this last set of slides. Word embeddings. Whoops. I closed the wrong one. Let me try again. Word embeddings here. And then if you are in the Posit cloud, you want to close the topic modeling and you want to open the embeddings here.

Okay. Last 30 minutes. Go. Okay. So here are the package. If you're working locally, here are the packages that we'll be using. The same two from before, tidy verse and tidy text, and then a new one called word salad. Word salad.

So the data set that we're going to use for this last example is a data set of cheeses. Everyone loves cheese, right? It's time for, time for some cheese. So in this data set of cheeses, we have the name of the cheese, what kind of milk it is, where it's from, the country. But we have also some very interesting data on the, on like the texture, the rind, the color, and the flavor. And I am interested, and the aroma also looks kind of interesting to me. Barnyardy. I love it.

So we have some, see, these are, these are text data. It's not like, it's not like natural language. It's not like Don Quixote. It is a limited vocabulary. But so that makes it good for this example, because we can do it very quickly here now together. So this is like a little bit of a toy example, but it can help you understand about embeddings. Because it's not, we cannot use every word in the world when we describe the cheeses. There's some small set of words, and also they aren't sentences. They're just like adjectives. Just adjectives.

So think of it as like, we don't take all of natural language. We just take a subset, just a, just a little bit here. So this, so here, we're going to look at the flavor, the flavor. So here are five example flavors. So this is what our data looks like. Creamy, sharp, strong. So that's one flavor. Buttery, milky, sweet. It's another flavor. Buttery, tangy, mild, nutty, sweet, and then acidic. So these are the kinds of words that are in the descriptions of the flavors of the cheeses here.

So if you are working along in the Posit cloud, you can, you can download the data and then you have it there for, to be able to do this yourself. So when we build models, I said this in the last section, we take, we need a, we need a matrix. We need a matrix to be able to do linear algebra on the, to do, to build some kind of model here. The, the matrix that you make is typically very sparse. It is a very high dimensionality and it can have a huge number of features. So natural language, we tend to have most documents do not use most words. So the data is very sparse. This, this cheese data is about 95% sparse. That means 95% of the spaces are zero.

If you have a matrix where you say, here's tangy, here's sweet, and then the cheeses on the, the rows are the cheese and the columns are the flavors, the words to describe the flavors. It's 95% just zeros. So that's actually not so sparse for, if you had natural language, it would be 99.99% sparse. Very sparse, very high dimensionality. Notice that I have, notice that I have a thousand documents. So that's a thousand cheeses and 46 features with, that's, you know, that's, that's getting to be, maybe that's medium dimensionality, but it's a lot of columns. 46 columns is a lot of columns. For real language, it could be, if you have 10,000 documents, your rows could be, could be 10,000, actually. It could be 1,000. High, high dimensionality. It's a big problem. This is a big problem for training models with language.

And this is where word embeddings are used, because if we think about it, words are not used in each other randomly. They're not independent. So if you come from a statistics background, these are not independent of each other. In fact, they are used together in specific ways. They're, like, when I say a sentence to you, it's not just words, I don't just choose words randomly and put them in a sentence to you. I use rules about how I speak.

And this sentence is, this quote is from someone named John Rupert Firth, who back, like, in the 1950s was a linguist who talked about word embeddings in an abstract idea. Like, they didn't really have them to use, but he had this idea that we can make an embedding. So think of embedding equals vector. Embedding equals vector. So I want to make a column of numbers that will represent the text, because I know, say, when I'm talking to you with some words, I use the words together in a way that there are rules about how the words are used, and we want to try to learn those rules.

So you have probably heard of some of these things, because these are all now a part of our lives. So maybe 10 years ago, word embeddings were things called Word2Vec, GloVe, FastText. This is like 10 years ago, word embeddings. These are algorithms for taking a text, a set of text, and learning the matrix representation for how the words go together. So think of it like if you were a statistician or have a statistics background, think of it like PCA, but fancy. I have a high-dimensional space, and I need to learn a lower-dimensional kind of space. And so these are all rules, rules for using linear algebra to go from a super high-dimensional space to a lower-dimensional space, and then I can represent words and documents in this space.

So think of it like if you were a statistician or have a statistics background, think of it like PCA, but fancy. I have a high-dimensional space, and I need to learn a lower-dimensional kind of space.

So now there are companies out there who have taken enormous data sets of text and are learning these representations, these embedding representations. All LLMs are built on the infrastructure of embeddings. So you can go to OpenAI and get an API token, and then you can send text to OpenAI, and they will send back to you the embeddings using their embeddings there. And so what you will get back is a long string of numbers. So it's a way to represent a text or a document numerically based on the rules of how text is used together, that it's not just random what words go together.

So now I highly recommend that you read this if you are interested in embeddings. So it's a little long. It's a little long. It's maybe— I forget. I don't— it's a little long, but it is excellent. It is excellent as a way to understand what are embeddings. And all the resources and slides are available for you. You can go to the URL. And this will also— I'll be here afterwards. So I do recommend that you go and you read this to understand more about embeddings and how they're used. But today, we're going to make some cheese embeddings. Some cheese embeddings.