Resources

JULIA SILGE: TEXT MINING WITH TIDY DATA PRINCIPLES - UNIVERSITY MIGUEL HERNANDEZ

PROGRAM:https://revolution.servicioapps.com/w/xxxviiicongresoasepelt/186821/programa-cientifico-del-congreso-asepelt-1?preview=1 Julia Silge is a data scientist and software engineer at Posit PBC (formerly RStudio), where she works on open source MLOps and modeling tools. She is an author, international speaker, and real-world professional focused on data analytics and machine learning. Julia loves analyzing text, making attractive graphs, and communicating on technical topics with diverse audiences

Jul 15, 2024
1h 58min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Well, good morning. Hello, everybody.

We are so lucky to enjoy the presence of Julia Silge here today. This is a pleasure for us that you are here today with us in this ACEP seminar. Julia is a data scientist and software engineer at POSIT, where she works on open source modeling and MLOps tools. I'm sure that you have read some of the books or some of the posts in LinkedIn or GitHub. She's an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences.

Thank you very much, Julia. We hope that we are going to enjoy this seminar.

Thank you so much. Thank you so much, Yama, for that wonderful introduction. I am so happy to be here with you today. So we have a session of two hours. And I hope that this two hours, if you are interested, that you can do some interactive work, some actual practice of what we're doing.

So to get that started, if you have a laptop with you, I would say, I'm going to make this big for a bit. If you can go to this URL here, so my name, juliasilgi.github.io slash asepelt-2024, the year. So if you want to go there right now, I'll show you what you will see.

And so that web page will look like this. And I invite you to, if you have not already, do some of these things at the preparation link here. So you have two choices. If you want to practically practice here with me today, there are two choices. One of the choices is that you already have R, RStudio on your computer, and you can install packages, and you can work yourself. The other option is that you go to Posit Cloud.

You need to have an account. So it's a free account, but it lets you do the analysis here with me today. So if you want to, if you, so the URL that I showed on the last link. And then you go to the here, to the preparation. And you can click through, and you can have a Posit Cloud account, which is free. And then I will give you the link in just a moment to join where the code is so that you can run it yourself, and we can learn together.

About Julia and Posit

All right, so my name is Julia Silgi. I work at Posit. So Posit, many of you may think, what, I've not heard of Posit. So Posit is the company formerly known as RStudio. So if you have worked with the RStudio IDE, that is the company that I work for. And my background is different from, I think, all of you all. I don't have a background in economics or business, but my background is in, a long time ago, in astrophysics.

And then about 10 years ago, I switched to work in data science. So first, I worked as a data science practitioner. My title was data scientist. I worked for tech companies. And then I started transitioning into being someone who built tools for data scientists to use. So open source tools in R, in Python, for people to do their practical tasks. Some of the tasks that I worked on, the kind of tools I worked on, are for text analysis, which is what we'll do here today, and then also tools for machine learning and MLOps.

And so the books that I've written are also on these kinds of topics. And if you, by the time, I'll go back to this website here. So if you want to learn more after this, you can have some links up here to the book, a course where you can learn online. And then this is another book that's about text here. So also, if you want to see the slides, you can click on them down here and see them.

Joining the Posit Cloud workspace

So if you have the Posit Cloud account, then you go to this URL right here. So bit.ly slash join tidy text tutorial. So if you type that out right now, it will take you somewhere that then you can join. And I'll show you what it looks like here in a moment. So if you want to work together here right now, go to this URL.

And if you are someone who you use R a lot and you have R on your computer already, you can install these packages. And then if you have those packages ready, we can work together.

So let me show you what it will look like if we join right here. You will come to a website that looks something like this. Because that's a link that asks you to join this workspace that has this project in it. And so when you click on this project, it will open up to look something like this. Maybe this won't be open yet, perhaps.

So it will reload. So it will either say initializing, or it will do this kind of thing to say it's getting started. It will wait for it to finish. And then I'll show you how to open it up.

So if you're totally new to RStudio, this may look different to you. But if you've used RStudio a little bit, you may be like, oh, right. Here it is. So this is RStudio, but in a browser. So when you log in, you'll probably see the files like this. And if you choose the directory code, there'll be three files here. And so we're going to start with the first one, EDA, like this. And it shows you some text that we can go and work through together.

So when you join, do you see this? Oh, I made it private. Ah, sorry. Sorry, sorry, sorry, sorry. It is access to everyone.

So my apologies. And you can come in here to, so the main project directory is this. So this is everything. And then if you go to code, and then pick this first one, EDA, for exploratory data analysis, there.

So the free account for Posit Cloud, you can keep it for, I think, a week or something. It's not, you are allowed to, if you were to do another thing in Posit Cloud, you have to delete this project. Because I think for free, you can have one project. So it's OK for today. You will not be charged money for what we do. But if you were going to want to add another project, you would have to delete this one.

Overview of the session

OK, so we are here to talk about text analysis. It's very, so I've been working in tooling for text analysis for about 10 years. And it's a very interesting time to be thinking about text analysis. Because even 10 years ago, people are building tools for being able to do quantitative analysis of text of all kinds.

We have some projects I know people have worked in are things like surveys, things like social media data, things like reports of various kinds, say, from companies. And one thing that I think is we know that we can see is that dealing with text data is, there's information latent. There's information in the text data sets. But it can be uncertain, how can we get it out? How can we begin to analyze the text data and get the information that is out of it?

Often, if you have quantitative training, if you took statistics, if you learned data analysis, often what you learn on are nice rectangles of data. And maybe they're all numbers. And how do we apply these kinds of quantitative training to text data, which you can think of as unstructured data? There's information in there, but the data is not structured in a table format, like what you might have in a spreadsheet. So often, in our quantitative training, we don't learn how to deal with text.

And now, in the maybe past two years, we start to see more tools based on text data, like LLMs, the chat GPT, and similar other kinds of tools that are trained on enormous quantities, collections of text. And as someone who has your own data that may be text, you're like, how does this impact me? How can I use these kinds of tools or understand these kinds of tools? So that is what we're going to do today.

So the point, the goal here today is that it would be that if you have not dealt with text data, you can take these first steps. And it is not everything for today, but it is the first step so that you can move forward and learn how to deal with text data.

Tidy data principles for text

So I want to make the argument that using tidy data principles with your text gives you a step up in being able to deal with this fluently, quickly, with greater fluency than other kinds of tools. So if you are familiar, if you have learned R a little bit, you may have heard about tools that are called tidy, things like the tidyverse. So these are sets of packages that adopt tidy data principles.

If you come from a SQL background, it's the same thing as the rules of normalized tables, that every observation is a row. If you take a new observation, instead of adding a new column, no, you add a new row. So you say, OK, I am going to take the temperature outside today, tomorrow, the next day. I don't make, oh, OK, for many different sensors. So I have sensor, sensor, date, temperature. It's time to take a new reading. I don't add a new column with a new temperature on a different date. I add a new row.

So we have one observation per row, and then one kind of table for a kind of observation. So these are this idea of tidy data principles. And we can apply them. It can be an approach for us to take unstructured text data and get it to structured, get it to structured data so we can use algorithms and math on it.

It can be an approach for us to take unstructured text data and get it to structured, get it to structured data so we can use algorithms and math on it.

So I work on a main package. The main package we'll be using today is called TidyText. So TidyText is an R package for text mining using tidy data principles. And there is a, something is wrong with this. Sorry. There's a book. And then today, we have some resources here. So the URL that I gave you before is being built by the code in this repo. So if you would like to see the code, it makes the slides that you can use later, you can go here.

So for example, if you, today, you're working in Posit Cloud and you think, oh, wait, what did we do? I want to see that again. This will be here always. So this URL and that URL at the bottom will be there for you to go back and look at. The Posit Cloud instance will go away, but the code will be there for you to look at afterwards.

Tokenization with unnest_tokens

OK, so we are here to talk about text analysis. So we have, from now, maybe one hour and 40 minutes. And there are two things that we want to cover. The first is EDA, which means Exploratory Data Analysis. This means we're not training a model. We are, instead, we want to summarize, visualize, make plots. We want to start to understand the content of the text. And then after we talk about EDA for text, we'll talk about modeling for text, building the kinds of models that you can build for text data.

So we'll maybe go to the hour with EDA. And then maybe we'll all stand up for five minutes at 5 PM and then come back and talk about modeling, maybe from 5 to 6 PM. So that will be the plan for what we go through today.

So I was telling you about tidy text, tidy data principles. And some of you may think, what are you talking about? What do you mean? So let's look at an example here. So this is the poem, Cantos Nuevos, by Federico Garcia Lorca. And notice, so I have assigned it to a, this is just the beginning of the poem. And I have assigned it here to a variable here. And so notice that this is what we would call a, in R, we call this a character vector. So it's a vector data structure.

And it has atomic variables that are of type character in them. If you are familiar with Python or other tools, you may think of them as strings, or if you have other strings, characters. But notice that it's a vector of length 4. So this is a vector. It has four elements here. So this is not yet tidy, meaning we don't have one observation per row here. Think of this as the unstructured data.

Now, we can put it into a data frame structure. So a table is a modern implementation of a data frame here. So if I put this into a structure like this, we have, now it's starting to be more like a data frame, more like a spreadsheet, more like a rectangle. So I have two variables. One tells me which line of the poem is it. And then the other one tells me what is in the poem. What is the content of the poem? So we're starting to have some richer information, but this still, I would say, is not a tidy data set when we talk about tidy data principles.

So we can load the tidy text library and then use this function unnest tokens. And now here, we have what I want to say is tidy data for text. So notice that this function is doing several things. It is, these are mostly options that you can turn on or off depending on your particular needs or analysis. But notice that instead of having, instead of having one row per line, now I have one row per word.

So this is called tokenization, where we take text and we divide it into tokens. The most common token when we work with text is a single word. So the process of tokenization is the process of identifying what are the words, where do I break them apart? We also did a little bit of some text processing to make it easier for us to analyze it later. We got rid of capitalization. We took some steps so that we got rid of punctuation between words. These again are options you may want to keep or you may not, but these are the defaults here.

So we've transformed our data from unstructured to now structured. Notice that I still know which word came from which line in the poem. So when we tokenize text and transform its shape so that it's in the tidy data still, we still know things like all the other information we have. If we had other columns here, not just the line of the poem, but other information about that text, we keep it all. We keep it all as we move forward. So we have one observation per row.

Working with Don Quixote from Project Gutenberg

So this is the function that does all this. So now what we're going to do is we are going to work with a larger, we're going to take one step up from a poem to a whole book. So Project Gutenberg is a resource, a project online that makes available the full text of public domain works, things that are in the public domain.

Now, if we run this code, you will then have the full text of the book that corresponds to this one. So let's go, I'll show you what we're going to do. Let's go, once you're over here, so you want to execute this here. So you want to have, you want to, this is the, where in the world are we getting our Gutenberg value from? And then we want to execute this like so.

So it's looking to the mirror that we want, and then now I have this thing that is called, I call it full text here, and I have that. So that worked for me.

Okay, okay, great, okay, great. So let's go back to the slides. I just want to make sure we're, because if you get stuck at the beginning and you don't even have the text, you get very stuck, you know, there's nothing to do later. So I'm just making sure that people who are, want to work are getting it.

So this is the next step of what we're going to do. So I bet some of you can tell what book this is. You know, it's not a secret what the book is. But this is Don Quixote. It's the Spanish language of, it's the original, not a translation, but the original for Don Quixote. And we have the next lines that we're going to run here.

So this is the part where I took my raw data, and I am transforming it into tidy data, into tidy data for me to explore. So I'm gonna keep track of the lines, what the lines are, which line of the book, and think of that like printed on a page, like line one, line two, but for the whole book. So that's some metadata I'm gonna keep track of about where the words are from. And then it has this other, so the Gutenberg ID, that's the same for the whole book, because the whole book is, of course, Don Quixote. And then we have the tokenization, where we take the text and we break it apart into pieces.

So I'm showing you here with an example that is a book, a fiction, but this applies to all kinds of text, survey results, documents that you have, reports, any kind of text data that you have, you can use for this kind of analysis.

All right, so if you were to go, so notice here that the thing that you need to add, if you are typing along, the thing that you need to add right here is the unnest tokens. So anytime in these interactive exploring together, what you need to do when you see this is you need to update it here to, and I think, what did I say it was called? I said word text. So we are making the word column from the text column here. So I can run this and this, and now I have it right here. So this is now a tidy text data set.

Counting words and stop words

All right. So that's the step you wanna take if you are going to work along and have some experience with this. All right, so let's look at this. So this uses, now, these kinds of functions use verbs, functions or verbs from the Tidyverse. So I, you know, we can look at this and all of you can look and look at these words and try to decide what do you think is gonna happen if we run this code? So we say, okay, I have my now my tidy book and then I'm going to count the words with sort equals true. So maybe very briefly, turn to someone around you and ask, what do you think I'm gonna get? Like, what do you think will happen after I run this?

All right, so what happens is we get a count of the words in this book sorted so that the most common words are on top here. So when we adopt tidy data principles and how we store our text, we are able to use very well-exercised, well-documented tools from the Tidyverse ecosystem to then be able to analyze the text that we have. So the only part here so far that is from tidy text is that unnest tokens. Everything else is from a set of tools that is built for working with quantitative data.

All right, so now I'm no great Spanish speaker, but I look at this and even I know, those are not very interesting words. Those are not very interesting words. When you take any collection of text data, natural text, and you count up what's the most common, the most common words are very boring. They're usually not important for the analytical purposes you have, and we have a word for what they're called in text analysis, and what it's called is stop words. They're called stop words.

So stop words, if you hear stop words, it's a word that usually means like, it's not important, it's very common, it's not very informative. It's there to make the structure of the sentences so that we can understand each other, but they're not very specific, usually, to the kind of data that we want to deal with.

So there are data sets of stop words that are available. It's important to know or to ask the question, where did these data sets come from? These data sets are, there's a bunch of them. This one, the default one, is from a lexicon called Snowball, which is an old software for analyzing text, and they include a list of stop words there, but there are many different data sets of stop words. Often, the way that they were created, these lists are usually pretty old. The way that they're created is you take some big collection of language, big, not one book, many books, and you add up the most common words, and you pick a line, and you say, okay, these are the stop words, and these are not, so you, or you compare across many different collections, and you say, what are the common ones, like this? So this is an English list of stop words that's somewhat conservative.

It's somewhat small. It's not so big.

They are not only for English. There are data sets for other languages as well. Notice this one is a bit bigger, but of course, it's kind of comparing apples and oranges to compare English to Spanish, right, like in terms of what we might consider stop words.

So where do the non-English ones come from? Because computationally, we have a big inequality between English language in text analysis and other languages, a big inequality. So where does this list come from? Some of them, I'm sad to say, are translated, that you take one list, and you translate the list. Others of them are curated, like human curated, but we have different lexicons, different ability, different lists.

Notice that we can get them from another place, like there are many different lexicons. So this one is English also, but let's remind ourselves. So the default one had 175. This one has 571. So it's a big, they can vary in how conservative or aggressive they are. Like take out everything, take out able, take out accordingly, like, okay, you know, there are different lists for different purposes.

Removing stop words and finding common words

So now when we have this kind of list, we can use it to get to a different answer for what is the most common words. So for your next exercise, you have the code. So we wanna ask the question, what are the most common words after we remove the stop words? So which ones are most common in Don Quixote after we remove the stop words? So this is all the correct code, but it's the lines are scrambled up.

So this is all the correct code, but the lines are scrambled up. So if you look at it, you want to put them in the right order. So if you are very new to R, you may be like, I don't know, I don't know. So don't worry. We'll show you the answer in just a minute. But maybe if you have a neighbor who is a little more familiar with R, you can together or you can try to make a guess. You can try to make a guess. Which one do you think is first and what order do they go in?

So take just a moment to look at your code and try to rearrange it and see if you can rearrange it so that you can run this code. So I think hopefully if you're learning along, you see this, and then you can rearrange it.

So I'll do it here with you. So first we'll use an anti-join. An anti-join is a way to keep something in one set that's not in another set. And then we will, after the anti-join, we'll do the same thing we did before. The same thing we did here that we did before, count, and then we'll pipe it to ggplot2.

So this is how you want to do it. So let's talk through these steps. We take our tidy data set. We remove the stop words. We count them so that we know... We count them up. And then we use this function sliceMax to take the top 20. Then this starts our plot. We're going to do a visualization and makes a bar chart like so.

So now we have this list for the most common words in Don Quixote. This is looking better. Now we have Don Quixote. We have Sancho. We have Caballero. You know, like we have words now that are telling us something about what Don Quixote is like.

Now, let's look at that third word, C. So it turns out in that list of stop words in Spanish, C with an accent is in it. C for yes is in it. But C for if is not in that list. They only have C. So probably what happened is they tried to deduplicate the list. They tried to say, oh, don't have any double things in the list. And so they took— I don't know. I don't know how this happened. But if is not in that Spanish list. If is in the English list.

So what this— what you should learn from this is that text analysis is full of supporting data sets all the way up to pre-trained models, and they all have problems like this. They all have problems like this.

So what you should learn from this is that text analysis is full of supporting data sets all the way up to pre-trained models, and they all have problems like this.

And so it's very important that you, as a practitioner, you understand and you have the tools to be able to dig into, you know, where are these— like, what is the problem here? What's going on? And so when you use tidy data principles, it's straightforward to find out why this happened. And then also it is straightforward for you to make the appropriate choice for your analysis. You say, okay, I add in C to my— to my list of stock words, the list that I used for this project. And other things here that you decide— you know, you decide, like, okay, I'm also taking that out. It's up to you. It's up to you to decide how to do this.

Word frequency in practice

So so far what we've talked about here is just the beginning— just the very beginning of, like, we have text. I need to do some basic tokenization. And even just ask the question, what is most common?

All right. So what we will do now— I'm going to briefly— I'll briefly talk about some of this. So what I've shown you so far is not— is not mathematically very exciting. But I have found that even those kinds of tools to be very impactful. So I had— I found— this is work that I did. I found a data set of Puff lyrics. Like since the 1960s, all the songs from the top 100, like, one of the— the chart topping songs. All the words to them. And then I could use the kind of tokenization that we just showed you and say, what are the most common? What are the most common words?

And here we say— oh, okay. So the most common states— so I find all the words, count them up. And then say, okay, which words are about states? Which words are actually the names of states in the United States? And so we can see that California and New York are, like, the most mentioned. If you hear songs, pop songs, what we talk about is California and New York. But I am a very good data scientist. And so that means not only can I add, I also know how to divide. I also know how to divide. This is the important skills, right? Skills adding and dividing. Mathematically very sophisticated. I'm making a joke. But it is actually interesting.

So not only do I know the words, I also know the population of these states. And so you can ask the question, which states are mentioned a lot in pop music relative to their population? So here we actually see, like, a big cluster in the— in the southeast of the United States, like— which is— comes— this is— this was not the country. This was the pop list. But even in the pop list, you see this representation of that part of the country. And Hawaii is very— almost no one lives there. But it looms large in people's imagination in the pop music.

TF-IDF: measuring what a document is about

So okay. So this is just the beginning. But even the beginning can help you say interesting things about text data that you have. We're now going to move on to talk about, in a little more detail, how can we have measures that tell us what a document is about? So we're going to talk about a couple of statistics that you can measure about a document and how to combine them.

So term frequency is just like what I was telling you before. Counting. Counting. Usually dividing by the length of the document. So some— if Don Quixote says— Quixote, Quixote, Quixote, over and over and over again, that has a high term frequency.

Inverse document frequency is this. So this is not something that has a theory behind it. It's just a heuristic. But it's a natural log of the number of documents you have in a collection divided by the number of documents containing that term. So let's say we have— let's say we have many books. Let's say we have 100 books. And then we divide by the number of documents that contain that term. So I have 100 books, and only one of them has Quixote in them. So that means we have a big number divided by a small number, and we get the natural log, and that is— so that's a big number. Let's say we have 100 books. And let's say they all have— they're English books. They all have the word "the" in them. So then that's 1, and the natural log of 1 is 0, and so that is a small number. That weights it down.

So if you multiply these two numbers together, what you get is called TFIDF. So it's the term frequency multiplied by the inverse document frequency. So it's like a weighting. Say, okay, which words are common but are common in one document compared to the other ones? It's about comparing documents within a collection. So TFIDF is a way to weight counts— a way to weight counts. And what it gives us is a measure of what a document is about or how we can find out which words are important in one document compared to other documents I want to compare to.

So now we're going to go back to the code, and we're going to get— we're going to get not one book but three books so that we can compare across the three books. So let me remind myself how much I made you all fill in. So this I think you can just run. So this here, just run this, and then you should have a collection. So we have Don Quixote again, but then we also have two other books. So if you wanted, you can count the titles to see the three titles that are there.

And then what we're going to do is we're going to convert this more raw text to a tidy data structure for all of them together. So think of me now having not only one document but a collection of documents. These are long documents. They're whole books. But this could be short documents. So right now I have three long documents. But maybe think about survey responses. Maybe you could have— say again?

Like a text from an interview, a survey result. Maybe you have 1,000 short documents. It works the same. So you could have 1,000 survey responses or, yes, like a text from an interview. And then what we're doing is we're counting. So we count up here. But I say— okay. So Don Quixote is up here for all of this, and that probably means Don Quixote is the longest. It probably means it is the longest document because it has K more times than the other books have K.

So what term frequency— what TFIDF allows us to do is to account for all that, account for all that. So you will run this. And let's think— what do the columns of book words here tell us, this object book words? It says for this book and this word, how many times— how many times is it in there? So Don Quixote has the word K that many times, 20,000-plus times. And if we were to keep looking down, we would see that the other book has the word K a different number of times. So I'm counting up these words in the books here. And now I can use this function right here, bind TFIDF. So this function computes TFIDF for the words that are in the documents.

And then N tells us where is that there. So this— I think if I remember right, you need to type this in. And then what we get here, after we look at this— I'm going to— hopefully this isn't too fast. Hopefully folks have seen this. We get this. So we have— let's look at these words here. So we— that function for us computed TF and IDF and then multiplied them together. And what we have here now is the TFIDF. Look at all those. It's all zero. It's all zero. That's because all three of these books contain these words. K, T, E, LA, A, N. That's in all three of them. So what TFIDF— so it has weighted those down. It says those are common to all of them.

So we can arrange the results that we have here to do it in descending TFIDF. And what that shows us now is something interesting. So now we have these three books here. And we have, for example, words that are unique and specific to this title compared to this title. So I actually think it's pretty interesting— you said is not in the other two books. Usted is only in this book. In this book. Which I think is pretty interesting. Gaucho and Poema, you know, are in this book. And we see Don Quixote down here with panza and Dulcinea. So we're now seeing unique words for one document compared to another document.

I am— you are welcome to do this unscramble, but I'm just going to go a little bit to the next to show you the results. So if you want, you can go ahead— I'm not going to take some time right here. But we can— what we're doing here is we're saying for every title— so group by title, give me the top ten words by TFIDF. And then we keep going and put it into a plot, a visualization. Which will look like this.

So now this tells us the top ten words for TFIDF for these three documents. So we see the Don Quixote words, panza, Dulcinea, Rocinante. You know, we hear— we see— so I was less familiar with these books before I was, like, looking for them for this talk. But I was like, oh, cool, this one has Gauchos and poems. You know? I can kind of learn what this is about. And then over here we have, you know, these interesting people— you know, these characters here.

Weighted log odds

So think about a different application. If you have survey results, you could— what TFIDF helps us do is identify things that are important in one document compared to another document. It's not the only way to do this. So we just talked about this term frequency. But another way that I find very impactful for this kind of work is what's called a weighted log odds. Now we're not even really to models yet. But we're starting to have a little bit more sophistication in what we're doing.

So I imagine that you all are familiar that a log odds ratio tells us a probability. And we can weight the probabilities because we know— we know the distribution of words. So remember that some words are super common. Other words are more rare. And we want to weight the— we want to weight the log odds ratio by how sure we are that what we're seeing is a real difference. So let's say— let's say we have some words that are very common. Like and and K. You know? If we see a difference in those words, a big difference, we know that is real.

Let's say there's a word that appears one time in one book and doesn't appear in the other book. We cannot be so certain that that's a significant difference. We have to weight. Because natural language exhibits a power law distribution. It's— you know, we— if you plot it on a log log scale, we see that some words we use many times and other words we use very little. It's a characteristic of text data that we have this power law distribution.

And so when we use weighted log odds, it helps us to get a good answer for— a good answer for which words have a higher probability from coming from one document for the others. For those of you who are statisticians in the bunch, the typical way of doing weighted log odds is with empirical Bayes. So empirical Bayes, maybe we're starting to say is a model, you know? But it just uses the data itself to be able to— to be able to do the weighting, to be able to say what is the distribution that we would have. So we're maybe starting to get into using some modeling, but not so much.

So it looks just the same as a bind TFIDF, whereas here we're doing bind log odds. It is in another package, in tidy load, because this actually works great, not just for text data, but for many kinds of data that we can do this kind of weighted log odds. Anytime we have, like, a power law distribution, it's a really great application. Definitely true in text that we have that. So this is a different statistic. So the statistic before was TFIDF. It's a heuristic statistic, but it's one that we can measure for a word in a document in a collection of documents. This is a different thing we can measure, the log odds.

So it gives us— you'll notice here we have some similar results. You know, eugenia, you said, panza, you know, Dulcinea, we have some similar words. But it— so it's a— it's a different approach, but the— the main difference is that weighted log odds can distinguish between words that are used in all documents, in all text. So remember that when we had K in all three of the books, it went down to zero. But a weighted log odds allows us to distinguish between words that are in all the text. And they do sometimes— you can say, oh, this word is much more important in this document than this document, even if it is in all the documents.

N-grams and more complex analysis

Just in the next five minutes, we will talk about a little bit more of EDA, and then we'll take just a short break. So the— so far, I've shown you only single words, single words. But absolutely, we can move into more complex analysis. So when we— instead of looking at one single word, we look at groups of word. Those are called engrams.

So a word is a one-gram. A set of two words is a bigram or a two-gram. A set of three words is a trigram or a three-gram. So we've got— you probably still have this. This is Don Quixote again, the text of Don Quixote. And so now we can start to use some— use some more arguments to have different kind of analysis. So before, we said one word is one observation. But maybe what we are interested in is sets of two words, sets of two words. So what we have here, we say token equals engrams and n equals two. So that will find for us all of the groups of two words.

So if you type this, and then we can get rid of ones— at the edges, we have to get rid of the ones that are— that are not there, are not there. So if we do this, instead of what we had before, we have— so this is probably what you need to type in if you're going to work along with us. We have this. So notice— notice what's happening. El— now you hear my very bad pronunciation. El ingenioso, ingenioso Hidalgo. Hidalgo Don, Don Quixote, Quixote de la Mancha. So engrams slide along like this. So you get every combination of two words. Trigrams are the same. They slide along. Four— four grams slide along so that you get every set of four words like this.

So you may— you may notice it's— it's bigger. It becomes a bigger data set. And for one— for one book, it's not so bad. But when you start using engrams, the space, the dimensionality starts to become high— starts to become higher here. But working with bigrams and higher order engrams also gets us more rich information here.

So here are the most common engrams, just like we said before. So de la is the most common. But after that, Don Quixote. But look at a lot of these. These are— these are, again, soft words here, like not so interesting. So let's— if we try to use the same approach we did before, we can't because we don't have— we can't use that— that anti-join there. So instead, we will take this kind of approach.

So if we take out the— if we take out the stop words in the way that I just showed, then we get some very more useful words here. So things like— things like Senor Don. Things like Dijo Sancho. You know, we start to have these words that don't have stop words in them. And what can we do with engrams? So what we talked about already, you can do it with engrams. TFIDF of engrams. Weighted log odds of engrams. You can also take approaches like network analysis. So the tools— so a set of engrams, it turns out, is a network. It has nodes and it has connections. So you can treat a text analysis like— you can treat a set of bigrams like a network analysis.

You can also look for— for effects of negation. This— when— this came up more in the past when you were trying to do sentiment analysis and say something like— can I see the sentence— this is very good and the sentence this is not good. And they both have the word good in them, which has very high sentiment scores. So can we use bigrams to understand the effect of negation there?

So here's an example of analysis I did that uses a bigram analysis. So my favorite writer is Jane Austen. I love Jane Austen. And so this analysis took all the works of Jane Austen and then found all the bigrams that were he blank or she blank. So what were the things that come after he and what are the things that come after she? And then looked at the log odds here to say what comes— what is more likely to come after he and what is more likely to come after she?

So let's look here. So we see she remembered. She read. She felt. She resolved. But he stopped— whoops, did I— no, I still haven't— he stopped, he takes, he replied, he comes. So if you notice, the she words are all kind of about the internal life, about the feelings, the thinking, and the he words are all about what we observe him doing. So I love this because this is fairly simple analysis, but it gives us this insight into Jane Austen work, that they're about the lives and thoughts and feelings of women, and the men in it are— you don't really know what they think or feel. You only know what they do.

This is fairly simple analysis, but it gives us this insight into Jane Austen work, that they're about the lives and thoughts and feelings of women, and the men in it are— you don't really know what they think or feel. You only know what they do.

So these kinds of engram analysis can be very powerful, even with straightforward choices. I did the same thing with a dataset of movies, with the scripts of new movies, and saw what did more— were more she and more he. With she, you know, you see words like she snuggles, she laughs, and with he, it's like he shoots. So we see, like, what does culture tell us about men and women with these kinds of text analysis things?

This is the— this is the log odds, the weighted log odds for Jane Austen again, but with bigrams. So we can do the same— the things we were talking about before, they're not only for single words. They also can be used for more complex structures here.

Network analysis of bigrams

Now I have— I have in your— your code a little bit on network analysis. I'm going to— we won't go through it, but you'll be able to do this yourself. So before— I'll just show you this example. Before my job now, I was a data scientist at Stack Overflow, which is the website where you go to find out answers to code. And we did some surveys asking what— if you could change something about Stack Overflow, what would you change? And so this is what people said to us in the survey. We had tens of thousands of responses. So this was not survey data that we could read every answer. But we actually got a really good understanding of what people said with this kind of analysis.

You know, people talked about questions and answers, that it's— it's hard to find answers. The answers can be out of date, off topic. The questions are being closed because they're off topic, which makes me feel sad. People wanted— at the time, they wanted a dark mode. They wanted a dark theme for the website. We see about voting and reputation. And over there, we see some people are like— what I want you to change is the toxic community at Stack Overflow. So this is a big— an effective tool for dealing with corpuses of text that are— that are large.

Now, I invite you later, if you're interested, to walk through this. But this is what code looks like to do the same kind of thing for Don Quixote. We make a graph— graph here. This is what the graph looks like. And then we can make a— I'm just going to blow through this— make a plot here. And this is what we get. So this is telling us about how the words in Don Quixote are connected to each other. We see, you know, Don Quixote, that's so strong. That's so strong together. But then we see these other kinds of— we see other people. We see other sets of words that help us understand what's going on.

Okay. Okay. It's past 5. Let's take— till the next four minutes, till like 10 after, just stand up, turn around, talk to your neighbor. And then in like four minutes, we'll go to modeling. The first section was the introduction here and learning about text as tidy data. And next, we're going to talk about topic modeling for maybe 30 minutes. And then in the last time— maybe the last 30 minutes, word embeddings.

And where— instead of— you can close the— excuse me— close the EDA file and open the topic modeling file. And this topic modeling file will have the— what you need to do for this next— this next part.

Any questions about where we are or opening the files? Okay. Great.

So if you are working locally, these are the packages that you need to install for this section. But if you are on Posit Cloud, they are all already installed there for you here.

Moving between data formats

So let's look at this diagram that tells us a little bit about how we move back and forth between the— the different sort of formats. Different sort of formats. So picture where we were in text data, where it says text data, there— that's, like, the raw text data that we start with from— say, from Project Gutenberg.

When we download, say, the text of Don Quixote, it's right there in text data. If we— if we transform it with tidy data principles to a tidy text data set, that's right there where it says tidy text. Even if we do things like count to find the most common words, that is the summarized text. And we can— so we have been moving back and forth kind of along that main line. Kind of this line here is where we have been going back and forth, back and forth. We have raw text, we reshape it, we summarize it, and then we go— we can make visualizations.

In the next section, we're going to be talking about going down to the model. Going down to make a model. And it does turn out, by the time you want to do model estimation, a tidy format is not perfect. Because you usually need a matrix to do linear algebra. Basically any model, what is it underneath is linear algebra of some kind or another. So we need a matrix.

So I have been here telling you tidy data, it is so important. You should have all your data in a tidy data format. But it is true that when you— it's time to do some algorithm, some math, we have to get it into a matrix form. So we go— we go down here to make what's called a document term matrix. And then we can train models on that. So we'll talk about that process just a little bit.

Introduction to topic modeling

Now we're going to start with a kind of model that is called topic modeling. So in topic modeling, it is trying to find what are the topics. What are the topics in the documents that we have? What are they about?

So the way that the most common kinds of models work for topic modeling is that each document that we have is a mixture of topics. And then each topic is a mixture of words. So if you had survey responses, you could imagine maybe— maybe you're trying to learn what topics there are. And in all— any— so say we're trying— with some of the topics, let's say we're thinking about survey responses from student evaluations at a university. Some of the topics could be the tests, the homework, the lectures. And the— each survey response can be about all of those things in some mixture. And then each topic uses a variety of words. And you can have the same words in different topics.

So topic modeling is an example of unsupervised machine learning. So we have a label on the document and we're trying to predict, we're not trying to predict yes no. Instead of we don't have any labels. We're doing unsupervised machine learning here. So that's the kind of model that topic modeling is.

Here is an example back from my Stack Overflow days, a real example of topic modeling. We're trying to say, okay, for all of the questions on Stack Overflow, what are they about? And it is kind of, it's an interesting result because you end up with like a front-end web development topic, like a Java topic, like a SQL topic, like Android, so they're related to each other here.

The four books example

So we're going to go through a slightly smaller example. So let's imagine that you have a library. You have a library. And someone comes, you have books in your library. And then someone comes, here's your library. Let's say, so we're going to go to an English example, but it does not, it's not dependent on this being English. So the example I just walked through, you can use, you can, it does not, it's not, nothing in it depends on the language being English, but the example is English here.

So I'm going to, let's say you have a library and these, you have these four books in your library. Emma, from my favorite, Jane Austen, The War of the Worlds, The Wizard of Oz, and Wuthering Heights. So we have these four books. Let's say someone breaks into your library and they take your books and they rip it apart, they rip apart your books and throws them all in a mixed up pile on the floor. So your four books are torn apart and in a pile on the floor.

And so that is this part here. We make chunks, we take the book, we divide it into chunks, and then we put it in a pile on the floor. And topic modeling asks the question, can we pick up, can we pick up one of the chunks and can we look at it and can we tell which book it is from? Can we tell which book it is from? So that's the idea of this topic model.

But topic modeling can be applied in many different ways. Like in many, like what are the survey responses about? What are certain topics becoming more important over time with in company reports or in media? Like we can, there's many different ways of using topic modeling. But this is, this one is nice because we can walk through and understand what happened together.

Building and training the model

So here's what we have. So we have for the text, we have a document here. So these texts are labeled. Each one of the documents here is like a chunk of book. Like someone ripped apart your book and you have it there. So now what we want to see, can we put our books back together? Can we learn from the text which chunks go together? We're not going to, we don't tell the model which chunks belong together. We're going to say, can the model learn which ones go together here?

So first we will make a, we will make a tidy data set. We will use tokenization. We can remove some stop words. It turns out removing stop words does often help with topic modeling because it just helps us bring the, bring the space down, like the multidimensional space. It helps it be more fast to train. And then we're counting again here. Always counting.

So we have a, we have what's called a document term matrix. So here it's in a long narrow form, but we can then use, and so it gives us the counts of words per line. And then we can then use a cast function to cast it to a different shape. So we have a long narrow shape rectangle, and then we cast it to a matrix. So here we're casting it to a sparse matrix because it turns out that most of the documents don't have most of the words. So it's full of zeros. It's a sparse matrix that is full of zeros. So it's more efficient to store it in a sparse matrix representation here.

So now this is no longer tidy data. Now this is a matrix, but it's great for doing math on, for doing linear algebra. So it's not a tidy data set anymore. Now it is a non-tidy data set, which is a good fit for doing linear algebra.

So we're going to use the topic. My favorite implementation of topic modeling is called, is FTM. That stands for structural topic model. It's very nice, very fast, very good to use. It's my favorite. And so what we pass in is that sparse data frame.

If you are on Posit Cloud, don't run this. I think in there it says don't run it because you will, unfortunately you run out of memory in the free tier, but I saved the output of a training that you can just load in. So you should just run the line that says to load it out of memory because it turns out you need a lot of memory to train the model, but then once the model is done, it's actually quite small. So the model is saved in your workspace and you can open it, but don't train it yourself. It will crash. So don't do that.

So here, you know, this is a pretty robust model that is difficult to mess up, but there is, we have to say what K is. So this is like K-means where you don't know what K is ahead of time. So in this sort of pretend fun example, we know that it was four books that was ripped up and thrown on the floor. So we know that K is four. In a real problem, you don't know what K is. So there are approaches and methods for deciding what is the right K. Usually you have some heuristic based on how big your dataset is. Do you have 1,000 or 10,000 texts or 100, 100, 1,000, 10, like how many do you have? Then this is kind of the right range. And then in that range, you can train all of them and then look at the models and say what's the characteristic of the model to decide what the right value for K is. But here we just say K is four because we know there are four books that were broken apart and we get a result.

Exploring model results

We get a result here. It does not take a long time to train this kind of model. It's pretty fast and efficient. But we get results here. So we had four books that were ripped apart and then we trained four topics. And this summary tells us, you know, just some high level details, kind of some results here. If you're familiar with the War of the Worlds and Wuthering Heights and the Wizard of Oz, you probably start to see, oh, I think we're getting somewhere. I think we're able to put these books back together here.

But we can dig deeper by using tidy data tools. So now if there is a verb in the sort of ecosystem called just tidy. And the verb tidy, usually what it does is it takes something that is some sort of awkward shape and it gives it to you in a data frame, now back into one observation per row. So notice that we started with raw text and we converted it to a tidy data structure so that we can use the rules of relational algebra to analyze it. Then we cast back to a matrix so we can do math. And now we have the model object. How can I explore it in detail and know what it is? I can tidy. I can tidy my output again.

So notice that we started with raw text and we converted it to a tidy data structure so that we can use the rules of relational algebra to analyze it. Then we cast back to a matrix so we can do math. And now we have the model object. How can I explore it in detail and know what it is? I can tidy. I can tidy my output again.

So there are, remember that a topic modeling says that each document is a mixture of topics and each topic is a mixture of words. So there are two matrices that come out. This one is the beta matrix and it is the topic word probabilities. What is the probability that a certain word comes from a certain topic? But there's another matrix. I'll just show it to you real fast. No, I'll come to it.

The other matrix is about the document topic probabilities. So we get two sets of probabilities because this is a multi-level or two-level mixture model here. So this tells us the probability that a given topic has a different word. So this first word we see is green. So notice that it's like medium, medium, medium, tiny, tiny. So green has a medium probability for one, two, and three topics, and then just zero in four. The word chapter is medium for all, maybe kind of small, small, medium, big, Mr., tiny, and then bigger. So these are the probabilities that a given word comes from a given topic here.

So we can now explore this in detail. We can explore this in detail. So the next bit of code that you have is to rearrange again. And I— oh, no. It's already 525. Okay. I'm sorry. I'm just going to keep going. I'm going to keep going.

So we can rearrange to this, and then we have a results here. So these are the highest probability words from each topic. So they're all arranged. So at the top is topic one. So these are the highest probability words from topic one, Martians, people, black, time. So this is the War of the Worlds. Topic one is the words that come from the War of the Worlds.

And we can also, we can, since we're using tidy data again, we can do any kind of table, any kind of summarization, any kind of visualization that we need here. So here we can look at these, and since we know what the books were, we can start to understand which of the topics are related to which of the worlds. So we've got here with the Martians, we've got War of the Worlds, we've got Wuthering Heights over there, the Wizard of Oz down here, and then there's Emma over there. So these are the highest words. So looks pretty good. Looks like we're going to be able to put our words back together.

These are the most probable words, but we can also identify important words with other statistics. So FREX is about whether the word is, whether the word is both high frequency and exclusive. So is it common and also exclusive? The LIFT is about the weighted, the weighted log odds again. So there are different statistics. I'm going to, I'm going to point you, if you would like to learn more about FREX and high LIFT words, I have this YouTube tutorial on it, but we'll just keep going because the other matrix, the other matrix is the one that is between the documents and the topics, the documents and the topics. So this is typically called the gamma matrix. And I'm saying matrix, but it's actually tidy data.

So picture, picture in your mind, I can have a long skinny thing or I can reshape it into a matrix. Like we can, and depending on which is more useful for the step you want to do, you want to shape it in those different ways. So here for any topic, we say, what is the probability? So notice here that we have kind of medium, low, medium, low. So for this topic, for this chunk, this broken up piece, what is the probability it comes? So this is all of the chunks that we broke apart here and we can get back out. Remember, when we trained this model, it did not know which documents came from which books. It did not have that information, but we can, since we kind of stored that, we can get that back out and make a visualization that looks like this.

So this is on the Y-axis, the gamma probabilities. And then on the X-axis is which topic it is. And now we can say, how good of a job did we do at learning which things go together, being able to pick up a chunk of book from the floor and say, which book did it come from? So notice Emma looks very good. It's like all topic four. The Wizard of Oz is from topic three. But notice it's like, okay, we see some evidence, these are box plots, and we can see some evidence that Wuthering Heights and Wizard of Oz may be a little bit harder to tell apart than the other things.

So we're able to learn which documents are closest, which documents are kind of a little bit further apart, and then, you know, which are very apart. Like War of the World, it looks like War of the World was very easy to distinguish. Maybe it's probably the Martians. Probably it was the Martians, why it was so easy, you know? So we can learn this from the output of this kind of model.

Choosing K and further uses

We can continue to use topic models in more sophisticated ways. I'm going to say just briefly, how do we decide what K is? Because in the example we just used, I said four, because I know there are four books. But in real examples, you have to choose K using some measures from the model itself. So this is like hyperparameter tuning, where you have to try a bunch of different values and then see which one have the best characteristics. So you can look at things like semantic coherence, which is about frecs, like which words are used together that are more exclusive, or semantic coherence, and then exclusivity. And there's many more details on like how, in a real situation, how do you go about choosing K for the model that you want to build?

I have, if you're a Taylor Swift fan, I have this, this is a fun example of how to do topic modeling on the lyrics of Taylor Swift. And you can see which ones, which ones are close together, which albums are close together and which ones are far apart. And the result, you kind of see in the screenshot that Folklore and Evermore are close together. They have similar, they are drawn, they look like the lyrics are drawn from the same kind of topic. Reputation is really different than everything else. And then like the early, the early albums look also close together, like those early albums have, and this is about the lyrics. So not about the music, but the words, the lyrics, how they, how they go together. So this can help you understand a way to use topic modeling.

So, all right, it's just 531. So I want to highlight this again. We get text data that is, think of it as raw text data. We have to transform it then to a tidy format so that you can make, do exploratory data analysis. You can understand what is in your text. Then you want to transform it because often when we, when we do models, we have, we have to have a matrix to do linear algebra. Once you have the model, then we can tidy again. Then we can get back to this nice rectangular data frame that helps us to make plots, explore the results of our model.

Word embeddings

So, let's, let's jump into the last piece. And then after that, we'll have some time for questions and talking about this more. So if you are going, if you're working along, you want to, whoops, oh, okay. So let me show you. So if you were looking at this website, we're going to be opening this last set of slides. Word embeddings. Whoops. I closed the wrong one. Let me try again. Word embeddings here. And then if you are in the Posit cloud, you want to close the topic modeling and you want to open the embeddings here.

Okay. Last 30 minutes. Go. Okay. So here are the package. If you're working locally, here are the packages that we'll be using. The same two from before, tidy verse and tidy text, and then a new one called word salad. Word salad.

So the data set that we're going to use for this last example is a data set of cheeses. Everyone loves cheese, right? It's time for, time for some cheese. So in this data set of cheeses, we have the name of the cheese, what kind of milk it is, where it's from, the country. But we have also some very interesting data on the, on like the texture, the rind, the color, and the flavor. And I am interested, and the aroma also looks kind of interesting to me. Barnyardy. I love it.

So we have some, see, these are, these are text data. It's not like, it's not like natural language. It's not like Don Quixote. It is a limited vocabulary. But so that makes it good for this example, because we can do it very quickly here now together. So this is like a little bit of a toy example, but it can help you understand about embeddings. Because it's not, we cannot use every word in the world when we describe the cheeses. There's some small set of words, and also they aren't sentences. They're just like adjectives. Just adjectives.

So think of it as like, we don't take all of natural language. We just take a subset, just a, just a little bit here. So this, so here, we're going to look at the flavor, the flavor. So here are five example flavors. So this is what our data looks like. Creamy, sharp, strong. So that's one flavor. Buttery, milky, sweet. It's another flavor. Buttery, tangy, mild, nutty, sweet, and then acidic. So these are the kinds of words that are in the descriptions of the flavors of the cheeses here.

So if you are working along in the Posit cloud, you can, you can download the data and then you have it there for, to be able to do this yourself. So when we build models, I said this in the last section, we take, we need a, we need a matrix. We need a matrix to be able to do linear algebra on the, to do, to build some kind of model here. The, the matrix that you make is typically very sparse. It is a very high dimensionality and it can have a huge number of features. So natural language, we tend to have most documents do not use most words. So the data is very sparse. This, this cheese data is about 95% sparse. That means 95% of the spaces are zero.

If you have a matrix where you say, here's tangy, here's sweet, and then the cheeses on the, the rows are the cheese and the columns are the flavors, the words to describe the flavors. It's 95% just zeros. So that's actually not so sparse for, if you had natural language, it would be 99.99% sparse. Very sparse, very high dimensionality. Notice that I have, notice that I have a thousand documents. So that's a thousand cheeses and 46 features with, that's, you know, that's, that's getting to be, maybe that's medium dimensionality, but it's a lot of columns. 46 columns is a lot of columns. For real language, it could be, if you have 10,000 documents, your rows could be, could be 10,000, actually. It could be 1,000. High, high dimensionality. It's a big problem. This is a big problem for training models with language.

And this is where word embeddings are used, because if we think about it, words are not used in each other randomly. They're not independent. So if you come from a statistics background, these are not independent of each other. In fact, they are used together in specific ways. They're, like, when I say a sentence to you, it's not just words, I don't just choose words randomly and put them in a sentence to you. I use rules about how I speak.

And this sentence is, this quote is from someone named John Rupert Firth, who back, like, in the 1950s was a linguist who talked about word embeddings in an abstract idea. Like, they didn't really have them to use, but he had this idea that we can make an embedding. So think of embedding equals vector. Embedding equals vector. So I want to make a column of numbers that will represent the text, because I know, say, when I'm talking to you with some words, I use the words together in a way that there are rules about how the words are used, and we want to try to learn those rules.

So you have probably heard of some of these things, because these are all now a part of our lives. So maybe 10 years ago, word embeddings were things called Word2Vec, GloVe, FastText. This is like 10 years ago, word embeddings. These are algorithms for taking a text, a set of text, and learning the matrix representation for how the words go together. So think of it like if you were a statistician or have a statistics background, think of it like PCA, but fancy. I have a high-dimensional space, and I need to learn a lower-dimensional kind of space. And so these are all rules, rules for using linear algebra to go from a super high-dimensional space to a lower-dimensional space, and then I can represent words and documents in this space.

So think of it like if you were a statistician or have a statistics background, think of it like PCA, but fancy. I have a high-dimensional space, and I need to learn a lower-dimensional kind of space.

So now there are companies out there who have taken enormous data sets of text and are learning these representations, these embedding representations. All LLMs are built on the infrastructure of embeddings. So you can go to OpenAI and get an API token, and then you can send text to OpenAI, and they will send back to you the embeddings using their embeddings there. And so what you will get back is a long string of numbers. So it's a way to represent a text or a document numerically based on the rules of how text is used together, that it's not just random what words go together.

So now I highly recommend that you read this if you are interested in embeddings. So it's a little long. It's a little long. It's maybe— I forget. I don't— it's a little long, but it is excellent. It is excellent as a way to understand what are embeddings. And all the resources and slides are available for you. You can go to the URL. And this will also— I'll be here afterwards. So I do recommend that you go and you read this to understand more about embeddings and how they're used. But today, we're going to make some cheese embeddings. Some cheese embeddings.

Okay. So time for some cheesy embeddings. So we take— we have those cheeses. We have that cheeses data set. And we are going to take the flavor. The flavor column. Remember it was like milky, sweet, creamy. And so we are going to take out the columns— sorry. Take out the commas. Remove the commas. And then we are going to use the GloVe embeddings.

The glove algorithm to learn word embeddings. And glove is available in the word salad package. So what we're doing here is we're saying... I have all these cheeses. I have the flavors of them that are described using words. Let me use glove to learn how the words are used together.

So think of glove here as like PCA. Like in analogy. I mean, not in detail. But in analogy. So let's say we have a high dimensional space. And I'm going to learn a... I think I did 10. 10 dimensions. So I did 10.

I have, you know, 1,000 or I think it was 50. I have 50 dimensional space. And I want to learn a lower 10 dimensional space. So think of the analogy here is that open AI got the whole internet and for months and months chugged with a computer to learn 1,000 dimensional representation. So it's like what we're doing here is like a small example compared to the embeddings that are used in all of these LLMs here.

And the tokens that we have... This is the things. Vegetal. Tart. Meaty. Floral. So these are the tokens here. And these are the embeddings. These are the vectors here.

And so now for every cheese, we can find... We have all these words. I want to learn a lower dimensional space. Now for every cheese, we can find a mean cheese. So these are cheese embeddings. This might be the first time anyone has ever made cheese embeddings. No, probably not.

So what we're doing is we are... We use unnest tokens. We're back to tidy data. And we join in the flavor embeddings. And then I'm gonna... For every cheese that I have... And I will keep around the country and the type and all that kind of thing. I'm gonna take the mean. So if you have a document made of words, there are different ways of getting a document embedding from a set of word embeddings. Mean is one of the ways to do it. You can also do other ways of waiting here.

So these are cheese embeddings. So for every cheese that I have, where is it in this 10 dimensional space? So I had a high dimensional space of all the words for flavors. I made a lower dimensional space. And then I found where in the lower dimensional space are all the cheeses?

Finding cheese similarities with embeddings

And now I can find cheese similarities. So this is just cosine. This is just cosine similarity. So for all of the... I have this space. I have all these cheeses in this 10 dimensional space. And I find out which... I compute cosine similarity for all the cheeses. To say which cheeses are close together in the words that are used to describe them. And which cheeses are far apart.

So think about the analogy here is... You can use embeddings in many ways. The LLMs use them to, you know, to say... To make a sentence. Saying... You ask chatGPT something. And chatGPT says... Time to make an answer. I'll say... What is the next most common word? And then it uses embeddings like this to be able to keep next word, next word, next word. Because we find things that are close together. And we're able to say... What is the next most common word?

But we can use embeddings in many ways. One of the ways is by finding things that are similar to each other. So this matrix contains the similarity scores for each cheese compared to each other cheese here. So let's say... I'm enjoying my trip to Spain very much. Let's say I'm interested in one particular cheese while I'm here. So let's say I'm interested in Manchego. So Manchego is here. It's from Spain. And the flavor in this dataset for Manchego is buttery and nutty. That sounds about right to me. I think that sounds like Manchego to me.

So let's say I start... I want to know about Manchego. Oh, maybe I want to know... What are the most similar cheeses to Manchego? So these are the ones that in this dataset are most... I'm gonna go home, and maybe I have to find a different cheese. Maybe I have to find something there. Or maybe I travel to a different country, and I want to find what cheese is. So these top couple, they have a similarity of 1. So they are exactly like Manchego in the flavor.

So they are exactly like manchego in the flavor, the words, the words for the flavor. And then we start having some that are, the numbers are very close to one, but they're not exactly one. So these are those top ones. So notice these cheeses have exactly the same description, buttery, nutty, buttery, nutty. So that's what similarity of one means. The next set were cheeses that were, the similarity was close to one, but not exactly one. So notice it doesn't, it can be longer, it can be shorter. What the embedding has learned is which words are used together, which words are used in a similar way. So we don't have buttery or, we have buttery in all of them, but we don't, we only, we don't have nutty in all of them. So buttery, nutty is very similar to buttery, caramel, fruity, full flavored nutty. Like they're not, we've, what the word embedding can learn is that those are words that go together. Those are words that in the high dimensional space of cheeses, when people are describing cheeses, if they say something is buttery and nutty, that is similar to saying it is buttery, milky, and sweet. So this is how, how word embeddings work here.

that is similar to saying it is buttery, milky, and sweet. So this is how, how word embeddings work here.

Now maybe I want to know what's the least similar cheese to Manchego. Like what, what's, what's far in the high dimensional cheese space, what's far away from Manchego? It's these. I have, I've never heard of any of these cheeses, but we can look at what they're called. So that first one sounds interesting, floral and meaty. I'm a little scared. I'm a little scared of that cheese. I would try it, but it's a little scared. Notice that the next ones both say they're full flavored. So what this tells us is that if you have a cheese that is buttery and nutty, it is far away from a cheese that you would call full flavored. If you have a meaty cheese, a garlicky, meaty cheese, that's far away from a buttery, nutty cheese. So this is a small example with just cheese flavors, but this is how embeddings work. When you have enormous, enormous data sets of language, you can learn in the high dimensional space, which things are similar, which are, what are the relationships between way, the way language is, is used here.

If you are working along yourself, you can put in other words, like you can find anything else you might be interested in and see what, you can find something that's medium, not close, not far, but kind of in the middle. What is that? So in the last, I'll take, I'll take just maybe five minutes to wrap up, and then we can have some time for questions.

Real-world embeddings and fairness

Because in the real world, so if you'd like to see another use for embeddings, this is on YouTube, and this is a horror movie descriptions. It uses the open AI embedding. So instead of learning them myself, the way we did here together, it goes to open AI, gets the embedding from the API, and then uses that to find which horror movies are similar and which horror movies are far apart. There's another kind of nice example. I'm just going to take about five minutes and talk a little bit about fairness and word embeddings.

So the embeddings work how we've been saying. They're trained from a corpus of data. And human prejudice or bias actually becomes baked into those numbers. We think of numbers as neutral, but since we learn these kinds of, we learn these kinds of numbers from large data sets of human-generated text, whatever is going on in society actually becomes baked into these sets of word embeddings. So there is some, you can click on this link on the bottom and read more about this, but over and over and over again, when people use word embeddings, we find these kinds of issues that crop up.

So where I live in the U.S., you find that names that are more common for African Americans are associated with unpleasant, they're closer in the high dimensional space to unpleasant feelings than European American first names. Women's first names are associated with family, while men's first names are associated with career. Words associated with women like mother, niece, aunt, are more associated with the arts, whereas terms associated with men are more associated with science. So over and over and over, we see these kinds of examples. Actually, so much so, it is so ingrained in this technology that actually they can be used to quantify change in social attitudes over time. This is kind of like, let's take a bug and make it into a feature. Like, people have done analysis of historical text and looked at embeddings, and they can actually measure beliefs about, say, women changing over time by looking at the embeddings. So it's like, it is so part of this technology that actually you can use it to measure bias. You can use it to measure bias.

it is so part of this technology that actually you can use it to measure bias. You can use it to measure bias.

So can embeddings be de-biased? What is happening? What, for example, has open AI done or not done? What can we think about when we think about embeddings? So you can try, there are these methods. This is all matrices and linear algebra, right? So people have tried to do, tried to take approaches where they take some bias, like let's say a bias between women and men, and on that axis, you try to reproject the embeddings so that you force, you get rid of any difference. So you could say the math, science, arts to men, women, you find that, and you reproject the matrix.

So people have tried to do that. Another thing that has been approached, that has been tried, is you actually add to your training data. So you find in your training data, you find the examples where it says, you find examples that say, like, my father was a scientist, and you add a new one that says my mother is a scientist. So you take, you augment your training data. That's a thing that people have tried. And other people suggest that actually you need to make, it's like, you can't mess with, like, this is all not working, trying with the training data, and you have to make the correction where you're going to make a decision. So, say, let's take the example of chat GPT. You actually, at the, when you decide what will you spit out to the user, you type something into chat GPT, and it says, you're like, can a woman be a scientist? And then they look at the results, and they say, don't ever say a woman can't be a scientist. They program it in. They make the correction at the end.

So there's very interesting research that for anything you try, debiasing still allows stereotypes to creep back in, and the one that seems so far the best is the augmented with counterfactuals. That first one basically doesn't work at all, because once you reproject, you actually end up making something worse for another set of words that you just happen to not think to put in. So the first one basically doesn't work. I think, you know, we don't know what open AI, oops, sorry, we don't know what open AI does, because they don't tell us. But almost certainly the Googles, the open AIs are doing this. Almost certainly they are making the correction when they have the output, that they are having rules, like if then rules about what comes out. But it seems like if you can afford to make your training data even bigger, it seems like that might be the one that has the best results in the long term.

I do, this is small, but I do recommend this paper. It is from 2019, so before chat GPT, but it is so wonderful. Lipstick on a pig. Debiasing methods cover up systematic gender bias in word embeddings, but do not remove them. So I love it. It's like putting lipstick on a pig, trying to debias the word embeddings. You just make it worse. You just make it worse for other situations.

Okay. So with that, I will say thank you so much. I think we have a little bit of time for questions. It has been such a pleasure to have you with you. So you can, I will put the URL up again. If you want to revisit the slides, the code, any example, it's all there for you to have. The Posit Cloud instance will come down probably tomorrow or something, and if you were going to use Posit Cloud again, you will probably need to delete the one that you have to go on to use a different one. So I think that's all the logistics. So I want to say thank you so much for having me, and I would love to answer a few questions if you have them.

Q&A

Hello. Thank you very much for your nice presentation. Congratulations. I have to, maybe I'm going to do, to tell some stupidity. Forgive me. Because I didn't know practically nothing about text mining. I have heard about the most frequent words, the log of the IE. But not much. I discovered a new word with the balance. While you were speaking, I was wondering how to include measures, statistical measures, to this world of text. And I want to mention you two possibilities, just in case, only for me to know if this could be possible. For example, when you say there are some associations, in the balance, women with, I don't know what, men with sciences, or women with family. Yeah. Okay, but how do you measure that association? Because maybe we can use all that measures from data quality, data analysis, the JAKA coefficient, the Q of joules, et cetera, from contingency tables, to put a number, just to measure the strength of that association.

I don't know. Of course, I guess that it is done. Yeah, yeah. But I am a complete illiterate. No, no, no. So if you want to just look at the embeddings themselves, you can find those kinds of what we call bias, in a technical sense, but also we understand it in a colloquial sense of oh, these embeddings are bias, but it is a technical bias, and you find it with distance measurements, say, cosine similarity or something more significant of a different way of measuring the distances, that you can see that in the high dimensional space, words like mother, aunt, like all kinds of feminine words are closer, say, to family words, and words that are specific to men, say, father, uncle. These are some of the classic ones. King and queen is another classic one. King minus man equals queen. But the problem is, that is good. Doctor minus man equals nurse is terrible, but you find this result. Yeah. So you can do these kinds of math with the words, because they're represented in this high dimensional space.

So if you think about it, how can you quantify this kind of bias? In terms of how else can we quantify the bias, is that we see over and over and over again, when we use them in a system, we end up with results that I would say are biased in the colloquial sense, that you're like, that system now is not being fair. We would say behaving unfairly. So embeddings like this are behind much automatic translation. If you go to Google Translate, embeddings like this are what are used to map from one language to another, to be able to map. I mean, sure, you've seen examples like this of translating from a language that does not have pronouns with gender to ones that have pronouns with gender and back again, and you end up in a loop. The doctor nurse one has happened so many times, where you're like, you can construct these things like, her, she is a doctor, and then you translate it to something with no gender, like that person is a doctor, and then you translate it back, and it's like he is a doctor. So that's more qualitative. So we see it in both ways. You can measure it using the math of the embeddings, and then we see it in the way the systems behave once you use them in the systems.

Okay. The second question was the following one. You have a high dimensional space, but maybe using multidimensional scaling, you can reduce, or whatever, dimensional technique for reducing the emission of the product. You can reduce the product to some dimensions related with cheese, with whatever. Okay. If you can represent the text in that less dimensional space, my question is, why not to use distance, distances, yes, to translate our space, cheese space, in a physical space, a physical space, where we could use all the instruments of just statistics. Yeah, yeah, yeah, yeah. And you, some new possibilities are open.

Yeah. Okay. So the rules, it is, the algorithms for finding word embeddings are more efficient for text than, say, the generic dimensionality reduction. I said, like, oh, it's like PCA, like in analogy. But PCA on words does not work as well at all as these text-specific algorithms. So, like, a simple one that is a little bit of a better example is you look for, like, pairwise information for sliding windows along a text, and you can use that, and then you can do, like, dimensionality reduction here. So the reason why these, you know, I was like, glove, word2vec, you know, word2vec, glove, fast text, and now these, like, you know, very more sophisticated approaches, the reason why people develop these rather than use the very straightforward, or not very, but dimensionality reduction that comes from more classic areas of statistics is because of the how sparse text data is, and also that language has rules, language has rules underneath it, and so you use the rules of language to do the dimension, you learn them specifically. So the algorithms are a little different, because they are used to take advantage of the rules that is inherent in language to get the representation. Once you have the representation, you can do anything that you might do with another matrix.

So you can use them as features for training a supervised model. So say you have a bunch of documents, say you have a bunch of cheeses, and then thumbs up, thumbs down on the cheese. So we can make a representation with embedding, and then use that, like, say, take the flavor, the representation, and then use that to train a, you know, logistic regression model. So you can't, once you have the representation, then, you know, many, many, many options of things you can do. But the traditional dimensionality reduction works more poorly on text than these, because they're built specifically for this application. Yeah. Thank you.

Thank you very much, Julia, for this interesting presentation. I would like to make you some small question. In the class, in the presentation, you give us some examples about the soda, or cheese, or some examples that are more or less real, but they are good, and so on. So my question is, can you give us some examples or some real case studies where you applied or applied in the past or in the future text mining?

Yeah. So when in my, in my, in my last job before I was, so right now I work on tools, tools for data scientists to use. Before my last job, I was a data scientist. So the, a big one was survey results. So I would do top, I would do topic modeling of survey results. What are the most common topics? What are the topics that are more common from users who are very experienced versus newcomers to, to Stack Overflow? What are, we would have for our customers, we'd have survey results. What are the topics that the, maybe some customers said, we're, we quit, we're not your customer anymore. Let's go to all their survey responses. What did they talk about? What was most important to them? And of course, you want to balance really having a human read those kinds of results, but you can use the tools of text analysis to look for differences. People who always gave us more money for them, they said, X, Y, Z was important. People who said we're quit, we quit, we're not your customer anymore. They said, A, B, C, you know, so we need to understand that. So survey results, a huge one.

I also have worked in, you know, a, in the same example for Stack Overflow, like we had text that people post on Stack Overflow. And so we would look for, we had a model to try to detect toxic speech on Stack Overflow to try to, if someone's going to type in something nasty, like to pop up and say, are you sure you want to say that? It was interesting though, because, so that was before the era of LLMs. I've been at my company now for four to five years. So it was before LLMs. And actually, it is very interesting because on, if any of you feel like you've had a bad experience on Stack Overflow, it was probably not that someone, you know, cussed you out or called you these very rude languages. It's probably that they were a little bit, you know, they talked like they were smarter than you. And that's actually pretty subtle and is more difficult for these kinds of models to detect. So those are some examples of the kinds of things that I have used for, in real life, in real life with text.

I mean, an area that you hear about these type of approaches a lot is, that's a little more research oriented, is the digital humanities. So people look at classic texts and what can we learn about literature, about how culture, with these kinds of tools. Another, I just thought of one more, and that was, I just thought of one more. Oh, another kind of common thing is you have maybe email, that I have worked on, is like emails into support, a support system. And you don't want to do an automated response, but you want to classify them to make sure that this one gets routed to this person, this one gets routed to this person. So you can have either supervised or unsupervised approaches to say, if an email comes in, I give it a score, or a label, and then it goes to this. And of course, when it goes to that person, they can say, no, no, it goes over here. But if you do a good enough job on your model, then you can send it to the right place. So those are some more practical examples.

Thank you very much. More questions?

Well, thank you very much, Julia, for this beautiful presentation. It was a pleasure for us to stay here with you. Thank you very much. Thank you so much. Thank you.