817: The Positron IDE, Tidy NLP and MLOps — with Dr. @JuliaSilge

Transcript#

This transcript was generated automatically and may contain errors.

For people who are writing code as a data analyst or a data scientist, people who are working with data, what is different that we need specifically relative to another software? Yeah, so I think one piece that is very different is that the process of writing code is more exploratory, is more interactive. The one gap I feel like that Positron is working to address is that there isn't something out there right now that can be one place you go to do all your data science.

Julia, welcome to the Super Data Science Podcast. You are one of those megastars in the data science space that I've wanted to have on the show for so long. Oh, thank you so much for having me. Thank you. That's so kind.

Of course, yeah. Where are you calling in from today? I live in Salt Lake City, Utah, so that is where I am right now. It's late summer. It's hot. It's going to be fire season, unfortunately. But yeah, that's where I call home.

Great for outdoor activities I hear out there. Yes, just really unparalleled. I mean, in the summer, it's like hiking. In the winter, of course, snow sports. It's a really lovely place to be.

Nice. And so you were recommended to me most recently by Hadley Wickham , who is, of course, also a superstar in this space. And so he was recently on episode number 779. And shortly after that, I had the pleasure of meeting him in New York at the New York Art Conference, which is really nice. Is that something – have you ever been to the New York Art Conference? I have. I have. I have attended a couple of times, and I spoke there at least once. It's really a fun group of people.

And it's a group of people where they have – or it's a conference where they often have some really big names who come through kind of every time. I don't know if, like, are everyone's favorite Bayesians swung through to, like, give a sort of, like, slide-free talk while you were there. But that's always kind of a highlight. Yes. Andrew Gelman? Yeah.

This year for the 10th anniversary, I'm going to do some injustice to probably someone out there who's listening who is also a huge name that wasn't there because I'm going to forget them in the slew of all the ones that were there this year. So we had Andrew Gelman there. There was David Robinson. There was – Hadley Wickham, of course, was there. Wes McKinney. Max Kuhn . Hilary Mason. It was wild. It was kind of like every talk with somebody that is iconic in the field. And so, yeah, it's an amazing conference to go to, highly concentrated.

And for whatever reason, it isn't a huge audience like you see with Open Data Science Conference or with the O'Reilly conferences that used to happen before the pandemic. And so it means that if you do come and attend one of these conferences, you get to meet all of these people. It's a pretty – It is a fairly intimate group, a fairly small sort of audience. So that is a real highlight of it. Like you really feel like you get to – You're not in – You're not one, you know, like so far back from the speakers. It is – The dynamic is really interactive, which is fun.

Introducing Positron

The most exciting thing that you're working on right now is that as an engineering manager for Posit, which formerly known as RStudio , and the makers of RStudio, you're now working there as an engineering manager. And your project that you're leading the development of is something called Positron, which is described as a next-generation IDE, integrated development environment, for data science.

And so with Positron, what are the gaps or limitations that you're addressing that aren't covered by things like RStudio, VS Code, or Jupyter Notebooks, which might be the go-to IDEs for data scientists or software developers today? Yeah, if I was going to sum up the one gap I feel like that Positron is working to address, it's that there isn't something out there right now that can be one place you go to do all your data science.

Yeah, if I was going to sum up the one gap I feel like that Positron is working to address, it's that there isn't something out there right now that can be one place you go to do all your data science.

So Positron is not a general-purpose IDE. It is specifically an IDE built to do data science. And I come from a science background, and I've always been someone who wrote code for my data analysis, but I've always really felt that my needs were a little different than someone who is writing general-purpose code, like to build a website or to make a mobile app. Like, people who write code to analyze data are different in some real ways. It's not that it's like they're worse coders or like – No, no, I really do think that. It's not. It's not. Like, I don't think it is that people who write code to analyze data do a worse job writing code. It's that their needs are different and that they're writing code in a different way.

So folks who have been – for example, who have been using DS Code as a data science IDE have really felt that tension, where they're like, this is really general-purpose, and instead – and I'm trying to kind of customizing it using extensions to fit my needs. So Positron is meant to specifically be a data science IDE. Positron is also like a real driving reason why we've built it the way it is, is that it is a multilingual or polyglot IDE. A lot of the environments you might download to do scientific computing or data science or data analysis are built specifically for one language. So I know all of us have used these. So RStudio is an example of one of these, like MATLAB, Spyder. Like, there's a lot of – there are a lot of, you know, environments in which you would do data analysis that are just built for one language.

And increasingly, I just think that's not how many people – that's not how as many people work. Many, many people use multiple languages, whether it's on one project that literally uses multiple languages over the course of a week. They pick up different projects that use different languages. Or almost certainly on the span of years or your career, you use different languages because things change in our ecosystem. Like you said, you started with R and, you know, now you use other languages. There are so many people who use combinations of R and Rust, you know, or they work on projects that's like Python plus front-end kind of technologies, JavaScript, you know, HTML, et cetera. Or like almost any data science language plus SQL, right? Like, very few people – IDE that is built to use one language, for very few people is that really going to fit all of the needs that they have over the course of a week, a month, or multiple years.

So, Positron is built with a design such that the front-end user-facing features are about the tasks you need to do, like whether that is interactively write code, whether that's dealing with your plots, whether that's seeing, exploring your data, like, you know, in a visual way. And then there are back-end language packs that provide the engines for those front-end features. So, it's – Positron is very early days for Positron. We only made it public about six weeks ago as of the day we're recording this. So, it is currently shipping with support for Python and R. But it is designed in such a way that other data science languages can be added because there's a separation between the, like, the front-end features and what is driving them.

Our vision for that is excellent integrations, excellent integrations and choice choice so that people can decide what aligns with their own either personal or or organizational priorities when it comes to the details of the LM itself and then the details of how you're interacting with the LM.

That's a really exciting answer because that's probably what developers love the most is to hear that they can have, you know, whatever their preferred choice is. And so it means that behind the scenes they can they can be using the LM of their choice. It could be an open AI API or it could be anthropic or it could be a completely open source implementation like Code Llama, something that you could have running locally or that you could be using through a third party provider.

Yeah, so the first one is tab nine. Yeah, so continue. So I'll not to make a favorite, but but continue. OK, continue. I mean, people bring different sets of priorities here, so I'm not saying the one that I like the best is the best one. So continue codium is codium with an E in it. Codium and then tab nine. So these are ones that I know people have had good experiences with in Positron already.

Monitoring models in production

Yeah, it's really relevant because that third piece for what does MLOps mean is monitoring. And when you're monitoring a machine learning model that's in production, you it means different. It means different things kind of depending on who you're talking to, because it is a software artifact, meaning that meaning that you do it. You do have to measure things like uptime latency for a machine learning model like you do need to measure that because they are software artifacts, but they are also statistical artifacts they have.

You could you could you could, you know, you put a model into production and it's software monitoring like you could monitor things about its software characteristics like metrics like latency and uptime, and it could be doing great according to those metrics. But over time, say something in the world change such that the actual underlying relationship between your inputs and your outputs change over time, your statistical metrics could just be falling off a cliff. And you you if you don't monitor your models, you don't know that that's happening, because like their machine learning models in production are particularly prone to silent failure, because by failure, we don't all we don't only mean the software characteristics of them, we also mean the statistical character. Because the world has changed, and the model that you originally changed a month or a year ago no longer no longer applies.

Machine learning models in production are particularly prone to silent failure, because by failure, we don't all we don't only mean the software characteristics of them, we also mean the statistical character.

So those are the three pieces when it comes to MLOps that we focus on providing tooling for you we we we we assume that you start with a model that's trained and again we think that people bring a variety of sets of perspectives to how they get there. And so we we are fairly agnostic to how you got there. This is one of the real differentiators between the projects I've worked on, which is called vetiver versus a project like ml flow, like we come in at a different place and provide data science practitioners more flexibility and how they get started.

Again, so we version deploy monitor version deploy monitor. So that's what vetiver provides support for vetiver is a framework for MLOps in both Python and R. So there's very, very, very parallel kinds of implementations there. And what that means is that if you as a practitioner, practitioner, if you prefer to use tidy models for one kind of in R for one kind of statistical problem, but you like to use pytorch in Python for another kind of statistical problem, you can deploy them both with the same kind of tool. And you can provide your like say your your software engineer collaborator with the same kind of API that looks the same no matter how you you no matter how you trained it. The other sort of differentiator for vetiver is it is built this I mean this align so much with the other things we've said, but it is built for a data science practitioner to use.

It is not built with a software engineer, a general software engineer user in mind, it is built with a data science user in mind. And like, I think it's an in like a different orgs make different decisions about this, like who is responsible for getting a model kind of the last mile, like getting it deployed at really large organizations, there's whole teams that that is their whole job, right. But for many medium sized organizations, or even small ones, it's like who should do this, who should do this. And my hypothesis is that the best person to do it is the person who has the most domain knowledge about the model. So we can give that person the tool so that they can hand it off a little later in the process, like not hand off some like really raw thing, but actually be the one that packages up deploys the model that we end up with better machine learning practice overall, because they have the most knowledge about the model and how it works.

Wow, yeah, thank you for that tour of vetiver. It sounds like the compatibility having support for both Python and R in one place allowing me I mean, that is there are still some things that I'd love to do in R. I primarily use Python these days. But there's things like particularly for me around creating visualizations. Yeah, that are with the ggplot library. For sure. And so yeah, there's reasons to be for me to be using both together. And so it's great that with vetiver, I can be deploying both languages together and monitoring across all those three key steps that you outlined their version deploy and monitor.

Tidy text and Jane Austen

So moving on to our next topic area, it still is tidy. It's a tiny topic. So with vetiver, you're, you know, you're talking about tidy models in there. This one is all about tidy text. So you've written several books, several bestselling books, in fact, and one of them text mining with are a tidy approach features the tidy text natural language processing library. And interestingly enough, it also includes Jane Austen's complete works with an R package that you wrote. And which is Jane Austen R. Yes, that's right. That's right. I have some of your listeners are probably familiar with like the stickers like the hex stickers that the R community just loves and loves to put on. And I made a hex sticker for the Jane Austen R library and it's her signature. Like like with colors or whatever. And I just said like, I love it. I love it.

So are you and this is a complete tangent from where I was going, but are you a big lover of Jane Austen?

I, it's, I'm a super fan. I'm a super fan. I'll be honest. No, I, um, I, uh, so the, the story of tidy text is very intertwined with my story of just like getting into data science in general. I, when I was making this career transition from the kind of random stuff I was doing before into data science, um, I was thinking about, um, okay, I have kind of this weird, uh, this weird resume, like, what can I do? Like, how can I set myself up so that people who are in, this is about 10 years ago, who people who are in this at the time newish field of data science, how can I be compelling that like, yes, I can do this. Like, like I'm someone who can do one of these jobs.

And so I was working on, um, what I thought of as the time as, um, uh, like, like, uh, like a, like a blog, a way to show people what I like this kind of things I can work on. I'm like, I envision myself sitting down with a hiring manager and like talking through these projects, like a portfolio, you know? And as I was thinking about like what will be compelling, um, to people, uh, and I, I thought about, okay, the stuff that I really know about stuff that I really know about and care about that's personal to me. And so some of my, if you, my blog, like all the posts are still up. If you go to like the earliest, earliest posts, some of those are, um, using like data from Utah, like Salt Lake City and Utah data. Cause I was like, okay, I go to the, you know, the public data portal. I pull something that's about county differences in health.

So I kind of started out there and then I very quickly started thinking about, well, I mean, anyone who knows me knows like, like I love Jane Austen, right? Like this is one of the great loves of my life, my whole life since I was like 12 years old or something. I should see like, what kind of analysis can I do out there about, um, about to do with Jane Austen's work? Jane Austen's works are in the public domain, which means you can just get the text of them. And I started, you know, I started doing some like initial exploration. Um, I was having a great time.

And then I, um, I was introduced to via at the time, the thriving data science, social media scene. Um, and I've always introduced to David Robinson, who you mentioned earlier, he's a, he lives in New York. Um, he, big part of the NYR community and David, David Robinson, I'll be honest, like changed my life. You know, like Dave, um, Dave, uh, reached out about collaborating because he was excited about some of the stuff he saw me doing was like, I think there's an opportunity here to, to build tooling that is tidyverse style tooling, but, um, but applying to text. I was quite new to our community at the time. And I didn't have as much background as he did in terms of like, what does that, what does that mean? Like tidyverse style tooling, but he had, he had built things like Broom and had, had already had that kind of experience.

So we worked together, we collaborated together and we, um, we actually met for the first time at a, at a, uh, what was called an unconference run by the organization R OpenSci, which is an amazing organization about supporting open science through R there's Pi OpenSci, which is a similar organization for Python. Anyway, we met at an unconference in person and, um, and he, he was like, Hey, do you want to build, do you want to build an R package to do text analysis from a, from a tidyverse perspective? And I was like, okay, sounds great. Let's do it. And we had something working by the end of three days. That was the core of what tidy text became.

And then, um, over both Dave and I, um, at the time were really loving writing, uh, publicly and putting a lot of like, a lot of stuff out there to help people know how to use our, our tools. So we were doing really interesting, or at least we thought things we were really interested in. So we were like writing a lot. Like I did a lot, um, uh, really digging deeper into the Jane Austen stuff, comparing to other books. Um, at that, this was, this was like 2015, 20, like Dave did this analysis of like Trump's tweets at the time that went super viral that used our tooling. And, uh, we were having, so we were having all these blog posts and we at one point looked at each other. We're like, what if we, what if we wrote a book? What if we wrote a book?

And we started basically with taking some of these stuff we had already written, either long form documentation, package vignettes, blog posts. Like when we started putting them together and then how, how will we shift? How will we reorganize? How will we make this flow from one thing to the other? And how, what, how do we write an introduction? You know, how do we wrap this thing up? And it, um, we wrote that book really fast. I have since worked on other books and the book that Dave and I wrote together came together fast. And, um, because it just was like, right. It was just right. And I, it was, it was, it was huge for me. It was huge for me. It was huge for my career. I love that book. I love Jane Austen. Yeah, no, it's like, that's a big part of my story is how that, um, how that all came together.

As I said, we will be doing a book giveaway. You don't know this yet, Julia, but we, when we have authors on the show, we often do book giveaways. And so we will be doing one for your books. Um, and yeah, yeah, yeah. So when, when this episode goes live, uh, people who have been listening to the, the audio version, it will be in my intro because you also wouldn't know this, Julia, but after we finished recording, I use my notes from our conversation to create an intro and an outro. And in that intro, I will have announced that, uh, people can get a physical copy of your book, uh, by, yeah. Oh, delightful. Yeah, yeah. They'll, you know, all the details are in the intro as to how they can pull that off. That's delightful.

Topic modeling and NLP projects

So, um, so yeah, that was a great story to get behind the development of your first book. And now, uh, I mean, there's, there are lots of recent projects as well that are super interesting. You've done topic modeling for Taylor Swift lyrics. Um, you've done measuring readability with a smog metric, um, which is a fun, I like, at juliasilgee.com slash blog slash gobbledygook. Yeah, that was really fun to work on.

And I will say the Taylor Swift one was super fun to work on. I did it like right the same week that the, the, the concert film came out. I have not seen Taylor Swift live, sad to report, but I went that week that that concert film came out. And, um, uh, it was like the first week and I went with my kids to see it. And then I did the topic modeling and it felt very topical at the time. It was very fun. It was, I was very interested. And actually the results are pretty interesting. So the topic modeling looks at the textual content of lyrics. And what they, the, the real sort of takeaway from there is that there are, there's like the, like the, the, the, the pandemic era albums, right? Like, like evermore, like the two of them are very similar lyrically. Like they, they, the machine learning algorithm puts them together. The early albums all get put together. Cause they are also very like thematically very similar. And then reputation is one that really stands out as separate. Like reputation is like quite distinct lyrically from these other kinds of other kinds of groups. So that was really fun to work on. And certainly like aligns with my own sort of, you know, like how I perceive as, as, as a fan and, uh, you know, and someone who enjoys, um, uh, Taylor Swift's work. Um, like it really is, it was interesting way to explore something else that I love. You know, I do, I love doing data science projects about things that I love. And, um, it really, I just get a lot of joy from that.

Nice. Uh, really cool to hear about that project as well. Um, another one that you did recently is with stranger things dialogue. And so for the popular Netflix series, stranger things, you showcased high Frecks F R E X in all caps, high Frecks and lift words of each season's dialogue. What do those terms mean? And what'd you find?

Yeah. Okay. So topic modeling topic modeling is a, um, unsupervised machine learning method for analyzing text. Text data is very interesting. I mean, I guess it's no shock that I find it very interesting, but when you think about text and natural language, there are a couple of, there are a couple of, um, uh, really defining things about it. Uh, you, you see a lot of power laws where, um, there are a few words we use a ton, uh, you know, the, and, you know, and then there are a lot of words that are used only a few times. This means you count the things you're observing of in a very, it's a, there's a really wide discrepancy in how many times you have counted things or observe things. And so you have to use methods that allow you to get at that, like allow you to learn something, even giving that. So there's a lot of brute force methods where you just kind of like take out software, you make some cuts, but then there's much more sophisticated ways you can use to learn what words are important, what, what topics are about what.

So high, the high, a topic model is a multi-level, um, hierarchical model. And it, it, the, the mental thing you can think about is that a topic is made out of a mixture of words and words can be in more than one topic. And then a document is made of a mixture of topics. So like with the Taylor Swift, like, uh, example, I think I did it so that, um, songs or maybe lines or one or the other songs or lines with the document. And then you say, okay, which, which topics, so which songs are made of which topics and the topics are made of different words.

So with, um, with stranger, so you end up with the most probable words, but those are often the same across a lot of topics. So in stranger things, I think I did it as lines, like lines of dialogue. And if you think about like, if you have a group of people talking, the most common words are always, um, just common words that people use speaking. It's different than the words if you were reading prose, you know, so it's not so much the and, and, but more like you and, um, and like words you use as you talk. Like our spoken word is different than we do like written language.

So dialogue, so those are the most probable words. But if you're, so if you look at, oh, the, the topics all have the same most probable words, but that's normal. That's normal and okay. And so these other metrics like lift and Frex help you get at metrics of what words are special or unique for different topics.

So they're different. They're like different statistics. Like Frex is high frequency and exclusivity. So it means words that are used a lot, i.e. for high frequency, but also have high exclusivity, meaning you see them in some topics, but not others. High lift is words that you see a lot relative to how often they are used.

So I, one thing I remember from that stranger things is that like some of these high, if you're looking across the seasons, like what are these high Frex, high lift words? Like you see the words about, um, you know, that monster that they called dart, the little funny monster that got lost in the house. Like you only, that one pops up in that season because it's high lift, high frequency, high exclusivity. The, when they were, um, uh, you know, there were things that only happened in the first season. Like they talked a lot more about like the upside down or something. I'd like those popped up in the early seasons because it was high, like high exclusivity to those.

Topic models in the era of LLMs

So their topic models are great. They are complicated models and it is interesting thinking about when are they useful in the, in the, like in the era of LLM based tools. And I think they are most useful when you have medium size text data. By that, I mean like 5,000, 10,000, um, like you have something, you have text data that, that many documents. So think of a document often in a real, uh, application is like a survey response or, um, something along that line. So you have on the order of 5,000, 10,000, and you are interested in, um, looking for like, what are the topics? What are these about?

Um, you can, of course, look to LLM based tools for a summarization, which can give you like another sort of way of doing it. But there are, um, I, I tend to think they're most useful in situations either where you have compliance reasons that you can't use LLM based tools, or that you have more, um, medium size type data, or you have maybe higher statistical, um, you have needs for higher statistical rigor than like I threw it into an LLM and got something out.

So I think, um, even in the era of LLM based text tools, it's, um, I, I think it, it never gets away the need for doing EDA for text. And that's exactly what the tidy text package is all about, is, is about doing EDA for text. So I would say, I would say it's a bad idea to just throw texture analyzing into an LLM based tool for summarization without also doing EDA first.

So I would say, I would say it's a bad idea to just throw texture analyzing into an LLM based tool for summarization without also doing EDA first.

And it's also interesting to think about what, like, when would you use which kinds of tools? And like, what are your, what are your needs? You know, like, these are all tools. These are all tools. And adding more tools is great. Um, but we have to know when it's appropriate to apply them.

Yeah, those are really good points. You get there at the end around why you would use an LLM versus topic modeling. These kinds of more now what you might say are traditional natural language processing techniques. And yeah, some great points there. If you want higher statistical rigor, if you want to be doing exploratory data analysis, which maybe you should be doing before. You are using an LLM. No, I would argue yes. Um, midsize data. It's also probably going to be a lot less expensive.

Oh, absolutely. Absolutely. Uh, 100%. Because it is, it is true. Like, expensive LLM-based tools are expensive to use. Either literally per API call or in the expertise of running a local one.

Stop words

Um, so one of the questions that I, that was brought up in our research by our researcher, Serge Macice, that I was really interested in asking you around NLP was that you have pointed out how practitioners, including myself, tend to use pre-made lists of stop words before they start doing NLP analysis. So, um, maybe quickly give us your definition of stop words and then tell us why I should stop using a pre-made list.

So, stop words are, um, are lists of words that people are like, oh, those are not important. I can take them out. And so they are words like, um, the, and of. And so, uh, in English, a conservative stop word list would be on the order of like 100. And a more aggressive stop word list would be on the order of like 1,000. And the way that these lists were made is that they're, they're old. They're old. These lists come from like the mid-20th century typically. They were made by taking huge, for the time, huge corpuses of language and counting up words, looking at the top 1,000 and deciding where a cutoff should be. And then deciding, like, a person, a person decided where to make the cutoff. And a person decided, um, are there words that I should or should not keep in.

Um, there are, uh, there are so many problems with stop words. So, A, some of them literally have typos in them, you know. And if you, you're like, well, is that good or bad? If someone had a typo for, you know, I don't know. I can't think of a long enough word. But let's say that someone had a typo, like A, D, N. Like, maybe I do want to take that out. But, but it's strange, right, that there are actually, there are some words that have typos in there.

Another thing that happens is that because these were created from list, from, from corpuses of language, like many books that were put together, you actually end up with evidence of, um, gender bias just in the stop word list. Like, because, like, let's say we take a whole huge number of books. Um, there are more uses of those, in those books of he than she. There are more uses in those books of his than her. And some of those stop word lists actually have, for some of those sets of, you make a list of all the English pronouns. They have all of them for masculine, and they only have, like, three quarters of them for the feminine version because of where the cutoff was.

So, you, you're like, oh, man, these are, even though you're like, boy, this is, like, the simplest thing you can do, make a list of words. They are, they are, all of our, like, challenges around data analysis, data science, data sources, they show up, even in this, like, dead simplest thing you can do.

They are, they are, all of our, like, challenges around data analysis, data science, data sources, they show up, even in this, like, dead simplest thing you can do.

So, I, I avoid, now, I now avoid using lists of stop words when I do topic models because, um, they will always end up being, like, the most, um, the most, uh, probable words. But like we just talked about, the most probable words are never very interesting. You need to look at these other, these other kind of, um, statistics, give you a better sense of what topics are really about.

When I do supervised, um, machine learning, I often leave them in as well because it turns out it's actually informative. The way some, the documents use even those boring words, it can be predictive. Like, they can actually have predictive information. Like, if you're doing classification, how it uses those boring words can help. If I'm doing EDA, I sometimes take them out if I'm, like, trying to show the top, like, most common words and I just, like, well, let me take out, like, those boring words. I do sometimes still take them out there.

But I often will supplement it with, um, even EDA approaches that provide, uh, ways for seeing differences across groups that do not depend on taking out those stop words. So, some examples of that are looking for log odds of words. Like, what are the highest log odds words? I have a package for this that's called tidy low for tidy log odds. Or, um, and that's actually an interesting approach. Anytime you're looking at differences of counts across groups, it doesn't have to be language. But it's, it's, um, applying it to language is really great.

Um, I still use things like PFIDF as an exploratory tool, uh, which I bet many people have heard of. And, um, I've written about, like, what it means and people can dig into that more. So, so that's my pitch. That's my pitch. My pitch is, um, they suffer from the same problems that almost any data science process suffers from, even though they are so simple. And if you're doing unsupervised smart learning machine, or supervised machine learning, you probably want those words. And if you're doing EDA, there are alternative approaches that give you better answers.

Very cool answer. Something that I have been teaching for years, something that I was aware of that is an issue in stop lists, is that for some particular application you might have, you at least need to be looking through the list of stop words. Like, you shouldn't just be using it without, I mean, you've made a good argument to not be using them at all. But something at least, um, that I have been saying for years is that you need to know what the stop words are in there. So, for example, if you're doing sentiment analysis and one of your stop words is not, you're going to be pulling out the word not. So, how are you going to, like, that's, that's, like, one of the most critical words in figuring out the sentiments of, of a document.

Totally. Totally. My, um, my first book that I wrote with Dave has an example of this. It's back to Jane Austen. So, it turns out one of the most commonly used stop words has the word myth on, um, the stop word list. Or, no, I'm sorry, this is sentiment analysis. Sorry. So, myth is on in the sentiment analysis list as a negative word. So, this is slightly different. It's not a stop word list, but a sentiment lexicon list. But they're very similar, um, uh, constraints at play.

So, myth is on the word as a, is on the list as a negative word. Like, I miss you. Or, you miss that, uh, I don't know. You miss quarterly earnings. It was, it's on the list as a negative word. So, if you look at Jane Austen novels and you, like, do sentiment analysis using one of these lexicons, it shows up. It's like, oh, the word that is driving negative sentiment, like, in a top ten way, like a lot, is the word myth. But, of course, in Jane Austen's, like, works, everyone, that's how everyone is referred to. It's all, like, Miss Bennett, you know, like, like, like, everyone is myth. Everyone is myth. So, they're not, they're not, um, they're not negative at all. Like, those are, those are neutral words.