Resources

817: The Positron IDE, Tidy NLP and MLOps — with Dr. @JuliaSilge

PositronIDE #Tidyverse #MLOps Dr. Julia Silge, Engineering Manager at Posit, joins @JonKrohnLearns to introduce the brand-new Positron IDE, perfect for exploratory data analysis and visualization. She also lays out her top picks for LLMs that boost coding efficiency and discusses when traditional NLP methods might be the smarter choice over LLMs. Plus, Julia highlights some must-know open-source libraries that make managing MLOps easier than ever. Tune in for insights that every data scientist, ML engineer, and developer will find useful. This episode is brought to you by Gurobi (https://www.gurobi.com/personas/optimization-for-data-scientists/), the Decision Intelligence Leader, and by ODSC (https://odsc.com/california), the Open Data Science Conference. Interested in sponsoring a SuperDataScience Podcast episode? Email natalie@superdatascience.com for sponsorship information. In this episode you will learn: • [00:00:00] Introduction • [00:03:23] Overview of Posit and Positron IDE • [00:08:33] How the needs of a data scientist differ from those of a software developer • [00:17:56] How to contribute to the open-source Positron • [00:34:52] MLOps and Vetiver: Tools for deploying and maintaining ML models • [00:48:34] Natural Language Processing (NLP) and the Tidyverse approach • [01:22:18] The role of AI and LLMs in data science education Additional materials: https://www.superdatascience.com/817

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

For people who are writing code as a data analyst or a data scientist, people who are working with data, what is different that we need specifically relative to another software? Yeah, so I think one piece that is very different is that the process of writing code is more exploratory, is more interactive. The one gap I feel like that Positron is working to address is that there isn't something out there right now that can be one place you go to do all your data science.

Julia, welcome to the Super Data Science Podcast. You are one of those megastars in the data science space that I've wanted to have on the show for so long. Oh, thank you so much for having me. Thank you. That's so kind.

Of course, yeah. Where are you calling in from today? I live in Salt Lake City, Utah, so that is where I am right now. It's late summer. It's hot. It's going to be fire season, unfortunately. But yeah, that's where I call home.

Great for outdoor activities I hear out there. Yes, just really unparalleled. I mean, in the summer, it's like hiking. In the winter, of course, snow sports. It's a really lovely place to be.

Nice. And so you were recommended to me most recently by Hadley Wickham, who is, of course, also a superstar in this space. And so he was recently on episode number 779. And shortly after that, I had the pleasure of meeting him in New York at the New York Art Conference, which is really nice. Is that something – have you ever been to the New York Art Conference? I have. I have. I have attended a couple of times, and I spoke there at least once. It's really a fun group of people.

And it's a group of people where they have – or it's a conference where they often have some really big names who come through kind of every time. I don't know if, like, are everyone's favorite Bayesians swung through to, like, give a sort of, like, slide-free talk while you were there. But that's always kind of a highlight. Yes. Andrew Gelman? Yeah.

This year for the 10th anniversary, I'm going to do some injustice to probably someone out there who's listening who is also a huge name that wasn't there because I'm going to forget them in the slew of all the ones that were there this year. So we had Andrew Gelman there. There was David Robinson. There was – Hadley Wickham, of course, was there. Wes McKinney. Max Kuhn. Hilary Mason. It was wild. It was kind of like every talk with somebody that is iconic in the field. And so, yeah, it's an amazing conference to go to, highly concentrated.

And for whatever reason, it isn't a huge audience like you see with Open Data Science Conference or with the O'Reilly conferences that used to happen before the pandemic. And so it means that if you do come and attend one of these conferences, you get to meet all of these people. It's a pretty – It is a fairly intimate group, a fairly small sort of audience. So that is a real highlight of it. Like you really feel like you get to – You're not in – You're not one, you know, like so far back from the speakers. It is – The dynamic is really interactive, which is fun.

Introducing Positron

The most exciting thing that you're working on right now is that as an engineering manager for Posit, which formerly known as RStudio, and the makers of RStudio, you're now working there as an engineering manager. And your project that you're leading the development of is something called Positron, which is described as a next-generation IDE, integrated development environment, for data science.

And so with Positron, what are the gaps or limitations that you're addressing that aren't covered by things like RStudio, VS Code, or Jupyter Notebooks, which might be the go-to IDEs for data scientists or software developers today? Yeah, if I was going to sum up the one gap I feel like that Positron is working to address, it's that there isn't something out there right now that can be one place you go to do all your data science.

Yeah, if I was going to sum up the one gap I feel like that Positron is working to address, it's that there isn't something out there right now that can be one place you go to do all your data science.

So Positron is not a general-purpose IDE. It is specifically an IDE built to do data science. And I come from a science background, and I've always been someone who wrote code for my data analysis, but I've always really felt that my needs were a little different than someone who is writing general-purpose code, like to build a website or to make a mobile app. Like, people who write code to analyze data are different in some real ways. It's not that it's like they're worse coders or like – No, no, I really do think that. It's not. It's not. Like, I don't think it is that people who write code to analyze data do a worse job writing code. It's that their needs are different and that they're writing code in a different way.

So folks who have been – for example, who have been using DS Code as a data science IDE have really felt that tension, where they're like, this is really general-purpose, and instead – and I'm trying to kind of customizing it using extensions to fit my needs. So Positron is meant to specifically be a data science IDE. Positron is also like a real driving reason why we've built it the way it is, is that it is a multilingual or polyglot IDE. A lot of the environments you might download to do scientific computing or data science or data analysis are built specifically for one language. So I know all of us have used these. So RStudio is an example of one of these, like MATLAB, Spyder. Like, there's a lot of – there are a lot of, you know, environments in which you would do data analysis that are just built for one language.

And increasingly, I just think that's not how many people – that's not how as many people work. Many, many people use multiple languages, whether it's on one project that literally uses multiple languages over the course of a week. They pick up different projects that use different languages. Or almost certainly on the span of years or your career, you use different languages because things change in our ecosystem. Like you said, you started with R and, you know, now you use other languages. There are so many people who use combinations of R and Rust, you know, or they work on projects that's like Python plus front-end kind of technologies, JavaScript, you know, HTML, et cetera. Or like almost any data science language plus SQL, right? Like, very few people – IDE that is built to use one language, for very few people is that really going to fit all of the needs that they have over the course of a week, a month, or multiple years.

So, Positron is built with a design such that the front-end user-facing features are about the tasks you need to do, like whether that is interactively write code, whether that's dealing with your plots, whether that's seeing, exploring your data, like, you know, in a visual way. And then there are back-end language packs that provide the engines for those front-end features. So, it's – Positron is very early days for Positron. We only made it public about six weeks ago as of the day we're recording this. So, it is currently shipping with support for Python and R. But it is designed in such a way that other data science languages can be added because there's a separation between the, like, the front-end features and what is driving them.

Data science vs. general-purpose coding

So, the Polyglot IDE part to me, that makes a huge amount of sense. I get that especially as a contrast to RStudio. For people who are writing code as a data analyst or a data scientist, people who are working with data, what is different that we need specifically relative to another software developer? Yeah. So, I think one piece that is very different is that the process of writing code is more exploratory, is more interactive. And that's not wrong or bad. That is actually just the fact that instead of getting a spec from a product manager and building a product, like, that's not what data scientists, data analysts do. They, you start with data and you often don't know what you can or should do in detail until you start that process.

And if you have a code writing process that is more exploratory, you need more supports for writing in that interactive exploratory mode. Some things that support that are things like a, like, truly, truly fully featured interactive console. Of course, that does exist in various ways. Like, people get at that in various ways. Like, when they use notebooks or using, say, a Python REPL. But there are, if you get to a truly fully featured interactive console where what happens in the console is then reflected in the rest of the, where you're working. Like, say, in Positron we have what we call a variables pane. If you come from our studio, you may be familiar with something called, like, an environment pane where you see all the things you've, and it updates, right? Like, as you change things. Or the plots that you see. You have them all right there. You can scroll through them. You can, if you change and make a new plot, you see it pop up there. And you have that really interactive way of working.

Some of the other things that I know really make a difference for people help inside of the IDE where you are working. So you don't, you know, you're working along, you're like, ah, wait, what is the function signature? Or, like, maybe I want to look at the docs for this. So instead of having to get out of a flow state and go somewhere else and re-docs, like, on a website. Like, you can open up help right there and copy paste, like, go right back and forth and stay in that kind of flow state. Another thing is you're building interactive apps. Like, you need a way to have that right there and that it updates as you change your code versus having some sort of build process, going somewhere, looking at a browser.

There's really quite a lot that if we put it together, we can make people more productive. You know, the company that I work for, Posit, you know, is a formerly known, the company formerly known as RStudio, you know, like, Posit. Like, we do, it's a really fun place to work as someone who likes thinking about the process that people bring to their tasks. Because I do think, like, we are huge believers in code-first data science. Like, not no-code solutions, not GUI-based tool. Like, people who do data work should be writing code. And at the same time, their needs are different. Like, their needs are different. And so we can, like, pretty much every single thing my company does is, like, deeply informed by this belief or, like, how deeply we know that data practitioners are different and that's good and fine. And we can make them more productive by building tools that are specifically for the kinds of tasks they need to do.

Positron's architecture and VS Code roots

Makes a huge amount of sense to me. You also talk about, in addition to these kinds of things that are great in an IDE, that are interactive, that allow us as data scientists, data analysts, doing exploratory data analysis that are super helpful. So the kinds of things you mentioned, like a fully-featured interactive console with updated variable pane. So useful. And something that, for example, for me, is missing from Jupyter notebook, which is what I primarily use today. And, yeah, obviously having plots in a way that you can scroll through them separately from your code, it sounds like. Whereas in my Jupyter notebook, it's all just kind of like one giant stream of information. And then obviously, you know, Jupyter notebooks don't have that kind of built-in help or docs. Although some implementations, like Google Colab, do that.

No, there's like an interesting sort of set of overlapping sort of tools in this space. And I do think Positron is existing in somewhere that's different than some of these slightly other places. Like someone maybe who is in a situation where they want a lot – they're like needing a lot of guardrails, right? Like often people will start in Jupyter notebooks. But often as you like kind of grow in sophistication, you end up being frustrated by some of those guardrails and constraints. And it's like, well, I need a more fully featured IDE. Like I need something. But if I go straight to something like VS Code, it's actually quite challenging to try to get it to be a good fit.

I mean, speaking of VS Code, we should say right now. So Positron is now among the list of IDEs that are built on top of the infrastructure that powers VS Code. So there is a repo out there. It's called Code OSS. It's open source components that are used to build what you might be familiar with as the Microsoft-branded Visual Studio Code. And there's actually quite a number now of IDEs that are forks of this open source of Code OSS, the open source components that are used to build Visual Studio Code.

And I think it's actually super interesting because it's like having that as open source has made, has almost like made it like a commodity, like the building blocks for making an IDE. And it has lowered the bar to people building new IDEs in really interesting ways. Because you can say like, okay, what do I want to experiment with as like a differentiator when it comes to making an IDE? And one that I bet you and your listeners may have heard of or experimented with is Cursor. Cursor is another fork of this open source code. And it's like it's sort of take on things is that is AI first, like LM coding assistant first. And then like our approach was like data science specific. Like we think people who do data science are different from people who write general purpose code. So let's make an IDE just for them.

And it means that you can build a new IDE with a smaller team, I think, than you would have in other ways. Because you can concentrate on the things you think are differentiators. And then the pieces that are true for anyone who's writing code, like we all need to save files. We all need to look at get merge conflicts, right? Like you get that by building on top of this infrastructure. The other sort of really interesting piece for building on top of this infrastructure is that you can then make the whole wide, wide world of VS Code compatible extensions. You can make those available to your users.

So, for example, what this means in our case is people who need a data science specific IDE now have access to that whole world of extension. So if you're like, well, I do data science, but I want to use the Databricks extension or whatever, you know, like you can end up like getting that extensibility, which has not been true of many of the other kinds. Like if you look at the category of like RStudio, Spyder, you know, like those are quite difficult to extend and customize. So it's an interesting choice. And I think it's pretty interesting seeing how we're kind of having this flourishing of IDEs because of the open source nature of the underlying infrastructure for VS Code.

Very nice. And yes, so not only do you have CodeOSS as a kind of a backbone that's providing building blocks for Positron and offering that kind of extensibility through all the VS Code extensions, like you mentioned Databricks there, any number of extensions that people might want to be able to import. Rainbow tabs, whatever you want, it's out there. In addition to all those things, the Positron project itself is open source. So if people are listening and they want to be contributing, right now at the time of recording, there are 27 people, including I can see your face, as one of those GitHub contributors. Listeners can go and they can contribute to this developing and very exciting project.

Yeah, yeah. So Positron is licensed such that it is source available. Anyone can come and look at the source, change it, contribute to it. And it is also licensed such that it is free to use, including for commercial purposes. You can use it, of course, in academia for personal projects, but you can use it at work. It is licensed in such a way that it is free to use in your work as a data scientist. So it's free to get it, so you can read the code. There's real benefit to that kind of model for building and making software.

Posit as a public benefit corporation

I'm a big Posit fan. They have sponsored the show in the past, and they may sponsor the show again in the future, so there's that obvious kind of bias. But I feel confident that I would be saying these great things about Posit regardless, because it's so awesome to have companies like Posit out there that are supporting people like you to be able to be working a huge portion of your time, I imagine, on open source projects like this, that then anyone can go and contribute to, anyone can go and use. There's already thousands of stars on this GitHub repo, despite only having been released six weeks ago at the time of recording, which is wild. And so, yeah, thanks, Posit, for that kind of system. And, you know, Posit still is a public benefit corporation.

Yeah, yeah, it's really interesting. So some people hear that, and they're like, what does that mean? It does not mean a nonprofit. So we're not a nonprofit like Wikipedia. So some of the really famous PBCs are like outdoor companies. And basically what a PBC means is that when you write into your governance, like how is your company governed, like what are the decisions that you make explicit, here are the things we take into account when we make decisions, because it turns out regular corporate governance, it's actually sort of against the rules to use any considerations other than stakeholder value. Like that's actually how they're set up. Like that's the only thing.

So PBC, organized companies, is that you make explicit, here are the things we take into consideration. And, of course, one of them is stakeholder value, making sure the company is financially stable and sustainable. So our PBC charter talks about the data science community, like in general. So people, not just our customers who give us money, but the data science community in general. So that means we can explicitly say we make a decision to improve or to make the situation better for data science users in general.

Like the analogy to think of is not nonprofit, but rather like many famous outdoor companies who make outdoor products. Back to my like living in the West where I like very outdoorsy kind of city, like many of them are set up as PBC, or some of them are set up as PBCs, so that they can, and then often they'll say, you know, the environment or like the outdoors, like that they can use that to make decisions for their company. It's pretty great as an employee, you know, no company is perfect, right? Like no place will they always do everything that you agree with. But working somewhere where leadership has made their values explicit and then set up so that they can make decisions aligned with their values is pretty great.

RStudio and Positron coexisting

And to go into kind of more of, I guess, the history of Posit and things developed in the past. Obviously, we've already talked about how Posit used to be called RStudio. And that's because they were developers of that hugely, it was like the de facto IDE of choice for basically everyone when you're using R. It just worked so nicely. So, with RStudio and Positron both being developed by Posit now, how do you see these two tools coexisting and complementing each other?

Yeah. So, I think the best thing, the way to think about this is to just really acknowledge how new Positron is. Positron is like a teeny tiny baby. And it is actually only available, as of when we're recording this, it's only available like getting an installer from GitHub Actions. Like it is not, I would not say it is ready for prime time. Like Positron is today, for example, not available to our paying customers, like our people who pay for our products like Workbench. It's not even in there yet. It's in what we're calling a public beta. Like it's in like beta here. In RStudio, it's got, it is like rock solid stable software. It's got a decade plus of real world use on it.

We expect for, particularly for an R user, RStudio is probably going to be the best choice for many people's work today and for quite a while to come. There's no need to switch away from, if you are happy with your current code editing experience, there's no need or no call from me or anyone on the Positron to switch today. I think right now, if I was to say who is like, who should, if I were to say like who should try it out today, like go get the, it's someone who is a little adventurous. Maybe they're an early adopter type and probably someone who has some motivating reason, like they are someone who writes R and C++ together. That would be an example. Be like you're using multiple languages together. Or you do a lot of maybe Python plus Rust. Or you do a lot of like data science plus front end work. And you are using those multiple languages and you're frustrated by the things that are currently open to you.

So my company is committed to maintenance for and including new features where appropriate for RStudio for the long term. Like we expect it to be used for people's real work for quite a number of years to come. We made the call to build on top of a new architecture because of some of the limitations of the way, like you could have maybe envisioned like what if you made RStudio but for Python. And it was like really like RStudio but like built for Python. Like that would have been another sort of way that could have moved forward. And there were a couple of reasons why that was not the call.

One of them is that the architecture of RStudio is deeply, deeply entwined with R. So much so that like the session, the R session is like deeply. And it would have taken a lot of work to unravel that. And you can't just kind of throw Python instead in there. Like if you have ever been an RStudio user, you may have had the experience of like the crash with the little bomb. And the whole thing just goes down. Like it's very, if you crash R because of like too much data or, you know, like RAM or whatever, if you crash at R, the whole thing goes down. The whole thing goes down. And with the new architecture that we're building on, which in many ways is more modern and more separated, if R crashes, the IDE stays up. Like you don't lose, you know, you lose, of course, things you have made in your interactive R session, but the rest of everything stays. Like you don't lose everything else.

So it is, it is, it's, so A, RStudio is not going anywhere for a very long time. And we really did think carefully about if we're going to make the kind of next generation thing or support people who are using more than one language, like what should we do next? And so there were a lot of things on the table. And I think that what we landed on, of course, it has its upsides and downsides, but I think it's a really good choice where we landed. It's playing out well.

What language comes next?

It sounds like that ability to not have the whole system crash maybe is part of what makes this a next generation IDE. Yeah, for sure. I would definitely agree with that. So what's next? You already support R and Python. Do you have some sense already? Can you disclose to us on air maybe what programming languages you're thinking of supporting next? I will say, I am willing to say this, not because the plans are immediate, but because I think it's really clear what makes sense next. And that is probably Julia. The language that shares my name, but I do not use.

And if you look at our GitHub repo and you search to find the most upvoted issue that has come up, at least as of today, when I looked at it, it was like, hey, can you support Julia? So I'm just predicting that will be the next thing that happens. Not because we have started work on it or I'm committing us to it, but because it seems that is what the community would most like next. And so I think that's probably, and there are already Julia Jupyter kernels. And that is part that like Positron is built with a lot of like using a lot of standard protocols so that work can be modular, which can be reusable. And if there is an already existing kernel, Jupyter kernel, which there is, of course, for Julia, then the amount of work to get it to work into Positron is not enormous. It's not enormous. So I think this, to be clear, just to emphasize, this is not me saying we have started on this, but that I'm predicting that'll be the next thing that happens because that's where the most community interest is.

Yeah, it makes sense that there is a lot of community interest and also overlap in terms of existing components that you can make use of because this is probably obvious to some portion of listeners, but maybe for some, for others it isn't. But Jupyter Notebooks, that is, I don't know, I don't think portmanteau is the right English word, but it's a blend of Julia, Python and R is what makes up that word Jupyter.

Yeah, yeah, yeah. So people who use a Jupyter Notebook may not be super aware of like what's like what's driving this. So there every time you have a Jupyter Notebook, there's a Jupyter kernel and the Jupyter kernel uses something called the Jupyter protocol. And so the Jupyter protocol is a way that a back end and a front end can communicate. So it's like the front end, think of your notebook, is going to ask a question and then the back end gives an answer and it goes back and forth with a standardized set of questions and answers. So that's called the Jupyter protocol. And you can do that in a notebook, but you can also use that protocol in other ways.

And that's what Positron does. It uses that protocol as a way for your to drive those components that we just talked about, like the variables pane, the plots, the viewer, you know, like you can you can use Jupyter to you can use. And by that, I mean, the Jupyter kernel and the Jupyter protocol to build a console, a fully interactive console.

All right. So this might be a really dumb question for me, but we talked about how Positron is also built on code OSS. So does that mean that there is already an inbuilt connection between code OSS and the Jupyter protocol? Or is that something that you stitched together for Positron?

That is something we stitched together. And I'm going to say sometimes when people hear about what we're doing with Positron, they say, well, why didn't you just build a set of extensions? Why didn't you just provide extensions so that someone who can come to VS code and can use it as a data science user in this really fluent way?

VS code is amazing. I have spent quite a lot of time in its innards over the past year. Incredible piece of engineering. It is built with limited extension points. So what a pure extension can do is quite limited. And our hypothesis, like our thing that we believe is that after having tried it out, is that if to get what a data science user truly needs, what they really need to be effective as that work is not possible only with extensions.

And it's mainly because we because of the way things are connected to each other, like the way that you want your console, your plots, your variables, your viewer, your, you know, maybe the pain you used to connect to databases. You want all of that in sync and talking to each other in a way that's not possible only through extensions. So that, you know, if you ask one of these other forks a question like, why, why did you fork? Like, why did you instead of building an extension? Why did you fork like they'll have a similar answer like, oh, well, our hypothesis is that what what our user wants is X. And that's not possible through the existing extension points. Very cool. That makes a ton of sense to me.

SQL support and future directions

Going back just a tiny little bit when you were talking about when I was asking you what language may be supported next or what's next for Positron. I am not surprised that the answer was Julia, but I actually expected because I because this was before, you know, we've done any of this discussion of Jupiter protocol. And so I expected my guess was that it was going to be SQL. Oh, so that's also really an interesting idea that, you know, we would be we would be interested in, like, hearing folks feedback on that kind of thing.

So right now, the way that SQL is handled is, is that you can you we have something called a connections pane that allows you to manage like your SQL connection, see your tables, like see them in a really nice visual way. And but then you would write Python or you would write R to like so like in Python, it supports SQLite and the SQL alchemy, you know, like, like these ways of like getting in from SQL in and out of Python. But I think a pure Python, I'm sorry, a pure SQL experience would also be really interesting and is something that we have had internal discussions about the thing here is that the execution model is really different.

So like when you are executing SQL, like queries, you're not executing them in some kind of kernel like a Python kernel and our kernel or this possible Julia kernel, you're actually executing it back in the database. So there's an interesting idea of like a data science IDE that is built to use SQL. And I think what we could I this is this is maybe a little bit further out. But the idea of SQL, you then you're executing in the database. So you could bring it into something that provides support for like a reporting and dashboards, which for us would be Quarto, because Quarto does have support for pure, like JavaScript visualization through observable.

And so you could execute SQL against the back end and then bring it into something like if you're going to write a report, make a dashboard, make a set of slides, like you could end up with a data science experience that actually does not go through a kernel, but instead executes queries against the actual SQL brings it over. And then actually, you could do your you could do your reporting on your dashboard needs or whatever, like actually just directly in JavaScript, we're using observable. So very interesting stuff that could be on the on the horizon for like really different times.

MLOps tooling: vetiver

Yeah, looking forward to seeing where this project goes. And obviously, we will have a link to the GitHub repo for Positron in the show notes so people can check that out. Moving on to another big passion project of yours as well as posits is MLOps tooling. So you have defined MLOps as a set of practices to deploy and maintain ML models in production reliably and efficiently. How has the tooling that you have developed so tools like tidy models and vetiver ensured MLOps practices?

I love talking about MLOps because I think it's a phrase that has like a lot of hype around it still and people don't aren't sure what it means people aren't sure and that's partly why when we started building tools, we wanted to be really explicit about what what do we mean when we say MLOps and so for like there's there's there's the process of developing a model which you might use scikit-learn or tidy models or PyTorch. PyTorch to develop a model. And then at some point, you call it so let's call that model development. Let's call that model development. And then at some point, you are done, you have your model. And then what do you do with it next? What are you even supposed to do with it next? I think that first piece, you know, you can learn about that in a class like like people will teach it to you if you go get to get a master's of data science, you can read books about it.

But when you're done, what do you do? What do you do with it? How like where are you supposed to put the model? What is it? What does the production mean? You know, like people like I'm afraid to ask, but what actually is production? What does that mean? So I think one of the reasons I love working on tools for MLOps is that I really like thinking about people's really practical workflows.

What do they actually do? What are they actually trying to do? And like, can we think about that from like a from like a systems thinking kind of way? We're like, how is it fitting together? And so for for the team that I built working on MLOps, what we wanted to focus on is your model is trained, you have decided you have used the appropriate statistical methods to decide what your model is, it's done. Where do you go from there? Where do you go from there? And we, we ended up highlighting three kinds of tasks, like the first thing you need to do is you need to version that model. So much like you need to version code, you need to version a model so that you can keep track of change, you know, like if you were to retrain that model next month or next year, or like, how do you know what the differences are? How do you keep track of the metadata around it? So that's that first piece version.

The second piece is deploy. And so that means getting that's a word that again, people people are like, what does that even mean? I don't know. So deploy means effectively, like a simple way to think about it is get the model off of your laptop, like the mindset of being somewhere on in the development environment where you built the model. Instead, you have like lifted the model up with all of the, you know, all the computational needs it has, and you put it somewhere else. So the often the best way to do that, for the sort of middle 80% of users is using REST APIs in some kind of containerized environment, whether that's literally Docker or something else. And then so that's deploy. So you version your model, you're doing the prep work so that you can keep track of what, like, what are the characteristics of the model, deploy the model, that's the process of like wrapping it up and getting it where it needs to go away from your dev environment into a place where it can be integrated into the general, like, infrastructure of your organization.

And then the third big piece is monitoring. So once you have put that model somewhere else, how do you watch it over the long term, because machine learning models are pretty interesting in that they are both statistical artifacts. And they are software artifacts. So something that's like think like a mobile app or something, it's mainly just a software artifact. And if you were to monitor how it's doing, you would monitor like, how is it like, is it returning? Is it like, like, is it? Is it what's its uptime? How fast is it responding? Like, that's the software artifact of it.

LM-based code assistance in Positron

But you mentioning that, yeah, this caused me to realize a question that I should have asked you about Positron as well. Yes. Okay. Which is what kinds of discussions have you had around? So right now Positron is that it's just that software development artifacts. But I imagine just in the way that GitHub has copilot and Google has Gemini built into Colab, that must be something that you discuss as well, where there would be a machine learning artifact built right in.

So as of today, in its early stage, Positron doesn't come built in with with LM based code assistant type tools. Instead, we have built really so that you get a great integration experience with LM based coding assistant tools that are available. It's like people probably immediately will be like, what about copilot? And it is that it is true that as of today, when we're recording this copilot is licensed in such a way that it can only be used in more proprietary Microsoft tools. And so it's like you can use if we're literally talking about copilot itself, that is usable in the Microsoft build of VS code, but is not available is not usable as of this. As of this recording is not available in Positron, which you can, which you do have access to is to this burgeoning ecosystem of alternative LM and there are quite a number of ways of interacting with LM that people that are aligned with different people's preferences.

Three that I know people have had quite good experiences. One of them is called continue. One of them is called tab nine, and one of them is called codium with an E. And they're kind of different takes on what it means to have an LM based tool help you write code. And a cool thing about these three is that you actually have different choices about the back ends that you use there. You can use one of the open AI back ends, but you can use alternatives as well. You can, in fact, use a fully open source model. Yeah, yeah, yeah, yeah, yeah, right. If that I like. And in fact, you can use a local model such that if you have constraints, maybe around compliance that you can't like send your data or code far away. So the if you're if the question is literally does copilot work. Unfortunately, as of this moment, the answer is no.

But if like these if you're like, what is our vision for LM based code assistance? Our vision for that is excellent integrations, excellent integrations and choice choice so that people can decide what aligns with their own either personal or or organizational priorities when it comes to the details of the LM itself and then the details of how you're interacting with the LM.

Our vision for that is excellent integrations, excellent integrations and choice choice so that people can decide what aligns with their own either personal or or organizational priorities when it comes to the details of the LM itself and then the details of how you're interacting with the LM.

That's a really exciting answer because that's probably what developers love the most is to hear that they can have, you know, whatever their preferred choice is. And so it means that behind the scenes they can they can be using the LM of their choice. It could be an open AI API or it could be anthropic or it could be a completely open source implementation like Code Llama, something that you could have running locally or that you could be using through a third party provider.

Yeah, so the first one is tab nine. Yeah, so continue. So I'll not to make a favorite, but but continue. OK, continue. I mean, people bring different sets of priorities here, so I'm not saying the one that I like the best is the best one. So continue codium is codium with an E in it. Codium and then tab nine. So these are ones that I know people have had good experiences with in Positron already.

Monitoring models in production

Yeah, it's really relevant because that third piece for what does MLOps mean is monitoring. And when you're monitoring a machine learning model that's in production, you it means different. It means different things kind of depending on who you're talking to, because it is a software artifact, meaning that meaning that you do it. You do have to measure things like uptime latency for a machine learning model like you do need to measure that because they are software artifacts, but they are also statistical artifacts they have.

You could you could you could, you know, you put a model into production and it's software monitoring like you could monitor things about its software characteristics like metrics like latency and uptime, and it could be doing great according to those metrics. But over time, say something in the world change such that the actual underlying relationship between your inputs and your outputs change over time, your statistical metrics could just be falling off a cliff. And you you if you don't monitor your models, you don't know that that's happening, because like their machine learning models in production are particularly prone to silent failure, because by failure, we don't all we don't only mean the software characteristics of them, we also mean the statistical character. Because the world has changed, and the model that you originally changed a month or a year ago no longer no longer applies.

Machine learning models in production are particularly prone to silent failure, because by failure, we don't all we don't only mean the software characteristics of them, we also mean the statistical character.

So those are the three pieces when it comes to MLOps that we focus on providing tooling for you we we we we assume that you start with a model that's trained and again we think that people bring a variety of sets of perspectives to how they get there. And so we we are fairly agnostic to how you got there. This is one of the real differentiators between the projects I've worked on, which is called vetiver versus a project like ml flow, like we come in at a different place and provide data science practitioners more flexibility and how they get started.

Again, so we version deploy monitor version deploy monitor. So that's what vetiver provides support for vetiver is a framework for MLOps in both Python and R. So there's very, very, very parallel kinds of implementations there. And what that means is that if you as a practitioner, practitioner, if you prefer to use tidy models for one kind of in R for one kind of statistical problem, but you like to use pytorch in Python for another kind of statistical problem, you can deploy them both with the same kind of tool. And you can provide your like say your your software engineer collaborator with the same kind of API that looks the same no matter how you you no matter how you trained it. The other sort of differentiator for vetiver is it is built this I mean this align so much with the other things we've said, but it is built for a data science practitioner to use.

It is not built with a software engineer, a general software engineer user in mind, it is built with a data science user in mind. And like, I think it's an in like a different orgs make different decisions about this, like who is responsible for getting a model kind of the last mile, like getting it deployed at really large organizations, there's whole teams that that is their whole job, right. But for many medium sized organizations, or even small ones, it's like who should do this, who should do this. And my hypothesis is that the best person to do it is the person who has the most domain knowledge about the model. So we can give that person the tool so that they can hand it off a little later in the process, like not hand off some like really raw thing, but actually be the one that packages up deploys the model that we end up with better machine learning practice overall, because they have the most knowledge about the model and how it works.

Wow, yeah, thank you for that tour of vetiver. It sounds like the compatibility having support for both Python and R in one place allowing me I mean, that is there are still some things that I'd love to do in R. I primarily use Python these days. But there's things like particularly for me around creating visualizations. Yeah, that are with the ggplot library. For sure. And so yeah, there's reasons to be for me to be using both together. And so it's great that with vetiver, I can be deploying both languages together and monitoring across all those three key steps that you outlined their version deploy and monitor.

Tidy text and Jane Austen

So moving on to our next topic area, it still is tidy. It's a tiny topic. So with vetiver, you're, you know, you're talking about tidy models in there. This one is all about tidy text. So you've written several books, several bestselling books, in fact, and one of them text mining with are a tidy approach features the tidy text natural language processing library. And interestingly enough, it also includes Jane Austen's complete works with an R package that you wrote. And which is Jane Austen R. Yes, that's right. That's right. I have some of your listeners are probably familiar with like the stickers like the hex stickers that the R community just loves and loves to put on. And I made a hex sticker for the Jane Austen R library and it's her signature. Like like with colors or whatever. And I just said like, I love it. I love it.

So are you and this is a complete tangent from where I was going, but are you a big lover of Jane Austen?

I, it's, I'm a super fan. I'm a super fan. I'll be honest. No, I, um, I, uh, so the, the story of tidy text is very intertwined with my story of just like getting into data science in general. I, when I was making this career transition from the kind of random stuff I was doing before into data science, um, I was thinking about, um, okay, I have kind of this weird, uh, this weird resume, like, what can I do? Like, how can I set myself up so that people who are in, this is about 10 years ago, who people who are in this at the time newish field of data science, how can I be compelling that like, yes, I can do this. Like, like I'm someone who can do one of these jobs.

And so I was working on, um, what I thought of as the time as, um, uh, like, like, uh, like a, like a blog, a way to show people what I like this kind of things I can work on. I'm like, I envision myself sitting down with a hiring manager and like talking through these projects, like a portfolio, you know? And as I was thinking about like what will be compelling, um, to people, uh, and I, I thought about, okay, the stuff that I really know about stuff that I really know about and care about that's personal to me. And so some of my, if you, my blog, like all the posts are still up. If you go to like the earliest, earliest posts, some of those are, um, using like data from Utah, like Salt Lake City and Utah data. Cause I was like, okay, I go to the, you know, the public data portal. I pull something that's about county differences in health.

So I kind of started out there and then I very quickly started thinking about, well, I mean, anyone who knows me knows like, like I love Jane Austen, right? Like this is one of the great loves of my life, my whole life since I was like 12 years old or something. I should see like, what kind of analysis can I do out there about, um, about to do with Jane Austen's work? Jane Austen's works are in the public domain, which means you can just get the text of them. And I started, you know, I started doing some like initial exploration. Um, I was having a great time.

And then I, um, I was introduced to via at the time, the thriving data science, social media scene. Um, and I've always introduced to David Robinson, who you mentioned earlier, he's a, he lives in New York. Um, he, big part of the NYR community and David, David Robinson, I'll be honest, like changed my life. You know, like Dave, um, Dave, uh, reached out about collaborating because he was excited about some of the stuff he saw me doing was like, I think there's an opportunity here to, to build tooling that is tidyverse style tooling, but, um, but applying to text. I was quite new to our community at the time. And I didn't have as much background as he did in terms of like, what does that, what does that mean? Like tidyverse style tooling, but he had, he had built things like Broom and had, had already had that kind of experience.

So we worked together, we collaborated together and we, um, we actually met for the first time at a, at a, uh, what was called an unconference run by the organization R OpenSci, which is an amazing organization about supporting open science through R there's Pi OpenSci, which is a similar organization for Python. Anyway, we met at an unconference in person and, um, and he, he was like, Hey, do you want to build, do you want to build an R package to do text analysis from a, from a tidyverse perspective? And I was like, okay, sounds great. Let's do it. And we had something working by the end of three days. That was the core of what tidy text became.

And then, um, over both Dave and I, um, at the time were really loving writing, uh, publicly and putting a lot of like, a lot of stuff out there to help people know how to use our, our tools. So we were doing really interesting, or at least we thought things we were really interested in. So we were like writing a lot. Like I did a lot, um, uh, really digging deeper into the Jane Austen stuff, comparing to other books. Um, at that, this was, this was like 2015, 20, like Dave did this analysis of like Trump's tweets at the time that went super viral that used our tooling. And, uh, we were having, so we were having all these blog posts and we at one point looked at each other. We're like, what if we, what if we wrote a book? What if we wrote a book?

And we started basically with taking some of these stuff we had already written, either long form documentation, package vignettes, blog posts. Like when we started putting them together and then how, how will we shift? How will we reorganize? How will we make this flow from one thing to the other? And how, what, how do we write an introduction? You know, how do we wrap this thing up? And it, um, we wrote that book really fast. I have since worked on other books and the book that Dave and I wrote together came together fast. And, um, because it just was like, right. It was just right. And I, it was, it was, it was huge for me. It was huge for me. It was huge for my career. I love that book. I love Jane Austen. Yeah, no, it's like, that's a big part of my story is how that, um, how that all came together.

As I said, we will be doing a book giveaway. You don't know this yet, Julia, but we, when we have authors on the show, we often do book giveaways. And so we will be doing one for your books. Um, and yeah, yeah, yeah. So when, when this episode goes live, uh, people who have been listening to the, the audio version, it will be in my intro because you also wouldn't know this, Julia, but after we finished recording, I use my notes from our conversation to create an intro and an outro. And in that intro, I will have announced that, uh, people can get a physical copy of your book, uh, by, yeah. Oh, delightful. Yeah, yeah. They'll, you know, all the details are in the intro as to how they can pull that off. That's delightful.

Topic modeling and NLP projects

So, um, so yeah, that was a great story to get behind the development of your first book. And now, uh, I mean, there's, there are lots of recent projects as well that are super interesting. You've done topic modeling for Taylor Swift lyrics. Um, you've done measuring readability with a smog metric, um, which is a fun, I like, at juliasilgee.com slash blog slash gobbledygook. Yeah, that was really fun to work on.

And I will say the Taylor Swift one was super fun to work on. I did it like right the same week that the, the, the concert film came out. I have not seen Taylor Swift live, sad to report, but I went that week that that concert film came out. And, um, uh, it was like the first week and I went with my kids to see it. And then I did the topic modeling and it felt very topical at the time. It was very fun. It was, I was very interested. And actually the results are pretty interesting. So the topic modeling looks at the textual content of lyrics. And what they, the, the real sort of takeaway from there is that there are, there's like the, like the, the, the, the pandemic era albums, right? Like, like evermore, like the two of them are very similar lyrically. Like they, they, the machine learning algorithm puts them together. The early albums all get put together. Cause they are also very like thematically very similar. And then reputation is one that really stands out as separate. Like reputation is like quite distinct lyrically from these other kinds of other kinds of groups. So that was really fun to work on. And certainly like aligns with my own sort of, you know, like how I perceive as, as, as a fan and, uh, you know, and someone who enjoys, um, uh, Taylor Swift's work. Um, like it really is, it was interesting way to explore something else that I love. You know, I do, I love doing data science projects about things that I love. And, um, it really, I just get a lot of joy from that.

Nice. Uh, really cool to hear about that project as well. Um, another one that you did recently is with stranger things dialogue. And so for the popular Netflix series, stranger things, you showcased high Frecks F R E X in all caps, high Frecks and lift words of each season's dialogue. What do those terms mean? And what'd you find?

Yeah. Okay. So topic modeling topic modeling is a, um, unsupervised machine learning method for analyzing text. Text data is very interesting. I mean, I guess it's no shock that I find it very interesting, but when you think about text and natural language, there are a couple of, there are a couple of, um, uh, really defining things about it. Uh, you, you see a lot of power laws where, um, there are a few words we use a ton, uh, you know, the, and, you know, and then there are a lot of words that are used only a few times. This means you count the things you're observing of in a very, it's a, there's a really wide discrepancy in how many times you have counted things or observe things. And so you have to use methods that allow you to get at that, like allow you to learn something, even giving that. So there's a lot of brute force methods where you just kind of like take out software, you make some cuts, but then there's much more sophisticated ways you can use to learn what words are important, what, what topics are about what.

So high, the high, a topic model is a multi-level, um, hierarchical model. And it, it, the, the mental thing you can think about is that a topic is made out of a mixture of words and words can be in more than one topic. And then a document is made of a mixture of topics. So like with the Taylor Swift, like, uh, example, I think I did it so that, um, songs or maybe lines or one or the other songs or lines with the document. And then you say, okay, which, which topics, so which songs are made of which topics and the topics are made of different words.

So with, um, with stranger, so you end up with the most probable words, but those are often the same across a lot of topics. So in stranger things, I think I did it as lines, like lines of dialogue. And if you think about like, if you have a group of people talking, the most common words are always, um, just common words that people use speaking. It's different than the words if you were reading prose, you know, so it's not so much the and, and, but more like you and, um, and like words you use as you talk. Like our spoken word is different than we do like written language.

So dialogue, so those are the most probable words. But if you're, so if you look at, oh, the, the topics all have the same most probable words, but that's normal. That's normal and okay. And so these other metrics like lift and Frex help you get at metrics of what words are special or unique for different topics.

So they're different. They're like different statistics. Like Frex is high frequency and exclusivity. So it means words that are used a lot, i.e. for high frequency, but also have high exclusivity, meaning you see them in some topics, but not others. High lift is words that you see a lot relative to how often they are used.

So I, one thing I remember from that stranger things is that like some of these high, if you're looking across the seasons, like what are these high Frex, high lift words? Like you see the words about, um, you know, that monster that they called dart, the little funny monster that got lost in the house. Like you only, that one pops up in that season because it's high lift, high frequency, high exclusivity. The, when they were, um, uh, you know, there were things that only happened in the first season. Like they talked a lot more about like the upside down or something. I'd like those popped up in the early seasons because it was high, like high exclusivity to those.

Topic models in the era of LLMs

So their topic models are great. They are complicated models and it is interesting thinking about when are they useful in the, in the, like in the era of LLM based tools. And I think they are most useful when you have medium size text data. By that, I mean like 5,000, 10,000, um, like you have something, you have text data that, that many documents. So think of a document often in a real, uh, application is like a survey response or, um, something along that line. So you have on the order of 5,000, 10,000, and you are interested in, um, looking for like, what are the topics? What are these about?

Um, you can, of course, look to LLM based tools for a summarization, which can give you like another sort of way of doing it. But there are, um, I, I tend to think they're most useful in situations either where you have compliance reasons that you can't use LLM based tools, or that you have more, um, medium size type data, or you have maybe higher statistical, um, you have needs for higher statistical rigor than like I threw it into an LLM and got something out.

So I think, um, even in the era of LLM based text tools, it's, um, I, I think it, it never gets away the need for doing EDA for text. And that's exactly what the tidy text package is all about, is, is about doing EDA for text. So I would say, I would say it's a bad idea to just throw texture analyzing into an LLM based tool for summarization without also doing EDA first.

So I would say, I would say it's a bad idea to just throw texture analyzing into an LLM based tool for summarization without also doing EDA first.

And it's also interesting to think about what, like, when would you use which kinds of tools? And like, what are your, what are your needs? You know, like, these are all tools. These are all tools. And adding more tools is great. Um, but we have to know when it's appropriate to apply them.

Yeah, those are really good points. You get there at the end around why you would use an LLM versus topic modeling. These kinds of more now what you might say are traditional natural language processing techniques. And yeah, some great points there. If you want higher statistical rigor, if you want to be doing exploratory data analysis, which maybe you should be doing before. You are using an LLM. No, I would argue yes. Um, midsize data. It's also probably going to be a lot less expensive.

Oh, absolutely. Absolutely. Uh, 100%. Because it is, it is true. Like, expensive LLM-based tools are expensive to use. Either literally per API call or in the expertise of running a local one.

Stop words

Um, so one of the questions that I, that was brought up in our research by our researcher, Serge Macice, that I was really interested in asking you around NLP was that you have pointed out how practitioners, including myself, tend to use pre-made lists of stop words before they start doing NLP analysis. So, um, maybe quickly give us your definition of stop words and then tell us why I should stop using a pre-made list.

So, stop words are, um, are lists of words that people are like, oh, those are not important. I can take them out. And so they are words like, um, the, and of. And so, uh, in English, a conservative stop word list would be on the order of like 100. And a more aggressive stop word list would be on the order of like 1,000. And the way that these lists were made is that they're, they're old. They're old. These lists come from like the mid-20th century typically. They were made by taking huge, for the time, huge corpuses of language and counting up words, looking at the top 1,000 and deciding where a cutoff should be. And then deciding, like, a person, a person decided where to make the cutoff. And a person decided, um, are there words that I should or should not keep in.

Um, there are, uh, there are so many problems with stop words. So, A, some of them literally have typos in them, you know. And if you, you're like, well, is that good or bad? If someone had a typo for, you know, I don't know. I can't think of a long enough word. But let's say that someone had a typo, like A, D, N. Like, maybe I do want to take that out. But, but it's strange, right, that there are actually, there are some words that have typos in there.

Another thing that happens is that because these were created from list, from, from corpuses of language, like many books that were put together, you actually end up with evidence of, um, gender bias just in the stop word list. Like, because, like, let's say we take a whole huge number of books. Um, there are more uses of those, in those books of he than she. There are more uses in those books of his than her. And some of those stop word lists actually have, for some of those sets of, you make a list of all the English pronouns. They have all of them for masculine, and they only have, like, three quarters of them for the feminine version because of where the cutoff was.

So, you, you're like, oh, man, these are, even though you're like, boy, this is, like, the simplest thing you can do, make a list of words. They are, they are, all of our, like, challenges around data analysis, data science, data sources, they show up, even in this, like, dead simplest thing you can do.

They are, they are, all of our, like, challenges around data analysis, data science, data sources, they show up, even in this, like, dead simplest thing you can do.

So, I, I avoid, now, I now avoid using lists of stop words when I do topic models because, um, they will always end up being, like, the most, um, the most, uh, probable words. But like we just talked about, the most probable words are never very interesting. You need to look at these other, these other kind of, um, statistics, give you a better sense of what topics are really about.

When I do supervised, um, machine learning, I often leave them in as well because it turns out it's actually informative. The way some, the documents use even those boring words, it can be predictive. Like, they can actually have predictive information. Like, if you're doing classification, how it uses those boring words can help. If I'm doing EDA, I sometimes take them out if I'm, like, trying to show the top, like, most common words and I just, like, well, let me take out, like, those boring words. I do sometimes still take them out there.

But I often will supplement it with, um, even EDA approaches that provide, uh, ways for seeing differences across groups that do not depend on taking out those stop words. So, some examples of that are looking for log odds of words. Like, what are the highest log odds words? I have a package for this that's called tidy low for tidy log odds. Or, um, and that's actually an interesting approach. Anytime you're looking at differences of counts across groups, it doesn't have to be language. But it's, it's, um, applying it to language is really great.

Um, I still use things like PFIDF as an exploratory tool, uh, which I bet many people have heard of. And, um, I've written about, like, what it means and people can dig into that more. So, so that's my pitch. That's my pitch. My pitch is, um, they suffer from the same problems that almost any data science process suffers from, even though they are so simple. And if you're doing unsupervised smart learning machine, or supervised machine learning, you probably want those words. And if you're doing EDA, there are alternative approaches that give you better answers.

Very cool answer. Something that I have been teaching for years, something that I was aware of that is an issue in stop lists, is that for some particular application you might have, you at least need to be looking through the list of stop words. Like, you shouldn't just be using it without, I mean, you've made a good argument to not be using them at all. But something at least, um, that I have been saying for years is that you need to know what the stop words are in there. So, for example, if you're doing sentiment analysis and one of your stop words is not, you're going to be pulling out the word not. So, how are you going to, like, that's, that's, like, one of the most critical words in figuring out the sentiments of, of a document.

Totally. Totally. My, um, my first book that I wrote with Dave has an example of this. It's back to Jane Austen. So, it turns out one of the most commonly used stop words has the word myth on, um, the stop word list. Or, no, I'm sorry, this is sentiment analysis. Sorry. So, myth is on in the sentiment analysis list as a negative word. So, this is slightly different. It's not a stop word list, but a sentiment lexicon list. But they're very similar, um, uh, constraints at play.

So, myth is on the word as a, is on the list as a negative word. Like, I miss you. Or, you miss that, uh, I don't know. You miss quarterly earnings. It was, it's on the list as a negative word. So, if you look at Jane Austen novels and you, like, do sentiment analysis using one of these lexicons, it shows up. It's like, oh, the word that is driving negative sentiment, like, in a top ten way, like a lot, is the word myth. But, of course, in Jane Austen's, like, works, everyone, that's how everyone is referred to. It's all, like, Miss Bennett, you know, like, like, like, everyone is myth. Everyone is myth. So, they're not, they're not, um, they're not negative at all. Like, those are, those are neutral words.

And, and actually, the more sophisticated tools, sentiment analysis tools now, which they do better, but they are actually not immune from this problem, because it's just basically fancier counting and fancier linear algebra. Like, they do a little, they do better, but they are not immune from this exact problem. That words in their training data were used a certain way, and if it's used in a different way, it's like mismatch between your training data and the data you're using it on. And that's one of the real downsides of using the pre-trained models is that either they're too general or use a different way or, you know, there's, there's, it's similar problems.

Awesome. Great, uh, insightful answer there. We got a ton from you. My stop word question has now been answered, and I will stop using them. You made a clear case. Uh, you know, built-in gender bias, these kinds of, just kind of weird statistical phenomena around where exactly the cutoff was, what data they were trained on. And so, yeah, using something like TF-IDF is something that I've been using a lot, um, historically, but now you also mentioned Tidy Low. Yeah, and there's implementations of that in Python as well. So there, if you look up the Tidy Low, um, the Tidy Low documentation, it is based on a paper that was originally implemented in Python. So that's available in Python as well.

Tidy principles and tidy models

Um, really quickly, and we are running out of recording time here, so hopefully you don't mind if we run over a little, a few minutes. But, um, we've talked a lot about Tidy modeling throughout this episode. We haven't, in this episode, defined that. What is the, what are the Tidy principles, and why should they matter to our visitors?

That's a great, that's great. Okay, so I'm going to speak about it a little bit higher level, like what are Tidy data principles? And so I'm sure Hadley would have been able to speak to this, and maybe he did. Maybe he did. But the, um, the Tidy data principles are like a set of ways of dealing with data. So there, so you want to have one observation per row. So, like, if you were taking, let's say you were measuring the temperature of, um, of something over time. You had five sensors, and every time you make a new observation, you make a new row. You don't make a new column. You make a new row. So your data ends up, like, long and skinny and not super wide. So one observation per row, one table for kind of observation.

Actually, if you have listeners who come from the world of databases, and, like, database, um, uh, like, like, like database world, it is actually the same as, like, the normal forms of data for, for databases. So the, so there's these, like, how do you make Tidy data Tidy? Or, like, what is a structure? And the idea there is that when you have tools for Tidy data, you can have standardized tools instead of writing, like, a lot of bespoke code every single time.

The tidyverse is a collection of R packages that embraces, uh, Tidy data principles. And that means we have our data in this form. We give you tools. There's tools for converting back and forth between, um, like, getting to Tidy data, getting out of Tidy data. But then if you want to do visualization, um, if you want to build models, you can, you can get your data into Tidy form and then do those next steps that you need to take.

The tidyverse set of R packages also has some, uh, some values around, uh, reusable data structures, like that we don't want to make a new kind of data structure for every single problem. Instead, we want to make, have a standard set of data structures and get our data into that, like then a tool to get our data into that structure so we can have more modular code. So modularity, um, reusable, um, uh, structures.

So, um, I have not myself worked much on tidyverse packages, but I did work on Tidy models packages. And that is a set of packages that applies these kinds of tidyverse priorities and principles, um, have one observation per row, um, reusable code, modular code so that you can, um, not having, you're not having to write, you're like, oh, I need to do something slightly different. I'm going to have to write it from scratch. No, we give you these pieces of modular code so that you can put it together in the way that you need to.

So Tidy models is a framework for machine learning and modeling and R that the sort of two main things is that it adopts these Tidy, these tidyverse principles to give someone kind of like a, uh, a fluent way to get from EDA into machine learning. And, um, also has some really great priorities around statistical guardrails, like to keep you from doing the wrong things, you know, like to keep you, uh, using good practices around data, like data hygiene is a big one, right? Like, how are you, um, like, like I just, it is crazy. Like, um, I think most of us know we need to split our data to training and testing sets.

But like, um, it is in the, it is pretty wild that even in today's, like, like with machine learning knowledge being as broad as it is, people still like trip up on data leakage kinds of issues. It is actually, and, and Tidy models is built without like first and foremost, like can we keep you from leaking your data when you don't mean to? Um, and I think that the biggest piece of there is, is real, real explicit adoption of data pre-processing data engineering as part of your model. Like you can't think about that as something you do separate. You cannot do it before you split your data. Like you cannot do it before you resample. Well, if you're making cross-validation folds or resampling folds, your pre-processing steps, your data engineering steps have to happen inside of a loop that is doing resampling. Um, so that is Tidy models.

And, um, I was really excited to where I was working on that team, but kind of the first, uh, two-ish years that I worked at, um, RStudio and then, and then posit then worked kind of like kind of moved a little later to, um, ML ops stuff. And then for the past year, I've been working on positron.

You can probably tell, like, I'm not someone who has one life passion. I, I am a little bit of a generalist. I am a little bit of a generalist. I like to learn, like, if I would say I have one sort of like what puts all that together, it's that I really care about people's real workflows. How are they approaching their real problem? And how can we think about the problem they're solving and the tools they're using at a systems level? So if I think about like what connects, like from text analysis to ML ops, to building a freaking ID, like what I think connects those for me is, um, I, I'm a very, I really like applied work. I really like practical work. I really like, um, thinking about how people approach their tasks in a very concrete nitty gritty kind of way.

Awesome. It's nice to get that insight from your background as well, and kind of how everything ties together. But most importantly in that recap was getting that overview of the tidyverse and tidy models. And we did certainly in Hadley's most recent episodes, seven, seven, nine talk about the tidyverse. But I don't remember for sure if we talked about tidy models very much.

Audience questions

I did post a week prior to you coming on the show. I posted on social media, on my LinkedIn and Twitter accounts. I still call it Twitter. Um, and, uh, I, I posted that you would be coming on the show. There was a huge amount of engagement, hundreds upon hundreds of reactions, tens of thousands of impressions. And we did have some interesting questions as well. So I've got two for you.

First one here is from Luke Morris, who is a healthcare data scientist at Stanford university school of medicine. Luke says that you are awesome and that your book with ML, uh, I'm going to butcher his last name. Uh, it felt. Oh, I mean, it'll be felt. Um, so that book supervised machine learning for text analysis in our, that was Luke's North star on his graduate capstone project. Um, and so he, his question is as a major hashtag tidy Tuesday contributor, what's been the biggest, Oh, wow moments you've had digging through these weekly datasets.

Okay. I love this question. Cause actually something comes directly to mind for me. Um, in there was a dataset in, um, 2020, a tidy Tuesday data set that was from back in 2020, that was about, the voyages of enslaved Africans, like, there was some like really, it was like the last data set on the last several decades of the trans Atlantic slave trade that, um, I, it was a, it was a really big thing during 2020 and people, it was like, let's explore this and let's, let's learn this more.

And the, um, it showed, it showed that, um, I didn't know this going in, but it turns out there is this exact, um, example of Simpsons paradox, like classic stats level, like info stats, some Simpsons paradox where the, um, it's like the mean year of arrival and age. It looked like they were going up together like that, um, that like when these enslaved Africans were arriving in this, on the Western hemisphere, it looked like over time, people were older as they were arriving, but it turned out just to be Simpsons paradox because the, um, the earlier years, they were proportionally more women than men. And so the, the earlier years, like they were bringing boys earlier and all this kind of thing.

So I was, first of all, it was a pretty heavy topic, right? Like pretty heavy, deep topic where you're like, this is, this is rough, actually kind of looking at this data. And then I went through this, like example, this modeling. And I think it like, and no one had told me ahead of time. And in fact, I don't think anyone had like looked at this particularly, but like you make a first initial plot and you're like, oh, look, people are getting older with time, but it actually turns out was the opposite. And it was just Simpsons paradox. And so that really sticks with me partly because it was, um, kind of a heavy topic. And I really remembered it. And because I literally just discovered, I mean, I just found the Simpsons paradox, like example, as I was going, I'm like, it's real. Everyone, Simpsons paradox is real.

Yeah. And so Simpsons paradox being, uh, if I remember correctly, it appears that there is a correlation one way between two factors. But then when you condition upon some sort of factor, in this case, sex, it turns out that actually, there was actually that you had more women coming over and women tend to be younger. Yes. So the sign of the effect flips when you end up rolling for this thing that it turns out it is important. Yes. So in fact, uh, controlling for gender, the, the slaves were getting younger. That's right. That's exactly right.

Um, so this one is from auto Hanson. Um, he's a data engineer and data scientist based in the Netherlands. And auto says that he's looking forward to this episode. This question would be, how does Julia reflect on the rise of AI as an aid for teaching future generations of data science students? Can AI in your view, Julia, either be a force for good in teaching students how to code, or do you think that AI poses a threat to properly teaching or mentoring future data science students and leaders?

Yeah, this is, um, this is a really great question. And I have my use of, um, like LLM based tools, which I'm going to be a little specific because AI is one of those words, like what on earth does that mean? What on earth does it mean? So when I have used LLM based tools for, um, code assistance, I feel like it has, it has overall helped me. And I think that it's partly because of when it has come in my life and also partly of how I can, how I have used it.

So if I were to say, use, um, Copilot or one of the alternatives to Copilot, um, it is most helpful to me when I am trying to do something in a language I am a little bit less familiar with. So I would say in terms of my own background, these days, I write a lot of TypeScript. I write a little bit of Rust. I write a little bit of Python. I write a medium out of R, like I write, and I write a lot of YAML.

But anyway, I, um, and a big proportion of those languages are not languages I have used my whole life, like, or even like 10 years, like I've used them, a good proportion of those I've used less than 10 years. And so if I know what I want to do, and I can't remember syntax for how to do it in a certain language, that's, that's actually been quite helpful for me.

If I am doing, um, R, like, writing EDA, tidyverse EDA code for R, um, I am, I am faster than the prompts, and it only, I feel like it gets in my way, because that's where, if I would say what am I most productive at, like, my expertise, that's probably what I am the fastest at writing. And I think that's partly because those tools are so well designed, like the tidyverse is so well designed that it is, it's a really good fit for how I think about data and analyzing data. So that's an example where I'm like, I don't actually find it that helpful in my strongest competency. I find it quite helpful in my, um, slightly, where I'm still, I'm a little bit slower. I would maybe have to look something up. Like, it is quite helpful in that area.

Now, now let's talk about teaching. Let's talk about teaching. I don't know that it would be that helpful in the long run for a learner. To use it when they're learning their first language. I think that it may, because you don't, you maybe don't have the ability to evaluate as much. It spits something out and you're like, uh, is this good or bad? I don't know. Um, so I don't know that it's so useful for someone trying to get competence with their first language.

I, um, I think there are some real, like, how can it, like, what's an effective or good use in a learning environment? And also, what do we want someone to learn? Do we, we, maybe I can envision someone like teaching a class where it's like, Hey, we're going to learn to code in Python. And you all have one of these, um, things, one of these things installed. Okay. Write a comment that says, I don't know. Let's picture a very intro class, write a comment that says, write a for loop that does like a blah, blah, blah. That's the thing. And maybe, maybe someone could teach saying, look at it, read it, predict what will happen when you run it and then do that. So I think it's possible to, to use these and teaching environments that are going to be good because of my own background. It's a little hard for me to envision it being really useful for a learner of a first language. I see tons of people around me using it. Those tools really useful for additional languages.

Now that was all about code, right? What about other uses? What about other uses? I have not had great experience with using these tools and other uses. Like, if I try to have it write something for me, I hate the voice that it's written in. I'm like, Oh, I hate it. Like, that doesn't sound like me at all. That sounds like this sort of gross, like machine, like they all have this kind of voice, you know, when they're like generating, um, uh, programming pros and I, and I like hate it. I'm like, no way would I send an email that says that that's ridiculous. So I've had less luck with it in, um, text generation outside of code. I've had pretty good experiences with them. Text generation in code.

I think it will take some real thoughtful educational, like, uh, people to think about how do we use these in really helpful ways to help people get the, the real skills. I mean, the real skills when it comes to code is not, um, writing syntax, right? Like, that's kind of almost what gets in the way. And if we can use, like, can these tools help us solve that? Maybe, maybe.

So I was like, overall, I put myself in the, like, I'm not like morally opposed to these tools. Although of course, there are all kinds of questions around the data that was used in the, have trained them, the licensing, like there are some complicated issues, but I would not say I am. By principle opposed to them. I have had use using them in some ways. I'm kind of skeptical about how they, um, how they are used broadly. And of course, uh, like the fact that it can just generate, these can just generate unlimited amounts of text. And then we can just like put all that text on the internet. Like there are some real, are we going into a information where like we can not find good information because there's too much literally texting generated as poor quality. Very interesting, very interesting questions. I'm not someone who feels like I have the answers. I certainly have opinions based on my own experiences of playing around with them.

Yeah. You went, you touched on a lot of different topics there. We could probably spend a whole episode with me into, into those. But yeah, it's nice. Or it's interesting to hear that at least for the software development education use case that you do see a lot of potential. So do I as well. And for me personally, it's these kinds of, uh, in the flow tools. Uh, like Gemini in Google, in Google Colab have been amazing for me, uh, to be developing software way more quickly. I mean, like instead of having to search over the web to find some stack overflow answer, it's not exactly in the same situation with the same version and obviously different variable names that I have to kind of figure things out. Now it's often just a click of a button to getting something to work.

Uh, I'm constantly blown away. It is interesting with the tone thing that you mentioned there with natural language generation. Yeah. That is something I think that is getting better. In fact, just last night at the time of recording Natalie, who does operations management for our podcast, she was said to me, wow, Google Gemini, just suddenly something changed. Google Gemini is her preferred, um, LLM and she uses it specifically for emails. And I think she provides context of, uh, ideally like emails that she's written in the past. So she's like, she wants her tone to be mimicked. Yeah. She was like recently, like really approved in the last few days. Um, really interesting to see how these tools, uh, develop moving forward.

one that I always ask our guests is if you have a book recommendation for us, obviously we know about your books, but maybe you have someone else's book to recommend to us. So based on that discussion we just actually had, I'm going to pull one off of my shelf here because it is, I think so relevant.

So it is called the programmer's brain and it is by Fillion, and it is all about how do people, how, how can you learn to be a better programmer and like, how can you, um, uh, like how, how do people program? How can you increase your own skill in programming? And it is, um, it is super interesting to read it. It kind of came out before the rise of the LLM based tool. So it does not address those, but it is extremely interesting reading it now through the lens of these tools being available.

And what are the, what are the real skills when it comes to being someone who writes code? What are the real skills, like the real skills that people are using? Um, and how can you make yourself more effective? So I'm going to, I'm going to say that is my recommendation because of, um, that discussion we just had about like, where are these tools helpful? Because I think if, you know, if you or your readers look at it with that mindset, you'll be like, ah, interesting. This can come in here, you know, like this can come in here in this process and really make me better or, oh, this is why I'm really frustrated when I try to use it here because it's not, it's not supporting that kind of work.

but it is extremely interesting reading it now through the lens of these tools being available.

How to follow Julia

Excellent recommendation. I love it. And the final, final question is how people should follow you after this episode, Julia. Yeah. So I would say YouTube is a great place. Um, me on YouTube, my blog, I post there in terms of, um, social media. I, the places that I am hanging out these days are LinkedIn, blue sky and Macedon.

Awesome. Thank you, Julia. This has been an amazing episode. You've also been really generous with your time. We've gone well over these scheduled recording slots. So thank you so much. We really appreciate all the insights you had today. And yeah, maybe we can check in, in a couple of years and see how your projects are coming along. Maybe how Positron is coming along. That would be so exciting. Thank you so much for having me.

Episode recap

Boom. So much rich, actionable detail. And today's episode with Julia Silky in it. She filled us in on how the polyglot Positron ID is designed from the ground up to be ideal for people who do exploratory data analysis, including updated variable pains and key consideration given to data. She told us how continue tab nine and codium are her favorite LLMs for code generation.

She told us how NLP should be used instead of LLMs when we need higher statistical rigor, when carrying out EDA or working with larger data sets, and how we should use TFIDF or her tidy low library in lieu of stop words because of issues like typos and demographic biases. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Julia's social media profiles, as well as my own at super data science dot com slash 817.

Thanks to everyone on this super data science podcast team. We've got our podcast manager, it's Hubert, our media editor, Mario Pombo, operations manager, Natalie Jaisky, researchers, third class seats, our writers, Dr. Zara Karshay, and Sylvia Albrang. And of course, our founder, Carol Aramenko. Thanks to all of them for producing another magnificent episode for us today and for enabling that super team to create this free podcast for you.

We're deeply grateful to our sponsors. You can support this show by checking out our sponsors links. Please do that. You can find them in the show notes. And if you yourself are ever interested in sponsoring an episode, you can get the details on how by making your way to john prone dot com slash podcast. Otherwise, please share, review, subscribe, and so on. But most importantly, I just hope you'll keep on tuning in. I'm so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Till next time, keep on rocking it out there, and I'm looking forward to enjoying another round of the super data science podcast with you very soon.