Julia Silge: Part 2 — Glue work, licensing, and open source in the age of LLMs

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

This episode is part two of a conversation with Julia Silge , data science leader and engineering manager at Posit.

Synthesizing community signals and working in the open

Yeah, it's so interesting to hear at the edge of experience as being your focus and even that inside of Posit and inside of companies being a big, from what I heard you like a challenge across teams. Where does that thinking show up for you usually? Is that in like a notebook or document? Where do you like to kind of refine your thoughts or shape them?

Yeah, yeah, yeah, yeah. I think I work in a combination of GitHub issues. If things can be broken down into, say, like a fairly concrete piece of work, then I'm like, okay, I'm somewhere there. Another sort of way of working that I and people on my team work in is writing documents, whether that's like feature specs that are very technical or whether that's more like a description of the kind of behavior we expect to be a little more like high level. I think that another way that this comes out is actually in like synthesizing bits of smaller information, whether that is, you know, we have as a public facing product where that has quite a number of users out there, right? So there's like stuff comes in on GitHub, of course. So, you know, stuff comes in. You know, we watch Stack Overflow. We watch like various discourse type boards and whatnot to like kind of be able to synthesize.

And often that involves like a kind of like a bottom up kind of organizational, like information organizational kind of thing of like how can we like how successfully can we understand how these things are related? Because they 100% are, right? Like very few, it's rare when something comes up and we're like, well, that's like nothing else we've seen before. Like we either these are thematically related or like they're different ways of the same kind of problem. So I would say in a practical sense, it's like, like I'm writing in different places depending on who the audience is or like how much and then in terms of organizationally thinking, I do a fair amount of like those kinds of like information organization kind of like activities.

Do you have like a preferred like tool or system that you use? Like are you like a mind mapper or like really into Obsidian or?

I feel like because of where our team stuff is, I end up in GitHub projects a lot, which are pretty good for if for things that are captured as really concrete type things for things that are a little bit less concrete. I think I am mostly I am mostly in documents and I am mostly in documents that are like shareable because like Obsidian is great for Obsidian is great for your internal stuff.

You're like just talking around the fact that you did everything in Google Docs and you don't want to admit it.

I don't want to admit it. I spent all day in Google Docs and I just don't want to admit it.

No, I mean, the thing is Obsidian, I don't know. You don't have any like shareable way of using Obsidian, do you? It has to be another. No. Yeah, yeah. And so much I mean, so much of what like I need to get feedback from people on it or like so the fact that I so much of what I do involves going to other people, either for their input or for them to agree, you know, like it's really the collaboration.

So speaking speaking of that, we do have we do use actually Quarto as well for a good chunk of this because we have like really extensive internal internal documentation for like process like like like iterating on plans that is less like a that is less like a situation that might have been a GitHub issue and also less something that like might have been a feature spec that's like kind of throw away like right at one time do it, you know, like don't come back to it. But stuff that we maintain over time about process and about how things work that we do actually use Quarto for.

And then the collaboration looks like Git looks like, you know, like those kinds of modes.

Yeah, I'm really looking forward to the day where we figure out how to like have shareable commentable Quarto docs.

That's the dream. That's a dream.

Yeah, I've become like a pretty pretty heavy Obsidian user mainly as like a as a capture tool because like stuff will come up in like a Slack channel that I'm that I'm a part of or over email. And I find that like if I have too many like unread unread tasks in different places that eventually like I'll just forget things and I'll drop balls and so I find like capturing things and obsidians like my personal like don't forget this list of things to not forget and to get things out of my out of my inbox.

Glue work and the value of boundaries

I have to say, too, I feel like you inside Posit, like we find you wrangling a lot of kind of like meetings or I want to say like edgy, not edgy. OK, edgy is the wrong term, like at the edge of experience types, things like if I could give one example, it's the the Python open source meeting. You're you're really a facilitator of the Python open source meeting, which I've always found really pretty inspiring that you like there was it seems like to me like there was a gap, like we need someone to fill this.

And there are a lot of people floating around, but that you really kind of took it and ran. So I think there's different ways to operate as a manager and some people who are really successful in managers have a really strong sense of ownership of like this is my team. I am going to make them successful. If this thing is successful, I am successful. And that is great like that. But that's not quite the way I operate.

I think as like a manager or a leader, I think there is so much space for a like slam dunk experience. There is so much space for massively leveling up how people perceive your tools when attention is spent at the boundaries of things.

And and it is often somewhat thankless work in the sense of like, like you don't always get as much credit for stuff that ends up gluey, you know, like you. It's also like pain, like people can't acknowledge pain that they didn't experience. Yeah. Right.

Yeah. I think that's like you're making pain go away, but it's because people never experienced it. Like it's hard for them to appreciate.

Yeah, yeah, yeah, yeah. Super. I think that sort of work is like so incredibly important. It's important. And also, I feel like there's often a lot of space to really, you know, you're it's not like you're trying to eke out some tiny little bit of performance improvement, making something tiny. It's like you can usually massively level up some sort of painful experience by paying attention at some boundary.

And so I do think, you know, like internally at Posit, I do end up doing things like like, like maintain the Quarto extension for a while. I mean, you know, like, you know, stuff like that because it's like, well, no one else is quite doing it and it's important to this team and this team. And I think it's because I have seen a lot of I have seen, I'm going to say, really outsize impact from attention at those spots that I think have a big, like big picture, big picture kind of like effect, I guess.

I think as like a manager or a leader, I think there is so much space for a like slam dunk experience. There is so much space for massively leveling up how people perceive your tools when attention is spent at the boundaries of things.

It is. It's sort of fascinating to me just how much like kind of team identity starts to play. Like I think like in general, I think it's a good thing for a team to have the strong sense of like, this is what we do. But it can mean that there's these things that like are really important that kind of fall between the gaps of every team. And it's surprisingly valuable to me, like it surprises me, like I don't think it's really surprising. It surprises me. Like one of the very valuable things you can do is convince people that this is their problem to solve. Like this is a real problem that exists. Maybe someone else should fix it, but they're not going to. Like you could have a really big impact here by like kind of going outside of your comfort zone, like tackling this problem that crosses some boundary you don't normally cross that like that can be so, so valuable.

They took data that I would say morally belongs to all of us. Right. Like morally, ethically, that's our data. That's our data as like human beings.

So I think it's super interesting to think about where are we today and what are going to be the what are going to be the steps that will keep, for example, our ability to get answers to our coding questions, you know, like like what will what will that look like going forward? I'm not I mean, I'm not a big doom and gloom person, you know, I'm not saying like the world's ending and anything like that. But I think it's some real questions because what got us to here is not what we are doing now because the ecosystem has substantively changed. The world has changed. The world has changed in terms of what like what like where is that data coming from or like like where where are the questions and answers? Like where is it such that it can be useful to the community as a whole?

Licensing in the age of LLMs

Do you think it's changed how you think about like because I think Stack Overflow is all like Creative Commons license. So like kind of legally, all the LLMs are, you know, providers are fine. Does it kind of change how you think about that license? Because I have to say for me, like for the longest time, that just seemed like absolutely the right thing to do. And now I'm like, I don't know, that's just giving all the stuff away for free is kind of. I don't know.

Well, there was like there was there was like an inflection point where I think the license for content posted on Stack Overflow changed where prior content prior to that date, it became this like massive IP contamination issue where developers would copy and paste stuff from Stack Overflow into their company's proprietary code bases. And then, you know, if you go to sell a software product that that code would turn up during like, you know, IP legal due diligence and say, oh, like you used code from Stack Overflow and maybe it could just a small snippet. But either you have to figure out how to replace that code with your own IP or you have to go and find the original author and ask them for permission to use that code. I think in the meantime, it's there was a change to make it more lenient. But still, like any content that was posted prior to that date has like the old, more more restrictive license. But it's interesting. It's very interesting. The licensing around these kinds of issues is very, very interesting.

And I think it's I think questions around how these licenses gotten us what we thought they were going to get us, like how these like because these these all these open source licenses, you know, like like came up and like have been hashed out and whatnot in a in a time when like like it was a technologically different kind of time than than what we have now in terms of the constraints. And people talk about like, OK, can we iterate on these? Can we think about these differently? Like something that, you know, there's still discussion about is like like iterations on these licenses that make kind of kind of moral or ethical claims. Right. Like like I want to exclude certain kinds of uses like you like there are these licenses that are like mostly open source ish, but they exclude some uses like maybe defense or or something like that. So there's that sort of category of iterating.

There is the category of license that's like mostly open source ish. But I put some restraints on, say, how you could how can you can platform it? So in full disclosure, Positron has a license like that. Positron has a license that is not a true OSI approved license. It's elastic license. And the reason why we did that was because like our experience as a company working on open source software has showed us huge benefit. We're huge believers. We're committed to open source software. The pieces of the software that you can make money by platforming them. We ended up making the call that like actually we don't want another giant, say, cloud company to be able to like to to get directly revenue from just just making available the thing the thing that we made.

So I think like I I've been involved in open source for a pretty long time. I'm a huge believer in open source. I'm not religious, though, about these specific licenses, because I think they're just things we wrote. And how are they turning out? How do we want to iterate on them? Like, like, what do we think is best for us as a community? I mean, our company, we're iterating with that. We're being really explicit, like this kind of software, like a Python package or an R package. It's like MIT, this kind of software. We're not going to do MIT anymore because we think it is not aligned with our long term goals around the like the sustainability of our of our company. Super interesting questions.

I think, yeah, to me, like a lot of it is about like. To me, like this open source is kind of like a gift, like this is a gift that I'm, you know, I'm spending my time on and getting to the world. But it's not like if you're going to like abuse that gift, like I have to keep giving it. Like there's some I don't know, there's some sort of sense of like I want to be giving it to people who are like necessary to be giving it back to the community. Like I get not everyone's in a place like I'm happy for, you know, a lot of this work just to be used by people. But it just starts to feel like exploitative when like big companies that are like tens or hundreds of thousands of times the size of Posit, like make money off our work. That just feels a little.

Gross, and yeah, I'm not like so religious about this, like it's about freedom, it's about all of these other kind of like big philosophical ideas, like to me, it's more about like community and, you know, trying to share what people are trying to do good in the world in some way, but at the same time accepting like if you try to pin down exactly what that means to like you get you get lost in the details and you just have to accept that people are going to use it for things that you don't, you wouldn't personally like them to do. But I don't know, on the whole, I kind of hope that people are using open source software like make the world a better place, not just make money for themselves.

The chicken-and-egg problem for new open source projects

Yeah, it's helpful to hear the kind of how Posit is trying to like thread the needle between things like MIT and the elastic license and find things that kind of work for everyone in the different like circumstances. It's for sure an experiment, it's like, okay, let's try this. How's this going to go? You know, like, and I think I mean, so many of these things are untested. And like, we don't quite know, you know, like how these things will play out. But it definitely is interesting.

The thing that's been on my mind a lot is, is if how will developers will will users be motivated to discover and learn new technologies that their their favorite LLM doesn't know about? And how is how is that going to affect the development of new open source software projects? How do the LLMs get the content that they need to get trained on new projects? And so, you know, for folks like us that have been in the business of teaching people how to do data science, building data science projects, it presents this, this conundrum or this maybe chicken and egg problem of just the nature of building new new open source software for data science is, I feel like going to be permanently, permanently altered and how that affects like adoption rates. And just how long does it take before a new project is important enough that the LLM providers go to the effort of like, creating a training corpus to teach the LLM how to use your new open source library that doesn't have doesn't have that much content available on GitHub.

And we're on the kind of like the lucky side of this, like most of the tools that we have created, like are now in the training sets, like we are kind of like, but I don't think we want to be like locked. Like, I don't want ggplot2 to be the visualization package used for the rest of humanity because it's impossible to create a new system because everyone uses LLMs. If it's not an LLM, they don't use it. Like, that doesn't seem like a win.

Like, I don't want ggplot2 to be the visualization package used for the rest of humanity because it's impossible to create a new system because everyone uses LLMs. If it's not an LLM, they don't use it. Like, that doesn't seem like a win.

Like how, like how, what's the, like, how does the, like the death part of the open source life cycle, like, how does that change? It's interesting with Stack Overflow too, like to these points, like a lot of my early experience with ggplot2 was on Stack Overflow, where Hadley would answer questions, but you would see the question and then you would see like a couple different options and one would be ggplot. So it's interesting to think about sometimes, yeah, the answers that people might get now if they ask an LLM, if it both, you might not even see that Hadley's answering and you might not see the range of tools or you might only see certain tools, I guess, in the output to Wes's point.

So, yeah, Julia, I really appreciate you coming on and just like opening up the complexity of this like data science workflow, how to reach your senator, you know, is there any, any like parting words for people at home, either ways to help you with Positron or things you'd encourage people to check out?

Yeah, yeah. So if what we've been talking about today has piqued your interest in Positron, you can go to positron.posit.co for installers and documentation. And I think I am excited for more and more people to get exposed to it and to try it out. And it's been really delightful to talk with the three of you here today. Thank you so much for having me on and for asking such insightful questions. I honestly haven't thought about the pizza in quite a while. So it was a little bit delightful to get to to get to revisit that.

No, it's been such a treat. Thanks. Thanks so much for coming on.

The Test Set is a production of PositPBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with creative studio Adji. For more episodes, visit thetestset.co or find us on your favorite podcast platform.