Julia Silge - Keynote PyCon Colombia 2025

Transcript#

This transcript was generated automatically and may contain errors.

Wonderful. Thank you so much for that great introduction. Thank you.

This has been such a fantastic day. The talks today, from the keynote this morning to all the session talks have been so exciting. And I am excited to get to end our day here today by giving you a bit of an introduction to Positron , which is a new IDE that I am working on along with my team. I'm going to give you a bit of an introduction to it.

I'm going to kick off by telling you a little bit about who I am, so that you can understand the perspective from which I am coming to speak. So, a very long time ago, I was an astrophysicist. A medium long time ago, I was a practicing data scientist. I worked in the non-profit sphere and also in tech companies as a data scientist, working with data, how can we make better decisions, did things like analysis, reports, making interactive apps, training models.

And then more recently, in the last five years or so, I've shifted to being someone who is a tool builder for data science. Someone who works on what are the tools that we can build or make to make the work of data scientists more productive, more fluent, more simple as ways to approach different kinds of things.

So, I work at a company called Posit, PBC, which is the company formerly known as RStudio . It's the company that made the RStudio IDE. And today I'm talking to you about a new one, a new IDE. So, that's a little bit about me.

And now I'm going to, I want to find out a little bit about you, so that I can understand more about how maybe I should speak about the kinds of things. So, learn a little bit about your background. Because I know people use Python for all kinds of things, right? People use Python for everything from, you know, with the Django web stack, for web development, to building APIs. And so, the first thing, I'm going to ask you to raise your hands here, so I can understand. So, can you tell me, do you mainly use Python for data work?

Okay, okay, great, awesome. Thank you very much. So, it seems like most of the people in here, you use Python mostly for working with data. Which is great, because that's what I'm going to talk about here.

Have you ever used RStudio? Have you ever opened up RStudio and used it? Have you ever used VS Code? Have you ever used a VS Code-like IDE or editor, such as Cursor, Windsurf, or VS Codium? Or one of the other forks?

The iterative nature of data science

I want to start off by talking about how the process of data science is, in many ways, different from the process of what you might call general purpose software engineering. The process of data analysis, working with data, is iterative and exploratory. So, we definitely use code. We write code to analyze data. But the process of writing code for the specific purpose of analyzing data, it's substantially different from what someone is doing who is writing code, maybe to build a website. Or maybe to make a mobile app or something.

This diagram outlines one model for what that process of data science may be like. You start, the first thing you do, you read in some data. Maybe you read from a CSV, or maybe you read from a database. You have some data in memory, in Python. And then you start, actually, in what is, I believe, an iterative set of steps. You maybe reshape the data. You munch the data. Maybe you make a visualization to see what's in there. Then you use that to maybe train a model or to train a first version of a model. And you realize, oh, I should, you know, maybe use this variable I have with some feature engineering. And you go around.

Or maybe you're like, ah, this doesn't make sense. Let me go back to that variable I was using and look in more detail. And, oh, I have to correct for this way that the data was recorded. And so, then you transform again. You make another visualization. Then you go back to your model again. And this process, the important thing is you don't know what the next step is until you see what's really in the data.

And so, this process, you probably would actually make mistakes if you, ahead of time, said exactly what you would do and did not stray from that path. Because when we're doing statistical analysis, machine learning, data analysis of various kinds, we're coming up against the real world. The real world. And we have to iterate in a way because of the actual content of the data that we're working with. And this is unique about the work of what people doing data analytics, data science, machine learning. It's a unique characteristic of what this kind of work is like.

Eventually, you do decide you're done. Or at least you're like, ah, I got to stop. I can't keep iterating on this model or app or plot or whatever it is you're trying to make. You decide you're done. And then you do something with that analysis. Largely, we communicate about it. We write a report. We build an interactive app of some kind. Maybe we deploy a model and then have to tell people about the model. But this idea of what data science is like, what working with data is like, is an important thing for us to keep in mind as we're thinking about our practices, as we're thinking about our tools and the kind of tools we choose to use.

The garden of forking paths

A result of this kind of iterative, exploratory nature of data work is that we end up in a situation where we are making a lot of decisions. We're making a lot of decisions, and the one decision we make depends on the last decision we made. People, when they talk about this, they talk about it with the idea of like a labyrinth or a maze or there's this paper that's called The Garden of Forking Paths. The Garden of Forking Paths. It's the idea that I'm walking through a garden, and there are all these paths that are like forking, and I choose to go one way, and that means I would end up in a different place at the end than if I chose to go a different direction.

So when we are in this iterative process, this means that the decisions that we make, as we're deciding, like as we're doing data analysis, they change where we may have ended up in the end. So this paper specifically talks about this idea of a labyrinth or a garden of forking paths in the context of how we use p-values in a statistical sense. It talks about how dangerous it is to very naively use p-values when you've made a lot of decisions.

But forking paths are a good thing. Like it is good to make decisions and we base our next decision on the last one we made. It is good to analyze data in different ways and to use what we learn about the data to do the next thing. The mistake actually would be to choose just one, like I'm doing this, and I decided, I don't care what I see in the data, that's just what I'm like. I'm not going to use the information in the data itself. That would be a mistake. And another mistake is of the nature that is really focused on this paper, would be, for example, to compute p-values without accounting for all the different choices we made.

So the garden of forking paths is not a problem. It is the only option we have to correctly analyze data. But it means that as we move through a path, we end up at a place that would have been different than if we took a different set of paths. So this is part of why we have this very exploratory kind of iterative nature of data work.

So the garden of forking paths is not a problem. It is the only option we have to correctly analyze data.

it makes the argument that AI is a normal technology. in the same way that the Internet is a normal technology or electricity is a normal technology. So it's going to have a huge impact on how we work and learn and teach.

So how we think about AI and LLMs is deeply informed by all this stuff we're thinking about and this particular perspective on data science that we have. So we're asking questions like, what do people who are doing statistic analysis need when it comes to LLMs? So to be clear, we're not training new models. We're not training Copilot, but for data science. We're not training new models. We instead are building tools of various kinds on top of the models. And so in the IDE, in Positron, this looks like taking advantage of that deep integration so that we can increase the context available to the LLM. This includes things like chat participants that are data science aware, know about what is going on with data science projects to scaffold you to getting more quickly what you need to do.

So this is what Positron Assistant looks like. It is currently available in preview in Positron. And it can be run in three modes. The first mode is ask. So this is like a chat style. You're just like asking and it answers, ask answers, like sort of chat GPT style. Edit, so that's where you're in a file and then you ask the model things about the file and ask it to propose edits for you. And then the third mode is agent. So an agentic mode where you opt in to giving the assistant permission to run code on your behalf. So the assistant gets to run code in the console and it sees the results and I see the results. We both see the results of what happens here.

And this ends up giving us like more context and some pretty successful stuff. So here, I down here, I'm saying, hey, I wanna read in some data available at a certain CSV file here. And if you'll notice up there, I imported Polars and I imported Seaborn here. And so what's gonna happen is it is going to try to use Polars to read in the data. So here, this is my same, this is my prompt. This is what I asked it to do. And then you will notice that it's like, okay, great, here is some code. And it automatically ran it because it's in this kind of agent mode. But look, look what happened. An error, it didn't work. It didn't read in my CSV.

Now, because we have this deeply integrated experience here, what happens is that, I mean, obviously I see the error. I'm the one looking at the computer. But also, the assistant sees the error. And notice there is no, I did not give any input between that first step and this step. It sees the error and it's like, oh, the error indicates that, you know, like we didn't handle the NA values here that are in the CSV. So let me try again. And then if I hit that run code button, it runs it again and then it succeeds. I also want to highlight that it used Polars. And that's not because I prompted it, it's because I imported it. So it knows to use, it has deep integrated knowledge of what's already going on in the IDE. So that it can generate the code that you actually want it to generate.

So let's get that ugly error off the screen. Who wants to see errors in code? So here, this is just us going a little further. And I'll highlight a few things. Like as I ask a question up at the top, like hey, for these Pokemon that are in this data set, like how are height and weight related? And then the assistant generates some code that uses Polars and then uses Seaborn to make plots, because that's what I had already loaded. I didn't explicitly tell it to. It's using the context here. And also, like the assistant literally can view the plot. Like it takes the plot, and the plot can get sent to the model. If you opt into permissions in such a way, the plot itself gets, because most of these multimodal models that are good at generating code, they're multimodal and they can look at images, and so they can describe plots for you, tell you what's going on, notice outliers, notice problems, and whatnot.

So Positron Assistant, I think it's a really interesting piece of the IDE, because it highlights the outcomes that we're getting by the sort of architecture we've chosen and the set of trade-offs that we're making. We are making trade-offs on the side of, this is not for everyone. This is for people doing data science. And that means it's powerful for the people doing data science, because we're not trying to be everything to everyone. We're trying to make the best thing for people doing data science. And we are taking, instead of everything is extensions, we are saying what actually can improve the experience by not being an extension and being part of the IDE as a whole.

And so that, I will wrap up here a little bit and say, hey, if this makes you curious, if you want to try out Positron, you can go to positron.posit.co. That's where you can get documentation and installers. If you run into bugs, if you have questions, if you want to give us feedback, join us on GitHub at posit-dev-positron. If you are, I mean, I don't know, if any of you are customers of ours, Positron is available in Posit Workbench and Preview, which is one of our enterprise products for big companies who need these kinds of features.

And if you're sitting in here and you're maybe not a data science person, maybe you are one of these people that uses Python more for traditional software engineering, I do think a takeaway we can all take from here is that not all software is built for the same purposes. And so software tools that bring us joy are not one-size-fits-all. But rather, we can build tools that are specifically for the kind of work that we need to do. And with that, I will say thank you very much.

not all software is built for the same purposes. And so software tools that bring us joy are not one-size-fits-all. But rather, we can build tools that are specifically for the kind of work that we need to do.

Julia Silge - Keynote PyCon Colombia 2025

Transcript#

The iterative nature of data science

The garden of forking paths

Reproducibility and tension in data work

Introducing Positron

A polyglot IDE for Python and R

Three categories of data practice

Positron's key features

Familiar and extensible: built on Code OSS

AI and LLMs in Positron