Keynote: Julia Silge - The Right Tool for the Job | SciPy 2024

Transcript#

This transcript was generated automatically and may contain errors.

Julia is a data scientist and engineering manager at Posit PBC, where she leads a team of developers building fluent, cohesive, open-source software for data science and Python and R. She's a tool builder, author, international keynote speaker, and real-world data science practitioner. She holds a PhD in astrophysics and serves on the Technical Advisory Committee of the U.S. Bureau of Labor Statistics. You can find her online on her blog, which is quite excellent, and YouTube.

I am super happy to be here at what is my first SciPy. I've had multiple people that I have worked with tell me it is their favorite conference, so I'm really happy to be able to be here and join you today and to speak specifically about the process of using multiple programming languages for scientific computing and how we can explore or start to identify what is the right tool for the job.

Before I get a little deeper into it, I want to tell you a little bit about the path that brought me here to be standing in front of all of you talking about this. My academic background is in physics and astronomy, but I came up through those fields before the dominance of Python in these fields and in scientific computing in general. If I were maybe 5 to 10 years younger than I am, the tooling that we are all here to talk about would be my tooling, and you would be my people, but I started in these fields before that era, and so I and the people around me when I was in academic departments doing physics and astronomy, we wrote a lot of C. We wrote a lot of FORTRAN. We wrote a lot of bespoke code that in hindsight really does make a lot more sense to have in a well-maintained community resource like AstroPy and that whole ecosystem. There were people around me who wrote IRAF, who used IRAF and IDL, closed source tools like that.

This was like my first scientific computing experience, my first exposure to how to use computing for scientific purposes. So I was in academia for a while, left, did a few pretty random things in tech companies for a little while, and about 10 years ago made a career transition, made a change in my professional identity to the sort of at the time newly burgeoning field of data science. So I became a data scientist.

As I made this transition, I brought with me my excellent quantitative training that came from being someone who studied physics and astronomy. I brought with me a lot of real-world experience dealing with data, messy data that I collected myself at telescopes, a lot of experience understanding how can I know what kinds of questions I can answer with this data? How can I communicate about this data with data visualization and speaking and writing? However, I did not know at the time modern data science languages, and so as I was making this transition, it was my first time really wrestling with, is it worth it? Do I need to do this? Is this important for me to add additional languages?

So I first learned Python, and I then learned R, and actually it turns out R has proven to be one of the great loves of my professional life, and my exposure to R was such a good fit for how I approached data analysis that it really had a huge impact on how my career then went after that. The work that I'm probably most strongly associated with is my open source, my work in the R world in open source and books and learning materials and that kind of thing.

So I moved into data science. I was a data science practitioner, and I started being involved with tool building on the side. I bet as many of you are, you have one sort of day job, and then you contribute to open source on the side in your own time. About four, four and a half years ago, I made a bit of a pivot again to becoming a full-time tool builder, so spending all my time building tools for data science. So when this pivot happened, this was me changing my job from working in tech as a data scientist to working at a company that was then called RStudio . So I started working at RStudio and started working on open source software for machine learning, for text analysis.

So the right question isn't what is the right tool for the job, but what is the right tool for the job for this person at this time in this organization?

The project that I worked on that, where I've seen this play out a lot is, is, is Vetiver. So Vetiver is software for Python and R that, for MLOps. So for about the, the process of deploying and operationalizing models. So the, when you say MLOps, people mean a lot of different things. So I'm going to get a little specific here by looking at this diagram of a model lifecycle. So if we start with, you know, we start by collecting data. The first thing we'll need to do when we have that data is to understand and clean it. And there's lots of great open source software, open source tools for understanding and cleaning your data. Next, the next thing you do is it's time to train and evaluate a machine learning model. And again, there's lots of really fantastic open source packages across different languages to, to train and evaluate the model.

At that point, it, the situation becomes much less clear. And that's both because there, there's a lot less community understanding of what, what is the right next thing to do once you have a machine learning model trained. And also because there are fewer open source tools for what needs to happen next. Especially fewer open source tools built for a scientist, data analyst, data scientist type user, as opposed to what I might call a generalist software engineer type user. And this is where Vetiver sits. Vetiver has a opinionated idea of what are the next things you need to do. You need to version your model, deploy your model, and then monitor your model. And also provides, provides functions and infrastructure for, for taking those next steps.

And it's really about this, like giving people the opportunity to choose the right tool for, for what they are needing to do. So person A might decide, oh, I'm going to train a random forest model using Scikit-learn and Python. And they can deploy, they can version, deploy, and monitor their model with Vetiver. And person B might say, oh, I need to create a survival analysis model. And I think the best way to do that is going to be to use tidy models and R. And they also can use Vetiver to, to version, deploy, and monitor their model.

So the, the, what, one thing I really have observed with our, our users of Vetiver, people telling us about how they're using it, is that it allows people to have autonomy to make the best decisions, giving their own domain knowledge about what they know, what they know about their data and their situation. And still actually get back some of that consistency. Like we can, and tools themselves can be designed in such a way so that we can give people the flexibility that increases their production, allows them to make the best decision in the context in which they are working, and also push back against these problems around a consistency and a complexity that arise when people do things in different ways.

Now that is, that's at the organization level, right? Like what is it that we end up gaining when we make, when, how people become more productive? And let's go back to the individual. And again, hypothetically, hypothetically, what if I stood up here and said, you need to learn another programming language, which again, to be clear, I absolutely am not. But like I want to reflect on what happens when people do. What happens when people do take opportunities as they come and, and, and add other things, other languages to their toolkit? What are the things, the benefits that individuals gain and, and, and observe?

The first one is that people scale their impact. If you are someone who can solve a certain problem using a certain toolkit, that's great. But if you are someone who can see a problem and understand different ways of solving it, weigh pros and cons that are specific to your situation, understand how certain stacks connect into other infrastructure, understand like how, how that systems level thinking about a problem, that second person has much more impact in their field, in their organization, in their, in the areas in which they work.

We also, I think it's important to think about the long term when it comes to these kinds of, these kinds of like issues that are individual. I do know some people who have spent their whole career using one programming language and are really happy and fulfilled and successful. But in my experience, that's, that's fairly rare. That's fairly rare. You know, a lot of us at least also use SQL, right, like in addition to the other languages that we use. But I also observe some, I'm going to call them like archetypal arcs, career arcs that people often take that as they consider what do they want to do with their career in the long term.

A common arc I observe is people starting from high level scripting languages like Python and R and moving to the front end, like really becoming experts in JavaScript. And often this is people who are really interested in data visualization or making interactive apps and dashboards. I observe people, you know, kind of going the other way. They will start with a high level scripting language like Python or R and then move lower level to use something like Rust or C or Go. Often these are people who are working on, you know, mathematical methods, machine learning methods, people who are building developer tooling so that you, you know, we have to work at the, at a lower level because maybe the high level scripting language needs to talk to something else and so we need to work at a lower level. My own arc, I don't know, I perceive it to be a little weird, right? Like I started a long time ago with C, I went way to the front end, I was actually paid to write Flash for a little bit, believe it or not. And then like I landed back kind of in this, at the high level scripting language layer, building tools for this layer.

But I think it, like when we can have a long-term view of what our careers are like, we can better understand, more accurately understand what the individual benefits are and whether it's worth it at any given time to learn something else.

And the last thing I want to say here about what individuals gain is about increasing your vocabulary. So the metaphor here is from natural human language, like different languages that we use. If you yourself are bilingual or if you've, you know, dabbled in other languages or, you know, you hear about people learning languages, you probably have come up, will come up against some situation where there's a word that, no, there's a word that correctly, so perfectly expresses some concept in one language only, or not in another language. And you know, we might say like, oh, it's very difficult to translate that word because of how a word is connected to some concept. And you know, people might use that word in, when they're speaking in other languages, because it's so perfectly encapsulates some concept.

This happens in computing. Computing languages are really different from each other and are built with different priorities, really different characteristics. And sometimes some concept in computing is really well expressed in one language. And it, you know, maybe cannot be translated or cannot be translated perfectly to some other language. When we have curiosity and openness to things that are happening, you know, a little bit outside our own communities, that allows us to increase the vocabulary of what we understand we can do using scientific computing.

Okay, so we talked about cost. We talked about what we gain. And now I want to talk about what using multiple programming languages allows us to share or to give across both to individuals and to different communities. And I approach this question mostly as a tool builder, because it is in the process of building tools that I have most observed this phenomenon of someone learning from one community and then bringing that thing that they've learned to another community. So let me be a little concrete about what this might look like.

So the main R package documentation generation tool is called Roxygen. And it is pretty directly inspired by the Doxygen tool to create docs that I bet many of you have run into or seen or experienced. R as a programming language does not have interpolated string literals. And people inside the R community observed how great that is, like in Python or TypeScript or other languages. And they're like, let's bring that great idea of interpolated string literals to R. So they built a tool called Glue that gives you that kind of behavior for strings. So those are both examples of things coming to R. But of course, this moves around in all kinds of directions all the time.

I'm going to highlight Quarto here again, because Quarto is a next generation implementation of a kind of working, a way of working that did come from the R community. So the original implementation was called R Markdown. And it is a plain, it's like some of the main characteristics of it are plain text format. So that's in contrast to a Jupyter Notebook, interspersed plain text and executable code chunks. And in the R community, R Markdown was so life changing, like so impactful for people as they worked, that it motivated, like, let's build this tool in a way so that it's not specific to R, but actually can be applied to different kinds of computing languages. And can be used for all kinds of different purposes. Like I use Quarto to write my blog. I use Quarto to make my slides. And you can do this from Python, R, Julia, like all these different ways of using this.

Introducing Positron

So there's all these examples of things moving around all the time. Looking at the company I do, like the company RStudio, if you think about this idea of, like, oh, something is great in one community. How can we take it somewhere else? I will say there's one thing that's top of mind for people. People will say things like, gosh, is there anything like RStudio but for Python? Or they're like, hey, I'm loving Python. Python is, like, really fun and I'm loving it, but I still prefer RStudio as an IDE. Or people will say, oh, we have there are these certain features that are out there that are in the RStudio IDE and I really want that feature, but when I do my Python work. Or people will say this to us. All I want is PyStudio.

So I am really excited to announce the project that I have been spending a lot of time over the last year, and that is a brand new data science IDE. So this data science IDE is called Positron. And if you have ever used RStudio or seen somebody use RStudio, a lot of this looks familiar. There is a pane where you write your source code. There is a truly interactive console, like, fully featured interactive console. There's UI affordances for seeing the variables that you have defined for dealing with the plots that you've created. For getting help right in your IDE so that you don't have to, like, kind of get out of the flow state when you need to quickly look up a, you know, a function signature or how certain method works.

Our design of Positron here is directly informed by the years of experience at my company building these kinds of tools for a data analyst user persona, the kind of person who deals with data on a regular basis. If you have ever used VS Code, you probably also think this looks pretty familiar. And that's because Positron is built on top of the open source components that are used to build Visual Studio Code. There are two main reasons we've done this. The first one is that it allows us to concentrate on what we're good at. Like, internally at my company, we have a lot of experience around data science tooling. And by using the general purpose components from the open source parts of VS Code, like, the, you know, support around general source code editing, saving files, interacting with version control, we can focus on the, like, with a fairly smallish team, we can focus on the pieces of the data science IDE that we think are table stakes and don't exist for a general purpose software engineering IDE.

A second big reason is it allows us to connect into the vast ecosystem of VS Code compatible extensions. Our studio as an IDE was never very extensible or customizable. And by building something that works with these extensions, we really open up people's ability to customize their IDE for the kinds of tools, the kinds of tasks that they have, the kinds of things that these need to do. And in fact, we ourselves developed some of these very extensions. So this is what developing a Quarto document looks like in Positron. And it uses the same VS Code compatible extension, the Quarto extension, that you would use in officially, like, Microsoft branded VS Code. So the extensions can be used in both places, and we're able to modularize code in ways that has really big benefits for the kind of people who do this kind of work.

So I am really excited about this. Like I said, like, a big reason why is because it is the project I have worked on where I have most observed this back and forth and this learning and then sharing. And I don't only mean, like, from RStudio to Python. I don't only mean that. I mean learning from, like, the general software engineering, like, the way debuggers are built, the way that, like, plots are, like, information about plots is shared in different kinds of tools. Like, I've observed this in a way that's really exciting for me.

I want to emphasize that Positron is a very early stage project. Today is Wednesday, and it has not even been a full two weeks that this project has been public. So it is, like, a brand new baby. And, boy, I love babies, but babies can be, you know, a little bit of a mess, a little bit of a challenge. So I certainly think it's probably not the right fit for everyone who is sitting in this room. Like, I'm not encouraging you all to, everyone, you should switch right now. If you consider yourself a little bit of an early adopter, and if something that I said about this intrigues you, I do invite you to go to our GitHub, download an installer, and give it a try. We'd be really interested in feedback that you have as you try out our new data science IDE.

Wrapping up: cost as investment

All right. So as I wrap up here, I started out talking about the cost of using multiple programming language. And, you know, it's probably no shock, it's probably no shock to you as I get here to the end that I do want to reframe that. I do want to reframe that, because when I think about my own career and when I observe what happens in the careers and the organizations around me, I think it's more accurate to think about it as an investment. And you have to decide for your situation what investment fits, what kind of investment fits, and then we can, community-wide, start to gather, start to accumulate some of these gains that we can realize.

And if I were to leave you with one takeaway, it would not be, again, just to be clear, it would not be, you better go learn another language. No. If I were to leave you with one takeaway, it would be, if you can approach the communities around you that are, like, adjacent or maybe even a little further away, if you can approach that with curiosity and openness, that allows you both to learn new concepts and new skills that will make both your own work and the work in your organization more robust and more fulfilling. So thank you very much.

If I were to leave you with one takeaway, it would be, if you can approach the communities around you that are, like, adjacent or maybe even a little further away, if you can approach that with curiosity and openness, that allows you both to learn new concepts and new skills that will make both your own work and the work in your organization more robust and more fulfilling.