Alenka Frim & Nic Crane - Mixing R, Python, and Quarto: Crafting the Perfect Open Source Cocktail

Transcript#

This transcript was generated automatically and may contain errors.

Let's start. I'm a mother of two, so I have to start with a story, right? Whenever I'm away for a while, I do miss my kids, and I look forward to seeing them again. But then, coming back home, opening the door, I'm met with excitement. Stomping feet, you know, hands reaching, grabbing, the voices louder and louder and louder until the point of squeaking. It's fair to say I get a bit overwhelmed by the attention.

So, to give a bit more context then, Alenka and I met when we were both working as apprentices on the Apache Arrow project, learning how to be open source maintainers. And it was like a really exciting time. Apache Arrow is a bit of a unique open source project in that because there are so many different implementations of Arrow in different languages, you've got a huge number of people all collaborating together on one repository and kind of the multiple ones there and the wider Arrow ecosystem. So, this can be really advantageous in a lot of ways. There's just so much collaboration happens that, like, wouldn't have otherwise. But actually, there can be some challenges associated with that.

The scale of Apache Arrow

Let's look at the GitHub issue tracker of Apache Arrow. This is the picture you see when you arrive there. It's quite a big number of issues. Over 4,000 still open. Over 20,000 were already closed. But if we zoom in on just the last year, and only R and Python because it is a part where we work on, there were over 500 pull requests submitted, over 1,000 issues open. And the interesting thing is that almost half of those issues were opened by new contributors, meaning each issue was opened by a different person. And for every person, that was the first issue they opened on the repo.

So, this is quite exciting. So, what's the problems this can cause? So, we're tracking information from GitHub issues and pull requests. We've got questions on Stack Overflow. And we've also got the mailing list. And this is a huge amount of information to manage. And in terms of thinking about the work done on an open source project, the important work in terms of the code base is responding to issues, fixing bugs, adding features, doing releases, but not really managing the project itself. What we want is to be able to quickly get to the right information so we can get on with the fun stuff.

And as well as that, it can be really tricky just to get an idea of, like, what's going on and seeing what's changing over time. So, for example, let's say we had an increase of issues open since the last release. That might indicate bugs that we're not aware of that we want to get fixed. But also contributors. So, contributors are the lifeblood of any open source project. But if you're new to a project, let's say you open up an issue or maybe even a pull request, and nobody gets back to you for weeks or even months, you're going to be discouraged. And you might not want to continue to stay involved. So, what we really want is to be able to get to these new contributors as quickly as possible and really just get that collaboration going.

So, contributors are the lifeblood of any open source project. But if you're new to a project, let's say you open up an issue or maybe even a pull request, and nobody gets back to you for weeks or even months, you're going to be discouraged.

Well, actually, when I think about the problems that we're talking about here, these aren't problems that are unique to Arrow or unique to open source projects or even just kind of unique to us. Like, everybody's got things to think about in terms of how do we foster collaboration between different people with different things going on? And how do we reduce information overload? So, what we're going to talk to you about today is a dashboard that we created to manage all of this information. But actually, this isn't just about the dashboard. It is about building the dashboard and the technology choices we made there. But it's also about building diverse collaborations that allow everybody to contribute in their own way. And building curiosity for the ideas and technology that enable all of this.

We both feel it's really important that everyone is allowed to contribute in their own way.

And then, because together with the tools we chose for this process, it was much easier to work together in this process. So, we didn't have to, as Nick mentioned, we didn't have to climb different ladders to get to the solution. We used one and helped each other climb it. It was faster, nicer, and much more fun.

If we start with Python and R script that Nick already had with dealing with the sources of information, I took them, leveled them up, connected them together in the dashboard, and we both worked on how to pass data between R and Python. Nick did the core dashboard and also GitHub Actions deployment. We both worked on the display content of the dashboard itself, some in R, some in Python. And what we needed to do here was iterate on the feedback and update to select the data we cared about most.

So, Nick at the end said, this is cool, I learned Python environments and dependency management. I learned about portal, which ended up to be really simple to use. So, plus we both had to learn to select the data that will help us solve the problem, because this is not so clear. You have to iterate, you have to find and select. So, these different skill sets we have and the different background made us work and learn on things we wouldn't otherwise. And ultimately, we got much further than we would if we were working alone.

We have a challenge for you that with that we'll end this talk. We would like to challenge you to, there's still some time today, it's not the end yet, to talk to somebody at the conference about a problem you're interested in or you're trying to solve. It doesn't have to be tech. Talk and see what happens. Thank you.

Alenka Frim & Nic Crane - Mixing R, Python, and Quarto: Crafting the Perfect Open Source Cocktail

Transcript#

The scale of Apache Arrow

The dashboard

Technology choices

Combining R and Python in Quarto

Building diverse collaboration

Featured software#

Quarto