Resources

Alenka Frim & Nic Crane - Mixing R, Python, and Quarto: Crafting the Perfect Open Source Cocktail

Collaborating effectively on a cross-language open-source project like Apache Arrow has a lot in common with data science teams, where the most productivity is seen when people are given the right tools to enable them to contribute to the programming language they are most familiar with. In this talk, we share a project we created to combine information from different sources to simplify project maintenance and monitor important metrics for tracking project sustainability, using Quarto dashboards with both R and Python components. We'll share the lessons we learned collaborating on this project - what was easy, where things got tougher, and concrete principles we discovered were key to effective cross-language collaboration. Talk by Alenka Frim and Nic Crane Slides: https://github.com/arrow-maintenance/arrowdash/blob/main/other/PositConfTalk2024.pdf GitHub Repo: https://github.com/arrow-maintenance/arrowdash

Oct 31, 2024
16 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Let's start. I'm a mother of two, so I have to start with a story, right? Whenever I'm away for a while, I do miss my kids, and I look forward to seeing them again. But then, coming back home, opening the door, I'm met with excitement. Stomping feet, you know, hands reaching, grabbing, the voices louder and louder and louder until the point of squeaking. It's fair to say I get a bit overwhelmed by the attention.

And somewhat similar feelings can be triggered when maintaining an open source project, especially if it's a big one. In our case, that's Apache Arrow.

So, to give a bit more context then, Alenka and I met when we were both working as apprentices on the Apache Arrow project, learning how to be open source maintainers. And it was like a really exciting time. Apache Arrow is a bit of a unique open source project in that because there are so many different implementations of Arrow in different languages, you've got a huge number of people all collaborating together on one repository and kind of the multiple ones there and the wider Arrow ecosystem. So, this can be really advantageous in a lot of ways. There's just so much collaboration happens that, like, wouldn't have otherwise. But actually, there can be some challenges associated with that.

The scale of Apache Arrow

Let's look at the GitHub issue tracker of Apache Arrow. This is the picture you see when you arrive there. It's quite a big number of issues. Over 4,000 still open. Over 20,000 were already closed. But if we zoom in on just the last year, and only R and Python because it is a part where we work on, there were over 500 pull requests submitted, over 1,000 issues open. And the interesting thing is that almost half of those issues were opened by new contributors, meaning each issue was opened by a different person. And for every person, that was the first issue they opened on the repo.

So, this is quite exciting. So, what's the problems this can cause? So, we're tracking information from GitHub issues and pull requests. We've got questions on Stack Overflow. And we've also got the mailing list. And this is a huge amount of information to manage. And in terms of thinking about the work done on an open source project, the important work in terms of the code base is responding to issues, fixing bugs, adding features, doing releases, but not really managing the project itself. What we want is to be able to quickly get to the right information so we can get on with the fun stuff.

And as well as that, it can be really tricky just to get an idea of, like, what's going on and seeing what's changing over time. So, for example, let's say we had an increase of issues open since the last release. That might indicate bugs that we're not aware of that we want to get fixed. But also contributors. So, contributors are the lifeblood of any open source project. But if you're new to a project, let's say you open up an issue or maybe even a pull request, and nobody gets back to you for weeks or even months, you're going to be discouraged. And you might not want to continue to stay involved. So, what we really want is to be able to get to these new contributors as quickly as possible and really just get that collaboration going.

So, contributors are the lifeblood of any open source project. But if you're new to a project, let's say you open up an issue or maybe even a pull request, and nobody gets back to you for weeks or even months, you're going to be discouraged.

Well, actually, when I think about the problems that we're talking about here, these aren't problems that are unique to Arrow or unique to open source projects or even just kind of unique to us. Like, everybody's got things to think about in terms of how do we foster collaboration between different people with different things going on? And how do we reduce information overload? So, what we're going to talk to you about today is a dashboard that we created to manage all of this information. But actually, this isn't just about the dashboard. It is about building the dashboard and the technology choices we made there. But it's also about building diverse collaborations that allow everybody to contribute in their own way. And building curiosity for the ideas and technology that enable all of this.

The dashboard

So, let's first take a look at the dashboard itself. We mentioned a sheer number of issues and pull requests created. So, what we wanted to do first is to kind of split the information so we can get to the right one quickly. So, we did two pages. One for Python, the other for R. Both of the pages have the same structure, though.

We also wanted to have an understanding of what's going on at a higher level. So, the first thing we see here is the Summary tab. In the Summary tab, we can see, for example, issues that have been created lately or PRs and similar information.

We wanted to prioritize new contributors, as mentioned. So, in the individual tabs for issues and PRs, you can see the list of those and the highlight. See? So, this, what's highlighted, is the issues or PRs opened by a new contributor, so we can identify them easily and get to that first.

The next, yes, it's the Stack Overflow tab. Here you can find questions asked on Stack Overflow connected to Python on R. Again, we highlighted information that's important to us, which is, is this answer accepted yet or not? So, if not, we can get to that first.

The last, but not the least, Mailing List. I know it's a bit of an old-school thing, but a note we want to add here is that for any Apache Arrow, this is the most important communication channel. And if there's a topic that's important, it's discussed on the Mailing List. So, here you can find a list of emails that have been sent in the last month.

Technology choices

So, this is the dashboard we made, and now we're going to talk about how we got them. So, in terms of how we made the dashboard, we decided to use Quarto dashboards. So, this is a feature of Quarto that's been available since January this year, and it allows you to create dashboards, but using Quarto kind of syntax. So, in terms of the different technologies we were trying to choose between before we decided on Quarto dashboards, we considered Shiny, creating it as a Quarto document, or the one that we went with in the end, a Quarto dashboard.

So, here are the few kind of criteria that we had to make this decision. So, the first thing was that we wanted something with a dashboard aesthetic. We didn't want a document that we had to scroll through. We just wanted to be able to look at it, click on a couple of things, and get the information that we needed.

The next thing that was important to us was the fact that with Quarto dashboards, you can work with markdown syntax. So, we were thinking about what is this project about? And really, the information in it, it doesn't really change that often. We were refreshing the data like once a day, and we didn't need a huge amount of interactivity. And Shiny is great, but it felt like just too much for this project, because the idea here was to reduce the maintenance burden. And so, being able to work with markdown syntax made life a lot easier for us.

Another thing we had to consider was deployment of the dashboard. So, we decided we wanted to deploy it with GitHub Actions, mainly because that's something that I'm familiar with, and that was something we could get working quickly. And actually, you can deploy Shiny apps when it's Shiny live on GitHub Actions, but it still just felt simpler to go with the Quarto dashboard option.

But actually, the most important thing for us was that we wanted to work in something where we could have both R and Python combined. We didn't want to divide the project up between, okay, this bit is yours, so you can do that in Python, and this bit is mine, so I'll do that in R. We wanted to be able to both work on all of the different bits. So, we could just have fun and enjoy it.

Combining R and Python in Quarto

So, just having a bit of a think about the technological decisions and choices we had to make, when combining R and Python in a Quarto dashboard, I think there's one decision that has to be made up front, and that's whether you start from R or you start from Python. So, starting from R, you're using the NitR engine and reticulate to pass data between R and Python. Starting from Python, it's using Jupyter as the engine and using RPy2.

And we were thinking, okay, can we do a bit of a compare and contrast of these? But to be honest, when we were kind of building this, we found they're pretty similar approaches. In terms of the project, we went starting from R solely because I was saying that at the time, and it was what I was familiar with. So, thinking about kind of the structure of the dashboard and the code inside it, we started off having a mix of scripts. So, some were in R, some were in Python. And then we had this index Quarto markdown document that orchestrated all of that. So, that was in a mix of R and Python.

But then we had this data Quarto markdown document, and that actually enabled us to use a really cool feature of Quarto. So, there's this concept of Quarto includes. So, includes allow you to reuse information or code across multiple documents, or in our case, multiple times in the same document. So, this code snippet you can see now is from that index QMD file. And we've got their high-level headings, pyro or R. We've got a variable, which is a string. We've got a variable, lang. So, we set that to Python or R. And then we include the data Quarto markdown document, and that's it. And we get the same dashboard for the different languages.

And the reason this was really important to us is that when we started this project, Alenka and I didn't want this to be a collaboration between us that only benefit the two of us. We wanted to be able to roll this out to other bits of the Arrow dev community. So, let's say the Arrow C++ developers would make good use of this dashboard. We could just roll this out, like, in about 30 seconds to them. So, that was super important to us.

Okay. So, the last kind of technical question we wanted to ask is, like, how much R and how much Python do we end up in the end? So, quite conveniently, GitHub on a repository page actually has this information. So, we found we ended up actually with a pretty even split, which was pretty cool.

Building diverse collaboration

So, yeah. So, that's all of the technological choices and the thinking around that. But actually, I don't think this is the most interesting thing about this project. I think it's actually much more interesting thinking about exactly how we got there. And the first thing we needed to do was build on curiosity for both of technology and ideas. We really needed to start with having honest discussions about our current approaches to maintenance. We realized we struggled with the same difficulties, but we did approach them from a different perspective.

I, at the end, set up email filters for GitHub notifications and mainly used that. Nick had various scripts that handled all the different sources we had. We both felt overwhelmed by the amount of information and we realized we were missing things. Things that were important, right? For example, the new contributors. It's really important and we felt like we were both frustrated by the fact that we didn't get to them as fast as we wanted. Or maybe we didn't even get to them. So, this had to be a part of the solution.

The sharing, getting an opinion from somebody else, a new perspective for us created new ideas. And the process became exciting. It was highly motivating to move towards a solution together. And the cool part at the end was that the solution became something that other contributors, other maintainers could also use.

Which leads me to my next point of building diverse collaboration. We both feel it's really important that everyone is allowed to contribute in their own way. We have different skill sets. I like both R and Python, but the most comfortable I feel is in Python. Nick can also do both, but the most comfortable she feels is in R. Nick's head is bubbling with good ideas. I like to tidy up things, right? So, we have these different skill sets.

We both feel it's really important that everyone is allowed to contribute in their own way.

And then, because together with the tools we chose for this process, it was much easier to work together in this process. So, we didn't have to, as Nick mentioned, we didn't have to climb different ladders to get to the solution. We used one and helped each other climb it. It was faster, nicer, and much more fun.

If we start with Python and R script that Nick already had with dealing with the sources of information, I took them, leveled them up, connected them together in the dashboard, and we both worked on how to pass data between R and Python. Nick did the core dashboard and also GitHub Actions deployment. We both worked on the display content of the dashboard itself, some in R, some in Python. And what we needed to do here was iterate on the feedback and update to select the data we cared about most.

So, Nick at the end said, this is cool, I learned Python environments and dependency management. I learned about portal, which ended up to be really simple to use. So, plus we both had to learn to select the data that will help us solve the problem, because this is not so clear. You have to iterate, you have to find and select. So, these different skill sets we have and the different background made us work and learn on things we wouldn't otherwise. And ultimately, we got much further than we would if we were working alone.

We have a challenge for you that with that we'll end this talk. We would like to challenge you to, there's still some time today, it's not the end yet, to talk to somebody at the conference about a problem you're interested in or you're trying to solve. It doesn't have to be tech. Talk and see what happens. Thank you.