
Building Multilingual Data Science Teams (Michael Thomas, Ketchbrook Analytics) | posit::conf(2025)
Building Multilingual Data Science Teams Speaker(s): Michael Thomas Abstract: For much of my career, I have seen data science teams make the critical decision of deciding whether they are going to be an “R shop” or a “Python shop”. Doing both seemed impossible. I argue that this has changed drastically, as we have built out an effective multilingual data science team at Ketchbrook, thanks to polars/dplyr, gt/great-tables, ggplot2/plotnine, arrow, duckdb, Quarto, etc. I would like to provide a walk through of our journey to developing a multilingual data science team, lessons learned, and best practices. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I'm Michael Thomas. I'm the Chief Data Scientist at Ketchbrook Analytics, which is a data science consulting firm, and believe it or not, we use both R and Python together. This is a LinkedIn post that was making the rounds a few months ago. Maybe it sounds like some of you in this room may have seen it. It got a lot of buzz online, reigniting the so-called language wars that were reminiscent of the pre-Elon Twitter days, but I digress. My experience with the R versus Python debate is a little bit different, and it doesn't take a side.
So this might be a hot take, but I believe that almost everything that used to make Python awesome that wasn't in R has since been ported over to R, and vice versa. Everything that used to make R awesome that wasn't in Python has since been ported over to Python.
So Michael, if all of the good stuff is in both languages, then why not just pick one? I can think of at least three reasons why. First, if you're looking for good data science candidates to build out your team, limiting your search to only individuals that know one language or the other can slow the hiring process. Second, once in a while, I will run into a problem where there's a package in one language that solves that problem that doesn't exist in the other language. Lastly, I work at a consulting firm, and the fact that we can offer our clients the option to choose which language we're going to build their solution in is a huge value proposition for us.
The secret to multilingual teams: technical empathy
So now that I've clearly convinced you that a multilingual data science team can have a lot of benefits, I'm going to let you in on the secret to actually making that work. And this is going to sound non-technical, but I've found that the key to making multilingual data science teams work is an extra dose of empathy. And I'd argue that empathy is an important ingredient in any data science team, but especially so for teams that are trying to use R and Python together.
And I want to focus on the technical things that you can do that are empathetic towards others on your team. And I'm going to break it down into three areas. Environment management, package choices, and documentation. And I like to think of these as the three pillars of technical empathy, and we'll revisit this pillar concept throughout this presentation.
I've found that the key to making multilingual data science teams work is an extra dose of empathy.
Pillar one: environment management
And let's start with the first pillar, environment management. In the pillar of environment management, empathy means helping your team spend more time creating value and less time troubleshooting technical issues.
But it works on my machine. Have any of you ever heard this phrase, maybe even uttered it yourself? A little show of hands out there. I won't ask you if you've said it yourself, but at least heard it. There are still lots of folks dealing with this today, day in and day out. And there are also people that don't care because they're essentially a lone wolf. And I still think that those people should care about this. It's a short-sighted approach. But today we're focusing on data science teams, not lone wolves. And the idea that you want the code that you're writing on your own laptop to also run successfully somewhere else.
I believe successful environment management boils down to consistency across three things. The operating system, the version of Python or R that you're using, and the versions of the individual R or Python packages that you're using on that project.
So let's talk about the first two. On the operating system and Python or R version side, you can typically use the same tooling for managing both of those. There are two best approaches to doing this, leveraging third-party tools like Posit Workbench, hosted on a dedicated server or cloud environment, that allow you to set up your environment in a more simplified kind of point-and-click approach.
Or you can leverage open source tooling where you will essentially script your environment. And if you choose to go that open source approach, know that it's going to require some DevOps skill sets. So you'll need to ask yourself if that's something you want to undertake or not, and weigh that against the cost benefit of something like a Posit Workbench that abstracts a lot of that away from you.
On the package versioning side, using virtual environments, which allow you to manage package versions on a project-by-project basis, instead of having some sort of global package library, can be really efficient for collaboration and ensuring that you and the rest of your team are working within the same setup. And I think that there are essentially two leaders these days in that area, UV on the Python side and Renv on the R side.
Pillar two: package choices
So we just talked about package versions, but now let's talk about choosing which packages to use at all. In the second pillar of package choices, empathy means syntax similarity for easier code review and collaboration.
Here's a big question that I ask myself every day. How easy would it be for someone who knows R to review my Python code, or vice versa? And the choice of packages you use can really impact how successful your multilingual data science team will be.
And this somewhat goes back to my slide where I mentioned that everything great in each language had been ported to the other language. But I really think it actually goes beyond that. I think the great ideas are being borrowed from each language.
And the Python package Polars is a great example. The author of Polars, Richie Vink, he's mentioned in a bunch of interviews that Polars takes a lot of inspiration from the tidyverse in R. And in a minute here, I think we'll show you just how true that is.
And we know that data prep is 80% of data science. So the methodology and the opinions that you choose are a significant investment for your team.
At Ketchbrook, our choice is typically Polars for Python data prep and dplyr for R data prep because not only are they two of the most popular data wrangling libraries in their respective languages, but they're also syntactically really, really similar.
If you want to filter data in your data frame, use the filter function in either package and specify the column and its constraint. Do you need to select a column or a specific set of columns? Use the select verb. And because nothing is completely perfect, we have the arrange verb in dplyr and sort in Polars with some slight differences in direction such that descending is actually a function in dplyr, whereas it's an argument on the Polars side.
But lastly, we have the trusty head function from Base R and in Polars for grabbing the first five rows of data after the upstream ETL takes place. And these are just a few functions, but if you went out today and you further explored the dplyr and the Polars packages, you'll continue to find these syntax similarities all over the place.
But on multilingual teams, there's another interesting option that's perhaps even more language agnostic. It's actually courtesy of a PositConf keynote speaker last year, and that's DuckDB. If you haven't heard of it, DuckDB is a zero-dependency analytics database engine that allows you to do data wrangling in SQL at lightning-fast speeds. I'm underselling it. If you haven't tried it, I really encourage you to check it out.
But beyond that, we have great libraries in both R and Python that allow us to call DuckDB. In each language, there's a bit of setup in the front to connect to the database and some small syntactic differences at the end of the script. But everything in the middle is exactly the same. The year is 2025 and SQL will not die. In fact, I'd argue that perhaps it's as popular and hot as ever. And using DuckDB is a great way to still have a high-powered analytics engine that's easily portable across languages.
The year is 2025 and SQL will not die. In fact, I'd argue that perhaps it's as popular and hot as ever.
Let's take a look at another example, ggplot2 in R and plotnine in Python. With the exception of quoting column names and wrapping the entire expression in parentheses on the Python side, the code is literally identical. And one last example I'll give is gt in R versus greattables in Python. Cue the office meme with Pam saying they're literally the same picture.
And these last two examples I just gave are no accident. Posit is behind all four of these packages, ggplot2, plotnine, gt, and greattables. And they have put a lot of work into structuring these packages in a way that preserves a ton of syntax similarity.
Pillar three: documentation
And our final pillar today, documentation. Empathy means providing clear guides and road maps for collaborators, including your future self. So here's my public service announcement. Please treat documentation as a first-class citizen in your software development practices.
Years ago, I heard someone say, I think it was J.D. Long, that software shouldn't be a bunch of code with a little bit of documentation. Actually, the ratio should be the inverse. It should be a lot of documentation with a little bit of code.
Consider this simple function I've put together that rounds a number to a certain number of digits and forces it to round up. You don't need to care about what it's doing, and the syntax is very similar across both R and Python. But the point here is going to be around the documentation of the function. In R, we have ROxygen as the prevailing framework for function documentation. In Python, we have a couple of options and a little bit more flexibility. The prevailing two options seem to be either Google or SciPy's approach to doc strings. Within our team at Ketchbrook, we tend to go with the SciPy approach, really only because I find it a little bit more readable, and I know SciPy, as an organization, is data science-focused.
So let's compare the ROxygen framework to the Python SciPy framework. In both cases, you can kind of just stick your function description at the beginning of your documentation. Function arguments are defined in param tags in R and under a parameter section in Python. And you might note that Python tends to traditionally be more explicit about supplying argument data types than we're used to in R. While it's not required by ROxygen, we've tried to adopt that in R as well. And the output of the function is documented in a returns tag or section in both languages. And examples are always critical and similarly structured in both cases. So when we leverage these parallels in how we document our functions across the two languages, it's yet another relatively simple way that we can make it easier to cross the language aisle and work together.
So jumping now to our approach to what I'll call GitHub documentation via issues and pull requests. If you're on a data science team, there are probably a few approaches to how you could structure your collaborative workflows. I'm going to show you our current approach, which I believe is not only ours, but very similar to organizations that we've seen, including Posit. So the first step would be a user or a developer identifies a bug in the software or comes up with an enhancement that they're suggesting we implement. And they write this up in a detailed issue. Once that issue has been created, when we're ready to actually work on making the required code changes, we first create a new Git branch in the repository, specifically for the purposes of addressing that issue. Then the developer goes off and writes or modifies the code that fixes the bug or adds the enhancement. And once the developer feels like the updated code is in a good spot, they can write up a pull request or a PR for short that details how they addressed the issue, what changed, and more.
So let's focus first on issues. In my opinion, I think a good issue is comprised of three different parts. First, an overview of the bug that you're encountering or the enhancement that you're proposing, including the rationale for why you feel a change to the code base is necessary. Second, a reprex, a reproducible example, a.k.a. some code that someone else can run to show that current shortfall that the proposed enhancement will overcome or identify the bug. And lastly, a discussion which may or may not include code regarding possible solutions. So I thought I'd show an example of a real-world GitHub issue we've authored that contains this structure. An overview section, a reproducible example section with some code that can be run, and a possible solution that's being included as well.
On the pull request side, it's very similar. A good pull request, in my opinion, is also comprised of three parts. First, the purpose of the pull request and the associated issue or issues that it addresses. Second, a more in-the-weeds section that details the technical aspects of how the issue was addressed in code as well as any design decisions that were made and any hurdles that were overcome along the way. Lastly, instructions and code that a reviewer can run in order to see the impact of the changes that you made. So I'm going to pick on Ivan on our team in the front row for a minute here because he does an awesome job at this. And here's an example of a GitHub pull request he authored containing those three components I just mentioned. An overview section, a detail section, a test section, and even a discussion section that, in this case, provided some additional context that he felt was important for the reviewer to know.
Putting it all together
So let's put this all back together here. I think it's easy to say that successful team collaboration requires empathy, but these are the technical ways that we've hacked empathy within our multilingual data science team that seems to be keeping everyone happy and productive. Lastly, I want to make a couple quick shout-outs. Thank you to Posit and Articulation, as well as Emily Reederer. I wanted to mention her. She previously gave a talk called Python Argonomics, which was a big inspiration for this talk, and how we think about multilingual data science at Ketchbrook. If you haven't seen it, I recommend you checking it out. I think it's a nice compliment if you're interested in exploring this topic further. And if you'd like to get in touch with me, you can find me on LinkedIn, BlueSky, GitHub, or through our company website.
Q&A
Do you see R-like Python packages like plotnine or Polars being used more outside of just multilingual teams? How transferable are these skills to purely Python teams versus using more standardized packages like Pandas or Matplotlib?
That's a good question. I think I'm sure that there's a lot of legacy code and workflows that folks have on teams that are typically just using one language or the other. But I think you have to think about the rationale for why these packages were created in the first place. And I think Hadley said something interesting either yesterday or the day before, where the idea is to be able to get what's in your brain out as fast as possible, right, with maybe as little code in the middle. And I think a lot of these packages that we're talking about, plotnine, Polars, some of the newer advancements, are trying to do that regardless of whether you care about them being multilingual or not.
Okay. One more. Do you think there's a need for Python versions of the dev tools or use as packages?
Yes. Sure. I mean, it's fine with me. Yes. I think there's much more, I guess, standardization and obviously enforcement on the R side in terms of package development that's lacking on the Python side that could be beneficial.

