Resources

Polyglot Data Science: Why and How to Combine R and Python (Jeroen Janssens) | posit::conf(2025)

Polyglot Data Science: Why and How to Combine R and Python Speaker(s): Jeroen Janssens Abstract: Doing everything in one language is convenient but not always possible. For example, your Python app might need an algorithm only available as an R package. Or your R analysis might need to fit into a Python pipeline. What do you do? You take a polyglot approach! Many data scientists hesitate to explore beyond their main language, but combining R and Python can be powerful. In my talk, I’ll explain why polyglot data science is beneficial and address common concerns. Then, I’ll show you how to make it happen using tools like Quarto, Positron, Reticulate, and the Unix command line. By the end, you’ll gain a fresh perspective and practical ideas to start. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone. Welcome. My name is Jeroen Janssens and today I'm going to talk about how you can combine R and Python.

In order to get the job done, you need to, you know, resort to other languages, however strange they may be. And the same is for data science, right? There might be situations where you need to make combinations of tools or languages out there. So, for example, you may have written a Python application that needs to use an algorithm that's only available inside an R package. Or maybe you've done your analysis in R which needs to be part of a larger ETL pipeline written in Python. So, what do you do in those situations? Well, you take a polyglot approach.

So, I've had a few of those situations over the past few years and I want to share some stories with you today. Now, this won't be a very technical talk, because I think the main challenge is not going to be the code that you need to write, but your mindset. The way you're going to handle this. Your way of thinking. I really believe that if you want to be effective as a data scientist, you need to be able to recognize when it's okay to use the same language for the entire project or when it's better to get in the help of some other tool or language.

So, who here uses R or Python? Show of hands. R or Python? Not that many. R or Python? All right. That's okay. If you're watching this on YouTube, almost everybody raised their hands. Keep your hands raised. Who here is using SQL? Congratulations. You are already a polyglot data scientist. Perhaps without even being aware of this.

Congratulations. You are already a polyglot data scientist. Perhaps without even being aware of this.

So, there are, of course, many other languages and tools available, but in this talk, I'm going to focus on just R and Python. And rest assured, I'm not suggesting that you should transition from R to Python. Although I realize, I've heard that, you know, there are plenty of people in a situation at the place that they work that they do need to transition. But I really see this as a spectrum. Right? It's not R or Python. You can combine these two beautiful languages in multiple ways. And which one is best depends entirely on your situation.

Story one: using rpy2 for ggplot2 in a Python team

Okay. Story time. About 13 years ago, I started working as a data scientist at a startup in New York City. I was the first data scientist, and all the engineers there were using Python. So, I'm like, all right. I got this. We've got pandas. We've got notebooks. Oh, my. I don't want to visualize my data using Matplotlib. I had seen what ggplot2 could do. So, what did I do? I took a polyglot approach.

So, the solution conceptually looks like this. Right? And try to think if you have had a similar situation in the past that you where this might have been appropriate or maybe there is something currently going on. But in my specific situation, I there was some IPython magic that I could use that uses the package rpy2 under the hood. So, if you're in a similar situation, then I could recommend rpy2. Now, in this particular case, it was fine because I was the only data scientist doing this.

And for data visualization specifically, as we have seen in the previous presentation as well, there is now the wonderful plot9 package, a Python port of ggplot2. And speaking of which, Posit is currently organizing a contest for plot9 as well as a table contest. So, if you're into, you know, visualizing data or creating tables, I recommend that you participate. So, it could be a nice excuse to try out that other language. So, the deadline is October 3rd for these two contests. And if you search for 2025 plot9 contest, then you'll get to see the announcement and you can see more details.

Story two: the save and load approach

All right. Story time. I was working at a Dutch telecommunications company in the Netherlands at a team that only used R. All right. Fantastic. I used the tidyverse and ggplot2 still and I did some fancy dimensionality reduction and I created, you know, a machine learning model. I was asked to create a machine learning model that would then be used on their website. So, I had this nicely wrapped up and it was all good to go. So, when I then spoke to the team responsible for serving these models, right, would be an API kind of situation, they were like, yeah, we don't do that here. We only use Python.

So, luckily, I used annoy, which is a C++ library for doing approximate nearest neighbor search. So, I realize that's really specific. But this particular package had both language bindings in R and Python available. So, what I could do is what I like to call the save and load approach. It doesn't have to be that complicated. You can save intermediate results, right, say to disk or to a database and let some other process read this. Right? It doesn't have to be fancy.

This might work. And it also doesn't have to be, you know, in our situation we had multiple processes reading from this database. But, you know, it could be a single one and those languages could be flipped. So, this approach is very flexible. But keep in mind that you may need additional coordination, right, so that you ensure that both sides of the database or disk are being run at the correct times.

The Unix pipe approach

So, the polyglot approach is nothing new. Yeah? So, the ability to easily combine multiple tools and languages has been around for over 50 years. 50 years. When the Unix command line got its pipe operator. So, this is not so much a single story. But over the past 15 years, this has become more a way of working for me. And the idea here is that the pipe operator connects the standard output stream of the first tool to the standard input stream of the second tool. Right? There's no saving to disk. Now, keep in mind that according to the Unix philosophy, text is the universal interface. So, if you have to write and read, you know, a data frame, that will most likely be CSV. And so that's something to keep in mind. Now, you can send binary data over the pipe, but then you wouldn't be playing nice with the gazillion other tools out there. And again, this approach works for any language. Yeah? And you can make that chain as long as you want. But hey, it's not for everybody. I realize that. But if you do want to give this a go, I can you might enjoy this book I wrote a couple of years ago. It's available for free online.

Story three: a cautionary tale about going too polyglot

Now, the last story that I want to share with you is more a cautionary tale. So, I was once again working at a Dutch company. This time a utilities company. And they had a huge ETL pipeline, so to speak, set up. It started off as some R code, and then later on they added some Python and some copying and RPy2. But eventually it became a process that took over 700 gigabytes of RAM on a single machine to run. I believe it took over two days to finish. And that was okay, because these calculations only needed to be done twice a year. They were needed to make predictions about the utility grid and how much power was going over them so that they could make choices on where to reinforce the network, add additional transformers. So, pretty important stuff.

But then management came down to us and they said, all right, so instead of doing this for 10% of our network, why don't you do this for everything that we have? And instead of doing this twice a year, why don't you do this every week? Oh, yeah, your budget is 5,000 euros a month. So, we had a real problem here. We had the reverse problem, because using both Python and R was really hurting our performance. It also made our virtual machines, our Docker images way too large and it was very difficult to maintain. So, we needed to convince the rest of the team that we actually needed a monoglot approach.

So, we were successful in the end. We suggested, okay, let's use Polars. Mind you, this is almost three years ago, before Polars was cool. But we were really excited. I mean, in the end, it was a big success. Everything was rewritten to Python and Polars and it all worked out. But it was a big investment. So, my colleague and I, we learned so much about Polars. We were so excited about it that we decided to write a book about this at the same time. Now, on this website, you can find more information and get the first chapter for free. Right? And for five bucks, I can send you the PDF.

Weighing the costs of polyglot complexity

All right. So, now, so just to keep in mind that everything that you add, every dependency, and, yeah, also for packages and tools for entire languages, everything that you add, of course, adds complexity to this. Yeah? And space. Not only in memory, but also on disk. Right? Adds time. It's usually not, depending on what you want to do, it usually takes longer if you want to combine languages. Unless you decide, all right, we've got to rewrite this part in Fortran, but that's another talk. Yeah? And, of course, risk.

So, I hear some of you thinking, what about Python within R? Yeah? And I have to be honest here. I haven't had a situation yet where I needed to do this. But if you do, then I can definitely recommend that you check out reticulate and Arrow. Right? reticulate for running Python code in R and Arrow for zero copy data sharing. Yeah? And I also want to give a shout out to Positron and Quarto. Two tools that make it particularly convenient and pleasant to work with these two languages. Positron as an IDE and Quarto as a format.

So, I hope that through these stories I've been able to convince you that there are definitely situations where it's worthwhile to embrace a polyglot approach. Yeah? You know, it takes time and skill to sort of recognize when this is appropriate and which approach is the right one, which tools should I use. But I shouldn't stop you from, you know, clinging on to doing everything in the same language. How beautiful that language may be. Like Dutch.

So, all right. So, yeah. I've shared with you through my stories a couple of these approaches and tools, Rpy2, reticulate, Arrow, saving and loading. Right? So, whatever you do, be mindful of the complexity that you're adding. Space, time, risk. Right? It may be more pragmatic. It may get the job done. But, yeah, I think if you take all that into account, then you'll be good. Thank you very much.

Q&A

Okay. We have a few minutes. So, send your questions in. Let's see. One, managing versions in each language takes work. Is it twice as bad or more when using both?

It's more than twice as bad. Yeah. Well, I mean, it depends on how coupled the two processes will be. Right? If you're using reticulate and Rpy2, there's a really tight connection between the two. Again, if you're using the other two approaches, like using piping or the save and load approach, right, where you have, say, a hard disk or a database in between, you decouple that and it's exactly twice as bad.

Okay. Another question. Could you explain a little bit more about how rewriting your R plus Python code with Polars helps speed things up? Could you have used Arrow?

Yeah. So, Polars uses Arrow and then some changes. But, yeah, it uses an in-columnar or an in-memory columnar data frame format. So, yeah. Yeah. Arrow. Good stuff. Parquet. Right? These kind of things.

Okay. How do you determine when it's worth exploring doing something in another language if you can already do it in your primary language?

Yeah. So, I think you'll recognize when there's a problem. Right? You shouldn't just add another language just for the sake of it. But if you're like, maybe you work at a consultancy firm and you're asked to do something now in a different language and you're like, oh, man, I wish I just had that algorithm that I use in that other language. Then, of course, then you're feeling it. Then you're feeling, okay, there is now a situation. Wait a minute. I remember what this Dutch guy was saying. I should use a polyglot approach. So, but if there is no problem, then I don't see any reason why you should not stick to the same language.

Okay. One more. What is something that R is missing for Python and something Python is missing from R? Well, as we've seen in the previous talk, I mean, there are, the two languages are, there are quite some fundamental differences between the two. Right? With R, you've all been spoiled with the tidyverse and non-standard evaluation and the pipe operator. Right? Those are things that aren't easily done in Python. So, yeah. I mean, I think that's about it. I was really hoping for an applause there. I don't know. But the tidyverse. All right. Thank you.