Combining R and Python for the Belgian Justice Department - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

So I think it's time to settle the case of R versus Python. I won't settle it, it will be settled by the Belgian Justice Department. So let the battle begin. The Belgian Justice Department had an urgent need for a traffic fine monitoring system. And both R and Python were actually on the table. Who am I? I'm Thomas Mitchum, a data science lead at AXI. AXI is an IT software and full service posit partner in Belgium and in the Netherlands. Here we're also known for the donuts at our booth. We actually settled another battle there, the battle of Dunkin' Donuts versus Stan Donuts. I'm happy to announce that Stan Donuts won today.

My background is I started as an analyst, then turned architect, then turned data scientist. I learned Python from my fellow developers, and I learned R from business experts. So a little bit different. And last year, our company also gave a battle presentation. It was then Shiny versus Power BI. But so this year, R versus Python.

That's production. IT has a different view. IT has a view of, well, if it's production, then there can be no users logging into the system. There can be no access to any terminal executing any commands. That's not allowed. And you shall not install things on the fly. You shall not pass.

So here we faced it as well. But then we had a good look at the side of production that we really needed, just real users with real consequences and real data. And for that, we didn't need to be in the production environment as IT saw it. And then we proposed a platform strategy for production where we looked at systems that provided enterprise-level support so that not all of the load was on the IT department of actually keeping everything running. There's a proven platform that is PositConnect. And by introducing that platform, we also could level the question again, R versus Python, as both could be run on this platform.

And specifically here, the platform would not have to be deployed on what IT considered the production environment. It could be in a test environment because from that test environment, there was access to the production data. And that was actually all that we needed. Users could access that test environment. So we had real users. It was the actual monitoring with actual consequences and using actual data. So we did have IT that was not bothered too much if it went down. But as this is not the operational system actually processing the fines, that was also not that big of a concern. And so we did have the most important thing, real users and real data. And the advantage of then choosing this platform strategy is that if we set it up once, all future use cases deploying new things to it could then happen often without each time having the discussions again with the IT department. And at this point, we are back at an even battlefield of R versus Python.

Mediation: combining R and Python

If it's even, then we can have a look at if any mediation is possible. Can't we all just get along? Combining R and Python in the same deliverable, that could be a strategy. So there are packages that follow this strategy. We have great packages like Reticulate , R2Py, where you can in the same thing that you're making, combine the languages. And in things like Quarto notebooks or in just your R scripts, this is actually a very valid strategy where you can use different cells that have different languages and then use the objects created in one and the other in a very flexible way. Quarto, if you're not using it, you should be. It's really a powerful framework to deliver all kinds of things. For example, presentations on PositConf as well. But it was more tricky to combine R and Python in the same code for building a Shiny API thing. And so then combining it in the same deliverable was not the best approach. And so it was also not seen as the solution in this case here.

And then we took a step back at how do other software systems solve this? And you actually have a lot of patterns, frameworks, approaches, methodologies that are general for software development. But for data science, sometimes it feels like new strategies. But just splitting your software into different components, modules, sometimes Microsoft services, if you're Netflix or somebody like that. But just splitting your architecture in multiple components, that's actually also a way that you combine multiple programming frameworks in a good way. And I think Posit also has very good documentation on this. There's on their website the bike prediction case, where you have a more elaborate architecture that has ETL steps, where there's Quarto documents or Markdown documents that connect to some source data, perform some transformations, and publish clean data sets with the Pins package, who then also run a Quarto document to train a machine learning model on that Pins data set, also publish that model in a Plumber API, and then using that API in a Shiny application.

And so with something like Connect, you can actually build these more elaborate architectures, but just on the same platform. And then dividing your use case into multiple little pieces that are easier to manage, easier to maintain, easier to make changes, easier also to evolve and get new people on board and contributing to your architecture. And I think it's a very well-documented solution, where you have the source codes, all the steps that they do. So definitely something to check out.

The verdict: R and Python together

And it's also the strategy that we ultimately chose for the cross-border monitoring architecture. So the monitoring of the traffic finds, where on PositConnect, we could use our Shiny as our monitoring dashboard, which then connects to a monitoring API that was built in Python, that uses the results of ETL that is also running in Python. And actually for the checks, it made sense to have some checks written in Python, some checks written in R, and then combine all of these things into a very efficient and performant monitoring architecture.

So this way we can reach a verdict. And I think as you all expected, probably, it wasn't the story about R or Python. It's actually the story of R and Python, where using them together, we could make real speed in making innovation and efficiency possible for the Belgian Justice Department. I think there's a lot of things that data science here can learn from the classical software architecture and how to approach these kinds of architectures. And also for me, the data science can have more broad applications than just the top high-end machine learning cases. I think in a business context, there's a lot of things that where data plays an important role, where the focus is more on the analysis rather than the process.

So I don't think we'll ever see a traffic fine processing system built in R or Python. But there's a lot of things around it that can be more data-driven. And I think this monitoring case is a very relevant one, where you have anomaly detection, where there are some algorithms that can help, but just the full end-to-end architecture where you have real users using an application that gets warnings and alerts, and they can have a very small workflow. It's a very good case for data science and R and Python.

And I think in this case, the verdict was reached by the Belgian justice system. So from now on, no more discussions of R or Python, but it's a question of R and Python. Thank you very much for your attention.

It's actually the story of R and Python, where using them together, we could make real speed in making innovation and efficiency possible for the Belgian Justice Department.

Q&A

Thank you so much. We have a few questions for you. The first one is, did Reticulate give you any headache when deploying apps that use both R and Python code?

Not in this case, but in a lot of cases, it has indeed. It can be a bit tricky to set it up correctly. And I think if you have cases where you can use some R code and Python code that don't have to exchange too many objects, then it can work perfectly. If it comes tightly interwoven, it can give headaches. Yes, that's true.

One more question. Do you have a view on Plumber for APIs in R versus the abundance of Python API libraries, like Flask or FastAPI?

In this case, we actually chose Flask as the API. And in a lot of cases, I think Python does have an edge. And I think the FastAPI, for example, is a very, very powerful API framework. I think Plumber has the advantage of it's easy for R users to also get started. And there is also, if there's a lot of R stuff happening in your API, then Plumber makes more sense. So, for example, if you would have R markdown documents that you need to generate, it makes more sense to keep it in the R framework, use Plumber to generate that document. And I think the cases where we used it, there is some consideration in terms of how to scale it. But then if you have something like the RStudio or deposit connect platform, there's ways to control how many users target the same Plumber API at the same time. And if you use it, for example, like in a cloud context on Azure, for example, you have Azure container apps where you can also scale your Plumber app. So, that downside doesn't have to be really a downside. So, I think for me, it's more about what does the API do and is there something specific with R than use Plumber? And otherwise, I think Python does have an edge with FastAPI. Yes.

Thank you so much. We can give one more round of applause.

Combining R and Python for the Belgian Justice Department - posit::conf(2023)

Transcript#

The traffic fine processing system

Who can deliver the quickest?

Who is the best fit for production?

Mediation: combining R and Python

The verdict: R and Python together

Q&A

Featured software#

Shiny