Resources

Combining R and Python for the Belgian Justice Department - posit::conf(2023)

Presented by Thomas Michem We build a great case on how to combine R and Python in a production environment. So the justice department's back office monitors the smooth processing of all traffic fines in Belgium. They gather that data from all police departments and check if any anomalies occur. The back-office monitors that using a shiny application where they can see traffic signs showing the status of the whole operation and the status is built using Python scripts that perform anomaly detection if the number of fines is in line with what they expect daily. And the results of those checks are delivered to a front-end shiny application with Python flask API. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: R or Python? Why not both!. Session Code: TALK-1122

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So I think it's time to settle the case of R versus Python. I won't settle it, it will be settled by the Belgian Justice Department. So let the battle begin. The Belgian Justice Department had an urgent need for a traffic fine monitoring system. And both R and Python were actually on the table. Who am I? I'm Thomas Mitchum, a data science lead at AXI. AXI is an IT software and full service posit partner in Belgium and in the Netherlands. Here we're also known for the donuts at our booth. We actually settled another battle there, the battle of Dunkin' Donuts versus Stan Donuts. I'm happy to announce that Stan Donuts won today.

My background is I started as an analyst, then turned architect, then turned data scientist. I learned Python from my fellow developers, and I learned R from business experts. So a little bit different. And last year, our company also gave a battle presentation. It was then Shiny versus Power BI. But so this year, R versus Python.

The traffic fine processing system

So AXI as an IT software company is actually also responsible of building the traffic fine processing system in Belgium. So the traffic fine processing system, it's a complex system with many actors involved. The easy flow is just the police departments send fines to the fine processing system, who then send the file to the postal service, who prints these fines, deliver them to civilians, who then make payments that are matched again by the fine processing system. Each time I get a traffic fine in the mail, I get to thank the company I work for.

But for the case, we'll have a look at the judge, so what the Belgian Justice Department does. We'll have a look at the suspect who has a say in R versus Python. There's two important investigations, who can deliver the quickest, and also who is the best fit for production. And like in any case, we'll have a see if we can make any kind of mediation, if a settlement is possible, and then we'll come to the verdict.

So the Belgian Justice Department, so they are among many things responsible of overseeing the traffic fines. It's actually quite a large scale operation, so the collection from traffic fines in Belgium alone exceeds 500 million dollars annually. So it's actually quite a large number, so making mistakes can cost them a lot of money. But it's also a very reputational thing, where civilians really don't like to get fines, and they even like it less if they get a fine twice, if they pay it twice. So it's a process that needs to be looked at. And there were some issues with the process, and the Justice Department needed a monitoring tool to actually help smoothen that process.

Who can deliver the quickest?

And then R and Python came to the table. So there's languages that can be used not only for machine learning, but also in a broader context. But of course, people are already a bit biased if they hear both languages, R and Python. Some people, if they think of R, they see it as a specifically statistical language. So it's sometimes seen as good for trivia questions like, how many people do you need in a room to get a 50-50 chance that two people have the same birthday? Does anybody in the audience know? 23, okay. 23 is the right answer, of course. Never let anybody say that it's something else.

Python is the developer language. It's for the white space enthusiasts that make sure that their indentation is always correct. It's for Zen lovers that when they type import this, they get a whole Zen poem. And it can be used for anything, not just data science. So already from the start, there's different opinions.

Who can deliver the quickest? Because there was quite an immediate concern, there had been some glitches in the system where the Justice Department needed a solution quickly. And overhauling the operational system would take a couple of months. So there was a lot of process around making changes to that system. So it was not something that they could quickly fix. And so that's where the requirement came for a flexible monitoring tool to have a look at anomaly detection. Also inspecting file contents and many more things. And that solution needed to be there fast.

And Python was actually the quickest out of the gate. So most of the stakeholders were already on board with Python. We had the IT department that already had some data processing solutions in Python. And also some of the checks, for example, for the financial files, the payment files, were already available as a Python package. So something that the open source community is really good at in building already packaged solutions. And so if Python would have been chosen, then at least these things would have been developed already quickly.

But we made a counter where we used Shiny to build a quick UI mockup, a quick UI demo actually for the actual users of the system. We started this project in 2020. So there was no Shiny for Python yet. So it had good old classic R Shiny. And there was only one requirement by the end users. They wanted to see traffic lights if they looked at the monitoring. So we gave them traffic lights, so a dashboard where they could see in the color if things are running smoothly, if there are any issues. And they were able to drill down into what is happening.

So the back office was already on board. And also for the case of R, there were some auditors, one of the big four companies actually, that already had some scripts in R that they used to do part of their checks and controls. So at this point, it was still balanced.

Who is the best fit for production?

But then the question was there, who will be a best fit for production? Because of course, the system had to run in production. And if people hear production, Python can have a slight edge. So it was already approved by the IT department as a valid use for production. And they also had already developers in-house at the Belgian Justice Department who then could manage that system. But if you looked at the existing systems, so that was among others a data processing system, they did not extend easily for this monitoring case. So the things that were needed for the monitoring could not be provided by the system as it was running in Python already. So that would have to change. Plus, also, if new use cases came up, there were doubts if that system could be extended to tackle all of those use cases.

But yeah, we still had Python as the only option for production. Then you had a good look at what they meant by production, because there the opinions can also differ. If you ask data scientists, and I think Zhou Cheng had a good article about this, they say that production is if you have real users using your app, where there are real consequences if things go wrong, it's using real data. That's production. IT has a different view. IT has a view of, well, if it's production, then there can be no users logging into the system. There can be no access to any terminal executing any commands. That's not allowed. And you shall not install things on the fly. You shall not pass. IT had this really strict requirement of keeping something in production where not a lot of things were allowed. I think that's very recognizable for a lot of data science cases that struggle.

That's production. IT has a different view. IT has a view of, well, if it's production, then there can be no users logging into the system. There can be no access to any terminal executing any commands. That's not allowed. And you shall not install things on the fly. You shall not pass.

So here we faced it as well. But then we had a good look at the side of production that we really needed, just real users with real consequences and real data. And for that, we didn't need to be in the production environment as IT saw it. And then we proposed a platform strategy for production where we looked at systems that provided enterprise-level support so that not all of the load was on the IT department of actually keeping everything running. There's a proven platform that is PositConnect. And by introducing that platform, we also could level the question again, R versus Python, as both could be run on this platform.

And specifically here, the platform would not have to be deployed on what IT considered the production environment. It could be in a test environment because from that test environment, there was access to the production data. And that was actually all that we needed. Users could access that test environment. So we had real users. It was the actual monitoring with actual consequences and using actual data. So we did have IT that was not bothered too much if it went down. But as this is not the operational system actually processing the fines, that was also not that big of a concern. And so we did have the most important thing, real users and real data. And the advantage of then choosing this platform strategy is that if we set it up once, all future use cases deploying new things to it could then happen often without each time having the discussions again with the IT department. And at this point, we are back at an even battlefield of R versus Python.

Mediation: combining R and Python

If it's even, then we can have a look at if any mediation is possible. Can't we all just get along? Combining R and Python in the same deliverable, that could be a strategy. So there are packages that follow this strategy. We have great packages like Reticulate, R2Py, where you can in the same thing that you're making, combine the languages. And in things like Quarto notebooks or in just your R scripts, this is actually a very valid strategy where you can use different cells that have different languages and then use the objects created in one and the other in a very flexible way. Quarto, if you're not using it, you should be. It's really a powerful framework to deliver all kinds of things. For example, presentations on PositConf as well. But it was more tricky to combine R and Python in the same code for building a Shiny API thing. And so then combining it in the same deliverable was not the best approach. And so it was also not seen as the solution in this case here.

And then we took a step back at how do other software systems solve this? And you actually have a lot of patterns, frameworks, approaches, methodologies that are general for software development. But for data science, sometimes it feels like new strategies. But just splitting your software into different components, modules, sometimes Microsoft services, if you're Netflix or somebody like that. But just splitting your architecture in multiple components, that's actually also a way that you combine multiple programming frameworks in a good way. And I think Posit also has very good documentation on this. There's on their website the bike prediction case, where you have a more elaborate architecture that has ETL steps, where there's Quarto documents or Markdown documents that connect to some source data, perform some transformations, and publish clean data sets with the Pins package, who then also run a Quarto document to train a machine learning model on that Pins data set, also publish that model in a Plumber API, and then using that API in a Shiny application.

And so with something like Connect, you can actually build these more elaborate architectures, but just on the same platform. And then dividing your use case into multiple little pieces that are easier to manage, easier to maintain, easier to make changes, easier also to evolve and get new people on board and contributing to your architecture. And I think it's a very well-documented solution, where you have the source codes, all the steps that they do. So definitely something to check out.

The verdict: R and Python together

And it's also the strategy that we ultimately chose for the cross-border monitoring architecture. So the monitoring of the traffic finds, where on PositConnect, we could use our Shiny as our monitoring dashboard, which then connects to a monitoring API that was built in Python, that uses the results of ETL that is also running in Python. And actually for the checks, it made sense to have some checks written in Python, some checks written in R, and then combine all of these things into a very efficient and performant monitoring architecture.

So this way we can reach a verdict. And I think as you all expected, probably, it wasn't the story about R or Python. It's actually the story of R and Python, where using them together, we could make real speed in making innovation and efficiency possible for the Belgian Justice Department. I think there's a lot of things that data science here can learn from the classical software architecture and how to approach these kinds of architectures. And also for me, the data science can have more broad applications than just the top high-end machine learning cases. I think in a business context, there's a lot of things that where data plays an important role, where the focus is more on the analysis rather than the process.

So I don't think we'll ever see a traffic fine processing system built in R or Python. But there's a lot of things around it that can be more data-driven. And I think this monitoring case is a very relevant one, where you have anomaly detection, where there are some algorithms that can help, but just the full end-to-end architecture where you have real users using an application that gets warnings and alerts, and they can have a very small workflow. It's a very good case for data science and R and Python.

And I think in this case, the verdict was reached by the Belgian justice system. So from now on, no more discussions of R or Python, but it's a question of R and Python. Thank you very much for your attention.

It's actually the story of R and Python, where using them together, we could make real speed in making innovation and efficiency possible for the Belgian Justice Department.

Q&A

Thank you so much. We have a few questions for you. The first one is, did Reticulate give you any headache when deploying apps that use both R and Python code?

Not in this case, but in a lot of cases, it has indeed. It can be a bit tricky to set it up correctly. And I think if you have cases where you can use some R code and Python code that don't have to exchange too many objects, then it can work perfectly. If it comes tightly interwoven, it can give headaches. Yes, that's true.

One more question. Do you have a view on Plumber for APIs in R versus the abundance of Python API libraries, like Flask or FastAPI?

In this case, we actually chose Flask as the API. And in a lot of cases, I think Python does have an edge. And I think the FastAPI, for example, is a very, very powerful API framework. I think Plumber has the advantage of it's easy for R users to also get started. And there is also, if there's a lot of R stuff happening in your API, then Plumber makes more sense. So, for example, if you would have R markdown documents that you need to generate, it makes more sense to keep it in the R framework, use Plumber to generate that document. And I think the cases where we used it, there is some consideration in terms of how to scale it. But then if you have something like the RStudio or deposit connect platform, there's ways to control how many users target the same Plumber API at the same time. And if you use it, for example, like in a cloud context on Azure, for example, you have Azure container apps where you can also scale your Plumber app. So, that downside doesn't have to be really a downside. So, I think for me, it's more about what does the API do and is there something specific with R than use Plumber? And otherwise, I think Python does have an edge with FastAPI. Yes.

Thank you so much. We can give one more round of applause.