Data Science in Production: The Way to a Centralized Infrastructure - posit::conf(2023)

Presented by Oliver Bracht In this talk, the success story of Covestro's posit infrastructure is presented. The problem of the leading German material manufacturer was that no common development environment existed. With the help of eoda and Posit, a replicable, centralized development environment for R and Python was created. Although R and Python represent the core of the infrastructure, multiple languages and tools are unified. In addition to the collaboration of Covestro's data science teams, compliance guidelines could also be better fulfilled. The staging architecture hereby provides developers with a concept for testing and going live with their products. This project presents a best practice approach to a data science infrastructure using Covestro as an example. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Data science infrastructure for your org. Session Code: TALK-1113

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thanks a lot for having me here, I'm very happy to talk to you about an infrastructure that we have just set up at Covestro and in my presentation I wanted to show you a little bit of what we did and what is a good practice from our perspective if you want to integrate professional posit products into your infrastructure.

Just before we start let me give you the opportunity to introduce me quickly. So I belong to a company called Eora, we were founded in 2010 and we do data science from the very beginning on. We are about 50 employees at the moment and we are located in Germany, pretty much in the center of Germany, not too far away from the geographical center of the EU and we are a posit partner since 2017 and our mission is empowering data-driven intelligence and that is pretty much in line, I think, what the mission of posit is.

So we have four pillars of what we do, so we have consulting services, use case workshops, proof of concept projects and implementation of algorithms into productive environments, we have trainings, we have infrastructure and we have data science software. Today I will focus on the pillar of data science infrastructure, this is what we call when we kind of help companies that have an internal large group of data scientists which want to use professional products.

And yeah, as the situation is a little bit complicated I also want to introduce our customer, it's a company called Covestro which is also located in Germany. It's quite big, 80,000 employees, 50 production sites all around the globe and the headquarter in Germany and what they do is they create polymers and high sophisticated and high performance plastic components that you will use in everyday life. So probably almost everybody of you have used some sort of product from Covestro.

Covestro's starting point

And the situation where we have started was that there was already an existing data science team, so Covestro does data science since, I don't know, 100 years or so because they do chemical trials and experiments and stuff like that. So they are kind of a data science centric company from the very beginning on. And they just wanted to bring their infrastructure and their way of how to do data science on the next level.

So this is how we came in and I think this is important because the situation is probably a bit different if data science is more or less newly introduced to a company or if there's already an existing team that does something and wants to mature to the next level.

Bridging IT and data science perspectives

So the general challenge when it comes to building up a centralized data science infrastructure, we heard this also in the keynote speeches, is that there are two perspectives on how it feels like to be a captain of your ship. So the IT operation perspective is pretty much like that. So you have a pretty big ship. You want to drive it over the Atlantic Ocean, for example. Once the course is set, you don't want to make any changes anymore. And if nothing happens, then you had a good day because you were good.

As a data scientist, it feels a little bit more like this, to be captain of your ship. You want to have a powerful motor. You want to be able to change your direction quickly. And in most cases, you are not so far away from the shoreline. You typically wouldn't cross the ocean with such a boat. But this is kind of the way it works.

And there are also other perspectives, business perspectives or data governance perspectives, which are pretty much somewhere in the middle of these. And so one of the key challenges, from my perspective, is to bring those two perspectives together. It's not either with this or that. Both perspectives are valid.

So those implementation projects are typically successful if both sides have a mutual understanding of the other side.

So those implementation projects are typically successful if both sides have a mutual understanding of the other side.

Three-step implementation approach

Well, at Covestro, we did it in a three-step way. This is actually our good practice of how to introduce centralized data science infrastructures. So the first step is an assessment step, where we try to find out, okay, where are you? What's your position at the moment? So what are the databases that you could never get rid of, for example? And what are the technologies where you say, well, they are not up-to-date anymore. And it might be possible to migrate.

And we also want to know, where are you now in terms of data science use cases? And where are you seeing yourself in the foreseeable future, in the next two, three years or so? And yeah, one of the central aspects of those assessments is to bring all the perspectives, the business perspective, the data governance perspective, the IT operation perspective, and the data science perspective all together. And this is really important to have from every stakeholder group, somebody on the table all at once, right? So that there's a common understanding where we want to go and how to do it.

The next step is the implementation itself. It means just setting up the infrastructure where we have just decided for. And it's, yeah, the implementation step, from our point of view, is really like you would do it in a data science project. You think of iterations. You start with a minimal viable product where somebody can use it and where you can gain the experiences from the users, from the data scientists, in order to improve the infrastructure.

And finally, there's the operation side. So once everything is in place and the business and data science people are kind of running crucial apps on this infrastructure, it needs to be operated. And then there's like a step forward, step back. It's like DevOps idea, implement and operate more or less at the same time.

Challenges and goals identified in the assessment

So after we finished the assessment with Covestro, we had like a list of challenges that they faced with their existing infrastructure or with the existing way of how they do data science. And we had a list of goals where they wanted to go. So the challenges were like there were 20 to 30 Python developers. Well, that's not the challenge in itself. But they were distributed across multiple locations and multiple business lines.

So there was actually something like an internal meetup of the data scientists who like exchange what they do. But it was not really the case that they feel like a common group. They were more or less splitted in their departments. The data science infrastructure itself was pretty much decentralized. So while they had some kind of advanced workflows, but basically everybody was working on his or her laptop. So this was kind of the original situation.

And they expected a substantial growth in data. And I guess this is what every organization can say, that there is a substantial growth in data. On the IT operation side, that was quite high administrative overhead to set up all those images of R and all the packages. And it was also not so easy to tackle the compliance issues that you have in a larger organization. They managed to do it, but it just took a bit too much time to do it.

And of course, like always, if you work in a decentralized infrastructure, they had issues with compatibility, version dependencies, and all those things which you might all know. And this is one aspect of a centralized infrastructure that you can really get rid of this stuff.

On the other hand, we have goals. So what do they want? They want to be able to collaborate. They want to be able to share projects and results. They want to have a central package management so that it's easy to exchange the packages and to get always the same version that all the others have. They were looking for efficiency, so they want to reduce the efforts of the admin people. And they want a consistent data governance.

Because of the growth they were expecting and all the new technology that's coming up all the time, they wanted to be able to scale out horizontally and vertically. But they still wanted to have everything secure, of course. And they wanted easy development pipelines, so it is easy to develop something and to bring it to the business user as easy as possible.

The posit-centric infrastructure

What we did then is we set up an infrastructure which is posit-centric infrastructure, if you will. And in the next couple of slides, well, it's actually just one slide, I will show you how this infrastructure looked like a little bit more into detail. Not all the details, but I hope you can get an idea of what it looked like and it might help you to do the same in your organization.

So first of all, we have set up a posit workbench on AWS. And this is what the developers were using from the beginning on. And the cool thing is if you switch from laptop to a centralized posit workbench on AWS, it's not much different from the user's experience. And that's really, really important for the acceptance. So there's especially one user who is kind of, I don't know, he works with R for 25 years or so. He's really good, really, really good. And he was like, OK, I have to have everything on my laptop. This is the only way it will work for me. So I cannot believe that it's possible if we have an infrastructure on the cloud. And finally, after we finished everything, he would not want to work every time in this life again on his laptop. So that was our greatest success indicator, actually.

Yeah, so then we kind of combined the workbench with an Azure Active Directory and we used SAML to do the communication between Puzzle and the Active Directory. And the interesting thing here is, and we see this in many organizations, actually, they mix the cloud providers. So the operation platform is AWS, but the platform that provided the Active Directory is Azure in this case.

And when we started the implementation, we actually had sessions that worked on the server, but it was clear from the beginning on that we're going to switch to a Kubernetes setup where the sessions are kind of spawned into a Kubernetes cluster. And the way we did it here is like an off-host execution. That means that the Puzzle Workbench server is not in Kubernetes. Only the sessions are running in Kubernetes. And that worked out pretty well. And we distinguished between kind of three load groups. We had a low, medium, and high load. That means of the computational power that the applications or the computation needs. And that was really useful in order to provide the respective resources to the users.

And additionally, we had an NFS share. So that's an external file share where the users had their home directories on. And that is also a good practice from our perspective to have this on a separated server. Whenever there's an issue on the Puzzle server, on the Workbench server, then you have the separated and you can just, you have no issues in updating, I think.

Next step was to set up Puzzle Connect in a dev environment, development environment. This is how we started. And the developer were able to push what they have done as APIs or Shiny apps or reports to this Connect server. And they were also consumers of this Connect server. So it's not only that they kind of provided stuff to the business people, but it's also the case that they used it themselves and it made their life much easier to have a central, reliable base where the data and the analytics are kind of to be found. And it's easy to be sure that everything is in one place. So the developers actually are heavy users of the Connect server themselves.

Same idea is we kind of externalized the app directories on an NFS share. And there's probably a detail, but very helpful, there's a Postgres DB within Connect and we kind of externalized this too. This has two advantages. The first is you can do updates on the Connect server while it's running. And whenever you have to set up your Connect server again, everything is kind of on a separate server. That worked out pretty well.

And then the same thing with the production environment. So it's basically two types of server. So it's a two-tier environment, if you will. So development plus test. And the production environment, and now for other customers, we have a three-tier where we kind of distinguish between development and test. So in this case, it was OK to just go with one development server. And the production environment, and where the business users, you know, on the other end of the diagram, in the infrastructure perspective at least, they are heavily using the APIs, mainly the shiny apps that were provided from the developers.

And finally, we added the package manager on top of that as an external thing, and also externalized the packages themselves and the metadata of those packages. So this is more or less the infrastructure, how it looks like. There are some components missing. There was also Git. There was also continuous integration. It's not everything on the table.

And there are also other perspectives which are quite important. When you introduce something like this, and this is, for example, like how the data flows, the data governance perspective, I was not talking about that, and also the use case perspective. So how do you manage use cases? So if a business user comes up and say, OK, I have this and that use case, and that would be pretty cool if I have this, somebody has to evaluate what is important and prioritize what is important, what's not important. Because it will not take long until you have so many demand from the business users that you are not able to serve everything.

Outcomes and lessons learned

So it is easy to collaborate for the data scientists across departments, also to communicate between data scientists and business users. So the agility of data science was also introduced into other parts of the organization, just what we heard in the keynote speech before. That's an interesting aspect. If data science agility mindset goes into the organization, that is good for the organization as well.

The maintainability was improved, and that was good from the perspective of the IT operators. So they were really happy, so that it is much easier for them to manage everything. The performance was increased, and it was much more flexible. And it was, another advantage is that it is in line with compliance guidelines. I mean, they were before, of course, in line with compliance guidelines. But this infrastructure is also compliant, and that is, of course, a very important aspect.

So what I wanted to say a minute ago was avoid manual steps from the beginning on. Something that I haven't mentioned yet, we did this from the very beginning on with infrastructure as code and Terraform in this example. And that's pretty cool, because if you want to roll out the same environment in the Asian Pacific region, for example, it's just one click. Very easy. And we have other use cases, for example, where we kind of set up a training environment just for a couple of days. And if you do it in Terraform, this is really easy to do. So this is really our kind of learning there, that we always build infrastructure with Terraform or similar tools.

So the question of cloud versus on-prem versus hybrid infrastructure, that is at least in Germany very virulent. So many companies still not kind of completely going into the cloud. Everything I've just showed, although it was created in the cloud, it's possible to do this on-prem as well, and also with the Kubernetes cluster. You can run the Kubernetes cluster on-prem. I don't know if I would want to do it, but it's possible.

So it's always good to develop the infrastructure in close coordination with data science, also with the business users. That is our kind of lessons learned. And the implementation in iterations really makes sense. So don't do everything at once. What's your lookout? What is your minimal viable product? Implement this and have it used, and then you get feedback by the users.

All right, yeah, and that's everything I'd have to say. Thank you for your attention. You find my contact details here, so if you are interested to chat about some more details, I'm happy to talk with you.