Resources

Giving your scientific computing environment (SCE) a voice - posit conf 2024

Full Title: Giving your scientific computing environment (SCE) a voice: experiences and learnings leveraging operational data from our SCE and Posit products to help us serve our users better Platform owners often ask questions like ‘how quickly do users migrate to new versions of R’, ‘what programming languages are used’, and ‘how are internal packages, dashboards and outputs consumed’? The answer to these questions and many more lives within the operational logs collected by systems like Posit Connect, Github, and AWS. I’ll share examples of how we use this data at Roche to shape our product roadmap. I’ll also share some ideas we are exploring to use this data to help empower our data scientists to understand the hidden consequence of how they work through feeding back personal cost and environmental impact - enabling informed decisions, e.g., what it means to schedule a pin to update daily, or request 2 vs 8 cores. Talk by James Black

Oct 31, 2024
19 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is James Black, I'm the insights engineering lead in late stage at Roche, and today I'm wearing my hat as the scientific computing environment product owner. I'm talking about a project with Vegeta, who's one of our API engineers who helped us navigate all of the data that's within our logs, and then also Christian who took a very badly written POC by me and turned it into a data product.

So in this talk I'm going to cover two problems that we ran into. So one of them was we were building our scientific computing environment and we spent a lot of time kind of understanding what the users needed. So we built this platform which hopefully was able to meet all the use cases which statisticians and our statistical programmers and our real data scientists had. But now we got to 1.0 and suddenly we were generating lots and lots of data on how they actually use the platform. So we started to ask, well can we start now incorporating what the users actually do into our decision making?

And the second problem was around optimizing use. So with our users, kind of historically, compute and particularly CPU compute is seen as a free resource. So we have over 1,200 users on this platform and when you're talking about that kind of scale, one thing we know is that it is definitely not free. So what could we do to kind of feed back to users to try and improve their use or at least allow them to make educated decisions rather than having to limit stuff?

Four key questions from operational data

So there's so much you can do with the operational data that a compute platform generates, but I'm going to focus on four questions. So one of them was like what versions of R are people actually using at any one point in time? And we had this question around what's the optimal way to handle interactive containers because they spend a lot of time just sitting there. So what should we do in terms of provisioning those kind of interactive environments for people? So this is one that's slightly hidden in the Connect documentation, but how can we make sure that the disk space on Connect is full of things people are actually using and not a lot of failed apps or apps which are kind of five years old and people have forgotten about or there's a cron job running every day for a project which hasn't been working for like for many years.

And the last one was just even understanding who's using the platform. So once we have this data in place, it's very easy for us to get really deep into what departments are using it, what are the patterns of use, what components are used by different people and things like that.

Overview of the scientific computing environment

So to kind of give some background on the compute environment I'm talking about, we have one scientific computing environment and here I'm including kind of the tools that sit around it. So we have data scientists, so from the real-world evidence team and right through to imaging scientists. They're using the same kind of backend as our clinical trial, the people doing the clinical trial analysis. It's just the UIs are different. One of them is much more efficiency focused and for the data scientists the focus is agility.

But the pieces they share are things like the studies being in Git repos on GitLab. We have Posit Package Manager which is surfacing, currently it's surfacing all the R packages. Python is a little bit different, we use a mixture of Lmod and a couple of things in Posit Package Manager. A lot of our apps and for some departments the Connect server is actually their results portal. So for instance in real-world evidence they treat it as a results portal for static content as well. We have a Kubernetes server with all those containers running those interactive jobs. We have a HomePod, so this is like their home drive where people are storing things. We deal with patient data, so sometimes there's a risk they're storing things in their home drive they shouldn't be storing there. And then we use Batch. So we have a proper HPC but we kind of treat Batch as a simplified HPC for those kind of middle-of-the-road jobs. Like if it takes between an hour and six hours, AWS Batch is probably going to be a much easier interface to work with.

So under the hood these are all generating data. So for instance in the containers we have a Dockerfile, so we know exactly what version of everything people are actively using at any one time. We know things like CPU use, I've kept the HPC out of scope here so we don't have GPUs, but we also know things like when are they creating containers, when are they deleting containers or are we the ones deleting it with our rules. In the HomePod we know how much space they're using up and we also are looking into things like looking at data transfer within AWS. We obviously have all of the kind of HR information on people.

Package Manager is a slightly weird one. So CRAN downloads like statistics are available, but just like the CRAN downloads they're not that useful. But we also have the Git repos. So we can actually go into things like the RNF lock file for a study and properly understand which version of every package is being used. And something which GSK told me about just a couple days ago was also the API for flagging what language is being used and things like that. And then finally for Connect we have information for every single object on stuff like when it was last accessed, how many times it was accessed, did it actually successfully deploy and stuff like that.

R version usage across containers

So we do have a dashboard, but because I don't have that much time I'm just going to focus on those four topics. But we have this kind of living live dashboard. So if people are asking questions or we want to know something we can kind of dive into it very easily.

So going into the split of active containers over our versions. So when I was making these slides I looked in the dashboard to see what the current status was and you'd kind of think well there would be use cases where people are using an old container. Let's say they're still working on a study and a study might last a year. So there could be quite a split there. Looking just at the containers with R, we actually have a lot of diversity in what people are using. And some of this is actually quite impressive. Like this person's still using R 4.00, they've been going in at least once every now and again to stop it auto deleting for many years. That's like it's maybe four years they've been managing to do that. So they obviously haven't had a sabbatical or anything like that.

But it really helps us understand like what versions of R are being used and we could do the same for Python and we just have some less users there. But we can also start going deeper. So we know things like the life of the container. We know when it was created. We know who deleted it. Like was it us or was it the user? And we can start to look at different patterns. And I guess if you look at this plot, you can at least see the patterns for when things are also deleted. If they create a container, maybe visit it once and never come back because you can see those lines appearing in the data when it comes to the life cycle of these containers.

Optimizing idle interactive containers

So one of the other things that we can look into is this question around how often do the interactive containers sit idle. So maybe to give a bit of background, we have our containers here in the Kubernetes server. But really behind the scenes, we have a whole bunch of worker nodes. In our current configuration, there's 64 CPUs per worker nodes. So it's basically this hugely horizontal scaling thing where every time it's adding a machine of 64 CPUs as new containers are created.

So if we said, okay, we want this to kind of at least mirror a laptop. So we give each user 16 reserved CPUs. That means we're only putting a few containers onto every server and our costs are relatively high. So you can kind of suspect that most people, you know, they're not working 24 hours a day running things across 16 cores. So how can we optimize this to understand what's the best way to run this Kubernetes server?

And so because we have the data, we actually have like a living track record of what people are actually doing. And so looking at our statisticians and our statistical programmers, we can see in this plot that pretty much everyone has a container doing nothing most of the time. So zero is the one where almost everyone is. There's a couple of people who managed to request all 16 containers. I think it's only a couple, it's literally a couple of people in our department, but actually they're not using the full 16. So 16 probably wasn't a good default for us to have in the first place.

And we can definitely conclude that reserving 16 cores per container is probably a bad idea and it's definitely a waste of money. So we can try and figure out like what actually is the best balance to optimize the number of containers we can throw onto each worker node. And we can really drive down the price then. And I think one of the cool things we can do is even dig a bit deeper. So for instance, our real world data scientists and statistical programmers, we have different levels of over provisioning. So we have different kind of numbers of containers per worker node, depending on what role the person does within the company.

And we can definitely conclude that reserving 16 cores per container is probably a bad idea and it's definitely a waste of money.

Cleaning up stale content on Posit Connect

So this is one, as I mentioned, it's actually in the docs for Connect, but like really hidden in there. There's actually R code as well to do this. But one of the things we can look into is making sure the disk space on our Connect server is, I mean, not full of stuff that doesn't work or is very old. So we decided to kind of look into this. And when we ran the scripts that Posit has in their documentation, we saw that there actually were 707 items where they click the publish button, it's stale, like it was never successfully deployed and the user didn't come back and get it working. So this was 707 items that were published. The person may have tried to publish them again, maybe they didn't, but they basically never successfully deployed. So it's essentially, you know, if it's a 50 megabyte app, it's just 50 megabytes of data sitting there that does absolutely nothing and never has.

And I think there's a little quirk in there, because if you're just pressing the button in RStudio and it fails the first time, you keep making new ones. So sometimes this was like they made actually 12, it took them 12 attempts to publish, which then means we have 11 dead ones that are just sitting on the server as duplicated objects. And then we also looked at like how long it had been since people had touched an object. So this was across everything, like across APIs, static, shiny apps, like when was the last time it was touched? And so we had 502 items that hadn't been viewed in the last year. We also have retention policies within the company. So this is something where, you know, there could be some hard deadlines where if it hasn't been modified in three years, we may want to automatically delete it or at least tell the user this is about to get deleted because of our retention policies.

So one thing we could do, and actually, again, the code for this is already in the Connect documentation, but we can go in and automatically start deleting using a cron job those items which actually were never successfully deployed. So you can imagine if you try to deploy it, you fail, you don't come back within 30 days. We can assume we can safely delete that. And then I think for the 502 is something we're still trying to figure out the best strategy. So retention policies, at least there's a floor where it will get deleted, but then we can at least start to inform users that, hey, you've got this content that's running and also reminding people about their cron jobs because I think there's a lot of jobs where people have set something on a schedule within Connect and then forget about it.

Understanding who uses the platform

So the last example I had was kind of understanding what teams are actually using our platform. So our platform was built for specific departments, but we have a whole bunch of people that have come in from other departments to use it as well. And we could really get into the details on how people are using the platform and which parts of the platform as well.

So I mentioned there's a lot you could do with that information, but this is just looking at one component. So within the frequency of interactive containers, we can really dig into how often are people actually going into an interactive container? And this is where they actually are doing something. So we're taking into account CPU use and if the container was idle or they were actively in it on that particular day, and we kind of get this picture around like what is the use daily, weekly, monthly, or even less than that across, I handpicked these departments because they were interesting numbers, but they're across different departments. And there's a lot of different ways you could tackle this. And because we have our HR information, we could also split this like are they an individual contributor or are they a manager and things like that. I mean, it's also really useful when you have teams who are coming in and using your platform who aren't the original stakeholders, it can be quite useful to understand where else it's being used.

Feeding usage data back to users

So one thing we started relatively recently is also seeing if we could try to kind of positively influence use in our users. So one thing we want to kind of avoid is like forcing them to make a ticket if they want like more than one virtual CPU, but so how could we do things to try and encourage kind of respectful use of the resources without having to put in limits or making them fill in the ticket or something like that?

So I looked in the data when I made these slides and took one statistician who I know does simulations. So that was his real number. He had 2,971 CPU hours in July. So for this particular user, I'm not using the real math is a lot more complicated because of firstly the corporate rates and then also like how the system is actually set up. But using kind of a simplistic formula, that's about 62 US dollars. So we can actually feed back to users like, hey, that job you ran, each time you're running it, that's $6 or something like that. If you ran it 10 times.

When I submitted the abstract, we had just started looking at the carbon footprint. I think in terms of like this being something we want to positively influence the users, the carbon footprint is interesting if we look at it at the platform level. And it's definitely a useful thing to communicate back. If we're talking about the HPCs where you've got like much more CPU use and GPU use and stuff like that. One of the things we found though is that using the kind of available code to calculate that, it's actually relatively low if we're talking about a statistical programmer. So this person who had the almost 3,000 hours in a month, it was actually only about 0.13 kilograms of CO2.

And we kind of need to make sure when we're communicating stuff back to users, it's steering people in the direction we want them to go. Because Roche also does stuff like when you book a flight through the company, they tell you your carbon footprint. And when I was making the slides, I got back the flights. So the flight here was 9,000 kilograms. This person who did a crazy amount of simulations was 0.13 kilograms. And is that actually going to steer the person in the right direction if we start sharing this information back to them? Maybe they'll say, actually, that's hardly anything. I can now run like, you know, I can increase my bootstrap to 100,000 or something like that.

And we kind of need to make sure when we're communicating stuff back to users, it's steering people in the direction we want them to go.

And so this is just a mockup because we haven't started doing it yet. But one thing we've talked about is maybe we could put some rules in where we think there's something happening, which is a behavior which we want to modify. Could we actually start feeding them back automatically by making some rules? So this is an example. We have a lot of people who make a lot of containers and they just don't log into them again. So they might make like 11 containers over the span of a month. And each one they kind of work in for a day and just don't come back. And sometimes, you know, it's the exact same image. So they're just making a new container each time they start working. So we can start doing stuff like identifying those users and maybe feeding back to them like, hey, you know, because we wait a little while to close these, that's actually quite expensive what you're doing. Can we please encourage you to kind of stick to one image and one container? And we can, that example from Connect, I think is almost like a no brainer. We can start feeding back to people when there's content that's maybe forgotten. And definitely when there's content that's not actually working.

Where we are now

Yeah, so where are we now? We have this dashboard which provides transparent metrics on data that was already there. Like we didn't make this data. All of our stuff from the Kubernetes cluster that was just sitting in Prometheus, and it was used by our informatics team to kind of monitor the platform. But it was also available to us on the business side to kind of understand the users of the platform.

And so now when we kind of get into these questions, like I can see Doug, who leads our validation strategy here, when he starts asking me like how long do people use particular versions of R and things like that, we actually know that. Like we know the splits across different versions of R for our users by department. Like we can really drill into the stuff like that.

Yeah, and soon we'll figure out exactly what we want to communicate back to users and how we want to influence them. But we could start to try and kind of lower costs or at least allow them to make more informed decisions based on their usage of the platform. And I think one thing which I've added just before I gave this talk was from that conversation with GSK just a few days ago where a lot of people are asking similar questions. So maybe there's an opportunity to kind of share more of this code so that people could just pick it up and, you know, automatically be able to build these kind of dashboards understanding their users.

So yeah, thank you.

Q&A

So have your efforts so far changed user perceptions of SCE as free? Yes, I think as I mentioned before, we want to kind of move away from relying on forcing people to fill in a ticket. That's what we currently do for GPUs. So if you want to use a GPU and you don't sit in the imaging team, you basically have to try and convince us that you should be allowed to use one. So I think we definitely are getting across the message, but it's not the way we want to get the message across. Like, I don't want to have tickets for that. I want them to make a decision and make a good decision.

Was there any data that was harder to obtain than you would have liked? And what other tools did you need to use in order to get that data? I think the Prometheus logs themselves are not that easy to access. So there's some stuff where API engineers pushing it into an SQL database for us. So yeah, there's definitely some things where maybe they're not collecting it yet. So like AWS Batch, that wasn't actually available when we were first talking about it. And so they had to actually start keeping that information. But it's all, I think it's relatively easy for them to keep.