Giving your scientific computing environment (SCE) a voice - posit conf 2024

Transcript#

This transcript was generated automatically and may contain errors.

My name is James Black, I'm the insights engineering lead in late stage at Roche, and today I'm wearing my hat as the scientific computing environment product owner. I'm talking about a project with Vegeta, who's one of our API engineers who helped us navigate all of the data that's within our logs, and then also Christian who took a very badly written POC by me and turned it into a data product.

So in this talk I'm going to cover two problems that we ran into. So one of them was we were building our scientific computing environment and we spent a lot of time kind of understanding what the users needed. So we built this platform which hopefully was able to meet all the use cases which statisticians and our statistical programmers and our real data scientists had. But now we got to 1.0 and suddenly we were generating lots and lots of data on how they actually use the platform. So we started to ask, well can we start now incorporating what the users actually do into our decision making?

And the second problem was around optimizing use. So with our users, kind of historically, compute and particularly CPU compute is seen as a free resource. So we have over 1,200 users on this platform and when you're talking about that kind of scale, one thing we know is that it is definitely not free. So what could we do to kind of feed back to users to try and improve their use or at least allow them to make educated decisions rather than having to limit stuff?

And we can definitely conclude that reserving 16 cores per container is probably a bad idea and it's definitely a waste of money.

Cleaning up stale content on Posit Connect

So this is one, as I mentioned, it's actually in the docs for Connect, but like really hidden in there. There's actually R code as well to do this. But one of the things we can look into is making sure the disk space on our Connect server is, I mean, not full of stuff that doesn't work or is very old. So we decided to kind of look into this. And when we ran the scripts that Posit has in their documentation, we saw that there actually were 707 items where they click the publish button, it's stale, like it was never successfully deployed and the user didn't come back and get it working. So this was 707 items that were published. The person may have tried to publish them again, maybe they didn't, but they basically never successfully deployed. So it's essentially, you know, if it's a 50 megabyte app, it's just 50 megabytes of data sitting there that does absolutely nothing and never has.

And I think there's a little quirk in there, because if you're just pressing the button in RStudio and it fails the first time, you keep making new ones. So sometimes this was like they made actually 12, it took them 12 attempts to publish, which then means we have 11 dead ones that are just sitting on the server as duplicated objects. And then we also looked at like how long it had been since people had touched an object. So this was across everything, like across APIs, static, shiny apps, like when was the last time it was touched? And so we had 502 items that hadn't been viewed in the last year. We also have retention policies within the company. So this is something where, you know, there could be some hard deadlines where if it hasn't been modified in three years, we may want to automatically delete it or at least tell the user this is about to get deleted because of our retention policies.

So one thing we could do, and actually, again, the code for this is already in the Connect documentation, but we can go in and automatically start deleting using a cron job those items which actually were never successfully deployed. So you can imagine if you try to deploy it, you fail, you don't come back within 30 days. We can assume we can safely delete that. And then I think for the 502 is something we're still trying to figure out the best strategy. So retention policies, at least there's a floor where it will get deleted, but then we can at least start to inform users that, hey, you've got this content that's running and also reminding people about their cron jobs because I think there's a lot of jobs where people have set something on a schedule within Connect and then forget about it.

Understanding who uses the platform

So the last example I had was kind of understanding what teams are actually using our platform. So our platform was built for specific departments, but we have a whole bunch of people that have come in from other departments to use it as well. And we could really get into the details on how people are using the platform and which parts of the platform as well.

So I mentioned there's a lot you could do with that information, but this is just looking at one component. So within the frequency of interactive containers, we can really dig into how often are people actually going into an interactive container? And this is where they actually are doing something. So we're taking into account CPU use and if the container was idle or they were actively in it on that particular day, and we kind of get this picture around like what is the use daily, weekly, monthly, or even less than that across, I handpicked these departments because they were interesting numbers, but they're across different departments. And there's a lot of different ways you could tackle this. And because we have our HR information, we could also split this like are they an individual contributor or are they a manager and things like that. I mean, it's also really useful when you have teams who are coming in and using your platform who aren't the original stakeholders, it can be quite useful to understand where else it's being used.

Feeding usage data back to users

So one thing we started relatively recently is also seeing if we could try to kind of positively influence use in our users. So one thing we want to kind of avoid is like forcing them to make a ticket if they want like more than one virtual CPU, but so how could we do things to try and encourage kind of respectful use of the resources without having to put in limits or making them fill in the ticket or something like that?

So I looked in the data when I made these slides and took one statistician who I know does simulations. So that was his real number. He had 2,971 CPU hours in July. So for this particular user, I'm not using the real math is a lot more complicated because of firstly the corporate rates and then also like how the system is actually set up. But using kind of a simplistic formula, that's about 62 US dollars. So we can actually feed back to users like, hey, that job you ran, each time you're running it, that's $6 or something like that. If you ran it 10 times.

When I submitted the abstract, we had just started looking at the carbon footprint. I think in terms of like this being something we want to positively influence the users, the carbon footprint is interesting if we look at it at the platform level. And it's definitely a useful thing to communicate back. If we're talking about the HPCs where you've got like much more CPU use and GPU use and stuff like that. One of the things we found though is that using the kind of available code to calculate that, it's actually relatively low if we're talking about a statistical programmer. So this person who had the almost 3,000 hours in a month, it was actually only about 0.13 kilograms of CO2.

And we kind of need to make sure when we're communicating stuff back to users, it's steering people in the direction we want them to go. Because Roche also does stuff like when you book a flight through the company, they tell you your carbon footprint. And when I was making the slides, I got back the flights. So the flight here was 9,000 kilograms. This person who did a crazy amount of simulations was 0.13 kilograms. And is that actually going to steer the person in the right direction if we start sharing this information back to them? Maybe they'll say, actually, that's hardly anything. I can now run like, you know, I can increase my bootstrap to 100,000 or something like that.

And we kind of need to make sure when we're communicating stuff back to users, it's steering people in the direction we want them to go.

And so this is just a mockup because we haven't started doing it yet. But one thing we've talked about is maybe we could put some rules in where we think there's something happening, which is a behavior which we want to modify. Could we actually start feeding them back automatically by making some rules? So this is an example. We have a lot of people who make a lot of containers and they just don't log into them again. So they might make like 11 containers over the span of a month. And each one they kind of work in for a day and just don't come back. And sometimes, you know, it's the exact same image. So they're just making a new container each time they start working. So we can start doing stuff like identifying those users and maybe feeding back to them like, hey, you know, because we wait a little while to close these, that's actually quite expensive what you're doing. Can we please encourage you to kind of stick to one image and one container? And we can, that example from Connect, I think is almost like a no brainer. We can start feeding back to people when there's content that's maybe forgotten. And definitely when there's content that's not actually working.

Where we are now

Yeah, so where are we now? We have this dashboard which provides transparent metrics on data that was already there. Like we didn't make this data. All of our stuff from the Kubernetes cluster that was just sitting in Prometheus, and it was used by our informatics team to kind of monitor the platform. But it was also available to us on the business side to kind of understand the users of the platform.

And so now when we kind of get into these questions, like I can see Doug, who leads our validation strategy here, when he starts asking me like how long do people use particular versions of R and things like that, we actually know that. Like we know the splits across different versions of R for our users by department. Like we can really drill into the stuff like that.

Yeah, and soon we'll figure out exactly what we want to communicate back to users and how we want to influence them. But we could start to try and kind of lower costs or at least allow them to make more informed decisions based on their usage of the platform. And I think one thing which I've added just before I gave this talk was from that conversation with GSK just a few days ago where a lot of people are asking similar questions. So maybe there's an opportunity to kind of share more of this code so that people could just pick it up and, you know, automatically be able to build these kind of dashboards understanding their users.

So yeah, thank you.

Q&A

So have your efforts so far changed user perceptions of SCE as free? Yes, I think as I mentioned before, we want to kind of move away from relying on forcing people to fill in a ticket. That's what we currently do for GPUs. So if you want to use a GPU and you don't sit in the imaging team, you basically have to try and convince us that you should be allowed to use one. So I think we definitely are getting across the message, but it's not the way we want to get the message across. Like, I don't want to have tickets for that. I want them to make a decision and make a good decision.

Was there any data that was harder to obtain than you would have liked? And what other tools did you need to use in order to get that data? I think the Prometheus logs themselves are not that easy to access. So there's some stuff where API engineers pushing it into an SQL database for us. So yeah, there's definitely some things where maybe they're not collecting it yet. So like AWS Batch, that wasn't actually available when we were first talking about it. And so they had to actually start keeping that information. But it's all, I think it's relatively easy for them to keep.

Giving your scientific computing environment (SCE) a voice - posit conf 2024

Transcript#

Four key questions from operational data

Overview of the scientific computing environment

R version usage across containers

Optimizing idle interactive containers

Cleaning up stale content on Posit Connect

Understanding who uses the platform

Feeding usage data back to users

Where we are now

Q&A