
Building Governable ML Models with R (Tom Shafer, Elder Research) | posit::conf(2025)
Building Governable ML Models with R Speaker(s): Tom Shafer Abstract: For a model to provide value in production, it must be fit for purpose, deployable, and maintainable over time. We know that R provides a host of tools and packages for building good models, but the language and ecosystem also provide tools to help us build these kinds of maintainable production systems. This talk will present techniques, adapted from software engineering, that provide a stable foundation for building models and writing all the accompanying code that's often needed to train, test, and update models over time. Those attending this talk will learn how, by centering model development on packages, writing tests, creating intuitive S3 methods, and more, we can build modularized, testable code that makes our models easier to monitor and update over time. Materials - https://elderresearch.github.io/posit-conf-2025/index.html Blog Post - https://tshafer.com/blog/2025/11/recap-post-conf-2025 posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Congratulations! Our model's in production. The session is over. Now what do we do? Back when we started this project, the goal was to build a model that provided some kind of value for our company, for the public sector thing, for the world in general, and so we've made a good start now that it's in production. Supposedly some enormous number of analytics projects never get off the ground at all, so okay, we're doing okay. But in practice, deployment isn't actually the end of putting something into production, it's just the beginning, it's the first gate.
Really what we need to do is we need to be able to keep this in production for an arbitrarily long amount of time in order for it to actually affect the changes that we want to happen. And typically, right, all of the stuff that happens after deployment falls under the heading of model governance or something like this, which is, you know, how do we keep this thing running when things change? Because a lot of stuff does change. Companies are going to change out of under us. Sometimes production systems change. We move from one system to another. Or maybe our model is amazing, and our business customer wants us to change it. They want us to add new features or they want us to retrain for some new use case or something like this. Anyway, we're going to have to retrain the model, and so we're going to have to, like, work with it again in the future.
Now, a lot of times when we talk about model governance, we think about the model object itself. We're thinking about versioning, serving, monitoring for drift, these kinds of things. But in practice, a lot of what we actually end up doing involves sort of all the scaffolding around the model. This is how the model is trained, how the model is validated. If we're responsible for writing the code for inference, for working with predictions, there's a lot of code that gets involved there, too. And so this talk is coming from experiences that I've had over the last few years putting models into production, trying to keep them there, and sort of how to think about model governance in this broader context, including all of the scaffolding code.
And so I want to center the talk on just this question, which is sort of what can we do now while we're building the model, while we're designing how this is going to work to make our maintenance job easier later? And in practice, you know, there's I think we can distill it down to a few core practices that can really help with this. Core principles that serve as a foundation for this. Let's see if you're ready for this list. These are things like packaging, documentation, testing, writing legible code. Take it in. I know what you're thinking. Wow. This is going to change my life. No one has ever told me before that I should document my code. But the trick here is that in the governance context, these go from nice things that we should do, maybe, to things that sort of all stack on top of each other in order to produce a model or a model system that then we can maintain over time and can keep providing that value, even as we have to retrain it, we have to adapt it, we have to change it, right?
But the trick here is that in the governance context, these go from nice things that we should do, maybe, to things that sort of all stack on top of each other in order to produce a model or a model system that then we can maintain over time and can keep providing that value, even as we have to retrain it, we have to adapt it, we have to change it, right?
Packaging as a foundation
And this talk is about R. This is the context in which I was working on the project that inspired this talk. But a lot of this stuff is just as applicable to Python as anything else. So, you know, I'd love to talk after I write a ton of Python code, I write a ton of R code, would be happy to talk about ways that this applies in Python the same way it does in R. It's just all the examples and stuff are going to be in R.
So let's start with packaging. And I start here because this is foundational to everything else. And to be clear, I don't mean packaging like we're going to release this on CRAN. Most of what I do is proprietary. I work in consulting. So I work with a client. It stays in their environment. It stays in their version control, right? We're not putting this anywhere. But packaging is still tremendously useful because this provides a structure to support things like automation and lots of the other sort of governance and maintenance details. I try to do this with almost every analytics project I do of reasonable complexity for this reason.
And it does a couple of things. First, packages provide structure. Literally they provide a file structure. Our metadata goes here. Our R files go here. Our tests go here. Our documentation goes here. This at the very least helps to avoid the thing that you'll get in teams, in academia, like where I've come from, where you have scripts just kind of everywhere in a folder, and this is our production model. Like this helps avoid this, right? But also it does a lot of other things. It structures your dependencies. So if your model depends on other packages, it helps declare and maintain those kinds of things. So as the world changes, we can redeploy, rebuild with some amount of confidence.
The other thing it tracks is things like your code version, right? And this is important in at least two cases in governance. One is I don't just want to know the model version that we're running. I would really like to know what version of my code base trained that model, right? Or if I'm responsible for inference, like I am in the project I was referring to, I also want to be able to capture what version of the inference code was used. There are other ways you can do this, but this just puts the number in one place and it propagates throughout, right? As you use your package.
So it provides a structure. The other thing it does is it supports a bunch of useful automations using other really cool packages you've maybe used before, like DevTools. This supports things like auto-generating package documentation by just calling document. Or running all of your tests in a package using test. It also supports some more holistic kinds of things, like with the check, which runs our command check and a whole bunch of other stuff that will basically test that your package is loadable and usable by all of your potential downstream users. And you can bundle this into something like continuous integration, you know, a GitHub workflow or something. So every time I make any meaningful change to my package, this all happens automatically. People don't have to think about it, which is sort of the point. I want to be thinking about the code. I don't want to be thinking about all of this other kind of governance stuff. I don't have to.
And then finally, this builds on typical R patterns. This is what any R user would expect. Is instead of having to write some kind of proprietary, you know, thing where we're sourcing a bunch of files or making sure they're in the right order, right? We can just call library for our little fictional package here, classifier, and then we can just use functions, right? We've built this into a package structure, so any R user, whether it's next month, next year or in two years, can use this code base the way that we would expect, right?
Documentation and testing
So this starts as a foundation, and then we can build on it with things like documentation. This is probably the least surprising part of this framework, so I won't spend too much time on this, but it's super helpful, because it works at multiple timescales.
In the governance context, documentation helps to provide us sort of contract for how things work. It starts by helping future you or whoever's going to be maintaining your package, because in a lot of cases, it probably will be you, but it's going to be you in six months or 12 months when you don't have all the context loaded up that you do right now about the problem, the domain, the sort of stakeholders, the weird sort of conflicting things you've got to take account of. The other thing it helps you do is it helps you right now, because it sort of helps get you away from sort of vibes-based programming and gets you into something more resembling a contract. If in at least the most basic form I can write down, what should this function do? What does it take in, and what does it take out? I've sort of forced myself to think about that, and so this is going to help you later in the governance process.
The other thing is, like, in this ecosystem that we're all sort of swimming in, documentation offers a lot of other really nice functionality, right? If you're working in Positron or you're working in RStudio, if you write help, even really basic help, you don't have to go crazy, you get nice help integrated into your IDE. You get auto-completion, even if you're doing pipes and this kind of thing. Again, all you have to do is document your arguments. You don't have to write a treatise. Plus, in the world we're in now, all of this documentation can get sucked up by language models or sort of other things, too, to power sort of other experiences. And so it's not just something that's like, oh, we should do this, you know? We get real benefits from this, both now and sort of downstream.
And then there's testing, which for a long time for me was kind of something that's, I should do this. I don't really know how to get started. I'm not a software developer. I didn't go to school for computer science. But it is really, really useful, and it's really important in this context. Because testing is what's going to provide us with some kind of a safety harness once our model is in production and we need to then change it. Because, I don't know, code's complicated, and we can make mistakes, and I want something that's going to catch me if I make a mistake.
So and this helps us answer a couple of questions in particular. One, does my code work right now? Right? So if youβ again, this is sort of moving away from vibes-based coding, where instead of saying, yeah, this model seems to do what I want, or, yes, this transformation function seems to do what I want, it lets us write something down in code that can ensure that. But then more importantly, later, when you can't remember how this stuff worked, this helps me know, does my code still work? And this is particularly important when you start having a larger code base for, like, a model or model functions that have lasted a while, and changes in part A are going to affect part, you know, F over here. And you can't keep that all in your head.
There's also all kinds of interesting stuff where you start depending on other packages, and packages change and update over time, right? Maybe you didn't do anything, but maybe a dependency broke something. Or maybe I did. That happens. Tests catch you for this kind of thing.
The good news is it's getting easier and easier, I think, to start growing in this capacity. It's, you know, what do we do? I mean, there's ways to start small and grow from there, right? We don't have to boil the ocean to do testing. We have a lot of tools that help with this now.
And what you can do to get started and what we've done on some of our key projects are really just two steps. One, identify and test your key functionality. In a modeling context, this might be a main training function using little bits of, you know, reproducible data. This might be inference functionality. Or it might be, like, complicated transformation stuff where you have to smash data frames together. And any time you do that, you can have missing values, misjoins, right? If you're using untrusted data, you can have a bad time. And so this lets you sort of test the things you think are most likely to go wrong. So you can do a couple of those. You don't have to do a lot.
But then over time, as we add new work or as we fix bugs, because they happen, we can just write a test that says, oh, can I catch this bug, right? Can I figure out where the bug is? Can I test it? And then fix the bug. And so now the bug will never happen to you again, at least not without you knowing it. And we have lots of really good tools for this now. Packages like test that, which I think in version 3 is when they introduced snapshots, which makes it super easy to write tests. And basically it works in a three-part model. You start with some known data. So, like, the iris data set here in this example. This is not random data. Like, it's deterministic. We know what these data are. Capture the output. And then call this function expect snapshot with your output. On the first run, what this is going to do is write a little markdown file into your package structure that stores things like what the call was, any output messages, and the actual output of the object. And it just writes it to a little markdown file. The next time that you call this, it's just going to go read that file, and it's going to figure out whether or not it's reproducible. I didn't have to write any super fancy, you know, test conditions or anything like that. As a way to get started, this is pretty good. It also helps you sort of ensure that your code is deterministic. If you get, you know, it will yell at you if it's not.
Writing legible code with S3
And then finally, sort of the layer on top would be writing legible code. Right? We've got packaging. We've got docs and tests. But then if we want people to be able to use our code into the future, this is where writing legible code I think really helps. Because the real world is more complicated than demos. We have data that are complicated and involve lots of merging. We have models that aren't just one little model object we trained. It's multiple model objects that have to be managed together and maybe interact. In our project before, we had a two-stage model where one flowed into the other. Both were trainable. There can also be complicated prediction logic if you have to manage state for multiple models.
One approach to do this would be, you know, and it's fine, would be to just bundle everything together in lists. Right? I train my model and out the end of it comes a list with two parts. A preparatory thing and a model thing. So we get that. And then, you know, if we want to do inference, maybe we extract the different parts of the model and we call some function that uses all these different pieces together. Right? You can do that, but we already have a nicer way to do this in R using S3 objects and S3 methods. If you use R, you're used to calling print on something and getting a nice output. If you call predict, you get a nice output. The good news is getting this kind of thing is really, really easy and adopting it has made life better for me and for us.
You can do this in two easy steps. First, instead of returning a list, here's our original list that we had before. Return a list just with one extra attribute attached called class. Call it anything you want. I called it classifier model because it matches the package. Once you have that, you can write this funny named function. Predict.classifier model. And then just put whatever complex nonsense that you want to go in there. But once you do that, your users can just use predict from now on. They don't have to know what's going on. You can abstract it all away, which hopefully you were going to do anyway. But now they can just behaveβ they can work in the R ecosystem the way that they should be used to doing anyway. This is good for downstream.
This is particularly good if you're responsible for inference. Like in this project, we were responsible for maintaining the R stub files that actually did inference and then were orchestrated by the DevOps team. And so this lets us make our stub files be expressive and easy to understand and self-contained. They just call the package and they do the thing. You can also extend this to other methods, too, if you want. You want to define something entirely new. I found out this was actually defined in generics the other day, but we're going to pretend that we invented it. You define it using use method. Literally, this is not even likeβ this is not even like toy code. You just write this. This is all you have to do. And then you can define your S3 method this way. And again, now I have this function explain that I can call on my object, right? You can define things very easily so then when people come back to your code later, it's more expressive, it follows the conventions of the ecosystem.
Putting it all together
And so even though none of these things are probably like mind-blowing new things, what we have found is that when we combine these kinds of principles together, it has made our model more maintainable over time. This particular case, we've been in production for a few years now for a client. It's made a ton of value for them. And we've even survived a transition from, like, an air flow-based architecture to, like, Databricks comes in and sells them all the things. So now we're in Databricks. We were able to lift this, move it over, in part because of this kind of an infrastructure. So I would love to talk more about this later. I would love to hear ways that you do this, you do other things. Yeah. I appreciate you letting me be here. I also have a companion website. It's on GitHub that has some more examples and the slides from this talk. Thank you so much.
And so even though none of these things are probably like mind-blowing new things, what we have found is that when we combine these kinds of principles together, it has made our model more maintainable over time.
Q&A
I think we have time for one question. So how would you suggest someone goes about...
Okay. So someone that wants to go about starting with a small test base just for their previous functionality or even starting to use packages for packaging their code. Do you suggest using an LLM system for that? Have you had any experience with that and good results? I have not used language. The question was, if you want to get started with any of these sort of directions, what do you think about using a language model to help? I have not used language models to do this because I got into this before they existed. But if they work the same here as they do at everything else, I say, sure, take a look at them. See what they have to say. They can certainly point you in the right direction. But in our ecosystem, we have packages like use this and DevTools that make this so easy. Install use this and then call use package, I think, or make create package or something. And it just does it for you. And you can sort of walk your way through that way. It's so easy now.
