Resources

Demystifying MLOps with Vetiver (Myles Mitchell, Jumping Rivers) | posit::conf(2025)

Demystifying MLOps with Vetiver Speaker(s): Myles Mitchell Abstract: MLOps is the process of setting up a Machine Learning lifecycle, including model training, deployment and monitoring. It is a complex topic which brings together an understanding of data processing, modelling and cloud architecture. It is therefore not surprising that many newcomers (myself included) can feel intimidated by the subject. In this talk I will draw on my experience as an organiser of local data science meetups. I will go into how MLOps is often presented within the data science community, how it could be made more accessible to students and beginners, and my current process for teaching MLOps in R and Python using my favourite package, vetiver. In summary: no, you do not have to be an expert in AWS or Azure to get started! posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So I'm Myles, and my talk is titled Demystifying MLOps with Vetiver. And I'm going to start with a quick disclaimer. So this is the statistical modeling and machine learning track. But my talk is not going to feature any machine learning models. In fact, there's going to be no statistical models at all for that matter. There's going to be no code. There's going to be no plots. Apart from one pie chart.

Instead, I'm going to be taking you on my own personal journey as I got to grips with MLOps over the past couple of years. And along this journey, we'll find out about how... Well, first of all, what is MLOps? I'll be talking a bit about how it is often presented within the data science community. I think it can cause a lot of confusion to beginner data scientists. And I'll be going through some take-home lessons.

So first, just some introductions. So I'm a principal data scientist at Jumping Rivers, which is a data science consultancy based in the United Kingdom. In terms of my day-to-day work, I do a lot of project management. I do a lot of support on client projects, mostly on the Python side. So I have worked on a few machine learning projects in the past. Also do a lot of teaching as well of programming, advanced statistics, that kind of thing. And I also organize a number of data science meetups. As you'll find out, I also enjoy hiking. And I've included lots of hiking analogies in this talk. Because who doesn't love analogies? And yet it is as tenuous as it sounds.

You may have already visited us at the Jumping Rivers stand. We're one of the sponsors for the conference. If you haven't already, we'll be there at the next coffee break. So please do stop by to say hi if you haven't already.

Background and getting into data science

So I said there'd be hiking analogies. So like with any journey, we have to start with a route plan. So this is the plan for the talk. So I'll begin with how I got into data science. My first encounter with ML Ops. I'll go through how I got to grips with ML Ops using the wonderful Vetiver package by Posit. I'll go through some success stories and finish with some take-home lessons.

So I came from an academic background originally. So I did a PhD in astrophysics at Durham University in the UK. I'm studying in this wonderful building, which is a research and development centre in Durham University. And I was among one of the first PhD cohorts that was offered funding for additional training in what was called data intensive science back then. This was only eight years ago, but data science was still quite a novel topic back then. Machine learning wasn't so widely known. And there's certainly very little talk about AI and large language models and that sort of thing. So I got to do an additional six months of training in industry, and get into practice with big data and machine learning.

Now, originally, I'd actually planned to stay in academia. I thought I'd love research. You know, that's amazing. But as I quickly found out, academia is very hard. And if you want to stay, you're going to be moving all around the world. It's going to be a long time before you end up with a permanent position. So I took full advantage of the extra training I'd received. I did an internship at Jumping Rivers for about three months during the PhD. And then I joined as a full-time data scientist in 2022. And I'm still there to this day. So obviously, I'm enjoying it.

So I thought I'd say a little bit about my initial experience as a data scientist. So I had all these pictures in my head that I'd be working with these big machine learning models, you know, working with vast data sets. The reality was a bit different. So to start with, I was doing a lot of software development. If you're interested to know what software, check out diffify.com. It's an app I helped to build quite early on. Jumping Rivers also does a lot of teaching of courses. So I was very much involved in the writing of these courses and in teaching them to, like, the general public and going into companies and upskilling staff. I was doing lots of merge requests as well, lots of just co-development sort of internally within the company. And I was also going to conferences representing Jumping Rivers, also going to local data science meetups.

Data science meetups and first encounter with MLOps

But, yeah, I think you'd agree a lot of this doesn't really look very data sciency. You know, there's not really a lot of data, and there's certainly not much science going on in this slide.

So we organize meetups around England. So this is a map of the British Isles. This is England and the Shade region, for anyone who doesn't know. And the English refer to this region in the square as the North, okay? If you're from Scotland, you'd actually call this the Deep South, okay? But this is what we call the North. And these are the two cities in which we organize the data science meetups. So Newcastle-Upon-Tyne at the top, which is also where the Jumping Rivers main office is, and then Leeds at the bottom there. That's where we organize the other meetups. Lovely part of the world, lots of national parks to go to if you're ever around.

So in September 2022, I got the chance to organize my first data science meetup. So I was planning the speakers. I was arranging the pizza. You know, lots of administration to go along with that. All these are advertised via meetup.com. And I was really curious to see what everyone else views as data science in the community. So I was very interested to see the sort of talks that are being submitted.

So I did promise a pie chart. This is a breakdown of all the talks we've had from 2022 to the present day. So we've had 58 talks in total, and only about 5% are sort of outreach talks. So that's sort of talking about, you know, here are some opportunities for students, you know, to get into data science, that kind of thing. It'd be nice to have a bit more of that, I think. Maybe about 15% are sort of software talks, where, you know, a speaker is just talking about a package they've been using, that sort of thing. About 15% will be focusing on a particular model someone's been using, either machine learning model or just a general statistical model. 25% LLMs. That's definitely on the rise, though. I think if I'd created this using just the data from the past year, LLMs would probably be over half of the talks, because that's very much on the rise at the moment. But about 40% fall into this area that I've sort of grouped as MLOps. So clearly, this is a really big area in data science. And this was my, personally, my first encounter with MLOps, was from going to these meetups.

What is MLOps and why it's intimidating

So my first encounter with MLOps, well, what is it, first of all? So it stands for Machine Learning Operations. And the idea is that it streamlines the machine learning lifecycle. So the idea is we train our models. We want that model to be usable, so we deploy it to the cloud, where users can query it. So we'll have some sort of model API. And then we also want to monitor the model while it's in production. So we want to check that as the data changes, is the model still maintaining an acceptable accuracy? And if it's not, then we need to make sure we're retraining it at a regular frequency.

Another big part of this, as well, is because this is like a cycle, and we're constantly retraining our models, it becomes very important to version and to store our models, because we want to be able to retrieve a model from any point in time in the past. If these models are leading to important decision-making in businesses, it's very important to see why a certain decision was made, and what was it about that model that led to that prediction. So that's pretty much all there is to it, really, in words. It sounds simple enough. Until we see our first architecture diagram.

So this is like a very typical sort of architecture diagram. I actually just used a diagram that my friend put together. He'll not mind. But essentially, the way these work is they basically summarize the full infrastructure of like a machine learning sort of lifecycle or workflow. So usually there's some version control stuff, in this case with GitHub, sort of around the top. Then you're going to have some stuff to do with like development environment. There's going to be a staging environment somewhere, production environment. But, yeah, for like a beginner data scientist, when I look at this, it might as well just be squiggly lines, you know, just spaghetti. You know, I don't really know what's going on.

So there is just an uncomfortable reality for a lot of beginners in data science that, and this was also touched on in Flavia's brilliant talk just before mine, there's an overwhelming choice when it comes to data science. So particularly with modeling frameworks, you know, do I go tidy models? Do I go R or Python? You know, which package am I going to use? What type of model will I use? Then there's cloud platforms. Do I use Azure? Do I use AWS? You know, it's an endless choice. And there's so many different options for creating environments and containers, and the list just goes on. It's also very multidisciplinary. So, you know, this demands expertise in data science on the modeling side, but also in like cloud infrastructure. So actually setting up, you know, compute environments, setting like the, you know, the optimal computing power, that sort of thing. So unless you've got a big team around you, it's very difficult to know how to jump into this area. It's also very expensive. You know, all these cloud platforms charge money. You know, none of these are free. There might be a few free trials out there, but, you know, it's very difficult to jump into this.

Getting to grips with MLOps using Vetiver

So just to add some extra motivation and incentive, in 2024, I agreed to deliver an MLOps workshop at an upcoming conference. I say I agreed to this. Actually, it was the CTO of Jumping Rivers, Colin. He signed up to do a workshop, and then he asked me to do it in his place. And he was very positive about, you know, he saw lots of benefits from doing this. You know, it means you'll get to learn it. You can start to teach MLOps as a service. And actually, you know what, he has got a point, and I would actually argue that if you are learning a really difficult topic, you know, sign up to do a talk sometime in the future or a workshop. It doesn't have to be at a big conference, but just at a local data science meetup. Because knowing that you're going to be getting up onto a stage, you know, presenting it to a large audience, you know, you're going to have to be able to explain it to an audience, and so you're going to have to understand it yourself. So I think it is actually probably the best way to learn a new topic is to just sign up to do a workshop, if you've got any opportunities to do that.

So I think it is actually probably the best way to learn a new topic is to just sign up to do a workshop, if you've got any opportunities to do that.

The turning point in learning MLOps was discovering the Vetiver package. So this is a package developed by Posit, and like many of their other packages, like Shiny and Quarto, it's multilingual, so it's available in both R and Python. It's completely free to install and use. So it's a really good sort of beginner-friendly package. I think this diagram from the Vetiver documentation summarizes it really nicely. So this is the kind of lifecycle that we have in machine learning. So we're collecting data. We're doing some data cleaning using the Tidyverse. We're training the model using Tidy models. And then we've then got these steps in this kind of shaded background, and these are the steps that Vetiver takes care of. So with a single package, you've got functions for versioning your models. So that will store the models. It will store it with the timestamp, so you can retrieve your models at any previous time. You can store them locally. You can store them on the cloud. It has deployment, so it will use Docker to create a Docker file. It will create a model API using Plumber, and it's got functions for deploying this straight to the cloud, whether that's Posit Connect or AWS. It's also got functions for monitoring your models. So you can basically track metrics like the accuracy or the root mean square error in real time as the data changes.

In the background, it's using lots of other packages. So it's using pins for the versioning. It's using Plumber if it's R or FastAPI if it's Python to actually set up the model API, and it's also generating the Docker file for you. So even if you're not an expert in Docker and creating container environments, it does a lot of this heavy lifting for you, which I think is really nice.

It also has the option to deploy your models to the local host. So if you're not ready to invest in cloud infrastructure, just deploy it onto your local machine, and you can still query it as though it's an API just using just the local host IP address. It's just a nice way to sort of get used to working with APIs and to check that the model actually behaves as expected, and it's completely free to do this. Once you're happy then with your model and it's working as expected, then you can start to think about that next step. So yeah, I think this is a really great beginner-friendly way to learn MLOps.

So going back to that workshop that I was signed up to teach. So the way we did this, it's quite similar to how we run our training courses at Jumping Rivers. So we use Posit Workbench, where attendees can use Jupyter if it's Python, or RStudio if they're using R. All the dependencies are pre-installed, so no one has to install anything. So both R and Python attendees are welcome. We only really did deployments to the local host just to sort of teach them what an API is, how do we query an API, what's a POST request, what's a GET request, that kind of thing. And the feedback I've had is that the attendees have found it really clear to understand and sort of pitch to the right level. So I'm really pleased with how that's gone.

Success stories and take-home lessons

So we've taken this forward. We've not just left at that one workshop. So I've now taught three workshops at conferences over the past year, and also I'm now giving this talk today. We're also now wrapping up our first MLOps project for a client. I'm not going to show you an architecture diagram for this, but basically it's Python models deployed in AWS, essentially. And it's gone very well, so we're very happy with that. We're also now starting to build up an MLOps server, Jumping Rivers, where explainability will be a key idea of it. So just making sure that the client is fully informed of the process and explaining the concepts to them in an understandable sort of manner. So we've already got data science and cloud engineering at Jumping Rivers, so this is essentially bridging these two areas. So, yeah, we're very excited to be developing this, and we're excited to share more information about this in the very near future.

So just to give a quick summary, you can get started in MLOps right now with free and open-source tools, so check out Vetiver if you're interested in a really beginner-friendly package to get started. Architecture diagrams can be incredibly useful for the right use cases, but do consider the audience at the conference or the meetup that you're going to, and just consider whether you do need to have all of the information in that diagram. Of course, there's nothing wrong with just fashioning it up quickly and just in case anyone wants to take a photograph, it's quite a lot of information to be talking through in just one slide. And finally, signing up to teach a workshop or to give a talk, it's a really good way to learn a topic, so I can highly recommend having a go at teaching if you've never tried that.

Just a quick plug as well, we organize an annual Shiny conference. It's called Shiny in Production, so this is running in Newcastle in the UK in just over two weeks, so if you fancy getting on a plane in two weeks' time, nine-and-a-half-hour journey, feel free to sign up. There's a discount code there. This will be on the last slide as well, and those are just some useful links. So we've got the Vetiver documentation, the link for the Shiny conference, the Jumping Rivers website where we've also got a blog. We've also got a YouTube channel where we put up free monthly webinars that anyone can check out, and we're Jumping Rivers Limited on LinkedIn if you want to check us out there. Thank you very much.

Q&A

Okay, a quick question or two. Do you know of anyone who has had to rescind a model and fall back to the previous version?

No, I've not come across any cases like that. Like I said, it's still quite a new service in Jumping Rivers, so I think the longer we spend in this area, we will start to see some more cases like this, but yeah, it's a core part of MLOps that you always want models to be retrievable at any point. Unexpected things can happen. You might suddenly change your mind and want to go back to where you were, so yeah.

Okay, one more. I'm an early career data scientist who wants experience with MLOps, but my organization doesn't focus on deploying models. We do more one-off research. What advice do you have to practice MLOps without organizational support?

Yeah, so like I said, the Vetiver package is a really good way just to start learning the concepts at least. It's completely free to install. It's a bit limited in that if you actually want to start practicing on the cloud, Vetiver is not going to just automatically let you deploy things straight to AWS and connect, so I think just to get used to the concepts, it's a really good way to get started at least. I'm still looking for that answer as to like how does a complete beginner take that next step going onto the cloud, so if anyone does have any suggestions or ideas about that, I'd be really interested to hear them afterwards.