Community Meetup | Suteja Kanura | R in Retail: ML Ops - AI as an Engineering Discipline

R in Retail & E-Commerce: ML Ops - Machine Learning as an Engineering Discipline Presentation by Suteja Kanuri Abstract: “Only 22 percent of companies using machine learning have successfully deployed a model." What makes it so hard? And what do we need to do to improve the situation? ML Ops is a set of practices that combines Machine Learning, DevOps, and Data Engineering - while deploying and maintaining ML systems in production reliably and efficiently. There are various maturity levels of ML Ops based on the industry and it's important to have an awareness of an ML Ops toolkit for Machine Learning teams to succeed. Speaker Bio: Suteja is based out of Singapore and started her career in the data industry. She has been an Engineering Manager for the last 3 years across two industries- Banking and E-commerce. She has immense experience working with Machine Learning Teams and is well versed with ML Ops practices across multiple industries. She has nurtured and managed a 15 member team of ML Ops which comprised of Machine Learning Engineers and data engineers

Dec 3, 2021

38 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you everyone for joining in first of all. Yeah, so today's presentation is about ML Ops, as everyone would have been knowing prior to the meeting. So I thought let me just take around 30-35 minutes to deep dive everything about ML Ops, probably explain it from my experience as well and you know, have a discussion towards the end in the terms of in the form of questions that you'll probably drop to Rachel. So before actually deep diving into the topic of ML Ops, which is the machine learning as an engineering discipline, let me give a quick introduction about myself to all so that that will help you to connect with me.

So this is me on the right side. I'm based out of Singapore, and I have been in the data world since 2015. So I've come to Singapore to study my graduate course in data science and business analytics. That's how I deep dive into the world of data. And ever since that I've just been in it. And I love it absolutely.

Since the last couple of years, I have been working in three different industries, primarily advertising, banking, and e commerce. That's how I started my journey. I was an individual contributor in advertising before I moved to roles of engineering manager in the banking industry. And my recent experience is that in an e commerce firm. I'm also a parent to my beautiful dog, whom I got like a month ago. That's her. Her name is Toffee. And my email address is Suteja Kanuri at gmail.com. Just in case anyone wants to connect with me, I'm also available on LinkedIn as well.

Agenda and overview

So the agenda for today's session primarily is that by the end of it, one would be able to understand or take away the following aspects, which is elements of machine learning systems. Why do we need MLOps? What is the need of MLOps and the benefits of MLOps? So my entire presentation is pertaining to answer these questions. And hopefully everyone has a lot of takeaways at the end of it.

Software systems vs. ML systems

So this is just an introductory slide where I just wanted to understand from everyone on. I'm sure that whoever is attending the session at this point knows the difference between a software system versus an ML system, right? Well, we know that an AI system actually comprises of code and data, unlike a software system, which only comprises of code. And what exactly is the code in the ML system? Well, it is that machine learning algorithm, which we famously call as the model. It can be either deep learning model or machine learning model. Essentially, that's what the code is all about, right?

And traditionally, programmers automated that task by writing programs. Whereas in machine learning, the computer, the system finds the most suitable algorithm that fits into the data patterns and is able to produce results in the form of prediction, classification, etc. So machine learning system is ideally a combination of code and data both.

So I just want to draw everyone's attention to the red box, which says ML code. What can we deduce from this entire diagram? Probably only a small fraction of real time systems of machine learning is actually comprised of the ML code, which is shown in the box in the middle. But what about the rest? What are the other processes? Well, that is the entire surrounding infra, which is supporting machine learning algorithms, machine learning processes, which is vast and complex. As you see, in reality, machine learning team does more than just building the models, they maintain the entire infra around machine learning, right from data collection, transformation, feature engineering, A B test, building in house tools and frameworks to serve a solution and a particular pattern, monitoring, retraining, etc, etc.

So machine learning team has to do more than just figuring out which algorithm works, they need to ensure that the entire process is repeatable, the entire process can be seamless, and it is fast to delivery. Apart from ensuring that the models can be quickly deployed to production, one also has to understand that a machine learning team is spending a lot of time in the technical debt as well, where the data is changing, probably the machine learning, sorry, the code probably has degraded based on the user behavior, the demographic, the demographic behavior. So what do we do when certain behaviors have changed? So machine learning team invest in retraining the machine learning models to keep them up to date.

What is MLOps?

All of these processes which are being shown right now, this actually leads to a practice in engineering, which has been coined, practiced a lot right now across all the industries. And this is called MLOps, which is machine learning operations. Essentially, it is application of DevOps to the ML world and the ML systems.

So what is the goal of MLOps? If you can see the crisscross diagram, which goes front and back in the slide in front of you, it says that, you know, designing, training, running, again, going back, forth, and, you know, coming from it, it's a slide like a spike, it's a wave like a spider, right? So that's what is MLOps all about. It's basically ensuring that there is faster experimentation and model development. There's faster deployment of upgraded models to production. And when doing this, there is quality assurance provided once the machine learning models have been deployed to production. So these are the three aspects that MLOps tackles really well.

Let me conclude this slide by saying that the goal of MLOps is to make development and deployment of ML systems systematic, reliable, and repeatable. As we go ahead, we will realize that machine learning process is extremely repeatable apart from the model itself, the algorithm itself. But the entire process around it, if we start zooming out, is very repeatable. And why don't we sort of start relying on what DevOps engineers have been doing forever and just implement similar aspects with the lens of machine learning?

The goal of MLOps is to make development and deployment of ML systems systematic, reliable, and repeatable.

Typical ML workflow

Now, let me move on to the typical ML flow, typical ML workflow that every organization is following or probably wants to follow at any given point in time. So these are the eight boxes that I've come up with. There can be many, but these are good enough for a high level sort of workflow, which brings in most of the aspects in a typical ML workflow.

So the first one starts with data extraction. So as everyone knows, that data extraction is the phase where we select the different data sources and bring in those data sources to data lakes, which is the entry point for data analysts or machine learning engineers, which where they can actually analyze the data. And moving on to the second phase, which is called the data analysis phase. In this phase, analysts or machine learning engineers do exploratory data analysis to understand all the data which can be probably an input to the machine learning model. This is the phase where the engineers really get intimate with data, try to understand, wrangle it for building models or for preparation of data.

Once the data has been analyzed, data preparation becomes a very important phase for a machine learning activity or a task, which involves ample amount of cleaning, changing, probably changing or splitting the data itself with respect to training data set, validation data sets or test data sets, or apply some sort of complex transformations, just quoting an example, converting categorical variables to one hot encoding. So this is where the entire chunk of preparation happens before it actually enters into a model training phase. And the model training phase is the phase which probably everyone is aware of, which is choosing the right machine learning algorithm, choosing the right hyper parameters, figuring out which combination works for the data set that we have chosen, and coming up with a reasonably good model which can be tested and which can be deployed.

So the actual output of the model training is the best model that we have chosen based on experimentation. There are different tools which actually help in each and every phase. The tools are sort of very standard at this point in the market, which can be found in different ecosystems, cloud ecosystems, as well like Amazon, Google or Microsoft.

Now after the model has been decided and we have come up with the best model according to that particular place and time, we go for something called model evaluation, where the model is validated with the test data for its model quality, where the baselines are set, some metrics are measured. And finally, we start moving towards a model validation phase, where we decide where we have adequate information for checking the predictive performance of the model. And here is where we start getting ready on finalizing the model which actually can go into production environments, which is done by the phase called model serving. So in the model serving phase, the validated model is deployed to target environment using automated practices to serve the predictions. The deployment can be done using any sort of a microservice architecture, which is really catching up at this point in ML workflows as well. Typically, we do it using REST APIs to serve online predictions, or we embed a machine learning model in the form of an object to an edge system or a device. Or it can be a part of a batch prediction system as well, which can run daily or monthly, whatever the frequency is.

And the last and final phase of a great machine learning workflow is model monitoring. This is an important imperative to track the performance of the model once it has been pushed to production, once it is serving in real time, because all the tests which have been done is generally done on offline data, historical data. But we need to know that at a given point in time, how well our model is performing. And does it need any re-tweaking in terms of the algo itself or in terms of data itself to make changes and go back in the loop the way the loop is there, we probably want to go back again and again to retraining data preparation, all the other steps based on model monitoring.

Why do we need MLOps?

So this particular ML workflow answers the question to us as to why do we exactly need MLOps? Essentially, it helps in every process. It helps in all of those eight processes. It has a checkpoint around every input, every output. There is a validation, there is a cross validation which happens. Essentially, I have listed down a couple of pointers as to why there is a need for MLOps, which probably everyone must agree or at this point, it helps in experimentation. It helps in tracking metrics in a process-oriented manner. It helps in source control of the code, because every time we figure out the perfect model, we need to be able to track it back in terms of code and in terms of data, we need to track back the lineage. For that, we need to check in both data and code at every point in time regularly, so that for any auditing purpose or any governance purpose, we have the data reproducible.

Also, MLOps automates the entire deployment. So we do not want to have manual checkpoints, or we do not want to spend a lot of manual ways to do deployment because that sort of goes against the topic of scalability. Also, MLOps helps in monitoring model performance and efficacy at a regular basis. And if something is amiss, model has degraded or data has degraded, again, the entire process can be repeated. So essentially, the need for MLOps is to convert each and every process into a series of steps or subtasks and essentially keep triggering, having a set of triggers which warrant a change to maintain reliability and consistency.

MLOps as a process, not a product

With this, I actually move on to what exactly is MLOps, the definition, right? So it is not a product that we can buy off a market probably and deploy. There are at this point, some tools, frameworks, which provide most of the MLOps processes. But typically, it's not a product which you can deploy because it is actually a process which we need to follow. And how do we follow the process? By combining different tools available in the market, choosing which one is the right one based on the various competing tools available, and then plugging it into our ecosystem based on the industry, the maturity of the industry that we are in, and ensuring an automated pipeline of triggers is run to maintain the entire MLOps flow.

So it is the entire process around integration, training, and delivery. And it needs different teams to come together to ensure this. So let me quote certain teams which actually need to come together to set up the entire MLOps in the organizations. So first one is definitely the entire data team, which comprises of platform teams, engineering teams, machine learning teams. We also need the support of the production support teams because they help in monitoring, they help in alerting. There is also an entire IT infrastructure whose support has to be needed just in case we need more. We need different types of clusters where we can deploy our machine learning workloads.

Probably there might be a case where we need a new infrastructure to be created all by itself to help us with our deployments. So in that way, it's a total combination and support of multiple teams coming together and setting up the MLOps. But the direction is what has to be given by the data teams, specifically the machine learning teams.

There are various maturity levels of MLOps. This is something which I have been following as a Bible from Google paper, which is called MLOps by Google, where they define every maturity level. A quick anecdote that I probably want to give. So when I was working in a banking industry, I was working on maturity level zero. And when I recently made a move to e-commerce industry, I realized that I was somewhere working on MLOps whose maturity is between one and two. So our eventual goal is to start moving to two, but there are different checkpoints as well. There is a 0.5, there is a 1.5 as well. So MLOps is a process which continuously keeps getting better and better based on how much to automate, what to automate, the problems that we're trying to solve. So it's something which is essentially has to keep improving.

CI, CD and CT

This brings me to another very important topic of CI, CD and CT. As I'm saying that MLOps is not a product, but a process. So the process is all about, which has been, which has existed in the past with respect to the DevOps lens, which is continuous integration, delivery, training. So if we just apply the lens of machine learning on top of it, the definition sort of changes and the way it has to be done also changes.

So let me just probably go through the most interesting and the new addition to CI, CD pipelines, which is the continuous training, which is the third one. So this is the new property, which is solely unique to machine learning systems. It is concerned with automatically retraining and serving the models. We have to continuously retrain the models, right? Because if suppose it's an e-commerce industry where the products are being changed every week, where the user behavior is changing based on sale days, non-peak days, etc. We need our models to cater to the changing data, changing patterns. That's when continuous training becomes extremely important.

So we can think of continuous training as a series of multiple jobs, which can be orchestrated by any orchestration tool like Airflow, where the training, what exactly happens in training is that we look into the data, we try out multiple combinations of different hyperparameters, choose the best model, which fits the current data based on how much it has changed and come up with the best model, which could be served in real time. So this is what continuous training is about. Another example that when I was in the banking industry, we were not doing continuous training at all, because how often do we open the banking apps, right? Not as much as we probably open an e-commerce app and navigate through the products. Also, the inventory in an e-commerce firm is very different than an inventory in a banking world. In banking world, there are high ticket products, whereas in e-commerce, there are small ticket products, which have low value. So in that way, continuous training also depends on the industry that we're in.

Now, let me just move on quickly to continuous integration and delivery. So continuous integration is no longer about testing and validating code and components for software systems. Instead, it is also about testing and validating data. As you remember from the first slide, where AI or machine learning comprises of code and data, so data is imperative. Hence, integration handles, tests the data, data schemas and models, and essentially builds components, builds tested components in the form of containers, executables, which can be deployed further. Continuous delivery is all about continuously implementing new pipelines to target environments. So these can be retraining pipelines and reserving pipelines just to ensure that the entire workloads which are being served in production are fresh and they're continuously integrated with the changing model and code.

This is a quick diagram that I came up with just to explain how a developer probably thinks about CICD and CT. So as you can see, when the developer makes any changes to the code and pushes it to a GitHub repository, for example, there has to be a framework or a tool which needs to detect a new commit has been done. As soon as it detects, automatically unit test cases must be done. A container image must be created and pushed to a container image repository and the branch for the environment where we want to either be pushed to, which can be development, staging or production has to be ready. And after the branch is ready to be pushed to a particular environment, there must be another tool which needs to detect the candidate branch and the changes and ensure that the downstream of deploying it to the right clusters is done. For example, we can deploy the container images to Kubernetes cluster. So this is the process that I'm talking about. How everyone does it is totally dependent on the rules and regulations of the industry.

Benefits of MLOps

So I would like to conclude the three topics that I wanted to take everyone through with this final point, which is the benefits of MLOps. So it actually enables scalability and management. If we have one model or two models, yes, probably I can do a manual process and trigger it. I can trigger all the changes and manually deploy to production. But if we have, but if I have thousand models, then I need a process. I need an automated pipeline around it. So the benefits of MLOps boils down to these different pointers, which is enables model scalability and management, reusability of ML pipelines for audit and regulatory requirements, effortless CICD, and also provide data lineage. Data lineage is extremely important for analytics and machine learning teams both. Also, very important is to maintain health and governance post deployment. And finally, MLOps is useful for three verticals, which is the people, the process and technology. Everyone. It's just not for either people or process or technology. This process is for each and every piece in an organization.

Retraining pipeline example

Now I'm going to take everyone through. So now that I've spoken a lot about the process, but probably giving some sort of an example will actually help each one of you to understand how to map all these ideologies and philosophies into actual action. So for that, I have picked up a retraining pipeline from the blog in the company that I work for, where this is how we have imagined and attained a retraining pipeline.

So let me start with the one in the bottom, which is marked with red, where airflow is written. So to keep the models fresh, we have to build retraining pipelines, right? And the retraining pipeline can run weekly, daily, whatever, based on the industry. So if I was a Twitter or a Facebook, and if I had to refresh my timelines, I would probably run my retraining pipelines very often, like probably many times a day. Whereas if I was an e-commerce industry, where I know that the products change only once a week, inventory is probably coming only once in 10 days, I would build a retraining pipeline to run only once in 10 days. Whereas if I'm a banking, if I'm in the banking domain, where I know that none of my products are going to change for a long time, maybe I don't need to retrain the model for six months. So the frequency is something pertaining to your own industry.

So what is the need for retraining? So there are two needs for retraining. One is for the model to learn new patterns in data. Patterns in data can be with respect to new items or new users, and also to solve the cold start problem of new items and new users. So cold start problem is a very common term in machine learning, where when someone is new to a platform, how do we start giving them recommendations? Because there is no previous data right? So these are the problems that continuous retraining actually helps.

So let's move on to the first box in the left, bottom left, which is how the retraining pipeline actually starts. So we pull the last three months of data based on our particular use case. We convert it into a training set and a test set. It can be a 70-30 split, 80-20 split, whatever you want. We train a model using the training set. Then after we have trained the model, there is this entire component called MLflow, where we can save the model that we have tuned. So what exactly is MLflow? MLflow is an open source platform for managing the end-to-end machine learning lifecycle. There are many components in MLflow like tracking, projects, models, registry. All of this help in managing MLflow, ML lifecycle, and deploying as well. So it's up to us as to which component we choose. So either we can choose all the components and only stick to MLflow for model serving, or if there are any problems with MLflow while we using the tool we detect, we only use the components which actually map our entire use case.

Okay, so the new models are saved to MLflow. There has to be some sort of a tool or a framework. Here it is Jenkins, which detects that there has been a new model which has been trained. So let's compare against the test data how the older model is performing against how the new and the old model are performing with respect to the test data. So for that we have included something called Jenkins. This is again an open source framework which helps in continuous integration and continuous delivery. So Jenkins detects a change in a new model or an entry of a new model to MLflow, validates the model, creates a production tag to the best performing model. It can be previous model or the latest model that we have trained. And then what it does is it triggers a job which pulls the model that we have decided to be the best and converts it into an executable, builds a docker image and pushes to image repository.

So for codes we have GitHub, for docker images we need to have a repository, right? So we're using AWS's ECR. So Jenkins is that orchestrator or the tool which pulls the production model and pushes the image repository. It also creates a pull request to the GitHub repository to change the production tags so that when serving happens the new model is hit instead of the older model. So once the pull request is merged with the help of a reviewer, this is a manual step that we have ensured to keep so that every reviewer is responsible. So once the PR is merged, there is another tool called fluxcd which detects a change in the repository, GitHub repository, and triggers a new Kubernetes deployment of the entire model, entire serving, etc.

So the components that we're using to achieve retraining for the entire MLflow, sorry, MLOps is MLflow, Jenkins and Kubernetes, fluxcd, what else? I think that's it. And probably if it's an insurance firm or a banking industry and maybe Kubernetes is not allowed, deployment to Kubernetes is not allowed, it can be a homegrown cluster, right? So this entire box will be replaced by the cluster that they want to deploy it. Whereas if MLflow again is being open source is not allowed, this can be probably replaced by Databricks MLflow or AWS SageMaker. So these are the components which can be moved around.

But how it will be done is essentially, it actually remains the same. It's just that based on the technical skills or the industry, we can keep changing these aspects, these tools. But the crux of it remains the same. The entire retraining pipeline is orchestrated by another tool called Airflow, which ensures that all the jobs run sequentially, or with some sort of dependencies directly.

Model serving pipeline

So let me just probably discuss on, there's one more architecture I want to show before we conclude the presentation and open up for Q&A, which is the model serve pipeline. So once we have completed the exploration phase and experimentation phase of our machine learning models, it is time to make it live. And it is time to build a web service to serve our models in production. This architecture that you see in front is specifically for real-time predictions only and not batch.

So let me draw your attention to the right side, which there is a mobile phone which says client, and then there's a prediction request, fast API service, and vendor ML server. So as and when any user opens a mobile phone and logs into that particular app, a prediction request is sent to a fast API service, which I'll explain. And the request is navigated or routed to another server, which gives predictions. Hence, the arrows are both ways, and the request is given back to the client. The request can be Netflix recommendations, Amazon recommendations, or any sort of data that is shown on the app.

So we are serving the model as microservice, as this is the norm of engineering teams now, because the services can be consumed by other teams. So if I'm a machine learning team, I can keep one service ready. And as and when new engineering teams are built or new features are built on the app, all of them can consume the same data. So typically, we load the model into the memory of fast API during startup and use the predict function to throw out predictions every time a request comes in. This does not scale very well. So we use, again, this is an example. If it does not scale very well, one can use Kubernetes to scale horizontally and vertically by using multiple pods.

Several model serving frameworks are present. One of the example of model serving framework is Bento ML server that is being used right now. But the patterns of this are TensorFlow serving, ML serving. One can decide based on their use case. The reason we have finalized on Bento ML is because it gives a higher throughput compared to the other services, other serving services. It has a feature called micro-matching, which essentially predicts predictions on batches of data and sequentially. So throughput is important for us. It's an e-commerce firm. But if it's not, then one can go and choose the other products as well.

So the key components of this diagram is only on the right side, which is the model serving, which is done by Bento ML, model client service, which is done by FastAPI. FastAPI is a Pythonic framework and a set of tools that enables developers to use a REST interface to call commonly used functions to also implement machine learning applications as well. And the entire left side, if you see, is the retraining pipeline from the previous slide. So essentially, all of these pieces come together and form different architecture diagrams or different pipelines. This is what I wanted to draw everyone's attention to, that how the previous one amalgamates. So the left side flow runs probably once a week, but the right side flow happens in a real time manner each time the client logs into the application.

All right. So let me just conclude by saying that the CI, CD and CT, which is integration, delivery and training is done by Jenkins, GitHub Actions and Flux CD. These are different tools, frameworks, open source ones. Orchestration is done using Airflow, which is very popular. Most of the companies do it this way. Data sources are typically data lakes. It can be AWS S3 or Cloudera, whichever, whichever company that you have onboarded. And model tracking and registry is done using MLflow or equivalent. This, this is where I come to an end to my presentation.

Several model serving frameworks are present. The reason we have finalized on Bento ML is because it gives a higher throughput compared to the other services, other serving services. It has a feature called micro-matching, which essentially predicts predictions on batches of data and sequentially.

And I have also set up some additional FAQs based on the questions. I'll just take everyone through certain slides that probably can be explained well with the help of diagrams. So that's it for now. Thank you, everyone. Thank you.