Issue 3

Model Serving

Intro

After training, what do we do with our models?

Models alone don’t have much value — it’s all in how you use them. Whether that’s to drive decisions within your business, or to provide new features for your customers, the role of a serving framework is to bring your models to life.

With a model serving framework, you can

  • Interact with a model via an API. Because of this, anything that talks to your model can do so without knowing any internal details such as which tools were used to train it or what language it’s written in.
  • Deploy the model in the cloud alongside other components of your applications.
  • Scale the model easily to meet user demand.

For a concrete example, suppose you run an online store, and you want each of your customers to see personalised product recommendations. There are lots of ways to train a model for this task, but assuming you’ve already done that part, the next challenge is in getting the website talking to it.

Even though the model might be complex, a model serving framework will hide that complexity, leaving us with a simple API so that, whenever we want a customer to see recommendations, all we need to do is query that API.

Batch vs real-time

Sometimes, you want a model to give instant results. This is the case in the product recommendation example, where we want to serve relevant suggestions to a customer while they browse a website.

In other cases, results don’t need to be instant, and the model is accessed on a schedule. Imagine we have some products whose price gets updated every week using a model that’s trained to price things according to seasonal trends.

Many model serving frameworks are suited to both real-time and batch usage, but it’s important to know which approach you need before implementing model serving.

Do I need it?

While we all need to interact with our models in some way, a model serving framework isn’t the only way to do this.

The strength of model serving is that it can hide complex models behind simple APIs, making this approach a perfect fit for any application that runs in the cloud, including web applications.

But not everything is cloud-based; edge/IOT applications come to mind. Take as an example a smart camera that uses a model to detect faces. In this case the model needs to run directly on the camera’s hardware, because streaming the video to a remote server would simply be too slow.

Do I need it?

While we all need to interact with our models in some way, a model serving framework isn’t the only way to do this.

The strength of model serving is that it can hide complex models behind simple APIs, making this approach a perfect fit for any application that runs in the cloud, including web applications.

But not everything is cloud-based; edge/IOT applications come to mind. Take as an example a smart camera that uses a model to detect faces. In this case the model needs to run directly on the camera’s hardware, because streaming the video to a remote server would simply be too slow.

What are the options?

The choices for open source model serving frameworks are vast. To narrow it down a little, it’s helpful to consider a few factors:

  • Machine learning library support. Any model will have been trained using an ML library such as TensorFlow, PyTorch, or SKLearn. Some serving tools support multiple ML libraries, while others might support only TensorFlow, for example.
  • How the model is packaged. A typical model is made up of the raw model assets and a bunch of code dependencies. The serving tools in this guide all work by packaging model + dependencies into a Docker container. Docker is the industry standard way to package, distribute and deploy  software to modern infrastructure.
  • Where the model runs. Some serving frameworks simply give you a container that you can run anywhere that supports Docker. Others are built on top of Kubernetes, which is the most popular open source solution for automating the deployment, scaling and management of containers.

With these in mind, let’s look at some options.

Seldon Core

Seldon Core is a mature tool built for scalable, reliable model serving, and it backs this up with a rich set of features out of the box, including advanced metrics, logging, explainability and A/B testing.

One particularly important feature of Seldon Core is metrics and monitoring. Each model server exposes metrics that can be integrated into your software monitoring stack, and it can be combined with Alibi-Detect, another open source product from Seldon, to do model-specific monitoring of things like drift and bias.

Pros:

  • Powered by Kubernetes. The de-facto standard in container orchestration, with Seldon leveraging all of its capabilities for model serving, providing scalable and reliable infrastructure with low management overheads.
  • Support for many ML libraries. Including PyTorch, TensorFlow, SKLearn.
  • No extra code needed. By default, you can deploy a model without you needing to write any code; you just tell Seldon where to find your model assets, and let it do the rest.
  • Model monitoring. Includes features to enable monitoring of models.

Cons:

  • Kubernetes isn’t always the best option. With Seldon you have no choice but to use Kubernetes. But if, for example, you’re only deploying one or two models, then running Kubernetes might be too costly for your needs.
BentoML

While Seldon Core specialises in model deployment on Kubernetes, BentoML isn’t picky about where models are deployed.

The simplest way to use Bento is through a Docker container, but even this isn’t a requirement, as you can run your own instance of the BentoML server on your own infrastructure.

Pros

  • Support for many ML libraries. Including PyTorch, TensorFlow, SKLearn.
  • Deploy models anywhere. Including Docker, Kubernetes, or cloud platforms like AWS, GCP and Azure.

Cons

  • No orchestration built in. Orchestration is about automating deployment, management, scaling and networking for containers. BentoML leaves that up to you: depending on where you deploy, orchestration support will differ. By contrast, Seldon Core gets this for free through Kubernetes.
Kubeflow and KServe

Kubeflow is one of the oldest open source MLOps frameworks. It covers a lot of different areas, from pipelines, training, and serving. As you might guess from the name, it runs on top of Kubernetes.

Existing users of Kubeflow would find it easiest to stick with it for model serving. More recently, the model serving part of Kubeflow was spun out into an independent project, called KServe, which you can use without the rest of Kubeflow.

On the face of it, KServe promises similar things to Seldon Core: run any model on Kubernetes. But compared to Seldon, KServe is a lot more lightweight, making it easier to set up and run, at the cost of fewer features.

Pros

  • Support for many ML libraries. Including PyTorch, TensorFlow, SKLearn.
  • More than just serving. If using Kubeflow in its entirety then you can train and deploy models all in the same system.

Cons

  • Legacy. MLOPs tooling has moved on a lot in the past few years, and in many ways KServe and Kubeflow are falling behind the curve.
  • Missing advanced monitoring. Compared with Seldon Core, this is an area where Kubeflow is particularly lacking.
Read another guide
Issue 1
Data Version Control
Issue 2
Experiment Tracking