When you’re building a machine-learning powered product, figuring out how to bridge the gap between your models and everything else is pivotal. For instance, maybe you’ve got a great recommendations model, but until we can get those recommendations in front of customers, that model isn’t a great deal of use.
This is where model serving fits in. In this article, we’ll look at how to serve models using Seldon Core, an open source serving framework built for speed and scale, capable of running 1000s of models at once. I’ll discuss some of the things that make Seldon unique in this space, along with reasons for and against using it on your project.
This is the first part of a 3-part series on Seldon Core. In addition to the basics of serving, in later editions we’ll get into monitoring Seldon models using Alibi Detect, as well as integrating Seldon with a ZenML pipeline.
What is model serving?
As a model serving framework, Seldon Core is in good company, sitting among dozens of different competing frameworks for serving models (see our guide to model serving to learn more about them).
There are three main things we want to do with model serving:
- Interact with your model via an API so that other components of the product can interact with the model without needing to know its internal details.
- Deploy the model to the cloud (or elsewhere, like edge/IoT environments).
- Scale the model easily to meet user demand.
As you can imagine, each framework takes different approaches to these problems. Let’s see how Seldon works on all three of these points.
About Seldon Core
Seldon Core supports models from a wide range of ML frameworks, including TensorFlow and PyTorch. It also works across multiple implementation languages, supporting R, Julia, C++ and Java as well as Python. That’s something we don’t see in a lot of competing serving frameworks.
It works by packaging your model into a Docker container. Seldon provide a set of pre-built Docker images, and many real-world models can be deployed immediately by using one of these, without requiring you to write any extra code.
Seldon runs on top of Kubernetes. If you’re not familiar with it, Kubernetes is the de-facto standard for cloud-based container orchestration, providing a robust and reliable way to run and scale containers. This is one of Seldon’s super powers: by running on top of Kubernetes, Seldon brings these same capabilities in terms of scale and reliability to model serving.
As a consequence, you do need to maintain a Kubernetes cluster in order to use Seldon, but as all the major cloud providers provide Kubernetes as a managed service, this isn’t a massive overhead. And if you need the scale, it’s the best option.
Being Kubernetes-based also means that it's most suited to cloud deployments. It wouldn't be used for edge-based or IoT model serving.
Setting up Seldon Core
A Kubernetes cluster is a prerequisite to setting up Seldon Core. After that, installing Seldon is pretty straightforward.
Seldon provide instructions to install on Kubernetes using Helm, which is the simplest approach.
Additionally, it’s goood to have the option to run Seldon locally. The benefit of doing so is that you can experiment with models as well as serving configurations locally before deploying anything for real, while still ensuring that your experimental environment matches what’s being used in production.
Seldon’s recommended approach here is to use Kind (which stands for Kubernetes in Docker). For more details see the Seldon documentation for installing locally.
How do we serve a model?
Let’s look at an example. Suppose we’ve got a model that’s been trained with SKLearn. Actually, the ML framework doesn’t really matter; it could just as well be PyTorch, Tensorflow, etc.
First we need to host the model assets somewhere that Seldon can access them. This can be a Google Cloud storage bucket, an AWS S3 bucket, and so on. Internally, Seldon uses a tool called rclone to read files from cloud storage locations, so it will work with any of the 40+ platforms that rclone supports.
Next we need to configure the model server. The configuration looks like this:<pre><code>apiVersion: machinelearning.seldon.io/v1
That seems like a lot of configuration, so let’s break down the important bits:
Name: every model needs to have a unique name. This way we can manage it after deployment, identify it in application logs, etc.
Predictors: this is where we describe the model itself. There can be multiple predictors, but for our purposes that’s not so important.
A predictor needs to be set up with the following information:
Implementation: what kind of model server is this? In our case, it’s an SKLearn server.
Model URI: where are the model assets? We’re using a Google Cloud bucket here.
Replicas: a model server might need to handle thousands of requests. We can spread the load between multiple servers by setting the number of replicas.
We deploy the server using the Kubernetes command line tool:<pre><code>kubectl apply -f iris_server.yaml
Just like that, we have a model server, and we can interact with the model using a REST API. Without writing any code, we’ve turned our model into a web service, which is pretty cool. Not only that, but Seldon also provides Swagger-based documentation for your API. You can use this to experiment with and test your API.
Serving with a custom Docker image
While Seldon’s pre-built Docker images can get you up and running easily, sometimes we need to build our own images. This includes when our model has special dependencies that need to be installed, or when we want to do some pre-processing on the model inputs.
Seldon uses and recommends a particular way of building Docker images, called source to image (s2i). Originally, s2i was developed by RedHat to provide a way of generating images from source code.
If you’re already familiar with how to build Docker images with a Dockerfile, s2i is just an abstraction on top of this process. You also don’t have to use s2i if you prefer to write your own Dockerfile, it’s just the recommended approach.
Even though this all sounds a bit complicated, it actually isn’t in practice; it just works. Building your own images is something that most teams will inevitably end up doing at some point, and Seldon have put a lot of thought into making this painless.
Is Seldon right for you?
Seldon is a mature model serving framework that is compatible with a lot of different kinds of ML model, and works across multiple programming languages. Its pre-built Docker images make it really easy to get models into production, and because it’s built on Kubernetes, you get all the reliability and scale that Kuberentes is famous for, too.
But not everybody wants to set up and maintain a Kubernetes cluster, and for many people it doesn’t really make sense to take on the complexity and cost of doing so. That’s especially the case if you’re at a very early stage of building your ML-powered product, and you don’t anticipate the need for huge scale in the short term.
With that being said, it’s getting easier and easier to work with Kubernetes. This process started with managed Kubernetes services, which handle provisioning and scaling the cluster for you.
But more recently, Google announced Autopilot, which lets you run things on Kubernetes without running your own cluster, and paying only for CPU and memory resources. Seldon can’t run on it yet, owing to limitations in Autopilot, but if that changes, then the barrier for entry becomes much lower, and Seldon will be a good choice even for those early-stage projects.
There are more benefits to Seldon Core, aside from the versatility of its model serving. In the next blog, we’ll take a look at Alibi Detect, which can be used to monitor Seldon Models.
Until then, take a look at our Seldon example repo on Github.