If you are familiar with Kubeflow, you know KFServing as the platform’s model server and inference engine. In September last year, the KFServing project has gone through a transformation to become KServe.

KServe is now an independent component graduating from the Kubeflow project, apart from the name change. The separation allows KServe to evolve as a separate, cloud native inference engine deployed as a standalone model server. Of course, it will continue to have tight integration with Kubeflow, but they would be treated and maintained as independent open source projects.

For a brief overview of the model server, refer to one of my previous articles at The New Stack.

KServe is collaboratively developed by Google, IBM, Bloomberg, Nvidia, and Seldon as an open source, cloud native model server for Kubernetes. The most recent version, 0.8, squarely focused on transforming the model server into a standalone component with changes to the taxonomy and nomenclature.

Let’s understand the core capabilities of KServe.

A model server is to machine learning models what an application is to code binaries. Both provide the runtime and execution context to the deployments. KServe, as a model server, provides the foundation for serving machine learning and deep learning models at scale.

KServe can be deployed as a traditional Kubernetes deployment or as a serverless deployment with support for scale-to-zero. For serverless, it takes advantage of Knative Serving for serverless, which comes with automatic scale-up and scale-down capabilities. Istio is used as an ingress to expose the service endpoints to the API consumers. The combination of Istio and Knative Serving enables exciting scenarios such as blue/green and canary deployments of models.

Kserve architecture diagram

The RawDeployment Mode, which lets you use KServe without Knative Serving, supports traditional scaling techniques such as Horizontal Pod Autoscaler (HPA) but lacks support for scale-to-zero.

KServe Architecture

KServe model server has a control plane and a data plane. The control plane manages and reconciles the custom resources responsible for inference. In serverless mode, It coordinates with Knative resources in managing the autoscale.

Kserve control plane

At the heart of KServe control plane is the KServe Controller that manages the lifecycle of an inference service. It is responsible for creating service, ingress resources, model server container, model agent container for request/response logging, batching, and pulling the models from the model store. The model store is a repository of models registered with the model server. It is typically an object storage service such as Amazon S3, Google Cloud Storage, Azure Storage, or MinIO.

The data plane manages the request/response cycle targeting a specific model. It has a predictor, transformer, and explainer components.

An AI application sends a REST or gRPC request to the predictor endpoint. The predictor acts as an inference pipeline that invokes the transformer component, which can perform pre-processing of the inbound data (request) and post-processing of outbound data (response). Optionally, there may be an explainer component to bring AI explainability to the hosted models. KServe encourages the usage of V2 protocol which is interoperable and extensible.

The data plane also has endpoints to check the readiness and health of models. It also exposes APIs for retrieving model metadata.

Supported Frameworks and Runtimes

KServe supports a wide range of machine learning and deep learning frameworks. Deep learning frameworks and runtimes work with existing serving infrastructures such as TensorFlow Serving, TorchServe, and Triton Inference Server. KServe can host TensorFlow, ONNX, PyTorch, TensorRT runtimes through Triton.

For classical machine learning models based on SKLearn, XGBoost, Spark MLLib, and LightGBM KServe rely on Seldon’s MLServer.

The extensible framework of KServe makes it possible to plugin any runtime that adheres to the V2 inference protocol.

Multimodel Serving with ModelMesh

KServe deploys one model per inference, limiting the platform’s scalability to the available CPUs and GPUs. This limitation becomes obvious when running inference on GPUs which are expensive and scarce compute resources.

With Multimodel serving, we can overcome the limitations of the infrastructure — compute resources, maximum pods, and maximum IP addresses.

ModelMesh Serving, developed by IBM, is a Kubernetes-based platform for a real-time serving of ML/DL models, optimized for high volume/density use cases. Similar to an operating system that manages processes to optimally utilize the available resources, ModelMesh optimizes the deployed models to run efficiently within the cluster.

ModelMesh serving diagram

Through intelligent management of in-memory model data across clusters of deployed pods, and the usage of those models over time, the system maximizes the use of available cluster resources.

ModelMesh Serving is based on KServe v2 data plane API for inferencing, which makes it possible to deploy it as a runtime similar to NVIDIA Triton Inference Server. When a request hits the KServe data plane, it is simply delegated to ModelMesh Serving.

The integration of ModelMesh Serving with KServe is currently in Alpha. As both the projects mature, there will be a tighter integration making it possible to mix and match the features and capabilities of both platforms.

With model serving becoming the core building block of MLOps, open source projects such as KServe become important. The extensibility of KServe to use existing and upcoming runtimes makes it a unique model serving platform.

In the upcoming articles, I will walk you through the steps of deploying KServe on a GPU-based Kubernetes cluster to perform inference on a TensorFlow model. Stay tuned.

Feature Image by Alexas_Fotos from Pixabay.