LLM Model Serving Runtime
LLM Model serving runtime aims at supporing the possibility to expose the LLM models as OpenAI-compatible APIs. For this purpose several different runtimes are available. Depending on the specific scenario requirements, the user may choose one or another approach.
- KubeAI-Text Serving (
kubeai-text) runtime that relies on KubeAI operator to expose model. KubeAI serving deploys model while their serving is performed by KubeAI through a single channel. Also, this runtime relies on different engine, including vLLM, OLLama, and Infinity for different tasks. KubeAI supports also serving multiple LoRA adapters, autoscaling, and many other useful options for production-ready environments. - vLLM (
vllmserve-text,vllmserve-speech, andvllmserve-pooling) runtime exposes LLM models using vLLM engine. This is a custom implementation of the OpenAI-compatible API that is based on vLLM engine. Based on a specific runtime version, the model supports the OpenAI generative AI APIs (completions, chat completions), OpenAI audio processing (audio transcription, audio translation), and a series of other OpenAI compatible functions (like embeddings, ranking, tokenization, classification, and raning). - HuggingFace Serving (
huggingfaceserve) runtime exposes standalone LLM models using KServe-based implementation (deprecated). In a nutshell, this runtime allows for exposing LLMs using the vLLM engine. The engine supports, in particular, completions and chat completions APIs compatible with the OpenAI protocol, embedding, and a series of other functions (like embeddings, fill mask, classification) using Open Inference Protocol. See corresponding kserve documentation for the details.
KubeAI Text runtime
KubeAI Text runtime relies on KubeAI platform for model serving. In this case, for each serve action performed with this runtime a corresponding deployment is created, while no dedicated service is exposed - the models are service by KubeAI service directly.
There are different advantages of using KubeAI deployment, which include
- possibility to use multiple backend engines optimized for different goals. For example OLLama is best suited for testing models without GPU, while vLLM suites better for GPU-based environments.
- possibility to serve multiple models simultaneously through LoRA adapters. In this case one single base model + a list of different fine-tuned adapters are served on the same resources, while being made available indipendently.
- possibility to configure more efficient resource management with autoscaling and scale profiles.
- use of configurable prefix caching.
- Full OpenAI compatibility (completions, chat completions, embeddings)
The specification of the KubeAI text runtime amounts to defining
- base model URL (from S3 storage or from HuggingFace catalog)
- list of adapters (from S3 storage or from HuggingFace catalog)
- name of the model to expose
- Model task or feature: text generation (default) or embedding
- Backend engine: vLLM, OLLama, or Infinity (for embeddings only)
- optional base image for serving
The serve action allows for deploying the model and adapters, and a set of extra properties may be configured, including
- inference server-specific arguments
- load balancing strategy and properties
- prefix cache length
- scaling configuration (min/max/default replicas, scale delays and request targets)
- Resource confguration (e.g., run profile), environments and secrets (e.g., reference to
HF_TOKENif needed for accessing Huggingface resources)
Using GPU for model seving
Please note that in case of large models for text generation task the usage of the corresponding GPU-based profiles may be required.
When deployed, the corresponding serve run specification contains extra information for using the LLM model. This includes
- the base URL of the kube AI environment to use by the clients
- the name of the deployed model and adapters to be used in the OpenAI requests
- LLM metadata - feature information, engine, base model, etc
vLLM Serving runtime
The specification of the vLLM runtime functions consists of the following elements:
urldefining the URL of the model, either from the platform storage or from HuggingFace catalog (e.g., 'hf://Qwen/Qwen2.5-0.5B')model_namedefining the name of the exposed modelimagedefining the base image to use for serving the model if different from the one used by the platform by defaultadaptersdefining the list of LoRA adapters (withnameandurl) to be used for serving the model
The specification of the vLLM run additionally, allows for defining the following elements:
urldefining the URL of the model to serve, either from the platform storage or from HuggingFace catalog (e.g., 'hf://Qwen/Qwen2.5-0.5B')argsdefining the list of arguments to be passed to the vLLM engineenable_telemetrydefining if the telemetry should be enabled or notuse_cpu_imagedefining if the CPU-only image should be used for serving the model.
Once deployed, a model is exposed with the corresponding Kubernetes service. The sevice endpoint is avaialble as a part of the status/service data of the run.
Huggingface Serve runtime (deprecated)
The specification of the HuggingfaceServe runtime functions consists of the following elements:
pathdefining the URL of the model, either from the platform storage or from HuggingFace catalog (e.g., 'huggingface://Qwen/Qwen2.5-0.5B')modeldefining the name of the exposed model- and optional base image to use for serving the model if different from the one used by the platform by default
The runtime supports the serve action that may specify further deployment details including
- backend engine type (vLLM or custom Kserve implementation called "huggingface")
- inference task (e.g.,
sequence_classification,fill_mask,text_generation,text_embedding, etc) - Specific parameters refering to the context length, data types, logging properties, tokenizer revision, engine args, etc.
- Resource confguration (e.g., run profile), environments and secrets (e.g., reference to
HF_TOKENif needed for accessing Huggingface resources)
Once deployed, a model is exposed with the corresponding Kubernetes service. The sevice endpoint is avaialble as a part of the status/service data of the run.
Management with SDK
Check the SDK runtime documentation for more information.