Modelserve runtime

The Modelserve runtime allows you to deploy ML models on Kubernetes or locally.

Prerequisites

Python version and libraries:

python >= 3.9, <3.13
digitalhub-runtime-modelserve

The package is available on PyPI:

python -m pip install digitalhub-runtime-modelserve

HOW TO

The modelserve runtime introduces several functions of kind sklearnserve, mlflowserve, huggingfaceserve and kubeai-text, kubeai-speech that allows you to serve different ML models flavours and a task of kind serve. The usage of the runtime is similar to the others:

Create a Function object of the desired model and execute it's run() method.
The runtime collects (if in remote execution), loads and exposes the model as a service.
With the run's invoke() method you can call the v2 inference API specifying the json payload you want (passed as keyword arguments).
You can stop the service with the run's stop() method.

The modelserve runtime launches a mlserver inference server is deployed on Kubernetes as deployment and exposed as a service.

Service responsiveness

It takes a while for the service to be ready and notified to the client. You can use the refresh() method and access the status attribute of the run object. When the service is ready, you can see a service attribute in the status.

run.refresh()
run.status

Once the service is ready, you can use the run.invoke() method to call the inference server. The invoke method accept requests.request parameters as kwargs. The url parameter is by default collected from the run object. In case you need to override it, you can use the url parameter.

Note

In case you passed model_name in the function spec, and you execute the run in remote execution, you need to pass the model_name to the invoke method. This is because the model_name is used to identify the model in the inference server. "http://{url-from-k8s}/v2/models/{model_name}/infer".

data = [[...]] #some array
json = {
    "inputs": [
        {
        "name": "input-0",
        "shape": [x, y],
        "datatype": "FP32",
        "data": data #data-array goes here
        }
    ]
}

run.invoke(json=json)

Function

There are different modelserve functions (sklearnserve, mlflowserve, huggingfaceserve and kubeai-text, kubeai-speech), each one representing a different ML model flavour.

Function parameters

A modelserve function has the following spec parameters to pass to the new_function() method:

Name	Type	Description	Default	Runtime
project	str	Project name. Required only if creating from library, otherwise MUST NOT be set
name	str	Name that identifies the object	required
kind	str	Function kind	required
uuid	str	ID of the object in form of UUID4	None
description	str	Description of the object	None
labels	list[str]	List of labels	None
embedded	bool	Flag to determine if object must be embedded in project	True
path	str	Path to the model files	None
model_name	str	Name of the model	None
image	str	Docker image where to serve the model	None
url	str	Model url	None	`kubeai-text`, `kubeai-speech`
adapters	list[str]	Adapters	None	`kubeai-text`, `kubeai-speech`
features	list[str]	Features	None	`kubeai-text`
engine	KubeaiEngine	Engine	None	`kubeai-text`

Function kinds

The kind parameter must be one of the following:

sklearnserve
mlflowserve
huggingfaceserve
kubeai-text
kubeai-speech

Adapters

Adapters is a list of dictionaries with the following keys:

adapters = [{
    "name": "adapter-name",
    "url": "adapter-url"
}]

Features

Features is a list of strings. It accepts the following values:

TextGeneration
TextEmbedding
SpeechToText

Engine

The engine is a KubeaiEngine object that represents the engine to use for the function. The engine can be one of the following:

OLlama
VLLM
FasterWhisper
Infinity

Model path

The model path is the path to the model files. In remote execution, the path is a remote s3 path (for example: s3://my-bucket/path-to-model). In local execution, the path is a local path (for example: ./my-path or my-path). According to the kind of modelserve function, the path must follow a specific pattern:

sklearnserve: s3://my-bucket/path-to-model/model.pkl or ./path-to-model/model.pkl. The remote path is the partition with the model file, the local path is the model file.
mlflowserve: s3://my-bucket/path-to-model-files or ./path-to-model-files. The remote path is the partition with all the model files, the local path is the folder containing the MLmodel file according to MLFlow specification.

Model path is not required for kubeai-text, kubeai-speech.

Model url

The model url must follow the pattern:

regexp = (
    r"^(store://([^/]+)/model/huggingface/.*)"
    + r"|"
    + r"^pvc?://.*$"
    + r"|"
    + r"^s3?://.*$"
    + r"|"
    + r"^ollama?://.*$"
    + r"|"
    + r"^hf?://.*$"
)

Function example

# Example remote model mlflow

function = project.new_function(name="mlflow-serve-function",
                                kind="mlflowserve",
                                path=model.spec.path + "model")

# Example local model mlflow

function = project.new_function(name="mlflow-serve-function",
                                kind="mlflowserve",
                                path="./my-path/model")

# Example remote model sklearn

function = project.new_function(name="sklearn-serve-function",
                                kind="sklearnserve",
                                path=model.spec.path)

# Example local model sklearn

function = project.new_function(name="sklearn-serve-function",
                                kind="sklearnserve",
                                path="./my-path/model.pkl")

# Example KubeAI text model
function = project.new_function(
    name="kubeai-text-function",
    kind="kubeai-text",
    url="hf://mistralai/Mistral-7B-v0.1",
    features=["TextGeneration"],
    engine="VLLM"
)

# Example KubeAI speech model
function = project.new_function(
    name="kubeai-speech-function",
    kind="kubeai-speech",
    url="hf://openai/whisper-large-v3",
    features=["SpeechToText"],
    engine="FasterWhisper"
)

Task

The modelserve runtime introduces one tasks of kind serve that allows you to deploy ML models on Kubernetes or locally. A Task is created with the run() method, so it's not managed directly by the user. The parameters for the task creation are passed directly to the run() method, and may vary depending on the kind of task.

Task parameters

Name	Type	Description	Default	Runtime
action	str	Task action	required
node_selector	list[dict]	Node selector	None
volumes	list[dict]	List of volumes	None
resources	dict	Resources restrictions	None
affinity	dict	Affinity	None
tolerations	list[dict]	Tolerations	None
envs	list[dict]	Env variables	None
secrets	list[str]	List of secret names	None
profile	str	Profile template	None
replicas	int	Number of replicas	None
service_type	str	Service type	`NodePort`
huggingface_task	str	Huggingface task type	None	`huggingfaceserve`
backend	str	Backend type	None	`huggingfaceserve`
tokenizer_revision	str	Tokenizer revision	None	`huggingfaceserve`
max_length	int	Huggingface max sequence length for the tokenizer	None	`huggingfaceserve`
disable_lower_case	bool	Do not use lower case for the tokenizer	None	`huggingfaceserve`
disable_special_tokens	bool	The sequences will not be encoded with the special tokens relative to their model	None	`huggingfaceserve`
dtype	str	Data type to load the weights in	None	`huggingfaceserve`
trust_remote_code	bool	Allow loading of models and tokenizers with custom code	None	`huggingfaceserve`
tensor_input_names	list[str]	The tensor input names passed to the model	None	`huggingfaceserve`
return_token_type_ids	bool	Return token type ids	None	`huggingfaceserve`
return_probabilities	bool	Return all probabilities	None	`huggingfaceserve`
disable_log_requests	bool	Disable log requests	None	`huggingfaceserve`
max_log_len	int	Max number of prompt characters or prompt	None	`huggingfaceserve`

Task actions

Actions must be one of the following:

serve: to deploy a service

Huggingface task

You can specify the task type for the Huggingface model. The task type must be one of the following:

sequence_classification
token_classification
fill_mask
text_generation
text2text_generation
text_embedding

Backend

You can specify the backend type for the Huggingface model. The backend type must be one of the following:

AUTO
VLLM
HUGGINGFACE

Dtype

You can specify the data type to load the weights in. The data type must be one of the following:

AUTO
FLOAT32
FLOAT16
BFLOAT16
FLOAT
HALF

Task example

run = function.run(action="serve")

Run

The Run object is, similar to the Task, created with the run() method. The run's parameters are passed alongside the task's ones.

Run parameters

Name	Type	Description	Default	Runtime
local_execution	bool	Flag to determine if the run must be executed locally	False
env	dict	Environment variables	None	`kubeai-text`, `kubeai-speech`
args	list[str]	Arguments	None	`kubeai-text`, `kubeai-speech`
cache_profile	str	Cache profile	None	`kubeai-text`, `kubeai-speech`
files	list[KubeaiFile]	Files	None	`kubeai-text`, `kubeai-speech`
scaling	Scaling	Scaling parameters	None	`kubeai-text`, `kubeai-speech`
processors	int	Number of processors	None	`kubeai-text`, `kubeai-speech`

Files

Files is a list of dict with the following keys:

files = [
    {
        "path": "file-path"
        "content": "file-content"
    }
]

Scaling

Scaling is a Scaling object that represents the scaling parameters for the run. Its structure is as follows:

scaling = {
    "replicas": int,
    "min_replicas": int,
    "max_replicas": int,
    "autoscaling_disabled": bool,
    "target_request": int,
    "scale_down_delay_seconds": int,
    "load_balancing": {
        "strategy": str,  # "LeastLoad" or "PrefixHash"
        "prefix_hash": {
            "mean_load_factor": int,
            "replication": int,
            "prefix_char_length": int
        }
    }
}

Run example

run = function.run(action="serve")

Run methods

Once the run is created, you can access some of its attributes and methods through the run object.

`invoke`

Invoke served model. By default it exposes infer v2 endpoint.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name of the model.	`None`
`method`	`str`	Method of the request.	`'POST'`
`url`	`str`	URL of the request.	`None`
`**kwargs`	`dict`	Keyword arguments to pass to the request.	`{}`

Returns:

Type	Description
`Response`	Response from the request.