Modelserve runtime
The Modelserve runtime allows you to deploy ML models on Kubernetes or locally.
Prerequisites
Python version and libraries:
- python >= 3.9, <3.13
- digitalhub-runtime-modelserve
The package is available on PyPI:
HOW TO
The modelserve runtime introduces several functions of kind sklearnserve, mlflowserve, huggingfaceserve and kubeai-text,  kubeai-speech that allows you to serve different ML models flavours and a task of kind serve.
The usage of the runtime is similar to the others:
- Create a Functionobject of the desired model and execute it'srun()method.
- The runtime collects (if in remote execution), loads and exposes the model as a service.
- With the run's invoke()method you can call the v2 inference API specifying the json payload you want (passed as keyword arguments).
- You can stop the service with the run's stop()method.
The modelserve runtime launches a mlserver inference server is deployed on Kubernetes as deployment and exposed as a service.
Service responsiveness
It takes a while for the service to be ready and notified to the client. You can use the refresh() method and access the status attribute of the run object. When the service is ready, you can see a service attribute in the status.
Once the service is ready, you can use the run.invoke() method to call the inference server.
The invoke method accept requests.request parameters as kwargs. The url parameter is by default collected from the run object. In case you need to override it, you can use the url parameter.
Note
In case you passed model_name in the function spec, and you execute the run in remote execution, you need to pass the model_name to the invoke method. This is because the model_name is used to identify the model in the inference server. "http://{url-from-k8s}/v2/models/{model_name}/infer".
data = [[...]] #some array
json = {
    "inputs": [
        {
        "name": "input-0",
        "shape": [x, y],
        "datatype": "FP32",
        "data": data #data-array goes here
        }
    ]
}
run.invoke(json=json)
Function
There are different modelserve functions (sklearnserve, mlflowserve, huggingfaceserve and kubeai-text,  kubeai-speech), each one representing a different ML model flavour.
Function parameters
A modelserve function has the following spec parameters to pass to the new_function() method:
| Name | Type | Description | Default | Runtime | 
|---|---|---|---|---|
| project | str | Project name. Required only if creating from library, otherwise MUST NOT be set | ||
| name | str | Name that identifies the object | required | |
| kind | str | Function kind | required | |
| uuid | str | ID of the object in form of UUID4 | None | |
| description | str | Description of the object | None | |
| labels | list[str] | List of labels | None | |
| embedded | bool | Flag to determine if object must be embedded in project | True | |
| path | str | Path to the model files | None | |
| model_name | str | Name of the model | None | |
| image | str | Docker image where to serve the model | None | |
| url | str | Model url | None | kubeai-text,kubeai-speech | 
| adapters | list[str] | Adapters | None | kubeai-text,kubeai-speech | 
| features | list[str] | Features | None | kubeai-text | 
| engine | KubeaiEngine | Engine | None | kubeai-text | 
Function kinds
The kind parameter must be one of the following:
- sklearnserve
- mlflowserve
- huggingfaceserve
- kubeai-text
- kubeai-speech
Adapters
Adapters is a list of dictionaries with the following keys:
Features
Features is a list of strings. It accepts the following values:
- TextGeneration
- TextEmbedding
- SpeechToText
Engine
The engine is a KubeaiEngine object that represents the engine to use for the function. The engine can be one of the following:
- OLlama
- VLLM
- FasterWhisper
- Infinity
Model path
The model path is the path to the model files. In remote execution, the path is a remote s3 path (for example: s3://my-bucket/path-to-model). In local execution, the path is a local path (for example: ./my-path or my-path). According to the kind of modelserve function, the path must follow a specific pattern:
- sklearnserve:- s3://my-bucket/path-to-model/model.pklor- ./path-to-model/model.pkl. The remote path is the partition with the model file, the local path is the model file.
- mlflowserve:- s3://my-bucket/path-to-model-filesor- ./path-to-model-files. The remote path is the partition with all the model files, the local path is the folder containing the MLmodel file according to MLFlow specification.
Model path is not required for kubeai-text,  kubeai-speech.
Model url
The model url must follow the pattern:
regexp = (
    r"^(store://([^/]+)/model/huggingface/.*)"
    + r"|"
    + r"^pvc?://.*$"
    + r"|"
    + r"^s3?://.*$"
    + r"|"
    + r"^ollama?://.*$"
    + r"|"
    + r"^hf?://.*$"
)
Function example
# Example remote model mlflow
function = project.new_function(name="mlflow-serve-function",
                                kind="mlflowserve",
                                path=model.spec.path + "model")
# Example local model mlflow
function = project.new_function(name="mlflow-serve-function",
                                kind="mlflowserve",
                                path="./my-path/model")
# Example remote model sklearn
function = project.new_function(name="sklearn-serve-function",
                                kind="sklearnserve",
                                path=model.spec.path)
# Example local model sklearn
function = project.new_function(name="sklearn-serve-function",
                                kind="sklearnserve",
                                path="./my-path/model.pkl")
# Example KubeAI text model
function = project.new_function(
    name="kubeai-text-function",
    kind="kubeai-text",
    url="hf://mistralai/Mistral-7B-v0.1",
    features=["TextGeneration"],
    engine="VLLM"
)
# Example KubeAI speech model
function = project.new_function(
    name="kubeai-speech-function",
    kind="kubeai-speech",
    url="hf://openai/whisper-large-v3",
    features=["SpeechToText"],
    engine="FasterWhisper"
)
Task
The modelserve runtime introduces one tasks of kind serve that allows you to deploy ML models on Kubernetes or locally.
A Task is created with the run() method, so it's not managed directly by the user. The parameters for the task creation are passed directly to the run() method, and may vary depending on the kind of task.
Task parameters
| Name | Type | Description | Default | Runtime | 
|---|---|---|---|---|
| action | str | Task action | required | |
| node_selector | list[dict] | Node selector | None | |
| volumes | list[dict] | List of volumes | None | |
| resources | dict | Resources restrictions | None | |
| affinity | dict | Affinity | None | |
| tolerations | list[dict] | Tolerations | None | |
| envs | list[dict] | Env variables | None | |
| secrets | list[str] | List of secret names | None | |
| profile | str | Profile template | None | |
| replicas | int | Number of replicas | None | |
| service_type | str | Service type | NodePort | |
| huggingface_task | str | Huggingface task type | None | huggingfaceserve | 
| backend | str | Backend type | None | huggingfaceserve | 
| tokenizer_revision | str | Tokenizer revision | None | huggingfaceserve | 
| max_length | int | Huggingface max sequence length for the tokenizer | None | huggingfaceserve | 
| disable_lower_case | bool | Do not use lower case for the tokenizer | None | huggingfaceserve | 
| disable_special_tokens | bool | The sequences will not be encoded with the special tokens relative to their model | None | huggingfaceserve | 
| dtype | str | Data type to load the weights in | None | huggingfaceserve | 
| trust_remote_code | bool | Allow loading of models and tokenizers with custom code | None | huggingfaceserve | 
| tensor_input_names | list[str] | The tensor input names passed to the model | None | huggingfaceserve | 
| return_token_type_ids | bool | Return token type ids | None | huggingfaceserve | 
| return_probabilities | bool | Return all probabilities | None | huggingfaceserve | 
| disable_log_requests | bool | Disable log requests | None | huggingfaceserve | 
| max_log_len | int | Max number of prompt characters or prompt | None | huggingfaceserve | 
Task actions
Actions must be one of the following:
- serve: to deploy a service
Huggingface task
You can specify the task type for the Huggingface model. The task type must be one of the following:
- sequence_classification
- token_classification
- fill_mask
- text_generation
- text2text_generation
- text_embedding
Backend
You can specify the backend type for the Huggingface model. The backend type must be one of the following:
- AUTO
- VLLM
- HUGGINGFACE
Dtype
You can specify the data type to load the weights in. The data type must be one of the following:
- AUTO
- FLOAT32
- FLOAT16
- BFLOAT16
- FLOAT
- HALF
Task example
Run
The Run object is, similar to the Task, created with the run() method.
The run's parameters are passed alongside the task's ones.
Run parameters
| Name | Type | Description | Default | Runtime | 
|---|---|---|---|---|
| local_execution | bool | Flag to determine if the run must be executed locally | False | |
| env | dict | Environment variables | None | kubeai-text,kubeai-speech | 
| args | list[str] | Arguments | None | kubeai-text,kubeai-speech | 
| cache_profile | str | Cache profile | None | kubeai-text,kubeai-speech | 
| files | list[KubeaiFile] | Files | None | kubeai-text,kubeai-speech | 
| scaling | Scaling | Scaling parameters | None | kubeai-text,kubeai-speech | 
| processors | int | Number of processors | None | kubeai-text,kubeai-speech | 
Files
Files is a list of dict with the following keys:
Scaling
Scaling is a Scaling object that represents the scaling parameters for the run. Its structure is as follows:
scaling = {
    "replicas": int,
    "min_replicas": int,
    "max_replicas": int,
    "autoscaling_disabled": bool,
    "target_request": int,
    "scale_down_delay_seconds": int,
    "load_balancing": {
        "strategy": str,  # "LeastLoad" or "PrefixHash"
        "prefix_hash": {
            "mean_load_factor": int,
            "replication": int,
            "prefix_char_length": int
        }
    }
}
Run example
Run methods
Once the run is created, you can access some of its attributes and methods through the run object.
invoke
    Invoke served model. By default it exposes infer v2 endpoint.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| model_name | str | Name of the model. | None | 
| method | str | Method of the request. | 'POST' | 
| url | str | URL of the request. | None | 
| **kwargs | dict | Keyword arguments to pass to the request. | {} | 
Returns:
| Type | Description | 
|---|---|
| Response | Response from the request. |