Modelserve runtime
The Modelserve runtime allows you to deploy ML models on Kubernetes or locally.
Prerequisites
Python version and libraries:
python >= 3.9, <3.13
digitalhub-runtime-modelserve
The package is available on PyPI:
HOW TO
The modelserve runtime introduces several functions of kind sklearnserve
, mlflowserve
, huggingfaceserve
and kubeai-text
, kubeai-speech
that allows you to serve different ML models flavours and a task of kind serve
.
The usage of the runtime is similar to the others:
- Create a
Function
object of the desired model and execute it'srun()
method. - The runtime collects (if in remote execution), loads and exposes the model as a service.
- With the run's
invoke()
method you can call the v2 inference API specifying the json payload you want (passed as keyword arguments). - You can stop the service with the run's
stop()
method.
The modelserve runtime launches a mlserver inference server is deployed on Kubernetes as deployment and exposed as a service.
Service responsiveness
It takes a while for the service to be ready and notified to the client. You can use the refresh()
method and access the status
attribute of the run object. When the service is ready, you can see a service
attribute in the status
.
Once the service is ready, you can use the run.invoke()
method to call the inference server.
The invoke
method accept requests.request
parameters as kwargs. The url
parameter is by default collected from the run
object. In case you need to override it, you can use the url
parameter.
Note
In case you passed model_name
in the function spec, and you execute the run in remote execution, you need to pass the model_name
to the invoke method. This is because the model_name
is used to identify the model in the inference server. "http://{url-from-k8s}/v2/models/{model_name}/infer"
.
data = [[...]] #some array
json = {
"inputs": [
{
"name": "input-0",
"shape": [x, y],
"datatype": "FP32",
"data": data #data-array goes here
}
]
}
run.invoke(json=json)
Function
There are different modelserve functions (sklearnserve
, mlflowserve
, huggingfaceserve
and kubeai-text
, kubeai-speech
), each one representing a different ML model flavour.
Function parameters
A modelserve function has the following spec
parameters to pass to the new_function()
method:
Name | Type | Description | Default | Runtime |
---|---|---|---|---|
project | str | Project name. Required only if creating from library, otherwise MUST NOT be set | ||
name | str | Name that identifies the object | required | |
kind | str | Function kind | required | |
uuid | str | ID of the object in form of UUID4 | None | |
description | str | Description of the object | None | |
labels | list[str] | List of labels | None | |
embedded | bool | Flag to determine if object must be embedded in project | True | |
path | str | Path to the model files | None | |
model_name | str | Name of the model | None | |
image | str | Docker image where to serve the model | None | |
url | str | Model url | None | kubeai-text , kubeai-speech |
adapters | list[str] | Adapters | None | kubeai-text , kubeai-speech |
features | list[str] | Features | None | kubeai-text |
engine | KubeaiEngine | Engine | None | kubeai-text |
Function kinds
The kind
parameter must be one of the following:
sklearnserve
mlflowserve
huggingfaceserve
kubeai-text
kubeai-speech
Adapters
Adapters is a list of dictionaries with the following keys:
Features
Features is a list of strings. It accepts the following values:
TextGeneration
TextEmbedding
SpeechToText
Engine
The engine is a KubeaiEngine
object that represents the engine to use for the function. The engine can be one of the following:
OLlama
VLLM
FasterWhisper
Infinity
Model path
The model path is the path to the model files. In remote execution, the path is a remote s3 path (for example: s3://my-bucket/path-to-model
). In local execution, the path is a local path (for example: ./my-path
or my-path
). According to the kind of modelserve function, the path must follow a specific pattern:
sklearnserve
:s3://my-bucket/path-to-model/model.pkl
or./path-to-model/model.pkl
. The remote path is the partition with the model file, the local path is the model file.mlflowserve
:s3://my-bucket/path-to-model-files
or./path-to-model-files
. The remote path is the partition with all the model files, the local path is the folder containing the MLmodel file according to MLFlow specification.
Model path is not required for kubeai-text
, kubeai-speech
.
Model url
The model url must follow the pattern:
regexp = (
r"^(store://([^/]+)/model/huggingface/.*)"
+ r"|"
+ r"^pvc?://.*$"
+ r"|"
+ r"^s3?://.*$"
+ r"|"
+ r"^ollama?://.*$"
+ r"|"
+ r"^hf?://.*$"
)
Function example
# Example remote model mlflow
function = project.new_function(name="mlflow-serve-function",
kind="mlflowserve",
path=model.spec.path + "model")
# Example local model mlflow
function = project.new_function(name="mlflow-serve-function",
kind="mlflowserve",
path="./my-path/model")
# Example remote model sklearn
function = project.new_function(name="sklearn-serve-function",
kind="sklearnserve",
path=model.spec.path)
# Example local model sklearn
function = project.new_function(name="sklearn-serve-function",
kind="sklearnserve",
path="./my-path/model.pkl")
# Example KubeAI text model
function = project.new_function(
name="kubeai-text-function",
kind="kubeai-text",
url="hf://mistralai/Mistral-7B-v0.1",
features=["TextGeneration"],
engine="VLLM"
)
# Example KubeAI speech model
function = project.new_function(
name="kubeai-speech-function",
kind="kubeai-speech",
url="hf://openai/whisper-large-v3",
features=["SpeechToText"],
engine="FasterWhisper"
)
Task
The modelserve runtime introduces one tasks of kind serve
that allows you to deploy ML models on Kubernetes or locally.
A Task
is created with the run()
method, so it's not managed directly by the user. The parameters for the task creation are passed directly to the run()
method, and may vary depending on the kind of task.
Task parameters
Name | Type | Description | Default | Runtime |
---|---|---|---|---|
action | str | Task action | required | |
node_selector | list[dict] | Node selector | None | |
volumes | list[dict] | List of volumes | None | |
resources | dict | Resources restrictions | None | |
affinity | dict | Affinity | None | |
tolerations | list[dict] | Tolerations | None | |
envs | list[dict] | Env variables | None | |
secrets | list[str] | List of secret names | None | |
profile | str | Profile template | None | |
replicas | int | Number of replicas | None | |
service_type | str | Service type | NodePort |
|
huggingface_task | str | Huggingface task type | None | huggingfaceserve |
backend | str | Backend type | None | huggingfaceserve |
tokenizer_revision | str | Tokenizer revision | None | huggingfaceserve |
max_length | int | Huggingface max sequence length for the tokenizer | None | huggingfaceserve |
disable_lower_case | bool | Do not use lower case for the tokenizer | None | huggingfaceserve |
disable_special_tokens | bool | The sequences will not be encoded with the special tokens relative to their model | None | huggingfaceserve |
dtype | str | Data type to load the weights in | None | huggingfaceserve |
trust_remote_code | bool | Allow loading of models and tokenizers with custom code | None | huggingfaceserve |
tensor_input_names | list[str] | The tensor input names passed to the model | None | huggingfaceserve |
return_token_type_ids | bool | Return token type ids | None | huggingfaceserve |
return_probabilities | bool | Return all probabilities | None | huggingfaceserve |
disable_log_requests | bool | Disable log requests | None | huggingfaceserve |
max_log_len | int | Max number of prompt characters or prompt | None | huggingfaceserve |
Task actions
Actions must be one of the following:
serve
: to deploy a service
Huggingface task
You can specify the task type for the Huggingface model. The task type must be one of the following:
sequence_classification
token_classification
fill_mask
text_generation
text2text_generation
text_embedding
Backend
You can specify the backend type for the Huggingface model. The backend type must be one of the following:
AUTO
VLLM
HUGGINGFACE
Dtype
You can specify the data type to load the weights in. The data type must be one of the following:
AUTO
FLOAT32
FLOAT16
BFLOAT16
FLOAT
HALF
Task example
Run
The Run
object is, similar to the Task
, created with the run()
method.
The run's parameters are passed alongside the task's ones.
Run parameters
Name | Type | Description | Default | Runtime |
---|---|---|---|---|
local_execution | bool | Flag to determine if the run must be executed locally | False | |
env | dict | Environment variables | None | kubeai-text , kubeai-speech |
args | list[str] | Arguments | None | kubeai-text , kubeai-speech |
cache_profile | str | Cache profile | None | kubeai-text , kubeai-speech |
files | list[KubeaiFile] | Files | None | kubeai-text , kubeai-speech |
scaling | Scaling | Scaling parameters | None | kubeai-text , kubeai-speech |
processors | int | Number of processors | None | kubeai-text , kubeai-speech |
Files
Files is a list of dict with the following keys:
Scaling
Scaling is a Scaling
object that represents the scaling parameters for the run. Its structure is as follows:
scaling = {
"replicas": int,
"min_replicas": int,
"max_replicas": int,
"autoscaling_disabled": bool,
"target_request": int,
"scale_down_delay_seconds": int,
"load_balancing": {
"strategy": str, # "LeastLoad" or "PrefixHash"
"prefix_hash": {
"mean_load_factor": int,
"replication": int,
"prefix_char_length": int
}
}
}
Run example
Run methods
Once the run is created, you can access some of its attributes and methods through the run
object.
invoke
Invoke served model. By default it exposes infer v2 endpoint.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
Name of the model. |
None
|
method
|
str
|
Method of the request. |
'POST'
|
url
|
str
|
URL of the request. |
None
|
**kwargs
|
dict
|
Keyword arguments to pass to the request. |
{}
|
Returns:
Type | Description |
---|---|
Response
|
Response from the request. |